来源材料

资料来源

← 首页

                                                        OPT: Open Pre-trained Transformer Language Models
                                                                 Susan Zhang∗, Stephen Roller∗, Naman Goyal∗,
                                             Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li,
                                             Xi Victoria Lin, Todor Mihaylov, Myle Ott†, Sam Shleifer†, Kurt Shuster, Daniel Simig,
                                                      Punit Singh Koura, Anjali Sridhar, Tianlu Wang, Luke Zettlemoyer
                                                                                   Meta AI
                                                                    {susanz,roller,naman}@fb.com

                                                                Abstract                                 progress on improving known challenges in areas
                                                                                                         such as robustness, bias, and toxicity.
                                              Large language models, which are often
                                                                                                            In this technical report, we present Open Pre-
                                              trained for hundreds of thousands of compute
                                              days, have shown remarkable capabilities for               trained Transformers (OPT), a suite of decoder-




arXiv:2205.01068v4 [cs.CL] 21 Jun 2022
                                              zero- and few-shot learning. Given their com-              only pre-trained transformers ranging from 125M
                                              putational cost, these models are difficult to             to 175B parameters, which we aim to fully and
                                              replicate without significant capital. For the             responsibly share with interested researchers. We
                                              few that are available through APIs, no ac-                train the OPT models to roughly match the per-
                                              cess is granted to the full model weights, mak-            formance and sizes of the GPT-3 class of models,
                                              ing them difficult to study. We present Open               while also applying the latest best practices in data
                                              Pre-trained Transformers (OPT), a suite of
                                              decoder-only pre-trained transformers ranging
                                                                                                         collection and efficient training. Our aim in de-
                                              from 125M to 175B parameters, which we aim                 veloping this suite of OPT models is to enable re-
                                              to fully and responsibly share with interested             producible and responsible research at scale, and
                                              researchers. We show that OPT-175B is com-                 to bring more voices to the table in studying the
                                              parable to GPT-3,1 while requiring only 1/7th              impact of these LLMs. Definitions of risk, harm,
                                              the carbon footprint to develop. We are also               bias, and toxicity, etc., should be articulated by the
                                              releasing our logbook detailing the infrastruc-            collective research community as a whole, which is
                                              ture challenges we faced, along with code for
                                                                                                         only possible when models are available for study.
                                              experimenting with all of the released models.
                                                                                                            We are releasing all of our models between
                                         1    Introduction                                               125M and 66B parameters, and will provide full
                                                                                                         research access to OPT-175B upon request. Ac-
                                         Large language models (LLMs) trained on massive                 cess will be granted to academic researchers; those
                                         text collections have shown surprising emergent                 affiliated with organizations in government, civil
                                         capabilities to generate text and perform zero- and             society, and academia; and those in industry re-
                                         few-shot learning (Brown et al., 2020; Lieber et al.,           search laboratories. We are also releasing both the
                                         2021; Smith et al., 2022; Rae et al., 2021; Chowd-              logbook of our model creation as well as our code-
                                         hery et al., 2022). While in some cases the public              base, metaseq,3 which enabled training OPT-175B
                                         can interact with these models through paid APIs,               on 992 80GB A100 GPUs, reaching 147 TFLOP/s
                                         full model access is currently limited to only a                utilization per GPU. From this implementation, and
                                         few highly resourced labs.2 This restricted access              from using the latest generation of NVIDIA hard-
                                         has limited researchers’ ability to study how and               ware, we are able to develop OPT-175B using only
                                         why these large language models work, hindering                 1/7th the carbon footprint of GPT-3. While this is a
                                              ∗
                                                Equal contribution.                                      significant achievement, the energy cost of creating
                                              †
                                                Work done while at Meta AI.                              such a model is still nontrivial, and repeated efforts
                                             1
                                               Following Brown et al. (2020), we use GPT-3 to refer to
                                         both the 175B model and the smaller scale models as well.
                                                                                                         to replicate a model of this size will only amplify
                                             2
                                               Exceptions include work by EleutherAI, who released       the growing compute footprint of these LLMs.
                                         dense models up to 20B in size (Black et al., 2022),               We believe the entire AI community — aca-
                                         Salesforce (Nijkamp et al., 2022), and Meta AI, who re-
                                         leased dense models up to 13B and sparse models up to
                                                                                                         demic researchers, civil society, policymakers, and
                                         1.1T (Artetxe et al., 2021). There is also ongoing work         industry — must work together to develop clear
                                         from the BigScience workshop (https://bigscience.
                                                                                                           3
                                         huggingface.co/), which aims to open source very large              https://github.com/facebookresearch/
                                         multilingual language models and datasets.                      metaseq
    Model   #L    #H     dmodel       LR     Batch          We use an AdamW optimizer (Loshchilov and
                                                         Hutter, 2017) with (β1 , β2 ) set to (0.9, 0.95), and
    125M     12   12       768    6.0e−4      0.5M
                                                         weight decay of 0.1. We follow a linear learning
    350M     24   16      1024    3.0e−4      0.5M
                                                         rate schedule, warming up from 0 to the maximum
    1.3B     24   32      2048    2.0e−4        1M
                                                         learning rate over the first 2000 steps in OPT-175B,
    2.7B     32   32      2560    1.6e−4        1M
                                                         or over 375M tokens in our smaller baselines, and
    6.7B     32   32      4096    1.2e−4        2M
                                                         decaying down to 10% of the maximum LR over
    13B      40   40      5120    1.0e−4        4M
                                                         300B tokens. A number of mid-flight changes
    30B      48   56      7168    1.0e−4        4M
                                                         to LR were also required (see Section 2.5). Our
    66B      64   72      9216    0.8e−4        2M
                                                         batch sizes range from 0.5M to 4M depending on
    175B     96   96     12288    1.2e−4        2M
                                                         the model size (see Table 1) and is kept constant
                                                         throughout the course of training.
Table 1: Model architecture details. We report the
number of layers (#L), number of attention heads (#H),      We use a dropout of 0.1 throughout, but we
and the embedding size (dmodel ). We also report the     do not apply any dropout to embeddings. We
peak Learning Rate (LR) and global batch size in num-    clip gradient norms at 1.0, except for some mid-
ber of tokens (Batch).                                   flight changes that reduce this threshold down
                                                         from 1.0 to 0.3 (see Section 2.5). We also in-
                                                         clude a gradient predivide factor to reduce the risk
guidelines around responsible AI in general and
                                                         of over/underflows when computing the gradient
responsible LLMs in particular, given their cen-
                                                         across all ranks (splitting the division by the
                                                                                                      √ world
trality in many downstream language applications.
                                                         size of N into two division operations by N ).
A much broader segment of the AI community
needs access to these models in order to conduct         2.3   Pre-training Corpus
reproducible research and collectively drive the         The pre-training corpus contains a concatenation
field forward. With the release of OPT-175B and          of datasets used in RoBERTa (Liu et al., 2019b),
smaller-scale baselines, we hope to increase the di-     the Pile (Gao et al., 2021a), and PushShift.io Red-
versity of voices defining the ethical considerations    dit (Baumgartner et al., 2020; Roller et al., 2021).
of such technologies.                                    All corpora were previously collected or filtered
2       Method                                           to contain predominantly English text, but a small
                                                         amount of non-English data is still present within
2.1     Models                                           the corpus via CommonCrawl.
We present results on eight Transformer language            We removed duplicated documents across all
models ranging from 125 million to 175 billion           datasets by filtering out documents via Min-
parameters. Architectural details are displayed in       hashLSH (Rajaraman and Ullman, 2011) with a
Table 1. In the interest of transparency, and to re-     Jaccard similarity ≥ .95. We found the Pile was
duce risk of training instabilities, our models and      particularly full of duplicate documents, and ad-
hyperparameters largely follow Brown et al. (2020),      vise future researchers using the Pile to perform
with variations in batch size mostly to obtain in-       additional de-duplication processing.
creased computational efficiency.                           We tokenize all corpora using the GPT-2 byte
                                                         level BPE tokenizer (Sennrich et al., 2016; Radford
2.2     Training Setup                                   et al., 2019; Brown et al., 2020). Our final corpus
For weight initialization, we follow the same set-       contains roughly 180B tokens.
tings provided in the Megatron-LM codebase,4 us-         RoBERTa We included the BookCorpus (Zhu
ing a normal distribution with zero mean and stan-       et al., 2015) and Stories (Trinh and Le, 2018) sub-
dard deviation of 0.006. Standard √ deviation for        sets of the RoBERTa corpus and utilized an up-
output layers are scaled by a 1.0/ 2L term where         dated version of CCNews, containing news stories
L is the total number of layers. All bias terms are      crawled through September 28, 2021. This CC-
initialized as 0, and all models are trained with        News v2 corpus was preprocessed the same way as
ReLU activation and a sequence length of 2048.           the original RoBERTa CCNews (Liu et al., 2019b).
    4
   https://github.com/NVIDIA/
Megatron-LM/blob/main/examples/pretrain_                 The Pile We included a subset of the Pile
gpt3_175B.sh                                             (Gao et al., 2021a), including: CommonCrawl,
DM Mathematics, Project Gutenberg, Hack-                                                       Empirical Learning Rate
                                                                        1.2e-4
erNews, OpenSubtitles, OpenWebText2, USPTO
and Wikipedia. Other subsets of the Pile were elim-                     1.0e-4
inated as we found they increased the risk of insta-                    0.8e-4

                                                        Learning Rate
bilities, as measured by tendency to cause spikes
in gradient norms at the 1.3B scale, or were other-                     0.6e-4

wise deemed unsuitable. All subsets went through                        0.4e-4
additional ad-hoc whitespace normalization.                             0.2e-4
PushShift.io Reddit We included a subset of                             0.0e-4
                                                                                  0k    20k    40k    60k 80k        100k   120k   140k
the Pushshift.io corpus produced by Baumgart-                                                           Iterations
ner et al. (2020) and previously used by Roller
et al. (2021). To convert the conversational trees      Figure 1: Empirical LR schedule. We found that low-
into language-model-accessible documents, we ex-        ering learning rate was helpful for avoiding instabili-
                                                        ties.
tracted the longest chain of comments in each
thread and discarded all other paths in the tree.
                                                                                                Validation Perplexity
This reduced the corpus by about 66%.                                   10.0

                                                                         9.5
2.4   Training Efficiency
We trained OPT-175B on 992 80GB A100 GPUs,                               9.0

                                                        Perplexity
by utilizing Fully Sharded Data Parallel (Artetxe                        8.5
et al., 2021) with Megatron-LM Tensor Parallelism
                                                                         8.0
(Shoeybi et al., 2019). We achieve utilization of up
to 147 TFLOP/s per GPU. We keep Adam state in                            7.5
FP32, since we shard it across all hosts, while the
                                                                         7.0
model weights remained in FP16. To avoid under-                                  0k    20k    40k    60k      80k    100k   120k   140k
                                                                                                       Iterations
flows, we used dynamic loss scaling, as described
in Micikevicius et al. (2017).                          Figure 2: Validation Perplexity. Our mid-flight LR
                                                        changes had clear effects on validation perplexity.
2.5   Training Processes
Here we describe significant training process ad-
justments that arose during OPT-175B pre-training.      scalar crashing to 0, and the l2 -norm of the activa-
                                                        tions of the final layer spiking. These observations
Hardware Failures We faced a significant num-           led us to pick restart points for which our dynamic
ber of hardware failures in our compute cluster         loss scalar was still in a “healthy” state (≥ 1.0),
while training OPT-175B. In total, hardware fail-       and after which our activation norms would trend
ures contributed to at least 35 manual restarts and     downward instead of growing unboundedly. Our
the cycling of over 100 hosts over the course of 2      empirical LR schedule is shown in Figure 1. Early
months. During manual restarts, the training run        in training, we also noticed that lowering gradient
was paused, and a series of diagnostics tests were      clipping from 1.0 to 0.3 helped with stability; see
conducted to detect problematic nodes. Flagged          our released logbook for exact details. Figure 2
nodes were then cordoned off and training was re-       shows our validation loss with respect to training
sumed from the last saved checkpoint. Given the         iterations.
difference between the number of hosts cycled out
and the number of manual restarts, we estimate 70+      Other Mid-flight Changes We conducted a
automatic restarts due to hardware failures.            number of other experimental mid-flight changes
                                                        to handle loss divergences. These included: switch-
Loss Divergences Loss divergences were also an          ing to vanilla SGD (optimization plateaued quickly,
issue in our training run. When the loss diverged,      and we reverted back to AdamW); resetting the dy-
we found that lowering the learning rate and restart-   namic loss scalar (this helped recover some but not
ing from an earlier checkpoint allowed for the job      all divergences); and switching to a newer version
to recover and continue training. We noticed a cor-     of Megatron (this reduced pressure on activation
relation between loss divergence, our dynamic loss      norms and improved throughput).
3       Evaluations                                                                 Average across 14 NLP Tasks (Zero-Shot)
                                                                             70
3.1      Prompting & Few-Shot
We evaluate our model on 16 standard NLP tasks                               65


                                                             Avg. Accuracy
utilized in the literature: HellaSwag (Zellers et al.,
2019), StoryCloze (Mostafazadeh et al., 2016),                               60
PIQA (Bisk et al., 2020), ARC Easy and Challenge
(Clark et al., 2018), OpenBookQA (Mihaylov et al.,                           55
2018), WinoGrad (Levesque et al., 2011), Wino-                                                                                  OPT
                                                                             50                                                 GPT
Grande (Sakaguchi et al., 2020), and SuperGLUE
(Wang et al., 2019). We follow GPT-3 (Brown                                   108         109            1010           1011
                                                                                                  Parameters
et al., 2020) by using their prompts and overall ex-
perimental setup. We compare primarily to GPT-3,             Figure 3: Zero-shot NLP Evaluation Averages.
having aimed to re-implement their evaluation set-           Across a variety of tasks and model sizes, OPT largely
tings, but include reported performance of other             matches the reported averages of GPT-3. However, per-
LLMs on a per-task basis when available (Lieber              formance varies greatly per task: see Appendix A.
et al., 2021; Rae et al., 2021; Hoffmann et al., 2022;
Black et al., 2022)                                                                      Average across 14 NLP Tasks
   We report performance in accuracy (omitting F1
                                                                             75
for MultiRC and ReCoRD for consistency in eval-
uation metrics). For the Winograd Schema Chal-                               70


                                                             Avg. Accuracy
lenge (WSC) task in the SuperGLUE benchmark,                                 65
we follow (Brown et al., 2020) and formulate the                                                                              Shot
                                                                             60                                               0
task as multiple choice questions, which is known                                                                             1
to affect performance (Liu et al., 2020).                                    55                                               32
                                                                                                                              Series
                                                                                                                              OPT
Zero-shot Overall average zero-shot perfor-                                  50                                               GPT
mance across all 14 tasks may be seen in Figure 3.                            108         109            1010           1011
Overall, we see our average performance follows                                                   Parameters
the trend of GPT-3. However, performance can
                                                             Figure 4: Multi-shot performance. OPT perfor-
vary radically across the tasks: for a full break-           mance for one- and few-shot lags behind GPT-3 mod-
down, see Appendix A. Note that we intentionally             els, but performance depends heavily per task; see Ap-
removed MultiRC and WIC from these averages, as              pendix A.
these datasets seem to systematically favor GPT-3
or OPT disproportionately.
   Our performance roughly matched GPT-3 for 10              of evaluation on this task. For BoolQ and WSC,
tasks, and underperformed in 3 tasks (ARC Chal-              we note that both OPT and GPT models seem to
lenge and MultiRC). In 3 tasks (CB, BoolQ, WSC),             hover around majority-class accuracy, suggesting
we find both GPT and OPT models display unpre-               small perturbations in probability masses may be
dictable behavior with respect to scale, likely due          dominating the evaluations.
to the small size of the validation set in these 3             Chinchilla (Hoffmann et al., 2022) and Gopher
tasks (56, 277, and 104 examples, respectively).             (Rae et al., 2021) perform roughly consistently
In WIC, we see that the OPT models always out-               with others for their parameter sizes, while PaLM
perform the GPT-3 models, though the numbers                 (Chowdhery et al., 2022) generally performs better
reported by Brown et al. (2020) also seem question-          across all settings, even when controlling for num-
able, given WIC being a binary classification task.5         ber of parameters. We speculate the high perfor-
For MultiRC, we are unable to replicate the GPT-3            mance of PaLM comes predominantly from higher
results using the Davinci API6 within our evalua-            quality and diversity of pre-training data.
tion setup, suggesting differences in the methods
                                                             One-shot and Few-shot Average multi-shot in-
    5
     Brown et al. (2020) reports 0% accuracy on WIC, which   context performance is shown in Figure 4 (again,
implies 100% accuracy if the classification was inverted.
   6
     https://beta.openai.com/docs/engines/                   omitting MultiRC and WIC), with detailed perfor-
overview                                                     mances shown in Appendix A. Across the average
of all metrics, we find that OPT models perform        performs competitively with the fully supervised
similarly to GPT-3 models. However, as with zero-      BlenderBot 1 model, especially in the ConvAI2
shot, breaking down these results per task shows       dataset. On the Wizard-of-Internet dataset, which
a different story: in the same set of 10 datasets as   is fully unsupervised for all models, we see that
zero-shot, we see similar performance across the       OPT-175B obtains the lowest perplexity but still
two models. Some of the remaining datasets show        has lower UF1 than the models with Wizard-of-
inconsistent performance with respect to model         Wikipedia supervision.
size for both OPT and GPT-3 models (BoolQ, CB,            We were somewhat surprised that the evaluations
WSC, RTE). In MultiRC, we consistently see un-         of the unsupervised OPT-175B model were as com-
derperformance of OPT models compared to GPT-          petitive as BlenderBot 1 on the ConvAI2 dataset.
3 models. Similar to our zero-shot evaluation, we      This may indicate leakage of the ConvAI2 dataset
hypothesize our one- and few-shot evaluation setup     into the general pre-training corpus or even into the
may differ significantly from Brown et al. (2020).     validation data as evaluated in Table 2. To address
                                                       concerns of leakage, we searched our pre-training
3.2   Dialogue                                         corpus for the first conversation in the ConvAI2
                                                       dataset, but we did not find any overlap. We addi-
Given that LLMs are known to be an integral com-
                                                       tionally evaluated OPT-175B on the ConvAI2 hid-
ponent of modern dialogue models (Adiwardana
                                                       den test set, which has never been publicly released,
et al., 2020; Roller et al., 2021; Thoppilan et al.,
                                                       and achieved 10.7 ppl and .185 UF1, matching the
2022; Rae et al., 2021; Chowdhery et al., 2022), we
                                                       performance of the validation set. Furthermore, we
additionally evaluate OPT-175B on several open
                                                       evaluated OPT-175B on a subset of the ConvAI2-
source dialogue datasets. In particular, we fol-
                                                       like MultiSessionChat (MSC) dataset (Xu et al.,
low Roller et al. (2021), and evaluate on ConvAI2
                                                       2021b) and obtained a perplexity of 9.7 and UF1
(Dinan et al., 2020b), Wizard of Wikipedia (Di-
                                                       of .177, indicating the model is generalizing well
nan et al., 2019b), Empathetic Dialogues (Rashkin
                                                       across multiple PersonaChat-like datasets. Since
et al., 2019), and Blended Skill Talk (Smith et al.,
                                                       both MSC and WoI datasets were released after the
2020). We additionally evaluate on the more recent
                                                       CommonCrawl snapshot used in pre-training cor-
Wizard of Internet dataset (Komeili et al., 2021).
                                                       pus, there is minimal risk of leakage. We conclude
We focus our comparisons primarily against ex-
                                                       that OPT-175B has a strong ability to maintain a
isting open source dialogue models including the
                                                       consistent persona across conversations, a behav-
fine-tuned BlenderBot 1 (Roller et al., 2021) and
                                                       ior also highlighted in LaMDA (Thoppilan et al.,
its pre-training counterpart Reddit 2.7B. We also
                                                       2022).
compare against the fine-tuned R2C2 BlenderBot,
a 2.7B parameter BlenderBot-like model trained by      4     Bias & Toxicity Evaluations
Shuster et al. (2022).
   We report Perplexity and Unigram F1 (UF1)           To understand the potential harm of OPT-175B,
overlap, following the metrics of the ConvAI2 com-     we evaluate a series of benchmarks related to hate
petition (Dinan et al., 2020b). To control for dif-    speech detection, stereotype awareness, and toxic
ferent tokenization in each of the models, we nor-     content generation. While there may be shortcom-
malize all perplexities to be in the space of the      ings in these benchmarks (Blodgett et al., 2021; Ja-
GPT-2 tokenizer (Radford et al., 2019). We also        cobs and Wallach, 2021), these measurements pro-
note which models are supervised with respect to       vide a first step towards understanding the limita-
these dialogue tasks and which are unsupervised.       tions of OPT-175B. We compare primarily against
For OPT-175B, all generations are performed using      GPT-3 Davinci, as these benchmarks were not yet
greedy decoding up to a maximum of 32 tokens.          available to be included in Brown et al. (2020).
We do not attempt to prompt the model at all except
for alternating “Person 1:” and “Person 2:” lines of   4.1    Hate Speech Detection
dialogue. The remaining models use the generation      Using the ETHOS dataset provided in Mollas et al.
parameters found in BlenderBot 1.                      (2020) and instrumented by Chiu and Alexander
   Results are shown in Table 2. We see that           (2021), we measure the ability of OPT-175B to
OPT-175B significantly outperforms the also-           identify whether or not certain English statements
unsupervised Reddit 2.7B model on all tasks, and       are racist or sexist (or neither). In the zero-, one-,
                                             Perplexity (↓)                        Unigram F1 (↑)
 Model                  Eval        C2    WW        ED    BST     WoI        C2   WW      ED     BST     WoI
 Reddit 2.7B            Unsup.    18.9     21.0    11.6   17.4     18.0   .126    .133   .135     .133   .124
 BlenderBot 1           Sup.      10.2     12.5     9.0   11.9     14.7   .183    .189   .192     .178   .154
 R2C2 BlenderBot        Sup.      10.5     12.4     9.1   11.7     14.6   .205    .198   .197     .186   .160
 OPT-175B               Unsup.    10.8     13.3    10.3   12.1     12.0   .185    .152   .149     .162   .147

Table 2: Dialogue Evaluations. OPT-175B, in a fully unsupervised setting, performs competitively against fully
supervised models.


  Setup                     Davinci      OPT-175B             Category                GPT-3      OPT-175B
  Zero-shot                      .628          .667           Gender                     62.6            65.7
  One-shot                       .616          .713           Religion                   73.3            68.6
  Few-shot (binary)              .354          .759           Race/Color                 64.7            68.6
  Few-shot (multiclass)          .672          .812           Sexual orientation         76.2            78.6
                                                              Age                        64.4            67.8
Table 3: Hate speech detection. F1 scores of detect-          Nationality                61.6            62.9
ing hate speech between Davinci and OPT-175B. OPT-            Disability                 76.7            76.7
175B considerably outperforms Davinci in all settings.
                                                              Physical appearance        74.6            76.2
                                                              Socioeconomic status       73.8            76.2
and few-shot binary cases, the model is presented             Overall                    67.2            69.5
with text and asked to consider whether the text is
racist or sexist and provide a yes/no response. In        Table 4: CrowS-Pairs evaluation. Lower is better for
the few-shot multiclass setting, the model is asked       all categories, indicating more fairness. The OPT-175B
to provide a yes/no/neither response.                     model performs worse than Davinci in most categories.
   Results are presented in Table 3. With all of
our one-shot through few-shot configurations, OPT-           When compared with Davinci in Table 4, OPT-
175B performs considerably better than Davinci.           175B appears to exhibit more stereotypical biases
We speculate this occurs from two sources: (1)            in almost all categories except for religion. Again,
evaluating via the Davinci API may be bringing            this is likely due to differences in training data;
in safety control mechanisms beyond the original          Nangia et al. (2020) showed that Pushshift.io Red-
175B GPT-3 model used in Brown et al. (2020);             dit corpus has a higher incidence rate for stereo-
and (2) the significant presence of unmoderated           types and discriminatory text than other corpora
social media discussions in the pre-training dataset      (e.g. Wikipedia). Given this is a primary data
has provided additional inductive bias to aid in such     source for OPT-175B, the model may have learned
classification tasks.                                     more discriminatory associations, which directly
                                                          impacts its performance on CrowS-Pairs.
4.2   CrowS-Pairs
Developed for masked language models, CrowS-              4.3    StereoSet
Pairs (Nangia et al., 2020) is a crowdsourced bench-      Following Lieber et al. (2021) and Artetxe et al.
mark aiming to measure intrasentence level biases         (2021), we use StereoSet (Nadeem et al., 2021)
in 9 categories: gender, religion, race/color, sex-       to measure stereotypical bias across 4 categories:
ual orientation, age, nationality, disability, physical   profession, gender, religion, and race. In addition
appearance, and socioeconomic status. Each exam-          to intrasentence measurement (similar to CrowS-
ple consists of a pair of sentences representing a        Pairs), StereoSet includes measurement at the inter-
stereotype, or anti-stereotype, regarding a certain       sentence level to test a model’s ability to incorpo-
group, with the goal of measuring model preference        rate additional context. To account for a potential
towards stereotypical expressions. Higher scores          trade-off between bias detection and language mod-
indicate higher bias exhibited by a model.                eling capability, StereoSet includes two metrics:
                                                                                                                            Toxicity Probability of Prompt (TPP)
   Category               Davinci    OPT-175B                                                        0.45
                                                                                                               OPT 175B
                                                                                                     0.40      Davinci
                                                                                                               PaLM




                                                        Toxicity Probability of Continuation (TPC)
              LMS (↑)         78.4          74.1                                                     0.35
   Prof.        SS (↓)        63.4          62.6                                                     0.30
              ICAT (↑)        57.5          55.4                                                     0.25

              LMS (↑)         75.6          74.0                                                     0.20

   Gend.        SS (↓)        66.5          63.6                                                     0.15
                                                                                                     0.10
              ICAT (↑)        50.6          53.8
                                                                                                     0.05
              LMS (↑)         80.8          84.0                                                     0.00
                                                                                                         0.0   0.1    0.2    0.3     0.4       0.5      0.6      0.7   0.8   0.9   1.0
   Reli.        SS (↓)        59.0          59.0                                                                               Prompt Toxicity Probability (Binned)
              ICAT (↑)        66.3          68.9
                                                        Figure 5: RealToxicityPompts. OPT-175B is more
              LMS (↑)         77.0          74.9        likely to generate toxic responses than either Davinci
   Race         SS (↓)        57.4          56.8        or PaLM. Consistent with prior work, toxicity rates in-
              ICAT (↑)        65.7          64.8        crease as prompt toxicity increases.

              LMS (↑)         77.6          74.8
   Overall      SS (↓)        60.8          59.9        that OPT-175B has a higher toxicity rate than ei-
              ICAT (↑)        60.8          60.0        ther PaLM or Davinci. We also observe that all
                                                        3 models have increased likelihood of generating
Table 5: StereoSet Evaluations. Davinci and OPT-        toxic continuations as the toxicity of the prompt
175B perform similarly across all evaluations.
                                                        increases, which is consistent with the observations
                                                        of Chowdhery et al. (2022). As with our exper-
Language Modeling Score (LMS) and Stereotype            iments in hate speech detection, we suspect the
Score (SS), which are then combined to form the         inclusion of unmoderated social media texts in the
Idealized Context Association Test score (ICAT).        pre-training corpus raises model familiarity with,
Unlike Lieber et al. (2021), we normalize scores        and therefore propensity to generate and detect,
by token count, rather than character count, which      toxic text. This strong awareness of toxic language
they report improves metrics for several models.        may or may not be desirable depending on the
   Results are shown in Table 5. We see that            specific requirements of downstream applications.
Davinci and OPT-175B exhibit similar scores on          Future applications of OPT-175B should consider
aggregate (overall ICAT is very close between the       this aspect of the model, and take additional miti-
two). In particular, Davinci outperforms in the         gations, or avoid usage entirely as appropriate.
areas of profession and race, while OPT-175B out-
performs in the areas of Gender and Religion. OPT-      4.5                                                 Dialogue Safety Evaluations
175B performs better across the board on the SS         Finally, we compare OPT-175B on two Dialogue
metric, while Davinci generally outperforms on the      Safety evaluations. The first, SaferDialogues (Ung
LMS metric.                                             et al., 2021), measures the ability to recover from
                                                        explicit safety failures, usually in the form of apol-
4.4   RealToxicityPrompts                               ogizing or recognizing its mistake. The second, the
We evaluate the tendency of OPT-175B to respond         Safety Bench Unit Tests (Dinan et al., 2021), mea-
with toxic language via the RealToxicityPrompts         sures how unsafe a model’s response is, stratified
(Gehman et al., 2020) dataset. Following PaLM           across 4 levels of topic sensitivity: Safe, Realis-
(Chowdhery et al., 2022), we sample 25 genera-          tic, Unsafe, and Adversarial. As with the other
tions of 20 tokens using nucleus sampling (Holtz-       dialogue evaluations (Section 3.2), we compare to
man et al., 2020) (p = 0.9) for each of 10, 000         several existing open source dialogue models.
randomly sampled prompts from RTP, and report              Results for both experiments are shown in Ta-
mean toxicity probabilities of the continuations,       ble 6. We observe that OPT-175B has similar per-
stratified across bucketed toxicities of the original   formance as the Reddit 2.7B model across both
prompts. For comparison, we report bucketed toxi-       SaferDialogues and the Unit Tests, with OPT-175B
city rates from Davinci and PaLM.                       performing marginally better in the Safe and Adver-
   Results are shown in Figure 5. Overall, we see       sarial settings. Consistent with Roller et al. (2021)
                      Safe. Dia.           Unit Tests (↓)              Similar to other LLMs, OPT-175B can produce
    Model             PPL     F1     Sa      Re     Un       Ad    factually incorrect statements (Adiwardana et al.,
    Reddit 2.7B       16.2   .140   .300    .261   .450     .439   2020; Brown et al., 2020; Roller et al., 2021; Rae
    BlenderBot 1      12.4   .161   .028    .150   .250     .194   et al., 2021; Chowdhery et al., 2022; Thoppilan
    R2C2 BlenderBot   13.8   .160   .022    .133   .289     .222
                                                                   et al., 2022). This can be particularly harmful in
    OPT-175B          14.7   .141   .033    .261   .567     .283
                                                                   applications where information accuracy is critical,
Table 6: Dialogue Responsible AI evaluations. OPT-                 such as healthcare and scientific discovery (Wei-
175B is roughly on par with the Reddit 2.7B model, but             dinger et al., 2021b). Recently, several efforts have
performs worse in the Unsafe setting.                              reported that retrieval-augmented models can im-
                                                                   prove factual correctness of LLMs (Lewis et al.,
                                                                   2020; Komeili et al., 2021; Thoppilan et al., 2022;
and Xu et al. (2020), we find that the models fine-
                                                                   Borgeaud et al., 2021; Shuster et al., 2022; Nakano
tuned on curated dialogue datasets (BlenderBot 1,
                                                                   et al., 2021). We believe OPT-175B will also bene-
R2C2) have overall lower toxicity. We conclude
                                                                   fit from retrieval-augmentation in future iterations.
that future experimentation of OPT-175B for dia-
logue should contain explicit fine-tuning on curated
datasets in order to improve the safety profile.                      As shown in Section 4, we also find OPT-175B
                                                                   has a high propensity to generate toxic language
5     Limitations                                                  and reinforce harmful stereotypes, even when pro-
                                                                   vided with a relatively innocuous prompt (Gehman
In Sections 3.1 and 4, we carried out extensive
                                                                   et al., 2020), and adversarial prompts are trivial to
evaluation of all released models at varying scales.
                                                                   find (Dinan et al., 2021). There has been a great
We saw parity in performance for standard evalu-
                                                                   deal of work on mitigations for toxicity and bi-
ation datasets used in the GPT-3 models. More-
                                                                   ases (Dathathri et al., 2019; Dinan et al., 2019a;
over, we performed safety, bias, and inclusion eval-
                                                                   Sheng et al., 2019; Dinan et al., 2020a; Liu et al.,
uations, again seeing largely comparable perfor-
                                                                   2019a; Krause et al., 2020; Xu et al., 2020; Liang
mance with some variations in toxicity and hate
                                                                   et al., 2021; Dinan et al., 2021; Xu et al., 2021a;
speech detection. However, such evaluations may
                                                                   Dhamala et al., 2021; Schick et al., 2021; Ouyang
not fully characterize the complete limitations of
                                                                   et al., 2022). Depending on downstream applica-
these models. In general, we qualitatively observe
                                                                   tions, future uses of OPT-175B may need to employ
that OPT-175B suffers from the same limitations
                                                                   these or novel mitigation approaches, especially be-
noted in other LLMs (Brown et al., 2020; Lieber
                                                                   fore any real world deployment. Given our primary
et al., 2021; Thoppilan et al., 2022; Rae et al., 2021;
                                                                   goal as a replication of GPT-3, we choose not to
Smith et al., 2022; Chowdhery et al., 2022; Bender
                                                                   apply these mitigations in this first release.
et al., 2021).
   In particular, we found OPT-175B does not work
well with declarative instructions or point-blank                     In summary, we still believe this technology is
interrogatives. Prompting with such instructions                   premature for commercial deployment. Despite
tends to produce a simulation of a dialogue begin-                 including data sheets and model cards, we believe
ning with such an instruction, rather than an execu-               more scrutiny should be afforded to the training
tion of the instruction. Future work into instruction              data with additional data characterization and se-
learning, in the vein of InstructGPT (Ouyang et al.,               lection criteria in order to use data responsibly. The
2022), may alleviate these limitations.                            current practice is to feed the model with as much
   OPT-175B also tends to be repetitive and can eas-               data as possible and minimal selection within these
ily get stuck in a loop. While sampling can reduce                 datasets. Despite having comprehensive evalua-
the incidence rate of repetitive behavior (Holtz-                  tions, we would ideally have more streamlined and
man et al., 2020), we anecdotally found it did not                 consistent evaluation setups to ensure replicability
eliminate it entirely when only one generation is                  and reproducibility of evaluation scenarios. Dif-
sampled. Future work may wish to incorporate                       ferences in prompting styles and number of shots
more modern strategies for reducing repetition and                 for in-context learning could create variations that
improving diversity, such as unlikelihood training                 lead to different results. We hope that the public
(Welleck et al., 2020) or best-first decoding (Meis-               release of the OPT models will enable many more
ter et al., 2020).                                                 researchers to work on these important issues.
6   Considerations for Release                          impact of pursuing research at this scale. There is a
                                                        growing body of work detailing ethical and social
Following the recommendations for individual re-        risks from deploying language models with emer-
searchers generated by the Partnership for AI,7         gent capabilities at scale (Weidinger et al., 2021a;
along with the governance guidance outlined by          Bommasani et al., 2021; Dinan et al., 2021; Kenton
NIST,8 we are disclosing all of the details in-         et al., 2021). By limiting access to OPT-175B to
volved in training OPT-175B through our log-            the research community with a non-commercial
book,9 our code, and providing researchers access       license, we aim to focus development efforts on
to model weights for OPT-175B, along with a suite       quantifying the limitations of the LLMs first, be-
of smaller baselines mirroring the setup for OPT-       fore broader commercial deployment occurs.
175B. We aim to be fully accountable for the devel-
                                                           Furthermore, there exists significant compute
opment lifecycle of OPT-175B, and only through
                                                        and carbon cost to reproduce models of this size.
increasing transparency around LLM development
                                                        While OPT-175B was developed with an estimated
can we start understanding the limitations and risks
                                                        carbon emissions footprint (CO2eq) of 75 tons,10
of LLMs before broader deployment occurs.
                                                        GPT-3 was estimated to use 500 tons (Patterson
   By sharing a detailed account of our day-to-day
                                                        et al., 2021), while Gopher required 380 tons (Rae
training process, we disclose not only how much
                                                        et al., 2021). These estimates are not universally re-
compute was used to train the current version of
                                                        ported, and the accounting methodologies for these
OPT-175B, but also the human overhead required
                                                        calculations are also not standardized. In addition,
when underlying infrastructure or the training pro-
                                                        model training is only one component of the over-
cess itself becomes unstable at scale. These details
                                                        all carbon footprint of AI systems; we must also
are generally omitted from previous publications,
                                                        consider experimentation and eventual downstream
likely due to the inability to fully ablate changes
                                                        inference cost, all of which contribute to the grow-
made mid-flight (without drastically increasing the
                                                        ing energy footprint of creating large-scale models
compute budget). We hope that by revealing how
                                                        (Wu et al., 2022). By releasing our logbook, we
certain ad-hoc design decisions were made, we can
                                                        hope to highlight the gap between a theoretical car-
improve upon these practices in the future, and col-
                                                        bon cost estimate that assumes no hardware failures
lectively increase the experimental robustness in
                                                        or training instabilities, versus one that aims to in-
developing models at this scale.
                                                        clude the entire LLM development lifecycle. We
   Outside of these notes, the metaseq codebase         need to understand the manufacturing (or embod-
itself is the final source of truth in many of our      ied) carbon of these systems (Gupta et al., 2021)
implementation details. By releasing our develop-       as they grow increasingly more complex, and we
ment codebase, we aim to shed light on any imple-       hope that our paper can help future work in defin-
mentation detail that may have been omitted from        ing additional factors to consider when measuring
being explicitly enumerated in this paper, as it is     the impact of scale on the environment.
either considered a detail of standard practice in
                                                           Similarly, by producing a set of baselines across
the field, or is simply a detail we failed to account
                                                        a wide range of scales, we hope to enable the
for. This current codebase is also the only known
                                                        broader research community to study the impact
open-source implementation of training a decoder-
                                                        and limitations of these models with respect to
only transformer that is ≥175B parameters without
                                                        scale alone. As reported in Hoffmann et al. (2022),
the use of pipeline paralellism on NVIDIA GPUs.
                                                        many of these LLMs may have been under-trained
   To enable experimentation at 175B scale, we are
                                                        as a function of the amount of training data used,
providing researchers with direct access to the pa-
                                                        which implies that incorporating more data and con-
rameters of OPT-175B. The reasoning here is two-
                                                        tinuing to train these baseline models may continue
fold: enable Responsible AI research into LLMs
                                                        to improve performance. There is also evidence
while simultaneously reducing the environmental
                                                        that step-function changes in capabilities may oc-
  7
    https://partnershiponai.org/paper/                  cur at a scale that is much smaller than 175B (Wei
responsible-publication-recommendations/                et al., 2021), indicating a need to examine a wider
  8
    https://nvlpubs.nist.gov/nistpubs/
SpecialPublications/NIST.SP.1270.pdf
                                                        range of scales for different research applications.
  9
    https://github.com/facebookresearch/
                                                          10
metaseq/blob/main/projects/OPT/                              With ablations, baselines and downtime, our own esti-
chronicles/OPT175B_Logbook.pdf                          mates of total cost is roughly 2× higher.
7   Related Work                                         Gao et al., 2021b; Li and Liang, 2021; Lester et al.,
                                                         2021; Scao and Rush, 2021), improving the flexi-
Since the publication of the Transformer architec-       bility of prompting (Shin et al., 2020), and under-
ture (Vaswani et al., 2017) and BERT (Devlin et al.,     standing why and how prompting works (Liu et al.,
2019), the field of NLP has experienced a massive        2021; Min et al., 2022).
shift towards the use of LLMs with self-supervised
                                                            Recent efforts have shown gains by fine-tuning
pre-training. Multiple masked langauge models,
                                                         models to directly respond to instruction-style
including T5 (Raffel et al., 2020) and Megatron-
                                                         prompting (Wei et al., 2021; Min et al., 2021; Sanh
LM (Shoeybi et al., 2019), have shown consistent
                                                         et al., 2021; Ouyang et al., 2022). However, ef-
improvements through scale. These scaling gains
                                                         fective prompt engineering remains an open re-
come not only from growing the total number of
                                                         search challenge. Results vary significantly and
parameters in the models, but also the amount and
                                                         unpredictably with the selection of the prompt (Lu
quality of pre-training data (Liu et al., 2019b; Hoff-
                                                         et al., 2021), and models do not seem to understand
mann et al., 2022).
                                                         the prompts as fully as we expect (Webson and
   Auto-regressive language models (Mikolov et al.,      Pavlick, 2021). Furthermore, it is challenging to
2009) have seen the largest growth in model size,        write prompts without a development set, which
from 117M parameters (Radford et al., 2018) to           leads to questions about the extent to which we
over 500B parameters (Smith et al., 2022; Chowd-         are actually achieving zero- or few-shot learning in
hery et al., 2022). The resulting massive improve-       practice (Perez et al., 2021). We do not attempt to
ment in generative fluency and quality was first         address these concerns of prompting, and instead
characterized in GPT-2 (Radford et al., 2019) and        only aim to provide evaluation of OPT-175B in ex-
further improved with GPT-3 (Brown et al., 2020)         isting settings. However, we hope the full release of
and later models. Although a variety of very large       OPT-175B will enable others to better study these
(over 100B parameters) generative models have            challenges in the future.
now been trained (Lieber et al., 2021; Rae et al.,
2021; Thoppilan et al., 2022; Smith et al., 2022;        8   Conclusion
Chowdhery et al., 2022), they are all closed source
and accessible only internally or via paid API ser-      In this technical report, we introduced OPT, a col-
vices. There are a few notable efforts towards open      lection of auto-regressive language models ranging
sourcing LLMs from non-profit research organiza-         in size from 125M to 175B parameters. Our goal
tions including EleutherAI (Black et al., 2022) and      was to replicate the performance and sizes of the
BigScience.11 These models differ from the OPT           GPT-3 class of models, while also applying the
models in pre-training data, target languages and        latest best practices in data curation and training
model scale, making it possible for the community        efficiency. We described training details, evaluated
to compare different pre-training strategies.            performance in a number of NLP and dialogue set-
   Since Brown et al. (2020), the primary evalu-         tings, and characterized behaviors with respect to
ation criterion for LLMs has been prompt-based           bias, toxicity and hate speech. We also described
(Black et al., 2022; Rae et al., 2021; Chowdhery         many other limitations the models have, and dis-
et al., 2022), as is also performed in this paper.       cussed a wide set of considerations for responsibly
This is largely due to the convenience of evaluat-       releasing the models. We believe the entire AI
ing on many tasks without specialized task-specific      community would benefit from working together
fine-tuning. Prompting itself has a long history:        to develop guidelines for responsible LLMs, and
cloze evaluations go back several decades (Cham-         we hope that broad access to these types of models
bers and Jurafsky, 2008; Mostafazadeh et al., 2016).     will increase the diversity of voices defining the
More recently, prompting or masked infilling has         ethical considerations of such technologies.
been used to probe models for knowledge (Petroni
et al., 2019) or perform a variety of NLP tasks          Acknowledgements
(Radford et al., 2019; Brown et al., 2020). There        We would like to thank Scott Jeschonek, Giri Anan-
has also been work on eliciting prompting behav-         tharaman, Diego Sarina, Joaquin Colombo, Chris
ior in smaller models (Schick and Schütze, 2020;         Bray, Stephen Roylance, Kalyan Saladi, Shubho
  11
     https://huggingface.co/bigscience/                  Sengupta, and Brian O’Horo for helping to remove
tr11-176B-ml-logs/tensorboard                            infrastructure blockers along the way; Percy Liang,
Rishi Bommasani, and Emily Dinan for discus-             Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu,
sions on responsible release practices; Carole-Jean        Robert Sim, and Hanna Wallach. 2021. Stereotyp-
                                                           ing Norwegian salmon: An inventory of pitfalls in
Wu for discussions on sustainability and carbon
                                                           fairness benchmark datasets. In Proceedings of the
footprint considerations; Srini Iyer, Ramakanth Pa-        59th Annual Meeting of the Association for Compu-
sunuru, and Shruti Bhosale for previous contribu-          tational Linguistics and the 11th International Joint
tions to evaluations; Benjamin Lefaudeux, Geeta            Conference on Natural Language Processing (Vol-
Chauhan, Natalia Gimelshein, Horace He, and Sam            ume 1: Long Papers), pages 1004–1015, Online. As-
                                                           sociation for Computational Linguistics.
Gross for discussions on performance improvement
work; Emily Dinan, Carole-Jean Wu, Daniel McK-           Rishi Bommasani, Drew A. Hudson, Ehsan Adeli,
innon, and Mark Tygert for feedback on this draft;         Russ Altman, Simran Arora, Sydney von Arx,
                                                           Michael S. Bernstein, Jeannette Bohg, Antoine
Antoine Bordes, Joelle Pineau, Mary Williamson,            Bosselut, Emma Brunskill, Erik Brynjolfsson, Shya-
Necip Fazil Ayan, Armand Joulin, Sergey Edunov,            mal Buch, Dallas Card, Rodrigo Castellon, Ni-
Melanie Kambadur, Zornitsa Kozareva, Ves Stoy-             ladri Chatterji, Annie S. Chen, Kathleen Creel,
anov, Vitaliy Liptchinsky, Rahul Iyer, Jing Xu, Ja-        Jared Quincy Davis, Dorottya Demszky, Chris Don-
                                                           ahue, Moussa Doumbouya, Esin Durmus, Stefano
son Weston, and many others for supporting this
                                                           Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-
project internally.                                        Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie,
                                                           Karan Goel, Noah D. Goodman, Shelby Grossman,
                                                           Neel Guha, Tatsunori Hashimoto, Peter Henderson,
References                                                 John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu,
                                                           Jing Huang, Thomas Icard, Saahil Jain, Dan Juraf-
Daniel Adiwardana, Minh-Thang Luong, David R So,           sky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff
  Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang,       Keeling, Fereshte Khani, Omar Khattab, Pang Wei
  Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu,           Koh, Mark S. Krass, Ranjay Krishna, Rohith Kudi-
  et al. 2020. Towards a human-like open-domain            tipudi, and et al. 2021. On the opportunities and
  chatbot. arXiv preprint arXiv:2001.09977.                risks of foundation models. CoRR, abs/2108.07258.
Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor        Sebastian Borgeaud, Arthur Mensch, Jordan Hoff-
  Mihaylov, Myle Ott, Sam Shleifer, Xi Victoria Lin,       mann, Trevor Cai, Eliza Rutherford, Katie Millican,
  Jingfei Du, Srinivasan Iyer, Ramakanth Pasunuru,         George van den Driessche, Jean-Baptiste Lespiau,
  Giri Anantharaman, Xian Li, Shuohui Chen, Halil          Bogdan Damoc, Aidan Clark, et al. 2021. Improv-
  Akin, Mandeep Baines, Louis Martin, Xing Zhou,           ing language models by retrieving from trillions of
  Punit Singh Koura, Brian O’Horo, Jeff Wang, Luke         tokens. arXiv preprint arXiv:2112.04426.
  Zettlemoyer, Mona T. Diab, Zornitsa Kozareva, and
  Ves Stoyanov. 2021. Efficient large scale lan-         Tom Brown, Benjamin Mann, Nick Ryder, Melanie
  guage modeling with mixtures of experts. CoRR,           Subbiah, Jared D Kaplan, Prafulla Dhariwal,
  abs/2112.10684.                                          Arvind Neelakantan, Pranav Shyam, Girish Sastry,
                                                           Amanda Askell, Sandhini Agarwal, Ariel Herbert-
Jason Baumgartner, Savvas Zannettou, Brian Keegan,         Voss, Gretchen Krueger, Tom Henighan, Rewon
   Megan Squire, and Jeremy Blackburn. 2020. The           Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu,
   pushshift reddit dataset. CoRR, abs/2001.08435.         Clemens Winter, Chris Hesse, Mark Chen, Eric
                                                           Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Emily M Bender, Timnit Gebru, Angelina McMillan-           Jack Clark, Christopher Berner, Sam McCandlish,
  Major, and Shmargaret Shmitchell. 2021. On the           Alec Radford, Ilya Sutskever, and Dario Amodei.
  dangers of stochastic parrots: Can language models       2020. Language models are few-shot learners. In
  be too big? In Proceedings of the 2021 ACM Confer-       Advances in Neural Information Processing Systems,
  ence on Fairness, Accountability, and Transparency,      volume 33, pages 1877–1901. Curran Associates,
  pages 610–623.                                           Inc.
Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng     Nathanael Chambers and Dan Jurafsky. 2008. Unsuper-
  Gao, and Yejin Choi. 2020. Piqa: Reasoning about         vised learning of narrative event chains. In Proceed-
  physical commonsense in natural language. Pro-           ings of ACL-08: HLT, pages 789–797, Columbus,
  ceedings of the AAAI Conference on Artificial Intel-     Ohio. Association for Computational Linguistics.
  ligence, 34(05):7432–7439.
                                                         Ke-Li Chiu and Rohan Alexander. 2021. Detect-
Sid Black, Stella Biderman, Eric Hallahan, Quentin An-     ing hate speech with gpt-3.   arXiv preprint
   thony, Leo Gao, Laurence Golding, Horace He, Con-       arXiv:2103.12407.
   nor Leahy, Kyle McDonell, Jason Phang, Michael
   Pieler, USVSN Sai Prashanth, Shivanshu Purohit,       Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
   Laria Reynolds, Jonathan Tow, Ben Wang, and             Maarten Bosma, Gaurav Mishra, Adam Roberts,
   Samuel Weinbach. 2022. Gpt-neox-20b: An open-           Paul Barham, Hyung Won Chung, Charles Sutton,
   source autoregressive language model.                   Sebastian Gehrmann, Parker Schuh, Kensen Shi,
  Sasha Tsvyashchenko, Joshua Maynez, Abhishek            Emily Dinan, Varvara Logacheva, Valentin Ma-
  Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin-            lykh, Alexander Miller, Kurt Shuster, Jack Ur-
  odkumar Prabhakaran, Emily Reif, Nan Du, Ben              banek, Douwe Kiela, Arthur Szlam, Iulian Serban,
  Hutchinson, Reiner Pope, James Bradbury, Jacob            Ryan Lowe, Shrimai Prabhumoye, Alan W. Black,
  Austin, Michael Isard, Guy Gur-Ari, Pengcheng             Alexander Rudnicky, Jason Williams, Joelle Pineau,
  Yin, Toju Duke, Anselm Levskaya, Sanjay Ghe-              Mikhail Burtsev, and Jason Weston. 2020b. The
  mawat, Sunipa Dev, Henryk Michalewski, Xavier             second conversational intelligence challenge (Con-
  Garcia, Vedant Misra, Kevin Robinson, Liam Fe-            vAI2). In The NeurIPS ’18 Competition, pages 187–
  dus, Denny Zhou, Daphne Ippolito, David Luan,             208, Cham. Springer International Publishing.
  Hyeontaek Lim, Barret Zoph, Alexander Spiridonov,
  Ryan Sepassi, David Dohan, Shivani Agrawal, Mark        Emily Dinan, Stephen Roller, Kurt Shuster, Angela
  Omernick, Andrew M. Dai, Thanumalayan Sankara-            Fan, Michael Auli, and Jason Weston. 2019b. Wiz-
  narayana Pillai, Marie Pellat, Aitor Lewkowycz,           ard of Wikipedia: Knowledge-powered conversa-
  Erica Moreira, Rewon Child, Oleksandr Polozov,            tional agents. In Proceedings of the International
  Katherine Lee, Zongwei Zhou, Xuezhi Wang, Bren-           Conference on Learning Representations.
  nan Saeta, Mark Diaz, Orhan Firat, Michele Catasta,
  Jason Wei, Kathy Meier-Hellstern, Douglas Eck,          Leo Gao, Stella Biderman, Sid Black, Laurence Gold-
  Jeff Dean, Slav Petrov, and Noah Fiedel. 2022.            ing, Travis Hoppe, Charles Foster, Jason Phang,
  Palm: Scaling language modeling with pathways.            Horace He, Anish Thite, Noa Nabeshima, Shawn
                                                            Presser, and Connor Leahy. 2021a. The pile: An
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot,       800gb dataset of diverse text for language modeling.
  Ashish Sabharwal, Carissa Schoenick, and Oyvind           CoRR, abs/2101.00027.
  Tafjord. 2018. Think you have solved question
  answering? try arc, the AI2 reasoning challenge.        Tianyu Gao, Adam Fisch, and Danqi Chen. 2021b.
  CoRR, abs/1803.05457.                                      Making pre-trained language models better few-shot
                                                             learners. In Proceedings of the 59th Annual Meet-
Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane          ing of the Association for Computational Linguis-
  Hung, Eric Frank, Piero Molino, Jason Yosinski, and        tics and the 11th International Joint Conference on
  Rosanne Liu. 2019. Plug and play language mod-            Natural Language Processing, ACL/IJCNLP 2021,
  els: A simple approach to controlled text generation.     (Volume 1: Long Papers), Virtual Event, August 1-6,
  arXiv preprint arXiv:1912.02164.                          2021, pages 3816–3830. Association for Computa-
                                                             tional Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
                                                          Timnit Gebru, Jamie Morgenstern, Briana Vec-
   Kristina Toutanova. 2019. BERT: Pre-training of
                                                            chione, Jennifer Wortman Vaughan, Hanna
   deep bidirectional transformers for language under-
                                                            Wallach, Hal Daumé III, and Kate Crawford.
   standing. In North American Association for Com-
                                                            2021. Datasheets for datasets. Commun. ACM,
   putational Linguistics (NAACL).
                                                            64(12):86–92.
Jwala Dhamala, Tony Sun, Varun Kumar, Satyapriya          Samuel Gehman, Suchin Gururangan, Maarten Sap,
  Krishna, Yada Pruksachatkun, Kai-Wei Chang, and           Yejin Choi, and Noah A. Smith. 2020. RealToxi-
  Rahul Gupta. 2021. Bold: Dataset and metrics              cityPrompts: Evaluating neural toxic degeneration
  for measuring biases in open-ended language gen-          in language models. In Findings of the Association
  eration. In Proceedings of the 2021 ACM Confer-           for Computational Linguistics: EMNLP 2020, pages
  ence on Fairness, Accountability, and Transparency,       3356–3369, Online. Association for Computational
  pages 862–872.                                            Linguistics.
Emily Dinan, Gavin Abercrombie, A Stevie Bergman,         Udit Gupta, Young Geun Kim, Sylvia Lee, Jordan Tse,
  Shannon Spruit, Dirk Hovy, Y-Lan Boureau, and             Hsien-Hsin S Lee, Gu-Yeon Wei, David Brooks, and
  Verena Rieser. 2021. Anticipating safety issues           Carole-Jean Wu. 2021. Chasing carbon: The elu-
  in e2e conversational ai: Framework and tooling.          sive environmental footprint of computing. IEEE In-
  arXiv preprint arXiv:2107.03451.                          ternational Symposium on High-Performance Com-
                                                            puter Architecture (HPCA 2021).
Emily Dinan, Angela Fan, Adina Williams, Jack Ur-
  banek, Douwe Kiela, and Jason Weston. 2020a.            Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian
  Queens are powerful too: Mitigating gender bias in        Sun. 2016. Deep residual learning for image recog-
  dialogue generation. In Proceedings of the 2020           nition. In Proceedings of the IEEE conference on
  Conference on Empirical Methods in Natural Lan-           computer vision and pattern recognition, pages 770–
  guage Processing (EMNLP), pages 8173–8188, On-            778.
  line. Association for Computational Linguistics.
                                                          Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch,
Emily Dinan, Samuel Humeau, Bharath Chintagunta,             Elena Buchatskaya, Trevor Cai, Eliza Rutherford,
  and Jason Weston. 2019a. Build it break it fix it for      Diego de Las Casas, Lisa Anne Hendricks, Johannes
  dialogue safety: Robustness from adversarial human        Welbl, Aidan Clark, Tom Hennigan, Eric Noland,
  attack. arXiv preprint arXiv:1908.06083.                   Katie Millican, George van den Driessche, Bogdan
  Damoc, Aurelia Guy, Simon Osindero, Karen Si-          Haokun Liu, William Huang, Dhara Mungra, and
  monyan, Erich Elsen, Jack W. Rae, Oriol Vinyals,         Samuel R. Bowman. 2020. Precise task formaliza-
  and Laurent Sifre. 2022. Training compute-optimal        tion matters in Winograd schema evaluations. In
  large language models.                                   Proceedings of the 2020 Conference on Empirical
                                                           Methods in Natural Language Processing (EMNLP),
Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin          pages 8275–8280, Online. Association for Computa-
  Choi. 2020. The curious case of neural text degener-     tional Linguistics.
  ation. ArXiv, abs/1904.09751.
                                                         Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan,
Abigail Z. Jacobs and Hanna Wallach. 2021. Measure-         Lawrence Carin, and Weizhu Chen. 2021. What
  ment and fairness. In Proceedings of the 2021 ACM         makes good in-context examples for gpt-3? CoRR,
  Conference on Fairness, Accountability, and Trans-        abs/2101.06804.
  parency, FAccT ’21, page 375–385, New York, NY,
  USA. Association for Computing Machinery.              Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
                                                           dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Zachary Kenton, Tom Everitt, Laura Weidinger, Ia-          Luke Zettlemoyer, and Veselin Stoyanov. 2019b.
  son Gabriel, Vladimir Mikulik, and Geoffrey Irv-         Roberta: A robustly optimized bert pretraining ap-
  ing. 2021. Alignment of language agents. CoRR,           proach. arXiv preprint arXiv:1907.11692.
  abs/2103.14659.
                                                         Ilya Loshchilov and Frank Hutter. 2017.        Fixing
Mojtaba Komeili, Kurt Shuster, and Jason Weston.            weight decay regularization in adam.        CoRR,
 2021.    Internet-augmented dialogue generation.           abs/1711.05101.
 CoRR, abs/2107.07566.
                                                         Yao Lu, Max Bartolo, Alastair Moore, Sebastian
Ben Krause, Akhilesh Deepak Gotmare, Bryan Mc-             Riedel, and Pontus Stenetorp. 2021. Fantastically
  Cann, Nitish Shirish Keskar, Shafiq Joty, Richard        ordered prompts and where to find them: Overcom-
  Socher, and Nazneen Fatema Rajani. 2020. GEDI:           ing few-shot prompt order sensitivity.
  Generative discriminator guided sequence genera-
                                                         Clara Meister, Tim Vieira, and Ryan Cotterell. 2020.
  tion. arXiv preprint arXiv:2009.06367.
                                                           Best-first beam search. Transactions of the Associa-
Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.       tion for Computational Linguistics, 8:795–809.
  The power of scale for parameter-efficient prompt      Paulius Micikevicius, Sharan Narang, Jonah Alben,
  tuning. CoRR, abs/2104.08691.                            Gregory Diamos, Erich Elsen, David Garcia, Boris
                                                           Ginsburg, Michael Houston, Oleksii Kuchaiev,
Hector J Levesque, Ernest Davis, and Leora Morgen-
                                                           Ganesh Venkatesh, et al. 2017. Mixed precision
  stern. 2011. The Winograd schema challenge. In
                                                           training. arXiv preprint arXiv:1710.03740.
  AAAI Spring Symposium: Logical Formalizations of
  Commonsense Reasoning, volume 46, page 47.             Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish
                                                           Sabharwal. 2018. Can a suit of armor conduct elec-
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio       tricity? A new dataset for open book question an-
  Petroni, Vladimir Karpukhin, Naman Goyal, Hein-          swering. CoRR, abs/1809.02789.
  rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock-
  täschel, et al. 2020. Retrieval-augmented generation   Tomas Mikolov, Jiri Kopecky, Lukas Burget, Ondrej
  for knowledge-intensive nlp tasks. Advances in Neu-      Glembek, et al. 2009. Neural network based lan-
  ral Information Processing Systems, 33:9459–9474.        guage models for highly inflective languages. In
                                                           2009 IEEE international conference on acoustics,
Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning:        speech and signal processing, pages 4725–4728.
  Optimizing Continuous Prompts for Generation.            IEEE.
  pages 4582–4597.
                                                         Sewon Min, Mike Lewis, Luke Zettlemoyer, and Han-
Paul Pu Liang, Chiyu Wu, Louis-Philippe Morency,           naneh Hajishirzi. 2021. Metaicl: Learning to learn
  and Ruslan Salakhutdinov. 2021. Towards under-           in context.
  standing and mitigating social biases in language
  models. In International Conference on Machine         Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe,
  Learning, pages 6565–6576. PMLR.                         Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle-
                                                           moyer. 2022. Rethinking the role of demonstra-
Opher Lieber, Or Sharir, Barak Lenz, and Yoav              tions: What makes in-context learning work? arXiv
  Shoham. 2021. Jurassic-1: Technical details and          preprint arXiv:2202.12837.
  evaluation. Technical report, AI21 Labs.
                                                         Margaret Mitchell, Simone Wu, Andrew Zaldivar,
Haochen Liu, Jamell Dacon, Wenqi Fan, Hui Liu, Zitao      Parker Barnes, Lucy Vasserman, Ben Hutchinson,
  Liu, and Jiliang Tang. 2019a. Does gender matter?       Elena Spitzer, Inioluwa Deborah Raji, and Timnit
  towards fairness in dialogue systems. arXiv preprint    Gebru. 2018. Model cards for model reporting.
  arXiv:1910.10486.                                       CoRR, abs/1810.03993.
Ioannis Mollas, Zoe Chrysopoulou, Stamatis Kar-           Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
   los, and Grigorios Tsoumakas. 2020. ETHOS:               Dario Amodei, and Ilya Sutskever. 2019. Language
   an online hate speech detection dataset. CoRR,           models are unsupervised multitask learners. Techni-
   abs/2006.08328.                                          cal report, OpenAI.
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong         Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie
  He, Devi Parikh, Dhruv Batra, Lucy Vander-                 Millican, Jordan Hoffmann, H. Francis Song, John
  wende, Pushmeet Kohli, and James F. Allen. 2016.           Aslanides, Sarah Henderson, Roman Ring, Susan-
  A corpus and evaluation framework for deeper               nah Young, Eliza Rutherford, Tom Hennigan, Ja-
  understanding of commonsense stories. CoRR,                cob Menick, Albin Cassirer, Richard Powell, George
  abs/1604.01696.                                            van den Driessche, Lisa Anne Hendricks, Mari-
Moin Nadeem, Anna Bethke, and Siva Reddy. 2021.              beth Rauh, Po-Sen Huang, Amelia Glaese, Jo-
 StereoSet: Measuring stereotypical bias in pre-             hannes Welbl, Sumanth Dathathri, Saffron Huang,
 trained language models. In Association for Com-            Jonathan Uesato, John Mellor, Irina Higgins, An-
 putational Linguistics (ACL).                               tonia Creswell, Nat McAleese, Amy Wu, Erich
                                                             Elsen, Siddhant M. Jayakumar, Elena Buchatskaya,
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu,      David Budden, Esme Sutherland, Karen Simonyan,
  Long Ouyang, Christina Kim, Christopher Hesse,             Michela Paganini, Laurent Sifre, Lena Martens,
  Shantanu Jain, Vineet Kosaraju, William Saunders,          Xiang Lorraine Li, Adhiguna Kuncoro, Aida Ne-
  et al. 2021. Webgpt: Browser-assisted question-            matzadeh, Elena Gribovskaya, Domenic Donato,
  answering with human feedback. arXiv preprint              Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste
  arXiv:2112.09332.                                          Lespiau, Maria Tsimpoukelli, Nikolai Grigorev,
                                                             Doug Fritz, Thibault Sottiaux, Mantas Pajarskas,
Nikita Nangia, Clara Vania, Rasika Bhalerao, and             Toby Pohlen, Zhitao Gong, Daniel Toyama, Cy-
  Samuel R Bowman. 2020. Crows-pairs: A chal-                prien de Masson d’Autume, Yujia Li, Tayfun Terzi,
  lenge dataset for measuring social biases in masked        Vladimir Mikulik, Igor Babuschkin, Aidan Clark,
  language models. arXiv preprint arXiv:2010.00133.          Diego de Las Casas, Aurelia Guy, Chris Jones,
                                                             James Bradbury, Matthew Johnson, Blake A. Hecht-
Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu,             man, Laura Weidinger, Iason Gabriel, William S.
   Huan Wang, Yingbo Zhou, Silvio Savarese, and              Isaac, Edward Lockhart, Simon Osindero, Laura
   Caiming Xiong. 2022. A conversational paradigm            Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub,
   for program synthesis. arXiv preprint.                    Jeff Stanway, Lorrayne Bennett, Demis Hassabis,
                                                             Koray Kavukcuoglu, and Geoffrey Irving. 2021.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car-          Scaling language models: Methods, analysis & in-
  roll L Wainwright, Pamela Mishkin, Chong Zhang,            sights from training gopher. CoRR, abs/2112.11446.
  Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
  2022. Training language models to follow in-
  structions with human feedback. arXiv preprint          Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
  arXiv:2203.02155.                                         Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
                                                            Wei Li, and Peter J Liu. 2020. Exploring the limits
David Patterson, Joseph Gonzalez, Quoc Le, Chen             of transfer learning with a unified text-to-text trans-
  Liang, Lluis-Miquel Munguia, Daniel Rothchild,            former. The Journal of Machine Learning Research
  David So, Maud Texier, and Jeff Dean. 2021. Car-          (JMLR), 21:1–67.
  bon emissions and large neural network training.
  arXiv preprint arXiv:2104.10350.                        Anand Rajaraman and Jeffrey David Ullman. 2011.
                                                            Mining of massive datasets. Cambridge University
Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021.          Press.
  True few-shot learning with language models. Ad-
  vances in Neural Information Processing Systems,        Hannah Rashkin, Eric Michael Smith, Margaret Li, and
  34.                                                       Y-Lan Boureau. 2019. Towards empathetic open-
Fabio Petroni, Tim Rocktäschel, Sebastian Riedel,           domain conversation models: A new benchmark and
  Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and             dataset. In Proceedings of the 57th Annual Meet-
  Alexander Miller. 2019. Language models as knowl-         ing of the Association for Computational Linguis-
  edge bases? In Proceedings of the 2019 Confer-            tics, pages 5370–5381, Florence, Italy. Association
  ence on Empirical Methods in Natural Language             for Computational Linguistics.
  Processing and the 9th International Joint Confer-
  ence on Natural Language Processing (EMNLP-             Stephen Roller, Emily Dinan, Naman Goyal, Da Ju,
  IJCNLP), pages 2463–2473, Hong Kong, China. As-            Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott,
  sociation for Computational Linguistics.                   Eric Michael Smith, Y-Lan Boureau, and Jason We-
                                                             ston. 2021. Recipes for building an open-domain
Alec Radford, Karthik Narasimhan, Time Salimans,             chatbot. In Proceedings of the 16th Conference of
  and Ilya Sutskever. 2018. Improving language un-           the European Chapter of the Association for Compu-
  derstanding with unsupervised learning. Technical          tational Linguistics: Main Volume, pages 300–325,
  report, OpenAI.                                            Online. Association for Computational Linguistics.
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavat-        Kurt Shuster, Mojtaba Komeili, Leonard Adolphs,
  ula, and Yejin Choi. 2020. Winogrande: An adver-           Stephen Roller, Arthur Szlam, and Jason We-
  sarial winograd schema challenge at scale. In The          ston. 2022.    Language models that seek for
  Thirty-Fourth AAAI Conference on Artificial Intelli-       knowledge: Modular search & generation for di-
  gence, AAAI 2020, The Thirty-Second Innovative Ap-         alogue and prompt completion. arXiv preprint
  plications of Artificial Intelligence Conference, IAAI     arXiv:2203.13224.
  2020, The Tenth AAAI Symposium on Educational
  Advances in Artificial Intelligence, EAAI 2020, New      Eric Smith, Mary Williamson, Kurt Shuster, Jason We-
  York, NY, USA, February 7-12, 2020, pages 8732–             ston, and Y-Lan Boureau. 2020. Can you put it all
  8740. AAAI Press.                                           together: Evaluating conversational agents’ ability
                                                              to blend skills. In Proceedings of the 58th Annual
Victor Sanh, Albert Webson, Colin Raffel, Stephen H.         Meeting of the Association for Computational Lin-
  Bach, Lintang Sutawika, Zaid Alyafeai, Antoine              guistics. ACL.
  Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja,
  Manan Dey, M Saiful Bari, Canwen Xu, Urmish              Shaden Smith, Mostofa Patwary, Brandon Norick,
  Thakker, Shanya Sharma Sharma, Eliza Szczechla,            Patrick LeGresley, Samyam Rajbhandari, Jared
  Taewoon Kim, Gunjan Chhablani, Nihal Nayak,                Casper, Zhun Liu, Shrimai Prabhumoye, George
  Debajyoti Datta, Jonathan Chang, Mike Tian-Jian            Zerveas, Vijay Korthikanti, Elton Zheng, Rewon
  Jiang, Han Wang, Matteo Manica, Sheng Shen,                Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia
  Zheng Xin Yong, Harshit Pandey, Rachel Bawden,             Song, Mohammad Shoeybi, Yuxiong He, Michael
  Thomas Wang, Trishala Neeraj, Jos Rozen, Ab-               Houston, Saurabh Tiwary, and Bryan Catanzaro.
  heesht Sharma, Andrea Santilli, Thibault Fevry, Ja-        2022. Using deepspeed and megatron to train
  son Alan Fries, Ryan Teehan, Stella Biderman, Leo          megatron-turing NLG 530b, A large-scale genera-
  Gao, Tali Bers, Thomas Wolf, and Alexander M.              tive language model. CoRR, abs/2201.11990.
  Rush. 2021. Multitask prompted training enables
                                                           Romal Thoppilan, Daniel De Freitas, Jamie Hall,
  zero-shot task generalization.
                                                             Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze
Teven Le Scao and Alexander M. Rush. 2021. How               Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du,
  many data points is a prompt worth? pages 2627–            et al. 2022. Lamda: Language models for dialog
  2636.                                                      applications. arXiv preprint arXiv:2201.08239.

Timo Schick and Hinrich Schütze. 2020. It’s not just       Trieu H. Trinh and Quoc V. Le. 2018. A sim-
  size that matters: Small language models are also           ple method for commonsense reasoning. CoRR,
  few-shot learners. CoRR, abs/2009.07118.                    abs/1806.02847.
                                                           Megan Ung, Jing Xu, and Y-Lan Boureau. 2021. Safer-
Timo Schick, Sahana Udupa, and Hinrich Schütze.
                                                            dialogues: Taking feedback gracefully after conver-
  2021. Self-diagnosis and self-debiasing: A proposal
                                                            sational safety failures. ArXiv, abs/2110.07518.
  for reducing corpus-based bias in nlp. Transactions
  of the Association for Computational Linguistics,        Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  9:1408–1424.                                               Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
                                                             Kaiser, and Illia Polosukhin. 2017. Attention is all
Rico Sennrich, Barry Haddow, and Alexandra Birch.            you need. In Advances in neural information pro-
  2016. Neural machine translation of rare words             cessing systems.
  with subword units. In Proceedings of the 54th An-
  nual Meeting of the Association for Computational        Alex Wang, Yada Pruksachatkun, Nikita Nangia,
  Linguistics (Volume 1: Long Papers), pages 1715–           Amanpreet Singh, Julian Michael, Felix Hill, Omer
  1725, Berlin, Germany. Association for Computa-            Levy, and Samuel R. Bowman. 2019. SuperGLUE:
  tional Linguistics.                                        A stickier benchmark for general-purpose language
                                                             understanding systems. arXiv preprint 1905.00537.
Emily Sheng, Kai-Wei Chang, Premkumar Natarajan,
  and Nanyun Peng. 2019. The woman worked as a             Albert Webson and Ellie Pavlick. 2021. Do prompt-
  babysitter: On biases in language generation. arXiv        based models really understand the meaning of their
  preprint arXiv:1909.01326.                                 prompts? arXiv preprint arXiv:2109.01247.
Taylor Shin, Yasaman Razeghi, Robert L. Logan IV,          Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin
  Eric Wallace, and Sameer Singh. 2020. AutoPrompt:           Guu, Adams Wei Yu, Brian Lester, Nan Du, An-
  Eliciting Knowledge from Language Models with               drew M. Dai, and Quoc V. Le. 2021. Finetuned
  Automatically Generated Prompts. pages 4222–                language models are zero-shot learners. CoRR,
  4235.                                                       abs/2109.01652.
Mohammad Shoeybi, Mostofa Patwary, Raul Puri,              Laura Weidinger, John Mellor, Maribeth Rauh, Conor
 Patrick LeGresley, Jared Casper, and Bryan Catan-           Griffin, Jonathan Uesato, Po-Sen Huang, Myra
 zaro. 2019. Megatron-lm: Training multi-billion pa-         Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh,
 rameter language models using model parallelism.            Zac Kenton, Sasha Brown, Will Hawkins, Tom
 arXiv preprint arXiv:1909.08053.                            Stepleton, Courtney Biles, Abeba Birhane, Julia
  Haas, Laura Rimell, Lisa Anne Hendricks, William
  Isaac, Sean Legassick, Geoffrey Irving, and Iason
  Gabriel. 2021a. Ethical and social risks of harm
  from language models.
Laura Weidinger, John Mellor, Maribeth Rauh, Conor
  Griffin, Jonathan Uesato, Po-Sen Huang, Myra
  Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh,
  et al. 2021b. Ethical and social risks of harm from
  language models. arXiv preprint arXiv:2112.04359.
Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Di-
  nan, Kyunghyun Cho, and Jason Weston. 2020. Neu-
  ral text generation with unlikelihood training. In
  International Conference on Learning Representa-
  tions.
Carole-Jean Wu, Ramya Raghavendra, Udit Gupta,
  Bilge Acun, Newsha Ardalani, Kiwan Maeng, Glo-
  ria Chang, Fiona Aga Behram, James Huang,
  Charles Bai, Michael Gschwind, Anurag Gupta,
  Myle Ott, Anastasia Melnikov, Salvatore Candido,
  David Brooks, Geeta Chauhan, Benjamin Lee,
  Hsien-Hsin S. Lee, Bugra Akyildiz, Maximilian Ba-
  landat, Joe Spisak, Ravi Jain, Mike Rabbat, and Kim
  Hazelwood. 2022. Sustainable AI: environmental
  implications, challenges and opportunities. In Pro-
  ceedings of the Conference on Machine Learning
  and Systems.
Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Ja-
   son Weston, and Emily Dinan. 2020. Recipes for
   safety in open-domain chatbots. arXiv preprint
   arXiv:2010.07079.
Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason We-
   ston, and Emily Dinan. 2021a. Bot-adversarial dia-
   logue for safe conversational agents. In Proceedings
   of the 2021 Conference of the North American Chap-
   ter of the Association for Computational Linguistics:
   Human Language Technologies, pages 2950–2968,
   Online. Association for Computational Linguistics.
Jing Xu, Arthur Szlam, and Jason Weston. 2021b. Be-
   yond goldfish memory: Long-term open-domain
   conversation. arXiv preprint arXiv:2107.07567.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali
  Farhadi, and Yejin Choi. 2019. Hellaswag: Can
  a machine really finish your sentence?     In Pro-
  ceedings of the 57th Conference of the Association
  for Computational Linguistics, ACL 2019, Florence,
  Italy, July 28- August 2, 2019, Volume 1: Long Pa-
  pers, pages 4791–4800. Association for Computa-
  tional Linguistics.

Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan
  Salakhutdinov, Raquel Urtasun, Antonio Torralba,
  and Sanja Fidler. 2015. Aligning books and movies:
  Towards story-like visual explanations by watching
  movies and reading books. CoRR, abs/1506.06724.
A   Additional Evaluations
.
                              HellaSwag                                     StoryCloze                                           PIQA                                        ARC (Easy)
                                                                                                         82.5
                80                                      82.5                                                                                                70
                                                                                                         80.0
                                                        80.0
                                                                                                         77.5                                               65
                70                                      77.5
                                                                                                         75.0                                               60
                                                        75.0
     Accuracy
                60                                                                                       72.5
                                                        72.5                                                                                                55
                50                                                                                       70.0
                                                        70.0                                                                                                50
                                                                                                         67.5
                                                        67.5                                                                                                45
                40                                                                                       65.0
                                                        65.0
                                                                                                         62.5                                               40
                30                                      62.5
                  108   109     1010      1011   1012       108       109      1010      1011     1012        108      109       1010        1011   1012         108   109      1010      1011   1012
                         ARC (Challenge)                                    OpenBookQA                                        Winogrande                                     Winograd
                                                                                                             80                                             90
                50
                                                          55
                                                                                                                                                            85
                                                                                                             75
                45                                        50                                                                                                80
                                                                                                             70

     Accuracy
                40                                        45                                                                                                75
                                                                                                             65
                                                          40                                                                                                70
                35
                                                                                                             60
                                                          35                                                                                                65
                30                                                                                           55                                             60
                                                          30
                                                                                                             50
                 108    109     1010      1011   1012         108     109      1010      1011     1012         108     109       1010        1011   1012         108   109      1010      1011   1012
                               BoolQ                                            CB                                               COPA                                           WIC
                                                                                                                                                            60
                85                                                                                           90
                                                          50                                                                                                50
                80
                                                          40                                                 85                                             40
                75

     Accuracy
                70                                        30                                                 80                                             30
                65
                                                          20                                                 75                                             20
                60
                55                                        10                                                 70                                             10

                50                                            0                                                                                              0
                                                                                                             65
                  108   109     1010      1011   1012          108    109      1010      1011     1012         108     109       1010        1011   1012      108      109      1010      1011   1012
                                WSC                                           MultiRC                                            RTE                                          ReCoRD
                90
                85                                        25                                                 70                                             90
                80
                                                          20                                                 65
                75                                                                                                                                          85

     Accuracy
                70                                        15                                                 60
                65                                                                                                                                          80
                                                          10                                                 55
                60
                                                                                                                                                            75
                55                                            5                                              50
                50
                                                                                                                                                            70
                 108    109      1010    1011    1012         108     109       1010    1011      1012        108      109        1010    1011      1012      108      109      1010    1011     1012
                              Parameters                                     Parameters                                        Parameters                                    Parameters

                                                        OPT          GPT          PaLM          Chinchilla           Gopher             Eleuther      Jurassic


Figure 6: Zero-shot NLP Evaluations. Full evaluations on all 16 NLP tasks, with comparisons where available.
We find that across most tasks, GPT-3 models and OPT models perform similarly, but some tasks display highly
erratic behavior.
                              HellaSwag                                  StoryCloze                                        PIQA                                        ARC (Easy)
                80                                                                                  82.5
                                                                                                                                                         75
                                                        85                                          80.0
                70                                                                                                                                       70
                                                                                                    77.5
                                                        80                                                                                               65
                                                                                                    75.0
                60
                                                                                                                                                         60
     Accuracy
                                                        75                                          72.5
                50                                                                                  70.0                                                 55
                                                        70                                                                                               50
                                                                                                    67.5
                40                                                                                  65.0                                                 45
                                                        65
                                                                                                    62.5                                                 40
                30
                  108   109     1010      1011   1012    108       109      1010      1011   1012      108        109      1010      1011         1012    108    109      1010      1011   1012
                         ARC (Challenge)                                 OpenBookQA                                     Winogrande                                     Winograd
                                                        65                                                                                               90
                50                                                                                   75
                                                        60                                                                                               85
                45                                                                                   70                                                  80
                                                        55

     Accuracy
                40                                                                                   65                                                  75
                                                        50
                                                                                                                                                         70
                35                                      45                                           60
                                                                                                                                                         65
                30                                      40                                           55
                                                                                                                                                         60
                25                                      35                                           50
                  108   109     1010      1011   1012     108      109      1010      1011   1012      108        109      1010      1011         1012    108    109      1010      1011   1012
                               BoolQ                                         CB                                            COPA                                           WIC
                                                        80
                75                                                                                   90
                                                                                                                                                         50
                70                                                                                   85
                                                        60                                                                                               40
                65                                                                                   80
     Accuracy
                                                                                                                                                         30
                60                                      40
                                                                                                     75
                55                                                                                                                                       20
                                                        20                                           70
                50                                                                                                                                       10
                                                                                                     65
                45
                                                         0                                                                                                0
                 108    109     1010      1011   1012     108      109      1010      1011   1012      108        109      1010      1011         1012     108   109      1010      1011   1012
                                WSC                                        MultiRC                                         RTE                                          ReCoRD
                75
                                                        30                                                                                               90
                                                                                                     70
                70
                                                        25
                                                                                                     65                                                  85
                65                                      20
     Accuracy
                                                                                                     60                                                  80
                60                                      15
                                                                                                     55
                55                                      10                                                                                               75
                                                         5                                           50
                50
                                                                                                                                                         70
                 108    109      1010    1011    1012    108       109       1010    1011    1012      108        109       1010    1011          1012     108   109      1010    1011     1012
                              Parameters                                  Parameters                                     Parameters                                    Parameters

                                                                Shot         0          1      32            Series         OPT             GPT


Figure 7: Multishot-shot NLP Evaluations. Full evaluations on all 16 NLP tasks, with comparisons to the
GPT-3 reported performance. As with zero-shot, performance is roughly similar for most tasks, with some tasks
demonstrating erratic behavior.
B     Contributions
Pre-training

    • Initial planning: Susan Zhang

    • Training infrastructure and initial ablations: Naman Goyal, Myle Ott, Stephen Roller, Sam Shleifer,
      Susan Zhang

    • Training efficiency: Naman Goyal, Myle Ott, Sam Shleifer

    • Data curation and deduplication: Shuhoi Chen, Myle Ott, Stephen Roller

    • Training and monitoring OPT-175B: Mikel Artetxe, Moya Chen, Naman Goyal, Punit Singh Koura,
      Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Stephen Roller, Susan Zhang

    • Training 125M–66B baselines: Naman Goyal, Stephen Roller, Susan Zhang

Evaluations

    • NLP: Xian Li, Xi Victoria Lin, Todor Mihaylov, Stephen Roller, Anjali Sridhar

    • Dialogue: Stephen Roller

    • Responsible AI Evaluations: Punit Singh Koura, Stephen Roller, Tianlu Wang

Paper writing: Moya Chen, Stephen Roller, Luke Zettlemoyer, Susan Zhang

Code release preparation: Christopher Dewan, Susan Zhang

Responsible AI conduct: Mona Diab, Susan Zhang

C     Datasheet
We follow the recommendations of Gebru et al. (2021) and provide a data card for the dataset used to
train the OPT models.

C.1    Motivation
    • For what purpose was the dataset created? Was there a specific task in mind? Was there a
      specific gap that needed to be filled? Please provide a description. The pre-training data for
      training the OPT-175B model was created by a union of five datasets, including three datasets used
      by RoBERTa (Liu et al., 2019b), a subset of the Pile (Gao et al., 2021a), along with the Pushshift.io
      Reddit dataset that was developed in (Baumgartner et al., 2020) and processed in (Roller et al., 2021).
      These purpose of creating this dataset was to pre-train the language model on a broad corpus of text,
      with emphasis on human-generated text.

    • Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g.,
      company, institution, organization)? Meta AI.

    • Who funded the creation of the dataset? If there is an associated grant, please provide the
      name of the grantor and the grant name and number. Meta AI.

    • Any other comments? No.
C.2   Composition
  • What do the instances that comprise the dataset represent (e.g., documents, photos, people,
    countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and
    interactions between them; nodes and edges)? Please provide a description. The instances are
    textual documents. The overall dataset is composed from a union of the following datasets:
       – BookCorpus (Zhu et al., 2015) consists of more than 10K unpublished books
       – CC-Stories (Trinh and Le, 2018) contains a subset of CommonCrawl data filtered to match the
         story-like style of Winograd schemas
       – The Pile (Gao et al., 2021a) from which the following was included:
           * Pile-CC
           * OpenWebText2
           * USPTO
           * Project Gutenberg
           * OpenSubtitles
           * Wikipedia
           * DM Mathematics
           * HackerNews
       – Pushshift.io Reddit dataset that was developed in Baumgartner et al. (2020) and processed in
         Roller et al. (2021).
       – CCNewsV2 containing an updated version of the English portion of the CommonCrawl News
         dataset that was used in RoBERTa (Liu et al., 2019b)

  • How many instances are there in total (of each type, if appropriate)? The training data contains
    180B tokens corresponding to 800 GB of data.

  • Does the dataset contain all possible instances or is it a sample (not necessarily random) of
    instances from a larger set? If the dataset is a sample, then what is the larger set? Is the
    sample representative of the larger set (e.g., geographic coverage)? If so, please describe how
    this representativeness was validated/verified. If it is not representative of the larger set, please
    describe why not (e.g., to cover a more diverse range of instances, because instances were
    withheld or unavailable). The CC-stories dataset contains a subset of CommonCrawl data filtered
    to match the story-like style of Winograd schemas. The remainder of the dataset was collected from
    the above sources, reformatted, and deduplicated.

  • What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or
    features? In either case, please provide a description. Each instance consists of raw text data.

  • Is there a label or target associated with each instance? If so, please provide a description. No.

  • Is any information missing from individual instances? If so, please provide a description,
    explaining why this information is missing (e.g., because it was unavailable). This does not
    include intentionally removed information, but might include, e.g., redacted text. No.

  • Are relationships between individual instances made explicit (e.g., users’ movie ratings, social
    network links)? If so, please describe how these relationships are made explicit. There are no
    explicit relationships between individual instances.

  • Are there recommended data splits (e.g., training, development/validation, testing)? If so,
    please provide a description of these splits, explaining the rationale behind them. We hold out
    a random validation set of approximately 200MB from the pretraining data, sampled proportionally
    to each dataset’s size in the pretraining corpus.
  • Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a
    description. Outside of naturally occurring duplication from potential overlaps between the datasets,
    there are no other redundancies, errors, or sources of noise that we add.

  • Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g.,
    websites, tweets, other datasets)? It’s self-contained.

  • Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening,
    or might otherwise cause anxiety? If so, please describe why. Parts of the dataset are a subset of
    public Common Crawl data, along with a subset of public Reddit data, which could contain sentences
    that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety.

  • Does the dataset relate to people? If not, you may skip the remaining questions in this section.
    Some documents of this data relate to people, such as news articles, Wikipedia descriptions, etc.

  • Does the dataset identify any subpopulations (e.g., by age, gender)? If so, please describe how
    these subpopulations are identified and provide a description of their respective distributions
    within the dataset. No, the dataset does not explicitly include subpopulation identification.

  • Any other comments? No.

C.3   Collection Process
  • How was the data associated with each instance acquired? Was the data directly observ-
    able (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly
    inferred/ derived from other data (e.g., part-of-speech tags, model-based guesses for age or
    language)? If data was reported by subjects or indirectly inferred/derived from other data,
    was the data validated/verified? If so, please describe how. N/A. The dataset is a union of five
    publicly available datasets.

  • What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or
    sensor, manual human curation, software program, software API)? How were these mecha-
    nisms or procedures validated? The data was downloaded from the internet.

  • If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic,
    probabilistic with specific sampling probabilities)? Please see previous answers for how the
    dataset was created.

  • Who was involved in the data collection process (e.g., students, crowdworkers, contractors)
    and how were they compensated (e.g., how much were crowdworkers paid)? This data is
    mined, filtered and sampled by machines.

  • Over what timeframe was the data collected? Does this timeframe match the creation time-
    frame of the data associated with the instances (e.g., recent crawl of old news articles)? If not,
    please describe the timeframe in which the data associated with the instances was created. The
    CC-News dataset contains English news articles crawled between September 2016 and September
    2021.

  • Does the dataset relate to people? If not, you may skip the remainder of the questions in this
    section. No.

  • Did you collect the data from the individuals in question directly, or obtain it via third parties
    or other sources (e.g., websites)? N/A.

  • Were the individuals in question notified about the data collection? If so, please describe (or
    show with screenshots or other information) how notice was provided, and provide a link or
    other access point to, or otherwise reproduce, the exact language of the notification itself. N/A.
  • Did the individuals in question consent to the collection and use of their data? If so, please
    describe (or show with screenshots or other information) how consent was requested and pro-
    vided, and provide a link or other access point to, or otherwise reproduce, the exact language
    to which the individuals consented. N/A.

  • If consent was obtained, were the consenting individuals provided with a mechanism to revoke
    their consent in the future or for certain uses? If so, please provide a description, as well as a
    link or other access point to the mechanism (if appropriate). N/A.

  • Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a
    data protection impact analysis) been conducted? If so, please provide a description of this
    analysis, including the outcomes, as well as a link or other access point to any supporting
    documentation. Some toxicity and bias evaluations were performed. Please refer to the main
    document and the model card for these details.

  • Any other comments? No.

C.4   Preprocessing/cleaning/labeling
  • Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, to-
    kenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing
    of missing values)? If so, please provide a description. If not, you may skip the remainder
    of the questions in this section. The component datasets went through standard cleaning and
    re-formatting practices, including removing repetitive/non-informative text like “Chapter One,” or
    “This ebook by Project Gutenberg.”

  • Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to sup-
    port unanticipated future uses)? If so, please provide a link or other access point to the “raw”
    data. The “raw” component datasets is publicly available in their respective locations (more details
    can be seen in the respective papers linked in references).

  • Any other comments? No.

C.5   Uses
  • Has the dataset been used for any tasks already? If so, please provide a description. Yes, this
    dataset was used to pre-train the OPT models.

  • Is there a repository that links to any or all papers or systems that use the dataset? If so,
    please provide a link or other access point. https://github.com/facebookresearch/
    metaseq

  • What (other) tasks could the dataset be used for? This data can be used to pre-train language
    models, which are foundation to many current and future language tasks.

  • Is there anything about the composition of the dataset or the way it was collected and prepro-
    cessed/cleaned/labeled that might impact future uses? For example, is there anything that a
    future user might need to know to avoid uses that could result in unfair treatment of individ-
    uals or groups (e.g., stereotyping, quality of service issues) or other undesirable harms (e.g.,
    financial harms, legal risks) If so, please provide a description. Is there anything a future user
    could do to mitigate these undesirable harms? The pipeline for creating this dataset paves a way
    for building a scalable infrastructure for mining datasets.

  • Are there tasks for which the dataset should not be used? If so, please provide a description.
    None that we are currently aware of.

  • Any other comments? No.
C.6    Distribution
    • Will the dataset be distributed to third parties outside of the entity (e.g., company, institution,
      organization) on behalf of which the dataset was created? If so, please provide a description.
      Not at this time.

    • How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? Does the
      dataset have a digital object identifier (DOI)? N/A.

    • When will the dataset be distributed? N/A.

    • Will the dataset be distributed under a copyright or other intellectual property (IP) license,
      and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and
      provide a link or other access point to, or otherwise reproduce, any relevant licensing terms
      or ToU, as well as any fees associated with these restrictions. N/A.

    • Do any export controls or other regulatory restrictions apply to the dataset or to individual
      instances? If so, please describe these restrictions, and provide a link or other access point to,
      or otherwise reproduce, any supporting documentation. N/A.

    • Any other comments? No.

C.7    Maintenance
    • Who is supporting/hosting/maintaining the dataset? Meta AI.

    • How can the owner/curator/manager of the dataset be contacted (e.g., email address)? Refer
      to the main document.

    • Is there an erratum? If so, please provide a link or other access point. N/A.

    • Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete in-
      stances)? If so, please describe how often, by whom, and how updates will be communicated
      to users (e.g., mailing list, GitHub)? No current plan for updating.

    • If the dataset relates to people, are there applicable limits on the retention of the data as-
      sociated with the instances (e.g., were individuals in question told that their data would be
      retained for a fixed period of time and then deleted)? If so, please describe these limits and
      explain how they will be enforced. N/A.

    • Will older versions of the dataset continue to be supported/hosted/maintained? If so, please
      describe how. If not, please describe how its obsolescence will be communicated to users. N/A.

    • If others want to extend/augment/build on/contribute to the dataset, is there a mechanism
      for them to do so? If so, please provide a description. Will these contributions be validated/
      verified? If so, please describe how. If not, why not? Is there a process for communicating/ dis-
      tributing these contributions to other users? If so, please provide a description. No mechanism
      is available right now.

    • Any other comments? No.

D     Model Card
Following Mitchell et al. (2018), we provide a model card for OPT-175B.
D.1     Model Details
     • Person or organization developing model: OPT-175B was developed by Meta AI.

     • Model date: OPT-175B was released on May 3, 2022.

     • Model version: OPT-175B described in this paper is version 1.0.0.

     • Model type: OPT-175B is a large decoder-only transformer language model.

     • Information about training algorithms, parameters, fairness constraints or other applied ap-
       proaches, and features: OPT-175B was trained with AdamW for parameter sizes from 125M to
       175B. See the Data Card (Appendix C) for information about training data and Section 2.2 - 2.5 for
       information about the training process.

     • Paper or other resource for more information: See the rest of this paper for more details on
       OPT-175B as well as the corresponding post on the Meta AI Research Blog. More details are also
       available in metaseq, our open-source repository.12

     • License: OPT-175B and the smaller baseline models are made available through a non-commercial
       use license agreement provided in our model license.13

     • Where to send questions or comments about the model: Please contact the corresponding authors
      {susanz,roller,namangoyal}@fb.com for any questions or comments.

D.2     Intended Use
     • Primary intended uses: We release OPT-175B for research into Language Models, especially as it
       pertains to Responsible AI. See Section 6 for more detailed Considerations for Release. Information
       on how to use the model can be found at metaseq, our open-source repository.

     • Primary intended users: We primarily target researchers and the related research community.

     • Out-of-scope use cases: OPT-175B is not released for production use or real-world deployments.
       As we note in Section 5, OPT-175B, like similar large language models, has a variety of shortcomings
       that make it premature for commercial use.

D.3     Data, Limitations, and Recommendations
     • Data selection for training: Training data for OPT-175B was selected based on a combination of
       breadth and availability. See our Data Card (Appendix C) for more detailed information on the data
       used to train our model.

     • Data selection for evaluation: Evaluations in this paper were chosen to provide comparable perfor-
       mance assessments relative to similar scale models in the literature. Given concerns in the community
       around safety and fairness of large language models in general, we also explicitly provide evaluations
       on Responsible AI (see Section 4).

     • Limitations: Like other large language models for which the diversity (or lack thereof) of training
       data induces downstream impact on the quality of our model, OPT-175B has limitations in terms
       of bias and safety. OPT-175B can also have quality issues in terms of generation diversity and
       hallucination. In general, OPT-175B is not immune from the plethora of issues that plague modern
       large language models. By releasing with a non-commercial license, we also hope to increase
       communication, transparency, and study of the problems of large language models, especially in
       areas which may not be aligned with commercial interests. See Section 5 for a more detailed
       discussion of limitations of OPT-175B.
 12
      https://github.com/facebookresearch/metaseq/
 13
      https://github.com/facebookresearch/metaseq/blob/main/projects/OPT/MODEL_LICENSE.
md
• Recommendations for future work: See Section 6 for more about our Considerations for Release,
  including a discussion of potential avenues of research enabled by opening our model to more of
  the research community. We hope that the release of OPT-175B, as well as information around our
  model training process, will increase open science around both large language models in specific and
  natural language processing and deep learning in general.
E   Sample Model Outputs
For all sample outputs, the initial prompt is given in bold and the remainder is the continuation. These
example outputs were intentionally selected to highlight both successes and failures of the OPT-175B
model.




Figure 8: Poetry generation. We have observed the model can write entertaining poetry on topics such as dodos,
samosas, and performance reviews. However, we struggled to get the model to observe rhyme or meter.




Figure 9: Conversation generation. OPT-175B adopts a patriotic personality when prompted as the Statue of
Liberty. However, the model also devolves into somewhat simple and linguistically repetitive generations further
into the conversation.
Figure 10: Basic few-shot translation example. OPT was not intentionally trained to be multilingual, but we
found anecdotally it has limited success with simple translations in German, Spanish, French, and Chinese.
Figure 11: Paper writing example. Prompting with "1. Introduction" generally yielded more interesting results
compared to prompting with “Abstract.” Our prompt here was inspired by the first sentence of the seminal ResNet
work (He et al., 2016).
Figure 12: Arithmetic. We observe mistakes when extending from addition to other operations.
Figure 13: Python programming. Simply switching out a variable name can alter the generated output.