来源材料

资料来源

← 首页

                   Language Models are Unsupervised Multitask Learners


          Alec Radford * 1 Jeffrey Wu * 1 Rewon Child 1 David Luan 1 Dario Amodei ** 1 Ilya Sutskever ** 1


                         Abstract                                 competent generalists. We would like to move towards more
                                                                  general systems which can perform many tasks – eventually
      Natural language processing tasks, such as ques-            without the need to manually create and label a training
      tion answering, machine translation, reading com-           dataset for each one.
      prehension, and summarization, are typically
      approached with supervised learning on task-                The dominant approach to creating ML systems is to col-
      specific datasets. We demonstrate that language             lect a dataset of training examples demonstrating correct
      models begin to learn these tasks without any ex-           behavior for a desired task, train a system to imitate these
      plicit supervision when trained on a new dataset            behaviors, and then test its performance on independent
      of millions of webpages called WebText. When                and identically distributed (IID) held-out examples. This
      conditioned on a document plus questions, the an-           has served well to make progress on narrow experts. But
      swers generated by the language model reach 55              the often erratic behavior of captioning models (Lake et al.,
      F1 on the CoQA dataset - matching or exceeding              2017), reading comprehension systems (Jia & Liang, 2017),
      the performance of 3 out of 4 baseline systems              and image classifiers (Alcorn et al., 2018) on the diversity
      without using the 127,000+ training examples.               and variety of possible inputs highlights some of the short-
      The capacity of the language model is essential             comings of this approach.
      to the success of zero-shot task transfer and in-           Our suspicion is that the prevalence of single task training
      creasing it improves performance in a log-linear            on single domain datasets is a major contributor to the lack
      fashion across tasks. Our largest model, GPT-2,             of generalization observed in current systems. Progress
      is a 1.5B parameter Transformer that achieves               towards robust systems with current architectures is likely
      state of the art results on 7 out of 8 tested lan-          to require training and measuring performance on a wide
      guage modeling datasets in a zero-shot setting              range of domains and tasks. Recently, several benchmarks
      but still underfits WebText. Samples from the               have been proposed such as GLUE (Wang et al., 2018) and
      model reflect these improvements and contain co-            decaNLP (McCann et al., 2018) to begin studying this.
      herent paragraphs of text. These findings suggest
      a promising path towards building language pro-             Multitask learning (Caruana, 1997) is a promising frame-
      cessing systems which learn to perform tasks from           work for improving general performance. However, mul-
      their naturally occurring demonstrations.                   titask training in NLP is still nascent. Recent work re-
                                                                  ports modest performance improvements (Yogatama et al.,
                                                                  2019) and the two most ambitious efforts to date have
                                                                  trained on a total of 10 and 17 (dataset, objective)
1. Introduction                                                   pairs respectively (McCann et al., 2018) (Bowman et al.,
                                                                  2018). From a meta-learning perspective, each (dataset,
Machine learning systems now excel (in expectation) at            objective) pair is a single training example sampled
tasks they are trained for by using a combination of large        from the distribution of datasets and objectives. Current
datasets, high-capacity models, and supervised learning           ML systems need hundreds to thousands of examples to
(Krizhevsky et al., 2012) (Sutskever et al., 2014) (Amodei        induce functions which generalize well. This suggests that
et al., 2016). Yet these systems are brittle and sensitive to     multitask training many need just as many effective training
slight changes in the data distribution (Recht et al., 2018)      pairs to realize its promise with current approaches. It will
and task specification (Kirkpatrick et al., 2017). Current sys-   be very difficult to continue to scale the creation of datasets
tems are better characterized as narrow experts rather than       and the design of objectives to the degree that may be re-
  *, **                                                           quired to brute force our way there with current techniques.
     Equal contribution 1 OpenAI, San Francisco, Califor-
nia, United States.    Correspondence to: Alec Radford            This motivates exploring additional setups for performing
<alec@openai.com>.                                                multitask learning.
                                                                  The current best performing systems on language tasks
                                     Language Models are Unsupervised Multitask Learners




Figure 1. Zero-shot task performance of WebText LMs as a function of model size on many NLP tasks. Reading Comprehension results
are on CoQA (Reddy et al., 2018), translation on WMT-14 Fr-En (Artetxe et al., 2017), summarization on CNN and Daily Mail (See et al.,
2017), and Question Answering on Natural Questions (Kwiatkowski et al., 2019). Section 3 contains detailed descriptions of each result.



utilize a combination of pre-training and supervised fine-           symbols as the product of conditional probabilities (Jelinek
tuning. This approach has a long history with a trend to-            & Mercer, 1980) (Bengio et al., 2003):
wards more flexible forms of transfer. First, word vectors
were learned and used as inputs to task-specific architec-                                    n
tures (Mikolov et al., 2013) (Collobert et al., 2011), then
                                                                                              Y
                                                                                     p(x) =         p(sn |s1 , ..., sn−1 )         (1)
the contextual representations of recurrent networks were                                     i=1
transferred (Dai & Le, 2015) (Peters et al., 2018), and re-
cent work suggests that task-specific architectures are no           This approach allows for tractable sampling from and es-
longer necessary and transferring many self-attention blocks         timation of p(x) as well as any conditionals of the form
is sufficient (Radford et al., 2018) (Devlin et al., 2018).          p(sn−k , ..., sn |s1 , ..., sn−k−1 ). In recent years, there have
These methods still require supervised training in order             been significant improvements in the expressiveness of mod-
to perform a task. When only minimal or no supervised                els that can compute these conditional probabilities, such as
data is available, another line of work has demonstrated             self-attention architectures like the Transformer (Vaswani
the promise of language models to perform specific tasks,            et al., 2017).
such as commonsense reasoning (Schwartz et al., 2017) and            Learning to perform a single task can be expressed in a
sentiment analysis (Radford et al., 2017).                           probabilistic framework as estimating a conditional distri-
In this paper, we connect these two lines of work and con-           bution p(output|input). Since a general system should be
tinue the trend of more general methods of transfer. We              able to perform many different tasks, even for the same
demonstrate language models can perform down-stream                  input, it should condition not only on the input but also
tasks in a zero-shot setting – without any parameter or archi-       on the task to be performed. That is, it should model
tecture modification. We demonstrate this approach shows             p(output|input, task). This has been variously formalized
potential by highlighting the ability of language models to          in multitask and meta-learning settings. Task conditioning
perform a wide range of tasks in a zero-shot setting. We             is often implemented at an architectural level, such as the
achieve promising, competitive, and state of the art results         task specific encoders and decoders in (Kaiser et al., 2017)
depending on the task.                                               or at an algorithmic level such as the inner and outer loop
                                                                     optimization framework of MAML (Finn et al., 2017). But
                                                                     as exemplified in McCann et al. (2018), language provides
2. Approach                                                          a flexible way to specify tasks, inputs, and outputs all as a
At the core of our approach is language modeling. Lan-               sequence of symbols. For example, a translation training
guage modeling is usually framed as unsupervised distri-             example can be written as the sequence (translate to
bution estimation from a set of examples (x1 , x2 , ..., xn )        french, english text, french text). Like-
each composed of variable length sequences of symbols                wise, a reading comprehension training example can
(s1 , s2 , ..., sn ). Since language has a natural sequential or-    be written as (answer the question, document,
dering, it is common to factorize the joint probabilities over       question, answer). McCann et al. (2018) demon-
                                                                     strated it was possible to train a single model, the MQAN,
                                    Language Models are Unsupervised Multitask Learners

to infer and perform many different tasks on examples with          ”I’m not the cleverest man in the world, but like they say in
this type of format.                                                French: Je ne suis pas un imbecile [I’m not a fool].

Language modeling is also able to, in principle, learn the         In a now-deleted post from Aug. 16, Soheil Eid, Tory candidate
tasks of McCann et al. (2018) without the need for explicit        in the riding of Joliette, wrote in French: ”Mentez mentez,
                                                                   il en restera toujours quelque chose,” which translates as,
supervision of which symbols are the outputs to be pre-            ”Lie lie and something will always remain.”
dicted. Since the supervised objective is the the same as the
unsupervised objective but only evaluated on a subset of the        “I hate the word ‘perfume,”’ Burr says. ‘It’s somewhat better
                                                                    in French: ‘parfum.’
sequence, the global minimum of the unsupervised objective
is also the global minimum of the supervised objective. In          If listened carefully at 29:55, a conversation can be heard
this slightly toy setting, the concerns with density estimation     between two guys in French: “-Comment on fait pour aller
as a principled training objective discussed in (Sutskever          de l’autre coté? -Quel autre coté?”, which means “- How
                                                                    do you get to the other side? - What side?”.
et al., 2015) are side stepped. The problem instead becomes
whether we are able to, in practice, optimize the unsuper-          If this sounds like a bit of a stretch, consider this ques-
vised objective to convergence. Preliminary experiments             tion in French: As-tu aller au cinéma?, or Did you go to
                                                                    the movies?, which literally translates as Have-you to go to
confirmed that sufficiently large language models are able to
                                                                    movies/theater?
perform multitask learning in this toy-ish setup but learning
is much slower than in explicitly supervised approaches.            “Brevet Sans Garantie Du Gouvernement”, translated to
                                                                    English: “Patented without government warranty”.
While it is a large step from the well-posed setup described
above to the messiness of “language in the wild”, Weston
(2016) argues, in the context of dialog, for the need to          Table 1. Examples of naturally occurring demonstrations of En-
                                                                  glish to French and French to English translation found throughout
develop systems capable of learning from natural language
                                                                  the WebText training set.
directly and demonstrated a proof of concept – learning a
QA task without a reward signal by using forward prediction
of a teacher’s outputs. While dialog is an attractive approach,
we worry it is overly restrictive. The internet contains a vast   Common Crawl. Trinh & Le (2018)’s best results were
amount of information that is passively available without         achieved using a small subsample of Common Crawl which
the need for interactive communication. Our speculation is        included only documents most similar to their target dataset,
that a language model with sufficient capacity will begin         the Winograd Schema Challenge. While this is a pragmatic
to learn to infer and perform the tasks demonstrated in           approach to improve performance on a specific task, we
natural language sequences in order to better predict them,       want to avoid making assumptions about the tasks to be
regardless of their method of procurement. If a language          performed ahead of time.
model is able to do this it will be, in effect, performing        Instead, we created a new web scrape which emphasizes
unsupervised multitask learning. We test whether this is the      document quality. To do this we only scraped web pages
case by analyzing the performance of language models in a         which have been curated/filtered by humans. Manually
zero-shot setting on a wide variety of tasks.                     filtering a full web scrape would be exceptionally expensive
                                                                  so as a starting point, we scraped all outbound links from
2.1. Training Dataset                                             Reddit, a social media platform, which received at least 3
Most prior work trained language models on a single do-           karma. This can be thought of as a heuristic indicator for
main of text, such as news articles (Jozefowicz et al., 2016),    whether other users found the link interesting, educational,
Wikipedia (Merity et al., 2016), or fiction books (Kiros          or just funny.
et al., 2015). Our approach motivates building as large and       The resulting dataset, WebText, contains the text subset
diverse a dataset as possible in order to collect natural lan-    of these 45 million links. To extract the text from HTML
guage demonstrations of tasks in as varied of domains and         responses we use a combination of the Dragnet (Peters &
contexts as possible.                                             Lecocq, 2013) and Newspaper1 content extractors. All re-
A promising source of diverse and nearly unlimited text is        sults presented in this paper use a preliminary version of
web scrapes such as Common Crawl. While these archives            WebText which does not include links created after Dec
are many orders of magnitude larger than current language         2017 and which after de-duplication and some heuristic
modeling datasets, they have significant data quality issues.     based cleaning contains slightly over 8 million documents
Trinh & Le (2018) used Common Crawl in their work on              for a total of 40 GB of text. We removed all Wikipedia
commonsense reasoning but noted a large amount of doc-            documents from WebText since it is a common data source
uments “whose content are mostly unintelligible”. We ob-          for other datasets and could complicate analysis due to over-
served similar data issues in our initial experiments with           1
                                                                         https://github.com/codelucas/newspaper
                                   Language Models are Unsupervised Multitask Learners

lapping training data with test evaluation tasks.
                                                                                Parameters    Layers    dmodel
2.2. Input Representation                                                       117M            12        768
                                                                                345M            24       1024
A general language model (LM) should be able to compute                         762M            36       1280
the probability of (and also generate) any string. Current                      1542M           48       1600
large scale LMs include pre-processing steps such as lower-
casing, tokenization, and out-of-vocabulary tokens which        Table 2. Architecture hyperparameters for the 4 model sizes.
restrict the space of model-able strings. While processing
Unicode strings as a sequence of UTF-8 bytes elegantly ful-
fills this requirement as exemplified in work such as Gillick   few modifications. Layer normalization (Ba et al., 2016)
et al. (2015), current byte-level LMs are not competitive       was moved to the input of each sub-block, similar to a
with word-level LMs on large scale datasets such as the         pre-activation residual network (He et al., 2016) and an
One Billion Word Benchmark (Al-Rfou et al., 2018). We           additional layer normalization was added after the final self-
observed a similar performance gap in our own attempts to       attention block. A modified initialization which accounts
train standard byte-level LMs on WebText.                       for the accumulation on the residual path with model depth
                                                                is used. We scale the weights
                                                                                           √ of residual layers at initial-
Byte Pair Encoding (BPE) (Sennrich et al., 2015) is a
                                                                ization by a factor of 1/ N where N is the number of
practical middle ground between character and word level
                                                                residual layers. The vocabulary is expanded to 50,257. We
language modeling which effectively interpolates between
                                                                also increase the context size from 512 to 1024 tokens and
word level inputs for frequent symbol sequences and char-
                                                                a larger batchsize of 512 is used.
acter level inputs for infrequent symbol sequences. Despite
its name, reference BPE implementations often operate on
Unicode code points and not byte sequences. These imple-        3. Experiments
mentations would require including the full space of Uni-
                                                                We trained and benchmarked four LMs with approximately
code symbols in order to model all Unicode strings. This
                                                                log-uniformly spaced sizes. The architectures are summa-
would result in a base vocabulary of over 130,000 before
                                                                rized in Table 2. The smallest model is equivalent to the
any multi-symbol tokens are added. This is prohibitively
                                                                original GPT, and the second smallest equivalent to the
large compared to the 32,000 to 64,000 token vocabularies
                                                                largest model from BERT (Devlin et al., 2018). Our largest
often used with BPE. In contrast, a byte-level version of
                                                                model, which we call GPT-2, has over an order of magni-
BPE only requires a base vocabulary of size 256. However,
                                                                tude more parameters than GPT. The learning rate of each
directly applying BPE to the byte sequence results in sub-
                                                                model was manually tuned for the best perplexity on a 5%
optimal merges due to BPE using a greedy frequency based
                                                                held-out sample of WebText. All models still underfit Web-
heuristic for building the token vocabulary. We observed
                                                                Text and held-out perplexity has as of yet improved given
BPE including many versions of common words like dog
                                                                more training time.
since they occur in many variations such as dog. dog!
dog? . This results in a sub-optimal allocation of limited
vocabulary slots and model capacity. To avoid this, we pre-     3.1. Language Modeling
vent BPE from merging across character categories for any       As an initial step towards zero-shot task transfer, we are
byte sequence. We add an exception for spaces which sig-        interested in understanding how WebText LM’s perform
nificantly improves the compression efficiency while adding     at zero-shot domain transfer on the primary task they are
only minimal fragmentation of words across multiple vocab       trained for – language modeling. Since our model operates
tokens.                                                         on a byte level and does not require lossy pre-processing
This input representation allows us to combine the empirical    or tokenization, we can evaluate it on any language model
benefits of word-level LMs with the generality of byte-level    benchmark. Results on language modeling datasets are
approaches. Since our approach can assign a probability to      commonly reported in a quantity which is a scaled or ex-
any Unicode string, this allows us to evaluate our LMs on       ponentiated version of the average negative log probability
any dataset regardless of pre-processing, tokenization, or      per canonical prediction unit - usually a character, a byte, or
vocab size.                                                     a word. We evaluate the same quantity by computing the
                                                                log-probability of a dataset according to a WebText LM and
2.3. Model                                                      dividing by the number of canonical units. For many of these
                                                                datasets, WebText LMs would be tested significantly out-
We use a Transformer (Vaswani et al., 2017) based archi-        of-distribution, having to predict aggressively standardized
tecture for our LMs. The model largely follows the details      text, tokenization artifacts such as disconnected punctuation
of the OpenAI GPT model (Radford et al., 2018) with a           and contractions, shuffled sentences, and even the string
                                        Language Models are Unsupervised Multitask Learners

            LAMBADA         LAMBADA          CBT-CN     CBT-NE       WikiText2    PTB     enwik8    text8   WikiText103      1BW
                (PPL)           (ACC)         (ACC)       (ACC)        (PPL)      (PPL)    (BPB)    (BPC)        (PPL)        (PPL)

  SOTA          99.8            59.23          85.7       82.3         39.14     46.54     0.99      1.08        18.3         21.8
  117M          35.13           45.99         87.65        83.4        29.41     65.85     1.16      1.17       37.50         75.20
  345M          15.60           55.48         92.35        87.1        22.76     47.33     1.01      1.06       26.37         55.72
  762M          10.87           60.12         93.45       88.0         19.93     40.31     0.97      1.02       22.05        44.575
  1542M          8.63           63.24         93.30       89.05        18.34     35.76     0.93      0.98       17.48         42.16


Table 3. Zero-shot results on many datasets. No training or fine-tuning was performed for any of these results. PTB and WikiText-2
results are from (Gong et al., 2018). CBT results are from (Bajgar et al., 2016). LAMBADA accuracy result is from (Hoang et al., 2018)
and LAMBADA perplexity result is from (Grave et al., 2016). Other results are from (Dai et al., 2019).



<UNK> which is extremely rare in WebText - occurring                  The Children’s Book Test (CBT) (Hill et al., 2015) was
only 26 times in 40 billion bytes. We report our main re-             created to examine the performance of LMs on different cat-
sults in Table 3 using invertible de-tokenizers which remove          egories of words: named entities, nouns, verbs, and preposi-
as many of these tokenization / pre-processing artifacts as           tions. Rather than reporting perplexity as an evaluation met-
possible. Since these de-tokenizers are invertible, we can            ric, CBT reports accuracy on an automatically constructed
still calculate the log probability of a dataset and they can         cloze test where the task is to predict which of 10 possible
be thought of as a simple form of domain adaptation. We               choices for an omitted word is correct. Following the LM
observe gains of 2.5 to 5 perplexity for GPT-2 with these             approach introduced in the original paper, we compute the
de-tokenizers.                                                        probability of each choice and the rest of the sentence con-
                                                                      ditioned on this choice according to the LM, and predict
WebText LMs transfer well across domains and datasets,
                                                                      the one with the highest probability. As seen in Figure 2
improving the state of the art on 7 out of the 8 datasets in a
                                                                      performance steadily improves as model size is increased
zero-shot setting. Large improvements are noticed on small
                                                                      and closes the majority of the gap to human performance
datasets such as Penn Treebank and WikiText-2 which have
                                                                      on this test. Data overlap analysis showed one of the CBT
only 1 to 2 million training tokens. Large improvements
                                                                      test set books, The Jungle Book by Rudyard Kipling, is in
are also noticed on datasets created to measure long-term
                                                                      WebText, so we report results on the validation set which
dependencies like LAMBADA (Paperno et al., 2016) and
                                                                      has no significant overlap. GPT-2 achieves new state of the
the Children’s Book Test (Hill et al., 2015). Our model is
                                                                      art results of 93.3% on common nouns and 89.1% on named
still significantly worse than prior work on the One Billion
                                                                      entities. A de-tokenizer was applied to remove PTB style
Word Benchmark (Chelba et al., 2013). This is likely due
                                                                      tokenization artifacts from CBT.
to a combination of it being both the largest dataset and
having some of the most destructive pre-processing - 1BW’s
sentence level shuffling removes all long-range structure.            3.3. LAMBADA
                                                                      The LAMBADA dataset (Paperno et al., 2016) tests the
3.2. Children’s Book Test                                             ability of systems to model long-range dependencies in
                                                                      text. The task is to predict the final word of sentences
                                                                      which require at least 50 tokens of context for a human to
                                                                      successfully predict. GPT-2 improves the state of the art
                                                                      from 99.8 (Grave et al., 2016) to 8.6 perplexity and increases
                                                                      the accuracy of LMs on this test from 19% (Dehghani et al.,
                                                                      2018) to 52.66%. Investigating GPT-2’s errors showed most
                                                                      predictions are valid continuations of the sentence, but are
                                                                      not valid final words. This suggests that the LM is not
                                                                      using the additional useful constraint that the word must be
                                                                      the final of the sentence. Adding a stop-word filter as an
                                                                      approximation to this further increases accuracy to 63.24%,
                                                                      improving the overall state of the art on this task by 4%. The
Figure 2. Performance on the Children’s Book Test as a function of    previous state of the art (Hoang et al., 2018) used a different
model capacity. Human performance are from Bajgar et al. (2016),      restricted prediction setting where the outputs of the model
instead of the much lower estimates from the original paper.          were constrained to only words that appeared in the context.
                                                                      For GPT-2, this restriction is harmful rather than helpful
                                  Language Models are Unsupervised Multitask Learners

since 19% of answers are not in context. We use a version
of the dataset without preprocessing.                                                  R-1     R-2      R-L    R-AVG
                                                                   Bottom-Up Sum      41.22   18.68    38.34    32.75
3.4. Winograd Schema Challenge                                     Lede-3             40.38   17.66    36.62    31.55
                                                                   Seq2Seq + Attn     31.33   11.81    28.83    23.99
                                                                   GPT-2 TL;DR:       29.34    8.27    26.58    21.40
                                                                   Random-3           28.78    8.63    25.52    20.98
                                                                   GPT-2 no hint      21.58    4.03    19.47    15.03

                                                               Table 4. Summarization performance as measured by ROUGE F1
                                                               metrics on the CNN and Daily Mail dataset. Bottom-Up Sum is
                                                               the SOTA model from (Gehrmann et al., 2018)


                                                               2018), is nearing the 89 F1 performance of humans. While
                                                               GPT-2’s performance is exciting for a system without any su-
                                                               pervised training, some inspection of its answers and errors
                                                               suggests GPT-2 often uses simple retrieval based heuristics
                                                               such as answer with a name from the document in response
                                                               to a who question.

Figure 3. Performance on the Winograd Schema Challenge as a    3.6. Summarization
function of model capacity.
                                                               We test GPT-2’s ability to perform summarization on the
                                                               CNN and Daily Mail dataset (Nallapati et al., 2016). To in-
                                                               duce summarization behavior we add the text TL;DR: after
The Winograd Schema challenge (Levesque et al., 2012)
                                                               the article and generate 100 tokens with Top-k random sam-
was constructed to measure the capability of a system to
                                                               pling (Fan et al., 2018) with k = 2 which reduces repetition
perform commonsense reasoning by measuring its ability
                                                               and encourages more abstractive summaries than greedy de-
to resolve ambiguities in text. Recently Trinh & Le (2018)
                                                               coding. We use the first 3 generated sentences in these 100
demonstrated significant progress on this challenge using
                                                               tokens as the summary. While qualitatively the generations
LMs, by predicting the resolution of the ambiguity with
                                                               resemble summaries, as shown in Table 14, they often focus
higher probability. We follow their problem formulation and
                                                               on recent content from the article or confuse specific details
visualize the performance of our models with both full and
                                                               such as how many cars were involved in a crash or whether
partial scoring techniques in Figure 3. GPT-2 improves state
                                                               a logo was on a hat or shirt. On the commonly reported
of the art accuracy by 7%, achieving 70.70%. The dataset
                                                               ROUGE 1,2,L metrics the generated summaries only begin
is quite small with only 273 examples so we recommend
                                                               to approach the performance of classic neural baselines and
reading Trichelair et al. (2018) to help contextualize this
                                                               just barely outperforms selecting 3 random sentences from
result.
                                                               the article. GPT-2’s performance drops by 6.4 points on
                                                               the aggregate metric when the task hint is removed which
3.5. Reading Comprehension                                     demonstrates the ability to invoke task specific behavior in
The Conversation Question Answering dataset (CoQA)             a language model with natural language.
Reddy et al. (2018) consists of documents from 7 different
domains paired with natural language dialogues between a       3.7. Translation
question asker and a question answerer about the document.
                                                               We test whether GPT-2 has begun to learn how to translate
CoQA tests reading comprehension capabilities and also
                                                               from one language to another. In order to help it infer that
the ability of models to answer questions that depend on
                                                               this is the desired task, we condition the language model
conversation history (such as “Why?”).
                                                               on a context of example pairs of the format english
Greedy decoding from GPT-2 when conditioned on a doc-          sentence = french sentence and then after a fi-
ument, the history of the associated conversation, and a       nal prompt of english sentence = we sample from
final token A: achieves 55 F1 on the development set. This     the model with greedy decoding and use the first generated
matches or exceeds the performance of 3 out of 4 base-         sentence as the translation. On the WMT-14 English-French
line systems without using the 127,000+ manually collected     test set, GPT-2 gets 5 BLEU, which is slightly worse than
question answer pairs those baselines were trained on. The     a word-by-word substitution with a bilingual lexicon in-
supervised SOTA, a BERT based system (Devlin et al.,           ferred in previous work on unsupervised word translation
                                            Language Models are Unsupervised Multitask Learners


              Question                                                                              Generated Answer        Correct   Probability
              Who wrote the book the origin of species?                                             Charles Darwin            3         83.4%
              Who is the founder of the ubuntu project?                                             Mark Shuttleworth         3         82.0%
              Who is the quarterback for the green bay packers?                                     Aaron Rodgers             3         81.1%
              Panda is a national animal of which country?                                          China                     3         76.8%
              Who came up with the theory of relativity?                                            Albert Einstein           3         76.4%
              When was the first star wars film released?                                           1977                      3         71.4%
              What is the most common blood type in sweden?                                         A                         7         70.6%
              Who is regarded as the founder of psychoanalysis?                                     Sigmund Freud             3         69.3%
              Who took the first steps on the moon in 1969?                                         Neil Armstrong            3         66.8%
              Who is the largest supermarket chain in the uk?                                       Tesco                     3         65.3%
              What is the meaning of shalom in english?                                             peace                     3         64.0%
              Who was the author of the art of war?                                                 Sun Tzu                   3         59.6%
              Largest state in the us by land mass?                                                 California                7         59.2%
              Green algae is an example of which type of reproduction?                              parthenogenesis           7         56.5%
              Vikram samvat calender is official in which country?                                  India                     3         55.6%
              Who is mostly responsible for writing the declaration of independence?                Thomas Jefferson          3         53.3%
              What us state forms the western boundary of montana?                                  Montana                   7         52.3%
              Who plays ser davos in game of thrones?                                               Peter Dinklage            7         52.1%
              Who appoints the chair of the federal reserve system?                                 Janet Yellen              7         51.5%
              State the process that divides one nucleus into two genetically identical nuclei?     mitosis                   3         50.7%
              Who won the most mvp awards in the nba?                                               Michael Jordan            7         50.2%
              What river is associated with the city of rome?                                       the Tiber                 3         48.6%
              Who is the first president to be impeached?                                           Andrew Johnson            3         48.3%
              Who is the head of the department of homeland security 2017?                          John Kelly                3         47.0%
              What is the name given to the common currency to the european union?                  Euro                      3         46.8%
              What was the emperor name in star wars?                                               Palpatine                 3         46.5%
              Do you have to have a gun permit to shoot at a range?                                 No                        3         46.4%
              Who proposed evolution in 1859 as the basis of biological development?                Charles Darwin            3         45.7%
              Nuclear power plant that blew up in russia?                                           Chernobyl                 3         45.7%
              Who played john connor in the original terminator?                                    Arnold Schwarzenegger     7         45.2%


Table 5. The 30 most confident answers generated by GPT-2 on the development set of Natural Questions sorted by their probability
according to GPT-2. None of these questions appear in WebText according to the procedure described in Section 4.


(Conneau et al., 2017b). On the WMT-14 French-English                                  2019) is a promising resource to test this more quantita-
test set, GPT-2 is able to leverage its very strong English                            tively. Similar to translation, the context of the language
language model to perform significantly better, achieving                              model is seeded with example question answer pairs which
11.5 BLEU. This outperforms several unsupervised machine                               helps the model infer the short answer style of the dataset.
translation baselines from (Artetxe et al., 2017) and (Lample                          GPT-2 answers 4.1% of questions correctly when evalu-
et al., 2017) but is still much worse than the 33.5 BLEU of                            ated by the exact match metric commonly used on reading
the current best unsupervised machine translation approach                             comprehension datasets like SQUAD.3 As a comparison
(Artetxe et al., 2019). Performance on this task was sur-                              point, the smallest model does not exceed the 1.0% accu-
prising to us, since we deliberately removed non-English                               racy of an incredibly simple baseline which returns the most
webpages from WebText as a filtering step. In order to con-                            common answer for each question type (who, what, where,
firm this, we ran a byte-level language detector2 on WebText                           etc...). GPT-2 answers 5.3 times more questions correctly,
which detected only 10MB of data in the French language                                suggesting that model capacity has been a major factor in
which is approximately 500x smaller than the monolingual                               the poor performance of neural systems on this kind of task
French corpus common in prior unsupervised machine trans-                              as of yet. The probability GPT-2 assigns to its generated
lation research.                                                                       answers is well calibrated and GPT-2 has an accuracy of
                                                                                       63.1% on the 1% of questions it is most confident in. The
3.8. Question Answering                                                                30 most confident answers generated by GPT-2 on develop-
                                                                                       ment set questions are shown in Table 5. The performance
A potential way to test what information is contained within                           of GPT-2 is still much, much, worse than the 30 to 50%
a language model is to evaluate how often it generates the                             range of open domain question answering systems which
correct answer to factoid-style questions. Previous showcas-                           hybridize information retrieval with extractive document
ing of this behavior in neural systems where all information                           question answering (Alberti et al., 2019).
is stored in parameters such as A Neural Conversational
                                                                                            3
Model (Vinyals & Le, 2015) reported qualitative results due                                Alec, who previously thought of himself as good at random
to the lack of high-quality evaluation datasets. The recently                         trivia, answered 17 of 100 randomly sampled examples correctly
                                                                                      when tested in the same setting as GPT-2. He actually only got 14 right but he
introduced Natural Questions dataset (Kwiatkowski et al.,                              should have gotten those other 3

   2
       https://github.com/CLD2Owners/cld2
                                       Language Models are Unsupervised Multitask Learners


                                          PTB      WikiText-2       enwik8    text8    Wikitext-103         1BW
                       Dataset train     2.67%       0.66%          7.50%    2.34%        9.09%            13.19%
                       WebText train     0.88%       1.63%          6.31%    3.94%        2.42%            3.75%

                                 Table 6. Percentage of test set 8 grams overlapping with training sets.


4. Generalization vs Memorization                                      gave away the answer.
Recent work in computer vision has shown that common im-               For CoQA, about 15% of documents in the news domain
age datasets contain a non-trivial amount of near-duplicate            are already in WebText and the model performs about 3
images. For instance CIFAR-10 has 3.3% overlap between                 F1 better on these. CoQA’s development set metric reports
train and test images (Barz & Denzler, 2019). This results in          the average performance over 5 different domains and we
an over-reporting of the generalization performance of ma-             measure a gain of about 0.5-1.0 F1 due to overlap across the
chine learning systems. As the size of datasets increases this         various domains. However, no actual training questions or
issue becomes increasingly likely which suggests a similar             answers are in WebText since CoQA was released after the
phenomena could be happening with WebText. Therefore it                cutoff date for links in WebText.
is important to analyze how much test data also shows up in            On LAMBADA, the average overlap is 1.2%. GPT-2 per-
the training data.                                                     forms about 2 perplexity better on examples with greater
To study this we created Bloom filters containing 8-grams              than 15% overlap. Recalculating metrics when excluding
of WebText training set tokens. To improve recall, strings             all examples with any overlap shifts results from 8.6 to 8.7
were normalized to contain only lower-cased alphanumeric               perplexity and reduces accuracy from 63.2% to 62.9%. This
words with a single space as a delimiter. The Bloom filters            very small change in overall results is likely due to only 1
were constructed such that the false positive rate is upper            in 200 examples having significant overlap.
bounded by 1018 . We further verified the low false positive           Overall, our analysis suggests that data overlap between
rate by generating 1M strings, of which zero were found by             WebText training data and specific evaluation datasets pro-
the filter.                                                            vides a small but consistent benefit to reported results. How-
These Bloom filters let us calculate, given a dataset, the             ever, for most datasets we do not notice significantly larger
percentage of 8-grams from that dataset that are also found            overlaps than those already existing between standard train-
in the WebText training set. Table 6 shows this overlap anal-          ing and test sets, as Table 6 highlights.
ysis for the test sets of common LM benchmarks. Common                 Understanding and quantifying how highly similar text im-
LM datasets’ test sets have between 1-6% overlap with Web-             pacts performance is an important research question. Better
Text train, with an average of overlap of 3.2%. Somewhat               de-duplication techniques such as scalable fuzzy matching
surprisingly, many datasets have larger overlaps with their            could also help better answer these questions. For now, we
own training splits, with an average of 5.9% overlap.                  recommend the use of n-gram overlap based de-duplication
Our approach optimizes for recall, and while manual inspec-            as an important verification step and sanity check during the
tion of the overlaps shows many common phrases, there are              creation of training and test splits for new NLP datasets.
many longer matches that are due to duplicated data. This is           Another potential way of determining whether the perfor-
not unique to WebText. For instance, we discovered that the            mance of WebText LMs is attributable to memorization is
test set of WikiText-103 has an article which is also in the           inspecting their performance on their own held-out set. As
training dataset. Since there are only 60 articles in the test         shown in Figure 4, performance on both the training and
set there is at least an overlap of 1.6%.4 Potentially more            test sets of WebText are similar and improve together as
worryingly, 1BW has an overlap of nearly 13.2% with its                model size is increased. This suggests even GPT-2 is still
own training set according to our procedure.                           underfitting on WebText in many ways.
For the Winograd Schema Challenge, we found only 10                    GPT-2 is also able to write news articles about the discovery
schemata which had any 8-gram overlaps with the WebText                of talking unicorns. An example is provided in Table 13.
training set. Of these, 2 were spurious matches. Of the
remaining 8, only 1 schema appeared in any contexts that
                                                                       5. Related Work
   4
    A significant portion of additional overlap is due to editors
reusing some paragraphs across multiple articles with a shared         A significant portion of this work measured the performance
theme such as various battles in the Korean War.                       of larger language models trained on larger datasets. This
                                    Language Models are Unsupervised Multitask Learners

                                                                 improved the RNN based fine-tuning approaches of (Dai
                                                                 & Le, 2015). (Conneau et al., 2017a) studied the transfer
                                                                 performance of representations learned by natural language
                                                                 inference models and (Subramanian et al., 2018) explored
                                                                 large-scale multitask training.
                                                                 (Ramachandran et al., 2016) demonstrated that seq2seq mod-
                                                                 els benefit from being initialized with pre-trained language
                                                                 models as encoders and decoders. More recent work has
                                                                 shown that LM pre-training is helpful when fine-tuned for
                                                                 difficult generation tasks like chit-chat dialog and dialog
                                                                 based question answering systems as well (Wolf et al., 2019)
                                                                 (Dinan et al., 2018).

                                                                 6. Discussion
                                                                 Much research has been dedicated to learning (Hill et al.,
                                                                 2016), understanding (Levy & Goldberg, 2014), and criti-
                                                                 cally evaluating (Wieting & Kiela, 2019) the representations
Figure 4. The performance of LMs trained on WebText as a func-   of both supervised and unsupervised pre-training methods.
tion of model size.                                              Our results suggest that unsupervised task learning is an
                                                                 additional promising area of research to explore. These
                                                                 findings potentially help explain the widespread success of
                                                                 pre-training techniques for down-stream NLP tasks as we
is similar to the work of Jozefowicz et al. (2016) which         show that, in the limit, one of these pre-training techniques
scaled RNN based language models on the 1 Billion Word           begins to learn to perform tasks directly without the need
Benchmark. Bajgar et al. (2016) also previously improved         for supervised adaption or modification.
results on the Children’s Book Test by creating a much larger    On reading comprehension the performance of GPT-2 is
training dataset out of Project Gutenberg to supplement the      competitive with supervised baselines in a zero-shot setting.
standard training dataset. Hestness et al. (2017) conducted      However, on other tasks such as summarization, while it
a thorough analysis of how the performance of various deep       is qualitatively performing the task, its performance is still
learning models changes as a function of both model capac-       only rudimentary according to quantitative metrics. While
ity and dataset size. Our experiments, while much noisier        suggestive as a research result, in terms of practical applica-
across tasks, suggest similar trends hold for sub-tasks of an    tions, the zero-shot performance of GPT-2 is still far from
objective and continue into the 1B+ parameter regime.            use-able.
Interesting learned functionality in generative models           We have studied the zero-shot performance of WebText
has been documented before such as the cells in an               LMs on many canonical NLP tasks, but there are many addi-
RNN language model performing line-width tracking and            tional tasks that could be evaluated. There are undoubtedly
quote/comment detection Karpathy et al. (2015). More in-         many practical tasks where the performance of GPT-2 is
spirational to our work was the observation of Liu et al.        still no better than random. Even on common tasks that we
(2018) that a model trained to generate Wikipedia articles       evaluated on, such as question answering and translation,
also learned to translate names between languages.               language models only begin to outperform trivial baselines
Previous work has explored alternative approaches to filter-     when they have sufficient capacity.
ing and constructing a large text corpus of web pages, such      While zero-shot performance establishes a baseline of the
as the iWeb Corpus (Davies, 2018).                               potential performance of GPT-2 on many tasks, it is not
There has been extensive work on pre-training methods            clear where the ceiling is with finetuning. On some tasks,
for language tasks. In addition to those mentioned in the        GPT-2’s fully abstractive output is a significant departure
introduction, GloVe (Pennington et al., 2014) scaled word        from the extractive pointer network (Vinyals et al., 2015)
vector representation learning to all of Common Crawl. An        based outputs which are currently state of the art on many
influential early work on deep representation learning for       question answering and reading comprehension datasets.
text was Skip-thought Vectors (Kiros et al., 2015). McCann       Given the prior success of fine-tuning GPT, we plan to in-
et al. (2017) explored the use of representations derived from   vestigate fine-tuning on benchmarks such as decaNLP and
machine translation models and Howard & Ruder (2018)             GLUE, especially since it is unclear whether the additional
                                        Language Models are Unsupervised Multitask Learners

training data and capacity of GPT-2 is sufficient to over-              Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization.
come the inefficiencies of uni-directional representations                arXiv preprint arXiv:1607.06450, 2016.
demonstrated by BERT (Devlin et al., 2018).                             Bajgar, O., Kadlec, R., and Kleindienst, J. Embracing data abun-
                                                                          dance: Booktest dataset for reading comprehension. arXiv
7. Conclusion                                                             preprint arXiv:1610.00956, 2016.

                                                                        Barz, B. and Denzler, J. Do we train on test data? purging cifar of
When a large language model is trained on a sufficiently
                                                                          near-duplicates. arXiv preprint arXiv:1902.00423, 2019.
large and diverse dataset it is able to perform well across
many domains and datasets. GPT-2 zero-shots to state of                 Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. A neural
the art performance on 7 out of 8 tested language model-                  probabilistic language model. Journal of machine learning
ing datasets. The diversity of tasks the model is able to                 research, 3(Feb):1137–1155, 2003.
perform in a zero-shot setting suggests that high-capacity              Bowman, S. R., Pavlick, E., Grave, E., Van Durme, B., Wang, A.,
models trained to maximize the likelihood of a sufficiently               Hula, J., Xia, P., Pappagari, R., McCoy, R. T., Patel, R., et al.
varied text corpus begin to learn how to perform a surprising             Looking for elmo’s friends: Sentence-level pretraining beyond
amount of tasks without the need for explicit supervision.5               language modeling. arXiv preprint arXiv:1812.10860, 2018.

                                                                        Caruana, R. Multitask learning. Machine learning, 28(1):41–75,
Acknowledgements                                                          1997.

Thanks to everyone who wrote the text, shared the links,                Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn,
                                                                          P., and Robinson, T. One billion word benchmark for measur-
and upvoted the content in WebText. Many millions of                      ing progress in statistical language modeling. arXiv preprint
people were involved in creating the data that GPT-2 was                  arXiv:1312.3005, 2013.
trained on. Also thanks to all the Googlers who helped us
with training infrastructure, including Zak Stone, JS Riehl,            Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu,
                                                                          K., and Kuksa, P. Natural language processing (almost) from
Jonathan Hseu, Russell Power, Youlong Cheng, Noam
                                                                          scratch. Journal of Machine Learning Research, 12(Aug):2493–
Shazeer, Solomon Boulos, Michael Banfield, Aman Gupta,                    2537, 2011.
Daniel Sohn, and many more. Finally thanks to the people
who gave feedback on drafts of the paper: Jacob Steinhardt,             Conneau, A., Kiela, D., Schwenk, H., Barrault, L., and Bor-
Sam Bowman, Geoffrey Irving, and Madison May.                             des, A. Supervised learning of universal sentence represen-
                                                                          tations from natural language inference data. arXiv preprint
                                                                          arXiv:1705.02364, 2017a.
References
                                                                        Conneau, A., Lample, G., Ranzato, M., Denoyer, L., and Jégou,
Al-Rfou, R., Choe, D., Constant, N., Guo, M., and Jones, L.               H. Word translation without parallel data. arXiv preprint
  Character-level language modeling with deeper self-attention.           arXiv:1710.04087, 2017b.
  arXiv preprint arXiv:1808.04444, 2018.
                                                                        Dai, A. M. and Le, Q. V. Semi-supervised sequence learning. In
Alberti, C., Lee, K., and Collins, M. A bert baseline for the natural     Advances in neural information processing systems, pp. 3079–
  questions. arXiv preprint arXiv:1901.08634, 2019.                       3087, 2015.

Alcorn, M. A., Li, Q., Gong, Z., Wang, C., Mai, L., Ku, W.-S., and      Dai, Z., Yang, Z., Yang, Y., Cohen, W. W., Carbonell, J., Le,
  Nguyen, A. Strike (with) a pose: Neural networks are easily             Q. V., and Salakhutdinov, R. Transformer-xl: Attentive lan-
  fooled by strange poses of familiar objects. arXiv preprint             guage models beyond a fixed-length context. arXiv preprint
  arXiv:1811.11553, 2018.                                                 arXiv:1901.02860, 2019.

                                                                        Davies, M.           The 14 billion        word    iweb    corpus.
Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Batten-           https://corpus.byu.edu/iWeb/, 2018.
 berg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen,
 G., et al. Deep speech 2: End-to-end speech recognition in             Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., and Kaiser,
 english and mandarin. In International Conference on Machine             Ł. Universal transformers. arXiv preprint arXiv:1807.03819,
 Learning, pp. 173–182, 2016.                                             2018.
Artetxe, M., Labaka, G., Agirre, E., and Cho, K. Unsupervised           Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-
  neural machine translation. arXiv preprint arXiv:1710.11041,            training of deep bidirectional transformers for language under-
  2017.                                                                   standing. arXiv preprint arXiv:1810.04805, 2018.

Artetxe, M., Labaka, G., and Agirre, E. An effective ap-                Dinan, E., Roller, S., Shuster, K., Fan, A., Auli, M., and Weston,
  proach to unsupervised machine translation. arXiv preprint              J. Wizard of wikipedia: Knowledge-powered conversational
  arXiv:1902.01313, 2019.                                                 agents. arXiv preprint arXiv:1811.01241, 2018.
    5
      Preliminary code for downloading and using the small model        Fan, A., Lewis, M., and Dauphin, Y. Hierarchical neural story
is available at https://github.com/openai/gpt-2                           generation. arXiv preprint arXiv:1805.04833, 2018.
                                      Language Models are Unsupervised Multitask Learners

Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-             Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins,
   learning for fast adaptation of deep networks. arXiv preprint        G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., Grabska-
   arXiv:1703.03400, 2017.                                              Barwinska, A., et al. Overcoming catastrophic forgetting in
                                                                        neural networks. Proceedings of the national academy of sci-
Gehrmann, S., Deng, Y., and Rush, A. M. Bottom-up abstractive           ences, pp. 201611835, 2017.
  summarization. arXiv preprint arXiv:1808.10792, 2018.
                                                                      Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, R.,
Gillick, D., Brunk, C., Vinyals, O., and Subramanya, A. Mul-            Torralba, A., and Fidler, S. Skip-thought vectors. In Advances
  tilingual language processing from bytes. arXiv preprint              in neural information processing systems, pp. 3294–3302, 2015.
  arXiv:1512.00103, 2015.
                                                                      Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classifi-
                                                                        cation with deep convolutional neural networks. In Advances in
Gong, C., He, D., Tan, X., Qin, T., Wang, L., and Liu, T.-Y. Frage:
                                                                        neural information processing systems, pp. 1097–1105, 2012.
  frequency-agnostic word representation. In Advances in Neural
  Information Processing Systems, pp. 1341–1352, 2018.                Kwiatkowski, T., Palomaki, J., Rhinehart, O., Collins, M., Parikh,
                                                                       A., Alberti, C., Epstein, D., Polosukhin, I., Kelcey, M., Devlin,
Grave, E., Joulin, A., and Usunier, N. Improving neural                 J., et al. Natural questions: a benchmark for question answering
  language models with a continuous cache. arXiv preprint               research. 2019.
  arXiv:1612.04426, 2016.
                                                                      Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J.
He, K., Zhang, X., Ren, S., and Sun, J. Identity mappings in deep       Building machines that learn and think like people. Behavioral
  residual networks. In European conference on computer vision,         and Brain Sciences, 40, 2017.
  pp. 630–645. Springer, 2016.
                                                                      Lample, G., Conneau, A., Denoyer, L., and Ranzato, M. Unsu-
Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kian-      pervised machine translation using monolingual corpora only.
  inejad, H., Patwary, M., Ali, M., Yang, Y., and Zhou, Y. Deep         arXiv preprint arXiv:1711.00043, 2017.
  learning scaling is predictable, empirically. arXiv preprint        Levesque, H., Davis, E., and Morgenstern, L. The winograd
  arXiv:1712.00409, 2017.                                               schema challenge. In Thirteenth International Conference on
                                                                        the Principles of Knowledge Representation and Reasoning,
Hill, F., Bordes, A., Chopra, S., and Weston, J. The goldilocks         2012.
  principle: Reading children’s books with explicit memory rep-
  resentations. arXiv preprint arXiv:1511.02301, 2015.                Levy, O. and Goldberg, Y. Neural word embedding as implicit ma-
                                                                        trix factorization. In Advances in neural information processing
Hill, F., Cho, K., and Korhonen, A. Learning distributed repre-         systems, pp. 2177–2185, 2014.
  sentations of sentences from unlabelled data. arXiv preprint
  arXiv:1602.03483, 2016.                                             Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sepassi, R., Kaiser, L.,
                                                                        and Shazeer, N. Generating wikipedia by summarizing long
Hoang, L., Wiseman, S., and Rush, A. M. Entity tracking im-             sequences. arXiv preprint arXiv:1801.10198, 2018.
  proves cloze-style reading comprehension. arXiv preprint
  arXiv:1810.02891, 2018.                                             McCann, B., Bradbury, J., Xiong, C., and Socher, R. Learned
                                                                       in translation: Contextualized word vectors. In Advances in
Howard, J. and Ruder, S. Universal language model fine-tuning for      Neural Information Processing Systems, pp. 6294–6305, 2017.
  text classification. In Proceedings of the 56th Annual Meeting      McCann, B., Keskar, N. S., Xiong, C., and Socher, R. The natural
  of the Association for Computational Linguistics (Volume 1:           language decathlon: Multitask learning as question answering.
  Long Papers), volume 1, pp. 328–339, 2018.                            arXiv preprint arXiv:1806.08730, 2018.
Jelinek, F. and Mercer, R. L. Interpolated estimation of markov       Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel
   source parameters from sparse data. In Proceedings of the            mixture models. arXiv preprint arXiv:1609.07843, 2016.
   Workshop on Pattern Recognition in Practice, Amsterdam, The
   Netherlands: North-Holland, May., 1980.                            Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean,
                                                                        J. Distributed representations of words and phrases and their
Jia, R. and Liang, P. Adversarial examples for evaluating read-         compositionality. In Advances in neural information processing
   ing comprehension systems. arXiv preprint arXiv:1707.07328,          systems, pp. 3111–3119, 2013.
   2017.
                                                                      Nallapati, R., Zhou, B., Gulcehre, C., Xiang, B., et al. Abstrac-
                                                                        tive text summarization using sequence-to-sequence rnns and
Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., and Wu,
                                                                        beyond. arXiv preprint arXiv:1602.06023, 2016.
  Y. Exploring the limits of language modeling. arXiv preprint
  arXiv:1602.02410, 2016.                                             Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q. N., Bernardi,
                                                                        R., Pezzelle, S., Baroni, M., Boleda, G., and Fernández, R. The
Kaiser, L., Gomez, A. N., Shazeer, N., Vaswani, A., Parmar, N.,         lambada dataset: Word prediction requiring a broad discourse
  Jones, L., and Uszkoreit, J. One model to learn them all. arXiv       context. arXiv preprint arXiv:1606.06031, 2016.
  preprint arXiv:1706.05137, 2017.
                                                                      Pennington, J., Socher, R., and Manning, C. Glove: Global vectors
Karpathy, A., Johnson, J., and Fei-Fei, L. Visualizing and under-       for word representation. In Proceedings of the 2014 conference
  standing recurrent networks. arXiv preprint arXiv:1506.02078,         on empirical methods in natural language processing (EMNLP),
  2015.                                                                 pp. 1532–1543, 2014.
                                       Language Models are Unsupervised Multitask Learners

Peters, M. E. and Lecocq, D. Content extraction using diverse fea-     Vinyals, O., Fortunato, M., and Jaitly, N. Pointer networks. In
  ture sets. In Proceedings of the 22nd International Conference         Advances in Neural Information Processing Systems, pp. 2692–
  on World Wide Web, pp. 89–90. ACM, 2013.                               2700, 2015.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C.,         Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bow-
  Lee, K., and Zettlemoyer, L. Deep contextualized word repre-           man, S. R. Glue: A multi-task benchmark and analysis
  sentations. arXiv preprint arXiv:1802.05365, 2018.                     platform for natural language understanding. arXiv preprint
                                                                         arXiv:1804.07461, 2018.
Radford, A., Jozefowicz, R., and Sutskever, I. Learning to
  generate reviews and discovering sentiment. arXiv preprint           Weston, J. E. Dialog-based language learning. In Advances in
  arXiv:1704.01444, 2017.                                               Neural Information Processing Systems, pp. 829–837, 2016.
Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I.           Wieting, J. and Kiela, D. No training required: Exploring
  Improving language understanding by generative pre-training.           random encoders for sentence classification. arXiv preprint
  2018.                                                                  arXiv:1901.10444, 2019.
Ramachandran, P., Liu, P. J., and Le, Q. V. Unsupervised pre-          Wolf, T., Sanh, V., Chaumond, J., and Delangue, C. Transfer-
  training for sequence to sequence learning. arXiv preprint            transfo: A transfer learning approach for neural network based
  arXiv:1611.02683, 2016.                                               conversational agents. arXiv preprint arXiv:1901.08149, 2019.
Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do                Yogatama, D., d’Autume, C. d. M., Connor, J., Kocisky, T.,
  cifar-10 classifiers generalize to cifar-10? arXiv preprint            Chrzanowski, M., Kong, L., Lazaridou, A., Ling, W., Yu, L.,
  arXiv:1806.00451, 2018.                                                Dyer, C., et al. Learning and evaluating general linguistic intel-
                                                                         ligence. arXiv preprint arXiv:1901.11373, 2019.
Reddy, S., Chen, D., and Manning, C. D. Coqa: A conversational
  question answering challenge. arXiv preprint arXiv:1808.07042,
  2018.
Schwartz, R., Sap, M., Konstas, I., Zilles, L., Choi, Y., and Smith,
  N. A. Story cloze task: Uw nlp system. In Proceedings of the
  2nd Workshop on Linking Models of Lexical, Sentential and
  Discourse-level Semantics, pp. 52–55, 2017.
See, A., Liu, P. J., and Manning, C. D. Get to the point: Sum-
  marization with pointer-generator networks. arXiv preprint
  arXiv:1704.04368, 2017.
Sennrich, R., Haddow, B., and Birch, A. Neural machine trans-
  lation of rare words with subword units. arXiv preprint
  arXiv:1508.07909, 2015.
Subramanian, S., Trischler, A., Bengio, Y., and Pal, C. J. Learning
  general purpose distributed sentence representations via large
  scale multi-task learning. arXiv preprint arXiv:1804.00079,
  2018.
Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence
  learning with neural networks. In Advances in neural informa-
  tion processing systems, pp. 3104–3112, 2014.
Sutskever, I., Jozefowicz, R., Gregor, K., Rezende, D., Lillicrap,
  T., and Vinyals, O. Towards principled unsupervised learning.
  arXiv preprint arXiv:1511.06440, 2015.
Trichelair, P., Emami, A., Cheung, J. C. K., Trischler, A., Sule-
   man, K., and Diaz, F. On the evaluation of common-sense
   reasoning in natural language understanding. arXiv preprint
   arXiv:1811.01778, 2018.
Trinh, T. H. and Le, Q. V. A simple method for commonsense
   reasoning. arXiv preprint arXiv:1806.02847, 2018.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,
  Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is
  all you need. In Advances in Neural Information Processing
  Systems, pp. 5998–6008, 2017.
Vinyals, O. and Le, Q. A neural conversational model. arXiv
  preprint arXiv:1506.05869, 2015.
                                        Language Models are Unsupervised Multitask Learners

8. Appendix A: Samples                                                  8.3. Diversity
8.1. Model capacity                                                     Table 12 shows multiple completions of the same random WebText
                                                                        test set context, showing the diversity of completions with standard
To complement the reported perplexity gains of bigger LMs on            sampling settings.
WebText show in Figure 4, Tables 7 through 11 show side-by-side
completions of the smallest WebText LM and GPT-2 on random
unseen WebText test set articles.                                       8.4. Robustness
                                                                        Table 13 shows the previously mentioned talking unicorns news
8.2. Text Memorization                                                  article. We find the model to be capable of handling out of dis-
                                                                        tribution contexts, but the quality of these samples is generally
We observe some memorizing behavior in GPT-2 on longer strings          lower.
that are repeated many times in the dataset such as famous quotes
or speeches. For example, when conditioned on the first sentence
and a half of the Gettysburg Address (which occurs approximately
40 times throughout WebText), an argmax decode from GPT-2
recovers the speech. Even when sampling without truncation, we
find that the model copies the speech for awhile before drifting,
albeit in a similar style. It typically drifts within 100-200 tokens,
and displays widening diversity once it drifts.
To quantify how often exact memorization shows up in samples,
we generated samples from GPT-2 conditioned on WebText test
set articles and compared the overlap rates of GPT-2’s generations
to the overlap rates of the ground-truth completions. The results of
this analysis are shown below and suggest that GPT-2 repeats text
from the training set less often then the baseline rate of held-out
articles.




Figure 5. CDF of percentage 8-gram overlap with WebText train-
ing set, for both WebText test set and samples (conditioned on
WebText test set, with top-k truncated random sampling with
k = 40). Most samples have less than 1% overlap, including over
30% of samples with no overlap, whereas the median for test set is
2.6% overlap.
                                    Language Models are Unsupervised Multitask Learners




                                                  Context (WebText test)
  Outfit 1: Typical

  This pairing was the first outfit I thought of when I bought the shoes. It’s like a summer version of this
  Jake Grantham outfit; in fact, my shoes are close to the colors of his Nike Racers! Instead of a heavy Harris Tweed
  jacket and denim shirt, I’m wearing a cotton DB jacket and and a linen shirt. Both fabrics (in these colors) are an
  absolute must for summer, as they go with both dark and and light pants! As you can see, they pair wonderfully
  with the dark jeans and shoes. It’s a pseudo menswear/prep outfit.

  Overall, this is a very casual outfit which is why I paired my sneakers with it. I’m not about wearing a
  full wool suit with sneakers (as GQ shows a lot) but I’m definitely open to keeping things casual, like this cotton
  DB. Casual fabrics are key to pulling off your sneakers in a dressed down menswear outfit. I’d even suggest to
  wear these sneakers with a khaki chino suit or a white linen suit. Just be sure to ditch the tie or wear a tee or polo;
  wearing a tie with sneakers is a bit too much
                  Small model completion                                           GPT-2 completion
  .                                                            for me!

  Outfit 2: High End                                              Outfit 2: Casual Jacket

  This outfit was just in the eye of the storm. It’s a            I know that some people might be opposed to wearing
  hybrid of what ASICS does with streetwear (like J. Crew)        sneakers in a turtleneck, but I wanted to be true to myself,
  and what J. Crew does with casual pieces that are a bit         so I went with a slim fitting turtleneck in a color more
  more traditional. Again, as I mentioned before, it’s a low      similar to my favorite color of the day. Overall, it’s a
  budget item and they can be very expensive. These will          classic turtleneck, with nothing too flashy. I did swap my
  definitely be an issue if you’re trying to work as a casual     shoes for these ASICS Tiger Killshots and I wish I hadn’t
  sneakermaker, and that’s OK but don’t hesitate if you           did that. The shoe itself is very comfortable, however,
  want to try the lower price versions of their outfits!          I found that wearing them in these sneakers made my
                                                                  feet look a bit larger. I’d wear them with denim, chinos,
  This version features a very low cut-cut neckline               whatever, but would never recommend wearing them
  (which is not a bad thing), slightly longer legs, and a pair    alone. There are just too many variables involved.
  of long legs to keep things nice. My preferred color of
  color these items are blue-gray (which is my first colors,      Outfit 3: Dress
  as they are a light white), or black (which is my favorite
  of my designs). As always, my pick of colors from this          This is the final outfit I wore the ASICS Tiger
  store have nothing to do with that other than the de-           Killshots. This version of the ASICS is a much more
  sign. For a high end, I would recommend this to everyone!       modern and streamlined version. The shoe has the most
                                                                  modern and minimal design that I can think of! With a lot
  Outfit 3: High Light                                            of black, these sneakers look very cool. They’ll definitely
                                                                  be very comfortable for summer.
  This was a ”high light” look while I had a chance
  to get some color. I think the color is based on                To me, wearing them with a suit is the easiest way
                                                                  to have a good pair of sneaker shoes. That’s pretty much
                                                                  it! If

Table 7. Random unseen contexts (top), and non-cherry-picked completions from both the smallest (left) and largest (right) models.
Contexts are 768 tokens, with approximately 256 tokens worth of paragraphs shown. Completions are 256 tokens and fully shown. Top-k
random sampling with k = 40 was used for generation.
                                    Language Models are Unsupervised Multitask Learners




                                                 Context (WebText test)
  Some of the most glorious historical attractions in Spain date from the period of Muslim rule, including The
  Mezquita, built as the Great Mosque of Córdoba and the Medina Azahara, also in Córdoba and now in ruins but
  still visitable as such and built as the Madinat al-Zahra, the Palace of al-Andalus; and the Alhambra in Granada, a
  splendid, intact palace. There are also two synagogues still standing that were built during the era of Muslim Spain:
  Santa Marı́a la Blanca in Toledo and the Synagogue of Córdoba, in the Old City.

  Reconquista and Imperial era [ edit ]
                Small model completion                                                GPT-2 completion

  The era of Muslim Spain is very different from that of The Reconquista (literally, ”The return to the country”)
  other parts of the world, with a rather extensive medieval of 1092 and 1095 saw the establishment in the Iberian
  history. While the Spanish Empire existed during the peninsula of a united kingdom of Castile and Aragon in
  second half of the 9th century BC it was quite slowly which a Muslim dynasty would exist. This was the last
  transformed to a small, secularized empire by 1150 A.D., time in Europe where Muslim rule would be considered
  in a much more prosperous and diverse state. In 1050 CE ”native” and the last occasion where Christian-inflamed
  the emperor of Naples, at least partially in characterised by and organized conflicts would make themselves known in
  a series of wars and invasions from the north-western coast Europe.
  to the western coast of the region (e.g., on the Danube
  River which extended all the way from Gibraltar into the The reign of Muslim Moors in Spain is often re-
  south of Spain, but was quickly destroyed on the same ferred to, especially among non-Muslims, as the
  day by the Roman army), assumed the title of Emperor ”Conquest of Granada” and Spain is sometimes referred
  of Naples. He made a large contribution over the last few to as Spain in ”Granada”. The last years of Muslim rule
  decades to restore and maintain the empire, the first of were followed by the Spanish Reconquista where the
  which to the west. His successors were Ferdinand, who Muslims were driven or compelled out of the Iberian
  took power from the Papal Throne in 1148, while the first Peninsula with the victory of Ferdinand IX. in the Battle
  imperial successor was King Alfred II of Sweden. The of Poitiers in March 1189 and the conquest of Jerusalem
  empire was again made one of the few independent states in 1187 by Saladin.
  in the European Continent and although no more than 200
  million people lived on its territory during this period the The Reconquista period in Spain came to an end
  people had much to say to the Emperor. From 1720 on the when the Moors retreated north. However the Moorish
  Muslims made up half the population of Spain.                 rule would not be fully replaced by the Christian rule until
                                                                the Reconquista era resumed when the Reconquista lasted
                                                                from 1229 to 1492.

                                                                  The last Muslim state in Spain, Granada, was con-
                                                                  quered by the


Table 8. Random unseen contexts (top), and non-cherry-picked completions from both the smallest (left) and largest (right) models.
Contexts are 768 tokens, with approximately 256 tokens worth of paragraphs shown. Completions are 256 tokens and fully shown. Top-k
random sampling with k = 40 was used for generation.
                                    Language Models are Unsupervised Multitask Learners


                                                  Context (WebText test)
  Pinky

  The pink ghost’s AI is designed to ”feel” opposite of the red ghost’s behavior. Pinky actually attempts to
  get out in front of Pac-Man. This is accomplished by setting the target 4 tiles ahead of Pac-Man’s current location in
  the direction that Pac-Man is travelling. One exception to this is when Pac-Man is traveling up. Due to an overflow
  bug in the code, the calculation includes a left offset equal to the expected up offset.

  The use of Pac-Man’s expected position allows the player to exploit the navigation code for Pinky. This was most
  often seen in situations where Pac-Man and Pinky are in close proximity to each other. For example, if they were
  headed toward one another, once they are less than 4 tiles apart, the target tile is now behind Pinky. If he encounters
  an intersection while in this configuration, he will attempt to double-back (by turning) in order to reach the target tile.

  Player’s would often perform what was referred to as a ”head fake” by changing directions rapidly just
  as Pinky was approaching an intersection. By displacing the target tile just as Pinky
                  Small model completion                                           GPT-2 completion
  was coming to the left, they would get the ”correct” was about to cross the intersection, Pac-Man would be
  direction.                                                  able to dodge the ghosts’ projectiles and return to the safe
                                                              location of the safe house.
  If the location is actually farther from what Pac-
  Man was at, the player’s will not respond to the ghost’s Buster
  movement, and the ghost will instead return to the player’s
  current location. This is because Pac-Man does not move The yellow ghost’s AI is specifically designed to
  by turning.                                                 try to avoid Pac-Man from leaving the safe house. The
                                                              player must press the SPACE bar to reveal the safe
  The ghost will have to go to another location of house’s doors, then go into them to find the yellow ghost
  some kind to return where he began, and this is achieved hiding in the next room.
  in many ways. It will often simply leave the location, as
  for example if Pac-Man did jump off a cliff, he would The player must also note that the ghost will not
  return to the spot where he was before vanishing into attempt to exit the safe house if it sees Pac-Man already
  space. There is a good number of possibilities these ghost there. In fact, the ghost will not attempt an exit in the first
  paths may be possible, but the most obvious ones are if place unless they’re already in the safe house.
  the player wants to avoid getting caught up in the game,
  with a goal as easy as keeping Pac-Man in the game for These two behaviors are based on the fact that the
  1/3 as long as he was.                                      player does not always attempt to reach the safe house
                                                              before Pac-Man (or Blinky).
  Chasing
                                                              The player must also beware of this glitch if a
  In Pac-Man’s case, this would be similar to how ghost (or another player) is already in the safe house
  Pac-Man can ”catch up” in a fight.                          and attempting to get out as soon as Pac-Man or Blinky
                                                              is gone. At this point, the ghost will attempt to exit
  A ghost could chase through enemies. This is use- the safe house in a similar manner as it is in ”Chase” mode.
  ful for many reasons, both for Pac-Man and his AI.
  Unlike most Ghost games, this ghost simply travels in the Pac-Man
  direction from
                                                              There are some ghosts that follow Pac-Man. These ghosts
                                                              are


Table 9. Random unseen contexts (top), and non-cherry-picked completions from both the smallest (left) and largest (right) models.
Contexts are 768 tokens, with approximately 256 tokens worth of paragraphs shown. Completions are 256 tokens and fully shown. Top-k
random sampling with k = 40 was used for generation.
                                    Language Models are Unsupervised Multitask Learners




                                                 Context (WebText test)
  Corporal Michael P. Goeldin was an unskilled laborer from Ireland when he enlisted in Company A in November
  1860. Goldein survived the war. Corporal Patrick O’Neal, also from Ireland, first enlisted in 1854 and served with
  Company L, 3d U.S. Artillery, in Oregon. He returned to the East Coast and enlisted in the company in 1860.
  O’Neal served until 1874, when he was named superintendent of the National Cemetery at Willets Point, New York.
  Corporal Benjamin Browne was a shoemaker from Orange County, New York. In August 1862, he enlisted in the
  newly formed 124th New York Volunteers, and was one of sixty-one men who transferred into Company A that
  October. Browne reenlisted in the company in February 1864 while it was camped at Brandy Station. He returned
  to civilian life after completing his enlistment in 1867.

  On 10 June, Artificer William Collins was promoted to corporal, probably to fill a combat leadership
  void for the crossing of the James River. Collins’s service record does not reflect the qualities he demonstrated to
  earn this promotion, but he had obviously overcome some serious problems. Born in Sacketts Harbor, New York,
  Collins enlisted in the company in December 1853 at the age of twenty-two, and reenlisted in December 1858. Just
  a month before the war began in April 1861, Collins went ”over the hill” and was not caught until three years later.
  Returned to the company on 22 March 1864, he was tried
                   Small model completion                                          GPT-2 completion
  on military charges and was discharged from the U.S. in June of that year. Sentenced to be shot for desertion, he
  Military. He returned to the company, and was assigned remained on active duty.
  to Division of Internal Affairs during the Civil War.
                                                               Captain James J. Dall was born in Virginia in 1829 during
  As an engineer, Collins also had training in engi- the Revolution and joined his father in the Army in 1836.
  neering from his native New York. After graduate studies He served with the 2d Virginia Infantry until 1845, when
  in the Union College of Engineering, he was employed he went to the Western Reserve Regiment; and, on his
  in the United States Army for two years at the same discharge, he reenlisted with Company I, 25th New York
  time. There he worked under Captain Daniel Pfeiffer Volunteers (which in 1845 numbered about 4,500 men).
  and was promoted to colonel in May 1870, three years One of his brothers, James, served with his company
  after Captain Pfeiffer’s enlistment. Collins returned in the same capacity in the same brigade. While the
  home to England after completing his first year in the regiment remained in Connecticut through 1846, Captain
  First Division to serve the next two years in the Army Dall served with Company I at various times before
  Corps of Engineers, and by his second year, he began to his discharge. A veteran of the Mexican War, he had
  experience his service duty.¡—endoftext—¿MOSCOW, served two years on the staff of Brigadier General John J.
  July 5 (Reuters) - U.K.-based Russian state media and Sullivan at San Antonio. During the Peninsula Campaign
  government sources and experts say Moscow has sought of 1863, he commanded a company of twenty-three and
  to downplay the ”military buildup” of the Syrian oppo- was in charge of the battery of fifty-four heavy mortars
  sition while maintaining support for President Donald and the gunboat Pinta of the Battery, a gunboat and
  Trump’s plan to build a wall along the U.S.-Mexico several other boats. Captain Dall was on active duty with
  border, calling it a ”military coup”.                        Company I in the fall of 1865. Two years later, he was
                                                               promoted to second lieutenant.
  U.S.-led Syria war veteran and Syria envoy Michael
  Flynn (R), right, gestures with U.S.-based Syrian political Captain John A. O’
  analyst Ahmet Koryat, with


Table 10. Random unseen contexts (top), and non-cherry-picked completions from both the smallest (left) and largest (right) models.
Contexts are 768 tokens, with approximately 256 tokens worth of paragraphs shown. Completions are 256 tokens and fully shown. Top-k
random sampling with k = 40 was used for generation.
                                    Language Models are Unsupervised Multitask Learners

                                                  Context (WebText test)
  Xmonad, Xresources, and Fonts

  Xresources can be an absolute pain (they were for me). So, this a bit of help:

  First of all, the way that fonts are generally handled in Xmonad is through XFT . Xft is the X FreeType
  Interface library and is a library that handles the actual rendering of fonts in Xmonad/urxvt (Xresources specifies
  a bunch of resources to things that launch under the X window manager. Settings for individual programs are
  specified in Xresources like so: Urxvt.background:background settins here ).

  Anyway, fonts are set using this syntax in Xresources :

  urxvt*font:xft:your font name goes here:size=the size of the font goes here,xft:fall-back font 1 goes here

  So, thats the most basic part of fonts in urxvt specified via Xresources.

  Fixing /.ssh/ Permissions in Cygwin

  In Cygwin on Windows I found that I could not set the permissions of my /.ssh/ folder to be 0600, as is
  required for ssh to allow you to use keys. The symptom I had was that I no matter what I did, it always modfied the
  owner and the group
                   Small model completion                                         GPT-2 completion
  was set to /.ssh/ , which was the way things were. This , making these files read-only. I tried just starting ssh
  could not be fixed completely. I found I could make that and using a program like passwd which modifies the
  much more workable with just typing the username with owner/group to 0644 permissions, but it didn’t work. It
  your mouse then trying to set permissions to be ”0600” was annoying but not nearly as bad as it could be, just
  for the group, which is what my system is now configured annoying to have to fiddle with permissions.
  to do.
                                                              So I decided to do this as the reverse:
  However, as of Cygwin 7 on Linux I have found
  that doing so is now ”unnecessary” for my system! I also For each .ssh folder in your Cygwin directory, set
  installed a package called chroot. It does that by allowing the permission of each file and directory to 2644. Then
  you to install/upgrade to your system directory.            start ssh with: ssh-add /.ssh/

  So, as of Windows 8 and up, installing that can re-             The first time you connect, your permissions of
  quire chroot, so I tried and tried and found that it wasn’t     the /.ssh folder that you just had in Cygwin will be
  so bad when installing the package on Linux and then            set to 0600. Then every time you connect again, your
  uninstalling it. As of my last write-up on Cygwin, the          permissions will change to 0644.
  script just does its job (the sudo commands do the rest), if
  you try it you will have to log the change.                     Makefile for Xresources as used in /.xsessionrc

  If that wasn’t confusing enough, this is what I tried and       These are made in my .xsessionrc :
  just went back to my /.ssh/ directory and the chroot script.
  I found I was finally running out of permissions (because       .xprofile: .xsession-errors: color=””
  i forgot to create a symlink to ssh from the file system and
                                                                  For example:

                                                                  .xprofile: color=blue

                                                                  As the colors are stored in a separate file, the color in your
                                                                  /.xsession-errors will look like cyan instead of blue.

                                                                  This is mostly because, as of now, only the color
                                                                  is assigned to your


Table 11. Random unseen contexts (top), and non-cherry-picked completions from both the smallest (left) and largest (right) models.
Contexts are 768 tokens, with approximately 256 tokens worth of paragraphs shown. Completions are 256 tokens and fully shown. Top-k
random sampling with k = 40 was used for generation.
                                   Language Models are Unsupervised Multitask Learners

                                                Context (WebText test)
   Aaron loves mint chocolate cake, but he requires that it be paired with mini chocolate chips, so I threw some of
   those in between the layers. I also had a few Peppermint Jo Jos on hand so I crushed them up and threw some of
   those in along with some crushed meringue cookies because, why not? It’s a total smorgasbord of minty chocolate
   chippy cookie crunchy goodness. I didn’t measure how much of each topping I used, but after I tasted the finished
   product, I wish I had added more. You can add anything you want- crushed candy canes, peppermint bark, etc. And
   don’t be afraid to use a heavy hand. Texture = good.

   If you don’t have 7-inch cake pans, you can get 3 shorter 8-inch layers out of this
           GPT-2 Completion 1                    GPT-2 Completion 2                     GPT-2 Completion 3
   recipe. I made the longer version, recipe. If you do, and you know recipe, which will likely yield about
   then used the pan for a layer of how to make that cake, then your the same amount of cake.
   buttercream frosting and stuck 1/3 life will never be the same.
   of it into the fridge for a frosting                                           Chocolate     Buttermilk   Cake
   topper. Just make sure you have all I have two very simple rules Print Chunks of chocolate cake,
   of this frosting prepared earlier than when it comes to cookie crumbs coated in buttermilk and topped
   you make this cake and you won’t or chocolate chips. 1) When I buy with crunchy chopped chocolate,
   have to wait on it as your egg whites raw sugar, I do not use it until it’s drizzled with chocolate sauce.
   sit on the bottom of your mixing cooled down and has at least half a Simple but a nice change from
   bowl.                                  teaspoon of sugar on it. That’s my chocolate cake! Author: Momma
                                          rule. And 2) If I have a recipe that Made Ingredients For the chocolate
   You want the egg whites whipped calls for something and I want to buttermilk cake 1 1/2 cup unsalted
   to stiff peaks? Then put your cake use it in a recipe, I do not substitute butter , room temperature
   dish over a pot of simmering water it with something else on-line. Like
   and heat it up until it’s just barely this.                                   1 1/2 cup granulated sugar
   warm. It doesn’t take more than 3-5                                            3 large eggs plus 1 egg yolk For
   minutes; just enough to get it started I found this really cute card the chocolate glaze 1/3 cup cocoa
                                          with cute little kittens on             powder
                                                                                 1 3/4 cups powdered sugar
                                                                                  6 ounces semis

             GPT-2 Completion 4                      GPT-2 Completion 5                     GPT-2 Completion 6
   recipe.                                 recipe.                                   recipe, which will give you about
                                                                                     6-9 servings of delicious chocolate
   Meringue Cookies                        My 2 failures at Christmas morning        cake.     To prevent your layers
                                           were not due to poor frosting skills.     from sticking together, I used my
   1 3/4 cup butter, softened              That was another success story.           8-inch round tart pan, lined it with
   1 cup sugar                             The cake was beautifully moist,           parchment paper, then cut up 4
   1 egg yolk                              and well, just a beautiful cake. It’s     7-inch round cakes from it and
   3 T. heavy cream                        got very little air in it or moisture,    sandwiched the layers between
   3 1/2-4 cups flour                      but all the chocolate and mint was        them (this also worked well for
   pinch of salt                           perfectly blended together.               the 1-layer chocolate cake as well).
   Peppermint-Jojo Topping                                                           When you’re ready to bake this
   1 1/4 cups powdered sugar               One thing that made me cringe a           cake, remove the parchment paper,
   1 cup chopped pecans                    bit was that all those ingredients-       preheat oven to 350 degrees and
   1/2 cup finely chopped mint leaves      egg yolks, sugar, melted butter,          bake for about 30 – 40 minutes,
   1/2 cup chopped fresh mint (about       chocolate chips, melted chocolate,        until the top is golden and a tester
   1/2 sheet)                              and mint- had to be mixed right           comes out clean.
   1 tsp. peppermint extract               into the cake batter, which is
   1/4 tsp. sea salt                       super time-consuming. A friend            Chocolate    Buttermilk      Layer
   pinch of sea                            suggested that if I put the batter in a   Cake with Peppermint Jo
                                           separate bowl, I


Table 12. Non-cherry-picked completions from GPT-2 generated from the same context (from WebText test). Context is 384 tokens
(shown truncated), and generations are 128 tokens. Top-k random sampling with k = 40 was used for generation.
                                     Language Models are Unsupervised Multitask Learners




           Context (human-written): In a shocking finding, scientist discovered a herd of unicorns living
           in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the
           researchers was the fact that the unicorns spoke perfect English.

           GPT-2: The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These
           four-horned, silver-white unicorns were previously unknown to science.

           Now, after almost two centuries, the mystery of what sparked this odd phenomenon is fi-
           nally solved.

           Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several com-
           panions, were exploring the Andes Mountains when they found a small valley, with no other animals
           or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by
           two peaks of rock and silver snow.

           Pérez and the others then ventured further into the valley. “By the time we reached the top
           of one peak, the water looked blue, with some crystals on top,” said Pérez.

           Pérez and his friends were astonished to see the unicorn herd. These creatures could be
           seen from the air without having to move too much to see them – they were so close they could touch
           their horns.

           While examining these bizarre creatures the scientists discovered that the creatures also
           spoke some fairly regular English. Pérez stated, “We can see, for example, that they have a common
           ‘language,’ something like a dialect or dialectic.”

           Dr. Pérez believes that the unicorns may have originated in Argentina, where the animals
           were believed to be descendants of a lost race of people who lived there before the arrival of humans
           in those parts of South America.

           While their origins are still unclear, some believe that perhaps the creatures were created
           when a human and a unicorn met each other in a time before human civilization. According to Pérez,
           “In South America, such incidents seem to be quite common.”

           However, Pérez also pointed out that it is likely that the only way of knowing for sure if
           unicorns are indeed the descendants of a lost alien race is through DNA. “But they seem to be able to
           communicate in English quite well, which I believe is a sign of evolution, or at least a change in
           social organization,” said the scientist.

Table 13. Conditional generation on an out-of-distribution context by GPT-2. Cherry pick of 10 samples generated with k = 40.
                                  Language Models are Unsupervised Multitask Learners


Article: Prehistoric man sketched an incredible array of prehistoric beasts on the rough limestone walls of a cave in
modern day France 36,000 years ago.
Now, with the help of cutting-edge technology, those works of art in the Chauvet-Pont-d’Arc Cave have been
reproduced to create the biggest replica cave in the world.
The manmade cavern named the Caverne du Pont-d’Arc has been built a few miles from the original site in
Vallon-Pont-D’arc in Southern France and contains 1,000 painstakingly-reproduced drawings as well as around 450
bones and other features...
Cavemen and women sketched an incredible array of prehistoric beasts on the rough limestone walls of a cave
36,000 years ago and now a replica has been created (pictured)
...

GPT-2: The original site in Vallon-Pont-D’arc in South-        Reference: Cave mimics famous Caverne du Pont-d’Arc
ern France is a Unesco World Heritage site and is the          in France, the oldest cave decorated by man and the
oldest known and the best preserved cave decorated by          best preserved. The replica contains all 1,000 paint-
man. The replica cave was built a few miles from the orig-     ings which include 425 such as a woolly rhinoceros and
inal site in Vallon-Pont-D’Arc in Southern France. The         mammoths. Minute details were copied using 3D mod-
cave contains images of 14 different species of animals        elling and anamorphic techniques, often used to shoot
including woolly rhinoceros, mammoths, and big cats.           widescreen images. The modern cave also includes replica
                                                               paw prints of bears, bones and details preserved in the
                                                               original cave.

Article: ‘If you can find my submarine, it’s yours,’ Russian oil billionaire Roman Abramovich once said.
And it seems the oligarch and Chelsea FC owner, whose submarine is just one of the extras that came with his
£300million superyacht Eclipse (perfect for getting to shore undetected), is not the only wealthy businessman
splashing out on underwater exploration.
Dubbed Earth’s real ‘final frontier’, the oceans are still so little-explored that billionaires are queuing up to buy
vessels that give them a glimpse of the dramatic seascapes and incredible wildlife of the world’s oceans.
So if you have a spare few million in the bank and want some holiday snaps and Instagram posts that will really
trump everyone else, you may want to snap up one of these...
Whale of a time: The OrcaSub takes you 2000 feet beneath the surface in two pressurised Perspex viewing domes
for optimum exploration
...

GPT-2: The U-Boat Worx submarine can dive to 984 feet          Reference: Feel like a James Bond villain with the most
(300 metres) and is capable of diving to the bottom of         advanced submerged ocean vehicles. Some convert from
the ocean. It can be used for diving, snorkelling and even     a yacht into a submarine, and others detached off for a
snorkel diving. The submarine is also equipped with a          speedy exploration. Oliver’s Travels offers a Mile Low
snorkelling platform.                                          package on its submarine, complete with chef and butler
                                                               for £175,000 a night.

Article: Amina Ali Qassim is sitting with her youngest grandchild on her lap, wiping away tears with her headscarf.
Only a few months old, this is the baby girl whose ears she desperately tried to cover the night the aerial bombardment
started. She lay awake, she says, in a village mosque on the Yemeni island of Birim, counting explosions as the
baby cried.
It could have been worse though. They could have still been in their house when the first missile landed.
”Our neighbor shouted to my husband ’you have to leave, they’re coming.’ And we just ran. As soon as we left the
house, the first missile fell right by it and then a second on it. It burned everything to the ground,” Qassim tells us
...

GPT-2: Yemen is in the middle of a civil war. Saudi            Reference: Amina Ali Qassim’s family sought shelter in
Arabia is leading the coalition bombing campaign. It’s         a mosque before fleeing Yemen. Thousands like them are
been bombing Yemen for more than two months now.               boarding boats to sail to Djibouti. Saudi Arabia has been
                                                               pounding Yemen in a bid to defeat Houthi rebels.

      Table 14. Summaries generated by GPT-2 and reference summaries on the test set of the CNN and Daily Mail dataset.
                                  Language Models are Unsupervised Multitask Learners




English reference                                               GPT-2 French translation

This re-release, titled The Next Day Extra, was presented       Les nouvelles re-releases, tout en premier disc, nécessaire
in the form of three disks: the original album, unpublished     de l’album, un studio session et remixes, plus une DVD
studio sessions and remixes, plus a DVD containing the          de l’écran de quelques clips qui ont été déjà échappés.
four clips that have already been unveiled.

French reference                                                GPT-2 English translation

Un homme expliquait que le fonctionnement de la hernia          This release, originally released on three discs: the origi-
fonctionnelle qu’il avait reconnaı̂t avant de faire, le fonc-   nal album, sessions studio inédites and remixes, a DVD
tionnement de la hernia fonctionnelle que j’ai réussi, j’ai    containing the four clips that were released on the previous
réussi.                                                        day.

English reference                                               GPT-2 French translation

One man explained that the free hernia surgery he’d re-         Un homme expliquait que le fonctionnement de la hernia
ceived will allow him to work again.                            fonctionnelle qu’il avait reconnaı̂t avant de faire, le fonc-
                                                                tionnement de la hernia fonctionnelle que j’ai réussi, j’ai
                                                                réussi.

French reference                                                GPT-2 English translation

Un homme a expliqué que l’opération gratuite qu’il avait      A man told me that the operation gratuity he had been
subie pour soigner une hernie lui permettrait de travailler     promised would not allow him to travel.
à nouveau.

English reference                                               GPT-2 French translation

Speaking to an open government conference in London             Le premier conference de l’édition internationale en ligne
via video link, Mr Kerry said: ”There is no question that       sur la vide, le ministre de l’Édition internationale a été
the President and I and others in government have actually      vide, lui dit : ”Il n’y a qu’une question que le président
learned of some things that had been happening on an            et moi et tous les autres personnes d’une administration
automatic pilot because the ability has been there, going       qui a été évident par la fois, par la part de la guerre, par la
back to World War Two and to the very difficult years of        part de la guer
the Cold War, and then, of course, 9/11.”

French reference                                                GPT-2 English translation

S’exprimant lors d’une conférence intergouvernementale         In a conférence between the United States and London,
à Londres par liaison vidéo, M. Kerry a déclaré: ”Il est    Secretary of State John Kerry said: ”It is indeniable that
indéniable que le Président, moi-même et d’autres mem-       the President, myself and others of the government have
bres du gouvernement avons pris connaissance de cer-            been aware of certain certain choices that have been made
taines choses en mode pilote automatique parce que nous         in the past in order to be able to do certain things in a more
en avions la possibilité, dès la Seconde guerre mondiale et   automated way.”
jusqu’aux années difficiles de la Guerre froide, puis bien
sûr le 11 septembre.”

                      Table 15. English to French and French to English translations generated by GPT-2.
                               Language Models are Unsupervised Multitask Learners




Context (passage and previous question/answer pairs)

The 2008 Summer Olympics torch relay was run from March 24 until August 8, 2008, prior to the 2008 Summer
Olympics, with the theme of “one world, one dream”. Plans for the relay were announced on April 26, 2007, in
Beijing, China. The relay, also called by the organizers as the “Journey of Harmony”, lasted 129 days and carried
the torch 137,000 km (85,000 mi) – the longest distance of any Olympic torch relay since the tradition was started
ahead of the 1936 Summer Olympics.

After being lit at the birthplace of the Olympic Games in Olympia, Greece on March 24, the torch trav-
eled to the Panathinaiko Stadium in Athens, and then to Beijing, arriving on March 31. From Beijing, the torch was
following a route passing through six continents. The torch has visited cities along the Silk Road, symbolizing
ancient links between China and the rest of the world. The relay also included an ascent with the flame to the top of
Mount Everest on the border of Nepal and Tibet, China from the Chinese side, which was closed specially for the
event.

Q: What was the theme
A: “one world, one dream”.

Q: What was the length of the race?
A: 137,000 km

Q: Was it larger than previous ones?
A: No

Q: Where did the race begin?
A: Olympia, Greece

Q: Is there anything notable about that place?
A: birthplace of Olympic Games

Q: Where did they go after?
A: Athens

Q: How many days was the race?
A: seven

Q: Did they visit any notable landmarks?
A: Panathinaiko Stadium

Q: And did they climb any mountains?
A:

Model answer: Everest
Turker answers: unknown, yes, Yes, yes


                                        Table 16. Selected CoQA completion.
                               Language Models are Unsupervised Multitask Learners




Context (passage and previous question/answer pairs)

Tom goes everywhere with Catherine Green, a 54-year-old secretary. He moves around her office at work and goes
shopping with her. ”Most people don’t seem to mind Tom,” says Catherine, who thinks he is wonderful. ”He’s my
fourth child,” she says. She may think of him and treat him that way as her son. He moves around buying his food,
paying his health bills and his taxes, but in fact Tom is a dog.

Catherine and Tom live in Sweden, a country where everyone is expected to lead an orderly life accord-
ing to rules laid down by the government, which also provides a high level of care for its people. This level of care
costs money.

People in Sweden pay taxes on everything, so aren’t surprised to find that owning a dog means more
taxes. Some people are paying as much as 500 Swedish kronor in taxes a year for the right to keep their dog, which
is spent by the government on dog hospitals and sometimes medical treatment for a dog that falls ill. However, most
such treatment is expensive, so owners often decide to offer health and even life for their dog.

In Sweden dog owners must pay for any damage their dog does. A Swedish Kennel Club official ex-
plains what this means: if your dog runs out on the road and gets hit by a passing car, you, as the owner, have to pay
for any damage done to the car, even if your dog has been killed in the accident.

Q: How old is Catherine?
A: 54

Q: where does she live?
A:

Model answer: Stockholm
Turker answers: Sweden, Sweden, in Sweden, Sweden


                                        Table 17. Selected CoQA completion.