来源材料

资料来源

← 首页

                                                   RoBERTa: A Robustly Optimized BERT Pretraining Approach

                                              Yinhan Liu∗§ Myle Ott∗§ Naman Goyal∗§ Jingfei Du∗§ Mandar Joshi†
                                           Danqi Chen§ Omer Levy§ Mike Lewis§ Luke Zettlemoyer†§ Veselin Stoyanov§
                                                                †
                                                                    Paul G. Allen School of Computer Science & Engineering,
                                                                             University of Washington, Seattle, WA
                                                                         {mandar90,lsz}@cs.washington.edu
                                                                                             §
                                                                                    Facebook AI
                                                                      {yinhanliu,myleott,naman,jingfeidu,
                                                                       danqi,omerlevy,mikelewis,lsz,ves}@fb.com




arXiv:1907.11692v1 [cs.CL] 26 Jul 2019
                                                                 Abstract                             We present a replication study of BERT pre-
                                                                                                   training (Devlin et al., 2019), which includes a
                                               Language model pretraining has led to sig-          careful evaluation of the effects of hyperparmeter
                                               nificant performance gains but careful com-         tuning and training set size. We find that BERT
                                               parison between different approaches is chal-
                                                                                                   was significantly undertrained and propose an im-
                                               lenging. Training is computationally expen-
                                               sive, often done on private datasets of different
                                                                                                   proved recipe for training BERT models, which
                                               sizes, and, as we will show, hyperparameter         we call RoBERTa, that can match or exceed the
                                               choices have significant impact on the final re-    performance of all of the post-BERT methods.
                                               sults. We present a replication study of BERT       Our modifications are simple, they include: (1)
                                               pretraining (Devlin et al., 2019) that carefully    training the model longer, with bigger batches,
                                               measures the impact of many key hyperparam-         over more data; (2) removing the next sentence
                                               eters and training data size. We find that BERT     prediction objective; (3) training on longer se-
                                               was significantly undertrained, and can match
                                                                                                   quences; and (4) dynamically changing the mask-
                                               or exceed the performance of every model
                                               published after it. Our best model achieves         ing pattern applied to the training data. We also
                                               state-of-the-art results on GLUE, RACE and          collect a large new dataset (CC-N EWS) of compa-
                                               SQuAD. These results highlight the impor-           rable size to other privately used datasets, to better
                                               tance of previously overlooked design choices,      control for training set size effects.
                                               and raise questions about the source of re-            When controlling for training data, our im-
                                               cently reported improvements. We release our        proved training procedure improves upon the pub-
                                               models and code.1
                                                                                                   lished BERT results on both GLUE and SQuAD.
                                                                                                   When trained for longer over additional data, our
                                         1 Introduction                                            model achieves a score of 88.5 on the public
                                         Self-training methods such as ELMo (Peters et al.,        GLUE leaderboard, matching the 88.4 reported
                                         2018), GPT (Radford et al., 2018), BERT                   by Yang et al. (2019). Our model establishes a
                                         (Devlin et al., 2019), XLM (Lample and Conneau,           new state-of-the-art on 4/9 of the GLUE tasks:
                                         2019), and XLNet (Yang et al., 2019) have                 MNLI, QNLI, RTE and STS-B. We also match
                                         brought significant performance gains, but it can         state-of-the-art results on SQuAD and RACE.
                                         be challenging to determine which aspects of              Overall, we re-establish that BERT’s masked lan-
                                         the methods contribute the most. Training is              guage model training objective is competitive
                                         computationally expensive, limiting the amount            with other recently proposed training objectives
                                         of tuning that can be done, and is often done with        such as perturbed autoregressive language model-
                                         private training data of varying sizes, limiting          ing (Yang et al., 2019).2
                                         our ability to measure the effects of the modeling           In summary, the contributions of this paper
                                         advances.                                                 are: (1) We present a set of important BERT de-
                                               ∗
                                                                                                   sign choices and training strategies and introduce
                                             Equal contribution.
                                           1                                                          2
                                            Our models and code are available at:                       It is possible that these other methods could also improve
                                         https://github.com/pytorch/fairseq                        with more tuning. We leave this exploration to future work.
alternatives that lead to better downstream task              and 10% are replaced by a randomly selected vo-
performance; (2) We use a novel dataset, CC-                  cabulary token.
N EWS, and confirm that using more data for pre-                 In the original implementation, random mask-
training further improves performance on down-                ing and replacement is performed once in the be-
stream tasks; (3) Our training improvements show              ginning and saved for the duration of training, al-
that masked language model pretraining, under                 though in practice, data is duplicated so the mask
the right design choices, is competitive with all             is not always the same for every training sentence
other recently published methods. We release our              (see Section 4.1).
model, pretraining and fine-tuning code imple-
                                                              Next Sentence Prediction (NSP) NSP is a bi-
mented in PyTorch (Paszke et al., 2017).
                                                              nary classification loss for predicting whether two
2 Background                                                  segments follow each other in the original text.
                                                              Positive examples are created by taking consecu-
In this section, we give a brief overview of the              tive sentences from the text corpus. Negative ex-
BERT (Devlin et al., 2019) pretraining approach               amples are created by pairing segments from dif-
and some of the training choices that we will ex-             ferent documents. Positive and negative examples
amine experimentally in the following section.                are sampled with equal probability.
                                                                 The NSP objective was designed to improve
2.1   Setup
                                                              performance on downstream tasks, such as Natural
BERT takes as input a concatenation of two                    Language Inference (Bowman et al., 2015), which
segments (sequences of tokens), x1 , . . . , xN               require reasoning about the relationships between
and y1 , . . . , yM . Segments usually consist of             pairs of sentences.
more than one natural sentence. The two seg-
ments are presented as a single input sequence                2.4 Optimization
to BERT with special tokens delimiting them:                  BERT is optimized with Adam (Kingma and Ba,
[CLS ], x1 , . . . , xN , [SEP ], y1 , . . . , yM , [EOS ].   2015) using the following parameters: β1 = 0.9,
M and N are constrained such that M + N < T ,                 β2 = 0.999, ǫ = 1e-6 and L2 weight de-
where T is a parameter that controls the maximum              cay of 0.01. The learning rate is warmed up
sequence length during training.                              over the first 10,000 steps to a peak value of
  The model is first pretrained on a large unla-              1e-4, and then linearly decayed. BERT trains
beled text corpus and subsequently finetuned us-              with a dropout of 0.1 on all layers and at-
ing end-task labeled data.                                    tention weights, and a GELU activation func-
                                                              tion (Hendrycks and Gimpel, 2016). Models are
2.2   Architecture
                                                              pretrained for S = 1,000,000 updates, with mini-
BERT uses the now ubiquitous transformer archi-               batches containing B = 256 sequences of maxi-
tecture (Vaswani et al., 2017), which we will not             mum length T = 512 tokens.
review in detail. We use a transformer architecture
with L layers. Each block uses A self-attention               2.5 Data
heads and hidden dimension H.                                 BERT is trained on a combination of B OOK C OR -
                                                              PUS (Zhu et al., 2015) plus English W IKIPEDIA ,
2.3   Training Objectives
                                                              which totals 16GB of uncompressed text.3
During pretraining, BERT uses two objectives:
masked language modeling and next sentence pre-               3 Experimental Setup
diction.
                                                              In this section, we describe the experimental setup
Masked Language Model (MLM) A random                          for our replication study of BERT.
sample of the tokens in the input sequence is
selected and replaced with the special token                  3.1 Implementation
[MASK ]. The MLM objective is a cross-entropy                 We reimplement BERT in FAIRSEQ (Ott et al.,
loss on predicting the masked tokens. BERT uni-               2019). We primarily follow the original BERT
formly selects 15% of the input tokens for possi-                3
                                                                   Yang et al. (2019) use the same dataset but report having
ble replacement. Of the selected tokens, 80% are              only 13GB of text after data cleaning. This is most likely due
replaced with [MASK ], 10% are left unchanged,                to subtle differences in cleaning of the Wikipedia data.
optimization hyperparameters, given in Section 2,                pus described in Radford et al. (2019). The text
except for the peak learning rate and number of                  is web content extracted from URLs shared on
warmup steps, which are tuned separately for each                Reddit with at least three upvotes. (38GB).5
setting. We additionally found training to be very
                                                              • S TORIES, a dataset introduced in Trinh and Le
sensitive to the Adam epsilon term, and in some
                                                                (2018) containing a subset of CommonCrawl
cases we obtained better performance or improved
                                                                data filtered to match the story-like style of
stability after tuning it. Similarly, we found setting
                                                                Winograd schemas. (31GB).
β2 = 0.98 to improve stability when training with
large batch sizes.                                            3.3 Evaluation
   We pretrain with sequences of at most T = 512
tokens. Unlike Devlin et al. (2019), we do not ran-           Following previous work, we evaluate our pre-
domly inject short sequences, and we do not train             trained models on downstream tasks using the fol-
with a reduced sequence length for the first 90% of           lowing three benchmarks.
updates. We train only with full-length sequences.
                                                              GLUE The General Language Understand-
   We train with mixed precision floating point
                                                              ing Evaluation (GLUE) benchmark (Wang et al.,
arithmetic on DGX-1 machines, each with 8 ×
                                                              2019b) is a collection of 9 datasets for evaluating
32GB Nvidia V100 GPUs interconnected by In-
                                                              natural language understanding systems.6 Tasks
finiband (Micikevicius et al., 2018).
                                                              are framed as either single-sentence classification
3.2    Data                                                   or sentence-pair classification tasks. The GLUE
                                                              organizers provide training and development data
BERT-style pretraining crucially relies on large
                                                              splits as well as a submission server and leader-
quantities of text. Baevski et al. (2019) demon-
                                                              board that allows participants to evaluate and com-
strate that increasing data size can result in im-
                                                              pare their systems on private held-out test data.
proved end-task performance. Several efforts
have trained on datasets larger and more diverse                 For the replication study in Section 4, we report
than the original BERT (Radford et al., 2019;                 results on the development sets after finetuning
Yang et al., 2019; Zellers et al., 2019). Unfortu-            the pretrained models on the corresponding single-
nately, not all of the additional datasets can be             task training data (i.e., without multi-task training
publicly released. For our study, we focus on gath-           or ensembling). Our finetuning procedure follows
ering as much data as possible for experimenta-               the original BERT paper (Devlin et al., 2019).
tion, allowing us to match the overall quality and               In Section 5 we additionally report test set re-
quantity of data as appropriate for each compari-             sults obtained from the public leaderboard. These
son.                                                          results depend on a several task-specific modifica-
   We consider five English-language corpora of               tions, which we describe in Section 5.1.
varying sizes and domains, totaling over 160GB
                                                              SQuAD The Stanford Question Answering
of uncompressed text. We use the following text
                                                              Dataset (SQuAD) provides a paragraph of context
corpora:
                                                              and a question. The task is to answer the question
• B OOK C ORPUS (Zhu et al., 2015) plus English               by extracting the relevant span from the context.
  W IKIPEDIA. This is the original data used to               We evaluate on two versions of SQuAD: V1.1
  train BERT. (16GB).                                         and V2.0 (Rajpurkar et al., 2016, 2018). In V1.1
                                                              the context always contains an answer, whereas in
• CC-N EWS, which we collected from the En-
  glish portion of the CommonCrawl News                          5
                                                                    The authors and their affiliated institutions are not in any
  dataset (Nagel, 2016). The data contains 63                 way affiliated with the creation of the OpenWebText dataset.
                                                                  6
  million English news articles crawled between                     The datasets are:       CoLA (Warstadt et al., 2018),
                                                              Stanford Sentiment Treebank (SST) (Socher et al.,
  September 2016 and February 2019. (76GB af-                 2013),      Microsoft      Research        Paragraph      Corpus
  ter filtering).4                                            (MRPC) (Dolan and Brockett, 2005), Semantic Tex-
                                                              tual Similarity Benchmark (STS) (Agirre et al., 2007),
• O PEN W EB T EXT (Gokaslan and Cohen, 2019),                Quora Question Pairs (QQP) (Iyer et al., 2016), Multi-
  an open-source recreation of the WebText cor-               Genre NLI (MNLI) (Williams et al., 2018), Question NLI
                                                              (QNLI) (Rajpurkar et al., 2016), Recognizing Textual
   4
     We use news-please (Hamborg et al., 2017) to col-        Entailment (RTE) (Dagan et al., 2006; Bar-Haim et al.,
lect and extract CC-N EWS. CC-N EWS is similar to the R E -   2006; Giampiccolo et al., 2007; Bentivogli et al., 2009) and
AL N EWS dataset described in Zellers et al. (2019).          Winograd NLI (WNLI) (Levesque et al., 2011).
V2.0 some questions are not answered in the pro-                  Masking      SQuAD 2.0      MNLI-m       SST-2
vided context, making the task more challenging.
                                                                  reference        76.3          84.3       92.8
   For SQuAD V1.1 we adopt the same span pre-
diction method as BERT (Devlin et al., 2019). For                 Our reimplementation:
SQuAD V2.0, we add an additional binary classi-                   static        78.3             84.3       92.5
fier to predict whether the question is answerable,               dynamic       78.7             84.0       92.9
which we train jointly by summing the classifica-
tion and span loss terms. During evaluation, we                 Table 1: Comparison between static and dynamic
only predict span indices on pairs that are classi-             masking for BERTBASE . We report F1 for SQuAD and
fied as answerable.                                             accuracy for MNLI-m and SST-2. Reported results are
                                                                medians over 5 random initializations (seeds). Refer-
RACE The ReAding Comprehension from Ex-                         ence results are from Yang et al. (2019).
aminations (RACE) (Lai et al., 2017) task is a
large-scale reading comprehension dataset with
more than 28,000 passages and nearly 100,000                    Results Table 1 compares the published
questions. The dataset is collected from English                BERTBASE results from Devlin et al. (2019) to our
examinations in China, which are designed for                   reimplementation with either static or dynamic
middle and high school students. In RACE, each                  masking. We find that our reimplementation
passage is associated with multiple questions. For              with static masking performs similar to the
every question, the task is to select one correct an-           original BERT model, and dynamic masking is
swer from four options. RACE has significantly                  comparable or slightly better than static masking.
longer context than other popular reading compre-                  Given these results and the additional efficiency
hension datasets and the proportion of questions                benefits of dynamic masking, we use dynamic
that requires reasoning is very large.                          masking in the remainder of the experiments.

                                                                4.2 Model Input Format and Next Sentence
4 Training Procedure Analysis
                                                                    Prediction
This section explores and quantifies which choices              In the original BERT pretraining procedure, the
are important for successfully pretraining BERT                 model observes two concatenated document seg-
models. We keep the model architecture fixed.7                  ments, which are either sampled contiguously
Specifically, we begin by training BERT models                  from the same document (with p = 0.5) or from
with the same configuration as BERTBASE (L =                    distinct documents. In addition to the masked lan-
12, H = 768, A = 12, 110M params).                              guage modeling objective, the model is trained to
                                                                predict whether the observed document segments
4.1   Static vs. Dynamic Masking
                                                                come from the same or distinct documents via an
As discussed in Section 2, BERT relies on ran-                  auxiliary Next Sentence Prediction (NSP) loss.
domly masking and predicting tokens. The orig-                     The NSP loss was hypothesized to be an impor-
inal BERT implementation performed masking                      tant factor in training the original BERT model.
once during data preprocessing, resulting in a sin-             Devlin et al. (2019) observe that removing NSP
gle static mask. To avoid using the same mask for               hurts performance, with significant performance
each training instance in every epoch, training data            degradation on QNLI, MNLI, and SQuAD 1.1.
was duplicated 10 times so that each sequence is                However, some recent work has questioned the
masked in 10 different ways over the 40 epochs of               necessity of the NSP loss (Lample and Conneau,
training. Thus, each training sequence was seen                 2019; Yang et al., 2019; Joshi et al., 2019).
with the same mask four times during training.                     To better understand this discrepancy, we com-
   We compare this strategy with dynamic mask-                  pare several alternative training formats:
ing where we generate the masking pattern every
time we feed a sequence to the model. This be-                  • SEGMENT- PAIR + NSP: This follows the original
comes crucial when pretraining for more steps or                  input format used in BERT (Devlin et al., 2019),
with larger datasets.                                             with the NSP loss. Each input has a pair of seg-
                                                                  ments, which can each contain multiple natural
    7
      Studying architectural changes, including larger archi-     sentences, but the total combined length must
tectures, is an important area for future work.                   be less than 512 tokens.
                  Model                   SQuAD 1.1/2.0      MNLI-m        SST-2    RACE
                  Our reimplementation (with NSP loss):
                  SEGMENT- PAIR          90.4/78.7              84.0        92.9      64.2
                  SENTENCE - PAIR        88.7/76.2              82.9        92.1      63.0
                  Our reimplementation (without NSP loss):
                  FULL - SENTENCES       90.4/79.1         84.7             92.5      64.8
                  DOC - SENTENCES        90.6/79.7         84.7             92.7      65.6
                  BERTBASE                   88.5/76.3          84.3        92.8      64.3
                  XLNetBASE (K = 7)           –/81.3            85.8        92.7      66.1
                  XLNetBASE (K = 6)           –/81.0            85.6        93.4      66.7

Table 2: Development set results for base models pretrained over B OOK C ORPUS and W IKIPEDIA. All models are
trained for 1M steps with a batch size of 256 sequences. We report F1 for SQuAD and accuracy for MNLI-m,
SST-2 and RACE. Reported results are medians over five random initializations (seeds). Results for BERTBASE and
XLNetBASE are from Yang et al. (2019).


• SENTENCE - PAIR + NSP: Each input contains a              We next compare training without the NSP
  pair of natural sentences, either sampled from         loss and training with blocks of text from a sin-
  a contiguous portion of one document or from           gle document (DOC - SENTENCES). We find that
  separate documents. Since these inputs are sig-        this setting outperforms the originally published
  nificantly shorter than 512 tokens, we increase        BERTBASE results and that removing the NSP loss
  the batch size so that the total number of tokens      matches or slightly improves downstream task
  remains similar to SEGMENT- PAIR + NSP. We re-         performance, in contrast to Devlin et al. (2019).
  tain the NSP loss.                                     It is possible that the original BERT implementa-
• FULL - SENTENCES: Each input is packed with            tion may only have removed the loss term while
  full sentences sampled contiguously from one           still retaining the SEGMENT- PAIR input format.
  or more documents, such that the total length is          Finally we find that restricting sequences to
  at most 512 tokens. Inputs may cross document          come from a single document (DOC - SENTENCES)
  boundaries. When we reach the end of one doc-          performs slightly better than packing sequences
  ument, we begin sampling sentences from the            from multiple documents (FULL - SENTENCES).
  next document and add an extra separator token         However, because the DOC - SENTENCES format
  between documents. We remove the NSP loss.             results in variable batch sizes, we use FULL -
                                                         SENTENCES in the remainder of our experiments
• DOC - SENTENCES: Inputs are constructed sim-
                                                         for easier comparison with related work.
  ilarly to FULL - SENTENCES, except that they
  may not cross document boundaries. Inputs              4.3 Training with large batches
  sampled near the end of a document may be
                                                         Past work in Neural Machine Translation has
  shorter than 512 tokens, so we dynamically in-
                                                         shown that training with very large mini-batches
  crease the batch size in these cases to achieve
                                                         can both improve optimization speed and end-task
  a similar number of total tokens as FULL -
                                                         performance when the learning rate is increased
  SENTENCES. We remove the NSP loss.
                                                         appropriately (Ott et al., 2018). Recent work has
Results Table 2 shows results for the four dif-          shown that BERT is also amenable to large batch
ferent settings. We first compare the original           training (You et al., 2019).
SEGMENT- PAIR input format from Devlin et al.               Devlin et al.    (2019)    originally    trained
(2019) to the SENTENCE - PAIR format; both for-          BERTBASE for 1M steps with a batch size of
mats retain the NSP loss, but the latter uses sin-       256 sequences. This is equivalent in computa-
gle sentences. We find that using individual             tional cost, via gradient accumulation, to training
sentences hurts performance on downstream                for 125K steps with a batch size of 2K sequences,
tasks, which we hypothesize is because the model         or for 31K steps with a batch size of 8K.
is not able to learn long-range dependencies.               In Table 3 we compare perplexity and end-
 bsz      steps      lr      ppl      MNLI-m          SST-2          The      original     BERT       implementa-
                                                                  tion (Devlin et al., 2019) uses a character-level
 256      1M        1e-4     3.99        84.7          92.7       BPE vocabulary of size 30K, which is learned
 2K      125K       7e-4     3.68        85.2          92.9       after preprocessing the input with heuristic tok-
 8K       31K       1e-3     3.77        84.6          92.8       enization rules. Following Radford et al. (2019),
                                                                  we instead consider training BERT with a larger
Table 3: Perplexity on held-out training data (ppl) and           byte-level BPE vocabulary containing 50K sub-
development set accuracy for base models trained over             word units, without any additional preprocessing
B OOK C ORPUS and W IKIPEDIA with varying batch                   or tokenization of the input. This adds approxi-
sizes (bsz). We tune the learning rate (lr) for each set-
                                                                  mately 15M and 20M additional parameters for
ting. Models make the same number of passes over the
data (epochs) and have the same computational cost.               BERTBASE and BERTLARGE, respectively.
                                                                     Early experiments revealed only slight dif-
                                                                  ferences between these encodings, with the
task performance of BERTBASE as we increase the                   Radford et al. (2019) BPE achieving slightly
batch size, controlling for the number of passes                  worse end-task performance on some tasks. Nev-
through the training data. We observe that train-                 ertheless, we believe the advantages of a univer-
ing with large batches improves perplexity for the                sal encoding scheme outweighs the minor degre-
masked language modeling objective, as well as                    dation in performance and use this encoding in
end-task accuracy. Large batches are also easier to               the remainder of our experiments. A more de-
parallelize via distributed data parallel training,8              tailed comparison of these encodings is left to fu-
and in later experiments we train with batches of                 ture work.
8K sequences.
   Notably You et al. (2019) train BERT with even                 5 RoBERTa
larger batche sizes, up to 32K sequences. We leave
further exploration of the limits of large batch                  In the previous section we propose modifications
training to future work.                                          to the BERT pretraining procedure that improve
                                                                  end-task performance. We now aggregate these
4.4    Text Encoding                                              improvements and evaluate their combined im-
Byte-Pair Encoding (BPE) (Sennrich et al., 2016)                  pact. We call this configuration RoBERTa for
is a hybrid between character- and word-level rep-                Robustly optimized BERT approach. Specifi-
resentations that allows handling the large vocab-                cally, RoBERTa is trained with dynamic mask-
ularies common in natural language corpora. In-                   ing (Section 4.1), FULL - SENTENCES without NSP
stead of full words, BPE relies on subwords units,                loss (Section 4.2), large mini-batches (Section 4.3)
which are extracted by performing statistical anal-               and a larger byte-level BPE (Section 4.4).
ysis of the training corpus.                                         Additionally, we investigate two other impor-
   BPE vocabulary sizes typically range from                      tant factors that have been under-emphasized in
10K-100K subword units. However, unicode char-                    previous work: (1) the data used for pretraining,
acters can account for a sizeable portion of this                 and (2) the number of training passes through the
vocabulary when modeling large and diverse cor-                   data. For example, the recently proposed XLNet
pora, such as the ones considered in this work.                   architecture (Yang et al., 2019) is pretrained us-
Radford et al. (2019) introduce a clever imple-                   ing nearly 10 times more data than the original
mentation of BPE that uses bytes instead of uni-                  BERT (Devlin et al., 2019). It is also trained with
code characters as the base subword units. Using                  a batch size eight times larger for half as many op-
bytes makes it possible to learn a subword vocab-                 timization steps, thus seeing four times as many
ulary of a modest size (50K units) that can still en-             sequences in pretraining compared to BERT.
code any input text without introducing any “un-                     To help disentangle the importance of these fac-
known” tokens.                                                    tors from other modeling choices (e.g., the pre-
    8
      Large batch training can improve training efficiency even   training objective), we begin by training RoBERTa
without large scale parallel hardware through gradient ac-        following the BERTLARGE architecture (L = 24,
cumulation, whereby gradients from multiple mini-batches          H = 1024, A = 16, 355M parameters). We
are accumulated locally before each optimization step. This
functionality is supported natively in FAIRSEQ (Ott et al.,       pretrain for 100K steps over a comparable B OOK -
2019).                                                            C ORPUS plus W IKIPEDIA dataset as was used in
                                                                          SQuAD
           Model                                data      bsz   steps                 MNLI-m       SST-2
                                                                         (v1.1/2.0)
           RoBERTa
             with B OOKS + W IKI               16GB        8K   100K     93.6/87.3       89.0       95.3
             + additional data (§3.2)         160GB        8K   100K     94.0/87.7       89.3       95.6
             + pretrain longer                160GB        8K   300K     94.4/88.7       90.0       96.1
             + pretrain even longer           160GB        8K   500K     94.6/89.4       90.2       96.4
           BERTLARGE
             with B OOKS + W IKI               13GB       256    1M      90.9/81.8       86.6       93.7
           XLNetLARGE
             with B OOKS + W IKI               13GB       256    1M      94.0/87.8       88.4       94.4
             + additional data                126GB       2K    500K     94.5/88.8       89.8       95.6

Table 4: Development set results for RoBERTa as we pretrain over more data (16GB → 160GB of text) and pretrain
for longer (100K → 300K → 500K steps). Each row accumulates improvements from the rows above. RoBERTa
matches the architecture and training objective of BERTLARGE . Results for BERTLARGE and XLNetLARGE are from
Devlin et al. (2019) and Yang et al. (2019), respectively. Complete results on all GLUE tasks can be found in the
Appendix.


Devlin et al. (2019). We pretrain our model using               we consider RoBERTa trained for 500K steps over
1024 V100 GPUs for approximately one day.                       all five of the datasets introduced in Section 3.2.
Results We present our results in Table 4. When
controlling for training data, we observe that                  5.1 GLUE Results
RoBERTa provides a large improvement over the
originally reported BERTLARGE results, reaffirming              For GLUE we consider two finetuning settings.
the importance of the design choices we explored                In the first setting (single-task, dev) we finetune
in Section 4.                                                   RoBERTa separately for each of the GLUE tasks,
   Next, we combine this data with the three ad-                using only the training data for the correspond-
ditional datasets described in Section 3.2. We                  ing task. We consider a limited hyperparameter
train RoBERTa over the combined data with the                   sweep for each task, with batch sizes ∈ {16, 32}
same number of training steps as before (100K).                 and learning rates ∈ {1e−5, 2e−5, 3e−5}, with a
In total, we pretrain over 160GB of text. We ob-                linear warmup for the first 6% of steps followed by
serve further improvements in performance across                a linear decay to 0. We finetune for 10 epochs and
all downstream tasks, validating the importance of              perform early stopping based on each task’s eval-
data size and diversity in pretraining.9                        uation metric on the dev set. The rest of the hyper-
   Finally, we pretrain RoBERTa for significantly               parameters remain the same as during pretraining.
longer, increasing the number of pretraining steps              In this setting, we report the median development
from 100K to 300K, and then further to 500K. We                 set results for each task over five random initial-
again observe significant gains in downstream task              izations, without model ensembling.
performance, and the 300K and 500K step mod-                       In the second setting (ensembles, test), we com-
els outperform XLNetLARGE across most tasks. We                 pare RoBERTa to other approaches on the test set
note that even our longest-trained model does not               via the GLUE leaderboard. While many submis-
appear to overfit our data and would likely benefit             sions to the GLUE leaderboard depend on multi-
from additional training.                                       task finetuning, our submission depends only on
   In the rest of the paper, we evaluate our best               single-task finetuning. For RTE, STS and MRPC
RoBERTa model on the three different bench-                     we found it helpful to finetune starting from the
marks: GLUE, SQuaD and RACE. Specifically                       MNLI single-task model, rather than the baseline
   9
                                                                pretrained RoBERTa. We explore a slightly wider
     Our experiments conflate increases in data size and di-
versity. We leave a more careful analysis of these two dimen-   hyperparameter space, described in the Appendix,
sions to future work.                                           and ensemble between 5 and 7 models per task.
                       MNLI         QNLI       QQP      RTE     SST      MRPC        CoLA       STS      WNLI        Avg
 Single-task single models on dev
 BERTLARGE        86.6/-     92.3              91.3      70.4   93.2      88.0        60.6      90.0        -          -
 XLNetLARGE       89.8/-     93.9              91.8      83.8   95.6      89.2        63.6      91.8        -          -
 RoBERTa        90.2/90.2    94.7              92.2      86.6   96.4      90.9        68.0      92.4       91.3        -
 Ensembles on test (from leaderboard as of July 25, 2019)
 ALICE         88.2/87.9     95.7   90.7 83.5 95.2                        92.6        68.6      91.1       80.8      86.3
 MT-DNN        87.9/87.4     96.0   89.9 86.3 96.5                        92.7        68.4      91.1       89.0      87.6
 XLNet         90.2/89.8     98.6   90.3 86.3 96.8                        93.0        67.8      91.6       90.4      88.4
 RoBERTa       90.8/90.2     98.9   90.2 88.2 96.7                        92.3        67.8      92.2       89.0      88.5

Table 5: Results on GLUE. All results are based on a 24-layer architecture. BERTLARGE and XLNetLARGE results
are from Devlin et al. (2019) and Yang et al. (2019), respectively. RoBERTa results on the development set are a
median over five runs. RoBERTa results on the test set are ensembles of single-task models. For RTE, STS and
MRPC we finetune starting from the MNLI model instead of the baseline pretrained model. Averages are obtained
from the GLUE leaderboard.


Task-specific modifications Two of the GLUE                     Results We present our results in Table 5. In the
tasks require task-specific finetuning approaches               first setting (single-task, dev), RoBERTa achieves
to achieve competitive leaderboard results.                     state-of-the-art results on all 9 of the GLUE
   QNLI: Recent submissions on the GLUE                         task development sets. Crucially, RoBERTa uses
leaderboard adopt a pairwise ranking formulation                the same masked language modeling pretrain-
for the QNLI task, in which candidate answers                   ing objective and architecture as BERTLARGE, yet
are mined from the training set and compared to                 consistently outperforms both BERTLARGE and
one another, and a single (question, candidate)                 XLNetLARGE . This raises questions about the rel-
pair is classified as positive (Liu et al., 2019b,a;            ative importance of model architecture and pre-
Yang et al., 2019). This formulation significantly              training objective, compared to more mundane de-
simplifies the task, but is not directly comparable             tails like dataset size and training time that we ex-
to BERT (Devlin et al., 2019). Following recent                 plore in this work.
work, we adopt the ranking approach for our test                   In the second setting (ensembles, test), we
submission, but for direct comparison with BERT                 submit RoBERTa to the GLUE leaderboard and
we report development set results based on a pure               achieve state-of-the-art results on 4 out of 9 tasks
classification approach.                                        and the highest average score to date. This is espe-
   WNLI: We found the provided NLI-format                       cially exciting because RoBERTa does not depend
data to be challenging to work with. Instead                    on multi-task finetuning, unlike most of the other
we use the reformatted WNLI data from Super-                    top submissions. We expect future work may fur-
GLUE (Wang et al., 2019a), which indicates the                  ther improve these results by incorporating more
span of the query pronoun and referent. We fine-                sophisticated multi-task finetuning procedures.
tune RoBERTa using the margin ranking loss from
Kocijan et al. (2019). For a given input sentence,              5.2 SQuAD Results
we use spaCy (Honnibal and Montani, 2017) to                    We adopt a much simpler approach for SQuAD
extract additional candidate noun phrases from the              compared to past work. In particular, while
sentence and finetune our model so that it assigns              both BERT (Devlin et al., 2019) and XL-
higher scores to positive referent phrases than for             Net (Yang et al., 2019) augment their training data
any of the generated negative candidate phrases.                with additional QA datasets, we only finetune
One unfortunate consequence of this formulation                 RoBERTa using the provided SQuAD training
is that we can only make use of the positive train-             data. Yang et al. (2019) also employed a custom
ing examples, which excludes over half of the pro-              layer-wise learning rate schedule to finetune
vided training examples.10
                                                                results could potentially be improved by augmenting this with
  10
       While we only use the provided WNLI training data, our   additional pronoun disambiguation datasets.
                   SQuAD 1.1         SQuAD 2.0               Model           Accuracy     Middle     High
  Model
                   EM   F1          EM     F1
                                                             Single models on test (as of July 25, 2019)
  Single models on dev, w/o data augmentation                BERTLARGE        72.0         76.6      70.1
  BERTLARGE 84.1 90.9 79.0             81.8                  XLNetLARGE       81.7         85.4      80.2
  XLNetLARGE 89.0 94.5 86.1            88.8
                                                             RoBERTa            83.2        86.5     81.3
  RoBERTa       88.9 94.6 86.5         89.4
  Single models on test (as of July 25, 2019)             Table 7: Results on the RACE test set. BERTLARGE and
  XLNetLARGE                     86.3†    89.1†           XLNetLARGE results are from Yang et al. (2019).
  RoBERTa                        86.8     89.8
  XLNet + SG-Net Verifier        87.0†    89.9†
                                                          nating each candidate answer with the correspond-
Table 6: Results on SQuAD. † indicates results that de-   ing question and passage. We then encode each of
pend on additional external training data. RoBERTa        these four sequences and pass the resulting [CLS]
uses only the provided SQuAD data in both dev and         representations through a fully-connected layer,
test settings. BERTLARGE and XLNetLARGE results are
                                                          which is used to predict the correct answer. We
from Devlin et al. (2019) and Yang et al. (2019), re-
spectively.                                               truncate question-answer pairs that are longer than
                                                          128 tokens and, if needed, the passage so that the
                                                          total length is at most 512 tokens.
XLNet, while we use the same learning rate for
                                                            Results on the RACE test sets are presented in
all layers.
                                                          Table 7. RoBERTa achieves state-of-the-art results
   For SQuAD v1.1 we follow the same finetun-
                                                          on both middle-school and high-school settings.
ing procedure as Devlin et al. (2019). For SQuAD
v2.0, we additionally classify whether a given
question is answerable; we train this classifier          6 Related Work
jointly with the span predictor by summing the
classification and span loss terms.
                                                          Pretraining methods have been designed
Results We present our results in Table 6. On             with different training objectives, includ-
the SQuAD v1.1 development set, RoBERTa                   ing language modeling (Dai and Le, 2015;
matches the state-of-the-art set by XLNet. On the         Peters et al., 2018; Howard and Ruder, 2018),
SQuAD v2.0 development set, RoBERTa sets a                machine translation (McCann et al., 2017), and
new state-of-the-art, improving over XLNet by 0.4         masked language modeling (Devlin et al., 2019;
points (EM) and 0.6 points (F1).                          Lample and Conneau, 2019).           Many recent
   We also submit RoBERTa to the public SQuAD             papers have used a basic recipe of finetuning
2.0 leaderboard and evaluate its performance rel-         models for each end task (Howard and Ruder,
ative to other systems. Most of the top systems           2018; Radford et al., 2018), and pretraining
build upon either BERT (Devlin et al., 2019) or           with some variant of a masked language model
XLNet (Yang et al., 2019), both of which rely on          objective.      However, newer methods have
additional external training data. In contrast, our       improved performance by multi-task fine tun-
submission does not use any additional data.              ing (Dong et al., 2019), incorporating entity
   Our single RoBERTa model outperforms all but           embeddings (Sun et al., 2019), span predic-
one of the single model submissions, and is the           tion (Joshi et al., 2019), and multiple variants
top scoring system among those that do not rely           of autoregressive pretraining (Song et al., 2019;
on data augmentation.                                     Chan et al., 2019; Yang et al., 2019). Perfor-
                                                          mance is also typically improved by training
5.3   RACE Results                                        bigger models on more data (Devlin et al.,
In RACE, systems are provided with a passage of           2019; Baevski et al., 2019; Yang et al., 2019;
text, an associated question, and four candidate an-      Radford et al., 2019). Our goal was to replicate,
swers. Systems are required to classify which of          simplify, and better tune the training of BERT,
the four candidate answers is correct.                    as a reference point for better understanding the
   We modify RoBERTa for this task by concate-            relative performance of all of these methods.
7 Conclusion                                              Ido Dagan, Oren Glickman, and Bernardo Magnini.
                                                             2006. The PASCAL recognising textual entailment
We carefully evaluate a number of design de-                 challenge. In Machine learning challenges. evalu-
cisions when pretraining BERT models. We                     ating predictive uncertainty, visual object classifica-
                                                             tion, and recognising tectual entailment.
find that performance can be substantially im-
proved by training the model longer, with bigger          Andrew M Dai and Quoc V Le. 2015. Semi-supervised
batches over more data; removing the next sen-              sequence learning. In Advances in Neural Informa-
tence prediction objective; training on longer se-          tion Processing Systems (NIPS).
quences; and dynamically changing the masking
                                                          Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
pattern applied to the training data. Our improved           Kristina Toutanova. 2019. BERT: Pre-training of
pretraining procedure, which we call RoBERTa,                deep bidirectional transformers for language under-
achieves state-of-the-art results on GLUE, RACE              standing. In North American Association for Com-
and SQuAD, without multi-task finetuning for                 putational Linguistics (NAACL).
GLUE or additional data for SQuAD. These re-
                                                          William B Dolan and Chris Brockett. 2005. Auto-
sults illustrate the importance of these previ-             matically constructing a corpus of sentential para-
ously overlooked design decisions and suggest               phrases. In Proceedings of the International Work-
that BERT’s pretraining objective remains com-              shop on Paraphrasing.
petitive with recently proposed alternatives.
                                                          Li Dong, Nan Yang, Wenhui Wang, Furu Wei,
   We additionally use a novel dataset,                     Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming
CC-N EWS, and release our models and                        Zhou, and Hsiao-Wuen Hon. 2019.           Unified
code for pretraining and finetuning at:                     language model pre-training for natural language
https://github.com/pytorch/fairseq.                         understanding and generation.     arXiv preprint
                                                            arXiv:1905.03197.

                                                          Danilo Giampiccolo, Bernardo Magnini, Ido Dagan,
                                                            and Bill Dolan. 2007. The third PASCAL recog-
References                                                  nizing textual entailment challenge. In Proceedings
                                                            of the ACL-PASCAL workshop on textual entailment
Eneko Agirre, Llu’is M‘arquez, and Richard Wicen-           and paraphrasing.
  towski, editors. 2007. Proceedings of the Fourth
  International Workshop on Semantic Evaluations
                                                          Aaron Gokaslan and Vanya Cohen. 2019. Openweb-
  (SemEval-2007).
                                                            text corpus.   http://web.archive.org/
                                                            save/http://Skylion007.github.io/
Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke
                                                            OpenWebTextCorpus.
  Zettlemoyer, and Michael Auli. 2019. Cloze-
  driven pretraining of self-attention networks. arXiv
  preprint arXiv:1903.07785.                              Felix Hamborg, Norman Meuschke, Corinna Bre-
                                                            itinger, and Bela Gipp. 2017. news-please: A
Roy Bar-Haim, Ido Dagan, Bill Dolan, Lisa Ferro,            generic news crawler and extractor. In Proceedings
  Danilo Giampiccolo, Bernardo Magnini, and Idan            of the 15th International Symposium of Information
  Szpektor. 2006. The second PASCAL recognising             Science.
  textual entailment challenge. In Proceedings of the
  second PASCAL challenges workshop on recognis-          Dan Hendrycks and Kevin Gimpel. 2016. Gaus-
  ing textual entailment.                                   sian error linear units (gelus). arXiv preprint
                                                            arXiv:1606.08415.
Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo
  Giampiccolo, and Bernardo Magnini. 2009. The            Matthew Honnibal and Ines Montani. 2017. spaCy 2:
  fifth PASCAL recognizing textual entailment chal-        Natural language understanding with Bloom embed-
  lenge.                                                   dings, convolutional neural networks and incremen-
                                                           tal parsing. To appear.
Samuel R Bowman, Gabor Angeli, Christopher Potts,
  and Christopher D Manning. 2015. A large anno-          Jeremy Howard and Sebastian Ruder. 2018. Universal
  tated corpus for learning natural language inference.      language model fine-tuning for text classification.
  In Empirical Methods in Natural Language Process-          arXiv preprint arXiv:1801.06146.
  ing (EMNLP).
                                                          Shankar Iyer, Nikhil Dandekar, and Kornl Cser-
William Chan, Nikita Kitaev, Kelvin Guu, Mitchell           nai. 2016. First quora dataset release: Question
  Stern, and Jakob Uszkoreit. 2019. KERMIT: Gener-          pairs. https://data.quora.com/First-
  ative insertion-based modeling for sequences. arXiv       Quora-Dataset-Release-Question-
  preprint arXiv:1906.01604.                                Pairs.
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.          Myle Ott, Sergey Edunov, David Grangier, and
 Weld, Luke Zettlemoyer, and Omer Levy. 2019.             Michael Auli. 2018. Scaling neural machine trans-
 SpanBERT: Improving pre-training by repre-               lation. In Proceedings of the Third Conference on
 senting and predicting spans.   arXiv preprint           Machine Translation (WMT).
 arXiv:1907.10529.
                                                         Adam Paszke, Sam Gross, Soumith Chintala, Gre-
Diederik Kingma and Jimmy Ba. 2015. Adam: A                gory Chanan, Edward Yang, Zachary DeVito, Zem-
  method for stochastic optimization. In International     ing Lin, Alban Desmaison, Luca Antiga, and Adam
  Conference on Learning Representations (ICLR).           Lerer. 2017. Automatic differentiation in PyTorch.
                                                           In NIPS Autodiff Workshop.
Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu,
  Yordan Yordanov, and Thomas Lukasiewicz. 2019.         Matthew Peters, Mark Neumann, Mohit Iyyer, Matt
  A surprisingly robust trick for winograd schema         Gardner, Christopher Clark, Kenton Lee, and Luke
  challenge. arXiv preprint arXiv:1905.06290.             Zettlemoyer. 2018. Deep contextualized word repre-
                                                          sentations. In North American Association for Com-
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang,          putational Linguistics (NAACL).
  and Eduard Hovy. 2017. Race: Large-scale reading
  comprehension dataset from examinations. arXiv         Alec Radford, Karthik Narasimhan, Time Salimans,
  preprint arXiv:1704.04683.                               and Ilya Sutskever. 2018. Improving language un-
                                                           derstanding with unsupervised learning. Technical
Guillaume Lample and Alexis Conneau. 2019. Cross-          report, OpenAI.
  lingual language model pretraining. arXiv preprint
  arXiv:1901.07291.                                      Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
                                                           Dario Amodei, and Ilya Sutskever. 2019. Language
Hector J Levesque, Ernest Davis, and Leora Morgen-         models are unsupervised multitask learners. Techni-
  stern. 2011. The Winograd schema challenge. In           cal report, OpenAI.
  AAAI Spring Symposium: Logical Formalizations of
  Commonsense Reasoning.                                 Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018.
                                                           Know what you don’t know: Unanswerable ques-
Xiaodong Liu, Pengcheng He, Weizhu Chen, and               tions for squad. In Association for Computational
  Jianfeng Gao. 2019a. Improving multi-task deep           Linguistics (ACL).
  neural networks via knowledge distillation for
  natural language understanding. arXiv preprint         Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and
  arXiv:1904.09482.                                        Percy Liang. 2016. SQuAD: 100,000+ questions for
                                                           machine comprehension of text. In Empirical Meth-
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian-         ods in Natural Language Processing (EMNLP).
  feng Gao. 2019b. Multi-task deep neural networks
  for natural language understanding. arXiv preprint     Rico Sennrich, Barry Haddow, and Alexandra Birch.
  arXiv:1901.11504.                                        2016. Neural machine translation of rare words with
                                                           subword units. In Association for Computational
Bryan McCann, James Bradbury, Caiming Xiong, and           Linguistics (ACL), pages 1715–1725.
  Richard Socher. 2017. Learned in translation: Con-
  textualized word vectors. In Advances in Neural In-    Richard Socher, Alex Perelygin, Jean Wu, Jason
  formation Processing Systems (NIPS), pages 6297–         Chuang, Christopher D Manning, Andrew Ng, and
  6308.                                                    Christopher Potts. 2013. Recursive deep models
                                                           for semantic compositionality over a sentiment tree-
Paulius Micikevicius, Sharan Narang, Jonah Alben,          bank. In Empirical Methods in Natural Language
  Gregory Diamos, Erich Elsen, David Garcia, Boris         Processing (EMNLP).
  Ginsburg, Michael Houston, Oleksii Kuchaiev,
  Ganesh Venkatesh, and Hao Wu. 2018. Mixed preci-       Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and
  sion training. In International Conference on Learn-     Tie-Yan Liu. 2019. MASS: Masked sequence
  ing Representations.                                     to sequence pre-training for language generation.
                                                           In International Conference on Machine Learning
Sebastian Nagel. 2016. Cc-news. http:                      (ICML).
  //web.archive.org/save/http:
  //commoncrawl.org/2016/10/news-                        Yu Stephanie Sun, Shuohuan Wang, Yukun Li, Shikun
  dataset-available.                                       Feng, Xuyi Chen, Han Zhang, Xinlun Tian, Danxi-
                                                           ang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: En-
Myle Ott, Sergey Edunov, Alexei Baevski, Angela            hanced representation through knowledge integra-
 Fan, Sam Gross, Nathan Ng, David Grangier, and            tion. arXiv preprint arXiv:1904.09223.
 Michael Auli. 2019. FAIRSEQ: A fast, exten-
 sible toolkit for sequence modeling. In North           Trieu H Trinh and Quoc V Le. 2018. A simple
 American Association for Computational Linguis-            method for commonsense reasoning. arXiv preprint
 tics (NAACL): System Demonstrations.                       arXiv:1806.02847.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob          B Pretraining Hyperparameters
  Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  Kaiser, and Illia Polosukhin. 2017. Attention is all    Table 9 describes the hyperparameters for pre-
  you need. In Advances in neural information pro-        training of RoBERTaLARGE and RoBERTaBASE
  cessing systems.
                                                          C Finetuning Hyperparameters
Alex Wang, Yada Pruksachatkun, Nikita Nangia,
  Amanpreet Singh, Julian Michael, Felix Hill, Omer       Finetuning hyperparameters for RACE, SQuAD
  Levy, and Samuel R. Bowman. 2019a. SuperGLUE:
  A stickier benchmark for general-purpose language
                                                          and GLUE are given in Table 10. We select the
  understanding systems. arXiv preprint 1905.00537.       best hyperparameter values based on the median
                                                          of 5 random seeds for each task.
Alex Wang, Amanpreet Singh, Julian Michael, Felix
  Hill, Omer Levy, and Samuel R. Bowman. 2019b.
  GLUE: A multi-task benchmark and analysis plat-
  form for natural language understanding. In Inter-
  national Conference on Learning Representations
  (ICLR).

Alex Warstadt, Amanpreet Singh, and Samuel R. Bow-
  man. 2018. Neural network acceptability judg-
  ments. arXiv preprint 1805.12471.

Adina Williams, Nikita Nangia, and Samuel Bowman.
  2018. A broad-coverage challenge corpus for sen-
  tence understanding through inference. In North
  American Association for Computational Linguis-
  tics (NAACL).

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
  bonell, Ruslan Salakhutdinov, and Quoc V Le.
  2019. Xlnet: Generalized autoregressive pretrain-
  ing for language understanding. arXiv preprint
  arXiv:1906.08237.

Yang You, Jing Li, Jonathan Hseu, Xiaodan Song,
  James Demmel, and Cho-Jui Hsieh. 2019. Reduc-
  ing bert pre-training time from 3 days to 76 minutes.
  arXiv preprint arXiv:1904.00962.

Rowan Zellers, Ari Holtzman, Hannah Rashkin,
  Yonatan Bisk, Ali Farhadi, Franziska Roesner, and
  Yejin Choi. 2019. Defending against neural fake
  news. arXiv preprint arXiv:1905.12616.

Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan
  Salakhutdinov, Raquel Urtasun, Antonio Torralba,
  and Sanja Fidler. 2015. Aligning books and movies:
  Towards story-like visual explanations by watch-
  ing movies and reading books. In arXiv preprint
  arXiv:1506.06724.


Appendix for “RoBERTa: A Robustly
Optimized BERT Pretraining Approach”
A    Full results on GLUE
In Table 8 we present the full set of development
set results for RoBERTa. We present results for
a LARGE configuration that follows BERTLARGE,
as well as a BASE configuration that follows
BERTBASE.
                             MNLI      QNLI       QQP      RTE      SST    MRPC        CoLA   STS
RoBERTaBASE
  + all data + 500k steps     87.6      92.8      91.9     78.7     94.8     90.2      63.6   91.2
RoBERTaLARGE
  with B OOKS + W IKI         89.0      93.9      91.9     84.5     95.3     90.2      66.3   91.6
  + additional data (§3.2)    89.3      94.0      92.0     82.7     95.6     91.4      66.1   92.2
  + pretrain longer 300k      90.0      94.5      92.2     83.3     96.1     91.1      67.4   92.3
  + pretrain longer 500k      90.2      94.7      92.2     86.6     96.4     90.9      68.0   92.4

      Table 8: Development set results on GLUE tasks for various configurations of RoBERTa.




                 Hyperparam                RoBERTaLARGE            RoBERTaBASE
                 Number of Layers                   24                   12
                 Hidden size                       1024                 768
                 FFN inner hidden size            4096                 3072
                 Attention heads                    16                   12
                 Attention head size                64                   64
                 Dropout                            0.1                  0.1
                 Attention Dropout                  0.1                  0.1
                 Warmup Steps                       30k                 24k
                 Peak Learning Rate                4e-4                 6e-4
                 Batch Size                         8k                   8k
                 Weight Decay                      0.01                 0.01
                 Max Steps                         500k                 500k
                 Learning Rate Decay              Linear               Linear
                 Adam ǫ                            1e-6                 1e-6
                 Adam β1                            0.9                  0.9
                 Adam β2                           0.98                 0.98
                 Gradient Clipping                  0.0                  0.0

            Table 9: Hyperparameters for pretraining RoBERTaLARGE and RoBERTaBASE .




               Hyperparam               RACE       SQuAD               GLUE
               Learning Rate              1e-5      1.5e-5        {1e-5, 2e-5, 3e-5}
               Batch Size                  16         48              {16, 32}
               Weight Decay                0.1       0.01                 0.1
               Max Epochs                   4          2                  10
               Learning Rate Decay       Linear     Linear             Linear
               Warmup ratio               0.06       0.06                0.06

      Table 10: Hyperparameters for finetuning RoBERTaLARGE on RACE, SQuAD and GLUE.