Sources — Compute-Structure AI History

Longformer: The Long-Document Transformer

Iz Beltagy∗ Matthew E. Peters∗ Arman Cohan∗
Allen Institute for Artificial Intelligence, Seattle, WA, USA
{beltagy,matthewp,armanc}@allenai.org

Abstract
Transformer-based models are unable to pro-
cess long sequences due to their self-attention

arXiv:2004.05150v2 [cs.CL] 2 Dec 2020
operation, which scales quadratically with the
sequence length. To address this limitation,
we introduce the Longformer with an attention
mechanism that scales linearly with sequence
length, making it easy to process documents of
thousands of tokens or longer. Longformer’s Figure 1: Runtime and memory of full self-
attention mechanism is a drop-in replacement attention and different implementations of Long-
for the standard self-attention and combines former’s self-attention; Longformer-loop is non-
a local windowed attention with a task moti- vectorized, Longformer-chunk is vectorized, and
vated global attention. Following prior work Longformer-cuda is a custom cuda kernel im-
on long-sequence transformers, we evaluate plementations. Longformer’s memory usage scales
Longformer on character-level language mod- linearly with the sequence length, unlike the full
eling and achieve state-of-the-art results on self-attention mechanism that runs out of memory
text8 and enwik8. In contrast to most for long sequences on current GPUs. Different
prior work, we also pretrain Longformer and implementations vary in speed, with the vectorized
finetune it on a variety of downstream tasks. Longformer-chunk being the fastest. More details
Our pretrained Longformer consistently out- are in section 3.2.
performs RoBERTa on long document tasks
and sets new state-of-the-art results on Wiki-
Hop and TriviaQA. We finally introduce the quadratically with sequence length, making it infea-
Longformer-Encoder-Decoder (LED), a Long- sible (or very expensive) to process long sequences.
former variant for supporting long document To address this limitation, we present Long-
generative sequence-to-sequence tasks, and former, a modified Transformer architecture with
demonstrate its effectiveness on the arXiv sum-
a self-attention operation that scales linearly with
marization dataset.1
the sequence length, making it versatile for pro-
1 Introduction cessing long documents (Fig 1). This is an advan-
tage for natural language tasks such as long docu-
Transformers (Vaswani et al., 2017) have achieved ment classification, question answering (QA), and
state-of-the-art results in a wide range of natu- coreference resolution, where existing approaches
ral language tasks including generative language partition or shorten the long context into smaller
modeling (Dai et al., 2019; Radford et al., 2019) sequences that fall within the typical 512 token
and discriminative language understanding (De- limit of BERT-style pretrained models. Such parti-
vlin et al., 2019). This success is partly due to tioning could potentially result in loss of important
the self-attention component which enables the net- cross-partition information, and to mitigate this
work to capture contextual information from the problem, existing methods often rely on complex
entire sequence. While powerful, the memory and architectures to address such interactions. On the
computational requirements of self-attention grow other hand, our proposed Longformer is able to
∗
Equal contribution. build contextual representations of the entire con-
1
https://github.com/allenai/longformer text using multiple layers of attention, reducing the
need for task-specific architectures. Model attention char-LM other pretrain
matrix tasks
Recent work has addressed the computational in-
efficiency of Transformers on long sequences (see Transformer-XL (2019) ltr yes no no
Adaptive Span (2019) ltr yes no no
Tab. 1). However, they primarily focus on autore- Compressive (2020) ltr yes no no
gressive language modeling (LM), while the appli- Reformer (2020) sparse yes no no
Sparse (2019) sparse yes no no
cation of long document transformers to document-
Routing (2020) sparse yes no no
level NLP tasks in the transfer learning setting BP-Transformer (2019) sparse yes MT no
(Dai and Le, 2015; Peters et al., 2018; Howard Blockwise (2019) sparse no QA yes
Our Longformer sparse yes multiple yes
and Ruder, 2018; Devlin et al., 2019) has remained
largely unexplored. We address this gap and show
Table 1: Summary of prior work on adapting Trans-
that Longformer’s attention mechanism can act as formers for long documents. ltr: left-to-right.
a drop-in replacement for the self-attention mecha-
nism in pretrained Transformers, and leads to gains
across a suite of document NLP tasks. Longformer’s efficient attention pattern on the en-
Longformer’s attention mechanism is a combina- coder network, allowing it to address long docu-
tion of a windowed local-context self-attention and ment seq2seq tasks such as summarization. We
an end task motivated global attention that encodes demonstrate the effectiveness of LED on the arXiv
inductive bias about the task. Through ablations summarization dataset (Cohan et al., 2018).
and controlled trials we show both attention types
are essential – the local attention is primarily used 2 Related Work
to build contextual representations, while the global
Long-Document Transformers Tab. 1 summa-
attention allows Longformer to build full sequence
rizes recent prior work on long documents. Two
representations for prediction.
types of self-attention approaches have been ex-
We first evaluate Longformer on autoregressive plored. The first is a left-to-right (ltr) approach that
character-level language modeling using a com- processes the document in chunks moving from
bination of windowed and a new dilated attention left-to-right. While such models have been success-
pattern, allowing the model to process sequences of ful in autoregressive language modeling, they are
up to 32K characters on modern GPUs. We achieve unsuitable for transfer learning approaches with
state-of-the-art results on text8 and enwik8 tasks that benefit from bidirectional context.
benchmark datasets, demonstrating the effective- Our work falls within the other general approach
ness of Longformer in long document modeling. that defines some form of sparse attention pattern
Then, to evaluate Longformer’s ability to re- and avoids computing the full quadratic attention
place the full self-attention operation of existing matrix multiplication. The model with the most
pretrained models, we pretrain it with the masked similar attention pattern to ours is Sparse Trans-
language modeling (MLM) objective, continuing former (Child et al., 2019), which uses a form of
from the RoBERTa (Liu et al., 2019) released dilated sliding window of blocks of size 8x8 pro-
checkpoint. After pretraining, we apply it to vided by BlockSparse (Gray et al., 2017). Our
downstream language tasks through finetuning and implementation (§3) also includes a custom CUDA
demonstrate that Longformer consistently outper- kernel, but it is more flexible and maintainable than
forms RoBERTa on a wide range of document-level BlockSparse which is implemented in C++, and
natural language tasks including text classification, designed for a specific version of TensorFlow. We
QA, and coreference resolution, achieving state-of- also introduce additional task motivated global at-
the-art results on two of these datasets. tention patterns suitable for common NLP tasks
We finally introduce a variant of Longformer (§3) and show they are essential for good perfor-
which instead of an encoder-only Transformer mance in the transfer learning setting.
architecture, it follows an encoder-decoder ar- A few models tried tasks other than autoregres-
chitecture similar to the original Transformer sive language modeling, which is a step forward
model (Vaswani et al., 2017), and it is in- because arguably focusing on language modeling
tended for sequence-to-sequence (seq2seq) learn- as the primary evaluation has led to the develop-
ing (Sutskever et al., 2014). We call this model ment of models with limited applicability. BP-
Longformer-Encoder-Decoder (LED) that uses Transformer (Ye et al., 2019) evaluated on machine

2
(a) Full n2 attention (b) Sliding window attention (c) Dilated sliding window (d) Global+sliding window

Figure 2: Comparing the full self-attention pattern and the configuration of attention patterns in our Longformer.

translation (MT), but didn’t explore the pretrain- beddings (which we only used for the Autoregres-
finetune setting. Blockwise attention (Qiu et al., sive LM setting), introduces an additional training
2019) pretrained their models and evaluated on objective (CPC loss) for pre-training, and config-
question answering (QA). However, the evaluation ures global attention in a slightly different way.
is limited as it doesn’t include language modeling, It shows strong results on several tasks including
and the QA datasets are of relatively short docu- reading comprehension and classification. GMAT
ments,2 therefore the effectiveness of this model (Gupta and Berant, 2020) uses a similar idea of
on long document tasks remains unexplored. few global locations in the input serving as global
memory. BigBird (Zaheer et al., 2020) is an exten-
Task-specific Models for Long Documents sion over ETC with evaluation on additional tasks,
Many task-specific approaches have been devel- including summarization. Importantly, through the-
oped to workaround the 512 limit of pretrained oretical analysis, BigBird shows that sparse Trans-
transformer models like BERT. The simplest ap- formers are universal approximators of sequence
proach just truncates the document, commonly functions and preserve these properties of the full
used for classification (Xie et al., 2019). An- self-attention.
other approach chunks the document into chunks
of length 512 (could be overlapping), processes 3 Longformer
each chunk separately, then combines the activa-
The original Transformer model has a self-attention
tions with a task specific model (Joshi et al., 2019).
component with O(n2 ) time and memory complex-
A third approach popular for multihop and open
ity where n is the input sequence length. To address
domain QA tasks uses a two-stage model where
this challenge, we sparsify the full self-attention
the first stage retrieves relevant documents that are
matrix according to an “attention pattern” specify-
passed onto the second stage for answer extrac-
ing pairs of input locations attending to one another.
tion (Clark and Gardner, 2017; Chen et al., 2017).
Unlike the full self-attention, our proposed atten-
All of these approaches suffer from information
tion pattern scales linearly with the input sequence,
loss due to truncation or cascading errors from
making it efficient for longer sequences. This sec-
the two stage approach. In contrast, Longformer
tion discusses the design and implementation of
can process long sequences without truncating or
this attention pattern.
chunking, allowing us to adopt a much simpler ap-
proach that concatenates the available context and 3.1 Attention Pattern
processes it in a single pass. Sliding Window Given the importance of local
A few contemporaneous works3 have explored context (Kovaleva et al., 2019), our attention pat-
similar ideas to Longformer using local + global tern employs a fixed-size window attention sur-
attention in Transformers, and pre-training it for rounding each token. Using multiple stacked lay-
long document natural language tasks. In particu- ers of such windowed attention results in a large
lar, ETC (Ainslie et al., 2020) uses a similar local receptive field, where top layers have access to all
+ global attention instead of full self-attention to input locations and have the capacity to build repre-
scale Transformers to long documents. Different sentations that incorporate information across the
from Longformer, ETC uses relative position em- entire input, similar to CNNs (Wu et al., 2019).
2 Given a fixed window size w, each token attends
SQuAD contexts typically fit within the 512 limit, and
MRQA is constructed by dropping long-document examples. to 12 w tokens on each side (Fig. 2b). The com-
3
All were published on arXiv after Longformer. putation complexity of this pattern is O(n × w),

3
which scales linearly with input sequence length n. simpler than existing task specific approaches that
In a transformer with ` layers, the receptive field use complex architecture to combine information
size at the top layer is ` × w (assuming w is fixed across smaller input chunks.
for all layers). Depending on the application, it
Linear Projections for Global Attention Re-
might be helpful to use different values of w for
call that given the linear projections Q, K, V , the
each layer to balance between efficiency and model
Transformer model (Vaswani et al., 2017) computes
representation capacity (§4.1).
attention scores as follows:
Dilated Sliding Window To further increase the QK T

receptive field without increasing computation, the Attention(Q, K, V ) = softmax √ V (1)
dk
sliding window can be “dilated”. This is analogous
to dilated CNNs (van den Oord et al., 2016) where We use two sets of projections, Qs , Ks , Vs to com-
the window has gaps of size dilation d (Fig. 2c). pute attention scores of sliding window attention,
Assuming a fixed d and w for all layers, the recep- and Qg , Kg , Vg to compute attention scores for the
tive field is ` × d × w, which can reach tens of global attention. The additional projections provide
thousands of tokens even for small values of d. flexibility to model the different types of attention,
In multi-headed attention, each attention head which we show is critical for best performance on
computes a different attention score. We found set- downstream tasks. Qg , Kg , Vg are all initialized
tings with different dilation configurations per head with values that match Qs , Ks , Vs .
improves performance by allowing some heads 3.2 Implementation
without dilation to focus on local context, while
others with dilation focus on longer context. In regular transformers, attention scores are com-
puted as in Eqn. 1. The expensive operation is
Global Attention In state-of-the-art BERT-style the matrix multiplication QK T because both Q
models for natural language tasks, the optimal in- and K have n (sequence length) projections. For
put representation differs from language modeling Longformer, the dilated sliding window attention
and varies by task. For masked language modeling computes only a fixed number of the diagonals of
(MLM), the model uses local context to predict the QK T . As shown in Fig. 1, this results in a linear
masked word, while for classification, the model ag- increase in memory usage compared to quadratic
gregates the representation of the whole sequence increase for full self-attention. However, imple-
into a special token ([CLS] in case of BERT). For menting it requires a form of banded matrix mul-
QA, the question and document are concatenated, tiplication that is not supported in existing deep
allowing the model to compare the question with learning libraries like PyTorch/Tensorflow. Fig. 1
the document through self-attention. compares the performance of three different ways
In our case, the windowed and dilated attention of implementing it: loop is a memory efficient Py-
are not flexible enough to learn task-specific repre- Torch implementation that supports dilation but is
sentations. Accordingly, we add “global attention” unusably slow and only used for testing; chunks
on few pre-selected input locations. Importantly, only supports the non-dilated case and is used for
we make this attention operation symmetric: that the pretraining/finetuning setting; and cuda is our
is, a token with a global attention attends to all fully functioning highly optimized custom CUDA
tokens across the sequence, and all tokens in the kernel implemented using TVM (Chen et al., 2018)
sequence attend to it. Fig. 2d shows an example and used for the language modeling experiments
of a sliding window attention with global attention (see Appendix A for more details).
at a few tokens at custom locations. For example
4 Autoregressive Language Modeling
for classification, global attention is used for the
[CLS] token while in QA global attention is pro- Autoregressive or left-to-right language modeling
vided on all question tokens. Since the number of is loosely defined as estimating the probability dis-
such tokens is small relative to and independent of tribution of an existing token/character given its
n the complexity of the combined local and global previous tokens/characters in an input sequence.
attention is still O(n). While specifying global This task is considered one of the fundamental tasks
attention is task specific, it is a easy way to add in- in natural language and recent prior work on mod-
ductive bias to the model’s attention, and it is much eling long sequences using transformers has relied

4
on this task as their primary evaluation (Dai et al., Model #Param Dev Test
2019; Rae et al., 2020; Sukhbaatar et al., 2019). Dataset text8
Similarly, we develop and evaluate our model on T12 (Al-Rfou et al., 2018) 44M - 1.18
Adaptive (Sukhbaatar et al., 2019) 38M 1.05 1.11
autoregressive language modeling. BP-Transformer (Ye et al., 2019) 39M - 1.11
Our Longformer 41M 1.04 1.10
4.1 Attention Pattern
Dataset enwik8
For autoregressive language modeling we use T12 (Al-Rfou et al., 2018) 44M - 1.11
Transformer-XL (Dai et al., 2019) 41M - 1.06
our dilated sliding window attention. Follow- Reformer (Kitaev et al., 2020) - - 1.05
ing Sukhbaatar et al. (2019) we use differing win- Adaptive (Sukhbaatar et al., 2019) 39M 1.04 1.02
dow sizes across the layers. In particular, we use BP-Transformer (Ye et al., 2019) 38M - 1.02
Our Longformer 41M 1.02 1.00
small window sizes for the lower layers and in-
crease window sizes as we move to higher layers. Table 2: Small model BPC on text8 & enwik8
This allows the top layers to learn higher-level rep-
resentation of the entire sequence while having the
lower layers capture local information. In addition, Model #Param Test BPC
it provides balance between efficiency (smaller win- Transformer-XL (18 layers) 88M 1.03
dow sizes are less computationally expensive due Sparse (Child et al., 2019) ≈100M 0.99
Transformer-XL (24 layers) 277M 0.99
to fewer nonzero values) and performance (larger Adaptive (Sukhbaatar et al., 2019) 209M 0.98
window sizes have richer representation power and Compressive (Rae et al., 2020) 277M 0.97
Routing (Roy et al., 2020) ≈223M 0.99
often result in performance improvements). Our Longformer 102M 0.99
We do not use dilated sliding windows for lower
layers to maximize their capacity to learn and uti- Table 3: Performance of large models on enwik8
lize the immediate local context. For the higher
layers, we use a small amount of increasing dila-
tion only on 2 heads. This gives the model the Evaluation We evaluate with sequences of
ability to directly attend to distant tokens without length 32,256. Following Dai et al. (2019), we
sacrificing local context. split the dataset into overlapping sequences of size
32,256 with a step of size 512, and report the per-
4.2 Experiment Setup
formance on the last 512 tokens on the sequence.
To compare to prior work we focus on character-
level LM (text8 and enwik8; Mahoney, 2009).
4.2.1 Results
Training Ideally, we would like to train our
model on the largest window size and sequence Tab. 2 and 3 summarize evaluation results on
length we can fit in a modern GPU memory. How- text8 and enwik8 datasets. We achieve a new
ever, we found that the model needs a large number state-of-the-art on both text8 and enwik8 using
of gradient updates to learn the local context first, the small models with BPC of 1.10 and 1.00 on
before learning to utilize longer context. To accom- text8 and enwik8 respectively, demonstrating
modate this, we adopt a staged training procedure the effectiveness of our model.
where we increase the attention window size and For large models, given how expensive these
sequence length across multiple training phases. In experiments are, and following recent work (Ki-
particular, in the first phase we start with a short taev et al., 2020; Rae et al., 2020), we are only
sequence length and window size, then on each sub- evaluating on enwik8. Tab. 3 shows that Long-
sequent phase, we double the window size and the former outperforms the comparable Transformer-
sequence length, and halve the learning rate. This XL model, matches the performance of the compa-
makes training fast, while keeping the slow part rable Sparse Transformer (Child et al., 2019), and
(longest sequences and window sizes) to the end. matches or slightly underperforms recent models
We train the model over 5 total phases with start- that have more than twice the number of parameters.
ing sequence length of 2,048 and ending sequence It is worth noting that Adaptive Span (Sukhbaatar
length of 23,040 on the last phase (see Appendix B et al., 2019) and Compressive Transformer (Rae
for detailed configurations of each phase, and for et al., 2020) are not good fit for the pretraining-
all other hyperparameters). finetuning paradigm as discussed in §2.

5
Model Dev BPC Model base large
Decreasing w (from 512 to 32) 1.24 RoBERTa (seqlen: 512) 1.846 1.496
Fixed w (= 230) 1.23 Longformer (seqlen: 4,096) 10.299 8.738
Increasing w (from 32 to 512) 1.21 + copy position embeddings 1.957 1.597
+ 2K gradient updates 1.753 1.414
No Dilation 1.21
+ 65K gradient updates 1.705 1.358
Dilation on 2 heads 1.20
Longformer (train extra pos. embed. only) 1.850 1.504

Table 4: Top: changing window size across layers. Bot- Table 5: MLM BPC for RoBERTa and various pre-
tom: with/without dilation (@ 150K steps on phase1) trained Longformer configurations.

4.2.2 Ablation Study changes necessary to support Longformer’s atten-
To show the importance of the design choices of tion mechanism. Note that our attention pattern can
our attention patterns, we tried different variants be plugged into any pretrained transformer model
and report their controlled experiment results. To without the need to change the model architecture.
make the ablation study more manageable, we train
each configuration for 150K steps4 with phase 1 Attention Pattern We use sliding window atten-
configuration on a small model on text8, then tion with window size of 512, therefore using the
report the BPC performance on the dev set. same amount of computation as RoBERTa.6
The top of Tab. 4 demonstrates the impact of Position Embeddings RoBERTa uses learned
different ways of configuring the window sizes absolute position embeddings with the maximum
per layer. We observe that increasing the window position being 512. To support longer documents,
size from the bottom to the top layer leads to the we add extra position embeddings to support up to
best performance, arranging them in the reverse position 4,096. To leverage RoBERTa’s pretrained
way leads to worse performance, and using a fixed weights, instead of randomly initializing the new
window size (the average of window sizes of the position embeddings, we initialize them by copying
other configuration) leads to a performance that the 512 position embeddings from RoBERTa mul-
it is in between. The bottom of Tab. 4 shows the tiple times as analysis of BERT’s attention heads
impact of adding dilation. Adding some dilation to shows a strong learned bias to attending to local
two heads leads to some improvement compared context, including the previous or next token (Clark
with no dilation at all. et al., 2019). Using the copy initialization preserves
this local structure everywhere except at the parti-
5 Pretraining and Finetuning tion boundaries. Despite its simplicity, we found
this to be a very effective (see Tab. 5), allowing
Current state-of-the-art systems for many NLP
Longformer pretraining to rapidly converge with a
tasks finetune a pretrained model with task super-
small number of gradient updates.
vision (e.g. BERT). One of our main motivations
is to develop such a model suitable for long docu- Continued MLM Pretraining We pretrain
ment tasks. To do so, we pretrained Longformer Longformer using fairseq (Ott et al., 2019) on a
on a document corpus and finetune it for six tasks, corpus of long documents that we compiled (see
including classification, QA and coreference resolu- Appendix C for corpus details). We train two model
tion. The resulting model can process sequences up sizes, a base model and a large model. Both models
to 4,096 tokens long (8 times longer than BERT)5 . are trained for 65K gradient updates with sequences
We pretrain Longformer with masked language length 4,096, batch size 64 (218 tokens), maximum
modeling (MLM), where the goal is to recover learning rate of 3e-5, linear warmup of 500 steps,
randomly masked tokens in a sequence. Since followed by a power 3 polynomial decay. The rest
MLM pretraining is expensive, we continue pre- of the hyperparameters are the same as RoBERTa.
training from the RoBERTa (Liu et al., 2019) re- Tab. 5 shows the BPC on the development set of
leased checkpoint, while only making the minimal our training corpus. The first row shows a 1.846
4 6
One caveat is that the ordering of end performance will Adding dilation on a few heads as in §4.1 hurt perfor-
not agree with that at step 150K. However, this approximation mance, likely because it is not compatible with the pretrained
saves the huge cost of running every experiment to completion. RoBERTa weights. Retraining such model from scratch might
5
Sequences up to 16K are possible on current GPUs. be needed to improve performance.

6
Wordpieces WH TQA HQA ON IMDB HY nate all available context into a single sequence.
avg. 1,535 6,589 1,316 506 300 705 Our baseline is a RoBERTa based model that
95th pctl. 3,627 17,126 1,889 1,147 705 1,975 breaks the context into the longest possible seg-
ment, passes each individually through RoBERTa,
Table 6: Average and 95th percentile of context length and concatenates the activations for further process-
of datasets in wordpieces. WH: WikiHop, TQA: Triv-
ing. For QA tasks, we also concatenate the question
iaQA, HQA: HotpotQA, ON: OntoNotes, HY: Hyper-
partisan news to each segment so that RoBERTa can condition
it’s contextual representations of the context on
the question. The Longformer variant replaces the
BPC using RoBERTa-base, which is comparable RoBERTa self-attention mechanism with our win-
to the 1.880 BPC reported on the RoBERTa paper dowed attention used during pretraining, plus a task
on their corpus. This indicates our training corpus motivated global attention. The global attention
is from a distribution close to that used to train uses additional linear projections (§3.1).
RoBERTa. The following two rows show the per-
formance of Longformer before pretraining with 6.1 Question answering
randomly initialized position embeddings and with We used three datasets: WikiHop (Welbl et al.,
copied position embeddings. The significant differ- 2018), TriviaQA (Joshi et al., 2017, Wikipedia set-
ence indicates the importance of the copy initial- ting), and HotpotQA, (Yang et al., 2018, distractor
ization, and the relative small difference between setting).7
the RoBERTa BPC and the initialized BPC indi- For WikiHop and TriviaQA we follow the sim-
cates that our sliding window attention is working ple QA model of BERT (Devlin et al., 2019), and
well with the RoBERTa weights. The following concatenate question and documents into one long
two rows show the impact of continuing pretrain- sequence, run it through Longformer, then have a
ing. Traininig for 2K steps improves BPC from dataset-specific prediction layer. WikiHop uses a
1.957 to 1.753, which further decreases to 1.705 af- classification layer for the candidate while Trivi-
ter 65K steps, demonstrating the model is learning aQA uses the loss function of Clark and Gardner
to better utilize the sliding window attention and (2017) to predict answer span. We include global
longer context. Similar patterns are observed with attention to question tokens and answer candidates
RoBERTa-large and Longformer-large. for WikiHop and to question tokens for TriviaQA.
HotpotQA is a multihop QA dataset that involves
Frozen RoBERTa Weights We also pretrained
extracting answer spans and evidence sentences
Longformer while freezing all RoBERTa weights,
from 10 Wikipedia paragraphs, 2 of which are rele-
and only training the new position embeddings.
vant and the rest are distractors. We use a two-stage
The motivation for this configuration is to perfectly
model that first selects the most relevant paragraphs
preserve the RoBERTa performance on short doc-
then passes them to a second stage for answer ex-
uments. This configuration has a BPC of 1.850
traction. Both stages concatenate question and con-
(down from 1.957 at initialization), but higher than
text into one sequence, run it through Longformer,
1.705 where all the weights are trainable.
then use task-specific prediction layers. We train
6 Tasks the models in a multi-task way to predict relevant
paragraphs, evidence sentences, answer spans and
We apply Longformer to multiple long document question types (yes/no/span) jointly. Note that this
tasks, including QA, coreference resolution and model is simpler than recent SOTA models that in-
classification. Tab. 6 shows the evaluation datasets clude complex task-specific architectures (e.g., (Tu
have contexts significantly longer than 512 word- et al., 2019; Chen et al., 2019; Tu et al., 2020;
pieces. Our primary goal is to evaluate whether Groeneveld et al., 2020)). See Appendix D for fur-
our attention mechanism can act as a replace- ther details about the models and hyperparameters.
ment for the standard self-attention mechanism in
BERT style models, and to perform controlled tri- 6.2 Coreference Resolution
als against a strong baseline. We are also interested We use OntoNotes (Pradhan et al., 2012), and the
in evaluating whether we can replace complicated model from Joshi et al. (2019), a modification of
task specific models necessitated by BERT’s lim- 7
We use the full version of TriviaQA and HotpotQA, not
ited context with simpler models that just concate- the simplified versions in MRQA (Fisch et al., 2019).

7
QA Coref. Classification
Model WikiHop TriviaQA HotpotQA OntoNotes IMDB Hyperpartisan
RoBERTa-base 72.4 74.3 63.5 78.4 95.3 87.4
Longformer-base 75.0 75.2 64.4 78.6 95.7 94.8

Table 7: Summary of finetuning results on QA, coreference resolution, and document classification. Results are on
the development sets comparing our Longformer-base with RoBERTa-base. TriviaQA, Hyperpartisan metrics are
F1, WikiHop and IMDB use accuracy, HotpotQA is joint F1, OntoNotes is average F1.

the system from Lee et al. (2018) to replace ELMo Model WikiHop TriviaQA HotpotQA
with BERT. The Longformer system is a straightfor- ∗
Current SOTA 78.3 73.3 74.2
ward adaption of the baseline model by replacing Longformer-large 81.9 77.3 73.2
RoBERTa with Longformer and extending the se-
quence length. We didn’t use global attention for Table 8: Leaderboard results of Longformer-large at
time of submission (May 2020). All numbers are F1
this task.
scores.
6.3 Document Classification
We evaluate on IMDB (Maas et al., 2011) and Hy-
perpartisan news detection (Kiesel et al., 2019)
datasets.8 IMDB is a standard sentiment classifica-
found that the distance between any two mentions
tion datasets consisting of movie reviews. While
is typically quite small so that a baseline that pro-
most documents in this dataset are short, about
cesses smaller chunks separately is able to stitch
13.6% of them are larger than 512 wordpieces
together mentions into coreference chains without
(Tab. 6). Documents in Hyperpartisan are relatively
considering cross chunk interactions.
long, and it is small with only 645 documents mak-
ing it a good test for Longformer’s ability to adapt
to limited data. We use global attention on the
[CLS] token.

6.4 Results
Longformer-large for QA We also evaluate the
Main Result Tab. 7 summarizes the results of all performance of Longformer-large on long context
our finetuning experiments. We observe that Long- QA tasks. Tab. 8 shows that our Longformer-large
former consistently outperforms the RoBERTa achieves new state-of-the-art results9 on WikiHop
baseline. Its performance gain is especially ob- and TriviaQA by large margins (3.6 and 4 points
vious for tasks that require long context such as respectively), and for HotpotQA, it underperforms
WikiHop and Hyperpartisan. For TriviaQA, the the current state-of-the-art (Fang et al., 2020) by
improvement is more modest as the local context a point. Tab. 9 shows the detailed results of Hot-
is often sufficient to answer the question. In the potQA compared with published and unpublished
case of HotpotQA, the supporting fact auxiliary concurrent models. Longformer places second
supervision allows models to easily find relevant on the published leaderboard, outperforming all
contexts and then focus on local context, leading to other published results except for HGN (Fang et al.,
smaller gains. This is contrasted with WikiHop that 2020). All published top performing models in
only includes distant supervision of intermediate this task (Tu et al., 2019; Fang et al., 2020; Shao
reasoning chains, where our approach excels by et al., 2020) use GNNs (Kipf and Welling, 2017)
reasoning over the entire context. On the IMDB or graph network of entities, which seem to encode
and OntoNotes datasets the performance gains are an important inductive bias for the task and can po-
smaller. For IMDB, the majority of the dataset tentially improve our results further. Nevertheless,
consists of short documents and thus it is expected Longformer performs strongly outperforming all
to see smaller improvements. For OntoNotes, we other methods including the recent non-GNN meth-
8
For Hyperpartisan we split the training data into 80/10/10 ods (Glaß et al., 2019; Shao et al., 2020; Groen-
train/dev/test sets, and report mean F1 across five seeds. eveld et al., 2020).

8
Model ans. supp. joint 7 Longformer-Encoder-Decoder (LED)
TAP 2 (ensemble) (Glaß et al., 2019) 79.8 86.7 70.7
SAE (Tu et al., 2019) 79.6 86.7 71.4 The original Transformer (Vaswani et al., 2017)
Quark (dev) (Groeneveld et al., 2020) 81.2 87.0 72.3 consisted of an encoder-decoder architecture, in-
C2F Reader (Shao et al., 2020) 81.2 87.6 72.8
tended for sequence-to-sequence tasks (Sutskever
Longformer-large 81.3 88.3 73.2 et al., 2014), such as summarization and transla-
†
ETC-large (Ainslie et al., 2020) 81.2 89.1 73.6 tion. While encoder-only Transformers are effec-
GSAN-large† 81.6 88.7 73.9
HGN-large (Fang et al., 2020) 82.2 88.5 74.2
tive on a variety of NLP tasks, pre-trained encoder-
decoder Transformer models (e.g. BART (Lewis
Table 9: HotpotQA results in distractor setting test set. et al., 2020) and T5 (Raffel et al., 2020)) have
Quark’s test results are not available. All numbers are achieved strong results on tasks like summariza-
F1 scores. † shows contemporaneous leaderboard sub- tion. Yet, such models can’t efficiently scale to
missions. seq2seq tasks with longer inputs.
To facilitate modeling long sequences for
seq2seq learning, we propose a Longformer variant
Model Accuracy / ∆
that has both the encoder and decoder Transformer
Longformer (seqlen: 4,096) 73.8
stacks but instead of the full self-attention in the
RoBERTa-base (seqlen: 512) 72.4 / -1.4 encoder, it uses the efficient local+global attention
Longformer (seqlen: 4,096, 15 epochs) 75.0 / +1.2
Longformer (seqlen: 512, attention: n2 ) 71.7 / -2.1 pattern of the Longformer. The decoder uses the
Longformer (seqlen: 2,048) 73.1 / -0.7 full self-attention to the entire encoded tokens and
Longformer (no MLM pretraining) 73.2 / -0.6 to previously decoded locations. We call this model
Longformer (no linear proj.) 72.2 / -1.6
Longformer (no linear proj. no global atten.) 65.5 / -8.3 Longformer-Encoder-Decoder (LED) which scales
Longformer (pretrain extra position embed. only) 73.5 / -0.3 linearly with the input. Since pre-training LED is
expensive, we initialize LED parameters from the
Table 10: WikiHop development set ablations BART, and follow BART’s exact architecture in
terms of number of layers and hidden sizes. The
only difference is that to process longer inputs,
6.5 Ablations on WikiHop we extend position embedding to 16K tokens (up
from BART’s 1K tokens) and we initialize the new
Tab. 10 presents an ablation study for WikiHop on position embedding matrix by repeatedly copying
the development set. All results use Longformer- BART’s 1K position embeddings 16 times as in
base, fine-tuned for five epochs with identical hy- Section 5 for RoBERTa. Following BART, we re-
perparameters except where noted. Longformer lease two model sizes, LED-base and LED-large,
benefits from longer sequences, global attention, which respectively have 6 and 12 layers in both
separate projection matrices for global attention, encoder and decoder stacks.
MLM pretraining, and longer training. In addition, We evaluate LED on the summarization task us-
when configured as in RoBERTa-base (seqlen: 512, ing the arXiv summarization dataset (Cohan et al.,
and n2 attention) Longformer performs slightly 2018) which focuses on long document summariza-
worse then RoBERTa-base, confirming that per- tion in the scientific domain. The 90th percentile
formance gains are not due to additional pretrain- of document lengths is 14.5K tokens, making it
ing. Performance drops slightly when using the an appropriate testbed for evaluating LED. LED’s
RoBERTa model pretrained when only unfreezing encoder reads the document and its decoder gener-
the additional position embeddings, showing that ates the output summary. The encoder uses local
Longformer can learn to use long range context in attention with window size 1,024 tokens and global
task specific fine-tuning with large training datasets attention on the first <s> token. The decoder uses
such as WikiHop. full attention to the entire encoder and previously
decoded locations. As standard in seq2seq models,
LED is trained using teacher forcing on gold train-
9
At submission time, May 2020. Later, BigBird (Zaheer ing summaries and uses beam search at inference.
et al., 2020) improved leaderboard results on these datasets. Tab. 11 demonstrates the results of LED-large
There are confounding factors such as using 16X more com-
pute in BigBird’s pretraining compared with Longformer, po- 16K on the arXiv summarization task. This model
tentially affecting the performance. is merely initialized from BART, with no additional

9
R-1 R-2 R-L and enwik8. When pretrained, Longformer con-
Discourse-aware (2018) 35.80 11.05 31.80 sistently outperforms RoBERTa on long document
Extr-Abst-TLM (2020) 41.62 14.69 38.03 tasks and sets new state-of-the-art results on Wik-
Dancer (2020) 42.70 16.54 38.44
Pegasus (2020) 44.21 16.95 38.83
iHop and TriviaQA. We further present LED, an
LED-large (seqlen: 4,096) (ours) 44.40 17.94 39.76 encoder-decoder variant of Longformer for model-
BigBird (seqlen: 4,096) (2020) 46.63 19.02 41.77 ing sequence-to-sequence tasks, and achieve state-
LED-large (seqlen: 16,384) (ours) 46.63 19.62 41.83
of-the-art results on the arXiv long document sum-
Table 11: Summarization results of Longformer- marization task. For future work, we would like
Encoder-Decoder (LED) on the arXiv dataset. Met- to study other pretraining objectives, especially for
rics from left to right are ROUGE-1, ROUGE-2 and LED, increase the sequence length, and explore
ROUGE-L. other tasks that might benefit from our model.

45 R1 Acknowledgment
46.23
R2 44.48
40 We would like to thank Noah Smith, Dan Weld,
35 35.21 Dirk Groeneveld, Kyle Lo, Daniel King and Doug
30 Downey for helpful discussions and feedback, and
25 the AI2 infrastructure team for technical support.
19.62
20 17.99
15 11.54
10 References
1K 4k 16k
Joshua Ainslie, Santiago Ontanon, Chris Alberti, Va-
Figure 3: ROUGE-1 and ROUGE-2 of LED when vary- clav Cvicek, Zachary Fisher, Philip Pham, Anirudh
ing the input size (arXiv validation set). Ravula, Sumit Sanghai, Qifan Wang, and Li Yang.
2020. ETC: Encoding long and structured inputs
in transformers. In Proceedings of the 2020 Con-
pre-training. We observe that LED achieves state- ference on Empirical Methods in Natural Language
of-the-art results on arXiv, slightly outperform- Processing (EMNLP), pages 268–284, Online. Asso-
ing BigBird (Zaheer et al., 2020). Note that the ciation for Computational Linguistics.
BigBird summarization model supports sequence Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy
length of 4K tokens but starts from and continues Guo, and Llion Jones. 2018. Character-level lan-
pre-training Pegasus (Zhang et al., 2020), a model guage modeling with deeper self-attention. In AAAI.
specifically designed and pre-trained for summa-
Danqi Chen, Adam Fisch, Jason Weston, and Antoine
rization. With no pre-training or task-specific ini- Bordes. 2017. Reading wikipedia to answer open-
tialization, but with ability to process longer inputs, domain questions. In ACL.
LED can slightly outperform BigBird. Further im-
Jifan Chen, Shih-Ting Lin, and Greg Durrett. 2019.
provements should be possible through pre-training Multi-hop question answering via reasoning chains.
of LED. Fig. 3 further illustrates the importance arXiv preprint, abs/1910.02610.
of sequence length showing the ablility to process
longer input significantly improves the results. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin
Zheng, Eddie Yan, Haichen Shen, Meghan Cowan,
Leyuan Wang, Yuwei Hu, Luis Ceze, et al. 2018.
8 Conclusion and Future Work TVM: An automated end-to-end optimizing com-
We present Longformer, a transformer-based model piler for deep learning. In OSDI.
that is scalable for processing long documents Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos
and that makes it easy to perform a wide range Guestrin. 2016. Training deep nets with sublinear
of document-level NLP tasks without chunk- memory cost. arXiv preprint, abs/1604.06174.
ing/shortening the long input and without com- Rewon Child, Scott Gray, Alec Radford, and
plex architecture to combine information across Ilya Sutskever. 2019. Generating long se-
these chunks. Longformer employs an attention quences with sparse transformers. arXiv preprint,
pattern that combines local and global information abs/1904.10509.
while also scaling linearly with the sequence length. Christopher Clark and Matt Gardner. 2017. Simple
Longformer achieves state-of-the-art results on the and effective multi-paragraph reading comprehen-
character-level language modeling tasks of text8 sion. In ACL.

10
Kevin Clark, Urvashi Khandelwal, Omer Levy, and Mandar Joshi, Omer Levy, Luke Zettlemoyer, and
Christopher D. Manning. 2019. What does bert look Daniel Weld. 2019. BERT for coreference resolu-
at? an analysis of bert’s attention. arXiv preprint, tion: Baselines and analysis. In EMNLP-IJCNLP.
abs/1906.04341.
Johannes Kiesel, Maria Mestre, Rishabh Shukla, Em-
Arman Cohan, Franck Dernoncourt, Doo Soon Kim, manuel Vincent, Payam Adineh, David Corney,
Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Benno Stein, and Martin Potthast. 2019. SemEval-
Goharian. 2018. A discourse-aware attention model 2019 task 4: Hyperpartisan news detection. In
for abstractive summarization of long documents. In Proceedings of the 13th International Workshop on
NAACL-HLT 2018. Semantic Evaluation, pages 829–839, Minneapo-
lis, Minnesota, USA. Association for Computational
Andrew M Dai and Quoc V Le. 2015. Semi-supervised Linguistics.
sequence learning. In NeurIPS.
Thomas N Kipf and Max Welling. 2017. Semi-
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Car- supervised classification with graph convolutional
bonell, Quoc V. Le, and Ruslan Salakhutdinov. 2019. networks. ICLR.
Transformer-XL: Attentive language models beyond
a fixed-length context. In ACL. Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya.
2020. Reformer: The efficient transformer. In
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and ICLR.
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under- Olga V. Kovaleva, Alexey Romanov, Anna Rogers, and
standing. In NAACL-HLT. Anna Rumshisky. 2019. Revealing the dark secrets
of bert. In EMNLP/IJCNLP.
Yuwei Fang, Siqi Sun, Zhe Gan, Rohit Pillai, Shuo-
hang Wang, and Jingjing Liu. 2020. Hierarchical Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018.
graph network for multi-hop question answering. In Higher-order coreference resolution with coarse-to-
Proceedings of the 2020 Conference on Empirical fine inference. In NAACL.
Methods in Natural Language Processing (EMNLP),
pages 8823–8838, Online. Association for Computa- Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
tional Linguistics. jan Ghazvininejad, Abdelrahman Mohamed, Omer
Levy, Veselin Stoyanov, and Luke Zettlemoyer.
Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eu- 2020. BART: Denoising sequence-to-sequence pre-
nsol Choi, and Danqi Chen. 2019. MRQA 2019 training for natural language generation, translation,
shared task: Evaluating generalization in reading and comprehension. In Proceedings of the 58th An-
comprehension. In MRQA workshop at EMNLP. nual Meeting of the Association for Computational
Linguistics, pages 7871–7880, Online. Association
Alexios Gidiotis and Grigorios Tsoumakas. 2020. A for Computational Linguistics.
divide-and-conquer approach to the summarization
of academic articles. ArXiv, abs/2004.06190. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Michael Glaß, Alfio Massimiliano Gliozzo, Rishav Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Chakravarti, Anthony Ferritto, Lin Pan, Gaudani RoBERTa: A robustly optimized bert pretraining ap-
Bhargav, Dinesh Garg, and Avirup Sil. 2019. Span proach. arXiv preprint, abs/1907.11692.
selection pre-training for question answering. arXiv
preprint, abs/1909.04120. Andrew L. Maas, Raymond E. Daly, Peter T. Pham,
Dan Huang, Andrew Y. Ng, and Christopher Potts.
Scott Gray, Alec Radford, and Diederik P. Kingma. 2011. Learning word vectors for sentiment analy-
2017. Gpu kernels for block-sparse weights. sis. In Proceedings of the 49th Annual Meeting of
the Association for Computational Linguistics: Hu-
Dirk Groeneveld, Tushar Khot, Mausam, and Ashish man Language Technologies, pages 142–150, Port-
Sabhwaral. 2020. A simple yet strong pipeline for land, Oregon, USA. Association for Computational
HotpotQA. arXiv preprint, abs/2004.06753. Linguistics.
Ankit Gupta and Jonathan Berant. 2020. Gmat: Global Matt Mahoney. 2009. Large text compression bench-
memory augmentation for transformers. ArXiv, mark.
abs/2006.03274.
Aäron van den Oord, Sander Dieleman, Heiga Zen,
Jeremy Howard and Sebastian Ruder. 2018. Universal Karen Simonyan, Oriol Vinyals, Alex Graves,
language model fine-tuning for text classification. In Nal Kalchbrenner, Andrew W. Senior, and Koray
ACL. Kavukcuoglu. 2016. Wavenet: A generative model
for raw audio. In SSW.
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke
Zettlemoyer. 2017. TriviaQA: A large scale dis- Myle Ott, Sergey Edunov, Alexei Baevski, Angela
tantly supervised challenge dataset for reading com- Fan, Sam Gross, Nathan Ng, David Grangier, and
prehension. In ACL. Michael Auli. 2019. fairseq: A fast, extensible

11
toolkit for sequence modeling. In Proceedings of Ming Tu, Kevin Huang, Guangtao Wang, Jing Huang,
NAACL-HLT 2019: Demonstrations. Xiaodong He, and Bufang Zhou. 2019. Select, an-
swer and explain: Interpretable multi-hop reading
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt comprehension over multiple documents. arXiv
Gardner, Christopher Clark, Kenton Lee, and Luke preprint, abs/1911.00484.
Zettlemoyer. 2018. Deep contextualized word repre-
sentations. In NAACL. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Kaiser, and Illia Polosukhin. 2017. Attention is all
Olga Uryupina, and Yuchen Zhang. 2012. CoNLL- you need. In NIPS.
2012 shared task: Modeling multilingual unre-
stricted coreference in OntoNotes. In Joint Confer- Johannes Welbl, Pontus Stenetorp, and Sebastian
ence on EMNLP and CoNLL - Shared Task, pages Riedel. 2018. Constructing datasets for multi-hop
1–40, Jeju Island, Korea. Association for Computa- reading comprehension across documents. TACL,
tional Linguistics. 6:287–302.
Jiezhong Qiu, Hao Ma, Omer Levy, Scott Yih, Sinong Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin,
Wang, and Jie Tang. 2019. Blockwise self-attention and Michael Auli. 2019. Pay less attention with
for long document understanding. arXiv preprint, lightweight and dynamic convolutions. arXiv
abs/1911.02972. preprint, abs/1901.10430.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Qizhe Xie, Zihang Dai, Eduard H. Hovy, Minh-Thang
Dario Amodei, and Ilya Sutskever. 2019. Language Luong, and Quoc V. Le. 2019. Unsupervised
models are unsupervised multitask learners. data augmentation for consistency training. arXiv
preprint, abs/1904.12848.
Jack W. Rae, Anna Potapenko, Siddhant M. Jayaku-
mar, and Timothy P. Lillicrap. 2020. Compressive Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shu
transformers for long-range sequence modelling. In xin Zheng, Chen Xing, Huishuai Zhang, Yanyan
ICLR. Lan, Li-Wei Wang, and Tie-Yan Liu. 2020. On layer
normalization in the transformer architecture. arXiv
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
preprint, abs/2002.04745.
Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
W. Li, and Peter J. Liu. 2020. Exploring the limits Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben-
of transfer learning with a unified text-to-text trans- gio, William W. Cohen, Ruslan Salakhutdinov, and
former. J. Mach. Learn. Res., 21:140:1–140:67. Christopher D. Manning. 2018. HotpotQA: A
Aurko Roy, Mohammad Saffar, Ashish Vaswani, and dataset for diverse, explainable multi-hop question
David Grangier. 2020. Efficient content-based answering. In EMNLP.
sparse attention with routing transformers. arXiv Zihao Ye, Qipeng Guo, Quan Gan, Xipeng Qiu, and
preprint, abs/2003.05997. Zheng Zhang. 2019. BP-Transformer: Modelling
Nan Shao, Yiming Cui, Ting Liu, Shijin Wang, and long-range context via binary partitioning. arXiv
Guoping Hu. 2020. Is graph structure neces- preprint, abs/1911.04070.
sary for multi-hop reasoning? arXiv preprint,
Manzil Zaheer, Guru Guruganesh, Kumar Avinava
abs/2004.03096.
Dubey, Joshua Ainslie, C. Alberti, S. Ontañón,
Sandeep Subramanian, Raymond Li, Jonathan Pilault, Philip Pham, Anirudh Ravula, Qifan Wang, L. Yang,
and C. Pal. 2020. On extractive and abstractive neu- and A. Ahmed. 2020. Big bird: Transformers for
ral document summarization with transformer lan- longer sequences. ArXiv, abs/2007.14062.
guage models. In EMNLP.
Rowan Zellers, Ari Holtzman, Hannah Rashkin,
Sainbayar Sukhbaatar, Edouard Grave, Piotr Bo- Yonatan Bisk, Ali Farhadi, Franziska Roesner, and
janowski, and Armand Joulin. 2019. Adaptive at- Yejin Choi. 2019. Defending against neural fake
tention span in transformers. In ACL. news. In NeurIPS.

Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Jingqing Zhang, Yao Zhao, Mohammad Saleh, and
Sequence to sequence learning with neural networks. Peter J Liu. 2020. Pegasus: Pre-training with ex-
In NIPS. tracted gap-sentences for abstractive summarization.
ICML.
Trieu H. Trinh and Quoc V. Le. 2018. A simple
method for commonsense reasoning. arXiv preprint, Yukun Zhu, Ryan Kiros, Richard S. Zemel, Ruslan
abs/1806.02847. Salakhutdinov, Raquel Urtasun, Antonio Torralba,
and Sanja Fidler. 2015. Aligning books and movies:
Ming Tu, Jinke Huang, Xiaodong He, and Bowen Towards story-like visual explanations by watching
Zhou. 2020. Graph sequential network for reasoning movies and reading books. ICCV, pages 19–27.
over sequences. In NeurIPS Graph Representation
Learning workshop.

12
A Implementation Details constructs, then TVM generates the corresponding
CUDA code and compiles it for GPUs.
Implementing Longformer’s dilated sliding win-
dow attention requires a form of banded matrix B Character LM Hyperparameters
multiplication (matrix multiplication where the out-
put is all zero except certain diagonals) that is We evaluate on text8 and enwik8, both contain
not directly supported in existing deep learning 100M characters from Wikipedia split into 90M,
libraries like PyTorch/Tensorflow. Fig. 1 compares 5M, 5M for train, dev, test. Our model only speci-
the runtime and memory of three different ways of fies how the self-attention component works, and it
implementing it. is agnostic to the other design choices for the trans-
Longformer-loop is a naive implementation former model. Our implementation is based on the
that computes each diagonal separately in a loop. Transformer-XL (Dai et al., 2019) code11 with the
It is memory efficient because it only computes the memory mechanism disabled. We use relative posi-
non-zero values, but it is unusably slow. We only tion embeddings with sinusoidal weights as in Dai
use it for testing because it is easy to implement et al. (2019). We use two different model sizes, a
but don’t use it to run experiments. small (12 layers, 512 hidden size) model as in Dai
Longformer-chunks only supports the non- et al. (2019), and a large (30 layers, 512 hidden
dilated case. It chunks Q and K into overlapping size) model as in Child et al. (2019). We employed
blocks of size w and overlap of size 12 w, multiplies mixed precision training (floating points 16 and 32)
the blocks, then mask out the diagonals. This is using apex12 to reduce memory consumption and
very compute efficient because it uses a single ma- speed-up training. However, we kept the attention
trix multiplication operation from PyTorch, but it computation in fp32 to avoid numerical instability
consumes 2x the amount of memory a perfectly op- issues.13 We used gradient checkpointing (Chen
timized implementation should consume because et al., 2016) to reduce memory usage, and ran our
it computes some of the zero values. Because of experiments on 48GB RTX8000 GPUs. All hyper-
the compute efficiency, this implementation is most parameters and stage configurations are listed in
suitable for the pretrain/finetune case. We didn’t Tab. 12. Our CUDA kernel supports the autoregres-
find the increase in memory to be a problem for sive mode where each token attends to a window of
this setting. previous tokens only. Our implementation also in-
Longformer-cuda is a custom CUDA kernel cludes a version of the relative position embedding
that we implement using TVM (Chen et al., 2018). that is compatible with our dilated sliding window
It is a fully functioning implementation of our at- attention.
tention (not limited as Longformer-chunks), We ran the small model experiments on 4
it is the most memory efficient, and it is as fast RTX8000 GPUs for 16 days. For the large model,
as the highly optimized full self-attention.10 We we ran experiments on 8 RTX8000 GPUs for 13
mainly use this implementation for the autoregres- days. Most of our hyperparameter search is similar
sive language modeling experiments because of the to the ablation in Tab. 4 where we run the configu-
memory efficiency (allows the longest sequences) ration for 150K steps on text8. We experimented
and the support of dilation (needed for character- with absolute position embeddings and learned po-
LM experiments). sition embeddings, dropout values of [0.1, 0.2]
(small model) and [0.1, 0.4] (large model), pre-
Tensor Virtual Machine (TVM) We build our layernorm and post-layernorm (Xiong et al., 2020),
custom CUDA kernel using TVM (Chen et al., learning rate (LR) of phase1 of values [2.5e-5, 5e-
2018), a deep learning compiler stack that compiles 4, 1e-4] constant and cosine LR schedules, and
high level description of a function into optimized different configurations for dilation (on all heads,
device-specific code. Using TVM, we describe our on 2 heads, no dilation). Number of gradient up-
banded matrix multiplication in high-level python dates/phase reported in Tab. 12 is determined by
running each phase until the validation BPC stops
10
It is worth noting that theoretically, a perfectly optimized 11
Longformer-cuda should be faster than the n2 computa- https://github.com/kimiyoung/
tion. However, achieving this level of performance requires transformer-xl
12
special knowledge of low-level GPU programming, similar to https://github.com/NVIDIA/apex
13
implementing a highly optimized matrix multiplication. Our We found that using fp16 in attention operation results in
current implementation is sufficiently fast and practical to use. floating point overflow and NaNs in later stages of training.

13
getting better. After preparing the input data, we compute acti-
vations from the top layer of each model as follows.
C Pretraining Data We take the question and answer candidates and
concatenate them to as much context as possible up
In order to allow the model to learn long depen-
to the model sequence length (512 for RoBERTa,
dencies in pretraining, we compiled a corpus of
4,096 for Longformer), run the sequence through
long documents. Some of these data sources were
the model, collect the output activations, and repeat
also included in the original RoBERTa pretraining
until all of the context is exhausted (for all models
including the Books corpus (Zhu et al., 2015) plus
except Longformer-large, where we just include
English Wikipedia. We additionally included one
the first 4,096 length sequence due to memory re-
third of a subset of the Realnews dataset (Zellers
quirements). Then all activations for all chunks are
et al., 2019) with documents longer than 1,200 to-
concatenated into one long sequence. In the case of
kens as well as one third of the Stories (Trinh and
Longformer, we use global attention to the entire
Le, 2018) corpus. Our goal was to include a mix of
question and answer candidate sequence.
long and short documents to both allow the model
For prediction, we attach a linear layer to each
to learn longer dependencies while not to forget in-
[ent] that outputs a single logit, average over all
formation from the original RoBERTa pretraining.
logits for each candidate across the chunks, apply
The statistics of the pretraining data is shown in
a softmax and use the cross entropy loss with the
Tab. 13.
correct answer candidate.
D Task specific model details Training used the Adam optimizer with linear
warmup over 200 gradient updates to a maximum
All the QA and classification models are imple- LR, and linear decay over the remainder of training.
mented using PyTorch-Lightning14 . We use the We used gradient accumulation to effective batch
official train/dev/test splits of all datasets except size of 32 instances, checking the development ac-
for the Hyperpartisan news which we randomely curacy every 250 gradient updates and reported the
split into 80/10/10 for train/dev/test. maximum development accuracy. Other hyperpa-
rameters (dropout, weight decay) were identical to
WikiHop Instances in WikiHop consist of: a
RoBERTa pretraining.
question, answer candidates (ranging from two
In general, we ran minimal hyperparameter trials,
candidates to 79 candidates), supporting contexts
but for fair comparison between Longformer and
(ranging from three paragraphs to 63 paragraphs),
RoBERTa ran an identical hyperparameter search
and the correct answer. The dataset does not pro-
with Longformer-base and RoBERTa-base. This
vide any intermediate annotation for the multihop
consisted of a grid search of LR in [2e-5, 3e-5,
reasoning chains, requiring models to instead infer
5e-5] and number epochs in [5, 10, 15]. The
them from the indirect answer supervision.
best Longformer-base configuration used lr=3e-5,
To prepare the data for input to Longformer
15 epochs. We ran two hyperparameter trials for
and RoBERTa, we first tokenize the question,
Longformer-large, lr=3e-5 and number epochs in
answer candidates, and support contexts using
[5, 15] (the 5 epoch model had higher dev accuracy
RoBERTa’s wordpiece tokenizer. Then we
of 77.6, and was the single model submitted to the
concatenate the question and answer candi-
public leaderboard for test set evaluation). All mod-
dates with special tokens as [q] question
els were trained on a single RTX8000 GPU, with
[/q] [ent] candidate1 [/ent] ...
Longformer-base taking about a day for 5 epochs.
[ent] candidateN [/ent]. The contexts
are also concatenated using RoBERTa’s doc- TriviaQA TriviaQA has more than 100K ques-
ument delimiter tokens as separators: </s> tion, answer, document triplets for training. Doc-
context1 </s> ... </s> contextM uments are Wikipedia articles, and answers are
</s>. The special tokens [q], [/q], named entities mentioned in the article. The span
[ent], [/ent] were added to the RoBERTa that answers the question is not annotated, but it is
vocabulary and randomly initialized before task found using simple text matching.
finetuning. Similar to WikiHop, we tokenize the question
14
https://github.com/PyTorchLightning/ and the document using RoBERTa’s tokenizer,
pytorch-lightning then form the input as [s] question [/s]

14
Param Value
Position Embeddings Relative and Sinusoidal as in Dai et al. (2019)
Small model config 12 layers, 8 heads, 512 hidden size as in Dai et al. (2019)
Large model config 30 layers, 8 heads, 512 hidden size as in Child et al. (2019)
Optimizer AdamW
Dropout 0.2 (small model), 0.4 (large model)
Gradient clipping 0.25
Weight Decay 0.01
Layernorm Location pre-layernorm (Xiong et al., 2020)
Activation GeLU
Number of phases 5
Phase 1 window sizes 32 (bottom layer) - 8,192 (top layer)
Phase 5 window sizes 512 (bottom layer) - (top layer)
Phase 1 sequence length 2,048
Phase 5 sequence length 23,040 (gpu memory limit)
Phase 1 LR 0.00025
Phase 5 LR 000015625
Batch size per phase 32, 32, 16, 16, 16
#Steps per phase (small) 430K, 50k, 50k, 35k, 5k
#Steps per phase (large) 350K, 25k, 10k, 5k, 5k
Warmup 10% of the phase steps with maximum 10K steps
LR scheduler constant throughout each phase
Dilation (small model) 0 (layers 0-5), 1 (layers 6-7), 2 (layers 8-9), 3 (layers 10-11)
Dilation (large model) 0 (layers 0-14), 1 (layers 15-19), 2 (layers 20-24), 3 (layers 25-29)
Dilation heads 2 heads only

Table 12: Hyperparameters for the best performing model for character-level language modeling

Source Tokens Avg doc len on 32GB V100 GPUs. Small model takes 1 day to
Books (Zhu et al., 2015) 0.5B 95.9K train on 4 GPUs, while large model takes 1 day on
English Wikipedia 2.1B 506 8 GPUs.
Realnews (Zellers et al., 2019) 1.8B 1.7K
Stories (Trinh and Le, 2018) 2.1B 7.8K HotpotQA HotpotQA dataset involves answer-
ing questions from a set of 10 paragraphs from
Table 13: Pretraining data
10 different Wikipedia articles where 2 paragraphs
are relevant to the question and the rest are dis-
tractors. It includes 2 tasks of answer span ex-
document [/s]. We truncate the document at
traction and evidence sentence identification. Our
4,096 wordpiece to avoid it being very slow. After-
model for HotpotQA combines both answer span
wards, we get the activations from RoBERTa and
extraction and evidence extraction in one joint
Longformer similar to WikiHop (discussed above).
model. We found a higher performance using a
We use global attention on all question tokens.
two-stage Longformer model with similar setup
For prediction, we add one layer that predicts the that first identifies relevant paragraphs and then
beginning and end of the answer span. Because of does find the final answer span and evidence.15
the distant supervision nature of the training data This is largely because removing the distracting
(no gold answer spans), we use the loss function paragraphs first reduces the noise for the final ev-
of Clark and Gardner (2017) which works like an idence and span detection as also found to be im-
OR that the model only needs to get one answer portant by recent state-of-the-art methods in this
span right, not all of them. dataset (Fang et al., 2020). Similar to Wikihop and
Hyperparameters of the best configuration are TriviaQA, to prepare the data for input to Long-
listed in Tab. 14. All other hyperparameters are former, we concatenate question and then all the
similar to RoBERTa’s. For hyperparameter search, 10 paragraphs in one long context. We particu-
we only tuned LR for the RoBERTa baseline and larly use the following input format with special
tried rates [3e-5, 5e-5, 1e-4], then used the best, tokens: “[CLS] [q] question [/q] hti
which is 3e-5, for all subsequent experiments with title1 h/ti sent1,1 [s] sent1,2 [s] ...
no further tuning. We trained the Longformer-large 15
The final dev performance of the two stage model im-
with the best configuration once and submitted its proves over a single stage model by about 4.2 points on joint-
output to the leaderboard. We ran our experiments F1 metric

15
hti title2 h/ti sent2,1 [s] sent2,2 Param WikiHop TriviaQA HotpotQA
[s] ...” where [q], [/q], hti, h/ti, [s], Epochs 15 5 5
[p] are special tokens representing, question start LR 3e-5 3e-5 5e-5
Warmup steps 200 1000 1000
and end, paragraph title start and end, and sentence, Batch size 32 32 32
respectively. The special tokens were added to the Optimizer Adam Adam Adam
Longformer vocabulary and randomly initialized
before task finetuning. For Longformer, we use Table 14: Hyperparameters of the QA models. All mod-
global attention to question tokens, paragraph ti- els use a similar scheduler with linear warmup and de-
cay.
tle start tokens as well as sentence tokens. The
model includes additional feedforward layers on
top of paragraph title start tokens for prediction document into non-overlapping segments up to the
of relevant paragraphs, as well as sentence tokens maximum sequence length, then concatenates the
for predicting evidence sentences. After training activations for the coarse-to-fine clustering stage
the first stage model, we predict relevant paragraph that forms coreference clusters. The maximum se-
scores for both training and development set. We quence length was 384 for RoBERTa-base, chosen
then keep up to 5 paragraphs whose raw score is after three trials from [256, 384, 512] using the
higher than a pre-specified threshold (-3.0), and default hyperparameters in the original implemen-
remove the other paragraphs from the context. We tation.16 For Longformer-base the sequence length
then train the second stage model on the resulting was 4,096. Similar to the original implementation,
shortened context. For answer span extraction we different learning rates were used for the pretrained
use BERT’s QA model (Devlin et al., 2019) with RoBERTa parameters and the randomly initialized
addition of a question type (yes/no/span) classifi- task parameters. Using a larger learning rate in the
cation head over the first special token ([CLS]). task parameters allows the optimizer to adjust them
For evidence extraction we apply 2 layer feedfor- farther from their randomly initialized values with-
ward networks on top of the representations corre- out destroying the information in the pretrained
sponding to sentence and paragraph tokens to get RoBERTa parameters.
the corresponding evidence prediction scores and Hyperparameter searches were minimal and con-
use binary cross entropy loss to train the model. sisted of grid searches of RoBERTa LR in [1e-5,
At inference time for evidence extraction, we use 2e-5, 3e-5] and task LR in [1e-4, 2e-4, 3e-4] for
a constrained decoding strategy similar to Groen- both RoBERTa and Longformer for a fair compari-
eveld et al. (2020) that ensures that the evidence son. The best configuration for Longformer-base
sentences come from exactly two paragraphs which was RoBERTa lr=1e-5, task lr=1e-4. All other hy-
is the setup of this dataset. We combine span, ques- perparameters were the same as in the original im-
tion classification, sentence, and paragraphs losses plementation. Training takes about 10 hours on a
and train the model in a multitask way using lin- single GPU.
ear combination of losses. Our experiments are Our implementation is a superhack that involves
done on RTX8000 GPUs and training each epoch PyTorch and Tensorflow sharing a single process
takes approximately half a day on 4 GPUs. We and GPU. To avoid re-implementing the com-
trained the model using Adam optimizer with lin- plicated coarse-to-fine logic from Tensorflow in
ear warmup (1000 steps) and linear decay. We used PyTorch (that involves a highly optimized cus-
minimal hyperparameter tuning using LRs of 3e-5 tom GPU kernel originally released by Lee et al.
and 5e-5 and epochs of 3 to 7 and found the model (2018)), we devised a system where the lower trans-
with LR of 3e-5 and 5 epochs to work best. We former portion of the model passes activations and
conduct the same hyperparameter search for the gradients back and forth between PyTorch and Ten-
RoBERTa baseline as well. The rest of hyperpa- sorflow. The input tensors are first run through
rameters are reported in Tab 14. the transformer in PyTorch, the activations are col-
lected from the top layer, transferred from GPU
Coreference model details The coreference
to CPU then from CPU to Tensorflow and back to
model is a straightforward adaptation of the coarse-
GPU to run the coarse-to-fine clustering and com-
to-fine BERT based model from Joshi et al.
pute the loss. Then gradients are back propogated
(2019). After preprocessing each document with
16
the RoBERTa wordpiece tokenizer, it splits each https://github.com/mandarjoshi90/coref

16
in Tensorflow to the top of the transformer and
the process reversed to transfer them to PyTorch
for back propogation through the remainder of the
model. Separate optimizers are maintained with
identical LR schedules for parameter updates. The
overhead in this approach is minimal compared to
the overall cost of running the model.
Text classification For classification, following
BERT, we used a simple binary cross entropy loss
on top of a first [CLS] token with addition of
global attention to [CLS]. We used Adam opti-
mizer with batch sizes of 32 and linear warmup
and decay with warmup steps equal to 0.1 of the
total training steps. For both IMDB and Hyperpar-
tisan news we did grid search of LRs [3e-5, 5e-5]
and epochs [10, 15, 20] and found the model with
[3e-5] and epochs 15 to work best. Experiments
were done on a single RTX8000 GPU.