Sources — Compute-Structure AI History

Transformer-XL: Attentive Language Models
Beyond a Fixed-Length Context

Zihang Dai∗12 , Zhilin Yang∗12 , Yiming Yang1 , Jaime Carbonell1 ,
Quoc V. Le2 , Ruslan Salakhutdinov1
1
Carnegie Mellon University, 2 Google Brain
{dzihang,zhiliny,yiming,jgc,rsalakhu}@cs.cmu.edu, qvl@google.com

Abstract Term Memory (LSTM) networks (Hochreiter and
Schmidhuber, 1997), have been a standard solu-
Transformers have a potential of learning
longer-term dependency, but are limited by a tion to language modeling and obtained strong
results on multiple benchmarks. Despite the

arXiv:1901.02860v3 [cs.LG] 2 Jun 2019
fixed-length context in the setting of language
modeling. We propose a novel neural ar- wide adaption, RNNs are difficult to optimize
chitecture Transformer-XL that enables learn- due to gradient vanishing and explosion (Hochre-
ing dependency beyond a fixed length with- iter et al., 2001), and the introduction of gat-
out disrupting temporal coherence. It con- ing in LSTMs and the gradient clipping tech-
sists of a segment-level recurrence mechanism nique (Graves, 2013) might not be sufficient to
and a novel positional encoding scheme. Our
fully address this issue. Empirically, previous
method not only enables capturing longer-term
dependency, but also resolves the context frag- work has found that LSTM language models use
mentation problem. As a result, Transformer- 200 context words on average (Khandelwal et al.,
XL learns dependency that is 80% longer than 2018), indicating room for further improvement.
RNNs and 450% longer than vanilla Trans- On the other hand, the direct connections be-
formers, achieves better performance on both tween long-distance word pairs baked in atten-
short and long sequences, and is up to 1,800+ tion mechanisms might ease optimization and en-
times faster than vanilla Transformers during
able the learning of long-term dependency (Bah-
evaluation. Notably, we improve the state-of-
the-art results of bpc/perplexity to 0.99 on en- danau et al., 2014; Vaswani et al., 2017). Re-
wiki8, 1.08 on text8, 18.3 on WikiText-103, cently, Al-Rfou et al. (2018) designed a set of aux-
21.8 on One Billion Word, and 54.5 on Penn iliary losses to train deep Transformer networks
Treebank (without finetuning). When trained for character-level language modeling, which out-
only on WikiText-103, Transformer-XL man- perform LSTMs by a large margin. Despite the
ages to generate reasonably coherent, novel success, the LM training in Al-Rfou et al. (2018)
text articles with thousands of tokens. Our
is performed on separated fixed-length segments
code, pretrained models, and hyperparameters
are available in both Tensorflow and PyTorch1 .
of a few hundred characters, without any informa-
tion flow across segments. As a consequence of
1 Introduction the fixed context length, the model cannot capture
any longer-term dependency beyond the prede-
Language modeling is among the important prob-
fined context length. In addition, the fixed-length
lems that require modeling long-term dependency,
segments are created by selecting a consecutive
with successful applications such as unsupervised
chunk of symbols without respecting the sentence
pretraining (Dai and Le, 2015; Peters et al., 2018;
or any other semantic boundary. Hence, the model
Radford et al., 2018; Devlin et al., 2018). How-
lacks necessary contextual information needed to
ever, it has been a challenge to equip neural
well predict the first few symbols, leading to inef-
networks with the capability to model long-term
ficient optimization and inferior performance. We
dependency in sequential data. Recurrent neu-
refer to this problem as context fragmentation.
ral networks (RNNs), in particular Long Short-
To address the aforementioned limitations of
∗
Equal contribution. Order determined by swapping the fixed-length contexts, we propose a new architec-
one in Yang et al. (2017).
1
https://github.com/kimiyoung/ ture called Transformer-XL (meaning extra long).
transformer-xl We introduce the notion of recurrence into our
deep self-attention network. In particular, instead as an additional input. Existing works range
of computing the hidden states from scratch for from ones where context representations are man-
each new segment, we reuse the hidden states ob- ually defined (Mikolov and Zweig, 2012; Ji et al.,
tained in previous segments. The reused hidden 2015; Wang and Cho, 2015) to others that rely on
states serve as memory for the current segment, document-level topics learned from data (Dieng
which builds up a recurrent connection between et al., 2016; Wang et al., 2017).
the segments. As a result, modeling very long- More broadly, in generic sequence modeling,
term dependency becomes possible because in- how to capture long-term dependency has been a
formation can be propagated through the recur- long-standing research problem. From this per-
rent connections. Meanwhile, passing informa- spective, since the ubiquitous adaption of LSTM,
tion from the previous segment can also resolve many efforts have been spent on relieving the
the problem of context fragmentation. More im- vanishing gradient problem, including better ini-
portantly, we show the necessity of using relative tialization (Le et al., 2015), additional loss sig-
positional encodings rather than absolute ones, in nal (Trinh et al., 2018), augmented memory struc-
order to enable state reuse without causing tem- ture (Ke et al., 2018) and others that modify the in-
poral confusion. Hence, as an additional techni- ternal architecture of RNNs to ease the optimiza-
cal contribution, we introduce a simple but more tion (Wu et al., 2016; Li et al., 2018). Different
effective relative positional encoding formulation from them, our work is based on the Transformer
that generalizes to attention lengths longer than the architecture and shows that language modeling as
one observed during training. a real-world task benefits from the ability to learn
Transformer-XL obtained strong results on five longer-term dependency.
datasets, varying from word-level to character-
level language modeling. Transformer-XL is also 3 Model
able to generate relatively coherent long text arti-
Given a corpus of tokens x = (x1 , . . . , xT ), the
cles with thousands of tokens (see Appendix E),
task of language modeling is to estimate the joint
trained on only 100M tokens.
probability P (x), which Qis often auto-regressively
Our main technical contributions include intro-
factorized as P (x) = t P (xt | x<t ). With the
ducing the notion of recurrence in a purely self-
factorization, the problem reduces to estimating
attentive model and deriving a novel positional en-
each conditional factor. In this work, we stick to
coding scheme. These two techniques form a com-
the standard neural approach to modeling the con-
plete set of solutions, as any one of them alone
ditional probability. Specifically, a trainable neu-
does not address the issue of fixed-length con-
ral network is used to encode the context x<t into
texts. Transformer-XL is the first self-attention
a fixed size hidden state, which is multiplied with
model that achieves substantially better results
the word embeddings to obtain the logits. The log-
than RNNs on both character-level and word-level
its are then fed into the Softmax function, yielding
language modeling.
a categorical probability distribution over the next
token.
2 Related Work
In the last few years, the field of language mod- 3.1 Vanilla Transformer Language Models
eling has witnessed many significant advances, In order to apply Transformer or self-attention to
including but not limited to devising novel ar- language modeling, the central problem is how to
chitectures to better encode the context (Bengio train a Transformer to effectively encode an arbi-
et al., 2003; Mikolov et al., 2010; Merity et al., trarily long context into a fixed size representation.
2016; Al-Rfou et al., 2018), improving regulariza- Given infinite memory and computation, a sim-
tion and optimization algorithms (Gal and Ghahra- ple solution would be to process the entire con-
mani, 2016) , speeding up the Softmax computa- text sequence using an unconditional Transformer
tion (Grave et al., 2016a) , and enriching the output decoder, similar to a feed-forward neural network.
distribution family (Yang et al., 2017). However, this is usually infeasible with the limited
To capture the long-range context in language resource in practice.
modeling, a line of work directly feeds a repre- One feasible but crude approximation is to split
sentation of the wider context into the network the entire corpus into shorter segments of man-
x1 x2 x3 x4 x5 x6 x7 x8 x1 x2 x3 x4 x5 x6 x1 x2 x3 x4 x5 x6 x1 x2 x3 x4 x5 x6

Segment 1 Segment 2 Limited Context Limited Context Limited Context

(a) Train phase. (b) Evaluation phase.

Figure 1: Illustration of the vanilla model with a segment length 4.

ageable sizes, and only train the model within 3.2 Segment-Level Recurrence with State
each segment, ignoring all contextual information Reuse
from previous segments. This is the idea adopted To address the limitations of using a fixed-length
by Al-Rfou et al. (2018). We call it the vanilla context, we propose to introduce a recurrence
model and visualize it in Fig. 1a. Under this mechanism to the Transformer architecture. Dur-
training paradigm, information never flows across ing training, the hidden state sequence computed
segments in either the forward or backward pass. for the previous segment is fixed and cached to
There are two critical limitations of using a fixed- be reused as an extended context when the model
length context. First, the largest possible depen- processes the next new segment, as shown in Fig.
dency length is upper bounded by the segment 2a. Although the gradient still remains within a
length, which is a few hundred on character-level segment, this additional input allows the network
language modeling (Al-Rfou et al., 2018). There- to exploit information in the history, leading to an
fore, although the self-attention mechanism is less ability of modeling longer-term dependency and
affected by the vanishing gradient problem com- avoiding context fragmentation. Formally, let the
pared to RNNs, the vanilla model is not able to two consecutive segments of length L be sτ =
fully exploit this optimization advantage. Second, [xτ,1 , · · · , xτ,L ] and sτ +1 = [xτ +1,1 , · · · , xτ +1,L ]
though it is possible to use padding to respect the respectively. Denoting the n-th layer hidden state
sentence or other semantic boundaries, in practice sequence produced for the τ -th segment sτ by
it has been standard practice to simply chunk long hnτ ∈ RL×d , where d is the hidden dimension.
text into fixed-length segments due to improved Then, the n-th layer hidden state for segment sτ +1
efficiency (Peters et al., 2018; Devlin et al., 2018; is produced (schematically) as follows,
Al-Rfou et al., 2018). However, simply chunking e τn−1
h
n−1
) ◦ hn−1

+1 = SG(hτ τ +1 ,
a sequence into fixed-length segments will lead to > e n−1 > e n−1 >
qn n n n−1
τ +1 , kτ +1 , vτ +1 = hτ +1 Wq , hτ +1 Wk , hτ +1 Wv ,
the context fragmentation problem as discussed in n n n n
hτ +1 = Transformer-Layer (qτ +1 , kτ +1 , vτ +1 ) .
Section 1.
where the function SG(·) stands for stop-gradient,
the notation [hu ◦ hv ] indicates the concatenation
of two hidden sequences along the length dimen-
During evaluation, at each step, the vanilla sion, and W· denotes model parameters. Com-
model also consumes a segment of the same length pared to the standard Transformer, the critical dif-
as in training, but only makes one prediction at the ference lies in that the key knτ+1 and value vτn+1
last position. Then, at the next step, the segment are conditioned on the extended context e hn−1
τ +1 and
is shifted to the right by only one position, and the hence hτ n−1 cached from the previous segment.
new segment has to be processed all from scratch. We emphasize this particular design by the green
As shown in Fig. 1b, this procedure ensures that paths in Fig. 2a.
each prediction utilizes the longest possible con- With this recurrence mechanism applied to ev-
text exposed during training, and also relieves con- ery two consecutive segments of a corpus, it es-
text fragmentation issue encountered in training. sentially creates a segment-level recurrence in the
However, this evaluation procedure is extremely hidden states. As a result, the effective context be-
expensive. We will show that our proposed archi- ing utilized can go way beyond just two segments.
tecture is able to substantially improve the evalua- However, notice that the recurrent dependency be-
tion speed. tween hnτ+1 and hn−1τ shifts one layer downwards
x1 x2 x3 x4 x5 x6 x7 x8 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12

Fixed (No Grad) New Segment Fixed (No Grad) New Segment Extended Context

(a) Training phase. (b) Evaluation phase.

Figure 2: Illustration of the Transformer-XL model with a segment length 4.

per-segment, which differs from the same-layer der to reuse the hidden states. That is, how can
recurrence in conventional RNN-LMs. Conse- we keep the positional information coherent when
quently, the largest possible dependency length we reuse the states? Recall that, in the standard
grows linearly w.r.t. the number of layers as well Transformer, the information of sequence order is
as the segment length, i.e., O(N × L), as vi- provided by a set of positional encodings, denoted
sualized by the shaded area in Fig. 2b. This as U ∈ RLmax ×d , where the i-th row Ui corre-
is analogous to truncated BPTT (Mikolov et al., sponds to the i-th absolute position within a seg-
2010), a technique developed for training RNN- ment and Lmax prescribes the maximum possible
LMs. However, different from truncated BPTT, length to be modeled. Then, the actual input to the
our method caches a sequence of hidden states in- Transformer is the element-wise addition of the
stead of the last one, and should be applied to- word embeddings and the positional encodings. If
gether with the relative positional encoding tech- we simply adapt this positional encoding to our
nique described in Section 3.3. recurrence mechanism, the hidden state sequence
Besides achieving extra long context and re- would be computed schematically by
solving fragmentation, another benefit that comes hτ +1 = f (hτ , Esτ +1 + U1:L )
with the recurrence scheme is significantly faster hτ = f (hτ −1 , Esτ + U1:L ),
evaluation. Specifically, during evaluation, the
where Esτ ∈ RL×d is the word embedding se-
representations from the previous segments can
quence of sτ , and f represents a transformation
be reused instead of being computed from scratch
function. Notice that, both Esτ and Esτ +1 are as-
as in the case of the vanilla model. In our ex-
sociated with the same positional encoding U1:L .
periments on enwiki8, Transformer-XL is up to
As a result, the model has no information to dis-
1,800+ times faster than the vanilla model during
tinguish the positional difference between xτ,j and
evaluation (see Section 4).
xτ +1,j for any j = 1, . . . , L, resulting in a sheer
Finally, notice that the recurrence scheme does
performance loss.
not need to be restricted to only the previous seg-
In order to avoid this failure mode, the funda-
ment. In theory, we can cache as many previous
mental idea is to only encode the relative posi-
segments as the GPU memory allows, and reuse
tional information in the hidden states. Concep-
all of them as the extra context when processing
tually, the positional encoding gives the model a
the current segment. Thus, we can cache a prede-
temporal clue or “bias” about how information
fined length-M old hidden states spanning (pos-
should be gathered, i.e., where to attend. For the
sibly) multiple segments, and refer to them as the
same purpose, instead of incorporating bias stati-
memory mnτ ∈ RM ×d , due to a clear connection to
cally into the initial embedding, one can inject the
the memory augmented neural networks (Graves
same information into the attention score of each
et al., 2014; Weston et al., 2014). In our experi-
layer. More importantly, it is more intuitive and
ments, we set M equal to the segment length dur-
generalizable to define the temporal bias in a rela-
ing training, and increase it by multiple times dur-
tive manner. For instance, when a query vector qτ,i
ing evaluation.
attends on the key vectors kτ,≤i , it does not need
to know the absolute position of each key vector
3.3 Relative Positional Encodings
to identify the temporal order of the segment. In-
While we found the idea presented in the pre- stead, it suffices to know the relative distance be-
vious subsection very appealing, there is a cru- tween each key vector kτ,j and itself qτ,i , i.e. i − j.
cial technical challenge we haven’t solved in or- Practically, one can create a set of relative posi-
tional encodings R ∈ RLmax ×d , where the i-th row • Finally, we deliberately separate the two weight
Ri indicates a relative distance of i between two matrices Wk,E and Wk,R for producing the
positions. By injecting the relative distance dy- content-based key vectors and location-based
namically into the attention score, the query vector key vectors respectively.
can easily distinguish the representations of xτ,j Under the new parameterization, each term has
and xτ +1,j from their different distances, making an intuitive meaning: term (a) represents content-
the state reuse mechanism feasible. Meanwhile, based addressing, term (b) captures a content-
we won’t lose any temporal information, as the ab- dependent positional bias, term (c) governs a
solute position can be recovered recursively from global content bias, and (d) encodes a global po-
relative distances. sitional bias.
Previously, the idea of relative positional encod- In comparison, the formulation in Shaw et al.
ings has been explored in the context of machine (2018) only has terms (a) and (b), dropping the
translation (Shaw et al., 2018) and music gener- two bias terms (c) and (d). Moreover, Shaw et al.
ation (Huang et al., 2018). Here, we offer a dif- (2018) merge the multiplication Wk R into a sin-
ferent derivation, arriving at a new form of rel- gle trainable matrix R̂, which abandons the induc-
ative positional encodings, which not only has a tive bias built into the original sinusoid positional
one-to-one correspondence to its absolute coun- encoding (Vaswani et al., 2017). In contrast, our
terpart but also enjoys much better generalization relative positional embedding R adapts the sinu-
empirically (see Section 4). Firstly, in the standard soid formulation. As a benefit of the inductive
Transformer (Vaswani et al., 2017), the attention bias, a model trained on a memory of some certain
score between query qi and key vector kj within length can automatically generalize to a memory
the same segment can be decomposed as several times longer during evaluation.
Aabs > > > >
i,j = Exi Wq Wk Exj + Exi Wq Wk Uj
Equipping the recurrence mechanism with our
| {z } |
(a)
{z }
(b)
proposed relative positional embedding, we finally
> > arrive at the Transformer-XL architecture. For
+ U> >
i Wq Wk Exj + Ui Wq Wk Uj .
| {z } | {z } completeness, we summarize the computational
(c) (d)
procedure for a N -layer Transformer-XL with a
Following the idea of only relying on rela- single attention head here. For n = 1, . . . , N :
tive positional information, we propose to re- e n−1 = SG(mn−1 ) ◦ hn−1

h τ τ τ
parameterize the four terms as follows
> > > > qn n n n−1
τ , kτ , vτ = hτ Wqn > , h n > e n−1
e τn−1 Wk,E , hτ Wvn >
Arel
i,j = Exi Wq Wk,E Exj + Exi Wq Wk,R Ri−j n > n n >
| {z } | {z } An n
τ,i,j = qτ,i kτ,j + qτ,i Wk,R Ri−j
(a) (b)
+ u> kτ,j + v > Wk,R
n
Ri−j
+ u> Wk,E Exj + v > Wk,R Ri−j .
| {z } | {z } aτ = Masked-Softmax(Aτ )vτn
n n

(c) (d)
on n
τ = LayerNorm(Linear(aτ ) + hτ
n−1
)
• The first change we make is to replace all ap- n n
hτ = Positionwise-Feed-Forward(oτ )
pearances of the absolute positional embedding
Uj for computing key vectors in term (b) and with h0τ := Esτ defined as the word embed-
(d) with its relative counterpart Ri−j . This es- ding sequence. In addition, it is worth mention-
sentially reflects the prior that only the relative ing that a naive way to compute A requires com-
puting Wk,Rn R
distance matters for where to attend. Note that i−j for all pairs (i, j), whose cost
R is a sinusoid encoding matrix (Vaswani et al., is quadratic w.r.t. the sequence length. How-
2017) without learnable parameters. ever, noticing that the value of i − j only ranges
from zero to the sequence length, we show a sim-
• Secondly, we introduce a trainable parameter
ple computation procedure in Appendix B, which
u ∈ Rd to replace the query U> >
i Wq in term reduces the cost to be linear w.r.t. the sequence
(c). In this case, since the query vector is the
length.
same for all query positions, it suggests that the
attentive bias towards different words should re- 4 Experiments
main the same regardless of the query position.
With a similar reasoning, a trainable parameter 4.1 Main Results
v ∈ Rd is added to substitute U> >
i Wq in term We apply Transformer-XL to a variety of datasets
(d). on both word-level and character-level language
Model #Param PPL Model #Param bpc
Grave et al. (2016b) - LSTM - 48.7 Cooijmans et al. (2016) - BN-LSTM - 1.36
Bai et al. (2018) - TCN - 45.2 Chung et al. (2016) - LN HM-LSTM 35M 1.29
Dauphin et al. (2016) - GCNN-8 - 44.9 Zilly et al. (2016) - RHN 45M 1.27
Grave et al. (2016b) - LSTM + Neural cache - 40.8 Krause et al. (2016) - Large mLSTM 45M 1.27
Dauphin et al. (2016) - GCNN-14 - 37.2 Al-Rfou et al. (2018) - 12L Transformer 44M 1.18
Merity et al. (2018) - QRNN 151M 33.0
Al-Rfou et al. (2018) - 64L Transformer 235M 1.13
Rae et al. (2018) - Hebbian + Cache - 29.9
Ours - 24L Transformer-XL 277M 1.08
Ours - Transformer-XL Standard 151M 24.0
Baevski and Auli (2018) - Adaptive Input 247M 20.5 Table 3: Comparison with state-of-the-art results on text8.
Ours - Transformer-XL Large 257M 18.3

Table 1: Comparison with state-of-the-art results on Model #Param PPL
WikiText-103. indicates contemporary work.
Shazeer et al. (2014) - Sparse Non-Negative 33B 52.9
Chelba et al. (2013) - RNN-1024 + 9 Gram 20B 51.3
Model #Param bpc Kuchaiev and Ginsburg (2017) - G-LSTM-2 - 36.0
Dauphin et al. (2016) - GCNN-14 bottleneck - 31.9
Ha et al. (2016) - LN HyperNetworks 27M 1.34 Jozefowicz et al. (2016) - LSTM 1.8B 30.6
Chung et al. (2016) - LN HM-LSTM 35M 1.32 Jozefowicz et al. (2016) - LSTM + CNN Input 1.04B 30.0
Zilly et al. (2016) - RHN 46M 1.27 Shazeer et al. (2017) - Low-Budget MoE ∼5B 34.1
Mujika et al. (2017) - FS-LSTM-4 47M 1.25 Shazeer et al. (2017) - High-Budget MoE ∼5B 28.0
Krause et al. (2016) - Large mLSTM 46M 1.24 Shazeer et al. (2018) - Mesh Tensorflow 4.9B 24.0
Knol (2017) - cmix v13 - 1.23 Baevski and Auli (2018) - Adaptive Input 0.46B 24.1
Al-Rfou et al. (2018) - 12L Transformer 44M 1.11 Baevski and Auli (2018) - Adaptive Input 1.0B 23.7
Ours - 12L Transformer-XL 41M 1.06
Ours - Transformer-XL Base 0.46B 23.5
Al-Rfou et al. (2018) - 64L Transformer 235M 1.06 Ours - Transformer-XL Large 0.8B 21.8
Ours - 18L Transformer-XL 88M 1.03
Ours - 24L Transformer-XL 277M 0.99
Table 4: Comparison with state-of-the-art results on One Bil-
lion Word. indicates contemporary work.
Table 2: Comparison with state-of-the-art results on enwik8.

modeling to have a comparison with state-of-the- former variants have a large margin over conven-
art systems, including WikiText-103 (Merity et al., tional RNN-based models. Notably, our 12-layer
2016), enwik8 (LLC, 2009), text8 (LLC, 2009), architecture achieves the same result as the 64-
One Billion Word (Chelba et al., 2013), and Penn layer network from Al-Rfou et al. (2018), using
Treebank (Mikolov and Zweig, 2012). only 17% of the parameter budget. In order to see
WikiText-103 is the largest available word-level whether better performances can be obtained by
language modeling benchmark with long-term de- increasing the model size, we train 18-layer and
pendency. It contains 103M training tokens from 24-layer Transformer-XLs with increased model
28K articles, with an average length of 3.6K to- sizes. With the attention length 784 during train-
kens per article, which allows testing the abil- ing and 3,800 during evaluation, we obtained a
ity of long-term dependency modeling. We set new SoTA result and our method is the first to
the attention length to 384 during training and break through 1.0 on widely-studied character-
1600 during evaluation. We adopted adaptive soft- level benchmarks. Different from Al-Rfou et al.
max and input representations (Baevski and Auli, (2018), Transformer-XL does not need any auxil-
2018; Grave et al., 2016a). As shown in Table 1, iary losses, and thus all benefits are credited to a
Transformer-XL reduces the previous state-of-the- better architecture.
art (SoTA) perplexity from 20.5 to 18.3, which Similar to but different from enwik8, text8 con-
demonstrates the superiority of the Transformer- tains 100M processed Wikipedia characters cre-
XL architecture. ated by lowering case the text and removing any
The dataset enwik8 contains 100M bytes of un- character other than the 26 letters a through z, and
processed Wikipedia text. We compare our ar- space. Due to the similarity, we simply adapt the
chitecture with the previous results in Table 2. best model and the same hyper-parameters on en-
Under the model size constraint, the 12-layer wik8 to text8 without further tuning. The compari-
Transformer-XL achieves a new SoTA result, out- son with previous methods is summarized in Table
performing the 12-layer vanilla Transformer from 3. Again, Transformer-XL achieves the new SoTA
Al-Rfou et al. (2018) by 0.05, while both Trans- result with a clear margin.
Model #Param PPL that absolute encodings only work well with half
Inan et al. (2016) - Tied Variational LSTM 24M 73.2 losses because half losses exclude positions with
Zilly et al. (2016) - Variational RHN 23M 65.4 very short attention lengths during training for bet-
Zoph and Le (2016) - NAS Cell 25M 64.0
Merity et al. (2017) - AWD-LSTM 24M 58.8
ter generalization. Table 6 shows that both the
Pham et al. (2018) - Efficient NAS 24M 58.6 recurrence mechanism and our encoding scheme
Liu et al. (2018) - Differentiable NAS 23M 56.1 are necessary to achieve the best performance, as
Yang et al. (2017) - AWD-LSTM-MoS 22M 55.97
Melis et al. (2018) - Dropout tuning 24M 55.3 well as generalizing to longer attention sequences
during evaluation time. Although the backprop-
Ours - Transformer-XL 24M 54.52
†
agation length during training is only 128, with
Merity et al. (2017) - AWD-LSTM+Finetune 24M 57.3
the two techniques the attention length can be in-
Yang et al. (2017) - MoS+Finetune† 22M 54.44
creased to 640 at test time. In the standard setting
Table 5: Comparison with state-of-the-art results on Penn with 151M parameters, the perplexity decreases as
Treebank. † indicates using two-step finetuning. the attention length increases.
Since the recurrence mechanism costs addi-
tional memory, we also compare Transformer-XL
One Billion Word does not preserve any long-
with baselines under the same GPU memory con-
term dependency because sentences have been
straints. As shown in Table 10 in Appendix A,
shuffled. Consequently, this dataset mainly tests
despite using a shorter backpropagation length,
the ability of modeling only short-term depen-
Transformer-XL remains superior to the baselines.
dency. The comparison between Transformer-XL
and the other methods is shown in Table 4. Al- The second study targets at isolating the ef-
though Transformer-XL is mainly designed to bet- fects of resolving the context fragmentation prob-
ter capture longer-term dependency, it dramati- lem from the benefit of capturing longer context
cally improves the single-model SoTA from 23.7 length. In order to achieve this goal, we deliber-
to 21.8. Specifically, Transformer-XL signifi- ately choose a dataset that does not require long-
cantly outperforms a contemporary method using term dependency, so that any improvement from
vanilla Transformers (Baevski and Auli, 2018), establishing the recurrence can be attributed to
suggesting the advantage of Transformer-XL is solving the context fragmentation. Specifically,
generalizable to modeling short sequences. we perform this controlled experiment on the One
Billion Word dataset, which can only benefit from
We also report the results on word-level Penn
removing the context fragmentation. We train
Treebank in Table 5. Similar to AWD-LSTM
a 20-layer Transformer-XL with ∼0.3B parame-
(Merity et al., 2017), we apply variational dropout
ters for 400K steps. As shown in Table 7, using
and weight average to Transformer-XL. With
segment-level recurrence substantially improves
proper regularization, Transformer-XL achieves a
performance even when long-term dependency is
new SoTA result among models without two-step
not needed, which is consistent with our previous
finetuning. Penn Treebank has only 1M training
discussion that the recurrence mechanism resolves
tokens, which implies that Transformer-XL also
the context fragmentation problem. Moreover, our
generalizes well even on small datasets.
relative positional encodings is also superior to
4.2 Ablation Study Shaw et al. (2018) on short sequences.
We conduct two sets of ablation studies to exam-
4.3 Relative Effective Context Length
ine the effects of two proposed techniques used in
Transformer-XL: the recurrence mechanism and Khandelwal et al. (2018) proposed a method to
the new positional encoding scheme. evaluate the Effective Context Length (ECL) of a
The first study is performed on WikiText-103, sequence model. ECL is the longest length to
which requires modeling long-term dependency. which increasing the context span would lead to
The results are reported in Table 6. Among the a gain more than a threshold. However, ECL ig-
compared encoding schemes, Shaw et al. (2018) is nores the fact that it is harder to get improve-
relative, while Vaswani et al. (2017) and Al-Rfou ment when a model already achieves a lower per-
et al. (2018) are absolute. “Full” and “half” losses plexity using only a shorter context, and thus it
refer to applying a cross entropy loss to all or the is not suitable for fair comparison among mul-
recent half positions in the segment. We found tiple models. We instead propose a new metric
Remark Recurrence Encoding Loss PPL init PPL best Attn Len
Transformer-XL (128M) 3 Ours Full 27.02 26.77 500
- 3 Shaw et al. (2018) Full 27.94 27.94 256
- 3 Ours Half 28.69 28.33 460
- 7 Ours Full 29.59 29.02 260
- 7 Ours Half 30.10 30.10 120
- 7 Shaw et al. (2018) Full 29.75 29.75 120
- 7 Shaw et al. (2018) Half 30.50 30.50 120
- 7 Vaswani et al. (2017) Half 30.97 30.97 120
Transformer (128M)† 7 Al-Rfou et al. (2018) Half 31.16 31.16 120
23.09 640
Transformer-XL (151M) 3 Ours Full 23.43 23.16 450
23.35 300

Table 6: Ablation study on WikiText-103. For the first two blocks, we use a slightly smaller model (128M parameters).
† indicates that the corresponding row is reduced to the same setting as the Transformer network in (Al-Rfou et al., 2018),
except that two auxiliary losses are not implemented in our experiments. “PPL init” refers to using the same length as training.
“PPL best” indicates the perplexity obtained by using the optimal length. “Attn Len” is the shortest possible attention length
during evaluation to achieve the corresponding result (PPL best). Increasing the attention length during evaluation improves
performance only when our positional encoding is used. The “Transformer-XL (151M)” setting uses a standard parameter
budget as previous work (Merity et al., 2018), where we observe a similar effect when increasing the attention length during
evaluation.

Method PPL Attn Len How much Al-Rfou et al. (2018) is slower
Ours 25.2 3,800 1,874x
With Shaw et al. (2018) encodings 25.7 2,800 1,409x
Without recurrence 27.1 1,800 773x
800 363x
Table 7: Ablation study on One Billion Word, a dataset with-
out long-term dependency. Table 9: Slowdown in terms of running time during evalua-
tion. Evaluation is based on per-token time on one GPU.

Model r = 0.1 r = 0.5 r = 1.0
Transformer-XL 151M 900 800 700
QRNN 500 400 300 erage with r = 0.1. The RECL of Transformer-
LSTM 400 300 200 XL is 80% and 450% longer than recurrent net-
Transformer-XL 128M 700 600 500 works and Transformer respectively. Both the re-
- use Shaw et al. (2018) encoding 400 400 300 currence mechanism and our positional encodings
- remove recurrence 300 300 300
Transformer 128 128 128
contribute to a longer RECL. This further substan-
tiates our argument that Transformer-XL is able to
Table 8: Relative effective context length (RECL) compari- model longer-term dependency.
son. See text for the definition of RECL and r. The first three
models and the last four models are compared as two model
groups when we calculate RECL (RECL is computed on a 4.4 Generated Text
model group rather than a single model). Each group has the
same parameter budget.
Trained only on WikiText-103 which is medium-
sized, Transformer-XL is already able to generate
called Relative Effective Context Length (RECL). relatively coherent articles with thousands of to-
RECL is defined on a model group instead of a kens without manual cherry picking, despite mi-
single model, and the gain of a long context is nor flaws. Please refer to Appendix E for samples.
measure by the relative improvement over the best
short context model. As such, the model group 4.5 Evaluation Speed
shares the same baseline to enable fair compari-
son. RECL also has a parameter r, which means Finally, we compare the evaluation speed of our
constraining the comparison on top-r hard exam- model with the vanilla Transformer model (Al-
ples. See Appedix C for more details about RECL. Rfou et al., 2018). As shown in Table 9, due to
As shown in Table 8, Transformer-XL manages the state reuse scheme, Transformer-XL achieves
to model dependency of 900 words long on av- an up to 1,874 times speedup during evaluation.
5 Conclusions Andrew M Dai and Quoc V Le. 2015. Semi-supervised
sequence learning. In Advances in neural informa-
Transformer-XL obtains strong perplexity results, tion processing systems, pages 3079–3087.
models longer-term dependency than RNNs and
Transformer, achieves substantial speedup during Yann N Dauphin, Angela Fan, Michael Auli, and
David Grangier. 2016. Language modeling with
evaluation, and is able to generate coherent text gated convolutional networks. arXiv preprint
articles. We envision interesting applications of arXiv:1612.08083.
Transformer-XL in the fields of text generation,
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
unsupervised feature learning, image and speech
Kristina Toutanova. 2018. Bert: Pre-training of deep
modeling. bidirectional transformers for language understand-
ing. arXiv preprint arXiv:1810.04805.
Acknowledgments
ZD and YY were supported in part by National Adji B Dieng, Chong Wang, Jianfeng Gao, and John
Science Foundation (NSF) under the grant IIS- Paisley. 2016. Topicrnn: A recurrent neural net-
work with long-range semantic dependency. arXiv
1546329 and by the DOE-Office of Science un- preprint arXiv:1611.01702.
der the grant ASCR #KJ040201. ZY and RS
were supported in part by the Office of Naval Yarin Gal and Zoubin Ghahramani. 2016. A theoret-
Research grant N000141812861, the NSF grant ically grounded application of dropout in recurrent
neural networks. In Advances in neural information
IIS1763562, the Nvidia fellowship, and the Siebel processing systems, pages 1019–1027.
scholarship.
Edouard Grave, Armand Joulin, Moustapha Cissé,
David Grangier, and Hervé Jégou. 2016a. Efficient
References softmax approximation for gpus. arXiv preprint
arXiv:1609.04309.
Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy
Guo, and Llion Jones. 2018. Character-level lan- Edouard Grave, Armand Joulin, and Nicolas
guage modeling with deeper self-attention. arXiv Usunier. 2016b. Improving neural language
preprint arXiv:1808.04444. models with a continuous cache. arXiv preprint
arXiv:1612.04426.
Alexei Baevski and Michael Auli. 2018. Adaptive in-
put representations for neural language modeling. Alex Graves. 2013. Generating sequences with
arXiv preprint arXiv:1809.10853. recurrent neural networks. arXiv preprint
arXiv:1308.0850.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
gio. 2014. Neural machine translation by jointly Alex Graves, Greg Wayne, and Ivo Danihelka.
learning to align and translate. arXiv preprint 2014. Neural turing machines. arXiv preprint
arXiv:1409.0473. arXiv:1410.5401.
Shaojie Bai, J Zico Kolter, and Vladlen Koltun. David Ha, Andrew Dai, and Quoc V Le. 2016. Hyper-
2018. An empirical evaluation of generic convolu- networks. arXiv preprint arXiv:1609.09106.
tional and recurrent networks for sequence model-
ing. arXiv preprint arXiv:1803.01271. Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jür-
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and gen Schmidhuber, et al. 2001. Gradient flow in re-
Christian Jauvin. 2003. A neural probabilistic lan- current nets: the difficulty of learning long-term de-
guage model. Journal of machine learning research, pendencies.
3(Feb):1137–1155.
Sepp Hochreiter and Jürgen Schmidhuber. 1997.
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Long short-term memory. Neural computation,
Thorsten Brants, Phillipp Koehn, and Tony Robin- 9(8):1735–1780.
son. 2013. One billion word benchmark for measur-
ing progress in statistical language modeling. arXiv Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob
preprint arXiv:1312.3005. Uszkoreit, Noam Shazeer, Curtis Hawthorne, An-
drew M Dai, Matthew D Hoffman, and Douglas Eck.
Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. 2018. An improved relative self-attention mecha-
2016. Hierarchical multiscale recurrent neural net- nism for transformer with application to music gen-
works. arXiv preprint arXiv:1609.01704. eration. arXiv preprint arXiv:1809.04281.

Tim Cooijmans, Nicolas Ballas, César Laurent, Çağlar Hakan Inan, Khashayar Khosravi, and Richard Socher.
Gülçehre, and Aaron Courville. 2016. Re- 2016. Tying word vectors and word classifiers:
current batch normalization. arXiv preprint A loss framework for language modeling. arXiv
arXiv:1603.09025. preprint arXiv:1611.01462.
Yangfeng Ji, Trevor Cohn, Lingpeng Kong, Chris Dyer, Gábor Melis, Charles Blundell, Tomáš Kočiskỳ,
and Jacob Eisenstein. 2015. Document context lan- Karl Moritz Hermann, Chris Dyer, and Phil Blun-
guage models. arXiv preprint arXiv:1511.03962. som. 2018. Pushing the bounds of dropout. arXiv
preprint arXiv:1805.09208.
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam
Shazeer, and Yonghui Wu. 2016. Exploring Stephen Merity, Nitish Shirish Keskar, and Richard
the limits of language modeling. arXiv preprint Socher. 2017. Regularizing and optimizing lstm lan-
arXiv:1602.02410. guage models. arXiv preprint arXiv:1708.02182.

Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Stephen Merity, Nitish Shirish Keskar, and Richard
Aaron van den Oord, Alex Graves, and Koray Socher. 2018. An analysis of neural language
Kavukcuoglu. 2016. Neural machine translation in modeling at multiple scales. arXiv preprint
linear time. arXiv preprint arXiv:1610.10099. arXiv:1803.08240.

Stephen Merity, Caiming Xiong, James Bradbury, and
Sekitoshi Kanai, Yasuhiro Fujiwara, Yuki Yamanaka,
Richard Socher. 2016. Pointer sentinel mixture
and Shuichi Adachi. 2018. Sigsoftmax: Reanal-
models. arXiv preprint arXiv:1609.07843.
ysis of the softmax bottleneck. arXiv preprint
arXiv:1805.10829. Tomas Mikolov, Armand Joulin, Sumit Chopra,
Michael Mathieu, and Marc’Aurelio Ranzato. 2014.
Nan Rosemary Ke, Anirudh Goyal ALIAS PARTH Learning longer memory in recurrent neural net-
GOYAL, Olexa Bilaniuk, Jonathan Binas, works. arXiv preprint arXiv:1412.7753.
Michael C Mozer, Chris Pal, and Yoshua Ben-
gio. 2018. Sparse attentive backtracking: Temporal Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan
credit assignment through reminding. In Advances Černockỳ, and Sanjeev Khudanpur. 2010. Recur-
in Neural Information Processing Systems, pages rent neural network based language model. In
7650–7661. Eleventh Annual Conference of the International
Speech Communication Association.
Urvashi Khandelwal, He He, Peng Qi, and Dan Ju-
rafsky. 2018. Sharp nearby, fuzzy far away: How Tomas Mikolov and Geoffrey Zweig. 2012. Context
neural language models use context. arXiv preprint dependent recurrent neural network language model.
arXiv:1805.04623. SLT, 12(234-239):8.

Bryon Knol. 2017. cmix v13. http://www. Frederic Morin and Yoshua Bengio. 2005. Hierarchi-
byronknoll.com/cmix.html. cal probabilistic neural network language model. In
Aistats, volume 5, pages 246–252. Citeseer.
Jan Koutnik, Klaus Greff, Faustino Gomez, and Juer-
gen Schmidhuber. 2014. A clockwork rnn. arXiv Asier Mujika, Florian Meier, and Angelika Steger.
preprint arXiv:1402.3511. 2017. Fast-slow recurrent neural networks. In Ad-
vances in Neural Information Processing Systems,
Ben Krause, Liang Lu, Iain Murray, and Steve Renals. pages 5915–5924.
2016. Multiplicative lstm for sequence modelling.
arXiv preprint arXiv:1609.07959. Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.
2012. Understanding the exploding gradient prob-
Oleksii Kuchaiev and Boris Ginsburg. 2017. Factor- lem. CoRR, abs/1211.5063.
ization tricks for lstm networks. arXiv preprint
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt
arXiv:1703.10722.
Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word rep-
Quoc V Le, Navdeep Jaitly, and Geoffrey E Hin-
resentations. arXiv preprint arXiv:1802.05365.
ton. 2015. A simple way to initialize recurrent
networks of rectified linear units. arXiv preprint Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V
arXiv:1504.00941. Le, and Jeff Dean. 2018. Efficient neural architec-
ture search via parameter sharing. arXiv preprint
Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo arXiv:1802.03268.
Gao. 2018. Independently recurrent neural network
(indrnn): Building a longer and deeper rnn. In Pro- Ofir Press and Lior Wolf. 2016. Using the output
ceedings of the IEEE Conference on Computer Vi- embedding to improve language models. arXiv
sion and Pattern Recognition, pages 5457–5466. preprint arXiv:1608.05859.

Hanxiao Liu, Karen Simonyan, and Yiming Yang. Alec Radford, Karthik Narasimhan, Tim Salimans, and
2018. Darts: Differentiable architecture search. Ilya Sutskever. 2018. Improving language under-
arXiv preprint arXiv:1806.09055. standing by generative pre-training. URL https://s3-
us-west-2. amazonaws. com/openai-assets/research-
MultiMedia LLC. 2009. Large text compression covers/languageunsupervised/language under-
benchmark. standing paper. pdf.
Jack W Rae, Chris Dyer, Peter Dayan, and Tim- Julian Georg Zilly, Rupesh Kumar Srivastava,
othy P Lillicrap. 2018. Fast parametric learn- Jan Koutník, and Jürgen Schmidhuber. 2016.
ing with activation memorization. arXiv preprint Recurrent highway networks. arXiv preprint
arXiv:1803.10049. arXiv:1607.03474.

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Barret Zoph and Quoc V Le. 2016. Neural architecture
2018. Self-attention with relative position represen- search with reinforcement learning. arXiv preprint
tations. arXiv preprint arXiv:1803.02155. arXiv:1611.01578.

Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin
Tran, Ashish Vaswani, Penporn Koanantakool, Peter
Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff
Young, et al. 2018. Mesh-tensorflow: Deep learning
for supercomputers. In Advances in Neural Infor-
mation Processing Systems, pages 10434–10443.

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,
Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff
Dean. 2017. Outrageously large neural networks:
The sparsely-gated mixture-of-experts layer. arXiv
preprint arXiv:1701.06538.

Noam Shazeer, Joris Pelemans, and Ciprian Chelba.
2014. Skip-gram language modeling using sparse
non-negative matrix probability estimation. arXiv
preprint arXiv:1412.1454.

Trieu H Trinh, Andrew M Dai, Thang Luong, and
Quoc V Le. 2018. Learning longer-term dependen-
cies in rnns with auxiliary losses. arXiv preprint
arXiv:1803.00144.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in Neural Information Pro-
cessing Systems, pages 5998–6008.

Tian Wang and Kyunghyun Cho. 2015. Larger-
context language modelling. arXiv preprint
arXiv:1511.03729.

Wenlin Wang, Zhe Gan, Wenqi Wang, Dinghan Shen,
Jiaji Huang, Wei Ping, Sanjeev Satheesh, and
Lawrence Carin. 2017. Topic compositional neural
language model. arXiv preprint arXiv:1712.09783.

Jason Weston, Sumit Chopra, and Antoine Bor-
des. 2014. Memory networks. arXiv preprint
arXiv:1410.3916.

Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua
Bengio, and Ruslan R Salakhutdinov. 2016. On
multiplicative integration with recurrent neural net-
works. In Advances in neural information process-
ing systems, pages 2856–2864.

Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and
William W Cohen. 2017. Breaking the softmax bot-
tleneck: A high-rank rnn language model. arXiv
preprint arXiv:1711.03953.

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals.
2014. Recurrent neural network regularization.
arXiv preprint arXiv:1409.2329.
A Ablation Study with Memory Constraints

Backprop Len Recurrence Encoding Loss pplx best pplx init Attn Len
128 3 Ours Full 26.77 27.02 500
128 3 Ours Partial 28.33 28.69 460
176 7 Ours Full 27.98 28.43 400
172 7 Ours Partial 28.83 28.83 120

Table 10: Ablation study on WikiText-103 with the same GPU memory constraints.

Table 10 compares Transformer-XL with baseline under the same memory budget. Transformer-XL
still outperforms the baseline even with a shorter backprop length.

B Efficient Computation of the Attention with Relative Positional Embedding
As we discussed in section 3.3, the naive way of computing the Wk,R Ri−j for all pairs (i, j) is subject
to a quadratic cost. Here, we present a simple method with only a linear cost. Firstly, notice that the
relative distance i − j can only be integer from 0 to M + L − 1, where M and L are the memory length
and segment length respectively. Hence, the rows of the matrix
 > >
[Wk,R RM +L−1 ]
 
RM +L−1
> >
R
 M +L−2  [Wk,R RM +L−2 ] 
.. ..
Q :=   Wk,R > =   ∈ R(M +L)×d
   
. .
 >   > 
R1 [Wk,R R1 ]
> >
R0 [Wk,R R0 ]

consist of all possible vector outputs of Wk,R Ri−j for any (i, j). Note that we have defined Q in a
reversed order, i.e., Qk = Wk,R RM +L−1−k , to make further discussion easier.
Next, we collect the term (b) for all possible i, j into the following L × (M + L) matrix,
q0> Wk,R RM q0> Wk,R R0
 
··· 0 ··· 0
 q > Wk,R RM +1 ··· q1> Wk,R R1 q1> Wk,R R0 ··· 0 
 1
B=

.. .. .. .. .. .. 
 . . . . . . 
> W > > > W
qL−1 k,R RM +L−1 · · · qL−1 Wk,R RM +L−1 qL−1 Wk,R RL−1 · · · qL−1 k,R R0
 >
q0 QL−1 · · · q0> QM +L−1

0 ··· 0
q > QL−2 · · · q > QM +L−2 q > QM +L−1 · · · 0 
 1 1 1
=

.. .. .. .. .. .. 
 . . . . . . 
> Q
qL−1 ··· > Q
qL−1 > Q
qL−1 ··· > Q
qL−1
0 M M +1 M +L−1

Then, we further define
q0> Q0 q0> QM q0> QM +1 q0> QM +L−1
 
··· ···
 q > Q0 ··· q1> QM q1> QM +1 ··· q1> QM +L−1 
e = qQ> =  1
B .

 .. .. .. .. .. ..
 . . . . . . 
> Q > Q > Q > Q
qL−1 0 ··· qL−1 M qL−1 M +1 · · · qL−1 M +L−1

Now, it is easy to see an immediate relationship between B and B, e where the i-th row of B is simply a
left-shifted version of i-th row of B. Hence, the computation of B only requires a matrix multiplication
e
qQ> to compute B e and then a set of left-shifts.
Similarly, we can collect all term (d) for all possible i, j into another L × (M + L) matrix D,
 >
v QL−1 · · · v > QM +L−1

0 ··· 0
v > QL−2 · · · v > QM +L−2 v > QM +L−1 · · · 0 
D= .
 
.. .. .. .. .. ..
 . . . . . . 
v > Q0 ··· v > QM v > QM +1 ··· v > QM +L−1
Then, we can follow the same procedure to define
e = [Qv]> = v > Q0 · · · v > QM v > QM +1 · · · v > QM +L−1 .

Again, each row of D is simply a left-shift version of d.
e Hence, the main computation cost comes from
>
the matrix-vector multiplication d
e = [Qv] , which is not expensive any more.

C Details About RECL

(a) Transformer-XL vs RNNs (b) Transformer-XL vs Baseline

Figure 3: Visualizing unnormalized relative perplexity gains with r = 0.1.

(a) Transformer-XL vs RNNs (b) Transformer-XL vs Baseline

Figure 4: Perplexity vs context length.

In this section, we describe the details of the metric RECL. Let M = {m1 , m2 , · · · , mN } be a model
group consisting of N models. Let li (c, t) denote the loss of model mi on the t-th token in the corpus
with a context length c. Concretely, the loss can be written as

li (c, t) = − log Pmi (xt |xt−1 , · · · , xt−c )

where Pmi is the probability distribution given by model mi , and xt is the t-th token in the corpus. Given
a short context length c and a long context length c0 such that c0 ≥ c, we can further define a baseline for
each position t,
N
b(c, t) = min li (c, t)
i=1
The relative loss of mi w.r.t. the model group M is written as
1 X
fi (c, c0 ) = min b(c, t), li (c0 , t)

|T |
t∈T

The above equation uses the minimum loss of all models on the short length c as a baseline, and only
losses smaller than the baseline will be effectively counted towards the relative loss. This enables fair
comparison between multiple models because all models with a long context length c0 need to improve
over the same baseline. Sometimes we only care about those positions where the baseline performs
poorly (which means short-term dependency with context length c is not sufficient), so given a ratio
parameter r, we define the set T is the above equation as
T = top-r positions t with largest b(c, t)
The relative gain is subsequently defined as the relative perplexity reduction:
exp fi (c, c) − exp fi (c, c0 )
gi (c, c0 ) =
exp fi (c, c)
Given a step size ∆, we then use an algorithm to find the RECL by thresholding the relative gain:
1. Set initial short context length c, and long context length c0 = c + ∆
2. Compute gi (c, c0 ). If gi (c, c0 ) < 0.01, return RECL = c. If gi (c, c0 ) ≥ 0.01, set c = c0 , c0 = c + ∆
and go to step 1.
In Figure 3, we visualize the unnormalized relative perplexity gains (exp fi (c, c) − exp fi (c, c0 )) with
various pairs of (c, c0 ) when r = 0.1. It is clear that Transformer-XL has a longer RECL compared to
RNNs and other baselines because the relative gains are substantially larger.
For reference, we plot the perplexities with varying context lengths in Figure 4. The y-axis denotes
the “normal” perplexity (not calibrated by baselines).

D Attention Visualization
In this section, we provide some visualization of the attention learned by the SoTA model on the
WikiText-103 validation set. Recall that, this model has 16 10-head transformer layers and relies on
a memory of length 640.

Figure 5: Average attention over the previous 640 tokens, where each row corresponds to a attention head and each
column corresponds to a relative location. There are totally 160 attention heads, and every 10 heads come from a
single layer. Darker colors indicate higher values.

The first visualization aims at revealing the overall trend of where the model is attending. Specifically,
for each attention head of each layer, we average the attention distributions of all tokens in the validation
set. This is shown in Fig. 5. As we can see, the overall trend is to focus more on the nearby tokens
than the faraway ones. However, it is also very clear that some attention heads have a wider attention
distribution over the entire memory span, notably the head 8 from layer 1, head 78 from layer 8, and the
head 158 from layer 16.
Since we are focused on learning long-range dependency, we are especially interested in these heads
with a wider attention span. Thus, in the second set of visualization, we pick the three notable heads
mentioned above, and visualize their attention behavior for a randomly chosen position, as shown in Fig.
6. Here, we see three different patterns of wider attention:
• For the head 8 in the 1st layer, we see an almost uniform attention over the entire memory span. This
is quite intuitive, as lower-level layers needs to screen the entire memory span to decide where to focus
for higher-level layers
(a) Head 8 from layer 1.

(b) Head 78 from layer 8.

Figure 6: Visualization of the three heads with a wide attention range. Each row corresponds to a target loca-
tion/token and each column corresponds to a context location/token. Tokens in the memory that have top 20%
attention values are highlighted in red.

• For the head 78 in the 8th layer (a middle-level layer), we see a very sparse attention pattern scattered
in all ranges of the memory. Again, this well fits our intuition that as information accumulates, the
network may focus on some particular position with special interests.
• For the head 158 in the 16th layer (i.e. the last layer), each target location (corresponding to each row)
has its own distinct sparse focus, differing from head 78 where target locations largely share the same
attentive location in memory. Meanwhile, the pattern is also different from the case of head 8, where
a few locations are clearly attended more than others.
Finally, as we have discussed in section 3.3, the attention score can be decomposed into four intuitive
terms. Here, we want to further investigate how these four terms contribute to the overall attention trend
in Fig. 5. Since the term (c) represents the global content bias, i.e., the prior importance of each word
regardless of the context, we will leave it out and focus on the terms (a), (b) and (d). So, for each term,
we take the Softmax w.r.t. the memory span and average the resulted distribution of all tokens in the
validation set. The results are visualized in Fig. 7:
• Since term (a) is fully content-based addressing, when averaging over all target words, the result is
essentially uniform over the entire context, except for a few very close words, which are likely to be
semantically similar to the target word.
• The overall trend of term (b) highly resembles that of the entire attention distribution in Fig. 5. It
suggests that the global trend of focusing on the nearby context is largely contributed by this content-
dependent positional bias.
• The overall trend of term (d) is also focusing more on nearby words. However, compared to the trend
of term (b), it is clearly flatter and biases towards a longer context.
(a) Term (a).

(b) Term (b).

Figure 7: Visualization of the three terms in computing the attention score. Each row corresponds to a attention
head and each column corresponds to a relative location.

E Generated Text

In this section, we present some generated text from our best model trained the Wikitext-103 dataset.
We seed the our Transformer-XL with a context of at most 512 consecutive tokens randomly sampled
from the test set of Wikitext-103. Then, we run Transformer-XL to generate a pre-defined number of
tokens (500 or 1,000 in our case). For each generation step, we first find the top-40 probabilities of the
next-step distribution and sample from top-40 tokens based on the re-normalized distribution. To help
reading, we detokenize the context, the generated text and the reference text. Three generated examples
are shown in Tables 11, 12, and 13. Note that we do not perform any cherry picking and present the
first three examples we generate in the paper. In the text, “= text =”, “= = text = =” and “= = = text = =
=” denote the Wikipedia page tile, section title and subsection title, respectively, due to the original data
preprocessing procedure of Wikitext-103 (Merity et al., 2016).
As we can see, though only trained on 100M tokens, Transformer-XL is a strong model at generating
long text articles, particularly in the following aspects:
• Transformer-XL is able to structurally maintain the sectional arrangement of Wikipedia.
• Transformer-XL manages to semantically stay on the same topic throughout the course of genera-
tion.
• Long-range references are common in the generated text.
• Transformer-XL often generates novel content that is not present in the training data.
For more detailed explanation of the interesting observations in each example, please refer to the corre-
sponding caption.
Despite the overall excellence of the generation quality, the model can only perceive the seed context
and hallucinate what to generate based on the limited knowledge (100M tokens only) it is trained on.
As a result, the generated text sometimes looks clearly relevant but not close enough or to the point
compared to what human writer would do. That said, we believe this issue is mostly a problem of limited
training data size and could be alleviated by using a larger training set.

Context:
Kershaw started the 2010 season by posting a 3.07 ERA in April, but did so by walking 22 batters in 29 innings. On May 4, he had his worst start of his career
against the Milwaukee Brewers at Dodger Stadium, throwing just 57 pitches in 11 / 3 innings, while retiring only four of the 13 batters he faced — including the
pitcher. He was booed loudly upon being pulled from the game. Kershaw said after the game, " I didn’t give our team any kind of chance. It’s just not a good
feeling to let your teammates down, let everybody down. It stings, it hurts. I ’ve got to figure things out. " Kershaw rebounded his next start by pitching an 8 inning
two-hitter and out-dueling the then undefeated Ubaldo Jiménez. He credited his control of the slider being the major turning point for him. Later in the season, he
was suspended for five games after hitting Aaron Rowand of the Giants with a pitch in a game on July 20. The incident occurred after both teams were given a
warning following Giants ace Tim Lincecum hitting Matt Kemp earlier in the game. He threw his first career complete game shutout on September 14, 2010 also
against San Francisco and finished the season with a record of 13 – 10 and a 2.91 ERA in 32 starts, pitching 2041 / 3 innings and recording 212 strikeouts.
= = = = 2011 season: 1st Cy Young Award = = = =
After finishing the 2010 season strong, the Dodgers named Kershaw as the Opening Day Starter for the 2011 season. On May 29, he pitched the second complete-
game shutout of his career, striking out 10 while winning a two-hitter against the Florida Marlins, 8 – 0; he also had two singles and an RBI, scoring twice in the
game. He produced his third career shutout on June 20, a two-hit, 11-strikeout effort against the Detroit Tigers. Kershaw became the first Dodgers starter to strike
out the side in the 9th inning since Sandy Koufax’s perfect game. In his next start, on June 26, Kershaw pitched another complete game (against the Los Angeles
Angels of Anaheim ). He became the first Dodger starter to have back-to-back complete game victories since Jeff Weaver in the 2005 season and the first Dodger
to have double-digit strikeouts in consecutive starts since Chan-Ho Park in the 2000 season. He was awarded the National League Player of the Week award for the
week of June 20 – 26 as a result of those two starts. Midway through June, Kershaw had amassed 32 career victories
XL Generation: Reference:
, tying him with Luis Castillo for the third best start total in the league and trail- , a 3.15 ERA and 593 career strikeouts in 568.2 innings. According to the Elias
ing only Johnny Bench and Eddie Mathews (37). The Dodgers signed Kershaw Sports Bureau, Kershaw was the first 23-year-old pitcher to have that many
to a one-year deal on July 14, 2011, with a player option for 2012, in a deal victories, an ERA that low and an average of more than one strikeout per inning
worth just under $ 300,000. The Dodgers entered the 2011 season as the major since ERA became an official statistic in 1910. Kershaw was selected to the
league leaders, and Kershaw had the best record in the National League in saves National League team for the 2011 Major League Baseball All-Star Game, his
(39), games started (78) and walks (120); he also had the lowest win / loss ratio first All-Star selection. In the month of July, Kershaw was 4 – 1 with a 2.02
in the league (1.87), and he had the worst on-base percentage (.349). He had ERA and NL-leading 45 strikeouts, earning him the National League Pitcher of
the best ERA (2.00 ), most wins (16 ), or most shutouts (29) in his career (25), the Month Award. On August 23, he struck out Matt Holliday of the St. Louis
while leading the league in losses per nine innings pitched (4.01). Following the Cardinals for his 200th strikeout of the season and became the 10th Dodger
2011 season, the Dodgers traded Kershaw to the Kansas City Royals on July 29 pitcher to record back-to-back 200 strikeout seasons and the first since Chan-
for pitcher Javier Torres. Ho Park did it in the 2001 season. Kershaw finished the 2011 season by leading
= = = Kansas City Royals = = = the NL with 21 wins, 248 strikeouts and a 2.28 ERA, winning the NL pitching
= = = = 2012 season = = = = Triple Crown, the first Triple Crown winner since Jake Peavy of the 2007 San
During spring training, Kershaw played very well. He was selected to spring Diego Padres and the first Dodger since Sandy Koufax won it in the 1966 season.
training as a relief pitcher for the Royals for the 2012 season. After an injury Justin Verlander of the Detroit Tigers won the American League Triple Crown
to closer Javier Vázquez, he was activated on April 29 to replace Matt Holliday the same season, marking the first major-league season since 1924 to feature
in the Royals ’ starting rotation. In his only start with the Royals, on August 6, Triple Crown-winning pitchers in both leagues. Kershaw’s 21 wins were the
2012, Kershaw struck out five batters in seven innings pitched to help the Royals most by a Dodger pitcher since Orel Hershiser won 23 during the 1988 season.
to their first victory in franchise history. On September 27, 2012, it appeared His ERA was the lowest by a Dodger since Hershiser’s 2.03 in the 1985 season,
Kershaw was going to pitch a complete game shutout against the Detroit Tigers, his strikeouts were the most by a Dodger since Koufax’s 317 in 1966 and his 233
but did not manage to do so since the Tigers won 3 – 1. At the conclusion 1 / 3 innings pitched were the most since Chan Ho Park pitched 234 in 2001.
of the season, Kershaw was named Major League Baseball’s Most Valuable Since 1965 when Koufax did it, Peavy and Kershaw are only two pitchers in the
Player, was chosen to the All-Star Game at Busch Stadium and was named to National League have led the league in wins, strikeouts, ERA, and WHIP (walks
the All-Star Game as the starting pitcher at shortstop. The Royals announced on plus hits per inning pitched). Kershaw also became just the second <unk> to
February 4, 2013 that Kershaw would spend 2013 as starting pitcher, though he have a 240-plus strikeouts in a season before the age of 24, joining Vida Blue.
was expected to miss the season as a result of a shoulder injury. After the season, Kershaw was awarded the Warren Spahn Award as the best
= = = = 2013 season = = = = left-handed pitcher in 2011, the Players Choice Award for Most Outstanding
On May 17, 2013, Kershaw sustained another back injury and did not start in National League pitcher, the Gold Glove Award as the top fielding pitcher in
August and October 2013. He appeared in 22 starts, all starts, finishing with the NL and the Sporting News (TSN) National League Pitcher of the Year. He
a strikeout-to-walk ratio of 1.50 and a 2.91 ERA. He also had the third most was additionally selected as the starting pitcher for the TSN NL All-Star Team.
strikeouts in the league: 10. On May 20, 2013, he On November 17, he was honored with the National League Cy Young Award,
making him the youngest Cy Young winner since Dwight Gooden

Table 11: Example 1 – 500 tokens generated by XL using a snippet from the Wikitext-103 test set as initial context. The
sample is randomly generated without any cherry picking.
Original Wikipedia page: https://en.wikipedia.org/wiki/Clayton_Kershaw
There are many interesting observations from this example:
• Firstly, Kershaw never went to Royals in real life. Despite that, Transformer-XL stays on the fully imagined topic and keeps
hallucinating the experience of Kershaw in Royals across the generated text.
• Secondly, notice that XL correctly tracks the chronological order from 2011 to 2012 and to the finally 2013 season in the
section titles.
• In addition, notice that Transformer-XL accurately uses the the phrase “another back injury” in the 2013 season paragraph,
since it has talked about one earlier injure in the 2012 season. This shows again Transformer-XL’s ability of capturing
long-term dependency.
Context:
= = Distribution = =
Species range across the Neotropics from Mexico in the north to Bolivia, Paraguay, and southern Brazil in the south. According to <unk> and coauthors, three
species are found in Mexico, four in Central America, and 62 in South America. Three species are present in the Caribbean — two in Trinidad and Tobago, along
the southern edge of the region, and one in Haiti.
= = Habitat and ecology = =
<unk> includes both large trees and small acaulescent palms which occupy a number of different ecological niches. Dense stands of some of the larger species are
conspicuous elements on the landscape, while smaller species are found in both in the forest understorey and in savannas. Disturbance has been implicated in the
formation of vegetation dominated by large <unk> species. In seasonally dry Amazonian forests the density of large adult A. <unk> palms was correlated with
canopy openness; the species also dominates savannas formed by repeated forest fires in Trinidad and Tobago. <unk> speciosa forms pure stands in many parts of
Brazil where natural forest vegetation has been cleared. Similarly, stands of A. <unk> in Bahia, Brazil (which are cultivated for <unk> fibre) are managed using
fire — the seedlings survive cutting and burning, and are able to dominate burned forest patches. The fruit are dispersed by animals; fruit which are not dispersed
frequently suffer seed predation by <unk> beetles. Certain species of <unk> have been mentioned as examples of " anachronistic " species which are adapted
for dispersal by now-extinct Pleistocene megafauna. On <unk> Island, <unk>, in the Brazilian Amazon, <unk> <unk> fruit were consumed by tapirs, collared
peccaries, deer and primates. Rodents, including agoutis, fed upon the fruit and, as the fruit availability declined, they fed on the seeds. Other dispersers of <unk>
fruit include Crested <unk> which consume the fruit and disperse the seeds of A. <unk> in the Brazilian Pantanal.
= = Uses = =
<unk> species have a long history of human utilisation. <unk> <unk> <unk> seeds have been found in archaeological sites in Colombia dating back to 9000 BP.
A variety of species remain important sources of edible oil, thatch, edible seeds and fibre. The leaves of <unk> <unk> and A. <unk> are used extensively for
thatching. Several species are oil palms, with A. speciosa among the most important economically. Products extracted from A. speciosa were reported to support
over 300,000 households in the Brazilian state of Maranhão in 2005, and in 1985 it was estimated to support over 450,000 households throughout the Brazil. <unk>
fibres, extracted from the leaf bases of A. <unk>, are commercially important, and generated about US $
XL Generation: Reference:
700,000 in 2013 on the black market. In 2004 the plant was reported to be 20 million in annual income to Brazilian farmers in 1996.
responsible for a net profit of around US $ 3 million in Brazil. A 2012 study = The Heart of Ezra Greer =
found A. speciosa to cause agricultural production at least 2 – 3 times greater The Heart of Ezra Greer is a 1917 American silent drama film produced by the
than those achieved without prior knowledge of land management. Thanhouser Company and directed by Emile <unk>. The film focuses on Ezra
= The Tale of Mrs. Tittlemouse = Greer, a successful middle-aged man who searches for his college age daughter,
The Tale of Mrs. Tittlemouse is a children’s book written and illustrated by H. Mary. The wayward Mary was romanced and abandoned by Jack <unk>, later
M. Potter. It was first published in February 1910 by Frederick Warne & Co. bearing his child. Once Ezra becomes broke he finds employment as the valet
and included illustrations by Frederick Warne. Mrs. Tittlemouse was intended for Jack <unk>. After Jack’s engagement to a cabaret girl, Mary becomes upset
to be a children’s book, rather than a musical, and the tale describes her attempts and leaves her child at Jack’s home. Contrary to Jack’s wishes, Ezra keeps the
to find her own home. Mrs. Tittlemouse is an average kitten and the tale follows child and Jack ultimately reveals that the child is his own. Ezra convinces Jack
a mischievous rabbit named Mr. Tittlemouse who tries to kill her in an attempt to make things right and Ezra convinces the cabaret girl to leave Jack. After a
to get a ride on a horse. Potter later wrote of the rabbit and her attempts to carriage accident in which the baby is injured, Ezra and Jack rush to the hospital
kill her, " If [ she ] were a fox, I should like to have been able to show her the and find Mary as a nurse crying over the child. The film ends with the marriage
way. " Potter’s work was inspired by The Tale of Little Black Sambo and her of Jack and Mary. The film was released by Pathé on October 7, 1917. The film
two nieces, Petunia and Petunia. It was the first book in the Potter family to be was the final release from Thanhouser and was deemed to be an average film
published in a hardback format, and was also the first book in the Peter Rabbit by most reviewers. Criticism for the film hinged on far-fetched coincidences to
series. The tale was released on 13 November 1910 and was accompanied by drive the plot. The film is presumed lost.
a greeting card by Potter’s brother-in-law, Frederick Warne. The tale was well = = Plot = =
received by critics and children, and sold over 600,000 copies in its first few The film follows Ezra Greer, a middle-aged man who has worked hard since
years. In the first year of its publication, Potter’s The Tale of Mrs. Tittlemouse his youth. He cares deeply for his motherless daughter, Mary, but was unable
won children’s book awards in England, America and Australia. It was selected to attend the annual commencement at her co-educational college. He awaits
as a " recommended reading " by children in the US and Canada. for her to return from college, but Mary leaves with her romantic interest, Jack
= = Plot summary = = <unk>. On promise of marriage and wealth, Mary is romanced and gives birth
The tale opens with the pet rabbit Mrs. Tittlemouse wandering through a forest to a fatherless child. Without word from his daughter, Ezra resigns from his
in a small litter of four. He is shot and badly injured by a deer after she attempts job and attempts to seek her out and finds a poor motherless child, Marie. With
to escape. Mrs. Tittlemouse decides that she wants to find her own home, be- Ezra’s money exhausted he seeks employment and finds it as the valet of Jack.
cause she is eager to go on her own. She goes alone to the farm where she makes One day, Mary seeks an announcement of Jack’s engagement to a cabaret girl
a little money by selling a few seeds and building a small cabin in the woods. known as " The Baby Vamp ". Bitter over the prospect of her child’s future,
She is approached by a wealthy hunter named Mr. Tittlemouse, who tries to kill she leaves the child at Jack’s home during his absence with a note. Jack orders
her but Mrs. Tittlemouse kills him by stuffing a rope into his nose and killing Ezra to take the baby to an orphanage, but Marie begs Ezra to keep him. After
him. She is rescued by Mr. Tittlemouse’s wife Ruth, but Mrs. Tittlemouse then continually seeing the child, Jack is overcome with remorse and explains to Ezra
leaves the woodland with the baby. When she is spotted by and seeks his advice. Not knowing he was making the case for his own daughter,
Ezra convinces Jack to seek out Mary and forget the Baby Vamp. The Baby

Table 12: Example 2 – 500 tokens generated by XL using a snippet from the Wikitext-103 test set as initial context. The
sample is randomly generated without any cherry picking.
Original Wikipedia page: https://en.wikipedia.org/wiki/The_Tale_of_Mrs._Tittlemouse.
This example exhibit some additional interesting properties of Transformer-XL:
• After finishing the last paragraph of the seed context, both the reference and generated text start a new topic (i.e., Wikipedia
page), as marked by the single “= title =” line. This suggests the model has the ability of identifying the end of a topic / page,
and randomly starting with a new topic.
• Even more interestingly, a newly-started page is on a book called “The Tale of Mrs. Tittlemouse”. Transformer-XL manages
to copy the same book title and some related information from the training set, but hallucinates novel content of the book.
This demonstrates a degree of generalization instead of memorization. Please refer to the original book content at the
Wikipedia page.
Context:
= Battle of Dürenstein =
The Battle of Dürenstein (also known as the Battle of <unk>, Battle of <unk> and Battle of <unk>; German: <unk> bei <unk> ), on 11 November 1805 was
an engagement in the Napoleonic Wars during the War of the Third Coalition. Dürenstein (modern <unk>) is located in the <unk> Valley, on the River Danube,
73 kilometers (45 mi) upstream from Vienna, Austria. The river makes a crescent-shaped curve between <unk> and nearby Krems an der Donau and the battle
was fought in the flood plain between the river and the mountains. At Dürenstein a combined force of Russian and Austrian troops trapped a French division
commanded by Théodore Maxime Gazan. The French division was part of the newly created VIII Corps, the so-called Corps Mortier, under command of Édouard
Mortier. In pursuing the Austrian retreat from Bavaria, Mortier had over-extended his three divisions along the north bank of the Danube. Mikhail <unk> Kutuzov,
commander of the Coalition force, enticed Mortier to send Gazan’s division into a trap and French troops were caught in a valley between two Russian columns.
They were rescued by the timely arrival of a second division, under command of Pierre Dupont de l ’Étang. The battle extended well into the night. Both sides
claimed victory. The French lost more than a third of their participants, and Gazan’s division experienced over 40 percent losses. The Austrians and Russians
also had heavy <unk> to 16 <unk> perhaps the most significant was the death in action of Johann Heinrich von Schmitt, one of Austria’s most capable chiefs of
staff. The battle was fought three weeks after the Austrian capitulation at Ulm and three weeks before the Russo-Austrian defeat at the Battle of Austerlitz. After
Austerlitz Austria withdrew from the war. The French demanded a high indemnity and Francis II abdicated as Holy Roman Emperor, releasing the German states
from their allegiance to the Holy Roman Empire.
= = Background = =
In a series of conflicts from 1803-15 known as the Napoleonic Wars, various European powers formed five coalitions against the First French Empire. Like the wars
sparked by the French Revolution (1789 ), these further revolutionized the formation, organization and training of European armies and led to an unprecedented
militarization, mainly due to mass conscription. Under the leadership of Napoleon, French power rose quickly as the Grande Armée conquered most of Europe,
and collapsed rapidly
XL Generation: Reference:
after the French conquest of Italy. This victory marked the beginning of the after the disastrous invasion of Russia in 1812. Napoleon’s empire ultimately
Second Coalition. Napoleon’s rapid advance caused Austria – Russia, Britain suffered complete military defeat in the 1813 – 14 campaigns, resulting in the
and Russia to make peace negotiations. The Russian army under Prince Mikhail restoration of the Bourbon monarchy in France. Although Napoleon made a
Mikhailovich Mikhailovich Chaffee, commander of the Imperial Russian Army spectacular return in 1815, known as the Hundred Days, his defeat at the Battle
in Switzerland, was able to drive off the Austrians in the Battle of Stockach of Waterloo, the pursuit of his army and himself, his abdication and banishment
(1805) and to take Pressburg (modern Ö3 Austria) from the Austrians. At an to the Island of Saint Helena concluded the Napoleonic Wars.
early stage the Austro-Hungarian fleet had been damaged in a storm off Cape = = Danube campaign = =
Matapan; this allowed the Allies to launch a full-scale invasion of Italy in Febru- From 1803-06 the Third Coalition fought the First French Empire and its client
ary 1805. The Battle of Austerlitz was the decisive French victory against states (see table at right ). Although several naval battles determined control of
Napoleon and the largest naval battle in history, involving more modern-day the seas, the outcome of the war was decided on the continent, predominantly
European navies. The French military situation worsened as Napoleon faced in two major land operations in the Danube valley: the Ulm campaign in the
several threats related to his newly formed Army of Europe, particularly Russia upper Danube and the Vienna campaign, in the middle Danube valley. Political
itself, which was now on the verge of collapse. The Russian general and friend conflicts in Vienna delayed Austria’s entry into the Third Coalition until 1805.
of Napoleon, Alexander, had been dissatisfied with the conduct of the Austri- After hostilities of the War of the Second Coalition ended in 1801, Archduke
ans; he was still in conflict with Emperor Napoleon, the French Republic’s king, <unk> emperor’s <unk> advantage of the subsequent years of peace to develop
who had declared war on Austria during the summer of 1804. With the war over, a military restructuring plan. He carefully put this plan into effect beginning in
Napoleon decided to launch a second invasion of Italy in October 1805. 1803 – 04, but implementation was incomplete in 1805 when Karl Mack, Lieu-
= = Prelude = = tenant Field Marshal and Quartermaster-General of the Army, implemented his
In July 1805, the French 1st Army entered southern Italy. The army, under the own restructuring. Mack bypassed Charles ’ methodical approach. Occurring
command of Marshal Marmont, were reinforced by a few battalions of infantry in the field, Mack’s plan also undermined the overall command and organiza-
under Claude General Auguste de Marmont at the town of Philippsburg and tional structure. Regardless, Mack sent an enthusiastic report to Vienna on the
another battalion at Belluno. On 17 September 1805, the army marched from military’s readiness. Furthermore, after misreading Napoleon’s maneuvers in
Belluno towards Krems. By 29 September, they had reached Belluno and con- Württemberg, Mack also reported to Vienna on the weakness of French dispo-
ducted its advance against a small Austrian force. By 31 September, the whole sitions. His reports convinced the war party advising the emperor, Francis II,
force had been reinforced by a brigade from the Army of Tyrol under the com- to enter the conflict against France, despite Charles ’ own advice to the con-
mand of Pierre Augereau. The Austrians were now under the command of Mar- trary. Responding to the report and rampant anti-French fever in Vienna, Fran-
shal Jean Victor Marie Moreau, a member of the Directory. Moreau had taken cis dismissed Charles from his post as generalissimo and appointed his <unk>
command of the Austrian invasion force in the spring of 1805. His command brother-in-law, Archduke Ferdinand, as commander. The inexperienced Ferdi-
included the VI Corps commanded by Jean Baptiste Drouet de Ney and the VI nand was a poor choice of replacement for the capable Charles, having neither
Corps commanded by Generals Jean Victor Marie Moreau and Joseph Souham. maturity nor aptitude for the assignment. Although Ferdinand retained nomi-
Ney’s corps consisted of the III. Corps and VI. Corps, which consisted of the nal command, day-to-day decisions were placed in the hands of Mack, equally
III Corps and VI. Corps, located in the Austrian Netherlands, was commanded ill-suited for such an important assignment. When Mack was wounded early
by Friedrich Joseph, Count Baillet de Latour. Moreau’s army consisted of six in the campaign, he was unable to take full charge of the army. Consequently,
divisions and several associated brigades. command further devolved to Lieutenant Field Marshal Karl Philipp, Prince of
= = Aftermath = = Schwarzenberg, an able cavalry officer but inexperienced in the command of
= = = First Coalition forces = = = such a large army.
On 9 October 1805 the French Army of the Danube was attacked by an Aus- = = = Road to Ulm = = =
trian army under Archduke Charles at the Battle of Austerlitz. Although Charles The campaign in the upper Danube valley began in October, with several clashes
and Charles had not had much time to regroup, on 10 October, he launched his in Swabia. Near the Bavarian town of Wertingen, 40 kilometers (25 mi) north-
attack on the Polish forces under Friedrich Joseph, Count of Lauenburg. Af- west of Augsburg, on 8 October the 1st Regiment of dragoons, part of Murat’s
ter three days, Charles’ army captured Lauenburg. The French forces pursued Reserve Cavalry Corps, and grenadiers of Lannes ’ V Corps surprised an Aus-
the Austrians to the Silesian border, where they encountered strong Austrian trian force half its size. The Austrians were arrayed in a line and unable to form
resistance. These conflicts forced the Austrians to retreat into Tyrol and Aus- their defensive squares quickly enough to protect themselves from the 4,000
tria agreed to a truce. The Austrian army, commanded by Wenzel Anton Karl, dragoons and 8,000 grenadiers. Nearly 3,000 Austrians were captured and over
Count of Merveldt, was reduced to around 10,000 men. It was initially planned 400 were killed or wounded. A day later, at another small town, <unk> south
that Archduke Charles would launch a counter-attack against the French army of the Danube <unk> French 59th Regiment of the Line stormed a bridge over
on the same day, as Napoleon had hoped, but this was not carried out. On 25 the Danube and, humiliatingly, chased two large Austrian columns toward Ulm.
October, Merveldt left Styria for Tyrol. On the same day, Austria launched its The campaign was not entirely bad news for Vienna. At Haslach, Johann von
new offensive against the French at Ulm. Charles withdrew his army from the Klenau arranged his 25,000 infantry and cavalry in a prime defensive position
region for a third time at the Battle of Elchingen, under the overall command of and, on 11 October, the overly confident General of Division Pierre Dupont de
the Austrian generals, Ferdinand and Friedrich Wilhelm of Jülich-Cleves-Berg. l’Étang attacked Klenau’s force with fewer than 8,000 men. The French lost
To prevent Archduke Charles from escaping from the battlefield, the comman- 1,500 men killed and wounded. Aside from taking the Imperial Eagles and
der of the Habsburg army, Archduke Charles, planned to occupy the fortress <unk> of the 15th and 17th Dragoons, Klenau’s force also captured 900 men,
Linz; instead, he decided to force Franz von Hipper to surrender the city. How- 11 guns and 18 ammunition wagons. Klenau’s victory was a singular success.
ever, as Charles moved to the south, Moreau arrived on the scene with additional On 14 October Mack sent two columns out of Ulm in preparation for a breakout
soldiers – including the entire Imperial Guard – and defeated the Austrians at to the north: one under Johann Sigismund Riesch headed toward Elchingen to
the Battle of Hohenlinden on 28 October. The loss of Linz resulted in Austria’s secure the bridge there, and the other under Franz von Werneck went north with
complete defeat at Hohenlinden. In the meantime, the French Army of Obser- most of the heavy artillery. Recognizing the opportunity, Marshal Michel Ney
vation and Preparedness was reorganized into the Army of the Danube under hurried the rest of his VI Corps forward to re-establish contact with Dupont, who
Feldzeugmeister (Colonel-General) Friedrich Freiherr von Hotze. The army was still north of the Danube. In a two-pronged attack Ney sent one division to
was composed of the I, IV, VI, VI, VII, VIII and IX Corps. With reinforcements the south of Elchingen on the right bank of the Danube. This division began the
from Italy and France, it formed new battalions, companies, and squadrons in assault at Elchingen. At the same time another division crossed the river to the
the Austrian army. On 17 November 1804, at the Battle of Jena-Auerstadt the east and moved west against Riesch’s position. After clearing Austrian pickets
Army of Silesia and the Army of Silesia joined forces, but by the time that the from a bridge, the French attacked and captured a strategically located abbey at
French approached Vienna, the Prussians had already surrendered. As the Aus- the top of the hill at bayonet point. The Austrian cavalry unsuccessfully tried to
trians did not want to allow the war to continue, they decided to abandon their fend off the French, but the Austrian infantry broke and ran. In this engagement
territories in the north and move their army to the north and west, cutting off alone, the Austrians lost more than half their reserve artillery park, 6,000 (out
Charles from Vienna. The Battle of Warsaw was fought on 23 November 1805 of 8,000 total participants) dead, wounded or captured and four colors. Reisch’s
between the French Army of the Danube and the Austrian Army of Styria in column also failed to destroy the bridges across the Danube. Napoleon’s light-
the vicinity of Warsaw and Pressburg (modern Trnava, Slovakia). At that time ning campaign exposed the Austrian indecisive command structure and poor
Habsburg forces supply apparatus. Mack

Table 13: Example 3 – 1,000 tokens generated by XL using a snippet from the Wikitext-103 test set as initial context. The
sample is randomly generated without any cherry picking.
Original Wikipedia page: https://en.wikipedia.org/wiki/Battle_of_D%C3%BCrenstein.
• Although this example is significantly longer, we can see that Transformer-XL is still able to stay on the same topic and
makes up non-existing stories about the Napoleon wars.
• Notably, from the second section on, the generated text correctly follows a fine-grained chronological order on the level of
month and day to narrate events in 1805, except a mistake (1804 instead of 1805) near the end of the paragraph. To ease
reading which we have highlighted all the date related phrases by magenta in the generation.