来源材料

资料来源

← 首页

                                                              Transformer-XL: Attentive Language Models
                                                                    Beyond a Fixed-Length Context

                                                         Zihang Dai∗12 , Zhilin Yang∗12 , Yiming Yang1 , Jaime Carbonell1 ,
                                                                      Quoc V. Le2 , Ruslan Salakhutdinov1
                                                                   1
                                                                     Carnegie Mellon University, 2 Google Brain
                                                      {dzihang,zhiliny,yiming,jgc,rsalakhu}@cs.cmu.edu, qvl@google.com




                                                              Abstract                              Term Memory (LSTM) networks (Hochreiter and
                                                                                                    Schmidhuber, 1997), have been a standard solu-
                                            Transformers have a potential of learning
                                            longer-term dependency, but are limited by a            tion to language modeling and obtained strong
                                                                                                    results on multiple benchmarks. Despite the




arXiv:1901.02860v3 [cs.LG] 2 Jun 2019
                                            fixed-length context in the setting of language
                                            modeling. We propose a novel neural ar-                 wide adaption, RNNs are difficult to optimize
                                            chitecture Transformer-XL that enables learn-           due to gradient vanishing and explosion (Hochre-
                                            ing dependency beyond a fixed length with-              iter et al., 2001), and the introduction of gat-
                                            out disrupting temporal coherence. It con-              ing in LSTMs and the gradient clipping tech-
                                            sists of a segment-level recurrence mechanism           nique (Graves, 2013) might not be sufficient to
                                            and a novel positional encoding scheme. Our
                                                                                                    fully address this issue. Empirically, previous
                                            method not only enables capturing longer-term
                                            dependency, but also resolves the context frag-         work has found that LSTM language models use
                                            mentation problem. As a result, Transformer-            200 context words on average (Khandelwal et al.,
                                            XL learns dependency that is 80% longer than            2018), indicating room for further improvement.
                                            RNNs and 450% longer than vanilla Trans-                   On the other hand, the direct connections be-
                                            formers, achieves better performance on both            tween long-distance word pairs baked in atten-
                                            short and long sequences, and is up to 1,800+           tion mechanisms might ease optimization and en-
                                            times faster than vanilla Transformers during
                                                                                                    able the learning of long-term dependency (Bah-
                                            evaluation. Notably, we improve the state-of-
                                            the-art results of bpc/perplexity to 0.99 on en-        danau et al., 2014; Vaswani et al., 2017). Re-
                                            wiki8, 1.08 on text8, 18.3 on WikiText-103,             cently, Al-Rfou et al. (2018) designed a set of aux-
                                            21.8 on One Billion Word, and 54.5 on Penn              iliary losses to train deep Transformer networks
                                            Treebank (without finetuning). When trained             for character-level language modeling, which out-
                                            only on WikiText-103, Transformer-XL man-               perform LSTMs by a large margin. Despite the
                                            ages to generate reasonably coherent, novel             success, the LM training in Al-Rfou et al. (2018)
                                            text articles with thousands of tokens. Our
                                                                                                    is performed on separated fixed-length segments
                                            code, pretrained models, and hyperparameters
                                            are available in both Tensorflow and PyTorch1 .
                                                                                                    of a few hundred characters, without any informa-
                                                                                                    tion flow across segments. As a consequence of
                                        1   Introduction                                            the fixed context length, the model cannot capture
                                                                                                    any longer-term dependency beyond the prede-
                                        Language modeling is among the important prob-
                                                                                                    fined context length. In addition, the fixed-length
                                        lems that require modeling long-term dependency,
                                                                                                    segments are created by selecting a consecutive
                                        with successful applications such as unsupervised
                                                                                                    chunk of symbols without respecting the sentence
                                        pretraining (Dai and Le, 2015; Peters et al., 2018;
                                                                                                    or any other semantic boundary. Hence, the model
                                        Radford et al., 2018; Devlin et al., 2018). How-
                                                                                                    lacks necessary contextual information needed to
                                        ever, it has been a challenge to equip neural
                                                                                                    well predict the first few symbols, leading to inef-
                                        networks with the capability to model long-term
                                                                                                    ficient optimization and inferior performance. We
                                        dependency in sequential data. Recurrent neu-
                                                                                                    refer to this problem as context fragmentation.
                                        ral networks (RNNs), in particular Long Short-
                                                                                                       To address the aforementioned limitations of
                                           ∗
                                             Equal contribution. Order determined by swapping the   fixed-length contexts, we propose a new architec-
                                        one in Yang et al. (2017).
                                           1
                                             https://github.com/kimiyoung/                          ture called Transformer-XL (meaning extra long).
                                        transformer-xl                                              We introduce the notion of recurrence into our
deep self-attention network. In particular, instead     as an additional input. Existing works range
of computing the hidden states from scratch for         from ones where context representations are man-
each new segment, we reuse the hidden states ob-        ually defined (Mikolov and Zweig, 2012; Ji et al.,
tained in previous segments. The reused hidden          2015; Wang and Cho, 2015) to others that rely on
states serve as memory for the current segment,         document-level topics learned from data (Dieng
which builds up a recurrent connection between          et al., 2016; Wang et al., 2017).
the segments. As a result, modeling very long-             More broadly, in generic sequence modeling,
term dependency becomes possible because in-            how to capture long-term dependency has been a
formation can be propagated through the recur-          long-standing research problem. From this per-
rent connections. Meanwhile, passing informa-           spective, since the ubiquitous adaption of LSTM,
tion from the previous segment can also resolve         many efforts have been spent on relieving the
the problem of context fragmentation. More im-          vanishing gradient problem, including better ini-
portantly, we show the necessity of using relative      tialization (Le et al., 2015), additional loss sig-
positional encodings rather than absolute ones, in      nal (Trinh et al., 2018), augmented memory struc-
order to enable state reuse without causing tem-        ture (Ke et al., 2018) and others that modify the in-
poral confusion. Hence, as an additional techni-        ternal architecture of RNNs to ease the optimiza-
cal contribution, we introduce a simple but more        tion (Wu et al., 2016; Li et al., 2018). Different
effective relative positional encoding formulation      from them, our work is based on the Transformer
that generalizes to attention lengths longer than the   architecture and shows that language modeling as
one observed during training.                           a real-world task benefits from the ability to learn
   Transformer-XL obtained strong results on five       longer-term dependency.
datasets, varying from word-level to character-
level language modeling. Transformer-XL is also         3     Model
able to generate relatively coherent long text arti-
                                                        Given a corpus of tokens x = (x1 , . . . , xT ), the
cles with thousands of tokens (see Appendix E),
                                                        task of language modeling is to estimate the joint
trained on only 100M tokens.
                                                        probability P (x), which  Qis often auto-regressively
   Our main technical contributions include intro-
                                                        factorized as P (x) = t P (xt | x<t ). With the
ducing the notion of recurrence in a purely self-
                                                        factorization, the problem reduces to estimating
attentive model and deriving a novel positional en-
                                                        each conditional factor. In this work, we stick to
coding scheme. These two techniques form a com-
                                                        the standard neural approach to modeling the con-
plete set of solutions, as any one of them alone
                                                        ditional probability. Specifically, a trainable neu-
does not address the issue of fixed-length con-
                                                        ral network is used to encode the context x<t into
texts. Transformer-XL is the first self-attention
                                                        a fixed size hidden state, which is multiplied with
model that achieves substantially better results
                                                        the word embeddings to obtain the logits. The log-
than RNNs on both character-level and word-level
                                                        its are then fed into the Softmax function, yielding
language modeling.
                                                        a categorical probability distribution over the next
                                                        token.
2   Related Work
In the last few years, the field of language mod-       3.1    Vanilla Transformer Language Models
eling has witnessed many significant advances,          In order to apply Transformer or self-attention to
including but not limited to devising novel ar-         language modeling, the central problem is how to
chitectures to better encode the context (Bengio        train a Transformer to effectively encode an arbi-
et al., 2003; Mikolov et al., 2010; Merity et al.,      trarily long context into a fixed size representation.
2016; Al-Rfou et al., 2018), improving regulariza-      Given infinite memory and computation, a sim-
tion and optimization algorithms (Gal and Ghahra-       ple solution would be to process the entire con-
mani, 2016) , speeding up the Softmax computa-          text sequence using an unconditional Transformer
tion (Grave et al., 2016a) , and enriching the output   decoder, similar to a feed-forward neural network.
distribution family (Yang et al., 2017).                However, this is usually infeasible with the limited
   To capture the long-range context in language        resource in practice.
modeling, a line of work directly feeds a repre-           One feasible but crude approximation is to split
sentation of the wider context into the network         the entire corpus into shorter segments of man-
x1   x2      x3     x4      x5      x6     x7   x8   x1     x2      x3      x4   x5   x6         x1     x2     x3      x4      x5   x6   x1   x2   x3     x4      x5      x6


     Segment 1                      Segment 2             Limited Context                                    Limited Context                            Limited Context


                 (a) Train phase.                                                                     (b) Evaluation phase.

                                         Figure 1: Illustration of the vanilla model with a segment length 4.

ageable sizes, and only train the model within                                             3.2        Segment-Level Recurrence with State
each segment, ignoring all contextual information                                                     Reuse
from previous segments. This is the idea adopted                                           To address the limitations of using a fixed-length
by Al-Rfou et al. (2018). We call it the vanilla                                           context, we propose to introduce a recurrence
model and visualize it in Fig. 1a. Under this                                              mechanism to the Transformer architecture. Dur-
training paradigm, information never flows across                                          ing training, the hidden state sequence computed
segments in either the forward or backward pass.                                           for the previous segment is fixed and cached to
There are two critical limitations of using a fixed-                                       be reused as an extended context when the model
length context. First, the largest possible depen-                                         processes the next new segment, as shown in Fig.
dency length is upper bounded by the segment                                               2a. Although the gradient still remains within a
length, which is a few hundred on character-level                                          segment, this additional input allows the network
language modeling (Al-Rfou et al., 2018). There-                                           to exploit information in the history, leading to an
fore, although the self-attention mechanism is less                                        ability of modeling longer-term dependency and
affected by the vanishing gradient problem com-                                            avoiding context fragmentation. Formally, let the
pared to RNNs, the vanilla model is not able to                                            two consecutive segments of length L be sτ =
fully exploit this optimization advantage. Second,                                         [xτ,1 , · · · , xτ,L ] and sτ +1 = [xτ +1,1 , · · · , xτ +1,L ]
though it is possible to use padding to respect the                                        respectively. Denoting the n-th layer hidden state
sentence or other semantic boundaries, in practice                                         sequence produced for the τ -th segment sτ by
it has been standard practice to simply chunk long                                         hnτ ∈ RL×d , where d is the hidden dimension.
text into fixed-length segments due to improved                                            Then, the n-th layer hidden state for segment sτ +1
efficiency (Peters et al., 2018; Devlin et al., 2018;                                      is produced (schematically) as follows,
Al-Rfou et al., 2018). However, simply chunking                                              e τn−1
                                                                                             h
                                                                                                         n−1
                                                                                                              ) ◦ hn−1
                                                                                                                       
                                                                                                 +1 = SG(hτ        τ +1 ,
a sequence into fixed-length segments will lead to                                                                          > e n−1     > e n−1   >
                                                                                             qn       n       n       n−1
                                                                                              τ +1 , kτ +1 , vτ +1 = hτ +1 Wq , hτ +1 Wk , hτ +1 Wv ,
the context fragmentation problem as discussed in                                             n                             n       n    n
                                                                                             hτ +1 = Transformer-Layer (qτ +1 , kτ +1 , vτ +1 ) .
Section 1.
                                                                                           where the function SG(·) stands for stop-gradient,
                                                                                           the notation [hu ◦ hv ] indicates the concatenation
                                                                                           of two hidden sequences along the length dimen-
   During evaluation, at each step, the vanilla                                            sion, and W· denotes model parameters. Com-
model also consumes a segment of the same length                                           pared to the standard Transformer, the critical dif-
as in training, but only makes one prediction at the                                       ference lies in that the key knτ+1 and value vτn+1
last position. Then, at the next step, the segment                                         are conditioned on the extended context e  hn−1
                                                                                                                                        τ +1 and
is shifted to the right by only one position, and the                                      hence hτ  n−1 cached from the previous segment.
new segment has to be processed all from scratch.                                          We emphasize this particular design by the green
As shown in Fig. 1b, this procedure ensures that                                           paths in Fig. 2a.
each prediction utilizes the longest possible con-                                            With this recurrence mechanism applied to ev-
text exposed during training, and also relieves con-                                       ery two consecutive segments of a corpus, it es-
text fragmentation issue encountered in training.                                          sentially creates a segment-level recurrence in the
However, this evaluation procedure is extremely                                            hidden states. As a result, the effective context be-
expensive. We will show that our proposed archi-                                           ing utilized can go way beyond just two segments.
tecture is able to substantially improve the evalua-                                       However, notice that the recurrent dependency be-
tion speed.                                                                                tween hnτ+1 and hn−1τ   shifts one layer downwards
 x1     x2      x3      x4   x5   x6      x7    x8       x1   x2   x3   x4   x5     x6      x7      x8   x9   x10    x11    x12   x1     x2   x3   x4   x5   x6      x7      x8      x9   x10   x11   x12


      Fixed (No Grad)             New Segment                                     Fixed (No Grad)             New Segment                                         Extended Context



                                                     (a) Training phase.                                                                           (b) Evaluation phase.

                                          Figure 2: Illustration of the Transformer-XL model with a segment length 4.

per-segment, which differs from the same-layer                                                           der to reuse the hidden states. That is, how can
recurrence in conventional RNN-LMs. Conse-                                                               we keep the positional information coherent when
quently, the largest possible dependency length                                                          we reuse the states? Recall that, in the standard
grows linearly w.r.t. the number of layers as well                                                       Transformer, the information of sequence order is
as the segment length, i.e., O(N × L), as vi-                                                            provided by a set of positional encodings, denoted
sualized by the shaded area in Fig. 2b. This                                                             as U ∈ RLmax ×d , where the i-th row Ui corre-
is analogous to truncated BPTT (Mikolov et al.,                                                          sponds to the i-th absolute position within a seg-
2010), a technique developed for training RNN-                                                           ment and Lmax prescribes the maximum possible
LMs. However, different from truncated BPTT,                                                             length to be modeled. Then, the actual input to the
our method caches a sequence of hidden states in-                                                        Transformer is the element-wise addition of the
stead of the last one, and should be applied to-                                                         word embeddings and the positional encodings. If
gether with the relative positional encoding tech-                                                       we simply adapt this positional encoding to our
nique described in Section 3.3.                                                                          recurrence mechanism, the hidden state sequence
   Besides achieving extra long context and re-                                                          would be computed schematically by
solving fragmentation, another benefit that comes                                                                                 hτ +1 = f (hτ , Esτ +1 + U1:L )
with the recurrence scheme is significantly faster                                                                                     hτ = f (hτ −1 , Esτ + U1:L ),
evaluation. Specifically, during evaluation, the
                                                                                                         where Esτ ∈ RL×d is the word embedding se-
representations from the previous segments can
                                                                                                         quence of sτ , and f represents a transformation
be reused instead of being computed from scratch
                                                                                                         function. Notice that, both Esτ and Esτ +1 are as-
as in the case of the vanilla model. In our ex-
                                                                                                         sociated with the same positional encoding U1:L .
periments on enwiki8, Transformer-XL is up to
                                                                                                         As a result, the model has no information to dis-
1,800+ times faster than the vanilla model during
                                                                                                         tinguish the positional difference between xτ,j and
evaluation (see Section 4).
                                                                                                         xτ +1,j for any j = 1, . . . , L, resulting in a sheer
   Finally, notice that the recurrence scheme does
                                                                                                         performance loss.
not need to be restricted to only the previous seg-
                                                                                                            In order to avoid this failure mode, the funda-
ment. In theory, we can cache as many previous
                                                                                                         mental idea is to only encode the relative posi-
segments as the GPU memory allows, and reuse
                                                                                                         tional information in the hidden states. Concep-
all of them as the extra context when processing
                                                                                                         tually, the positional encoding gives the model a
the current segment. Thus, we can cache a prede-
                                                                                                         temporal clue or “bias” about how information
fined length-M old hidden states spanning (pos-
                                                                                                         should be gathered, i.e., where to attend. For the
sibly) multiple segments, and refer to them as the
                                                                                                         same purpose, instead of incorporating bias stati-
memory mnτ ∈ RM ×d , due to a clear connection to
                                                                                                         cally into the initial embedding, one can inject the
the memory augmented neural networks (Graves
                                                                                                         same information into the attention score of each
et al., 2014; Weston et al., 2014). In our experi-
                                                                                                         layer. More importantly, it is more intuitive and
ments, we set M equal to the segment length dur-
                                                                                                         generalizable to define the temporal bias in a rela-
ing training, and increase it by multiple times dur-
                                                                                                         tive manner. For instance, when a query vector qτ,i
ing evaluation.
                                                                                                         attends on the key vectors kτ,≤i , it does not need
                                                                                                         to know the absolute position of each key vector
3.3          Relative Positional Encodings
                                                                                                         to identify the temporal order of the segment. In-
While we found the idea presented in the pre-                                                            stead, it suffices to know the relative distance be-
vious subsection very appealing, there is a cru-                                                         tween each key vector kτ,j and itself qτ,i , i.e. i − j.
cial technical challenge we haven’t solved in or-                                                        Practically, one can create a set of relative posi-
tional encodings R ∈ RLmax ×d , where the i-th row      • Finally, we deliberately separate the two weight
Ri indicates a relative distance of i between two          matrices Wk,E and Wk,R for producing the
positions. By injecting the relative distance dy-          content-based key vectors and location-based
namically into the attention score, the query vector       key vectors respectively.
can easily distinguish the representations of xτ,j      Under the new parameterization, each term has
and xτ +1,j from their different distances, making      an intuitive meaning: term (a) represents content-
the state reuse mechanism feasible. Meanwhile,          based addressing, term (b) captures a content-
we won’t lose any temporal information, as the ab-      dependent positional bias, term (c) governs a
solute position can be recovered recursively from       global content bias, and (d) encodes a global po-
relative distances.                                     sitional bias.
   Previously, the idea of relative positional encod-      In comparison, the formulation in Shaw et al.
ings has been explored in the context of machine        (2018) only has terms (a) and (b), dropping the
translation (Shaw et al., 2018) and music gener-        two bias terms (c) and (d). Moreover, Shaw et al.
ation (Huang et al., 2018). Here, we offer a dif-       (2018) merge the multiplication Wk R into a sin-
ferent derivation, arriving at a new form of rel-       gle trainable matrix R̂, which abandons the induc-
ative positional encodings, which not only has a        tive bias built into the original sinusoid positional
one-to-one correspondence to its absolute coun-         encoding (Vaswani et al., 2017). In contrast, our
terpart but also enjoys much better generalization      relative positional embedding R adapts the sinu-
empirically (see Section 4). Firstly, in the standard   soid formulation. As a benefit of the inductive
Transformer (Vaswani et al., 2017), the attention       bias, a model trained on a memory of some certain
score between query qi and key vector kj within         length can automatically generalize to a memory
the same segment can be decomposed as                   several times longer during evaluation.
      Aabs     >  >           >   >
       i,j = Exi Wq Wk Exj + Exi Wq Wk Uj
                                                           Equipping the recurrence mechanism with our
             |     {z    } |
                         (a)
                                  {z    }
                                         (b)
                                                        proposed relative positional embedding, we finally
                             >  >                       arrive at the Transformer-XL architecture. For
           + U>  >
              i Wq Wk Exj + Ui Wq Wk Uj .
             |          {z     }   |     {z      }      completeness, we summarize the computational
                        (c)              (d)
                                                        procedure for a N -layer Transformer-XL with a
   Following the idea of only relying on rela-          single attention head here. For n = 1, . . . , N :
tive positional information, we propose to re-                   e n−1 = SG(mn−1 ) ◦ hn−1
                                                                                         
                                                                 h τ         τ        τ
parameterize the four terms as follows
            >  >             >   >                          qn    n    n    n−1
                                                             τ , kτ , vτ = hτ   Wqn > , h       n > e n−1
                                                                                        e τn−1 Wk,E , hτ Wvn >
   Arel
    i,j = Exi Wq Wk,E Exj + Exi Wq Wk,R Ri−j                              n > n       n >
          |      {z     } |        {z      }                    An                         n
                                                                 τ,i,j = qτ,i kτ,j + qτ,i Wk,R Ri−j
                       (a)                 (b)
                                                                        + u> kτ,j + v > Wk,R
                                                                                         n
                                                                                             Ri−j
        + u> Wk,E Exj + v > Wk,R Ri−j .
          |   {z    } |      {z     }                              aτ = Masked-Softmax(Aτ )vτn
                                                                    n                    n

                 (c)               (d)
                                                                   on                       n
                                                                    τ = LayerNorm(Linear(aτ ) + hτ
                                                                                                  n−1
                                                                                                       )
• The first change we make is to replace all ap-                    n                              n
                                                                   hτ = Positionwise-Feed-Forward(oτ )
  pearances of the absolute positional embedding
  Uj for computing key vectors in term (b) and          with h0τ := Esτ defined as the word embed-
  (d) with its relative counterpart Ri−j . This es-     ding sequence. In addition, it is worth mention-
  sentially reflects the prior that only the relative   ing that a naive way to compute A requires com-
                                                        puting Wk,Rn R
  distance matters for where to attend. Note that                       i−j for all pairs (i, j), whose cost
  R is a sinusoid encoding matrix (Vaswani et al.,      is quadratic w.r.t. the sequence length. How-
  2017) without learnable parameters.                   ever, noticing that the value of i − j only ranges
                                                        from zero to the sequence length, we show a sim-
• Secondly, we introduce a trainable parameter
                                                        ple computation procedure in Appendix B, which
  u ∈ Rd to replace the query U>         >
                                     i Wq in term       reduces the cost to be linear w.r.t. the sequence
  (c). In this case, since the query vector is the
                                                        length.
  same for all query positions, it suggests that the
  attentive bias towards different words should re-     4     Experiments
  main the same regardless of the query position.
  With a similar reasoning, a trainable parameter       4.1    Main Results
  v ∈ Rd is added to substitute U>       >
                                     i Wq in term       We apply Transformer-XL to a variety of datasets
  (d).                                                  on both word-level and character-level language
Model                                        #Param PPL           Model                                     #Param bpc
Grave et al. (2016b) - LSTM                     -      48.7       Cooijmans et al. (2016) - BN-LSTM            -     1.36
Bai et al. (2018) - TCN                         -      45.2       Chung et al. (2016) - LN HM-LSTM           35M     1.29
Dauphin et al. (2016) - GCNN-8                  -      44.9       Zilly et al. (2016) - RHN                  45M     1.27
Grave et al. (2016b) - LSTM + Neural cache      -      40.8       Krause et al. (2016) - Large mLSTM         45M     1.27
Dauphin et al. (2016) - GCNN-14                 -      37.2       Al-Rfou et al. (2018) - 12L Transformer    44M     1.18
Merity et al. (2018) - QRNN                   151M     33.0
                                                                  Al-Rfou et al. (2018) - 64L Transformer   235M     1.13
Rae et al. (2018) - Hebbian + Cache             -      29.9
                                                                  Ours - 24L Transformer-XL                 277M     1.08
Ours - Transformer-XL Standard                151M     24.0
Baevski and Auli (2018) - Adaptive Input     247M     20.5     Table 3: Comparison with state-of-the-art results on text8.
Ours - Transformer-XL Large                   257M     18.3

Table 1:    Comparison with state-of-the-art results on        Model                                          #Param PPL
WikiText-103.  indicates contemporary work.
                                                               Shazeer et al. (2014) - Sparse Non-Negative  33B 52.9
                                                               Chelba et al. (2013) - RNN-1024 + 9 Gram     20B 51.3
  Model                                     #Param bpc         Kuchaiev and Ginsburg (2017) - G-LSTM-2        -  36.0
                                                               Dauphin et al. (2016) - GCNN-14 bottleneck     -  31.9
  Ha et al. (2016) - LN HyperNetworks        27M     1.34      Jozefowicz et al. (2016) - LSTM              1.8B 30.6
  Chung et al. (2016) - LN HM-LSTM           35M     1.32      Jozefowicz et al. (2016) - LSTM + CNN Input 1.04B 30.0
  Zilly et al. (2016) - RHN                  46M     1.27      Shazeer et al. (2017) - Low-Budget MoE      ∼5B 34.1
  Mujika et al. (2017) - FS-LSTM-4           47M     1.25      Shazeer et al. (2017) - High-Budget MoE     ∼5B 28.0
  Krause et al. (2016) - Large mLSTM         46M     1.24      Shazeer et al. (2018) - Mesh Tensorflow      4.9B 24.0
  Knol (2017) - cmix v13                       -     1.23      Baevski and Auli (2018) - Adaptive Input   0.46B 24.1
  Al-Rfou et al. (2018) - 12L Transformer    44M     1.11      Baevski and Auli (2018) - Adaptive Input    1.0B 23.7
  Ours - 12L Transformer-XL                  41M     1.06
                                                               Ours - Transformer-XL Base                      0.46B 23.5
  Al-Rfou et al. (2018) - 64L Transformer    235M    1.06      Ours - Transformer-XL Large                      0.8B 21.8
  Ours - 18L Transformer-XL                  88M     1.03
  Ours - 24L Transformer-XL                  277M    0.99
                                                               Table 4: Comparison with state-of-the-art results on One Bil-
                                                               lion Word.  indicates contemporary work.
Table 2: Comparison with state-of-the-art results on enwik8.


modeling to have a comparison with state-of-the-                former variants have a large margin over conven-
art systems, including WikiText-103 (Merity et al.,             tional RNN-based models. Notably, our 12-layer
2016), enwik8 (LLC, 2009), text8 (LLC, 2009),                   architecture achieves the same result as the 64-
One Billion Word (Chelba et al., 2013), and Penn                layer network from Al-Rfou et al. (2018), using
Treebank (Mikolov and Zweig, 2012).                             only 17% of the parameter budget. In order to see
   WikiText-103 is the largest available word-level             whether better performances can be obtained by
language modeling benchmark with long-term de-                  increasing the model size, we train 18-layer and
pendency. It contains 103M training tokens from                 24-layer Transformer-XLs with increased model
28K articles, with an average length of 3.6K to-                sizes. With the attention length 784 during train-
kens per article, which allows testing the abil-                ing and 3,800 during evaluation, we obtained a
ity of long-term dependency modeling. We set                    new SoTA result and our method is the first to
the attention length to 384 during training and                 break through 1.0 on widely-studied character-
1600 during evaluation. We adopted adaptive soft-               level benchmarks. Different from Al-Rfou et al.
max and input representations (Baevski and Auli,                (2018), Transformer-XL does not need any auxil-
2018; Grave et al., 2016a). As shown in Table 1,                iary losses, and thus all benefits are credited to a
Transformer-XL reduces the previous state-of-the-               better architecture.
art (SoTA) perplexity from 20.5 to 18.3, which                     Similar to but different from enwik8, text8 con-
demonstrates the superiority of the Transformer-                tains 100M processed Wikipedia characters cre-
XL architecture.                                                ated by lowering case the text and removing any
   The dataset enwik8 contains 100M bytes of un-                character other than the 26 letters a through z, and
processed Wikipedia text. We compare our ar-                    space. Due to the similarity, we simply adapt the
chitecture with the previous results in Table 2.                best model and the same hyper-parameters on en-
Under the model size constraint, the 12-layer                   wik8 to text8 without further tuning. The compari-
Transformer-XL achieves a new SoTA result, out-                 son with previous methods is summarized in Table
performing the 12-layer vanilla Transformer from                3. Again, Transformer-XL achieves the new SoTA
Al-Rfou et al. (2018) by 0.05, while both Trans-                result with a clear margin.
Model                                            #Param PPL      that absolute encodings only work well with half
Inan et al. (2016) - Tied Variational LSTM         24M    73.2   losses because half losses exclude positions with
Zilly et al. (2016) - Variational RHN              23M    65.4   very short attention lengths during training for bet-
Zoph and Le (2016) - NAS Cell                      25M    64.0
Merity et al. (2017) - AWD-LSTM                    24M    58.8
                                                                 ter generalization. Table 6 shows that both the
Pham et al. (2018) - Efficient NAS                 24M    58.6   recurrence mechanism and our encoding scheme
Liu et al. (2018) - Differentiable NAS             23M    56.1   are necessary to achieve the best performance, as
Yang et al. (2017) - AWD-LSTM-MoS                  22M   55.97
Melis et al. (2018) - Dropout tuning               24M    55.3   well as generalizing to longer attention sequences
                                                                 during evaluation time. Although the backprop-
Ours - Transformer-XL                              24M   54.52
                                             †
                                                                 agation length during training is only 128, with
Merity et al. (2017) - AWD-LSTM+Finetune           24M    57.3
                                                                 the two techniques the attention length can be in-
Yang et al. (2017) - MoS+Finetune†                 22M   54.44
                                                                 creased to 640 at test time. In the standard setting
Table 5: Comparison with state-of-the-art results on Penn        with 151M parameters, the perplexity decreases as
Treebank. † indicates using two-step finetuning.                 the attention length increases.
                                                                    Since the recurrence mechanism costs addi-
                                                                 tional memory, we also compare Transformer-XL
    One Billion Word does not preserve any long-
                                                                 with baselines under the same GPU memory con-
 term dependency because sentences have been
                                                                 straints. As shown in Table 10 in Appendix A,
 shuffled. Consequently, this dataset mainly tests
                                                                 despite using a shorter backpropagation length,
 the ability of modeling only short-term depen-
                                                                 Transformer-XL remains superior to the baselines.
 dency. The comparison between Transformer-XL
 and the other methods is shown in Table 4. Al-                     The second study targets at isolating the ef-
 though Transformer-XL is mainly designed to bet-                fects of resolving the context fragmentation prob-
 ter capture longer-term dependency, it dramati-                 lem from the benefit of capturing longer context
 cally improves the single-model SoTA from 23.7                  length. In order to achieve this goal, we deliber-
 to 21.8. Specifically, Transformer-XL signifi-                  ately choose a dataset that does not require long-
 cantly outperforms a contemporary method using                  term dependency, so that any improvement from
 vanilla Transformers (Baevski and Auli, 2018),                  establishing the recurrence can be attributed to
 suggesting the advantage of Transformer-XL is                   solving the context fragmentation. Specifically,
 generalizable to modeling short sequences.                      we perform this controlled experiment on the One
                                                                 Billion Word dataset, which can only benefit from
    We also report the results on word-level Penn
                                                                 removing the context fragmentation. We train
 Treebank in Table 5. Similar to AWD-LSTM
                                                                 a 20-layer Transformer-XL with ∼0.3B parame-
 (Merity et al., 2017), we apply variational dropout
                                                                 ters for 400K steps. As shown in Table 7, using
 and weight average to Transformer-XL. With
                                                                 segment-level recurrence substantially improves
 proper regularization, Transformer-XL achieves a
                                                                 performance even when long-term dependency is
 new SoTA result among models without two-step
                                                                 not needed, which is consistent with our previous
 finetuning. Penn Treebank has only 1M training
                                                                 discussion that the recurrence mechanism resolves
 tokens, which implies that Transformer-XL also
                                                                 the context fragmentation problem. Moreover, our
 generalizes well even on small datasets.
                                                                 relative positional encodings is also superior to
 4.2    Ablation Study                                           Shaw et al. (2018) on short sequences.
 We conduct two sets of ablation studies to exam-
                                                                 4.3   Relative Effective Context Length
 ine the effects of two proposed techniques used in
 Transformer-XL: the recurrence mechanism and                    Khandelwal et al. (2018) proposed a method to
 the new positional encoding scheme.                             evaluate the Effective Context Length (ECL) of a
    The first study is performed on WikiText-103,                sequence model. ECL is the longest length to
 which requires modeling long-term dependency.                   which increasing the context span would lead to
 The results are reported in Table 6. Among the                  a gain more than a threshold. However, ECL ig-
 compared encoding schemes, Shaw et al. (2018) is                nores the fact that it is harder to get improve-
 relative, while Vaswani et al. (2017) and Al-Rfou               ment when a model already achieves a lower per-
 et al. (2018) are absolute. “Full” and “half” losses            plexity using only a shorter context, and thus it
 refer to applying a cross entropy loss to all or the            is not suitable for fair comparison among mul-
 recent half positions in the segment. We found                  tiple models. We instead propose a new metric
        Remark                      Recurrence             Encoding                  Loss      PPL init   PPL best    Attn Len
        Transformer-XL (128M)             3                  Ours                    Full       27.02       26.77       500
        -                                 3            Shaw et al. (2018)            Full       27.94       27.94       256
        -                                 3                  Ours                    Half       28.69       28.33       460
        -                                 7                  Ours                    Full       29.59       29.02       260
        -                                 7                  Ours                    Half       30.10       30.10       120
        -                                   7          Shaw et al. (2018)            Full       29.75       29.75       120
        -                                   7          Shaw et al. (2018)            Half       30.50       30.50       120
        -                                   7         Vaswani et al. (2017)          Half       30.97       30.97       120
        Transformer (128M)†                 7         Al-Rfou et al. (2018)          Half       31.16       31.16       120
                                                                                                            23.09       640
        Transformer-XL (151M)             3                     Ours                 Full       23.43       23.16       450
                                                                                                            23.35       300

Table 6: Ablation study on WikiText-103. For the first two blocks, we use a slightly smaller model (128M parameters).
† indicates that the corresponding row is reduced to the same setting as the Transformer network in (Al-Rfou et al., 2018),
except that two auxiliary losses are not implemented in our experiments. “PPL init” refers to using the same length as training.
“PPL best” indicates the perplexity obtained by using the optimal length. “Attn Len” is the shortest possible attention length
during evaluation to achieve the corresponding result (PPL best). Increasing the attention length during evaluation improves
performance only when our positional encoding is used. The “Transformer-XL (151M)” setting uses a standard parameter
budget as previous work (Merity et al., 2018), where we observe a similar effect when increasing the attention length during
evaluation.


         Method                                 PPL                      Attn Len           How much Al-Rfou et al. (2018) is slower
         Ours                                   25.2                         3,800                          1,874x
         With Shaw et al. (2018) encodings      25.7                         2,800                          1,409x
         Without recurrence                     27.1                         1,800                           773x
                                                                              800                            363x
Table 7: Ablation study on One Billion Word, a dataset with-
out long-term dependency.                                              Table 9: Slowdown in terms of running time during evalua-
                                                                       tion. Evaluation is based on per-token time on one GPU.

Model                               r = 0.1 r = 0.5 r = 1.0
Transformer-XL 151M                   900       800       700
QRNN                                  500       400       300          erage with r = 0.1. The RECL of Transformer-
LSTM                                  400       300       200          XL is 80% and 450% longer than recurrent net-
Transformer-XL 128M                   700       600       500          works and Transformer respectively. Both the re-
- use Shaw et al. (2018) encoding     400       400       300          currence mechanism and our positional encodings
- remove recurrence                   300       300       300
Transformer                           128       128       128
                                                                       contribute to a longer RECL. This further substan-
                                                                       tiates our argument that Transformer-XL is able to
Table 8: Relative effective context length (RECL) compari-             model longer-term dependency.
son. See text for the definition of RECL and r. The first three
models and the last four models are compared as two model
groups when we calculate RECL (RECL is computed on a                   4.4     Generated Text
model group rather than a single model). Each group has the
same parameter budget.
                                                                       Trained only on WikiText-103 which is medium-
                                                                       sized, Transformer-XL is already able to generate
called Relative Effective Context Length (RECL).                       relatively coherent articles with thousands of to-
RECL is defined on a model group instead of a                          kens without manual cherry picking, despite mi-
single model, and the gain of a long context is                        nor flaws. Please refer to Appendix E for samples.
measure by the relative improvement over the best
short context model. As such, the model group                          4.5     Evaluation Speed
shares the same baseline to enable fair compari-
son. RECL also has a parameter r, which means                          Finally, we compare the evaluation speed of our
constraining the comparison on top-r hard exam-                        model with the vanilla Transformer model (Al-
ples. See Appedix C for more details about RECL.                       Rfou et al., 2018). As shown in Table 9, due to
As shown in Table 8, Transformer-XL manages                            the state reuse scheme, Transformer-XL achieves
to model dependency of 900 words long on av-                           an up to 1,874 times speedup during evaluation.
5   Conclusions                                          Andrew M Dai and Quoc V Le. 2015. Semi-supervised
                                                           sequence learning. In Advances in neural informa-
Transformer-XL obtains strong perplexity results,          tion processing systems, pages 3079–3087.
models longer-term dependency than RNNs and
Transformer, achieves substantial speedup during         Yann N Dauphin, Angela Fan, Michael Auli, and
                                                           David Grangier. 2016. Language modeling with
evaluation, and is able to generate coherent text          gated convolutional networks.  arXiv preprint
articles. We envision interesting applications of          arXiv:1612.08083.
Transformer-XL in the fields of text generation,
                                                         Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
unsupervised feature learning, image and speech
                                                            Kristina Toutanova. 2018. Bert: Pre-training of deep
modeling.                                                   bidirectional transformers for language understand-
                                                            ing. arXiv preprint arXiv:1810.04805.
Acknowledgments
ZD and YY were supported in part by National             Adji B Dieng, Chong Wang, Jianfeng Gao, and John
Science Foundation (NSF) under the grant IIS-              Paisley. 2016. Topicrnn: A recurrent neural net-
                                                           work with long-range semantic dependency. arXiv
1546329 and by the DOE-Office of Science un-               preprint arXiv:1611.01702.
der the grant ASCR #KJ040201. ZY and RS
were supported in part by the Office of Naval            Yarin Gal and Zoubin Ghahramani. 2016. A theoret-
Research grant N000141812861, the NSF grant                ically grounded application of dropout in recurrent
                                                           neural networks. In Advances in neural information
IIS1763562, the Nvidia fellowship, and the Siebel          processing systems, pages 1019–1027.
scholarship.
                                                         Edouard Grave, Armand Joulin, Moustapha Cissé,
                                                           David Grangier, and Hervé Jégou. 2016a. Efficient
References                                                 softmax approximation for gpus. arXiv preprint
                                                           arXiv:1609.04309.
Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy
  Guo, and Llion Jones. 2018. Character-level lan-       Edouard Grave, Armand Joulin, and Nicolas
  guage modeling with deeper self-attention. arXiv         Usunier. 2016b.     Improving neural language
  preprint arXiv:1808.04444.                               models with a continuous cache. arXiv preprint
                                                           arXiv:1612.04426.
Alexei Baevski and Michael Auli. 2018. Adaptive in-
  put representations for neural language modeling.      Alex Graves. 2013.    Generating sequences with
  arXiv preprint arXiv:1809.10853.                         recurrent neural networks.     arXiv preprint
                                                           arXiv:1308.0850.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
  gio. 2014. Neural machine translation by jointly       Alex Graves, Greg Wayne, and Ivo Danihelka.
  learning to align and translate. arXiv preprint          2014. Neural turing machines. arXiv preprint
  arXiv:1409.0473.                                         arXiv:1410.5401.
Shaojie Bai, J Zico Kolter, and Vladlen Koltun.          David Ha, Andrew Dai, and Quoc V Le. 2016. Hyper-
  2018. An empirical evaluation of generic convolu-        networks. arXiv preprint arXiv:1609.09106.
  tional and recurrent networks for sequence model-
  ing. arXiv preprint arXiv:1803.01271.                  Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jür-
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and        gen Schmidhuber, et al. 2001. Gradient flow in re-
  Christian Jauvin. 2003. A neural probabilistic lan-      current nets: the difficulty of learning long-term de-
  guage model. Journal of machine learning research,       pendencies.
  3(Feb):1137–1155.
                                                         Sepp Hochreiter and Jürgen Schmidhuber. 1997.
Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge,       Long short-term memory. Neural computation,
  Thorsten Brants, Phillipp Koehn, and Tony Robin-         9(8):1735–1780.
  son. 2013. One billion word benchmark for measur-
  ing progress in statistical language modeling. arXiv   Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob
  preprint arXiv:1312.3005.                                Uszkoreit, Noam Shazeer, Curtis Hawthorne, An-
                                                           drew M Dai, Matthew D Hoffman, and Douglas Eck.
Junyoung Chung, Sungjin Ahn, and Yoshua Bengio.            2018. An improved relative self-attention mecha-
  2016. Hierarchical multiscale recurrent neural net-      nism for transformer with application to music gen-
  works. arXiv preprint arXiv:1609.01704.                  eration. arXiv preprint arXiv:1809.04281.

Tim Cooijmans, Nicolas Ballas, César Laurent, Çağlar    Hakan Inan, Khashayar Khosravi, and Richard Socher.
  Gülçehre, and Aaron Courville. 2016.          Re-        2016. Tying word vectors and word classifiers:
  current batch normalization.       arXiv preprint        A loss framework for language modeling. arXiv
  arXiv:1603.09025.                                        preprint arXiv:1611.01462.
Yangfeng Ji, Trevor Cohn, Lingpeng Kong, Chris Dyer,    Gábor Melis, Charles Blundell, Tomáš Kočiskỳ,
  and Jacob Eisenstein. 2015. Document context lan-       Karl Moritz Hermann, Chris Dyer, and Phil Blun-
  guage models. arXiv preprint arXiv:1511.03962.          som. 2018. Pushing the bounds of dropout. arXiv
                                                          preprint arXiv:1805.09208.
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam
  Shazeer, and Yonghui Wu. 2016.           Exploring    Stephen Merity, Nitish Shirish Keskar, and Richard
  the limits of language modeling. arXiv preprint          Socher. 2017. Regularizing and optimizing lstm lan-
  arXiv:1602.02410.                                        guage models. arXiv preprint arXiv:1708.02182.

Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan,       Stephen Merity, Nitish Shirish Keskar, and Richard
  Aaron van den Oord, Alex Graves, and Koray               Socher. 2018. An analysis of neural language
  Kavukcuoglu. 2016. Neural machine translation in         modeling at multiple scales.     arXiv preprint
  linear time. arXiv preprint arXiv:1610.10099.            arXiv:1803.08240.

                                                        Stephen Merity, Caiming Xiong, James Bradbury, and
Sekitoshi Kanai, Yasuhiro Fujiwara, Yuki Yamanaka,
                                                           Richard Socher. 2016. Pointer sentinel mixture
  and Shuichi Adachi. 2018. Sigsoftmax: Reanal-
                                                           models. arXiv preprint arXiv:1609.07843.
  ysis of the softmax bottleneck. arXiv preprint
  arXiv:1805.10829.                                     Tomas Mikolov, Armand Joulin, Sumit Chopra,
                                                          Michael Mathieu, and Marc’Aurelio Ranzato. 2014.
Nan Rosemary Ke, Anirudh Goyal ALIAS PARTH                Learning longer memory in recurrent neural net-
  GOYAL, Olexa Bilaniuk, Jonathan Binas,                  works. arXiv preprint arXiv:1412.7753.
  Michael C Mozer, Chris Pal, and Yoshua Ben-
  gio. 2018. Sparse attentive backtracking: Temporal    Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan
  credit assignment through reminding. In Advances        Černockỳ, and Sanjeev Khudanpur. 2010. Recur-
  in Neural Information Processing Systems, pages         rent neural network based language model. In
  7650–7661.                                              Eleventh Annual Conference of the International
                                                          Speech Communication Association.
Urvashi Khandelwal, He He, Peng Qi, and Dan Ju-
  rafsky. 2018. Sharp nearby, fuzzy far away: How       Tomas Mikolov and Geoffrey Zweig. 2012. Context
  neural language models use context. arXiv preprint      dependent recurrent neural network language model.
  arXiv:1805.04623.                                       SLT, 12(234-239):8.

Bryon Knol. 2017. cmix v13. http://www.                 Frederic Morin and Yoshua Bengio. 2005. Hierarchi-
  byronknoll.com/cmix.html.                               cal probabilistic neural network language model. In
                                                          Aistats, volume 5, pages 246–252. Citeseer.
Jan Koutnik, Klaus Greff, Faustino Gomez, and Juer-
   gen Schmidhuber. 2014. A clockwork rnn. arXiv        Asier Mujika, Florian Meier, and Angelika Steger.
   preprint arXiv:1402.3511.                              2017. Fast-slow recurrent neural networks. In Ad-
                                                          vances in Neural Information Processing Systems,
Ben Krause, Liang Lu, Iain Murray, and Steve Renals.      pages 5915–5924.
  2016. Multiplicative lstm for sequence modelling.
  arXiv preprint arXiv:1609.07959.                      Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.
                                                          2012. Understanding the exploding gradient prob-
Oleksii Kuchaiev and Boris Ginsburg. 2017. Factor-        lem. CoRR, abs/1211.5063.
  ization tricks for lstm networks. arXiv preprint
                                                        Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt
  arXiv:1703.10722.
                                                         Gardner, Christopher Clark, Kenton Lee, and Luke
                                                         Zettlemoyer. 2018. Deep contextualized word rep-
Quoc V Le, Navdeep Jaitly, and Geoffrey E Hin-
                                                         resentations. arXiv preprint arXiv:1802.05365.
  ton. 2015. A simple way to initialize recurrent
  networks of rectified linear units. arXiv preprint    Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V
  arXiv:1504.00941.                                       Le, and Jeff Dean. 2018. Efficient neural architec-
                                                          ture search via parameter sharing. arXiv preprint
Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo       arXiv:1802.03268.
  Gao. 2018. Independently recurrent neural network
  (indrnn): Building a longer and deeper rnn. In Pro-   Ofir Press and Lior Wolf. 2016. Using the output
  ceedings of the IEEE Conference on Computer Vi-         embedding to improve language models. arXiv
  sion and Pattern Recognition, pages 5457–5466.          preprint arXiv:1608.05859.

Hanxiao Liu, Karen Simonyan, and Yiming Yang.           Alec Radford, Karthik Narasimhan, Tim Salimans, and
  2018. Darts: Differentiable architecture search.        Ilya Sutskever. 2018. Improving language under-
  arXiv preprint arXiv:1806.09055.                        standing by generative pre-training. URL https://s3-
                                                          us-west-2. amazonaws. com/openai-assets/research-
MultiMedia LLC. 2009.       Large text compression        covers/languageunsupervised/language         under-
 benchmark.                                               standing paper. pdf.
Jack W Rae, Chris Dyer, Peter Dayan, and Tim-             Julian Georg Zilly, Rupesh Kumar Srivastava,
   othy P Lillicrap. 2018. Fast parametric learn-            Jan Koutník, and Jürgen Schmidhuber. 2016.
   ing with activation memorization. arXiv preprint          Recurrent highway networks. arXiv preprint
   arXiv:1803.10049.                                         arXiv:1607.03474.

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani.          Barret Zoph and Quoc V Le. 2016. Neural architecture
  2018. Self-attention with relative position represen-     search with reinforcement learning. arXiv preprint
  tations. arXiv preprint arXiv:1803.02155.                 arXiv:1611.01578.

Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin
  Tran, Ashish Vaswani, Penporn Koanantakool, Peter
  Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff
  Young, et al. 2018. Mesh-tensorflow: Deep learning
  for supercomputers. In Advances in Neural Infor-
  mation Processing Systems, pages 10434–10443.

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz,
  Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff
  Dean. 2017. Outrageously large neural networks:
  The sparsely-gated mixture-of-experts layer. arXiv
  preprint arXiv:1701.06538.

Noam Shazeer, Joris Pelemans, and Ciprian Chelba.
  2014. Skip-gram language modeling using sparse
  non-negative matrix probability estimation. arXiv
  preprint arXiv:1412.1454.

Trieu H Trinh, Andrew M Dai, Thang Luong, and
   Quoc V Le. 2018. Learning longer-term dependen-
   cies in rnns with auxiliary losses. arXiv preprint
   arXiv:1803.00144.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
  Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
  Kaiser, and Illia Polosukhin. 2017. Attention is all
  you need. In Advances in Neural Information Pro-
  cessing Systems, pages 5998–6008.

Tian Wang and Kyunghyun Cho. 2015.    Larger-
   context language modelling. arXiv preprint
   arXiv:1511.03729.

Wenlin Wang, Zhe Gan, Wenqi Wang, Dinghan Shen,
  Jiaji Huang, Wei Ping, Sanjeev Satheesh, and
  Lawrence Carin. 2017. Topic compositional neural
  language model. arXiv preprint arXiv:1712.09783.

Jason Weston, Sumit Chopra, and Antoine Bor-
   des. 2014. Memory networks. arXiv preprint
   arXiv:1410.3916.

Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua
  Bengio, and Ruslan R Salakhutdinov. 2016. On
  multiplicative integration with recurrent neural net-
  works. In Advances in neural information process-
  ing systems, pages 2856–2864.

Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and
  William W Cohen. 2017. Breaking the softmax bot-
  tleneck: A high-rank rnn language model. arXiv
  preprint arXiv:1711.03953.

Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals.
 2014. Recurrent neural network regularization.
 arXiv preprint arXiv:1409.2329.
A   Ablation Study with Memory Constraints

              Backprop Len         Recurrence      Encoding    Loss      pplx best      pplx init   Attn Len
                   128                 3             Ours       Full      26.77             27.02     500
                   128                 3             Ours      Partial    28.33             28.69     460
                   176                 7             Ours       Full      27.98             28.43     400
                   172                 7             Ours      Partial    28.83             28.83     120

                 Table 10: Ablation study on WikiText-103 with the same GPU memory constraints.

   Table 10 compares Transformer-XL with baseline under the same memory budget. Transformer-XL
still outperforms the baseline even with a shorter backprop length.

B   Efficient Computation of the Attention with Relative Positional Embedding
As we discussed in section 3.3, the naive way of computing the Wk,R Ri−j for all pairs (i, j) is subject
to a quadratic cost. Here, we present a simple method with only a linear cost. Firstly, notice that the
relative distance i − j can only be integer from 0 to M + L − 1, where M and L are the memory length
and segment length respectively. Hence, the rows of the matrix
                             >                                   >
                                               [Wk,R RM +L−1 ]
                                                   
                           RM +L−1
                             >                                 >
                           R
                           M +L−2           [Wk,R RM +L−2 ] 
                               ..                     ..
                     Q :=          Wk,R > =                    ∈ R(M +L)×d
                                                              
                                .                      .
                                >                       >     
                              R1                 [Wk,R R1 ]
                                 >                         >
                              R0                 [Wk,R R0 ]

consist of all possible vector outputs of Wk,R Ri−j for any (i, j). Note that we have defined Q in a
reversed order, i.e., Qk = Wk,R RM +L−1−k , to make further discussion easier.
   Next, we collect the term (b) for all possible i, j into the following L × (M + L) matrix,
             q0> Wk,R RM                    q0> Wk,R R0
                                                                                                
                                 ···                                     0        ···         0
        q > Wk,R RM +1          ···        q1> Wk,R R1             q1> Wk,R R0   ···         0  
        1
  B=
                                                                                                 
                     ..            ..             ..                     ..       ..          .. 
                     .             .              .                      .          .         . 
        > W                      >                  >                                                   > W
      qL−1    k,R RM +L−1 · · · qL−1 Wk,R RM +L−1 qL−1 Wk,R RL−1 · · ·                                 qL−1 k,R R0
      >
      q0 QL−1 · · · q0> QM +L−1
                                                              
                                      0        ···      0
     q > QL−2 · · · q > QM +L−2 q > QM +L−1 · · ·      0     
      1               1           1
    =
                                                              
           ..      ..     ..           ..      ..       ..    
           .       .        .          .         .      .    
          > Q
         qL−1       ···          > Q
                                qL−1               > Q
                                                  qL−1           ···      > Q
                                                                         qL−1
              0                      M                 M +1                   M +L−1

Then, we further define
                       q0> Q0                        q0> QM       q0> QM +1                   q0> QM +L−1
                                                                                                         
                                            ···                                      ···
                      q > Q0               ···      q1> QM       q1> QM +1          ···      q1> QM +L−1 
           e = qQ> =   1
           B                                                                                              .
                                                                                                          
                         ..                 ..         ..             ..            ..             ..
                          .                  .            .            .               .            .    
                               > Q                   > Q          > Q                         > Q
                              qL−1 0 ···            qL−1 M       qL−1 M +1 · · ·             qL−1 M +L−1

Now, it is easy to see an immediate relationship between B and B,     e where the i-th row of B is simply a
left-shifted version of i-th row of B. Hence, the computation of B only requires a matrix multiplication
                                    e
qQ> to compute B   e and then a set of left-shifts.
   Similarly, we can collect all term (d) for all possible i, j into another L × (M + L) matrix D,
                      >
                       v QL−1 · · · v > QM +L−1
                                                                                          
                                                                0        ···       0
                     v > QL−2 · · · v > QM +L−2 v > QM +L−1 · · ·                 0      
                D=                                                                       .
                                                                                         
                            ..      ..        ..                ..        ..        ..
                            .       .           .               .           .       .    
                          v > Q0      ···         v > QM         v > QM +1        ···       v > QM +L−1
Then, we can follow the same procedure to define
               e = [Qv]> = v > Q0 · · · v > QM                   v > QM +1 · · ·     v > QM +L−1 .
                                                                                               
               d

Again, each row of D is simply a left-shift version of d.
                                                       e Hence, the main computation cost comes from
                                          >
the matrix-vector multiplication d
                                 e = [Qv] , which is not expensive any more.

C     Details About RECL




               (a) Transformer-XL vs RNNs                                 (b) Transformer-XL vs Baseline

                   Figure 3: Visualizing unnormalized relative perplexity gains with r = 0.1.




               (a) Transformer-XL vs RNNs                                 (b) Transformer-XL vs Baseline

                                     Figure 4: Perplexity vs context length.

  In this section, we describe the details of the metric RECL. Let M = {m1 , m2 , · · · , mN } be a model
group consisting of N models. Let li (c, t) denote the loss of model mi on the t-th token in the corpus
with a context length c. Concretely, the loss can be written as

                                  li (c, t) = − log Pmi (xt |xt−1 , · · · , xt−c )

where Pmi is the probability distribution given by model mi , and xt is the t-th token in the corpus. Given
a short context length c and a long context length c0 such that c0 ≥ c, we can further define a baseline for
each position t,
                                                           N
                                             b(c, t) = min li (c, t)
                                                           i=1
    The relative loss of mi w.r.t. the model group M is written as
                                                  1 X
                                 fi (c, c0 ) =        min b(c, t), li (c0 , t)
                                                                              
                                                 |T |
                                                     t∈T

The above equation uses the minimum loss of all models on the short length c as a baseline, and only
losses smaller than the baseline will be effectively counted towards the relative loss. This enables fair
comparison between multiple models because all models with a long context length c0 need to improve
over the same baseline. Sometimes we only care about those positions where the baseline performs
poorly (which means short-term dependency with context length c is not sufficient), so given a ratio
parameter r, we define the set T is the above equation as
                                   T = top-r positions t with largest b(c, t)
    The relative gain is subsequently defined as the relative perplexity reduction:
                                                     exp fi (c, c) − exp fi (c, c0 )
                                     gi (c, c0 ) =
                                                             exp fi (c, c)
    Given a step size ∆, we then use an algorithm to find the RECL by thresholding the relative gain:
    1. Set initial short context length c, and long context length c0 = c + ∆
    2. Compute gi (c, c0 ). If gi (c, c0 ) < 0.01, return RECL = c. If gi (c, c0 ) ≥ 0.01, set c = c0 , c0 = c + ∆
       and go to step 1.
  In Figure 3, we visualize the unnormalized relative perplexity gains (exp fi (c, c) − exp fi (c, c0 )) with
various pairs of (c, c0 ) when r = 0.1. It is clear that Transformer-XL has a longer RECL compared to
RNNs and other baselines because the relative gains are substantially larger.
  For reference, we plot the perplexities with varying context lengths in Figure 4. The y-axis denotes
the “normal” perplexity (not calibrated by baselines).

D     Attention Visualization
In this section, we provide some visualization of the attention learned by the SoTA model on the
WikiText-103 validation set. Recall that, this model has 16 10-head transformer layers and relies on
a memory of length 640.




Figure 5: Average attention over the previous 640 tokens, where each row corresponds to a attention head and each
column corresponds to a relative location. There are totally 160 attention heads, and every 10 heads come from a
single layer. Darker colors indicate higher values.

   The first visualization aims at revealing the overall trend of where the model is attending. Specifically,
for each attention head of each layer, we average the attention distributions of all tokens in the validation
set. This is shown in Fig. 5. As we can see, the overall trend is to focus more on the nearby tokens
than the faraway ones. However, it is also very clear that some attention heads have a wider attention
distribution over the entire memory span, notably the head 8 from layer 1, head 78 from layer 8, and the
head 158 from layer 16.
   Since we are focused on learning long-range dependency, we are especially interested in these heads
with a wider attention span. Thus, in the second set of visualization, we pick the three notable heads
mentioned above, and visualize their attention behavior for a randomly chosen position, as shown in Fig.
6. Here, we see three different patterns of wider attention:
• For the head 8 in the 1st layer, we see an almost uniform attention over the entire memory span. This
  is quite intuitive, as lower-level layers needs to screen the entire memory span to decide where to focus
  for higher-level layers
                                            (a) Head 8 from layer 1.




                                           (b) Head 78 from layer 8.




                                          (c) Head 158 from layer 16.

Figure 6: Visualization of the three heads with a wide attention range. Each row corresponds to a target loca-
tion/token and each column corresponds to a context location/token. Tokens in the memory that have top 20%
attention values are highlighted in red.


• For the head 78 in the 8th layer (a middle-level layer), we see a very sparse attention pattern scattered
  in all ranges of the memory. Again, this well fits our intuition that as information accumulates, the
  network may focus on some particular position with special interests.
• For the head 158 in the 16th layer (i.e. the last layer), each target location (corresponding to each row)
  has its own distinct sparse focus, differing from head 78 where target locations largely share the same
  attentive location in memory. Meanwhile, the pattern is also different from the case of head 8, where
  a few locations are clearly attended more than others.
   Finally, as we have discussed in section 3.3, the attention score can be decomposed into four intuitive
terms. Here, we want to further investigate how these four terms contribute to the overall attention trend
in Fig. 5. Since the term (c) represents the global content bias, i.e., the prior importance of each word
regardless of the context, we will leave it out and focus on the terms (a), (b) and (d). So, for each term,
we take the Softmax w.r.t. the memory span and average the resulted distribution of all tokens in the
validation set. The results are visualized in Fig. 7:
• Since term (a) is fully content-based addressing, when averaging over all target words, the result is
  essentially uniform over the entire context, except for a few very close words, which are likely to be
  semantically similar to the target word.
• The overall trend of term (b) highly resembles that of the entire attention distribution in Fig. 5. It
  suggests that the global trend of focusing on the nearby context is largely contributed by this content-
  dependent positional bias.
• The overall trend of term (d) is also focusing more on nearby words. However, compared to the trend
  of term (b), it is clearly flatter and biases towards a longer context.
                                                  (a) Term (a).




                                                  (b) Term (b).




                                                  (c) Term (d).

Figure 7: Visualization of the three terms in computing the attention score. Each row corresponds to a attention
head and each column corresponds to a relative location.


E   Generated Text

In this section, we present some generated text from our best model trained the Wikitext-103 dataset.
We seed the our Transformer-XL with a context of at most 512 consecutive tokens randomly sampled
from the test set of Wikitext-103. Then, we run Transformer-XL to generate a pre-defined number of
tokens (500 or 1,000 in our case). For each generation step, we first find the top-40 probabilities of the
next-step distribution and sample from top-40 tokens based on the re-normalized distribution. To help
reading, we detokenize the context, the generated text and the reference text. Three generated examples
are shown in Tables 11, 12, and 13. Note that we do not perform any cherry picking and present the
first three examples we generate in the paper. In the text, “= text =”, “= = text = =” and “= = = text = =
=” denote the Wikipedia page tile, section title and subsection title, respectively, due to the original data
preprocessing procedure of Wikitext-103 (Merity et al., 2016).
   As we can see, though only trained on 100M tokens, Transformer-XL is a strong model at generating
long text articles, particularly in the following aspects:
   • Transformer-XL is able to structurally maintain the sectional arrangement of Wikipedia.
   • Transformer-XL manages to semantically stay on the same topic throughout the course of genera-
       tion.
   • Long-range references are common in the generated text.
   • Transformer-XL often generates novel content that is not present in the training data.
For more detailed explanation of the interesting observations in each example, please refer to the corre-
sponding caption.
   Despite the overall excellence of the generation quality, the model can only perceive the seed context
and hallucinate what to generate based on the limited knowledge (100M tokens only) it is trained on.
As a result, the generated text sometimes looks clearly relevant but not close enough or to the point
compared to what human writer would do. That said, we believe this issue is mostly a problem of limited
training data size and could be alleviated by using a larger training set.

 Context:
 Kershaw started the 2010 season by posting a 3.07 ERA in April, but did so by walking 22 batters in 29 innings. On May 4, he had his worst start of his career
 against the Milwaukee Brewers at Dodger Stadium, throwing just 57 pitches in 11 / 3 innings, while retiring only four of the 13 batters he faced — including the
 pitcher. He was booed loudly upon being pulled from the game. Kershaw said after the game, " I didn’t give our team any kind of chance. It’s just not a good
 feeling to let your teammates down, let everybody down. It stings, it hurts. I ’ve got to figure things out. " Kershaw rebounded his next start by pitching an 8 inning
 two-hitter and out-dueling the then undefeated Ubaldo Jiménez. He credited his control of the slider being the major turning point for him. Later in the season, he
 was suspended for five games after hitting Aaron Rowand of the Giants with a pitch in a game on July 20. The incident occurred after both teams were given a
 warning following Giants ace Tim Lincecum hitting Matt Kemp earlier in the game. He threw his first career complete game shutout on September 14, 2010 also
 against San Francisco and finished the season with a record of 13 – 10 and a 2.91 ERA in 32 starts, pitching 2041 / 3 innings and recording 212 strikeouts.
 = = = = 2011 season: 1st Cy Young Award = = = =
 After finishing the 2010 season strong, the Dodgers named Kershaw as the Opening Day Starter for the 2011 season. On May 29, he pitched the second complete-
 game shutout of his career, striking out 10 while winning a two-hitter against the Florida Marlins, 8 – 0; he also had two singles and an RBI, scoring twice in the
 game. He produced his third career shutout on June 20, a two-hit, 11-strikeout effort against the Detroit Tigers. Kershaw became the first Dodgers starter to strike
 out the side in the 9th inning since Sandy Koufax’s perfect game. In his next start, on June 26, Kershaw pitched another complete game (against the Los Angeles
 Angels of Anaheim ). He became the first Dodger starter to have back-to-back complete game victories since Jeff Weaver in the 2005 season and the first Dodger
 to have double-digit strikeouts in consecutive starts since Chan-Ho Park in the 2000 season. He was awarded the National League Player of the Week award for the
 week of June 20 – 26 as a result of those two starts. Midway through June, Kershaw had amassed 32 career victories
 XL Generation:                                                                           Reference:
 , tying him with Luis Castillo for the third best start total in the league and trail-   , a 3.15 ERA and 593 career strikeouts in 568.2 innings. According to the Elias
 ing only Johnny Bench and Eddie Mathews (37). The Dodgers signed Kershaw                 Sports Bureau, Kershaw was the first 23-year-old pitcher to have that many
 to a one-year deal on July 14, 2011, with a player option for 2012, in a deal            victories, an ERA that low and an average of more than one strikeout per inning
 worth just under $ 300,000. The Dodgers entered the 2011 season as the major             since ERA became an official statistic in 1910. Kershaw was selected to the
 league leaders, and Kershaw had the best record in the National League in saves          National League team for the 2011 Major League Baseball All-Star Game, his
 (39), games started (78) and walks (120); he also had the lowest win / loss ratio        first All-Star selection. In the month of July, Kershaw was 4 – 1 with a 2.02
 in the league (1.87), and he had the worst on-base percentage (.349). He had             ERA and NL-leading 45 strikeouts, earning him the National League Pitcher of
 the best ERA (2.00 ), most wins (16 ), or most shutouts (29) in his career (25),         the Month Award. On August 23, he struck out Matt Holliday of the St. Louis
 while leading the league in losses per nine innings pitched (4.01). Following the        Cardinals for his 200th strikeout of the season and became the 10th Dodger
 2011 season, the Dodgers traded Kershaw to the Kansas City Royals on July 29             pitcher to record back-to-back 200 strikeout seasons and the first since Chan-
 for pitcher Javier Torres.                                                               Ho Park did it in the 2001 season. Kershaw finished the 2011 season by leading
 = = = Kansas City Royals = = =                                                           the NL with 21 wins, 248 strikeouts and a 2.28 ERA, winning the NL pitching
 = = = = 2012 season = = = =                                                              Triple Crown, the first Triple Crown winner since Jake Peavy of the 2007 San
 During spring training, Kershaw played very well. He was selected to spring              Diego Padres and the first Dodger since Sandy Koufax won it in the 1966 season.
 training as a relief pitcher for the Royals for the 2012 season. After an injury         Justin Verlander of the Detroit Tigers won the American League Triple Crown
 to closer Javier Vázquez, he was activated on April 29 to replace Matt Holliday          the same season, marking the first major-league season since 1924 to feature
 in the Royals ’ starting rotation. In his only start with the Royals, on August 6,       Triple Crown-winning pitchers in both leagues. Kershaw’s 21 wins were the
 2012, Kershaw struck out five batters in seven innings pitched to help the Royals        most by a Dodger pitcher since Orel Hershiser won 23 during the 1988 season.
 to their first victory in franchise history. On September 27, 2012, it appeared          His ERA was the lowest by a Dodger since Hershiser’s 2.03 in the 1985 season,
 Kershaw was going to pitch a complete game shutout against the Detroit Tigers,           his strikeouts were the most by a Dodger since Koufax’s 317 in 1966 and his 233
 but did not manage to do so since the Tigers won 3 – 1. At the conclusion                1 / 3 innings pitched were the most since Chan Ho Park pitched 234 in 2001.
 of the season, Kershaw was named Major League Baseball’s Most Valuable                   Since 1965 when Koufax did it, Peavy and Kershaw are only two pitchers in the
 Player, was chosen to the All-Star Game at Busch Stadium and was named to                National League have led the league in wins, strikeouts, ERA, and WHIP (walks
 the All-Star Game as the starting pitcher at shortstop. The Royals announced on          plus hits per inning pitched). Kershaw also became just the second <unk> to
 February 4, 2013 that Kershaw would spend 2013 as starting pitcher, though he            have a 240-plus strikeouts in a season before the age of 24, joining Vida Blue.
 was expected to miss the season as a result of a shoulder injury.                        After the season, Kershaw was awarded the Warren Spahn Award as the best
 = = = = 2013 season = = = =                                                              left-handed pitcher in 2011, the Players Choice Award for Most Outstanding
 On May 17, 2013, Kershaw sustained another back injury and did not start in              National League pitcher, the Gold Glove Award as the top fielding pitcher in
 August and October 2013. He appeared in 22 starts, all starts, finishing with            the NL and the Sporting News (TSN) National League Pitcher of the Year. He
 a strikeout-to-walk ratio of 1.50 and a 2.91 ERA. He also had the third most             was additionally selected as the starting pitcher for the TSN NL All-Star Team.
 strikeouts in the league: 10. On May 20, 2013, he                                        On November 17, he was honored with the National League Cy Young Award,
                                                                                          making him the youngest Cy Young winner since Dwight Gooden


Table 11: Example 1 – 500 tokens generated by XL using a snippet from the Wikitext-103 test set as initial context. The
sample is randomly generated without any cherry picking.
Original Wikipedia page: https://en.wikipedia.org/wiki/Clayton_Kershaw
There are many interesting observations from this example:
• Firstly, Kershaw never went to Royals in real life. Despite that, Transformer-XL stays on the fully imagined topic and keeps
  hallucinating the experience of Kershaw in Royals across the generated text.
• Secondly, notice that XL correctly tracks the chronological order from 2011 to 2012 and to the finally 2013 season in the
  section titles.
• In addition, notice that Transformer-XL accurately uses the the phrase “another back injury” in the 2013 season paragraph,
  since it has talked about one earlier injure in the 2012 season. This shows again Transformer-XL’s ability of capturing
  long-term dependency.
 Context:
 = = Distribution = =
 Species range across the Neotropics from Mexico in the north to Bolivia, Paraguay, and southern Brazil in the south. According to <unk> and coauthors, three
 species are found in Mexico, four in Central America, and 62 in South America. Three species are present in the Caribbean — two in Trinidad and Tobago, along
 the southern edge of the region, and one in Haiti.
 = = Habitat and ecology = =
 <unk> includes both large trees and small acaulescent palms which occupy a number of different ecological niches. Dense stands of some of the larger species are
 conspicuous elements on the landscape, while smaller species are found in both in the forest understorey and in savannas. Disturbance has been implicated in the
 formation of vegetation dominated by large <unk> species. In seasonally dry Amazonian forests the density of large adult A. <unk> palms was correlated with
 canopy openness; the species also dominates savannas formed by repeated forest fires in Trinidad and Tobago. <unk> speciosa forms pure stands in many parts of
 Brazil where natural forest vegetation has been cleared. Similarly, stands of A. <unk> in Bahia, Brazil (which are cultivated for <unk> fibre) are managed using
 fire — the seedlings survive cutting and burning, and are able to dominate burned forest patches. The fruit are dispersed by animals; fruit which are not dispersed
 frequently suffer seed predation by <unk> beetles. Certain species of <unk> have been mentioned as examples of " anachronistic " species which are adapted
 for dispersal by now-extinct Pleistocene megafauna. On <unk> Island, <unk>, in the Brazilian Amazon, <unk> <unk> fruit were consumed by tapirs, collared
 peccaries, deer and primates. Rodents, including agoutis, fed upon the fruit and, as the fruit availability declined, they fed on the seeds. Other dispersers of <unk>
 fruit include Crested <unk> which consume the fruit and disperse the seeds of A. <unk> in the Brazilian Pantanal.
 = = Uses = =
 <unk> species have a long history of human utilisation. <unk> <unk> <unk> seeds have been found in archaeological sites in Colombia dating back to 9000 BP.
 A variety of species remain important sources of edible oil, thatch, edible seeds and fibre. The leaves of <unk> <unk> and A. <unk> are used extensively for
 thatching. Several species are oil palms, with A. speciosa among the most important economically. Products extracted from A. speciosa were reported to support
 over 300,000 households in the Brazilian state of Maranhão in 2005, and in 1985 it was estimated to support over 450,000 households throughout the Brazil. <unk>
 fibres, extracted from the leaf bases of A. <unk>, are commercially important, and generated about US $
 XL Generation:                                                                         Reference:
 700,000 in 2013 on the black market. In 2004 the plant was reported to be              20 million in annual income to Brazilian farmers in 1996.
 responsible for a net profit of around US $ 3 million in Brazil. A 2012 study          = The Heart of Ezra Greer =
 found A. speciosa to cause agricultural production at least 2 – 3 times greater        The Heart of Ezra Greer is a 1917 American silent drama film produced by the
 than those achieved without prior knowledge of land management.                        Thanhouser Company and directed by Emile <unk>. The film focuses on Ezra
 = The Tale of Mrs. Tittlemouse =                                                       Greer, a successful middle-aged man who searches for his college age daughter,
 The Tale of Mrs. Tittlemouse is a children’s book written and illustrated by H.        Mary. The wayward Mary was romanced and abandoned by Jack <unk>, later
 M. Potter. It was first published in February 1910 by Frederick Warne & Co.            bearing his child. Once Ezra becomes broke he finds employment as the valet
 and included illustrations by Frederick Warne. Mrs. Tittlemouse was intended           for Jack <unk>. After Jack’s engagement to a cabaret girl, Mary becomes upset
 to be a children’s book, rather than a musical, and the tale describes her attempts    and leaves her child at Jack’s home. Contrary to Jack’s wishes, Ezra keeps the
 to find her own home. Mrs. Tittlemouse is an average kitten and the tale follows       child and Jack ultimately reveals that the child is his own. Ezra convinces Jack
 a mischievous rabbit named Mr. Tittlemouse who tries to kill her in an attempt         to make things right and Ezra convinces the cabaret girl to leave Jack. After a
 to get a ride on a horse. Potter later wrote of the rabbit and her attempts to         carriage accident in which the baby is injured, Ezra and Jack rush to the hospital
 kill her, " If [ she ] were a fox, I should like to have been able to show her the     and find Mary as a nurse crying over the child. The film ends with the marriage
 way. " Potter’s work was inspired by The Tale of Little Black Sambo and her            of Jack and Mary. The film was released by Pathé on October 7, 1917. The film
 two nieces, Petunia and Petunia. It was the first book in the Potter family to be      was the final release from Thanhouser and was deemed to be an average film
 published in a hardback format, and was also the first book in the Peter Rabbit        by most reviewers. Criticism for the film hinged on far-fetched coincidences to
 series. The tale was released on 13 November 1910 and was accompanied by               drive the plot. The film is presumed lost.
 a greeting card by Potter’s brother-in-law, Frederick Warne. The tale was well         = = Plot = =
 received by critics and children, and sold over 600,000 copies in its first few        The film follows Ezra Greer, a middle-aged man who has worked hard since
 years. In the first year of its publication, Potter’s The Tale of Mrs. Tittlemouse     his youth. He cares deeply for his motherless daughter, Mary, but was unable
 won children’s book awards in England, America and Australia. It was selected          to attend the annual commencement at her co-educational college. He awaits
 as a " recommended reading " by children in the US and Canada.                         for her to return from college, but Mary leaves with her romantic interest, Jack
 = = Plot summary = =                                                                   <unk>. On promise of marriage and wealth, Mary is romanced and gives birth
 The tale opens with the pet rabbit Mrs. Tittlemouse wandering through a forest         to a fatherless child. Without word from his daughter, Ezra resigns from his
 in a small litter of four. He is shot and badly injured by a deer after she attempts   job and attempts to seek her out and finds a poor motherless child, Marie. With
 to escape. Mrs. Tittlemouse decides that she wants to find her own home, be-           Ezra’s money exhausted he seeks employment and finds it as the valet of Jack.
 cause she is eager to go on her own. She goes alone to the farm where she makes        One day, Mary seeks an announcement of Jack’s engagement to a cabaret girl
 a little money by selling a few seeds and building a small cabin in the woods.         known as " The Baby Vamp ". Bitter over the prospect of her child’s future,
 She is approached by a wealthy hunter named Mr. Tittlemouse, who tries to kill         she leaves the child at Jack’s home during his absence with a note. Jack orders
 her but Mrs. Tittlemouse kills him by stuffing a rope into his nose and killing        Ezra to take the baby to an orphanage, but Marie begs Ezra to keep him. After
 him. She is rescued by Mr. Tittlemouse’s wife Ruth, but Mrs. Tittlemouse then          continually seeing the child, Jack is overcome with remorse and explains to Ezra
 leaves the woodland with the baby. When she is spotted by                              and seeks his advice. Not knowing he was making the case for his own daughter,
                                                                                        Ezra convinces Jack to seek out Mary and forget the Baby Vamp. The Baby


Table 12: Example 2 – 500 tokens generated by XL using a snippet from the Wikitext-103 test set as initial context. The
sample is randomly generated without any cherry picking.
Original Wikipedia page: https://en.wikipedia.org/wiki/The_Tale_of_Mrs._Tittlemouse.
This example exhibit some additional interesting properties of Transformer-XL:
• After finishing the last paragraph of the seed context, both the reference and generated text start a new topic (i.e., Wikipedia
  page), as marked by the single “= title =” line. This suggests the model has the ability of identifying the end of a topic / page,
  and randomly starting with a new topic.
• Even more interestingly, a newly-started page is on a book called “The Tale of Mrs. Tittlemouse”. Transformer-XL manages
  to copy the same book title and some related information from the training set, but hallucinates novel content of the book.
  This demonstrates a degree of generalization instead of memorization. Please refer to the original book content at the
  Wikipedia page.
Context:
= Battle of Dürenstein =
The Battle of Dürenstein (also known as the Battle of <unk>, Battle of <unk> and Battle of <unk>; German: <unk> bei <unk> ), on 11 November 1805 was
an engagement in the Napoleonic Wars during the War of the Third Coalition. Dürenstein (modern <unk>) is located in the <unk> Valley, on the River Danube,
73 kilometers (45 mi) upstream from Vienna, Austria. The river makes a crescent-shaped curve between <unk> and nearby Krems an der Donau and the battle
was fought in the flood plain between the river and the mountains. At Dürenstein a combined force of Russian and Austrian troops trapped a French division
commanded by Théodore Maxime Gazan. The French division was part of the newly created VIII Corps, the so-called Corps Mortier, under command of Édouard
Mortier. In pursuing the Austrian retreat from Bavaria, Mortier had over-extended his three divisions along the north bank of the Danube. Mikhail <unk> Kutuzov,
commander of the Coalition force, enticed Mortier to send Gazan’s division into a trap and French troops were caught in a valley between two Russian columns.
They were rescued by the timely arrival of a second division, under command of Pierre Dupont de l ’Étang. The battle extended well into the night. Both sides
claimed victory. The French lost more than a third of their participants, and Gazan’s division experienced over 40 percent losses. The Austrians and Russians
also had heavy <unk> to 16 <unk> perhaps the most significant was the death in action of Johann Heinrich von Schmitt, one of Austria’s most capable chiefs of
staff. The battle was fought three weeks after the Austrian capitulation at Ulm and three weeks before the Russo-Austrian defeat at the Battle of Austerlitz. After
Austerlitz Austria withdrew from the war. The French demanded a high indemnity and Francis II abdicated as Holy Roman Emperor, releasing the German states
from their allegiance to the Holy Roman Empire.
= = Background = =
In a series of conflicts from 1803-15 known as the Napoleonic Wars, various European powers formed five coalitions against the First French Empire. Like the wars
sparked by the French Revolution (1789 ), these further revolutionized the formation, organization and training of European armies and led to an unprecedented
militarization, mainly due to mass conscription. Under the leadership of Napoleon, French power rose quickly as the Grande Armée conquered most of Europe,
and collapsed rapidly
XL Generation:                                                                        Reference:
after the French conquest of Italy. This victory marked the beginning of the          after the disastrous invasion of Russia in 1812. Napoleon’s empire ultimately
Second Coalition. Napoleon’s rapid advance caused Austria – Russia, Britain           suffered complete military defeat in the 1813 – 14 campaigns, resulting in the
and Russia to make peace negotiations. The Russian army under Prince Mikhail          restoration of the Bourbon monarchy in France. Although Napoleon made a
Mikhailovich Mikhailovich Chaffee, commander of the Imperial Russian Army             spectacular return in 1815, known as the Hundred Days, his defeat at the Battle
in Switzerland, was able to drive off the Austrians in the Battle of Stockach         of Waterloo, the pursuit of his army and himself, his abdication and banishment
(1805) and to take Pressburg (modern Ö3 Austria) from the Austrians. At an            to the Island of Saint Helena concluded the Napoleonic Wars.
early stage the Austro-Hungarian fleet had been damaged in a storm off Cape           = = Danube campaign = =
Matapan; this allowed the Allies to launch a full-scale invasion of Italy in Febru-   From 1803-06 the Third Coalition fought the First French Empire and its client
ary 1805. The Battle of Austerlitz was the decisive French victory against            states (see table at right ). Although several naval battles determined control of
Napoleon and the largest naval battle in history, involving more modern-day           the seas, the outcome of the war was decided on the continent, predominantly
European navies. The French military situation worsened as Napoleon faced             in two major land operations in the Danube valley: the Ulm campaign in the
several threats related to his newly formed Army of Europe, particularly Russia       upper Danube and the Vienna campaign, in the middle Danube valley. Political
itself, which was now on the verge of collapse. The Russian general and friend        conflicts in Vienna delayed Austria’s entry into the Third Coalition until 1805.
of Napoleon, Alexander, had been dissatisfied with the conduct of the Austri-         After hostilities of the War of the Second Coalition ended in 1801, Archduke
ans; he was still in conflict with Emperor Napoleon, the French Republic’s king,      <unk> emperor’s <unk> advantage of the subsequent years of peace to develop
who had declared war on Austria during the summer of 1804. With the war over,         a military restructuring plan. He carefully put this plan into effect beginning in
Napoleon decided to launch a second invasion of Italy in October 1805.                1803 – 04, but implementation was incomplete in 1805 when Karl Mack, Lieu-
= = Prelude = =                                                                       tenant Field Marshal and Quartermaster-General of the Army, implemented his
In July 1805, the French 1st Army entered southern Italy. The army, under the         own restructuring. Mack bypassed Charles ’ methodical approach. Occurring
command of Marshal Marmont, were reinforced by a few battalions of infantry           in the field, Mack’s plan also undermined the overall command and organiza-
under Claude General Auguste de Marmont at the town of Philippsburg and               tional structure. Regardless, Mack sent an enthusiastic report to Vienna on the
another battalion at Belluno. On 17 September 1805, the army marched from             military’s readiness. Furthermore, after misreading Napoleon’s maneuvers in
Belluno towards Krems. By 29 September, they had reached Belluno and con-             Württemberg, Mack also reported to Vienna on the weakness of French dispo-
ducted its advance against a small Austrian force. By 31 September, the whole         sitions. His reports convinced the war party advising the emperor, Francis II,
force had been reinforced by a brigade from the Army of Tyrol under the com-          to enter the conflict against France, despite Charles ’ own advice to the con-
mand of Pierre Augereau. The Austrians were now under the command of Mar-             trary. Responding to the report and rampant anti-French fever in Vienna, Fran-
shal Jean Victor Marie Moreau, a member of the Directory. Moreau had taken            cis dismissed Charles from his post as generalissimo and appointed his <unk>
command of the Austrian invasion force in the spring of 1805. His command             brother-in-law, Archduke Ferdinand, as commander. The inexperienced Ferdi-
included the VI Corps commanded by Jean Baptiste Drouet de Ney and the VI             nand was a poor choice of replacement for the capable Charles, having neither
Corps commanded by Generals Jean Victor Marie Moreau and Joseph Souham.               maturity nor aptitude for the assignment. Although Ferdinand retained nomi-
Ney’s corps consisted of the III. Corps and VI. Corps, which consisted of the         nal command, day-to-day decisions were placed in the hands of Mack, equally
III Corps and VI. Corps, located in the Austrian Netherlands, was commanded           ill-suited for such an important assignment. When Mack was wounded early
by Friedrich Joseph, Count Baillet de Latour. Moreau’s army consisted of six          in the campaign, he was unable to take full charge of the army. Consequently,
divisions and several associated brigades.                                            command further devolved to Lieutenant Field Marshal Karl Philipp, Prince of
= = Aftermath = =                                                                     Schwarzenberg, an able cavalry officer but inexperienced in the command of
= = = First Coalition forces = = =                                                    such a large army.
On 9 October 1805 the French Army of the Danube was attacked by an Aus-               = = = Road to Ulm = = =
trian army under Archduke Charles at the Battle of Austerlitz. Although Charles       The campaign in the upper Danube valley began in October, with several clashes
and Charles had not had much time to regroup, on 10 October, he launched his          in Swabia. Near the Bavarian town of Wertingen, 40 kilometers (25 mi) north-
attack on the Polish forces under Friedrich Joseph, Count of Lauenburg. Af-           west of Augsburg, on 8 October the 1st Regiment of dragoons, part of Murat’s
ter three days, Charles’ army captured Lauenburg. The French forces pursued           Reserve Cavalry Corps, and grenadiers of Lannes ’ V Corps surprised an Aus-
the Austrians to the Silesian border, where they encountered strong Austrian          trian force half its size. The Austrians were arrayed in a line and unable to form
resistance. These conflicts forced the Austrians to retreat into Tyrol and Aus-       their defensive squares quickly enough to protect themselves from the 4,000
tria agreed to a truce. The Austrian army, commanded by Wenzel Anton Karl,            dragoons and 8,000 grenadiers. Nearly 3,000 Austrians were captured and over
Count of Merveldt, was reduced to around 10,000 men. It was initially planned         400 were killed or wounded. A day later, at another small town, <unk> south
that Archduke Charles would launch a counter-attack against the French army           of the Danube <unk> French 59th Regiment of the Line stormed a bridge over
on the same day, as Napoleon had hoped, but this was not carried out. On 25           the Danube and, humiliatingly, chased two large Austrian columns toward Ulm.
October, Merveldt left Styria for Tyrol. On the same day, Austria launched its        The campaign was not entirely bad news for Vienna. At Haslach, Johann von
new offensive against the French at Ulm. Charles withdrew his army from the           Klenau arranged his 25,000 infantry and cavalry in a prime defensive position
region for a third time at the Battle of Elchingen, under the overall command of      and, on 11 October, the overly confident General of Division Pierre Dupont de
the Austrian generals, Ferdinand and Friedrich Wilhelm of Jülich-Cleves-Berg.         l’Étang attacked Klenau’s force with fewer than 8,000 men. The French lost
To prevent Archduke Charles from escaping from the battlefield, the comman-           1,500 men killed and wounded. Aside from taking the Imperial Eagles and
der of the Habsburg army, Archduke Charles, planned to occupy the fortress            <unk> of the 15th and 17th Dragoons, Klenau’s force also captured 900 men,
Linz; instead, he decided to force Franz von Hipper to surrender the city. How-       11 guns and 18 ammunition wagons. Klenau’s victory was a singular success.
ever, as Charles moved to the south, Moreau arrived on the scene with additional      On 14 October Mack sent two columns out of Ulm in preparation for a breakout
soldiers – including the entire Imperial Guard – and defeated the Austrians at        to the north: one under Johann Sigismund Riesch headed toward Elchingen to
the Battle of Hohenlinden on 28 October. The loss of Linz resulted in Austria’s       secure the bridge there, and the other under Franz von Werneck went north with
complete defeat at Hohenlinden. In the meantime, the French Army of Obser-            most of the heavy artillery. Recognizing the opportunity, Marshal Michel Ney
vation and Preparedness was reorganized into the Army of the Danube under             hurried the rest of his VI Corps forward to re-establish contact with Dupont, who
Feldzeugmeister (Colonel-General) Friedrich Freiherr von Hotze. The army              was still north of the Danube. In a two-pronged attack Ney sent one division to
was composed of the I, IV, VI, VI, VII, VIII and IX Corps. With reinforcements        the south of Elchingen on the right bank of the Danube. This division began the
from Italy and France, it formed new battalions, companies, and squadrons in          assault at Elchingen. At the same time another division crossed the river to the
the Austrian army. On 17 November 1804, at the Battle of Jena-Auerstadt the           east and moved west against Riesch’s position. After clearing Austrian pickets
Army of Silesia and the Army of Silesia joined forces, but by the time that the       from a bridge, the French attacked and captured a strategically located abbey at
French approached Vienna, the Prussians had already surrendered. As the Aus-      the top of the hill at bayonet point. The Austrian cavalry unsuccessfully tried to
trians did not want to allow the war to continue, they decided to abandon their   fend off the French, but the Austrian infantry broke and ran. In this engagement
territories in the north and move their army to the north and west, cutting off   alone, the Austrians lost more than half their reserve artillery park, 6,000 (out
Charles from Vienna. The Battle of Warsaw was fought on 23 November 1805          of 8,000 total participants) dead, wounded or captured and four colors. Reisch’s
between the French Army of the Danube and the Austrian Army of Styria in          column also failed to destroy the bridges across the Danube. Napoleon’s light-
the vicinity of Warsaw and Pressburg (modern Trnava, Slovakia). At that time      ning campaign exposed the Austrian indecisive command structure and poor
Habsburg forces                                                                   supply apparatus. Mack

Table 13: Example 3 – 1,000 tokens generated by XL using a snippet from the Wikitext-103 test set as initial context. The
sample is randomly generated without any cherry picking.
Original Wikipedia page: https://en.wikipedia.org/wiki/Battle_of_D%C3%BCrenstein.
• Although this example is significantly longer, we can see that Transformer-XL is still able to stay on the same topic and
  makes up non-existing stories about the Napoleon wars.
• Notably, from the second section on, the generated text correctly follows a fine-grained chronological order on the level of
  month and day to narrate events in 1805, except a mistake (1804 instead of 1805) near the end of the paragraph. To ease
  reading which we have highlighted all the date related phrases by magenta in the generation.