Source notes and reports

Sources

← Home

                                             Megatron-LM: Training Multi-Billion Parameter Language Models Using
                                                                    Model Parallelism


                                                   Mohammad Shoeybi 1 2 Mostofa Patwary 1 2 Raul Puri 1 2 Patrick LeGresley 2 Jared Casper 2
                                                                                    Bryan Catanzaro 2


                                                                 Abstract                               1. Introduction




arXiv:1909.08053v4 [cs.CL] 13 Mar 2020
                                              Recent work in language modeling demonstrates             Natural Language Processing (NLP) is advancing quickly in
                                              that training large transformer models advances           part due to an increase in available compute and dataset size.
                                              the state of the art in Natural Language Processing       The abundance of compute and data enables training increas-
                                              applications. However, very large models can be           ingly larger language models via unsupervised pretraining
                                              quite difficult to train due to memory constraints.       (Devlin et al., 2018; Radford et al., 2019). Empirical evi-
                                              In this work, we present our techniques for train-        dence indicates that larger language models are dramatically
                                              ing very large transformer models and implement           more useful for NLP tasks such as article completion, ques-
                                              a simple, efficient intra-layer model parallel ap-        tion answering, and natural language inference (Lan et al.,
                                              proach that enables training transformer models           2019; Raffel et al., 2019). By finetuning these pretrained
                                              with billions of parameters. Our approach does            language models on downstream natural language tasks,
                                              not require a new compiler or library changes, is         one can achieve state of the art results as shown in recent
                                              orthogonal and complimentary to pipeline model            work (Devlin et al., 2018; Peters et al., 2018; Howard &
                                              parallelism, and can be fully implemented with            Ruder, 2018; Radford et al., 2018; 2017; Ramachandran
                                              the insertion of a few communication operations           et al., 2016; Liu et al., 2019b; Dai et al., 2019; Yang et al.,
                                              in native PyTorch. We illustrate this approach            2019; Liu et al., 2019a; Lan et al., 2019).
                                              by converging transformer based models up to              As these models become larger, they exceed the memory
                                              8.3 billion parameters using 512 GPUs. We sus-            limit of modern processors, and require additional memory
                                              tain 15.1 PetaFLOPs across the entire applica-            management techniques such as activation checkpointing
                                              tion with 76% scaling efficiency when compared            (Chen et al., 2016). Widely used optimization algorithms
                                              to a strong single GPU baseline that sustains 39          such as ADAM require additional memory per parameter to
                                              TeraFLOPs, which is 30% of peak FLOPs. To                 store momentum and other optimizer state, which reduces
                                              demonstrate that large language models can fur-           the size of models that can be effectively trained. Several
                                              ther advance the state of the art (SOTA), we train        approaches to model parallelism overcome this limit by
                                              an 8.3 billion parameter transformer language             partitioning the model such that the weights and their asso-
                                              model similar to GPT-2 and a 3.9 billion parame-          ciated optimizer state do not need to reside concurrently on
                                              ter model similar to BERT. We show that careful           the processor. For example, GPipe (Huang et al., 2018) and
                                              attention to the placement of layer normalization         Mesh-Tensorflow (Shazeer et al., 2018) provide frameworks
                                              in BERT-like models is critical to achieving in-          for model parallelism of different kinds. However, they
                                              creased performance as the model size grows. Us-          require rewriting the model, and rely on custom compilers
                                              ing the GPT-2 model we achieve SOTA results               and frameworks that are still under development.
                                              on the WikiText103 (10.8 compared to SOTA per-
                                              plexity of 15.8) and LAMBADA (66.5% com-                  In this work, we implement a simple and efficient model
                                              pared to SOTA accuracy of 63.2%) datasets. Our            parallel approach using intra-layer model-parallelism. We
                                              BERT model achieves SOTA results on the RACE              exploit the inherent structure in transformer based language
                                              dataset (90.9% compared to SOTA accuracy of               models to make a simple model-parallel implementation that
                                              89.4%).                                                   trains efficiently in PyTorch, with no custom C++ code or
                                                                                                        compiler required. This approach is orthogonal to pipeline-
                                                                                                        based model parallelism as advocated by approaches such
                                           1
                                             Equal contribution 2 NVIDIA. Correspondence to: Mohammad   as GPipe (Huang et al., 2018).
                                         Shoeybi <mshoeybi@nvidia.com>.
                                                                                                        To demonstrate the scalability of our approach, we establish
                 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

                                                                    • We show that careful attention to the placement of
                                                                      layer normalization in BERT-like models is critical to
                                                                      achieving increased accuracies as the model grows.

                                                                    • We demonstrate that scaling the model size results in
                                                                      improved accuracies for both GPT-2 (studied up to
                                                                      8.3 billion parameters) and BERT (studied up to 3.9B
                                                                      parameters) models.

                                                                    • We showcase that our models achieve state of the art
                                                                      results on test sets: perplexity on WikiText103 (10.8
                                                                      ppl), accuracy on LAMBADA (66.5%), and accuracy
                                                                      on RACE (90.9%).
Figure 1. Model (blue) and model+data (green) parallel FLOPS
as a function of number of GPUs. Model parallel (blue): up to       • We open source our code along with the training
8-way model parallel weak scaling with approximately 1 billion        and evaluation pipelines at https://github.com/
parameters per GPU (e.g. 2 billion for 2 GPUs and 4 billion for       NVIDIA/Megatron-LM
4 GPUs). Model+data parallel (green): similar configuration as
model parallel combined with 64-way data parallel.
                                                                  2. Background and Challenges
                                                                  2.1. Neural Language Model Pretraining
a baseline by training a model of 1.2 billion parameters
on a single NVIDIA V100 32GB GPU, that sustains 39                Pretrained language models have become an indispensable
TeraFLOPs. This is 30% of the theoretical peak FLOPS              part of NLP researchers’ toolkits. Leveraging large corpus
for a single GPU as configured in a DGX-2H server, and            pretraining to learn robust neural representations of lan-
is thus a strong baseline. Scaling the model to 8.3 billion       guage is an active area of research that has spanned the
parameters on 512 GPUs with 8-way model parallelism,              past decade. Early examples of pretraining and transferring
we achieve up to 15.1 PetaFLOPs per second sustained              neural representations of language demonstrated that pre-
over the entire application. This is 76% scaling efficiency       trained word embedding tables improve downstream task
compared to the single GPU case. Figure 1 shows more              results compared to word embedding tables learned from
detailed scaling results.                                         scratch (Mikolov et al., 2013; Pennington et al., 2014; Turian
                                                                  et al., 2010). Later work advanced research in this area by
To analyze the effect of model size scaling on accuracy,          learning and transferring neural models that capture contex-
we train both left-to-right GPT-2 (Radford et al., 2019) lan-     tual representations of words (Melamud et al., 2016; Mc-
guage models as well as BERT (Devlin et al., 2018) bidi-          Cann et al., 2017; Peters et al., 2018; Radford et al., 2017;
rectional transformers and evaluate them on several down-         2019). Recent parallel work (Ramachandran et al., 2016;
stream tasks. We show that the existing BERT architecture         Howard & Ruder, 2018; Radford et al., 2018; Devlin et al.,
results in model degradation as the size increases. We over-      2018; Liu et al., 2019b; Dai et al., 2019; Yang et al., 2019;
come this challenge by rearranging the layer normalization        Liu et al., 2019a; Lan et al., 2019) further builds upon these
and residual connection in the transformer layers and show        ideas by not just transferring the language model to extract
that with this change, results for the downstream tasks on        contextual word representations, but by also finetuning the
development sets improve monotonically as the model size          language model in an end to end fashion on downstream
increases. In addition, we show that our models achieve           tasks. Through these works, the state of the art has advanced
test set state of the art (SOTA) results on WikiText103,          from transferring just word embedding tables to transferring
cloze-style prediction accuracy on LAMBADA, and reading           entire multi-billion parameter language models. This pro-
comprehension RACE datasets.                                      gression of methods has necessitated the need for hardware,
In summary, our contributions are as follows:                     systems techniques, and frameworks that are able to oper-
                                                                  ate efficiently at scale and satisfy increasing computational
                                                                  needs. Our work aims to provide the tools necessary to take
  • We implement a simple and efficient model parallel            another step forward in this trend.
    approach by making only a few targeted modifications
    to an existing PyTorch transformer implementation.
                                                                  2.2. Transformer Language Models and Multi-Head
                                                                       Attention
  • We perform an in-depth empirical analysis of our
    model and data parallel technique and demonstrate             Current work in NLP trends towards using transformer mod-
    up to 76% scaling efficiency using 512 GPUs.                  els (Vaswani et al., 2017) due to their superior accuracy
                  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

                                                                     gate these effects and drive down the training time of large
                                                                     neural networks. To scale out training even further, parallel
                                                                     work (Chen et al., 2016) has combined data parallelism with
                                                                     activation checkpointing: recomputing activations in the
                                                                     backward pass without storing them in the forward pass to
                                                                     reduce memory requirements.
                                                                     However, these techniques have one fundamental limitation
                                                                     in the problem size they can tackle: the model must fit
                                                                     entirely on one worker. With language models of increasing
                                                                     size and complexity like BERT and GPT-2, neural networks
                                                                     have approached the memory capacity of modern hardware
                                                                     accelerators. One solution to this problem is to employ
                                                                     parameter sharing to reduce the memory footprint of the
                                                                     model (Lan et al., 2019), but this limits the overall capacity
Figure 2. Transformer Architecture. Purple blocks correspond to      of the model. Our approach is to utilize model parallelism
fully connected layers. Each blue block represents a single trans-   to split the model across multiple accelerators. This not
former layer that is replicated N times.                             only alleviates the memory pressure, but also increases the
                                                                     amount of parallelism independently of the microbatch size.
and compute efficiency. The original transformer formula-
tion was designed as a machine translation architecture that         Within model parallelism, there are two further paradigms:
transforms an input sequence into another output sequence            layer-wise pipeline parallelism, and more general distributed
using two parts, an Encoder and Decoder. However, recent             tensor computation. In pipeline model parallelism, groups
work leveraging transformers for language modeling such as           of operations are performed on one device before the outputs
BERT (Devlin et al., 2018) and GPT-2 (Radford et al., 2019)          are passed to the next device in the pipeline where a differ-
use only the Encoder or Decoder depending on their needs.            ent group of operations are performed. Some approaches
This work explores both a decoder architecture, GPT-2, and           (Harlap et al., 2018; Chen et al., 2018) use a parameter
an encoder architecture, BERT.                                       server (Li et al., 2014) in conjunction with pipeline par-
                                                                     allelism. However these suffer from inconsistency issues.
Figure 2 shows a schematic diagram of the model we used.             The GPipe framework for TensorFlow (Huang et al., 2018)
We refer the reader to prior work for a detailed descrip-            overcomes this inconsistency issue by using synchronous
tion of the model architecture (Vaswani et al., 2017; Devlin         gradient decent. This approach requires additional logic to
et al., 2018; Radford et al., 2019). It is worthwhile to men-        handle the efficient pipelining of these communication and
tion that both GPT-2 and BERT use GeLU (Hendrycks &                  computation operations, and suffers from pipeline bubbles
Gimpel, 2016) nonlinearities and layer normalization (Ba             that reduce efficiency, or changes to the optimizer itself
et al., 2016) to the input of the multi-head attention and feed      which impact accuracy.
forward layers, whereas the original transformer (Vaswani
et al., 2017) uses ReLU nonlinearities and applies layer             Distributed tensor computation is an orthogonal and more
normalization to outputs.                                            general approach that partitions a tensor operation across
                                                                     multiple devices to accelerate computation or increase
2.3. Data and Model Parallelism in Deep Learning                     model size. FlexFlow (Jia et al., 2018), a deep learning
                                                                     framework orchestrating such parallel computation, pro-
There are two central paradigms for scaling out deep neu-            vides a method to pick the best parallelization strategy. Re-
ral network training to numerous hardware accelerators:              cently, Mesh-TensorFlow (Shazeer et al., 2018) introduced
data parallelism (Valiant, 1990) where a training minibatch          a language for specifying a general class of distributed ten-
is split across multiple workers, and model parallelism in           sor computations in TensorFlow (Abadi et al., 2015). The
which the memory usage and computation of a model is                 parallel dimensions are specified in the language by the
distributed across multiple workers. By increasing the mini-         end user and the resulting graph is compiled with proper
batch size proportionally to the number of available work-           collective primitives. We utilize similar insights to those
ers (i.e. weak scaling), one observes near linear scaling            leveraged in Mesh-TensorFlow and exploit parallelism in
in training data throughput. However, large batch train-             computing the transformer’s attention heads to parallelize
ing introduces complications into the optimization process           our transformer model. However, rather than implementing
that can result in reduced accuracy or longer time to conver-        a framework and compiler for model parallelism, we make
gence, offsetting the benefit of increased training throughput       only a few targeted modifications to existing PyTorch trans-
(Keskar et al., 2017). Further research (Goyal et al., 2017;         former implementations. Our approach is simple, does not
You et al., 2017; 2019) has developed techniques to miti-
                  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

require any new compiler or code re-writing, and can be
fully implemented by inserting a few simple primitives, as
described in the next section.

3. Model Parallel Transformers
We take advantage of the structure of transformer networks
to create a simple model parallel implementation by adding a
few synchronization primitives. A transformer layer consists                                      (a) MLP
of a self attention block followed by a two-layer, multi-layer
perceptron (MLP) as shown in Figure 2. We introduce
model parallelism in both of these blocks separately.
We start by detailing the MLP block. The first part of the
block is a GEMM followed by a GeLU nonlinearity:
                      Y = GeLU(XA)                             (1)
One option to parallelize the GEMM is to split the weight
matrix A along its rows and input X along its columns as:
                                                                                          (b) Self-Attention
                                     A1
               X = [X1 , X2 ], A =       .            (2)            Figure 3. Blocks of Transformer with Model Parallelism. f and g
                                     A2
                                                                     are conjugate. f is an identity operator in the forward pass and all
This partitioning will result in Y = GeLU(X1 A1 +                    reduce in the backward pass while g is an all reduce in the forward
X2 A2 ). Since GeLU is a nonlinear function, GeLU(X1 A1 +            pass and identity in the backward pass.
X2 A2 ) 6= GeLU(X1 A1 )+GeLU(X2 A2 ) and this approach
will require a synchronization point before the GeLU func-
                                                                     As shown in Figure 3b, for the self attention block we exploit
tion.
                                                                     inherent parallelism in the multihead attention operation,
Another option is to split A along its columns A = [A1 , A2 ].       partitioning the GEMMs associated with key (K), query
This partitioning allows the GeLU nonlinearity to be inde-           (Q), and value (V ) in a column parallel fashion such that
pendently applied to the output of each partitioned GEMM:            the matrix multiply corresponding to each attention head is
          [Y1 , Y2 ] = [GeLU(XA1 ), GeLU(XA2 )]                (3)   done locally on one GPU. This allows us to split per atten-
                                                                     tion head parameters and workload across the GPUs, and
This is advantageous as it removes a synchronization point.          doesnt require any immediate communication to complete
Hence, we partition the first GEMM in this column parallel           the self-attention. The subsequent GEMM from the output
fashion and split the second GEMM along its rows so it takes         linear layer (after self attention) is parallelized along its
the output of the GeLU layer directly without requiring any          rows and takes the output of the parallel attention layer di-
communication as shown in Figure 3a. The output of the               rectly, without requiring communication between the GPUs.
second GEMM is then reduced across the GPUs before                   This approach for both the MLP and self attention layer
passing the output to the dropout layer. This approach splits        fuses groups of two GEMMs, eliminates a synchronization
both GEMMs in the MLP block across GPUs and requires                 point in between, and results in better scaling. This enables
only a single all-reduce operation in the forward pass (g            us to perform all GEMMs in a simple transformer layer
operator) and a single all-reduce in the backward pass (f            using only two all-reduces in the forward path and two in
operator). These two operators are conjugates of each other          the backward path (see Figure 4).
and can be implemented in PyTorch with only a few lines of
code. As an example, the implementation of the f operator            The transformer language model has an output embedding
is provided below:                                                   with the dimension of hidden-size (H) times vocabulary-
                                                                     size (v). Since the vocabulary size is on the order of tens
class f(torch.autograd.Function):
    def forward(ctx, x):                                             of thousands of tokens for modern language models (for
        return x                                                     example, GPT-2 used a vocabulary size of 50,257), it is ben-
    def backward(ctx, gradient):                                     eficial to parallelize the output embedding GEMM. How-
        all_reduce(gradient)                                         ever, in transformer language models, the output embed-
        return gradient                                              ding layer shares weights with the input embedding, requir-
Code 1. Implementation of f operator. g is similar to f with         ing modifications to both. We parallelize the input embed-
     identity in the backward and all-reduce in the forward          ding weight matrix EH×v along the vocabulary dimension
     functions.                                                      E = [E1 , E2 ] (column-wise). Since each partition now only
                 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

                                                                   4. Setup
                                                                   Pretrained language understanding models are central tasks
                                                                   in natural language processing and language understanding.
                                                                   There are several formulations of language modeling. In
                                                                   this work we focus on GPT-2 (Radford et al., 2019), a left-
                                                                   to-right generative transformer based language model, and
                                                                   BERT (Devlin et al., 2018), a bi-directional transformer
                                                                   model based on language model masking. We explain our
                                                                   configurations for these models in the following section and
Figure 4. Communication operations in a transformer layer. There   refer to the original papers for more details.
are 4 total communication operations in the forward and backward
pass of a single model parallel transformer layer.                 4.1. Training Dataset
                                                                   To collect a large diverse training set with longterm de-
                                                                   pendencies we aggregate several of the largest language
                                                                   modeling datasets. We create an aggregate dataset consist-
contains a portion of the embedding table, an all-reduce (g        ing of Wikipedia (Devlin et al., 2018), CC-Stories (Trinh &
operator) is required after the input embedding. For the           Le, 2018), RealNews (Zellers et al., 2019), and OpenWeb-
output embedding, one approach is to perform the parallel          text (Radford et al., 2019). To avoid training set leakage
GEMM [Y1 , Y2 ] = [XE1 , XE2 ] to obtain the logits, add an        into our downstream tasks we remove the Wikipedia articles
all-gather Y = all-gather([Y1 , Y2 ]), and send the results to     present in the WikiText103 test set (Merity et al., 2016).
the cross-entropy loss function. However, for this case, the       We also remove unnecessary newlines from the CC-Stories
all-gather will communicate b × s × v elements (b is the           corpus introduced by preprocessing artifacts. For BERT
batch-size and s is the sequence length) which is huge due to      models we include BooksCorpus (Zhu et al., 2015) in the
vocabulary size being large. To reduce the communication           training dataset, however, this dataset is excluded for GPT-2
size, we fuse the output of the parallel GEMM [Y1 , Y2 ] with      trainings as it overlaps with LAMBADA task.
the cross entropy loss which reduces the dimension to b × s.       We combined all the datasets and then filtered out all the
Communicating scalar losses instead of logits is a huge re-        documents with content length less than 128 tokens from
duction in communication that improves the efficiency of           the aggregated dataset. Since similar content might appear
our model parallel approach.                                       multiple times in the aggregated datasets, we used locality-
Much of our model parallel approach can be characterized           sensitive hashing (LSH) to deduplicate content with a jac-
as techniques aimed at reducing communication and keep-            card similarity greater than 0.7. The resulting aggregate
ing the GPUs compute bound. Rather than having one GPU             corpus contains 174 GB of deduplicated text.
compute part of the dropout, layer normalization, or residual
connections and broadcast the results to other GPUs, we            4.2. Training Optimization and Hyperparameters
choose to duplicate the computation across GPUs. Specifi-
                                                                   To train our models efficiently we utilize mixed precision
cally, we maintain duplicate copies of layer normalization
                                                                   training with dynamic loss scaling to take advantage of the
parameters on each GPU, and take the output of the model
                                                                   V100’s Tensor Cores (Micikevicius et al., 2017; NVIDIA,
parallel region and run dropout and residual connection
                                                                   2018). We start by initializing our weights W with a sim-
on these tensors before feeding them as input to the next
                                                                   ple normal distribution W ∼ N (0, 0.02). We then scale
model parallel regions. To optimize the model we allow                                                                1
                                                                   weights immediately before residual layers by √2N      where
each model parallel worker to optimize its own set of pa-
rameters. Since all values are either local to or duplicated       N is the number of transformer layers comprised of self at-
on a GPU, there is no need for communicating updated               tention and MLP blocks. For our optimizer we utilize Adam
parameter values in this formulation.                              (Kingma & Ba, 2014) with weight decay (Loshchilov &
                                                                   Hutter, 2019) λ = 0.01. Additionally, we use global gradi-
We present further details about the hybrid model and data         ent norm clipping of 1.0 to improve the stability of training
parallelism and handling random number generation in Ap-           large models. In all cases, a dropout of 0.1 is used. Lastly,
pendix B for reference. In summary, our approach as de-            to better manage our memory footprint we utilize activation
scribed above is simple to implement, requiring only a few         checkpointing (Chen et al., 2016) after every transformer
extra all-reduce operations added to the forward and back-         layer.
ward pass. It does not require a compiler, and is orthogonal
and complementary to the pipeline model parallelism advo-          For GPT-2 models, all training is performed with sequences
cated by approaches such as (Huang et al., 2018).                  of 1024 subword units at a batch size of 512 for 300k itera-
                 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

tions. Our learning rate of 1.5e-4 utilizes a warmup period      Table 1. Parameters used for scaling studies. Hidden size per atten-
of 3k iterations before following a single cycle cosine decay    tion head is kept constant at 96.
over the remaining 297k iterations. We stop the decay at a
minimum learning rate of 1e-5.                                                                          Number Number Model Model
                                                                        Hidden Attention                   of        of     parallel +data
For BERT models, we largely follow the training process                  Size   heads                    layers parameters GPUs parallel
described in (Lan et al., 2019). We use the original BERT                                                        (billions)          GPUs
                                                                                 1536          16          40       1.2        1      64
dictionary with vocab size of 30,522. In addition, we re-                        1920          20          54       2.5        2      128
place the next sentence prediction head with sentence order                      2304          24          64       4.2        4      256
prediction as suggested by (Lan et al., 2019) and use whole                      3072          32          72       8.3        8      512
word n-gram masking of (Joshi et al., 2019). For all cases,
we set the batch size to 1024 and use a learning rate of 1.0e-                                  Model Parallel         Model + Data Parallel
4 warmed up over 10,000 iterations and decayed linearly                          100%
                                                                                        100%                                     96%
                                                                                                95%
                                                                                  80%



                                                                  Weak Scaling
over 2 million iterations. Other training parameters are kept                                           82%
                                                                                                                 77%
                                                                                                                                         83%
                                                                                                                                               79%
                                                                                                                                                     74%
                                                                                  60%
the same as (Devlin et al., 2018).
                                                                                  40%
                                                                                  20%
5. Experiments                                                                    0%
                                                                                         1          2    4       8        …      64     128    256   512
All of our experiments use up to 32 DGX-2H servers (a total                                                      Number of GPUS

of 512 Tesla V100 SXM3 32GB GPUs). Our infrastruc-               Figure 5. Model and model + data parallel weak scaling efficiency
ture is optimized for multi-node deep learning applications,     as a function of the number of GPUs.
with 300 GB/sec bandwidth between GPUs inside a server
via NVSwitch and 100 GB/sec of interconnect bandwidth
between servers using 8 InfiniBand adapters per server.          done by scaling the batch-size, however, this approach does
                                                                 not address training large models that do not fit on a single
5.1. Scaling Analysis                                            GPU and it leads to training convergence degradation for
                                                                 large batch sizes. In contrast, here we use weak scaling to
To test the scalability of our implementation, we consider       train larger models that were not possible otherwise. The
GPT-2 models with four sets of parameters detailed in Table      baseline for all the scaling numbers is the first configuration
1. To have consistent GEMM sizes in the self attention layer,    (1.2 billion parameters) in Table 1 running on a single GPU.
the hidden size per attention head is kept constant at 96        This is a strong baseline as it achieves 39 TeraFLOPS during
while the number of heads and layers are varied to obtain        the overall training process, which is 30% of the theoretical
configurations ranging from 1 billion to 8 billion parameters.   peak FLOPS for a single GPU in a DGX-2H server.
The configuration with 1.2 billion parameters fits on a single
                                                                 Figure 5 shows scaling values for both model and
GPU whereas the 8 billion parameter model requires 8-way
                                                                 model+data parallelism. We observe excellent scaling num-
model parallelism (8 GPUs). The original vocabulary size
                                                                 bers in both settings. For example, the 8.3 billion parame-
was 50,257, however, to have efficient GEMMs for the logit
                                                                 ters case with 8-way (8 GPU) model parallelism achieves
layer, it is beneficial for the per-GPU vocabulary size to
                                                                 77% of linear scaling. Model+data parallelism requires fur-
be a multiple of 128. Since we study up to 8-way model
                                                                 ther communication of gradients and as a result the scaling
parallelism, we pad the vocabulary such that it is divisible
                                                                 numbers drop slightly. However, even for the largest config-
by 128 × 8 = 1024, resulting in a padded vocabulary size
                                                                 uration (8.3 billion parameters) running on 512 GPUs, we
of 51,200. We study both model and model+data parallel
                                                                 achieve 74% scaling relative to linear scaling of the strong
scaling. For the model parallel scaling, a fixed batch size of
                                                                 single GPU baseline configuration (1.2 billion parameters).
8 is used across all configurations. Data parallel scaling is
                                                                 Further scaling analysis is provided in Appendix D
necessary for training many state of the art models which
typically use a much larger global batch size. To this end,
for the model+data parallel cases we fix the global batch        5.2. Language Modeling Results Using GPT-2
size to 512 for all experiments which corresponds to 64-way      To demonstrate that large language models can further ad-
data parallelism.                                                vance the state of the art, we consider training GPT-2 models
                                                                 of the sizes and configurations listed in Table 2. The 355M
5.1.1. M ODEL AND DATA PARALLELISM                               model is equivalent in size and configuration of BERT-Large
Throughout this section, we will showcase weak scaling           model (Devlin et al., 2018). The 2.5B model is bigger than
with respect to the model parameters for both model parallel     the previous largest GPT-2 model, and the 8.3B model is
and model+data parallel cases. Weak scaling is typically         larger than any left-to-right transformer language model
                                                                 ever trained, to the best of our knowledge. To train and eval-
                 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism


        Table 2. Model configurations used for GPT-2.
                                     Hidden            Time
  Parameter Layers Hidden Attn         Size Total       per
    Count            Size Heads         per    GPUs Epoch
                                       Head           (days)
    355M       24    1024      16       64      64     0.86
    2.5B       54    1920      20       96      128    2.27
    8.3B       72    3072      24      128      512    2.10



Table 3. Zero-shot results. SOTA are from (Khandelwal et al.,
2019) for Wikitext103 and (Radford et al., 2019) for LAMBADA.
           Model          Wikitext103     LAMBADA                Figure 6. Validation set perplexity. All language models are trained
                                                                 for 300k iterations. Larger language models converge notice-
                          Perplexity ↓    Accuracy ↑
                                                                 ably faster and converge to lower validation perplexities than their
          355M               19.31         45.18%                smaller counterparts.
           2.5B              12.76         61.73%
           8.3B              10.81         66.51%
      Previous SOTA          15.79         63.24%
                                                                          Table 4. Model configurations used for BERT.
                                                                       Parameter Layers Hidden Attention            Total
                                                                         Count                 Size      Heads      GPUs
uate our language models we use the procedure described in               336M        24       1024         16        128
section 4. Table 2 also lists the time it takes to advance one           1.3B        24       2048         32        256
epoch which is equivalent to 68,507 iterations. For example,             3.9B        48       2560         40        512
for the 8.3B model on 512 GPUs, each epoch takes around
two days. Compared to the configurations used for our scal-
ing studies in Table 1, the 2.5B model is the same, the 8.3B
model has 24 attention heads instead of 32, and the 355M is      10.8% overlap and the LAMBADA test set (Paperno et al.,
much smaller than any seen previously while still using 64       2016) has at most 1.4% overlap. We should note that the
GPUs to train, leading to the much lower time per epoch.         WikiText103 test set has already 9.09% overlap with the
                                                                 WikiText103 training set (Radford et al., 2019). As these
Figure 6 shows validation perpelixity as a function of num-      are consistent with previous work, we are confident that no
ber of iterations. As the model size increases, the validation   documents from our test data are inadvertently included in
perpelixity decreases and reaches a validation perplexity of     our training data.
9.27 for the 8.3B model. We report the zero-shot evaluation
of the trained models on the LAMBADA and WikiText103
                                                                 5.3. Bi-directional Transformer Results Using BERT
datasets in Table 3. For more details on evaluation method-
ology, see Appendix E. We observe the trend that increasing      In this section, we apply our methodology to BERT-style
model size also leads to lower perplexity on WikiText103         transformer models and study the effect of model scaling
and higher cloze accuracy on LAMBADA. Our 8.3B model             on several downstream tasks. Prior work (Lan et al., 2019)
achieves state of the art perplexity on the WikiText103 test     found that increasing model size beyond BERT-large with
set at a properly adjusted perplexity of 10.81. At 66.51%        336M parameters results in unexpected model degradation.
accuracy, the 8.3B model similarly surpasses prior cloze         To address this degradation, the authors of that work (Lan
accuracy results on the LAMBADA task. We have included           et al., 2019) introduced parameter sharing and showed that
samples generated from the 8.3 billion parameters model          that their models scale much better compared to the original
in the Appendix C. Recently researchers from Microsoft in        BERT model.
collaboration with NVIDIA trained a 17 billion parameter
                                                                 We further investigated this behaviour and empirically
GPT-2 model called Turing-NLG (Microsoft, 2020) using
                                                                 demonstrated that rearranging the order of the layer nor-
Megatron and showed that the accuracies further improve
                                                                 malization and the residual connections as shown in Figure
as they scale the model, highlighting the value of larger
                                                                 7 is critical to enable the scaling of the BERT-style mod-
models.
                                                                 els beyond BERT-Large. The architecture (b) in Figure 7
To ensure we do not train on any data found in our test sets,    eliminates instabilities observed using the original BERT
we calculate the percentage of test set 8-grams that also        architecture in (a) and also has a lower training loss. To
appear in our training set as done in previous work (Rad-        the best of our knowledge, we are the first to report such a
ford et al., 2019). The WikiText103 test set has at most         change enables training larger BERT models.
                  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism


Table 5. Development set results for MNLI, QQP, SQuAD 1.1 and SQuAD 2.0 and test set results for RACE. The trained tokens represents
consumed tokens during model pretraining (proportional to batch size times number of iterations) normalized by consumed tokens during
model pretraining for our 336M model.
                                     trained tokens MNLI m/mm            QQP     SQuAD 1.1 SQuAD 2.0                RACE m/h
               Model                      ratio        accuracy        accuracy    F1 / EM          F1 / EM          accuracy
                                                       (dev set)       (dev set)   (dev set)        (dev set)        (test set)
    RoBERTa (Liu et al., 2019b)             2        90.2 / 90.2         92.2     94.6 / 88.9      89.4 / 86.5   83.2 (86.5 / 81.8)
     ALBERT (Lan et al., 2019)              3            90.8            92.2     94.8 / 89.3      90.2 / 87.4   86.5 (89.0 / 85.5)
      XLNet (Yang et al., 2019)             2        90.8 / 90.8         92.3     95.1 / 89.7      90.6 / 87.9   85.4 (88.6 / 84.0)
          Megatron-336M                     1        89.7 / 90.0         92.3     94.2 / 88.0      88.1 / 84.8   83.0 (86.9 / 81.5)
           Megatron-1.3B                    1        90.9 / 91.0         92.6     94.9 / 89.1      90.2 / 87.1   87.3 (90.4 / 86.1)
           Megatron-3.9B                    1        91.4 / 91.4         92.7     95.5 / 90.0      91.2 / 88.5   89.5 (91.8 / 88.6)
    ALBERT ensemble (Lan et al., 2019)                                            95.5 / 90.1      91.4 / 88.9   89.4 (91.2 / 88.6)
    Megatron-3.9B ensemble                                                        95.8 / 90.5      91.7 / 89.0   90.9 (93.1 / 90.0)



                                                                      size and learning rate. Once we obtain the best values, we
                                                                      report the median development set results over 5 different
                                                                      random seeds for initialization. The hyperparameters used
                                                                      for each model and task are provided in the Appendix A.
                                                                      Table 5 shows the development set results for MNLI, QQP,
                                                                      SQuAD 1.1, and SQuAD 2.0 and test set results for RACE.
                                                                      For the test set results of RACE, we first use the develop-
                                                                      ment set to find the checkpoint that gives us the median
                                                                      score on the 5 random seeds and we report the results from
                                                                      that checkpoint on the test set. We also report 5-way ensem-
                                                                      ble results for the development set of SQuAD and test set
                                                                      of RACE. From Table 5 we observe that (a) as the model
Figure 7. Training loss for BERT model using the original architec-   size increases, the downstream task performance improves
ture (a) and the rearranged architecture (b). Left figure shows the   in all cases, (b) our 3.9B model establishes state of the art
training loss for 336M and 752M BERT model. While the original        results on the development set compared to other BERT
architecture performs well on the 336M model, the modifications       based models, and (c) our 3.9B model achieves both single
in (b) enable stable training with lower training loss.
                                                                      model as well as ensembled SOTA results on RACE test set.

Using the architecture change in Figure 7(b), we consider
                                                                      6. Conclusion and Future Work
three different cases as detailed in Table 4. The 336M model
has the same size as BERT-large. The 1.3B is the same as              In this work, we successfully surpassed the limitations posed
the BERT-xlarge configuration that was previously shown               by traditional single-GPU-per-model training by implement-
to get worse results than the 336M BERT-large model (Lan              ing model parallelism with only a few modifications to
et al., 2019). We further scale the BERT model using both             the existing PyTorch transformer implementations. We ef-
larger hidden size as well as more layers to arrive at the 3.9B       ficiently trained transformer based models up to 8.3 bil-
parameter case. In all cases, the hidden size per attention           lion parameter on 512 NVIDIA V100 GPUs with 8-way
head is kept constant at 64. 336M and 1.3B models are                 model parallelism and achieved up to 15.1 PetaFLOPs sus-
trained for 2 million iterations while the 3.9B model is              tained over the entire application. We also showed that for
trained for 1.5 million iterations and is still training.             BERT models, careful attention to the placement of layer
On a 3% held-out set, 336M, 1.3B, and 3.9B models achieve             normalization in BERT-like models is critical to achieving
validation set perplexity of 1.58, 1.30, and 1.16, respectively,      increased accuracies as the model size increases. We study
a monotonic decrease with the model size. We finetune                 the effect of model size on down-stream task accuracy and
the trained models on several downstream tasks including              achieve far superior results on downstream tasks and estab-
MNLI and QQP from the GLUE benchmark (Wang et al.,                    lish new SOTA for WikiText103, LAMBADA, and RACE
2019), SQuAD 1.1 and SQuAD 2.0 from the Stanford Ques-                datasets. Finally, we open sourced our code to enable future
tion answering dataset (Rajpurkar et al., 2016; 2018), and            work leveraging model parallel transformers.
the reading comprehension RACE dataset (Lai et al., 2017).            There are several directions for future work. Continuing
For finetuning, we follow the same procedure as (Liu et al.,          to increase the scale of pretraining is a promising line of
2019b). We first perform hyperparameter tuning on batch
                Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

investigation that will further test existing deep learning       He, K. Accurate, large minibatch SGD: training imagenet
hardware and software. To realize this, improvements in           in 1 hour. CoRR, abs/1706.02677, 2017.
the efficiency and memory footprint of optimizers will be
                                                                Harlap, A., Narayanan, D., Phanishayee, A., Se-
needed. In addition, training a model with more than 16
                                                                  shadri, V., Devanur, N., Ganger, G., and Gibbons, P.
billion parameters will demand more memory than is avail-
                                                                  Pipedream: Fast and efficient pipeline parallel dnn train-
able within 16 GPUs of a DGX-2H box. For such models, a
                                                                  ing. arXiv:1806.03377, 2018.
hybrid intra-layer and inter-layer model parallelism along
with inter-node model parallelism would be more suitable.       Hendrycks, D. and Gimpel, K. Bridging nonlinearities
Three other directions of investigation include (a) pretrain-     and stochastic regularizers with gaussian error linear
ing different model families (XLNet, T5), (b) evaluating per-     units. CoRR, abs/1606.08415, 2016. URL http:
formance of large models across more difficult and diverse       //arxiv.org/abs/1606.08415.
downstream tasks (e.g. Generative Question Answering,
Summarization, and Conversation), and (c) using knowl-          Howard, J. and Ruder, S. Fine-tuned language models for
edge distillation to train small student models from these        text classification. CoRR, abs/1801.06146, 2018.
large pretrained teacher models.                                Huang, Y., Cheng, Y., Chen, D., Lee, H., Ngiam, J., Le,
                                                                  Q. V., and Chen, Z. Gpipe: Efficient training of gi-
References                                                        ant neural networks using pipeline parallelism. CoRR,
                                                                  abs/1811.06965, 2018. URL http://arxiv.org/
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,         abs/1811.06965.
  Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, M.,
  Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Is-       Jia, Z., Zaharia, M., and Aiken, A. Beyond data and model
  ard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M.,        parallelism for deep neural networks. arXiv:1807.05358,
  Levenberg, J., Mané, D., Monga, R., Moore, S., Mur-             2018.
  ray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B.,
                                                                Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer,
  Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Va-
                                                                  L., and Levy, O. Spanbert: Improving pre-training by
  sudevan, V., Viégas, F., Vinyals, O., Warden, P., Watten-
                                                                  representing and predicting spans. arXiv:1907.10529,
  berg, M., Wicke, M., Yu, Y., and Zheng, X. TensorFlow:
                                                                  2019.
  Large-scale machine learning on heterogeneous systems,
  2015. URL http://tensorflow.org/. Software                    Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy,
  available from tensorflow.org.                                  M., and Tang, P. T. P. On large- batch training for deep
                                                                  learning: Generalization gap and sharp minima. ICLR,
Ba, J. L., Kiros, J. R., and Hinton, G. E. Layernorm. CoRR,       2017.
  abs/1607.06450, 2016. URL http://arxiv.org/
  abs/1607.06450.                                               Khandelwal, U., Levy, O., Jurafsky, D., Zettlemoyer, L., and
                                                                  Lewis, M. Generalization through memorization: Nearest
Chen, C.-C., Yang, C.-L., and Cheng, H.-Y. Efficient and          neighbor language models. arXiv:1911.00172, 2019.
  robust parallel dnn training through model parallelism on
  multi-gpu platform. arXiv:1809.02839, 2018.                   Kingma, D. P. and Ba, J. Adam: A method for stochastic
                                                                  optimization. arXiv preprint arXiv:1412.6980, 2014.
Chen, T., Xu, B., Zhang, C., and Guestrin, C. Train-
                                                                Lai, G., Xie, Q., Liu, H., Yang, Y., and Hovy, E. Race:
  ing deep nets with sublinear memory cost. CoRR,
                                                                  Large-scale reading comprehension dataset from exami-
  abs/1604.06174, 2016. URL http://arxiv.org/
                                                                  nations. arXiv:1704.04683, 2017.
  abs/1604.06174.
                                                                Lan, Z., Chen, M., Goodman, S., Gimpel, K., and Soricut, P.
Dai, Z., Yang, Z., Yang, Y., Carbonell, J. G., Le, Q. V.,         S. R. Albert: A lite bert for self-supervised learning of
  and Salakhutdinov, R. Transformer-xl: Attentive lan-            language representations. arXiv:1909.11942, 2019.
  guage models beyond a fixed-length context. CoRR,
  abs/1901.02860, 2019. URL http://arxiv.org/                   Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed,
  abs/1901.02860.                                                 A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y.
                                                                  Scaling distributed machine learning with the parameter
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert:        server, 2014.
  Pre-training of deep bidirectional transformers for lan-
  guage understanding, 2018.                                    Liu, X., He, P., Chen, W., and Gao, J. Multi-task deep neu-
                                                                  ral networks for natural language understanding. CoRR,
Goyal, P., Dollár, P., Girshick, R. B., Noordhuis, P.,           abs/1901.11504, 2019a. URL http://arxiv.org/
 Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and            abs/1901.11504.
                 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy,   Pennington, J., Socher, R., and Manning, C. D. Glove:
  O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. Roberta:         Global vectors for word representation, 2014. URL
  A robustly optimized BERT pretraining approach. CoRR,             https : / / www.aclweb.org / anthology / D14 -
  abs/1907.11692, 2019b. URL http://arxiv.org/                      1162.
  abs/1907.11692.
                                                                  Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark,
Loshchilov, I. and Hutter, F. Decoupled weight de-                  C., Lee, K., and Zettlemoyer, L. Deep contextualized
  cay regularization. In International Conference on                word representations. CoRR, abs/1802.05365, 2018. URL
  Learning Representations, 2019. URL https://                      http://arxiv.org/abs/1802.05365.
  openreview.net/forum?id=Bkg6RiCqY7.
                                                                  Radford, A., Józefowicz, R., and Sutskever, I. Learning
McCann, B., Bradbury, J., Xiong, C., and Socher, R.                 to generate reviews and discovering sentiment. CoRR,
 Learned in translation: Contextualized word vectors.               abs/1704.01444, 2017.
 CoRR, abs/1708.00107, 2017.
                                                                  Radford, A., Narasimhan, K., Salimans, T., and Sutskever,
Melamud, O., Goldberger, J., and Dagan, I. context2vec:             I. Improving language understanding by generative pre-
 Learning generic context embedding with bidirectional              training, 2018. URL https://blog.openai.com/
 lstm. In Proceedings of The 20th SIGNLL Conference on              language-unsupervised/.
 Computational Natural Language Learning, pp. 51–61,
 01 2016.                                                         Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and
                                                                    Sutskever, I. Better language models and their impli-
Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer         cations, 2019. URL https://openai.com/blog/
 sentinel mixture models. CoRR, abs/1609.07843, 2016.               better-language-models/.
 URL http://arxiv.org/abs/1609.07843.
                                                                  Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
Micikevicius, P., Narang, S., Alben, J., Diamos, G. F., Elsen,      Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring
 E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O.,           the limits of transfer learning with a unified text-to-text
 Venkatesh, G., and Wu, H. Mixed precision training.                transformer. arXiv:1910.10683, 2019.
 CoRR, abs/1710.03740, 2017.
                                                                  Rajpurkar, P., Zhang, J., Lopyrev, K., and Liang, P. Squad:
Microsoft. Turing-nlg: A 17-billion-parameter lan-                  100,000+ questions for machine comprehension of text.
 guage model by microsoft, 2020. URL https://                       EMNLP, 2016.
 www.microsoft.com/en-us/research/blog/
 turing - nlg - a - 17 - billion - parameter -                    Rajpurkar, P., Jia, R., and Liang, P. Know what you dont
 language-model-by-microsoft/.                                      know: Unanswerable questions for squad. ACL, 2018.

Mikolov, T., Deoras, A., Kombrink, S., Burget, L., and            Ramachandran, P., Liu, P. J., and Le, Q. V. Unsupervised
 Černockỳ, J. Empirical evaluation and combination of ad-         pretraining for sequence to sequence learning. CoRR,
 vanced language modeling techniques. In Twelfth Annual             abs/1611.02683, 2016. URL http://arxiv.org/
 Conference of the International Speech Communication               abs/1611.02683.
 Association, 2011.
                                                                  Shazeer, N., Cheng, Y., Parmar, N., Tran, D., Vaswani, A.,
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean,        Koanantakool, P., Hawkins, P., Lee, H., Hong, M., Young,
 J. Distributed representations of words and phrases and            C., Sepassi, R., and Hechtman, B. Mesh-TensorFlow:
 their compositionality. CoRR, abs/1310.4546, 2013.                 Deep learning for supercomputers. In Neural Information
                                                                    Processing Systems, 2018.
NVIDIA. Mixed precision training: Choosing a scaling
 factor, 2018. URL https://docs.nvidia.com/                       Trinh, T. H. and Le, Q. V. A simple method for common-
 deeplearning / sdk / mixed - precision -                           sense reasoning. CoRR, abs/1806.02847, 2018. URL
 training/index.html#scalefactor.                                   http://arxiv.org/abs/1806.02847.

Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q. N.,          Turian, J., Ratinov, L., and Bengio, Y. Word representations:
  Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and           A simple and general method for semi-supervised learn-
  Fernández, R. The LAMBADA dataset: Word pre-                     ing. In Proceedings of the 48th Annual Meeting of the
  diction requiring a broad discourse context. CoRR,                Association for Computational Linguistics, ACL ’10, pp.
  abs/1606.06031, 2016. URL http://arxiv.org/                       384–394, Stroudsburg, PA, USA, 2010. Association for
  abs/1606.06031.                                                   Computational Linguistics.
                Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Valiant, L. G. A bridging model for parallel computation.
                                                               Table 6. Hyperparameters for finetuning BERT model on down-
  Communications of the ACM, 33(8):103-111, 1990.
                                                               stream tasks.
                                                                        Task      Model Batch Learning Training
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
                                                                                             size      rate   epochs
  L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention                      336M
  is all you need. CoRR, abs/1706.03762, 2017.                         MNLI       1.3B       128       1e-5      10
                                                                                  3.8B
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and                         336M       128       5e-5
 Bowman, S. R. Glue: A multi-task benchmark and analy-                  QQP       1.3B       128       3e-5      12
 sis platform for natural language understanding. ICLR,                           3.8B       256       4e-5
 2019.                                                                            336M        64       3e-5
                                                                    SQUAD 1.1     1.3B        48       3e-5       2
Yang, Z., Dai, Z., Yang, Y., Carbonell, J. G., Salakhut-                          3.8B        48       1e-5
                                                                                  336M        48       3e-5
  dinov, R., and Le, Q. V. Xlnet: Generalized autore-               SQUAD 2.0     1.3B        64       3e-5       2
  gressive pretraining for language understanding. CoRR,                          3.8B        48       1e-5
  abs/1906.08237, 2019. URL http://arxiv.org/                                     336M        32       2e-5
  abs/1906.08237.                                                      RACE       1.3B        16       1e-5       3
                                                                                  3.8B        32       2e-5
You, Y., Gitman, I., and Ginsburg, B. Large batch training
  of convolutional networks. arXiv:1708.03888, 2017.

You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojana-      instance of the model distributed across these GPUs. The
  palli, S., Song, X., Demmel, J., and Hsieh, C.-J. Large      remaining GPUs, which could be within the same server but
  batch optimization for deep learning: Training bert in 76    more typically are located in other servers, run additional
  minutes. arXiv:1904.00962, 2019.                             model parallel groups. GPUs with the same position in each
                                                               of the model parallel groups (for example GPUs 1, 9, ...,
Zellers, R., Holtzman, A., Rashkin, H., Bisk, Y., Farhadi,
                                                               505 in Figure 8) form data parallel groups so that all GPUs
  A., Roesner, F., and Choi, Y. Defending against neural
                                                               within a data parallel group hold the same model param-
  fake news. CoRR, abs/1905.12616, 2019. URL http:
                                                               eters. During back propagation we run multiple gradient
  //arxiv.org/abs/1905.12616.
                                                               all-reduce operations in parallel to reduce weight gradients
Zhu, Y., Kiros, R., Zemel, R. S., Salakhutdinov, R., Urta-     within each distinct data parallel group. The total number
  sun, R., Torralba, A., and Fidler, S. Aligning books and     of required GPUs is the product of the number of model
  movies: Towards story-like visual explanations by watch-     and data parallel groups. For example, for the 8.3 billion
  ing movies and reading books. CoRR, abs/1506.06724,          parameter model we use 8 GPUs per model parallel group
  2015.                                                        and 64-way data parallelism, for a total of 512 GPUs. All
                                                               communication is implemented in PyTorch by Python calls
                                                               to NCCL. GPUs within each model parallel group perform
A. BERT Finetuning Hyperparameters                             all-reduces amongst all GPUs within the group. For data
                                                               parallelism, each of the all-reduce operations takes place
Table 6 presents the hyperparameters used for each model       with one of the GPUs from each model parallel group.
and task during finetuning.
                                                               B.2. Model Parallel Random Number Generation
B. Model Parallel Supplementary Material                       Techniques that utilize random number generation, such
In this section, we present further details about the hybrid   as dropout, are a staple of modern deep learning training.
model and data parallelism and handling random number          Transformers have dropout layers outside the model parallel
generation.                                                    regions before residual connections and within model paral-
                                                               lel regions in the self attention block. Because some dropout
                                                               layers are in a model parallel region, while others are not,
B.1. Hybrid Model and Data Parallelism
                                                               we need to treat random number generation carefully to
Model parallelism is orthogonal to data parallelism, and so    ensure dropout works correctly. To synchronize residual
we can use both simultaneously to train large models in a      connection dropout across model parallel workers we seed
reasonable amount of time. Figure 8 shows a grouping of        the random number generators at the beginning of training
GPUs for hybrid model and data parallelism. Two or more        with the same seed. This results in identical dropout patterns
GPUs within the same server form model parallel groups         across all model parallel workers. However, dropout within
(for example GPUs 1 to 8 in Figure 8), and contain one         a model parallel region should result in different random
                 Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

patterns for each worker to achieve randomness across the
                                                                   Context: There is a major force uniting Americas fiercely
entire operation. To achieve this we maintain a separate
                                                                   partisan politicians: big technology companies. Democrats
random number generator for dropout within model parallel
                                                                   and Republicans at the federal and state levels are coming
regions. This random number generator is uniquely seeded
                                                                   together to scrutinize the power of the Silicon Valley giants
for each model parallel worker.
                                                                   and, potentially, to rein them in. Letitia James, the Demo-
                                                                   cratic attorney general of New York, announced on Friday
                                                                   that attorneys general in eight states four Democrats and
                                                                   four Republicans and the District of Columbia had begun
                                                                   an antitrust investigation of Facebook.

                                                                   Megatron-LM: The probe, which Ms. James said was
                                                                   likely to result in a lawsuit, is the latest in a string of high-
                                                                   profile regulatory actions, in state and federal courts, that
                                                                   challenge the market dominance of tech companies. The
                                                                   Justice Department is also investigating Facebook’s data
                                                                   practices and whether the company violated a 2011 consent
                                                                   decree with the Federal Trade Commission that required
                                                                   the company to obtain explicit consent from users before
                                                                   sharing their personal information with third parties.

                                                                   What might prompt a bipartisan coalition to sue Facebook?
                                                                   It isn’t clear that there’s any one thing. But lawmakers and
                                                                   regulators have been increasingly alarmed by the company’s
                                                                   recent controversies, which include a data breach affecting
                                                                   more than 50 million users, the spread of fake news on
                                                                   the platform, and the spreading of propaganda by Russian-
                                                                   backed accounts during the 2016 election.

                                                                   It’s also not clear how serious the Democratic attorneys gen-
                                                                   eral are about filing a lawsuit. While the attorneys general
                                                                   announced their investigation in a press release, they said
                                                                   they had not yet reached a decision on whether to sue. In
                                                                   the past, some of the attorneys general have focused their
                                                                   efforts on trying to extract settlements from the tech compa-
Figure 8. Grouping of GPUs for hybrid model and data parallelism
                                                                   nies without filing lawsuits.
with 8-way model parallel and 64-way data parallel.

                                                                   ”As attorneys general who have investigated and prosecuted
                                                                   health-care fraud, price-fixing, and other violations of our
                                                                   states’ antitrust laws, we are committed to holding compa-
                                                                   nies accountable when they fail to protect consumers and
                                                                   competition in the market,” Ms. James said in a statement.

                                                                   The investigation, which was first reported by The New
                                                                   York Times, was announced as the companies that dom-
                                                                   inate the tech industry are facing scrutiny from antitrust
                                                                   regulators around the world. Google, Twitter, Facebook
                                                                   and Amazon have faced a series of antitrust probes and
                                                                   class-action lawsuits in the United States and Europe ...


C. Text Samples
Below are some text samples generated by Megatron-LM
using a context prompt. Some of the texts are cut short.
                Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism


Context: A flower, sometimes known as a bloom or                  Context: OpenAI did not release the full GPT-2 model
blossom, is the reproductive structure found in flowering         due to concerns of malicious use, but they did release a
plants (plants of the division Magnoliophyta, also called         smaller version equivalent in size to the original GPT (117
angiosperms). The biological function of a flower is to ef-       M parameters), trained on the new, larger dataset. Although
fect reproduction, usually by providing a mechanism for the       not as powerful as the large model, the smaller version still
union of sperm with eggs.                                         has some language generation chops.

Megatron-LM: The structural components of the flower              Megatron-LM: In December 2016, DeepMind released an
are the petals and sepals, which are the reproductive organs,     even larger model, GPT-3, with more than 1.5 billion pa-
the hypanthium, which forms the base of the flower, and           rameters. DeepMind has no plans to release the model, so
the calyx, which is the stigma. A flower is a composite           the question remains whether other AI researchers will be
structure, made up of sepals and petals, the sepals being         able to copy and build on it.
composed of three fused tepals and the petals of two fused
petals, though there are a few exceptions to this rule. The       Other language models have been built using other tech-
three sepals are fused to form a tube called the calyx, which     niques. In 2016, a team of researchers at Columbia Uni-
is the structure that attracts the pollinating insects. The       versity, the University of Toronto, and Google DeepMind
petals are composed of two fused sepals (the outer ones)          showed that neural network-based models, including deep
which open and fold inward (clasp) at the tips, forming           neural networks, can learn the grammatical rules of a lan-
the petals. The four sepals and petals are united to form         guage. A group at Google DeepMind demonstrated that
a corolla, the outermost structure of a flower that carries       these rules could be represented in a network as a kind of
pollen.                                                           decision tree, where the nodes in the tree are the words in
                                                                  the sentence, and the rules for the words are represented
The structure of the flower has important functional signif-      by the decision nodes at the end of the tree. The network
icance, including protection of the ovary from insects and        would go through the network in order to choose the next
the transfer of pollen between the male and female parts          word in the sentence based on whether it follows the rules
of a flower. The structure of the flower is also used by the      or not. In 2017, the team at DeepMind released a paper
plants to attract pollinators, for example by the shape of the    detailing the training and performance of their model.
petals and the colours of the stamens.
                                                                  The problems with the model are many. One of the biggest
The structure of a flower varies widely between plants. The       issues is that grammatical rules are ambiguous. For exam-
simplest flowers are unbranched and made up of only two           ple, the phrase ”I came down” could mean I came down
sepals and a single petal. Other examples are composed            from the stairs, or I came down from the sky. In English,
of several petals joined together, with the innermost sepals      there are numerous other examples of sentences that are
completely fused to form a tube called the calyx. The petals      ambiguous. The model does not have any way to tell the dif-
are usually coloured to attract insects, which can only see       ference between the ambiguous sentences, and so it simply
certain wavelengths of light. The sepals are often coloured       makes a prediction about the entire sentence based on what
too, but usually have a more limited range of colours.            rules it does know. This makes the model vulnerable to the
                                                                  phenomenon of overfitting. In order to avoid this, models
Some flowers have additional structures, such as stamens,         must be trained using a large set of data, so that the model
anthers and ovaries. The presence of stamens and anthers          can find useful patterns in the data, and use those patterns
in some flowers is used for cross-pollination. The ovary is       to help make predictions. This approach is also prone to
the organ where the female gametes are produced, and is           errors, as it relies on the training data to help make correct
the first structure that is developed in an embryo of a flower.   predictions, and the data might be incorrectly labeled, re-
The ovary is the female flower’s reproductive system, and         sulting in a model that produces a poor set of predictions.
consists of the ovary wall, the ovule, and the oviduct.           In the case of GPT-2, the model only has about 100 training
                                                                  sentences for each type of phrase (subject-verb-object).
Flowers vary in their size and shape, from tiny to huge, and
are often brightly coloured. Flowers may be seen as part of       Additionally, languages have different rules for different
a larger structure, such as a flower head, an inflorescence, a    types of phrasing. In English, the subject-verb-object con-
bract, a spathe, or a carpels.                                    struction is fairly universal, but other languages have differ-
                                                                  ent rules, which makes training a GPT more complicated.
The term ”flower” is usually applied to the floral structures
that are visible externally, but many ...                         Finally, there is an inherent trade-off between the number
                                                                  of rules and the performance of the model ...
                Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism


Context: The 36th International Conference on Machine            Context: Rescue workers on Tuesday found a flight
Learning (ICML 2019) will be held in Long Beach, CA,             recorder from the Russian plane that crashed into the Black
USA from June 10th to June 15th, 2019. The conference            Sea over the weekend, the defense ministry said. All 92
will consist of one day of tutorials (June 10), followed         people aboard the Russian military’s Tu-154 plane are be-
by three days of main conference sessions (June 11-13),          lieved to have died Sunday morning when it crashed two
followed by two days of workshops (June 14-15).                  minutes after taking off from the southern Russian city of
                                                                 Sochi.
Megatron-LM: With a broad scope, the conference ad-
dresses the challenges and opportunities in machine learning     Megatron-LM: A child holds up a portrait of a dead relative
for practitioners and researchers. The conference program        at a makeshift morgue in a building belonging to a local
will feature sessions that cover different machine learning      government in the town of Petropavlovsk-Kamchatsky. A
topics, from reinforcement learning to computational biol-       Russian Defense Ministry Tu-154 plane with 92 people
ogy and from AI to deep learning. It is the world’s premier      on board crashed into the Black Sea on Sunday morning
forum for exploring and presenting advances in machine           shortly after taking off from the southern Russian city of
learning.                                                        Sochi. (Yuri Kochetkov/EPA)

ICML 2019 will offer an impressive range of sessions,            According to the ministry, the plane was found at a depth
which will be of interest to practitioners, researchers, and     of more than 1,000 feet and the recorder was damaged.
students. Some of the topics we will cover include:              ”The defense ministry has decided to hand over the recorder
                                                                 to the Interstate Aviation Committee, which is the main
Reinforcement learning                                           investigative body in the air crash,” the statement said.

Soft Computing                                                   The search for the plane’s black boxes is being hampered by
                                                                 strong winds and rough seas. Investigators have launched a
Computational biology                                            criminal probe into the crash, which officials said was likely
                                                                 caused by a technical fault.
Convolutional networks
                                                                 Russian President Vladimir Putin declared Wednesday a
Deep learning                                                    day of mourning for the victims.

Statistical learning theory
                                                                 D. Further Scaling Analysis
Statistical learning machine
                                                                 In this section we study the effect of number of attention
                                                                 heads on the scaling results. We also present strong scaling
Natural language processing
                                                                 results for our 1.2 billion parameter model.
Biology
                                                                 D.1. Attention Heads and Scaling
Awards                                                           This section studies the effect of attention heads on model
                                                                 parallel scaling. To this end, we consider the 8.3 billion
Since its first edition, the conference has grown and evolved.   parameter configuration with 8-way model parallelism and
Today, it is the premier forum for showcasing state-of-the-      vary the number of heads from 16 to 32. The results are
art research in machine learning. The conference is widely       presented in Table 7. As the number of attention heads
recognized as the preeminent venue for showcasing the            increases, some of the GEMMS inside the self-attention
latest research in machine learning, and a place where prac-     layer become smaller and also the number of elements in
titioners, researchers, and students can exchange ideas and      the self attention softmax increases. This results in a slight
insights.                                                        decrease in scaling efficiency. Future research should be
                                                                 wary of this hyperparameter to design large transformer
The conference is a peer-reviewed venue that draws its pres-     models that balance model speed and model accuracy.
tige and influence from the excellence of the presentations
and the quality of the invited speakers. With over 50% of all    D.2. Strong Scaling
talks selected based on their content by an expert committee,
it is clear that the conference’s high standards of research     Our model parallelism is primarily designed to enable train-
and quality keep a very high bar.                                ing models larger than what can fit in the memory of a
                   Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

                                                                  works we must normalize by the original number of tokens,
Table 7. Effect of number of attention heads on scaling on 8.3
                                                                  To , rather than the number of tokens, T , actually in the tok-
billion of parameters with 8-way model parallelism.
                                                                  enized data fed as input to our model. This pre-tokenization
 Attention heads     Hidden size per head   Scaling Efficiency    also introduces artifacts in the text that are not present in our
        16                   192                  82%             training data. To alleviate this distributional mismatch, we
        24                   128                  80%             first preprocess the WikiText103 test dataset with invertible
        32                    96                  77%             detokenizers to remove various artifacts related to punctua-
                                                                  tion and whitespace. The value of To is calculated before
                                                                  this preprocessing. For WikiText103’s test set To = 245566
Table 8. Speedup obtained for the 1.2 billion parameters model
using model parallelism while keeping the batch size constant.
                                                                  and T = 270329.
                                                                  We must also make one further transformer-specific mod-
          # of GPUs       1      2       4       8                ification to the perplexity calculation. Unlike RNN-based
           Speedup       1.0    1.64   2.34    2.98               language models, transformers operate on a fixed window in-
                                                                  put size. Therefore they cannot fully calculate P (t|0 : t − 1)
single GPU, but it can also accelerate the training of smaller    and can only calculate P (t|t − w : t − 1) where w is the
models without increasing the batch size. To measure this         size of our context: 1024 tokens. However, calculating this
acceleration we train a model with a fixed 1.2 billion parame-    value for every token in our dataset is prohibitively expen-
ters. We use a fixed batch size of 8 samples per iteration and    sive since we must compute approximately T evaluations
increase the number of GPUs using model parallelism. The          of a w sized context. To evaluate our models efficiently we
results are listed in Table 8. Using two GPUs makes training      take a middle ground approach termed overlapping evalu-
64% faster. Above that we see diminishing returns as the          ation where we advance the sliding window by some over-
per-GPU computation decreases and the memory bandwidth            lap o each time and only compute the cross entropy losses
and communication overheads begin to dominate.                    corresponding to the last o tokens of the window. In our
                                                                  experiments we utilize an overlap o of 32, and compute
                                                                  losses over all sliding windows in such a fashion.
E. Evaluating Language Models Using
   WikiText103 and LAMBADA                                        E.2. LAMBADA Cloze Accuracy
In this section we detail our evaluation methodology for the      The capability to handle long term contexts is crucial for
WikiText103 dataset (Merity et al., 2016) and cloze-style         state of the art language models and is a necessary prerequi-
prediction accuracy on the LAMBADA dataset(Paperno                site for problems like long-form generation and document-
et al., 2016).                                                    based question answering. Cloze-style datasets like LAM-
                                                                  BADA are designed to measure a model’s ability to operate
E.1. Wikitext103 Perplexity                                       in and reason about these types of long term contexts. Cloze-
                                                                  style reading comprehension uses a context of word tokens
WikiText103 perplexity is an evaluation criterion that has
                                                                  x = x1:t with one token xj masked; the models objective
been well studied over the past few years since the creation
                                                                  is to correctly predict the value of the missing j th token. To
of the benchmark dataset. Perplexity is the exponentiation
                                                                  accurately predict the missing token, the model requires an
of the average cross entropy of a corpus (Mikolov et al.,
                                                                  in-depth understanding of the surrounding context and how
2011). This makes it a natural evaluation metric for lan-
                                                                  language should be used in such a context. LAMBADA
guage models which represent a probability distribution
                                                                  uses cloze-style reading comprehension to test generative
over entire sentences or texts.
                                                                  left-to-right language models by constructing examples of 4-
                                                                  5 sentences where the last word in the context xt is masked.
                               T
                           1 X                                    Our models utilize subword units, so for LAMBADA evalu-
         P P L = exp(−          logP (t|0 : t − 1))         (4)
                           To t                                   ation we utilize the raw, unprocessed LAMBADA dataset
                                                                  and require that our model predict the multiple subword
To calculate perplexity in (4) we tokenize the WikiText103        tokens that make up the word token. We use teacher forc-
test corpus according to our subword vocabulary and sum           ing, and consider an answer correct only when all output
the cross entropy loss from each token [0, T ]. We then nor-      predictions are correct. This formulation is equivalent to the
malize the cross entropy loss by the number of tokens in the      original task of word token prediction.
original tokenization scheme To . The WikiText103 test cor-
pus already comes pre-tokenized with word level tokens that
prior works have used to compute perplexity. To evaluate
our models’ perplexities on a level playing field with prior