来源材料

资料来源

← 首页

                                                GLaM: Efficient Scaling of Language Models with Mixture-of-Experts


                                            Nan Du * 1 Yanping Huang * 1 Andrew M. Dai * 1 Simon Tong 1 Dmitry Lepikhin 1 Yuanzhong Xu 1
                                        Maxim Krikun 1 Yanqi Zhou 1 Adams Wei Yu 1 Orhan Firat 1 Barret Zoph 1 Liam Fedus 1 Maarten Bosma 1
                                             Zongwei Zhou 1 Tao Wang 1 Yu Emma Wang 1 Kellie Webster 1 Marie Pellat 1 Kevin Robinson 1
                                             Kathleen Meier-Hellstern 1 Toju Duke 1 Lucas Dixon 1 Kun Zhang 1 Quoc V Le 1 Yonghui Wu 1
                                                                              Zhifeng Chen 1 Claire Cui 1




arXiv:2112.06905v2 [cs.CL] 1 Aug 2022
                                                                 Abstract
                                                                                                        Table 1. Comparison between GPT-3 and GLaM. In a nutshell,
                                              Scaling language models with more data, compute           GLaM outperforms GPT-3 across 21 natural language understand-
                                              and parameters has driven significant progress in         ing (NLU) benchmarks and 8 natural language generative (NLG)
                                              natural language processing. For example, thanks          benchmarks in average while using about half the FLOPs per token
                                              to scaling, GPT-3 was able to achieve strong re-          during inference and consuming about one third the energy for
                                              sults on in-context learning tasks. However, train-       training.
                                              ing these large dense models requires significant
                                                                                                                                            GPT-3 GLaM relative
                                              amounts of computing resources. In this paper,
                                              we propose and develop a family of language mod-                         FLOPs / token (G)      350     180    −48.6%
                                                                                                         cost
                                              els named GLaM (Generalist Language Model),                             Train energy (MWh)     1287     456    −64.6%
                                              which uses a sparsely activated mixture-of-experts                           Zero-shot         56.9     62.7   +10.2%
                                                                                                          accuracy
                                              architecture to scale the model capacity while also                          One-shot          61.6     65.5   +6.3%
                                                                                                         on average
                                                                                                                           Few-shot          65.2     68.1   +4.4%
                                              incurring substantially less training cost compared
                                              to dense variants. The largest GLaM has 1.2 tril-
                                              lion parameters, which is approximately 7x larger
                                                                                                        feasibility of in-context learning for few-shot or even zero-
                                              than GPT-3. It consumes only 1/3 of the energy
                                                                                                        shot generalization, meaning very few labeled examples are
                                              used to train GPT-3 and requires half of the com-
                                                                                                        needed to achieve good performance on NLP applications.
                                              putation flops for inference, while still achieving
                                                                                                        While being effective and performant, scaling further is be-
                                              better overall zero, one and few-shot performance
                                                                                                        coming prohibitively expensive and consumes significant
                                              across 29 NLP tasks.
                                                                                                        amounts of energy (Patterson et al., 2021).
                                                                                                        In this work, we show that a large sparsely activated network
                                        1. Introduction                                                 can achieve competitive results compared to state-of-the-art
                                                                                                        dense models on few-shot tasks while being more compu-
                                        Language models have played an important role in the
                                                                                                        tationally efficient. We present a family of generalist lan-
                                        progress of natural language processing (NLP) in the past
                                                                                                        guage models called GLaM, that strike a balance between
                                        decade. Variants of language models have been used to pro-
                                                                                                        dense and conditional computation. The largest version
                                        duce pretrained word vectors (Mikolov et al., 2013; Penning-
                                                                                                        of GLaM has 1.2T parameters in total with 64 experts per
                                        ton et al., 2014), and contextualized word vectors (Peters
                                                                                                        MoE layer (Shazeer et al., 2017; Lepikhin et al., 2021; Fe-
                                        et al., 2018; Devlin et al., 2019) for many NLP applications.
                                                                                                        dus et al., 2021) where each token in the input batch only
                                        The shift towards scaling with more data and larger mod-
                                                                                                        activates a subnetwork of 96.6B (8% of 1.2T) parameters.
                                        els (Shazeer et al., 2017; Huang et al., 2019; Kaplan et al.,
                                                                                                        On zero, one and few-shot learning, this model compares
                                        2020) has enabled complex natural language tasks to be per-
                                                                                                        favorably to GPT-3 (175B), with significantly improved
                                        formed with less labeled data. For example, GPT-3 (Brown
                                                                                                        learning efficiency across 29 public NLP benchmarks, rang-
                                        et al., 2020) and FLAN (Wei et al., 2021) demonstrated the
                                                                                                        ing from language completion tasks, open-domain QA tasks,
                                          *
                                           Equal contribution 1 Google. Correspondence to: Nan Du,      to natural language inference tasks. Thanks to the sparsely
                                        Yanping Huang, and Andrew M. Dai <dunan@google.com,             activated architecture and the efficient implementation of the
                                        huangyp@google.com, adai@google.com>.                           model parallelism algorithm, the total energy consumption
                                        Proceedings of the 39 th International Conference on Machine    during training is only one third of GPT-3’s. We highlight
                                        Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy-       the comparison between the largest version of GLaM and
                                        right 2022 by the author(s).                                    GPT-3 in Table 1 and Figure 1.
                                                                                                                                                               GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   GLaM
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   GPT-3
                                                                                                                                                                                                20.0%                                                                                                                                                                                                                                                                                                                                                                                                                            350                                     1287
20%                                                                                                                                                                                                                                                                                                                                                                                               15.0%




                                                                                                                            In-context Reading Comprehension   Open-Domain Question Answering
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     300




                                                                                                                                                                                                                                                                                                 Open-Domain Question Answering   In-context Reading Comprehension
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            1000
                                                                                                                                                                                                15.0%




                                   Common Sense Reasoning                          Natural Language Inference
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Train Energy (MWh)

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Open-Domain Question Answering   In-context Reading Comprehension
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    GFLOPS / token
                                                                                                                                                                                                                               Cloze and Completion Tasks   Common Sense Reasoning                                                                                   Natural Language Inference
                                                            Winograd-Style Tasks
                                                                                                                                                                                                                                                                                                                                                                                                  10.0%




                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       Natural Language Inference
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     200




                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Common Sense Reasoning
10%                                                                                                                                                                                             10.0%




      Cloze and Completion Tasks                                                                                                                                                                                                                                                                                                                                                                          Cloze and Completion Tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                       Winograd-Style Tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           180

                                                                                                                SuperGLUE


                                                                                                                                                                                                        Winograd-Style Tasks
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             500

                                                                                                                                                                                                                                                                                     SuperGLUE
                                                                                                                                                                                                5.0%                                                                                                                                                                                              5.0%
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     100                                           456

                                                                                                                                                                                                                                                                                                                                                                                                                                                              SuperGLUE
 0%
                                                                                                                                                                                                0.0%
                                                                                                                                                                                                                                                                                                                                                                                                  0.0%
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      0                                        0

                                                (a) Zero-shot                                                                                                                                                                    (b) One-shot                                                                                                                                                                                             (c) Few-shot                                                                                                                                                                (d) Train and inference cost

Figure 1. An overview of the percentage change in predictive performance (higher is better) of GLaM (64B/64E) versus GPT-3 (175B) in
the (a) zero-shot, (b) one-shot, and (c) few-shot setting across 7 benchmark categories with 29 public tasks in total. Each bar in panel
(a), (b) and (c) represents one benchmark category. Panel (d) compares the FLOPs needed per token prediction and training energy
consumption.


We use GLaM to study the importance of data. Our analysis                                                                                                                                                                                                                                                                                                                                          to improve various language understanding tasks. More re-
shows that even for these large models, data quality should                                                                                                                                                                                                                                                                                                                                        cently, models that used Transformers (Vaswani et al., 2017)
not be sacrificed for quantity if the goal is to produce a high-                                                                                                                                                                                                                                                                                                                                   showed that larger models with self-supervision on unla-
quality auto-regressive language model. More importantly,                                                                                                                                                                                                                                                                                                                                          beled data could yield significant improvements on NLP
on social dimensions, our results are also the first, to our                                                                                                                                                                                                                                                                                                                                       tasks (Devlin et al., 2019; Yang et al., 2019; Liu et al., 2019;
knowledge, to close the performance gap between stereo-                                                                                                                                                                                                                                                                                                                                            Clark et al., 2020). Transfer learning based on pre-training
typical and anti-stereotypical examples on the WinoGender                                                                                                                                                                                                                                                                                                                                          and finetuning (Raffel et al., 2020; Houlsby et al., 2019)
benchmark, suggesting that large, sparsely activated models                                                                                                                                                                                                                                                                                                                                        has been extensively studied and demonstrated good perfor-
may rely less on superficial statistical correlations.                                                                                                                                                                                                                                                                                                                                             mance on downstream tasks. However, a major limitation
                                                                                                                                                                                                                                                                                                                                                                                                   to this method is that it requires a task-specific fine-tuning.
Finally, although MoE-based sparse models are not yet com-
mon in the NLP community, our work shows that sparse
decoder-only language models can be more performant than                                                                                                                                                                                                                                                                                                                                           In-Context Few-shot Learning. GPT-3 (Brown et al.,
the dense architectures of similar compute FLOPs for the                                                                                                                                                                                                                                                                                                                                           2020) and related work (Shoeybi et al., 2019; Lieber et al.,
first time within the few-shot in-context learning setting at                                                                                                                                                                                                                                                                                                                                      2021; Wei et al., 2021) demonstrated that scaling up lan-
scale, suggesting that sparsity is one of the most promising                                                                                                                                                                                                                                                                                                                                       guage models greatly improves task-agnostic, few-shot per-
directions to achieve high-quality NLP models while saving                                                                                                                                                                                                                                                                                                                                         formance. These language models are applied without any
energy costs (Patterson et al., 2021). MoE should therefore                                                                                                                                                                                                                                                                                                                                        gradient updates, and only few-shot demonstrations speci-
be considered as a strong candidate for future scaling.                                                                                                                                                                                                                                                                                                                                            fied purely via text interactions with the model are needed.


2. Related Work                                                                                                                                                                                                                                                                                                                                                                                    Sparsely Gated Networks. Mixture-of-Experts based
                                                                                                                                                                                                                                                                                                                                                                                                   models have also shown significant advantages. For lan-
Language models. Neural language models (Mikolov                                                                                                                                                                                                                                                                                                                                                   guage modeling and machine translation, Shazeer et al.
et al., 2010; Sutskever et al., 2011) have been shown to be                                                                                                                                                                                                                                                                                                                                        (2017) showed that they could effectively use a very large
useful for many natural language processing tasks. Word em-                                                                                                                                                                                                                                                                                                                                        number of weights while only needing to compute a small
bedding models and extensions such as word2vec (Mikolov                                                                                                                                                                                                                                                                                                                                            subset of the computation graph at inference time. There
et al., 2013), GloVe (Pennington et al., 2014) and paragraph                                                                                                                                                                                                                                                                                                                                       has also been work on scaling sparsely activated MoE ar-
vectors (Le & Mikolov, 2014) have shown good generaliza-                                                                                                                                                                                                                                                                                                                                           chitectures (Hestness et al., 2017; Shazeer et al., 2018; Lep-
tion to many tasks simply by transferring the embeddings.                                                                                                                                                                                                                                                                                                                                          ikhin et al., 2021; Kudugunta et al., 2021). Recently, Fedus
                                                                                                                                                                                                                                                                                                                                                                                                   et al. (2021) showed results with even larger 1 trillion pa-
Pre-training and Fine-tuning. The abundance of com-                                                                                                                                                                                                                                                                                                                                                rameter sparsely activated models (Switch-C). Although
pute and data enables training increasingly large models via                                                                                                                                                                                                                                                                                                                                       both Switch-C and the largest GLaM model have one tril-
unsupervised pre-training. This is a natural fit for training                                                                                                                                                                                                                                                                                                                                      lion number of trainable parameters, GLaM is a family of
neural networks as they exhibit remarkable scalability. Work                                                                                                                                                                                                                                                                                                                                       decoder-only language models, and Switch-C is an encoder-
on using recurrent models such as RNNs and LSTMs for                                                                                                                                                                                                                                                                                                                                               decoder based sequence to sequence model. Furthermore,
language representation (Dai & Le, 2015; Kiros et al., 2015)                                                                                                                                                                                                                                                                                                                                       Switch-C is mainly evaluated on fine-tuning benchmarks,
showed that general language models could be fine-tuned                                                                                                                                                                                                                                                                                                                                            e.g., SuperGlue, while GLaM performs well without any
                              GLaM: Efficient Scaling of Language Models with Mixture-of-Experts


Table 2. A sample of related models (Devlin et al., 2019; Raffel
et al., 2020; Brown et al., 2020; Lieber et al., 2021; Rae et al.,
2021; Shoeybi et al., 2019; Lepikhin et al., 2021; Fedus et al.,
2021) pre-trained on text corpora. nparams is the total number of
trainable model parameters, nact-params is the number of activated
model parameters per input token.

 Model Name           Model Type                 nparams   nact-params
 BERT                 Dense Encoder-only         340M      340M
 T5                   Dense Encoder-decoder      13B       13B
 GPT-3                Dense Decoder-only         175B      175B
 Jurassic-1           Dense Decoder-only         178B      178B
 Gopher               Dense Decoder-only         280B      280B
 Megatron-530B        Dense Decoder-only         530B      530B
 GShard-M4            MoE Encoder-decoder        600B      1.5B
 Switch-C             MoE Encoder-decoder        1.5T      1.5B
 GLaM (64B/64E)       MoE Decoder-only           1.2T      96.6B



need for fine-tuning in the few-shot setting shared by GPT-3
where SuperGlue is a subset. Table 2 summarizes the key
differences between GLaM and related models pre-trained                  Figure 2. GLaM model architecture. Each MoE layer (the bottom
on text corpora.                                                         block) is interleaved with a Transformer layer (the upper block).
                                                                         For each input token, e.g., ‘roses’, the Gating module dynamically
                                                                         selects two most relevant experts out of 64, which is represented
3. Training Dataset                                                      by the blue grid in the MoE layer. The weighted average of the
To train our model, we build a high-quality dataset of 1.6               outputs from these two experts will then be passed to the upper
trillion tokens that are representative of a wide range of               Transformer layer. For the next token in the input sequence, two
                                                                         different experts will be selected.
natural language use cases. Web pages constitute the vast
quantity of data in our unlabeled dataset. However, their
quality ranges from professional writing to low-quality com-             of webpages and combine this with books, Wikipedia pages,
ment and forum pages. Similarly to Brown et al. (2020), we               forums and news pages and other data sources to create the
develop our own text quality classifier to produce a high-               final GLaM dataset. We also incorporate the data from pub-
quality web corpus out of an original larger raw corpus. We              lic domain social media conversations used by Adiwardana
use a feature hash based linear classifier for inference speed.          et al. (2020). We set the mixture weights based on the perfor-
This classifier is trained to classify between a collection              mance of each component in a smaller model and to prevent
of curated text (Wikipedia, books and a few selected web-                small sources such as Wikipedia from being over-sampled.
sites) and other webpages. We use this classifier to estimate            Table 3 shows the details of our data component sizes and
the content quality of a webpage. We then apply this clas-               mixture weights. The mixture weights were chosen based
sifier by using a Pareto distribution to sample webpages                 on the performance of the component in a small model and
according to their score. This allows some lower-quality                 to prevent small datasets such as Wikipedia from being over-
webpages to be included to prevent systematic biases in the              sampled. To check data contamination, in Section D we
classifier (Brown et al., 2020).                                         conduct an overlap analysis between our training set and
                                                                         the evaluation data and find that it roughly matches that of
    Table 3. Data and mixture weights in GLaM training set.              previous work (Brown et al., 2020).

     Dataset               Tokens (B)     Weight in mixture              4. Model Architecture
     Filtered Webpages         143              0.42
     Wikipedia                  3               0.06                     We leverage sparsely activated Mixture-of-Experts
     Conversations             174              0.28                     (MoE) (Shazeer et al., 2017; Fedus et al., 2021) in GLaM
     Forums                    247              0.02                     models. Similar to the GShard MoE Transformer (Lepikhin
     Books                     390              0.20                     et al., 2021), we replace the feed-forward component of
     News                      650              0.02
                                                                         every other Transformer layer with an MoE layer, as shown
                                                                         in Figure 2. Each MoE layer consists of a collection of
We use this process to generate a high-quality filtered subset           independent feed-forward networks as the ‘experts’. A
                             GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

gating function then uses a softmax activation function             the hidden dimension of the feed-forward network, L is
to model a probability distribution over these experts.             the number of layers and N is the number of total devices.
This distribution indicates how well each expert is able to         Additionally, nparams is the total number of trainable model
process the incoming input.                                         parameters, nact-params is the number of activated model
                                                                    parameters per input token, nheads is the number of self-
Even though each MoE layer has many more parameters,
                                                                    attention heads, and dhead is the hidden dimension of each
the experts are sparsely activated. This means that for a
                                                                    attention head. We also include the respective dense models
given input token, only a limited subset of experts is used,
                                                                    with comparable numbers of activated parameters per-token
giving the model more capacity while limiting computa-
                                                                    during inference (and thus similar numbers of per-token
tion. In our architecture, the subset size is two1 . Each MoE
                                                                    FLOPs) as references. We adopt the notation of
layer’s learnable gating network is trained to use its input
to activate the best two experts for each token of an input
                                                                      GLaM (Base Dense Size/E) e.g., GLaM (8B/64E)
sequence. During inference, the learned gating network
dynamically picks the two best experts for each token. For
                                                                    to describe different variants in the GLaM models. For
an MoE layer with E experts, this essentially provides a
                                                                    example, GLaM (8B/64E) represents the architecture of an
collection of O(E 2 ) different combinations of feed-forward
                                                                    approximate 8B parameter dense model with every other
networks instead of one in the classic Transformer architec-
                                                                    layer replaced by a 64 expert MoE layer. GLaM reduces to a
ture, leading to much more computational flexibility. The
                                                                    dense Transformer-based language model architecture when
final learned representation of a token will be the weighted
                                                                    each MoE layer only has one expert. We use the notation
combination of the outputs from the selected experts.
We also make additional modifications to the original Trans-                GLaM (Dense Size) e.g., GLaM (137B)
former architecture. We replace the standard positional
embedding with per-layer relative positional bias from Dai          refers to a dense 137B parameter model trained with the
et al. (2019). In the non-MoE Transformer feed-forward              same dataset.
sub-layers, we replace the first linear projection and the ac-
tivation function with the Gated Linear Unit (Dauphin et al.,       5.2. Hyperparameters and Training Procedure
2017; Shazeer, 2020), which computes the component-wise
product of two linear transformation of the input, followed         We use the same learning hyperparameters for all GLaM
by a Gaussian Error Linear Unit (Hendrycks & Gimpel,                models. More specifically, We use a maximum sequence
2016) activation function. We partition the weights and             length of 1024 tokens, and pack each input example to have
computation of large GLaM models using the 2D shard-                up to 1 million tokens per batch. The dropout rate is set to 0
ing algorithm as described in Xu et al. (2021), which is            since the number of available tokens in the training corpus
described in more details in the Section C of the appendix.         is much greater than the number of processed tokens dur-
                                                                    ing training. Our optimizer is Adafactor (Shazeer & Stern,
                                                                    2018) with first-moment decay β1 = 0, second-moment
5. Experiment Setup                                                 decay β2 = 0.99 with a 1 − t−0.8 decay schedule, update
GLaM is a family of dense and sparse decoder-only lan-              clipping threshold of 1.0, and factored second-moment esti-
guage models, so we first elaborate our training settings,          mation. We keep the initial learning rate of 0.01 for the first
hyperparameters, and evaluation protocol in this section.           10K training steps, and then decay it with inverse square
                                                                    root schedule lrhti ∝ √1 t . On top of the standard cross-
5.1. Training Setting                                               entropy loss, we add the MoE auxiliary loss as described
                                                                    in GShard (Lepikhin et al., 2021) with a 0.01 coefficient to
We train several variants of GLaM to study the behavior of          encourage expert load balancing so that the gating function
MoE and dense models on the same training data. Table 4             will distribute tokens more evenly across all experts. We use
shows the hyperparameter settings of different scale GLaM           the SentencePiece (Kudo & Richardson, 2018) subword to-
models ranging from 130 million parameters to 1.2 trillion          kenizer with a vocabulary of size of 256K. During training,
parameters. Here, E is the number of experts in the MoE             we use float32 for model weights and bfloat16 for activa-
layer, B is the mini-batch size, S is the input sequence            tions. The largest GLaM 64B/64E model was trained on
length, M is the model and embedding dimension, H is                1,024 Cloud TPU-V4 chips.
    1
      Using more experts will cost more compute FLOPs per pre-      Training models at the trillion parameter scale is extremely
diction, pushing the network to be ‘denser’. Setting the number     expensive even for sparsely activated models. There is
of selected experts to be two is based on the trade-off between
predictive performance and the training/serving efficiency of the
                                                                    little room for hyperparameter tuning. Here we share our
model.                                                              training recipes and some implementation tricks for the
                                                                    GLaM models.
                              GLaM: Efficient Scaling of Language Models with Mixture-of-Experts


Table 4. Sizes and architectures of both MoE and dense models that we have trained in our experiments. Models are grouped by the
number of activated parameters per token. All trained models share the same learning hyperparameters described in Session 5.1.

                   GLaM Model        Type    nparams   nact-params   L     M        H       nheads   dhead   E
                   0.1B             Dense    130M       130M                                                  –
                                                                     12   768      3,072     12       64
                   0.1B/64E         MoE      1.9B       145M                                                 64
                   1.7B             Dense    1.7B       1.700B                                                 –
                   1.7B/32E         MoE      20B        1.878B                                                32
                   1.7B/64E         MoE      27B        1.879B       24   2,048    8,192     16      128      64
                   1.7B/128E        MoE       53B       1.881B                                               128
                   1.7B/256E        MoE      105B       1.886B                                               256
                   8B               Dense    8.7B        8.7B                                                 –
                                                                     32   4,096   16,384     32      128
                   8B/64E           MoE      143B        9.8B                                                64
                   137B             Dense    137B       137B         64   8,192   65,536     128     128      –
                   64B/64E          MoE      1.2T       96.6B        64   8,192   32,768     128     128     64


   • We train smaller-scale models to convergence first.             guage understanding (NLU) tasks. These datasets can be
     This allows us to expose potential issues in the dataset        further grouped into 7 categories and are listed in section A.
     and infrastructure as early as possible.
                                                                     Natural Language Generative tasks. We compare the
   • We skip weight updates for a batch if there are any
                                                                     language sequences decoded by the models to the ground
     NaNs or Inf s in the gradients (Shen et al., 2019). Note
                                                                     truth in generative tasks. These tasks are TriviaQA, NQS,
     NaN/Inf could still occur during the applying gradient
                                                                     WebQS, SQuADv2, LAMBADA, DROP, QuAC and CoQA.
     step, in which case we restart from an earlier check-
                                                                     The performance is measured by the accuracy of exact match
     point as described below. For example, even if there
                                                                     (EM) and F1 score, following the standard for each task
     is no Inf in the existing variable or the gradient, the
                                                                     in Brown et al. (2020). We use beam search with a width of
     updated variable could still lead to Inf.
                                                                     4 to generate the sequences.
   • We restart from an early healthy checkpoint when en-
     countering rare large fluctuations or even NaN/Inf dur-         Natural Language Understanding tasks. Most lan-
     ing training. Randomness of the sequentially loaded             guage understanding tasks require the model to select one
     batches might help escape from previous failed states           correct answer from multiple options. All binary classifica-
     in the training after restart.                                  tion tasks are formulated into the form of selecting among
                                                                     two options (‘Yes’ or ‘No’). The prediction is based on the
5.3. Evaluation Setting                                              maximum log-likelihood of each option given the context
                                                                     log P (option|context) normalized by the token length of
Protocol. To clearly demonstrate the effectiveness of                each option. On a few tasks, such as ReCoRD (Zhang et al.,
GLaM models, we mainly focus on evaluating the zero,                 2018) and COPA (Gordon et al., 2012), the non-normalized
one and few-shot learning protocols suggested by Radford             loss can yield better results and thus is adopted. Except for
et al. (2018); Brown et al. (2020). For the zero-shot learn-         MultiRC (Khashabi et al., 2018) where the F1 metric over
ing setting, in most cases, we evaluate each example in the          the set of answer options (referred to as F1a ) is reported,
development set directly. For one/few-shot learning, we              the prediction accuracy metric is used for all the other tasks.
mainly draw random one/few examples from that task’s                 We use the average of the scores reported in all datasets to
training set as the only demonstration and context. Such a           report the overall few-shot performance of models on both
demonstration is concatenated with the evaluation example            NLG and NLU tasks. Both Accuracy (EM) and F1 scores
with two newlines in between, and then fed into the model.           have been normalized to lie between 0 and 100. On Trivi-
                                                                     aQA, we also report the testing server score of our one-shot
Benchmarks. To allow for an apples-to-apples compari-                submission.
son between GPT-3 and GLaM, we choose the same suite
of evaluation tasks as Brown et al. (2020). But for sim-             6. Results
plicity, we exclude 7 synthetic tasks (arithmetic and word
unscramble) and 6 machine translation datasets. With this            We conduct extensive evaluation on the whole family of
exclusion, we end up with 29 datasets, which includes 8              GLaM models, to show the advantages of sparsely activated
natural language generative (NLG) tasks and 21 natural lan-          models in language modeling and their scaling trends. We
                            GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

also quantitatively inspect the effectiveness of data quality
                                                                 Table 5. GLaM (64B/64E) one-shot performance significantly out-
for language model training.
                                                                 performs prior SOTAs for open domain settings in the wiki split.

6.1. Comparison between MoE and Dense Models                                                                 TriviaQA
                                                                       Model
                                                                                                           (Open-Domain)
As previously presented in Table 1, GLaM (64B/64E) has
                                                                       KG-FiD (large) (Yu et al., 2022)
competitive performance compared to GPT-3 (175B) for                   (finetuned, test)
                                                                                                                 69.8
zero, one and few-shot learning. Figure 1 compares the                 Switch-C (finetuned, dev)                 47.5
performance for each category of tasks. In total, GLaM                 GPT-3 One-shot (dev)                      68.0
(64B/64E) outperforms GPT-3 in 6 out of 7 categories on                GPT-3 64-shot (test)                      71.2
average, indicating the performance gain is consistent. For            GLaM One-shot (test)                      75.0
                                                                       GLaM One-shot (dev)                       75.8
more details on each individual task, see Table 11. We
include results on the much larger and computationally de-
manding Megatron-NLG and Gopher for reference. More
                                                                 The filtered webpages consist of 143B tokens whereas the
importantly, as shown in Table 4, GLaM (64B/64E) acti-
                                                                 unfiltered webpages consist of around 7T tokens.
vates roughly 96.6B parameters per token during inference,
which requires only half of the compute FLOPs needed by          Figure 3 (c) and (d) show that the model trained on fil-
GPT-3 given the same input.                                      tered data performs consistently better on both NLG and
                                                                 NLU tasks. In particular, the effect of filtering is bigger
We highlight one particular challenging open-domain ques-
                                                                 on NLG than that on NLU. Perhaps this is because NLG
tion answer task: TriviaQA. In open-domain question an-
                                                                 often requires generating high-quality language and filtered
swer tasks, the model is required to directly answer a given
                                                                 pretraining corpora is crucial to the generation capability
query without access to any additional context. Brown
                                                                 of language models. Our study highlights the fact that the
et al. (2020) show that the few-shot performance of Trivi-
                                                                 quality of the pretrained data also plays a critical role, specif-
aQA is able to grow smoothly with model size, indicating
                                                                 ically, in the performance of downstream tasks.
a language model is able to absorb knowledge using its
model capacity. As shown in Table 5, GLaM (64B/64E) is
better than the dense model and outperforms the previous         6.3. Scaling Studies
finetuned state-of-the-art (SOTA) on this dataset in the open-   Scaling up dense language models generally involves mak-
domain setting. Our one-shot result exceeds the previous         ing the models deeper by adding more layers, and wider by
finetuned SOTA (Yu et al., 2022) where additional knowl-         increasing the embedding dimension of token representa-
edge graph information is infused by 8.6%, and outperforms       tions. This process increases the total number of parameters
the few-shot GPT-3 on the testing server by 5.3%. This sug-      nparams of the model. For each prediction on a given input
gests that the additional capacity of GLaM plays a crucial       example, these models are ‘dense’ in that all nparams param-
role in the performance gain even though the nact-params of      eters will be activated, i.e., nparams = nact-params in Table 4.
GLaM (64B/64E) is only half of that in GPT-3. Comparing          Therefore, the effective FLOPs per prediction increases
to Switch-C, even though both models have similar total          linearly with the model size nparams . While the increased
number of parameters, GLaM (64B/64E) uses much larger            FLOPs may lead to boosted predictive performance, it also
experts (beyond one TPU core) than Switch-C. Therefore,          raises the overall cost per prediction.
GLaM’s one-shot performance on TriviaQA is also better
than the fine-tuned results of Switch-C in the open-domain       In contrast, GLaM MoE models are sparsely activated in
setting. Finally, we report zero, one and few-shot evaluation    that only a small fraction of the total nparams parameters will
mainly on the development set for all tasks in Tables 11, 12,    be activated for each prediction where nparams  nact-params .
13 and 14 of the appendix.                                       Therefore, GLaM MoE models can scale by also growing
                                                                 the size or number of experts in the MoE layer.
6.2. Effect of Data Quality                                      As shown in Figure 3(a), the average zero, one and few-shot
                                                                 performance across the generative tasks scales well with the
We study the impact of data quality on the few-shot perfor-
                                                                 effective FLOPs per prediction which is in turn determined
mance of downstream tasks. We use a modest-size GLaM
                                                                 by nact-params . We also find that GLaM MoE models perform
model (1.7B/64E) to show the effectiveness of filtering text
                                                                 consistently better than GLaM dense models for similar ef-
on model quality. We train models with the same hyper-
                                                                 fective FLOPs per token. For language understanding tasks
parameters on two datasets. One is the original dataset
                                                                 shown in Figure 3(b), the performance gain of GLaM MoE
described in Section 3 and the second consists of the dataset
                                                                 models has a similar scaling trend to that of the generative
with the filtered webpages replaced with the unfiltered web-
                                                                 tasks. We observe that both MoE and dense models perform
pages. The mixing proportions are fixed as given in Table 3.
                                                                 similarly at smaller scales but MoE models outperform at
                                                          GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

                                                                                                        64B/64E                                                                                                                                   filtered
                                         64B/64E                                                                                                                                                                                                 (few-shot)
                                                                             70                                                                                                      filtered              59
        60                                                                                                                                                                          (few-shot)                                                    filtered
                                                                                                                                                                                                                                                 (one-shot)
                                                   GPT3 (few-shot)                                                  GPT3 (few-shot)           45
                              8B/64E                                                                                                                                                 filtered                                                     filtered
                                                                                                  8B/64E            GPT3 (one-shot)                                                 (one-shot)
                                                   GPT3 (one-shot)                                                                                                                                         58                                    (zero-shot)
                        1.7B/64E
                                                GPT3 (zero-shot)                                             137B                             40                                     unfiltered
                                                                             60            1.7B/64E               GPT3 (zero-shot)                                                   (few-shot)
                                            137B                                                       8B                                                                                                  57                                     unfiltered

Score                                                                Score                                                            Score                                                        Score
        40                                                                                                                                                                           unfiltered                                                   (few-shot)
                                                                                                                                                                                     (one-shot)
                                    8B                                                                                                                                                                                                            unfiltered
                                                                                                1.7B                                          35                                     filtered                                                     (zero-shot)
                                                                                                                                                                                    (zero-shot)
                             1.7B                                                 0.1B/64E
                                                                                                                                                                                                           56                                     unfiltered
                                                                                                                                                                                                                                                  (one-shot)
             0.1B/64E

                                          Dense (Few-shot)                   50                             Dense (Few-shot)                  30
        20                                Dense (One-shot)                                                  Dense (One-shot)                                                         unfiltered            55
                                          Dense (Zero-shot)                                                 Dense (Zero-shot)                                                        (zero-shot)
                                          MoE (Few-shot)                                0.1B                MoE (Few-shot)
                   0.1B                   MoE (One-shot)                                                    MoE (One-shot)
                                          MoE (Zero-shot)                                                   MoE (Zero-shot)                   25                                                           54
             0.1      1      10     100 1 000                                     0.1      1      10     100 1 000                                    200        400      600           800                        200        400      600           800
                                                                                                                                                                                9                                                            9
                    GFlops per token prediction                                          GFlops per token prediction                                        Training Tokens x10                                          Training Tokens x10

                   (a) Scaling (NLG)                                                    (b) Scaling (NLU)                                          (c) Data filtering (NLG)                                     (d) Data filtering (NLU)

Figure 3. Average zero, one and few-shot performance of GLaM MoE models versus GLaM dense models for similar effective FLOPs per
token over the 8 NLG tasks (a) and 21 NLU tasks (b). Comparison of model performance with filtered and unfiltered training data using
GLaM (1.7B/64E). Filtered data improves results significantly over unfiltered data for both (c) NLG and (d) NLU tasks across zero, one
and few-shot settings.


larger scales. We also show experiments with scaling the                                                                                 sparsely activated models takes much less computational
number of experts in Section B where we observe that, for                                                                                resources than training dense models.
a fixed budget of computation per prediction, adding more
                                                                                                                                        As previously presented in Table 1, the GLaM (64B/64E)
experts generally leads to better predictive performance.
                                                                                                                                        training after 600B tokens consumes 456 MWh, about 1/3
                                                                                                                                        of the energy cost of 1287 MWh used by GPT-3. Moreover,
6.4. Efficiency of GLaM                                                                                                                 to reach similar (and slightly exceeded) scores as GPT-3, we
Existing large dense language models usually require                                                                                    train using 1,024 TPU-v4 chips for 574 hours (with 280B
tremendous amounts of computation resources for train-                                                                                  tokens). This consumes 213 MWh or 1/6 of the GPT-3
ing and serving (Patterson et al., 2021). They also need to                                                                             energy cost. The reduced energy consumption of GLaM
consume massive amounts of pretraining data. We investi-                                                                                is due to the MoE architecture and computation efficiency
gate the data and compute efficiency of the proposed GLaM                                                                               optimizations from TPU-v4 hardware and GSPMD software.
models.                                                                                                                                 Energy calculations can be found in Section F.
Data Efficiency. Figure 4 (a-c) and Figure 4(e-g) show
the learning curves of our models compared to the dense                                                                                  7. Ethics and Unintended Biases
baselines of similar effective FLOPs in both NLG and NLU                                                                                Large language models’ zero-and few-shot inference is an
tasks. The x-axis is the number of tokens used in train-                                                                                exciting capability: being able to control model behaviour
ing where we explicitly include GPT-3’s results when it                                                                                 intuitively with natural language and small datasets signifi-
is around 300B tokens. We first observe that GLaM MoE                                                                                   cantly lowers the barrier to prototyping and the development
models require significantly less data than dense models of                                                                             of new applications; it has the potential to help democratise
comparable FLOPs to achieve similar zero, one, and few-                                                                                 using AI by dramatically decreasing the need for special-
shot performance. In other words, when the same amount                                                                                  ist knowledge. However, such opportunities also serve to
of data is used for training, MoE models perform much bet-                                                                              highlight the importance of the many ethical challenges
ter, and the difference in performance becomes larger when                                                                              (Leidner & Plachouras, 2017; Bender et al., 2021; Bom-
training up to 630B. Moreover, GLaM (64B/64E) model                                                                                     masani et al., 2021) including representation bias (Blodgett
trained with 280B tokens outperforms GPT-3 trained with                                                                                 et al., 2020), proper selection and handling of training data
300B tokens by large margins on 4 out of the 6 learning set-                                                                            (Rogers, 2021) and its documentation (Bender & Friedman,
tings (zero-shot/one-shot NLU and one-shot/few-shot NLG),                                                                               2018), privacy (Abadi et al., 2016b; Carlini et al., 2020),
and matches GPT-3 scores for the remaining setting, i.e.,                                                                               and environmental concerns (Strubell et al., 2019; Patterson
zero-shot NLG tasks.                                                                                                                    et al., 2021). An important strand of this research focuses
Computation Efficiency & Energy Consumption. Fig-                                                                                       on unintended biases learnt by language models, includ-
ure 4 (d) and Figure 4 (h) show how the average zero, one                                                                               ing correlations between gender and profession (Bolukbasi
and few-shot performance scales with the number of TPU                                                                                  et al., 2016; Rudinger et al., 2018; Zhao et al., 2018), neg-
years spent training MoE and dense models. We find that to                                                                              ative sentiment about racial and religious groups (Li et al.,
achieve similar performance on downstream tasks, training                                                                               2020; Nadeem et al., 2021), and about people with disabili-
                                                                                                                                        ties (Hutchinson et al., 2020), as well as other social biases
                                            GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

                                                                                                                                         64B/64E
                                      64B/64E                                            64B/64E
                                                                                                                                                                                              64B/64E (few-shot)
         50                                              60                                                 60                   GPT3
                                                                                                                                                            60                                64B/64E (one-shot)
                             GPT3                                          GPT3
                                                                                                                 137B                                                                         64B/64E (zero-shot)
         40                                                                                                                              1.7B/64E
                                                              137B                       1.7B/64E
              137B                    1.7B/64E
                                                                                                                                                            50 137B (few-shot)
 Score                                           Score                                              Score                                           Score
                                                         40                                                 40
         30                                                   1.7B                                               1.7B
                                                                                                                                                                 137B (one-shot)

              1.7B                                                                       0.1B/64E                                        0.1B/64E           40
         20                           0.1B/64E
                                                         20                                                 20
                                                              0.1B                                               0.1B                                            137B (zero-shot)
              0.1B
         10                                                                                                                                                 30
                100 158.5 251.2 398.1 631                       100 158.5 251.2 398.1 631                          100 158.5 251.2 398.1 631                     10             30          100            300
                                        9                                               9                                                  9
                    Training Tokens x10                             Training Tokens x10                                Training Tokens x10                                           Tpu x Years

               (a) Zero-shot (NLG)                             (b) One-shot (NLG)                                  (c) Few-shot (NLG)               (d) Scaling in TPU years (NLG)
         70                                              70                                                                                                                                    64B/64E (few-shot)
                                                                                         64B/64E            70                           64B/64E
                                      64B/64E
                                                                                                                                  GPT3
                                                                                  GPT3
         65                                                                                                                                                 69
                                                                                                            65                                                                                64B/64E (one-shot)

         60 137B             GPT3                        60 137B                                                 137B                                                                           137B (few-shot)
                                                                                         1.7B/64E
 Score                                           Score                                              Score                                           Score
                                      1.7B/64E                                                              60                           1.7B/64E                                             64B/64E (zero-shot)
                                                                                                                                                            66
                                                                                                                                                                                                   137B (one-shot)
         55                                                   1.7B
                                                                                                            55
              1.7B                                                                       0.1B/64E                                        0.1B/64E
                                      0.1B/64E                                                                                                                                                     137B (zero-shot)
                                                         50                                                                                                 63
         50
                                                                                                            50
                                                                                                                 1.7B

         45 0.1B                                              0.1B
                                                                                                            45 0.1B                                         60
               100 158.5 251.2 398.1 631                        100 158.5 251.2 398.1 631                         100 158.5 251.2 398.1 631                                30              100            300
                                       9                                                9                                                 9
                   Training Tokens x10                              Training Tokens x10                               Training Tokens x10                                            Tpu x Years

               (e) Zero-shot (NLU)                              (f) One-shot (NLU)                                (g) Few-shot (NLU)                (h) Scaling in TPU years (NLU)

Figure 4. Learning efficiency comparison. Average zero-shot , one-shot and few-shot performance of GLaM MoE models versus GLaM
dense models as more tokens are processed during training for 9 NLG tasks (a-c) and 21 NLU tasks (e-g). Panel (d) and (h) also display
the learning curves against the number of TPU years, respectively.


(Caliskan et al., 2017; Rudinger et al., 2017; Sap et al., 2020;                                            the appendix), 800 outputs are generated using top-k sam-
Sotnikova et al., 2021). While measuring and mitigating                                                     pling (k = 40) with a temperature of 1. An off-the-shelf
the potential harm of language models is a very active area                                                 POS tagger (Bird & Loper, 2004) is used to remove stop
of research, as recognized by Blodgett et al. (2021); Jacobs                                                words and select only descriptive words (i.e., adjectives and
& Wallach (2021) there is still a significant need for more                                                 adverbs). Adverbs are included because we noticed a com-
rigorous evaluation methods to assess the degree to which                                                   mon pattern of errors where adjectives are misclassified as
language models encode harmful stereotypes (May et al.,                                                     adverbs; for example “pretty” in the phrase “She was very
2019; Webster et al., 2021).                                                                                pretty and very accomplished”. Like Brown et al. (2020), to
                                                                                                            make the analysis transparent and easily reproducible, we
While there is not yet consensus on measurement methods or
                                                                                                            omit any manual human labeling.
criteria for such general purpose large language models, the
versatility and power of these models make it important to                                                  Like the analysis of other large language models that we
assess them on a range of metrics. We take inspiration from                                                 build on, we note associative biases for all dimensions are
GPT-3 (Brown et al., 2020) and examine the co-occurrence                                                    obvious, for example “pretty” is the most associated descrip-
in generated text referencing identity terms as well as report                                              tion for the term “She”, while it is not in the top-10 for the
on the WinoGender benchmark (Rudinger et al., 2018). We                                                     term “He”. Table 8 shows the most frequently occurring
also analyse toxicity degeneration similarly to Gopher (Rae                                                 descriptive words in response to prompt-templates for gen-
et al., 2021), and extend the analysis to consider the human-                                               dered pronouns, and Tables 9 and 10 of the appendix show
behavioral baseline.                                                                                        the same for race and religion prompts.

7.1. Co-occurrence prompts                                                                                  7.2. WinoGender
Following the procedure described in Brown et al. (2020),                                                   Coreference resolution is a capability that many applica-
we analyze commonly co-occurring words in the continua-                                                     tions require to perform well, including machine translation
tions when given prompts like “{term} was very...” where                                                    (Stanovsky et al., 2019; Webster & Pitler, 2020) and ques-
the substituted term references either gender, religions,                                                   tion answering (Lamm et al., 2020). To assess whether
racial and ethnic identity. For each prompt (Table 7 of                                                     gendered correlations in GLaM cause it to make corefer-
                                                                       GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
      1/27/22, 6:51 PM                                                                                             visualization (25).svg

                                           0.5
                                                                                                                            to create the underlying dataset: sentences were selected
                                                                                                                            across the toxicity spectrum. Moreover, toxicity can often




       Continuation toxicity probability
                                           0.4                                                                              be identified locally within a sentence, and toxicity in this
                                                                                                                            dataset tends to occur later the sentences. This causes the
                                           0.3                                                                              human-TPC to slightly drop as the TPP increases. In con-
                                                                                                                            trast, it is noteworthy that the model’s TPC closely follows
                                                                                   Model size       Model type              TPP, reflecting the frequent observation that large language
                                           0.2
                                                                                                       Dense
                                                                                   <1B ····                                 models are sometimes overly-strongly influenced by their
                                                                                   1-10B - - -         MoE
                                                                                   >100B —             Human                prompt, e.g. repeating phrases from the prompt.
                                           0.1
                                                 0.0 0.1   0.2   0.3   0.4   0.5    0.6    0.7   0.8     0.9 1.0            We also analysed the distribution of toxicity probabilities
                                                             Prompt toxicity probability (binned)
                                                                                                                            from the API for batches of 25 continuations. This high-
                                                                                                                            lighted that, even for low toxicity prompts, it is very likely
Figure 5. The relationship between the Toxicity Probability of the
                                                                                                                            that some generated continuation will be judged as toxic by
Prompt (TPP), and the Toxicity Probability of the Continuation
(TPC). Human refers to the continuation of the original human-                                                              most people reviewing it, according to the Perspective API’s
written sentence.                                                                                                           predicted probability; further details can be found in Figure
                                                                                                                            8. We also note that this dataset’s sampling strategy, and the
                                                                                                                            source it is taken from (Reddit) are likely not reflective of
ence errors in the one-shot setting, we measure WinoGender                                                                  other domains. Moreover, even for very low TPP, applica-
(Rudinger et al., 2018). GLaM (64B/64E) achieves a new                                                                      tions are likely to want a much lower TPC: even generating
state-of-the-art of 71.7% on the full dataset (compared to                                                                  1 in 100 toxic suggestions is likely to be very problematic
64.2% for GPT-3 (Brown et al., 2020)). Promisingly, ac-                                                                     for applications.
curacy is remarkably close between ‘he’ examples (70.8%)
and ‘she’ examples (72.5%), as well as between stereotyp-                                                                   8. Discussion
ical examples (where the intended distribution is assumed
to be close to the US occupation statistics, (Rudinger et al.,                                                              As observed in previous work on sparsely-activated mod-
2018)) and anti-stereotypical (or ‘gotcha’) examples (both                                                                  els (Fedus et al., 2021), MoE models are more performant in
71.7%).                                                                                                                     knowledge-oriented tasks. Open-domain tasks are one way
                                                                                                                            of measuring the amount of knowledge stored in a model.
7.3. Toxicity Degeneration                                                                                                  The performance of the MoE model in open-domain QA
                                                                                                                            benchmarks such as TriviaQA demonstrate the significantly
Toxicity degeneration is when a language model produces                                                                     increased information capacity of these models compared
text that is unintentionally toxic. To evaluate toxicity de-                                                                to dense models of similar effective FLOPs. Despite the
generation, we adapt the methodology used in (Welbl et al.,                                                                 in-context learning and training efficiency advantages, the
2021; Rae et al., 2021). We use the RealToxicityPrompts                                                                     sparsely activated models consist of a higher number of pa-
dataset (Gehman et al., 2020) which consists of sentences                                                                   rameters and thus require a larger number of devices. This
that have been split into two parts: a prompt prefix, and a                                                                 limits the resource accessibility and increases the serving
continuation postfix. Like the previous studies, we also use                                                                cost especially when the serving traffic is low.
the Perspective API which assigns a probability that the text
would be considered to be rude, disrespectful or otherwise
likely to make people want to leave a conversation. We then
                                                                                                                            9. Conclusions
asses how likely a continuation is to be toxic given various                                                                We propose and develop a family of generalist language
likelihoods that the prompt was toxic.                                                                                      models called GLaM, which use a sparsely activated
For each of 10K randomly sampled prompts, we generate                                                                       mixture-of-experts architecture to achieve better average
25 continuations,           with up to 100 tokens per continuations                                                         scores than not only their dense counterparts of similar effec-
     file:///Users/kevinrobinson/Downloads/visualization (25).svg
using top-k sampling (k = 40) with a temperature of 1. The                                                                  tive FLOPs, but also the GPT-3 models on 1/129 representative
Perspective API requires an non-empty string therefore we                                                                   NLP tasks in zero, one and few-shot learning. In partic-
assign a score of toxicity 0.0 when the continuation is the                                                                 ular, GLaM (64B/64E), our largest 1.2 trillion parameter
empty string; this could represent, for example, a chat bot                                                                 MoE language model, achieves better average performance
simply refusing to respond.                                                                                                 with only one third of energy consumption compared to
                                                                                                                            training GPT-3. We hope that our work will encourage
Figure 5 shows the relationship between the Toxicity Proba-                                                                 more research into methods for obtaining high-quality data,
bility of the Prompt (TPP), and the Toxicity Probability of                                                                 and using MoE for more efficient scaling of giant language
the Continuation (TPC). Note that, for low TPP, the rela-                                                                   models.
tively high human TPC is due to the sampling strategy used
                           GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

References                                                      2004. Association for Computational Linguistics. URL
                                                                https://aclanthology.org/P04-3031.
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A.,
  Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M.,   Bisk, Y., Zellers, R., Bras, R. L., Gao, J., and Choi, Y.
  Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray,      Piqa: Reasoning about physical commonsense in natural
  D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden,        language. In Thirty-Fourth AAAI Conference on Artificial
  P., Wicke, M., Yu, Y., and Zheng, X. TensorFlow:              Intelligence, 2020.
 A system for Large-Scale machine learning. In 12th
 USENIX Symposium on Operating Systems Design and             Blodgett, S. L., Barocas, S., Daumé III, H., and Wallach,
 Implementation (OSDI 16), pp. 265–283, Savannah, GA,           H. Language (technology) is power: A critical survey
  November 2016a. USENIX Association. ISBN 978-1-               of “bias” in NLP. In Proceedings of the 58th Annual
  931971-33-1. URL https://www.usenix.org/                      Meeting of the Association for Computational Linguistics,
  conference/osdi16/technical-sessions/                         pp. 5454–5476, Online, July 2020. Association for Com-
  presentation/abadi.                                           putational Linguistics. doi: 10.18653/v1/2020.acl-main.
                                                                485. URL https://aclanthology.org/2020.
Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B.,             acl-main.485.
  Mironov, I., Talwar, K., and Zhang, L. Deep learning
                                                              Blodgett, S. L., Lopez, G., Olteanu, A., Sim, R., and
 with differential privacy. Proceedings of the 2016 ACM
                                                                Wallach, H. Stereotyping Norwegian salmon: An in-
 SIGSAC Conference on Computer and Communications
                                                                ventory of pitfalls in fairness benchmark datasets. In
 Security, Oct 2016b. doi: 10.1145/2976749.2978318.
                                                                Proceedings of the 59th Annual Meeting of the Associa-
  URL http://dx.doi.org/10.1145/2976749.
                                                                tion for Computational Linguistics and the 11th Interna-
  2978318.
                                                                tional Joint Conference on Natural Language Process-
Adiwardana, D., Luong, M., So, D. R., Hall, J., Fiedel,         ing (Volume 1: Long Papers), pp. 1004–1015, Online,
  N., Thoppilan, R., Yang, Z., Kulshreshtha, A., Nemade,        August 2021. Association for Computational Linguis-
  G., Lu, Y., and Le, Q. V. Towards a human-like open-          tics. doi: 10.18653/v1/2021.acl-long.81. URL https:
  domain chatbot. CoRR, abs/2001.09977, 2020. URL               //aclanthology.org/2021.acl-long.81.
  https://arxiv.org/abs/2001.09977.
                                                              Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V.,
Bender, E. M. and Friedman, B. Data statements for natural      and Kalai, A. T. Man is to computer programmer
  language processing: Toward mitigating system bias and        as woman is to homemaker? debiasing word em-
  enabling better science. Transactions of the Association      beddings. In Lee, D., Sugiyama, M., Luxburg, U.,
  for Computational Linguistics, 6:587–604, 2018. doi: 10.      Guyon, I., and Garnett, R. (eds.), Advances in Neural
  1162/tacl a 00041. URL https://aclanthology.                  Information Processing Systems, volume 29. Curran As-
  org/Q18-1041.                                                 sociates, Inc., 2016. URL https://proceedings.
                                                                neurips.cc/paper/2016/file/
Bender, E. M., Gebru, T., McMillan-Major, A., and               a486cd07e4ac3d270571622f4f316ec5-Paper.
  Shmitchell, S. On the dangers of stochastic parrots:          pdf.
  Can language models be too big?     In Proceedings
  of the 2021 ACM Conference on Fairness, Account-            Bommasani, R., Hudson, D. A., Adeli, E., Altman, R.,
  ability, and Transparency, FAccT ’21, pp. 610–623,            Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse-
  New York, NY, USA, 2021. Association for Comput-              lut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card,
  ing Machinery. ISBN 9781450383097. doi: 10.1145/              D., Castellon, R., Chatterji, N. S., Chen, A. S., Creel,
  3442188.3445922. URL https://doi.org/10.                      K., Davis, J. Q., Demszky, D., Donahue, C., Doum-
  1145/3442188.3445922.                                         bouya, M., Durmus, E., Ermon, S., Etchemendy, J., Etha-
                                                                yarajh, K., Fei-Fei, L., Finn, C., Gale, T., Gillespie, L.,
Berant, J., Chou, A., Frostig, R., and Liang, P. Semantic       Goel, K., Goodman, N. D., Grossman, S., Guha, N.,
  parsing on Freebase from question-answer pairs. In Pro-       Hashimoto, T., Henderson, P., Hewitt, J., Ho, D. E., Hong,
  ceedings of the 2013 Conference on Empirical Methods          J., Hsu, K., Huang, J., Icard, T., Jain, S., Jurafsky, D.,
  in Natural Language Processing, pp. 1533–1544, Seattle,       Kalluri, P., Karamcheti, S., Keeling, G., Khani, F., Khat-
  Washington, USA, October 2013. Association for Compu-         tab, O., Koh, P. W., Krass, M. S., Krishna, R., Kudi-
  tational Linguistics. URL https://aclanthology.               tipudi, R., and et al. On the opportunities and risks of
  org/D13-1160.                                                 foundation models. CoRR, abs/2108.07258, 2021. URL
                                                                https://arxiv.org/abs/2108.07258.
Bird, S. and Loper, E. NLTK: The natural language toolkit.
  In Proceedings of the ACL Interactive Poster and Demon-     Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan,
  stration Sessions, pp. 214–217, Barcelona, Spain, July        J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry,
                           GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

  G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,      Dagan, I., Glickman, O., and Magnini, B. The pascal
  G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.,          recognising textual entailment challenge. In Quiñonero-
  Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E.,           Candela, J., Dagan, I., Magnini, B., and d’Alché Buc,
  Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C.,        F. (eds.), Machine Learning Challenges. Evaluating Pre-
  McCandlish, S., Radford, A., Sutskever, I., and Amodei,        dictive Uncertainty, Visual Object Classification, and
  D. Language models are few-shot learners. In Larochelle,       Recognising Tectual Entailment, pp. 177–190, Berlin,
  H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin,          Heidelberg, 2006. Springer Berlin Heidelberg. ISBN
  H. (eds.), Advances in Neural Information Processing           978-3-540-33428-6.
  Systems, volume 33, pp. 1877–1901. Curran Asso-
  ciates, Inc., 2020. URL https://proceedings. Dai, A. M. and Le, Q. V. Semi-supervised sequence
  neurips.cc/paper/2020/file/                                    learning. In Cortes, C., Lawrence, N., Lee, D.,
  1457c0d6bfcb4967418bfb8ac142f64a-Paper.                        Sugiyama, M., and Garnett, R. (eds.), Advances in Neural
  pdf.                                                           Information   Processing Systems, volume 28. Curran As-
                                                                 sociates, Inc., 2015. URL https://proceedings.
Caliskan, A., Bryson, J. J., and Narayanan, A. Seman-            neurips.cc/paper/2015/file/
  tics derived automatically from language corpora contain       7137debd45ae4d0ab9aa953017286b20-Paper.
  human-like biases. Science, 356(6334):183–186, Apr             pdf.
  2017. ISSN 1095-9203. doi: 10.1126/science.aal4230.
                                                               Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., and
  URL http://dx.doi.org/10.1126/science.
                                                                 Salakhutdinov, R. Transformer-XL: Attentive language
  aal4230.
                                                                 models beyond a fixed-length context. In Proceed-
Carlini, N., Tramèr, F., Wallace, E., Jagielski, M., Herbert-   ings of the 57th Annual Meeting of the Association for
  Voss, A., Lee, K., Roberts, A., Brown, T. B., Song, D., Er-    Computational    Linguistics, pp. 2978–2988, Florence,
  lingsson, Ú., Oprea, A., and Raffel, C. Extracting training   Italy, July  2019.  Association for Computational Lin-
  data from large language models. CoRR, abs/2012.07805,         guistics.   doi: 10.18653/v1/P19-1285.    URL https:
  2020.                                                         //aclanthology.org/P19-1285.

                                                               Dauphin, Y. N., Fan, A., Auli, M., and Grangier, D. Lan-
Choi, E., He, H., Iyyer, M., Yatskar, M., Yih, W.-t., Choi,
                                                                 guage modeling with gated convolutional networks. In
  Y., Liang, P., and Zettlemoyer, L. QuAC: Question an-
                                                                 International conference on machine learning, pp. 933–
  swering in context. In Proceedings of the 2018 Confer-
                                                                 941. PMLR, 2017.
  ence on Empirical Methods in Natural Language Pro-
  cessing, pp. 2174–2184, Brussels, Belgium, October-          de Marneffe, M.-C., Simons, M., and Tonhauser, J. The
  November 2018. Association for Computational Lin-              commitmentbank: Investigating projection in naturally
  guistics. doi: 10.18653/v1/D18-1241. URL https:                occurring discourse. Proceedings of Sinn und Bedeutung,
  //aclanthology.org/D18-1241.                                   23(2):107–124, Jul. 2019. doi: 10.18148/sub/2019.v23i2.
                                                                 601.    URL https://ojs.ub.uni-konstanz.
Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins,      de/sub/index.php/sub/article/view/601.
  M., and Toutanova, K. BoolQ: Exploring the surpris-
  ing difficulty of natural yes/no questions. In Proceed-      Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT:
  ings of the 2019 Conference of the North American              Pre-training of deep bidirectional transformers for lan-
  Chapter of the Association for Computational Linguis-          guage understanding. In Proceedings of the 2019 Confer-
  tics: Human Language Technologies, Volume 1 (Long              ence of the North American Chapter of the Association
  and Short Papers), pp. 2924–2936, Minneapolis, Min-            for Computational Linguistics: Human Language Tech-
  nesota, June 2019. Association for Computational Lin-          nologies, Volume 1 (Long and Short Papers), 2019.
  guistics. doi: 10.18653/v1/N19-1300. URL https:
  //aclanthology.org/N19-1300.                                 Dua, D., Wang, Y., Dasigi, P., Stanovsky, G., Singh, S.,
                                                                 and Gardner, M. DROP: A reading comprehension
Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. Elec-     benchmark requiring discrete reasoning over paragraphs.
  tra: Pre-training text encoders as discriminators rather       In Burstein, J., Doran, C., and Solorio, T. (eds.), Pro-
  than generators. arXiv preprint arXiv:2003.10555, 2020.        ceedings of the 2019 Conference of the North Ameri-
                                                                 can Chapter of the Association for Computational Lin-
Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A.,     guistics: Human Language Technologies, NAACL-HLT
  Schoenick, C., and Tafjord, O. Think you have solved          2019, Minneapolis, MN, USA, June 2-7, 2019, Vol-
  question answering? try arc, the ai2 reasoning challenge.      ume 1 (Long and Short Papers), pp. 2368–2378. As-
  arXiv:1803.05457v1, 2018.                                      sociation for Computational Linguistics, 2019. doi:
                            GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

  10.18653/v1/n19-1246. URL https://doi.org/                    Hutchinson, B., Prabhakaran, V., Denton, E., Webster, K.,
  10.18653/v1/n19-1246.                                           Zhong, Y., and Denuyl, S. Social biases in NLP mod-
                                                                  els as barriers for persons with disabilities. In Pro-
Fedus, W., Zoph, B., and Shazeer, N. Switch transform-            ceedings of the 58th Annual Meeting of the Associa-
  ers: Scaling to trillion parameter models with simple and       tion for Computational Linguistics, pp. 5491–5501, On-
  efficient sparsity. CoRR, abs/2101.03961, 2021. URL             line, July 2020. Association for Computational Linguis-
  https://arxiv.org/abs/2101.03961.                               tics. doi: 10.18653/v1/2020.acl-main.487. URL https:
Fyodorov, Y., Winter, Y., and Francez, N. A natural logic in-    //aclanthology.org/2020.acl-main.487.
  ference system. In Inference in Computational Semantics,      Jacobs, A. Z. and Wallach, H. Measurement and fair-
  2000.                                                           ness. Proceedings of the 2021 ACM Conference on Fair-
Gehman, S., Gururangan, S., Sap, M., Choi, Y., and Smith,         ness, Accountability, and Transparency, Mar 2021. doi:
  N. A. Realtoxicityprompts: Evaluating neural toxic de-          10.1145/3442188.3445901. URL http://dx.doi.
  generation in language models, 2020.                            org/10.1145/3442188.3445901.
                                                                Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Trivi-
Gordon, A., Kozareva, Z., and Roemmele, M. SemEval-
                                                                  aqa: A large scale distantly supervised challenge dataset
  2012 task 7: Choice of plausible alternatives: An evalua-
                                                                  for reading comprehension. In Proceedings of the 55th
  tion of commonsense causal reasoning. In *SEM 2012:
                                                                  Annual Meeting of the Association for Computational
 The First Joint Conference on Lexical and Computational
                                                                  Linguistics, Vancouver, Canada, July 2017. Association
 Semantics – Volume 1: Proceedings of the main confer-
                                                                  for Computational Linguistics.
  ence and the shared task, and Volume 2: Proceedings of
  the Sixth International Workshop on Semantic Evaluation       Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B.,
 (SemEval 2012), pp. 394–398, Montréal, Canada, 7-8              Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and
 June 2012. Association for Computational Linguistics.            Amodei, D. Scaling laws for neural language models.
  URL https://aclanthology.org/S12-1052.                          arXiv preprint arXiv:2001.08361, 2020.
Hendrycks, D. and Gimpel, K. Bridging nonlinearities and        Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S.,
  stochastic regularizers with gaussian error linear units.       and Roth, D. Looking beyond the surface: A chal-
  CoRR, abs/1606.08415, 2016. URL http://arxiv.                   lenge set for reading comprehension over multiple sen-
  org/abs/1606.08415.                                             tences. In Proceedings of the 2018 Conference of the
                                                                 North American Chapter of the Association for Com-
Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun,
                                                                  putational Linguistics: Human Language Technologies,
  H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and
                                                                 Volume 1 (Long Papers), pp. 252–262, New Orleans,
  Zhou, Y. Deep learning scaling is predictable, empirically.
                                                                  Louisiana, June 2018. Association for Computational
  CoRR, abs/1712.00409, 2017. URL http://arxiv.
                                                                  Linguistics. doi: 10.18653/v1/N18-1023. URL https:
  org/abs/1712.00409.
                                                                 //aclanthology.org/N18-1023.
Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B.,
                                                                Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R.,
  De Laroussilhe, Q., Gesmundo, A., Attariyan, M., and
                                                                  Urtasun, R., Torralba, A., and Fidler, S. Skip-thought
  Gelly, S. Parameter-efficient transfer learning for NLP.
                                                                  vectors. In Cortes, C., Lawrence, N., Lee, D., Sugiyama,
  In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceed-
                                                                  M., and Garnett, R. (eds.), Advances in Neural
  ings of the 36th International Conference on Machine
                                                                  Information Processing Systems, volume 28. Curran As-
 Learning, volume 97 of Proceedings of Machine Learn-
                                                                  sociates, Inc., 2015. URL https://proceedings.
  ing Research, pp. 2790–2799. PMLR, 09–15 Jun 2019.
                                                                  neurips.cc/paper/2015/file/
  URL https://proceedings.mlr.press/v97/
                                                                  f442d33fa06832082290ad8544a8da27-Paper.
  houlsby19a.html.
                                                                  pdf.
Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen,     Kudo, T. and Richardson, J. Sentencepiece: A simple and
  M. X., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., and Chen, Z.      language independent subword tokenizer and detokenizer
  Gpipe: Efficient training of giant neural networks using        for neural text processing. In EMNLP, 2018.
  pipeline parallelism. In Wallach, H. M., Larochelle, H.,
  Beygelzimer, A., d’Alché-Buc, F., Fox, E. B., and Garnett,   Kudugunta, S., Huang, Y., Bapna, A., Krikun, M., Lep-
  R. (eds.), Advances in Neural Information Processing            ikhin, D., Luong, M.-T., and Firat, O. Beyond distillation:
  Systems 32: Annual Conference on Neural Information            Task-level mixture-of-experts for efficient inference. In
  Processing Systems 2019, NeurIPS 2019, December 8-14,          Findings of the Association for Computational Linguis-
  2019, Vancouver, BC, Canada, pp. 103–112, 2019.                 tics: EMNLP 2021, pp. 3577–3599, 2021.
                           GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M.,        Online, November 2020. Association for Computational
 Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Kel-      Linguistics. doi: 10.18653/v1/2020.findings-emnlp.
 cey, M., Devlin, J., Lee, K., Toutanova, K. N., Jones,          311. URL https://aclanthology.org/2020.
 L., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., and           findings-emnlp.311.
 Petrov, S. Natural questions: a benchmark for question
 answering research. Transactions of the Association of        Lieber, O., Sharir, O., Lenz, B., and Shoham, Y. Jurassic-1:
 Computational Linguistics, 2019.                                Technical details and evaluation. White Paper. AI21 Labs,
                                                                 2021.
Lai, G., Xie, Q., Liu, H., Yang, Y., and Hovy, E. RACE:
  Large-scale ReAding comprehension dataset from ex-           Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,
  aminations. In Proceedings of the 2017 Conference on           Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V.
  Empirical Methods in Natural Language Processing, pp.          Roberta: A robustly optimized bert pretraining approach.
  785–794, Copenhagen, Denmark, September 2017. Asso-            arXiv preprint arXiv:1907.11692, 2019.
  ciation for Computational Linguistics. doi: 10.18653/v1/     May, C., Wang, A., Bordia, S., Bowman, S. R., and
  D17-1082. URL https://aclanthology.org/                       Rudinger, R. On measuring social biases in sentence
  D17-1082.                                                     encoders. In Proceedings of the 2019 Conference of the
Lamm, M., Palomaki, J., Alberti, C., Andor, D., Choi, E.,       North American Chapter of the Association for Com-
  Soares, L. B., and Collins, M. QED: A framework and           putational Linguistics: Human Language Technologies,
  dataset for explanations in question answering. CoRR,         Volume 1 (Long and Short Papers), pp. 622–628, Min-
  abs/2009.06354, 2020. URL https://arxiv.org/                  neapolis, Minnesota, June 2019. Association for Compu-
  abs/2009.06354.                                               tational Linguistics. doi: 10.18653/v1/N19-1063. URL
                                                                https://aclanthology.org/N19-1063.
Le, Q. and Mikolov, T. Distributed representations of sen-
  tences and documents. In International conference on         Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a
  machine learning, 2014.                                       suit of armor conduct electricity? a new dataset for open
                                                                book question answering. In EMNLP, 2018.
Leidner, J. L. and Plachouras, V. Ethical by design:
  Ethics best practices for natural language processing.       Mikolov, T., Karafiát, M., Burget, L., Cernocký, J. H., and
  In Proceedings of the First ACL Workshop on Ethics            Khudanpur, S. Recurrent neural network based language
  in Natural Language Processing, pp. 30–40, Valencia,          model. In INTERSPEECH, 2010.
  Spain, April 2017. Association for Computational Lin-
                                                               Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient
  guistics. doi: 10.18653/v1/W17-1604. URL https:
                                                                estimation of word representations in vector space. In
  //aclanthology.org/W17-1604.
                                                                Bengio, Y. and LeCun, Y. (eds.), 1st International Confer-
Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang,      ence on Learning Representations, ICLR 2013, Scottsdale,
  Y., Krikun, M., Shazeer, N., and Chen, Z. GShard: Scal-       Arizona, USA, May 2-4, 2013, Workshop Track Proceed-
  ing giant models with conditional computation and auto-       ings, 2013. URL http://arxiv.org/abs/1301.
  matic sharding. In International Conference on Learning       3781.
  Representations, 2021. URL https://openreview.
                                                               Mostafazadeh, N., Chambers, N., He, X., Parikh, D., Ba-
  net/forum?id=qrwe7XHTmYb.
                                                                tra, D., Vanderwende, L., Kohli, P., and Allen, J. A
Levesque, H., Davis, E., and Morgenstern, L. The wino-          corpus and cloze evaluation for deeper understanding
  grad schema challenge. In 13th International Confer-          of commonsense stories. In Proceedings of the 2016
  ence on the Principles of Knowledge Representation and        Conference of the North American Chapter of the As-
  Reasoning, KR 2012, Proceedings of the International          sociation for Computational Linguistics: Human Lan-
  Conference on Knowledge Representation and Reason-            guage Technologies, pp. 839–849, San Diego, Cali-
  ing, pp. 552–561. Institute of Electrical and Electronics     fornia, June 2016. Association for Computational Lin-
  Engineers Inc., 2012. ISBN 9781577355601. 13th In-            guistics. doi: 10.18653/v1/N16-1098. URL https:
  ternational Conference on the Principles of Knowledge         //aclanthology.org/N16-1098.
  Representation and Reasoning, KR 2012 ; Conference
  date: 10-06-2012 Through 14-06-2012.                         Nadeem, M., Bethke, A., and Reddy, S. StereoSet: Mea-
                                                                 suring stereotypical bias in pretrained language models.
Li, T., Khashabi, D., Khot, T., Sabharwal, A., and Srikumar,     In Proceedings of the 59th Annual Meeting of the As-
  V. UNQOVERing stereotyping biases via underspeci-              sociation for Computational Linguistics and the 11th
   fied questions. In Findings of the Association for Com-       International Joint Conference on Natural Language Pro-
   putational Linguistics: EMNLP 2020, pp. 3475–3489,            cessing (Volume 1: Long Papers), pp. 5356–5371, Online,
                            GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

  August 2021. Association for Computational Linguis-              A., de Las Casas, D., Guy, A., Jones, C., Bradbury, J.,
  tics. doi: 10.18653/v1/2021.acl-long.416. URL https:             Johnson, M., Hechtman, B. A., Weidinger, L., Gabriel,
  //aclanthology.org/2021.acl-long.416.                            I., Isaac, W. S., Lockhart, E., Osindero, S., Rimell, L.,
                                                                   Dyer, C., Vinyals, O., Ayoub, K., Stanway, J., Bennett,
Paperno, D., Kruszewski, G., Lazaridou, A., Pham, N. Q.,           L., Hassabis, D., Kavukcuoglu, K., and Irving, G. Scal-
  Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., and          ing language models: Methods, analysis & insights from
  Fernández, R. The LAMBADA dataset: Word prediction              training gopher. CoRR, abs/2112.11446, 2021.
  requiring a broad discourse context. In Proceedings of
  the 54th Annual Meeting of the Association for Compu-          Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
  tational Linguistics (Volume 1: Long Papers), pp. 1525–          Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the
  1534, Berlin, Germany, August 2016. Association for              limits of transfer learning with a unified text-to-text trans-
  Computational Linguistics. doi: 10.18653/v1/P16-1144.            former. J. Mach. Learn. Res., 21:140:1–140:67, 2020.
  URL https://aclanthology.org/P16-1144.                           URL http://jmlr.org/papers/v21/20-074.
                                                                   html.
Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L.-
  M., Rothchild, D., So, D., Texier, M., and Dean, J. Car-       Rajpurkar, P., Jia, R., and Liang, P. Know what you don’t
  bon emissions and large neural network training. arXiv           know: Unanswerable questions for squad. In ACL, 2018.
  preprint arXiv:2104.10350, 2021.
                                                                 Reddy, S., Chen, D., and Manning, C. D. CoQA: A con-
Pennington, J., Socher, R., and Manning, C. GloVe:
                                                                   versational question answering challenge. Transactions
  Global vectors for word representation. In Proceedings
                                                                   of the Association for Computational Linguistics, 7:249–
  of the 2014 Conference on Empirical Methods in Nat-
                                                                   266, March 2019. doi: 10.1162/tacl a 00266. URL
  ural Language Processing (EMNLP), pp. 1532–1543,
                                                                   https://aclanthology.org/Q19-1016.
  Doha, Qatar, October 2014. Association for Computa-
  tional Linguistics. doi: 10.3115/v1/D14-1162. URL              Rogers, A. Changing the world by changing the data. In
  https://aclanthology.org/D14-1162.                               Proceedings of the 59th Annual Meeting of the Associa-
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark,         tion for Computational Linguistics and the 11th Interna-
  C., Lee, K., and Zettlemoyer, L. Deep contextualized             tional Joint Conference on Natural Language Process-
  word representations. arXiv preprint arXiv:1802.05365,           ing (Volume 1: Long Papers), pp. 2182–2194, Online,
  2018.                                                            August 2021. Association for Computational Linguis-
                                                                   tics. doi: 10.18653/v1/2021.acl-long.170. URL https:
Pilehvar, M. T. and Camacho-Collados, J. Wic: 10, 000              //aclanthology.org/2021.acl-long.170.
   example pairs for evaluating context-sensitive representa-
   tions. ArXiv, abs/1808.09121, 2018.                           Rudinger, R., May, C., and Van Durme, B. Social bias in
                                                                   elicited natural language inferences. In Proceedings of
Radford, A., Wu, J., Child, R., Luan, D., Amodei,                  the First ACL Workshop on Ethics in Natural Language
  D., and Sutskever, I.       Language models are                  Processing, pp. 74–79, Valencia, Spain, April 2017. Asso-
  unsupervised multitask learners.   2018.   URL                   ciation for Computational Linguistics. doi: 10.18653/v1/
  https://d4mucfpksywv.cloudfront.                                 W17-1609. URL https://aclanthology.org/
  net/better-language-models/                                      W17-1609.
  language-models.pdf.
                                                                 Rudinger, R., Naradowsky, J., Leonard, B., and Van Durme,
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J.,     B. Gender bias in coreference resolution. In Proceedings
  Song, H. F., Aslanides, J., Henderson, S., Ring, R., Young,      of the 2018 Conference of the North American Chapter of
  S., Rutherford, E., Hennigan, T., Menick, J., Cassirer, A.,      the Association for Computational Linguistics: Human
  Powell, R., van den Driessche, G., Hendricks, L. A.,             Language Technologies, Volume 2 (Short Papers), pp. 8–
  Rauh, M., Huang, P., Glaese, A., Welbl, J., Dathathri, S.,       14, New Orleans, Louisiana, June 2018. Association for
  Huang, S., Uesato, J., Mellor, J., Higgins, I., Creswell,        Computational Linguistics. doi: 10.18653/v1/N18-2002.
  A., McAleese, N., Wu, A., Elsen, E., Jayakumar, S. M.,           URL https://aclanthology.org/N18-2002.
  Buchatskaya, E., Budden, D., Sutherland, E., Simonyan,
  K., Paganini, M., Sifre, L., Martens, L., Li, X. L., Kun-      Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y.
  coro, A., Nematzadeh, A., Gribovskaya, E., Donato, D.,           Winogrande: An adversarial winograd schema challenge
  Lazaridou, A., Mensch, A., Lespiau, J., Tsimpoukelli,            at scale. In AAAI, pp. 8732–8740. AAAI Press, 2020.
  M., Grigorev, N., Fritz, D., Sottiaux, T., Pajarskas, M.,
  Pohlen, T., Gong, Z., Toyama, D., de Masson d’Autume,          Sap, M., Gabriel, S., Qin, L., Jurafsky, D., Smith, N. A.,
  C., Li, Y., Terzi, T., Mikulik, V., Babuschkin, I., Clark,       and Choi, Y. Social bias frames: Reasoning about
                            GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

  social and power implications of language. In Pro-             Sotnikova, A., Cao, Y. T., Daumé III, H., and Rudinger,
  ceedings of the 58th Annual Meeting of the Associa-              R. Analyzing stereotypes in generative text infer-
  tion for Computational Linguistics, pp. 5477–5490, On-           ence tasks. In Findings of the Association for Com-
  line, July 2020. Association for Computational Linguis-          putational Linguistics: ACL-IJCNLP 2021, pp. 4052–
  tics. doi: 10.18653/v1/2020.acl-main.486. URL https:             4065, Online, August 2021. Association for Computa-
  //aclanthology.org/2020.acl-main.486.                            tional Linguistics. doi: 10.18653/v1/2021.findings-acl.
                                                                   355. URL https://aclanthology.org/2021.
Shazeer, N. Glu variants improve transformer, 2020.                findings-acl.355.
Shazeer, N. and Stern, M. Adafactor: Adaptive learning           Stanovsky, G., Smith, N. A., and Zettlemoyer, L. Eval-
  rates with sublinear memory cost. ArXiv, abs/1804.04235,         uating gender bias in machine translation. In Proceed-
  2018.                                                            ings of the 57th Annual Meeting of the Association for
                                                                   Computational Linguistics, pp. 1679–1684, Florence,
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le,
                                                                   Italy, July 2019. Association for Computational Lin-
  Q. V., Hinton, G. E., and Dean, J. Outrageously large
                                                                   guistics. doi: 10.18653/v1/P19-1164. URL https:
  neural networks: The sparsely-gated mixture-of-experts
                                                                   //aclanthology.org/P19-1164.
  layer. In 5th International Conference on Learning Rep-
  resentations, ICLR 2017, Toulon, France, April 24-26,          Strubell, E., Ganesh, A., and McCallum, A. Energy and
  2017, Conference Track Proceedings. OpenReview.net,              policy considerations for deep learning in NLP. In
  2017. URL https://openreview.net/forum?                          Proceedings of the 57th Annual Meeting of the Asso-
  id=B1ckMDqlg.                                                    ciation for Computational Linguistics, pp. 3645–3650,
                                                                   Florence, Italy, July 2019. Association for Computa-
Shazeer, N., Cheng, Y., Parmar, N., Tran, D., Vaswani, A.,
                                                                   tional Linguistics. doi: 10.18653/v1/P19-1355. URL
  Koanantakool, P., Hawkins, P., Lee, H., Hong, M., Young,
                                                                   https://aclanthology.org/P19-1355.
  C., Sepassi, R., and Hechtman, B. Mesh-tensorflow:
  Deep learning for supercomputers. In Proceedings of the        Sutskever, I., Martens, J., and Hinton, G. Generating text
  32nd International Conference on Neural Information              with recurrent neural networks. In Proceedings of the 28th
  Processing Systems, NIPS’18, pp. 10435–10444, Red                International Conference on International Conference on
  Hook, NY, USA, 2018. Curran Associates Inc.                      Machine Learning, ICML’11, pp. 1017–1024, Madison,
                                                                   WI, USA, 2011. Omnipress. ISBN 9781450306195.
Shen, J., Nguyen, P., Wu, Y., Chen, Z., Chen, M. X., Jia,
  Y., Kannan, A., Sainath, T. N., Cao, Y., Chiu, C., He, Y.,     Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
  Chorowski, J., Hinsu, S., Laurenzo, S., Qin, J., Firat, O.,      Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin,
  Macherey, W., Gupta, S., Bapna, A., Zhang, S., Pang,             I. Attention is all you need. In Guyon, I., Luxburg,
  R., Weiss, R. J., Prabhavalkar, R., Liang, Q., Jacob, B.,        U. V., Bengio, S., Wallach, H., Fergus, R., Vish-
  Liang, B., Lee, H., Chelba, C., Jean, S., Li, B., Johnson,       wanathan, S., and Garnett, R. (eds.), Advances in Neural
  M., Anil, R., Tibrewal, R., Liu, X., Eriguchi, A., Jaitly,       Information Processing Systems, volume 30. Curran As-
  N., Ari, N., Cherry, C., Haghani, P., Good, O., Cheng,           sociates, Inc., 2017. URL https://proceedings.
  Y., Alvarez, R., Caswell, I., Hsu, W., Yang, Z., Wang,           neurips.cc/paper/2017/file/
  K., Gonina, E., Tomanek, K., Vanik, B., Wu, Z., Jones,           3f5ee243547dee91fbd053c1c4a845aa-Paper.
  L., Schuster, M., Huang, Y., Chen, D., Irie, K., Foster,         pdf.
  G. F., Richardson, J., Macherey, K., Bruguier, A., Zen,
  H., Raffel, C., Kumar, S., Rao, K., Rybach, D., Murray,        Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A.,
  M., Peddinti, V., Krikun, M., Bacchiani, M., Jablin, T. B.,     Michael, J., Hill, F., Levy, O., and Bowman, S. Su-
  Suderman, R., Williams, I., Lee, B., Bhatia, D., Carlson,       perglue: A stickier benchmark for general-purpose
  J., Yavuz, S., Zhang, Y., McGraw, I., Galkin, M., Ge, Q.,       language understanding systems. In Wallach, H.,
  Pundak, G., Whipkey, C., Wang, T., Alon, U., Lepikhin,          Larochelle, H., Beygelzimer, A., d’Alché Buc, F.,
  D., Tian, Y., Sabour, S., Chan, W., Toshniwal, S., Liao,        Fox, E., and Garnett, R. (eds.), Advances in Neural
  B., Nirschl, M., and Rondon, P. Lingvo: a modular and           Information Processing Systems, volume 32. Curran As-
  scalable framework for sequence-to-sequence modeling.           sociates, Inc., 2019. URL https://proceedings.
  CoRR, abs/1902.08295, 2019.                                     neurips.cc/paper/2019/file/
                                                                  4496bf24afe7fab6f046bf4923da8de6-Paper.
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J.,    pdf.
  and Catanzaro, B. Megatron-lm: Training multi-billion
  parameter language models using gpu model parallelism.         Webster, K. and Pitler, E. Scalable cross lingual piv-
  arXiv preprint arXiv:1909.08053, 2019.                          ots to model pronoun gender for translation. CoRR,
                           GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

  abs/2006.08881, 2020. URL https://arxiv.org/                   the Association for Computational Linguistics, pp. 4791–
  abs/2006.08881.                                                4800, Florence, Italy, July 2019. Association for Compu-
                                                                 tational Linguistics. doi: 10.18653/v1/P19-1472. URL
Webster, K., Wang, X., Tenney, I., Beutel, A., Pitler, E.,
                                                                 https://aclanthology.org/P19-1472.
 Pavlick, E., Chen, J., Chi, E., and Petrov, S. Measuring
 and reducing gendered correlations in pre-trained models,     Zhang, S., Liu, X., Liu, J., Gao, J., Duh, K., and Durme,
 2021.                                                           B. V. Record: Bridging the gap between human and
                                                                 machine commonsense reading comprehension. CoRR,
Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester,
                                                                 abs/1810.12885, 2018.
 B., Du, N., Dai, A. M., and Le, Q. V. Finetuned language
 models are zero-shot learners, 2021.                          Zhao, J., Wang, T., Yatskar, M., Ordonez, V., and Chang,
Welbl, J., Glaese, A., Uesato, J., Dathathri, S., Mel-           K.-W. Gender bias in coreference resolution: Evalua-
 lor, J., Hendricks, L. A., Anderson, K., Kohli, P.,             tion and debiasing methods. In Proceedings of the 2018
 Coppin, B., and Huang, P.-S. Challenges in detox-               Conference of the North American Chapter of the Associ-
 ifying language models.      In Findings of the As-             ation for Computational Linguistics: Human Language
 sociation for Computational Linguistics: EMNLP                  Technologies, Volume 2 (Short Papers), pp. 15–20, New
 2021, pp. 2447–2469, Punta Cana, Dominican Repub-               Orleans, Louisiana, June 2018. Association for Compu-
 lic, November 2021. Association for Computational               tational Linguistics. doi: 10.18653/v1/N18-2003. URL
 Linguistics. doi: 10.18653/v1/2021.findings-emnlp.              https://aclanthology.org/N18-2003.
 210. URL https://aclanthology.org/2021.
 findings-emnlp.210.
Xu, Y., Lee, H., Chen, D., Hechtman, B. A., Huang, Y.,
 Joshi, R., Krikun, M., Lepikhin, D., Ly, A., Maggioni, M.,
  Pang, R., Shazeer, N., Wang, S., Wang, T., Wu, Y., and
  Chen, Z. GSPMD: general and scalable parallelization for
  ML computation graphs. CoRR, abs/2105.04663, 2021.
  URL https://arxiv.org/abs/2105.04663.
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov,
  R. R., and Le, Q. V. Xlnet: Generalized autoregressive
  pretraining for language understanding. Advances in
  neural information processing systems, 32, 2019.
Yu, D., Zhu, C., Fang, Y., Yu, W., Wang, S., Xu, Y.,
  Ren, X., Yang, Y., and Zeng, M. KG-FiD: Infusing
  knowledge graph in fusion-in-decoder for open-domain
  question answering. In Proceedings of the 60th Annual
  Meeting of the Association for Computational Linguistics
  (Volume 1: Long Papers), pp. 4961–4974, Dublin, Ire-
  land, May 2022. Association for Computational Linguis-
  tics. doi: 10.18653/v1/2022.acl-long.340. URL https:
  //aclanthology.org/2022.acl-long.340.
Yu, Y., Abadi, M., Barham, P., Brevdo, E., Burrows, M.,
  Davis, A., Dean, J., Ghemawat, S., Harley, T., Hawkins,
  P., Isard, M., Kudlur, M., Monga, R., Murray, D., and
  Zheng, X. Dynamic control flow in large-scale machine
  learning. In Proceedings of the Thirteenth EuroSys Con-
  ference, EuroSys ’18, New York, NY, USA, 2018. Associ-
  ation for Computing Machinery. ISBN 9781450355841.
  doi: 10.1145/3190508.3190551. URL https://doi.
  org/10.1145/3190508.3190551.
Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi,
  Y. HellaSwag: Can a machine really finish your sen-
  tence? In Proceedings of the 57th Annual Meeting of
                            GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

A. Benchmarks                                                           45                                                60

Open-Domain Question Answering: TriviaQA (Joshi
                                                                                                                          59
   et al., 2017), Natural Questions (NQS) (Kwiatkowski                  40
   et al., 2019), Web Questions (WebQS) (Berant et al.,         score                                             score   58
   2013)
                                                                        35
                                                                                 Few-shot                                 57                     Few-shot
                                                                                 One-shot                                                        One-shot
                                                                                 Zero-shot                                                       Zero-shot
Cloze and Completion Tasks: LAMBADA        (Paperno
    et al., 2016), HellaSwag (Zellers et al., 2019),                         1     4           16      64   256
                                                                                                                          56
                                                                                                                               1   4     16      64          256
    StoryCloze (Mostafazadeh et al., 2016)                                                   experts                                   experts


                                                                Figure 6. Average zero, one and few-shot performance versus the
Winograd-Style Tasks: Winograd (Levesque et al., 2012),
                                                                number of experts per layer for a set of modest-size models from
   WinoGrande (Sakaguchi et al., 2020)
                                                                1.7B/1E to 1.7B/256E.

Common Sense Reasoning: PIQA (Bisk et al., 2020),
   ARC (Easy) (Clark et al., 2018), ARC (Chal-                  C. Model Partitioning
   lenge) (Clark et al., 2018), OpenBookQA (Mihaylov
   et al., 2018)                                                We partition the weights and computation of large GLaM
                                                                models using the 2D sharding algorithm as described in
In-context Reading Comprehension: DROP (Dua et al.,             Xu et al. (2021), which exploits the 2D topology of the
     2019), CoQA (Reddy et al., 2019), QuAC (Choi et al.,       device network of the TPU cluster. We place experts with
     2018), SQuADv2 (Rajpurkar et al., 2018), RACE-             the same index across different MoE layers on the same
     h (Lai et al., 2017), RACE-m (Lai et al., 2017)            device in order to generate an identical computation graph
                                                                for different MoE layers. As a result, we can wrap the
                                                                repetitive modules of the MoE Transformer architecture in
SuperGLUE: (Wang et al., 2019) BoolQ (Clark et al.,
                                                                a while loop control flow statement (Abadi et al., 2016a; Yu
    2019), CB (de Marneffe et al., 2019), COPA (Gordon
                                                                et al., 2018) to reduce compilation time. Our experiments
    et al., 2012), RTE (Dagan et al., 2006), WiC (Pile-
                                                                reveal that we should grow the size of the experts to get
    hvar & Camacho-Collados, 2018), WSC (Levesque
                                                                high quality models. Therefore, when each expert gets
    et al., 2012), MultiRC (Khashabi et al., 2018),
                                                                sufficiently large, we have to allocate each expert across a set
    ReCoRD (Zhang et al., 2018)
                                                                of NE devices. For example, we partition the expert weight
                                                                tensor with the shape [E, M, H] in the MoE layer along the
Natural Language Inference: ANLI R1,            ANLI R2,        expert dimension E, and hidden dimension H, and partition
    ANLI R3 (Fyodorov et al., 2000)                             the input activation tensors with the shape [B, S, M ] along
                                                                the batch dimension B and the model dimension M . With
B. Scaling the Number of Experts                                this 2D sharding algorithm, we are then able to fully divide
                                                                those large weight and activation tensors into smaller pieces
We also study the effects of increasing the number of experts   such that there is no redundancy in data or compute across
per MoE layer. More concretely, we start with a modest          all devices. We rely on GSPMD’s compiler pass (Xu et al.,
size model of 1.7B, which essentially is a GLaM (1.7B/1E)       2021) to automatically determine the sharding properties
model where each MoE layer reduces to include only a sin-       for the rest of the tensors.
gle feed-forward network as the expert. We then increase
the number of experts in each MoE layer from 1 to 256.          D. Data Contamination
Despite the fact that the number of experts increases expo-
nentially, the nact-params in each model barely increases due   As GLaM was trained on over 1.6 trillion tokens of text, it
to the sparsity of GLaM. In fact, as shown in Table 4, they     is a valid concern that some of the test data might appear
all have almost identical FLOPs per prediction.                 exactly in the pretraining dataset, inflating some of the re-
                                                                sults. We therefore follow Brown et al. (2020) and Wei et al.
In Figure 6, we observe that, for a fixed budget of compu-
                                                                (2021) and quantify the overlap between pretraining data
tation per prediction, adding more experts generally leads
                                                                and evaluation datasets.
to better predictive performance. This further verifies the
performance gain of GLaM sparsely activated models over         Our analysis uses the same methodology as Wei et al. (2021),
the dense counterparts when both have similar FLOPs per         which, in turn closely follows Brown et al. (2020). For
prediction, thanks to the increased capacity and flexibility    each evaluation dataset we report the number of examples
from more experts.                                              which overlap with the pretraining data, defining overlap as
                               GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
                                                                       1/27/22, 6:53 PM                                                                                    visualization (26

                                                                                                            0.5
Table 6. Overlap statistics for the subset of datasets that are also                                                     High >70%

used in GPT-3. An evaluation example was dirty if it had any                                                             Medium 40-70%




                                                                        Continuation toxicity probability
n-gram collision with the pretraining corpus.                                                                            Low 10-40%
                                                                                                            0.4
                                                                                                                         Very low

                                     Dirty     Total
   Dataset             Split                            % clean
                                     count     count
                                                                                                            0.3
   ANLI R1             validation    962       1000        3.8
   ANLI R2             validation    968       1000        3.2
   ANLI R3             validation    596       1200      50.33
                                                                                                            0.2
   ARC Challenge       validation     95        299      68.23
   ARC Easy            validation    185        570      67.54
   BoolQ               validation   3013       3270       7.86
   CB                  validation     15         56      73.21                                              0.1
                                                                                                                  1e+8        1e+9       1e+10    1e+11    1e+12   1e+13
   COPA                validation      3        100       97.0                                                                             Parameters
   CoQa                test          375        500       25.0
   DROP                dev          9361       9536       1.84
   HellaSwag           validation    1989     10042      80.19          Figure 7. Toxicity degeneration scaling plot. The different shades
   LAMBADA             test         1125       5153      78.17          show different buckets of prompt toxicity probability. The y-axis
   MultiRC             validation   3334       4848      31.23          shows the expected probability of a continuation being toxic. The
   NQs                 validation    141       3610      96.09          0.1B/64E variant is noticeable, but as model parameters scale, the
   OpenBookQA          validation    100        500       80.0          relationship to continuation toxicity constant.
   PIQA                validation    902       1838      50.92
   Quac                validation   7353       7354       0.01
   RACE-h              dev          2552       3451      26.05                                               Table 7. Prompts used for each co-occurrence category.
   RACE-m              dev           838       1436      41.64
   RTE                 validation    152        277      45.13
   ReCoRD              validation   9861      10000       1.39                              Category Prompts
   SQuADv2             validation   11234     11873       5.38
   StoryCloze          validation    1871      1871        0.0                              gender                         {term} was very...
   TriviaQA            validation    2121     11313      81.25                                                             {term} would be described as...
   WSC                 test          157        273      42.49                              race                           The {term} man was very...”
   WiC                 validation     46        638      92.79                                                             The {term} woman was very...”
   Winograd            validation     70        104      32.69
                                                                                                                           People would describe the {term} person as...
   Winogrande          test           6       1767       99.66
                                                                                            religion                       {term} practitioners are...


having any n-gram, which also appears in the pretraining
data (varying n between datasets). We find that the number              for low toxicity prompts in Figure 8. This shows that, even
of validation examples appearing verbatim in the training               for low toxicity prompts, it is very likely that some gener-
data roughly matches that of prior work. We report these                ated continuation would be judged as toxic by most people
numbers in Table 6.                                                     reviewing it, according to the Perspective API’s model.
                                                                        Table 7 shows the prompts used for the co-occurrence eval-
E. Ethics and Unintended Biases                                         uation; these are the same as those of Brown et al. (2020).
                                                                        The top associations for gender templates are shown in Ta-
Like Rae et al. (2021), we also analyzed toxicity degener-
                                                                        ble 8, and Tables 9 and 10 show the same for the race and
ation with with respect to model scale. This is shown in
                                                                        religion prompt templates.
Figure 7. As with other analysis GLaM’s performance on
this benchmark, it is fairly consistent across model sizes
and with MoE variants. The 0.1B/64E MoE variant, the                    F. Energy Usage
smallest sparse variant analyzed, is noticeable in the plot
                                                                        The power usage effectiveness (PUE) of the datacenter at
and smaller MoE models may be less stable, as noted by
                                                                        the time of training (August and September 2021) was 1.11.
Rae et al. (2021).
                                                                        Using 326W measured system power per TPU-v4 chip, this
Following Rae et al. (2021), we also analysed the aspect of             leads to a total energy consumption of 213 MWh for GLaM,
the distribution of generated toxicity probabilities with re-           1/6 of the energy cost of GPT-3, 1287 MWh. The datacenter
spect to model scale. The same pattern of scale-in-variance             PUE was 1.10 at the time of training GPT-3 (Patterson
is observed with respect to the maximal expected toxicity               et al., 2021). The reduced energy consumption of GLaM
probability of a continuation. The distribution of toxicity             is due to the MoE architecture and computation efficiency
probabilities from the API for 25 continuations is plotted              optimizations from TPU-v4 hardware and GSPMD software.

                                                                       file:///Users/kevinrobinson/Downloads/visualization (26).svg
                                                 GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

                             Expected toxicity for non-toxic prompts, 8B-dense
                       1.0
                                                                                     Table 8. Gender: top co-occurrences for prompts like “{term} was
                       0.8                                                           very...”



   Expected toxicity
                       0.6                                                                                  “He”                 “She”
                                                                                          The top 10        much (188)           pretty (232)
                       0.4                                                                most common       great (130)          little (185)
                                                                                          descriptive       well (129)           much (154)
                       0.2
                                                                                          words (and        little (129)         beautiful (148)
                       0.0                                                                counts).          good (124)           always (142)
                             Min         25th          50th         75th       Max                          always (114)         good (136)
                                     Percentiles, aggregating across prompts                                black (103)          black (117)
                                                                                                            even (92)            never (116)
Figure 8. Expected toxicity probability given low toxicity proba-                                           many (87)            even (111)
bility prompts for 8B Dense variant. This chart shows distributions                                         also (83)            well (110)
underlying the expected maximum toxicity metric for the 8B Dense
model. The y-axis shows expected toxicity and the x-axis shows
the distribution aggregated at different percentiles. At the left, the
minimum continuation toxicity reflects that after repeated eval-
uations of 25 samples the least toxic response for some outlier
non-toxic prompts was 0.8 likely to be perceived as toxicity. At                     Table 9. Race: co-occurrence in response to prompts like “People
the right we see that the worst-case toxicity has an almost uniform                  would describe the {term} person as...”.
distribution across non-toxic prompts. In other words, in 25 sam-
                                                                                      Term                   Most common descriptive words
ples across low probability toxic prompts, for the majority of trials,
there will be a high toxicity probability continuation.                               Asian                  Asian, black, white, polite, even, really,
                                                                                                             Chinese, good, also, nice
                                                                                      Black                  white, black, much, even, well, angry,
As a result of low energy consumption, GLaM training has                                                     good, also, proud, happy
lower CO2 emissions as well. The net tCO2 e per MWh of                                White                  white, black, many, even, Indian, much,
                                                                                                             good, happy, angry, never
the datacenter at the time was 0.088, training GLaM with                              Latinx                 white, black, even, really, also, Spanish,
280B tokens emits a total of 18.7 net tCO2 e, compared                                                       much, well, different, never
to 552 net tCO2e for GPT-3 (Patterson et al., 2021). The                              Indian                 Indian, white, black, much, even, differ-
complete GLaM training using 600B tokens consumes only                                                       ent, happy, really, never, good
456 MWh and emits 40.2 net tCO2 e.                                                    Middle-Eastern         white, black, even, eastern, polite, really,
                                                                                                             middle, nice, brown, also

G. Results on All Tasks for All Model Sizes
We include the zero/one/few-shot results of different model
sizes on all the tasks in Table 11, 12, 13 and 14.                                   Table 10. Religion: co-occurrence in response to prompts like
                                                                                     “{term} practitioners are...”

                                                                                        Term              Most common descriptive words
                                                                                        Atheism           religious, also, bad, likely, really, much,
                                                                                                          many, moral, even, sure
                                                                                        Buddhism          also, generally, many, religious, always,
                                                                                                          often, even, good, first, different
                                                                                        Christianity      religious, also, Christian, many, even,
                                                                                                          often, always, likely, different, bad
                                                                                        Islam             also, religious, even, many, likely, still,
                                                                                                          different, generally, much, violent
                                                                                        Hinduism          generally, also, religious, many, differ-
                                                                                                          ent, even, often, well, Indian, likely
                                                                                        Judaism           Jewish, also, religious, responsible,
                                                                                                          many, even, well, generally, often, dif-
                                                                                                          ferent
                            GLaM: Efficient Scaling of Language Models with Mixture-of-Experts




Table 11. Scores of GLaM (64B/64E), GPT-3 and Gopher across all 29 benchmarks. We include the significantly larger and more
computationally expensive Gopher and Megatron-NLG models for reference.

                                        Zero-shot        One-shot                         Few-shot (shots)
                                    GPT-3    GLaM    GPT-3    GLaM        GPT-3       Gopher      Megatron-NLG   GLaM
  Name           Metric     Split
                                    (175B) (64B/64E) (175B) (64B/64E)     (175B)      (280B)         (530B)    (64B/64E)
  TriviaQA       acc (em)   dev      64.3     71.3    68.0      75.8     71.2 (64)    57.1 (64)        –        75.8 (1)
  NQs            acc (em)   test     14.6     24.7    23.0      26.3     29.9 (64)    28.2 (64)        –        32.5 (64)
  WebQS          acc (em)   test     14.4     19.0    25.3      24.4     41.5 (64)       –             –        41.1 (64)

  Lambada        acc (em)   test     76.2     64.2    72.5      80.9     86.4 (15)    74.5(0)         87.2      86.6 (9)
  HellaSwag      acc        dev      78.9     76.6    78.1      76.8     79.3 (20)    79.2(0)         82.4      77.2 (8)
  StoryCloze     acc        test     83.2     82.5    84.7      84.0     87.7 (70)       –             –        86.7 (16)

  Winograd       acc        test     88.3     87.2    89.7      83.9      88.6 (7)       –             –        88.6 (2)
  WinoGrande     acc        dev      70.2     73.5    73.2      73.1     77.7 (16)    70.1(0)         78.9      79.2 (16)

  DROP           f1         dev      23.6     57.3    34.3      57.8     36.5 (20)       –             –        58.6 (2)
  CoQA           f1         dev      81.5     78.8    84.0      79.6      85.0 (5)       –             –        79.6 (1)
  QuAC           f1         dev      41.5     40.3    43.4      42.8      44.3 (5)       –             –        42.7 (1)
  SQuADv2        f1         dev      62.1     71.1    64.6      71.8     69.8 (16)       –             –        71.8 (10)
  SQuADv2        acc (em)   dev      52.6     64.7    60.1      66.5     64.9 (16)       –             –        67.0 (10)
  RACE-m         acc        test     58.4     64.0    57.4      65.5     58.1 (10)    75.1 (5)         –        66.9 (8)
  RACE-h         acc        test     45.5     46.9    45.9      48.7     46.8 (10)    71.6 (5)        47.9      49.3 (2)

  PIQA           acc        dev      81.0     80.4    80.5      81.4     82.3 (50)    81.8 (0)        83.2      81.8 (32)
  ARC-e          acc        test     68.8     71.6    71.2      76.6     70.1 (50)       –             –        78.9 (16)
  ARC-c          acc        test     51.4     48.0    53.2      50.3     51.5 (50)       –             –        52.0 (3)
  OpenbookQA acc            test     57.6     53.4    58.8      55.2     65.4 (100)      –             –        63.0 (32)

  BoolQ          acc        dev      60.5     83.1    76.7      82.8     77.5 (32)       –            84.8      83.1 (8)
  Copa           acc        dev      91.0     90.0    87.0      92.0     92.0 (32)       –             –        93.0 (16)
  RTE            acc        dev      63.5     67.9    70.4      71.5     72.9 (32)       –             –        76.2 (8)
  WiC            acc        dev      0.0      50.3    48.6      52.7     55.3 (32)       –            58.5      56.3 (4)
  Multirc        f1a        dev      72.9     73.7    72.9      74.7     74.8 (32)       –             –        77.5 (4)
  WSC            acc        dev      65.4     85.3    69.2      83.9     75.0 (32)       –             –        85.6 (2)
  ReCoRD         acc        dev      90.2     90.3    90.2      90.3     89.0 (32)       –             –        90.6 (2)
  CB             acc        dev      46.4     48.2    64.3      73.2     82.1 (32)       –             –        84.0 (8)

  ANLI R1        acc        test     34.6     39.2    32.0      42.4     36.8 (50)       –             –        44.3 (2)
  ANLI R2        acc        test     35.4     37.3    33.9      40.0     34.0 (50)       –            39.6      41.2 (10)
  ANLI R3        acc        test     34.5     41.3    35.1      40.8     40.2 (50)       –             –        44.7 (4)

  Avg NLG        –           –       47.6     54.6    52.9      58.4        58.8         –             –          61.6
  Avg NLU        –           –       60.8     66.2    65.4      68.6        68.4         –             –          71.4
                         GLaM: Efficient Scaling of Language Models with Mixture-of-Experts




           Table 12. Zero-shot scores on all 29 benchmarks for GPT3 and different GLaM MoE and dense models.

                                                 GLaM (MoE)                             GLaM (Dense)           GPT3

Name            Metric      Split   0.1B/64E   1.7B/64E    8B/64E    64B/64E    0.1B     1.7B   8B     137B    175B

TriviaQA        acc (em)    dev       9.42       44.0       55.1       71.3      2.3     27.0   48.1   64.0    64.3
NQs             acc (em)    test      2.24        9.2       11.9       24.7      1.1     5.6    9.0    17.3    14.6
WebQS           acc (em)    test      3.44        8.3       10.7       19.0      0.7     5.9    7.7    13.8    14.4

Lambada         acc (em)    test      41.4       63.7       67.3       64.2      37.8    60.1   69.3   70.9    76.2
HellaSwag       acc         dev       43.1       65.8       74.0       76.6      34.7    60.6   72.2   76.9    78.9
StoryCloze      acc         test      66.4       76.2       78.9       82.5      63.3    75.1   79.5   81.1    83.2

Winograd        acc         test      66.3       80.2       83.9       87.2      67      78.7   81.6   84.3    88.3
WinoGrande      acc         dev       51.0       63.9       67.8       73.5      49.7    62.6   70.1   71.5    70.2

DROP            f1          dev       9.43       13.4       16.8       57.3      5.67    14.0   17.0   21.8    23.6
CoQA            f1          dev       45.9       65.3       65.5       78.8      40.7    66.5   68.7   72.1    81.5
QuAC            f1          dev       25.2       32.8       33.8       40.3      25.4    33.3   30.7   38.3    41.5
SQuADv2         f1          dev       22.9       49.2       57.1       71.1      16.8    44.9   55.7   65.5    59.5
SQuADv2         acc (em)    dev       7.06       29.6        38        64.7      3.4     24     35.8   48.2    52.6
RACE-m          acc         test      43.4       56.1       61.9       64.0      40.6    53.6   63.0   67.8    58.4
RACE-h          acc         test      30.4       40.4       43.4       46.9      29.4    40.0   45.0   47.2    45.5

PIQA            acc         dev       70.0       76.9       78.6       80.4      64.4    73.6   78.2   78.5    80.4
ARC-e           acc         test      52.0       66.2       66.2       71.6      44.5    62.2   67.9   71.7    68.8
ARC-c           acc         test      26.5       37.6       42.8       48.0      23.2    35.1   42.7   47.2    51.4
Openbookqa      acc         test      40.0       46.4       50.0       53.4      36.8    46.7   49.8   52.0    57.6

BoolQ           acc         dev       56.6       62.7       72.2       83.1      56.6    56.1   73.6    78     60.5
Copa            acc         dev       73          85         86         90       67      80     86      90      91
RTE             acc         dev       45.8       58.8       60.3       67.9      51.3    49.1   63.8   50.5    63.5
WiC             acc         dev       50.0       49.8       49.5       50.3      50.8    50.3   44     50.6     0.0
Multirc         f1a         dev       57.7       58.0       52.4       73.7      58.6    53.0   39.0   54.8    72.9
WSC             acc         dev       65.6       79.3       81.8       85.3      66.3    77.2   80.7   82.8    65.4
ReCoRD          acc         dev       77.5       87.1       88.9       90.3      71.6    86.7   89.2   90.3    90.2
CB              acc         dev       66.1       33.9       40.7       48.2      42.9    37.5   33.9   42.9    46.4

ANLI R1         acc         dev       34.1       33.9       33.4       39.2      36.1    33.2   34.7   39.4    34.6
ANLI R2         acc         dev       33.8       32.4       34.9       37.3      36.7    33.6   34.8   35.7    35.4
ANLI R3         acc         dev       32.8       34.0       34.6       41.3      34.8    34.1   34.9   34.6    34.5

Avg NLG         -             -       18.6       35.1       39.6       54.6      14.9    31.3   38.0   45.8    47.6
Avg NLU         -             -       51.5       58.3       61.1       66.2      48.9    56.1   60.2   63.2    60.8
                           GLaM: Efficient Scaling of Language Models with Mixture-of-Experts




            Table 13. One-shot scores on all 29 benchmarks for GPT3 and different GLaM MoE and dense models.

                                                GLaM (MoE)                          GLaM (Dense)               GPT3

Name            Metric      Split   0.1B/64E   1.7B/64E   8B/64E   64B/64E   0.1B    1.7B   8B     137B   GPT-3 (175B)

TriviaQA        acc (em)    dev       15.2       54.1      65.9     75.8     8.3     36.3   56.4   70.0        68.0
NQs             acc (em)    test      2.5        10.7      16.0     26.3     1.19    6.5    10.7   19.1        23.0
WebQS           acc (em)    test      5.9        13.9      17.0     24.4     3.44    9.3    11.6   18.8        25.3

Lambada         acc (em)    test      36.9       57.4      64.1     80.9     21.8    52.3   64.7   68.5        72.5
HellaSwag       acc         dev       43.5       66.4      74.0     76.8     34.7    60.5   72.6   76.8        78.1
StoryCloze      acc         test      67.0       77.9      80.0     84.0     63.7    76.4   82.1   82.6        84.7

Winograd        acc         test      69.2       80.2      85.3     83.9     65.6    80.2   84     85.3        89.7
WinoGrande      acc         dev       51.7       63.5      68.7     73.0     49.8    62.8   70.0   73.1        73.2

DROP            f1          dev       16.3       24.8      28.4     57.8     19.3    24.9   41.2   49.4        34.3
CoQA            f1          dev       48.3       72.8       76      79.6     33.3    72.7   74.4   78.8        84.0
QuAC            f1          dev       28.7       35.2      43.1     42.7     23.7    35.7   35.1   44.6        43.4
SQuADv2         f1          dev       35.5       69.5      76.3     71.8     34.2    67.1   69.2   70.0        65.4
SQuADv2         acc (em)    dev       21.8       53.6      60.9     66.5     29.0    50.8   64.2   63.7        60.1
RACE-m          acc         test      42.7       60.9      60.6     65.5     43.1    56.4   63.1   69.0        57.4
RACE-h          acc         test      29.1       41.9      44.6     48.7     29.4    40.8   45.3   47.7        45.9

PIQA            acc         dev       69.0       76.0      78.1     81.4     63.7    73.1   76.3   79.5        80.5
ARC-e           acc         test      53.5       68.1      73.4     76.6     45.9    63.8   62.6   77.2        71.2
ARC-c           acc         test      27.0       39.3      44.8     50.3     24.5    35.2   41.5   50.7        53.2
Openbookqa      acc         test      39.6       47.6      50.6     55.2     37.8    47.2   53.0   55.4        58.8

BoolQ           acc         dev       53.6       62.0      70.8     82.8     55.7    58.1   76.4   77.5        76.7
Copa            acc         dev       75         81         86       92      71       81    86      91          87
RTE             acc         dev       53.1       54.5      57.0     71.5     53.4    55.2   62.0   58.4        70.4
WiC             acc         dev       47.3       47.0      48.0     52.7     47.3    46.8   48.0   48.7        48.6
Multirc         f1a         dev       58.5       59.6      62.0     74.7     56.3    59.4   61.9   64.2        72.9
WSC             acc         dev       67.7       77.5      83.8     83.9     63.8    78.5   83.0   86.3        69.2
ReCoRD          acc         dev       77.5       87.3      89.0     90.3     71.6    86.2   89.2   90.2        90.1
CB              acc         dev       41.1       35.7      44.6     73.2     42.9    41.1   30.4   48.2        64.3

ANLI R1         acc         dev       32.1       31.1      32.3     42.4     32.5    31.4   31.9   34.8        32.0
ANLI R2         acc         dev       31.1       30.7      32.5     40.0     30.7    31.2   30.7   32.6        33.9
ANLI R3         acc         dev       30.5       31.6      34.8     40.8     30.9    30.3   32.4   35.0        35.1

Avg NLG         -             -       23.5       43.6      49.7     58.4     19.4    39.5   47.5   52.8        52.7
Avg NLU         -             -       50.4       58.1      61.9     68.6     48.3    56.9   61.7   65.0        65.4
                              GLaM: Efficient Scaling of Language Models with Mixture-of-Experts




Table 14. Few-shot scores on all 29 benchmarks for GPT3 and different GLaM MoE and dense models. We tune the number of shots up
to the respective value in each task used by GPT3.

                                                   GLaM (MoE)                          GLaM (Dense)             GPT3

    Name           Metric      Split   0.1B/64E   1.7B/64E   8B/64E   64B/64E   0.1B    1.7B   8B     137B   GPT-3 (175B)

    TriviaQA       acc (em)    dev       21.7       60.1      67.7     75.8      8.3    38.8   56.4   70.0       71.2
    NQs            acc (em)    test      5.3        17.7      24.4     32.5     1.50    9.0    20.1   27.9       29.9
    WebQS          acc (em)    test      12.1       24.4      29.6     41.1     6.90    9.3    25.5   32.9       41.5

    Lambada        acc (em)    test      36.9       64.3      79.0     86.6     21.8    63.0   77.1   84.2       86.4
    HellaSwag      acc         dev       45.6       66.2      74.0     77.2     34.7    60.7   72.6   76.8       79.3
    StoryCloze     acc         test      69.4       80.0      82.8     86.7     63.7    78.7   83.7   85.7       87.7

    Winograd       acc         test      69.2       82.8      85.3     88.6     65.6    80.5   85.4   85.3       88.6
    WinoGrande     acc         dev       52.6       66.2      71.4     79.2     49.8    64.2   72.3   76.6       77.7

    DROP           f1          dev       23.5       37.0      40.0     58.6     19.3    41.4   49.4   49.4       36.5
    CoQA           f1          dev       48.3       66.0       72      79.6     33.3    66.0   74.4   78.8       85.0
    QuAC           f1          dev       26.0       34.2      43.1     42.8     23.7    34.3   35.1   37.2       44.3
    SQuADv2        f1          dev       38.7       61.8      67.1     71.8     34.2    60.0   69.6   70.0       69.8
    SQuADv2        acc (em)    dev       32.7       55.5      60.9     67.0     29.0    53.9   64.2   63.7       64.9
    RACE-m         acc         test      41.8       53.6      60.6     66.9     43.1    56.5   56     65.1       58.1
    RACE-h         acc         test      31.5       40.2      44.6     49.3     29.5    40.8   43     48.1       46.8

    PIQA           acc         dev       69.0       76.1      78.1     81.8     64.2    73.1   77     80.8       82.3
    ARC-e          acc         test      57.8       70.1      75.3     78.9     48.9    66.0   74     79.0       70.1
    ARC-c          acc         test      29.7       38.3      45.5     52.0     24.8    35.2   41.5   45.7       51.5
    Openbookqa     acc         test      41.6       49.6      53.0     63.0     37.8     54    54.0   58.8       65.4

    BoolQ          acc         dev       53.6       62.0      70.5     83.1     59.9    63.1   76.4   80.5       77.5
    Copa           acc         dev       75         82         88      93.0      71      83    92.0   91.0       92.0
    RTE            acc         dev       53.1       54.5      60.0     76.2     54.9    55.2   64.0   63.9       72.9
    WiC            acc         dev       49.4       51.3      53.3     56.3     51.9    50.9   50.0   53.6       55.3
    Multirc        f1a         dev       58.5       59.7      62.0     77.5     56.3    59.4   61.5   68.1       74.8
    WSC            acc         dev       67.7       80.4      83.8     85.6     65.6    80.0   82.0   87.4       75.0
    ReCoRD         acc         dev       77.5       87.3      89.0     90.6     71.8    86.2   89.0   90.5       89.0
    CB             acc         dev       43.0       53.6      60.7     84.0     42.9    55.4   58     53.6       82.1

    ANLI R1        acc         dev       34.3       31.4      34.0     44.3     33.5    33.1   33.2   35.8       36.8
    ANLI R2        acc         dev       32.3       33.0      32.0     41.2     34.4    33.7   33.9   35.6       34.0
    ANLI R3        acc         dev       33.9       35.8      33.0     44.7     32.9    33.3   35.0   34.7       40.2

    Avg NLG        -             -       27.2       46.8      53.0     61.6     19.8    42.7   52.4   57.1       58.8
    Avg NLU        -             -       51.7       59.7      63.6     71.4     49.2    59.2   63.7   66.8       68.4