来源材料

资料来源

← 首页

                                                                                                                                                                                             1




                                                               A Survey of Large Language Models
                                              Wayne Xin Zhao, Kun Zhou*, Junyi Li*, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen
                                               Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang,
                                                   Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie and Ji-Rong Wen†

                                                  Abstract—Ever since the Turing Test was proposed in the 1950s, humans have explored the mastering of language intelligence
                                                  by machine. Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a
                                                  significant challenge to develop capable artificial intelligence (AI) algorithms for comprehending and grasping a language. As a major
                                                  approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving
                                                  from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pre-
                                                  training Transformer models over large-scale corpora, showing strong capabilities in solving various natural language processing (NLP)
                                                  tasks. Since the researchers have found that model scaling can lead to an improved model capacity, they further investigate the scaling




arXiv:2303.18223v19 [cs.CL] 18 Mar 2026
                                                  effect by increasing the parameter scale to an even larger size. Interestingly, when the parameter scale exceeds a certain level, these
                                                  enlarged language models not only achieve a significant performance improvement, but also exhibit some special abilities (e.g., in-
                                                  context learning) that are not present in small-scale language models (e.g., BERT). To discriminate the language models in different
                                                  parameter scales, the research community has coined the term large language models (LLM) for the PLMs of significant size (e.g.,
                                                  containing tens or hundreds of billions of parameters). Recently, the research on LLMs has been largely advanced by both academia
                                                  and industry, and a remarkable progress is the launch of ChatGPT (a powerful AI chatbot developed based on LLMs), which has
                                                  attracted widespread attention from society. The technical evolution of LLMs has been making an important impact on the entire AI
                                                  community, which would revolutionize the way how we develop and use AI algorithms. Considering this rapid technical progress, in this
                                                  survey, we review the recent advances of LLMs by introducing the background, key findings, and mainstream techniques. In particular,
                                                  we focus on four major aspects of LLMs, namely pre-training, adaptation tuning, utilization, and capacity evaluation. Furthermore, we
                                                  also summarize the available resources for developing LLMs and discuss the remaining issues for future directions. This survey provides
                                                  an up-to-date review of the literature on LLMs, which can be a useful resource for both researchers and engineers.

                                                  Index Terms—Large Language Models; Emergent Abilities; Adaptation Tuning; Utilization; Alignment; Capacity Evaluation

                                                                                                                              ✦

                                          1      I NTRODUCTION
                                              “The limits of my language mean the limits of my world.”                            extensive attention in the literature, which can be divided
                                                                              —Ludwig Wittgenstein                                into four major development stages:
                                                                                                                                      • Statistical language models (SLM). SLMs [6–9] are de-
                                                                                                                                  veloped based on statistical learning methods that rose in
                                          L    ANGUAGE is a prominent ability in human beings to
                                               express and communicate, which develops in early
                                          childhood and evolves over a lifetime [3, 4]. Machines,
                                                                                                                                  the 1990s. The basic idea is to build the word prediction
                                                                                                                                  model based on the Markov assumption, e.g., predicting the
                                                                                                                                  next word based on the most recent context. The SLMs with
                                          however, cannot naturally grasp the abilities of understand-
                                                                                                                                  a fixed context length n are also called n-gram language
                                          ing and communicating in the form of human language,
                                                                                                                                  models, e.g., bigram and trigram language models. SLMs
                                          unless equipped with powerful artificial intelligence (AI)
                                                                                                                                  have been widely applied to enhance task performance
                                          algorithms. It has been a longstanding research challenge
                                                                                                                                  in information retrieval (IR) [10, 11] and natural language
                                          to achieve this goal, to enable machines to read, write, and
                                                                                                                                  processing (NLP) [12–14]. However, they often suffer from
                                          communicate like humans [5].
                                                                                                                                  the curse of dimensionality: it is difficult to accurately
                                              Technically, language modeling (LM) is one of the major
                                                                                                                                  estimate high-order language models since an exponential
                                          approaches to advancing language intelligence of machines.
                                                                                                                                  number of transition probabilities need to be estimated.
                                          In general, LM aims to model the generative likelihood
                                                                                                                                  Thus, specially designed smoothing strategies such as back-
                                          of word sequences, so as to predict the probabilities of
                                                                                                                                  off estimation [15] and Good–Turing estimation [16] have
                                          future (or missing) tokens. The research of LM has received
                                                                                                                                  been introduced to alleviate the data sparsity problem.
                                                                                                                                      • Neural language models (NLM). NLMs [1, 17, 18] charac-
                                          •     Version: v18 (major update on March 7, 2026).                                     terize the probability of word sequences by neural networks,
                                          •     GitHub link: https://github.com/RUCAIBox/LLMSurvey
                                          •     Chinese book link: lmbook-zh.github.io
                                                                                                                                  e.g., multi-layer perceptron (MLP) and recurrent neural net-
                                          •     † Corresponding Author.                                                           works (RNNs). As a remarkable contribution, the work in
                                          •     * K. Zhou and J. Li contribute equally to this work.                              [1] introduced the concept of distributed representation of
                                          •     The authors are mainly with Gaoling School of Artificial Intelligence and         words and built the word prediction function conditioned
                                                School of Information, Renmin University of China, Beijing, China; Jian-
                                                Yun Nie is with DIRO, Université de Montréal, Canada.                           on the aggregated context features (i.e., the distributed
                                                Contact e-mail: batmanfly@gmail.com                                               word vectors). By extending the idea of learning effective
                                          •     The authors of this survey paper reserve all the copyrights of the fig-           features for text data, a general neural network approach
                                                ures/tables, and any use of these materials for publication purpose must be
                                                officially granted by the survey authors.
                                                                                                                                  was developed to build a unified, end-to-end solution for
                                                                                                                                                                                                                                                                           2

 
                                                                                                                                       
                                                                                                                                                   
                                                                                                                              *37
                                                                                                                                 *37
  
                                                                                                                  //D0$
                                                                                                                            //D0$           
                                                                                                                                                   
                                                                                                                  &KDW*37
                                                                                                                    &KDW*37               
                                                                                                                                                                                                                                                    *37
                                                                                                                                                                                                                                                            *37
  
      
                                                                                                ,QVWUXFW*37
                                                                                                    ,QVWUXFW*37                       
                                                                                                                                                                                                                                                //D0$
                                                                                                                                                                                                                                                         //D0$
  
                                                                                   &RGH[
                                                                                              &RGH[                                          
                                                                                                                                                    
                                                                                                                                                                                                                                                   &KDW*37
                                                                                                                                                                                                                                                  &KDW*37
                                                                                                                                                    
                                                                                                                                                  
      
                                                          *37
                                                               *37
                                                    77                                                                                        
                                                                                                                                                                                                                        ,QVWUXFW*37
                                                                                                                                                                                                                            ,QVWUXFW*37
                                     *37
                       *37 *37
                    *37                                                                                                                                                                                &RGH[
                                                                                                                                                                                                             &RGH[
                                                                                                                                                               77         *37
                                                                                                                                                                               *37
                           %(57
                            %(57                                                                                                                  
                
                             
                                                    
                                                                          
                                                                                                      
                                                                                                                             
                                                                                                                                                                
                                                                                                                                                                                      
                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                               
                                                                       7L7LPPH H                                                                                                                 7L7LPPHH
                                   (a) Query=”Language Model”                                                                                                  (b) Query=”Large Language Model”

Fig. 1: The trends of the cumulative numbers of arXiv papers that contain the keyphrases “language model” (since June 2018)
and “large language model” (since October 2019), respectively. The statistics are calculated using exact match by querying
the keyphrases in title or abstract by months. We set different x-axis ranges for the two keyphrases, because “language
models” have been explored at an earlier time. We label the points corresponding to important landmarks in the research
progress of LLMs. A sharp increase occurs after the release of ChatGPT: the average number of published arXiv papers
that contain “large language model” in title or abstract goes from 0.40 per day to 8.58 per day (Figure 1(b)).


                                                                                                                                                                                               General-purpose
                                                                                                                                                   Transferable                                  task solver
                                                                                                               Task-agnostic                      NLP task solver                            GPT-3/4!ChatGPT!Claude
                                                                                                                                                                                             Scaling language models
              Task                                  Specific task                                             feature learner                   ELMO!BERT!GPT-1/2                            Prompt based completion
              solving                                 helper                                               Word2vec (NPLM)!NLPS Context-aware representations                                Solve various real-world tasks
              capacity                            n-gram models                                            Static word representations
                                                                                                                                       Pre-training + fine-tuning
                                                                                                                                       Solve various NLP tasks                                                      LLM
                                                  Statistical methods                                      Neural context modeling
                                                  Probability estimation                                   Solve typical NLP tasks           Pre-trained LM
                                                  Assist in specific tasks                                                   Neural LM
                                                           Statistical LM


                                                                 1990s                                                           2013                         2018                                                     2020

Fig. 2: An evolution process of the four generations of language models (LM) from the perspective of task solving capacity.
Note that the time period for each stage may not be very accurate, and we set the time mainly according to the publish
date of the most representative studies at each stage. For neural language models, we abbreviate the paper titles of
two representative studies to name the two approaches: NPLM [1] (“A neural probabilistic language model”) and NLPS [2]
(“Natural language processing (almost) from scratch”). Due to the space limitation, we don’t list all representative studies in
this figure.


various NLP tasks [2]. Furthermore, word2vec [19, 20] was                                                                                       designed pre-training tasks on large-scale unlabeled cor-
proposed to build a simplified shallow neural network                                                                                           pora. These pre-trained context-aware word representations
for learning distributed word representations, which were                                                                                       are very effective as general-purpose semantic features,
demonstrated to be very effective across a variety of NLP                                                                                       which have largely raised the performance bar of NLP
tasks. These studies have initiated the use of language                                                                                         tasks. This study has inspired a large number of follow-up
models for representation learning (beyond word sequence                                                                                        work, which sets the “pre-training and fine-tuning” learning
modeling), having an important impact on the field of NLP.                                                                                      paradigm. Following this paradigm, a great number of stud-
                                                                                                                                                ies on PLMs have been developed, introducing either differ-
   • Pre-trained language models (PLM). As an early at-                                                                                         ent architectures [24, 25] (e.g., GPT-2 [26] and BART [24]) or
tempt, ELMo [21] was proposed to capture context-aware                                                                                          improved pre-training strategies [27–29]. In this paradigm, it
word representations by first pre-training a bidirectional                                                                                      often requires fine-tuning the PLM for adapting to different
LSTM (biLSTM) network (instead of learning fixed word                                                                                           downstream tasks.
representations) and then fine-tuning the biLSTM network
according to specific downstream tasks. Furthermore, based                                                                                           • Large language models (LLM). Researchers find that
on the highly parallelizable Transformer architecture [22]                                                                                      scaling PLM (e.g., scaling model size or data size) often
with self-attention mechanisms, BERT [23] was proposed by                                                                                       leads to an improved model capacity on downstream tasks
pre-training bidirectional language models with specially                                                                                       (i.e., following the scaling law [30]). A number of studies
                                                                                                                                         3

have explored the performance limit by training an ever                  the prompting interface (e.g., GPT-4 API). Humans have to
larger PLM (e.g., the 175B-parameter GPT-3 and the 540B-                 understand how LLMs work and format their tasks in a way
parameter PaLM). Although scaling is mainly conducted                    that LLMs can follow. Third, the development of LLMs no
in model size (with similar architectures and pre-training               longer draws a clear distinction between research and en-
tasks), these large-sized PLMs display different behaviors               gineering. The training of LLMs requires extensive practical
from smaller PLMs (e.g., 330M-parameter BERT and 1.5B-                   experiences in large-scale data processing and distributed
parameter GPT-2) and show surprising abilities (called emer-             parallel training. To develop capable LLMs, researchers
gent abilities [31]) in solving a series of complex tasks. For           have to solve complicated engineering issues, working with
example, GPT-3 can solve few-shot tasks through in-context               engineers or being engineers.
learning, whereas GPT-2 cannot do well. Thus, the research                   Nowadays, LLMs are posing a significant impact on
community coins the term “large language models (LLM)”1                  the AI community, and the advent of ChatGPT and GPT-4
for these large-sized PLMs [32–35], which attract increasing             leads to the rethinking of the possibilities of artificial general
research attention (See Figure 1). A remarkable application              intelligence (AGI). OpenAI has published a technical article
of LLMs is ChatGPT2 that adapts the LLMs from the GPT                    entitled “Planning for AGI and beyond”, which discusses
series for dialogue, which presents an amazing conversation              the short-term and long-term plans to approach AGI [40],
ability with humans. We can observe a sharp increase of the              and a more recent paper has argued that GPT-4 might be
arXiv papers that are related to LLMs after the release of               considered as an early version of an AGI system [41]. The
ChatGPT in Figure 1.                                                     research areas of AI are being revolutionized by the rapid
    As discussed before, language model is not a new tech-               progress of LLMs. In the field of NLP, LLMs can serve as a
nical concept specially for LLMs, but has evolved with the               general-purpose language task solver (to some extent), and
advance of artificial intelligence over the decades. Early lan-          the research paradigm has been shifting towards the use
guage models mainly aim to model and generate text data,                 of LLMs. In the field of IR, traditional search engines are
while latest language models (e.g., GPT-4) focus on complex              challenged by the new information seeking way through AI
task solving. From language modeling to task solving, it is an           chatbots (i.e., ChatGPT), and New Bing3 presents an initial
important leap in scientific thinking, which is the key to               attempt that enhances the search results based on LLMs. In
understand the development of language models in the re-                 the field of CV, the researchers try to develop ChatGPT-like
search history. From the perspective of task solving, the four           vision-language models that can better serve multimodal
generations of language models have exhibited different lev-             dialogues [42–45], and GPT-4 [46] has supported multi-
els of model capacities. In Figure 2, we describe the evolu-             modal input by integrating the visual information. This new
tion process of language models in terms of the task solving             wave of technology would potentially lead to a prosperous
capacity. At first, statistical language models mainly assisted          ecosystem of real-world applications based on LLMs. For
in some specific tasks (e.g., retrieval or speech tasks), in             instance, Microsoft 365 is being empowered by LLMs (i.e.,
which the predicted or estimated probabilities can enhance               Copilot) to automate the office work, and OpenAI supports
the performance of task-specific approaches. Subsequently,               the use of plugins in ChatGPT for implementing special
neural language models focused on learning task-agnostic                 functions.
representations (e.g., features), aiming to reduce the efforts               Despite the progress and impact, the underlying prin-
for human feature engineering. Furthermore, pre-trained                  ciples of LLMs are still not well explored. Firstly, it is
language models learned context-aware representations that               mysterious why emergent abilities occur in LLMs, instead of
can be optimized according to downstream tasks. For the                  smaller PLMs. As a more general issue, there lacks a deep,
latest generation of language model, LLMs are enhanced by                detailed investigation of the key factors that contribute to
exploring the scaling effect on model capacity, which can be             the superior abilities of LLMs. It is important to study when
considered as general-purpose task solvers. To summarize,                and how LLMs obtain such abilities [47]. Although there are
in the evolution process, the task scope that can be solved              some meaningful discussions about this problem [31, 47],
by language models have been greatly extended, and the                   more principled investigations are needed to uncover the
task performance attained by language models have been                   “secrets“ of LLMs. Secondly, it is difficult for the research
significantly enhanced.                                                  community to train capable LLMs. Due to the huge de-
    In the existing literature, PLMs have been widely dis-               mand of computation resources, it is very costly to carry
cussed and surveyed [36–39], while LLMs are seldom re-                   out repetitive, ablating studies for investigating the effect
viewed in a systematic way. To motivate our survey, we first             of various strategies for training LLMs. Indeed, LLMs are
highlight three major differences between LLMs and PLMs.                 mainly trained by industry, where many important training
First, LLMs display some surprising emergent abilities that              details (e.g., data collection and cleaning) are not revealed
may not be observed in previous smaller PLMs. These abili-               to the public. Thirdly, it is challenging to align LLMs with
ties are key to the performance of language models on com-               human values or preferences. Despite the capacities, LLMs
plex tasks, making AI algorithms unprecedently powerful                  are also likely to produce toxic, fictitious, or harmful con-
and effective. Second, LLMs would revolutionize the way                  tents. It requires effective and efficient control approaches
that humans develop and use AI algorithms. Unlike small                  to eliminating the potential risk of the use of LLMs [46].
PLMs, the major approach to accessing LLMs is through                        Faced with both opportunities and challenges, it needs
                                                                         more attention on the research and development of LLMs. In
  1. Note that a LLM is not necessarily more capable than a small PLM,   order to provide a basic understanding of LLMs, this survey
and emergent abilities may not occur in some LLMs.
  2. https://openai.com/blog/chatgpt/                                      3. https://www.bing.com/new
                                                                                                                                                      4

conducts a literature review of the recent advances in LLMs                shown that scaling can largely improve the model capacity
from four major aspects, including pre-training (how to pre-               of LLMs [26, 55, 56]. Thus, it is useful to establish a quantita-
train a capable LLM), adaptation (how to effectively adapt                 tive approach to characterizing the scaling effect. Next, we
pre-trained LLMs for better use), utilization (how to use                  introduce two representative scaling laws for Transformer
LLMs for solving various downstream tasks) and capability                  language models [30, 34].
evaluation (how to evaluate the abilities of LLMs and existing                 • KM scaling law5 . In 2020, Kaplan et al. [30] (the OpenAI
empirical findings). We thoroughly comb the literature and                 team) firstly proposed to model the power-law relationship
summarize the key findings, techniques, and methods of                     of model performance with respective to three major factors,
LLMs. For this survey, we also create a GitHub project                     namely model size (N ), dataset size (D), and the amount of
website by collecting the supporting resources for LLMs, at                training compute (C ), for neural language models. Given
the link https://github.com/RUCAIBox/LLMSurvey. We                         a compute budget c, they empirically presented three basic
are also aware of several related review articles on PLMs                  formulas for the scaling law6 :
or LLMs [32, 36, 38, 39, 43, 48–54]. These papers either
                                                                                                   α
discuss PLMs or some specific (or general) aspects of LLMs.
                                                                                              
                                                                                                Nc N
Compared with them, we focus on the techniques and                              L(N )    =             , αN ∼ 0.076, Nc ∼ 8.8 × 1013                 (1)
                                                                                                N
methods to develop and use LLMs and provide a relatively                                          α
                                                                                                Dc D
comprehensive reference to important aspects of LLMs.                           L(D)     =             , αD ∼ 0.095, Dc ∼ 5.4 × 1013
                                                                                                D
    The remainder of this survey is organized as follows:                                      αC
                                                                                                Cc
Section 2 introduces the background for LLMs and the evo-                       L(C)     =            , αC ∼ 0.050, Cc ∼ 3.1 × 108
                                                                                                C
lution of GPT-series models, followed by the summarization
of available resources for developing LLMs in Section 3.                   where L(·) denotes the cross entropy loss in nats, and
Sections 4, 5, 6, and 7 review and summarize the recent                    a follow-up study [58] from OpenAI has shown that the
progress from the four aspects of pre-training, adaptation,                language modeling loss can be decomposed into two parts,
utilization, and capacity evaluation, respectively. Then, Sec-             namely irreducible loss (the entropy of the true data distri-
tion 8 discusses the practical guide for prompt design,                    bution) and reducible loss (an estimate of the KL divergence
and Section 9 reviews the applications of LLMs in several                  between the true and model distributions). The three laws
representative domains. Finally, we conclude the survey in                 were derived by fitting the model performance with varied
Section 10 by summarizing the major findings and discuss                   data sizes (22M to 23B tokens), model sizes (768 to 1.5B non-
the remaining issues for future work.                                      embedding parameters) and training compute, under some
                                                                           assumptions (e.g., the analysis of one factor should be not
                                                                           bottlenecked by the other two factors). They showed that
2     OVERVIEW                                                             the model performance has a strong dependence relation on
In this section, we present an overview about the back-                    the three factors.
ground of LLMs and then summarize the technical evolu-                        • Chinchilla scaling law. As another representative study,
tion of the GPT-series models.                                             Hoffmann et al. [34] (the Google DeepMind team) proposed
                                                                           an alternative form for scaling laws to instruct the compute-
2.1   Background for LLMs                                                  optimal training for LLMs. They conducted rigorous exper-
                                                                           iments by varying a larger range of model sizes (70M to
Typically, large language models (LLMs) refer to Transformer               16B) and data sizes (5B to 500B tokens), and fitted a similar
language models that contain hundreds of billions (or                      scaling law yet with different coefficients as below [34]:
more) of parameters4 , which are trained on massive text
data [32], such as GPT-3 [55], PaLM [56], Galactica [35],                                                 A      B
                                                                                              L(N, D) = E +  + β,                 (2)
and LLaMA [57]. LLMs exhibit strong capacities to un-                                                    Nα     D
derstand natural language and solve complex tasks (via                     where E = 1.69, A = 406.4, B = 410.7, α = 0.34 and
text generation). To have a quick understanding of how                     β = 0.28. By optimizing the loss L(N, D) under the con-
LLMs work, this part introduces the basic background for                   straint C ≈ 6N D, they showed that the optimal allocation
LLMs, including scaling laws, emergent abilities and key                   of compute budget to model size and data size can be
techniques.                                                                derived as follows:
Formulation of Scaling Laws for LLMs. Currently, LLMs                                                    a                               b
are mainly built upon the Transformer architecture [22],                                              C                                 C
                                                                                 Nopt (C) = G                  ,   Dopt (C) = G−1                ,   (3)
where multi-head attention layers are stacked in a very                                               6                                 6
deep neural network. Existing LLMs adopt similar Trans-                               α         β
                                                                           where a = α+β , b = α+β and G is a scaling coefficient that
former architectures and pre-training objectives (e.g., lan-
                                                                           can be computed by A, B , α and β . As analyzed in [34],
guage modeling) as small language models. However, LLMs
significantly extend the model size, data size, and total                    5. Since there was not a model trained following this law in the
compute (orders of magnification). Extensive research has                  original paper, we took the last names of the two co-first authors to
                                                                           name this scaling law.
  4. In existing literature, there is no formal consensus on the minimum     6. Here, Nc , Dc and Cc are measured in the number of non-
parameter scale for LLMs, since the model capacity is also related to      embedding parameters, the number of training tokens and the number
data size and total compute. In this survey, we take a slightly loose      of FP-days, respectively. According to the original paper [30], Cc and C
definition of LLMs, and mainly focus on discussing language models         should be denoted by Ccmin and Cmin , corresponding to the optimal
with a model size larger than 10B.                                         use of compute. We use the simplified notations for ease of discussions.
                                                                                                                                               5

given an increase in compute budget, the KM scaling law              characterize task-level scaling laws, since it might be also
favors a larger budget allocation in model size than the data        dependent on task-related information (task metric, task
size, while the Chinchilla scaling law argues that the two           difficulty, etc.). Furthermore, some capacities (e.g., in-context
sizes should be increased in equal scales, i.e., having similar      learning [55]) are unpredictable according to the scaling law,
values for a and b in Equation (3).                                  which can be observed only when the model size exceeds a
                                                                     certain level (as discussed below).
Discussion on Scaling Laws. After introducing the formu-
lations, we continue to discuss scaling law in the following         Emergent Abilities of LLMs. In the literature [31], emergent
two aspects, to enhance its understanding:                           abilities of LLMs are formally defined as “the abilities that
    • Predictable scaling. In practice, scaling law can be used      are not present in small models but arise in large models”,
to instruct the training of LLMs, and it has been proven             which is one of the most prominent features that distin-
feasible to reliably estimate the performance of larger mod-         guish LLMs from previous PLMs. It further introduces a
els based on that of smaller models, called predictable scal-        notable characteristic when emergent abilities occur [31]:
ing [46]. The benefits of predictable scaling for training           performance rises significantly above random when the
LLMs are mainly twofold. Firstly, for large models, it is            scale reaches a certain level. By analogy, such an emergent
infeasible to rigorously examine various training tricks or          pattern has close connections with the phenomenon of phase
variants, and it would be very helpful if experiences gained         transition in physics [31, 63]. In principle, emergent abilities
from small models could also apply to large models. For              can be defined in relation to some complex tasks [31, 64],
instance, small proxy models can be trained to find the              while we are more concerned with general abilities that
optimal schedule of the data mixture for large models [59].          can be applied to solve a variety of tasks. Here, we briefly
Secondly, the training of large-scale models takes a long            introduce three typical emergent abilities for LLMs and
time, often suffering from issues such as training loss spike,       representative models that possess such an ability8 .
and scaling law can be employed to monitor the training                  • In-context learning. The in-context learning (ICL) ability
status of LLMs, e.g., identifying abnormal performance at an         is formally introduced by GPT-3 [55]: assuming that the
early time. Despite that scaling law characterizes a smooth          language model has been provided with a natural language
trend of performance increase (or loss decrease), it also            instruction and/or several task demonstrations, it can gen-
indicates that diminishing returns7 might occur as model             erate the expected output for the test instances by com-
scaling. An empirical study [58] from the OpenAI team                pleting the word sequence of input text, without requiring
has shown that representation quality or semantic content            additional training or gradient update9 . Among the GPT-
can still effectively improve even if approaching the point          series models, the 175B GPT-3 model exhibited a strong ICL
of diminishing returns (i.e., approaching the irreducible            ability in general, but not the GPT-1 and GPT-2 models. Such
loss) [58]. This finding suggests that training large models         an ability also depends on the specific downstream task. For
are promising for improving the performance of down-                 example, the ICL ability can emerge on the arithmetic tasks
stream tasks. To further explore scaling effect, a potential         (e.g., the 3-digit addition and subtraction) for the 13B GPT-3,
issue is that the amount of available data for training LLMs         but 175B GPT-3 even cannot work well on the Persian QA
is actually limited. With the ever-increasing model scale, the       task [31].
public text data would be soon “exhausted” for LLMs [60].                • Instruction following. By fine-tuning with a mixture of
Thus, it will be meaningful to study how scaling laws apply          multi-task datasets formatted via natural language descrip-
to a data-constrained regime [61], where data repetition or          tions (called instruction tuning), LLMs are shown to perform
augmentation might be useful to alleviate data scarcity.             well on unseen tasks that are also described in the form
    • Task-level predictability. Existing research of scaling laws   of instructions [28, 66, 67]. With instruction tuning, LLMs
are mostly conducted in terms of language modeling loss              are enabled to follow the task instructions for new tasks
(e.g., per-token cross-entropy loss in nats [30]), while in          without using explicit examples, thus having an improved
practice we are more concerned about the performance of              generalization ability. According to the experiments in [67],
LLMs on actual tasks. Thus, a basic problem is that how              instruction-tuned LaMDA-PT [68] started to significantly
the decrease of language modeling loss translates into the           outperform the untuned one on unseen tasks when the
improvement of task performance [58]. Intuitively, a model           model size reached 68B, but not for 8B or smaller model
with a smaller language modeling loss tends to yield a               sizes. A recent study [69] found that a model size of 62B is
better performance on downstream tasks, since language               at least required for PaLM to perform well on various tasks
modeling loss can be considered as a general measure of              in four evaluation benchmarks (i.e., MMLU, BBH, TyDiQA
the overall model capacity. GPT-4 [46] has reported that             and MGSM), though a much smaller size might suffice for
some capabilities (e.g., coding ability) can be accurately           some specific tasks (e.g., MMLU).
predicted via scaling law. Despite that, readers should be               • Step-by-step reasoning. For small language models, it
aware that a direct decrease in language modeling loss does          is usually difficult to solve complex tasks that involve
not always indicate an improvement of model performance
                                                                        8. It is difficult to accurately examine the critical size for emergent
on downstream tasks. Specially, the phenomenon of inverse
                                                                     abilities of LLMs (i.e., the minimum size to possess an ability), since it
scaling would occur for some tasks, where task performance           might vary for different models or tasks. Also, existing studies often
surprisingly becomes worse as the language modeling loss             test emergent abilities on very limited model sizes for a specific LLM.
decreases [62]. Overall, it is more difficult to explore and         For example, PaLM is often tested with three sizes of 8B, 62B and 540B.
                                                                     It is unclear about the model performance of the untested sizes.
                                                                        9. In a recent study [65], it also shows that in-context learning implic-
  7. https://en.wikipedia.org/wiki/Diminishing returns               itly performs meta-optimization through the attention mechanism.
                                                                                                                                           6

multiple reasoning steps, e.g., mathematical word problems.                 every day. It is interesting that young parents would be often
In contrast, with the chain-of-thought (CoT) prompting                      surprised by unexpected progress of the speaking ability
strategy [33], LLMs can solve such tasks by utilizing the                   exhibited by their babies.
prompting mechanism that involves intermediate reasoning
steps for deriving the final answer. This ability is speculated             Key Techniques for LLMs. It has been a long way that
to be potentially obtained by training on code [33, 47]. An                 LLMs evolve into the current state: general and capable
empirical study [33] has shown that CoT prompting can                       learners. In the development process, a number of impor-
bring performance gains (on arithmetic reasoning bench-                     tant techniques are proposed, which largely improve the
marks) when applied to PaLM and LaMDA variants with                         capacity of LLMs. Here, we briefly list several important
a model size larger than 60B, while its advantage over                      techniques that (potentially) lead to the success of LLMs, as
the standard prompting becomes more evident when the                        follows.
model size exceeds 100B. Furthermore, the performance                            • Scaling. As discussed in previous parts, there exists
improvement with CoT prompting seems to be also varied                      an evident scaling effect in Transformer language mod-
for different tasks, e.g., GSM8K > MAWPS > SWAMP for                        els: larger model/data sizes and more training compute
PaLM [33].                                                                  typically lead to an improved model capacity [30, 34]. As
                                                                            two representative models, GPT-3 and PaLM explored the
How Emergent Abilities Relate to Scaling Laws. In existing                  scaling limits by increasing the model size to 175B and
literature [30, 31, 34], scaling laws and emergent abilities                540B, respectively. Since compute budget is usually limited,
provide two perspectives to understand the advantage of                     scaling laws can be further employed to conduct a more
large models over small models. In general, scaling law                     compute-efficient allocation of the compute resources. For
(often measured by language modeling loss) describes pre-                   example, Chinchilla (with more training tokens) outper-
dictable performance relation with the potential effect of                  forms its counterpart model Gopher (with a larger model
diminishing returns, while emergent abilities (often mea-                   size) by increasing the data scale with the same compute
sured by task performance) are unpredictable but very prof-                 budget [34]. In addition, data scaling should be with careful
itable once such abilities actually emerge. Since the two                   cleaning process, since the quality of pre-training data plays
perspectives reflect different performance trends (continu-                 a key role in the model capacity.
ous improvement v.s. sharp performance leap), they might                         • Training. Due to the huge model size, it is very chal-
lead to misaligned findings or observations. There are also                 lenging to successfully train a capable LLM. Distributed
extensive debates on the rationality of emergent abilities.                 training algorithms are needed to learn the network param-
A popular speculation is that emergent abilities might be                   eters of LLMs, in which various parallel strategies are of-
partially attributed to the evaluation setting for special tasks            ten jointly utilized. To support distributed training, several
(e.g., the discontinuous evaluation metrics) [70, 71]: when                 optimization frameworks have been released to facilitate
evaluation metrics are altered accordingly, the sharpness of                the implementation and deployment of parallel algorithms,
the emergent ability curve would disappear. However, the                    such as DeepSpeed [74] and Megatron-LM [75–77]. Also, op-
performance of LLMs on most tasks are perceived by users                    timization tricks are also important for training stability and
naturally in a discontinuous way. For instance, end users                   model performance, e.g., restart to overcome training loss
prefer a reliable code generated by LLMs that can success-                  spike [56] and mixed precision training [78]. More recently,
fully pass the test case, but are less interested in selecting a            GPT-4 [46] proposes to develop special infrastructure and
better code with fewer errors between two failed ones. More                 optimization methods that reliably predict the performance
recently, a study [72] proposes a new evaluation setting                    of large models with much smaller models.
that can enlarge the resolution of task metrics, making task                     • Ability eliciting. After being pre-trained on large-scale
performance more predictable. Despite these efforts, more                   corpora, LLMs are endowed with potential abilities as
fundamental research (e.g., grokking10 ) about the working                  general-purpose task solvers. These abilities might not be
mechanism of LLMs is still in need to understand the emer-                  explicitly exhibited when LLMs perform some specific tasks.
gence of certain abilities. The subtle relation between scaling             As the technical approach, it is useful to design suitable task
law and emergent abilities can be explained by analogy with                 instructions or specific in-context learning strategies to elicit
the ability acquisition of human11 . Take the speaking ability              such abilities. For instance, chain-of-thought prompting has
as an example. For children, language development (espe-                    been shown to be useful to solve complex reasoning tasks
cially infants) can be also considered as a multi-level process             by including intermediate reasoning steps. Furthermore,
where “emergent abilities” occur. Specially, the language                   we can perform instruction tuning on LLMs with task
ability would relatively stable within a time interval, but                 descriptions expressed in natural language, for improving
qualitative change only occurs when evolving into another                   the generalizability of LLMs on unseen tasks. These eliciting
ability level (e.g., from speaking simple words to speaking                 techniques mainly correspond to the emergent abilities of
simple sentences). Such a learning process is essentially not               LLMs, which may not show the same effect on small lan-
smooth and stable (i.e., language ability does not develop at               guage models.
a constant rate over time), though a child actually grows                        • Alignment tuning. Since LLMs are trained to capture
                                                                            the data characteristics of pre-training corpora (including
   10. Grokking refers that “a pattern in the data, improving generaliza-   both high-quality and low-quality data), they are likely to
tion performance from random chance level to perfect generalization”,
quoted from the original paper [73].
                                                                            generate toxic, biased, or even harmful content for humans.
   11. This explanation is only for ease of understanding, and there is     It is necessary to align LLMs with human values, e.g., helpful,
not direct evidence to connect the two points.                              honest, and harmless. For this purpose, InstructGPT [66]
                                                                                                                                          7

designs an effective tuning approach that enables LLMs to                  models was already explored in the early days of Ope-
follow the expected instructions, which utilizes the tech-                 nAI, while it was attempted with recurrent neural net-
nique of reinforcement learning with human feedback [66, 79].              works (RNN) [121]. With the advent of Transformer, OpenAI
It incorporates human in the training loop with elaborately                developed two initial GPT models, namely GPT-1 [122] and
designed labeling strategies. ChatGPT is indeed developed                  GPT-2 [26], which can be considered as the foundation to
on a similar technique to InstructGPT, which shows a strong                more powerful models subsequently i.e., GPT-3 and GPT-4.
alignment capacity in producing high-quality, harmless re-                      • GPT-1. In 2017, the Transformer model [22] was intro-
sponses, e.g., rejecting to answer insulting questions.                    duced by Google, and the OpenAI team quickly adapted
    • Tools manipulation. In essence, LLMs are trained as text             their language modeling work to this new neural network
generators over massive plain text corpora, thus performing                architecture. They released the first GPT model in 2018,
less well on the tasks that are not best expressed in the                  i.e., GPT-1 [122], and coined the abbreviation term GPT
form of text (e.g., numerical computation). In addition, their             as the model name, standing for Generative Pre-Training.
capacities are also limited to the pre-training data, e.g., the            GPT-1 was developed based on a generative, decoder-only
inability to capture up-to-date information. To tackle these               Transformer architecture, and adopted a hybrid approach of
issues, a recently proposed technique is to employ external                unsupervised pre-training and supervised fine-tuning. GPT-
tools to compensate for the deficiencies of LLMs [80, 81].                 1 has set up the core architecture for the GPT-series models
For example, LLMs can utilize the calculator for accurate                  and established the underlying principle to model natural
computation [80] and employ search engines to retrieve                     language text, i.e., predicting the next word.
unknown information [81]. More recently, ChatGPT has                            • GPT-2. Following a similar architecture of GPT-1,
enabled the mechanism of using external plugins (existing                  GPT-2 [26] increased the parameter scale to 1.5B, which
or newly created apps)12 , which are by analogy with the                   was trained with a large webpage dataset WebText. As
“eyes and ears” of LLMs. Such a mechanism can broadly                      claimed in the paper of GPT-2, it sought to perform
expand the scope of capacities for LLMs.                                   tasks via unsupervised language modeling, without explicit
    In addition, many other factors (e.g., the upgrade of                  fine-tuning using labeled data. To motivate the approach,
hardware) also contribute to the success of LLMs. Currently,               they introduced a probabilistic form for multi-task solving,
we limit our discussion to the major technical approaches                  i.e., p(output|input, task) (similar approaches have been
and key findings for developing LLMs.                                      adopted in [123]), which predicts the output conditioned on
                                                                           the input and task information. To model this conditional
                                                                           probability, language text can be naturally employed as a
2.2   Technical Evolution of GPT-series Models                             unified way to format input, output and task information.
Due to the excellent capacity in communicating with hu-                    In this way, the process of solving a task can be cast as a
mans, ChatGPT has ignited the excitement of the AI com-                    word prediction problem for generating the solution text.
munity since its release. ChatGPT is developed based on the                Further, they introduced a more formal claim for this idea:
powerful GPT model with specially optimized conversation                   “Since the (task-specific) supervised objective is the same
capacities. Considering the ever-growing interest in Chat-                 as the unsupervised (language modeling) objective but only
GPT and GPT models, we add a special discussion about the                  evaluated on a subset of the sequence, the global minimum
technical evolution of the GPT-series models, to briefly sum-              of the unsupervised objective is also the global minimum
marize the progress how they have been developed in the                    of the supervised objective (for various tasks)” [26]15 . A
past years. Meanwhile, we drew a schematic diagram de-                     basic understanding of this claim is that each (NLP) task
picting the technological evolution of the GPT-series models               can be considered as the word prediction problem based
in Figure 4. The basic principle underlying GPT models is                  on a subset of the world text. Thus, unsupervised language
to compress the world knowledge into the decoder-only                      modeling could be capable in solving various tasks, if it was
Transformer model by language modeling, such that it can                   trained to have sufficient capacity in recovering the world
recover (or memorize) the semantics of world knowledge                     text. These early discussion in GPT-2’s paper echoed in the
and serve as a general-purpose task solver. Two key points                 interview of Ilya Sutskever by Jensen Huang: “What the
to the success are (I) training decoder-only Transformer                   neural network learns is some representation of the process
language models that can accurately predict the next word                  that produced the text. This text is actually a projection of
and (II) scaling up the size of language models. Overall, the              the world...the more accurate you are in predicting the next
research of OpenAI on LLMs can be roughly divided into                     word, the higher the fidelity, the more resolution you get in
the following stages13 .                                                   this process...”16 .

Early Explorations. According to one interview with Ilya                   Capacity Leap. Although GPT-2 is intended to be an “un-
Sutskever14 (a co-founder and chief scientist of OpenAI),                  supervised multitask learner”, it overall has an inferior
the idea of approaching intelligent systems with language                  performance compared with supervised fine-tuning state-
                                                                           of-the-art methods. Because it has a relatively small model
   12. https://openai.com/blog/chatgpt-plugins                             size, it has been widely fine-tuned in downstream tasks,
   13. Note that the discussion of this part can be somewhat subjective.   especially the dialog tasks [124, 125]. Based on GPT-2, GPT-3
The overall viewpoints and summaries are made based on the under-
standing of the survey authors by reading the papers, blog articles,
interview reports and APIs released by OpenAI.                              15. To better understand this sentence, we put some explanation
   14. https://hackernoon.com/an-interview-with-ilya-sutskever-co-         words in parentheses.
founder-of-openai                                                           16. https://lifearchitect.ai/ilya/
                                                                                                                              8




TABLE 1: Statistics of large language models (having a size larger than 10B in this survey) in recent years, including the
capacity evaluation, pre-training data scale (either in the number of tokens or storage size) and hardware resource costs.
In this table, we only include LLMs with a public paper about the technical details. Here, “Release Time” indicates the
date when the corresponding paper was officially released. “Publicly Available” means that the model checkpoints can be
publicly accessible while “Closed Source” means the opposite. “Adaptation” indicates whether the model has been with
subsequent fine-tuning: IT denotes instruction tuning and RLHF denotes reinforcement learning with human feedback.
“Evaluation” indicates whether the model has been evaluated with corresponding abilities in their original paper: ICL
denotes in-context learning and CoT denotes chain-of-thought. “*” denotes the largest publicly available version.

                               Release   Size   Base    Adaptation Pre-train Latest Data Hardware  Training Evaluation
          Model
                                Time     (B)    Model   IT RLHF Data Scale Timestamp (GPUs / TPUs)   Time ICL CoT
           T5 [82]             Oct-2019   11       -    -    -     1T tokens    Apr-2019     1024 TPU v3        -     ✓   -
           mT5 [83]            Oct-2020   13       -    -    -     1T tokens       -               -            -     ✓   -
           PanGu-α [84]        Apr-2021 13*        -    -    -       1.1TB         -       2048 Ascend 910      -     ✓   -
           CPM-2 [85]          Jun-2021 198        -    -    -       2.6TB         -               -            -     -   -
           T0 [28]             Oct-2021   11      T5    ✓    -          -          -          512 TPU v3      27 h    ✓   -
           CodeGen [86]        Mar-2022  16        -    -    -    577B tokens      -               -            -     ✓   -
           GPT-NeoX-20B [87]   Apr-2022  20        -    -    -       825GB         -         96 40G A100        -     ✓   -
           Tk-Instruct [88]    Apr-2022   11      T5    ✓    -          -          -          256 TPU v3       4h     ✓   -
           UL2 [89]            May-2022   20       -    -    -     1T tokens    Apr-2019      512 TPU v4        -     ✓   ✓
           OPT [90]            May-2022 175        -    -    -    180B tokens      -         992 80G A100       -     ✓   -
           NLLB [91]           Jul-2022 54.5       -    -    -          -          -               -            -     ✓   -
           CodeGeeX [92]       Sep-2022  13        -    -    -    850B tokens      -       1536 Ascend 910   60 d     ✓   -
           GLM [93]            Oct-2022 130        -    -    -    400B tokens      -         768 40G A100     60 d    ✓   -
           Flan-T5 [69]        Oct-2022   11      T5    ✓    -          -          -               -            -     ✓   ✓
           BLOOM [78]          Nov-2022 176        -    -    -    366B tokens      -         384 80G A100    105 d    ✓   -
           mT0 [94]            Nov-2022   13     mT5    ✓    -          -          -               -            -     ✓   -
           Galactica [35]      Nov-2022 120        -    -    -    106B tokens      -               -            -     ✓   ✓
           BLOOMZ [94]         Nov-2022 176     BLOOM   ✓    -          -          -               -            -     ✓   -
 Publicly OPT-IML [95]         Dec-2022 175      OPT    ✓    -          -          -         128 40G A100       -     ✓   ✓
 Available LLaMA [57]          Feb-2023   65       -    -    -    1.4T tokens      -        2048 80G A100     21 d    ✓   -
           Pythia [96]         Apr-2023   12       -    -    -    300B tokens      -         256 40G A100       -     ✓   -
           CodeGen2 [97]       May-2023   16       -    -    -    400B tokens      -               -            -     ✓   -
           StarCoder [98]      May-2023 15.5       -    -    -     1T tokens       -         512 40G A100       -     ✓   ✓
           LLaMA2 [99]         Jul-2023   70       -    ✓    ✓     2T tokens       -        2000 80G A100       -     ✓   -
           Baichuan2 [100]     Sep-2023   13       -    ✓    ✓    2.6T tokens      -           1024 A800        -     ✓   -
           QWEN [101]          Sep-2023  14        -    ✓    ✓     3T tokens       -               -            -     ✓   -
           FLM [102]           Sep-2023 101        -    ✓    -    311B tokens      -           192 A800       22 d    ✓   -
           Skywork [103]       Oct-2023   13       -    -    -    3.2T tokens      -         512 80G A800       -     ✓   -

          GPT-3 [55]            May-2020 175     -      -    -    300B tokens       -              -            -     ✓   -
          GShard [104]          Jun-2020 600     -      -    -     1T tokens        -       2048 TPU v3        4d     -   -
          Codex [105]           Jul-2021   12  GPT-3    -    -    100B tokens   May-2020           -            -     ✓   -
          ERNIE 3.0 [106]       Jul-2021   10    -      -    -    375B tokens       -          384 V100         -     ✓   -
          Jurassic-1 [107]      Aug-2021 178     -      -    -    300B tokens       -          800 GPU          -     ✓   -
          HyperCLOVA [108]      Sep-2021   82    -      -    -    300B tokens       -         1024 A100      13.4 d   ✓   -
          FLAN [67]             Sep-2021 137 LaMDA-PT   ✓    -          -           -        128 TPU v3       60 h    ✓   -
          Yuan 1.0 [109]        Oct-2021 245     -      -    -    180B tokens       -         2128 GPU          -     ✓   -
          Anthropic [110]       Dec-2021   52    -      -    -    400B tokens       -              -            -     ✓   -
          WebGPT [81]           Dec-2021 175   GPT-3    -    ✓          -           -              -            -     ✓   -
          Gopher [64]           Dec-2021 280     -      -    -    300B tokens       -       4096 TPU v3      920 h    ✓   -
          ERNIE 3.0 Titan [111] Dec-2021 260     -      -    -          -           -              -            -     ✓   -
          GLaM [112]            Dec-2021 1200    -      -    -    280B tokens       -       1024 TPU v4      574 h    ✓   -
          LaMDA [68]            Jan-2022 137     -      -    -    768B tokens       -       1024 TPU v3      57.7 d   -   -
          MT-NLG [113]          Jan-2022 530     -      -    -    270B tokens       -      4480 80G A100        -     ✓   -
 Closed
          AlphaCode [114]       Feb-2022   41    -      -    -    967B tokens   Jul-2021           -            -     -   -
 Source
          InstructGPT [66]      Mar-2022 175   GPT-3    ✓    ✓          -           -              -            -     ✓   -
          Chinchilla [34]       Mar-2022   70    -      -    -    1.4T tokens       -              -            -     ✓   -
          PaLM [56]             Apr-2022 540     -      -    -    780B tokens       -       6144 TPU v4         -     ✓   ✓
          AlexaTM [115]         Aug-2022   20    -      -    -    1.3T tokens       -          128 A100      120 d    ✓   ✓
          Sparrow [116]         Sep-2022   70    -      -    ✓          -           -         64 TPU v3         -     ✓   -
          WeLM [117]            Sep-2022   10    -      -    -    300B tokens       -       128 A100 40G      24 d    ✓   -
          U-PaLM [118]          Oct-2022 540   PaLM     -    -          -           -        512 TPU v4       5d      ✓   ✓
          Flan-PaLM [69]        Oct-2022 540   PaLM     ✓    -          -           -        512 TPU v4       37 h    ✓   ✓
          Flan-U-PaLM [69]      Oct-2022 540 U-PaLM     ✓    -          -           -              -            -     ✓   ✓
          GPT-4 [46]            Mar-2023    -    -      ✓    ✓          -           -              -            -     ✓   ✓
          PanGu-Σ [119]         Mar-2023 1085 PanGu-α   -    -    329B tokens       -      512 Ascend 910    100 d    ✓   -
          PaLM2 [120]           May-2023   16    -      ✓    -    100B tokens       -              -            -     ✓   ✓
                                                                                                                                                                                                   9


                            T5                      ChatGPT                                                                                                         Publicly Available
             2019                                                                   Deepseek
                                                                    GPT-4
                                        2022                                        InternLM         Gemini 2.0            DeepSeek-R1              Kimi K2
          GPT-3                                           2023
                                                                                    Mistral          Gemma-2               Kimi K1.5                Step 3
         GShard              OPT
                                                                                    Mixtral          GPT-4o                                         GLM 4.5 / 4.6           DeepSeek-V3.2
                                                LLaMA2                                                                     Seed 1.5 / 1.6
            mT5              GLM
                                                                                                     InternLM2             GPT-o3                   Gemma-3                 Mistral 3
                                          YuLan-Chat
          Codex         Galatica                                                                                                                    Claude 4.5
                                                                          Qwen                       Qwen2                                                                 GPT- 5.2
                                            ChatGLM                                                                        LLaDA
       PanGu-𝛂         BLOOM                                                                                                                        Ring-1T / Ling-1T       MiMo-V2
                                                                                  2024               DeepSeek-V2           GPT-4.5
                        LaMDA                    Falcon                                                                                             MiniMax M2
       Anthropic                                                                                                                                                            Seed 1.8
                                                                                                     LLaMA3                LLaMA4
                      AlphaCode                  MOSS
        WebGPT                                                                                                                                      GPT-5.1                 Nanbeige4
                                                 PaLM2                                               MiniCPM                  Qwen3
                        Flan-T5                                      Claude 3.5
        Ernie 3.0                                                                                                                                   Gemini 3.0             GLM 4.7
                                                PanGu-Σ                Qwen2.5                       Gemma                 Gemini 2.5
         Gopher                                                                                                                                     Grok 4.1               MiniMax M2.1
                                                   Bard          DeepSeek-V3                       2025
     InstructGPT
                                                LLaMA               YuLan-Mini
                                                                                                                     1-6
                                                              Mixtral Large 2                                                                7-11                           12          2026


Fig. 3: A timeline of representative LLMs released in recent years. Models with publicly available checkpoints are
highlighted in yellow.


                                                                                                                                                              ChatGPT

        GPT-1                              GPT-2                                  GPT-3         +code         Codex                     GPT-3.5                                  GPT-4
        2018.06                             2019.02                               2020.05                     2021.07                       2022.03                              2023.03
decoder-only architecture        unsupervised multitask learner           in-context learning             code pre-training                                             strong reasoning ability
 generative pre-training             scaling the model size             exploring scaling limits
                                                                                                                                                                            GPT-4 Turbo
                                                                                                                                                                                 2023.09
                                                                                                                                                                        longer context window
      code-davinci-002           +instruction     text-davinci-002          +RLHF        text-davinci-003         +chat       gpt-3.5-turbo
            2022.03                                       2022.03                              2022.09                           2023.03                            GPT-4 Turbo with vision
                                                                                                                                                                                 2023.09
      capable code model                         instruction following                   human alignment            excellent comprehensive ability                      multimodal ability


Fig. 4: A brief illustration for the technical evolution of GPT-series models. We plot this figure mainly based on the papers,
blog articles and official APIs from OpenAI. Here, solid lines denote that there exists an explicit evidence (e.g., the official
statement that a new model is developed based on a base model) on the evolution path between two models, while dashed
lines denote a relatively weaker evolution relation.


demonstrates a key capacity leap by scaling of the (nearly                                           Figure 1.2 of the GPT-3’s paper [55]). Overall, GPT-3 can be
same) generative pre-training architecture.                                                          viewed as a remarkable landmark in the journey evolving
    • GPT-3. GPT-3 [55] was released in 2020, which scaled                                           from PLMs to LLMs. It has empirically proved that scaling
the model parameters to an ever larger size of 175B. In                                              the neural networks to a significant size can lead to a huge
the GPT-3’s paper, it formally introduced the concept of                                             increase in model capacity.
in-context learning (ICL)17 , which utilizes LLMs in a few-
shot or zero-shot way. ICL can teach (or instruct) LLMs to                                           Capacity Enhancement. Due to the strong capacities, GPT-
understand the tasks in the form of natural language text.                                           3 has been the base model to develop even more capable
With ICL, the pre-training and utilization of LLMs converge                                          LLMs for OpenAI. Overall, OpenAI has explored two major
to the same language modeling paradigm: pre-training pre-                                            approaches to further improving the GPT-3 model, i.e., train-
dicts the following text sequence conditioned on the context,                                        ing on code data and alignment with human preference,
while ICL predicts the correct task solution, which can be                                           which are detailed as follows.
also formatted as a text sequence, given the task description
                                                                                                        • Training on code data. A major limitation of the original
and demonstrations. GPT-3 not only demonstrates very ex-
                                                                                                     GPT-3 model (pre-trained on plain text) lies in the lack of
cellent performance in a variety of NLP tasks, but also on a
                                                                                                     the reasoning ability on complex tasks, e.g., completing the
number of specially designed tasks that require the abilities
                                                                                                     code and solving math problems. To enhance this ability,
of reasoning or domain adaptation. Although the GPT-3’s
                                                                                                     Codex [105] was introduced by OpenAI in July 2021, which
paper does not explicitly discuss the emergent abilities of
                                                                                                     was a GPT model fine-tuned on a large corpus of GitHub
LLMs, we can observe large performance leap that might
                                                                                                     code. It demonstrated that Codex can solve very difficult
transcend the basic scaling law [30], e.g., larger models have
                                                                                                     programming problems, and also lead to a significant per-
significantly stronger ICL ability (illustrated in the original
                                                                                                     formance improvement in solving math problems [126].
  17. GPT-2 essentially used ICL for unsupervised task learning,                                     Further, a contrastive approach [127] to training text and
though it wasn’t called ICL at that time.                                                            code embedding was reported in January 2022, which was
                                                                                                                               10

shown to improve a series of related tasks (i.e., linear-          InstructGPT in the data collection setup: human-generated
probe classification, text search and code search). Actually,      conversations (playing both the roles of user and AI) are
the GPT-3.5 models are developed based on a code-based             combined with the InstructGPT dataset in a dialogue format
GPT model (i.e., code-davinci-002), which indicates that           for training ChatGPT. ChatGPT exhibited superior capaci-
training on code data is a very useful practice to improve         ties in communicating with humans: possessing a vast store
the model capacity of GPT models, especially the reasoning         of knowledge, skill at reasoning on mathematical problems,
ability. Furthermore, there is also a speculation that train-      tracing the context accurately in multi-turn dialogues, and
ing on code data can greatly increase the chain-of-thought         aligning well with human values for safe use. Later on, the
prompting abilities of LLMs [47], while it is still worth          plugin mechanism has been supported in ChatGPT, which
further investigation with more thorough verification.             further extends the capacities of ChatGPT with existing tools
    • Human alignment. The related research of human               or apps. So far, it seems to be the ever most powerful chatbot
alignment can be dated back to the year 2017 (or earlier)          in the AI history. The launch of ChatGPT has a significant
for OpenAI: a blog article entitled “learning from human           impact on the AI research in the future, which sheds light
preferences”18 was posted on the OpenAI blog describing            on the exploration of human-like AI systems.
a work that applied reinforcement learning (RL) to learn               • GPT-4. As another remarkable progress, GPT-4 [46]
from the preference comparisons annotated by humans [79]           was released in March 2023, which extended the text input
(similar to the reward training step in the aligning algorithm     to multimodal signals. Overall, GPT-4 has stronger capac-
of InstructGPT in Figure 12). Shortly after the release of this    ities in solving complex tasks than GPT-3.5, showing a
RL paper [79], the paper of the Proximal Policy Optimiza-          large performance improvement on many evaluation tasks.
tion (PPO) [128] was published in July 2017, which now has         A recent study [41] investigated the capacities of GPT-
been the foundational RL algorithm for learning from hu-           4 by conducting qualitative tests with human-generated
man preferences [66]. Later in January 2020, GPT-2 was fine-       problems, spanning a diverse range of difficult tasks, and
tuned using the aforementioned RL algorithms [79, 128],            showed that GPT-4 can achieve more superior performance
which leveraged human preferences to improve the capac-            than prior GPT models. Furthermore, GPT-4 responds more
ities of GPT-2 on NLP tasks. In the same year, another             safely to malicious or provocative queries, due to a six-
work [129] trained a summarization model for optimizing            month iterative alignment (with an additional safety re-
human preferences in a similar way. Based on these prior           ward signal in the RLHF training). In the technical report,
work, InstructGPT [66] was proposed in January 2022 to             OpenAI has emphasized how to safely develop GPT-4 and
improve the GPT-3 model for human alignment, which                 applied a number of intervention strategies to mitigate the
formally established a three-stage reinforcement learning from     possible issues of LLMs, such as hallucinations, privacy
human feedback (RLHF) algorithm. Note that it seems that           and overreliance. For example, they introduced the mech-
the wording of “instruction tuning” has seldom been used in        anism called red teaming [132] to reduce the harm or toxic
OpenAI’s paper and documentation, which is substituted by          content generation. As another important aspect, GPT-4
supervised fine-tuning on human demonstrations (i.e., the first    has been developed on a well-established deep learning
step of the RLHF algorithm [66]). In addition to improving         infrastructure with improved optimization methods. They
the instruction following capacity, the RLHF algorithm is          introduced a new mechanism called predictable scaling that
particularly useful to mitigate the issues of generating harm      can accurately predict the final performance with a small
or toxic content for LLMs, which is key to the safe deploy-        proportion of compute during model training.
ment of LLMs in practice. OpenAI describes their approach              • GPT-4V, GPT-4 turbo, and beyond. Based on the work
to alignment research in a technical article [130], which          done for GPT-4 [46], OpenAI further released GPT-4V in
has summarized three promising directions: “training AI            September 2023, which focused on the safe deployment of
systems to use human feedback, to assist human evaluation          the vision capabilities of GPT-4. In the GPT-4V’s system
and to do alignment research”.                                     card [133], it has extensively discussed the assessment and
    These enhancement techniques lead to the improved              mitigation of risks related to visually augmented inputs.
GPT-3 models with stronger capacities, which are called            Specially, GPT-4V exhibited strong vision capacities in var-
GPT-3.5 models by OpenAI (see the discussion about the             ious application scenarios, showing the great potential as
OpenAI API in Section 3.1).                                        a powerful multimodal learning system. More recently, in
                                                                   November 2023, OpenAI released an upgraded generation
The Milestones of Language Models. Based on all the ex-            of GPT-4 model at DevDay, named GPT-4 Turbo, with a
ploration efforts, two major milestones have been achieved         series of technical improvements. GPT-4 Turbo is featured
by OpenAI, namely ChatGPT [131] and GPT-4 [46], which              by the improved model capacity (more capable than GPT-
have largely raised the capacity bar of existing AI systems.       4), the extended knowledge source (up to April 2023),
   • ChatGPT. In November 2022, OpenAI released the                long context window (up to 128k tokens), optimized model
conversation model ChatGPT, based on the GPT models                performance (cheaper price), and other useful functional-
(GPT-3.5 and GPT-4). As the official blog article intro-           ity updates (function call, reproducible outputs, etc.). At
duced [131], ChatGPT was trained in a similar way as               the same time, Assistants API was launched to ease the
InstructGPT (called “a sibling model to InstructGPT” in the        rapid development of agent-like assistants. With this API,
original post), while specially optimized for dialogue. They       developers can easily create goal-oriented assistants within
reported a difference between the training of ChatGPT and          their applications, by leveraging specific instruction, extra
                                                                   knowledge and tool use. Furthermore, multimodal capaci-
 18. https://openai.com/research/learning-from-human-preferences   ties (see, hear, and speak) were also enhanced in this new
                                                                                                                               11

release, supported by GPT-4 Turbo with vision, DALL·E 3,           closed-source LLMs, such as GPT-4, GPT-4o, and Claude
Text-to-speech (TTS), and Listen to voice samples. These           3.5 Sonnet in various benchmarks (e.g., MMLU, GSM8k,
improvements have greatly extended the capacity scope and          and HumanEval). The pre-training of LLaMA (65B version)
enhanced the task performance of GPT models. More impor-           involves 2,048 A100-80G GPUs, whereas LLaMA-3.1 (405B
tantly, the application ecosystem will be greatly strength-        version) involves more than 16,000 H100 GPUs.
ened with the technology upgrade in improved models,                   • Mistral. The Mistral series [137, 138] consist of Mis-
APIs, and functionalities.                                         tral (7B), Mistral NeMo (12B), Mistral Large 2 (123B), and
    Despite the huge progress, there are still limitations with    Mixtral (8×7B and 8×22B), which have been widely known
these superior LLMs, e.g., generating hallucinations with          for their strong performance on various mainstream bench-
factual errors or potentially risky response within some           marks (e.g., MMLU and GSM8k). Mistral NeMo is featured
specific context [46]. More limitations or issues of LLMs will     with a long context window of 128K at the parameter scale
be discussed in Section 7. It poses long-standing research         of 12B. Although Mistral NeMo is trained with quantization
challenges to develop more capable, safer LLMs. From               awareness, it enables FP8 inference without sacrificing per-
the perspective of engineering, OpenAI has adopted an              formance. Mistral Large 2 is the largest and most powerful
iterative deployment strategy [134] to develop the models          model of the Mistral series, which supports 11 natural
and products by following a five-stage development and             languages and more than 80 programming languages. Mix-
deployment life-cycle, which aims to effectively reduce the        tral is a kind of sparse Mixture-of-Experts (SMoE) model
potential risks of using the models. In the following, we          that activates only part of the parameters during inference,
will dive into the technical details in order to have a specific   making it more efficient compared to dense models of the
understanding of how they have been developed.                     same size.
                                                                       • Gemma. Gemma [139, 140] is a series of lightweight,
                                                                   strong, and open models, consisting of Gemma-1 (2B and
3     R ESOURCES OF LLM S                                          7B) and Gemma-2 (2B, 9B, and 27B). During the pre-training
It is by no means an easy job to develop or reproduce LLMs,        stage, Gemma-2 2B, 9B, and 27B versions are trained on
considering the challenging technical issues and huge de-          2T, 8T, and 13T primarily English tokens, respectively. The
mands of computation resources. A feasible way is to learn         largest version of Gemma-2 is trained on 6144 TPUv5p
experiences from existing LLMs and reuse publicly avail-           chips. Gemma-2 has achieved excellent performance in mul-
able resources for incremental development or experimental         tiple benchmarks (e.g., ARC-c, MMLU, and GSM8k).
study. In this section, we briefly summarize the publicly              • Qwen. Qwen [141, 142] is an open-source large
available resources for developing LLMs, including model           model series consisting of Qwen (raging from 7B to 72B),
checkpoints (or APIs), corpora and libraries.                      Qwen1.5 (raging from 0.5B to 110B), Qwen2 (ranging from
                                                                   0.5B to 72B), and Qwen2.5 (ranging from 0.5B to 72B).
3.1   Publicly Available Model Checkpoints or APIs                 Qwen2.5 is the newest LLM collection of Qwen, which
                                                                   is pre-trained on up to 18T tokens. Compared to Qwen2,
Given the huge cost of model pre-training, well-trained
                                                                   Qwen2.5 demonstrates a significant increase in knowledge
model checkpoints are critical to the study and development
                                                                   retention, as well as notable advancements in coding and
of LLMs for the research community. Due to space limita-
                                                                   mathematical abilities. Qwen2.5 has also shown large im-
tion, we can only selectively discuss several representative
                                                                   provements in instruction following, long texts generation
LLMs. In addition, for inference, we can directly employ
                                                                   (over 8K tokens), structured data understanding and gener-
public APIs to perform our tasks, without running the
                                                                   ation (e.g., JSON).
model locally. Next, we introduce the publicly available
model checkpoints and APIs.
                                                                       • GLM. GLM [143] is a series of LLMs featuring compre-
                                                                   hensive capabilities in both English and Chinese. GLM has
Publicly Available Model Checkpoints. To assist re-                been upgraded to its fourth-generation model, GLM-4, with
searchers in selecting a suitable model based on the resource      a parameter scale of up to 9B, possesses excellent conver-
budget and usage needs, we focus on discussing the model’s         sational abilities. It has achieved excellent performance in
parameter size, data and computational resources required          evaluations from multiple perspectives including semantics,
for training, the relevant technologies employed by the            mathematics, reasoning, code, and knowledge. In addition
model, and its performance evaluation in downstream tasks.         to the base model GLM-4-9B, it has open-sourced human
For more details of LLMs, see Table 1.                             preference-aligned model GLM-4-9B-Chat, and long context
   • LLaMA. The LLaMA series of models has gained im-              conversational model GLM-4-9B-Chat-1M.
mense popularity and widespread attention due to its open-             • Baichuan. Baichuan is a series of open-source bilingual
ness and effectiveness. From LLaMA [57], LLaMA-2 [99],             LLMs and the latest version is Baichuan-2. Both Baichuan
LLaMA-3 [135] to LLaMA-3.1 [136], continuous updates               and Baichuan-2 have two available parameter sizes (7B
have been made and the development is still ongoing. With          and 13B). Baichuan supports both Chinese and English,
increased parameters (the largest version has 405B), more          with pre-training data reaching 1.2 trillion tokens. Further-
pre-training tokens (15T tokens), and an extended context          more, Baichuan-2 expands its pre-training data to 2.6 trillion
window (128K), LLaMA-3.1 has significantly enhanced its            tokens. Baichuan-2 surpasses Baichuan in all evaluation
capabilities, and it also integrates additional components         benchmarks, demonstrating excellent multilingual capabil-
that work in synergy with the model, including new se-             ities and showing potential for vertical applications in the
curity and safety tools. In evaluation, LLaMa-3.1 (405B ver-       domains such as law and healthcare (e.g., JEC-QA [144] and
sion) achieves competitive performance against prominent           MedQA [145]).
                                                                                                                                                                                         12


        Continue pre-training                                                                     LLaMA                                                Parameter-efficient fine-tuning
        Model inheritance       Instruction
        Data inheritance          tuning                                                                                                                 Full parameter fine-tuning
                                                                      + chinese data                              + chat data




                                                           Chinese
 Open-Chinese-LLaMA                                                                               + synthetic data
                                                           Vicuna
                                                                                                                                                         Vicuna
                                           Panda                                                                             + task data
                                                                                                     Alpaca
 Linly-Chinese-LLaMA
                                Chinese                                                                                                                                   Yulan-Chat
                                                                                                                      RLHF
                                LLaMA                                                  Alpaca                                              Goat
          + chat data                                                                   Lora                       PKU-Beaver
                                                                       BiLLa
     Cornucopia
                                                                                                                                                      + synthetic data
                                                                         + chat data
                            + Alpaca data
              Lawyer
              LLaMA                                                                                                             OpenFlamingo        LLaVA
                                                                           BELLE                                                                                         MiniGPT-4
              + chat data
                                                                                           Ziya               + task data
         QiZhenGPT                                            Baize
                                          Chinese                                                                                                        + task data
                                          Alpaca                               + task data                      Guanaco
                  + task data
                                                                                   Koala            + task data             VisionLLM          InstructBLIP              Chatbridge
                   TaoLi

                                                                                                     LLaMA
                                                    ChatMed
                                                                                                     Adapter                                                  PandaGPT
             BenTsao             LAWGPT                                                                                         Multimodal models




                                                    Math          Finance            Medicine           Law          Bilingualism       Education



Fig. 5: An evolutionary graph of the research work conducted on LLaMA. Due to the huge number, we cannot include all
the LLaMA variants in this figure, even much excellent work. To support incremental update, we share the source file of
this figure, and welcome the readers to include the desired models by submitting the pull requests on our GitHub page.


LLaMA Model Family. The collection of LLaMA mod-                                                  Due to the excellent performance and availability of the
els [57] were introduced by Meta AI in February, 2023,                                            LLaMA model family, many multimodal models incorpo-
consisting of four sizes (7B, 13B, 30B and 65B). Since                                            rate them as the base language models, to achieve strong
released, LLaMA has attracted extensive attention from                                            language understanding and generation abilities. Compared
both research and industry communities. LLaMA mod-                                                with other variants, Vicuna is more preferred in multimodal
els have achieved very excellent performance on various                                           language models, which have led to the emergence of a va-
open benchmarks, which have become the most popu-                                                 riety of popular models, including LLaVA [154], MiniGPT-
lar open language models thus far. A large number of                                              4 [155], InstructBLIP [156], and PandaGPT [157]. The re-
researchers have extended LLaMA models by either in-                                              lease of LLaMA has greatly advanced the research progress
struction tuning or continual pre-training. In particular,                                        of LLMs. To summarize the research work conducted on
instruction tuning LLaMA has become a major approach                                              LLaMA, we present a brief evolutionary graph in Figure 5.
to developing customized or specialized models, due to
the relatively low computational costs. To effectively adapt                                      Public API of LLMs. Instead of directly using the model
LLaMA models in non-English languages, it often needs to                                          copies, APIs provide a more convenient way for common
extend the original vocabulary (trained mainly on English                                         users to use LLMs, without the need of running the model
corpus) or fine-tune it with instructions or data in the                                          locally. As a representative interface for using LLMs, the
target language. Among these extended models, Stanford                                            APIs for the GPT-series models [46, 55, 66, 105] have
Alpaca [146] is the first open instruct-following model                                           been widely used for both academia and industry19 .
fine-tuned based on LLaMA (7B). It is trained by 52K                                              OpenAI has provided seven major interfaces to the models
instruction-following demonstrations generated via self-                                          in GPT-3 series: ada, babbage, curie, davinci (the
instruct [147] using text-davinci-003. The instruction                                            most powerful version in GPT-3 series), text-ada-001,
data, named Alpaca-52K, and training code have been ex-                                           text-babbage-001, and text-curie-001. Among
tensively adopted in subsequent work, such as Alpaca-                                             them, the first four interfaces can be further fine-
LoRA [148] (a reproduction of Stanford Alpaca using                                               tuned on the host server of OpenAI. In particular,
LoRA [149]), Koala [150], and BELLE [151]. In addition, Vi-                                       babbage, curie, and davinci correspond to the
cuna [152] is another popular LLaMA variant, trained upon                                         GPT-3 (1B), GPT-3 (6.7B), and GPT-3 (175B) models,
user-shared conversations collected from ShareGPT [153].
                                                                                                    19. https://platform.openai.com/docs/api-reference/introduction
                                                                                                                               13

respectively [55]. In addition, there are also two APIs           Facebook and has been used in processing datasets like
related to Codex [105], called code-cushman-001 (a                RedPajama-Data [170].
powerful and multilingual version of the Codex (12B) [105])            • C4. The Colossal Clean Crawled Corpus (C4) includes
and code-davinci-002. Further, GPT-3.5 series                     five variants21 , namely en (806G), en.noclean (6T), real-
include one base model code-davinci-002 and                       newslike (36G), webtextlike (17G), and multilingual (38T).
three enhanced versions, namely text-davinci-002,                 The en version has been utilized for pre-training T5 [82],
text-davinci-003, and gpt-3.5-turbo. As more                      LaMDA [68], Gopher [64], and UL2 [89]. The multilingual
powerful alternatives, in this year, OpenAI has released          C4, also called mC4, has been used in mT5 [83].
the model interfaces for GPT-4 series, including gpt-4,                • RedPajama-Data. RedPajama-Data [170] is a publicly
gpt-4-32k, gpt-4-1106-preview (i.e., GPT-4 Turbo)                 available comprehensive web dataset, comprising 100 bil-
and gpt-4-vision-preview (i.e., GPT-4 Turbo with                  lion documents from Common Crawl. It has been cleaned,
vision, a multimodal model). It is worth noting that OpenAI       filtered, and deduplicated using the CCNet tool, resulting in
has been maintaining and upgrading these model interfaces         approximately 30T tokens, which is available for download
(gpt-3.5-turbo, gpt-4, gpt-4-32k), so the API name                on Hugging Face. RedPajama-Data is a multilingual dataset
will actually point to the latest version. Currently, ChatGPT     that includes five languages: English, French, Spanish, Ger-
can be powered by either GPT-3.5 or GPT-4 models. Overall,        man, and Italian. Additionally, it offers over 40 quality
one select the suitable model interface based on the specific     labels, making it feasible to filter or reweight the dataset
application scenarios and response requirements. The              according to specific criteria. The dataset is continuously
detailed usage can be found on their project websites20 .         updated and maintained, with all data processing scripts
                                                                  open-sourced on GitHub for convenient use.
      TABLE 2: Statistics of commonly-used data sources.               • RefinedWeb. RefinedWeb [171] is a web dataset obtained
                                                                  through rigorous selection and deduplication based on data
 Corpora            Size       Source     Latest Update Time
                                                                  from Common Crawl, encompassing all Common Crawl
 BookCorpus [158] 5GB        Books             Dec-2015           web records from 2008 to June 2023, totaling around 5T
 Gutenberg [159]    -        Books             Dec-2021
 C4 [82]            800GB CommonCrawl          Apr-2019
                                                                  tokens. The open-source portion consists of 600B tokens,
 CC-Stories-R [160] 31GB CommonCrawl           Sep-2019           with a data size of approximately 500GB. After decompres-
 CC-NEWS [27]       78GB CommonCrawl           Feb-2019           sion, it requires 2.8TB of local storage space and is available
 REALNEWs [161] 120GB CommonCrawl              Apr-2019           for download on Hugging Face. This dataset serves as the
 OpenWebText [162] 38GB    Reddit links        Mar-2023
 Pushift.io [163]   2TB    Reddit links        Mar-2023           primary training dataset for the open-source large language
 Wikipedia [164]    21GB    Wikipedia          Mar-2023           model Falcon.
 BigQuery [165]     -        Codes             Mar-2023                • WebText. WebText [26] is a well-known corpus com-
 the Pile [166]     800GB    Other             Dec-2020
 ROOTS [167]        1.6TB    Other             Jun-2022           posed of highly upvoted links from Reddit, a social media
                                                                  platform that enables users to submit links and text posts,
                                                                  but it is not publicly available. As a surrogate, there is a
                                                                  readily accessible open-source alternative called OpenWeb-
3.2    Commonly Used Corpora for Pre-training                     Text [162].
In contrast to earlier PLMs, LLMs which consist of a signifi-     Books & Academic Data. Books and academic data contains
cantly larger number of parameters require a higher volume        a wealth of world knowledge and linguistic information,
of training data that covers a broad range of content. For        serving as a high-quality corpus for model learning.
this need, there are increasingly more accessible training            • Book Data. BookCorpus [158] is a commonly used
datasets that have been released for research. In this section,   dataset in previous small-scale models (e.g., GPT [122] and
we will briefly summarize several widely used corpora for         GPT-2 [26]), consisting of over 11,000 books covering a wide
training LLMs. Based on their content types, we categorize        range of topics and genres (e.g., novels and biographies).
these corpora into five groups: web pages, books, Wikipedia,      Another large-scale book corpus is Project Gutenberg [159],
code, and others.                                                 consisting of over 70,000 literary books including novels,
Web pages. Web pages are a primary data source for train-         essays, poetry, drama, history, science, philosophy, and
ing language models.                                              other types of works in the public domain. It is currently
    • CommonCrawl. CommonCrawl [168] is one of the                one of the largest open-source book collections, which is
largest open-source web crawling databases, containing a          used in training of MT-NLG [113] and LLaMA [57]. As for
petabyte-scale data volume, which has been widely used            Books1 [55] and Books2 [55] used in GPT-3 [55], they are
as training data for existing LLMs. As the whole dataset is       much larger than BookCorpus but have not been publicly
very large, existing studies mainly extract subsets of web        released so far.
pages from it within a specific period or specific needs              • Academic Data. In addition to book data, scientific
(e.g., extracting mathematical texts). However, due to the        publication data such as paper is also important for model
widespread existence of noisy and low-quality information         pre-training. arXiv Dataset [172] is a corpus of 1.7 mil-
in web page data, it is necessary to perform data preprocess-     lion academic papers, covering a wide range of papers in
ing before usage. One commonly used toolkit for cleaning          the fields of physics, mathematics, and computer science.
CommonCrawl is CC-Net [169], which is developed by                S2ORC [173] is a corpora that consists of 136M academic

  20. https://platform.openai.com/docs/models/overview             21. https://www.tensorflow.org/datasets/catalog/c4
                                                                                                                                   14

papers collected by Semantic Scholar. It also releases a         pre-training data.
derivative dataset peS2o [174], which contains about 42B
tokens.                                                          TABLE 3: A detailed list of available collections for instruc-
                                                                 tion tuning.
Wikipedia. Wikipedia [164] is an online encyclopedia con-
taining a large volume of high-quality articles on diverse               Categories Collections              Time     #Examples
topics. Most of these articles are composed in an expository                         Nat. Inst. [179]      Apr-2021     193K
style of writing (with supporting references), covering a                            FLAN [67]             Sep-2021     4.4M
wide range of languages and fields. Typically, the English-                          P3 [180]              Oct-2021     12.1M
                                                                            Task     Super Nat. Inst. [88] Apr-2022       5M
only filtered versions of Wikipedia are widely used in most                          MVPCorpus [181]       Jun-2022      41M
LLMs (e.g., GPT-3 [55], LaMDA [68], and LLaMA [57]).                                 xP3 [94]              Nov-2022      81M
Wikipedia is available in multiple languages, so it can be                           OIG[182]              Mar-2023      43M
used in multilingual settings.                                                       HH-RLHF [183]       Apr-2022       160K
                                                                                     HC3 [184]           Jan-2023        87K
Code. To collect code data, existing work mainly crawls                     Chat     ShareGPT [153]      Mar-2023        90K
open-source licensed codes from the Internet. Two major                              Dolly [185]         Apr-2023       15K
                                                                                     OpenAssistant [186] Apr-2023       161K
sources are public code repositories under open-source li-
censes (e.g., GitHub) and code-related question-answering                           Self-Instruct [147]    Dec-2022      82K
                                                                                    Alpaca [187]           Mar-2023      52K
platforms (e.g., StackOverflow). Google has publicly re-
                                                                          Synthetic Guanaco [188]          Mar-2023     535K
leased the BigQuery dataset [165], which includes a sub-                            Baize [189]            Apr-2023     158K
stantial number of open-source licensed code snippets in                            BELLE [190]            Apr-2023     1.5M
various programming languages, serving as a representa-
tive code dataset. CodeGen has utilized BIGQUERY [86], a
subset of the BigQuery dataset, for training the multilingual       TABLE 4: A list of available collections for alignment.
version of CodeGen (CodeGen-Multi). In addition, Hugging
Face has collected and released a code dataset named The               Dataset                            Release Time #Examples
Stack [175], covering more than 30 programming languages.              Summarize from Feedback [129]       Sep-2020       193K
The Stack is continuously updated, and the v1.2 version                SHP [191]                           Oct-2021       385K
                                                                       WebGPT Comparisons [81]             Dec-2021        19K
has expanded to 358 programming languages. Based on                    Stack Exchange Preferences [192]    Dec-2021       10M
this dataset, BigCode further processed it and released                HH-RLHF [183]                       Apr-2022       169K
StarCoder [98], which is also the pre-training data of the             Sandbox Alignment Data [193]        May-2023       169K
                                                                       CValues [194]                       Jul-2023       145K
model StarCoder.                                                       PKU-SafeRLHF [195]                  Oct-2023       330K
Mixed Data. In addition to the aforementioned specific
types of datasets, different types of data have been com-
bined to facilitate usage by researchers. The Pile [166] is a    3.3    Commonly Used Datasets for Fine-tuning
large-scale, diverse, and open-source text dataset consisting    After pre-training, it requires further fine-tuning LLMs to
of over 800GB of data from multiple sources, including           enhance the model capacity, which often involve two major
books, websites, codes, scientific papers, and social media      steps, namely instruction tuning (supervised fine-tuning)
platforms. It is constructed from 22 diverse high-quality        and alignment tuning. In this section, we mainly focus on
subsets. The Pile dataset is widely used in models with          discussing the related available datasets for the two kinds of
different parameter scales, such as GPT-J (6B) [176], Code-      tuning approaches, and more algorithm details can be found
Gen (16B) [86], and Megatron-Turing NLG (530B) [113].            in Section 5.
ROOTS [167] is composed of various smaller datasets (to-
tally 1.61 TB of text) and covers 59 different languages (con-   3.3.1 Instruction Tuning Datasets
taining natural languages and programming languages),
                                                                 After pre-training, instruction tuning (a.k.a., supervised fine-
which have been used for training BLOOM [78]. Another
                                                                 tuning) is an important method to enhance or unlock spe-
mixture dataset is Dolma [177], which includes web text
                                                                 cific abilities of LLMs (e.g., instruction following). In this
from Common Crawl, academic papers from Semantic
                                                                 part, we introduce several widely used datasets for in-
Scholar, GitHub code, books, social media from Reddit,
                                                                 struction tuning, and categorize them into three main types
and Wikipedia data. Dolma consisting of 3T tokens of ap-
                                                                 based on the construction method of formatted instruction
proximately 200TB of raw text and has been used to train
                                                                 instances, namely NLP task datasets, daily chat datasets and
OLMo [178].
                                                                 synthetic datasets. We show their details in Table 3.
    In practice, it commonly requires a mixture of different
data sources for pre-training LLMs (see Figure 6), instead       NLP Task Datasets. This kind of datasets are formatted
of a single corpus. Therefore, existing studies commonly         based on collected NLP task datasets (e.g., text classifica-
mix several ready-made datasets (e.g., C4, OpenWebText,          tion and summarization) with corresponding natural lan-
and the Pile), and then perform further processing to obtain     guage task descriptions. In this category, P3 [196] and
the pre-training corpus. Furthermore, to train the LLMs that     FLAN [67, 197] are two widely used datasets for instruction
are adaptive to specific applications, it is also important      tuning.
to extract data from relevant sources (e.g., Wikipedia and           • P3 [196] is composed of 170 English NLP datasets and
BigQuery) for enriching the corresponding information in         2,052 English prompt templates, where the input and output
                                                                                                                                  15

of each data example have been formatted with specific              and outputs. Moreover, 60% of the examples are pure in-
prompt templates for composing the training instance.               structions without the input part in the final dataset.
    • FLAN [67] consists of 62 widely used NLP benchmarks               • Baize [189] is an English multi-turn conversation corpus
in its original version. Recently, FLAN-v2 [197] is also pro-       constructed using ChatGPT, comprising 111.5K instances. To
posed, which expands FLAN by mixing additional instruc-             create Baize, a method called “self-chat” [189] is purposed,
tion datasets, including Muffin [67], NIV2 [88], T0-SF [28],        where ChatGPT takes on the roles of both the user and the
and CoT [198–200]. Muffin contains 62 tasks from the orig-          AI assistant in turns, generating information in a conversa-
inal FLAN and additional 26 tasks, including conversation           tional format.
and code synthesis tasks. T0-SF is extracted from T0 [28]
while ensuring no overlap with Muffin. NIV2 refers to the           3.3.2   Alignment Datasets
Natural-Instructions v2 dataset [88], and CoT [198–200] is
a combination of nine reasoning tasks with corresponding            Apart from instruction tuning, it is important to construct
chain-of-thought prompts and outputs.                               high-quality datasets for aligning LLMs with human values
                                                                    and preferences (e.g., helpfulness, honesty, and harmless-
Daily Chat Datasets. This kind of datasets are constructed          ness). In this section, we introduce several widely used
based on real user conversations where queries are posed            datasets for alignment tuning, including HH-RLHF [183],
by humans and responses are mainly generated by hu-                 SHP [191], PKU-SafeRLHF [195], Stack Exchange Prefer-
man labelers or LLMs (e.g., ChatGPT, GPT-4). The con-               ences [192] and Sandbox Alignment Data [193]. We show
versation types include open-ended generation, question             their details in Table 4.
answering, brainstorming, and chatting. In this category,                • HH-RLHF [183] consists of around 169K instances, and
ShareGPT [153], OpenAssistant [186] and Dolly [185] are             can be divided into two parts that focus on the helpfulness
three commonly used datasets for LLM fine-tuning.                   and harmlessness of LLMs, respectively. Each instance is
    • ShareGPT [153] is collected from a data collection            an open-ended conversation between a crowdworker and
platform where users can upload their conversations with            a chat model, about seeking assistance, advice, or task
ChatGPT or GPT-4 through the ShareGPT API. Currently,               completion. The chat model provides two responses to each
this dataset consists of approximately 90,000 conversations,        user query, and the more helpful or harmful responses will
including real instructions or inquiries from human and             be chosen as the annotations.
responses from ChatGPT.                                                  • SHP [191] focuses on the helpfulness of responses.
    • OpenAssistant [186] is a multilingual corpus containing       It comprises 385K collective human preferences over re-
66,497 real-world conversation trees between human and AI           sponses to questions/instructions across 18 diverse subject
assistant. Each conversation tree consists of multiple nodes,       areas, spanning topics from cooking to legal advice. Each
and each node represents the information generated by a             instance is a Reddit post containing a question or instruction
role in the dialogue. It spans 35 languages and includes            and a pair of top-level comments, one of which is deemed
461,292 manually annotated quality ratings of responses.            as more preferable by Reddit users and the other one is
    • Dolly [185] is an English dataset comprising 15,000           deemed as less helpful. Different from HH-RLHF [183], the
human-generated data instances (prompt-response pairs)              data in SHP consists of naturally occurring and human-
from Databricks. This dataset covers seven domains out-             written responses.
lined in the InstructGPT [66], including brainstorming, clas-            • PKU-SafeRLHF [195] encompasses more than 330K
sification, closed-book quality assurance, generation, infor-       instances of expert comparison data, concentrating on the
mation extraction, open-book quality assurance, and sum-            helpfulness and harmlessness. Each instance in the dataset
marization.                                                         includes a question and two responses, accompanied by
                                                                    safety labels for each response and two preference anno-
Synthetic Datasets. This kind of datasets are typically             tations between the two responses according to helpfulness
constructed by instructing LLMs, based on pre-defined               and harmlessness. The harmlessness of a response indicates
guidance rules or methods. In this category, Self-Instruct-         its classification as risk-neutral across all 14 harm categories,
52K [147], Alpaca [146] and Baize [189] are three commonly          while the helpfulness of a response is evaluated based on its
used synthetic datasets for LLMs.                                   effectiveness in addressing the question.
    • Self-Instruct-52K [147] is an instruction dataset gener-           • Stack Exchange Preferences [192] focuses on the help-
ated through the self-instruct [147] method, consisting of          fulness of answers. It comprises about 10M questions and
82,000 instances with 52,000 instructions. Concretely, the          answers from Stack Overflow. Each instance consists of a
authors construct 175 seed instances, and then iteratively          question and more than two corresponding answers. Each
prompt the LLM [55] to synthesize additional instructions           answer is annotated with a score calculated based on its
based on randomly selected 8 instructions as reference.             votes and a label denoting whether it is selected.
Subsequently, the LLM is further instructed to generate in-              • Sandbox Alignment Data [193] is an alignment dataset
stance inputs and their corresponding outputs based on the          containing feedback from LLMs rather than human. It
synthetic instructions, and finally obtain the Self-Instruct-       comes from a virtual interaction environment called SAND-
52K dataset.                                                        BOX, where the model simulates social interactions with
    • Alpaca [146] is also a synthetic dataset based on the self-   other models and revise responses according to the feedback
instruct [147] method. It utilizes the text-davinci-003             from other models. The dataset contains 169K instances, and
model on the 175 seed datasets from Self-Instruct-52K to            each instance consists of a societal query, several responses,
obtain 52,000 new instructions and corresponding inputs             and corresponding ratings from other models.
                                                                                                                              16

3.4   Library Resource                                           decoding algorithms, tensor parallelism and streaming out-
                                                                 puts. To ease the integration with other systems, vLLM is
In this part, we briefly introduce a series of available li-
                                                                 friendly to the use of HuggingFace models, and also provide
braries for developing LLMs.
                                                                 OpenAI-compatible API servers.
     • Transformers [201] is an open-source Python library
                                                                     • DeepSpeed-MII [209] is also a memory efficient
for building models using the Transformer architecture,
                                                                 Python library developed by DeepSpeed [74]. It aims to
which is developed and maintained by Hugging Face. It
                                                                 democratize LLMs inference by prioritizing high through-
has a simple and user-friendly API, making it easy to use
                                                                 put, low latency, and cost-effectiveness. DeepSpeed-MII
and customize various pre-trained models. It is a powerful
                                                                 achieves accelerated text generation inference by leveraging
library with a large and active community of users and
                                                                 four essential technologies: blocked KV caching, continuous
developers who regularly update and improve the models
                                                                 batching, dynamic SplitFuse, and high-performance CUDA
and algorithms.
                                                                 Kernels. It currently supports over 13,000 models across
     • DeepSpeed [74] is a deep learning optimization library    three popular model architectures, such as LLaMA [57],
(compatible with PyTorch) developed by Microsoft, which
                                                                 Mistral [137], and OPT [90].
has been used to train a number of LLMs, such as MT-
                                                                     • DeepSpeed-Chat [210] is a fast, cost-effective, and
NLG [113] and BLOOM [78]. It provides the support of
                                                                 easy-to-use system framework that enables the integration
various optimization techniques for distributed training,
                                                                 of the complete RLHF process during model training. It
such as memory optimization (ZeRO technique, gradient
                                                                 is featured by three major functionalities: (1) it simplifies
checkpointing), and pipeline parallelism.
                                                                 the training and inference process for ChatGPT-like models,
     • Megatron-LM [75–77] is a deep learning library devel-     enabling using a simple script to implement multiple train-
oped by NVIDIA for training large-scale language models.
                                                                 ing or inference steps; (2) it replicates the training mode
It also provides rich optimization techniques for distributed
                                                                 of InstructGPT [66] and provides a complete pipeline for
training, including model and data parallelism, mixed-
                                                                 three training steps (i.e., SFT, reward model fine-tuning, and
precision training, and FlashAttention. These optimization
                                                                 RLHF); (3) it integrates the training engine and inference en-
techniques can largely improve the training efficiency and
                                                                 gine of Deepspeed into a unified hybrid engine (Deepspeed
speed, enabling efficient distributed training across GPUs.
                                                                 HE) for RLHF training, which enables seamless switch be-
     • JAX [202] is a Python library for high-performance        tween training and inference modes, and leveraging various
machine learning algorithms developed by Google, allow-          optimizations from DeepSpeed Inference.
ing users to easily perform computations on arrays with              In addition to the above library resources, existing deep
hardware acceleration (e.g., GPU or TPU). It enables efficient   learning frameworks (e.g., PyTorch [211], TensorFlow [212],
computation on various devices and also supports several         MXNet [213], PaddlePaddle [214], MindSpore [215] and
featured functions, such as automatic differentiation and        OneFlow [216]) have also provided the support for parallel
just-in-time compilation.                                        algorithms, which are commonly used for training large-
     • Colossal-AI [203] is a deep learning library developed    scale models.
by HPC-AI Tech for training large-scale AI models. It is
implemented based on PyTorch and supports a rich collec-
tion of parallel training strategies. Furthermore, it can also   4     P RE - TRAINING
optimize heterogeneous memory management with meth-
ods proposed by PatrickStar [204]. Recently, a ChatGPT-like      Pre-training establishes the basis of the abilities of LLMs. By
model called ColossalChat [205] has been publicly released       pre-training on large-scale corpora, LLMs can acquire essen-
with two versions (7B and 13B), which are developed using        tial language understanding and generation skills [55, 56]. In
Colossal-AI based on LLaMA [57].                                 this process, the scale and quality of the pre-training corpus
     • BMTrain [206] is an efficient library developed by        are critical for LLMs to attain powerful capabilities. Fur-
OpenBMB for training models with large-scale parameters          thermore, to effectively pre-train LLMs, model architectures,
in a distributed manner, which emphasizes code simplicity,       acceleration methods, and optimization techniques need to
low resource, and high availability. BMTrain has already         be well designed. In what follows, we first discuss the data
incorporated several common LLMs (e.g., Flan-T5 [69] and         collection and processing in Section 4.1, then introduce the
GLM [93]) into its ModelCenter, where developers can use         commonly used model architectures in Section 4.2, and fi-
these models directly.                                           nally present the training techniques to stably and efficiently
     • FastMoE [207] is a specialized training library for MoE   optimize LLMs in Section 4.3.
(i.e., mixture-of-experts) models. It is developed based on
PyTorch, prioritizing both efficiency and user-friendliness
in its design. FastMoE simplifies the process of transferring    4.1   Data Collection and Preparation
Transformer models to MoE models and supports both data          Compared with small-scale language models, LLMs have
parallelism and model parallelism during training.               a stronger demand for high-quality data for model pre-
     • vLLM [208] is a fast, memory efficient, and easy-         training, and their model capacities largely rely on the pre-
to-use library for LLM inference and serving. To enable          training corpus and how it has been preprocessed. In this
fast inference, it is specially optimized with high serving      part, we discuss the collection and processing of pre-training
throughput, effective attention memory management using          data, including data sources, preprocessing methods, and
PagedAttention [208], continuous batching, and optimized         important analysis of how pre-training data affects the
CUDA kernels. Furthermore, vLLM also supports various            performance of LLMs.
                                                                                                                                                   17

         T5 (11B)         Falcon (40B)         LLaMA (65B)           GPT-3 (175B)         MT-NLG (530B)      Gopher (280B)     Chinchilla (70B)
                                                 3%                                               2%
                                               2% 5%                  16%                                          3%                4%
                                                  5%                                       26% 4%            37%               40%
                                                                                                       62%               60%                56%
                                                                                           6%
           100%              100%                      87%                    84%


          Yi (34B)        PaLM (540B)         LaMDA (137B)          Galactica (120B)      GPT-NeoX (20B)     CodeGen (16B)     StarCoder 2 (15B)
                             5%                                                                                                        2%
                                                                              8%
        5% 9%                                    13%                                            8%                                    1%
                          14%                                               7%                                          20%
                                    31%                                                                30%                             5%
          4%                                              38%                                                39%
                                                                                                                          6%
                                                                                          38%
                                                                                                     10%                 10%
                83%                            50%
                            50%                                             86%                   15%              25%               92%


                      💻 C4 (800G, 2019), 💻 OpenWebText (38G, 2023), 💻 Wikipedia (21G, 2023)
                      💬 the Pile - StackExchange (41G, 2020)
                      📚 BookCorpus (5G, 2015), 📚 Gutenberg (-, 2021), 📚 CC-Stories-R (31G, 2019), 📰 CC-NEWES (78G, 2019), 📰 REALNEWs (120G, 2019)
                      🔬 the Pile - ArXiv (72G, 2020), 🔬 the Pile - PubMed Abstracts (25G, 2020)
                      ⌨ BigQuery (-, 2023), the Pile - GitHub (61G, 2020)


                       Fig. 6: Ratios of various data sources in the pre-training data for existing LLMs.


4.1.1   Data Source                                                               conversational competence of LLMs [90] and potentially im-
To develop a capable LLM, it is key to collect a large amount                     prove their performance on a range of question-answering
of natural language corpus from various data sources. Ex-                         tasks [56]. Researchers can utilize subsets of public conver-
isting LLMs mainly leverage a mixture of diverse public                           sation corpus (e.g., PushShift.io Reddit corpus) [163, 217] or
textual datasets as the pre-training corpus. Figure 6 shows                       collect conversation data from online social media. Since on-
the distribution of the sources of pre-training data for a                        line conversational data often involves discussions among
number of representative LLMs.                                                    multiple participants, an effective processing way is to
                                                                                  transform a conversation into a tree structure, where the
    The source of pre-training corpus can be broadly cate-
                                                                                  utterance is linked to the one it responds to. In this way, the
gorized into two types: general data and specialized data.
                                                                                  multi-party conversation tree can be divided into multiple
General data, such as webpages, books, and conversational
                                                                                  sub-conversations, which can be collected in the pre-training
text, is utilized by most LLMs [55, 56, 90] due to its large,
                                                                                  corpus. Furthermore, a potential risk is that the excessive
diverse, and accessible nature, which can enhance the lan-
                                                                                  integration of dialogue data into LLMs may result in a side
guage modeling and generalization abilities of LLMs. In
                                                                                  effect [90]: declarative instructions and direct interrogatives
light of the impressive generalization capabilities exhibited
                                                                                  are erroneously perceived as the beginning of conversations,
by LLMs, there are also studies that extend their pre-training
                                                                                  thus leading to a decline in the efficacy of the instructions.
corpus to more specialized datasets, such as multilingual
data, scientific data, and code, endowing LLMs with specific                          • Books. Compared to other corpus, books provide an
task-solving capabilities [35, 56, 86]. In what follows, we                       important source of formal long texts, which are potentially
describe these two types of pre-training data sources and                         beneficial for LLMs to learn linguistic knowledge, model
their effects on LLMs. For a detailed introduction to the                         long-term dependency, and generate narrative and coherent
commonly used corpus, one can refer to Section 3.2.                               texts. To obtain open-source book data, existing studies
                                                                                  usually adopt the Books3 and Bookcorpus2 datasets, which
General Text Data. As we can see in Figure 6, the vast                            are available in the Pile dataset [166].
majority of LLMs adopt general-purpose pre-training data,
such as webpages, books, and conversational text, which                           Specialized Text Data. Specialized datasets are useful to
provides rich text sources on a variety of topics. Next, we                       improve the specific capabilities of LLMs on downstream
briefly summarize three important kinds of general data.                          tasks. Next, we introduce three kinds of specialized data.
    • Webpages. Owing to the proliferation of the Internet,                           • Multilingual text. In addition to the text in the target
various types of data have been created, which enables                            language, integrating a multilingual corpus can enhance
LLMs to gain diverse linguistic knowledge and enhance                             the multilingual abilities of language understanding and
their generalization capabilities [26, 82]. For convenient                        generation. For example, BLOOM [78] and PaLM [56] have
use of these data resources, a large amount of data is                            curated multilingual data covering 46 and 122 languages,
crawled from the web in previous work, such as Com-                               respectively, within their pre-training corpora. FLM [102]
monCrawl [168]. However, the crawled web data tends to                            mixes Chinese and English corpora in nearly equal propor-
contain both high-quality text, such as Wikipedia and low-                        tions. These models demonstrate impressive performance in
quality text, like spam mail, thus it is important to filter and                  multilingual tasks, such as translation, multilingual summa-
process webpages for improving the data quality.                                  rization, and multilingual question answering, and achieve
    • Conversation text. Conversation data can enhance the                        comparable or superior performance to the state-of-the-
                                                                                                                                  18

art models that are fine-tuned on the corpus in the target          quality data. Typically, these methods train a binary classi-
language(s).                                                        fier using positive instances that are: well-curated data (e.g.,
    • Scientific text. The exploration of science by humans has     Wikipedia pages) [55, 56, 112], high-quality synthesized
been witnessed by the increasing growth of scientific publi-        data [135, 229–231], or a combination of both. They sample
cations. In order to enhance the understanding of scientific        candidate data as negative instances and predict the score
knowledge for LLMs [35, 218], it is useful to incorporate a         that measures the quality of each data example. However,
scientific corpus for model pre-training [35, 218]. By pre-         several studies [64, 112] find that a classifier-based approach
training on a vast amount of scientific text, LLMs can              may result in the unintentional removal of high-quality texts
achieve impressive performance in scientific and reasoning          in dialectal, colloquial, and sociolectal languages, which
tasks [219]. To construct the scientific corpus, existing efforts   potentially leads to bias in the pre-training corpus and
mainly collect arXiv papers, scientific textbooks, math web-        diminishes the corpus diversity. As the second approach,
pages, and other related scientific resources. Due to the com-      several studies, such as BLOOM [78] and Gopher [64],
plex nature of data in scientific fields, such as mathematical      employ heuristic-based approaches to eliminate low-quality
symbols and protein sequences, specific tokenization and            texts through a set of well-designed rules, which can be
preprocessing techniques are usually required to transform          summarized as follows:
these different formats of data into a unified form that can         • Language based filtering. If a LLM would be mainly used
be processed by language models.                                       in the tasks of certain languages, the text in other lan-
    • Code. Program synthesis has been widely studied in               guages can be filtered.
the research community [105, 220–223], especially the use of
                                                                     • Metric based filtering. Evaluation metrics about the gener-
PLMs trained on code [176, 224]. However, it remains chal-
lenging for these PLMs (e.g., GPT-J [176]) to generate high-           ated texts, e.g., perplexity, can be employed to detect and
quality and accurate programs. Recent studies [105, 223]               remove unnatural sentences.
have found that training LLMs on a vast code corpus                  • Statistic based filtering. Statistical features of a corpus,
can lead to a substantial improvement in the quality of                e.g., the punctuation distribution, symbol-to-word ratio,
the synthesized programs. The generated programs can                   and sentence length, can be utilized to measure the text
successfully pass expert-designed unit-test cases [105] or             quality and filter the low-quality data.
solve competitive programming questions [114]. In gen-
                                                                     • Keyword based filtering. Based on specific keyword set, the
eral, two types of code corpora are commonly used for
                                                                       noisy or unuseful elements in the text, such as HTML
pre-training LLMs. The first source is from programming
                                                                       tags, hyperlinks, boilerplates, and offensive words, can
question answering communities like Stack Exchange [225].
                                                                       be identified and removed.
The second source is from public software repositories
such as GitHub [86, 105, 223], where code data (includ-                 In addition to the above methods, LLMs (especially rela-
ing comments and docstrings) are collected for utilization.         tively small models) can be also employed for data selection,
Compared to natural language text, code is in the format            either by computing perplexity [232] or directly prompting
of a programming language, corresponding to long-range              LLMs [233] for measuring the sample importance. However,
dependencies and accurate execution logic [226]. A recent           using LLMs is unavoidably computationally intensive for
study [47] also speculates that training on code might be a         large-scale data selection.
source of complex reasoning abilities (e.g., chain-of-thought       De-duplication. Existing work [234] has found that dupli-
ability [33]). Furthermore, it has been shown that formatting       cate data in a corpus would reduce the diversity of language
reasoning tasks into code can help LLMs generate more               models, which may cause the training process to become un-
accurate results [226].                                             stable and thus affect the model performance. Therefore, it is
                                                                    necessary to de-duplicate the pre-training corpus. Specially,
4.1.2   Data Preprocessing                                          de-duplication can be performed at different granularities,
After collecting a large amount of text data, it is essential       including sentence-level, document-level, and dataset-level
to preprocess the data for constructing the pre-training            de-duplication. First, low-quality sentences that contain re-
corpus, especially removing noisy, redundant, irrelevant,           peated words and phrases should be removed, as they may
and potentially toxic data [56, 64, 227], which may largely         introduce repetitive patterns in language modeling [235].
affect the capacity and performance of LLMs. To facilitate          At the document level, existing studies mostly rely on the
the data processing, a recent study [228] proposes a useful         overlap ratio of surface features (e.g., words and n-grams
data processing system for LLMs, named Data-Juicer, which           overlap) between documents to detect and remove duplicate
provides over 50 processing operators and tools. In this            documents containing similar contents [57, 64, 78, 236].
part, we review the detailed data preprocessing strategies          Furthermore, to avoid the dataset contamination problem,
to improve the quality of the collected data [64, 78, 112]. A       it is also crucial to prevent the overlap between the training
typical pipeline of preprocessing the pre-training data for         and evaluation sets [56], by removing the possible duplicate
LLMs has been illustrated in Figure 7.                              texts from the training set. It has been shown that the three
                                                                    levels of de-duplication are useful to improve the training
Filtering and Selection. To remove low-quality data from            of LLMs [56, 237], which should be jointly used in practice.
the collected corpus, existing work generally adopts two ap-
proaches, namely classifier-based and heuristic-based. The          Privacy Reduction. Thus, it is necessary to remove the
former approach trains a selection classifier based on high-        personally identifiable information (PII) from the pre-training
quality texts and leverages it to identify and filter out low-      corpus. One direct and effective approach is to employ
                                                                                                                                                                                        19


                                                                                                                                                                   Ready to
      Raw Corpus           Filtering & Selection              De-duplication                  Privacy Reduction                Tokenization
                                                                                                                                                                   pre-train!

                             Language Filtering               Sentence-level                 Detect Personality              Reuse Existing
                                                                                             Identifiable                    Tokenizer
                             Metric Filtering                 Document-level
                                                                                             Information (PII)               SentencePiece
                             Statistic Filtering              Set-level
                                                                                             Remove PII                      Byte-level BPE
                             Keyword Filtering

                        Alice is writing a paper about   Alice is writing a paper about    Replace('Alice') is           Encode('[Somebody] is           32, 145, 66, 79, 12, 56, ...
                        LLMs. #$^& Alice is writing      LLMs. Alice is writing a paper    writing a paper about LLMs.   writing a paper about LLMs.')
                        a paper about LLMs.              about LLMs.




           Fig. 7: An illustration of a typical data preprocessing pipeline for pre-training large language models.


rule-based methods, such as keyword spotting, to detect                                       • WordPiece tokenization. WordPiece was a Google inter-
and remove PII such as names, addresses, and phone num-                                   nal subword tokenization algorithm. It was originally pro-
bers [167]. Furthermore, researchers also find that the vul-                              posed by Google in developing voice search systems [242].
nerability of LLMs under privacy attacks can be attributed                                Then, it was used in the neural machine translation system
to the presence of duplicate PII data in the pre-training cor-                            in 2016 [243], and was adopted as the word tokenizer for
pus [238]. Therefore, de-duplication can also reduce privacy                              BERT in 2018 [23]. WordPiece has a very similar idea with
risks to some extent.                                                                     BPE by iteratively merging consecutive tokens, whereas
                                                                                          taking a slightly different selection criterion for the merge.
Tokenization. Tokenization is also a crucial step for data                                To conduct the merge, it first trains a language model and
preprocessing. It aims to segment raw text into sequences                                 employs it to score all possible pairs. Then, at each merge, it
of individual tokens, which are subsequently used as the                                  selects the pair that leads to the most increase in the likeli-
inputs of LLMs. In traditional NLP research (e.g., sequence                               hood of training data. Since Google has’t released the official
labeling with conditional random fields [239]), word-based                                implementation of the WordPiece algorithm, HuggingFace
tokenization is the predominant approach, which is more                                   gives a more intuitive selection measure in its online NLP
aligned with human’s language cognition. However, word-                                   course: a pair is scored by dividing the co-occurrence count
based tokenization can yield different segmentation results                               by the product of the occurrence counts of two tokens in the
for the same input in some languages (e.g., Chinese word                                  pair based on training corpus.
segmentation), generate a huge word vocabulary containing
many low-frequency words, and also suffer from the “out-                                       • Unigram tokenization. Unlike BPE and WordPiece, Un-
of-vocabulary” issue. Thus, several neural network models                                 igram tokenization [244] starts with a sufficiently large
employ character as the minimum unit to derive the word                                   set of possible substrings or subtokens for a corpus, and
representation (e.g., a CNN word encoder in ELMo [21]).                                   iteratively removes the tokens in the current vocabulary
Recently, subword tokenizers have been widely used in Trans-                              until the expected vocabulary size is reached. As the se-
former based language models, typically including Byte-                                   lection criterion, it calculates the yielded increase in the
Pair Encoding tokenization, WordPiece tokenization and                                    likelihood of training corpus by assuming that some to-
Unigram tokenization. HuggingFace has maintained an                                       ken was removed from current vocabulary. This step is
excellent online NLP course on tokenizer22 with running                                   conducted based on a trained unigram language model.
examples, and we refer to the beginners to this course. Next,                             To estimate the unigram language model, it adopts an
we briefly describe the three representative tokenization                                 expectation–maximization (EM) algorithm: at each iteration,
methods.                                                                                  we first find the currently optimal tokenization of words
    • Byte-Pair Encoding (BPE) tokenization. BPE was origi-                               based on the old language model, and then re-estimate the
nally proposed as a general data compression algorithm in                                 probabilities of unigrams to update the language model.
1994 [240], and then adapted to NLP for tokenization [241].                               During this procedure, dynamic programming algorithms
It starts with a set of basic symbols (e.g., the alphabets                                (i.e., the Viterbi algorithm) are used to efficiently find the
and boundary characters), and iteratively combine frequent                                optimal decomposition way of a word given the language
pairs of two consecutive tokens in the corpus as new to-                                  model. Representative models that adopt this tokenization
kens (called merge). For each merge, the selection criterion                              approach include T5 and mBART.
is based on the co-occurrence frequency of two contigu-                                       Although it is expedient to leverage an existing tokenizer
ous tokens: the top frequent pair would be selected. The                                  (e.g., OPT [90] and GPT-3 [55] utilize the tokenizer of GPT-
merge process continues until it reaches the predefined                                   2 [26]), using a tokenizer specially designed for the pre-
size. Further, Byte-level BPE has been used to improve the                                training corpus can be highly beneficial [78], especially for
tokenization quality for multilingual corpus (e.g., the text                              the corpus that consists of diverse domains, languages, and
containing non-ASCII characters) by considering bytes as the                              formats. Therefore, recent LLMs often train the customized
basic symbols for merge. Representative language models                                   tokenizers specially for the pre-training corpus with the
with this tokenization approach include GPT-2, BART, and                                  SentencePiece library [245], which includes Byte-level BPE
LLaMA.                                                                                    and Unigram tokenization. A note is that normalization
                                                                                          techniques in BPE, such as NFKC [246], may degrade the
 22. https://huggingface.co/learn/nlp-course/chapter6                                     tokenization performance [34, 64, 78]. When extending exist-
                                                                                                                                   20


   Data              1                                               of the entire pre-training data), and can be also locally set
   Source
                     2
                     3                Data Mixture                   to varied proportions at different training stages. During
                     4
                                                                     pre-training, data samples from different sources would be
  Stage 1          Stage 2               Stage          Stage        selected according to the mixture proportions: more data
                                                                     will be sampled from a data source with a larger weight.
                                ···                                  Typically, existing LLMs such as LLaMA [57] may employ
                                                                     upsampling or downsampling on the full data of each
                                                                     source to create specific data mixtures as pre-training data.
                         Data Curriculum                             As Figure 6 illustrates, existing LLMs use different data mix-
                                                                     tures to construct the pre-training data. As a representative
Fig. 8: An illustration of data scheduling for pre-training          model, the pre-training data of LLaMA [57] mainly consists
LLMs.                                                                of webpages (over 80%), alongside 6.5% of code-heavy data
                                                                     from GitHub and StackExchange, 4.5% from books, and
                                                                     2.5% of scientific data sourced from arXiv, which has become
ing LLMs (i.e., continual pre-training or instruction tuning),       an important reference for training general-purpose LLMs.
we should be also aware of the potential side effect with            Furthermore, special data mixtures can be used to facilitate
customized tokenizers. For example, LLaMA trains the BPE             different purposes. For example, Falcon [171] is trained on
tokenizer based on a pre-training corpus mainly consisting           pure webpages, and CodeGen [86] largely increases the
of English texts, and the derived vocabulary might be less           amount of code data. In practice, data mixture is often de-
capable in processing non-English data, e.g., taking longer          termined empirically, and we summarize several common
inference latency to generate Chinese texts.                         strategies for finding an effective data mixture as follows:
Discussion on Effect of Data Quality. For pre-training, the              • Increasing the diversity of data sources. Recent studies
quality of pre-training data is vital to the model capacities        have empirically shown that training on excessive data
of LLMs. Existing work has shown that pre-training on the            about a certain domain would degrade the generalization
low-quality corpus, such as noisy, toxic, and duplicate data,        capability of LLMs on other domains [35, 64]. In contrast,
would largely hurt the performance of models [64, 234,               increasing the data source heterogeneity (e.g., including
236, 238]. Recent studies, such as T5 [82], GLaM [112], and          diverse data sources) is critical for improving the down-
Gopher [64], have investigated the influence of data quality         stream performance of LLMs [227, 248, 249]. To further
on the LLMs’ capacities. By comparing the performance of             examine the effect of different data sources, some studies
models trained on the filtered and unfiltered corpus, they           have conducted ablation experiments by removing each
have reached the similar conclusion that pre-training LLMs           data source one by one, and pre-train LLMs with specially
on cleaned data can improve the model performance. More              curated datasets [227]. It has been shown that dropping data
specifically, the duplication of data may result in “double          sources with high heterogeneity (e.g., webpages) impacts
descent” (referring to the phenomenon of performance ini-            LLM’s abilities more severely than dropping sources with
tially deteriorating and subsequently improving) [234, 247],         low heterogeneity (e.g., academic corpus).
or even overwhelm the training process [234]. In addition,               • Optimizing data mixtures. In addition to manually set-
it has been shown that duplicate data degrades the ability           ting the data mixtures, several studies have proposed to
of LLMs to copy from the context, which might further                optimize the data mixtures for improving the model pre-
affect the generalization capacity of LLMs using in-context          training [59, 250]. Given the target downstream tasks, one
learning [234]. Therefore, as suggested in [56, 64, 78, 227],        can select pre-training data with either higher proximity
it is essential to utilize preprocessing methods like quality        in the feature space [250] or those that provide positive
filtering, toxic filtering and deduplication to carefully clean      influences on downstream task performance [251]. Further,
the pre-training corpus (as illustrated in Section 4.1.2), to        to reduce the reliance of target tasks, DoReMi [59] first trains
improve stability of the training process and avoid affecting        a small reference model using given initial domain weights,
the model performance.                                               and then trains another small proxy model, upweighting the
                                                                     domains on which the greatest discrepancies in likelihood
4.1.3 Data Scheduling                                                between the two models are observed. Finally, the learned
                                                                     domain weights of the proxy model are applied to train
After data preprocessing, it is essential to design suit-
                                                                     a much larger LLM. In a more simple way, one can train
able strategies to schedule these multi-source data for pre-
                                                                     several small language models with different data mixtures,
training a capable LLM. Generally, two key aspects should
                                                                     and select the data mixture that leads to the most desir-
be paid close attention for data scheduling: the proportion
                                                                     able performance. However, an assumption made in this
of each data source (data mixture), and the order in which
                                                                     approach is, when trained in a similar way, small models
each data source is scheduled for training (data curriculum).
                                                                     would resemble with large models in model abilities or
Next, we discuss the two aspects in detail. An illustration of
                                                                     behaviors, which may not always hold in practice.
data scheduling has been presented in Figure 8.
                                                                         • Specializing the targeted abilities. The model capacities
Data Mixture. Since each kind of data source is closely              of LLMs heavily rely on data selection and mixture, and
related to the development of certain capacities for LLMs            one can boost the proportions of specific data sources to
(referring to the discussions in Section 4.1), it is important       enhance certain model abilities [64, 227]. For example, the
to set a suitable distribution to mix these data. The data           mathematical reasoning and coding abilities can be specially
mixture is generally set in a global level (i.e., the distribution   enhanced by training with more mathematical texts and
                                                                                                                                     21

code data, respectively. Furthermore, experimental results            of scientific papers, web data containing mathematical text
on the LAMBADA dataset [252] show that increasing the                 and code (2T general tokens → 500B code-heavy tokens
proportion of books data can improve the model capacity in            → 50∼200B math-heavy tokens). Note that the pre-training
capturing long-term dependencies from text, and increasing            data of Llemma also contains 5% general domain data as a
the proportion of the C4 dataset [82] leads to performance            form of regularization.
improvement on the C4 validation dataset [64]. Generally,                  • Long context. Long context modeling is an important
it is important to identify more implicit relations between           ability for LLMs, and many studies have explored extend-
data sources and model abilities. To enhance specific skills          ing the context windows of LLMs via continually train-
such as mathematics and coding in LLMs, or to develop                 ing [254, 257]. With modifications on position embeddings
specialized LLMs, a practical way is to employ a multi-stage          (i.e., position interpolation) of RoPE-based LLMs [57, 99,
training approach, e.g., general and skill-specific data can          259], CodeLLaMA further extends the context window of
be scheduled at two consecutive stages. This approach of              LLaMA 2 (2.5T tokens with 4K context window → 20B
training LLMs on varying sources or proportions of data               tokens with 16K context window). LongLLaMA [257] also
across multiple stages is also known as “data curriculum”,            achieves longer context window with the help of external
which will be introduced below.                                       memory and a unique training objective (1T tokens with 2K
                                                                      context window → 10B tokens with 8K context window).
Data Curriculum. After preparing the data mixture, it is
important to schedule the order that specific data is pre-            4.1.4 Summary of Data Preparation
sented to LLMs for pre-training. It has been shown that,              In this part, we summarize the general procedure and key
in some cases, to learn a certain skill, learning in a skill-         points to prepare pre-training data for LLMs, which are
set sequence (e.g., basic skills → target skill) outperforms          detailed in the following three aspects.
direct learning from a corpus focused solely on the target                • Data collection. It is suggested to include diverse data
skill [253, 254]. Following the idea of curriculum learn-             sources in the pre-training data. Although Falcon [171]
ing [255], data curriculum has been proposed and widely               shows that webpages alone can be employed to train power-
used in model pre-training [253, 254, 256, 257]. It aims to           ful LLMs, a more typical approach is to also incorporate di-
organize different parts of pre-training data for LLMs in             verse high-quality text like code, books, scientific papers, etc.
a specific order, e.g., starting with easy/general examples           If a LLM is specialized with a certain skill, the proportion of
and progressively introducing more challenging/special-               corresponding data source should be increased accordingly.
ized ones. More generally, it can broadly refer to the adap-          For example, Gopher [64] and Chinchilla [34] are trained
tive adjustment of data proportions for different sources             with approximately 40% of data from books. PaLM [44] and
during pre-training. Existing work about data curriculum              LaMDA [68] use approximately 50% conversational data.
mainly focuses on continual pre-training, such as special-                • Data cleaning. After data collection, it is crucial to clean
ized coding LLMs (e.g., CodeLLaMA [254]) or long context              the raw corpus to enhance its quality as possible. First,
LLMs (e.g., LongLLaMA [257]). However, it still lacks of              deduplication is commonly used in existing work [99, 171,
more detailed report about data curriculum for general-               248]. Second, low-quality text, toxic content, and data with
purpose LLMs (e.g., LLaMA) in the literature. To determine            privacy concerns should be removed at different granulari-
data curriculum, a practical approach is to monitor the de-           ties (e.g., document, passage or sentence). In practice, both
velopment of key abilities of LLMs based on specially con-            heuristic and classifier-based methods can be employed
structed evaluation benchmarks, and then adaptively adjust            for quality and toxicity filtering (e.g., CCNet [260], fast-
the data mixture during pre-training. Next, we take three             Text [261], and Data-Juicer [262]). Third, with the cleaned
common abilities as examples to introduce how the concept             data, one can further unify or specify the format for pre-
of data curriculum23 applies in continual pre-training.               training data, and perform the tokenization by training
    • Coding. To improve the coding ability of LLMs, CodeL-           the tokenizer on the filtered and deduplicated corpus with
LaMA [254] is developed based on LLaMA 2 [99] (2T general             libraries like SentencePiece [245].
tokens → 500B code-heavy tokens), aiming to improve the                   • Data scheduling. With the preprocessed data, the next
code generation ability and retain natural language under-            step is to determine the data mixture and the specific order
standing skills. CodeLLaMA also provides a version that               of data for pre-training LLMs. To determine both settings, a
is further specialized to a certain programming language,             practical way is to first train several small language models
namely CodeLLaMA-Python (2T general tokens → 500B                     with multiple candidate plans and then select a good plan
code-heavy tokens → 100B Python-heavy tokens).                        among them [59]. Overall, it is more difficult to find a
    • Mathematics. Llemma [258] is proposed to enhance                suitable data curriculum. In practice, one can monitor the
the mathematical capacities of general-purpose LLMs. It               performance of intermediate model checkpoints on specific
is developed based on CodeLLaMA. Although CodeL-                      evaluation benchmarks, and dynamically tune the data mix-
LaMA [254] mainly focuses on the coding ability, exper-               ture and distribution during pre-training. In this process, it
iments have shown that it performs better than its base               is also useful to explore the potential relations between data
model LLaMA 2 on mathematics benchmarks [258]. Based                  sources and model abilities to instruct the design of data
on CodeLLaMA, Llemma is continually trained on mixtures               curriculum.

  23. We utilize the symbol “→” to represent the data order in data   4.2   Architecture
curriculum. For example, “2T webpage tokens → 500B code tokens”
means that the LLM is firstly trained with 2T webpage tokens and      In this section, we review the architecture design of LLMs,
subsequently with 500B code data tokens.                              i.e., mainstream architecture, pre-training objective, and de-
                                                                                                                              22

tailed configuration. Table 5 presents the model cards of         e.g., U-PaLM [118] is derived from PaLM [56]. Existing rep-
several representative LLMs with public details.                  resentative LLMs based on prefix decoders include GLM-
                                                                  130B [93] and U-PaLM [118].
4.2.1 Typical Architectures
                                                                  Mixture-of-Experts. For the above three types of archi-
Due to the excellent parallelizability and capacity, the Trans-   tectures, we can further extend them via the mixture-of-
former architecture [22] has become the de facto backbone to      experts (MoE) scaling, in which a subset of neural network
develop various LLMs, making it possible to scale language        weights for each input are sparsely activated, e.g., Switch
models to hundreds or thousands of billions of parameters.        Transformer [25] and GLaM [112]. The major merit is that
In general, the mainstream architectures of existing LLMs         MoE is a flexible way to scale up the model parameter while
can be roughly categorized into three major types, namely         maintaining a constant computational cost [25]. It has been
encoder-decoder, causal decoder, and prefix decoder, as           shown that substantial performance improvement can be
shown in Figure 9.                                                observed by increasing either the number of experts or the
Encoder-decoder Architecture. The vanilla Transformer             total parameter size [265]. Despite the merits, training large
model is built on the encoder-decoder architecture [22],          MoE models may suffer from instability issues due to the
which consists of two stacks of Transformer blocks as             complex, hard-switching nature of the routing operation.
the encoder and decoder, respectively. The encoder adopts         To enhance the training stability of MoE-based language
stacked multi-head self-attention layers to encode the input      models, techniques such as selectively using high-precision
sequence for generating its latent representations, while         tensors in the routing module or initializing the model with
the decoder performs cross-attention on these representa-         a smaller range have been introduced [25]. More recently,
tions and autoregressively generates the target sequence.         there is widespread speculation that GPT-4 has been devel-
Encoder-decoder PLMs (e.g., T5 [82] and BART [24]) have           oped based on the MoE architecture, but without official
shown effectiveness on a variety of NLP tasks. So far,            verification.
there are only a small number of LLMs that are built based
on the encoder-decoder architecture, e.g., Flan-T5 [69]. We       Emergent Architectures. The conventional Transformer ar-
leave a detailed discussion about the architecture selection      chitecture typically suffers from quadratic computational
in Section 4.2.5.                                                 complexity with respect to sequence length, resulting in a
Causal Decoder Architecture. The causal decoder archi-            high processing cost for dealing with long inputs. To im-
tecture incorporates the unidirectional attention mask, to        prove efficiency, recent studies aim to devise new architec-
guarantee that each input token can only attend to the            tures for language modeling, most based on parameterized
past tokens and itself. The input and output tokens are           state space models (SSM) [266], which can be viewed as
processed in the same fashion through the decoder. As             a combination of RNN and CNN. On the one hand, SSM
representative language models of this architecture, the          can generate outputs recursively like RNN, meaning that
GPT-series models [26, 55, 122] are developed based on            they only need to refer to the single previous state during
the causal-decoder architecture. In particular, GPT-3 [55]        decoding. It makes the decoding process more efficient
has successfully demonstrated the effectiveness of this ar-       as it eliminates the need to revisit all previous states as
chitecture, also showing an amazing in-context learning           in conventional Transformers. On the other hand, these
capability of LLMs. Interestingly, GPT-1 [122] and GPT-           models have the capability to encode an entire sequence
2 [26] do not exhibit such superior abilities as those in         in parallel like Transformers via convolution computation.
GPT-3, and it seems that scaling plays an important role          Thus, they can benefit from the parallelism of GPUs with
in increasing the model capacity of this model architecture.      techniques such as Parallel Scan [267, 268], FFT [269, 270],
So far, the causal decoders have been widely adopted as           and Chunkwise Recurrent [271]. Despite the high computa-
the architecture of LLMs by various existing LLMs, such           tion efficiency of SSMs, their performance still lags behind
as OPT [90], BLOOM [78], and Gopher [64]. Note that both          Transformer. Thus, several variants of SSM have been pro-
the causal decoder and prefix decoder discussed next belong       posed, including Mamba [272], RetNet [271], RWKV [273],
to decoder-only architectures. When mentioning “decoder-          and Hyena [269].
only architecture”, it mainly refers to the causal decoder            • Mamba. Mamba [272] aims to selectively filter out or
architecture in existing literature, unless specified.            remember information during state update. It replaces the
                                                                  original fixed parameters of SSM layers with functions of the
Prefix Decoder Architecture. The prefix decoder architec-         input, selectively filtering out information of the previous
ture (a.k.a., non-causal decoder [263]) revises the masking       state and the current input depending on the current input.
mechanism of causal decoders, to enable performing bidi-          Compared with traditional SSMs, Mamba has demonstrated
rectional attention over the prefix tokens [264] and unidi-       improved text modeling capacities.
rectional attention only on generated tokens. In this way,            • RWKV. RWKV [273] combines the advantages of Trans-
like the encoder-decoder architecture, the prefix decoders        former and RNN. It employs time-mixing modules, i.e.,
can bidirectionally encode the prefix sequence and autore-        RNN with gating, and channel-mixing modules that are
gressively predict the output tokens one by one, where the        special feedforward neural networks [273]. Within these
same parameters are shared during encoding and decoding.          modules, token shift, a linear combination of the current and
Instead of pre-training from scratch, a practical suggestion      previous token, is used instead of the token representation
is to continually train causal decoders and then convert          as the input.
them into prefix decoders for accelerating convergence [29],          • RetNet. RetNet [271] proposes multi-scale retention
                                                                                                                                                                                                                                                       23

TABLE 5: Model cards of several selected LLMs with public configuration details. Here, PE denotes position embedding,
#L denotes the number of layers, #H denotes the number of attention heads, dmodel denotes the size of hidden states, and
MCL denotes the maximum context length during training.

           Model                                           Category               Size       Normalization                                PE          Activation         Bias                             #L        #H        dmodel      MCL
           GPT3 [55]                                    Causal decoder           175B       Pre LayerNorm                              Learned          GeLU              ✓                            96           96         12288       2048
           PanGU- α [84]                                Causal decoder           207B       Pre LayerNorm                              Learned          GeLU              ✓                            64          128         16384       1024
           OPT [90]                                     Causal decoder           175B       Pre LayerNorm                              Learned          ReLU              ✓                            96           96         12288       2048
           PaLM [56]                                    Causal decoder           540B       Pre LayerNorm                               RoPE           SwiGLU             ×                           118           48         18432       2048
           BLOOM [78]                                   Causal decoder           176B       Pre LayerNorm                               ALiBi           GeLU              ✓                            70          112         14336       2048
           MT-NLG [113]                                 Causal decoder           530B              -                                      -               -               -                           105          128         20480       2048
           Gopher [64]                                  Causal decoder           280B        Pre RMSNorm                               Relative           -               -                            80          128         16384       2048
           Chinchilla [34]                              Causal decoder            70B        Pre RMSNorm                               Relative           -               -                            80           64          8192          -
           Galactica [35]                               Causal decoder           120B       Pre LayerNorm                              Learned          GeLU              ×                            96           80         10240       2048
           LaMDA [68]                                   Causal decoder           137B              -                                   Relative        GeGLU              -                            64          128          8192          -
           Jurassic-1 [107]                             Causal decoder           178B       Pre LayerNorm                              Learned          GeLU              ✓                            76           96         13824       2048
           LLaMA [57]                                   Causal decoder            65B        Pre RMSNorm                                RoPE           SwiGLU             ×                            80           64          8192       2048
           LLaMA 2 [99]                                 Causal decoder            70B        Pre RMSNorm                                RoPE           SwiGLU             ×                            80           64          8192       4096
           Falcon [171]                                 Causal decoder            40B       Pre LayerNorm                               RoPE            GeLU              ×                            60           64          8192       2048
           GLM-130B [93]                                Prefix decoder           130B       Post DeepNorm                               RoPE           GeGLU              ✓                            70           96         12288       2048
           T5 [82]                                     Encoder-decoder            11B        Pre RMSNorm                               Relative         ReLU              ×                            24          128          1024        512



                                        Causal Decoder                                                                  Prefix Decoder                                                                     Encoder-Decoder

             A                                                                               A                                                                                 A




                                                                                                                                                               Encoder
               Survey                                                                          Survey                                                                            Survey




 Decoder                                                                          Decoder
             of                                                                              of                                                                                of




            Models Language Large                                                           Models Language Large                                              Decoder        Models Language Large




                                    A   Survey    of     Large Language Models                                      A   Survey    of   Large Language Models                                          A        Survey    of    Large Language Models




                                                 Decoder                                                                         Decoder                                                                  Encoder                 Decoder

Fig. 9: A comparison of the attention patterns in three mainstream architectures. Here, the blue, green, yellow and grey
rounded rectangles indicate the attention between prefix tokens, attention between prefix and target tokens, attention
between target tokens, and masked attention respectively.


TABLE 6: Comparison of parallelism and complexity of dif-                                                                              to traditional SSMs, RetNet keeps the parallel and recurrent
ferent models. T represents sequence length, H represents                                                                              computation capacity at the same time.
the dimension of the input representation, N represents the                                                                                 • Hyena. Hyena employs long convolution to replace
dimension after compression in SSMs, and M represents the                                                                              the attention module. In the long convolution module, the
number of layers in each Hyena module.                                                                                                 filters based on relative positions are used to aggregate
                                                                                                                                       information at different positions into the middle represen-
  Model                                 Decoding Complexity                      Training Complexity
                                                                                                                                       tations, and gating functions are employed to further project
  Transformer                                 O(H(T + H))                        O(T H(T + H))                                         intermediate representations into the final output. However,
  SSM                                        O(H(N 2 + H))                   O(T H(log T + N 2 + H))                                   due to the long convolution, Hyena can not infer like RNN
  Mamba                                      O(H(N 2 + H))                      O(T H(N 2 + H))
  RWKV                                          O(H 2 )                             O(T H 2 )
                                                                                                                                       and must explicitly access all previous states.
  RetNet                                        O(H 2 )                             O(T H 2 )
  Hyena                                      O(M H(T + H))                    O(T M H(log T + H))                                      4.2.2 Detailed Configuration
                                                                                                                                       Since the launch of Transformer [22], various improvements
                                                                                                                                       have been proposed to enhance its training stability, per-
                                                                                                                                       formance, and computational efficiency. In this part, we
(MSR) to replace the attention module in Transformer. Sim-                                                                             will discuss the corresponding configurations for four major
ilar to linear attention, in the MSR module, the input is                                                                              parts of the Transformer, including normalization, position
first mapped into query, key, and value, and the product                                                                               embeddings, activation functions, and attention and bias.
of key and value is employed to update the state. Then, the                                                                            To make this survey more self-contained, we present the
query is used to project the state into the output. Similar                                                                            detailed formulations for these configurations in Table 7.
                                                                                                                               24

TABLE 7: Detailed formulations for the network configurations. Here, Sublayer denotes a FFN or a self-attention module
in a Transformer layer, d denotes the size of hidden states, pi denotes position embedding at position i, Aij denotes the
attention score between a query and a key, ri−j denotes a learnable scalar based on the offset between the query and the
key, and RΘ,t denotes a rotary matrix with rotation degree t · Θ.

                 Configuration             Method           Equation
                                         Post Norm [22]     Norm(x+Sublayer(x))
             Normalization position      Pre Norm [26]      x + Sublayer(Norm(x))
                                      Sandwich Norm [274]   x + Norm(Sublayer(Norm(x)))
                                                                                                 q P
                                                            x−µ
                                                                 · γ + β, µ = d1 di=1 xi , σ =    1  d                  2
                                                                                P
                                        LayerNorm [275]      σ                                      d   i=1 (xi − µ))
             Normalization method                              x
                                                                                   q P
                                                                                      1    d    2
                                        RMSNorm [276]       RMS(x)
                                                                     · γ, RMS(x) =    d    i=1 xi
                                        DeepNorm [277]      LayerNorm(α · x + Sublayer(x))
                                          ReLU [278]        ReLU(x) = max(x, 0)
                                                                                          √                     Rx    −t2 dt
                                          GeLU [279]        GeLU(x) = 0.5x ⊗ [1 + erf(x/ 2)],    erf(x) = √2π   0 e
              Activation function
                                          Swish [280]       Swish(x) = x ⊗ sigmoid(x)
                                         SwiGLU [281]       SwiGLU(x1 , x2 ) = Swish(x1 ) ⊗ x2
                                         GeGLU [281]        GeGLU(x1 , x2 ) = GeLU(x1 ) ⊗ x2
                                         Absolute [22]      xi = x i + p i
              Position embedding         Relative [82]      Aij = Wq xi xT    T
                                                                           j Wk + ri−j
                                          RoPE [282]        Aij = Wq xi RΘ,i−j xT    T                             T
                                                                                 j Wk = (Wq xi RΘ,i )(Wk xj RΘ,j )
                                          ALiBi [283]                      T  T
                                                            Aij = Wq xi xj Wk − m(i − j)



Normalization Methods. Training instability is a challeng-        employed in existing LLMs except combined with other
ing issue for pre-training LLMs. To alleviate this issue,         strategies (e.g., combining post-LN with pre-LN in GLM-
normalization is a widely adopted strategy to stabilize the       130B [93]).
training of neural networks. In the vanilla Transformer [22],         • Pre-LN. Different from post-LN, pre-LN [287] is applied
LayerNorm [275] is employed. Recently, several advanced           before each sub-layer, and an additional LN is placed before
normalization techniques have been proposed as alterna-           the final prediction. Compared with post-LN, the Trans-
tives to LayerNorm, e.g., RMSNorm, and DeepNorm.                  formers with pre-LN are more stable in training. However,
    • LayerNorm. In the early research, BatchNorm [284] is        it performs worse than the variants with post-LN [288].
a commonly used normalization method. However, it is              Despite the decreasing performance, most LLMs still adopt
difficult to deal with sequence data of variable lengths and      pre-LN due to the training stability. However, one excep-
small-batch data. Thus, LayerNorm [275] is introduced to          tion is that pre-LN has been found unstable in GLM when
conduct layerwise normalization. Specifically, the mean and       training models more than 100B parameters [93].
variance over all activations per layer are calculated to re-         • Sandwich-LN. Based on pre-LN, Sandwich-LN [274]
center and re-scale the activations.                              adds extra LN before the residual connections to avoid
    • RMSNorm. To improve the training speed of Lay-              the value explosion issues in Transformer layer outputs.
erNorm (LN), RMSNorm [276] is proposed by re-scaling              However, it has been found that Sandwich-LN sometimes
the activations with only the root mean square (RMS) of           fails to stabilize the training of LLMs and may lead to the
the summed activations, instead of the mean and variance.         collapse of training [93].
Related research has demonstrated its superiority in training     Activation Functions. To obtain good performance, activa-
speed and performance on Transformer [285]. Representa-           tion functions also need to be properly set in feed-forward
tive models that adopt RMSNorm include Gopher [64] and            networks. In existing LLMs, GeLU activations [289] are
Chinchilla [34].                                                  widely used. Specially, in the latest LLMs (e.g., PaLM and
    • DeepNorm. DeepNorm is proposed by Microsoft [277]           LaMDA), variants of GLU activation [281, 290] have also
to stabilize the training of deep Transformers. With Deep-        been utilized, especially the SwiGLU and GeGLU variants,
Norm as residual connections, Transformers can be scaled          which often achieve better performance in practice [285].
up to 1,000 layers [277], which has shown the advantages          However, compared with GeLU, they require extra parame-
of stability and good performance. It has been adopted by         ters (about 50%) in the feed-forward networks [291].
GLM-130B [93].
                                                                  Position Embeddings. Since the self-attention modules in
Normalization Position. In addition to the normalization          Transformer are permutation equivariant, position embed-
method, normalization position also plays a crucial role in       dings (PE) are employed to inject absolute or relative posi-
the LLMs. There are generally three choices for the normal-       tion information for modeling sequences.
ization position, i.e., post-LN, pre-LN, and sandwich-LN.             • Absolute position embedding. In the vanilla Trans-
    • Post-LN. Post-LN is used in the vanilla Trans-              former [22], absolute position embeddings are employed.
former [22], which is placed between residual blocks. How-        At the bottoms of the encoder and the decoder, the absolute
ever, existing work has found that the training of Trans-         positional embeddings are added to the input embeddings.
formers with post-LN tends to be instable due to the large        There are two variants of absolute position embeddings
gradients near the output layer [286]. Thus, post-LN is rarely    proposed in the vanilla Transformer [22], i.e., sinusoidal and
                                                                                                                                  25

learned position embeddings, where the latter is commonly               • Full attention. In the vanilla Transformer [22], the atten-
used in existing pre-trained language models.                      tion mechanism is conducted in a pairwise way, considering
    • Relative position embedding. Unlike absolute position        the relations between all token pairs in a sequence. It adopts
embeddings, relative positional embeddings are generated           scaled dot-product attention, in which the hidden states
according to the offsets between keys and queries [292].           are mapped into queries, keys, and values. Additionally,
A popular variant of relative PE was introduced in                 Transformer uses multi-head attention instead of single
Transformer-XL [293, 294]. The calculation of attention            attention, projecting the queries, keys, and values with
scores between keys and queries has been modified to               different projections in different heads. The concatenation
introduce learnable embeddings corresponding to relative           of the output of each head is taken as the final output.
positions. T5 [82] further simplified relative positional em-           • Sparse attention. A crucial challenge of full attention
beddings, which was subsequently adopted by Gopher [64].           is the quadratic computational complexity, which becomes
Specifically, it adds learnable scalars to the attention scores,   a burden when dealing with long sequences. Therefore,
where the scalars are calculated based on the distances            various efficient Transformer variants are proposed to re-
between the positions of the query and the key. Compared           duce the computational complexity of the attention mecha-
with the absolute PE, Transformers with relative position          nism [297, 298]. For instance, locally banded sparse attention
embedding can generalize to sequences longer than those            (i.e., Factorized Attention [299] has been adopted in GPT-
sequences for training, i.e., extrapolation [283].                 3 [55]. Instead of the whole sequence, each query can only
    • Rotary position embedding. Rotary position embedding         attend to a subset of tokens based on the positions.
(RoPE) [282] sets specific rotatory matrices based on the               • Multi-query/grouped-query attention. Multi-query atten-
absolute position of each key or query. The scores between         tion refers to the attention variant where different heads
keys and queries can be computed with relative position            share the same linear transformation matrices on the keys
information (Table 7). RoPE combines each consecutive pair         and values [300]. It achieves higher inference speed with
of elements in query and key vectors as a dimension, so there      only a minor sacrifice in model quality. Representative
are d/2 dimensions for an original d-length embedding.             models with multi-query attention include PaLM [56] and
For each dimension i ∈ {1, . . . , d/2}, the pair of involved      StarCoder [98]. To make a trade-off between multi-query
elements will rotate based on the rotation angle t · θi , where    attention and multi-head attention, grouped-query attention
t denotes the position index and θi is the basis in the            (GQA) [301] has been explored. In GQA, heads are assigned
dimension. Following sinusoidal position embeddings [22],          into different groups, and those heads that belong to the
RoPE defines the basis θi as an exponentiation of the base b       same group will share the same transformation matrices.
(set to 10000 by default):                                         Specially, GQA has been adopted and empirically tested in
                                                                   the recently released LLaMA 2 model [99].
         Θ = {θi = b−2(i−1)/d |i ∈ {1, 2, . . . , d/2}}.     (4)
                                                                        • FlashAttention. Different from most existing approx-
Furthermore, a recent study [295] defines the distance re-         imate attention methods that trade-off model quality to
quired to rotate one cycle (2π ) for each dimension as wave-       improve the computing efficiency, FlashAttention [302] pro-
length:                                                            poses to optimize the speed and memory consumption of
                 λi = 2πb2(i−1)/d = 2π/θi .              (5)       attention modules on GPUs from an IO-aware perspective.
                                                                   There exist different levels of memory on modern GPUs,
Due to the excellent performance and the long-term decay
                                                                   e.g., SRAM with a fast IO and HBM with a relatively
property, RoPE is widely adopted in the latest LLMs, e.g.,
                                                                   slow IO. FlashAttention organizes the input into blocks and
PaLM [56] and LLaMA [57]. Based on RoPE, xPos [296] fur-
                                                                   introduces necessary recomputation, both to make better
ther improves the translation invariance and length extrap-
                                                                   use of the fast memory SRAM. Implemented as a fused
olation of Transformer. At each dimension of the rotation
                                                                   kernel in CUDA, FlashAttention has been integrated into
angle vector, xPos adds a special exponential decay that is
                                                                   PyTorch [211], DeepSpeed [74], and Megatron-LM [75]. The
smaller when the basis is larger. It can alleviate the unstable
                                                                   updated version FlashAttention-2 [303] further optimizes
phenomenon during training as the distance increases.
                                                                   the work partitioning of GPU thread blocks and warps, lead-
    • ALiBi. ALiBi [283] is proposed to improve the extrap-
                                                                   ing to around 2× speedup when compared to the original
olation of Transformer. Similar to relative position embed-
                                                                   FlashAttention.
ding, it biases attention scores with a penalty based on the
                                                                        • PagedAttention. It has been observed when LLM are
distances between keys and queries. Different from the rela-
                                                                   deployed on servers, GPU memory is largely occupied by
tive positional embedding methods like T5 [82], the penalty
                                                                   cached attention key and value tensors (called KV cache).
scores in ALiBi are pre-defined without any trainable pa-
                                                                   The major reason is that the input lengths are often varied,
rameters. Empirical results in [283] have shown that ALiBi
                                                                   leading to fragmentation and over-reservation issues. In-
has a better extrapolation performance on sequences that are
                                                                   spired by the classic paging technique in operating systems,
longer than those for training than several popular position
                                                                   PagedAttention has been proposed to improve the memory
embedding methods such as sinusoidal PE [22], RoPE [282],
                                                                   efficiency and throughput of deployed LLMs [304]. In detail,
and T5 bias [82]. In addition, it has been shown that ALiBi
                                                                   PagedAttention partitions each sequence into subsequences,
can also improve training stability in BLOOM [78].
                                                                   and the corresponding KV caches of these subsequences are
Attention. Attention mechanism is a critical component of          allocated into non-contiguous physical blocks. The paging
Transformer. It allows the tokens across the sequence to           technique increases the GPU utilization and enables efficient
interact with each other and compute the representations           memory sharing in parallel sampling.
of the input and output sequence.                                       To put all these discussions together, we summarize the
                                                                                                                                  26

suggestions from existing literature for detailed configura-                       I am sleepy. I start a pot of
tion. For stronger generalization and training stability, it is
                                                                          coffee   0.661    strong    0.008    soup     0.005
suggested to choose the pre RMSNorm for layer normaliza-
                                                                          water    0.119    black     0.008    ...       ...
tion, and SwiGLU or GeGLU as the activation function. In
                                                                          tea      0.057    hot       0.007    happy    4.3e-6
addition, LN may not be used immediately after embedding
                                                                          rice     0.017    oat       0.006    Boh      4.3e-6
layers, which is likely to incur performance degradation. As
                                                                          chai     0.012    beans     0.006    ...       ...
for position embeddings, RoPE or ALiBi is a better choice
since it performs better on long sequences.                       Fig. 10: The probability distribution over the vocabulary in
                                                                  descending order for the next token of the context “I am
4.2.3   Pre-training Tasks                                        sleepy. I start a pot of ”. For ease of discussion, this example is
Pre-training plays a key role that encodes general knowl-         given in word units instead of subword units.
edge from large-scale corpus into the massive model param-
eters. For training LLMs, there are two commonly used pre-
training tasks, namely language modeling and denoising            LM and DAE objectives as different types of denoising tasks,
autoencoding.                                                     namely S-denoiser (LM), R-denoiser (DAE, short span and
                                                                  low corruption), and X-denoiser (DAE, long span or high
Language Modeling. The language modeling task (LM) is             corruption). Among the three denoising tasks, S-denoiser
the most commonly used objective to pre-train decoder-only        is similar to the conventional LM objective (Equation (6)),
LLMs, e.g., GPT3 [55] and PaLM [56]. Given a sequence of          while R-denoiser and X-denoiser are similar to DAE ob-
tokens x = {x1 , . . . , xn }, the LM task aims to autoregres-    jectives (Equation (7)) but differ from each other in the
sively predict the target tokens xi based on the preceding        lengths of spans and ratio of corrupted text. For input sen-
tokens x<i in a sequence. A general training objective is to      tences started with different special tokens (i.e., {[R], [S],
maximize the following likelihood:                                [X]}), the model will be optimized using the corresponding
                             n
                             X                                    denoisers. MoD has been applied in the latest PaLM 2
                LLM (x) =          log P (xi |x<i ).        (6)   model [120].
                             i=1

    Since most language tasks can be cast as the prediction       4.2.4    Decoding Strategy
problem based on the input, these decoder-only LLMs might         After the LLMs have been pre-trained, it is essential to em-
be potentially advantageous to implicitly learn how to ac-        ploy a specific decoding strategy to generate the appropriate
complish these tasks in a unified LM way. Some studies            output from the LLMs.
have also revealed that decoder-only LLMs can be naturally
                                                                  Background. We start the discussion with the prevalent
transferred to certain tasks by autoregressively predicting
                                                                  decoder-only architecture, and introduce the auto-regressive
the next tokens [26, 55], without fine-tuning. An important
                                                                  decoding mechanism. Since such LLMs are pre-trained
variant of LM is the prefix language modeling task, which is
                                                                  based on the language modeling task (Equation 6), a basic
designed for pre-training models with the prefix decoder
                                                                  decoding method is greedy search that predicts the most
architecture. The tokens within a randomly selected prefix
                                                                  likely token at each step based on the previously generated
would not be used in computing the loss of prefix language
                                                                  tokens, formally modeled as:
modeling. With the same amount of tokens seen during pre-
training, prefix language modeling performs slightly worse                              xi = arg maxP (x|x<i ),                   (8)
than language modeling, since fewer tokens in the sequence                                       x
are involved for model pre-training [29].                         where xi is the token with the highest probability at i-
                                                                  th step of generation conditioned on the context x<i . For
Denoising Autoencoding. In addition to conventional
                                                                  instance in Figure 10, when predicting the next token of
LM, the denoising autoencoding task (DAE) has also been
                                                                  the sentence “I am sleepy. I start a pot of”, greedy search
widely used to pre-train language models [24, 82]. The
                                                                  selects the token “coffee” which has the highest probability
inputs x\x̃ for DAE task are corrupted text with randomly
                                                                  at the current step. Greedy search can achieve satisfactory
replaced spans. Then, the language models are trained to re-
                                                                  results in text generation tasks (e.g., machine translation
cover the replaced tokens x̃. Formally, the training objective
                                                                  and text summarization), in which the output is highly
of DAE is denoted as follows:
                                                                  dependent on the input [305]. However, in terms of open-
                 LDAE (x) = log P (x̃|x\x̃ ).               (7)   ended generation tasks (e.g., story generation and dialog),
                                                                  greedy search sometimes tends to generate awkward and
    However, the DAE task seems to be more complicated            repetitive sentences [306].
in implementation than LM task. As a result, it has not               As another alternative decoding strategy, sampling-
been widely used to pre-train large language models. Exist-       based methods are proposed to randomly select the next
ing LLMs that take DAE as pre-training objectives include         token based on the probability distribution to enhance the
T5 [82] and GLM-130B [93]. These models are mainly trained        randomness and diversity during generation:
to recover the replaced spans in an autoregressive way.
                                                                                            xi ∼ P (x|x<i ).                      (9)
Mixture-of-Denoisers. Mixture-of-Denoisers (MoD) [89],
also known as UL2 loss, was introduced as a unified ob-           For the example in Figure 10, sampling-based methods will
jective for pre-training language models. MoD regards both        sample the word “coffee” with higher probability while
                                                                                                                                27

also retaining the possibilities of selecting the rest words,     from the smallest set having a cumulative probability above
“water”, “tea”, “rice”, etc.                                      (or equal to) p [306]. In practice, the smallest set can be con-
    Not limited to the decoder-only architecture, these two       structed by gradually adding tokens from the vocabulary
decoding methods can be generally applied to encoder-             sorted in descending order of generative probability, until
decoder models and prefix decoder models in a similar way.        their cumulative value exceeds p.
Improvement for Greedy Search. Selecting the token with               Recently, researchers have also explored other sampling
the highest probability at each step may result in overlook-      strategies for LLMs. For instance, η -sampling [313] further
ing a sentence with a higher overall probability but a lower      improves top-p sampling by introducing a dynamic thresh-
local estimation. Next, we introduce several improvement          old based on the probability distribution. Furthermore, con-
strategies to alleviate this issue.                               trastive search [314] and typical sampling [315] can be utilized
    • Beam search. Beam search [307] retains the sentences        to improve the generation coherence during decoding. Since
with the n (beam size) highest probabilities at each step         it has been found that large models tend to assign higher
during the decoding process, and finally selects the gener-       probability to important tokens compared to small models,
ated response with the top probability. Typically, the beam       contrastive decoding [316] utilizes a larger LM (e.g., OPT-
size is configured within the range of 3 to 6. However,           13B) and a smaller LM (e.g., OPT-125M) to measure their
opting for a larger beam size might result in a decline in        log-likelihood differences. Subsequently, tokens are sampled
performance [308].                                                based on the delta value of the probability distribution,
    • Length penalty. Since beam search favours shorter sen-      thereby amplifying the impact of important tokens. Based
tences, imposing length penalty (a.k.a., length normaliza-        on this contrastive idea, DoLa [317] further extends this
tion) is a commonly used technique [309] to overcome this         approach to contrasting the logits across different layers of
issue, which normalizes the sentence probability according        a single LLM, as higher layers tend to assign more weight
to the sentence length (divided by an exponential power α         to important tokens.
of the length).
                                                                  Practical Settings. In practice, existing libraries (e.g., Trans-
    Besides, some researchers [310] propose to penalize the
                                                                  formers [201]) and public APIs of LLMs (e.g., OpenAI) have
generation of previously generated tokens or n-grams to
                                                                  supported various decoding strategies to serve different
alleviate the issue of repetitive generation. In addition,
                                                                  scenarios of text generation. Next, we present the decoding
diverse beam search [311] can be leveraged to produce a
                                                                  settings of several representative LLMs:
set of diverse outputs based on the same input.
                                                                      • T5 [82] utilizes greedy search as the default setting and
Improvement for Random Sampling. Sampling-based                   applies beam search (beam size of 4) with a length penalty
methods sample the token over the whole vocabulary, which         of 0.6 for translation and summarization tasks.
may select wrong or irrelevant tokens (e.g., “happy” and              • GPT-3 [55] employs beam search with a beam size of 4
“Boh” in Figure 10) based on the context. To improve the          and a length penalty of 0.6 for all generation tasks.
generation quality, several strategies have been proposed
                                                                      • Alpaca [146] utilizes sampling-based strategies with
for mitigating or preventing the selection of words with
                                                                  top-k (k = 50), top-p (p = 0.9), and temperature of 0.7 for
exceedingly low probabilities.
                                                                  open-ended generation.
    • Temperature sampling. To modulate the randomness of
sampling, a practical method is to adjust the temperature             • LLaMA [57] applies diverse decoding strategies tai-
coefficient of the softmax function for computing the proba-      lored to specific tasks. For instance, it employs the greedy
bility of the j -th token over the vocabulary:                    search for question answering tasks while utilizes a sam-
                                                                  pling strategy with the temperature settings of 0.1 (pass@1)
                                 exp (lj /t)                      and 0.8 (pass@100) for code generation.
                P (xj |x<i ) = P                   ,       (10)
                                 j ′ exp (lj ′ /t)                    • OpenAI API supports several basic decoding strate-
where lj ′ is the logits of each word and t is the temperature    gies, including greedy search (by setting temperature to
coefficient. Reducing the temperature t increases the chance      0), beam search (with the setting best_of), temperature
of selecting words with high probabilities while decreases        sampling (with the setting temperature), nucleus sam-
the chances of selecting words with low probabilities. When       pling (with the setting top_p). It also introduce param-
t is set to 1, it becomes the default random sampling; when       eters presence_penalty and frequency_penalty to
t is approaching 0, it is equivalent to greedy search. In         control the repetition degree of generation. According to
addition, when t goes to infinity, it degenerates to uniform      the OpenAI’s document, their APIs would produce different
sampling.                                                         outputs even if the input and the hyper-parameters are the
    • Top-k sampling. Different from temperature sampling,        same. Setting temperature to 0 can yield more deterministic
top-k sampling directly truncates the tokens with lower           outputs, albeit with a slight chance of variability.
probability and only samples from the tokens with the top
k highest probabilities [312]. For example in Figure 10, top-
5 sampling will sample from the words “coffee”, “water”,          4.2.5   Summary and Discussion
“tea”, “rice”, and “chai” from their re-scaled probabilities.
    • Top-p sampling. Since top-k sampling does not consider      The choice of architecture and pre-training tasks may incur
the overall possibility distribution, a constant value of k may   different inductive biases for LLMs, which would lead to
be not be suitable for different contexts. Therefore, top-p       different model capacities. In this part, we discuss one open
sampling (a.k.a., nucleus sampling) is proposed by sampling       issue about the architecture choice for LLMs.
                                                                                                                              28

    Why does Predicting the Next Word Works?                     tures and pre-training objectives are in need to analyze how
                                                                 the choices of the architecture and pre-training tasks affect
    The essence of decoder-only architecture is to               the capacity of LLMs, especially for encoder-decoder archi-
    accurately predict the next word for reconstructing          tectures. Despite the effectiveness of decoder-only architec-
    the pre-training data. Till now, there has been no           ture, it is also suggested to make more diverse exploration
    formal study that theoretically demonstrates its             on architecture design. Besides the major architecture, the
    advantage over other architectures. An interesting           detailed configuration of LLM is also worth attention, which
    explanation was from Ilya Sutskever during the               has been discussed in Section 4.2.2.
    interview held by Jensen Huanga . The original
    transcript from the interview was copied belowb :            4.3   Model Training
    Say you read a detective novel. It’s                         In this part, we review the important settings, techniques,
    like complicated plot, a storyline,                          or tricks for training LLMs.
    different characters, lots of events,
    mysteries like clues, it’s unclear.                          4.3.1 Optimization Setting
    Then, let’s say that at the last                             For parameter optimization of LLMs, we present the com-
    page of the book, the detective has                          monly used settings for batch training, learning rate, opti-
    gathered all the clues, gathered                             mizer, and training stability.
    all the people and saying, "okay,                            Batch Training. For language model pre-training, existing
    I’m going to reveal the identity of                          work generally sets the batch size to a large number (e.g.,
    whoever committed the crime and that                         2,048 examples or 4M tokens) to improve the training
    person’s name is". Predict that word.                        stability and throughput. For LLMs such as GPT-3 and
    ...                                                          PaLM, they have introduced a new strategy that dynam-
    Now, there are many different words.                         ically increases the batch size during training, ultimately
    But predicting those words better and                        reaching a million scale. Specifically, the batch size of GPT-3
    better, the understanding of the text                        is gradually increasing from 32K to 3.2M tokens. Empirical
    keeps on increasing. GPT-4 predicts                          results have demonstrated that the dynamic schedule of
    the next word better.                                        batch size can effectively stabilize the training process of
      a. https://www.nvidia.com/en-us/on-                        LLMs [56].
    demand/session/gtcspring23-S52092/
      b. https://lifearchitect.ai/ilya/                          Learning Rate. Existing LLMs usually adopt a similar learn-
                                                                 ing rate schedule with the warm-up and decay strategies
                                                                 during pre-training. Specifically, in the initial 0.1% to 0.5%
Architecture Choice. In earlier literature of pre-trained lan-   of the training steps, a linear warm-up schedule is employed
guage models, there are lots of discussions on the effects       for gradually increasing the learning rate to the maximum
of different architectures [29, 89]. However, most LLMs are      value that ranges from approximately 5 × 10−5 to 1 × 10−4
developed based on the causal decoder architecture, and          (e.g., 6 × 10−5 for GPT-3). Then, a cosine decay strategy
there still lacks a theoretical analysis on its advantage over   is adopted in the subsequent steps, gradually reducing the
the other alternatives. Next, we briefly summarize existing      learning rate to approximately 10% of its maximum value,
discussions on this issue.                                       until the convergence of the training loss.
    • By pre-training with the LM objective, it seems that
                                                                 Optimizer. The Adam optimizer [318] and AdamW opti-
causal decoder architecture can achieve a superior zero-
                                                                 mizer [319] are widely utilized for training LLMs (e.g., GPT-
shot and few-shot generalization capacity. Existing research
                                                                 3), which are based on adaptive estimates of lower-order
has shown that without multi-task fine-tuning, the causal
                                                                 moments for first-order gradient-based optimization. Com-
decoder has better zero-shot performance than other archi-
                                                                 monly, its hyper-parameters are set as follows: β1 = 0.9,
tectures [29]. The success of GPT-3 [55] has demonstrates
                                                                 β2 = 0.95 and ϵ = 10−8 . Meanwhile, the Adafactor op-
that the large causal decoder model can be a good few-
                                                                 timizer [320] has also been utilized in training LLMs (e.g.,
shot learner. In addition, instruction tuning and alignment
                                                                 PaLM and T5), which is a variant of the Adam optimizer
tuning discussed in Section 5 have been proven to fur-
                                                                 specially designed for conserving GPU memory during
ther enhance the capability of large causal decoder mod-
                                                                 training. The hyper-parameters of the Adafactor optimizer
els [66, 67, 69].
                                                                 are set as: β1 = 0.9 and β2 = 1.0 − k −0.8 , where k denotes
    • Scaling law has been widely observed in causal de-
                                                                 the number of training steps.
coders. By scaling the model size, the dataset size, and
the total computation, the performance of causal decoders        Stabilizing the Training. During the pre-training of LLMs,
can be substantially improved [30, 55]. Thus, it has become      it often suffers from the training instability issue, which
an important strategy to increase the model capacity of          may cause the model collapse. To address this issue, weight
the causal decoder via scaling. However, more detailed           decay and gradient clipping have been widely utilized,
investigation on encoder-decoder models is still lacking, and    where existing studies [55, 78, 90, 93, 113] commonly set
more efforts are needed to investigate the performance of        the threshold of gradient clipping to 1.0 and weight decay
encoder-decoder models at a large scale.                         rate to 0.1. However, with the scaling of LLMs, the training
    More research efforts about the discussions on architec-     loss spike is also more likely to occur, leading to unstable
                                                                                                                                        29

                                 TABLE 8: Detailed optimization settings of several existing LLMs.

                             Batch Size      Learning                                               Precision   Weight   Grad
  Model                                                  Warmup       Decay Method      Optimizer                               Dropout
                              (#tokens)          Rate                                               Type        Decay    Clip
  GPT3 (175B)                 32K→3.2M        6 × 10−5    yes     cosine decay to 10%     Adam        FP16       0.1     1.0       -
  PanGu-α (200B)                       -      2 × 10−5     -                -             Adam          -        0.1      -        -
  OPT (175B)                         2M     1.2 × 10−4    yes        manual decay        AdamW        FP16       0.1      -       0.1
  PaLM (540B)                    1M→4M        1 × 10−2    no      inverse square root   Adafactor     BF16       lr2     1.0      0.1
  BLOOM (176B)                       4M       6 × 10−5    yes     cosine decay to 10%     Adam        BF16       0.1     1.0      0.0
  MT-NLG (530B)            64 K→3.75M         5 × 10−5    yes     cosine decay to 10%     Adam        BF16       0.1     1.0       -
  Gopher (280B)                  3M→6M        4 × 10−5    yes     cosine decay to 10%     Adam        BF16        -      1.0       -
  Chinchilla (70B)             1.5M→3M        1 × 10−4    yes     cosine decay to 10%    AdamW        BF16        -       -        -
  Galactica (120B)                   2M       7 × 10−6    yes     linear decay to 10%    AdamW          -        0.1     1.0      0.1
  LaMDA (137B)                     256K              -     -                -               -         BF16        -       -        -
  Jurassic-1 (178B)          32 K→3.2M        6 × 10−5    yes               -               -           -         -       -        -
  LLaMA (65B)                        4M     1.5 × 10−4    yes     cosine decay to 10%    AdamW          -        0.1     1.0       -
  LLaMA 2 (70B)                      4M     1.5 × 10−4    yes     cosine decay to 10%    AdamW          -        0.1     1.0       -
  Falcon (40B)                       2M    1.85 × 10−4    yes     cosine decay to 10%    AdamW        BF16       0.1      -        -
  GLM (130B)               0.4M→8.25M         8 × 10−5    yes     cosine decay to 10%    AdamW        FP16       0.1     1.0      0.1
  T5 (11B)                          64K       1 × 10−2    no      inverse square root   AdaFactor       -         -       -       0.1
  ERNIE 3.0 Titan (260B)               -      1 × 10−4     -                -             Adam        FP16       0.1     1.0       -
  PanGu-Σ (1.085T)                 0.5M       2 × 10−5    yes               -             Adam        FP16        -       -        -



training. To mitigate this problem, PaLM [56] and OPT [90]              technique is simple in implementation, and most of existing
use a simple strategy that restarts the training process from           popular deep learning libraries have already implemented
an earlier checkpoint before the occurrence of the spike and            data parallelism, such as TensorFlow and PyTorch.
skips over the data that may have caused the problem.                       • Pipeline parallelism. Pipeline parallelism aims to dis-
Further, GLM [93] finds that the abnormal gradients of the              tribute the different layers of a LLM into multiple GPUs.
embedding layer usually lead to spikes, and proposes to                 Especially, in the case of a Transformer model, pipeline
shrink the embedding layer gradients to alleviate it.                   parallelism loads consecutive layers onto the same GPU, to
                                                                        reduce the cost of transmitting the computed hidden states
4.3.2     Scalable Training Techniques                                  or gradients between GPUs. However, a naive implemen-
As the model and data sizes increase, it has become chal-               tation of pipeline parallelism may result in a lower GPU
lenging to efficiently train LLMs under a limited computa-              utilization rate as each GPU has to wait for the previous
tional resource. Especially, two primary technical issues are           one to complete the computation, leading to the unneces-
required to be resolved, i.e., increasing training throughput           sary cost of bubbles overhead [321]. To reduce these bubbles
and loading larger models into GPU memory. In this part,                in pipeline parallelism, GPipe [321] and PipeDream [322]
we review several widely used approaches in existing work               propose the techniques of padding multiple batches of data
to address the above two challenges, namely 3D paral-                   and asynchronous gradient update to improve the pipeline
lelism [75, 321, 322] and mixed precision training [323], and           efficiency.
also give general suggestions about how to utilize them for                 • Tensor parallelism. Tensor parallelism is also a com-
training.                                                               monly used technique that aims to decompose the LLM for
                                                                        multi-GPU loading. Unlike pipeline parallelism, tensor par-
3D Parallelism. 3D parallelism is actually a combination of             allelism focuses on decomposing the tensors (the parameter
three commonly used parallel training techniques, namely                matrices) of LLMs. For a matrix multiplication operation
data parallelism, pipeline parallelism [321, 322], and tensor           Y = XA in the LLM, the parameter matrix A can be
parallelism [75]24 . We next introduce the three parallel train-        split into two submatrices, A1 and A2 , by column, which
ing techniques.                                                         can be expressed as Y = [XA1 , XA2 ]. By placing matrices
    • Data parallelism. Data parallelism is one of the most             A1 and A2 on different GPUs, the matrix multiplication
fundamental approaches to improving the training through-               operation would be invoked at two GPUs in parallel, and
put. It replicates the model parameters and optimizer states            the final result can be obtained by combining the outputs
across multiple GPUs and then distributes the whole train-              from the two GPUs through across-GPU communication.
ing corpus into these GPUs. In this way, each GPU only                  Currently, tensor parallelism has been supported in several
needs to process the assigned data for it, and performs                 open-source libraries, e.g., Megatron-LM [75], and can be
the forward and backward propagation to obtain the gra-                 extended to higher-dimensional tensors. Also, Colossal-AI
dients. The computed gradients on different GPUs will be                has implemented tensor parallelism for higher-dimensional
further aggregated to obtain the gradients of the entire batch          tensors [324–326] and proposed sequence parallelism [327]
for updating the models in all GPUs. In this way, as the                especially for sequence data, which can further decompose
calculations of gradients are independently performed on                the attention operation of the Transformer model.
different GPUs, the data parallelism mechanism is highly
scalable, enabling the way that increases the number of                 Mixed Precision Training. In previous PLMs (e.g.,
GPUs to improve training throughput. Furthermore, this                  BERT [23]), 32-bit floating-point numbers, also known as
                                                                        FP32, have been predominantly used for pre-training. In
  24. Model parallelism is a more broader term that includes tensor     recent years, to pre-train extremely large language models,
parallelism and pipeline parallelism in some work [75].                 some studies [323] have started to utilize 16-bit floating-
                                                                                                                               30

point numbers (FP16), which reduces memory usage and             to supervised fine-tuning [66] and multi-task prompted
communication overhead. Additionally, as popular NVIDIA          training [28]. In order to perform instruction tuning, we first
GPUs (e.g., A100) have twice the amount of FP16 computa-         need to collect or construct instruction-formatted instances.
tion units as FP32, the computational efficiency of FP16 can     Then, we employ these formatted instances to fine-tune
be further improved. However, existing work has found that       LLMs in a supervised learning way (e.g., training with the
FP16 may lead to the loss of computational accuracy [64, 78],    sequence-to-sequence loss). After instruction tuning, LLMs
which affects the final model performance. To alleviate it, an   can demonstrate superior abilities to generalize to unseen
alternative called Brain Floating Point (BF16) has been used     tasks [28, 67, 69], even in a multilingual setting [94].
for training, which allocates more exponent bits and fewer           A recent survey [331] presents a systematic overview
significant bits than FP16. For pre-training, BF16 generally     of the research on instruction tuning. In comparison to
performs better than FP16 on representation accuracy [78].       that, we mainly focus on the effect of instruction tuning
                                                                 on LLMs and provide detailed guidelines or strategies for
Overall Training Suggestion. In practice, the above train-
                                                                 instance collection and tuning. In addition, we also discuss
ing techniques, especially 3D parallelism, are often jointly
                                                                 the use of instruction tuning for satisfying the real needs of
used to improve the training throughput and large model
                                                                 users, which has been widely applied in existing LLMs, e.g.,
loading. For instance, researchers have incorporated 8-way
                                                                 InstructGPT [66] and GPT-4 [46].
data parallelism, 4-way tensor parallelism, and 12-way
pipeline parallelism, enabling the training of BLOOM [78]
on 384 A100 GPUs. Currently, open-source libraries like          5.1.1   Formatted Instance Construction
DeepSpeed [74], Colossal-AI [203], and Alpa [328] can well       Generally, an instruction-formatted instance consists of a
support the three parallel training methods. To reduce the       task description (called an instruction), an optional input,
memory redundancy, ZeRO, FSDP, and activation recom-             the corresponding output, and a small number of demon-
putation techniques [77, 329] can be also employed for           strations (optional). As important public resources, existing
training LLMs, which have already been integrated into           studies have released a large number of labeled data format-
DeepSpeed, PyTorch, and Megatron-LM. In addition, the            ted in natural language (see the list of available resources in
mixed precision training technique such as BF16 can be           Table 3) as introduced in Section 3.3.1. Next, we introduce
also leveraged to improve the training efficiency and reduce     four major methods for constructing formatted instances
GPU memory usage, while it requires necessary support on         (see an illustration in Figure 11) and then discuss several
hardware (e.g., A100 GPU). Because training large models is      key factors for instance construction.
a time-intensive process, it would be useful to forecast the
model performance and detect abnormal issues at an early         Formatting NLP Task Datasets. Before instruction tuning
stage. For this purpose, GPT-4 [46] has recently introduced      was proposed, several early studies [181, 332, 333] collected
a new mechanism called predictable scaling built on a deep       the instances from a diverse range of traditional NLP tasks
learning stack, enabling the performance prediction of large     (e.g., text summarization, text classification, and translation)
models with a much smaller model, which might be quite           to create supervised multi-task training datasets. As a major
useful for developing LLMs. In practice, one can further         source of instruction tuning instances, it is convenient to for-
leverage the supporting training techniques of mainstream        mat these multi-task training datasets with natural language
deep learning frameworks. For instance, PyTorch supports         task descriptions. Specifically, recent work [28, 66, 67, 88]
the data parallel training algorithm FSDP [330] (i.e., fully     augments the labeled datasets with human-written task de-
sharded data parallel), which allows for partial offloading      scriptions, which instructs LLMs to understand the tasks by
of training computations to CPUs if desired.                     explaining the task goal. For example, in Figure 11(a), a task
                                                                 description “Please answer this question” is added for each
                                                                 example in the question-answering task. After instruction
5     P OST- TRAINING OF LLM S                                   tuning, LLMs can generalize well to other unseen tasks by
After pre-training, LLMs can acquire the general abilities       following their task descriptions [28, 67, 69]. In particular,
for solving various tasks. However, an increasing number         it has been shown that instructions are the crucial factor
of studies have shown that LLM’s abilities can be further        in task generalization ability for LLMs [67]: by fine-tuning
adapted according to specific goals. In this section, we         the model on labeled datasets with the task descriptions re-
introduce two major approaches to adapting pre-trained           moved, it results in a dramatic drop in model performance.
LLMs, namely instruction tuning and alignment tuning. The        To better generate labeled instances for instruction tuning,
former approach mainly aims to enhance (or unlock) the           a crowd-sourcing platform, PromptSource [180] has been
abilities of LLMs, while the latter approach aims to align the   proposed to effectively create, share, and verify the task
behaviors of LLMs with human values or preferences. Fur-         descriptions for different datasets. To enrich the training
ther, we will also discuss efficient tuning and quantization     instances, several studies [28, 181, 334] also try to invert the
for model adaptation in resource-limited settings. In what       input-output pairs of existing instances with specially de-
follows, we will introduce the four parts in detail.             signed task descriptions for instruction tuning. For instance,
                                                                 given a question-answer pair, we can create a new instance
5.1   Instruction Tuning                                         by predicting the answer-conditioned question (e.g., “Please
                                                                 generate a question based on the answer:”).
In essence, instruction tuning is the approach to fine-tuning
pre-trained LLMs on a collection of formatted instances in       Formatting Daily Chat Data. Despite that a large number
the form of natural language [67], which is highly related       of training instances have been formatted with instructions,
                                                                                                                                                            31


                                                                  API collection         Human-written         Seed
                                                                                                                                      Instance Pool
     Human-written                  Task description                                                         Instances

                                Please answer this question:                         &                          Instruction
                                                                                                                Generation         LLM                 Filter
                                    Demonstrations                                                                       Task description
                                                                            Task description
     NLP Datasets             Q: What is the capital of France?                                                      Give me a quote from a
                              A: Paris.                               Can you recommend some ways
                                                                                                                   famous person on this topic.
                                                                             to lose weight?
                              Q: What is the capital of Brazil?
                              A: Brasilia                                                                     Input-Output
                                                                  Desired output written by human              Generation          LLM

                                   Input          Output                          Output                            Input             Output
                                                                   Here are some ways to lose weight:        Input: The importance of being honest.
                             Q: What is the capital of China?
                                                                   1. Eat a healthy diet: Focus on …         Output: Honesty is the first chapter in
                             A: Beijing.
                                                                   2. Increase physical activity: Engage …   the book of wisdom.


           (a) Formatting Task Datasets                            (b) Formatting Daily Chat Data                 (c) Formatting Synthetic Data

Fig. 11: An illustration of instance formatting and three different methods for constructing the instruction-formatted
instances.


they mainly come from public NLP datasets, either lack-                        economical way to generate large-scale instruction data for
ing instruction diversity or mismatching with real human                       LLMs. However, the instances generated by the Self-Instruct
needs [66]. To overcome this issue, InstructGPT [66] pro-                      method might be simplistic or lack the diversity. To improve
poses to take the queries that real users have submitted to                    the quality of synthetic instructions, WizardLM [335] intro-
the OpenAI API as the task descriptions. Additionally, to                      duces Evol-Instruct by proposing in-depth and in-breadth
enrich the task diversity, human labelers are also asked to                    evolving to enrich the complexity and diversity of the
compose the instructions for real-life tasks, including open-                  instances. Furthermore, Self-Align [336] establishes multiple
ended generation, open question answering, brainstorm-                         human-aligned principles to filter the synthesized instances.
ing, and chatting. Then, they let another group of labelers                    It then employs these instances to train a LLM in order
directly answer these instructions as the output. Finally,                     to yield more aligned instances. To enhance the quality
they pair one instruction (i.e., the collected user query) and                 of the instance output, researchers directly adopt human-
the expected output (i.e., the human-written answer) as a                      written texts as the output and synthesize corresponding
training instance. Note that InstructGPT also employs these                    instructions using ICL examples [337].
real-world tasks formatted in natural language for align-
ment tuning (discussed in Section 5.2). Further, GPT-4 [46]                    Key Factors for Instruction Dataset Construction. The
has designed potentially high-risk instructions and guided                     quality of instruction instances has an important impact
the model to reject these instructions through supervised                      on the performance of the model. Here, we discuss some
fine-tuning for safety concerns. Considering the absence                       essential factors for instance construction.
of high-quality public chat data, several studies have also                        • Scaling the instructions. It has been widely shown that
collected users’ chat requests as input data, and then utilized                scaling the number of tasks can largely enhance the gen-
ChatGPT or GPT-4 to generate responses as output data. A                       eralization ability of LLMs [28, 67, 88]. With the increasing
notable example of such a dataset is the conversational data                   of the task number, the model performance initially shows
from ShareGPT [153]. Additionally, Dolly [185] and Ope-                        a continuous growth pattern, while the gain becomes neg-
nAssistant [186] have further released their conversation                      ligible when it reaches a certain level [69, 88]. A plausible
data, which has been carefully labeled by human annotators                     speculation is that a certain number of representative tasks
to attain a high level of quality.                                             can provide relatively sufficient knowledge and adding
                                                                               more tasks may not bring additional gains [69]. Also, it is
Formatting Synthetic Data. To reduce the burden of human                       beneficial to enhance the diversity of the task descriptions in
annotation or manual collection, several semi-automated                        several aspects, such as length, structure, and creativity [28].
approaches [147] have been proposed for constructing in-                       As for the number of instances per task, it has been found
stances by feeding existing instances into LLMs to synthe-                     that a small number of instances can usually saturate the
size diverse task descriptions and instances. As illustrated                   generalization performance of the model to perform a spe-
in Figure 11(c), the Self-Instruct method only needs 175                       cific task [67, 69]. Specially, several recent work [338, 339]
instances as the initial task pool. Then, they randomly select                 has explored the effect of fine-tuning with a small amount
a few instances from the pool as demonstrations and prompt                     of high-quality instruction data (e.g., one or a few thousand
a LLM to generate new instructions and corresponding                           instances), showing very promising results on the evalua-
input-output pairs. After the quality and diversity filter-                    tion tasks. In contrast, another line of studies continue to
ing, newly generated instances would be added into the                         explore the scaling effect of instruction data [340, 341]. For
task pool. Hence, the synthetic method is an effective and                     example, Orca [340] scales up the synthesized instances to
                                                                                                                               32

5 million with step-by-step explanations, and it achieves         tion for the LLM capability, more complex metrics (e.g.,
superior performance across a wide range of tasks.                IFD [346]) have also been proposed, which are computed by
    • Formatting design. As an important factor, the design       combining multiple simple metrics. Additionally, diversity-
of natural language format also highly impacts the gener-         aware sampling methods have been introduced to ensure
alization performance of LLMs [88]. Typically, we can add         the overall coverage of representative instruction data [347].
task descriptions and optional demonstrations to the input-       Besides, when downstream task data is available, cross-
output pairs of existing datasets, where the task description     instance gradient similarity can be employed to measure
is the most key part for LLMs to understand the task [88].        the value of training instances for the target task. LESS [348]
Further, it can lead to substantial improvements by using an      computes gradients for both downstream validation and
appropriate number of exemplars as demonstrations [69],           training instruction data, to evaluate the contribution of
which also alleviates the model sensitivity to instruction        instruction data based on extensions of influence func-
engineering [67, 69]. However, incorporating other compo-         tion [349].
nents (e.g., things to avoid, reasons, and suggestions) into          To summarize, diversity and quality of instructions are
instructions may have a negligible or even adverse effect         important factors to consider when scaling the number of
on the performance of LLMs [88, 179]. Recently, to elicit         instances [338]. As the capacities of LLMs improve, data
the step-by-step reasoning ability of LLMs, some work [69]        synthesis methods have become the mainstream approach
proposes to include chain-of-thought (CoT) examples for           for generating large amount of instruction data. Following
some reasoning datasets, such as arithmetic reasoning. It         this trend, there are increasingly more automatically gener-
has been shown that fine-tuning LLMs with both CoT and            ated instruction datasets available, and selection and refin-
non-CoT examples can lead to a good performance across            ing methods are key to effectively use these datasets. To help
various reasoning tasks, including those that require multi-      readers understand how different factors affect instruction
hop reasoning ability (e.g., commonsense question answer-         tuning, we conduct an empirical study by experimenting
ing and arithmetic reasoning) as well as those without the        with multiple specially constructed instruction datasets in
need for such a reasoning way (e.g., sentiment analysis and       Section 5.1.4.
extractive question answering) [69, 95].
    • Instruction quality improvement. Data quality is very
important for the performance of instruction tuning, and          5.1.2   Instruction Tuning Strategies
a surge of work has been proposed to further improve              Unlike pre-training, instruction tuning is often more effi-
the quality of existing instruction datasets. Typically, these    cient since only a moderate number of instances are used
methods mostly rely on carefully designed prompts, to             for training. Since instruction tuning can be considered as
guide LLMs to refine or rewrite the given instruction. Wiz-       a supervised training process, its optimization is different
ardLM [335] aims to complexify and diversify the Alpaca           from pre-training in several aspects [69], such as the training
dataset [187] by devising prompts to widen and deepen             objective (i.e., sequence-to-sequence loss) and optimization
the required knowledge of given instructions. It also crafts      configuration (e.g., smaller batch size and learning rate),
the filter strategy to remove the low-quality instructions.       which require special attention in practice. In addition to
To further provide fine-grained knowledge guidance, recent        these optimization configurations, there are also four im-
work also involves the knowledge taxonomy into the input          portant aspects to consider for instruction tuning:
prompt, e.g., knowledge key points [342] and the human-
AI conversation topic taxonomy [343]. To guarantee the in-        Balancing the Data Distribution. Since instruction tun-
struction quality, early methods mainly employ close-source       ing involves a mixture of different tasks, it is important
API or powerful open-source LLMs, which would take a              to balance the proportion of different tasks during fine-
huge cost for large-scale instructions synthesis. Considering     tuning. A widely used method is the examples-proportional
this issue, recent studies widely explore the potential of        mixing strategy [82], i.e., combining all the datasets and
relatively small models for data synthesis. For instance,         sampling each instance equally from the mixed datasets.
JiuZhang3.0 [344] fine-tunes a 7B language model to syn-          Furthermore, increasing the sampling ratio of high-quality
thesize questions by distilling the knowledge from GPT-           collections (e.g., FLAN [67] and P3 [180]) can generally
4, and then utilizes it to synthesize massive high-quality        lead to performance improvement according to recent find-
instructions based on pre-training corpus. Such a way can         ings [69, 95]. Further, it is common to set a maximum
achieve better performance on mathematical reasoning tasks        cap to control the maximum number of examples that a
than baseline methods, with only 20% data synthesis cost.         dataset can contain during instruction tuning [82], which
    • Instruction selection. As a surge of instruction datasets   is set to prevent larger datasets from overwhelming the
are proposed, it is non-trivial to select the high-quality        entire distribution [82, 95]. In practice, the maximum cap
ones from them to construct the training dataset. Generally,      is typically set to several thousands or tens of thousands
existing work either leverages quality estimation metrics or      according to different datasets [67, 69]. Recently, it has been
employs LLMs as the judge model to rank all the instruc-          empirically found that existing instruction datasets (Table 3)
tion instances, and then selects those with relatively higher     mainly focus on enhancing LLMs’ capabilities in certain
scores. Concretely, for metrics, perplexity and other heuristic   aspects, and a single dataset alone cannot lead to a compre-
measurements (e.g., length) [345] have been widely used in        hensive enhancement in model capacity [350]. Therefore, it
practice, e.g., we can consider removing high-perplexity or       is often suggested to use a mixture of existing instruction
very short instructions, which might correspond to low-           datasets to achieve a balanced improvement in different
quality ones. To better estimate the effect of an instruc-        capacities, including NLP task data (e.g., FLAN v2 [351]),
                                                                                                                                 33

TABLE 9: Basic statistics of the required number of GPUs, tuning time, batch size (denoted as BS) per device (full tuning
and LoRA tuning), and inference rate (the number of generated tokes per second). Our experiments are conducted based
on two Linux servers having 8 A800-80G SXM4 GPUs with 6 NVSwitch and 8 3090-24G GPUs, respectively. The major
difference between A800 and A100 lies in the NVLink interconnect speed. Thus, our estimations about training and
inference efficiency would be slightly improved for A100, while the rest memory consumption would remain the same.
For full tuning experiments, we use data parallel training, ZeRO Stage 3, BF16, and gradient checkpointing. Additionally,
the LoRA tuning can be executed on one 80G GPU utilizing INT8 quantization with the rank setting set to 16. All the
experiments are conducted with Alpaca-52K dataset by training LLaMA models three epochs. The max sequence length
for both training settings is set to 512. The inference experiments are performed with the batch size set to 1.

                     A800 Full Tuning A800 LoRA Tuning A800 Inference (16-bit) 3090 Inference (16-bit) 3090 Inference (8-bit)
          Models
                     #GPU BS Time #GPU BS Time #GPU              #Token/s      #GPU      #Token/s      #GPU     #Token/s
       LLaMA (7B)       2    8   3.0h    1    80    3.5h     1         36.6          1         24.3        1        7.5
       LLaMA (13B)      4    8   3.1h    1    48    5.1h     1         26.8          2          9.9        1        4.5
       LLaMA (30B)      8    4   6.1h    1    24   14.3h     1         17.7          4          3.8        2        2.6
       LLaMA (65B)     16    2   11.2h   1    4    60.6h     2          8.8          8          2.0        4        1.5



chat data (e.g., ShareGPT [153]), and synthetic data (e.g.,         ing context for all splits (i.e., at each utterance from the
GPT4-Alpaca [352]).                                                 user). In such a fine-tuning way, it is apparent that there
                                                                    exist overlapping utterances in the split examples from a
Combining Instruction Tuning and Pre-Training. To make              conversation. To save the training cost, Vicuna [152] has
the tuning process more effective and stable, OPT-IML [95]          adopted an efficient way that feeds the whole conversation
incorporates pre-training data during instruction tuning,           into the LLM, but relies on a loss mask that only computes
which can be regarded as regularization for model tuning.           the loss on the responses of the chatbot for training. It can
Further, instead of using a separate two-stage process (pre-        significantly reduce the compute costs derived from the
training then instruction tuning), some studies attempt to          overlapped utterances.
train a model from scratch with a mixture of pre-training               • Establishing self-identification for LLM. To deploy LLMs
data (i.e., plain texts) and instruction tuning data (i.e., for-    for real-world applications, it is necessary to establish its
matted datasets) using multi-task learning [82]. Specifically,      identity and make LLMs aware of these identity informa-
GLM-130B [93] and Galactica [35] integrate instruction-             tion, such as name, developer and affiliation. A practical
formatted datasets as a small proportion of the pre-training        way is to create identity-related instructions for fine-tuning
corpora to pre-train LLMs, which potentially achieves the           the LLM. It is also feasible to prefix the input with the self-
advantages of pre-training and instruction tuning at the            identification prompt, e.g., “The following is a conversation
same time.                                                          between a human and an AI assistant called C HATBOT N AME,
Multi-stage Instruction Tuning. For instruction tuning,             developed by D EVELOPER.”, where C HATBOT N AME and D E -
there are two kinds of important instruction data, namely           VELOPER refer to the name and developer of the chatbot,
task-formatted instructions and daily chat instructions. Gen-       respectively.
erally, the former has a significantly larger volume than the           In addition to the above practical strategies and tricks,
latter. It is important to balance the training with the two        existing work has also used other tricks, e.g., concatenating
kinds of instruction data. In addition to carefully mixing          multiple examples into a single sequence to approach the
different instruction data, we can also adopt a multi-stage         max length [353].
instruction tuning strategy [341], where LLMs are first fine-
tuned with large-scale task-formatted instructions and sub-         5.1.3     The Effect of Instruction Tuning
sequently fine-tuned on daily chat ones. To avoid the capac-        In this part, we discuss the effect of instruction tuning on
ity forgetting issue, it is also useful to add an amount of task-   LLMs in three major aspects.
formatted instructions at the second stage. Actually, such
a multi-stage tuning strategy can be also applied to other          Performance Improvement. Despite being tuned on a mod-
settings for instruction tuning. For example, we can sched-         erate number of instances, instruction tuning has become
ule different fine-tuning stages with progressively increased       an important way to improve or unlock the abilities of
levels on difficulty and complexity, and gradually improve          LLMs [69]. Recent studies have experimented with language
the capacities of LLMs to follow complex instructions.              models in multiple scales (ranging from 77M to 540B),
                                                                    showing that the models of different scales can all benefit
Other Practical Tricks. In practice, there are also several         from instruction tuning [69, 334], yielding improved perfor-
useful strategies and tricks that are helpful to improve the        mance as the parameter scale increases [94]. Further, smaller
fine-tuning performance of LLMs. We list several represen-          models with instruction tuning can even perform better
tative ones as follows:                                             than larger models without fine-tuning [28, 69]. Besides
    • Efficient training for multi-turn chat data. Given a multi-   the model scale, instruction tuning demonstrates consistent
turn chat example (the conversation between a user and              improvements in various model architectures, pre-training
chatbot), a straightforward fine-tuning way is to split it into     objectives, and model adaptation methods [69]. In practice,
multiple context-response pairs for training: a LLM is fine-        instruction tuning offers a general approach to enhancing
tuned to generate the response based on the correspond-             the abilities of existing language models [69] (including
                                                                                                                                       34

small-sized PLMs). Also, it is much less costly than pre-               Instruction Datasets. According to the discussion in Sec-
training, since the amount of instruction data required by              tion 5.1.1, we mainly consider three common kinds of in-
LLMs is significantly smaller than pre-training data.                   structions as follows:
                                                                             • Task-specific instructions. For the first type of instruc-
Task Generalization. Instruction tuning encourages the                  tions, we adopt the most commonly-used multi-task instruc-
model to understand natural language instructions for task              tion dataset, FLAN-T5 [69], which contains 1,836 tasks and
completion. It endows LLMs with the ability (often con-                 over 15M instructions by combining four data mixtures from
sidered as an emergent ability) to follow human instruc-                prior work.
tions [31] to perform specific tasks without demonstrations,
                                                                             • Daily chat instructions. This type of instructions are con-
even on unseen tasks [69]. A large number of studies
                                                                        versations posed by users about daily life, which are more
have confirmed the effectiveness of instruction tuning to
                                                                        closely related to real-life scenarios. We adopt the ShareGPT
achieve superior performance on both seen and unseen
                                                                        instruciton set, consisting of 63K real-user instructions. It
tasks [95, 334]. Also, instruction tuning has been shown to
                                                                        has been used as the core instructions for Vicuna.
be useful in alleviating several weaknesses of LLMs (e.g.,
                                                                             • Synthetic instructions. In addition to reusing existing
repetitive generation or complementing the input without
                                                                        instructions, we can also automatically synthesize massive
accomplishing a certain task) [66, 69], leading to a superior
                                                                        instructions using LLMs. We adopt the popular synthetic
capacity to solve real-world tasks for LLMs. Furthermore,
                                                                        instruction dataset Self-Instruct-52K [147], consisting of 52K
LLMs trained with instruction tuning can generalize to re-
                                                                        instructions paired with about 82K instance inputs and
lated tasks across languages. For example, BLOOMZ-P3 [94]
                                                                        outputs. These generated instructions have a similar data
is fine-tuned based on BLOOM [78] using English-only task
                                                                        distribution as the human-written seed tasks (e.g., grammar
collection P3 [180]. Interestingly, BLOOMZ-P3 can achieve
                                                                        checking, brainstorming).
a more than 50% improvement in multilingual sentence
                                                                             As the original FLAN-T5 dataset is very large (i.e., over
completion tasks compared to BLOOM, which shows that
                                                                        15M), we randomly sample 80,000 instructions from it for
instruction tuning can help LLMs acquire general task skills
                                                                        conducting a fair comparison with other instruction datasets
from English-only datasets and transfer such skills into
                                                                        (i.e., ShareGPT and Self-Instruct-52K) at a similar scale. In
other languages [94]. In addition, it has been found that
                                                                        our experiments, we test on each individual instruction
using English-only instructions can produce satisfactory
                                                                        set to explore their own effects and also examine their
results on multilingual tasks [94], which helps reduce the
                                                                        combinatorial effects on model performance.
effort of instruction engineering for a specific language.
                                                                        Improvement Strategies. Although real-world instructions
Domain Specialization. Existing LLMs have showcased su-
                                                                        from human users are more suitable for fine-tuning LLMs,
perior capabilities in traditional NLP tasks (e.g., generation
                                                                        it is difficult to collect them at a large scale. As alternatives
and reasoning) and daily questions. However, they may
                                                                        to human-generated instructions, most existing research
still lack domain knowledge to accomplish specific tasks,
                                                                        mainly adopts synthetic instructions generated by LLMs.
such as medicine, law, and finance (See Section 8 for a
                                                                        However, there are some potential problems with synthetic
detailed discussion of LLMs in different applications). In-
                                                                        instructions, such as poor topic diversity and uneven in-
struction tuning is an effective approach to adapting existing
                                                                        struction difficulty (either too simple or too difficult). Thus,
general LLMs to be domain-specific experts. For instance,
                                                                        it is necessary to improve the quality of the synthetic in-
researchers propose to fine-tune Flan-PaLM [69] using medi-
                                                                        structions. Next, we summarize four major improvement
cal datasets to create Med-PaLM [354], a medical knowledge
                                                                        strategies widely used in existing work as follows:
assistant that achieves performance levels comparable to
those of expert clinicians. Furthermore, a recent study [355]                • Enhancing the instruction complexity. As discussed in
fine-tunes FLAN-T5 to support e-commerce recommender                    existing work [335], enhancing the complexity of instruc-
systems with natural language instructions, showing strong              tions can improve the model capacity of LLMs in following
performance in a variety of recommendation tasks. There                 complex instructions, e.g., including more task demands or
are also several open-sourced medical models instruction-               requiring more reasoning steps. To validate this strategy,
tuned based on LLaMA [57], such as BenTsao [356]. Also,                 we follow WizardLM [335] by gradually increasing the
researchers explore instruction tuning on law [357], fi-                complexity levels, e.g., adding constraints, increasing rea-
nance [358], and arithmetic computation [359].                          soning steps, and complicating the input. We leverage the
                                                                        publicly released WizardLM-70K instructions [335] as the
                                                                        complexity-enhanced instruction dataset, which has been
5.1.4   Empirical Analysis for Instruction Tuning                       generated via the above enhancement approach based on
Fine-tuning LLMs with different instruction sets tend to lead           the Self-Instruct-52K dataset [335].
to model variants with varied performance on downstream                      • Increasing the topic diversity. In addition to the complex-
tasks. In this section, we will explore the effect of different         ity, improving the topic diversity of the instruction dataset
types of instructions in fine-tuning LLMs (i.e., LLaMA (7B)             can help elicit different abilities of LLMs on diverse tasks in
and LLaMA (13B)25 ), as well as examine the usefulness of               real world [336]. However, it is difficult to directly control
several instruction improvement strategies.                             the self-instruct process for generating diverse instructions.
                                                                        Following YuLan-Chat [341], we employ ChatGPT to rewrite
                                                                        the instructions from Self-Instruct-52K dataset for adapting
   25. Due to the limit of computational resources, we cannot conduct
large-scale experiments on larger LLaMA variants right now, which       them into 293 topics via specific prompts. Finally, we obtain
would be scheduled in a future version.                                 70K instructions as the diversity-increased dataset.
                                                                                                                                  35

TABLE 10: Results of instruction-tuning experiments (all in a single-turn conversation) based on the LLaMA (7B) and
LLaMA (13B) model under the chat and QA setting. We employ four instruction improvement strategies on the Self-
Instruct-52K dataset, i.e., enhancing the complexity (w/ complexity), increasing the diversity (w/ diversity), balancing the
difficulty (w/ difficulty), and scaling the instruction number (w/ scaling). ∗ Since we select the LLaMA (7B)/(13B) model
fine-tuned on Self-Instruct-52K as the baseline, we omit the win rate of the fine-tuned model with Self-Instruct-52K against
itself.

                                Dataset               Instruction     Lexical          Chat                QA
                Models
                                Mixtures               Numbers       Diversity     AlpacaFarm     MMLU          BBH3k
                LLaMA (7B)      ① FLAN-T5                 80,000       48.48          23.77        38.58        32.79
                                ② ShareGPT                63,184       77.31          81.30        38.11        27.71
                                ③ Self-Instruct-52K       82,439       25.92           /∗          37.52        29.81
                                ②+③                      145,623       48.22          71.36        41.26        28.36
                                ①+②+③                    225,623       48.28          70.00        43.69        29.69
                                ③ Self-Instruct-52K       82,439       25.92           /∗          37.52        29.81
                                w/ complexity             70,000       70.43          76.96        39.73        33.25
                                w/ diversity              70,000       75.59          81.55        38.01        30.03
                                w/ difficulty             70,000       73.48          79.15        32.55        31.25
                                w/ scaling               220,000       57.78          51.13        33.81        26.63
                LLaMA (13B)     ① FLAN-T5                 80,000       48.48          22.12        34.12        34.05
                                ② ShareGPT                63,184       77.31          77.13        47.49        33.82
                                ③ Self-Instruct-52K       82,439       25.92           /∗          36.73        25.43
                                ②+③                      145,623       48.22          72.85        41.16        29.49
                                ①+②+③                    225,623       48.28          69.49        43.50        31.16
                                ③ Self-Instruct-52K       82,439       25.92           /∗          36.73        25.43
                                w/ complexity             70,000       70.43          77.94        46.89        35.75
                                w/ diversity              70,000       75.59          78.92        44.97        36.40
                                w/ difficulty             70,000       73.48          80.45        43.15        34.59
                                w/ scaling               220,000       57.78          58.12        38.07        27.28



    • Scaling the instruction number. In addition to the above      of fine-tuned models, we consider two settings, namely
aspects, the number of instructions is also an important            Chat setting and QA setting. The chat setting mainly utilizes
factor that may affect the model performance. Specially,            user instructions and queries from daily chat, whereas the
using more instructions can extend the task knowledge and           QA setting mainly employs question answering examples
improve the ability of instruction following for LLMs [69].         from existing NLP datasets. The evaluation on the chat
To examine this strategy, we sample new instructions from           setting is conducted based on the AlpacaFarm evaluation
the synthesized instruction set released from the MOSS              set [361]. Instead of using a full pairwise comparison, we
project [360], as they are also synthesized using the same          select the LLaMA 7B and 13B models fine-tuned on Self-
self-instruct method [147]. We mix them with the Self-              Instruct-52K as the reference baselines, and then compare
Instruct-52K dataset to compose a larger one containing             them with other fine-tuned LLaMA 7B and 13B models
220K instructions.                                                  using different instructions, respectively. Since our focus is
                                                                    to examine the usefulness of different strategies to generate
    • Balancing the instruction difficulty. As the synthetic
                                                                    the instructions, the model fine-tuned on Self-Instruct-52K
instructions tend to contain too easy or too hard ones, it
                                                                    can serve as a good reference. Following AlpacaFarm [361],
is likely to result in training instability or even overfitting
                                                                    for each comparison, we employ ChatGPT to automatically
for LLMs. To explore the potential effects, we leverage
                                                                    annotate which response from two compared models each
the perplexity score of LLMs to estimate the difficulty of
                                                                    time is the best for the user query, and report the win
instructions and remove too easy or too hard instructions. To
                                                                    rate (%) as the evaluation metric. For the QA setting, we
generate the same scale of instructions for fair comparison,
                                                                    select two benchmarks, MMLU [362] and BBH [363], and
we adopt a LLaMA (7B) model to compute the perplexity for
                                                                    evaluate the accuracy based on their default settings by
the 220K instructions from the large instruction dataset, and
                                                                    using heuristic rules to parse the answers from these LLMs.
then keep 70K instructions of moderate perplexity scores as
the difficulty-balanced dataset.                                        For both instruction tuning and evaluation, we adopt
                                                                    the following prompt: “The following is a conversation be-
Experimental Setup. To conduct the experiments on the               tween a human and an AI assistant. The AI assistant gives
effect of instruction data, we leverage these new instruction       helpful, detailed, and polite answers to the user’s questions.\n
datasets for tuning LLaMA, a popular LLM backbone that              [|Human|]:{input}\n[|AI|]:”. To reproduce our results, we
has been widely used for instruction-tuning. We use the             release the code and data at the link: https://github.com/
code from YuLan-Chat [341] for our experiments, and train           RUCAIBox/LLMSurvey/tree/main/Experiments.
LLaMA 7B and 13B on a server of 8 A800-80G GPUs. All
the hyper-parameters settings remain the same as Stanford           Results and Analysis. The results using different instruction
Alpaca. To better evaluate the instruction following ability        datasets based on 7B and 13B LLaMA are in Table 10. Next,
                                                                                                                                   36

we summarize and analyze our findings in detail.                     likely because that the larger models generally have better
    • Task-formatted instructions are more proper for the QA         knowledge utilization and reasoning capability [33, 55],
setting, but may not be useful for the chat setting. By comparing    which can accurately answer more complex questions.
the performance of instruction tuning using FLAN-T5 with
that of ShareGPT and Self-Instruct-52K, we can observe                     Instruction Tuning Suggestions
that FLAN-T5 mostly achieves a better performance on QA
benchmarks while underperforms ShareGPT on the chat set-                   To conduct instruction tuning on LLMs, one can
ting. The reason is that FLAN-T5 is composed of a mixture                  prepare the computational resources according to
of instructions and examples from existing NLP tasks, e.g.,                the basic statistics about the required number of
translation and reading comprehension. As a result, LLaMA                  GPUs and tuning time in Table 9. After setting
fine-tuned with FLAN-T5 performs better on QA tasks, but                   up the development environment, we recommend
poorly on user queries. In contrast, ShareGPT consists of                  beginners to follow the code of Alpaca reposi-
real-world human-ChatGPT conversations, which is able to                   tory [187] for instruction tuning. Subsequently, one
better elicit LLaMA to follow user instructions in daily life,             should select the base model and construct the
while may not be suitable for accomplishing the QA tasks.                  instruction datasets as we discuss in this section.
    • A mixture of different kinds of instructions are helpful to          When computational resources for training are con-
improve the comprehensive abilities of LLMs. After mixing the              strained, users can utilize LoRA for parameter-
three kinds of instructions for fine-tuning, we can see that               efficient tuning (see Section 5.3). As for inference,
the derived LLaMA variant (with FLAN-T5, ShareGPT and                      users can further use quantization methods to de-
Self-Instruct-52K) performs well in both task settings. In                 ploy LLMs on fewer or smaller GPUs (see Sec-
MMLU, the performance of LLaMA (7B) can surpass the                        tion 5.3).
ones using individual instruction set by a large margin, i.e.,
43.69 vs. 38.58 (FLAN-T5). It shows that mixing multiple
                                                                     5.2     Alignment Tuning
sources of instruction datasets is helpful to improve the
performance of instruction-tuned LLMs, which scales the              This part first presents the background of alignment with
instruction number as well as increases the diversity.               its definition and criteria, then focuses on the collection
    • Enhancing the complexity and diversity of instructions         of human feedback data for aligning LLMs, and finally
leads to an improved model performance. By increasing the            discusses the key technique of reinforcement learning from
complexity and diversity of the Self-Instruct-52K dataset            human feedback (RLHF) for alignment tuning.
respectively, the chat and QA performance of LLaMA can
be consistently improved, e.g., from 37.52 to 39.73 in MMLU          5.2.1    Background and Criteria for Alignment
for LLaMA (7B). It demonstrates that both strategies are
                                                                     Background. LLMs have shown remarkable capabilities
useful to improve the instruction following ability of LLMs.
                                                                     in a wide range of NLP tasks [55, 56, 67, 90]. However,
Further, we can see that improving the complexity yields a
                                                                     these models may sometimes exhibit unintended behav-
larger performance improvement on QA tasks. The reason
                                                                     iors, e.g., fabricating false information, pursuing inaccurate
is that the QA tasks mostly consist of difficult questions for
                                                                     objectives, and producing harmful, misleading, and biased
evaluating LLMs, which can be better solved by LLMs that
                                                                     expressions [66, 364]. For LLMs, the language modeling
have learned complex instructions at the fine-tuning stage.
                                                                     objective pre-trains the model parameters by word predic-
    • Simply increasing the number of instructions may not be
                                                                     tion while lacking the consideration of human values or
that useful, and balancing the difficulty is not always helpful.
                                                                     preferences. To avert these unexpected behaviors, human
As the results shown in Table 10, balancing the difficulty
                                                                     alignment has been proposed to make LLMs act in line with
and increasing the number of fine-tuning instructions are
                                                                     human expectations [66, 365]. However, unlike the original
not very helpful in our experiments. Especially for scaling
                                                                     pre-training and adaptation tuning (e.g., instruction tuning),
the instruction number, it even hurts the performance, e.g.,
                                                                     such an alignment requires considering very different crite-
a decrease from 29.81 to 26.63 in BBH for LLaMA (7B).
                                                                     ria (e.g., helpfulness, honesty, and harmlessness). It has been
It shows that simply scaling the number of synthesized
                                                                     shown that alignment might harm the general abilities of
instructions without quality control may not be effective to
                                                                     LLMs to some extent, which is called alignment tax in related
improve the performance. Furthermore, fine-tuning with the
                                                                     literature [366].
instructions of moderate difficulty also performs well in the
chat setting, while slightly decreasing the performance in           Alignment Criteria. Recently, there is increasing attention
the QA setting. A possible reason is that we filter complex          on developing multifarious criteria to regulate the behav-
and hard instructions with large perplexity scores, hurting          iors of LLMs. Here, we take three representative alignment
the model performance in answering complex questions.                criteria (i.e., helpful, honest, and harmless) as examples for
    • A larger model scale leads to a better instruction following   discussion, which have been widely adopted in existing
performance. By comparing the performance of LLaMA (7B)              literature [66, 366]. In addition, there are other alignment
and LLaMA (13B) models fine-tuned with the same set                  criteria for LLMs from different perspectives including be-
of instruction data, we can see that LLaMA (13B) mostly              havior, intent, incentive, and inner aspects [364], which
achieves a better performance. It indicates that scaling the         are essentially similar (or at least with similar alignment
model size is helpful for improving the instruction following        techniques) to the above three criteria. It is also feasible to
capability. Besides, we can see that the QA performance has          modify the three criteria according to specific needs, e.g.,
been improved a lot, e.g., from 38.11 to 47.49 in MMLU. It is        substituting honesty with correctness [116]. Next, we give
                                                                                                                               37

brief explanations about the three representative alignment       to be UK-based native English speakers who have obtained
criteria:                                                         at least an undergraduate-level educational qualification.
    • Helpfulness. To be helpful, the LLM should demon-           Even then, several studies [365] have found that there still
strate a clear attempt to assist users in solving their tasks     exists a mismatch between the intentions of researchers
or answering questions in a concise and efficient manner          and human labelers, which may lead to low-quality human
as possible. At a higher level, when further clarification        feedback and cause LLMs to produce unexpected output.
is needed, the LLM should demonstrate the capability of           To address this issue, InstructGPT [66] further conducts a
eliciting additional relevant information through pertinent       screening process to filter labelers by assessing the agree-
inquiries and exhibit suitable levels of sensitivity, percep-     ment between human labelers and researchers. Specifically,
tiveness, and prudence [366]. Realizing the alignment of          researchers first label a small amount of data and then
helpful behavior is challenging for LLMs since it is difficult    measure the agreement between themselves and human
to precisely define and measure the intention of users [364].     labelers. The labelers with the highest agreement will be
    • Honesty. At a basic level, a LLM aligned to be honest       selected to proceed with the subsequent annotation work.
should present accurate content to users instead of fabri-        In some other work [368], “super raters” are used to ensure
cating information. Additionally, it is crucial for the LLM       the high quality of human feedback. Researchers evaluate
to convey appropriate degrees of uncertainty in its output,       the performance of human labelers and select a group of
in order to avoid any form of deception or misrepresen-           well-performing human labelers (e.g., high agreement) as
tation of information. This requires the model to know            super raters. The super raters will be given priority to
about its capabilities and levels of knowledge (e.g., “know       collaborate with the researchers in the subsequent study.
unknowns”). According to the discussion in [366], honesty         When human labelers annotate the output of LLMs, it is
is a more objective criterion compared to helpfulness and         helpful to specify detailed instructions and provide instant
harmlessness, hence honesty alignment could potentially be        guidance for human labelers, which can further regulate the
developed with less reliance on human efforts.                    annotation of labelers.
    • Harmlessness. To be harmless, it requires that the lan-
guage produced by the model should not be offensive or            Human Feedback Collection. In existing work, there are
discriminatory. To the best of its abilities, the model should    mainly three kinds of approaches to collecting feedback and
be capable of detecting covert endeavors aimed at soliciting      preference data from human labelers.
requests for malicious purposes. Ideally, when the model              • Ranking-based approach. In early work [365], human
was induced to conduct a dangerous action (e.g., commit-          labelers often evaluate model-generated outputs in a coarse-
ting a crime), the LLM should politely refuse. Nonetheless,       grained manner (i.e., only selecting the best) without taking
what behaviors are deemed harmful and to what extent vary         into account more fine-grained alignment criteria. Nonethe-
amongst individuals or societies [366] highly depend on           less, different labelers may hold diverse opinions on the
who is using the LLM, the type of the posed question, and         selection of the best candidate output, and this method
the context (e.g., time) at which the LLM is being used.          disregards the unselected samples, which may lead to inac-
    As we can see, these criteria are quite subjective, and are   curate or incomplete human feedback. To address this issue,
developed based on human cognition. Thus, it is difficult         subsequent studies [116] introduce the Elo rating system
to directly formulate them as optimization objectives for         to derive the preference ranking by comparing candidate
LLMs. In existing work, there are many ways to fulfill these      outputs. The ranking of outputs serves as the training signal
criteria when aligning LLMs. A promising technique is red         that guides the model to prefer certain outputs over others,
teaming [367], which involves using manual or automated           thus inducing outputs that are more reliable and safer.
means to probe LLMs in an adversarial way to generate                 • Question-based approach. Further, human labelers can
harmful outputs and then updates LLMs to prevent such             provide more detailed feedback by answering certain ques-
outputs.                                                          tions designed by researchers [81], covering the alignment
                                                                  criteria as well as additional constraints for LLMs. Specially,
5.2.2   Collecting Human Feedback                                 in WebGPT [81], to assist the model in filtering and utiliz-
                                                                  ing relevant information from retrieved documents, human
During the pre-training stage, LLMs are trained using the
                                                                  labelers are required to answer questions with multiple
language modeling objective on a large-scale corpus. How-
                                                                  options about whether the retrieved documents are useful
ever, it cannot take into account the subjective and qualita-
                                                                  for answering the given input.
tive evaluations of LLM outputs by humans (called human
feedback in this survey). High-quality human feedback is              • Rule-based approach. Many studies also develop rule-
extremely important for aligning LLMs with human pref-            based methods to provide more detailed human feedback.
erences and values. In this part, we discuss how to select a      As a typical case, Sparrow [116] not only selects the response
team of human labelers for feedback data collection.              that labelers consider the best but also uses a series of
                                                                  rules to test whether model-generated responses meet the
Human Labeler Selection. In existing work, the dominant           alignment criteria of being helpful, correct, and harmless.
method for generating human feedback data is human                In this way, two kinds of human feedback data can be ob-
annotation [66, 116, 365]. This highlights the critical role      tained: (1) the response preference feedback is obtained by
of selecting appropriate human labelers. To provide high-         comparing the quality of model-generated output in pairs,
quality feedback, human labelers are supposed to have a           and (2) the rule violation feedback is obtained by collecting
qualified level of education and excellent proficiency in En-     the assessment from human labelers (i.e., a score indicating
glish. For example, Sparrow [116] requires human labelers         to what extent the generated output has violated the rules).
                                                                                                                                                             38

                                                                    Supervised Fine-tuning     reward models having a parameter scale different from that
                            Prompts                                                            of the aligned LM [66, 368]. For example, OpenAI uses 6B
                                                    Training with demonstration data
                                                                                               GPT-3 and DeepMind uses 7B Gopher as the reward model,
 Human
Annotator
                            Demonstrations                             Pre-trained LM
                                                                                         🔥     respectively. Finally, to optimize the pre-trained LM using
                                                                                               the signal from the reward model, a specific RL algorithm
                         Demonstration Data
                                                                                               is designed for large-scale model tuning. Specifically, Prox-
                                                                                               imal Policy Optimization (PPO) [128] is a widely used RL
                                                                    Reward Model Training      algorithm for alignment in existing work [66, 116, 368].
                                                           🔥                                   Key Steps for RLHF. Figure 12 illustrates the overall three-
     Prompts              LM Outputs               Reward
                                                   Model
                                                                       Pre-trained LM
                                                                                         🧊     step process of RLHF [66] as introduced below.
                                                                                                    • Supervised fine-tuning. To make the LM initially perform
                         Human Feedback                                                        desired behaviors, it usually needs to collect a supervised
            Ranking                           Training with feedback data
                                                                                               dataset containing input prompts (instruction) and desired
                                                                              RL Fine-tuning   outputs for fine-tuning the LM. These prompts and outputs
                                             🧊                                                 can be written by human labelers for some specific tasks
             Prompts
                                   Reward
                                   Model
                                                                            Aligned LM
                                                                                         🔥     while ensuring the diversity of tasks. For example, Instruct-
                                                                                               GPT [66] asks human labelers to compose prompts (e.g.,

            LM Outputs             😊/😞
                                   Reward
                                                  Training with RL algorithm (PPO)
                                                                                               “List five ideas for how to regain enthusiasm for my career”) and
                                                                                               desired outputs for several generative tasks such as open
                                                                                               QA, brainstorming, chatting, and rewriting. Note that the
        Fig. 12: The workflow of the RLHF algorithm.                                           first step is optional in specific settings or scenarios.
                                                                                                    • Reward model training. The second step is to train the
                                                                                               RM using human feedback data. Specifically, we employ
Furthermore, GPT-4 [46] utilizes a set of zero-shot classifiers                                the LM to generate a certain number of output texts using
(based on GPT-4 itself) as rule-based reward models, which                                     sampled prompts (from either the supervised dataset or
can automatically determine whether the model-generated                                        the human-generated prompt) as input. We then invite
outputs violate a set of human-written rules.                                                  human labelers to annotate the preference for these pairs.
    In the following, we focus on a well-known technique,                                      The annotation process can be conducted in multiple forms,
reinforcement learning from human feedback (RLHF),                                             and a common approach is to annotate by ranking the
which has been widely used in the recent powerful LLMs                                         generated candidate texts, which can reduce the inconsis-
such as ChatGPT. As discussed below, the alignment criteria                                    tency among annotators. Then, the RM is trained to predict
introduced in Section 5.2.1 can be fulfilled by learning from                                  the human-preferred output. In InstructGPT, labelers rank
human feedback on the responses of LLMs to users’ queries.                                     model-generated outputs from best to worst, and the RM
                                                                                               (i.e., 6B GPT-3) is trained to predict the ranking. Note that, in
5.2.3   Reinforcement Learning from Human Feedback                                             recent work [369], the annotation of preference on response
                                                                                               pairs has been conducted by an AI agent (usually an aligned
To align LLMs with human values, reinforcement learning
                                                                                               LLM) instead of humans, which is called “reinforcement
from human feedback (RLHF) [79, 365] has been proposed
                                                                                               learning from AI feedback (RLAIF)”. LLMs trained with typical
to fine-tune LLMs with the collected human feedback data,
                                                                                               RLHF algorithms tend to generate harmless responses with
which is useful to improve the alignment criteria (e.g.,
                                                                                               less helpfulness, which is called evasion problem [369]. To
helpfulness, honesty, and harmlessness). RLHF employs
                                                                                               guarantee both the harmlessness and helpfulness, RLAIF
reinforcement learning (RL) algorithms (e.g., Proximal Pol-
                                                                                               generates the AI feedback based on pre-set alignment prin-
icy Optimization (PPO) [128]) to adapt LLMs to human
                                                                                               ciples in instructions [369, 370], which can also reduce the
feedback by learning a reward model. Such an approach
                                                                                               efforts of human annotation.
incorporates humans in the training loop for developing
well-aligned LLMs, as exemplified by InstructGPT [66].                                              • RL fine-tuning. At this step, aligning (i.e., fine-tuning)
                                                                                               the LM is formalized as an RL problem. In this setting,
RLHF System. The RLHF system mainly comprises three                                            the pre-trained LM acts as the policy that takes as input
key components: a pre-trained LM to be aligned, a reward                                       a prompt and returns an output text, the action space of
model learning from human feedback, and a RL algorithm                                         it is the vocabulary, the state is the currently generated
training the LM. Specifically, the pre-trained LM is typically                                 token sequence, and the reward is provided by the RM. To
a generative model that is initialized with existing pre-                                      avoid eviating significantly from the initial (before tuning)
trained LM parameters. For example, OpenAI uses 175B                                           LM, a penalty term is commonly incorporated into the
GPT-3 for its first popular RLHF model, InstructGPT [66],                                      reward function. For example, InstructGPT optimizes the
and DeepMind uses the 280 billion parameter model Go-                                          LM against the RM using the PPO algorithm. For each input
pher [64] for its GopherCite model [368]. Further, the reward                                  prompt, InstructGPT calculates the KL divergence between
model (RM) provides (learned) guidance signals that reflect                                    the generated results from the current LM and the initial
human preferences for the text generated by the LM, usually                                    LM as the penalty. It is noted that the second and final steps
in the form of a scalar value. The reward model can take on                                    can be iterated in multiple turns for better aligning LLMs.
two forms: a fine-tuned LM or a LM trained de novo using                                       Due to the instability of the RL algorithm, recent work [371]
human preference data. Existing work typically employs                                         replaces the RL tuning with another supervised fine-tuning
                                                                                                                                     39



  MHA Adapter FFN Adapter            Prefix        Layer #N                    Layer #N                     Wdown
                                                                                                                     Layer #N
              …                                       …                          …                     LoRA             …
  MHA Adapter FFN Adapter            Prefix        Layer #1                    Layer #1                     Wdown
                                                                                                                      Layer #1


             Input                                   Input            Prompt         Input                             Input
       (a) Adapter Tuning                 (b) Prefix Tuning               (c) Prompt Tuning               (d) Low-Rank Adapation

Fig. 13: An illustration of four different parameter-efficient fine-tuning methods. MHA and FFN denote the multi-head
attention and feed-forward networks in the Transformer layer, respectively.


by reusing the best ranked samples with higher rewards.            by the reward model for learning. After fine-tuning the
                                                                   LLM on the best samples until convergence, the RL process
Practical Strategies for RLHF. Although RLHF is promising
                                                                   will be performed to further improve the performance.
to effectively improve the alignment of LLMs with humans,
                                                                   LLaMA 2 [99] has successively trained five versions of RLHF
it is practically challenging for researchers to successfully
                                                                   models, where the LLM has been progressively improved
implement it. In this part, we focus on discussing several
                                                                   with the improvement of the reward models. In this way,
useful strategies and tricks for improving the effectiveness
                                                                   the collected prompts and annotations of human preference
and efficiency of RLHF. Concretely, we focus on the effective
                                                                   data can better reflect the issues of the current model check-
training of reward models, efficient and effective RL train-
                                                                   point, thus making special tuning to address these issues. In
ing, respectively.
                                                                   addition, LLaMA 2 also adds samples from prior iterations
    • Effective reward model training. Despite that InstructGPT
                                                                   into the subsequent ones, to alleviate the possible capacity
used a small reward model (6B GPT model), increasing
                                                                   regression issue during iterative optimization.
work [99] has shown it is often more effective to use a
                                                                       • Efficient RL training. As the RL training requires to
large reward model (e.g., equal or greater than the original
                                                                   iterate the inference process of both the LLM and reward
model size), since large reward models generally perform
                                                                   models, it would greatly increase the total memory and
better in judging the quality of the LLM generated outputs.
                                                                   computation cost, especially for larger reward models and
In LLaMa 2 [99], pretrained chat model checkpoints are
                                                                   LLMs. As a practical trick, we can deploy the reward model
used to initialize the reward model, they argue that such an
                                                                   on a separate server, and invoke the corresponding API
approach can effectively reduce the information mismatch
                                                                   to work with the LLM on its own server. In addition, as
between the model to be aligned and the reward model
                                                                   RLHF requires the LLM to generate multiple candidate
by sharing the same pre-training knowledge. Whereas, it is
                                                                   outputs, instead of calling the sample decoding procedure
common to encounter the overfitting problem when train-
                                                                   for multiple times, it is more efficient to utilize the beam
ing large-scale reward models. As a simple yet effective
                                                                   search decoding algorithm26 . It only needs to perform one-
solution, existing work [372, 373] has introduced the LM
                                                                   pass decoding for response generation, meanwhile such a
loss on the preferred response of the input prompt from
                                                                   strategy can also enhance the diversity of the generated
the human-annotated alignment dataset as a regularizer,
                                                                   candidate responses.
which alleviates the overfitting of the reward model on the
binary classification task. In addition, as there are multiple     Process-Supervised RLHF. In existing literature of
criteria for alignment (e.g., helpfulness and honesty), it is      RLHF [374], the supervision approach for RL training
often difficult to train a single reward model that can satisfy    generally takes two major forms, either using outcome-
all the alignment criteria. Therefore, it is useful to train       supervision signals or process-supervision signals. The
multiple reward models that focus on different alignment           outcome-supervised RLHF employs a quantitative score to
criteria [99], and compute the final reward based on the           assess the quality of the whole text generated by LLMs. In
produced ones from them via special combination strategies         contrast, process-supervised RLHF offers an evaluation of
(e.g., mean pooling and weighted sum). Such a way enables          each individual component (e.g., sentence, word, or reason-
more flexible rules or standards on multiple criteria, e.g.,       ing step) within the generated content, which leverage fine-
relaxing the requirement on helpfulness while posing more          grained supervision signals to guide the training, helping
strict limits on harmfulness.                                      LLMs refine the undesired generation contents [374, 375].
    • Effective RL training. As the RL training process tends to   In what follows, we discuss two key aspects of process-
be unstable and hyper-parameter sensitive, it is suggested         supervised RLHF.
that the language model should be well supervised fine-               • Obtaining Fine-grained Supervision Signals. Compared
tuned before RL training, so as to reaching a good model           with outcome rewards, it is more difficult to obtain fine-
capacity. A commonly-used way is to fine-tune the LLM              grained supervision signals. OpenAI has released a fine-
on its best outputs of the prompts (referred to as rejec-          grained annotation dataset named PRM800k [375] consist-
tion sampling or best-of-N ) from the alignment dataset until
convergence before RL. Given a prompt, the LLM would
                                                                      26. https://huggingface.co/docs/transformers/v4.31.0/en/main
first produce N outputs via the sampling algorithm, and            classes/text generation#transformers.GenerationMixin.group beam
then the best candidate from the model will be selected            search
                                                                                                                                40

ing of 12K process-annotated mathematical problems (i.e.,         iors from these demonstration data via suitable fine-tuning
MATH dataset [376]) and 75K solutions generated by LLMs           strategies. Thus, to implement this approach, two key issues
of these problems, where each reasoning step of mathe-            are the construction of alignment dataset and the design of
matical problems is labeled as positive, negative or neutral      fine-tuning loss. For the first issue, the alignment dataset
in PRM800k. Considering the cost and efficiency of the            can be automatically constructed by an aligned LLMs ac-
human annotation process, several methods aim to auto-            cording to human-written safety principles [336] or refining
matically annotate the correctness of intermediate reason-        existing examples using edits operations [385]. In addition,
ing steps, e.g., using powerful LLMs to directly replace          we can also reuse existing reward models to select high-
human annotators [377] or Monte Carlo tree search [378].          rated responses from existing human feedback data [371].
After obtaining fine-grained supervision signals, existing        For the second issue, non-RL alignment approaches mainly
work typically leverages them to train process-supervised         fine-tune LLMs in a supervised learning way (the same
reward models (PRM) [375, 379], which can produce step-           as the original instruction tuning loss) on a high-quality
level rewards (e.g., sentence based or token based rewards)       alignment dataset, meanwhile auxiliary learning objectives
during the RLHF procedure. Furthermore, rather than lever-        can be used to enhance the alignment performance, e.g.,
aging the discriminative model to produce the rewards,            ranking responses or contrasting instruction-response pairs.
RLMEC [380] utilizes a generative reward model trained on
                                                                  Alignment Data Collection. The construction of alignment
rewriting tasks with the minimum editing constraint, to pro-
                                                                  data is important to effectively align the behaviors of LLMs
vide token-level rewards. In addition, for the downstream
                                                                  with human preferences. To collect high-quality alignment
tasks where fine-grained supervision signals are difficult to
                                                                  data, some work tries to reuse existing reward models to
collected, outcome-supervision signals can also be utilized
                                                                  select high-rated responses, and others explore to leverage
to perform process-supervised RLHF [381].
                                                                  powerful LLMs (e.g., ChatGPT) or build a simulated envi-
    • Utilizing the PRMs. To effectively leverage process-
                                                                  ronment to generate synthetic alignment examples. Next,
supervision signals from PRMs, existing work mainly uti-
                                                                  we will discuss these three lines of research.
lizes these fine-grained signals to evaluate individual parts
                                                                      • Reward model based approaches. The reward model in
within the LLM responses and then guides LLMs to adjust
                                                                  RLHF has been trained to measure the alignment degree
their generation behaviors to maximize the received reward
                                                                  on the responses of LLMs. It is straightforward to leverage
of the response. Concretely, expert iteration [382, 383], an
                                                                  existing reward models to select high-quality responses as
effective RL algorithm, has been utilized to improve the base
                                                                  alignment data for subsequent fine-tuning. Based on this
policy via learning from expert policy [374]. Typically, expert
                                                                  idea, RAFT [371] adopts reward models trained on human
iteration contains two main stages: policy improvement and
                                                                  preference data to rank the responses of LLMs and collect
distillation [374]. In the policy improvement stage, expert
                                                                  those with higher rewards for supervised fine-tuning. In
policy processes the systematic search procedure to produce
                                                                  addition, the reward model can be also used to score model
the samples under the guidance of PRMs. Subsequently,
                                                                  responses and assign them to different quality groups.
during the distillation stage, the samples generated by ex-
                                                                  Quark [386] sorts the responses of LLMs into different quan-
pert policy in the first stage are utilized to improve the
                                                                  tiles based on the reward scores. Each quantile is attached
base policy through supervised fine-tuning. In addition to
                                                                  with a special reward token to represent the reward level
expert iteration, PRMs can also be utilized to re-rank the
                                                                  of the quantile. Conditioned on the highest-reward tokens,
candidates of the final answers generated by LLMs [375] or
                                                                  LLMs are subsequently prompted to generate high-quality
to select better intermediate reasoning steps during step by
                                                                  responses. Given an initial answer and the corresponding
step reasoning [379, 384].
                                                                  human feedback, ILF [387] first adopts LLMs to generate
                                                                  refined answers, then utilizes the reward model to select
5.2.4   Alignment without RLHF
                                                                  the answer that best matches the feedback for further
Although RLHF has achieved great success in aligning the          training. As valuable resources for aligning LLMs, several
behaviors of LLMs with human values and preferences, it           reward models have been released, including DeBERTa-
also suffers from notable limitations. First, RLHF needs to       base/large/xxlarge from OpenAssistant27 , Moss-7B from
train multiple LMs including the model being aligned, the         Fudan28 , and Flan-T5-xl from Stanford29 .
reward model, and the reference model at the same time,               • LLM based generative approaches. Reward models help
which is tedious in algorithmic procedure and memory-             to select aligned data from model responses. However,
consuming in practice. Besides, the commonly-used PPO             training reward models itself necessitates substantial high-
algorithm in RLHF is rather complex and often sensitive           quality human-labeled data, which is typically expensive
to hyper-parameters. As an alternative, increasing studies        and in short supply. In addition, although existing reward
explore to directly optimize LLMs to adhere to human pref-        models can be reused, they might not be able to accurately
erences, using supervised fine-tuning without reinforcement       capture the nonalignment behaviors in another separately
learning [338].                                                   trained LLM. Therefore, some work explores leveraging
                                                                  powerful LLMs to automatically generate human-aligned
Overview. The basic idea of non-RL alignment approaches
                                                                  data. As a representative work, constitutional AI [369] pro-
is to directly fine-tune LLMs with supervised learning on
                                                                  poses that human supervision comes from a set of principles
high-quality alignment dataset. It basically assumes that re-
sponse feedback or golden rules to avert unsafe behaviors          27. https://huggingface.co/OpenAssistant
have been injected or included in the specially curated align-     28. https://github.com/OpenLMLab/MOSS-RLHF
ment dataset, so that LLMs can directly learn aligned behav-       29. https://huggingface.co/stanfordnlp/SteamSHP-flan-t5-xl
                                                                                                                                 41

(i.e., natural language instructions) governing AI behaviors.       DPO [391] proposes to reparameterize the response rewards
Based on these principles, LLMs will critique their own             using the policy model (i.e., the language model being
harmful responses and revise them repeatedly into finally           optimized), and then the original reward modeling objective
aligned responses. Similarly, Self-Align [336] first adopts         can be reformulated only based on the policy model. In this
self-instruct [147] to generate instructions focusing on cov-       way, DPO removes the explicit reward modeling step, and
ering diverse topics. Then, the model is also prompted              optimizing the new learning objective that only involves the
with multiple human-written principles that describe the            policy model is equivalent to optimizing the rewards. Based
rules of expected model behaviors (also with several in-            on DPO, existing work has proposed several improvement
context exemplars), to generate helpful, ethical, and reliable      strategies for enhancing the effectiveness or efficiency, e.g.,
responses as alignment data. To mitigate the limit that the         decomposing the optimization of positive responses and
original SFT method can only learn from positive responses,         negative responses into two independent components [392]
FIGA [388] develops an improved supervised alignment                or removing the probability of the reference model in the
approach, where both negative (the original output of low           objective function [393]. Furthermore, FIGA [388] designs a
quality) and positive (the refined output by LLMs) re-              token-level contrastive loss that aims to encourage desirable
sponses are leveraged in a contrastive way, to enable LLMs          tokens, penalize undesirable ones, and disregard trivial
to deeply understand what fine-grained revisions actually           tokens. Despite the effectiveness, recent work has also
lead to good response.                                              revealed that DPO may have inherent limitations in several
     • LLM based interactive approaches. Most existing ap-          aspects. First, based on the analysis about the magnitude
proaches train LLMs in isolation, where LLMs are not                and gradient directions, recent work reveals that DPO might
present in actual environments to improve themselves                have difficulty in well balancing the learning of positive
through external feedback signals. As a comparison, hu-             instances and negative instances [394]. In addition, as the
mans learn social norms and values from interactions with           reference model provides the reward scores for itself in DPO
others in social environments [389]. To mimic such a learn-         algorithm, a weak reference model would also influence
ing approach, Stable Alignment [193] builds a simulated             the alignment performance [395], which can be enhanced
interaction environment consisting of a number of LLM               by improved learning strategies [396] or well-trained policy
agents, where AI agents keep interacting with and each              model [395].
other, receiving feedback on improvement. Once a central                • Auxiliary optimization objectives. Besides the primary
agent receives an instruction, it produces a response and           cross-entropy loss, several studies propose auxiliary train-
shares it with nearby agents. These critic agents generate          ing loss to enhance the learning from the alignment data.
feedback comprising ratings about the response and re-              First, since the responses of each instruction can be scored
vision suggestions. Then the central agent would revise             by the reward model, the ranking loss can be used to train
the original response following these suggestions. Such             the model to preserve the ranking order of these responses.
an alignment approach can be also extended to real-world            For example, RRHF [397] samples responses from multi-
environment with humans.                                            ple sources, including model-generated responses, such as
                                                                    those derived from the model itself, ChatGPT, and GPT-4,
Supervised Alignment Tuning. After obtaining alignment              as well as human-written responses, spanning both high-
data, it is also key to design suitable fine-tuning strategies      quality and low-quality instances. To align with the scores
for direct alignment. A straightforward approach is to op-          from reward models, it further optimizes the ranking loss
timize LLMs using the conventional sequence-to-sequence             by encouraging the model to have a higher conditional
objective based on the alignment data. In addition to the           log probability for the response with a higher ranking.
conventional optimization objective, several studies further        Moreover, SLiC-HF [398] proposes to assess the similarity
explore auxiliary losses that enhance the learning from the         between model outputs and human preference via the dis-
alignment data.                                                     tance in the latent space, and introduces specific calibration
    • Primary training objective. Since the alignment data          and regularization loss to calibrate the candidate sequences
typically consists of an input instruction and an output re-        based on human-preference data. Similarly, the difference
sponse, the primary training loss is still the traditional cross-   between positive and negative responses from the reward
entropy loss for sequence-to-sequence learning. Based on            model can be employed to construct the regularization
this loss, many studies propose a number of improvement             loss [399], to enhance the discrimination between positive
variants for enhancing the supervised alignment tuning.             and negative responses by LLMs. Second, to enhance the
For example, CoH [390] constructs the training data by              relatedness between the response and the instruction, some
prepending “A helpful answer:” and “An unhelpful answer:”           work adopts contrastive learning to push up the probability
to the annotated good and bad responses, respectively, and          of correct instruction-response pairs while pushing down in-
only compute losses for those response tokens with special          correct instruction-response pairs. Specifically, for an output
masking. Quark [386] sorts model responses into different           response, the proposed approach in [400] contrasts the target
quantiles with varying alignment quality, it prepends a spe-        instruction to the other irrelevant instructions. By doing so,
cial reward token to each model response to represent the           it can enable the model to learn the right correlation between
reward level of the response. These studies basically adopt         instructions and responses.
the maximum likelihood objective, and employ instruction
prefixes to guide the learning of human preference.                 5.2.5 Remarks on SFT and RLHF
    • Direct preference optimization. To better mimic the           As discussed in Section 5.1, instruction tuning is the process
learning approach of RLHF in a supervised learning way,             of training pre-trained language models with formatted
                                                                                                                                       42

demonstration data (instructions paired with desired out-              hallucination behaviors when demonstration data is beyond
puts). At early exploration, instruction data was mainly col-          the knowledge or ability scope of LLMs, e.g., training a LLM
lected from NLP tasks [67], while it has been now extended             to answer questions about its unknown facts. An interesting
to more diverse supervision data that pairs input and                  viewpoint from John Schulman’s talk on RLHF [403] is that
output texts (e.g., the utterances of open-ended dialogues).           distilling superior models to train less capable models (e.g.,
Training with such paired texts is also called supervised fine-        prompting GPT-4 to generate the response as fine-tuning
tuning (SFT) in the context of LLMs [66]. In this part, we             data) might increase the possibilities of generating the hal-
mainly use the abbreviation SFT for discussion but not                 lucinated texts, thus likely affecting the factual accuracy
instruction tuning, due to the simplicity and popularity.              of LLMs. Furthermore, as a behavior cloning method, SFT
    Since SFT and RLHF are two major adaptation tuning                 aims to imitate the behaviors (without explorations) of the
methods for LLMs, it is important to understand the con-               experts who construct the demonstration data. However,
nections and difference between them. Next, we make some               there often exist variations among different annotators on
discussions on this issue30 .                                          the writing styles, quality, and preferences of demonstration
                                                                       data, which tends to affect the learning performance of SFT.
Overall Comparison with RL Formulation. Following the                  Thus, high-quality instruction data (but not the quantity) is
discussion in Section 5.2.3 (the part related to RL training),         the primary factor for effective training of LLMs during the
the text generation problem can be formulated as a decision-           SFT stage [99].
making process based on RL. Taking a prompt as input,
the task of a LLM is to generate a text completion that                Pros and Cons of RLHF. RLHF was early explored in the
appropriately responds to the prompt. This task would be               literature of deep RL [79], then borrowed to improve the
completed step by step. At each step, an agent (i.e., LLM)             capacity of language models (e.g., summarization [129]),
will perform an action (i.e., generating a token) according            and subsequently adopted as the fundamental technique to
to the policy (i.e., the generative probability distribution of        develop InstructGPT [66]. Recently, increasing evidence [99,
LLM) conditioned on the current state (currently generated             369] has demonstrated the effectiveness of RLHF in miti-
token sequence and other available context information).               gating the harmful responses and enhancing the model ca-
It is expected that a high-quality output text would be                pacity. Specially, LLaMA 2 has demonstrated that RLHF can
produced by the LLM, which can earn a large reward score               improve both the helpfulness and harmlessness scores [99],
based on the entire response. Overall, RLHF and SFT can be             and attributed this to a better human-LLM synergy for data
considered as two different training approaches to optimiz-            annotation. They explain this reason in two major aspects
ing the above decision making process for LLMs. Specially,             as follows. First, since human annotators mainly provide
RLHF firstly learns the reward model, and then employs                 preference annotations for RLHF, it can largely alleviate the
it to improve the LLM with RL training (e.g., PPO). As a               discrepancies of annotators as that in SFT. Secondly, pref-
comparison, SFT adopts a teacher-forcing approach, which               erence annotation is much easier than writing the demon-
directly optimizes the likelihood of a demonstration output.           stration data, and annotators can even judge the quality of
Such a token-level training way essentially does behavior              more superior generations than those they create, making it
cloning (a special algorithm of imitation learning [401]): it          possible to explore a broader state space beyond what can
utilizes the expert’s action (i.e., the target token at each step)     be demonstrated by human annotators. Another key point
as the supervision label and directly learns to imitate the            is that RLHF essentially encourages LLMs to learn correct
demonstrations from experts without specifying a reward                policies by contrasting the self-generated responses (dis-
model as in typical RL algorithms. To learn the desired                criminating between good and bad responses). It no longer
policies, SFT adopts a “local” optimization way (i.e., token-          forces the model to imitate external demonstration data,
level loss) based on demonstration data, while RLHF takes a            and thus can mitigate the hallucination issues with SFT as
“global” optimization way (i.e., text-level loss) by involving         discussed above31 . Actually, RLHF has been demonstrated
human preference. More theoretical analysis about imitation            to be an important approach to reduce the hallucination
learning and reinforcement learning can be referred to the             behaviors in GPT-4 [46]. However, RLHF inherits the draw-
related RL literature [401, 402].                                      backs of classic RL algorithms, e.g., sample inefficiency and
                                                                       training instability. When adapted to LLMs, RLHF further
Pros and Cons of SFT. SFT has been shown to be an                      relies on a strong SFT model as initial model checkpoint for
effective approach to boosting the performance of LLMs                 efficiently achieving good performance. In addition, human
on various benchmarks [67, 69, 152, 187], which can largely            annotators are involved in a complex iterative optimization
enhance the task generalization ability and flexibly endow             process, in which a number of important details (e.g., the
specific functions (e.g., establishing the chatbot’s identity).        prompt selection, the schedule of reward model training and
More discussions about the usefulness of SFT can be found              PPO training, and the settings of hyper-parameters) have
in Section 5.1.3. It has been widely recognized that SFT               important impact on the whole model performance.
mainly unlocks the abilities but not inject new abilities into             Overall, SFT is particularly useful to increase the model
LLMs. Thus, it might become problematic when one tries                 capacity of pre-trained model checkpoints right after pre-
to stimulate the non-endogenous abilities of LLMs via SFT.             training, while RLHF is promising to further improve the
As a concrete scenario, it would potentially advocate the
                                                                         31. In RLHF, it seems to be also important that reward models
  30. This part would be somehow subjective, mainly based on the au-   should be aware of the knowledge or ability of a LLM to be aligned.
thors’ opinions and experiences. Comments or corrections are welcome   For example, LLaMA 2 adopts pre-trained chat model checkpoints to
to enhance this part.                                                  initialize reward models [99].
                                                                                                                                         43

model capacity of SFT models. However, RLHF has been               the Transformer architecture specially for natural language
difficult to implement, and far from well explored (ac-            understanding, which also utilizes multi-task learning for
cording to public literature), and more improvements (e.g.,        jointly optimizing shared prompts. It has been shown to
efficient and reliable annotation [369] and simplified opti-       be useful in improving the model performance of different
mization [391]) are still needed for further research.             parameter scales on natural language understanding tasks.

                                                                   Prompt Tuning. Different from prefix tuning, prompt tun-
5.3   Parameter-Efficient Model Adaptation                         ing [405, 410] mainly focuses on incorporating trainable
In the above, we have discussed the approaches of instruc-         prompt vectors at the input layer32 . Based on the discrete
tion tuning and alignment tuning to adapt LLMs according           prompting methods [412, 413], it augments the input text
to specific goals. Since LLMs consist of a huge amount of          by including a group of soft prompt tokens (either in a
model parameters, it would be costly to perform the full-          free form [410] or a prefix form [405]), and then takes
parameter tuning. In this section, we will discuss how to          the prompt-augmented input to solve specific downstream
conduct efficient tuning on LLMs. We first review several          tasks. In implementation, task-specific prompt embeddings
representative parameter-efficient fine-tuning methods for         are combined with the input text embeddings, which are
Transformer language models, and then summarize existing           subsequently fed into language models. P-tuning [410] has
work on parameter-efficient fine-tuned LLMs.                       proposed a free form to combine the context, prompt and
                                                                   target tokens, which can be applied to the architectures for
5.3.1 Parameter-Efficient Fine-Tuning Methods                      both natural language understanding and generation. They
In existing literature, parameter-efficient fine-tuning [149,      further learn the representations of soft prompt tokens by a
404, 405] has been an important topic that aims to reduce          bidirectional LSTM. Another representative approach [405]
the number of trainable parameters while retaining a good          named prompt tuning directly prepends prefix prompts to
performance as possible. In what follows, we briefly re-           the input. During training, only the prompt embeddings
view four parameter-efficient fine-tuning methods for Trans-       would be learned according to task-specific supervisions.
former language models, including adapter tuning, prefix           Since this method only includes a small number of trainable
tuning, prompt tuning and LoRA. The illustration of these          parameters at the input layer, it has been found that the
four methods are shown in Figure 13.                               performance highly relies on the model capacity of the
                                                                   underlying language models [405].
Adapter Tuning. Adapter tuning incorporates small neural
network modules (called adapter) into the Transformer mod-         Low-Rank Adaptation (LoRA). LoRA [149] imposes the
els [406]. To implement the adapter module, a bottleneck           low-rank constraint for approximating the update matrix at
architecture has been proposed in [406, 407], which first          each dense layer, so as to reduce the trainable parameters
compresses the original feature vector into a smaller di-          for adapting to downstream tasks. Consider the case of
mension (followed by a nonlinear transformation) and then          optimizing a parameter matrix W. The update process can
recovers it to the original dimension. The adapter modules         be written in a general form as: W ← W + ∆W. The basic
would be integrated into each Transformer layer, typically         idea of LoRA is to freeze the original matrix W ∈ Rm×n
using a serial insertion after each of the two core parts (i.e.,   while approximating the parameter update ∆W by low-
attention layer and feed-forward layer) of a Transformer           rank decomposition matrices, i.e., ∆W = A · B⊤ , where
layer. Alternatively, parallel adapters [408] can be also used     A ∈ Rm×k and B ∈ Rn×k are the trainable parameters for
in Transformer layers, where it places two adapter modules         task adaptation and k ≪ min(m, n) is the reduced rank. The
in parallel with the attention layer and feed-forward layer        major merit of LoRA is that it can largely save the memory
accordingly. During fine-tuning, the adapter modules would         and storage usage (e.g., VRAM). Further, one can only keep
be optimized according to the specific task goals, while the       a single large model copy, while maintaining a number of
parameters of the original language model are frozen in this       task-specific low-rank decomposition matrices for adapting
process. In this way, we can effectively reduce the number         to different downstream tasks. Further, several studies have
of trainable parameters during fine-tuning.                        also discussed how to set the rank in a more principled
                                                                   approach, e.g., importance score based allocation [414] and
Prefix Tuning. Prefix tuning [404] prepends a sequence of          search-free optimal rank selection [415].
prefixes, which are a set of trainable continuous vectors, to
                                                                       Besides the above methods, there is extensive research
each Transformer layer in language models. These prefix
                                                                   on efficient tuning of Transformer language models. How-
vectors are task-specific, which can be considered as virtual
                                                                   ever, a more comprehensive discussion of efficient tuning is
token embeddings. To optimize the prefix vectors, a repa-
                                                                   beyond the scope of this article, which can be found in the
rameterization trick [404] has been proposed by learning a
                                                                   related papers on this topic [408, 416].
MLP function that maps a smaller matrix to the parameter
matrix of prefixes, instead of directly optimizing the pre-
fixes. It has been shown that this trick is useful for stable         32. Here, prompt tuning denotes a category of related efficient tuning
                                                                   methods exemplified by the work [405, 410, 411], instead of a spe-
training. After optimization, the mapping function would           cific method as used in [405]. Indeed, the prefix based tuning meth-
be discarded, and only the derived prefix vectors are kept         ods [404, 409] can be also considered as prompting methods, which
to enhance task-specific performance. Since only the prefix        are called deep prompting tuning in [409]. In this survey, prompt tuning
parameters would be trained, it can lead to a parameter-           specially refer to the methods that only include the prompt tokens at
                                                                   the input layer, in the context of LLMs. We assign p-tuning v2 [409] to
efficient model optimization. Similar to prefix tuning, p-         the category of prefix tuning, because it incorporates layerwise prompts
tuning v2 [409] incorporates layer-wise prompt vectors into        in langauge models.
                                                                                                                                  44

5.3.2     Parameter-Efficient Fine-Tuning on LLMs                  task description and/or demonstrations in the form of natu-
With the rising of LLMs, efficient tuning has attracted            ral language text. In addition, chain-of-thought prompting [33]
increasing research attention for developing a more                can be employed to enhance in-context learning by involv-
lightweight adaptation approach in downstream tasks.               ing a series of intermediate reasoning steps in prompts.
    In particular, LoRA [149] has been widely applied              Furthermore, planning [432] is proposed for solving complex
to open-source LLMs (e.g., LLaMA and BLOOM) for                    tasks, which first breaks them down into smaller sub-tasks
parameter-efficient fine-tuning. Among these research at-          and then generates a plan of action to solve these sub-tasks
tempts, LLaMA and its variants have gained much atten-             one by one. We summarize representative work for these
tion for parameter-efficient tuning. For example, Alpaca-          prompting approaches in Table 11. Next, we will elaborate
LoRA [148] has been trained using LoRA as a lightweight            on the details of the four techniques.
tuned version of Alpaca [146] (a fine-tuned 7B LLaMA
model with 52K human demonstrations of instruction fol-            6.1     Prompting
lowing). There are extensive explorations of Alpaca-LoRA
ranging in different languages or model sizes, which can           As discussed in previous work [36], prompting is the major
be found in the collection page33 . A recent study LLaMA-          approach to utilizing LLMs for solving various tasks. Since
Adapter [417] inserts learnable prompt vectors into each           the quality of prompts will largely influence the perfor-
Transformer layer, in which zero-initialized attention has         mance of LLMs in specific tasks, there have been a series of
been proposed to improve the training by mitigating the            studies proposed to generate suitable task prompts through
influence of under-fitted prompt vectors. They also extend         manual creation or automatic optimization, which will be
this approach to a multi-modal setting, e.g., visual question      introduced in this section.
answering.
    Further, an empirical study [407] has been conducted           6.1.1    Prompt Creation
to examine the effect of different tuning methods on lan-          The process of manually creating a suitable prompt is also
guage models. They compare four efficient tuning methods           called prompt engineering [445, 446]. A well-designed prompt
including serial adapter tuning [406], parallel adapter tun-       is very helpful to elicit the abilities of LLMs for accomplish-
ing [408, 418], and LoRA [149], on three open-source LLMs,         ing specific tasks. In this part, we will first introduce the
namely GPT-J (6B), BLOOM (7.1B) and LLaMA (7B), for                key components of prompts and discuss several principles
evaluation. Based on the experimental results on six math          for prompt design. Then, we evaluate ChatGPT with differ-
reasoning datasets, they show that these efficient-tuning          ent prompts to show the results on several representative
methods under-perform the reference baseline GPT-3.5 on            tasks. We are aware that there have been several existing
difficult tasks, while achieving a comparable performance          papers [446, 447] and websites [448–450] that present the
on simple tasks. Overall, LoRA performs relatively well            suggestions and guidelines to design good prompts. As a
among these comparison methods, using significantly fewer          comparison, we mainly aim to discuss the key factors (ingre-
trainable parameters.                                              dients and principles) that are useful for prompt creation,
    As an important resource, the library PEFT [419] (stand-       and provide experimental results and analysis on popular
ing for parameter-efficient fine-tuning) has been released on      tasks as the reference to the beginners.
GitHub34 . It has included several widely used efficient tun-
ing methods, including LoRA [149]/AdaLoRA [414], prefix-           Key Ingredients. Typically, there are four key ingredients
tuning [404, 409], P-Tuning [410], and prompt-tuning [405].        that depict the functionality of a prompt for eliciting the
Further, it supports a number of language models such as           abilities of LLMs to complete the tasks, including task
GPT-2 and LLaMA, and also covers several representative            description, input data, contextual information, and prompt
vision Transformer models (e.g., ViT and Swin Transformer).        style. To have an intuitive understanding of our discussion,
    As discussed in Section 5.3.1, there have been a large         we also present three prompt examples for question answer-
number of efficient tuning methods proposed in the existing        ing, meta-review generation, and text-to-SQL in Table 13.
literature. However, most of these approaches are tested               • Task description. A task description is typically a specific
on small-sized pre-trained language models, instead of the         instruction that LLMs are expected to follow. In general, one
LLMs. So far, there still lacks a thorough investigation on        should clearly describe the task goal in natural language.
the effect of different efficient tuning methods on large-sized    For the tasks with special input or output format, detailed
language models at different settings or tasks.                    clarifications are often needed, and one can further utilize
                                                                   keywords to highlight the special settings for better guiding
                                                                   LLMs in task completion.
6     U TILIZATION                                                     • Input data. In common cases, it is straightforward to
After pre-training or adaptation tuning, a major approach          describe input data (e.g., an instance to be responded by
to using LLMs is to design suitable prompting strategies           LLMs) in natural language. For special input data, such
for solving various tasks. In existing literature, task-specific   as knowledge graph and table, it is necessary to apply an
prompts can be effectively learned through manual creation         appropriate and convenient way to make them readable
and automatic optimization. A representative prompting             for LLMs. For structured data, linearization is commonly
method is in-context learning [50, 55], which formulates the       used to transform the original records (e.g., knowledge
                                                                   triples) into sequences [451] due to the simplicity. Further,
    33. https://github.com/tloen/alpaca-lora                       the programming language (e.g., executable code) has also
    34. https://github.com/huggingface/peft                        been utilized to formulate the structured data, which can
                                                                                                                                                 45

TABLE 11: Typical LLM utilization methods and their key points for ICL, CoT, and planning. Note that the key points only
highlight the most important technical contribution.

 Approach           Representative Work                   Key Point
                    KATE [420]                            Demonstration selection (similar; k-NN)
                    EPR [421]                             Demonstration selection (dense retrieval; constrative learning)
 In-context         SG-ICL [422]                          Demonstration selection (LLM as the demonstration generator)
 Learning (ICL)     APE [423]                             Demonstration format (automatic generation & selection)
                    Structured Prompting [424]            Demonstration format (grouped context encoding; rescaled attention)
                    GlobalE & LocalE [425]                Demonstration order (entropy-based metric; probing set generation with LLM)
                    Complex CoT [426]                     Demonstration (complexity-based selection)
                    Auto-CoT [427]                        Demonstration (automatic generation)
 Chain-of-thought   Selection-Inference [428]             Generation (alternate between selection and inference)
 Prompting (CoT)    Self-consistency [429]                Generation (diverse paths; self-ensemble)
                    DIVERSE [430]                         Generation (diverse paths); Verification (step-wise voting)
                    Rationale-augmented ensembles [431]   Generation (rationale sampling)
                    Least-to-most prompting [432]         Plan generation (text-based; problem decomposition)
                    DECOMP [433]                          Plan generation (text-based; problem decomposition)
                    PS [434]                              Plan generation (text-based)
                    Faithful CoT [435]                    Plan generation (code-based)
                    PAL [436]                             Plan generation (code-based; Python)
                    HuggingGPT [437]                      Plan generation (code-based; models from HuggingFace)
 Planning           AdaPlanner [438]                      Plan refinement (skill memory)
                    TIP [439]                             Feedback acquisition (visual perception)
                    RAP [440]                             Feedback acquisition (LLM as the world model); Plan refinement (Monte Carlo Tree Search)
                    ChatCoT [441]                         Feedback acquisition (tool); Plan refinement (conversation between LLM and tools)
                    ReAct [442]                           Feedback acquisition (tool); Plan refinement (synergizing reasoning and acting)
                    Reflexion [443]                       Feedback acquisition (text-based self-reflection); Plan refinement (dynamic memory)
                    Tree of Thoughts [444]                Feedback acquisition (vote comparison); Plan refinement (tree-based search)



also support using external tools (e.g., program executor) to             need for clear and unambiguous directives when utilizing
produce the precise results [452, 453].                                   these models [66]. A clear and detailed description should
    • Contextual information. In addition to the task descrip-            contain various elements to explain a task, including task
tion and input data, contextual or background information                 objective, input/output data (e.g., “Given a long document, I
is also essential for specific tasks. For example, retrieved              want you to generate a concise summary.”), and the response
documents are highly useful for open-domain question                      constraints (e.g., “the length of the summary cannot exceed 50.”).
answering as supporting evidence. Both the quality of the                 By providing a well-clarified task description, LLMs can
retrieved documents and their relevance to the question                   more effectively understand the target task and generate the
have an impact on the generated answers [454]. Thus, it                   desired output.
needs to include such information in a proper prompt                           • Decomposing into easy, detailed sub-tasks. To solve com-
pattern or expression format. Furthermore, in-context task                plex tasks, it is important to decompose the difficult task
exemplars are also helpful for eliciting LLMs to accomplish               into several more easier, detailed sub-tasks for helping
a complex task, which can better depict the task goal, the                LLMs accomplish the goal step by step, which is closely re-
special output formats, and the mapping relation between                  lated to the planning technique in Section 6.4. For example,
input and output.                                                         following the suggestion [447], we can explicitly list the sub-
    • Prompt style. For different LLMs, it is important to                tasks in the form of multiple numbered items (e.g., “Braid a
design a suitable prompt style for eliciting their abilities to           coherent narrative by performing the following tasks: 1. ...; 2. ...; 3.
solve specific tasks. Overall, one should express the prompt              ...”). By decomposing a target task into sub-tasks, LLMs can
as a clear question or detailed instruction that can be well              focus on solving easier sub-tasks and finally achieve more
understood and answered. In some cases, it is also useful to              accurate results for complex tasks.
add the prefix or suffix to better guide LLMs. For example,                    • Providing few-shot demonstrations. As discussed in Sec-
using the prefix “Let us think step by step” can help elicit              tion 6.2, LLMs can benefit from in-context learning for
LLMs perform step-by-step reasoning, and using the prefix                 solving complex tasks, where the prompts contain a small
“You are an expert on this task (or in this domain)” can boost            number of task examples of the desired input-output pairs,
the performance of LLMs in some specific tasks. Further, for              i.e., few-shot demonstrations. Few-shot demonstrations can
chat-based LLMs (e.g., ChatGPT), instead of directly feeding              help LLMs learn the semantic mapping between input and
a long or complex task prompt, it is suggested to decompose               output without parameter tuning. In practice, it is suggested
it into multiple prompts for the sub-tasks and then feed                  that one should generate a few high-quality demonstrations
them into LLMs via a multi-turn conversation [441].                       for the target task, which would highly benefit the final task
                                                                          performance.
Design Principles. Based on the key ingredients of prompts,                    • Utilizing model-friendly format. Since LLMs are pre-
we summarize several critical design principles that can                  trained on specially constructed datasets, there are some
help create more effective prompts for solving various tasks.             prompt formats that can make LLMs better understand
   • Expressing the task goal clearly. Task descriptions should           the instruction. For example, as the OpenAI documentation
not be ambiguous or unclear, which likely lead to in-                     suggests, we can use ### or """ as a stop symbol to
accurate or inappropriate responses. This highlights the                  separate the instruction and context, which can be better
                                                                                                                                46

understood by LLMs. As a general guideline, most existing          specific output formats or require background knowledge.
LLMs perform a task better in English, thus it is useful to        Our example prompts provide more detailed task descrip-
employ English instructions to solve difficult tasks based on      tion (e.g., output format and task goal), which can help
machine translation.                                               ChatGPT better understand the complex task requirement
    • Adopting role-playing strategies. Since LLMs are pre-        for fulfilling it.
trained on extensive corpora containing diverse characters             • For mathematical reasoning tasks, it is more effective to
and dialogues, they possess an inherent ability for role-          design specific prompts based on the format of programming
playing. This feature can be harnessed through specific            language. For GSM8k, the designed prompt employs code-
prompts to enhance the corresponding capacity for some             formatted few-shot demonstrations to convert this mathe-
specific domains [455]. For instance, when solving a math          matical reasoning task into code generation task, which can
problem, we can use a prompt prefix like “You are an expert in     leverage the strong code synthesis ability of ChatGPT for
mathematics”. This enables LLMs to solve the problem from          solving mathematical problems. Further, with the help of an
an expert’s perspective, thereby leveraging their pretrained       external program executor, we are able to obtain more pre-
knowledge more effectively. By guiding LLMs with role-             cise results instead of using LLMs for arithmetic operation.
playing prompts, they can often generate more reasonable           As we can see, the performance is boosted from 78.47 to
and accurate solutions.                                            79.30 on GSM8k, indicating the usefulness of programming
                                                                   language in mathematical reasoning tasks.
Useful Tips. In addition to the design principles, we also
                                                                       • In knowledge utilization and complex reasoning tasks,
present a collection of useful prompt tips based on existing
                                                                   ChatGPT with proper prompts achieves comparable performance
work or our empirical experiences in Table 12. Note that
                                                                   or even outperforms the supervised baselines methods. In knowl-
these tips are suggested in a general manner, it does not
                                                                   edge utilization and complex reasoning tasks, ChatGPT
indicate that they are the best prompts for the corresponding
                                                                   with proper zero-shot or few-shot prompts can achieve
tasks. This part will be continuously updated with more
                                                                   comparable performance or even outperform the super-
guidelines or tips. We welcome readers to contribute to this
                                                                   vised methods, e.g., 31.21 (ChatGPT) v.s. 34.20 (supervised
collection of prompt tips. We present the detailed procedure
                                                                   baseline) on WikiFact. Despite that, ChatGPT still performs
to contribute to the prompt tips, at the link: https://github.
                                                                   worse than supervised baseline models on some specific
com/RUCAIBox/LLMSurvey/tree/main/Prompts.
                                                                   tasks (e.g., ARC and WikiFact), since these supervised mod-
Empirical Analysis. We further conduct empirical studies           els have been specially optimized with task-specific data.
to present the impact of prompts on task performance. To               • Through suitable prompt engineering, LLMs can handle
conduct the experiments, we select a variety of tasks that         some non-traditional NLP tasks. With the help of specific
span language generation, knowledge utilization, complex           prompts, ChatGPT can also accomplish non-traditional NLP
reasoning, structure data generation, and information re-          tasks, i.e., the general recommendation and conversational
trieval. For each task, we manually write a prompt that            recommendation. A key point is that these tasks can be
follows general guidelines introduced above. Note that the         well expressed or described in natural language. However,
tested prompts may not be the optimal for these tasks,             the performance of ChatGPT is still far from the referenced
since they mainly aim to help readers understand how to            performance in these tasks, as LLMs cannot directly fit these
write an effective prompt for solving different tasks. Also,       tasks, which require specific domain knowledge and task
we add a simplified prompt as the comparison for most              adaptation [355, 456].
tasks. Following the experimental settings in Section 7.4, we
examine the 3-shot performance of ChatGPT on complex               6.1.2   Prompt Optimization
reasoning tasks (Colored Objects and GSM8k), and zero-             Although manually creating task prompts is more intuitive,
shot performance on other tasks. We report the experimental        it is time consuming and, more importantly, models are
results in Table 17, where we also include the supervised          highly sensitive to the crafted prompts—improper prompts
performance in existing papers as reference.                       will lead to low task performance (as shown in Table 17).
    • Carefully designed prompts can boost the zero-shot or few-   Therefore, a large body of studies propose automatic opti-
shot performance of ChatGPT. By comparing the results of           mization approaches for discrete prompts and continuous
using different prompts on the same task, we can see that          prompts to achieve the optimal performance [404, 413]. In
using the carefully designed prompts can achieve better per-       this part, we will detail these studies from two perspectives,
formance than the simpler ones. In the carefully designed          i.e., discrete prompts and continuous prompts.
prompts, we provide a more clearly expressed task de-
scription (e.g., WMT and WikiFact), or use a model-friendly        Discrete Prompt Optimization. Discrete prompt is typically
format (e.g., GSM8k and OBQA). For example, for WikiFact           composed of a sequence of natural language tokens. Despite
task, the prompt with a more detailed task description leads       that the form is simple and flexible, optimizing prompts in
to a performance increase from 29.25 to 31.21.                     discrete space is a challenging problem due to the combina-
    • More complex tasks can benefit more from careful prompt      torial huge search space. To automatically search effective
engineering on ChatGPT. In the WikiFact and Colored Objects        prompts for downstream tasks, existing studies propose a
tasks, the designed prompts have greatly improved the per-         wide spectrum of discrete prompt optimization approaches,
formance of ChatGPT, i.e., from 23.61 to 28.47 on WikiFact         which are detailed as follows.
and from 53.20 to 66.75 on Colored Objects. It indicates               • Gradient-based approaches. This kind of approaches
the necessity of prompt engineering for LLMs to perform            aims to optimize the prompt search process by maximizing
well on complex tasks, since these tasks typically have            the output likelihood via gradient update [413, 458–461].
                                                                                                                                                                             47

TABLE 12: A collection of useful tips for designing prompts that are collected from online notes [446–449] and experiences
from our authors, where we also show the related ingredients and principles (introduced in Section 6.1.1). We abbreviate
principles as Prin. and list the IDs of the related principles for each prompt. 1 ⃝: expressing the task goal clearly; ⃝ 2:
decomposing into easy, detailed sub-tasks; 3⃝: providing few-shot demonstrations; ⃝ 4 : utilizing model-friendly format.

       Ingredient         Collected Prompts                                                                                                                             Prin.
                          T1. Make your prompt as detailed as possible, e.g., “Summarize the article into a short paragraph within 50 words. The major                   ⃝
                                                                                                                                                                         1
    Task Description      storyline and conclusion should be included, and the unimportant details can be omitted.”
                          T2. It is helpful to let the LLM know that it is an expert with a prefixed prompt, e.g., “You are a sophisticated expert in the                ⃝
                                                                                                                                                                         1
                          domain of compute science.”
                          T3. Tell the model more what it should do, but not what it should not do.                                                                      ⃝
                                                                                                                                                                         1
                          T4. To avoid the LLM to generate too long output, you can just use the prompt: “Question: Short Answer: ”. Besides, you can                    ⃝
                                                                                                                                                                         1
                          also use the following suffixes, “in a or a few words”, “in one of two sentences”.
                          I1. For the question required factual knowledge, it is useful to first retrieve relevant documents via the search engine, and                  ⃝
                                                                                                                                                                         4
      Input Data
                          then concatenate them into the prompt as reference.
                          I2. To highlight some important parts in your prompt, please use special marks, e.g., quotation (””) and line break (\n). You                  ⃝
                                                                                                                                                                         4
                          can also use both of them for emphasizing.
                          C1. For complex tasks, you can clearly describe the required intermediate steps to accomplish it, e.g., “Please answer the                     ⃝
                                                                                                                                                                         2
 Contextual Information   question step by step as: Step 1 - Decompose the question into several sub-questions, · · · ”
                          C2. If you want LLMs to provide the score for a text, it is necessary to provide a detailed description about the                              ⃝
                                                                                                                                                                         1
                          scoring standard with examples as reference.
                          C3. When LLMs generate text according to some context (e.g., making recommendations according to purchase history),                            ⃝
                                                                                                                                                                         2
                          instructing them with the explanation about the generated result conditioned on context is helpful to improve the quality
                          of the generated text.
                          C4. An approach similar to tree-of-thoughts but can be done in one prompt: e.g., Imagine three different experts are answering                 ⃝
                                                                                                                                                                         2
                          this question. All experts will write down one step of their thinking, then share it with the group of experts. Then all experts will go on
                          to the next step, etc. If any expert realizes they’re wrong at any point then they leave. The question is
                          D1. Well-formatted in-context exemplars are very useful, especially for producing the outputs with complex formats.                            ⃝3
                          D2. For few-shot chain-of-thought prompting, you can also use the prompt “Let’s think step-by-step”, and the few-shot                         ⃝⃝
                                                                                                                                                                        1   3
                          examples should be separated by “\n” instead of full stop.
                          D3. You can also retrieve similar examples in context to supply the useful task-specific knowledge for LLMs. To retrieve                      ⃝⃝
                                                                                                                                                                        3 4
     Demonstration        more relevant examples, it is useful to first obtain the answer of the question, and then concatenate it with the question for
                          retrieval.
                          D4. The diversity of the in-context exemplars within the prompt is also useful. If it is not easy to obtain diverse questions,                 ⃝
                                                                                                                                                                         3
                          you can also seek to keep the diversity of the solutions for the questions.
                          D5. When using chat-based LLMs, you can decompose in-context exemplars into multi-turn messages, to better match the                           ⃝
                                                                                                                                                                         3
                          human-chatbot conversation format. Similarly, you can also decompose the reasoning process of an exemplars into multi-turn
                          conversation.
                          D6. Complex and informative in-context exemplars can help LLMs answer complex questions.                                                       ⃝3
                          D7. As a symbol sequence can typically be divided into multiple segments (e.g., i1 , i2 , i3 −→ i1 , i2 and i2 , i3 ), the preceding          ⃝⃝
                                                                                                                                                                        2   3
                          ones can be used as in-context exemplars to guide LLMs to predict the subsequent ones, meanwhile providing historical
                          information.
                          D8. Order matters for in-context exemplars and prompts components. For very long input data, the position of the question                      ⃝
                                                                                                                                                                         3
                          (first or last) may also affect the performance.
                          D9. If you can not obtain the in-context exemplars from existing datasets, an alternative way is to use the zero-shot                          ⃝
                                                                                                                                                                         3
                          generated ones from the LLM itself.

                          O1. Let the LLM check its outputs before draw the conclusion, e.g., “Check whether the above solution is correct or not.”                      ⃝
                                                                                                                                                                         2
                          O2. If the LLM can not well solve the task, you can seek help from external tools by prompting the LLM to manipulate                           ⃝
                                                                                                                                                                         4
                          them. In this way, the tools should be encapsulated into callable APIs with detailed description about their functions, to
     Other Designs        better guide the LLM to utilize the tools.
                          O3. The prompt should be self-contained, and better not include pronouns (e.g., it and they) in the context.                                   ⃝
                                                                                                                                                                         1
                          O4. When using LLMs for comparing two or more examples, the order affects the performance a lot.                                               ⃝
                                                                                                                                                                         1
                          O5. Before the prompt, assigning a role for the LLM is useful to help it better fulfill the following task instruction, e.g., “I               ⃝
                                                                                                                                                                         1
                          want you to act as a lawyer”.
                          O6. OpenAI models can perform a task better in English than other languages. Thus, it is useful to first                                       ⃝
                                                                                                                                                                         4
                          translate the input into English and then feed it to LLMs.
                          O7. For multi-choice questions, it is useful to constrain the output space of the LLM. You can use a more detailed explanation                 ⃝
                                                                                                                                                                         1
                          or just imposing constraints on the logits.
                          O8. For sorting based tasks (e.g., recommendation), instead of directly outputting the complete text of each item after sorting,               ⃝
                                                                                                                                                                         1
                          one can assign indicators (e.g., ABCD) to the unsorted items and instruct the LLMs to directly output the sorted indicators.




As a representative work, Auto-Prompt [413] proposes a                                 the gradient on continuous space during optimization.
gradient-guided method to greedily search the optimal to-
ken for each position of the prompt, leveraging the gradient                              • RL-based approaches. Since discrete prompts are difficult
approximated by the change in the log-likelihood when                                  to be learned through gradient back-propagation, a num-
replacing a prompt token with another candidate token                                  ber of studies propose to formulate the discrete prompt
from vocabulary. However, such a search process can be ex-                             optimization as a reinforcement learning (RL) problem and
tremely expensive since it needs to evaluate each candidate                            leverage RL algorithms for optimization [462–465]. For ex-
token for each position of the prompt, leading to a num-                               ample, RLPrompt [462] trains a policy network to generate
ber of additional forward passes. Therefore, an improved                               desired prompts with multiple reward functions. In this
gradient method [458] has been proposed by transforming                                approach, several effective reward stabilization strategies
discrete tokens into continuous embeddings and computing                               are also proposed to enhance the RL training efficiency.
                                                                                       Compared to previous work that requires sufficient data
                                                                                                                                                48

TABLE 13: Example instructions collected from [447, 457]. The blue text denotes the task description, the red text denotes
the contextual information, the green text denotes the demonstrations, and the gold text denotes the prompt style.

 Use the provided articles delimited by triple quotes to answer questions. If the answer cannot be found in the articles, write “I could not find an
 answer.”
 Articles: “““Joao Moutinho is a Portuguese footballer who last played as a central midfielder for Premier League club Wolverhampton Wanderers
 and the Portugal national team.”””
 Question: Is the following sentence plausible? ’Joao Moutinho was out at third.’
 Answer: Let’s think step by step. Joao Moutinho is a soccer player. Being out at third is part of baseball, not soccer. So the answer is No.
 ...
 <Demonstrations>

 Articles: <insert articles, each delimited by triple quotes>
 Question: <insert question>
 Answer:

 Prepare a meta-review by answering the following questions from the reviewer comments (provided after the questions).
 1. Based on the reviewer’s comments, what are the core contributions made by this manuscript?
 2. What are the common strengths of this work, as mentioned by multiple reviewers?
 3. What are the common weaknesses of this work, as highlighted by multiple reviewers?
 4. What suggestions would you provide for improving this paper?
 5. What are the missing references mentioned by the individual reviews?
 The review texts are below: <insert three comments R1 , R2 , R3 from the reviewers>
 Meta-review: <insert meta-review>
 ...
 <Demonstrations>

 Provide justification for your response in detail by explaining why you made the choices you actually made. A good output should be coherent,
 highlight major strengths/issues mentioned by multiple reviewers, be less than 400 words in length, and finally, the response should be in English
 only.

 The review texts are below: <insert three comments R1 , R2 , R3 from the reviewers>
 Meta-review:

 CREATE TABLE Highschooler (
 ID int primary key,
 name text,
 grade int
 );
 /*
 3 example rows:
 SELECT * FROM Highschooler LIMIT 3;
 ID name grade
 1234 Janie 8
 5678 Mary 8
 9012 Mike 9
 */
 Using valid SQLite, answer the following questions for the tables provided above.
 Question: What is Kyle’s id?
 SQL: SELECT ID FROM Highschooler WHERE name=“Kyle”;
 ...
 <Demonstrations>

 Question: <insert question>
 SQL:



for training, TEMPERA [463] proposes to edit prompts at                   ally demanding for ever larger models, and may not be fea-
test time by utilizing a pre-trained RL agent to sequentially             sible for API-based model calls (e.g., ChatGPT). Therefore,
edit different parts of a manually-written initial prompt. Al-            another line of work aims to directly edit existing prompts
though these methods are simple and effective, they explore               based on the task performance. Specifically, GPS [466] bor-
a manually defined edit space (e.g., add, swap and delete)                rows an idea from the genetic algorithm and proposes
and focus on modifying the original prompt, which limits                  a genetic prompt search method that utilizes a language
the flexibility of prompt search. In contrast, PRewrite [465]             model (i.e., T5) to edit prompts by taking the cloze task form.
employs RL to train a prompt rewriter for generating new                  In addition to model based edit methods, human-defined
prompts instead of modification, which does not impose                    operations can be also employed for prompt editing [467],
any restrictions in the prompt rewriting and offers improved              including delete, swap, paraphrase, and addition. Based
flexibility in the action space.                                          on these operations, they iteratively edit the prompts and
                                                                          greedily search for the best prompt guided by the model
   • Edit-based approaches. For the above methods, gradient-              performance on a small pool of examples.
based and RL-based tuning can be extremely computation-
                                                                                                                                    49

    • LLM-based approaches. Due to the exceptional capacities     layer in language models, while prompt tuning [405] only
of LLMs, an increasing number of studies directly leverage        incorporates trainable prompt vectors at the input layer. By
LLMs as prompt generator [468–475]. Specifically, APE [468]       fixing the large-scale parameters of LLMs and only tuning
utilizes an LLM to generate initial prompts, then selects the     continuous prompt vector, this kind of approaches can be
best prompt with the highest accuracy, and finally improves       extremely parameter-efficient (Section 5.3). However, these
the best candidate through an iterative Monte Carlo search        approaches are typically independent of the inputs, lacking
method. However, this method does not effectively con-            sufficient consideration of input semantics. Therefore, the
strain the prompt search space, which might likely lead           authors in [477] propose context tuning, where the continu-
to unstable results. To achieve good performance and fast         ous prompts are derived based on the input text and learned
convergence, one line of work utilizes heuristic methods          through the downstream task losses.
(e.g., evolutionary algorithms [473, 474] and adversarial             • Prompt transferring with scarce data. Supervised learn-
learning [475]) for prompt optimization. Another line of          ing approaches demand in sufficient training data to learn
work draws an analogy to gradient-based model optimiz-            optimal continuous prompts, which may not work well
ers for LLM-based prompt optimization.         For example,       in data-scarce domains and tasks. To address this prob-
APO [469] instructs the LLM to generate text feedback on          lem, SPoT [478] proposes a prompt-based transfer learning
how to refine an old prompt into new improved prompts             approach, which first learns a single continuous prompt
and then execute textual gradient descent. However, their         for several representative source tasks and then uses this
search in the prompt space might be inefficient without           prompt to initialize the prompt for a target task. However,
fully considering the whole refinement trace of previous          this approach leverages the same prompt for solving all
prompts, thus potentially leading to sub-optimal results.         instances of the target task. For a single task, even a well-
Therefore, some recent studies [470, 471] incorporate the         learned prompt may not be suitable for all the data instances
previous prompts with their scores to instruct LLMs for           from a large population. To address this issue, an improved
progressively generating better new prompts. To further           method [479] designs an adaptive attention mechanism dur-
design formalized guidelines about the design of prompt           ing the prompt transfer process to derive the target prompts,
optimizers, GPO [472] conducts a systematic analogy for           considering both task- and instance-level information. The
LLM-based prompt optimizers with gradient-based model             prompt transfer paradigm can leverage the knowledge of
optimizers. It further develops a more formal LLM-based           data-sufficient source tasks encoded in source prompts for
prompt optimization framework, which extensively bor-             solving data-scarce target tasks.
rows the idea of machine learning optimization. Specifally,
it retrieves relevant prompts from the previous prompts
and utilizes the generation-based refinement strategy to
perform the update. In order to avoid large variation at each
iteration, GPO further adopts a cosine-based decay strategy
to control the edit distance. However, these approaches still     6.2     In-Context Learning
struggle in exploring the vast space of effective prompts.
Inspired by human-like trial-and-error, prompt optimization       As a special prompting form, in-context learning (ICL) is
is further formulated as a strategic planning problem [476]       first proposed along with GPT-3 [55], which has become a
and uses Monte Carlo tree search to navigate the vast             typical approach to utilizing LLMs.
prompt space.

Continuous Prompt Optimization. Different from discrete
prompts, continuous prompts consist of a set of continuous
                                                                  6.2.1    ICL Formulation
embeddings, which can be directly optimized through the
gradient update based on the loss of downstream tasks.
Note that continuous prompt optimization has been mainly          As stated in [55], ICL uses a formatted natural language
studied in PLMs, but draws limited attention in era of LLMs       prompt, consisting of the task description and/or a few task
due to their massive magnitudes of parameters. We include         examples as demonstrations. Figure 14 presents an illustra-
the discussion of this part for content completeness. In prior    tion of ICL. First, starting with a task description, a few ex-
work, most studies typically rely on supervised learning to       amples are selected from the task dataset as demonstrations.
train continuous prompts based on task data. Furthermore,         Then, they are combined in a specific order to form nat-
in data-scarce scenarios, transfer learning methods can be        ural language prompts with specially designed templates.
employed to alleviate the lack of labeled data on target tasks.   Finally, the test instance is appended to the demonstration
These two approaches are detailed below.                          as the input for LLMs to generate the output. Based on task
    • Prompt learning with sufficient data. In this approach,     demonstrations, LLMs can recognize and perform a new
most existing methods regard continuous prompts as train-         task without explicit gradient update.
able model parameters and then leverage supervised learn-             Formally, let Dk = {f (x1 , y1 ), . . . , f (xk , yk )} represent
ing to optimize the continuous prompts by minimizing              a set of demonstrations with k examples, where f (xk , yk ) is
the cross-entropy loss based on sufficient downstream task        the prompt function that transforms the k -th task example
data [404, 405, 409, 477]. As discussed in Section 5.3.1,         into natural language prompts. Given the task description
prefix tuning [404] prepends a sequence of prefixes (i.e.,        I , demonstration Dk , and a new input query xk+1 , the
a set of trainable continuous vectors) to each Transformer        prediction of the output ŷk+1 generated from LLMs can be
                                                                                                                                        50

formulated as follows35 :                                                  can be utilized to directly measure the informativeness
                                                                          of each example according to the performance gain after
 LLM I, f (x1 , y1 ), . . . , f (xk , yk ), f (xk+1 ,        ) → ŷk+1 .   adding the example [486]. In addition, EPR [421] proposes
        |              {z               }      | {z } |{z}
                   demonstrations            input   answer                a two-stage retrieval approach that first recalls similar ex-
                                                            (11)           amples with an unsupervised method (e.g., BM25) and then
where the actual answer yk+1 is left as a blank to be                      ranks them using a dense retriever (trained with positive
predicted by the LLM. Since the performance of ICL heavily                 and negative examples labeled by LLMs). As an alterna-
relies on demonstrations, it is important to properly design               tive approach, the task of demonstration selection can be
them in the prompts. According to the construction process                 formulated into a RL problem, where LLMs serve as the
in Equation (11), we focus on three major aspects of for-                  reward function to provide feedback for training the policy
matting demonstrations in the prompts, including how to                    model [487]. Since LLMs perform well for text annota-
select examples that make up demonstrations, format each                   tion [488], some recent studies employ LLM itself as the
example into the prompt with the function f (·), and arrange               demonstration generator without human intervention [489].
demonstrations in a reasonable order.
                                                                               To summarize, as discussed in [490], the selected demon-
     A comprehensive review of ICL has been presented in
                                                                           stration examples in ICL should contain sufficient informa-
the survey paper [50], and we suggest the readers refer-
                                                                           tion about the task to solve as well as be relevant to the test
ring to it for a more general, detailed discussion on this
                                                                           query, for the above two selection approaches.
topic. Compared with this survey, we specially focus on the
discussion of applying ICL to LLMs in two major aspects,                   Demonstration Format. After selecting task examples, the
i.e., demonstration design and the underlying mechanism                    next step is to integrate and format them into a natural
of ICL. Also, ICL has a close connection with instruction                  language prompt for LLMs. A straightforward method is to
tuning (discussed in Section 5.1) in that both utilize nat-                instantiate a pre-defined template with the corresponding
ural language to format the task or instances. However,                    input-output pairs [36]. To construct more informative tem-
instruction tuning needs to fine-tune LLMs for adaptation,                 plates, recent studies consider adding task descriptions [69]
while ICL only prompts LLMs for utilization. Furthermore,                  or enhancing the reasoning capability of LLMs with chain-
instruction tuning can enhance the ICL ability of LLMs to                  of-thought prompts [33]. For instance, in [179], the authors
perform target tasks, especially in the zero-shot setting (only            collect a large-scale dataset with task descriptions written by
using task descriptions) [69].                                             humans. After tuning with this dataset, the performance on
                                                                           seen tasks can be boosted, and LLMs can also generalize to
6.2.2 Demonstration Design                                                 unseen tasks to some extent. To reduce the annotation costs,
Several studies have shown that the effectiveness of ICL                   a semi-automated approach has been proposed in [147]
is highly affected by the design of demonstrations [425,                   by employing a seed set consisting of human-written task
480, 481] Following the discussion in Section 6.2.1, we will               descriptions to guide LLMs to generate task descriptions
introduce the demonstration design of ICL from three major                 for new tasks. Since it is costly to manually annotate
aspects, i.e., demonstration selection, format, and order.                 demonstration formats for different tasks, some work also
Demonstration Selection. The performance of ICL tends                      studies how to automatically generate high-quality ones.
to have a large variance with different demonstration exam-                As two representative methods, Auto-CoT [427] leverages
ples [420], so it is important to select a subset of examples              LLMs with the zero-shot prompt “Let’s think step by step”
that can effectively leverage the ICL capability of LLMs.                  for generating intermediate reasoning steps, while least-to-
There are two main demonstration selection approaches,                     most prompting [432] first queries LLMs to perform prob-
namely heuristic and LLM-based approaches:                                 lem decomposition and then utilizes LLMs to sequentially
    • Heuristic approaches. Due to their simplicity and low                solve sub-problems based on the intermediate answers to
costs, existing work widely adopts heuristic methods to                    previously solved ones.
select demonstrations. Several studies employ a k -NN based
                                                                           Demonstration Order. LLMs are shown to sometimes suffer
retriever to select examples that are semantically relevant to
                                                                           from the recency bias, i.e., they are prone to repeat answers
the query [420, 482]. However, they perform the selection
                                                                           that are near the end of demonstrations [481]. Thus, it is
individually for each example, rather than evaluating the
                                                                           important to arrange demonstrations (i.e., task examples)
example set as a whole. To resolve this issue, diversity-
                                                                           in a reasonable order. Early work proposes several heuris-
based selection strategies are proposed to choose the most
                                                                           tic methods to quickly find a good order. For example,
representative set of examples for specific tasks [483, 484].
                                                                           demonstrations can be directly organized according to their
Furthermore, in [485], both relevance and diversity are taken
                                                                           similarity to the query in the embedding space [420]: the
into consideration when selecting demonstrations.
                                                                           more similar, the closer to the end. In addition, global
    • LLM-based approaches. Another line of work selects
                                                                           and local entropy metrics can be used to score different
demonstrations by making use of LLMs. For example, LLMs
                                                                           demonstration orders [425]. To integrate more task infor-
  35. When ICL was introduced in the GPT-3’s paper [55], it was            mation, some recent studies propose to minimize the code
originally defined to be a combination of the task description and         length required to compress and transmit task labels, which
demonstration examples, wherein either component is dispensable.           is inspired by information theory [491]. However, these
Following this definition, when a LLM is required to solve an unseen       methods need additional labeled data as the validation
task by using only task descriptions, it can be also considered to
perform ICL for task solving, whereas the ICL ability can be enhanced      set to evaluate the performance of specific demonstration
by instruction tuning.                                                     orders. To eliminate this need, the authors in [425] propose
                                                                                                                                                             51

                         In-Context Learning                                                              Chain-of-Thought Prompting
        Answer the following mathematical reasoning questions:                            Answer the following mathematical reasoning questions:

        Q: If you have 12 candies and you give 4 candies to your friend,                  Q: If a rectangle has a length of 6 cm and a width of 3 cm,
           how many candies do you have left?                                                  what is the perimeter of the rectangle?
 Nx     A: The answer is 8.                                                          Nx   A: For a rectangle, add up the length and width and double it.
           If a rectangle has a length of 6 cm and a width of 3 cm,
        Q: what is the perimeter of the rectangle?                                             So, the perimeter of this rectangle is (6 + 3) x 2 = 18 cm.

        A: The answer is 18 cm.                                                                The answer is 18 cm.

        Q: Sam has 12 marbles. He gives 1/4 of them to his sister.                        Q: Sam has 12 marbles. He gives 1/4 of them to his sister.
        How many marbles does Sam have left?                                              How many marbles does Sam have left?



                                                                                                      A: He gives (1 / 4) x 12 = 3 marbles.
               A: The answer is 9.                                             LLM                    So Sam is left with 12 – 3 = 9 marbles.
                                                                                                      The answer is 9.


                           : Task description                : Demonstration                : Chain-of-Thought                    : Query

Fig. 14: A comparative illustration of in-context learning (ICL) and chain-of-thought (CoT) prompting. ICL prompts LLMs
with a natural language description, several demonstrations, and a test query, while CoT prompting involves a series of
intermediate reasoning steps in prompts.


to sample the validation data from the LLM itself.                               obtained from pre-training to solve new test tasks. A Proba-
                                                                                 bly Approximately Correct (PAC) framework [496] has been
6.2.3   Underlying Mechanism                                                     proposed to assess the learnability of ICL. It assumes that
                                                                                 there exists a latent variable representing the task in the pre-
After pre-training, LLMs can exhibit intriguing ICL capabil-
                                                                                 training data, and LLMs have been shown to be capable
ity without being updated. In what follows, we discuss two
                                                                                 of capturing this variable from demonstrations, enabling
key questions about the ICL ability of LLMs, i.e., “how does
                                                                                 them to recognize the task in ICL. Also, the interpretation
pre-training affect the ICL ability” and “how do LLMs perform
                                                                                 of ICL as task recognition is supported by several empir-
ICL during inference”.
                                                                                 ical studies [480, 497]. For example, it has been observed
How Pre-Training Affects ICL? ICL is first proposed in                           that replacing the inputs or labels of demonstrations with
GPT-3 [55], and it has been shown that the ICL ability                           random ones sampled from the input or label space does
becomes more significant with a larger model size. Further,                      not seriously hurt the performance of LLMs, indicating that
some studies reveal that small-scale PLMs can also demon-                        LLMs mainly recognize the target task from demonstrations
strate a strong ICL ability by continual pre-training [492]                      instead of learning from them [480, 495]. Similarly, LLMs
or fine-tuning [493] on specially designed training tasks,                       can exhibit decent performance even if the prompt template
which typically involve additional task examples in the                          is irrelevant or misleading [497].
input during the training process. It suggests that the design
of training tasks is an important influence factor on the ICL                        • Task learning. In the second way, LLMs learn new tasks
capability of LLMs. Besides training tasks, recent studies                       unseen in the pre-training stage only through demonstra-
have also investigated the relationship between ICL and                          tions. Specially, task learning is analyzed mainly from the
pre-training corpora [490, 494]. For example, ICL can be                         perspective of gradient descent and considered as implicit
theoretically explained as the product of pre-training on                        fine-tuning [65, 498]. Then, ICL can be explained as follows:
documents that exhibit long-range coherence [490]. Fur-                          by means of forward computation, LLMs generate meta-
ther, another study [494] theoretically analyzes that when                       gradients with respect to demonstrations and implicitly per-
scaling parameters and data, LLMs based on next-word                             form gradient descent via the attention mechanism. Exper-
prediction can emerge the ability of ICL by learning from                        iments also show that certain attention heads in LLMs are
the compositional structure (e.g., how words and phrases                         capable of performing task-agnostic atomic operations (e.g.,
are combined to form larger linguistic units like sentences)                     copying and prefix matching), which are closely related to
present in language data.                                                        the ICL ability [499]. Furthermore, some studies abstract
                                                                                 ICL as an algorithm learning process [500]. For example, the
How LLMs Perform ICL? At the inference stage, researchers                        authors in [500] find that LLMs essentially encode implicit
focus on analyzing how the ICL capability operates based                         models through their parameters during pre-training. With
on given demonstrations since no explicit learning or updat-                     the examples provided in ICL, LLMs can implement learn-
ing is involved. According to the discussion in [495], there                     ing algorithms such as gradient descent or directly compute
are two main ways for LLMs to utilize demonstrations: task                       the closed-form solution to update these models during
recognition and task learning.                                                   forward computation. Under this explanation framework,
    • Task recognition. In the first way, LLMs recognize the                     it has been shown that LLMs can effectively learn simple
task from demonstrations and utilize the prior knowledge                         linear functions and even some complex functions like deci-
                                                                                                                              52

sion trees with ICL [500].                                      that require rigorous logic for reasoning. Considering this,
    As discussed in a recent study [495], LLMs exhibit the      some work uses code [508, 509] due to its structured and
abilities of both task recognition and task learning in ICL,    precise nature. Furthermore, the authors in [510] propose
but the two abilities seem to be possessed with different       to dynamically select text or code as the format of CoTs to
model scales. As shown in the experiments [495], the ability    combine their advantages.
of task recognition is easier to obtain, and even a small LM
with only 350M parameters can exhibit this ability, while       6.3.2 Improved CoT Prompting Strategies
task learning can only emerge for LLMs with at least 66B        Despite the performance improvement in complex reason-
parameters. Another study [501] also supports this find-        ing tasks, CoT prompting still suffers from problems like
ing with specially designed experiments. They set up the        incorrect reasoning and instability. In this part, we first
tasks with flipped and semantically unrelated labels in the     introduce how to design better CoT prompts and enhanced
experiment, which require task learning when performing         CoT generation strategies, and then introduce the extension
ICL. The results suggest that small LMs tend to disregard       of the basic chain structure of CoT. Figure 15 illustrates the
the labels and mainly depend on their prior knowledge           evolution of representative CoT prompting strategies.
to accomplish the task, while LLMs have the ability to
                                                                Better Prompt Design. Since CoT prompting relies on
surpass their prior knowledge and acquire new knowledge
                                                                prompts to elicit the reasoning capabilities of LLMs, the
from demonstrations, resulting in better outcomes. Further-
                                                                design of prompts is critical to its performance. As a di-
more, to improve the task learning ability, Meta-In-Context
                                                                rect approach, it is shown that using diverse CoTs (i.e.,
Learning [502] proposes to include multiple related tasks
                                                                multiple reasoning paths for each problem) can effectively
instead of just a single one in the prompt. In addition,
                                                                enhance the performance [430]. Another intuitive idea is
Symbol Tuning [503] fine-tunes LLMs on demonstrations
                                                                that prompts with more complex reasoning paths are more
with semantically unrelated labels (e.g., foo/bar instead of
                                                                likely to elicit the reasoning ability of LLMs [426], which
positive/negative for sentiment analysis), forcing LLMs to
                                                                can result in higher accuracy in generating correct an-
learn the task from demonstrations instead of relying on
                                                                swers. However, all these approaches rely on annotated CoT
prior knowledge.
                                                                datasets, which limits their use in practice. To overcome
                                                                this limitation, magic instructions such as “Let’s think step
6.3   Chain-of-Thought Prompting                                by step” can be used to automatically construct CoTs by
Chain-of-Thought (CoT) prompting [33, 504] is an improved       prompting LLMs [427].
prompting strategy to boost the performance of LLMs on          Enhanced CoT Generation. Since LLMs are prone to
complex reasoning tasks, such as arithmetic reasoning [505],    producing incorrect reasoning steps and exhibiting insta-
commonsense reasoning [506], and symbolic reasoning [33].       bility in the generation process, there are a number of
Instead of simply constructing the prompts with input-          studies [429, 511] to improve the generation of CoT. In this
output pairs like ICL, CoT prompting further incorporates       part, we will introduce two typical approaches to enhancing
intermediate reasoning steps, which serve as the bridge be-     the generation of CoT: sampling- and verification-based
tween inputs and outputs. Figure 14 presents an illustration    methods.
of CoT. In the following part, we will first elaborate on the        • Sampling-based methods. LLMs are known to suffer
basic CoT prompting approach and its improved strategies,       from instability during inference, which can lead to un-
then discuss when and why CoT prompting works.                  faithfulness in the generated reasoning steps. To address
                                                                this issue, some work proposes to sample multiple rea-
6.3.1 Basic CoT Prompting Approach                              soning paths instead of using greedy decoding. As a rep-
CoT prompting is first proposed as an extension of ICL [33],    resentative solution, self-consistency [429] first generates
which augments each demonstration ⟨input, output⟩ as            several reasoning paths and then takes an ensemble over
⟨input, CoT, output⟩. A CoT is a series of intermediate         the corresponding answers, selecting the most consistent
reasoning steps for connecting the input and output. With       one through majority voting. However, such a method can
these augmented demonstrations, LLMs can follow them to         still lead to wrong answers when most of the reasoning
generate CoTs and the answer for a new input. However,          paths are misled. Considering this, the authors in [426] only
unlike ⟨input, output⟩ pairs in ICL, CoTs are difficult to      vote on the k most complex reasoning paths based on their
obtain and usually require human annotation. Fortunately,       observation that reasoning paths with higher complexity
it has been found that LLMs can be triggered to generate        (e.g., more reasoning steps) usually have better performance.
CoTs through simple instructions like “Let’s think step by      Furthermore, MCR [512] proposes referring to the steps
step.” [507], making CoT prompting easy to use. There are       from other reasoning paths when generating the next step,
also alternative magic prompts that can elicit the ability      and performs reasoning across multiple reasoning paths to
of CoT reasoning and further improve the performance of         generate the final answer.
LLMs, such as “Take a deep breath and work on this problem           • Verification-based methods. The sequential nature of rea-
step-by-step.” [470].                                           soning steps in CoTs can lead to the accumulation of errors
    As illustrated in Figure 15, the generation process of      in the generated CoTs when certain steps are incorrect. To
CoT follows a chain structure in the basic CoT prompt-          mitigate this problem, recent studies propose to verify the
ing approach, where LLMs generate CoTs step by step.            correctness of generated reasoning steps with either trained
Typically, CoT takes the format of natural language text.       verifiers or LLMs themselves. For example, DIVERSE [511]
However, textual CoTs may not work well on complex tasks        trains solution-level and step-level verifiers respectively to
                                                                                                                                                  53


                             Sampling-                Verification-
      CoT                    based CoT                 based CoT                                    ToT                        GoT
      Input                    Input                        Input                                  Input                       Input




       ...             ...       ...      ...            Verification



                                                    ✖️                  ✖️
                                                              ...
                             Ensemble



     Output                    Output                       Output                                 Output                      Output



              Reason          Backtrack         Aggregate                    Unevaluated thought            Positive thought   Negative thought



Fig. 15: An illustration of the evolution of CoT prompting strategies. It begins with the basic CoT approach and progresses
to enhanced CoT generation techniques, including sampling-based and verification-based methods. Finally, it extends to
variations of the chain structure, such as trees and graphs. Here, “thought” refers to an intermediate reasoning step as
stated in [33, 444].


examine the reasoning steps at different granularities. An-                      with LLMs, making the thought exploration process highly
other approach [513] utilizes LLMs to verify the correctness                     inefficient.  To reduce potentially meaningless thought
of reasoning steps through step-by-step self-verification                        exploration, XoT [521] further proposes to guide the search
with a specially designed reasoning format. In addition,                         of thoughts with pre-trained policy and value networks.
several studies propose backward reasoning for verification:
it first deduces the necessary question conditions [514, 515]                    6.3.3     Further Discussion on CoT Prompting
or variables [516] from the model’s predictions, and then
compares them with the original ones.                                            In this part, we present discussions regarding two funda-
                                                                                 mental questions related to CoT prompting, i.e., “when does
Reasoning Structure Extension. Despite the generality, the                       CoT prompting work for LLMs” and “why can LLMs perform
chain reasoning structure of basic CoT prompting limits its                      CoT reasoning”.
effectiveness in solving complex tasks, which require ex-
                                                                                 When CoT Prompting Works For LLMs? Since CoT reason-
ploration like foresight and backtracking during inference.
                                                                                 ing is an emergent ability [31], it only has a positive effect
Therefore, many studies have been devoted to extending
                                                                                 on sufficiently large models (typically containing 10B or
the reasoning structure by designing more intricate thought
                                                                                 more parameters [33]) but not on small models. Moreover,
processes, e.g., tree- and graph-structured reasoning.
                                                                                 since CoT prompting augments the standard prompting
    • Tree-structured reasoning. This approach (exemplified by
                                                                                 with intermediate reasoning steps, it is mainly effective
Tree of Thoughts (ToT) [444, 517]) formulates the reasoning
                                                                                 for the tasks that require step-by-step reasoning [33], e.g.,
process in a hierarchical tree structure, where intermediate
                                                                                 arithmetic reasoning, commonsense reasoning, and sym-
thoughts are nodes. In this way, it enables LLMs to explore
                                                                                 bolic reasoning. Whereas, for other tasks that do not rely
multiple reasoning paths in parallel and further supports
                                                                                 on complex reasoning, CoT prompting might lead to worse
the operation of lookahead and backtracking to facilitate
                                                                                 performance than standard prompting [431], e.g., MNLI-
more comprehensive decisions. In addition, TouT [518] takes
                                                                                 m/mm, SST-2, and QQP from GLUE [279]. Interestingly, it
the uncertainty of intermediate thoughts into account for
                                                                                 seems that the performance gain brought by CoT prompting
thought evaluation based on Monte Carlo Dropout.
                                                                                 could be significant only when standard prompting yields
    • Graph-structured reasoning. Although the tree structure                    poor results [33].
facilitates parallel reasoning, it also imposes restrictions on
the reasoning process. With more complex topological struc-                      Why LLMs Can Perform CoT Reasoning? As the second
tures, graphs offer greater flexibility in reasoning, enabling                   question, we discuss the underlying mechanism of CoT
the characterization of more intricate relationships and in-                     prompting in the following two aspects.
teractions. For instance, Graph of Thoughts (GoT) [519, 520]                         • The source of CoT reasoning ability. Regarding the source
conceptualizes the reasoning process as an arbitrary graph,                      of CoT reasoning capability, it is widely hypothesized that it
where vertices denote intermediate thoughts and edges                            can be attributed to training on code since models trained on
denote the interdependence between these thoughts. Com-                          it show a strong reasoning ability [47, 522, 523]. Intuitively,
pared with ToT, it can further utilize thoughts from other                       code data is well organized with algorithmic logic and
reasoning paths when generating new thoughts. However,                           programming flow, which may be useful to improve the rea-
such an approach requires a large number of interactions                         soning performance of LLMs. However, this hypothesis still
                                                                                                                                            54

lacks publicly reported evidence of ablation experiments
                                                                                                    Planning
(with and without training on code). In addition, instruction                    Task                                          Result
                                                                                                   Framework
tuning seems not to be the key reason for obtaining the CoT
reasoning ability, since it has been empirically shown that
instruction tuning on non-CoT data does not improve the                     Task Planner               Plan
                                                                                                                            Plan Executor
performance on held-out CoT reasoning benchmarks [69].                         (LLM)           (generate & refine)
    • The effect of CoT prompting components. The major dis-
tinction between CoT prompting and standard prompting                                      Feedback                Action
is the incorporation of reasoning paths prior to the final
answer. Thus, some researchers investigate the effects of                       Memory              Environment                 Tool
different components in the reasoning paths. Specifically,
a recent study identifies three key components in CoT
                                                                            Internal                          External
prompting, namely symbols (e.g., numerical quantities in
arithmetic reasoning), patterns (e.g., equations in arithmetic
reasoning), and text (i.e., the rest of tokens that are not                                                                         …
symbols or patterns) [524]. It is shown that the latter two                     LLM         Human                 World           Others
parts (i.e., patterns and text) are essential to the model
performance, and removing either one would lead to a
significant performance drop. However, the correctness of               Fig. 16: An illustration of the formulation for prompt based
symbols and patterns does not seem critical. Further, there             planning by LLMs for solving complex tasks.
exists a symbiotic relationship between text and patterns:
the text helps LLMs to generate useful patterns, and patterns
aid LLMs to understand tasks and generate texts that help               for coding tasks [443]. Furthermore, environment refers to
solve them [524].                                                       where the plan executor carries out the actions, which can
    In summary, CoT prompting provides a general and                    be set differently according to specific tasks, e.g., the LLM
flexible approach to eliciting the reasoning ability of LLMs.           itself [529] or an external virtual world like Minecraft [530].
There are also some preliminary attempts to extend this                 It provides feedback about the execution result of the action to
technique to solve multimodal [525] and multilingual                    the task planner, either in the form of natural language [443]
tasks [526].                                                            or from other multimodal signals [439].
                                                                            For solving a complex task, the task planner first needs to
                                                                        clearly understand the task goal and generate a reasonable
6.4   Planning
                                                                        plan based on the reasoning of LLMs (See Section 6.4.2).
Prompting with ICL and CoT is a conceptually simple yet                 Then, the plan executor acts according to the plan in the
general approach to solving various tasks. However, this                environment, and the environment will produce feedback
approach struggles with complex tasks like mathematical                 for the task planner (See Section 6.4.3). The task planner
reasoning [527] and multi-hop question answering [528]. As              can further incorporate the feedback obtained from the
an enhanced approach, prompt-based planning has been                    environment to refine its initial plan and iteratively perform
proposed to break down complex tasks into smaller sub-                  the above process to get better results as the task solution
tasks and generate a plan of actions to accomplish the task.            (See Section 6.4.4).

6.4.1 The Overall Framework                                             6.4.2    Plan Generation
In this part, we first formulate the general planning                   Plan generation focuses on directly generating action se-
paradigm of LLMs for solving complex tasks, which is                    quences by prompting LLMs. Based on the format of the
illustrated in Figure 16.                                               generated plans, existing work can be divided into two
    In this paradigm, there are typically three components:             groups: text-based and code-based approaches.
task planner, plan executor, and environment36 . Specifically,
task planner, which is played by LLMs, aims to generate the             Text-based Approaches. It is straightforward for LLMs to
whole plan to solve a target task. The plan can be presented            generate plans in the form of natural language. In this
in various forms, e.g., an action sequence in the form of               approach, LLMs are prompted to generate a sequence of
natural language [432] or an executable program written in              actions for the plan executor to perform and solve the com-
programming language [436]. The LLM-based task planner                  plex task. For example, Plan-and-Solve [434] adds explicit
can be enhanced with the memory mechanism for plan                      instructions like “devise a plan” to directly prompt
storage and retrieval, which is helpful for long-horizon                the LLM for planning in a zero-shot manner, while Self-
tasks. Then, plan executor is responsible for executing the             planning [531] and DECOMP [433] add demonstrations in
actions in the plan. It can be implemented by models like               the prompt to guide the LLM to devise a plan through ICL.
LLMs for textual tasks [434] or by tools like code interpreters         Following this way, some work further considers incorpo-
                                                                        rating extra tools or models when planning. For example,
   36. Despite the similarity with RL, our formulation decouples the    ToolFormer [80] first annotates a pre-training corpus with
planning and execution phases, whereas in RL, they are typically        potential API calls using LLMs, and then fine-tunes LLMs
interleaved in the agent. This paradigm is defined in a general yet
slightly loose way, and it mainly aims to help readers understand the   on it, so that LLMs can learn when and how to call APIs
key idea underlying the planning approaches of LLMs.                    and incorporate the results returned by APIs during gener-
                                                                                                                               55

ation. HuggingGPT [437] introduces the models available in         6.4.4 Plan Refinement
HuggingFace and regards LLMs as the controller to select           With access to feedback from the environment, the task
suitable models based on their descriptions and aggregate          planner can accordingly refine its current plan and itera-
their results as the final solution.                               tively go through the “planning – execution – refinement” loop
                                                                   for better results. In this part, we summarizes three major
Code-based Approaches. Although text-based approaches
                                                                   refinement approaches in existing work.
sound intuitive, they cannot guarantee faithful execution of
the plan, which may lead to failure even when the plan is          Reasoning. The feedback data from the environment may
sound. To address this issue, code-based approaches have           not be directly suitable to be utilized by LLMs for plan
been proposed to generate more verifiable plans in the             refinement, e.g., containing irrelevant information or taking
form of executable code in programming languages, e.g.,            a non-language form. To solve this, some work adds the
Python or PDDL. In this way, LLMs are first prompted               explicit reasoning process to extract critical information
to generate the program and then utilize a deterministic           from feedback [441, 442]. For example, React [442] prompts
solver to execute it. For example, Faithful CoT [435] and          LLMs with demonstrations to generate reasoning traces
PAL [436] decompose a reasoning task into two stages: at           over feedback. It has been widely used in autonomous agent
the first stage, the LLM generates a plan conditioned on the       projects, such as AutoGPT [536], which can automatically
query; at the second stage, a deterministic solver executes        reason over the observed feedback to revise the initial
the plan to derive the final answer. Furthermore, code-based       plan for solving various user requests. However, these ap-
approaches can be applied to embodied agents in a similar          proaches typically fix the order of reasoning and planning.
way. For example, PROGPROMPT [532] and LLM+P [533]                 To support flexible switching between the two processes for
first utilize LLMs to generate plans in the form of python         better performance, ChatCoT [441] further unifies the tool-
functions or PDDL files, and then leverage a virtual agent         augmented reasoning process into a multi-turn conversation
or classical planner to solve the problem according to the         between the LLM-based task planner and the tool-based
code-based plans.                                                  environment.
                                                                   Backtracking. Early methods mainly consider planning
6.4.3   Feedback Acquisition                                       forward actions while maintaining the existing plan, thus
                                                                   likely leading to local optimal plans based on a short-term
After executing the generated plan, the environment would          evaluation. To solve this, Tree of Thoughts [529] allows back-
produce the feedback signal to the LLM-based task planner,         tracking with search algorithms like breadth-first and depth-
which can be used to refine its initial plan for better results.   first search to make global planning. It refines the plan
In existing work, there are typically two sources of feedback      step by step by backtracking to the last state in the initial
from the environment, depending on their relationship with         plan and choosing the next unexplored action. Furthermore,
the LLM-based task planner: internal (i.e., the LLM itself)        some studies [439, 537] utilize feedback signals to revise the
and external (e.g., tools or virtual worlds) feedback.             entire plan. For example, DEPS [537] selects a better plan
                                                                   according to feedback signals, while TIP [439] adds feedback
Internal Feedback. The LLM itself can be utilized as a             signals to prompts for the LLM-based planner to revise each
feedback provider. One straightforward way is to directly          step in the initial plan.
evaluate the quality of the generated plans through prompt-
ing. For example, RAP [440] evaluate the likelihood that           Memorization. In order to handle long-horizon tasks, it has
each candidate plan can lead to task success, while Tree of        become a key approach to aid plan refinement with long-
Thoughts [529] proposes to vote across plans by making             term memory in addition to utilizing the short-term memory of
comparisons between them. Further, LLMs can provide                LLMs through ICL. For example, Reflexion [443] stores the
feedback based on the intermediate results from the plan           feedback from self-reflection into the memory, so previous
executor. For example, Reflexion [443] utilizes LLMs to            feedback can be retrieved for plan refinement. Generative
transform sparse result signals (e.g., success or failure) into    Agents [535] designs the memory stream mechanism as the
concrete text-based feedback (e.g., “You should recommend          core component of agents for action planning and reflection.
comedies that the user mentions in the query instead of horror     Further, the skill library mechanism [438, 530] is proposed
movies”) and stores this feedback in long-term memory for          to store successful plans in the library, which can be reused
future planning.                                                   and synthesized as complex plans for novel tasks. To imple-
                                                                   ment the long-term memory mechanism, tools like vector
External Feedback. In addition to LLMs, external objects           databases (e.g., milvus [538]) can be used to encode plans or
can also provide feedback signals. For example, tools like         feedbacks into high-dimensional vectors for efficient storage
code interpreters are widely used in programming tasks to          and retrieval at a large scale. MemoryBank [539] further
provide real-time error messages [443], models like stable         proposes the memory updating mechanism to allow mem-
diffusion [534] can be used in multimodal tasks to provide         ory forgetting and strengthening following the Ebbinghaus
visual perception [439], and virtual worlds like Minecraft         Forgetting Curve theory.
can provide immersive experiences [530]. Besides, some
work (e.g., Generative Agents [535]) explores multi-agent
collaboration in simulated environments, where each agent          7   C APACITY AND E VALUATION
receives feedback not only from interaction with the envi-         To examine the effectiveness and superiority of LLMs, a
ronment but also from communication with other agents.             surge of tasks and benchmarks have been proposed for
                                                                                                                                  56

conducting empirical ability evaluation and analysis. In this     LLMs also demonstrate comparable performance with hu-
section, we first introduce three types of basic ability evalu-   man freelance writers [630]. Despite the rapid progress
ation of LLMs for language generation and understanding,          on model capacity, there are increasing concerns on the
then present several advanced ability evaluations with more       feasibility of existing automatic metrics to faithfully assess
complicated settings or goals, and finally discuss existing       the performance of LLMs in conditional text generation
benchmarks, evaluation approaches, and empirical analysis.        tasks [630–632]. As the alternatives to automatic metrics,
                                                                  recent studies also propose to incorporate LLMs as gener-
7.1   Basic Ability                                               ation evaluators to examine the quality of the generated
                                                                  content [152, 633, 634]. Moreover, researchers also explore
In this part, we mainly focus on three basic types of ability     more challenging language generation tasks for LLMs, such
evaluation for LLMs, i.e., language generation, knowledge         as structured data generation [451] and long text genera-
utilization, and complex reasoning. It is noted that we do not    tion [46, 635, 636].
intend to have complete coverage of all the related tasks, but
instead only focus on the most widely discussed or studied        Code Synthesis. In addition to generating high-quality nat-
tasks for LLMs. Next, we introduce these tasks in detail.         ural language text, existing LLMs also show strong abilities
                                                                  to generate formal language, especially computer programs
7.1.1 Language Generation                                         (i.e., code) that satisfy specific conditions, called code syn-
According to the task definition, existing tasks about lan-       thesis [637]. Unlike natural language generation, as the gen-
guage generation can be roughly categorized into language         erated code can be directly checked by execution with cor-
modeling, conditional text generation, and code synthesis         responding compilers or interpreters, existing work mostly
tasks. Note that code synthesis is not a typical NLP task, we     evaluates the quality of the generated code from LLMs by
include it for discussion because it can be directly solved       calculating the pass rate against the test cases, i.e., pass@k 37 .
by a number of LLMs (trained on code data) in a similar           Recently, several code benchmarks focusing on functional
generation approach as natural language text.                     correctness are proposed to assess the code synthesis abil-
                                                                  ities of LLMs, such as APPS [376], HumanEval [105], and
Language Modeling. As the most fundamental ability of             MBPP [223]. Typically, they consist of diverse programming
LLMs, language modeling aims to predict the next token            problems, with text specification and test cases for cor-
based on the previous tokens [1], which mainly focuses            rectness checking. To improve such an ability, it is key to
on the capacity of basic language understanding and gen-          fine-tuning (or pre-training) LLMs on code data, which can
eration. For evaluating such an ability, typical language         effectively adapt LLMs to code synthesis tasks [86]. In addi-
modeling datasets that existing work uses include Penn            tion, existing work has proposed new strategies to generate
Treebank [540], WikiText-103 [541], and the Pile [166], where     code, e.g., sampling multiple candidate solutions [223] and
the metric of perplexity is commonly used for evaluating the      planning-guided decoding [638], which can be considered
model performance under the zero-shot setting. Empirical          as the imitation of bug-fixing and code-planning processes
studies [55, 93] show that LLMs bring substantial per-            by programmers. Impressively, LLMs have recently shown
formance gains over the previous state-of-the-art methods         competitive performance with humans by achieving a rank-
on these evaluation datasets. To better test the modeling         ing of the top 28% among users on the programming contest
capacity of long-range dependencies in text, the LAMBADA          platform Codeforces [114]. Further, GitHub Copilot has been
dataset [252] has been introduced, where LLMs are required        released to assist programming in coding IDEs (e.g., Visual
to predict the last word of sentences based on a paragraph of     Studio and JetBrains IDEs), which can support a variety
context. Then, the accuracy and perplexity of the predicted       of languages including Python, JavaScript, and Java. A
last words are employed to evaluate LLMs. As shown in             viewpoint article entitled “The End of Programming” [639] in
existing work, the performance on the language modeling           Communications of the ACM has discussed the impact of AI
tasks typically follows the scaling law [30], which means         programming in the field of computer science, emphasizing
that scaling language models would improve the accuracy           an important shift towards the highly adaptive LLM as a
and reduce the perplexity.                                        new atomic unit of computation.
Conditional Text Generation. As an important topic in             Major Issues. Although LLMs have achieved splendid per-
language generation, conditional text generation [48] fo-         formance in generating human-like text, they are susceptible
cuses on generating texts satisfying specific task demands        to suffering from two major issues in language generation
based on the given conditions, typically including machine        as discussed below.
translation [626], text summarization [550], and question             • Unreliable generation evaluation. With the advancement
answering [559]. To measure the quality of the generated          of language generation ability of LLMs, existing studies
text, automatic metrics (e.g., Accuracy, BLEU [627] and           find that the generated texts from LLMs have reached a
ROUGE [628]) and human ratings have been typically used           comparable quality to the reference texts on a variety of text
for evaluating the performance. Due to the powerful lan-          generation tasks. However, due to the intrinsic weakness
guage generation capabilities, LLMs have achieved remark-         of existing evaluation benchmarks, there exists pronounced
able performance on existing datasets and benchmarks. For         inconsistency between human evaluation and automatic
instance, GPT-4 exhibits comparable performance as com-           reference-based metrics [630–632, 640]. For example, in
mercial translation products, even for the translation task of
languages that are with significant linguistic distance [629].      37. Given k programs generated by the LLM, pass@k is computed as
On news summarization tasks (i.e., CNN/DM and XSUM),              1 when at least one program passes all test cases, or else 0
                                                                                                                                             57

    TABLE 14: Representative basic and advanced abilities and corresponding representative datasets for evaluating.

   Level           Ability                  Task                                                Dataset
                                      Language Modeling        Penn Treebank [540], WikiText-103 [541], the Pile [166], LAMBADA [252]
                                                                   WMT’14,16,19,20,21,22 [542–547], Flores-101 [548], DiaBLa [549],
            Language Generation Conditional Text Generation           CNN/DailyMail [550], XSum [551], WikiLingua [552]
                                                                                        OpenDialKG [553]
                                                              APPS [376], HumanEval [105], MBPP [223], CodeContest [114], MTPB [86],
                                        Code Synthesis
                                                                                  DS-1000 [554], ODEX [555]
                                                              Natural Questions [556], ARC [557], TruthfulQA [558], Web Questions [559],
                                       Closed-Book QA         TriviaQA [560], PIQA [561], LC-quad2.0 [562], GrailQA [563], KQApro [564],
                                                                               CWQ [565], MKQA [566], ScienceQA [567]
                                                               Natural Questions [556], OpenBookQA [568], ARC [557], TriviaQA [560],
            Knowledge Utilization
                                        Open-Book QA            Web Questions [559], MS MARCO [569], QASC [570], SQuAD [571],
   Basic                                                                                  WikiMovies [572]
                                                                    WikiFact [573], FB15k-237 [574], Freebase [575], WN18RR [576],
                                    Knowledge Completion
                                                                     WordNet [577], LAMA [578], YAGO3-10 [579], YAGO [580]
                                                               CSQA [506], StrategyQA [199], HotpotQA [581], ARC [557], BoolQ [582],
                                                               PIQA [561], SIQA [583], HellaSwag [584], WinoGrande [585], COPA [586],
                                    Knowledge Reasoning
                                                                 OpenBookQA [568], ScienceQA [567], proScript [587], ProPara [588],
                                                                     ExplaGraphs [589], ProofWriter [590], EntailmentBank [591],
                                                                                          ProOntoQA [592]
             Complex Reasoning                                 CoinFlip [33], ReverseList [33], LastLetter [33], Boolean Assignment [593],
                                     Symbolic Reasoning               Parity [593], Colored Object [70], Penguins in a Table [70],
                                                                               Repeat Copy [436], Object Counting [436]
                                                               MATH [362], GSM8k [198], SVAMP [594], MultiArith [595], ASDiv [505],
                                    Mathematical Reasoning        MathQA [596], AQUA-RAT [597], MAWPS [598], DROP [599],
                                                                  NaturalProofs [600], PISA [601], miniF2F [602], ProofNet [603]
                                         Honestness                                TruthfulQA [558], HaluEval [604]
                                         Helpfulness                                        HH-RLHF [183]
             Human Alignment
                                                                                 HH-RLHF [183], Crows-Pairs [605]
                                        Harmlessness
                                                                             WinoGender [606], RealToxicityPrompts [607]
                                          Household              VirtualHome [608], BEHAVIOR [609], ALFRED [610],ALFWorld [611]
               Interaction with
                                     Website Environment                           WebShop [612], Mind2Web [613]
            External Environment
 Advanced                                Open World                                  MineRL [614], MineDojo [615]
                                        Search Engine                  HotpotQA [581], TriviaQA [560], Natural Questions [556]
                                        Code Executor                   GSM8k [198], TabMWP [616], Date Understanding [70]
                                          Calculator                            GSM8k [198], MATH [362], CARP [617]
             Tool Manipulation
                                       Model Interface                               GPT4Tools [618], Gorilla [619]
                                                                               WebQSP [620], MetaQA [621], WTQ [622]
                                        Data Interface
                                                                               WikiSQL [623], TabFact [624], Spider [625]



OpenDialKG [553], ChatGPT underperforms a fine-tuned                 generation in a reference-free manner, including evaluating
GPT-2 on BLEU and ROUGE-L metrics, while earning more                a single prediction [633, 634, 644] or comparing several
favor from human judgment [640]. Furthermore, existing               candidates [152, 645–647]. Nevertheless, LLMs may expose
work argues that even human evaluation may not be robust             bias (e.g., order bias or preference for LLM-generated texts
enough [630, 631, 641, 642]. In some cases, it is difficult          over human-written texts) as language generation evalua-
to achieve a high level of consensus among human an-                 tors, demonstrating disparities when compared to human
notators [631], and there is also a large gap between the            evaluation [634, 648, 649].
annotation quality of crowdworkers and experts [641, 642].
Thus, how to conduct reliable evaluation for language gen-
eration tasks in the era of LLMs has become a fundamental
yet challenging research topic. Recently, increasing research
work proposes to leverage LLMs to improve the evaluation
quality of the generated texts. Specially, LLMs can be used
to improve the evaluation quality of existing metrics. For ex-
ample, Para-Ref [643] augments various automatic metrics
by leveraging LLMs to paraphrase existing references into
semantically equivalent references with diverse expressions.
Further, LLMs are widely employed as the evaluators of text
                                                                                                                                       58

     Unreliable Generation Evaluation                             or knowledge completion) and evaluation settings (with or
                                                                  without external resources), we categorize existing knowl-
     LLMs have been capable of generating texts with              edge utilization tasks into three types, namely closed-book
     a comparable quality to human-written texts,                 QA, open-book QA38 , and knowledge completion.
     which however might be underestimated by au-
     tomatic reference-based metrics. As an alterna-              Closed-Book QA. Closed-book QA tasks [654] test the
     tive evaluation approach, LLMs can serve as lan-             acquired factual knowledge of LLMs from the pre-training
     guage generation evaluators to evaluate a single             corpus, where LLMs should answer the question only based
     text, compare multiple candidates, and improve               on the given context without using external resources. For
     existing metrics. However, this evaluation ap-               evaluating this ability, there are several datasets that can
     proach still needs more inspections and exami-               be leveraged, including Natural Questions [556], Web Ques-
     nations in real-world tasks.                                 tions [559], and TriviaQA [560], where the accuracy metric is
                                                                  widely adopted. Empirical results have revealed that LLMs
     • Underperforming specialized generation. Although LLMs      can perform well in this setting and even match the per-
have learned general language patterns to generate coherent       formance of state-of-the-art open-domain QA systems [56].
text, their proficiency in generation might be constrained        Also, the performance of LLMs on closed-book QA tasks
when dealing with a specialized domain or task. For in-           shows a scaling law pattern in terms of both model size
stance, a language model that has been trained on gen-            and data size: scaling the parameters and training tokens
eral web articles may face challenges when generating a           can increase the capacity of LLMs and help them learn (or
medical report which involves many medical jargon and             memorize) more knowledge from the pre-training data [56].
methods. Intuitively, domain knowledge should be critical         Further, under a similar parameter scale, LLMs with more
for model specialization. However, it is not easy to inject       pre-training data relevant to the evaluated tasks would
such specialized knowledge into LLMs. As discussed in             achieve better performance [81]. Also, the closed-book QA
recent analyses [47, 650], when LLMs are trained to exhibit       setting provides a testbed for probing the accuracy of the
some specific ability that allows them to excel in some areas,    factual knowledge encoded by LLMs. However, as shown
they might struggle in others. Such an issue is related to        in existing work [55], LLMs might perform less well on QA
catastrophic forgetting [651, 652] in training neural networks,   tasks relying on fine-grained knowledge, even when it exists
which refers to the conflict phenomenon of integrating new        in the pre-training data.
and old knowledge. Similar cases also occur in human align-
                                                                  Open-Book QA. Unlike closed-book QA, in open-book QA
ment of LLMs, where “alignment tax” [66] (e.g., a potential
                                                                  tasks, LLMs can extract useful evidence from the external
loss in the in-context learning ability) has to be paid for
                                                                  knowledge base or document collections, and then answer
aligning to human values and needs. Moreover, due to
                                                                  the question based on the extracted evidence [655–658]. Typ-
the limitations of sequence modeling architecture, LLMs
                                                                  ical open-book QA datasets (e.g., Natural Questions [556],
still face challenges in the understanding and generation
                                                                  OpenBookQA [568], and SQuAD [571]) have overlap with
of structured data. Consequently, they often fall behind
                                                                  closed-book QA datasets, but they incorporate external data
task-specific models on complex structured data tasks, such
                                                                  sources, e.g., Wikipedia. The metrics of accuracy and F1
as knowledge-base question answering and semantic pars-
                                                                  score are widely used in open-book QA tasks for evalua-
ing [451, 653]. Therefore, it is important to develop effective
                                                                  tion. To select relevant knowledge from external resources,
model specialization methods that can flexibly adapt LLMs
                                                                  LLMs are often paired with a text retriever (or even a
to various task scenarios, meanwhile retaining the original
                                                                  search engine), which is trained independently or jointly
abilities as possible.
                                                                  with LLMs [81, 655, 659]. Also, previous work [660–662]
     Underperforming Specialized Generation                       has indicated that retrievers can assist LLMs in verifying
                                                                  and rectifying the reasoning path. In evaluation, existing
     LLMs may fall short in mastering generation                  studies mainly focus on testing how LLMs utilize the ex-
     tasks that require domain-specific knowledge or              tracted knowledge to answer the question and show that
     generating structured data. It is non-trivial to             the retrieved evidence can largely improve the accuracy
     inject specialized knowledge into LLMs, mean-                of the generated answers, even enabling a smaller LLM to
     while maintaining the original abilities of LLMs.            outperform 10× larger ones [655, 659]. Further, open-book
                                                                  QA tasks can be also employed to evaluate the recency
                                                                  of knowledge information. Pre-training or retrieving from
                                                                  outdated knowledge resources may cause LLMs to generate
7.1.2 Knowledge Utilization
                                                                  incorrect answers for time-sensitive questions [655].
Knowledge utilization is an important ability of intelligent
systems to accomplish knowledge-intensive tasks (e.g., com-       Knowledge Completion. In knowledge completion tasks,
monsense question answering and fact completion) based            LLMs might be (to some extent) considered as a knowledge
on supporting factual evidence. Concretely, it requires LLMs
to properly utilize the rich factual knowledge from the pre-        38. In this part, open-book QA refers to the QA tasks that require
training corpus or retrieve external data when necessary. In      to extract and utilize useful information from external knowledge
particular, question answering (QA) and knowledge com-            resources, as the antithesis of closed-book QA (only using the encoded
                                                                  information from pre-training corpus). Note that there is a dataset also
pletion have been two commonly used tasks for evaluating          named OpenBookQA [568], which follows the settings of open-book
this ability. According to the test tasks (question answering     QA tasks by extracting and utilizing external science facts.
                                                                                                                                59



                  Bob’s wife is Amy. Bob’s daughter is Cindy.
                                                                                              Explain RLHF for LLMs.
                  Who is Cindy to Amy?



                                                                       RLHF stands for "Rights, Limitations, Harms, and
             Cindy is Amy’s daughter-in-law.                           Freedoms" and is a framework for …… models like
                                                                       LLMs (Large Language Models).

                   (a) Intrinsic hallucination                                          (b) Extrinsic hallucination


Fig. 17: Examples of intrinsic and extrinsic hallucination for a public LLM (access date: March 19, 2023). As an example
of intrinsic hallucination, the LLM gives a conflicting judgment about the relationship between Cindy and Amy, which
contradicts the input. For extrinsic hallucination, in this example, the LLM seems to have an incorrect understanding of
the meaning of RLHF (reinforcement learning from human feedback), though it can correctly understand the meaning of
LLMs (in this context).


base [578], which can be leveraged to complete or predict the       deploying LLMs in real-world applications. To alleviate
missing parts of knowledge units (e.g., knowledge triples).         this problem, alignment tuning strategies (as discussed in
Such tasks can probe and evaluate how much and what kind            Section 5.2) have been widely utilized in existing work [66],
of knowledge LLMs have learned from the pre-training                which rely on tuning LLMs on high-quality data or using
data. Existing knowledge completion tasks can be roughly            human feedback. Moreover, the integration of external
divided into knowledge graph completion tasks (e.g., FB15k-         tools for the provision of credible information sources can
237 [574] and WN18RR [576]) and fact completion tasks (e.g.,        help alleviate the hallucination issue [81, 604, 661]. Another
WikiFact [573]), which aim to complete the triples from a           line of research work leverages uncertainty estimation of
knowledge graph and incomplete sentences about specific             LLMs to identify hallucinations [665, 666]. For instance,
facts, respectively. Empirical studies have revealed that it        considering that hallucinated facts are prone to exhibit
is difficult for existing LLMs to accomplish knowledge              inconsistency across different sampled outputs, SelfCheck-
completion tasks related to specific relation types [522].          GPT [666] detects hallucination by measuring information
As shown in the evaluation results on WikiFact, LLMs                inconsistency within sampled outputs. For the evaluation
perform well on several frequent relations that occur in            of the hallucination problem, a set of hallucination de-
the pre-training data (e.g., currency and author), while            tection tasks have been proposed, e.g., TruthfulQA [558]
not well on rare ones (e.g., discoverer_or_inventor                 for detecting human falsehood mimicked by models. More
and place_of_birth). Interestingly, under the same eval-            recently, HaluEval [604] creates a large-scale LLM-generated
uation settings (e.g., in-context learning), InstructGPT (i.e.,     and human-annotated hallucinated samples to evaluate the
text-davinci-002) outperforms GPT-3 in all subsets of               ability of language models to recognize hallucination in both
WikiFact.                                                           task-specific and general scenarios.

Major Issues. Although LLMs have achieved key progress                   Hallucination
in capturing and utilizing knowledge information, they
suffer from two major issues as discussed below.                         LLMs are prone to generate untruthful informa-
                                                                         tion that either conflicts with the existing source
    • Hallucination. In generating factual texts, a challeng-
                                                                         or cannot be verified by the available source.
ing issue is hallucination generations [640, 663], where the
                                                                         Even the most powerful LLMs such as ChatGPT
generated information is either in conflict with the existing
                                                                         face great challenges in migrating the hallucina-
source (intrinsic hallucination) or cannot be verified by the
                                                                         tions of the generated texts. This issue can be
available source (extrinsic hallucination), which are illustrated
                                                                         partially alleviated by special approaches such as
by two examples in Figure 17. Hallucination widely occurs
                                                                         alignment tuning and tool utilization.
in existing LLMs, even the most superior LLMs such as
GPT-4 [46]. Furthermore, existing work shows that LLMs
encounter difficulties in recognizing the hallucinated con-             • Knowledge recency. As another major challenge, LLMs
tent in text [604], even the powerful ChatGPT. Additionally,        would encounter difficulties when solving tasks that require
beyond language tasks, a recent study has shown that large          the latest knowledge beyond the training data. To tackle
vision-language models (LVLM) also face challenges with             this issue, a straightforward approach is to regularly update
hallucination, i.e., generating objects that are not present in     LLMs with new data. However, it is very costly to fine-tune
the accompanying images [664]. In essence, LLMs seem                LLMs, and also likely to cause the catastrophic forgetting
to “unconsciously” utilize the knowledge in task solving,           issue when incrementally training LLMs. Therefore, it is
which still lack an ability to accurately control the use           necessary to develop efficient and effective approaches that
of internal or external knowledge. Hallucinations would             can integrate new knowledge into existing LLMs, making
mislead LLMs to generate undesired outputs and mostly               them up-to-date. Existing studies have explored how to
degrade the performance, leading to potential risks when            utilize the external knowledge source (e.g., search engine)
                                                                                                                                       60

to complement LLMs, which can be either jointly optimized           knowledge reasoning tasks into code generation tasks, re-
with LLMs [655] or used as a plug-and-play module [661].            searchers have found that the performance of LLMs can
For instance, ChatGPT utilizes a retrieval plugin to access         be further improved [226], especially with the LLMs pre-
up-to-date information sources [667]. By incorporating the          trained on code. However, due to the complexity of knowl-
extracted relevant information into the context [668–670],          edge reasoning tasks, the performance of current LLMs still
LLMs can acquire new factual knowledge and perform                  lags behind human results on tasks such as commonsense
better on relevant tasks. However, such an approach seems           reasoning [33, 56, 677]. As a common type of mistakes, LLMs
to be still at a superficial level. In addition, existing studies   might generate inaccurate intermediate steps, leading to a
also explore editing parameters of language models to up-           wrong final result. To address this issue, existing work has
date intrinsic knowledge [671–673]. Nevertheless, previous          proposed special decoding or ensemble strategies to im-
work [674] has shown that several parameter editing meth-           prove the accuracy of the whole reasoning chain [429, 430].
ods perform not well on LLMs, though they can improve
                                                                    Symbolic Reasoning39 . The symbolic reasoning tasks
the performance of small language models. Therefore, it
                                                                    mainly focus on manipulating the symbols in a formal rule
is still difficult to directly amend intrinsic knowledge or
                                                                    setting to fulfill some specific goal [51], where the operations
inject specific knowledge into LLMs, which remains an
                                                                    and rules may have never been seen by LLMs during pre-
open research problem [674]. Recently, a useful framework
                                                                    training. Existing work [33, 432, 507] commonly evaluates
EasyEdit [675] has been released to facilitate the research of
                                                                    LLMs on the task of last letter concatenation and coin flip,
knowledge editing for LLMs.
                                                                    where the evaluation examples require the same reasoning
        Knowledge Recency                                           steps as the in-context examples (called in-domain test) or
                                                                    more steps (called out-of-domain test). For an example of
        The parametric knowledge of LLMs is hard to be              the out-of-domain test, LLMs could only see the examples
        updated in a timely manner. Augmenting LLMs                 with two words in context, but it requires LLMs to concate-
        with external knowledge sources is a practical              nate the last letters of three or more words. Typically, the
        approach to tackling the issue. However, how                accuracy of the generated symbols is adopted to evaluate
        to effectively update knowledge within LLMs                 the performance of LLMs on these tasks. Thus, LLMs need
        remains an open research problem.                           to understand the semantic relations among the symbolic
                                                                    operations and their composition in complex scenarios.
                                                                    However, under the out-of-domain setting, as LLMs have
                                                                    not seen the complex compositions of symbolic operations
7.1.3    Complex Reasoning
                                                                    and rules (e.g., twice the number of operations in context
Complex reasoning refers to the ability of understanding            examples), it is hard for LLMs to capture their accurate
and utilizing supporting evidence or logic to derive con-           meanings. To solve this issue, existing studies incorporate
clusions or make decisions [51, 52]. According to the type          scratchpad [593, 678] and tutor [679] strategies to help
of involved logic and evidence in the reasoning process,            LLMs better manipulate symbolic operations, for generating
we consider dividing existing evaluation tasks into three           longer and more complex reasoning processes. Another
major categories, namely knowledge reasoning, symbolic              line of research work utilizes the formal programming
reasoning, and mathematical reasoning.                              language to represent the symbolic operations and rules,
                                                                    which requires LLMs to generate code and perform the
Knowledge Reasoning. The knowledge reasoning tasks
                                                                    reasoning process by executing it with external interpreters.
rely on logical relations and evidence about factual
                                                                    Such a way can decompose the complex reasoning process
knowledge to answer the given question. Existing work
                                                                    into code synthesis and program execution for LLMs and
mainly uses specific datasets to evaluate the reasoning
                                                                    interpreters, respectively, leading to a simplified reasoning
capacity of the corresponding type of knowledge, e.g.,
                                                                    process with yet more accurate results [436].
CSQA [506]/StrategyQA [199] for commonsense knowledge
reasoning and ScienceQA [567] for science knowledge rea-            Mathematical Reasoning. The mathematical reasoning
soning. In addition to the accuracy of the predicted results,       tasks need to comprehensively utilize mathematical knowl-
existing work [567] has also evaluated the quality of the           edge, logic, and computation for solving problems or gen-
generated reasoning process, via automatic metrics (e.g.,           erating proof statements. Existing mathematical reasoning
BLEU) or human evaluation. Typically, these tasks require           tasks can be mainly categorized into math problem solv-
LLMs to perform step-by-step reasoning based on factual             ing and automated theorem proving. For math problem
knowledge, until reaching the answer to the given ques-             solving tasks, SVAMP [594], GSM8k [198] and MATH [362]
tion. To elicit the step-by-step reasoning ability, chain-of-       datasets are commonly used for evaluation, where LLMs
thought (CoT) prompting strategy [33] has been proposed             need to generate accurate concrete numbers or equations
for enhancing the complex reasoning capacity of LLMs.               to answer the mathematical problem. As these tasks also
As discussed in Section 6.3, CoT involves the intermediate          require multi-step reasoning, the CoT prompting strategy
reasoning steps, which can be manually created [33] or              has been widely adopted for LLMs to improve the reasoning
automatically generated [676], into the prompts to guide            performance [33]. As another practical strategy, continu-
LLMs to perform multi-step reasoning. Such a way largely
improves the reasoning performance of LLMs, leading to                 39. Following [33], we mainly discuss symbolic reasoning tasks spe-
                                                                    cially designed for evaluating LLMs. We do not consider symbolic
new state-of-the-art results on several complex knowledge           reasoning methods in traditional NLP tasks, such as deducing logical
reasoning tasks [33, 56, 528]. Further, after reformulating         rules from the knowledge graphs in KBQA.
                                                                                                                                 61

ally pre-training LLMs on large-scale mathematical corpora               Reasoning Inconsistency
can largely boost their performance on mathematical rea-
soning tasks [35, 218, 680]. Further, since math problems                LLMs may generate the correct answer following
in different languages share the same mathematical logic,                an invalid reasoning path, or produce a wrong
researchers also propose a multilingual math word problem                answer after a correct reasoning process, leading
benchmark [526] to evaluate the multilingual mathematical                to inconsistency between the derived answer and
reasoning capacity of LLMs. As another challenging task,                 the reasoning process. The issue can be alleviated
automated theorem proving (ATP) [600, 602, 681] requires                 by fine-tuning LLMs with process-level feedback,
the reasoning model to strictly follow the reasoning logic               using an ensemble of diverse reasoning paths,
and mathematical skills. To evaluate the performance on                  and refining the reasoning process with self-
this task, PISA [601] and miniF2F [602] are two typical ATP              reflection or external feedback.
datasets with the proof success rate as the evaluation metric.
As a typical approach, existing work on ATP utilizes LLMs              • Numerical computation. For complex reasoning tasks,
to aid the search for proofs using an interactive theorem          LLMs still face difficulties in the involved numerical com-
prover (ITP), such as Lean, Metamath, and Isabelle [682–           putation, especially for the symbols that are seldom en-
684]. A major limitation of ATP research is the lack of related    countered during pre-training, such as arithmetic with large
corpora in formal language. To tackle it, several studies          numbers [49, 679, 692]. To tackle this issue, a direct way is
utilize LLMs to convert informal statements into formal            to tune LLMs on synthesized arithmetic problems [359, 693].
proofs for augmenting new data [685] or generate drafts and        Also, a surge of studies improve the numerical computation
proof sketches to reduce the search space of the proofs [686].     performance by tracing intermediate calculation steps in
                                                                   training and inference stages [359, 678, 694], e.g., scratchpad
                                                                   tracing. In addition, existing work [80] has also incorpo-
                                                                   rated external tools (e.g., calculator), especially for handling
Major Issues. In spite of the advancements, LLMs still have        arithmetic operations. More recently, ChatGPT has provided
several limitations in solving complex reasoning tasks.            a plugin mechanism to use external tools [667]. In this
                                                                   way, LLMs need to learn how to properly manipulate the
                                                                   tools. For this purpose, researchers have augmented the
    • Reasoning inconsistency. With improved reasoning             examples using tools (even the LLM itself) for tuning the
strategies (e.g., CoT prompting), LLMs can solve some com-         LLM [80, 695], or devised instructions and exemplars for
plex reasoning tasks, by performing step-by-step reasoning         in-context learning [436]. In addition to the aid of ex-
based on the supporting logic and evidence. Despite the            ternal tools, recent studies find that tokenizing digits into
effectiveness, the reasoning inconsistency issue often occurs in   individual tokens (e.g., LLaMA and Galactica tokenizers)
the decomposed reasoning process. Concretely, LLMs may             is a useful approach to enhancing the inherent arithmetic
generate the correct answer following an invalid reasoning         ability of LLMs [359, 692]. One possible explanation is that
path, or produce a wrong answer after a correct reason-            subword tokenization techniques can result in inconsistent
ing process [33, 435], leading to inconsistency between the        sequences when tokenizing numbers. For instance, with
derived answer and the reasoning process. To alleviate             a subword tokenizer the integer 7481 may be tokenized
this problem, existing work has proposed to guide the              as 7 481, while 74815 may be tokenized as 748 15 (the
whole generation process of LLMs via external tools or             same numerical substrings with different splits) [359]. As a
models [430, 444, 638], to re-check the reasoning process          comparison, digit-based tokenization for numbers can avoid
and final answer for correcting the potential errors [687–689]     such an inconsistency, thus likely improving the numerical
or fine-tune LLMs with process-based feedback [690, 691].          computation ability of LLMs.
For instance, Tree of Thoughts (ToT) [444] empowers LLMs
to engage in the decision-making process by concurrently                 Numerical Computation
exploring and self-evaluating various reasoning paths. To
refine the reasoning processes, Self-Refine [687] elicits feed-          LLMs face difficulties in numerical computation,
back from LLMs on self-generated solutions, enabling the                 especially for the symbols that are seldom en-
iterative refinement of solutions based on the feedback.                 countered during pre-training. In addition to us-
Moreover, several studies improve the consistency in the                 ing mathematical tools, tokenizing digits into in-
reasoning chain of LLMs through the integration of process-              dividual tokens is also an effective design choice
based supervision during training [690, 691]. As a promis-               for improving the arithmetic ability of LLMs.
ing solution, recent approaches reformulate the complex
reasoning tasks into code generation tasks, where the strict
execution of the generated code ensures the consistency            7.2   Advanced Ability
between the reasoning process and the outcome. Also,               In addition to the above basic evaluation tasks, LLMs also
it has been revealed that there might exist inconsistency          exhibit some superior abilities that require special consider-
between tasks with similar inputs, where small changes             ations for evaluation. In this part, we discuss several rep-
in the task description may cause the model to produce             resentative advanced abilities and the corresponding eval-
different results [49, 594]. To mitigate this problem, self-       uation approaches, including human alignment, interaction
consistency [429] adopts the ensemble of multiple reasoning        with the external environment, and tool manipulation. Next,
paths to enhance the decoding process of LLMs.                     we discuss these advanced abilities in detail.
                                                                                                                                  62

7.2.1   Human Alignment                                             work either adopts the regular metrics (e.g., executability
It is desired that LLMs could well conform to human values          and correctness of the generated action plans) [696] in the
and needs, i.e., human alignment, which is a key ability for        benchmark or directly conducts real-world experiments and
the broad use of LLMs in real-world applications.                   measures the success rate [700], to evaluate such ability. It
     To evaluate this ability, existing studies consider multiple   has been shown that LLMs are capable in interacting with
criteria for human alignment, such as helpfulness, honesty,         the external environment and generating accurate action
and safety [46, 183, 366]. For helpfulness and honesty, adver-      plans [701]. Recently, several improvement methods have
sarial question answering tasks (e.g., TruthfulQA [558]) can        been proposed to enhance the interaction ability of LLMs,
be utilized to examine LLM’s ability in detecting possible          e.g., designing code-like prompts [532] and providing real-
falsehood in the text [46, 81]. Furthermore, harmlessness           world grounding [700].
can be also evaluated by several existing benchmarks, e.g.,             In addition, recent work also explores multi-agent col-
CrowS-Pairs [605] and Winogender [606]. Despite the auto-           laboration based on LLMs in simulated environments [535,
matic evaluation with the above datasets, human evaluation          702, 703]. These studies simulate human social behaviors
is still a more direct way to effectively test the human            by instantiating multiple LLM-based agents with observa-
alignment ability of LLMs. OpenAI invites many experts              tions, planning, and memories in a sandbox environment.
in domains related to AI risks to evaluate and improve the          In controlled evaluation, the abilities of generative agents
behaviors of GPT-4 when encountering risky contents [46].           to search, plan, and think are evaluated by humans in an
In addition, for other aspects of human alignment (e.g.,            interview-like manner. Further, they also conduct descrip-
truthfulness), several studies propose to use specific instruc-     tive measurements on multiple agents within a simulated
tions and devise annotation rules to guide the annotation           environment to examine emergent social behaviors.
process [81]. Empirical studies have revealed that these
                                                                    7.2.3 Tool Manipulation
strategies can greatly improve the human alignment ability
of LLMs [183]. For instance, after alignment tuning on data         When solving complex problems, LLMs can turn to external
collected through interactions with experts, the incorrect          tools if they determine it is necessary. By encapsulating
behavior rate of GPT-4 can be largely reduced when it deals         available tools with API calls, existing work has involved
with sensitive or disallowed prompts. In addition, high-            a variety of external tools, e.g., search engine [81], calcula-
quality pre-training data can reduce the effort required for        tor [80], and compiler [436], to enhance the performance of
alignment [46]. For instance, Galactica is potentially more         LLMs on several specific tasks. Recently, OpenAI has sup-
harmless due to the less biased contents in the scientific          ported the use of plugins in ChatGPT [667], which can equip
corpus [35].                                                        LLMs with broader capacities beyond language modeling.
                                                                    For example, the web browser plugin enables ChatGPT
                                                                    to access fresh information. Further, incorporating third-
7.2.2   Interaction with External Environment                       party plugins is particularly key for creating a prosperous
In addition to standard evaluation tasks, LLMs have the             ecosystem of applications based on LLMs.
ability to receive feedback from the external environment               To examine the ability of tool manipulation, existing
and perform actions according to the behavior instruction,          work mostly adopts complex reasoning tasks for evaluation,
e.g., generating action plans in natural language to manip-         such as mathematical problem solving (e.g., GSM8k [198]
ulate agents [696, 697]. Such an ability is also emergent in        and SVAMP [594]) or knowledge question answering (e.g.,
LLMs that can generate detailed and highly realistic action         TruthfulQA [558]), where the successful utilization of tools is
plans, while smaller models (e.g., GPT-2) tend to generate          very important for enhancing the required skills that LLMs
shorter or meaningless plans [696].                                 are incapable in (e.g., numerical calculation). In this way, the
    To test this ability, several embodied AI environments          evaluated performance on these tasks can reflect the ability
and benchmarks can be used for evaluation, described                of LLMs in tool manipulation. To teach LLMs to utilize tools,
as follows. VirtualHome [608] builds a 3D simulator for             existing studies add exemplars using tools in context to elicit
household tasks such as cleaning and cooking, in which              LLMs [436], or fine-tune LLMs on simulated data about
the agent can execute natural language actions generated            tool utilization [80, 695]. It has been found that with the
by LLMs. ALFRED [610] includes more challenging tasks               help of tools, LLMs become more capable of handling the
that require LLMs to accomplish compositional targets. BE-          issues that they are not good at, e.g., equation calculation
HAVIOR [609] focuses on everyday chores in simulation               and answering timely questions [80, 441]. However, as
environments and requires LLMs to generate complex so-              the number of available tools increases, the limited context
lutions, e.g., changing the internal status of objects. Apart       length of LLMs may pose challenges in describing and
from restricted environments such as household tasks, a             demonstrating extensive tool APIs. To address this issue,
line of research work investigates the proficiency of LLM-          existing work retrieves the usage of relevant tools, or en-
based agents to explore open-world environments, such as            coding tool information as tokens within the embedding
Minecraft and the Internet [698, 699]. Voyager [699] intro-         space [704–706].
duces an automatic curriculum module that enables LLMs                  In addition to existing tools developed by humans,
to continuously acquire new skills based on feedback from           LLMs possess the capability to make their own tools for
the environment. GITM [698] focuses on solving various              specific tasks autonomously [707]. This enables the models
challenges in Minecraft based on LLM, through task de-              to independently explore and manipulate these self-created
composition, planning, and invocation of interfaces. Based          tools, thereby expanding their potential for autonomous
on the generated action plans or task completions, existing         exploration in solving a wide range of real-world tasks.
                                                                                                                              63

    Summary. The above three abilities are of great value to      the experimental results of HELM, instruction tuning can
the practical performance of LLMs: conforming to human            consistently boost the performance of LLMs in terms of
values and preferences (human alignment), acting properly         accuracy, robustness, and fairness. Further, for reasoning
in real-world scenarios (interaction with the external envi-      tasks, the LLMs that have been pre-trained on the code
ronment), and expanding the ability scope (tool manipu-           corpus show superior performance.
lation). In addition to the above three advanced abilities,           • Human-level test benchmarks aim to evaluate the compre-
LLMs might also show other abilities that are specially           hensive ability of LLMs with questions designed for testing
related to some tasks (e.g., data annotation [488]) or learning   humans, such as AGIEval [710], MMCU [711], M3KE [712],
mechanisms (e.g., self-improvement [708]). It will be an open     C-Eval [713] and Xiezhi [714]. These benchmarks encompass
direction to discover, measure and evaluate these newly           a wide range of domains, difficulty levels, and languages
emerging abilities, so as to better utilize and improve LLMs.     to provide a comprehensive evaluation of LLMs’ general
                                                                  capabilities. Compared to publicly available models, models
                                                                  offering API services (e.g., GPT-4, ChatGPT, Claude) demon-
7.3   Benchmarks and Evaluation Approaches                        strate superior performance compared to publicly avail-
                                                                  able models on these evaluation benchmarks. As the best-
In the above, we have discussed the basic and advanced            performing model in evaluations, GPT-4 surpasses average
abilities of LLMs. Next, we will introduce existing evalua-       human performance in AGIEval [710]. However, it still lags
tion benchmarks and approaches [735, 736].                        behind the top human performance on these challenging
                                                                  benchmarks. Hence, there remains ample room for further
7.3.1 Comprehensive Evaluation Benchmarks                         enhancements in the overall abilities of LLMs, particularly
Recently, several comprehensive benchmarks [70, 362, 522]         for publicly accessible models.
have been released for the evaluation of LLMs. In this                The above benchmarks cover a variety of mainstream
part, we introduce several widely used benchmarks, i.e.,          evaluation tasks and real-world human exam questions for
MMLU, BIG-bench, HELM, and a series of human exam                 the evaluation of LLMs. Also, there are several benchmarks
benchmarks.                                                       that focus on evaluating specific abilities of LLMs, such
    • MMLU [362] is a versatile benchmark for large-scale         as TyDiQA [737] for multilingual knowledge utilization
evaluation of multi-task knowledge understanding, cover-          and MGSM [526] for multilingual mathematical reasoning.
ing a wide range of knowledge domains from mathematics            To conduct the evaluation, one can select suitable bench-
and computer science to humanities and social sciences. The       marks according to specific goals. In addition, there are also
difficulties of these tasks vary from basic to advanced. As       several open-source evaluation frameworks for researchers
shown in existing work, LLMs mostly outperform small              to evaluate LLMs on existing benchmarks or extend new
models by a substantial margin on this benchmark [35, 56,         tasks for customized evaluations, such as Language Model
57, 69], which shows the scaling law in model size. More          Evaluation Harness [738] and OpenAI Evals [46]. Fur-
recently, GPT-4 achieves a remarkable record (86.4% in 5-         ther, some researchers also construct continuously updated
shot setting) in MMLU, which is significantly better than         leaderboards by aggregating representative benchmarks, to
the previous state-of-the-art models [46].                        compare the performance of existing LLMs, such as Open
    • BIG-bench [70] is a collaborative benchmark intended        LLM Leaderboard [709]. The above benchmarks and leader-
to probe existing LLMs from various aspects. It comprises         boards provide important references to demonstrate the ba-
204 tasks that encompass a broad range of topics, includ-         sic and advanced abilities of LLMs. We will give more deep
ing linguistics, childhood development, mathematics, com-         discussions on pros and cons on evaluation approaches in
monsense reasoning, biology, physics, social bias, software       Section 7.3.2.
development, and so on. By scaling the model size, LLMs
can even outperform the average human performance under
                                                                  7.3.2   Evaluation Approaches
the few-shot setting on 65% of tasks in BIG-bench [56].
Considering the high evaluation cost of the entire bench-         After introducing existing benchmarks, in this part, we
mark, a lightweight benchmark BIG-bench-Lite has been             will review existing evaluation approaches for assessing
proposed, which contains 24 small yet diverse and challeng-       the performance of LLMs. To organize our discussion, we
ing tasks from BIG-bench. Additionally, the BIG-bench hard        categorize LLMs into three different types: base LLMs (pre-
(BBH) benchmark [363] has been proposed to concentrate            trained model checkpoints), fine-tuned LLMs (instruction or
on investigating the currently unsolvable tasks of LLMs by        alignment fine-tuned model checkpoints), and specialized
selecting the challenging tasks in which LLMs exhibit infe-       LLMs (adapted model checkpoints for some specific task
rior performance compared to humans. Since BBH becomes            or domain). Here, we keep both fine-tuned LLMs and
more difficult, small models mostly achieve performance           specialized LLMs, to distinguish the different purposes of
close to random. As a comparison, CoT prompting can               LLMs: general or specific task solvers. To evaluate the three
elicit the abilities of LLMs to perform step-by-step reasoning    types of LLMs, we can test the LLM’s performance related
for enhancing the performance, even exceeding the average         to different abilities (e.g., basic or advanced abilities as
human performance in BBH.                                         discussed in Section 7.1 and 7.2). In general, there are three
    • HELM [522] is a comprehensive benchmark that cur-           main approaches to evaluating LLMs, namely benchmark-
rently implements a core set of 16 scenarios and 7 categories     based approach [362], human-based approach [729], and
of metrics. It is built on top of many prior studies, conduct-    model-based approach [731]. Table 15 shows an illustration
ing a holistic evaluation of language models. As shown in         of the relationship among LLM type, evaluation approach,
                                                                                                                                         64

TABLE 15: A category of existing evaluation work. “General” denotes that the evaluation focuses on an overall performance
of multiple abilities. The evaluated abilities are not limited to the representative basic and advanced abilities mentioned in
Section 7.1 and 7.2.

    Method              Evaluation                   Model Types                 Abilities/Domain                    Data Source
                       MMLU [362]             Base/Fine-tuned/Specialized           General                   Human exam/practice
                      BIG-bench [70]          Base/Fine-tuned/Specialized           General                    Human annotation
                        HELM [522]            Base/Fine-tuned/Specialized           General                   Benchmark collection
                Open LLM Leaderboard [709]    Base/Fine-tuned/Specialized           General                   Benchmark collection
                       AGIEval [710]          Base/Fine-tuned/Specialized           General                   Human exam/practice
                       MMCU [711]             Base/Fine-tuned/Specialized           General                   Human exam/practice
                        M3KE [712]            Base/Fine-tuned/Specialized           General                   Human exam/practice
                        C-Eval [713]          Base/Fine-tuned/Specialized           General                   Human exam/practice
                        Xiezhi [714]          Base/Fine-tuned/Specialized           General                   Human exam/practice
                   OpenCompass [715]          Base/Fine-tuned/Specialized           General                   Benchmark collection
                Chain-of-Thought Hub [716]         Base/Fine-tuned                  General                   Benchmark collection
                        KoLA [717]                 Base/Fine-tuned            Knowledge utilization                   Web
                         ARB [718]                    Fine-tuned               Complex reasoning              Human exam/practice
                      APIBench [719]               Base/Fine-tuned             Tool manipulation                      Web
  Benchmark
                       APIBank [720]                  Fine-tuned               Tool manipulation                    Synthesis
                     ToolAlpaca [721]              Base/Fine-tuned             Tool manipulation                    Synthesis
                       T-Bench [722]                  Fine-tuned               Tool manipulation                    Synthesis
                      ToolBench [723]                 Fine-tuned               Tool manipulation                    Synthesis
                       BOLAA [724]                 Base/Fine-tuned           Environment interaction          Benchmark collection
                     AgentBench [725]              Base/Fine-tuned           Environment interaction        Human annotation/Synthesis
                      HaluEval [604]               Base/Fine-tuned             Human alignment              Human annotation/Synthesis
                    PromptBench [726]              Base/Fine-tuned                 Robustness                 Benchmark collection
                     HumanEval [105]          Base/Fine-tuned/Specialized        Code synthesis                Human annotation
                    MultiMedQA [354]                  Specialized                  Healthcare                 Benchmark collection
                        FLUE [727]                    Specialized                   Finance                   Benchmark collection
                     LegalBench [728]                 Specialized                    Legal                     Human annotation
                    Chatbot Arena [729]       Base/Fine-tuned/Specialized       Human Alignment                  Human annotation
    Human
                      SciBench [730]                  Fine-tuned                Complex reasoning               Human exam/practice
                     AlpacaEval [731]                Fine-tuned                Instruction following                 Synthesis
                      MT-bench [729]                 Fine-tuned                 Human alignment                  Human annotation
    Model             TrustGPT [732]               Base/Fine-tuned              Human alignment                 Benchmark collection
                     LMExamQA [733]                Base/Fine-tuned             Knowledge utilization                 Synthesis
                      ChatEval [734]               Base/Fine-tuned             Knowledge utilization            Benchmark collection



and tested abilities. Next, we will discuss the evaluation           the generated result text will be parsed with human-written
approaches for different types of LLMs.                              rules to get the predicted answer. Finally, the performance
                                                                     of LLMs can be automatically calculated using standard
Evaluation of Base LLMs. Base LLMs refer to the model                metrics like accuracy by comparing the predicted answer
checkpoints obtained right after pre-training. For base              with the ground-truth one. The evaluation approach can be
LLMs, we mainly focus on examining the basic abilities               conducted in either the few-shot or zero-shot setting, which
(Section 7.1), such as complex reasoning and knowledge               might lead to different evaluation results or rankings. Since
utilization. Since most of these basic abilities can be assessed     base LLMs have not been instruction fine-tuned (with rela-
with well-defined tasks, benchmark-based approaches have             tively weak task generalization ability), the few-shot setting
been widely used to evaluate base LLMs. Next, we will                is often more suitable for evaluation. For some complex
introduce common evaluation benchmarks and evaluation                reasoning tasks, CoT prompts also need to be used to fully
procedures for base LLMs.                                            exhibit the capacity during evaluation. Another note is that
   • Common benchmarks. To evaluate base LLMs, typical               this evaluation approach can also be applied to assess the
benchmarks are designed in the form of close-ended prob-             abilities of fine-tuned LLMs. Actually, several leaderboards
lems like multiple-choice questions. These commonly used             (e.g., Open LLM Leaderboard [709]) are built upon this
benchmarks can be mainly divided into two categories:                approach, evaluating both base and fine-tuned LLMs.
knowledge-oriented and reasoning-oriented benchmarks.
                                                                     Evaluation of Fine-tuned LLMs. Fine-tuned LLMs in this
Knowledge-oriented benchmarks (e.g., MMLU [362] and C-
                                                                     part refer to the model checkpoints obtained after in-
Eval [713]) aim to evaluate the capacity of world knowledge,
                                                                     struction tuning or alignment tuning based on pre-trained
while reasoning-oriented benchmarks (e.g., GSM8K [645],
                                                                     model weights40 . Typically, fine-tuned LLMs will be tested
BBH [363], and MATH [362]) focus on evaluating the ca-
                                                                     on various abilities (e.g., knowledge utilization and hu-
pability of solving complex reasoning tasks. Further, some
                                                                     man alignment), and thus it is common that they are as-
recently proposed benchmarks (e.g., OpenCompass [715])
                                                                     sessed with multiple evaluation approaches. In addition
combine these two types for a comprehensive comparison.
                                                                     to benchmark-based evaluation, human-based and model-
    • Benchmark based evaluation procedure. To perform the           based approaches have also been widely used to evaluate
benchmark evaluation, each problem will first be formatted
into a prompt for LLMs to generate the result text. Then,             40. In some cases, it is also called chat models.
                                                                                                                               65

the advanced abilities of fine-tuned LLMs. Next, we will         analysis to question answering. It has been used collab-
introduce the two evaluation methods.                            oratively with BBH [363] to evaluate finical LLMs like
    • Human-based evaluation. Unlike automatic evaluation        BloombergGPT [358].
for basic abilities, human evaluation typically considers
                                                                 Pros and Cons of Different Evaluation Approaches. In the
more factors or abilities in real-world use, such as hu-
                                                                 above, we have discussed different evaluation approaches
man alignment and tool manipulation. In this evaluation
                                                                 to assess the abilities of LLMs. Next, we simply analyze the
approach, test tasks are usually in the form of open-
                                                                 pros and cons of each evaluation approach.
ended questions, and human evaluators are invited to make
judgments on the quality of answers generated by LLMs.                • Benchmark-based approach. This evaluation approach can
Typically, there are two main types of scoring methods           leverage existing benchmarks for assessing the performance
for human evaluators: pairwise comparison and single-            of LLMs. The tasks involved in these benchmarks often
answer grading. In pairwise comparison, given the same           contain sufficient test samples to measure the core abilities
question, humans are assigned two answers from different         (e.g., reasoning). The whole evaluation procedure can be
models to determine which one is better, while in single-        (almost) automatic, and it is convenient to carry out test
answer grading, they only need to score a single answer          experiments for various base LLMs, especially useful for
at a time. For example, HELM [522] employs humans                monitoring the performance of model checkpoints during
to perform single-answer grading on summarization and            pre-training. However, LLMs are often sensitive to the eval-
disinformation tasks, while Chatbot Arena [729] constructs       uation settings, including the question prompts, zero-shot or
a crowdsourcing platform that allows users to engage in          few-shot tests, and the answer parsing methods. Thus, one
conversations with two anonymous chat LLMs and report            should take possible influencing factors into consideration
pairwise comparison results.                                     when conducting the evaluation experiments. The evalua-
                                                                 tion results should be noted with the adopted evaluation
    • Model-based evaluation. Since human-based evaluation
                                                                 settings. Another issue is the data contamination [56, 740],
is both expensive and time-consuming, some work has
                                                                 i.e., the test data itself or relevant content has been contained
proposed leveraging powerful closed-source LLMs such
                                                                 in the pre-training corpora. This phenomenon has become
as ChatGPT and GPT-4 as a surrogate for human evalu-
                                                                 increasingly severe since more and more open data has been
ators [729, 731]. For example, AlpacaEval [731] collects a
                                                                 collected for developing LLMs.
set of instructions and utilizes a capable LLM (e.g., GPT-4)
as the judge to perform pair-wise comparisons against the             • Human-based approach. Human evaluation offers several
reference outputs. Furthermore, MT-bench [729] collects a        advantages when assessing the capabilities of LLMs to solve
set of multi-turn questions for evaluation and improves the      real-world tasks. One of the key benefits is its ability to
reliability of LLM-based evaluators through methods like         directly reflect the actual abilities of LLMs. Based on feed-
ICL and CoT. Compared with human evaluators, LLMs such           back and experiences from real users, human evaluation
as ChatGPT and GPT-4 can achieve high agreement with             provides a more direct measure of LLMs’ performance in
humans, in both small-scale handcrafted and large-scale          real-world scenarios. Further, it can conduct more flexible
crowdsourced evaluation tasks. Despite this, these closed-       and diverse evaluation tasks based on human evaluators.
source LLMs are limited in access and have the potential         For instance, users can submit various queries and test the
risk of data leakage. To address this, recent work [729] has     abilities of LLMs according to their own task cognition. It
explored fine-tuning open-source LLMs (e.g., Vicuna [152])       allows for a deep understanding of the strengths and weak-
as model evaluators using scoring data from human eval-          nesses of LLMs across different types of tasks and contexts.
uators, which has narrowed the gap with powerful closed-         However, human evaluation also has inherent limitations
source LLMs (e.g., GPT-4).                                       that could potentially affect its accuracy and consistency.
                                                                 Factors such as personalized tastes and varying education
Evaluation of Specialized LLMs. Specialized LLMs refer           levels among evaluators can introduce biases or even incon-
to the model checkpoints specially adapted to some do-           sistencies in the evaluation process. In some cases, users’
mains or applications like healthcare [354] and finance [739].   judgments are likely to be subjective, which may not reflect
As special task solvers, specialized LLMs will be tested         the true capabilities of the LLMs. Moreover, conducting
not only on general abilities (e.g., basic ability like com-     robust and reliable human evaluations often requires a large
plex reasoning and advanced ability like human align-            number of evaluators, which can be very expensive and
ment), but also on specific abilities related to their des-      time-consuming. In addition, human evaluation is often
ignated domains or applications. For this purpose, one           not reproducible, making it infeasible to extend existing
often needs to construct specific benchmarks tailored for the    evaluation results or track the progress of LLMs.
target domains or applications. Then, these domain-specific           • Model-based approach. As a surrogate for human-based
benchmarks can be combined with general benchmarks to            approaches, model-based approaches serve to diminish the
conduct both comprehensive and targeted evaluation for           reliance on human involvement, and enable more efficient
specialized LLMs. For example, MultiMedQA [354] is a             and scalable evaluation. In addition, LLMs can provide
specific benchmark in healthcare, which includes medical         meaningful explanations for the assigned rating scores,
examinations and healthcare questions. In this work [354],       thereby enhancing the interpretability of evaluations. De-
MultiMedQA has been combined with MMLU [362] to                  spite their scalability and explanability, model-based ap-
assess the performance of specialized LLMs for healthcare,       proaches have been found to suffer from several issues, in-
such as Med-PaLM [354]. Similarly, FLUE [739] constructs a       cluding position, verbosity, and self-enhancement bias [729].
benchmark for finance, spanning from financial sentiment         Specially, position bias (i.e., the order to present the re-
                                                                                                                                     66

sponses) refers to the fact that LLMs tend to assign high               Claude 2, where the first three models are developed by
scores for the answers at specific positions over others,               OpenAI and the other two are developed by Anthropic.
verbosity bias means that LLMs favor verbose answers even
if they are short in quality compared with shorter answers,             Tasks and Datasets. Next, we set up the evaluation tasks
and self-enhancement bias indicates that LLMs often over-               and datasets for the abilities discussed in Section 7.1 and
rate in their own generations. In addition, since LLMs have             Section 7.2. We mainly evaluate the zero-shot performance
limited capacities in solving complex reasoning problems,               of LLMs on these datasets. For more complex tasks that are
they cannot serve as qualified evaluators for some difficult            hard to be solved in the zero-shot manner (e.g., mathemati-
tasks (e.g., mathematical reasoning). These limitations can             cal reasoning and tool manipulation), we mainly report the
be mitigated to some extent by specific prompt engineering              3-shot performance, considering the context length limit of
and fine-tuning strategies [729].                                       open-source models.
    To summarize, our categorization (Table 15) of existing                  • Language generation. As discussed before, for language
work on LLM evaluation is mainly based on two major di-                 generation, we consider evaluating three kinds of tasks,
mensions, namely evaluation methodology and model type,                 i.e., language modeling, conditional text generation, and
which are further extended with the test abilities. There               code synthesis. Specially, we select four commonly-used
are some recent work [735, 736] that also has discussed                 datasets, namely LAMBADA [252] (language modeling),
the categorization or taxonomies of existing work for LLM               WMT’22 [547] (machine translation), XSum [551] (text sum-
evaluation.                                                             marization), and HumanEval [105] (code synthesis) for eval-
                                                                        uation. In WMT’22, we construct a new evaluation set
                                                                        by selecting 1000 examples for each language pair from
7.4     Empirical Evaluation                                            the original large-scale test set to examine the average
The above evaluation benchmarks and approaches are                      performance of LLMs in machine translation. We evaluate
mainly employed to evaluate the overall abilities of LLMs.              the zero-shot performance of LLMs on these datasets, and
In this part, we conduct a fine-grained evaluation of the               compute the accuracy of predicting words for LAMBADA,
abilities discussed in Section 7.1 and Section 7.2. For each            BLEU-4 for WMT’22, ROUGE-L for XSum, and pass@10 for
kind of ability, we select representative tasks and datasets            HumanEval.
for conducting evaluation experiments to examine the cor-                    • Knowledge utilization. To evaluate the ability of knowl-
responding performance of LLMs.                                         edge utilization, we select four question answering datasets
                                                                        (i.e., TriviaQA [560], Natural Questions [556], Web Ques-
                                                                        tions [559], and ARC [557]), and a fact extraction dataset,
7.4.1    Experimental Settings
                                                                        WikiFact [573]. We also report the zero-shot performance of
In this part, we introduce the experimental settings for our            LLMs on these datasets, and compute accuracy for ARC and
evaluation.                                                             exact match for other datasets.
Evaluation Models. To conduct the evaluation, we consider                    • Complex reasoning. For complex reasoning, we eval-
representative LLMs from open-source models to closed-                  uate the comparison models on OpenbookQA [568], Hel-
source API-accessing models as follows:                                 laSwag [584], and SocialIQA [583] for knowledge reason-
    • Open-source models. Existing open-source models can be            ing; Colored Objects [70] and Penguins in the Table [70]
categorized into base models and instruction-tuned models.              for symbolic reasoning; GSM8k [198] and MATH [362] for
Base models are only pre-trained on a large general-purpose             mathematical reasoning. We compute the accuracy for Open-
corpus with the language modeling objective, but without                bookQA, HellaSwag, and SocialIQA; solve rate for Colored
further supervised fine-tuning. In our evaluation, we select            Objects and Penguins in the Table; and accuracy for GSM8k
four representative base models including LLaMA (7B) [57],              and MATH. For knowledge reasoning tasks, we evaluate
LLaMA 2 (7B) [99], Pythia (7B and 12B) [96], and Falcon                 the zero-shot performance, since they are all QA tasks that
(7B) [749]41 . Instruction-tuned models are those fine-tuned            can be solved in a zero-shot setting. For complex symbolic
using instructions (i.e., task datasets, daily chat, or syn-            reasoning and mathematical reasoning tasks, we leverage
thetic instructions). In our experiments, we select four rep-           3-shot in-context exemplars to better elicit LLMs to accom-
resentative instruction-tuned models including Vicuna (7B               plish them. Following existing work [33, 436], we also utilize
and 13B) [152], Alpaca (7B) [187], and ChatGLM (6B) [93].               the chain-of-thought prompting strategy for better solving
In addition, we also include LLaMA 2-Chat (7B) [99] for                 the mathematical reasoning tasks.
comparison, and it is a representative model that has been                   • Human alignment. For human alignment, we select
aligned with human via instruction tuning and RLHF, based               TruthfulQA [558] to measure whether a LLM is truth-
on LLaMA 2 (7B).                                                        ful in generating answers to questions, CrowS-Pairs [605]
    • Closed-source models. In addition to the open-source              and WinoGender [606] to assess the stereotypes in LLMs,
models, there are also closed-source models that can only               RealToxityPrompts [607] to evaluate the extent to which
be accessed via APIs, which have gained much attention                  LLMs generate toxic language, and HaluEval [604] to test
from both developers and researchers. Here, we select four              the ability of LLMs to recognize hallucination. As the test
representative closed-source models including text-davinci-             set of Real-Toxicity-Prompts is too large, we randomly
002/003 (short as Davinci002/003), ChatGPT, Claude, and                 sample 10000 examples from it for evaluation. We fol-
                                                                        low LLaMA [57] to report the zero-shot performance, and
  41. Experiments with larger models are still in schedule due to the   compute the accuracy of identifying a claim as true for
limit of computational resources.                                       TruthfulQA, accuracy of recognizing biased sentences (high
                                                                                                                                                   67

TABLE 16: Evaluation on the eight abilities of LLMs with specially selected tasks. The shade of the Orange and Blue
fonts denote the performance orders of the results in closed-source and open-source models, respectively. This table will
be continuously updated by incorporating the results of more models.

                                         Language Generation                                           Knowledge Utilization
      Models
                          LBD↑       WMT↑           XSum↑          HumanEval↑ TriviaQA↑ NaturalQ↑          WebQ↑       ARC↑        WikiFact↑
      ChatGPT              55.81      36.44          21.71            79.88        54.54       21.52        17.77       93.69         29.25
      Claude               64.47      31.23          18.63            51.22        40.92       13.77        14.57       66.62         34.34
      Claude 2             45.20      12.93          19.13            78.04        54.30       21.30        21.06       79.97         35.83
      Davinci003           69.98      37.46          18.19            67.07        51.51       17.76        16.68       88.47         28.29
      Davinci002           58.85      35.11          19.15            56.70        52.11       20.47        18.45       89.23         29.15
      LLaMA 2-Chat (7B)    56.12      12.62          16.00            11.59        38.93       12.96        11.32       72.35         23.37
      Vicuna (13B)         62.45      20.49          17.87            20.73        29.04       10.75        11.52       20.69         28.76
      Vicuna (7B)          63.90      19.95          13.59            17.07        28.58       9.17         6.64        16.96         26.95
      Alpaca (7B)          63.35      21.52          8.74             13.41        17.14        3.24        3.00        49.75         26.05
      ChatGLM (6B)         33.34      16.58          13.48            13.42        13.42        4.40        9.20        55.39         16.01
      LLaMA 2 (7B)         66.39      11.57          11.57            17.07        30.92        5.15        2.51        24.16         28.06
      LLaMA (7B)           67.68      13.84          8.77             15.24        34.62       7.92         11.12        4.88         19.78
      Falcon (7B)          66.89       4.05          10.00            10.37        28.74       10.78        8.46         4.08         23.91
      Pythia (12B)         61.19       5.43          8.87             14.63        15.73        1.99        4.72        11.66         20.57
      Pythia (7B)          56.96       3.68          8.23             9.15         10.16        1.77        3.74        11.03         15.75
                                   Knowledge Reasoning              Symbolic Reasoning       Mathematical Reasoning Interaction with Environment
      Models
                          OBQA↑ HellaSwag↑        SocialIQA↑       C-Objects↑    Penguins↑    GSM8k↑      MATH↑       ALFW↑        WebShop↑
      ChatGPT              81.20      61.43          73.23            53.20        40.27       78.47        33.78       58.96      45.12/15.60
      Claude               81.80      54.95          73.23            59.95        47.65       70.81        20.18       76.87      47.72/23.00
      Claude 2             71.60      50.75          58.34            66.76        74.50       82.87        32.24       77.61      34.96/19.20
      Davinci003           74.40      62.65          69.70            64.60        61.07       57.16        17.66       65.67      64.08/32.40
      Davinci002           69.80      47.81          57.01            62.55        67.11       49.96        14.28       76.87      29.66/15.20
      LLaMA 2-Chat (7B)    45.62      74.01          43.84            43.40        38.93        9.63        2.22        11.19      24.51/5.60
      Vicuna (13B)         43.65      70.51          45.97            53.55        36.91       18.50        3.72         8.96      22.74/5.00
      Vicuna (7B)          43.84      69.25          46.27            44.25        36.24       14.03        3.54        1.49        6.90/1.40
      Alpaca (7B)          47.82      69.81          47.55            39.35        40.27       4.93         4.16        4.48       0.00/0.00
      ChatGLM (6B)         30.42      29.27          33.18            14.05        14.09        3.41        1.10        0.00       0.00/0.00
      LLaMA 2 (7B)         44.81      74.25          41.72            43.95        35.75       10.99        2.64        8.96        0.00/0.00
      LLaMA (7B)           42.42      73.91          41.46            39.95        34.90       10.99        3.12        2.24       0.00/0.00
      Falcon (7B)          39.46      74.58          42.53            29.80        24.16        1.67        0.94        7.46        0.00/0.00
      Pythia (12B)         37.02      65.45          41.53            32.40        26.17        2.88        1.96        5.22       3.68/0.60
      Pythia (7B)          34.88      61.82          41.01            29.05        27.52        1.82        1.46        7.46       10.75/1.80
                                                Human Alignment                                               Tool Manipulation
      Models
                          TfQA↑     C-Pairs↓    WinoGender↑           RTP↓       HaluEval↑ HotpotQA↑ Gorilla-TH↑ Gorilla-TF↑      Gorilla-HF↑
      ChatGPT              69.16      18.60    62.50/72.50/79.17      3.07         66.64       23.80        67.20       44.53         19.36
      Claude               67.93      32.73    71.67/55.00/52.50      3.75         63.75       33.80        22.04        7.74          7.08
      Claude 2             71.11      10.67    60.00/60.00/55.83      3.20         50.63       36.4         61.29       22.19         23.67
      Davinci003           60.83       0.99    67.50/68.33/79.17      8.81         58.94       34.40        72.58        3.80          6.42
      Davinci002           53.73       7.56    72.50/70.00/64.17      10.65        59.67       26.00        2.69         1.02          1.00
      LLaMA 2-Chat (7B)    69.77      48.54    47.50/46.67/46.67      4.61         43.82       4.40         0.00        0.00          0.22
      Vicuna (13B)         62.30      45.95    50.83/50.83/52.50      5.00         49.01       11.20        0.00        0.44          0.89
      Vicuna (7B)          57.77      67.44    49.17/49.17/49.17      4.70         43.44       6.20         0.00        0.00          0.33
      Alpaca (7B)          46.14      65.45    53.33/51.67/53.33      4.78         44.16       11.60        0.00        0.00          0.11
      ChatGLM (6B)         63.53      50.53    47.50/47.50/46.67      2.89         41.82       4.00         0.00        0.00          0.00
      LLaMA 2 (7B)         50.06      51.39    48.83/48.83/50.83      6.17         42.23        3.80        0.00        0.00          0.11
      LLaMA (7B)           47.86      67.84    54.17/52.50/51.67       5.94        14.18        1.60        0.00        0.00          0.11
      Falcon (7B)          53.24      68.04    50.00/50.83/50.00       6.71        37.41        1.00        0.00        0.00          0.00
      Pythia (12B)         54.47      65.78    49.17/48.33/49.17       6.59        27.09        0.40        0.00        0.00          0.00
      Pythia (7B)          50.92      64.79    51.67/49.17/50.00      13.02        25.84        0.20        0.00        0.00          0.00



perplexity) for CrowS-Pairs, coreference resolution accuracy                    evaluation.
(he/she/they) for WinoGender, toxicity score for RealToxi-                          • Interaction with environment. To test this ability, we
tyPrompts, and average accuracy of recognizing hallucina-                       select ALFWorld [611] and WebShop [612] for evaluation,
tions for HaluEval. For TruthfulQA, we follow existing                          which simulate real-world scenarios such as household
work [57] that utilizes text-davinci-003 to replace humans                      and e-commerce environments. We follow the setting of
for scoring. For Crows-Pairs and WinoGender, we follow                          ReAct [442] that evaluate the 1-shot and 2-shot performance
the experimental settings of LLaMA [57] to compute the                          of LLMs on WebShop and ALFWorld respectively, and com-
perplexity and coreference resolution score. For RealTox-                       pute success rate for ALFWorld and average score/success rate
ityPrompts, we utilize the Perspective-API42 for toxicity                       for WebShop. Further, we also follow ReAct [442] to reduce
                                                                                the length of the input prompt and utilize line break as the
 42. https://perspectiveapi.com/                                                EOS token.
                                                                                                                              68

TABLE 17: Prompt examples and their performance of ChatGPT on representative tasks. For most tasks, we compare the
performance for simple and complex prompts. We also present the reported performance of supervised methods. “LG”,
“KU”, “CR”, “SDG”, “IR” are short for “language generation”, “knowledge utilization”, “complex reasoning”, “structured
data generation”, “information retrieval”. “-” means there is no reported supervised result previously on this dataset.

        Tasks             Datasets                                   Instructions                           ChatGPT    Supervised
                                      I want you to act as a translator. Please translate the English          20.66
                                      sentence into Czech.
       Translation          WMT                                                                                        41.40 [741]
                                      I want you to act as a translator. Translate the given English           21.12
                                      sentence into Czech, and ensure that the translated sentence is
                                      semantically consistent with the given sentence. \n Sentence:
                                      {source sentence} \n Translation:
 LG
                                      Please generate a one-sentence summary for the given document.           21.71

       Summarization       XSum       {document} Try your best to summarize the main content of the given      23.01   42.08 [742]
                                      document. And generate a short summary in 1 sentence for it.\n
                                      Summary:
                                      Choose your answer to the question. {query} {options}                    85.19
       Closed-Book QA       ARC                                                                                        92.00 [743]
                                      Choose a correct answer according to the given question, and output      85.86
                                      the corresponding id, do not answer other content except the answer
                                      id.
                                      Choose your answer to the question: {question} {choices}. You must       81.20
 KU                                   only output A, B, C, or D without any extra explanation. The answer
                                      is
       Open-Book QA        OBQA                                                                                        87.20 [743]
                                      Following is a question that requires multi-step reasoning, use          82.20
                                      of additional common and commonsense knowledge, and rich text
                                      comprehension. Choose your answer to the question: \n Question:
                                      Frilled sharks and angler fish live far beneath the surface of the
                                      ocean, which is why they are known as \n Choices: \n A. Deep sea
                                      animals \n B. fish \n C. Long Sea Fish \n D. Far Sea Animals \n You
                                      must only output A, B, C, or D without any extra explanation. The
                                      answer is
                                      Complete the sentence with one or a few words.                           29.25
       Fact Extraction      WikiF                                                                                      34.20 [522]
                                      Complete the given sentence with one entity name in Wikipedia (MUST      31.21
                                      be a noun) as short as possible, and ensure that the completed
                                      sentence conforms to the facts.
                                      Problem: {problem}\n Answer:                                             53.20
       Symbolic Reasoning C-Objects                                                                                            —
                                      You are an expert in reasoning problem. Here are some examples           66.75
                                      about symbolic reasoning. You can use the knowledge in examples and
                                      solve the last problem. You should follow the examples and generate
                                      the final answer without external solution or words.
 CR                                   Problem: {problem}\n Solution: Let’s think step by step.                 78.47

       Math Word Problems GSM8k       Let’s use python to solve math problems. Here are three examples         79.30   63.20 [744]
                                      how to do it,\n Q: Olivia has $23. She bought five bagels for $3
                                      each. How much money does she have left?\n‘‘‘def solution():\n
                                      """Olivia has $23. She bought five bagels for $3 each. How
                                      much money does she have left?"""\n     money_initial = 23\n
                                      bagels = 5\n     bagel_cost = 3\n     money_spent = bagels *
                                      bagel_cost\n     money_left = money_initial - money_spent\n
                                      result = money_left\n     return result‘‘‘\n ...... \n How about
                                      this question?\n Q:
       Code Synthesis    HumanEval    I want you act as a code completer. Given a code snippet, your           79.88   48.20 [745]
                                      objective is to complete the code and ensure that it can achieve
                                      the described functionality.
 SDG
       Text-to-SQL         Spider     ### Complete sqlite SQL query only and with no explanation.\n            70.10   84.10 [746]
                                      #\n### Sqlite SQL tables, with their properties: \n#\n{table}\n#
                                      {foreign_key}\n#\n### {question}\n SELECT
       Recommendation    MovieLens    I’ve watched the following movies in the past in order: \n               48.80   76.25 [747]
                                      {user_his_text} \n\n Now there are {recall_budget} candidate movies
                                      that I can watch next: \n {candidate_text_order} \n Please rank
                                      these {recall_budget} movies by measuring the possibilities that I
                                      would like to watch next most, according to my watching history.
                                      Please think step by step. \n Note that my most recently watched
                                      movie is {recent_item}. Please show me your ranking results with
  IR                                  order numbers. Split your output with line break. You MUST rank the
                                      given candidate movies. You can not generate movies that are not in
                                      the given candidate list.
       Conversational      ReDial     Recommend 10 items that are consistent with user preference. The         17.20   25.60 [748]
       Recommenda-                    recommendation list can contain items that the dialog mentioned
       tion                           before. The format of the recommendation list is: no. title (year).
                                      Don’t mention anything other than the title of items in your
                                      recommendation list
                                                                                                                               69

    • Tool manipulation. For tool manipulation, we consider           • All the comparison models perform not well on very diffi-
two kinds of tools including search engine and model in-          cult reasoning tasks. On MATH and HotpotQA, all models
terfaces. Therefore, we adopt two tool manipulation bench-        (including ChatGPT) perform not well. The two tasks are
marks, i.e., HotpotQA [581] and Gorilla [619]. HotpotQA           very difficult to solve, requiring accurate understanding of
requires LLMs to use search engine to retrieve documents          complex mathematical knowledge and performing multi-
from the web, and Gorilla to invoke model APIs from               hop reasoning across documents, respectively. Further, these
three hubs of TorchHub, TensorHub and HuggingFace. We             models also have a relatively weak performance on machine
compute exact match for HotpotQA and accuracy for Gorilla.        translation task (WMT). A possible reason is that WMT also
For HotpotQA, we follow ReAct [442] to report the 3-shot          contains many evaluation examples in minor languages,
performance. For Gorilla, we follow the code released by its      which might not be well covered in the pre-training data
paper [619], and evaluate the zero-shot performance.              of these LLMs.
Implementation Details. For each task and dataset, we             Analysis of Open-Source Models. Next, we continue to
evaluate the compared LLMs using the same prompts and             show our analysis and findings about eight open-source
results parsing method provided by existing work (i.e.,           models (i.e., LLaMA 2-Chat, Vicuna, Alpaca, ChatGLM,
TruthfulQA, HotPotQA, Gorilla, HaluEval) or designed ac-          LLaMA 2, LLaMA, Pythia and Falcon) as follows:
cording to our empirical experience (i.e., TriviaQA, Nat-             • Instruction-tuned models mostly perform better than the
ural Questions, Web Questions, ARC, WikiFact, GSM8k,              base models. Among all the compared open-source methods,
MATH, C-Objects, Penguins, LAMBADA, WMT’22, XSum,                 the instruction-tuned models (i.e., LLaMA 2-Chat, Vicuna,
HumanEval, CrowS-Pairs, WinoGender, RealToxityPrompt).            Alpaca and ChatGLM) mostly perform better than non-
Specifically, all the experiments about closed-source models      instruction-tuned models (i.e., LLaMA 2, LLaMA, Pythia
are based on invoking their official APIs, while for open-        and Falcon). It indicates that instruction tuning is generally
source models, we utilize their publicly available code and       capable of improving the few-shot or zero-shot ability of
model parameters, and perform the inference on 8 A800-            LLMs in solving various tasks. However, after instruction
80G GPUs. For TriviaQA, OpenbookQA, HellaSwag, and                tuning, Vicuna (7B) and Alpaca (7B) suffer from perfor-
SocialIQA, we experiment on the development set since the         mance degradations on LAMBADA, a language modeling
test set is not publicly released. While for other datasets,      task. The reason may be that the instruction data mainly
we experiment on the test set. To reproduce our experi-           focuses on enabling LLMs to follow human instructions,
ments, we also publicly release our experimental code and         which is not always useful for the general language gen-
data in https://github.com/RUCAIBox/LLMSurvey/tree/               eration task.
main/Experiments.                                                     • These small-sized open-source models perform not well on
                                                                  mathematical reasoning, interaction with environment, and tool
7.4.2   Results Analysis and Findings                             manipulation tasks. On the tasks of mathematical reasoning,
                                                                  interaction with environment and tool manipulation, all
We report the experimental results in Table 16, and analyze
                                                                  these evaluated open-source models perform not well, in-
the results in the following.
                                                                  cluding instruction-tuned ones. A possible reason is that the
Analysis of Closed-Source Models. We summarize our                instruction data for fine-tuning these models is not specif-
analysis and findings of the four closed-source models (i.e.,     ically designed for these tasks. In addition, these closed-
ChatGPT, Claude, Davinci003 and Davinci002) as follows:           source models may have limited model capacities due to
    • These five closed-source models achieve promising results   small model sizes.
as general-purpose task solvers, in which ChatGPT mostly per-         • The top-performing model varies on different human align-
forms the best. ChatGPT, Claude, Claude 2, Davinci003 and         ment tasks. For different human alignment tasks, we can see
Davinci002 perform well in most of tasks, including com-          that these models achieve inconsistent performance rank-
plex tasks (e.g., GSM8k), which have shown great potential        ings. For example, LLaMA 2-Chat (7B) performs the best
to be general-purpose task solvers. Among them, ChatGPT           among the compared open-source models on TruthfulQA,
exhibits a more superior model capacity on the evaluation         while Vicuna (13B) performs the best on CrowS-Pairs. A
tasks, winning the most across all tasks. In some evaluation      possible reason is that these tasks are designed with spe-
tasks, the performance gap between ChatGPT and other              cific purposes for evaluating different aspects of human
closed-source models is very large, especially for complex        alignment, and these models exhibit varied performance
tasks e.g., 78.47 (ChatGPT) v.s. 49.96 (Davinci002) on GSM8k,     on different tasks, even for the variants of the same model
and 79.88 (ChatGPT) v.s. 51.22 (Claude) on HumanEval.             (e.g., Pythia (7B) and Pythia (12B)). More experiments and
    • Claude 2, ChatGPT and Davinci003 perform better on inter-   analysis on human alignment evaluation are needed to
action with environment and tool manipulation tasks. On the two   reveal more detailed findings.
evaluation tasks, Claude 2, ChatGPT and Davinci003, per-              • As a more recently released model, LLaMA 2 (7B) overall
form better than other models by a large margin, e.g., 36.40      achieves a good performance, especially on complex reasoning
(Claude 2) v.s. 26.00 (Davinci002) on HotpotQA, 44.53 (Chat-      tasks. For complex reasoning tasks, LLaMA 2 (7B) mostly
GPT) v.s. 7.74 (Claude) on Gorilla-TF, and 72.58 (Davinci003)     performs better than other base models, e.g., 43.95 (LLaMA
v.s. 22.04 (Claude) on Gorilla-TH. A possible reason is that      2 (7B)) v.s. 29.80 (Falcon (7B)) in C-Objects. For other
these three models have been specially optimized towards          tasks (e.g., language generation and knowledge utilization),
these advanced abilities, e.g., supporting the use of external    LLaMA 2 (7B) can also achieve comparable performance
plugins.                                                          as the best-performing base models. It has used more data
                                                                                                                                          70

for pre-training (i.e., about 2 trillion tokens), which mainly              performance by fine-tuning on it. Recent studies [55, 754]
contributes to the excellent performance. Furthermore, it                   have also tested the performance of LLMs on these tasks,
also conducts a more robust data cleaning process.                          showing that LLMs can also perform well via in-context
    • Scaling the open-source modes can improve the performance             learning (with very few examples). Whereas, as small mod-
consistently. By comparing the performance of Vicuna (7B)                   els can be specially optimized on these tasks to learn the
and Vicuna (13B), Pythia (7B) and Pythia (13B), we can see                  specific task requirement and domain knowledge, full-data
that the models with larger scales mostly perform better                    fine-tuned small models can mostly outperform LLMs using
than smaller ones on these evaluation tasks, indicating the                 in-context learning on several classic tasks [755, 756], e.g.,
effectiveness of scaling up the model size. Across different                semantic matching and sentiment analysis.
tasks, scaling model is more beneficial for more complex
tasks (e.g., symbolic and mathematical reasoning), where the                Sequence Tagging. The sequence tagging tasks, e.g., named
larger models mostly outperform smaller ones in a large                     entity recognition (NER) [757] and part-of-speech (POS)
margin.                                                                     tagging [758], are also fundamental tasks. Typically, such
    The readers should be note that these findings about                    tasks require assigning each token in the input sequence a
open-source language models are limited to the model sizes.                 proper semantic category label, e.g., the classic B-I-O (Be-
We will continually update this part by including the results               ginning, Inside and Outside) tagging scheme for NER tasks.
of larger versions of these models, and also call for the                   In the era of deep learning, early efforts [759, 760] mainly
support of computational resources for more experiments.                    integrate the learned sequence representations (e.g., using
                                                                            CNN, LSTM, and BERT) into the classic conditional random
                                                                            field model (CRF), which performs the tagging task based
8     A PPLICATIONS                                                         on structural prediction. Recently, researchers have tested
In this section, we briefly review the recent progress on the               the performance of LLMs in sequence tagging tasks, but ob-
applications of LLMs in two aspects, namely the impact to                   served that LLMs still face challenges in solving them using
research community and representative domains. Figure 18                    in-context learning [755], especially for special categories
shows a content organization of this section43 .                            with ambiguous or rare names, e.g., the “MISC” (miscella-
                                                                            neous entity) and “ORG” (organization) classes. A possible
                                                                            reason is that LLMs may misunderstand the meanings of
8.1   LLM for Research Community                                            these classes in the human-annotated dataset, making it
As LLMs have revolutionized the way how we develop                          difficult to accurately understand their semantics according
AI algorithms, it poses significant impact on the research                  to the instruction and limited examples in the context.
community. In this part, we briefly review the advances that
                                                                            Information Extraction. The information extraction task
led by LLMs for several representative research directions.
                                                                            focuses on automatically extracting useful structured infor-
                                                                            mation from unstructured text data, such as relation extrac-
8.1.1 LLM for Classic NLP Tasks
                                                                            tion [761] and event extraction [762], which is also a crucial
As pre-trained language models (e.g., BERT) have originated                 task relating to many NLP applications. Typically, previous
in the field of NLP, the technical advances of language                     studies formulate this task as a text classification task or
models has an important impact on the research of NLP. In                   a sequential labeling task. As information extraction often
this part, we discuss the application of LLMs on five kinds                 needs to accurately understand and process complex se-
of classic NLP tasks, including word-level, sentence-level,                 mantic relations (multiple relations within one sentence), in-
sequence tagging, relation extraction, and text generation                  context learning with LLMs typically underperform state-
tasks, which had been the foundation of many existing NLP                   of-the-art full-data fine-tuning methods [763, 764]. Whereas,
systems and applications. Note that we do not intend to                     it is shown that enabling collaboration between LLMs and
comprehensively cover all NLP tasks, but instead try to                     small models can further boost the performance of specific
analyze the impact of LLMs for fundamental NLP research                     tasks [764, 765]. In addition, a recent study [766] also reveals
through the basic tasks. We also omit the discussion of sev-                that LLMs can achieve competitive zero-shot performance
eral tasks (e.g., language modeling) that have been discussed               for information extraction with a two-stage workflow, mak-
early in this survey.                                                       ing this approach attractive in future applications.
Word/Sentence-level Tasks. As long-standing NLP tasks,                      Text Generation. Text generation tasks, e.g., machine trans-
word-level (e.g., word clustering [750] and sense disam-                    lation [626] and automatic summarization [550], are long-
biguation [751]) and sentence-level tasks (sentence match-                  standing NLP tasks that have been widely studied, and
ing [752] and sentiment classification [753]) have been                     there have been a number of deployed products and sys-
widely studied in the literature and applied in real-world                  tems based on fine-tuned small models [309, 767]. Since the
platforms. To solve these tasks, the key is to accurately                   pre-training of LLMs is established on text prediction, they
understand the semantic information about the words or                      exhibit strong language generation abilities as commercial
sentences. As rich high-quality labeled data about these                    products [629] and humans [630], with the help of proper
tasks has been accumulated so far, existing work [23, 39]                   prompts [768, 769]. Additionally, LLMs are flexible to effec-
finds that small language models can achieve very good                      tively handle special requirement in real-world application
                                                                            scenarios, e.g., document-level translation [770], and also
  43. Note that we don’t aim to cover all the related research directions
or domains, but instead demonstrating the use or impact of LLMs via         enable natural language interaction with users to further
these selected examples.                                                    improve the generation quality [771]. Despite the above
                                                                                                                                                             71

                                                                                                      •   Word/Sentence-level Tasks
                                                                       LLM for Classic NLP Tasks      •   Sequence Tagging
                                                                                                      •   Information Extraction
                                                                                                      •   Text Generation
                                          Classic Scenarios                                           •   LLM as IR Model
                                                                              LLM for IR              •   LLM-Enhanced IR Models

                                                                                                      •   LLM as Recommendation Model
                                                                       LLM for Recommendation         •   LLM-enhanced Recommendation Models
                     Research                                                                         •   LLM as Recommendation Simulator
                     Directions                                                                       •   Vision-Language Alignment Pre-Training
                                                                           Multimodal LLMs            •   Visual Instruction Tuning
                                                                                                      •   Evaluation of MLLM
                                        Enhanced Capabilities
                                                                                                      •   Retrieval-augmented LLM
                                                                           KG Enhanced LLM            •   Synergy Augmented LLM

   LLM for                                                                                            •   Components: Memory/Planning/Execution
  Application                                                              LLM-based Agent
                                                                                                      •   Single/Multi-agent based Application
                                           New Scenarios
                                                                                                      •   Score/Language-based Evaluation
                                                                           LLM for Evaluation         •   Instruction Design, Multiple Feedbacks, Debate Agent
                                                                                                      •   Meta-Evaluation

                                                                                         Scientific
                     Specific Domains                         Healthcare    Finance                        Law          Education
                                                                                         Research

                Fig. 18: The applications of LLMs in representative research directions and downstream domains.


success, recent work also reveals that LLMs are hard to well                  8.1.2   LLM for Information Retrieval
address the generation tasks about low-resource languages
and domains, e.g., Marathi-to-English translation [772], due                  The goal of information retrieval (IR) systems is to assist
to their unbalanced training data across different languages.                 users in discovering ideal information resources (typically
                                                                              documents) and mitigating the information overload issue.
                                                                              Typically, contemporary IR systems adopt a retrieve-then-
Summary. Based on the above discussion, we summarize
                                                                              rerank pipeline framework [54]. Within this framework,
the suggestions, and future direction about the use of LLMs
                                                                              the retriever initially retrieves relevant information from a
in classic NLP tasks as follows:
                                                                              large-scale corpus, and the reranker subsequently performs
    • Suggestions: LLMs and small models have their own                       multi-stage ranking procedure to acquire the most relevant
merits in different aspects: LLMs are can provide unified                     information [776]. Since the advent of LLMs has significant
solutions to various NLP tasks and achieve competitive                        impact on the way of information access, we discuss how
performance (especially in the zero/few-shot setting), while                  it advances the development of IR from two main aspects,
small models are economical to develop and can be specially                   namely LLMs as IR models and LLM-enhanced IR models.
tuned according to target tasks, which can achieve good
performance with sufficient high-quality labeled data [755,                   LLMs as IR Models. Existing IR models can be overall
756, 773, 774]. In applications, one can make suitable choices                categorized into sparse models (relying on term-based lexical
based on the actual needs, comprehensively considering                        similarity) and dense models (relying on embedding based
flexibility, data availability, training compute, and efficiency.             semantic similarity) [54]. Specially, dense models are mainly
                                                                              implemented by fine-tuned PLMs (e.g., BERT). Compared
    • Future direction: Despite the excellent general capac-                  to PLMs, LLMs have more strong model capacities in
ities, LLMs still cannot effectively process the NLP tasks                    capturing text semantics, thus having the potential to im-
in low-resource domains, e.g., minor language translation.                    prove existing dense IR models. However, due to the high
To tackle such tasks, it needs to develop effective ap-                       overhead of LLMs, the majority of studies concentrate on
proaches to injecting necessary task information or domain-                   employing LLMs as rerankers, aiming to refine the rank-
specific knowledge into LLMs, either through fine-tuning                      ing of retrieved candidates. To achieve this, recent efforts
or prompting. In addition, it is still challenging for LLMs to                often formulate special instructions that enable LLMs to
handle complex semantic relations in classic NLP tasks (e.g.,                 perform reranking on a small set of provided candidate
nested entity extraction), which is worth more exploration                    documents. Typically, such an approach does not necessitate
from the underlying working mechanism of LLMs. It is also                     model training, and achieve promising results compared
promising to combine LLMs and fine-tuned small language                       with well-trained reranking methods [777, 778]. Specially,
models for complementing with each other in solving com-                      the LLM-based reranking approach can be implemented
plex cases of classic NLP tasks [775]. Another promising di-                  in different ways by zero-shot or few-shot instruction, in-
rection is to conduct human-machine collaborative research                    cluding pointwise (estimating the relevance scores for query-
(e.g., conversational translation [771]) on NLP tasks, since                  document pairs) [779], pairwise (determining the relevance order
LLMs can effectively understand human instructions and                        of two documents) [778], or listwise ranking (sorting a subset of
make meaningful responses.                                                    candidate documents) [780]. The essence of these methods lies
                                                                                                                              72

in the special design of instructions for text reranking, such   be discussed in Section 8.1.4, multimodal large language
as sliding window strategy for document lists [777, 781],        models [800] are also widely studied, making it feasible to
setwise selection prompting [782], fine-grained relevance la-    develop more powerful multimedia retrieval systems.
bels incorporation [783], and pairwise comparison prompt-
ing [778]. In addition, recent efforts employ LLMs to gen-       8.1.3   LLM for Recommender Systems
erate intermediate texts (e.g., URLs) as retrieval results us-
ing few-shot demonstrations [784]. To further enhance the        Unlike IR systems that analyze user search queries to
model performance, LLMs can be specially fine-tuned as           retrieve relevant documents, recommender systems (RS)
backbones for reranking [785, 786] or retrieval (including       aim to capture the underlying user preference and pro-
dense retrieval [54] and model-based retrieval [787, 788]),      vide appropriate information resources to users [801–804].
similar to the fine-tuning process for traditional PLM-based     Typically, existing studies train a recommendation model
IR models [785]. However, fine-tuning LLMs as IR models          (either classic or deep learning model) by fitting it over
entails considerable expenses given the huge parameter           the user’s logged data (e.g., click data) [747, 805]. However,
scale of LLMs.                                                   these models often suffer from a series of technical issues,
                                                                 e.g., cold-start recommendation, domain transfer, and poor
LLM-Enhanced IR Models. As another major research                explainability. Recently, LLMs have demonstrated the po-
direction, LLMs can be employed to improve existing IR           tential to alleviate these issues of recommendation mod-
models (e.g., small models). A common challenge faced            els [355, 806, 807], due to the strong capacities of domain
by existing IR models is the lack of relevant judgment           generalization and language generation. In this part, we
annotation [789, 790]. To tackle this problem, LLMs can be       briefly review the recent progress of LLMs in recommender
instructed to annotate positive or negative documents for        systems, from the following three aspects, namely LLMs as
a given query [791], or to generate corresponding queries        recommendation models, LLM-enhanced recommendation
based on a set of documents in the corpus by referring to a      models, and LLMs as recommendation simulators.
few demonstrations [792, 793]. In addition to training data      LLMs as Recommendation Models. With specific methods
augmentation, LLM has the potential to improve existing          or mechanisms, LLMs can be adapted to serve as recom-
IR models by refining the search-oriented informativeness        mendation models. Existing work along this line can be
of both queries and documents. In IR systems, the in-            generally divided into two main categories. First, some
put queries may be constrained by a user’s cognitive and         methods prompt LLMs for completing the recommendation
cultural competency, making it challenging to accurately         task in a zero-shot paradigm (i.e., without parameter tun-
express the real intent, and irrelevant content present in       ing) [808, 809]. A series of prompt engineering methods like
documents can also impact the relevance evaluation with          recency-focused and in-context learning are introduced to
the query. As a solution, LLM can be utilized to rewrite the     improve recommendation performance as well as alleviate
query for enhancing the understanding of the query intent        the potential model biases [810, 811]. Second, another cat-
and incorporating additional knowledge into the query            egory of studies aim to specialize LLMs for personalized
through well-designed instructions. The rewritten query          recommendation through instruction tuning [355, 812]. Spe-
can take the form of an improved version of the original         cially, high-quality instruction data is key to adapt LLMs
query [794], a document in the corpus that related to the        to the recommendation tasks, which can be constructed
query [795], or an expansion of the query that concatenated      based on user-item interactions with heuristic templates. To
with a pseudo generated document [796]. In addition, docu-       further improve the instruction diversity, InstructRec [355]
ments can also be expanded with queries that are generated       employs self-instruct technique to simulate large amounts of
based on the original documents using LLMs for context           potential user instructions in various scenarios like product
extension [797].                                                 search and personalized recommendations. In addition to
                                                                 representing each item by its text description, there is also
Remaining Issues. In this part, we further discuss several       growing attention on extending LLM’s vocabulary with
important issues to apply LLMs to improve IR systems.            semantic identifiers in recommender systems [813, 814], to
First, though LLMs are capable of being as general-purpose       incorporate collaborative semantics into LLMs.
task solvers, they are not directly well suited for existing
IR systems: they require high overhead for inference [777,       LLM-enhanced Recommendation Models. In addition to
785], have limitations in modeling long texts or document        instructing LLMs to directly provide recommendations, re-
lists [781], and need special adaptation (e.g., instruction      searchers also propose leveraging the universal knowledge
tuning) to perform the text ranking task [798]. Therefore,       encoded in LLMs to improve traditional recommender sys-
more systematic approaches to adapt LLMs for modern IR           tems. Existing approaches in this line can be divided into
systems should be investigated, to leverage their benefits       three main categories. The first category employs LLMs to
and meanwhile overcome these limitations. Secondly, the          infer users’ potential intention from their historical interac-
advent of LLMs sheds lights on the development of new            tion data. Furthermore, traditional recommendation/search
information seeking ways (e.g., New Bing). It is meaningful      models employ the inferred intentions to improve the re-
to explore how to reshape the architecture and paradigm          trieval of relevant items [815, 816]. Additionally, several
of IR by integrating the LLMs’ capacities and the merits         studies explore the use of LLMs as feature encoders. They
of existing IR systems [799]. Thirdly, existing work mainly      employ LLMs to encode the side information of items and
focuses on text retrieval tasks, lacking a comprehensive         users (e.g., item’s descriptions and user’s reviews), thus de-
consideration of multimodal information sources. As will         riving more informative representations of users and items.
                                                                                                                                     73

These representations are then fed into traditional recom-        extension of LLMs by enabling the information modeling
mender systems as augmented input [817, 818]. As an-              of non-textual modalities, especially the vision modality,
other alternative approach, several studies [819, 820] adopt      called multimodal large language models (MLLMs) [800]44 . To
a distillation-like way to transfer LLM’s capacities (e.g.,       start our discussion, we specify the input to be text-image
semantic encoding) to improve traditional recommenders            pairs and the output to be text responses. Similar discus-
(i.e., small models). Specially, they align the hidden states     sions can be made for other modalities, e.g., language-audio
of LLMs and traditional recommendation models via joint           models [828], which is beyond our scope here. In essence,
training. After training, since only the enhanced small           MLLMs are developed by adapting the information from
model will be deployed online, it can avoid the huge over-        other modalities to the text modality, so as to leverage the
head of LLMs in online service.                                   excellent model capacities of LLMs that are learned based on
                                                                  world text. Typically, a MLLM comprises an image encoder
LLM as Recommendation Simulator. Inspired by the recent
                                                                  for image encoding and a LLM for text generation, associ-
success of autonomous AI agents [821], LLMs have been
                                                                  ated by a connection module that aligns vision and language
also utilized to develop recommendation simulators [822,
                                                                  representations. During generation, the image is first split
823] (exemplified by RecAgent [822]), showing great po-
                                                                  into patches, and then transformed into patch embeddings
tential to simulate user real behaviors in recommender
                                                                  by the image encoder and the connection module, to derive
systems [822, 824, 825]. Specifically, to make personalized
                                                                  a visual representation that can be understood by the LLM.
simulation, an agent will be equipped with a profiling
                                                                  Subsequently, the patch embeddings and text embeddings
module that encompasses relevant identity information.
                                                                  are concatenated, and fed into the MLLM, allowing the
Then, a memory module is introduced to store agents’ past
                                                                  language model to generate the response autoregressively.
interaction experiences. During the process of simulation,
                                                                  In the following, we will discuss the training, evaluation,
agents are further prompted to conduct self-reflection based
                                                                  and key points to develop capable MLLMs.
on their past experiences, to capture their underlying user
preference. Most of existing recommendation simulators are        Training Process. The training process of the MLLM in-
conducted in a user-oriented way, without explicitly mod-         cludes two major stages: vision-language alignment pre-
eling the items in the interaction process. To address this,      training and visual instruction tuning.
AgentCF [824] models both users and items as agents, and              • Vision-language alignment pre-training. To develop
further facilitates collaborative reflections to simulate user-   MLLMs, existing work mostly initializes the vision encoder
item interactions, so as to capturing the two-sided relations     and the LLM with pre-trained models [154, 155, 829]. These
between users and items.                                          models retain excellent vision and language capacities, but
                                                                  span different semantic spaces. Thus, the goal of vision-
Remaining Issues. Despite these efforts, there are still
                                                                  language alignment pre-training (i.e., the first-stage training)
several challenges to address when applying LLMs in
                                                                  is to align the vision encoder and the LLM through end-to-
recommender systems. First, existing studies have shown
                                                                  end training on large-scale image-text pairs [830, 831]. How-
that LLM-based recommendation models in zero/few-shot
                                                                  ever, directly tuning these two models on image-text pairs
settings tend to perform worse than traditional ID-based
                                                                  may cause the degradation of the original representation ca-
recommenders [809, 810]. This indicates that LLMs might
                                                                  pacities. To improve the alignment performance, it is crucial
lack an understanding of personalized user behaviors and
                                                                  to design effective training strategies and select appropriate
domain-specific collaborative semantics. Although instruc-
                                                                  pre-training data [832, 833]. Existing work mainly employs
tion tuning alleviates this issue to some extent [355, 812],
                                                                  the following strategies for cross-modality alignment: (1) if
it can’t fully reduce the semantic gap between LLMs and
                                                                  the number of image-text pairs is not sufficiently large (e.g.,
recommender systems, and also suffers from high tuning
                                                                  less than 1M), it is often suggested to only update the
costs. Furthermore, recommender systems prioritize min-
                                                                  connection module [834]; (2) if the training data includes
imizing inference latency to enhance users’ experience in
                                                                  high-quality text corpora [835] or image-text pairs with
low-resourced environments (e.g., phones), which poses a
                                                                  fine-grained annotations [836], fine-tuning the LLM can be
challenge to LLMs’ inference speed as well as memory
                                                                  conducted to boost the performance; (3) if the number of
overhead. Therefore, it is important to explore improvement
                                                                  image-text pairs is very large (e.g., about 1B), fine-tuning
techniques, such as efficient tuning and quantization meth-
                                                                  the vision encoder is also plausible [832, 833], but the benefit
ods, to deploy LLMs efficiently and effectively in real-world
                                                                  remains further verification.
recommender systems. In addition, existing LLMs have
                                                                      • Visual instruction tuning. After vision-language pre-
limited capacities in long context modeling, make it difficult
                                                                  training, the second-stage training, i.e., visual instruction
to process the huge amount of user-item interaction data.
                                                                  tuning, aims to improve the instruction-following and task-
Improved context length extension and context information
                                                                  solving abilities of MLLMs. Generally, the input of vi-
utilization approaches should be developed to improve the
                                                                  sual instruction tuning consists of an image and a task
modeling capacities of LLMs in long interaction sequences.
                                                                  description, and the task is to generate a corresponding
8.1.4 Multimodal Large Language Model                             text output. To boost the performance, high-quality visual
In existing literature [826, 827], multimodal models mainly       instruction data is key to eliciting and enhancing the abil-
refer to the models that can process and integrate informa-
tion of various modalities (e.g., text, image, and audio) from      44. In existing work, large vision language models (LVLMs) [664] are
                                                                  also used to term such bimodal models that are developed based on
input, and further produce corresponding output in certain        LLMs. We use the naming of MLLMs in this part due to its wide use in
modalities. In this part, we mainly focus on the multimodal       existing literature.
                                                                                                                                  74

ities of MLLMs. Therefore, most studies are dedicated to            questions. Similarly, LLaVA [854] utilizes GPT-4 for eval-
constructing various visual instruction datasets. As the basic      uating MLLMs’ output, where GPT-4 takes the generated
approaches, early studies construct visual instructions by          image captions and object bounding boxes as visual inputs
distilling from GPT-4 [154] or reformulating vision-language        for assessment. Such open-ended evaluation methods can
task datasets [156]. To enhance the quality of instruction          improve assessment accuracy while incurring higher costs
data, recent work further proposes improved strategies by           due to the involvement of humans or LLMs.
increasing the instruction diversity [837], incorporating fine-         • Evaluation benchmarks. To facilitate a more thorough
grained information (e.g., coordinate of objects) into the          evaluation of MLLMs, various benchmarks have been devel-
instruction [836], or synthesizing complex visual reasoning         oped. Part of them collect existing vision-language tasks for
instructions [838].                                                 comprehensive evaluation. For instance, LVLM-eHub [855]
                                                                    aggregates 47 existing text-related visual tasks to assess
Evaluation of MLLM. After introducing the approaches to
                                                                    six distinct capabilities of MLLMs, and Reform-Eval [856]
developing MLLMs, we further discuss how to effectively
                                                                    takes this a step further by standardizing questions from
assess the multimodal capabilities of MLLMs from the fol-
                                                                    existing benchmarks into a uniform format and discusses
lowing three aspects.
                                                                    how the backbone models influence MLLMs’ performance.
    • Evaluation perspectives. The evaluation tasks for MLLMs       In addition to incorporating existing tasks, several work
can be categorized into two main types: perception and              also derives new questions annotated by humans or with
cognition tasks. Specifically, perception tasks aim to assess the   the help of LLMs. MME [842] creates a dataset by pair-
model’s abilities in understanding the basic semantics of the       ing images from public sources with manually-collected
image content, while cognition tasks evaluate models with           text instructions for perception and cognition evaluations.
more complex tasks that require reasoning based on per-             MMBench [841] transforms these instructions into multiple-
ception results. The perception ability is typically evaluated      choice questions and introduces CircularEval to ensure
through classification tasks about attributes of image (e.g.,       evaluation consistency. SEED-Bench [857] further considers
topic and style) and object (e.g., existence and color) or OCR-     temporal understanding tasks and enlarges the evaluation
related tasks, based on existing datasets or new datasets           scale to 19K multiple-choice questions with the assistance of
derived from existing images with annotations by humans             LLMs. MM-Vet [858] presents more complex tasks to assess
or LLMs [839–842]. A notable perception issue is hallucina-         the integrated multimodal capabilities of MLLMs. It starts
tion [843], where the model’s responses contain inconsistent        by defining six essential multimodal abilities and then cre-
content with the image. Among existing studies about hallu-         ates intricate questions by combining multiple abilities. In
cination in MLLMs [837, 844, 845], object hallucination [846]       summary, the above benchmarks collectively contribute to
has received much research attention. To conduct a stable,          the comprehensive evaluation and improved development
robust evaluation of object hallucination, POPE [847] pro-          of MLLMs.
poses a polling-based object probing approach for convert-
ing object recognition into a series of binary questions, and       Key Points for Improving MLLMs. To develop capable
the results indicate that current MLLMs often struggle with         MLLMs, we continue to discuss three key points to improve
object hallucination. Cognition tasks, on the other hand, re-       the model capacities, from the perspectives of instruction
quire MLLMs to perform reasoning based on image percep-             data, training strategy, and safety and alignment.
tion. A common reasoning task is visual question answering               • Visual instruction data. Extensive work [834, 859] has
(VQA), where models answer questions about images that              empirically found that both quantity and quality of visual
demand reasoning about spatial relationships [848], general         instructions have an important impact on model perfor-
knowledge [849], or scene text [850]. To fully explore the          mance of MLLMs. One basic way to construct visual in-
capabilities of MLLMs, HallusionBench [851] collects 200            structions is to leverage the exceptional capability of LLMs
sophisticated visual dependent or supplement questions, on          to synthesize instructions based on text descriptions of
which even the most advanced MLLMs like LLaVA-1.5 [834]             images [854]. To further enhance the quality of instructions,
and GPT-4V [133] fail to achieve good performance.                  one can construct fine-grained visual instructions with the
    • Evaluation paradigms. The responses of MLLMs can              help of human annotation [836, 860] or synthesize more
be evaluated either in a closed-ended or an open-ended              complex data through carefully-designed prompts [838].
manner. Traditional multimodal tasks often rely on a closed-        Despite the effectiveness of the above LLM-based ap-
ended evaluation framework, where the assessment is based           proaches, one primary question emerges as to whether a
on the exact match between the model’s response and the             LLM (i.e., text generation model without training on any
ground-truth answer. Examples include the VQA score [852]           images) possesses the ability to generate sufficiently good
for visual question answering tasks and the CIDEr [853]             visual instructions solely based on verbalized visual infor-
score for captioning tasks. However, MLLMs generate re-             mation (e.g., captions and coordinates). Specially, existing
sponses in an open-ended way, which may contain the                 work has also revealed that visual instructions generated
correct answer but not exactly match the ground-truth per-          by LLMs sometimes contain misinterpretations about the
fectly. This discrepancy can lead to the underestimation of         visual information, e.g., object hallucination [847]. Therefore,
the model’s performance in previous evaluation paradigms.           it is crucial to design effective verification methods to con-
To address this issue, recent approaches have incorporated          trol the quality of instruction data generated by LLMs [838].
humans or LLMs as evaluators [832]. For instance, MM-               Furthermore, it still needs more investigation about what
Bench [841] employs ChatGPT to align the model responses            makes good visual instructions and how visual instructions
with the most relevant option in a set of multiple-choice           elicit specific multimodal abilities in MLLMs.
                                                                                                                                              75

     • Model training. Different from LLMs, MLLMs are not                 tential to generate hallucinated content [604] and the lack of
trained from scratch, but instead developed based on pre-                 domain-specific knowledge [865]. As a promising solution,
trained language and vision models. Existing work em-                     knowledge graphs (KGs), which store enormous knowledge
ploys a typical two-stage approach for training MLLMs,                    in the triple format, i.e., ⟨ head entity, relation, tail entity ⟩, can
i.e., vision-language alignment pre-training and visual in-               be utilized to enhance the task performance of LLMs by pro-
struction tuning. In essence, existing MLLMs aim to (1) pre-              viding precise and necessary knowledge. Generally, knowl-
serve the inherent capabilities and parametric knowledge                  edge enhanced approaches can be expanded into other
of LLMs as possible, and meanwhile (2) effectively adapt                  forms of structured data (e.g., tables and databases) [864],
to multimodal tasks by leveraging the pre-trained LLMs                    while we limit our discussion to the integration of KG for
and visual encoders. To achieve the above two goals, two                  improving LLMs, which are detailed in two aspects, namely
typical training strategies are often employed for visual                 retrieval-augmented LLM and synergy-augmented LLM.
instruction tuning, either only optimizing the connection
module [156] or fine-tuning both the connector module                     Retrieval-Augmented LLM. Due to the huge amount of
and LLM component [854]. As we can see, the former                        fact records in a KG, existing work typically adopts a
can reserve the original capacities of LLMs but likely have               retrieval model to first obtain a relatively small subgraph
a weak an adaptation performance, while the latter can                    from KG, and then leverages it to enhance LLMs by en-
fully adapt to multimodal tasks but suffer from the loss of               riching the relevant knowledge. Before the advent of LLMs,
original capacities of LLMs. More efforts should be made to               the retrieved subgraphs are often supplemented into train-
investigate how to effectively balance the two aspects, so as             ing data, injecting knowledge information into PLMs via
to achieving improved multimodal capacities. In addition,                 parameter learning [866–868]. In contrast, to leverage the
existing MLLMs are still overly dependent on the capacities               retrieved knowledge, LLMs mainly incorporate it as part of
of LLMs, which pose the limits on many multimodal tasks                   the prompt, without parameter update. To implement this
(e.g., space positioning). It will be meaningful to explore               approach, there are two main technical problems, i.e., how
improved training approaches of language models, so that                  to retrieve relevant knowledge from KGs and how to make
multimodal information can be also utilized in this process.              better use of the structured data by LLMs. For the first issue
     • Safety and alignment. Safety and alignment has been                (i.e., retrieving relevant knowledge), a typical approach is
widely discussed in LLMs, which aim to regulate the behav-                to train a small language model (e.g., RoBERTa) to iden-
iors of models by technical approaches [66]. This topic is also           tify question-related fact triples [869]. To further improve
important to MLLMs. Even a highly advanced MLLM (e.g.,                    the retrieval performance, several studies also propose an
GPT-4V [133]) can be susceptible to safety issues. For exam-              iterative reading-then-reasoning framework, enabling the
ple, GPT-4V might occasionally exhibit factual inaccuracies               LLM to interact with the KG multiple times and acquire the
and baseless inferences about images. In some cases, it may               required knowledge in a more accurate way [451]. For the
even generate harmful content targeting specific individuals              second issue (i.e., utilizing retrieved knowledge), a straight-
or groups [133]. Furthermore, open-sourced MLLMs are                      forward approach is to serialize the retrieved subgraph
also prone to generate hallucinated response [847] and can                and craft specific prompts to include it as the input of
be easily manipulated to produce harmful content [861].                   LLMs [468, 653]. However, due to the loss of structured
To address the aforementioned issues, some studies collect                information in knowledge serialization, LLMs cannot fully
specialized visual instructions to mitigate the problem of                capture the structural semantics conveyed by original KGs.
hallucination [837]. Another alternative approach is to train             To address this issue, several model-based approaches train
a revision model to rectify hallucinated response generated               a specialized language model (e.g., T5) to transform the
by MLLMs in a post-hoc way [862]. Additionally, aligning                  subgraph into the natural language text [870]. To guarantee
MLLMs with RLHF can also assist MLLMs in generating                       the transformation accuracy, it relies on sufficient training
responses with improved factuality [863]. Despite these                   pairs (often unsupervised constructed) [871] and excellent
efforts, existing alignment techniques for MLLMs mainly                   model capability [872].
concentrate on several specific aspects (e.g., hallucination),
lacking a comprehensive consideration of alignment criteria.              Synergy-Augmented LLM. To solve complex tasks (e.g.,
More efforts should be made to promote the research of                    multi-hop question answering [658]), it often requires LLMs
safety and alignment for MLLMs. As a promising solution,                  to query a KG multiple times, following a systematic solu-
knowledge graphs (KGs), which store enormous knowledge                    tion plan. We call such a multi-turn interaction approach to
in the triple format, i.e., ⟨ head entity, relation, tail entity ⟩, can   enhancing LLM synergy-augmented LLM. To better synergize
be utilized to enhance the task performance of LLMs by pro-               the LLM and KG in a complementary manner, recent studies
viding precise and necessary knowledge. Generally, knowl-                 propose to decompose the complex task into multiple sub-
edge enhanced approaches can be expanded into other                       goals and iteratively solve each one by leveraging the nec-
forms of structured data (e.g., tables and databases) [864],              essary knowledge from KG [451, 873, 874]. In this process,
while we limit our discussion to the integration of KG for                the LLM can be regarded as an autonomous agent (detailed
improving LLMs, which are detailed in two aspects, namely                 in Section 9.2), which automatically generates the plan
retrieval-augmented LLM and synergy-augmented LLM.                        and executes it through interaction with the KG environ-
                                                                          ment [873]. Specially, the mainstream approaches typically
8.1.5 KG-Enhanced LLM                                                     start by enumerating the candidates using the available
Despite the excellent capacities, LLMs often suffer from                  knowledge information at the current step, and then retrieve
challenges on knowledge-intensive tasks, such as the po-                  the most appropriate candidates for the next step according
                                                                                                                              76

to the question [873, 874]. By iterating the above two steps,     candidate texts following specific guidelines [352, 649, 729],
LLMs can gradually collect relevant evidence [873, 874], and      which greatly simplifies the evaluation task. However,
finally approach the correct solution. Despite the effective-     it may face the inefficiency issue when scaling up the
ness, enumeration of the candidates over the KG would lead        number of candidates [729]. When high-quality reference
to a vast search space [875]. To address it, StructGPT [451]      texts are available during evaluation, LLMs can be in-
proposes a more efficient way to access knowledge infor-          structed to score texts under the guidance provided by ref-
mation using the specialized interfaces for KGs. Specifically,    erences [718, 729, 730]. On the other hand, language-based
it carefully designs the specialized interfaces according to      evaluation focuses on generating critiques and suggestions,
the common data operations on KG (e.g., relation extraction       offering qualitative explanation beyond simple quantitative
and triple extraction), to ensure efficient and accurate data     scoring [369, 377, 882, 883]. It is particularly useful for
extraction. In this way, LLMs can be instructed to better         gathering language feedback signals for human alignment
manipulate and process the structural information of KGs,         tuning [369, 882]. Furthermore, it can evolve into a multi-
thus achieving improved task performance.                         turn interaction framework, where LLM-based evaluators
                                                                  provide natural language feedback to existing solutions
Future Directions. Besides the above approaches, there
                                                                  from task solvers [884]. This framework evaluates the ability
are several promising directions for KG-enhanced LLM
                                                                  of LLMs to leverage language feedback for refining self-
remaining underexplored. First, due to the variety of struc-
                                                                  generated solutions.
tured data, it is still difficult for LLMs to directly leverage
various kinds of knowledge sources, e.g., domain-specific
                                                                  Evaluation Methods. A common method for LLM-based
KGs. Therefore, it is essential to explore the unified way
                                                                  evaluation involves prompting LLMs with specific instruc-
to manipulate and utilize different knowledge sources by
                                                                  tions. To further improve the quality of LLM-based eval-
LLMs. As a potential solution, it is promising to develop
                                                                  uation, recent work proposes to prompt LLMs with varied
effective approaches to help LLMs comprehend and make
                                                                  contexts to generate diverse evaluation feedback. These con-
use of the access interfaces provided by specific knowledge
                                                                  texts vary in aspects such as the candidate order [649, 729],
sources to acquire precise knowledge [451], while more ef-
                                                                  evaluation perspectives [885, 886] (e.g., relevance, clarity,
forts should be made to investigate how to adapt to the data
                                                                  originality), and evaluation explanation [649]. The gener-
variety in a cost-effective way. Second, with the evolution of
                                                                  ated multiple evaluation feedbacks are then aggregated to
real-world information, the knowledge stored in LLMs may
                                                                  produce a final evaluation result, which makes the evalua-
become outdated or incorrect. It is necessary to explore how
                                                                  tion process less prone to biases from individual feedback
to synchronize the updated knowledge into LLMs through
                                                                  and allows for a more thorough evaluation by covering
a cost-effective manner [876, 877]. Third, it is promising to
                                                                  a wider range of evaluation aspects. To further improve
investigate the use of factual information from KG to align
                                                                  the quality of the single-model evaluation, recent studies
LLMs in generating more faithful content [878, 879], which
                                                                  also develop multi-agent collaboration frameworks [886–
can help reduce the hallucination of LLMs.
                                                                  888] or fine-tune LLMs as specified evaluators [369, 377, 882,
    In addition to exploring KG-enhanced LLMs, it is also
                                                                  883, 889]. In a multi-model collaboration mode, different
meaningful to leverage LLMs to improve the tasks on the
                                                                  LLMs evaluate the candidates by engaging in discussions
KG side (i.e., LLM4KG) [865, 880]. A typical example is that
                                                                  to align preferences and reach a consensus [887, 888]. This
LLMs can help supplement or construct the KG. We omit
                                                                  method helps reduce the potential biases in individual
the discussion of this part, since it is beyond our scope.
                                                                  models through the consensus reached by multiple agents.
8.1.6 LLM for Evaluation                                          Another approach to improving single-model evaluation
                                                                  is to specialize LLMs as scores or critics through fine-
While human evaluation can generally offer reliable quality
                                                                  tuning [369, 377, 882, 883, 889]. This process involves creat-
assessment, it is also often hindered by high annotation
                                                                  ing datasets annotated with preferences and feedback from
costs, significant time requirements, and annotation incon-
                                                                  humans or proficient LLMs. These datasets are then used to
sistencies [881]. In contrast, automatic evaluation can be
                                                                  train evaluation-oriented models, enabling them to generate
employed as a scalable alternative to human evaluation.
                                                                  pairwise preference or language feedback. The specialized
Traditional automatic evaluations have relied on reference-
                                                                  LLM evaluators demonstrate competitive performance with
based metrics (e.g., BLEU and ROUGE). Recently, with
                                                                  fewer parameters [377, 883, 889].
the emergence of LLMs as general task solvers highlights
their potential as automatic evaluators [649, 729], making it
                                                                  Meta-Evaluation. To effectively assess the quality of
promising to conduct LLM based evaluation. In the follow-
                                                                  LLM-based evaluators, meta-evaluation benchmarks have
ing part, we will introduce the recent progress on LLM for
                                                                  been introduced, for gauging the agreement with human
evaluation, including evaluation formats, methods, meta-
                                                                  preferences and the fairness of the evaluations made by
evaluation, and the remaining issues.
                                                                  LLMs [649, 729, 886, 890, 891]. As a representative bench-
Evaluation Formats. Depending on the type of evaluation           mark, MT-Bench [729] evaluates the agreement between
outcome, the evaluation format can be categorized into            LLMs and human judgments, demonstrating that GPT-4
score-based evaluation and language-based evaluation. Score-      aligns closely with human preferences in no-tie compar-
based evaluation employs measurable metrics to assign             isons on 80 multi-turn questions. In addition, to address
quality scores (e.g., ratings or rankings) for evaluated texts.   potential biases arising from subjective human evaluations,
A prevalent way is to conduct pairwise comparison, where          LLMBar [890] manually designs outputs that are objectively
LLMs are used to determine the partial order relation of          worse but superficially appealing, which could mislead
                                                                                                                              77

evaluators. The evaluation results reveal that even the most     consistent answers across disciplines, balancing both depth
advanced LLMs still fall short of human-level evaluation in      and breadth. Another quantitative analysis [901] shows that
the challenging setting.                                         students utilizing ChatGPT (either keeping or refining the
                                                                 results from LLMs as their own answers) perform better
Remaining Issues. As discussed in Section 7.1.1, recent
                                                                 than average students in some courses from the computer
studies demonstrate that LLM-based evaluators expose
                                                                 security field. Recently, several perspective papers [903, 904]
multiple types of bias, such as order bias, self-preference
                                                                 also explore various application scenarios of LLMs in class-
bias, and length bias [649, 729]. Although some biases can
                                                                 room teaching, such as teacher-student collaboration, per-
be mitigated through methods like multi-path ensemble or
                                                                 sonalized learning, and assessment automation. However,
multi-agent collaboration, they remain inherent to LLM-
                                                                 the application of LLMs in education may lead to a series
based evaluators. Consequently, addressing these biases
                                                                 of practical issues, e.g., plagiarism, potential bias in AI-
intrinsically within the models continues to be an a chal-
                                                                 generated content, overreliance on LLMs, and inequitable
lenging issue. In addition, recent work has revealed that
                                                                 access for non-English speaking individuals [905].
LLMs may be incapable of understanding the self-generated
content, exhibiting a weaker understanding capacity com-         Law is a specialized domain that is built on professional
pared to their generation capabilities [892]. Even the most      domain knowledge. Recently, a number of studies have ap-
advanced LLMs still struggle identifying their reasoning or      plied LLMs to solve various legal tasks, e.g., legal document
factual errors without external feedback [893, 894]. Conse-      analysis [906], legal judgment prediction [907], and legal
quently, current LLM-based evaluators might not be ade-          document writing [908]. A recent study [909] has found
quate for evaluating top-tier LLMs or complex tasks. This        that LLMs exhibit powerful abilities of legal interpretation
underscores the importance of improvement approaches             and reasoning. Moreover, the latest GPT-4 model achieves
for LLM-based evaluators, especially for evaluating capable      a top 10% score in a simulated bar exam compared with
LLMs and complex tasks demanding sophisticated reason-           human test-takers [46]. To further improve the performance
ing, planning, and domain-specific knowledge.                    of LLMs in the law domain, specially designed legal prompt
                                                                 engineering are employed to yield advanced performance
8.2   LLM for Specific Domains                                   in long legal document comprehension and complex legal
In this part, we discuss the applications of LLMs on several     reasoning [910, 911]. To summarize the progress, LLMs can
representative domains, including healthcare, education,         act as helpful assistants to legal profession. Despite the
law, finance, and scientific research assistance.                progress, the use of LLMs in law raises concerns about
                                                                 legal challenges, including copyright issues [912], personal
Healthcare is a vital application field closely related to       information leakage [913], or bias and discrimination [914].
human life. Ever since the advent of ChatGPT, a number of
studies have applied ChatGPT or other LLMs to the medical        Finance is an important field where LLMs have promis-
domain. It has been shown that LLMs are capable of han-          ing application prospects. LLMs have been employed on
dling a variety of healthcare tasks, e.g., biology information   various finance related tasks, such as numerical claim
extraction [765], medical advice consultation [895], mental      detection [915], financial sentiment analysis [916], finan-
health analysis [896], and report simplification [897]. As       cial named entity recognition [917], and financial reason-
the major technical approach, researchers typically design       ing [918]. Despite the competitive zero-shot performance
specific prompts or instructions to guide LLMs to perform a      exhibited by general-purpose LLMs in the finance tasks,
wide range of medical tasks. To further harness the power        they still underperform domain-specific PLMs containing
of LLMs in the healthcare domain, researchers propose to         million-scale parameters [915]. To leverage the scaling effect
develop healthcare-related LLMs [354, 898, 899]. Specifically,   of LLMs, researchers collect large-scale finance corpora for
the Med-PaLM models [354, 898] achieves expert-level per-        continually pre-training LLMs (e.g., BloombergGPT [358],
formance on the United States Medical Licensing Exami-           XuanYuan 2.0 [919], and FinGPT [920]). BloombergGPT
nation (USMLE), and earns greater approval from physi-           has demonstrated remarkable performance across a diverse
cians in answering consumer’s medical questions. However,        range of financial tasks while maintaining competitive per-
LLMs may fabricate medical misinformation [897, 900],            formance in general-purpose tasks [358]. Nevertheless, it is
e.g., misinterpreting medical terms and suggesting advice        imperative to consider the potential risks in the application
inconsistent with medical guidelines. In addition, it would      of LLMs in finance, as the generation of inaccurate or
also raise privacy concerns to upload the health information     harmful content by LLMs could have significant adverse
of patients [765] into a commercial server that support the      implications for financial markets [358]. Therefore, it needs
LLM.                                                             more strict reviewing and monitoring on the use of LLMs in
                                                                 the financial field.
Education is also an important application domain where
LLMs potentially exert significant influence. Existing work      Scientific research is another promising field that LLMs
has found that LLMs can achieve student-level performance        can empower the development progress. Prior research
on standardized tests [46] in a variety of subjects of math-     demonstrates the effectiveness of LLMs in handling
ematics (e.g., physics, computer science) on both multiple-      knowledge-intensive scientific tasks (e.g., PubMedQA [921],
choice and free-response problems. In addition, empirical        BioASQ [922]), especially for LLMs that are trained on
studies have shown that LLMs can serve as writing or read-       scientific-related corpora [35, 218, 923]. Given the excel-
ing assistant for education [901, 902]. A recent study [902]     lent general abilities and broad scientific knowledge, LLMs
reveals that ChatGPT is capable of generating logically          hold significant potential as helpful assistants across var-
                                                                                                                                78

ious stages of the scientific research pipeline [924]. First,     9.1.1 Scaling Position Embeddings
during the literature survey stage, LLMs can help conduct         Transformer-based LLMs can learn effective position em-
a comprehensive overview of the progress in a specific            beddings within the maximum training length. When
research field [925, 926]. Second, during the research idea       adapting LLMs to language tasks beyond the maximum
generation stage, LLMs demonstrate the ability to generate        training length, it is necessary to scale to larger position
intriguing scientific hypotheses [927]. Third, during the data    indices. Specially, some position embedding methods have
analysis stage, LLMs can be employed to conduct automatic         been shown to possess a certain degree of ability to gener-
approaches to analyzing the data characteristics, includ-         alize to text beyond the training length, which is termed as
ing data exploration, visualization, and deriving analytical      extrapolation capability, including T5 bias [82], ALiBi [283],
conclusions [928, 929]. Fourth, during the paper writing          xPos [296] and even NoPE [939]. However, as one of the
stage, researchers can also benefit from the assistance of        mainstream position embedding methods, RoPE exhibits
LLMs in scientific writing [930, 931], in which LLMs can          limited extrapolation ability in empirical studies [259]. In
offer valuable support for scientific writing through diverse     the following, we discuss several methods that adapt RoPE
means, such as summarizing the existing content and pol-          to longer texts.
ishing the writing [932]. In addition, LLMs can aid in                • Direct model fine-tuning. To adapt LLMs to a long
the automated paper review process, encompassing tasks            context window, a straightforward approach is to directly
such as error detection, checklist verification, and candidate    fine-tune the models on long texts with the target length.
ranking [933]. Despite these advances, there is much room         The context extension can be scheduled with gradually
for improving the capacities of LLMs to serve as helpful,         increased lengths in a multi-stage manner (e.g., 2K → 8K
trustworthy scientific assistants, to both increase the quality   → 32K). To conduct effective extension, it often requires
of the generated scientific content and reduce the harmful        specially prepared long text data for training (Section 9.1.3),
hallucinations.                                                   and data quality plays a critical role in improving LLM’s
    Summary. In addition to the aforementioned work, the          long context capacities [940]. However, such a direct fine-
applications of LLMs have been also discussed in several          tuning approach tends to be inherently slow when adapting
other domains. For instance, in the psychologic domain,           LLMs for long texts [259].
some recent work has studied the human-like characteristics           • Position interpolation. This method downscales the po-
of LLMs, such as self-awareness, theory of mind (ToM), and        sition indices within the original context window, to avoid
affective computing [934, 935]. In particular, an empirical       out-of-distribution rotation angles during pre-training [259,
evaluation of ToM conducted on two classic false-belief           941]. Specifically, this approach multiplies all position in-
tasks speculates that LLMs may have ToM-like abilities            dices by a scaling coefficient L/L′ (L < L′ ), where L and
since the model in the GPT-3.5 series achieves comparable         L′ denote the original and target context window length,
performance with nine-year-old children in ToM task [934].        respectively. Experimental results [259] have shown that
In addition, another line of work has investigated applying       this method can extend the context window effectively and
LLMs into the software development domain, e.g., code             efficiently, compared to the above approach of direct model
suggestion [936], code summarization [937], and automated         fine-tuning. However, it is worth noting that this technique
program repair [938]. To summarize, to assist humans by           may have an adverse impact on the model’s performance
LLMs in real-world tasks has become a significant area of         when handling normal texts within the original context
research. However, it also presents challenges. Ensuring the      window [259, 942].
accuracy of LLM-generated content, addressing biases, and             • Position truncation. To mitigate the challenges posed
maintaining user privacy and data security are crucial con-       by out-of-distribution rotation angles, another practical ap-
siderations when applying LLMs to real-world scenarios.           proach is to truncate longer relative positions to satisfy the
                                                                  requirement of the maximum training length. ReRoPE and
9     A DVANCED TOPICS                                            LeakyReRoPE [943] introduce a pre-defined window length
                                                                  for truncation, which is smaller than the maximum training
In this section, we focus on discussing several advanced
                                                                  length. Specifically, position indices within this pre-defined
topics that have attracted extensive attention in the research
                                                                  window would be retained, while those indices beyond the
community, and these topics are related to challenging
                                                                  window are either truncated to the pre-defined window
technical issues that largely limit LLM’s capacity. Next, we
                                                                  length or interpolated to align with the maximum training
will introduce these issues and discuss how to address them
                                                                  length. This strategy can preserve the attention mechanism
with feasible approaches.
                                                                  with the neighbor tokens (within the window length), and
                                                                  further enhance the extrapolation capacity. However, this
9.1   Long Context Modeling                                       approach needs to compute the attention matrices twice,
In real-world application scenarios, there are increasing         accommodating additional computational costs.
demands for long context modeling capacities of LLMs,                 • Base modification. Since LLMs are usually trained with
especially for text file processing (e.g., information parsing,   a pre-set maximum training length, wavelengths in certain
extraction, and summarization). Many mainstream LLMs              dimensions of RoPE may exceed the training length for
have provided support for long context window. To enhance         longer text [295], on which language models may not be
the long context modeling abilities, there are generally two      sufficiently trained, i.e., training data can’t cover a complete
widely used approaches, namely scaling position embed-            rotation cycle. Thus, when processing long text, some ro-
dings and adapting context window. Next, we introduce the         tation angles for certain dimensions would never be seen
two approaches in detail.                                         in the training phase [351]. Formally, given a fixed rotation
                                                                                                                               79

angle t · θi , a smaller basis θi allows for a greater distance   of attention patterns in a Transformer [951], e.g.,the top-
t, i.e., enabling the modeling of longer texts [254, 295, 940].   k attention scores can well approximate the original full
According to the formula θi = b−2(i−1)/d in Equation 4,           attention. Therefore, a number of studies propose different
decreasing the basis can be achieved by increasing the            methods to select the most relevant tokens from token-level
value of the base. In addition, decreasing the base can also      or block-level memory units for generation. Token-level se-
help re-scale the wavelengths of all dimensions below the         lection methods store the past keys in external memory and
training length, while it often needs continual pre-training      utilize a k -NN search method to retrieve the k most relevant
to adapt the LLMs to long context windows [351]. A re-            tokens for generation [257, 951, 952]. For a decoder model,
cent study [351] has empirically compared these two base          it typically employs one certain layer to access these top-
modification methods, and shown that decreasing the base          k external tokens, while still adopting the normal context
demonstrates better extrapolation performance, while in-          window in the rest layers [257, 952]. Block-level selection
creasing the base performs better within the training length.     methods [953, 954] first segment the long sequence into
     • Basis truncation. Similar to the base modification, the    blocks with the same length and represent each block into
truncation of the basis also concentrates on dealing with         several key vectors for retrieval. Then, the most relevant
the singular dimensions with wavelengths exceeding the            blocks to the query as well as the neighbor and initial
training length [944]. According to the definition λi = 2π/θi     blocks will be selected for attention computations. Unlike
in Equation 5, the dimension with a large wavelength λi           token-level selection methods, block-level selection methods
has a small basis θi accordingly. Based on this observation,      typically retrieve different tokens with specific heads.
this approach first defines a basis range [a, c]. Given the
basis range, the value of basis is modified according to the      9.1.3 Long Text Data
following ways: (1) when θi ≥ c, the value is retained,           To further enhance the long context modeling capacity,
(2) when θi ≤ a, the value is set to zero, and (3) when           it typically requires continual pre-training with specially
a < θi < c, the value is truncated to a fixed small               curated long text data. Next, we discuss how to prepare the
value. Via basis truncation, the out-of-distribution rotation     long text data from the two aspects of quantity and quality.
angles can be avoided at larger position indices. However,            • Quantity effect. Different from the pre-training phase
this approach does not perform very well at long context          that requires vast amounts of data, a small amount of long-
tasks [944].                                                      text data for continual pre-training is sufficient for context
                                                                  window extension [259]. Several studies show that LLMs
9.1.2 Adapting Context Window                                     have obtained the capability of utilizing distant information
Since Transformer-based LLMs have limited context win-            via large-scale pre-training data, and thus it only needs
dows, they can not directly integrate or utilize the entire       to adapt for extended context windows during continual
information of the long sequences exceeding the context           pre-training [955]. Typically, it has shown that LLaMA-2-
window. To alleviate the limitation, several methods have         7B or LLaMA-2-13B can achieve a context window length
been proposed to adapt LLMs to long context, as discussed         of over 100K tokens and effective context utilization [955]
below.                                                            with the training on several billion tokens. However, the
    • Parallel context window. Inspired by fusion-in-             ability to handle short text of LLMs may be affected to some
decoder [945], parallel context window methods [424, 946]         extent [259].
adopt a divide-and-conquer strategy to process input text.            • Quality effect. In addition to the quantity, the quality
Specially, it divides the input text into multiple segments,      of long text data is essential to long context modeling for
each independently encoded with shared position embed-            LLMs. For instance, LongWanjuan [956] categorize long
dings. At the generation stage, the attention masks are           texts into holistic, aggregated, and chaotic long texts based
modified to make that subsequent tokens can access to             on three metrics, i.e., coherence, cohesion, and complexity,
previous tokens in each segment. Nevertheless, this method        and they show that removing chaotic data and keeping
cannot distinguish the order of different segments, resulting     coherent and cohesive data are useful to enhance the long
in a limited model capacity on certain tasks.                     text modeling capacities of LLMs. Further, up-sampling
    • Λ-shaped context window. Some prior work has revealed       cohesive data can lead to further improvement. In addition,
that LLMs tend to allocate greater attention weights to           when preparing long text data, data mixture should be
the starting and nearest tokens among all previous to-            carefully adjusted for avoiding large distribution drift with
kens [947, 948], and it potentially results in the “lost in the   the original pre-training data.
middle” phenomenon [949]. Based on this observation, LM-              In addition to the studies based on vanilla Transformer,
Infinite [950] and StreamingLLM [948] propose to employ           there are a surge of Transformer variants with efficient at-
a “Λ-shaped” attention mask, which selectively preserves          tentions and other efficient architectures, aiming to alleviate
the initial tokens and the nearest tokens that each query can     the high computational costs for modeling long texts. These
attend to and then discards any tokens beyond this scope.         studies are discussed in Section 4.2.1 and Section 4.2.2. Fur-
Experiments demonstrate that this method can facilitate           thermore, context compression and prompting techniques
extra-long text generation with a fixed memory [948]. How-        (e.g., iterative reasoning [957]) have also been proven to
ever, it may struggle to model the long-range dependency          be a viable strategy for handling long text tasks [957–960],
in the context window, since it cannot effectively utilize the    without the need of model adaption.
information from the discarded tokens [948].
    • Token selection. It has been shown that a relatively        9.2   LLM-empowered Agent
small subset of tokens can effectively capture the majority       The research on agents in AI aims to develop entities that
                                                                                                                                       80

can perceive the environment, make decisions, and take                     are assigned goals, they follow the above workflow to
actions to achieve specific goals [961]. However, traditional              accomplish tasks through multi-turn interactions with the
agents are often limited to heuristic rules or specific environ-           environment.
ments, which constrain their generalization to open-domain                     To summarize, in an LLM-based agent, the LLM serves
scenarios [962]. Given that LLMs possess excellent capacities              as the core computation unit and is equipped with compo-
in solving complex tasks, they have rapidly emerged as                     nents including memory, planning, and execution. These com-
promising solutions for serving as the core computation                    ponents are integrated in a systematic way under the control
unit of agents [821]. In this part, we will first introduce                of the LLM during interactions with the environment. For
the framework for LLM-based agents, then explore their                     more details, the readers might refer to the comprehensive
applications, and finally discuss the future directions.                   survey for LLM-based AI agents [821].

9.2.1   Overall Framework.                                                 9.2.2 Applications
Next, we first detail the key components of an LLM-based                   Recently, LLM-based agents have shown great potential in
agent and then present the typical workflow.                               autonomously solving complex tasks, making it feasible to
                                                                           rapidly develop capable applications for specific domains
Components. Typically, there are three main components                     or tasks. In this section, we will discuss the applications in
in an LLM-based agent: memory, planning45 , and execution.                 single-agent and multi-agent scenarios.
Specifically, the memory component aims to store the in-
formation perceived from the environment and can be                        Single-agent based Applications. Applications based on
utilized to support decision-making. In particular, LLM-                   a single-agent mode mainly aim to develop capable task
based agents usually maintain information in both short-                   solvers that can autonomously complete user requests. A
term memory and long-term memory with the operations                       large number of single-agent projects have been developed,
of reading and writing. Short-term memory usually refers                   which focus on general-purpose task solving. As a rep-
to the internal context window of LLMs (i.e., input), where                resentative project, AutoGPT [536] empowers LLMs with
LLMs can read and write through actions like reason-                       long/short-term memory management and external tools
ing [963]. While long-term memory can be mapped to the                     like search engines. In order to autonomously address a
external storage like vector databases [539], where LLMs                   user request, AutoGPT understands the request with knowl-
can read through retrieval and write with reflection [688].                edge from its memory and actions like reasoning, decom-
Specially, profiles are usually implemented with long-term                 poses it into a detailed plan, executes the plan step-by-
memory, which is an important feature for an agent that                    step with the assistance of tools, and refines the rest plan
specifies its role and function [821]. The planning component              based on feedback from the environment. Such an iterative
is responsible for generating the action plan based on the in-             process continues until the user request is successfully re-
formation from the memory component. In data format, the                   solved. Other similar projects include GPT-Engineer [964]
plan usually takes the form of text-based instructions [434]               and XAgent [965]. In addition, there is also some work that
or code-based programs [436]. To generate it, LLM-based                    aims to develop autonomous agents for specific domains,
agents will first propose several candidates and then select               such as WebGPT [81] for the web-browsing environment,
a more suitable one among them [429]. The initial plan                     ProgPrompt [532] for the real-life environment, and Voy-
can be further refined with execution feedback from the                    ager [699] for the Minecraft environment.
environment [530]. The execution component is in charge                    Multi-agent based Applications. Different from single-
of carrying out the plan from the planning component,                      agent systems where agents work independently, multi-
which can be fulfilled by the internal LLM [434] or external               agent systems work in collaboration to unleash collective
tools [963].                                                               intelligence. Typically, multiple agents can be instantiated
Workflow. With the three components mentioned above, a                     from the same or different LLMs, each with their respective
typical workflow of an LLM-based agent is as follows. First,               roles and functions. According to the coordinating strategies
it receives information from the environment and writes                    among these agents, multi-agent systems can be divided
it into short-term memory. Then, the agent processes the                   into two categories: cooperation-based and competition-
newly received information in the short-term memory. Such                  based. In the cooperation-based mode, to share informa-
a process can be enhanced with information retrieved from                  tion and seek collaborative actions among agents, various
long-term memory. Subsequently, the planning component                     communication protocols have been proposed, including
utilizes the processed information from short-term memory                  free-form dialogue [966], structured document [967], and
to generate the next plan. Finally, the execution component                data embedding [968]. Based on the communication pro-
carries out the plan generated from the planning compo-                    tocol, agents can be effectively organized for downstream
nent, which can be further assisted with external tools.                   applications, such as software engineering [967], user be-
By repeating the aforementioned process, the LLM-based                     havior analysis [822, 824], and society simulation [535].
agent can autonomously adjust its behavior in response                     As a representative project, LangChain46 is a framework
to feedback from the environment and ultimately achieve                    for developing multi-agent based applications powered by
its goal. Once LLM-based agents receive user requests or                   LLMs. It enables users to deploy different roles of LLM-
                                                                           based agents and utilize them to solve tasks via working in
  45. Section 6.4 introduces planning as a utilization approach for        collaboration. In addition, other similar frameworks, such
LLMs, while in this section, we describe its utilization as a functional
component in LLM-based agents.                                              46. https://www.langchain.com/
                                                                                                                             81

as AgentVerse [969] and AutoGen [970], can also be utilized      Robustness and Trustworthiness. The deployment of LLM-
for developing multi-agent collaborative systems. In the         based agent systems necessitates robustness and trustwor-
competition-based mode, debate serves as one of the pop-         thiness [973]. The system should be resilient against adver-
ular communication protocols to foster divergent thinking        sarial inputs from various modalities such as text, image,
and elicit valuable external feedback among agents. Such a       or audio. Incorporating existing techniques like adversarial
way is beneficial for domains that demand precise decision-      training, data augmentation, and sample detection to in-
making and accurate responses, such as mathematical rea-         crease sensitivity to aggressive information in the input can
soning [971] and evaluation [734].                               fortify the system’s security. Concurrently, it is challenging
                                                                 to ensure the credibility of LLM-based agents given the se-
9.2.3 Discussion                                                 vere hallucination issues inherently rooted in LLMs. While
Despite the huge success, there still remain several technical   existing methods such as constrained decoding during infer-
challenges that limit the development and application of         ence and external knowledge integration can mitigate these
LLM-based agents. In this part, we discuss the remaining         issues to some extent [974], further exploration of efficient
challenges from the perspective of computational burden,         and effective alignment methods is necessary to develop
human alignment, complex capability extension, and ro-           reliable agent systems.
bustness.
                                                                 9.3   Analysis and Optimization for Model Training
Computational Costs. With the ever-increasing capabilities
of LLMs [821], their performance on agent applications           In Section 4.3, we have introduced basic techniques for
demonstrate promising performance. However, it also in-          training LLMs. As the scale of model parameters and data
troduces significant issues in terms of efficiency due to        continues to expand, efficiently training larger models with
the high computational demands and intricate interaction         limited computational resources has become a critical tech-
mechanisms involved. Furthermore, in multi-agent systems         nical challenge in the development of LLMs. This challenge
with numerous LLM instances, as the number of agents in-         primarily encompasses two technical issues: firstly, how
creases, this issue would be more severe, since the commu-       to optimize memory usage when loading and processing
nication network within multi-agent systems also becomes         models across GPU clusters, and secondly, how to maintain
increasingly complex. Therefore, more effective and efficient    or improve training efficiency as models scale. Next, we
communication protocols and architectures are essential          will conduct quantitative analyses and introduce advanced
to support the heightened coordination demands among             training techniques addressing the two aforementioned is-
agents.                                                          sues.

Alignment with Human Sociality. LLM-based agents can             9.3.1 Estimation of Training Memory Consumption
be conceptualized as individual entities, with the emergence
                                                                 In this part, we will first estimate the GPU memory con-
of sociability resulting from the interaction among these
                                                                 sumption for training LLMs.
agents. Autonomous agents often assume specific roles such
as coders or researchers, making role-playing a vital capa-      Model States Cost. Model states often occupy the majority
bility for agents to solve downstream tasks [972]. However,      of memory during training, typically consisting of model
LLMs, typically trained on web corpora, face difficulties in     parameters, gradients, and optimizer states. As introduced
accurately mimicking roles that are infrequently discussed       in Section 4.3.2, mixed precision training has been widely
online or are emergent. They also lack self-awareness in         utilized in LLM training. For a model containing P param-
conversational scenarios due to inadequate modeling of hu-       eters, both the model parameters and their gradients are
man cognitive psychology. Thus, it is imperative to develop      typically stored as 16-bit floating-point numbers, requiring
improved agent technique, including both training methods        a total storage of 4P bytes (2P for the parameters and 2P for
and architectures, to better align LLMs with human prefer-       the gradients). When using optimizers such as Adam [318]
ences and enhance their role-playing abilities.                  or AdamW [975], an additional set of 32-bit floating-point
                                                                 numbers are needed to store the optimizer states, including
Capability Extension. LLM-based agents, similar to hu-
                                                                 the copy of model parameters, gradient momenta, and
mans, require advanced capabilities (e.g., tool learning) to
                                                                 gradient variances, which leads to a total storage of 12P
fulfill complex functions or tasks, which might be beyond
                                                                 bytes (4P each for each of these components). Consequently,
their capacity scope. To address this issue, tool use has
                                                                 the total memory required for storing the model states
become a widely-used approach to enhancing LLMs’ capac-
                                                                 during training is 16P bytes. For instance, training LLaMA-
ities in various complex tasks. For example, when answer-
                                                                 7B (P ≈ 6.7 × 109 ) requires around 100GB memory to store
ing informative user questions, they use search engines to
                                                                 the model states alone.
retrieve information from the internet. However, the quality
and quantity of existing available tools impose limitations      Activations Cost. Activations are the intermediate states
on their accessibility and comprehensiveness. And it would       that require to be stored in the forward pass for gradient
become more difficult for LLM-based agents to use such           computation during backpropagation. For example, for a
                                                                                                                        ∂Y
limited tools when interacting with dynamic and changing         binary operation Y = W X , calculating the gradient ∂W
environments. In addition, as the scale of tools expands,        necessitates the input X , which should be preserved dur-
the compatibility and extensibility between the agents and       ing the forward pass. In Table 18, we list the estimation
tools must be further improved to facilitate complex task        of the activation memory consumption for different com-
resolution.                                                      ponents within the Transformer model. Take LLaMA-7B
                                                                                                                                 82

(V = 32, 000, L = 32, H = 4, 096, H ′ = 11, 008, N = 32) as        used to optimize memory usage during backpropagation.
an example, it would take 16GB memory to store activations         Specifically, the activations need to be retained during the
per device under the setting B = 1, T = 2, 048.                    forward pass. However, storing all activation values for each
                                                                   layer requires a significant amount of memory resources
TABLE 18: The activation memory consumption of each                (detailed in Table 18). To reduce the memory cost, gradient
computation within the LLaMA model based on research               checkpointing retains only a subset of the activations during
work [976]. We denote batch size by B , sequence length by         the forward pass and recomputes these values during the
T , the vocabulary size by V , the number of head in the           backward pass to save memory, albeit with additional com-
attention module by N , the dimension of each head by D,           putational overhead. In implementation, gradient check-
the hidden size by H (H = N D), and the intermediate               pointing typically involves storing the input of each Trans-
size inside FFN by H ′ . Equations ➀-➈ are layer-wise and          former layer and recomputing the corresponding activation
need to be multiplied by the number of the layers L when           values during backpropagation.
computing the total consumption.
                                                                   ZeRO. Zero redundancy optimizer (ZeRO) [977] technique,
 Equations                   Activation consumption                proposed by the DeepSpeed library, focuses on alleviating
                                                                   the issue of memory redundancy in data parallelism. As
 ➀ Q, K, V = XW Q,K,V        store X with size 2BT H
 ➁ Q, K = RoPE(Q, K)         store Q and K with size 4BT H         mentioned in Section 4.3.2, data parallelism requires each
 ➂ O = Attn (Q, K, V )       store Q, K , and V with size          GPU to store the same copy of the model states, resulting
                             6T H and results of softmax           in a memory consumption of 16P bytes per GPU. A direct
                             with size 2BT 2 N                     side effect of data parallelism is that it memory redundancy
 ➃ X = OW O                  store O with size 2BT H               issues, since not all of the above data is necessary to be
 ➄ X = Add&Norm(X)           store X with size 2BT H               retained on each GPU. To resolve it, the ZeRO technique
 ➅ G, U = X[W G , W U ]      store X with size 2BT H               aims to retain only a fraction of data on each GPU, while the
 ➆ D = Swish(G) · U          store G and U with size 4BT H ′
                                                                   rest data can be obtained from other GPUs when required.
 ➇ X = DW D                  store D with size 2BT H ′
                                                                   Specifically, ZeRO provides three strategies, depending on
 ➈ X = Add&Norm(X)           store X with size 2BT H
 ➉ CE(softmax(XW L ))        store X with size 2BT H and re-       how the three parts of the data are stored, namely optimizer
                             sults of softmax with size 4BT V      state partitioning (ZeRO-1), gradient partitioning (ZeRO-
                                                                   2), and parameter partitioning (ZeRO-3). Empirical results
                                                                   indicate that the first two strategies do not increase the
Other Memory Cost. In addition to the main factors af-             communication overhead, and the third solution increases
fecting GPU memory consumption discussed above, the                about 50% communication overhead but saves memory
memory usage also includes the following aspects:                  proportional to the number of GPUs. PyTorch has imple-
    • Deep learning frameworks. The PyTorch framework re-          mented a similar technique as ZeRO, called fully sharded
quires approximately 1GB of GPU memory when loading                data parallel (FSDP) [330].
its core functions. This is the essential overhead for the
framework to operate.                                              Offload. In GPU-limited environments, DeepSpeed has pro-
    • Distributed frameworks. When utilizing distributed           posed the offload technique [978], which can significantly
training frameworks (e.g., DeepSpeed), its GPU memory              reduce the GPU memory required for training by offloading
usage can fluctuate between 1GB and 4GB. The exact                 part of the model states and computational overhead to CPU
amount depends on the level of optimization and the hyper-         memory. Specifically, gradients and optimizer states would
parameter settings. This portion of the memory is primarily        be offloaded to CPU memory, with only the model param-
used to optimize memory management and communication               eters kept on GPU. The computationally intensive forward
efficiency during the training process.                            and backward propagation still need to be performed on
    • Intermediate results and memory fragmentation. Besides       GPU to ensure efficiency, while parameter update, which
the activations, there also exist intermediate results that will   requires relatively fewer computations, are executed on
affect the peak memory consumption. Take the computation           CPU to reduce GPU memory overhead. Furthermore, In-
of the softmax function in Equation ➉ as an example,               finity [979] allows training models that exceed the GPU
the implementation of the Transformers library requires an         memory limits by utilizing high-speed disk storage (e.g.,
additional 8BT V bytes of memory, as it needs to store two         NVMe).
additional copies of the 32-bit input (4BT V bytes each).
Moreover, during the training process, memory fragmenta-           9.3.3 Efficiency Optimization Methods
tion occurs due to the non-contiguous allocation and release       In addition to memory-saving techniques, it is also crucial to
of memory, typically leading to an additional 0.5GB to 1GB         maintain computational throughput as the model scales. In
of memory consumption.                                             what follows, we will describe two representative efficiency
                                                                   optimization methods.
9.3.2 Memory Optimization Methods
                                                                   FlashAttention. FlashAttention [303, 980] is an optimization
Based on the aforementioned analysis, we will next intro-
                                                                   method for the attention mechanism that significantly re-
duce several typical methods for optimizing the memory
                                                                   duces the memory transfer during attention computation.
usage for training LLMs.
                                                                   The core idea is to minimize the storage of intermediate
Gradient Checkpointing. Gradient checkpointing [329],              results and directly obtain the final result. According
                                                                                                                    ⊺
                                                                                                                             to the
also known as activation recomputation, is a technique             attention computation equation softmax( QK    √
                                                                                                                   D
                                                                                                                      )V , multiple
                                                                                                                                83

intermediate results, such as QK ⊺ and the attention score        which is measured in FLOP/byte. For example, the half-
matrix, need to be explicitly retained, leading to numerous       precision compute and bandwidth of the A100 GPU are 312
memory read-write operations. FlashAttention uses spe-            TFLOP/s and 2039GB/s, respectively. Correspondingly, its
cially designed methods, such as matrix partition and opera-      maximum arithmetic intensity is 142.51 FLOP/byte47 .
tor fusion, to keep intermediate results in the cache until the       • Model efficiency metrics. Similarly, each operation (e.g.,
final result is obtained, thus reducing the amount of mem-        matrix multiplication) of the model can be measured by
ory read and write operations. Additionally, FlashAttention       two corresponding metrics: the computation amount and the
can effectively reduce the peak memory usage and activa-          data transfer amount. The former refers to the total number
tion memory consumption (Section 9.3) during the LLM              of floating-point operations, measured in FLOPs. The latter
training and inference. By using FlashAttention, LLaMA-           refers to the total amount of GPU memory read and write
2 (7B) with a sequence length of 2,048 and a batch size of 8      operations, measured in bytes. Analogous to the arithmetic
requires only one-tenth of the computation time compared          intensity of a GPU, the arithmetic intensity I of a model oper-
to the standard method.                                           ation (e.g., matrix multiplication) can be defined as the ratio
                                                                  of computation to data transfer, with units of FLOP/byte.
Sequence Parallelism. Compared with the 3D parallelism                When the model’s arithmetic intensity I is less than the
introduced in Section 4.3, sequence parallelism can be            GPU’s maximum arithmetic intensity Imax , it indicates that
considered a fourth parallelism dimension in pre-training,        the maximum memory bandwidth of the GPU is lower than
particularly when handling long data sequences. The core          the speed required. Consequently, the model’s efficiency
idea is to partition the sequence across multiple devices         will primarily be limited by memory bandwidth, and the
for parallel computation. The primary challenge lies in           operation is called memory-bound. Conversely, when I ex-
minimizing communication across the devices during atten-         ceeds Imax , it suggests that the GPU’s maximum floating-
tion computation. DeepSpeed-Ulysses [981] partitions the          point operation speed is lower than the speed required. In
sequence along the hidden dimension, allowing each device         this case, the model’s efficiency will mainly be constrained
to receive a subset of the attention heads and compute            by the GPU’s compute capability, and the operation is called
attention for different heads in parallel. In comparison, Ring    compute-bound.
Attention [982] partitions the sequence along the length
dimension, where the query matrices on each device are in         Bottleneck Analysis. Based on the above analysis, we can
turn computed with the key and value matrices on other            obtain the arithmetic intensity for each operation during
devices. Furthermore, Ring Attention is also compatible           both the prefill and decoding stages, as shown in Tables 19
with FlashAttention and can be considered as its distributed      and 20, thereby better identifying the bottleneck operations
extension.                                                        in the inference process.
                                                                      • Prefill stage. In the following analysis, we will still
                                                                  take the LLaMA (7B) model in Table 18 as an example
9.4   Analysis and Optimization for Model Inference
                                                                  (N = 32, D = 128, H = 4096) and assume a batch size of
In Section 4.2.4, we have introduced the basic decoding           8 and a sequence length of 1024 (i.e., B = 8, T = 1024).
strategies for using LLMs. As inference efficiency is criti-      Substituting these values into Table 19, we can find that
cally important for the application of LLMs, we next will         the arithmetic intensity for linear transformations (Equa-
quantitatively analyze the efficiency of the inference process    tions ➀➃➅➇) is approximately 2730.67, for multi-head at-
and also present corresponding optimization methods.              tention (Equation ➂) it is approximately 114.67, while the
                                                                  intensity for other operations (Equations ➁➄➆➉) is around
9.4.1 Analysis of Inference Efficiency                            1. When using an A100 (80G) GPU with Imax = 142.51,
Overall, the inference process of LLMs can be divided into        the arithmetic intensities of the linear transformations and
two stages for overhead analysis: (1) the prefill stage, which    multi-head attention operations are all above or close to the
computes the states and caches the key-value tensors for the      maximum value. Given that these operations occupy the
input sequence; and (2) the decoding stage, which computes        majority of the computations during the prefill stage, we
the states of the newly generated tokens, updates the key-        can conclude that prefill stage is actually compute-bound.
value cache (KV cache, and continuously generate tokens               • Decoding stage. Similarly, substituting these values into
in an auto-regressive way until the generation process is         the arithmetic intensity formulas in Table 20 for the decod-
complete [984].                                                   ing stage reveals that the arithmetic intensities of the lin-
                                                                  ear transformations and multi-head attention are all below
Inference Efficiency Measurement. To quantitatively an-
                                                                  8, which is much lower than the A100 GPU’s maximum
alyze the inference efficiency, we next will introduce two
                                                                  intensity 142.51. This indicates that the decoding stage is
widely-used metrics for measuring inference efficiency.
                                                                  constrained by the GPU’s data transfer speed (i.e., memory-
    • GPU performance metrics. First, we introduce the com-
                                                                  bound), a problem commonly referred to as the memory wall.
pute capability and memory bandwidth to evaluate the effi-
                                                                  The analysis indicates that inefficiencies in LLM inference
ciency of a certain GPU. The compute capability of a GPU
                                                                  primarily occur during the decoding stage.
refers to the number of floating-point operations (FLOP)
that it can perform per second, measured in FLOP/s. The           9.4.2 System-level Optimization
bandwidth of a GPU refers to the amount of memory read            To mitigate the memory wall issue, an intuitive idea is
and write operations it can perform per second, measured in       to reduce the data transfer operations as possible, thereby
byte/s. The ratio of compute to bandwidth is known as the
maximum arithmetic intensity of the GPU, denoted as Imax ,          47. https://www.nvidia.com/en-us/data-center/a100/
                                                                                                                                   84

TABLE 19: The computation, data transfer, and arithmetic intensity during the prefill stage. We use the asymptotic notation
O to denote the complexity of data transfer amount, where the constant factor of the complexity is related to the specific
implementation method. Table source: [983].

              Equations                   Computation              Data transfer                    Arithmetic intensity
                                                                                                              
              ➀ Q, K, V = XW Q,K,V        6BT H 2                  O(BT H + H 2 )                   O 1 +1 1
                                                                                                        H BT
              ➁ Q, K = RoPE(Q, K)         6BT H                    O(BT H)                          O(1)
                                                                                                            1
                                                                                                           1+ D
                                               2            2
              ➂ O = Attn(Q, K, V )        4BT N D + 4BT N          O(BT 2 N + BT N D)               O      1       1
                                                                                                         D+T 
                           O                       2                                2
              ➃ X = OW                    2BT H                    O(BT H + H )                     O 1 +1 1
                                                                                                      H BT
                                                                                                               1
              ➄ X = Add&Norm(X)           5BT H                    O(BT H + H)                      O              1
                                                                                                        1+ BT                 
              ➅ G, U = X[W G , W U ]      4BT HH ′                 O(BT H + BT H ′ + HH ′ )         O      1 + 1 + 1
                                                                                                                       1
                                                                                                           H  H′  BT
              ➆ D = Swish(G) · U          2BT H ′                  O(BT H ′ )                       O(1)
                                                                                                                
                           D                           ′                                ′       ′
              ➇ X = DW                    2BT HH                   O(BT H + BT H + HH )             O 1 + 11 + 1
                                                                                                      H H ′ BT
              ➈ X = Add&Norm(X)           5BT H                    O(BT H + H)                      O 1+1 1
                                                                                                               BT



    TABLE 20: The computation, data transfer, and arithmetic intensity during the decoding stage. Table source: [983].

              Equations                  Computation            Data transfer                       Arithmetic intensity
                                                                                                             
              ➀ q, k, v = XW QKV         6BH 2                  O(BH + H 2 )                        O 1+  1
                                                                                                            1
                                                                                                        H   B
              ➁ q, k = RoPE(q, k)        6BH                    O(BH)                               O(1)
              ➂ K, V = Cache(k, v)       -                      O(BT N D) or O(BN D)                -          1
                                                                                                              1+ D
              ➃ o = Attn(q, K, V )       4BT N D + 4BT N        O(BT N + BT N D + BN D)             O        1   1
                                                                                                         1+ D +T
              ➄ X = oW O                 2BH 2                  O(BH + H 2 )                        O     1
                                                                                                               1
                                                                                                                   1
                                                                                                         H + B
              ➅ X = Add&Norm(X)          5BH                    O(BH + H)                           O 1+1 1
                                                                                                      B                   
              ➆ g, u = X[W G , W U ]     4BHH ′                 O(BH + BH ′ + HH ′ )                O     1 + 1 + 1
                                                                                                                   1
                                                                                                          H  H′   B
              ➇ d = Swish(g) · u         2BH ′                  O(BH ′ )                            O(1)
                                                                                                                
                          D                        ′                            ′           ′
              ➈ X = dW                   2BHH                   O(BH + BH + HH )                    O 1 + 11 + 1
                                                                                                      H H′ B
              ➉ X = Add&Norm(X)          5BH                    O(BH + H)                           O 1+1 1
                                                                                                               B




enhancing the arithmetic intensity. In this part, we will intro-   cate new GPU memory for each concatenation, copying the
duce several system-level optimization methods to achieve          original KV cache and the new hidden states into the newly
the reduction in data transfer.                                    allocated memory. This process leads to repeated memory
                                                                   read-write operations and substantial memory fragmenta-
FlashAttention and Flash-Decoding. The FlashAttention              tion. PagedAttention addresses this issue by introducing
method discussed in Section 9.3.3 can also be applied at           a memory paging management method, preallocating sev-
the prefill stage, as it reduces data transfer operations and      eral blocks of memory for future KV caches, which can
effectively increases arithmetic intensity. However, this op-      largely reduce the memory allocation operations during
timization technique is not directly applicable during the         concatenation. Additionally, PagedAttention optimizes the
decoding stage, where only the current query vector needs          attention computation by increasing the parallelism. It uses
to be computed with the KV cache matrices. To further              operator fusion to parallelize the computation of the query
optimize the decoding process, Flash-Decoding [985] has            vector with multiple KV cache chunk, thereby enhancing
been proposed based on FlashAttention, particularly for            the computational efficiency.
long sequences, which shares a similar idea with sequence
parallelism. Specifically, Flash-Decoding splits the KV cache      Batch Management Optimization. Batch management op-
into smaller chunks, allowing the computation of the query         timization aims to increase the batch size during the decod-
vector with these chunks in parallel, thereby improving the        ing stage to enhance arithmetic intensity. A representative
decoding efficiency.                                               method is continuous batching, proposed by vLLM [304].
                                                                   Unlike traditional fixed-length batch processing, this tech-
PagedAttention. PagedAttention [304] focuses on optimiz-           nique breaks down each request into a prefill iteration
ing KV cache and attention computation, significantly re-          and several single-step decoding iterations, and continu-
ducing data transfer operations in these two aspects. In KV        ous batching further employ heuristic algorithms to select
cache concatenation, traditional methods often need to allo-       requests for prefill or single-step decoding iteration. This
                                                                                                                               85

fine-grained batching mechanism allows for handling more          of this method still largely lags behind autoregressive meth-
requests simultaneously, which is has the same effect as in-      ods. To improve the quality of the generated text, several
creasing the batch size. Furthermore, DeepSpeed-MII [986]         studies attempt to combine both decoding methods, propos-
introduces Dynamic SplitFuse, which splits the prefill stage      ing semi-autoregressive decoding methods [994] that gener-
into multiple iterations and allows simultaneous prefill and      ate a group of tokens (e.g., 3 to 10 tokens) at each step and
decoding in one computation, resulting in larger batches          use these tokens as input to generate the next group. How-
and higher inference throughput.                                  ever, existing mainstream LLMs are pre-trained to predict
                                                                  the next token, making direct non- or semi-autoregressive
9.4.3 Algorithm-level Optimization                                generation infeasible. To address this, Medusa [995] trains
In addition to system-level optimization methods, existing        two additional prediction heads on the Vicuna model to
research work has proposed a series of improvements for           predict the second and third tokens respectively, thereby
autoregressive inference algorithms aimed at enhancing in-        achieving the generation of three tokens simultaneously.
ference efficiency. This part introduces four typical inference   However, due to the decreased generation quality, these
optimization algorithms.                                          methods have been rarely used directly in practice, but are
                                                                  more often combined with other methods (e.g., speculative
Speculative Decoding. Intuitively, the generation steps in        decoding) to accelerate the inference process of LLMs. For
language modeling have varied difficulty levels. For exam-        instance, after Medusa generates three tokens in parallel, the
ple, predicting the next word of “The founder of Microsoft        original Vicuna model would still be employed to verify the
is” may be more challenging than predicting the next word         generation quality.
of “The founder of Microsoft is Bill”. Even a small model
may successfully predict the answer in this case. Based on        Early Exit. It has been found that in multi-layer Transformer
this idea, speculative decoding [987, 988] has been proposed      models, it may not be necessary to perform the computation
to accelerate the inference speed. Specifically, it employs a     through all layers to reliably predict the next token [996].
relatively smaller yet more efficient model (such as an n-        Based on this idea, several studies [996, 997] have proposed
gram statistical model or a small pre-trained model) to au-       improved generation methods based on early exit. During
toregressively generate several tokens. Then, a larger model      model decoding, when the conditions for early exit are
then verifies these tokens, determining whether each token        satisfied, the model can directly use intermediate compu-
is the top-ranked prediction at the each generation step. The     tation results from certain layers to generate tokens, thereby
small and large models iteratively repeat this process until      improving the inference efficiency. To determine the exit
decoding is complete. Speculative decoding can lead to a          condition, prediction confidence [997] or the entropy [996]
notable 2× to 3× speedup without compromising the gener-          of the next token’s generation probability distribution can
ation quality. Researchers further suggest several variants to    be used as reference measures. More recently, mixture-
improve the efficiency of this approach, such as a learning-      of-depths [998] has proposed to dynamically adjust the
based method to combine several small models [989] and            computation load of each layer. Similar to MoE networks,
a stage-wise acceleration which employs a more smaller            the mixture-of-depths method calculates a score for each
model to accelerate the small model first [990].                  layer’s input via a routing network. If the score exceeds a
                                                                  preset threshold, the layer would be computed; otherwise,
Cascade Inference. Cascade inference optimizes the inference      the layer would be skipped. Unlike traditional early exit
efficiency by addressing requests of varying difficulty with      mechanisms that skip all subsequent layers, the mixture-
models of different scales. FrugalGPT [991] introduces a          of-depths method selectively skips certain layers, which
series of models arranged by efficiency from high to low,         can adaptively utilize the characteristics of different layers
sequentially processing a request through these models. A         during generation.
specially trained binary classification model then evaluates
whether the generated result meets the task requirements.         9.5   Model Compression
If the result is deemed reliable, subsequent models would
                                                                  Due to the huge number of model parameters, LLMs take
be bypassed, thus improving the inference speed. This
                                                                  a significant memory footprint for inference, making it very
strategy can be applied to various open-source models and
                                                                  costly to be deployed in real-world applications [999]. In this
commercial APIs, allowing for the flexible adjustment the
                                                                  section, we focus on how to reduce the memory footprint
classification threshold to balance inference efficiency and
                                                                  of LLMs via technical approaches. In particular, we will
generation quality according to specific needs. For reason-
                                                                  primarily introduce the model quantization approach, and
ing tasks, researchers [992] further propose to utilize the
                                                                  also briefly discuss other model compression methods, e.g.,
self-consistency [429] of generated answers to evaluate the
                                                                  model pruning and distillation.
quality of the small model: the large model is employed for
generation only when the small model’s answers exhibit a          9.5.1 Quantization Methods
low consistency.
                                                                  There are generally two major model quantization ap-
Non-autoregressive Decoding. Existing decoding methods            proaches, namely quantization-aware training (QAT) (requir-
predominantly adopt the autoregressive mechanism, gen-            ing additional full model retraining) and post-training quanti-
erating tokens one by one, which is a primary reason              zation (PTQ) (requires no model retraining). Compared with
for lower inference efficiency. Therefore, non-autoregressive     small-sized language models, two major differences need
decoding [993] has been proposed by generating all tokens         to be considered when designing or selecting quantization
based on the input at once. However, the generation quality       methods for LLMs. Firstly, LLMs consist of a huge number
                                                                                                                                    86

of parameters, and thus PTQ methods are more preferred                   and can be pre-processed before model deployment. By
due to a much lower computational cost than QAT methods.                 identifying and preserving these salient weights, the error
Secondly, LLMs exhibit very different activation patterns                associated with model quantization can be effectively re-
(i.e., large outlier features), and it becomes more difficult            duced. In existing literature, various methods have been
to quantize LLMs, especially hidden activations. Next, we                proposed to detect these salient weights. For instance, PB-
will briefly review several representative PTQ methods48 for             LLM [1003] utilizes the magnitude of weights for finding
LLMs.                                                                    critical weights, SpQR [1004] categorizes the outliers in
                                                                         weights into small groups by investigating the structural
Background for Quantization. In this part, we present a                  patterns, APTQ [1005] employs the Hessian trace as a sen-
general introduction of quantization techniques for neu-                 sitivity metric, and OWQ [1006] selects the top sensitive
ral networks. In neural network compression, quantization                columns based on both the Hessian matrix and weight
often refers to the mapping process from floating-point                  perturbations.
numbers to integers [1000], especially the 8-bit integer quan-
                                                                             • Fine-grained quantization. For Transformer models,
tization (i.e., INT8 quantization). For neural network models,
                                                                         weights and activations are usually represented in the
there are typically two kinds of data to be quantized, namely
                                                                         form of tensors. A straightforward approach is to use
weights (model parameters) and activations (hidden activa-
                                                                         coarse-grained quantization parameters for the whole ten-
tions), which are originally represented in floating-point
                                                                         sor (i.e., per-tensor quantization) [1007]. However, it usu-
numbers. To illustrate the essential idea of model quan-
                                                                         ally leads to inaccurate reconstruction results. Thus, fine-
tization, we introduce a simple yet popular quantization
                                                                         grained methods are proposed to reduce the quantization
function: xq = R(x/S) − Z , which transforms a floating
                                                                         error. ZeroQuant [1008] adopts a token-wise quantization
number x into a quantized value xq . In this function, S
                                                                         approach with dynamic calibration for compressing acti-
and Z denote the scaling factor (involving two parameters
                                                                         vations. Whereas for weights (easier to be quantized), it
α and β that determine the clipping range) and zero-point                uses a group-wise quantization. In practice, a group size of
factor (determining symmetric or asymmetric quantization),
                                                                         128 [1002, 1008] is commonly used for model quantization.
respectively, and R(·) denotes the rounding operation that
                                                                             • Balancing the quantization difficulty. Considering that
maps a scaled floating value to an approximate integer.
                                                                         weights are easier to be quantized than activations,
As the reverse process, dequantization recovers the original
                                                                         SmoothQuant [1007] proposes to migrate the difficulty from
value from the quantized value accordingly: x̃ = S·(xq +Z).
                                                                         activations to weights. Specially, they incorporate a scaling
The quantization error is calculated as the numerical differ-
                                                                         transformation to balance the difficulty between weights
ence between the original value x and the recovered value
                                                                         and activations in a linear layer: Y = (Xdiag(s)−1 ) ·
x̃. The range parameters α and β have a large impact on the
                                                                         (diag(s)W). By introducing an mathematically equivalent
quantization performance, which often need to be calibrated
                                                                         transformation, this formula controls the quantization diffi-
according to real data distributions, in either a static (offline)
                                                                         culty through the scaling factor s. To set s, it incorporates
or dynamic way (runtime). For more details, we refer to the
                                                                         a migration strength parameter α to balance the difficulties,
readers to the excellent survey [1000] about quantization
                                                                         where each entry sj = max(xj )α / max(wj )(1−α) is deter-
methods on neural networks.
                                                                         mined by the migration strength.
Post-Training Quantization (PTQ). We first introduce the                     • Layerwise quantization. This approach finds optimal
PTQ methods for LLMs.                                                    quantized weights that minimize a layerwise reconstruction
                                                                                                             2
    • Mixed-precision decomposition. As found in [1001], ex-             loss: arg minW c ∥ WX − WX ∥2 . To efficiently optimize
                                                                                                     c
tremely large values would occur in hidden activations                   this objective, GPTQ [1009] improves the original opti-
(called the emergence of outliers) when the model size reaches           mal brain quantization (OBQ) [1010] method by fixing the
6.7B parameters or above. These outliers significantly influ-            quantization order of weights for all rows. Further, with
ence the data distribution ranges of the hidden activations,             specially designed methods (i.e., lazy batch-updates and
making it challenging to conduct effective model quantiza-               Cholesky reformulation), GPTQ is feasible to quantize very
tion. To reduce the quantization error, a straightforward                large models (e.g., 175B OPT) in 3 or 4 bit precision. More
method is to separately process the outliers and the rest                recently, AWQ [1002] further simplifies the optimization
weight values. Specifically, LLM.int8() [1001] has observed              form by incorporating activation-aware scaling for weights,
that these outliers are mainly distributed in certain feature            which resembles the idea of SmoothQuant [1007]: weights
dimensions at Transformer layers. Based on this finding, a               corresponding to outlier activations are more important
vector-wise quantization approach is proposed to separate                to be precisely quantized. It does not directly optimize
the outliers and the rest in matrix multiplication.                      the reconstruction loss, but instead performs simple hyper-
    • Salient weights protection. For Transformer based lan-             parameter search to achieve the minimal loss on calibration
guage models, there often exists a subset of weight values               data.
that are more sensitive to quantization, which are also                      These strategies in the above methods can be jointly
referred to as salient weights [1002]. Unlike activation out-            used to improve the quantization performance. In order to
liers, which occur dynamically during inference and may                  achieve high-efficiency implementation, quantization meth-
require complex runtime handling, weight outliers are static             ods also rely on hardware- or system-level support (e.g., ef-
                                                                         ficient GPU kernels or hardware-friendly group partition).
  48. Since we mainly focus on discussing quantization methods in the
context of LLMs, the line of quantization work on small-sized language   Other Quantization Methods. In the above, we mainly fo-
models (e.g., BERT) has not been included in this survey.                cus on PTQ methods, and next introduce two recent studies
                                                                                                                                    87

that explore efficient fine-tuning methods or QAT methods           and difficulty migration [1007], can be applied to alleviate
for quanitizing LLMs.                                               the influence of outlier values. Since large outliers mainly
     • Efficient fine-tuning enhanced quantization. For post-       exist in the activations of LLMs, small language models
training quantization, direct low-bit quantization (e.g., INT4      are more resistant to activation quantization [1013, 1015].
quantization) often results in large performance degrada-           In practice, high-quality INT8 activation quantization is still
tion. To overcome this challenge, QLoRA [1011] incorporates         a difficult task, though several methods can attain satisfying
additional small tunable adapters (16-bit precision) into the       results. Further, lower precision activation quantization has
quantized models, to achieve an efficient, high-precision           still not been successfully explored, even for QAT meth-
model fine-tuning. It combines the merits of LoRA (See              ods [1012].
Section 5.3.1) and quantization methods. The experiment                  • Efficient fine-tuning enhanced quantization is a good option
results show that 4-bit quantized models can achieve the            to enhance the performance of quantized LLMs [149, 1011]. The
full 16-bit fine-tuning performance by QLoRA.                       benefits of efficient fine-tuning methods in quantization can
     • Quantization-aware training (QAT) for LLMs. A recent         be twofold. Firstly, it can directly compensate for the per-
study [1012] explores the effect of QAT methods by applying         formance degradation suffered from low-bit quantization.
a data-free distillation method to compress the weights,            This can be achieved either by increasing the fitting capacity
activations as well as key-value cache. By conducting exten-        via updating high precision adapters [1013, 1015, 1016],
sive experiments based on LLaMA, they show promising                or by finding a proper low-rank initizalization for LoRA
results with 4-bit quantization on both weights and key-            fine-tuning [1017]. Secondly, it is flexible to support task-
value cache, but not on 4-bit activation quantization, which        specific or goal-specific fine-tuning of LLMs in a lightweight
still needs more exploration.                                       way [1011], e.g., instruction tuning or chat-oriented tuning,
                                                                    by only tuning the small adapters. Overall, it makes a
Empirical Analysis and Findings. Quantization has cur-              good trade-off between the effectiveness and training cost,
rently become a common technique to reduce the memory               which provides a promising approach to enhancing the
footprint and latency of LLMs in deployment. In particular,         performance of quantized LLMs.
it is important to understand what level of precision (e.g.,
INT8 or INT4) can be applied to quantize different parts of         Empirical Analysis on Quantization Experiments. To fur-
LLMs (e.g., weights or activations), while retaining a high         ther help readers understand the impact of quantization on
accuracy. In this part, we first summarize the major findings       LLMs, we also conduct a group of experiments to investi-
about the quantization of LLMs in existing literature, and          gate the inference performance of quantized models here.
then present some empirical analysis with quantization              Specifically, we focus on the fine-tuned LLaMA models (i.e.,
experiments.                                                        7B and 13B) using popular SFT datasets, including FLAN-
    • INT8 weight quantization can often yield very good results    v2 [69], Alpaca-52K [187] and ShareGPT [153]. For evalua-
on LLMs, while the performance of lower precision weight quan-      tion, we utilize the same tasks in Table 10, and follow the
tization depends on specific methods [1002, 1007, 1009, 1013]. In   quantization settings in the study [1015] examining the per-
most cases, INT8 weight quantization can be effectively ap-         formance of quantized language models at three precision
plied to reduce the memory footprint without performance            levels: 4-bit, 8-bit and 16-bit. The results are summarized
degradation. While for INT4 (or INT3) weight quantiza-              in Table 21. As can be observed from Table 21, the results
tion, existing methods rely on specific strategies to reduce        obtained with 8-bit and 4-bit weight quantization are close
the performance degradation, e.g., layerwise method [1008,          to the performance of 16-bit models while significantly
1009], activation-aware scaling [1002] and low-rank adapter         reducing memory consumption. In practice, it is recom-
tuning [1011]. Interestingly, LLMs seem to be less sensitive        mended to first examine the performance of 4-bit weight
to low-bit weight quantization than small-sized language            quantization for LLMs if reducing memory usage is a critical
models [1013]. In practice, with the same memory cost, it           consideration for deployment.
is suggested to use a larger language model with a lower            9.5.2 Other Model Compression Methods
quantization precision rather than a smaller language model
                                                                    In addition to model quantization, we next introduce two
with a higher quantization precision. For example, a 4-bit
                                                                    other model compression methods for LLMs, namely model
60B LLM is demonstrated to have better performance than
                                                                    distillation and model pruning. Unlike model quantization,
an 8-bit 30B LLM [1014]. Moreover, focusing on emergent
                                                                    model distillation and pruning aim to simplify the model
capabilities, the study [1015] finds that in-context learning,
                                                                    architecture, thereby reducing the total number of parame-
step-by-step reasoning, and instruction following all seem
                                                                    ters.
to be seldom affected with 4-bit weight quantization. This
result suggests that INT4 quantization exhibits a favorable         Distillation for LLMs. In general, model distillation aims to
trade-off in terms of both total bits and performance of            transfer the capabilities from a capable model (referred to
emergent abilities.                                                 as the teacher model) to a less capable model (referred to
    • Activations are more difficult to be quantized than           as the student model), thereby achieving the compression
weights [1001, 1007, 1013]. It has been found that large out-       of the capable model. Based on whether the weights of
liers would occur for Transformer language models having            teacher models are accessible, one can employ either the
a size of 6.7B or above [1001]. This issue has been one             white-box approach or the black-box approach for LLM
of the most fundamental difficulties to quantize LLMs. To           distillation. The white-box approach often employs the
overcome this issue, various methods, e.g., mixed-precision         traditional knowledge distillation technique, which incor-
decomposition [1001], fine-grained quantization [766, 1001]         porates additional loss functions (i.e., distillation loss) for
                                                                                                                                                                       88

TABLE 21: Evaluation results for quantized LLaMA models (7B and 13B). We employ existing model checkpoints provided
by [350] for quantization experiments, which have been fine-tuned on FLAN-v2, Alpaca-52K, and ShareGPT, respectively.
Specifically, we report the performance with AlpacaFarm, MMLU, and BBH, as well as the memory usage of the loaded
model (Mem.). For quantization, we employ bitsandbytes to quantize the 16-bit models to 8/4 bits by specifying the
commands load_in_8bit and load_in_4bit when loading the weights. It is worth noting that we select text-davinci-
003 as the baseline model for the AlpacaFarm dataset.

                                               16-bit                                          8-bit                                       4-bit
   Models        SFT Dataset
                               AlpacaFarm   MMLU        BBH     Mem.(GiB)   AlpacaFarm      MMLU       BBH     Mem.(GiB)   AlpacaFarm   MMLU       BBH     Mem.(GiB)
   LLaMA (7B)    FLAN-v2          6.65      47.34       35.05     12.58       6.15          47.02      35.17     6.65         7.83      46.23      34.77     3.94
                 Alpaca-52K      32.55      40.87       33.66     12.58       33.60         39.98      34.38     6.65        29.57      39.24      32.80     3.94
                 ShareGPT        72.05      41.30       32.90     12.58       72.86         39.34      32.71     6.65        70.31      40.08      32.11     3.94
   LLaMA (13B)   FLAN-v2          8.14      51.67       41.46     24.40       7.64          51.02      41.25     12.53        7.52      50.48      40.68     7.34
                 Alpaca-52K      33.60      47.63       36.10     24.40       31.43         47.04      35.98     12.53       30.87      46.20      36.16     7.34
                 ShareGPT        75.59      47.58       38.00     24.40       73.79         47.71      38.31     12.53       71.99      45.77      36.97     7.34




aligning the outputs or intermediate states of the student                            optimizers [1023]. It focuses on the quantization of both
model to those of the teacher model. Based on this ap-                                activations and weights for LLMs, including the support on
proach, MINILLM [1018] effectively distills the 13B LLaMA                             8-bit and 4-bit (NF4,FP4) matrix multiplication for efficient
model down to a 7B model. The black-box approach [1019],                              inference, as well as an 8-bit optimizer for efficient training.
on the other hand, can only make use of the textual re-                                   • GPTQ-for-LLaMA50 is developed specially for quantiz-
sponse of the teacher model for training the student model.                           ing LLaMA models. It enables 4-bit quantization of LLaMA
These studies mainly focus on utilizing the generated re-                             models of varied sizes based on the GPTQ algorithm [1009].
sponses for enhancing the key capabilities from the teacher                           Also, it provides a comparison with bitsandbytes in both
model [146, 384], such as in-context learning and chain-of-                           memory and performance (PPL) on the project website.
thought prompting.                                                                        • AutoGPTQ51 is a quantization package developed
                                                                                      based on the GPTQ algorithm [1009], which supports INT4
Pruning for LLMs. The goal of model pruning is to min-
                                                                                      quantization for LLMs. It includes a number of quantized
imize the number of parameters in a model while pre-
                                                                                      models in the library, and supports LoRA by integrating
serving its performance as much as possible. In general,
                                                                                      with HuggingFace PEFT library.
model pruning methods can be categorized into two lines:
structured pruning and unstructured pruning. Structured
                                                                                          • llama.cpp52 makes it feasible to run quantized LLaMA
                                                                                      models on a MacBook device. It supports INT4, INT5 and
pruning aims to remove certain less important model com-
                                                                                      INT8 quantization, which is developed in efficient C/C++
ponents (e.g., neurons, channels, layers) that have minimal
                                                                                      implementation. It also supports a number of LLaMA based
impact on performance. On the other hand, unstructured
                                                                                      models, such as Alpaca and Vicuna.
pruning mainly focuses on removing individual weights or
connections within a neural network model without chang-                              Other Libraries. In addition, there are also libraries for
ing the model’s main structure. As for LLMs, unstructured                             supporting other model compression methods.
pruning can generally lead to higher compression rates.                                  • Torch-Pruning 53 is a toolkit developed for general-
For instance, SparseGPT [1020] achieves 60% unstructured                              purpose structural pruning, including the pruning for vision
sparsity for OPT-175B using unstructured pruning (i.e.,                               models, diffusion models and large language models. It em-
60% of the elements in the weights are masked), and the                               ploys dependency graph for automatic structural pruning
pruned LLM still retains a relatively low perplexity. With                            and supports several high-level pruners (e.g., MetaPruner
suitable strategies, structured pruning for LLMs can also                             and BNScalePruner).
achieve promising model compression rate. For instance,                                  • LLM-Pruner54 is designed specifically for the pruning
LLM-pruner [1021] selectively removes 20% of the non-                                 of LLMs. It enables efficient gradient-based structral prun-
essential parameters from LLaMA (7B) based on gradient                                ing for LLMs with minimal training samples and training
information, while maintaining 93.6% performance of the                               time. Currently, it supports a number of LLMs, such as
original model. Furthermore, Sheared LLaMA [1022] in-                                 Baichuan, BLOOM, and LLaMA3.
troduces two techniques: targeted structured pruning and
dynamic batch loading, which effectively prunes LLaMA-
2 (7B) to a parameter size of 2.7B, while preserving 87.8% of                         9.6   Retrieval-Augmented Generation
the original model’s performance.                                                     When dealing with real-time information or specialized
9.5.3 Open-source Libraries                                                           domain knowledge, LLMs may struggle to generate ac-
In this part, we briefly introduce the available open-source                          curate outputs solely based on their internal knowledge.
libraries for memory-efficient deployment.                                            To address this issue, retrieval-augmented generation (RAG)
                                                                                      technique [1024, 1025] has been proposed by incorporating
Quantization Libraries. Next, we introduce three popular
quantization libraries for LLMs, including:                                             50. https://github.com/qwopqwop200/GPTQ-for-LLaMa
    • Bitsandbytes49 is developed based on the methods                                  51. https://github.com/PanQiWei/AutoGPTQ
introduced in the papers of LLM.int8() [1001] and 8-bit                                 52. https://github.com/ggerganov/llama.cpp
                                                                                        53. https://github.com/VainF/Torch-Pruning
 49. https://github.com/TimDettmers/bitsandbytes                                        54. https://github.com/horseee/LLM-Pruner
                                                                                                                                  89

external knowledge source for improving the model re-              determine whether the current task requires retrieval or the
sponse. This technique aims to retrieve relevant information       use of retrieved content [662].
from external sources (e.g., the internet or domain-specific
knowledge bases) using an information retrieval system,            Improvement Strategies. In practice, factors such as the
thereby providing LLMs with timely or domain-relevant              quality of retrieved documents, prompt design, and the
context to reduce the factual errors in generated content.         generation method of LLMs might impact the final per-
In the format, RAG can also be considered as a specific            formance of RAG. Next, we discuss how to enhance the
prompting strategy that integrates auxiliary information           RAG performance by summarizing existing improvement
from external sources into the original prompt. Next, we will      strategies.
introduce the basic workflow of the retrieval-augmented                • Retrieval method improvement. The incorporation of
generation technique and related optimization strategies.          retrieval supplements the LLM with relevant contextual
                                                                   information, and the retrieval performance directly affects
Basic Workflow. Typically, the standard RAG procedure              the quality of the final generated response [454]. To design
consists of three steps, including context retrieval, prompt       effective retrieval strategy, an important factor to consider
construction, and response generation.                             is the text granularity. Intuitively, a coarser granularity (e.g.,
    • Context Retrieval. The retrieval step primarily focuses      document-level) may result in efficient retrieval but tend to
on finding relevant context information from existing infor-       incorporate substantial irrelevant information, while a finer
mation sources that are helpful for addressing the current         granularity (e.g., sentence-level) increases the proportion of
information need. To achieve efficient retrieval, it is often      relevant content in the retrieval results but can lead to higher
necessary to build a search index over the collection of can-      retrieval latency. To balance relevance and latency, existing
didate documents and then use appropriate methodologies            research work proposes using “propositions” as the retrieval
for text retrieval. There are two commonly used retrieval ap-      unit [1031], corresponding to semantically complete and
proaches: lexical-based retrieval [1026] using sparse vector       relatively independent text fragments, which can effectively
representations and semantic retrieval methods using dense         reduce the recall of irrelevant information. In particular, they
vector representations [54]. The former tokenizes the docu-        mainly use GPT-4 to synthesize instruction data for the ex-
ments and building an inverted index based on a vocabu-            traction of proposition text, training a smaller model specifi-
lary, followed by retrieving relevant documents using lexical      cally to construct proposition text data [1031]. Furthermore,
matching. The latter maps documents to low-dimensional             to improve retrieval performance, methods such as query
dense vectors and then constructs an efficient index of doc-       expansion and query rewriting can be utilized to optimize
ument vectors using approximate nearest neighbor search            query formulation. Query expansion focuses on adding
algorithms, ranking candidate documents based on the sim-          supplementary information to the original query, such as
ilarity of embeddings. Both methods can often perform well         incorporating related entity information or providing de-
for large-scale document collection, which are widely used         tailed explanations of key information in the query [796],
in existing RAG systems.                                           which helps strengthen relevance matching. However, tra-
    • Prompt Construction. After the retrieval stage returns       ditional query expansion methods may disrupt the original
the relevant documents, these documents need to be incor-          semantics for complex queries. To address this issue, we can
porated into the input prompt of the LLM along with the            employ LLMs to decompose complex queries into several
task description. The prompt should guide the model to uti-        sub-queries, which are subsequently expanded individually,
lize the retrieved information to complete the corresponding       allowing for multi-path recall of related information [1032].
task. For example, a prompt could be, “Please refer to the         As another query enhancment technique, query rewriting
information contained in the following documents to complete the   focuses on modifying the query content to highlight key
task”. Since the retrieved documents are typically lengthy,        information and eliminate potential ambiguities, facilitating
simply concatenating them into the prompt might lead               the retrieval of related documents [1033]. LLMs can be ap-
to a poor utilization of the provided context due to the           plied directly to query rewriting, transforming the original
biased attention (e.g., lost in the middle [949]). To address      query into a more suitable form through well-designed
this issue, existing approaches often introduce reranking          prompts [1034]. To reduce inference overhead, the query
models to select the most relevant documents from the              optimization capabilities of LLMs can also be transferred
retrieval results [1027]. Alternatively, information extraction    to smaller models through knowledge distillation [1035].
or text compression techniques can be used to retain only the          • Retrieval results refinement. In addition to the initial
highly relevant information from the documents, thereby            retrieval methods, the refinement of retrieval results also
reducing the input context length [1028, 1029].                    plays an important role in RAG systems, since the retrieved
    • Response Generation. In this step, the constructed           documents may be not best suited for RAG systems, e.g.,
prompt is input into the LLM, enabling it to utilize the re-       LLMs might have difficulty in utilizing long contexts or
trieved content to better accomplish the corresponding task.       be affected by irrelevant information in the retrieved docu-
However, the retrieved documents may contain irrelevant            ments. As a solution, the documents returned during the re-
information or even contradictory information to the true          trieval stage can be reranked according to their relevance to
answer, which might affect the output generated by the             the input [1036], filtering out low-quality or irrelevant doc-
LLM. To address this, the LLM can be further prompted              uments or placing less relevant documents in non-optimal
to self-check the quality of the generated output and decide       positions within the prompt. Furthermore, both generation
whether to re-perform the retrieval based on the new out-          and reranking tasks [1027] can be jointly optimized to facil-
puts [1030], or it can perform a confidence assessment to          iate better utilize of context documents. Additionally, LLMs
                                                                                                                               90

can be directly used for document re-ranking by designing        reconstruct the remaining content of the original document
specific prompts or using context examples to accomplish         based on the retrieval results [1043].
this task [777]. In addition to document filtering or rerank-
ing, information extraction or automatic summarization
                                                                 9.7     Hallucination
techniques can be employed to refine the retrieved content
by extracting more concise and query-relevant content from       Hallucination, which refers to the phenomenon that LLMs
the retrieved documents. Furthermore, existing research has      generate content inconsistent with factual information, has
proposed token-level compression strategies [1037], which        become a significant issue that greatly affects the task
select important tokens and remove unimportant parts from        performance of LLMs [1044]. In this section, we focus on
the candidate documents.                                         discussing the topic of LLM hallucination, first introducing
    • Iterative retrieval enhancement. In some complex appli-    the definition and source of hallucination and then summa-
cation scenarios, a single retrieval procedure may not suffice   rizing the detection and mitigation methods.
for RAG systems. To address this issue, we can further use
iterative retrieval augmentation and adaptive retrieval aug-     9.7.1    Definition of Hallucination
mentation. Iterative retrieval augmentation aims to itera-       Early research typically defines hallucinations based on
tively refine the initial query based on the model’s generated   the relationship between a model’s output and the given
results to achieve a comprehensive coverage of the required      input [1045]. In this manner, hallucinations are categorized
information. As it involves accumulating multiple rounds         into intrinsic hallucinations where the model’s output does
of retrieval information, the performance of RAG systems         not match the input text and extrinsic hallucinations where
may be affected by redundant or conflicting information. To      the model’s output cannot be verified against the input.
address this issue, stop mechanism has been introduced for       However, in real-world scenarios, user inputs often do not
retrieval iteration, using the LLM to evaluate the confidence    contain reference documents, and thus existing work mainly
of the current generation results to determine whether to        focuses on open-domain factual hallucinations, where the
continue the iteration process [662]. Additionally, for more     model-generated content does not align with or cannot be
complex scenarios, iterative retrieval can be combined with      verified by existing world knowledge [1044, 1046]. Accord-
the LLM’s own CoT reasoning capability. For example,             ing to a recent study [1044], factual hallucinations can be
intermediate results from the chain of thought can be used       further categorized into the following types:
as the query input for the next round of retrieval, and after        • Entity-error hallucination. This type of hallucination
completing the retrieval process, the returned results can       refers to LLMs generating text containing incorrect entities,
be integrated into the chain of thought. Building on the         such as names of people, dates, locations, or objects that
iterative retrieval augmentation method, adaptive retrieval      contradict world knowledge.
augmentation further enhances the LLM’s autonomous use               • Relation-error hallucination. This type of hallucination
of the retrieval mechanism [1038], thereby improving the         involves LLMs generating incorrect relationships between
overall framework’s efficacy in using the retrieval systems.     entities, such as inaccurate quantitative or chronological
In practical implementation, for the above two types of aug-     connections.
mentation methods, LLM first need to determine when to               • Incompleteness hallucination. LLMs may produce incom-
use the retriever and then utilize pre-set prompts to initiate   plete outputs, especially when generating lengthy or list-
query generation and retrieval result processing [1039].         based responses. This hallucination arises when LLMs are
    • RAG-enhanced training. In addition to the improvement      asked about aggregated facts and they fail to reserve the
strategies mentioned above, specialized training tasks can       factual completeness.
be designed to further enhance the LLM’s ability to utilize          • Outdatedness hallucination. This type of hallucination
the retrieved content, including both instruction tuning and     occurs when LLMs generate information that was accurate
pre-training tasks. By constructing instruction data focused     at a past time but is no longer correct at present. This issue
on retrieval context utilization [1040], instruction tuning      typically arises due to that most LLMs were trained on time-
can improve the LLM’s ability to utilize relevant retrieval      limited corpora.
information. When curating the instruction data, it is essen-        • Overclaim hallucination. This type of hallucination refers
tial to consider two important issues: positional bias and       to cases where the statement expressed in the generated text
irrelevant information within the input context. Specifically,   of LLMs is beyond the scale of factual knowledge.
relevant documents can be placed at different positions              • Unverifiability hallucination. This hallucination refers
within the prompt, which can enhance the model’s attention       to cases where the information produced by LLMs cannot
to relevant content in various positions and prevent the         be verified against existing information sources, making it
model from neglecting certain positions [949]. Additionally,     difficult to assess its accuracy.
irrelevant information can be added to the instructions data,
so as to improving the model’s ability to resist interference    9.7.2    Source of Hallucination
from such information [1041]. In addition, special training      In this part, we will discuss the potential factors that might
tasks can be introduced during the pre-training stage to         lead to hallucination for LLMs.
further enhance the LLM’s retrieval and generation capa-
bilities [657, 1042]. Existing work mainly constructs unsu-      Training Data. The quality of training data significantly
pervised pre-training data aimed at retrieval augmentation.      impacts the model’s output and is a primary source of
A common data construction method uses portions of the           hallucinations. Further, the distribution of training data also
original document as queries and then trains the model to        plays a key role in shaping the behaviors of LLMs. We next
                                                                                                                                91

introduce the effect of training data on hallucinations from      data may contain hallucinated content, which might lead
these two aspects.                                                to more hallucinations for the trained model. Addition-
    • Data quality. In practice, the pre-training dataset is      ally, during the human alignment process, existing training
typically constructed by collecting diverse data from various     methods may also cause hallucination issues. Some research
sources. While increasing pre-training data can lead to im-       work has revealed that LLMs may cater to human responses
proved model performance, low-quality data can severely           for earning higher rewards, likely resulting in answers that
damage the generation performance of large models. On             do not align with factual knowledge [1048].
the one hand, pre-training data may contain erroneous
                                                                  Response Generation. Given the input prompt, LLMs
information, and the goal of training large models is to
                                                                  employ decoding strategies (e.g., top-k sampling in Sec-
imitate and memorize the training data as possible. If inac-
                                                                  tion 4.2.4) for generating the response. In this process, the
curate information frequently appears in the training data,
                                                                  prompt formulation and the decoding strategies potentially
the model may memorize and directly copy this content
                                                                  affect the generation behaviors of LLMs.
during generation, leading to the phenomenon known as
“imitative falsehoods” [558]. On the other hand, pre-training         • Prompt design. Prompting has become the primary
data may contain biased content and the subjective views          way for using LLMs to solve downstream tasks. However,
of its creators. Such biased content can severely affect the      inappropriate prompt design can cause the model to over-
model’s learning of world knowledge, possibly leading to          look or misunderstand important information, leading to
inappropriate representations.                                    incorrect or irrelevant content [1044]. Recent studies have
                                                                  shown that the readability, format, and concreteness of user
    • Data distribution. The distribution of pre-training data
                                                                  instructions would impact the model’s output [1049]. For
also significantly affects the model’s behavior. Firstly, re-
                                                                  instance, the use of complex words or long phrases in the
garding the recency factor, LLMs are typically trained on
                                                                  prompt reduces the readability, which makes LLMs more
data from a limited period. As world knowledge continu-
                                                                  difficult to understand the real intentions of user instruction,
ously evolves, the model’s stored knowledge can become
                                                                  thereby increasing the chance of hallucination. Additionally,
outdated, thereby likely leading to fabrications or outdated
                                                                  non-standard expressions or abstract concepts can also ex-
information when addressing questions beyond its knowl-
                                                                  acerbate hallucinations.
edge scope. In terms of data composition, pre-training
data may lack domain-specific knowledge, which would                  • Decoding strategy. To improve the diversity of the
affect the model performance on tasks requiring specialized       generated content, multiple random sampling strategies are
knowledge, such as medical or legal issues, and it will           introduced (e.g., beam search, top-p sampling). However,
also result in significant hallucinations. Additionally, recent   increasing diversity also brings a higher likelihood of gen-
studies show that when addressing questions involving             erating hallucinated content. For example, increasing the
long-tail knowledge that appears infrequently in the train-       temperature t (Equation 10) will result in a more uniform to-
ing corpus, models are more likely to generate inaccurate         ken probability distribution, which potentially leads to more
content [1044].                                                   hallucinations, since low-frequency yet irrelevant words
                                                                  would be assigned a higher probability for generation in
Training Methods. The training process of LLMs typically          this setting.
includes two major stages: pre-training and post-training.
Inappropriate training methods across the two stages are          9.7.3   Hallucination Detection
also likely to result in the hallucination behaviors of LLMs.
    • Pre-training. Currently, the pre-training stage primar-     To effectively detect the hallucinated content, existing work
ily employs the next token prediction method for model            mainly adopts three approaches, namely model-based,
training. Recent studies [949] indicate that under the au-        uncertainty-based and tool-based methods.
toregressive training method, the model’s attention distri-
                                                                  Model-based Methods. Due to the powerful language ca-
bution tends to decay as the sequence length increases. This
                                                                  pabilities and rich world knowledge, existing work exten-
would prevent LLMs from effectively modeling long-range
                                                                  sively adopts powerful LLMs to detect hallucinations from
dependencies, potentially resulting in inference errors or
                                                                  the model-generated text. In this approach, hallucination
hallucinations. Additionally, the teacher-forcing strategy is
                                                                  detection can be considered as a normal text task that re-
commonly used during the training of large models. In this
                                                                  quires prompt formulation. To facilitate the research in this
approach, the correct tokens from the previous steps are
                                                                  line, HaluEval [604] introduces a comprehensive dataset of
used to predict the next token instead of the model output.
                                                                  model-generated and human-annotated hallucinated sam-
However, during model inference, the model can only use
                                                                  ples to evaluate how well LLMs can identify such instances,
its own generated content for subsequent predictions. This
                                                                  and they empirically show specific prompting strategies
discrepancy between the training and generation phases
                                                                  such as CoT can effectively improve the model’s accuracy
leads to “exposure bias” [1047], which may in turn cause
                                                                  in detecting hallucinations. Furthermore, research work pro-
hallucination issues.
                                                                  poses to decompose the hallucination detection into two
    • Post-training. During the instruction-tuning process,       subtasks: first, extract factual statements, and then assess
existing works typically employ knowledge distillation to         whether each statement is hallucinated or not [1044, 1050].
improve the model’s instruction-following ability. This in-
volves using high-performance models (such as GPT-4) to           Uncertainty-based Methods. Recent studies suggest that
generate large-scale instruction data and then fine-tuning        the occurrence of hallucinations in LLMs may be related
weaker models with this data. However, these synthesized          to the uncertainty of their outputs [1051]. Based on such
                                                                                                                                       92

assumptions, a series of works propose detecting hallucina-         further expands the knowledge source to local databases,
tions by assessing the uncertainty of model-generated con-          devising an agent framework to retrieve, consolidate, and
tent. Some research work focuses on the internal features of        generate feedback to the LLM for the final answer. Other
LLMs, such as token probability and logits. For key concepts        research explores placing the retrieval process at different
in the generated text, a lower token probability indicates a        positions relative to the generation process. Verify-and-
higher uncertainty, which represents an increased likelihood        Edit [1060] proposes to perform the retrieval procedure
of hallucination [1052]. Other research efforts evaluate the        after the generation process, allowing the original answer to
uncertainty by examining the consistency of the models’             be edited based on the retrieved documents. Furthermore,
responses. For instance, SelfCheckGPT [1051] lets LLMs an-          to help LLMs better handle complex tasks, IRCoT [1061]
swer the same questions multiple times to judge whether the         interleaves the knowledge retrieval process with CoT gen-
generated answers are consistent or not. Another alternative        eration, where the retrieved documents guide the LLM in
way requires LLMs to reconstruct the input questions based          generating additional reasoning steps and CoT sentences
on the responses and then check the consistency between             assist in retrieving more relevant and diverse documents.
the generated and original questions [1053].
                                                                    Improved Decoding Strategy. In addition to the above
Tool-based Methods. LLMs can detect hallucinations by               methods, hallucinations can also be mitigated by using im-
calling external tools to verify the model-generated content.       proved decoding strategies. Typically, the internal states or
Typically, the model’s output contains various segments of          knowledge of LLMs themselves can be exploited to reduce
factual knowledge, which can be broken down into fine-              the hallucinations. DoLa [317] proposes that the lower layers
grained factual statements. FActScore [1054] refers to knowl-       of LLMs tend to assign higher probabilities to syntactically
edge sources like search engines to verify these statements.        plausible words, while higher layers encode more factual
FacTool [1055] further proposes to use a series of external         knowledge. Therefore, DoLa devises a contrastive decod-
verification tools such as calculators and code interpreters to     ing strategy by subtracting the lower logits from the last
check different types of text. In addition, HaluAgent [1056]        layer’s logits and using the results for next-token predic-
proposes an agent framework to employ smaller open-                 tion. ITI [1062] finds that specific attention heads show
source models for hallucination detection. With the assis-          high linear probing accuracy and regards their activation as
tance of tools like search engines and calculators, HaluAgent       truth-correlated directions. During inference, certain heads’
enables 7B-size models to achieve comparable performance            activations would be shifted along these pivot directions.
as GPT-4 in hallucination detection.                                Some other work introduces external knowledge sources
                                                                    to aid the decoding process. CAD [1063] provides LLMs
9.7.4 Hallucination Mitigation                                      with extra context about the query, and then contrasts the
In practice, it is essential to effectively mitigate the halluci-   output probabilities by those without using context, thereby
nation behaviors of LLMs, to provide accurate and helpful           adjusting the influence of the model’s prior knowledge.
responses. In this part, we will introduce several widely-          KCTS [878] applies an auxiliary knowledge classifier on top
used approaches for alleviating the hallucination, including        of the LLM to detect hallucinations, and uses its knowledge
human alignment, retrieval-augmented generation and im-             faithfulness score to reweight the token distribution.
proved decoding strategy.
                                                                    9.8   Complex Reasoning
Human Alignment. Hallucination mitigation is closely re-
lated to the honest criterion in “3H” standards for human           In this section, we introduce a new reasoning paradigm
alignment, and various alignment methods like RLHF can              for LLMs aimed at solving complex tasks by allocating
be adopted to mitigate the model hallucination. HaluEval            more time to thinking before responding to a problem,
2.0 [1044] proposes to first collect hallucinated and non-          i.e., conducting complex reasoning. Specially, we focus on
hallucinated responses to train a reward model, and then            long chain-of-thought (CoT) reasoning55 , which is the main-
fine-tune the LLM with the reward model’s feedback us-              stream approach taken by recent large reaonsing models,
ing RL algorithms. However, recent research shows that              such as OpenAI’s o-series models. We will begin by pro-
human preference data may lead LLMs to exhibit syco-                viding an overview of long CoT reasoning, then introduce
phantic behavior [1057], where models prioritize catering           the construction of long CoT data and the corresponding
to human demands over maintaining truthfulness. Some                training methods, and finally discuss more general test-time
work proposes to refine the annotation process of preference        scaling methods.
data, such as by aggregating multiple human preferences
                                                                    9.8.1 Overview and Analysis
to improve feedback quality [1057] or fine-tuning LLMs on
prompts where the truthfulness of a claim is independent of         Generally, long CoT reasoning is a method to search for
the user’s opinion [1058].                                          solutions within the natural language space, as reflected in
                                                                    the output responses of LLMs. This approach is akin to the
Retrieval-Augmented Generation. Providing LLMs with                 slow thinking mode of the human brain [1064], which takes
highly reliable external knowledge as context can help re-          significantly more time to think through difficult problems
duce hallucinations. RARR [1059] first generates multiple
questions about the generated text, then retrieves web pages           55. The phrase “long CoT” may not be conceptually precise since the
from Google Search as evidence, and finally, an editing             model’s thought process could be tree- or graph-structured rather than
                                                                    strictly linear. We use this terminology in line with OpenAI’s intro-
model is employed if any disagreement is detected between           duction of the o1 model, which generally refers to extended thought
the evidence and the generated text. LLM-Augmenter [661]            processes for complex reasoning.
                                                                                                                             93

compared to the fast thinking mode used for simpler ones.        ther analyzed the reasoning patterns exhibited in the o1
This subsection will first qualitatively analyze the reasoning   model [1065]. These studies, based on empirical investiga-
patterns and then briefly discuss the main advantages of         tion, have identified several key reasoning patterns, includ-
this reasoning mode.                                             ing systematic analysis, method reuse, divide-and-conquer,
                                                                 self-refinement, context identification, and constraint em-
                                                                 phasis. Additionally, the use of these reasoning patterns
                                                                 varies across different tasks, significantly enhancing cogni-
                                                                 tive processes compared to standard CoT reasoning.
                                                                 Reasoning Advantages. Unlike standard CoT reasoning,
                                                                 this approach does not enforce a linear reasoning chain. In-
                                                                 stead, it integrates various reasoning actions and strategies,
                                                                 such as reflection and backtracking, into a single response.
                                                                 Overall, it has two major advantages compared to the stan-
                                                                 dard CoT method or direct prompting methods.
                                                                      Firstly, due to the autoregressive nature, the standard
                                                                 generation paradigm of LLMs is a “one-time” reasoning
                                                                 process. This means that if the generated solution contains
                                                                 obvious mistakes, or even if LLMs are aware of other
                                                                 promising solutions, there are no opportunities for refine-
                                                                 ment or verification. This issue becomes more pronounced
                                                                 in complex reasoning tasks, where the search space is much
                                                                 larger, preventing LLMs from fully exploring it [1066]. In
                                                                 contrast, long CoT reasoning mitigates this problem by
                                                                 allowing the model to autonomously check and revise its
                                                                 attempts, thus enabling more effective reasoning.
                                                                      Secondly, this text-based reasoning process can, in prin-
Fig. 19: Examples of long CoT reasoning from DeepSeek-           ciple, emulate various search algorithms that rely on more
R1 (accessed on January 25, 2025). Grey fonts denote the         complex search structures. For example, to represent a tree-
thought part of the model output, and italic fonts denote        structured search space, one might employ a textual process
the final answer.                                                that combines forward exploration with backward revis-
                                                                 its, incorporating necessary reflection and verification steps
Reasoning Patterns Analysis. As demonstrated in Exam-            along the way. Consequently, long CoT reasoning can repli-
ple 19, existing long CoT reasoning models typically gen-        cate the effects of previously introduced methods like tree-
erate an extended thought process (in grey) before arriving at   of-thought (ToT) and graph-of-thought (GoT). However, this
the final answer (in italic). It is crucial to understand how    capability is not inherently present in the LLM; it emerges
this thought process is conducted and the types of reason-       in a manner similar to the standard CoT ability, developing
ing patterns generated by LLMs during problem-solving.           through appropriate training (see Section 9.8.3).
To provide an intuitive understanding of this reasoning               Overall, long CoT represents a significant different rea-
process, we present two examples from the DeepSeek-R1            soning mode compared to the standard CoT method, fa-
model.                                                           cilitating search algorithms within the natural language
                                                                 space of LLMs. It emphasizes how to navigate correct paths
    In the first example, we present a mathematical problem
                                                                 through a trial-and-error approach, typically incorporating
to the model, and the corresponding long CoT can be ob-
                                                                 critical reasoning actions such as planning, evaluation, re-
served in the reasoning portion of the response. The thought
                                                                 flection, and exploration. In contrast, short CoT data typi-
process here is informal and flexible, while showcasing a
                                                                 cally presents a direct solution process in which all reason-
systematic exploration of the solution within the natural
                                                                 ing steps are expected to be correct.
language space. Concretely, the model follows a complete
reasoning process consisting of action steps like “factorize     9.8.2 Construction of Long CoT Data
196” and“take the exponents”. Notably, the thought process
                                                                 To guide LLMs in producing long-form reasoning followed
naturally includes trigger keywords like “double check” and
                                                                 by solutions, it is crucial to curate high-quality long CoT
“wait”, which invoke the corresponding verification or re-
                                                                 data for warming up or training the models. While human
flection actions. In the second example, we ask the model
                                                                 annotators can construct extended CoT data, this process
which Chinese city has the largest population. Interestingly,
                                                                 is costly and requires professional expertise for challenging
it exhibits similar thought patterns, even though the ques-
                                                                 problems, making it difficult to scale. Consequently, existing
tion could be addressed in a more compact and straight-
                                                                 studies often develop various methods for automatically
forward manner. The model generates a comprehensive
                                                                 constructing long CoT data, such as distillation from more
reasoning process with actions such as “confirm the latest
                                                                 advanced models, search based data synthesis, and multi-
data” and “clarify”, with trigger keywords like “make sure”
                                                                 agent collaboration, which are detailed below.
and “avoid”.
    To gain a more comprehensive understanding of                Long CoT Data Distillation. Benefiting from the open-
this complex reasoning mode, some research has fur-              ness of o1-like LLMs endowed with powerful reasoning
                                                                                                                               94

capabilities, the leading approach to curating high-quality       assessing whether each agent’s contribution is grounded
long CoT data involves using open models or APIs for              in logical consistency. Additionally, the debate mechanism
data synthesis. The basic idea is to first construct a set of     incorporates alternative perspectives and counterarguments
prompts (i.e., problems) and then feed them into the teacher      into the reasoning process, resulting in more robust and
model to collect the corresponding long CoT response data.        nuanced outcomes for complex decision-making tasks. By
Specifically, STILL-2 [1067] utilizes two slow-thinking sys-      combining these cognitive processes, the framework fosters
tems, i.e., DeepSeek-R1-Lite-Preview [1068] and QwQ-32B-          an environment where complex problems can be tackled col-
preview [1069] for distillation to construct a dataset of         laboratively, with diverse viewpoints contributing to more
long-form thought processes. A key finding is that length         comprehensive solutions.
distribution is a critical factor in determining the quality of
long CoT data. They suggest that length directly reflects the     9.8.3   Training Methods
difficulty of prompt problems, with mathematical problems         To elicit and enhance long CoT reasoning capabilities, the
being particularly important to collect, as they often involve    existing literature extensively explore two methods: long
extensive thought processes in their solutions. The research      CoT instruction tuning and scaling reinforcement learning
shows that even a small amount of carefully curated long          (RL) training. We will describe each approach in detail
CoT data can effectively activate the slow-thinking mode in       below.
LLMs. Furthermore, this effect is corroborated by the work
on DeepSeek-R1 [1070], which demonstrates that training           Long CoT Instruction Tuning. As discussed in Section 9.8.1,
with distilled data from DeepSeek-R1 consistently enhances        long-form thought processes require models to engage in
the performance of multiple Qwen and Llama models.                extended reasoning before responding. To develop this
                                                                  reasoning capability, we can instruction-tune LLMs using
Search based Data Synthesis. Search algorithms like Monte         carefully curated long CoT data. The core concept is to train
Carlo Tree Search (MCTS) [382] have been widely applied           LLMs to “imitate” the demonstrated behaviors presented in
to synthesizing long-form reasoning data. As a represen-          the long CoT data.
tative technique, MCTS integrates the principles of tree              In general, this fine-tuning method aims to achieve two
exploration and random simulation to estimate potential           key objectives: format adherence (i.e., following a long CoT
outcomes of actions, making it particularly effective for         format) and ability elicitation (i.e., activating the complex
decision-making tasks with large action spaces. In com-           reasoning mode). Specifically, format adherence requires
plex problem-solving, MCTS decomposes the process into            the model to produce outputs consisting of two sequential
multi-step generation, with each node at a specific tree          parts—thought and solution—while ability elicitation acti-
layer representing a step in the solution [1071]. At each         vates the model’s inherent capacities for executing appro-
step, a LLM, serving as the policy model, samples several         priate long-form thought processes. It has been shown that
candidate nodes, each generating a one-step CoT. MCTS             both objectives can be effectively achieved through super-
extensively uses rollouts to automatically assign a Q-value       vised fine-tuning: a small amount of high-quality long CoT
to each intermediate step based on its contribution: steps        data can suffice to elicit the long CoT reasoning capabilities
potentially leading to more trajectories that correctly solve     of LLMs. For instance, by fine-tuning Qwen2.5 (32B) on just
the problem receive higher Q-values. After iterating through      3.9K distilled long CoT data, STILL-2 [1067] achieved per-
multiple steps to successfully address the problem, the           formance comparable to industry counterparts such as o1-
complete reasoning trajectories from the root node to the         preview and QwQ in mathematical problem-solving. This
terminal node can be viewed as long-form CoT data, where          effectiveness is largely because strong LLMs inherently pos-
intermediate nodes represent either correct reasoning steps       sess various specific reasoning abilities (e.g., reflection and
or trial-and-error attempts.                                      backtracking). Instruction tuning with long CoT data further
                                                                  enhances these innate abilities, comprehensively integrating
Multi-Agent Collaboration. Beyond relying on a single             and extending their utilization, which enables the model to
model, an alternative approach to generating long CoT data        manage more complex reasoning processes.
is to construct a multi-agent framework [1072] in which               An interesting finding is that this reasoning capability
several models collaborate or debate to produce long-form         appears to generalize well across different domains. For
reasoning data.                                                   example, when trained exclusively on mathematical data, it
    The multi-agent framework for synthesizing long-form          can lead to significant improvements in other disciplines,
CoT data typically involves the coordination of multiple          such as physics and chemistry [1067]. This is primarily
autonomous agents, each specializing in distinct roles or         because long CoT reasoning is inherently a reasoning mode
functions. These agents work together using iterative reflec-     rather than a specific ability tied to any particular domain.
tion and strategic debate to enhance the reasoning process.       This can be seen in the example shown in Example 19, where
Within this framework, one agent might initiate a chain of        the query, “Which city in China has the largest population?”,
thought by presenting an initial hypothesis or argument,          is answered through a complex thought process, despite
while others critique and challenge these ideas through log-      being solvable in a more straightforward manner. Moreover,
ical examination and counter-arguments. This process en-          this capability can be naturally extended to multimodal
courages deep reflection by prompting agents to reconsider        LLMs, as these models are typically built on the backbone
assumptions, address potential biases, and refine conclu-         of language models [1073].
sions through continuous discourse. In this context, reflec-          Furthermore, this training approach can be naturally
tion involves not only reconsidering past decisions but also      enhanced by other supervised training strategies [1067],
                                                                                                                                  95

such as rejection sampling and directional preference op-          problems with specific answers are selected. The ground-
timization. In general, one can begin by warming up a LLM          truth answer is used to derive the reward scores, such as 1
through instruction tuning with long CoT instruction data          for a correct solution and 0 for an incorrect solution. This
and then use the model itself to generate rollout samples as       might seem counterintuitive: how can a complex reasoning
training data. These enhancements can have a certain effect,       system be effectively developed using such a simple reward
particularly when the amount of warmup instruction data            model? The explanation lies in the essence of RL: unlike su-
is limited. However, their impact tends to diminish when           pervised fine-tuning, it encourages the autonomous explo-
sufficient long CoT instruction data is available, especially if   rations of models through simple yet appropriate incentives.
the quality of self-generated samples is not superior to that      In this way, the complex reasoning capability can be well
of the demonstration data [1067]. These findings suggest           internalized within the model. In addition to the accuracy
that this advanced capability of a model may quickly reach a       reward, other simple rewards can be considered, including
performance ceiling when trained through supervised fine-          completeness, avoidance of excessively long texts, and other
tuning, due to the inherent limitations typical of imitation       formatting issues like repetition. OpenAI has proposed the
learning (for further discussion, see Section 9.8.4).              reinforcement fine-tuning (ReFT) [1075] approach for tuning
    Another downside of this fine-tuning method is its ten-        the o-series models to build domain-specific models, which
dency to default to long CoT reasoning mode even for sim-          also uses a simple accuracy reward to guide the training.
pler problems (See Example 19). To better manage reasoning         One limitation of this reward model is that it can only utilize
behavior, it is essential to explore systematic approaches         problems with definite and concise answers for training.
that integrate both long CoT reasoning mode and standard           More general task data, such as summarization, cannot be
response mode.                                                     directly used for training. In such cases, incorporating a
                                                                   trainable reward model becomes necessary. However, as
Scaling RL Training. Although OpenAI has not disclosed             we have discussed, once this reasoning mode is elicited
technical details about the o-series models, training meth-        in specific domains, it can naturally generalize well across
ods have been published through initiatives that imple-            different domains.
ment long chain-of-thought (CoT) reasoning systems, such
as DeepSeek-R1 [1070] and Kimi-K1.5 [1074], which have                 • RL Training. After configuring the policy and reward
demonstrated performance comparable to o1. The technical           models, suitable RL algorithms are selected to train the
methods employed converge on the approach of scaling RL            policy model [1076, 1077]. In Section 5.2, we provide a
training to enhance the complex reasoning capabilities of          detailed implementation of the PPO algorithm, which can
LLMs. In the following part, we introduce the detailed RL          be applied directly for training such models. Nonetheless,
method through three components: the policy model, the             PPO requires the maintenance and updating of an addi-
reward model, and the RL training algorithm.                       tional value model, which leads to high training costs,
    • Policy model. The policy model refers to the LLM that        especially when scaling RL training. As a result, existing ap-
needs to be enhanced by the complex reasoning capacities.          proaches [1070] often prefer more simplified RL algorithms,
Typically, it should be warmed up through supervised fine-         such as GRPO [1078] and RLOO [1079], which use heuristic
tuning with long CoT data, as outlined in the aforemen-            methods to eliminate the need of a value model. These algo-
tioned method. The main purpose of this warm-up is to              rithms typically exhibit higher efficiency and demonstrate
activate the long CoT reasoning mode, enabling the policy          strong training performance, especially in long CoT reason-
model to conduct appropriate explorations using a long-            ing. A critical factor to monitor during RL training is the
form thought process. It is also recommended that the policy       response length of the reasoning models, as a longer average
model possesses strong foundational capabilities, as this is       response length often corresponds to enhanced reasoning
crucial for eliciting high-reward actions in a more efficient      capabilities. Therefore, it is important to track the trends
way. An interesting attempt taken by DeepSeek-R1-Zero is           in average response lengths. With appropriate training, the
to omit the supervised fine-tuning step. Instead, it leverages     model should show progressively longer response lengths,
its strong instruction-following capacity to adhere to the         accompanied by simultaneous performance improvements.
response format and reasoning mode, guiding the model              In fact, response length is directly connected to the test-time
to generate formatted responses comprising two parts:              scaling law demonstrated by OpenAI56 . This law suggests
thought and answer. This method uses a format reward to            that as more output tokens are generated, a model’s rea-
reinforce the correct reasoning mode.                              soning performance can improve substantially. Nonetheless,
                                                                   achieving stable and effective RL training is challenging and
    • Reward model. To effectively guide the policy model,
                                                                   necessitates consideration of various factors, such as the
it is necessary to set an appropriate reward model in RL
                                                                   selection of query problems (e.g., choosing problems that
algorithms. As discussed in Section 5.2, RLHF employs a
                                                                   are challenging yet solvable by the model), the updating of
specially trained reward model to instruct the learning of
                                                                   the reference model (e.g., continually updating it as training
the policy model. However, this approach has become less
                                                                   progresses), and the enhancement of exploration strategies
effective for long CoT reasoning, given the difficulty of
                                                                   (e.g., sampling more responses with higher temperature
training reliable reward models to assess the quality of long
                                                                   settings).
CoT reasoning processes. Consequently, existing approaches
typically employ a verifiable reward model primarily built
on reference answers (e.g., mathematical problems) or test
samples (e.g., coding problems). Typically, the mathematical
domain serves as the major source of training data, where            56. https://openai.com/index/learning-to-reason-with-llms/
                                                                                                                                  96

9.8.4 Extended Discussion                                           been well covered in the existing literature. In particular,
In the preceding discussions, we have introduced the long           our survey has discussed four important aspects of LLMs,
CoT reasoning in technical detail. Actually, it can be con-         i.e., pre-training, adaptation, utilization, and evaluation. For
sidered a specific approach to achieve test-time scaling (a.k.a.,   each aspect, we highlight the techniques or findings that are
inference-time scaling), which is the focus of this subsection.     key to the success of LLMs. Furthermore, we also summa-
    From a broader perspective, test-time scaling encom-            rize the available resources for developing LLMs and dis-
passes various approaches that enhance model performance            cuss important implementation guidelines for reproducing
by increasing the outputs or computations from LLMs. In             LLMs. This survey tries to cover the most recent literature
this way, many methods can be considered test-time scaling          about LLMs and provides a good reference resource on this
techniques. For example, Self-Consistency [429] generates           topic for both researchers and engineers.
multiple responses and then aggregates the solutions using              Next, we summarize the discussions of this survey, and
majority vote, resulting in higher inference costs due to           introduce the challenges and future directions for LLMs, in
the increased number of rollouts. Additionally, planning            the following aspects.
techniques (Section 6.4) and their agentic instantiations (Sec-
tion 9.2) can also be considered test-time scaling approaches,      Basics and Principles. Instead of training on specific task
as they involve prompting LLMs multiple times and utiliz-           goals, LLMs learn from unsupervised pre-training on large-
ing tools or memory components. Therefore, the essence of           scale text data. This is quite different from previous multi-
test-time scaling is to trade additional inference costs for        task learning approaches, which aim to extend the training
performance gains. Unlike previous approaches, long CoT             tasks as possible to achieve sufficient generalization. Thus,
reasoning directly searches for solutions within the natural        it is essential to reveal the basic principles or elements that
language space, notably within a single response.                   establish the foundation of the abilities of LLMs. Although
    When comparing different test-time scaling methods,             the basic idea of language models is intuitive, it is still chal-
two critical factors require careful examination: token effi-       lenging to formally explain why LLMs trained by simple
ciency (the performance improvement per token cost) and             language modeling objectives (e.g., next token prediction)
performance ceiling (the maximum attainable performance).           can become capable of solving various real-world tasks.
Research has shown that scaling test-time computation can           To investigate this problem, a promising approach is to
effectively enhance model performance [1070, 1080] through          study the capacity learning (or selection) mechanism based
the use of simple aggregation methods or specially trained          on unsupervised pre-training, since the model capacity of
models, though token efficiency may vary. Overall, scaling          LLMs strongly depends on pre-training data. In addition,
RL training tends to exhibit higher token efficiency com-           scaling plays an important role in improving the capacity
pared to existing test-time scaling methods [1070]. Addi-           of LLMs [31, 55, 64], and it is very useful to conduct more
tionally, both heuristic methods and supervised fine-tuning         theoretical analysis about how the behaviors of large models
often exhibit a relatively limited performance ceiling that         relate to those of small models, e.g., what behaviors of large
cannot be substantially elevated once scaling reaches a             models can be inferred from small models and what can’t be
certain level [1067, 1080]. In contrast, scaling RL training can    predicted indeed. Another research direction is to explore
lead to continuous performance improvements in reasoning            more deep analysis on model generalization for LLMs,
models as training time increases. For example, DeepSeek-           since increasing concerns have been raised about whether
R1-Zero demonstrates a consistent upward trend in perfor-           LLMs can generalize beyond the knowledge encoded by
mance even after more than 8,000 training steps [1070].             pre-training data. Furthermore, data contamination has be-
    These scaling effects are crucial for solving complex           come a severe issue for fairly assessing the performance of
tasks. Notably, a potential advantage of long CoT reasoning         LLMs [740], and thus setting appropriate evaluation proto-
models is that they make it feasible to develop expert-level        col will be the basis to investigate and analyze the model
models in specialized domains or for specific tasks, which          capacity of LLMs.
could significantly impact the advancement of scientific
research challenges. Moreover, as inference methods and             Model Architecture. Due to the scalability and effective-
hardware techniques improve, the deployment and use cost            ness, Transformer has become the de facto architecture
of these models will be significantly reduced, enhancing            for building LLMs. Various strategies have been proposed
the contribution of these highly intelligent models to real-        to improve the performance of this architecture, such as
world applications. Additionally, addressing security issues        neural network configuration and scalable parallel training
in long CoT reasoning models is crucial. Given their unique         (see discussions in Section 4.2.2). However, Transformer
reasoning mode, specialized alignment strategies should be          still suffers from high training costs and slow inference
developed to ensure safer use of these models.                      rates. More efforts [270, 271] are still in need to develop
                                                                    improved model architectures for large-scale pre-training.
                                                                    Specially, system-level or hardware-level optimization (e.g.,
10    C ONCLUSION AND F UTURE D IRECTIONS                           FlashAttention [303]) is worth more exploration to improve
In this survey, we have reviewed the recent progress of large       the efficiency of Transformer architectures. In addition, as an
language models (LLMs), and introduced the key concepts,            important basic capacity, existing LLMs typically maintain
findings, and techniques for understanding and utilizing            a long context window. For example, the most recent GPT-4
LLMs. We focus on the large-sized models (i.e., having a size       Turbo enables a long context of 128K tokens, and Claude
larger than 10B) while excluding the contents of early pre-         2.1 also supports the input up to 200K tokens. Although
trained language models (e.g., BERT and GPT-2) that have            many efforts have been made to enhance the long context
                                                                                                                                97

modeling ability of LLMs [283, 943], the resulting mod-             cost of LLMs, especially in large-scale deployment. Another
els still can’t well process the information in the context         popular research direction is retrieval-augmented gener-
window [949]. To address this issue, specific architecture          ation, where retrieved contexts from supporting sources
adaptations or algorithms might be needed to enhance the            are included into prompts for task solving. It has been
modeling and utilization of long context information. An-           shown that retrieval augmentation can extend the knowl-
other worrying concern is that existing work mostly focuses         edge boundary and improve the question answering ca-
on training LLMs with decoder-only Transformers. Despite            pacity [454], but may suffer from the effectiveness of long
the effectiveness, it severely limits the more wide, diverse        context utilization by LLMs [949].
explorations on alternative model architectures.
                                                                    Safety and Alignment. Despite the capacities, LLMs are
Model Training. For pre-training, it is essential to establish      faced with great safety challenges in practical use. As a
a data-centric infrastructure and training procedure for LLM        fundamental issue of probabilistic modeling nature, LLMs
optimization, which can effectively support a systematic            exhibit a tendency to generate hallucinations [640], refer-
process of data collection, data cleaning, data mixture, and        ring to texts that seem plausible but may be factually
data curriculum. Furthermore, it also calls for more flexible       incorrect [46]. What is worse, LLMs might be elicited by
mechanisms of hardware support or resource schedule, so             intentional instructions to produce harmful, biased, or toxic
as to better organize and utilize the resources in a computing      texts for malicious systems, leading to the potential risks
cluster. In practice, it is very challenging to pre-train capable   of misuse [55, 66]. To have a detailed discussion of the
LLMs, due to the huge compute consumption and the                   safety issues of LLMs (e.g., privacy, overreliance, disinfor-
sensitivity to data quality and training tricks [78, 93]. Thus,     mation, and influence operations), the readers can refer to
it becomes particularly important to develop systemic, eco-         the GPT-3/4 technical reports [46, 55]. As the major tech-
nomical pre-training approaches for optimizing LLMs, e.g.,          nical approach to averting these issues, alignment methods
predictable scaling [46] and proxy model training [59]. More        (e.g., RLHF) [66, 116] have been widely used by leveraging
training recipes or principles should be investigated and           human feedback for developing well-aligned LLMs. How-
shared to reduce the potential risk of degradation or failure       ever, RLHF heavily relies on high-quality human feedback
in large-scale model optimization. Although increasingly            data from professional labelers, which is costly and time-
more model checkpoints and cleaned datasets have been               consuming to recruit qualified human annotators. There-
released, there still lacks reproducible work on pre-training       fore, it is necessary to improve the RLHF framework for
data preparation (e.g., detailed cleaning strategies) and data      reducing the efforts of human labelers and seek a more
scheduling (e.g., data mixture and curriculum). Since it is         efficient annotation approach with guaranteed data quality,
very costly to pre-train a LLM from scratch, it is important        e.g., LLMs can be employed to assist the labeling work.
to design suitable mechanisms for continually pre-training          Furthermore, it is also suggested to develop simplified
or fine-tuning the LLM based on publicly available model            optimization algorithms for alignment [388, 391], to reduce
checkpoints (e.g., LLaMA [57] and Flan-T5 [69]). For this           the training difficulty and unstability of RLHF. As another
purpose, a number of technical issues have to be resolved,          practical approach, red teaming [132, 367] has been adopted
e.g., catastrophic forgetting and task specialization. Further-     for improving the model safety of LLMs, which utilizes
more, it is also useful to develop effective tuning strategies      the collected adversarial prompts to refine the LLMs (i.e.,
that effectively inject or edit specific knowledge [674], e.g.,     avoiding the attacks from red teaming). In addition, privacy
correcting the outdated facts.                                      concerns are also important to consider when fine-tuning
                                                                    LLMs with domain-specific data, and thus federated based
Model Utilization. Based on the natural language inter-             learning [1081] can be useful in privacy-restricted scenarios.
face, prompting has become the prominent approach for
using LLMs to solving various tasks. By combining task              Application and Ecosystem. As LLMs have shown strong
descriptions and demonstration examples into prompts, in-           capacities in solving various tasks, they can be applied
context learning (ICL) endows LLMs with the ability to              in a broad range of real-world applications (i.e., following
perform well on new tasks, even outperforming full-data             task-specific natural language instructions). As a remarkable
fine-tuned models in some cases. To enhance the ability of          progress, ChatGPT has potentially changed the way how
complex reasoning, advanced prompting techniques have               humans access information, which has been additionally
been proposed, exemplified by the chain-of-thought (CoT)            integrated in the release of New Bing. Generally, in the
strategy, which includes the intermediate reasoning steps           near future, it can be foreseen that LLMs would have a
into prompts. Furthermore, planning is a promising ap-              significant impact on information-seeking techniques, in-
proach for solving complex tasks, which iteratively invokes         cluding both search engines and recommender systems.
LLMs by leveraging tool use capacities. Despite these ef-           Furthermore, LLMs make it possible to develop more intel-
forts, several basic problems related to prompting are still        ligent systems (e.g., autonomous AI agents) to tackle various
under-explored: why a good prompt can elicit the correct            complex tasks in real-world scenarios. Specially, Assistants
answer but a bad prompt cannot, how to reveal the working           API has been launched by OpenAI (featured by instructions,
principles of advanced prompting methods (e.g., ICL and             knowledge and tool use), enabling rapid development of
CoT) and further improve these existing approaches, and             agent-like assistants within the applications. This wave of
how to efficiently find the effective prompts for LLMs on           technical innovation would lead to an ecosystem of LLM-
specific tasks. Furthermore, from a practical perspective, it       empowered applications (e.g., OpenAI’s GPT Store), which
has become a fundamental challenge to reduce the inference          has a close connection with human life. Lastly, the rise of
                                                                                                                            98

LLMs sheds light on the exploration of artificial general         illustrative figure for the attention patterns for different
intelligence (AGI). It is promising to develop more smart AI      architectures in Figure 9, and add the detailed formulas
systems than ever. However, in this development process,          in Table 7.
AI safety should be one of the primary concerns, i.e., making   • Update on April 25, 2023: revise some copy errors in
AI lead to good for humanity but not bad [40].                    figures and tables.
                                                                • Update on April 27, 2023: add efficient tuning in Sec-
                                                                  tion 5.3.
C ODA                                                           • Update on April 28, 2023: revise Section 5.3.
It is not an easy job to write this long survey and update      • Update on May 7, 2023: revise Table 1, Table 2, and
its content with timely work. First of all, we would like to      some minor points.
sincerely thank the support from the readers and our team       • Update on June 29, 2023 (major revision):
members. We work very hard on this survey, and hope that          – Section 1: add Figure 1 for the trends of published
it can present a comprehensive, timely reference for LLMs.           LLM papers in arXiv;
                                                                  – Section 2: add Figure 4 for GPT’s evolution and the
Survey Writing. This survey was planned during a discus-
                                                                     corresponding discussion;
sion meeting held by our research team, and we aimed to
                                                                  – Section 3: add Figure 5 for LLaMA family and the
summarize the recent advances of large language models
                                                                     corresponding discussion;
as a highly readable report for our team members. The
                                                                  – Section 5: add latest discussion about the synthetic
first draft was finished on March 13, 2023, in which our
                                                                     data formatting of instruction tuning in Section 5.1.1,
team members tried their best to include the related stud-
                                                                     the empirical analysis for instruction tuning in Sec-
ies about LLMs in a relatively objective, comprehensive
                                                                     tion 5.1.4, parameter-efficient model adaptation in
way. Then, we have extensively revised the writing and
                                                                     Section 5.3 and memory-efficient adaptation in Sec-
contents in several passes. Due to the space limit, we can
                                                                     tion 5.3;
only include a fraction of existing LLMs in Figure 3 and
                                                                  – Section 6: add latest discussion about the underlying
Table 1, by setting the selection criterion. However, we set
                                                                     mechanism of ICL 6.2.3, planning for complex task
a more relaxed criterion for model selection on our GitHub
                                                                     solving in Section 6.4;
page (https://github.com/RUCAIBox/LLMSurvey), which
                                                                  – Section 7: update Table 14 for representative datasets
will be regularly maintained. We release the initial version
                                                                     for evaluating advanced abilities of LLMs, and em-
on March 31, 2023, the major revision on June 29, 2023,
                                                                     pirical ability evaluation in Section 7.4;
and second version on September 10, 2023, and this latest
                                                                  – Section 6.1.1: add prompt design;
version (major revision) on November 23, 2023.
                                                                  – Section 8: add the discussions on applications of
Seeking for Advice. Despite all our efforts, this survey             LLMs in finance and scientific research domains;
is still far from perfect: we are likely to miss important      • Update on September 10, 2023 (major revision):
references or topics, and might also have non-rigorous            – Claim the copyrights of the figures and tables in this
expressions or discussions. We will continuously update              paper.
this survey, and improve the quality as much as we can.           – Add latest LLMs, techniques and their descriptions in
For us, survey writing is also a learning process for LLMs           Section 3, Section 4, Section 5, Section 6 and Section 7;
by ourselves. For readers with constructive suggestions to        – Section 4: add latest discussion about the decoding
improve this survey, you are welcome to leave comments on            strategy in Section 4.2.4;
the GitHub page of our survey or directly email our authors.      – Section 5: add latest discussion about the practical
We will make revisions following the received comments               tricks for instruction tuning in Section 5.1.2, the
or suggestions in a future version, and acknowledge the              empirical analysis on LLaMA (13B) for instruction
readers who have contributed constructive suggestions in             tuning in Section 5.1.4, practical strategies for RLHF
our survey.                                                          in Section 5.2.3, alignment without RLHF in Sec-
Update log. In this part, we regularly maintain an update            tion 5.2.4 and remarks on SFT and RLHF in Sec-
log for the submissions of this survey to arXiv:                     tion 5.2.5;
                                                                  – Section 6: update the content about the planning for
  • First release on March 31, 2023: the initial version.
                                                                     complex task solving in Section 6.4;
  • Update on April 9, 2023: add the affiliation information,
                                                                  – Section 7: add discussions about evaluation ap-
     revise Figure 3 and Table 1 and clarify the correspond-
                                                                     proaches in Section 7.3.2, Table 15 for the category
     ing selection criterion for LLMs, improve the writing,
                                                                     of existing evaluation work, and update empirical
     and correct some minor errors.
                                                                     ability evaluation in Section 7.4 and the results on
  • Update on April 11, 2023: correct the errors for library
                                                                     Table 16;
     resources.
                                                                  – Section 6.1.1: add new prompt examples in Table 12;
  • Update on April 12, 2023: revise Figure 3 and Table 1,
                                                                • Update on November 23, 2023 (major revision):
     and clarify the release date of LLMs.
  • Update on April 16, 2023: add a new Section 2.2 about         – Section 1: add Figure 2 for the evolution process of
     the technical evolution of GPT-series models.                   four generations of language models;
  • Update on April 24, 2023: add the discussion about            – Section 2: add more discussion about scaling laws
     scaling laws and add some explanations about the                and how emergent abilities relate to scaling laws;
     model sizes for emergent abilities (Section 2.1); add an     – Section 3: add latest LLMs in Figure 3 and Table 1,
                                                                                                                             99

      latest APIs in Section 3.1, commonly used datasets       engineering (Table 17). Due to the limit of computational
      for instruction tuning and alignment tuning in Sec-      resources, our experiments are not complete, limited to
      tion 3.3, and several libraries in Section 3.4;          small-sized models or a few comparisons. Despite that, we
    – Section 4: add latest discussion about the data          feel that it might be meaningful to share the partial results to
      scheduling, including data mixtures and data cur-        the public. We will try to include the missing results of larger
      riculum in Section 4.1.3; add summary of data prepa-     models or more comparisons in the future versions. We also
      ration in Section 4.1.4; add discussion about model-     call for support of computing power for conducting more
      ing long context in Section 9.1; add discussion about    comprehensive experiments.
      decoding efficiency issues and add latest decoding
                                                               Chinese Book. We also released a Chinese book based on
      strategies in Section 4.2.4;
                                                               this survey article, at the link: https://llmbook-zh.github.io.
    – Section 5: add latest discussion about instance con-
                                                               This book is in the publication process.
      struction and tuning strategies in Section 5.1; add
      latest discussion about process-supervised RLHF in
      Section 5.2.3, and the empirical study on quantized      ACKNOWLEDGMENTS
      LLaMA models (7B and 13B) in Section 9.5.1;              The authors would like to thank Yankai Lin and Yutao Zhu
    – Section 6: add latest discussion about prompt op-        for proofreading this paper. Since the first release of this
      timization in Section 6.1.2, and update the content      paper, we have received a number of valuable comments
      about chain-of-thought prompting in Section 6.3;         from the readers. We sincerely thank the readers who have
    – Section 8: add latest discussion about LLM for re-       written to us with constructive suggestions and comments:
      search directions in Section 8.1;                        Tyler Suard, Damai Dai, Liang Ding, Stella Biderman,
    – Section 10: revise the content in the several aspects.   Kevin Gray, Jay Alammar, Yubo Feng, Mark Holmstrom,
  • Update on September 25, 2024:                              Xingdong Liu, Il-Seok Oh, Yiting Liu, Shaojun Wang,
    – Section 3: reorganize the content of “public available   Gaoyan Ou, Todd Morrill, Hao Liu, Zhenyu Zhang, and
      model checkpoints” into multiple series; add the         Xinlin Zhuang.
      latest LLMs in Figure 3.
    – Section 4: add LLM-based data filtering and selec-           Since the v11 version (June 29, 2023), we have been
      tion methods in Section 4.1.2; update Section 4.2.1,     adding a large number of experiments and prompt prac-
      “Emergent Architectures” to include more discus-         tices. These new contents are completed by a number of
      sions about SSM-based architectures; add Table 6         volunteers in our team. Here, we add a special part to thank
      to compare parallelism and complexity of different       all the students who have worked very hard on this part
      architectures.                                           (also including the ones on our author list).
    – Section 5: add latest discussion about instruction
                                                               Contribution on Experiments. We would like to sincerely
      quality improvement and instruction selection in
                                                               thank the following people for their hard work involved in
      Section 5.1.1; add latest discussion about practical
                                                               experiments shown in Table 16.
      strategies for RLHF and process-supervised RLHF
                                                                   • Xiaoxue Cheng: implement the experiments for evalu-
      in Section 5.2.3; update the content about supervised
                                                               ation on Language Generation and HaluEval tasks.
      alignment tuning in Section 5.2.4.
                                                                   • Yuhao Wang: implement the experiments for evalua-
    – Section 6: add latest papers about discrete prompt
                                                               tion on interaction with environment tasks.
      optimization in Section 6.1.2.
                                                                   • Bowen Zheng: implement the experiments for evalua-
    – Section 9: add latest discussion about advanced
                                                               tion on tool manipulation tasks.
      topics, including long context modeling, LLM-
      based agent, analysis and optimization for training      Contribution on Tips. We list the following guys for their
      and inference, model inference, model compression,       contributions on the corresponding numbers of provided
      retrieval-augmented generation, and hallucination.       tips for designing prompts in Table 12.
  • Update on October 12, 2024:                                    • Xiaolei Wang: T3, O3
    – Section 8.1.5: correct the errors.                           • Beichen Zhang: D2, D5
                                                                   • Zhipeng Chen: D3, D4
  • Update on March 11, 2025:
                                                                   • Junjie Zhang: D6
    – Section 9.8: add latest papers about long CoT rea-           • Bowen Zheng: D7
      soning, including the analysis of reasoning patterns         • Zican Dong: D8
      and advantages, construction of long CoT data (i.e.,         • Xinyu Tang: C2
      distillation, search-based, and multi-agent collabora-       • Yifan Du: T4
      tion), and training methods (i.e., instruction tuning        • Tianyi Tang: O6, O7, D9
      and reinforcement learning).                                 • Yupeng Hou: O8, C3
  • Update on February 27, 2026:                                   • Salvatore Raieli: C4
    – Figure 3: update with new LLMs.

Clarifications on Experiments. In this version, we have        R EFERENCES
included a number experiments on instruction-tuning (Ta-       [1]    Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A
ble 10), overall ability evaluation (Table 16), and prompt            neural probabilistic language model,” J. Mach. Learn.
                                                                      Res., vol. 3, pp. 1137–1155, 2003.
                                                                                                                                100

[2]    R. Collobert, J. Weston, L. Bottou, M. Karlen,                     ISCA, 2010, pp. 1045–1048.
       K. Kavukcuoglu, and P. P. Kuksa, “Natural language          [18]   S. Kombrink, T. Mikolov, M. Karafiát, and L. Burget,
       processing (almost) from scratch,” J. Mach. Learn.                 “Recurrent neural network based language modeling
       Res., vol. 12, pp. 2493–2537, 2011.                                in meeting recognition,” in INTERSPEECH 2011, 12th
[3]    S. Pinker, The Language Instinct: How the Mind Creates             Annual Conference of the International Speech Commu-
       Language.      Brilliance Audio; Unabridged edition,               nication Association, Florence, Italy, August 27-31, 2011.
       2014.                                                              ISCA, 2011, pp. 2877–2880.
[4]    M. D. Hauser, N. Chomsky, and W. T. Fitch, “The             [19]   T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
       faculty of language: what is it, who has it, and how               J. Dean, “Distributed representations of words and
       did it evolve?” science, vol. 298, no. 5598, pp. 1569–             phrases and their compositionality,” in Advances in
       1579, 2002.                                                        Neural Information Processing Systems 26: 27th Annual
[5]    A. M. Turing, “Computing machinery and intelli-                    Conference on Neural Information Processing Systems
       gence,” Mind, vol. LIX, no. 236, pp. 433–460, 1950.                2013. Proceedings of a meeting held December 5-8, 2013,
[6]    F. Jelinek, Statistical Methods for Speech Recognition.            Lake Tahoe, Nevada, United States, C. J. C. Burges,
       MIT Press, 1998.                                                   L. Bottou, Z. Ghahramani, and K. Q. Weinberger,
[7]    J. Gao and C. Lin, “Introduction to the special issue              Eds., 2013, pp. 3111–3119.
       on statistical language modeling,” ACM Trans. Asian         [20]   T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Ef-
       Lang. Inf. Process., vol. 3, no. 2, pp. 87–93, 2004.               ficient estimation of word representations in vector
[8]    R. Rosenfeld, “Two decades of statistical language                 space,” in 1st International Conference on Learning Rep-
       modeling: Where do we go from here?” Proceedings                   resentations, ICLR 2013, Scottsdale, Arizona, USA, May
       of the IEEE, vol. 88, no. 8, pp. 1270–1278, 2000.                  2-4, 2013, Workshop Track Proceedings, Y. Bengio and
[9]    A. Stolcke, “Srilm-an extensible language modeling                 Y. LeCun, Eds., 2013.
       toolkit,” in Seventh international conference on spoken     [21]   M. E. Peters, M. Neumann, M. Iyyer, M. Gardner,
       language processing, 2002.                                         C. Clark, K. Lee, and L. Zettlemoyer, “Deep contex-
[10]   X. Liu and W. B. Croft, “Statistical language mod-                 tualized word representations,” in Proceedings of the
       eling for information retrieval,” Annu. Rev. Inf. Sci.             2018 Conference of the North American Chapter of the
       Technol., vol. 39, no. 1, pp. 1–31, 2005.                          Association for Computational Linguistics: Human Lan-
[11]   C. Zhai, Statistical Language Models for Information Re-           guage Technologies, NAACL-HLT 2018, New Orleans,
       trieval, ser. Synthesis Lectures on Human Language                 Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers),
       Technologies. Morgan & Claypool Publishers, 2008.                  M. A. Walker, H. Ji, and A. Stent, Eds. Association
[12]   S. M. Thede and M. P. Harper, “A second-order                      for Computational Linguistics, 2018, pp. 2227–2237.
       hidden markov model for part-of-speech tagging,”            [22]   A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
       in 27th Annual Meeting of the Association for Computa-             L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin,
       tional Linguistics, University of Maryland, College Park,          “Attention is all you need,” in Advances in Neural In-
       Maryland, USA, 20-26 June 1999, R. Dale and K. W.                  formation Processing Systems 30: Annual Conference on
       Church, Eds. ACL, 1999, pp. 175–182.                               Neural Information Processing Systems 2017, December
[13]   L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mer-            4-9, 2017, Long Beach, CA, USA, 2017, pp. 5998–6008.
       cer, “A tree-based statistical language model for nat-      [23]   J. Devlin, M. Chang, K. Lee, and K. Toutanova,
       ural language speech recognition,” IEEE Transactions               “BERT: pre-training of deep bidirectional transform-
       on Acoustics, Speech, and Signal Processing, vol. 37,              ers for language understanding,” in Proceedings of
       no. 7, pp. 1001–1008, 1989.                                        the 2019 Conference of the North American Chapter of
[14]   T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean,             the Association for Computational Linguistics: Human
       “Large language models in machine translation,”                    Language Technologies, NAACL-HLT 2019, Minneapolis,
       in EMNLP-CoNLL 2007, Proceedings of the 2007 Joint                 MN, USA, June 2-7, 2019, Volume 1 (Long and Short
       Conference on Empirical Methods in Natural Language                Papers), J. Burstein, C. Doran, and T. Solorio, Eds.
       Processing and Computational Natural Language Learn-               Association for Computational Linguistics, 2019, pp.
       ing, June 28-30, 2007, Prague, Czech Republic, J. Eisner,          4171–4186.
       Ed. ACL, 2007, pp. 858–867.                                 [24]   M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-
[15]   S. M. Katz, “Estimation of probabilities from sparse               hamed, O. Levy, V. Stoyanov, and L. Zettlemoyer,
       data for the language model component of a speech                  “BART: denoising sequence-to-sequence pre-training
       recognizer,” IEEE Trans. Acoust. Speech Signal Process.,           for natural language generation, translation, and
       vol. 35, no. 3, pp. 400–401, 1987.                                 comprehension,” in Proceedings of the 58th Annual
[16]   W. A. Gale and G. Sampson, “Good-turing frequency                  Meeting of the Association for Computational Linguistics,
       estimation without tears,” J. Quant. Linguistics, vol. 2,          ACL 2020, Online, July 5-10, 2020, 2020, pp. 7871–
       no. 3, pp. 217–237, 1995.                                          7880.
[17]   T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and      [25]   W. Fedus, B. Zoph, and N. Shazeer, “Switch trans-
       S. Khudanpur, “Recurrent neural network based lan-                 formers: Scaling to trillion parameter models with
       guage model,” in INTERSPEECH 2010, 11th Annual                     simple and efficient sparsity,” J. Mach. Learn. Res, pp.
       Conference of the International Speech Communication               1–40, 2021.
       Association, Makuhari, Chiba, Japan, September 26-30,       [26]   A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
       2010, T. Kobayashi, K. Hirose, and S. Nakamura, Eds.               I. Sutskever et al., “Language models are unsuper-
                                                                                                                                   101

       vised multitask learners,” OpenAI blog, p. 9, 2019.                   Z. Liu, P. Xie, C. Xiong, J. Pei, P. S. Yu, and L. Sun,
[27]   Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,                   “A comprehensive survey on pretrained foundation
       O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov,                   models: A history from BERT to chatgpt,” CoRR, vol.
       “Roberta: A robustly optimized BERT pretraining                       abs/2302.09419, 2023.
       approach,” CoRR, vol. abs/1907.11692, 2019.                    [38]   X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo,
[28]   V. Sanh, A. Webson, C. Raffel, S. H. Bach,                            J. Qiu, Y. Yao, A. Zhang, L. Zhang, W. Han,
       L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler,                    M. Huang, Q. Jin, Y. Lan, Y. Liu, Z. Liu, Z. Lu,
       A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S.                 X. Qiu, R. Song, J. Tang, J. Wen, J. Yuan, W. X. Zhao,
       Sharma, E. Szczechla, T. Kim, G. Chhablani, N. V.                     and J. Zhu, “Pre-trained models: Past, present and
       Nayak, D. Datta, J. Chang, M. T. Jiang, H. Wang,                      future,” AI Open, vol. 2, pp. 225–250, 2021.
       M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Baw-             [39]   X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang,
       den, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. San-                 “Pre-trained models for natural language processing:
       tilli, T. Févry, J. A. Fries, R. Teehan, T. L. Scao, S. Bi-          A survey,” CoRR, vol. abs/2003.08271, 2020.
       derman, L. Gao, T. Wolf, and A. M. Rush, “Multitask            [40]   S. Altman, “Planning for agi and beyond,” OpenAI
       prompted training enables zero-shot task generaliza-                  Blog, February 2023.
       tion,” in The Tenth International Conference on Learning       [41]   S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke,
       Representations, ICLR 2022, Virtual Event, April 25-29,               E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li,
       2022. OpenReview.net, 2022.                                           S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and
[29]   T. Wang, A. Roberts, D. Hesslow, T. L. Scao, H. W.                    Y. Zhang, “Sparks of artificial general intelligence:
       Chung, I. Beltagy, J. Launay, and C. Raffel, “What                    Early experiments with gpt-4,” vol. abs/2303.12712,
       language model architecture and pretraining objec-                    2023.
       tive works best for zero-shot generalization?” in              [42]   S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal,
       International Conference on Machine Learning, ICML                    S. Ma, T. Lv, L. Cui, O. K. Mohammed, B. Patra,
       2022, 17-23 July 2022, Baltimore, Maryland, USA, ser.                 Q. Liu, K. Aggarwal, Z. Chi, J. Bjorck, V. Chaudhary,
       Proceedings of Machine Learning Research, vol. 162,                   S. Som, X. Song, and F. Wei, “Language is not all you
       2022, pp. 22 964–22 984.                                              need: Aligning perception with language models,”
[30]   J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown,                   CoRR, vol. abs/2302.14045, 2023.
       B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and            [43]   Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, and
       D. Amodei, “Scaling laws for neural language mod-                     L. Sun, “A comprehensive survey of ai-generated
       els,” CoRR, vol. abs/2001.08361, 2020.                                content (aigc): A history of generative ai from gan
[31]   J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph,                     to chatgpt,” arXiv preprint arXiv:2303.04226, 2023.
       S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou,                   [44]   D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdh-
       D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals,                      ery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu
       P. Liang, J. Dean, and W. Fedus, “Emergent                            et al., “Palm-e: An embodied multimodal language
       abilities of large language models,” CoRR, vol.                       model,” arXiv preprint arXiv:2303.03378, 2023.
       abs/2206.07682, 2022.                                          [45]   C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and
[32]   M. Shanahan, “Talking about large language mod-                       N. Duan, “Visual chatgpt: Talking, drawing and edit-
       els,” CoRR, vol. abs/2212.03551, 2022.                                ing with visual foundation models,” arXiv preprint
[33]   J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi,                  arXiv:2303.04671, 2023.
       Q. Le, and D. Zhou, “Chain of thought prompting                [46]   OpenAI, “Gpt-4 technical report,” OpenAI, 2023.
       elicits reasoning in large language models,” CoRR,             [47]   Y. Fu, H. Peng, and T. Khot, “How does gpt obtain
       vol. abs/2201.11903, 2022.                                            its ability? tracing emergent abilities of language
[34]   J.     Hoffmann,        S.     Borgeaud,    A.    Mensch,             models to their sources,” Yao Fu’s Notion, Dec 2022.
       E.      Buchatskaya,        T.    Cai,   E.    Rutherford,     [48]   J. Li, T. Tang, W. X. Zhao, and J. Wen, “Pretrained
       D. de Las Casas, L. A. Hendricks, J. Welbl,                           language model for text generation: A survey,” in
       A. Clark, T. Hennigan, E. Noland, K. Millican,                        Proceedings of the Thirtieth International Joint Confer-
       G. van den Driessche, B. Damoc, A. Guy, S. Osindero,                  ence on Artificial Intelligence, IJCAI 2021, Virtual Event
       K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and                     / Montreal, Canada, 19-27 August 2021, Z. Zhou, Ed.
       L. Sifre, “Training compute-optimal large language                    ijcai.org, 2021, pp. 4492–4499.
       models,” vol. abs/2203.15556, 2022.                            [49]   P. Lu, L. Qiu, W. Yu, S. Welleck, and K. Chang, “A
[35]   R. Taylor, M. Kardas, G. Cucurull, T. Scialom,                        survey of deep learning for mathematical reason-
       A. Hartshorn, E. Saravia, A. Poulton, V. Kerkez, and                  ing,” CoRR, vol. abs/2212.10535, 2022.
       R. Stojnic, “Galactica: A large language model for             [50]   Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang,
       science,” CoRR, vol. abs/2211.09085, 2022.                            X. Sun, J. Xu, L. Li, and Z. Sui, “A survey for in-
[36]   P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and                     context learning,” CoRR, vol. abs/2301.00234, 2023.
       G. Neubig, “Pre-train, prompt, and predict: A sys-             [51]   J. Huang and K. C. Chang, “Towards reasoning
       tematic survey of prompting methods in natural                        in large language models: A survey,” CoRR, vol.
       language processing,” ACM Comput. Surv., pp. 195:1–                   abs/2212.10403, 2022.
       195:35, 2023.                                                  [52]   S. Qiao, Y. Ou, N. Zhang, X. Chen, Y. Yao, S. Deng,
[37]   C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang,                        C. Tan, F. Huang, and H. Chen, “Reasoning with
       K. Zhang, C. Ji, Q. Yan, L. He, H. Peng, J. Li, J. Wu,                language model prompting: A survey,” CoRR, vol.
                                                                                                                              102

       abs/2212.09597, 2022.                                            preprint arXiv:2305.16264, 2023.
[53]   J. Zhou, P. Ke, X. Qiu, M. Huang, and J. Zhang,           [62]   I. McKenzie, A. Lyzhov, A. Parrish, A. Prabhu,
       “Chatgpt: potential, prospects, and limitations,” in             A. Mueller, N. Kim, S. Bowman, and E. Perez, “The
       Frontiers of Information Technology & Electronic Engi-           inverse scaling prize,” 2022. [Online]. Available:
       neering, 2023, pp. 1–6.                                          https://github.com/inverse-scaling/prize
[54]   W. X. Zhao, J. Liu, R. Ren, and J.-R. Wen, “Dense         [63]   B. A. Huberman and T. Hogg, “Phase transitions in
       text retrieval based on pretrained language models:              artificial intelligence systems,” Artificial Intelligence,
       A survey,” ACM Transactions on Information Systems,              vol. 33, no. 2, pp. 155–171, 1987.
       vol. 42, no. 4, pp. 1–60, 2024.                           [64]   J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoff-
[55]   T. B. Brown, B. Mann, N. Ryder, M. Subbiah,                      mann, H. F. Song, J. Aslanides, S. Henderson,
       J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,                R. Ring, S. Young, E. Rutherford, T. Hennigan,
       G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,               J. Menick, A. Cassirer, R. Powell, G. van den
       G. Krueger, T. Henighan, R. Child, A. Ramesh,                    Driessche, L. A. Hendricks, M. Rauh, P. Huang,
       D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,              A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Ue-
       E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,               sato, J. Mellor, I. Higgins, A. Creswell, N. McAleese,
       C. Berner, S. McCandlish, A. Radford, I. Sutskever,              A. Wu, E. Elsen, S. M. Jayakumar, E. Buchatskaya,
       and D. Amodei, “Language models are few-shot                     D. Budden, E. Sutherland, K. Simonyan, M. Paganini,
       learners,” in Advances in Neural Information Processing          L. Sifre, L. Martens, X. L. Li, A. Kuncoro, A. Ne-
       Systems 33: Annual Conference on Neural Information              matzadeh, E. Gribovskaya, D. Donato, A. Lazaridou,
       Processing Systems 2020, NeurIPS 2020, December 6-12,            A. Mensch, J. Lespiau, M. Tsimpoukelli, N. Grig-
       2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell,            orev, D. Fritz, T. Sottiaux, M. Pajarskas, T. Pohlen,
       M. Balcan, and H. Lin, Eds., 2020.                               Z. Gong, D. Toyama, C. de Masson d’Autume,
[56]   A. Chowdhery, S. Narang, J. Devlin, M. Bosma,                    Y. Li, T. Terzi, V. Mikulik, I. Babuschkin, A. Clark,
       G. Mishra, A. Roberts, P. Barham, H. W. Chung,                   D. de Las Casas, A. Guy, C. Jones, J. Bradbury, M. J.
       C. Sutton, S. Gehrmann, P. Schuh, K. Shi,                        Johnson, B. A. Hechtman, L. Weidinger, I. Gabriel,
       S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes,                  W. S. Isaac, E. Lockhart, S. Osindero, L. Rimell,
       Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du,              C. Dyer, O. Vinyals, K. Ayoub, J. Stanway, L. Bennett,
       B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Is-           D. Hassabis, K. Kavukcuoglu, and G. Irving, “Scaling
       ard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghe-           language models: Methods, analysis & insights from
       mawat, S. Dev, H. Michalewski, X. Garcia, V. Misra,              training gopher,” CoRR, vol. abs/2112.11446, 2021.
       K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan,     [65]   D. Dai, Y. Sun, L. Dong, Y. Hao, Z. Sui, and F. Wei,
       H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Do-               “Why can GPT learn in-context? language models se-
       han, S. Agrawal, M. Omernick, A. M. Dai, T. S.                   cretly perform gradient descent as meta-optimizers,”
       Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child,           CoRR, vol. abs/2212.10559, 2022.
       O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta,           [66]   L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wain-
       M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-                 wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama,
       Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel,            A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller,
       “Palm: Scaling language modeling with pathways,”                 M. Simens, A. Askell, P. Welinder, P. F. Christiano,
       CoRR, vol. abs/2204.02311, 2022.                                 J. Leike, and R. Lowe, “Training language models
[57]   H. Touvron, T. Lavril, G. Izacard, X. Martinet,                  to follow instructions with human feedback,” CoRR,
       M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham-           vol. abs/2203.02155, 2022.
       bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and     [67]   J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu,
       G. Lample, “Llama: Open and efficient foundation                 B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned
       language models,” CoRR, 2023.                                    language models are zero-shot learners,” in The Tenth
[58]   T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse,              International Conference on Learning Representations,
       J. Jackson, H. Jun, T. B. Brown, P. Dhariwal, S. Gray            ICLR 2022, Virtual Event, April 25-29, 2022. Open-
       et al., “Scaling laws for autoregressive generative              Review.net, 2022.
       modeling,” arXiv preprint arXiv:2010.14701, 2020.         [68]   R. Thoppilan, D. D. Freitas, J. Hall, N. Shazeer,
[59]   S. M. Xie, H. Pham, X. Dong, N. Du, H. Liu, Y. Lu,               A. Kulshreshtha, H. Cheng, A. Jin, T. Bos, L. Baker,
       P. Liang, Q. V. Le, T. Ma, and A. W. Yu, “Doremi:                Y. Du, Y. Li, H. Lee, H. S. Zheng, A. Ghafouri,
       Optimizing data mixtures speeds up language model                M. Menegali, Y. Huang, M. Krikun, D. Lepikhin,
       pretraining,” arXiv preprint arXiv:2305.10429, 2023.             J. Qin, D. Chen, Y. Xu, Z. Chen, A. Roberts, M. Bosma,
[60]   P. Villalobos, J. Sevilla, L. Heim, T. Besiroglu,                Y. Zhou, C. Chang, I. Krivokon, W. Rusch, M. Pick-
       M. Hobbhahn, and A. Ho, “Will we run out of                      ett, K. S. Meier-Hellstern, M. R. Morris, T. Doshi,
       data? an analysis of the limits of scaling datasets in           R. D. Santos, T. Duke, J. Soraker, B. Zevenbergen,
       machine learning,” CoRR, vol. abs/2211.04325, 2022.              V. Prabhakaran, M. Diaz, B. Hutchinson, K. Olson,
       [Online]. Available: https://doi.org/10.48550/arXiv.             A. Molina, E. Hoffman-John, J. Lee, L. Aroyo, R. Ra-
       2211.04325                                                       jakumar, A. Butryna, M. Lamm, V. Kuzmina, J. Fen-
[61]   N. Muennighoff, A. M. Rush, B. Barak, T. L. Scao,                ton, A. Cohen, R. Bernstein, R. Kurzweil, B. Aguera-
       A. Piktus, N. Tazi, S. Pyysalo, T. Wolf, and C. Raffel,          Arcas, C. Cui, M. Croak, E. H. Chi, and Q. Le,
       “Scaling data-constrained language models,” arXiv                “Lamda: Language models for dialog applications,”
                                                                                                                            103

       CoRR, vol. abs/2201.08239, 2022.                                  den, S. Bekman, A. McMillan-Major, I. Beltagy,
[69]   H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay,                 H. Nguyen, L. Saulnier, S. Tan, P. O. Suarez, V. Sanh,
       W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma,                 H. Laurençon, Y. Jernite, J. Launay, M. Mitchell,
       A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen,                  C. Raffel, A. Gokaslan, A. Simhi, A. Soroa, A. F. Aji,
       A. Chowdhery, S. Narang, G. Mishra, A. Yu, V. Y.                  A. Alfassy, A. Rogers, A. K. Nitzav, C. Xu, C. Mou,
       Zhao, Y. Huang, A. M. Dai, H. Yu, S. Petrov, E. H.                C. Emezue, C. Klamm, C. Leong, D. van Strien,
       Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le,           D. I. Adelani, and et al., “BLOOM: A 176b-parameter
       and J. Wei, “Scaling instruction-finetuned language               open-access multilingual language model,” CoRR,
       models,” CoRR, vol. abs/2210.11416, 2022.                         vol. abs/2211.05100, 2022.
[70]   A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb,         [79]   P. F. Christiano, J. Leike, T. B. Brown, M. Martic,
       A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta,             S. Legg, and D. Amodei, “Deep reinforcement learn-
       A. Garriga-Alonso, A. Kluska, A. Lewkowycz,                       ing from human preferences,” in Advances in Neural
       A. Agarwal, A. Power, A. Ray, A. Warstadt, A. W.                  Information Processing Systems 30: Annual Conference
       Kocurek, A. Safaya, A. Tazarv, A. Xiang, A. Parrish,              on Neural Information Processing Systems 2017, Decem-
       A. Nie, A. Hussain, A. Askell, A. Dsouza, A. Rahane,              ber 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von
       A. S. Iyer, A. Andreassen, A. Santilli, A. Stuhlmüller,          Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N.
       A. M. Dai, A. La, A. K. Lampinen, A. Zou, A. Jiang,               Vishwanathan, and R. Garnett, Eds., 2017, pp. 4299–
       A. Chen, A. Vuong, A. Gupta, A. Gottardi, A. Norelli,             4307.
       A. Venkatesh, A. Gholamidavoodi, A. Tabassum,              [80]   T. Schick, J. Dwivedi-Yu, R. Dessı̀, R. Raileanu,
       A. Menezes, A. Kirubarajan, A. Mullokandov, A. Sab-               M. Lomeli, L. Zettlemoyer, N. Cancedda, and
       harwal, A. Herrick, A. Efrat, A. Erdem, A. Karakas,               T. Scialom, “Toolformer: Language models can teach
       and et al., “Beyond the imitation game: Quantifying               themselves to use tools,” CoRR, vol. abs/2302.04761,
       and extrapolating the capabilities of language mod-               2023.
       els,” CoRR, vol. abs/2206.04615, 2022.                     [81]   R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang,
[71]   R. Schaeffer, B. Miranda, and S. Koyejo, “Are emer-               C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saun-
       gent abilities of large language models a mirage?”                ders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger,
       arXiv preprint arXiv:2304.15004, 2023.                            K. Button, M. Knight, B. Chess, and J. Schulman,
[72]   S. Hu, X. Liu, X. Han, X. Zhang, C. He, W. Zhao,                  “Webgpt: Browser-assisted question-answering with
       Y. Lin, N. Ding, Z. Ou, G. Zeng, Z. Liu, and M. Sun,              human feedback,” CoRR, vol. abs/2112.09332, 2021.
       “Unlock predictable scaling from emergent abilities,”      [82]   C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
       2023.                                                             M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring
[73]   A. Power, Y. Burda, H. Edwards, I. Babuschkin, and                the limits of transfer learning with a unified text-
       V. Misra, “Grokking: Generalization beyond overfit-               to-text transformer,” J. Mach. Learn. Res., pp. 140:1–
       ting on small algorithmic datasets,” arXiv preprint               140:67, 2020.
       arXiv:2201.02177, 2022.                                    [83]   L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-
[74]   J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He,                  Rfou, A. Siddhant, A. Barua, and C. Raffel, “mt5: A
       “Deepspeed: System optimizations enable training                  massively multilingual pre-trained text-to-text trans-
       deep learning models with over 100 billion param-                 former,” in Proceedings of the 2021 Conference of the
       eters,” in KDD, 2020, pp. 3505–3506.                              North American Chapter of the Association for Com-
[75]   M. Shoeybi, M. Patwary, R. Puri, P. LeGresley,                    putational Linguistics: Human Language Technologies,
       J. Casper, and B. Catanzaro, “Megatron-lm: Train-                 NAACL-HLT 2021, Online, June 6-11, 2021, 2021, pp.
       ing multi-billion parameter language models using                 483–498.
       model parallelism,” CoRR, vol. abs/1909.08053, 2019.       [84]   W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang,
[76]   D. Narayanan, M. Shoeybi, J. Casper, P. LeGres-                   X. Jiang, Z. Yang, K. Wang, X. Zhang, C. Li,
       ley, M. Patwary, V. Korthikanti, D. Vainbrand,                    Z. Gong, Y. Yao, X. Huang, J. Wang, J. Yu, Q. Guo,
       P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phan-               Y. Yu, Y. Zhang, J. Wang, H. Tao, D. Yan, Z. Yi,
       ishayee, and M. Zaharia, “Efficient large-scale lan-              F. Peng, F. Jiang, H. Zhang, L. Deng, Y. Zhang,
       guage model training on GPU clusters using                        Z. Lin, C. Zhang, S. Zhang, M. Guo, S. Gu, G. Fan,
       megatron-lm,” in International Conference for High Per-           Y. Wang, X. Jin, Q. Liu, and Y. Tian, “Pangu-α: Large-
       formance Computing, Networking, Storage and Analysis,             scale autoregressive pretrained chinese language
       SC 2021, St. Louis, Missouri, USA, November 14-19,                models with auto-parallel computation,” CoRR, vol.
       2021. ACM, 2021, p. 58.                                           abs/2104.12369, 2021.
[77]   V. Korthikanti, J. Casper, S. Lym, L. McAfee, M. An-       [85]   Z. Zhang, Y. Gu, X. Han, S. Chen, C. Xiao,
       dersch, M. Shoeybi, and B. Catanzaro, “Reducing ac-               Z. Sun, Y. Yao, F. Qi, J. Guan, P. Ke, Y. Cai,
       tivation recomputation in large transformer models,”              G. Zeng, Z. Tan, Z. Liu, M. Huang, W. Han, Y. Liu,
       CoRR, vol. abs/2205.05198, 2022.                                  X. Zhu, and M. Sun, “CPM-2: large-scale cost-
[78]   T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilic,                effective pre-trained language models,” CoRR, vol.
       D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon,                abs/2106.10715, 2021.
       M. Gallé, J. Tow, A. M. Rush, S. Biderman, A. Web-        [86]   E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang,
       son, P. S. Ammanamanchi, T. Wang, B. Sagot,                       Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An
       N. Muennighoff, A. V. del Moral, O. Ruwase, R. Baw-               open large language model for code with mtulti-turn
                                                                                                                             104

       program synthesis,” arXiv preprint arXiv:2203.13474,            P. Yu, K. Shuster, T. Wang, Q. Liu, P. S. Koura,
       2022.                                                           X. Li, B. O’Horo, G. Pereyra, J. Wang, C. Dewan,
[87]   S. Black, S. Biderman, E. Hallahan, Q. Anthony,                 A. Celikyilmaz, L. Zettlemoyer, and V. Stoyanov,
       L. Gao, L. Golding, H. He, C. Leahy, K. McDonell,               “OPT-IML: scaling language model instruction meta
       J. Phang, M. Pieler, U. S. Prashanth, S. Purohit,               learning through the lens of generalization,” CoRR,
       L. Reynolds, J. Tow, B. Wang, and S. Weinbach, “Gpt-            vol. abs/2212.12017, 2022.
       neox-20b: An open-source autoregressive language          [96]  S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley,
       model,” CoRR, vol. abs/2204.06745, 2022.                        K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit,
[88]   Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi,              U. S. Prashanth, E. Raff et al., “Pythia: A suite for
       A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran,              analyzing large language models across training and
       A. Arunkumar, D. Stap, E. Pathak, G. Karamanolakis,             scaling,” arXiv preprint arXiv:2304.01373, 2023.
       H. G. Lai, I. Purohit, I. Mondal, J. Anderson, K. Kuz-    [97]  E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and
       nia, K. Doshi, K. K. Pal, M. Patel, M. Moradshahi,              Y. Zhou, “Codegen2: Lessons for training llms on
       M. Parmar, M. Purohit, N. Varshney, P. R. Kaza,                 programming and natural languages,” CoRR, vol.
       P. Verma, R. S. Puri, R. Karia, S. Doshi, S. K.                 abs/2305.02309, 2023.
       Sampat, S. Mishra, S. R. A, S. Patro, T. Dixit, and       [98]  R. Li, L. B. Allal, Y. Zi, N. Muennighoff,
       X. Shen, “Super-naturalinstructions: Generalization             D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li,
       via declarative instructions on 1600+ NLP tasks,” in            J. Chim, Q. Liu, E. Zheltonozhskii, T. Y. Zhuo,
       Proceedings of the 2022 Conference on Empirical Methods         T. Wang, O. Dehaene, M. Davaadorj, J. Lamy-Poirier,
       in Natural Language Processing, EMNLP 2022, Abu                 J. Monteiro, O. Shliazhko, N. Gontier, N. Meade,
       Dhabi, United Arab Emirates, December 7-11, 2022,               A. Zebaze, M. Yee, L. K. Umapathi, J. Zhu, B. Lipkin,
       2022, pp. 5085–5109.                                            M. Oblokulov, Z. Wang, R. M. V, J. Stillerman,
[89]   Y. Tay, M. Dehghani, V. Q. Tran, X. Garcı́a, J. Wei,            S. S. Patel, D. Abulkhanov, M. Zocca, M. Dey,
       X. Wang, H. W. Chung, D. Bahri, T. Schuster,                    Z. Zhang, N. Fahmy, U. Bhattacharyya, W. Yu,
       H. Zheng, D. Zhou, N. Houlsby, and D. Metzler,                  S. Singh, S. Luccioni, P. Villegas, M. Kunakov,
       “Ul2: Unifying language learning paradigms,” 2022.              F. Zhdanov, M. Romero, T. Lee, N. Timor, J. Ding,
[90]   S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen,             C. Schlesinger, H. Schoelkopf, J. Ebert, T. Dao,
       S. Chen, C. Dewan, M. T. Diab, X. Li, X. V. Lin,                M. Mishra, A. Gu, J. Robinson, C. J. Anderson,
       T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig,         B. Dolan-Gavitt, D. Contractor, S. Reddy, D. Fried,
       P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer,           D. Bahdanau, Y. Jernite, C. M. Ferrandis, S. Hughes,
       “OPT: open pre-trained transformer language mod-                T. Wolf, A. Guha, L. von Werra, and H. de Vries,
       els,” CoRR, vol. abs/2205.01068, 2022.                          “Starcoder: may the source be with you!” CoRR,
[91]   M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad,           vol. abs/2305.06161, 2023. [Online]. Available:
       K. Heafield, K. Heffernan, E. Kalbassi, J. Lam,                 https://doi.org/10.48550/arXiv.2305.06161
       D. Licht, J. Maillard, A. Sun, S. Wang, G. Wen-           [99]  H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma-
       zek, A. Youngblood, B. Akula, L. Barrault, G. M.                hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava,
       Gonzalez, P. Hansanti, J. Hoffman, S. Jarrett, K. R.            S. Bhosale et al., “Llama 2: Open foundation and fine-
       Sadagopan, D. Rowe, S. Spruit, C. Tran, P. Andrews,             tuned chat models,” arXiv preprint arXiv:2307.09288,
       N. F. Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao,              2023.
       V. Goswami, F. Guzmán, P. Koehn, A. Mourachko,           [100] A. Yang, B. Xiao, B. Wang, B. Zhang, C. Yin, C. Lv,
       C. Ropers, S. Saleem, H. Schwenk, and J. Wang, “No              D. Pan, D. Wang, D. Yan, F. Yang et al., “Baichuan
       language left behind: Scaling human-centered ma-                2: Open large-scale language models,” arXiv preprint
       chine translation,” CoRR, vol. abs/2207.04672, 2022.            arXiv:2309.10305, 2023.
[92]   Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue,       [101] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan,
       Z. Wang, L. Shen, A. Wang, Y. Li et al., “Codegeex:             W. Ge, Y. Han, F. Huang et al., “Qwen technical
       A pre-trained model for code generation with multi-             report,” arXiv preprint arXiv:2309.16609, 2023.
       lingual evaluations on humaneval-x,” arXiv preprint       [102] X. Li, Y. Yao, X. Jiang, X. Fang, X. Meng, S. Fan,
       arXiv:2303.17568, 2023.                                         P. Han, J. Li, L. Du, B. Qin et al., “Flm-101b: An open
[93]   A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding,               llm and how to train it with $100 k budget,” arXiv
       Z. Yang, Y. Xu, W. Zheng, X. Xia, W. L. Tam, Z. Ma,             preprint arXiv:2309.03852, 2023.
       Y. Xue, J. Zhai, W. Chen, P. Zhang, Y. Dong, and          [103] T. Wei, L. Zhao, L. Zhang, B. Zhu, L. Wang, H. Yang,
       J. Tang, “GLM-130B: an open bilingual pre-trained               B. Li, C. Cheng, W. Lü, R. Hu et al., “Skywork:
       model,” vol. abs/2210.02414, 2022.                              A more open bilingual foundation model,” arXiv
[94]   N. Muennighoff, T. Wang, L. Sutawika, A. Roberts,               preprint arXiv:2310.19341, 2023.
       S. Biderman, T. L. Scao, M. S. Bari, S. Shen, Z. X.       [104] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat,
       Yong, H. Schoelkopf, X. Tang, D. Radev, A. F. Aji,              Y. Huang, M. Krikun, N. Shazeer, and Z. Chen,
       K. Almubarak, S. Albanie, Z. Alyafeai, A. Web-                  “Gshard: Scaling giant models with conditional com-
       son, E. Raff, and C. Raffel, “Crosslingual general-             putation and automatic sharding,” in 9th International
       ization through multitask finetuning,” CoRR, vol.               Conference on Learning Representations, ICLR 2021, Vir-
       abs/2211.01786, 2022.                                           tual Event, Austria, May 3-7, 2021, 2021.
[95]   S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig,   [105] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P.
                                                                                                                             105

        de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda,               Y. E. Wang, K. Webster, M. Pellat, K. Robinson, K. S.
        N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger,              Meier-Hellstern, T. Duke, L. Dixon, K. Zhang, Q. V.
        M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan,             Le, Y. Wu, Z. Chen, and C. Cui, “Glam: Efficient
        S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser,                scaling of language models with mixture-of-experts,”
        M. Bavarian, C. Winter, P. Tillet, F. P. Such,                    in International Conference on Machine Learning, ICML
        D. Cummings, M. Plappert, F. Chantzis, E. Barnes,                 2022, 17-23 July 2022, Baltimore, Maryland, USA, 2022,
        A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino,                 pp. 5547–5569.
        N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain,     [113]   S. Smith, M. Patwary, B. Norick, P. LeGresley,
        W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam,           S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye,
        V. Misra, E. Morikawa, A. Radford, M. Knight,                     G. Zerveas, V. Korthikanti, E. Zheng, R. Child,
        M. Brundage, M. Murati, K. Mayer, P. Welinder,                    R. Y. Aminabadi, J. Bernauer, X. Song, M. Shoeybi,
        B. McGrew, D. Amodei, S. McCandlish, I. Sutskever,                Y. He, M. Houston, S. Tiwary, and B. Catanzaro,
        and W. Zaremba, “Evaluating large language models                 “Using deepspeed and megatron to train megatron-
        trained on code,” CoRR, vol. abs/2107.03374, 2021.                turing NLG 530b, A large-scale generative language
[106]   Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang,             model,” CoRR, vol. abs/2201.11990, 2022.
        J. Liu, X. Chen, Y. Zhao, Y. Lu, W. Liu, Z. Wu,           [114]   Y. Li, D. H. Choi, J. Chung, N. Kushman, J. Schrit-
        W. Gong, J. Liang, Z. Shang, P. Sun, W. Liu,                      twieser, R. Leblond, T. Eccles, J. Keeling, F. Gi-
        X. Ouyang, D. Yu, H. Tian, H. Wu, and H. Wang,                    meno, A. D. Lago, T. Hubert, P. Choy, C. de Mas-
        “ERNIE 3.0: Large-scale knowledge enhanced pre-                   son d’Autume, I. Babuschkin, X. Chen, P. Huang,
        training for language understanding and genera-                   J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J.
        tion,” CoRR, vol. abs/2107.02137, 2021.                           Mankowitz, E. S. Robson, P. Kohli, N. de Freitas,
[107]   O. Lieber, O. Sharir, B. Lenz, and Y. Shoham,                     K. Kavukcuoglu, and O. Vinyals, “Competition-level
        “Jurassic-1: Technical details and evaluation,” White             code generation with alphacode,” Science, 2022.
        Paper. AI21 Labs, vol. 1, 2021.                           [115]   S. Soltan, S. Ananthakrishnan, J. FitzGerald,
[108]   B. Kim, H. Kim, S. Lee, G. Lee, D. Kwak, D. H. Jeon,              R. Gupta, W. Hamza, H. Khan, C. Peris, S. Rawls,
        S. Park, S. Kim, S. Kim, D. Seo, H. Lee, M. Jeong,                A. Rosenbaum, A. Rumshisky, C. S. Prakash, M. Srid-
        S. Lee, M. Kim, S. Ko, S. Kim, T. Park, J. Kim,                   har, F. Triefenbach, A. Verma, G. Tür, and P. Natara-
        S. Kang, N. Ryu, K. M. Yoo, M. Chang, S. Suh,                     jan, “Alexatm 20b: Few-shot learning using a
        S. In, J. Park, K. Kim, H. Kim, J. Jeong, Y. G. Yeo,              large-scale multilingual seq2seq model,” CoRR, vol.
        D. Ham, D. Park, M. Y. Lee, J. Kang, I. Kang, J. Ha,              abs/2208.01448, 2022.
        W. Park, and N. Sung, “What changes can large-            [116]   A. Glaese, N. McAleese, M. Trebacz, J. Aslanides,
        scale language models bring? intensive study on hy-               V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chad-
        perclova: Billions-scale korean generative pretrained             wick, P. Thacker, L. Campbell-Gillingham, J. Ue-
        transformers,” in Proceedings of the 2021 Conference              sato, P. Huang, R. Comanescu, F. Yang, A. See,
        on Empirical Methods in Natural Language Processing,              S. Dathathri, R. Greig, C. Chen, D. Fritz, J. S.
        EMNLP 2021, Virtual Event / Punta Cana, Dominican                 Elias, R. Green, S. Mokrá, N. Fernando, B. Wu,
        Republic, 7-11 November, 2021. Association for Com-               R. Foley, S. Young, I. Gabriel, W. Isaac, J. Mel-
        putational Linguistics, 2021.                                     lor, D. Hassabis, K. Kavukcuoglu, L. A. Hendricks,
[109]   S. Wu, X. Zhao, T. Yu, R. Zhang, C. Shen, H. Liu,                 and G. Irving, “Improving alignment of dialogue
        F. Li, H. Zhu, J. Luo, L. Xu et al., “Yuan 1.0: Large-            agents via targeted human judgements,” CoRR, vol.
        scale pre-trained language model in zero-shot and                 abs/2209.14375, 2022.
        few-shot learning,” arXiv preprint arXiv:2110.04725,      [117]   H. Su, X. Zhou, H. Yu, Y. Chen, Z. Zhu, Y. Yu, and
        2021.                                                             J. Zhou, “Welm: A well-read pre-trained language
[110]   A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli,                 model for chinese,” CoRR, vol. abs/2209.10372, 2022.
        T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das-        [118]   Y. Tay, J. Wei, H. W. Chung, V. Q. Tran, D. R. So,
        Sarma, N. Elhage, Z. Hatfield-Dodds, D. Hernandez,                S. Shakeri, X. Garcia, H. S. Zheng, J. Rao, A. Chowd-
        J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. B.               hery, D. Zhou, D. Metzler, S. Petrov, N. Houlsby,
        Brown, J. Clark, S. McCandlish, C. Olah, and J. Ka-               Q. V. Le, and M. Dehghani, “Transcending scal-
        plan, “A general language assistant as a laboratory               ing laws with 0.1% extra compute,” CoRR, vol.
        for alignment,” CoRR, vol. abs/2112.00861, 2021.                  abs/2210.11399, 2022.
[111]   S. Wang, Y. Sun, Y. Xiang, Z. Wu, S. Ding, W. Gong,       [119]   X. Ren, P. Zhou, X. Meng, X. Huang, Y. Wang,
        S. Feng, J. Shang, Y. Zhao, C. Pang, J. Liu, X. Chen,             W. Wang, P. Li, X. Zhang, A. Podolskiy, G. Arshinov,
        Y. Lu, W. Liu, X. Wang, Y. Bai, Q. Chen, L. Zhao,                 A. Bout, I. Piontkovskaya, J. Wei, X. Jiang, T. Su,
        S. Li, P. Sun, D. Yu, Y. Ma, H. Tian, H. Wu, T. Wu,               Q. Liu, and J. Yao, “Pangu-Σ: Towards trillion pa-
        W. Zeng, G. Li, W. Gao, and H. Wang, “ERNIE                       rameter language model with sparse heterogeneous
        3.0 titan: Exploring larger-scale knowledge enhanced              computing,” CoRR, vol. abs/2303.10845, 2023.
        pre-training for language understanding and gener-        [120]   R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lep-
        ation,” CoRR, vol. abs/2112.12731, 2021.                          ikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey,
[112]   N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin,                 Z. Chen et al., “Palm 2 technical report,” arXiv
        Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat,                    preprint arXiv:2305.10403, 2023.
        B. Zoph, L. Fedus, M. P. Bosma, Z. Zhou, T. Wang,         [121]   A. Radford, R. Józefowicz, and I. Sutskever, “Learn-
                                                                                                                                106

        ing to generate reviews and discovering sentiment,”       [133]   OpenAI, “Gpt-4v(ision) system card,” OpenAI, 2023.
        CoRR, vol. abs/1704.01444, 2017.                          [134]   ——, “Lessons learned on language model safety
[122]   A. Radford, K. Narasimhan, T. Salimans, I. Sutskever              and misuse,” OpenAI blog, 2022.
        et al., “Improving language understanding by gener-       [135]   Meta, “Introducing meta llama 3: The most capable
        ative pre-training,” 2018.                                        openly available llm to date,” https://ai.meta.com/
[123]   B. McCann, N. S. Keskar, C. Xiong, and R. Socher,                 blog/meta-llama-3/, 2024.
        “The natural language decathlon: Multitask learning       [136]   “Introducing Llama 3.1: Our most capable models to
        as question answering,” CoRR, vol. abs/1806.08730,                date ,” https://ai.meta.com/blog/meta-llama-3-1/,
        2018.                                                             2023.
[124]   Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett,        [137]   A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam-
        X. Gao, J. Gao, J. Liu, and B. Dolan, “DIALOGPT                   ford, D. S. Chaplot, D. de las Casas, F. Bressand,
        : Large-scale generative pre-training for conversa-               G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-
        tional response generation,” in Proceedings of the 58th           A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang,
        Annual Meeting of the Association for Computational               T. Lacroix, and W. E. Sayed, “Mistral 7b,” 2023.
        Linguistics: System Demonstrations, ACL 2020, Online,     [138]   A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch,
        July 5-10, 2020, A. Celikyilmaz and T. Wen, Eds.                  B. Savary, C. Bamford, D. S. Chaplot, D. de Las Casas,
        Association for Computational Linguistics, 2020, pp.              E. B. Hanna, F. Bressand, G. Lengyel, G. Bour,
        270–278.                                                          G. Lample, L. R. Lavaud, L. Saulnier, M. Lachaux,
[125]   D. Ham, J. Lee, Y. Jang, and K. Kim, “End-to-end                  P. Stock, S. Subramanian, S. Yang, S. Antoniak,
        neural pipeline for goal-oriented dialogue systems                T. L. Scao, T. Gervet, T. Lavril, T. Wang,
        using GPT-2,” in Proceedings of the 58th Annual Meet-             T. Lacroix, and W. E. Sayed, “Mixtral of experts,”
        ing of the Association for Computational Linguistics,             CoRR, vol. abs/2401.04088, 2024. [Online]. Available:
        ACL 2020, Online, July 5-10, 2020. Association for                https://doi.org/10.48550/arXiv.2401.04088
        Computational Linguistics, 2020, pp. 583–592.             [139]   T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju,
[126]   I. Drori, S. Tran, R. Wang, N. Cheng, K. Liu, L. Tang,            S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love,
        E. Ke, N. Singh, T. L. Patti, J. Lynch, A. Shporer,               P. Tafti, L. Hussenot, A. Chowdhery, A. Roberts,
        N. Verma, E. Wu, and G. Strang, “A neural network                 A. Barua, A. Botev, A. Castro-Ros, A. Slone,
        solves and generates mathematics problems by pro-                 A. Héliou, A. Tacchetti, A. Bulanova, A. Paterson,
        gram synthesis: Calculus, differential equations, lin-            B. Tsai, B. Shahriari, C. L. Lan, C. A. Choquette-Choo,
        ear algebra, and more,” CoRR, vol. abs/2112.15594,                C. Crepy, D. Cer, D. Ippolito, D. Reid, E. Buchatskaya,
        2021.                                                             E. Ni, E. Noland, G. Yan, G. Tucker, G. Muraru,
[127]   A. Neelakantan, T. Xu, R. Puri, A. Radford, J. M.                 G. Rozhdestvenskiy, H. Michalewski, I. Tenney, I. Gr-
        Han, J. Tworek, Q. Yuan, N. Tezak, J. W. Kim,                     ishchenko, J. Austin, J. Keeling, J. Labanowski,
        C. Hallacy, J. Heidecke, P. Shyam, B. Power, T. E.                J. Lespiau, J. Stanway, J. Brennan, J. Chen, J. Ferret,
        Nekoul, G. Sastry, G. Krueger, D. Schnurr, F. P.                  J. Chiu, and et al., “Gemma: Open models based
        Such, K. Hsu, M. Thompson, T. Khan, T. Sherbakov,                 on gemini research and technology,” CoRR, vol.
        J. Jang, P. Welinder, and L. Weng, “Text and code                 abs/2403.08295, 2024.
        embeddings by contrastive pre-training,” CoRR, vol.       [140]   M. Rivière, S. Pathak, P. G. Sessa, C. Hardin, S. Bhu-
        abs/2201.10005, 2022.                                             patiraju, L. Hussenot, T. Mesnard, B. Shahriari,
[128]   J. Schulman, F. Wolski, P. Dhariwal, A. Radford,                  A. Ramé, J. Ferret, P. Liu, P. Tafti, A. Friesen, M. Cas-
        and O. Klimov, “Proximal policy optimization algo-                bon, S. Ramos, R. Kumar, C. L. Lan, S. Jerome, A. Tsit-
        rithms,” arXiv preprint arXiv:1707.06347, 2017.                   sulin, N. Vieillard, P. Stanczyk, S. Girgin, N. Mom-
[129]   N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler,                     chev, M. Hoffman, S. Thakoor, J. Grill, B. Neyshabur,
        R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F.                O. Bachem, A. Walton, A. Severyn, A. Parrish, A. Ah-
        Christiano, “Learning to summarize from human                     mad, A. Hutchison, A. Abdagic, A. Carl, A. Shen,
        feedback,” CoRR, vol. abs/2009.01325, 2020.                       A. Brock, A. Coenen, A. Laforge, A. Paterson, B. Bas-
[130]   OpenAI, “Our approach to alignment research,” Ope-                tian, B. Piot, B. Wu, B. Royal, C. Chen, C. Kumar,
        nAI Blog, August 2022.                                            C. Perry, C. Welty, C. A. Choquette-Choo, D. Sinopal-
[131]   ——, “Introducing chatgpt,” OpenAI Blog, November                  nikov, D. Weinberger, D. Vijaykumar, D. Rogozin-
        2022.                                                             ska, D. Herbison, E. Bandy, E. Wang, E. Noland,
[132]   D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai,             E. Moreira, E. Senter, E. Eltyshev, F. Visin, G. Rasskin,
        S. Kadavath, B. Mann, E. Perez, N. Schiefer,                      G. Wei, G. Cameron, G. Martins, H. Hashemi,
        K. Ndousse, A. Jones, S. Bowman, A. Chen, T. Con-                 H. Klimczak-Plucinska, H. Batra, H. Dhand, I. Nar-
        erly, N. DasSarma, D. Drain, N. Elhage, S. E. Showk,              dini, J. Mein, J. Zhou, J. Svensson, J. Stanway, J. Chan,
        S. Fort, Z. Hatfield-Dodds, T. Henighan, D. Hernan-               J. P. Zhou, J. Carrasqueira, J. Iljazi, J. Becker, J. Fer-
        dez, T. Hume, J. Jacobson, S. Johnston, S. Kravec,                nandez, J. van Amersfoort, J. Gordon, J. Lipschultz,
        C. Olsson, S. Ringer, E. Tran-Johnson, D. Amodei,                 J. Newlan, J. Ji, K. Mohamed, K. Badola, K. Black,
        T. Brown, N. Joseph, S. McCandlish, C. Olah, J. Ka-               K. Millican, K. McDonell, K. Nguyen, K. Sodhia,
        plan, and J. Clark, “Red teaming language models                  K. Greene, L. L. Sjösund, L. Usui, L. Sifre, L. Heuer-
        to reduce harms: Methods, scaling behaviors, and                  mann, L. Lago, and L. McNealus, “Gemma 2: Im-
        lessons learned,” CoRR, vol. abs/2209.07858, 2022.                proving open language models at a practical size,”
                                                                                                                                  107

        CoRR, vol. abs/2408.00118, 2024.                             [151]   Y. Ji, Y. Deng, Y. Gong, Y. Peng, Q. Niu, B. Ma,
[141]   A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou,                  and X. Li, “Belle: Be everyone’s large language
        C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin,             model engine,” https://github.com/LianjiaTech/
        J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Xu,            BELLE, 2023.
        J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen,     [152]   W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu,
        K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang,                    H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E.
        R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai,                    Gonzalez, I. Stoica, and E. P. Xing, “Vicuna:
        S. Tan, T. Zhu, T. Li, T. Liu, W. Ge, X. Deng, X. Zhou,              An open-source chatbot impressing gpt-4 with
        X. Ren, X. Zhang, X. Wei, X. Ren, Y. Fan, Y. Yao,                    90%* chatgpt quality,” 2023. [Online]. Available:
        Y. Zhang, Y. Wan, Y. Chu, Y. Liu, Z. Cui, Z. Zhang,                  https://vicuna.lmsys.org
        and Z. Fan, “Qwen2 technical report,” arXiv preprint         [153]   D. Eccleston, “Sharegpt,” https://sharegpt.com/,
        arXiv:2407.10671, 2024.                                              2023.
[142]   Q. Team, “Qwen2.5: A party of foundation                     [154]   H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction
        models,” September 2024. [Online]. Available:                        tuning,” CoRR, vol. abs/2304.08485, 2023.
        https://qwenlm.github.io/blog/qwen2.5/                       [155]   D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny,
[143]   T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin,                   “Minigpt-4: Enhancing vision-language understand-
        D. Rojas, G. Feng, H. Zhao, H. Lai, H. Yu, H. Wang,                  ing with advanced large language models,” CoRR,
        J. Sun, J. Zhang, J. Cheng, J. Gui, J. Tang, J. Zhang,               vol. abs/2304.10592, 2023.
        J. Li, L. Zhao, L. Wu, L. Zhong, M. Liu, M. Huang,           [156]   W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang,
        P. Zhang, Q. Zheng, R. Lu, S. Duan, S. Zhang, S. Cao,                B. Li, P. Fung, and S. C. H. Hoi, “Instructblip: To-
        S. Yang, W. L. Tam, W. Zhao, X. Liu, X. Xia, X. Zhang,               wards general-purpose vision-language models with
        X. Gu, X. Lv, X. Liu, X. Liu, X. Yang, X. Song,                      instruction tuning,” CoRR, vol. abs/2305.06500, 2023.
        X. Zhang, Y. An, Y. Xu, Y. Niu, Y. Yang, Y. Li, Y. Bai,      [157]   Y. Su, T. Lan, H. Li, J. Xu, Y. Wang, and D. Cai,
        Y. Dong, Z. Qi, Z. Wang, Z. Yang, Z. Du, Z. Hou,                     “Pandagpt: One model to instruction-follow them
        and Z. Wang, “Chatglm: A family of large language                    all,” 2023.
        models from glm-130b to glm-4 all tools,” 2024.              [158]   Y. Zhu, R. Kiros, R. S. Zemel, R. Salakhutdinov,
[144]   H. Zhong, C. Xiao, C. Tu, T. Zhang, Z. Liu, and                      R. Urtasun, A. Torralba, and S. Fidler, “Aligning
        M. Sun, “JEC-QA: A legal-domain question answer-                     books and movies: Towards story-like visual expla-
        ing dataset,” in The Thirty-Fourth AAAI Conference                   nations by watching movies and reading books,” in
        on Artificial Intelligence, AAAI 2020, The Thirty-Second             2015 IEEE International Conference on Computer Vision,
        Innovative Applications of Artificial Intelligence Confer-           ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE
        ence, IAAI 2020, The Tenth AAAI Symposium on Edu-                    Computer Society, 2015, pp. 19–27.
        cational Advances in Artificial Intelligence, EAAI 2020,     [159]   “Project gutenberg.” [Online]. Available: https:
        New York, NY, USA, February 7-12, 2020. AAAI Press,                  //www.gutenberg.org/
        2020, pp. 9701–9708.                                         [160]   T. H. Trinh and Q. V. Le, “A simple method for com-
[145]   D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang,                   monsense reasoning,” CoRR, vol. abs/1806.02847,
        and P. Szolovits, “What disease does this patient                    2018.
        have? a large-scale open domain question answer-             [161]   R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk,
        ing dataset from medical exams,” Applied Sciences,                   A. Farhadi, F. Roesner, and Y. Choi, “Defending
        vol. 11, no. 14, p. 6421, 2021.                                      against neural fake news,” in Advances in Neu-
[146]   R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li,                  ral Information Processing Systems 32: Annual Confer-
        C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford                ence on Neural Information Processing Systems 2019,
        alpaca: An instruction-following llama model,”                       NeurIPS 2019, December 8-14, 2019, Vancouver, BC,
        https://github.com/tatsu-lab/stanford alpaca,                        Canada, H. M. Wallach, H. Larochelle, A. Beygelz-
        2023.                                                                imer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds.,
[147]   Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith,                   2019, pp. 9051–9062.
        D. Khashabi, and H. Hajishirzi, “Self-instruct: Align-       [162]   A. Gokaslan, V. C. E. Pavlick, and S. Tellex,
        ing language model with self generated instruc-                      “Openwebtext corpus,” http://Skylion007.github.
        tions,” CoRR, vol. abs/2212.10560, 2022.                             io/OpenWebTextCorpus, 2019.
[148]   Alpaca-LoRA, “Instruct-tune llama on consumer                [163]   J. Baumgartner, S. Zannettou, B. Keegan, M. Squire,
        hardware,” https://github.com/tloen/alpaca-lora,                     and J. Blackburn, “The pushshift reddit dataset,” in
        2023.                                                                Proceedings of the Fourteenth International AAAI Con-
[149]   E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li,                   ference on Web and Social Media, ICWSM 2020, Held
        S. Wang, L. Wang, and W. Chen, “Lora: Low-rank                       Virtually, Original Venue: Atlanta, Georgia, USA, June
        adaptation of large language models,” in The Tenth                   8-11, 2020. AAAI Press, 2020, pp. 830–839.
        International Conference on Learning Representations,        [164]   “Wikipedia.” [Online]. Available: https://en.
        ICLR 2022, Virtual Event, April 25-29, 2022. Open-                   wikipedia.org/wiki/Main Page
        Review.net, 2022.                                            [165]   “Bigquery dataset.” [Online]. Available: https:
[150]   X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel,                //cloud.google.com/bigquery?hl=zh-cn
        S. Levine, and D. Song, “Koala: A dialogue model for         [166]   L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe,
        academic research,” Blog post, April 2023.                           C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima,
                                                                                                                                 108

        S. Presser, and C. Leahy, “The pile: An 800gb dataset              guage crowdsourcing instructions,” in Proceedings of
        of diverse text for language modeling,” CoRR, vol.                 the 60th Annual Meeting of the Association for Com-
        abs/2101.00027, 2021.                                              putational Linguistics (Volume 1: Long Papers), ACL
[167]   H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. V.               2022, Dublin, Ireland, May 22-27, 2022, S. Muresan,
        del Moral, T. Le Scao, L. Von Werra, C. Mou, E. G.                 P. Nakov, and A. Villavicencio, Eds., 2022, pp. 3470–
        Ponferrada, H. Nguyen et al., “The bigscience roots                3487.
        corpus: A 1.6 tb composite multilingual dataset,” in       [180]   S. H. Bach, V. Sanh, Z. X. Yong, A. Webson, C. Raffel,
        Thirty-sixth Conference on Neural Information Process-             N. V. Nayak, A. Sharma, T. Kim, M. S. Bari, T. Févry,
        ing Systems Datasets and Benchmarks Track, 2022.                   Z. Alyafeai, M. Dey, A. Santilli, Z. Sun, S. Ben-David,
[168]   “Common crawl.” [Online]. Available: https://                      C. Xu, G. Chhablani, H. Wang, J. A. Fries, M. S.
        commoncrawl.org/                                                   AlShaibani, S. Sharma, U. Thakker, K. Almubarak,
[169]   G. Wenzek, M.-A. Lachaux, A. Conneau, V. Chaud-                    X. Tang, D. R. Radev, M. T. Jiang, and A. M. Rush,
        hary, F. Guzmán, A. Joulin, and É. Grave, “Ccnet:                “Promptsource: An integrated development environ-
        Extracting high quality monolingual datasets from                  ment and repository for natural language prompts,”
        web crawl data,” in Proceedings of The 12th Language               in ACL (demo). Association for Computational Lin-
        Resources and Evaluation Conference, 2020, pp. 4003–               guistics, 2022, pp. 93–104.
        4012.                                                      [181]   T. Tang, J. Li, W. X. Zhao, and J. Wen, “MVP: multi-
[170]   T. Computer, “Redpajama: an open dataset for train-                task supervised pre-training for natural language
        ing large language models,” https://github.com/                    generation,” CoRR, vol. abs/2206.12131, 2022.
        togethercomputer/RedPajama-Data, 2023.                     [182]   H. Nguyen, S. Suri, K. Tsui, Shahules786, T. team,
[171]   G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru,                   and C. Schuhmann, “The oig dataset,” https://laion.
        A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei,              ai/blog/oig-dataset/, 2023.
        and J. Launay, “The RefinedWeb dataset for Falcon          [183]   Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen,
        LLM: outperforming curated corpora with web data,                  N. DasSarma, D. Drain, S. Fort, D. Ganguli,
        and web data only,” arXiv preprint arXiv:2306.01116,               T. Henighan, N. Joseph, S. Kadavath, J. Kernion,
        2023.                                                              T. Conerly, S. E. Showk, N. Elhage, Z. Hatfield-
[172]   C. B. Clement, M. Bierbaum, K. P. O’Keeffe, and                    Dodds, D. Hernandez, T. Hume, S. Johnston,
        A. A. Alemi, “On the use of arxiv as a dataset,” arXiv             S. Kravec, L. Lovitt, N. Nanda, C. Olsson,
        preprint arXiv:1905.00075, 2019.                                   D. Amodei, T. B. Brown, J. Clark, S. McCandlish,
[173]   K. Lo, L. L. Wang, M. Neumann, R. Kinney, and                      C. Olah, B. Mann, and J. Kaplan, “Training a helpful
        D. Weld, “S2ORC: The semantic scholar open re-                     and harmless assistant with reinforcement learning
        search corpus,” in ACL, 2020.                                      from human feedback,” CoRR, vol. abs/2204.05862,
[174]   L. Soldaini and K. Lo, “peS2o (Pretraining Efficiently             2022. [Online]. Available: https://doi.org/10.48550/
        on S2ORC) Dataset,” ODC-By, https://github.com/                    arXiv.2204.05862
        allenai/pes2o, 2023.                                       [184]   B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding,
[175]   D. Kocetkov, R. Li, L. B. Allal, J. Li, C. Mou, C. M.              J. Yue, and Y. Wu, “How close is chatgpt to human
        Ferrandis, Y. Jernite, M. Mitchell, S. Hughes, T. Wolf             experts? comparison corpus, evaluation, and detec-
        et al., “The stack: 3 tb of permissively licensed source           tion,” arXiv preprint arXiv:2301.07597, 2023.
        code,” arXiv preprint arXiv:2211.15533, 2022.              [185]   M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan,
[176]   B. Wang and A. Komatsuzaki, “GPT-J-6B: A 6 Billion                 S. Shah, A. Ghodsi, P. Wendell, M. Zaharia, and
        Parameter Autoregressive Language Model,” https:                   R. Xin. (2023) Free dolly: Introducing the world’s first
        //github.com/kingoflolz/mesh-transformer-jax,                      truly open instruction-tuned llm.
        2021.                                                      [186]   A. Köpf, Y. Kilcher, D. von Rütte, S. Anagnostidis, Z.-
[177]   L. Soldaini, R. Kinney, A. Bhagia, D. Schwenk,                     R. Tam, K. Stevens, A. Barhoum, N. M. Duc, O. Stan-
        D. Atkinson, R. Authur, B. Bogin, K. Chandu, J. Du-                ley, R. Nagyfi et al., “Openassistant conversations–
        mas, Y. Elazar, V. Hofmann, A. H. Jha, S. Kumar,                   democratizing large language model alignment,”
        L. Lucy, X. Lyu, N. Lambert, I. Magnusson, J. Morri-               arXiv preprint arXiv:2304.07327, 2023.
        son, N. Muennighoff, A. Naik, C. Nam, M. E. Peters,        [187]   R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li,
        A. Ravichander, K. Richardson, Z. Shen, E. Strubell,               C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford
        N. Subramani, O. Tafjord, P. Walsh, L. Zettlemoyer,                alpaca: An instruction-following llama model,”
        N. A. Smith, H. Hajishirzi, I. Beltagy, D. Groeneveld,             https://github.com/tatsu-lab/stanford alpaca,
        J. Dodge, and K. Lo, “Dolma: an open corpus of                     2023.
        three trillion tokens for language model pretraining       [188]   J. Cheung, “Guanaco - generative universal assistant
        research,” arXiv preprint arXiv:2402.00159, 2024.                  for natural-language adaptive context-aware om-
[178]   D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia, R. Kin-            nilingual outputs,” https://guanaco-model.github.
        ney, O. Tafjord, A. H. Jha, H. Ivison, I. Magnusson,               io/, 2023.
        Y. Wang et al., “Olmo: Accelerating the science of         [189]   C. Xu, D. Guo, N. Duan, and J. McAuley,
        language models,” arXiv preprint arXiv:2402.00838,                 “Baize: An open-source chat model with parameter-
        2024.                                                              efficient tuning on self-chat data,” arXiv preprint
[179]   S. Mishra, D. Khashabi, C. Baral, and H. Ha-                       arXiv:2304.01196, 2023.
        jishirzi, “Cross-task generalization via natural lan-      [190]   Y. Ji, Y. Gong, Y. Deng, Y. Peng, Q. Niu, B. Ma,
                                                                                                                                   109

        and X. Li, “Towards better instruction following                       2020, pp. 4157–4165.
        language models for chinese: Investigating the im-             [201]   T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De-
        pact of training data and evaluation,” arXiv preprint                  langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
        arXiv:2304.07854, 2023.                                                towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma,
[191]   K. Ethayarajh, Y. Choi, and S. Swayamdipta, “Under-                    Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger,
        standing dataset difficulty with V -usable informa-                    M. Drame, Q. Lhoest, and A. M. Rush, “Transform-
        tion,” in Proceedings of the 39th International Conference             ers: State-of-the-art natural language processing,” in
        on Machine Learning, 2022, pp. 5988–6008.                              Proceedings of the 2020 Conference on Empirical Methods
[192]   N.       Lambert,        L.     Tunstall,    N.     Rajani,            in Natural Language Processing: System Demonstrations,
        and       T.    Thrush.       (2023)   Huggingface       h4            EMNLP 2020 - Demos, Online, November 16-20, 2020.
        stack       exchange        preference    dataset.     [On-            Association for Computational Linguistics, 2020, pp.
        line]. Available: https://huggingface.co/datasets/                     38–45.
        HuggingFaceH4/stack-exchange-preferences                       [202]   J. Bradbury, R. Frostig, P. Hawkins, M. J.
[193]   R. Liu, R. Yang, C. Jia, G. Zhang, D. Zhou, A. M.                      Johnson, C. Leary, D. Maclaurin, G. Necula,
        Dai, D. Yang, and S. Vosoughi, “Training socially                      A. Paszke, J. VanderPlas, S. Wanderman-Milne,
        aligned language models in simulated human soci-                       and Q. Zhang, “JAX: composable transformations
        ety,” CoRR, vol. abs/2305.16960, 2023.                                 of Python+NumPy programs,” 2018. [Online].
[194]   G. Xu, J. Liu, M. Yan, H. Xu, J. Si, Z. Zhou, P. Yi,                   Available: http://github.com/google/jax
        X. Gao, J. Sang, R. Zhang, J. Zhang, C. Peng,                  [203]   Z. Bian, H. Liu, B. Wang, H. Huang, Y. Li, C. Wang,
        F. Huang, and J. Zhou, “Cvalues: Measuring the                         F. Cui, and Y. You, “Colossal-ai: A unified deep learn-
        values of chinese large language models from safety                    ing system for large-scale parallel training,” CoRR,
        to responsibility,” 2023.                                              vol. abs/2110.14883, 2021.
[195]   J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y. Wang, and     [204]   J. Fang, Y. Yu, S. Li, Y. You, and J. Zhou, “Patrick-
        Y. Yang, “Safe rlhf: Safe reinforcement learning from                  star: Parallel training of pre-trained models via
        human feedback,” arXiv preprint arXiv:2310.12773,                      a chunk-based memory management,” CoRR, vol.
        2023.                                                                  abs/2108.05818, 2021.
[196]   V. Sanh, A. Webson, C. Raffel, S. H. Bach,                     [205]   Y. You, “Colossalchat: An open-source solution
        L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler,                     for     cloning     chatgpt     with     a    complete
        A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S.                  rlhf     pipeline,”    2023.    [Online].    Available:
        Sharma, E. Szczechla, T. Kim, G. Chhablani, N. V.                      https://medium.com/@yangyou berkeley/
        Nayak, D. Datta, J. Chang, M. T. Jiang, H. Wang,                       colossalchat-an-open-source-solution-for-cloning-
        M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Baw-                     chatgpt-with-a-complete-rlhf-pipeline-5edf08fb538b
        den, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. San-          [206]   “Bmtrain: Effient training for big models.” [Online].
        tilli, T. Févry, J. A. Fries, R. Teehan, T. L. Scao, S. Bi-           Available: https://github.com/OpenBMB/BMTrain
        derman, L. Gao, T. Wolf, and A. M. Rush, “Multitask            [207]   J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, and J. Tang,
        prompted training enables zero-shot task generaliza-                   “Fastmoe: A fast mixture-of-expert training system,”
        tion,” in The Tenth International Conference on Learning               CoRR, vol. abs/2103.13262, 2021.
        Representations, ICLR 2022, Virtual Event, April 25-29,        [208]   W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng,
        2022. OpenReview.net, 2022.                                            C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica,
[197]   S. Longpre, L. Hou, T. Vu, A. Webson, H. W.                            “Efficient memory management for large language
        Chung, Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei                      model serving with pagedattention,” in Proceedings
        et al., “The flan collection: Designing data and meth-                 of the ACM SIGOPS 29th Symposium on Operating
        ods for effective instruction tuning,” arXiv preprint                  Systems Principles, 2023.
        arXiv:2301.13688, 2023.                                        [209]   (2023) Deepspeed-mii. [Online]. Available: https:
[198]   K. Cobbe, V. Kosaraju, M. Bavarian, J. Hilton,                         //github.com/microsoft/DeepSpeed-MII
        R. Nakano, C. Hesse, and J. Schulman, “Training                [210]   Z. Yao, R. Y. Aminabadi, O. Ruwase, S. Rajbhan-
        verifiers to solve math word problems,” CoRR, vol.                     dari, X. Wu, A. A. Awan, J. Rasley, M. Zhang,
        abs/2110.14168, 2021.                                                  C. Li, C. Holmes, Z. Zhou, M. Wyatt, M. Smith,
[199]   M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth,                      L. Kurilenko, H. Qin, M. Tanaka, S. Che, S. L. Song,
        and J. Berant, “Did aristotle use a laptop? A ques-                    and Y. He, “DeepSpeed-Chat: Easy, Fast and Afford-
        tion answering benchmark with implicit reasoning                       able RLHF Training of ChatGPT-like Models at All
        strategies,” Trans. Assoc. Comput. Linguistics, vol. 9,                Scales,” arXiv preprint arXiv:2308.01320, 2023.
        pp. 346–361, 2021.                                             [211]   A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad-
[200]   O. Camburu, B. Shillingford, P. Minervini,                             bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
        T. Lukasiewicz, and P. Blunsom, “Make up your                          L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang,
        mind! adversarial generation of inconsistent natural                   Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy,
        language explanations,” in Proceedings of the 58th                     B. Steiner, L. Fang, J. Bai, and S. Chintala, “Py-
        Annual Meeting of the Association for Computational                    torch: An imperative style, high-performance deep
        Linguistics, ACL 2020, Online, July 5-10, 2020,                        learning library,” in Advances in Neural Information
        D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault,                Processing Systems 32: Annual Conference on Neural
        Eds.       Association for Computational Linguistics,                  Information Processing Systems 2019, NeurIPS 2019,
                                                                                                                                110

        December 8-14, 2019, Vancouver, BC, Canada, H. M.           [224]   S. Black, L. Gao, P. Wang, C. Leahy, and S. Bider-
        Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-                man, “GPT-Neo: Large Scale Autoregressive Lan-
        Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 8024–               guage Modeling with Mesh-Tensorflow,” 2021.
        8035.                                                       [225]   F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn,
[212]   M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis,                    “A systematic evaluation of large language models
        J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Is-                   of code,” in MAPS@PLDI, 2022.
        ard, M. Kudlur, J. Levenberg, R. Monga, S. Moore,           [226]   A. Madaan, S. Zhou, U. Alon, Y. Yang, and G. Neu-
        D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan,               big, “Language models of code are few-shot com-
        P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensor-                  monsense learners,” in Proceedings of the 2022 Confer-
        flow: A system for large-scale machine learning,” in                ence on Empirical Methods in Natural Language Process-
        12th USENIX Symposium on Operating Systems Design                   ing, EMNLP 2022, Abu Dhabi, United Arab Emirates,
        and Implementation, OSDI 2016, Savannah, GA, USA,                   December 7-11, 2022, Y. Goldberg, Z. Kozareva, and
        November 2-4, 2016, K. Keeton and T. Roscoe, Eds.                   Y. Zhang, Eds. Association for Computational Lin-
        USENIX Association, 2016, pp. 265–283.                              guistics, 2022, pp. 1384–1403.
[213]   T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang,            [227]   S. Longpre, G. Yauney, E. Reif, K. Lee, A. Roberts,
        T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet:                     B. Zoph, D. Zhou, J. Wei, K. Robinson, D. Mimno
        A flexible and efficient machine learning library                   et al., “A pretrainer’s guide to training data: Measur-
        for heterogeneous distributed systems,” CoRR, vol.                  ing the effects of data age, domain coverage, quality,
        abs/1512.01274, 2015.                                               & toxicity,” arXiv preprint arXiv:2305.13169, 2023.
[214]   Y. Ma, D. Yu, T. Wu, and H. Wang, “Paddlepaddle:            [228]   D. Chen, Y. Huang, Z. Ma, H. Chen, X. Pan, C. Ge,
        An open-source deep learning platform from indus-                   D. Gao, Y. Xie, Z. Liu, J. Gao, Y. Li, B. Ding, and
        trial practice,” Frontiers of Data and Domputing, vol. 1,           J. Zhou, “Data-juicer: A one-stop data processing
        no. 1, p. 105, 2019.                                                system for large language models,” 2023.
[215]   L. Huawei Technologies Co., “Huawei mindspore               [229]   M. Abdin, S. A. Jacobs, A. A. Awan, J. Aneja,
        ai development framework,” in Artificial Intelligence               A. Awadallah, H. Awadalla, N. Bach, A. Bahree,
        Technology. Springer, 2022, pp. 137–162.                            A. Bakhtiari, H. Behl et al., “Phi-3 technical report:
[216]   J. Yuan, X. Li, C. Cheng, J. Liu, R. Guo, S. Cai, C. Yao,           A highly capable language model locally on your
        F. Yang, X. Yi, C. Wu, H. Zhang, and J. Zhao, “One-                 phone,” arXiv preprint arXiv:2404.14219, 2024.
        flow: Redesign the distributed deep learning frame-         [230]   G. Penedo, H. Kydlı́ček, A. Lozhkov, M. Mitchell,
        work from scratch,” CoRR, vol. abs/2110.15032, 2021.                C. Raffel, L. Von Werra, T. Wolf et al., “The fineweb
[217]   S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson,                datasets: Decanting the web for the finest text data at
        Y. Liu, J. Xu, M. Ott, E. M. Smith, Y. Boureau, and                 scale,” arXiv preprint arXiv:2406.17557, 2024.
        J. Weston, “Recipes for building an open-domain             [231]   P. Maini, S. Seto, H. Bai, D. Grangier, Y. Zhang, and
        chatbot,” in Proceedings of the 16th Conference of the              N. Jaitly, “Rephrasing the web: A recipe for compute
        European Chapter of the Association for Computational               and data-efficient language modeling,” in ICLR 2024
        Linguistics: Main Volume, EACL 2021, Online, April 19               Workshop on Navigating and Addressing Data Problems
        - 23, 2021, 2021, pp. 300–325.                                      for Foundation Models, 2024.
[218]   A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer,             [232]   M. Marion, A. Üstün, L. Pozzobon, A. Wang,
        H. Michalewski, V. V. Ramasesh, A. Slone, C. Anil,                  M. Fadaee, and S. Hooker, “When less is more: Inves-
        I. Schlag, T. Gutman-Solo, Y. Wu, B. Neyshabur,                     tigating data pruning for pretraining llms at scale,”
        G. Gur-Ari, and V. Misra, “Solving quantitative rea-                arXiv preprint arXiv:2309.04564, 2023.
        soning problems with language models,” CoRR, vol.           [233]   N. Sachdeva, B. Coleman, W.-C. Kang, J. Ni, L. Hong,
        abs/2206.14858, 2022.                                               E. H. Chi, J. Caverlee, J. McAuley, and D. Z. Cheng,
[219]   T. Saier, J. Krause, and M. Färber, “unarxive 2022:                “How to train data-efficient llms,” arXiv preprint
        All arxiv publications pre-processed for nlp, includ-               arXiv:2402.09668, 2024.
        ing structured full-text and citation network,” arXiv       [234]   D. Hernandez, T. B. Brown, T. Conerly, N. Das-
        preprint arXiv:2303.14957, 2023.                                    Sarma, D. Drain, S. E. Showk, N. Elhage, Z. Hatfield-
[220]   H. A. Simon, “Experiments with a heuristic com-                     Dodds, T. Henighan, T. Hume, S. Johnston, B. Mann,
        piler,” J. ACM, vol. 10, no. 4, pp. 493–506, 1963.                  C. Olah, C. Olsson, D. Amodei, N. Joseph, J. Ka-
[221]   Z. Manna and R. J. Waldinger, “Toward automatic                     plan, and S. McCandlish, “Scaling laws and inter-
        program synthesis,” Commun. ACM, vol. 14, no. 3,                    pretability of learning from repeated data,” CoRR,
        pp. 151–165, 1971.                                                  vol. abs/2205.10487, 2022.
[222]   Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong,        [235]   A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi,
        L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou,                     “The curious case of neural text degeneration,” in 8th
        “Codebert: A pre-trained model for programming                      International Conference on Learning Representations,
        and natural languages,” in Findings of EMNLP, 2020.                 ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
[223]   J. Austin, A. Odena, M. I. Nye, M. Bosma,                           OpenReview.net, 2020.
        H. Michalewski, D. Dohan, E. Jiang, C. J. Cai,              [236]   K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck,
        M. Terry, Q. V. Le, and C. Sutton, “Program syn-                    C. Callison-Burch, and N. Carlini, “Deduplicating
        thesis with large language models,” CoRR, vol.                      training data makes language models better,” in Pro-
        abs/2108.07732, 2021.                                               ceedings of the 60th Annual Meeting of the Association
                                                                                                                               111

        for Computational Linguistics (Volume 1: Long Papers),             Conference on Learning Representations, ICLR 2020,
        ACL 2022, Dublin, Ireland, May 22-27, 2022, 2022, pp.              Addis Ababa, Ethiopia, April 26-30, 2020. OpenRe-
        8424–8445.                                                         view.net, 2020.
[237]   N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramèr,   [248] K. Tirumala, D. Simig, A. Aghajanyan, and A. S. Mor-
        and C. Zhang, “Quantifying memorization across                     cos, “D4: Improving llm pretraining via document
        neural language models,” CoRR, 2022.                               de-duplication and diversification,” arXiv preprint
[238]   N. Kandpal, E. Wallace, and C. Raffel, “Deduplicat-                arXiv:2308.12284, 2023.
        ing training data mitigates privacy risks in language        [249] Z. Shen, T. Tao, L. Ma, W. Neiswanger, J. Hes-
        models,” in International Conference on Machine Learn-             tness, N. Vassilieva, D. Soboleva, and E. Xing,
        ing, ICML 2022, 17-23 July 2022, Baltimore, Maryland,              “Slimpajama-dc: Understanding data combinations
        USA. PMLR, 2022, pp. 10 697–10 707.                                for llm training,” arXiv preprint arXiv:2309.10818,
[239]   J. D. Lafferty, A. McCallum, and F. C. N. Pereira,                 2023.
        “Conditional random fields: Probabilistic models             [250] S. M. Xie, S. Santurkar, T. Ma, and P. Liang, “Data
        for segmenting and labeling sequence data,” in                     selection for language models via importance resam-
        Proceedings of the Eighteenth International Conference             pling,” arXiv preprint arXiv:2302.03169, 2023.
        on Machine Learning (ICML 2001), Williams College,           [251] X. Wang, W. Zhou, Q. Zhang, J. Zhou, S. Gao,
        Williamstown, MA, USA, June 28 - July 1, 2001, C. E.               J. Wang, M. Zhang, X. Gao, Y. Chen, and T. Gui,
        Brodley and A. P. Danyluk, Eds. Morgan Kaufmann,                   “Farewell to aimless large-scale pretraining: Influ-
        2001, pp. 282–289.                                                 ential subset selection for language model,” arXiv
[240]   P. Gage, “A new algorithm for data compression,” C                 preprint arXiv:2305.12816, 2023.
        Users Journal, vol. 12, no. 2, pp. 23–38, 1994.              [252] D. Paperno, G. Kruszewski, A. Lazaridou, Q. N.
[241]   R. Sennrich, B. Haddow, and A. Birch, “Neural ma-                  Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda,
        chine translation of rare words with subword units,”               and R. Fernández, “The LAMBADA dataset: Word
        in Proceedings of the 54th Annual Meeting of the Associa-          prediction requiring a broad discourse context,” in
        tion for Computational Linguistics, ACL 2016, August 7-            ACL (1). The Association for Computer Linguistics,
        12, 2016, Berlin, Germany, Volume 1: Long Papers. The              2016.
        Association for Computer Linguistics, 2016.                  [253] M. F. Chen, N. Roberts, K. Bhatia, J. Wang, C. Zhang,
[242]   M. Schuster and K. Nakajima, “Japanese and korean                  F. Sala, and C. Ré, “Skill-it! a data-driven skills
        voice search,” in 2012 IEEE international conference on            framework for understanding and training language
        acoustics, speech and signal processing (ICASSP). IEEE,            models,” arXiv preprint arXiv:2307.14430, 2023.
        2012, pp. 5149–5152.                                         [254] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla,
[243]   Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,                 I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez,
        W. Macherey, M. Krikun, Y. Cao, Q. Gao,                            J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton,
        K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu,             M. Bhatt, C. Canton-Ferrer, A. Grattafiori, W. Xiong,
        L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa,                  A. Défossez, J. Copet, F. Azhar, H. Touvron, L. Mar-
        K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young,                tin, N. Usunier, T. Scialom, and G. Synnaeve, “Code
        J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Cor-                llama: Open foundation models for code,” CoRR, vol.
        rado, M. Hughes, and J. Dean, “Google’s neural                     abs/2308.12950, 2023.
        machine translation system: Bridging the gap be-             [255] Y. Bengio, J. Louradour, R. Collobert, and J. Weston,
        tween human and machine translation,” CoRR, vol.                   “Curriculum learning,” in ICML, 2009, pp. 41–48.
        abs/1609.08144, 2016.                                        [256] C. Xu, C. Rosset, L. Del Corro, S. Mahajan,
[244]   T. Kudo, “Subword regularization: Improving neural                 J. McAuley, J. Neville, A. H. Awadallah, and N. Rao,
        network translation models with multiple subword                   “Contrastive post-training large language models
        candidates,” in Proceedings of the 56th Annual Meeting             on data curriculum,” arXiv preprint arXiv:2310.02263,
        of the Association for Computational Linguistics, ACL              2023.
        2018, Melbourne, Australia, July 15-20, 2018, Volume 1:      [257] S. Tworkowski, K. Staniszewski, M. Pacek, Y. Wu,
        Long Papers, I. Gurevych and Y. Miyao, Eds. Associ-                H. Michalewski, and P. Milos, “Focused transformer:
        ation for Computational Linguistics, 2018, pp. 66–75.              Contrastive training for context scaling,” CoRR, vol.
[245]   T. Kudo and J. Richardson, “Sentencepiece: A simple                abs/2307.03170, 2023.
        and language independent subword tokenizer and               [258] Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos,
        detokenizer for neural text processing,” in Proceed-               S. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and
        ings of the 2018 Conference on Empirical Methods in                S. Welleck, “Llemma: An open language model for
        Natural Language Processing, EMNLP 2018: System                    mathematics,” arXiv preprint arXiv:2310.10631, 2023.
        Demonstrations, Brussels, Belgium, October 31 - Novem-       [259] S. Chen, S. Wong, L. Chen, and Y. Tian, “Extend-
        ber 4, 2018, E. Blanco and W. Lu, Eds. Association                 ing context window of large language models via
        for Computational Linguistics, 2018.                               positional interpolation,” CoRR, vol. abs/2306.15595,
[246]   M. Davis and M. Dürst, “Unicode normalization                     2023.
        forms,” 2001.                                                [260] G. Wenzek, M.-A. Lachaux, A. Conneau, V. Chaud-
[247]   P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak,              hary, F. Guzmán, A. Joulin, and É. Grave, “Ccnet:
        and I. Sutskever, “Deep double descent: Where big-                 Extracting high quality monolingual datasets from
        ger models and more data hurt,” in 8th International               web crawl data,” in Proceedings of the Twelfth Language
                                                                                                                               112

        Resources and Evaluation Conference, 2020, pp. 4003–              J. Wang, and F. Wei, “Retentive network: A successor
        4012.                                                             to transformer for large language models,” arXiv
[261]   A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov,               preprint arXiv:2307.08621, 2023.
        “Bag of tricks for efficient text classification,” in     [272]   A. Gu and T. Dao, “Mamba: Linear-time sequence
        EACL, 2017, pp. 427–431.                                          modeling with selective state spaces,” CoRR, vol.
[262]   D. Chen, Y. Huang, Z. Ma, H. Chen, X. Pan, C. Ge,                 abs/2312.00752, 2023.
        D. Gao, Y. Xie, Z. Liu, J. Gao et al., “Data-juicer: A    [273]   B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Ar-
        one-stop data processing system for large language                cadinho, H. Cao, X. Cheng, M. Chung, M. Grella,
        models,” arXiv preprint arXiv:2309.02033, 2023.                   K. K. GV et al., “Rwkv: Reinventing rnns for the
[263]   B. Zhang, B. Ghorbani, A. Bapna, Y. Cheng, X. Garcia,             transformer era,” arXiv preprint arXiv:2305.13048,
        J. Shen, and O. Firat, “Examining scaling and transfer            2023.
        of language model architectures for machine transla-      [274]   M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou,
        tion,” in International Conference on Machine Learning,           D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, and J. Tang,
        ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA,             “Cogview: Mastering text-to-image generation via
        2022, pp. 26 176–26 192.                                          transformers,” in Advances in Neural Information Pro-
[264]   L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang,               cessing Systems 34: Annual Conference on Neural Infor-
        J. Gao, M. Zhou, and H. Hon, “Unified language                    mation Processing Systems 2021, NeurIPS 2021, Decem-
        model pre-training for natural language understand-               ber 6-14, 2021, virtual, 2021, pp. 19 822–19 835.
        ing and generation,” in Advances in Neural Information    [275]   L. J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normal-
        Processing Systems 32: Annual Conference on Neural                ization,” vol. abs/1607.06450, 2016.
        Information Processing Systems 2019, NeurIPS 2019,        [276]   B. Zhang and R. Sennrich, “Root mean square layer
        December 8-14, 2019, Vancouver, BC, Canada, 2019, pp.             normalization,” in Advances in Neural Information
        13 042–13 054.                                                    Processing Systems 32: Annual Conference on Neural
[265]   A. Clark, D. de Las Casas, A. Guy, A. Mensch,                     Information Processing Systems 2019, NeurIPS 2019,
        M. Paganini, J. Hoffmann, B. Damoc, B. A. Hecht-                  December 8-14, 2019, Vancouver, BC, Canada, 2019, pp.
        man, T. Cai, S. Borgeaud, G. van den Driessche,                   12 360–12 371.
        E. Rutherford, T. Hennigan, M. J. Johnson, A. Cas-        [277]   H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang,
        sirer, C. Jones, E. Buchatskaya, D. Budden, L. Sifre,             and F. Wei, “Deepnet: Scaling transformers to 1, 000
        S. Osindero, O. Vinyals, M. Ranzato, J. W. Rae,                   layers,” vol. abs/2203.00555, 2022.
        E. Elsen, K. Kavukcuoglu, and K. Simonyan, “Uni-          [278]   V. Nair and G. E. Hinton, “Rectified linear units im-
        fied scaling laws for routed language models,” in                 prove restricted boltzmann machines,” in Proceedings
        International Conference on Machine Learning, ICML                of the 27th international conference on machine learning
        2022, 17-23 July 2022, Baltimore, Maryland, USA, 2022,            (ICML-10), 2010, pp. 807–814.
        pp. 4057–4086.                                            [279]   A. Wang, A. Singh, J. Michael, F. Hill, O. Levy,
[266]   A. Gu, K. Goel, and C. Ré, “Efficiently modeling                 and S. R. Bowman, “GLUE: A multi-task benchmark
        long sequences with structured state spaces,”                     and analysis platform for natural language under-
        in The Tenth International Conference on Learning                 standing,” in Proceedings of the Workshop: Analyz-
        Representations, ICLR 2022, Virtual Event, April 25-29,           ing and Interpreting Neural Networks for NLP, Black-
        2022. OpenReview.net, 2022. [Online]. Available:                  boxNLP@EMNLP 2018, Brussels, Belgium, November 1,
        https://openreview.net/forum?id=uYLFoz1vlAC                       2018, T. Linzen, G. Chrupala, and A. Alishahi, Eds.
[267]   J. T. Smith, A. Warrington, and S. Linderman, “Sim-               Association for Computational Linguistics, 2018, pp.
        plified state space layers for sequence modeling,” in             353–355.
        ICLR, 2023.                                               [280]   P. Ramachandran, B. Zoph, and Q. V. Le,
[268]   A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gul-              “Searching for activation functions,” arXiv preprint
        cehre, R. Pascanu, and S. De, “Resurrecting recurrent             arXiv:1710.05941, 2017.
        neural networks for long sequences,” in ICML, 2023.       [281]   N. Shazeer, “GLU variants improve transformer,”
[269]   M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao,               vol. abs/2002.05202, 2020.
        S. Baccus, Y. Bengio, S. Ermon, and C. Ré, “Hyena        [282]   J. Su, Y. Lu, S. Pan, B. Wen, and Y. Liu, “Roformer:
        hierarchy: Towards larger convolutional language                  Enhanced transformer with rotary position embed-
        models,” in ICML, 2023.                                           ding,” vol. abs/2104.09864, 2021.
[270]   B. Peng, E. Alcaide, Q. Anthony, A. Albalak,              [283]   O. Press, N. A. Smith, and M. Lewis, “Train short,
        S. Arcadinho, H. Cao, X. Cheng, M. Chung,                         test long: Attention with linear biases enables input
        M. Grella, K. K. G. V., X. He, H. Hou, P. Kazienko,               length extrapolation,” in The Tenth International Con-
        J. Kocon, J. Kong, B. Koptyra, H. Lau, K. S. I.                   ference on Learning Representations, ICLR 2022, Virtual
        Mantri, F. Mom, A. Saito, X. Tang, B. Wang,                       Event, April 25-29, 2022, 2022.
        J. S. Wind, S. Wozniak, R. Zhang, Z. Zhang,               [284]   S. Ioffe and C. Szegedy, “Batch normalization:
        Q. Zhao, P. Zhou, J. Zhu, and R. Zhu, “RWKV:                      Accelerating deep network training by reducing
        reinventing rnns for the transformer era,” CoRR,                  internal covariate shift,” in Proceedings of the
        vol. abs/2305.13048, 2023. [Online]. Available:                   32nd International Conference on Machine Learning,
        https://doi.org/10.48550/arXiv.2305.13048                         ICML 2015, Lille, France, 6-11 July 2015, ser.
[271]   Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue,                 JMLR Workshop and Conference Proceedings,
                                                                                                                               113

      F. R. Bach and D. M. Blei, Eds., vol. 37.                          Linguistics, 2019, pp. 2978–2988. [Online]. Available:
      JMLR.org, 2015, pp. 448–456. [Online]. Available:                  https://doi.org/10.18653/v1/p19-1285
      http://proceedings.mlr.press/v37/ioffe15.html              [294]   Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhut-
[285] S. Narang, H. W. Chung, Y. Tay, L. Fedus, T. Févry,               dinov, and Q. V. Le, “Xlnet: Generalized autoregres-
      M. Matena, K. Malkan, N. Fiedel, N. Shazeer, Z. Lan,               sive pretraining for language understanding,” Ad-
      Y. Zhou, W. Li, N. Ding, J. Marcus, A. Roberts,                    vances in neural information processing systems, vol. 32,
      and C. Raffel, “Do transformer modifications transfer              2019.
      across implementations and applications?” in Pro-          [295]   B. Peng, J. Quesnelle, H. Fan, and E. Shippole, “Yarn:
      ceedings of the 2021 Conference on Empirical Methods               Efficient context window extension of large language
      in Natural Language Processing, EMNLP 2021, Virtual                models,” CoRR, vol. abs/2309.00071, 2023.
      Event / Punta Cana, Dominican Republic, 7-11 Novem-        [296]   Y. Sun, L. Dong, B. Patra, S. Ma, S. Huang,
      ber, 2021, 2021, pp. 5758–5773.                                    A. Benhaim, V. Chaudhary, X. Song, and F. Wei,
[286] R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing,             “A length-extrapolatable transformer,” CoRR, vol.
      H. Zhang, Y. Lan, L. Wang, and T. Liu, “On layer                   abs/2212.10554, 2022. [Online]. Available: https:
      normalization in the transformer architecture,” in                 //doi.org/10.48550/arXiv.2212.10554
      ICML, 2020.                                                [297]   H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A.
[287] A. Baevski and M. Auli, “Adaptive input repre-                     Smith, and L. Kong, “Random feature attention,”
      sentations for neural language modeling,” in 7th                   in 9th International Conference on Learning Representa-
      International Conference on Learning Representations,              tions, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
      ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.            [298]   M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie,
      OpenReview.net, 2019.                                              C. Alberti, S. Ontañón, P. Pham, A. Ravula, Q. Wang,
[288] L. Liu, X. Liu, J. Gao, W. Chen, and J. Han, “Under-               L. Yang, and A. Ahmed, “Big bird: Transformers for
      standing the difficulty of training transformers,” in              longer sequences,” in Advances in Neural Information
      Proceedings of the 2020 Conference on Empirical Methods            Processing Systems 33: Annual Conference on Neural
      in Natural Language Processing, EMNLP 2020, Online,                Information Processing Systems 2020, NeurIPS 2020,
      November 16-20, 2020.       Association for Computa-               December 6-12, 2020, virtual, 2020.
      tional Linguistics, 2020, pp. 5747–5763.                   [299]   R. Child, S. Gray, A. Radford, and I. Sutskever, “Gen-
[289] D. Hendrycks and K. Gimpel, “Gaussian error linear                 erating long sequences with sparse transformers,”
      units (gelus),” arXiv preprint arXiv:1606.08415, 2016.             CoRR, vol. abs/1904.10509, 2019.
[290] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier,           [300]   N. Shazeer, “Fast transformer decoding: One write-
      “Language modeling with gated convolutional net-                   head is all you need,” CoRR, vol. abs/1911.02150,
      works,” in Proceedings of the 34th International Confer-           2019. [Online]. Available: http://arxiv.org/abs/1911.
      ence on Machine Learning, ICML 2017, Sydney, NSW,                  02150
      Australia, 6-11 August 2017, 2017, pp. 933–941.            [301]   J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy,
[291] T. L. Scao, T. Wang, D. Hesslow, S. Bekman, M. S.                  F. Lebrón, and S. Sanghai, “Gqa: Training gener-
      Bari, S. Biderman, H. Elsahar, N. Muennighoff,                     alized multi-query transformer models from multi-
      J. Phang, O. Press, C. Raffel, V. Sanh, S. Shen,                   head checkpoints,” arXiv preprint arXiv:2305.13245,
      L. Sutawika, J. Tae, Z. X. Yong, J. Launay, and I. Belt-           2023.
      agy, “What language model to train if you have one         [302]   T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Re,
      million GPU hours?” in Findings of the Association for             “Flashattention: Fast and memory-efficient exact at-
      Computational Linguistics: EMNLP 2022, Abu Dhabi,                  tention with IO-awareness,” in NeurIPS, 2022.
      United Arab Emirates, December 7-11, 2022, 2022, pp.       [303]   T. Dao, “Flashattention-2: Faster attention with better
      765–782.                                                           parallelism and work partitioning,” arXiv preprint
[292] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-                      arXiv:2307.08691, 2023.
      attention with relative position representations,”         [304]   “vllm: Easy, fast, and cheap llm serving with
      in Proceedings of the 2018 Conference of the North                 pagedattention.” [Online]. Available: https://vllm.
      American Chapter of the Association for Computational              ai/
      Linguistics: Human Language Technologies, NAACL-           [305]   K. Murray and D. Chiang, “Correcting length bias in
      HLT, New Orleans, Louisiana, USA, June 1-6, 2018,                  neural machine translation,” in WMT. Association
      Volume 2 (Short Papers), M. A. Walker, H. Ji,                      for Computational Linguistics, 2018, pp. 212–223.
      and A. Stent, Eds. Association for Computational           [306]   A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi,
      Linguistics, 2018, pp. 464–468. [Online]. Available:               “The curious case of neural text degeneration,” in
      https://doi.org/10.18653/v1/n18-2074                               ICLR, 2020.
[293] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell,                 [307]   C.-M. U. P. P. D. O. C. SCIENCE, Speech Under-
      Q. V. Le, and R. Salakhutdinov, “Transformer-xl:                   standing Systems. Summary of Results of the Five-Year
      Attentive language models beyond a fixed-length                    Research Effort at Carnegie-Mellon University, 1977.
      context,” in Proceedings of the 57th Conference of         [308]   P. Koehn and R. Knowles, “Six challenges for neural
      the Association for Computational Linguistics, ACL                 machine translation,” in NMT@ACL. Association
      2019, Florence, Italy, July 28- August 2, 2019, Volume             for Computational Linguistics, 2017, pp. 28–39.
      1: Long Papers, A. Korhonen, D. R. Traum, and              [309]   Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
      L. Màrquez, Eds. Association for Computational                    W. Macherey, M. Krikun, Y. Cao, Q. Gao,
                                                                                                                               114

        K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu,           2019, pp. 103–112.
        L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa,          [322] A. Harlap, D. Narayanan, A. Phanishayee, V. Se-
        K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young,              shadri, N. R. Devanur, G. R. Ganger, and P. B. Gib-
        J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Cor-              bons, “Pipedream: Fast and efficient pipeline parallel
        rado, M. Hughes, and J. Dean, “Google’s neural                   DNN training,” CoRR, vol. abs/1806.03377, 2018.
        machine translation system: Bridging the gap be-           [323] P. Micikevicius, S. Narang, J. Alben, G. F. Di-
        tween human and machine translation,” CoRR, vol.                 amos, E. Elsen, D. Garcı́a, B. Ginsburg, M. Houston,
        abs/1609.08144, 2016.                                            O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed pre-
[310]   R. Paulus, C. Xiong, and R. Socher, “A deep re-                  cision training,” CoRR, vol. abs/1710.03740, 2017.
        inforced model for abstractive summarization,” in          [324] Q. Xu, S. Li, C. Gong, and Y. You, “An efficient
        ICLR (Poster). OpenReview.net, 2018.                             2d method for training super-large deep learning
[311]   A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju,                 models,” CoRR, vol. abs/2104.05343, 2021.
        Q. Sun, S. Lee, D. J. Crandall, and D. Batra, “Diverse     [325] B. Wang, Q. Xu, Z. Bian, and Y. You, “Tesseract:
        beam search: Decoding diverse solutions from neural              Parallelize the tensor parallelism efficiently,” in Pro-
        sequence models,” CoRR, vol. abs/1610.02424, 2016.               ceedings of the 51st International Conference on Parallel
[312]   A. Fan, M. Lewis, and Y. N. Dauphin, “Hierarchical               Processing, ICPP 2022, Bordeaux, France, 29 August
        neural story generation,” in ACL (1). Association for            2022 - 1 September 2022. ACM, 2022.
        Computational Linguistics, 2018, pp. 889–898.              [326] Z. Bian, Q. Xu, B. Wang, and Y. You, “Maximizing
[313]   J. Hewitt, C. D. Manning, and P. Liang, “Trunca-                 parallelism in distributed training for huge neural
        tion sampling as language model desmoothing,” in                 networks,” CoRR, vol. abs/2105.14450, 2021.
        EMNLP (Findings). Association for Computational            [327] S. Li, F. Xue, C. Baranwal, Y. Li, and Y. You, “Se-
        Linguistics, 2022, pp. 3414–3427.                                quence parallelism: Long sequence training from
[314]   Y. Su, T. Lan, Y. Wang, D. Yogatama, L. Kong, and                system perspective,” arXiv e-prints, pp. arXiv–2105,
        N. Collier, “A contrastive framework for neural text             2021.
        generation,” in NeurIPS, 2022.                             [328] L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen,
[315]   C. Meister, T. Pimentel, G. Wiher, and R. Cotterell,             Y. Huang, Y. Wang, Y. Xu, D. Zhuo, E. P. Xing
        “Locally typical sampling,” Trans. Assoc. Comput. Lin-           et al., “Alpa: Automating inter-and {Intra-Operator}
        guistics, 2023.                                                  parallelism for distributed deep learning,” in OSDI,
[316]   X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eis-               2022, pp. 559–578.
        ner, T. Hashimoto, L. Zettlemoyer, and M. Lewis,           [329] T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training
        “Contrastive decoding: Open-ended text generation                deep nets with sublinear memory cost,” CoRR, vol.
        as optimization,” in ACL (1). Association for Com-               abs/1604.06174, 2016.
        putational Linguistics, 2023, pp. 12 286–12 312.           [330] FairScale authors, “Fairscale: A general purpose
[317]   Y. Chuang, Y. Xie, H. Luo, Y. Kim, J. R. Glass, and              modular pytorch library for high performance
        P. He, “Dola: Decoding by contrasting layers im-                 and large scale training,” https://github.com/
        proves factuality in large language models,” CoRR,               facebookresearch/fairscale, 2021.
        vol. abs/2309.03883, 2023.                                 [331] R. Lou, K. Zhang, and W. Yin, “Is prompt all you
[318]   D. P. Kingma and J. Ba, “Adam: A method for                      need? no. A comprehensive and broader view of in-
        stochastic optimization,” in 3rd International Confer-           struction learning,” CoRR, vol. abs/2303.10475, 2023.
        ence on Learning Representations, ICLR 2015, San Diego,    [332] X. Liu, P. He, W. Chen, and J. Gao, “Multi-task deep
        CA, USA, May 7-9, 2015, Conference Track Proceedings,            neural networks for natural language understand-
        Y. Bengio and Y. LeCun, Eds., 2015.                              ing,” in ACL (1). Association for Computational
[319]   I. Loshchilov and F. Hutter, “Fixing weight decay                Linguistics, 2019, pp. 4487–4496.
        regularization in adam,” CoRR, vol. abs/1711.05101,        [333] A. Aghajanyan, A. Gupta, A. Shrivastava, X. Chen,
        2017.                                                            L. Zettlemoyer, and S. Gupta, “Muppet: Massive
[320]   N. Shazeer and M. Stern, “Adafactor: Adaptive                    multi-task representations with pre-finetuning,” in
        learning rates with sublinear memory cost,” in Pro-              EMNLP (1). Association for Computational Linguis-
        ceedings of the 35th International Conference on Machine         tics, 2021, pp. 5799–5811.
        Learning, ICML 2018, Stockholmsmässan, Stockholm,         [334] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung,
        Sweden, July 10-15, 2018, ser. Proceedings of Machine            Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei, and
        Learning Research, J. G. Dy and A. Krause, Eds.,                 A. Roberts, “The flan collection: Designing data and
        vol. 80. PMLR, 2018, pp. 4603–4611.                              methods for effective instruction tuning,” CoRR, vol.
[321]   Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen,                 abs/2301.13688, 2023.
        M. X. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, and         [335] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng,
        Z. Chen, “Gpipe: Efficient training of giant neural              C. Tao, and D. Jiang, “Wizardlm: Empowering large
        networks using pipeline parallelism,” in Advances                language models to follow complex instructions,”
        in Neural Information Processing Systems 32: Annual              CoRR, vol. abs/2304.12244, 2023. [Online]. Available:
        Conference on Neural Information Processing Systems              https://doi.org/10.48550/arXiv.2304.12244
        2019, NeurIPS 2019, December 8-14, 2019, Vancouver,        [336] Z. Sun, Y. Shen, Q. Zhou, H. Zhang, Z. Chen, D. Cox,
        BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelz-            Y. Yang, and C. Gan, “Principle-driven self-alignment
        imer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds.,          of language models from scratch with minimal hu-
                                                                                                                               115

        man supervision,” arXiv preprint arXiv:2305.03047,                1894.
        2023.                                                     [350]   Y. Wang, H. Ivison, P. Dasigi, J. Hessel, T. Khot, K. R.
[337]   X. Li, P. Yu, C. Zhou, T. Schick, L. Zettle-                      Chandu, D. Wadden, K. MacMillan, N. A. Smith,
        moyer, O. Levy, J. Weston, and M. Lewis, “Self-                   I. Beltagy, and H. Hajishirzi, “How far can camels
        alignment with instruction backtranslation,” CoRR,                go? exploring the state of instruction tuning on open
        vol. abs/2308.06259, 2023.                                        resources,” CoRR, vol. abs/2306.04751, 2023.
[338]   C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma,   [351]   X. Liu, H. Yan, S. Zhang, C. An, X. Qiu, and D. Lin,
        A. Efrat, P. Yu, L. Yu et al., “Lima: Less is more for            “Scaling laws of rope-based extrapolation,” CoRR,
        alignment,” arXiv preprint arXiv:2305.11206, 2023.                vol. abs/2310.05209, 2023.
[339]   L. Chen, S. Li, J. Yan, H. Wang, K. Gunaratna, V. Ya-     [352]   B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruc-
        dav, Z. Tang, V. Srinivasan, T. Zhou, H. Huang, and               tion tuning with GPT-4,” CoRR, vol. abs/2304.03277,
        H. Jin, “Alpagasus: Training A better alpaca with                 2023.
        fewer data,” CoRR, vol. abs/2307.08701, 2023.             [353]   M. M. Krell, M. Kosec, S. P. Perez, and A. Fitzgib-
[340]   S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal,                   bon, “Efficient sequence packing without cross-
        H. Palangi, and A. H. Awadallah, “Orca: Progressive               contamination: Accelerating large language mod-
        learning from complex explanation traces of GPT-4,”               els without impacting performance,” arXiv preprint
        CoRR, vol. abs/2306.02707, 2023.                                  arXiv:2107.02027, 2021.
[341]   YuLan-Chat-Team, “Yulan-chat: An open-source              [354]   K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei,
        bilingual chatbot,” https://github.com/RUC-GSAI/                  H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis,
        YuLan-Chat, 2023.                                                 S. Pfohl et al., “Large language models encode clinical
[342]   Y. Huang, X. Liu, Y. Gong, Z. Gou, Y. Shen, N. Duan,              knowledge,” arXiv preprint arXiv:2212.13138, 2022.
        and W. Chen, “Key-point-driven data synthesis with        [355]   J. Zhang, R. Xie, Y. Hou, W. X. Zhao, L. Lin, and
        its enhancement on mathematical reasoning,” CoRR,                 J. Wen, “Recommendation as instruction following:
        vol. abs/2403.02333, 2024.                                        A large language model empowered recommenda-
[343]   N. Ding, Y. Chen, B. Xu, Y. Qin, S. Hu, Z. Liu, M. Sun,           tion approach,” CoRR, vol. abs/2305.07001, 2023.
        and B. Zhou, “Enhancing chat language models by           [356]   H. Wang, C. Liu, N. Xi, Z. Qiang, S. Zhao, B. Qin, and
        scaling high-quality instructional conversations,” in             T. Liu, “Huatuo: Tuning llama model with chinese
        Proceedings of the 2023 Conference on Empirical Methods           medical knowledge,” arXiv preprint arXiv:2304.06975,
        in Natural Language Processing, EMNLP 2023, Sin-                  2023.
        gapore, December 6-10, 2023, H. Bouamor, J. Pino,         [357]   Q. Huang, M. Tao, Z. An, C. Zhang, C. Jiang, Z. Chen,
        and K. Bali, Eds. Association for Computational                   Z. Wu, and Y. Feng, “Lawyer llama technical report,”
        Linguistics, 2023, pp. 3029–3051.                                 arXiv preprint arXiv:2305.15062, 2023.
[344]   K. Zhou, B. Zhang, J. Wang, Z. Chen, W. X. Zhao,          [358]   S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze,
        J. Sha, Z. Sheng, S. Wang, and J. Wen, “Jiuzhang3.0:              S. Gehrmann, P. Kambadur, D. Rosenberg, and
        Efficiently improving mathematical reasoning by                   G. Mann, “Bloomberggpt: A large language model
        training small data synthesis models,” CoRR, vol.                 for finance,” arXiv preprint arXiv:2303.17564, 2023.
        abs/2405.14365, 2024.                                     [359]   T. Liu and B. K. H. Low, “Goat: Fine-tuned llama out-
[345]   Y. Cao, Y. Kang, and L. Sun, “Instruction mining:                 performs gpt-4 on arithmetic tasks,” arXiv preprint
        High-quality instruction data selection for large lan-            arXiv:2305.14201, 2023.
        guage models,” CoRR, vol. abs/2307.06290, 2023.           [360]   T. Sun, X. Zhang, Z. He, P. Li, Q. Cheng, H. Yan,
[346]   M. Li, Y. Zhang, Z. Li, J. Chen, L. Chen,                         X. Liu, Y. Shao, Q. Tang, X. Zhao, K. Chen, Y. Zheng,
        N. Cheng, J. Wang, T. Zhou, and J. Xiao, “From                    Z. Zhou, R. Li, J. Zhan, Y. Zhou, L. Li, X. Yang, L. Wu,
        quantity to quality: Boosting LLM performance with                Z. Yin, X. Huang, and X. Qiu, “Moss: Training con-
        self-guided data selection for instruction tuning,”               versational language models from synthetic data,”
        CoRR, vol. abs/2308.12032, 2023. [Online]. Available:             2023.
        https://doi.org/10.48550/arXiv.2308.12032                 [361]   Y. Dubois, X. Li, R. Taori, T. Zhang, I. Gulrajani,
[347]   O. Sener and S. Savarese, “Active learning                        J. Ba, C. Guestrin, P. Liang, and T. B. Hashimoto,
        for convolutional neural networks: A core-set                     “Alpacafarm: A simulation framework for methods
        approach,” in 6th International Conference on Learning            that learn from human feedback,” CoRR, vol.
        Representations, ICLR 2018, Vancouver, BC, Canada,                abs/2305.14387, 2023. [Online]. Available: https:
        April 30 - May 3, 2018, Conference Track Proceedings.             //doi.org/10.48550/arXiv.2305.14387
        OpenReview.net, 2018. [Online]. Available: https:         [362]   D. Hendrycks, C. Burns, S. Basart, A. Zou,
        //openreview.net/forum?id=H1aIuk-RW                               M. Mazeika, D. Song, and J. Steinhardt, “Measur-
[348]   M. Xia, S. Malladi, S. Gururangan, S. Arora,                      ing massive multitask language understanding,” in
        and D. Chen, “LESS: selecting influential                         ICLR. OpenReview.net, 2021.
        data for targeted instruction tuning,” CoRR,              [363]   M. Suzgun, N. Scales, N. Schärli, S. Gehrmann,
        vol. abs/2402.04333, 2024. [Online]. Available:                   Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H.
        https://doi.org/10.48550/arXiv.2402.04333                         Chi, D. Zhou, and J. Wei, “Challenging big-bench
[349]   P. W. Koh and P. Liang, “Understanding black-box                  tasks and whether chain-of-thought can solve them,”
        predictions via influence functions,” in International            CoRR, vol. abs/2210.09261, 2022.
        conference on machine learning. PMLR, 2017, pp. 1885–     [364]   Z. Kenton, T. Everitt, L. Weidinger, I. Gabriel,
                                                                                                                                116

        V. Mikulik, and G. Irving, “Alignment of language                    in large language models part i: Ppo,” arXiv preprint
        agents,” CoRR, vol. abs/2103.14659, 2021.                            arXiv:2307.04964, 2023.
[365]   D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown,              [374]   J. Uesato, N. Kushman, R. Kumar, H. F. Song,
        A. Radford, D. Amodei, P. F. Christiano, and G. Irv-                 N. Y. Siegel, L. Wang, A. Creswell, G. Irving, and
        ing, “Fine-tuning language models from human pref-                   I. Higgins, “Solving math word problems with
        erences,” CoRR, vol. abs/1909.08593, 2019.                           process- and outcome-based feedback,” CoRR, vol.
[366]   A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli,                    abs/2211.14275, 2022.
        T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das-           [375]   H. Lightman, V. Kosaraju, Y. Burda, H. Edwards,
        Sarma, N. Elhage, Z. Hatfield-Dodds, D. Hernandez,                   B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever,
        J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. B.                  and K. Cobbe, “Let’s verify step by step,” CoRR, vol.
        Brown, J. Clark, S. McCandlish, C. Olah, and J. Ka-                  abs/2305.20050, 2023.
        plan, “A general language assistant as a laboratory          [376]   D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika,
        for alignment,” CoRR, vol. abs/2112.00861, 2021.                     A. Arora, E. Guo, C. Burns, S. Puranik, H. He,
[367]   E. Perez, S. Huang, H. F. Song, T. Cai, R. Ring,                     D. Song, and J. Steinhardt, “Measuring coding chal-
        J. Aslanides, A. Glaese, N. McAleese, and G. Irving,                 lenge competence with APPS,” in NeurIPS Datasets
        “Red teaming language models with language mod-                      and Benchmarks, 2021.
        els,” in Proceedings of the 2022 Conference on Empir-        [377]   T. Wang, P. Yu, X. E. Tan, S. O’Brien, R. Pa-
        ical Methods in Natural Language Processing, EMNLP                   sunuru, J. Dwivedi-Yu, O. Golovneva, L. Zettle-
        2022, Abu Dhabi, United Arab Emirates, December 7-11,                moyer, M. Fazel-Zarandi, and A. Celikyilmaz, “Shep-
        2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds.                   herd: A critic for language model generation,” CoRR,
        Association for Computational Linguistics, 2022, pp.                 vol. abs/2308.04592, 2023.
        3419–3448.                                                   [378]   G. Chen, M. Liao, C. Li, and K. Fan, “Alphamath
[368]   J. Menick, M. Trebacz, V. Mikulik, J. Aslanides,                     almost zero: process supervision without process,”
        H. F. Song, M. Chadwick, M. Glaese, S. Young,                        CoRR, vol. abs/2405.03553, 2024.
        L.     Campbell-Gillingham,       G.      Irving,     and    [379]   Q. Ma, H. Zhou, T. Liu, J. Yuan, P. Liu, Y. You, and
        N. McAleese, “Teaching language models to                            H. Yang, “Let’s reward step by step: Step-level re-
        support answers with verified quotes,” CoRR, vol.                    ward model as the navigators for reasoning,” CoRR,
        abs/2203.11147, 2022.                                                vol. abs/2310.10080, 2023.
[369]   Y. Bai, S. Kadavath, S. Kundu, A. Askell,                    [380]   Z. Chen, K. Zhou, W. X. Zhao, J. Wan, F. Zhang,
        J. Kernion, A. Jones, A. Chen, A. Goldie,                            D. Zhang, and J. Wen, “Improving large language
        A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson,                      models via fine-grained reinforcement learning
        C. Olah, D. Hernandez, D. Drain, D. Ganguli,                         with minimum editing constraint,” CoRR, vol.
        D. Li, E. Tran-Johnson, E. Perez, J. Kerr,                           abs/2401.06081, 2024. [Online]. Available: https:
        J. Mueller, J. Ladish, J. Landau, K. Ndousse,                        //doi.org/10.48550/arXiv.2401.06081
        K. Lukosiute, L. Lovitt, M. Sellitto, N. Elhage,             [381]   Z. Xi, W. Chen, B. Hong, S. Jin, R. Zheng, W. He,
        N. Schiefer, N. Mercado, N. DasSarma, R. Lasenby,                    Y. Ding, S. Liu, X. Guo, J. Wang, H. Guo, W. Shen,
        R. Larson, S. Ringer, S. Johnston, S. Kravec,                        X. Fan, Y. Zhou, S. Dou, X. Wang, X. Zhang,
        S. E. Showk, S. Fort, T. Lanham, T. Telleen-                         P. Sun, T. Gui, Q. Zhang, and X. Huang, “Train-
        Lawton, T. Conerly, T. Henighan, T. Hume, S. R.                      ing large language models for reasoning through
        Bowman, Z. Hatfield-Dodds, B. Mann, D. Amodei,                       reverse curriculum reinforcement learning,” CoRR,
        N. Joseph, S. McCandlish, T. Brown, and J. Kaplan,                   vol. abs/2402.05808, 2024.
        “Constitutional AI: harmlessness from AI feedback,”          [382]   D. Silver, J. Schrittwieser, K. Simonyan,
        CoRR, vol. abs/2212.08073, 2022. [Online]. Available:                I. Antonoglou, A. Huang, A. Guez, T. Hubert,
        https://doi.org/10.48550/arXiv.2212.08073                            L. Baker, M. Lai, A. Bolton, Y. Chen, T. P. Lillicrap,
[370]   H. Lee, S. Phatale, H. Mansoor, K. Lu, T. Mesnard,                   F. Hui, L. Sifre, G. van den Driessche, T. Graepel,
        C. Bishop, V. Carbune, and A. Rastogi, “RLAIF:                       and D. Hassabis, “Mastering the game of go without
        scaling reinforcement learning from human feedback                   human knowledge,” Nat., pp. 354–359, 2017.
        with AI feedback,” CoRR, vol. abs/2309.00267, 2023.          [383]   T. Anthony, Z. Tian, and D. Barber, “Thinking fast
[371]   H. Dong, W. Xiong, D. Goyal, R. Pan, S. Diao,                        and slow with deep learning and tree search,” in
        J. Zhang, K. Shum, and T. Zhang, “RAFT:                              Advances in Neural Information Processing Systems 30:
        reward ranked finetuning for generative foundation                   Annual Conference on Neural Information Processing
        model alignment,” CoRR, vol. abs/2304.06767, 2023.                   Systems 2017, December 4-9, 2017, Long Beach, CA,
        [Online]. Available: https://doi.org/10.48550/arXiv.                 USA, 2017, pp. 5360–5370.
        2304.06767                                                   [384]   H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao,
[372]   A. Askell, Y. Bai, A. Chen, D. Drain, D. Gan-                        X. Geng, Q. Lin, S. Chen, and D. Zhang, “Wizard-
        guli, T. Henighan, A. Jones, N. Joseph, B. Mann,                     math: Empowering mathematical reasoning for large
        N. DasSarma et al., “A general language assis-                       language models via reinforced evol-instruct,” CoRR,
        tant as a laboratory for alignment,” arXiv preprint                  vol. abs/2308.09583, 2023.
        arXiv:2112.00861, 2021.                                      [385]   R. Liu, C. Jia, G. Zhang, Z. Zhuang, T. X. Liu, and
[373]   R. Zheng, S. Dou, S. Gao, W. Shen, B. Wang, Y. Liu,                  S. Vosoughi, “Second thoughts are best: Learning
        S. Jin, Q. Liu, L. Xiong, L. Chen et al., “Secrets of rlhf           to re-align with human values from text edits,” in
                                                                                                                                 117

        NeurIPS, 2022.                                                     J. E. Gonzalez, “The wisdom of hindsight makes
[386]   X. Lu, S. Welleck, J. Hessel, L. Jiang, L. Qin, P. West,           language models better instruction followers,”
        P. Ammanabrolu, and Y. Choi, “QUARK: control-                      CoRR, vol. abs/2302.05206, 2023. [Online]. Available:
        lable text generation with reinforced unlearning,” in              https://doi.org/10.48550/arXiv.2302.05206
        NeurIPS, 2022.                                             [401]   A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne,
[387]   J. Scheurer, J. A. Campos, T. Korbak, J. S. Chan,                  “Imitation learning: A survey of learning methods,”
        A. Chen, K. Cho, and E. Perez, “Training language                  ACM Comput. Surv., vol. 50, no. 2, apr 2017. [Online].
        models with language feedback at scale,” CoRR, vol.                Available: https://doi.org/10.1145/3054912
        abs/2303.16755, 2023.                                      [402]   S. Levine, “Should i imitate or reinforce,”
[388]   G. Guo, R. Zhao, T. Tang, W. X. Zhao, and                          2022. [Online]. Available: https://www.youtube.
        J.-R. Wen, “Beyond imitation: Leveraging fine-                     com/watch?v=sVPm7zOrBxM
        grained quality signals for alignment,” arXiv preprint     [403]   J. Schulman, “Reinforcement learning from
        arXiv:2311.04072, 2023.                                            human feedback: Progress and challenges,” 2023.
[389]   R. Krishna, D. Lee, L. Fei-Fei, and M. S. Bernstein,               [Online]. Available: https://www.youtube.com/
        “Socially situated artificial intelligence enables                 watch?v=hhiLw5Q UFg
        learning from human interaction,” Proceedings of the       [404]   X. L. Li and P. Liang, “Prefix-tuning: Optimizing
        National Academy of Sciences of the United States of               continuous prompts for generation,” in Proceedings of
        America, vol. 119, 2022. [Online]. Available: https:               the 59th Annual Meeting of the Association for Compu-
        //api.semanticscholar.org/CorpusID:252381954                       tational Linguistics and the 11th International Joint Con-
[390]   H. Liu, C. Sferrazza, and P. Abbeel, “Chain of hind-               ference on Natural Language Processing, ACL/IJCNLP
        sight aligns language models with feedback,” CoRR,                 2021, (Volume 1: Long Papers), Virtual Event, August
        vol. abs/2302.02676, 2023.                                         1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds.
[391]   R. Rafailov, A. Sharma, E. Mitchell, S. Ermon,                     Association for Computational Linguistics, 2021, pp.
        C. D. Manning, and C. Finn, “Direct preference                     4582–4597.
        optimization: Your language model is secretly a            [405]   B. Lester, R. Al-Rfou, and N. Constant, “The power
        reward model,” CoRR, vol. abs/2305.18290, 2023.                    of scale for parameter-efficient prompt tuning,” in
        [Online]. Available: https://doi.org/10.48550/arXiv.               Proceedings of the 2021 Conference on Empirical Methods
        2305.18290                                                         in Natural Language Processing, EMNLP 2021, Virtual
[392]   K. Ethayarajh, W. Xu, N. Muennighoff, D. Jurafsky,                 Event / Punta Cana, Dominican Republic, 7-11 Novem-
        and D. Kiela, “KTO: model alignment as prospect                    ber, 2021, M. Moens, X. Huang, L. Specia, and S. W.
        theoretic optimization,” CoRR, vol. abs/2402.01306,                Yih, Eds. Association for Computational Linguistics,
        2024.                                                              2021, pp. 3045–3059.
[393]   Y. Meng, M. Xia, and D. Chen, “Simpo: Simple pref-         [406]   N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone,
        erence optimization with a reference-free reward,”                 Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and
        CoRR, vol. abs/2405.14734, 2024.                                   S. Gelly, “Parameter-efficient transfer learning for
[394]   D. Feng, B. Qin, C. Huang, Z. Zhang, and W. Lei,                   NLP,” in Proceedings of the 36th International Confer-
        “Towards analyzing and understanding the limita-                   ence on Machine Learning, ICML 2019, 9-15 June 2019,
        tions of DPO: A theoretical perspective,” CoRR, vol.               Long Beach, California, USA, 2019, pp. 2790–2799.
        abs/2404.04626, 2024.                                      [407]   Z. Hu, Y. Lan, L. Wang, W. Xu, E. Lim, R. K. Lee,
[395]   A. Gorbatovski, B. Shaposhnikov, A. Malakhov,                      L. Bing, and S. Poria, “Llm-adapters: An adapter
        N. Surnachev, Y. Aksenov, I. Maksimov, N. Balagan-                 family for parameter-efficient fine-tuning of large
        sky, and D. Gavrilov, “Learn your reference model                  language models,” CoRR, vol. abs/2304.01933, 2023.
        for real good alignment,” CoRR, vol. abs/2404.09656,       [408]   J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and
        2024.                                                              G. Neubig, “Towards a unified view of parameter-
[396]   D. Kim, Y. Kim, W. Song, H. Kim, Y. Kim, S. Kim,                   efficient transfer learning,” in The Tenth International
        and C. Park, “sdpo: Don’t use your data all at once,”              Conference on Learning Representations, ICLR 2022, Vir-
        CoRR, vol. abs/2403.19270, 2024.                                   tual Event, April 25-29, 2022. OpenReview.net, 2022.
[397]   Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, and           [409]   X. Liu, K. Ji, Y. Fu, Z. Du, Z. Yang, and J. Tang, “P-
        F. Huang, “RRHF: rank responses to align language                  tuning v2: Prompt tuning can be comparable to fine-
        models with human feedback without tears,”                         tuning universally across scales and tasks,” CoRR,
        CoRR, vol. abs/2304.05302, 2023. [Online]. Available:              vol. abs/2110.07602, 2021.
        https://doi.org/10.48550/arXiv.2304.05302                  [410]   X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang,
[398]   Y. Zhao, R. Joshi, T. Liu, M. Khalman, M. Saleh,                   and J. Tang, “GPT understands, too,” CoRR, vol.
        and P. J. Liu, “Slic-hf: Sequence likelihood calibration           abs/2103.10385, 2021.
        with human feedback,” CoRR, vol. abs/2305.10425,           [411]   Y. Gu, X. Han, Z. Liu, and M. Huang, “Ppt: Pre-
        2023.                                                              trained prompt tuning for few-shot learning,” in Pro-
[399]   A. Fisch, J. Eisenstein, V. Zayats, A. Agarwal,                    ceedings of the 60th Annual Meeting of the Association
        A. Beirami, C. Nagpal, P. Shaw, and J. Berant, “Ro-                for Computational Linguistics (Volume 1: Long Papers),
        bust preference optimization through reward model                  2022, pp. 8410–8423.
        distillation,” CoRR, vol. abs/2405.19316, 2024.            [412]   Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can
[400]   T. Zhang, F. Liu, J. Wong, P. Abbeel, and                          we know what language models know?” Transactions
                                                                                                                                118

        of the Association for Computational Linguistics, vol. 8,           2023.
        pp. 423–438, 2020.                                          [424]   Y. Hao, Y. Sun, L. Dong, Z. Han, Y. Gu, and F. Wei,
[413]   T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace,                    “Structured prompting: Scaling in-context learning
        and S. Singh, “Autoprompt: Eliciting knowledge                      to 1, 000 examples,” CoRR, 2022.
        from language models with automatically gener-              [425]   Y. Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stene-
        ated prompts,” in Proceedings of the 2020 Conference                torp, “Fantastically ordered prompts and where to
        on Empirical Methods in Natural Language Processing                 find them: Overcoming few-shot prompt order sen-
        (EMNLP), 2020, pp. 4222–4235.                                       sitivity,” in Proceedings of the 60th Annual Meeting of
[414]   Q. Zhang, M. Chen, A. Bukharin, P. He, Y. Cheng,                    the Association for Computational Linguistics (Volume
        W. Chen, and T. Zhao, “Adaptive budget allocation                   1: Long Papers), ACL 2022, Dublin, Ireland, May 22-
        for parameter-efficient fine-tuning,” CoRR, vol.                    27, 2022, S. Muresan, P. Nakov, and A. Villavicencio,
        abs/2303.10512, 2023. [Online]. Available: https:                   Eds., 2022, pp. 8086–8098.
        //doi.org/10.48550/arXiv.2303.10512                         [426]   Y. Fu, H. Peng, A. Sabharwal, P. Clark, and T. Khot,
[415]   M. Valipour, M. Rezagholizadeh, I. Kobyzev, and                     “Complexity-based prompting for multi-step reason-
        A. Ghodsi, “Dylora: Parameter efficient tuning                      ing,” CoRR, vol. abs/2210.00720, 2022.
        of pre-trained models using dynamic search-free             [427]   Z. Zhang, A. Zhang, M. Li, and A. Smola, “Auto-
        low-rank adaptation,” CoRR, vol. abs/2210.07558,                    matic chain of thought prompting in large language
        2022. [Online]. Available: https://doi.org/10.48550/                models,” CoRR, vol. abs/2210.03493, 2022.
        arXiv.2210.07558                                            [428]   A. Creswell, M. Shanahan, and I. Higgins,
[416]   N. Ding, Y. Qin, G. Yang, F. Wei, Y. Zonghan, Y. Su,                “Selection-inference: Exploiting large language mod-
        S. Hu, Y. Chen, C.-M. Chan, W. Chen, J. Yi, W. Zhao,                els for interpretable logical reasoning,” CoRR, vol.
        X. Wang, Z. Liu, H.-T. Zheng, J. Chen, Y. Liu, J. Tang,             abs/2205.09712, 2022.
        J. Li, and M. Sun, “Parameter-efficient fine-tuning         [429]   X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi,
        of large-scale pre-trained language models,” Nature                 and D. Zhou, “Self-consistency improves chain of
        Machine Intelligence, vol. 5, pp. 1–16, 03 2023.                    thought reasoning in language models,” CoRR, vol.
[417]   R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li,             abs/2203.11171, 2022.
        P. Gao, and Y. Qiao, “Llama-adapter: Efficient fine-        [430]   Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J. Lou,
        tuning of language models with zero-init attention,”                and W. Chen, “On the advance of making language
        CoRR, vol. abs/2303.16199, 2023.                                    models better reasoners,” CoRR, vol. abs/2206.02336,
[418]   J. Pfeiffer, I. Vulic, I. Gurevych, and S. Ruder, “MAD-             2022.
        X: an adapter-based framework for multi-task cross-         [431]   X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi,
        lingual transfer,” in Proceedings of the 2020 Conference            and D. Zhou, “Rationale-augmented ensembles in
        on Empirical Methods in Natural Language Processing,                language models,” CoRR, 2022.
        EMNLP 2020, Online, November 16-20, 2020, B. Web-           [432]   D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales,
        ber, T. Cohn, Y. He, and Y. Liu, Eds. Association for               X. Wang, D. Schuurmans, O. Bousquet, Q. Le, and
        Computational Linguistics, 2020, pp. 7654–7673.                     E. H. Chi, “Least-to-most prompting enables com-
[419]   S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada,                     plex reasoning in large language models,” CoRR, vol.
        and S. Paul, “Peft: State-of-the-art parameter-                     abs/2205.10625, 2022.
        efficient fine-tuning methods,” https://github.com/         [433]   T. Khot, H. Trivedi, M. Finlayson, Y. Fu,
        huggingface/peft, 2022.                                             K. Richardson, P. Clark, and A. Sabhar-
[420]   J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and                  wal, “Decomposed prompting: A modular
        W. Chen, “What makes good in-context examples for                   approach for solving complex tasks,” CoRR,
        gpt-3?” in Proceedings of Deep Learning Inside Out: The             vol. abs/2210.02406, 2022. [Online]. Available:
        3rd Workshop on Knowledge Extraction and Integration                https://doi.org/10.48550/arXiv.2210.02406
        for Deep Learning Architectures, DeeLIO@ACL 2022,           [434]   L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K.
        Dublin, Ireland and Online, May 27, 2022, 2022, pp.                 Lee, and E. Lim, “Plan-and-solve prompting:
        100–114.                                                            Improving zero-shot chain-of-thought reasoning by
[421]   O. Rubin, J. Herzig, and J. Berant, “Learning to                    large language models,” CoRR, vol. abs/2305.04091,
        retrieve prompts for in-context learning,” in Pro-                  2023. [Online]. Available: https://doi.org/10.48550/
        ceedings of the 2022 Conference of the North American               arXiv.2305.04091
        Chapter of the Association for Computational Linguistics:   [435]   Q. Lyu, S. Havaldar, A. Stein, L. Zhang, D. Rao,
        Human Language Technologies, NAACL 2022, Seattle,                   E. Wong, M. Apidianaki, and C. Callison-Burch,
        WA, United States, July 10-15, 2022, 2022, pp. 2655–                “Faithful chain-of-thought reasoning,” CoRR, vol.
        2671.                                                               abs/2301.13379, 2023.
[422]   H. J. Kim, H. Cho, J. Kim, T. Kim, K. M. Yoo, and           [436]   L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang,
        S. Lee, “Self-generated in-context learning: Leverag-               J. Callan, and G. Neubig, “PAL: program-aided lan-
        ing auto-regressive language models as a demonstra-                 guage models,” CoRR, vol. abs/2211.10435, 2022.
        tion generator,” CoRR, vol. abs/2206.08082, 2022.           [437]   Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and
[423]   Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis,               Y. Zhuang, “Hugginggpt: Solving ai tasks with chat-
        H. Chan, and J. Ba, “Large language models are                      gpt and its friends in huggingface,” arXiv preprint
        human-level prompt engineers,” in Proc. of ICLR,                    arXiv:2303.17580, 2023.
                                                                                                                              119

[438]   H. Sun, Y. Zhuang, L. Kong, B. Dai, and                          guage models,” arXiv preprint arXiv:2304.09842, 2023.
        C. Zhang, “Adaplanner: Adaptive planning from            [454]   R. Ren, Y. Wang, Y. Qu, W. X. Zhao, J. Liu, H. Tian,
        feedback with language models,” arXiv preprint                   H. Wu, J.-R. Wen, and H. Wang, “Investigating
        arXiv:2305.16653, 2023.                                          the factual knowledge boundary of large language
[439]   Y. Lu, P. Lu, Z. Chen, W. Zhu, X. E. Wang, and W. Y.             models with retrieval augmentation,” arXiv preprint
        Wang, “Multimodal procedural planning via dual                   arXiv:2307.11019, 2023.
        text-image prompting,” CoRR, vol. abs/2305.01795,        [455]   X. Amatriain, “Prompt design and engineering:
        2023.                                                            Introduction and advanced methods,” CoRR, vol.
[440]   S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang,           abs/2401.14423, 2024.
        and Z. Hu, “Reasoning with language model is plan-       [456]   Y. Hou, J. Zhang, Z. Lin, H. Lu, R. Xie, J. J. McAuley,
        ning with world model,” CoRR, vol. abs/2305.14992,               and W. X. Zhao, “Large language models are zero-
        2023.                                                            shot rankers for recommender systems,” CoRR, vol.
[441]   Z. Chen, K. Zhou, B. Zhang, Z. Gong, W. X.                       abs/2305.08845, 2023.
        Zhao, and J. Wen, “Chatcot: Tool-augmented chain-        [457]   S. Chang and E. Fosler-Lussier, “How to prompt
        of-thought reasoning on chat-based large language                llms for text-to-sql: A study in zero-shot, single-
        models,” CoRR, vol. abs/2305.14323, 2023.                        domain, and cross-domain settings,” CoRR, vol.
[442]   S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran,                       abs/2305.11853, 2023. [Online]. Available: https:
        K. Narasimhan, and Y. Cao, “React: Synergizing rea-              //doi.org/10.48550/arXiv.2305.11853
        soning and acting in language models,” CoRR, vol.        [458]   Y. Wen, N. Jain, J. Kirchenbauer, M. Goldblum,
        abs/2210.03629, 2022.                                            J. Geiping, and T. Goldstein, “Hard prompts
[443]   N. Shinn, F. Cassano, B. Labash, A. Gopinath,                    made easy: Gradient-based discrete optimization
        K. Narasimhan, and S. Yao, “Reflexion: Language                  for prompt tuning and discovery,” CoRR, vol.
        agents with verbal reinforcement learning,” 2023.                abs/2302.03668, 2023. [Online]. Available: https:
[444]   S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths,             //doi.org/10.48550/arXiv.2302.03668
        Y. Cao, and K. Narasimhan, “Tree of thoughts: Delib-     [459]   T. Gao, A. Fisch, and D. Chen, “Making pre-trained
        erate problem solving with large language models,”               language models better few-shot learners,” in Pro-
        CoRR, vol. abs/2305.10601, 2023.                                 ceedings of the 59th Annual Meeting of the Association
[445]   V. Liu and L. B. Chilton, “Design guidelines for                 for Computational Linguistics and the 11th International
        prompt engineering text-to-image generative mod-                 Joint Conference on Natural Language Processing, ACL/I-
        els,” in Proceedings of the 2022 CHI Conference on               JCNLP 2021, (Volume 1: Long Papers), Virtual Event,
        Human Factors in Computing Systems, 2022, pp. 1–23.              August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Nav-
[446]   J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea,                  igli, Eds. Association for Computational Linguistics,
        H. Gilbert, A. Elnashar, J. Spencer-Smith, and D. C.             2021, pp. 3816–3830.
        Schmidt, “A prompt pattern catalog to enhance            [460]   L. Chen, J. Chen, T. Goldstein, H. Huang,
        prompt engineering with chatgpt,” arXiv preprint                 and T. Zhou, “Instructzero: Efficient instruction
        arXiv:2302.11382, 2023.                                          optimization for black-box large language models,”
[447]   S. K. K. Santu and D. Feng, “Teler: A general                    CoRR, vol. abs/2306.03082, 2023. [Online]. Available:
        taxonomy of LLM prompts for benchmarking                         https://doi.org/10.48550/arXiv.2306.03082
        complex tasks,” CoRR, vol. abs/2305.11430, 2023.         [461]   X. Lin, Z. Wu, Z. Dai, W. Hu, Y. Shu, S. Ng, P. Jaillet,
        [Online]. Available: https://doi.org/10.48550/arXiv.             and B. K. H. Low, “Use your INSTINCT: instruc-
        2305.11430                                                       tion optimization using neural bandits coupled with
[448]   OpenAI, “Gpt best practices,” OpenAI, 2023.                      transformers,” CoRR, vol. abs/2310.02905, 2023.
        [Online]. Available: https://platform.openai.com/        [462]   M. Deng, J. Wang, C. Hsieh, Y. Wang, H. Guo, T. Shu,
        docs/guides/gpt-best-practices                                   M. Song, E. P. Xing, and Z. Hu, “Rlprompt: Optimiz-
[449]   Contributors, “Ai short,” 2023. [Online]. Available:             ing discrete text prompts with reinforcement learn-
        https://www.aishort.top/                                         ing,” in Proceedings of the 2022 Conference on Empir-
[450]   ——, “Awesome chatgpt prompts,” Github,                           ical Methods in Natural Language Processing, EMNLP
        2023. [Online]. Available: https://github.com/f/                 2022, Abu Dhabi, United Arab Emirates, December 7-11,
        awesome-chatgpt-prompts/                                         2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds.
[451]   J. Jiang, K. Zhou, Z. Dong, K. Ye, W. X. Zhao, and               Association for Computational Linguistics, 2022, pp.
        J. Wen, “Structgpt: A general framework for large                3369–3391.
        language model to reason over structured data,”          [463]   T. Zhang, X. Wang, D. Zhou, D. Schuurmans, and
        CoRR, vol. abs/2305.09645, 2023.                                 J. E. Gonzalez, “TEMPERA: test-time prompt editing
[452]   L. Beurer-Kellner, M. Fischer, and M. Vechev,                    via reinforcement learning,” in The Eleventh Inter-
        “Prompting is programming: A query language for                  national Conference on Learning Representations, ICLR
        large language models,” Proceedings of the ACM on                2023, Kigali, Rwanda, May 1-5, 2023.           OpenRe-
        Programming Languages, vol. 7, no. PLDI, pp. 1946–               view.net, 2023.
        1969, 2023.                                              [464]   Y. Jafari, D. Mekala, R. Yu, and T. Berg-Kirkpatrick,
[453]   P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang,                “Morl-prompt: An empirical analysis of multi-
        Y. N. Wu, S.-C. Zhu, and J. Gao, “Chameleon: Plug-               objective reinforcement learning for discrete prompt
        and-play compositional reasoning with large lan-                 optimization,” CoRR, vol. abs/2402.11711, 2024.
                                                                                                                               120

[465]   W. Kong, S. A. Hombaiah, M. Zhang, Q. Mei, and                   models enables expert-level prompt optimization,”
        M. Bendersky, “Prewrite: Prompt rewriting with re-               CoRR, vol. abs/2310.16427, 2023. [Online]. Available:
        inforcement learning,” CoRR, vol. abs/2401.08189,                https://doi.org/10.48550/arXiv.2310.16427
        2024.                                                    [477]   T. Tang, J. Li, W. X. Zhao, and J. Wen, “Context-
[466]   H. Xu, Y. Chen, Y. Du, N. Shao, Y. Wang, H. Li,                  tuning: Learning contextualized prompts for natu-
        and Z. Yang, “GPS: genetic prompt search for effi-               ral language generation,” in Proceedings of the 29th
        cient few-shot learning,” in Proceedings of the 2022             International Conference on Computational Linguistics,
        Conference on Empirical Methods in Natural Language              COLING 2022, Gyeongju, Republic of Korea, October 12-
        Processing, EMNLP 2022, Abu Dhabi, United Arab Emi-              17, 2022, N. Calzolari, C. Huang, H. Kim, J. Puste-
        rates, December 7-11, 2022, Y. Goldberg, Z. Kozareva,            jovsky, L. Wanner, K. Choi, P. Ryu, H. Chen, L. Do-
        and Y. Zhang, Eds. Association for Computational                 natelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, S. Kim,
        Linguistics, 2022, pp. 8162–8171.                                Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond, and
[467]   A. Prasad, P. Hase, X. Zhou, and M. Bansal,                      S. Na, Eds. International Committee on Computa-
        “Grips: Gradient-free, edit-based instruction search             tional Linguistics, 2022, pp. 6340–6354.
        for prompting large language models,” in Proceedings     [478]   T. Vu, B. Lester, N. Constant, R. Al-Rfou’, and D. Cer,
        of the 17th Conference of the European Chapter of the            “Spot: Better frozen model adaptation through soft
        Association for Computational Linguistics, EACL 2023,            prompt transfer,” in Proceedings of the 60th Annual
        Dubrovnik, Croatia, May 2-6, 2023, A. Vlachos and                Meeting of the Association for Computational Linguistics
        I. Augenstein, Eds. Association for Computational                (Volume 1: Long Papers), ACL 2022, Dublin, Ireland,
        Linguistics, 2023, pp. 3827–3846.                                May 22-27, 2022, S. Muresan, P. Nakov, and A. Villavi-
[468]   Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis,            cencio, Eds. Association for Computational Linguis-
        H. Chan, and J. Ba, “Large language models are                   tics, 2022, pp. 5039–5059.
        human-level prompt engineers,” in The Eleventh           [479]   J. Li, T. Tang, J. Nie, J. Wen, and X. Zhao, “Learn-
        International Conference on Learning Representations,            ing to transfer prompts for text generation,” in Pro-
        ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open-                  ceedings of the 2022 Conference of the North American
        Review.net, 2023.                                                Chapter of the Association for Computational Linguistics:
[469]   R. Pryzant, D. Iter, J. Li, Y. T. Lee, C. Zhu,                   Human Language Technologies, NAACL 2022, Seattle,
        and M. Zeng, “Automatic prompt optimization                      WA, United States, July 10-15, 2022, M. Carpuat,
        with ”gradient descent” and beam search,” CoRR,                  M. de Marneffe, and I. V. M. Ruı́z, Eds. Association
        vol. abs/2305.03495, 2023. [Online]. Available:                  for Computational Linguistics, 2022, pp. 3506–3518.
        https://doi.org/10.48550/arXiv.2305.03495                [480]   S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis,
[470]   C. Yang, X. Wang, Y. Lu, H. Liu, Q. V. Le,                       H. Hajishirzi, and L. Zettlemoyer, “Rethinking the
        D. Zhou, and X. Chen, “Large language models                     role of demonstrations: What makes in-context learn-
        as optimizers,” CoRR, vol. abs/2309.03409, 2023.                 ing work?” in Proceedings of the 2022 Conference
        [Online]. Available: https://doi.org/10.48550/arXiv.             on Empirical Methods in Natural Language Processing,
        2309.03409                                                       EMNLP 2022, Abu Dhabi, United Arab Emirates, De-
[471]   Q. Ye, M. Axmed, R. Pryzant, and F. Khani,                       cember 7-11, 2022. Association for Computational
        “Prompt engineering a prompt engineer,” CoRR, vol.               Linguistics, 2022, pp. 11 048–11 064.
        abs/2311.05661, 2023.                                    [481]   Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh,
[472]   X. Tang, X. Wang, W. X. Zhao, S. Lu, Y. Li, and                  “Calibrate before use: Improving few-shot perfor-
        J. Wen, “Unleashing the potential of large language              mance of language models,” in Proceedings of the 38th
        models as prompt optimizers: An analogical analysis              International Conference on Machine Learning, ICML
        with gradient-based model optimizers,” CoRR, vol.                2021, 18-24 July 2021, Virtual Event, M. Meila and
        abs/2402.17564, 2024.                                            T. Zhang, Eds., 2021, pp. 12 697–12 706.
[473]   H. Yang and K. Li, “Instoptima: Evolutionary             [482]   Y. Lee, C. Lim, and H. Choi, “Does GPT-3 generate
        multi-objective instruction optimization via large               empathetic dialogues? A novel in-context example
        language model-based instruction operators,” in                  selection method and automatic evaluation metric
        EMNLP (Findings). Association for Computational                  for empathetic dialogue generation,” in Proceedings
        Linguistics, 2023, pp. 13 593–13 602.                            of the 29th International Conference on Computational
[474]   Q. Guo, R. Wang, J. Guo, B. Li, K. Song, X. Tan,                 Linguistics, COLING 2022, Gyeongju, Republic of Korea,
        G. Liu, J. Bian, and Y. Yang, “Connecting large                  October 12-17, 2022, N. Calzolari, C. Huang, H. Kim,
        language models with evolutionary algorithms                     J. Pustejovsky, L. Wanner, K. Choi, P. Ryu, H. Chen,
        yields powerful prompt optimizers,” CoRR, vol.                   L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue,
        abs/2309.08532, 2023.                                            S. Kim, Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond,
[475]   X. L. Do, Y. Zhao, H. Brown, Y. Xie, J. X. Zhao, N. F.           and S. Na, Eds. International Committee on Com-
        Chen, K. Kawaguchi, M. Q. Xie, and J. He, “Prompt                putational Linguistics, 2022, pp. 669–683.
        optimization via adversarial in-context learning,”       [483]   I. Levy, B. Bogin, and J. Berant, “Diverse demonstra-
        CoRR, vol. abs/2312.02614, 2023.                                 tions improve in-context compositional generaliza-
[476]   X. Wang, C. Li, Z. Wang, F. Bai, H. Luo,                         tion,” CoRR, vol. abs/2212.06800, 2022.
        J. Zhang, N. Jojic, E. P. Xing, and Z. Hu,               [484]   H. Su, J. Kasai, C. H. Wu, W. Shi, T. Wang, J. Xin,
        “Promptagent: Strategic planning with language                   R. Zhang, M. Ostendorf, L. Zettlemoyer, N. A. Smith,
                                                                                                                                  121

        and T. Yu, “Selective annotation makes language                     guli, Z. Hatfield-Dodds, D. Hernandez, S. John-
        models better few-shot learners,” CoRR, 2022.                       ston, A. Jones, J. Kernion, L. Lovitt, K. Ndousse,
[485]   X. Ye, S. Iyer, A. Celikyilmaz, V. Stoyanov, G. Durrett,            D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCan-
        and R. Pasunuru, “Complementary explanations for                    dlish, and C. Olah, “In-context learning and induc-
        effective in-context learning,” CoRR, 2022.                         tion heads,” CoRR, vol. abs/2209.11895, 2022.
[486]   X. Li and X. Qiu, “Finding supporting examples for          [500]   E. Akyürek, D. Schuurmans, J. Andreas, T. Ma, and
        in-context learning,” CoRR, 2023.                                   D. Zhou, “What learning algorithm is in-context
[487]   Y. Zhang, S. Feng, and C. Tan, “Active example                      learning? investigations with linear models,” CoRR,
        selection for in-context learning,” in Proceedings of               vol. abs/2211.15661, 2022.
        the 2022 Conference on Empirical Methods in Natural         [501]   J. Wei, J. Wei, Y. Tay, D. Tran, A. Webson, Y. Lu,
        Language Processing, EMNLP 2022, Abu Dhabi, United                  X. Chen, H. Liu, D. Huang, D. Zhou et al., “Larger
        Arab Emirates, December 7-11, 2022, 2022, pp. 9134–                 language models do in-context learning differently,”
        9148.                                                               arXiv preprint arXiv:2303.03846, 2023.
[488]   F. Gilardi, M. Alizadeh, and M. Kubli, “Chatgpt out-        [502]   J. Coda-Forno, M. Binz, Z. Akata, M. M. Botvinick,
        performs crowd-workers for text-annotation tasks,”                  J. X. Wang, and E. Schulz, “Meta-in-context
        2023.                                                               learning in large language models,” CoRR, vol.
[489]   H. J. Kim, H. Cho, J. Kim, T. Kim, K. M. Yoo, and                   abs/2305.12907, 2023.
        S. Lee, “Self-generated in-context learning: Leverag-       [503]   J. W. Wei, L. Hou, A. K. Lampinen, X. Chen,
        ing auto-regressive language models as a demonstra-                 D. Huang, Y. Tay, X. Chen, Y. Lu, D. Zhou, T. Ma, and
        tion generator,” CoRR, vol. abs/2206.08082, 2022.                   Q. V. Le, “Symbol tuning improves in-context learn-
[490]   S. M. Xie, A. Raghunathan, P. Liang, and T. Ma,                     ing in language models,” CoRR, vol. abs/2305.08298,
        “An explanation of in-context learning as implicit                  2023.
        bayesian inference,” in The Tenth International Con-        [504]   Z. Chu, J. Chen, Q. Chen, W. Yu, T. He, H. Wang,
        ference on Learning Representations, ICLR 2022, Virtual             W. Peng, M. Liu, B. Qin, and T. Liu, “A survey of
        Event, April 25-29, 2022, 2022.                                     chain of thought reasoning: Advances, frontiers and
[491]   Z. Wu, Y. Wang, J. Ye, and L. Kong, “Self-adaptive in-              future,” CoRR, vol. abs/2309.15402, 2023.
        context learning,” CoRR, vol. abs/2212.10375, 2022.         [505]   S. Miao, C. Liang, and K. Su, “A diverse corpus
[492]   Y. Gu, L. Dong, F. Wei, and M. Huang, “Pre-training                 for evaluating and developing english math word
        to learn in context,” CoRR, vol. abs/2305.09137, 2023.              problem solvers,” in Proceedings of the 58th Annual
[493]   S. Min, M. Lewis, L. Zettlemoyer, and H. Hajishirzi,                Meeting of the Association for Computational Linguistics,
        “Metaicl: Learning to learn in context,” in Proceed-                ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai,
        ings of the 2022 Conference of the North American                   N. Schluter, and J. R. Tetreault, Eds. Association for
        Chapter of the Association for Computational Linguistics:           Computational Linguistics, 2020, pp. 975–984.
        Human Language Technologies, NAACL 2022, Seattle,           [506]   A. Talmor, J. Herzig, N. Lourie, and J. Berant, “Com-
        WA, United States, July 10-15, 2022, M. Carpuat,                    monsenseqa: A question answering challenge tar-
        M. de Marneffe, and I. V. M. Ruı́z, Eds., 2022, pp.                 geting commonsense knowledge,” in Proceedings of
        2791–2809.                                                          the 2019 Conference of the North American Chapter of
[494]   M. Hahn and N. Goyal, “A theory of emergent                         the Association for Computational Linguistics: Human
        in-context learning as implicit structure induction,”               Language Technologies, NAACL-HLT 2019, Minneapolis,
        CoRR, vol. abs/2303.07971, 2023.                                    MN, USA, June 2-7, 2019, Volume 1 (Long and Short
[495]   J. Pan, T. Gao, H. Chen, and D. Chen, “What in-                     Papers), J. Burstein, C. Doran, and T. Solorio, Eds.
        context learning ”learns” in-context: Disentangling                 Association for Computational Linguistics, 2019, pp.
        task recognition and task learning,” CoRR, vol.                     4149–4158.
        abs/2305.09731, 2023.                                       [507]   T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwa-
[496]   N. Wies, Y. Levine, and A. Shashua, “The learnability               sawa, “Large language models are zero-shot reason-
        of in-context learning,” CoRR, vol. abs/2303.07895,                 ers,” CoRR, vol. abs/2205.11916, 2022.
        2023.                                                       [508]   W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Pro-
[497]   A. Webson and E. Pavlick, “Do prompt-based models                   gram of thoughts prompting: Disentangling com-
        really understand the meaning of their prompts?” in                 putation from reasoning for numerical reasoning
        Proceedings of the 2022 Conference of the North American            tasks,” CoRR, vol. abs/2211.12588, 2022.
        Chapter of the Association for Computational Linguistics:   [509]   L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang,
        Human Language Technologies, NAACL 2022, Seattle,                   J. Callan, and G. Neubig, “PAL: program-aided lan-
        WA, United States, July 10-15, 2022, 2022, pp. 2300–                guage models,” in International Conference on Ma-
        2344.                                                               chine Learning, ICML 2023, 23-29 July 2023, Honolulu,
[498]   J. von Oswald, E. Niklasson, E. Randazzo, J. Sacra-                 Hawaii, USA, A. Krause, E. Brunskill, K. Cho, B. En-
        mento, A. Mordvintsev, A. Zhmoginov, and M. Vla-                    gelhardt, S. Sabato, and J. Scarlett, Eds., 2023.
        dymyrov, “Transformers learn in-context by gradient         [510]   X. Zhao, Y. Xie, K. Kawaguchi, J. He, and Q. Xie, “Au-
        descent,” CoRR, vol. abs/2212.07677, 2022.                          tomatic model selection with large language models
[499]   C. Olsson, N. Elhage, N. Nanda, N. Joseph,                          for reasoning,” CoRR, vol. abs/2305.14333, 2023.
        N. DasSarma, T. Henighan, B. Mann, A. Askell,               [511]   Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou,
        Y. Bai, A. Chen, T. Conerly, D. Drain, D. Gan-                      and W. Chen, “Making large language models better
                                                                                                                              122

        reasoners with step-aware verifier,” 2023.               [526]   F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Sri-
[512]   O. Yoran, T. Wolfson, B. Bogin, U. Katz, D. Deutch,              vats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder,
        and J. Berant, “Answering questions by meta-                     D. Zhou, D. Das, and J. Wei, “Language models are
        reasoning over multiple chains of thought,” CoRR,                multilingual chain-of-thought reasoners,” CoRR, vol.
        vol. abs/2304.13007, 2023.                                       abs/2210.03057, 2022.
[513]   Z. Ling, Y. Fang, X. Li, Z. Huang, M. Lee, R. Memi-      [527]   J. Qian, H. Wang, Z. Li, S. Li, and X. Yan, “Limita-
        sevic, and H. Su, “Deductive verification of chain-of-           tions of language models in arithmetic and symbolic
        thought reasoning,” CoRR, vol. abs/2306.03872, 2023.             induction,” CoRR, vol. abs/2208.05051, 2022.
[514]   T. Xue, Z. Wang, Z. Wang, C. Han, P. Yu, and H. Ji,      [528]   N. Bian, X. Han, L. Sun, H. Lin, Y. Lu, and B. He,
        “RCOT: detecting and rectifying factual inconsis-                “ChatGPT is a Knowledgeable but Inexperienced
        tency in reasoning by reversing chain-of-thought,”               Solver: An Investigation of Commonsense Problem
        CoRR, vol. abs/2305.11499, 2023.                                 in Large Language Models,” CoRR, 2023.
[515]   Y. Weng, M. Zhu, F. Xia, B. Li, S. He, K. Liu, and       [529]   S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths,
        J. Zhao, “Large language models are better reasoners             Y. Cao, and K. Narasimhan, “Tree of thoughts: Delib-
        with self-verification,” CoRR, abs/2212.09561, 2023.             erate problem solving with large language models,”
[516]   W. Jiang, H. Shi, L. Yu, Z. Liu, Y. Zhang, Z. Li,                CoRR, vol. abs/2305.10601, 2023.
        and J. T. Kwok, “Forward-backward reasoning in           [530]   G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao,
        large language models for mathematical verifica-                 Y. Zhu, L. Fan, and A. Anandkumar, “Voyager: An
        tion,” 2023.                                                     open-ended embodied agent with large language
[517]   J. Long, “Large language model guided tree-of-                   models,” arXiv preprint arXiv:2305.16291, 2023.
        thought,” CoRR, vol. abs/2305.08291, 2023.               [531]   X. Jiang, Y. Dong, L. Wang, Q. Shang, and
[518]   S. Mo and M. Xin, “Tree of uncertain thoughts                    G. Li, “Self-planning code generation with large
        reasoning for large language models,” CoRR, vol.                 language model,” CoRR, vol. abs/2303.06689, 2023.
        abs/2309.07694, 2023.                                            [Online]. Available: https://doi.org/10.48550/arXiv.
[519]   M. Besta, N. Blach, A. Kubicek, R. Gerstenberger,                2303.06689
        L. Gianinazzi, J. Gajda, T. Lehmann, M. Podstawski,      [532]   I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu,
        H. Niewiadomski, P. Nyczyk, and T. Hoefler, “Graph               J. Tremblay, D. Fox, J. Thomason, and A. Garg, “Prog-
        of thoughts: Solving elaborate problems with large               prompt: Generating situated robot task plans using
        language models,” CoRR, vol. abs/2308.09687, 2023.               large language models,” CoRR, vol. abs/2209.11302,
[520]   B. Lei, P. Lin, C. Liao, and C. Ding, “Boosting log-             2022.
        ical reasoning in large language models through a        [533]   B. Liu, Y. Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas,
        new framework: The graph of thought,” CoRR, vol.                 and P. Stone, “LLM+P: empowering large language
        abs/2308.08614, 2023.                                            models with optimal planning proficiency,” CoRR,
[521]   R. Ding, C. Zhang, L. Wang, Y. Xu, M. Ma, W. Zhang,              vol. abs/2304.11477, 2023. [Online]. Available:
        S. Qin, S. Rajmohan, Q. Lin, and D. Zhang, “Ev-                  https://doi.org/10.48550/arXiv.2304.11477
        erything of thoughts: Defying the law of pen-            [534]   R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and
        rose triangle for thought generation,” arXiv preprint            B. Ommer, “High-resolution image synthesis with
        arXiv:2311.04254, 2023.                                          latent diffusion models,” in IEEE/CVF Conference on
[522]   P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu,            Computer Vision and Pattern Recognition, CVPR 2022,
        M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Ku-               New Orleans, LA, USA, June 18-24, 2022, 2022, pp.
        mar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cos-               10 674–10 685.
        grove, C. D. Manning, C. Ré, D. Acosta-Navas, D. A.     [535]   J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris,
        Hudson, E. Zelikman, E. Durmus, F. Ladhak, F. Rong,              P. Liang, and M. S. Bernstein, “Generative agents:
        H. Ren, H. Yao, J. Wang, K. Santhanam, L. J. Orr,                Interactive simulacra of human behavior,” CoRR, vol.
        L. Zheng, M. Yüksekgönül, M. Suzgun, N. Kim,                  abs/2304.03442, 2023.
        N. Guha, N. S. Chatterji, O. Khattab, P. Henderson,      [536]   2023. [Online]. Available: https://github.com/
        Q. Huang, R. Chi, S. M. Xie, S. Santurkar, S. Ganguli,           Significant-Gravitas/Auto-GPT
        T. Hashimoto, T. Icard, T. Zhang, V. Chaudhary,          [537]   Z. Wang, S. Cai, A. Liu, X. Ma, and Y. Liang,
        W. Wang, X. Li, Y. Mai, Y. Zhang, and Y. Koreeda,                “Describe, explain, plan and select: Interactive plan-
        “Holistic evaluation of language models,” CoRR, vol.             ning with large language models enables open-world
        abs/2211.09110, 2022.                                            multi-task agents,” CoRR, vol. abs/2302.01560, 2023.
[523]   Z. Bi, N. Zhang, Y. Jiang, S. Deng, G. Zheng, and        [538]   J. Wang, X. Yi, R. Guo, H. Jin, P. Xu, S. Li, X. Wang,
        H. Chen, “When do program-of-thoughts work for                   X. Guo, C. Li, X. Xu et al., “Milvus: A purpose-
        reasoning?” CoRR, vol. abs/2308.15452, 2023.                     built vector data management system,” in Proceedings
[524]   A. Madaan and A. Yazdanbakhsh, “Text and pat-                    of the 2021 International Conference on Management of
        terns: For effective chain of thought, it takes two to           Data, 2021, pp. 2614–2627.
        tango,” CoRR, vol. abs/2209.07686, 2022.                 [539]   W. Zhong, L. Guo, Q. Gao, H. Ye, and Y. Wang,
[525]   Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and              “Memorybank: Enhancing large language models
        A. Smola, “Multimodal chain-of-thought reasoning                 with long-term memory,” CoRR, vol. abs/2305.10250,
        in language models,” CoRR, vol. abs/2302.00923,                  2023.
        2023.                                                    [540]   M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz,
                                                                                                                                123

        “Building a large annotated corpus of english: The                 2021, L. Barrault, O. Bojar, F. Bougares, R. Chat-
        penn treebank,” Comput. Linguistics, vol. 19, no. 2,               terjee, M. R. Costa-jussà, C. Federmann, M. Fishel,
        pp. 313–330, 1993.                                                 A. Fraser, M. Freitag, Y. Graham, R. Grundkiewicz,
[541]   S. Merity, C. Xiong, J. Bradbury, and R. Socher,                   P. Guzman, B. Haddow, M. Huck, A. Jimeno-Yepes,
        “Pointer sentinel mixture models,” in ICLR (Poster).               P. Koehn, T. Kocmi, A. Martins, M. Morishita, and
        OpenReview.net, 2017.                                              C. Monz, Eds. Association for Computational Lin-
[542]   O. Bojar, C. Buck, C. Federmann, B. Haddow,                        guistics, 2021, pp. 1–88.
        P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post,        [547]   T. Kocmi, R. Bawden, O. Bojar, A. Dvorkovich, C. Fe-
        H. Saint-Amand, R. Soricut, L. Specia, and A. Tam-                 dermann, M. Fishel, T. Gowda, Y. Graham, R. Grund-
        chyna, “Findings of the 2014 workshop on statistical               kiewicz, B. Haddow, R. Knowles, P. Koehn, C. Monz,
        machine translation,” in WMT@ACL. The Associa-                     M. Morishita, M. Nagata, T. Nakazawa, M. Novák,
        tion for Computer Linguistics, 2014, pp. 12–58.                    M. Popel, and M. Popovic, “Findings of the 2022
[543]   O. Bojar, R. Chatterjee, C. Federmann, Y. Graham,                  conference on machine translation (WMT22),” in Pro-
        B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn,                     ceedings of the Seventh Conference on Machine Trans-
        V. Logacheva, C. Monz, M. Negri, A. Névéol, M. L.                lation, WMT 2022, Abu Dhabi, United Arab Emi-
        Neves, M. Popel, M. Post, R. Rubino, C. Scarton,                   rates (Hybrid), December 7-8, 2022, P. Koehn, L. Bar-
        L. Specia, M. Turchi, K. Verspoor, and M. Zampieri,                rault, O. Bojar, F. Bougares, R. Chatterjee, M. R.
        “Findings of the 2016 conference on machine trans-                 Costa-jussà, C. Federmann, M. Fishel, A. Fraser,
        lation,” in WMT. The Association for Computer                      M. Freitag, Y. Graham, R. Grundkiewicz, P. Guzman,
        Linguistics, 2016, pp. 131–198.                                    B. Haddow, M. Huck, A. Jimeno-Yepes, T. Kocmi,
[544]   L. Barrault, O. Bojar, M. R. Costa-jussà, C. Feder-               A. Martins, M. Morishita, C. Monz, M. Nagata,
        mann, M. Fishel, Y. Graham, B. Haddow, M. Huck,                    T. Nakazawa, M. Negri, A. Névéol, M. Neves,
        P. Koehn, S. Malmasi, C. Monz, M. Müller, S. Pal,                 M. Popel, M. Turchi, and M. Zampieri, Eds. Associ-
        M. Post, and M. Zampieri, “Findings of the 2019                    ation for Computational Linguistics, 2022, pp. 1–45.
        conference on machine translation (WMT19),” in Pro-        [548]   N. Goyal, C. Gao, V. Chaudhary, P. Chen, G. Wenzek,
        ceedings of the Fourth Conference on Machine Transla-              D. Ju, S. Krishnan, M. Ranzato, F. Guzmán, and
        tion, WMT 2019, Florence, Italy, August 1-2, 2019 - Vol-           A. Fan, “The flores-101 evaluation benchmark for
        ume 2: Shared Task Papers, Day 1, O. Bojar, R. Chatter-            low-resource and multilingual machine translation,”
        jee, C. Federmann, M. Fishel, Y. Graham, B. Haddow,                Trans. Assoc. Comput. Linguistics, vol. 10, pp. 522–538,
        M. Huck, A. Jimeno-Yepes, P. Koehn, A. Martins,                    2022.
        C. Monz, M. Negri, A. Névéol, M. L. Neves, M. Post,      [549]   R. Bawden, E. Bilinski, T. Lavergne, and S. Rosset,
        M. Turchi, and K. Verspoor, Eds. Association for                   “Diabla: a corpus of bilingual spontaneous writ-
        Computational Linguistics, 2019, pp. 1–61.                         ten dialogues for machine translation,” Lang. Resour.
[545]   L. Barrault, M. Biesialska, O. Bojar, M. R. Costa-                 Evaluation, vol. 55, no. 3, pp. 635–660, 2021.
        jussà, C. Federmann, Y. Graham, R. Grundkiewicz,          [550]   R. Nallapati, B. Zhou, C. N. dos Santos, Ç. Gülçehre,
        B. Haddow, M. Huck, E. Joanis, T. Kocmi, P. Koehn,                 and B. Xiang, “Abstractive text summarization using
        C. Lo, N. Ljubesic, C. Monz, M. Morishita, M. Na-                  sequence-to-sequence rnns and beyond,” in Proceed-
        gata, T. Nakazawa, S. Pal, M. Post, and M. Zampieri,               ings of the 20th SIGNLL Conference on Computational
        “Findings of the 2020 conference on machine trans-                 Natural Language Learning, CoNLL 2016, Berlin, Ger-
        lation (WMT20),” in Proceedings of the Fifth Con-                  many, August 11-12, 2016, Y. Goldberg and S. Riezler,
        ference on Machine Translation, WMT@EMNLP 2020,                    Eds. ACL, 2016, pp. 280–290.
        Online, November 19-20, 2020, L. Barrault, O. Bojar,       [551]   S. Narayan, S. B. Cohen, and M. Lapata, “Don’t give
        F. Bougares, R. Chatterjee, M. R. Costa-jussà, C. Fed-            me the details, just the summary! topic-aware con-
        ermann, M. Fishel, A. Fraser, Y. Graham, P. Guzman,                volutional neural networks for extreme summariza-
        B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn,                     tion,” in EMNLP. Association for Computational
        A. Martins, M. Morishita, C. Monz, M. Nagata,                      Linguistics, 2018, pp. 1797–1807.
        T. Nakazawa, and M. Negri, Eds. Association for            [552]   F. Ladhak, E. Durmus, C. Cardie, and K. Mckeown,
        Computational Linguistics, 2020, pp. 1–55.                         “Wikilingua: A new benchmark dataset for cross-
[546]   F. Akhbardeh, A. Arkhangorodsky, M. Biesialska,                    lingual abstractive summarization,” in Findings of
        O. Bojar, R. Chatterjee, V. Chaudhary, M. R. Costa-                the Association for Computational Linguistics: EMNLP
        jussà, C. España-Bonet, A. Fan, C. Federmann,                    2020, 2020, pp. 4034–4048.
        M. Freitag, Y. Graham, R. Grundkiewicz, B. Had-            [553]   S. Moon, P. Shah, A. Kumar, and R. Subba, “Open-
        dow, L. Harter, K. Heafield, C. Homan, M. Huck,                    dialkg: Explainable conversational reasoning with
        K. Amponsah-Kaakyire, J. Kasai, D. Khashabi,                       attention-based walks over knowledge graphs,” in
        K. Knight, T. Kocmi, P. Koehn, N. Lourie, C. Monz,                 ACL (1). Association for Computational Linguistics,
        M. Morishita, M. Nagata, A. Nagesh, T. Nakazawa,                   2019, pp. 845–854.
        M. Negri, S. Pal, A. A. Tapo, M. Turchi, V. Vydrin,        [554]   Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettle-
        and M. Zampieri, “Findings of the 2021 confer-                     moyer, S. W. Yih, D. Fried, S. I. Wang, and T. Yu,
        ence on machine translation (WMT21),” in Proceed-                  “DS-1000: A natural and reliable benchmark for data
        ings of the Sixth Conference on Machine Translation,               science code generation,” CoRR, vol. abs/2211.11501,
        WMT@EMNLP 2021, Online Event, November 10-11,                      2022.
                                                                                                                                   124

[555]   Z. Wang, S. Zhou, D. Fried, and G. Neubig,                           for Computational Linguistics (Volume 1: Long Papers),
        “Execution-based evaluation for open-domain code                     ACL 2022, Dublin, Ireland, May 22-27, 2022, 2022, pp.
        generation,” CoRR, vol. abs/2212.10481, 2022.                        6101–6119.
[556]   T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins,        [565]   X. Hu, X. Wu, Y. Shu, and Y. Qu, “Logical form gen-
        A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin,                 eration via multi-task learning for complex question
        J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kel-                   answering over knowledge bases,” in Proceedings
        cey, M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and                   of the 29th International Conference on Computational
        S. Petrov, “Natural questions: a benchmark for ques-                 Linguistics, COLING 2022, Gyeongju, Republic of Korea,
        tion answering research,” Trans. Assoc. Comput. Lin-                 October 12-17, 2022, 2022, pp. 1687–1696.
        guistics, pp. 452–466, 2019.                                 [566]   S. Longpre, Y. Lu, and J. Daiber, “MKQA: A lin-
[557]   P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal,              guistically diverse benchmark for multilingual open
        C. Schoenick, and O. Tafjord, “Think you have solved                 domain question answering,” Trans. Assoc. Comput.
        question answering? try arc, the AI2 reasoning chal-                 Linguistics, vol. 9, pp. 1389–1406, 2021.
        lenge,” CoRR, vol. abs/1803.05457, 2018.                     [567]   T. Saikh, T. Ghosal, A. Mittal, A. Ekbal, and P. Bhat-
[558]   S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measur-                tacharyya, “Scienceqa: a novel resource for question
        ing how models mimic human falsehoods,” in Pro-                      answering on scholarly articles,” Int. J. Digit. Libr.,
        ceedings of the 60th Annual Meeting of the Association               vol. 23, no. 3, pp. 289–301, 2022.
        for Computational Linguistics (Volume 1: Long Papers),       [568]   T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal,
        ACL 2022, Dublin, Ireland, May 22-27, 2022, 2022, pp.                “Can a suit of armor conduct electricity? A new
        3214–3252.                                                           dataset for open book question answering,” in Pro-
[559]   J. Berant, A. Chou, R. Frostig, and P. Liang, “Semantic              ceedings of the 2018 Conference on Empirical Methods in
        parsing on freebase from question-answer pairs,” in                  Natural Language Processing, Brussels, Belgium, October
        Proceedings of the 2013 Conference on Empirical Methods              31 - November 4, 2018, 2018, pp. 2381–2391.
        in Natural Language Processing, EMNLP 2013, 18-21            [569]   T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary,
        October 2013, Grand Hyatt Seattle, Seattle, Washington,              R. Majumder, and L. Deng, “MS MARCO: A human
        USA, A meeting of SIGDAT, a Special Interest Group of                generated machine reading comprehension dataset,”
        the ACL, 2013, pp. 1533–1544.                                        in Proceedings of the Workshop on Cognitive Computa-
[560]   M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer,                   tion: Integrating neural and symbolic approaches 2016
        “Triviaqa: A large scale distantly supervised chal-                  co-located with the 30th Annual Conference on Neural
        lenge dataset for reading comprehension,” in Pro-                    Information Processing Systems (NIPS 2016), Barcelona,
        ceedings of the 55th Annual Meeting of the Association               Spain, December 9, 2016, 2016.
        for Computational Linguistics, ACL 2017, Vancouver,          [570]   T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sab-
        Canada, July 30 - August 4, Volume 1: Long Papers, 2017,             harwal, “QASC: A dataset for question answering
        pp. 1601–1611.                                                       via sentence composition,” in The Thirty-Fourth AAAI
[561]   Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi,                Conference on Artificial Intelligence, AAAI 2020, The
        “PIQA: reasoning about physical commonsense in                       Thirty-Second Innovative Applications of Artificial Intel-
        natural language,” in The Thirty-Fourth AAAI Con-                    ligence Conference, IAAI 2020, The Tenth AAAI Sympo-
        ference on Artificial Intelligence, AAAI 2020, The Thirty-           sium on Educational Advances in Artificial Intelligence,
        Second Innovative Applications of Artificial Intelligence            EAAI 2020, New York, NY, USA, February 7-12, 2020,
        Conference, IAAI 2020, The Tenth AAAI Symposium                      2020, pp. 8082–8090.
        on Educational Advances in Artificial Intelligence, EAAI     [571]   P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang,
        2020, New York, NY, USA, February 7-12, 2020, 2020,                  “Squad: 100, 000+ questions for machine compre-
        pp. 7432–7439.                                                       hension of text,” in Proceedings of the 2016 Conference
[562]   M. Dubey, D. Banerjee, A. Abdelkawi, and                             on Empirical Methods in Natural Language Processing,
        J. Lehmann, “Lc-quad 2.0: A large dataset for com-                   EMNLP 2016, Austin, Texas, USA, November 1-4, 2016,
        plex question answering over wikidata and dbpe-                      2016, pp. 2383–2392.
        dia,” in The Semantic Web - ISWC 2019 - 18th In-             [572]   A. H. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes,
        ternational Semantic Web Conference, Auckland, New                   and J. Weston, “Key-value memory networks for
        Zealand, October 26-30, 2019, Proceedings, Part II, 2019,            directly reading documents,” in Proceedings of the
        pp. 69–78.                                                           2016 Conference on Empirical Methods in Natural Lan-
[563]   Y. Gu, S. Kase, M. Vanni, B. M. Sadler, P. Liang,                    guage Processing, EMNLP 2016, Austin, Texas, USA,
        X. Yan, and Y. Su, “Beyond I.I.D.: three levels of                   November 1-4, 2016, 2016, pp. 1400–1409.
        generalization for question answering on knowledge           [573]   B. Goodrich, V. Rao, P. J. Liu, and M. Saleh, “As-
        bases,” in WWW ’21: The Web Conference 2021, Virtual                 sessing the factual accuracy of generated text,” in
        Event / Ljubljana, Slovenia, April 19-23, 2021, 2021, pp.            Proceedings of the 25th ACM SIGKDD International
        3477–3488.                                                           Conference on Knowledge Discovery & Data Mining,
[564]   S. Cao, J. Shi, L. Pan, L. Nie, Y. Xiang, L. Hou,                    KDD 2019, Anchorage, AK, USA, August 4-8, 2019,
        J. Li, B. He, and H. Zhang, “KQA pro: A dataset                      2019, pp. 166–175.
        with explicit compositional programs for complex             [574]   K. Toutanova and D. Chen, “Observed versus latent
        question answering over knowledge base,” in Pro-                     features for knowledge base and text inference,” in
        ceedings of the 60th Annual Meeting of the Association               Proceedings of the 3rd Workshop on Continuous Vector
                                                                                                                                   125

        Space Models and their Compositionality, CVSC 2015,                  Florence, Italy, July 28- August 2, 2019, Volume 1: Long
        Beijing, China, July 26-31, 2015, 2015, pp. 57–66.                   Papers, A. Korhonen, D. R. Traum, and L. Màrquez,
[575]   K. D. Bollacker, C. Evans, P. K. Paritosh, T. Sturge,                Eds.     Association for Computational Linguistics,
        and J. Taylor, “Freebase: a collaboratively created                  2019, pp. 4791–4800.
        graph database for structuring human knowledge,”             [585]   K. Sakaguchi, R. L. Bras, C. Bhagavatula, and
        in Proceedings of the ACM SIGMOD International Con-                  Y. Choi, “Winogrande: An adversarial winograd
        ference on Management of Data, SIGMOD 2008, Vancou-                  schema challenge at scale,” in AAAI. AAAI Press,
        ver, BC, Canada, June 10-12, 2008, 2008, pp. 1247–1250.              2020, pp. 8732–8740.
[576]   T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel,      [586]   M. Roemmele, C. A. Bejan, and A. S. Gordon,
        “Convolutional 2d knowledge graph embeddings,”                       “Choice of plausible alternatives: An evaluation of
        in Proceedings of the Thirty-Second AAAI Conference                  commonsense causal reasoning,” in Logical Formaliza-
        on Artificial Intelligence, (AAAI-18), the 30th innovative           tions of Commonsense Reasoning, Papers from the 2011
        Applications of Artificial Intelligence (IAAI-18), and the           AAAI Spring Symposium, Technical Report SS-11-06,
        8th AAAI Symposium on Educational Advances in Ar-                    Stanford, California, USA, March 21-23, 2011. AAAI,
        tificial Intelligence (EAAI-18), New Orleans, Louisiana,             2011.
        USA, February 2-7, 2018, 2018, pp. 1811–1818.                [587]   K. Sakaguchi, C. Bhagavatula, R. L. Bras, N. Tandon,
[577]   G. A. Miller, “Wordnet: A lexical database for en-                   P. Clark, and Y. Choi, “proscript: Partially ordered
        glish,” Commun. ACM, pp. 39–41, 1995.                                scripts generation,” in Findings of the Association for
[578]   F. Petroni, T. Rocktäschel, S. Riedel, P. S. H. Lewis,              Computational Linguistics: EMNLP 2021, Virtual Event
        A. Bakhtin, Y. Wu, and A. H. Miller, “Language mod-                  / Punta Cana, Dominican Republic, 16-20 November,
        els as knowledge bases?” in Proceedings of the 2019                  2021, M. Moens, X. Huang, L. Specia, and S. W. Yih,
        Conference on Empirical Methods in Natural Language                  Eds.     Association for Computational Linguistics,
        Processing and the 9th International Joint Conference                2021, pp. 2138–2149.
        on Natural Language Processing, EMNLP-IJCNLP 2019,           [588]   B. Dalvi, L. Huang, N. Tandon, W. Yih, and P. Clark,
        Hong Kong, China, November 3-7, 2019, 2019, pp. 2463–                “Tracking state changes in procedural text: a chal-
        2473.                                                                lenge dataset and models for process paragraph com-
[579]   F. Mahdisoltani, J. Biega, and F. M. Suchanek,                       prehension,” in Proceedings of the 2018 Conference of
        “YAGO3: A knowledge base from multilingual                           the North American Chapter of the Association for Com-
        wikipedias,” in Seventh Biennial Conference on Innova-               putational Linguistics: Human Language Technologies,
        tive Data Systems Research, CIDR 2015, Asilomar, CA,                 NAACL-HLT 2018, New Orleans, Louisiana, USA, June
        USA, January 4-7, 2015, Online Proceedings, 2015.                    1-6, 2018, Volume 1 (Long Papers), M. A. Walker, H. Ji,
[580]   F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago:                    and A. Stent, Eds. Association for Computational
        a core of semantic knowledge,” in Proceedings of                     Linguistics, 2018, pp. 1595–1604.
        the 16th International Conference on World Wide Web,         [589]   S. Saha, P. Yadav, L. Bauer, and M. Bansal, “Expla-
        WWW 2007, Banff, Alberta, Canada, May 8-12, 2007,                    graphs: An explanation graph generation task for
        2007, pp. 697–706.                                                   structured commonsense reasoning,” in Proceedings
[581]   Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen,                    of the 2021 Conference on Empirical Methods in Natural
        R. Salakhutdinov, and C. D. Manning, “Hotpotqa:                      Language Processing, EMNLP 2021, Virtual Event /
        A dataset for diverse, explainable multi-hop ques-                   Punta Cana, Dominican Republic, 7-11 November, 2021,
        tion answering,” in Proceedings of the 2018 Conference               M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds.
        on Empirical Methods in Natural Language Processing,                 Association for Computational Linguistics, 2021, pp.
        Brussels, Belgium, October 31 - November 4, 2018. As-                7716–7740.
        sociation for Computational Linguistics, 2018, pp.           [590]   O. Tafjord, B. Dalvi, and P. Clark, “Proofwriter:
        2369–2380.                                                           Generating implications, proofs, and abductive state-
[582]   C. Clark, K. Lee, M. Chang, T. Kwiatkowski,                          ments over natural language,” in Findings of the
        M. Collins, and K. Toutanova, “Boolq: Exploring the                  Association for Computational Linguistics: ACL/IJCNLP
        surprising difficulty of natural yes/no questions,” in               2021, Online Event, August 1-6, 2021, ser. Findings of
        Proceedings of the 2019 Conference of the North American             ACL, C. Zong, F. Xia, W. Li, and R. Navigli, Eds., vol.
        Chapter of the Association for Computational Linguis-                ACL/IJCNLP 2021. Association for Computational
        tics: Human Language Technologies, NAACL-HLT 2019,                   Linguistics, 2021, pp. 3621–3634.
        Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long         [591]   B. Dalvi, P. Jansen, O. Tafjord, Z. Xie, H. Smith, L. Pi-
        and Short Papers), J. Burstein, C. Doran, and T. Solorio,            patanangkura, and P. Clark, “Explaining answers
        Eds.       Association for Computational Linguistics,                with entailment trees,” in Proceedings of the 2021
        2019, pp. 2924–2936.                                                 Conference on Empirical Methods in Natural Language
[583]   M. Sap, H. Rashkin, D. Chen, R. L. Bras, and Y. Choi,                Processing, EMNLP 2021, Virtual Event / Punta Cana,
        “Socialiqa: Commonsense reasoning about social in-                   Dominican Republic, 7-11 November, 2021, M. Moens,
        teractions,” CoRR, vol. abs/1904.09728, 2019.                        X. Huang, L. Specia, and S. W. Yih, Eds. Association
[584]   R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and                    for Computational Linguistics, 2021, pp. 7358–7370.
        Y. Choi, “Hellaswag: Can a machine really finish             [592]   A. Saparov and H. He, “Language models are greedy
        your sentence?” in Proceedings of the 57th Conference of             reasoners: A systematic formal analysis of chain-of-
        the Association for Computational Linguistics, ACL 2019,             thought,” CoRR, vol. abs/2210.01240, 2022.
                                                                                                                                126

[593]   C. Anil, Y. Wu, A. Andreassen, A. Lewkowycz,                      system benchmark for formal olympiad-level mathe-
        V. Misra, V. V. Ramasesh, A. Slone, G. Gur-Ari,                   matics,” in The Tenth International Conference on Learn-
        E. Dyer, and B. Neyshabur, “Exploring length gen-                 ing Representations, ICLR 2022, Virtual Event, April 25-
        eralization in large language models,” CoRR, vol.                 29, 2022. OpenReview.net, 2022.
        abs/2207.04901, 2022.                                       [603] Z. Azerbayev, B. Piotrowski, H. Schoelkopf, E. W.
[594]   A. Patel, S. Bhattamishra, and N. Goyal, “Are NLP                 Ayers, D. Radev, and J. Avigad, “Proofnet: Autofor-
        models really able to solve simple math word prob-                malizing and formally proving undergraduate-level
        lems?” in NAACL-HLT. Association for Computa-                     mathematics,” CoRR, vol. abs/2302.12433, 2023.
        tional Linguistics, 2021, pp. 2080–2094.                    [604] J. Li, X. Cheng, W. X. Zhao, J. Nie, and J. Wen,
[595]   S. Roy and D. Roth, “Solving general arithmetic                   “Halueval: A large-scale hallucination evaluation
        word problems,” in Proceedings of the 2015 Conference             benchmark for large language models,” CoRR, vol.
        on Empirical Methods in Natural Language Processing,              abs/2305.11747, 2023.
        EMNLP 2015, Lisbon, Portugal, September 17-21, 2015,        [605] N. Nangia, C. Vania, R. Bhalerao, and S. R. Bowman,
        L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, and             “Crows-pairs: A challenge dataset for measuring
        Y. Marton, Eds. The Association for Computational                 social biases in masked language models,” in Pro-
        Linguistics, 2015, pp. 1743–1752.                                 ceedings of the 2020 Conference on Empirical Methods
[596]   A. Amini, S. Gabriel, S. Lin, R. Koncel-Kedziorski,               in Natural Language Processing, EMNLP 2020, Online,
        Y. Choi, and H. Hajishirzi, “Mathqa: Towards inter-               November 16-20, 2020, 2020, pp. 1953–1967.
        pretable math word problem solving with operation-          [606] R. Rudinger, J. Naradowsky, B. Leonard, and B. V.
        based formalisms,” in Proceedings of the 2019 Confer-             Durme, “Gender bias in coreference resolution,” in
        ence of the North American Chapter of the Association for         Proceedings of the 2018 Conference of the North American
        Computational Linguistics: Human Language Technolo-               Chapter of the Association for Computational Linguistics:
        gies, NAACL-HLT 2019, Minneapolis, MN, USA, June                  Human Language Technologies, NAACL-HLT, New Or-
        2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein,         leans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short
        C. Doran, and T. Solorio, Eds.          Association for           Papers), 2018, pp. 8–14.
        Computational Linguistics, 2019, pp. 2357–2367.             [607] S. Gehman, S. Gururangan, M. Sap, Y. Choi, and
[597]   W. Ling, D. Yogatama, C. Dyer, and P. Blunsom, “Pro-              N. A. Smith, “Realtoxicityprompts: Evaluating neu-
        gram induction by rationale generation: Learning to               ral toxic degeneration in language models,” in Find-
        solve and explain algebraic word problems,” in Pro-               ings of the Association for Computational Linguistics:
        ceedings of the 55th Annual Meeting of the Association            EMNLP 2020, Online Event, 16-20 November 2020, ser.
        for Computational Linguistics, ACL 2017, Vancouver,               Findings of ACL, T. Cohn, Y. He, and Y. Liu, Eds.,
        Canada, July 30 - August 4, Volume 1: Long Papers,                vol. EMNLP 2020. Association for Computational
        R. Barzilay and M. Kan, Eds. Association for Com-                 Linguistics, 2020, pp. 3356–3369.
        putational Linguistics, 2017, pp. 158–167.                  [608] X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler,
[598]   R. Koncel-Kedziorski, S. Roy, A. Amini, N. Kushman,               and A. Torralba, “Virtualhome: Simulating house-
        and H. Hajishirzi, “Mawps: A math word problem                    hold activities via programs,” in CVPR. Computer
        repository,” in Proceedings of the 2016 conference of             Vision Foundation / IEEE Computer Society, 2018,
        the north american chapter of the association for compu-          pp. 8494–8502.
        tational linguistics: human language technologies, 2016,    [609] S. Srivastava, C. Li, M. Lingelbach, R. Martı́n-Martı́n,
        pp. 1152–1157.                                                    F. Xia, K. E. Vainio, Z. Lian, C. Gokmen, S. Buch,
[599]   D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh,               C. K. Liu, S. Savarese, H. Gweon, J. Wu, and L. Fei-
        and M. Gardner, “DROP: A reading comprehension                    Fei, “BEHAVIOR: benchmark for everyday house-
        benchmark requiring discrete reasoning over para-                 hold activities in virtual, interactive, and ecological
        graphs,” in Proceedings of the 2019 Conference of the             environments,” in CoRL, ser. Proceedings of Machine
        North American Chapter of the Association for Com-                Learning Research, vol. 164. PMLR, 2021, pp. 477–
        putational Linguistics: Human Language Technologies,              490.
        NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7,             [610] M. Shridhar, J. Thomason, D. Gordon, Y. Bisk,
        2019, Volume 1 (Long and Short Papers), 2019, pp. 2368–           W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox,
        2378.                                                             “ALFRED: A benchmark for interpreting grounded
[600]   S. Welleck, J. Liu, R. L. Bras, H. Hajishirzi, Y. Choi,           instructions for everyday tasks,” in CVPR. Com-
        and K. Cho, “Naturalproofs: Mathematical theorem                  puter Vision Foundation / IEEE, 2020, pp. 10 737–
        proving in natural language,” in Proceedings of the               10 746.
        Neural Information Processing Systems Track on Datasets     [611] M. Shridhar, X. Yuan, M. Côté, Y. Bisk, A. Trischler,
        and Benchmarks 1, NeurIPS Datasets and Benchmarks                 and M. J. Hausknecht, “Alfworld: Aligning text and
        2021, December 2021, virtual, J. Vanschoren and S. Ye-            embodied environments for interactive learning,” in
        ung, Eds., 2021.                                                  9th International Conference on Learning Representa-
[601]   A. Q. Jiang, W. Li, J. M. Han, and Y. Wu, “Lisa: Lan-             tions, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
        guage models of isabelle proofs,” in 6th Conference               OpenReview.net, 2021.
        on Artificial Intelligence and Theorem Proving, 2021, pp.   [612] S. Yao, H. Chen, J. Yang, and K. Narasimhan, “Web-
        378–392.                                                          shop: Towards scalable real-world web interaction
[602]   K. Zheng, J. M. Han, and S. Polu, “minif2f: a cross-              with grounded language agents,” in NeurIPS, 2022.
                                                                                                                                  127

[613]   X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens,                      International Conference on Learning Representations,
        B. Wang, H. Sun, and Y. Su, “Mind2web: To-                          ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
        wards a generalist agent for the web,” CoRR, vol.                   OpenReview.net, 2020.
        abs/2306.06070, 2023.                                       [625]   T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang,
[614]   W. H. Guss, B. Houghton, N. Topin, P. Wang,                         Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and
        C. Codel, M. Veloso, and R. Salakhutdinov, “Minerl:                 D. R. Radev, “Spider: A large-scale human-labeled
        A large-scale dataset of minecraft demonstrations,”                 dataset for complex and cross-domain semantic pars-
        in Proceedings of the Twenty-Eighth International Joint             ing and text-to-sql task,” in Proceedings of the 2018
        Conference on Artificial Intelligence, IJCAI 2019, Macao,           Conference on Empirical Methods in Natural Language
        China, August 10-16, 2019, S. Kraus, Ed. ijcai.org,                 Processing, Brussels, Belgium, October 31 - November 4,
        2019, pp. 2442–2448.                                                2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsu-
[615]   L. Fan, G. Wang, Y. Jiang, A. Mandlekar, Y. Yang,                   jii, Eds. Association for Computational Linguistics,
        H. Zhu, A. Tang, D. Huang, Y. Zhu, and A. Anand-                    2018, pp. 3911–3921.
        kumar, “Minedojo: Building open-ended embodied              [626]   D. Bahdanau, K. Cho, and Y. Bengio, “Neural ma-
        agents with internet-scale knowledge,” in NeurIPS,                  chine translation by jointly learning to align and
        2022.                                                               translate,” in ICLR, 2015.
[616]   P. Lu, L. Qiu, K. Chang, Y. N. Wu, S. Zhu, T. Rajpuro-      [627]   K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu:
        hit, P. Clark, and A. Kalyan, “Dynamic prompt learn-                a method for automatic evaluation of machine trans-
        ing via policy gradient for semi-structured mathe-                  lation,” in Proceedings of the 40th Annual Meeting of
        matical reasoning,” CoRR, vol. abs/2209.14610, 2022.                the Association for Computational Linguistics, July 6-12,
[617]   B. Zhang, K. Zhou, X. Wei, W. X. Zhao, J. Sha,                      2002, Philadelphia, PA, USA. ACL, 2002, pp. 311–318.
        S. Wang, and J. rong Wen, “Evaluating and improv-           [628]   C.-Y. Lin, “ROUGE: A package for automatic evalu-
        ing tool-augmented computation-intensive math rea-                  ation of summaries,” in Text Summarization Branches
        soning,” CoRR, vol. abs/2306.02408, 2023.                           Out. Association for Computational Linguistics, Jul.
[618]   R. Yang, L. Song, Y. Li, S. Zhao, Y. Ge, X. Li,                     2004, pp. 74–81.
        and Y. Shan, “Gpt4tools: Teaching large language            [629]   W. Jiao, W. Wang, J.-t. Huang, X. Wang, and Z. Tu,
        model to use tools via self-instruction,” CoRR, vol.                “Is chatgpt a good translator? a preliminary study,”
        abs/2305.18752, 2023.                                               arXiv preprint arXiv:2301.08745, 2023.
[619]   S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez,         [630]   T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. R. McK-
        “Gorilla: Large language model connected with mas-                  eown, and T. B. Hashimoto, “Benchmarking large
        sive apis,” CoRR, vol. abs/2305.15334, 2023.                        language models for news summarization,” CoRR,
[620]   W. Yih, M. Richardson, C. Meek, M. Chang, and                       vol. abs/2301.13848, 2023.
        J. Suh, “The value of semantic parse labeling for           [631]   T. Goyal, J. J. Li, and G. Durrett, “News summariza-
        knowledge base question answering,” in Proceedings                  tion and evaluation in the era of GPT-3,” CoRR, vol.
        of the 54th Annual Meeting of the Association for Com-              abs/2209.12356, 2022.
        putational Linguistics, ACL 2016, August 7-12, 2016,        [632]   S. Gehrmann, E. Clark, and T. Sellam, “Repairing
        Berlin, Germany, Volume 2: Short Papers. The Associ-                the cracked foundation: A survey of obstacles in
        ation for Computer Linguistics, 2016.                               evaluation practices for generated text,” CoRR, vol.
[621]   H. Puerto, G. G. Sahin, and I. Gurevych, “Metaqa:                   abs/2202.06935, 2022.
        Combining expert agents for multi-skill question an-        [633]   J. Wang, Y. Liang, F. Meng, H. Shi, Z. Li, J. Xu, J. Qu,
        swering,” in Proceedings of the 17th Conference of the              and J. Zhou, “Is chatgpt a good NLG evaluator? A
        European Chapter of the Association for Computational               preliminary study,” CoRR, vol. abs/2303.04048, 2023.
        Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6,        [634]   Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu,
        2023, A. Vlachos and I. Augenstein, Eds. Association                “G-eval: NLG evaluation using GPT-4 with better
        for Computational Linguistics, 2023, pp. 3548–3562.                 human alignment,” CoRR, vol. abs/2303.16634, 2023.
[622]   P. Pasupat and P. Liang, “Compositional semantic            [635]   K. Yang, Y. Tian, N. Peng, and D. Klein, “Re3: Gen-
        parsing on semi-structured tables,” in Proceedings of               erating longer stories with recursive reprompting
        the 53rd Annual Meeting of the Association for Com-                 and revision,” in Proceedings of the 2022 Conference
        putational Linguistics and the 7th International Joint              on Empirical Methods in Natural Language Processing,
        Conference on Natural Language Processing of the Asian              EMNLP 2022, Abu Dhabi, United Arab Emirates, De-
        Federation of Natural Language Processing, ACL 2015,                cember 7-11, 2022, Y. Goldberg, Z. Kozareva, and
        July 26-31, 2015, Beijing, China, Volume 1: Long Papers.            Y. Zhang, Eds. Association for Computational Lin-
        The Association for Computer Linguistics, 2015, pp.                 guistics, 2022, pp. 4393–4479.
        1470–1480.                                                  [636]   W. Zhou, Y. E. Jiang, P. Cui, T. Wang, Z. Xiao, Y. Hou,
[623]   V. Zhong, C. Xiong, and R. Socher, “Seq2sql: Gener-                 R. Cotterell, and M. Sachan, “Recurrentgpt: Interac-
        ating structured queries from natural language using                tive generation of (arbitrarily) long text,” CoRR, vol.
        reinforcement learning,” CoRR, vol. abs/1709.00103,                 abs/2305.13304, 2023.
        2017.                                                       [637]   S. Gulwani, O. Polozov, and R. Singh, “Program
[624]   W. Chen, H. Wang, J. Chen, Y. Zhang, H. Wang,                       synthesis,” Found. Trends Program. Lang., vol. 4, no.
        S. Li, X. Zhou, and W. Y. Wang, “Tabfact: A large-                  1-2, pp. 1–119, 2017.
        scale dataset for table-based fact verification,” in 8th    [638]   S. Zhang, Z. Chen, Y. Shen, M. Ding, J. B. Tenenbaum,
                                                                                                                                    128

        and C. Gan, “Planning with large language models                      Advances in Artificial Intelligence (EAAI-18), New Or-
        for code generation,” 2023.                                           leans, Louisiana, USA, February 2-7, 2018, 2018, pp.
[639]   M. Welsh, “The end of programming,” Commun.                           3390–3398.
        ACM, vol. 66, no. 1, pp. 34–35, 2023.                         [653]   T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak,
[640]   Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su,                       M. Yasunaga, C. Wu, M. Zhong, P. Yin, S. I. Wang,
        B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, Q. V. Do,               V. Zhong, B. Wang, C. Li, C. Boyle, A. Ni, Z. Yao,
        Y. Xu, and P. Fung, “A multitask, multilingual, mul-                  D. Radev, C. Xiong, L. Kong, R. Zhang, N. A. Smith,
        timodal evaluation of chatgpt on reasoning, halluci-                  L. Zettlemoyer, and T. Yu, “Unifiedskg: Unifying and
        nation, and interactivity,” CoRR, vol. abs/2302.04023,                multi-tasking structured knowledge grounding with
        2023.                                                                 text-to-text language models,” in EMNLP. Associ-
[641]   Y. Liu, A. R. Fabbri, P. Liu, Y. Zhao, L. Nan, R. Han,                ation for Computational Linguistics, 2022, pp. 602–
        S. Han, S. R. Joty, C. Wu, C. Xiong, and D. Radev,                    631.
        “Revisiting the gold standard: Grounding summa-               [654]   A. Roberts, C. Raffel, and N. Shazeer, “How much
        rization evaluation with robust human evaluation,”                    knowledge can you pack into the parameters of a lan-
        CoRR, vol. abs/2212.07981, 2022.                                      guage model?” in Proceedings of the 2020 Conference
[642]   A. R. Fabbri, W. Kryscinski, B. McCann, C. Xiong,                     on Empirical Methods in Natural Language Processing,
        R. Socher, and D. R. Radev, “Summeval: Re-                            EMNLP 2020, Online, November 16-20, 2020, 2020, pp.
        evaluating summarization evaluation,” Trans. Assoc.                   5418–5426.
        Comput. Linguistics, vol. 9, pp. 391–409, 2021.               [655]   G. Izacard, P. S. H. Lewis, M. Lomeli, L. Hos-
[643]   T. Tang, H. Lu, Y. E. Jiang, H. Huang, D. Zhang, W. X.                seini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin,
        Zhao, and F. Wei, “Not all metrics are guilty: Improv-                S. Riedel, and E. Grave, “Few-shot learning with
        ing NLG evaluation with LLM paraphrasing,” CoRR,                      retrieval augmented language models,” CoRR, vol.
        vol. abs/2305.15067, 2023.                                            abs/2208.03299, 2022.
[644]   X. Wang, X. Tang, W. X. Zhao, J. Wang, and J. Wen,            [656]   K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang,
        “Rethinking the evaluation for conversational rec-                    “Retrieval augmented language model pre-training,”
        ommendation in the era of large language models,”                     in Proceedings of the 37th International Conference on
        CoRR, vol. abs/2305.13112, 2023.                                      Machine Learning, ICML 2020, 13-18 July 2020, Virtual
[645]   M. Gao, J. Ruan, R. Sun, X. Yin, S. Yang, and X. Wan,                 Event, 2020, pp. 3929–3938.
        “Human-like summarization evaluation with chat-               [657]   P. S. H. Lewis, E. Perez, A. Piktus, F. Petroni,
        gpt,” CoRR, vol. abs/2304.02554, 2023.                                V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih,
[646]   Y. Ji, Y. Gong, Y. Peng, C. Ni, P. Sun, D. Pan, B. Ma,                T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-
        and X. Li, “Exploring chatgpt’s ability to rank con-                  augmented generation for knowledge-intensive NLP
        tent: A preliminary study on consistency with hu-                     tasks,” in Advances in Neural Information Processing
        man preferences,” CoRR, vol. abs/2303.07610, 2023.                    Systems 33: Annual Conference on Neural Information
[647]   Y. Bai, J. Ying, Y. Cao, X. Lv, Y. He, X. Wang, J. Yu,                Processing Systems 2020, NeurIPS 2020, December 6-12,
        K. Zeng, Y. Xiao, H. Lyu, J. Zhang, J. Li, and L. Hou,                2020, virtual, 2020.
        “Benchmarking foundation models with language-                [658]   Y. Lan, G. He, J. Jiang, J. Jiang, W. X. Zhao, and J. Wen,
        model-as-an-examiner,” CoRR, vol. abs/2306.04181,                     “Complex knowledge base question answering: A
        2023.                                                                 survey,” CoRR, vol. abs/2108.06688, 2021.
[648]   Y. Liu, S. Feng, D. Wang, Y. Zhang, and H. Schütze,          [659]   S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai,
        “Evaluate what you can’t evaluate: Unassess-                          E. Rutherford, K. Millican, G. van den Driessche,
        able generated responses quality,” CoRR, vol.                         J. Lespiau, B. Damoc, A. Clark, D. de Las Casas,
        abs/2305.14658, 2023.                                                 A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang,
[649]   P. Wang, L. Li, L. Chen, D. Zhu, B. Lin, Y. Cao, Q. Liu,              L. Maggiore, C. Jones, A. Cassirer, A. Brock, M. Pa-
        T. Liu, and Z. Sui, “Large language models are not                    ganini, G. Irving, O. Vinyals, S. Osindero, K. Si-
        fair evaluators,” CoRR, vol. abs/2305.17926, 2023.                    monyan, J. W. Rae, E. Elsen, and L. Sifre, “Improv-
[650]   J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. Liu, Y. Cui,                ing language models by retrieving from trillions of
        Z. Zhou, C. Gong, Y. Shen, J. Zhou, S. Chen, T. Gui,                  tokens,” in International Conference on Machine Learn-
        Q. Zhang, and X. Huang, “A comprehensive capabil-                     ing, ICML 2022, 17-23 July 2022, Baltimore, Maryland,
        ity analysis of gpt-3 and gpt-3.5 series models,” arXiv               USA, ser. Proceedings of Machine Learning Research,
        preprint arXiv:2303.10420, 2023.                                      K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári,
[651]   M. McCloskey and N. J. Cohen, “Catastrophic in-                       G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 2022,
        terference in connectionist networks: The sequential                  pp. 2206–2240.
        learning problem,” in Psychology of learning and moti-        [660]   S. Xu, L. Pang, H. Shen, X. Cheng, and T.-S. Chua,
        vation, 1989, pp. 109–165.                                            “Search-in-the-chain: Towards accurate, credible and
[652]   R. Kemker, M. McClure, A. Abitino, T. L. Hayes,                       traceable large language models for knowledge-
        and C. Kanan, “Measuring catastrophic forgetting in                   intensive tasks,” CoRR, vol. abs/2304.14732, 2023.
        neural networks,” in Proceedings of the Thirty-Second         [661]   B. Peng, M. Galley, P. He, H. Cheng, Y. Xie, Y. Hu,
        AAAI Conference on Artificial Intelligence, (AAAI-18),                Q. Huang, L. Liden, Z. Yu, W. Chen, and J. Gao,
        the 30th innovative Applications of Artificial Intelligence           “Check your facts and try again: Improving large
        (IAAI-18), and the 8th AAAI Symposium on Educational                  language models with external knowledge and au-
                                                                                                                                  129

        tomated feedback,” CoRR, vol. abs/2302.12813, 2023.                 8493–8502.
[662]   Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-     [672]   K. Meng, D. Bau, A. J. Andonian, and Y. Belinkov,
        Yu, Y. Yang, J. Callan, and G. Neubig, “Ac-                         “Locating and editing factual associations in gpt,”
        tive retrieval augmented generation,” CoRR, vol.                    in Advances in Neural Information Processing Systems,
        abs/2305.06983, 2023.                                               2022.
[663]   L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng,                  [673]   M. Geva, R. Schuster, J. Berant, and O. Levy, “Trans-
        H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and                     former feed-forward layers are key-value memo-
        T. Liu, “A survey on hallucination in large language                ries,” in Proceedings of the 2021 Conference on Empirical
        models: Principles, taxonomy, challenges, and open                  Methods in Natural Language Processing, EMNLP 2021,
        questions,” CoRR, vol. abs/2311.05232, 2023.                        Virtual Event / Punta Cana, Dominican Republic, 7-
[664]   Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and                     11 November, 2021, M. Moens, X. Huang, L. Specia,
        J. Wen, “Evaluating object hallucination in large                   and S. W. Yih, Eds. Association for Computational
        vision-language models,” CoRR, vol. abs/2305.10355,                 Linguistics, 2021, pp. 5484–5495.
        2023.                                                       [674]   Y. Yao, P. Wang, B. Tian, S. Cheng, Z. Li, S. Deng,
[665]   S. Kadavath, T. Conerly, A. Askell, T. J. Henighan,                 H. Chen, and N. Zhang, “Editing large language
        D. Drain, E. Perez, N. Schiefer, Z. Dodds, N. Das-                  models: Problems, methods, and opportunities,”
        Sarma, E. Tran-Johnson, S. Johnston, S. El-Showk,                   CoRR, vol. abs/2305.13172, 2023.
        A. Jones, N. Elhage, T. Hume, A. Chen, Y. Bai,              [675]   P. Wang, N. Zhang, X. Xie, Y. Yao, B. Tian,
        S. Bowman, S. Fort, D. Ganguli, D. Hernandez, J. Ja-                M. Wang, Z. Xi, S. Cheng, K. Liu, G. Zheng, and
        cobson, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse,               H. Chen, “Easyedit: An easy-to-use knowledge edit-
        C. Olsson, S. Ringer, D. Amodei, T. B. Brown, J. Clark,             ing framework for large language models,” CoRR,
        N. Joseph, B. Mann, S. McCandlish, C. Olah, and                     vol. abs/2308.07269, 2023.
        J. Kaplan, “Language models (mostly) know what              [676]   Z. Shao, Y. Gong, Y. Shen, M. Huang, N. Duan, and
        they know,” CoRR, vol. abs/2207.05221, 2022.                        W. Chen, “Synthetic prompting: Generating chain-of-
[666]   P. Manakul, A. Liusie, and M. J. F. Gales, “Selfcheck-              thought demonstrations for large language models,”
        gpt: Zero-resource black-box hallucination detection                CoRR, vol. abs/2302.00618, 2023.
        for generative large language models,” ArXiv, vol.          [677]   Sifatkaur, M. Singh, V. S. B, and N. Malviya, “Mind
        abs/2305.06983, 2023.                                               meets machine: Unravelling gpt-4’s cognitive psy-
[667]   S. Agarwal, I. Akkaya, V. Balcom, M. Bavarian,                      chology,” CoRR, vol. abs/2303.11436, 2023.
        G. Bernadett-Shapiro, G. Brockman, M. Brundage,             [678]   M. I. Nye, A. J. Andreassen, G. Gur-Ari,
        J. Chan, F. Chantzis, N. Deutsch, B. Eastman, A. Eleti,             H. Michalewski, J. Austin, D. Bieber, D. Dohan,
        N. Felix, S. P. Fishman, I. Fulford, C. Gibson, J. Gross,           A. Lewkowycz, M. Bosma, D. Luan, C. Sutton, and
        M. Heaton, J. Hilton, X. Hu, S. Jain, H. Jin, L. Kil-               A. Odena, “Show your work: Scratchpads for inter-
        patrick, C. Kim, M. Kolhede, A. Mayne, P. McMil-                    mediate computation with language models,” CoRR,
        lan, D. Medina, J. Menick, A. Mishchenko, A. Nair,                  vol. abs/2112.00114, 2021.
        R. Nayak, A. Neelakantan, R. Nuttall, J. Parish,            [679]   J. Qian, H. Wang, Z. Li, S. Li, and X. Yan, “Limita-
        A. T. Passos, A. Perelman, F. de Avila Belbute Peres,               tions of language models in arithmetic and symbolic
        V. Pong, J. Schulman, E. Sigler, N. Staudacher, N. Tur-             induction,” CoRR, vol. abs/2208.05051, 2022.
        ley, J. Tworek, R. Greene, A. Vijayvergiya, C. Voss,        [680]   W. X. Zhao, K. Zhou, Z. Gong, B. Zhang, Y. Zhou,
        J. Weng, M. Wiethoff, S. Yoo, K. Yu, W. Zaremba,                    J. Sha, Z. Chen, S. Wang, C. Liu, and J. Wen, “Ji-
        S. Zhao, W. Zhuk, and B. Zoph, “Chatgpt plugins,”                   uzhang: A chinese pre-trained language model for
        OpenAI Blog, March 2023.                                            mathematical problem understanding,” in KDD ’22:
[668]   A. Lazaridou, E. Gribovskaya, W. Stokowiec, and                     The 28th ACM SIGKDD Conference on Knowledge Dis-
        N. Grigorev, “Internet-augmented language models                    covery and Data Mining, Washington, DC, USA, August
        through few-shot prompting for open-domain ques-                    14 - 18, 2022, A. Zhang and H. Rangwala, Eds. ACM,
        tion answering,” CoRR, vol. abs/2203.05115, 2022.                   2022, pp. 4571–4581.
[669]   H. Qian, Y. Zhu, Z. Dou, H. Gu, X. Zhang, Z. Liu,           [681]   Q. Wang, C. Kaliszyk, and J. Urban, “First experi-
        R. Lai, Z. Cao, J. Nie, and J. Wen, “Webbrain: Learn-               ments with neural translation of informal to formal
        ing to generate factually correct articles for queries              mathematics,” in Intelligent Computer Mathematics -
        by grounding on large web corpus,” CoRR, vol.                       11th International Conference, CICM 2018, Hagenberg,
        abs/2304.04358, 2023.                                               Austria, August 13-17, 2018, Proceedings, ser. Lecture
[670]   J. Liu, J. Jin, Z. Wang, J. Cheng, Z. Dou, and J. Wen,              Notes in Computer Science, F. Rabe, W. M. Farmer,
        “RETA-LLM: A retrieval-augmented large language                     G. O. Passmore, and A. Youssef, Eds., vol. 11006.
        model toolkit,” CoRR, vol. abs/2306.05212, 2023.                    Springer, 2018, pp. 255–270.
[671]   D. Dai, L. Dong, Y. Hao, Z. Sui, B. Chang, and F. Wei,      [682]   S. Polu and I. Sutskever, “Generative language mod-
        “Knowledge neurons in pretrained transformers,” in                  eling for automated theorem proving,” CoRR, vol.
        Proceedings of the 60th Annual Meeting of the Asso-                 abs/2009.03393, 2020.
        ciation for Computational Linguistics (Volume 1: Long       [683]   A. Q. Jiang, W. Li, S. Tworkowski, K. Czechowski,
        Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022,                T. Odrzygózdz, P. Milos, Y. Wu, and M. Jamnik,
        S. Muresan, P. Nakov, and A. Villavicencio, Eds.                    “Thor: Wielding hammers to integrate language
        Association for Computational Linguistics, 2022, pp.                models and automated theorem provers,” CoRR, vol.
                                                                                                                             130

        abs/2205.10893, 2022.                                      [698] X. Zhu, Y. Chen, H. Tian, C. Tao, W. Su, C. Yang,
[684]   S. Polu, J. M. Han, K. Zheng, M. Baksys,                         G. Huang, B. Li, L. Lu, X. Wang, Y. Qiao, Z. Zhang,
        I. Babuschkin, and I. Sutskever, “Formal mathe-                  and J. Dai, “Ghost in the minecraft: Generally capa-
        matics statement curriculum learning,” CoRR, vol.                ble agents for open-world environments via large
        abs/2202.01344, 2022.                                            language models with text-based knowledge and
[685]   Y. Wu, A. Q. Jiang, W. Li, M. N. Rabe, C. Staats,                memory,” CoRR, vol. abs/2305.17144, 2023.
        M. Jamnik, and C. Szegedy, “Autoformalization with         [699] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao,
        large language models,” CoRR, vol. abs/2205.12615,               Y. Zhu, L. Fan, and A. Anandkumar, “Voyager: An
        2022.                                                            open-ended embodied agent with large language
[686]   A. Q. Jiang, S. Welleck, J. P. Zhou, W. Li, J. Liu,              models,” CoRR, vol. abs/2305.16291, 2023.
        M. Jamnik, T. Lacroix, Y. Wu, and G. Lample, “Draft,       [700] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes,
        sketch, and prove: Guiding formal theorem provers                B. David, C. Finn, K. Gopalakrishnan, K. Hausman,
        with informal proofs,” CoRR, vol. abs/2210.12283,                A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Ir-
        2022.                                                            pan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth,
[687]   A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao,             N. J. Joshi, R. Julian, D. Kalashnikov, Y. Kuang,
        S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye,                  K. Lee, S. Levine, Y. Lu, L. Luu, C. Parada, P. Pastor,
        Y. Yang, S. Welleck, B. P. Majumder, S. Gupta, A. Yaz-           J. Quiambao, K. Rao, J. Rettinghouse, D. Reyes, P. Ser-
        danbakhsh, and P. Clark, “Self-refine: Iterative refine-         manet, N. Sievers, C. Tan, A. Toshev, V. Vanhoucke,
        ment with self-feedback,” CoRR, vol. abs/2303.17651,             F. Xia, T. Xiao, P. Xu, S. Xu, and M. Yan, “Do as
        2023.                                                            I can, not as I say: Grounding language in robotic
[688]   N. Shinn, B. Labash, and A. Gopinath, “Reflexion: an             affordances,” CoRR, vol. abs/2204.01691, 2022.
        autonomous agent with dynamic memory and self-             [701] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman,
        reflection,” CoRR, vol. abs/2303.11366, 2023.                    B. Ichter, P. Florence, and A. Zeng, “Code as policies:
[689]   Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, N. Duan,             Language model programs for embodied control,”
        and W. Chen, “CRITIC: large language models can                  CoRR, vol. abs/2209.07753, 2022.
        self-correct with tool-interactive critiquing,” CoRR,      [702] Y. Fu, H. Peng, T. Khot, and M. Lapata, “Improv-
        vol. abs/2305.11738, 2023.                                       ing language model negotiation with self-play and
[690]   J. Uesato, N. Kushman, R. Kumar, H. F. Song,                     in-context learning from AI feedback,” CoRR, vol.
        N. Y. Siegel, L. Wang, A. Creswell, G. Irving, and               abs/2305.10142, 2023.
        I. Higgins, “Solving math word problems with               [703] N. Mehta, M. Teruel, P. F. Sanz, X. Deng, A. H.
        process- and outcome-based feedback,” CoRR, vol.                 Awadallah, and J. Kiseleva, “Improving grounded
        abs/2211.14275, 2022.                                            language understanding in a collaborative environ-
[691]   H. Lightman, V. Kosaraju, Y. Burda, H. Edwards,                  ment by interacting with agents through help feed-
        B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever,           back,” CoRR, vol. abs/2304.10750, 2023.
        and K. Cobbe, “Let’s verify step by step,” CoRR, vol.      [704] S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez,
        abs/2305.20050, 2023.                                            “Gorilla: Large language model connected with mas-
[692]   Z. Yuan, H. Yuan, C. Tan, W. Wang, and S. Huang,                 sive apis,” CoRR, vol. abs/2305.15334, 2023.
        “How well do large language models perform in              [705] S. Hao, T. Liu, Z. Wang, and Z. Hu, “Toolkengpt:
        arithmetic tasks?” CoRR, vol. abs/2304.02015, 2023.              Augmenting frozen language models with mas-
[693]   X. Pi, Q. Liu, B. Chen, M. Ziyadi, Z. Lin, Q. Fu,                sive tools via tool embeddings,” CoRR, vol.
        Y. Gao, J. Lou, and W. Chen, “Reasoning like pro-                abs/2305.11554, 2023.
        gram executors,” in Proceedings of the 2022 Conference     [706] Y. Liang, C. Wu, T. Song, W. Wu, Y. Xia, Y. Liu, Y. Ou,
        on Empirical Methods in Natural Language Processing,             S. Lu, L. Ji, S. Mao, Y. Wang, L. Shou, M. Gong,
        EMNLP 2022, Abu Dhabi, United Arab Emirates, De-                 and N. Duan, “Taskmatrix.ai: Completing tasks by
        cember 7-11, 2022, 2022, pp. 761–779.                            connecting foundation models with millions of apis,”
[694]   H. Zhou, A. Nova, H. Larochelle, A. C. Courville,                CoRR, vol. abs/2303.16434, 2023.
        B. Neyshabur, and H. Sedghi, “Teaching algorith-           [707] T. Cai, X. Wang, T. Ma, X. Chen, and D. Zhou,
        mic reasoning via in-context learning,” CoRR, vol.               “Large language models as tool makers,” CoRR, vol.
        abs/2211.09066, 2022.                                            abs/2305.17126, 2023.
[695]   A. Parisi, Y. Zhao, and N. Fiedel, “TALM:                  [708] J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang,
        tool augmented language models,” CoRR, vol.                      H. Yu, and J. Han, “Large language models can self-
        abs/2205.12255, 2022.                                            improve,” CoRR, vol. abs/2210.11610, 2022.
[696]   W. Huang, P. Abbeel, D. Pathak, and I. Mordatch,           [709] E. Beeching, C. Fourrier, N. Habib, S. Han,
        “Language models as zero-shot planners: Extract-                 N.     Lambert,      N.    Rajani,   O.    Sanseviero,
        ing actionable knowledge for embodied agents,” in                L. Tunstall, and T. Wolf, “Open llm leaderboard,”
        ICML, ser. Proceedings of Machine Learning Re-                   https://huggingface.co/spaces/HuggingFaceH4/
        search, vol. 162. PMLR, 2022, pp. 9118–9147.                     open llm leaderboard, 2023.
[697]   T. Carta, C. Romac, T. Wolf, S. Lamprier, O. Sigaud,       [710] W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang,
        and P. Oudeyer, “Grounding large language models                 A. Saied, W. Chen, and N. Duan, “Agieval: A human-
        in interactive environments with online reinforce-               centric benchmark for evaluating foundation mod-
        ment learning,” CoRR, vol. abs/2302.02662, 2023.                 els,” CoRR, vol. abs/2304.06364, 2023.
                                                                                                                                 131

[711]   H. Zeng, “Measuring massive multitask chinese un-                   D. Arpit, R. Xu, P. Mui, H. Wang, C. Xiong, and
        derstanding,” CoRR, vol. abs/2304.12986, 2023.                      S. Savarese, “BOLAA: benchmarking and orchestrat-
[712]   C. Liu, R. Jin, Y. Ren, L. Yu, T. Dong, X. Peng,                    ing llm-augmented autonomous agents,” CoRR, vol.
        S. Zhang, J. Peng, P. Zhang, Q. Lyu, X. Su, Q. Liu,                 abs/2308.05960, 2023.
        and D. Xiong, “M3KE: A massive multi-level multi-           [725]   X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai,
        subject knowledge evaluation benchmark for chinese                  Y. Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng,
        large language models,” CoRR, vol. abs/2305.10263,                  A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang,
        2023.                                                               Y. Su, H. Sun, M. Huang, Y. Dong, and J. Tang,
[713]   Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su,                “Agentbench: Evaluating llms as agents,” CoRR, vol.
        J. Liu, C. Lv, Y. Zhang, J. Lei, Y. Fu, M. Sun, and                 abs/2308.03688, 2023.
        J. He, “C-eval: A multi-level multi-discipline chinese      [726]   K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y. Wang,
        evaluation suite for foundation models,” CoRR, vol.                 L. Yang, W. Ye, N. Z. Gong, Y. Zhang, and X. Xie,
        abs/2305.08322, 2023.                                               “Promptbench: Towards evaluating the robustness
[714]   Z. Gu, X. Zhu, H. Ye, L. Zhang, J. Wang, S. Jiang,                  of large language models on adversarial prompts,”
        Z. Xiong, Z. Li, Q. He, R. Xu, W. Huang, W. Zheng,                  CoRR, vol. abs/2306.04528, 2023.
        H. Feng, and Y. Xiao, “Xiezhi: An ever-updating             [727]   R. S. Shah, K. Chawla, D. Eidnani, A. Shah, W. Du,
        benchmark for holistic domain knowledge evalua-                     S. Chava, N. Raman, C. Smiley, J. Chen, and D. Yang,
        tion,” CoRR, vol. abs/2306.05783, 2023.                             “WHEN FLUE MEETS FLANG: benchmarks and
[715]   O. Contributors, “Opencompass: A universal eval-                    large pre-trained language model for financial do-
        uation platform for foundation models,” https://                    main,” CoRR, vol. abs/2211.00083, 2022.
        github.com/InternLM/OpenCompass, 2023.                      [728]   N. Guha, D. E. Ho, J. Nyarko, and C. Ré, “Legal-
[716]   Y. Fu, L. Ou, M. Chen, Y. Wan, H. Peng, and                         bench: Prototyping a collaborative benchmark for
        T. Khot, “Chain-of-thought hub: A continuous effort                 legal reasoning,” CoRR, vol. abs/2209.06120, 2022.
        to measure large language models’ reasoning perfor-         [729]   L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu,
        mance,” CoRR, vol. abs/2305.17306, 2023.                            Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang,
[717]   J. Yu, X. Wang, S. Tu, S. Cao, D. Zhang-li, X. Lv,                  J. E. Gonzalez, and I. Stoica, “Judging llm-as-a-
        H. Peng, Z. Yao, X. Zhang, H. Li, C. Li, Z. Zhang,                  judge with mt-bench and chatbot arena,” CoRR, vol.
        Y. Bai, Y. Liu, A. Xin, N. Lin, K. Yun, L. Gong, J. Chen,           abs/2306.05685, 2023.
        Z. Wu, Y. Qi, W. Li, Y. Guan, K. Zeng, J. Qi, H. Jin,       [730]   X. Wang, Z. Hu, P. Lu, Y. Zhu, J. Zhang, S. Sub-
        J. Liu, Y. Gu, Y. Yao, N. Ding, L. Hou, Z. Liu, B. Xu,              ramaniam, A. R. Loomba, S. Zhang, Y. Sun, and
        J. Tang, and J. Li, “Kola: Carefully benchmarking                   W. Wang, “Scibench: Evaluating college-level sci-
        world knowledge of large language models,” CoRR,                    entific problem-solving abilities of large language
        vol. abs/2306.09296, 2023.                                          models,” CoRR, vol. abs/2307.10635, 2023.
[718]   T. Sawada, D. Paleka, A. Havrilla, P. Tadepalli, P. Vi-     [731]   X. Li, T. Zhang, Y. Dubois, R. Taori, I. Gulrajani,
        das, A. Kranias, J. J. Nay, K. Gupta, and A. Komat-                 C. Guestrin, P. Liang, and T. B. Hashimoto, “Al-
        suzaki, “ARB: advanced reasoning benchmark for                      pacaeval: An automatic evaluator of instruction-
        large language models,” CoRR, vol. abs/2307.13692,                  following models,” https://github.com/tatsu-lab/
        2023.                                                               alpaca eval, 2023.
[719]   Y. Peng, S. Li, W. Gu, Y. Li, W. Wang, C. Gao, and          [732]   Y. Huang, Q. Zhang, P. S. Yu, and L. Sun, “Trustgpt:
        M. R. Lyu, “Revisiting, benchmarking and exploring                  A benchmark for trustworthy and responsible large
        API recommendation: How far are we?” IEEE Trans.                    language models,” CoRR, vol. abs/2306.11507, 2023.
        Software Eng., vol. 49, no. 4, pp. 1876–1897, 2023.         [733]   Y. Bai, J. Ying, Y. Cao, X. Lv, Y. He, X. Wang, J. Yu,
[720]   M. Li, F. Song, B. Yu, H. Yu, Z. Li, F. Huang, and Y. Li,           K. Zeng, Y. Xiao, H. Lyu, J. Zhang, J. Li, and L. Hou,
        “Api-bank: A benchmark for tool-augmented llms,”                    “Benchmarking foundation models with language-
        CoRR, vol. abs/2304.08244, 2023.                                    model-as-an-examiner,” CoRR, vol. abs/2306.04181,
[721]   Q. Tang, Z. Deng, H. Lin, X. Han, Q. Liang, and                     2023.
        L. Sun, “Toolalpaca: Generalized tool learning for          [734]   C. Chan, W. Chen, Y. Su, J. Yu, W. Xue, S. Zhang, J. Fu,
        language models with 3000 simulated cases,” CoRR,                   and Z. Liu, “Chateval: Towards better llm-based
        vol. abs/2306.05301, 2023.                                          evaluators through multi-agent debate,” CoRR, vol.
[722]   Q. Xu, F. Hong, B. Li, C. Hu, Z. Chen, and J. Zhang,                abs/2308.07201, 2023.
        “On the tool manipulation capability of open-source         [735]   Y. Chang, X. Wang, J. Wang, Y. Wu, K. Zhu, H. Chen,
        large language models,” CoRR, vol. abs/2305.16504,                  L. Yang, X. Yi, C. Wang, Y. Wang, W. Ye, Y. Zhang,
        2023.                                                               Y. Chang, P. S. Yu, Q. Yang, and X. Xie, “A survey
[723]   Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin,             on evaluation of large language models,” CoRR, vol.
        X. Cong, X. Tang, B. Qian, S. Zhao, R. Tian, R. Xie,                abs/2307.03109, 2023.
        J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “Tool-     [736]   Z. Zhuang, Q. Chen, L. Ma, M. Li, Y. Han, Y. Qian,
        llm: Facilitating large language models to master                   H. Bai, Z. Feng, W. Zhang, and T. Liu, “Through
        16000+ real-world apis,” CoRR, vol. abs/2307.16789,                 the lens of core competency: Survey on evaluation of
        2023.                                                               large language models,” CoRR, vol. abs/2308.07902,
[724]   Z. Liu, W. Yao, J. Zhang, L. Xue, S. Heinecke,                      2023.
        R. Murthy, Y. Feng, Z. Chen, J. C. Niebles,                 [737]   J. H. Clark, J. Palomaki, V. Nikolaev, E. Choi, D. Gar-
                                                                                                                                132

        rette, M. Collins, and T. Kwiatkowski, “Tydi QA:                  2018. IEEE Computer Society, 2018, pp. 197–206.
        A benchmark for information-seeking question an-          [748]   B. Yang, C. Han, Y. Li, L. Zuo, and Z. Yu, “Improv-
        swering in typologically diverse languages,” Trans.               ing conversational recommendation systems’ quality
        Assoc. Comput. Linguistics, vol. 8, pp. 454–470, 2020.            with context-aware item meta-information,” in Find-
[738]   L. Gao, J. Tow, S. Biderman, S. Black, A. DiPofi,                 ings of the Association for Computational Linguistics:
        C. Foster, L. Golding, J. Hsu, K. McDonell, N. Muen-              NAACL 2022, Seattle, WA, United States, July 10-15,
        nighoff, J. Phang, L. Reynolds, E. Tang, A. Thite,                2022, M. Carpuat, M. de Marneffe, and I. V. M. Ruı́z,
        B. Wang, K. Wang, and A. Zou, “A framework for                    Eds.     Association for Computational Linguistics,
        few-shot language model evaluation,” Sep. 2021.                   2022, pp. 38–48.
[739]   R. Shah, K. Chawla, D. Eidnani, A. Shah, W. Du,           [749]   E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cap-
        S. Chava, N. Raman, C. Smiley, J. Chen, and D. Yang,              pelli, R. Cojocaru, M. Debbah, E. Goffinet, D. Hes-
        “When flue meets flang: Benchmarks and large pre-                 low, J. Launay, Q. Malartic, B. Noune, B. Pannier,
        trained language model for financial domain,” in                  and G. Penedo, “Falcon-40B: an open large language
        Proceedings of the 2022 Conference on Empirical Methods           model with state-of-the-art performance,” 2023.
        in Natural Language Processing, 2022, pp. 2322–2335.      [750]   S. Martin, J. Liermann, and H. Ney, “Algorithms for
[740]   K. Zhou, Y. Zhu, Z. Chen, W. Chen, W. X. Zhao,                    bigram and trigram word clustering,” Speech commu-
        X. Chen, Y. Lin, J.-R. Wen, and J. Han, “Don’t make               nication, vol. 24, no. 1, pp. 19–37, 1998.
        your llm an evaluation benchmark cheater,” arXiv          [751]   R. Navigli, “Word sense disambiguation: A survey,”
        preprint arXiv:2311.01964, 2023.                                  ACM computing surveys (CSUR), vol. 41, no. 2, pp.
[741]   C. Zan, K. Peng, L. Ding, B. Qiu, B. Liu, S. He, Q. Lu,           1–69, 2009.
        Z. Zhang, C. Liu, W. Liu, Y. Zhan, and D. Tao, “Vega-     [752]   W. H. Gomaa, A. A. Fahmy et al., “A survey of
        mt: The JD explore academy machine translation                    text similarity approaches,” international journal of
        system for WMT22,” in Proceedings of the Seventh Con-             Computer Applications, vol. 68, no. 13, pp. 13–18, 2013.
        ference on Machine Translation, WMT 2022, Abu Dhabi,      [753]   S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad,
        United Arab Emirates (Hybrid), December 7-8, 2022,                M. Chenaghlu, and J. Gao, “Deep learning–based
        P. Koehn, L. Barrault, O. Bojar, F. Bougares, R. Chat-            text classification: a comprehensive review,” ACM
        terjee, M. R. Costa-jussà, C. Federmann, M. Fishel,              computing surveys (CSUR), vol. 54, no. 3, pp. 1–40,
        A. Fraser, M. Freitag, Y. Graham, R. Grundkiewicz,                2021.
        P. Guzman, B. Haddow, M. Huck, A. Jimeno-Yepes,           [754]   N. Alex, E. Lifland, L. Tunstall, A. Thakur, P. Maham,
        T. Kocmi, A. Martins, M. Morishita, C. Monz, M. Na-               C. J. Riedel, E. Hine, C. Ashurst, P. Sedille, A. Carlier,
        gata, T. Nakazawa, M. Negri, A. Névéol, M. Neves,               M. Noetel, and A. Stuhlmüller, “RAFT: A real-world
        M. Popel, M. Turchi, and M. Zampieri, Eds. Asso-                  few-shot text classification benchmark,” in NeurIPS
        ciation for Computational Linguistics, 2022, pp. 411–             Datasets and Benchmarks, 2021.
        422.                                                      [755]   C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga,
[742]   Y. Zhao, M. Khalman, R. Joshi, S. Narayan, M. Saleh,              and D. Yang, “Is chatgpt a general-purpose nat-
        and P. J. Liu, “Calibrating sequence likelihood                   ural language processing task solver?” CoRR, vol.
        improves conditional language generation,” CoRR,                  abs/2302.06476, 2023.
        vol. abs/2210.00045, 2022. [Online]. Available:           [756]   X. Chen, J. Ye, C. Zu, N. Xu, R. Zheng, M. Peng,
        https://doi.org/10.48550/arXiv.2210.00045                         J. Zhou, T. Gui, Q. Zhang, and X. Huang, “How
[743]   D. Khashabi, S. Min, T. Khot, A. Sabharwal,                       robust is gpt-3.5 to predecessors? a comprehensive
        O. Tafjord, P. Clark, and H. Hajishirzi, “Unifiedqa:              study on language understanding tasks,” 2023.
        Crossing format boundaries with a single QA sys-          [757]   D. Nadeau and S. Sekine, “A survey of named entity
        tem,” in EMNLP (Findings), ser. Findings of ACL,                  recognition and classification,” Lingvisticae Investiga-
        vol. EMNLP 2020. Association for Computational                    tiones, vol. 30, no. 1, pp. 3–26, 2007.
        Linguistics, 2020, pp. 1896–1907.                         [758]   A. Ratnaparkhi, “A maximum entropy model for
[744]   X. Zhu, J. Wang, L. Zhang, Y. Zhang, R. Gan,                      part-of-speech tagging,” in Conference on empirical
        J. Zhang, and Y. Yang, “Solving math word problem                 methods in natural language processing, 1996.
        via cooperative reasoning induced language mod-           [759]   V. Yadav and S. Bethard, “A survey on recent
        els,” arXiv preprint arXiv:2210.16257, 2022.                      advances in named entity recognition from deep
[745]   A. Nguyen, N. Karampatziakis, and W. Chen, “Meet                  learning models,” in Proceedings of the 27th Interna-
        in the middle: A new pre-training paradigm,”                      tional Conference on Computational Linguistics, 2018,
        CoRR, vol. abs/2303.07295, 2023. [Online]. Available:             pp. 2145–2158.
        https://doi.org/10.48550/arXiv.2303.07295                 [760]   F. Souza, R. Nogueira, and R. Lotufo, “Portuguese
[746]   H. Li, J. Zhang, C. Li, and H. Chen, “RESDSQL:                    named entity recognition using bert-crf,” arXiv
        decoupling schema linking and skeleton parsing                    preprint arXiv:1909.10649, 2019.
        for text-to-sql,” CoRR, vol. abs/2302.05965, 2023.        [761]   S. Pawar, G. K. Palshikar, and P. Bhattacharyya,
        [Online]. Available: https://doi.org/10.48550/arXiv.              “Relation extraction: A survey,” arXiv preprint
        2302.05965                                                        arXiv:1712.05191, 2017.
[747]   W. Kang and J. J. McAuley, “Self-attentive sequential     [762]   C. Walker and et al., “Ace 2005 multilingual training
        recommendation,” in IEEE International Conference on              corpus ldc2006t06,” Philadelphia, 2006.
        Data Mining, ICDM 2018, Singapore, November 17-20,        [763]   J. Gao, H. Zhao, C. Yu, and R. Xu, “Exploring the
                                                                                                                                 133

        feasibility of chatgpt for event extraction,” CoRR, vol.            “Is chatgpt good at search? investigating large lan-
        abs/2303.03836, 2023.                                               guage models as re-ranking agent,” arXiv preprint
[764]   Y. Ma, Y. Cao, Y. Hong, and A. Sun, “Large language                 arXiv:2304.09542, 2023.
        model is not a good few-shot information extractor,         [778]   Z. Qin, R. Jagerman, K. Hui, H. Zhuang, J. Wu,
        but a good reranker for hard samples!” CoRR, vol.                   J. Shen, T. Liu, J. Liu, D. Metzler, X. Wang et al.,
        abs/2303.08559, 2023.                                               “Large language models are effective text rankers
[765]   R. Tang, X. Han, X. Jiang, and X. Hu, “Does synthetic               with pairwise ranking prompting,” arXiv preprint
        data generation of llms help clinical text mining?”                 arXiv:2306.17563, 2023.
        arXiv preprint arXiv:2303.04360, 2023.                      [779]   S. Cho, S. Jeong, J. Seo, and J. C. Park, “Discrete
[766]   X. Wei, X. Cui, N. Cheng, X. Wang, X. Zhang,                        prompt optimization via constrained generation for
        S. Huang, P. Xie, J. Xu, Y. Chen, M. Zhang et al.,                  zero-shot re-ranker,” arXiv preprint arXiv:2305.13729,
        “Zero-shot information extraction via chatting with                 2023.
        chatgpt,” arXiv preprint arXiv:2302.10205, 2023.            [780]   R. Tang, X. Zhang, X. Ma, J. Lin, and F. Ture,
[767]   A. Vaswani, S. Bengio, E. Brevdo, F. Chollet,                       “Found in the middle: Permutation self-consistency
        A. Gomez, S. Gouws, L. Jones, Ł. Kaiser, N. Kalch-                  improves listwise ranking in large language mod-
        brenner, N. Parmar et al., “Tensor2tensor for neural                els,” arXiv preprint arXiv:2310.07712, 2023.
        machine translation,” in Proceedings of the 13th Con-       [781]   X. Ma, X. Zhang, R. Pradeep, and J. Lin, “Zero-shot
        ference of the Association for Machine Translation in the           listwise document reranking with a large language
        Americas (Volume 1: Research Track), 2018, pp. 193–199.             model,” arXiv preprint arXiv:2305.02156, 2023.
[768]   B. Zhang, B. Haddow, and A. Birch, “Prompting               [782]   S. Zhuang, H. Zhuang, B. Koopman, and G. Zuccon,
        large language model for machine translation: A case                “A setwise approach for effective and highly effi-
        study,” arXiv preprint arXiv:2301.07069, 2023.                      cient zero-shot ranking with large language models,”
[769]   M. Ghazvininejad, H. Gonen, and L. Zettlemoyer,                     arXiv preprint arXiv:2310.09497, 2023.
        “Dictionary-based phrase-level prompting of large           [783]   H. Zhuang, Z. Qin, K. Hui, J. Wu, L. Yan, X. Wang,
        language models for machine translation,” arXiv                     and M. Berdersky, “Beyond yes and no: Improving
        preprint arXiv:2302.07856, 2023.                                    zero-shot llm rankers via scoring fine-grained rele-
[770]   L. Wang, C. Lyu, T. Ji, Z. Zhang, D. Yu, S. Shi,                    vance labels,” arXiv preprint arXiv:2310.14122, 2023.
        and Z. Tu, “Document-level machine transla-                 [784]   N. Ziems, W. Yu, Z. Zhang, and M. Jiang, “Large
        tion with large language models,” arXiv preprint                    language models are built-in autoregressive search
        arXiv:2304.02210, 2023.                                             engines,” arXiv preprint arXiv:2305.09612, 2023.
[771]   W. Jiao, J.-t. Huang, W. Wang, X. Wang, S. Shi, and         [785]   X. Ma, L. Wang, N. Yang, F. Wei, and J. Lin, “Fine-
        Z. Tu, “Parrot: Translating during chat using large                 tuning llama for multi-stage text retrieval,” arXiv
        language models,” arXiv preprint arXiv:2304.02426,                  preprint arXiv:2310.08319, 2023.
        2023.                                                       [786]   R. Pradeep, S. Sharifymoghaddam, and J. Lin,
[772]   W. Yang, C. Li, J. Zhang, and C. Zong, “Bigtrans:                   “Rankvicuna: Zero-shot listwise document rerank-
        Augmenting large language models with multi-                        ing with open-source large language models,” arXiv
        lingual translation capability over 100 languages,”                 preprint arXiv:2309.15088, 2023.
        arXiv preprint arXiv:2305.18098, 2023.                      [787]   Y. Tay, V. Q. Tran, M. Dehghani, J. Ni, D. Bahri,
[773]   J. Kocon, I. Cichecki, O. Kaszyca, M. Kochanek,                     H. Mehta, Z. Qin, K. Hui, Z. Zhao, J. Gupta et al.,
        D. Szydlo, J. Baran, J. Bielaniewicz, M. Gruza,                     “Transformer memory as a differentiable search in-
        A. Janz, K. Kanclerz, A. Kocon, B. Koptyra,                         dex,” in Advances in Neural Information Processing
        W. Mieleszczenko-Kowszewicz, P. Milkowski,                          Systems, 2022.
        M. Oleksy, M. Piasecki, L. Radlinski, K. Wojtasik,          [788]   R. Ren, W. X. Zhao, J. Liu, H. Wu, J.-R. Wen,
        S. Wozniak, and P. Kazienko, “Chatgpt: Jack of all                  and H. Wang, “TOME: A two-stage approach for
        trades, master of none,” CoRR, vol. abs/2302.10724,                 model-based retrieval,” in Proceedings of the 61st
        2023.                                                               Annual Meeting of the Association for Computational
[774]   Q. Zhong, L. Ding, J. Liu, B. Du, and D. Tao,                       Linguistics (Volume 1: Long Papers). Association
        “Can chatgpt understand too? A comparative study                    for Computational Linguistics, 2023, pp. 6102–6114.
        on chatgpt and fine-tuned BERT,” CoRR, vol.                         [Online]. Available: https://aclanthology.org/2023.
        abs/2302.10198, 2023.                                               acl-long.336
[775]   D. Cheng, S. Huang, J. Bi, Y. Zhan, J. Liu, Y. Wang,        [789]   Y. Qu, Y. Ding, J. Liu, K. Liu, R. Ren, W. X. Zhao,
        H. Sun, F. Wei, D. Deng, and Q. Zhang, “Uprise:                     D. Dong, H. Wu, and H. Wang, “Rocketqa: An op-
        Universal prompt retrieval for improving zero-shot                  timized training approach to dense passage retrieval
        evaluation,” arXiv preprint arXiv:2303.08518, 2023.                 for open-domain question answering,” in Proceedings
[776]   R. Ren, Y. Qu, J. Liu, W. X. Zhao, Q. She, H. Wu,                   of the 2021 Conference of the North American Chapter
        H. Wang, and J.-R. Wen, “Rocketqav2: A joint train-                 of the Association for Computational Linguistics: Human
        ing method for dense passage retrieval and pas-                     Language Technologies, 2021, pp. 5835–5847.
        sage re-ranking,” in Proceedings of the 2021 Conference     [790]   R. Ren, S. Lv, Y. Qu, J. Liu, W. X. Zhao, Q. She,
        on Empirical Methods in Natural Language Processing,                H. Wu, H. Wang, and J.-R. Wen, “Pair: Leverag-
        2021, pp. 2825–2835.                                                ing passage-centric similarity relation for improving
[777]   W. Sun, L. Yan, X. Ma, P. Ren, D. Yin, and Z. Ren,                  dense passage retrieval,” in Findings of the Association
                                                                                                                                  134

        for Computational Linguistics: ACL-IJCNLP 2021, 2021,               and J. Wen, “Recbole 2.0: Towards a more up-to-date
        pp. 2173–2183.                                                      recommendation library,” in CIKM, M. A. Hasan and
[791]   Z. Peng, X. Wu, and Y. Fang, “Soft prompt tuning                    L. Xiong, Eds. ACM, 2022, pp. 4722–4726.
        for augmenting dense retrieval with large language          [804]   L. Xu, Z. Tian, G. Zhang, J. Zhang, L. Wang, B. Zheng,
        models,” arXiv preprint arXiv:2307.08303, 2023.                     Y. Li, J. Tang, Z. Zhang, Y. Hou, X. Pan, W. X. Zhao,
[792]   Z. Dai, V. Y. Zhao, J. Ma, Y. Luan, J. Ni, J. Lu,                   X. Chen, and J. Wen, “Towards a more user-friendly
        A. Bakalov, K. Guu, K. Hall, and M.-W. Chang,                       and easy-to-use benchmark library for recommender
        “Promptagator: Few-shot dense retrieval from 8 ex-                  systems,” in SIGIR, H. Chen, W. E. Duh, H. Huang,
        amples,” in The Eleventh International Conference on                M. P. Kato, J. Mothe, and B. Poblete, Eds. ACM,
        Learning Representations, 2023.                                     2023, pp. 2837–2847.
[793]   A. Askari, M. Aliannejadi, E. Kanoulas, and S. Ver-         [805]   S. Rendle, C. Freudenthaler, Z. Gantner, and
        berne, “Generating synthetic documents for cross-                   L. Schmidt-Thieme, “BPR: bayesian personalized
        encoder re-rankers: A comparative study of chatgpt                  ranking from implicit feedback,” CoRR, vol.
        and human experts,” arXiv preprint arXiv:2305.02320,                abs/1205.2618, 2012.
        2023.                                                       [806]   W. Fan, Z. Zhao, J. Li, Y. Liu, X. Mei, Y. Wang, J. Tang,
[794]   K. Mao, Z. Dou, H. Chen, F. Mo, and H. Qian, “Large                 and Q. Li, “Recommender systems in the era of large
        language models know your contextual search in-                     language models (llms),” CoRR, 2023.
        tent: A prompting framework for conversational              [807]   L. Wu, Z. Zheng, Z. Qiu, H. Wang, H. Gu, T. Shen,
        search,” arXiv preprint arXiv:2303.06573, 2023.                     C. Qin, C. Zhu, H. Zhu, Q. Liu, H. Xiong, and
[795]   L. Gao, X. Ma, J. Lin, and J. Callan, “Precise zero-                E. Chen, “A survey on large language models for
        shot dense retrieval without relevance labels,” in                  recommendation,” CoRR, 2023.
        Proceedings of the 61st Annual Meeting of the Association   [808]   Y. Gao, T. Sheng, Y. Xiang, Y. Xiong, H. Wang, and
        for Computational Linguistics (Volume 1: Long Papers).              J. Zhang, “Chat-rec: Towards interactive and explain-
        Association for Computational Linguistics, 2023, pp.                able llms-augmented recommender system,” CoRR,
        1762–1777.                                                          vol. abs/2303.14524, 2023.
[796]   L. Wang, N. Yang, and F. Wei, “Query2doc: Query ex-         [809]   S. Dai, N. Shao, H. Zhao, W. Yu, Z. Si, C. Xu, Z. Sun,
        pansion with large language models,” arXiv preprint                 X. Zhang, and J. Xu, “Uncovering chatgpt’s capabil-
        arXiv:2303.07678, 2023.                                             ities in recommender systems,” in RecSys, J. Zhang,
[797]   G. Ma, X. Wu, P. Wang, Z. Lin, and S. Hu, “Pre-                     L. Chen, S. Berkovsky, M. Zhang, T. D. Noia, J. Basil-
        training with large language model-based document                   ico, L. Pizzato, and Y. Song, Eds. ACM, 2023, pp.
        expansion for dense passage retrieval,” arXiv preprint              1126–1132.
        arXiv:2308.08285, 2023.                                     [810]   Y. Hou, J. Zhang, Z. Lin, H. Lu, R. Xie, J. J. McAuley,
[798]   W. Sun, Z. Chen, X. Ma, L. Yan, S. Wang, P. Ren,                    and W. X. Zhao, “Large language models are zero-
        Z. Chen, D. Yin, and Z. Ren, “Instruction distilla-                 shot rankers for recommender systems,” CoRR, 2023.
        tion makes large language models efficient zero-shot        [811]   J. Liu, C. Liu, R. Lv, K. Zhou, and Y. Zhang, “Is
        rankers,” arXiv preprint arXiv:2311.01555, 2023.                    chatgpt a good recommender? A preliminary study,”
[799]   L. Wang, N. Yang, X. Huang, L. Yang, R. Ma-                         CoRR, vol. abs/2304.10149, 2023.
        jumder, and F. Wei, “Large search model: Redefin-           [812]   K. Bao, J. Zhang, Y. Zhang, W. Wang, F. Feng,
        ing search stack in the era of llms,” arXiv preprint                and X. He, “Tallrec: An effective and efficient tun-
        arXiv:2310.14587, 2023.                                             ing framework to align large language model with
[800]   C. Li, Z. Gan, Z. Yang, J. Yang, L. Li, L. Wang,                    recommendation,” in RecSys, J. Zhang, L. Chen,
        and J. Gao, “Multimodal foundation models: From                     S. Berkovsky, M. Zhang, T. D. Noia, J. Basilico, L. Piz-
        specialists to general-purpose assistants,” CoRR, vol.              zato, and Y. Song, Eds. ACM, 2023, pp. 1007–1014.
        abs/2309.10020, 2023.                                       [813]   Y. Zhu, L. Wu, Q. Guo, L. Hong, and J. Li, “Col-
[801]   W. X. Zhao, S. Mu, Y. Hou, Z. Lin, Y. Chen, X. Pan,                 laborative large language model for recommender
        K. Li, Y. Lu, H. Wang, C. Tian, Y. Min, Z. Feng, X. Fan,            systems,” arXiv preprint arXiv:2311.01343, 2023.
        X. Chen, P. Wang, W. Ji, Y. Li, X. Wang, and J. Wen,        [814]   B. Zheng, Y. Hou, H. Lu, Y. Chen, W. X.
        “Recbole: Towards a unified, comprehensive and ef-                  Zhao, and J.-R. Wen, “Adapting large language
        ficient framework for recommendation algorithms,”                   models by integrating collaborative semantics for
        in CIKM, G. Demartini, G. Zuccon, J. S. Culpepper,                  recommendation,” 2023. [Online]. Available: https:
        Z. Huang, and H. Tong, Eds. ACM, 2021, pp. 4653–                    //api.semanticscholar.org/CorpusID:265213194
        4664.                                                       [815]   Y. Xi, W. Liu, J. Lin, J. Zhu, B. Chen, R. Tang,
[802]   K. Zhou, H. Wang, W. X. Zhao, Y. Zhu, S. Wang,                      W. Zhang, R. Zhang, and Y. Yu, “Towards open-
        F. Zhang, Z. Wang, and J. Wen, “S3-rec: Self-                       world recommendation with knowledge augmen-
        supervised learning for sequential recommendation                   tation from large language models,” CoRR, vol.
        with mutual information maximization,” in CIKM,                     abs/2306.10933, 2023.
        M. d’Aquin, S. Dietze, C. Hauff, E. Curry, and              [816]   Q. Liu, N. Chen, T. Sakai, and X. Wu, “A first look
        P. Cudré-Mauroux, Eds. ACM, 2020, pp. 1893–1902.                   at llm-powered generative news recommendation,”
[803]   W. X. Zhao, Y. Hou, X. Pan, C. Yang, Z. Zhang, Z. Lin,              CoRR, vol. abs/2305.06566, 2023.
        J. Zhang, S. Bian, J. Tang, W. Sun, Y. Chen, L. Xu,         [817]   R. Li, W. Deng, Y. Cheng, Z. Yuan, J. Zhang,
        G. Zhang, Z. Tian, C. Tian, S. Mu, X. Fan, X. Chen,                 and F. Yuan, “Exploring the upper limits of
                                                                                                                                135

        text-based collaborative filtering using large lan-         [830]   C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon,
        guage models: Discoveries and insights,” CoRR, vol.                 R. Wightman, M. Cherti, T. Coombes, A. Katta,
        abs/2305.11700, 2023.                                               C. Mullis, M. Wortsman, P. Schramowski, S. Kun-
[818]   W. Wei, X. Ren, J. Tang, Q. Wang, L. Su, S. Cheng,                  durthy, K. Crowson, L. Schmidt, R. Kaczmarczyk,
        J. Wang, D. Yin, and C. Huang, “Llmrec: Large lan-                  and J. Jitsev, “LAION-5B: an open large-scale dataset
        guage models with graph augmentation for recom-                     for training next generation image-text models,” in
        mendation,” CoRR, vol. abs/2311.00423, 2023.                        NeurIPS, 2022.
[819]   X. Li, B. Chen, L. Hou, and R. Tang, “Ctrl: Connect         [831]   S. Changpinyo, P. Sharma, N. Ding, and R. Soricut,
        tabular and language model for ctr prediction,” arXiv               “Conceptual 12m: Pushing web-scale image-text pre-
        preprint arXiv:2306.02841, 2023.                                    training to recognize long-tail visual concepts,” in
[820]   A. Muhamed, I. Keivanloo, S. Perera, J. Mracek,                     IEEE Conference on Computer Vision and Pattern Recog-
        Y. Xu, Q. Cui, S. Rajagopalan, B. Zeng, and                         nition, CVPR 2021, virtual, June 19-25, 2021. Com-
        T. Chilimbi, “Ctr-bert: Cost-effective knowledge dis-               puter Vision Foundation / IEEE, 2021, pp. 3558–3568.
        tillation for billion-parameter teacher models,” in         [832]   Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang,
        NeurIPS Efficient Natural Language and Speech Process-              A. Hu, P. Shi, Y. Shi, C. Li, Y. Xu, H. Chen, J. Tian,
        ing Workshop, 2021.                                                 Q. Qi, J. Zhang, and F. Huang, “mplug-owl: Mod-
[821]   L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang,                         ularization empowers large language models with
        J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, W. X.                  multimodality,” CoRR, vol. abs/2304.14178, 2023.
        Zhao, Z. Wei, and J. Wen, “A survey on large lan-           [833]   J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang,
        guage model based autonomous agents,” CoRR, vol.                    J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier
        abs/2308.11432, 2023.                                               large vision-language model with versatile abilities,”
[822]   L. Wang, J. Zhang, X. Chen, Y. Lin, R. Song, W. X.                  CoRR, vol. abs/2308.12966, 2023.
        Zhao, and J. Wen, “Recagent: A novel simulation             [834]   H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved base-
        paradigm for recommender systems,” CoRR, vol.                       lines with visual instruction tuning,” CoRR, vol.
        abs/2306.02552, 2023.                                               abs/2310.03744, 2023.
[823]   E. Ie, C. Hsu, M. Mladenov, V. Jain, S. Narvekar,           [835]   P. Zhang, X. Dong, B. Wang, Y. Cao, C. Xu,
        J. Wang, R. Wu, and C. Boutilier, “Recsim: A con-                   L. Ouyang, Z. Zhao, S. Ding, S. Zhang, H. Duan,
        figurable simulation platform for recommender sys-                  W. Zhang, H. Yan, X. Zhang, W. Li, J. Li,
        tems,” CoRR, vol. abs/1909.04847, 2019.                             K. Chen, C. He, X. Zhang, Y. Qiao, D. Lin, and
[824]   J. Zhang, Y. Hou, R. Xie, W. Sun, J. J. McAuley,                    J. Wang, “Internlm-xcomposer: A vision-language
        W. X. Zhao, L. Lin, and J. Wen, “Agentcf: Collabora-                large model for advanced text-image comprehension
        tive learning with autonomous language agents for                   and composition,” CoRR, vol. abs/2309.15112, 2023.
        recommender systems,” CoRR, vol. abs/2310.09233,            [836]   K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and
        2023.                                                               R. Zhao, “Shikra: Unleashing multimodal llm’s ref-
[825]   A. Zhang, L. Sheng, Y. Chen, H. Li, Y. Deng, X. Wang,               erential dialogue magic,” CoRR, vol. abs/2306.15195,
        and T. Chua, “On generative agents in recommenda-                   2023.
        tion,” CoRR, vol. abs/2310.10108, 2023.                     [837]   F. Liu, K. Lin, L. Li, J. Wang, Y. Yacoob, and L. Wang,
[826]   Y. Du, Z. Liu, J. Li, and W. X. Zhao, “A survey of                  “Aligning large multi-modal model with robust in-
        vision-language pre-trained models,” in Proceedings                 struction tuning,” CoRR, vol. abs/2306.14565, 2023.
        of the Thirty-First International Joint Conference on Ar-   [838]   Y. Du, H. Guo, K. Zhou, W. X. Zhao, J. Wang,
        tificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29           C. Wang, M. Cai, R. Song, and J.-R. Wen, “What
        July 2022, L. D. Raedt, Ed. ijcai.org, 2022, pp. 5436–              makes for good visual instructions? synthesizing
        5443.                                                               complex visual reasoning instructions for visual in-
[827]   Z. Gan, L. Li, C. Li, L. Wang, Z. Liu, and J. Gao,                  struction tuning,” 2023.
        “Vision-language pre-training: Basics, recent ad-           [839]   D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin,
        vances, and future trends,” Found. Trends Comput.                   K. Grauman, J. Luo, and J. P. Bigham, “Vizwiz grand
        Graph. Vis., vol. 14, no. 3-4, pp. 163–352, 2022.                   challenge: Answering visual questions from blind
[828]   P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen,                  people,” in CVPR. Computer Vision Foundation
        A. Bapna, Z. Borsos, F. de Chaumont Quitry, P. Chen,                / IEEE Computer Society, 2018, pp. 3608–3617.
        D. E. Badawy, W. Han, E. Kharitonov et al., “Au-            [840]   A. Mishra, K. Alahari, and C. V. Jawahar, “Top-down
        diopalm: A large language model that can speak and                  and bottom-up cues for scene text recognition,” in
        listen,” CoRR, 2023.                                                CVPR. IEEE Computer Society, 2012, pp. 2687–2694.
[829]   J. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr,          [841]   Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao,
        Y. Hasson, K. Lenc, A. Mensch, K. Millican,                         Y. Yuan, J. Wang, C. He, Z. Liu, K. Chen, and
        M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han,               D. Lin, “Mmbench: Is your multi-modal model an
        Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick,                  all-around player?” CoRR, vol. abs/2307.06281, 2023.
        S. Borgeaud, A. Brock, A. Nematzadeh, S. Shar-              [842]   C. Fu, P. Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin,
        ifzadeh, M. Binkowski, R. Barreira, O. Vinyals,                     Z. Qiu, W. Lin, J. Yang, X. Zheng, K. Li, X. Sun, and
        A. Zisserman, and K. Simonyan, “Flamingo: a visual                  R. Ji, “MME: A comprehensive evaluation bench-
        language model for few-shot learning,” in NeurIPS,                  mark for multimodal large language models,” CoRR,
        2022.                                                               vol. abs/2306.13394, 2023.
                                                                                                                                  136

[843]   Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang,   [857]   B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and
        E. Zhao, Y. Zhang, Y. Chen, L. Wang, A. T. Luu, W. Bi,              Y. Shan, “Seed-bench: Benchmarking multimodal
        F. Shi, and S. Shi, “Siren’s song in the AI ocean: A                llms with generative comprehension,” CoRR, vol.
        survey on hallucination in large language models,”                  abs/2307.16125, 2023.
        CoRR, vol. abs/2309.01219, 2023.                            [858]   W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang,
[844]   A. Gunjal, J. Yin, and E. Bas, “Detecting and prevent-              and L. Wang, “Mm-vet: Evaluating large multi-
        ing hallucinations in large vision language models,”                modal models for integrated capabilities,” CoRR, vol.
        CoRR, vol. abs/2308.06394, 2023.                                    abs/2308.02490, 2023.
[845]   J. Lu, J. Rao, K. Chen, X. Guo, Y. Zhang, B. Sun,           [859]   J. Wang, L. Meng, Z. Weng, B. He, Z. Wu, and
        C. Yang, and J. Yang, “Evaluation and mitigation                    Y. Jiang, “To see is to believe: Prompting GPT-
        of agnosia in multimodal large language models,”                    4V for better visual instruction tuning,” CoRR, vol.
        CoRR, vol. abs/2309.04041, 2023.                                    abs/2311.07574, 2023.
[846]   A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell,         [860]   Y. Zhang, R. Zhang, J. Gu, Y. Zhou, N. Lipka, D. Yang,
        and K. Saenko, “Object hallucination in image cap-                  and T. Sun, “Llavar: Enhanced visual instruction tun-
        tioning,” in EMNLP. Association for Computational                   ing for text-rich image understanding,” arXiv preprint
        Linguistics, 2018, pp. 4035–4045.                                   arXiv:2306.17107, 2023.
[847]   Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and             [861]   X. Qi, K. Huang, A. Panda, M. Wang, and P. Mittal,
        J.-R. Wen, “Evaluating object hallucination in large                “Visual adversarial examples jailbreak aligned large
        vision-language models,” in The 2023 Conference on                  language models,” in The Second Workshop on New
        Empirical Methods in Natural Language Processing,                   Frontiers in Adversarial Machine Learning, 2023.
        2023. [Online]. Available: https://openreview.net/          [862]   Y. Zhou, C. Cui, J. Yoon, L. Zhang, Z. Deng, C. Finn,
        forum?id=xozJw0kZXF                                                 M. Bansal, and H. Yao, “Analyzing and mitigating
[848]   D. A. Hudson and C. D. Manning, “GQA: A new                         object hallucination in large vision-language mod-
        dataset for real-world visual reasoning and compo-                  els,” arXiv preprint arXiv:2310.00754, 2023.
        sitional question answering,” in CVPR. Computer             [863]   Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y. Shen, C. Gan,
        Vision Foundation / IEEE, 2019, pp. 6700–6709.                      L.-Y. Gui, Y.-X. Wang, Y. Yang et al., “Aligning large
[849]   P. Lu, S. Mishra, T. Xia, L. Qiu, K. Chang, S. Zhu,                 multimodal models with factually augmented rlhf,”
        O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain:             arXiv preprint arXiv:2309.14525, 2023.
        Multimodal reasoning via thought chains for science         [864]   E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou,
        question answering,” in NeurIPS, 2022.                              J. Chen, and K. Srinivas, “Semtab 2019: Resources to
[850]   A. Singh, V. Natarjan, M. Shah, Y. Jiang, X. Chen,                  benchmark tabular data to knowledge graph match-
        D. Parikh, and M. Rohrbach, “Towards vqa models                     ing systems,” in The Semantic Web - 17th International
        that can read,” in Proceedings of the IEEE Conference               Conference, ESWC 2020, Heraklion, Crete, Greece, May
        on Computer Vision and Pattern Recognition, 2019, pp.               31-June 4, 2020, Proceedings, ser. Lecture Notes in
        8317–8326.                                                          Computer Science, vol. 12123. Springer, 2020, pp.
[851]   F. Liu, T. Guan, Z. Li, L. Chen, Y. Yacoob,                         514–530.
        D. Manocha, and T. Zhou, “Hallusionbench: You               [865]   S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu,
        see what you think? or you think what you see?                      “Unifying large language models and knowledge
        an image-context reasoning benchmark challenging                    graphs: A roadmap,” CoRR, vol. abs/2306.08302,
        for gpt-4v(ision), llava-1.5, and other multi-modality              2023.
        models,” CoRR, vol. abs/2310.14566, 2023.                   [866]   Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang,
[852]   S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra,                 J. Shang, J. Liu, X. Chen, Y. Zhao, Y. Lu,
        C. L. Zitnick, and D. Parikh, “VQA: visual question                 W. Liu, Z. Wu, W. Gong, J. Liang, Z. Shang,
        answering,” in ICCV. IEEE Computer Society, 2015,                   P. Sun, W. Liu, X. Ouyang, D. Yu, H. Tian,
        pp. 2425–2433.                                                      H. Wu, and H. Wang, “ERNIE 3.0: Large-
[853]   R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider:                  scale knowledge enhanced pre-training for
        Consensus-based image description evaluation,” in                   language understanding and generation,” CoRR,
        CVPR. IEEE Computer Society, 2015, pp. 4566–4575.                   vol. abs/2107.02137, 2021. [Online]. Available:
[854]   H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction            https://arxiv.org/abs/2107.02137
        tuning,” CoRR, vol. abs/2304.08485, 2023.                   [867]   Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and
[855]   P. Xu, W. Shao, K. Zhang, P. Gao, S. Liu, M. Lei,                   Q. Liu, “ERNIE: enhanced language representation
        F. Meng, S. Huang, Y. Qiao, and P. Luo, “Lvlm-ehub:                 with informative entities,” in Proceedings of the 57th
        A comprehensive evaluation benchmark for large                      Conference of the Association for Computational Linguis-
        vision-language models,” CoRR, vol. abs/2306.09265,                 tics, ACL 2019, Florence, Italy, July 28- August 2, 2019,
        2023.                                                               Volume 1: Long Papers. Association for Computa-
[856]   Z. Li, Y. Wang, M. Du, Q. Liu, B. Wu, J. Zhang,                     tional Linguistics, 2019, pp. 1441–1451.
        C. Zhou, Z. Fan, J. Fu, J. Chen, X. Huang, and              [868]   X. Wang, T. Gao, Z. Zhu, Z. Zhang, Z. Liu, J. Li,
        Z. Wei, “Reform-eval: Evaluating large vision lan-                  and J. Tang, “KEPLER: A unified model for knowl-
        guage models via unified re-formulation of task-                    edge embedding and pre-trained language represen-
        oriented benchmarks,” CoRR, vol. abs/2310.02569,                    tation,” Trans. Assoc. Comput. Linguistics, vol. 9, pp.
        2023.                                                               176–194, 2021.
                                                                                                                               137

[869]   J. Zhang, X. Zhang, J. Yu, J. Tang, J. Tang, C. Li,               abs/2305.13669, 2023.
        and H. Chen, “Subgraph retrieval enhanced model           [880]   Y. Zhu, X. Wang, J. Chen, S. Qiao, Y. Ou,
        for multi-hop knowledge base question answering,”                 Y. Yao, S. Deng, H. Chen, and N. Zhang, “Llms
        in Proceedings of the 60th Annual Meeting of the As-              for knowledge graph construction and reasoning:
        sociation for Computational Linguistics (Volume 1: Long           Recent capabilities and future opportunities,” CoRR,
        Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022.              vol. abs/2305.13168, 2023. [Online]. Available:
        Association for Computational Linguistics, 2022, pp.              https://doi.org/10.48550/arXiv.2305.13168
        5773–5784.                                                [881]   M. Karpinska, N. Akoury, and M. Iyyer, “The perils
[870]   P. Ke, H. Ji, Y. Ran, X. Cui, L. Wang, L. Song, X. Zhu,           of using mechanical turk to evaluate open-ended
        and M. Huang, “Jointgt: Graph-text joint represen-                text generation,” in Proceedings of the 2021 Conference
        tation learning for text generation from knowledge                on Empirical Methods in Natural Language Processing,
        graphs,” in Findings of the Association for Compu-                EMNLP 2021, Virtual Event / Punta Cana, Dominican
        tational Linguistics: ACL/IJCNLP 2021, Online Event,              Republic, 7-11 November, 2021, M. Moens, X. Huang,
        August 1-6, 2021, ser. Findings of ACL, vol. ACL/I-               L. Specia, and S. W. Yih, Eds.          Association for
        JCNLP 2021. Association for Computational Lin-                    Computational Linguistics, 2021, pp. 1265–1285.
        guistics, 2021, pp. 2526–2538.                            [882]   H. Lee, S. Phatale, H. Mansoor, K. Lu, T. Mesnard,
[871]   O. Agarwal, H. Ge, S. Shakeri, and R. Al-Rfou,                    C. Bishop, V. Carbune, and A. Rastogi, “RLAIF:
        “Large scale knowledge graph based synthetic cor-                 scaling reinforcement learning from human feedback
        pus generation for knowledge-enhanced language                    with AI feedback,” CoRR, vol. abs/2309.00267, 2023.
        model pre-training,” CoRR, vol. abs/2010.12688,           [883]   G. Cui, L. Yuan, N. Ding, G. Yao, W. Zhu, Y. Ni,
        2020.                                                             G. Xie, Z. Liu, and M. Sun, “Ultrafeedback: Boosting
[872]   W. Chen, Y. Su, X. Yan, and W. Y. Wang, “KGPT:                    language models with high-quality feedback,” CoRR,
        knowledge-grounded pre-training for data-to-text                  vol. abs/2310.01377, 2023.
        generation,” in Proceedings of the 2020 Conference        [884]   X. Wang, Z. Wang, J. Liu, Y. Chen, L. Yuan, H. Peng,
        on Empirical Methods in Natural Language Processing,              and H. Ji, “MINT: evaluating llms in multi-turn in-
        EMNLP 2020, Online, November 16-20, 2020. Associ-                 teraction with tools and language feedback,” CoRR,
        ation for Computational Linguistics, 2020, pp. 8635–              vol. abs/2309.10691, 2023.
        8648.                                                     [885]   S. Saha, O. Levy, A. Celikyilmaz, M. Bansal, J. We-
[873]   Y. Gu, X. Deng, and Y. Su, “Don’t generate, discrim-              ston, and X. Li, “Branch-solve-merge improves large
        inate: A proposal for grounding language models to                language model evaluation and generation,” CoRR,
        real-world environments,” in Proceedings of the 61st              vol. abs/2310.15123, 2023.
        Annual Meeting of the Association for Computational       [886]   X. Zhang, B. Yu, H. Yu, Y. Lv, T. Liu, F. Huang, H. Xu,
        Linguistics (Volume 1: Long Papers), ACL 2023, Toronto,           and Y. Li, “Wider and deeper LLM networks are
        Canada, July 9-14, 2023. Association for Computa-                 fairer LLM evaluators,” CoRR, vol. abs/2308.01862,
        tional Linguistics, 2023, pp. 4928–4949.                          2023.
[874]   L. Luo, Y. Li, G. Haffari, and S. Pan, “Reasoning         [887]   C. Chan, W. Chen, Y. Su, J. Yu, W. Xue, S. Zhang, J. Fu,
        on graphs: Faithful and interpretable large language              and Z. Liu, “Chateval: Towards better llm-based
        model reasoning,” CoRR, vol. abs/2310.01061, 2023.                evaluators through multi-agent debate,” CoRR, vol.
[875]   Y. Lan and J. Jiang, “Query graph generation for an-              abs/2308.07201, 2023.
        swering multi-hop complex questions from knowl-           [888]   R. Li, T. Patel, and X. Du, “PRD: peer rank and dis-
        edge bases,” in Proceedings of the 58th Annual Meeting            cussion improve large language model based evalu-
        of the Association for Computational Linguistics, ACL             ations,” CoRR, vol. abs/2307.02762, 2023.
        2020, Online, July 5-10, 2020, D. J. and, Ed. Asso-       [889]   L. Zhu, X. Wang, and X. Wang, “Judgelm: Fine-tuned
        ciation for Computational Linguistics, 2020, pp. 969–             large language models are scalable judges,” CoRR,
        974.                                                              vol. abs/2310.17631, 2023.
[876]   P. Wang, N. Zhang, X. Xie, Y. Yao, B. Tian,               [890]   Z. Zeng, J. Yu, T. Gao, Y. Meng, T. Goyal,
        M. Wang, Z. Xi, S. Cheng, K. Liu, G. Zheng, and                   and D. Chen, “Evaluating large language mod-
        H. Chen, “Easyedit: An easy-to-use knowledge edit-                els at evaluating instruction following,” CoRR, vol.
        ing framework for large language models,” CoRR,                   abs/2310.07641, 2023.
        vol. abs/2308.07269, 2023.                                [891]   R. Koo, M. Lee, V. Raheja, J. I. Park, Z. M. Kim,
[877]   Y. Yao, P. Wang, B. Tian, S. Cheng, Z. Li, S. Deng,               and D. Kang, “Benchmarking cognitive biases in
        H. Chen, and N. Zhang, “Editing large language                    large language models as evaluators,” CoRR, vol.
        models: Problems, methods, and opportunities,”                    abs/2309.17012, 2023.
        CoRR, vol. abs/2305.13172, 2023.                          [892]   P. West, X. Lu, N. Dziri, F. Brahman, L. Li,
[878]   S. Choi, T. Fang, Z. Wang, and Y. Song, “KCTS:                    J. D. Hwang, L. Jiang, J. Fisher, A. Ravichander,
        knowledge-constrained tree search decoding with                   K. Chandu, B. Newman, P. W. Koh, A. Ettinger,
        token-level hallucination detection,” CoRR, vol.                  and Y. Choi, “The generative AI paradox: ”what
        abs/2310.09044, 2023.                                             it can create, it may not understand”,” CoRR, vol.
[879]   S. Zhang, L. Pan, J. Zhao, and W. Y. Wang, “Mit-                  abs/2311.00059, 2023.
        igating language model hallucination with inter-          [893]   J. Huang, X. Chen, S. Mishra, H. S. Zheng, A. W.
        active question-knowledge alignment,” CoRR, vol.                  Yu, X. Song, and D. Zhou, “Large language mod-
                                                                                                                              138

        els cannot self-correct reasoning yet,” CoRR, vol.                prediction,” CoRR, vol. abs/2212.02199, 2022.
        abs/2310.01798, 2023.                                     [908]   J. H. Choi, K. E. Hickman, A. Monahan, and
[894]   K. Stechly, M. Marquez, and S. Kambhampati, “GPT-                 D. Schwarcz, “Chatgpt goes to law school,” Available
        4 doesn’t know it’s wrong: An analysis of itera-                  at SSRN, 2023.
        tive prompting for reasoning problems,” CoRR, vol.        [909]   J. J. Nay, “Law informs code: A legal informatics
        abs/2310.12397, 2023.                                             approach to aligning artificial intelligence with hu-
[895]   O. Nov, N. Singh, and D. M. Mann, “Putting chat-                  mans,” CoRR, vol. abs/2209.13020, 2022.
        gpt’s medical advice to the (turing) test,” CoRR, vol.    [910]   F. Yu, L. Quartey, and F. Schilder, “Legal prompting:
        abs/2301.10035, 2023.                                             Teaching a language model to think like a lawyer,”
[896]   K. Yang, S. Ji, T. Zhang, Q. Xie, and S. Anani-                   CoRR, vol. abs/2212.01326, 2022.
        adou, “On the evaluations of chatgpt and emotion-         [911]   D. Trautmann, A. Petrova, and F. Schilder, “Legal
        enhanced prompting for mental health analysis,”                   prompt engineering for multilingual legal judgement
        CoRR, vol. abs/2304.03347, 2023.                                  prediction,” CoRR, vol. abs/2212.02199, 2022.
[897]   K. Jeblick, B. Schachtner, J. Dexl, A. Mittermeier,       [912]   A. Tamkin, M. Brundage, J. Clark, and D. Ganguli,
        A. T. Stüber, J. Topalis, T. Weber, P. Wesp, B. O.               “Understanding the capabilities, limitations, and so-
        Sabel, J. Ricke, and M. Ingrisch, “Chatgpt makes                  cietal impact of large language models,” CoRR, vol.
        medicine easy to swallow: An exploratory case                     abs/2102.02503, 2021.
        study on simplified radiology reports,” CoRR, vol.        [913]   Z. Sun, “A short survey of viewing large language
        abs/2212.14882, 2022.                                             models in legal aspect,” CoRR, vol. abs/2303.09136,
[898]   K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wul-                2023.
        czyn, L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis,          [914]   A. Abid, M. Farooqi, and J. Zou, “Persistent anti-
        D. Neal, M. Schaekermann, A. Wang, M. Amin,                       muslim bias in large language models,” in AIES
        S. Lachgar, P. A. Mansfield, S. Prakash, B. Green,                ’21: AAAI/ACM Conference on AI, Ethics, and Society,
        E. Dominowska, B. A. y Arcas, N. Tomasev, Y. Liu,                 Virtual Event, USA, May 19-21, 2021, M. Fourcade,
        R. Wong, C. Semturs, S. S. Mahdavi, J. K. Barral,                 B. Kuipers, S. Lazar, and D. K. Mulligan, Eds. ACM,
        D. R. Webster, G. S. Corrado, Y. Matias, S. Azizi,                2021, pp. 298–306.
        A. Karthikesalingam, and V. Natarajan, “Towards           [915]   A. Shah and S. Chava, “Zero is not hero yet: Bench-
        expert-level medical question answering with large                marking zero-shot performance of llms for financial
        language models,” CoRR, vol. abs/2305.09617, 2023.                tasks,” CoRR, vol. abs/2305.16633, 2023.
[899]   S. Yang, H. Zhao, S. Zhu, G. Zhou, H. Xu, Y. Jia, and     [916]   D. Araci, “Finbert: Financial sentiment analysis
        H. Zan, “Zhongjing: Enhancing the chinese medical                 with pre-trained language models,” CoRR, vol.
        capabilities of large language model through expert               abs/1908.10063, 2019.
        feedback and real-world multi-turn dialogue,” CoRR,       [917]   J. C. S. Alvarado, K. Verspoor, and T. Baldwin,
        vol. abs/2308.03549, 2023.                                        “Domain adaption of named entity recognition to
[900]   S. Chen, B. H. Kann, M. B. Foote, H. J. Aerts,                    support credit risk assessment,” in Proceedings of
        G. K. Savova, R. H. Mak, and D. S. Bitterman, “The                the Australasian Language Technology Association Work-
        utility of chatgpt for cancer treatment information,”             shop, ALTA 2015, Parramatta, Australia, December 8 - 9,
        medRxiv, 2023.                                                    2015, B. Hachey and K. Webster, Eds. ACL, 2015,
[901]   K. Malinka, M. Peresı́ni, A. Firc, O. Hujnak, and                 pp. 84–90.
        F. Janus, “On the educational impact of chatgpt:          [918]   G. Son, H. Jung, M. Hahm, K. Na, and S. Jin, “Beyond
        Is artificial intelligence ready to obtain a university           classification: Financial reasoning in state-of-the-art
        degree?” CoRR, vol. abs/2303.11146, 2023.                         language models,” CoRR, vol. abs/2305.01505, 2023.
[902]   T. Susnjak, “Chatgpt: The end of online exam in-          [919]   X. Zhang, Q. Yang, and D. Xu, “Xuanyuan 2.0: A
        tegrity?” CoRR, vol. abs/2212.09292, 2022.                        large chinese financial chat model with hundreds of
[903]   K. Tan, T. Pang, and C. Fan, “Towards applying                    billions parameters,” arXiv preprint arXiv:2305.12002,
        powerful large ai models in classroom teaching: Op-               2023.
        portunities, challenges and prospects,” 2023.             [920]   H. Yang, X.-Y. Liu, and C. D. Wang, “Fingpt: Open-
[904]   F. Kamalov and I. Gurrib, “A new era of artificial                source financial large language models,” CoRR, vol.
        intelligence in education: A multifaceted revolution,”            abs/2306.06031, 2023.
        CoRR, vol. abs/2305.18303, 2023.                          [921]   Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu,
[905]   E. Kasneci, K. Seßler, S. Küchemann, M. Bannert,                 “Pubmedqa: A dataset for biomedical research ques-
        D. Dementieva, F. Fischer, U. Gasser, G. Groh,                    tion answering,” in Proceedings of the 2019 Conference
        S. Günnemann, E. Hüllermeier et al., “Chatgpt for               on Empirical Methods in Natural Language Processing
        good? on opportunities and challenges of large lan-               and the 9th International Joint Conference on Natu-
        guage models for education,” Learning and Individual              ral Language Processing, EMNLP-IJCNLP 2019, Hong
        Differences, vol. 103, p. 102274, 2023.                           Kong, China, November 3-7, 2019, 2019, pp. 2567–2577.
[906]   A. Blair-Stanek, N. Holzenberger, and B. V. Durme,        [922]   A. Krithara, A. Nentidis, K. Bougiatiotis, and
        “Can GPT-3 perform statutory reasoning?” CoRR,                    G. Paliouras, “Bioasq-qa: A manually curated corpus
        vol. abs/2302.06100, 2023.                                        for biomedical question answering,” 2022.
[907]   D. Trautmann, A. Petrova, and F. Schilder, “Legal         [923]   Z. Bi, N. Zhang, Y. Xue, Y. Ou, D. Ji, G. Zheng,
        prompt engineering for multilingual legal judgement               and H. Chen, “Oceangpt: A large language model
                                                                                                                               139

        for ocean science tasks,” CoRR, vol. abs/2310.02031,              of foundation models,” CoRR, vol. abs/2309.16039,
        2023.                                                             2023.
[924]   C. Zhang, C. Zhang, C. Li, Y. Qiao, S. Zheng, S. K.       [941]   kaiokendev, “Things I’m learning while training su-
        Dam, M. Zhang, J. U. Kim, S. T. Kim, J. Choi, G. Park,            perhot.” 2023.
        S. Bae, L. Lee, P. Hui, I. S. Kweon, and C. S. Hong,      [942]   Z. Dong, T. Tang, J. Li, W. X. Zhao, and J. Wen,
        “One small step for generative ai, one giant leap for             “BAMBOO: A comprehensive benchmark for evalu-
        AGI: A complete survey on chatgpt in AIGC era,”                   ating long text modeling capacities of large language
        CoRR, vol. abs/2304.06488, 2023.                                  models,” CoRR, vol. abs/2309.13345, 2023.
[925]   M. Haman and M. Skolnik, “Using chatgpt to con-           [943]   J. Su. (2023) Transformer upgrade path: 12, infinite
        duct a literature review.” Accountability in research,            extrapolation of rerope?
        2023.                                                     [944]   A. Pal, D. Karkhanis, M. Roberts, S. Dooley, A. Sun-
[926]   Ö. Aydın and E. Karaarslan, “Openai chatgpt gen-                 dararajan, and S. Naidu, “Giraffe: Adventures in
        erated literature review: Digital twin in healthcare,”            expanding context lengths in llms,” CoRR, vol.
        SSRN Electronic Journal, 2022.                                    abs/2308.10882, 2023.
[927]   Y. J. Park, D. Kaplan, Z. Ren, C. Hsu, C. Li, H. Xu,      [945]   G. Izacard and E. Grave, “Leveraging passage re-
        S. Li, and J. Li, “Can chatgpt be used to generate                trieval with generative models for open domain
        scientific hypotheses?” CoRR, vol. abs/2304.12208,                question answering,” in Proceedings of the 16th Con-
        2023.                                                             ference of the European Chapter of the Association for
[928]   M. M. Hassan, R. A. Knipper, and S. K. K. Santu,                  Computational Linguistics: Main Volume, EACL 2021,
        “Chatgpt as your personal data scientist,” CoRR, vol.             Online, April 19 - 23, 2021. Association for Compu-
        abs/2305.13657, 2023.                                             tational Linguistics, 2021, pp. 874–880.
[929]   L. Cheng, X. Li, and L. Bing, “Is GPT-4 a good data       [946]   N. Ratner, Y. Levine, Y. Belinkov, O. Ram, I. Magar,
        analyst?” CoRR, vol. abs/2305.15038, 2023.                        O. Abend, E. Karpas, A. Shashua, K. Leyton-Brown,
[930]   S. I. M. Hussam Alkaissi, “Artificial hallucinations in           and Y. Shoham, “Parallel context windows for large
        chatgpt: Implications in scientific writing,” PubMed,             language models,” in Proceedings of the 61st Annual
        2023.                                                             Meeting of the Association for Computational Linguistics
[931]   A. Azaria, R. Azoulay, and S. Reches, “Chatgpt                    (Volume 1: Long Papers), ACL 2023, Toronto, Canada,
        is a remarkable tool – for experts,” CoRR, vol.                   July 9-14, 2023.       Association for Computational
        abs/2306.03102, 2023.                                             Linguistics, 2023, pp. 6383–6402.
[932]   O. O. Buruk, “Academic writing with GPT-3.5: reflec-      [947]   I. Beltagy, M. E. Peters, and A. Cohan, “Long-
        tions on practices, efficacy and transparency,” CoRR,             former: The long-document transformer,” CoRR, vol.
        vol. abs/2304.11079, 2023.                                        abs/2004.05150, 2020.
[933]   R. Liu and N. B. Shah, “Reviewergpt? an exploratory       [948]   G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis,
        study on using large language models for paper                    “Efficient streaming language models with attention
        reviewing,” CoRR, vol. abs/2306.00622, 2023.                      sinks,” CoRR, vol. abs/2309.17453, 2023.
[934]   M. Kosinski, “Theory of mind may have sponta-             [949]   N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilac-
        neously emerged in large language models,” CoRR,                  qua, F. Petroni, and P. Liang, “Lost in the middle:
        vol. abs/2302.02083, 2023.                                        How language models use long contexts,” Transac-
[935]   M. M. Amin, E. Cambria, and B. W. Schuller, “Will                 tions of the Association for Computational Linguistics,
        affective computing emerge from foundation models                 vol. 12, pp. 157–173, 2024.
        and general ai? A first evaluation on chatgpt,” CoRR,     [950]   C. Han, Q. Wang, W. Xiong, Y. Chen, H. Ji, and
        vol. abs/2303.03186, 2023.                                        S. Wang, “Lm-infinite: Simple on-the-fly length gen-
[936]   G. Sridhara, R. H. G., and S. Mazumdar, “Chatgpt: A               eralization for large language models,” CoRR, vol.
        study on its utility for ubiquitous software engineer-            abs/2308.16137, 2023.
        ing tasks,” CoRR, vol. abs/2305.16837, 2023.              [951]   A. Bertsch, U. Alon, G. Neubig, and M. R. Gorm-
[937]   W. Sun, C. Fang, Y. You, Y. Miao, Y. Liu, Y. Li,                  ley, “Unlimiformer: Long-range transformers with
        G. Deng, S. Huang, Y. Chen, Q. Zhang, H. Qian,                    unlimited length input,” CoRR, vol. abs/2305.01625,
        Y. Liu, and Z. Chen, “Automatic code summariza-                   2023.
        tion via chatgpt: How far are we?” CoRR, vol.             [952]   Y. Wu, M. N. Rabe, D. Hutchins, and C. Szegedy,
        abs/2305.12865, 2023.                                             “Memorizing transformers,” in The Tenth Interna-
[938]   C. S. Xia and L. Zhang, “Conversational automated                 tional Conference on Learning Representations, ICLR
        program repair,” CoRR, vol. abs/2301.13246, 2023.                 2022, Virtual Event, April 25-29, 2022.         OpenRe-
[939]   A. Kazemnejad, I. Padhi, K. N. Ramamurthy, P. Das,                view.net, 2022.
        and S. Reddy, “The impact of positional encoding          [953]   Y. Lu, X. Zhou, W. He, J. Zhao, T. Ji, T. Gui, Q. Zhang,
        on length generalization in transformers,” CoRR, vol.             and X. Huang, “Longheads: Multi-head attention
        abs/2305.19466, 2023.                                             is secretly a long context processor,” CoRR, vol.
[940]   W. Xiong, J. Liu, I. Molybog, H. Zhang, P. Bhargava,              abs/2402.10685, 2024.
        R. Hou, L. Martin, R. Rungta, K. A. Sankararaman,         [954]   C. Xiao, P. Zhang, X. Han, G. Xiao, Y. Lin, Z. Zhang,
        B. Oguz, M. Khabsa, H. Fang, Y. Mehdad, S. Narang,                Z. Liu, S. Han, and M. Sun, “Infllm: Unveiling the in-
        K. Malik, A. Fan, S. Bhosale, S. Edunov, M. Lewis,                trinsic capacity of llms for understanding extremely
        S. Wang, and H. Ma, “Effective long-context scaling               long sequences with training-free memory,” CoRR,
                                                                                                                               140

        vol. abs/2402.04617, 2024.                                 [971]   Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and
[955]   Y. Fu, R. Panda, X. Niu, X. Yue, H. Hajishirzi, Y. Kim,            I. Mordatch, “Improving factuality and reasoning in
        and H. Peng, “Data engineering for scaling language                language models through multiagent debate,” CoRR,
        models to 128k context,” CoRR, vol. abs/2402.10171,                vol. abs/2305.14325, 2023.
        2024.                                                      [972]   Y. Shao, L. Li, J. Dai, and X. Qiu, “Character-llm:
[956]   K. Lv, X. Liu, Q. Guo, H. Yan, C. He, X. Qiu,                      A trainable agent for role-playing,” in Proceedings of
        and D. Lin, “Longwanjuan: Towards systematic                       the 2023 Conference on Empirical Methods in Natural
        measurement for long text quality,” CoRR, vol.                     Language Processing, EMNLP 2023, Singapore, Decem-
        abs/2402.13583, 2024.                                              ber 6-10, 2023, H. Bouamor, J. Pino, and K. Bali, Eds.
[957]   H. Chen, R. Pasunuru, J. Weston, and A. Celiky-                    Association for Computational Linguistics, 2023, pp.
        ilmaz, “Walking down the memory maze: Beyond                       13 153–13 187.
        context limit through interactive reading,” CoRR, vol.     [973]   W. Hua, X. Yang, Z. Li, W. Cheng, and Y. Zhang,
        abs/2310.05029, 2023.                                              “Trustagent: Towards safe and trustworthy llm-
[958]   W. Zhou, Y. E. Jiang, P. Cui, T. Wang, Z. Xiao, Y. Hou,            based agents through agent constitution,” CoRR, vol.
        R. Cotterell, and M. Sachan, “Recurrentgpt: Interac-               abs/2402.01586, 2024.
        tive generation of (arbitrarily) long text,” CoRR, vol.    [974]   L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng,
        abs/2305.13304, 2023.                                              H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and
[959]   C. Packer, V. Fang, S. G. Patil, K. Lin, S. Wooders, and           T. Liu, “A survey on hallucination in large language
        J. E. Gonzalez, “Memgpt: Towards llms as operating                 models: Principles, taxonomy, challenges, and open
        systems,” CoRR, vol. abs/2310.08560, 2023.                         questions,” CoRR, vol. abs/2311.05232, 2023.
[960]   P. Xu, W. Ping, X. Wu, L. McAfee, C. Zhu, Z. Liu,          [975]   I. Loshchilov and F. Hutter, “Decoupled weight de-
        S. Subramanian, E. Bakhturina, M. Shoeybi, and                     cay regularization,” in ICLR (Poster).        OpenRe-
        B. Catanzaro, “Retrieval meets long context large                  view.net, 2019.
        language models,” CoRR, vol. abs/2310.03025, 2023.         [976]   V. A. Korthikanti, J. Casper, S. Lym, L. McAfee,
[961]   S. Russell and P. Norvig, Artificial Intelligence:                 M. Andersch, M. Shoeybi, and B. Catanzaro, “Re-
        A Modern Approach (4th Edition). Pearson, 2020.                    ducing activation recomputation in large transformer
        [Online]. Available: http://aima.cs.berkeley.edu/                  models,” in MLSys. mlsys.org, 2023.
[962]   B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J.       [977]   S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He,
        Gershman, “Building machines that learn and think                  “Zero: memory optimizations toward training tril-
        like people,” CoRR, vol. abs/1604.00289, 2016.                     lion parameter models,” in Proceedings of the Interna-
[963]   S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran,                         tional Conference for High Performance Computing, Net-
        K. Narasimhan, and Y. Cao, “React: Synergizing rea-                working, Storage and Analysis, SC 2020, Virtual Event /
        soning and acting in language models,” CoRR, vol.                  Atlanta, Georgia, USA, November 9-19, 2020, C. Cuic-
        abs/2210.03629, 2022.                                              chi, I. Qualters, and W. T. Kramer, Eds. IEEE/ACM,
[964]   2023. [Online]. Available: https://github.com/                     2020, p. 20.
        AntonOsika/gpt-engineer                                    [978]   J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase,
[965]   X. Team, “Xagent: An autonomous agent for complex                  S. Yang, M. Zhang, D. Li, and Y. He, “Zero-offload:
        task solving,” 2023.                                               Democratizing billion-scale model training,” in 2021
[966]   G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin,               USENIX Annual Technical Conference, USENIX ATC
        and B. Ghanem, “CAMEL: communicative agents for                    2021, July 14-16, 2021, I. Calciu and G. Kuenning, Eds.
        ”mind” exploration of large scale language model                   USENIX Association, 2021, pp. 551–564.
        society,” CoRR, vol. abs/2303.17760, 2023.                 [979]   S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and
[967]   S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang,                     Y. He, “Zero-infinity: breaking the GPU memory wall
        C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou,                  for extreme scale deep learning,” in SC. ACM, 2021,
        C. Ran, L. Xiao, and C. Wu, “Metagpt: Meta pro-                    p. 59.
        gramming for multi-agent collaborative framework,”         [980]   T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré,
        CoRR, vol. abs/2308.00352, 2023.                                   “Flashattention: Fast and memory-efficient exact at-
[968]   C. Pham, B. Liu, Y. Yang, Z. Chen, T. Liu, J. Yuan,                tention with io-awareness,” in NeurIPS, 2022.
        B. A. Plummer, Z. Wang, and H. Yang, “Let models           [981]   S. A. Jacobs, M. Tanaka, C. Zhang, M. Zhang, S. L.
        speak ciphers: Multiagent debate through embed-                    Song, S. Rajbhandari, and Y. He, “Deepspeed ulysses:
        dings,” CoRR, vol. abs/2310.06272, 2023.                           System optimizations for enabling training of ex-
[969]   W. Chen, Y. Su, J. Zuo, C. Yang, C. Yuan, C. Qian,                 treme long sequence transformer models,” CoRR,
        C.-M. Chan, Y. Qin, Y. Lu, R. Xie et al., “Agent-                  vol. abs/2309.14509, 2023.
        verse: Facilitating multi-agent collaboration and ex-      [982]   H. Liu, M. Zaharia, and P. Abbeel, “Ring attention
        ploring emergent behaviors in agents,” arXiv preprint              with blockwise transformers for near-infinite con-
        arXiv:2308.10848, 2023.                                            text,” CoRR, vol. abs/2310.01889, 2023.
[970]   Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu,          [983]   Y. Chen, T. Tang, E. Xiang, L. Li, W. X. Zhao,
        L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah,             J. Wang, Y. Chai, and J. Wen, “Towards coarse-to-fine
        R. W. White, D. Burger, and C. Wang, “Autogen:                     evaluation of inference efficiency for large language
        Enabling next-gen llm applications via multi-agent                 models,” CoRR, vol. abs/2404.11502, 2024.
        conversation framework,” 2023.                             [984]   Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin,
                                                                                                                                 141

        B. Chen, P. Liang, C. Ré, I. Stoica, and C. Zhang,                abs/2404.02258, 2024.
        “Flexgen: High-throughput generative inference of           [999]  Z. Wan, X. Wang, C. Liu, S. Alam, Y. Zheng,
        large language models with a single GPU,” in ICML,                 J. Liu, Z. Qu, S. Yan, Y. Zhu, Q. Zhang,
        ser. Proceedings of Machine Learning Research, vol.                M. Chowdhury, and M. Zhang, “Efficient large
        202. PMLR, 2023, pp. 31 094–31 116.                                language models: A survey,” 2024. [Online].
[985]   T. Dao, D. Haziza, F. Massa, and G. Sizov, “Flash-                 Available: https://arxiv.org/abs/2312.03863
        decoding for long-context inference,” 2023. [Online].       [1000] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W.
        Available: https://crfm.stanford.edu/2023/10/12/                   Mahoney, and K. Keutzer, “A survey of quantization
        flashdecoding.html                                                 methods for efficient neural network inference,”
[986]   C. Holmes, M. Tanaka, M. Wyatt, A. A. Awan,                        CoRR, vol. abs/2103.13630, 2021. [Online]. Available:
        J. Rasley, S. Rajbhandari, R. Y. Aminabadi, H. Qin,                https://arxiv.org/abs/2103.13630
        A. Bakhtiari, L. Kurilenko, and Y. He, “Deepspeed-          [1001] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettle-
        fastgen: High-throughput text generation for llms                  moyer, “Llm.int8(): 8-bit matrix multiplication for
        via MII and deepspeed-inference,” CoRR, vol.                       transformers at scale,” CoRR, vol. abs/2208.07339,
        abs/2401.08671, 2024.                                              2022.
[987]   Y. Leviathan, M. Kalman, and Y. Matias, “Fast infer-        [1002] J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han,
        ence from transformers via speculative decoding,” in               “Awq: Activation-aware weight quantization for llm
        International Conference on Machine Learning, 2023.                compression and acceleration,” 2023.
[988]   C. Chen, S. Borgeaud, G. Irving, J. Lespiau, L. Sifre,      [1003] Y. Shang, Z. Yuan, Q. Wu, and Z. Dong, “PB-
        and J. Jumper, “Accelerating large language model                  LLM: partially binarized large language models,”
        decoding with speculative sampling,” CoRR, vol.                    CoRR, vol. abs/2310.00034, 2023. [Online]. Available:
        abs/2302.01318, 2023.                                              https://doi.org/10.48550/arXiv.2310.00034
[989]   X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang,            [1004] T. Dettmers, R. Svirschevski, V. Egiazarian,
        R. Y. Y. Wong, Z. Chen, D. Arfeen, R. Abhyankar,                   D. Kuznedelev, E. Frantar, S. Ashkboos, A. Borzunov,
        and Z. Jia, “Specinfer: Accelerating generative LLM                T. Hoefler, and D. Alistarh, “Spqr: A sparse-
        serving with speculative inference and token tree                  quantized representation for near-lossless LLM
        verification,” CoRR, vol. abs/2305.09781, 2023.                    weight compression,” CoRR, vol. abs/2306.03078,
[990]   B. Spector and C. Ré, “Accelerating LLM infer-                    2023.
        ence with staged speculative decoding,” CoRR, vol.          [1005] Z. Guan, H. Huang, Y. Su, H. Huang, N. Wong, and
        abs/2308.04623, 2023.                                              H. Yu, “APTQ: attention-aware post-training mixed-
[991]   L. Chen, M. Zaharia, and J. Zou, “Frugalgpt: How to                precision quantization for large language models,”
        use large language models while reducing cost and                  CoRR, vol. abs/2402.14866, 2024. [Online]. Available:
        improving performance,” CoRR, vol. abs/2305.05176,                 https://doi.org/10.48550/arXiv.2402.14866
        2023.                                                       [1006] C. Lee, J. Jin, T. Kim, H. Kim, and E. Park, “OWQ:
[992]   M. Yue, J. Zhao, M. Zhang, L. Du, and Z. Yao, “Large               outlier-aware weight quantization for efficient fine-
        language model cascades with mixture of thoughts                   tuning and inference of large language models,” in
        representations for cost-efficient reasoning,” CoRR,               Thirty-Eighth AAAI Conference on Artificial Intelligence,
        vol. abs/2310.03094, 2023.                                         AAAI 2024, Thirty-Sixth Conference on Innovative
[993]   J. Gu, J. Bradbury, C. Xiong, V. O. K. Li, and R. Socher,          Applications of Artificial Intelligence, IAAI 2024,
        “Non-autoregressive neural machine translation,” in                Fourteenth Symposium on Educational Advances in
        ICLR (Poster). OpenReview.net, 2018.                               Artificial Intelligence, EAAI 2014, February 20-
[994]   C. Wang, J. Zhang, and H. Chen, “Semi-                             27, 2024, Vancouver, Canada, M. J. Wooldridge,
        autoregressive neural machine translation,” in                     J. G. Dy, and S. Natarajan, Eds. AAAI Press,
        EMNLP. Association for Computational Linguistics,                  2024, pp. 13 355–13 364. [Online]. Available: https:
        2018, pp. 479–488.                                                 //doi.org/10.1609/aaai.v38i12.29237
[995]   T. Cai, Y. Li, Z. Geng, H. Peng, J. D. Lee, D. Chen, and    [1007] G. Xiao, J. Lin, M. Seznec, J. Demouth, and
        T. Dao, “Medusa: Simple LLM inference acceleration                 S. Han, “Smoothquant: Accurate and efficient post-
        framework with multiple decoding heads,” CoRR,                     training quantization for large language models,”
        vol. abs/2401.10774, 2024.                                         CoRR, vol. abs/2211.10438, 2022. [Online]. Available:
[996]   S. Teerapittayanon, B. McDanel, and H. T. Kung,                    https://doi.org/10.48550/arXiv.2211.10438
        “Branchynet: Fast inference via early exiting from          [1008] Z. Yao, R. Y. Aminabadi, M. Zhang, X. Wu, C. Li,
        deep neural networks,” in ICPR. IEEE, 2016, pp.                    and Y. He, “Zeroquant: Efficient and affordable post-
        2464–2469.                                                         training quantization for large-scale transformers,”
[997]   G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten,                in NeurIPS, 2022.
        and K. Q. Weinberger, “Multi-scale dense networks           [1009] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alis-
        for resource efficient image classification,” in ICLR.             tarh, “Gptq: Accurate post-training quantization for
        OpenReview.net, 2018.                                              generative pre-trained transformers,” arXiv preprint
[998]   D. Raposo, S. Ritter, B. A. Richards, T. P. Lilli-                 arXiv:2210.17323, 2022.
        crap, P. C. Humphreys, and A. Santoro, “Mixture-            [1010] E. Frantar and D. Alistarh, “Optimal brain compres-
        of-depths: Dynamically allocating compute in                       sion: A framework for accurate post-training quanti-
        transformer-based language models,” CoRR, vol.                     zation and pruning,” in NeurIPS, 2022.
                                                                                                                                 142

[1011] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettle-               Chua, and Q. Li, “A survey on rag meets llms: To-
       moyer, “Qlora: Efficient finetuning of quantized                   wards retrieval-augmented large language models,”
       llms,” arXiv preprint arXiv:2305.14314, 2023.                      arXiv preprint arXiv:2405.06211, 2024.
[1012] Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock,               [1025] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai,
       Y. Mehdad, Y. Shi, R. Krishnamoorthi, and V. Chan-                 J. Sun, and H. Wang, “Retrieval-augmented gener-
       dra, “Llm-qat: Data-free quantization aware training               ation for large language models: A survey,” arXiv
       for large language models,” 2023.                                  preprint arXiv:2312.10997, 2023.
[1013] Z. Yao, X. Wu, C. Li, S. Youn, and Y. He, “Zeroquant-       [1026] S. Robertson and H. Zaragoza, The probabilistic rele-
       v2: Exploring post-training quantization in llms from              vance framework: BM25 and beyond, 2009.
       comprehensive study to low rank compensation,”              [1027] Y. Wang, R. Ren, J. Li, W. X. Zhao, J. Liu, and J.-R.
       2023.                                                              Wen, “Rear: A relevance-aware retrieval-augmented
[1014] T. Dettmers and L. Zettlemoyer, “The case for 4-bit                framework for open-domain question answering,”
       precision: k-bit inference scaling laws,” CoRR, vol.               arXiv preprint arXiv:2402.17497, 2024.
       abs/2212.09720, 2022.                                       [1028] D. Rau, S. Wang, H. Déjean, and S. Clinchant, “Con-
[1015] L. Peiyu, L. Zikang, G. Ze-Feng, G. Dawei, Z. W. Xin,              text embeddings for efficient answer generation in
       L. Yaliang, D. Bolin, and W. Ji-Rong, “Do emergent                 rag,” arXiv preprint arXiv:2407.09252, 2024.
       abilities exist in quantized large language models:         [1029] F. Xu, W. Shi, and E. Choi, “Recomp: Improving
       An empirical study,” arXiv preprint arXiv:2307.08072,              retrieval-augmented lms with context compression
       2023.                                                              and selective augmentation,” in The Twelfth Interna-
[1016] Y. Xu, L. Xie, X. Gu, X. Chen, H. Chang,                           tional Conference on Learning Representations, 2024.
       H. Zhang, Z. Chen, X. Zhang, and Q. Tian, “Qa-lora:         [1030] Z. Shao, Y. Gong, Y. Shen, M. Huang, N. Duan,
       Quantization-aware low-rank adaptation of large                    and W. Chen, “Enhancing retrieval-augmented large
       language models,” CoRR, vol. abs/2309.14717, 2023.                 language models with iterative retrieval-generation
       [Online]. Available: https://doi.org/10.48550/arXiv.               synergy,” in Findings of the Association for Computa-
       2309.14717                                                         tional Linguistics: EMNLP 2023, 2023, pp. 9248–9274.
[1017] Y. Li, Y. Yu, C. Liang, P. He, N. Karampatziakis,           [1031] T. Chen, H. Wang, S. Chen, W. Yu, K. Ma, X. Zhao,
       W. Chen, and T. Zhao, “Loftq: Lora-fine-tuning-                    D. Yu, and H. Zhang, “Dense x retrieval: What re-
       aware quantization for large language models,”                     trieval granularity should we use?” arXiv preprint
       CoRR, vol. abs/2310.08659, 2023. [Online]. Available:              arXiv:2312.06648, 2023.
       https://doi.org/10.48550/arXiv.2310.08659                   [1032] X. Huang, S. Cheng, Y. Shu, Y. Bao, and Y. Qu,
[1018] Y. Gu, L. Dong, F. Wei, and M. Huang, “Knowledge                   “Question decomposition tree for answering com-
       distillation of large language models,” CoRR,                      plex questions over knowledge bases,” in Proceedings
       vol. abs/2306.08543, 2023. [Online]. Available:                    of the AAAI Conference on Artificial Intelligence, vol. 37,
       https://doi.org/10.48550/arXiv.2306.08543                          no. 11, 2023, pp. 12 924–12 932.
[1019] C. Hsieh, C. Li, C. Yeh, H. Nakhost, Y. Fujii,              [1033] Y. He, J. Tang, H. Ouyang, C. Kang, D. Yin, and
       A. Ratner, R. Krishna, C. Lee, and T. Pfister,                     Y. Chang, “Learning to rewrite queries,” in Pro-
       “Distilling step-by-step! outperforming larger                     ceedings of the 25th ACM International on Conference
       language models with less training data and                        on Information and Knowledge Management, 2016, pp.
       smaller model sizes,” in Findings of the Association for           1443–1452.
       Computational Linguistics: ACL 2023, Toronto, Canada,       [1034] J. Liu and B. Mozafari, “Query rewriting via large
       July 9-14, 2023, A. Rogers, J. L. Boyd-Graber, and                 language models,” arXiv preprint arXiv:2403.09060,
       N. Okazaki, Eds. Association for Computational                     2024.
       Linguistics, 2023, pp. 8003–8017. [Online]. Available:      [1035] F. Ye, M. Fang, S. Li, and E. Yilmaz, “Enhancing
       https://doi.org/10.18653/v1/2023.findings-acl.507                  conversational search: Large language model-aided
[1020] E. Frantar and D. Alistarh, “Sparsegpt: Massive lan-               informative query rewriting,” in Findings of the As-
       guage models can be accurately pruned in one-                      sociation for Computational Linguistics: EMNLP 2023,
       shot,” in International Conference on Machine Learning.            2023, pp. 5985–6006.
       PMLR, 2023, pp. 10 323–10 337.                              [1036] S. Jeong, J. Baek, S. Cho, S. J. Hwang, and J. C.
[1021] X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the                   Park, “Adaptive-rag: Learning to adapt retrieval-
       structural pruning of large language models,” Ad-                  augmented large language models through question
       vances in neural information processing systems, vol. 36,          complexity,” arXiv preprint arXiv:2403.14403, 2024.
       pp. 21 702–21 720, 2023.                                    [1037] H. Jiang, Q. Wu, C.-Y. Lin, Y. Yang, and L. Qiu,
[1022] M. Xia, T. Gao, Z. Zeng, and D. Chen, “Sheared                     “Llmlingua: Compressing prompts for accelerated
       llama: Accelerating language model pre-training via                inference of large language models,” in Proceedings
       structured pruning,” arXiv preprint arXiv:2310.06694,              of the 2023 Conference on Empirical Methods in Natural
       2023.                                                              Language Processing, 2023, pp. 13 358–13 376.
[1023] T. Dettmers, M. Lewis, S. Shleifer, and L. Zettle-          [1038] T. Xu, S. Wu, S. Diao, X. Liu, X. Wang, Y. Chen,
       moyer, “8-bit optimizers via block-wise quantiza-                  and J. Gao, “Sayself: Teaching llms to express con-
       tion,” 9th International Conference on Learning Repre-             fidence with self-reflective rationales,” arXiv preprint
       sentations, ICLR, 2022.                                            arXiv:2405.20974, 2024.
[1024] Y. Ding, W. Fan, L. Ning, S. Wang, H. Li, D. Yin, T.-S.     [1039] A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Ha-
                                                                                                                                143

       jishirzi, “Self-rag: Learning to retrieve, generate,               hallucinations of llms by validating low-confidence
       and critique through self-reflection,” arXiv preprint              generation,” CoRR, vol. abs/2307.03987, 2023.
       arXiv:2310.11511, 2023.                                     [1053] Y. Yehuda, I. Malkiel, O. Barkan, J. Weill, R. Ronen,
[1040] H. Luo, Y.-S. Chuang, Y. Gong, T. Zhang, Y. Kim,                   and N. Koenigstein, “In search of truth: An interro-
       X. Wu, D. Fox, H. Meng, and J. Glass, “Sail: Search-               gation approach to hallucination detection,” CoRR,
       augmented instruction learning,” arXiv preprint                    vol. abs/2403.02889, 2024.
       arXiv:2305.15225, 2023.                                     [1054] S. Min, K. Krishna, X. Lyu, M. Lewis, W. tau Yih, P. W.
[1041] X. V. Lin, X. Chen, M. Chen, W. Shi, M. Lomeli,                    Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi,
       R. James, P. Rodriguez, J. Kahn, G. Szilvasy, M. Lewis             “Factscore: Fine-grained atomic evaluation of factual
       et al., “Ra-dit: Retrieval-augmented dual instruction              precision in long form text generation,” 2023.
       tuning,” arXiv preprint arXiv:2310.01352, 2023.             [1055] I. Chern, S. Chern, S. Chen, W. Yuan, K. Feng,
[1042] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang,                 C. Zhou, J. He, G. Neubig, and P. Liu, “Factool:
       “Retrieval augmented language model pre-training,”                 Factuality detection in generative AI - A tool aug-
       in International conference on machine learning. PMLR,             mented framework for multi-task and multi-domain
       2020, pp. 3929–3938.                                               scenarios,” CoRR, vol. abs/2307.13528, 2023.
[1043] K. Lee, M.-W. Chang, and K. Toutanova, “Latent re-          [1056] X. Cheng, J. Li, W. X. Zhao, H. Zhang, F. Zhang,
       trieval for weakly supervised open domain question                 D. Zhang, K. Gai, and J.-R. Wen, “Small agent can
       answering,” in Proceedings of the 57th Annual Meeting              also rock! empowering small language models as
       of the Association for Computational Linguistics, 2019,            hallucination detector,” CoRR, vol. abs/2406.11277,
       pp. 6086–6096.                                                     2024.
[1044] J. Li, J. Chen, R. Ren, X. Cheng, W. X. Zhao, J.-Y.         [1057] M. Sharma, M. Tong, T. Korbak, D. Duvenaud,
       Nie, and J.-R. Wen, “The dawn after the dark: An                   A. Askell, S. R. Bowman, E. Durmus, Z. Hatfield-
       empirical study on factuality hallucination in large               Dodds, S. R. Johnston, S. Kravec, T. Maxwell, S. Mc-
       language models,” arXiv preprint arXiv:2401.03205,                 Candlish, K. Ndousse, O. Rausch, N. Schiefer, D. Yan,
       2024.                                                              M. Zhang, and E. Perez, “Towards understanding
[1045] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii,          sycophancy in language models,” in ICLR. Open-
       Y. J. Bang, A. Madotto, and P. Fung, “Survey of                    Review.net, 2024.
       hallucination in natural language generation,” ACM          [1058] J. W. Wei, D. Huang, Y. Lu, D. Zhou, and Q. V. Le,
       Comput. Surv., 2023.                                               “Simple synthetic data reduces sycophancy in large
[1046] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang,          language models,” CoRR, vol. abs/2308.03958, 2023.
       E. Zhao, Y. Zhang, Y. Chen, L. Wang, A. T. Luu, W. Bi,      [1059] L. Gao, Z. Dai, P. Pasupat, A. Chen, A. T. Chaganty,
       F. Shi, and S. Shi, “Siren’s song in the AI ocean: A               Y. Fan, V. Y. Zhao, N. Lao, H. Lee, D. Juan, and
       survey on hallucination in large language models,”                 K. Guu, “RARR: researching and revising what lan-
       arXiv preprint arXiv:2309.01219, 2023.                             guage models say, using language models,” in ACL
[1047] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer,                  (1). Association for Computational Linguistics, 2023,
       “Scheduled sampling for sequence prediction with                   pp. 16 477–16 508.
       recurrent neural networks,” in NIPS, 2015, pp. 1171–        [1060] R. Zhao, X. Li, S. Joty, C. Qin, and L. Bing, “Verify-
       1179.                                                              and-edit: A knowledge-enhanced chain-of-thought
[1048] M. Sharma, M. Tong, T. Korbak, D. Duvenaud,                        framework,” in ACL (1). Association for Compu-
       A. Askell, S. R. Bowman, N. Cheng, E. Dur-                         tational Linguistics, 2023, pp. 5823–5840.
       mus, Z. Hatfield-Dodds, S. R. Johnston, S. Kravec,          [1061] H. Trivedi, N. Balasubramanian, T. Khot, and A. Sab-
       T. Maxwell, S. McCandlish, K. Ndousse, O. Rausch,                  harwal, “Interleaving retrieval with chain-of-thought
       N. Schiefer, D. Yan, M. Zhang, and E. Perez, “To-                  reasoning for knowledge-intensive multi-step ques-
       wards understanding sycophancy in language mod-                    tions,” CoRR, vol. abs/2212.10509, 2022.
       els,” CoRR, vol. abs/2310.13548, 2023.                      [1062] K. Li, O. Patel, F. B. Viégas, H. Pfister, and M. Watten-
[1049] V. Rawte, P. Priya, S. M. T. I. Tonmoy, S. M. M.                   berg, “Inference-time intervention: Eliciting truthful
       Zaman, A. P. Sheth, and A. Das, “Exploring the re-                 answers from a language model,” in NeurIPS, 2023.
       lationship between LLM hallucinations and prompt            [1063] W. Shi, X. Han, M. Lewis, Y. Tsvetkov, L. Zettlemoyer,
       linguistic nuances: Readability, formality, and con-               and S. W. Yih, “Trusting your evidence: Halluci-
       creteness,” CoRR, vol. abs/2309.11064, 2023.                       nate less with context-aware decoding,” CoRR, vol.
[1050] S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li,              abs/2305.14739, 2023.
       A. Celikyilmaz, and J. Weston, “Chain-of-verification       [1064] D. Kahneman, “Thinking, fast and slow,” Farrar,
       reduces hallucination in large language models,”                   Straus and Giroux, 2011.
       CoRR, vol. abs/2309.11495, 2023.                            [1065] S. Wu, Z. Peng, X. Du, T. Zheng, M. Liu, J. Wu, J. Ma,
[1051] P. Manakul, A. Liusie, and M. J. F. Gales, “Selfcheck-             Y. Li, J. Yang, W. Zhou et al., “A comparative study
       gpt: Zero-resource black-box hallucination detection               on reasoning patterns of openai’s o1 model,” arXiv
       for generative large language models,” in EMNLP.                   preprint arXiv:2410.13639, 2024.
       Association for Computational Linguistics, 2023, pp.        [1066] T. Zhong, Z. Liu, Y. Pan, Y. Zhang, Y. Zhou, S. Liang,
       9004–9017.                                                         Z. Wu, Y. Lyu, P. Shu, X. Yu et al., “Evaluation
[1052] N. Varshney, W. Yao, H. Zhang, J. Chen, and D. Yu,                 of openai o1: Opportunities and challenges of agi,”
       “A stitch in time saves nine: Detecting and mitigating             arXiv preprint arXiv:2409.18486, 2024.
                                                                 144

[1067] Y. Min, Z. Chen, J. Jiang, J. Chen, J. Deng, Y. Hu,
       Y. Tang, J. Wang, X. Cheng, H. Song et al., “Imitate,
       explore, and self-improve: A reproduction report
       on slow-thinking reasoning systems,” arXiv preprint
       arXiv:2412.09413, 2024.
[1068] D. Team, “Deepseek-r1-lite-preview is now live: un-
       leashing supercharged reasoning power,” 2024.
[1069] Q. Team, “Qwq: Reflect deeply on the boundaries of
       the unknown, november 2024,” URL https://qwenlm.
       github. io/blog/qwq-32b-preview.
[1070] DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning
       capability in llms via reinforcement learning,” 2025.
[1071] J. Jiang, Z. Chen, Y. Min, J. Chen, X. Cheng, J. Wang,
       Y. Tang, H. Sun, J. Deng, W. X. Zhao, Z. Liu, D. Yan,
       J. Xie, Z. Wang, and J.-R. Wen, “Enhancing llm rea-
       soning with reward-guided tree search,” 2024.
[1072] T. Liang, Z. He, W. Jiao, X. Wang, Y. Wang, R. Wang,
       Y. Yang, S. Shi, and Z. Tu, “Encouraging divergent
       thinking in large language models through multi-
       agent debate,” arXiv preprint arXiv:2305.19118, 2023.
[1073] Y. Du, Z. Liu, Y. Li, W. X. Zhao, Y. Huo, B. Wang,
       W. Chen, Z. Liu, Z. Wang, and J.-R. Wen, “Virgo:
       A preliminary exploration on reproducing o1-like
       mllm,” arXiv preprint arXiv:2501.01904, 2025.
[1074] K. Team, “Kimi k1.5: Scaling reinforcement learning
       with llms,” 2025. [Online]. Available: https://arxiv.
       org/abs/2501.12599
[1075] OpenAI, “Openai’s reinforcement fine-tuning re-
       search program,” OpenAI Blog, 2024.
[1076] Z. Zeng, Q. Cheng, Z. Yin, B. Wang, S. Li, Y. Zhou,
       Q. Guo, X. Huang, and X. Qiu, “Scaling of search
       and learning: A roadmap to reproduce o1 from
       reinforcement learning perspective,” arXiv preprint
       arXiv:2412.14135, 2024.
[1077] Z. Chen, Y. Min, B. Zhang, J. Chen, J. Jiang, D. Cheng,
       W. X. Zhao, Z. Liu, X. Miao, Y. Lu, L. Fang, Z. Wang,
       and J.-R. Wen, “An empirical study on eliciting and
       improving r1-like reasoning models,” 2025.
[1078] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi,
       H. Zhang, M. Zhang, Y. Li, Y. Wu et al., “Deepseek-
       math: Pushing the limits of mathematical rea-
       soning in open language models,” arXiv preprint
       arXiv:2402.03300, 2024.
[1079] W. Kool, H. van Hoof, and M. Welling, “Buy 4 REIN-
       FORCE samples, get a baseline for free!” in Deep Re-
       inforcement Learning Meets Structured Prediction, ICLR
       2019 Workshop, New Orleans, Louisiana, United States,
       May 6, 2019. OpenReview.net, 2019.
[1080] C. Snell, J. Lee, K. Xu, and A. Kumar, “Scaling llm
       test-time compute optimally can be more effective
       than scaling model parameters,” 2024. [Online].
       Available: https://arxiv.org/abs/2408.03314
[1081] W. Kuang, B. Qian, Z. Li, D. Chen, D. Gao, X. Pan,
       Y. Xie, Y. Li, B. Ding, and J. Zhou, “Federatedscope-
       llm: A comprehensive package for fine-tuning large
       language models in federated learning,” 2023.