资料来源 — AI 计算结构史

A Survey of Large Language Models
Wayne Xin Zhao, Kun Zhou*, Junyi Li*, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen
Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang,
Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie and Ji-Rong Wen†

Abstract—Ever since the Turing Test was proposed in the 1950s, humans have explored the mastering of language intelligence
by machine. Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a
significant challenge to develop capable artificial intelligence (AI) algorithms for comprehending and grasping a language. As a major
approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving
from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pre-
training Transformer models over large-scale corpora, showing strong capabilities in solving various natural language processing (NLP)
tasks. Since the researchers have found that model scaling can lead to an improved model capacity, they further investigate the scaling

arXiv:2303.18223v19 [cs.CL] 18 Mar 2026
effect by increasing the parameter scale to an even larger size. Interestingly, when the parameter scale exceeds a certain level, these
enlarged language models not only achieve a significant performance improvement, but also exhibit some special abilities (e.g., in-
context learning) that are not present in small-scale language models (e.g., BERT). To discriminate the language models in different
parameter scales, the research community has coined the term large language models (LLM) for the PLMs of significant size (e.g.,
containing tens or hundreds of billions of parameters). Recently, the research on LLMs has been largely advanced by both academia
and industry, and a remarkable progress is the launch of ChatGPT (a powerful AI chatbot developed based on LLMs), which has
attracted widespread attention from society. The technical evolution of LLMs has been making an important impact on the entire AI
community, which would revolutionize the way how we develop and use AI algorithms. Considering this rapid technical progress, in this
survey, we review the recent advances of LLMs by introducing the background, key findings, and mainstream techniques. In particular,
we focus on four major aspects of LLMs, namely pre-training, adaptation tuning, utilization, and capacity evaluation. Furthermore, we
also summarize the available resources for developing LLMs and discuss the remaining issues for future directions. This survey provides
an up-to-date review of the literature on LLMs, which can be a useful resource for both researchers and engineers.

Index Terms—Large Language Models; Emergent Abilities; Adaptation Tuning; Utilization; Alignment; Capacity Evaluation

✦

1 I NTRODUCTION
“The limits of my language mean the limits of my world.” extensive attention in the literature, which can be divided
—Ludwig Wittgenstein into four major development stages:
• Statistical language models (SLM). SLMs [6–9] are de-
veloped based on statistical learning methods that rose in
L ANGUAGE is a prominent ability in human beings to
express and communicate, which develops in early
childhood and evolves over a lifetime [3, 4]. Machines,
the 1990s. The basic idea is to build the word prediction
model based on the Markov assumption, e.g., predicting the
next word based on the most recent context. The SLMs with
however, cannot naturally grasp the abilities of understand-
a fixed context length n are also called n-gram language
ing and communicating in the form of human language,
models, e.g., bigram and trigram language models. SLMs
unless equipped with powerful artificial intelligence (AI)
have been widely applied to enhance task performance
algorithms. It has been a longstanding research challenge
in information retrieval (IR) [10, 11] and natural language
to achieve this goal, to enable machines to read, write, and
processing (NLP) [12–14]. However, they often suffer from
communicate like humans [5].
the curse of dimensionality: it is difficult to accurately
Technically, language modeling (LM) is one of the major
estimate high-order language models since an exponential
approaches to advancing language intelligence of machines.
number of transition probabilities need to be estimated.
In general, LM aims to model the generative likelihood
Thus, specially designed smoothing strategies such as back-
of word sequences, so as to predict the probabilities of
off estimation [15] and Good–Turing estimation [16] have
future (or missing) tokens. The research of LM has received
been introduced to alleviate the data sparsity problem.
• Neural language models (NLM). NLMs [1, 17, 18] charac-
• Version: v18 (major update on March 7, 2026). terize the probability of word sequences by neural networks,
• GitHub link: https://github.com/RUCAIBox/LLMSurvey
• Chinese book link: lmbook-zh.github.io
e.g., multi-layer perceptron (MLP) and recurrent neural net-
• † Corresponding Author. works (RNNs). As a remarkable contribution, the work in
• * K. Zhou and J. Li contribute equally to this work. [1] introduced the concept of distributed representation of
• The authors are mainly with Gaoling School of Artificial Intelligence and words and built the word prediction function conditioned
School of Information, Renmin University of China, Beijing, China; Jian-
Yun Nie is with DIRO, Université de Montréal, Canada. on the aggregated context features (i.e., the distributed
Contact e-mail: batmanfly@gmail.com word vectors). By extending the idea of learning effective
• The authors of this survey paper reserve all the copyrights of the fig- features for text data, a general neural network approach
ures/tables, and any use of these materials for publication purpose must be
officially granted by the survey authors.
was developed to build a unified, end-to-end solution for
2

*37
*37

//D0$
//D0$

&KDW*37
&KDW*37
*37
*37

,QVWUXFW*37
,QVWUXFW*37
//D0$
//D0$

&RGH[
&RGH[

&KDW*37
&KDW*37

*37
*37
77
,QVWUXFW*37
,QVWUXFW*37
*37
*37 *37
*37 &RGH[
&RGH[
77 *37
*37
%(57
%(57

7L7LPPH H 7L7LPPHH
(a) Query=”Language Model” (b) Query=”Large Language Model”

Fig. 1: The trends of the cumulative numbers of arXiv papers that contain the keyphrases “language model” (since June 2018)
and “large language model” (since October 2019), respectively. The statistics are calculated using exact match by querying
the keyphrases in title or abstract by months. We set different x-axis ranges for the two keyphrases, because “language
models” have been explored at an earlier time. We label the points corresponding to important landmarks in the research
progress of LLMs. A sharp increase occurs after the release of ChatGPT: the average number of published arXiv papers
that contain “large language model” in title or abstract goes from 0.40 per day to 8.58 per day (Figure 1(b)).

General-purpose
Transferable task solver
Task-agnostic NLP task solver GPT-3/4!ChatGPT!Claude
Scaling language models
Task Specific task feature learner ELMO!BERT!GPT-1/2 Prompt based completion
solving helper Word2vec (NPLM)!NLPS Context-aware representations Solve various real-world tasks
capacity n-gram models Static word representations
Pre-training + fine-tuning
Solve various NLP tasks LLM
Statistical methods Neural context modeling
Probability estimation Solve typical NLP tasks Pre-trained LM
Assist in specific tasks Neural LM
Statistical LM

1990s 2013 2018 2020

Fig. 2: An evolution process of the four generations of language models (LM) from the perspective of task solving capacity.
Note that the time period for each stage may not be very accurate, and we set the time mainly according to the publish
date of the most representative studies at each stage. For neural language models, we abbreviate the paper titles of
two representative studies to name the two approaches: NPLM [1] (“A neural probabilistic language model”) and NLPS [2]
(“Natural language processing (almost) from scratch”). Due to the space limitation, we don’t list all representative studies in
this figure.

various NLP tasks [2]. Furthermore, word2vec [19, 20] was designed pre-training tasks on large-scale unlabeled cor-
proposed to build a simplified shallow neural network pora. These pre-trained context-aware word representations
for learning distributed word representations, which were are very effective as general-purpose semantic features,
demonstrated to be very effective across a variety of NLP which have largely raised the performance bar of NLP
tasks. These studies have initiated the use of language tasks. This study has inspired a large number of follow-up
models for representation learning (beyond word sequence work, which sets the “pre-training and fine-tuning” learning
modeling), having an important impact on the field of NLP. paradigm. Following this paradigm, a great number of stud-
ies on PLMs have been developed, introducing either differ-
• Pre-trained language models (PLM). As an early at- ent architectures [24, 25] (e.g., GPT-2 [26] and BART [24]) or
tempt, ELMo [21] was proposed to capture context-aware improved pre-training strategies [27–29]. In this paradigm, it
word representations by first pre-training a bidirectional often requires fine-tuning the PLM for adapting to different
LSTM (biLSTM) network (instead of learning fixed word downstream tasks.
representations) and then fine-tuning the biLSTM network
according to specific downstream tasks. Furthermore, based • Large language models (LLM). Researchers find that
on the highly parallelizable Transformer architecture [22] scaling PLM (e.g., scaling model size or data size) often
with self-attention mechanisms, BERT [23] was proposed by leads to an improved model capacity on downstream tasks
pre-training bidirectional language models with specially (i.e., following the scaling law [30]). A number of studies
3

have explored the performance limit by training an ever the prompting interface (e.g., GPT-4 API). Humans have to
larger PLM (e.g., the 175B-parameter GPT-3 and the 540B- understand how LLMs work and format their tasks in a way
parameter PaLM). Although scaling is mainly conducted that LLMs can follow. Third, the development of LLMs no
in model size (with similar architectures and pre-training longer draws a clear distinction between research and en-
tasks), these large-sized PLMs display different behaviors gineering. The training of LLMs requires extensive practical
from smaller PLMs (e.g., 330M-parameter BERT and 1.5B- experiences in large-scale data processing and distributed
parameter GPT-2) and show surprising abilities (called emer- parallel training. To develop capable LLMs, researchers
gent abilities [31]) in solving a series of complex tasks. For have to solve complicated engineering issues, working with
example, GPT-3 can solve few-shot tasks through in-context engineers or being engineers.
learning, whereas GPT-2 cannot do well. Thus, the research Nowadays, LLMs are posing a significant impact on
community coins the term “large language models (LLM)”1 the AI community, and the advent of ChatGPT and GPT-4
for these large-sized PLMs [32–35], which attract increasing leads to the rethinking of the possibilities of artificial general
research attention (See Figure 1). A remarkable application intelligence (AGI). OpenAI has published a technical article
of LLMs is ChatGPT2 that adapts the LLMs from the GPT entitled “Planning for AGI and beyond”, which discusses
series for dialogue, which presents an amazing conversation the short-term and long-term plans to approach AGI [40],
ability with humans. We can observe a sharp increase of the and a more recent paper has argued that GPT-4 might be
arXiv papers that are related to LLMs after the release of considered as an early version of an AGI system [41]. The
ChatGPT in Figure 1. research areas of AI are being revolutionized by the rapid
As discussed before, language model is not a new tech- progress of LLMs. In the field of NLP, LLMs can serve as a
nical concept specially for LLMs, but has evolved with the general-purpose language task solver (to some extent), and
advance of artificial intelligence over the decades. Early lan- the research paradigm has been shifting towards the use
guage models mainly aim to model and generate text data, of LLMs. In the field of IR, traditional search engines are
while latest language models (e.g., GPT-4) focus on complex challenged by the new information seeking way through AI
task solving. From language modeling to task solving, it is an chatbots (i.e., ChatGPT), and New Bing3 presents an initial
important leap in scientific thinking, which is the key to attempt that enhances the search results based on LLMs. In
understand the development of language models in the re- the field of CV, the researchers try to develop ChatGPT-like
search history. From the perspective of task solving, the four vision-language models that can better serve multimodal
generations of language models have exhibited different lev- dialogues [42–45], and GPT-4 [46] has supported multi-
els of model capacities. In Figure 2, we describe the evolu- modal input by integrating the visual information. This new
tion process of language models in terms of the task solving wave of technology would potentially lead to a prosperous
capacity. At first, statistical language models mainly assisted ecosystem of real-world applications based on LLMs. For
in some specific tasks (e.g., retrieval or speech tasks), in instance, Microsoft 365 is being empowered by LLMs (i.e.,
which the predicted or estimated probabilities can enhance Copilot) to automate the office work, and OpenAI supports
the performance of task-specific approaches. Subsequently, the use of plugins in ChatGPT for implementing special
neural language models focused on learning task-agnostic functions.
representations (e.g., features), aiming to reduce the efforts Despite the progress and impact, the underlying prin-
for human feature engineering. Furthermore, pre-trained ciples of LLMs are still not well explored. Firstly, it is
language models learned context-aware representations that mysterious why emergent abilities occur in LLMs, instead of
can be optimized according to downstream tasks. For the smaller PLMs. As a more general issue, there lacks a deep,
latest generation of language model, LLMs are enhanced by detailed investigation of the key factors that contribute to
exploring the scaling effect on model capacity, which can be the superior abilities of LLMs. It is important to study when
considered as general-purpose task solvers. To summarize, and how LLMs obtain such abilities [47]. Although there are
in the evolution process, the task scope that can be solved some meaningful discussions about this problem [31, 47],
by language models have been greatly extended, and the more principled investigations are needed to uncover the
task performance attained by language models have been “secrets“ of LLMs. Secondly, it is difficult for the research
significantly enhanced. community to train capable LLMs. Due to the huge de-
In the existing literature, PLMs have been widely dis- mand of computation resources, it is very costly to carry
cussed and surveyed [36–39], while LLMs are seldom re- out repetitive, ablating studies for investigating the effect
viewed in a systematic way. To motivate our survey, we first of various strategies for training LLMs. Indeed, LLMs are
highlight three major differences between LLMs and PLMs. mainly trained by industry, where many important training
First, LLMs display some surprising emergent abilities that details (e.g., data collection and cleaning) are not revealed
may not be observed in previous smaller PLMs. These abili- to the public. Thirdly, it is challenging to align LLMs with
ties are key to the performance of language models on com- human values or preferences. Despite the capacities, LLMs
plex tasks, making AI algorithms unprecedently powerful are also likely to produce toxic, fictitious, or harmful con-
and effective. Second, LLMs would revolutionize the way tents. It requires effective and efficient control approaches
that humans develop and use AI algorithms. Unlike small to eliminating the potential risk of the use of LLMs [46].
PLMs, the major approach to accessing LLMs is through Faced with both opportunities and challenges, it needs
more attention on the research and development of LLMs. In
1. Note that a LLM is not necessarily more capable than a small PLM, order to provide a basic understanding of LLMs, this survey
and emergent abilities may not occur in some LLMs.
2. https://openai.com/blog/chatgpt/ 3. https://www.bing.com/new
4

conducts a literature review of the recent advances in LLMs shown that scaling can largely improve the model capacity
from four major aspects, including pre-training (how to pre- of LLMs [26, 55, 56]. Thus, it is useful to establish a quantita-
train a capable LLM), adaptation (how to effectively adapt tive approach to characterizing the scaling effect. Next, we
pre-trained LLMs for better use), utilization (how to use introduce two representative scaling laws for Transformer
LLMs for solving various downstream tasks) and capability language models [30, 34].
evaluation (how to evaluate the abilities of LLMs and existing • KM scaling law5 . In 2020, Kaplan et al. [30] (the OpenAI
empirical findings). We thoroughly comb the literature and team) firstly proposed to model the power-law relationship
summarize the key findings, techniques, and methods of of model performance with respective to three major factors,
LLMs. For this survey, we also create a GitHub project namely model size (N ), dataset size (D), and the amount of
website by collecting the supporting resources for LLMs, at training compute (C ), for neural language models. Given
the link https://github.com/RUCAIBox/LLMSurvey. We a compute budget c, they empirically presented three basic
are also aware of several related review articles on PLMs formulas for the scaling law6 :
or LLMs [32, 36, 38, 39, 43, 48–54]. These papers either
α
discuss PLMs or some specific (or general) aspects of LLMs.

Nc N
Compared with them, we focus on the techniques and L(N ) = , αN ∼ 0.076, Nc ∼ 8.8 × 1013 (1)
N
methods to develop and use LLMs and provide a relatively α
Dc D
comprehensive reference to important aspects of LLMs. L(D) = , αD ∼ 0.095, Dc ∼ 5.4 × 1013
D
The remainder of this survey is organized as follows: αC
Cc
Section 2 introduces the background for LLMs and the evo- L(C) = , αC ∼ 0.050, Cc ∼ 3.1 × 108
C
lution of GPT-series models, followed by the summarization
of available resources for developing LLMs in Section 3. where L(·) denotes the cross entropy loss in nats, and
Sections 4, 5, 6, and 7 review and summarize the recent a follow-up study [58] from OpenAI has shown that the
progress from the four aspects of pre-training, adaptation, language modeling loss can be decomposed into two parts,
utilization, and capacity evaluation, respectively. Then, Sec- namely irreducible loss (the entropy of the true data distri-
tion 8 discusses the practical guide for prompt design, bution) and reducible loss (an estimate of the KL divergence
and Section 9 reviews the applications of LLMs in several between the true and model distributions). The three laws
representative domains. Finally, we conclude the survey in were derived by fitting the model performance with varied
Section 10 by summarizing the major findings and discuss data sizes (22M to 23B tokens), model sizes (768 to 1.5B non-
the remaining issues for future work. embedding parameters) and training compute, under some
assumptions (e.g., the analysis of one factor should be not
bottlenecked by the other two factors). They showed that
2 OVERVIEW the model performance has a strong dependence relation on
In this section, we present an overview about the back- the three factors.
ground of LLMs and then summarize the technical evolu- • Chinchilla scaling law. As another representative study,
tion of the GPT-series models. Hoffmann et al. [34] (the Google DeepMind team) proposed
an alternative form for scaling laws to instruct the compute-
2.1 Background for LLMs optimal training for LLMs. They conducted rigorous exper-
iments by varying a larger range of model sizes (70M to
Typically, large language models (LLMs) refer to Transformer 16B) and data sizes (5B to 500B tokens), and fitted a similar
language models that contain hundreds of billions (or scaling law yet with different coefficients as below [34]:
more) of parameters4 , which are trained on massive text
data [32], such as GPT-3 [55], PaLM [56], Galactica [35], A B
L(N, D) = E + + β, (2)
and LLaMA [57]. LLMs exhibit strong capacities to un- Nα D
derstand natural language and solve complex tasks (via where E = 1.69, A = 406.4, B = 410.7, α = 0.34 and
text generation). To have a quick understanding of how β = 0.28. By optimizing the loss L(N, D) under the con-
LLMs work, this part introduces the basic background for straint C ≈ 6N D, they showed that the optimal allocation
LLMs, including scaling laws, emergent abilities and key of compute budget to model size and data size can be
techniques. derived as follows:
Formulation of Scaling Laws for LLMs. Currently, LLMs a b
are mainly built upon the Transformer architecture [22], C C
Nopt (C) = G , Dopt (C) = G−1 , (3)
where multi-head attention layers are stacked in a very 6 6
deep neural network. Existing LLMs adopt similar Trans- α β
where a = α+β , b = α+β and G is a scaling coefficient that
former architectures and pre-training objectives (e.g., lan-
can be computed by A, B , α and β . As analyzed in [34],
guage modeling) as small language models. However, LLMs
significantly extend the model size, data size, and total 5. Since there was not a model trained following this law in the
compute (orders of magnification). Extensive research has original paper, we took the last names of the two co-first authors to
name this scaling law.
4. In existing literature, there is no formal consensus on the minimum 6. Here, Nc , Dc and Cc are measured in the number of non-
parameter scale for LLMs, since the model capacity is also related to embedding parameters, the number of training tokens and the number
data size and total compute. In this survey, we take a slightly loose of FP-days, respectively. According to the original paper [30], Cc and C
definition of LLMs, and mainly focus on discussing language models should be denoted by Ccmin and Cmin , corresponding to the optimal
with a model size larger than 10B. use of compute. We use the simplified notations for ease of discussions.
5

given an increase in compute budget, the KM scaling law characterize task-level scaling laws, since it might be also
favors a larger budget allocation in model size than the data dependent on task-related information (task metric, task
size, while the Chinchilla scaling law argues that the two difficulty, etc.). Furthermore, some capacities (e.g., in-context
sizes should be increased in equal scales, i.e., having similar learning [55]) are unpredictable according to the scaling law,
values for a and b in Equation (3). which can be observed only when the model size exceeds a
certain level (as discussed below).
Discussion on Scaling Laws. After introducing the formu-
lations, we continue to discuss scaling law in the following Emergent Abilities of LLMs. In the literature [31], emergent
two aspects, to enhance its understanding: abilities of LLMs are formally defined as “the abilities that
• Predictable scaling. In practice, scaling law can be used are not present in small models but arise in large models”,
to instruct the training of LLMs, and it has been proven which is one of the most prominent features that distin-
feasible to reliably estimate the performance of larger mod- guish LLMs from previous PLMs. It further introduces a
els based on that of smaller models, called predictable scal- notable characteristic when emergent abilities occur [31]:
ing [46]. The benefits of predictable scaling for training performance rises significantly above random when the
LLMs are mainly twofold. Firstly, for large models, it is scale reaches a certain level. By analogy, such an emergent
infeasible to rigorously examine various training tricks or pattern has close connections with the phenomenon of phase
variants, and it would be very helpful if experiences gained transition in physics [31, 63]. In principle, emergent abilities
from small models could also apply to large models. For can be defined in relation to some complex tasks [31, 64],
instance, small proxy models can be trained to find the while we are more concerned with general abilities that
optimal schedule of the data mixture for large models [59]. can be applied to solve a variety of tasks. Here, we briefly
Secondly, the training of large-scale models takes a long introduce three typical emergent abilities for LLMs and
time, often suffering from issues such as training loss spike, representative models that possess such an ability8 .
and scaling law can be employed to monitor the training • In-context learning. The in-context learning (ICL) ability
status of LLMs, e.g., identifying abnormal performance at an is formally introduced by GPT-3 [55]: assuming that the
early time. Despite that scaling law characterizes a smooth language model has been provided with a natural language
trend of performance increase (or loss decrease), it also instruction and/or several task demonstrations, it can gen-
indicates that diminishing returns7 might occur as model erate the expected output for the test instances by com-
scaling. An empirical study [58] from the OpenAI team pleting the word sequence of input text, without requiring
has shown that representation quality or semantic content additional training or gradient update9 . Among the GPT-
can still effectively improve even if approaching the point series models, the 175B GPT-3 model exhibited a strong ICL
of diminishing returns (i.e., approaching the irreducible ability in general, but not the GPT-1 and GPT-2 models. Such
loss) [58]. This finding suggests that training large models an ability also depends on the specific downstream task. For
are promising for improving the performance of down- example, the ICL ability can emerge on the arithmetic tasks
stream tasks. To further explore scaling effect, a potential (e.g., the 3-digit addition and subtraction) for the 13B GPT-3,
issue is that the amount of available data for training LLMs but 175B GPT-3 even cannot work well on the Persian QA
is actually limited. With the ever-increasing model scale, the task [31].
public text data would be soon “exhausted” for LLMs [60]. • Instruction following. By fine-tuning with a mixture of
Thus, it will be meaningful to study how scaling laws apply multi-task datasets formatted via natural language descrip-
to a data-constrained regime [61], where data repetition or tions (called instruction tuning), LLMs are shown to perform
augmentation might be useful to alleviate data scarcity. well on unseen tasks that are also described in the form
• Task-level predictability. Existing research of scaling laws of instructions [28, 66, 67]. With instruction tuning, LLMs
are mostly conducted in terms of language modeling loss are enabled to follow the task instructions for new tasks
(e.g., per-token cross-entropy loss in nats [30]), while in without using explicit examples, thus having an improved
practice we are more concerned about the performance of generalization ability. According to the experiments in [67],
LLMs on actual tasks. Thus, a basic problem is that how instruction-tuned LaMDA-PT [68] started to significantly
the decrease of language modeling loss translates into the outperform the untuned one on unseen tasks when the
improvement of task performance [58]. Intuitively, a model model size reached 68B, but not for 8B or smaller model
with a smaller language modeling loss tends to yield a sizes. A recent study [69] found that a model size of 62B is
better performance on downstream tasks, since language at least required for PaLM to perform well on various tasks
modeling loss can be considered as a general measure of in four evaluation benchmarks (i.e., MMLU, BBH, TyDiQA
the overall model capacity. GPT-4 [46] has reported that and MGSM), though a much smaller size might suffice for
some capabilities (e.g., coding ability) can be accurately some specific tasks (e.g., MMLU).
predicted via scaling law. Despite that, readers should be • Step-by-step reasoning. For small language models, it
aware that a direct decrease in language modeling loss does is usually difficult to solve complex tasks that involve
not always indicate an improvement of model performance
8. It is difficult to accurately examine the critical size for emergent
on downstream tasks. Specially, the phenomenon of inverse
abilities of LLMs (i.e., the minimum size to possess an ability), since it
scaling would occur for some tasks, where task performance might vary for different models or tasks. Also, existing studies often
surprisingly becomes worse as the language modeling loss test emergent abilities on very limited model sizes for a specific LLM.
decreases [62]. Overall, it is more difficult to explore and For example, PaLM is often tested with three sizes of 8B, 62B and 540B.
It is unclear about the model performance of the untested sizes.
9. In a recent study [65], it also shows that in-context learning implic-
7. https://en.wikipedia.org/wiki/Diminishing returns itly performs meta-optimization through the attention mechanism.
6

multiple reasoning steps, e.g., mathematical word problems. every day. It is interesting that young parents would be often
In contrast, with the chain-of-thought (CoT) prompting surprised by unexpected progress of the speaking ability
strategy [33], LLMs can solve such tasks by utilizing the exhibited by their babies.
prompting mechanism that involves intermediate reasoning
steps for deriving the final answer. This ability is speculated Key Techniques for LLMs. It has been a long way that
to be potentially obtained by training on code [33, 47]. An LLMs evolve into the current state: general and capable
empirical study [33] has shown that CoT prompting can learners. In the development process, a number of impor-
bring performance gains (on arithmetic reasoning bench- tant techniques are proposed, which largely improve the
marks) when applied to PaLM and LaMDA variants with capacity of LLMs. Here, we briefly list several important
a model size larger than 60B, while its advantage over techniques that (potentially) lead to the success of LLMs, as
the standard prompting becomes more evident when the follows.
model size exceeds 100B. Furthermore, the performance • Scaling. As discussed in previous parts, there exists
improvement with CoT prompting seems to be also varied an evident scaling effect in Transformer language mod-
for different tasks, e.g., GSM8K > MAWPS > SWAMP for els: larger model/data sizes and more training compute
PaLM [33]. typically lead to an improved model capacity [30, 34]. As
two representative models, GPT-3 and PaLM explored the
How Emergent Abilities Relate to Scaling Laws. In existing scaling limits by increasing the model size to 175B and
literature [30, 31, 34], scaling laws and emergent abilities 540B, respectively. Since compute budget is usually limited,
provide two perspectives to understand the advantage of scaling laws can be further employed to conduct a more
large models over small models. In general, scaling law compute-efficient allocation of the compute resources. For
(often measured by language modeling loss) describes pre- example, Chinchilla (with more training tokens) outper-
dictable performance relation with the potential effect of forms its counterpart model Gopher (with a larger model
diminishing returns, while emergent abilities (often mea- size) by increasing the data scale with the same compute
sured by task performance) are unpredictable but very prof- budget [34]. In addition, data scaling should be with careful
itable once such abilities actually emerge. Since the two cleaning process, since the quality of pre-training data plays
perspectives reflect different performance trends (continu- a key role in the model capacity.
ous improvement v.s. sharp performance leap), they might • Training. Due to the huge model size, it is very chal-
lead to misaligned findings or observations. There are also lenging to successfully train a capable LLM. Distributed
extensive debates on the rationality of emergent abilities. training algorithms are needed to learn the network param-
A popular speculation is that emergent abilities might be eters of LLMs, in which various parallel strategies are of-
partially attributed to the evaluation setting for special tasks ten jointly utilized. To support distributed training, several
(e.g., the discontinuous evaluation metrics) [70, 71]: when optimization frameworks have been released to facilitate
evaluation metrics are altered accordingly, the sharpness of the implementation and deployment of parallel algorithms,
the emergent ability curve would disappear. However, the such as DeepSpeed [74] and Megatron-LM [75–77]. Also, op-
performance of LLMs on most tasks are perceived by users timization tricks are also important for training stability and
naturally in a discontinuous way. For instance, end users model performance, e.g., restart to overcome training loss
prefer a reliable code generated by LLMs that can success- spike [56] and mixed precision training [78]. More recently,
fully pass the test case, but are less interested in selecting a GPT-4 [46] proposes to develop special infrastructure and
better code with fewer errors between two failed ones. More optimization methods that reliably predict the performance
recently, a study [72] proposes a new evaluation setting of large models with much smaller models.
that can enlarge the resolution of task metrics, making task • Ability eliciting. After being pre-trained on large-scale
performance more predictable. Despite these efforts, more corpora, LLMs are endowed with potential abilities as
fundamental research (e.g., grokking10 ) about the working general-purpose task solvers. These abilities might not be
mechanism of LLMs is still in need to understand the emer- explicitly exhibited when LLMs perform some specific tasks.
gence of certain abilities. The subtle relation between scaling As the technical approach, it is useful to design suitable task
law and emergent abilities can be explained by analogy with instructions or specific in-context learning strategies to elicit
the ability acquisition of human11 . Take the speaking ability such abilities. For instance, chain-of-thought prompting has
as an example. For children, language development (espe- been shown to be useful to solve complex reasoning tasks
cially infants) can be also considered as a multi-level process by including intermediate reasoning steps. Furthermore,
where “emergent abilities” occur. Specially, the language we can perform instruction tuning on LLMs with task
ability would relatively stable within a time interval, but descriptions expressed in natural language, for improving
qualitative change only occurs when evolving into another the generalizability of LLMs on unseen tasks. These eliciting
ability level (e.g., from speaking simple words to speaking techniques mainly correspond to the emergent abilities of
simple sentences). Such a learning process is essentially not LLMs, which may not show the same effect on small lan-
smooth and stable (i.e., language ability does not develop at guage models.
a constant rate over time), though a child actually grows • Alignment tuning. Since LLMs are trained to capture
the data characteristics of pre-training corpora (including
10. Grokking refers that “a pattern in the data, improving generaliza- both high-quality and low-quality data), they are likely to
tion performance from random chance level to perfect generalization”,
quoted from the original paper [73].
generate toxic, biased, or even harmful content for humans.
11. This explanation is only for ease of understanding, and there is It is necessary to align LLMs with human values, e.g., helpful,
not direct evidence to connect the two points. honest, and harmless. For this purpose, InstructGPT [66]
7

designs an effective tuning approach that enables LLMs to models was already explored in the early days of Ope-
follow the expected instructions, which utilizes the tech- nAI, while it was attempted with recurrent neural net-
nique of reinforcement learning with human feedback [66, 79]. works (RNN) [121]. With the advent of Transformer, OpenAI
It incorporates human in the training loop with elaborately developed two initial GPT models, namely GPT-1 [122] and
designed labeling strategies. ChatGPT is indeed developed GPT-2 [26], which can be considered as the foundation to
on a similar technique to InstructGPT, which shows a strong more powerful models subsequently i.e., GPT-3 and GPT-4.
alignment capacity in producing high-quality, harmless re- • GPT-1. In 2017, the Transformer model [22] was intro-
sponses, e.g., rejecting to answer insulting questions. duced by Google, and the OpenAI team quickly adapted
• Tools manipulation. In essence, LLMs are trained as text their language modeling work to this new neural network
generators over massive plain text corpora, thus performing architecture. They released the first GPT model in 2018,
less well on the tasks that are not best expressed in the i.e., GPT-1 [122], and coined the abbreviation term GPT
form of text (e.g., numerical computation). In addition, their as the model name, standing for Generative Pre-Training.
capacities are also limited to the pre-training data, e.g., the GPT-1 was developed based on a generative, decoder-only
inability to capture up-to-date information. To tackle these Transformer architecture, and adopted a hybrid approach of
issues, a recently proposed technique is to employ external unsupervised pre-training and supervised fine-tuning. GPT-
tools to compensate for the deficiencies of LLMs [80, 81]. 1 has set up the core architecture for the GPT-series models
For example, LLMs can utilize the calculator for accurate and established the underlying principle to model natural
computation [80] and employ search engines to retrieve language text, i.e., predicting the next word.
unknown information [81]. More recently, ChatGPT has • GPT-2. Following a similar architecture of GPT-1,
enabled the mechanism of using external plugins (existing GPT-2 [26] increased the parameter scale to 1.5B, which
or newly created apps)12 , which are by analogy with the was trained with a large webpage dataset WebText. As
“eyes and ears” of LLMs. Such a mechanism can broadly claimed in the paper of GPT-2, it sought to perform
expand the scope of capacities for LLMs. tasks via unsupervised language modeling, without explicit
In addition, many other factors (e.g., the upgrade of fine-tuning using labeled data. To motivate the approach,
hardware) also contribute to the success of LLMs. Currently, they introduced a probabilistic form for multi-task solving,
we limit our discussion to the major technical approaches i.e., p(output|input, task) (similar approaches have been
and key findings for developing LLMs. adopted in [123]), which predicts the output conditioned on
the input and task information. To model this conditional
probability, language text can be naturally employed as a
2.2 Technical Evolution of GPT-series Models unified way to format input, output and task information.
Due to the excellent capacity in communicating with hu- In this way, the process of solving a task can be cast as a
mans, ChatGPT has ignited the excitement of the AI com- word prediction problem for generating the solution text.
munity since its release. ChatGPT is developed based on the Further, they introduced a more formal claim for this idea:
powerful GPT model with specially optimized conversation “Since the (task-specific) supervised objective is the same
capacities. Considering the ever-growing interest in Chat- as the unsupervised (language modeling) objective but only
GPT and GPT models, we add a special discussion about the evaluated on a subset of the sequence, the global minimum
technical evolution of the GPT-series models, to briefly sum- of the unsupervised objective is also the global minimum
marize the progress how they have been developed in the of the supervised objective (for various tasks)” [26]15 . A
past years. Meanwhile, we drew a schematic diagram de- basic understanding of this claim is that each (NLP) task
picting the technological evolution of the GPT-series models can be considered as the word prediction problem based
in Figure 4. The basic principle underlying GPT models is on a subset of the world text. Thus, unsupervised language
to compress the world knowledge into the decoder-only modeling could be capable in solving various tasks, if it was
Transformer model by language modeling, such that it can trained to have sufficient capacity in recovering the world
recover (or memorize) the semantics of world knowledge text. These early discussion in GPT-2’s paper echoed in the
and serve as a general-purpose task solver. Two key points interview of Ilya Sutskever by Jensen Huang: “What the
to the success are (I) training decoder-only Transformer neural network learns is some representation of the process
language models that can accurately predict the next word that produced the text. This text is actually a projection of
and (II) scaling up the size of language models. Overall, the the world...the more accurate you are in predicting the next
research of OpenAI on LLMs can be roughly divided into word, the higher the fidelity, the more resolution you get in
the following stages13 . this process...”16 .

Early Explorations. According to one interview with Ilya Capacity Leap. Although GPT-2 is intended to be an “un-
Sutskever14 (a co-founder and chief scientist of OpenAI), supervised multitask learner”, it overall has an inferior
the idea of approaching intelligent systems with language performance compared with supervised fine-tuning state-
of-the-art methods. Because it has a relatively small model
12. https://openai.com/blog/chatgpt-plugins size, it has been widely fine-tuned in downstream tasks,
13. Note that the discussion of this part can be somewhat subjective. especially the dialog tasks [124, 125]. Based on GPT-2, GPT-3
The overall viewpoints and summaries are made based on the under-
standing of the survey authors by reading the papers, blog articles,
interview reports and APIs released by OpenAI. 15. To better understand this sentence, we put some explanation
14. https://hackernoon.com/an-interview-with-ilya-sutskever-co- words in parentheses.
founder-of-openai 16. https://lifearchitect.ai/ilya/
8

TABLE 1: Statistics of large language models (having a size larger than 10B in this survey) in recent years, including the
capacity evaluation, pre-training data scale (either in the number of tokens or storage size) and hardware resource costs.
In this table, we only include LLMs with a public paper about the technical details. Here, “Release Time” indicates the
date when the corresponding paper was officially released. “Publicly Available” means that the model checkpoints can be
publicly accessible while “Closed Source” means the opposite. “Adaptation” indicates whether the model has been with
subsequent fine-tuning: IT denotes instruction tuning and RLHF denotes reinforcement learning with human feedback.
“Evaluation” indicates whether the model has been evaluated with corresponding abilities in their original paper: ICL
denotes in-context learning and CoT denotes chain-of-thought. “*” denotes the largest publicly available version.

Release Size Base Adaptation Pre-train Latest Data Hardware Training Evaluation
Model
Time (B) Model IT RLHF Data Scale Timestamp (GPUs / TPUs) Time ICL CoT
T5 [82] Oct-2019 11 - - - 1T tokens Apr-2019 1024 TPU v3 - ✓ -
mT5 [83] Oct-2020 13 - - - 1T tokens - - - ✓ -
PanGu-α [84] Apr-2021 13* - - - 1.1TB - 2048 Ascend 910 - ✓ -
CPM-2 [85] Jun-2021 198 - - - 2.6TB - - - - -
T0 [28] Oct-2021 11 T5 ✓ - - - 512 TPU v3 27 h ✓ -
CodeGen [86] Mar-2022 16 - - - 577B tokens - - - ✓ -
GPT-NeoX-20B [87] Apr-2022 20 - - - 825GB - 96 40G A100 - ✓ -
Tk-Instruct [88] Apr-2022 11 T5 ✓ - - - 256 TPU v3 4h ✓ -
UL2 [89] May-2022 20 - - - 1T tokens Apr-2019 512 TPU v4 - ✓ ✓
OPT [90] May-2022 175 - - - 180B tokens - 992 80G A100 - ✓ -
NLLB [91] Jul-2022 54.5 - - - - - - - ✓ -
CodeGeeX [92] Sep-2022 13 - - - 850B tokens - 1536 Ascend 910 60 d ✓ -
GLM [93] Oct-2022 130 - - - 400B tokens - 768 40G A100 60 d ✓ -
Flan-T5 [69] Oct-2022 11 T5 ✓ - - - - - ✓ ✓
BLOOM [78] Nov-2022 176 - - - 366B tokens - 384 80G A100 105 d ✓ -
mT0 [94] Nov-2022 13 mT5 ✓ - - - - - ✓ -
Galactica [35] Nov-2022 120 - - - 106B tokens - - - ✓ ✓
BLOOMZ [94] Nov-2022 176 BLOOM ✓ - - - - - ✓ -
Publicly OPT-IML [95] Dec-2022 175 OPT ✓ - - - 128 40G A100 - ✓ ✓
Available LLaMA [57] Feb-2023 65 - - - 1.4T tokens - 2048 80G A100 21 d ✓ -
Pythia [96] Apr-2023 12 - - - 300B tokens - 256 40G A100 - ✓ -
CodeGen2 [97] May-2023 16 - - - 400B tokens - - - ✓ -
StarCoder [98] May-2023 15.5 - - - 1T tokens - 512 40G A100 - ✓ ✓
LLaMA2 [99] Jul-2023 70 - ✓ ✓ 2T tokens - 2000 80G A100 - ✓ -
Baichuan2 [100] Sep-2023 13 - ✓ ✓ 2.6T tokens - 1024 A800 - ✓ -
QWEN [101] Sep-2023 14 - ✓ ✓ 3T tokens - - - ✓ -
FLM [102] Sep-2023 101 - ✓ - 311B tokens - 192 A800 22 d ✓ -
Skywork [103] Oct-2023 13 - - - 3.2T tokens - 512 80G A800 - ✓ -

GPT-3 [55] May-2020 175 - - - 300B tokens - - - ✓ -
GShard [104] Jun-2020 600 - - - 1T tokens - 2048 TPU v3 4d - -
Codex [105] Jul-2021 12 GPT-3 - - 100B tokens May-2020 - - ✓ -
ERNIE 3.0 [106] Jul-2021 10 - - - 375B tokens - 384 V100 - ✓ -
Jurassic-1 [107] Aug-2021 178 - - - 300B tokens - 800 GPU - ✓ -
HyperCLOVA [108] Sep-2021 82 - - - 300B tokens - 1024 A100 13.4 d ✓ -
FLAN [67] Sep-2021 137 LaMDA-PT ✓ - - - 128 TPU v3 60 h ✓ -
Yuan 1.0 [109] Oct-2021 245 - - - 180B tokens - 2128 GPU - ✓ -
Anthropic [110] Dec-2021 52 - - - 400B tokens - - - ✓ -
WebGPT [81] Dec-2021 175 GPT-3 - ✓ - - - - ✓ -
Gopher [64] Dec-2021 280 - - - 300B tokens - 4096 TPU v3 920 h ✓ -
ERNIE 3.0 Titan [111] Dec-2021 260 - - - - - - - ✓ -
GLaM [112] Dec-2021 1200 - - - 280B tokens - 1024 TPU v4 574 h ✓ -
LaMDA [68] Jan-2022 137 - - - 768B tokens - 1024 TPU v3 57.7 d - -
MT-NLG [113] Jan-2022 530 - - - 270B tokens - 4480 80G A100 - ✓ -
Closed
AlphaCode [114] Feb-2022 41 - - - 967B tokens Jul-2021 - - - -
Source
InstructGPT [66] Mar-2022 175 GPT-3 ✓ ✓ - - - - ✓ -
Chinchilla [34] Mar-2022 70 - - - 1.4T tokens - - - ✓ -
PaLM [56] Apr-2022 540 - - - 780B tokens - 6144 TPU v4 - ✓ ✓
AlexaTM [115] Aug-2022 20 - - - 1.3T tokens - 128 A100 120 d ✓ ✓
Sparrow [116] Sep-2022 70 - - ✓ - - 64 TPU v3 - ✓ -
WeLM [117] Sep-2022 10 - - - 300B tokens - 128 A100 40G 24 d ✓ -
U-PaLM [118] Oct-2022 540 PaLM - - - - 512 TPU v4 5d ✓ ✓
Flan-PaLM [69] Oct-2022 540 PaLM ✓ - - - 512 TPU v4 37 h ✓ ✓
Flan-U-PaLM [69] Oct-2022 540 U-PaLM ✓ - - - - - ✓ ✓
GPT-4 [46] Mar-2023 - - ✓ ✓ - - - - ✓ ✓
PanGu-Σ [119] Mar-2023 1085 PanGu-α - - 329B tokens - 512 Ascend 910 100 d ✓ -
PaLM2 [120] May-2023 16 - ✓ - 100B tokens - - - ✓ ✓
9

T5 ChatGPT Publicly Available
2019 Deepseek
GPT-4
2022 InternLM Gemini 2.0 DeepSeek-R1 Kimi K2
GPT-3 2023
Mistral Gemma-2 Kimi K1.5 Step 3
GShard OPT
Mixtral GPT-4o GLM 4.5 / 4.6 DeepSeek-V3.2
LLaMA2 Seed 1.5 / 1.6
mT5 GLM
InternLM2 GPT-o3 Gemma-3 Mistral 3
YuLan-Chat
Codex Galatica Claude 4.5
Qwen Qwen2 GPT- 5.2
ChatGLM LLaDA
PanGu-𝛂 BLOOM Ring-1T / Ling-1T MiMo-V2
2024 DeepSeek-V2 GPT-4.5
LaMDA Falcon MiniMax M2
Anthropic Seed 1.8
LLaMA3 LLaMA4
AlphaCode MOSS
WebGPT GPT-5.1 Nanbeige4
PaLM2 MiniCPM Qwen3
Flan-T5 Claude 3.5
Ernie 3.0 Gemini 3.0 GLM 4.7
PanGu-Σ Qwen2.5 Gemma Gemini 2.5
Gopher Grok 4.1 MiniMax M2.1
Bard DeepSeek-V3 2025
InstructGPT
LLaMA YuLan-Mini
1-6
Mixtral Large 2 7-11 12 2026

Fig. 3: A timeline of representative LLMs released in recent years. Models with publicly available checkpoints are
highlighted in yellow.

ChatGPT

GPT-1 GPT-2 GPT-3 +code Codex GPT-3.5 GPT-4
2018.06 2019.02 2020.05 2021.07 2022.03 2023.03
decoder-only architecture unsupervised multitask learner in-context learning code pre-training strong reasoning ability
generative pre-training scaling the model size exploring scaling limits
GPT-4 Turbo
2023.09
longer context window
code-davinci-002 +instruction text-davinci-002 +RLHF text-davinci-003 +chat gpt-3.5-turbo
2022.03 2022.03 2022.09 2023.03 GPT-4 Turbo with vision
2023.09
capable code model instruction following human alignment excellent comprehensive ability multimodal ability

Fig. 4: A brief illustration for the technical evolution of GPT-series models. We plot this figure mainly based on the papers,
blog articles and official APIs from OpenAI. Here, solid lines denote that there exists an explicit evidence (e.g., the official
statement that a new model is developed based on a base model) on the evolution path between two models, while dashed
lines denote a relatively weaker evolution relation.

demonstrates a key capacity leap by scaling of the (nearly Figure 1.2 of the GPT-3’s paper [55]). Overall, GPT-3 can be
same) generative pre-training architecture. viewed as a remarkable landmark in the journey evolving
• GPT-3. GPT-3 [55] was released in 2020, which scaled from PLMs to LLMs. It has empirically proved that scaling
the model parameters to an ever larger size of 175B. In the neural networks to a significant size can lead to a huge
the GPT-3’s paper, it formally introduced the concept of increase in model capacity.
in-context learning (ICL)17 , which utilizes LLMs in a few-
shot or zero-shot way. ICL can teach (or instruct) LLMs to Capacity Enhancement. Due to the strong capacities, GPT-
understand the tasks in the form of natural language text. 3 has been the base model to develop even more capable
With ICL, the pre-training and utilization of LLMs converge LLMs for OpenAI. Overall, OpenAI has explored two major
to the same language modeling paradigm: pre-training pre- approaches to further improving the GPT-3 model, i.e., train-
dicts the following text sequence conditioned on the context, ing on code data and alignment with human preference,
while ICL predicts the correct task solution, which can be which are detailed as follows.
also formatted as a text sequence, given the task description
• Training on code data. A major limitation of the original
and demonstrations. GPT-3 not only demonstrates very ex-
GPT-3 model (pre-trained on plain text) lies in the lack of
cellent performance in a variety of NLP tasks, but also on a
the reasoning ability on complex tasks, e.g., completing the
number of specially designed tasks that require the abilities
code and solving math problems. To enhance this ability,
of reasoning or domain adaptation. Although the GPT-3’s
Codex [105] was introduced by OpenAI in July 2021, which
paper does not explicitly discuss the emergent abilities of
was a GPT model fine-tuned on a large corpus of GitHub
LLMs, we can observe large performance leap that might
code. It demonstrated that Codex can solve very difficult
transcend the basic scaling law [30], e.g., larger models have
programming problems, and also lead to a significant per-
significantly stronger ICL ability (illustrated in the original
formance improvement in solving math problems [126].
17. GPT-2 essentially used ICL for unsupervised task learning, Further, a contrastive approach [127] to training text and
though it wasn’t called ICL at that time. code embedding was reported in January 2022, which was
10

shown to improve a series of related tasks (i.e., linear- InstructGPT in the data collection setup: human-generated
probe classification, text search and code search). Actually, conversations (playing both the roles of user and AI) are
the GPT-3.5 models are developed based on a code-based combined with the InstructGPT dataset in a dialogue format
GPT model (i.e., code-davinci-002), which indicates that for training ChatGPT. ChatGPT exhibited superior capaci-
training on code data is a very useful practice to improve ties in communicating with humans: possessing a vast store
the model capacity of GPT models, especially the reasoning of knowledge, skill at reasoning on mathematical problems,
ability. Furthermore, there is also a speculation that train- tracing the context accurately in multi-turn dialogues, and
ing on code data can greatly increase the chain-of-thought aligning well with human values for safe use. Later on, the
prompting abilities of LLMs [47], while it is still worth plugin mechanism has been supported in ChatGPT, which
further investigation with more thorough verification. further extends the capacities of ChatGPT with existing tools
• Human alignment. The related research of human or apps. So far, it seems to be the ever most powerful chatbot
alignment can be dated back to the year 2017 (or earlier) in the AI history. The launch of ChatGPT has a significant
for OpenAI: a blog article entitled “learning from human impact on the AI research in the future, which sheds light
preferences”18 was posted on the OpenAI blog describing on the exploration of human-like AI systems.
a work that applied reinforcement learning (RL) to learn • GPT-4. As another remarkable progress, GPT-4 [46]
from the preference comparisons annotated by humans [79] was released in March 2023, which extended the text input
(similar to the reward training step in the aligning algorithm to multimodal signals. Overall, GPT-4 has stronger capac-
of InstructGPT in Figure 12). Shortly after the release of this ities in solving complex tasks than GPT-3.5, showing a
RL paper [79], the paper of the Proximal Policy Optimiza- large performance improvement on many evaluation tasks.
tion (PPO) [128] was published in July 2017, which now has A recent study [41] investigated the capacities of GPT-
been the foundational RL algorithm for learning from hu- 4 by conducting qualitative tests with human-generated
man preferences [66]. Later in January 2020, GPT-2 was fine- problems, spanning a diverse range of difficult tasks, and
tuned using the aforementioned RL algorithms [79, 128], showed that GPT-4 can achieve more superior performance
which leveraged human preferences to improve the capac- than prior GPT models. Furthermore, GPT-4 responds more
ities of GPT-2 on NLP tasks. In the same year, another safely to malicious or provocative queries, due to a six-
work [129] trained a summarization model for optimizing month iterative alignment (with an additional safety re-
human preferences in a similar way. Based on these prior ward signal in the RLHF training). In the technical report,
work, InstructGPT [66] was proposed in January 2022 to OpenAI has emphasized how to safely develop GPT-4 and
improve the GPT-3 model for human alignment, which applied a number of intervention strategies to mitigate the
formally established a three-stage reinforcement learning from possible issues of LLMs, such as hallucinations, privacy
human feedback (RLHF) algorithm. Note that it seems that and overreliance. For example, they introduced the mech-
the wording of “instruction tuning” has seldom been used in anism called red teaming [132] to reduce the harm or toxic
OpenAI’s paper and documentation, which is substituted by content generation. As another important aspect, GPT-4
supervised fine-tuning on human demonstrations (i.e., the first has been developed on a well-established deep learning
step of the RLHF algorithm [66]). In addition to improving infrastructure with improved optimization methods. They
the instruction following capacity, the RLHF algorithm is introduced a new mechanism called predictable scaling that
particularly useful to mitigate the issues of generating harm can accurately predict the final performance with a small
or toxic content for LLMs, which is key to the safe deploy- proportion of compute during model training.
ment of LLMs in practice. OpenAI describes their approach • GPT-4V, GPT-4 turbo, and beyond. Based on the work
to alignment research in a technical article [130], which done for GPT-4 [46], OpenAI further released GPT-4V in
has summarized three promising directions: “training AI September 2023, which focused on the safe deployment of
systems to use human feedback, to assist human evaluation the vision capabilities of GPT-4. In the GPT-4V’s system
and to do alignment research”. card [133], it has extensively discussed the assessment and
These enhancement techniques lead to the improved mitigation of risks related to visually augmented inputs.
GPT-3 models with stronger capacities, which are called Specially, GPT-4V exhibited strong vision capacities in var-
GPT-3.5 models by OpenAI (see the discussion about the ious application scenarios, showing the great potential as
OpenAI API in Section 3.1). a powerful multimodal learning system. More recently, in
November 2023, OpenAI released an upgraded generation
The Milestones of Language Models. Based on all the ex- of GPT-4 model at DevDay, named GPT-4 Turbo, with a
ploration efforts, two major milestones have been achieved series of technical improvements. GPT-4 Turbo is featured
by OpenAI, namely ChatGPT [131] and GPT-4 [46], which by the improved model capacity (more capable than GPT-
have largely raised the capacity bar of existing AI systems. 4), the extended knowledge source (up to April 2023),
• ChatGPT. In November 2022, OpenAI released the long context window (up to 128k tokens), optimized model
conversation model ChatGPT, based on the GPT models performance (cheaper price), and other useful functional-
(GPT-3.5 and GPT-4). As the official blog article intro- ity updates (function call, reproducible outputs, etc.). At
duced [131], ChatGPT was trained in a similar way as the same time, Assistants API was launched to ease the
InstructGPT (called “a sibling model to InstructGPT” in the rapid development of agent-like assistants. With this API,
original post), while specially optimized for dialogue. They developers can easily create goal-oriented assistants within
reported a difference between the training of ChatGPT and their applications, by leveraging specific instruction, extra
knowledge and tool use. Furthermore, multimodal capaci-
18. https://openai.com/research/learning-from-human-preferences ties (see, hear, and speak) were also enhanced in this new
11

release, supported by GPT-4 Turbo with vision, DALL·E 3, closed-source LLMs, such as GPT-4, GPT-4o, and Claude
Text-to-speech (TTS), and Listen to voice samples. These 3.5 Sonnet in various benchmarks (e.g., MMLU, GSM8k,
improvements have greatly extended the capacity scope and and HumanEval). The pre-training of LLaMA (65B version)
enhanced the task performance of GPT models. More impor- involves 2,048 A100-80G GPUs, whereas LLaMA-3.1 (405B
tantly, the application ecosystem will be greatly strength- version) involves more than 16,000 H100 GPUs.
ened with the technology upgrade in improved models, • Mistral. The Mistral series [137, 138] consist of Mis-
APIs, and functionalities. tral (7B), Mistral NeMo (12B), Mistral Large 2 (123B), and
Despite the huge progress, there are still limitations with Mixtral (8×7B and 8×22B), which have been widely known
these superior LLMs, e.g., generating hallucinations with for their strong performance on various mainstream bench-
factual errors or potentially risky response within some marks (e.g., MMLU and GSM8k). Mistral NeMo is featured
specific context [46]. More limitations or issues of LLMs will with a long context window of 128K at the parameter scale
be discussed in Section 7. It poses long-standing research of 12B. Although Mistral NeMo is trained with quantization
challenges to develop more capable, safer LLMs. From awareness, it enables FP8 inference without sacrificing per-
the perspective of engineering, OpenAI has adopted an formance. Mistral Large 2 is the largest and most powerful
iterative deployment strategy [134] to develop the models model of the Mistral series, which supports 11 natural
and products by following a five-stage development and languages and more than 80 programming languages. Mix-
deployment life-cycle, which aims to effectively reduce the tral is a kind of sparse Mixture-of-Experts (SMoE) model
potential risks of using the models. In the following, we that activates only part of the parameters during inference,
will dive into the technical details in order to have a specific making it more efficient compared to dense models of the
understanding of how they have been developed. same size.
• Gemma. Gemma [139, 140] is a series of lightweight,
strong, and open models, consisting of Gemma-1 (2B and
3 R ESOURCES OF LLM S 7B) and Gemma-2 (2B, 9B, and 27B). During the pre-training
It is by no means an easy job to develop or reproduce LLMs, stage, Gemma-2 2B, 9B, and 27B versions are trained on
considering the challenging technical issues and huge de- 2T, 8T, and 13T primarily English tokens, respectively. The
mands of computation resources. A feasible way is to learn largest version of Gemma-2 is trained on 6144 TPUv5p
experiences from existing LLMs and reuse publicly avail- chips. Gemma-2 has achieved excellent performance in mul-
able resources for incremental development or experimental tiple benchmarks (e.g., ARC-c, MMLU, and GSM8k).
study. In this section, we briefly summarize the publicly • Qwen. Qwen [141, 142] is an open-source large
available resources for developing LLMs, including model model series consisting of Qwen (raging from 7B to 72B),
checkpoints (or APIs), corpora and libraries. Qwen1.5 (raging from 0.5B to 110B), Qwen2 (ranging from
0.5B to 72B), and Qwen2.5 (ranging from 0.5B to 72B).
3.1 Publicly Available Model Checkpoints or APIs Qwen2.5 is the newest LLM collection of Qwen, which
is pre-trained on up to 18T tokens. Compared to Qwen2,
Given the huge cost of model pre-training, well-trained
Qwen2.5 demonstrates a significant increase in knowledge
model checkpoints are critical to the study and development
retention, as well as notable advancements in coding and
of LLMs for the research community. Due to space limita-
mathematical abilities. Qwen2.5 has also shown large im-
tion, we can only selectively discuss several representative
provements in instruction following, long texts generation
LLMs. In addition, for inference, we can directly employ
(over 8K tokens), structured data understanding and gener-
public APIs to perform our tasks, without running the
ation (e.g., JSON).
model locally. Next, we introduce the publicly available
model checkpoints and APIs.
• GLM. GLM [143] is a series of LLMs featuring compre-
hensive capabilities in both English and Chinese. GLM has
Publicly Available Model Checkpoints. To assist re- been upgraded to its fourth-generation model, GLM-4, with
searchers in selecting a suitable model based on the resource a parameter scale of up to 9B, possesses excellent conver-
budget and usage needs, we focus on discussing the model’s sational abilities. It has achieved excellent performance in
parameter size, data and computational resources required evaluations from multiple perspectives including semantics,
for training, the relevant technologies employed by the mathematics, reasoning, code, and knowledge. In addition
model, and its performance evaluation in downstream tasks. to the base model GLM-4-9B, it has open-sourced human
For more details of LLMs, see Table 1. preference-aligned model GLM-4-9B-Chat, and long context
• LLaMA. The LLaMA series of models has gained im- conversational model GLM-4-9B-Chat-1M.
mense popularity and widespread attention due to its open- • Baichuan. Baichuan is a series of open-source bilingual
ness and effectiveness. From LLaMA [57], LLaMA-2 [99], LLMs and the latest version is Baichuan-2. Both Baichuan
LLaMA-3 [135] to LLaMA-3.1 [136], continuous updates and Baichuan-2 have two available parameter sizes (7B
have been made and the development is still ongoing. With and 13B). Baichuan supports both Chinese and English,
increased parameters (the largest version has 405B), more with pre-training data reaching 1.2 trillion tokens. Further-
pre-training tokens (15T tokens), and an extended context more, Baichuan-2 expands its pre-training data to 2.6 trillion
window (128K), LLaMA-3.1 has significantly enhanced its tokens. Baichuan-2 surpasses Baichuan in all evaluation
capabilities, and it also integrates additional components benchmarks, demonstrating excellent multilingual capabil-
that work in synergy with the model, including new se- ities and showing potential for vertical applications in the
curity and safety tools. In evaluation, LLaMa-3.1 (405B ver- domains such as law and healthcare (e.g., JEC-QA [144] and
sion) achieves competitive performance against prominent MedQA [145]).
12

Continue pre-training LLaMA Parameter-efficient fine-tuning
Model inheritance Instruction
Data inheritance tuning Full parameter fine-tuning
+ chinese data + chat data

Chinese
Open-Chinese-LLaMA + synthetic data
Vicuna
Vicuna
Panda + task data
Alpaca
Linly-Chinese-LLaMA
Chinese Yulan-Chat
RLHF
LLaMA Alpaca Goat
+ chat data Lora PKU-Beaver
BiLLa
Cornucopia
+ synthetic data
+ chat data
+ Alpaca data
Lawyer
LLaMA OpenFlamingo LLaVA
BELLE MiniGPT-4
+ chat data
Ziya + task data
QiZhenGPT Baize
Chinese + task data
Alpaca + task data Guanaco
+ task data
Koala + task data VisionLLM InstructBLIP Chatbridge
TaoLi

LLaMA
ChatMed
Adapter PandaGPT
BenTsao LAWGPT Multimodal models

Math Finance Medicine Law Bilingualism Education

Fig. 5: An evolutionary graph of the research work conducted on LLaMA. Due to the huge number, we cannot include all
the LLaMA variants in this figure, even much excellent work. To support incremental update, we share the source file of
this figure, and welcome the readers to include the desired models by submitting the pull requests on our GitHub page.

LLaMA Model Family. The collection of LLaMA mod- Due to the excellent performance and availability of the
els [57] were introduced by Meta AI in February, 2023, LLaMA model family, many multimodal models incorpo-
consisting of four sizes (7B, 13B, 30B and 65B). Since rate them as the base language models, to achieve strong
released, LLaMA has attracted extensive attention from language understanding and generation abilities. Compared
both research and industry communities. LLaMA mod- with other variants, Vicuna is more preferred in multimodal
els have achieved very excellent performance on various language models, which have led to the emergence of a va-
open benchmarks, which have become the most popu- riety of popular models, including LLaVA [154], MiniGPT-
lar open language models thus far. A large number of 4 [155], InstructBLIP [156], and PandaGPT [157]. The re-
researchers have extended LLaMA models by either in- lease of LLaMA has greatly advanced the research progress
struction tuning or continual pre-training. In particular, of LLMs. To summarize the research work conducted on
instruction tuning LLaMA has become a major approach LLaMA, we present a brief evolutionary graph in Figure 5.
to developing customized or specialized models, due to
the relatively low computational costs. To effectively adapt Public API of LLMs. Instead of directly using the model
LLaMA models in non-English languages, it often needs to copies, APIs provide a more convenient way for common
extend the original vocabulary (trained mainly on English users to use LLMs, without the need of running the model
corpus) or fine-tune it with instructions or data in the locally. As a representative interface for using LLMs, the
target language. Among these extended models, Stanford APIs for the GPT-series models [46, 55, 66, 105] have
Alpaca [146] is the first open instruct-following model been widely used for both academia and industry19 .
fine-tuned based on LLaMA (7B). It is trained by 52K OpenAI has provided seven major interfaces to the models
instruction-following demonstrations generated via self- in GPT-3 series: ada, babbage, curie, davinci (the
instruct [147] using text-davinci-003. The instruction most powerful version in GPT-3 series), text-ada-001,
data, named Alpaca-52K, and training code have been ex- text-babbage-001, and text-curie-001. Among
tensively adopted in subsequent work, such as Alpaca- them, the first four interfaces can be further fine-
LoRA [148] (a reproduction of Stanford Alpaca using tuned on the host server of OpenAI. In particular,
LoRA [149]), Koala [150], and BELLE [151]. In addition, Vi- babbage, curie, and davinci correspond to the
cuna [152] is another popular LLaMA variant, trained upon GPT-3 (1B), GPT-3 (6.7B), and GPT-3 (175B) models,
user-shared conversations collected from ShareGPT [153].
19. https://platform.openai.com/docs/api-reference/introduction
13

respectively [55]. In addition, there are also two APIs Facebook and has been used in processing datasets like
related to Codex [105], called code-cushman-001 (a RedPajama-Data [170].
powerful and multilingual version of the Codex (12B) [105]) • C4. The Colossal Clean Crawled Corpus (C4) includes
and code-davinci-002. Further, GPT-3.5 series five variants21 , namely en (806G), en.noclean (6T), real-
include one base model code-davinci-002 and newslike (36G), webtextlike (17G), and multilingual (38T).
three enhanced versions, namely text-davinci-002, The en version has been utilized for pre-training T5 [82],
text-davinci-003, and gpt-3.5-turbo. As more LaMDA [68], Gopher [64], and UL2 [89]. The multilingual
powerful alternatives, in this year, OpenAI has released C4, also called mC4, has been used in mT5 [83].
the model interfaces for GPT-4 series, including gpt-4, • RedPajama-Data. RedPajama-Data [170] is a publicly
gpt-4-32k, gpt-4-1106-preview (i.e., GPT-4 Turbo) available comprehensive web dataset, comprising 100 bil-
and gpt-4-vision-preview (i.e., GPT-4 Turbo with lion documents from Common Crawl. It has been cleaned,
vision, a multimodal model). It is worth noting that OpenAI filtered, and deduplicated using the CCNet tool, resulting in
has been maintaining and upgrading these model interfaces approximately 30T tokens, which is available for download
(gpt-3.5-turbo, gpt-4, gpt-4-32k), so the API name on Hugging Face. RedPajama-Data is a multilingual dataset
will actually point to the latest version. Currently, ChatGPT that includes five languages: English, French, Spanish, Ger-
can be powered by either GPT-3.5 or GPT-4 models. Overall, man, and Italian. Additionally, it offers over 40 quality
one select the suitable model interface based on the specific labels, making it feasible to filter or reweight the dataset
application scenarios and response requirements. The according to specific criteria. The dataset is continuously
detailed usage can be found on their project websites20 . updated and maintained, with all data processing scripts
open-sourced on GitHub for convenient use.
TABLE 2: Statistics of commonly-used data sources. • RefinedWeb. RefinedWeb [171] is a web dataset obtained
through rigorous selection and deduplication based on data
Corpora Size Source Latest Update Time
from Common Crawl, encompassing all Common Crawl
BookCorpus [158] 5GB Books Dec-2015 web records from 2008 to June 2023, totaling around 5T
Gutenberg [159] - Books Dec-2021
C4 [82] 800GB CommonCrawl Apr-2019
tokens. The open-source portion consists of 600B tokens,
CC-Stories-R [160] 31GB CommonCrawl Sep-2019 with a data size of approximately 500GB. After decompres-
CC-NEWS [27] 78GB CommonCrawl Feb-2019 sion, it requires 2.8TB of local storage space and is available
REALNEWs [161] 120GB CommonCrawl Apr-2019 for download on Hugging Face. This dataset serves as the
OpenWebText [162] 38GB Reddit links Mar-2023
Pushift.io [163] 2TB Reddit links Mar-2023 primary training dataset for the open-source large language
Wikipedia [164] 21GB Wikipedia Mar-2023 model Falcon.
BigQuery [165] - Codes Mar-2023 • WebText. WebText [26] is a well-known corpus com-
the Pile [166] 800GB Other Dec-2020
ROOTS [167] 1.6TB Other Jun-2022 posed of highly upvoted links from Reddit, a social media
platform that enables users to submit links and text posts,
but it is not publicly available. As a surrogate, there is a
readily accessible open-source alternative called OpenWeb-
3.2 Commonly Used Corpora for Pre-training Text [162].
In contrast to earlier PLMs, LLMs which consist of a signifi- Books & Academic Data. Books and academic data contains
cantly larger number of parameters require a higher volume a wealth of world knowledge and linguistic information,
of training data that covers a broad range of content. For serving as a high-quality corpus for model learning.
this need, there are increasingly more accessible training • Book Data. BookCorpus [158] is a commonly used
datasets that have been released for research. In this section, dataset in previous small-scale models (e.g., GPT [122] and
we will briefly summarize several widely used corpora for GPT-2 [26]), consisting of over 11,000 books covering a wide
training LLMs. Based on their content types, we categorize range of topics and genres (e.g., novels and biographies).
these corpora into five groups: web pages, books, Wikipedia, Another large-scale book corpus is Project Gutenberg [159],
code, and others. consisting of over 70,000 literary books including novels,
Web pages. Web pages are a primary data source for train- essays, poetry, drama, history, science, philosophy, and
ing language models. other types of works in the public domain. It is currently
• CommonCrawl. CommonCrawl [168] is one of the one of the largest open-source book collections, which is
largest open-source web crawling databases, containing a used in training of MT-NLG [113] and LLaMA [57]. As for
petabyte-scale data volume, which has been widely used Books1 [55] and Books2 [55] used in GPT-3 [55], they are
as training data for existing LLMs. As the whole dataset is much larger than BookCorpus but have not been publicly
very large, existing studies mainly extract subsets of web released so far.
pages from it within a specific period or specific needs • Academic Data. In addition to book data, scientific
(e.g., extracting mathematical texts). However, due to the publication data such as paper is also important for model
widespread existence of noisy and low-quality information pre-training. arXiv Dataset [172] is a corpus of 1.7 mil-
in web page data, it is necessary to perform data preprocess- lion academic papers, covering a wide range of papers in
ing before usage. One commonly used toolkit for cleaning the fields of physics, mathematics, and computer science.
CommonCrawl is CC-Net [169], which is developed by S2ORC [173] is a corpora that consists of 136M academic

20. https://platform.openai.com/docs/models/overview 21. https://www.tensorflow.org/datasets/catalog/c4
14

papers collected by Semantic Scholar. It also releases a pre-training data.
derivative dataset peS2o [174], which contains about 42B
tokens. TABLE 3: A detailed list of available collections for instruc-
tion tuning.
Wikipedia. Wikipedia [164] is an online encyclopedia con-
taining a large volume of high-quality articles on diverse Categories Collections Time #Examples
topics. Most of these articles are composed in an expository Nat. Inst. [179] Apr-2021 193K
style of writing (with supporting references), covering a FLAN [67] Sep-2021 4.4M
wide range of languages and fields. Typically, the English- P3 [180] Oct-2021 12.1M
Task Super Nat. Inst. [88] Apr-2022 5M
only filtered versions of Wikipedia are widely used in most MVPCorpus [181] Jun-2022 41M
LLMs (e.g., GPT-3 [55], LaMDA [68], and LLaMA [57]). xP3 [94] Nov-2022 81M
Wikipedia is available in multiple languages, so it can be OIG[182] Mar-2023 43M
used in multilingual settings. HH-RLHF [183] Apr-2022 160K
HC3 [184] Jan-2023 87K
Code. To collect code data, existing work mainly crawls Chat ShareGPT [153] Mar-2023 90K
open-source licensed codes from the Internet. Two major Dolly [185] Apr-2023 15K
OpenAssistant [186] Apr-2023 161K
sources are public code repositories under open-source li-
censes (e.g., GitHub) and code-related question-answering Self-Instruct [147] Dec-2022 82K
Alpaca [187] Mar-2023 52K
platforms (e.g., StackOverflow). Google has publicly re-
Synthetic Guanaco [188] Mar-2023 535K
leased the BigQuery dataset [165], which includes a sub- Baize [189] Apr-2023 158K
stantial number of open-source licensed code snippets in BELLE [190] Apr-2023 1.5M
various programming languages, serving as a representa-
tive code dataset. CodeGen has utilized BIGQUERY [86], a
subset of the BigQuery dataset, for training the multilingual TABLE 4: A list of available collections for alignment.
version of CodeGen (CodeGen-Multi). In addition, Hugging
Face has collected and released a code dataset named The Dataset Release Time #Examples
Stack [175], covering more than 30 programming languages. Summarize from Feedback [129] Sep-2020 193K
The Stack is continuously updated, and the v1.2 version SHP [191] Oct-2021 385K
WebGPT Comparisons [81] Dec-2021 19K
has expanded to 358 programming languages. Based on Stack Exchange Preferences [192] Dec-2021 10M
this dataset, BigCode further processed it and released HH-RLHF [183] Apr-2022 169K
StarCoder [98], which is also the pre-training data of the Sandbox Alignment Data [193] May-2023 169K
CValues [194] Jul-2023 145K
model StarCoder. PKU-SafeRLHF [195] Oct-2023 330K
Mixed Data. In addition to the aforementioned specific
types of datasets, different types of data have been com-
bined to facilitate usage by researchers. The Pile [166] is a 3.3 Commonly Used Datasets for Fine-tuning
large-scale, diverse, and open-source text dataset consisting After pre-training, it requires further fine-tuning LLMs to
of over 800GB of data from multiple sources, including enhance the model capacity, which often involve two major
books, websites, codes, scientific papers, and social media steps, namely instruction tuning (supervised fine-tuning)
platforms. It is constructed from 22 diverse high-quality and alignment tuning. In this section, we mainly focus on
subsets. The Pile dataset is widely used in models with discussing the related available datasets for the two kinds of
different parameter scales, such as GPT-J (6B) [176], Code- tuning approaches, and more algorithm details can be found
Gen (16B) [86], and Megatron-Turing NLG (530B) [113]. in Section 5.
ROOTS [167] is composed of various smaller datasets (to-
tally 1.61 TB of text) and covers 59 different languages (con- 3.3.1 Instruction Tuning Datasets
taining natural languages and programming languages),
After pre-training, instruction tuning (a.k.a., supervised fine-
which have been used for training BLOOM [78]. Another
tuning) is an important method to enhance or unlock spe-
mixture dataset is Dolma [177], which includes web text
cific abilities of LLMs (e.g., instruction following). In this
from Common Crawl, academic papers from Semantic
part, we introduce several widely used datasets for in-
Scholar, GitHub code, books, social media from Reddit,
struction tuning, and categorize them into three main types
and Wikipedia data. Dolma consisting of 3T tokens of ap-
based on the construction method of formatted instruction
proximately 200TB of raw text and has been used to train
instances, namely NLP task datasets, daily chat datasets and
OLMo [178].
synthetic datasets. We show their details in Table 3.
In practice, it commonly requires a mixture of different
data sources for pre-training LLMs (see Figure 6), instead NLP Task Datasets. This kind of datasets are formatted
of a single corpus. Therefore, existing studies commonly based on collected NLP task datasets (e.g., text classifica-
mix several ready-made datasets (e.g., C4, OpenWebText, tion and summarization) with corresponding natural lan-
and the Pile), and then perform further processing to obtain guage task descriptions. In this category, P3 [196] and
the pre-training corpus. Furthermore, to train the LLMs that FLAN [67, 197] are two widely used datasets for instruction
are adaptive to specific applications, it is also important tuning.
to extract data from relevant sources (e.g., Wikipedia and • P3 [196] is composed of 170 English NLP datasets and
BigQuery) for enriching the corresponding information in 2,052 English prompt templates, where the input and output
15

of each data example have been formatted with specific and outputs. Moreover, 60% of the examples are pure in-
prompt templates for composing the training instance. structions without the input part in the final dataset.
• FLAN [67] consists of 62 widely used NLP benchmarks • Baize [189] is an English multi-turn conversation corpus
in its original version. Recently, FLAN-v2 [197] is also pro- constructed using ChatGPT, comprising 111.5K instances. To
posed, which expands FLAN by mixing additional instruc- create Baize, a method called “self-chat” [189] is purposed,
tion datasets, including Muffin [67], NIV2 [88], T0-SF [28], where ChatGPT takes on the roles of both the user and the
and CoT [198–200]. Muffin contains 62 tasks from the orig- AI assistant in turns, generating information in a conversa-
inal FLAN and additional 26 tasks, including conversation tional format.
and code synthesis tasks. T0-SF is extracted from T0 [28]
while ensuring no overlap with Muffin. NIV2 refers to the 3.3.2 Alignment Datasets
Natural-Instructions v2 dataset [88], and CoT [198–200] is
a combination of nine reasoning tasks with corresponding Apart from instruction tuning, it is important to construct
chain-of-thought prompts and outputs. high-quality datasets for aligning LLMs with human values
and preferences (e.g., helpfulness, honesty, and harmless-
Daily Chat Datasets. This kind of datasets are constructed ness). In this section, we introduce several widely used
based on real user conversations where queries are posed datasets for alignment tuning, including HH-RLHF [183],
by humans and responses are mainly generated by hu- SHP [191], PKU-SafeRLHF [195], Stack Exchange Prefer-
man labelers or LLMs (e.g., ChatGPT, GPT-4). The con- ences [192] and Sandbox Alignment Data [193]. We show
versation types include open-ended generation, question their details in Table 4.
answering, brainstorming, and chatting. In this category, • HH-RLHF [183] consists of around 169K instances, and
ShareGPT [153], OpenAssistant [186] and Dolly [185] are can be divided into two parts that focus on the helpfulness
three commonly used datasets for LLM fine-tuning. and harmlessness of LLMs, respectively. Each instance is
• ShareGPT [153] is collected from a data collection an open-ended conversation between a crowdworker and
platform where users can upload their conversations with a chat model, about seeking assistance, advice, or task
ChatGPT or GPT-4 through the ShareGPT API. Currently, completion. The chat model provides two responses to each
this dataset consists of approximately 90,000 conversations, user query, and the more helpful or harmful responses will
including real instructions or inquiries from human and be chosen as the annotations.
responses from ChatGPT. • SHP [191] focuses on the helpfulness of responses.
• OpenAssistant [186] is a multilingual corpus containing It comprises 385K collective human preferences over re-
66,497 real-world conversation trees between human and AI sponses to questions/instructions across 18 diverse subject
assistant. Each conversation tree consists of multiple nodes, areas, spanning topics from cooking to legal advice. Each
and each node represents the information generated by a instance is a Reddit post containing a question or instruction
role in the dialogue. It spans 35 languages and includes and a pair of top-level comments, one of which is deemed
461,292 manually annotated quality ratings of responses. as more preferable by Reddit users and the other one is
• Dolly [185] is an English dataset comprising 15,000 deemed as less helpful. Different from HH-RLHF [183], the
human-generated data instances (prompt-response pairs) data in SHP consists of naturally occurring and human-
from Databricks. This dataset covers seven domains out- written responses.
lined in the InstructGPT [66], including brainstorming, clas- • PKU-SafeRLHF [195] encompasses more than 330K
sification, closed-book quality assurance, generation, infor- instances of expert comparison data, concentrating on the
mation extraction, open-book quality assurance, and sum- helpfulness and harmlessness. Each instance in the dataset
marization. includes a question and two responses, accompanied by
safety labels for each response and two preference anno-
Synthetic Datasets. This kind of datasets are typically tations between the two responses according to helpfulness
constructed by instructing LLMs, based on pre-defined and harmlessness. The harmlessness of a response indicates
guidance rules or methods. In this category, Self-Instruct- its classification as risk-neutral across all 14 harm categories,
52K [147], Alpaca [146] and Baize [189] are three commonly while the helpfulness of a response is evaluated based on its
used synthetic datasets for LLMs. effectiveness in addressing the question.
• Self-Instruct-52K [147] is an instruction dataset gener- • Stack Exchange Preferences [192] focuses on the help-
ated through the self-instruct [147] method, consisting of fulness of answers. It comprises about 10M questions and
82,000 instances with 52,000 instructions. Concretely, the answers from Stack Overflow. Each instance consists of a
authors construct 175 seed instances, and then iteratively question and more than two corresponding answers. Each
prompt the LLM [55] to synthesize additional instructions answer is annotated with a score calculated based on its
based on randomly selected 8 instructions as reference. votes and a label denoting whether it is selected.
Subsequently, the LLM is further instructed to generate in- • Sandbox Alignment Data [193] is an alignment dataset
stance inputs and their corresponding outputs based on the containing feedback from LLMs rather than human. It
synthetic instructions, and finally obtain the Self-Instruct- comes from a virtual interaction environment called SAND-
52K dataset. BOX, where the model simulates social interactions with
• Alpaca [146] is also a synthetic dataset based on the self- other models and revise responses according to the feedback
instruct [147] method. It utilizes the text-davinci-003 from other models. The dataset contains 169K instances, and
model on the 175 seed datasets from Self-Instruct-52K to each instance consists of a societal query, several responses,
obtain 52,000 new instructions and corresponding inputs and corresponding ratings from other models.
16

3.4 Library Resource decoding algorithms, tensor parallelism and streaming out-
puts. To ease the integration with other systems, vLLM is
In this part, we briefly introduce a series of available li-
friendly to the use of HuggingFace models, and also provide
braries for developing LLMs.
OpenAI-compatible API servers.
• Transformers [201] is an open-source Python library
• DeepSpeed-MII [209] is also a memory efficient
for building models using the Transformer architecture,
Python library developed by DeepSpeed [74]. It aims to
which is developed and maintained by Hugging Face. It
democratize LLMs inference by prioritizing high through-
has a simple and user-friendly API, making it easy to use
put, low latency, and cost-effectiveness. DeepSpeed-MII
and customize various pre-trained models. It is a powerful
achieves accelerated text generation inference by leveraging
library with a large and active community of users and
four essential technologies: blocked KV caching, continuous
developers who regularly update and improve the models
batching, dynamic SplitFuse, and high-performance CUDA
and algorithms.
Kernels. It currently supports over 13,000 models across
• DeepSpeed [74] is a deep learning optimization library three popular model architectures, such as LLaMA [57],
(compatible with PyTorch) developed by Microsoft, which
Mistral [137], and OPT [90].
has been used to train a number of LLMs, such as MT-
• DeepSpeed-Chat [210] is a fast, cost-effective, and
NLG [113] and BLOOM [78]. It provides the support of
easy-to-use system framework that enables the integration
various optimization techniques for distributed training,
of the complete RLHF process during model training. It
such as memory optimization (ZeRO technique, gradient
is featured by three major functionalities: (1) it simplifies
checkpointing), and pipeline parallelism.
the training and inference process for ChatGPT-like models,
• Megatron-LM [75–77] is a deep learning library devel- enabling using a simple script to implement multiple train-
oped by NVIDIA for training large-scale language models.
ing or inference steps; (2) it replicates the training mode
It also provides rich optimization techniques for distributed
of InstructGPT [66] and provides a complete pipeline for
training, including model and data parallelism, mixed-
three training steps (i.e., SFT, reward model fine-tuning, and
precision training, and FlashAttention. These optimization
RLHF); (3) it integrates the training engine and inference en-
techniques can largely improve the training efficiency and
gine of Deepspeed into a unified hybrid engine (Deepspeed
speed, enabling efficient distributed training across GPUs.
HE) for RLHF training, which enables seamless switch be-
• JAX [202] is a Python library for high-performance tween training and inference modes, and leveraging various
machine learning algorithms developed by Google, allow- optimizations from DeepSpeed Inference.
ing users to easily perform computations on arrays with In addition to the above library resources, existing deep
hardware acceleration (e.g., GPU or TPU). It enables efficient learning frameworks (e.g., PyTorch [211], TensorFlow [212],
computation on various devices and also supports several MXNet [213], PaddlePaddle [214], MindSpore [215] and
featured functions, such as automatic differentiation and OneFlow [216]) have also provided the support for parallel
just-in-time compilation. algorithms, which are commonly used for training large-
• Colossal-AI [203] is a deep learning library developed scale models.
by HPC-AI Tech for training large-scale AI models. It is
implemented based on PyTorch and supports a rich collec-
tion of parallel training strategies. Furthermore, it can also 4 P RE - TRAINING
optimize heterogeneous memory management with meth-
ods proposed by PatrickStar [204]. Recently, a ChatGPT-like Pre-training establishes the basis of the abilities of LLMs. By
model called ColossalChat [205] has been publicly released pre-training on large-scale corpora, LLMs can acquire essen-
with two versions (7B and 13B), which are developed using tial language understanding and generation skills [55, 56]. In
Colossal-AI based on LLaMA [57]. this process, the scale and quality of the pre-training corpus
• BMTrain [206] is an efficient library developed by are critical for LLMs to attain powerful capabilities. Fur-
OpenBMB for training models with large-scale parameters thermore, to effectively pre-train LLMs, model architectures,
in a distributed manner, which emphasizes code simplicity, acceleration methods, and optimization techniques need to
low resource, and high availability. BMTrain has already be well designed. In what follows, we first discuss the data
incorporated several common LLMs (e.g., Flan-T5 [69] and collection and processing in Section 4.1, then introduce the
GLM [93]) into its ModelCenter, where developers can use commonly used model architectures in Section 4.2, and fi-
these models directly. nally present the training techniques to stably and efficiently
• FastMoE [207] is a specialized training library for MoE optimize LLMs in Section 4.3.
(i.e., mixture-of-experts) models. It is developed based on
PyTorch, prioritizing both efficiency and user-friendliness
in its design. FastMoE simplifies the process of transferring 4.1 Data Collection and Preparation
Transformer models to MoE models and supports both data Compared with small-scale language models, LLMs have
parallelism and model parallelism during training. a stronger demand for high-quality data for model pre-
• vLLM [208] is a fast, memory efficient, and easy- training, and their model capacities largely rely on the pre-
to-use library for LLM inference and serving. To enable training corpus and how it has been preprocessed. In this
fast inference, it is specially optimized with high serving part, we discuss the collection and processing of pre-training
throughput, effective attention memory management using data, including data sources, preprocessing methods, and
PagedAttention [208], continuous batching, and optimized important analysis of how pre-training data affects the
CUDA kernels. Furthermore, vLLM also supports various performance of LLMs.
17

T5 (11B) Falcon (40B) LLaMA (65B) GPT-3 (175B) MT-NLG (530B) Gopher (280B) Chinchilla (70B)
3% 2%
2% 5% 16% 3% 4%
5% 26% 4% 37% 40%
62% 60% 56%
6%
100% 100% 87% 84%

Yi (34B) PaLM (540B) LaMDA (137B) Galactica (120B) GPT-NeoX (20B) CodeGen (16B) StarCoder 2 (15B)
5% 2%
8%
5% 9% 13% 8% 1%
14% 7% 20%
31% 30% 5%
4% 38% 39%
6%
38%
10% 10%
83% 50%
50% 86% 15% 25% 92%

💻 C4 (800G, 2019), 💻 OpenWebText (38G, 2023), 💻 Wikipedia (21G, 2023)
💬 the Pile - StackExchange (41G, 2020)
📚 BookCorpus (5G, 2015), 📚 Gutenberg (-, 2021), 📚 CC-Stories-R (31G, 2019), 📰 CC-NEWES (78G, 2019), 📰 REALNEWs (120G, 2019)
🔬 the Pile - ArXiv (72G, 2020), 🔬 the Pile - PubMed Abstracts (25G, 2020)
⌨ BigQuery (-, 2023), the Pile - GitHub (61G, 2020)

Fig. 6: Ratios of various data sources in the pre-training data for existing LLMs.

4.1.1 Data Source conversational competence of LLMs [90] and potentially im-
To develop a capable LLM, it is key to collect a large amount prove their performance on a range of question-answering
of natural language corpus from various data sources. Ex- tasks [56]. Researchers can utilize subsets of public conver-
isting LLMs mainly leverage a mixture of diverse public sation corpus (e.g., PushShift.io Reddit corpus) [163, 217] or
textual datasets as the pre-training corpus. Figure 6 shows collect conversation data from online social media. Since on-
the distribution of the sources of pre-training data for a line conversational data often involves discussions among
number of representative LLMs. multiple participants, an effective processing way is to
transform a conversation into a tree structure, where the
The source of pre-training corpus can be broadly cate-
utterance is linked to the one it responds to. In this way, the
gorized into two types: general data and specialized data.
multi-party conversation tree can be divided into multiple
General data, such as webpages, books, and conversational
sub-conversations, which can be collected in the pre-training
text, is utilized by most LLMs [55, 56, 90] due to its large,
corpus. Furthermore, a potential risk is that the excessive
diverse, and accessible nature, which can enhance the lan-
integration of dialogue data into LLMs may result in a side
guage modeling and generalization abilities of LLMs. In
effect [90]: declarative instructions and direct interrogatives
light of the impressive generalization capabilities exhibited
are erroneously perceived as the beginning of conversations,
by LLMs, there are also studies that extend their pre-training
thus leading to a decline in the efficacy of the instructions.
corpus to more specialized datasets, such as multilingual
data, scientific data, and code, endowing LLMs with specific • Books. Compared to other corpus, books provide an
task-solving capabilities [35, 56, 86]. In what follows, we important source of formal long texts, which are potentially
describe these two types of pre-training data sources and beneficial for LLMs to learn linguistic knowledge, model
their effects on LLMs. For a detailed introduction to the long-term dependency, and generate narrative and coherent
commonly used corpus, one can refer to Section 3.2. texts. To obtain open-source book data, existing studies
usually adopt the Books3 and Bookcorpus2 datasets, which
General Text Data. As we can see in Figure 6, the vast are available in the Pile dataset [166].
majority of LLMs adopt general-purpose pre-training data,
such as webpages, books, and conversational text, which Specialized Text Data. Specialized datasets are useful to
provides rich text sources on a variety of topics. Next, we improve the specific capabilities of LLMs on downstream
briefly summarize three important kinds of general data. tasks. Next, we introduce three kinds of specialized data.
• Webpages. Owing to the proliferation of the Internet, • Multilingual text. In addition to the text in the target
various types of data have been created, which enables language, integrating a multilingual corpus can enhance
LLMs to gain diverse linguistic knowledge and enhance the multilingual abilities of language understanding and
their generalization capabilities [26, 82]. For convenient generation. For example, BLOOM [78] and PaLM [56] have
use of these data resources, a large amount of data is curated multilingual data covering 46 and 122 languages,
crawled from the web in previous work, such as Com- respectively, within their pre-training corpora. FLM [102]
monCrawl [168]. However, the crawled web data tends to mixes Chinese and English corpora in nearly equal propor-
contain both high-quality text, such as Wikipedia and low- tions. These models demonstrate impressive performance in
quality text, like spam mail, thus it is important to filter and multilingual tasks, such as translation, multilingual summa-
process webpages for improving the data quality. rization, and multilingual question answering, and achieve
• Conversation text. Conversation data can enhance the comparable or superior performance to the state-of-the-
18

art models that are fine-tuned on the corpus in the target quality data. Typically, these methods train a binary classi-
language(s). fier using positive instances that are: well-curated data (e.g.,
• Scientific text. The exploration of science by humans has Wikipedia pages) [55, 56, 112], high-quality synthesized
been witnessed by the increasing growth of scientific publi- data [135, 229–231], or a combination of both. They sample
cations. In order to enhance the understanding of scientific candidate data as negative instances and predict the score
knowledge for LLMs [35, 218], it is useful to incorporate a that measures the quality of each data example. However,
scientific corpus for model pre-training [35, 218]. By pre- several studies [64, 112] find that a classifier-based approach
training on a vast amount of scientific text, LLMs can may result in the unintentional removal of high-quality texts
achieve impressive performance in scientific and reasoning in dialectal, colloquial, and sociolectal languages, which
tasks [219]. To construct the scientific corpus, existing efforts potentially leads to bias in the pre-training corpus and
mainly collect arXiv papers, scientific textbooks, math web- diminishes the corpus diversity. As the second approach,
pages, and other related scientific resources. Due to the com- several studies, such as BLOOM [78] and Gopher [64],
plex nature of data in scientific fields, such as mathematical employ heuristic-based approaches to eliminate low-quality
symbols and protein sequences, specific tokenization and texts through a set of well-designed rules, which can be
preprocessing techniques are usually required to transform summarized as follows:
these different formats of data into a unified form that can • Language based filtering. If a LLM would be mainly used
be processed by language models. in the tasks of certain languages, the text in other lan-
• Code. Program synthesis has been widely studied in guages can be filtered.
the research community [105, 220–223], especially the use of
• Metric based filtering. Evaluation metrics about the gener-
PLMs trained on code [176, 224]. However, it remains chal-
lenging for these PLMs (e.g., GPT-J [176]) to generate high- ated texts, e.g., perplexity, can be employed to detect and
quality and accurate programs. Recent studies [105, 223] remove unnatural sentences.
have found that training LLMs on a vast code corpus • Statistic based filtering. Statistical features of a corpus,
can lead to a substantial improvement in the quality of e.g., the punctuation distribution, symbol-to-word ratio,
the synthesized programs. The generated programs can and sentence length, can be utilized to measure the text
successfully pass expert-designed unit-test cases [105] or quality and filter the low-quality data.
solve competitive programming questions [114]. In gen-
• Keyword based filtering. Based on specific keyword set, the
eral, two types of code corpora are commonly used for
noisy or unuseful elements in the text, such as HTML
pre-training LLMs. The first source is from programming
tags, hyperlinks, boilerplates, and offensive words, can
question answering communities like Stack Exchange [225].
be identified and removed.
The second source is from public software repositories
such as GitHub [86, 105, 223], where code data (includ- In addition to the above methods, LLMs (especially rela-
ing comments and docstrings) are collected for utilization. tively small models) can be also employed for data selection,
Compared to natural language text, code is in the format either by computing perplexity [232] or directly prompting
of a programming language, corresponding to long-range LLMs [233] for measuring the sample importance. However,
dependencies and accurate execution logic [226]. A recent using LLMs is unavoidably computationally intensive for
study [47] also speculates that training on code might be a large-scale data selection.
source of complex reasoning abilities (e.g., chain-of-thought De-duplication. Existing work [234] has found that dupli-
ability [33]). Furthermore, it has been shown that formatting cate data in a corpus would reduce the diversity of language
reasoning tasks into code can help LLMs generate more models, which may cause the training process to become un-
accurate results [226]. stable and thus affect the model performance. Therefore, it is
necessary to de-duplicate the pre-training corpus. Specially,
4.1.2 Data Preprocessing de-duplication can be performed at different granularities,
After collecting a large amount of text data, it is essential including sentence-level, document-level, and dataset-level
to preprocess the data for constructing the pre-training de-duplication. First, low-quality sentences that contain re-
corpus, especially removing noisy, redundant, irrelevant, peated words and phrases should be removed, as they may
and potentially toxic data [56, 64, 227], which may largely introduce repetitive patterns in language modeling [235].
affect the capacity and performance of LLMs. To facilitate At the document level, existing studies mostly rely on the
the data processing, a recent study [228] proposes a useful overlap ratio of surface features (e.g., words and n-grams
data processing system for LLMs, named Data-Juicer, which overlap) between documents to detect and remove duplicate
provides over 50 processing operators and tools. In this documents containing similar contents [57, 64, 78, 236].
part, we review the detailed data preprocessing strategies Furthermore, to avoid the dataset contamination problem,
to improve the quality of the collected data [64, 78, 112]. A it is also crucial to prevent the overlap between the training
typical pipeline of preprocessing the pre-training data for and evaluation sets [56], by removing the possible duplicate
LLMs has been illustrated in Figure 7. texts from the training set. It has been shown that the three
levels of de-duplication are useful to improve the training
Filtering and Selection. To remove low-quality data from of LLMs [56, 237], which should be jointly used in practice.
the collected corpus, existing work generally adopts two ap-
proaches, namely classifier-based and heuristic-based. The Privacy Reduction. Thus, it is necessary to remove the
former approach trains a selection classifier based on high- personally identifiable information (PII) from the pre-training
quality texts and leverages it to identify and filter out low- corpus. One direct and effective approach is to employ
19

Ready to
Raw Corpus Filtering & Selection De-duplication Privacy Reduction Tokenization
pre-train!

Language Filtering Sentence-level Detect Personality Reuse Existing
Identifiable Tokenizer
Metric Filtering Document-level
Information (PII) SentencePiece
Statistic Filtering Set-level
Remove PII Byte-level BPE
Keyword Filtering

Alice is writing a paper about Alice is writing a paper about Replace('Alice') is Encode('[Somebody] is 32, 145, 66, 79, 12, 56, ...
LLMs. #$^& Alice is writing LLMs. Alice is writing a paper writing a paper about LLMs. writing a paper about LLMs.')
a paper about LLMs. about LLMs.

Fig. 7: An illustration of a typical data preprocessing pipeline for pre-training large language models.

rule-based methods, such as keyword spotting, to detect • WordPiece tokenization. WordPiece was a Google inter-
and remove PII such as names, addresses, and phone num- nal subword tokenization algorithm. It was originally pro-
bers [167]. Furthermore, researchers also find that the vul- posed by Google in developing voice search systems [242].
nerability of LLMs under privacy attacks can be attributed Then, it was used in the neural machine translation system
to the presence of duplicate PII data in the pre-training cor- in 2016 [243], and was adopted as the word tokenizer for
pus [238]. Therefore, de-duplication can also reduce privacy BERT in 2018 [23]. WordPiece has a very similar idea with
risks to some extent. BPE by iteratively merging consecutive tokens, whereas
taking a slightly different selection criterion for the merge.
Tokenization. Tokenization is also a crucial step for data To conduct the merge, it first trains a language model and
preprocessing. It aims to segment raw text into sequences employs it to score all possible pairs. Then, at each merge, it
of individual tokens, which are subsequently used as the selects the pair that leads to the most increase in the likeli-
inputs of LLMs. In traditional NLP research (e.g., sequence hood of training data. Since Google has’t released the official
labeling with conditional random fields [239]), word-based implementation of the WordPiece algorithm, HuggingFace
tokenization is the predominant approach, which is more gives a more intuitive selection measure in its online NLP
aligned with human’s language cognition. However, word- course: a pair is scored by dividing the co-occurrence count
based tokenization can yield different segmentation results by the product of the occurrence counts of two tokens in the
for the same input in some languages (e.g., Chinese word pair based on training corpus.
segmentation), generate a huge word vocabulary containing
many low-frequency words, and also suffer from the “out- • Unigram tokenization. Unlike BPE and WordPiece, Un-
of-vocabulary” issue. Thus, several neural network models igram tokenization [244] starts with a sufficiently large
employ character as the minimum unit to derive the word set of possible substrings or subtokens for a corpus, and
representation (e.g., a CNN word encoder in ELMo [21]). iteratively removes the tokens in the current vocabulary
Recently, subword tokenizers have been widely used in Trans- until the expected vocabulary size is reached. As the se-
former based language models, typically including Byte- lection criterion, it calculates the yielded increase in the
Pair Encoding tokenization, WordPiece tokenization and likelihood of training corpus by assuming that some to-
Unigram tokenization. HuggingFace has maintained an ken was removed from current vocabulary. This step is
excellent online NLP course on tokenizer22 with running conducted based on a trained unigram language model.
examples, and we refer to the beginners to this course. Next, To estimate the unigram language model, it adopts an
we briefly describe the three representative tokenization expectation–maximization (EM) algorithm: at each iteration,
methods. we first find the currently optimal tokenization of words
• Byte-Pair Encoding (BPE) tokenization. BPE was origi- based on the old language model, and then re-estimate the
nally proposed as a general data compression algorithm in probabilities of unigrams to update the language model.
1994 [240], and then adapted to NLP for tokenization [241]. During this procedure, dynamic programming algorithms
It starts with a set of basic symbols (e.g., the alphabets (i.e., the Viterbi algorithm) are used to efficiently find the
and boundary characters), and iteratively combine frequent optimal decomposition way of a word given the language
pairs of two consecutive tokens in the corpus as new to- model. Representative models that adopt this tokenization
kens (called merge). For each merge, the selection criterion approach include T5 and mBART.
is based on the co-occurrence frequency of two contigu- Although it is expedient to leverage an existing tokenizer
ous tokens: the top frequent pair would be selected. The (e.g., OPT [90] and GPT-3 [55] utilize the tokenizer of GPT-
merge process continues until it reaches the predefined 2 [26]), using a tokenizer specially designed for the pre-
size. Further, Byte-level BPE has been used to improve the training corpus can be highly beneficial [78], especially for
tokenization quality for multilingual corpus (e.g., the text the corpus that consists of diverse domains, languages, and
containing non-ASCII characters) by considering bytes as the formats. Therefore, recent LLMs often train the customized
basic symbols for merge. Representative language models tokenizers specially for the pre-training corpus with the
with this tokenization approach include GPT-2, BART, and SentencePiece library [245], which includes Byte-level BPE
LLaMA. and Unigram tokenization. A note is that normalization
techniques in BPE, such as NFKC [246], may degrade the
22. https://huggingface.co/learn/nlp-course/chapter6 tokenization performance [34, 64, 78]. When extending exist-
20

Data 1 of the entire pre-training data), and can be also locally set
Source
2
3 Data Mixture to varied proportions at different training stages. During
4
pre-training, data samples from different sources would be
Stage 1 Stage 2 Stage Stage selected according to the mixture proportions: more data
will be sampled from a data source with a larger weight.
··· Typically, existing LLMs such as LLaMA [57] may employ
upsampling or downsampling on the full data of each
source to create specific data mixtures as pre-training data.
Data Curriculum As Figure 6 illustrates, existing LLMs use different data mix-
tures to construct the pre-training data. As a representative
Fig. 8: An illustration of data scheduling for pre-training model, the pre-training data of LLaMA [57] mainly consists
LLMs. of webpages (over 80%), alongside 6.5% of code-heavy data
from GitHub and StackExchange, 4.5% from books, and
2.5% of scientific data sourced from arXiv, which has become
ing LLMs (i.e., continual pre-training or instruction tuning), an important reference for training general-purpose LLMs.
we should be also aware of the potential side effect with Furthermore, special data mixtures can be used to facilitate
customized tokenizers. For example, LLaMA trains the BPE different purposes. For example, Falcon [171] is trained on
tokenizer based on a pre-training corpus mainly consisting pure webpages, and CodeGen [86] largely increases the
of English texts, and the derived vocabulary might be less amount of code data. In practice, data mixture is often de-
capable in processing non-English data, e.g., taking longer termined empirically, and we summarize several common
inference latency to generate Chinese texts. strategies for finding an effective data mixture as follows:
Discussion on Effect of Data Quality. For pre-training, the • Increasing the diversity of data sources. Recent studies
quality of pre-training data is vital to the model capacities have empirically shown that training on excessive data
of LLMs. Existing work has shown that pre-training on the about a certain domain would degrade the generalization
low-quality corpus, such as noisy, toxic, and duplicate data, capability of LLMs on other domains [35, 64]. In contrast,
would largely hurt the performance of models [64, 234, increasing the data source heterogeneity (e.g., including
236, 238]. Recent studies, such as T5 [82], GLaM [112], and diverse data sources) is critical for improving the down-
Gopher [64], have investigated the influence of data quality stream performance of LLMs [227, 248, 249]. To further
on the LLMs’ capacities. By comparing the performance of examine the effect of different data sources, some studies
models trained on the filtered and unfiltered corpus, they have conducted ablation experiments by removing each
have reached the similar conclusion that pre-training LLMs data source one by one, and pre-train LLMs with specially
on cleaned data can improve the model performance. More curated datasets [227]. It has been shown that dropping data
specifically, the duplication of data may result in “double sources with high heterogeneity (e.g., webpages) impacts
descent” (referring to the phenomenon of performance ini- LLM’s abilities more severely than dropping sources with
tially deteriorating and subsequently improving) [234, 247], low heterogeneity (e.g., academic corpus).
or even overwhelm the training process [234]. In addition, • Optimizing data mixtures. In addition to manually set-
it has been shown that duplicate data degrades the ability ting the data mixtures, several studies have proposed to
of LLMs to copy from the context, which might further optimize the data mixtures for improving the model pre-
affect the generalization capacity of LLMs using in-context training [59, 250]. Given the target downstream tasks, one
learning [234]. Therefore, as suggested in [56, 64, 78, 227], can select pre-training data with either higher proximity
it is essential to utilize preprocessing methods like quality in the feature space [250] or those that provide positive
filtering, toxic filtering and deduplication to carefully clean influences on downstream task performance [251]. Further,
the pre-training corpus (as illustrated in Section 4.1.2), to to reduce the reliance of target tasks, DoReMi [59] first trains
improve stability of the training process and avoid affecting a small reference model using given initial domain weights,
the model performance. and then trains another small proxy model, upweighting the
domains on which the greatest discrepancies in likelihood
4.1.3 Data Scheduling between the two models are observed. Finally, the learned
domain weights of the proxy model are applied to train
After data preprocessing, it is essential to design suit-
a much larger LLM. In a more simple way, one can train
able strategies to schedule these multi-source data for pre-
several small language models with different data mixtures,
training a capable LLM. Generally, two key aspects should
and select the data mixture that leads to the most desir-
be paid close attention for data scheduling: the proportion
able performance. However, an assumption made in this
of each data source (data mixture), and the order in which
approach is, when trained in a similar way, small models
each data source is scheduled for training (data curriculum).
would resemble with large models in model abilities or
Next, we discuss the two aspects in detail. An illustration of
behaviors, which may not always hold in practice.
data scheduling has been presented in Figure 8.
• Specializing the targeted abilities. The model capacities
Data Mixture. Since each kind of data source is closely of LLMs heavily rely on data selection and mixture, and
related to the development of certain capacities for LLMs one can boost the proportions of specific data sources to
(referring to the discussions in Section 4.1), it is important enhance certain model abilities [64, 227]. For example, the
to set a suitable distribution to mix these data. The data mathematical reasoning and coding abilities can be specially
mixture is generally set in a global level (i.e., the distribution enhanced by training with more mathematical texts and
21

code data, respectively. Furthermore, experimental results of scientific papers, web data containing mathematical text
on the LAMBADA dataset [252] show that increasing the and code (2T general tokens → 500B code-heavy tokens
proportion of books data can improve the model capacity in → 50∼200B math-heavy tokens). Note that the pre-training
capturing long-term dependencies from text, and increasing data of Llemma also contains 5% general domain data as a
the proportion of the C4 dataset [82] leads to performance form of regularization.
improvement on the C4 validation dataset [64]. Generally, • Long context. Long context modeling is an important
it is important to identify more implicit relations between ability for LLMs, and many studies have explored extend-
data sources and model abilities. To enhance specific skills ing the context windows of LLMs via continually train-
such as mathematics and coding in LLMs, or to develop ing [254, 257]. With modifications on position embeddings
specialized LLMs, a practical way is to employ a multi-stage (i.e., position interpolation) of RoPE-based LLMs [57, 99,
training approach, e.g., general and skill-specific data can 259], CodeLLaMA further extends the context window of
be scheduled at two consecutive stages. This approach of LLaMA 2 (2.5T tokens with 4K context window → 20B
training LLMs on varying sources or proportions of data tokens with 16K context window). LongLLaMA [257] also
across multiple stages is also known as “data curriculum”, achieves longer context window with the help of external
which will be introduced below. memory and a unique training objective (1T tokens with 2K
context window → 10B tokens with 8K context window).
Data Curriculum. After preparing the data mixture, it is
important to schedule the order that specific data is pre- 4.1.4 Summary of Data Preparation
sented to LLMs for pre-training. It has been shown that, In this part, we summarize the general procedure and key
in some cases, to learn a certain skill, learning in a skill- points to prepare pre-training data for LLMs, which are
set sequence (e.g., basic skills → target skill) outperforms detailed in the following three aspects.
direct learning from a corpus focused solely on the target • Data collection. It is suggested to include diverse data
skill [253, 254]. Following the idea of curriculum learn- sources in the pre-training data. Although Falcon [171]
ing [255], data curriculum has been proposed and widely shows that webpages alone can be employed to train power-
used in model pre-training [253, 254, 256, 257]. It aims to ful LLMs, a more typical approach is to also incorporate di-
organize different parts of pre-training data for LLMs in verse high-quality text like code, books, scientific papers, etc.
a specific order, e.g., starting with easy/general examples If a LLM is specialized with a certain skill, the proportion of
and progressively introducing more challenging/special- corresponding data source should be increased accordingly.
ized ones. More generally, it can broadly refer to the adap- For example, Gopher [64] and Chinchilla [34] are trained
tive adjustment of data proportions for different sources with approximately 40% of data from books. PaLM [44] and
during pre-training. Existing work about data curriculum LaMDA [68] use approximately 50% conversational data.
mainly focuses on continual pre-training, such as special- • Data cleaning. After data collection, it is crucial to clean
ized coding LLMs (e.g., CodeLLaMA [254]) or long context the raw corpus to enhance its quality as possible. First,
LLMs (e.g., LongLLaMA [257]). However, it still lacks of deduplication is commonly used in existing work [99, 171,
more detailed report about data curriculum for general- 248]. Second, low-quality text, toxic content, and data with
purpose LLMs (e.g., LLaMA) in the literature. To determine privacy concerns should be removed at different granulari-
data curriculum, a practical approach is to monitor the de- ties (e.g., document, passage or sentence). In practice, both
velopment of key abilities of LLMs based on specially con- heuristic and classifier-based methods can be employed
structed evaluation benchmarks, and then adaptively adjust for quality and toxicity filtering (e.g., CCNet [260], fast-
the data mixture during pre-training. Next, we take three Text [261], and Data-Juicer [262]). Third, with the cleaned
common abilities as examples to introduce how the concept data, one can further unify or specify the format for pre-
of data curriculum23 applies in continual pre-training. training data, and perform the tokenization by training
• Coding. To improve the coding ability of LLMs, CodeL- the tokenizer on the filtered and deduplicated corpus with
LaMA [254] is developed based on LLaMA 2 [99] (2T general libraries like SentencePiece [245].
tokens → 500B code-heavy tokens), aiming to improve the • Data scheduling. With the preprocessed data, the next
code generation ability and retain natural language under- step is to determine the data mixture and the specific order
standing skills. CodeLLaMA also provides a version that of data for pre-training LLMs. To determine both settings, a
is further specialized to a certain programming language, practical way is to first train several small language models
namely CodeLLaMA-Python (2T general tokens → 500B with multiple candidate plans and then select a good plan
code-heavy tokens → 100B Python-heavy tokens). among them [59]. Overall, it is more difficult to find a
• Mathematics. Llemma [258] is proposed to enhance suitable data curriculum. In practice, one can monitor the
the mathematical capacities of general-purpose LLMs. It performance of intermediate model checkpoints on specific
is developed based on CodeLLaMA. Although CodeL- evaluation benchmarks, and dynamically tune the data mix-
LaMA [254] mainly focuses on the coding ability, exper- ture and distribution during pre-training. In this process, it
iments have shown that it performs better than its base is also useful to explore the potential relations between data
model LLaMA 2 on mathematics benchmarks [258]. Based sources and model abilities to instruct the design of data
on CodeLLaMA, Llemma is continually trained on mixtures curriculum.

23. We utilize the symbol “→” to represent the data order in data 4.2 Architecture
curriculum. For example, “2T webpage tokens → 500B code tokens”
means that the LLM is firstly trained with 2T webpage tokens and In this section, we review the architecture design of LLMs,
subsequently with 500B code data tokens. i.e., mainstream architecture, pre-training objective, and de-
22

tailed configuration. Table 5 presents the model cards of e.g., U-PaLM [118] is derived from PaLM [56]. Existing rep-
several representative LLMs with public details. resentative LLMs based on prefix decoders include GLM-
130B [93] and U-PaLM [118].
4.2.1 Typical Architectures
Mixture-of-Experts. For the above three types of archi-
Due to the excellent parallelizability and capacity, the Trans- tectures, we can further extend them via the mixture-of-
former architecture [22] has become the de facto backbone to experts (MoE) scaling, in which a subset of neural network
develop various LLMs, making it possible to scale language weights for each input are sparsely activated, e.g., Switch
models to hundreds or thousands of billions of parameters. Transformer [25] and GLaM [112]. The major merit is that
In general, the mainstream architectures of existing LLMs MoE is a flexible way to scale up the model parameter while
can be roughly categorized into three major types, namely maintaining a constant computational cost [25]. It has been
encoder-decoder, causal decoder, and prefix decoder, as shown that substantial performance improvement can be
shown in Figure 9. observed by increasing either the number of experts or the
Encoder-decoder Architecture. The vanilla Transformer total parameter size [265]. Despite the merits, training large
model is built on the encoder-decoder architecture [22], MoE models may suffer from instability issues due to the
which consists of two stacks of Transformer blocks as complex, hard-switching nature of the routing operation.
the encoder and decoder, respectively. The encoder adopts To enhance the training stability of MoE-based language
stacked multi-head self-attention layers to encode the input models, techniques such as selectively using high-precision
sequence for generating its latent representations, while tensors in the routing module or initializing the model with
the decoder performs cross-attention on these representa- a smaller range have been introduced [25]. More recently,
tions and autoregressively generates the target sequence. there is widespread speculation that GPT-4 has been devel-
Encoder-decoder PLMs (e.g., T5 [82] and BART [24]) have oped based on the MoE architecture, but without official
shown effectiveness on a variety of NLP tasks. So far, verification.
there are only a small number of LLMs that are built based
on the encoder-decoder architecture, e.g., Flan-T5 [69]. We Emergent Architectures. The conventional Transformer ar-
leave a detailed discussion about the architecture selection chitecture typically suffers from quadratic computational
in Section 4.2.5. complexity with respect to sequence length, resulting in a
Causal Decoder Architecture. The causal decoder archi- high processing cost for dealing with long inputs. To im-
tecture incorporates the unidirectional attention mask, to prove efficiency, recent studies aim to devise new architec-
guarantee that each input token can only attend to the tures for language modeling, most based on parameterized
past tokens and itself. The input and output tokens are state space models (SSM) [266], which can be viewed as
processed in the same fashion through the decoder. As a combination of RNN and CNN. On the one hand, SSM
representative language models of this architecture, the can generate outputs recursively like RNN, meaning that
GPT-series models [26, 55, 122] are developed based on they only need to refer to the single previous state during
the causal-decoder architecture. In particular, GPT-3 [55] decoding. It makes the decoding process more efficient
has successfully demonstrated the effectiveness of this ar- as it eliminates the need to revisit all previous states as
chitecture, also showing an amazing in-context learning in conventional Transformers. On the other hand, these
capability of LLMs. Interestingly, GPT-1 [122] and GPT- models have the capability to encode an entire sequence
2 [26] do not exhibit such superior abilities as those in in parallel like Transformers via convolution computation.
GPT-3, and it seems that scaling plays an important role Thus, they can benefit from the parallelism of GPUs with
in increasing the model capacity of this model architecture. techniques such as Parallel Scan [267, 268], FFT [269, 270],
So far, the causal decoders have been widely adopted as and Chunkwise Recurrent [271]. Despite the high computa-
the architecture of LLMs by various existing LLMs, such tion efficiency of SSMs, their performance still lags behind
as OPT [90], BLOOM [78], and Gopher [64]. Note that both Transformer. Thus, several variants of SSM have been pro-
the causal decoder and prefix decoder discussed next belong posed, including Mamba [272], RetNet [271], RWKV [273],
to decoder-only architectures. When mentioning “decoder- and Hyena [269].
only architecture”, it mainly refers to the causal decoder • Mamba. Mamba [272] aims to selectively filter out or
architecture in existing literature, unless specified. remember information during state update. It replaces the
original fixed parameters of SSM layers with functions of the
Prefix Decoder Architecture. The prefix decoder architec- input, selectively filtering out information of the previous
ture (a.k.a., non-causal decoder [263]) revises the masking state and the current input depending on the current input.
mechanism of causal decoders, to enable performing bidi- Compared with traditional SSMs, Mamba has demonstrated
rectional attention over the prefix tokens [264] and unidi- improved text modeling capacities.
rectional attention only on generated tokens. In this way, • RWKV. RWKV [273] combines the advantages of Trans-
like the encoder-decoder architecture, the prefix decoders former and RNN. It employs time-mixing modules, i.e.,
can bidirectionally encode the prefix sequence and autore- RNN with gating, and channel-mixing modules that are
gressively predict the output tokens one by one, where the special feedforward neural networks [273]. Within these
same parameters are shared during encoding and decoding. modules, token shift, a linear combination of the current and
Instead of pre-training from scratch, a practical suggestion previous token, is used instead of the token representation
is to continually train causal decoders and then convert as the input.
them into prefix decoders for accelerating convergence [29], • RetNet. RetNet [271] proposes multi-scale retention
23

TABLE 5: Model cards of several selected LLMs with public configuration details. Here, PE denotes position embedding,
#L denotes the number of layers, #H denotes the number of attention heads, dmodel denotes the size of hidden states, and
MCL denotes the maximum context length during training.

Model Category Size Normalization PE Activation Bias #L #H dmodel MCL
GPT3 [55] Causal decoder 175B Pre LayerNorm Learned GeLU ✓ 96 96 12288 2048
PanGU- α [84] Causal decoder 207B Pre LayerNorm Learned GeLU ✓ 64 128 16384 1024
OPT [90] Causal decoder 175B Pre LayerNorm Learned ReLU ✓ 96 96 12288 2048
PaLM [56] Causal decoder 540B Pre LayerNorm RoPE SwiGLU × 118 48 18432 2048
BLOOM [78] Causal decoder 176B Pre LayerNorm ALiBi GeLU ✓ 70 112 14336 2048
MT-NLG [113] Causal decoder 530B - - - - 105 128 20480 2048
Gopher [64] Causal decoder 280B Pre RMSNorm Relative - - 80 128 16384 2048
Chinchilla [34] Causal decoder 70B Pre RMSNorm Relative - - 80 64 8192 -
Galactica [35] Causal decoder 120B Pre LayerNorm Learned GeLU × 96 80 10240 2048
LaMDA [68] Causal decoder 137B - Relative GeGLU - 64 128 8192 -
Jurassic-1 [107] Causal decoder 178B Pre LayerNorm Learned GeLU ✓ 76 96 13824 2048
LLaMA [57] Causal decoder 65B Pre RMSNorm RoPE SwiGLU × 80 64 8192 2048
LLaMA 2 [99] Causal decoder 70B Pre RMSNorm RoPE SwiGLU × 80 64 8192 4096
Falcon [171] Causal decoder 40B Pre LayerNorm RoPE GeLU × 60 64 8192 2048
GLM-130B [93] Prefix decoder 130B Post DeepNorm RoPE GeGLU ✓ 70 96 12288 2048
T5 [82] Encoder-decoder 11B Pre RMSNorm Relative ReLU × 24 128 1024 512

Causal Decoder Prefix Decoder Encoder-Decoder

A A A

Encoder
Survey Survey Survey

Decoder Decoder
of of of

Models Language Large Models Language Large Decoder Models Language Large

A Survey of Large Language Models A Survey of Large Language Models A Survey of Large Language Models

Decoder Decoder Encoder Decoder

Fig. 9: A comparison of the attention patterns in three mainstream architectures. Here, the blue, green, yellow and grey
rounded rectangles indicate the attention between prefix tokens, attention between prefix and target tokens, attention
between target tokens, and masked attention respectively.

TABLE 6: Comparison of parallelism and complexity of dif- to traditional SSMs, RetNet keeps the parallel and recurrent
ferent models. T represents sequence length, H represents computation capacity at the same time.
the dimension of the input representation, N represents the • Hyena. Hyena employs long convolution to replace
dimension after compression in SSMs, and M represents the the attention module. In the long convolution module, the
number of layers in each Hyena module. filters based on relative positions are used to aggregate
information at different positions into the middle represen-
Model Decoding Complexity Training Complexity
tations, and gating functions are employed to further project
Transformer O(H(T + H)) O(T H(T + H)) intermediate representations into the final output. However,
SSM O(H(N 2 + H)) O(T H(log T + N 2 + H)) due to the long convolution, Hyena can not infer like RNN
Mamba O(H(N 2 + H)) O(T H(N 2 + H))
RWKV O(H 2 ) O(T H 2 )
and must explicitly access all previous states.
RetNet O(H 2 ) O(T H 2 )
Hyena O(M H(T + H)) O(T M H(log T + H)) 4.2.2 Detailed Configuration
Since the launch of Transformer [22], various improvements
have been proposed to enhance its training stability, per-
formance, and computational efficiency. In this part, we
(MSR) to replace the attention module in Transformer. Sim- will discuss the corresponding configurations for four major
ilar to linear attention, in the MSR module, the input is parts of the Transformer, including normalization, position
first mapped into query, key, and value, and the product embeddings, activation functions, and attention and bias.
of key and value is employed to update the state. Then, the To make this survey more self-contained, we present the
query is used to project the state into the output. Similar detailed formulations for these configurations in Table 7.
24

TABLE 7: Detailed formulations for the network configurations. Here, Sublayer denotes a FFN or a self-attention module
in a Transformer layer, d denotes the size of hidden states, pi denotes position embedding at position i, Aij denotes the
attention score between a query and a key, ri−j denotes a learnable scalar based on the offset between the query and the
key, and RΘ,t denotes a rotary matrix with rotation degree t · Θ.

Configuration Method Equation
Post Norm [22] Norm(x+Sublayer(x))
Normalization position Pre Norm [26] x + Sublayer(Norm(x))
Sandwich Norm [274] x + Norm(Sublayer(Norm(x)))
q P
x−µ
· γ + β, µ = d1 di=1 xi , σ = 1 d 2
P
LayerNorm [275] σ d i=1 (xi − µ))
Normalization method x
q P
1 d 2
RMSNorm [276] RMS(x)
· γ, RMS(x) = d i=1 xi
DeepNorm [277] LayerNorm(α · x + Sublayer(x))
ReLU [278] ReLU(x) = max(x, 0)
√ Rx −t2 dt
GeLU [279] GeLU(x) = 0.5x ⊗ [1 + erf(x/ 2)], erf(x) = √2π 0 e
Activation function
Swish [280] Swish(x) = x ⊗ sigmoid(x)
SwiGLU [281] SwiGLU(x1 , x2 ) = Swish(x1 ) ⊗ x2
GeGLU [281] GeGLU(x1 , x2 ) = GeLU(x1 ) ⊗ x2
Absolute [22] xi = x i + p i
Position embedding Relative [82] Aij = Wq xi xT T
j Wk + ri−j
RoPE [282] Aij = Wq xi RΘ,i−j xT T T
j Wk = (Wq xi RΘ,i )(Wk xj RΘ,j )
ALiBi [283] T T
Aij = Wq xi xj Wk − m(i − j)

Normalization Methods. Training instability is a challeng- employed in existing LLMs except combined with other
ing issue for pre-training LLMs. To alleviate this issue, strategies (e.g., combining post-LN with pre-LN in GLM-
normalization is a widely adopted strategy to stabilize the 130B [93]).
training of neural networks. In the vanilla Transformer [22], • Pre-LN. Different from post-LN, pre-LN [287] is applied
LayerNorm [275] is employed. Recently, several advanced before each sub-layer, and an additional LN is placed before
normalization techniques have been proposed as alterna- the final prediction. Compared with post-LN, the Trans-
tives to LayerNorm, e.g., RMSNorm, and DeepNorm. formers with pre-LN are more stable in training. However,
• LayerNorm. In the early research, BatchNorm [284] is it performs worse than the variants with post-LN [288].
a commonly used normalization method. However, it is Despite the decreasing performance, most LLMs still adopt
difficult to deal with sequence data of variable lengths and pre-LN due to the training stability. However, one excep-
small-batch data. Thus, LayerNorm [275] is introduced to tion is that pre-LN has been found unstable in GLM when
conduct layerwise normalization. Specifically, the mean and training models more than 100B parameters [93].
variance over all activations per layer are calculated to re- • Sandwich-LN. Based on pre-LN, Sandwich-LN [274]
center and re-scale the activations. adds extra LN before the residual connections to avoid
• RMSNorm. To improve the training speed of Lay- the value explosion issues in Transformer layer outputs.
erNorm (LN), RMSNorm [276] is proposed by re-scaling However, it has been found that Sandwich-LN sometimes
the activations with only the root mean square (RMS) of fails to stabilize the training of LLMs and may lead to the
the summed activations, instead of the mean and variance. collapse of training [93].
Related research has demonstrated its superiority in training Activation Functions. To obtain good performance, activa-
speed and performance on Transformer [285]. Representa- tion functions also need to be properly set in feed-forward
tive models that adopt RMSNorm include Gopher [64] and networks. In existing LLMs, GeLU activations [289] are
Chinchilla [34]. widely used. Specially, in the latest LLMs (e.g., PaLM and
• DeepNorm. DeepNorm is proposed by Microsoft [277] LaMDA), variants of GLU activation [281, 290] have also
to stabilize the training of deep Transformers. With Deep- been utilized, especially the SwiGLU and GeGLU variants,
Norm as residual connections, Transformers can be scaled which often achieve better performance in practice [285].
up to 1,000 layers [277], which has shown the advantages However, compared with GeLU, they require extra parame-
of stability and good performance. It has been adopted by ters (about 50%) in the feed-forward networks [291].
GLM-130B [93].
Position Embeddings. Since the self-attention modules in
Normalization Position. In addition to the normalization Transformer are permutation equivariant, position embed-
method, normalization position also plays a crucial role in dings (PE) are employed to inject absolute or relative posi-
the LLMs. There are generally three choices for the normal- tion information for modeling sequences.
ization position, i.e., post-LN, pre-LN, and sandwich-LN. • Absolute position embedding. In the vanilla Trans-
• Post-LN. Post-LN is used in the vanilla Trans- former [22], absolute position embeddings are employed.
former [22], which is placed between residual blocks. How- At the bottoms of the encoder and the decoder, the absolute
ever, existing work has found that the training of Trans- positional embeddings are added to the input embeddings.
formers with post-LN tends to be instable due to the large There are two variants of absolute position embeddings
gradients near the output layer [286]. Thus, post-LN is rarely proposed in the vanilla Transformer [22], i.e., sinusoidal and
25

learned position embeddings, where the latter is commonly • Full attention. In the vanilla Transformer [22], the atten-
used in existing pre-trained language models. tion mechanism is conducted in a pairwise way, considering
• Relative position embedding. Unlike absolute position the relations between all token pairs in a sequence. It adopts
embeddings, relative positional embeddings are generated scaled dot-product attention, in which the hidden states
according to the offsets between keys and queries [292]. are mapped into queries, keys, and values. Additionally,
A popular variant of relative PE was introduced in Transformer uses multi-head attention instead of single
Transformer-XL [293, 294]. The calculation of attention attention, projecting the queries, keys, and values with
scores between keys and queries has been modified to different projections in different heads. The concatenation
introduce learnable embeddings corresponding to relative of the output of each head is taken as the final output.
positions. T5 [82] further simplified relative positional em- • Sparse attention. A crucial challenge of full attention
beddings, which was subsequently adopted by Gopher [64]. is the quadratic computational complexity, which becomes
Specifically, it adds learnable scalars to the attention scores, a burden when dealing with long sequences. Therefore,
where the scalars are calculated based on the distances various efficient Transformer variants are proposed to re-
between the positions of the query and the key. Compared duce the computational complexity of the attention mecha-
with the absolute PE, Transformers with relative position nism [297, 298]. For instance, locally banded sparse attention
embedding can generalize to sequences longer than those (i.e., Factorized Attention [299] has been adopted in GPT-
sequences for training, i.e., extrapolation [283]. 3 [55]. Instead of the whole sequence, each query can only
• Rotary position embedding. Rotary position embedding attend to a subset of tokens based on the positions.
(RoPE) [282] sets specific rotatory matrices based on the • Multi-query/grouped-query attention. Multi-query atten-
absolute position of each key or query. The scores between tion refers to the attention variant where different heads
keys and queries can be computed with relative position share the same linear transformation matrices on the keys
information (Table 7). RoPE combines each consecutive pair and values [300]. It achieves higher inference speed with
of elements in query and key vectors as a dimension, so there only a minor sacrifice in model quality. Representative
are d/2 dimensions for an original d-length embedding. models with multi-query attention include PaLM [56] and
For each dimension i ∈ {1, . . . , d/2}, the pair of involved StarCoder [98]. To make a trade-off between multi-query
elements will rotate based on the rotation angle t · θi , where attention and multi-head attention, grouped-query attention
t denotes the position index and θi is the basis in the (GQA) [301] has been explored. In GQA, heads are assigned
dimension. Following sinusoidal position embeddings [22], into different groups, and those heads that belong to the
RoPE defines the basis θi as an exponentiation of the base b same group will share the same transformation matrices.
(set to 10000 by default): Specially, GQA has been adopted and empirically tested in
the recently released LLaMA 2 model [99].
Θ = {θi = b−2(i−1)/d |i ∈ {1, 2, . . . , d/2}}. (4)
• FlashAttention. Different from most existing approx-
Furthermore, a recent study [295] defines the distance re- imate attention methods that trade-off model quality to
quired to rotate one cycle (2π ) for each dimension as wave- improve the computing efficiency, FlashAttention [302] pro-
length: poses to optimize the speed and memory consumption of
λi = 2πb2(i−1)/d = 2π/θi . (5) attention modules on GPUs from an IO-aware perspective.
There exist different levels of memory on modern GPUs,
Due to the excellent performance and the long-term decay
e.g., SRAM with a fast IO and HBM with a relatively
property, RoPE is widely adopted in the latest LLMs, e.g.,
slow IO. FlashAttention organizes the input into blocks and
PaLM [56] and LLaMA [57]. Based on RoPE, xPos [296] fur-
introduces necessary recomputation, both to make better
ther improves the translation invariance and length extrap-
use of the fast memory SRAM. Implemented as a fused
olation of Transformer. At each dimension of the rotation
kernel in CUDA, FlashAttention has been integrated into
angle vector, xPos adds a special exponential decay that is
PyTorch [211], DeepSpeed [74], and Megatron-LM [75]. The
smaller when the basis is larger. It can alleviate the unstable
updated version FlashAttention-2 [303] further optimizes
phenomenon during training as the distance increases.
the work partitioning of GPU thread blocks and warps, lead-
• ALiBi. ALiBi [283] is proposed to improve the extrap-
ing to around 2× speedup when compared to the original
olation of Transformer. Similar to relative position embed-
FlashAttention.
ding, it biases attention scores with a penalty based on the
• PagedAttention. It has been observed when LLM are
distances between keys and queries. Different from the rela-
deployed on servers, GPU memory is largely occupied by
tive positional embedding methods like T5 [82], the penalty
cached attention key and value tensors (called KV cache).
scores in ALiBi are pre-defined without any trainable pa-
The major reason is that the input lengths are often varied,
rameters. Empirical results in [283] have shown that ALiBi
leading to fragmentation and over-reservation issues. In-
has a better extrapolation performance on sequences that are
spired by the classic paging technique in operating systems,
longer than those for training than several popular position
PagedAttention has been proposed to improve the memory
embedding methods such as sinusoidal PE [22], RoPE [282],
efficiency and throughput of deployed LLMs [304]. In detail,
and T5 bias [82]. In addition, it has been shown that ALiBi
PagedAttention partitions each sequence into subsequences,
can also improve training stability in BLOOM [78].
and the corresponding KV caches of these subsequences are
Attention. Attention mechanism is a critical component of allocated into non-contiguous physical blocks. The paging
Transformer. It allows the tokens across the sequence to technique increases the GPU utilization and enables efficient
interact with each other and compute the representations memory sharing in parallel sampling.
of the input and output sequence. To put all these discussions together, we summarize the
26

suggestions from existing literature for detailed configura- I am sleepy. I start a pot of
tion. For stronger generalization and training stability, it is
coffee 0.661 strong 0.008 soup 0.005
suggested to choose the pre RMSNorm for layer normaliza-
water 0.119 black 0.008 ... ...
tion, and SwiGLU or GeGLU as the activation function. In
tea 0.057 hot 0.007 happy 4.3e-6
addition, LN may not be used immediately after embedding
rice 0.017 oat 0.006 Boh 4.3e-6
layers, which is likely to incur performance degradation. As
chai 0.012 beans 0.006 ... ...
for position embeddings, RoPE or ALiBi is a better choice
since it performs better on long sequences. Fig. 10: The probability distribution over the vocabulary in
descending order for the next token of the context “I am
4.2.3 Pre-training Tasks sleepy. I start a pot of ”. For ease of discussion, this example is
Pre-training plays a key role that encodes general knowl- given in word units instead of subword units.
edge from large-scale corpus into the massive model param-
eters. For training LLMs, there are two commonly used pre-
training tasks, namely language modeling and denoising LM and DAE objectives as different types of denoising tasks,
autoencoding. namely S-denoiser (LM), R-denoiser (DAE, short span and
low corruption), and X-denoiser (DAE, long span or high
Language Modeling. The language modeling task (LM) is corruption). Among the three denoising tasks, S-denoiser
the most commonly used objective to pre-train decoder-only is similar to the conventional LM objective (Equation (6)),
LLMs, e.g., GPT3 [55] and PaLM [56]. Given a sequence of while R-denoiser and X-denoiser are similar to DAE ob-
tokens x = {x1 , . . . , xn }, the LM task aims to autoregres- jectives (Equation (7)) but differ from each other in the
sively predict the target tokens xi based on the preceding lengths of spans and ratio of corrupted text. For input sen-
tokens x<i in a sequence. A general training objective is to tences started with different special tokens (i.e., {[R], [S],
maximize the following likelihood: [X]}), the model will be optimized using the corresponding
n
X denoisers. MoD has been applied in the latest PaLM 2
LLM (x) = log P (xi |x<i ). (6) model [120].
i=1

Since most language tasks can be cast as the prediction 4.2.4 Decoding Strategy
problem based on the input, these decoder-only LLMs might After the LLMs have been pre-trained, it is essential to em-
be potentially advantageous to implicitly learn how to ac- ploy a specific decoding strategy to generate the appropriate
complish these tasks in a unified LM way. Some studies output from the LLMs.
have also revealed that decoder-only LLMs can be naturally
Background. We start the discussion with the prevalent
transferred to certain tasks by autoregressively predicting
decoder-only architecture, and introduce the auto-regressive
the next tokens [26, 55], without fine-tuning. An important
decoding mechanism. Since such LLMs are pre-trained
variant of LM is the prefix language modeling task, which is
based on the language modeling task (Equation 6), a basic
designed for pre-training models with the prefix decoder
decoding method is greedy search that predicts the most
architecture. The tokens within a randomly selected prefix
likely token at each step based on the previously generated
would not be used in computing the loss of prefix language
tokens, formally modeled as:
modeling. With the same amount of tokens seen during pre-
training, prefix language modeling performs slightly worse xi = arg maxP (x|x<i ), (8)
than language modeling, since fewer tokens in the sequence x
are involved for model pre-training [29]. where xi is the token with the highest probability at i-
th step of generation conditioned on the context x<i . For
Denoising Autoencoding. In addition to conventional
instance in Figure 10, when predicting the next token of
LM, the denoising autoencoding task (DAE) has also been
the sentence “I am sleepy. I start a pot of”, greedy search
widely used to pre-train language models [24, 82]. The
selects the token “coffee” which has the highest probability
inputs x\x̃ for DAE task are corrupted text with randomly
at the current step. Greedy search can achieve satisfactory
replaced spans. Then, the language models are trained to re-
results in text generation tasks (e.g., machine translation
cover the replaced tokens x̃. Formally, the training objective
and text summarization), in which the output is highly
of DAE is denoted as follows:
dependent on the input [305]. However, in terms of open-
LDAE (x) = log P (x̃|x\x̃ ). (7) ended generation tasks (e.g., story generation and dialog),
greedy search sometimes tends to generate awkward and
However, the DAE task seems to be more complicated repetitive sentences [306].
in implementation than LM task. As a result, it has not As another alternative decoding strategy, sampling-
been widely used to pre-train large language models. Exist- based methods are proposed to randomly select the next
ing LLMs that take DAE as pre-training objectives include token based on the probability distribution to enhance the
T5 [82] and GLM-130B [93]. These models are mainly trained randomness and diversity during generation:
to recover the replaced spans in an autoregressive way.
xi ∼ P (x|x<i ). (9)
Mixture-of-Denoisers. Mixture-of-Denoisers (MoD) [89],
also known as UL2 loss, was introduced as a unified ob- For the example in Figure 10, sampling-based methods will
jective for pre-training language models. MoD regards both sample the word “coffee” with higher probability while
27

also retaining the possibilities of selecting the rest words, from the smallest set having a cumulative probability above
“water”, “tea”, “rice”, etc. (or equal to) p [306]. In practice, the smallest set can be con-
Not limited to the decoder-only architecture, these two structed by gradually adding tokens from the vocabulary
decoding methods can be generally applied to encoder- sorted in descending order of generative probability, until
decoder models and prefix decoder models in a similar way. their cumulative value exceeds p.
Improvement for Greedy Search. Selecting the token with Recently, researchers have also explored other sampling
the highest probability at each step may result in overlook- strategies for LLMs. For instance, η -sampling [313] further
ing a sentence with a higher overall probability but a lower improves top-p sampling by introducing a dynamic thresh-
local estimation. Next, we introduce several improvement old based on the probability distribution. Furthermore, con-
strategies to alleviate this issue. trastive search [314] and typical sampling [315] can be utilized
• Beam search. Beam search [307] retains the sentences to improve the generation coherence during decoding. Since
with the n (beam size) highest probabilities at each step it has been found that large models tend to assign higher
during the decoding process, and finally selects the gener- probability to important tokens compared to small models,
ated response with the top probability. Typically, the beam contrastive decoding [316] utilizes a larger LM (e.g., OPT-
size is configured within the range of 3 to 6. However, 13B) and a smaller LM (e.g., OPT-125M) to measure their
opting for a larger beam size might result in a decline in log-likelihood differences. Subsequently, tokens are sampled
performance [308]. based on the delta value of the probability distribution,
• Length penalty. Since beam search favours shorter sen- thereby amplifying the impact of important tokens. Based
tences, imposing length penalty (a.k.a., length normaliza- on this contrastive idea, DoLa [317] further extends this
tion) is a commonly used technique [309] to overcome this approach to contrasting the logits across different layers of
issue, which normalizes the sentence probability according a single LLM, as higher layers tend to assign more weight
to the sentence length (divided by an exponential power α to important tokens.
of the length).
Practical Settings. In practice, existing libraries (e.g., Trans-
Besides, some researchers [310] propose to penalize the
formers [201]) and public APIs of LLMs (e.g., OpenAI) have
generation of previously generated tokens or n-grams to
supported various decoding strategies to serve different
alleviate the issue of repetitive generation. In addition,
scenarios of text generation. Next, we present the decoding
diverse beam search [311] can be leveraged to produce a
settings of several representative LLMs:
set of diverse outputs based on the same input.
• T5 [82] utilizes greedy search as the default setting and
Improvement for Random Sampling. Sampling-based applies beam search (beam size of 4) with a length penalty
methods sample the token over the whole vocabulary, which of 0.6 for translation and summarization tasks.
may select wrong or irrelevant tokens (e.g., “happy” and • GPT-3 [55] employs beam search with a beam size of 4
“Boh” in Figure 10) based on the context. To improve the and a length penalty of 0.6 for all generation tasks.
generation quality, several strategies have been proposed
• Alpaca [146] utilizes sampling-based strategies with
for mitigating or preventing the selection of words with
top-k (k = 50), top-p (p = 0.9), and temperature of 0.7 for
exceedingly low probabilities.
open-ended generation.
• Temperature sampling. To modulate the randomness of
sampling, a practical method is to adjust the temperature • LLaMA [57] applies diverse decoding strategies tai-
coefficient of the softmax function for computing the proba- lored to specific tasks. For instance, it employs the greedy
bility of the j -th token over the vocabulary: search for question answering tasks while utilizes a sam-
pling strategy with the temperature settings of 0.1 (pass@1)
exp (lj /t) and 0.8 (pass@100) for code generation.
P (xj |x<i ) = P , (10)
j ′ exp (lj ′ /t) • OpenAI API supports several basic decoding strate-
where lj ′ is the logits of each word and t is the temperature gies, including greedy search (by setting temperature to
coefficient. Reducing the temperature t increases the chance 0), beam search (with the setting best_of), temperature
of selecting words with high probabilities while decreases sampling (with the setting temperature), nucleus sam-
the chances of selecting words with low probabilities. When pling (with the setting top_p). It also introduce param-
t is set to 1, it becomes the default random sampling; when eters presence_penalty and frequency_penalty to
t is approaching 0, it is equivalent to greedy search. In control the repetition degree of generation. According to
addition, when t goes to infinity, it degenerates to uniform the OpenAI’s document, their APIs would produce different
sampling. outputs even if the input and the hyper-parameters are the
• Top-k sampling. Different from temperature sampling, same. Setting temperature to 0 can yield more deterministic
top-k sampling directly truncates the tokens with lower outputs, albeit with a slight chance of variability.
probability and only samples from the tokens with the top
k highest probabilities [312]. For example in Figure 10, top-
5 sampling will sample from the words “coffee”, “water”, 4.2.5 Summary and Discussion
“tea”, “rice”, and “chai” from their re-scaled probabilities.
• Top-p sampling. Since top-k sampling does not consider The choice of architecture and pre-training tasks may incur
the overall possibility distribution, a constant value of k may different inductive biases for LLMs, which would lead to
be not be suitable for different contexts. Therefore, top-p different model capacities. In this part, we discuss one open
sampling (a.k.a., nucleus sampling) is proposed by sampling issue about the architecture choice for LLMs.
28

Why does Predicting the Next Word Works? tures and pre-training objectives are in need to analyze how
the choices of the architecture and pre-training tasks affect
The essence of decoder-only architecture is to the capacity of LLMs, especially for encoder-decoder archi-
accurately predict the next word for reconstructing tectures. Despite the effectiveness of decoder-only architec-
the pre-training data. Till now, there has been no ture, it is also suggested to make more diverse exploration
formal study that theoretically demonstrates its on architecture design. Besides the major architecture, the
advantage over other architectures. An interesting detailed configuration of LLM is also worth attention, which
explanation was from Ilya Sutskever during the has been discussed in Section 4.2.2.
interview held by Jensen Huanga . The original
transcript from the interview was copied belowb : 4.3 Model Training
Say you read a detective novel. It’s In this part, we review the important settings, techniques,
like complicated plot, a storyline, or tricks for training LLMs.
different characters, lots of events,
mysteries like clues, it’s unclear. 4.3.1 Optimization Setting
Then, let’s say that at the last For parameter optimization of LLMs, we present the com-
page of the book, the detective has monly used settings for batch training, learning rate, opti-
gathered all the clues, gathered mizer, and training stability.
all the people and saying, "okay, Batch Training. For language model pre-training, existing
I’m going to reveal the identity of work generally sets the batch size to a large number (e.g.,
whoever committed the crime and that 2,048 examples or 4M tokens) to improve the training
person’s name is". Predict that word. stability and throughput. For LLMs such as GPT-3 and
... PaLM, they have introduced a new strategy that dynam-
Now, there are many different words. ically increases the batch size during training, ultimately
But predicting those words better and reaching a million scale. Specifically, the batch size of GPT-3
better, the understanding of the text is gradually increasing from 32K to 3.2M tokens. Empirical
keeps on increasing. GPT-4 predicts results have demonstrated that the dynamic schedule of
the next word better. batch size can effectively stabilize the training process of
a. https://www.nvidia.com/en-us/on- LLMs [56].
demand/session/gtcspring23-S52092/
b. https://lifearchitect.ai/ilya/ Learning Rate. Existing LLMs usually adopt a similar learn-
ing rate schedule with the warm-up and decay strategies
during pre-training. Specifically, in the initial 0.1% to 0.5%
Architecture Choice. In earlier literature of pre-trained lan- of the training steps, a linear warm-up schedule is employed
guage models, there are lots of discussions on the effects for gradually increasing the learning rate to the maximum
of different architectures [29, 89]. However, most LLMs are value that ranges from approximately 5 × 10−5 to 1 × 10−4
developed based on the causal decoder architecture, and (e.g., 6 × 10−5 for GPT-3). Then, a cosine decay strategy
there still lacks a theoretical analysis on its advantage over is adopted in the subsequent steps, gradually reducing the
the other alternatives. Next, we briefly summarize existing learning rate to approximately 10% of its maximum value,
discussions on this issue. until the convergence of the training loss.
• By pre-training with the LM objective, it seems that
Optimizer. The Adam optimizer [318] and AdamW opti-
causal decoder architecture can achieve a superior zero-
mizer [319] are widely utilized for training LLMs (e.g., GPT-
shot and few-shot generalization capacity. Existing research
3), which are based on adaptive estimates of lower-order
has shown that without multi-task fine-tuning, the causal
moments for first-order gradient-based optimization. Com-
decoder has better zero-shot performance than other archi-
monly, its hyper-parameters are set as follows: β1 = 0.9,
tectures [29]. The success of GPT-3 [55] has demonstrates
β2 = 0.95 and ϵ = 10−8 . Meanwhile, the Adafactor op-
that the large causal decoder model can be a good few-
timizer [320] has also been utilized in training LLMs (e.g.,
shot learner. In addition, instruction tuning and alignment
PaLM and T5), which is a variant of the Adam optimizer
tuning discussed in Section 5 have been proven to fur-
specially designed for conserving GPU memory during
ther enhance the capability of large causal decoder mod-
training. The hyper-parameters of the Adafactor optimizer
els [66, 67, 69].
are set as: β1 = 0.9 and β2 = 1.0 − k −0.8 , where k denotes
• Scaling law has been widely observed in causal de-
the number of training steps.
coders. By scaling the model size, the dataset size, and
the total computation, the performance of causal decoders Stabilizing the Training. During the pre-training of LLMs,
can be substantially improved [30, 55]. Thus, it has become it often suffers from the training instability issue, which
an important strategy to increase the model capacity of may cause the model collapse. To address this issue, weight
the causal decoder via scaling. However, more detailed decay and gradient clipping have been widely utilized,
investigation on encoder-decoder models is still lacking, and where existing studies [55, 78, 90, 93, 113] commonly set
more efforts are needed to investigate the performance of the threshold of gradient clipping to 1.0 and weight decay
encoder-decoder models at a large scale. rate to 0.1. However, with the scaling of LLMs, the training
More research efforts about the discussions on architec- loss spike is also more likely to occur, leading to unstable
29

TABLE 8: Detailed optimization settings of several existing LLMs.

Batch Size Learning Precision Weight Grad
Model Warmup Decay Method Optimizer Dropout
(#tokens) Rate Type Decay Clip
GPT3 (175B) 32K→3.2M 6 × 10−5 yes cosine decay to 10% Adam FP16 0.1 1.0 -
PanGu-α (200B) - 2 × 10−5 - - Adam - 0.1 - -
OPT (175B) 2M 1.2 × 10−4 yes manual decay AdamW FP16 0.1 - 0.1
PaLM (540B) 1M→4M 1 × 10−2 no inverse square root Adafactor BF16 lr2 1.0 0.1
BLOOM (176B) 4M 6 × 10−5 yes cosine decay to 10% Adam BF16 0.1 1.0 0.0
MT-NLG (530B) 64 K→3.75M 5 × 10−5 yes cosine decay to 10% Adam BF16 0.1 1.0 -
Gopher (280B) 3M→6M 4 × 10−5 yes cosine decay to 10% Adam BF16 - 1.0 -
Chinchilla (70B) 1.5M→3M 1 × 10−4 yes cosine decay to 10% AdamW BF16 - - -
Galactica (120B) 2M 7 × 10−6 yes linear decay to 10% AdamW - 0.1 1.0 0.1
LaMDA (137B) 256K - - - - BF16 - - -
Jurassic-1 (178B) 32 K→3.2M 6 × 10−5 yes - - - - - -
LLaMA (65B) 4M 1.5 × 10−4 yes cosine decay to 10% AdamW - 0.1 1.0 -
LLaMA 2 (70B) 4M 1.5 × 10−4 yes cosine decay to 10% AdamW - 0.1 1.0 -
Falcon (40B) 2M 1.85 × 10−4 yes cosine decay to 10% AdamW BF16 0.1 - -
GLM (130B) 0.4M→8.25M 8 × 10−5 yes cosine decay to 10% AdamW FP16 0.1 1.0 0.1
T5 (11B) 64K 1 × 10−2 no inverse square root AdaFactor - - - 0.1
ERNIE 3.0 Titan (260B) - 1 × 10−4 - - Adam FP16 0.1 1.0 -
PanGu-Σ (1.085T) 0.5M 2 × 10−5 yes - Adam FP16 - - -

training. To mitigate this problem, PaLM [56] and OPT [90] technique is simple in implementation, and most of existing
use a simple strategy that restarts the training process from popular deep learning libraries have already implemented
an earlier checkpoint before the occurrence of the spike and data parallelism, such as TensorFlow and PyTorch.
skips over the data that may have caused the problem. • Pipeline parallelism. Pipeline parallelism aims to dis-
Further, GLM [93] finds that the abnormal gradients of the tribute the different layers of a LLM into multiple GPUs.
embedding layer usually lead to spikes, and proposes to Especially, in the case of a Transformer model, pipeline
shrink the embedding layer gradients to alleviate it. parallelism loads consecutive layers onto the same GPU, to
reduce the cost of transmitting the computed hidden states
4.3.2 Scalable Training Techniques or gradients between GPUs. However, a naive implemen-
As the model and data sizes increase, it has become chal- tation of pipeline parallelism may result in a lower GPU
lenging to efficiently train LLMs under a limited computa- utilization rate as each GPU has to wait for the previous
tional resource. Especially, two primary technical issues are one to complete the computation, leading to the unneces-
required to be resolved, i.e., increasing training throughput sary cost of bubbles overhead [321]. To reduce these bubbles
and loading larger models into GPU memory. In this part, in pipeline parallelism, GPipe [321] and PipeDream [322]
we review several widely used approaches in existing work propose the techniques of padding multiple batches of data
to address the above two challenges, namely 3D paral- and asynchronous gradient update to improve the pipeline
lelism [75, 321, 322] and mixed precision training [323], and efficiency.
also give general suggestions about how to utilize them for • Tensor parallelism. Tensor parallelism is also a com-
training. monly used technique that aims to decompose the LLM for
multi-GPU loading. Unlike pipeline parallelism, tensor par-
3D Parallelism. 3D parallelism is actually a combination of allelism focuses on decomposing the tensors (the parameter
three commonly used parallel training techniques, namely matrices) of LLMs. For a matrix multiplication operation
data parallelism, pipeline parallelism [321, 322], and tensor Y = XA in the LLM, the parameter matrix A can be
parallelism [75]24 . We next introduce the three parallel train- split into two submatrices, A1 and A2 , by column, which
ing techniques. can be expressed as Y = [XA1 , XA2 ]. By placing matrices
• Data parallelism. Data parallelism is one of the most A1 and A2 on different GPUs, the matrix multiplication
fundamental approaches to improving the training through- operation would be invoked at two GPUs in parallel, and
put. It replicates the model parameters and optimizer states the final result can be obtained by combining the outputs
across multiple GPUs and then distributes the whole train- from the two GPUs through across-GPU communication.
ing corpus into these GPUs. In this way, each GPU only Currently, tensor parallelism has been supported in several
needs to process the assigned data for it, and performs open-source libraries, e.g., Megatron-LM [75], and can be
the forward and backward propagation to obtain the gra- extended to higher-dimensional tensors. Also, Colossal-AI
dients. The computed gradients on different GPUs will be has implemented tensor parallelism for higher-dimensional
further aggregated to obtain the gradients of the entire batch tensors [324–326] and proposed sequence parallelism [327]
for updating the models in all GPUs. In this way, as the especially for sequence data, which can further decompose
calculations of gradients are independently performed on the attention operation of the Transformer model.
different GPUs, the data parallelism mechanism is highly
scalable, enabling the way that increases the number of Mixed Precision Training. In previous PLMs (e.g.,
GPUs to improve training throughput. Furthermore, this BERT [23]), 32-bit floating-point numbers, also known as
FP32, have been predominantly used for pre-training. In
24. Model parallelism is a more broader term that includes tensor recent years, to pre-train extremely large language models,
parallelism and pipeline parallelism in some work [75]. some studies [323] have started to utilize 16-bit floating-
30

point numbers (FP16), which reduces memory usage and to supervised fine-tuning [66] and multi-task prompted
communication overhead. Additionally, as popular NVIDIA training [28]. In order to perform instruction tuning, we first
GPUs (e.g., A100) have twice the amount of FP16 computa- need to collect or construct instruction-formatted instances.
tion units as FP32, the computational efficiency of FP16 can Then, we employ these formatted instances to fine-tune
be further improved. However, existing work has found that LLMs in a supervised learning way (e.g., training with the
FP16 may lead to the loss of computational accuracy [64, 78], sequence-to-sequence loss). After instruction tuning, LLMs
which affects the final model performance. To alleviate it, an can demonstrate superior abilities to generalize to unseen
alternative called Brain Floating Point (BF16) has been used tasks [28, 67, 69], even in a multilingual setting [94].
for training, which allocates more exponent bits and fewer A recent survey [331] presents a systematic overview
significant bits than FP16. For pre-training, BF16 generally of the research on instruction tuning. In comparison to
performs better than FP16 on representation accuracy [78]. that, we mainly focus on the effect of instruction tuning
on LLMs and provide detailed guidelines or strategies for
Overall Training Suggestion. In practice, the above train-
instance collection and tuning. In addition, we also discuss
ing techniques, especially 3D parallelism, are often jointly
the use of instruction tuning for satisfying the real needs of
used to improve the training throughput and large model
users, which has been widely applied in existing LLMs, e.g.,
loading. For instance, researchers have incorporated 8-way
InstructGPT [66] and GPT-4 [46].
data parallelism, 4-way tensor parallelism, and 12-way
pipeline parallelism, enabling the training of BLOOM [78]
on 384 A100 GPUs. Currently, open-source libraries like 5.1.1 Formatted Instance Construction
DeepSpeed [74], Colossal-AI [203], and Alpa [328] can well Generally, an instruction-formatted instance consists of a
support the three parallel training methods. To reduce the task description (called an instruction), an optional input,
memory redundancy, ZeRO, FSDP, and activation recom- the corresponding output, and a small number of demon-
putation techniques [77, 329] can be also employed for strations (optional). As important public resources, existing
training LLMs, which have already been integrated into studies have released a large number of labeled data format-
DeepSpeed, PyTorch, and Megatron-LM. In addition, the ted in natural language (see the list of available resources in
mixed precision training technique such as BF16 can be Table 3) as introduced in Section 3.3.1. Next, we introduce
also leveraged to improve the training efficiency and reduce four major methods for constructing formatted instances
GPU memory usage, while it requires necessary support on (see an illustration in Figure 11) and then discuss several
hardware (e.g., A100 GPU). Because training large models is key factors for instance construction.
a time-intensive process, it would be useful to forecast the
model performance and detect abnormal issues at an early Formatting NLP Task Datasets. Before instruction tuning
stage. For this purpose, GPT-4 [46] has recently introduced was proposed, several early studies [181, 332, 333] collected
a new mechanism called predictable scaling built on a deep the instances from a diverse range of traditional NLP tasks
learning stack, enabling the performance prediction of large (e.g., text summarization, text classification, and translation)
models with a much smaller model, which might be quite to create supervised multi-task training datasets. As a major
useful for developing LLMs. In practice, one can further source of instruction tuning instances, it is convenient to for-
leverage the supporting training techniques of mainstream mat these multi-task training datasets with natural language
deep learning frameworks. For instance, PyTorch supports task descriptions. Specifically, recent work [28, 66, 67, 88]
the data parallel training algorithm FSDP [330] (i.e., fully augments the labeled datasets with human-written task de-
sharded data parallel), which allows for partial offloading scriptions, which instructs LLMs to understand the tasks by
of training computations to CPUs if desired. explaining the task goal. For example, in Figure 11(a), a task
description “Please answer this question” is added for each
example in the question-answering task. After instruction
5 P OST- TRAINING OF LLM S tuning, LLMs can generalize well to other unseen tasks by
After pre-training, LLMs can acquire the general abilities following their task descriptions [28, 67, 69]. In particular,
for solving various tasks. However, an increasing number it has been shown that instructions are the crucial factor
of studies have shown that LLM’s abilities can be further in task generalization ability for LLMs [67]: by fine-tuning
adapted according to specific goals. In this section, we the model on labeled datasets with the task descriptions re-
introduce two major approaches to adapting pre-trained moved, it results in a dramatic drop in model performance.
LLMs, namely instruction tuning and alignment tuning. The To better generate labeled instances for instruction tuning,
former approach mainly aims to enhance (or unlock) the a crowd-sourcing platform, PromptSource [180] has been
abilities of LLMs, while the latter approach aims to align the proposed to effectively create, share, and verify the task
behaviors of LLMs with human values or preferences. Fur- descriptions for different datasets. To enrich the training
ther, we will also discuss efficient tuning and quantization instances, several studies [28, 181, 334] also try to invert the
for model adaptation in resource-limited settings. In what input-output pairs of existing instances with specially de-
follows, we will introduce the four parts in detail. signed task descriptions for instruction tuning. For instance,
given a question-answer pair, we can create a new instance
5.1 Instruction Tuning by predicting the answer-conditioned question (e.g., “Please
generate a question based on the answer:”).
In essence, instruction tuning is the approach to fine-tuning
pre-trained LLMs on a collection of formatted instances in Formatting Daily Chat Data. Despite that a large number
the form of natural language [67], which is highly related of training instances have been formatted with instructions,
31

API collection Human-written Seed
Instance Pool
Human-written Task description Instances

Please answer this question: & Instruction
Generation LLM Filter
Demonstrations Task description
Task description
NLP Datasets Q: What is the capital of France? Give me a quote from a
A: Paris. Can you recommend some ways
famous person on this topic.
to lose weight?
Q: What is the capital of Brazil?
A: Brasilia Input-Output
Desired output written by human Generation LLM

Input Output Output Input Output
Here are some ways to lose weight: Input: The importance of being honest.
Q: What is the capital of China?
1. Eat a healthy diet: Focus on … Output: Honesty is the first chapter in
A: Beijing.
2. Increase physical activity: Engage … the book of wisdom.

(a) Formatting Task Datasets (b) Formatting Daily Chat Data (c) Formatting Synthetic Data

Fig. 11: An illustration of instance formatting and three different methods for constructing the instruction-formatted
instances.

they mainly come from public NLP datasets, either lack- economical way to generate large-scale instruction data for
ing instruction diversity or mismatching with real human LLMs. However, the instances generated by the Self-Instruct
needs [66]. To overcome this issue, InstructGPT [66] pro- method might be simplistic or lack the diversity. To improve
poses to take the queries that real users have submitted to the quality of synthetic instructions, WizardLM [335] intro-
the OpenAI API as the task descriptions. Additionally, to duces Evol-Instruct by proposing in-depth and in-breadth
enrich the task diversity, human labelers are also asked to evolving to enrich the complexity and diversity of the
compose the instructions for real-life tasks, including open- instances. Furthermore, Self-Align [336] establishes multiple
ended generation, open question answering, brainstorm- human-aligned principles to filter the synthesized instances.
ing, and chatting. Then, they let another group of labelers It then employs these instances to train a LLM in order
directly answer these instructions as the output. Finally, to yield more aligned instances. To enhance the quality
they pair one instruction (i.e., the collected user query) and of the instance output, researchers directly adopt human-
the expected output (i.e., the human-written answer) as a written texts as the output and synthesize corresponding
training instance. Note that InstructGPT also employs these instructions using ICL examples [337].
real-world tasks formatted in natural language for align-
ment tuning (discussed in Section 5.2). Further, GPT-4 [46] Key Factors for Instruction Dataset Construction. The
has designed potentially high-risk instructions and guided quality of instruction instances has an important impact
the model to reject these instructions through supervised on the performance of the model. Here, we discuss some
fine-tuning for safety concerns. Considering the absence essential factors for instance construction.
of high-quality public chat data, several studies have also • Scaling the instructions. It has been widely shown that
collected users’ chat requests as input data, and then utilized scaling the number of tasks can largely enhance the gen-
ChatGPT or GPT-4 to generate responses as output data. A eralization ability of LLMs [28, 67, 88]. With the increasing
notable example of such a dataset is the conversational data of the task number, the model performance initially shows
from ShareGPT [153]. Additionally, Dolly [185] and Ope- a continuous growth pattern, while the gain becomes neg-
nAssistant [186] have further released their conversation ligible when it reaches a certain level [69, 88]. A plausible
data, which has been carefully labeled by human annotators speculation is that a certain number of representative tasks
to attain a high level of quality. can provide relatively sufficient knowledge and adding
more tasks may not bring additional gains [69]. Also, it is
Formatting Synthetic Data. To reduce the burden of human beneficial to enhance the diversity of the task descriptions in
annotation or manual collection, several semi-automated several aspects, such as length, structure, and creativity [28].
approaches [147] have been proposed for constructing in- As for the number of instances per task, it has been found
stances by feeding existing instances into LLMs to synthe- that a small number of instances can usually saturate the
size diverse task descriptions and instances. As illustrated generalization performance of the model to perform a spe-
in Figure 11(c), the Self-Instruct method only needs 175 cific task [67, 69]. Specially, several recent work [338, 339]
instances as the initial task pool. Then, they randomly select has explored the effect of fine-tuning with a small amount
a few instances from the pool as demonstrations and prompt of high-quality instruction data (e.g., one or a few thousand
a LLM to generate new instructions and corresponding instances), showing very promising results on the evalua-
input-output pairs. After the quality and diversity filter- tion tasks. In contrast, another line of studies continue to
ing, newly generated instances would be added into the explore the scaling effect of instruction data [340, 341]. For
task pool. Hence, the synthetic method is an effective and example, Orca [340] scales up the synthesized instances to
32

5 million with step-by-step explanations, and it achieves tion for the LLM capability, more complex metrics (e.g.,
superior performance across a wide range of tasks. IFD [346]) have also been proposed, which are computed by
• Formatting design. As an important factor, the design combining multiple simple metrics. Additionally, diversity-
of natural language format also highly impacts the gener- aware sampling methods have been introduced to ensure
alization performance of LLMs [88]. Typically, we can add the overall coverage of representative instruction data [347].
task descriptions and optional demonstrations to the input- Besides, when downstream task data is available, cross-
output pairs of existing datasets, where the task description instance gradient similarity can be employed to measure
is the most key part for LLMs to understand the task [88]. the value of training instances for the target task. LESS [348]
Further, it can lead to substantial improvements by using an computes gradients for both downstream validation and
appropriate number of exemplars as demonstrations [69], training instruction data, to evaluate the contribution of
which also alleviates the model sensitivity to instruction instruction data based on extensions of influence func-
engineering [67, 69]. However, incorporating other compo- tion [349].
nents (e.g., things to avoid, reasons, and suggestions) into To summarize, diversity and quality of instructions are
instructions may have a negligible or even adverse effect important factors to consider when scaling the number of
on the performance of LLMs [88, 179]. Recently, to elicit instances [338]. As the capacities of LLMs improve, data
the step-by-step reasoning ability of LLMs, some work [69] synthesis methods have become the mainstream approach
proposes to include chain-of-thought (CoT) examples for for generating large amount of instruction data. Following
some reasoning datasets, such as arithmetic reasoning. It this trend, there are increasingly more automatically gener-
has been shown that fine-tuning LLMs with both CoT and ated instruction datasets available, and selection and refin-
non-CoT examples can lead to a good performance across ing methods are key to effectively use these datasets. To help
various reasoning tasks, including those that require multi- readers understand how different factors affect instruction
hop reasoning ability (e.g., commonsense question answer- tuning, we conduct an empirical study by experimenting
ing and arithmetic reasoning) as well as those without the with multiple specially constructed instruction datasets in
need for such a reasoning way (e.g., sentiment analysis and Section 5.1.4.
extractive question answering) [69, 95].
• Instruction quality improvement. Data quality is very
important for the performance of instruction tuning, and 5.1.2 Instruction Tuning Strategies
a surge of work has been proposed to further improve Unlike pre-training, instruction tuning is often more effi-
the quality of existing instruction datasets. Typically, these cient since only a moderate number of instances are used
methods mostly rely on carefully designed prompts, to for training. Since instruction tuning can be considered as
guide LLMs to refine or rewrite the given instruction. Wiz- a supervised training process, its optimization is different
ardLM [335] aims to complexify and diversify the Alpaca from pre-training in several aspects [69], such as the training
dataset [187] by devising prompts to widen and deepen objective (i.e., sequence-to-sequence loss) and optimization
the required knowledge of given instructions. It also crafts configuration (e.g., smaller batch size and learning rate),
the filter strategy to remove the low-quality instructions. which require special attention in practice. In addition to
To further provide fine-grained knowledge guidance, recent these optimization configurations, there are also four im-
work also involves the knowledge taxonomy into the input portant aspects to consider for instruction tuning:
prompt, e.g., knowledge key points [342] and the human-
AI conversation topic taxonomy [343]. To guarantee the in- Balancing the Data Distribution. Since instruction tun-
struction quality, early methods mainly employ close-source ing involves a mixture of different tasks, it is important
API or powerful open-source LLMs, which would take a to balance the proportion of different tasks during fine-
huge cost for large-scale instructions synthesis. Considering tuning. A widely used method is the examples-proportional
this issue, recent studies widely explore the potential of mixing strategy [82], i.e., combining all the datasets and
relatively small models for data synthesis. For instance, sampling each instance equally from the mixed datasets.
JiuZhang3.0 [344] fine-tunes a 7B language model to syn- Furthermore, increasing the sampling ratio of high-quality
thesize questions by distilling the knowledge from GPT- collections (e.g., FLAN [67] and P3 [180]) can generally
4, and then utilizes it to synthesize massive high-quality lead to performance improvement according to recent find-
instructions based on pre-training corpus. Such a way can ings [69, 95]. Further, it is common to set a maximum
achieve better performance on mathematical reasoning tasks cap to control the maximum number of examples that a
than baseline methods, with only 20% data synthesis cost. dataset can contain during instruction tuning [82], which
• Instruction selection. As a surge of instruction datasets is set to prevent larger datasets from overwhelming the
are proposed, it is non-trivial to select the high-quality entire distribution [82, 95]. In practice, the maximum cap
ones from them to construct the training dataset. Generally, is typically set to several thousands or tens of thousands
existing work either leverages quality estimation metrics or according to different datasets [67, 69]. Recently, it has been
employs LLMs as the judge model to rank all the instruc- empirically found that existing instruction datasets (Table 3)
tion instances, and then selects those with relatively higher mainly focus on enhancing LLMs’ capabilities in certain
scores. Concretely, for metrics, perplexity and other heuristic aspects, and a single dataset alone cannot lead to a compre-
measurements (e.g., length) [345] have been widely used in hensive enhancement in model capacity [350]. Therefore, it
practice, e.g., we can consider removing high-perplexity or is often suggested to use a mixture of existing instruction
very short instructions, which might correspond to low- datasets to achieve a balanced improvement in different
quality ones. To better estimate the effect of an instruc- capacities, including NLP task data (e.g., FLAN v2 [351]),
33

TABLE 9: Basic statistics of the required number of GPUs, tuning time, batch size (denoted as BS) per device (full tuning
and LoRA tuning), and inference rate (the number of generated tokes per second). Our experiments are conducted based
on two Linux servers having 8 A800-80G SXM4 GPUs with 6 NVSwitch and 8 3090-24G GPUs, respectively. The major
difference between A800 and A100 lies in the NVLink interconnect speed. Thus, our estimations about training and
inference efficiency would be slightly improved for A100, while the rest memory consumption would remain the same.
For full tuning experiments, we use data parallel training, ZeRO Stage 3, BF16, and gradient checkpointing. Additionally,
the LoRA tuning can be executed on one 80G GPU utilizing INT8 quantization with the rank setting set to 16. All the
experiments are conducted with Alpaca-52K dataset by training LLaMA models three epochs. The max sequence length
for both training settings is set to 512. The inference experiments are performed with the batch size set to 1.

A800 Full Tuning A800 LoRA Tuning A800 Inference (16-bit) 3090 Inference (16-bit) 3090 Inference (8-bit)
Models
#GPU BS Time #GPU BS Time #GPU #Token/s #GPU #Token/s #GPU #Token/s
LLaMA (7B) 2 8 3.0h 1 80 3.5h 1 36.6 1 24.3 1 7.5
LLaMA (13B) 4 8 3.1h 1 48 5.1h 1 26.8 2 9.9 1 4.5
LLaMA (30B) 8 4 6.1h 1 24 14.3h 1 17.7 4 3.8 2 2.6
LLaMA (65B) 16 2 11.2h 1 4 60.6h 2 8.8 8 2.0 4 1.5

chat data (e.g., ShareGPT [153]), and synthetic data (e.g., ing context for all splits (i.e., at each utterance from the
GPT4-Alpaca [352]). user). In such a fine-tuning way, it is apparent that there
exist overlapping utterances in the split examples from a
Combining Instruction Tuning and Pre-Training. To make conversation. To save the training cost, Vicuna [152] has
the tuning process more effective and stable, OPT-IML [95] adopted an efficient way that feeds the whole conversation
incorporates pre-training data during instruction tuning, into the LLM, but relies on a loss mask that only computes
which can be regarded as regularization for model tuning. the loss on the responses of the chatbot for training. It can
Further, instead of using a separate two-stage process (pre- significantly reduce the compute costs derived from the
training then instruction tuning), some studies attempt to overlapped utterances.
train a model from scratch with a mixture of pre-training • Establishing self-identification for LLM. To deploy LLMs
data (i.e., plain texts) and instruction tuning data (i.e., for- for real-world applications, it is necessary to establish its
matted datasets) using multi-task learning [82]. Specifically, identity and make LLMs aware of these identity informa-
GLM-130B [93] and Galactica [35] integrate instruction- tion, such as name, developer and affiliation. A practical
formatted datasets as a small proportion of the pre-training way is to create identity-related instructions for fine-tuning
corpora to pre-train LLMs, which potentially achieves the the LLM. It is also feasible to prefix the input with the self-
advantages of pre-training and instruction tuning at the identification prompt, e.g., “The following is a conversation
same time. between a human and an AI assistant called C HATBOT N AME,
Multi-stage Instruction Tuning. For instruction tuning, developed by D EVELOPER.”, where C HATBOT N AME and D E -
there are two kinds of important instruction data, namely VELOPER refer to the name and developer of the chatbot,
task-formatted instructions and daily chat instructions. Gen- respectively.
erally, the former has a significantly larger volume than the In addition to the above practical strategies and tricks,
latter. It is important to balance the training with the two existing work has also used other tricks, e.g., concatenating
kinds of instruction data. In addition to carefully mixing multiple examples into a single sequence to approach the
different instruction data, we can also adopt a multi-stage max length [353].
instruction tuning strategy [341], where LLMs are first fine-
tuned with large-scale task-formatted instructions and sub- 5.1.3 The Effect of Instruction Tuning
sequently fine-tuned on daily chat ones. To avoid the capac- In this part, we discuss the effect of instruction tuning on
ity forgetting issue, it is also useful to add an amount of task- LLMs in three major aspects.
formatted instructions at the second stage. Actually, such
a multi-stage tuning strategy can be also applied to other Performance Improvement. Despite being tuned on a mod-
settings for instruction tuning. For example, we can sched- erate number of instances, instruction tuning has become
ule different fine-tuning stages with progressively increased an important way to improve or unlock the abilities of
levels on difficulty and complexity, and gradually improve LLMs [69]. Recent studies have experimented with language
the capacities of LLMs to follow complex instructions. models in multiple scales (ranging from 77M to 540B),
showing that the models of different scales can all benefit
Other Practical Tricks. In practice, there are also several from instruction tuning [69, 334], yielding improved perfor-
useful strategies and tricks that are helpful to improve the mance as the parameter scale increases [94]. Further, smaller
fine-tuning performance of LLMs. We list several represen- models with instruction tuning can even perform better
tative ones as follows: than larger models without fine-tuning [28, 69]. Besides
• Efficient training for multi-turn chat data. Given a multi- the model scale, instruction tuning demonstrates consistent
turn chat example (the conversation between a user and improvements in various model architectures, pre-training
chatbot), a straightforward fine-tuning way is to split it into objectives, and model adaptation methods [69]. In practice,
multiple context-response pairs for training: a LLM is fine- instruction tuning offers a general approach to enhancing
tuned to generate the response based on the correspond- the abilities of existing language models [69] (including
34

small-sized PLMs). Also, it is much less costly than pre- Instruction Datasets. According to the discussion in Sec-
training, since the amount of instruction data required by tion 5.1.1, we mainly consider three common kinds of in-
LLMs is significantly smaller than pre-training data. structions as follows:
• Task-specific instructions. For the first type of instruc-
Task Generalization. Instruction tuning encourages the tions, we adopt the most commonly-used multi-task instruc-
model to understand natural language instructions for task tion dataset, FLAN-T5 [69], which contains 1,836 tasks and
completion. It endows LLMs with the ability (often con- over 15M instructions by combining four data mixtures from
sidered as an emergent ability) to follow human instruc- prior work.
tions [31] to perform specific tasks without demonstrations,
• Daily chat instructions. This type of instructions are con-
even on unseen tasks [69]. A large number of studies
versations posed by users about daily life, which are more
have confirmed the effectiveness of instruction tuning to
closely related to real-life scenarios. We adopt the ShareGPT
achieve superior performance on both seen and unseen
instruciton set, consisting of 63K real-user instructions. It
tasks [95, 334]. Also, instruction tuning has been shown to
has been used as the core instructions for Vicuna.
be useful in alleviating several weaknesses of LLMs (e.g.,
• Synthetic instructions. In addition to reusing existing
repetitive generation or complementing the input without
instructions, we can also automatically synthesize massive
accomplishing a certain task) [66, 69], leading to a superior
instructions using LLMs. We adopt the popular synthetic
capacity to solve real-world tasks for LLMs. Furthermore,
instruction dataset Self-Instruct-52K [147], consisting of 52K
LLMs trained with instruction tuning can generalize to re-
instructions paired with about 82K instance inputs and
lated tasks across languages. For example, BLOOMZ-P3 [94]
outputs. These generated instructions have a similar data
is fine-tuned based on BLOOM [78] using English-only task
distribution as the human-written seed tasks (e.g., grammar
collection P3 [180]. Interestingly, BLOOMZ-P3 can achieve
checking, brainstorming).
a more than 50% improvement in multilingual sentence
As the original FLAN-T5 dataset is very large (i.e., over
completion tasks compared to BLOOM, which shows that
15M), we randomly sample 80,000 instructions from it for
instruction tuning can help LLMs acquire general task skills
conducting a fair comparison with other instruction datasets
from English-only datasets and transfer such skills into
(i.e., ShareGPT and Self-Instruct-52K) at a similar scale. In
other languages [94]. In addition, it has been found that
our experiments, we test on each individual instruction
using English-only instructions can produce satisfactory
set to explore their own effects and also examine their
results on multilingual tasks [94], which helps reduce the
combinatorial effects on model performance.
effort of instruction engineering for a specific language.
Improvement Strategies. Although real-world instructions
Domain Specialization. Existing LLMs have showcased su-
from human users are more suitable for fine-tuning LLMs,
perior capabilities in traditional NLP tasks (e.g., generation
it is difficult to collect them at a large scale. As alternatives
and reasoning) and daily questions. However, they may
to human-generated instructions, most existing research
still lack domain knowledge to accomplish specific tasks,
mainly adopts synthetic instructions generated by LLMs.
such as medicine, law, and finance (See Section 8 for a
However, there are some potential problems with synthetic
detailed discussion of LLMs in different applications). In-
instructions, such as poor topic diversity and uneven in-
struction tuning is an effective approach to adapting existing
struction difficulty (either too simple or too difficult). Thus,
general LLMs to be domain-specific experts. For instance,
it is necessary to improve the quality of the synthetic in-
researchers propose to fine-tune Flan-PaLM [69] using medi-
structions. Next, we summarize four major improvement
cal datasets to create Med-PaLM [354], a medical knowledge
strategies widely used in existing work as follows:
assistant that achieves performance levels comparable to
those of expert clinicians. Furthermore, a recent study [355] • Enhancing the instruction complexity. As discussed in
fine-tunes FLAN-T5 to support e-commerce recommender existing work [335], enhancing the complexity of instruc-
systems with natural language instructions, showing strong tions can improve the model capacity of LLMs in following
performance in a variety of recommendation tasks. There complex instructions, e.g., including more task demands or
are also several open-sourced medical models instruction- requiring more reasoning steps. To validate this strategy,
tuned based on LLaMA [57], such as BenTsao [356]. Also, we follow WizardLM [335] by gradually increasing the
researchers explore instruction tuning on law [357], fi- complexity levels, e.g., adding constraints, increasing rea-
nance [358], and arithmetic computation [359]. soning steps, and complicating the input. We leverage the
publicly released WizardLM-70K instructions [335] as the
complexity-enhanced instruction dataset, which has been
5.1.4 Empirical Analysis for Instruction Tuning generated via the above enhancement approach based on
Fine-tuning LLMs with different instruction sets tend to lead the Self-Instruct-52K dataset [335].
to model variants with varied performance on downstream • Increasing the topic diversity. In addition to the complex-
tasks. In this section, we will explore the effect of different ity, improving the topic diversity of the instruction dataset
types of instructions in fine-tuning LLMs (i.e., LLaMA (7B) can help elicit different abilities of LLMs on diverse tasks in
and LLaMA (13B)25 ), as well as examine the usefulness of real world [336]. However, it is difficult to directly control
several instruction improvement strategies. the self-instruct process for generating diverse instructions.
Following YuLan-Chat [341], we employ ChatGPT to rewrite
the instructions from Self-Instruct-52K dataset for adapting
25. Due to the limit of computational resources, we cannot conduct
large-scale experiments on larger LLaMA variants right now, which them into 293 topics via specific prompts. Finally, we obtain
would be scheduled in a future version. 70K instructions as the diversity-increased dataset.
35

TABLE 10: Results of instruction-tuning experiments (all in a single-turn conversation) based on the LLaMA (7B) and
LLaMA (13B) model under the chat and QA setting. We employ four instruction improvement strategies on the Self-
Instruct-52K dataset, i.e., enhancing the complexity (w/ complexity), increasing the diversity (w/ diversity), balancing the
difficulty (w/ difficulty), and scaling the instruction number (w/ scaling). ∗ Since we select the LLaMA (7B)/(13B) model
fine-tuned on Self-Instruct-52K as the baseline, we omit the win rate of the fine-tuned model with Self-Instruct-52K against
itself.

Dataset Instruction Lexical Chat QA
Models
Mixtures Numbers Diversity AlpacaFarm MMLU BBH3k
LLaMA (7B) ① FLAN-T5 80,000 48.48 23.77 38.58 32.79
② ShareGPT 63,184 77.31 81.30 38.11 27.71
③ Self-Instruct-52K 82,439 25.92 /∗ 37.52 29.81
②+③ 145,623 48.22 71.36 41.26 28.36
①+②+③ 225,623 48.28 70.00 43.69 29.69
③ Self-Instruct-52K 82,439 25.92 /∗ 37.52 29.81
w/ complexity 70,000 70.43 76.96 39.73 33.25
w/ diversity 70,000 75.59 81.55 38.01 30.03
w/ difficulty 70,000 73.48 79.15 32.55 31.25
w/ scaling 220,000 57.78 51.13 33.81 26.63
LLaMA (13B) ① FLAN-T5 80,000 48.48 22.12 34.12 34.05
② ShareGPT 63,184 77.31 77.13 47.49 33.82
③ Self-Instruct-52K 82,439 25.92 /∗ 36.73 25.43
②+③ 145,623 48.22 72.85 41.16 29.49
①+②+③ 225,623 48.28 69.49 43.50 31.16
③ Self-Instruct-52K 82,439 25.92 /∗ 36.73 25.43
w/ complexity 70,000 70.43 77.94 46.89 35.75
w/ diversity 70,000 75.59 78.92 44.97 36.40
w/ difficulty 70,000 73.48 80.45 43.15 34.59
w/ scaling 220,000 57.78 58.12 38.07 27.28

• Scaling the instruction number. In addition to the above of fine-tuned models, we consider two settings, namely
aspects, the number of instructions is also an important Chat setting and QA setting. The chat setting mainly utilizes
factor that may affect the model performance. Specially, user instructions and queries from daily chat, whereas the
using more instructions can extend the task knowledge and QA setting mainly employs question answering examples
improve the ability of instruction following for LLMs [69]. from existing NLP datasets. The evaluation on the chat
To examine this strategy, we sample new instructions from setting is conducted based on the AlpacaFarm evaluation
the synthesized instruction set released from the MOSS set [361]. Instead of using a full pairwise comparison, we
project [360], as they are also synthesized using the same select the LLaMA 7B and 13B models fine-tuned on Self-
self-instruct method [147]. We mix them with the Self- Instruct-52K as the reference baselines, and then compare
Instruct-52K dataset to compose a larger one containing them with other fine-tuned LLaMA 7B and 13B models
220K instructions. using different instructions, respectively. Since our focus is
to examine the usefulness of different strategies to generate
• Balancing the instruction difficulty. As the synthetic
the instructions, the model fine-tuned on Self-Instruct-52K
instructions tend to contain too easy or too hard ones, it
can serve as a good reference. Following AlpacaFarm [361],
is likely to result in training instability or even overfitting
for each comparison, we employ ChatGPT to automatically
for LLMs. To explore the potential effects, we leverage
annotate which response from two compared models each
the perplexity score of LLMs to estimate the difficulty of
time is the best for the user query, and report the win
instructions and remove too easy or too hard instructions. To
rate (%) as the evaluation metric. For the QA setting, we
generate the same scale of instructions for fair comparison,
select two benchmarks, MMLU [362] and BBH [363], and
we adopt a LLaMA (7B) model to compute the perplexity for
evaluate the accuracy based on their default settings by
the 220K instructions from the large instruction dataset, and
using heuristic rules to parse the answers from these LLMs.
then keep 70K instructions of moderate perplexity scores as
the difficulty-balanced dataset. For both instruction tuning and evaluation, we adopt
the following prompt: “The following is a conversation be-
Experimental Setup. To conduct the experiments on the tween a human and an AI assistant. The AI assistant gives
effect of instruction data, we leverage these new instruction helpful, detailed, and polite answers to the user’s questions.\n
datasets for tuning LLaMA, a popular LLM backbone that [|Human|]:{input}\n[|AI|]:”. To reproduce our results, we
has been widely used for instruction-tuning. We use the release the code and data at the link: https://github.com/
code from YuLan-Chat [341] for our experiments, and train RUCAIBox/LLMSurvey/tree/main/Experiments.
LLaMA 7B and 13B on a server of 8 A800-80G GPUs. All
the hyper-parameters settings remain the same as Stanford Results and Analysis. The results using different instruction
Alpaca. To better evaluate the instruction following ability datasets based on 7B and 13B LLaMA are in Table 10. Next,
36

we summarize and analyze our findings in detail. likely because that the larger models generally have better
• Task-formatted instructions are more proper for the QA knowledge utilization and reasoning capability [33, 55],
setting, but may not be useful for the chat setting. By comparing which can accurately answer more complex questions.
the performance of instruction tuning using FLAN-T5 with
that of ShareGPT and Self-Instruct-52K, we can observe Instruction Tuning Suggestions
that FLAN-T5 mostly achieves a better performance on QA
benchmarks while underperforms ShareGPT on the chat set- To conduct instruction tuning on LLMs, one can
ting. The reason is that FLAN-T5 is composed of a mixture prepare the computational resources according to
of instructions and examples from existing NLP tasks, e.g., the basic statistics about the required number of
translation and reading comprehension. As a result, LLaMA GPUs and tuning time in Table 9. After setting
fine-tuned with FLAN-T5 performs better on QA tasks, but up the development environment, we recommend
poorly on user queries. In contrast, ShareGPT consists of beginners to follow the code of Alpaca reposi-
real-world human-ChatGPT conversations, which is able to tory [187] for instruction tuning. Subsequently, one
better elicit LLaMA to follow user instructions in daily life, should select the base model and construct the
while may not be suitable for accomplishing the QA tasks. instruction datasets as we discuss in this section.
• A mixture of different kinds of instructions are helpful to When computational resources for training are con-
improve the comprehensive abilities of LLMs. After mixing the strained, users can utilize LoRA for parameter-
three kinds of instructions for fine-tuning, we can see that efficient tuning (see Section 5.3). As for inference,
the derived LLaMA variant (with FLAN-T5, ShareGPT and users can further use quantization methods to de-
Self-Instruct-52K) performs well in both task settings. In ploy LLMs on fewer or smaller GPUs (see Sec-
MMLU, the performance of LLaMA (7B) can surpass the tion 5.3).
ones using individual instruction set by a large margin, i.e.,
43.69 vs. 38.58 (FLAN-T5). It shows that mixing multiple
5.2 Alignment Tuning
sources of instruction datasets is helpful to improve the
performance of instruction-tuned LLMs, which scales the This part first presents the background of alignment with
instruction number as well as increases the diversity. its definition and criteria, then focuses on the collection
• Enhancing the complexity and diversity of instructions of human feedback data for aligning LLMs, and finally
leads to an improved model performance. By increasing the discusses the key technique of reinforcement learning from
complexity and diversity of the Self-Instruct-52K dataset human feedback (RLHF) for alignment tuning.
respectively, the chat and QA performance of LLaMA can
be consistently improved, e.g., from 37.52 to 39.73 in MMLU 5.2.1 Background and Criteria for Alignment
for LLaMA (7B). It demonstrates that both strategies are
Background. LLMs have shown remarkable capabilities
useful to improve the instruction following ability of LLMs.
in a wide range of NLP tasks [55, 56, 67, 90]. However,
Further, we can see that improving the complexity yields a
these models may sometimes exhibit unintended behav-
larger performance improvement on QA tasks. The reason
iors, e.g., fabricating false information, pursuing inaccurate
is that the QA tasks mostly consist of difficult questions for
objectives, and producing harmful, misleading, and biased
evaluating LLMs, which can be better solved by LLMs that
expressions [66, 364]. For LLMs, the language modeling
have learned complex instructions at the fine-tuning stage.
objective pre-trains the model parameters by word predic-
• Simply increasing the number of instructions may not be
tion while lacking the consideration of human values or
that useful, and balancing the difficulty is not always helpful.
preferences. To avert these unexpected behaviors, human
As the results shown in Table 10, balancing the difficulty
alignment has been proposed to make LLMs act in line with
and increasing the number of fine-tuning instructions are
human expectations [66, 365]. However, unlike the original
not very helpful in our experiments. Especially for scaling
pre-training and adaptation tuning (e.g., instruction tuning),
the instruction number, it even hurts the performance, e.g.,
such an alignment requires considering very different crite-
a decrease from 29.81 to 26.63 in BBH for LLaMA (7B).
ria (e.g., helpfulness, honesty, and harmlessness). It has been
It shows that simply scaling the number of synthesized
shown that alignment might harm the general abilities of
instructions without quality control may not be effective to
LLMs to some extent, which is called alignment tax in related
improve the performance. Furthermore, fine-tuning with the
literature [366].
instructions of moderate difficulty also performs well in the
chat setting, while slightly decreasing the performance in Alignment Criteria. Recently, there is increasing attention
the QA setting. A possible reason is that we filter complex on developing multifarious criteria to regulate the behav-
and hard instructions with large perplexity scores, hurting iors of LLMs. Here, we take three representative alignment
the model performance in answering complex questions. criteria (i.e., helpful, honest, and harmless) as examples for
• A larger model scale leads to a better instruction following discussion, which have been widely adopted in existing
performance. By comparing the performance of LLaMA (7B) literature [66, 366]. In addition, there are other alignment
and LLaMA (13B) models fine-tuned with the same set criteria for LLMs from different perspectives including be-
of instruction data, we can see that LLaMA (13B) mostly havior, intent, incentive, and inner aspects [364], which
achieves a better performance. It indicates that scaling the are essentially similar (or at least with similar alignment
model size is helpful for improving the instruction following techniques) to the above three criteria. It is also feasible to
capability. Besides, we can see that the QA performance has modify the three criteria according to specific needs, e.g.,
been improved a lot, e.g., from 38.11 to 47.49 in MMLU. It is substituting honesty with correctness [116]. Next, we give
37

brief explanations about the three representative alignment to be UK-based native English speakers who have obtained
criteria: at least an undergraduate-level educational qualification.
• Helpfulness. To be helpful, the LLM should demon- Even then, several studies [365] have found that there still
strate a clear attempt to assist users in solving their tasks exists a mismatch between the intentions of researchers
or answering questions in a concise and efficient manner and human labelers, which may lead to low-quality human
as possible. At a higher level, when further clarification feedback and cause LLMs to produce unexpected output.
is needed, the LLM should demonstrate the capability of To address this issue, InstructGPT [66] further conducts a
eliciting additional relevant information through pertinent screening process to filter labelers by assessing the agree-
inquiries and exhibit suitable levels of sensitivity, percep- ment between human labelers and researchers. Specifically,
tiveness, and prudence [366]. Realizing the alignment of researchers first label a small amount of data and then
helpful behavior is challenging for LLMs since it is difficult measure the agreement between themselves and human
to precisely define and measure the intention of users [364]. labelers. The labelers with the highest agreement will be
• Honesty. At a basic level, a LLM aligned to be honest selected to proceed with the subsequent annotation work.
should present accurate content to users instead of fabri- In some other work [368], “super raters” are used to ensure
cating information. Additionally, it is crucial for the LLM the high quality of human feedback. Researchers evaluate
to convey appropriate degrees of uncertainty in its output, the performance of human labelers and select a group of
in order to avoid any form of deception or misrepresen- well-performing human labelers (e.g., high agreement) as
tation of information. This requires the model to know super raters. The super raters will be given priority to
about its capabilities and levels of knowledge (e.g., “know collaborate with the researchers in the subsequent study.
unknowns”). According to the discussion in [366], honesty When human labelers annotate the output of LLMs, it is
is a more objective criterion compared to helpfulness and helpful to specify detailed instructions and provide instant
harmlessness, hence honesty alignment could potentially be guidance for human labelers, which can further regulate the
developed with less reliance on human efforts. annotation of labelers.
• Harmlessness. To be harmless, it requires that the lan-
guage produced by the model should not be offensive or Human Feedback Collection. In existing work, there are
discriminatory. To the best of its abilities, the model should mainly three kinds of approaches to collecting feedback and
be capable of detecting covert endeavors aimed at soliciting preference data from human labelers.
requests for malicious purposes. Ideally, when the model • Ranking-based approach. In early work [365], human
was induced to conduct a dangerous action (e.g., commit- labelers often evaluate model-generated outputs in a coarse-
ting a crime), the LLM should politely refuse. Nonetheless, grained manner (i.e., only selecting the best) without taking
what behaviors are deemed harmful and to what extent vary into account more fine-grained alignment criteria. Nonethe-
amongst individuals or societies [366] highly depend on less, different labelers may hold diverse opinions on the
who is using the LLM, the type of the posed question, and selection of the best candidate output, and this method
the context (e.g., time) at which the LLM is being used. disregards the unselected samples, which may lead to inac-
As we can see, these criteria are quite subjective, and are curate or incomplete human feedback. To address this issue,
developed based on human cognition. Thus, it is difficult subsequent studies [116] introduce the Elo rating system
to directly formulate them as optimization objectives for to derive the preference ranking by comparing candidate
LLMs. In existing work, there are many ways to fulfill these outputs. The ranking of outputs serves as the training signal
criteria when aligning LLMs. A promising technique is red that guides the model to prefer certain outputs over others,
teaming [367], which involves using manual or automated thus inducing outputs that are more reliable and safer.
means to probe LLMs in an adversarial way to generate • Question-based approach. Further, human labelers can
harmful outputs and then updates LLMs to prevent such provide more detailed feedback by answering certain ques-
outputs. tions designed by researchers [81], covering the alignment
criteria as well as additional constraints for LLMs. Specially,
5.2.2 Collecting Human Feedback in WebGPT [81], to assist the model in filtering and utiliz-
ing relevant information from retrieved documents, human
During the pre-training stage, LLMs are trained using the
labelers are required to answer questions with multiple
language modeling objective on a large-scale corpus. How-
options about whether the retrieved documents are useful
ever, it cannot take into account the subjective and qualita-
for answering the given input.
tive evaluations of LLM outputs by humans (called human
feedback in this survey). High-quality human feedback is • Rule-based approach. Many studies also develop rule-
extremely important for aligning LLMs with human pref- based methods to provide more detailed human feedback.
erences and values. In this part, we discuss how to select a As a typical case, Sparrow [116] not only selects the response
team of human labelers for feedback data collection. that labelers consider the best but also uses a series of
rules to test whether model-generated responses meet the
Human Labeler Selection. In existing work, the dominant alignment criteria of being helpful, correct, and harmless.
method for generating human feedback data is human In this way, two kinds of human feedback data can be ob-
annotation [66, 116, 365]. This highlights the critical role tained: (1) the response preference feedback is obtained by
of selecting appropriate human labelers. To provide high- comparing the quality of model-generated output in pairs,
quality feedback, human labelers are supposed to have a and (2) the rule violation feedback is obtained by collecting
qualified level of education and excellent proficiency in En- the assessment from human labelers (i.e., a score indicating
glish. For example, Sparrow [116] requires human labelers to what extent the generated output has violated the rules).
38

Supervised Fine-tuning reward models having a parameter scale different from that
Prompts of the aligned LM [66, 368]. For example, OpenAI uses 6B
Training with demonstration data
GPT-3 and DeepMind uses 7B Gopher as the reward model,
Human
Annotator
Demonstrations Pre-trained LM
🔥 respectively. Finally, to optimize the pre-trained LM using
the signal from the reward model, a specific RL algorithm
Demonstration Data
is designed for large-scale model tuning. Specifically, Prox-
imal Policy Optimization (PPO) [128] is a widely used RL
Reward Model Training algorithm for alignment in existing work [66, 116, 368].
🔥 Key Steps for RLHF. Figure 12 illustrates the overall three-
Prompts LM Outputs Reward
Model
Pre-trained LM
🧊 step process of RLHF [66] as introduced below.
• Supervised fine-tuning. To make the LM initially perform
Human Feedback desired behaviors, it usually needs to collect a supervised
Ranking Training with feedback data
dataset containing input prompts (instruction) and desired
RL Fine-tuning outputs for fine-tuning the LM. These prompts and outputs
🧊 can be written by human labelers for some specific tasks
Prompts
Reward
Model
Aligned LM
🔥 while ensuring the diversity of tasks. For example, Instruct-
GPT [66] asks human labelers to compose prompts (e.g.,

LM Outputs 😊/😞
Reward
Training with RL algorithm (PPO)
“List five ideas for how to regain enthusiasm for my career”) and
desired outputs for several generative tasks such as open
QA, brainstorming, chatting, and rewriting. Note that the
Fig. 12: The workflow of the RLHF algorithm. first step is optional in specific settings or scenarios.
• Reward model training. The second step is to train the
RM using human feedback data. Specifically, we employ
Furthermore, GPT-4 [46] utilizes a set of zero-shot classifiers the LM to generate a certain number of output texts using
(based on GPT-4 itself) as rule-based reward models, which sampled prompts (from either the supervised dataset or
can automatically determine whether the model-generated the human-generated prompt) as input. We then invite
outputs violate a set of human-written rules. human labelers to annotate the preference for these pairs.
In the following, we focus on a well-known technique, The annotation process can be conducted in multiple forms,
reinforcement learning from human feedback (RLHF), and a common approach is to annotate by ranking the
which has been widely used in the recent powerful LLMs generated candidate texts, which can reduce the inconsis-
such as ChatGPT. As discussed below, the alignment criteria tency among annotators. Then, the RM is trained to predict
introduced in Section 5.2.1 can be fulfilled by learning from the human-preferred output. In InstructGPT, labelers rank
human feedback on the responses of LLMs to users’ queries. model-generated outputs from best to worst, and the RM
(i.e., 6B GPT-3) is trained to predict the ranking. Note that, in
5.2.3 Reinforcement Learning from Human Feedback recent work [369], the annotation of preference on response
pairs has been conducted by an AI agent (usually an aligned
To align LLMs with human values, reinforcement learning
LLM) instead of humans, which is called “reinforcement
from human feedback (RLHF) [79, 365] has been proposed
learning from AI feedback (RLAIF)”. LLMs trained with typical
to fine-tune LLMs with the collected human feedback data,
RLHF algorithms tend to generate harmless responses with
which is useful to improve the alignment criteria (e.g.,
less helpfulness, which is called evasion problem [369]. To
helpfulness, honesty, and harmlessness). RLHF employs
guarantee both the harmlessness and helpfulness, RLAIF
reinforcement learning (RL) algorithms (e.g., Proximal Pol-
generates the AI feedback based on pre-set alignment prin-
icy Optimization (PPO) [128]) to adapt LLMs to human
ciples in instructions [369, 370], which can also reduce the
feedback by learning a reward model. Such an approach
efforts of human annotation.
incorporates humans in the training loop for developing
well-aligned LLMs, as exemplified by InstructGPT [66]. • RL fine-tuning. At this step, aligning (i.e., fine-tuning)
the LM is formalized as an RL problem. In this setting,
RLHF System. The RLHF system mainly comprises three the pre-trained LM acts as the policy that takes as input
key components: a pre-trained LM to be aligned, a reward a prompt and returns an output text, the action space of
model learning from human feedback, and a RL algorithm it is the vocabulary, the state is the currently generated
training the LM. Specifically, the pre-trained LM is typically token sequence, and the reward is provided by the RM. To
a generative model that is initialized with existing pre- avoid eviating significantly from the initial (before tuning)
trained LM parameters. For example, OpenAI uses 175B LM, a penalty term is commonly incorporated into the
GPT-3 for its first popular RLHF model, InstructGPT [66], reward function. For example, InstructGPT optimizes the
and DeepMind uses the 280 billion parameter model Go- LM against the RM using the PPO algorithm. For each input
pher [64] for its GopherCite model [368]. Further, the reward prompt, InstructGPT calculates the KL divergence between
model (RM) provides (learned) guidance signals that reflect the generated results from the current LM and the initial
human preferences for the text generated by the LM, usually LM as the penalty. It is noted that the second and final steps
in the form of a scalar value. The reward model can take on can be iterated in multiple turns for better aligning LLMs.
two forms: a fine-tuned LM or a LM trained de novo using Due to the instability of the RL algorithm, recent work [371]
human preference data. Existing work typically employs replaces the RL tuning with another supervised fine-tuning
39

MHA Adapter FFN Adapter Prefix Layer #N Layer #N Wdown
Layer #N
… … … LoRA …
MHA Adapter FFN Adapter Prefix Layer #1 Layer #1 Wdown
Layer #1

Input Input Prompt Input Input
(a) Adapter Tuning (b) Prefix Tuning (c) Prompt Tuning (d) Low-Rank Adapation

Fig. 13: An illustration of four different parameter-efficient fine-tuning methods. MHA and FFN denote the multi-head
attention and feed-forward networks in the Transformer layer, respectively.

by reusing the best ranked samples with higher rewards. by the reward model for learning. After fine-tuning the
LLM on the best samples until convergence, the RL process
Practical Strategies for RLHF. Although RLHF is promising
will be performed to further improve the performance.
to effectively improve the alignment of LLMs with humans,
LLaMA 2 [99] has successively trained five versions of RLHF
it is practically challenging for researchers to successfully
models, where the LLM has been progressively improved
implement it. In this part, we focus on discussing several
with the improvement of the reward models. In this way,
useful strategies and tricks for improving the effectiveness
the collected prompts and annotations of human preference
and efficiency of RLHF. Concretely, we focus on the effective
data can better reflect the issues of the current model check-
training of reward models, efficient and effective RL train-
point, thus making special tuning to address these issues. In
ing, respectively.
addition, LLaMA 2 also adds samples from prior iterations
• Effective reward model training. Despite that InstructGPT
into the subsequent ones, to alleviate the possible capacity
used a small reward model (6B GPT model), increasing
regression issue during iterative optimization.
work [99] has shown it is often more effective to use a
• Efficient RL training. As the RL training requires to
large reward model (e.g., equal or greater than the original
iterate the inference process of both the LLM and reward
model size), since large reward models generally perform
models, it would greatly increase the total memory and
better in judging the quality of the LLM generated outputs.
computation cost, especially for larger reward models and
In LLaMa 2 [99], pretrained chat model checkpoints are
LLMs. As a practical trick, we can deploy the reward model
used to initialize the reward model, they argue that such an
on a separate server, and invoke the corresponding API
approach can effectively reduce the information mismatch
to work with the LLM on its own server. In addition, as
between the model to be aligned and the reward model
RLHF requires the LLM to generate multiple candidate
by sharing the same pre-training knowledge. Whereas, it is
outputs, instead of calling the sample decoding procedure
common to encounter the overfitting problem when train-
for multiple times, it is more efficient to utilize the beam
ing large-scale reward models. As a simple yet effective
search decoding algorithm26 . It only needs to perform one-
solution, existing work [372, 373] has introduced the LM
pass decoding for response generation, meanwhile such a
loss on the preferred response of the input prompt from
strategy can also enhance the diversity of the generated
the human-annotated alignment dataset as a regularizer,
candidate responses.
which alleviates the overfitting of the reward model on the
binary classification task. In addition, as there are multiple Process-Supervised RLHF. In existing literature of
criteria for alignment (e.g., helpfulness and honesty), it is RLHF [374], the supervision approach for RL training
often difficult to train a single reward model that can satisfy generally takes two major forms, either using outcome-
all the alignment criteria. Therefore, it is useful to train supervision signals or process-supervision signals. The
multiple reward models that focus on different alignment outcome-supervised RLHF employs a quantitative score to
criteria [99], and compute the final reward based on the assess the quality of the whole text generated by LLMs. In
produced ones from them via special combination strategies contrast, process-supervised RLHF offers an evaluation of
(e.g., mean pooling and weighted sum). Such a way enables each individual component (e.g., sentence, word, or reason-
more flexible rules or standards on multiple criteria, e.g., ing step) within the generated content, which leverage fine-
relaxing the requirement on helpfulness while posing more grained supervision signals to guide the training, helping
strict limits on harmfulness. LLMs refine the undesired generation contents [374, 375].
• Effective RL training. As the RL training process tends to In what follows, we discuss two key aspects of process-
be unstable and hyper-parameter sensitive, it is suggested supervised RLHF.
that the language model should be well supervised fine- • Obtaining Fine-grained Supervision Signals. Compared
tuned before RL training, so as to reaching a good model with outcome rewards, it is more difficult to obtain fine-
capacity. A commonly-used way is to fine-tune the LLM grained supervision signals. OpenAI has released a fine-
on its best outputs of the prompts (referred to as rejec- grained annotation dataset named PRM800k [375] consist-
tion sampling or best-of-N ) from the alignment dataset until
convergence before RL. Given a prompt, the LLM would
26. https://huggingface.co/docs/transformers/v4.31.0/en/main
first produce N outputs via the sampling algorithm, and classes/text generation#transformers.GenerationMixin.group beam
then the best candidate from the model will be selected search
40

ing of 12K process-annotated mathematical problems (i.e., iors from these demonstration data via suitable fine-tuning
MATH dataset [376]) and 75K solutions generated by LLMs strategies. Thus, to implement this approach, two key issues
of these problems, where each reasoning step of mathe- are the construction of alignment dataset and the design of
matical problems is labeled as positive, negative or neutral fine-tuning loss. For the first issue, the alignment dataset
in PRM800k. Considering the cost and efficiency of the can be automatically constructed by an aligned LLMs ac-
human annotation process, several methods aim to auto- cording to human-written safety principles [336] or refining
matically annotate the correctness of intermediate reason- existing examples using edits operations [385]. In addition,
ing steps, e.g., using powerful LLMs to directly replace we can also reuse existing reward models to select high-
human annotators [377] or Monte Carlo tree search [378]. rated responses from existing human feedback data [371].
After obtaining fine-grained supervision signals, existing For the second issue, non-RL alignment approaches mainly
work typically leverages them to train process-supervised fine-tune LLMs in a supervised learning way (the same
reward models (PRM) [375, 379], which can produce step- as the original instruction tuning loss) on a high-quality
level rewards (e.g., sentence based or token based rewards) alignment dataset, meanwhile auxiliary learning objectives
during the RLHF procedure. Furthermore, rather than lever- can be used to enhance the alignment performance, e.g.,
aging the discriminative model to produce the rewards, ranking responses or contrasting instruction-response pairs.
RLMEC [380] utilizes a generative reward model trained on
Alignment Data Collection. The construction of alignment
rewriting tasks with the minimum editing constraint, to pro-
data is important to effectively align the behaviors of LLMs
vide token-level rewards. In addition, for the downstream
with human preferences. To collect high-quality alignment
tasks where fine-grained supervision signals are difficult to
data, some work tries to reuse existing reward models to
collected, outcome-supervision signals can also be utilized
select high-rated responses, and others explore to leverage
to perform process-supervised RLHF [381].
powerful LLMs (e.g., ChatGPT) or build a simulated envi-
• Utilizing the PRMs. To effectively leverage process-
ronment to generate synthetic alignment examples. Next,
supervision signals from PRMs, existing work mainly uti-
we will discuss these three lines of research.
lizes these fine-grained signals to evaluate individual parts
• Reward model based approaches. The reward model in
within the LLM responses and then guides LLMs to adjust
RLHF has been trained to measure the alignment degree
their generation behaviors to maximize the received reward
on the responses of LLMs. It is straightforward to leverage
of the response. Concretely, expert iteration [382, 383], an
existing reward models to select high-quality responses as
effective RL algorithm, has been utilized to improve the base
alignment data for subsequent fine-tuning. Based on this
policy via learning from expert policy [374]. Typically, expert
idea, RAFT [371] adopts reward models trained on human
iteration contains two main stages: policy improvement and
preference data to rank the responses of LLMs and collect
distillation [374]. In the policy improvement stage, expert
those with higher rewards for supervised fine-tuning. In
policy processes the systematic search procedure to produce
addition, the reward model can be also used to score model
the samples under the guidance of PRMs. Subsequently,
responses and assign them to different quality groups.
during the distillation stage, the samples generated by ex-
Quark [386] sorts the responses of LLMs into different quan-
pert policy in the first stage are utilized to improve the
tiles based on the reward scores. Each quantile is attached
base policy through supervised fine-tuning. In addition to
with a special reward token to represent the reward level
expert iteration, PRMs can also be utilized to re-rank the
of the quantile. Conditioned on the highest-reward tokens,
candidates of the final answers generated by LLMs [375] or
LLMs are subsequently prompted to generate high-quality
to select better intermediate reasoning steps during step by
responses. Given an initial answer and the corresponding
step reasoning [379, 384].
human feedback, ILF [387] first adopts LLMs to generate
refined answers, then utilizes the reward model to select
5.2.4 Alignment without RLHF
the answer that best matches the feedback for further
Although RLHF has achieved great success in aligning the training. As valuable resources for aligning LLMs, several
behaviors of LLMs with human values and preferences, it reward models have been released, including DeBERTa-
also suffers from notable limitations. First, RLHF needs to base/large/xxlarge from OpenAssistant27 , Moss-7B from
train multiple LMs including the model being aligned, the Fudan28 , and Flan-T5-xl from Stanford29 .
reward model, and the reference model at the same time, • LLM based generative approaches. Reward models help
which is tedious in algorithmic procedure and memory- to select aligned data from model responses. However,
consuming in practice. Besides, the commonly-used PPO training reward models itself necessitates substantial high-
algorithm in RLHF is rather complex and often sensitive quality human-labeled data, which is typically expensive
to hyper-parameters. As an alternative, increasing studies and in short supply. In addition, although existing reward
explore to directly optimize LLMs to adhere to human pref- models can be reused, they might not be able to accurately
erences, using supervised fine-tuning without reinforcement capture the nonalignment behaviors in another separately
learning [338]. trained LLM. Therefore, some work explores leveraging
powerful LLMs to automatically generate human-aligned
Overview. The basic idea of non-RL alignment approaches
data. As a representative work, constitutional AI [369] pro-
is to directly fine-tune LLMs with supervised learning on
poses that human supervision comes from a set of principles
high-quality alignment dataset. It basically assumes that re-
sponse feedback or golden rules to avert unsafe behaviors 27. https://huggingface.co/OpenAssistant
have been injected or included in the specially curated align- 28. https://github.com/OpenLMLab/MOSS-RLHF
ment dataset, so that LLMs can directly learn aligned behav- 29. https://huggingface.co/stanfordnlp/SteamSHP-flan-t5-xl
41

(i.e., natural language instructions) governing AI behaviors. DPO [391] proposes to reparameterize the response rewards
Based on these principles, LLMs will critique their own using the policy model (i.e., the language model being
harmful responses and revise them repeatedly into finally optimized), and then the original reward modeling objective
aligned responses. Similarly, Self-Align [336] first adopts can be reformulated only based on the policy model. In this
self-instruct [147] to generate instructions focusing on cov- way, DPO removes the explicit reward modeling step, and
ering diverse topics. Then, the model is also prompted optimizing the new learning objective that only involves the
with multiple human-written principles that describe the policy model is equivalent to optimizing the rewards. Based
rules of expected model behaviors (also with several in- on DPO, existing work has proposed several improvement
context exemplars), to generate helpful, ethical, and reliable strategies for enhancing the effectiveness or efficiency, e.g.,
responses as alignment data. To mitigate the limit that the decomposing the optimization of positive responses and
original SFT method can only learn from positive responses, negative responses into two independent components [392]
FIGA [388] develops an improved supervised alignment or removing the probability of the reference model in the
approach, where both negative (the original output of low objective function [393]. Furthermore, FIGA [388] designs a
quality) and positive (the refined output by LLMs) re- token-level contrastive loss that aims to encourage desirable
sponses are leveraged in a contrastive way, to enable LLMs tokens, penalize undesirable ones, and disregard trivial
to deeply understand what fine-grained revisions actually tokens. Despite the effectiveness, recent work has also
lead to good response. revealed that DPO may have inherent limitations in several
• LLM based interactive approaches. Most existing ap- aspects. First, based on the analysis about the magnitude
proaches train LLMs in isolation, where LLMs are not and gradient directions, recent work reveals that DPO might
present in actual environments to improve themselves have difficulty in well balancing the learning of positive
through external feedback signals. As a comparison, hu- instances and negative instances [394]. In addition, as the
mans learn social norms and values from interactions with reference model provides the reward scores for itself in DPO
others in social environments [389]. To mimic such a learn- algorithm, a weak reference model would also influence
ing approach, Stable Alignment [193] builds a simulated the alignment performance [395], which can be enhanced
interaction environment consisting of a number of LLM by improved learning strategies [396] or well-trained policy
agents, where AI agents keep interacting with and each model [395].
other, receiving feedback on improvement. Once a central • Auxiliary optimization objectives. Besides the primary
agent receives an instruction, it produces a response and cross-entropy loss, several studies propose auxiliary train-
shares it with nearby agents. These critic agents generate ing loss to enhance the learning from the alignment data.
feedback comprising ratings about the response and re- First, since the responses of each instruction can be scored
vision suggestions. Then the central agent would revise by the reward model, the ranking loss can be used to train
the original response following these suggestions. Such the model to preserve the ranking order of these responses.
an alignment approach can be also extended to real-world For example, RRHF [397] samples responses from multi-
environment with humans. ple sources, including model-generated responses, such as
those derived from the model itself, ChatGPT, and GPT-4,
Supervised Alignment Tuning. After obtaining alignment as well as human-written responses, spanning both high-
data, it is also key to design suitable fine-tuning strategies quality and low-quality instances. To align with the scores
for direct alignment. A straightforward approach is to op- from reward models, it further optimizes the ranking loss
timize LLMs using the conventional sequence-to-sequence by encouraging the model to have a higher conditional
objective based on the alignment data. In addition to the log probability for the response with a higher ranking.
conventional optimization objective, several studies further Moreover, SLiC-HF [398] proposes to assess the similarity
explore auxiliary losses that enhance the learning from the between model outputs and human preference via the dis-
alignment data. tance in the latent space, and introduces specific calibration
• Primary training objective. Since the alignment data and regularization loss to calibrate the candidate sequences
typically consists of an input instruction and an output re- based on human-preference data. Similarly, the difference
sponse, the primary training loss is still the traditional cross- between positive and negative responses from the reward
entropy loss for sequence-to-sequence learning. Based on model can be employed to construct the regularization
this loss, many studies propose a number of improvement loss [399], to enhance the discrimination between positive
variants for enhancing the supervised alignment tuning. and negative responses by LLMs. Second, to enhance the
For example, CoH [390] constructs the training data by relatedness between the response and the instruction, some
prepending “A helpful answer:” and “An unhelpful answer:” work adopts contrastive learning to push up the probability
to the annotated good and bad responses, respectively, and of correct instruction-response pairs while pushing down in-
only compute losses for those response tokens with special correct instruction-response pairs. Specifically, for an output
masking. Quark [386] sorts model responses into different response, the proposed approach in [400] contrasts the target
quantiles with varying alignment quality, it prepends a spe- instruction to the other irrelevant instructions. By doing so,
cial reward token to each model response to represent the it can enable the model to learn the right correlation between
reward level of the response. These studies basically adopt instructions and responses.
the maximum likelihood objective, and employ instruction
prefixes to guide the learning of human preference. 5.2.5 Remarks on SFT and RLHF
• Direct preference optimization. To better mimic the As discussed in Section 5.1, instruction tuning is the process
learning approach of RLHF in a supervised learning way, of training pre-trained language models with formatted
42

demonstration data (instructions paired with desired out- hallucination behaviors when demonstration data is beyond
puts). At early exploration, instruction data was mainly col- the knowledge or ability scope of LLMs, e.g., training a LLM
lected from NLP tasks [67], while it has been now extended to answer questions about its unknown facts. An interesting
to more diverse supervision data that pairs input and viewpoint from John Schulman’s talk on RLHF [403] is that
output texts (e.g., the utterances of open-ended dialogues). distilling superior models to train less capable models (e.g.,
Training with such paired texts is also called supervised fine- prompting GPT-4 to generate the response as fine-tuning
tuning (SFT) in the context of LLMs [66]. In this part, we data) might increase the possibilities of generating the hal-
mainly use the abbreviation SFT for discussion but not lucinated texts, thus likely affecting the factual accuracy
instruction tuning, due to the simplicity and popularity. of LLMs. Furthermore, as a behavior cloning method, SFT
Since SFT and RLHF are two major adaptation tuning aims to imitate the behaviors (without explorations) of the
methods for LLMs, it is important to understand the con- experts who construct the demonstration data. However,
nections and difference between them. Next, we make some there often exist variations among different annotators on
discussions on this issue30 . the writing styles, quality, and preferences of demonstration
data, which tends to affect the learning performance of SFT.
Overall Comparison with RL Formulation. Following the Thus, high-quality instruction data (but not the quantity) is
discussion in Section 5.2.3 (the part related to RL training), the primary factor for effective training of LLMs during the
the text generation problem can be formulated as a decision- SFT stage [99].
making process based on RL. Taking a prompt as input,
the task of a LLM is to generate a text completion that Pros and Cons of RLHF. RLHF was early explored in the
appropriately responds to the prompt. This task would be literature of deep RL [79], then borrowed to improve the
completed step by step. At each step, an agent (i.e., LLM) capacity of language models (e.g., summarization [129]),
will perform an action (i.e., generating a token) according and subsequently adopted as the fundamental technique to
to the policy (i.e., the generative probability distribution of develop InstructGPT [66]. Recently, increasing evidence [99,
LLM) conditioned on the current state (currently generated 369] has demonstrated the effectiveness of RLHF in miti-
token sequence and other available context information). gating the harmful responses and enhancing the model ca-
It is expected that a high-quality output text would be pacity. Specially, LLaMA 2 has demonstrated that RLHF can
produced by the LLM, which can earn a large reward score improve both the helpfulness and harmlessness scores [99],
based on the entire response. Overall, RLHF and SFT can be and attributed this to a better human-LLM synergy for data
considered as two different training approaches to optimiz- annotation. They explain this reason in two major aspects
ing the above decision making process for LLMs. Specially, as follows. First, since human annotators mainly provide
RLHF firstly learns the reward model, and then employs preference annotations for RLHF, it can largely alleviate the
it to improve the LLM with RL training (e.g., PPO). As a discrepancies of annotators as that in SFT. Secondly, pref-
comparison, SFT adopts a teacher-forcing approach, which erence annotation is much easier than writing the demon-
directly optimizes the likelihood of a demonstration output. stration data, and annotators can even judge the quality of
Such a token-level training way essentially does behavior more superior generations than those they create, making it
cloning (a special algorithm of imitation learning [401]): it possible to explore a broader state space beyond what can
utilizes the expert’s action (i.e., the target token at each step) be demonstrated by human annotators. Another key point
as the supervision label and directly learns to imitate the is that RLHF essentially encourages LLMs to learn correct
demonstrations from experts without specifying a reward policies by contrasting the self-generated responses (dis-
model as in typical RL algorithms. To learn the desired criminating between good and bad responses). It no longer
policies, SFT adopts a “local” optimization way (i.e., token- forces the model to imitate external demonstration data,
level loss) based on demonstration data, while RLHF takes a and thus can mitigate the hallucination issues with SFT as
“global” optimization way (i.e., text-level loss) by involving discussed above31 . Actually, RLHF has been demonstrated
human preference. More theoretical analysis about imitation to be an important approach to reduce the hallucination
learning and reinforcement learning can be referred to the behaviors in GPT-4 [46]. However, RLHF inherits the draw-
related RL literature [401, 402]. backs of classic RL algorithms, e.g., sample inefficiency and
training instability. When adapted to LLMs, RLHF further
Pros and Cons of SFT. SFT has been shown to be an relies on a strong SFT model as initial model checkpoint for
effective approach to boosting the performance of LLMs efficiently achieving good performance. In addition, human
on various benchmarks [67, 69, 152, 187], which can largely annotators are involved in a complex iterative optimization
enhance the task generalization ability and flexibly endow process, in which a number of important details (e.g., the
specific functions (e.g., establishing the chatbot’s identity). prompt selection, the schedule of reward model training and
More discussions about the usefulness of SFT can be found PPO training, and the settings of hyper-parameters) have
in Section 5.1.3. It has been widely recognized that SFT important impact on the whole model performance.
mainly unlocks the abilities but not inject new abilities into Overall, SFT is particularly useful to increase the model
LLMs. Thus, it might become problematic when one tries capacity of pre-trained model checkpoints right after pre-
to stimulate the non-endogenous abilities of LLMs via SFT. training, while RLHF is promising to further improve the
As a concrete scenario, it would potentially advocate the
31. In RLHF, it seems to be also important that reward models
30. This part would be somehow subjective, mainly based on the au- should be aware of the knowledge or ability of a LLM to be aligned.
thors’ opinions and experiences. Comments or corrections are welcome For example, LLaMA 2 adopts pre-trained chat model checkpoints to
to enhance this part. initialize reward models [99].
43

model capacity of SFT models. However, RLHF has been the Transformer architecture specially for natural language
difficult to implement, and far from well explored (ac- understanding, which also utilizes multi-task learning for
cording to public literature), and more improvements (e.g., jointly optimizing shared prompts. It has been shown to
efficient and reliable annotation [369] and simplified opti- be useful in improving the model performance of different
mization [391]) are still needed for further research. parameter scales on natural language understanding tasks.

Prompt Tuning. Different from prefix tuning, prompt tun-
5.3 Parameter-Efficient Model Adaptation ing [405, 410] mainly focuses on incorporating trainable
In the above, we have discussed the approaches of instruc- prompt vectors at the input layer32 . Based on the discrete
tion tuning and alignment tuning to adapt LLMs according prompting methods [412, 413], it augments the input text
to specific goals. Since LLMs consist of a huge amount of by including a group of soft prompt tokens (either in a
model parameters, it would be costly to perform the full- free form [410] or a prefix form [405]), and then takes
parameter tuning. In this section, we will discuss how to the prompt-augmented input to solve specific downstream
conduct efficient tuning on LLMs. We first review several tasks. In implementation, task-specific prompt embeddings
representative parameter-efficient fine-tuning methods for are combined with the input text embeddings, which are
Transformer language models, and then summarize existing subsequently fed into language models. P-tuning [410] has
work on parameter-efficient fine-tuned LLMs. proposed a free form to combine the context, prompt and
target tokens, which can be applied to the architectures for
5.3.1 Parameter-Efficient Fine-Tuning Methods both natural language understanding and generation. They
In existing literature, parameter-efficient fine-tuning [149, further learn the representations of soft prompt tokens by a
404, 405] has been an important topic that aims to reduce bidirectional LSTM. Another representative approach [405]
the number of trainable parameters while retaining a good named prompt tuning directly prepends prefix prompts to
performance as possible. In what follows, we briefly re- the input. During training, only the prompt embeddings
view four parameter-efficient fine-tuning methods for Trans- would be learned according to task-specific supervisions.
former language models, including adapter tuning, prefix Since this method only includes a small number of trainable
tuning, prompt tuning and LoRA. The illustration of these parameters at the input layer, it has been found that the
four methods are shown in Figure 13. performance highly relies on the model capacity of the
underlying language models [405].
Adapter Tuning. Adapter tuning incorporates small neural
network modules (called adapter) into the Transformer mod- Low-Rank Adaptation (LoRA). LoRA [149] imposes the
els [406]. To implement the adapter module, a bottleneck low-rank constraint for approximating the update matrix at
architecture has been proposed in [406, 407], which first each dense layer, so as to reduce the trainable parameters
compresses the original feature vector into a smaller di- for adapting to downstream tasks. Consider the case of
mension (followed by a nonlinear transformation) and then optimizing a parameter matrix W. The update process can
recovers it to the original dimension. The adapter modules be written in a general form as: W ← W + ∆W. The basic
would be integrated into each Transformer layer, typically idea of LoRA is to freeze the original matrix W ∈ Rm×n
using a serial insertion after each of the two core parts (i.e., while approximating the parameter update ∆W by low-
attention layer and feed-forward layer) of a Transformer rank decomposition matrices, i.e., ∆W = A · B⊤ , where
layer. Alternatively, parallel adapters [408] can be also used A ∈ Rm×k and B ∈ Rn×k are the trainable parameters for
in Transformer layers, where it places two adapter modules task adaptation and k ≪ min(m, n) is the reduced rank. The
in parallel with the attention layer and feed-forward layer major merit of LoRA is that it can largely save the memory
accordingly. During fine-tuning, the adapter modules would and storage usage (e.g., VRAM). Further, one can only keep
be optimized according to the specific task goals, while the a single large model copy, while maintaining a number of
parameters of the original language model are frozen in this task-specific low-rank decomposition matrices for adapting
process. In this way, we can effectively reduce the number to different downstream tasks. Further, several studies have
of trainable parameters during fine-tuning. also discussed how to set the rank in a more principled
approach, e.g., importance score based allocation [414] and
Prefix Tuning. Prefix tuning [404] prepends a sequence of search-free optimal rank selection [415].
prefixes, which are a set of trainable continuous vectors, to
Besides the above methods, there is extensive research
each Transformer layer in language models. These prefix
on efficient tuning of Transformer language models. How-
vectors are task-specific, which can be considered as virtual
ever, a more comprehensive discussion of efficient tuning is
token embeddings. To optimize the prefix vectors, a repa-
beyond the scope of this article, which can be found in the
rameterization trick [404] has been proposed by learning a
related papers on this topic [408, 416].
MLP function that maps a smaller matrix to the parameter
matrix of prefixes, instead of directly optimizing the pre-
fixes. It has been shown that this trick is useful for stable 32. Here, prompt tuning denotes a category of related efficient tuning
methods exemplified by the work [405, 410, 411], instead of a spe-
training. After optimization, the mapping function would cific method as used in [405]. Indeed, the prefix based tuning meth-
be discarded, and only the derived prefix vectors are kept ods [404, 409] can be also considered as prompting methods, which
to enhance task-specific performance. Since only the prefix are called deep prompting tuning in [409]. In this survey, prompt tuning
parameters would be trained, it can lead to a parameter- specially refer to the methods that only include the prompt tokens at
the input layer, in the context of LLMs. We assign p-tuning v2 [409] to
efficient model optimization. Similar to prefix tuning, p- the category of prefix tuning, because it incorporates layerwise prompts
tuning v2 [409] incorporates layer-wise prompt vectors into in langauge models.
44

5.3.2 Parameter-Efficient Fine-Tuning on LLMs task description and/or demonstrations in the form of natu-
With the rising of LLMs, efficient tuning has attracted ral language text. In addition, chain-of-thought prompting [33]
increasing research attention for developing a more can be employed to enhance in-context learning by involv-
lightweight adaptation approach in downstream tasks. ing a series of intermediate reasoning steps in prompts.
In particular, LoRA [149] has been widely applied Furthermore, planning [432] is proposed for solving complex
to open-source LLMs (e.g., LLaMA and BLOOM) for tasks, which first breaks them down into smaller sub-tasks
parameter-efficient fine-tuning. Among these research at- and then generates a plan of action to solve these sub-tasks
tempts, LLaMA and its variants have gained much atten- one by one. We summarize representative work for these
tion for parameter-efficient tuning. For example, Alpaca- prompting approaches in Table 11. Next, we will elaborate
LoRA [148] has been trained using LoRA as a lightweight on the details of the four techniques.
tuned version of Alpaca [146] (a fine-tuned 7B LLaMA
model with 52K human demonstrations of instruction fol- 6.1 Prompting
lowing). There are extensive explorations of Alpaca-LoRA
ranging in different languages or model sizes, which can As discussed in previous work [36], prompting is the major
be found in the collection page33 . A recent study LLaMA- approach to utilizing LLMs for solving various tasks. Since
Adapter [417] inserts learnable prompt vectors into each the quality of prompts will largely influence the perfor-
Transformer layer, in which zero-initialized attention has mance of LLMs in specific tasks, there have been a series of
been proposed to improve the training by mitigating the studies proposed to generate suitable task prompts through
influence of under-fitted prompt vectors. They also extend manual creation or automatic optimization, which will be
this approach to a multi-modal setting, e.g., visual question introduced in this section.
answering.
Further, an empirical study [407] has been conducted 6.1.1 Prompt Creation
to examine the effect of different tuning methods on lan- The process of manually creating a suitable prompt is also
guage models. They compare four efficient tuning methods called prompt engineering [445, 446]. A well-designed prompt
including serial adapter tuning [406], parallel adapter tun- is very helpful to elicit the abilities of LLMs for accomplish-
ing [408, 418], and LoRA [149], on three open-source LLMs, ing specific tasks. In this part, we will first introduce the
namely GPT-J (6B), BLOOM (7.1B) and LLaMA (7B), for key components of prompts and discuss several principles
evaluation. Based on the experimental results on six math for prompt design. Then, we evaluate ChatGPT with differ-
reasoning datasets, they show that these efficient-tuning ent prompts to show the results on several representative
methods under-perform the reference baseline GPT-3.5 on tasks. We are aware that there have been several existing
difficult tasks, while achieving a comparable performance papers [446, 447] and websites [448–450] that present the
on simple tasks. Overall, LoRA performs relatively well suggestions and guidelines to design good prompts. As a
among these comparison methods, using significantly fewer comparison, we mainly aim to discuss the key factors (ingre-
trainable parameters. dients and principles) that are useful for prompt creation,
As an important resource, the library PEFT [419] (stand- and provide experimental results and analysis on popular
ing for parameter-efficient fine-tuning) has been released on tasks as the reference to the beginners.
GitHub34 . It has included several widely used efficient tun-
ing methods, including LoRA [149]/AdaLoRA [414], prefix- Key Ingredients. Typically, there are four key ingredients
tuning [404, 409], P-Tuning [410], and prompt-tuning [405]. that depict the functionality of a prompt for eliciting the
Further, it supports a number of language models such as abilities of LLMs to complete the tasks, including task
GPT-2 and LLaMA, and also covers several representative description, input data, contextual information, and prompt
vision Transformer models (e.g., ViT and Swin Transformer). style. To have an intuitive understanding of our discussion,
As discussed in Section 5.3.1, there have been a large we also present three prompt examples for question answer-
number of efficient tuning methods proposed in the existing ing, meta-review generation, and text-to-SQL in Table 13.
literature. However, most of these approaches are tested • Task description. A task description is typically a specific
on small-sized pre-trained language models, instead of the instruction that LLMs are expected to follow. In general, one
LLMs. So far, there still lacks a thorough investigation on should clearly describe the task goal in natural language.
the effect of different efficient tuning methods on large-sized For the tasks with special input or output format, detailed
language models at different settings or tasks. clarifications are often needed, and one can further utilize
keywords to highlight the special settings for better guiding
LLMs in task completion.
6 U TILIZATION • Input data. In common cases, it is straightforward to
After pre-training or adaptation tuning, a major approach describe input data (e.g., an instance to be responded by
to using LLMs is to design suitable prompting strategies LLMs) in natural language. For special input data, such
for solving various tasks. In existing literature, task-specific as knowledge graph and table, it is necessary to apply an
prompts can be effectively learned through manual creation appropriate and convenient way to make them readable
and automatic optimization. A representative prompting for LLMs. For structured data, linearization is commonly
method is in-context learning [50, 55], which formulates the used to transform the original records (e.g., knowledge
triples) into sequences [451] due to the simplicity. Further,
33. https://github.com/tloen/alpaca-lora the programming language (e.g., executable code) has also
34. https://github.com/huggingface/peft been utilized to formulate the structured data, which can
45

TABLE 11: Typical LLM utilization methods and their key points for ICL, CoT, and planning. Note that the key points only
highlight the most important technical contribution.

Approach Representative Work Key Point
KATE [420] Demonstration selection (similar; k-NN)
EPR [421] Demonstration selection (dense retrieval; constrative learning)
In-context SG-ICL [422] Demonstration selection (LLM as the demonstration generator)
Learning (ICL) APE [423] Demonstration format (automatic generation & selection)
Structured Prompting [424] Demonstration format (grouped context encoding; rescaled attention)
GlobalE & LocalE [425] Demonstration order (entropy-based metric; probing set generation with LLM)
Complex CoT [426] Demonstration (complexity-based selection)
Auto-CoT [427] Demonstration (automatic generation)
Chain-of-thought Selection-Inference [428] Generation (alternate between selection and inference)
Prompting (CoT) Self-consistency [429] Generation (diverse paths; self-ensemble)
DIVERSE [430] Generation (diverse paths); Verification (step-wise voting)
Rationale-augmented ensembles [431] Generation (rationale sampling)
Least-to-most prompting [432] Plan generation (text-based; problem decomposition)
DECOMP [433] Plan generation (text-based; problem decomposition)
PS [434] Plan generation (text-based)
Faithful CoT [435] Plan generation (code-based)
PAL [436] Plan generation (code-based; Python)
HuggingGPT [437] Plan generation (code-based; models from HuggingFace)
Planning AdaPlanner [438] Plan refinement (skill memory)
TIP [439] Feedback acquisition (visual perception)
RAP [440] Feedback acquisition (LLM as the world model); Plan refinement (Monte Carlo Tree Search)
ChatCoT [441] Feedback acquisition (tool); Plan refinement (conversation between LLM and tools)
ReAct [442] Feedback acquisition (tool); Plan refinement (synergizing reasoning and acting)
Reflexion [443] Feedback acquisition (text-based self-reflection); Plan refinement (dynamic memory)
Tree of Thoughts [444] Feedback acquisition (vote comparison); Plan refinement (tree-based search)

also support using external tools (e.g., program executor) to need for clear and unambiguous directives when utilizing
produce the precise results [452, 453]. these models [66]. A clear and detailed description should
• Contextual information. In addition to the task descrip- contain various elements to explain a task, including task
tion and input data, contextual or background information objective, input/output data (e.g., “Given a long document, I
is also essential for specific tasks. For example, retrieved want you to generate a concise summary.”), and the response
documents are highly useful for open-domain question constraints (e.g., “the length of the summary cannot exceed 50.”).
answering as supporting evidence. Both the quality of the By providing a well-clarified task description, LLMs can
retrieved documents and their relevance to the question more effectively understand the target task and generate the
have an impact on the generated answers [454]. Thus, it desired output.
needs to include such information in a proper prompt • Decomposing into easy, detailed sub-tasks. To solve com-
pattern or expression format. Furthermore, in-context task plex tasks, it is important to decompose the difficult task
exemplars are also helpful for eliciting LLMs to accomplish into several more easier, detailed sub-tasks for helping
a complex task, which can better depict the task goal, the LLMs accomplish the goal step by step, which is closely re-
special output formats, and the mapping relation between lated to the planning technique in Section 6.4. For example,
input and output. following the suggestion [447], we can explicitly list the sub-
• Prompt style. For different LLMs, it is important to tasks in the form of multiple numbered items (e.g., “Braid a
design a suitable prompt style for eliciting their abilities to coherent narrative by performing the following tasks: 1. ...; 2. ...; 3.
solve specific tasks. Overall, one should express the prompt ...”). By decomposing a target task into sub-tasks, LLMs can
as a clear question or detailed instruction that can be well focus on solving easier sub-tasks and finally achieve more
understood and answered. In some cases, it is also useful to accurate results for complex tasks.
add the prefix or suffix to better guide LLMs. For example, • Providing few-shot demonstrations. As discussed in Sec-
using the prefix “Let us think step by step” can help elicit tion 6.2, LLMs can benefit from in-context learning for
LLMs perform step-by-step reasoning, and using the prefix solving complex tasks, where the prompts contain a small
“You are an expert on this task (or in this domain)” can boost number of task examples of the desired input-output pairs,
the performance of LLMs in some specific tasks. Further, for i.e., few-shot demonstrations. Few-shot demonstrations can
chat-based LLMs (e.g., ChatGPT), instead of directly feeding help LLMs learn the semantic mapping between input and
a long or complex task prompt, it is suggested to decompose output without parameter tuning. In practice, it is suggested
it into multiple prompts for the sub-tasks and then feed that one should generate a few high-quality demonstrations
them into LLMs via a multi-turn conversation [441]. for the target task, which would highly benefit the final task
performance.
Design Principles. Based on the key ingredients of prompts, • Utilizing model-friendly format. Since LLMs are pre-
we summarize several critical design principles that can trained on specially constructed datasets, there are some
help create more effective prompts for solving various tasks. prompt formats that can make LLMs better understand
• Expressing the task goal clearly. Task descriptions should the instruction. For example, as the OpenAI documentation
not be ambiguous or unclear, which likely lead to in- suggests, we can use ### or """ as a stop symbol to
accurate or inappropriate responses. This highlights the separate the instruction and context, which can be better
46

understood by LLMs. As a general guideline, most existing specific output formats or require background knowledge.
LLMs perform a task better in English, thus it is useful to Our example prompts provide more detailed task descrip-
employ English instructions to solve difficult tasks based on tion (e.g., output format and task goal), which can help
machine translation. ChatGPT better understand the complex task requirement
• Adopting role-playing strategies. Since LLMs are pre- for fulfilling it.
trained on extensive corpora containing diverse characters • For mathematical reasoning tasks, it is more effective to
and dialogues, they possess an inherent ability for role- design specific prompts based on the format of programming
playing. This feature can be harnessed through specific language. For GSM8k, the designed prompt employs code-
prompts to enhance the corresponding capacity for some formatted few-shot demonstrations to convert this mathe-
specific domains [455]. For instance, when solving a math matical reasoning task into code generation task, which can
problem, we can use a prompt prefix like “You are an expert in leverage the strong code synthesis ability of ChatGPT for
mathematics”. This enables LLMs to solve the problem from solving mathematical problems. Further, with the help of an
an expert’s perspective, thereby leveraging their pretrained external program executor, we are able to obtain more pre-
knowledge more effectively. By guiding LLMs with role- cise results instead of using LLMs for arithmetic operation.
playing prompts, they can often generate more reasonable As we can see, the performance is boosted from 78.47 to
and accurate solutions. 79.30 on GSM8k, indicating the usefulness of programming
language in mathematical reasoning tasks.
Useful Tips. In addition to the design principles, we also
• In knowledge utilization and complex reasoning tasks,
present a collection of useful prompt tips based on existing
ChatGPT with proper prompts achieves comparable performance
work or our empirical experiences in Table 12. Note that
or even outperforms the supervised baselines methods. In knowl-
these tips are suggested in a general manner, it does not
edge utilization and complex reasoning tasks, ChatGPT
indicate that they are the best prompts for the corresponding
with proper zero-shot or few-shot prompts can achieve
tasks. This part will be continuously updated with more
comparable performance or even outperform the super-
guidelines or tips. We welcome readers to contribute to this
vised methods, e.g., 31.21 (ChatGPT) v.s. 34.20 (supervised
collection of prompt tips. We present the detailed procedure
baseline) on WikiFact. Despite that, ChatGPT still performs
to contribute to the prompt tips, at the link: https://github.
worse than supervised baseline models on some specific
com/RUCAIBox/LLMSurvey/tree/main/Prompts.
tasks (e.g., ARC and WikiFact), since these supervised mod-
Empirical Analysis. We further conduct empirical studies els have been specially optimized with task-specific data.
to present the impact of prompts on task performance. To • Through suitable prompt engineering, LLMs can handle
conduct the experiments, we select a variety of tasks that some non-traditional NLP tasks. With the help of specific
span language generation, knowledge utilization, complex prompts, ChatGPT can also accomplish non-traditional NLP
reasoning, structure data generation, and information re- tasks, i.e., the general recommendation and conversational
trieval. For each task, we manually write a prompt that recommendation. A key point is that these tasks can be
follows general guidelines introduced above. Note that the well expressed or described in natural language. However,
tested prompts may not be the optimal for these tasks, the performance of ChatGPT is still far from the referenced
since they mainly aim to help readers understand how to performance in these tasks, as LLMs cannot directly fit these
write an effective prompt for solving different tasks. Also, tasks, which require specific domain knowledge and task
we add a simplified prompt as the comparison for most adaptation [355, 456].
tasks. Following the experimental settings in Section 7.4, we
examine the 3-shot performance of ChatGPT on complex 6.1.2 Prompt Optimization
reasoning tasks (Colored Objects and GSM8k), and zero- Although manually creating task prompts is more intuitive,
shot performance on other tasks. We report the experimental it is time consuming and, more importantly, models are
results in Table 17, where we also include the supervised highly sensitive to the crafted prompts—improper prompts
performance in existing papers as reference. will lead to low task performance (as shown in Table 17).
• Carefully designed prompts can boost the zero-shot or few- Therefore, a large body of studies propose automatic opti-
shot performance of ChatGPT. By comparing the results of mization approaches for discrete prompts and continuous
using different prompts on the same task, we can see that prompts to achieve the optimal performance [404, 413]. In
using the carefully designed prompts can achieve better per- this part, we will detail these studies from two perspectives,
formance than the simpler ones. In the carefully designed i.e., discrete prompts and continuous prompts.
prompts, we provide a more clearly expressed task de-
scription (e.g., WMT and WikiFact), or use a model-friendly Discrete Prompt Optimization. Discrete prompt is typically
format (e.g., GSM8k and OBQA). For example, for WikiFact composed of a sequence of natural language tokens. Despite
task, the prompt with a more detailed task description leads that the form is simple and flexible, optimizing prompts in
to a performance increase from 29.25 to 31.21. discrete space is a challenging problem due to the combina-
• More complex tasks can benefit more from careful prompt torial huge search space. To automatically search effective
engineering on ChatGPT. In the WikiFact and Colored Objects prompts for downstream tasks, existing studies propose a
tasks, the designed prompts have greatly improved the per- wide spectrum of discrete prompt optimization approaches,
formance of ChatGPT, i.e., from 23.61 to 28.47 on WikiFact which are detailed as follows.
and from 53.20 to 66.75 on Colored Objects. It indicates • Gradient-based approaches. This kind of approaches
the necessity of prompt engineering for LLMs to perform aims to optimize the prompt search process by maximizing
well on complex tasks, since these tasks typically have the output likelihood via gradient update [413, 458–461].
47

TABLE 12: A collection of useful tips for designing prompts that are collected from online notes [446–449] and experiences
from our authors, where we also show the related ingredients and principles (introduced in Section 6.1.1). We abbreviate
principles as Prin. and list the IDs of the related principles for each prompt. 1 ⃝: expressing the task goal clearly; ⃝ 2:
decomposing into easy, detailed sub-tasks; 3⃝: providing few-shot demonstrations; ⃝ 4 : utilizing model-friendly format.

Ingredient Collected Prompts Prin.
T1. Make your prompt as detailed as possible, e.g., “Summarize the article into a short paragraph within 50 words. The major ⃝
1
Task Description storyline and conclusion should be included, and the unimportant details can be omitted.”
T2. It is helpful to let the LLM know that it is an expert with a prefixed prompt, e.g., “You are a sophisticated expert in the ⃝
1
domain of compute science.”
T3. Tell the model more what it should do, but not what it should not do. ⃝
1
T4. To avoid the LLM to generate too long output, you can just use the prompt: “Question: Short Answer: ”. Besides, you can ⃝
1
also use the following suffixes, “in a or a few words”, “in one of two sentences”.
I1. For the question required factual knowledge, it is useful to first retrieve relevant documents via the search engine, and ⃝
4
Input Data
then concatenate them into the prompt as reference.
I2. To highlight some important parts in your prompt, please use special marks, e.g., quotation (””) and line break (\n). You ⃝
4
can also use both of them for emphasizing.
C1. For complex tasks, you can clearly describe the required intermediate steps to accomplish it, e.g., “Please answer the ⃝
2
Contextual Information question step by step as: Step 1 - Decompose the question into several sub-questions, · · · ”
C2. If you want LLMs to provide the score for a text, it is necessary to provide a detailed description about the ⃝
1
scoring standard with examples as reference.
C3. When LLMs generate text according to some context (e.g., making recommendations according to purchase history), ⃝
2
instructing them with the explanation about the generated result conditioned on context is helpful to improve the quality
of the generated text.
C4. An approach similar to tree-of-thoughts but can be done in one prompt: e.g., Imagine three different experts are answering ⃝
2
this question. All experts will write down one step of their thinking, then share it with the group of experts. Then all experts will go on
to the next step, etc. If any expert realizes they’re wrong at any point then they leave. The question is
D1. Well-formatted in-context exemplars are very useful, especially for producing the outputs with complex formats. ⃝3
D2. For few-shot chain-of-thought prompting, you can also use the prompt “Let’s think step-by-step”, and the few-shot ⃝⃝
1 3
examples should be separated by “\n” instead of full stop.
D3. You can also retrieve similar examples in context to supply the useful task-specific knowledge for LLMs. To retrieve ⃝⃝
3 4
Demonstration more relevant examples, it is useful to first obtain the answer of the question, and then concatenate it with the question for
retrieval.
D4. The diversity of the in-context exemplars within the prompt is also useful. If it is not easy to obtain diverse questions, ⃝
3
you can also seek to keep the diversity of the solutions for the questions.
D5. When using chat-based LLMs, you can decompose in-context exemplars into multi-turn messages, to better match the ⃝
3
human-chatbot conversation format. Similarly, you can also decompose the reasoning process of an exemplars into multi-turn
conversation.
D6. Complex and informative in-context exemplars can help LLMs answer complex questions. ⃝3
D7. As a symbol sequence can typically be divided into multiple segments (e.g., i1 , i2 , i3 −→ i1 , i2 and i2 , i3 ), the preceding ⃝⃝
2 3
ones can be used as in-context exemplars to guide LLMs to predict the subsequent ones, meanwhile providing historical
information.
D8. Order matters for in-context exemplars and prompts components. For very long input data, the position of the question ⃝
3
(first or last) may also affect the performance.
D9. If you can not obtain the in-context exemplars from existing datasets, an alternative way is to use the zero-shot ⃝
3
generated ones from the LLM itself.

O1. Let the LLM check its outputs before draw the conclusion, e.g., “Check whether the above solution is correct or not.” ⃝
2
O2. If the LLM can not well solve the task, you can seek help from external tools by prompting the LLM to manipulate ⃝
4
them. In this way, the tools should be encapsulated into callable APIs with detailed description about their functions, to
Other Designs better guide the LLM to utilize the tools.
O3. The prompt should be self-contained, and better not include pronouns (e.g., it and they) in the context. ⃝
1
O4. When using LLMs for comparing two or more examples, the order affects the performance a lot. ⃝
1
O5. Before the prompt, assigning a role for the LLM is useful to help it better fulfill the following task instruction, e.g., “I ⃝
1
want you to act as a lawyer”.
O6. OpenAI models can perform a task better in English than other languages. Thus, it is useful to first ⃝
4
translate the input into English and then feed it to LLMs.
O7. For multi-choice questions, it is useful to constrain the output space of the LLM. You can use a more detailed explanation ⃝
1
or just imposing constraints on the logits.
O8. For sorting based tasks (e.g., recommendation), instead of directly outputting the complete text of each item after sorting, ⃝
1
one can assign indicators (e.g., ABCD) to the unsorted items and instruct the LLMs to directly output the sorted indicators.

As a representative work, Auto-Prompt [413] proposes a the gradient on continuous space during optimization.
gradient-guided method to greedily search the optimal to-
ken for each position of the prompt, leveraging the gradient • RL-based approaches. Since discrete prompts are difficult
approximated by the change in the log-likelihood when to be learned through gradient back-propagation, a num-
replacing a prompt token with another candidate token ber of studies propose to formulate the discrete prompt
from vocabulary. However, such a search process can be ex- optimization as a reinforcement learning (RL) problem and
tremely expensive since it needs to evaluate each candidate leverage RL algorithms for optimization [462–465]. For ex-
token for each position of the prompt, leading to a num- ample, RLPrompt [462] trains a policy network to generate
ber of additional forward passes. Therefore, an improved desired prompts with multiple reward functions. In this
gradient method [458] has been proposed by transforming approach, several effective reward stabilization strategies
discrete tokens into continuous embeddings and computing are also proposed to enhance the RL training efficiency.
Compared to previous work that requires sufficient data
48

TABLE 13: Example instructions collected from [447, 457]. The blue text denotes the task description, the red text denotes
the contextual information, the green text denotes the demonstrations, and the gold text denotes the prompt style.

Use the provided articles delimited by triple quotes to answer questions. If the answer cannot be found in the articles, write “I could not find an
answer.”
Articles: “““Joao Moutinho is a Portuguese footballer who last played as a central midfielder for Premier League club Wolverhampton Wanderers
and the Portugal national team.”””
Question: Is the following sentence plausible? ’Joao Moutinho was out at third.’
Answer: Let’s think step by step. Joao Moutinho is a soccer player. Being out at third is part of baseball, not soccer. So the answer is No.
...
<Demonstrations>

Articles: <insert articles, each delimited by triple quotes>
Question: <insert question>
Answer:

Prepare a meta-review by answering the following questions from the reviewer comments (provided after the questions).
1. Based on the reviewer’s comments, what are the core contributions made by this manuscript?
2. What are the common strengths of this work, as mentioned by multiple reviewers?
3. What are the common weaknesses of this work, as highlighted by multiple reviewers?
4. What suggestions would you provide for improving this paper?
5. What are the missing references mentioned by the individual reviews?
The review texts are below: <insert three comments R1 , R2 , R3 from the reviewers>
Meta-review: <insert meta-review>
...
<Demonstrations>

Provide justification for your response in detail by explaining why you made the choices you actually made. A good output should be coherent,
highlight major strengths/issues mentioned by multiple reviewers, be less than 400 words in length, and finally, the response should be in English
only.

The review texts are below: <insert three comments R1 , R2 , R3 from the reviewers>
Meta-review:

CREATE TABLE Highschooler (
ID int primary key,
name text,
grade int
);
/*
3 example rows:
SELECT * FROM Highschooler LIMIT 3;
ID name grade
1234 Janie 8
5678 Mary 8
9012 Mike 9
*/
Using valid SQLite, answer the following questions for the tables provided above.
Question: What is Kyle’s id?
SQL: SELECT ID FROM Highschooler WHERE name=“Kyle”;
...
<Demonstrations>

Question: <insert question>
SQL:

for training, TEMPERA [463] proposes to edit prompts at ally demanding for ever larger models, and may not be fea-
test time by utilizing a pre-trained RL agent to sequentially sible for API-based model calls (e.g., ChatGPT). Therefore,
edit different parts of a manually-written initial prompt. Al- another line of work aims to directly edit existing prompts
though these methods are simple and effective, they explore based on the task performance. Specifically, GPS [466] bor-
a manually defined edit space (e.g., add, swap and delete) rows an idea from the genetic algorithm and proposes
and focus on modifying the original prompt, which limits a genetic prompt search method that utilizes a language
the flexibility of prompt search. In contrast, PRewrite [465] model (i.e., T5) to edit prompts by taking the cloze task form.
employs RL to train a prompt rewriter for generating new In addition to model based edit methods, human-defined
prompts instead of modification, which does not impose operations can be also employed for prompt editing [467],
any restrictions in the prompt rewriting and offers improved including delete, swap, paraphrase, and addition. Based
flexibility in the action space. on these operations, they iteratively edit the prompts and
greedily search for the best prompt guided by the model
• Edit-based approaches. For the above methods, gradient- performance on a small pool of examples.
based and RL-based tuning can be extremely computation-
49

• LLM-based approaches. Due to the exceptional capacities layer in language models, while prompt tuning [405] only
of LLMs, an increasing number of studies directly leverage incorporates trainable prompt vectors at the input layer. By
LLMs as prompt generator [468–475]. Specifically, APE [468] fixing the large-scale parameters of LLMs and only tuning
utilizes an LLM to generate initial prompts, then selects the continuous prompt vector, this kind of approaches can be
best prompt with the highest accuracy, and finally improves extremely parameter-efficient (Section 5.3). However, these
the best candidate through an iterative Monte Carlo search approaches are typically independent of the inputs, lacking
method. However, this method does not effectively con- sufficient consideration of input semantics. Therefore, the
strain the prompt search space, which might likely lead authors in [477] propose context tuning, where the continu-
to unstable results. To achieve good performance and fast ous prompts are derived based on the input text and learned
convergence, one line of work utilizes heuristic methods through the downstream task losses.
(e.g., evolutionary algorithms [473, 474] and adversarial • Prompt transferring with scarce data. Supervised learn-
learning [475]) for prompt optimization. Another line of ing approaches demand in sufficient training data to learn
work draws an analogy to gradient-based model optimiz- optimal continuous prompts, which may not work well
ers for LLM-based prompt optimization. For example, in data-scarce domains and tasks. To address this prob-
APO [469] instructs the LLM to generate text feedback on lem, SPoT [478] proposes a prompt-based transfer learning
how to refine an old prompt into new improved prompts approach, which first learns a single continuous prompt
and then execute textual gradient descent. However, their for several representative source tasks and then uses this
search in the prompt space might be inefficient without prompt to initialize the prompt for a target task. However,
fully considering the whole refinement trace of previous this approach leverages the same prompt for solving all
prompts, thus potentially leading to sub-optimal results. instances of the target task. For a single task, even a well-
Therefore, some recent studies [470, 471] incorporate the learned prompt may not be suitable for all the data instances
previous prompts with their scores to instruct LLMs for from a large population. To address this issue, an improved
progressively generating better new prompts. To further method [479] designs an adaptive attention mechanism dur-
design formalized guidelines about the design of prompt ing the prompt transfer process to derive the target prompts,
optimizers, GPO [472] conducts a systematic analogy for considering both task- and instance-level information. The
LLM-based prompt optimizers with gradient-based model prompt transfer paradigm can leverage the knowledge of
optimizers. It further develops a more formal LLM-based data-sufficient source tasks encoded in source prompts for
prompt optimization framework, which extensively bor- solving data-scarce target tasks.
rows the idea of machine learning optimization. Specifally,
it retrieves relevant prompts from the previous prompts
and utilizes the generation-based refinement strategy to
perform the update. In order to avoid large variation at each
iteration, GPO further adopts a cosine-based decay strategy
to control the edit distance. However, these approaches still 6.2 In-Context Learning
struggle in exploring the vast space of effective prompts.
Inspired by human-like trial-and-error, prompt optimization As a special prompting form, in-context learning (ICL) is
is further formulated as a strategic planning problem [476] first proposed along with GPT-3 [55], which has become a
and uses Monte Carlo tree search to navigate the vast typical approach to utilizing LLMs.
prompt space.

Continuous Prompt Optimization. Different from discrete
prompts, continuous prompts consist of a set of continuous
6.2.1 ICL Formulation
embeddings, which can be directly optimized through the
gradient update based on the loss of downstream tasks.
Note that continuous prompt optimization has been mainly As stated in [55], ICL uses a formatted natural language
studied in PLMs, but draws limited attention in era of LLMs prompt, consisting of the task description and/or a few task
due to their massive magnitudes of parameters. We include examples as demonstrations. Figure 14 presents an illustra-
the discussion of this part for content completeness. In prior tion of ICL. First, starting with a task description, a few ex-
work, most studies typically rely on supervised learning to amples are selected from the task dataset as demonstrations.
train continuous prompts based on task data. Furthermore, Then, they are combined in a specific order to form nat-
in data-scarce scenarios, transfer learning methods can be ural language prompts with specially designed templates.
employed to alleviate the lack of labeled data on target tasks. Finally, the test instance is appended to the demonstration
These two approaches are detailed below. as the input for LLMs to generate the output. Based on task
• Prompt learning with sufficient data. In this approach, demonstrations, LLMs can recognize and perform a new
most existing methods regard continuous prompts as train- task without explicit gradient update.
able model parameters and then leverage supervised learn- Formally, let Dk = {f (x1 , y1 ), . . . , f (xk , yk )} represent
ing to optimize the continuous prompts by minimizing a set of demonstrations with k examples, where f (xk , yk ) is
the cross-entropy loss based on sufficient downstream task the prompt function that transforms the k -th task example
data [404, 405, 409, 477]. As discussed in Section 5.3.1, into natural language prompts. Given the task description
prefix tuning [404] prepends a sequence of prefixes (i.e., I , demonstration Dk , and a new input query xk+1 , the
a set of trainable continuous vectors) to each Transformer prediction of the output ŷk+1 generated from LLMs can be
50

formulated as follows35 : can be utilized to directly measure the informativeness
of each example according to the performance gain after
LLM I, f (x1 , y1 ), . . . , f (xk , yk ), f (xk+1 , ) → ŷk+1 . adding the example [486]. In addition, EPR [421] proposes
| {z } | {z } |{z}
demonstrations input answer a two-stage retrieval approach that first recalls similar ex-
(11) amples with an unsupervised method (e.g., BM25) and then
where the actual answer yk+1 is left as a blank to be ranks them using a dense retriever (trained with positive
predicted by the LLM. Since the performance of ICL heavily and negative examples labeled by LLMs). As an alterna-
relies on demonstrations, it is important to properly design tive approach, the task of demonstration selection can be
them in the prompts. According to the construction process formulated into a RL problem, where LLMs serve as the
in Equation (11), we focus on three major aspects of for- reward function to provide feedback for training the policy
matting demonstrations in the prompts, including how to model [487]. Since LLMs perform well for text annota-
select examples that make up demonstrations, format each tion [488], some recent studies employ LLM itself as the
example into the prompt with the function f (·), and arrange demonstration generator without human intervention [489].
demonstrations in a reasonable order.
To summarize, as discussed in [490], the selected demon-
A comprehensive review of ICL has been presented in
stration examples in ICL should contain sufficient informa-
the survey paper [50], and we suggest the readers refer-
tion about the task to solve as well as be relevant to the test
ring to it for a more general, detailed discussion on this
query, for the above two selection approaches.
topic. Compared with this survey, we specially focus on the
discussion of applying ICL to LLMs in two major aspects, Demonstration Format. After selecting task examples, the
i.e., demonstration design and the underlying mechanism next step is to integrate and format them into a natural
of ICL. Also, ICL has a close connection with instruction language prompt for LLMs. A straightforward method is to
tuning (discussed in Section 5.1) in that both utilize nat- instantiate a pre-defined template with the corresponding
ural language to format the task or instances. However, input-output pairs [36]. To construct more informative tem-
instruction tuning needs to fine-tune LLMs for adaptation, plates, recent studies consider adding task descriptions [69]
while ICL only prompts LLMs for utilization. Furthermore, or enhancing the reasoning capability of LLMs with chain-
instruction tuning can enhance the ICL ability of LLMs to of-thought prompts [33]. For instance, in [179], the authors
perform target tasks, especially in the zero-shot setting (only collect a large-scale dataset with task descriptions written by
using task descriptions) [69]. humans. After tuning with this dataset, the performance on
seen tasks can be boosted, and LLMs can also generalize to
6.2.2 Demonstration Design unseen tasks to some extent. To reduce the annotation costs,
Several studies have shown that the effectiveness of ICL a semi-automated approach has been proposed in [147]
is highly affected by the design of demonstrations [425, by employing a seed set consisting of human-written task
480, 481] Following the discussion in Section 6.2.1, we will descriptions to guide LLMs to generate task descriptions
introduce the demonstration design of ICL from three major for new tasks. Since it is costly to manually annotate
aspects, i.e., demonstration selection, format, and order. demonstration formats for different tasks, some work also
Demonstration Selection. The performance of ICL tends studies how to automatically generate high-quality ones.
to have a large variance with different demonstration exam- As two representative methods, Auto-CoT [427] leverages
ples [420], so it is important to select a subset of examples LLMs with the zero-shot prompt “Let’s think step by step”
that can effectively leverage the ICL capability of LLMs. for generating intermediate reasoning steps, while least-to-
There are two main demonstration selection approaches, most prompting [432] first queries LLMs to perform prob-
namely heuristic and LLM-based approaches: lem decomposition and then utilizes LLMs to sequentially
• Heuristic approaches. Due to their simplicity and low solve sub-problems based on the intermediate answers to
costs, existing work widely adopts heuristic methods to previously solved ones.
select demonstrations. Several studies employ a k -NN based
Demonstration Order. LLMs are shown to sometimes suffer
retriever to select examples that are semantically relevant to
from the recency bias, i.e., they are prone to repeat answers
the query [420, 482]. However, they perform the selection
that are near the end of demonstrations [481]. Thus, it is
individually for each example, rather than evaluating the
important to arrange demonstrations (i.e., task examples)
example set as a whole. To resolve this issue, diversity-
in a reasonable order. Early work proposes several heuris-
based selection strategies are proposed to choose the most
tic methods to quickly find a good order. For example,
representative set of examples for specific tasks [483, 484].
demonstrations can be directly organized according to their
Furthermore, in [485], both relevance and diversity are taken
similarity to the query in the embedding space [420]: the
into consideration when selecting demonstrations.
more similar, the closer to the end. In addition, global
• LLM-based approaches. Another line of work selects
and local entropy metrics can be used to score different
demonstrations by making use of LLMs. For example, LLMs
demonstration orders [425]. To integrate more task infor-
35. When ICL was introduced in the GPT-3’s paper [55], it was mation, some recent studies propose to minimize the code
originally defined to be a combination of the task description and length required to compress and transmit task labels, which
demonstration examples, wherein either component is dispensable. is inspired by information theory [491]. However, these
Following this definition, when a LLM is required to solve an unseen methods need additional labeled data as the validation
task by using only task descriptions, it can be also considered to
perform ICL for task solving, whereas the ICL ability can be enhanced set to evaluate the performance of specific demonstration
by instruction tuning. orders. To eliminate this need, the authors in [425] propose
51

In-Context Learning Chain-of-Thought Prompting
Answer the following mathematical reasoning questions: Answer the following mathematical reasoning questions:

Q: If you have 12 candies and you give 4 candies to your friend, Q: If a rectangle has a length of 6 cm and a width of 3 cm,
how many candies do you have left? what is the perimeter of the rectangle?
Nx A: The answer is 8. Nx A: For a rectangle, add up the length and width and double it.
If a rectangle has a length of 6 cm and a width of 3 cm,
Q: what is the perimeter of the rectangle? So, the perimeter of this rectangle is (6 + 3) x 2 = 18 cm.

A: The answer is 18 cm. The answer is 18 cm.

Q: Sam has 12 marbles. He gives 1/4 of them to his sister. Q: Sam has 12 marbles. He gives 1/4 of them to his sister.
How many marbles does Sam have left? How many marbles does Sam have left?

A: He gives (1 / 4) x 12 = 3 marbles.
A: The answer is 9. LLM So Sam is left with 12 – 3 = 9 marbles.
The answer is 9.

: Task description : Demonstration : Chain-of-Thought : Query

Fig. 14: A comparative illustration of in-context learning (ICL) and chain-of-thought (CoT) prompting. ICL prompts LLMs
with a natural language description, several demonstrations, and a test query, while CoT prompting involves a series of
intermediate reasoning steps in prompts.

to sample the validation data from the LLM itself. obtained from pre-training to solve new test tasks. A Proba-
bly Approximately Correct (PAC) framework [496] has been
6.2.3 Underlying Mechanism proposed to assess the learnability of ICL. It assumes that
there exists a latent variable representing the task in the pre-
After pre-training, LLMs can exhibit intriguing ICL capabil-
training data, and LLMs have been shown to be capable
ity without being updated. In what follows, we discuss two
of capturing this variable from demonstrations, enabling
key questions about the ICL ability of LLMs, i.e., “how does
them to recognize the task in ICL. Also, the interpretation
pre-training affect the ICL ability” and “how do LLMs perform
of ICL as task recognition is supported by several empir-
ICL during inference”.
ical studies [480, 497]. For example, it has been observed
How Pre-Training Affects ICL? ICL is first proposed in that replacing the inputs or labels of demonstrations with
GPT-3 [55], and it has been shown that the ICL ability random ones sampled from the input or label space does
becomes more significant with a larger model size. Further, not seriously hurt the performance of LLMs, indicating that
some studies reveal that small-scale PLMs can also demon- LLMs mainly recognize the target task from demonstrations
strate a strong ICL ability by continual pre-training [492] instead of learning from them [480, 495]. Similarly, LLMs
or fine-tuning [493] on specially designed training tasks, can exhibit decent performance even if the prompt template
which typically involve additional task examples in the is irrelevant or misleading [497].
input during the training process. It suggests that the design
of training tasks is an important influence factor on the ICL • Task learning. In the second way, LLMs learn new tasks
capability of LLMs. Besides training tasks, recent studies unseen in the pre-training stage only through demonstra-
have also investigated the relationship between ICL and tions. Specially, task learning is analyzed mainly from the
pre-training corpora [490, 494]. For example, ICL can be perspective of gradient descent and considered as implicit
theoretically explained as the product of pre-training on fine-tuning [65, 498]. Then, ICL can be explained as follows:
documents that exhibit long-range coherence [490]. Fur- by means of forward computation, LLMs generate meta-
ther, another study [494] theoretically analyzes that when gradients with respect to demonstrations and implicitly per-
scaling parameters and data, LLMs based on next-word form gradient descent via the attention mechanism. Exper-
prediction can emerge the ability of ICL by learning from iments also show that certain attention heads in LLMs are
the compositional structure (e.g., how words and phrases capable of performing task-agnostic atomic operations (e.g.,
are combined to form larger linguistic units like sentences) copying and prefix matching), which are closely related to
present in language data. the ICL ability [499]. Furthermore, some studies abstract
ICL as an algorithm learning process [500]. For example, the
How LLMs Perform ICL? At the inference stage, researchers authors in [500] find that LLMs essentially encode implicit
focus on analyzing how the ICL capability operates based models through their parameters during pre-training. With
on given demonstrations since no explicit learning or updat- the examples provided in ICL, LLMs can implement learn-
ing is involved. According to the discussion in [495], there ing algorithms such as gradient descent or directly compute
are two main ways for LLMs to utilize demonstrations: task the closed-form solution to update these models during
recognition and task learning. forward computation. Under this explanation framework,
• Task recognition. In the first way, LLMs recognize the it has been shown that LLMs can effectively learn simple
task from demonstrations and utilize the prior knowledge linear functions and even some complex functions like deci-
52

sion trees with ICL [500]. that require rigorous logic for reasoning. Considering this,
As discussed in a recent study [495], LLMs exhibit the some work uses code [508, 509] due to its structured and
abilities of both task recognition and task learning in ICL, precise nature. Furthermore, the authors in [510] propose
but the two abilities seem to be possessed with different to dynamically select text or code as the format of CoTs to
model scales. As shown in the experiments [495], the ability combine their advantages.
of task recognition is easier to obtain, and even a small LM
with only 350M parameters can exhibit this ability, while 6.3.2 Improved CoT Prompting Strategies
task learning can only emerge for LLMs with at least 66B Despite the performance improvement in complex reason-
parameters. Another study [501] also supports this find- ing tasks, CoT prompting still suffers from problems like
ing with specially designed experiments. They set up the incorrect reasoning and instability. In this part, we first
tasks with flipped and semantically unrelated labels in the introduce how to design better CoT prompts and enhanced
experiment, which require task learning when performing CoT generation strategies, and then introduce the extension
ICL. The results suggest that small LMs tend to disregard of the basic chain structure of CoT. Figure 15 illustrates the
the labels and mainly depend on their prior knowledge evolution of representative CoT prompting strategies.
to accomplish the task, while LLMs have the ability to
Better Prompt Design. Since CoT prompting relies on
surpass their prior knowledge and acquire new knowledge
prompts to elicit the reasoning capabilities of LLMs, the
from demonstrations, resulting in better outcomes. Further-
design of prompts is critical to its performance. As a di-
more, to improve the task learning ability, Meta-In-Context
rect approach, it is shown that using diverse CoTs (i.e.,
Learning [502] proposes to include multiple related tasks
multiple reasoning paths for each problem) can effectively
instead of just a single one in the prompt. In addition,
enhance the performance [430]. Another intuitive idea is
Symbol Tuning [503] fine-tunes LLMs on demonstrations
that prompts with more complex reasoning paths are more
with semantically unrelated labels (e.g., foo/bar instead of
likely to elicit the reasoning ability of LLMs [426], which
positive/negative for sentiment analysis), forcing LLMs to
can result in higher accuracy in generating correct an-
learn the task from demonstrations instead of relying on
swers. However, all these approaches rely on annotated CoT
prior knowledge.
datasets, which limits their use in practice. To overcome
this limitation, magic instructions such as “Let’s think step
6.3 Chain-of-Thought Prompting by step” can be used to automatically construct CoTs by
Chain-of-Thought (CoT) prompting [33, 504] is an improved prompting LLMs [427].
prompting strategy to boost the performance of LLMs on Enhanced CoT Generation. Since LLMs are prone to
complex reasoning tasks, such as arithmetic reasoning [505], producing incorrect reasoning steps and exhibiting insta-
commonsense reasoning [506], and symbolic reasoning [33]. bility in the generation process, there are a number of
Instead of simply constructing the prompts with input- studies [429, 511] to improve the generation of CoT. In this
output pairs like ICL, CoT prompting further incorporates part, we will introduce two typical approaches to enhancing
intermediate reasoning steps, which serve as the bridge be- the generation of CoT: sampling- and verification-based
tween inputs and outputs. Figure 14 presents an illustration methods.
of CoT. In the following part, we will first elaborate on the • Sampling-based methods. LLMs are known to suffer
basic CoT prompting approach and its improved strategies, from instability during inference, which can lead to un-
then discuss when and why CoT prompting works. faithfulness in the generated reasoning steps. To address
this issue, some work proposes to sample multiple rea-
6.3.1 Basic CoT Prompting Approach soning paths instead of using greedy decoding. As a rep-
CoT prompting is first proposed as an extension of ICL [33], resentative solution, self-consistency [429] first generates
which augments each demonstration ⟨input, output⟩ as several reasoning paths and then takes an ensemble over
⟨input, CoT, output⟩. A CoT is a series of intermediate the corresponding answers, selecting the most consistent
reasoning steps for connecting the input and output. With one through majority voting. However, such a method can
these augmented demonstrations, LLMs can follow them to still lead to wrong answers when most of the reasoning
generate CoTs and the answer for a new input. However, paths are misled. Considering this, the authors in [426] only
unlike ⟨input, output⟩ pairs in ICL, CoTs are difficult to vote on the k most complex reasoning paths based on their
obtain and usually require human annotation. Fortunately, observation that reasoning paths with higher complexity
it has been found that LLMs can be triggered to generate (e.g., more reasoning steps) usually have better performance.
CoTs through simple instructions like “Let’s think step by Furthermore, MCR [512] proposes referring to the steps
step.” [507], making CoT prompting easy to use. There are from other reasoning paths when generating the next step,
also alternative magic prompts that can elicit the ability and performs reasoning across multiple reasoning paths to
of CoT reasoning and further improve the performance of generate the final answer.
LLMs, such as “Take a deep breath and work on this problem • Verification-based methods. The sequential nature of rea-
step-by-step.” [470]. soning steps in CoTs can lead to the accumulation of errors
As illustrated in Figure 15, the generation process of in the generated CoTs when certain steps are incorrect. To
CoT follows a chain structure in the basic CoT prompt- mitigate this problem, recent studies propose to verify the
ing approach, where LLMs generate CoTs step by step. correctness of generated reasoning steps with either trained
Typically, CoT takes the format of natural language text. verifiers or LLMs themselves. For example, DIVERSE [511]
However, textual CoTs may not work well on complex tasks trains solution-level and step-level verifiers respectively to
53

Sampling- Verification-
CoT based CoT based CoT ToT GoT
Input Input Input Input Input

... ... ... ... Verification

✖️ ✖️
...
Ensemble

Output Output Output Output Output

Reason Backtrack Aggregate Unevaluated thought Positive thought Negative thought

Fig. 15: An illustration of the evolution of CoT prompting strategies. It begins with the basic CoT approach and progresses
to enhanced CoT generation techniques, including sampling-based and verification-based methods. Finally, it extends to
variations of the chain structure, such as trees and graphs. Here, “thought” refers to an intermediate reasoning step as
stated in [33, 444].

examine the reasoning steps at different granularities. An- with LLMs, making the thought exploration process highly
other approach [513] utilizes LLMs to verify the correctness inefficient. To reduce potentially meaningless thought
of reasoning steps through step-by-step self-verification exploration, XoT [521] further proposes to guide the search
with a specially designed reasoning format. In addition, of thoughts with pre-trained policy and value networks.
several studies propose backward reasoning for verification:
it first deduces the necessary question conditions [514, 515] 6.3.3 Further Discussion on CoT Prompting
or variables [516] from the model’s predictions, and then
compares them with the original ones. In this part, we present discussions regarding two funda-
mental questions related to CoT prompting, i.e., “when does
Reasoning Structure Extension. Despite the generality, the CoT prompting work for LLMs” and “why can LLMs perform
chain reasoning structure of basic CoT prompting limits its CoT reasoning”.
effectiveness in solving complex tasks, which require ex-
When CoT Prompting Works For LLMs? Since CoT reason-
ploration like foresight and backtracking during inference.
ing is an emergent ability [31], it only has a positive effect
Therefore, many studies have been devoted to extending
on sufficiently large models (typically containing 10B or
the reasoning structure by designing more intricate thought
more parameters [33]) but not on small models. Moreover,
processes, e.g., tree- and graph-structured reasoning.
since CoT prompting augments the standard prompting
• Tree-structured reasoning. This approach (exemplified by
with intermediate reasoning steps, it is mainly effective
Tree of Thoughts (ToT) [444, 517]) formulates the reasoning
for the tasks that require step-by-step reasoning [33], e.g.,
process in a hierarchical tree structure, where intermediate
arithmetic reasoning, commonsense reasoning, and sym-
thoughts are nodes. In this way, it enables LLMs to explore
bolic reasoning. Whereas, for other tasks that do not rely
multiple reasoning paths in parallel and further supports
on complex reasoning, CoT prompting might lead to worse
the operation of lookahead and backtracking to facilitate
performance than standard prompting [431], e.g., MNLI-
more comprehensive decisions. In addition, TouT [518] takes
m/mm, SST-2, and QQP from GLUE [279]. Interestingly, it
the uncertainty of intermediate thoughts into account for
seems that the performance gain brought by CoT prompting
thought evaluation based on Monte Carlo Dropout.
could be significant only when standard prompting yields
• Graph-structured reasoning. Although the tree structure poor results [33].
facilitates parallel reasoning, it also imposes restrictions on
the reasoning process. With more complex topological struc- Why LLMs Can Perform CoT Reasoning? As the second
tures, graphs offer greater flexibility in reasoning, enabling question, we discuss the underlying mechanism of CoT
the characterization of more intricate relationships and in- prompting in the following two aspects.
teractions. For instance, Graph of Thoughts (GoT) [519, 520] • The source of CoT reasoning ability. Regarding the source
conceptualizes the reasoning process as an arbitrary graph, of CoT reasoning capability, it is widely hypothesized that it
where vertices denote intermediate thoughts and edges can be attributed to training on code since models trained on
denote the interdependence between these thoughts. Com- it show a strong reasoning ability [47, 522, 523]. Intuitively,
pared with ToT, it can further utilize thoughts from other code data is well organized with algorithmic logic and
reasoning paths when generating new thoughts. However, programming flow, which may be useful to improve the rea-
such an approach requires a large number of interactions soning performance of LLMs. However, this hypothesis still
54

lacks publicly reported evidence of ablation experiments
Planning
(with and without training on code). In addition, instruction Task Result
Framework
tuning seems not to be the key reason for obtaining the CoT
reasoning ability, since it has been empirically shown that
instruction tuning on non-CoT data does not improve the Task Planner Plan
Plan Executor
performance on held-out CoT reasoning benchmarks [69]. (LLM) (generate & refine)
• The effect of CoT prompting components. The major dis-
tinction between CoT prompting and standard prompting Feedback Action
is the incorporation of reasoning paths prior to the final
answer. Thus, some researchers investigate the effects of Memory Environment Tool
different components in the reasoning paths. Specifically,
a recent study identifies three key components in CoT
Internal External
prompting, namely symbols (e.g., numerical quantities in
arithmetic reasoning), patterns (e.g., equations in arithmetic
reasoning), and text (i.e., the rest of tokens that are not …
symbols or patterns) [524]. It is shown that the latter two LLM Human World Others
parts (i.e., patterns and text) are essential to the model
performance, and removing either one would lead to a
significant performance drop. However, the correctness of Fig. 16: An illustration of the formulation for prompt based
symbols and patterns does not seem critical. Further, there planning by LLMs for solving complex tasks.
exists a symbiotic relationship between text and patterns:
the text helps LLMs to generate useful patterns, and patterns
aid LLMs to understand tasks and generate texts that help for coding tasks [443]. Furthermore, environment refers to
solve them [524]. where the plan executor carries out the actions, which can
In summary, CoT prompting provides a general and be set differently according to specific tasks, e.g., the LLM
flexible approach to eliciting the reasoning ability of LLMs. itself [529] or an external virtual world like Minecraft [530].
There are also some preliminary attempts to extend this It provides feedback about the execution result of the action to
technique to solve multimodal [525] and multilingual the task planner, either in the form of natural language [443]
tasks [526]. or from other multimodal signals [439].
For solving a complex task, the task planner first needs to
clearly understand the task goal and generate a reasonable
6.4 Planning
plan based on the reasoning of LLMs (See Section 6.4.2).
Prompting with ICL and CoT is a conceptually simple yet Then, the plan executor acts according to the plan in the
general approach to solving various tasks. However, this environment, and the environment will produce feedback
approach struggles with complex tasks like mathematical for the task planner (See Section 6.4.3). The task planner
reasoning [527] and multi-hop question answering [528]. As can further incorporate the feedback obtained from the
an enhanced approach, prompt-based planning has been environment to refine its initial plan and iteratively perform
proposed to break down complex tasks into smaller sub- the above process to get better results as the task solution
tasks and generate a plan of actions to accomplish the task. (See Section 6.4.4).

6.4.1 The Overall Framework 6.4.2 Plan Generation
In this part, we first formulate the general planning Plan generation focuses on directly generating action se-
paradigm of LLMs for solving complex tasks, which is quences by prompting LLMs. Based on the format of the
illustrated in Figure 16. generated plans, existing work can be divided into two
In this paradigm, there are typically three components: groups: text-based and code-based approaches.
task planner, plan executor, and environment36 . Specifically,
task planner, which is played by LLMs, aims to generate the Text-based Approaches. It is straightforward for LLMs to
whole plan to solve a target task. The plan can be presented generate plans in the form of natural language. In this
in various forms, e.g., an action sequence in the form of approach, LLMs are prompted to generate a sequence of
natural language [432] or an executable program written in actions for the plan executor to perform and solve the com-
programming language [436]. The LLM-based task planner plex task. For example, Plan-and-Solve [434] adds explicit
can be enhanced with the memory mechanism for plan instructions like “devise a plan” to directly prompt
storage and retrieval, which is helpful for long-horizon the LLM for planning in a zero-shot manner, while Self-
tasks. Then, plan executor is responsible for executing the planning [531] and DECOMP [433] add demonstrations in
actions in the plan. It can be implemented by models like the prompt to guide the LLM to devise a plan through ICL.
LLMs for textual tasks [434] or by tools like code interpreters Following this way, some work further considers incorpo-
rating extra tools or models when planning. For example,
36. Despite the similarity with RL, our formulation decouples the ToolFormer [80] first annotates a pre-training corpus with
planning and execution phases, whereas in RL, they are typically potential API calls using LLMs, and then fine-tunes LLMs
interleaved in the agent. This paradigm is defined in a general yet
slightly loose way, and it mainly aims to help readers understand the on it, so that LLMs can learn when and how to call APIs
key idea underlying the planning approaches of LLMs. and incorporate the results returned by APIs during gener-
55

ation. HuggingGPT [437] introduces the models available in 6.4.4 Plan Refinement
HuggingFace and regards LLMs as the controller to select With access to feedback from the environment, the task
suitable models based on their descriptions and aggregate planner can accordingly refine its current plan and itera-
their results as the final solution. tively go through the “planning – execution – refinement” loop
for better results. In this part, we summarizes three major
Code-based Approaches. Although text-based approaches
refinement approaches in existing work.
sound intuitive, they cannot guarantee faithful execution of
the plan, which may lead to failure even when the plan is Reasoning. The feedback data from the environment may
sound. To address this issue, code-based approaches have not be directly suitable to be utilized by LLMs for plan
been proposed to generate more verifiable plans in the refinement, e.g., containing irrelevant information or taking
form of executable code in programming languages, e.g., a non-language form. To solve this, some work adds the
Python or PDDL. In this way, LLMs are first prompted explicit reasoning process to extract critical information
to generate the program and then utilize a deterministic from feedback [441, 442]. For example, React [442] prompts
solver to execute it. For example, Faithful CoT [435] and LLMs with demonstrations to generate reasoning traces
PAL [436] decompose a reasoning task into two stages: at over feedback. It has been widely used in autonomous agent
the first stage, the LLM generates a plan conditioned on the projects, such as AutoGPT [536], which can automatically
query; at the second stage, a deterministic solver executes reason over the observed feedback to revise the initial
the plan to derive the final answer. Furthermore, code-based plan for solving various user requests. However, these ap-
approaches can be applied to embodied agents in a similar proaches typically fix the order of reasoning and planning.
way. For example, PROGPROMPT [532] and LLM+P [533] To support flexible switching between the two processes for
first utilize LLMs to generate plans in the form of python better performance, ChatCoT [441] further unifies the tool-
functions or PDDL files, and then leverage a virtual agent augmented reasoning process into a multi-turn conversation
or classical planner to solve the problem according to the between the LLM-based task planner and the tool-based
code-based plans. environment.
Backtracking. Early methods mainly consider planning
6.4.3 Feedback Acquisition forward actions while maintaining the existing plan, thus
likely leading to local optimal plans based on a short-term
After executing the generated plan, the environment would evaluation. To solve this, Tree of Thoughts [529] allows back-
produce the feedback signal to the LLM-based task planner, tracking with search algorithms like breadth-first and depth-
which can be used to refine its initial plan for better results. first search to make global planning. It refines the plan
In existing work, there are typically two sources of feedback step by step by backtracking to the last state in the initial
from the environment, depending on their relationship with plan and choosing the next unexplored action. Furthermore,
the LLM-based task planner: internal (i.e., the LLM itself) some studies [439, 537] utilize feedback signals to revise the
and external (e.g., tools or virtual worlds) feedback. entire plan. For example, DEPS [537] selects a better plan
according to feedback signals, while TIP [439] adds feedback
Internal Feedback. The LLM itself can be utilized as a signals to prompts for the LLM-based planner to revise each
feedback provider. One straightforward way is to directly step in the initial plan.
evaluate the quality of the generated plans through prompt-
ing. For example, RAP [440] evaluate the likelihood that Memorization. In order to handle long-horizon tasks, it has
each candidate plan can lead to task success, while Tree of become a key approach to aid plan refinement with long-
Thoughts [529] proposes to vote across plans by making term memory in addition to utilizing the short-term memory of
comparisons between them. Further, LLMs can provide LLMs through ICL. For example, Reflexion [443] stores the
feedback based on the intermediate results from the plan feedback from self-reflection into the memory, so previous
executor. For example, Reflexion [443] utilizes LLMs to feedback can be retrieved for plan refinement. Generative
transform sparse result signals (e.g., success or failure) into Agents [535] designs the memory stream mechanism as the
concrete text-based feedback (e.g., “You should recommend core component of agents for action planning and reflection.
comedies that the user mentions in the query instead of horror Further, the skill library mechanism [438, 530] is proposed
movies”) and stores this feedback in long-term memory for to store successful plans in the library, which can be reused
future planning. and synthesized as complex plans for novel tasks. To imple-
ment the long-term memory mechanism, tools like vector
External Feedback. In addition to LLMs, external objects databases (e.g., milvus [538]) can be used to encode plans or
can also provide feedback signals. For example, tools like feedbacks into high-dimensional vectors for efficient storage
code interpreters are widely used in programming tasks to and retrieval at a large scale. MemoryBank [539] further
provide real-time error messages [443], models like stable proposes the memory updating mechanism to allow mem-
diffusion [534] can be used in multimodal tasks to provide ory forgetting and strengthening following the Ebbinghaus
visual perception [439], and virtual worlds like Minecraft Forgetting Curve theory.
can provide immersive experiences [530]. Besides, some
work (e.g., Generative Agents [535]) explores multi-agent
collaboration in simulated environments, where each agent 7 C APACITY AND E VALUATION
receives feedback not only from interaction with the envi- To examine the effectiveness and superiority of LLMs, a
ronment but also from communication with other agents. surge of tasks and benchmarks have been proposed for
56

conducting empirical ability evaluation and analysis. In this LLMs also demonstrate comparable performance with hu-
section, we first introduce three types of basic ability evalu- man freelance writers [630]. Despite the rapid progress
ation of LLMs for language generation and understanding, on model capacity, there are increasing concerns on the
then present several advanced ability evaluations with more feasibility of existing automatic metrics to faithfully assess
complicated settings or goals, and finally discuss existing the performance of LLMs in conditional text generation
benchmarks, evaluation approaches, and empirical analysis. tasks [630–632]. As the alternatives to automatic metrics,
recent studies also propose to incorporate LLMs as gener-
7.1 Basic Ability ation evaluators to examine the quality of the generated
content [152, 633, 634]. Moreover, researchers also explore
In this part, we mainly focus on three basic types of ability more challenging language generation tasks for LLMs, such
evaluation for LLMs, i.e., language generation, knowledge as structured data generation [451] and long text genera-
utilization, and complex reasoning. It is noted that we do not tion [46, 635, 636].
intend to have complete coverage of all the related tasks, but
instead only focus on the most widely discussed or studied Code Synthesis. In addition to generating high-quality nat-
tasks for LLMs. Next, we introduce these tasks in detail. ural language text, existing LLMs also show strong abilities
to generate formal language, especially computer programs
7.1.1 Language Generation (i.e., code) that satisfy specific conditions, called code syn-
According to the task definition, existing tasks about lan- thesis [637]. Unlike natural language generation, as the gen-
guage generation can be roughly categorized into language erated code can be directly checked by execution with cor-
modeling, conditional text generation, and code synthesis responding compilers or interpreters, existing work mostly
tasks. Note that code synthesis is not a typical NLP task, we evaluates the quality of the generated code from LLMs by
include it for discussion because it can be directly solved calculating the pass rate against the test cases, i.e., pass@k 37 .
by a number of LLMs (trained on code data) in a similar Recently, several code benchmarks focusing on functional
generation approach as natural language text. correctness are proposed to assess the code synthesis abil-
ities of LLMs, such as APPS [376], HumanEval [105], and
Language Modeling. As the most fundamental ability of MBPP [223]. Typically, they consist of diverse programming
LLMs, language modeling aims to predict the next token problems, with text specification and test cases for cor-
based on the previous tokens [1], which mainly focuses rectness checking. To improve such an ability, it is key to
on the capacity of basic language understanding and gen- fine-tuning (or pre-training) LLMs on code data, which can
eration. For evaluating such an ability, typical language effectively adapt LLMs to code synthesis tasks [86]. In addi-
modeling datasets that existing work uses include Penn tion, existing work has proposed new strategies to generate
Treebank [540], WikiText-103 [541], and the Pile [166], where code, e.g., sampling multiple candidate solutions [223] and
the metric of perplexity is commonly used for evaluating the planning-guided decoding [638], which can be considered
model performance under the zero-shot setting. Empirical as the imitation of bug-fixing and code-planning processes
studies [55, 93] show that LLMs bring substantial per- by programmers. Impressively, LLMs have recently shown
formance gains over the previous state-of-the-art methods competitive performance with humans by achieving a rank-
on these evaluation datasets. To better test the modeling ing of the top 28% among users on the programming contest
capacity of long-range dependencies in text, the LAMBADA platform Codeforces [114]. Further, GitHub Copilot has been
dataset [252] has been introduced, where LLMs are required released to assist programming in coding IDEs (e.g., Visual
to predict the last word of sentences based on a paragraph of Studio and JetBrains IDEs), which can support a variety
context. Then, the accuracy and perplexity of the predicted of languages including Python, JavaScript, and Java. A
last words are employed to evaluate LLMs. As shown in viewpoint article entitled “The End of Programming” [639] in
existing work, the performance on the language modeling Communications of the ACM has discussed the impact of AI
tasks typically follows the scaling law [30], which means programming in the field of computer science, emphasizing
that scaling language models would improve the accuracy an important shift towards the highly adaptive LLM as a
and reduce the perplexity. new atomic unit of computation.
Conditional Text Generation. As an important topic in Major Issues. Although LLMs have achieved splendid per-
language generation, conditional text generation [48] fo- formance in generating human-like text, they are susceptible
cuses on generating texts satisfying specific task demands to suffering from two major issues in language generation
based on the given conditions, typically including machine as discussed below.
translation [626], text summarization [550], and question • Unreliable generation evaluation. With the advancement
answering [559]. To measure the quality of the generated of language generation ability of LLMs, existing studies
text, automatic metrics (e.g., Accuracy, BLEU [627] and find that the generated texts from LLMs have reached a
ROUGE [628]) and human ratings have been typically used comparable quality to the reference texts on a variety of text
for evaluating the performance. Due to the powerful lan- generation tasks. However, due to the intrinsic weakness
guage generation capabilities, LLMs have achieved remark- of existing evaluation benchmarks, there exists pronounced
able performance on existing datasets and benchmarks. For inconsistency between human evaluation and automatic
instance, GPT-4 exhibits comparable performance as com- reference-based metrics [630–632, 640]. For example, in
mercial translation products, even for the translation task of
languages that are with significant linguistic distance [629]. 37. Given k programs generated by the LLM, pass@k is computed as
On news summarization tasks (i.e., CNN/DM and XSUM), 1 when at least one program passes all test cases, or else 0
57

TABLE 14: Representative basic and advanced abilities and corresponding representative datasets for evaluating.

Level Ability Task Dataset
Language Modeling Penn Treebank [540], WikiText-103 [541], the Pile [166], LAMBADA [252]
WMT’14,16,19,20,21,22 [542–547], Flores-101 [548], DiaBLa [549],
Language Generation Conditional Text Generation CNN/DailyMail [550], XSum [551], WikiLingua [552]
OpenDialKG [553]
APPS [376], HumanEval [105], MBPP [223], CodeContest [114], MTPB [86],
Code Synthesis
DS-1000 [554], ODEX [555]
Natural Questions [556], ARC [557], TruthfulQA [558], Web Questions [559],
Closed-Book QA TriviaQA [560], PIQA [561], LC-quad2.0 [562], GrailQA [563], KQApro [564],
CWQ [565], MKQA [566], ScienceQA [567]
Natural Questions [556], OpenBookQA [568], ARC [557], TriviaQA [560],
Knowledge Utilization
Open-Book QA Web Questions [559], MS MARCO [569], QASC [570], SQuAD [571],
Basic WikiMovies [572]
WikiFact [573], FB15k-237 [574], Freebase [575], WN18RR [576],
Knowledge Completion
WordNet [577], LAMA [578], YAGO3-10 [579], YAGO [580]
CSQA [506], StrategyQA [199], HotpotQA [581], ARC [557], BoolQ [582],
PIQA [561], SIQA [583], HellaSwag [584], WinoGrande [585], COPA [586],
Knowledge Reasoning
OpenBookQA [568], ScienceQA [567], proScript [587], ProPara [588],
ExplaGraphs [589], ProofWriter [590], EntailmentBank [591],
ProOntoQA [592]
Complex Reasoning CoinFlip [33], ReverseList [33], LastLetter [33], Boolean Assignment [593],
Symbolic Reasoning Parity [593], Colored Object [70], Penguins in a Table [70],
Repeat Copy [436], Object Counting [436]
MATH [362], GSM8k [198], SVAMP [594], MultiArith [595], ASDiv [505],
Mathematical Reasoning MathQA [596], AQUA-RAT [597], MAWPS [598], DROP [599],
NaturalProofs [600], PISA [601], miniF2F [602], ProofNet [603]
Honestness TruthfulQA [558], HaluEval [604]
Helpfulness HH-RLHF [183]
Human Alignment
HH-RLHF [183], Crows-Pairs [605]
Harmlessness
WinoGender [606], RealToxicityPrompts [607]
Household VirtualHome [608], BEHAVIOR [609], ALFRED [610],ALFWorld [611]
Interaction with
Website Environment WebShop [612], Mind2Web [613]
External Environment
Advanced Open World MineRL [614], MineDojo [615]
Search Engine HotpotQA [581], TriviaQA [560], Natural Questions [556]
Code Executor GSM8k [198], TabMWP [616], Date Understanding [70]
Calculator GSM8k [198], MATH [362], CARP [617]
Tool Manipulation
Model Interface GPT4Tools [618], Gorilla [619]
WebQSP [620], MetaQA [621], WTQ [622]
Data Interface
WikiSQL [623], TabFact [624], Spider [625]

OpenDialKG [553], ChatGPT underperforms a fine-tuned generation in a reference-free manner, including evaluating
GPT-2 on BLEU and ROUGE-L metrics, while earning more a single prediction [633, 634, 644] or comparing several
favor from human judgment [640]. Furthermore, existing candidates [152, 645–647]. Nevertheless, LLMs may expose
work argues that even human evaluation may not be robust bias (e.g., order bias or preference for LLM-generated texts
enough [630, 631, 641, 642]. In some cases, it is difficult over human-written texts) as language generation evalua-
to achieve a high level of consensus among human an- tors, demonstrating disparities when compared to human
notators [631], and there is also a large gap between the evaluation [634, 648, 649].
annotation quality of crowdworkers and experts [641, 642].
Thus, how to conduct reliable evaluation for language gen-
eration tasks in the era of LLMs has become a fundamental
yet challenging research topic. Recently, increasing research
work proposes to leverage LLMs to improve the evaluation
quality of the generated texts. Specially, LLMs can be used
to improve the evaluation quality of existing metrics. For ex-
ample, Para-Ref [643] augments various automatic metrics
by leveraging LLMs to paraphrase existing references into
semantically equivalent references with diverse expressions.
Further, LLMs are widely employed as the evaluators of text
58

Unreliable Generation Evaluation or knowledge completion) and evaluation settings (with or
without external resources), we categorize existing knowl-
LLMs have been capable of generating texts with edge utilization tasks into three types, namely closed-book
a comparable quality to human-written texts, QA, open-book QA38 , and knowledge completion.
which however might be underestimated by au-
tomatic reference-based metrics. As an alterna- Closed-Book QA. Closed-book QA tasks [654] test the
tive evaluation approach, LLMs can serve as lan- acquired factual knowledge of LLMs from the pre-training
guage generation evaluators to evaluate a single corpus, where LLMs should answer the question only based
text, compare multiple candidates, and improve on the given context without using external resources. For
existing metrics. However, this evaluation ap- evaluating this ability, there are several datasets that can
proach still needs more inspections and exami- be leveraged, including Natural Questions [556], Web Ques-
nations in real-world tasks. tions [559], and TriviaQA [560], where the accuracy metric is
widely adopted. Empirical results have revealed that LLMs
• Underperforming specialized generation. Although LLMs can perform well in this setting and even match the per-
have learned general language patterns to generate coherent formance of state-of-the-art open-domain QA systems [56].
text, their proficiency in generation might be constrained Also, the performance of LLMs on closed-book QA tasks
when dealing with a specialized domain or task. For in- shows a scaling law pattern in terms of both model size
stance, a language model that has been trained on gen- and data size: scaling the parameters and training tokens
eral web articles may face challenges when generating a can increase the capacity of LLMs and help them learn (or
medical report which involves many medical jargon and memorize) more knowledge from the pre-training data [56].
methods. Intuitively, domain knowledge should be critical Further, under a similar parameter scale, LLMs with more
for model specialization. However, it is not easy to inject pre-training data relevant to the evaluated tasks would
such specialized knowledge into LLMs. As discussed in achieve better performance [81]. Also, the closed-book QA
recent analyses [47, 650], when LLMs are trained to exhibit setting provides a testbed for probing the accuracy of the
some specific ability that allows them to excel in some areas, factual knowledge encoded by LLMs. However, as shown
they might struggle in others. Such an issue is related to in existing work [55], LLMs might perform less well on QA
catastrophic forgetting [651, 652] in training neural networks, tasks relying on fine-grained knowledge, even when it exists
which refers to the conflict phenomenon of integrating new in the pre-training data.
and old knowledge. Similar cases also occur in human align-
Open-Book QA. Unlike closed-book QA, in open-book QA
ment of LLMs, where “alignment tax” [66] (e.g., a potential
tasks, LLMs can extract useful evidence from the external
loss in the in-context learning ability) has to be paid for
knowledge base or document collections, and then answer
aligning to human values and needs. Moreover, due to
the question based on the extracted evidence [655–658]. Typ-
the limitations of sequence modeling architecture, LLMs
ical open-book QA datasets (e.g., Natural Questions [556],
still face challenges in the understanding and generation
OpenBookQA [568], and SQuAD [571]) have overlap with
of structured data. Consequently, they often fall behind
closed-book QA datasets, but they incorporate external data
task-specific models on complex structured data tasks, such
sources, e.g., Wikipedia. The metrics of accuracy and F1
as knowledge-base question answering and semantic pars-
score are widely used in open-book QA tasks for evalua-
ing [451, 653]. Therefore, it is important to develop effective
tion. To select relevant knowledge from external resources,
model specialization methods that can flexibly adapt LLMs
LLMs are often paired with a text retriever (or even a
to various task scenarios, meanwhile retaining the original
search engine), which is trained independently or jointly
abilities as possible.
with LLMs [81, 655, 659]. Also, previous work [660–662]
Underperforming Specialized Generation has indicated that retrievers can assist LLMs in verifying
and rectifying the reasoning path. In evaluation, existing
LLMs may fall short in mastering generation studies mainly focus on testing how LLMs utilize the ex-
tasks that require domain-specific knowledge or tracted knowledge to answer the question and show that
generating structured data. It is non-trivial to the retrieved evidence can largely improve the accuracy
inject specialized knowledge into LLMs, mean- of the generated answers, even enabling a smaller LLM to
while maintaining the original abilities of LLMs. outperform 10× larger ones [655, 659]. Further, open-book
QA tasks can be also employed to evaluate the recency
of knowledge information. Pre-training or retrieving from
outdated knowledge resources may cause LLMs to generate
7.1.2 Knowledge Utilization
incorrect answers for time-sensitive questions [655].
Knowledge utilization is an important ability of intelligent
systems to accomplish knowledge-intensive tasks (e.g., com- Knowledge Completion. In knowledge completion tasks,
monsense question answering and fact completion) based LLMs might be (to some extent) considered as a knowledge
on supporting factual evidence. Concretely, it requires LLMs
to properly utilize the rich factual knowledge from the pre- 38. In this part, open-book QA refers to the QA tasks that require
training corpus or retrieve external data when necessary. In to extract and utilize useful information from external knowledge
particular, question answering (QA) and knowledge com- resources, as the antithesis of closed-book QA (only using the encoded
information from pre-training corpus). Note that there is a dataset also
pletion have been two commonly used tasks for evaluating named OpenBookQA [568], which follows the settings of open-book
this ability. According to the test tasks (question answering QA tasks by extracting and utilizing external science facts.
59

Bob’s wife is Amy. Bob’s daughter is Cindy.
Explain RLHF for LLMs.
Who is Cindy to Amy?

RLHF stands for "Rights, Limitations, Harms, and
Cindy is Amy’s daughter-in-law. Freedoms" and is a framework for …… models like
LLMs (Large Language Models).

(a) Intrinsic hallucination (b) Extrinsic hallucination

Fig. 17: Examples of intrinsic and extrinsic hallucination for a public LLM (access date: March 19, 2023). As an example
of intrinsic hallucination, the LLM gives a conflicting judgment about the relationship between Cindy and Amy, which
contradicts the input. For extrinsic hallucination, in this example, the LLM seems to have an incorrect understanding of
the meaning of RLHF (reinforcement learning from human feedback), though it can correctly understand the meaning of
LLMs (in this context).

base [578], which can be leveraged to complete or predict the deploying LLMs in real-world applications. To alleviate
missing parts of knowledge units (e.g., knowledge triples). this problem, alignment tuning strategies (as discussed in
Such tasks can probe and evaluate how much and what kind Section 5.2) have been widely utilized in existing work [66],
of knowledge LLMs have learned from the pre-training which rely on tuning LLMs on high-quality data or using
data. Existing knowledge completion tasks can be roughly human feedback. Moreover, the integration of external
divided into knowledge graph completion tasks (e.g., FB15k- tools for the provision of credible information sources can
237 [574] and WN18RR [576]) and fact completion tasks (e.g., help alleviate the hallucination issue [81, 604, 661]. Another
WikiFact [573]), which aim to complete the triples from a line of research work leverages uncertainty estimation of
knowledge graph and incomplete sentences about specific LLMs to identify hallucinations [665, 666]. For instance,
facts, respectively. Empirical studies have revealed that it considering that hallucinated facts are prone to exhibit
is difficult for existing LLMs to accomplish knowledge inconsistency across different sampled outputs, SelfCheck-
completion tasks related to specific relation types [522]. GPT [666] detects hallucination by measuring information
As shown in the evaluation results on WikiFact, LLMs inconsistency within sampled outputs. For the evaluation
perform well on several frequent relations that occur in of the hallucination problem, a set of hallucination de-
the pre-training data (e.g., currency and author), while tection tasks have been proposed, e.g., TruthfulQA [558]
not well on rare ones (e.g., discoverer_or_inventor for detecting human falsehood mimicked by models. More
and place_of_birth). Interestingly, under the same eval- recently, HaluEval [604] creates a large-scale LLM-generated
uation settings (e.g., in-context learning), InstructGPT (i.e., and human-annotated hallucinated samples to evaluate the
text-davinci-002) outperforms GPT-3 in all subsets of ability of language models to recognize hallucination in both
WikiFact. task-specific and general scenarios.

Major Issues. Although LLMs have achieved key progress Hallucination
in capturing and utilizing knowledge information, they
suffer from two major issues as discussed below. LLMs are prone to generate untruthful informa-
tion that either conflicts with the existing source
• Hallucination. In generating factual texts, a challeng-
or cannot be verified by the available source.
ing issue is hallucination generations [640, 663], where the
Even the most powerful LLMs such as ChatGPT
generated information is either in conflict with the existing
face great challenges in migrating the hallucina-
source (intrinsic hallucination) or cannot be verified by the
tions of the generated texts. This issue can be
available source (extrinsic hallucination), which are illustrated
partially alleviated by special approaches such as
by two examples in Figure 17. Hallucination widely occurs
alignment tuning and tool utilization.
in existing LLMs, even the most superior LLMs such as
GPT-4 [46]. Furthermore, existing work shows that LLMs
encounter difficulties in recognizing the hallucinated con- • Knowledge recency. As another major challenge, LLMs
tent in text [604], even the powerful ChatGPT. Additionally, would encounter difficulties when solving tasks that require
beyond language tasks, a recent study has shown that large the latest knowledge beyond the training data. To tackle
vision-language models (LVLM) also face challenges with this issue, a straightforward approach is to regularly update
hallucination, i.e., generating objects that are not present in LLMs with new data. However, it is very costly to fine-tune
the accompanying images [664]. In essence, LLMs seem LLMs, and also likely to cause the catastrophic forgetting
to “unconsciously” utilize the knowledge in task solving, issue when incrementally training LLMs. Therefore, it is
which still lack an ability to accurately control the use necessary to develop efficient and effective approaches that
of internal or external knowledge. Hallucinations would can integrate new knowledge into existing LLMs, making
mislead LLMs to generate undesired outputs and mostly them up-to-date. Existing studies have explored how to
degrade the performance, leading to potential risks when utilize the external knowledge source (e.g., search engine)
60

to complement LLMs, which can be either jointly optimized knowledge reasoning tasks into code generation tasks, re-
with LLMs [655] or used as a plug-and-play module [661]. searchers have found that the performance of LLMs can
For instance, ChatGPT utilizes a retrieval plugin to access be further improved [226], especially with the LLMs pre-
up-to-date information sources [667]. By incorporating the trained on code. However, due to the complexity of knowl-
extracted relevant information into the context [668–670], edge reasoning tasks, the performance of current LLMs still
LLMs can acquire new factual knowledge and perform lags behind human results on tasks such as commonsense
better on relevant tasks. However, such an approach seems reasoning [33, 56, 677]. As a common type of mistakes, LLMs
to be still at a superficial level. In addition, existing studies might generate inaccurate intermediate steps, leading to a
also explore editing parameters of language models to up- wrong final result. To address this issue, existing work has
date intrinsic knowledge [671–673]. Nevertheless, previous proposed special decoding or ensemble strategies to im-
work [674] has shown that several parameter editing meth- prove the accuracy of the whole reasoning chain [429, 430].
ods perform not well on LLMs, though they can improve
Symbolic Reasoning39 . The symbolic reasoning tasks
the performance of small language models. Therefore, it
mainly focus on manipulating the symbols in a formal rule
is still difficult to directly amend intrinsic knowledge or
setting to fulfill some specific goal [51], where the operations
inject specific knowledge into LLMs, which remains an
and rules may have never been seen by LLMs during pre-
open research problem [674]. Recently, a useful framework
training. Existing work [33, 432, 507] commonly evaluates
EasyEdit [675] has been released to facilitate the research of
LLMs on the task of last letter concatenation and coin flip,
knowledge editing for LLMs.
where the evaluation examples require the same reasoning
Knowledge Recency steps as the in-context examples (called in-domain test) or
more steps (called out-of-domain test). For an example of
The parametric knowledge of LLMs is hard to be the out-of-domain test, LLMs could only see the examples
updated in a timely manner. Augmenting LLMs with two words in context, but it requires LLMs to concate-
with external knowledge sources is a practical nate the last letters of three or more words. Typically, the
approach to tackling the issue. However, how accuracy of the generated symbols is adopted to evaluate
to effectively update knowledge within LLMs the performance of LLMs on these tasks. Thus, LLMs need
remains an open research problem. to understand the semantic relations among the symbolic
operations and their composition in complex scenarios.
However, under the out-of-domain setting, as LLMs have
not seen the complex compositions of symbolic operations
7.1.3 Complex Reasoning
and rules (e.g., twice the number of operations in context
Complex reasoning refers to the ability of understanding examples), it is hard for LLMs to capture their accurate
and utilizing supporting evidence or logic to derive con- meanings. To solve this issue, existing studies incorporate
clusions or make decisions [51, 52]. According to the type scratchpad [593, 678] and tutor [679] strategies to help
of involved logic and evidence in the reasoning process, LLMs better manipulate symbolic operations, for generating
we consider dividing existing evaluation tasks into three longer and more complex reasoning processes. Another
major categories, namely knowledge reasoning, symbolic line of research work utilizes the formal programming
reasoning, and mathematical reasoning. language to represent the symbolic operations and rules,
which requires LLMs to generate code and perform the
Knowledge Reasoning. The knowledge reasoning tasks
reasoning process by executing it with external interpreters.
rely on logical relations and evidence about factual
Such a way can decompose the complex reasoning process
knowledge to answer the given question. Existing work
into code synthesis and program execution for LLMs and
mainly uses specific datasets to evaluate the reasoning
interpreters, respectively, leading to a simplified reasoning
capacity of the corresponding type of knowledge, e.g.,
process with yet more accurate results [436].
CSQA [506]/StrategyQA [199] for commonsense knowledge
reasoning and ScienceQA [567] for science knowledge rea- Mathematical Reasoning. The mathematical reasoning
soning. In addition to the accuracy of the predicted results, tasks need to comprehensively utilize mathematical knowl-
existing work [567] has also evaluated the quality of the edge, logic, and computation for solving problems or gen-
generated reasoning process, via automatic metrics (e.g., erating proof statements. Existing mathematical reasoning
BLEU) or human evaluation. Typically, these tasks require tasks can be mainly categorized into math problem solv-
LLMs to perform step-by-step reasoning based on factual ing and automated theorem proving. For math problem
knowledge, until reaching the answer to the given ques- solving tasks, SVAMP [594], GSM8k [198] and MATH [362]
tion. To elicit the step-by-step reasoning ability, chain-of- datasets are commonly used for evaluation, where LLMs
thought (CoT) prompting strategy [33] has been proposed need to generate accurate concrete numbers or equations
for enhancing the complex reasoning capacity of LLMs. to answer the mathematical problem. As these tasks also
As discussed in Section 6.3, CoT involves the intermediate require multi-step reasoning, the CoT prompting strategy
reasoning steps, which can be manually created [33] or has been widely adopted for LLMs to improve the reasoning
automatically generated [676], into the prompts to guide performance [33]. As another practical strategy, continu-
LLMs to perform multi-step reasoning. Such a way largely
improves the reasoning performance of LLMs, leading to 39. Following [33], we mainly discuss symbolic reasoning tasks spe-
cially designed for evaluating LLMs. We do not consider symbolic
new state-of-the-art results on several complex knowledge reasoning methods in traditional NLP tasks, such as deducing logical
reasoning tasks [33, 56, 528]. Further, after reformulating rules from the knowledge graphs in KBQA.
61

ally pre-training LLMs on large-scale mathematical corpora Reasoning Inconsistency
can largely boost their performance on mathematical rea-
soning tasks [35, 218, 680]. Further, since math problems LLMs may generate the correct answer following
in different languages share the same mathematical logic, an invalid reasoning path, or produce a wrong
researchers also propose a multilingual math word problem answer after a correct reasoning process, leading
benchmark [526] to evaluate the multilingual mathematical to inconsistency between the derived answer and
reasoning capacity of LLMs. As another challenging task, the reasoning process. The issue can be alleviated
automated theorem proving (ATP) [600, 602, 681] requires by fine-tuning LLMs with process-level feedback,
the reasoning model to strictly follow the reasoning logic using an ensemble of diverse reasoning paths,
and mathematical skills. To evaluate the performance on and refining the reasoning process with self-
this task, PISA [601] and miniF2F [602] are two typical ATP reflection or external feedback.
datasets with the proof success rate as the evaluation metric.
As a typical approach, existing work on ATP utilizes LLMs • Numerical computation. For complex reasoning tasks,
to aid the search for proofs using an interactive theorem LLMs still face difficulties in the involved numerical com-
prover (ITP), such as Lean, Metamath, and Isabelle [682– putation, especially for the symbols that are seldom en-
684]. A major limitation of ATP research is the lack of related countered during pre-training, such as arithmetic with large
corpora in formal language. To tackle it, several studies numbers [49, 679, 692]. To tackle this issue, a direct way is
utilize LLMs to convert informal statements into formal to tune LLMs on synthesized arithmetic problems [359, 693].
proofs for augmenting new data [685] or generate drafts and Also, a surge of studies improve the numerical computation
proof sketches to reduce the search space of the proofs [686]. performance by tracing intermediate calculation steps in
training and inference stages [359, 678, 694], e.g., scratchpad
tracing. In addition, existing work [80] has also incorpo-
rated external tools (e.g., calculator), especially for handling
Major Issues. In spite of the advancements, LLMs still have arithmetic operations. More recently, ChatGPT has provided
several limitations in solving complex reasoning tasks. a plugin mechanism to use external tools [667]. In this
way, LLMs need to learn how to properly manipulate the
tools. For this purpose, researchers have augmented the
• Reasoning inconsistency. With improved reasoning examples using tools (even the LLM itself) for tuning the
strategies (e.g., CoT prompting), LLMs can solve some com- LLM [80, 695], or devised instructions and exemplars for
plex reasoning tasks, by performing step-by-step reasoning in-context learning [436]. In addition to the aid of ex-
based on the supporting logic and evidence. Despite the ternal tools, recent studies find that tokenizing digits into
effectiveness, the reasoning inconsistency issue often occurs in individual tokens (e.g., LLaMA and Galactica tokenizers)
the decomposed reasoning process. Concretely, LLMs may is a useful approach to enhancing the inherent arithmetic
generate the correct answer following an invalid reasoning ability of LLMs [359, 692]. One possible explanation is that
path, or produce a wrong answer after a correct reason- subword tokenization techniques can result in inconsistent
ing process [33, 435], leading to inconsistency between the sequences when tokenizing numbers. For instance, with
derived answer and the reasoning process. To alleviate a subword tokenizer the integer 7481 may be tokenized
this problem, existing work has proposed to guide the as 7 481, while 74815 may be tokenized as 748 15 (the
whole generation process of LLMs via external tools or same numerical substrings with different splits) [359]. As a
models [430, 444, 638], to re-check the reasoning process comparison, digit-based tokenization for numbers can avoid
and final answer for correcting the potential errors [687–689] such an inconsistency, thus likely improving the numerical
or fine-tune LLMs with process-based feedback [690, 691]. computation ability of LLMs.
For instance, Tree of Thoughts (ToT) [444] empowers LLMs
to engage in the decision-making process by concurrently Numerical Computation
exploring and self-evaluating various reasoning paths. To
refine the reasoning processes, Self-Refine [687] elicits feed- LLMs face difficulties in numerical computation,
back from LLMs on self-generated solutions, enabling the especially for the symbols that are seldom en-
iterative refinement of solutions based on the feedback. countered during pre-training. In addition to us-
Moreover, several studies improve the consistency in the ing mathematical tools, tokenizing digits into in-
reasoning chain of LLMs through the integration of process- dividual tokens is also an effective design choice
based supervision during training [690, 691]. As a promis- for improving the arithmetic ability of LLMs.
ing solution, recent approaches reformulate the complex
reasoning tasks into code generation tasks, where the strict
execution of the generated code ensures the consistency 7.2 Advanced Ability
between the reasoning process and the outcome. Also, In addition to the above basic evaluation tasks, LLMs also
it has been revealed that there might exist inconsistency exhibit some superior abilities that require special consider-
between tasks with similar inputs, where small changes ations for evaluation. In this part, we discuss several rep-
in the task description may cause the model to produce resentative advanced abilities and the corresponding eval-
different results [49, 594]. To mitigate this problem, self- uation approaches, including human alignment, interaction
consistency [429] adopts the ensemble of multiple reasoning with the external environment, and tool manipulation. Next,
paths to enhance the decoding process of LLMs. we discuss these advanced abilities in detail.
62

7.2.1 Human Alignment work either adopts the regular metrics (e.g., executability
It is desired that LLMs could well conform to human values and correctness of the generated action plans) [696] in the
and needs, i.e., human alignment, which is a key ability for benchmark or directly conducts real-world experiments and
the broad use of LLMs in real-world applications. measures the success rate [700], to evaluate such ability. It
To evaluate this ability, existing studies consider multiple has been shown that LLMs are capable in interacting with
criteria for human alignment, such as helpfulness, honesty, the external environment and generating accurate action
and safety [46, 183, 366]. For helpfulness and honesty, adver- plans [701]. Recently, several improvement methods have
sarial question answering tasks (e.g., TruthfulQA [558]) can been proposed to enhance the interaction ability of LLMs,
be utilized to examine LLM’s ability in detecting possible e.g., designing code-like prompts [532] and providing real-
falsehood in the text [46, 81]. Furthermore, harmlessness world grounding [700].
can be also evaluated by several existing benchmarks, e.g., In addition, recent work also explores multi-agent col-
CrowS-Pairs [605] and Winogender [606]. Despite the auto- laboration based on LLMs in simulated environments [535,
matic evaluation with the above datasets, human evaluation 702, 703]. These studies simulate human social behaviors
is still a more direct way to effectively test the human by instantiating multiple LLM-based agents with observa-
alignment ability of LLMs. OpenAI invites many experts tions, planning, and memories in a sandbox environment.
in domains related to AI risks to evaluate and improve the In controlled evaluation, the abilities of generative agents
behaviors of GPT-4 when encountering risky contents [46]. to search, plan, and think are evaluated by humans in an
In addition, for other aspects of human alignment (e.g., interview-like manner. Further, they also conduct descrip-
truthfulness), several studies propose to use specific instruc- tive measurements on multiple agents within a simulated
tions and devise annotation rules to guide the annotation environment to examine emergent social behaviors.
process [81]. Empirical studies have revealed that these
7.2.3 Tool Manipulation
strategies can greatly improve the human alignment ability
of LLMs [183]. For instance, after alignment tuning on data When solving complex problems, LLMs can turn to external
collected through interactions with experts, the incorrect tools if they determine it is necessary. By encapsulating
behavior rate of GPT-4 can be largely reduced when it deals available tools with API calls, existing work has involved
with sensitive or disallowed prompts. In addition, high- a variety of external tools, e.g., search engine [81], calcula-
quality pre-training data can reduce the effort required for tor [80], and compiler [436], to enhance the performance of
alignment [46]. For instance, Galactica is potentially more LLMs on several specific tasks. Recently, OpenAI has sup-
harmless due to the less biased contents in the scientific ported the use of plugins in ChatGPT [667], which can equip
corpus [35]. LLMs with broader capacities beyond language modeling.
For example, the web browser plugin enables ChatGPT
to access fresh information. Further, incorporating third-
7.2.2 Interaction with External Environment party plugins is particularly key for creating a prosperous
In addition to standard evaluation tasks, LLMs have the ecosystem of applications based on LLMs.
ability to receive feedback from the external environment To examine the ability of tool manipulation, existing
and perform actions according to the behavior instruction, work mostly adopts complex reasoning tasks for evaluation,
e.g., generating action plans in natural language to manip- such as mathematical problem solving (e.g., GSM8k [198]
ulate agents [696, 697]. Such an ability is also emergent in and SVAMP [594]) or knowledge question answering (e.g.,
LLMs that can generate detailed and highly realistic action TruthfulQA [558]), where the successful utilization of tools is
plans, while smaller models (e.g., GPT-2) tend to generate very important for enhancing the required skills that LLMs
shorter or meaningless plans [696]. are incapable in (e.g., numerical calculation). In this way, the
To test this ability, several embodied AI environments evaluated performance on these tasks can reflect the ability
and benchmarks can be used for evaluation, described of LLMs in tool manipulation. To teach LLMs to utilize tools,
as follows. VirtualHome [608] builds a 3D simulator for existing studies add exemplars using tools in context to elicit
household tasks such as cleaning and cooking, in which LLMs [436], or fine-tune LLMs on simulated data about
the agent can execute natural language actions generated tool utilization [80, 695]. It has been found that with the
by LLMs. ALFRED [610] includes more challenging tasks help of tools, LLMs become more capable of handling the
that require LLMs to accomplish compositional targets. BE- issues that they are not good at, e.g., equation calculation
HAVIOR [609] focuses on everyday chores in simulation and answering timely questions [80, 441]. However, as
environments and requires LLMs to generate complex so- the number of available tools increases, the limited context
lutions, e.g., changing the internal status of objects. Apart length of LLMs may pose challenges in describing and
from restricted environments such as household tasks, a demonstrating extensive tool APIs. To address this issue,
line of research work investigates the proficiency of LLM- existing work retrieves the usage of relevant tools, or en-
based agents to explore open-world environments, such as coding tool information as tokens within the embedding
Minecraft and the Internet [698, 699]. Voyager [699] intro- space [704–706].
duces an automatic curriculum module that enables LLMs In addition to existing tools developed by humans,
to continuously acquire new skills based on feedback from LLMs possess the capability to make their own tools for
the environment. GITM [698] focuses on solving various specific tasks autonomously [707]. This enables the models
challenges in Minecraft based on LLM, through task de- to independently explore and manipulate these self-created
composition, planning, and invocation of interfaces. Based tools, thereby expanding their potential for autonomous
on the generated action plans or task completions, existing exploration in solving a wide range of real-world tasks.
63

Summary. The above three abilities are of great value to the experimental results of HELM, instruction tuning can
the practical performance of LLMs: conforming to human consistently boost the performance of LLMs in terms of
values and preferences (human alignment), acting properly accuracy, robustness, and fairness. Further, for reasoning
in real-world scenarios (interaction with the external envi- tasks, the LLMs that have been pre-trained on the code
ronment), and expanding the ability scope (tool manipu- corpus show superior performance.
lation). In addition to the above three advanced abilities, • Human-level test benchmarks aim to evaluate the compre-
LLMs might also show other abilities that are specially hensive ability of LLMs with questions designed for testing
related to some tasks (e.g., data annotation [488]) or learning humans, such as AGIEval [710], MMCU [711], M3KE [712],
mechanisms (e.g., self-improvement [708]). It will be an open C-Eval [713] and Xiezhi [714]. These benchmarks encompass
direction to discover, measure and evaluate these newly a wide range of domains, difficulty levels, and languages
emerging abilities, so as to better utilize and improve LLMs. to provide a comprehensive evaluation of LLMs’ general
capabilities. Compared to publicly available models, models
offering API services (e.g., GPT-4, ChatGPT, Claude) demon-
7.3 Benchmarks and Evaluation Approaches strate superior performance compared to publicly avail-
able models on these evaluation benchmarks. As the best-
In the above, we have discussed the basic and advanced performing model in evaluations, GPT-4 surpasses average
abilities of LLMs. Next, we will introduce existing evalua- human performance in AGIEval [710]. However, it still lags
tion benchmarks and approaches [735, 736]. behind the top human performance on these challenging
benchmarks. Hence, there remains ample room for further
7.3.1 Comprehensive Evaluation Benchmarks enhancements in the overall abilities of LLMs, particularly
Recently, several comprehensive benchmarks [70, 362, 522] for publicly accessible models.
have been released for the evaluation of LLMs. In this The above benchmarks cover a variety of mainstream
part, we introduce several widely used benchmarks, i.e., evaluation tasks and real-world human exam questions for
MMLU, BIG-bench, HELM, and a series of human exam the evaluation of LLMs. Also, there are several benchmarks
benchmarks. that focus on evaluating specific abilities of LLMs, such
• MMLU [362] is a versatile benchmark for large-scale as TyDiQA [737] for multilingual knowledge utilization
evaluation of multi-task knowledge understanding, cover- and MGSM [526] for multilingual mathematical reasoning.
ing a wide range of knowledge domains from mathematics To conduct the evaluation, one can select suitable bench-
and computer science to humanities and social sciences. The marks according to specific goals. In addition, there are also
difficulties of these tasks vary from basic to advanced. As several open-source evaluation frameworks for researchers
shown in existing work, LLMs mostly outperform small to evaluate LLMs on existing benchmarks or extend new
models by a substantial margin on this benchmark [35, 56, tasks for customized evaluations, such as Language Model
57, 69], which shows the scaling law in model size. More Evaluation Harness [738] and OpenAI Evals [46]. Fur-
recently, GPT-4 achieves a remarkable record (86.4% in 5- ther, some researchers also construct continuously updated
shot setting) in MMLU, which is significantly better than leaderboards by aggregating representative benchmarks, to
the previous state-of-the-art models [46]. compare the performance of existing LLMs, such as Open
• BIG-bench [70] is a collaborative benchmark intended LLM Leaderboard [709]. The above benchmarks and leader-
to probe existing LLMs from various aspects. It comprises boards provide important references to demonstrate the ba-
204 tasks that encompass a broad range of topics, includ- sic and advanced abilities of LLMs. We will give more deep
ing linguistics, childhood development, mathematics, com- discussions on pros and cons on evaluation approaches in
monsense reasoning, biology, physics, social bias, software Section 7.3.2.
development, and so on. By scaling the model size, LLMs
can even outperform the average human performance under
7.3.2 Evaluation Approaches
the few-shot setting on 65% of tasks in BIG-bench [56].
Considering the high evaluation cost of the entire bench- After introducing existing benchmarks, in this part, we
mark, a lightweight benchmark BIG-bench-Lite has been will review existing evaluation approaches for assessing
proposed, which contains 24 small yet diverse and challeng- the performance of LLMs. To organize our discussion, we
ing tasks from BIG-bench. Additionally, the BIG-bench hard categorize LLMs into three different types: base LLMs (pre-
(BBH) benchmark [363] has been proposed to concentrate trained model checkpoints), fine-tuned LLMs (instruction or
on investigating the currently unsolvable tasks of LLMs by alignment fine-tuned model checkpoints), and specialized
selecting the challenging tasks in which LLMs exhibit infe- LLMs (adapted model checkpoints for some specific task
rior performance compared to humans. Since BBH becomes or domain). Here, we keep both fine-tuned LLMs and
more difficult, small models mostly achieve performance specialized LLMs, to distinguish the different purposes of
close to random. As a comparison, CoT prompting can LLMs: general or specific task solvers. To evaluate the three
elicit the abilities of LLMs to perform step-by-step reasoning types of LLMs, we can test the LLM’s performance related
for enhancing the performance, even exceeding the average to different abilities (e.g., basic or advanced abilities as
human performance in BBH. discussed in Section 7.1 and 7.2). In general, there are three
• HELM [522] is a comprehensive benchmark that cur- main approaches to evaluating LLMs, namely benchmark-
rently implements a core set of 16 scenarios and 7 categories based approach [362], human-based approach [729], and
of metrics. It is built on top of many prior studies, conduct- model-based approach [731]. Table 15 shows an illustration
ing a holistic evaluation of language models. As shown in of the relationship among LLM type, evaluation approach,
64

TABLE 15: A category of existing evaluation work. “General” denotes that the evaluation focuses on an overall performance
of multiple abilities. The evaluated abilities are not limited to the representative basic and advanced abilities mentioned in
Section 7.1 and 7.2.

Method Evaluation Model Types Abilities/Domain Data Source
MMLU [362] Base/Fine-tuned/Specialized General Human exam/practice
BIG-bench [70] Base/Fine-tuned/Specialized General Human annotation
HELM [522] Base/Fine-tuned/Specialized General Benchmark collection
Open LLM Leaderboard [709] Base/Fine-tuned/Specialized General Benchmark collection
AGIEval [710] Base/Fine-tuned/Specialized General Human exam/practice
MMCU [711] Base/Fine-tuned/Specialized General Human exam/practice
M3KE [712] Base/Fine-tuned/Specialized General Human exam/practice
C-Eval [713] Base/Fine-tuned/Specialized General Human exam/practice
Xiezhi [714] Base/Fine-tuned/Specialized General Human exam/practice
OpenCompass [715] Base/Fine-tuned/Specialized General Benchmark collection
Chain-of-Thought Hub [716] Base/Fine-tuned General Benchmark collection
KoLA [717] Base/Fine-tuned Knowledge utilization Web
ARB [718] Fine-tuned Complex reasoning Human exam/practice
APIBench [719] Base/Fine-tuned Tool manipulation Web
Benchmark
APIBank [720] Fine-tuned Tool manipulation Synthesis
ToolAlpaca [721] Base/Fine-tuned Tool manipulation Synthesis
T-Bench [722] Fine-tuned Tool manipulation Synthesis
ToolBench [723] Fine-tuned Tool manipulation Synthesis
BOLAA [724] Base/Fine-tuned Environment interaction Benchmark collection
AgentBench [725] Base/Fine-tuned Environment interaction Human annotation/Synthesis
HaluEval [604] Base/Fine-tuned Human alignment Human annotation/Synthesis
PromptBench [726] Base/Fine-tuned Robustness Benchmark collection
HumanEval [105] Base/Fine-tuned/Specialized Code synthesis Human annotation
MultiMedQA [354] Specialized Healthcare Benchmark collection
FLUE [727] Specialized Finance Benchmark collection
LegalBench [728] Specialized Legal Human annotation
Chatbot Arena [729] Base/Fine-tuned/Specialized Human Alignment Human annotation
Human
SciBench [730] Fine-tuned Complex reasoning Human exam/practice
AlpacaEval [731] Fine-tuned Instruction following Synthesis
MT-bench [729] Fine-tuned Human alignment Human annotation
Model TrustGPT [732] Base/Fine-tuned Human alignment Benchmark collection
LMExamQA [733] Base/Fine-tuned Knowledge utilization Synthesis
ChatEval [734] Base/Fine-tuned Knowledge utilization Benchmark collection

and tested abilities. Next, we will discuss the evaluation the generated result text will be parsed with human-written
approaches for different types of LLMs. rules to get the predicted answer. Finally, the performance
of LLMs can be automatically calculated using standard
Evaluation of Base LLMs. Base LLMs refer to the model metrics like accuracy by comparing the predicted answer
checkpoints obtained right after pre-training. For base with the ground-truth one. The evaluation approach can be
LLMs, we mainly focus on examining the basic abilities conducted in either the few-shot or zero-shot setting, which
(Section 7.1), such as complex reasoning and knowledge might lead to different evaluation results or rankings. Since
utilization. Since most of these basic abilities can be assessed base LLMs have not been instruction fine-tuned (with rela-
with well-defined tasks, benchmark-based approaches have tively weak task generalization ability), the few-shot setting
been widely used to evaluate base LLMs. Next, we will is often more suitable for evaluation. For some complex
introduce common evaluation benchmarks and evaluation reasoning tasks, CoT prompts also need to be used to fully
procedures for base LLMs. exhibit the capacity during evaluation. Another note is that
• Common benchmarks. To evaluate base LLMs, typical this evaluation approach can also be applied to assess the
benchmarks are designed in the form of close-ended prob- abilities of fine-tuned LLMs. Actually, several leaderboards
lems like multiple-choice questions. These commonly used (e.g., Open LLM Leaderboard [709]) are built upon this
benchmarks can be mainly divided into two categories: approach, evaluating both base and fine-tuned LLMs.
knowledge-oriented and reasoning-oriented benchmarks.
Evaluation of Fine-tuned LLMs. Fine-tuned LLMs in this
Knowledge-oriented benchmarks (e.g., MMLU [362] and C-
part refer to the model checkpoints obtained after in-
Eval [713]) aim to evaluate the capacity of world knowledge,
struction tuning or alignment tuning based on pre-trained
while reasoning-oriented benchmarks (e.g., GSM8K [645],
model weights40 . Typically, fine-tuned LLMs will be tested
BBH [363], and MATH [362]) focus on evaluating the ca-
on various abilities (e.g., knowledge utilization and hu-
pability of solving complex reasoning tasks. Further, some
man alignment), and thus it is common that they are as-
recently proposed benchmarks (e.g., OpenCompass [715])
sessed with multiple evaluation approaches. In addition
combine these two types for a comprehensive comparison.
to benchmark-based evaluation, human-based and model-
• Benchmark based evaluation procedure. To perform the based approaches have also been widely used to evaluate
benchmark evaluation, each problem will first be formatted
into a prompt for LLMs to generate the result text. Then, 40. In some cases, it is also called chat models.
65

the advanced abilities of fine-tuned LLMs. Next, we will analysis to question answering. It has been used collab-
introduce the two evaluation methods. oratively with BBH [363] to evaluate finical LLMs like
• Human-based evaluation. Unlike automatic evaluation BloombergGPT [358].
for basic abilities, human evaluation typically considers
Pros and Cons of Different Evaluation Approaches. In the
more factors or abilities in real-world use, such as hu-
above, we have discussed different evaluation approaches
man alignment and tool manipulation. In this evaluation
to assess the abilities of LLMs. Next, we simply analyze the
approach, test tasks are usually in the form of open-
pros and cons of each evaluation approach.
ended questions, and human evaluators are invited to make
judgments on the quality of answers generated by LLMs. • Benchmark-based approach. This evaluation approach can
Typically, there are two main types of scoring methods leverage existing benchmarks for assessing the performance
for human evaluators: pairwise comparison and single- of LLMs. The tasks involved in these benchmarks often
answer grading. In pairwise comparison, given the same contain sufficient test samples to measure the core abilities
question, humans are assigned two answers from different (e.g., reasoning). The whole evaluation procedure can be
models to determine which one is better, while in single- (almost) automatic, and it is convenient to carry out test
answer grading, they only need to score a single answer experiments for various base LLMs, especially useful for
at a time. For example, HELM [522] employs humans monitoring the performance of model checkpoints during
to perform single-answer grading on summarization and pre-training. However, LLMs are often sensitive to the eval-
disinformation tasks, while Chatbot Arena [729] constructs uation settings, including the question prompts, zero-shot or
a crowdsourcing platform that allows users to engage in few-shot tests, and the answer parsing methods. Thus, one
conversations with two anonymous chat LLMs and report should take possible influencing factors into consideration
pairwise comparison results. when conducting the evaluation experiments. The evalua-
tion results should be noted with the adopted evaluation
• Model-based evaluation. Since human-based evaluation
settings. Another issue is the data contamination [56, 740],
is both expensive and time-consuming, some work has
i.e., the test data itself or relevant content has been contained
proposed leveraging powerful closed-source LLMs such
in the pre-training corpora. This phenomenon has become
as ChatGPT and GPT-4 as a surrogate for human evalu-
increasingly severe since more and more open data has been
ators [729, 731]. For example, AlpacaEval [731] collects a
collected for developing LLMs.
set of instructions and utilizes a capable LLM (e.g., GPT-4)
as the judge to perform pair-wise comparisons against the • Human-based approach. Human evaluation offers several
reference outputs. Furthermore, MT-bench [729] collects a advantages when assessing the capabilities of LLMs to solve
set of multi-turn questions for evaluation and improves the real-world tasks. One of the key benefits is its ability to
reliability of LLM-based evaluators through methods like directly reflect the actual abilities of LLMs. Based on feed-
ICL and CoT. Compared with human evaluators, LLMs such back and experiences from real users, human evaluation
as ChatGPT and GPT-4 can achieve high agreement with provides a more direct measure of LLMs’ performance in
humans, in both small-scale handcrafted and large-scale real-world scenarios. Further, it can conduct more flexible
crowdsourced evaluation tasks. Despite this, these closed- and diverse evaluation tasks based on human evaluators.
source LLMs are limited in access and have the potential For instance, users can submit various queries and test the
risk of data leakage. To address this, recent work [729] has abilities of LLMs according to their own task cognition. It
explored fine-tuning open-source LLMs (e.g., Vicuna [152]) allows for a deep understanding of the strengths and weak-
as model evaluators using scoring data from human eval- nesses of LLMs across different types of tasks and contexts.
uators, which has narrowed the gap with powerful closed- However, human evaluation also has inherent limitations
source LLMs (e.g., GPT-4). that could potentially affect its accuracy and consistency.
Factors such as personalized tastes and varying education
Evaluation of Specialized LLMs. Specialized LLMs refer levels among evaluators can introduce biases or even incon-
to the model checkpoints specially adapted to some do- sistencies in the evaluation process. In some cases, users’
mains or applications like healthcare [354] and finance [739]. judgments are likely to be subjective, which may not reflect
As special task solvers, specialized LLMs will be tested the true capabilities of the LLMs. Moreover, conducting
not only on general abilities (e.g., basic ability like com- robust and reliable human evaluations often requires a large
plex reasoning and advanced ability like human align- number of evaluators, which can be very expensive and
ment), but also on specific abilities related to their des- time-consuming. In addition, human evaluation is often
ignated domains or applications. For this purpose, one not reproducible, making it infeasible to extend existing
often needs to construct specific benchmarks tailored for the evaluation results or track the progress of LLMs.
target domains or applications. Then, these domain-specific • Model-based approach. As a surrogate for human-based
benchmarks can be combined with general benchmarks to approaches, model-based approaches serve to diminish the
conduct both comprehensive and targeted evaluation for reliance on human involvement, and enable more efficient
specialized LLMs. For example, MultiMedQA [354] is a and scalable evaluation. In addition, LLMs can provide
specific benchmark in healthcare, which includes medical meaningful explanations for the assigned rating scores,
examinations and healthcare questions. In this work [354], thereby enhancing the interpretability of evaluations. De-
MultiMedQA has been combined with MMLU [362] to spite their scalability and explanability, model-based ap-
assess the performance of specialized LLMs for healthcare, proaches have been found to suffer from several issues, in-
such as Med-PaLM [354]. Similarly, FLUE [739] constructs a cluding position, verbosity, and self-enhancement bias [729].
benchmark for finance, spanning from financial sentiment Specially, position bias (i.e., the order to present the re-
66

sponses) refers to the fact that LLMs tend to assign high Claude 2, where the first three models are developed by
scores for the answers at specific positions over others, OpenAI and the other two are developed by Anthropic.
verbosity bias means that LLMs favor verbose answers even
if they are short in quality compared with shorter answers, Tasks and Datasets. Next, we set up the evaluation tasks
and self-enhancement bias indicates that LLMs often over- and datasets for the abilities discussed in Section 7.1 and
rate in their own generations. In addition, since LLMs have Section 7.2. We mainly evaluate the zero-shot performance
limited capacities in solving complex reasoning problems, of LLMs on these datasets. For more complex tasks that are
they cannot serve as qualified evaluators for some difficult hard to be solved in the zero-shot manner (e.g., mathemati-
tasks (e.g., mathematical reasoning). These limitations can cal reasoning and tool manipulation), we mainly report the
be mitigated to some extent by specific prompt engineering 3-shot performance, considering the context length limit of
and fine-tuning strategies [729]. open-source models.
To summarize, our categorization (Table 15) of existing • Language generation. As discussed before, for language
work on LLM evaluation is mainly based on two major di- generation, we consider evaluating three kinds of tasks,
mensions, namely evaluation methodology and model type, i.e., language modeling, conditional text generation, and
which are further extended with the test abilities. There code synthesis. Specially, we select four commonly-used
are some recent work [735, 736] that also has discussed datasets, namely LAMBADA [252] (language modeling),
the categorization or taxonomies of existing work for LLM WMT’22 [547] (machine translation), XSum [551] (text sum-
evaluation. marization), and HumanEval [105] (code synthesis) for eval-
uation. In WMT’22, we construct a new evaluation set
by selecting 1000 examples for each language pair from
7.4 Empirical Evaluation the original large-scale test set to examine the average
The above evaluation benchmarks and approaches are performance of LLMs in machine translation. We evaluate
mainly employed to evaluate the overall abilities of LLMs. the zero-shot performance of LLMs on these datasets, and
In this part, we conduct a fine-grained evaluation of the compute the accuracy of predicting words for LAMBADA,
abilities discussed in Section 7.1 and Section 7.2. For each BLEU-4 for WMT’22, ROUGE-L for XSum, and pass@10 for
kind of ability, we select representative tasks and datasets HumanEval.
for conducting evaluation experiments to examine the cor- • Knowledge utilization. To evaluate the ability of knowl-
responding performance of LLMs. edge utilization, we select four question answering datasets
(i.e., TriviaQA [560], Natural Questions [556], Web Ques-
tions [559], and ARC [557]), and a fact extraction dataset,
7.4.1 Experimental Settings
WikiFact [573]. We also report the zero-shot performance of
In this part, we introduce the experimental settings for our LLMs on these datasets, and compute accuracy for ARC and
evaluation. exact match for other datasets.
Evaluation Models. To conduct the evaluation, we consider • Complex reasoning. For complex reasoning, we eval-
representative LLMs from open-source models to closed- uate the comparison models on OpenbookQA [568], Hel-
source API-accessing models as follows: laSwag [584], and SocialIQA [583] for knowledge reason-
• Open-source models. Existing open-source models can be ing; Colored Objects [70] and Penguins in the Table [70]
categorized into base models and instruction-tuned models. for symbolic reasoning; GSM8k [198] and MATH [362] for
Base models are only pre-trained on a large general-purpose mathematical reasoning. We compute the accuracy for Open-
corpus with the language modeling objective, but without bookQA, HellaSwag, and SocialIQA; solve rate for Colored
further supervised fine-tuning. In our evaluation, we select Objects and Penguins in the Table; and accuracy for GSM8k
four representative base models including LLaMA (7B) [57], and MATH. For knowledge reasoning tasks, we evaluate
LLaMA 2 (7B) [99], Pythia (7B and 12B) [96], and Falcon the zero-shot performance, since they are all QA tasks that
(7B) [749]41 . Instruction-tuned models are those fine-tuned can be solved in a zero-shot setting. For complex symbolic
using instructions (i.e., task datasets, daily chat, or syn- reasoning and mathematical reasoning tasks, we leverage
thetic instructions). In our experiments, we select four rep- 3-shot in-context exemplars to better elicit LLMs to accom-
resentative instruction-tuned models including Vicuna (7B plish them. Following existing work [33, 436], we also utilize
and 13B) [152], Alpaca (7B) [187], and ChatGLM (6B) [93]. the chain-of-thought prompting strategy for better solving
In addition, we also include LLaMA 2-Chat (7B) [99] for the mathematical reasoning tasks.
comparison, and it is a representative model that has been • Human alignment. For human alignment, we select
aligned with human via instruction tuning and RLHF, based TruthfulQA [558] to measure whether a LLM is truth-
on LLaMA 2 (7B). ful in generating answers to questions, CrowS-Pairs [605]
• Closed-source models. In addition to the open-source and WinoGender [606] to assess the stereotypes in LLMs,
models, there are also closed-source models that can only RealToxityPrompts [607] to evaluate the extent to which
be accessed via APIs, which have gained much attention LLMs generate toxic language, and HaluEval [604] to test
from both developers and researchers. Here, we select four the ability of LLMs to recognize hallucination. As the test
representative closed-source models including text-davinci- set of Real-Toxicity-Prompts is too large, we randomly
002/003 (short as Davinci002/003), ChatGPT, Claude, and sample 10000 examples from it for evaluation. We fol-
low LLaMA [57] to report the zero-shot performance, and
41. Experiments with larger models are still in schedule due to the compute the accuracy of identifying a claim as true for
limit of computational resources. TruthfulQA, accuracy of recognizing biased sentences (high
67

TABLE 16: Evaluation on the eight abilities of LLMs with specially selected tasks. The shade of the Orange and Blue
fonts denote the performance orders of the results in closed-source and open-source models, respectively. This table will
be continuously updated by incorporating the results of more models.

Language Generation Knowledge Utilization
Models
LBD↑ WMT↑ XSum↑ HumanEval↑ TriviaQA↑ NaturalQ↑ WebQ↑ ARC↑ WikiFact↑
ChatGPT 55.81 36.44 21.71 79.88 54.54 21.52 17.77 93.69 29.25
Claude 64.47 31.23 18.63 51.22 40.92 13.77 14.57 66.62 34.34
Claude 2 45.20 12.93 19.13 78.04 54.30 21.30 21.06 79.97 35.83
Davinci003 69.98 37.46 18.19 67.07 51.51 17.76 16.68 88.47 28.29
Davinci002 58.85 35.11 19.15 56.70 52.11 20.47 18.45 89.23 29.15
LLaMA 2-Chat (7B) 56.12 12.62 16.00 11.59 38.93 12.96 11.32 72.35 23.37
Vicuna (13B) 62.45 20.49 17.87 20.73 29.04 10.75 11.52 20.69 28.76
Vicuna (7B) 63.90 19.95 13.59 17.07 28.58 9.17 6.64 16.96 26.95
Alpaca (7B) 63.35 21.52 8.74 13.41 17.14 3.24 3.00 49.75 26.05
ChatGLM (6B) 33.34 16.58 13.48 13.42 13.42 4.40 9.20 55.39 16.01
LLaMA 2 (7B) 66.39 11.57 11.57 17.07 30.92 5.15 2.51 24.16 28.06
LLaMA (7B) 67.68 13.84 8.77 15.24 34.62 7.92 11.12 4.88 19.78
Falcon (7B) 66.89 4.05 10.00 10.37 28.74 10.78 8.46 4.08 23.91
Pythia (12B) 61.19 5.43 8.87 14.63 15.73 1.99 4.72 11.66 20.57
Pythia (7B) 56.96 3.68 8.23 9.15 10.16 1.77 3.74 11.03 15.75
Knowledge Reasoning Symbolic Reasoning Mathematical Reasoning Interaction with Environment
Models
OBQA↑ HellaSwag↑ SocialIQA↑ C-Objects↑ Penguins↑ GSM8k↑ MATH↑ ALFW↑ WebShop↑
ChatGPT 81.20 61.43 73.23 53.20 40.27 78.47 33.78 58.96 45.12/15.60
Claude 81.80 54.95 73.23 59.95 47.65 70.81 20.18 76.87 47.72/23.00
Claude 2 71.60 50.75 58.34 66.76 74.50 82.87 32.24 77.61 34.96/19.20
Davinci003 74.40 62.65 69.70 64.60 61.07 57.16 17.66 65.67 64.08/32.40
Davinci002 69.80 47.81 57.01 62.55 67.11 49.96 14.28 76.87 29.66/15.20
LLaMA 2-Chat (7B) 45.62 74.01 43.84 43.40 38.93 9.63 2.22 11.19 24.51/5.60
Vicuna (13B) 43.65 70.51 45.97 53.55 36.91 18.50 3.72 8.96 22.74/5.00
Vicuna (7B) 43.84 69.25 46.27 44.25 36.24 14.03 3.54 1.49 6.90/1.40
Alpaca (7B) 47.82 69.81 47.55 39.35 40.27 4.93 4.16 4.48 0.00/0.00
ChatGLM (6B) 30.42 29.27 33.18 14.05 14.09 3.41 1.10 0.00 0.00/0.00
LLaMA 2 (7B) 44.81 74.25 41.72 43.95 35.75 10.99 2.64 8.96 0.00/0.00
LLaMA (7B) 42.42 73.91 41.46 39.95 34.90 10.99 3.12 2.24 0.00/0.00
Falcon (7B) 39.46 74.58 42.53 29.80 24.16 1.67 0.94 7.46 0.00/0.00
Pythia (12B) 37.02 65.45 41.53 32.40 26.17 2.88 1.96 5.22 3.68/0.60
Pythia (7B) 34.88 61.82 41.01 29.05 27.52 1.82 1.46 7.46 10.75/1.80
Human Alignment Tool Manipulation
Models
TfQA↑ C-Pairs↓ WinoGender↑ RTP↓ HaluEval↑ HotpotQA↑ Gorilla-TH↑ Gorilla-TF↑ Gorilla-HF↑
ChatGPT 69.16 18.60 62.50/72.50/79.17 3.07 66.64 23.80 67.20 44.53 19.36
Claude 67.93 32.73 71.67/55.00/52.50 3.75 63.75 33.80 22.04 7.74 7.08
Claude 2 71.11 10.67 60.00/60.00/55.83 3.20 50.63 36.4 61.29 22.19 23.67
Davinci003 60.83 0.99 67.50/68.33/79.17 8.81 58.94 34.40 72.58 3.80 6.42
Davinci002 53.73 7.56 72.50/70.00/64.17 10.65 59.67 26.00 2.69 1.02 1.00
LLaMA 2-Chat (7B) 69.77 48.54 47.50/46.67/46.67 4.61 43.82 4.40 0.00 0.00 0.22
Vicuna (13B) 62.30 45.95 50.83/50.83/52.50 5.00 49.01 11.20 0.00 0.44 0.89
Vicuna (7B) 57.77 67.44 49.17/49.17/49.17 4.70 43.44 6.20 0.00 0.00 0.33
Alpaca (7B) 46.14 65.45 53.33/51.67/53.33 4.78 44.16 11.60 0.00 0.00 0.11
ChatGLM (6B) 63.53 50.53 47.50/47.50/46.67 2.89 41.82 4.00 0.00 0.00 0.00
LLaMA 2 (7B) 50.06 51.39 48.83/48.83/50.83 6.17 42.23 3.80 0.00 0.00 0.11
LLaMA (7B) 47.86 67.84 54.17/52.50/51.67 5.94 14.18 1.60 0.00 0.00 0.11
Falcon (7B) 53.24 68.04 50.00/50.83/50.00 6.71 37.41 1.00 0.00 0.00 0.00
Pythia (12B) 54.47 65.78 49.17/48.33/49.17 6.59 27.09 0.40 0.00 0.00 0.00
Pythia (7B) 50.92 64.79 51.67/49.17/50.00 13.02 25.84 0.20 0.00 0.00 0.00

perplexity) for CrowS-Pairs, coreference resolution accuracy evaluation.
(he/she/they) for WinoGender, toxicity score for RealToxi- • Interaction with environment. To test this ability, we
tyPrompts, and average accuracy of recognizing hallucina- select ALFWorld [611] and WebShop [612] for evaluation,
tions for HaluEval. For TruthfulQA, we follow existing which simulate real-world scenarios such as household
work [57] that utilizes text-davinci-003 to replace humans and e-commerce environments. We follow the setting of
for scoring. For Crows-Pairs and WinoGender, we follow ReAct [442] that evaluate the 1-shot and 2-shot performance
the experimental settings of LLaMA [57] to compute the of LLMs on WebShop and ALFWorld respectively, and com-
perplexity and coreference resolution score. For RealTox- pute success rate for ALFWorld and average score/success rate
ityPrompts, we utilize the Perspective-API42 for toxicity for WebShop. Further, we also follow ReAct [442] to reduce
the length of the input prompt and utilize line break as the
42. https://perspectiveapi.com/ EOS token.
68

TABLE 17: Prompt examples and their performance of ChatGPT on representative tasks. For most tasks, we compare the
performance for simple and complex prompts. We also present the reported performance of supervised methods. “LG”,
“KU”, “CR”, “SDG”, “IR” are short for “language generation”, “knowledge utilization”, “complex reasoning”, “structured
data generation”, “information retrieval”. “-” means there is no reported supervised result previously on this dataset.

Tasks Datasets Instructions ChatGPT Supervised
I want you to act as a translator. Please translate the English 20.66
sentence into Czech.
Translation WMT 41.40 [741]
I want you to act as a translator. Translate the given English 21.12
sentence into Czech, and ensure that the translated sentence is
semantically consistent with the given sentence. \n Sentence:
{source sentence} \n Translation:
LG
Please generate a one-sentence summary for the given document. 21.71

Summarization XSum {document} Try your best to summarize the main content of the given 23.01 42.08 [742]
document. And generate a short summary in 1 sentence for it.\n
Summary:
Choose your answer to the question. {query} {options} 85.19
Closed-Book QA ARC 92.00 [743]
Choose a correct answer according to the given question, and output 85.86
the corresponding id, do not answer other content except the answer
id.
Choose your answer to the question: {question} {choices}. You must 81.20
KU only output A, B, C, or D without any extra explanation. The answer
is
Open-Book QA OBQA 87.20 [743]
Following is a question that requires multi-step reasoning, use 82.20
of additional common and commonsense knowledge, and rich text
comprehension. Choose your answer to the question: \n Question:
Frilled sharks and angler fish live far beneath the surface of the
ocean, which is why they are known as \n Choices: \n A. Deep sea
animals \n B. fish \n C. Long Sea Fish \n D. Far Sea Animals \n You
must only output A, B, C, or D without any extra explanation. The
answer is
Complete the sentence with one or a few words. 29.25
Fact Extraction WikiF 34.20 [522]
Complete the given sentence with one entity name in Wikipedia (MUST 31.21
be a noun) as short as possible, and ensure that the completed
sentence conforms to the facts.
Problem: {problem}\n Answer: 53.20
Symbolic Reasoning C-Objects —
You are an expert in reasoning problem. Here are some examples 66.75
about symbolic reasoning. You can use the knowledge in examples and
solve the last problem. You should follow the examples and generate
the final answer without external solution or words.
CR Problem: {problem}\n Solution: Let’s think step by step. 78.47

Math Word Problems GSM8k Let’s use python to solve math problems. Here are three examples 79.30 63.20 [744]
how to do it,\n Q: Olivia has $23. She bought five bagels for $3
each. How much money does she have left?\n‘‘‘def solution():\n
"""Olivia has $23. She bought five bagels for $3 each. How
much money does she have left?"""\n money_initial = 23\n
bagels = 5\n bagel_cost = 3\n money_spent = bagels *
bagel_cost\n money_left = money_initial - money_spent\n
result = money_left\n return result‘‘‘\n ...... \n How about
this question?\n Q:
Code Synthesis HumanEval I want you act as a code completer. Given a code snippet, your 79.88 48.20 [745]
objective is to complete the code and ensure that it can achieve
the described functionality.
SDG
Text-to-SQL Spider ### Complete sqlite SQL query only and with no explanation.\n 70.10 84.10 [746]
#\n### Sqlite SQL tables, with their properties: \n#\n{table}\n#
{foreign_key}\n#\n### {question}\n SELECT
Recommendation MovieLens I’ve watched the following movies in the past in order: \n 48.80 76.25 [747]
{user_his_text} \n\n Now there are {recall_budget} candidate movies
that I can watch next: \n {candidate_text_order} \n Please rank
these {recall_budget} movies by measuring the possibilities that I
would like to watch next most, according to my watching history.
Please think step by step. \n Note that my most recently watched
movie is {recent_item}. Please show me your ranking results with
IR order numbers. Split your output with line break. You MUST rank the
given candidate movies. You can not generate movies that are not in
the given candidate list.
Conversational ReDial Recommend 10 items that are consistent with user preference. The 17.20 25.60 [748]
Recommenda- recommendation list can contain items that the dialog mentioned
tion before. The format of the recommendation list is: no. title (year).
Don’t mention anything other than the title of items in your
recommendation list
69

• Tool manipulation. For tool manipulation, we consider • All the comparison models perform not well on very diffi-
two kinds of tools including search engine and model in- cult reasoning tasks. On MATH and HotpotQA, all models
terfaces. Therefore, we adopt two tool manipulation bench- (including ChatGPT) perform not well. The two tasks are
marks, i.e., HotpotQA [581] and Gorilla [619]. HotpotQA very difficult to solve, requiring accurate understanding of
requires LLMs to use search engine to retrieve documents complex mathematical knowledge and performing multi-
from the web, and Gorilla to invoke model APIs from hop reasoning across documents, respectively. Further, these
three hubs of TorchHub, TensorHub and HuggingFace. We models also have a relatively weak performance on machine
compute exact match for HotpotQA and accuracy for Gorilla. translation task (WMT). A possible reason is that WMT also
For HotpotQA, we follow ReAct [442] to report the 3-shot contains many evaluation examples in minor languages,
performance. For Gorilla, we follow the code released by its which might not be well covered in the pre-training data
paper [619], and evaluate the zero-shot performance. of these LLMs.
Implementation Details. For each task and dataset, we Analysis of Open-Source Models. Next, we continue to
evaluate the compared LLMs using the same prompts and show our analysis and findings about eight open-source
results parsing method provided by existing work (i.e., models (i.e., LLaMA 2-Chat, Vicuna, Alpaca, ChatGLM,
TruthfulQA, HotPotQA, Gorilla, HaluEval) or designed ac- LLaMA 2, LLaMA, Pythia and Falcon) as follows:
cording to our empirical experience (i.e., TriviaQA, Nat- • Instruction-tuned models mostly perform better than the
ural Questions, Web Questions, ARC, WikiFact, GSM8k, base models. Among all the compared open-source methods,
MATH, C-Objects, Penguins, LAMBADA, WMT’22, XSum, the instruction-tuned models (i.e., LLaMA 2-Chat, Vicuna,
HumanEval, CrowS-Pairs, WinoGender, RealToxityPrompt). Alpaca and ChatGLM) mostly perform better than non-
Specifically, all the experiments about closed-source models instruction-tuned models (i.e., LLaMA 2, LLaMA, Pythia
are based on invoking their official APIs, while for open- and Falcon). It indicates that instruction tuning is generally
source models, we utilize their publicly available code and capable of improving the few-shot or zero-shot ability of
model parameters, and perform the inference on 8 A800- LLMs in solving various tasks. However, after instruction
80G GPUs. For TriviaQA, OpenbookQA, HellaSwag, and tuning, Vicuna (7B) and Alpaca (7B) suffer from perfor-
SocialIQA, we experiment on the development set since the mance degradations on LAMBADA, a language modeling
test set is not publicly released. While for other datasets, task. The reason may be that the instruction data mainly
we experiment on the test set. To reproduce our experi- focuses on enabling LLMs to follow human instructions,
ments, we also publicly release our experimental code and which is not always useful for the general language gen-
data in https://github.com/RUCAIBox/LLMSurvey/tree/ eration task.
main/Experiments. • These small-sized open-source models perform not well on
mathematical reasoning, interaction with environment, and tool
7.4.2 Results Analysis and Findings manipulation tasks. On the tasks of mathematical reasoning,
interaction with environment and tool manipulation, all
We report the experimental results in Table 16, and analyze
these evaluated open-source models perform not well, in-
the results in the following.
cluding instruction-tuned ones. A possible reason is that the
Analysis of Closed-Source Models. We summarize our instruction data for fine-tuning these models is not specif-
analysis and findings of the four closed-source models (i.e., ically designed for these tasks. In addition, these closed-
ChatGPT, Claude, Davinci003 and Davinci002) as follows: source models may have limited model capacities due to
• These five closed-source models achieve promising results small model sizes.
as general-purpose task solvers, in which ChatGPT mostly per- • The top-performing model varies on different human align-
forms the best. ChatGPT, Claude, Claude 2, Davinci003 and ment tasks. For different human alignment tasks, we can see
Davinci002 perform well in most of tasks, including com- that these models achieve inconsistent performance rank-
plex tasks (e.g., GSM8k), which have shown great potential ings. For example, LLaMA 2-Chat (7B) performs the best
to be general-purpose task solvers. Among them, ChatGPT among the compared open-source models on TruthfulQA,
exhibits a more superior model capacity on the evaluation while Vicuna (13B) performs the best on CrowS-Pairs. A
tasks, winning the most across all tasks. In some evaluation possible reason is that these tasks are designed with spe-
tasks, the performance gap between ChatGPT and other cific purposes for evaluating different aspects of human
closed-source models is very large, especially for complex alignment, and these models exhibit varied performance
tasks e.g., 78.47 (ChatGPT) v.s. 49.96 (Davinci002) on GSM8k, on different tasks, even for the variants of the same model
and 79.88 (ChatGPT) v.s. 51.22 (Claude) on HumanEval. (e.g., Pythia (7B) and Pythia (12B)). More experiments and
• Claude 2, ChatGPT and Davinci003 perform better on inter- analysis on human alignment evaluation are needed to
action with environment and tool manipulation tasks. On the two reveal more detailed findings.
evaluation tasks, Claude 2, ChatGPT and Davinci003, per- • As a more recently released model, LLaMA 2 (7B) overall
form better than other models by a large margin, e.g., 36.40 achieves a good performance, especially on complex reasoning
(Claude 2) v.s. 26.00 (Davinci002) on HotpotQA, 44.53 (Chat- tasks. For complex reasoning tasks, LLaMA 2 (7B) mostly
GPT) v.s. 7.74 (Claude) on Gorilla-TF, and 72.58 (Davinci003) performs better than other base models, e.g., 43.95 (LLaMA
v.s. 22.04 (Claude) on Gorilla-TH. A possible reason is that 2 (7B)) v.s. 29.80 (Falcon (7B)) in C-Objects. For other
these three models have been specially optimized towards tasks (e.g., language generation and knowledge utilization),
these advanced abilities, e.g., supporting the use of external LLaMA 2 (7B) can also achieve comparable performance
plugins. as the best-performing base models. It has used more data
70

for pre-training (i.e., about 2 trillion tokens), which mainly performance by fine-tuning on it. Recent studies [55, 754]
contributes to the excellent performance. Furthermore, it have also tested the performance of LLMs on these tasks,
also conducts a more robust data cleaning process. showing that LLMs can also perform well via in-context
• Scaling the open-source modes can improve the performance learning (with very few examples). Whereas, as small mod-
consistently. By comparing the performance of Vicuna (7B) els can be specially optimized on these tasks to learn the
and Vicuna (13B), Pythia (7B) and Pythia (13B), we can see specific task requirement and domain knowledge, full-data
that the models with larger scales mostly perform better fine-tuned small models can mostly outperform LLMs using
than smaller ones on these evaluation tasks, indicating the in-context learning on several classic tasks [755, 756], e.g.,
effectiveness of scaling up the model size. Across different semantic matching and sentiment analysis.
tasks, scaling model is more beneficial for more complex
tasks (e.g., symbolic and mathematical reasoning), where the Sequence Tagging. The sequence tagging tasks, e.g., named
larger models mostly outperform smaller ones in a large entity recognition (NER) [757] and part-of-speech (POS)
margin. tagging [758], are also fundamental tasks. Typically, such
The readers should be note that these findings about tasks require assigning each token in the input sequence a
open-source language models are limited to the model sizes. proper semantic category label, e.g., the classic B-I-O (Be-
We will continually update this part by including the results ginning, Inside and Outside) tagging scheme for NER tasks.
of larger versions of these models, and also call for the In the era of deep learning, early efforts [759, 760] mainly
support of computational resources for more experiments. integrate the learned sequence representations (e.g., using
CNN, LSTM, and BERT) into the classic conditional random
field model (CRF), which performs the tagging task based
8 A PPLICATIONS on structural prediction. Recently, researchers have tested
In this section, we briefly review the recent progress on the the performance of LLMs in sequence tagging tasks, but ob-
applications of LLMs in two aspects, namely the impact to served that LLMs still face challenges in solving them using
research community and representative domains. Figure 18 in-context learning [755], especially for special categories
shows a content organization of this section43 . with ambiguous or rare names, e.g., the “MISC” (miscella-
neous entity) and “ORG” (organization) classes. A possible
reason is that LLMs may misunderstand the meanings of
8.1 LLM for Research Community these classes in the human-annotated dataset, making it
As LLMs have revolutionized the way how we develop difficult to accurately understand their semantics according
AI algorithms, it poses significant impact on the research to the instruction and limited examples in the context.
community. In this part, we briefly review the advances that
Information Extraction. The information extraction task
led by LLMs for several representative research directions.
focuses on automatically extracting useful structured infor-
mation from unstructured text data, such as relation extrac-
8.1.1 LLM for Classic NLP Tasks
tion [761] and event extraction [762], which is also a crucial
As pre-trained language models (e.g., BERT) have originated task relating to many NLP applications. Typically, previous
in the field of NLP, the technical advances of language studies formulate this task as a text classification task or
models has an important impact on the research of NLP. In a sequential labeling task. As information extraction often
this part, we discuss the application of LLMs on five kinds needs to accurately understand and process complex se-
of classic NLP tasks, including word-level, sentence-level, mantic relations (multiple relations within one sentence), in-
sequence tagging, relation extraction, and text generation context learning with LLMs typically underperform state-
tasks, which had been the foundation of many existing NLP of-the-art full-data fine-tuning methods [763, 764]. Whereas,
systems and applications. Note that we do not intend to it is shown that enabling collaboration between LLMs and
comprehensively cover all NLP tasks, but instead try to small models can further boost the performance of specific
analyze the impact of LLMs for fundamental NLP research tasks [764, 765]. In addition, a recent study [766] also reveals
through the basic tasks. We also omit the discussion of sev- that LLMs can achieve competitive zero-shot performance
eral tasks (e.g., language modeling) that have been discussed for information extraction with a two-stage workflow, mak-
early in this survey. ing this approach attractive in future applications.
Word/Sentence-level Tasks. As long-standing NLP tasks, Text Generation. Text generation tasks, e.g., machine trans-
word-level (e.g., word clustering [750] and sense disam- lation [626] and automatic summarization [550], are long-
biguation [751]) and sentence-level tasks (sentence match- standing NLP tasks that have been widely studied, and
ing [752] and sentiment classification [753]) have been there have been a number of deployed products and sys-
widely studied in the literature and applied in real-world tems based on fine-tuned small models [309, 767]. Since the
platforms. To solve these tasks, the key is to accurately pre-training of LLMs is established on text prediction, they
understand the semantic information about the words or exhibit strong language generation abilities as commercial
sentences. As rich high-quality labeled data about these products [629] and humans [630], with the help of proper
tasks has been accumulated so far, existing work [23, 39] prompts [768, 769]. Additionally, LLMs are flexible to effec-
finds that small language models can achieve very good tively handle special requirement in real-world application
scenarios, e.g., document-level translation [770], and also
43. Note that we don’t aim to cover all the related research directions
or domains, but instead demonstrating the use or impact of LLMs via enable natural language interaction with users to further
these selected examples. improve the generation quality [771]. Despite the above
71

• Word/Sentence-level Tasks
LLM for Classic NLP Tasks • Sequence Tagging
• Information Extraction
• Text Generation
Classic Scenarios • LLM as IR Model
LLM for IR • LLM-Enhanced IR Models

• LLM as Recommendation Model
LLM for Recommendation • LLM-enhanced Recommendation Models
Research • LLM as Recommendation Simulator
Directions • Vision-Language Alignment Pre-Training
Multimodal LLMs • Visual Instruction Tuning
• Evaluation of MLLM
Enhanced Capabilities
• Retrieval-augmented LLM
KG Enhanced LLM • Synergy Augmented LLM

LLM for • Components: Memory/Planning/Execution
Application LLM-based Agent
• Single/Multi-agent based Application
New Scenarios
• Score/Language-based Evaluation
LLM for Evaluation • Instruction Design, Multiple Feedbacks, Debate Agent
• Meta-Evaluation

Scientific
Specific Domains Healthcare Finance Law Education
Research

Fig. 18: The applications of LLMs in representative research directions and downstream domains.

success, recent work also reveals that LLMs are hard to well 8.1.2 LLM for Information Retrieval
address the generation tasks about low-resource languages
and domains, e.g., Marathi-to-English translation [772], due The goal of information retrieval (IR) systems is to assist
to their unbalanced training data across different languages. users in discovering ideal information resources (typically
documents) and mitigating the information overload issue.
Typically, contemporary IR systems adopt a retrieve-then-
Summary. Based on the above discussion, we summarize
rerank pipeline framework [54]. Within this framework,
the suggestions, and future direction about the use of LLMs
the retriever initially retrieves relevant information from a
in classic NLP tasks as follows:
large-scale corpus, and the reranker subsequently performs
• Suggestions: LLMs and small models have their own multi-stage ranking procedure to acquire the most relevant
merits in different aspects: LLMs are can provide unified information [776]. Since the advent of LLMs has significant
solutions to various NLP tasks and achieve competitive impact on the way of information access, we discuss how
performance (especially in the zero/few-shot setting), while it advances the development of IR from two main aspects,
small models are economical to develop and can be specially namely LLMs as IR models and LLM-enhanced IR models.
tuned according to target tasks, which can achieve good
performance with sufficient high-quality labeled data [755, LLMs as IR Models. Existing IR models can be overall
756, 773, 774]. In applications, one can make suitable choices categorized into sparse models (relying on term-based lexical
based on the actual needs, comprehensively considering similarity) and dense models (relying on embedding based
flexibility, data availability, training compute, and efficiency. semantic similarity) [54]. Specially, dense models are mainly
implemented by fine-tuned PLMs (e.g., BERT). Compared
• Future direction: Despite the excellent general capac- to PLMs, LLMs have more strong model capacities in
ities, LLMs still cannot effectively process the NLP tasks capturing text semantics, thus having the potential to im-
in low-resource domains, e.g., minor language translation. prove existing dense IR models. However, due to the high
To tackle such tasks, it needs to develop effective ap- overhead of LLMs, the majority of studies concentrate on
proaches to injecting necessary task information or domain- employing LLMs as rerankers, aiming to refine the rank-
specific knowledge into LLMs, either through fine-tuning ing of retrieved candidates. To achieve this, recent efforts
or prompting. In addition, it is still challenging for LLMs to often formulate special instructions that enable LLMs to
handle complex semantic relations in classic NLP tasks (e.g., perform reranking on a small set of provided candidate
nested entity extraction), which is worth more exploration documents. Typically, such an approach does not necessitate
from the underlying working mechanism of LLMs. It is also model training, and achieve promising results compared
promising to combine LLMs and fine-tuned small language with well-trained reranking methods [777, 778]. Specially,
models for complementing with each other in solving com- the LLM-based reranking approach can be implemented
plex cases of classic NLP tasks [775]. Another promising di- in different ways by zero-shot or few-shot instruction, in-
rection is to conduct human-machine collaborative research cluding pointwise (estimating the relevance scores for query-
(e.g., conversational translation [771]) on NLP tasks, since document pairs) [779], pairwise (determining the relevance order
LLMs can effectively understand human instructions and of two documents) [778], or listwise ranking (sorting a subset of
make meaningful responses. candidate documents) [780]. The essence of these methods lies
72

in the special design of instructions for text reranking, such be discussed in Section 8.1.4, multimodal large language
as sliding window strategy for document lists [777, 781], models [800] are also widely studied, making it feasible to
setwise selection prompting [782], fine-grained relevance la- develop more powerful multimedia retrieval systems.
bels incorporation [783], and pairwise comparison prompt-
ing [778]. In addition, recent efforts employ LLMs to gen- 8.1.3 LLM for Recommender Systems
erate intermediate texts (e.g., URLs) as retrieval results us-
ing few-shot demonstrations [784]. To further enhance the Unlike IR systems that analyze user search queries to
model performance, LLMs can be specially fine-tuned as retrieve relevant documents, recommender systems (RS)
backbones for reranking [785, 786] or retrieval (including aim to capture the underlying user preference and pro-
dense retrieval [54] and model-based retrieval [787, 788]), vide appropriate information resources to users [801–804].
similar to the fine-tuning process for traditional PLM-based Typically, existing studies train a recommendation model
IR models [785]. However, fine-tuning LLMs as IR models (either classic or deep learning model) by fitting it over
entails considerable expenses given the huge parameter the user’s logged data (e.g., click data) [747, 805]. However,
scale of LLMs. these models often suffer from a series of technical issues,
e.g., cold-start recommendation, domain transfer, and poor
LLM-Enhanced IR Models. As another major research explainability. Recently, LLMs have demonstrated the po-
direction, LLMs can be employed to improve existing IR tential to alleviate these issues of recommendation mod-
models (e.g., small models). A common challenge faced els [355, 806, 807], due to the strong capacities of domain
by existing IR models is the lack of relevant judgment generalization and language generation. In this part, we
annotation [789, 790]. To tackle this problem, LLMs can be briefly review the recent progress of LLMs in recommender
instructed to annotate positive or negative documents for systems, from the following three aspects, namely LLMs as
a given query [791], or to generate corresponding queries recommendation models, LLM-enhanced recommendation
based on a set of documents in the corpus by referring to a models, and LLMs as recommendation simulators.
few demonstrations [792, 793]. In addition to training data LLMs as Recommendation Models. With specific methods
augmentation, LLM has the potential to improve existing or mechanisms, LLMs can be adapted to serve as recom-
IR models by refining the search-oriented informativeness mendation models. Existing work along this line can be
of both queries and documents. In IR systems, the in- generally divided into two main categories. First, some
put queries may be constrained by a user’s cognitive and methods prompt LLMs for completing the recommendation
cultural competency, making it challenging to accurately task in a zero-shot paradigm (i.e., without parameter tun-
express the real intent, and irrelevant content present in ing) [808, 809]. A series of prompt engineering methods like
documents can also impact the relevance evaluation with recency-focused and in-context learning are introduced to
the query. As a solution, LLM can be utilized to rewrite the improve recommendation performance as well as alleviate
query for enhancing the understanding of the query intent the potential model biases [810, 811]. Second, another cat-
and incorporating additional knowledge into the query egory of studies aim to specialize LLMs for personalized
through well-designed instructions. The rewritten query recommendation through instruction tuning [355, 812]. Spe-
can take the form of an improved version of the original cially, high-quality instruction data is key to adapt LLMs
query [794], a document in the corpus that related to the to the recommendation tasks, which can be constructed
query [795], or an expansion of the query that concatenated based on user-item interactions with heuristic templates. To
with a pseudo generated document [796]. In addition, docu- further improve the instruction diversity, InstructRec [355]
ments can also be expanded with queries that are generated employs self-instruct technique to simulate large amounts of
based on the original documents using LLMs for context potential user instructions in various scenarios like product
extension [797]. search and personalized recommendations. In addition to
representing each item by its text description, there is also
Remaining Issues. In this part, we further discuss several growing attention on extending LLM’s vocabulary with
important issues to apply LLMs to improve IR systems. semantic identifiers in recommender systems [813, 814], to
First, though LLMs are capable of being as general-purpose incorporate collaborative semantics into LLMs.
task solvers, they are not directly well suited for existing
IR systems: they require high overhead for inference [777, LLM-enhanced Recommendation Models. In addition to
785], have limitations in modeling long texts or document instructing LLMs to directly provide recommendations, re-
lists [781], and need special adaptation (e.g., instruction searchers also propose leveraging the universal knowledge
tuning) to perform the text ranking task [798]. Therefore, encoded in LLMs to improve traditional recommender sys-
more systematic approaches to adapt LLMs for modern IR tems. Existing approaches in this line can be divided into
systems should be investigated, to leverage their benefits three main categories. The first category employs LLMs to
and meanwhile overcome these limitations. Secondly, the infer users’ potential intention from their historical interac-
advent of LLMs sheds lights on the development of new tion data. Furthermore, traditional recommendation/search
information seeking ways (e.g., New Bing). It is meaningful models employ the inferred intentions to improve the re-
to explore how to reshape the architecture and paradigm trieval of relevant items [815, 816]. Additionally, several
of IR by integrating the LLMs’ capacities and the merits studies explore the use of LLMs as feature encoders. They
of existing IR systems [799]. Thirdly, existing work mainly employ LLMs to encode the side information of items and
focuses on text retrieval tasks, lacking a comprehensive users (e.g., item’s descriptions and user’s reviews), thus de-
consideration of multimodal information sources. As will riving more informative representations of users and items.
73

These representations are then fed into traditional recom- extension of LLMs by enabling the information modeling
mender systems as augmented input [817, 818]. As an- of non-textual modalities, especially the vision modality,
other alternative approach, several studies [819, 820] adopt called multimodal large language models (MLLMs) [800]44 . To
a distillation-like way to transfer LLM’s capacities (e.g., start our discussion, we specify the input to be text-image
semantic encoding) to improve traditional recommenders pairs and the output to be text responses. Similar discus-
(i.e., small models). Specially, they align the hidden states sions can be made for other modalities, e.g., language-audio
of LLMs and traditional recommendation models via joint models [828], which is beyond our scope here. In essence,
training. After training, since only the enhanced small MLLMs are developed by adapting the information from
model will be deployed online, it can avoid the huge over- other modalities to the text modality, so as to leverage the
head of LLMs in online service. excellent model capacities of LLMs that are learned based on
world text. Typically, a MLLM comprises an image encoder
LLM as Recommendation Simulator. Inspired by the recent
for image encoding and a LLM for text generation, associ-
success of autonomous AI agents [821], LLMs have been
ated by a connection module that aligns vision and language
also utilized to develop recommendation simulators [822,
representations. During generation, the image is first split
823] (exemplified by RecAgent [822]), showing great po-
into patches, and then transformed into patch embeddings
tential to simulate user real behaviors in recommender
by the image encoder and the connection module, to derive
systems [822, 824, 825]. Specifically, to make personalized
a visual representation that can be understood by the LLM.
simulation, an agent will be equipped with a profiling
Subsequently, the patch embeddings and text embeddings
module that encompasses relevant identity information.
are concatenated, and fed into the MLLM, allowing the
Then, a memory module is introduced to store agents’ past
language model to generate the response autoregressively.
interaction experiences. During the process of simulation,
In the following, we will discuss the training, evaluation,
agents are further prompted to conduct self-reflection based
and key points to develop capable MLLMs.
on their past experiences, to capture their underlying user
preference. Most of existing recommendation simulators are Training Process. The training process of the MLLM in-
conducted in a user-oriented way, without explicitly mod- cludes two major stages: vision-language alignment pre-
eling the items in the interaction process. To address this, training and visual instruction tuning.
AgentCF [824] models both users and items as agents, and • Vision-language alignment pre-training. To develop
further facilitates collaborative reflections to simulate user- MLLMs, existing work mostly initializes the vision encoder
item interactions, so as to capturing the two-sided relations and the LLM with pre-trained models [154, 155, 829]. These
between users and items. models retain excellent vision and language capacities, but
span different semantic spaces. Thus, the goal of vision-
Remaining Issues. Despite these efforts, there are still
language alignment pre-training (i.e., the first-stage training)
several challenges to address when applying LLMs in
is to align the vision encoder and the LLM through end-to-
recommender systems. First, existing studies have shown
end training on large-scale image-text pairs [830, 831]. How-
that LLM-based recommendation models in zero/few-shot
ever, directly tuning these two models on image-text pairs
settings tend to perform worse than traditional ID-based
may cause the degradation of the original representation ca-
recommenders [809, 810]. This indicates that LLMs might
pacities. To improve the alignment performance, it is crucial
lack an understanding of personalized user behaviors and
to design effective training strategies and select appropriate
domain-specific collaborative semantics. Although instruc-
pre-training data [832, 833]. Existing work mainly employs
tion tuning alleviates this issue to some extent [355, 812],
the following strategies for cross-modality alignment: (1) if
it can’t fully reduce the semantic gap between LLMs and
the number of image-text pairs is not sufficiently large (e.g.,
recommender systems, and also suffers from high tuning
less than 1M), it is often suggested to only update the
costs. Furthermore, recommender systems prioritize min-
connection module [834]; (2) if the training data includes
imizing inference latency to enhance users’ experience in
high-quality text corpora [835] or image-text pairs with
low-resourced environments (e.g., phones), which poses a
fine-grained annotations [836], fine-tuning the LLM can be
challenge to LLMs’ inference speed as well as memory
conducted to boost the performance; (3) if the number of
overhead. Therefore, it is important to explore improvement
image-text pairs is very large (e.g., about 1B), fine-tuning
techniques, such as efficient tuning and quantization meth-
the vision encoder is also plausible [832, 833], but the benefit
ods, to deploy LLMs efficiently and effectively in real-world
remains further verification.
recommender systems. In addition, existing LLMs have
• Visual instruction tuning. After vision-language pre-
limited capacities in long context modeling, make it difficult
training, the second-stage training, i.e., visual instruction
to process the huge amount of user-item interaction data.
tuning, aims to improve the instruction-following and task-
Improved context length extension and context information
solving abilities of MLLMs. Generally, the input of vi-
utilization approaches should be developed to improve the
sual instruction tuning consists of an image and a task
modeling capacities of LLMs in long interaction sequences.
description, and the task is to generate a corresponding
8.1.4 Multimodal Large Language Model text output. To boost the performance, high-quality visual
In existing literature [826, 827], multimodal models mainly instruction data is key to eliciting and enhancing the abil-
refer to the models that can process and integrate informa-
tion of various modalities (e.g., text, image, and audio) from 44. In existing work, large vision language models (LVLMs) [664] are
also used to term such bimodal models that are developed based on
input, and further produce corresponding output in certain LLMs. We use the naming of MLLMs in this part due to its wide use in
modalities. In this part, we mainly focus on the multimodal existing literature.
74

ities of MLLMs. Therefore, most studies are dedicated to questions. Similarly, LLaVA [854] utilizes GPT-4 for eval-
constructing various visual instruction datasets. As the basic uating MLLMs’ output, where GPT-4 takes the generated
approaches, early studies construct visual instructions by image captions and object bounding boxes as visual inputs
distilling from GPT-4 [154] or reformulating vision-language for assessment. Such open-ended evaluation methods can
task datasets [156]. To enhance the quality of instruction improve assessment accuracy while incurring higher costs
data, recent work further proposes improved strategies by due to the involvement of humans or LLMs.
increasing the instruction diversity [837], incorporating fine- • Evaluation benchmarks. To facilitate a more thorough
grained information (e.g., coordinate of objects) into the evaluation of MLLMs, various benchmarks have been devel-
instruction [836], or synthesizing complex visual reasoning oped. Part of them collect existing vision-language tasks for
instructions [838]. comprehensive evaluation. For instance, LVLM-eHub [855]
aggregates 47 existing text-related visual tasks to assess
Evaluation of MLLM. After introducing the approaches to
six distinct capabilities of MLLMs, and Reform-Eval [856]
developing MLLMs, we further discuss how to effectively
takes this a step further by standardizing questions from
assess the multimodal capabilities of MLLMs from the fol-
existing benchmarks into a uniform format and discusses
lowing three aspects.
how the backbone models influence MLLMs’ performance.
• Evaluation perspectives. The evaluation tasks for MLLMs In addition to incorporating existing tasks, several work
can be categorized into two main types: perception and also derives new questions annotated by humans or with
cognition tasks. Specifically, perception tasks aim to assess the the help of LLMs. MME [842] creates a dataset by pair-
model’s abilities in understanding the basic semantics of the ing images from public sources with manually-collected
image content, while cognition tasks evaluate models with text instructions for perception and cognition evaluations.
more complex tasks that require reasoning based on per- MMBench [841] transforms these instructions into multiple-
ception results. The perception ability is typically evaluated choice questions and introduces CircularEval to ensure
through classification tasks about attributes of image (e.g., evaluation consistency. SEED-Bench [857] further considers
topic and style) and object (e.g., existence and color) or OCR- temporal understanding tasks and enlarges the evaluation
related tasks, based on existing datasets or new datasets scale to 19K multiple-choice questions with the assistance of
derived from existing images with annotations by humans LLMs. MM-Vet [858] presents more complex tasks to assess
or LLMs [839–842]. A notable perception issue is hallucina- the integrated multimodal capabilities of MLLMs. It starts
tion [843], where the model’s responses contain inconsistent by defining six essential multimodal abilities and then cre-
content with the image. Among existing studies about hallu- ates intricate questions by combining multiple abilities. In
cination in MLLMs [837, 844, 845], object hallucination [846] summary, the above benchmarks collectively contribute to
has received much research attention. To conduct a stable, the comprehensive evaluation and improved development
robust evaluation of object hallucination, POPE [847] pro- of MLLMs.
poses a polling-based object probing approach for convert-
ing object recognition into a series of binary questions, and Key Points for Improving MLLMs. To develop capable
the results indicate that current MLLMs often struggle with MLLMs, we continue to discuss three key points to improve
object hallucination. Cognition tasks, on the other hand, re- the model capacities, from the perspectives of instruction
quire MLLMs to perform reasoning based on image percep- data, training strategy, and safety and alignment.
tion. A common reasoning task is visual question answering • Visual instruction data. Extensive work [834, 859] has
(VQA), where models answer questions about images that empirically found that both quantity and quality of visual
demand reasoning about spatial relationships [848], general instructions have an important impact on model perfor-
knowledge [849], or scene text [850]. To fully explore the mance of MLLMs. One basic way to construct visual in-
capabilities of MLLMs, HallusionBench [851] collects 200 structions is to leverage the exceptional capability of LLMs
sophisticated visual dependent or supplement questions, on to synthesize instructions based on text descriptions of
which even the most advanced MLLMs like LLaVA-1.5 [834] images [854]. To further enhance the quality of instructions,
and GPT-4V [133] fail to achieve good performance. one can construct fine-grained visual instructions with the
• Evaluation paradigms. The responses of MLLMs can help of human annotation [836, 860] or synthesize more
be evaluated either in a closed-ended or an open-ended complex data through carefully-designed prompts [838].
manner. Traditional multimodal tasks often rely on a closed- Despite the effectiveness of the above LLM-based ap-
ended evaluation framework, where the assessment is based proaches, one primary question emerges as to whether a
on the exact match between the model’s response and the LLM (i.e., text generation model without training on any
ground-truth answer. Examples include the VQA score [852] images) possesses the ability to generate sufficiently good
for visual question answering tasks and the CIDEr [853] visual instructions solely based on verbalized visual infor-
score for captioning tasks. However, MLLMs generate re- mation (e.g., captions and coordinates). Specially, existing
sponses in an open-ended way, which may contain the work has also revealed that visual instructions generated
correct answer but not exactly match the ground-truth per- by LLMs sometimes contain misinterpretations about the
fectly. This discrepancy can lead to the underestimation of visual information, e.g., object hallucination [847]. Therefore,
the model’s performance in previous evaluation paradigms. it is crucial to design effective verification methods to con-
To address this issue, recent approaches have incorporated trol the quality of instruction data generated by LLMs [838].
humans or LLMs as evaluators [832]. For instance, MM- Furthermore, it still needs more investigation about what
Bench [841] employs ChatGPT to align the model responses makes good visual instructions and how visual instructions
with the most relevant option in a set of multiple-choice elicit specific multimodal abilities in MLLMs.
75

• Model training. Different from LLMs, MLLMs are not tential to generate hallucinated content [604] and the lack of
trained from scratch, but instead developed based on pre- domain-specific knowledge [865]. As a promising solution,
trained language and vision models. Existing work em- knowledge graphs (KGs), which store enormous knowledge
ploys a typical two-stage approach for training MLLMs, in the triple format, i.e., ⟨ head entity, relation, tail entity ⟩, can
i.e., vision-language alignment pre-training and visual in- be utilized to enhance the task performance of LLMs by pro-
struction tuning. In essence, existing MLLMs aim to (1) pre- viding precise and necessary knowledge. Generally, knowl-
serve the inherent capabilities and parametric knowledge edge enhanced approaches can be expanded into other
of LLMs as possible, and meanwhile (2) effectively adapt forms of structured data (e.g., tables and databases) [864],
to multimodal tasks by leveraging the pre-trained LLMs while we limit our discussion to the integration of KG for
and visual encoders. To achieve the above two goals, two improving LLMs, which are detailed in two aspects, namely
typical training strategies are often employed for visual retrieval-augmented LLM and synergy-augmented LLM.
instruction tuning, either only optimizing the connection
module [156] or fine-tuning both the connector module Retrieval-Augmented LLM. Due to the huge amount of
and LLM component [854]. As we can see, the former fact records in a KG, existing work typically adopts a
can reserve the original capacities of LLMs but likely have retrieval model to first obtain a relatively small subgraph
a weak an adaptation performance, while the latter can from KG, and then leverages it to enhance LLMs by en-
fully adapt to multimodal tasks but suffer from the loss of riching the relevant knowledge. Before the advent of LLMs,
original capacities of LLMs. More efforts should be made to the retrieved subgraphs are often supplemented into train-
investigate how to effectively balance the two aspects, so as ing data, injecting knowledge information into PLMs via
to achieving improved multimodal capacities. In addition, parameter learning [866–868]. In contrast, to leverage the
existing MLLMs are still overly dependent on the capacities retrieved knowledge, LLMs mainly incorporate it as part of
of LLMs, which pose the limits on many multimodal tasks the prompt, without parameter update. To implement this
(e.g., space positioning). It will be meaningful to explore approach, there are two main technical problems, i.e., how
improved training approaches of language models, so that to retrieve relevant knowledge from KGs and how to make
multimodal information can be also utilized in this process. better use of the structured data by LLMs. For the first issue
• Safety and alignment. Safety and alignment has been (i.e., retrieving relevant knowledge), a typical approach is
widely discussed in LLMs, which aim to regulate the behav- to train a small language model (e.g., RoBERTa) to iden-
iors of models by technical approaches [66]. This topic is also tify question-related fact triples [869]. To further improve
important to MLLMs. Even a highly advanced MLLM (e.g., the retrieval performance, several studies also propose an
GPT-4V [133]) can be susceptible to safety issues. For exam- iterative reading-then-reasoning framework, enabling the
ple, GPT-4V might occasionally exhibit factual inaccuracies LLM to interact with the KG multiple times and acquire the
and baseless inferences about images. In some cases, it may required knowledge in a more accurate way [451]. For the
even generate harmful content targeting specific individuals second issue (i.e., utilizing retrieved knowledge), a straight-
or groups [133]. Furthermore, open-sourced MLLMs are forward approach is to serialize the retrieved subgraph
also prone to generate hallucinated response [847] and can and craft specific prompts to include it as the input of
be easily manipulated to produce harmful content [861]. LLMs [468, 653]. However, due to the loss of structured
To address the aforementioned issues, some studies collect information in knowledge serialization, LLMs cannot fully
specialized visual instructions to mitigate the problem of capture the structural semantics conveyed by original KGs.
hallucination [837]. Another alternative approach is to train To address this issue, several model-based approaches train
a revision model to rectify hallucinated response generated a specialized language model (e.g., T5) to transform the
by MLLMs in a post-hoc way [862]. Additionally, aligning subgraph into the natural language text [870]. To guarantee
MLLMs with RLHF can also assist MLLMs in generating the transformation accuracy, it relies on sufficient training
responses with improved factuality [863]. Despite these pairs (often unsupervised constructed) [871] and excellent
efforts, existing alignment techniques for MLLMs mainly model capability [872].
concentrate on several specific aspects (e.g., hallucination),
lacking a comprehensive consideration of alignment criteria. Synergy-Augmented LLM. To solve complex tasks (e.g.,
More efforts should be made to promote the research of multi-hop question answering [658]), it often requires LLMs
safety and alignment for MLLMs. As a promising solution, to query a KG multiple times, following a systematic solu-
knowledge graphs (KGs), which store enormous knowledge tion plan. We call such a multi-turn interaction approach to
in the triple format, i.e., ⟨ head entity, relation, tail entity ⟩, can enhancing LLM synergy-augmented LLM. To better synergize
be utilized to enhance the task performance of LLMs by pro- the LLM and KG in a complementary manner, recent studies
viding precise and necessary knowledge. Generally, knowl- propose to decompose the complex task into multiple sub-
edge enhanced approaches can be expanded into other goals and iteratively solve each one by leveraging the nec-
forms of structured data (e.g., tables and databases) [864], essary knowledge from KG [451, 873, 874]. In this process,
while we limit our discussion to the integration of KG for the LLM can be regarded as an autonomous agent (detailed
improving LLMs, which are detailed in two aspects, namely in Section 9.2), which automatically generates the plan
retrieval-augmented LLM and synergy-augmented LLM. and executes it through interaction with the KG environ-
ment [873]. Specially, the mainstream approaches typically
8.1.5 KG-Enhanced LLM start by enumerating the candidates using the available
Despite the excellent capacities, LLMs often suffer from knowledge information at the current step, and then retrieve
challenges on knowledge-intensive tasks, such as the po- the most appropriate candidates for the next step according
76

to the question [873, 874]. By iterating the above two steps, candidate texts following specific guidelines [352, 649, 729],
LLMs can gradually collect relevant evidence [873, 874], and which greatly simplifies the evaluation task. However,
finally approach the correct solution. Despite the effective- it may face the inefficiency issue when scaling up the
ness, enumeration of the candidates over the KG would lead number of candidates [729]. When high-quality reference
to a vast search space [875]. To address it, StructGPT [451] texts are available during evaluation, LLMs can be in-
proposes a more efficient way to access knowledge infor- structed to score texts under the guidance provided by ref-
mation using the specialized interfaces for KGs. Specifically, erences [718, 729, 730]. On the other hand, language-based
it carefully designs the specialized interfaces according to evaluation focuses on generating critiques and suggestions,
the common data operations on KG (e.g., relation extraction offering qualitative explanation beyond simple quantitative
and triple extraction), to ensure efficient and accurate data scoring [369, 377, 882, 883]. It is particularly useful for
extraction. In this way, LLMs can be instructed to better gathering language feedback signals for human alignment
manipulate and process the structural information of KGs, tuning [369, 882]. Furthermore, it can evolve into a multi-
thus achieving improved task performance. turn interaction framework, where LLM-based evaluators
provide natural language feedback to existing solutions
Future Directions. Besides the above approaches, there
from task solvers [884]. This framework evaluates the ability
are several promising directions for KG-enhanced LLM
of LLMs to leverage language feedback for refining self-
remaining underexplored. First, due to the variety of struc-
generated solutions.
tured data, it is still difficult for LLMs to directly leverage
various kinds of knowledge sources, e.g., domain-specific
Evaluation Methods. A common method for LLM-based
KGs. Therefore, it is essential to explore the unified way
evaluation involves prompting LLMs with specific instruc-
to manipulate and utilize different knowledge sources by
tions. To further improve the quality of LLM-based eval-
LLMs. As a potential solution, it is promising to develop
uation, recent work proposes to prompt LLMs with varied
effective approaches to help LLMs comprehend and make
contexts to generate diverse evaluation feedback. These con-
use of the access interfaces provided by specific knowledge
texts vary in aspects such as the candidate order [649, 729],
sources to acquire precise knowledge [451], while more ef-
evaluation perspectives [885, 886] (e.g., relevance, clarity,
forts should be made to investigate how to adapt to the data
originality), and evaluation explanation [649]. The gener-
variety in a cost-effective way. Second, with the evolution of
ated multiple evaluation feedbacks are then aggregated to
real-world information, the knowledge stored in LLMs may
produce a final evaluation result, which makes the evalua-
become outdated or incorrect. It is necessary to explore how
tion process less prone to biases from individual feedback
to synchronize the updated knowledge into LLMs through
and allows for a more thorough evaluation by covering
a cost-effective manner [876, 877]. Third, it is promising to
a wider range of evaluation aspects. To further improve
investigate the use of factual information from KG to align
the quality of the single-model evaluation, recent studies
LLMs in generating more faithful content [878, 879], which
also develop multi-agent collaboration frameworks [886–
can help reduce the hallucination of LLMs.
888] or fine-tune LLMs as specified evaluators [369, 377, 882,
In addition to exploring KG-enhanced LLMs, it is also
883, 889]. In a multi-model collaboration mode, different
meaningful to leverage LLMs to improve the tasks on the
LLMs evaluate the candidates by engaging in discussions
KG side (i.e., LLM4KG) [865, 880]. A typical example is that
to align preferences and reach a consensus [887, 888]. This
LLMs can help supplement or construct the KG. We omit
method helps reduce the potential biases in individual
the discussion of this part, since it is beyond our scope.
models through the consensus reached by multiple agents.
8.1.6 LLM for Evaluation Another approach to improving single-model evaluation
is to specialize LLMs as scores or critics through fine-
While human evaluation can generally offer reliable quality
tuning [369, 377, 882, 883, 889]. This process involves creat-
assessment, it is also often hindered by high annotation
ing datasets annotated with preferences and feedback from
costs, significant time requirements, and annotation incon-
humans or proficient LLMs. These datasets are then used to
sistencies [881]. In contrast, automatic evaluation can be
train evaluation-oriented models, enabling them to generate
employed as a scalable alternative to human evaluation.
pairwise preference or language feedback. The specialized
Traditional automatic evaluations have relied on reference-
LLM evaluators demonstrate competitive performance with
based metrics (e.g., BLEU and ROUGE). Recently, with
fewer parameters [377, 883, 889].
the emergence of LLMs as general task solvers highlights
their potential as automatic evaluators [649, 729], making it
Meta-Evaluation. To effectively assess the quality of
promising to conduct LLM based evaluation. In the follow-
LLM-based evaluators, meta-evaluation benchmarks have
ing part, we will introduce the recent progress on LLM for
been introduced, for gauging the agreement with human
evaluation, including evaluation formats, methods, meta-
preferences and the fairness of the evaluations made by
evaluation, and the remaining issues.
LLMs [649, 729, 886, 890, 891]. As a representative bench-
Evaluation Formats. Depending on the type of evaluation mark, MT-Bench [729] evaluates the agreement between
outcome, the evaluation format can be categorized into LLMs and human judgments, demonstrating that GPT-4
score-based evaluation and language-based evaluation. Score- aligns closely with human preferences in no-tie compar-
based evaluation employs measurable metrics to assign isons on 80 multi-turn questions. In addition, to address
quality scores (e.g., ratings or rankings) for evaluated texts. potential biases arising from subjective human evaluations,
A prevalent way is to conduct pairwise comparison, where LLMBar [890] manually designs outputs that are objectively
LLMs are used to determine the partial order relation of worse but superficially appealing, which could mislead
77

evaluators. The evaluation results reveal that even the most consistent answers across disciplines, balancing both depth
advanced LLMs still fall short of human-level evaluation in and breadth. Another quantitative analysis [901] shows that
the challenging setting. students utilizing ChatGPT (either keeping or refining the
results from LLMs as their own answers) perform better
Remaining Issues. As discussed in Section 7.1.1, recent
than average students in some courses from the computer
studies demonstrate that LLM-based evaluators expose
security field. Recently, several perspective papers [903, 904]
multiple types of bias, such as order bias, self-preference
also explore various application scenarios of LLMs in class-
bias, and length bias [649, 729]. Although some biases can
room teaching, such as teacher-student collaboration, per-
be mitigated through methods like multi-path ensemble or
sonalized learning, and assessment automation. However,
multi-agent collaboration, they remain inherent to LLM-
the application of LLMs in education may lead to a series
based evaluators. Consequently, addressing these biases
of practical issues, e.g., plagiarism, potential bias in AI-
intrinsically within the models continues to be an a chal-
generated content, overreliance on LLMs, and inequitable
lenging issue. In addition, recent work has revealed that
access for non-English speaking individuals [905].
LLMs may be incapable of understanding the self-generated
content, exhibiting a weaker understanding capacity com- Law is a specialized domain that is built on professional
pared to their generation capabilities [892]. Even the most domain knowledge. Recently, a number of studies have ap-
advanced LLMs still struggle identifying their reasoning or plied LLMs to solve various legal tasks, e.g., legal document
factual errors without external feedback [893, 894]. Conse- analysis [906], legal judgment prediction [907], and legal
quently, current LLM-based evaluators might not be ade- document writing [908]. A recent study [909] has found
quate for evaluating top-tier LLMs or complex tasks. This that LLMs exhibit powerful abilities of legal interpretation
underscores the importance of improvement approaches and reasoning. Moreover, the latest GPT-4 model achieves
for LLM-based evaluators, especially for evaluating capable a top 10% score in a simulated bar exam compared with
LLMs and complex tasks demanding sophisticated reason- human test-takers [46]. To further improve the performance
ing, planning, and domain-specific knowledge. of LLMs in the law domain, specially designed legal prompt
engineering are employed to yield advanced performance
8.2 LLM for Specific Domains in long legal document comprehension and complex legal
In this part, we discuss the applications of LLMs on several reasoning [910, 911]. To summarize the progress, LLMs can
representative domains, including healthcare, education, act as helpful assistants to legal profession. Despite the
law, finance, and scientific research assistance. progress, the use of LLMs in law raises concerns about
legal challenges, including copyright issues [912], personal
Healthcare is a vital application field closely related to information leakage [913], or bias and discrimination [914].
human life. Ever since the advent of ChatGPT, a number of
studies have applied ChatGPT or other LLMs to the medical Finance is an important field where LLMs have promis-
domain. It has been shown that LLMs are capable of han- ing application prospects. LLMs have been employed on
dling a variety of healthcare tasks, e.g., biology information various finance related tasks, such as numerical claim
extraction [765], medical advice consultation [895], mental detection [915], financial sentiment analysis [916], finan-
health analysis [896], and report simplification [897]. As cial named entity recognition [917], and financial reason-
the major technical approach, researchers typically design ing [918]. Despite the competitive zero-shot performance
specific prompts or instructions to guide LLMs to perform a exhibited by general-purpose LLMs in the finance tasks,
wide range of medical tasks. To further harness the power they still underperform domain-specific PLMs containing
of LLMs in the healthcare domain, researchers propose to million-scale parameters [915]. To leverage the scaling effect
develop healthcare-related LLMs [354, 898, 899]. Specifically, of LLMs, researchers collect large-scale finance corpora for
the Med-PaLM models [354, 898] achieves expert-level per- continually pre-training LLMs (e.g., BloombergGPT [358],
formance on the United States Medical Licensing Exami- XuanYuan 2.0 [919], and FinGPT [920]). BloombergGPT
nation (USMLE), and earns greater approval from physi- has demonstrated remarkable performance across a diverse
cians in answering consumer’s medical questions. However, range of financial tasks while maintaining competitive per-
LLMs may fabricate medical misinformation [897, 900], formance in general-purpose tasks [358]. Nevertheless, it is
e.g., misinterpreting medical terms and suggesting advice imperative to consider the potential risks in the application
inconsistent with medical guidelines. In addition, it would of LLMs in finance, as the generation of inaccurate or
also raise privacy concerns to upload the health information harmful content by LLMs could have significant adverse
of patients [765] into a commercial server that support the implications for financial markets [358]. Therefore, it needs
LLM. more strict reviewing and monitoring on the use of LLMs in
the financial field.
Education is also an important application domain where
LLMs potentially exert significant influence. Existing work Scientific research is another promising field that LLMs
has found that LLMs can achieve student-level performance can empower the development progress. Prior research
on standardized tests [46] in a variety of subjects of math- demonstrates the effectiveness of LLMs in handling
ematics (e.g., physics, computer science) on both multiple- knowledge-intensive scientific tasks (e.g., PubMedQA [921],
choice and free-response problems. In addition, empirical BioASQ [922]), especially for LLMs that are trained on
studies have shown that LLMs can serve as writing or read- scientific-related corpora [35, 218, 923]. Given the excel-
ing assistant for education [901, 902]. A recent study [902] lent general abilities and broad scientific knowledge, LLMs
reveals that ChatGPT is capable of generating logically hold significant potential as helpful assistants across var-
78

ious stages of the scientific research pipeline [924]. First, 9.1.1 Scaling Position Embeddings
during the literature survey stage, LLMs can help conduct Transformer-based LLMs can learn effective position em-
a comprehensive overview of the progress in a specific beddings within the maximum training length. When
research field [925, 926]. Second, during the research idea adapting LLMs to language tasks beyond the maximum
generation stage, LLMs demonstrate the ability to generate training length, it is necessary to scale to larger position
intriguing scientific hypotheses [927]. Third, during the data indices. Specially, some position embedding methods have
analysis stage, LLMs can be employed to conduct automatic been shown to possess a certain degree of ability to gener-
approaches to analyzing the data characteristics, includ- alize to text beyond the training length, which is termed as
ing data exploration, visualization, and deriving analytical extrapolation capability, including T5 bias [82], ALiBi [283],
conclusions [928, 929]. Fourth, during the paper writing xPos [296] and even NoPE [939]. However, as one of the
stage, researchers can also benefit from the assistance of mainstream position embedding methods, RoPE exhibits
LLMs in scientific writing [930, 931], in which LLMs can limited extrapolation ability in empirical studies [259]. In
offer valuable support for scientific writing through diverse the following, we discuss several methods that adapt RoPE
means, such as summarizing the existing content and pol- to longer texts.
ishing the writing [932]. In addition, LLMs can aid in • Direct model fine-tuning. To adapt LLMs to a long
the automated paper review process, encompassing tasks context window, a straightforward approach is to directly
such as error detection, checklist verification, and candidate fine-tune the models on long texts with the target length.
ranking [933]. Despite these advances, there is much room The context extension can be scheduled with gradually
for improving the capacities of LLMs to serve as helpful, increased lengths in a multi-stage manner (e.g., 2K → 8K
trustworthy scientific assistants, to both increase the quality → 32K). To conduct effective extension, it often requires
of the generated scientific content and reduce the harmful specially prepared long text data for training (Section 9.1.3),
hallucinations. and data quality plays a critical role in improving LLM’s
Summary. In addition to the aforementioned work, the long context capacities [940]. However, such a direct fine-
applications of LLMs have been also discussed in several tuning approach tends to be inherently slow when adapting
other domains. For instance, in the psychologic domain, LLMs for long texts [259].
some recent work has studied the human-like characteristics • Position interpolation. This method downscales the po-
of LLMs, such as self-awareness, theory of mind (ToM), and sition indices within the original context window, to avoid
affective computing [934, 935]. In particular, an empirical out-of-distribution rotation angles during pre-training [259,
evaluation of ToM conducted on two classic false-belief 941]. Specifically, this approach multiplies all position in-
tasks speculates that LLMs may have ToM-like abilities dices by a scaling coefficient L/L′ (L < L′ ), where L and
since the model in the GPT-3.5 series achieves comparable L′ denote the original and target context window length,
performance with nine-year-old children in ToM task [934]. respectively. Experimental results [259] have shown that
In addition, another line of work has investigated applying this method can extend the context window effectively and
LLMs into the software development domain, e.g., code efficiently, compared to the above approach of direct model
suggestion [936], code summarization [937], and automated fine-tuning. However, it is worth noting that this technique
program repair [938]. To summarize, to assist humans by may have an adverse impact on the model’s performance
LLMs in real-world tasks has become a significant area of when handling normal texts within the original context
research. However, it also presents challenges. Ensuring the window [259, 942].
accuracy of LLM-generated content, addressing biases, and • Position truncation. To mitigate the challenges posed
maintaining user privacy and data security are crucial con- by out-of-distribution rotation angles, another practical ap-
siderations when applying LLMs to real-world scenarios. proach is to truncate longer relative positions to satisfy the
requirement of the maximum training length. ReRoPE and
9 A DVANCED TOPICS LeakyReRoPE [943] introduce a pre-defined window length
for truncation, which is smaller than the maximum training
In this section, we focus on discussing several advanced
length. Specifically, position indices within this pre-defined
topics that have attracted extensive attention in the research
window would be retained, while those indices beyond the
community, and these topics are related to challenging
window are either truncated to the pre-defined window
technical issues that largely limit LLM’s capacity. Next, we
length or interpolated to align with the maximum training
will introduce these issues and discuss how to address them
length. This strategy can preserve the attention mechanism
with feasible approaches.
with the neighbor tokens (within the window length), and
further enhance the extrapolation capacity. However, this
9.1 Long Context Modeling approach needs to compute the attention matrices twice,
In real-world application scenarios, there are increasing accommodating additional computational costs.
demands for long context modeling capacities of LLMs, • Base modification. Since LLMs are usually trained with
especially for text file processing (e.g., information parsing, a pre-set maximum training length, wavelengths in certain
extraction, and summarization). Many mainstream LLMs dimensions of RoPE may exceed the training length for
have provided support for long context window. To enhance longer text [295], on which language models may not be
the long context modeling abilities, there are generally two sufficiently trained, i.e., training data can’t cover a complete
widely used approaches, namely scaling position embed- rotation cycle. Thus, when processing long text, some ro-
dings and adapting context window. Next, we introduce the tation angles for certain dimensions would never be seen
two approaches in detail. in the training phase [351]. Formally, given a fixed rotation
79

angle t · θi , a smaller basis θi allows for a greater distance of attention patterns in a Transformer [951], e.g.,the top-
t, i.e., enabling the modeling of longer texts [254, 295, 940]. k attention scores can well approximate the original full
According to the formula θi = b−2(i−1)/d in Equation 4, attention. Therefore, a number of studies propose different
decreasing the basis can be achieved by increasing the methods to select the most relevant tokens from token-level
value of the base. In addition, decreasing the base can also or block-level memory units for generation. Token-level se-
help re-scale the wavelengths of all dimensions below the lection methods store the past keys in external memory and
training length, while it often needs continual pre-training utilize a k -NN search method to retrieve the k most relevant
to adapt the LLMs to long context windows [351]. A re- tokens for generation [257, 951, 952]. For a decoder model,
cent study [351] has empirically compared these two base it typically employs one certain layer to access these top-
modification methods, and shown that decreasing the base k external tokens, while still adopting the normal context
demonstrates better extrapolation performance, while in- window in the rest layers [257, 952]. Block-level selection
creasing the base performs better within the training length. methods [953, 954] first segment the long sequence into
• Basis truncation. Similar to the base modification, the blocks with the same length and represent each block into
truncation of the basis also concentrates on dealing with several key vectors for retrieval. Then, the most relevant
the singular dimensions with wavelengths exceeding the blocks to the query as well as the neighbor and initial
training length [944]. According to the definition λi = 2π/θi blocks will be selected for attention computations. Unlike
in Equation 5, the dimension with a large wavelength λi token-level selection methods, block-level selection methods
has a small basis θi accordingly. Based on this observation, typically retrieve different tokens with specific heads.
this approach first defines a basis range [a, c]. Given the
basis range, the value of basis is modified according to the 9.1.3 Long Text Data
following ways: (1) when θi ≥ c, the value is retained, To further enhance the long context modeling capacity,
(2) when θi ≤ a, the value is set to zero, and (3) when it typically requires continual pre-training with specially
a < θi < c, the value is truncated to a fixed small curated long text data. Next, we discuss how to prepare the
value. Via basis truncation, the out-of-distribution rotation long text data from the two aspects of quantity and quality.
angles can be avoided at larger position indices. However, • Quantity effect. Different from the pre-training phase
this approach does not perform very well at long context that requires vast amounts of data, a small amount of long-
tasks [944]. text data for continual pre-training is sufficient for context
window extension [259]. Several studies show that LLMs
9.1.2 Adapting Context Window have obtained the capability of utilizing distant information
Since Transformer-based LLMs have limited context win- via large-scale pre-training data, and thus it only needs
dows, they can not directly integrate or utilize the entire to adapt for extended context windows during continual
information of the long sequences exceeding the context pre-training [955]. Typically, it has shown that LLaMA-2-
window. To alleviate the limitation, several methods have 7B or LLaMA-2-13B can achieve a context window length
been proposed to adapt LLMs to long context, as discussed of over 100K tokens and effective context utilization [955]
below. with the training on several billion tokens. However, the
• Parallel context window. Inspired by fusion-in- ability to handle short text of LLMs may be affected to some
decoder [945], parallel context window methods [424, 946] extent [259].
adopt a divide-and-conquer strategy to process input text. • Quality effect. In addition to the quantity, the quality
Specially, it divides the input text into multiple segments, of long text data is essential to long context modeling for
each independently encoded with shared position embed- LLMs. For instance, LongWanjuan [956] categorize long
dings. At the generation stage, the attention masks are texts into holistic, aggregated, and chaotic long texts based
modified to make that subsequent tokens can access to on three metrics, i.e., coherence, cohesion, and complexity,
previous tokens in each segment. Nevertheless, this method and they show that removing chaotic data and keeping
cannot distinguish the order of different segments, resulting coherent and cohesive data are useful to enhance the long
in a limited model capacity on certain tasks. text modeling capacities of LLMs. Further, up-sampling
• Λ-shaped context window. Some prior work has revealed cohesive data can lead to further improvement. In addition,
that LLMs tend to allocate greater attention weights to when preparing long text data, data mixture should be
the starting and nearest tokens among all previous to- carefully adjusted for avoiding large distribution drift with
kens [947, 948], and it potentially results in the “lost in the the original pre-training data.
middle” phenomenon [949]. Based on this observation, LM- In addition to the studies based on vanilla Transformer,
Infinite [950] and StreamingLLM [948] propose to employ there are a surge of Transformer variants with efficient at-
a “Λ-shaped” attention mask, which selectively preserves tentions and other efficient architectures, aiming to alleviate
the initial tokens and the nearest tokens that each query can the high computational costs for modeling long texts. These
attend to and then discards any tokens beyond this scope. studies are discussed in Section 4.2.1 and Section 4.2.2. Fur-
Experiments demonstrate that this method can facilitate thermore, context compression and prompting techniques
extra-long text generation with a fixed memory [948]. How- (e.g., iterative reasoning [957]) have also been proven to
ever, it may struggle to model the long-range dependency be a viable strategy for handling long text tasks [957–960],
in the context window, since it cannot effectively utilize the without the need of model adaption.
information from the discarded tokens [948].
• Token selection. It has been shown that a relatively 9.2 LLM-empowered Agent
small subset of tokens can effectively capture the majority The research on agents in AI aims to develop entities that
80

can perceive the environment, make decisions, and take are assigned goals, they follow the above workflow to
actions to achieve specific goals [961]. However, traditional accomplish tasks through multi-turn interactions with the
agents are often limited to heuristic rules or specific environ- environment.
ments, which constrain their generalization to open-domain To summarize, in an LLM-based agent, the LLM serves
scenarios [962]. Given that LLMs possess excellent capacities as the core computation unit and is equipped with compo-
in solving complex tasks, they have rapidly emerged as nents including memory, planning, and execution. These com-
promising solutions for serving as the core computation ponents are integrated in a systematic way under the control
unit of agents [821]. In this part, we will first introduce of the LLM during interactions with the environment. For
the framework for LLM-based agents, then explore their more details, the readers might refer to the comprehensive
applications, and finally discuss the future directions. survey for LLM-based AI agents [821].

9.2.1 Overall Framework. 9.2.2 Applications
Next, we first detail the key components of an LLM-based Recently, LLM-based agents have shown great potential in
agent and then present the typical workflow. autonomously solving complex tasks, making it feasible to
rapidly develop capable applications for specific domains
Components. Typically, there are three main components or tasks. In this section, we will discuss the applications in
in an LLM-based agent: memory, planning45 , and execution. single-agent and multi-agent scenarios.
Specifically, the memory component aims to store the in-
formation perceived from the environment and can be Single-agent based Applications. Applications based on
utilized to support decision-making. In particular, LLM- a single-agent mode mainly aim to develop capable task
based agents usually maintain information in both short- solvers that can autonomously complete user requests. A
term memory and long-term memory with the operations large number of single-agent projects have been developed,
of reading and writing. Short-term memory usually refers which focus on general-purpose task solving. As a rep-
to the internal context window of LLMs (i.e., input), where resentative project, AutoGPT [536] empowers LLMs with
LLMs can read and write through actions like reason- long/short-term memory management and external tools
ing [963]. While long-term memory can be mapped to the like search engines. In order to autonomously address a
external storage like vector databases [539], where LLMs user request, AutoGPT understands the request with knowl-
can read through retrieval and write with reflection [688]. edge from its memory and actions like reasoning, decom-
Specially, profiles are usually implemented with long-term poses it into a detailed plan, executes the plan step-by-
memory, which is an important feature for an agent that step with the assistance of tools, and refines the rest plan
specifies its role and function [821]. The planning component based on feedback from the environment. Such an iterative
is responsible for generating the action plan based on the in- process continues until the user request is successfully re-
formation from the memory component. In data format, the solved. Other similar projects include GPT-Engineer [964]
plan usually takes the form of text-based instructions [434] and XAgent [965]. In addition, there is also some work that
or code-based programs [436]. To generate it, LLM-based aims to develop autonomous agents for specific domains,
agents will first propose several candidates and then select such as WebGPT [81] for the web-browsing environment,
a more suitable one among them [429]. The initial plan ProgPrompt [532] for the real-life environment, and Voy-
can be further refined with execution feedback from the ager [699] for the Minecraft environment.
environment [530]. The execution component is in charge Multi-agent based Applications. Different from single-
of carrying out the plan from the planning component, agent systems where agents work independently, multi-
which can be fulfilled by the internal LLM [434] or external agent systems work in collaboration to unleash collective
tools [963]. intelligence. Typically, multiple agents can be instantiated
Workflow. With the three components mentioned above, a from the same or different LLMs, each with their respective
typical workflow of an LLM-based agent is as follows. First, roles and functions. According to the coordinating strategies
it receives information from the environment and writes among these agents, multi-agent systems can be divided
it into short-term memory. Then, the agent processes the into two categories: cooperation-based and competition-
newly received information in the short-term memory. Such based. In the cooperation-based mode, to share informa-
a process can be enhanced with information retrieved from tion and seek collaborative actions among agents, various
long-term memory. Subsequently, the planning component communication protocols have been proposed, including
utilizes the processed information from short-term memory free-form dialogue [966], structured document [967], and
to generate the next plan. Finally, the execution component data embedding [968]. Based on the communication pro-
carries out the plan generated from the planning compo- tocol, agents can be effectively organized for downstream
nent, which can be further assisted with external tools. applications, such as software engineering [967], user be-
By repeating the aforementioned process, the LLM-based havior analysis [822, 824], and society simulation [535].
agent can autonomously adjust its behavior in response As a representative project, LangChain46 is a framework
to feedback from the environment and ultimately achieve for developing multi-agent based applications powered by
its goal. Once LLM-based agents receive user requests or LLMs. It enables users to deploy different roles of LLM-
based agents and utilize them to solve tasks via working in
45. Section 6.4 introduces planning as a utilization approach for collaboration. In addition, other similar frameworks, such
LLMs, while in this section, we describe its utilization as a functional
component in LLM-based agents. 46. https://www.langchain.com/
81

as AgentVerse [969] and AutoGen [970], can also be utilized Robustness and Trustworthiness. The deployment of LLM-
for developing multi-agent collaborative systems. In the based agent systems necessitates robustness and trustwor-
competition-based mode, debate serves as one of the pop- thiness [973]. The system should be resilient against adver-
ular communication protocols to foster divergent thinking sarial inputs from various modalities such as text, image,
and elicit valuable external feedback among agents. Such a or audio. Incorporating existing techniques like adversarial
way is beneficial for domains that demand precise decision- training, data augmentation, and sample detection to in-
making and accurate responses, such as mathematical rea- crease sensitivity to aggressive information in the input can
soning [971] and evaluation [734]. fortify the system’s security. Concurrently, it is challenging
to ensure the credibility of LLM-based agents given the se-
9.2.3 Discussion vere hallucination issues inherently rooted in LLMs. While
Despite the huge success, there still remain several technical existing methods such as constrained decoding during infer-
challenges that limit the development and application of ence and external knowledge integration can mitigate these
LLM-based agents. In this part, we discuss the remaining issues to some extent [974], further exploration of efficient
challenges from the perspective of computational burden, and effective alignment methods is necessary to develop
human alignment, complex capability extension, and ro- reliable agent systems.
bustness.
9.3 Analysis and Optimization for Model Training
Computational Costs. With the ever-increasing capabilities
of LLMs [821], their performance on agent applications In Section 4.3, we have introduced basic techniques for
demonstrate promising performance. However, it also in- training LLMs. As the scale of model parameters and data
troduces significant issues in terms of efficiency due to continues to expand, efficiently training larger models with
the high computational demands and intricate interaction limited computational resources has become a critical tech-
mechanisms involved. Furthermore, in multi-agent systems nical challenge in the development of LLMs. This challenge
with numerous LLM instances, as the number of agents in- primarily encompasses two technical issues: firstly, how
creases, this issue would be more severe, since the commu- to optimize memory usage when loading and processing
nication network within multi-agent systems also becomes models across GPU clusters, and secondly, how to maintain
increasingly complex. Therefore, more effective and efficient or improve training efficiency as models scale. Next, we
communication protocols and architectures are essential will conduct quantitative analyses and introduce advanced
to support the heightened coordination demands among training techniques addressing the two aforementioned is-
agents. sues.

Alignment with Human Sociality. LLM-based agents can 9.3.1 Estimation of Training Memory Consumption
be conceptualized as individual entities, with the emergence
In this part, we will first estimate the GPU memory con-
of sociability resulting from the interaction among these
sumption for training LLMs.
agents. Autonomous agents often assume specific roles such
as coders or researchers, making role-playing a vital capa- Model States Cost. Model states often occupy the majority
bility for agents to solve downstream tasks [972]. However, of memory during training, typically consisting of model
LLMs, typically trained on web corpora, face difficulties in parameters, gradients, and optimizer states. As introduced
accurately mimicking roles that are infrequently discussed in Section 4.3.2, mixed precision training has been widely
online or are emergent. They also lack self-awareness in utilized in LLM training. For a model containing P param-
conversational scenarios due to inadequate modeling of hu- eters, both the model parameters and their gradients are
man cognitive psychology. Thus, it is imperative to develop typically stored as 16-bit floating-point numbers, requiring
improved agent technique, including both training methods a total storage of 4P bytes (2P for the parameters and 2P for
and architectures, to better align LLMs with human prefer- the gradients). When using optimizers such as Adam [318]
ences and enhance their role-playing abilities. or AdamW [975], an additional set of 32-bit floating-point
numbers are needed to store the optimizer states, including
Capability Extension. LLM-based agents, similar to hu-
the copy of model parameters, gradient momenta, and
mans, require advanced capabilities (e.g., tool learning) to
gradient variances, which leads to a total storage of 12P
fulfill complex functions or tasks, which might be beyond
bytes (4P each for each of these components). Consequently,
their capacity scope. To address this issue, tool use has
the total memory required for storing the model states
become a widely-used approach to enhancing LLMs’ capac-
during training is 16P bytes. For instance, training LLaMA-
ities in various complex tasks. For example, when answer-
7B (P ≈ 6.7 × 109 ) requires around 100GB memory to store
ing informative user questions, they use search engines to
the model states alone.
retrieve information from the internet. However, the quality
and quantity of existing available tools impose limitations Activations Cost. Activations are the intermediate states
on their accessibility and comprehensiveness. And it would that require to be stored in the forward pass for gradient
become more difficult for LLM-based agents to use such computation during backpropagation. For example, for a
∂Y
limited tools when interacting with dynamic and changing binary operation Y = W X , calculating the gradient ∂W
environments. In addition, as the scale of tools expands, necessitates the input X , which should be preserved dur-
the compatibility and extensibility between the agents and ing the forward pass. In Table 18, we list the estimation
tools must be further improved to facilitate complex task of the activation memory consumption for different com-
resolution. ponents within the Transformer model. Take LLaMA-7B
82

(V = 32, 000, L = 32, H = 4, 096, H ′ = 11, 008, N = 32) as used to optimize memory usage during backpropagation.
an example, it would take 16GB memory to store activations Specifically, the activations need to be retained during the
per device under the setting B = 1, T = 2, 048. forward pass. However, storing all activation values for each
layer requires a significant amount of memory resources
TABLE 18: The activation memory consumption of each (detailed in Table 18). To reduce the memory cost, gradient
computation within the LLaMA model based on research checkpointing retains only a subset of the activations during
work [976]. We denote batch size by B , sequence length by the forward pass and recomputes these values during the
T , the vocabulary size by V , the number of head in the backward pass to save memory, albeit with additional com-
attention module by N , the dimension of each head by D, putational overhead. In implementation, gradient check-
the hidden size by H (H = N D), and the intermediate pointing typically involves storing the input of each Trans-
size inside FFN by H ′ . Equations ➀-➈ are layer-wise and former layer and recomputing the corresponding activation
need to be multiplied by the number of the layers L when values during backpropagation.
computing the total consumption.
ZeRO. Zero redundancy optimizer (ZeRO) [977] technique,
Equations Activation consumption proposed by the DeepSpeed library, focuses on alleviating
the issue of memory redundancy in data parallelism. As
➀ Q, K, V = XW Q,K,V store X with size 2BT H
➁ Q, K = RoPE(Q, K) store Q and K with size 4BT H mentioned in Section 4.3.2, data parallelism requires each
➂ O = Attn (Q, K, V ) store Q, K , and V with size GPU to store the same copy of the model states, resulting
6T H and results of softmax in a memory consumption of 16P bytes per GPU. A direct
with size 2BT 2 N side effect of data parallelism is that it memory redundancy
➃ X = OW O store O with size 2BT H issues, since not all of the above data is necessary to be
➄ X = Add&Norm(X) store X with size 2BT H retained on each GPU. To resolve it, the ZeRO technique
➅ G, U = X[W G , W U ] store X with size 2BT H aims to retain only a fraction of data on each GPU, while the
➆ D = Swish(G) · U store G and U with size 4BT H ′
rest data can be obtained from other GPUs when required.
➇ X = DW D store D with size 2BT H ′
Specifically, ZeRO provides three strategies, depending on
➈ X = Add&Norm(X) store X with size 2BT H
➉ CE(softmax(XW L )) store X with size 2BT H and re- how the three parts of the data are stored, namely optimizer
sults of softmax with size 4BT V state partitioning (ZeRO-1), gradient partitioning (ZeRO-
2), and parameter partitioning (ZeRO-3). Empirical results
indicate that the first two strategies do not increase the
Other Memory Cost. In addition to the main factors af- communication overhead, and the third solution increases
fecting GPU memory consumption discussed above, the about 50% communication overhead but saves memory
memory usage also includes the following aspects: proportional to the number of GPUs. PyTorch has imple-
• Deep learning frameworks. The PyTorch framework re- mented a similar technique as ZeRO, called fully sharded
quires approximately 1GB of GPU memory when loading data parallel (FSDP) [330].
its core functions. This is the essential overhead for the
framework to operate. Offload. In GPU-limited environments, DeepSpeed has pro-
• Distributed frameworks. When utilizing distributed posed the offload technique [978], which can significantly
training frameworks (e.g., DeepSpeed), its GPU memory reduce the GPU memory required for training by offloading
usage can fluctuate between 1GB and 4GB. The exact part of the model states and computational overhead to CPU
amount depends on the level of optimization and the hyper- memory. Specifically, gradients and optimizer states would
parameter settings. This portion of the memory is primarily be offloaded to CPU memory, with only the model param-
used to optimize memory management and communication eters kept on GPU. The computationally intensive forward
efficiency during the training process. and backward propagation still need to be performed on
• Intermediate results and memory fragmentation. Besides GPU to ensure efficiency, while parameter update, which
the activations, there also exist intermediate results that will requires relatively fewer computations, are executed on
affect the peak memory consumption. Take the computation CPU to reduce GPU memory overhead. Furthermore, In-
of the softmax function in Equation ➉ as an example, finity [979] allows training models that exceed the GPU
the implementation of the Transformers library requires an memory limits by utilizing high-speed disk storage (e.g.,
additional 8BT V bytes of memory, as it needs to store two NVMe).
additional copies of the 32-bit input (4BT V bytes each).
Moreover, during the training process, memory fragmenta- 9.3.3 Efficiency Optimization Methods
tion occurs due to the non-contiguous allocation and release In addition to memory-saving techniques, it is also crucial to
of memory, typically leading to an additional 0.5GB to 1GB maintain computational throughput as the model scales. In
of memory consumption. what follows, we will describe two representative efficiency
optimization methods.
9.3.2 Memory Optimization Methods
FlashAttention. FlashAttention [303, 980] is an optimization
Based on the aforementioned analysis, we will next intro-
method for the attention mechanism that significantly re-
duce several typical methods for optimizing the memory
duces the memory transfer during attention computation.
usage for training LLMs.
The core idea is to minimize the storage of intermediate
Gradient Checkpointing. Gradient checkpointing [329], results and directly obtain the final result. According
⊺
to the
also known as activation recomputation, is a technique attention computation equation softmax( QK √
D
)V , multiple
83

intermediate results, such as QK ⊺ and the attention score which is measured in FLOP/byte. For example, the half-
matrix, need to be explicitly retained, leading to numerous precision compute and bandwidth of the A100 GPU are 312
memory read-write operations. FlashAttention uses spe- TFLOP/s and 2039GB/s, respectively. Correspondingly, its
cially designed methods, such as matrix partition and opera- maximum arithmetic intensity is 142.51 FLOP/byte47 .
tor fusion, to keep intermediate results in the cache until the • Model efficiency metrics. Similarly, each operation (e.g.,
final result is obtained, thus reducing the amount of mem- matrix multiplication) of the model can be measured by
ory read and write operations. Additionally, FlashAttention two corresponding metrics: the computation amount and the
can effectively reduce the peak memory usage and activa- data transfer amount. The former refers to the total number
tion memory consumption (Section 9.3) during the LLM of floating-point operations, measured in FLOPs. The latter
training and inference. By using FlashAttention, LLaMA- refers to the total amount of GPU memory read and write
2 (7B) with a sequence length of 2,048 and a batch size of 8 operations, measured in bytes. Analogous to the arithmetic
requires only one-tenth of the computation time compared intensity of a GPU, the arithmetic intensity I of a model oper-
to the standard method. ation (e.g., matrix multiplication) can be defined as the ratio
of computation to data transfer, with units of FLOP/byte.
Sequence Parallelism. Compared with the 3D parallelism When the model’s arithmetic intensity I is less than the
introduced in Section 4.3, sequence parallelism can be GPU’s maximum arithmetic intensity Imax , it indicates that
considered a fourth parallelism dimension in pre-training, the maximum memory bandwidth of the GPU is lower than
particularly when handling long data sequences. The core the speed required. Consequently, the model’s efficiency
idea is to partition the sequence across multiple devices will primarily be limited by memory bandwidth, and the
for parallel computation. The primary challenge lies in operation is called memory-bound. Conversely, when I ex-
minimizing communication across the devices during atten- ceeds Imax , it suggests that the GPU’s maximum floating-
tion computation. DeepSpeed-Ulysses [981] partitions the point operation speed is lower than the speed required. In
sequence along the hidden dimension, allowing each device this case, the model’s efficiency will mainly be constrained
to receive a subset of the attention heads and compute by the GPU’s compute capability, and the operation is called
attention for different heads in parallel. In comparison, Ring compute-bound.
Attention [982] partitions the sequence along the length
dimension, where the query matrices on each device are in Bottleneck Analysis. Based on the above analysis, we can
turn computed with the key and value matrices on other obtain the arithmetic intensity for each operation during
devices. Furthermore, Ring Attention is also compatible both the prefill and decoding stages, as shown in Tables 19
with FlashAttention and can be considered as its distributed and 20, thereby better identifying the bottleneck operations
extension. in the inference process.
• Prefill stage. In the following analysis, we will still
take the LLaMA (7B) model in Table 18 as an example
9.4 Analysis and Optimization for Model Inference
(N = 32, D = 128, H = 4096) and assume a batch size of
In Section 4.2.4, we have introduced the basic decoding 8 and a sequence length of 1024 (i.e., B = 8, T = 1024).
strategies for using LLMs. As inference efficiency is criti- Substituting these values into Table 19, we can find that
cally important for the application of LLMs, we next will the arithmetic intensity for linear transformations (Equa-
quantitatively analyze the efficiency of the inference process tions ➀➃➅➇) is approximately 2730.67, for multi-head at-
and also present corresponding optimization methods. tention (Equation ➂) it is approximately 114.67, while the
intensity for other operations (Equations ➁➄➆➉) is around
9.4.1 Analysis of Inference Efficiency 1. When using an A100 (80G) GPU with Imax = 142.51,
Overall, the inference process of LLMs can be divided into the arithmetic intensities of the linear transformations and
two stages for overhead analysis: (1) the prefill stage, which multi-head attention operations are all above or close to the
computes the states and caches the key-value tensors for the maximum value. Given that these operations occupy the
input sequence; and (2) the decoding stage, which computes majority of the computations during the prefill stage, we
the states of the newly generated tokens, updates the key- can conclude that prefill stage is actually compute-bound.
value cache (KV cache, and continuously generate tokens • Decoding stage. Similarly, substituting these values into
in an auto-regressive way until the generation process is the arithmetic intensity formulas in Table 20 for the decod-
complete [984]. ing stage reveals that the arithmetic intensities of the lin-
ear transformations and multi-head attention are all below
Inference Efficiency Measurement. To quantitatively an-
8, which is much lower than the A100 GPU’s maximum
alyze the inference efficiency, we next will introduce two
intensity 142.51. This indicates that the decoding stage is
widely-used metrics for measuring inference efficiency.
constrained by the GPU’s data transfer speed (i.e., memory-
• GPU performance metrics. First, we introduce the com-
bound), a problem commonly referred to as the memory wall.
pute capability and memory bandwidth to evaluate the effi-
The analysis indicates that inefficiencies in LLM inference
ciency of a certain GPU. The compute capability of a GPU
primarily occur during the decoding stage.
refers to the number of floating-point operations (FLOP)
that it can perform per second, measured in FLOP/s. The 9.4.2 System-level Optimization
bandwidth of a GPU refers to the amount of memory read To mitigate the memory wall issue, an intuitive idea is
and write operations it can perform per second, measured in to reduce the data transfer operations as possible, thereby
byte/s. The ratio of compute to bandwidth is known as the
maximum arithmetic intensity of the GPU, denoted as Imax , 47. https://www.nvidia.com/en-us/data-center/a100/
84

TABLE 19: The computation, data transfer, and arithmetic intensity during the prefill stage. We use the asymptotic notation
O to denote the complexity of data transfer amount, where the constant factor of the complexity is related to the specific
implementation method. Table source: [983].

Equations Computation Data transfer Arithmetic intensity

➀ Q, K, V = XW Q,K,V 6BT H 2 O(BT H + H 2 ) O 1 +1 1
H BT
➁ Q, K = RoPE(Q, K) 6BT H O(BT H) O(1)
1
1+ D
2 2
➂ O = Attn(Q, K, V ) 4BT N D + 4BT N O(BT 2 N + BT N D) O 1 1
D+T
O 2 2
➃ X = OW 2BT H O(BT H + H ) O 1 +1 1
H BT
1
➄ X = Add&Norm(X) 5BT H O(BT H + H) O 1
1+ BT
➅ G, U = X[W G , W U ] 4BT HH ′ O(BT H + BT H ′ + HH ′ ) O 1 + 1 + 1
1
H H′ BT
➆ D = Swish(G) · U 2BT H ′ O(BT H ′ ) O(1)

D ′ ′ ′
➇ X = DW 2BT HH O(BT H + BT H + HH ) O 1 + 11 + 1
H H ′ BT
➈ X = Add&Norm(X) 5BT H O(BT H + H) O 1+1 1
BT

TABLE 20: The computation, data transfer, and arithmetic intensity during the decoding stage. Table source: [983].

Equations Computation Data transfer Arithmetic intensity

➀ q, k, v = XW QKV 6BH 2 O(BH + H 2 ) O 1+ 1
1
H B
➁ q, k = RoPE(q, k) 6BH O(BH) O(1)
➂ K, V = Cache(k, v) - O(BT N D) or O(BN D) - 1
1+ D
➃ o = Attn(q, K, V ) 4BT N D + 4BT N O(BT N + BT N D + BN D) O 1 1
1+ D +T
➄ X = oW O 2BH 2 O(BH + H 2 ) O 1
1
1
H + B
➅ X = Add&Norm(X) 5BH O(BH + H) O 1+1 1
B
➆ g, u = X[W G , W U ] 4BHH ′ O(BH + BH ′ + HH ′ ) O 1 + 1 + 1
1
H H′ B
➇ d = Swish(g) · u 2BH ′ O(BH ′ ) O(1)

D ′ ′ ′
➈ X = dW 2BHH O(BH + BH + HH ) O 1 + 11 + 1
H H′ B
➉ X = Add&Norm(X) 5BH O(BH + H) O 1+1 1
B

enhancing the arithmetic intensity. In this part, we will intro- cate new GPU memory for each concatenation, copying the
duce several system-level optimization methods to achieve original KV cache and the new hidden states into the newly
the reduction in data transfer. allocated memory. This process leads to repeated memory
read-write operations and substantial memory fragmenta-
FlashAttention and Flash-Decoding. The FlashAttention tion. PagedAttention addresses this issue by introducing
method discussed in Section 9.3.3 can also be applied at a memory paging management method, preallocating sev-
the prefill stage, as it reduces data transfer operations and eral blocks of memory for future KV caches, which can
effectively increases arithmetic intensity. However, this op- largely reduce the memory allocation operations during
timization technique is not directly applicable during the concatenation. Additionally, PagedAttention optimizes the
decoding stage, where only the current query vector needs attention computation by increasing the parallelism. It uses
to be computed with the KV cache matrices. To further operator fusion to parallelize the computation of the query
optimize the decoding process, Flash-Decoding [985] has vector with multiple KV cache chunk, thereby enhancing
been proposed based on FlashAttention, particularly for the computational efficiency.
long sequences, which shares a similar idea with sequence
parallelism. Specifically, Flash-Decoding splits the KV cache Batch Management Optimization. Batch management op-
into smaller chunks, allowing the computation of the query timization aims to increase the batch size during the decod-
vector with these chunks in parallel, thereby improving the ing stage to enhance arithmetic intensity. A representative
decoding efficiency. method is continuous batching, proposed by vLLM [304].
Unlike traditional fixed-length batch processing, this tech-
PagedAttention. PagedAttention [304] focuses on optimiz- nique breaks down each request into a prefill iteration
ing KV cache and attention computation, significantly re- and several single-step decoding iterations, and continu-
ducing data transfer operations in these two aspects. In KV ous batching further employ heuristic algorithms to select
cache concatenation, traditional methods often need to allo- requests for prefill or single-step decoding iteration. This
85

fine-grained batching mechanism allows for handling more of this method still largely lags behind autoregressive meth-
requests simultaneously, which is has the same effect as in- ods. To improve the quality of the generated text, several
creasing the batch size. Furthermore, DeepSpeed-MII [986] studies attempt to combine both decoding methods, propos-
introduces Dynamic SplitFuse, which splits the prefill stage ing semi-autoregressive decoding methods [994] that gener-
into multiple iterations and allows simultaneous prefill and ate a group of tokens (e.g., 3 to 10 tokens) at each step and
decoding in one computation, resulting in larger batches use these tokens as input to generate the next group. How-
and higher inference throughput. ever, existing mainstream LLMs are pre-trained to predict
the next token, making direct non- or semi-autoregressive
9.4.3 Algorithm-level Optimization generation infeasible. To address this, Medusa [995] trains
In addition to system-level optimization methods, existing two additional prediction heads on the Vicuna model to
research work has proposed a series of improvements for predict the second and third tokens respectively, thereby
autoregressive inference algorithms aimed at enhancing in- achieving the generation of three tokens simultaneously.
ference efficiency. This part introduces four typical inference However, due to the decreased generation quality, these
optimization algorithms. methods have been rarely used directly in practice, but are
more often combined with other methods (e.g., speculative
Speculative Decoding. Intuitively, the generation steps in decoding) to accelerate the inference process of LLMs. For
language modeling have varied difficulty levels. For exam- instance, after Medusa generates three tokens in parallel, the
ple, predicting the next word of “The founder of Microsoft original Vicuna model would still be employed to verify the
is” may be more challenging than predicting the next word generation quality.
of “The founder of Microsoft is Bill”. Even a small model
may successfully predict the answer in this case. Based on Early Exit. It has been found that in multi-layer Transformer
this idea, speculative decoding [987, 988] has been proposed models, it may not be necessary to perform the computation
to accelerate the inference speed. Specifically, it employs a through all layers to reliably predict the next token [996].
relatively smaller yet more efficient model (such as an n- Based on this idea, several studies [996, 997] have proposed
gram statistical model or a small pre-trained model) to au- improved generation methods based on early exit. During
toregressively generate several tokens. Then, a larger model model decoding, when the conditions for early exit are
then verifies these tokens, determining whether each token satisfied, the model can directly use intermediate compu-
is the top-ranked prediction at the each generation step. The tation results from certain layers to generate tokens, thereby
small and large models iteratively repeat this process until improving the inference efficiency. To determine the exit
decoding is complete. Speculative decoding can lead to a condition, prediction confidence [997] or the entropy [996]
notable 2× to 3× speedup without compromising the gener- of the next token’s generation probability distribution can
ation quality. Researchers further suggest several variants to be used as reference measures. More recently, mixture-
improve the efficiency of this approach, such as a learning- of-depths [998] has proposed to dynamically adjust the
based method to combine several small models [989] and computation load of each layer. Similar to MoE networks,
a stage-wise acceleration which employs a more smaller the mixture-of-depths method calculates a score for each
model to accelerate the small model first [990]. layer’s input via a routing network. If the score exceeds a
preset threshold, the layer would be computed; otherwise,
Cascade Inference. Cascade inference optimizes the inference the layer would be skipped. Unlike traditional early exit
efficiency by addressing requests of varying difficulty with mechanisms that skip all subsequent layers, the mixture-
models of different scales. FrugalGPT [991] introduces a of-depths method selectively skips certain layers, which
series of models arranged by efficiency from high to low, can adaptively utilize the characteristics of different layers
sequentially processing a request through these models. A during generation.
specially trained binary classification model then evaluates
whether the generated result meets the task requirements. 9.5 Model Compression
If the result is deemed reliable, subsequent models would
Due to the huge number of model parameters, LLMs take
be bypassed, thus improving the inference speed. This
a significant memory footprint for inference, making it very
strategy can be applied to various open-source models and
costly to be deployed in real-world applications [999]. In this
commercial APIs, allowing for the flexible adjustment the
section, we focus on how to reduce the memory footprint
classification threshold to balance inference efficiency and
of LLMs via technical approaches. In particular, we will
generation quality according to specific needs. For reason-
primarily introduce the model quantization approach, and
ing tasks, researchers [992] further propose to utilize the
also briefly discuss other model compression methods, e.g.,
self-consistency [429] of generated answers to evaluate the
model pruning and distillation.
quality of the small model: the large model is employed for
generation only when the small model’s answers exhibit a 9.5.1 Quantization Methods
low consistency.
There are generally two major model quantization ap-
Non-autoregressive Decoding. Existing decoding methods proaches, namely quantization-aware training (QAT) (requir-
predominantly adopt the autoregressive mechanism, gen- ing additional full model retraining) and post-training quanti-
erating tokens one by one, which is a primary reason zation (PTQ) (requires no model retraining). Compared with
for lower inference efficiency. Therefore, non-autoregressive small-sized language models, two major differences need
decoding [993] has been proposed by generating all tokens to be considered when designing or selecting quantization
based on the input at once. However, the generation quality methods for LLMs. Firstly, LLMs consist of a huge number
86

of parameters, and thus PTQ methods are more preferred and can be pre-processed before model deployment. By
due to a much lower computational cost than QAT methods. identifying and preserving these salient weights, the error
Secondly, LLMs exhibit very different activation patterns associated with model quantization can be effectively re-
(i.e., large outlier features), and it becomes more difficult duced. In existing literature, various methods have been
to quantize LLMs, especially hidden activations. Next, we proposed to detect these salient weights. For instance, PB-
will briefly review several representative PTQ methods48 for LLM [1003] utilizes the magnitude of weights for finding
LLMs. critical weights, SpQR [1004] categorizes the outliers in
weights into small groups by investigating the structural
Background for Quantization. In this part, we present a patterns, APTQ [1005] employs the Hessian trace as a sen-
general introduction of quantization techniques for neu- sitivity metric, and OWQ [1006] selects the top sensitive
ral networks. In neural network compression, quantization columns based on both the Hessian matrix and weight
often refers to the mapping process from floating-point perturbations.
numbers to integers [1000], especially the 8-bit integer quan-
• Fine-grained quantization. For Transformer models,
tization (i.e., INT8 quantization). For neural network models,
weights and activations are usually represented in the
there are typically two kinds of data to be quantized, namely
form of tensors. A straightforward approach is to use
weights (model parameters) and activations (hidden activa-
coarse-grained quantization parameters for the whole ten-
tions), which are originally represented in floating-point
sor (i.e., per-tensor quantization) [1007]. However, it usu-
numbers. To illustrate the essential idea of model quan-
ally leads to inaccurate reconstruction results. Thus, fine-
tization, we introduce a simple yet popular quantization
grained methods are proposed to reduce the quantization
function: xq = R(x/S) − Z , which transforms a floating
error. ZeroQuant [1008] adopts a token-wise quantization
number x into a quantized value xq . In this function, S
approach with dynamic calibration for compressing acti-
and Z denote the scaling factor (involving two parameters
vations. Whereas for weights (easier to be quantized), it
α and β that determine the clipping range) and zero-point uses a group-wise quantization. In practice, a group size of
factor (determining symmetric or asymmetric quantization),
128 [1002, 1008] is commonly used for model quantization.
respectively, and R(·) denotes the rounding operation that
• Balancing the quantization difficulty. Considering that
maps a scaled floating value to an approximate integer.
weights are easier to be quantized than activations,
As the reverse process, dequantization recovers the original
SmoothQuant [1007] proposes to migrate the difficulty from
value from the quantized value accordingly: x̃ = S·(xq +Z).
activations to weights. Specially, they incorporate a scaling
The quantization error is calculated as the numerical differ-
transformation to balance the difficulty between weights
ence between the original value x and the recovered value
and activations in a linear layer: Y = (Xdiag(s)−1 ) ·
x̃. The range parameters α and β have a large impact on the
(diag(s)W). By introducing an mathematically equivalent
quantization performance, which often need to be calibrated
transformation, this formula controls the quantization diffi-
according to real data distributions, in either a static (offline)
culty through the scaling factor s. To set s, it incorporates
or dynamic way (runtime). For more details, we refer to the
a migration strength parameter α to balance the difficulties,
readers to the excellent survey [1000] about quantization
where each entry sj = max(xj )α / max(wj )(1−α) is deter-
methods on neural networks.
mined by the migration strength.
Post-Training Quantization (PTQ). We first introduce the • Layerwise quantization. This approach finds optimal
PTQ methods for LLMs. quantized weights that minimize a layerwise reconstruction
2
• Mixed-precision decomposition. As found in [1001], ex- loss: arg minW c ∥ WX − WX ∥2 . To efficiently optimize
c
tremely large values would occur in hidden activations this objective, GPTQ [1009] improves the original opti-
(called the emergence of outliers) when the model size reaches mal brain quantization (OBQ) [1010] method by fixing the
6.7B parameters or above. These outliers significantly influ- quantization order of weights for all rows. Further, with
ence the data distribution ranges of the hidden activations, specially designed methods (i.e., lazy batch-updates and
making it challenging to conduct effective model quantiza- Cholesky reformulation), GPTQ is feasible to quantize very
tion. To reduce the quantization error, a straightforward large models (e.g., 175B OPT) in 3 or 4 bit precision. More
method is to separately process the outliers and the rest recently, AWQ [1002] further simplifies the optimization
weight values. Specifically, LLM.int8() [1001] has observed form by incorporating activation-aware scaling for weights,
that these outliers are mainly distributed in certain feature which resembles the idea of SmoothQuant [1007]: weights
dimensions at Transformer layers. Based on this finding, a corresponding to outlier activations are more important
vector-wise quantization approach is proposed to separate to be precisely quantized. It does not directly optimize
the outliers and the rest in matrix multiplication. the reconstruction loss, but instead performs simple hyper-
• Salient weights protection. For Transformer based lan- parameter search to achieve the minimal loss on calibration
guage models, there often exists a subset of weight values data.
that are more sensitive to quantization, which are also These strategies in the above methods can be jointly
referred to as salient weights [1002]. Unlike activation out- used to improve the quantization performance. In order to
liers, which occur dynamically during inference and may achieve high-efficiency implementation, quantization meth-
require complex runtime handling, weight outliers are static ods also rely on hardware- or system-level support (e.g., ef-
ficient GPU kernels or hardware-friendly group partition).
48. Since we mainly focus on discussing quantization methods in the
context of LLMs, the line of quantization work on small-sized language Other Quantization Methods. In the above, we mainly fo-
models (e.g., BERT) has not been included in this survey. cus on PTQ methods, and next introduce two recent studies
87

that explore efficient fine-tuning methods or QAT methods and difficulty migration [1007], can be applied to alleviate
for quanitizing LLMs. the influence of outlier values. Since large outliers mainly
• Efficient fine-tuning enhanced quantization. For post- exist in the activations of LLMs, small language models
training quantization, direct low-bit quantization (e.g., INT4 are more resistant to activation quantization [1013, 1015].
quantization) often results in large performance degrada- In practice, high-quality INT8 activation quantization is still
tion. To overcome this challenge, QLoRA [1011] incorporates a difficult task, though several methods can attain satisfying
additional small tunable adapters (16-bit precision) into the results. Further, lower precision activation quantization has
quantized models, to achieve an efficient, high-precision still not been successfully explored, even for QAT meth-
model fine-tuning. It combines the merits of LoRA (See ods [1012].
Section 5.3.1) and quantization methods. The experiment • Efficient fine-tuning enhanced quantization is a good option
results show that 4-bit quantized models can achieve the to enhance the performance of quantized LLMs [149, 1011]. The
full 16-bit fine-tuning performance by QLoRA. benefits of efficient fine-tuning methods in quantization can
• Quantization-aware training (QAT) for LLMs. A recent be twofold. Firstly, it can directly compensate for the per-
study [1012] explores the effect of QAT methods by applying formance degradation suffered from low-bit quantization.
a data-free distillation method to compress the weights, This can be achieved either by increasing the fitting capacity
activations as well as key-value cache. By conducting exten- via updating high precision adapters [1013, 1015, 1016],
sive experiments based on LLaMA, they show promising or by finding a proper low-rank initizalization for LoRA
results with 4-bit quantization on both weights and key- fine-tuning [1017]. Secondly, it is flexible to support task-
value cache, but not on 4-bit activation quantization, which specific or goal-specific fine-tuning of LLMs in a lightweight
still needs more exploration. way [1011], e.g., instruction tuning or chat-oriented tuning,
by only tuning the small adapters. Overall, it makes a
Empirical Analysis and Findings. Quantization has cur- good trade-off between the effectiveness and training cost,
rently become a common technique to reduce the memory which provides a promising approach to enhancing the
footprint and latency of LLMs in deployment. In particular, performance of quantized LLMs.
it is important to understand what level of precision (e.g.,
INT8 or INT4) can be applied to quantize different parts of Empirical Analysis on Quantization Experiments. To fur-
LLMs (e.g., weights or activations), while retaining a high ther help readers understand the impact of quantization on
accuracy. In this part, we first summarize the major findings LLMs, we also conduct a group of experiments to investi-
about the quantization of LLMs in existing literature, and gate the inference performance of quantized models here.
then present some empirical analysis with quantization Specifically, we focus on the fine-tuned LLaMA models (i.e.,
experiments. 7B and 13B) using popular SFT datasets, including FLAN-
• INT8 weight quantization can often yield very good results v2 [69], Alpaca-52K [187] and ShareGPT [153]. For evalua-
on LLMs, while the performance of lower precision weight quan- tion, we utilize the same tasks in Table 10, and follow the
tization depends on specific methods [1002, 1007, 1009, 1013]. In quantization settings in the study [1015] examining the per-
most cases, INT8 weight quantization can be effectively ap- formance of quantized language models at three precision
plied to reduce the memory footprint without performance levels: 4-bit, 8-bit and 16-bit. The results are summarized
degradation. While for INT4 (or INT3) weight quantiza- in Table 21. As can be observed from Table 21, the results
tion, existing methods rely on specific strategies to reduce obtained with 8-bit and 4-bit weight quantization are close
the performance degradation, e.g., layerwise method [1008, to the performance of 16-bit models while significantly
1009], activation-aware scaling [1002] and low-rank adapter reducing memory consumption. In practice, it is recom-
tuning [1011]. Interestingly, LLMs seem to be less sensitive mended to first examine the performance of 4-bit weight
to low-bit weight quantization than small-sized language quantization for LLMs if reducing memory usage is a critical
models [1013]. In practice, with the same memory cost, it consideration for deployment.
is suggested to use a larger language model with a lower 9.5.2 Other Model Compression Methods
quantization precision rather than a smaller language model
In addition to model quantization, we next introduce two
with a higher quantization precision. For example, a 4-bit
other model compression methods for LLMs, namely model
60B LLM is demonstrated to have better performance than
distillation and model pruning. Unlike model quantization,
an 8-bit 30B LLM [1014]. Moreover, focusing on emergent
model distillation and pruning aim to simplify the model
capabilities, the study [1015] finds that in-context learning,
architecture, thereby reducing the total number of parame-
step-by-step reasoning, and instruction following all seem
ters.
to be seldom affected with 4-bit weight quantization. This
result suggests that INT4 quantization exhibits a favorable Distillation for LLMs. In general, model distillation aims to
trade-off in terms of both total bits and performance of transfer the capabilities from a capable model (referred to
emergent abilities. as the teacher model) to a less capable model (referred to
• Activations are more difficult to be quantized than as the student model), thereby achieving the compression
weights [1001, 1007, 1013]. It has been found that large out- of the capable model. Based on whether the weights of
liers would occur for Transformer language models having teacher models are accessible, one can employ either the
a size of 6.7B or above [1001]. This issue has been one white-box approach or the black-box approach for LLM
of the most fundamental difficulties to quantize LLMs. To distillation. The white-box approach often employs the
overcome this issue, various methods, e.g., mixed-precision traditional knowledge distillation technique, which incor-
decomposition [1001], fine-grained quantization [766, 1001] porates additional loss functions (i.e., distillation loss) for
88

TABLE 21: Evaluation results for quantized LLaMA models (7B and 13B). We employ existing model checkpoints provided
by [350] for quantization experiments, which have been fine-tuned on FLAN-v2, Alpaca-52K, and ShareGPT, respectively.
Specifically, we report the performance with AlpacaFarm, MMLU, and BBH, as well as the memory usage of the loaded
model (Mem.). For quantization, we employ bitsandbytes to quantize the 16-bit models to 8/4 bits by specifying the
commands load_in_8bit and load_in_4bit when loading the weights. It is worth noting that we select text-davinci-
003 as the baseline model for the AlpacaFarm dataset.

16-bit 8-bit 4-bit
Models SFT Dataset
AlpacaFarm MMLU BBH Mem.(GiB) AlpacaFarm MMLU BBH Mem.(GiB) AlpacaFarm MMLU BBH Mem.(GiB)
LLaMA (7B) FLAN-v2 6.65 47.34 35.05 12.58 6.15 47.02 35.17 6.65 7.83 46.23 34.77 3.94
Alpaca-52K 32.55 40.87 33.66 12.58 33.60 39.98 34.38 6.65 29.57 39.24 32.80 3.94
ShareGPT 72.05 41.30 32.90 12.58 72.86 39.34 32.71 6.65 70.31 40.08 32.11 3.94
LLaMA (13B) FLAN-v2 8.14 51.67 41.46 24.40 7.64 51.02 41.25 12.53 7.52 50.48 40.68 7.34
Alpaca-52K 33.60 47.63 36.10 24.40 31.43 47.04 35.98 12.53 30.87 46.20 36.16 7.34
ShareGPT 75.59 47.58 38.00 24.40 73.79 47.71 38.31 12.53 71.99 45.77 36.97 7.34

aligning the outputs or intermediate states of the student optimizers [1023]. It focuses on the quantization of both
model to those of the teacher model. Based on this ap- activations and weights for LLMs, including the support on
proach, MINILLM [1018] effectively distills the 13B LLaMA 8-bit and 4-bit (NF4,FP4) matrix multiplication for efficient
model down to a 7B model. The black-box approach [1019], inference, as well as an 8-bit optimizer for efficient training.
on the other hand, can only make use of the textual re- • GPTQ-for-LLaMA50 is developed specially for quantiz-
sponse of the teacher model for training the student model. ing LLaMA models. It enables 4-bit quantization of LLaMA
These studies mainly focus on utilizing the generated re- models of varied sizes based on the GPTQ algorithm [1009].
sponses for enhancing the key capabilities from the teacher Also, it provides a comparison with bitsandbytes in both
model [146, 384], such as in-context learning and chain-of- memory and performance (PPL) on the project website.
thought prompting. • AutoGPTQ51 is a quantization package developed
based on the GPTQ algorithm [1009], which supports INT4
Pruning for LLMs. The goal of model pruning is to min-
quantization for LLMs. It includes a number of quantized
imize the number of parameters in a model while pre-
models in the library, and supports LoRA by integrating
serving its performance as much as possible. In general,
with HuggingFace PEFT library.
model pruning methods can be categorized into two lines:
structured pruning and unstructured pruning. Structured
• llama.cpp52 makes it feasible to run quantized LLaMA
models on a MacBook device. It supports INT4, INT5 and
pruning aims to remove certain less important model com-
INT8 quantization, which is developed in efficient C/C++
ponents (e.g., neurons, channels, layers) that have minimal
implementation. It also supports a number of LLaMA based
impact on performance. On the other hand, unstructured
models, such as Alpaca and Vicuna.
pruning mainly focuses on removing individual weights or
connections within a neural network model without chang- Other Libraries. In addition, there are also libraries for
ing the model’s main structure. As for LLMs, unstructured supporting other model compression methods.
pruning can generally lead to higher compression rates. • Torch-Pruning 53 is a toolkit developed for general-
For instance, SparseGPT [1020] achieves 60% unstructured purpose structural pruning, including the pruning for vision
sparsity for OPT-175B using unstructured pruning (i.e., models, diffusion models and large language models. It em-
60% of the elements in the weights are masked), and the ploys dependency graph for automatic structural pruning
pruned LLM still retains a relatively low perplexity. With and supports several high-level pruners (e.g., MetaPruner
suitable strategies, structured pruning for LLMs can also and BNScalePruner).
achieve promising model compression rate. For instance, • LLM-Pruner54 is designed specifically for the pruning
LLM-pruner [1021] selectively removes 20% of the non- of LLMs. It enables efficient gradient-based structral prun-
essential parameters from LLaMA (7B) based on gradient ing for LLMs with minimal training samples and training
information, while maintaining 93.6% performance of the time. Currently, it supports a number of LLMs, such as
original model. Furthermore, Sheared LLaMA [1022] in- Baichuan, BLOOM, and LLaMA3.
troduces two techniques: targeted structured pruning and
dynamic batch loading, which effectively prunes LLaMA-
2 (7B) to a parameter size of 2.7B, while preserving 87.8% of 9.6 Retrieval-Augmented Generation
the original model’s performance. When dealing with real-time information or specialized
9.5.3 Open-source Libraries domain knowledge, LLMs may struggle to generate ac-
In this part, we briefly introduce the available open-source curate outputs solely based on their internal knowledge.
libraries for memory-efficient deployment. To address this issue, retrieval-augmented generation (RAG)
technique [1024, 1025] has been proposed by incorporating
Quantization Libraries. Next, we introduce three popular
quantization libraries for LLMs, including: 50. https://github.com/qwopqwop200/GPTQ-for-LLaMa
• Bitsandbytes49 is developed based on the methods 51. https://github.com/PanQiWei/AutoGPTQ
introduced in the papers of LLM.int8() [1001] and 8-bit 52. https://github.com/ggerganov/llama.cpp
53. https://github.com/VainF/Torch-Pruning
49. https://github.com/TimDettmers/bitsandbytes 54. https://github.com/horseee/LLM-Pruner
89

external knowledge source for improving the model re- determine whether the current task requires retrieval or the
sponse. This technique aims to retrieve relevant information use of retrieved content [662].
from external sources (e.g., the internet or domain-specific
knowledge bases) using an information retrieval system, Improvement Strategies. In practice, factors such as the
thereby providing LLMs with timely or domain-relevant quality of retrieved documents, prompt design, and the
context to reduce the factual errors in generated content. generation method of LLMs might impact the final per-
In the format, RAG can also be considered as a specific formance of RAG. Next, we discuss how to enhance the
prompting strategy that integrates auxiliary information RAG performance by summarizing existing improvement
from external sources into the original prompt. Next, we will strategies.
introduce the basic workflow of the retrieval-augmented • Retrieval method improvement. The incorporation of
generation technique and related optimization strategies. retrieval supplements the LLM with relevant contextual
information, and the retrieval performance directly affects
Basic Workflow. Typically, the standard RAG procedure the quality of the final generated response [454]. To design
consists of three steps, including context retrieval, prompt effective retrieval strategy, an important factor to consider
construction, and response generation. is the text granularity. Intuitively, a coarser granularity (e.g.,
• Context Retrieval. The retrieval step primarily focuses document-level) may result in efficient retrieval but tend to
on finding relevant context information from existing infor- incorporate substantial irrelevant information, while a finer
mation sources that are helpful for addressing the current granularity (e.g., sentence-level) increases the proportion of
information need. To achieve efficient retrieval, it is often relevant content in the retrieval results but can lead to higher
necessary to build a search index over the collection of can- retrieval latency. To balance relevance and latency, existing
didate documents and then use appropriate methodologies research work proposes using “propositions” as the retrieval
for text retrieval. There are two commonly used retrieval ap- unit [1031], corresponding to semantically complete and
proaches: lexical-based retrieval [1026] using sparse vector relatively independent text fragments, which can effectively
representations and semantic retrieval methods using dense reduce the recall of irrelevant information. In particular, they
vector representations [54]. The former tokenizes the docu- mainly use GPT-4 to synthesize instruction data for the ex-
ments and building an inverted index based on a vocabu- traction of proposition text, training a smaller model specifi-
lary, followed by retrieving relevant documents using lexical cally to construct proposition text data [1031]. Furthermore,
matching. The latter maps documents to low-dimensional to improve retrieval performance, methods such as query
dense vectors and then constructs an efficient index of doc- expansion and query rewriting can be utilized to optimize
ument vectors using approximate nearest neighbor search query formulation. Query expansion focuses on adding
algorithms, ranking candidate documents based on the sim- supplementary information to the original query, such as
ilarity of embeddings. Both methods can often perform well incorporating related entity information or providing de-
for large-scale document collection, which are widely used tailed explanations of key information in the query [796],
in existing RAG systems. which helps strengthen relevance matching. However, tra-
• Prompt Construction. After the retrieval stage returns ditional query expansion methods may disrupt the original
the relevant documents, these documents need to be incor- semantics for complex queries. To address this issue, we can
porated into the input prompt of the LLM along with the employ LLMs to decompose complex queries into several
task description. The prompt should guide the model to uti- sub-queries, which are subsequently expanded individually,
lize the retrieved information to complete the corresponding allowing for multi-path recall of related information [1032].
task. For example, a prompt could be, “Please refer to the As another query enhancment technique, query rewriting
information contained in the following documents to complete the focuses on modifying the query content to highlight key
task”. Since the retrieved documents are typically lengthy, information and eliminate potential ambiguities, facilitating
simply concatenating them into the prompt might lead the retrieval of related documents [1033]. LLMs can be ap-
to a poor utilization of the provided context due to the plied directly to query rewriting, transforming the original
biased attention (e.g., lost in the middle [949]). To address query into a more suitable form through well-designed
this issue, existing approaches often introduce reranking prompts [1034]. To reduce inference overhead, the query
models to select the most relevant documents from the optimization capabilities of LLMs can also be transferred
retrieval results [1027]. Alternatively, information extraction to smaller models through knowledge distillation [1035].
or text compression techniques can be used to retain only the • Retrieval results refinement. In addition to the initial
highly relevant information from the documents, thereby retrieval methods, the refinement of retrieval results also
reducing the input context length [1028, 1029]. plays an important role in RAG systems, since the retrieved
• Response Generation. In this step, the constructed documents may be not best suited for RAG systems, e.g.,
prompt is input into the LLM, enabling it to utilize the re- LLMs might have difficulty in utilizing long contexts or
trieved content to better accomplish the corresponding task. be affected by irrelevant information in the retrieved docu-
However, the retrieved documents may contain irrelevant ments. As a solution, the documents returned during the re-
information or even contradictory information to the true trieval stage can be reranked according to their relevance to
answer, which might affect the output generated by the the input [1036], filtering out low-quality or irrelevant doc-
LLM. To address this, the LLM can be further prompted uments or placing less relevant documents in non-optimal
to self-check the quality of the generated output and decide positions within the prompt. Furthermore, both generation
whether to re-perform the retrieval based on the new out- and reranking tasks [1027] can be jointly optimized to facil-
puts [1030], or it can perform a confidence assessment to iate better utilize of context documents. Additionally, LLMs
90

can be directly used for document re-ranking by designing reconstruct the remaining content of the original document
specific prompts or using context examples to accomplish based on the retrieval results [1043].
this task [777]. In addition to document filtering or rerank-
ing, information extraction or automatic summarization
9.7 Hallucination
techniques can be employed to refine the retrieved content
by extracting more concise and query-relevant content from Hallucination, which refers to the phenomenon that LLMs
the retrieved documents. Furthermore, existing research has generate content inconsistent with factual information, has
proposed token-level compression strategies [1037], which become a significant issue that greatly affects the task
select important tokens and remove unimportant parts from performance of LLMs [1044]. In this section, we focus on
the candidate documents. discussing the topic of LLM hallucination, first introducing
• Iterative retrieval enhancement. In some complex appli- the definition and source of hallucination and then summa-
cation scenarios, a single retrieval procedure may not suffice rizing the detection and mitigation methods.
for RAG systems. To address this issue, we can further use
iterative retrieval augmentation and adaptive retrieval aug- 9.7.1 Definition of Hallucination
mentation. Iterative retrieval augmentation aims to itera- Early research typically defines hallucinations based on
tively refine the initial query based on the model’s generated the relationship between a model’s output and the given
results to achieve a comprehensive coverage of the required input [1045]. In this manner, hallucinations are categorized
information. As it involves accumulating multiple rounds into intrinsic hallucinations where the model’s output does
of retrieval information, the performance of RAG systems not match the input text and extrinsic hallucinations where
may be affected by redundant or conflicting information. To the model’s output cannot be verified against the input.
address this issue, stop mechanism has been introduced for However, in real-world scenarios, user inputs often do not
retrieval iteration, using the LLM to evaluate the confidence contain reference documents, and thus existing work mainly
of the current generation results to determine whether to focuses on open-domain factual hallucinations, where the
continue the iteration process [662]. Additionally, for more model-generated content does not align with or cannot be
complex scenarios, iterative retrieval can be combined with verified by existing world knowledge [1044, 1046]. Accord-
the LLM’s own CoT reasoning capability. For example, ing to a recent study [1044], factual hallucinations can be
intermediate results from the chain of thought can be used further categorized into the following types:
as the query input for the next round of retrieval, and after • Entity-error hallucination. This type of hallucination
completing the retrieval process, the returned results can refers to LLMs generating text containing incorrect entities,
be integrated into the chain of thought. Building on the such as names of people, dates, locations, or objects that
iterative retrieval augmentation method, adaptive retrieval contradict world knowledge.
augmentation further enhances the LLM’s autonomous use • Relation-error hallucination. This type of hallucination
of the retrieval mechanism [1038], thereby improving the involves LLMs generating incorrect relationships between
overall framework’s efficacy in using the retrieval systems. entities, such as inaccurate quantitative or chronological
In practical implementation, for the above two types of aug- connections.
mentation methods, LLM first need to determine when to • Incompleteness hallucination. LLMs may produce incom-
use the retriever and then utilize pre-set prompts to initiate plete outputs, especially when generating lengthy or list-
query generation and retrieval result processing [1039]. based responses. This hallucination arises when LLMs are
• RAG-enhanced training. In addition to the improvement asked about aggregated facts and they fail to reserve the
strategies mentioned above, specialized training tasks can factual completeness.
be designed to further enhance the LLM’s ability to utilize • Outdatedness hallucination. This type of hallucination
the retrieved content, including both instruction tuning and occurs when LLMs generate information that was accurate
pre-training tasks. By constructing instruction data focused at a past time but is no longer correct at present. This issue
on retrieval context utilization [1040], instruction tuning typically arises due to that most LLMs were trained on time-
can improve the LLM’s ability to utilize relevant retrieval limited corpora.
information. When curating the instruction data, it is essen- • Overclaim hallucination. This type of hallucination refers
tial to consider two important issues: positional bias and to cases where the statement expressed in the generated text
irrelevant information within the input context. Specifically, of LLMs is beyond the scale of factual knowledge.
relevant documents can be placed at different positions • Unverifiability hallucination. This hallucination refers
within the prompt, which can enhance the model’s attention to cases where the information produced by LLMs cannot
to relevant content in various positions and prevent the be verified against existing information sources, making it
model from neglecting certain positions [949]. Additionally, difficult to assess its accuracy.
irrelevant information can be added to the instructions data,
so as to improving the model’s ability to resist interference 9.7.2 Source of Hallucination
from such information [1041]. In addition, special training In this part, we will discuss the potential factors that might
tasks can be introduced during the pre-training stage to lead to hallucination for LLMs.
further enhance the LLM’s retrieval and generation capa-
bilities [657, 1042]. Existing work mainly constructs unsu- Training Data. The quality of training data significantly
pervised pre-training data aimed at retrieval augmentation. impacts the model’s output and is a primary source of
A common data construction method uses portions of the hallucinations. Further, the distribution of training data also
original document as queries and then trains the model to plays a key role in shaping the behaviors of LLMs. We next
91

introduce the effect of training data on hallucinations from data may contain hallucinated content, which might lead
these two aspects. to more hallucinations for the trained model. Addition-
• Data quality. In practice, the pre-training dataset is ally, during the human alignment process, existing training
typically constructed by collecting diverse data from various methods may also cause hallucination issues. Some research
sources. While increasing pre-training data can lead to im- work has revealed that LLMs may cater to human responses
proved model performance, low-quality data can severely for earning higher rewards, likely resulting in answers that
damage the generation performance of large models. On do not align with factual knowledge [1048].
the one hand, pre-training data may contain erroneous
Response Generation. Given the input prompt, LLMs
information, and the goal of training large models is to
employ decoding strategies (e.g., top-k sampling in Sec-
imitate and memorize the training data as possible. If inac-
tion 4.2.4) for generating the response. In this process, the
curate information frequently appears in the training data,
prompt formulation and the decoding strategies potentially
the model may memorize and directly copy this content
affect the generation behaviors of LLMs.
during generation, leading to the phenomenon known as
“imitative falsehoods” [558]. On the other hand, pre-training • Prompt design. Prompting has become the primary
data may contain biased content and the subjective views way for using LLMs to solve downstream tasks. However,
of its creators. Such biased content can severely affect the inappropriate prompt design can cause the model to over-
model’s learning of world knowledge, possibly leading to look or misunderstand important information, leading to
inappropriate representations. incorrect or irrelevant content [1044]. Recent studies have
shown that the readability, format, and concreteness of user
• Data distribution. The distribution of pre-training data
instructions would impact the model’s output [1049]. For
also significantly affects the model’s behavior. Firstly, re-
instance, the use of complex words or long phrases in the
garding the recency factor, LLMs are typically trained on
prompt reduces the readability, which makes LLMs more
data from a limited period. As world knowledge continu-
difficult to understand the real intentions of user instruction,
ously evolves, the model’s stored knowledge can become
thereby increasing the chance of hallucination. Additionally,
outdated, thereby likely leading to fabrications or outdated
non-standard expressions or abstract concepts can also ex-
information when addressing questions beyond its knowl-
acerbate hallucinations.
edge scope. In terms of data composition, pre-training
data may lack domain-specific knowledge, which would • Decoding strategy. To improve the diversity of the
affect the model performance on tasks requiring specialized generated content, multiple random sampling strategies are
knowledge, such as medical or legal issues, and it will introduced (e.g., beam search, top-p sampling). However,
also result in significant hallucinations. Additionally, recent increasing diversity also brings a higher likelihood of gen-
studies show that when addressing questions involving erating hallucinated content. For example, increasing the
long-tail knowledge that appears infrequently in the train- temperature t (Equation 10) will result in a more uniform to-
ing corpus, models are more likely to generate inaccurate ken probability distribution, which potentially leads to more
content [1044]. hallucinations, since low-frequency yet irrelevant words
would be assigned a higher probability for generation in
Training Methods. The training process of LLMs typically this setting.
includes two major stages: pre-training and post-training.
Inappropriate training methods across the two stages are 9.7.3 Hallucination Detection
also likely to result in the hallucination behaviors of LLMs.
• Pre-training. Currently, the pre-training stage primar- To effectively detect the hallucinated content, existing work
ily employs the next token prediction method for model mainly adopts three approaches, namely model-based,
training. Recent studies [949] indicate that under the au- uncertainty-based and tool-based methods.
toregressive training method, the model’s attention distri-
Model-based Methods. Due to the powerful language ca-
bution tends to decay as the sequence length increases. This
pabilities and rich world knowledge, existing work exten-
would prevent LLMs from effectively modeling long-range
sively adopts powerful LLMs to detect hallucinations from
dependencies, potentially resulting in inference errors or
the model-generated text. In this approach, hallucination
hallucinations. Additionally, the teacher-forcing strategy is
detection can be considered as a normal text task that re-
commonly used during the training of large models. In this
quires prompt formulation. To facilitate the research in this
approach, the correct tokens from the previous steps are
line, HaluEval [604] introduces a comprehensive dataset of
used to predict the next token instead of the model output.
model-generated and human-annotated hallucinated sam-
However, during model inference, the model can only use
ples to evaluate how well LLMs can identify such instances,
its own generated content for subsequent predictions. This
and they empirically show specific prompting strategies
discrepancy between the training and generation phases
such as CoT can effectively improve the model’s accuracy
leads to “exposure bias” [1047], which may in turn cause
in detecting hallucinations. Furthermore, research work pro-
hallucination issues.
poses to decompose the hallucination detection into two
• Post-training. During the instruction-tuning process, subtasks: first, extract factual statements, and then assess
existing works typically employ knowledge distillation to whether each statement is hallucinated or not [1044, 1050].
improve the model’s instruction-following ability. This in-
volves using high-performance models (such as GPT-4) to Uncertainty-based Methods. Recent studies suggest that
generate large-scale instruction data and then fine-tuning the occurrence of hallucinations in LLMs may be related
weaker models with this data. However, these synthesized to the uncertainty of their outputs [1051]. Based on such
92

assumptions, a series of works propose detecting hallucina- further expands the knowledge source to local databases,
tions by assessing the uncertainty of model-generated con- devising an agent framework to retrieve, consolidate, and
tent. Some research work focuses on the internal features of generate feedback to the LLM for the final answer. Other
LLMs, such as token probability and logits. For key concepts research explores placing the retrieval process at different
in the generated text, a lower token probability indicates a positions relative to the generation process. Verify-and-
higher uncertainty, which represents an increased likelihood Edit [1060] proposes to perform the retrieval procedure
of hallucination [1052]. Other research efforts evaluate the after the generation process, allowing the original answer to
uncertainty by examining the consistency of the models’ be edited based on the retrieved documents. Furthermore,
responses. For instance, SelfCheckGPT [1051] lets LLMs an- to help LLMs better handle complex tasks, IRCoT [1061]
swer the same questions multiple times to judge whether the interleaves the knowledge retrieval process with CoT gen-
generated answers are consistent or not. Another alternative eration, where the retrieved documents guide the LLM in
way requires LLMs to reconstruct the input questions based generating additional reasoning steps and CoT sentences
on the responses and then check the consistency between assist in retrieving more relevant and diverse documents.
the generated and original questions [1053].
Improved Decoding Strategy. In addition to the above
Tool-based Methods. LLMs can detect hallucinations by methods, hallucinations can also be mitigated by using im-
calling external tools to verify the model-generated content. proved decoding strategies. Typically, the internal states or
Typically, the model’s output contains various segments of knowledge of LLMs themselves can be exploited to reduce
factual knowledge, which can be broken down into fine- the hallucinations. DoLa [317] proposes that the lower layers
grained factual statements. FActScore [1054] refers to knowl- of LLMs tend to assign higher probabilities to syntactically
edge sources like search engines to verify these statements. plausible words, while higher layers encode more factual
FacTool [1055] further proposes to use a series of external knowledge. Therefore, DoLa devises a contrastive decod-
verification tools such as calculators and code interpreters to ing strategy by subtracting the lower logits from the last
check different types of text. In addition, HaluAgent [1056] layer’s logits and using the results for next-token predic-
proposes an agent framework to employ smaller open- tion. ITI [1062] finds that specific attention heads show
source models for hallucination detection. With the assis- high linear probing accuracy and regards their activation as
tance of tools like search engines and calculators, HaluAgent truth-correlated directions. During inference, certain heads’
enables 7B-size models to achieve comparable performance activations would be shifted along these pivot directions.
as GPT-4 in hallucination detection. Some other work introduces external knowledge sources
to aid the decoding process. CAD [1063] provides LLMs
9.7.4 Hallucination Mitigation with extra context about the query, and then contrasts the
In practice, it is essential to effectively mitigate the halluci- output probabilities by those without using context, thereby
nation behaviors of LLMs, to provide accurate and helpful adjusting the influence of the model’s prior knowledge.
responses. In this part, we will introduce several widely- KCTS [878] applies an auxiliary knowledge classifier on top
used approaches for alleviating the hallucination, including of the LLM to detect hallucinations, and uses its knowledge
human alignment, retrieval-augmented generation and im- faithfulness score to reweight the token distribution.
proved decoding strategy.
9.8 Complex Reasoning
Human Alignment. Hallucination mitigation is closely re-
lated to the honest criterion in “3H” standards for human In this section, we introduce a new reasoning paradigm
alignment, and various alignment methods like RLHF can for LLMs aimed at solving complex tasks by allocating
be adopted to mitigate the model hallucination. HaluEval more time to thinking before responding to a problem,
2.0 [1044] proposes to first collect hallucinated and non- i.e., conducting complex reasoning. Specially, we focus on
hallucinated responses to train a reward model, and then long chain-of-thought (CoT) reasoning55 , which is the main-
fine-tune the LLM with the reward model’s feedback us- stream approach taken by recent large reaonsing models,
ing RL algorithms. However, recent research shows that such as OpenAI’s o-series models. We will begin by pro-
human preference data may lead LLMs to exhibit syco- viding an overview of long CoT reasoning, then introduce
phantic behavior [1057], where models prioritize catering the construction of long CoT data and the corresponding
to human demands over maintaining truthfulness. Some training methods, and finally discuss more general test-time
work proposes to refine the annotation process of preference scaling methods.
data, such as by aggregating multiple human preferences
9.8.1 Overview and Analysis
to improve feedback quality [1057] or fine-tuning LLMs on
prompts where the truthfulness of a claim is independent of Generally, long CoT reasoning is a method to search for
the user’s opinion [1058]. solutions within the natural language space, as reflected in
the output responses of LLMs. This approach is akin to the
Retrieval-Augmented Generation. Providing LLMs with slow thinking mode of the human brain [1064], which takes
highly reliable external knowledge as context can help re- significantly more time to think through difficult problems
duce hallucinations. RARR [1059] first generates multiple
questions about the generated text, then retrieves web pages 55. The phrase “long CoT” may not be conceptually precise since the
from Google Search as evidence, and finally, an editing model’s thought process could be tree- or graph-structured rather than
strictly linear. We use this terminology in line with OpenAI’s intro-
model is employed if any disagreement is detected between duction of the o1 model, which generally refers to extended thought
the evidence and the generated text. LLM-Augmenter [661] processes for complex reasoning.
93

compared to the fast thinking mode used for simpler ones. ther analyzed the reasoning patterns exhibited in the o1
This subsection will first qualitatively analyze the reasoning model [1065]. These studies, based on empirical investiga-
patterns and then briefly discuss the main advantages of tion, have identified several key reasoning patterns, includ-
this reasoning mode. ing systematic analysis, method reuse, divide-and-conquer,
self-refinement, context identification, and constraint em-
phasis. Additionally, the use of these reasoning patterns
varies across different tasks, significantly enhancing cogni-
tive processes compared to standard CoT reasoning.
Reasoning Advantages. Unlike standard CoT reasoning,
this approach does not enforce a linear reasoning chain. In-
stead, it integrates various reasoning actions and strategies,
such as reflection and backtracking, into a single response.
Overall, it has two major advantages compared to the stan-
dard CoT method or direct prompting methods.
Firstly, due to the autoregressive nature, the standard
generation paradigm of LLMs is a “one-time” reasoning
process. This means that if the generated solution contains
obvious mistakes, or even if LLMs are aware of other
promising solutions, there are no opportunities for refine-
ment or verification. This issue becomes more pronounced
in complex reasoning tasks, where the search space is much
larger, preventing LLMs from fully exploring it [1066]. In
contrast, long CoT reasoning mitigates this problem by
allowing the model to autonomously check and revise its
attempts, thus enabling more effective reasoning.
Secondly, this text-based reasoning process can, in prin-
Fig. 19: Examples of long CoT reasoning from DeepSeek- ciple, emulate various search algorithms that rely on more
R1 (accessed on January 25, 2025). Grey fonts denote the complex search structures. For example, to represent a tree-
thought part of the model output, and italic fonts denote structured search space, one might employ a textual process
the final answer. that combines forward exploration with backward revis-
its, incorporating necessary reflection and verification steps
Reasoning Patterns Analysis. As demonstrated in Exam- along the way. Consequently, long CoT reasoning can repli-
ple 19, existing long CoT reasoning models typically gen- cate the effects of previously introduced methods like tree-
erate an extended thought process (in grey) before arriving at of-thought (ToT) and graph-of-thought (GoT). However, this
the final answer (in italic). It is crucial to understand how capability is not inherently present in the LLM; it emerges
this thought process is conducted and the types of reason- in a manner similar to the standard CoT ability, developing
ing patterns generated by LLMs during problem-solving. through appropriate training (see Section 9.8.3).
To provide an intuitive understanding of this reasoning Overall, long CoT represents a significant different rea-
process, we present two examples from the DeepSeek-R1 soning mode compared to the standard CoT method, fa-
model. cilitating search algorithms within the natural language
space of LLMs. It emphasizes how to navigate correct paths
In the first example, we present a mathematical problem
through a trial-and-error approach, typically incorporating
to the model, and the corresponding long CoT can be ob-
critical reasoning actions such as planning, evaluation, re-
served in the reasoning portion of the response. The thought
flection, and exploration. In contrast, short CoT data typi-
process here is informal and flexible, while showcasing a
cally presents a direct solution process in which all reason-
systematic exploration of the solution within the natural
ing steps are expected to be correct.
language space. Concretely, the model follows a complete
reasoning process consisting of action steps like “factorize 9.8.2 Construction of Long CoT Data
196” and“take the exponents”. Notably, the thought process
To guide LLMs in producing long-form reasoning followed
naturally includes trigger keywords like “double check” and
by solutions, it is crucial to curate high-quality long CoT
“wait”, which invoke the corresponding verification or re-
data for warming up or training the models. While human
flection actions. In the second example, we ask the model
annotators can construct extended CoT data, this process
which Chinese city has the largest population. Interestingly,
is costly and requires professional expertise for challenging
it exhibits similar thought patterns, even though the ques-
problems, making it difficult to scale. Consequently, existing
tion could be addressed in a more compact and straight-
studies often develop various methods for automatically
forward manner. The model generates a comprehensive
constructing long CoT data, such as distillation from more
reasoning process with actions such as “confirm the latest
advanced models, search based data synthesis, and multi-
data” and “clarify”, with trigger keywords like “make sure”
agent collaboration, which are detailed below.
and “avoid”.
To gain a more comprehensive understanding of Long CoT Data Distillation. Benefiting from the open-
this complex reasoning mode, some research has fur- ness of o1-like LLMs endowed with powerful reasoning
94

capabilities, the leading approach to curating high-quality assessing whether each agent’s contribution is grounded
long CoT data involves using open models or APIs for in logical consistency. Additionally, the debate mechanism
data synthesis. The basic idea is to first construct a set of incorporates alternative perspectives and counterarguments
prompts (i.e., problems) and then feed them into the teacher into the reasoning process, resulting in more robust and
model to collect the corresponding long CoT response data. nuanced outcomes for complex decision-making tasks. By
Specifically, STILL-2 [1067] utilizes two slow-thinking sys- combining these cognitive processes, the framework fosters
tems, i.e., DeepSeek-R1-Lite-Preview [1068] and QwQ-32B- an environment where complex problems can be tackled col-
preview [1069] for distillation to construct a dataset of laboratively, with diverse viewpoints contributing to more
long-form thought processes. A key finding is that length comprehensive solutions.
distribution is a critical factor in determining the quality of
long CoT data. They suggest that length directly reflects the 9.8.3 Training Methods
difficulty of prompt problems, with mathematical problems To elicit and enhance long CoT reasoning capabilities, the
being particularly important to collect, as they often involve existing literature extensively explore two methods: long
extensive thought processes in their solutions. The research CoT instruction tuning and scaling reinforcement learning
shows that even a small amount of carefully curated long (RL) training. We will describe each approach in detail
CoT data can effectively activate the slow-thinking mode in below.
LLMs. Furthermore, this effect is corroborated by the work
on DeepSeek-R1 [1070], which demonstrates that training Long CoT Instruction Tuning. As discussed in Section 9.8.1,
with distilled data from DeepSeek-R1 consistently enhances long-form thought processes require models to engage in
the performance of multiple Qwen and Llama models. extended reasoning before responding. To develop this
reasoning capability, we can instruction-tune LLMs using
Search based Data Synthesis. Search algorithms like Monte carefully curated long CoT data. The core concept is to train
Carlo Tree Search (MCTS) [382] have been widely applied LLMs to “imitate” the demonstrated behaviors presented in
to synthesizing long-form reasoning data. As a represen- the long CoT data.
tative technique, MCTS integrates the principles of tree In general, this fine-tuning method aims to achieve two
exploration and random simulation to estimate potential key objectives: format adherence (i.e., following a long CoT
outcomes of actions, making it particularly effective for format) and ability elicitation (i.e., activating the complex
decision-making tasks with large action spaces. In com- reasoning mode). Specifically, format adherence requires
plex problem-solving, MCTS decomposes the process into the model to produce outputs consisting of two sequential
multi-step generation, with each node at a specific tree parts—thought and solution—while ability elicitation acti-
layer representing a step in the solution [1071]. At each vates the model’s inherent capacities for executing appro-
step, a LLM, serving as the policy model, samples several priate long-form thought processes. It has been shown that
candidate nodes, each generating a one-step CoT. MCTS both objectives can be effectively achieved through super-
extensively uses rollouts to automatically assign a Q-value vised fine-tuning: a small amount of high-quality long CoT
to each intermediate step based on its contribution: steps data can suffice to elicit the long CoT reasoning capabilities
potentially leading to more trajectories that correctly solve of LLMs. For instance, by fine-tuning Qwen2.5 (32B) on just
the problem receive higher Q-values. After iterating through 3.9K distilled long CoT data, STILL-2 [1067] achieved per-
multiple steps to successfully address the problem, the formance comparable to industry counterparts such as o1-
complete reasoning trajectories from the root node to the preview and QwQ in mathematical problem-solving. This
terminal node can be viewed as long-form CoT data, where effectiveness is largely because strong LLMs inherently pos-
intermediate nodes represent either correct reasoning steps sess various specific reasoning abilities (e.g., reflection and
or trial-and-error attempts. backtracking). Instruction tuning with long CoT data further
enhances these innate abilities, comprehensively integrating
Multi-Agent Collaboration. Beyond relying on a single and extending their utilization, which enables the model to
model, an alternative approach to generating long CoT data manage more complex reasoning processes.
is to construct a multi-agent framework [1072] in which An interesting finding is that this reasoning capability
several models collaborate or debate to produce long-form appears to generalize well across different domains. For
reasoning data. example, when trained exclusively on mathematical data, it
The multi-agent framework for synthesizing long-form can lead to significant improvements in other disciplines,
CoT data typically involves the coordination of multiple such as physics and chemistry [1067]. This is primarily
autonomous agents, each specializing in distinct roles or because long CoT reasoning is inherently a reasoning mode
functions. These agents work together using iterative reflec- rather than a specific ability tied to any particular domain.
tion and strategic debate to enhance the reasoning process. This can be seen in the example shown in Example 19, where
Within this framework, one agent might initiate a chain of the query, “Which city in China has the largest population?”,
thought by presenting an initial hypothesis or argument, is answered through a complex thought process, despite
while others critique and challenge these ideas through log- being solvable in a more straightforward manner. Moreover,
ical examination and counter-arguments. This process en- this capability can be naturally extended to multimodal
courages deep reflection by prompting agents to reconsider LLMs, as these models are typically built on the backbone
assumptions, address potential biases, and refine conclu- of language models [1073].
sions through continuous discourse. In this context, reflec- Furthermore, this training approach can be naturally
tion involves not only reconsidering past decisions but also enhanced by other supervised training strategies [1067],
95

such as rejection sampling and directional preference op- problems with specific answers are selected. The ground-
timization. In general, one can begin by warming up a LLM truth answer is used to derive the reward scores, such as 1
through instruction tuning with long CoT instruction data for a correct solution and 0 for an incorrect solution. This
and then use the model itself to generate rollout samples as might seem counterintuitive: how can a complex reasoning
training data. These enhancements can have a certain effect, system be effectively developed using such a simple reward
particularly when the amount of warmup instruction data model? The explanation lies in the essence of RL: unlike su-
is limited. However, their impact tends to diminish when pervised fine-tuning, it encourages the autonomous explo-
sufficient long CoT instruction data is available, especially if rations of models through simple yet appropriate incentives.
the quality of self-generated samples is not superior to that In this way, the complex reasoning capability can be well
of the demonstration data [1067]. These findings suggest internalized within the model. In addition to the accuracy
that this advanced capability of a model may quickly reach a reward, other simple rewards can be considered, including
performance ceiling when trained through supervised fine- completeness, avoidance of excessively long texts, and other
tuning, due to the inherent limitations typical of imitation formatting issues like repetition. OpenAI has proposed the
learning (for further discussion, see Section 9.8.4). reinforcement fine-tuning (ReFT) [1075] approach for tuning
Another downside of this fine-tuning method is its ten- the o-series models to build domain-specific models, which
dency to default to long CoT reasoning mode even for sim- also uses a simple accuracy reward to guide the training.
pler problems (See Example 19). To better manage reasoning One limitation of this reward model is that it can only utilize
behavior, it is essential to explore systematic approaches problems with definite and concise answers for training.
that integrate both long CoT reasoning mode and standard More general task data, such as summarization, cannot be
response mode. directly used for training. In such cases, incorporating a
trainable reward model becomes necessary. However, as
Scaling RL Training. Although OpenAI has not disclosed we have discussed, once this reasoning mode is elicited
technical details about the o-series models, training meth- in specific domains, it can naturally generalize well across
ods have been published through initiatives that imple- different domains.
ment long chain-of-thought (CoT) reasoning systems, such
as DeepSeek-R1 [1070] and Kimi-K1.5 [1074], which have • RL Training. After configuring the policy and reward
demonstrated performance comparable to o1. The technical models, suitable RL algorithms are selected to train the
methods employed converge on the approach of scaling RL policy model [1076, 1077]. In Section 5.2, we provide a
training to enhance the complex reasoning capabilities of detailed implementation of the PPO algorithm, which can
LLMs. In the following part, we introduce the detailed RL be applied directly for training such models. Nonetheless,
method through three components: the policy model, the PPO requires the maintenance and updating of an addi-
reward model, and the RL training algorithm. tional value model, which leads to high training costs,
• Policy model. The policy model refers to the LLM that especially when scaling RL training. As a result, existing ap-
needs to be enhanced by the complex reasoning capacities. proaches [1070] often prefer more simplified RL algorithms,
Typically, it should be warmed up through supervised fine- such as GRPO [1078] and RLOO [1079], which use heuristic
tuning with long CoT data, as outlined in the aforemen- methods to eliminate the need of a value model. These algo-
tioned method. The main purpose of this warm-up is to rithms typically exhibit higher efficiency and demonstrate
activate the long CoT reasoning mode, enabling the policy strong training performance, especially in long CoT reason-
model to conduct appropriate explorations using a long- ing. A critical factor to monitor during RL training is the
form thought process. It is also recommended that the policy response length of the reasoning models, as a longer average
model possesses strong foundational capabilities, as this is response length often corresponds to enhanced reasoning
crucial for eliciting high-reward actions in a more efficient capabilities. Therefore, it is important to track the trends
way. An interesting attempt taken by DeepSeek-R1-Zero is in average response lengths. With appropriate training, the
to omit the supervised fine-tuning step. Instead, it leverages model should show progressively longer response lengths,
its strong instruction-following capacity to adhere to the accompanied by simultaneous performance improvements.
response format and reasoning mode, guiding the model In fact, response length is directly connected to the test-time
to generate formatted responses comprising two parts: scaling law demonstrated by OpenAI56 . This law suggests
thought and answer. This method uses a format reward to that as more output tokens are generated, a model’s rea-
reinforce the correct reasoning mode. soning performance can improve substantially. Nonetheless,
achieving stable and effective RL training is challenging and
• Reward model. To effectively guide the policy model,
necessitates consideration of various factors, such as the
it is necessary to set an appropriate reward model in RL
selection of query problems (e.g., choosing problems that
algorithms. As discussed in Section 5.2, RLHF employs a
are challenging yet solvable by the model), the updating of
specially trained reward model to instruct the learning of
the reference model (e.g., continually updating it as training
the policy model. However, this approach has become less
progresses), and the enhancement of exploration strategies
effective for long CoT reasoning, given the difficulty of
(e.g., sampling more responses with higher temperature
training reliable reward models to assess the quality of long
settings).
CoT reasoning processes. Consequently, existing approaches
typically employ a verifiable reward model primarily built
on reference answers (e.g., mathematical problems) or test
samples (e.g., coding problems). Typically, the mathematical
domain serves as the major source of training data, where 56. https://openai.com/index/learning-to-reason-with-llms/
96

9.8.4 Extended Discussion been well covered in the existing literature. In particular,
In the preceding discussions, we have introduced the long our survey has discussed four important aspects of LLMs,
CoT reasoning in technical detail. Actually, it can be con- i.e., pre-training, adaptation, utilization, and evaluation. For
sidered a specific approach to achieve test-time scaling (a.k.a., each aspect, we highlight the techniques or findings that are
inference-time scaling), which is the focus of this subsection. key to the success of LLMs. Furthermore, we also summa-
From a broader perspective, test-time scaling encom- rize the available resources for developing LLMs and dis-
passes various approaches that enhance model performance cuss important implementation guidelines for reproducing
by increasing the outputs or computations from LLMs. In LLMs. This survey tries to cover the most recent literature
this way, many methods can be considered test-time scaling about LLMs and provides a good reference resource on this
techniques. For example, Self-Consistency [429] generates topic for both researchers and engineers.
multiple responses and then aggregates the solutions using Next, we summarize the discussions of this survey, and
majority vote, resulting in higher inference costs due to introduce the challenges and future directions for LLMs, in
the increased number of rollouts. Additionally, planning the following aspects.
techniques (Section 6.4) and their agentic instantiations (Sec-
tion 9.2) can also be considered test-time scaling approaches, Basics and Principles. Instead of training on specific task
as they involve prompting LLMs multiple times and utiliz- goals, LLMs learn from unsupervised pre-training on large-
ing tools or memory components. Therefore, the essence of scale text data. This is quite different from previous multi-
test-time scaling is to trade additional inference costs for task learning approaches, which aim to extend the training
performance gains. Unlike previous approaches, long CoT tasks as possible to achieve sufficient generalization. Thus,
reasoning directly searches for solutions within the natural it is essential to reveal the basic principles or elements that
language space, notably within a single response. establish the foundation of the abilities of LLMs. Although
When comparing different test-time scaling methods, the basic idea of language models is intuitive, it is still chal-
two critical factors require careful examination: token effi- lenging to formally explain why LLMs trained by simple
ciency (the performance improvement per token cost) and language modeling objectives (e.g., next token prediction)
performance ceiling (the maximum attainable performance). can become capable of solving various real-world tasks.
Research has shown that scaling test-time computation can To investigate this problem, a promising approach is to
effectively enhance model performance [1070, 1080] through study the capacity learning (or selection) mechanism based
the use of simple aggregation methods or specially trained on unsupervised pre-training, since the model capacity of
models, though token efficiency may vary. Overall, scaling LLMs strongly depends on pre-training data. In addition,
RL training tends to exhibit higher token efficiency com- scaling plays an important role in improving the capacity
pared to existing test-time scaling methods [1070]. Addi- of LLMs [31, 55, 64], and it is very useful to conduct more
tionally, both heuristic methods and supervised fine-tuning theoretical analysis about how the behaviors of large models
often exhibit a relatively limited performance ceiling that relate to those of small models, e.g., what behaviors of large
cannot be substantially elevated once scaling reaches a models can be inferred from small models and what can’t be
certain level [1067, 1080]. In contrast, scaling RL training can predicted indeed. Another research direction is to explore
lead to continuous performance improvements in reasoning more deep analysis on model generalization for LLMs,
models as training time increases. For example, DeepSeek- since increasing concerns have been raised about whether
R1-Zero demonstrates a consistent upward trend in perfor- LLMs can generalize beyond the knowledge encoded by
mance even after more than 8,000 training steps [1070]. pre-training data. Furthermore, data contamination has be-
These scaling effects are crucial for solving complex come a severe issue for fairly assessing the performance of
tasks. Notably, a potential advantage of long CoT reasoning LLMs [740], and thus setting appropriate evaluation proto-
models is that they make it feasible to develop expert-level col will be the basis to investigate and analyze the model
models in specialized domains or for specific tasks, which capacity of LLMs.
could significantly impact the advancement of scientific
research challenges. Moreover, as inference methods and Model Architecture. Due to the scalability and effective-
hardware techniques improve, the deployment and use cost ness, Transformer has become the de facto architecture
of these models will be significantly reduced, enhancing for building LLMs. Various strategies have been proposed
the contribution of these highly intelligent models to real- to improve the performance of this architecture, such as
world applications. Additionally, addressing security issues neural network configuration and scalable parallel training
in long CoT reasoning models is crucial. Given their unique (see discussions in Section 4.2.2). However, Transformer
reasoning mode, specialized alignment strategies should be still suffers from high training costs and slow inference
developed to ensure safer use of these models. rates. More efforts [270, 271] are still in need to develop
improved model architectures for large-scale pre-training.
Specially, system-level or hardware-level optimization (e.g.,
10 C ONCLUSION AND F UTURE D IRECTIONS FlashAttention [303]) is worth more exploration to improve
In this survey, we have reviewed the recent progress of large the efficiency of Transformer architectures. In addition, as an
language models (LLMs), and introduced the key concepts, important basic capacity, existing LLMs typically maintain
findings, and techniques for understanding and utilizing a long context window. For example, the most recent GPT-4
LLMs. We focus on the large-sized models (i.e., having a size Turbo enables a long context of 128K tokens, and Claude
larger than 10B) while excluding the contents of early pre- 2.1 also supports the input up to 200K tokens. Although
trained language models (e.g., BERT and GPT-2) that have many efforts have been made to enhance the long context
97

modeling ability of LLMs [283, 943], the resulting mod- cost of LLMs, especially in large-scale deployment. Another
els still can’t well process the information in the context popular research direction is retrieval-augmented gener-
window [949]. To address this issue, specific architecture ation, where retrieved contexts from supporting sources
adaptations or algorithms might be needed to enhance the are included into prompts for task solving. It has been
modeling and utilization of long context information. An- shown that retrieval augmentation can extend the knowl-
other worrying concern is that existing work mostly focuses edge boundary and improve the question answering ca-
on training LLMs with decoder-only Transformers. Despite pacity [454], but may suffer from the effectiveness of long
the effectiveness, it severely limits the more wide, diverse context utilization by LLMs [949].
explorations on alternative model architectures.
Safety and Alignment. Despite the capacities, LLMs are
Model Training. For pre-training, it is essential to establish faced with great safety challenges in practical use. As a
a data-centric infrastructure and training procedure for LLM fundamental issue of probabilistic modeling nature, LLMs
optimization, which can effectively support a systematic exhibit a tendency to generate hallucinations [640], refer-
process of data collection, data cleaning, data mixture, and ring to texts that seem plausible but may be factually
data curriculum. Furthermore, it also calls for more flexible incorrect [46]. What is worse, LLMs might be elicited by
mechanisms of hardware support or resource schedule, so intentional instructions to produce harmful, biased, or toxic
as to better organize and utilize the resources in a computing texts for malicious systems, leading to the potential risks
cluster. In practice, it is very challenging to pre-train capable of misuse [55, 66]. To have a detailed discussion of the
LLMs, due to the huge compute consumption and the safety issues of LLMs (e.g., privacy, overreliance, disinfor-
sensitivity to data quality and training tricks [78, 93]. Thus, mation, and influence operations), the readers can refer to
it becomes particularly important to develop systemic, eco- the GPT-3/4 technical reports [46, 55]. As the major tech-
nomical pre-training approaches for optimizing LLMs, e.g., nical approach to averting these issues, alignment methods
predictable scaling [46] and proxy model training [59]. More (e.g., RLHF) [66, 116] have been widely used by leveraging
training recipes or principles should be investigated and human feedback for developing well-aligned LLMs. How-
shared to reduce the potential risk of degradation or failure ever, RLHF heavily relies on high-quality human feedback
in large-scale model optimization. Although increasingly data from professional labelers, which is costly and time-
more model checkpoints and cleaned datasets have been consuming to recruit qualified human annotators. There-
released, there still lacks reproducible work on pre-training fore, it is necessary to improve the RLHF framework for
data preparation (e.g., detailed cleaning strategies) and data reducing the efforts of human labelers and seek a more
scheduling (e.g., data mixture and curriculum). Since it is efficient annotation approach with guaranteed data quality,
very costly to pre-train a LLM from scratch, it is important e.g., LLMs can be employed to assist the labeling work.
to design suitable mechanisms for continually pre-training Furthermore, it is also suggested to develop simplified
or fine-tuning the LLM based on publicly available model optimization algorithms for alignment [388, 391], to reduce
checkpoints (e.g., LLaMA [57] and Flan-T5 [69]). For this the training difficulty and unstability of RLHF. As another
purpose, a number of technical issues have to be resolved, practical approach, red teaming [132, 367] has been adopted
e.g., catastrophic forgetting and task specialization. Further- for improving the model safety of LLMs, which utilizes
more, it is also useful to develop effective tuning strategies the collected adversarial prompts to refine the LLMs (i.e.,
that effectively inject or edit specific knowledge [674], e.g., avoiding the attacks from red teaming). In addition, privacy
correcting the outdated facts. concerns are also important to consider when fine-tuning
LLMs with domain-specific data, and thus federated based
Model Utilization. Based on the natural language inter- learning [1081] can be useful in privacy-restricted scenarios.
face, prompting has become the prominent approach for
using LLMs to solving various tasks. By combining task Application and Ecosystem. As LLMs have shown strong
descriptions and demonstration examples into prompts, in- capacities in solving various tasks, they can be applied
context learning (ICL) endows LLMs with the ability to in a broad range of real-world applications (i.e., following
perform well on new tasks, even outperforming full-data task-specific natural language instructions). As a remarkable
fine-tuned models in some cases. To enhance the ability of progress, ChatGPT has potentially changed the way how
complex reasoning, advanced prompting techniques have humans access information, which has been additionally
been proposed, exemplified by the chain-of-thought (CoT) integrated in the release of New Bing. Generally, in the
strategy, which includes the intermediate reasoning steps near future, it can be foreseen that LLMs would have a
into prompts. Furthermore, planning is a promising ap- significant impact on information-seeking techniques, in-
proach for solving complex tasks, which iteratively invokes cluding both search engines and recommender systems.
LLMs by leveraging tool use capacities. Despite these ef- Furthermore, LLMs make it possible to develop more intel-
forts, several basic problems related to prompting are still ligent systems (e.g., autonomous AI agents) to tackle various
under-explored: why a good prompt can elicit the correct complex tasks in real-world scenarios. Specially, Assistants
answer but a bad prompt cannot, how to reveal the working API has been launched by OpenAI (featured by instructions,
principles of advanced prompting methods (e.g., ICL and knowledge and tool use), enabling rapid development of
CoT) and further improve these existing approaches, and agent-like assistants within the applications. This wave of
how to efficiently find the effective prompts for LLMs on technical innovation would lead to an ecosystem of LLM-
specific tasks. Furthermore, from a practical perspective, it empowered applications (e.g., OpenAI’s GPT Store), which
has become a fundamental challenge to reduce the inference has a close connection with human life. Lastly, the rise of
98

LLMs sheds light on the exploration of artificial general illustrative figure for the attention patterns for different
intelligence (AGI). It is promising to develop more smart AI architectures in Figure 9, and add the detailed formulas
systems than ever. However, in this development process, in Table 7.
AI safety should be one of the primary concerns, i.e., making • Update on April 25, 2023: revise some copy errors in
AI lead to good for humanity but not bad [40]. figures and tables.
• Update on April 27, 2023: add efficient tuning in Sec-
tion 5.3.
C ODA • Update on April 28, 2023: revise Section 5.3.
It is not an easy job to write this long survey and update • Update on May 7, 2023: revise Table 1, Table 2, and
its content with timely work. First of all, we would like to some minor points.
sincerely thank the support from the readers and our team • Update on June 29, 2023 (major revision):
members. We work very hard on this survey, and hope that – Section 1: add Figure 1 for the trends of published
it can present a comprehensive, timely reference for LLMs. LLM papers in arXiv;
– Section 2: add Figure 4 for GPT’s evolution and the
Survey Writing. This survey was planned during a discus-
corresponding discussion;
sion meeting held by our research team, and we aimed to
– Section 3: add Figure 5 for LLaMA family and the
summarize the recent advances of large language models
corresponding discussion;
as a highly readable report for our team members. The
– Section 5: add latest discussion about the synthetic
first draft was finished on March 13, 2023, in which our
data formatting of instruction tuning in Section 5.1.1,
team members tried their best to include the related stud-
the empirical analysis for instruction tuning in Sec-
ies about LLMs in a relatively objective, comprehensive
tion 5.1.4, parameter-efficient model adaptation in
way. Then, we have extensively revised the writing and
Section 5.3 and memory-efficient adaptation in Sec-
contents in several passes. Due to the space limit, we can
tion 5.3;
only include a fraction of existing LLMs in Figure 3 and
– Section 6: add latest discussion about the underlying
Table 1, by setting the selection criterion. However, we set
mechanism of ICL 6.2.3, planning for complex task
a more relaxed criterion for model selection on our GitHub
solving in Section 6.4;
page (https://github.com/RUCAIBox/LLMSurvey), which
– Section 7: update Table 14 for representative datasets
will be regularly maintained. We release the initial version
for evaluating advanced abilities of LLMs, and em-
on March 31, 2023, the major revision on June 29, 2023,
pirical ability evaluation in Section 7.4;
and second version on September 10, 2023, and this latest
– Section 6.1.1: add prompt design;
version (major revision) on November 23, 2023.
– Section 8: add the discussions on applications of
Seeking for Advice. Despite all our efforts, this survey LLMs in finance and scientific research domains;
is still far from perfect: we are likely to miss important • Update on September 10, 2023 (major revision):
references or topics, and might also have non-rigorous – Claim the copyrights of the figures and tables in this
expressions or discussions. We will continuously update paper.
this survey, and improve the quality as much as we can. – Add latest LLMs, techniques and their descriptions in
For us, survey writing is also a learning process for LLMs Section 3, Section 4, Section 5, Section 6 and Section 7;
by ourselves. For readers with constructive suggestions to – Section 4: add latest discussion about the decoding
improve this survey, you are welcome to leave comments on strategy in Section 4.2.4;
the GitHub page of our survey or directly email our authors. – Section 5: add latest discussion about the practical
We will make revisions following the received comments tricks for instruction tuning in Section 5.1.2, the
or suggestions in a future version, and acknowledge the empirical analysis on LLaMA (13B) for instruction
readers who have contributed constructive suggestions in tuning in Section 5.1.4, practical strategies for RLHF
our survey. in Section 5.2.3, alignment without RLHF in Sec-
Update log. In this part, we regularly maintain an update tion 5.2.4 and remarks on SFT and RLHF in Sec-
log for the submissions of this survey to arXiv: tion 5.2.5;
– Section 6: update the content about the planning for
• First release on March 31, 2023: the initial version.
complex task solving in Section 6.4;
• Update on April 9, 2023: add the affiliation information,
– Section 7: add discussions about evaluation ap-
revise Figure 3 and Table 1 and clarify the correspond-
proaches in Section 7.3.2, Table 15 for the category
ing selection criterion for LLMs, improve the writing,
of existing evaluation work, and update empirical
and correct some minor errors.
ability evaluation in Section 7.4 and the results on
• Update on April 11, 2023: correct the errors for library
Table 16;
resources.
– Section 6.1.1: add new prompt examples in Table 12;
• Update on April 12, 2023: revise Figure 3 and Table 1,
• Update on November 23, 2023 (major revision):
and clarify the release date of LLMs.
• Update on April 16, 2023: add a new Section 2.2 about – Section 1: add Figure 2 for the evolution process of
the technical evolution of GPT-series models. four generations of language models;
• Update on April 24, 2023: add the discussion about – Section 2: add more discussion about scaling laws
scaling laws and add some explanations about the and how emergent abilities relate to scaling laws;
model sizes for emergent abilities (Section 2.1); add an – Section 3: add latest LLMs in Figure 3 and Table 1,
99

latest APIs in Section 3.1, commonly used datasets engineering (Table 17). Due to the limit of computational
for instruction tuning and alignment tuning in Sec- resources, our experiments are not complete, limited to
tion 3.3, and several libraries in Section 3.4; small-sized models or a few comparisons. Despite that, we
– Section 4: add latest discussion about the data feel that it might be meaningful to share the partial results to
scheduling, including data mixtures and data cur- the public. We will try to include the missing results of larger
riculum in Section 4.1.3; add summary of data prepa- models or more comparisons in the future versions. We also
ration in Section 4.1.4; add discussion about model- call for support of computing power for conducting more
ing long context in Section 9.1; add discussion about comprehensive experiments.
decoding efficiency issues and add latest decoding
Chinese Book. We also released a Chinese book based on
strategies in Section 4.2.4;
this survey article, at the link: https://llmbook-zh.github.io.
– Section 5: add latest discussion about instance con-
This book is in the publication process.
struction and tuning strategies in Section 5.1; add
latest discussion about process-supervised RLHF in
Section 5.2.3, and the empirical study on quantized ACKNOWLEDGMENTS
LLaMA models (7B and 13B) in Section 9.5.1; The authors would like to thank Yankai Lin and Yutao Zhu
– Section 6: add latest discussion about prompt op- for proofreading this paper. Since the first release of this
timization in Section 6.1.2, and update the content paper, we have received a number of valuable comments
about chain-of-thought prompting in Section 6.3; from the readers. We sincerely thank the readers who have
– Section 8: add latest discussion about LLM for re- written to us with constructive suggestions and comments:
search directions in Section 8.1; Tyler Suard, Damai Dai, Liang Ding, Stella Biderman,
– Section 10: revise the content in the several aspects. Kevin Gray, Jay Alammar, Yubo Feng, Mark Holmstrom,
• Update on September 25, 2024: Xingdong Liu, Il-Seok Oh, Yiting Liu, Shaojun Wang,
– Section 3: reorganize the content of “public available Gaoyan Ou, Todd Morrill, Hao Liu, Zhenyu Zhang, and
model checkpoints” into multiple series; add the Xinlin Zhuang.
latest LLMs in Figure 3.
– Section 4: add LLM-based data filtering and selec- Since the v11 version (June 29, 2023), we have been
tion methods in Section 4.1.2; update Section 4.2.1, adding a large number of experiments and prompt prac-
“Emergent Architectures” to include more discus- tices. These new contents are completed by a number of
sions about SSM-based architectures; add Table 6 volunteers in our team. Here, we add a special part to thank
to compare parallelism and complexity of different all the students who have worked very hard on this part
architectures. (also including the ones on our author list).
– Section 5: add latest discussion about instruction
Contribution on Experiments. We would like to sincerely
quality improvement and instruction selection in
thank the following people for their hard work involved in
Section 5.1.1; add latest discussion about practical
experiments shown in Table 16.
strategies for RLHF and process-supervised RLHF
• Xiaoxue Cheng: implement the experiments for evalu-
in Section 5.2.3; update the content about supervised
ation on Language Generation and HaluEval tasks.
alignment tuning in Section 5.2.4.
• Yuhao Wang: implement the experiments for evalua-
– Section 6: add latest papers about discrete prompt
tion on interaction with environment tasks.
optimization in Section 6.1.2.
• Bowen Zheng: implement the experiments for evalua-
– Section 9: add latest discussion about advanced
tion on tool manipulation tasks.
topics, including long context modeling, LLM-
based agent, analysis and optimization for training Contribution on Tips. We list the following guys for their
and inference, model inference, model compression, contributions on the corresponding numbers of provided
retrieval-augmented generation, and hallucination. tips for designing prompts in Table 12.
• Update on October 12, 2024: • Xiaolei Wang: T3, O3
– Section 8.1.5: correct the errors. • Beichen Zhang: D2, D5
• Zhipeng Chen: D3, D4
• Update on March 11, 2025:
• Junjie Zhang: D6
– Section 9.8: add latest papers about long CoT rea- • Bowen Zheng: D7
soning, including the analysis of reasoning patterns • Zican Dong: D8
and advantages, construction of long CoT data (i.e., • Xinyu Tang: C2
distillation, search-based, and multi-agent collabora- • Yifan Du: T4
tion), and training methods (i.e., instruction tuning • Tianyi Tang: O6, O7, D9
and reinforcement learning). • Yupeng Hou: O8, C3
• Update on February 27, 2026: • Salvatore Raieli: C4
– Figure 3: update with new LLMs.

Clarifications on Experiments. In this version, we have R EFERENCES
included a number experiments on instruction-tuning (Ta- [1] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A
ble 10), overall ability evaluation (Table 16), and prompt neural probabilistic language model,” J. Mach. Learn.
Res., vol. 3, pp. 1137–1155, 2003.
100

[2] R. Collobert, J. Weston, L. Bottou, M. Karlen, ISCA, 2010, pp. 1045–1048.
K. Kavukcuoglu, and P. P. Kuksa, “Natural language [18] S. Kombrink, T. Mikolov, M. Karafiát, and L. Burget,
processing (almost) from scratch,” J. Mach. Learn. “Recurrent neural network based language modeling
Res., vol. 12, pp. 2493–2537, 2011. in meeting recognition,” in INTERSPEECH 2011, 12th
[3] S. Pinker, The Language Instinct: How the Mind Creates Annual Conference of the International Speech Commu-
Language. Brilliance Audio; Unabridged edition, nication Association, Florence, Italy, August 27-31, 2011.
2014. ISCA, 2011, pp. 2877–2880.
[4] M. D. Hauser, N. Chomsky, and W. T. Fitch, “The [19] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
faculty of language: what is it, who has it, and how J. Dean, “Distributed representations of words and
did it evolve?” science, vol. 298, no. 5598, pp. 1569– phrases and their compositionality,” in Advances in
1579, 2002. Neural Information Processing Systems 26: 27th Annual
[5] A. M. Turing, “Computing machinery and intelli- Conference on Neural Information Processing Systems
gence,” Mind, vol. LIX, no. 236, pp. 433–460, 1950. 2013. Proceedings of a meeting held December 5-8, 2013,
[6] F. Jelinek, Statistical Methods for Speech Recognition. Lake Tahoe, Nevada, United States, C. J. C. Burges,
MIT Press, 1998. L. Bottou, Z. Ghahramani, and K. Q. Weinberger,
[7] J. Gao and C. Lin, “Introduction to the special issue Eds., 2013, pp. 3111–3119.
on statistical language modeling,” ACM Trans. Asian [20] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Ef-
Lang. Inf. Process., vol. 3, no. 2, pp. 87–93, 2004. ficient estimation of word representations in vector
[8] R. Rosenfeld, “Two decades of statistical language space,” in 1st International Conference on Learning Rep-
modeling: Where do we go from here?” Proceedings resentations, ICLR 2013, Scottsdale, Arizona, USA, May
of the IEEE, vol. 88, no. 8, pp. 1270–1278, 2000. 2-4, 2013, Workshop Track Proceedings, Y. Bengio and
[9] A. Stolcke, “Srilm-an extensible language modeling Y. LeCun, Eds., 2013.
toolkit,” in Seventh international conference on spoken [21] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner,
language processing, 2002. C. Clark, K. Lee, and L. Zettlemoyer, “Deep contex-
[10] X. Liu and W. B. Croft, “Statistical language mod- tualized word representations,” in Proceedings of the
eling for information retrieval,” Annu. Rev. Inf. Sci. 2018 Conference of the North American Chapter of the
Technol., vol. 39, no. 1, pp. 1–31, 2005. Association for Computational Linguistics: Human Lan-
[11] C. Zhai, Statistical Language Models for Information Re- guage Technologies, NAACL-HLT 2018, New Orleans,
trieval, ser. Synthesis Lectures on Human Language Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers),
Technologies. Morgan & Claypool Publishers, 2008. M. A. Walker, H. Ji, and A. Stent, Eds. Association
[12] S. M. Thede and M. P. Harper, “A second-order for Computational Linguistics, 2018, pp. 2227–2237.
hidden markov model for part-of-speech tagging,” [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
in 27th Annual Meeting of the Association for Computa- L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin,
tional Linguistics, University of Maryland, College Park, “Attention is all you need,” in Advances in Neural In-
Maryland, USA, 20-26 June 1999, R. Dale and K. W. formation Processing Systems 30: Annual Conference on
Church, Eds. ACL, 1999, pp. 175–182. Neural Information Processing Systems 2017, December
[13] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mer- 4-9, 2017, Long Beach, CA, USA, 2017, pp. 5998–6008.
cer, “A tree-based statistical language model for nat- [23] J. Devlin, M. Chang, K. Lee, and K. Toutanova,
ural language speech recognition,” IEEE Transactions “BERT: pre-training of deep bidirectional transform-
on Acoustics, Speech, and Signal Processing, vol. 37, ers for language understanding,” in Proceedings of
no. 7, pp. 1001–1008, 1989. the 2019 Conference of the North American Chapter of
[14] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean, the Association for Computational Linguistics: Human
“Large language models in machine translation,” Language Technologies, NAACL-HLT 2019, Minneapolis,
in EMNLP-CoNLL 2007, Proceedings of the 2007 Joint MN, USA, June 2-7, 2019, Volume 1 (Long and Short
Conference on Empirical Methods in Natural Language Papers), J. Burstein, C. Doran, and T. Solorio, Eds.
Processing and Computational Natural Language Learn- Association for Computational Linguistics, 2019, pp.
ing, June 28-30, 2007, Prague, Czech Republic, J. Eisner, 4171–4186.
Ed. ACL, 2007, pp. 858–867. [24] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo-
[15] S. M. Katz, “Estimation of probabilities from sparse hamed, O. Levy, V. Stoyanov, and L. Zettlemoyer,
data for the language model component of a speech “BART: denoising sequence-to-sequence pre-training
recognizer,” IEEE Trans. Acoust. Speech Signal Process., for natural language generation, translation, and
vol. 35, no. 3, pp. 400–401, 1987. comprehension,” in Proceedings of the 58th Annual
[16] W. A. Gale and G. Sampson, “Good-turing frequency Meeting of the Association for Computational Linguistics,
estimation without tears,” J. Quant. Linguistics, vol. 2, ACL 2020, Online, July 5-10, 2020, 2020, pp. 7871–
no. 3, pp. 217–237, 1995. 7880.
[17] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and [25] W. Fedus, B. Zoph, and N. Shazeer, “Switch trans-
S. Khudanpur, “Recurrent neural network based lan- formers: Scaling to trillion parameter models with
guage model,” in INTERSPEECH 2010, 11th Annual simple and efficient sparsity,” J. Mach. Learn. Res, pp.
Conference of the International Speech Communication 1–40, 2021.
Association, Makuhari, Chiba, Japan, September 26-30, [26] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
2010, T. Kobayashi, K. Hirose, and S. Nakamura, Eds. I. Sutskever et al., “Language models are unsuper-
101

vised multitask learners,” OpenAI blog, p. 9, 2019. Z. Liu, P. Xie, C. Xiong, J. Pei, P. S. Yu, and L. Sun,
[27] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, “A comprehensive survey on pretrained foundation
O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, models: A history from BERT to chatgpt,” CoRR, vol.
“Roberta: A robustly optimized BERT pretraining abs/2302.09419, 2023.
approach,” CoRR, vol. abs/1907.11692, 2019. [38] X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo,
[28] V. Sanh, A. Webson, C. Raffel, S. H. Bach, J. Qiu, Y. Yao, A. Zhang, L. Zhang, W. Han,
L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, M. Huang, Q. Jin, Y. Lan, Y. Liu, Z. Liu, Z. Lu,
A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S. X. Qiu, R. Song, J. Tang, J. Wen, J. Yuan, W. X. Zhao,
Sharma, E. Szczechla, T. Kim, G. Chhablani, N. V. and J. Zhu, “Pre-trained models: Past, present and
Nayak, D. Datta, J. Chang, M. T. Jiang, H. Wang, future,” AI Open, vol. 2, pp. 225–250, 2021.
M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Baw- [39] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang,
den, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. San- “Pre-trained models for natural language processing:
tilli, T. Févry, J. A. Fries, R. Teehan, T. L. Scao, S. Bi- A survey,” CoRR, vol. abs/2003.08271, 2020.
derman, L. Gao, T. Wolf, and A. M. Rush, “Multitask [40] S. Altman, “Planning for agi and beyond,” OpenAI
prompted training enables zero-shot task generaliza- Blog, February 2023.
tion,” in The Tenth International Conference on Learning [41] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke,
Representations, ICLR 2022, Virtual Event, April 25-29, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li,
2022. OpenReview.net, 2022. S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and
[29] T. Wang, A. Roberts, D. Hesslow, T. L. Scao, H. W. Y. Zhang, “Sparks of artificial general intelligence:
Chung, I. Beltagy, J. Launay, and C. Raffel, “What Early experiments with gpt-4,” vol. abs/2303.12712,
language model architecture and pretraining objec- 2023.
tive works best for zero-shot generalization?” in [42] S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal,
International Conference on Machine Learning, ICML S. Ma, T. Lv, L. Cui, O. K. Mohammed, B. Patra,
2022, 17-23 July 2022, Baltimore, Maryland, USA, ser. Q. Liu, K. Aggarwal, Z. Chi, J. Bjorck, V. Chaudhary,
Proceedings of Machine Learning Research, vol. 162, S. Som, X. Song, and F. Wei, “Language is not all you
2022, pp. 22 964–22 984. need: Aligning perception with language models,”
[30] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, CoRR, vol. abs/2302.14045, 2023.
B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and [43] Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, and
D. Amodei, “Scaling laws for neural language mod- L. Sun, “A comprehensive survey of ai-generated
els,” CoRR, vol. abs/2001.08361, 2020. content (aigc): A history of generative ai from gan
[31] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, to chatgpt,” arXiv preprint arXiv:2303.04226, 2023.
S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, [44] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdh-
D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, ery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu
P. Liang, J. Dean, and W. Fedus, “Emergent et al., “Palm-e: An embodied multimodal language
abilities of large language models,” CoRR, vol. model,” arXiv preprint arXiv:2303.03378, 2023.
abs/2206.07682, 2022. [45] C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and
[32] M. Shanahan, “Talking about large language mod- N. Duan, “Visual chatgpt: Talking, drawing and edit-
els,” CoRR, vol. abs/2212.03551, 2022. ing with visual foundation models,” arXiv preprint
[33] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi, arXiv:2303.04671, 2023.
Q. Le, and D. Zhou, “Chain of thought prompting [46] OpenAI, “Gpt-4 technical report,” OpenAI, 2023.
elicits reasoning in large language models,” CoRR, [47] Y. Fu, H. Peng, and T. Khot, “How does gpt obtain
vol. abs/2201.11903, 2022. its ability? tracing emergent abilities of language
[34] J. Hoffmann, S. Borgeaud, A. Mensch, models to their sources,” Yao Fu’s Notion, Dec 2022.
E. Buchatskaya, T. Cai, E. Rutherford, [48] J. Li, T. Tang, W. X. Zhao, and J. Wen, “Pretrained
D. de Las Casas, L. A. Hendricks, J. Welbl, language model for text generation: A survey,” in
A. Clark, T. Hennigan, E. Noland, K. Millican, Proceedings of the Thirtieth International Joint Confer-
G. van den Driessche, B. Damoc, A. Guy, S. Osindero, ence on Artificial Intelligence, IJCAI 2021, Virtual Event
K. Simonyan, E. Elsen, J. W. Rae, O. Vinyals, and / Montreal, Canada, 19-27 August 2021, Z. Zhou, Ed.
L. Sifre, “Training compute-optimal large language ijcai.org, 2021, pp. 4492–4499.
models,” vol. abs/2203.15556, 2022. [49] P. Lu, L. Qiu, W. Yu, S. Welleck, and K. Chang, “A
[35] R. Taylor, M. Kardas, G. Cucurull, T. Scialom, survey of deep learning for mathematical reason-
A. Hartshorn, E. Saravia, A. Poulton, V. Kerkez, and ing,” CoRR, vol. abs/2212.10535, 2022.
R. Stojnic, “Galactica: A large language model for [50] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang,
science,” CoRR, vol. abs/2211.09085, 2022. X. Sun, J. Xu, L. Li, and Z. Sui, “A survey for in-
[36] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and context learning,” CoRR, vol. abs/2301.00234, 2023.
G. Neubig, “Pre-train, prompt, and predict: A sys- [51] J. Huang and K. C. Chang, “Towards reasoning
tematic survey of prompting methods in natural in large language models: A survey,” CoRR, vol.
language processing,” ACM Comput. Surv., pp. 195:1– abs/2212.10403, 2022.
195:35, 2023. [52] S. Qiao, Y. Ou, N. Zhang, X. Chen, Y. Yao, S. Deng,
[37] C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang, C. Tan, F. Huang, and H. Chen, “Reasoning with
K. Zhang, C. Ji, Q. Yan, L. He, H. Peng, J. Li, J. Wu, language model prompting: A survey,” CoRR, vol.
102

abs/2212.09597, 2022. preprint arXiv:2305.16264, 2023.
[53] J. Zhou, P. Ke, X. Qiu, M. Huang, and J. Zhang, [62] I. McKenzie, A. Lyzhov, A. Parrish, A. Prabhu,
“Chatgpt: potential, prospects, and limitations,” in A. Mueller, N. Kim, S. Bowman, and E. Perez, “The
Frontiers of Information Technology & Electronic Engi- inverse scaling prize,” 2022. [Online]. Available:
neering, 2023, pp. 1–6. https://github.com/inverse-scaling/prize
[54] W. X. Zhao, J. Liu, R. Ren, and J.-R. Wen, “Dense [63] B. A. Huberman and T. Hogg, “Phase transitions in
text retrieval based on pretrained language models: artificial intelligence systems,” Artificial Intelligence,
A survey,” ACM Transactions on Information Systems, vol. 33, no. 2, pp. 155–171, 1987.
vol. 42, no. 4, pp. 1–60, 2024. [64] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoff-
[55] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, mann, H. F. Song, J. Aslanides, S. Henderson,
J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, R. Ring, S. Young, E. Rutherford, T. Hennigan,
G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, J. Menick, A. Cassirer, R. Powell, G. van den
G. Krueger, T. Henighan, R. Child, A. Ramesh, Driessche, L. A. Hendricks, M. Rauh, P. Huang,
D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, A. Glaese, J. Welbl, S. Dathathri, S. Huang, J. Ue-
E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, sato, J. Mellor, I. Higgins, A. Creswell, N. McAleese,
C. Berner, S. McCandlish, A. Radford, I. Sutskever, A. Wu, E. Elsen, S. M. Jayakumar, E. Buchatskaya,
and D. Amodei, “Language models are few-shot D. Budden, E. Sutherland, K. Simonyan, M. Paganini,
learners,” in Advances in Neural Information Processing L. Sifre, L. Martens, X. L. Li, A. Kuncoro, A. Ne-
Systems 33: Annual Conference on Neural Information matzadeh, E. Gribovskaya, D. Donato, A. Lazaridou,
Processing Systems 2020, NeurIPS 2020, December 6-12, A. Mensch, J. Lespiau, M. Tsimpoukelli, N. Grig-
2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, orev, D. Fritz, T. Sottiaux, M. Pajarskas, T. Pohlen,
M. Balcan, and H. Lin, Eds., 2020. Z. Gong, D. Toyama, C. de Masson d’Autume,
[56] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, Y. Li, T. Terzi, V. Mikulik, I. Babuschkin, A. Clark,
G. Mishra, A. Roberts, P. Barham, H. W. Chung, D. de Las Casas, A. Guy, C. Jones, J. Bradbury, M. J.
C. Sutton, S. Gehrmann, P. Schuh, K. Shi, Johnson, B. A. Hechtman, L. Weidinger, I. Gabriel,
S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes, W. S. Isaac, E. Lockhart, S. Osindero, L. Rimell,
Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, C. Dyer, O. Vinyals, K. Ayoub, J. Stanway, L. Bennett,
B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Is- D. Hassabis, K. Kavukcuoglu, and G. Irving, “Scaling
ard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghe- language models: Methods, analysis & insights from
mawat, S. Dev, H. Michalewski, X. Garcia, V. Misra, training gopher,” CoRR, vol. abs/2112.11446, 2021.
K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan, [65] D. Dai, Y. Sun, L. Dong, Y. Hao, Z. Sui, and F. Wei,
H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Do- “Why can GPT learn in-context? language models se-
han, S. Agrawal, M. Omernick, A. M. Dai, T. S. cretly perform gradient descent as meta-optimizers,”
Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, CoRR, vol. abs/2212.10559, 2022.
O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta, [66] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wain-
M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier- wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama,
Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller,
“Palm: Scaling language modeling with pathways,” M. Simens, A. Askell, P. Welinder, P. F. Christiano,
CoRR, vol. abs/2204.02311, 2022. J. Leike, and R. Lowe, “Training language models
[57] H. Touvron, T. Lavril, G. Izacard, X. Martinet, to follow instructions with human feedback,” CoRR,
M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham- vol. abs/2203.02155, 2022.
bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and [67] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu,
G. Lample, “Llama: Open and efficient foundation B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned
language models,” CoRR, 2023. language models are zero-shot learners,” in The Tenth
[58] T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse, International Conference on Learning Representations,
J. Jackson, H. Jun, T. B. Brown, P. Dhariwal, S. Gray ICLR 2022, Virtual Event, April 25-29, 2022. Open-
et al., “Scaling laws for autoregressive generative Review.net, 2022.
modeling,” arXiv preprint arXiv:2010.14701, 2020. [68] R. Thoppilan, D. D. Freitas, J. Hall, N. Shazeer,
[59] S. M. Xie, H. Pham, X. Dong, N. Du, H. Liu, Y. Lu, A. Kulshreshtha, H. Cheng, A. Jin, T. Bos, L. Baker,
P. Liang, Q. V. Le, T. Ma, and A. W. Yu, “Doremi: Y. Du, Y. Li, H. Lee, H. S. Zheng, A. Ghafouri,
Optimizing data mixtures speeds up language model M. Menegali, Y. Huang, M. Krikun, D. Lepikhin,
pretraining,” arXiv preprint arXiv:2305.10429, 2023. J. Qin, D. Chen, Y. Xu, Z. Chen, A. Roberts, M. Bosma,
[60] P. Villalobos, J. Sevilla, L. Heim, T. Besiroglu, Y. Zhou, C. Chang, I. Krivokon, W. Rusch, M. Pick-
M. Hobbhahn, and A. Ho, “Will we run out of ett, K. S. Meier-Hellstern, M. R. Morris, T. Doshi,
data? an analysis of the limits of scaling datasets in R. D. Santos, T. Duke, J. Soraker, B. Zevenbergen,
machine learning,” CoRR, vol. abs/2211.04325, 2022. V. Prabhakaran, M. Diaz, B. Hutchinson, K. Olson,
[Online]. Available: https://doi.org/10.48550/arXiv. A. Molina, E. Hoffman-John, J. Lee, L. Aroyo, R. Ra-
2211.04325 jakumar, A. Butryna, M. Lamm, V. Kuzmina, J. Fen-
[61] N. Muennighoff, A. M. Rush, B. Barak, T. L. Scao, ton, A. Cohen, R. Bernstein, R. Kurzweil, B. Aguera-
A. Piktus, N. Tazi, S. Pyysalo, T. Wolf, and C. Raffel, Arcas, C. Cui, M. Croak, E. H. Chi, and Q. Le,
“Scaling data-constrained language models,” arXiv “Lamda: Language models for dialog applications,”
103

CoRR, vol. abs/2201.08239, 2022. den, S. Bekman, A. McMillan-Major, I. Beltagy,
[69] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, H. Nguyen, L. Saulnier, S. Tan, P. O. Suarez, V. Sanh,
W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, H. Laurençon, Y. Jernite, J. Launay, M. Mitchell,
A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, C. Raffel, A. Gokaslan, A. Simhi, A. Soroa, A. F. Aji,
A. Chowdhery, S. Narang, G. Mishra, A. Yu, V. Y. A. Alfassy, A. Rogers, A. K. Nitzav, C. Xu, C. Mou,
Zhao, Y. Huang, A. M. Dai, H. Yu, S. Petrov, E. H. C. Emezue, C. Klamm, C. Leong, D. van Strien,
Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, D. I. Adelani, and et al., “BLOOM: A 176b-parameter
and J. Wei, “Scaling instruction-finetuned language open-access multilingual language model,” CoRR,
models,” CoRR, vol. abs/2210.11416, 2022. vol. abs/2211.05100, 2022.
[70] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, [79] P. F. Christiano, J. Leike, T. B. Brown, M. Martic,
A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, S. Legg, and D. Amodei, “Deep reinforcement learn-
A. Garriga-Alonso, A. Kluska, A. Lewkowycz, ing from human preferences,” in Advances in Neural
A. Agarwal, A. Power, A. Ray, A. Warstadt, A. W. Information Processing Systems 30: Annual Conference
Kocurek, A. Safaya, A. Tazarv, A. Xiang, A. Parrish, on Neural Information Processing Systems 2017, Decem-
A. Nie, A. Hussain, A. Askell, A. Dsouza, A. Rahane, ber 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von
A. S. Iyer, A. Andreassen, A. Santilli, A. Stuhlmüller, Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N.
A. M. Dai, A. La, A. K. Lampinen, A. Zou, A. Jiang, Vishwanathan, and R. Garnett, Eds., 2017, pp. 4299–
A. Chen, A. Vuong, A. Gupta, A. Gottardi, A. Norelli, 4307.
A. Venkatesh, A. Gholamidavoodi, A. Tabassum, [80] T. Schick, J. Dwivedi-Yu, R. Dessı̀, R. Raileanu,
A. Menezes, A. Kirubarajan, A. Mullokandov, A. Sab- M. Lomeli, L. Zettlemoyer, N. Cancedda, and
harwal, A. Herrick, A. Efrat, A. Erdem, A. Karakas, T. Scialom, “Toolformer: Language models can teach
and et al., “Beyond the imitation game: Quantifying themselves to use tools,” CoRR, vol. abs/2302.04761,
and extrapolating the capabilities of language mod- 2023.
els,” CoRR, vol. abs/2206.04615, 2022. [81] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang,
[71] R. Schaeffer, B. Miranda, and S. Koyejo, “Are emer- C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saun-
gent abilities of large language models a mirage?” ders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger,
arXiv preprint arXiv:2304.15004, 2023. K. Button, M. Knight, B. Chess, and J. Schulman,
[72] S. Hu, X. Liu, X. Han, X. Zhang, C. He, W. Zhao, “Webgpt: Browser-assisted question-answering with
Y. Lin, N. Ding, Z. Ou, G. Zeng, Z. Liu, and M. Sun, human feedback,” CoRR, vol. abs/2112.09332, 2021.
“Unlock predictable scaling from emergent abilities,” [82] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
2023. M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring
[73] A. Power, Y. Burda, H. Edwards, I. Babuschkin, and the limits of transfer learning with a unified text-
V. Misra, “Grokking: Generalization beyond overfit- to-text transformer,” J. Mach. Learn. Res., pp. 140:1–
ting on small algorithmic datasets,” arXiv preprint 140:67, 2020.
arXiv:2201.02177, 2022. [83] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-
[74] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, Rfou, A. Siddhant, A. Barua, and C. Raffel, “mt5: A
“Deepspeed: System optimizations enable training massively multilingual pre-trained text-to-text trans-
deep learning models with over 100 billion param- former,” in Proceedings of the 2021 Conference of the
eters,” in KDD, 2020, pp. 3505–3506. North American Chapter of the Association for Com-
[75] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, putational Linguistics: Human Language Technologies,
J. Casper, and B. Catanzaro, “Megatron-lm: Train- NAACL-HLT 2021, Online, June 6-11, 2021, 2021, pp.
ing multi-billion parameter language models using 483–498.
model parallelism,” CoRR, vol. abs/1909.08053, 2019. [84] W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang,
[76] D. Narayanan, M. Shoeybi, J. Casper, P. LeGres- X. Jiang, Z. Yang, K. Wang, X. Zhang, C. Li,
ley, M. Patwary, V. Korthikanti, D. Vainbrand, Z. Gong, Y. Yao, X. Huang, J. Wang, J. Yu, Q. Guo,
P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phan- Y. Yu, Y. Zhang, J. Wang, H. Tao, D. Yan, Z. Yi,
ishayee, and M. Zaharia, “Efficient large-scale lan- F. Peng, F. Jiang, H. Zhang, L. Deng, Y. Zhang,
guage model training on GPU clusters using Z. Lin, C. Zhang, S. Zhang, M. Guo, S. Gu, G. Fan,
megatron-lm,” in International Conference for High Per- Y. Wang, X. Jin, Q. Liu, and Y. Tian, “Pangu-α: Large-
formance Computing, Networking, Storage and Analysis, scale autoregressive pretrained chinese language
SC 2021, St. Louis, Missouri, USA, November 14-19, models with auto-parallel computation,” CoRR, vol.
2021. ACM, 2021, p. 58. abs/2104.12369, 2021.
[77] V. Korthikanti, J. Casper, S. Lym, L. McAfee, M. An- [85] Z. Zhang, Y. Gu, X. Han, S. Chen, C. Xiao,
dersch, M. Shoeybi, and B. Catanzaro, “Reducing ac- Z. Sun, Y. Yao, F. Qi, J. Guan, P. Ke, Y. Cai,
tivation recomputation in large transformer models,” G. Zeng, Z. Tan, Z. Liu, M. Huang, W. Han, Y. Liu,
CoRR, vol. abs/2205.05198, 2022. X. Zhu, and M. Sun, “CPM-2: large-scale cost-
[78] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilic, effective pre-trained language models,” CoRR, vol.
D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, abs/2106.10715, 2021.
M. Gallé, J. Tow, A. M. Rush, S. Biderman, A. Web- [86] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang,
son, P. S. Ammanamanchi, T. Wang, B. Sagot, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An
N. Muennighoff, A. V. del Moral, O. Ruwase, R. Baw- open large language model for code with mtulti-turn
104

program synthesis,” arXiv preprint arXiv:2203.13474, P. Yu, K. Shuster, T. Wang, Q. Liu, P. S. Koura,
2022. X. Li, B. O’Horo, G. Pereyra, J. Wang, C. Dewan,
[87] S. Black, S. Biderman, E. Hallahan, Q. Anthony, A. Celikyilmaz, L. Zettlemoyer, and V. Stoyanov,
L. Gao, L. Golding, H. He, C. Leahy, K. McDonell, “OPT-IML: scaling language model instruction meta
J. Phang, M. Pieler, U. S. Prashanth, S. Purohit, learning through the lens of generalization,” CoRR,
L. Reynolds, J. Tow, B. Wang, and S. Weinbach, “Gpt- vol. abs/2212.12017, 2022.
neox-20b: An open-source autoregressive language [96] S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley,
model,” CoRR, vol. abs/2204.06745, 2022. K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit,
[88] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, U. S. Prashanth, E. Raff et al., “Pythia: A suite for
A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran, analyzing large language models across training and
A. Arunkumar, D. Stap, E. Pathak, G. Karamanolakis, scaling,” arXiv preprint arXiv:2304.01373, 2023.
H. G. Lai, I. Purohit, I. Mondal, J. Anderson, K. Kuz- [97] E. Nijkamp, H. Hayashi, C. Xiong, S. Savarese, and
nia, K. Doshi, K. K. Pal, M. Patel, M. Moradshahi, Y. Zhou, “Codegen2: Lessons for training llms on
M. Parmar, M. Purohit, N. Varshney, P. R. Kaza, programming and natural languages,” CoRR, vol.
P. Verma, R. S. Puri, R. Karia, S. Doshi, S. K. abs/2305.02309, 2023.
Sampat, S. Mishra, S. R. A, S. Patro, T. Dixit, and [98] R. Li, L. B. Allal, Y. Zi, N. Muennighoff,
X. Shen, “Super-naturalinstructions: Generalization D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li,
via declarative instructions on 1600+ NLP tasks,” in J. Chim, Q. Liu, E. Zheltonozhskii, T. Y. Zhuo,
Proceedings of the 2022 Conference on Empirical Methods T. Wang, O. Dehaene, M. Davaadorj, J. Lamy-Poirier,
in Natural Language Processing, EMNLP 2022, Abu J. Monteiro, O. Shliazhko, N. Gontier, N. Meade,
Dhabi, United Arab Emirates, December 7-11, 2022, A. Zebaze, M. Yee, L. K. Umapathi, J. Zhu, B. Lipkin,
2022, pp. 5085–5109. M. Oblokulov, Z. Wang, R. M. V, J. Stillerman,
[89] Y. Tay, M. Dehghani, V. Q. Tran, X. Garcı́a, J. Wei, S. S. Patel, D. Abulkhanov, M. Zocca, M. Dey,
X. Wang, H. W. Chung, D. Bahri, T. Schuster, Z. Zhang, N. Fahmy, U. Bhattacharyya, W. Yu,
H. Zheng, D. Zhou, N. Houlsby, and D. Metzler, S. Singh, S. Luccioni, P. Villegas, M. Kunakov,
“Ul2: Unifying language learning paradigms,” 2022. F. Zhdanov, M. Romero, T. Lee, N. Timor, J. Ding,
[90] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, C. Schlesinger, H. Schoelkopf, J. Ebert, T. Dao,
S. Chen, C. Dewan, M. T. Diab, X. Li, X. V. Lin, M. Mishra, A. Gu, J. Robinson, C. J. Anderson,
T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, B. Dolan-Gavitt, D. Contractor, S. Reddy, D. Fried,
P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer, D. Bahdanau, Y. Jernite, C. M. Ferrandis, S. Hughes,
“OPT: open pre-trained transformer language mod- T. Wolf, A. Guha, L. von Werra, and H. de Vries,
els,” CoRR, vol. abs/2205.01068, 2022. “Starcoder: may the source be with you!” CoRR,
[91] M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, vol. abs/2305.06161, 2023. [Online]. Available:
K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, https://doi.org/10.48550/arXiv.2305.06161
D. Licht, J. Maillard, A. Sun, S. Wang, G. Wen- [99] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma-
zek, A. Youngblood, B. Akula, L. Barrault, G. M. hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava,
Gonzalez, P. Hansanti, J. Hoffman, S. Jarrett, K. R. S. Bhosale et al., “Llama 2: Open foundation and fine-
Sadagopan, D. Rowe, S. Spruit, C. Tran, P. Andrews, tuned chat models,” arXiv preprint arXiv:2307.09288,
N. F. Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao, 2023.
V. Goswami, F. Guzmán, P. Koehn, A. Mourachko, [100] A. Yang, B. Xiao, B. Wang, B. Zhang, C. Yin, C. Lv,
C. Ropers, S. Saleem, H. Schwenk, and J. Wang, “No D. Pan, D. Wang, D. Yan, F. Yang et al., “Baichuan
language left behind: Scaling human-centered ma- 2: Open large-scale language models,” arXiv preprint
chine translation,” CoRR, vol. abs/2207.04672, 2022. arXiv:2309.10305, 2023.
[92] Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, [101] J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng, Y. Fan,
Z. Wang, L. Shen, A. Wang, Y. Li et al., “Codegeex: W. Ge, Y. Han, F. Huang et al., “Qwen technical
A pre-trained model for code generation with multi- report,” arXiv preprint arXiv:2309.16609, 2023.
lingual evaluations on humaneval-x,” arXiv preprint [102] X. Li, Y. Yao, X. Jiang, X. Fang, X. Meng, S. Fan,
arXiv:2303.17568, 2023. P. Han, J. Li, L. Du, B. Qin et al., “Flm-101b: An open
[93] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, llm and how to train it with $100 k budget,” arXiv
Z. Yang, Y. Xu, W. Zheng, X. Xia, W. L. Tam, Z. Ma, preprint arXiv:2309.03852, 2023.
Y. Xue, J. Zhai, W. Chen, P. Zhang, Y. Dong, and [103] T. Wei, L. Zhao, L. Zhang, B. Zhu, L. Wang, H. Yang,
J. Tang, “GLM-130B: an open bilingual pre-trained B. Li, C. Cheng, W. Lü, R. Hu et al., “Skywork:
model,” vol. abs/2210.02414, 2022. A more open bilingual foundation model,” arXiv
[94] N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, preprint arXiv:2310.19341, 2023.
S. Biderman, T. L. Scao, M. S. Bari, S. Shen, Z. X. [104] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat,
Yong, H. Schoelkopf, X. Tang, D. Radev, A. F. Aji, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen,
K. Almubarak, S. Albanie, Z. Alyafeai, A. Web- “Gshard: Scaling giant models with conditional com-
son, E. Raff, and C. Raffel, “Crosslingual general- putation and automatic sharding,” in 9th International
ization through multitask finetuning,” CoRR, vol. Conference on Learning Representations, ICLR 2021, Vir-
abs/2211.01786, 2022. tual Event, Austria, May 3-7, 2021, 2021.
[95] S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig, [105] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P.
105

de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, Y. E. Wang, K. Webster, M. Pellat, K. Robinson, K. S.
N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, Meier-Hellstern, T. Duke, L. Dixon, K. Zhang, Q. V.
M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, Le, Y. Wu, Z. Chen, and C. Cui, “Glam: Efficient
S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, scaling of language models with mixture-of-experts,”
M. Bavarian, C. Winter, P. Tillet, F. P. Such, in International Conference on Machine Learning, ICML
D. Cummings, M. Plappert, F. Chantzis, E. Barnes, 2022, 17-23 July 2022, Baltimore, Maryland, USA, 2022,
A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, pp. 5547–5569.
N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, [113] S. Smith, M. Patwary, B. Norick, P. LeGresley,
W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, S. Rajbhandari, J. Casper, Z. Liu, S. Prabhumoye,
V. Misra, E. Morikawa, A. Radford, M. Knight, G. Zerveas, V. Korthikanti, E. Zheng, R. Child,
M. Brundage, M. Murati, K. Mayer, P. Welinder, R. Y. Aminabadi, J. Bernauer, X. Song, M. Shoeybi,
B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, Y. He, M. Houston, S. Tiwary, and B. Catanzaro,
and W. Zaremba, “Evaluating large language models “Using deepspeed and megatron to train megatron-
trained on code,” CoRR, vol. abs/2107.03374, 2021. turing NLG 530b, A large-scale generative language
[106] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, model,” CoRR, vol. abs/2201.11990, 2022.
J. Liu, X. Chen, Y. Zhao, Y. Lu, W. Liu, Z. Wu, [114] Y. Li, D. H. Choi, J. Chung, N. Kushman, J. Schrit-
W. Gong, J. Liang, Z. Shang, P. Sun, W. Liu, twieser, R. Leblond, T. Eccles, J. Keeling, F. Gi-
X. Ouyang, D. Yu, H. Tian, H. Wu, and H. Wang, meno, A. D. Lago, T. Hubert, P. Choy, C. de Mas-
“ERNIE 3.0: Large-scale knowledge enhanced pre- son d’Autume, I. Babuschkin, X. Chen, P. Huang,
training for language understanding and genera- J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J.
tion,” CoRR, vol. abs/2107.02137, 2021. Mankowitz, E. S. Robson, P. Kohli, N. de Freitas,
[107] O. Lieber, O. Sharir, B. Lenz, and Y. Shoham, K. Kavukcuoglu, and O. Vinyals, “Competition-level
“Jurassic-1: Technical details and evaluation,” White code generation with alphacode,” Science, 2022.
Paper. AI21 Labs, vol. 1, 2021. [115] S. Soltan, S. Ananthakrishnan, J. FitzGerald,
[108] B. Kim, H. Kim, S. Lee, G. Lee, D. Kwak, D. H. Jeon, R. Gupta, W. Hamza, H. Khan, C. Peris, S. Rawls,
S. Park, S. Kim, S. Kim, D. Seo, H. Lee, M. Jeong, A. Rosenbaum, A. Rumshisky, C. S. Prakash, M. Srid-
S. Lee, M. Kim, S. Ko, S. Kim, T. Park, J. Kim, har, F. Triefenbach, A. Verma, G. Tür, and P. Natara-
S. Kang, N. Ryu, K. M. Yoo, M. Chang, S. Suh, jan, “Alexatm 20b: Few-shot learning using a
S. In, J. Park, K. Kim, H. Kim, J. Jeong, Y. G. Yeo, large-scale multilingual seq2seq model,” CoRR, vol.
D. Ham, D. Park, M. Y. Lee, J. Kang, I. Kang, J. Ha, abs/2208.01448, 2022.
W. Park, and N. Sung, “What changes can large- [116] A. Glaese, N. McAleese, M. Trebacz, J. Aslanides,
scale language models bring? intensive study on hy- V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chad-
perclova: Billions-scale korean generative pretrained wick, P. Thacker, L. Campbell-Gillingham, J. Ue-
transformers,” in Proceedings of the 2021 Conference sato, P. Huang, R. Comanescu, F. Yang, A. See,
on Empirical Methods in Natural Language Processing, S. Dathathri, R. Greig, C. Chen, D. Fritz, J. S.
EMNLP 2021, Virtual Event / Punta Cana, Dominican Elias, R. Green, S. Mokrá, N. Fernando, B. Wu,
Republic, 7-11 November, 2021. Association for Com- R. Foley, S. Young, I. Gabriel, W. Isaac, J. Mel-
putational Linguistics, 2021. lor, D. Hassabis, K. Kavukcuoglu, L. A. Hendricks,
[109] S. Wu, X. Zhao, T. Yu, R. Zhang, C. Shen, H. Liu, and G. Irving, “Improving alignment of dialogue
F. Li, H. Zhu, J. Luo, L. Xu et al., “Yuan 1.0: Large- agents via targeted human judgements,” CoRR, vol.
scale pre-trained language model in zero-shot and abs/2209.14375, 2022.
few-shot learning,” arXiv preprint arXiv:2110.04725, [117] H. Su, X. Zhou, H. Yu, Y. Chen, Z. Zhu, Y. Yu, and
2021. J. Zhou, “Welm: A well-read pre-trained language
[110] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, model for chinese,” CoRR, vol. abs/2209.10372, 2022.
T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das- [118] Y. Tay, J. Wei, H. W. Chung, V. Q. Tran, D. R. So,
Sarma, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, S. Shakeri, X. Garcia, H. S. Zheng, J. Rao, A. Chowd-
J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. B. hery, D. Zhou, D. Metzler, S. Petrov, N. Houlsby,
Brown, J. Clark, S. McCandlish, C. Olah, and J. Ka- Q. V. Le, and M. Dehghani, “Transcending scal-
plan, “A general language assistant as a laboratory ing laws with 0.1% extra compute,” CoRR, vol.
for alignment,” CoRR, vol. abs/2112.00861, 2021. abs/2210.11399, 2022.
[111] S. Wang, Y. Sun, Y. Xiang, Z. Wu, S. Ding, W. Gong, [119] X. Ren, P. Zhou, X. Meng, X. Huang, Y. Wang,
S. Feng, J. Shang, Y. Zhao, C. Pang, J. Liu, X. Chen, W. Wang, P. Li, X. Zhang, A. Podolskiy, G. Arshinov,
Y. Lu, W. Liu, X. Wang, Y. Bai, Q. Chen, L. Zhao, A. Bout, I. Piontkovskaya, J. Wei, X. Jiang, T. Su,
S. Li, P. Sun, D. Yu, Y. Ma, H. Tian, H. Wu, T. Wu, Q. Liu, and J. Yao, “Pangu-Σ: Towards trillion pa-
W. Zeng, G. Li, W. Gao, and H. Wang, “ERNIE rameter language model with sparse heterogeneous
3.0 titan: Exploring larger-scale knowledge enhanced computing,” CoRR, vol. abs/2303.10845, 2023.
pre-training for language understanding and gener- [120] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lep-
ation,” CoRR, vol. abs/2112.12731, 2021. ikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey,
[112] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Z. Chen et al., “Palm 2 technical report,” arXiv
Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, preprint arXiv:2305.10403, 2023.
B. Zoph, L. Fedus, M. P. Bosma, Z. Zhou, T. Wang, [121] A. Radford, R. Józefowicz, and I. Sutskever, “Learn-
106

ing to generate reviews and discovering sentiment,” [133] OpenAI, “Gpt-4v(ision) system card,” OpenAI, 2023.
CoRR, vol. abs/1704.01444, 2017. [134] ——, “Lessons learned on language model safety
[122] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever and misuse,” OpenAI blog, 2022.
et al., “Improving language understanding by gener- [135] Meta, “Introducing meta llama 3: The most capable
ative pre-training,” 2018. openly available llm to date,” https://ai.meta.com/
[123] B. McCann, N. S. Keskar, C. Xiong, and R. Socher, blog/meta-llama-3/, 2024.
“The natural language decathlon: Multitask learning [136] “Introducing Llama 3.1: Our most capable models to
as question answering,” CoRR, vol. abs/1806.08730, date ,” https://ai.meta.com/blog/meta-llama-3-1/,
2018. 2023.
[124] Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett, [137] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam-
X. Gao, J. Gao, J. Liu, and B. Dolan, “DIALOGPT ford, D. S. Chaplot, D. de las Casas, F. Bressand,
: Large-scale generative pre-training for conversa- G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-
tional response generation,” in Proceedings of the 58th A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang,
Annual Meeting of the Association for Computational T. Lacroix, and W. E. Sayed, “Mistral 7b,” 2023.
Linguistics: System Demonstrations, ACL 2020, Online, [138] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch,
July 5-10, 2020, A. Celikyilmaz and T. Wen, Eds. B. Savary, C. Bamford, D. S. Chaplot, D. de Las Casas,
Association for Computational Linguistics, 2020, pp. E. B. Hanna, F. Bressand, G. Lengyel, G. Bour,
270–278. G. Lample, L. R. Lavaud, L. Saulnier, M. Lachaux,
[125] D. Ham, J. Lee, Y. Jang, and K. Kim, “End-to-end P. Stock, S. Subramanian, S. Yang, S. Antoniak,
neural pipeline for goal-oriented dialogue systems T. L. Scao, T. Gervet, T. Lavril, T. Wang,
using GPT-2,” in Proceedings of the 58th Annual Meet- T. Lacroix, and W. E. Sayed, “Mixtral of experts,”
ing of the Association for Computational Linguistics, CoRR, vol. abs/2401.04088, 2024. [Online]. Available:
ACL 2020, Online, July 5-10, 2020. Association for https://doi.org/10.48550/arXiv.2401.04088
Computational Linguistics, 2020, pp. 583–592. [139] T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju,
[126] I. Drori, S. Tran, R. Wang, N. Cheng, K. Liu, L. Tang, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love,
E. Ke, N. Singh, T. L. Patti, J. Lynch, A. Shporer, P. Tafti, L. Hussenot, A. Chowdhery, A. Roberts,
N. Verma, E. Wu, and G. Strang, “A neural network A. Barua, A. Botev, A. Castro-Ros, A. Slone,
solves and generates mathematics problems by pro- A. Héliou, A. Tacchetti, A. Bulanova, A. Paterson,
gram synthesis: Calculus, differential equations, lin- B. Tsai, B. Shahriari, C. L. Lan, C. A. Choquette-Choo,
ear algebra, and more,” CoRR, vol. abs/2112.15594, C. Crepy, D. Cer, D. Ippolito, D. Reid, E. Buchatskaya,
2021. E. Ni, E. Noland, G. Yan, G. Tucker, G. Muraru,
[127] A. Neelakantan, T. Xu, R. Puri, A. Radford, J. M. G. Rozhdestvenskiy, H. Michalewski, I. Tenney, I. Gr-
Han, J. Tworek, Q. Yuan, N. Tezak, J. W. Kim, ishchenko, J. Austin, J. Keeling, J. Labanowski,
C. Hallacy, J. Heidecke, P. Shyam, B. Power, T. E. J. Lespiau, J. Stanway, J. Brennan, J. Chen, J. Ferret,
Nekoul, G. Sastry, G. Krueger, D. Schnurr, F. P. J. Chiu, and et al., “Gemma: Open models based
Such, K. Hsu, M. Thompson, T. Khan, T. Sherbakov, on gemini research and technology,” CoRR, vol.
J. Jang, P. Welinder, and L. Weng, “Text and code abs/2403.08295, 2024.
embeddings by contrastive pre-training,” CoRR, vol. [140] M. Rivière, S. Pathak, P. G. Sessa, C. Hardin, S. Bhu-
abs/2201.10005, 2022. patiraju, L. Hussenot, T. Mesnard, B. Shahriari,
[128] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, A. Ramé, J. Ferret, P. Liu, P. Tafti, A. Friesen, M. Cas-
and O. Klimov, “Proximal policy optimization algo- bon, S. Ramos, R. Kumar, C. L. Lan, S. Jerome, A. Tsit-
rithms,” arXiv preprint arXiv:1707.06347, 2017. sulin, N. Vieillard, P. Stanczyk, S. Girgin, N. Mom-
[129] N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, chev, M. Hoffman, S. Thakoor, J. Grill, B. Neyshabur,
R. Lowe, C. Voss, A. Radford, D. Amodei, and P. F. O. Bachem, A. Walton, A. Severyn, A. Parrish, A. Ah-
Christiano, “Learning to summarize from human mad, A. Hutchison, A. Abdagic, A. Carl, A. Shen,
feedback,” CoRR, vol. abs/2009.01325, 2020. A. Brock, A. Coenen, A. Laforge, A. Paterson, B. Bas-
[130] OpenAI, “Our approach to alignment research,” Ope- tian, B. Piot, B. Wu, B. Royal, C. Chen, C. Kumar,
nAI Blog, August 2022. C. Perry, C. Welty, C. A. Choquette-Choo, D. Sinopal-
[131] ——, “Introducing chatgpt,” OpenAI Blog, November nikov, D. Weinberger, D. Vijaykumar, D. Rogozin-
2022. ska, D. Herbison, E. Bandy, E. Wang, E. Noland,
[132] D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, E. Moreira, E. Senter, E. Eltyshev, F. Visin, G. Rasskin,
S. Kadavath, B. Mann, E. Perez, N. Schiefer, G. Wei, G. Cameron, G. Martins, H. Hashemi,
K. Ndousse, A. Jones, S. Bowman, A. Chen, T. Con- H. Klimczak-Plucinska, H. Batra, H. Dhand, I. Nar-
erly, N. DasSarma, D. Drain, N. Elhage, S. E. Showk, dini, J. Mein, J. Zhou, J. Svensson, J. Stanway, J. Chan,
S. Fort, Z. Hatfield-Dodds, T. Henighan, D. Hernan- J. P. Zhou, J. Carrasqueira, J. Iljazi, J. Becker, J. Fer-
dez, T. Hume, J. Jacobson, S. Johnston, S. Kravec, nandez, J. van Amersfoort, J. Gordon, J. Lipschultz,
C. Olsson, S. Ringer, E. Tran-Johnson, D. Amodei, J. Newlan, J. Ji, K. Mohamed, K. Badola, K. Black,
T. Brown, N. Joseph, S. McCandlish, C. Olah, J. Ka- K. Millican, K. McDonell, K. Nguyen, K. Sodhia,
plan, and J. Clark, “Red teaming language models K. Greene, L. L. Sjösund, L. Usui, L. Sifre, L. Heuer-
to reduce harms: Methods, scaling behaviors, and mann, L. Lago, and L. McNealus, “Gemma 2: Im-
lessons learned,” CoRR, vol. abs/2209.07858, 2022. proving open language models at a practical size,”
107

CoRR, vol. abs/2408.00118, 2024. [151] Y. Ji, Y. Deng, Y. Gong, Y. Peng, Q. Niu, B. Ma,
[141] A. Yang, B. Yang, B. Hui, B. Zheng, B. Yu, C. Zhou, and X. Li, “Belle: Be everyone’s large language
C. Li, C. Li, D. Liu, F. Huang, G. Dong, H. Wei, H. Lin, model engine,” https://github.com/LianjiaTech/
J. Tang, J. Wang, J. Yang, J. Tu, J. Zhang, J. Ma, J. Xu, BELLE, 2023.
J. Zhou, J. Bai, J. He, J. Lin, K. Dang, K. Lu, K. Chen, [152] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu,
K. Yang, M. Li, M. Xue, N. Ni, P. Zhang, P. Wang, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E.
R. Peng, R. Men, R. Gao, R. Lin, S. Wang, S. Bai, Gonzalez, I. Stoica, and E. P. Xing, “Vicuna:
S. Tan, T. Zhu, T. Li, T. Liu, W. Ge, X. Deng, X. Zhou, An open-source chatbot impressing gpt-4 with
X. Ren, X. Zhang, X. Wei, X. Ren, Y. Fan, Y. Yao, 90%* chatgpt quality,” 2023. [Online]. Available:
Y. Zhang, Y. Wan, Y. Chu, Y. Liu, Z. Cui, Z. Zhang, https://vicuna.lmsys.org
and Z. Fan, “Qwen2 technical report,” arXiv preprint [153] D. Eccleston, “Sharegpt,” https://sharegpt.com/,
arXiv:2407.10671, 2024. 2023.
[142] Q. Team, “Qwen2.5: A party of foundation [154] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction
models,” September 2024. [Online]. Available: tuning,” CoRR, vol. abs/2304.08485, 2023.
https://qwenlm.github.io/blog/qwen2.5/ [155] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny,
[143] T. GLM, A. Zeng, B. Xu, B. Wang, C. Zhang, D. Yin, “Minigpt-4: Enhancing vision-language understand-
D. Rojas, G. Feng, H. Zhao, H. Lai, H. Yu, H. Wang, ing with advanced large language models,” CoRR,
J. Sun, J. Zhang, J. Cheng, J. Gui, J. Tang, J. Zhang, vol. abs/2304.10592, 2023.
J. Li, L. Zhao, L. Wu, L. Zhong, M. Liu, M. Huang, [156] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang,
P. Zhang, Q. Zheng, R. Lu, S. Duan, S. Zhang, S. Cao, B. Li, P. Fung, and S. C. H. Hoi, “Instructblip: To-
S. Yang, W. L. Tam, W. Zhao, X. Liu, X. Xia, X. Zhang, wards general-purpose vision-language models with
X. Gu, X. Lv, X. Liu, X. Liu, X. Yang, X. Song, instruction tuning,” CoRR, vol. abs/2305.06500, 2023.
X. Zhang, Y. An, Y. Xu, Y. Niu, Y. Yang, Y. Li, Y. Bai, [157] Y. Su, T. Lan, H. Li, J. Xu, Y. Wang, and D. Cai,
Y. Dong, Z. Qi, Z. Wang, Z. Yang, Z. Du, Z. Hou, “Pandagpt: One model to instruction-follow them
and Z. Wang, “Chatglm: A family of large language all,” 2023.
models from glm-130b to glm-4 all tools,” 2024. [158] Y. Zhu, R. Kiros, R. S. Zemel, R. Salakhutdinov,
[144] H. Zhong, C. Xiao, C. Tu, T. Zhang, Z. Liu, and R. Urtasun, A. Torralba, and S. Fidler, “Aligning
M. Sun, “JEC-QA: A legal-domain question answer- books and movies: Towards story-like visual expla-
ing dataset,” in The Thirty-Fourth AAAI Conference nations by watching movies and reading books,” in
on Artificial Intelligence, AAAI 2020, The Thirty-Second 2015 IEEE International Conference on Computer Vision,
Innovative Applications of Artificial Intelligence Confer- ICCV 2015, Santiago, Chile, December 7-13, 2015. IEEE
ence, IAAI 2020, The Tenth AAAI Symposium on Edu- Computer Society, 2015, pp. 19–27.
cational Advances in Artificial Intelligence, EAAI 2020, [159] “Project gutenberg.” [Online]. Available: https:
New York, NY, USA, February 7-12, 2020. AAAI Press, //www.gutenberg.org/
2020, pp. 9701–9708. [160] T. H. Trinh and Q. V. Le, “A simple method for com-
[145] D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, monsense reasoning,” CoRR, vol. abs/1806.02847,
and P. Szolovits, “What disease does this patient 2018.
have? a large-scale open domain question answer- [161] R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk,
ing dataset from medical exams,” Applied Sciences, A. Farhadi, F. Roesner, and Y. Choi, “Defending
vol. 11, no. 14, p. 6421, 2021. against neural fake news,” in Advances in Neu-
[146] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, ral Information Processing Systems 32: Annual Confer-
C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford ence on Neural Information Processing Systems 2019,
alpaca: An instruction-following llama model,” NeurIPS 2019, December 8-14, 2019, Vancouver, BC,
https://github.com/tatsu-lab/stanford alpaca, Canada, H. M. Wallach, H. Larochelle, A. Beygelz-
2023. imer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds.,
[147] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, 2019, pp. 9051–9062.
D. Khashabi, and H. Hajishirzi, “Self-instruct: Align- [162] A. Gokaslan, V. C. E. Pavlick, and S. Tellex,
ing language model with self generated instruc- “Openwebtext corpus,” http://Skylion007.github.
tions,” CoRR, vol. abs/2212.10560, 2022. io/OpenWebTextCorpus, 2019.
[148] Alpaca-LoRA, “Instruct-tune llama on consumer [163] J. Baumgartner, S. Zannettou, B. Keegan, M. Squire,
hardware,” https://github.com/tloen/alpaca-lora, and J. Blackburn, “The pushshift reddit dataset,” in
2023. Proceedings of the Fourteenth International AAAI Con-
[149] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, ference on Web and Social Media, ICWSM 2020, Held
S. Wang, L. Wang, and W. Chen, “Lora: Low-rank Virtually, Original Venue: Atlanta, Georgia, USA, June
adaptation of large language models,” in The Tenth 8-11, 2020. AAAI Press, 2020, pp. 830–839.
International Conference on Learning Representations, [164] “Wikipedia.” [Online]. Available: https://en.
ICLR 2022, Virtual Event, April 25-29, 2022. Open- wikipedia.org/wiki/Main Page
Review.net, 2022. [165] “Bigquery dataset.” [Online]. Available: https:
[150] X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel, //cloud.google.com/bigquery?hl=zh-cn
S. Levine, and D. Song, “Koala: A dialogue model for [166] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe,
academic research,” Blog post, April 2023. C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima,
108

S. Presser, and C. Leahy, “The pile: An 800gb dataset guage crowdsourcing instructions,” in Proceedings of
of diverse text for language modeling,” CoRR, vol. the 60th Annual Meeting of the Association for Com-
abs/2101.00027, 2021. putational Linguistics (Volume 1: Long Papers), ACL
[167] H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. V. 2022, Dublin, Ireland, May 22-27, 2022, S. Muresan,
del Moral, T. Le Scao, L. Von Werra, C. Mou, E. G. P. Nakov, and A. Villavicencio, Eds., 2022, pp. 3470–
Ponferrada, H. Nguyen et al., “The bigscience roots 3487.
corpus: A 1.6 tb composite multilingual dataset,” in [180] S. H. Bach, V. Sanh, Z. X. Yong, A. Webson, C. Raffel,
Thirty-sixth Conference on Neural Information Process- N. V. Nayak, A. Sharma, T. Kim, M. S. Bari, T. Févry,
ing Systems Datasets and Benchmarks Track, 2022. Z. Alyafeai, M. Dey, A. Santilli, Z. Sun, S. Ben-David,
[168] “Common crawl.” [Online]. Available: https:// C. Xu, G. Chhablani, H. Wang, J. A. Fries, M. S.
commoncrawl.org/ AlShaibani, S. Sharma, U. Thakker, K. Almubarak,
[169] G. Wenzek, M.-A. Lachaux, A. Conneau, V. Chaud- X. Tang, D. R. Radev, M. T. Jiang, and A. M. Rush,
hary, F. Guzmán, A. Joulin, and É. Grave, “Ccnet: “Promptsource: An integrated development environ-
Extracting high quality monolingual datasets from ment and repository for natural language prompts,”
web crawl data,” in Proceedings of The 12th Language in ACL (demo). Association for Computational Lin-
Resources and Evaluation Conference, 2020, pp. 4003– guistics, 2022, pp. 93–104.
4012. [181] T. Tang, J. Li, W. X. Zhao, and J. Wen, “MVP: multi-
[170] T. Computer, “Redpajama: an open dataset for train- task supervised pre-training for natural language
ing large language models,” https://github.com/ generation,” CoRR, vol. abs/2206.12131, 2022.
togethercomputer/RedPajama-Data, 2023. [182] H. Nguyen, S. Suri, K. Tsui, Shahules786, T. team,
[171] G. Penedo, Q. Malartic, D. Hesslow, R. Cojocaru, and C. Schuhmann, “The oig dataset,” https://laion.
A. Cappelli, H. Alobeidli, B. Pannier, E. Almazrouei, ai/blog/oig-dataset/, 2023.
and J. Launay, “The RefinedWeb dataset for Falcon [183] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen,
LLM: outperforming curated corpora with web data, N. DasSarma, D. Drain, S. Fort, D. Ganguli,
and web data only,” arXiv preprint arXiv:2306.01116, T. Henighan, N. Joseph, S. Kadavath, J. Kernion,
2023. T. Conerly, S. E. Showk, N. Elhage, Z. Hatfield-
[172] C. B. Clement, M. Bierbaum, K. P. O’Keeffe, and Dodds, D. Hernandez, T. Hume, S. Johnston,
A. A. Alemi, “On the use of arxiv as a dataset,” arXiv S. Kravec, L. Lovitt, N. Nanda, C. Olsson,
preprint arXiv:1905.00075, 2019. D. Amodei, T. B. Brown, J. Clark, S. McCandlish,
[173] K. Lo, L. L. Wang, M. Neumann, R. Kinney, and C. Olah, B. Mann, and J. Kaplan, “Training a helpful
D. Weld, “S2ORC: The semantic scholar open re- and harmless assistant with reinforcement learning
search corpus,” in ACL, 2020. from human feedback,” CoRR, vol. abs/2204.05862,
[174] L. Soldaini and K. Lo, “peS2o (Pretraining Efficiently 2022. [Online]. Available: https://doi.org/10.48550/
on S2ORC) Dataset,” ODC-By, https://github.com/ arXiv.2204.05862
allenai/pes2o, 2023. [184] B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding,
[175] D. Kocetkov, R. Li, L. B. Allal, J. Li, C. Mou, C. M. J. Yue, and Y. Wu, “How close is chatgpt to human
Ferrandis, Y. Jernite, M. Mitchell, S. Hughes, T. Wolf experts? comparison corpus, evaluation, and detec-
et al., “The stack: 3 tb of permissively licensed source tion,” arXiv preprint arXiv:2301.07597, 2023.
code,” arXiv preprint arXiv:2211.15533, 2022. [185] M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan,
[176] B. Wang and A. Komatsuzaki, “GPT-J-6B: A 6 Billion S. Shah, A. Ghodsi, P. Wendell, M. Zaharia, and
Parameter Autoregressive Language Model,” https: R. Xin. (2023) Free dolly: Introducing the world’s first
//github.com/kingoflolz/mesh-transformer-jax, truly open instruction-tuned llm.
2021. [186] A. Köpf, Y. Kilcher, D. von Rütte, S. Anagnostidis, Z.-
[177] L. Soldaini, R. Kinney, A. Bhagia, D. Schwenk, R. Tam, K. Stevens, A. Barhoum, N. M. Duc, O. Stan-
D. Atkinson, R. Authur, B. Bogin, K. Chandu, J. Du- ley, R. Nagyfi et al., “Openassistant conversations–
mas, Y. Elazar, V. Hofmann, A. H. Jha, S. Kumar, democratizing large language model alignment,”
L. Lucy, X. Lyu, N. Lambert, I. Magnusson, J. Morri- arXiv preprint arXiv:2304.07327, 2023.
son, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, [187] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li,
A. Ravichander, K. Richardson, Z. Shen, E. Strubell, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford
N. Subramani, O. Tafjord, P. Walsh, L. Zettlemoyer, alpaca: An instruction-following llama model,”
N. A. Smith, H. Hajishirzi, I. Beltagy, D. Groeneveld, https://github.com/tatsu-lab/stanford alpaca,
J. Dodge, and K. Lo, “Dolma: an open corpus of 2023.
three trillion tokens for language model pretraining [188] J. Cheung, “Guanaco - generative universal assistant
research,” arXiv preprint arXiv:2402.00159, 2024. for natural-language adaptive context-aware om-
[178] D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia, R. Kin- nilingual outputs,” https://guanaco-model.github.
ney, O. Tafjord, A. H. Jha, H. Ivison, I. Magnusson, io/, 2023.
Y. Wang et al., “Olmo: Accelerating the science of [189] C. Xu, D. Guo, N. Duan, and J. McAuley,
language models,” arXiv preprint arXiv:2402.00838, “Baize: An open-source chat model with parameter-
2024. efficient tuning on self-chat data,” arXiv preprint
[179] S. Mishra, D. Khashabi, C. Baral, and H. Ha- arXiv:2304.01196, 2023.
jishirzi, “Cross-task generalization via natural lan- [190] Y. Ji, Y. Gong, Y. Deng, Y. Peng, Q. Niu, B. Ma,
109

and X. Li, “Towards better instruction following 2020, pp. 4157–4165.
language models for chinese: Investigating the im- [201] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De-
pact of training data and evaluation,” arXiv preprint langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
arXiv:2304.07854, 2023. towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma,
[191] K. Ethayarajh, Y. Choi, and S. Swayamdipta, “Under- Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger,
standing dataset difficulty with V -usable informa- M. Drame, Q. Lhoest, and A. M. Rush, “Transform-
tion,” in Proceedings of the 39th International Conference ers: State-of-the-art natural language processing,” in
on Machine Learning, 2022, pp. 5988–6008. Proceedings of the 2020 Conference on Empirical Methods
[192] N. Lambert, L. Tunstall, N. Rajani, in Natural Language Processing: System Demonstrations,
and T. Thrush. (2023) Huggingface h4 EMNLP 2020 - Demos, Online, November 16-20, 2020.
stack exchange preference dataset. [On- Association for Computational Linguistics, 2020, pp.
line]. Available: https://huggingface.co/datasets/ 38–45.
HuggingFaceH4/stack-exchange-preferences [202] J. Bradbury, R. Frostig, P. Hawkins, M. J.
[193] R. Liu, R. Yang, C. Jia, G. Zhang, D. Zhou, A. M. Johnson, C. Leary, D. Maclaurin, G. Necula,
Dai, D. Yang, and S. Vosoughi, “Training socially A. Paszke, J. VanderPlas, S. Wanderman-Milne,
aligned language models in simulated human soci- and Q. Zhang, “JAX: composable transformations
ety,” CoRR, vol. abs/2305.16960, 2023. of Python+NumPy programs,” 2018. [Online].
[194] G. Xu, J. Liu, M. Yan, H. Xu, J. Si, Z. Zhou, P. Yi, Available: http://github.com/google/jax
X. Gao, J. Sang, R. Zhang, J. Zhang, C. Peng, [203] Z. Bian, H. Liu, B. Wang, H. Huang, Y. Li, C. Wang,
F. Huang, and J. Zhou, “Cvalues: Measuring the F. Cui, and Y. You, “Colossal-ai: A unified deep learn-
values of chinese large language models from safety ing system for large-scale parallel training,” CoRR,
to responsibility,” 2023. vol. abs/2110.14883, 2021.
[195] J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y. Wang, and [204] J. Fang, Y. Yu, S. Li, Y. You, and J. Zhou, “Patrick-
Y. Yang, “Safe rlhf: Safe reinforcement learning from star: Parallel training of pre-trained models via
human feedback,” arXiv preprint arXiv:2310.12773, a chunk-based memory management,” CoRR, vol.
2023. abs/2108.05818, 2021.
[196] V. Sanh, A. Webson, C. Raffel, S. H. Bach, [205] Y. You, “Colossalchat: An open-source solution
L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, for cloning chatgpt with a complete
A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S. rlhf pipeline,” 2023. [Online]. Available:
Sharma, E. Szczechla, T. Kim, G. Chhablani, N. V. https://medium.com/@yangyou berkeley/
Nayak, D. Datta, J. Chang, M. T. Jiang, H. Wang, colossalchat-an-open-source-solution-for-cloning-
M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Baw- chatgpt-with-a-complete-rlhf-pipeline-5edf08fb538b
den, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. San- [206] “Bmtrain: Effient training for big models.” [Online].
tilli, T. Févry, J. A. Fries, R. Teehan, T. L. Scao, S. Bi- Available: https://github.com/OpenBMB/BMTrain
derman, L. Gao, T. Wolf, and A. M. Rush, “Multitask [207] J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, and J. Tang,
prompted training enables zero-shot task generaliza- “Fastmoe: A fast mixture-of-expert training system,”
tion,” in The Tenth International Conference on Learning CoRR, vol. abs/2103.13262, 2021.
Representations, ICLR 2022, Virtual Event, April 25-29, [208] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng,
2022. OpenReview.net, 2022. C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica,
[197] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. “Efficient memory management for large language
Chung, Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei model serving with pagedattention,” in Proceedings
et al., “The flan collection: Designing data and meth- of the ACM SIGOPS 29th Symposium on Operating
ods for effective instruction tuning,” arXiv preprint Systems Principles, 2023.
arXiv:2301.13688, 2023. [209] (2023) Deepspeed-mii. [Online]. Available: https:
[198] K. Cobbe, V. Kosaraju, M. Bavarian, J. Hilton, //github.com/microsoft/DeepSpeed-MII
R. Nakano, C. Hesse, and J. Schulman, “Training [210] Z. Yao, R. Y. Aminabadi, O. Ruwase, S. Rajbhan-
verifiers to solve math word problems,” CoRR, vol. dari, X. Wu, A. A. Awan, J. Rasley, M. Zhang,
abs/2110.14168, 2021. C. Li, C. Holmes, Z. Zhou, M. Wyatt, M. Smith,
[199] M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, L. Kurilenko, H. Qin, M. Tanaka, S. Che, S. L. Song,
and J. Berant, “Did aristotle use a laptop? A ques- and Y. He, “DeepSpeed-Chat: Easy, Fast and Afford-
tion answering benchmark with implicit reasoning able RLHF Training of ChatGPT-like Models at All
strategies,” Trans. Assoc. Comput. Linguistics, vol. 9, Scales,” arXiv preprint arXiv:2308.01320, 2023.
pp. 346–361, 2021. [211] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad-
[200] O. Camburu, B. Shillingford, P. Minervini, bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
T. Lukasiewicz, and P. Blunsom, “Make up your L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang,
mind! adversarial generation of inconsistent natural Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy,
language explanations,” in Proceedings of the 58th B. Steiner, L. Fang, J. Bai, and S. Chintala, “Py-
Annual Meeting of the Association for Computational torch: An imperative style, high-performance deep
Linguistics, ACL 2020, Online, July 5-10, 2020, learning library,” in Advances in Neural Information
D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault, Processing Systems 32: Annual Conference on Neural
Eds. Association for Computational Linguistics, Information Processing Systems 2019, NeurIPS 2019,
110

December 8-14, 2019, Vancouver, BC, Canada, H. M. [224] S. Black, L. Gao, P. Wang, C. Leahy, and S. Bider-
Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché- man, “GPT-Neo: Large Scale Autoregressive Lan-
Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 8024– guage Modeling with Mesh-Tensorflow,” 2021.
8035. [225] F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn,
[212] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, “A systematic evaluation of large language models
J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Is- of code,” in MAPS@PLDI, 2022.
ard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, [226] A. Madaan, S. Zhou, U. Alon, Y. Yang, and G. Neu-
D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, big, “Language models of code are few-shot com-
P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensor- monsense learners,” in Proceedings of the 2022 Confer-
flow: A system for large-scale machine learning,” in ence on Empirical Methods in Natural Language Process-
12th USENIX Symposium on Operating Systems Design ing, EMNLP 2022, Abu Dhabi, United Arab Emirates,
and Implementation, OSDI 2016, Savannah, GA, USA, December 7-11, 2022, Y. Goldberg, Z. Kozareva, and
November 2-4, 2016, K. Keeton and T. Roscoe, Eds. Y. Zhang, Eds. Association for Computational Lin-
USENIX Association, 2016, pp. 265–283. guistics, 2022, pp. 1384–1403.
[213] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, [227] S. Longpre, G. Yauney, E. Reif, K. Lee, A. Roberts,
T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: B. Zoph, D. Zhou, J. Wei, K. Robinson, D. Mimno
A flexible and efficient machine learning library et al., “A pretrainer’s guide to training data: Measur-
for heterogeneous distributed systems,” CoRR, vol. ing the effects of data age, domain coverage, quality,
abs/1512.01274, 2015. & toxicity,” arXiv preprint arXiv:2305.13169, 2023.
[214] Y. Ma, D. Yu, T. Wu, and H. Wang, “Paddlepaddle: [228] D. Chen, Y. Huang, Z. Ma, H. Chen, X. Pan, C. Ge,
An open-source deep learning platform from indus- D. Gao, Y. Xie, Z. Liu, J. Gao, Y. Li, B. Ding, and
trial practice,” Frontiers of Data and Domputing, vol. 1, J. Zhou, “Data-juicer: A one-stop data processing
no. 1, p. 105, 2019. system for large language models,” 2023.
[215] L. Huawei Technologies Co., “Huawei mindspore [229] M. Abdin, S. A. Jacobs, A. A. Awan, J. Aneja,
ai development framework,” in Artificial Intelligence A. Awadallah, H. Awadalla, N. Bach, A. Bahree,
Technology. Springer, 2022, pp. 137–162. A. Bakhtiari, H. Behl et al., “Phi-3 technical report:
[216] J. Yuan, X. Li, C. Cheng, J. Liu, R. Guo, S. Cai, C. Yao, A highly capable language model locally on your
F. Yang, X. Yi, C. Wu, H. Zhang, and J. Zhao, “One- phone,” arXiv preprint arXiv:2404.14219, 2024.
flow: Redesign the distributed deep learning frame- [230] G. Penedo, H. Kydlı́ček, A. Lozhkov, M. Mitchell,
work from scratch,” CoRR, vol. abs/2110.15032, 2021. C. Raffel, L. Von Werra, T. Wolf et al., “The fineweb
[217] S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson, datasets: Decanting the web for the finest text data at
Y. Liu, J. Xu, M. Ott, E. M. Smith, Y. Boureau, and scale,” arXiv preprint arXiv:2406.17557, 2024.
J. Weston, “Recipes for building an open-domain [231] P. Maini, S. Seto, H. Bai, D. Grangier, Y. Zhang, and
chatbot,” in Proceedings of the 16th Conference of the N. Jaitly, “Rephrasing the web: A recipe for compute
European Chapter of the Association for Computational and data-efficient language modeling,” in ICLR 2024
Linguistics: Main Volume, EACL 2021, Online, April 19 Workshop on Navigating and Addressing Data Problems
- 23, 2021, 2021, pp. 300–325. for Foundation Models, 2024.
[218] A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, [232] M. Marion, A. Üstün, L. Pozzobon, A. Wang,
H. Michalewski, V. V. Ramasesh, A. Slone, C. Anil, M. Fadaee, and S. Hooker, “When less is more: Inves-
I. Schlag, T. Gutman-Solo, Y. Wu, B. Neyshabur, tigating data pruning for pretraining llms at scale,”
G. Gur-Ari, and V. Misra, “Solving quantitative rea- arXiv preprint arXiv:2309.04564, 2023.
soning problems with language models,” CoRR, vol. [233] N. Sachdeva, B. Coleman, W.-C. Kang, J. Ni, L. Hong,
abs/2206.14858, 2022. E. H. Chi, J. Caverlee, J. McAuley, and D. Z. Cheng,
[219] T. Saier, J. Krause, and M. Färber, “unarxive 2022: “How to train data-efficient llms,” arXiv preprint
All arxiv publications pre-processed for nlp, includ- arXiv:2402.09668, 2024.
ing structured full-text and citation network,” arXiv [234] D. Hernandez, T. B. Brown, T. Conerly, N. Das-
preprint arXiv:2303.14957, 2023. Sarma, D. Drain, S. E. Showk, N. Elhage, Z. Hatfield-
[220] H. A. Simon, “Experiments with a heuristic com- Dodds, T. Henighan, T. Hume, S. Johnston, B. Mann,
piler,” J. ACM, vol. 10, no. 4, pp. 493–506, 1963. C. Olah, C. Olsson, D. Amodei, N. Joseph, J. Ka-
[221] Z. Manna and R. J. Waldinger, “Toward automatic plan, and S. McCandlish, “Scaling laws and inter-
program synthesis,” Commun. ACM, vol. 14, no. 3, pretability of learning from repeated data,” CoRR,
pp. 151–165, 1971. vol. abs/2205.10487, 2022.
[222] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, [235] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi,
L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “The curious case of neural text degeneration,” in 8th
“Codebert: A pre-trained model for programming International Conference on Learning Representations,
and natural languages,” in Findings of EMNLP, 2020. ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
[223] J. Austin, A. Odena, M. I. Nye, M. Bosma, OpenReview.net, 2020.
H. Michalewski, D. Dohan, E. Jiang, C. J. Cai, [236] K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck,
M. Terry, Q. V. Le, and C. Sutton, “Program syn- C. Callison-Burch, and N. Carlini, “Deduplicating
thesis with large language models,” CoRR, vol. training data makes language models better,” in Pro-
abs/2108.07732, 2021. ceedings of the 60th Annual Meeting of the Association
111

for Computational Linguistics (Volume 1: Long Papers), Conference on Learning Representations, ICLR 2020,
ACL 2022, Dublin, Ireland, May 22-27, 2022, 2022, pp. Addis Ababa, Ethiopia, April 26-30, 2020. OpenRe-
8424–8445. view.net, 2020.
[237] N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramèr, [248] K. Tirumala, D. Simig, A. Aghajanyan, and A. S. Mor-
and C. Zhang, “Quantifying memorization across cos, “D4: Improving llm pretraining via document
neural language models,” CoRR, 2022. de-duplication and diversification,” arXiv preprint
[238] N. Kandpal, E. Wallace, and C. Raffel, “Deduplicat- arXiv:2308.12284, 2023.
ing training data mitigates privacy risks in language [249] Z. Shen, T. Tao, L. Ma, W. Neiswanger, J. Hes-
models,” in International Conference on Machine Learn- tness, N. Vassilieva, D. Soboleva, and E. Xing,
ing, ICML 2022, 17-23 July 2022, Baltimore, Maryland, “Slimpajama-dc: Understanding data combinations
USA. PMLR, 2022, pp. 10 697–10 707. for llm training,” arXiv preprint arXiv:2309.10818,
[239] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, 2023.
“Conditional random fields: Probabilistic models [250] S. M. Xie, S. Santurkar, T. Ma, and P. Liang, “Data
for segmenting and labeling sequence data,” in selection for language models via importance resam-
Proceedings of the Eighteenth International Conference pling,” arXiv preprint arXiv:2302.03169, 2023.
on Machine Learning (ICML 2001), Williams College, [251] X. Wang, W. Zhou, Q. Zhang, J. Zhou, S. Gao,
Williamstown, MA, USA, June 28 - July 1, 2001, C. E. J. Wang, M. Zhang, X. Gao, Y. Chen, and T. Gui,
Brodley and A. P. Danyluk, Eds. Morgan Kaufmann, “Farewell to aimless large-scale pretraining: Influ-
2001, pp. 282–289. ential subset selection for language model,” arXiv
[240] P. Gage, “A new algorithm for data compression,” C preprint arXiv:2305.12816, 2023.
Users Journal, vol. 12, no. 2, pp. 23–38, 1994. [252] D. Paperno, G. Kruszewski, A. Lazaridou, Q. N.
[241] R. Sennrich, B. Haddow, and A. Birch, “Neural ma- Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda,
chine translation of rare words with subword units,” and R. Fernández, “The LAMBADA dataset: Word
in Proceedings of the 54th Annual Meeting of the Associa- prediction requiring a broad discourse context,” in
tion for Computational Linguistics, ACL 2016, August 7- ACL (1). The Association for Computer Linguistics,
12, 2016, Berlin, Germany, Volume 1: Long Papers. The 2016.
Association for Computer Linguistics, 2016. [253] M. F. Chen, N. Roberts, K. Bhatia, J. Wang, C. Zhang,
[242] M. Schuster and K. Nakajima, “Japanese and korean F. Sala, and C. Ré, “Skill-it! a data-driven skills
voice search,” in 2012 IEEE international conference on framework for understanding and training language
acoustics, speech and signal processing (ICASSP). IEEE, models,” arXiv preprint arXiv:2307.14430, 2023.
2012, pp. 5149–5152. [254] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla,
[243] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez,
W. Macherey, M. Krikun, Y. Cao, Q. Gao, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton,
K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, M. Bhatt, C. Canton-Ferrer, A. Grattafiori, W. Xiong,
L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, A. Défossez, J. Copet, F. Azhar, H. Touvron, L. Mar-
K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, tin, N. Usunier, T. Scialom, and G. Synnaeve, “Code
J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Cor- llama: Open foundation models for code,” CoRR, vol.
rado, M. Hughes, and J. Dean, “Google’s neural abs/2308.12950, 2023.
machine translation system: Bridging the gap be- [255] Y. Bengio, J. Louradour, R. Collobert, and J. Weston,
tween human and machine translation,” CoRR, vol. “Curriculum learning,” in ICML, 2009, pp. 41–48.
abs/1609.08144, 2016. [256] C. Xu, C. Rosset, L. Del Corro, S. Mahajan,
[244] T. Kudo, “Subword regularization: Improving neural J. McAuley, J. Neville, A. H. Awadallah, and N. Rao,
network translation models with multiple subword “Contrastive post-training large language models
candidates,” in Proceedings of the 56th Annual Meeting on data curriculum,” arXiv preprint arXiv:2310.02263,
of the Association for Computational Linguistics, ACL 2023.
2018, Melbourne, Australia, July 15-20, 2018, Volume 1: [257] S. Tworkowski, K. Staniszewski, M. Pacek, Y. Wu,
Long Papers, I. Gurevych and Y. Miyao, Eds. Associ- H. Michalewski, and P. Milos, “Focused transformer:
ation for Computational Linguistics, 2018, pp. 66–75. Contrastive training for context scaling,” CoRR, vol.
[245] T. Kudo and J. Richardson, “Sentencepiece: A simple abs/2307.03170, 2023.
and language independent subword tokenizer and [258] Z. Azerbayev, H. Schoelkopf, K. Paster, M. D. Santos,
detokenizer for neural text processing,” in Proceed- S. McAleer, A. Q. Jiang, J. Deng, S. Biderman, and
ings of the 2018 Conference on Empirical Methods in S. Welleck, “Llemma: An open language model for
Natural Language Processing, EMNLP 2018: System mathematics,” arXiv preprint arXiv:2310.10631, 2023.
Demonstrations, Brussels, Belgium, October 31 - Novem- [259] S. Chen, S. Wong, L. Chen, and Y. Tian, “Extend-
ber 4, 2018, E. Blanco and W. Lu, Eds. Association ing context window of large language models via
for Computational Linguistics, 2018. positional interpolation,” CoRR, vol. abs/2306.15595,
[246] M. Davis and M. Dürst, “Unicode normalization 2023.
forms,” 2001. [260] G. Wenzek, M.-A. Lachaux, A. Conneau, V. Chaud-
[247] P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, hary, F. Guzmán, A. Joulin, and É. Grave, “Ccnet:
and I. Sutskever, “Deep double descent: Where big- Extracting high quality monolingual datasets from
ger models and more data hurt,” in 8th International web crawl data,” in Proceedings of the Twelfth Language
112

Resources and Evaluation Conference, 2020, pp. 4003– J. Wang, and F. Wei, “Retentive network: A successor
4012. to transformer for large language models,” arXiv
[261] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, preprint arXiv:2307.08621, 2023.
“Bag of tricks for efficient text classification,” in [272] A. Gu and T. Dao, “Mamba: Linear-time sequence
EACL, 2017, pp. 427–431. modeling with selective state spaces,” CoRR, vol.
[262] D. Chen, Y. Huang, Z. Ma, H. Chen, X. Pan, C. Ge, abs/2312.00752, 2023.
D. Gao, Y. Xie, Z. Liu, J. Gao et al., “Data-juicer: A [273] B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Ar-
one-stop data processing system for large language cadinho, H. Cao, X. Cheng, M. Chung, M. Grella,
models,” arXiv preprint arXiv:2309.02033, 2023. K. K. GV et al., “Rwkv: Reinventing rnns for the
[263] B. Zhang, B. Ghorbani, A. Bapna, Y. Cheng, X. Garcia, transformer era,” arXiv preprint arXiv:2305.13048,
J. Shen, and O. Firat, “Examining scaling and transfer 2023.
of language model architectures for machine transla- [274] M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou,
tion,” in International Conference on Machine Learning, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, and J. Tang,
ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, “Cogview: Mastering text-to-image generation via
2022, pp. 26 176–26 192. transformers,” in Advances in Neural Information Pro-
[264] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, cessing Systems 34: Annual Conference on Neural Infor-
J. Gao, M. Zhou, and H. Hon, “Unified language mation Processing Systems 2021, NeurIPS 2021, Decem-
model pre-training for natural language understand- ber 6-14, 2021, virtual, 2021, pp. 19 822–19 835.
ing and generation,” in Advances in Neural Information [275] L. J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normal-
Processing Systems 32: Annual Conference on Neural ization,” vol. abs/1607.06450, 2016.
Information Processing Systems 2019, NeurIPS 2019, [276] B. Zhang and R. Sennrich, “Root mean square layer
December 8-14, 2019, Vancouver, BC, Canada, 2019, pp. normalization,” in Advances in Neural Information
13 042–13 054. Processing Systems 32: Annual Conference on Neural
[265] A. Clark, D. de Las Casas, A. Guy, A. Mensch, Information Processing Systems 2019, NeurIPS 2019,
M. Paganini, J. Hoffmann, B. Damoc, B. A. Hecht- December 8-14, 2019, Vancouver, BC, Canada, 2019, pp.
man, T. Cai, S. Borgeaud, G. van den Driessche, 12 360–12 371.
E. Rutherford, T. Hennigan, M. J. Johnson, A. Cas- [277] H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang,
sirer, C. Jones, E. Buchatskaya, D. Budden, L. Sifre, and F. Wei, “Deepnet: Scaling transformers to 1, 000
S. Osindero, O. Vinyals, M. Ranzato, J. W. Rae, layers,” vol. abs/2203.00555, 2022.
E. Elsen, K. Kavukcuoglu, and K. Simonyan, “Uni- [278] V. Nair and G. E. Hinton, “Rectified linear units im-
fied scaling laws for routed language models,” in prove restricted boltzmann machines,” in Proceedings
International Conference on Machine Learning, ICML of the 27th international conference on machine learning
2022, 17-23 July 2022, Baltimore, Maryland, USA, 2022, (ICML-10), 2010, pp. 807–814.
pp. 4057–4086. [279] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy,
[266] A. Gu, K. Goel, and C. Ré, “Efficiently modeling and S. R. Bowman, “GLUE: A multi-task benchmark
long sequences with structured state spaces,” and analysis platform for natural language under-
in The Tenth International Conference on Learning standing,” in Proceedings of the Workshop: Analyz-
Representations, ICLR 2022, Virtual Event, April 25-29, ing and Interpreting Neural Networks for NLP, Black-
2022. OpenReview.net, 2022. [Online]. Available: boxNLP@EMNLP 2018, Brussels, Belgium, November 1,
https://openreview.net/forum?id=uYLFoz1vlAC 2018, T. Linzen, G. Chrupala, and A. Alishahi, Eds.
[267] J. T. Smith, A. Warrington, and S. Linderman, “Sim- Association for Computational Linguistics, 2018, pp.
plified state space layers for sequence modeling,” in 353–355.
ICLR, 2023. [280] P. Ramachandran, B. Zoph, and Q. V. Le,
[268] A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gul- “Searching for activation functions,” arXiv preprint
cehre, R. Pascanu, and S. De, “Resurrecting recurrent arXiv:1710.05941, 2017.
neural networks for long sequences,” in ICML, 2023. [281] N. Shazeer, “GLU variants improve transformer,”
[269] M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao, vol. abs/2002.05202, 2020.
S. Baccus, Y. Bengio, S. Ermon, and C. Ré, “Hyena [282] J. Su, Y. Lu, S. Pan, B. Wen, and Y. Liu, “Roformer:
hierarchy: Towards larger convolutional language Enhanced transformer with rotary position embed-
models,” in ICML, 2023. ding,” vol. abs/2104.09864, 2021.
[270] B. Peng, E. Alcaide, Q. Anthony, A. Albalak, [283] O. Press, N. A. Smith, and M. Lewis, “Train short,
S. Arcadinho, H. Cao, X. Cheng, M. Chung, test long: Attention with linear biases enables input
M. Grella, K. K. G. V., X. He, H. Hou, P. Kazienko, length extrapolation,” in The Tenth International Con-
J. Kocon, J. Kong, B. Koptyra, H. Lau, K. S. I. ference on Learning Representations, ICLR 2022, Virtual
Mantri, F. Mom, A. Saito, X. Tang, B. Wang, Event, April 25-29, 2022, 2022.
J. S. Wind, S. Wozniak, R. Zhang, Z. Zhang, [284] S. Ioffe and C. Szegedy, “Batch normalization:
Q. Zhao, P. Zhou, J. Zhu, and R. Zhu, “RWKV: Accelerating deep network training by reducing
reinventing rnns for the transformer era,” CoRR, internal covariate shift,” in Proceedings of the
vol. abs/2305.13048, 2023. [Online]. Available: 32nd International Conference on Machine Learning,
https://doi.org/10.48550/arXiv.2305.13048 ICML 2015, Lille, France, 6-11 July 2015, ser.
[271] Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, JMLR Workshop and Conference Proceedings,
113

F. R. Bach and D. M. Blei, Eds., vol. 37. Linguistics, 2019, pp. 2978–2988. [Online]. Available:
JMLR.org, 2015, pp. 448–456. [Online]. Available: https://doi.org/10.18653/v1/p19-1285
http://proceedings.mlr.press/v37/ioffe15.html [294] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhut-
[285] S. Narang, H. W. Chung, Y. Tay, L. Fedus, T. Févry, dinov, and Q. V. Le, “Xlnet: Generalized autoregres-
M. Matena, K. Malkan, N. Fiedel, N. Shazeer, Z. Lan, sive pretraining for language understanding,” Ad-
Y. Zhou, W. Li, N. Ding, J. Marcus, A. Roberts, vances in neural information processing systems, vol. 32,
and C. Raffel, “Do transformer modifications transfer 2019.
across implementations and applications?” in Pro- [295] B. Peng, J. Quesnelle, H. Fan, and E. Shippole, “Yarn:
ceedings of the 2021 Conference on Empirical Methods Efficient context window extension of large language
in Natural Language Processing, EMNLP 2021, Virtual models,” CoRR, vol. abs/2309.00071, 2023.
Event / Punta Cana, Dominican Republic, 7-11 Novem- [296] Y. Sun, L. Dong, B. Patra, S. Ma, S. Huang,
ber, 2021, 2021, pp. 5758–5773. A. Benhaim, V. Chaudhary, X. Song, and F. Wei,
[286] R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, “A length-extrapolatable transformer,” CoRR, vol.
H. Zhang, Y. Lan, L. Wang, and T. Liu, “On layer abs/2212.10554, 2022. [Online]. Available: https:
normalization in the transformer architecture,” in //doi.org/10.48550/arXiv.2212.10554
ICML, 2020. [297] H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A.
[287] A. Baevski and M. Auli, “Adaptive input repre- Smith, and L. Kong, “Random feature attention,”
sentations for neural language modeling,” in 7th in 9th International Conference on Learning Representa-
International Conference on Learning Representations, tions, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. [298] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie,
OpenReview.net, 2019. C. Alberti, S. Ontañón, P. Pham, A. Ravula, Q. Wang,
[288] L. Liu, X. Liu, J. Gao, W. Chen, and J. Han, “Under- L. Yang, and A. Ahmed, “Big bird: Transformers for
standing the difficulty of training transformers,” in longer sequences,” in Advances in Neural Information
Proceedings of the 2020 Conference on Empirical Methods Processing Systems 33: Annual Conference on Neural
in Natural Language Processing, EMNLP 2020, Online, Information Processing Systems 2020, NeurIPS 2020,
November 16-20, 2020. Association for Computa- December 6-12, 2020, virtual, 2020.
tional Linguistics, 2020, pp. 5747–5763. [299] R. Child, S. Gray, A. Radford, and I. Sutskever, “Gen-
[289] D. Hendrycks and K. Gimpel, “Gaussian error linear erating long sequences with sparse transformers,”
units (gelus),” arXiv preprint arXiv:1606.08415, 2016. CoRR, vol. abs/1904.10509, 2019.
[290] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, [300] N. Shazeer, “Fast transformer decoding: One write-
“Language modeling with gated convolutional net- head is all you need,” CoRR, vol. abs/1911.02150,
works,” in Proceedings of the 34th International Confer- 2019. [Online]. Available: http://arxiv.org/abs/1911.
ence on Machine Learning, ICML 2017, Sydney, NSW, 02150
Australia, 6-11 August 2017, 2017, pp. 933–941. [301] J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy,
[291] T. L. Scao, T. Wang, D. Hesslow, S. Bekman, M. S. F. Lebrón, and S. Sanghai, “Gqa: Training gener-
Bari, S. Biderman, H. Elsahar, N. Muennighoff, alized multi-query transformer models from multi-
J. Phang, O. Press, C. Raffel, V. Sanh, S. Shen, head checkpoints,” arXiv preprint arXiv:2305.13245,
L. Sutawika, J. Tae, Z. X. Yong, J. Launay, and I. Belt- 2023.
agy, “What language model to train if you have one [302] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Re,
million GPU hours?” in Findings of the Association for “Flashattention: Fast and memory-efficient exact at-
Computational Linguistics: EMNLP 2022, Abu Dhabi, tention with IO-awareness,” in NeurIPS, 2022.
United Arab Emirates, December 7-11, 2022, 2022, pp. [303] T. Dao, “Flashattention-2: Faster attention with better
765–782. parallelism and work partitioning,” arXiv preprint
[292] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self- arXiv:2307.08691, 2023.
attention with relative position representations,” [304] “vllm: Easy, fast, and cheap llm serving with
in Proceedings of the 2018 Conference of the North pagedattention.” [Online]. Available: https://vllm.
American Chapter of the Association for Computational ai/
Linguistics: Human Language Technologies, NAACL- [305] K. Murray and D. Chiang, “Correcting length bias in
HLT, New Orleans, Louisiana, USA, June 1-6, 2018, neural machine translation,” in WMT. Association
Volume 2 (Short Papers), M. A. Walker, H. Ji, for Computational Linguistics, 2018, pp. 212–223.
and A. Stent, Eds. Association for Computational [306] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi,
Linguistics, 2018, pp. 464–468. [Online]. Available: “The curious case of neural text degeneration,” in
https://doi.org/10.18653/v1/n18-2074 ICLR, 2020.
[293] Z. Dai, Z. Yang, Y. Yang, J. G. Carbonell, [307] C.-M. U. P. P. D. O. C. SCIENCE, Speech Under-
Q. V. Le, and R. Salakhutdinov, “Transformer-xl: standing Systems. Summary of Results of the Five-Year
Attentive language models beyond a fixed-length Research Effort at Carnegie-Mellon University, 1977.
context,” in Proceedings of the 57th Conference of [308] P. Koehn and R. Knowles, “Six challenges for neural
the Association for Computational Linguistics, ACL machine translation,” in NMT@ACL. Association
2019, Florence, Italy, July 28- August 2, 2019, Volume for Computational Linguistics, 2017, pp. 28–39.
1: Long Papers, A. Korhonen, D. R. Traum, and [309] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
L. Màrquez, Eds. Association for Computational W. Macherey, M. Krikun, Y. Cao, Q. Gao,
114

K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, 2019, pp. 103–112.
L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, [322] A. Harlap, D. Narayanan, A. Phanishayee, V. Se-
K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, shadri, N. R. Devanur, G. R. Ganger, and P. B. Gib-
J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Cor- bons, “Pipedream: Fast and efficient pipeline parallel
rado, M. Hughes, and J. Dean, “Google’s neural DNN training,” CoRR, vol. abs/1806.03377, 2018.
machine translation system: Bridging the gap be- [323] P. Micikevicius, S. Narang, J. Alben, G. F. Di-
tween human and machine translation,” CoRR, vol. amos, E. Elsen, D. Garcı́a, B. Ginsburg, M. Houston,
abs/1609.08144, 2016. O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed pre-
[310] R. Paulus, C. Xiong, and R. Socher, “A deep re- cision training,” CoRR, vol. abs/1710.03740, 2017.
inforced model for abstractive summarization,” in [324] Q. Xu, S. Li, C. Gong, and Y. You, “An efficient
ICLR (Poster). OpenReview.net, 2018. 2d method for training super-large deep learning
[311] A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju, models,” CoRR, vol. abs/2104.05343, 2021.
Q. Sun, S. Lee, D. J. Crandall, and D. Batra, “Diverse [325] B. Wang, Q. Xu, Z. Bian, and Y. You, “Tesseract:
beam search: Decoding diverse solutions from neural Parallelize the tensor parallelism efficiently,” in Pro-
sequence models,” CoRR, vol. abs/1610.02424, 2016. ceedings of the 51st International Conference on Parallel
[312] A. Fan, M. Lewis, and Y. N. Dauphin, “Hierarchical Processing, ICPP 2022, Bordeaux, France, 29 August
neural story generation,” in ACL (1). Association for 2022 - 1 September 2022. ACM, 2022.
Computational Linguistics, 2018, pp. 889–898. [326] Z. Bian, Q. Xu, B. Wang, and Y. You, “Maximizing
[313] J. Hewitt, C. D. Manning, and P. Liang, “Trunca- parallelism in distributed training for huge neural
tion sampling as language model desmoothing,” in networks,” CoRR, vol. abs/2105.14450, 2021.
EMNLP (Findings). Association for Computational [327] S. Li, F. Xue, C. Baranwal, Y. Li, and Y. You, “Se-
Linguistics, 2022, pp. 3414–3427. quence parallelism: Long sequence training from
[314] Y. Su, T. Lan, Y. Wang, D. Yogatama, L. Kong, and system perspective,” arXiv e-prints, pp. arXiv–2105,
N. Collier, “A contrastive framework for neural text 2021.
generation,” in NeurIPS, 2022. [328] L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen,
[315] C. Meister, T. Pimentel, G. Wiher, and R. Cotterell, Y. Huang, Y. Wang, Y. Xu, D. Zhuo, E. P. Xing
“Locally typical sampling,” Trans. Assoc. Comput. Lin- et al., “Alpa: Automating inter-and {Intra-Operator}
guistics, 2023. parallelism for distributed deep learning,” in OSDI,
[316] X. L. Li, A. Holtzman, D. Fried, P. Liang, J. Eis- 2022, pp. 559–578.
ner, T. Hashimoto, L. Zettlemoyer, and M. Lewis, [329] T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training
“Contrastive decoding: Open-ended text generation deep nets with sublinear memory cost,” CoRR, vol.
as optimization,” in ACL (1). Association for Com- abs/1604.06174, 2016.
putational Linguistics, 2023, pp. 12 286–12 312. [330] FairScale authors, “Fairscale: A general purpose
[317] Y. Chuang, Y. Xie, H. Luo, Y. Kim, J. R. Glass, and modular pytorch library for high performance
P. He, “Dola: Decoding by contrasting layers im- and large scale training,” https://github.com/
proves factuality in large language models,” CoRR, facebookresearch/fairscale, 2021.
vol. abs/2309.03883, 2023. [331] R. Lou, K. Zhang, and W. Yin, “Is prompt all you
[318] D. P. Kingma and J. Ba, “Adam: A method for need? no. A comprehensive and broader view of in-
stochastic optimization,” in 3rd International Confer- struction learning,” CoRR, vol. abs/2303.10475, 2023.
ence on Learning Representations, ICLR 2015, San Diego, [332] X. Liu, P. He, W. Chen, and J. Gao, “Multi-task deep
CA, USA, May 7-9, 2015, Conference Track Proceedings, neural networks for natural language understand-
Y. Bengio and Y. LeCun, Eds., 2015. ing,” in ACL (1). Association for Computational
[319] I. Loshchilov and F. Hutter, “Fixing weight decay Linguistics, 2019, pp. 4487–4496.
regularization in adam,” CoRR, vol. abs/1711.05101, [333] A. Aghajanyan, A. Gupta, A. Shrivastava, X. Chen,
2017. L. Zettlemoyer, and S. Gupta, “Muppet: Massive
[320] N. Shazeer and M. Stern, “Adafactor: Adaptive multi-task representations with pre-finetuning,” in
learning rates with sublinear memory cost,” in Pro- EMNLP (1). Association for Computational Linguis-
ceedings of the 35th International Conference on Machine tics, 2021, pp. 5799–5811.
Learning, ICML 2018, Stockholmsmässan, Stockholm, [334] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung,
Sweden, July 10-15, 2018, ser. Proceedings of Machine Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei, and
Learning Research, J. G. Dy and A. Krause, Eds., A. Roberts, “The flan collection: Designing data and
vol. 80. PMLR, 2018, pp. 4603–4611. methods for effective instruction tuning,” CoRR, vol.
[321] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, abs/2301.13688, 2023.
M. X. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, and [335] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng,
Z. Chen, “Gpipe: Efficient training of giant neural C. Tao, and D. Jiang, “Wizardlm: Empowering large
networks using pipeline parallelism,” in Advances language models to follow complex instructions,”
in Neural Information Processing Systems 32: Annual CoRR, vol. abs/2304.12244, 2023. [Online]. Available:
Conference on Neural Information Processing Systems https://doi.org/10.48550/arXiv.2304.12244
2019, NeurIPS 2019, December 8-14, 2019, Vancouver, [336] Z. Sun, Y. Shen, Q. Zhou, H. Zhang, Z. Chen, D. Cox,
BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelz- Y. Yang, and C. Gan, “Principle-driven self-alignment
imer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., of language models from scratch with minimal hu-
115

man supervision,” arXiv preprint arXiv:2305.03047, 1894.
2023. [350] Y. Wang, H. Ivison, P. Dasigi, J. Hessel, T. Khot, K. R.
[337] X. Li, P. Yu, C. Zhou, T. Schick, L. Zettle- Chandu, D. Wadden, K. MacMillan, N. A. Smith,
moyer, O. Levy, J. Weston, and M. Lewis, “Self- I. Beltagy, and H. Hajishirzi, “How far can camels
alignment with instruction backtranslation,” CoRR, go? exploring the state of instruction tuning on open
vol. abs/2308.06259, 2023. resources,” CoRR, vol. abs/2306.04751, 2023.
[338] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, [351] X. Liu, H. Yan, S. Zhang, C. An, X. Qiu, and D. Lin,
A. Efrat, P. Yu, L. Yu et al., “Lima: Less is more for “Scaling laws of rope-based extrapolation,” CoRR,
alignment,” arXiv preprint arXiv:2305.11206, 2023. vol. abs/2310.05209, 2023.
[339] L. Chen, S. Li, J. Yan, H. Wang, K. Gunaratna, V. Ya- [352] B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruc-
dav, Z. Tang, V. Srinivasan, T. Zhou, H. Huang, and tion tuning with GPT-4,” CoRR, vol. abs/2304.03277,
H. Jin, “Alpagasus: Training A better alpaca with 2023.
fewer data,” CoRR, vol. abs/2307.08701, 2023. [353] M. M. Krell, M. Kosec, S. P. Perez, and A. Fitzgib-
[340] S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, bon, “Efficient sequence packing without cross-
H. Palangi, and A. H. Awadallah, “Orca: Progressive contamination: Accelerating large language mod-
learning from complex explanation traces of GPT-4,” els without impacting performance,” arXiv preprint
CoRR, vol. abs/2306.02707, 2023. arXiv:2107.02027, 2021.
[341] YuLan-Chat-Team, “Yulan-chat: An open-source [354] K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei,
bilingual chatbot,” https://github.com/RUC-GSAI/ H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis,
YuLan-Chat, 2023. S. Pfohl et al., “Large language models encode clinical
[342] Y. Huang, X. Liu, Y. Gong, Z. Gou, Y. Shen, N. Duan, knowledge,” arXiv preprint arXiv:2212.13138, 2022.
and W. Chen, “Key-point-driven data synthesis with [355] J. Zhang, R. Xie, Y. Hou, W. X. Zhao, L. Lin, and
its enhancement on mathematical reasoning,” CoRR, J. Wen, “Recommendation as instruction following:
vol. abs/2403.02333, 2024. A large language model empowered recommenda-
[343] N. Ding, Y. Chen, B. Xu, Y. Qin, S. Hu, Z. Liu, M. Sun, tion approach,” CoRR, vol. abs/2305.07001, 2023.
and B. Zhou, “Enhancing chat language models by [356] H. Wang, C. Liu, N. Xi, Z. Qiang, S. Zhao, B. Qin, and
scaling high-quality instructional conversations,” in T. Liu, “Huatuo: Tuning llama model with chinese
Proceedings of the 2023 Conference on Empirical Methods medical knowledge,” arXiv preprint arXiv:2304.06975,
in Natural Language Processing, EMNLP 2023, Sin- 2023.
gapore, December 6-10, 2023, H. Bouamor, J. Pino, [357] Q. Huang, M. Tao, Z. An, C. Zhang, C. Jiang, Z. Chen,
and K. Bali, Eds. Association for Computational Z. Wu, and Y. Feng, “Lawyer llama technical report,”
Linguistics, 2023, pp. 3029–3051. arXiv preprint arXiv:2305.15062, 2023.
[344] K. Zhou, B. Zhang, J. Wang, Z. Chen, W. X. Zhao, [358] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze,
J. Sha, Z. Sheng, S. Wang, and J. Wen, “Jiuzhang3.0: S. Gehrmann, P. Kambadur, D. Rosenberg, and
Efficiently improving mathematical reasoning by G. Mann, “Bloomberggpt: A large language model
training small data synthesis models,” CoRR, vol. for finance,” arXiv preprint arXiv:2303.17564, 2023.
abs/2405.14365, 2024. [359] T. Liu and B. K. H. Low, “Goat: Fine-tuned llama out-
[345] Y. Cao, Y. Kang, and L. Sun, “Instruction mining: performs gpt-4 on arithmetic tasks,” arXiv preprint
High-quality instruction data selection for large lan- arXiv:2305.14201, 2023.
guage models,” CoRR, vol. abs/2307.06290, 2023. [360] T. Sun, X. Zhang, Z. He, P. Li, Q. Cheng, H. Yan,
[346] M. Li, Y. Zhang, Z. Li, J. Chen, L. Chen, X. Liu, Y. Shao, Q. Tang, X. Zhao, K. Chen, Y. Zheng,
N. Cheng, J. Wang, T. Zhou, and J. Xiao, “From Z. Zhou, R. Li, J. Zhan, Y. Zhou, L. Li, X. Yang, L. Wu,
quantity to quality: Boosting LLM performance with Z. Yin, X. Huang, and X. Qiu, “Moss: Training con-
self-guided data selection for instruction tuning,” versational language models from synthetic data,”
CoRR, vol. abs/2308.12032, 2023. [Online]. Available: 2023.
https://doi.org/10.48550/arXiv.2308.12032 [361] Y. Dubois, X. Li, R. Taori, T. Zhang, I. Gulrajani,
[347] O. Sener and S. Savarese, “Active learning J. Ba, C. Guestrin, P. Liang, and T. B. Hashimoto,
for convolutional neural networks: A core-set “Alpacafarm: A simulation framework for methods
approach,” in 6th International Conference on Learning that learn from human feedback,” CoRR, vol.
Representations, ICLR 2018, Vancouver, BC, Canada, abs/2305.14387, 2023. [Online]. Available: https:
April 30 - May 3, 2018, Conference Track Proceedings. //doi.org/10.48550/arXiv.2305.14387
OpenReview.net, 2018. [Online]. Available: https: [362] D. Hendrycks, C. Burns, S. Basart, A. Zou,
//openreview.net/forum?id=H1aIuk-RW M. Mazeika, D. Song, and J. Steinhardt, “Measur-
[348] M. Xia, S. Malladi, S. Gururangan, S. Arora, ing massive multitask language understanding,” in
and D. Chen, “LESS: selecting influential ICLR. OpenReview.net, 2021.
data for targeted instruction tuning,” CoRR, [363] M. Suzgun, N. Scales, N. Schärli, S. Gehrmann,
vol. abs/2402.04333, 2024. [Online]. Available: Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H.
https://doi.org/10.48550/arXiv.2402.04333 Chi, D. Zhou, and J. Wei, “Challenging big-bench
[349] P. W. Koh and P. Liang, “Understanding black-box tasks and whether chain-of-thought can solve them,”
predictions via influence functions,” in International CoRR, vol. abs/2210.09261, 2022.
conference on machine learning. PMLR, 2017, pp. 1885– [364] Z. Kenton, T. Everitt, L. Weidinger, I. Gabriel,
116

V. Mikulik, and G. Irving, “Alignment of language in large language models part i: Ppo,” arXiv preprint
agents,” CoRR, vol. abs/2103.14659, 2021. arXiv:2307.04964, 2023.
[365] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, [374] J. Uesato, N. Kushman, R. Kumar, H. F. Song,
A. Radford, D. Amodei, P. F. Christiano, and G. Irv- N. Y. Siegel, L. Wang, A. Creswell, G. Irving, and
ing, “Fine-tuning language models from human pref- I. Higgins, “Solving math word problems with
erences,” CoRR, vol. abs/1909.08593, 2019. process- and outcome-based feedback,” CoRR, vol.
[366] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, abs/2211.14275, 2022.
T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das- [375] H. Lightman, V. Kosaraju, Y. Burda, H. Edwards,
Sarma, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever,
J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. B. and K. Cobbe, “Let’s verify step by step,” CoRR, vol.
Brown, J. Clark, S. McCandlish, C. Olah, and J. Ka- abs/2305.20050, 2023.
plan, “A general language assistant as a laboratory [376] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika,
for alignment,” CoRR, vol. abs/2112.00861, 2021. A. Arora, E. Guo, C. Burns, S. Puranik, H. He,
[367] E. Perez, S. Huang, H. F. Song, T. Cai, R. Ring, D. Song, and J. Steinhardt, “Measuring coding chal-
J. Aslanides, A. Glaese, N. McAleese, and G. Irving, lenge competence with APPS,” in NeurIPS Datasets
“Red teaming language models with language mod- and Benchmarks, 2021.
els,” in Proceedings of the 2022 Conference on Empir- [377] T. Wang, P. Yu, X. E. Tan, S. O’Brien, R. Pa-
ical Methods in Natural Language Processing, EMNLP sunuru, J. Dwivedi-Yu, O. Golovneva, L. Zettle-
2022, Abu Dhabi, United Arab Emirates, December 7-11, moyer, M. Fazel-Zarandi, and A. Celikyilmaz, “Shep-
2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. herd: A critic for language model generation,” CoRR,
Association for Computational Linguistics, 2022, pp. vol. abs/2308.04592, 2023.
3419–3448. [378] G. Chen, M. Liao, C. Li, and K. Fan, “Alphamath
[368] J. Menick, M. Trebacz, V. Mikulik, J. Aslanides, almost zero: process supervision without process,”
H. F. Song, M. Chadwick, M. Glaese, S. Young, CoRR, vol. abs/2405.03553, 2024.
L. Campbell-Gillingham, G. Irving, and [379] Q. Ma, H. Zhou, T. Liu, J. Yuan, P. Liu, Y. You, and
N. McAleese, “Teaching language models to H. Yang, “Let’s reward step by step: Step-level re-
support answers with verified quotes,” CoRR, vol. ward model as the navigators for reasoning,” CoRR,
abs/2203.11147, 2022. vol. abs/2310.10080, 2023.
[369] Y. Bai, S. Kadavath, S. Kundu, A. Askell, [380] Z. Chen, K. Zhou, W. X. Zhao, J. Wan, F. Zhang,
J. Kernion, A. Jones, A. Chen, A. Goldie, D. Zhang, and J. Wen, “Improving large language
A. Mirhoseini, C. McKinnon, C. Chen, C. Olsson, models via fine-grained reinforcement learning
C. Olah, D. Hernandez, D. Drain, D. Ganguli, with minimum editing constraint,” CoRR, vol.
D. Li, E. Tran-Johnson, E. Perez, J. Kerr, abs/2401.06081, 2024. [Online]. Available: https:
J. Mueller, J. Ladish, J. Landau, K. Ndousse, //doi.org/10.48550/arXiv.2401.06081
K. Lukosiute, L. Lovitt, M. Sellitto, N. Elhage, [381] Z. Xi, W. Chen, B. Hong, S. Jin, R. Zheng, W. He,
N. Schiefer, N. Mercado, N. DasSarma, R. Lasenby, Y. Ding, S. Liu, X. Guo, J. Wang, H. Guo, W. Shen,
R. Larson, S. Ringer, S. Johnston, S. Kravec, X. Fan, Y. Zhou, S. Dou, X. Wang, X. Zhang,
S. E. Showk, S. Fort, T. Lanham, T. Telleen- P. Sun, T. Gui, Q. Zhang, and X. Huang, “Train-
Lawton, T. Conerly, T. Henighan, T. Hume, S. R. ing large language models for reasoning through
Bowman, Z. Hatfield-Dodds, B. Mann, D. Amodei, reverse curriculum reinforcement learning,” CoRR,
N. Joseph, S. McCandlish, T. Brown, and J. Kaplan, vol. abs/2402.05808, 2024.
“Constitutional AI: harmlessness from AI feedback,” [382] D. Silver, J. Schrittwieser, K. Simonyan,
CoRR, vol. abs/2212.08073, 2022. [Online]. Available: I. Antonoglou, A. Huang, A. Guez, T. Hubert,
https://doi.org/10.48550/arXiv.2212.08073 L. Baker, M. Lai, A. Bolton, Y. Chen, T. P. Lillicrap,
[370] H. Lee, S. Phatale, H. Mansoor, K. Lu, T. Mesnard, F. Hui, L. Sifre, G. van den Driessche, T. Graepel,
C. Bishop, V. Carbune, and A. Rastogi, “RLAIF: and D. Hassabis, “Mastering the game of go without
scaling reinforcement learning from human feedback human knowledge,” Nat., pp. 354–359, 2017.
with AI feedback,” CoRR, vol. abs/2309.00267, 2023. [383] T. Anthony, Z. Tian, and D. Barber, “Thinking fast
[371] H. Dong, W. Xiong, D. Goyal, R. Pan, S. Diao, and slow with deep learning and tree search,” in
J. Zhang, K. Shum, and T. Zhang, “RAFT: Advances in Neural Information Processing Systems 30:
reward ranked finetuning for generative foundation Annual Conference on Neural Information Processing
model alignment,” CoRR, vol. abs/2304.06767, 2023. Systems 2017, December 4-9, 2017, Long Beach, CA,
[Online]. Available: https://doi.org/10.48550/arXiv. USA, 2017, pp. 5360–5370.
2304.06767 [384] H. Luo, Q. Sun, C. Xu, P. Zhao, J. Lou, C. Tao,
[372] A. Askell, Y. Bai, A. Chen, D. Drain, D. Gan- X. Geng, Q. Lin, S. Chen, and D. Zhang, “Wizard-
guli, T. Henighan, A. Jones, N. Joseph, B. Mann, math: Empowering mathematical reasoning for large
N. DasSarma et al., “A general language assis- language models via reinforced evol-instruct,” CoRR,
tant as a laboratory for alignment,” arXiv preprint vol. abs/2308.09583, 2023.
arXiv:2112.00861, 2021. [385] R. Liu, C. Jia, G. Zhang, Z. Zhuang, T. X. Liu, and
[373] R. Zheng, S. Dou, S. Gao, W. Shen, B. Wang, Y. Liu, S. Vosoughi, “Second thoughts are best: Learning
S. Jin, Q. Liu, L. Xiong, L. Chen et al., “Secrets of rlhf to re-align with human values from text edits,” in
117

NeurIPS, 2022. J. E. Gonzalez, “The wisdom of hindsight makes
[386] X. Lu, S. Welleck, J. Hessel, L. Jiang, L. Qin, P. West, language models better instruction followers,”
P. Ammanabrolu, and Y. Choi, “QUARK: control- CoRR, vol. abs/2302.05206, 2023. [Online]. Available:
lable text generation with reinforced unlearning,” in https://doi.org/10.48550/arXiv.2302.05206
NeurIPS, 2022. [401] A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne,
[387] J. Scheurer, J. A. Campos, T. Korbak, J. S. Chan, “Imitation learning: A survey of learning methods,”
A. Chen, K. Cho, and E. Perez, “Training language ACM Comput. Surv., vol. 50, no. 2, apr 2017. [Online].
models with language feedback at scale,” CoRR, vol. Available: https://doi.org/10.1145/3054912
abs/2303.16755, 2023. [402] S. Levine, “Should i imitate or reinforce,”
[388] G. Guo, R. Zhao, T. Tang, W. X. Zhao, and 2022. [Online]. Available: https://www.youtube.
J.-R. Wen, “Beyond imitation: Leveraging fine- com/watch?v=sVPm7zOrBxM
grained quality signals for alignment,” arXiv preprint [403] J. Schulman, “Reinforcement learning from
arXiv:2311.04072, 2023. human feedback: Progress and challenges,” 2023.
[389] R. Krishna, D. Lee, L. Fei-Fei, and M. S. Bernstein, [Online]. Available: https://www.youtube.com/
“Socially situated artificial intelligence enables watch?v=hhiLw5Q UFg
learning from human interaction,” Proceedings of the [404] X. L. Li and P. Liang, “Prefix-tuning: Optimizing
National Academy of Sciences of the United States of continuous prompts for generation,” in Proceedings of
America, vol. 119, 2022. [Online]. Available: https: the 59th Annual Meeting of the Association for Compu-
//api.semanticscholar.org/CorpusID:252381954 tational Linguistics and the 11th International Joint Con-
[390] H. Liu, C. Sferrazza, and P. Abbeel, “Chain of hind- ference on Natural Language Processing, ACL/IJCNLP
sight aligns language models with feedback,” CoRR, 2021, (Volume 1: Long Papers), Virtual Event, August
vol. abs/2302.02676, 2023. 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds.
[391] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, Association for Computational Linguistics, 2021, pp.
C. D. Manning, and C. Finn, “Direct preference 4582–4597.
optimization: Your language model is secretly a [405] B. Lester, R. Al-Rfou, and N. Constant, “The power
reward model,” CoRR, vol. abs/2305.18290, 2023. of scale for parameter-efficient prompt tuning,” in
[Online]. Available: https://doi.org/10.48550/arXiv. Proceedings of the 2021 Conference on Empirical Methods
2305.18290 in Natural Language Processing, EMNLP 2021, Virtual
[392] K. Ethayarajh, W. Xu, N. Muennighoff, D. Jurafsky, Event / Punta Cana, Dominican Republic, 7-11 Novem-
and D. Kiela, “KTO: model alignment as prospect ber, 2021, M. Moens, X. Huang, L. Specia, and S. W.
theoretic optimization,” CoRR, vol. abs/2402.01306, Yih, Eds. Association for Computational Linguistics,
2024. 2021, pp. 3045–3059.
[393] Y. Meng, M. Xia, and D. Chen, “Simpo: Simple pref- [406] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone,
erence optimization with a reference-free reward,” Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and
CoRR, vol. abs/2405.14734, 2024. S. Gelly, “Parameter-efficient transfer learning for
[394] D. Feng, B. Qin, C. Huang, Z. Zhang, and W. Lei, NLP,” in Proceedings of the 36th International Confer-
“Towards analyzing and understanding the limita- ence on Machine Learning, ICML 2019, 9-15 June 2019,
tions of DPO: A theoretical perspective,” CoRR, vol. Long Beach, California, USA, 2019, pp. 2790–2799.
abs/2404.04626, 2024. [407] Z. Hu, Y. Lan, L. Wang, W. Xu, E. Lim, R. K. Lee,
[395] A. Gorbatovski, B. Shaposhnikov, A. Malakhov, L. Bing, and S. Poria, “Llm-adapters: An adapter
N. Surnachev, Y. Aksenov, I. Maksimov, N. Balagan- family for parameter-efficient fine-tuning of large
sky, and D. Gavrilov, “Learn your reference model language models,” CoRR, vol. abs/2304.01933, 2023.
for real good alignment,” CoRR, vol. abs/2404.09656, [408] J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and
2024. G. Neubig, “Towards a unified view of parameter-
[396] D. Kim, Y. Kim, W. Song, H. Kim, Y. Kim, S. Kim, efficient transfer learning,” in The Tenth International
and C. Park, “sdpo: Don’t use your data all at once,” Conference on Learning Representations, ICLR 2022, Vir-
CoRR, vol. abs/2403.19270, 2024. tual Event, April 25-29, 2022. OpenReview.net, 2022.
[397] Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, and [409] X. Liu, K. Ji, Y. Fu, Z. Du, Z. Yang, and J. Tang, “P-
F. Huang, “RRHF: rank responses to align language tuning v2: Prompt tuning can be comparable to fine-
models with human feedback without tears,” tuning universally across scales and tasks,” CoRR,
CoRR, vol. abs/2304.05302, 2023. [Online]. Available: vol. abs/2110.07602, 2021.
https://doi.org/10.48550/arXiv.2304.05302 [410] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang,
[398] Y. Zhao, R. Joshi, T. Liu, M. Khalman, M. Saleh, and J. Tang, “GPT understands, too,” CoRR, vol.
and P. J. Liu, “Slic-hf: Sequence likelihood calibration abs/2103.10385, 2021.
with human feedback,” CoRR, vol. abs/2305.10425, [411] Y. Gu, X. Han, Z. Liu, and M. Huang, “Ppt: Pre-
2023. trained prompt tuning for few-shot learning,” in Pro-
[399] A. Fisch, J. Eisenstein, V. Zayats, A. Agarwal, ceedings of the 60th Annual Meeting of the Association
A. Beirami, C. Nagpal, P. Shaw, and J. Berant, “Ro- for Computational Linguistics (Volume 1: Long Papers),
bust preference optimization through reward model 2022, pp. 8410–8423.
distillation,” CoRR, vol. abs/2405.19316, 2024. [412] Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can
[400] T. Zhang, F. Liu, J. Wong, P. Abbeel, and we know what language models know?” Transactions
118

of the Association for Computational Linguistics, vol. 8, 2023.
pp. 423–438, 2020. [424] Y. Hao, Y. Sun, L. Dong, Z. Han, Y. Gu, and F. Wei,
[413] T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, “Structured prompting: Scaling in-context learning
and S. Singh, “Autoprompt: Eliciting knowledge to 1, 000 examples,” CoRR, 2022.
from language models with automatically gener- [425] Y. Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stene-
ated prompts,” in Proceedings of the 2020 Conference torp, “Fantastically ordered prompts and where to
on Empirical Methods in Natural Language Processing find them: Overcoming few-shot prompt order sen-
(EMNLP), 2020, pp. 4222–4235. sitivity,” in Proceedings of the 60th Annual Meeting of
[414] Q. Zhang, M. Chen, A. Bukharin, P. He, Y. Cheng, the Association for Computational Linguistics (Volume
W. Chen, and T. Zhao, “Adaptive budget allocation 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-
for parameter-efficient fine-tuning,” CoRR, vol. 27, 2022, S. Muresan, P. Nakov, and A. Villavicencio,
abs/2303.10512, 2023. [Online]. Available: https: Eds., 2022, pp. 8086–8098.
//doi.org/10.48550/arXiv.2303.10512 [426] Y. Fu, H. Peng, A. Sabharwal, P. Clark, and T. Khot,
[415] M. Valipour, M. Rezagholizadeh, I. Kobyzev, and “Complexity-based prompting for multi-step reason-
A. Ghodsi, “Dylora: Parameter efficient tuning ing,” CoRR, vol. abs/2210.00720, 2022.
of pre-trained models using dynamic search-free [427] Z. Zhang, A. Zhang, M. Li, and A. Smola, “Auto-
low-rank adaptation,” CoRR, vol. abs/2210.07558, matic chain of thought prompting in large language
2022. [Online]. Available: https://doi.org/10.48550/ models,” CoRR, vol. abs/2210.03493, 2022.
arXiv.2210.07558 [428] A. Creswell, M. Shanahan, and I. Higgins,
[416] N. Ding, Y. Qin, G. Yang, F. Wei, Y. Zonghan, Y. Su, “Selection-inference: Exploiting large language mod-
S. Hu, Y. Chen, C.-M. Chan, W. Chen, J. Yi, W. Zhao, els for interpretable logical reasoning,” CoRR, vol.
X. Wang, Z. Liu, H.-T. Zheng, J. Chen, Y. Liu, J. Tang, abs/2205.09712, 2022.
J. Li, and M. Sun, “Parameter-efficient fine-tuning [429] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi,
of large-scale pre-trained language models,” Nature and D. Zhou, “Self-consistency improves chain of
Machine Intelligence, vol. 5, pp. 1–16, 03 2023. thought reasoning in language models,” CoRR, vol.
[417] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, abs/2203.11171, 2022.
P. Gao, and Y. Qiao, “Llama-adapter: Efficient fine- [430] Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J. Lou,
tuning of language models with zero-init attention,” and W. Chen, “On the advance of making language
CoRR, vol. abs/2303.16199, 2023. models better reasoners,” CoRR, vol. abs/2206.02336,
[418] J. Pfeiffer, I. Vulic, I. Gurevych, and S. Ruder, “MAD- 2022.
X: an adapter-based framework for multi-task cross- [431] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi,
lingual transfer,” in Proceedings of the 2020 Conference and D. Zhou, “Rationale-augmented ensembles in
on Empirical Methods in Natural Language Processing, language models,” CoRR, 2022.
EMNLP 2020, Online, November 16-20, 2020, B. Web- [432] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales,
ber, T. Cohn, Y. He, and Y. Liu, Eds. Association for X. Wang, D. Schuurmans, O. Bousquet, Q. Le, and
Computational Linguistics, 2020, pp. 7654–7673. E. H. Chi, “Least-to-most prompting enables com-
[419] S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, plex reasoning in large language models,” CoRR, vol.
and S. Paul, “Peft: State-of-the-art parameter- abs/2205.10625, 2022.
efficient fine-tuning methods,” https://github.com/ [433] T. Khot, H. Trivedi, M. Finlayson, Y. Fu,
huggingface/peft, 2022. K. Richardson, P. Clark, and A. Sabhar-
[420] J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and wal, “Decomposed prompting: A modular
W. Chen, “What makes good in-context examples for approach for solving complex tasks,” CoRR,
gpt-3?” in Proceedings of Deep Learning Inside Out: The vol. abs/2210.02406, 2022. [Online]. Available:
3rd Workshop on Knowledge Extraction and Integration https://doi.org/10.48550/arXiv.2210.02406
for Deep Learning Architectures, DeeLIO@ACL 2022, [434] L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K.
Dublin, Ireland and Online, May 27, 2022, 2022, pp. Lee, and E. Lim, “Plan-and-solve prompting:
100–114. Improving zero-shot chain-of-thought reasoning by
[421] O. Rubin, J. Herzig, and J. Berant, “Learning to large language models,” CoRR, vol. abs/2305.04091,
retrieve prompts for in-context learning,” in Pro- 2023. [Online]. Available: https://doi.org/10.48550/
ceedings of the 2022 Conference of the North American arXiv.2305.04091
Chapter of the Association for Computational Linguistics: [435] Q. Lyu, S. Havaldar, A. Stein, L. Zhang, D. Rao,
Human Language Technologies, NAACL 2022, Seattle, E. Wong, M. Apidianaki, and C. Callison-Burch,
WA, United States, July 10-15, 2022, 2022, pp. 2655– “Faithful chain-of-thought reasoning,” CoRR, vol.
2671. abs/2301.13379, 2023.
[422] H. J. Kim, H. Cho, J. Kim, T. Kim, K. M. Yoo, and [436] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang,
S. Lee, “Self-generated in-context learning: Leverag- J. Callan, and G. Neubig, “PAL: program-aided lan-
ing auto-regressive language models as a demonstra- guage models,” CoRR, vol. abs/2211.10435, 2022.
tion generator,” CoRR, vol. abs/2206.08082, 2022. [437] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and
[423] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, Y. Zhuang, “Hugginggpt: Solving ai tasks with chat-
H. Chan, and J. Ba, “Large language models are gpt and its friends in huggingface,” arXiv preprint
human-level prompt engineers,” in Proc. of ICLR, arXiv:2303.17580, 2023.
119

[438] H. Sun, Y. Zhuang, L. Kong, B. Dai, and guage models,” arXiv preprint arXiv:2304.09842, 2023.
C. Zhang, “Adaplanner: Adaptive planning from [454] R. Ren, Y. Wang, Y. Qu, W. X. Zhao, J. Liu, H. Tian,
feedback with language models,” arXiv preprint H. Wu, J.-R. Wen, and H. Wang, “Investigating
arXiv:2305.16653, 2023. the factual knowledge boundary of large language
[439] Y. Lu, P. Lu, Z. Chen, W. Zhu, X. E. Wang, and W. Y. models with retrieval augmentation,” arXiv preprint
Wang, “Multimodal procedural planning via dual arXiv:2307.11019, 2023.
text-image prompting,” CoRR, vol. abs/2305.01795, [455] X. Amatriain, “Prompt design and engineering:
2023. Introduction and advanced methods,” CoRR, vol.
[440] S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, abs/2401.14423, 2024.
and Z. Hu, “Reasoning with language model is plan- [456] Y. Hou, J. Zhang, Z. Lin, H. Lu, R. Xie, J. J. McAuley,
ning with world model,” CoRR, vol. abs/2305.14992, and W. X. Zhao, “Large language models are zero-
2023. shot rankers for recommender systems,” CoRR, vol.
[441] Z. Chen, K. Zhou, B. Zhang, Z. Gong, W. X. abs/2305.08845, 2023.
Zhao, and J. Wen, “Chatcot: Tool-augmented chain- [457] S. Chang and E. Fosler-Lussier, “How to prompt
of-thought reasoning on chat-based large language llms for text-to-sql: A study in zero-shot, single-
models,” CoRR, vol. abs/2305.14323, 2023. domain, and cross-domain settings,” CoRR, vol.
[442] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, abs/2305.11853, 2023. [Online]. Available: https:
K. Narasimhan, and Y. Cao, “React: Synergizing rea- //doi.org/10.48550/arXiv.2305.11853
soning and acting in language models,” CoRR, vol. [458] Y. Wen, N. Jain, J. Kirchenbauer, M. Goldblum,
abs/2210.03629, 2022. J. Geiping, and T. Goldstein, “Hard prompts
[443] N. Shinn, F. Cassano, B. Labash, A. Gopinath, made easy: Gradient-based discrete optimization
K. Narasimhan, and S. Yao, “Reflexion: Language for prompt tuning and discovery,” CoRR, vol.
agents with verbal reinforcement learning,” 2023. abs/2302.03668, 2023. [Online]. Available: https:
[444] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, //doi.org/10.48550/arXiv.2302.03668
Y. Cao, and K. Narasimhan, “Tree of thoughts: Delib- [459] T. Gao, A. Fisch, and D. Chen, “Making pre-trained
erate problem solving with large language models,” language models better few-shot learners,” in Pro-
CoRR, vol. abs/2305.10601, 2023. ceedings of the 59th Annual Meeting of the Association
[445] V. Liu and L. B. Chilton, “Design guidelines for for Computational Linguistics and the 11th International
prompt engineering text-to-image generative mod- Joint Conference on Natural Language Processing, ACL/I-
els,” in Proceedings of the 2022 CHI Conference on JCNLP 2021, (Volume 1: Long Papers), Virtual Event,
Human Factors in Computing Systems, 2022, pp. 1–23. August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Nav-
[446] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, igli, Eds. Association for Computational Linguistics,
H. Gilbert, A. Elnashar, J. Spencer-Smith, and D. C. 2021, pp. 3816–3830.
Schmidt, “A prompt pattern catalog to enhance [460] L. Chen, J. Chen, T. Goldstein, H. Huang,
prompt engineering with chatgpt,” arXiv preprint and T. Zhou, “Instructzero: Efficient instruction
arXiv:2302.11382, 2023. optimization for black-box large language models,”
[447] S. K. K. Santu and D. Feng, “Teler: A general CoRR, vol. abs/2306.03082, 2023. [Online]. Available:
taxonomy of LLM prompts for benchmarking https://doi.org/10.48550/arXiv.2306.03082
complex tasks,” CoRR, vol. abs/2305.11430, 2023. [461] X. Lin, Z. Wu, Z. Dai, W. Hu, Y. Shu, S. Ng, P. Jaillet,
[Online]. Available: https://doi.org/10.48550/arXiv. and B. K. H. Low, “Use your INSTINCT: instruc-
2305.11430 tion optimization using neural bandits coupled with
[448] OpenAI, “Gpt best practices,” OpenAI, 2023. transformers,” CoRR, vol. abs/2310.02905, 2023.
[Online]. Available: https://platform.openai.com/ [462] M. Deng, J. Wang, C. Hsieh, Y. Wang, H. Guo, T. Shu,
docs/guides/gpt-best-practices M. Song, E. P. Xing, and Z. Hu, “Rlprompt: Optimiz-
[449] Contributors, “Ai short,” 2023. [Online]. Available: ing discrete text prompts with reinforcement learn-
https://www.aishort.top/ ing,” in Proceedings of the 2022 Conference on Empir-
[450] ——, “Awesome chatgpt prompts,” Github, ical Methods in Natural Language Processing, EMNLP
2023. [Online]. Available: https://github.com/f/ 2022, Abu Dhabi, United Arab Emirates, December 7-11,
awesome-chatgpt-prompts/ 2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds.
[451] J. Jiang, K. Zhou, Z. Dong, K. Ye, W. X. Zhao, and Association for Computational Linguistics, 2022, pp.
J. Wen, “Structgpt: A general framework for large 3369–3391.
language model to reason over structured data,” [463] T. Zhang, X. Wang, D. Zhou, D. Schuurmans, and
CoRR, vol. abs/2305.09645, 2023. J. E. Gonzalez, “TEMPERA: test-time prompt editing
[452] L. Beurer-Kellner, M. Fischer, and M. Vechev, via reinforcement learning,” in The Eleventh Inter-
“Prompting is programming: A query language for national Conference on Learning Representations, ICLR
large language models,” Proceedings of the ACM on 2023, Kigali, Rwanda, May 1-5, 2023. OpenRe-
Programming Languages, vol. 7, no. PLDI, pp. 1946– view.net, 2023.
1969, 2023. [464] Y. Jafari, D. Mekala, R. Yu, and T. Berg-Kirkpatrick,
[453] P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang, “Morl-prompt: An empirical analysis of multi-
Y. N. Wu, S.-C. Zhu, and J. Gao, “Chameleon: Plug- objective reinforcement learning for discrete prompt
and-play compositional reasoning with large lan- optimization,” CoRR, vol. abs/2402.11711, 2024.
120

[465] W. Kong, S. A. Hombaiah, M. Zhang, Q. Mei, and models enables expert-level prompt optimization,”
M. Bendersky, “Prewrite: Prompt rewriting with re- CoRR, vol. abs/2310.16427, 2023. [Online]. Available:
inforcement learning,” CoRR, vol. abs/2401.08189, https://doi.org/10.48550/arXiv.2310.16427
2024. [477] T. Tang, J. Li, W. X. Zhao, and J. Wen, “Context-
[466] H. Xu, Y. Chen, Y. Du, N. Shao, Y. Wang, H. Li, tuning: Learning contextualized prompts for natu-
and Z. Yang, “GPS: genetic prompt search for effi- ral language generation,” in Proceedings of the 29th
cient few-shot learning,” in Proceedings of the 2022 International Conference on Computational Linguistics,
Conference on Empirical Methods in Natural Language COLING 2022, Gyeongju, Republic of Korea, October 12-
Processing, EMNLP 2022, Abu Dhabi, United Arab Emi- 17, 2022, N. Calzolari, C. Huang, H. Kim, J. Puste-
rates, December 7-11, 2022, Y. Goldberg, Z. Kozareva, jovsky, L. Wanner, K. Choi, P. Ryu, H. Chen, L. Do-
and Y. Zhang, Eds. Association for Computational natelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, S. Kim,
Linguistics, 2022, pp. 8162–8171. Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond, and
[467] A. Prasad, P. Hase, X. Zhou, and M. Bansal, S. Na, Eds. International Committee on Computa-
“Grips: Gradient-free, edit-based instruction search tional Linguistics, 2022, pp. 6340–6354.
for prompting large language models,” in Proceedings [478] T. Vu, B. Lester, N. Constant, R. Al-Rfou’, and D. Cer,
of the 17th Conference of the European Chapter of the “Spot: Better frozen model adaptation through soft
Association for Computational Linguistics, EACL 2023, prompt transfer,” in Proceedings of the 60th Annual
Dubrovnik, Croatia, May 2-6, 2023, A. Vlachos and Meeting of the Association for Computational Linguistics
I. Augenstein, Eds. Association for Computational (Volume 1: Long Papers), ACL 2022, Dublin, Ireland,
Linguistics, 2023, pp. 3827–3846. May 22-27, 2022, S. Muresan, P. Nakov, and A. Villavi-
[468] Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, cencio, Eds. Association for Computational Linguis-
H. Chan, and J. Ba, “Large language models are tics, 2022, pp. 5039–5059.
human-level prompt engineers,” in The Eleventh [479] J. Li, T. Tang, J. Nie, J. Wen, and X. Zhao, “Learn-
International Conference on Learning Representations, ing to transfer prompts for text generation,” in Pro-
ICLR 2023, Kigali, Rwanda, May 1-5, 2023. Open- ceedings of the 2022 Conference of the North American
Review.net, 2023. Chapter of the Association for Computational Linguistics:
[469] R. Pryzant, D. Iter, J. Li, Y. T. Lee, C. Zhu, Human Language Technologies, NAACL 2022, Seattle,
and M. Zeng, “Automatic prompt optimization WA, United States, July 10-15, 2022, M. Carpuat,
with ”gradient descent” and beam search,” CoRR, M. de Marneffe, and I. V. M. Ruı́z, Eds. Association
vol. abs/2305.03495, 2023. [Online]. Available: for Computational Linguistics, 2022, pp. 3506–3518.
https://doi.org/10.48550/arXiv.2305.03495 [480] S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis,
[470] C. Yang, X. Wang, Y. Lu, H. Liu, Q. V. Le, H. Hajishirzi, and L. Zettlemoyer, “Rethinking the
D. Zhou, and X. Chen, “Large language models role of demonstrations: What makes in-context learn-
as optimizers,” CoRR, vol. abs/2309.03409, 2023. ing work?” in Proceedings of the 2022 Conference
[Online]. Available: https://doi.org/10.48550/arXiv. on Empirical Methods in Natural Language Processing,
2309.03409 EMNLP 2022, Abu Dhabi, United Arab Emirates, De-
[471] Q. Ye, M. Axmed, R. Pryzant, and F. Khani, cember 7-11, 2022. Association for Computational
“Prompt engineering a prompt engineer,” CoRR, vol. Linguistics, 2022, pp. 11 048–11 064.
abs/2311.05661, 2023. [481] Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh,
[472] X. Tang, X. Wang, W. X. Zhao, S. Lu, Y. Li, and “Calibrate before use: Improving few-shot perfor-
J. Wen, “Unleashing the potential of large language mance of language models,” in Proceedings of the 38th
models as prompt optimizers: An analogical analysis International Conference on Machine Learning, ICML
with gradient-based model optimizers,” CoRR, vol. 2021, 18-24 July 2021, Virtual Event, M. Meila and
abs/2402.17564, 2024. T. Zhang, Eds., 2021, pp. 12 697–12 706.
[473] H. Yang and K. Li, “Instoptima: Evolutionary [482] Y. Lee, C. Lim, and H. Choi, “Does GPT-3 generate
multi-objective instruction optimization via large empathetic dialogues? A novel in-context example
language model-based instruction operators,” in selection method and automatic evaluation metric
EMNLP (Findings). Association for Computational for empathetic dialogue generation,” in Proceedings
Linguistics, 2023, pp. 13 593–13 602. of the 29th International Conference on Computational
[474] Q. Guo, R. Wang, J. Guo, B. Li, K. Song, X. Tan, Linguistics, COLING 2022, Gyeongju, Republic of Korea,
G. Liu, J. Bian, and Y. Yang, “Connecting large October 12-17, 2022, N. Calzolari, C. Huang, H. Kim,
language models with evolutionary algorithms J. Pustejovsky, L. Wanner, K. Choi, P. Ryu, H. Chen,
yields powerful prompt optimizers,” CoRR, vol. L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue,
abs/2309.08532, 2023. S. Kim, Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond,
[475] X. L. Do, Y. Zhao, H. Brown, Y. Xie, J. X. Zhao, N. F. and S. Na, Eds. International Committee on Com-
Chen, K. Kawaguchi, M. Q. Xie, and J. He, “Prompt putational Linguistics, 2022, pp. 669–683.
optimization via adversarial in-context learning,” [483] I. Levy, B. Bogin, and J. Berant, “Diverse demonstra-
CoRR, vol. abs/2312.02614, 2023. tions improve in-context compositional generaliza-
[476] X. Wang, C. Li, Z. Wang, F. Bai, H. Luo, tion,” CoRR, vol. abs/2212.06800, 2022.
J. Zhang, N. Jojic, E. P. Xing, and Z. Hu, [484] H. Su, J. Kasai, C. H. Wu, W. Shi, T. Wang, J. Xin,
“Promptagent: Strategic planning with language R. Zhang, M. Ostendorf, L. Zettlemoyer, N. A. Smith,
121

and T. Yu, “Selective annotation makes language guli, Z. Hatfield-Dodds, D. Hernandez, S. John-
models better few-shot learners,” CoRR, 2022. ston, A. Jones, J. Kernion, L. Lovitt, K. Ndousse,
[485] X. Ye, S. Iyer, A. Celikyilmaz, V. Stoyanov, G. Durrett, D. Amodei, T. Brown, J. Clark, J. Kaplan, S. McCan-
and R. Pasunuru, “Complementary explanations for dlish, and C. Olah, “In-context learning and induc-
effective in-context learning,” CoRR, 2022. tion heads,” CoRR, vol. abs/2209.11895, 2022.
[486] X. Li and X. Qiu, “Finding supporting examples for [500] E. Akyürek, D. Schuurmans, J. Andreas, T. Ma, and
in-context learning,” CoRR, 2023. D. Zhou, “What learning algorithm is in-context
[487] Y. Zhang, S. Feng, and C. Tan, “Active example learning? investigations with linear models,” CoRR,
selection for in-context learning,” in Proceedings of vol. abs/2211.15661, 2022.
the 2022 Conference on Empirical Methods in Natural [501] J. Wei, J. Wei, Y. Tay, D. Tran, A. Webson, Y. Lu,
Language Processing, EMNLP 2022, Abu Dhabi, United X. Chen, H. Liu, D. Huang, D. Zhou et al., “Larger
Arab Emirates, December 7-11, 2022, 2022, pp. 9134– language models do in-context learning differently,”
9148. arXiv preprint arXiv:2303.03846, 2023.
[488] F. Gilardi, M. Alizadeh, and M. Kubli, “Chatgpt out- [502] J. Coda-Forno, M. Binz, Z. Akata, M. M. Botvinick,
performs crowd-workers for text-annotation tasks,” J. X. Wang, and E. Schulz, “Meta-in-context
2023. learning in large language models,” CoRR, vol.
[489] H. J. Kim, H. Cho, J. Kim, T. Kim, K. M. Yoo, and abs/2305.12907, 2023.
S. Lee, “Self-generated in-context learning: Leverag- [503] J. W. Wei, L. Hou, A. K. Lampinen, X. Chen,
ing auto-regressive language models as a demonstra- D. Huang, Y. Tay, X. Chen, Y. Lu, D. Zhou, T. Ma, and
tion generator,” CoRR, vol. abs/2206.08082, 2022. Q. V. Le, “Symbol tuning improves in-context learn-
[490] S. M. Xie, A. Raghunathan, P. Liang, and T. Ma, ing in language models,” CoRR, vol. abs/2305.08298,
“An explanation of in-context learning as implicit 2023.
bayesian inference,” in The Tenth International Con- [504] Z. Chu, J. Chen, Q. Chen, W. Yu, T. He, H. Wang,
ference on Learning Representations, ICLR 2022, Virtual W. Peng, M. Liu, B. Qin, and T. Liu, “A survey of
Event, April 25-29, 2022, 2022. chain of thought reasoning: Advances, frontiers and
[491] Z. Wu, Y. Wang, J. Ye, and L. Kong, “Self-adaptive in- future,” CoRR, vol. abs/2309.15402, 2023.
context learning,” CoRR, vol. abs/2212.10375, 2022. [505] S. Miao, C. Liang, and K. Su, “A diverse corpus
[492] Y. Gu, L. Dong, F. Wei, and M. Huang, “Pre-training for evaluating and developing english math word
to learn in context,” CoRR, vol. abs/2305.09137, 2023. problem solvers,” in Proceedings of the 58th Annual
[493] S. Min, M. Lewis, L. Zettlemoyer, and H. Hajishirzi, Meeting of the Association for Computational Linguistics,
“Metaicl: Learning to learn in context,” in Proceed- ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai,
ings of the 2022 Conference of the North American N. Schluter, and J. R. Tetreault, Eds. Association for
Chapter of the Association for Computational Linguistics: Computational Linguistics, 2020, pp. 975–984.
Human Language Technologies, NAACL 2022, Seattle, [506] A. Talmor, J. Herzig, N. Lourie, and J. Berant, “Com-
WA, United States, July 10-15, 2022, M. Carpuat, monsenseqa: A question answering challenge tar-
M. de Marneffe, and I. V. M. Ruı́z, Eds., 2022, pp. geting commonsense knowledge,” in Proceedings of
2791–2809. the 2019 Conference of the North American Chapter of
[494] M. Hahn and N. Goyal, “A theory of emergent the Association for Computational Linguistics: Human
in-context learning as implicit structure induction,” Language Technologies, NAACL-HLT 2019, Minneapolis,
CoRR, vol. abs/2303.07971, 2023. MN, USA, June 2-7, 2019, Volume 1 (Long and Short
[495] J. Pan, T. Gao, H. Chen, and D. Chen, “What in- Papers), J. Burstein, C. Doran, and T. Solorio, Eds.
context learning ”learns” in-context: Disentangling Association for Computational Linguistics, 2019, pp.
task recognition and task learning,” CoRR, vol. 4149–4158.
abs/2305.09731, 2023. [507] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwa-
[496] N. Wies, Y. Levine, and A. Shashua, “The learnability sawa, “Large language models are zero-shot reason-
of in-context learning,” CoRR, vol. abs/2303.07895, ers,” CoRR, vol. abs/2205.11916, 2022.
2023. [508] W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Pro-
[497] A. Webson and E. Pavlick, “Do prompt-based models gram of thoughts prompting: Disentangling com-
really understand the meaning of their prompts?” in putation from reasoning for numerical reasoning
Proceedings of the 2022 Conference of the North American tasks,” CoRR, vol. abs/2211.12588, 2022.
Chapter of the Association for Computational Linguistics: [509] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang,
Human Language Technologies, NAACL 2022, Seattle, J. Callan, and G. Neubig, “PAL: program-aided lan-
WA, United States, July 10-15, 2022, 2022, pp. 2300– guage models,” in International Conference on Ma-
2344. chine Learning, ICML 2023, 23-29 July 2023, Honolulu,
[498] J. von Oswald, E. Niklasson, E. Randazzo, J. Sacra- Hawaii, USA, A. Krause, E. Brunskill, K. Cho, B. En-
mento, A. Mordvintsev, A. Zhmoginov, and M. Vla- gelhardt, S. Sabato, and J. Scarlett, Eds., 2023.
dymyrov, “Transformers learn in-context by gradient [510] X. Zhao, Y. Xie, K. Kawaguchi, J. He, and Q. Xie, “Au-
descent,” CoRR, vol. abs/2212.07677, 2022. tomatic model selection with large language models
[499] C. Olsson, N. Elhage, N. Nanda, N. Joseph, for reasoning,” CoRR, vol. abs/2305.14333, 2023.
N. DasSarma, T. Henighan, B. Mann, A. Askell, [511] Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou,
Y. Bai, A. Chen, T. Conerly, D. Drain, D. Gan- and W. Chen, “Making large language models better
122

reasoners with step-aware verifier,” 2023. [526] F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Sri-
[512] O. Yoran, T. Wolfson, B. Bogin, U. Katz, D. Deutch, vats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder,
and J. Berant, “Answering questions by meta- D. Zhou, D. Das, and J. Wei, “Language models are
reasoning over multiple chains of thought,” CoRR, multilingual chain-of-thought reasoners,” CoRR, vol.
vol. abs/2304.13007, 2023. abs/2210.03057, 2022.
[513] Z. Ling, Y. Fang, X. Li, Z. Huang, M. Lee, R. Memi- [527] J. Qian, H. Wang, Z. Li, S. Li, and X. Yan, “Limita-
sevic, and H. Su, “Deductive verification of chain-of- tions of language models in arithmetic and symbolic
thought reasoning,” CoRR, vol. abs/2306.03872, 2023. induction,” CoRR, vol. abs/2208.05051, 2022.
[514] T. Xue, Z. Wang, Z. Wang, C. Han, P. Yu, and H. Ji, [528] N. Bian, X. Han, L. Sun, H. Lin, Y. Lu, and B. He,
“RCOT: detecting and rectifying factual inconsis- “ChatGPT is a Knowledgeable but Inexperienced
tency in reasoning by reversing chain-of-thought,” Solver: An Investigation of Commonsense Problem
CoRR, vol. abs/2305.11499, 2023. in Large Language Models,” CoRR, 2023.
[515] Y. Weng, M. Zhu, F. Xia, B. Li, S. He, K. Liu, and [529] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths,
J. Zhao, “Large language models are better reasoners Y. Cao, and K. Narasimhan, “Tree of thoughts: Delib-
with self-verification,” CoRR, abs/2212.09561, 2023. erate problem solving with large language models,”
[516] W. Jiang, H. Shi, L. Yu, Z. Liu, Y. Zhang, Z. Li, CoRR, vol. abs/2305.10601, 2023.
and J. T. Kwok, “Forward-backward reasoning in [530] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao,
large language models for mathematical verifica- Y. Zhu, L. Fan, and A. Anandkumar, “Voyager: An
tion,” 2023. open-ended embodied agent with large language
[517] J. Long, “Large language model guided tree-of- models,” arXiv preprint arXiv:2305.16291, 2023.
thought,” CoRR, vol. abs/2305.08291, 2023. [531] X. Jiang, Y. Dong, L. Wang, Q. Shang, and
[518] S. Mo and M. Xin, “Tree of uncertain thoughts G. Li, “Self-planning code generation with large
reasoning for large language models,” CoRR, vol. language model,” CoRR, vol. abs/2303.06689, 2023.
abs/2309.07694, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.
[519] M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, 2303.06689
L. Gianinazzi, J. Gajda, T. Lehmann, M. Podstawski, [532] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu,
H. Niewiadomski, P. Nyczyk, and T. Hoefler, “Graph J. Tremblay, D. Fox, J. Thomason, and A. Garg, “Prog-
of thoughts: Solving elaborate problems with large prompt: Generating situated robot task plans using
language models,” CoRR, vol. abs/2308.09687, 2023. large language models,” CoRR, vol. abs/2209.11302,
[520] B. Lei, P. Lin, C. Liao, and C. Ding, “Boosting log- 2022.
ical reasoning in large language models through a [533] B. Liu, Y. Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas,
new framework: The graph of thought,” CoRR, vol. and P. Stone, “LLM+P: empowering large language
abs/2308.08614, 2023. models with optimal planning proficiency,” CoRR,
[521] R. Ding, C. Zhang, L. Wang, Y. Xu, M. Ma, W. Zhang, vol. abs/2304.11477, 2023. [Online]. Available:
S. Qin, S. Rajmohan, Q. Lin, and D. Zhang, “Ev- https://doi.org/10.48550/arXiv.2304.11477
erything of thoughts: Defying the law of pen- [534] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and
rose triangle for thought generation,” arXiv preprint B. Ommer, “High-resolution image synthesis with
arXiv:2311.04254, 2023. latent diffusion models,” in IEEE/CVF Conference on
[522] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, Computer Vision and Pattern Recognition, CVPR 2022,
M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Ku- New Orleans, LA, USA, June 18-24, 2022, 2022, pp.
mar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cos- 10 674–10 685.
grove, C. D. Manning, C. Ré, D. Acosta-Navas, D. A. [535] J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris,
Hudson, E. Zelikman, E. Durmus, F. Ladhak, F. Rong, P. Liang, and M. S. Bernstein, “Generative agents:
H. Ren, H. Yao, J. Wang, K. Santhanam, L. J. Orr, Interactive simulacra of human behavior,” CoRR, vol.
L. Zheng, M. Yüksekgönül, M. Suzgun, N. Kim, abs/2304.03442, 2023.
N. Guha, N. S. Chatterji, O. Khattab, P. Henderson, [536] 2023. [Online]. Available: https://github.com/
Q. Huang, R. Chi, S. M. Xie, S. Santurkar, S. Ganguli, Significant-Gravitas/Auto-GPT
T. Hashimoto, T. Icard, T. Zhang, V. Chaudhary, [537] Z. Wang, S. Cai, A. Liu, X. Ma, and Y. Liang,
W. Wang, X. Li, Y. Mai, Y. Zhang, and Y. Koreeda, “Describe, explain, plan and select: Interactive plan-
“Holistic evaluation of language models,” CoRR, vol. ning with large language models enables open-world
abs/2211.09110, 2022. multi-task agents,” CoRR, vol. abs/2302.01560, 2023.
[523] Z. Bi, N. Zhang, Y. Jiang, S. Deng, G. Zheng, and [538] J. Wang, X. Yi, R. Guo, H. Jin, P. Xu, S. Li, X. Wang,
H. Chen, “When do program-of-thoughts work for X. Guo, C. Li, X. Xu et al., “Milvus: A purpose-
reasoning?” CoRR, vol. abs/2308.15452, 2023. built vector data management system,” in Proceedings
[524] A. Madaan and A. Yazdanbakhsh, “Text and pat- of the 2021 International Conference on Management of
terns: For effective chain of thought, it takes two to Data, 2021, pp. 2614–2627.
tango,” CoRR, vol. abs/2209.07686, 2022. [539] W. Zhong, L. Guo, Q. Gao, H. Ye, and Y. Wang,
[525] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and “Memorybank: Enhancing large language models
A. Smola, “Multimodal chain-of-thought reasoning with long-term memory,” CoRR, vol. abs/2305.10250,
in language models,” CoRR, vol. abs/2302.00923, 2023.
2023. [540] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz,
123

“Building a large annotated corpus of english: The 2021, L. Barrault, O. Bojar, F. Bougares, R. Chat-
penn treebank,” Comput. Linguistics, vol. 19, no. 2, terjee, M. R. Costa-jussà, C. Federmann, M. Fishel,
pp. 313–330, 1993. A. Fraser, M. Freitag, Y. Graham, R. Grundkiewicz,
[541] S. Merity, C. Xiong, J. Bradbury, and R. Socher, P. Guzman, B. Haddow, M. Huck, A. Jimeno-Yepes,
“Pointer sentinel mixture models,” in ICLR (Poster). P. Koehn, T. Kocmi, A. Martins, M. Morishita, and
OpenReview.net, 2017. C. Monz, Eds. Association for Computational Lin-
[542] O. Bojar, C. Buck, C. Federmann, B. Haddow, guistics, 2021, pp. 1–88.
P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post, [547] T. Kocmi, R. Bawden, O. Bojar, A. Dvorkovich, C. Fe-
H. Saint-Amand, R. Soricut, L. Specia, and A. Tam- dermann, M. Fishel, T. Gowda, Y. Graham, R. Grund-
chyna, “Findings of the 2014 workshop on statistical kiewicz, B. Haddow, R. Knowles, P. Koehn, C. Monz,
machine translation,” in WMT@ACL. The Associa- M. Morishita, M. Nagata, T. Nakazawa, M. Novák,
tion for Computer Linguistics, 2014, pp. 12–58. M. Popel, and M. Popovic, “Findings of the 2022
[543] O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, conference on machine translation (WMT22),” in Pro-
B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn, ceedings of the Seventh Conference on Machine Trans-
V. Logacheva, C. Monz, M. Negri, A. Névéol, M. L. lation, WMT 2022, Abu Dhabi, United Arab Emi-
Neves, M. Popel, M. Post, R. Rubino, C. Scarton, rates (Hybrid), December 7-8, 2022, P. Koehn, L. Bar-
L. Specia, M. Turchi, K. Verspoor, and M. Zampieri, rault, O. Bojar, F. Bougares, R. Chatterjee, M. R.
“Findings of the 2016 conference on machine trans- Costa-jussà, C. Federmann, M. Fishel, A. Fraser,
lation,” in WMT. The Association for Computer M. Freitag, Y. Graham, R. Grundkiewicz, P. Guzman,
Linguistics, 2016, pp. 131–198. B. Haddow, M. Huck, A. Jimeno-Yepes, T. Kocmi,
[544] L. Barrault, O. Bojar, M. R. Costa-jussà, C. Feder- A. Martins, M. Morishita, C. Monz, M. Nagata,
mann, M. Fishel, Y. Graham, B. Haddow, M. Huck, T. Nakazawa, M. Negri, A. Névéol, M. Neves,
P. Koehn, S. Malmasi, C. Monz, M. Müller, S. Pal, M. Popel, M. Turchi, and M. Zampieri, Eds. Associ-
M. Post, and M. Zampieri, “Findings of the 2019 ation for Computational Linguistics, 2022, pp. 1–45.
conference on machine translation (WMT19),” in Pro- [548] N. Goyal, C. Gao, V. Chaudhary, P. Chen, G. Wenzek,
ceedings of the Fourth Conference on Machine Transla- D. Ju, S. Krishnan, M. Ranzato, F. Guzmán, and
tion, WMT 2019, Florence, Italy, August 1-2, 2019 - Vol- A. Fan, “The flores-101 evaluation benchmark for
ume 2: Shared Task Papers, Day 1, O. Bojar, R. Chatter- low-resource and multilingual machine translation,”
jee, C. Federmann, M. Fishel, Y. Graham, B. Haddow, Trans. Assoc. Comput. Linguistics, vol. 10, pp. 522–538,
M. Huck, A. Jimeno-Yepes, P. Koehn, A. Martins, 2022.
C. Monz, M. Negri, A. Névéol, M. L. Neves, M. Post, [549] R. Bawden, E. Bilinski, T. Lavergne, and S. Rosset,
M. Turchi, and K. Verspoor, Eds. Association for “Diabla: a corpus of bilingual spontaneous writ-
Computational Linguistics, 2019, pp. 1–61. ten dialogues for machine translation,” Lang. Resour.
[545] L. Barrault, M. Biesialska, O. Bojar, M. R. Costa- Evaluation, vol. 55, no. 3, pp. 635–660, 2021.
jussà, C. Federmann, Y. Graham, R. Grundkiewicz, [550] R. Nallapati, B. Zhou, C. N. dos Santos, Ç. Gülçehre,
B. Haddow, M. Huck, E. Joanis, T. Kocmi, P. Koehn, and B. Xiang, “Abstractive text summarization using
C. Lo, N. Ljubesic, C. Monz, M. Morishita, M. Na- sequence-to-sequence rnns and beyond,” in Proceed-
gata, T. Nakazawa, S. Pal, M. Post, and M. Zampieri, ings of the 20th SIGNLL Conference on Computational
“Findings of the 2020 conference on machine trans- Natural Language Learning, CoNLL 2016, Berlin, Ger-
lation (WMT20),” in Proceedings of the Fifth Con- many, August 11-12, 2016, Y. Goldberg and S. Riezler,
ference on Machine Translation, WMT@EMNLP 2020, Eds. ACL, 2016, pp. 280–290.
Online, November 19-20, 2020, L. Barrault, O. Bojar, [551] S. Narayan, S. B. Cohen, and M. Lapata, “Don’t give
F. Bougares, R. Chatterjee, M. R. Costa-jussà, C. Fed- me the details, just the summary! topic-aware con-
ermann, M. Fishel, A. Fraser, Y. Graham, P. Guzman, volutional neural networks for extreme summariza-
B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn, tion,” in EMNLP. Association for Computational
A. Martins, M. Morishita, C. Monz, M. Nagata, Linguistics, 2018, pp. 1797–1807.
T. Nakazawa, and M. Negri, Eds. Association for [552] F. Ladhak, E. Durmus, C. Cardie, and K. Mckeown,
Computational Linguistics, 2020, pp. 1–55. “Wikilingua: A new benchmark dataset for cross-
[546] F. Akhbardeh, A. Arkhangorodsky, M. Biesialska, lingual abstractive summarization,” in Findings of
O. Bojar, R. Chatterjee, V. Chaudhary, M. R. Costa- the Association for Computational Linguistics: EMNLP
jussà, C. España-Bonet, A. Fan, C. Federmann, 2020, 2020, pp. 4034–4048.
M. Freitag, Y. Graham, R. Grundkiewicz, B. Had- [553] S. Moon, P. Shah, A. Kumar, and R. Subba, “Open-
dow, L. Harter, K. Heafield, C. Homan, M. Huck, dialkg: Explainable conversational reasoning with
K. Amponsah-Kaakyire, J. Kasai, D. Khashabi, attention-based walks over knowledge graphs,” in
K. Knight, T. Kocmi, P. Koehn, N. Lourie, C. Monz, ACL (1). Association for Computational Linguistics,
M. Morishita, M. Nagata, A. Nagesh, T. Nakazawa, 2019, pp. 845–854.
M. Negri, S. Pal, A. A. Tapo, M. Turchi, V. Vydrin, [554] Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettle-
and M. Zampieri, “Findings of the 2021 confer- moyer, S. W. Yih, D. Fried, S. I. Wang, and T. Yu,
ence on machine translation (WMT21),” in Proceed- “DS-1000: A natural and reliable benchmark for data
ings of the Sixth Conference on Machine Translation, science code generation,” CoRR, vol. abs/2211.11501,
WMT@EMNLP 2021, Online Event, November 10-11, 2022.
124

[555] Z. Wang, S. Zhou, D. Fried, and G. Neubig, for Computational Linguistics (Volume 1: Long Papers),
“Execution-based evaluation for open-domain code ACL 2022, Dublin, Ireland, May 22-27, 2022, 2022, pp.
generation,” CoRR, vol. abs/2212.10481, 2022. 6101–6119.
[556] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, [565] X. Hu, X. Wu, Y. Shu, and Y. Qu, “Logical form gen-
A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin, eration via multi-task learning for complex question
J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kel- answering over knowledge bases,” in Proceedings
cey, M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and of the 29th International Conference on Computational
S. Petrov, “Natural questions: a benchmark for ques- Linguistics, COLING 2022, Gyeongju, Republic of Korea,
tion answering research,” Trans. Assoc. Comput. Lin- October 12-17, 2022, 2022, pp. 1687–1696.
guistics, pp. 452–466, 2019. [566] S. Longpre, Y. Lu, and J. Daiber, “MKQA: A lin-
[557] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, guistically diverse benchmark for multilingual open
C. Schoenick, and O. Tafjord, “Think you have solved domain question answering,” Trans. Assoc. Comput.
question answering? try arc, the AI2 reasoning chal- Linguistics, vol. 9, pp. 1389–1406, 2021.
lenge,” CoRR, vol. abs/1803.05457, 2018. [567] T. Saikh, T. Ghosal, A. Mittal, A. Ekbal, and P. Bhat-
[558] S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measur- tacharyya, “Scienceqa: a novel resource for question
ing how models mimic human falsehoods,” in Pro- answering on scholarly articles,” Int. J. Digit. Libr.,
ceedings of the 60th Annual Meeting of the Association vol. 23, no. 3, pp. 289–301, 2022.
for Computational Linguistics (Volume 1: Long Papers), [568] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal,
ACL 2022, Dublin, Ireland, May 22-27, 2022, 2022, pp. “Can a suit of armor conduct electricity? A new
3214–3252. dataset for open book question answering,” in Pro-
[559] J. Berant, A. Chou, R. Frostig, and P. Liang, “Semantic ceedings of the 2018 Conference on Empirical Methods in
parsing on freebase from question-answer pairs,” in Natural Language Processing, Brussels, Belgium, October
Proceedings of the 2013 Conference on Empirical Methods 31 - November 4, 2018, 2018, pp. 2381–2391.
in Natural Language Processing, EMNLP 2013, 18-21 [569] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary,
October 2013, Grand Hyatt Seattle, Seattle, Washington, R. Majumder, and L. Deng, “MS MARCO: A human
USA, A meeting of SIGDAT, a Special Interest Group of generated machine reading comprehension dataset,”
the ACL, 2013, pp. 1533–1544. in Proceedings of the Workshop on Cognitive Computa-
[560] M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer, tion: Integrating neural and symbolic approaches 2016
“Triviaqa: A large scale distantly supervised chal- co-located with the 30th Annual Conference on Neural
lenge dataset for reading comprehension,” in Pro- Information Processing Systems (NIPS 2016), Barcelona,
ceedings of the 55th Annual Meeting of the Association Spain, December 9, 2016, 2016.
for Computational Linguistics, ACL 2017, Vancouver, [570] T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sab-
Canada, July 30 - August 4, Volume 1: Long Papers, 2017, harwal, “QASC: A dataset for question answering
pp. 1601–1611. via sentence composition,” in The Thirty-Fourth AAAI
[561] Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi, Conference on Artificial Intelligence, AAAI 2020, The
“PIQA: reasoning about physical commonsense in Thirty-Second Innovative Applications of Artificial Intel-
natural language,” in The Thirty-Fourth AAAI Con- ligence Conference, IAAI 2020, The Tenth AAAI Sympo-
ference on Artificial Intelligence, AAAI 2020, The Thirty- sium on Educational Advances in Artificial Intelligence,
Second Innovative Applications of Artificial Intelligence EAAI 2020, New York, NY, USA, February 7-12, 2020,
Conference, IAAI 2020, The Tenth AAAI Symposium 2020, pp. 8082–8090.
on Educational Advances in Artificial Intelligence, EAAI [571] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang,
2020, New York, NY, USA, February 7-12, 2020, 2020, “Squad: 100, 000+ questions for machine compre-
pp. 7432–7439. hension of text,” in Proceedings of the 2016 Conference
[562] M. Dubey, D. Banerjee, A. Abdelkawi, and on Empirical Methods in Natural Language Processing,
J. Lehmann, “Lc-quad 2.0: A large dataset for com- EMNLP 2016, Austin, Texas, USA, November 1-4, 2016,
plex question answering over wikidata and dbpe- 2016, pp. 2383–2392.
dia,” in The Semantic Web - ISWC 2019 - 18th In- [572] A. H. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes,
ternational Semantic Web Conference, Auckland, New and J. Weston, “Key-value memory networks for
Zealand, October 26-30, 2019, Proceedings, Part II, 2019, directly reading documents,” in Proceedings of the
pp. 69–78. 2016 Conference on Empirical Methods in Natural Lan-
[563] Y. Gu, S. Kase, M. Vanni, B. M. Sadler, P. Liang, guage Processing, EMNLP 2016, Austin, Texas, USA,
X. Yan, and Y. Su, “Beyond I.I.D.: three levels of November 1-4, 2016, 2016, pp. 1400–1409.
generalization for question answering on knowledge [573] B. Goodrich, V. Rao, P. J. Liu, and M. Saleh, “As-
bases,” in WWW ’21: The Web Conference 2021, Virtual sessing the factual accuracy of generated text,” in
Event / Ljubljana, Slovenia, April 19-23, 2021, 2021, pp. Proceedings of the 25th ACM SIGKDD International
3477–3488. Conference on Knowledge Discovery & Data Mining,
[564] S. Cao, J. Shi, L. Pan, L. Nie, Y. Xiang, L. Hou, KDD 2019, Anchorage, AK, USA, August 4-8, 2019,
J. Li, B. He, and H. Zhang, “KQA pro: A dataset 2019, pp. 166–175.
with explicit compositional programs for complex [574] K. Toutanova and D. Chen, “Observed versus latent
question answering over knowledge base,” in Pro- features for knowledge base and text inference,” in
ceedings of the 60th Annual Meeting of the Association Proceedings of the 3rd Workshop on Continuous Vector
125

Space Models and their Compositionality, CVSC 2015, Florence, Italy, July 28- August 2, 2019, Volume 1: Long
Beijing, China, July 26-31, 2015, 2015, pp. 57–66. Papers, A. Korhonen, D. R. Traum, and L. Màrquez,
[575] K. D. Bollacker, C. Evans, P. K. Paritosh, T. Sturge, Eds. Association for Computational Linguistics,
and J. Taylor, “Freebase: a collaboratively created 2019, pp. 4791–4800.
graph database for structuring human knowledge,” [585] K. Sakaguchi, R. L. Bras, C. Bhagavatula, and
in Proceedings of the ACM SIGMOD International Con- Y. Choi, “Winogrande: An adversarial winograd
ference on Management of Data, SIGMOD 2008, Vancou- schema challenge at scale,” in AAAI. AAAI Press,
ver, BC, Canada, June 10-12, 2008, 2008, pp. 1247–1250. 2020, pp. 8732–8740.
[576] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel, [586] M. Roemmele, C. A. Bejan, and A. S. Gordon,
“Convolutional 2d knowledge graph embeddings,” “Choice of plausible alternatives: An evaluation of
in Proceedings of the Thirty-Second AAAI Conference commonsense causal reasoning,” in Logical Formaliza-
on Artificial Intelligence, (AAAI-18), the 30th innovative tions of Commonsense Reasoning, Papers from the 2011
Applications of Artificial Intelligence (IAAI-18), and the AAAI Spring Symposium, Technical Report SS-11-06,
8th AAAI Symposium on Educational Advances in Ar- Stanford, California, USA, March 21-23, 2011. AAAI,
tificial Intelligence (EAAI-18), New Orleans, Louisiana, 2011.
USA, February 2-7, 2018, 2018, pp. 1811–1818. [587] K. Sakaguchi, C. Bhagavatula, R. L. Bras, N. Tandon,
[577] G. A. Miller, “Wordnet: A lexical database for en- P. Clark, and Y. Choi, “proscript: Partially ordered
glish,” Commun. ACM, pp. 39–41, 1995. scripts generation,” in Findings of the Association for
[578] F. Petroni, T. Rocktäschel, S. Riedel, P. S. H. Lewis, Computational Linguistics: EMNLP 2021, Virtual Event
A. Bakhtin, Y. Wu, and A. H. Miller, “Language mod- / Punta Cana, Dominican Republic, 16-20 November,
els as knowledge bases?” in Proceedings of the 2019 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih,
Conference on Empirical Methods in Natural Language Eds. Association for Computational Linguistics,
Processing and the 9th International Joint Conference 2021, pp. 2138–2149.
on Natural Language Processing, EMNLP-IJCNLP 2019, [588] B. Dalvi, L. Huang, N. Tandon, W. Yih, and P. Clark,
Hong Kong, China, November 3-7, 2019, 2019, pp. 2463– “Tracking state changes in procedural text: a chal-
2473. lenge dataset and models for process paragraph com-
[579] F. Mahdisoltani, J. Biega, and F. M. Suchanek, prehension,” in Proceedings of the 2018 Conference of
“YAGO3: A knowledge base from multilingual the North American Chapter of the Association for Com-
wikipedias,” in Seventh Biennial Conference on Innova- putational Linguistics: Human Language Technologies,
tive Data Systems Research, CIDR 2015, Asilomar, CA, NAACL-HLT 2018, New Orleans, Louisiana, USA, June
USA, January 4-7, 2015, Online Proceedings, 2015. 1-6, 2018, Volume 1 (Long Papers), M. A. Walker, H. Ji,
[580] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago: and A. Stent, Eds. Association for Computational
a core of semantic knowledge,” in Proceedings of Linguistics, 2018, pp. 1595–1604.
the 16th International Conference on World Wide Web, [589] S. Saha, P. Yadav, L. Bauer, and M. Bansal, “Expla-
WWW 2007, Banff, Alberta, Canada, May 8-12, 2007, graphs: An explanation graph generation task for
2007, pp. 697–706. structured commonsense reasoning,” in Proceedings
[581] Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, of the 2021 Conference on Empirical Methods in Natural
R. Salakhutdinov, and C. D. Manning, “Hotpotqa: Language Processing, EMNLP 2021, Virtual Event /
A dataset for diverse, explainable multi-hop ques- Punta Cana, Dominican Republic, 7-11 November, 2021,
tion answering,” in Proceedings of the 2018 Conference M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds.
on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2021, pp.
Brussels, Belgium, October 31 - November 4, 2018. As- 7716–7740.
sociation for Computational Linguistics, 2018, pp. [590] O. Tafjord, B. Dalvi, and P. Clark, “Proofwriter:
2369–2380. Generating implications, proofs, and abductive state-
[582] C. Clark, K. Lee, M. Chang, T. Kwiatkowski, ments over natural language,” in Findings of the
M. Collins, and K. Toutanova, “Boolq: Exploring the Association for Computational Linguistics: ACL/IJCNLP
surprising difficulty of natural yes/no questions,” in 2021, Online Event, August 1-6, 2021, ser. Findings of
Proceedings of the 2019 Conference of the North American ACL, C. Zong, F. Xia, W. Li, and R. Navigli, Eds., vol.
Chapter of the Association for Computational Linguis- ACL/IJCNLP 2021. Association for Computational
tics: Human Language Technologies, NAACL-HLT 2019, Linguistics, 2021, pp. 3621–3634.
Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long [591] B. Dalvi, P. Jansen, O. Tafjord, Z. Xie, H. Smith, L. Pi-
and Short Papers), J. Burstein, C. Doran, and T. Solorio, patanangkura, and P. Clark, “Explaining answers
Eds. Association for Computational Linguistics, with entailment trees,” in Proceedings of the 2021
2019, pp. 2924–2936. Conference on Empirical Methods in Natural Language
[583] M. Sap, H. Rashkin, D. Chen, R. L. Bras, and Y. Choi, Processing, EMNLP 2021, Virtual Event / Punta Cana,
“Socialiqa: Commonsense reasoning about social in- Dominican Republic, 7-11 November, 2021, M. Moens,
teractions,” CoRR, vol. abs/1904.09728, 2019. X. Huang, L. Specia, and S. W. Yih, Eds. Association
[584] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and for Computational Linguistics, 2021, pp. 7358–7370.
Y. Choi, “Hellaswag: Can a machine really finish [592] A. Saparov and H. He, “Language models are greedy
your sentence?” in Proceedings of the 57th Conference of reasoners: A systematic formal analysis of chain-of-
the Association for Computational Linguistics, ACL 2019, thought,” CoRR, vol. abs/2210.01240, 2022.
126

[593] C. Anil, Y. Wu, A. Andreassen, A. Lewkowycz, system benchmark for formal olympiad-level mathe-
V. Misra, V. V. Ramasesh, A. Slone, G. Gur-Ari, matics,” in The Tenth International Conference on Learn-
E. Dyer, and B. Neyshabur, “Exploring length gen- ing Representations, ICLR 2022, Virtual Event, April 25-
eralization in large language models,” CoRR, vol. 29, 2022. OpenReview.net, 2022.
abs/2207.04901, 2022. [603] Z. Azerbayev, B. Piotrowski, H. Schoelkopf, E. W.
[594] A. Patel, S. Bhattamishra, and N. Goyal, “Are NLP Ayers, D. Radev, and J. Avigad, “Proofnet: Autofor-
models really able to solve simple math word prob- malizing and formally proving undergraduate-level
lems?” in NAACL-HLT. Association for Computa- mathematics,” CoRR, vol. abs/2302.12433, 2023.
tional Linguistics, 2021, pp. 2080–2094. [604] J. Li, X. Cheng, W. X. Zhao, J. Nie, and J. Wen,
[595] S. Roy and D. Roth, “Solving general arithmetic “Halueval: A large-scale hallucination evaluation
word problems,” in Proceedings of the 2015 Conference benchmark for large language models,” CoRR, vol.
on Empirical Methods in Natural Language Processing, abs/2305.11747, 2023.
EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, [605] N. Nangia, C. Vania, R. Bhalerao, and S. R. Bowman,
L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, and “Crows-pairs: A challenge dataset for measuring
Y. Marton, Eds. The Association for Computational social biases in masked language models,” in Pro-
Linguistics, 2015, pp. 1743–1752. ceedings of the 2020 Conference on Empirical Methods
[596] A. Amini, S. Gabriel, S. Lin, R. Koncel-Kedziorski, in Natural Language Processing, EMNLP 2020, Online,
Y. Choi, and H. Hajishirzi, “Mathqa: Towards inter- November 16-20, 2020, 2020, pp. 1953–1967.
pretable math word problem solving with operation- [606] R. Rudinger, J. Naradowsky, B. Leonard, and B. V.
based formalisms,” in Proceedings of the 2019 Confer- Durme, “Gender bias in coreference resolution,” in
ence of the North American Chapter of the Association for Proceedings of the 2018 Conference of the North American
Computational Linguistics: Human Language Technolo- Chapter of the Association for Computational Linguistics:
gies, NAACL-HLT 2019, Minneapolis, MN, USA, June Human Language Technologies, NAACL-HLT, New Or-
2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, leans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short
C. Doran, and T. Solorio, Eds. Association for Papers), 2018, pp. 8–14.
Computational Linguistics, 2019, pp. 2357–2367. [607] S. Gehman, S. Gururangan, M. Sap, Y. Choi, and
[597] W. Ling, D. Yogatama, C. Dyer, and P. Blunsom, “Pro- N. A. Smith, “Realtoxicityprompts: Evaluating neu-
gram induction by rationale generation: Learning to ral toxic degeneration in language models,” in Find-
solve and explain algebraic word problems,” in Pro- ings of the Association for Computational Linguistics:
ceedings of the 55th Annual Meeting of the Association EMNLP 2020, Online Event, 16-20 November 2020, ser.
for Computational Linguistics, ACL 2017, Vancouver, Findings of ACL, T. Cohn, Y. He, and Y. Liu, Eds.,
Canada, July 30 - August 4, Volume 1: Long Papers, vol. EMNLP 2020. Association for Computational
R. Barzilay and M. Kan, Eds. Association for Com- Linguistics, 2020, pp. 3356–3369.
putational Linguistics, 2017, pp. 158–167. [608] X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler,
[598] R. Koncel-Kedziorski, S. Roy, A. Amini, N. Kushman, and A. Torralba, “Virtualhome: Simulating house-
and H. Hajishirzi, “Mawps: A math word problem hold activities via programs,” in CVPR. Computer
repository,” in Proceedings of the 2016 conference of Vision Foundation / IEEE Computer Society, 2018,
the north american chapter of the association for compu- pp. 8494–8502.
tational linguistics: human language technologies, 2016, [609] S. Srivastava, C. Li, M. Lingelbach, R. Martı́n-Martı́n,
pp. 1152–1157. F. Xia, K. E. Vainio, Z. Lian, C. Gokmen, S. Buch,
[599] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, C. K. Liu, S. Savarese, H. Gweon, J. Wu, and L. Fei-
and M. Gardner, “DROP: A reading comprehension Fei, “BEHAVIOR: benchmark for everyday house-
benchmark requiring discrete reasoning over para- hold activities in virtual, interactive, and ecological
graphs,” in Proceedings of the 2019 Conference of the environments,” in CoRL, ser. Proceedings of Machine
North American Chapter of the Association for Com- Learning Research, vol. 164. PMLR, 2021, pp. 477–
putational Linguistics: Human Language Technologies, 490.
NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, [610] M. Shridhar, J. Thomason, D. Gordon, Y. Bisk,
2019, Volume 1 (Long and Short Papers), 2019, pp. 2368– W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox,
2378. “ALFRED: A benchmark for interpreting grounded
[600] S. Welleck, J. Liu, R. L. Bras, H. Hajishirzi, Y. Choi, instructions for everyday tasks,” in CVPR. Com-
and K. Cho, “Naturalproofs: Mathematical theorem puter Vision Foundation / IEEE, 2020, pp. 10 737–
proving in natural language,” in Proceedings of the 10 746.
Neural Information Processing Systems Track on Datasets [611] M. Shridhar, X. Yuan, M. Côté, Y. Bisk, A. Trischler,
and Benchmarks 1, NeurIPS Datasets and Benchmarks and M. J. Hausknecht, “Alfworld: Aligning text and
2021, December 2021, virtual, J. Vanschoren and S. Ye- embodied environments for interactive learning,” in
ung, Eds., 2021. 9th International Conference on Learning Representa-
[601] A. Q. Jiang, W. Li, J. M. Han, and Y. Wu, “Lisa: Lan- tions, ICLR 2021, Virtual Event, Austria, May 3-7, 2021.
guage models of isabelle proofs,” in 6th Conference OpenReview.net, 2021.
on Artificial Intelligence and Theorem Proving, 2021, pp. [612] S. Yao, H. Chen, J. Yang, and K. Narasimhan, “Web-
378–392. shop: Towards scalable real-world web interaction
[602] K. Zheng, J. M. Han, and S. Polu, “minif2f: a cross- with grounded language agents,” in NeurIPS, 2022.
127

[613] X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, International Conference on Learning Representations,
B. Wang, H. Sun, and Y. Su, “Mind2web: To- ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
wards a generalist agent for the web,” CoRR, vol. OpenReview.net, 2020.
abs/2306.06070, 2023. [625] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang,
[614] W. H. Guss, B. Houghton, N. Topin, P. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and
C. Codel, M. Veloso, and R. Salakhutdinov, “Minerl: D. R. Radev, “Spider: A large-scale human-labeled
A large-scale dataset of minecraft demonstrations,” dataset for complex and cross-domain semantic pars-
in Proceedings of the Twenty-Eighth International Joint ing and text-to-sql task,” in Proceedings of the 2018
Conference on Artificial Intelligence, IJCAI 2019, Macao, Conference on Empirical Methods in Natural Language
China, August 10-16, 2019, S. Kraus, Ed. ijcai.org, Processing, Brussels, Belgium, October 31 - November 4,
2019, pp. 2442–2448. 2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsu-
[615] L. Fan, G. Wang, Y. Jiang, A. Mandlekar, Y. Yang, jii, Eds. Association for Computational Linguistics,
H. Zhu, A. Tang, D. Huang, Y. Zhu, and A. Anand- 2018, pp. 3911–3921.
kumar, “Minedojo: Building open-ended embodied [626] D. Bahdanau, K. Cho, and Y. Bengio, “Neural ma-
agents with internet-scale knowledge,” in NeurIPS, chine translation by jointly learning to align and
2022. translate,” in ICLR, 2015.
[616] P. Lu, L. Qiu, K. Chang, Y. N. Wu, S. Zhu, T. Rajpuro- [627] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu:
hit, P. Clark, and A. Kalyan, “Dynamic prompt learn- a method for automatic evaluation of machine trans-
ing via policy gradient for semi-structured mathe- lation,” in Proceedings of the 40th Annual Meeting of
matical reasoning,” CoRR, vol. abs/2209.14610, 2022. the Association for Computational Linguistics, July 6-12,
[617] B. Zhang, K. Zhou, X. Wei, W. X. Zhao, J. Sha, 2002, Philadelphia, PA, USA. ACL, 2002, pp. 311–318.
S. Wang, and J. rong Wen, “Evaluating and improv- [628] C.-Y. Lin, “ROUGE: A package for automatic evalu-
ing tool-augmented computation-intensive math rea- ation of summaries,” in Text Summarization Branches
soning,” CoRR, vol. abs/2306.02408, 2023. Out. Association for Computational Linguistics, Jul.
[618] R. Yang, L. Song, Y. Li, S. Zhao, Y. Ge, X. Li, 2004, pp. 74–81.
and Y. Shan, “Gpt4tools: Teaching large language [629] W. Jiao, W. Wang, J.-t. Huang, X. Wang, and Z. Tu,
model to use tools via self-instruction,” CoRR, vol. “Is chatgpt a good translator? a preliminary study,”
abs/2305.18752, 2023. arXiv preprint arXiv:2301.08745, 2023.
[619] S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, [630] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. R. McK-
“Gorilla: Large language model connected with mas- eown, and T. B. Hashimoto, “Benchmarking large
sive apis,” CoRR, vol. abs/2305.15334, 2023. language models for news summarization,” CoRR,
[620] W. Yih, M. Richardson, C. Meek, M. Chang, and vol. abs/2301.13848, 2023.
J. Suh, “The value of semantic parse labeling for [631] T. Goyal, J. J. Li, and G. Durrett, “News summariza-
knowledge base question answering,” in Proceedings tion and evaluation in the era of GPT-3,” CoRR, vol.
of the 54th Annual Meeting of the Association for Com- abs/2209.12356, 2022.
putational Linguistics, ACL 2016, August 7-12, 2016, [632] S. Gehrmann, E. Clark, and T. Sellam, “Repairing
Berlin, Germany, Volume 2: Short Papers. The Associ- the cracked foundation: A survey of obstacles in
ation for Computer Linguistics, 2016. evaluation practices for generated text,” CoRR, vol.
[621] H. Puerto, G. G. Sahin, and I. Gurevych, “Metaqa: abs/2202.06935, 2022.
Combining expert agents for multi-skill question an- [633] J. Wang, Y. Liang, F. Meng, H. Shi, Z. Li, J. Xu, J. Qu,
swering,” in Proceedings of the 17th Conference of the and J. Zhou, “Is chatgpt a good NLG evaluator? A
European Chapter of the Association for Computational preliminary study,” CoRR, vol. abs/2303.04048, 2023.
Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, [634] Y. Liu, D. Iter, Y. Xu, S. Wang, R. Xu, and C. Zhu,
2023, A. Vlachos and I. Augenstein, Eds. Association “G-eval: NLG evaluation using GPT-4 with better
for Computational Linguistics, 2023, pp. 3548–3562. human alignment,” CoRR, vol. abs/2303.16634, 2023.
[622] P. Pasupat and P. Liang, “Compositional semantic [635] K. Yang, Y. Tian, N. Peng, and D. Klein, “Re3: Gen-
parsing on semi-structured tables,” in Proceedings of erating longer stories with recursive reprompting
the 53rd Annual Meeting of the Association for Com- and revision,” in Proceedings of the 2022 Conference
putational Linguistics and the 7th International Joint on Empirical Methods in Natural Language Processing,
Conference on Natural Language Processing of the Asian EMNLP 2022, Abu Dhabi, United Arab Emirates, De-
Federation of Natural Language Processing, ACL 2015, cember 7-11, 2022, Y. Goldberg, Z. Kozareva, and
July 26-31, 2015, Beijing, China, Volume 1: Long Papers. Y. Zhang, Eds. Association for Computational Lin-
The Association for Computer Linguistics, 2015, pp. guistics, 2022, pp. 4393–4479.
1470–1480. [636] W. Zhou, Y. E. Jiang, P. Cui, T. Wang, Z. Xiao, Y. Hou,
[623] V. Zhong, C. Xiong, and R. Socher, “Seq2sql: Gener- R. Cotterell, and M. Sachan, “Recurrentgpt: Interac-
ating structured queries from natural language using tive generation of (arbitrarily) long text,” CoRR, vol.
reinforcement learning,” CoRR, vol. abs/1709.00103, abs/2305.13304, 2023.
2017. [637] S. Gulwani, O. Polozov, and R. Singh, “Program
[624] W. Chen, H. Wang, J. Chen, Y. Zhang, H. Wang, synthesis,” Found. Trends Program. Lang., vol. 4, no.
S. Li, X. Zhou, and W. Y. Wang, “Tabfact: A large- 1-2, pp. 1–119, 2017.
scale dataset for table-based fact verification,” in 8th [638] S. Zhang, Z. Chen, Y. Shen, M. Ding, J. B. Tenenbaum,
128

and C. Gan, “Planning with large language models Advances in Artificial Intelligence (EAAI-18), New Or-
for code generation,” 2023. leans, Louisiana, USA, February 2-7, 2018, 2018, pp.
[639] M. Welsh, “The end of programming,” Commun. 3390–3398.
ACM, vol. 66, no. 1, pp. 34–35, 2023. [653] T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak,
[640] Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, M. Yasunaga, C. Wu, M. Zhong, P. Yin, S. I. Wang,
B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, Q. V. Do, V. Zhong, B. Wang, C. Li, C. Boyle, A. Ni, Z. Yao,
Y. Xu, and P. Fung, “A multitask, multilingual, mul- D. Radev, C. Xiong, L. Kong, R. Zhang, N. A. Smith,
timodal evaluation of chatgpt on reasoning, halluci- L. Zettlemoyer, and T. Yu, “Unifiedskg: Unifying and
nation, and interactivity,” CoRR, vol. abs/2302.04023, multi-tasking structured knowledge grounding with
2023. text-to-text language models,” in EMNLP. Associ-
[641] Y. Liu, A. R. Fabbri, P. Liu, Y. Zhao, L. Nan, R. Han, ation for Computational Linguistics, 2022, pp. 602–
S. Han, S. R. Joty, C. Wu, C. Xiong, and D. Radev, 631.
“Revisiting the gold standard: Grounding summa- [654] A. Roberts, C. Raffel, and N. Shazeer, “How much
rization evaluation with robust human evaluation,” knowledge can you pack into the parameters of a lan-
CoRR, vol. abs/2212.07981, 2022. guage model?” in Proceedings of the 2020 Conference
[642] A. R. Fabbri, W. Kryscinski, B. McCann, C. Xiong, on Empirical Methods in Natural Language Processing,
R. Socher, and D. R. Radev, “Summeval: Re- EMNLP 2020, Online, November 16-20, 2020, 2020, pp.
evaluating summarization evaluation,” Trans. Assoc. 5418–5426.
Comput. Linguistics, vol. 9, pp. 391–409, 2021. [655] G. Izacard, P. S. H. Lewis, M. Lomeli, L. Hos-
[643] T. Tang, H. Lu, Y. E. Jiang, H. Huang, D. Zhang, W. X. seini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin,
Zhao, and F. Wei, “Not all metrics are guilty: Improv- S. Riedel, and E. Grave, “Few-shot learning with
ing NLG evaluation with LLM paraphrasing,” CoRR, retrieval augmented language models,” CoRR, vol.
vol. abs/2305.15067, 2023. abs/2208.03299, 2022.
[644] X. Wang, X. Tang, W. X. Zhao, J. Wang, and J. Wen, [656] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang,
“Rethinking the evaluation for conversational rec- “Retrieval augmented language model pre-training,”
ommendation in the era of large language models,” in Proceedings of the 37th International Conference on
CoRR, vol. abs/2305.13112, 2023. Machine Learning, ICML 2020, 13-18 July 2020, Virtual
[645] M. Gao, J. Ruan, R. Sun, X. Yin, S. Yang, and X. Wan, Event, 2020, pp. 3929–3938.
“Human-like summarization evaluation with chat- [657] P. S. H. Lewis, E. Perez, A. Piktus, F. Petroni,
gpt,” CoRR, vol. abs/2304.02554, 2023. V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih,
[646] Y. Ji, Y. Gong, Y. Peng, C. Ni, P. Sun, D. Pan, B. Ma, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-
and X. Li, “Exploring chatgpt’s ability to rank con- augmented generation for knowledge-intensive NLP
tent: A preliminary study on consistency with hu- tasks,” in Advances in Neural Information Processing
man preferences,” CoRR, vol. abs/2303.07610, 2023. Systems 33: Annual Conference on Neural Information
[647] Y. Bai, J. Ying, Y. Cao, X. Lv, Y. He, X. Wang, J. Yu, Processing Systems 2020, NeurIPS 2020, December 6-12,
K. Zeng, Y. Xiao, H. Lyu, J. Zhang, J. Li, and L. Hou, 2020, virtual, 2020.
“Benchmarking foundation models with language- [658] Y. Lan, G. He, J. Jiang, J. Jiang, W. X. Zhao, and J. Wen,
model-as-an-examiner,” CoRR, vol. abs/2306.04181, “Complex knowledge base question answering: A
2023. survey,” CoRR, vol. abs/2108.06688, 2021.
[648] Y. Liu, S. Feng, D. Wang, Y. Zhang, and H. Schütze, [659] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai,
“Evaluate what you can’t evaluate: Unassess- E. Rutherford, K. Millican, G. van den Driessche,
able generated responses quality,” CoRR, vol. J. Lespiau, B. Damoc, A. Clark, D. de Las Casas,
abs/2305.14658, 2023. A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang,
[649] P. Wang, L. Li, L. Chen, D. Zhu, B. Lin, Y. Cao, Q. Liu, L. Maggiore, C. Jones, A. Cassirer, A. Brock, M. Pa-
T. Liu, and Z. Sui, “Large language models are not ganini, G. Irving, O. Vinyals, S. Osindero, K. Si-
fair evaluators,” CoRR, vol. abs/2305.17926, 2023. monyan, J. W. Rae, E. Elsen, and L. Sifre, “Improv-
[650] J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. Liu, Y. Cui, ing language models by retrieving from trillions of
Z. Zhou, C. Gong, Y. Shen, J. Zhou, S. Chen, T. Gui, tokens,” in International Conference on Machine Learn-
Q. Zhang, and X. Huang, “A comprehensive capabil- ing, ICML 2022, 17-23 July 2022, Baltimore, Maryland,
ity analysis of gpt-3 and gpt-3.5 series models,” arXiv USA, ser. Proceedings of Machine Learning Research,
preprint arXiv:2303.10420, 2023. K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári,
[651] M. McCloskey and N. J. Cohen, “Catastrophic in- G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 2022,
terference in connectionist networks: The sequential pp. 2206–2240.
learning problem,” in Psychology of learning and moti- [660] S. Xu, L. Pang, H. Shen, X. Cheng, and T.-S. Chua,
vation, 1989, pp. 109–165. “Search-in-the-chain: Towards accurate, credible and
[652] R. Kemker, M. McClure, A. Abitino, T. L. Hayes, traceable large language models for knowledge-
and C. Kanan, “Measuring catastrophic forgetting in intensive tasks,” CoRR, vol. abs/2304.14732, 2023.
neural networks,” in Proceedings of the Thirty-Second [661] B. Peng, M. Galley, P. He, H. Cheng, Y. Xie, Y. Hu,
AAAI Conference on Artificial Intelligence, (AAAI-18), Q. Huang, L. Liden, Z. Yu, W. Chen, and J. Gao,
the 30th innovative Applications of Artificial Intelligence “Check your facts and try again: Improving large
(IAAI-18), and the 8th AAAI Symposium on Educational language models with external knowledge and au-
129

tomated feedback,” CoRR, vol. abs/2302.12813, 2023. 8493–8502.
[662] Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi- [672] K. Meng, D. Bau, A. J. Andonian, and Y. Belinkov,
Yu, Y. Yang, J. Callan, and G. Neubig, “Ac- “Locating and editing factual associations in gpt,”
tive retrieval augmented generation,” CoRR, vol. in Advances in Neural Information Processing Systems,
abs/2305.06983, 2023. 2022.
[663] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, [673] M. Geva, R. Schuster, J. Berant, and O. Levy, “Trans-
H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and former feed-forward layers are key-value memo-
T. Liu, “A survey on hallucination in large language ries,” in Proceedings of the 2021 Conference on Empirical
models: Principles, taxonomy, challenges, and open Methods in Natural Language Processing, EMNLP 2021,
questions,” CoRR, vol. abs/2311.05232, 2023. Virtual Event / Punta Cana, Dominican Republic, 7-
[664] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and 11 November, 2021, M. Moens, X. Huang, L. Specia,
J. Wen, “Evaluating object hallucination in large and S. W. Yih, Eds. Association for Computational
vision-language models,” CoRR, vol. abs/2305.10355, Linguistics, 2021, pp. 5484–5495.
2023. [674] Y. Yao, P. Wang, B. Tian, S. Cheng, Z. Li, S. Deng,
[665] S. Kadavath, T. Conerly, A. Askell, T. J. Henighan, H. Chen, and N. Zhang, “Editing large language
D. Drain, E. Perez, N. Schiefer, Z. Dodds, N. Das- models: Problems, methods, and opportunities,”
Sarma, E. Tran-Johnson, S. Johnston, S. El-Showk, CoRR, vol. abs/2305.13172, 2023.
A. Jones, N. Elhage, T. Hume, A. Chen, Y. Bai, [675] P. Wang, N. Zhang, X. Xie, Y. Yao, B. Tian,
S. Bowman, S. Fort, D. Ganguli, D. Hernandez, J. Ja- M. Wang, Z. Xi, S. Cheng, K. Liu, G. Zheng, and
cobson, J. Kernion, S. Kravec, L. Lovitt, K. Ndousse, H. Chen, “Easyedit: An easy-to-use knowledge edit-
C. Olsson, S. Ringer, D. Amodei, T. B. Brown, J. Clark, ing framework for large language models,” CoRR,
N. Joseph, B. Mann, S. McCandlish, C. Olah, and vol. abs/2308.07269, 2023.
J. Kaplan, “Language models (mostly) know what [676] Z. Shao, Y. Gong, Y. Shen, M. Huang, N. Duan, and
they know,” CoRR, vol. abs/2207.05221, 2022. W. Chen, “Synthetic prompting: Generating chain-of-
[666] P. Manakul, A. Liusie, and M. J. F. Gales, “Selfcheck- thought demonstrations for large language models,”
gpt: Zero-resource black-box hallucination detection CoRR, vol. abs/2302.00618, 2023.
for generative large language models,” ArXiv, vol. [677] Sifatkaur, M. Singh, V. S. B, and N. Malviya, “Mind
abs/2305.06983, 2023. meets machine: Unravelling gpt-4’s cognitive psy-
[667] S. Agarwal, I. Akkaya, V. Balcom, M. Bavarian, chology,” CoRR, vol. abs/2303.11436, 2023.
G. Bernadett-Shapiro, G. Brockman, M. Brundage, [678] M. I. Nye, A. J. Andreassen, G. Gur-Ari,
J. Chan, F. Chantzis, N. Deutsch, B. Eastman, A. Eleti, H. Michalewski, J. Austin, D. Bieber, D. Dohan,
N. Felix, S. P. Fishman, I. Fulford, C. Gibson, J. Gross, A. Lewkowycz, M. Bosma, D. Luan, C. Sutton, and
M. Heaton, J. Hilton, X. Hu, S. Jain, H. Jin, L. Kil- A. Odena, “Show your work: Scratchpads for inter-
patrick, C. Kim, M. Kolhede, A. Mayne, P. McMil- mediate computation with language models,” CoRR,
lan, D. Medina, J. Menick, A. Mishchenko, A. Nair, vol. abs/2112.00114, 2021.
R. Nayak, A. Neelakantan, R. Nuttall, J. Parish, [679] J. Qian, H. Wang, Z. Li, S. Li, and X. Yan, “Limita-
A. T. Passos, A. Perelman, F. de Avila Belbute Peres, tions of language models in arithmetic and symbolic
V. Pong, J. Schulman, E. Sigler, N. Staudacher, N. Tur- induction,” CoRR, vol. abs/2208.05051, 2022.
ley, J. Tworek, R. Greene, A. Vijayvergiya, C. Voss, [680] W. X. Zhao, K. Zhou, Z. Gong, B. Zhang, Y. Zhou,
J. Weng, M. Wiethoff, S. Yoo, K. Yu, W. Zaremba, J. Sha, Z. Chen, S. Wang, C. Liu, and J. Wen, “Ji-
S. Zhao, W. Zhuk, and B. Zoph, “Chatgpt plugins,” uzhang: A chinese pre-trained language model for
OpenAI Blog, March 2023. mathematical problem understanding,” in KDD ’22:
[668] A. Lazaridou, E. Gribovskaya, W. Stokowiec, and The 28th ACM SIGKDD Conference on Knowledge Dis-
N. Grigorev, “Internet-augmented language models covery and Data Mining, Washington, DC, USA, August
through few-shot prompting for open-domain ques- 14 - 18, 2022, A. Zhang and H. Rangwala, Eds. ACM,
tion answering,” CoRR, vol. abs/2203.05115, 2022. 2022, pp. 4571–4581.
[669] H. Qian, Y. Zhu, Z. Dou, H. Gu, X. Zhang, Z. Liu, [681] Q. Wang, C. Kaliszyk, and J. Urban, “First experi-
R. Lai, Z. Cao, J. Nie, and J. Wen, “Webbrain: Learn- ments with neural translation of informal to formal
ing to generate factually correct articles for queries mathematics,” in Intelligent Computer Mathematics -
by grounding on large web corpus,” CoRR, vol. 11th International Conference, CICM 2018, Hagenberg,
abs/2304.04358, 2023. Austria, August 13-17, 2018, Proceedings, ser. Lecture
[670] J. Liu, J. Jin, Z. Wang, J. Cheng, Z. Dou, and J. Wen, Notes in Computer Science, F. Rabe, W. M. Farmer,
“RETA-LLM: A retrieval-augmented large language G. O. Passmore, and A. Youssef, Eds., vol. 11006.
model toolkit,” CoRR, vol. abs/2306.05212, 2023. Springer, 2018, pp. 255–270.
[671] D. Dai, L. Dong, Y. Hao, Z. Sui, B. Chang, and F. Wei, [682] S. Polu and I. Sutskever, “Generative language mod-
“Knowledge neurons in pretrained transformers,” in eling for automated theorem proving,” CoRR, vol.
Proceedings of the 60th Annual Meeting of the Asso- abs/2009.03393, 2020.
ciation for Computational Linguistics (Volume 1: Long [683] A. Q. Jiang, W. Li, S. Tworkowski, K. Czechowski,
Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, T. Odrzygózdz, P. Milos, Y. Wu, and M. Jamnik,
S. Muresan, P. Nakov, and A. Villavicencio, Eds. “Thor: Wielding hammers to integrate language
Association for Computational Linguistics, 2022, pp. models and automated theorem provers,” CoRR, vol.
130

abs/2205.10893, 2022. [698] X. Zhu, Y. Chen, H. Tian, C. Tao, W. Su, C. Yang,
[684] S. Polu, J. M. Han, K. Zheng, M. Baksys, G. Huang, B. Li, L. Lu, X. Wang, Y. Qiao, Z. Zhang,
I. Babuschkin, and I. Sutskever, “Formal mathe- and J. Dai, “Ghost in the minecraft: Generally capa-
matics statement curriculum learning,” CoRR, vol. ble agents for open-world environments via large
abs/2202.01344, 2022. language models with text-based knowledge and
[685] Y. Wu, A. Q. Jiang, W. Li, M. N. Rabe, C. Staats, memory,” CoRR, vol. abs/2305.17144, 2023.
M. Jamnik, and C. Szegedy, “Autoformalization with [699] G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao,
large language models,” CoRR, vol. abs/2205.12615, Y. Zhu, L. Fan, and A. Anandkumar, “Voyager: An
2022. open-ended embodied agent with large language
[686] A. Q. Jiang, S. Welleck, J. P. Zhou, W. Li, J. Liu, models,” CoRR, vol. abs/2305.16291, 2023.
M. Jamnik, T. Lacroix, Y. Wu, and G. Lample, “Draft, [700] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes,
sketch, and prove: Guiding formal theorem provers B. David, C. Finn, K. Gopalakrishnan, K. Hausman,
with informal proofs,” CoRR, vol. abs/2210.12283, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Ir-
2022. pan, E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth,
[687] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, N. J. Joshi, R. Julian, D. Kalashnikov, Y. Kuang,
S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, K. Lee, S. Levine, Y. Lu, L. Luu, C. Parada, P. Pastor,
Y. Yang, S. Welleck, B. P. Majumder, S. Gupta, A. Yaz- J. Quiambao, K. Rao, J. Rettinghouse, D. Reyes, P. Ser-
danbakhsh, and P. Clark, “Self-refine: Iterative refine- manet, N. Sievers, C. Tan, A. Toshev, V. Vanhoucke,
ment with self-feedback,” CoRR, vol. abs/2303.17651, F. Xia, T. Xiao, P. Xu, S. Xu, and M. Yan, “Do as
2023. I can, not as I say: Grounding language in robotic
[688] N. Shinn, B. Labash, and A. Gopinath, “Reflexion: an affordances,” CoRR, vol. abs/2204.01691, 2022.
autonomous agent with dynamic memory and self- [701] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman,
reflection,” CoRR, vol. abs/2303.11366, 2023. B. Ichter, P. Florence, and A. Zeng, “Code as policies:
[689] Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, N. Duan, Language model programs for embodied control,”
and W. Chen, “CRITIC: large language models can CoRR, vol. abs/2209.07753, 2022.
self-correct with tool-interactive critiquing,” CoRR, [702] Y. Fu, H. Peng, T. Khot, and M. Lapata, “Improv-
vol. abs/2305.11738, 2023. ing language model negotiation with self-play and
[690] J. Uesato, N. Kushman, R. Kumar, H. F. Song, in-context learning from AI feedback,” CoRR, vol.
N. Y. Siegel, L. Wang, A. Creswell, G. Irving, and abs/2305.10142, 2023.
I. Higgins, “Solving math word problems with [703] N. Mehta, M. Teruel, P. F. Sanz, X. Deng, A. H.
process- and outcome-based feedback,” CoRR, vol. Awadallah, and J. Kiseleva, “Improving grounded
abs/2211.14275, 2022. language understanding in a collaborative environ-
[691] H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, ment by interacting with agents through help feed-
B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, back,” CoRR, vol. abs/2304.10750, 2023.
and K. Cobbe, “Let’s verify step by step,” CoRR, vol. [704] S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez,
abs/2305.20050, 2023. “Gorilla: Large language model connected with mas-
[692] Z. Yuan, H. Yuan, C. Tan, W. Wang, and S. Huang, sive apis,” CoRR, vol. abs/2305.15334, 2023.
“How well do large language models perform in [705] S. Hao, T. Liu, Z. Wang, and Z. Hu, “Toolkengpt:
arithmetic tasks?” CoRR, vol. abs/2304.02015, 2023. Augmenting frozen language models with mas-
[693] X. Pi, Q. Liu, B. Chen, M. Ziyadi, Z. Lin, Q. Fu, sive tools via tool embeddings,” CoRR, vol.
Y. Gao, J. Lou, and W. Chen, “Reasoning like pro- abs/2305.11554, 2023.
gram executors,” in Proceedings of the 2022 Conference [706] Y. Liang, C. Wu, T. Song, W. Wu, Y. Xia, Y. Liu, Y. Ou,
on Empirical Methods in Natural Language Processing, S. Lu, L. Ji, S. Mao, Y. Wang, L. Shou, M. Gong,
EMNLP 2022, Abu Dhabi, United Arab Emirates, De- and N. Duan, “Taskmatrix.ai: Completing tasks by
cember 7-11, 2022, 2022, pp. 761–779. connecting foundation models with millions of apis,”
[694] H. Zhou, A. Nova, H. Larochelle, A. C. Courville, CoRR, vol. abs/2303.16434, 2023.
B. Neyshabur, and H. Sedghi, “Teaching algorith- [707] T. Cai, X. Wang, T. Ma, X. Chen, and D. Zhou,
mic reasoning via in-context learning,” CoRR, vol. “Large language models as tool makers,” CoRR, vol.
abs/2211.09066, 2022. abs/2305.17126, 2023.
[695] A. Parisi, Y. Zhao, and N. Fiedel, “TALM: [708] J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang,
tool augmented language models,” CoRR, vol. H. Yu, and J. Han, “Large language models can self-
abs/2205.12255, 2022. improve,” CoRR, vol. abs/2210.11610, 2022.
[696] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, [709] E. Beeching, C. Fourrier, N. Habib, S. Han,
“Language models as zero-shot planners: Extract- N. Lambert, N. Rajani, O. Sanseviero,
ing actionable knowledge for embodied agents,” in L. Tunstall, and T. Wolf, “Open llm leaderboard,”
ICML, ser. Proceedings of Machine Learning Re- https://huggingface.co/spaces/HuggingFaceH4/
search, vol. 162. PMLR, 2022, pp. 9118–9147. open llm leaderboard, 2023.
[697] T. Carta, C. Romac, T. Wolf, S. Lamprier, O. Sigaud, [710] W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang,
and P. Oudeyer, “Grounding large language models A. Saied, W. Chen, and N. Duan, “Agieval: A human-
in interactive environments with online reinforce- centric benchmark for evaluating foundation mod-
ment learning,” CoRR, vol. abs/2302.02662, 2023. els,” CoRR, vol. abs/2304.06364, 2023.
131

[711] H. Zeng, “Measuring massive multitask chinese un- D. Arpit, R. Xu, P. Mui, H. Wang, C. Xiong, and
derstanding,” CoRR, vol. abs/2304.12986, 2023. S. Savarese, “BOLAA: benchmarking and orchestrat-
[712] C. Liu, R. Jin, Y. Ren, L. Yu, T. Dong, X. Peng, ing llm-augmented autonomous agents,” CoRR, vol.
S. Zhang, J. Peng, P. Zhang, Q. Lyu, X. Su, Q. Liu, abs/2308.05960, 2023.
and D. Xiong, “M3KE: A massive multi-level multi- [725] X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai,
subject knowledge evaluation benchmark for chinese Y. Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng,
large language models,” CoRR, vol. abs/2305.10263, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang,
2023. Y. Su, H. Sun, M. Huang, Y. Dong, and J. Tang,
[713] Y. Huang, Y. Bai, Z. Zhu, J. Zhang, J. Zhang, T. Su, “Agentbench: Evaluating llms as agents,” CoRR, vol.
J. Liu, C. Lv, Y. Zhang, J. Lei, Y. Fu, M. Sun, and abs/2308.03688, 2023.
J. He, “C-eval: A multi-level multi-discipline chinese [726] K. Zhu, J. Wang, J. Zhou, Z. Wang, H. Chen, Y. Wang,
evaluation suite for foundation models,” CoRR, vol. L. Yang, W. Ye, N. Z. Gong, Y. Zhang, and X. Xie,
abs/2305.08322, 2023. “Promptbench: Towards evaluating the robustness
[714] Z. Gu, X. Zhu, H. Ye, L. Zhang, J. Wang, S. Jiang, of large language models on adversarial prompts,”
Z. Xiong, Z. Li, Q. He, R. Xu, W. Huang, W. Zheng, CoRR, vol. abs/2306.04528, 2023.
H. Feng, and Y. Xiao, “Xiezhi: An ever-updating [727] R. S. Shah, K. Chawla, D. Eidnani, A. Shah, W. Du,
benchmark for holistic domain knowledge evalua- S. Chava, N. Raman, C. Smiley, J. Chen, and D. Yang,
tion,” CoRR, vol. abs/2306.05783, 2023. “WHEN FLUE MEETS FLANG: benchmarks and
[715] O. Contributors, “Opencompass: A universal eval- large pre-trained language model for financial do-
uation platform for foundation models,” https:// main,” CoRR, vol. abs/2211.00083, 2022.
github.com/InternLM/OpenCompass, 2023. [728] N. Guha, D. E. Ho, J. Nyarko, and C. Ré, “Legal-
[716] Y. Fu, L. Ou, M. Chen, Y. Wan, H. Peng, and bench: Prototyping a collaborative benchmark for
T. Khot, “Chain-of-thought hub: A continuous effort legal reasoning,” CoRR, vol. abs/2209.06120, 2022.
to measure large language models’ reasoning perfor- [729] L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu,
mance,” CoRR, vol. abs/2305.17306, 2023. Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang,
[717] J. Yu, X. Wang, S. Tu, S. Cao, D. Zhang-li, X. Lv, J. E. Gonzalez, and I. Stoica, “Judging llm-as-a-
H. Peng, Z. Yao, X. Zhang, H. Li, C. Li, Z. Zhang, judge with mt-bench and chatbot arena,” CoRR, vol.
Y. Bai, Y. Liu, A. Xin, N. Lin, K. Yun, L. Gong, J. Chen, abs/2306.05685, 2023.
Z. Wu, Y. Qi, W. Li, Y. Guan, K. Zeng, J. Qi, H. Jin, [730] X. Wang, Z. Hu, P. Lu, Y. Zhu, J. Zhang, S. Sub-
J. Liu, Y. Gu, Y. Yao, N. Ding, L. Hou, Z. Liu, B. Xu, ramaniam, A. R. Loomba, S. Zhang, Y. Sun, and
J. Tang, and J. Li, “Kola: Carefully benchmarking W. Wang, “Scibench: Evaluating college-level sci-
world knowledge of large language models,” CoRR, entific problem-solving abilities of large language
vol. abs/2306.09296, 2023. models,” CoRR, vol. abs/2307.10635, 2023.
[718] T. Sawada, D. Paleka, A. Havrilla, P. Tadepalli, P. Vi- [731] X. Li, T. Zhang, Y. Dubois, R. Taori, I. Gulrajani,
das, A. Kranias, J. J. Nay, K. Gupta, and A. Komat- C. Guestrin, P. Liang, and T. B. Hashimoto, “Al-
suzaki, “ARB: advanced reasoning benchmark for pacaeval: An automatic evaluator of instruction-
large language models,” CoRR, vol. abs/2307.13692, following models,” https://github.com/tatsu-lab/
2023. alpaca eval, 2023.
[719] Y. Peng, S. Li, W. Gu, Y. Li, W. Wang, C. Gao, and [732] Y. Huang, Q. Zhang, P. S. Yu, and L. Sun, “Trustgpt:
M. R. Lyu, “Revisiting, benchmarking and exploring A benchmark for trustworthy and responsible large
API recommendation: How far are we?” IEEE Trans. language models,” CoRR, vol. abs/2306.11507, 2023.
Software Eng., vol. 49, no. 4, pp. 1876–1897, 2023. [733] Y. Bai, J. Ying, Y. Cao, X. Lv, Y. He, X. Wang, J. Yu,
[720] M. Li, F. Song, B. Yu, H. Yu, Z. Li, F. Huang, and Y. Li, K. Zeng, Y. Xiao, H. Lyu, J. Zhang, J. Li, and L. Hou,
“Api-bank: A benchmark for tool-augmented llms,” “Benchmarking foundation models with language-
CoRR, vol. abs/2304.08244, 2023. model-as-an-examiner,” CoRR, vol. abs/2306.04181,
[721] Q. Tang, Z. Deng, H. Lin, X. Han, Q. Liang, and 2023.
L. Sun, “Toolalpaca: Generalized tool learning for [734] C. Chan, W. Chen, Y. Su, J. Yu, W. Xue, S. Zhang, J. Fu,
language models with 3000 simulated cases,” CoRR, and Z. Liu, “Chateval: Towards better llm-based
vol. abs/2306.05301, 2023. evaluators through multi-agent debate,” CoRR, vol.
[722] Q. Xu, F. Hong, B. Li, C. Hu, Z. Chen, and J. Zhang, abs/2308.07201, 2023.
“On the tool manipulation capability of open-source [735] Y. Chang, X. Wang, J. Wang, Y. Wu, K. Zhu, H. Chen,
large language models,” CoRR, vol. abs/2305.16504, L. Yang, X. Yi, C. Wang, Y. Wang, W. Ye, Y. Zhang,
2023. Y. Chang, P. S. Yu, Q. Yang, and X. Xie, “A survey
[723] Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, on evaluation of large language models,” CoRR, vol.
X. Cong, X. Tang, B. Qian, S. Zhao, R. Tian, R. Xie, abs/2307.03109, 2023.
J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “Tool- [736] Z. Zhuang, Q. Chen, L. Ma, M. Li, Y. Han, Y. Qian,
llm: Facilitating large language models to master H. Bai, Z. Feng, W. Zhang, and T. Liu, “Through
16000+ real-world apis,” CoRR, vol. abs/2307.16789, the lens of core competency: Survey on evaluation of
2023. large language models,” CoRR, vol. abs/2308.07902,
[724] Z. Liu, W. Yao, J. Zhang, L. Xue, S. Heinecke, 2023.
R. Murthy, Y. Feng, Z. Chen, J. C. Niebles, [737] J. H. Clark, J. Palomaki, V. Nikolaev, E. Choi, D. Gar-
132

rette, M. Collins, and T. Kwiatkowski, “Tydi QA: 2018. IEEE Computer Society, 2018, pp. 197–206.
A benchmark for information-seeking question an- [748] B. Yang, C. Han, Y. Li, L. Zuo, and Z. Yu, “Improv-
swering in typologically diverse languages,” Trans. ing conversational recommendation systems’ quality
Assoc. Comput. Linguistics, vol. 8, pp. 454–470, 2020. with context-aware item meta-information,” in Find-
[738] L. Gao, J. Tow, S. Biderman, S. Black, A. DiPofi, ings of the Association for Computational Linguistics:
C. Foster, L. Golding, J. Hsu, K. McDonell, N. Muen- NAACL 2022, Seattle, WA, United States, July 10-15,
nighoff, J. Phang, L. Reynolds, E. Tang, A. Thite, 2022, M. Carpuat, M. de Marneffe, and I. V. M. Ruı́z,
B. Wang, K. Wang, and A. Zou, “A framework for Eds. Association for Computational Linguistics,
few-shot language model evaluation,” Sep. 2021. 2022, pp. 38–48.
[739] R. Shah, K. Chawla, D. Eidnani, A. Shah, W. Du, [749] E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cap-
S. Chava, N. Raman, C. Smiley, J. Chen, and D. Yang, pelli, R. Cojocaru, M. Debbah, E. Goffinet, D. Hes-
“When flue meets flang: Benchmarks and large pre- low, J. Launay, Q. Malartic, B. Noune, B. Pannier,
trained language model for financial domain,” in and G. Penedo, “Falcon-40B: an open large language
Proceedings of the 2022 Conference on Empirical Methods model with state-of-the-art performance,” 2023.
in Natural Language Processing, 2022, pp. 2322–2335. [750] S. Martin, J. Liermann, and H. Ney, “Algorithms for
[740] K. Zhou, Y. Zhu, Z. Chen, W. Chen, W. X. Zhao, bigram and trigram word clustering,” Speech commu-
X. Chen, Y. Lin, J.-R. Wen, and J. Han, “Don’t make nication, vol. 24, no. 1, pp. 19–37, 1998.
your llm an evaluation benchmark cheater,” arXiv [751] R. Navigli, “Word sense disambiguation: A survey,”
preprint arXiv:2311.01964, 2023. ACM computing surveys (CSUR), vol. 41, no. 2, pp.
[741] C. Zan, K. Peng, L. Ding, B. Qiu, B. Liu, S. He, Q. Lu, 1–69, 2009.
Z. Zhang, C. Liu, W. Liu, Y. Zhan, and D. Tao, “Vega- [752] W. H. Gomaa, A. A. Fahmy et al., “A survey of
mt: The JD explore academy machine translation text similarity approaches,” international journal of
system for WMT22,” in Proceedings of the Seventh Con- Computer Applications, vol. 68, no. 13, pp. 13–18, 2013.
ference on Machine Translation, WMT 2022, Abu Dhabi, [753] S. Minaee, N. Kalchbrenner, E. Cambria, N. Nikzad,
United Arab Emirates (Hybrid), December 7-8, 2022, M. Chenaghlu, and J. Gao, “Deep learning–based
P. Koehn, L. Barrault, O. Bojar, F. Bougares, R. Chat- text classification: a comprehensive review,” ACM
terjee, M. R. Costa-jussà, C. Federmann, M. Fishel, computing surveys (CSUR), vol. 54, no. 3, pp. 1–40,
A. Fraser, M. Freitag, Y. Graham, R. Grundkiewicz, 2021.
P. Guzman, B. Haddow, M. Huck, A. Jimeno-Yepes, [754] N. Alex, E. Lifland, L. Tunstall, A. Thakur, P. Maham,
T. Kocmi, A. Martins, M. Morishita, C. Monz, M. Na- C. J. Riedel, E. Hine, C. Ashurst, P. Sedille, A. Carlier,
gata, T. Nakazawa, M. Negri, A. Névéol, M. Neves, M. Noetel, and A. Stuhlmüller, “RAFT: A real-world
M. Popel, M. Turchi, and M. Zampieri, Eds. Asso- few-shot text classification benchmark,” in NeurIPS
ciation for Computational Linguistics, 2022, pp. 411– Datasets and Benchmarks, 2021.
422. [755] C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga,
[742] Y. Zhao, M. Khalman, R. Joshi, S. Narayan, M. Saleh, and D. Yang, “Is chatgpt a general-purpose nat-
and P. J. Liu, “Calibrating sequence likelihood ural language processing task solver?” CoRR, vol.
improves conditional language generation,” CoRR, abs/2302.06476, 2023.
vol. abs/2210.00045, 2022. [Online]. Available: [756] X. Chen, J. Ye, C. Zu, N. Xu, R. Zheng, M. Peng,
https://doi.org/10.48550/arXiv.2210.00045 J. Zhou, T. Gui, Q. Zhang, and X. Huang, “How
[743] D. Khashabi, S. Min, T. Khot, A. Sabharwal, robust is gpt-3.5 to predecessors? a comprehensive
O. Tafjord, P. Clark, and H. Hajishirzi, “Unifiedqa: study on language understanding tasks,” 2023.
Crossing format boundaries with a single QA sys- [757] D. Nadeau and S. Sekine, “A survey of named entity
tem,” in EMNLP (Findings), ser. Findings of ACL, recognition and classification,” Lingvisticae Investiga-
vol. EMNLP 2020. Association for Computational tiones, vol. 30, no. 1, pp. 3–26, 2007.
Linguistics, 2020, pp. 1896–1907. [758] A. Ratnaparkhi, “A maximum entropy model for
[744] X. Zhu, J. Wang, L. Zhang, Y. Zhang, R. Gan, part-of-speech tagging,” in Conference on empirical
J. Zhang, and Y. Yang, “Solving math word problem methods in natural language processing, 1996.
via cooperative reasoning induced language mod- [759] V. Yadav and S. Bethard, “A survey on recent
els,” arXiv preprint arXiv:2210.16257, 2022. advances in named entity recognition from deep
[745] A. Nguyen, N. Karampatziakis, and W. Chen, “Meet learning models,” in Proceedings of the 27th Interna-
in the middle: A new pre-training paradigm,” tional Conference on Computational Linguistics, 2018,
CoRR, vol. abs/2303.07295, 2023. [Online]. Available: pp. 2145–2158.
https://doi.org/10.48550/arXiv.2303.07295 [760] F. Souza, R. Nogueira, and R. Lotufo, “Portuguese
[746] H. Li, J. Zhang, C. Li, and H. Chen, “RESDSQL: named entity recognition using bert-crf,” arXiv
decoupling schema linking and skeleton parsing preprint arXiv:1909.10649, 2019.
for text-to-sql,” CoRR, vol. abs/2302.05965, 2023. [761] S. Pawar, G. K. Palshikar, and P. Bhattacharyya,
[Online]. Available: https://doi.org/10.48550/arXiv. “Relation extraction: A survey,” arXiv preprint
2302.05965 arXiv:1712.05191, 2017.
[747] W. Kang and J. J. McAuley, “Self-attentive sequential [762] C. Walker and et al., “Ace 2005 multilingual training
recommendation,” in IEEE International Conference on corpus ldc2006t06,” Philadelphia, 2006.
Data Mining, ICDM 2018, Singapore, November 17-20, [763] J. Gao, H. Zhao, C. Yu, and R. Xu, “Exploring the
133

feasibility of chatgpt for event extraction,” CoRR, vol. “Is chatgpt good at search? investigating large lan-
abs/2303.03836, 2023. guage models as re-ranking agent,” arXiv preprint
[764] Y. Ma, Y. Cao, Y. Hong, and A. Sun, “Large language arXiv:2304.09542, 2023.
model is not a good few-shot information extractor, [778] Z. Qin, R. Jagerman, K. Hui, H. Zhuang, J. Wu,
but a good reranker for hard samples!” CoRR, vol. J. Shen, T. Liu, J. Liu, D. Metzler, X. Wang et al.,
abs/2303.08559, 2023. “Large language models are effective text rankers
[765] R. Tang, X. Han, X. Jiang, and X. Hu, “Does synthetic with pairwise ranking prompting,” arXiv preprint
data generation of llms help clinical text mining?” arXiv:2306.17563, 2023.
arXiv preprint arXiv:2303.04360, 2023. [779] S. Cho, S. Jeong, J. Seo, and J. C. Park, “Discrete
[766] X. Wei, X. Cui, N. Cheng, X. Wang, X. Zhang, prompt optimization via constrained generation for
S. Huang, P. Xie, J. Xu, Y. Chen, M. Zhang et al., zero-shot re-ranker,” arXiv preprint arXiv:2305.13729,
“Zero-shot information extraction via chatting with 2023.
chatgpt,” arXiv preprint arXiv:2302.10205, 2023. [780] R. Tang, X. Zhang, X. Ma, J. Lin, and F. Ture,
[767] A. Vaswani, S. Bengio, E. Brevdo, F. Chollet, “Found in the middle: Permutation self-consistency
A. Gomez, S. Gouws, L. Jones, Ł. Kaiser, N. Kalch- improves listwise ranking in large language mod-
brenner, N. Parmar et al., “Tensor2tensor for neural els,” arXiv preprint arXiv:2310.07712, 2023.
machine translation,” in Proceedings of the 13th Con- [781] X. Ma, X. Zhang, R. Pradeep, and J. Lin, “Zero-shot
ference of the Association for Machine Translation in the listwise document reranking with a large language
Americas (Volume 1: Research Track), 2018, pp. 193–199. model,” arXiv preprint arXiv:2305.02156, 2023.
[768] B. Zhang, B. Haddow, and A. Birch, “Prompting [782] S. Zhuang, H. Zhuang, B. Koopman, and G. Zuccon,
large language model for machine translation: A case “A setwise approach for effective and highly effi-
study,” arXiv preprint arXiv:2301.07069, 2023. cient zero-shot ranking with large language models,”
[769] M. Ghazvininejad, H. Gonen, and L. Zettlemoyer, arXiv preprint arXiv:2310.09497, 2023.
“Dictionary-based phrase-level prompting of large [783] H. Zhuang, Z. Qin, K. Hui, J. Wu, L. Yan, X. Wang,
language models for machine translation,” arXiv and M. Berdersky, “Beyond yes and no: Improving
preprint arXiv:2302.07856, 2023. zero-shot llm rankers via scoring fine-grained rele-
[770] L. Wang, C. Lyu, T. Ji, Z. Zhang, D. Yu, S. Shi, vance labels,” arXiv preprint arXiv:2310.14122, 2023.
and Z. Tu, “Document-level machine transla- [784] N. Ziems, W. Yu, Z. Zhang, and M. Jiang, “Large
tion with large language models,” arXiv preprint language models are built-in autoregressive search
arXiv:2304.02210, 2023. engines,” arXiv preprint arXiv:2305.09612, 2023.
[771] W. Jiao, J.-t. Huang, W. Wang, X. Wang, S. Shi, and [785] X. Ma, L. Wang, N. Yang, F. Wei, and J. Lin, “Fine-
Z. Tu, “Parrot: Translating during chat using large tuning llama for multi-stage text retrieval,” arXiv
language models,” arXiv preprint arXiv:2304.02426, preprint arXiv:2310.08319, 2023.
2023. [786] R. Pradeep, S. Sharifymoghaddam, and J. Lin,
[772] W. Yang, C. Li, J. Zhang, and C. Zong, “Bigtrans: “Rankvicuna: Zero-shot listwise document rerank-
Augmenting large language models with multi- ing with open-source large language models,” arXiv
lingual translation capability over 100 languages,” preprint arXiv:2309.15088, 2023.
arXiv preprint arXiv:2305.18098, 2023. [787] Y. Tay, V. Q. Tran, M. Dehghani, J. Ni, D. Bahri,
[773] J. Kocon, I. Cichecki, O. Kaszyca, M. Kochanek, H. Mehta, Z. Qin, K. Hui, Z. Zhao, J. Gupta et al.,
D. Szydlo, J. Baran, J. Bielaniewicz, M. Gruza, “Transformer memory as a differentiable search in-
A. Janz, K. Kanclerz, A. Kocon, B. Koptyra, dex,” in Advances in Neural Information Processing
W. Mieleszczenko-Kowszewicz, P. Milkowski, Systems, 2022.
M. Oleksy, M. Piasecki, L. Radlinski, K. Wojtasik, [788] R. Ren, W. X. Zhao, J. Liu, H. Wu, J.-R. Wen,
S. Wozniak, and P. Kazienko, “Chatgpt: Jack of all and H. Wang, “TOME: A two-stage approach for
trades, master of none,” CoRR, vol. abs/2302.10724, model-based retrieval,” in Proceedings of the 61st
2023. Annual Meeting of the Association for Computational
[774] Q. Zhong, L. Ding, J. Liu, B. Du, and D. Tao, Linguistics (Volume 1: Long Papers). Association
“Can chatgpt understand too? A comparative study for Computational Linguistics, 2023, pp. 6102–6114.
on chatgpt and fine-tuned BERT,” CoRR, vol. [Online]. Available: https://aclanthology.org/2023.
abs/2302.10198, 2023. acl-long.336
[775] D. Cheng, S. Huang, J. Bi, Y. Zhan, J. Liu, Y. Wang, [789] Y. Qu, Y. Ding, J. Liu, K. Liu, R. Ren, W. X. Zhao,
H. Sun, F. Wei, D. Deng, and Q. Zhang, “Uprise: D. Dong, H. Wu, and H. Wang, “Rocketqa: An op-
Universal prompt retrieval for improving zero-shot timized training approach to dense passage retrieval
evaluation,” arXiv preprint arXiv:2303.08518, 2023. for open-domain question answering,” in Proceedings
[776] R. Ren, Y. Qu, J. Liu, W. X. Zhao, Q. She, H. Wu, of the 2021 Conference of the North American Chapter
H. Wang, and J.-R. Wen, “Rocketqav2: A joint train- of the Association for Computational Linguistics: Human
ing method for dense passage retrieval and pas- Language Technologies, 2021, pp. 5835–5847.
sage re-ranking,” in Proceedings of the 2021 Conference [790] R. Ren, S. Lv, Y. Qu, J. Liu, W. X. Zhao, Q. She,
on Empirical Methods in Natural Language Processing, H. Wu, H. Wang, and J.-R. Wen, “Pair: Leverag-
2021, pp. 2825–2835. ing passage-centric similarity relation for improving
[777] W. Sun, L. Yan, X. Ma, P. Ren, D. Yin, and Z. Ren, dense passage retrieval,” in Findings of the Association
134

for Computational Linguistics: ACL-IJCNLP 2021, 2021, and J. Wen, “Recbole 2.0: Towards a more up-to-date
pp. 2173–2183. recommendation library,” in CIKM, M. A. Hasan and
[791] Z. Peng, X. Wu, and Y. Fang, “Soft prompt tuning L. Xiong, Eds. ACM, 2022, pp. 4722–4726.
for augmenting dense retrieval with large language [804] L. Xu, Z. Tian, G. Zhang, J. Zhang, L. Wang, B. Zheng,
models,” arXiv preprint arXiv:2307.08303, 2023. Y. Li, J. Tang, Z. Zhang, Y. Hou, X. Pan, W. X. Zhao,
[792] Z. Dai, V. Y. Zhao, J. Ma, Y. Luan, J. Ni, J. Lu, X. Chen, and J. Wen, “Towards a more user-friendly
A. Bakalov, K. Guu, K. Hall, and M.-W. Chang, and easy-to-use benchmark library for recommender
“Promptagator: Few-shot dense retrieval from 8 ex- systems,” in SIGIR, H. Chen, W. E. Duh, H. Huang,
amples,” in The Eleventh International Conference on M. P. Kato, J. Mothe, and B. Poblete, Eds. ACM,
Learning Representations, 2023. 2023, pp. 2837–2847.
[793] A. Askari, M. Aliannejadi, E. Kanoulas, and S. Ver- [805] S. Rendle, C. Freudenthaler, Z. Gantner, and
berne, “Generating synthetic documents for cross- L. Schmidt-Thieme, “BPR: bayesian personalized
encoder re-rankers: A comparative study of chatgpt ranking from implicit feedback,” CoRR, vol.
and human experts,” arXiv preprint arXiv:2305.02320, abs/1205.2618, 2012.
2023. [806] W. Fan, Z. Zhao, J. Li, Y. Liu, X. Mei, Y. Wang, J. Tang,
[794] K. Mao, Z. Dou, H. Chen, F. Mo, and H. Qian, “Large and Q. Li, “Recommender systems in the era of large
language models know your contextual search in- language models (llms),” CoRR, 2023.
tent: A prompting framework for conversational [807] L. Wu, Z. Zheng, Z. Qiu, H. Wang, H. Gu, T. Shen,
search,” arXiv preprint arXiv:2303.06573, 2023. C. Qin, C. Zhu, H. Zhu, Q. Liu, H. Xiong, and
[795] L. Gao, X. Ma, J. Lin, and J. Callan, “Precise zero- E. Chen, “A survey on large language models for
shot dense retrieval without relevance labels,” in recommendation,” CoRR, 2023.
Proceedings of the 61st Annual Meeting of the Association [808] Y. Gao, T. Sheng, Y. Xiang, Y. Xiong, H. Wang, and
for Computational Linguistics (Volume 1: Long Papers). J. Zhang, “Chat-rec: Towards interactive and explain-
Association for Computational Linguistics, 2023, pp. able llms-augmented recommender system,” CoRR,
1762–1777. vol. abs/2303.14524, 2023.
[796] L. Wang, N. Yang, and F. Wei, “Query2doc: Query ex- [809] S. Dai, N. Shao, H. Zhao, W. Yu, Z. Si, C. Xu, Z. Sun,
pansion with large language models,” arXiv preprint X. Zhang, and J. Xu, “Uncovering chatgpt’s capabil-
arXiv:2303.07678, 2023. ities in recommender systems,” in RecSys, J. Zhang,
[797] G. Ma, X. Wu, P. Wang, Z. Lin, and S. Hu, “Pre- L. Chen, S. Berkovsky, M. Zhang, T. D. Noia, J. Basil-
training with large language model-based document ico, L. Pizzato, and Y. Song, Eds. ACM, 2023, pp.
expansion for dense passage retrieval,” arXiv preprint 1126–1132.
arXiv:2308.08285, 2023. [810] Y. Hou, J. Zhang, Z. Lin, H. Lu, R. Xie, J. J. McAuley,
[798] W. Sun, Z. Chen, X. Ma, L. Yan, S. Wang, P. Ren, and W. X. Zhao, “Large language models are zero-
Z. Chen, D. Yin, and Z. Ren, “Instruction distilla- shot rankers for recommender systems,” CoRR, 2023.
tion makes large language models efficient zero-shot [811] J. Liu, C. Liu, R. Lv, K. Zhou, and Y. Zhang, “Is
rankers,” arXiv preprint arXiv:2311.01555, 2023. chatgpt a good recommender? A preliminary study,”
[799] L. Wang, N. Yang, X. Huang, L. Yang, R. Ma- CoRR, vol. abs/2304.10149, 2023.
jumder, and F. Wei, “Large search model: Redefin- [812] K. Bao, J. Zhang, Y. Zhang, W. Wang, F. Feng,
ing search stack in the era of llms,” arXiv preprint and X. He, “Tallrec: An effective and efficient tun-
arXiv:2310.14587, 2023. ing framework to align large language model with
[800] C. Li, Z. Gan, Z. Yang, J. Yang, L. Li, L. Wang, recommendation,” in RecSys, J. Zhang, L. Chen,
and J. Gao, “Multimodal foundation models: From S. Berkovsky, M. Zhang, T. D. Noia, J. Basilico, L. Piz-
specialists to general-purpose assistants,” CoRR, vol. zato, and Y. Song, Eds. ACM, 2023, pp. 1007–1014.
abs/2309.10020, 2023. [813] Y. Zhu, L. Wu, Q. Guo, L. Hong, and J. Li, “Col-
[801] W. X. Zhao, S. Mu, Y. Hou, Z. Lin, Y. Chen, X. Pan, laborative large language model for recommender
K. Li, Y. Lu, H. Wang, C. Tian, Y. Min, Z. Feng, X. Fan, systems,” arXiv preprint arXiv:2311.01343, 2023.
X. Chen, P. Wang, W. Ji, Y. Li, X. Wang, and J. Wen, [814] B. Zheng, Y. Hou, H. Lu, Y. Chen, W. X.
“Recbole: Towards a unified, comprehensive and ef- Zhao, and J.-R. Wen, “Adapting large language
ficient framework for recommendation algorithms,” models by integrating collaborative semantics for
in CIKM, G. Demartini, G. Zuccon, J. S. Culpepper, recommendation,” 2023. [Online]. Available: https:
Z. Huang, and H. Tong, Eds. ACM, 2021, pp. 4653– //api.semanticscholar.org/CorpusID:265213194
4664. [815] Y. Xi, W. Liu, J. Lin, J. Zhu, B. Chen, R. Tang,
[802] K. Zhou, H. Wang, W. X. Zhao, Y. Zhu, S. Wang, W. Zhang, R. Zhang, and Y. Yu, “Towards open-
F. Zhang, Z. Wang, and J. Wen, “S3-rec: Self- world recommendation with knowledge augmen-
supervised learning for sequential recommendation tation from large language models,” CoRR, vol.
with mutual information maximization,” in CIKM, abs/2306.10933, 2023.
M. d’Aquin, S. Dietze, C. Hauff, E. Curry, and [816] Q. Liu, N. Chen, T. Sakai, and X. Wu, “A first look
P. Cudré-Mauroux, Eds. ACM, 2020, pp. 1893–1902. at llm-powered generative news recommendation,”
[803] W. X. Zhao, Y. Hou, X. Pan, C. Yang, Z. Zhang, Z. Lin, CoRR, vol. abs/2305.06566, 2023.
J. Zhang, S. Bian, J. Tang, W. Sun, Y. Chen, L. Xu, [817] R. Li, W. Deng, Y. Cheng, Z. Yuan, J. Zhang,
G. Zhang, Z. Tian, C. Tian, S. Mu, X. Fan, X. Chen, and F. Yuan, “Exploring the upper limits of
135

text-based collaborative filtering using large lan- [830] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon,
guage models: Discoveries and insights,” CoRR, vol. R. Wightman, M. Cherti, T. Coombes, A. Katta,
abs/2305.11700, 2023. C. Mullis, M. Wortsman, P. Schramowski, S. Kun-
[818] W. Wei, X. Ren, J. Tang, Q. Wang, L. Su, S. Cheng, durthy, K. Crowson, L. Schmidt, R. Kaczmarczyk,
J. Wang, D. Yin, and C. Huang, “Llmrec: Large lan- and J. Jitsev, “LAION-5B: an open large-scale dataset
guage models with graph augmentation for recom- for training next generation image-text models,” in
mendation,” CoRR, vol. abs/2311.00423, 2023. NeurIPS, 2022.
[819] X. Li, B. Chen, L. Hou, and R. Tang, “Ctrl: Connect [831] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut,
tabular and language model for ctr prediction,” arXiv “Conceptual 12m: Pushing web-scale image-text pre-
preprint arXiv:2306.02841, 2023. training to recognize long-tail visual concepts,” in
[820] A. Muhamed, I. Keivanloo, S. Perera, J. Mracek, IEEE Conference on Computer Vision and Pattern Recog-
Y. Xu, Q. Cui, S. Rajagopalan, B. Zeng, and nition, CVPR 2021, virtual, June 19-25, 2021. Com-
T. Chilimbi, “Ctr-bert: Cost-effective knowledge dis- puter Vision Foundation / IEEE, 2021, pp. 3558–3568.
tillation for billion-parameter teacher models,” in [832] Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang,
NeurIPS Efficient Natural Language and Speech Process- A. Hu, P. Shi, Y. Shi, C. Li, Y. Xu, H. Chen, J. Tian,
ing Workshop, 2021. Q. Qi, J. Zhang, and F. Huang, “mplug-owl: Mod-
[821] L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, ularization empowers large language models with
J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, W. X. multimodality,” CoRR, vol. abs/2304.14178, 2023.
Zhao, Z. Wei, and J. Wen, “A survey on large lan- [833] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang,
guage model based autonomous agents,” CoRR, vol. J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier
abs/2308.11432, 2023. large vision-language model with versatile abilities,”
[822] L. Wang, J. Zhang, X. Chen, Y. Lin, R. Song, W. X. CoRR, vol. abs/2308.12966, 2023.
Zhao, and J. Wen, “Recagent: A novel simulation [834] H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved base-
paradigm for recommender systems,” CoRR, vol. lines with visual instruction tuning,” CoRR, vol.
abs/2306.02552, 2023. abs/2310.03744, 2023.
[823] E. Ie, C. Hsu, M. Mladenov, V. Jain, S. Narvekar, [835] P. Zhang, X. Dong, B. Wang, Y. Cao, C. Xu,
J. Wang, R. Wu, and C. Boutilier, “Recsim: A con- L. Ouyang, Z. Zhao, S. Ding, S. Zhang, H. Duan,
figurable simulation platform for recommender sys- W. Zhang, H. Yan, X. Zhang, W. Li, J. Li,
tems,” CoRR, vol. abs/1909.04847, 2019. K. Chen, C. He, X. Zhang, Y. Qiao, D. Lin, and
[824] J. Zhang, Y. Hou, R. Xie, W. Sun, J. J. McAuley, J. Wang, “Internlm-xcomposer: A vision-language
W. X. Zhao, L. Lin, and J. Wen, “Agentcf: Collabora- large model for advanced text-image comprehension
tive learning with autonomous language agents for and composition,” CoRR, vol. abs/2309.15112, 2023.
recommender systems,” CoRR, vol. abs/2310.09233, [836] K. Chen, Z. Zhang, W. Zeng, R. Zhang, F. Zhu, and
2023. R. Zhao, “Shikra: Unleashing multimodal llm’s ref-
[825] A. Zhang, L. Sheng, Y. Chen, H. Li, Y. Deng, X. Wang, erential dialogue magic,” CoRR, vol. abs/2306.15195,
and T. Chua, “On generative agents in recommenda- 2023.
tion,” CoRR, vol. abs/2310.10108, 2023. [837] F. Liu, K. Lin, L. Li, J. Wang, Y. Yacoob, and L. Wang,
[826] Y. Du, Z. Liu, J. Li, and W. X. Zhao, “A survey of “Aligning large multi-modal model with robust in-
vision-language pre-trained models,” in Proceedings struction tuning,” CoRR, vol. abs/2306.14565, 2023.
of the Thirty-First International Joint Conference on Ar- [838] Y. Du, H. Guo, K. Zhou, W. X. Zhao, J. Wang,
tificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 C. Wang, M. Cai, R. Song, and J.-R. Wen, “What
July 2022, L. D. Raedt, Ed. ijcai.org, 2022, pp. 5436– makes for good visual instructions? synthesizing
5443. complex visual reasoning instructions for visual in-
[827] Z. Gan, L. Li, C. Li, L. Wang, Z. Liu, and J. Gao, struction tuning,” 2023.
“Vision-language pre-training: Basics, recent ad- [839] D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin,
vances, and future trends,” Found. Trends Comput. K. Grauman, J. Luo, and J. P. Bigham, “Vizwiz grand
Graph. Vis., vol. 14, no. 3-4, pp. 163–352, 2022. challenge: Answering visual questions from blind
[828] P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, people,” in CVPR. Computer Vision Foundation
A. Bapna, Z. Borsos, F. de Chaumont Quitry, P. Chen, / IEEE Computer Society, 2018, pp. 3608–3617.
D. E. Badawy, W. Han, E. Kharitonov et al., “Au- [840] A. Mishra, K. Alahari, and C. V. Jawahar, “Top-down
diopalm: A large language model that can speak and and bottom-up cues for scene text recognition,” in
listen,” CoRR, 2023. CVPR. IEEE Computer Society, 2012, pp. 2687–2694.
[829] J. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, [841] Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao,
Y. Hasson, K. Lenc, A. Mensch, K. Millican, Y. Yuan, J. Wang, C. He, Z. Liu, K. Chen, and
M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, D. Lin, “Mmbench: Is your multi-modal model an
Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick, all-around player?” CoRR, vol. abs/2307.06281, 2023.
S. Borgeaud, A. Brock, A. Nematzadeh, S. Shar- [842] C. Fu, P. Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin,
ifzadeh, M. Binkowski, R. Barreira, O. Vinyals, Z. Qiu, W. Lin, J. Yang, X. Zheng, K. Li, X. Sun, and
A. Zisserman, and K. Simonyan, “Flamingo: a visual R. Ji, “MME: A comprehensive evaluation bench-
language model for few-shot learning,” in NeurIPS, mark for multimodal large language models,” CoRR,
2022. vol. abs/2306.13394, 2023.
136

[843] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, [857] B. Li, R. Wang, G. Wang, Y. Ge, Y. Ge, and
E. Zhao, Y. Zhang, Y. Chen, L. Wang, A. T. Luu, W. Bi, Y. Shan, “Seed-bench: Benchmarking multimodal
F. Shi, and S. Shi, “Siren’s song in the AI ocean: A llms with generative comprehension,” CoRR, vol.
survey on hallucination in large language models,” abs/2307.16125, 2023.
CoRR, vol. abs/2309.01219, 2023. [858] W. Yu, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, X. Wang,
[844] A. Gunjal, J. Yin, and E. Bas, “Detecting and prevent- and L. Wang, “Mm-vet: Evaluating large multi-
ing hallucinations in large vision language models,” modal models for integrated capabilities,” CoRR, vol.
CoRR, vol. abs/2308.06394, 2023. abs/2308.02490, 2023.
[845] J. Lu, J. Rao, K. Chen, X. Guo, Y. Zhang, B. Sun, [859] J. Wang, L. Meng, Z. Weng, B. He, Z. Wu, and
C. Yang, and J. Yang, “Evaluation and mitigation Y. Jiang, “To see is to believe: Prompting GPT-
of agnosia in multimodal large language models,” 4V for better visual instruction tuning,” CoRR, vol.
CoRR, vol. abs/2309.04041, 2023. abs/2311.07574, 2023.
[846] A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, [860] Y. Zhang, R. Zhang, J. Gu, Y. Zhou, N. Lipka, D. Yang,
and K. Saenko, “Object hallucination in image cap- and T. Sun, “Llavar: Enhanced visual instruction tun-
tioning,” in EMNLP. Association for Computational ing for text-rich image understanding,” arXiv preprint
Linguistics, 2018, pp. 4035–4045. arXiv:2306.17107, 2023.
[847] Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and [861] X. Qi, K. Huang, A. Panda, M. Wang, and P. Mittal,
J.-R. Wen, “Evaluating object hallucination in large “Visual adversarial examples jailbreak aligned large
vision-language models,” in The 2023 Conference on language models,” in The Second Workshop on New
Empirical Methods in Natural Language Processing, Frontiers in Adversarial Machine Learning, 2023.
2023. [Online]. Available: https://openreview.net/ [862] Y. Zhou, C. Cui, J. Yoon, L. Zhang, Z. Deng, C. Finn,
forum?id=xozJw0kZXF M. Bansal, and H. Yao, “Analyzing and mitigating
[848] D. A. Hudson and C. D. Manning, “GQA: A new object hallucination in large vision-language mod-
dataset for real-world visual reasoning and compo- els,” arXiv preprint arXiv:2310.00754, 2023.
sitional question answering,” in CVPR. Computer [863] Z. Sun, S. Shen, S. Cao, H. Liu, C. Li, Y. Shen, C. Gan,
Vision Foundation / IEEE, 2019, pp. 6700–6709. L.-Y. Gui, Y.-X. Wang, Y. Yang et al., “Aligning large
[849] P. Lu, S. Mishra, T. Xia, L. Qiu, K. Chang, S. Zhu, multimodal models with factually augmented rlhf,”
O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain: arXiv preprint arXiv:2309.14525, 2023.
Multimodal reasoning via thought chains for science [864] E. Jiménez-Ruiz, O. Hassanzadeh, V. Efthymiou,
question answering,” in NeurIPS, 2022. J. Chen, and K. Srinivas, “Semtab 2019: Resources to
[850] A. Singh, V. Natarjan, M. Shah, Y. Jiang, X. Chen, benchmark tabular data to knowledge graph match-
D. Parikh, and M. Rohrbach, “Towards vqa models ing systems,” in The Semantic Web - 17th International
that can read,” in Proceedings of the IEEE Conference Conference, ESWC 2020, Heraklion, Crete, Greece, May
on Computer Vision and Pattern Recognition, 2019, pp. 31-June 4, 2020, Proceedings, ser. Lecture Notes in
8317–8326. Computer Science, vol. 12123. Springer, 2020, pp.
[851] F. Liu, T. Guan, Z. Li, L. Chen, Y. Yacoob, 514–530.
D. Manocha, and T. Zhou, “Hallusionbench: You [865] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu,
see what you think? or you think what you see? “Unifying large language models and knowledge
an image-context reasoning benchmark challenging graphs: A roadmap,” CoRR, vol. abs/2306.08302,
for gpt-4v(ision), llava-1.5, and other multi-modality 2023.
models,” CoRR, vol. abs/2310.14566, 2023. [866] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang,
[852] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, J. Shang, J. Liu, X. Chen, Y. Zhao, Y. Lu,
C. L. Zitnick, and D. Parikh, “VQA: visual question W. Liu, Z. Wu, W. Gong, J. Liang, Z. Shang,
answering,” in ICCV. IEEE Computer Society, 2015, P. Sun, W. Liu, X. Ouyang, D. Yu, H. Tian,
pp. 2425–2433. H. Wu, and H. Wang, “ERNIE 3.0: Large-
[853] R. Vedantam, C. L. Zitnick, and D. Parikh, “Cider: scale knowledge enhanced pre-training for
Consensus-based image description evaluation,” in language understanding and generation,” CoRR,
CVPR. IEEE Computer Society, 2015, pp. 4566–4575. vol. abs/2107.02137, 2021. [Online]. Available:
[854] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction https://arxiv.org/abs/2107.02137
tuning,” CoRR, vol. abs/2304.08485, 2023. [867] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and
[855] P. Xu, W. Shao, K. Zhang, P. Gao, S. Liu, M. Lei, Q. Liu, “ERNIE: enhanced language representation
F. Meng, S. Huang, Y. Qiao, and P. Luo, “Lvlm-ehub: with informative entities,” in Proceedings of the 57th
A comprehensive evaluation benchmark for large Conference of the Association for Computational Linguis-
vision-language models,” CoRR, vol. abs/2306.09265, tics, ACL 2019, Florence, Italy, July 28- August 2, 2019,
2023. Volume 1: Long Papers. Association for Computa-
[856] Z. Li, Y. Wang, M. Du, Q. Liu, B. Wu, J. Zhang, tional Linguistics, 2019, pp. 1441–1451.
C. Zhou, Z. Fan, J. Fu, J. Chen, X. Huang, and [868] X. Wang, T. Gao, Z. Zhu, Z. Zhang, Z. Liu, J. Li,
Z. Wei, “Reform-eval: Evaluating large vision lan- and J. Tang, “KEPLER: A unified model for knowl-
guage models via unified re-formulation of task- edge embedding and pre-trained language represen-
oriented benchmarks,” CoRR, vol. abs/2310.02569, tation,” Trans. Assoc. Comput. Linguistics, vol. 9, pp.
2023. 176–194, 2021.
137

[869] J. Zhang, X. Zhang, J. Yu, J. Tang, J. Tang, C. Li, abs/2305.13669, 2023.
and H. Chen, “Subgraph retrieval enhanced model [880] Y. Zhu, X. Wang, J. Chen, S. Qiao, Y. Ou,
for multi-hop knowledge base question answering,” Y. Yao, S. Deng, H. Chen, and N. Zhang, “Llms
in Proceedings of the 60th Annual Meeting of the As- for knowledge graph construction and reasoning:
sociation for Computational Linguistics (Volume 1: Long Recent capabilities and future opportunities,” CoRR,
Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022. vol. abs/2305.13168, 2023. [Online]. Available:
Association for Computational Linguistics, 2022, pp. https://doi.org/10.48550/arXiv.2305.13168
5773–5784. [881] M. Karpinska, N. Akoury, and M. Iyyer, “The perils
[870] P. Ke, H. Ji, Y. Ran, X. Cui, L. Wang, L. Song, X. Zhu, of using mechanical turk to evaluate open-ended
and M. Huang, “Jointgt: Graph-text joint represen- text generation,” in Proceedings of the 2021 Conference
tation learning for text generation from knowledge on Empirical Methods in Natural Language Processing,
graphs,” in Findings of the Association for Compu- EMNLP 2021, Virtual Event / Punta Cana, Dominican
tational Linguistics: ACL/IJCNLP 2021, Online Event, Republic, 7-11 November, 2021, M. Moens, X. Huang,
August 1-6, 2021, ser. Findings of ACL, vol. ACL/I- L. Specia, and S. W. Yih, Eds. Association for
JCNLP 2021. Association for Computational Lin- Computational Linguistics, 2021, pp. 1265–1285.
guistics, 2021, pp. 2526–2538. [882] H. Lee, S. Phatale, H. Mansoor, K. Lu, T. Mesnard,
[871] O. Agarwal, H. Ge, S. Shakeri, and R. Al-Rfou, C. Bishop, V. Carbune, and A. Rastogi, “RLAIF:
“Large scale knowledge graph based synthetic cor- scaling reinforcement learning from human feedback
pus generation for knowledge-enhanced language with AI feedback,” CoRR, vol. abs/2309.00267, 2023.
model pre-training,” CoRR, vol. abs/2010.12688, [883] G. Cui, L. Yuan, N. Ding, G. Yao, W. Zhu, Y. Ni,
2020. G. Xie, Z. Liu, and M. Sun, “Ultrafeedback: Boosting
[872] W. Chen, Y. Su, X. Yan, and W. Y. Wang, “KGPT: language models with high-quality feedback,” CoRR,
knowledge-grounded pre-training for data-to-text vol. abs/2310.01377, 2023.
generation,” in Proceedings of the 2020 Conference [884] X. Wang, Z. Wang, J. Liu, Y. Chen, L. Yuan, H. Peng,
on Empirical Methods in Natural Language Processing, and H. Ji, “MINT: evaluating llms in multi-turn in-
EMNLP 2020, Online, November 16-20, 2020. Associ- teraction with tools and language feedback,” CoRR,
ation for Computational Linguistics, 2020, pp. 8635– vol. abs/2309.10691, 2023.
8648. [885] S. Saha, O. Levy, A. Celikyilmaz, M. Bansal, J. We-
[873] Y. Gu, X. Deng, and Y. Su, “Don’t generate, discrim- ston, and X. Li, “Branch-solve-merge improves large
inate: A proposal for grounding language models to language model evaluation and generation,” CoRR,
real-world environments,” in Proceedings of the 61st vol. abs/2310.15123, 2023.
Annual Meeting of the Association for Computational [886] X. Zhang, B. Yu, H. Yu, Y. Lv, T. Liu, F. Huang, H. Xu,
Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, and Y. Li, “Wider and deeper LLM networks are
Canada, July 9-14, 2023. Association for Computa- fairer LLM evaluators,” CoRR, vol. abs/2308.01862,
tional Linguistics, 2023, pp. 4928–4949. 2023.
[874] L. Luo, Y. Li, G. Haffari, and S. Pan, “Reasoning [887] C. Chan, W. Chen, Y. Su, J. Yu, W. Xue, S. Zhang, J. Fu,
on graphs: Faithful and interpretable large language and Z. Liu, “Chateval: Towards better llm-based
model reasoning,” CoRR, vol. abs/2310.01061, 2023. evaluators through multi-agent debate,” CoRR, vol.
[875] Y. Lan and J. Jiang, “Query graph generation for an- abs/2308.07201, 2023.
swering multi-hop complex questions from knowl- [888] R. Li, T. Patel, and X. Du, “PRD: peer rank and dis-
edge bases,” in Proceedings of the 58th Annual Meeting cussion improve large language model based evalu-
of the Association for Computational Linguistics, ACL ations,” CoRR, vol. abs/2307.02762, 2023.
2020, Online, July 5-10, 2020, D. J. and, Ed. Asso- [889] L. Zhu, X. Wang, and X. Wang, “Judgelm: Fine-tuned
ciation for Computational Linguistics, 2020, pp. 969– large language models are scalable judges,” CoRR,
974. vol. abs/2310.17631, 2023.
[876] P. Wang, N. Zhang, X. Xie, Y. Yao, B. Tian, [890] Z. Zeng, J. Yu, T. Gao, Y. Meng, T. Goyal,
M. Wang, Z. Xi, S. Cheng, K. Liu, G. Zheng, and and D. Chen, “Evaluating large language mod-
H. Chen, “Easyedit: An easy-to-use knowledge edit- els at evaluating instruction following,” CoRR, vol.
ing framework for large language models,” CoRR, abs/2310.07641, 2023.
vol. abs/2308.07269, 2023. [891] R. Koo, M. Lee, V. Raheja, J. I. Park, Z. M. Kim,
[877] Y. Yao, P. Wang, B. Tian, S. Cheng, Z. Li, S. Deng, and D. Kang, “Benchmarking cognitive biases in
H. Chen, and N. Zhang, “Editing large language large language models as evaluators,” CoRR, vol.
models: Problems, methods, and opportunities,” abs/2309.17012, 2023.
CoRR, vol. abs/2305.13172, 2023. [892] P. West, X. Lu, N. Dziri, F. Brahman, L. Li,
[878] S. Choi, T. Fang, Z. Wang, and Y. Song, “KCTS: J. D. Hwang, L. Jiang, J. Fisher, A. Ravichander,
knowledge-constrained tree search decoding with K. Chandu, B. Newman, P. W. Koh, A. Ettinger,
token-level hallucination detection,” CoRR, vol. and Y. Choi, “The generative AI paradox: ”what
abs/2310.09044, 2023. it can create, it may not understand”,” CoRR, vol.
[879] S. Zhang, L. Pan, J. Zhao, and W. Y. Wang, “Mit- abs/2311.00059, 2023.
igating language model hallucination with inter- [893] J. Huang, X. Chen, S. Mishra, H. S. Zheng, A. W.
active question-knowledge alignment,” CoRR, vol. Yu, X. Song, and D. Zhou, “Large language mod-
138

els cannot self-correct reasoning yet,” CoRR, vol. prediction,” CoRR, vol. abs/2212.02199, 2022.
abs/2310.01798, 2023. [908] J. H. Choi, K. E. Hickman, A. Monahan, and
[894] K. Stechly, M. Marquez, and S. Kambhampati, “GPT- D. Schwarcz, “Chatgpt goes to law school,” Available
4 doesn’t know it’s wrong: An analysis of itera- at SSRN, 2023.
tive prompting for reasoning problems,” CoRR, vol. [909] J. J. Nay, “Law informs code: A legal informatics
abs/2310.12397, 2023. approach to aligning artificial intelligence with hu-
[895] O. Nov, N. Singh, and D. M. Mann, “Putting chat- mans,” CoRR, vol. abs/2209.13020, 2022.
gpt’s medical advice to the (turing) test,” CoRR, vol. [910] F. Yu, L. Quartey, and F. Schilder, “Legal prompting:
abs/2301.10035, 2023. Teaching a language model to think like a lawyer,”
[896] K. Yang, S. Ji, T. Zhang, Q. Xie, and S. Anani- CoRR, vol. abs/2212.01326, 2022.
adou, “On the evaluations of chatgpt and emotion- [911] D. Trautmann, A. Petrova, and F. Schilder, “Legal
enhanced prompting for mental health analysis,” prompt engineering for multilingual legal judgement
CoRR, vol. abs/2304.03347, 2023. prediction,” CoRR, vol. abs/2212.02199, 2022.
[897] K. Jeblick, B. Schachtner, J. Dexl, A. Mittermeier, [912] A. Tamkin, M. Brundage, J. Clark, and D. Ganguli,
A. T. Stüber, J. Topalis, T. Weber, P. Wesp, B. O. “Understanding the capabilities, limitations, and so-
Sabel, J. Ricke, and M. Ingrisch, “Chatgpt makes cietal impact of large language models,” CoRR, vol.
medicine easy to swallow: An exploratory case abs/2102.02503, 2021.
study on simplified radiology reports,” CoRR, vol. [913] Z. Sun, “A short survey of viewing large language
abs/2212.14882, 2022. models in legal aspect,” CoRR, vol. abs/2303.09136,
[898] K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wul- 2023.
czyn, L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, [914] A. Abid, M. Farooqi, and J. Zou, “Persistent anti-
D. Neal, M. Schaekermann, A. Wang, M. Amin, muslim bias in large language models,” in AIES
S. Lachgar, P. A. Mansfield, S. Prakash, B. Green, ’21: AAAI/ACM Conference on AI, Ethics, and Society,
E. Dominowska, B. A. y Arcas, N. Tomasev, Y. Liu, Virtual Event, USA, May 19-21, 2021, M. Fourcade,
R. Wong, C. Semturs, S. S. Mahdavi, J. K. Barral, B. Kuipers, S. Lazar, and D. K. Mulligan, Eds. ACM,
D. R. Webster, G. S. Corrado, Y. Matias, S. Azizi, 2021, pp. 298–306.
A. Karthikesalingam, and V. Natarajan, “Towards [915] A. Shah and S. Chava, “Zero is not hero yet: Bench-
expert-level medical question answering with large marking zero-shot performance of llms for financial
language models,” CoRR, vol. abs/2305.09617, 2023. tasks,” CoRR, vol. abs/2305.16633, 2023.
[899] S. Yang, H. Zhao, S. Zhu, G. Zhou, H. Xu, Y. Jia, and [916] D. Araci, “Finbert: Financial sentiment analysis
H. Zan, “Zhongjing: Enhancing the chinese medical with pre-trained language models,” CoRR, vol.
capabilities of large language model through expert abs/1908.10063, 2019.
feedback and real-world multi-turn dialogue,” CoRR, [917] J. C. S. Alvarado, K. Verspoor, and T. Baldwin,
vol. abs/2308.03549, 2023. “Domain adaption of named entity recognition to
[900] S. Chen, B. H. Kann, M. B. Foote, H. J. Aerts, support credit risk assessment,” in Proceedings of
G. K. Savova, R. H. Mak, and D. S. Bitterman, “The the Australasian Language Technology Association Work-
utility of chatgpt for cancer treatment information,” shop, ALTA 2015, Parramatta, Australia, December 8 - 9,
medRxiv, 2023. 2015, B. Hachey and K. Webster, Eds. ACL, 2015,
[901] K. Malinka, M. Peresı́ni, A. Firc, O. Hujnak, and pp. 84–90.
F. Janus, “On the educational impact of chatgpt: [918] G. Son, H. Jung, M. Hahm, K. Na, and S. Jin, “Beyond
Is artificial intelligence ready to obtain a university classification: Financial reasoning in state-of-the-art
degree?” CoRR, vol. abs/2303.11146, 2023. language models,” CoRR, vol. abs/2305.01505, 2023.
[902] T. Susnjak, “Chatgpt: The end of online exam in- [919] X. Zhang, Q. Yang, and D. Xu, “Xuanyuan 2.0: A
tegrity?” CoRR, vol. abs/2212.09292, 2022. large chinese financial chat model with hundreds of
[903] K. Tan, T. Pang, and C. Fan, “Towards applying billions parameters,” arXiv preprint arXiv:2305.12002,
powerful large ai models in classroom teaching: Op- 2023.
portunities, challenges and prospects,” 2023. [920] H. Yang, X.-Y. Liu, and C. D. Wang, “Fingpt: Open-
[904] F. Kamalov and I. Gurrib, “A new era of artificial source financial large language models,” CoRR, vol.
intelligence in education: A multifaceted revolution,” abs/2306.06031, 2023.
CoRR, vol. abs/2305.18303, 2023. [921] Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu,
[905] E. Kasneci, K. Seßler, S. Küchemann, M. Bannert, “Pubmedqa: A dataset for biomedical research ques-
D. Dementieva, F. Fischer, U. Gasser, G. Groh, tion answering,” in Proceedings of the 2019 Conference
S. Günnemann, E. Hüllermeier et al., “Chatgpt for on Empirical Methods in Natural Language Processing
good? on opportunities and challenges of large lan- and the 9th International Joint Conference on Natu-
guage models for education,” Learning and Individual ral Language Processing, EMNLP-IJCNLP 2019, Hong
Differences, vol. 103, p. 102274, 2023. Kong, China, November 3-7, 2019, 2019, pp. 2567–2577.
[906] A. Blair-Stanek, N. Holzenberger, and B. V. Durme, [922] A. Krithara, A. Nentidis, K. Bougiatiotis, and
“Can GPT-3 perform statutory reasoning?” CoRR, G. Paliouras, “Bioasq-qa: A manually curated corpus
vol. abs/2302.06100, 2023. for biomedical question answering,” 2022.
[907] D. Trautmann, A. Petrova, and F. Schilder, “Legal [923] Z. Bi, N. Zhang, Y. Xue, Y. Ou, D. Ji, G. Zheng,
prompt engineering for multilingual legal judgement and H. Chen, “Oceangpt: A large language model
139

for ocean science tasks,” CoRR, vol. abs/2310.02031, of foundation models,” CoRR, vol. abs/2309.16039,
2023. 2023.
[924] C. Zhang, C. Zhang, C. Li, Y. Qiao, S. Zheng, S. K. [941] kaiokendev, “Things I’m learning while training su-
Dam, M. Zhang, J. U. Kim, S. T. Kim, J. Choi, G. Park, perhot.” 2023.
S. Bae, L. Lee, P. Hui, I. S. Kweon, and C. S. Hong, [942] Z. Dong, T. Tang, J. Li, W. X. Zhao, and J. Wen,
“One small step for generative ai, one giant leap for “BAMBOO: A comprehensive benchmark for evalu-
AGI: A complete survey on chatgpt in AIGC era,” ating long text modeling capacities of large language
CoRR, vol. abs/2304.06488, 2023. models,” CoRR, vol. abs/2309.13345, 2023.
[925] M. Haman and M. Skolnik, “Using chatgpt to con- [943] J. Su. (2023) Transformer upgrade path: 12, infinite
duct a literature review.” Accountability in research, extrapolation of rerope?
2023. [944] A. Pal, D. Karkhanis, M. Roberts, S. Dooley, A. Sun-
[926] Ö. Aydın and E. Karaarslan, “Openai chatgpt gen- dararajan, and S. Naidu, “Giraffe: Adventures in
erated literature review: Digital twin in healthcare,” expanding context lengths in llms,” CoRR, vol.
SSRN Electronic Journal, 2022. abs/2308.10882, 2023.
[927] Y. J. Park, D. Kaplan, Z. Ren, C. Hsu, C. Li, H. Xu, [945] G. Izacard and E. Grave, “Leveraging passage re-
S. Li, and J. Li, “Can chatgpt be used to generate trieval with generative models for open domain
scientific hypotheses?” CoRR, vol. abs/2304.12208, question answering,” in Proceedings of the 16th Con-
2023. ference of the European Chapter of the Association for
[928] M. M. Hassan, R. A. Knipper, and S. K. K. Santu, Computational Linguistics: Main Volume, EACL 2021,
“Chatgpt as your personal data scientist,” CoRR, vol. Online, April 19 - 23, 2021. Association for Compu-
abs/2305.13657, 2023. tational Linguistics, 2021, pp. 874–880.
[929] L. Cheng, X. Li, and L. Bing, “Is GPT-4 a good data [946] N. Ratner, Y. Levine, Y. Belinkov, O. Ram, I. Magar,
analyst?” CoRR, vol. abs/2305.15038, 2023. O. Abend, E. Karpas, A. Shashua, K. Leyton-Brown,
[930] S. I. M. Hussam Alkaissi, “Artificial hallucinations in and Y. Shoham, “Parallel context windows for large
chatgpt: Implications in scientific writing,” PubMed, language models,” in Proceedings of the 61st Annual
2023. Meeting of the Association for Computational Linguistics
[931] A. Azaria, R. Azoulay, and S. Reches, “Chatgpt (Volume 1: Long Papers), ACL 2023, Toronto, Canada,
is a remarkable tool – for experts,” CoRR, vol. July 9-14, 2023. Association for Computational
abs/2306.03102, 2023. Linguistics, 2023, pp. 6383–6402.
[932] O. O. Buruk, “Academic writing with GPT-3.5: reflec- [947] I. Beltagy, M. E. Peters, and A. Cohan, “Long-
tions on practices, efficacy and transparency,” CoRR, former: The long-document transformer,” CoRR, vol.
vol. abs/2304.11079, 2023. abs/2004.05150, 2020.
[933] R. Liu and N. B. Shah, “Reviewergpt? an exploratory [948] G. Xiao, Y. Tian, B. Chen, S. Han, and M. Lewis,
study on using large language models for paper “Efficient streaming language models with attention
reviewing,” CoRR, vol. abs/2306.00622, 2023. sinks,” CoRR, vol. abs/2309.17453, 2023.
[934] M. Kosinski, “Theory of mind may have sponta- [949] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilac-
neously emerged in large language models,” CoRR, qua, F. Petroni, and P. Liang, “Lost in the middle:
vol. abs/2302.02083, 2023. How language models use long contexts,” Transac-
[935] M. M. Amin, E. Cambria, and B. W. Schuller, “Will tions of the Association for Computational Linguistics,
affective computing emerge from foundation models vol. 12, pp. 157–173, 2024.
and general ai? A first evaluation on chatgpt,” CoRR, [950] C. Han, Q. Wang, W. Xiong, Y. Chen, H. Ji, and
vol. abs/2303.03186, 2023. S. Wang, “Lm-infinite: Simple on-the-fly length gen-
[936] G. Sridhara, R. H. G., and S. Mazumdar, “Chatgpt: A eralization for large language models,” CoRR, vol.
study on its utility for ubiquitous software engineer- abs/2308.16137, 2023.
ing tasks,” CoRR, vol. abs/2305.16837, 2023. [951] A. Bertsch, U. Alon, G. Neubig, and M. R. Gorm-
[937] W. Sun, C. Fang, Y. You, Y. Miao, Y. Liu, Y. Li, ley, “Unlimiformer: Long-range transformers with
G. Deng, S. Huang, Y. Chen, Q. Zhang, H. Qian, unlimited length input,” CoRR, vol. abs/2305.01625,
Y. Liu, and Z. Chen, “Automatic code summariza- 2023.
tion via chatgpt: How far are we?” CoRR, vol. [952] Y. Wu, M. N. Rabe, D. Hutchins, and C. Szegedy,
abs/2305.12865, 2023. “Memorizing transformers,” in The Tenth Interna-
[938] C. S. Xia and L. Zhang, “Conversational automated tional Conference on Learning Representations, ICLR
program repair,” CoRR, vol. abs/2301.13246, 2023. 2022, Virtual Event, April 25-29, 2022. OpenRe-
[939] A. Kazemnejad, I. Padhi, K. N. Ramamurthy, P. Das, view.net, 2022.
and S. Reddy, “The impact of positional encoding [953] Y. Lu, X. Zhou, W. He, J. Zhao, T. Ji, T. Gui, Q. Zhang,
on length generalization in transformers,” CoRR, vol. and X. Huang, “Longheads: Multi-head attention
abs/2305.19466, 2023. is secretly a long context processor,” CoRR, vol.
[940] W. Xiong, J. Liu, I. Molybog, H. Zhang, P. Bhargava, abs/2402.10685, 2024.
R. Hou, L. Martin, R. Rungta, K. A. Sankararaman, [954] C. Xiao, P. Zhang, X. Han, G. Xiao, Y. Lin, Z. Zhang,
B. Oguz, M. Khabsa, H. Fang, Y. Mehdad, S. Narang, Z. Liu, S. Han, and M. Sun, “Infllm: Unveiling the in-
K. Malik, A. Fan, S. Bhosale, S. Edunov, M. Lewis, trinsic capacity of llms for understanding extremely
S. Wang, and H. Ma, “Effective long-context scaling long sequences with training-free memory,” CoRR,
140

vol. abs/2402.04617, 2024. [971] Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and
[955] Y. Fu, R. Panda, X. Niu, X. Yue, H. Hajishirzi, Y. Kim, I. Mordatch, “Improving factuality and reasoning in
and H. Peng, “Data engineering for scaling language language models through multiagent debate,” CoRR,
models to 128k context,” CoRR, vol. abs/2402.10171, vol. abs/2305.14325, 2023.
2024. [972] Y. Shao, L. Li, J. Dai, and X. Qiu, “Character-llm:
[956] K. Lv, X. Liu, Q. Guo, H. Yan, C. He, X. Qiu, A trainable agent for role-playing,” in Proceedings of
and D. Lin, “Longwanjuan: Towards systematic the 2023 Conference on Empirical Methods in Natural
measurement for long text quality,” CoRR, vol. Language Processing, EMNLP 2023, Singapore, Decem-
abs/2402.13583, 2024. ber 6-10, 2023, H. Bouamor, J. Pino, and K. Bali, Eds.
[957] H. Chen, R. Pasunuru, J. Weston, and A. Celiky- Association for Computational Linguistics, 2023, pp.
ilmaz, “Walking down the memory maze: Beyond 13 153–13 187.
context limit through interactive reading,” CoRR, vol. [973] W. Hua, X. Yang, Z. Li, W. Cheng, and Y. Zhang,
abs/2310.05029, 2023. “Trustagent: Towards safe and trustworthy llm-
[958] W. Zhou, Y. E. Jiang, P. Cui, T. Wang, Z. Xiao, Y. Hou, based agents through agent constitution,” CoRR, vol.
R. Cotterell, and M. Sachan, “Recurrentgpt: Interac- abs/2402.01586, 2024.
tive generation of (arbitrarily) long text,” CoRR, vol. [974] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng,
abs/2305.13304, 2023. H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and
[959] C. Packer, V. Fang, S. G. Patil, K. Lin, S. Wooders, and T. Liu, “A survey on hallucination in large language
J. E. Gonzalez, “Memgpt: Towards llms as operating models: Principles, taxonomy, challenges, and open
systems,” CoRR, vol. abs/2310.08560, 2023. questions,” CoRR, vol. abs/2311.05232, 2023.
[960] P. Xu, W. Ping, X. Wu, L. McAfee, C. Zhu, Z. Liu, [975] I. Loshchilov and F. Hutter, “Decoupled weight de-
S. Subramanian, E. Bakhturina, M. Shoeybi, and cay regularization,” in ICLR (Poster). OpenRe-
B. Catanzaro, “Retrieval meets long context large view.net, 2019.
language models,” CoRR, vol. abs/2310.03025, 2023. [976] V. A. Korthikanti, J. Casper, S. Lym, L. McAfee,
[961] S. Russell and P. Norvig, Artificial Intelligence: M. Andersch, M. Shoeybi, and B. Catanzaro, “Re-
A Modern Approach (4th Edition). Pearson, 2020. ducing activation recomputation in large transformer
[Online]. Available: http://aima.cs.berkeley.edu/ models,” in MLSys. mlsys.org, 2023.
[962] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. [977] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He,
Gershman, “Building machines that learn and think “Zero: memory optimizations toward training tril-
like people,” CoRR, vol. abs/1604.00289, 2016. lion parameter models,” in Proceedings of the Interna-
[963] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, tional Conference for High Performance Computing, Net-
K. Narasimhan, and Y. Cao, “React: Synergizing rea- working, Storage and Analysis, SC 2020, Virtual Event /
soning and acting in language models,” CoRR, vol. Atlanta, Georgia, USA, November 9-19, 2020, C. Cuic-
abs/2210.03629, 2022. chi, I. Qualters, and W. T. Kramer, Eds. IEEE/ACM,
[964] 2023. [Online]. Available: https://github.com/ 2020, p. 20.
AntonOsika/gpt-engineer [978] J. Ren, S. Rajbhandari, R. Y. Aminabadi, O. Ruwase,
[965] X. Team, “Xagent: An autonomous agent for complex S. Yang, M. Zhang, D. Li, and Y. He, “Zero-offload:
task solving,” 2023. Democratizing billion-scale model training,” in 2021
[966] G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, USENIX Annual Technical Conference, USENIX ATC
and B. Ghanem, “CAMEL: communicative agents for 2021, July 14-16, 2021, I. Calciu and G. Kuenning, Eds.
”mind” exploration of large scale language model USENIX Association, 2021, pp. 551–564.
society,” CoRR, vol. abs/2303.17760, 2023. [979] S. Rajbhandari, O. Ruwase, J. Rasley, S. Smith, and
[967] S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, Y. He, “Zero-infinity: breaking the GPU memory wall
C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, for extreme scale deep learning,” in SC. ACM, 2021,
C. Ran, L. Xiao, and C. Wu, “Metagpt: Meta pro- p. 59.
gramming for multi-agent collaborative framework,” [980] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré,
CoRR, vol. abs/2308.00352, 2023. “Flashattention: Fast and memory-efficient exact at-
[968] C. Pham, B. Liu, Y. Yang, Z. Chen, T. Liu, J. Yuan, tention with io-awareness,” in NeurIPS, 2022.
B. A. Plummer, Z. Wang, and H. Yang, “Let models [981] S. A. Jacobs, M. Tanaka, C. Zhang, M. Zhang, S. L.
speak ciphers: Multiagent debate through embed- Song, S. Rajbhandari, and Y. He, “Deepspeed ulysses:
dings,” CoRR, vol. abs/2310.06272, 2023. System optimizations for enabling training of ex-
[969] W. Chen, Y. Su, J. Zuo, C. Yang, C. Yuan, C. Qian, treme long sequence transformer models,” CoRR,
C.-M. Chan, Y. Qin, Y. Lu, R. Xie et al., “Agent- vol. abs/2309.14509, 2023.
verse: Facilitating multi-agent collaboration and ex- [982] H. Liu, M. Zaharia, and P. Abbeel, “Ring attention
ploring emergent behaviors in agents,” arXiv preprint with blockwise transformers for near-infinite con-
arXiv:2308.10848, 2023. text,” CoRR, vol. abs/2310.01889, 2023.
[970] Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu, [983] Y. Chen, T. Tang, E. Xiang, L. Li, W. X. Zhao,
L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, J. Wang, Y. Chai, and J. Wen, “Towards coarse-to-fine
R. W. White, D. Burger, and C. Wang, “Autogen: evaluation of inference efficiency for large language
Enabling next-gen llm applications via multi-agent models,” CoRR, vol. abs/2404.11502, 2024.
conversation framework,” 2023. [984] Y. Sheng, L. Zheng, B. Yuan, Z. Li, M. Ryabinin,
141

B. Chen, P. Liang, C. Ré, I. Stoica, and C. Zhang, abs/2404.02258, 2024.
“Flexgen: High-throughput generative inference of [999] Z. Wan, X. Wang, C. Liu, S. Alam, Y. Zheng,
large language models with a single GPU,” in ICML, J. Liu, Z. Qu, S. Yan, Y. Zhu, Q. Zhang,
ser. Proceedings of Machine Learning Research, vol. M. Chowdhury, and M. Zhang, “Efficient large
202. PMLR, 2023, pp. 31 094–31 116. language models: A survey,” 2024. [Online].
[985] T. Dao, D. Haziza, F. Massa, and G. Sizov, “Flash- Available: https://arxiv.org/abs/2312.03863
decoding for long-context inference,” 2023. [Online]. [1000] A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W.
Available: https://crfm.stanford.edu/2023/10/12/ Mahoney, and K. Keutzer, “A survey of quantization
flashdecoding.html methods for efficient neural network inference,”
[986] C. Holmes, M. Tanaka, M. Wyatt, A. A. Awan, CoRR, vol. abs/2103.13630, 2021. [Online]. Available:
J. Rasley, S. Rajbhandari, R. Y. Aminabadi, H. Qin, https://arxiv.org/abs/2103.13630
A. Bakhtiari, L. Kurilenko, and Y. He, “Deepspeed- [1001] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettle-
fastgen: High-throughput text generation for llms moyer, “Llm.int8(): 8-bit matrix multiplication for
via MII and deepspeed-inference,” CoRR, vol. transformers at scale,” CoRR, vol. abs/2208.07339,
abs/2401.08671, 2024. 2022.
[987] Y. Leviathan, M. Kalman, and Y. Matias, “Fast infer- [1002] J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han,
ence from transformers via speculative decoding,” in “Awq: Activation-aware weight quantization for llm
International Conference on Machine Learning, 2023. compression and acceleration,” 2023.
[988] C. Chen, S. Borgeaud, G. Irving, J. Lespiau, L. Sifre, [1003] Y. Shang, Z. Yuan, Q. Wu, and Z. Dong, “PB-
and J. Jumper, “Accelerating large language model LLM: partially binarized large language models,”
decoding with speculative sampling,” CoRR, vol. CoRR, vol. abs/2310.00034, 2023. [Online]. Available:
abs/2302.01318, 2023. https://doi.org/10.48550/arXiv.2310.00034
[989] X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, [1004] T. Dettmers, R. Svirschevski, V. Egiazarian,
R. Y. Y. Wong, Z. Chen, D. Arfeen, R. Abhyankar, D. Kuznedelev, E. Frantar, S. Ashkboos, A. Borzunov,
and Z. Jia, “Specinfer: Accelerating generative LLM T. Hoefler, and D. Alistarh, “Spqr: A sparse-
serving with speculative inference and token tree quantized representation for near-lossless LLM
verification,” CoRR, vol. abs/2305.09781, 2023. weight compression,” CoRR, vol. abs/2306.03078,
[990] B. Spector and C. Ré, “Accelerating LLM infer- 2023.
ence with staged speculative decoding,” CoRR, vol. [1005] Z. Guan, H. Huang, Y. Su, H. Huang, N. Wong, and
abs/2308.04623, 2023. H. Yu, “APTQ: attention-aware post-training mixed-
[991] L. Chen, M. Zaharia, and J. Zou, “Frugalgpt: How to precision quantization for large language models,”
use large language models while reducing cost and CoRR, vol. abs/2402.14866, 2024. [Online]. Available:
improving performance,” CoRR, vol. abs/2305.05176, https://doi.org/10.48550/arXiv.2402.14866
2023. [1006] C. Lee, J. Jin, T. Kim, H. Kim, and E. Park, “OWQ:
[992] M. Yue, J. Zhao, M. Zhang, L. Du, and Z. Yao, “Large outlier-aware weight quantization for efficient fine-
language model cascades with mixture of thoughts tuning and inference of large language models,” in
representations for cost-efficient reasoning,” CoRR, Thirty-Eighth AAAI Conference on Artificial Intelligence,
vol. abs/2310.03094, 2023. AAAI 2024, Thirty-Sixth Conference on Innovative
[993] J. Gu, J. Bradbury, C. Xiong, V. O. K. Li, and R. Socher, Applications of Artificial Intelligence, IAAI 2024,
“Non-autoregressive neural machine translation,” in Fourteenth Symposium on Educational Advances in
ICLR (Poster). OpenReview.net, 2018. Artificial Intelligence, EAAI 2014, February 20-
[994] C. Wang, J. Zhang, and H. Chen, “Semi- 27, 2024, Vancouver, Canada, M. J. Wooldridge,
autoregressive neural machine translation,” in J. G. Dy, and S. Natarajan, Eds. AAAI Press,
EMNLP. Association for Computational Linguistics, 2024, pp. 13 355–13 364. [Online]. Available: https:
2018, pp. 479–488. //doi.org/10.1609/aaai.v38i12.29237
[995] T. Cai, Y. Li, Z. Geng, H. Peng, J. D. Lee, D. Chen, and [1007] G. Xiao, J. Lin, M. Seznec, J. Demouth, and
T. Dao, “Medusa: Simple LLM inference acceleration S. Han, “Smoothquant: Accurate and efficient post-
framework with multiple decoding heads,” CoRR, training quantization for large language models,”
vol. abs/2401.10774, 2024. CoRR, vol. abs/2211.10438, 2022. [Online]. Available:
[996] S. Teerapittayanon, B. McDanel, and H. T. Kung, https://doi.org/10.48550/arXiv.2211.10438
“Branchynet: Fast inference via early exiting from [1008] Z. Yao, R. Y. Aminabadi, M. Zhang, X. Wu, C. Li,
deep neural networks,” in ICPR. IEEE, 2016, pp. and Y. He, “Zeroquant: Efficient and affordable post-
2464–2469. training quantization for large-scale transformers,”
[997] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, in NeurIPS, 2022.
and K. Q. Weinberger, “Multi-scale dense networks [1009] E. Frantar, S. Ashkboos, T. Hoefler, and D. Alis-
for resource efficient image classification,” in ICLR. tarh, “Gptq: Accurate post-training quantization for
OpenReview.net, 2018. generative pre-trained transformers,” arXiv preprint
[998] D. Raposo, S. Ritter, B. A. Richards, T. P. Lilli- arXiv:2210.17323, 2022.
crap, P. C. Humphreys, and A. Santoro, “Mixture- [1010] E. Frantar and D. Alistarh, “Optimal brain compres-
of-depths: Dynamically allocating compute in sion: A framework for accurate post-training quanti-
transformer-based language models,” CoRR, vol. zation and pruning,” in NeurIPS, 2022.
142

[1011] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettle- Chua, and Q. Li, “A survey on rag meets llms: To-
moyer, “Qlora: Efficient finetuning of quantized wards retrieval-augmented large language models,”
llms,” arXiv preprint arXiv:2305.14314, 2023. arXiv preprint arXiv:2405.06211, 2024.
[1012] Z. Liu, B. Oguz, C. Zhao, E. Chang, P. Stock, [1025] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai,
Y. Mehdad, Y. Shi, R. Krishnamoorthi, and V. Chan- J. Sun, and H. Wang, “Retrieval-augmented gener-
dra, “Llm-qat: Data-free quantization aware training ation for large language models: A survey,” arXiv
for large language models,” 2023. preprint arXiv:2312.10997, 2023.
[1013] Z. Yao, X. Wu, C. Li, S. Youn, and Y. He, “Zeroquant- [1026] S. Robertson and H. Zaragoza, The probabilistic rele-
v2: Exploring post-training quantization in llms from vance framework: BM25 and beyond, 2009.
comprehensive study to low rank compensation,” [1027] Y. Wang, R. Ren, J. Li, W. X. Zhao, J. Liu, and J.-R.
2023. Wen, “Rear: A relevance-aware retrieval-augmented
[1014] T. Dettmers and L. Zettlemoyer, “The case for 4-bit framework for open-domain question answering,”
precision: k-bit inference scaling laws,” CoRR, vol. arXiv preprint arXiv:2402.17497, 2024.
abs/2212.09720, 2022. [1028] D. Rau, S. Wang, H. Déjean, and S. Clinchant, “Con-
[1015] L. Peiyu, L. Zikang, G. Ze-Feng, G. Dawei, Z. W. Xin, text embeddings for efficient answer generation in
L. Yaliang, D. Bolin, and W. Ji-Rong, “Do emergent rag,” arXiv preprint arXiv:2407.09252, 2024.
abilities exist in quantized large language models: [1029] F. Xu, W. Shi, and E. Choi, “Recomp: Improving
An empirical study,” arXiv preprint arXiv:2307.08072, retrieval-augmented lms with context compression
2023. and selective augmentation,” in The Twelfth Interna-
[1016] Y. Xu, L. Xie, X. Gu, X. Chen, H. Chang, tional Conference on Learning Representations, 2024.
H. Zhang, Z. Chen, X. Zhang, and Q. Tian, “Qa-lora: [1030] Z. Shao, Y. Gong, Y. Shen, M. Huang, N. Duan,
Quantization-aware low-rank adaptation of large and W. Chen, “Enhancing retrieval-augmented large
language models,” CoRR, vol. abs/2309.14717, 2023. language models with iterative retrieval-generation
[Online]. Available: https://doi.org/10.48550/arXiv. synergy,” in Findings of the Association for Computa-
2309.14717 tional Linguistics: EMNLP 2023, 2023, pp. 9248–9274.
[1017] Y. Li, Y. Yu, C. Liang, P. He, N. Karampatziakis, [1031] T. Chen, H. Wang, S. Chen, W. Yu, K. Ma, X. Zhao,
W. Chen, and T. Zhao, “Loftq: Lora-fine-tuning- D. Yu, and H. Zhang, “Dense x retrieval: What re-
aware quantization for large language models,” trieval granularity should we use?” arXiv preprint
CoRR, vol. abs/2310.08659, 2023. [Online]. Available: arXiv:2312.06648, 2023.
https://doi.org/10.48550/arXiv.2310.08659 [1032] X. Huang, S. Cheng, Y. Shu, Y. Bao, and Y. Qu,
[1018] Y. Gu, L. Dong, F. Wei, and M. Huang, “Knowledge “Question decomposition tree for answering com-
distillation of large language models,” CoRR, plex questions over knowledge bases,” in Proceedings
vol. abs/2306.08543, 2023. [Online]. Available: of the AAAI Conference on Artificial Intelligence, vol. 37,
https://doi.org/10.48550/arXiv.2306.08543 no. 11, 2023, pp. 12 924–12 932.
[1019] C. Hsieh, C. Li, C. Yeh, H. Nakhost, Y. Fujii, [1033] Y. He, J. Tang, H. Ouyang, C. Kang, D. Yin, and
A. Ratner, R. Krishna, C. Lee, and T. Pfister, Y. Chang, “Learning to rewrite queries,” in Pro-
“Distilling step-by-step! outperforming larger ceedings of the 25th ACM International on Conference
language models with less training data and on Information and Knowledge Management, 2016, pp.
smaller model sizes,” in Findings of the Association for 1443–1452.
Computational Linguistics: ACL 2023, Toronto, Canada, [1034] J. Liu and B. Mozafari, “Query rewriting via large
July 9-14, 2023, A. Rogers, J. L. Boyd-Graber, and language models,” arXiv preprint arXiv:2403.09060,
N. Okazaki, Eds. Association for Computational 2024.
Linguistics, 2023, pp. 8003–8017. [Online]. Available: [1035] F. Ye, M. Fang, S. Li, and E. Yilmaz, “Enhancing
https://doi.org/10.18653/v1/2023.findings-acl.507 conversational search: Large language model-aided
[1020] E. Frantar and D. Alistarh, “Sparsegpt: Massive lan- informative query rewriting,” in Findings of the As-
guage models can be accurately pruned in one- sociation for Computational Linguistics: EMNLP 2023,
shot,” in International Conference on Machine Learning. 2023, pp. 5985–6006.
PMLR, 2023, pp. 10 323–10 337. [1036] S. Jeong, J. Baek, S. Cho, S. J. Hwang, and J. C.
[1021] X. Ma, G. Fang, and X. Wang, “Llm-pruner: On the Park, “Adaptive-rag: Learning to adapt retrieval-
structural pruning of large language models,” Ad- augmented large language models through question
vances in neural information processing systems, vol. 36, complexity,” arXiv preprint arXiv:2403.14403, 2024.
pp. 21 702–21 720, 2023. [1037] H. Jiang, Q. Wu, C.-Y. Lin, Y. Yang, and L. Qiu,
[1022] M. Xia, T. Gao, Z. Zeng, and D. Chen, “Sheared “Llmlingua: Compressing prompts for accelerated
llama: Accelerating language model pre-training via inference of large language models,” in Proceedings
structured pruning,” arXiv preprint arXiv:2310.06694, of the 2023 Conference on Empirical Methods in Natural
2023. Language Processing, 2023, pp. 13 358–13 376.
[1023] T. Dettmers, M. Lewis, S. Shleifer, and L. Zettle- [1038] T. Xu, S. Wu, S. Diao, X. Liu, X. Wang, Y. Chen,
moyer, “8-bit optimizers via block-wise quantiza- and J. Gao, “Sayself: Teaching llms to express con-
tion,” 9th International Conference on Learning Repre- fidence with self-reflective rationales,” arXiv preprint
sentations, ICLR, 2022. arXiv:2405.20974, 2024.
[1024] Y. Ding, W. Fan, L. Ning, S. Wang, H. Li, D. Yin, T.-S. [1039] A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Ha-
143

jishirzi, “Self-rag: Learning to retrieve, generate, hallucinations of llms by validating low-confidence
and critique through self-reflection,” arXiv preprint generation,” CoRR, vol. abs/2307.03987, 2023.
arXiv:2310.11511, 2023. [1053] Y. Yehuda, I. Malkiel, O. Barkan, J. Weill, R. Ronen,
[1040] H. Luo, Y.-S. Chuang, Y. Gong, T. Zhang, Y. Kim, and N. Koenigstein, “In search of truth: An interro-
X. Wu, D. Fox, H. Meng, and J. Glass, “Sail: Search- gation approach to hallucination detection,” CoRR,
augmented instruction learning,” arXiv preprint vol. abs/2403.02889, 2024.
arXiv:2305.15225, 2023. [1054] S. Min, K. Krishna, X. Lyu, M. Lewis, W. tau Yih, P. W.
[1041] X. V. Lin, X. Chen, M. Chen, W. Shi, M. Lomeli, Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi,
R. James, P. Rodriguez, J. Kahn, G. Szilvasy, M. Lewis “Factscore: Fine-grained atomic evaluation of factual
et al., “Ra-dit: Retrieval-augmented dual instruction precision in long form text generation,” 2023.
tuning,” arXiv preprint arXiv:2310.01352, 2023. [1055] I. Chern, S. Chern, S. Chen, W. Yuan, K. Feng,
[1042] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang, C. Zhou, J. He, G. Neubig, and P. Liu, “Factool:
“Retrieval augmented language model pre-training,” Factuality detection in generative AI - A tool aug-
in International conference on machine learning. PMLR, mented framework for multi-task and multi-domain
2020, pp. 3929–3938. scenarios,” CoRR, vol. abs/2307.13528, 2023.
[1043] K. Lee, M.-W. Chang, and K. Toutanova, “Latent re- [1056] X. Cheng, J. Li, W. X. Zhao, H. Zhang, F. Zhang,
trieval for weakly supervised open domain question D. Zhang, K. Gai, and J.-R. Wen, “Small agent can
answering,” in Proceedings of the 57th Annual Meeting also rock! empowering small language models as
of the Association for Computational Linguistics, 2019, hallucination detector,” CoRR, vol. abs/2406.11277,
pp. 6086–6096. 2024.
[1044] J. Li, J. Chen, R. Ren, X. Cheng, W. X. Zhao, J.-Y. [1057] M. Sharma, M. Tong, T. Korbak, D. Duvenaud,
Nie, and J.-R. Wen, “The dawn after the dark: An A. Askell, S. R. Bowman, E. Durmus, Z. Hatfield-
empirical study on factuality hallucination in large Dodds, S. R. Johnston, S. Kravec, T. Maxwell, S. Mc-
language models,” arXiv preprint arXiv:2401.03205, Candlish, K. Ndousse, O. Rausch, N. Schiefer, D. Yan,
2024. M. Zhang, and E. Perez, “Towards understanding
[1045] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, sycophancy in language models,” in ICLR. Open-
Y. J. Bang, A. Madotto, and P. Fung, “Survey of Review.net, 2024.
hallucination in natural language generation,” ACM [1058] J. W. Wei, D. Huang, Y. Lu, D. Zhou, and Q. V. Le,
Comput. Surv., 2023. “Simple synthetic data reduces sycophancy in large
[1046] Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, language models,” CoRR, vol. abs/2308.03958, 2023.
E. Zhao, Y. Zhang, Y. Chen, L. Wang, A. T. Luu, W. Bi, [1059] L. Gao, Z. Dai, P. Pasupat, A. Chen, A. T. Chaganty,
F. Shi, and S. Shi, “Siren’s song in the AI ocean: A Y. Fan, V. Y. Zhao, N. Lao, H. Lee, D. Juan, and
survey on hallucination in large language models,” K. Guu, “RARR: researching and revising what lan-
arXiv preprint arXiv:2309.01219, 2023. guage models say, using language models,” in ACL
[1047] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, (1). Association for Computational Linguistics, 2023,
“Scheduled sampling for sequence prediction with pp. 16 477–16 508.
recurrent neural networks,” in NIPS, 2015, pp. 1171– [1060] R. Zhao, X. Li, S. Joty, C. Qin, and L. Bing, “Verify-
1179. and-edit: A knowledge-enhanced chain-of-thought
[1048] M. Sharma, M. Tong, T. Korbak, D. Duvenaud, framework,” in ACL (1). Association for Compu-
A. Askell, S. R. Bowman, N. Cheng, E. Dur- tational Linguistics, 2023, pp. 5823–5840.
mus, Z. Hatfield-Dodds, S. R. Johnston, S. Kravec, [1061] H. Trivedi, N. Balasubramanian, T. Khot, and A. Sab-
T. Maxwell, S. McCandlish, K. Ndousse, O. Rausch, harwal, “Interleaving retrieval with chain-of-thought
N. Schiefer, D. Yan, M. Zhang, and E. Perez, “To- reasoning for knowledge-intensive multi-step ques-
wards understanding sycophancy in language mod- tions,” CoRR, vol. abs/2212.10509, 2022.
els,” CoRR, vol. abs/2310.13548, 2023. [1062] K. Li, O. Patel, F. B. Viégas, H. Pfister, and M. Watten-
[1049] V. Rawte, P. Priya, S. M. T. I. Tonmoy, S. M. M. berg, “Inference-time intervention: Eliciting truthful
Zaman, A. P. Sheth, and A. Das, “Exploring the re- answers from a language model,” in NeurIPS, 2023.
lationship between LLM hallucinations and prompt [1063] W. Shi, X. Han, M. Lewis, Y. Tsvetkov, L. Zettlemoyer,
linguistic nuances: Readability, formality, and con- and S. W. Yih, “Trusting your evidence: Halluci-
creteness,” CoRR, vol. abs/2309.11064, 2023. nate less with context-aware decoding,” CoRR, vol.
[1050] S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, abs/2305.14739, 2023.
A. Celikyilmaz, and J. Weston, “Chain-of-verification [1064] D. Kahneman, “Thinking, fast and slow,” Farrar,
reduces hallucination in large language models,” Straus and Giroux, 2011.
CoRR, vol. abs/2309.11495, 2023. [1065] S. Wu, Z. Peng, X. Du, T. Zheng, M. Liu, J. Wu, J. Ma,
[1051] P. Manakul, A. Liusie, and M. J. F. Gales, “Selfcheck- Y. Li, J. Yang, W. Zhou et al., “A comparative study
gpt: Zero-resource black-box hallucination detection on reasoning patterns of openai’s o1 model,” arXiv
for generative large language models,” in EMNLP. preprint arXiv:2410.13639, 2024.
Association for Computational Linguistics, 2023, pp. [1066] T. Zhong, Z. Liu, Y. Pan, Y. Zhang, Y. Zhou, S. Liang,
9004–9017. Z. Wu, Y. Lyu, P. Shu, X. Yu et al., “Evaluation
[1052] N. Varshney, W. Yao, H. Zhang, J. Chen, and D. Yu, of openai o1: Opportunities and challenges of agi,”
“A stitch in time saves nine: Detecting and mitigating arXiv preprint arXiv:2409.18486, 2024.
144

[1067] Y. Min, Z. Chen, J. Jiang, J. Chen, J. Deng, Y. Hu,
Y. Tang, J. Wang, X. Cheng, H. Song et al., “Imitate,
explore, and self-improve: A reproduction report
on slow-thinking reasoning systems,” arXiv preprint
arXiv:2412.09413, 2024.
[1068] D. Team, “Deepseek-r1-lite-preview is now live: un-
leashing supercharged reasoning power,” 2024.
[1069] Q. Team, “Qwq: Reflect deeply on the boundaries of
the unknown, november 2024,” URL https://qwenlm.
github. io/blog/qwq-32b-preview.
[1070] DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning
capability in llms via reinforcement learning,” 2025.
[1071] J. Jiang, Z. Chen, Y. Min, J. Chen, X. Cheng, J. Wang,
Y. Tang, H. Sun, J. Deng, W. X. Zhao, Z. Liu, D. Yan,
J. Xie, Z. Wang, and J.-R. Wen, “Enhancing llm rea-
soning with reward-guided tree search,” 2024.
[1072] T. Liang, Z. He, W. Jiao, X. Wang, Y. Wang, R. Wang,
Y. Yang, S. Shi, and Z. Tu, “Encouraging divergent
thinking in large language models through multi-
agent debate,” arXiv preprint arXiv:2305.19118, 2023.
[1073] Y. Du, Z. Liu, Y. Li, W. X. Zhao, Y. Huo, B. Wang,
W. Chen, Z. Liu, Z. Wang, and J.-R. Wen, “Virgo:
A preliminary exploration on reproducing o1-like
mllm,” arXiv preprint arXiv:2501.01904, 2025.
[1074] K. Team, “Kimi k1.5: Scaling reinforcement learning
with llms,” 2025. [Online]. Available: https://arxiv.
org/abs/2501.12599
[1075] OpenAI, “Openai’s reinforcement fine-tuning re-
search program,” OpenAI Blog, 2024.
[1076] Z. Zeng, Q. Cheng, Z. Yin, B. Wang, S. Li, Y. Zhou,
Q. Guo, X. Huang, and X. Qiu, “Scaling of search
and learning: A roadmap to reproduce o1 from
reinforcement learning perspective,” arXiv preprint
arXiv:2412.14135, 2024.
[1077] Z. Chen, Y. Min, B. Zhang, J. Chen, J. Jiang, D. Cheng,
W. X. Zhao, Z. Liu, X. Miao, Y. Lu, L. Fang, Z. Wang,
and J.-R. Wen, “An empirical study on eliciting and
improving r1-like reasoning models,” 2025.
[1078] Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi,
H. Zhang, M. Zhang, Y. Li, Y. Wu et al., “Deepseek-
math: Pushing the limits of mathematical rea-
soning in open language models,” arXiv preprint
arXiv:2402.03300, 2024.
[1079] W. Kool, H. van Hoof, and M. Welling, “Buy 4 REIN-
FORCE samples, get a baseline for free!” in Deep Re-
inforcement Learning Meets Structured Prediction, ICLR
2019 Workshop, New Orleans, Louisiana, United States,
May 6, 2019. OpenReview.net, 2019.
[1080] C. Snell, J. Lee, K. Xu, and A. Kumar, “Scaling llm
test-time compute optimally can be more effective
than scaling model parameters,” 2024. [Online].
Available: https://arxiv.org/abs/2408.03314
[1081] W. Kuang, B. Qian, Z. Li, D. Chen, D. Gao, X. Pan,
Y. Xie, Y. Li, B. Ding, and J. Zhou, “Federatedscope-
llm: A comprehensive package for fine-tuning large
language models in federated learning,” 2023.