0% found this document useful (0 votes)
308 views

A Survey of Large Language Models

1. The document discusses recent advances in large language models (LLMs), which are pre-trained language models containing tens or hundreds of billions of parameters. 2. It reviews LLMs in four aspects: pre-training, adaptation tuning, utilization, and capacity evaluation. It also discusses available resources for developing LLMs and future directions. 3. The survey provides an up-to-date review of research on LLMs and can serve as a useful resource for both researchers and engineers working with large language models.

Uploaded by

kai lu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
308 views

A Survey of Large Language Models

1. The document discusses recent advances in large language models (LLMs), which are pre-trained language models containing tens or hundreds of billions of parameters. 2. It reviews LLMs in four aspects: pre-training, adaptation tuning, utilization, and capacity evaluation. It also discusses available resources for developing LLMs and future directions. 3. The survey provides an up-to-date review of research on LLMs and can serve as a useful resource for both researchers and engineers working with large language models.

Uploaded by

kai lu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

1

A Survey of Large Language Models


Wayne Xin Zhao, Kun Zhou*, Junyi Li*, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen
Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang,
Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie and Ji-Rong Wen

Abstract—Ever since the Turing Test was proposed in the 1950s, humans have explored the mastering of language intelligence
by machine. Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a
significant challenge to develop capable artificial intelligence (AI) algorithms for comprehending and grasping a language. As a major
approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving
from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pre-
training Transformer models over large-scale corpora, showing strong capabilities in solving various natural language processing (NLP)
arXiv:2303.18223v10 [cs.CL] 7 May 2023

tasks. Since the researchers have found that model scaling can lead to an improved model capacity, they further investigate the scaling
effect by increasing the parameter scale to an even larger size. Interestingly, when the parameter scale exceeds a certain level, these
enlarged language models not only achieve a significant performance improvement, but also exhibit some special abilities (e.g., in-
context learning) that are not present in small-scale language models (e.g., BERT). To discriminate the language models in different
parameter scales, the research community has coined the term large language models (LLM) for the PLMs of significant size (e.g.,
containing tens or hundreds of billions of parameters). Recently, the research on LLMs has been largely advanced by both academia
and industry, and a remarkable progress is the launch of ChatGPT (a powerful AI chatbot developed based on LLMs), which has
attracted widespread attention from society. The technical evolution of LLMs has been making an important impact on the entire AI
community, which would revolutionize the way how we develop and use AI algorithms. Considering this rapid technical progress, in this
survey, we review the recent advances of LLMs by introducing the background, key findings, and mainstream techniques. In particular,
we focus on four major aspects of LLMs, namely pre-training, adaptation tuning, utilization, and capacity evaluation. Besides, we also
summarize the available resources for developing LLMs and discuss the remaining issues for future directions. This survey provides an
up-to-date review of the literature on LLMs, which can be a useful resource for both researchers and engineers.

Index Terms—Large Language Models; Emergent Abilities; Adaptation Tuning; Utilization; Alignment; Capacity Evaluation

1 I NTRODUCTION

L ANGUAGE is a prominent ability in human beings to


express and communicate, which develops in early
childhood and evolves over a lifetime [1, 2]. Machines,
a fixed context length n are also called n-gram language
models, e.g., bigram and trigram language models. SLMs
have been widely applied to enhance task performance
however, cannot naturally grasp the abilities of understand- in information retrieval (IR) [8, 9] and natural language
ing and communicating in the form of human language, processing (NLP) [10–12]. However, they often suffer from
unless equipped with powerful artificial intelligence (AI) the curse of dimensionality: it is difficult to accurately
algorithms. It has been a longstanding research challenge estimate high-order language models since an exponential
to achieve this goal, to enable machines to read, write, and number of transition probabilities need to be estimated.
communicate like humans [3]. Thus, specially designed smoothing strategies such as back-
Technically, language modeling (LM) is one of the major off estimation [13] and Good–Turing estimation [14] have
approaches to advancing language intelligence of machines. been introduced to alleviate the data sparsity problem.
In general, LM aims to model the generative likelihood • Neural language models (NLM). NLMs [15–17] character-
of word sequences, so as to predict the probabilities of ize the probability of word sequences by neural networks,
future (or missing) tokens. The research of LM has received e.g., recurrent neural networks (RNNs). As a remarkable
extensive attention in the literature, which can be divided contribution, the work in [15] introduced the concept of
into four major development stages: distributed representation of words and built the word predic-
• Statistical language models (SLM). SLMs [4–7] are de- tion function conditioned on the aggregated context features
veloped based on statistical learning methods that rose in (i.e., the distributed word vectors). By extending the idea
the 1990s. The basic idea is to build the word prediction of learning effective features for words or sentences, a
model based on the Markov assumption, e.g., predicting the general neural network approach was developed to build
next word based on the most recent context. The SLMs with a unified solution for various NLP tasks [18]. Further,
word2vec [19, 20] was proposed to build a simplified shal-
• Version: v10 (update on May 7, 2023). low neural network for learning distributed word represen-
• GitHub link: https://github.com/RUCAIBox/LLMSurvey tations, which were demonstrated to be very effective across
• * K. Zhou and J. Li contribute equally to this work.
• The authors are mainly with Gaoling School of Artificial Intelligence and a variety of NLP tasks. These studies have initiated the
School of Information, Renmin University of China, Beijing, China; Jian- use of language models for representation learning (beyond
Yun Nie is with DIRO, Université de Montréal, Canada. word sequence modeling), having an important impact on
Contact e-mail: [email protected]
the field of NLP.
2

• Pre-trained language models (PLM). As an early at- parallel training. To develop capable LLMs, researchers
tempt, ELMo [21] was proposed to capture context-aware have to solve complicated engineering issues, working with
word representations by first pre-training a bidirectional engineers or being engineers.
LSTM (biLSTM) network (instead of learning fixed word Nowadays, LLMs are posing a significant impact on
representations) and then fine-tuning the biLSTM network the AI community, and the advent of ChatGPT and GPT-4
according to specific downstream tasks. Further, based on leads to the rethinking of the possibilities of artificial general
the highly parallelizable Transformer architecture [22] with intelligence (AGI). OpenAI has published a technical article
self-attention mechanisms, BERT [23] was proposed by pre- entitled “Planning for AGI and beyond”, which discusses
training bidirectional language models with specially de- the short-term and long-term plans to approach AGI [40],
signed pre-training tasks on large-scale unlabeled corpora. and a more recent paper has argued that GPT-4 might be
These pre-trained context-aware word representations are considered as an early version of an AGI system [41]. The
very effective as general-purpose semantic features, which research areas of AI are being revolutionized by the rapid
have largely raised the performance bar of NLP tasks. This progress of LLMs. In the field of NLP, LLMs can serve as a
study has inspired a large number of follow-up work, which general-purpose language task solver (to some extent), and
sets the “pre-training and fine-tuning” learning paradigm. the research paradigm has been shifting towards the use
Following this paradigm, a great number of studies on of LLMs. In the field of IR, traditional search engines are
PLMs have been developed, introducing either different challenged by the new information seeking way through AI
architectures [24, 25] (e.g., GPT-2 [26] and BART [24]) or chatbots (i.e., ChatGPT), and New Bing3 presents an initial
improved pre-training strategies [27–29]. In this paradigm, it attempt that enhances the search results based on LLMs. In
often requires fine-tuning the PLM for adapting to different the field of CV, the researchers try to develop ChatGPT-like
downstream tasks. vision-language models that can better serve multimodal
• Large language models (LLM). Researchers find that dialogues [42–45], and GPT-4 [46] has supported multi-
scaling PLM (e.g., scaling model size or data size) often modal input by integrating the visual information. This new
leads to an improved model capacity on downstream tasks wave of technology would potentially lead to a prosperous
(i.e., following the scaling law [30]). A number of studies ecosystem of real-world applications based on LLMs. For
have explored the performance limit by training an ever instance, Microsoft 365 is being empowered by LLMs (i.e.,
larger PLM (e.g., the 175B-parameter GPT-3 and the 540B- Copilot) to automate the office work, and OpenAI supports
parameter PaLM). Although scaling is mainly conducted the use of plugins in ChatGPT for implementing special
in model size (with similar architectures and pre-training functions.
tasks), these large-sized PLMs display different behaviors Despite the progress and impact, the underlying prin-
from smaller PLMs (e.g., 330M-parameter BERT and 1.5B- ciples of LLMs are still not well explored. Firstly, it is
parameter GPT-2) and show surprising abilities (called emer- mysterious why emergent abilities occur in LLMs, instead of
gent abilities [31]) in solving a series of complex tasks. For smaller PLMs. As a more general issue, there lacks a deep,
example, GPT-3 can solve few-shot tasks through in-context detailed investigation of the key factors that contribute to
learning, whereas GPT-2 cannot do well. Thus, the research the superior abilities of LLMs. It is important to study when
community coins the term “large language models (LLM)”1 for and how LLMs obtain such abilities [47]. Although there are
these large-sized PLMs [32–35]. A remarkable application some meaningful discussions about this problem [31, 47],
of LLMs is ChatGPT2 that adapts the LLMs from the GPT more principled investigations are needed to uncover the
series for dialogue, which presents an amazing conversation “secrets“ of LLMs. Secondly, it is difficult for the research
ability with humans. community to train capable LLMs. Due to the huge de-
In the existing literature, PLMs have been widely dis- mand of computation resources, it is very costly to carry
cussed and surveyed [36–39], while LLMs are seldom re- out repetitive, ablating studies for investigating the effect
viewed in a systematic way. To motivate our survey, we first of various strategies for training LLMs. Indeed, LLMs are
highlight three major differences between LLMs and PLMs. mainly trained by industry, where many important training
First, LLMs display some surprising emergent abilities that details (e.g., data collection and cleaning) are not revealed
may not be observed in previous smaller PLMs. These abili- to the public. Thirdly, it is challenging to align LLMs with
ties are key to the performance of language models on com- human values or preferences. Despite the capacities, LLMs
plex tasks, making AI algorithms unprecedently powerful are also likely to produce toxic, fictitious, or harmful con-
and effective. Second, LLMs would revolutionize the way tents. It requires effective and efficient control approaches
that humans develop and use AI algorithms. Unlike small to eliminating the potential risk of the use of LLMs [46].
PLMs, the major approach to accessing LLMs is through Faced with both opportunities and challenges, it needs
the prompting interface (e.g., GPT-4 API). Humans have to more attention on the research and development of LLMs.
understand how LLMs work and format their tasks in a way In order to provide a basic understanding of LLMs, this
that LLMs can follow. Third, the development of LLMs no survey conducts a literature review of the recent advances
longer draws a clear distinction between research and en- in LLMs from four major aspects, including pre-training
gineering. The training of LLMs requires extensive practical (how to pre-train a capable LLM), adaptation tuning (how to
experiences in large-scale data processing and distributed effectively tune pre-trained LLMs from the two perspectives
of effectiveness and safety), utilization (how to use LLMs
1. Note that a LLM is not necessarily more capable than a small PLM, for solving various downstream tasks) and capability eval-
and emergent abilities may not occur in some LLMs.
2. https://openai.com/blog/chatgpt/ 3. https://www.bing.com/new
3

uation (how to evaluate the abilities of LLMs and existing of model performance with respective to three major factors,
empirical findings). We thoroughly comb the literature and namely model size (N ), dataset size (D), and the amount of
summarize the key findings, techniques, and methods of training compute (C ), for neural language models. Given
LLMs. For this survey, we also create a GitHub project a compute budget c, they empirically presented three basic
website by collecting the supporting resources for LLMs, at formulas for the scaling law6 :
the link https://github.com/RUCAIBox/LLMSurvey. We
are also aware of several related review articles on PLMs  α
Nc N
or LLMs [32, 36, 38, 39, 43, 48–54]. These papers either L(N ) = , αN ∼ 0.076, Nc ∼ 8.8 × 1013 (1)
N
discuss PLMs or some specific (or general) aspects of LLMs.  α
Dc D
Compared with them, we focus on the techniques and L(D) = , αD ∼ 0.095, Dc ∼ 5.4 × 1013
D
methods to develop and use LLMs and provide a relatively  αC
Cc
comprehensive reference to important aspects of LLMs. L(C) = , αC ∼ 0.050, Cc ∼ 3.1 × 108
C
The remainder of this survey is organized as follows:
Section 2 introduces the background for LLMs, with the
terminology, settings, resources, and organization outline, where L(·) denotes the cross entropy loss in nats. The three
followed by the summarization of available resources for laws were derived by fitting the model performance with
developing LLMs in Section 3. Sections 4, 5, 6, and 7 review varied data sizes (22M to 23B tokens), model sizes (768M
and summarize the recent progress from the four aspects to 1.5B non-embedding parameters) and training compute,
of pre-training, adaptation tuning, utilization, and capacity under some assumptions (e.g., the analysis of one factor
evaluation, respectively. Finally, we conclude the survey in should be not bottlenecked by the other two factors). They
Section 8 by summarizing the major findings and discuss showed that the model performance has a strong depen-
the remaining issues for future work. dence relation on the three factors.
• Chinchilla scaling law. As another representative study,
2 OVERVIEW Hoffmann et al. [34] (the Google DeepMind team) proposed
an alternative form for scaling laws to instruct the compute-
In this section, we present an overview about the back-
optimal training for LLMs. They conducted rigorous exper-
ground of LLMs and then summarize the technical evolu-
iments by varying a larger range of model sizes (70M to
tion of the GPT-series models.
16B) and data sizes (5B to 500B tokens), and fitted a similar
2.1 Background for LLMs scaling law yet with different coefficients as below [34]:
Typically, large language models (LLMs) refer to Transformer A B
L(N, D) = E + + β, (2)
language models that contain hundreds of billions (or Nα D
more) of parameters4 , which are trained on massive text where E = 1.69, A = 406.4, B = 410.7, α = 0.34 and
data [32], such as GPT-3 [55], PaLM [56], Galactica [35], β = 0.28. By optimizing the loss L(N, D) under the con-
and LLaMA [57]. LLMs exhibit strong capacities to un- straint C ≈ 6N D, they showed that the optimal allocation
derstand natural language and solve complex tasks (via of compute budget to model size and data size can be
text generation). To have a quick understanding of how derived as follows:
LLMs work, this part introduces the basic background for
LLMs, including scaling laws, emergent abilities and key  a  b
techniques. C −1 C
Nopt (C) = G , Dopt (C) = G , (3)
6 6
Scaling Laws for LLMs. Currently, LLMs are mainly built
upon the Transformer architecture [22], where multi-head where a = α+β α β
, b = α+β and G is a scaling coefficient that
attention layers are stacked in a very deep neural network. can be computed by A, B , α and β . As analyzed in [34],
Existing LLMs adopt similar Transformer architectures and given an increase in compute budget, the KM scaling law
pre-training objectives (e.g., language modeling) as small favors a larger budget allocation in model size than the data
language models. While, LLMs largely scale the model size, size, while the Chinchilla scaling law argues that the two
data size, and total compute (orders of magnification). Ex- sizes should be increased in equal scales, i.e., having similar
tensive research has shown that scaling can largely improve values for a and b in Equation (3).
the model capacity of LLMs [26, 55, 56]. Thus, it is useful Though with some restricted assumptions, these scaling
to establish a quantitative approach to characterizing the laws provide an intuitive understanding of the scaling ef-
scaling effect. Next, we introduce two representative scaling fect, making it feasible to predict the performance of LLMs
laws for Transformer language models [30, 34]. during training [46]. However, some abilities (e.g., in-context
• KM scaling law5 . In 2020, Kaplan et al. [30] (the OpenAI learning [55]) are unpredictable according to the scaling law,
team) firstly proposed to model the power-law relationship which can be observed only when the model size exceeds a
4. In existing literature, there is no formal consensus on the minimum certain level (as discussed below).
parameter scale for LLMs, since the model capacity is also related to
data size and total compute. In this survey, we take a slightly loose 6. Here, Nc , Dc and Cc are measured in the number of non-
definition of LLMs, and mainly focus on discussing language models embedding parameters, the number of training tokens and the number
with a model size larger than 10B. of FP-days, respectively. According to the original paper [30], Cc and C
5. Since there was not a model trained following this law in the should be denoted by Ccmin and Cmin , corresponding to the optimal
original paper, we took the last names of the two co-first authors to use of compute. While, we use the simplified notations for ease of
name this scaling law. discussions.
4

Emergent Abilities of LLMs. In the literature [31], emergent performance gains (on arithmetic reasoning benchmarks)
abilities of LLMs are formally defined as “the abilities that when applied to PaLM and LaMDA variants with a model
are not present in small models but arise in large models”, size larger than 60B, while its advantage over the standard
which is one of the most prominent features that distin- prompting becomes more evident when the model size
guish LLMs from previous PLMs. It further introduces a exceeds 100B. Besides, the performance improvement with
notable characteristic when emergent abilities occur [31]: CoT prompting seems to be also varied for different tasks,
performance rises significantly above random when the e.g., GSM8K > MAWPS > SWAMP for PaLM [33].
scale reaches a certain level. By analogy, such an emergent
pattern has close connections with the phenomenon of phase Key Techniques for LLMs. It has been a long way that
transition in physics [31, 58]. In principle, emergent abilities LLMs evolve into the current state: general and capable
can be defined in relation to some complex tasks [31, 59], learners. In the development process, a number of impor-
while we are more concerned with general abilities that tant techniques are proposed, which largely improve the
can be applied to solve a variety of tasks. Here, we briefly capacity of LLMs. Here, we briefly list several important
introduce three typical emergent abilities for LLMs and techniques that (potentially) lead to the success of LLMs, as
representative models that possess such an ability7 . follows.
• In-context learning. The in-context learning (ICL) abil- • Scaling. As discussed in previous parts, there exists
ity is formally introduced by GPT-3 [55]: assuming that an evident scaling effect in Transformer language mod-
the language model has been provided with a natural els: larger model/data sizes and more training compute
language instruction and/or several task demonstrations, typically lead to an improved model capacity [30, 34]. As
it can generate the expected output for the test instances two representative models, GPT-3 and PaLM explored the
by completing the word sequence of input text, without scaling limits by increasing the model size to 175B and
requiring additional training or gradient update8 . Among 540B, respectively. Furthermore, since compute budget is
the GPT-series models, the 175B GPT-3 model exhibited usually limited, scaling laws can be employed to conduct a
a strong ICL ability in general, but not the GPT-1 and more compute-efficient allocation of the compute resources.
GPT-2 models. While, such an ability also depends on the For example, Chinchilla (with more training tokens) outper-
specific downstream task. For example, the ICL ability can forms its counterpart model Gopher (with a larger model
emerge on the arithmetic tasks (e.g., the 3-digit addition and size) by increasing the data scale with the same compute
subtraction) for the 13B GPT-3, but 175B GPT-3 even cannot budget [34]. While, it should be noted that data scaling
work well on the Persian QA task [31]. should be with careful cleaning process, since the quality
• Instruction following. By fine-tuning with a mixture of of pre-training data plays a key role in the model capacity.
multi-task datasets formatted via natural language descrip- • Training. Due to the huge model size, it is very chal-
tions (called instruction tuning), LLMs are shown to perform lenging to successfully train a capable LLM. Distributed
well on unseen tasks that are also described in the form training algorithms are needed to learn the network param-
of instructions [28, 61, 62]. With instruction tuning, LLMs eters of LLMs, in which various parallel strategies are often
are enabled to follow the task instructions for new tasks jointly utilized. To support distributed training, several opti-
without using explicit examples, thus having an improved mization frameworks have been released to facilitate the im-
generalization ability. According to the experiments in [62], plementation and deployment of parallel algorithms, such
instruction-tuned LaMDA-PT [63] started to significantly as DeepSpeed [65] and Megatron-LM [66–68]. Besides, opti-
outperform the untuned one on unseen tasks when the mization tricks are also important for training stability and
model size reached 68B, but not for 8B or smaller model model performance, e.g., restart to overcome training loss
sizes. A recent study [64] found that a model size of 62B is spike [56] and mixed precision training [69]. More recently,
at least required for PaLM to perform well on various tasks GPT-4 [46] proposes to develop special infrastructure and
in four evaluation benchmarks (i.e., MMLU, BBH, TyDiQA optimization methods that reliably predict the performance
and MGSM), though a much smaller size might suffice for of large models with much smaller models.
some specific tasks (e.g., MMLU). • Ability eliciting. After being pre-trained on large-scale
• Step-by-step reasoning. For small language models, it is corpora, LLMs are endowed with potential abilities as
usually difficult to solve complex tasks that involve multiple general-purpose task solvers. While, these abilities might
reasoning steps, e.g., mathematical word problems. While, not be explicitly exhibited when LLMs perform some spe-
with the chain-of-thought (CoT) prompting strategy [33], cific tasks. As the technical approach, it is useful to de-
LLMs can solve such tasks by utilizing the prompting sign suitable task instructions or specific in-context learn-
mechanism that involves intermediate reasoning steps for ing strategies to elicit such abilities. For instance, chain-
deriving the final answer. This ability is speculated to be of-thought prompting has been shown to be useful to
potentially obtained by training on code [33, 47]. An empir- solve complex reasoning tasks by including intermediate
ical study [33] has shown that CoT prompting can bring reasoning steps. Besides, we can further perform instruction
tuning on LLMs with task descriptions expressed in natural
7. It is difficult to accurately examine the critical size for emergent
abilities of LLMs (i.e., the minimum size to possess an ability), since it language, for improving the generalizability of LLMs on
might vary for different models or tasks. Besides, existing studies often unseen tasks. While, these techniques mainly correspond to
test emergent abilities on very limited model sizes for a specific LLM. the emergent abilities of LLMs, which may not show the
For example, PaLM is often tested with three sizes of 8B, 62B and 540B. same effect on small language models.
It is unclear about the model performance of the untested sizes.
8. In a recent study [60], it also shows that in-context learning implic- • Alignment tuning. Since LLMs are trained to capture
itly performs meta-optimization through the attention mechanism. the data characteristics of pre-training corpora (including
5

both high-quality and low-quality data), they are likely to • GPT-1. In 2017, the Transformer model [22] was intro-
generate toxic, biased, or even harmful content for humans. duced by Google, and the OpenAI team quickly adapted
It is necessary to align LLMs with human values, e.g., helpful, their language modeling work to this new neural network
honest, and harmless. For this purpose, InstructGPT [61] architecture. They released the first GPT model in 2018,
designs an effective tuning approach that enables LLMs to i.e., GPT-1 [105], and coined the abbreviation term GPT
follow the expected instructions, which utilizes the tech- as the model name, standing for Generative Pre-Training.
nique of reinforcement learning with human feedback [61, 70]. GPT-1 was developed based on a generative, decoder-only
It incorporates human in the training loop with elaborately Transformer architecture, and adopted a hybrid approach of
designed labeling strategies. ChatGPT is indeed developed unsupervised pretraining and supervised fine-tuning. GPT-
on a similar technique to InstructGPT, which shows a strong 1 has set up the core architecture for the GPT-series models
alignment capacity in producing high-quality, harmless re- and established the underlying principle to model natural
sponses, e.g., rejecting to answer insulting questions. language text, i.e., predicting the next word.
• Tools manipulation. In essence, LLMs are trained as text • GPT-2. Following a similar architecture of GPT-1,
generators over massive plain text corpora, thus performing GPT-2 [26] increased the parameter scale to 1.5B, which
less well on the tasks that are not best expressed in the was trained with a large webpage dataset WebText. As
form of text (e.g., numerical computation). Besides, their claimed in the paper of GPT-2, it sought to perform
capacities are also limited to the pre-training data, e.g., the tasks via unsupervised language modeling, without explicit
inability to capture up-to-date information. To tackle these fine-tuning using labeled data. To motivate the approach,
issues, a recently proposed technique is to employ external they introduced a probabilistic form for multi-task solving,
tools to compensate for the deficiencies of LLMs [71, 72]. i.e., p(output|input, task) (similar approaches have been
For example, LLMs can utilize the calculator for accurate adopted in [106]), which predicts the output conditioned on
computation [71] and employ search engines to retrieve the input and task information. To model this conditional
unknown information [72]. More recently, ChatGPT has probability, language text can be naturally employed as a
enabled the mechanism of using external plugins (existing unified way to format input, output and task information.
or newly created apps)9 , which are by analogy with the “eyes In this way, the process of solving a task can be cast as a
and ears” of LLMs. Such a mechanism can broadly expand word prediction problem for generating the solution text.
the scope of capacities for LLMs. Further, they introduced a more formal claim for this idea:
Besides, many other factors (e.g., the upgrade of hard- “Since the (task-specific) supervised objective is the same
ware) also contribute to the success of LLMs. While, we as the unsupervised (language modeling) objective but only
limit our discussion to the major technical approaches and evaluated on a subset of the sequence, the global minimum
key findings for developing LLMs. of the unsupervised objective is also the global minimum
of the supervised objective (for various tasks)” [26]12 . A
basic understanding of this claim is that each (NLP) task
2.2 Technical Evolution of GPT-series Models can be considered as the word prediction problem based
Due to the excellent capacity in communicating with hu- on a subset of the world text. Thus, unsupervised language
mans, ChatGPT has ignited the excitement of the AI com- modeling could be capable in solving various tasks, if it was
munity since its release. ChatGPT is developed based on the trained to have sufficient capacity in recovering the world
powerful GPT model with specially optimized conversation text. These early discussion in GPT-2’s paper echoed in the
capacities. Considering the ever-growing interest in Chat- interview of Ilya Sutskever by Jensen Huang: “What the
GPT and GPT models, we add a special discussion about neural network learns is some representation of the process
the technical evolution of the GPT-series models, to briefly that produced the text. This text is actually a projection of
summarize the progress how they have been developed in the world...the more accurate you are in predicting the next
the past years. Overall, the research of OpenAI on LLMs can word, the higher the fidelity, the more resolution you get in
be roughly divided into the following stages10 . this process...”13 .

Early Explorations. According to one interview with Ilya Capacity Leap. Although GPT-2 is intended to be an “un-
Sutskever11 (a co-founder and chief scientist of OpenAI), supervised multitask learner”, it overall has an inferior
the idea of approaching intelligent systems with language performance compared with supervised fine-tuning state-
models was already explored in the early days of Ope- of-the-art methods. While, it has a relatively small model
nAI, while it was attempted with recurrent neural net- size, it has widely fine-tuned in downstream tasks, espe-
works (RNN) [104]. With the advent of Transformer, OpenAI cially the dialog tasks [107, 108]. Based on GPT-2, GPT-3
developed two initial GPT models, namely GPT-1 [105] and demonstrates a key capacity leap by scaling of the (nearly
GPT-2 [26], which can considered as the foundation to more same) generative pre-training architecture.
powerful models subsequently i.e., GPT-3 and GPT-4. • GPT-3. GPT-3 [55] was released in 2020, which scaled
the model parameters to an ever larger size of 175B. In
9. https://openai.com/blog/chatgpt-plugins the GPT-3’s paper, it formally introduced the concept of
10. Note that the discussion of this part can be somewhat subjective.
The overall viewpoints and summaries are made based on the under-
standing of the authors by surveying the papers, blog articles, interview
reports and APIs released by OpenAI. 12. To better understand this sentence, we put some explanation
11. https://hackernoon.com/an-interview-with-ilya-sutskever-co- words in parentheses.
founder-of-openai 13. https://lifearchitect.ai/ilya/
6

T5 GShard Publicly Available

2019 2020 mT5 PanGu-𝛂 Ernie 3.0


2021
1-4 PLUG Jurassic-1
GPT-3
Codex 5-8
CPM-2
FLAN
T0 9-10 LaMDA
Anthropic Yuan 1.0
HyperCLOVA AlphaCode
11-12
WebGPT
Chinchilla
Ernie 3.0 Titan InstructGPT 2022 CodeGeeX
UL2 Sparrow
Gopher CodeGen 1-3 Pythia
MT-NLG PaLM Flan-T5
GLaM OPT Vicuna
YaLM Flan-PaLM
GPT-NeoX-20B PanGu-Σ
4-6 Luminous
BLOOM Tk-Instruct Bard
GLM
mT0 7-10 NLLB
Cohere LLaMA
AlexaTM
BLOOMZ 11-12
WeLM 2023 1-4
Galatica
OPT-IML ChatGPT GPT-4

Fig. 1. A timeline of existing large language models (having a size larger than 10B) in recent years. The timeline was established mainly according
to the release date (e.g., the submission date to arXiv) of the technical paper for a model. If there was not a corresponding paper, we set the date
of a model as the earliest time of its public release or announcement. We mark the LLMs with publicly available model checkpoints in yellow color.
Due to the space limit of the figure, we only include the LLMs with publicly reported evaluation results.

in-context learning (ICL)14 , which utilizes LLMs in a few- GPT-3 model (pre-trained on plain text) lies in the lack of
shot or zero-shot way. ICL can teach (or instruct) LLMs to the reasoning ability on complex tasks, e.g., completing the
understand the tasks in the form of natural language text. code and solving math problems. To enhance this ability,
With ICL, the pre-training and utilization of LLMs converge Codex [89] was introduced by OpenAI in July 2021, which
to the same language modeling paradigm: pre-training pre- was a GPT model fine-tuned on a large corpus of GitHub
dicts the following text sequence conditioned on the context, code. It demonstrated that Codex can solve very difficult
while ICL predicts the correct task solution, which can be programming problems, and also lead to a significant per-
also formatted as a text sequence, given the task description formance improvement in solving math problems [109]. Fur-
and demonstrations. GPT-3 not only demonstrates very ex- ther, a contrastive approach [110] to training text and code
cellent performance in a variety of NLP tasks, but also on a embedding was reported in January 2022, which was shown
number of specially designed tasks that require the abilities to improve a series of related tasks (i.e., linear-probe classi-
of reasoning or domain adaptation. Although the GPT-3’s fication, text search and code search). Actually, the GPT-3.5
paper does not explicitly discuss the emergent abilities of models are developed based on a code-based GPT model
LLMs, we can observe large performance leap that might (i.e., code-davinci-002), which indicates that training on
transcend the basic scaling law [30], e.g., larger models have code data is a very useful practice to improve the model
significantly stronger ICL ability (illustrated in the original capacity of GPT models, especially the reasoning ability.
Figure 1.2 of the GPT-3’s paper [55]). Overall, GPT-3 can be Besides, there is also a speculation that training on code data
viewed as a remarkable landmark in the journey evolving can greatly increase the chain-of-thought prompting abilities
from PLMs to LLMs. It has empirically proved that scaling of LLMs [47], while it is still worth further investigation with
the neural networks to a significant size can lead to a huge more thorough verification.
increase in model capacity. • Human alignment. The related research of human
alignment can be dated back to the year 2017 (or earlier)
Capacity Enhancement. Due to the strong capacities, GPT-
for OpenAI: a blog article entitled “learning from human
3 has been the base model to develop even more capable
preferences”15 was posted on the OpenAI blog describing
LLMs for OpenAI. Overall, OpenAI has explored two major
a work that applied reinforcement learning (RL) to learn
approaches to further improving the GPT-3 model, i.e., train-
from the preference comparisons annotated by humans [70]
ing on code data and alignment with human preference,
(similar to the reward training step in the aligning algorithm
which are detailed as follows.
of InstructGPT in Figure 6). Shortly after the release of this
• Training on code data. A major limitation of the original
RL paper [70], the paper of the Proximal Policy Optimiza-
14. GPT-2 essentially used ICL for unsupervised task learning,
though it wasn’t called ICL at that time. 15. https://openai.com/research/learning-from-human-preferences
7

TABLE 1
Statistics of large language models (having a size larger than 10B in this survey) in recent years, including the capacity evaluation, pre-training
data scale (either in the number of tokens or storage size) and hardware resource costs. In this table, we only include LLMs with a public paper
about the technical details. Here, “Release Time” indicates the date when the corresponding paper was officially released. “Publicly Available”
means that the model checkpoints can be publicly accessible while “Closed Source” means the opposite. “Adaptation” indicates whether the
model has been with subsequent fine-tuning: IT denotes instruction tuning and RLHF denotes reinforcement learning with human feedback.
“Evaluation” indicates whether the model has been evaluated with corresponding abilities in their original paper: ICL denotes in-context learning
and CoT denotes chain-of-thought. “*” denotes the largest publicly available version.

Release Size Base Adaptation Pre-train Latest Data Hardware Training Evaluation
Model
Time (B) Model IT RLHF Data Scale Timestamp (GPUs / TPUs) Time ICL CoT
T5 [73] Oct-2019 11 - - - 1T tokens Apr-2019 1024 TPU v3 - X -
mT5 [74] Oct-2020 13 - - - 1T tokens - - - X -
PanGu-α [75] Apr-2021 13* - - - 1.1TB - 2048 Ascend 910 - X -
CPM-2 [76] Jun-2021 198 - - - 2.6TB - - - - -
T0 [28] Oct-2021 11 T5 X - - - 512 TPU v3 27 h X -
CodeGen [77] Mar-2022 16 - - - 577B tokens - - - X -
GPT-NeoX-20B [78] Apr-2022 20 - - - 825GB - 96 40G A100 - X -
Tk-Instruct [79] Apr-2022 11 T5 X - - - 256 TPU v3 4h X -
UL2 [80] May-2022 20 - - - 1T tokens Apr-2019 512 TPU v4 - X X
OPT [81] May-2022 175 - - - 180B tokens - 992 80G A100 - X -
NLLB [82] Jul-2022 54.5 - - - - - - - X -
Publicly GLM [83] Oct-2022 130 - - - 400B tokens - 768 40G A100 60 d X -
Available Flan-T5 [64] Oct-2022 11 T5 X - - - - - X X
BLOOM [69] Nov-2022 176 - - - 366B tokens - 384 80G A100 105 d X -
mT0 [84] Nov-2022 13 mT5 X - - - - - X -
Galactica [35] Nov-2022 120 - - - 106B tokens - - - X X
BLOOMZ [84] Nov-2022 176 BLOOM X - - - - - X -
OPT-IML [85] Dec-2022 175 OPT X - - - 128 40G A100 - X X
LLaMA [57] Feb-2023 65 - - - 1.4T tokens - 2048 80G A100 21 d X -
CodeGeeX [86] Sep-2022 13 - - - 850B tokens - 1536 Ascend 910 60 d X -
Pythia [87] Apr-2023 12 - - - 300B tokens - 256 40G A100 - X -

GPT-3 [55] May-2020 175 - - - 300B tokens - - - X -


GShard [88] Jun-2020 600 - - - 1T tokens - 2048 TPU v3 4d - -
Codex [89] Jul-2021 12 GPT-3 - - 100B tokens May-2020 - - X -
ERNIE 3.0 [90] Jul-2021 10 - - - 375B tokens - 384 V100 - X -
Jurassic-1 [91] Aug-2021 178 - - - 300B tokens - 800 GPU - X -
HyperCLOVA [92] Sep-2021 82 - - - 300B tokens - 1024 A100 13.4 d X -
FLAN [62] Sep-2021 137 LaMDA-PT X - - - 128 TPU v3 60 h X -
Yuan 1.0 [93] Oct-2021 245 - - - 180B tokens - 2128 GPU - X -
Anthropic [94] Dec-2021 52 - - - 400B tokens - - - X -
WebGPT [72] Dec-2021 175 GPT-3 - X - - - - X -
Gopher [59] Dec-2021 280 - - - 300B tokens - 4096 TPU v3 920 h X -
ERNIE 3.0 Titan [95] Dec-2021 260 - - - - - - - X -
GLaM [96] Dec-2021 1200 - - - 280B tokens - 1024 TPU v4 574 h X -
LaMDA [63] Jan-2022 137 - - - 768B tokens - 1024 TPU v3 57.7 d - -
Closed
MT-NLG [97] Jan-2022 530 - - - 270B tokens - 4480 80G A100 - X -
Source
AlphaCode [98] Feb-2022 41 - - - 967B tokens Jul-2021 - - - -
InstructGPT [61] Mar-2022 175 GPT-3 X X - - - - X -
Chinchilla [34] Mar-2022 70 - - - 1.4T tokens - - - X -
PaLM [56] Apr-2022 540 - - - 780B tokens - 6144 TPU v4 - X X
AlexaTM [99] Aug-2022 20 - - - 1.3T tokens - 128 A100 120 d X X
Sparrow [100] Sep-2022 70 - - X - - 64 TPU v3 - X -
WeLM [101] Sep-2022 10 - - - 300B tokens - 128 A100 40G 24 d X -
U-PaLM [102] Oct-2022 540 PaLM - - - - 512 TPU v4 5d X X
Flan-PaLM [64] Oct-2022 540 PaLM X - - - 512 TPU v4 37 h X X
Flan-U-PaLM [64] Oct-2022 540 U-PaLM X - - - - - X X
GPT-4 [46] Mar-2023 - - X X - - - - X X
PanGu-Σ [103] Mar-2023 1085 PanGu-α - - 329B tokens - 512 Ascend 910 100 d X -

tion (PPO) [111] was published in July 2017, which now has human feedback (RLHF) algorithm. Note that it seems that
been the foundational RL algorithm for learning from hu- the wording of “instruction tuning” has seldom been used in
man preferences [61]. Later in January 2020, GPT-2 was fine- OpenAI’s paper and documentation, which is substituted by
tuned using the aforementioned RL algorithms [70, 111], supervised fine-tuning on human demonstrations (i.e., the first
which leveraged human preferences to improve the capac- step of the RLHF algorithm [61]). In addition to improving
ities of GPT-2 on NLP tasks. In the same year, another the instruction following capacity, the RLHF algorithm is
work [112] trained a summarization model for optimizing particularly useful to mitigate the issues of generating harm
human preferences in a similar way. Based on these prior or toxic content for LLMs, which is key to the safe deploy-
work, InstructGPT [61] was proposed in January 2022 to ment of LLMs in practice. OpenAI describes their approach
improve the GPT-3 model for human alignment, which to alignment research in a technical article [113], which
formally established a three-stage reinforcement learning from has summarized three promising directions: “training AI
8

systems to use human feedback, to assist human evaluation challenges to develop more capable, safer LLMs. From
and to do alignment research”. the perspective of engineering, OpenAI has adopted an
These enhancement techniques lead to the improved iterative deployment strategy [116] to develop the models
GPT-3 models with stronger capacities, which are called and products by following a five-stage development and
GPT-3.5 models by OpenAI (see the discussion about the deployment life-cycle, which aims to effectively reduce the
OpenAI API in Section 3.1). potential risks of using the models. In the following, we
will dive into the technical details in order to have a specific
The Milestones of Language Models. Based on all the ex- understanding of how they have been developed.
ploration efforts, two major milestones have been achieved
by OpenAI, namely ChatGPT [114] and GPT-4 [46], which
have largely raised the capacity bar of existing AI systems. 3 R ESOURCES OF LLM S
• ChatGPT. In November 2022, OpenAI released the It is by no means an easy job to develop or reproduce LLMs,
conversation model ChatGPT, based on the GPT models considering the challenging technical issues and huge de-
(GPT-3.5 and GPT-4). As the official blog article intro- mands of computation resources. A feasible way is to learn
duced [114], ChatGPT was trained in a similar way as experiences from existing LLMs and reuse publicly avail-
InstructGPT (called “a sibling model to InstructGPT” in the able resources for incremental development or experimental
original post), while specially optimized for dialogue. They study. In this section, we briefly summarize the publicly
reported a difference between the training of ChatGPT and available resources for developing LLMs, including model
InstructGPT in the data collection setup: human-generated checkpoints (or APIs), corpora and libraries.
conversations (playing both the roles of user and AI) are
combined with the InstructGPT dataset in a dialogue format 3.1 Publicly Available Model Checkpoints or APIs
for training ChatGPT. ChatGPT exhibited superior capaci-
ties in communicating with humans: possessing a vast store Given the huge cost of model pre-training, well-trained
of knowledge, skill at reasoning on mathematical problems, model checkpoints are critical to the study and development
tracing the context accurately in multi-turn dialogues, and of LLMs for the research community. Since the parameter
aligning well with human values for safe use. Later on, the scale is a key factor to consider for using LLMs, we cate-
plugin mechanism has been supported in ChatGPT, which gorize these public models into two scale levels (i.e., tens
further extends the capacities of ChatGPT with existing tools of billions of parameters and hundreds of billions of parameters),
or apps. So far, it seems to be the ever most powerful chatbot which is useful for users to identify the suitable resources
in the AI history. The launch of ChatGPT has a significant according to their resource budget. Besides, for inference,
impact on the AI research in the future, which sheds light we can directly employ public APIs to perform our tasks,
on the exploration of human-like AI systems. without running the model locally. Next, we introduce the
publicly available model checkpoints and APIs.
• GPT-4. As another remarkable progress, GPT-4 [46] was
released in March 2023, which extended the text input to Models with Tens of Billions of Parameters. Most of the
multimodal signals. Overall, GPT-4 has stronger capacities models in this category have a parameter scale ranging from
in solving complex tasks than GPT-3.5, showing a large 10B to 20B, except LLaMA [57] (containing 65B parameters
performance improvement on many evaluation tasks. A re- in the largest version) and NLLB [82] (containing 54.5B
cent study [41] investigated the capacities of GPT-4 by con- parameters in the largest version). Other models within
ducting qualitative tests with human-generated problems, this range include mT5 [74], PanGu-α [75], T0 [28], GPT-
spanning a diverse range of difficult tasks, and showed NeoX-20B [78], CodeGen [77], UL2 [80], Flan-T5 [64], and
that GPT-4 can achieve more superior performance than mT0 [84]. Among them, Flan-T5 (11B version) can serve as
prior GPT models such as ChatGPT. Furthermore, GPT-4 a premier model for research on instruction tuning, since
responds more safely to malicious or provocative queries, it explores the instruction tuning from three aspects [64]:
due to a six-month iterative alignment (with an additional increasing the number of tasks, scaling the model size,
safety reward signal in the RLHF training). In the technical and fine-tuning with chain-of-thought prompting data. Be-
report, OpenAI has emphasized how to safely develop sides, CodeGen (11B version), as an autoregressive language
GPT-4 and applied a number of intervention strategies to model designed for generating code, can be considered as a
mitigate the possible issues of LLMs, such as hallucinations, good candidate for exploring the code generation ability.
privacy and overreliance. For example, they introduced the It also introduces a new benchmark MTPB [77] specially
mechanism called read teaming [115] to reduce the harm or for multi-turn program synthesis, which is composed by
toxic content generation. As another important aspect, GPT- 115 expert-generated problems. To solve these problems, it
4 has been developed on a well-established deep learning requires LLMs to acquire sufficient programming knowl-
infrastructure with improved optimization methods. They edge (e.g., math, array operations, and algorithms). As for
introduced a new mechanism called predictable scaling that multilingual tasks, mT0 (13B version) might be a good
can accurately predict the final performance with a small candidate model, which has been fine-tuned on multilin-
proportion of compute during model training. gual tasks with multilingual prompts. Furthermore, PanGu-
Despite the huge progress, there are still limitations with α [75] shows good performance in Chinese downstream
these superior LLMs, e.g., generating hallucinations with tasks in zero-shot or few-shot settings, which is developed
factual errors or potentially risky response within some based on the deep learning framework MindSpore [117].
specific context [46]. More limitations or issues of LLMs will Note that PanGu-α [75] holds multiple versions of models
be discussed in Section 7. It poses long-standing research (up to 200B parameters), while the largest public version
9

has 13B parameters. As a more recent release, LLaMA (65B gpt-3.5-turbo-0301 is the interface to invoke Chat-
version) [57], which contains approximately five times as GPT. More recently, OpenAI has also released the corre-
many parameters as other models, has exhibited superior sponding APIs for GPT-4, including gpt-4, gpt-4-0314,
performance in tasks related to instruction following. Due gpt-4-32k, and gpt-4-32k-0314. Overall, the choice of
to the openness and effectiveness, LLaMA has attracted API interfaces depends on the specific application scenarios
significant attention from the research community, and and response requirements. The detailed usage can be found
many efforts [118–121] have been devoted to fine-tuning on their project websites17 .
or continually pre-training its different model versions for
implementing new models or tools. Typically, pre-training TABLE 2
models at this scale require hundreds or even thousands Statistics of commonly-used data sources.
of GPUs or TPUs. For instance, GPT-NeoX-20B uses 12
supermicro servers, each equipped with 8 NVIDIA A100- Corpora Size Source Latest Update Time
SXM4-40GB GPUs, while LLaMA utilizes 2,048 A100-80G BookCorpus [122] 5GB Books Dec-2015
GPUs as reported in their original publications. To accu- Gutenberg [123] - Books Dec-2021
C4 [73] 800GB CommonCrawl Apr-2019
rately estimate the computation resources needed, it is sug- CC-Stories-R [124] 31GB CommonCrawl Sep-2019
gested to use the metrics measuring the number of involved CC-NEWS [27] 78GB CommonCrawl Feb-2019
computations such as FLOPS (i.e., FLoating point number REALNEWs [125] 120GB CommonCrawl Apr-2019
Operations Per Second) [30]. OpenWebText [126] 38GB Reddit links Mar-2023
Pushift.io [127] 2TB Reddit links Mar-2023
Models with Hundreds of Billions of Parameters. For Wikipedia [128] 21GB Wikipedia Mar-2023
BigQuery [129] - Codes Mar-2023
models in this category, only a handful of models have been the Pile [130] 800GB Other Dec-2020
publicly released. For example, OPT [81], OPT-IML [85], ROOTS [131] 1.6TB Other Jun-2022
BLOOM [69], and BLOOMZ [84] have nearly the same num-
ber of parameters as GPT-3 (175B version), while GLM [83]
and Galactica [35] have 130B and 120B parameters, respec- 3.2 Commonly Used Corpora
tively. Among them, OPT (175B version) has been spe- In contrast to earlier PLMs, LLMs which consist of a signifi-
cially motivated for open sharing, which aims to enable cantly larger number of parameters require a higher volume
researchers to carry out reproducible research at scale. For of training data that covers a broad range of content. For
research in cross-lingual generalization, BLOOM (176B ver- this need, there are increasingly more accessible training
sion) and BLOOMZ (176B version) can be used as base datasets that have been released for research. In this section,
models, due to the competence in multilingual language we will briefly summarize several widely used corpora for
modeling tasks. Among these models, OPT-IML have been training LLMs. Based on their content types, we catego-
tuned with instructions, which might be good candidates for rize these corpora into six groups: Books, CommonCrawl,
studying the effect of instruction tuning. Models of this scale Reddit links, Wikipedia, Code, and others.
typically require thousands of GPUs or TPUs to train. For
Books. BookCorpus [122] is a commonly used dataset in
instance, OPT (175B version) used 992 A100-80GB GPUs,
previous small-scale models (e.g., GPT [105] and GPT-2 [26]),
while GLM (130B version) used a cluster of 96 NVIDIA
consisting of over 11,000 books covering a wide range of
DGX-A100 (8x40G) GPU nodes.
topics and genres (e.g., novels and biographies). Another
Public API of LLMs. Instead of directly using the large-scale book corpus is Project Gutenberg [123], consist-
model copies, APIs provide a more convenient way ing of over 70,000 literary books including novels, essays,
for common users to use LLMs, without the need of poetry, drama, history, science, philosophy, and other types
running the model locally. As a representative inter- of works in the public domain. It is currently one of the
face for using LLMs, the APIs for the GPT-series mod- largest open-source book collections, which is used in train-
els [46, 55, 61, 89] have been widely used for both ing of MT-NLG [97] and LLaMA [57]. As for Books1 [55] and
academia and industry16 . OpenAI has provided seven Books2 [55] used in GPT-3 [55], they are much larger than
major interfaces to the models in GPT-3 series: ada, BookCorpus but have not been publicly released so far.
babbage, curie, davinci (the most powerful version in CommonCrawl. CommonCrawl [132] is one of the largest
GPT-3 series), text-ada-001, text-babbage-001, and open-source web crawling databases, containing a petabyte-
text-curie-001. Among them, the first four interfaces scale data volume, which has been widely used as training
can be further fine-tuned on the host server of OpenAI. data for existing LLMs. As the whole dataset is very large,
In particular, babbage, curie, and davinci correspond existing studies mainly extract subsets of web pages from
to the GPT-3 (1B), GPT-3 (6.7B), and GPT-3 (175B) models, it within a specific period. However, due to the widespread
respectively [55]. Besides, there are also two APIs related existence of noisy and low-quality information in web data,
to Codex [89], called code-cushman-001 (a powerful it is necessary to perform data preprocessing before usage.
and multilingual version of the Codex (12B) [89]) and Based on CommonCrawl, there are four filtered datasets
code-davinci-002. Further, GPT-3.5 series include one that are commonly used in existing work: C4 [73], CC-
base model code-davinci-002 and three enhanced ver- Stories [124], CC-News [27], and RealNews [125]. The Colos-
sions, namely text-davinci-002, text-davinci-003, sal Clean Crawled Corpus (C4) includes five variants18 ,
and gpt-3.5-turbo-0301. It is worth noting that
17. https://platform.openai.com/docs/models/overview
16. https://platform.openai.com/docs/api-reference/introduction 18. https://www.tensorflow.org/datasets/catalog/c4
10

namely en (806G), en.noclean (6T), realnewslike (36G), web- training BLOOM [69].
textlike (17G), and multilingual (38T). The en version has In practice, it commonly requires a mixture of different
been utilized for pre-training T5 [73], LaMDA [63], Go- data sources for pre-training LLMs (see Figure 2), instead
pher [59], and UL2 [80]. The multilingual C4, also called of a single corpus. Therefore, existing studies commonly
mC4, has been used in mT5 [74]. CC-Stories (31G) is com- mix several ready-made datasets (e.g., C4, OpenWebText,
posed of a subset of CommonCrawl data, in which the and the Pile), and then perform further processing to obtain
contents are made in a story-like way. While, the original the pre-training corpus. Besides, to train the LLMs that
source of CC-Stories is not available now, so a reproduction are adaptive to specific applications, it is also important
version, CC-Stories-R [133], has been included in Table 2. to extract data from relevant sources (e.g., Wikipedia and
Moreover, two news corpora extracted from Common- BigQuery) for enriching the corresponding information in
Crawl, i.e., REALNEWS (120G) and CC-News (76G), are also pre-training data. To have a quick reference of the data
commonly used as the pre-training data. sources used in existing LLMs, we present the pre-training
corpora of three representative LLMs:
Reddit Links. Reddit is a social media platform that enables • GPT-3 (175B) [55] was trained on a mixed dataset of
users to submit links and text posts, which can be voted on 300B tokens, including CommonCrawl [132], WebText2 [55],
by others through “upvotes” or “downvotes”. Highly up- Books1 [55], Books2 [55], and Wikipedia [128].
voted posts are often considered useful, and can be utilized • PaLM (540B) [56] uses a pre-training dataset of 780B
to create high-quality datasets. WebText [26] is a well-known tokens, which is sourced from social media conversations,
corpus composed of highly upvoted links from Reddit, but it filtered webpages, books, Github, multilingual Wikipedia,
is not publicly available. As a surrogate, there is a readily ac- and news.
cessible open-source alternative called OpenWebText [126]. • LLaMA [57] extracts training data from various sources,
Another corpus extracted from Reddit is PushShift.io [127], including CommonCrawl, C4 [73], Github, Wikipedia,
a real-time updated dataset that consists of historical data books, ArXiv, and StackExchange. The training data size for
from Reddit since its creation day. Pushshift provides not LLaMA (6B) and LLaMA (13B) is 1.0T tokens, while 1.4T
only monthly data dumps but also useful utility tools to tokens are used for LLaMA (32B) and LLaMA (65B).
support users in searching, summarizing, and conducting
preliminary investigations on the entire dataset. This makes
it easy for users to collect and process Reddit data. 3.3 Library Resource
In this part, we briefly introduce a series of available li-
Wikipedia. Wikipedia [128] is an online encyclopedia con- braries for developing LLMs.
taining a large volume of high-quality articles on diverse
• Transformers [135] is an open-source Python library
topics. Most of these articles are composed in an expository
for building models using the Transformer architecture,
style of writing (with supporting references), covering a
which is developed and maintained by Hugging Face. It
wide range of languages and fields. Typically, the English-
has a simple and user-friendly API, making it easy to use
only filtered versions of Wikipedia are widely used in most
and customize various pre-trained models. It is a powerful
LLMs (e.g., GPT-3 [55], LaMDA [63], and LLaMA [57]).
library with a large and active community of users and
Wikipedia is available in multiple languages, so it can be
developers who regularly update and improve the models
used in multilingual settings.
and algorithms.
Code. To collect code data, existing work mainly crawls • DeepSpeed [65] is a deep learning optimization library
open-source licensed codes from the Internet. Two major (compatible with PyTorch) developed by Microsoft, which
sources are public code repositories under open-source li- has been used to train a number of LLMs, such as MT-
censes (e.g., GitHub) and code-related question-answering NLG [97] and BLOOM [69]. It provides the support of
platforms (e.g., StackOverflow). Google has publicly re- various optimization techniques for distributed training,
leased the BigQuery dataset [129], which includes a substan- such as memory optimization (ZeRO technique, gradient
tial number of open-source licensed code snippets in various checkpointing), and pipeline parallelism.
programming languages, serving as a representative code • Megatron-LM [66–68] is a deep learning library devel-
dataset. CodeGen has utilized BIGQUERY [77], a subset of oped by NVIDIA for training large-scale language models.
the BigQuery dataset, for training the multilingual version It also provides rich optimization techniques for distributed
of CodeGen (CodeGen-Multi). training, including model and data parallelism, mixed-
precision training, and FlashAttention. These optimization
Others. The Pile [130] is a large-scale, diverse, and open- techniques can largely improve the training efficiency and
source text dataset consisting of over 800GB of data from speed, enabling efficient distributed training across GPUs.
multiple sources, including books, websites, codes, scien- • JAX [136] is a Python library for high-performance
tific papers, and social media platforms. It is constructed machine learning algorithms developed by Google, allow-
from 22 diverse high-quality subsets. The Pile dataset is ing users to easily perform computations on arrays with
widely used in models with different parameter scales, such hardware acceleration (e.g., GPU or TPU). It enables efficient
as GPT-J (6B) [134], CodeGen (16B) [77], and Megatron- computation on various devices and also supports several
Turing NLG (530B) [97]. Besides, ROOTS [131] is composed featured functions, such as automatic differentiation and
of various smaller datasets (totally 1.61 TB of text) and just-in-time compilation.
covers 59 different languages (containing natural languages • Colossal-AI [137] is a deep learning library developed
and programming languages), which have been used for by HPC-AI Tech for training large-scale AI models. It is
11

implemented based on PyTorch and supports a rich collec- General data, such as webpages, books, and conversational
tion of parallel training strategies. Furthermore, it can also text, is utilized by most LLMs [55, 56, 81] due to its large,
optimize heterogeneous memory management with meth- diverse, and accessible nature, which can enhance the lan-
ods proposed by PatrickStar [138]. Recently, a ChatGPT-like guage modeling and generalization abilities of LLMs. In
model called ColossalChat [121] has been publicly released light of the impressive generalization capabilities exhibited
with two versions (7B and 13B), which are developed using by LLMs, there are also studies that extend their pre-training
Colossal-AI based on LLaMA [57]. corpus to more specialized datasets, such as multilingual
• BMTrain [139] is an efficient library developed by data, scientific data, and code, endowing LLMs with specific
OpenBMB for training models with large-scale parameters task-solving capabilities [35, 56, 77]. In what follows, we
in a distributed manner, which emphasizes code simplicity, describe these two types of pre-training data sources and
low resource, and high availability. BMTrain has already their effects on LLMs. For a detailed introduction to the
incorporated several common LLMs (e.g., Flan-T5 [64] and commonly used corpus, one can refer to Section 3.2.
GLM [83]) into its ModelCenter, where developers can use
General Text Data. As we can see in Figure 2, the vast
these models directly.
majority of LLMs adopt general-purpose pre-training data,
• FastMoE [140] is a specialized training library for MoE
such as webpages, books, and conversational text, which
(i.e., mixture-of-experts) models. It is developed based on
provides rich text sources on a variety of topics. Next, we
PyTorch, prioritizing both efficiency and user-friendliness
briefly summarize three important kinds of general data.
in its design. FastMoE simplifies the process of transferring
• Webpages. Owing to the proliferation of the Internet,
Transformer models to MoE models and supports both data
various types of data have been created, which enables
parallelism and model parallelism during training.
LLMs to gain diverse linguistic knowledge and enhance
Besides the above library resources, existing deep learn-
their generalization capabilities [26, 73]. For convenient
ing frameworks (e.g., PyTorch [141], TensorFlow [142],
use of these data resources, a large amount of data is
MXNet [143], PaddlePaddle [144], MindSpore [117] and
crawled from the web in previous work, such as Com-
OneFlow [145]) have also provided the support for parallel
monCrawl [132]. However, the crawled web data tends to
algorithms, which are commonly used for training large-
contain both high-quality text, such as Wikipedia and low-
scale models.
quality text, like spam mail, thus it is important to filter and
process webpages for improving the data quality.
4 P RE - TRAINING • Conversation text. Conversation data can enhance the
Pre-training establishes the basis of the abilities of LLMs. By conversational competence of LLMs [81] and potentially im-
pre-training on large-scale corpora, LLMs can acquire essen- prove their performance on a range of question-answering
tial language understanding and generation skills [55, 56]. tasks [56]. Researchers can utilize subsets of public conver-
In this process, the scale and quality of the pre-training sation corpus (e.g., PushShift.io Reddit corpus) [127, 146] or
corpus are critical for LLMs to attain powerful capabilities. collect conversation data from online social media. Since on-
Besides, to effectively pre-train LLMs, model architectures, line conversational data often involves discussions among
acceleration methods, and optimization techniques need to multiple participants, an effective processing way is to
be well designed. In what follows, we first discuss the data transform a conversation into a tree structure, where the
collection and processing in Section 4.1, then introduce the utterance is linked to the one it responds to. In this way, the
commonly used model architectures in Section 4.2, and fi- multi-party conversation tree can be divided into multiple
nally present the training techniques to stably and efficiently sub-conversations, which can be collected in the pre-training
optimize LLMs in Section 4.3. corpus. Furthermore, a potential risk is that the excessive
integration of dialogue data into LLMs may result in a side
4.1 Data Collection effect [81]: declarative instructions and direct interrogatives
are erroneously perceived as the beginning of conversations,
Compared with small-scale language models, LLMs have
thus leading to a decline in the efficacy of the instructions.
a stronger demand for high-quality data for model pre-
• Books. Compared to other corpus, books provide an
training, and their model capacities largely rely on the pre-
important source of formal long texts, which are potentially
training corpus and how it has been preprocessed. In this
beneficial for LLMs to learn linguistic knowledge, model
part, we discuss the collection and processing of pre-training
long-term dependency, and generate narrative and coherent
data, including data sources, preprocessing methods, and
texts. To obtain open-source book data, existing studies
important analysis of how pre-training data affects the
usually adopt the Books3 and Bookcorpus2 datasets, which
performance of LLMs.
are available in the Pile dataset [130].
4.1.1 Data Source Specialized Text Data. Specialized datasets are useful to
To develop a capable LLM, it is key to collect a large amount improve the specific capabilities of LLMs on downstream
of natural language corpus from various data sources. Ex- tasks. Next, we introduce three kinds of specialized data.
isting LLMs mainly leverage a mixture of diverse public • Multilingual text. Besides the text in the target lan-
textual datasets as the pre-training corpus. Figure 2 shows guage, integrating a multilingual corpus can enhance the
the distribution of the sources of pre-training data for a multilingual abilities of language understanding and gen-
number of representative LLMs. eration. For example, BLOOM [69] and PaLM [56] have
The source of pre-training corpus can be broadly cate- curated multilingual data covering 46 and 122 languages,
gorized into two types: general data and specialized data. respectively, within their pre-training corpora. These models
12

T5 (11B) mT5 (13B) LLaMA (65B) GPT-3 (175B) MT-NLG (530B) Gopher (280B) Chinchilla (70B)
3% 2%
2% 5% 16% 3% 4%
5% 26% 4% 37% 40%
62% 60% 56%
6%
100% 100% 87% 84%

GLaM (1200B) PaLM (540B) LaMDA (137B) Galactica (120B) GPT-NeoX (20B) CodeGen (16B) AlphaCode (41B)
5% 8%
13% 8% 20%
22% 14% 7% 30%
31% 38% 39%
48% 6%
38%
10% 10%
30% 50%
50% 86% 15% 25% 100%

Webpages Conversation Data Books & News Scientific Data Code

Fig. 2. Ratios of various data sources in the pre-training data for existing LLMs.

demonstrate impressive performance in multilingual tasks, abilities (e.g., chain-of-thought ability [33]). Besides, it has
such as translation, multilingual summarization, and mul- been shown that formatting reasoning tasks into code can
tilingual question answering, and achieve comparable or help LLMs generate more accurate results [156, 157].
superior performance to the state-of-the-art models that are
fine-tuned on the corpus in the target language(s).
4.1.2 Data Preprocessing
• Scientific text. The exploration of science by humans has
been witnessed by the increasing growth of scientific publi- After collecting a large amount of text data, it is essential
cations. In order to enhance the understanding of scientific to preprocess the data for constructing the pre-training cor-
knowledge for LLMs [35, 147], it is useful to incorporate a pus, especially removing noisy, redundant, irrelevant, and
scientific corpus for model pre-training [35, 147]. By pre- potentially toxic data [56, 59], which may largely affect the
training on a vast amount of scientific text, LLMs can capacity and performance of LLMs. In this part, we review
achieve impressive performance in scientific and reasoning the detailed data preprocessing strategies to improve the
tasks [148]. To construct the scientific corpus, existing efforts quality of the collected data [59, 69, 96]. A typical pipeline
mainly collect arXiv papers, scientific textbooks, math web- of preprocessing the pre-training data for LLMs has been
pages, and other related scientific resources. Due to the com- illustrated in Figure 3.
plex nature of data in scientific fields, such as mathematical
Quality Filtering. To remove low-quality data from the
symbols and protein sequences, specific tokenization and
collected corpus, existing work generally adopts two ap-
preprocessing techniques are usually required to transform
proaches: (1) classifier-based, and (2) heuristic-based. The
these different formats of data into a unified form that can
former approach trains a selection classifier based on high-
be processed by language models.
quality texts and leverages it to identify and filter out low-
• Code. Program synthesis has been widely studied in quality data. Typically, these methods [55, 56, 96] train a bi-
the research community [89, 149–152], especially the use of nary classifier with well-curated data (e.g., Wikipedia pages)
PLMs trained on code [134, 153]. However, it remains chal- as positive instances and sample candidate data as negative
lenging for these PLMs (e.g., GPT-J [134]) to generate high- instances, and predict the score that measures the quality
quality and accurate programs. Recent studies [89, 152] have of each data example. However, several studies [59, 96]
found that training LLMs on a vast code corpus can lead to also find that a classifier-based approach may result in the
a substantial improvement in the quality of the synthesized unintentional removal of high-quality texts in dialectal, col-
programs. The generated programs can successfully pass loquial, and sociolectal languages, which potentially leads
expert-designed unit-test cases [89] or solve competitive to bias in the pre-training corpus and diminishes the corpus
programming questions [98]. In general, two types of code diversity. As the second approach, several studies, such
corpora are commonly used for pre-training LLMs. The first as BLOOM [69] and Gopher [59], employ heuristic-based
source is from programming question answering communi- approaches to eliminate low-quality texts through a set of
ties like Stack Exchange [154, 155]. The second source is from well-designed rules, which can be summarized as follows:
public software repositories such as GitHub [77, 89, 152],
• Language based filtering. If a LLM would be mainly used
where code data (including comments and docstrings) are
in the tasks of certain languages, the text in other lan-
collected for utilization. Compared to natural language text,
guages can be filtered.
code is in the format of a programming language, corre-
sponding to long-range dependencies and accurate execu- • Metric based filtering. Evaluation metrics about the gener-
tion logic [156]. A recent study [47] also speculates that ated texts, e.g., perplexity, can be employed to detect and
training on code might be a source of complex reasoning remove unnatural sentences.
13

Ready to
Raw Corpus Quality Filtering De-duplication Privacy Reduction Tokenization
pre-train!

Language Filtering Sentence-level Detect Personality Reuse Existing


Document-level Identifiable Tokenizer
Metric Filtering
Information (PII) SentencePiece
Statistic Filtering Set-level
Remove PII Byte-level BPE
Keyword Filtering

Alice is writing a paper about Alice is writing a paper about Replace('Alice') is Encode('[Somebody] is 32, 145, 66, 79, 12, 56, ...
LLMs. #$^& Alice is writing LLMs. Alice is writing a paper writing a paper about LLMs. writing a paper about LLMs.')
a paper about LLMs. about LLMs.

Fig. 3. An illustration of a typical data preprocessing pipeline for pre-training large language models.

• Statistic based filtering. Statistical features of a corpus, existing tokenizer (e.g., OPT [81] and GPT-3 [55] utilize
e.g., the punctuation distribution, symbol-to-word ratio, the tokenizer of GPT-2 [26]), using a tokenizer specially
and sentence length, can be utilized to measure the text designed for the pre-training corpus can be highly benefi-
quality and filter the low-quality data. cial [69], especially for the corpus that consists of diverse
domains, languages, and formats. Therefore, several recent
• Keyword based filtering. Based on specific keyword set, the
LLMs train the customized tokenizers specially for the pre-
noisy or unuseful elements in the text, such as HTML
training corpus with SentencePiece [164]. The byte-level Byte
tags, hyperlinks, boilerplates, and offensive words, can
Pair Encoding (BPE) algorithm [165] is utilized to ensure that
be identified and removed.
the information after tokenization is lossless [56, 59]. While,
normalization techniques in BPE, such as NFKC [166], may
De-duplication. Existing work [158] has found that dupli-
degrade the tokenization performance [34, 59, 69].
cate data in a corpus would reduce the diversity of language
models, which may cause the training process to become un-
stable and thus affect the model performance. Therefore, it is 4.1.3 Effect of Pre-training Data on LLMs
necessary to de-duplicate the pre-training corpus. Specially, Unlike small-scale PLMs, it is usually infeasible to iterate
de-duplication can be performed at different granularities, the pre-training of LLMs multiple times, due to the huge
including sentence-level, document-level, and dataset-level demand for computational resources. Thus, it is particularly
de-duplication. First, low-quality sentences that contain re- important to construct a well-prepared pre-training corpus
peated words and phrases should be removed, as they may before training a LLM. In this part, we discuss how the qual-
introduce repetitive patterns in language modeling [159]. ity and distribution of the pre-training corpus potentially
At the document level, existing studies mostly rely on the influence the performance of LLMs.
overlap ratio of surface features (e.g., words and n-grams
overlap) between documents to detect and remove duplicate Mixture of Sources. As discussed before, pre-training data
documents containing similar contents [57, 59, 69, 160]. from different domains or scenarios has distinct linguistic
Furthermore, to avoid the dataset contamination problem, characteristics or semantic knowledge. By pre-training on a
it is also crucial to prevent the overlap between the training mixture of text data from diverse sources, LLMs can acquire
and evaluation sets [56], by removing the possible duplicate a broad scope of knowledge and may exhibit a strong
texts from the training set. It has been shown that the three generalization capacity. When mixing different sources, one
levels of de-duplication are useful to improve the training needs to carefully set the distribution of pre-training data,
of LLMs [56, 161], which should be jointly used in practice. since it is also likely to affect the performance of LLMs on
downstream tasks [59]. Gopher [59] conducts the ablation
Privacy Redaction. The majority of pre-training text data is experiment on data distribution to examine the impact of
obtained from web sources, including user-generated con- mixed sources on downstream tasks. Experimental results
tent involving sensitive or personal information, which may on the LAMBADA dataset [167] show that increasing the
increase the risk of privacy breaches [162]. Thus, it is nec- proportion of books data can improve the capacity of the
essary to remove the personally identifiable information (PII) model in capturing long-term dependencies from text, and
from the pre-training corpus. One direct and effective ap- increasing the proportion of the C4 dataset [73] leads to
proach is to employ rule-based methods, such as keyword performance improvement on the C4 validation dataset [59].
spotting, to detect and remove PII such as names, addresses, While, as a side effect, training on excessive data about a
and phone numbers [131]. Furthermore, researchers also certain domain would affect the generalization capability of
find that the vulnerability of LLMs under privacy attacks LLMs on other domains [35, 59]. Therefore, it is suggested
can be attributed to the presence of duplicate PII data in the that researchers should carefully determine the proportion
pre-training corpus [163]. Therefore, de-duplication can also of data from different domains in the pre-training corpus, in
reduce privacy risks to some extent. order to develop LLMs that better meet their specific needs.
The readers can refer to Figure 2 for a comparison of the
Tokenization. Tokenization is also a crucial step for data data sources for different LLMs.
preprocessing. It aims to segment raw text into sequences
of individual tokens, which are subsequently used as the Amount of Pre-training Data. For pre-training an effective
inputs of LLMs. Although it is expedient to leverage an LLM, it is important to collect sufficient high-quality data
14

that satisfies the data quantity demand of the LLM. Exist- the encoder and decoder, respectively. The encoder adopts
ing studies have found that with the increasing parameter stacked multi-head self-attention layers to encode the input
scale in the LLM, more data is also required to train the sequence for generating its latent representations, while
model [34, 57]: a similar scaling law as model size is also the decoder performs cross-attention on these representa-
observed in data size, with respect to model performance. tions and autoregressively generates the target sequence.
A recent study has shown that a number of existing LLMs Encoder-decoder PLMs (e.g., T5 [73] and BART [24]) have
suffer from sub-optimal training due to inadequate pre- shown effectiveness on a variety of NLP tasks. So far,
training data [34]. By conducting extensive experiments, it there are only a small number of LLMs that are built based
further demonstrates increasing the model size and data size on the encoder-decoder architecture, e.g., Flan-T5 [64]. We
in equal scales can lead to a more compute-efficient model leave a detailed discussion about the architecture selection
(i.e., the Chinchilla model), for a given compute budget. in Section 4.2.4.
More recently, LLaMA [57] shows that with more data
Causal Decoder Architecture. The causal decoder archi-
and longer training, smaller models can also achieve good
tecture incorporates the unidirectional attention mask, to
performance. Overall, it is suggested that researchers should
guarantee that each input token can only attend to the past
pay more attention to the amount of high-quality data for
tokens and itself. The input and output tokens are processed
adequately training the model, especially when scaling the
in the same fashion through the decoder. As representa-
model parameters.
tive language models of this architecture, the GPT-series
Quality of Pre-training Data. Existing work has shown models [26, 55, 105] are developed based on the causal-
that pre-training on the low-quality corpus, such as noisy, decoder architecture. In particular, GPT-3 [55] has success-
toxic, and duplicate data, may hurt the performance of fully demonstrated the effectiveness of this architecture, also
models [59, 158, 160, 163]. For developing a well-performing showing an amazing in-context learning capability of LLMs.
LLM, it is crucial to consider both the quantity ant the Interestingly, GPT-1 [105] and GPT-2 [26] do not exhibit such
quality of the collected training data. Recent studies, such superior abilities as those in GPT-3, and it seems that scaling
as T5 [73], GLaM [96], and Gopher [59], have investigated plays an important role in increasing the model capacity
the influence of data quality on the performance of down- of this model architecture. So far, the causal decoders have
stream tasks. By comparing the performance of models been widely adopted as the architecture of LLMs by var-
trained on the filtered and unfiltered corpus, they reach ious existing LLMs, such as OPT [81], BLOOM [69], and
the same conclusion that pre-training LLMs on cleaned Gopher [59]. Note that both the causal decoder and prefix
data can improve the performance. More specifically, the decoder discussed next belong to decoder-only architec-
duplication of data may result in “double descent” (referring tures. While, when mentioning “decoder-only architecture”,
to the phenomenon of performance initially deteriorating it mainly refers to the causal decoder architecture in existing
and subsequently improving) [158, 168], or even overwhelm literature, unless specified.
the training process [158]. Besides, it has been shown that
Prefix Decoder Architecture. The prefix decoder architec-
duplicate data degrades the ability of LLMs to copy from
ture (a.k.a., non-causal decoder [169]) revises the masking
the context, which might further affect the generalization
mechanism of causal decoders, to enable performing bidi-
capacity of LLMs using in-context learning [158]. Therefore,
rectional attention over the prefix tokens [170] and unidi-
as suggested in [56, 59, 69], it is essential to incorporate
rectional attention only on generated tokens. In this way,
preprocessing methods on the pre-training corpus carefully
like the encoder-decoder architecture, the prefix decoders
(as illustrated in Section 4.1.2), to improve stability of the
can bidirectionally encode the prefix sequence and autore-
training process and avoid affecting the model performance.
gressively predict the output tokens one by one, where the
same parameters are shared during encoding and decoding.
4.2 Architecture Instead of pre-training from scratch, a practical suggestion
In this section, we review the architecture design of LLMs, is to continually train causal decoders and then convert
i.e., mainstream architecture, pre-training objective, and de- them into prefix decoders for accelerating convergence [29],
tailed configuration. Table 3 presents the model cards of e.g., U-PaLM [102] is derived from PaLM [56]. Existing rep-
several representative LLMs with public details. resentative LLMs based on prefix decoders include GLM-
130B [83] and U-PaLM [102].
4.2.1 Mainstream Architectures For the three types of architectures, we can also consider
Due to the excellent parallelizability and capacity, the Trans- extending them via the mixture-of-experts (MoE) scaling, in
former architecture [22] has become the de facto backbone to which a subset of neural network weights for each input
develop various LLMs, making it possible to scale language are sparsely activated, e.g., Switch Transformer [25] and
models to hundreds or thousands of billions of parameters. GLaM [96]. It has been shown that substantial performance
In general, the mainstream architectures of existing LLMs improvement can be observed by increasing either the num-
can be roughly categorized into three major types, namely ber of experts or the total parameter size [171].
encoder-decoder, causal decoder, and prefix decoder, as 4.2.2 Detailed Configuration
shown in Figure 4.
Since the launch of Transformer [22], various improvements
Encoder-decoder Architecture. The vanilla Transformer have been proposed to enhance its training stability, per-
model is built on the encoder-decoder architecture [22], formance, and computational efficiency. In this part, we
which consists of two stacks of Transformer blocks as will discuss the corresponding configurations for four major
15

TABLE 3
Model cards of several selected LLMs with public configuration details. Here, PE denotes position embedding, #L denotes the number of layers, #H
denotes the number of attention heads, dmodel denotes the size of hidden states, and MCL denotes the maximum context length during training.

Model Category Size Normalization PE Activation Bias #L #H dmodel MCL


GPT3 [55] Causal decoder 175B Pre Layer Norm Learned GeLU X 96 96 12288 2048
PanGU- α [75] Causal decoder 207B Pre Layer Norm Learned GeLU X 64 128 16384 1024
OPT [81] Causal decoder 175B Pre Layer Norm Learned ReLU X 96 96 12288 2048
PaLM [56] Causal decoder 540B Pre Layer Norm RoPE SwiGLU × 118 48 18432 2048
BLOOM [69] Causal decoder 176B Pre Layer Norm ALiBi GeLU X 70 112 14336 2048
MT-NLG [97] Causal decoder 530B - - - - 105 128 20480 2048
Gopher [59] Causal decoder 280B Pre RMS Norm Relative - - 80 128 16384 2048
Chinchilla [34] Causal decoder 70B Pre RMS Norm Relative - - 80 64 8192 -
Galactica [35] Causal decoder 120B Pre Layer Norm Learned GeLU × 96 80 10240 2048
LaMDA [63] Causal decoder 137B - Relative GeGLU - 64 128 8192 -
Jurassic-1 [91] Causal decoder 178B Pre Layer Norm Learned GeLU X 76 96 13824 2048
LLaMA [57] Causal decoder 65B Pre RMS Norm RoPE SwiGLU X 80 64 8192 2048
GLM-130B [83] Prefix decoder 130B Post Deep Norm RoPE GeGLU X 70 96 12288 2048
T5 [73] Encoder-decoder 11B Pre RMS Norm Relative ReLU × 24 128 1024 512

Causal Decoder Prefix Decoder Encoder-Decoder

A
A

Encoder

Survey
Survey

Survey
Decoder

Decoder

of
of

of

Models Language Large


Models Language Large

Models Language Large

Decoder
A Survey of Large Language Models A Survey of Large Language Models A Survey of Large Language Models

Decoder Decoder Encoder Decoder

Fig. 4. A comparison of the attention patterns in three mainstream architectures. Here, the blue, green, yellow and grey rounded rectangles indicate
the attention between prefix tokens, attention between prefix and target tokens, attention between target tokens, and masked attention respectively.

parts of the Transformer, including normalization, position tion. In addition, adding an extra LN after the embedding
embeddings, activation functions, and attention and bias. layer can also stabilize the training of LLMs. However, it
To make this survey more self-contained, we present the tends to incur a significant performance drop [184], which
detailed formulations for these configurations in Table 4. has been removed in several recent LLMs [69].

Normalization. Training instability is a challenging issue Activation Functions. To obtain good performance, activa-
for pre-training LLMs. To alleviate this problem, layer nor- tion functions also need to be properly set in feed-forward
malization (Layer Norm, LN) [173] is widely employed in networks. In existing LLMs, GeLU activations [185] are
Transformer architectures. The position of LN is vital to the widely used. Besides, in the latest LLMs (e.g., PaLM and
performance of LLMs. While the initial Transformer [22] LaMDA), variants of GLU activation [179, 186] have also
uses post-LN, most LLMs employ pre-LN for more stable been utilized, especially the SwiGLU and GeGLU variants,
training in spite of decreasing performance [182]. Based which often achieve better performance in practice [183].
on pre-LN, Sandwich-LN [172] adds extra LN before the However, compared with GeLU, they require extra parame-
residual connections to avoid value explosion. However, ters (about 50%) in the feed-forward networks [184].
it has been found that Sandwich-LN sometimes fails to
stabilize the training of LLMs and may lead to the collapse Position Embeddings. Since the self-attention modules in
of training [83]. Recently, several advanced normalization Transformer are permutation equivariant, position embed-
techniques have been proposed as alternatives to LN. In dings are employed to inject absolute or relative position
Gopher [59] and Chinchilla [34], RMS Norm [174] is em- information for modeling sequences. There are two vari-
ployed due to its superiority in training speed and per- ants of absolute position embeddings in the vanilla Trans-
formance [183]. Compared with LN, DeepNorm [175] has former [22], i.e., sinusoids and learned position embeddings,
shown a better capability to ensure the stability in training, where the latter is commonly employed in LLMs. Unlike
which has been adopted by GLM-130B with post normaliza- absolute position embeddings, relative positional encodings
16

TABLE 4
Detailed formulations for the network configurations. Here, Sublayer denotes a FFN or a self-attention module in a Transformer layer, d denotes
the size of hidden states, pi denotes position embedding at position i, Aij denotes the attention score between a query and a key, ri−j denotes a
learnable scalar based on the offset between the query and the key, and Rθ,t denotes a rotary matrix with rotation degree t · θ.

Configuration Method Equation


Post Norm [22] Norm(x+Sulayerb(x))
Normalization position Pre Norm [26] x + Sublayer(Norm(x))
Sandwich Norm [172] x + Norm(Sublayer(Norm(x)))
q P
x−µ 1 Pd 1 d 2
LayerNorm [173] √
σ
·γ + β, µ= d
xi , σ =
i=1 d i=1 (xi − µ))
Normalization method x
q P
1 d 2
RMSNorm [174] RMS(x)
· γ, RMS(x) = d i=1 xi
DeepNorm [175] LayerNorm(α · x + Sublayer(x))
ReLU [176] ReLU(x) = max(x, 0)
√ Rx 2
GeLU [177] GeLU(x) = 0.5x ⊗ [1 + erf(x/ 2)], erf(x) = √2 e−t dt
π 0
Activation function
Swish [178] Swish(x) = x ⊗ sigmoid(x)
SwiGLU [179] SwiGLU(x1 , x2 ) = Swish(x1 ) ⊗ x2
GeGLU [179] GeGLU(x1 , x2 ) = GeLU(x1 ) ⊗ x2
Absolute [22] xi = xi + pi
Position embedding Relative [73] Aij = Wq xi xT T
j Wk + ri−j
RoPE [180] Aij = Wq xi Rθ,i−j xT
j Wk
T

Alibi [181] Aij = Wq xi Rθ,i−j xj WkT Aij = Wq xi xT


T T
j Wk − m(i − j)

generate embeddings according to the offsets between keys 4.2.3 Pre-training Tasks
and queries [73], so it can perform well on sequences Pre-training plays a key role that encodes general knowl-
longer than those it has seen during training, i.e., extrap- edge from large-scale corpus into the massive model param-
olation [181]. ALiBi [181] biases attention scores using a eters. For training LLMs, there are two commonly used pre-
penalty based on the distance between keys and queries. training tasks, namely language modeling and denoising
Empirical results have shown that it has better zero-shot autoencoding.
generalization with a stronger extrapolation capacity than
other position embeddings [29]. Besides, by setting specific Language Modeling. The language modeling task (LM) is
rotatory matrices based on the absolute position, the scores the most commonly used objective to pre-train decoder-only
between keys and queries in RoPE [180] can be computed LLMs, e.g., GPT3 [55] and PaLM [56]. Given a sequence of
with relative position information, which is useful to model tokens x = {x1 , . . . , xn }, the LM task aims to autoregres-
long sequences. As a result, RoPE has been widely adopted sively predict the target tokens xi based on the preceding
in several latest LLMs [56, 57, 83] tokens x<i in a sequence. A general training objective is to
maximize the following likelihood:
n
X
Attention and Bias. Beyond the full self-attention in the LLM (x) = log P (xi |x<i ). (4)
original Transformer [22], sparse attention with lower com- i=1

putation complexity is employed in GPT-3 (i.e., Factorized Since most language tasks can be cast as the prediction
Attention [55, 187]). In order to effectively and efficiently problem based on the input, these decoder-only LLMs might
model longer sequences, more attempts have been made by be potentially advantageous to implicitly learn how to ac-
either introducing special attention patterns [188, 189] or complish these tasks in a unified LM way. Some studies
considering GPU memory access (i.e., FlashAttention [190]). have also revealed that decoder-only LLMs can be naturally
Besides, following the original Transformer, most LLMs transferred to certain tasks by autoregressively predicting
keep the biases in each dense kernel and Layer Norm. How- the next tokens [26, 55], without fine-tuning. An important
ever, in PaLM [56] and Galactica [35], biases are removed. variant of LM is the prefix language modeling task, which is
It demonstrates that no biases can enhance training stability designed for pre-training models with the prefix decoder
for LLMs [56]. architecture. The tokens within a randomly selected prefix
would not be used in computing the loss of prefix language
modeling. With the same amount of tokens seen during pre-
To put all these discussions together, we summarize the training, prefix language modeling performs slightly worse
suggestions from existing literature for detailed configura- than language modeling, since fewer tokens in the sequence
tion. For stronger generalization and training stability, it is are involved for model pre-training [29].
suggested to choose the pre RMS Norm for layer normal-
ization, and SwiGLU or GeGLU as the activation function. Denoising Autoencoding. Besides conventional LM, the
While, LN may not be used immediately after embedding denoising autoencoding task (DAE) has also been widely
layers, which is likely to incur performance degradation. used to pre-train language models [24, 73]. The inputs x\x̃
Besides, as for position embeddings, RoPE or ALiBi is a for DAE task are corrupted text with randomly replaced
better choice since it performs better on long sequences. spans. Then, the language models are trained to recover the
17

replaced tokens x̃. Formally, the training objective of DAE reaching a million scale. Specifically, the batch size of GPT-3
is denoted as follows: is gradually increasing from 32K to 3.2M tokens. Empirical
results have demonstrated that the dynamic schedule of
LDAE (x) = log P (x̃|x\x̃ ). (5) batch size can effectively stabilize the training process of
However, the DAE task seems to be more complicated LLMs [56].
in implementation than LM task. As a result, it has not
Learning Rate. Existing LLMs usually adopt a similar learn-
been widely used to pre-train large language models. Exist-
ing rate schedule with the warm-up and decay strategies
ing LLMs that take DAE as pre-training objectives include
during pre-training. Specifically, in the initial 0.1% to 0.5%
T5 [73] and GLM-130B [83]. These models are mainly trained
of the training steps, a linear warm-up schedule is employed
to recover the replaced spans in an autoregressive way.
for gradually increasing the learning rate to the maximum
value that ranges from approximately 5 × 10−5 to 1 × 10−4
4.2.4 Summary and Discussion
(e.g., 6 × 10−5 for GPT-3). Then, a cosine decay strategy
The choice of architecture and pre-training tasks may incur is adopted in the subsequent steps, gradually reducing the
different inductive biases for LLMs, which would lead to learning rate to approximately 10% of its maximum value,
different model capacities. In this part, we summarize some until the convergence of the training loss.
important findings or discussions in the existing literature
on this issue. Optimizer. The Adam optimizer [191] and AdamW opti-
• By pre-training with the LM objective, it seems that mizer [192] are widely utilized for training LLMs (e.g., GPT-
causal decoder architecture can achieve a more superior 3), which are based on adaptive estimates of lower-order
zero-shot and few-shot generalization capacity. Existing moments for first-order gradient-based optimization. Com-
research has shown that without multi-task fine-tuning, monly, its hyper-parameters are set as follows: β1 = 0.9,
the causal decoder has better zero-shot performance than β2 = 0.95 and  = 10−8 . Meanwhile, the Adafactor op-
other architectures [29]. The success of GPT-3 [55] has timizer [193] has also been utilized in training LLMs (e.g.,
demonstrated that the large causal decoder model can be PaLM and T5), which is a variant of the Adam optimizer
a good few-shot learner. In addition, instruction tuning and specially designed for conserving GPU memory during
alignment tuning discussed in Section 5 have been proven training. The hyper-parameters of the Adafactor optimizer
to further enhance the capability of large causal decoder are set as: β1 = 0.9 and β2 = 1.0 − k −0.8 , where k denotes
models [61, 62, 64]. the number of training steps.
• Scaling law has been widely observed in causal de-
coders. By scaling the model size, the dataset size, and Stabilizing the Training. During the pre-training of LLMs,
the total computation, the performance of causal decoders it often suffers from the training instability issue, which
can be substantially improved [30, 55]. Thus, it has become may cause the model collapse. To address this issue, weight
an important strategy to increase the model capacity of decay and gradient clipping have been widely utilized,
the causal decoder via scaling. However, more detailed where existing studies [55, 69, 81, 83, 97] commonly set
investigation on encoder-decoder models is still lacking, and the threshold of gradient clipping to 1.0 and weight decay
more efforts are needed to investigate the performance of rate to 0.1. However, with the scaling of LLMs, the training
encoder-decoder models at a large scale. loss spike is also more likely to occur, leading to unstable
More research efforts about the discussions on archi- training. To mitigate this problem, PaLM [56] and OPT [81]
tectures and pre-training objectives are in need to analyze use a simple strategy that restarts the training process from
how the choices of the architecture and pre-training tasks an earlier checkpoint before the occurrence of the spike and
affect the capacity of LLMs, especially for encoder-decoder skips over the data that may have caused the problem.
architectures. Besides the major architecture, the detailed Further, GLM [83] finds that the abnormal gradients of the
configuration of LLM is also worth attention, which has embedding layer usually lead to spikes, and proposes to
been discussed in Section 4.2.2. shrink the embedding layer gradients to alleviate it.

4.3 Model Training 4.3.2 Scalable Training Techniques


In this part, we review the important settings, techniques, As the model and data sizes increase, it has become chal-
or tricks for training LLMs. lenging to efficiently train LLMs under a limited compu-
tational resource. Especially, two primary technical issues
4.3.1 Optimization Setting are required to be resolved, i.e., increasing training through-
For parameter optimization of LLMs, we present the com- put and loading larger models into GPU memory. In this
monly used settings for batch training, learning rate, opti- part, we review several widely used approaches in existing
mizer, and training stability. work to address the above two challenges, namely 3D
parallelism [66, 194, 195], ZeRO [196], and mixed precision
Batch Training. For language model pre-training, existing training [197], and also give general suggestions about how
work generally sets the batch size to a large number (e.g., to utilize them for training.
8,196 examples or 16M tokens) to improve the training
stability and throughput. For LLMs such as GPT-3 and 3D Parallelism. 3D parallelism is actually a combination of
PaLM, they have introduced a new strategy that dynam- three commonly used parallel training techniques, namely
ically increases the batch size during training, ultimately data parallelism, pipeline parallelism [194, 195], and tensor
18

TABLE 5
Detailed optimization settings of several existing LLMs.

Batch Size Learning Precision Weight Grad


Model Warmup Decay Method Optimizer Dropout
(#tokens) Rate Type Decay Clip
GPT3 (175B) 32K→3.2M 6 × 10−5 yes cosine decay to 10% Adam FP16 0.1 1.0 -
PanGu-α (200B) - 2 × 10−5 - - Adam - 0.1 - -
OPT (175B) 2M 1.2 × 10−4 yes manual decay AdamW FP16 0.1 - 0.1
PaLM (540B) 1M→4M 1 × 10−2 no inverse square root Adafactor BF16 lr2 1.0 0.1
BLOOM (176B) 4M 6 × 10−5 yes cosine decay to 10% Adam BF16 0.1 1.0 0.0
MT-NLG (530B) 64 K→3.75M 5 × 10−5 yes cosine decay to 10% Adam BF16 0.1 1.0 -
Gopher (280B) 3M→6M 4 × 10−5 yes cosine decay to 10% Adam BF16 - 1.0 -
Chinchilla (70B) 1.5M→3M 1 × 10−4 yes cosine decay to 10% AdamW BF16 - - -
Galactica (120B) 2M 7 × 10−6 yes linear decay to 10% AdamW - 0.1 1.0 0.1
LaMDA (137B) 256K - - - - BF16 - - -
Jurassic-1 (178B) 32 K→3.2M 6 × 10−5 yes - - - - - -
LLaMA (65B) 4M 1.5 × 10−4 yes cosine decay to 10% AdamW - 0.1 1.0 -
GLM (130B) 0.4M→8.25M 8 × 10−5 yes cosine decay to 10% AdamW FP16 0.1 1.0 0.1
T5 (11B) 64K 1 × 10−2 no inverse square root AdaFactor - - - 0.1
ERNIE 3.0 Titan (260B) - 1 × 10−4 - - Adam FP16 0.1 1.0 -
PanGu-Σ (1.085T) 0.5M 2 × 10−5 yes - Adam FP16 - - -

parallelism [66]19 . We next introduce the three parallel train- into two submatrices, A1 and A2 , by column, which can be
ing techniques. expressed as Y = [XA1 , XA2 ]. By placing matrices A1 and
• Data parallelism. Data parallelism is one of the most A2 on different GPUs, the matrix multiplication operation
fundamental approaches to improving the training through- would be invoked at two GPUs in parallel, and the final
put. It replicates the model parameters and optimizer states result can be obtained by combining the outputs from the
across multiple GPUs and then distributes the whole train- two GPUs through across-GPU communication. Currently,
ing corpus into these GPUs. In this way, each GPU only tensor parallelism has been supported in several open-
needs to process the assigned data for it, and performs source libraries, e.g., Megatron-LM [66], and can be extended
the forward and backward propagation to obtain the gra- to higher-dimensional tensors. Besides, Colossal-AI has also
dients. The computed gradients on different GPUs will be implemented tensor parallelism for higher-dimensional ten-
further aggregated to obtain the gradients of the entire batch sors [198–200] and proposed sequence parallelism [201]
for updating the models in all GPUs. In this way, as the especially for sequence data, which can further decompose
calculations of gradients are independently performed on the attention operation of the Transformer model.
different GPUs, the data parallelism mechanism is highly
scalable, enabling the way that increases the number of ZeRO. ZeRO [196] technique, proposed by the Deep-
GPUs to improve training throughput. Furthermore, this Speed [65] library, focuses on the issue of memory re-
technique is simple in implementation, and most of existing dundancy in data parallelism. As mentioned before, data
popular deep learning libraries have already implemented parallelism requires each GPU to store the same copy of
data parallelism, such as TensorFlow and PyTorch. a LLM, including model parameters, model gradients, and
• Pipeline parallelism. Pipeline parallelism aims to dis- optimizer parameters. Whereas, not all of the above data is
tribute the different layers of a LLM into multiple GPUs. necessary to be retained on each GPU, which would cause
Especially, in the case of a Transformer model, pipeline a memory redundancy problem. To resolve it, the ZeRO
parallelism loads consecutive layers onto the same GPU, to technique aims to retain only a fraction of data on each
reduce the cost of transmitting the computed hidden states GPU, while the rest data can be retrieved from other GPUs
or gradients between GPUs. However, a naive implemen- when required. Specifically, ZeRO provides three solutions,
tation of pipeline parallelism may result in a lower GPU depending on how the three parts of the data are stored,
utilization rate as each GPU has to wait for the previous namely optimizer state partitioning, gradient partitioning,
one to complete the computation, leading to the unneces- and parameter partitioning. Empirical results indicate that
sary cost of bubbles overhead [194]. To reduce these bubbles the first two solutions do not increase the communication
in pipeline parallelism, GPipe [194] and PipeDream [195] overhead, and the third solution increases about 50% com-
propose the techniques of padding multiple batches of data munication overhead but saves memory proportional to
and asynchronous gradient update to improve the pipeline the number of GPUs. PyTorch has implemented a similar
efficiency. technique as ZeRO, called FSDP [202].
• Tensor parallelism. Tensor parallelism is also a com-
monly used technique that aims to decompose the LLM for Mixed Precision Training. In previous PLMs (e.g.,
multi-GPU loading. Unlike pipeline parallelism, tensor par- BERT [23]), 32-bit floating-point numbers, also known as
allelism focuses on decomposing the tensors (the parameter FP32, have been predominantly used for pre-training. In
matrices) of LLMs. For a matrix multiplication operation recent years, to pre-train extremely large language models,
Y = XA in the LLM, the parameter matrix A can be split some studies [197] have started to utilize 16-bit floating-
point numbers (FP16), which reduces memory usage and
19. Model parallelism is a more broader term that includes tensor communication overhead. Additionally, as popular NVIDIA
parallelism and pipeline parallelism in some work [66]. GPUs (e.g., A100) have twice the amount of FP16 computa-
19

tion units as FP32, the computational efficiency of FP16 can to specific goals. In this section, we introduce two major ap-
be further improved. However, existing work has found that proaches to adapting pre-trained LLMs, namely instruction
FP16 may lead to the loss of computational accuracy [59, 69], tuning and alignment tuning. The former approach mainly
which affects the final model performance. To alleviate it, an aims to enhance (or unlock) the abilities of LLMs, while the
alternative called Brain Floating Point (BF16) has been used latter approach aims to align the behaviors of LLMs with
for training, which allocates more exponent bits and fewer human values or preferences. Further, we will also discuss
significant bits than FP16. For pre-training, BF16 generally efficient tuning for rapid model adaptation. In what follows,
performs better than FP16 on representation accuracy [69]. we will introduce the three parts in detail.
Overall Training Suggestion. In practice, the above train-
TABLE 6
ing techniques, especially 3D parallelism, are often jointly A detailed list of available task collections for instruction tuning. Note
used to improve the training throughput and large model that OIG is a large collection consisting of existing collections.
loading. For instance, researchers have incorporated 8-way
data parallelism, 4-way tensor parallelism, and 12-way Collections Time #Task types #Tasks #Examples
pipeline parallelism, enabling the training of BLOOM [69] Nat. Inst. [208] Apr-2021 6 61 193K
on 384 A100 GPUs. Currently, open-source libraries like CrossFit [209] Apr-2021 13 160 7.1M
DeepSpeed [65], Colossal-AI [137], and Alpa [203] can well FLAN [62] Sep-2021 12 62 4.4M
support the three parallel training methods. To reduce the P3 [210] Oct-2021 13 267 12.1M
ExMix [211] Nov-2021 11 107 18M
memory redundancy, ZeRO, FSDP, and activation recompu- UnifiedSKG [212] Jan-2022 6 21 812K
tation techniques [68, 204] can be also employed for training Super Nat. Inst. [79] Apr-2022 76 1616 5M
LLMs, which have already been integrated into DeepSpeed, MVPCorpus [213] Jun-2022 11 77 41M
xP3 [84] Nov-2022 17 85 81M
PyTorch, and Megatron-LM. Besides, the mixed precision
OIG23 Mar-2023 - - 43M
training technique such as BF16 can be also leveraged to
improve the training efficiency and reduce GPU memory
usage, while it requires necessary support on hardware
(e.g., A100 GPU). Because training large models is a time- 5.1 Instruction Tuning
intensive process, it would be useful to forecast the model In essence, instruction tuning is the approach to fine-tuning
performance and detect abnormal issues at an early stage. pre-trained LLMs on a collection of formatted instances in
For this purpose, GPT-4 [46] has recently introduced a the form of natural language [62], which is highly related
new mechanism called predictable scaling built on a deep to supervised fine-tuning [61] and multi-task prompted
learning stack, enabling the performance prediction of large training [28]. In order to perform instruction tuning, we first
models with a much smaller model, which might be quite need to collect or construct instruction-formatted instances.
useful for developing LLMs. In practice, one can further Then, we employ these formatted instances to fine-tune
leverage the supporting training techniques of mainstream LLMs in a supervised learning way (e.g., training with the
deep learning frameworks. For instance, PyTorch supports sequence-to-sequence loss). After instruction tuning, LLMs
the data parallel training algorithm FSDP [202] (i.e., fully can demonstrate superior abilities to generalize to unseen
sharded data parallel), which allows for partial offloading tasks [28, 62, 64], even in a multilingual setting [84].
of training computations to CPUs if desired. A recent survey [214] presents a systematic overview
Besides the above training strategies, it is also important of the research on instruction tuning. In comparison to
to improve the inference speed for using LLMs. Typically, that, we mainly focus on the effect of instruction tuning
quantization techniques are widely used to reduce both on LLMs and provide detailed guidelines or strategies for
the time and space costs of LLMs during the inference instance collection and tuning. Besides, we also discuss the
stage [205]. With some loss in model performance, quan- use of instruction tuning for satisfying the real needs of
tized language models have smaller model sizes and can users, which has been widely applied in existing LLMs, e.g.,
achieve faster inference speed [83, 206, 207]. For model InstructGPT [61] and GPT-4 [46].
quantization, a popular choice is INT8-quantization [206].
Further, some research work attempts to develop more 5.1.1 Formatted Instance Construction
aggressive INT4-quantization methods [83]. Recently, quan- Generally, an instruction-formatted instance consists of a
tized model copies of several publicly available language task description (called an instruction), an input-output pair,
models have been released on Hugging Face, including and a small number of demonstrations (optional). As impor-
BLOOM20 , GPT-J21 , and ChatGLM22 . tant public resources, existing studies have released a large
number of labeled data formatted in natural language (see
the list of available resources in Table 6). Next, we introduce
5 A DAPTATION T UNING OF LLM S two major methods for constructing formatted instances
(see an illustration in Figure 5) and then discuss several key
After pre-training, LLMs can acquire the general abilities
factors for instance construction.
for solving various tasks. However, increasing studies have
shown that LLM’s abilities can be further adapted according Formatting Existing Datasets. Before instruction tuning was
proposed, several early studies [211, 213, 215, 216] col-
20. https://huggingface.co/joaoalvarenga/bloom-8bit lected the instances from a diverse range of tasks (e.g., text
21. https://huggingface.co/hivemind/gpt-j-6B-8bit
22. https://huggingface.co/THUDM/chatglm-6b-int8 23. https://laion.ai/blog/oig-dataset/
20

Instance API collection Human-written


Human-written Task description

Task description Please answer this question: &


Please translate the French to English:

Optional Demonstrations
Demonstrations Task description
NLP Datasets Q: Where is the capital of France?
fr: Reprise de la session Can you recommend some ways
A: Paris.
en: Resumption of the session to lose weight?
fr: Il s'agit du cas d'Alexandre Nikitin. Q: Where is the capital of Brazil?
en: It is the case of Alexander Nikitin. A: Brasilia
Desired output written by human
Input
fr: Nous ne savons pas ce qui se passe.
Input Output Output
Here are some ways to lose weight:
Output Q: Where is the capital of China?
1. Eat a healthy diet: Focus on …
en: We do not know what is happening. A: Beijing.
2. Increase physical activity: Engage …

(a) Instance format (b) Formatting existing datasets (c) Formatting human needs

Fig. 5. An illustration of instance formatting and two different methods for constructing the instruction-formatted instances.

summarization, text classification, and translation) to create life tasks, including open-ended generation, open question
supervised multi-task training datasets. As a major source answering, brainstorming, and chatting. Then, they let an-
of instruction tuning instances, it is convenient to format other group of labelers directly answer these instructions as
these multi-task training datasets with natural language the output. Finally, they pair one instruction (i.e., the col-
task descriptions. Specifically, recent work [28, 61, 62, 79] lected user query) and the expected output (i.e., the human-
augments the labeled datasets with human-written task de- written answer) as a training instance. Note that Instruct-
scriptions, which instructs LLMs to understand the tasks by GPT also employs these real-world tasks formatted in natu-
explaining the task goal. For example, in Figure 5(b), a task ral language for alignment tuning (discussed in Section 5.2).
description “Please answer this question” is added for each Further, GPT-4 [46] has designed potentially high-risk in-
example in the question-answering task. After instruction structions and guided the model to reject these instructions
tuning, LLMs can generalize well to other unseen tasks by through supervised fine-tuning for safety concerns. Besides,
following their task descriptions [28, 62, 64]. In particular, it to reduce the burden of human annotation, several semi-
has been shown that instructions are the crucial factor in automated approaches [219–221] have also been proposed
task generalization ability for LLMs [62]: by fine-tuning for constructing instances by feeding existing instances into
the model on labeled datasets with the task descriptions LLMs to generate diverse task descriptions and instances.
removed, it results in a dramatic drop in model perfor-
mance. To better generate labeled instances for instruc- Key Factors for Instance Construction. The quality of
tion tuning, a crowd-sourcing platform, PromptSource [210] instruction instances has an important impact on the perfor-
has been proposed to effectively create, share, and verify mance of the model. Here, we discuss some essential factors
the task descriptions for different datasets. To enrich the for instance construction.
training instances, several studies [28, 213, 217] also try • Scaling the instructions. It has been widely shown that
to invert the input-output pairs of existing instances with scaling the number of tasks can largely enhance the general-
specially designed task descriptions for instruction tuning. ization ability of LLMs [28, 62, 79]. With the increasing of the
For instance, given a question-answer pair, we can create a task number, the model performance initially shows a con-
new instance by predicting the question-conditioned answer tinuous growth pattern, while the gain becomes negligible
(e.g., “Please generate a question based on the answer:”). Besides, when it reaches a certain level [64, 79]. A plausible specu-
some work [218] also leverages heuristic task templates to lation is that a certain number of representative tasks can
convert massive unlabeled texts into labeled instances. provide relatively sufficient knowledge and adding more
tasks may not bring additional gains [64]. Besides, it is also
Formatting Human Needs. Despite that a large number of beneficial to enhance the diversity of the task descriptions in
training instances have been formatted with instructions, several aspects, such as length, structure, and creativity [28].
they mainly come from public NLP datasets, either lack- As for the number of instances per task, it has been found
ing instruction diversity or mismatching with real human that a small number of instances can usually saturate the
needs [61]. To overcome this issue, InstructGPT [61] pro- generalization performance of the model [62, 64]. Whereas,
poses to take the queries that real users have submitted increasing the number of instances for some tasks to a large
to the OpenAI API as the task descriptions. User queries number (e.g., a few hundreds) could potentially result in the
are expressed in natural languages, which are particularly overfitting issue and impair the model performance [79].
suitable for eliciting the ability of instruction following for • Formatting design. As an important factor, the design
LLMs. Additionally, to enrich the task diversity, human of natural language format also highly impacts the gener-
labelers are also asked to compose the instructions for real- alization performance of LLMs [79]. Typically, we can add
21

task descriptions and optional demonstrations to the input- Combining Instruction Tuning and Pre-Training. To make
output pairs of existing datasets, where the task description the tuning process more effective and stable, OPT-IML [85]
is the most key part for LLMs to understand the task [79]. incorporates pre-training data during instruction tuning,
Further, it can lead to substantial improvements by using an which can be regarded as regularization for model tun-
appropriate number of exemplars as demonstrations [64], ing. Further, instead of using a separate two-stage process
which also alleviates the model sensitivity to instruction (pre-training then instruction tuning), some studies attempt
engineering [62, 64]. However, incorporating other compo- to train a model from scratch with a mixture of pre-
nents (e.g., things to avoid, reasons, and suggestions) into training data (i.e., plain texts) and instruction tuning data
instructions may have a negligible or even adverse effect (i.e., formatted datasets) using multi-task learning [73, 211].
on the performance of LLMs [79, 208]. Recently, to elicit Specifically, GLM-130B [83] and Galactica [35] integrate
the step-by-step reasoning ability of LLMs, some work [64] instruction-formatted datasets as a small proportion of the
proposes to include chain-of-thought (CoT) examples for pre-training corpora to pre-train LLMs, which potentially
some reasoning datasets, such as arithmetic reasoning. It achieves the advantages of pre-training and instruction tun-
has been shown that fine-tuning LLMs with both CoT and ing at the same time.
non-CoT examples can lead to a good performance across
various reasoning tasks, including those that require multi- 5.1.3 The Effect of Instruction Tuning
hop reasoning ability (e.g., commonsense question answer-
ing and arithmetic reasoning) as well as those without the In this part, we discuss the effect of instruction tuning on
need for such a reasoning way (e.g., sentiment analysis and LLMs in two major aspects.
extractive question answering) [64, 85]. Performance Improvement. Despite being tuned on a mod-
To summarize, it seems that the diversity of instructions erate number of instances, instruction tuning has become
is more important than the number of instances since the an important way to improve or unlock the abilities of
well-performing InstructGPT [61] and Alpaca [221] utilize LLMs [64]. Recent studies have experimented with language
fewer but more diverse instructions (or instances) than the models in multiple scales (ranging from 77M to 540B),
Flan-series LLMs [62, 64]. Further, it is more useful to invite showing that the models of different scales can all benefit
labelers to compose human-need tasks than using dataset- from instruction tuning [64, 217], yielding improved perfor-
specific tasks. While, it still lacks the guidelines to anno- mance as the parameter scale increases [84]. Further, smaller
tate human-need instances, making the task composition models with instruction tuning can even perform better
somehow heuristic. To reduce human efforts, we can either than larger models without fine-tuning [28, 64]. Besides
reuse existing formatted datasets (Table 6) or automatically the model scale, instruction tuning demonstrates consistent
construct the instructions using existing LLMs [219]. improvements in various model architectures, pre-training
objectives, and model adaptation methods [64]. In practice,
5.1.2 Instruction Tuning Strategies instruction tuning offers a general approach to enhancing
the abilities of existing language models [64] (including
Unlike pre-training, instruction tuning is often more effi- small-sized PLMs). Besides, it is also much less costly than
cient since only a moderate number of instances are used pre-training, since the amount of instruction data required
for training. Since instruction tuning can be considered as by LLMs is significantly smaller than pre-training data.
a supervised training process, its optimization is different
from pre-training in several aspects [64], such as the training Task Generalization. Instruction tuning encourages the
objective (i.e., sequence-to-sequence loss) and optimization model to understand natural language instructions for task
configuration (e.g., smaller batch size and learning rate), completion. It endows LLMs with the ability (often con-
which require special attention in practice. In addition to sidered as an emergent ability) to follow human instruc-
these optimization configurations, there are also two impor- tions [31] to perform specific tasks without demonstrations,
tant aspects to consider for instruction tuning: even on unseen tasks [64]. A large number of studies
have confirmed the effectiveness of instruction tuning to
Balancing the Data Distribution. Since instruction tun- achieve superior performance on both seen and unseen
ing involves a mixture of different tasks, it is important tasks [85, 217]. Besides, instruction tuning has been shown
to balance the proportion of different tasks during fine- to be useful in alleviating several weaknesses of LLMs (e.g.,
tuning. A widely used method is the examples-proportional repetitive generation or complementing the input without
mixing strategy [73], i.e., combining all the datasets and accomplishing a certain task) [61, 64], leading to a superior
sampling each instance equally from the mixed datasets. capacity to solve real-world tasks for LLMs. Furthermore,
Furthermore, increasing the sampling ratio of high-quality LLMs trained with instruction tuning can generalize to re-
collections (e.g., FLAN [62] and P3 [210]) can generally lated tasks across languages. For example, BLOOMZ-P3 [84]
lead to performance improvement according to recent find- is fine-tuned based on BLOOM [69] using English-only task
ings [64, 85]. While, it is common to set a maximum cap to collection P3 [210]. Interestingly, BLOOMZ-P3 can achieve
control the maximum number of examples that a dataset a more than 50% improvement in multilingual sentence
can contain during instruction tuning [73], which is set to completion tasks compared to BLOOM, which shows that
prevent larger datasets from overwhelming the entire dis- instruction tuning can help LLMs acquire general task skills
tribution [73, 85]. In practice, the maximum cap is typically from English-only datasets and transfer such skills into
set to several thousands or tens of thousands according to other languages [84]. In addition, it has been found that
different datasets [62, 64]. using English-only instructions can produce satisfactory
22

results on multilingual tasks [84], which helps reduce the unknowns”). According to the discussion in [223], honesty
effort of instruction engineering for a specific language. is a more objective criterion compared to helpfulness and
harmlessness, hence honesty alignment could potentially be
5.2 Alignment Tuning developed with less reliance on human efforts.
• Harmlessness. To be harmless, it requires that the lan-
This part first presents the background of alignment with guage produced by the model should not be offensive or
its definition and criteria, then focuses on the collection discriminatory. To the best of its abilities, the model should
of human feedback data for aligning LLMs, and finally be capable of detecting covert endeavors aimed at soliciting
discusses the key technique of reinforcement learning from requests for malicious purposes. Ideally, when the model
human feedback for alignment tuning. was induced to conduct a dangerous action (e.g., commit-
ting a crime), the LLM should politely refuse. Nonetheless,
5.2.1 Background and Criteria for Alignment what behaviors are deemed harmful and to what extent vary
Background. LLMs have shown remarkable capabilities amongst individuals or societies [223] highly depend on
in a wide range of NLP tasks [55, 56, 62, 81]. However, who is using the LLM, the type of the posed question, and
these models may sometimes exhibit unintended behav- the context (e.g., time) at which the LLM is being used.
iors, e.g., fabricating false information, pursuing inaccurate As we can see, these criteria are quite subjective, and are
objectives, and producing harmful, misleading, and biased developed based on human cognition. Thus, it is difficult
expressions [61, 222]. For LLMs, the language modeling to directly formulate them as optimization objectives for
objective pre-trains the model parameters by word predic- LLMs. In existing work, there are many ways to fulfill
tion while lacking the consideration of human values or these criteria when aligning LLMs. A promising technique
preferences. To avert these unexpected behaviors, human is red teaming [115, 225], which involves using manual or
alignment has been proposed to make LLMs act in line with automated means to probe LLMs in an adversarial way
human expectations [61, 100]. However, unlike the original to generate harmful outputs and then updates LLMs to
pre-training and adaptation tuning (e.g., instruction tuning), prevent such outputs.
such an alignment requires considering very different crite-
ria (e.g., helpfulness, honesty, and harmlessness). It has been 5.2.2 Collecting Human Feedback
shown that alignment might harm the general abilities of During the pre-training stage, LLMs are trained using the
LLMs to some extent, which is called alignment tax in related language modeling objective on a large-scale corpus. How-
literature [61, 223, 224]. ever, it cannot take into account the subjective and qualita-
tive evaluations of LLM outputs by humans (called human
Alignment Criteria. Recently, there is increasing attention feedback in this survey). High-quality human feedback is
on developing multifarious criteria to regulate the behav- extremely important for aligning LLMs with human pref-
iors of LLMs. Here, we take three representative alignment erences and values. In this part, we discuss how to select a
criteria (i.e., helpful, honest, and harmless) as examples for team of human labelers for feedback data collection.
discussion, which have been widely adopted in existing
literature [61, 222, 223]. Besides, there are other align- Human Labeler Selection. In existing work, the dominant
ment criteria for LLMs from different perspectives including method for generating human feedback data is human
behavior, intent, incentive, and inner aspects [222], which annotation [61, 100, 226]. This highlights the critical role
are essentially similar (or at least with similar alignment of selecting appropriate human labelers. To provide high-
techniques) to the above three criteria. It is also feasible to quality feedback, human labelers are supposed to have a
modify the three criteria according to specific needs, e.g., qualified level of education and excellent proficiency in En-
substituting honesty with correctness [100] or focusing on glish. For example, Sparrow [100] requires human labelers
some specified criteria [224]. Next, we give brief explana- to be UK-based native English speakers who have obtained
tions about the three representative alignment criteria: at least an undergraduate-level educational qualification.
• Helpfulness. To be helpful, the LLM should demon- Further, in [224], about half of human labelers for high
strate a clear attempt to assist users in solving their tasks priority tasks were recruited from the US-based Amazon
or answering questions in a concise and efficient manner Mechanical Turk workforce with a master’s qualification.
as possible. At a higher level, when further clarification Even then, several studies [112, 226] have found that there
is needed, the LLM should demonstrate the capability of still exists a mismatch between the intentions of researchers
eliciting additional relevant information through pertinent and human labelers, which may lead to low-quality human
inquiries and exhibit suitable levels of sensitivity, percep- feedback and cause LLMs to produce unexpected output.
tiveness, and prudence [223]. Realizing the alignment of To address this issue, InstructGPT [61] further conducts a
helpful behavior is challenging for LLMs since it is difficult screening process to filter labelers by assessing the agree-
to precisely define and measure the intention of users [222]. ment between human labelers and researchers. Specifically,
• Honesty. At a basic level, a LLM aligned to be honest researchers first label a small amount of data and then
should present accurate content to users instead of fabri- measure the agreement between themselves and human
cating information. Additionally, it is crucial for the LLM labelers. The labelers with the highest agreement will be
to convey appropriate degrees of uncertainty in its output, selected to proceed with the subsequent annotation work.
in order to avoid any form of deception or misrepresen- In some other work [227], “super raters” are used to ensure
tation of information. This requires the model to know the high quality of human feedback. Researchers evaluate
about its capabilities and levels of knowledge (e.g., “know the performance of human labelers and select a group of
23

well-performing human labelers (e.g., high agreement) as Supervised Fine-tuning


super raters. The super raters will be given priority to col- Prompts Training with demonstration data
laborate with the researchers in the subsequent study. When
human labelers annotate the output of LLMs, it is helpful to
Human
Annotator
Demonstrations Pre-trained LM
🔥
specify detailed instructions and provide instant guidance
for human labelers [112], which can further regulate the Demonstration Data
annotation of labelers.
Reward Model Training

🔥
Human Feedback Collection. In existing work, there are
mainly three kinds of approaches to collecting feedback and
preference data from human labelers.
Prompts LM Outputs Reward
Model
Pre-trained LM
🧊
• Ranking-based approach. In early work [226, 228], hu-
man labelers often evaluate model-generated outputs in a Human Feedback
Ranking Training with feedback data
coarse-grained manner (i.e., only selecting the best) without
taking into account more fine-grained alignment criteria. RL Fine-tuning
Nonetheless, different labelers may hold diverse opinions
🧊
on the selection of the best candidate output, and this
method disregards the unselected samples, which may lead Prompts
Reward
Model
Aligned LM
🔥
😊/😞
to inaccurate or incomplete human feedback. To address this
issue, subsequent studies [100, 224] introduce the Elo rating LM Outputs Training with RL algorithm (PPO)
Reward
system to derive the preference ranking by comparing can-
didate outputs. The ranking of outputs serves as the training
Fig. 6. The workflow of the RLHF algorithm.
signal that guides the model to prefer certain outputs over
others, thus inducing outputs that are more reliable and
safer. which is useful to improve the alignment criteria (e.g.,
• Question-based approach. Further, human labelers can helpfulness, honesty, and harmlessness). RLHF employs
provide more detailed feedback by answering certain ques- reinforcement learning (RL) algorithms (e.g., Proximal Pol-
tions designed by researchers [72], covering the alignment icy Optimization (PPO) [111]) to adapt LLMs to human
criteria as well as additional constraints for LLMs. Specially, feedback by learning a reward model. Such an approach
in WebGPT [72], to assist the model in filtering and utiliz- incorporates humans in the training loop for developing
ing relevant information from retrieved documents, human well-aligned LLMs, as exemplified by InstructGPT [61].
labelers are required to answer questions with multiple
options about whether the retrieved documents are useful RLHF System. The RLHF system mainly comprises three
for answering the given input. key components: a pre-trained LM to be aligned, a reward
• Rule-based approach. Besides, many studies develop model learning from human feedback, and a RL algorithm
rule-based methods to provide more detailed human feed- training the LM. Specifically, the pre-trained LM is typically
back. As a typical case, Sparrow [100] not only selects the a generative model that is initialized with existing pre-
response that labelers consider the best but also uses a series trained LM parameters. For example, OpenAI uses 175B
of rules to test whether model-generated responses meet the GPT-3 for its first popular RLHF model, InstructGPT [61],
alignment criteria of being helpful, correct, and harmless. and DeepMind uses the 280 billion parameter model Go-
In this way, two kinds of human feedback data can be ob- pher [59] for its GopherCite model [227]. Further, the reward
tained: (1) the response preference feedback is obtained by model (RM) provides (learned) guidance signals that reflect
comparing the quality of model-generated output in pairs, human preferences for the text generated by the LM, usually
and (2) the rule violation feedback is obtained by collecting in the form of a scalar value. The reward model can take on
the assessment from human labelers (i.e., a score indicating two forms: a fine-tuned LM or a LM trained de novo using
to what extent the generated output has violated the rules). human preference data. Existing work typically employs
Furthermore, GPT-4 [46] utilizes a set of zero-shot classifiers reward models having a parameter scale different from that
(based on GPT-4 itself) as rule-based reward models, which of the aligned LM [61, 227]. For example, OpenAI uses 6B
can automatically determine whether the model-generated GPT-3 and DeepMind uses 7B Gopher as the reward model,
outputs violate a set of human-written rules. respectively. Finally, to optimize the pre-trained LM using
In the following, we focus on a well-known technique, the signal from the reward model, a specific RL algorithm
reinforcement learning from human feedback (RLHF), is designed for large-scale model tuning. Specifically, Prox-
which has been widely used in the recent powerful LLMs imal Policy Optimization (PPO) [111] is a widely used RL
such as ChatGPT. As discussed below, the alignment criteria algorithm for alignment in existing work [61, 100, 227].
introduced in Section 5.2.1 can be fulfilled by learning from Key Steps for RLHF. Figure 6 illustrates the overall three-
human feedback on the responses of LLMs to users’ queries. step process of RLHF [61, 112] as introduced below.
• Supervised fine-tuning. To make the LM initially perform
5.2.3 Reinforcement Learning from Human Feedback desired behaviors, it usually needs to collect a supervised
To align LLMs with human values, reinforcement learning dataset containing input prompts (instruction) and desired
from human feedback (RLHF) [70, 226] has been proposed outputs for fine-tuning the LM. These prompts and outputs
to fine-tune LLMs with the collected human feedback data, can be written by human labelers for some specific tasks
24

while ensuring the diversity of tasks. For example, Instruct- would be integrated into each Transformer layer, typically
GPT [61] asks human labelers to compose prompts (e.g., using a serial insertion after each of the two core parts (i.e.,
“List five ideas for how to regain enthusiasm for my career”) and attention layer and feed-forward layer) of a Transformer
desired outputs for several generative tasks such as open layer. Alternatively, parallel adapters [235] can be also used
QA, brainstorming, chatting, and rewriting. Note that the in Transformer layers, where it places two adapter modules
first step is optional in specific settings or scenarios. in parallel with the attention layer and feed-forward layer
• Reward model training. The second step is to train the accordingly. During fine-tuning, the adapter modules would
RM using human feedback data. Specifically, we employ be optimized according to the specific task goals, while the
the LM to generate a certain number of output texts using parameters of the original language model are frozen in this
sampled prompts (from either the supervised dataset or process. In this way, we can effectively reduce the number
the human-generated prompt) as input. We then invite of trainable parameters during fine-tuning.
human labelers to annotate the preference for these pairs.
The annotation process can be conducted in multiple forms, Prefix Tuning. Prefix tuning [230] prepends a sequence of
and a common approach is to annotate by ranking the prefixes, which are a set of trainable continuous vectors, to
generated candidate texts, which can reduce the inconsis- each Transformer layer in language models. These prefix
tency among annotators. Then, the RM is trained to predict vectors are task-specific, which can be considered as virtual
the human-preferred output. In InstructGPT, labelers rank token embeddings. To optimize the prefix vectors, a repa-
model-generated outputs from best to worst, and the RM rameterization trick [230] has been proposed by learning a
(i.e., 6B GPT-3) is trained to predict the ranking. MLP function that maps a smaller matrix to the parameter
• RL fine-tuning. At this step, aligning (i.e., fine-tuning) matrix of prefixes, instead of directly optimizing the pre-
the LM is formalized as an RL problem. In this setting, fixes. It has been shown that this trick is useful for stable
the pre-trained LM acts as the policy that takes as input training. After optimization, the mapping function would
a prompt and returns an output text, the action space of be discarded, and only the derived prefix vectors are kept
it is the vocabulary, the state is the currently generated to enhance task-specific performance. Since only the prefix
token sequence, and the reward is provided by the RM. To parameters would be trained, it can lead to a parameter-
avoid eviating significantly from the initial (before tuning) efficient model optimization. Similar to prefix tuning, p-
LM, a penalty term is commonly incorporated into the tuning v2 [236] incorporates layer-wise prompt vectors into
reward function. For example, InstructGPT optimizes the the Transformer architecture specially for natural language
LM against the RM using the PPO algorithm. For each input understanding, which also utilizes multi-task learning for
prompt, InstructGPT calculates the KL divergence between jointly optimizing shared prompts. It has been shown to
the generated results from the current LM and the initial LM be useful in improving the model performance of different
as the penalty. It is noted that the second and final steps can parameter scales on natural language understanding tasks.
be iterated in multiple turns for better aligning LLMs.
Prompt Tuning. Different from prefix tuning, prompt tun-
ing [231, 237] mainly focuses on incorporating trainable
5.3 Efficient Tuning prompt vectors at the input layer24 . Based on the discrete
In the above, we have discussed the approaches of instruc- prompting methods [239, 240], it augments the input text
tion tuning and alignment tuning to adapt LLMs according by including a group of soft prompt tokens (either in a
to specific goals. Since LLMs consist of a huge amount of free form [237] or a prefix form [231]), and then takes
model parameters, it would be costly to perform the full- the prompt-augmented input to solve specific downstream
parameter tuning. In this section, we will discuss how to tasks. In implementation, task-specific prompt embeddings
conduct efficient tuning on LLMs. We first review several are combined with the input text embeddings, which are
representative parameter-efficient fine-tuning methods for subsequently fed into language models. P-tuning [237] has
Transformer language models, and then summarize existing proposed a free form to combine the context, prompt and
work on parameter-efficient fine-tuned LLMs. target tokens, which can be applied to the architectures for
both natural language understanding and generation. They
5.3.1 Parameter-Efficient Fine-Tuning Methods further learn the representations of soft prompt tokens by a
In existing literature, parameter-efficient fine-tuning [229, bidirectional LSTM. Another representative approach [231]
229–232] has been an important topic that aims to reduce named prompt tuning directly prepends prefix prompts to
the number of trainable parameters while retaining a good the input. During training, only the prompt embeddings
performance as possible. In what follows, we briefly re- would be learned according to task-specific supervisions.
view four parameter-efficient fine-tuning methods for Trans- While, since this method only includes a small number of
former language models, including adapter tuning, prefix trainable parameters at the input layer, it has been found
tuning, prompt tuning and LoRA.
Adapter Tuning. Adapter tuning incorporates small neural 24. Here, prompt tuning denotes a category of related efficient tuning
methods exemplified by the work [231, 237, 238], instead of a specific
network modules (called adapter) into the Transformer mod- method as used in [231]. Indeed, the prefix based tuning methods [230,
els [233]. To implement the adapter module, a bottleneck 236] can be also considered as prompting methods, which are called
architecture has been proposed in [233, 234], which first deep prompting tuning in [236]. While, in this survey, prompt tuning
compresses the original feature vector into a smaller di- specially refer to the methods that only include the prompt tokens at
the input layer, in the context of LLMs. We assign p-tuning v2 [236] to
mension (followed by a nonlinear transformation) and then the category of prefix tuning, because it incorporates layerwise prompts
recovers it to the original dimension. The adapter modules in langauge models.
25

that the performance highly relies on the model capacity of on simple tasks. Overall, LoRA performs relatively well
the underlying language models [231]. among these comparison methods, using significantly fewer
trainable parameters.
Low-Rank Adaptation (LoRA). LoRA [232] imposes the
As an important resource, the library PEFT [246] (stand-
low-rank constraint for approximating the update matrix at
ing for parameter-efficient fine-tuning) has been released on
each dense layer, so as to reduce the trainable parameters
GitHub26 . It has included several widely used efficient tun-
for adapting to downstream tasks. Consider the case of
ing methods, including LoRA [232]/AdaLoRA [241], prefix-
optimizing a parameter matrix W. The update process can
tuning [230, 236], P-Tuning [237], and prompt-tuning [231].
be written in a general form as: W ← W + ∆W. The basic
Further, it supports a number of language models such as
idea of LoRA is to freeze the original matrix W ∈ Rm×n
GPT-2 and LLaMA, and also covers several representative
while approximating the parameter update ∆W by low-
vision Transformer models (e.g., ViT and Swin Transformer).
rank decomposition matrices, i.e., ∆W = A · B> , where
A ∈ Rm×k and B ∈ Rn×k are the trainable parameters for As discussed in Section 5.3.1, there have been a large
task adaptation and r  min(m, n) is the reduced rank. The number of efficient tuning methods proposed in the existing
major merit of LoRA is that it can largely save the memory literature. While, most of these approaches are tested on
and storage usage (e.g., VRAM). Further, one can only keep small-sized pre-trained language models, instead of the
a single large model copy, while maintaining a number of LLMs. So far, there still lacks a thorough investigation on
task-specific low-rank decomposition matrices for adapting the effect of different efficient tuning methods on large-sized
to different downstream tasks. Further, several studies have language models at different settings or tasks.
also discussed how to set the rank in a more principled
approach, e.g., importance score based allocation [241] and
search-free optimal rank selection [242].
6 U TILIZATION
Besides the above methods, there is extensive research on
efficient tuning of Transformer language models. While, a
After pre-training or adaptation tuning, a major approach
more comprehensive discussion of efficient tuning is beyond
to using LLMs is to design suitable prompting strategies for
the scope of this article, which can be found in the related
solving various tasks. A typical prompting method is in-
papers on this topic [229, 235].
context learning [50, 55], which formulates the task descrip-
tion and/or demonstrations in the form of natural language
5.3.2 Parameter-Efficient Fine-Tuning on LLMs text. In addition, chain-of-thought prompting [33] can be em-
With the rising of LLMs, efficient tuning has attracted ployed to enhance in-context learning by involving a series
increasing research attention for developing a more of intermediate reasoning steps into prompts. Next, we will
lightweight adaptation approach in downstream tasks. elaborate on the details of the two techniques.
In particular, LoRA [232] has been widely applied
to open-source LLMs (e.g., LLaMA and BLOOM) for
parameter-efficient fine-tuning. Among these research at- 6.1 In-Context Learning
tempts, LLaMA and its variants have gained much atten-
tion for parameter-efficient tuning. For example, Alpaca- As a special prompting form, in-context learning (ICL) is
LoRA [243] has been trained using LoRA as a lightweight first proposed along with GPT-3 [55], which has become a
tuned version of Alpaca [221] (a fine-tuned 7B LLaMA typical approach to utilizing LLMs.
model with 52K human demonstrations of instruction
following). There are extensive explorations of Alpaca-
LoRA ranging in different languages or model sizes, which 6.1.1 Prompting Formulation
can be found in the collection page25 . Besides, LLaMA-
As stated in [55], ICL uses a formatted natural language
Adapter [244] inserts learnable prompt vectors into each
prompt, consisting of the task description and/or a few task
Transformer layer, in which zero-initialized attention has
examples as demonstrations. Figure 7 presents the illustra-
been proposed to improve the training by mitigating the
tion of ICL. First, starting with a task description, a few ex-
influence of under-fitted prompt vectors. Besides, they also
amples are selected from the task dataset as demonstrations.
extend this approach to a multi-modal setting, e.g., visual
Then, they are combined in a specific order to form nat-
question answering.
ural language prompts with specially designed templates.
Further, an empirical study [234] has been conducted Finally, the test instance is appended to the demonstration
to examine the effect of different tuning methods on lan- as the input for LLMs to generate the output. Based on task
guage models. They compare four efficient tuning methods demonstrations, LLMs can recognize and perform a new
including serial adapter tuning [233], parallel adapter tun- task without explicit gradient update.
ing [235, 245], and LoRA [232], on three open-source LLMs,
Formally, let Dk = {f (x1 , y1 ), . . . , f (xk , yk )} represent
namely GPT-J (6B), BLOOM (7.1B) and LLaMA (7B), for
a set of demonstrations with k examples, where f (xk , yk ) is
evaluation. Based on the experimental results on six math
the prompt function that transforms the k -th task example
reasoning datasets, they show that these efficient-tuning
into natural language prompts. Given the task description
methods under-perform the reference baseline GPT-3.5 on
I , demonstration Dk , and a new input query xk+1 , the
difficult tasks, while achieving a comparable performance

25. https://github.com/tloen/alpaca-lora 26. https://github.com/huggingface/peft


26

In-Context Learning Chain-of-Thought Prompting


Answer the following mathematical reasoning questions: Answer the following mathematical reasoning questions:

Q: If you have 12 candies and you give 4 candies to your friend, Q: If a rectangle has a length of 6 cm and a width of 3 cm,
how many candies do you have left? what is the perimeter of the rectangle?
Nx A: The answer is 8. Nx A: For a rectangle, add up the length and width and double it.
If a rectangle has a length of 6 cm and a width of 3 cm,
Q: what is the perimeter of the rectangle? So, the perimeter of this rectangle is (6 + 3) x 2 = 18 cm.

A: The answer is 18 cm. The answer is 18 cm.

Q: Sam has 12 marbles. He gives 1/4 of them to his sister. Q: Sam has 12 marbles. He gives 1/4 of them to his sister.
How many marbles does Sam have left? How many marbles does Sam have left?

A: He gives (1 / 4) x 12 = 3 marbles.
A: The answer is 9. LLM So Sam is left with 12 – 3 = 9 marbles.
The answer is 9.

: Task description : Demonstration : Chain-of-Thought : Query

Fig. 7. A comparative illustration of in-context learning (ICL) and chain-of-thought (CoT) prompting. ICL prompts LLMs with a natural language
description, several demonstrations, and a test query. While CoT prompting involves a series of intermediate reasoning steps in prompts.

prediction of the output ŷk+1 generated from LLMs can be Following the discussion in Section 6.1.1, we will introduce
formulated as follows27 : the demonstration design of ICL from three major aspects,
 i.e., demonstration selection, format, and order.
LLM I, f (x1 , y1 ), . . . , f (xk , yk ), f (xk+1 , ) → ŷk+1 .
| {z } | {z } |{z}
demonstrations input answer Demonstration Selection. The performance of ICL tends
(6) to have a large variance with different demonstration exam-
where the actual answer yk+1 is left as a blank to be ples [250], so it is important to select a subset of examples
predicted by the LLM. Since the performance of ICL heavily that can effectively leverage the ICL capability of LLMs.
relies on demonstrations, it is an important issue to properly There are two main demonstration selection approaches,
design them in the prompts. According to the construction namely heuristic and LLM-based approaches:
process in Equation (6), we focus on three major aspects in • Heuristic approaches. Due to the simplicity and low
formatting demonstrations in the prompts, including how to costs, existing work widely adopts heuristic methods to
select examples that make up demonstrations, format each select demonstrations. Several studies employ a k -NN based
example into the prompt with the function f (·), and arrange retriever to select examples that are semantically relevant to
demonstrations in a reasonable order. the query [250, 251]. However, they perform the selection
A comprehensive review of ICL has been presented in individually for each example, rather than evaluating the
the survey paper [50], and we suggest the readers refer- example set as a whole. To resolve this issue, diversity-
ring to it for a more general, detailed discussion on this based selection strategies are proposed to choose the most
topic. Compared with this survey, we specially focus on the representative set of examples for specific tasks [252, 253].
discussion of applying ICL to LLMs in two major aspects, Furthermore, in [254], both relevance and diversity are taken
i.e., demonstration design and the underlying mechanism into consideration when selecting demonstrations.
of ICL. Besides, ICL also has a close connection with • LLM-based approaches. Another line of work selects
instruction tuning (discussed in Section 5.1) in that both demonstrations by making use of LLMs. For example, LLMs
utilize natural language to format the task or instances. can be utilized to directly measure the informativeness
However, instruction tuning needs to fine-tune LLMs for of each example according to the performance gain after
adaptation, while ICL only prompts LLMs for utilization. adding the example [255]. Besides, EPR [256] proposes a
Furthermore, instruction tuning can enhance the ICL ability two-stage retrieval approach that first recalls similar ex-
of LLMs to perform target tasks, especially in the zero-shot amples with an unsupervised method (e.g., BM25) and
setting (only using task descriptions) [64]. then ranks them using a dense retriever (trained with
positive and negative examples labeled by LLMs). As an
6.1.2 Demonstration Design
alternative approach, the task of demonstration selection
Several studies have shown that the effectiveness of ICL is can be formulated into a RL problem, where LLMs serve
highly affected by the design of demonstrations [247–249] as the reward function to provide feedback for training
27. When ICL was introduced in the GPT-3’s paper [55], it was
the policy model [257]. Since LLMs perform well for text
originally defined to be a combination of the task description and annotation [258], some recent studies employ LLM itself
demonstration examples, wherein either component is dispensable. as the demonstration generator without human interven-
Following this definition, when a LLM is required to solve an unseen tion [259, 260].
task by using only task descriptions, it can be also considered to
perform ICL for task solving, whereas the ICL ability can be enhanced To summarize, as discussed in [261], the selected demon-
by instruction tuning. stration examples in ICL should contain sufficient informa-
27

tion about the task to solve as well as be relevant to the test els [265]. It suggests that the design of training tasks is an
query, for the above two selection approaches. important influence factor of the ICL capability of LLMs.
Besides training tasks, recent studies have also investigated
Demonstration Format. After selecting task examples, the
the relationship between ICL and the pre-training cor-
next step is to integrate and format them into a natural
pora [261, 266, 267]. It has been shown that the performance
language prompt for LLMs. A straightforward method is to
of ICL heavily depends on the source of pre-training corpora
instantiate a pre-defined template with the corresponding
rather than the scale [267]. Another study [266] provides an
input-output pairs [36]. To construct more informative tem-
in-depth analysis of the impact of training data distribution.
plates, recent studies consider adding task descriptions [64]
They find that ICL emerges when the training data can be
or enhancing the reasoning capability of LLMs with chain-
clustered into numerous infrequent classes, instead of being
of-thought prompts [33]. For instance, in [208], the authors
uniformly distributed. Furthermore, the authors in [261]
collect a large-scale dataset with task descriptions written by
theoretically explain ICL as the product of pre-training on
humans. After tuning with this dataset, the performance on
documents that exhibit long-range coherence.
seen tasks can be boosted, and LLMs can also generalize to
unseen tasks to some extent. To reduce the annotation costs, How LLMs Perform ICL? At the inference stage, researchers
a semi-automated approach has been proposed in [219] focus on analyzing how the ICL capability operates based
by employing a seed set consisting of human-written task on given demonstrations since no explicit learning or up-
descriptions to guide LLMs to generate task descriptions dating is involved. They typically analyze from the per-
for new tasks. Since it is costly to manually annotate spective of gradient descent and consider ICL as implicit
demonstration formats for different tasks, some work also fine-tuning [60, 268]. Under this framework, the ICL process
studies how to automatically generate high-quality ones. can be explained as follows: by means of forward computa-
As two representative methods, Auto-CoT [262] leverages tion, LLMs generate meta-gradients with respect to demon-
LLMs with the zero-shot prompt “Let’s think step by step” strations and implicitly perform gradient descent via the
for generating intermediate reasoning steps, while least-to- attention mechanism. Experiments also show that certain
most prompting [263] first queries LLMs to perform prob- attention heads in LLMs are capable of performing task-
lem decomposition and then utilizes LLMs to sequentially agnostic atomic operations (e.g., copying and prefix match-
solve sub-problems based on the intermediate answers to ing), which are closely related to the ICL ability [269, 270].
previously solved ones. To further explore the working mechanism of ICL, some
studies abstract ICL as an algorithm learning process [271–
Demonstration Order. LLMs are shown to sometimes suffer
273]. Specifically, the authors in [272] find that LLMs es-
from the recency bias, i.e., they are prone to repeat answers
sentially encode implicit models through their parameters
that are near the end of demonstrations [249]. Thus, it is
during pre-training. With the examples provided in ICL,
important to arrange demonstrations (i.e., task examples)
LLMs can implement learning algorithms such as gradient
in a reasonable order. Early work proposes several heuris-
descent or directly compute the closed-form solution to
tic methods to quickly find a good order. For example,
update these models during forward computation. Under
demonstrations can be directly organized according to their
this explanation framework, it has been shown that LLMs
similarity to the query in the embedding space [250]: the
can effectively learn simple linear functions and even some
more similar, the closer to the end. Besides, global and local
complex functions like decision trees with ICL [271–273].
entropy metrics can be used to score different demonstra-
tion orders [248]. To integrate more task information, some
recent studies propose to minimize the code length required 6.2 Chain-of-Thought Prompting
to compress and transmit task labels, which is inspired
by information theory [264]. However, these methods need Chain-of-Thought (CoT) [33] is an improved prompting
additional labeled data as the validation set to evaluate the strategy to boost the performance of LLMs on complex
performance of specific demonstration orders. To eliminate reasoning tasks, such as arithmetic reasoning [274–276],
this need, the authors in [248] propose to sample the valida- commonsense reasoning [277, 278], and symbolic reason-
tion data from the LLM itself. ing [33]. Instead of simply constructing the prompts with
input-output pairs as in ICL, CoT incorporates intermediate
6.1.3 Underlying Mechanism reasoning steps that can lead to the final output into the
prompts. In the following, we will elaborate on the usage of
After pre-training, LLMs can exhibit intriguing ICL capabil- CoT with ICL and discuss when and why CoT prompting
ity without being updated. In what follows, we discuss two works.
key questions about the ICL ability of LLMs, i.e., “how does
pre-training affect the ICL ability” and “how do LLMs perform
ICL during inference”. 6.2.1 In-context Learning with CoT
Typically, CoT can be used with ICL in two major settings,
How Pre-Training Affects ICL? ICL is first proposed in namely the few-shot and zero-shot settings, as introduced
GPT-3 [55], and it has shown that the ICL ability becomes below.
more significant with a larger model size. While, some
studies reveal that small-scale PLMs can also demonstrate Few-shot CoT. Few-shot CoT is a special case of ICL, which
a strong ICL ability with specially designed training tasks augments each demonstration hinput, outputi as hinput, CoT,
(e.g., learning to predict the label with task examples and outputi by incorporating the CoT reasoning steps. To apply
the query as the input), and may even surpass larger mod- this strategy, we next discuss two key issues, i.e., how to
28

design appropriate CoT prompts and how to utilize the exceeds a certain size, but is not effective with small-scale
generated CoTs for deriving the final answer. models, showing a significant pattern of emergent abilities.
• CoT prompt design. It is critical to design appropriate In order to unlock the CoT ability on more tasks, Flan-T5
CoT prompts for effectively eliciting the complex reasoning and Flan-PaLM [64] further perform instruction tuning on
ability of LLMs. As a direct approach, it is shown that CoT annotations and the zero-shot performance on unseen
using diverse CoTs (i.e., multiple reasoning paths for each tasks has been improved.
problem) can effectively enhance their performance [279].
Another intuitive idea is that prompts with more complex 6.2.2 Further Discussion on CoT
reasoning paths are more likely to elicit the reasoning ability In this part, we present discussions regarding two funda-
of LLMs [280], which can result in higher accuracy in mental questions related to CoT, i.e., “when does CoT work for
generating correct answers. However, both of these two LLMs” and “why can LLMs perform CoT reasoning”.
approaches rely on annotated CoT datasets, which limits When CoT works for LLMs? Since CoT is an emergent
their use in practice. To overcome this limitation, Auto- ability [31], it only has a positive effect on sufficiently
CoT [262] proposes to utilize Zero-shot-CoT [281] (detailed large models (e.g., typically containing 10B or more pa-
in the following part “Zero-shot CoT”) to generate CoT rea- rameters [33]) but not on small models. Moreover, since
soning paths by specially prompting LLMs, thus eliminating CoT augments the standard prompting with intermediate
manual efforts. In order to boost the performance, Auto-CoT reasoning steps, it is mainly effective to improve the tasks
further divides the questions in the training set into different that require step-by-step reasoning [33], such as arithmetic
clusters and then chooses the questions that are closest to the reasoning, commonsense reasoning, and symbolic reason-
centroid of each cluster, which is supposed to well represent ing. Whereas, for other tasks that do not rely on complex
the questions in the training set. Although few-shot CoT can reasoning, it might show worse performance than standard
be considered as a special prompt case of ICL, the ordering prompting [283], e.g., MNLI-m/mm, SST-2, and QQP from
of demonstrations seems to have a relatively small impact GLUE [177]. Interestingly, it seems that the performance
compared to the standard prompt in ICL: reordering the gain brought by CoT prompting could be significant only
demonstrations only results in a performance variation of when standard prompting yields poor results [33].
less than 2% in most tasks [33].
• Enhanced CoT strategies. Besides enriching the contex- Why LLMs Can Perform CoT Reasoning? As the second
tual information, CoT prompting also provides more op- question, we discuss the underlying mechanism of CoT in
tions to infer the answer given a question. Existing studies the following two aspects.
mainly focus on generating multiple reasoning paths, and • The source of CoT ability. Regarding the source of CoT
try to find a consensus among the derived answers [282– capability, it is widely hypothesized that it can be attributed
284]. For instance, self-consistency [282] is proposed as a to training on code since models trained on it show a strong
new decoding strategy when generating CoT and the final reasoning ability [47, 287]. Intuitively, code data is well orga-
answer. It first generates several reasoning paths and then nized with algorithmic logic and programming flow, which
takes an ensemble over all the answers (e.g., selecting the may be useful to improve the reasoning performance of
most consistent answer by voting among these paths). Self- LLMs. However, this hypothesis still lacks publicly reported
consistency boosts the performance in CoT reasoning by evidence of ablation experiments (with and without training
a large margin, and can even improve some tasks where on code). Besides, instruction tuning seems not to be the key
CoT prompting is usually worse than standard prompting reason to obtain the CoT ability, since it has been empirically
(e.g., closed-book question answering and natural language shown that instruction tuning on non-CoT data does not
inference). Further, the authors in [283] expand the self- improve the performance on held-out CoT benchmarks [64].
consistency strategy to a more general ensemble frame- • The effect of prompting components. The major distinction
work (extending to ensemble on the prompts), and they find between CoT prompting and standard prompting is the
that diverse reasoning paths are the key to the performance incorporation of reasoning paths prior to the final answer.
improvement in CoT reasoning. The above methods can Thus, some researchers investigate the effect of different
be easily integrated into CoT prompting to enhance the components in the reasoning paths. Specifically, a recent
performance without additional training. In contrast, other study identifies three key components in CoT prompting,
studies train a scoring model to measure the reliability of the namely symbols (e.g., numerical quantities in arithmetic rea-
generated reasoning paths [279] or continually train LLMs soning), patterns (e.g., equations in arithmetic reasoning),
on the reasoning paths generated by themselves [285, 286] and text (i.e., the rest of tokens that are not symbols or
to improve the performance. patterns) [288]. It is shown that the latter two parts (i.e., pat-
terns and text) are essential to the model performance, and
Zero-shot CoT. Different from few-shot CoT, zero-shot CoT removing either one would lead to a significant performance
does not include human-annotated task demonstrations in drop. However, the correctness of symbols and patterns
the prompts. Instead, it directly generates reasoning steps does not seem critical. Further, there exists a symbiotic
and then employs the generated CoTs to derive the answers. relationship between text and patterns: the text helps LLMs
Zero-shot CoT is first proposed in [281], where the LLM to generate useful patterns, and patterns aid LLMs to under-
is first prompted by “Let’s think step by step” to generate stand tasks and generate texts that help solve them [288].
reasoning steps and then prompted by “Therefore, the answer In summary, CoT prompting provides a general yet
is” to derive the final answer. They find that such a strategy flexible approach to eliciting the reasoning ability of LLMs.
drastically boosts the performance when the model scale There are also some preliminary attempts that extend this
29

technique to solve multimodal tasks [289] and multilingual Conditional Text Generation. As an important topic in lan-
tasks [290]. In addition to directly utilizing LLMs with ICL guage generation, conditional text generation [48] focuses
and CoT, some recent studies explore how to specialize the on generating texts satisfying specific task demands based
ability of LLMs towards specific tasks [291–293], which is on the given conditions, typically including machine trans-
called model specialization [294]. For example, the researchers lation [367], text summarization [368], and question answer-
in [294] specialize the ability of mathematical reasoning ing [369]. To measure the quality of the generated text, auto-
from LLMs through fine-tuning the small-scale Flan-T5 [64] matic metrics (e.g., Accuracy, BLEU [370] and ROUGE [371])
on CoT reasoning paths generated by LLMs. Model spe- and human ratings have been typically used for evaluating
cialization can also be applied to solve a variety of tasks the performance. Due to the powerful language generation
like question answering [295], code synthesis [296], and capabilities, LLMs have achieved remarkable performance
information retrieval [297]. on existing datasets and benchmarks, even surpassing hu-
man performance (on test datasets). For instance, given
only 32 examples as the input, GPT-3 with in-context learn-
7 C APACITY E VALUATION
ing can outperform a full-data fine-tuned BERT-Large on
To examine the effectiveness and superiority of LLMs, a the average score of SuperGLUE [312]; on MMLU, a 5-
surge of tasks and benchmarks have been leveraged for shot Chinchilla [34] nearly doubles the average accuracy
conducting empirical evaluation and analysis. We first intro- of human raters, and GPT-4 [46] in 5-shot setting further
duce three types of basic evaluation tasks of LLMs for lan- achieves the state-of-the-art performance which yields more
guage generation and understanding, then present several than 10% improvement in average accuracy compared to the
advanced tasks of LLMs with more complicated settings or previous best model. Thus, it raises serious concern about
goals, and finally discuss existing benchmarks and empirical whether existing benchmarks for conditional text generation
analyses. tasks can appropriately evaluate and reflect the capability
of LLMs. Considering this issue, researchers try to make
7.1 Basic Evaluation Tasks new evaluation benchmarks (e.g., BIG-bench Hard [314]) by
In this part, we mainly focus on three types of evaluation collecting currently unsolvable tasks (i.e., the task on which
tasks for LLMs, i.e., language generation, knowledge uti- LLMs fail to perform well) or creating more challenging
lization, and complex reasoning. It is noted that we do not tasks, e.g., super-long text generation [372]. Moreover, recent
intend to have complete coverage of all the related tasks, but studies also find that the automatic metrics may underesti-
instead only focus on the most widely discussed or studied mate the generation quality of LLMs. In OpenDialKG [311],
tasks for LLMs. Next, we introduce these tasks in detail. ChatGPT underperforms a fine-tuned GPT-2 on BLEU and
ROUGE-L metrics, while earning more favor from human
7.1.1 Language Generation judgment [373]. Therefore, more efforts need to be devoted
to developing new metrics that are more aligned with
According to the task definition, existing tasks about lan-
human judgment.
guage generation can be roughly categorized into language
modeling, conditional text generation, and code synthesis Code Synthesis. Besides generating high-quality natural
tasks. Note that code synthesis is not a typical NLP task, we language, existing LLMs also show strong abilities to gen-
include it for discussion because it can be directly solved erate formal language, especially computer programs (i.e.,
by a number of LLMs (trained on code data) in a similar code) that satisfy specific conditions, called code synthe-
generation approach as natural language text. sis [374]. Unlike natural language generation, as the gen-
erated code can be directly checked by execution with cor-
Language Modeling. As the most fundamental ability of
responding compilers or interpreters, existing work mostly
LLMs, language modeling aims to predict the next token
evaluates the quality of the generated code from LLMs by
based on the previous tokens [15], which mainly focuses
calculating the pass rate against the test cases, i.e., pass@k 28 .
on the capacity of basic language understanding and gen-
Recently, several code benchmarks focusing on functional
eration. For evaluating such an ability, typical language
correctness are proposed to assess the code synthesis abil-
modeling datasets that existing work uses include Penn
ities of LLMs, such as APPS [316], HumanEval [89], and
Treebank [298], WikiText-103 [299], and the Pile [130], where
MBPP [152]. Typically, they consist of diverse program-
the metric of perplexity is commonly used for evaluating the
ming problems, with text specification and test cases for
model performance under the zero-shot setting. Empirical
correctness checking. To improve such an ability, it is key
studies [55, 83] show that LLMs bring substantial per-
to fine-tuning (or pre-training) LLMs on code data, which
formance gains over the previous state-of-the-art methods
can effectively adapt LLMs to code synthesis tasks [77]. Be-
on these evaluation datasets. To better test the modeling
sides, existing work has proposed new strategies to generate
capacity of long-range dependencies in text, the LAMBADA
code, e.g., sampling multiple candidate solutions [152] and
dataset [167] has been introduced, where LLMs are required
planning-guided decoding [375], which can be considered
to predict the last word of sentences based on a paragraph of
as the imitation of bug-fixing and code-planning processes
context. Then, the accuracy and perplexity of the predicted
by programmers. Impressively, LLMs have recently shown
last words are employed to evaluate LLMs. As shown in
competitive performance with humans by achieving a rank-
existing work, the performance on the language modeling
ing of the top 28% among users on the programming contest
tasks typically follows the scaling law [30], which means
that scaling language models would improve the accuracy 28. Given k programs generated by the LLM, pass@k is computed as
and reduce the perplexity. 1 when at least one program passes all test cases, or else 0
30

TABLE 7
Basic evaluation tasks and corresponding representative datasets of LLMs.

Task Dataset
Language Modeling Penn Treebank [298], WikiText-103 [299], the Pile [130], LAMBADA [167]
WMT’14,16,19,20,21,22 [300–305], Flores-101 [306], DiaBLa [307],
Conditional Text Generation CNN/DailyMail [308], XSum [309], WikiLingua [310], OpenDialKG [311]
Language Generation
SuperGLUE [312], MMLU [313], BIG-bench Hard [314], CLUE [315]
APPS [316], HumanEval [89], MBPP [152], CodeContest [98], MTPB [77],
Code Synthesis
DS-1000 [317], ODEX [318]
Natural Questions [319], ARC [320], TruthfulQA [321], Web Questions [322],
Closed-Book QA TriviaQA [323], PIQA [324], LC-quad2.0 [325], GrailQA [326], KQApro [327],
CWQ [328], MKQA [329], ScienceQA [330]
Knowledge Utilization Natural Questions [319], OpenBookQA [331], ARC [320], Web Questions [322],
Open-Book QA
TriviaQA [323], MS MARCO [332], QASC [333], SQuAD [334], WikiMovies [335]
WikiFact [336], FB15k-237 [337], Freebase [338], WN18RR [339], WordNet [340],
Knowledge Completion
LAMA [341], YAGO3-10 [342], YAGO [343]
CSQA [277], StrategyQA [278], ARC [320], BoolQ [344], PIQA [324], SIQA [345],
HellaSwag [346], WinoGrande [347], OpenBookQA [331], COPA [348],
Knowledge Reasoning
ScienceQA [330], proScript [349], ProPara [350], ExplaGraphs [351],
ProofWriter [352], EntailmentBank [353], ProOntoQA [354]
CoinFlip [33], ReverseList [33], LastLetter [33], Boolean Assignment [355],
Complex Reasoning
Symbolic Reasoning Parity [355], Colored Object [356], Penguins in a Table [356],
Repeat Copy [357], Object Counting [357]
MATH [313], GSM8k [274], SVAMP [275], MultiArith [358], ASDiv [276],
Mathematical Reasoning MathQA [359], AQUA-RAT [360], MAWPS [361], DROP [362], NaturalProofs [363],
PISA [364], miniF2F [365], ProofNet [366]

platform Codeforces [98]. Further, GitHub Copilot has been into multiple steps such as planning, drafting, rewriting,
released to assist programming in coding IDEs (e.g., Visual and editting [372]. Several studies have proven that iterative
Studio and JetBrains IDEs), which can support a variety prompting can elicit relevant knowledge to achieve better
of languages including Python, JavaScript, and Java. A performance in sub-tasks [377, 378]. In essence, chain-of-
viewpoint article entitled “The End of Programming” [376] in thought prompting has utilized the idea of decomposing
Communications of the ACM has discussed the impact of AI complex tasks into multi-step reasoning chains. Besides,
programming in the field of computer science, emphasizing the safety control of generated texts is also important for
an important shift towards the highly adaptive LLM as a practical deployment. It has been shown that LLMs may
new atomic unit of computation. generate texts that contain sensitive information or offensive
expressions [46]. Although the RLHF algorithm [61] can
Major Issues. Although LLMs have achieved splendid per- alleviate this problem to some extent, it still relies on con-
formance in generating human-like text, they are susceptible siderable human-labeled data for tuning LLMs, without an
to suffering from two major issues in language generation objective optimization goal to follow. Thus, it is imperative
as discussed below. to explore effective methods to overcome these limitations
• Controllable generation. For LLMs, the mainstream way and enable safer control over the outputs of LLMs.
to generate texts under given conditions is through the use • Specialized generation. Although LLMs have learned
of natural language instructions or prompts. Despite the general language patterns to generate coherent text, their
simplicity, such a mechanism poses significant challenges in proficiency in generation might be constrained when deal-
terms of exerting fine-grained or structural constraints over ing with a specialized domain or task. For instance, a
the generated outputs of these models. Existing work [41] language model that has been trained on general web
shows that, when generating texts with complex constraints articles may face challenges when generating a medical
on their structures, LLMs can handle local planning (e.g., in- report which involves many medical jargon and methods.
teractions between proximal sentences) very well but might Intuitively, domain knowledge should be critical for model
struggle with global planning (i.e., long-range relatedness). specialization. Whereas, it is not easy to inject such special-
For example, to generate a complex long passage with sev- ized knowledge into LLMs. As discussed in recent analy-
eral paragraphs, it is still difficult to directly ensure specific ses [47, 379], when LLMs are trained to exhibit some specific
text structure (e.g., the order of concepts and the logical ability that allows them to excel in some areas, they might
flow), considering the whole text. This case will become struggle in others. Such an issue is related to catastrophic
even more challenging for generation tasks that require forgetting [380, 381] in training neural networks, which refers
formal rules or grammar, e.g., code synthesis. To tackle this to the conflict phenomenon of integrating new and old
issue, a potential solution is to extend the one-pass genera- knowledge. Similar cases also occur in human alignment
tion into the iterative prompting of LLMs. This simulates the of LLMs, where “alignment tax” [61] (e.g., a potential loss in
human writing process to break down language generation the in-context learning ability) has to be paid for aligning
31

to human values and needs. Therefore, it is important to LLMs [72, 383, 387]. In evaluation, existing studies mainly
develop effective model specialization methods that can focus on testing how LLMs utilize the extracted knowledge
flexibly adapt LLMs to various task scenarios, meanwhile to answer the question and show that the retrieved evi-
retaining the original abilities as possible. dence can largely improve the accuracy of the generated
answers, even enabling a smaller LLM to outperform 10×
7.1.2 Knowledge Utilization larger ones [383, 387]. Besides, open-book QA tasks can
Knowledge utilization is an important ability of intelligent also evaluate the recency of knowledge information. Pre-
systems to accomplish knowledge-intensive tasks (e.g., com- training or retrieving from outdated knowledge resources
monsense question answering and fact completion) based may cause LLMs to generate incorrect answers for time-
on supporting factual evidence. Concretely, it requires LLMs sensitive questions [383].
to properly utilize the rich factual knowledge from the pre-
training corpus or retrieve external data when necessary. In Knowledge Completion. In knowledge completion tasks,
particular, question answering (QA) and knowledge com- LLMs might be (to some extent) considered as a knowledge
pletion have been two commonly used tasks for evaluating base [341], which can be leveraged to complete or predict the
this ability. According to the test tasks (question answering missing parts of knowledge units (e.g., knowledge triples).
or knowledge completion) and evaluation settings (with or Such tasks can probe and evaluate how much and what kind
without external resources), we categorize existing knowl- of knowledge LLMs have learned from the pre-training
edge utilization tasks into three types, namely closed-book data. Existing knowledge completion tasks can be roughly
QA, open-book QA29 , and knowledge completion. divided into knowledge graph completion tasks (e.g., FB15k-
237 [337] and WN18RR [339]) and fact completion tasks (e.g.,
Closed-Book QA. Closed-book QA tasks [382] test the WikiFact [336]), which aim to complete the triples from a
acquired factual knowledge of LLMs from the pre-training knowledge graph and incomplete sentences about specific
corpus, where LLMs should answer the question only based facts, respectively. Empirical studies have revealed that it
on the given context without using external resources. For is difficult for existing LLMs to accomplish knowledge
evaluating this ability, there are several datasets that can completion tasks related to specific relation types [287].
be leveraged, including Natural Questions [319], Web Ques- As shown in the evaluation results on WikiFact, LLMs
tions [322], and TriviaQA [323], where the accuracy metric is perform well on several frequent relations that occur in
widely adopted. Empirical results have revealed that LLMs the pre-training data (e.g., currency and author), while
can perform well in this setting and even match the per- not well on rare ones (e.g., discoverer_or_inventor
formance of state-of-the-art open-domain QA systems [56]. and place_of_birth). Interestingly, under the same eval-
Besides, the performance of LLMs on closed-book QA tasks uation settings (e.g., in-context learning), InstructGPT (i.e.,
also shows a scaling law pattern in terms of both model size text-davinci-002) outperforms GPT-3 in all subsets of
and data size: scaling the parameters and training tokens WikiFact. It indicates that instruction tuning is helpful for
can increase the capacity of LLMs and help them learn (or LLMs to accomplish knowledge completion tasks.
memorize) more knowledge from the pre-training data [56].
Further, under a similar parameter scale, LLMs with more Major Issues. Although LLMs have achieved key progress
pre-training data relevant to the evaluated tasks would in capturing and utilizing knowledge information, they
achieve better performance [72]. Besides, the closed-book suffer from two major issues as discussed below.
QA setting also provides a testbed for probing the accuracy • Hallucination. In generating factual texts, a challenging
of the factual knowledge encoded by LLMs. However, as issue is hallucination generations [373], where the generated
shown in existing work [55], LLMs might perform less well information is either in conflict with the existing source
on QA tasks relying on fine-grained knowledge, even when (intrinsic hallucination) or cannot be verified by the available
it exists in the pre-training data. source (extrinsic hallucination), which are illustrated with
two examples in Figure 8. Hallucination widely occurs in
Open-Book QA. Unlike closed-book QA, in open-book QA
existing LLMs, even the most superior LLMs such as GPT-
tasks, LLMs can extract useful evidence from the external
4 [46]. In essence, LLMs seem to “unconsciously” utilize
knowledge base or document collections, and then answer
the knowledge in task solving, which still lack an ability to
the question based on the extracted evidence [383–386]. Typ-
accurately control the use of internal or external knowledge.
ical open-book QA datasets (e.g., Natural Questions [319],
Hallucination would mislead LLMs to generate undesired
OpenBookQA [331], and SQuAD [334]) have overlap with
outputs and mostly degrade the performance, leading to
closed-book QA datasets, but they incorporate external data
potential risks when deploying LLMs in real-world ap-
sources, e.g., Wikipedia. The metrics of accuracy and F1
plications. To alleviate this problem, the alignment tuning
score are widely used in open-book QA tasks for evaluation.
strategies (as discussed in Section 5.2) have been widely
To select relevant knowledge from external resources, LLMs
utilized in existing works [61], which rely on tuning LLMs
are often paired with a text retriever (or even a search
on high-quality data or using human feedback. For the eval-
engine), which is trained independently or jointly with
uation of the hallucination problem, a set of hallucination
29. In this part, open-book QA refers to the QA tasks that require detection tasks have been proposed, e.g., TruthfulQA [321],
to extract and utilize useful information from external knowledge for detecting human falsehood mimicked by models.
resources, as the antithesis of closed-book QA (only using the encoded • Knowledge recency. As another major challenge, LLMs
information from pre-training corpus). Note that there is a dataset also
named OpenBookQA [331], which follows the settings of open-book would encounter difficulties when solving tasks that require
QA tasks by extracting and utilizing external science facts. the latest knowledge beyond the training data. To tackle
32

Bob’s wife is Amy. Bob’s daughter is Cindy.


Explain RLHF for LLMs.
Who is Cindy to Amy?

RLHF stands for "Rights, Limitations, Harms, and


Cindy is Amy’s daughter-in-law. Freedoms" and is a framework for …… models like
LLMs (Large Language Models).

(a) Intrinsic hallucination (b) Extrinsic hallucination

Fig. 8. Examples of intrinsic and extrinsic hallucination for a public LLM (access date: March 19, 2023). As an example of intrinsic hallucination,
the LLM gives a conflicting judgment about the relationship between Cindy and Amy, which contradicts the input. For extrinsic hallucination, in this
example, the LLM seems to have an incorrect understanding of the meaning of RLHF (reinforcement learning from human feedback), though it can
correctly understand the meaning of LLMs (in this context).

this issue, a straightforward approach is to regularly update As discussed in Section 6.2, CoT involves the intermediate
LLMs with new data. However, it is very costly to fine-tune reasoning steps, which can be manually created [33] or
LLMs, and also likely to cause the catastrophic forgetting automatically generated [394], into the prompts to guide
issue when incrementally training LLMs. Therefore, it is LLMs to perform multi-step reasoning. Such a way largely
necessary to develop efficient and effective approaches that improves the reasoning performance of LLMs, leading to
can integrate new knowledge into existing LLMs, making new state-of-the-art results on several complex knowledge
them up-to-date. Existing studies have explored how to reasoning tasks [33, 56, 395]. Further, after reformulating
utilize the external knowledge source (e.g., search engine) knowledge reasoning tasks into code generation tasks, re-
to complement LLMs, which can be either jointly optimized searchers have found that the performance of LLMs can
with LLMs [383] or used as a plug-and-play module [388]. be further improved [156], especially with the LLMs pre-
For instance, ChatGPT utilizes a retrieval plugin to access trained on code. However, due to the complexity of knowl-
up-to-date information sources [389]. By incorporating the edge reasoning tasks, the performance of current LLMs still
extracted relevant information into the context [390, 391], lags behind human results on tasks such as commonsense
LLMs can acquire new factual knowledge and perform reasoning [33, 56, 396]. As one of the most common mis-
better on relevant tasks. However, such an approach seems takes, LLMs might generate inaccurate intermediate steps
to be still at a superficial level. It has been revealed that it based on wrong factual knowledge, leading to a wrong final
is difficult to directly amend intrinsic knowledge or inject result. To address this issue, existing work has proposed
specific knowledge into LLMs, which remains an open special decoding or ensemble strategies to improve the accu-
research problem [392, 393]. racy of the whole reasoning chain [279, 282]. More recently,
an empirical study [395] reveals that LLMs may have dif-
7.1.3 Complex Reasoning ficulty in explicitly inferring the commonsense knowledge
Complex reasoning refers to the ability of understanding required by a specific task, though they can successfully
and utilizing supporting evidence or logic to derive con- solve it. Further, it further shows that leveraging self-
clusions or make decisions [51, 52]. According to the type generated knowledge may not be beneficial for improving
of involved logic and evidence in the reasoning process, the reasoning performance.
we consider dividing existing evaluation tasks into three
Symbolic Reasoning30 . The symbolic reasoning tasks
major categories, namely knowledge reasoning, symbolic
mainly focus on manipulating the symbols in a formal rule
reasoning, and mathematical reasoning.
setting to fulfill some specific goal [51], where the operations
Knowledge Reasoning. The knowledge reasoning tasks and rules may have never been seen by LLMs during pre-
rely on logical relations and evidence about factual training. Existing work [33, 263, 281] commonly evaluates
knowledge to answer the given question. Existing work LLMs on the task of last letter concatenation and coin flip,
mainly uses specific datasets to evaluate the reasoning where the evaluation examples require the same reasoning
capacity of the corresponding type of knowledge, e.g., steps as the in-context examples (called in-domain test) or
CSQA [277]/StrategyQA [278] for commonsense knowledge more steps (called out-of-domain test). For an example of
reasoning and ScienceQA [330] for science knowledge rea- the out-of-domain test, LLMs could only see the examples
soning. In addition to the accuracy of the predicted results, with two words in context, but it requires LLMs to concate-
existing work [330] has also evaluated the quality of the nate the last letters of three or more words. Typically, the
generated reasoning process, via automatic metrics (e.g., accuracy of the generated symbols is adopted to evaluate
BLEU) or human evaluation. Typically, these tasks require the performance of LLMs on these tasks. Thus, LLMs need
LLMs to perform step-by-step reasoning based on factual to understand the semantic relations among the symbolic
knowledge, until reaching the answer to the given ques-
tion. To elicit the step-by-step reasoning ability, chain-of- 30. Following [33], we mainly discuss symbolic reasoning tasks spe-
cially designed for evaluating LLMs. We do not consider symbolic
thought (CoT) prompting strategy [33] has been proposed reasoning methods in traditional NLP tasks, such as deducing logical
for enhancing the complex reasoning capacity of LLMs. rules from the knowledge graphs in KBQA.
33

operations and their composition in complex scenarios. answer after a correct reasoning process [33, 405], leading
However, under the out-of-domain setting, as LLMs have to inconsistency between the derived answer and the rea-
not seen the complex compositions of symbolic operations soning process. To alleviate this problem, existing work has
and rules (e.g., twice the number of operations in context proposed to guide the whole generation process of LLMs
examples), it is hard for LLMs to capture their accurate via external tools or models [375], or re-check the reasoning
meanings. To solve this issue, existing studies incorporate process and final answer for correcting them [406]. As
scratchpad [355, 397] and tutor [398] strategies to help a promising solution, recent approaches reformulate the
LLMs better manipulate symbolic operations, for generating complex reasoning tasks into code generation tasks, where
longer and more complex reasoning processes. Another the strict execution of the generated code ensures the con-
line of research work utilizes the formal programming sistency between the reasoning process and the outcome.
language to represent the symbolic operations and rules, Besides, it has been revealed that there might also exist in-
which requires LLMs to generate code and perform the consistency between tasks with similar inputs, where small
reasoning process by executing it with external interpreters. changes in the task description may cause the model to
Such a way can decompose the complex reasoning process produce different results [49, 275]. To mitigate this problem,
into code synthesis and program execution for LLMs and the ensemble of multiple reasoning paths can be applied to
interpreters, respectively, leading to a simplified reasoning enhance the decoding process of LLMs [282].
process with yet more accurate results [357]. • Numerical computation. For complex reasoning tasks,
LLMs still face difficulties in the involved numerical com-
Mathematical Reasoning. The mathematical reasoning putation, especially for the symbols that are seldom en-
tasks need to comprehensively utilize mathematical knowl- countered during pre-training, such as arithmetic with large
edge, logic, and computation for solving problems or gen- numbers [49, 398]. To tackle this issue, a direct way is to tune
erating proof statements. Existing mathematical reasoning LLMs on synthesized arithmetic problems [407]. A surge
tasks can be mainly categorized into math problem solving of studies follow this approach and further improve the
and automated theorem proving. For math problem solving numerical computation performance by special training and
tasks, SVAMP [275], GSM8k [274], and MATH [313] datasets inference strategies [397], e.g., scratchpad tracing. Besides,
are commonly used for evaluation, where LLMs need to existing work [71] has also incorporated external tools (e.g.,
generate accurate concrete numbers or equations to answer calculator), especially for handling arithmetic operations.
the mathematical problem. As these tasks also require multi- More recently, ChatGPT has provided a plugin mechanism
step reasoning, the chain-of-thought prompting strategy has to use external tools [389]. In this way, LLMs need to learn
been widely adopted for LLMs to improve the reasoning how to properly manipulate the tools. For this purpose,
performance [33]. As a practical strategy, continually pre- researchers have augmented the examples using tools (even
training LLMs on large-scale mathematical corpora can the LLM itself) for tuning the LLM [71, 408], or devised
largely boost their performance on mathematical reason- instructions and exemplars for in-context learning [357].
ing tasks [35, 147, 399]. Further, since math problems in While, these LLMs still rely on the text context to capture
different languages share the same mathematical logic, re- the semantic meanings of mathematical symbols (during the
searchers also propose a multilingual math word problem pre-training stage), which is not best suited for numerical
benchmark [290] to evaluate the multilingual mathematical computation in essence.
reasoning capacity of LLMs. As another challenging task,
automated theorem proving (ATP) [363, 365, 400] requires
7.2 Advanced Ability Evaluation
the reasoning model to strictly follow the reasoning logic
and mathematical skills. To evaluate the performance on In addition to the above basic evaluation tasks, LLMs also
this task, PISA [364] and miniF2F [365] are two typical ATP exhibit some superior abilities that require special consider-
datasets with the proof success rate as the evaluation metric. ations for evaluation. In this part, we discuss several rep-
As a typical approach, existing work on ATP utilizes LLMs resentative advanced abilities and the corresponding eval-
to aid the search for proofs using an interactive theorem uation approaches, including human alignment, interaction
prover (ITP), such as Lean, Metamath, and Isabelle [401– with the external environment, and tool manipulation. Next,
403]. A major limitation of ATP research is the lack of related we discuss these advanced abilities in detail.
corpora in formal language. To tackle it, several studies
utilize LLMs to convert informal statements into formal 7.2.1 Human Alignment
proofs for augmenting new data [157] or generate drafts and It is desired that LLMs could well conform to human values
proof sketches to reduce the search space of the proofs [404]. and needs, i.e., human alignment, which is a key ability for
the broad use of LLMs in real-world applications.
Major Issues. In spite of the advancements, LLMs still have To evaluate this ability, existing studies consider multiple
several limitations in solving complex reasoning tasks. criteria for human alignment, such as helpfulness, honesty,
• Inconsistency. With improved reasoning strategies (e.g., and safety [46, 223, 224]. For helpfulness and honesty, adver-
CoT prompting), LLMs can solve some complex reasoning sarial question answering tasks (e.g., TruthfulQA [321]) can
tasks, by performing step-by-step reasoning based on the be utilized to examine LLM’s ability in detecting possible
supporting logic and evidence. Despite the effectiveness, the falsehood in the text [46, 72]. Furthermore, harmlessness
inconsistency issue often occurs in the decomposed reasoning can be also evaluated by several existing benchmarks, e.g.,
process. Concretely, LLMs may generate the correct answer CrowS-Pairs [409] and Winogender [410]. Despite the auto-
following an invalid reasoning path, or produce a wrong matic evaluation with the above datasets, human evaluation
34

is still a more direct way to effectively test the human To examine the ability of tool manipulation, existing
alignment ability of LLMs. OpenAI invites many experts work mostly adopts complex reasoning tasks for evaluation,
in domains related to AI risks to evaluate and improve the such as mathematical problem solving (e.g., GSM8k [274]
behaviors of GPT-4 when encountering risky contents [46]. and SVAMP [275]) or knowledge question answering (e.g.,
Besides, for other aspects of human alignment (e.g., truth- TruthfulQA [321]), where the successful utilization of tools is
fulness), several studies propose to use specific instruc- very important for enhancing the required skills that LLMs
tions and devise annotation rules to guide the annotation are incapable of (e.g., numerical calculation). In this way, the
process [72]. Empirical studies have revealed that these evaluated performance on these tasks can reflect the ability
strategies can greatly improve the human alignment ability of LLMs in tool manipulation. To teach LLMs to utilize tools,
of LLMs [224]. For instance, after alignment tuning on data existing studies add exemplars using tools in context to elicit
collected through interactions with experts, the incorrect LLMs [357], or fine-tune LLMs on simulated data about tool
behavior rate of GPT-4 can be largely reduced when it deals utilization [71, 408]. Existing work has found that with the
with sensitive or disallowed prompts. In addition, high- help of tools, LLMs become more capable of handling the
quality pre-training data can reduce the effort required for issues that they are not good at, e.g., equation calculation
alignment [46]. For instance, Galactica is potentially more and utilizing real-time information, and eventually improve
harmless due to the less biased contents in the scientific the final performance [71].
corpus [35]. Summary. The above three abilities are of great value to
the practical performance of LLMs: conforming to human
7.2.2 Interaction with External Environment values and preferences (human alignment), acting properly
Besides standard evaluation tasks, LLMs have the ability in real-world scenarios (interaction with the external envi-
to receive feedback from the external environment and ronment), and expanding the ability scope (tool manipu-
perform actions according to the behavior instruction, e.g., lation). In addition to the above three advanced abilities,
generating action plans in natural language to manipulate LLMs might also show other abilities that are specially
agents [411, 412]. Such an ability is also emergent in LLMs related to some tasks (e.g., data annotation [258]) or learning
that can generate detailed and highly realistic action plans, mechanisms (e.g., self-improvement [286]). It will be an open
while smaller models (e.g., GPT-2) tend to generate shorter direction to discover, measure and evaluate these newly
or meaningless plans [411]. emerging abilities, so as to better utilize and improve LLMs.
To test this ability, several embodied AI benchmarks
can be used for evaluation, described as follows. Virtual- 7.3 Public Benchmarks and Empirical Analysis
Home [413] builds a 3D simulator for household tasks such In the aforementioned parts, we have discussed the eval-
as cleaning and cooking, in which the agent can execute uation tasks of LLMs and their corresponding settings.
natural language actions generated by LLMs. ALFRED [414] Next, we will introduce existing evaluation benchmarks and
includes more challenging tasks that require LLMs to ac- empirical analyses for LLMs, which focus on exploring more
complish compositional targets. BEHAVIOR [415] focuses comprehensive discussions from a general perspective.
on everyday chores in simulation environments and re-
quires LLMs to generate complex solutions, e.g., changing 7.3.1 Evaluation Benchmarks
the internal status of objects. Based on the generated action Recently, several comprehensive benchmarks [287, 313, 356]
plans from LLMs, existing work either adopts the regular have been released for the evaluation of LLMs. In this
metrics (e.g., executability and correctness of the generated part, we introduce several representative and widely used
action plans) [411] in the benchmark or directly conducts benchmarks, i.e., MMLU, BIG-bench, and HELM.
real-world experiments and measures the success rate [416], • MMLU [313] is a versatile benchmark for large-scale
to evaluate such ability. Existing work has shown the effec- evaluation of multi-task knowledge understanding, cover-
tiveness of LLMs in interacting with the external environ- ing a wide range of knowledge domains from mathematics
ment and generating accurate action plans [417]. Recently, and computer science to humanities and social sciences. The
several improved methods have been proposed to enhance difficulties of these tasks vary from basic to advanced. As
the interaction ability of LLMs, e.g., designing code-like shown in existing work, LLMs mostly outperform small
prompts [418] and providing real-world grounding [416]. models by a substantial margin on this benchmark [35, 56,
57, 64], which shows the scaling law in model size. More
7.2.3 Tool Manipulation recently, GPT-4 achieves a remarkable record (86.4% in 5-
When solving complex problems, LLMs can turn to external shot setting) in MMLU, which is significantly better than
tools if they determine it is necessary. By encapsulating the previous state-of-the-art models [46].
available tools with API calls, existing work has involved • BIG-bench [356] is a collaborative benchmark intended
a variety of external tools, e.g., search engine [72], calcula- to probe existing LLMs from various aspects. It comprises
tor [71], and compiler [357], to enhance the performance of 204 tasks that encompass a broad range of topics, includ-
LLMs on several specific tasks. Recently, OpenAI has sup- ing linguistics, childhood development, mathematics, com-
ported the use of plugins in ChatGPT [389], which can equip monsense reasoning, biology, physics, social bias, software
LLMs with broader capacities beyond language modeling. development, and so on. By scaling the model size, LLMs
For example, the web browser plugin enables ChatGPT can even outperform the average human performance under
to access fresh information. Further, incorporating third- the few-shot setting on 65% of tasks in BIG-bench [56].
party plugins is particularly key for creating a prosperous Considering the high evaluation cost of the entire bench-
ecosystem of applications based on LLMs. mark, a lightweight benchmark BIG-bench-Lite has been
35

proposed, which contains 24 small yet diverse and chal- comprehensive qualitative analysis [41] has revealed that
lenging tasks from BIG-bench. Additionally, the BIG-bench GPT-4 approaches human-level performance in a variety of
hard (BBH) benchmark has been proposed to concentrate challenging tasks across various fields (e.g., mathematics,
on investigating the currently unsolvable tasks of LLMs by computer vision, and programming), and considered it as
selecting the challenging tasks in which LLMs exhibit infe- “an early version of an artificial general intelligence system”.
rior performance compared to humans. Since BBH becomes Despite the promising results, this analysis has also revealed
more difficult, small models mostly achieve performance that GPT-4 still has severe limitations. For example, GPT-4
close to random. As a comparison, CoT prompting can is hard to calibrate its confidence about the generated result,
elicit the abilities of LLMs to perform step-by-step reasoning and can not verify its consistency with the training data
for enhancing the performance, even exceeding the average and itself. Besides, it demonstrates inferior performance
human performance in BBH [314]. on tasks that require planning (e.g., solving the “Tower of
• HELM [287] is a comprehensive benchmark that cur- Hanoi” problem) or conceptual leaps (e.g., proposing a new
rently implements a core set of 16 scenarios and 7 categories scientific hypothesis). Furthermore, several studies have
of metrics. It is built on top of many prior studies, conduct- also shown that LLMs may misunderstand unfamiliar con-
ing a holistic evaluation of language models. As shown in cepts [423, 424] on information extraction tasks from specific
the experimental results of HELM [287], instruction tuning domains, and face challenges in solving pragmatic emotion-
can consistently boost the performance of LLMs in terms related tasks [422] (e.g., personalized emotion recognition),
of accuracy, robustness, and fairness. Further, for reasoning showing inferior performance compared to specific fine-
tasks, the LLMs that have been pre-trained on code corpus tuned models.
show superior performance. • Robustness. Besides the mastery, another aspect to con-
The above benchmarks cover a variety of mainstream sider is the stability of LLMs against noises or perturbations,
evaluation tasks for the evaluation of LLMs. Besides, there which is particularly important for practical applications.
are also several benchmarks that focus on evaluating specific To evaluate the robustness of LLMs against noises or per-
abilities of LLMs, such as TyDiQA [419] for multilingual turbations, existing work [425] conducts adversarial attack
knowledge utilization and MGSM [290] for multilingual (e.g., token replacement) on the input, and then evaluates the
mathematical reasoning. To conduct the evaluation, one robustness of LLMs based on the change of output results.
can select suitable benchmarks according to specific goals. It has been shown that LLMs are more robust than small
In addition, there are also several open-source evaluation language models in a variety of tasks, but may encounter
frameworks for researchers to evaluate LLMs on existing new issues about robustness, e.g.,robustness instability and
benchmarks or extend new tasks for customized evalua- prompt sensitivity. Concretely, LLMs are prone to provide
tions, such as Language Model Evaluation Harness [420] different answers when using varied expressions of the
and OpenAI Evals [46]. same input, even in conflict with the content generated by
itself [426]. Such an issue would also lead to unstable results
7.3.2 Comprehensive Analyses on LLMs’ Capacities when evaluating the robustness using different prompts,
In addition to constructing large-scale evaluation bench- making the evaluation results of robustness analysis them-
marks, a surge of studies have conducted comprehensive selves less reliable.
analyses to investigate the strengths and limitations of
Specialist. As LLMs have been pre-trained on large-scale
LLMs. In this part, we briefly discuss them in major aspects,
mixture-of-source corpora, they can capture rich knowledge
namely generalist (general-purpose capacity) and specialist
from the pre-training data. Thus, LLMs are also employed
(domain-specific capacity).
as domain experts or specialists for specific areas. Therefore,
Generalist. Due to the remarkable performance, existing recent studies have widely explored the use of LLMs for
work [41, 46, 373, 379, 421–423] has systematically evaluated solving domain-specific tasks and evaluated the adaptation
the general capacities of LLMs, to explore their competences capacity of LLMs. Typically, these studies collect or con-
in a variety of different tasks or applications. Typically, these struct domain-specific datasets to evaluate the performance
studies mainly focus on the newly emerged LLMs (e.g., of LLMs using in-context learning. Since our focus is not
ChatGPT and GPT-4) that have not been well investigated to cover all the possible application domains, we briefly
before, which are discussed as follows: discuss three representative domains receiving considerable
• Mastery. To evaluate the mastery level of LLMs in attention from the research community, namely healthcare,
solving general tasks, existing work [423] typically collects education, and law.
a set of datasets covering a range of tasks and domains, • Healthcare is a vital application field closely related
and then tests LLMs under the few/zero-shot setting. Em- to human life. Since the advent of ChatGPT, a series of
pirical results [41, 46, 379, 423] have shown the superior studies have applied ChatGPT or other LLMs to the medical
capacities of LLMs as a general-purpose task solver. As a re- domain. It has been shown that LLMs are capable of han-
markable progress, GPT-4 has surpassed the state-of-the-art dling a variety of healthcare tasks, e.g., biology information
methods with benchmark-specific training in a wide range extraction [427], medical advice consultation [428–430], and
of tasks, such as language understanding, commonsense report simplification [431], and can even pass the medical
reasoning, and mathematical reasoning [46]. Furthermore, license exams [432–434] specially designed for professional
it can achieve human-like performance in real-world ex- doctors. However, LLMs may fabricate medical misinfor-
ams designed for humans (e.g., Advanced Placement exams mation [429, 431], e.g., misinterpreting medical terms and
and Graduate Record Examination [46]). More recently, a suggesting advice inconsistent with medical guidelines. Be-
36

sides, it would also raise privacy concerns to upload the This survey tries to cover the most recent literature about
health information of patients [427]. LLMs and provides a good reference resource on this topic
• Education is also an important application domain for both researchers and engineers.
where LLMs potentially exert significant influence. Existing Next, we summarize the discussions of this survey, and
work has found that LLMs can achieve student-level perfor- introduce the challenges and future directions for LLMs, in
mance on standardized tests [46, 435, 436] in the subjects the following aspects.
of mathematics, physics, computer science and so on, in
both multiple-choice and free-response problems. Besides, Theory and Principle. To understand the underlying work-
empirical studies have shown that LLMs can serve as writ- ing mechanism of LLMs, one of the greatest mysteries
ing or reading assistant for education [437, 438]. A recent is how information is distributed, organized, and utilized
study [438] reveals that ChatGPT is capable of generating through the very large, deep neural network. It is important
logically consistent answers across disciplines, balancing to reveal the basic principles or elements that establish the
both depth and breadth. Another quantitative analysis [437] foundation of the abilities of LLMs. In particular, scaling
shows that students utilizing ChatGPT perform better than seems to play an important role in increasing the capacity
average students with different usage methods (e.g., keeping of LLMs [31, 55, 59]. It has been shown that some emergent
or refining the results from LLMs as their own answers) in abilities would occur in an unexpected way (a sudden per-
some courses from the computer security field. However, formance leap) when the parameter scale of language mod-
the increasing popularity of LLMs has been raising concerns els increases to a critical size (e.g., 10B) [31, 33], typically in-
(e.g., cheating on homework) on the rational use of such cluding in-context learning, instruction following, and step-
intelligent assistants for education. by-step reasoning. These emergent abilities are fascinating
• Law is a specialized domain that is built on professional yet perplexing: when and how they are obtained by LLMs
domain knowledge. Recently, a number of studies have ap- are not yet clear. Recent studies either conduct extensive
plied LLMs to solve various legal tasks, e.g., legal document experiments for investigating the effect of emergent abilities
analysis [439, 440], legal judgment prediction [441], and and the contributing factors to such abilities [250, 267, 451],
legal document writing [442]. A recent study [443] has found or explain some specific abilities with existing theoretical
that LLMs own powerful abilities of legal interpretation frameworks [60, 261]. An insightful technical post also spe-
and reasoning. Moreover, the latest GPT-4 model achieves cially discusses this topic [47], taking the GPT-series models
a top 10% score in a simulated bar exam compared with as the target. While, more formal theories and principles
human test-takers. However, the use of LLMs in law also to understand, characterize, and explain the abilities or
raises concerns about legal challenges, including copyright behaviors of LLMs are still missing. Since emergent abilities
issues [444], personal information leakage [445], or bias and bear a close analogy to phase transitions in nature [31, 58],
discrimination [446]. cross-discipline theories or principles (e.g., whether LLMs
Besides the aforementioned work, the capacities of can be considered as some kind of complex systems) might
LLMs have been also analyzed from other perspectives. be useful to explain and understand the behaviors of LLMs.
For instance, some recent work has studied the human- These fundamental questions are worth exploring for the
like characteristics of LLMs, such as self-awareness, theory research community, which are important for developing
of mind (ToM), and affective computing [41, 447–449]. In the next-generation LLMs.
particular, an empirical evaluation of ToM conducted on
two classic false-belief tasks speculates that LLMs may have Model Architecture. Due to the scalability and effective-
ToM-like abilities since the model in the GPT-3.5 series ness, Transformer, consisting of stacked multi-head self-
achieves comparable performance with nine-year-old chil- attention layers, has become the de facto architecture for
dren in ToM task [448]. Further, another line of work has building LLMs. Various strategies have been proposed to
investigated the fairness and accuracy of existing evaluation improve the performance of this architecture, such as neural
settings about LLMs [450], e.g., the large-scale mixture-of- network configuration and scalable parallel training (see
source pre-training data may contain the data in test sets. discussions in Section 4.2.2). To enhance the model capacity
(e.g., the multi-turn conversation ability), existing LLMs
typically maintain a long context window, e.g., GPT-4-32k
8 C ONCLUSION AND F UTURE D IRECTIONS has an extremely large context length of 32,768 tokens. Thus,
In this survey, we have reviewed the recent progress of large a practical consideration is to reduce the time complexity
language models (LLMs), and introduced the key concepts, (originally to be quadratic costs) incurred by the standard
findings, and techniques for understanding and utilizing self-attention mechanism. It is important to investigate the
LLMs. We focus on the large-sized models (i.e., having a size effect of more efficient Transformer variants in building
larger than 10B) while excluding the contents of early pre- LLMs [452], e.g., sparse attention has been used in GPT-
trained language models (e.g., BERT and GPT-2) that have 3 [55]. Besides, catastrophic forgetting has been a long-
been well covered in the existing literature. In particular, our standing challenge for neural networks, which also has a
survey has discussed four important aspects of LLMs, i.e., negative impact on LLMs. When tuning LLMs with new
pre-training, adaptation tuning, utilization, and evaluation. data, the originally learned knowledge is likely to be dam-
For each aspect, we highlight the techniques or findings that aged, e.g., fine-tuning a LLM according to some specific
are key to the success of LLMs. Besides, we also summarize tasks will affect the general ability of LLMs. A similar case
the available resources for developing LLMs and discuss im- occurs when LLMs are aligned with human values (called
portant implementation guidelines for reproducing LLMs. alignment tax [61, 223]). Thus, it is necessary to consider
37

extending existing architectures with more flexible mech- similar safety challenges as small language models. For
anisms or modules that can effectively support data update example, LLMs exhibit a tendency to generate hallucina-
and task specialization. tions [373], which are texts that seem plausible but may be
factually incorrect. What is worse, LLMs might be elicited by
Model Training. In practice, it is very difficult to pre-
intentional instructions to produce harmful, biased, or toxic
train capable LLMs, due to the huge computation con-
texts for malicious systems, leading to the potential risks of
sumption and the sensitivity to data quality and training
misuse [55, 61]. To have a detailed discussion of the safety
tricks [69, 83]. Thus, it becomes particularly important to
issues of LLMs (e.g., privacy, overreliance, disinformation,
develop more systemic, economical pre-training approaches
and influence operations), the readers can refer to the GPT-
for optimizing LLMs, considering the factors of model ef-
3/4 technical reports [46, 55]. As the major approach to
fectiveness, efficiency optimization, and training stability.
averting these issues, reinforcement learning from human
More model checking or performance diagnosis methods
feedback (RLHF) [61, 100] has been widely used by in-
(e.g., predictable scaling in GPT-4 [46]) should be developed
corporating humans in the training loop for developing
in order to detect early abnormal issues during training.
well-aligned LLMs. To improve the model safety, it is also
Furthermore, it also calls for more flexible mechanisms of
important to include safety-relevant prompts during RLHF,
hardware support or resource schedule, so as to better
as shown by GPT-4 [46]. However, RLHF heavily relies
organize and utilize the resources in a computing cluster.
on high-quality human feedback data from professional
Since it is very costly to pre-train a LLM from scratch, it is
labelers, making it difficult to be properly implemented in
important to design a suitable mechanisms for continually
practice. Therefore, it is necessary to improve the RLHF
pre-training or fine-tuning the LLM based on publicly avail-
framework for reducing the efforts of human labelers and
able model checkpoints (e.g., LLaMA [57] and Flan-T5 [64]).
seek a more efficient annotation approach with guaranteed
For this purpose, a number of technical issues have to be
data quality, e.g., LLMs can be employed to assist the
resolved, e.g., catastrophic forgetting and task specialization.
labeling work. More recently, red teaming [115, 225] has
However, to date, there still lack open-source model check-
been adopted for improving the model safety of LLMs,
points for LLMs with complete pre-processing and training
which utilizes the collected adversarial prompts to refine
logs (e.g., the scripts to prepare the pre-training data) for
the LLMs (i.e., avoiding the attacks from red teaming).
reproduction. We believe that it will be of great value to
Furthermore, it is also meaningful to establish the proper
report more technical details in open-source models for the
learning mechanism for LLMs to obtain human feedback
research of LLMs. Besides, it is also important to develop
via chatting and directly utilize it for self-improvement.
more improvement tuning strategies that effectively elicits
the model abilities. Application and Ecosystem. As LLMs have shown a strong
Model Utilization. Since fine-tuning is very costly in real capacity in solving various tasks, they can be applied in a
applications, prompting has become the prominent approach broad range of real-world applications (i.e., following task-
to using LLMs. By combining task descriptions and demon- specific natural language instructions). As a remarkable
stration examples into prompts, in-context learning (a spe- progress, ChatGPT has potentially changed the way how
cial form of prompting) endows LLMs with the ability to humans access information, which has been implemented
perform well on new tasks, even outperforming full-data in the release of New Bing. In the near future, it can be
fine-tuned models in some cases. Furthermore, to enhance foreseen that LLMs would have a significant impact on
the ability of complex reasoning, advanced prompting tech- information-seeking techniques, including both search en-
niques have been proposed, exemplified by the chain-of- gines and recommender systems. Furthermore, the develop-
thought (CoT) strategy, which includes the intermediate ment and use of intelligent information assistants would be
reasoning steps into prompts. However, existing prompt- highly promoted with the technology upgrade from LLMs.
ing approaches still have several deficiencies described as In a broader scope, this wave of technical innovation would
follows. Firstly, it involves considerable human efforts in lead to an ecosystem of LLM-empowered applications (e.g.,
the design of prompts. It would be quite useful to au- the support of plugins by ChatGPT), which has a close con-
tomatically generate effective prompts for solving various nection with human life. Lastly, the rise of LLMs sheds light
tasks. Secondly, some complex tasks (e.g., formal proof and on the exploration of artificial general intelligence (AGI).
numerical computation) require specific knowledge or logic It is promising to develop more smart intelligent systems
rules, which may not be well expressed in natural language (possibly with multi-modality signals) than ever. However,
or demonstrated by examples. Thus, it is important to in this development process, AI safety should be one of the
develop more informative, flexible task formatting methods primary concerns, i.e., making AI lead to good for humanity
for prompts31 . Thirdly, existing prompting strategies mainly but not bad [40].
focus on single-turn performance. It is useful to develop C ODA: This survey was planned during a discussion
interactive prompting mechanisms (e.g., through natural meeting held by our research team, and we aimed to sum-
language conversations) for solving complex tasks, which marize the recent advances of large language models as
have been demonstrated to be very useful by ChatGPT. a highly readable report for our team members. The first
draft was finished on March 13, 2023, in which our team
Safety and Alignment. Despite their capacities, LLMs pose members tried their best to include the related studies about
LLMs in a relatively objective, comprehensive way. Then,
31. While, it seems that an alternative approach to this issue is to
invoke external tools, e.g., the plugins for ChatGPT, when the task is we have extensively revised the writing and contents in
difficult to solve via text generation. several passes. Despite all our efforts, this survey is still
38

far from perfect: we are likely to miss important references Tyler Suard, Damai Dai, Liang Ding, Stella Biderman, Kevin
or topics, and might also have non-rigorous expressions or Gray, Jay Alammar, Yubo Feng, and Mark Holmstrom.
discussions. Due to the space limit, we can only include
a fraction of existing LLMs in Figure 1 and Table 1, by
setting the selection criterion. However, we set a more R EFERENCES
relaxed criterion for model selection on our GitHub page [1] S. Pinker, The Language Instinct: How the Mind Creates
(https://github.com/RUCAIBox/LLMSurvey), which will Language. Brilliance Audio; Unabridged edition,
be regularly maintained. We will continuously update this 2014.
survey, and improve the quality as much as we can. For [2] M. D. Hauser, N. Chomsky, and W. T. Fitch, “The
us, survey writing is also a learning process for LLMs faculty of language: what is it, who has it, and how
by ourselves. For readers with constructive suggestions to did it evolve?” science, vol. 298, no. 5598, pp. 1569–
improve this survey, you are welcome to leave comments on 1579, 2002.
the GitHub page of our survey or directly email our authors. [3] A. M. Turing, “Computing machinery and intelli-
We will make revisions following the received comments gence,” Mind, vol. LIX, no. 236, pp. 433–460, 1950.
or suggestions in a future version, and acknowledge the [4] F. Jelinek, Statistical Methods for Speech Recognition.
readers who have contributed constructive suggestions in MIT Press, 1998.
our survey. [5] J. Gao and C. Lin, “Introduction to the special issue
Update log. In this part, we regularly maintain a update on statistical language modeling,” ACM Trans. Asian
log for the submissions of this survey to arXiv: Lang. Inf. Process., vol. 3, no. 2, pp. 87–93, 2004.
• First release on March 31, 2023: the initial version. [6] R. Rosenfeld, “Two decades of statistical language
• Update on April 9, 2023: add the affiliation information, modeling: Where do we go from here?” Proceedings
revise Figure 1 and Table 1 and clarify the correspond- of the IEEE, vol. 88, no. 8, pp. 1270–1278, 2000.
ing selection criterion for LLMs, improve the writing, [7] A. Stolcke, “Srilm-an extensible language modeling
and correct some minor errors. toolkit,” in Seventh international conference on spoken
• Update on April 11, 2023: correct the errors for library language processing, 2002.
resources. [8] X. Liu and W. B. Croft, “Statistical language modeling
• Update on April 12, 2023: revise Figure 1 and Table 1, for information retrieval,” Annu. Rev. Inf. Sci. Technol.,
and clarify the release date of LLMs. vol. 39, no. 1, pp. 1–31, 2005.
• Update on April 16, 2023: add a new Section 2.2 about [9] C. Zhai, Statistical Language Models for Information Re-
the technical evolution of GPT-series models. trieval, ser. Synthesis Lectures on Human Language
• Update on April 24, 2023: add the discussion about Technologies. Morgan & Claypool Publishers, 2008.
scaling laws and add some explanations about the [10] S. M. Thede and M. P. Harper, “A second-order hid-
model sizes for emergent abilities (Section 2.1); add an den markov model for part-of-speech tagging,” in
illustrative figure for the attention patterns for different 27th Annual Meeting of the Association for Computational
architectures in Figure 4, and add the detailed formulas Linguistics, University of Maryland, College Park, Mary-
in Table 4. land, USA, 20-26 June 1999, R. Dale and K. W. Church,
• Update on April 25, 2023: revise some copy errors in Eds. ACL, 1999, pp. 175–182.
figures and tables. [11] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer,
• Update on April 27, 2023: add efficient tuning in Sec- “A tree-based statistical language model for natural
tion 5.3. language speech recognition,” IEEE Transactions on
• Update on April 28, 2023: revise Section 5.3. Acoustics, Speech, and Signal Processing, vol. 37, no. 7,
• Update on May 7, 2023: revise Table 1, Table 2, and pp. 1001–1008, 1989.
some minor points. [12] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean,
“Large language models in machine translation,” in
Planning content. We will regularly include new content
EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Con-
into this survey, to make it more self-contained and up-to-
ference on Empirical Methods in Natural Language Pro-
date. Here, we list several potential topics that might appear
cessing and Computational Natural Language Learning,
in the next major version(s): (1) the technical evolution from
June 28-30, 2007, Prague, Czech Republic, J. Eisner, Ed.
GPT-1 to ChatGPT (partially done), (2) LLaMA based tuning
ACL, 2007, pp. 858–867.
(e.g., Alpaca), (3) lightweight tuning strategies (done), and
[13] S. M. Katz, “Estimation of probabilities from sparse
(4) detailed formulations for model details (done). If you
data for the language model component of a speech
have a specific topic suggested for this survey, please drop
recognizer,” IEEE Trans. Acoust. Speech Signal Process.,
us a message about it.
vol. 35, no. 3, pp. 400–401, 1987.
[14] W. A. Gale and G. Sampson, “Good-turing frequency
estimation without tears,” J. Quant. Linguistics, vol. 2,
ACKNOWLEDGMENTS
no. 3, pp. 217–237, 1995.
The authors would like to thank Yankai Lin and Yutao Zhu [15] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A
for proofreading this paper. Since the first release of this neural probabilistic language model,” J. Mach. Learn.
paper, we have received a number of valuable comments Res., vol. 3, pp. 1137–1155, 2003.
from the readers. We sincerely thank the readers who have [16] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and
written to us with constructive suggestions and comments: S. Khudanpur, “Recurrent neural network based lan-
39

guage model,” in INTERSPEECH 2010, 11th Annual [25] W. Fedus, B. Zoph, and N. Shazeer, “Switch trans-
Conference of the International Speech Communication formers: Scaling to trillion parameter models with
Association, Makuhari, Chiba, Japan, September 26-30, simple and efficient sparsity,” J. Mach. Learn. Res, pp.
2010, T. Kobayashi, K. Hirose, and S. Nakamura, Eds. 1–40, 2021.
ISCA, 2010, pp. 1045–1048. [26] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei,
[17] S. Kombrink, T. Mikolov, M. Karafiát, and L. Burget, I. Sutskever et al., “Language models are unsuper-
“Recurrent neural network based language modeling vised multitask learners,” OpenAI blog, p. 9, 2019.
in meeting recognition,” in INTERSPEECH 2011, 12th [27] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen,
Annual Conference of the International Speech Commu- O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov,
nication Association, Florence, Italy, August 27-31, 2011. “Roberta: A robustly optimized BERT pretraining ap-
ISCA, 2011, pp. 2877–2880. proach,” CoRR, vol. abs/1907.11692, 2019.
[18] R. Collobert, J. Weston, L. Bottou, M. Karlen, [28] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika,
K. Kavukcuoglu, and P. P. Kuksa, “Natural language Z. Alyafeai, A. Chaffin, A. Stiegler, A. Raja, M. Dey,
processing (almost) from scratch,” J. Mach. Learn. Res., M. S. Bari, C. Xu, U. Thakker, S. S. Sharma,
vol. 12, pp. 2493–2537, 2011. E. Szczechla, T. Kim, G. Chhablani, N. V. Nayak,
[19] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and D. Datta, J. Chang, M. T. Jiang, H. Wang, M. Man-
J. Dean, “Distributed representations of words and ica, S. Shen, Z. X. Yong, H. Pandey, R. Bawden,
phrases and their compositionality,” in Advances in T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli,
Neural Information Processing Systems 26: 27th Annual T. Févry, J. A. Fries, R. Teehan, T. L. Scao, S. Bider-
Conference on Neural Information Processing Systems man, L. Gao, T. Wolf, and A. M. Rush, “Multitask
2013. Proceedings of a meeting held December 5-8, 2013, prompted training enables zero-shot task generaliza-
Lake Tahoe, Nevada, United States, C. J. C. Burges, L. Bot- tion,” in The Tenth International Conference on Learning
tou, Z. Ghahramani, and K. Q. Weinberger, Eds., 2013, Representations, ICLR 2022, Virtual Event, April 25-29,
pp. 3111–3119. 2022. OpenReview.net, 2022.
[20] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Ef- [29] T. Wang, A. Roberts, D. Hesslow, T. L. Scao, H. W.
ficient estimation of word representations in vector Chung, I. Beltagy, J. Launay, and C. Raffel, “What
space,” in 1st International Conference on Learning Rep- language model architecture and pretraining objective
resentations, ICLR 2013, Scottsdale, Arizona, USA, May works best for zero-shot generalization?” in Interna-
2-4, 2013, Workshop Track Proceedings, Y. Bengio and tional Conference on Machine Learning, ICML 2022, 17-23
Y. LeCun, Eds., 2013. July 2022, Baltimore, Maryland, USA, ser. Proceedings
[21] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, of Machine Learning Research, vol. 162, 2022, pp.
C. Clark, K. Lee, and L. Zettlemoyer, “Deep contex- 22 964–22 984.
tualized word representations,” in Proceedings of the [30] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown,
2018 Conference of the North American Chapter of the As- B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and
sociation for Computational Linguistics: Human Language D. Amodei, “Scaling laws for neural language mod-
Technologies, NAACL-HLT 2018, New Orleans, Louisiana, els,” CoRR, vol. abs/2001.08361, 2020.
USA, June 1-6, 2018, Volume 1 (Long Papers), M. A. [31] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph,
Walker, H. Ji, and A. Stent, Eds. Association for S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou,
Computational Linguistics, 2018, pp. 2227–2237. D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals,
[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, P. Liang, J. Dean, and W. Fedus, “Emergent abilities of
L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, large language models,” CoRR, vol. abs/2206.07682,
“Attention is all you need,” in Advances in Neural 2022.
Information Processing Systems 30: Annual Conference on [32] M. Shanahan, “Talking about large language models,”
Neural Information Processing Systems 2017, December 4- CoRR, vol. abs/2212.03551, 2022.
9, 2017, Long Beach, CA, USA, 2017, pp. 5998–6008. [33] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi,
[23] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Q. Le, and D. Zhou, “Chain of thought prompting
pre-training of deep bidirectional transformers for elicits reasoning in large language models,” CoRR, vol.
language understanding,” in Proceedings of the 2019 abs/2201.11903, 2022.
Conference of the North American Chapter of the Asso- [34] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya,
ciation for Computational Linguistics: Human Language T. Cai, E. Rutherford, D. de Las Casas, L. A. Hen-
Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, dricks, J. Welbl, A. Clark, T. Hennigan, E. Noland,
June 2-7, 2019, Volume 1 (Long and Short Papers), K. Millican, G. van den Driessche, B. Damoc, A. Guy,
J. Burstein, C. Doran, and T. Solorio, Eds. Association S. Osindero, K. Simonyan, E. Elsen, J. W. Rae,
for Computational Linguistics, 2019, pp. 4171–4186. O. Vinyals, and L. Sifre, “Training compute-optimal
[24] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo- large language models,” vol. abs/2203.15556, 2022.
hamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, [35] R. Taylor, M. Kardas, G. Cucurull, T. Scialom,
“BART: denoising sequence-to-sequence pre-training A. Hartshorn, E. Saravia, A. Poulton, V. Kerkez, and
for natural language generation, translation, and com- R. Stojnic, “Galactica: A large language model for
prehension,” in Proceedings of the 58th Annual Meeting science,” CoRR, vol. abs/2211.09085, 2022.
of the Association for Computational Linguistics, ACL [36] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and
2020, Online, July 5-10, 2020, 2020, pp. 7871–7880. G. Neubig, “Pre-train, prompt, and predict: A system-
40

atic survey of prompting methods in natural language abs/2212.10403, 2022.


processing,” ACM Comput. Surv., pp. 195:1–195:35, [52] S. Qiao, Y. Ou, N. Zhang, X. Chen, Y. Yao, S. Deng,
2023. C. Tan, F. Huang, and H. Chen, “Reasoning with
[37] C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang, K. Zhang, language model prompting: A survey,” CoRR, vol.
C. Ji, Q. Yan, L. He, H. Peng, J. Li, J. Wu, Z. Liu, P. Xie, abs/2212.09597, 2022.
C. Xiong, J. Pei, P. S. Yu, and L. Sun, “A comprehensive [53] J. Zhou, P. Ke, X. Qiu, M. Huang, and J. Zhang, “Chat-
survey on pretrained foundation models: A history gpt: potential, prospects, and limitations,” in Frontiers
from BERT to chatgpt,” CoRR, vol. abs/2302.09419, of Information Technology & Electronic Engineering, 2023,
2023. pp. 1–6.
[38] X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, [54] W. X. Zhao, J. Liu, R. Ren, and J. Wen, “Dense text
J. Qiu, Y. Yao, A. Zhang, L. Zhang, W. Han, M. Huang, retrieval based on pretrained language models: A
Q. Jin, Y. Lan, Y. Liu, Z. Liu, Z. Lu, X. Qiu, R. Song, survey,” CoRR, vol. abs/2211.14876, 2022.
J. Tang, J. Wen, J. Yuan, W. X. Zhao, and J. Zhu, “Pre- [55] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan,
trained models: Past, present and future,” AI Open, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry,
vol. 2, pp. 225–250, 2021. A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger,
[39] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler,
“Pre-trained models for natural language processing: J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler,
A survey,” CoRR, vol. abs/2003.08271, 2020. M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. Mc-
[40] S. Altman, “Planning for agi and beyond,” OpenAI Candlish, A. Radford, I. Sutskever, and D. Amodei,
Blog, February 2023. “Language models are few-shot learners,” in Ad-
[41] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, vances in Neural Information Processing Systems 33: An-
E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lund- nual Conference on Neural Information Processing Sys-
berg, H. Nori, H. Palangi, M. T. Ribeiro, and Y. Zhang, tems 2020, NeurIPS 2020, December 6-12, 2020, virtual,
“Sparks of artificial general intelligence: Early experi- H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and
ments with gpt-4,” vol. abs/2303.12712, 2023. H. Lin, Eds., 2020.
[42] S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal, S. Ma, [56] A. Chowdhery, S. Narang, J. Devlin, M. Bosma,
T. Lv, L. Cui, O. K. Mohammed, B. Patra, Q. Liu, G. Mishra, A. Roberts, P. Barham, H. W. Chung,
K. Aggarwal, Z. Chi, J. Bjorck, V. Chaudhary, S. Som, C. Sutton, S. Gehrmann, P. Schuh, K. Shi,
X. Song, and F. Wei, “Language is not all you need: S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes,
Aligning perception with language models,” CoRR, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du,
vol. abs/2302.14045, 2023. B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Is-
[43] Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, and ard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghe-
L. Sun, “A comprehensive survey of ai-generated mawat, S. Dev, H. Michalewski, X. Garcia, V. Misra,
content (aigc): A history of generative ai from gan to K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan,
chatgpt,” arXiv preprint arXiv:2303.04226, 2023. H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Do-
[44] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdh- han, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pil-
ery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu lai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child,
et al., “Palm-e: An embodied multimodal language O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta,
model,” arXiv preprint arXiv:2303.03378, 2023. M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-
[45] C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel,
N. Duan, “Visual chatgpt: Talking, drawing and edit- “Palm: Scaling language modeling with pathways,”
ing with visual foundation models,” arXiv preprint CoRR, vol. abs/2204.02311, 2022.
arXiv:2303.04671, 2023. [57] H. Touvron, T. Lavril, G. Izacard, X. Martinet,
[46] OpenAI, “Gpt-4 technical report,” OpenAI, 2023. M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham-
[47] Y. Fu, H. Peng, and T. Khot, “How does gpt obtain its bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and
ability? tracing emergent abilities of language models G. Lample, “Llama: Open and efficient foundation
to their sources,” Yao Fu’s Notion, Dec 2022. language models,” CoRR, 2023.
[48] J. Li, T. Tang, W. X. Zhao, and J. Wen, “Pretrained [58] B. A. Huberman and T. Hogg, “Phase transitions in
language model for text generation: A survey,” in artificial intelligence systems,” Artificial Intelligence,
Proceedings of the Thirtieth International Joint Conference vol. 33, no. 2, pp. 155–171, 1987.
on Artificial Intelligence, IJCAI 2021, Virtual Event / [59] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoff-
Montreal, Canada, 19-27 August 2021, Z. Zhou, Ed. mann, H. F. Song, J. Aslanides, S. Henderson, R. Ring,
ijcai.org, 2021, pp. 4492–4499. S. Young, E. Rutherford, T. Hennigan, J. Menick,
[49] P. Lu, L. Qiu, W. Yu, S. Welleck, and K. Chang, “A A. Cassirer, R. Powell, G. van den Driessche, L. A.
survey of deep learning for mathematical reasoning,” Hendricks, M. Rauh, P. Huang, A. Glaese, J. Welbl,
CoRR, vol. abs/2212.10535, 2022. S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Higgins,
[50] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, A. Creswell, N. McAleese, A. Wu, E. Elsen, S. M.
X. Sun, J. Xu, L. Li, and Z. Sui, “A survey for in-context Jayakumar, E. Buchatskaya, D. Budden, E. Suther-
learning,” CoRR, vol. abs/2301.00234, 2023. land, K. Simonyan, M. Paganini, L. Sifre, L. Martens,
[51] J. Huang and K. C. Chang, “Towards reasoning X. L. Li, A. Kuncoro, A. Nematzadeh, E. Gribovskaya,
in large language models: A survey,” CoRR, vol. D. Donato, A. Lazaridou, A. Mensch, J. Lespiau,
41

M. Tsimpoukelli, N. Grigorev, D. Fritz, T. Sotti- P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phan-


aux, M. Pajarskas, T. Pohlen, Z. Gong, D. Toyama, ishayee, and M. Zaharia, “Efficient large-scale lan-
C. de Masson d’Autume, Y. Li, T. Terzi, V. Mikulik, guage model training on GPU clusters using
I. Babuschkin, A. Clark, D. de Las Casas, A. Guy, megatron-lm,” in International Conference for High Per-
C. Jones, J. Bradbury, M. J. Johnson, B. A. Hechtman, formance Computing, Networking, Storage and Analysis,
L. Weidinger, I. Gabriel, W. S. Isaac, E. Lockhart, SC 2021, St. Louis, Missouri, USA, November 14-19,
S. Osindero, L. Rimell, C. Dyer, O. Vinyals, K. Ayoub, 2021. ACM, 2021, p. 58.
J. Stanway, L. Bennett, D. Hassabis, K. Kavukcuoglu, [68] V. Korthikanti, J. Casper, S. Lym, L. McAfee, M. An-
and G. Irving, “Scaling language models: Methods, dersch, M. Shoeybi, and B. Catanzaro, “Reducing ac-
analysis & insights from training gopher,” CoRR, vol. tivation recomputation in large transformer models,”
abs/2112.11446, 2021. CoRR, vol. abs/2205.05198, 2022.
[60] D. Dai, Y. Sun, L. Dong, Y. Hao, Z. Sui, and F. Wei, [69] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilic, D. Hess-
“Why can GPT learn in-context? language models se- low, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé,
cretly perform gradient descent as meta-optimizers,” J. Tow, A. M. Rush, S. Biderman, A. Webson, P. S.
CoRR, vol. abs/2212.10559, 2022. Ammanamanchi, T. Wang, B. Sagot, N. Muennighoff,
[61] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wain- A. V. del Moral, O. Ruwase, R. Bawden, S. Bekman,
wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. McMillan-Major, I. Beltagy, H. Nguyen, L. Saulnier,
A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, S. Tan, P. O. Suarez, V. Sanh, H. Laurençon, Y. Jer-
M. Simens, A. Askell, P. Welinder, P. F. Christiano, nite, J. Launay, M. Mitchell, C. Raffel, A. Gokaslan,
J. Leike, and R. Lowe, “Training language models to A. Simhi, A. Soroa, A. F. Aji, A. Alfassy, A. Rogers,
follow instructions with human feedback,” CoRR, vol. A. K. Nitzav, C. Xu, C. Mou, C. Emezue, C. Klamm,
abs/2203.02155, 2022. C. Leong, D. van Strien, D. I. Adelani, and et al.,
[62] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, “BLOOM: A 176b-parameter open-access multilingual
B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Fine- language model,” CoRR, vol. abs/2211.05100, 2022.
tuned language models are zero-shot learners,” in [70] P. F. Christiano, J. Leike, T. B. Brown, M. Martic,
The Tenth International Conference on Learning Repre- S. Legg, and D. Amodei, “Deep reinforcement learn-
sentations, ICLR 2022, Virtual Event, April 25-29, 2022. ing from human preferences,” in Advances in Neural
OpenReview.net, 2022. Information Processing Systems 30: Annual Conference on
[63] R. Thoppilan, D. D. Freitas, J. Hall, N. Shazeer, Neural Information Processing Systems 2017, December
A. Kulshreshtha, H. Cheng, A. Jin, T. Bos, L. Baker, 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von
Y. Du, Y. Li, H. Lee, H. S. Zheng, A. Ghafouri, Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N.
M. Menegali, Y. Huang, M. Krikun, D. Lepikhin, Vishwanathan, and R. Garnett, Eds., 2017, pp. 4299–
J. Qin, D. Chen, Y. Xu, Z. Chen, A. Roberts, M. Bosma, 4307.
Y. Zhou, C. Chang, I. Krivokon, W. Rusch, M. Pick- [71] T. Schick, J. Dwivedi-Yu, R. Dessı̀, R. Raileanu,
ett, K. S. Meier-Hellstern, M. R. Morris, T. Doshi, M. Lomeli, L. Zettlemoyer, N. Cancedda, and
R. D. Santos, T. Duke, J. Soraker, B. Zevenbergen, T. Scialom, “Toolformer: Language models can teach
V. Prabhakaran, M. Diaz, B. Hutchinson, K. Olson, themselves to use tools,” CoRR, vol. abs/2302.04761,
A. Molina, E. Hoffman-John, J. Lee, L. Aroyo, R. Ra- 2023.
jakumar, A. Butryna, M. Lamm, V. Kuzmina, J. Fenton, [72] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang,
A. Cohen, R. Bernstein, R. Kurzweil, B. Aguera-Arcas, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saun-
C. Cui, M. Croak, E. H. Chi, and Q. Le, “Lamda: ders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger,
Language models for dialog applications,” CoRR, vol. K. Button, M. Knight, B. Chess, and J. Schulman,
abs/2201.08239, 2022. “Webgpt: Browser-assisted question-answering with
[64] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, human feedback,” CoRR, vol. abs/2112.09332, 2021.
W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, [73] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring
A. Chowdhery, S. Narang, G. Mishra, A. Yu, V. Y. the limits of transfer learning with a unified text-
Zhao, Y. Huang, A. M. Dai, H. Yu, S. Petrov, E. H. to-text transformer,” J. Mach. Learn. Res., pp. 140:1–
Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, 140:67, 2020.
and J. Wei, “Scaling instruction-finetuned language [74] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-
models,” CoRR, vol. abs/2210.11416, 2022. Rfou, A. Siddhant, A. Barua, and C. Raffel, “mt5: A
[65] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, massively multilingual pre-trained text-to-text trans-
“Deepspeed: System optimizations enable training former,” in Proceedings of the 2021 Conference of the
deep learning models with over 100 billion parame- North American Chapter of the Association for Com-
ters,” in KDD, 2020, pp. 3505–3506. putational Linguistics: Human Language Technologies,
[66] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, NAACL-HLT 2021, Online, June 6-11, 2021, 2021, pp.
J. Casper, and B. Catanzaro, “Megatron-lm: Training 483–498.
multi-billion parameter language models using model [75] W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang,
parallelism,” CoRR, vol. abs/1909.08053, 2019. X. Jiang, Z. Yang, K. Wang, X. Zhang, C. Li,
[67] D. Narayanan, M. Shoeybi, J. Casper, P. LeGres- Z. Gong, Y. Yao, X. Huang, J. Wang, J. Yu, Q. Guo,
ley, M. Patwary, V. Korthikanti, D. Vainbrand, Y. Yu, Y. Zhang, J. Wang, H. Tao, D. Yan, Z. Yi,
42

F. Peng, F. Jiang, H. Zhang, L. Deng, Y. Zhang, model,” vol. abs/2210.02414, 2022.


Z. Lin, C. Zhang, S. Zhang, M. Guo, S. Gu, G. Fan, [84] N. Muennighoff, T. Wang, L. Sutawika, A. Roberts,
Y. Wang, X. Jin, Q. Liu, and Y. Tian, “Pangu-α: S. Biderman, T. L. Scao, M. S. Bari, S. Shen, Z. X. Yong,
Large-scale autoregressive pretrained chinese lan- H. Schoelkopf, X. Tang, D. Radev, A. F. Aji, K. Al-
guage models with auto-parallel computation,” CoRR, mubarak, S. Albanie, Z. Alyafeai, A. Webson, E. Raff,
vol. abs/2104.12369, 2021. and C. Raffel, “Crosslingual generalization through
[76] Z. Zhang, Y. Gu, X. Han, S. Chen, C. Xiao, Z. Sun, multitask finetuning,” CoRR, vol. abs/2211.01786,
Y. Yao, F. Qi, J. Guan, P. Ke, Y. Cai, G. Zeng, Z. Tan, 2022.
Z. Liu, M. Huang, W. Han, Y. Liu, X. Zhu, and [85] S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig,
M. Sun, “CPM-2: large-scale cost-effective pre-trained P. Yu, K. Shuster, T. Wang, Q. Liu, P. S. Koura, X. Li,
language models,” CoRR, vol. abs/2106.10715, 2021. B. O’Horo, G. Pereyra, J. Wang, C. Dewan, A. Celikyil-
[77] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, maz, L. Zettlemoyer, and V. Stoyanov, “OPT-IML: scal-
Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An ing language model instruction meta learning through
open large language model for code with mtulti-turn the lens of generalization,” CoRR, vol. abs/2212.12017,
program synthesis,” arXiv preprint arXiv:2203.13474, 2022.
2022. [86] Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue,
[78] S. Black, S. Biderman, E. Hallahan, Q. Anthony, Z. Wang, L. Shen, A. Wang, Y. Li et al., “Codegeex:
L. Gao, L. Golding, H. He, C. Leahy, K. McDonell, A pre-trained model for code generation with mul-
J. Phang, M. Pieler, U. S. Prashanth, S. Purohit, tilingual evaluations on humaneval-x,” arXiv preprint
L. Reynolds, J. Tow, B. Wang, and S. Weinbach, “Gpt- arXiv:2303.17568, 2023.
neox-20b: An open-source autoregressive language [87] S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley,
model,” CoRR, vol. abs/2204.06745, 2022. K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S.
[79] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, Prashanth, E. Raff et al., “Pythia: A suite for analyzing
A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran, large language models across training and scaling,”
A. Arunkumar, D. Stap, E. Pathak, G. Karamanolakis, arXiv preprint arXiv:2304.01373, 2023.
H. G. Lai, I. Purohit, I. Mondal, J. Anderson, K. Kuz- [88] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat,
nia, K. Doshi, K. K. Pal, M. Patel, M. Moradshahi, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen,
M. Parmar, M. Purohit, N. Varshney, P. R. Kaza, “Gshard: Scaling giant models with conditional com-
P. Verma, R. S. Puri, R. Karia, S. Doshi, S. K. Sampat, putation and automatic sharding,” in 9th International
S. Mishra, S. R. A, S. Patro, T. Dixit, and X. Shen, Conference on Learning Representations, ICLR 2021, Vir-
“Super-naturalinstructions: Generalization via declar- tual Event, Austria, May 3-7, 2021, 2021.
ative instructions on 1600+ NLP tasks,” in Proceedings [89] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P.
of the 2022 Conference on Empirical Methods in Natural de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda,
Language Processing, EMNLP 2022, Abu Dhabi, United N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger,
Arab Emirates, December 7-11, 2022, 2022, pp. 5085– M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan,
5109. S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser,
[80] Y. Tay, M. Dehghani, V. Q. Tran, X. Garcı́a, J. Wei, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cum-
X. Wang, H. W. Chung, D. Bahri, T. Schuster, mings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-
H. Zheng, D. Zhou, N. Houlsby, and D. Metzler, “Ul2: Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak,
Unifying language learning paradigms,” 2022. J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saun-
[81] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, ders, C. Hesse, A. N. Carr, J. Leike, J. Achiam,
S. Chen, C. Dewan, M. T. Diab, X. Li, X. V. Lin, V. Misra, E. Morikawa, A. Radford, M. Knight,
T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, M. Brundage, M. Murati, K. Mayer, P. Welinder,
P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever,
“OPT: open pre-trained transformer language mod- and W. Zaremba, “Evaluating large language models
els,” CoRR, vol. abs/2205.01068, 2022. trained on code,” CoRR, vol. abs/2107.03374, 2021.
[82] M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, [90] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang,
K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, J. Liu, X. Chen, Y. Zhao, Y. Lu, W. Liu, Z. Wu,
D. Licht, J. Maillard, A. Sun, S. Wang, G. Wenzek, W. Gong, J. Liang, Z. Shang, P. Sun, W. Liu, X. Ouyang,
A. Youngblood, B. Akula, L. Barrault, G. M. Gonzalez, D. Yu, H. Tian, H. Wu, and H. Wang, “ERNIE 3.0:
P. Hansanti, J. Hoffman, S. Jarrett, K. R. Sadagopan, Large-scale knowledge enhanced pre-training for lan-
D. Rowe, S. Spruit, C. Tran, P. Andrews, N. F. Ayan, guage understanding and generation,” CoRR, vol.
S. Bhosale, S. Edunov, A. Fan, C. Gao, V. Goswami, abs/2107.02137, 2021.
F. Guzmán, P. Koehn, A. Mourachko, C. Ropers, [91] O. Lieber, O. Sharir, B. Lenz, and Y. Shoham, “Jurassic-
S. Saleem, H. Schwenk, and J. Wang, “No language 1: Technical details and evaluation,” White Paper. AI21
left behind: Scaling human-centered machine transla- Labs, vol. 1, 2021.
tion,” CoRR, vol. abs/2207.04672, 2022. [92] B. Kim, H. Kim, S. Lee, G. Lee, D. Kwak, D. H. Jeon,
[83] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, S. Park, S. Kim, S. Kim, D. Seo, H. Lee, M. Jeong,
Z. Yang, Y. Xu, W. Zheng, X. Xia, W. L. Tam, Z. Ma, S. Lee, M. Kim, S. Ko, S. Kim, T. Park, J. Kim, S. Kang,
Y. Xue, J. Zhai, W. Chen, P. Zhang, Y. Dong, and N. Ryu, K. M. Yoo, M. Chang, S. Suh, S. In, J. Park,
J. Tang, “GLM-130B: an open bilingual pre-trained K. Kim, H. Kim, J. Jeong, Y. G. Yeo, D. Ham, D. Park,
43

M. Y. Lee, J. Kang, I. Kang, J. Ha, W. Park, and [100] A. Glaese, N. McAleese, M. Trebacz, J. Aslanides,
N. Sung, “What changes can large-scale language V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chad-
models bring? intensive study on hyperclova: Billions- wick, P. Thacker, L. Campbell-Gillingham, J. Ue-
scale korean generative pretrained transformers,” in sato, P. Huang, R. Comanescu, F. Yang, A. See,
Proceedings of the 2021 Conference on Empirical Methods S. Dathathri, R. Greig, C. Chen, D. Fritz, J. S. Elias,
in Natural Language Processing, EMNLP 2021, Virtual R. Green, S. Mokrá, N. Fernando, B. Wu, R. Foley,
Event / Punta Cana, Dominican Republic, 7-11 November, S. Young, I. Gabriel, W. Isaac, J. Mellor, D. Hassabis,
2021. Association for Computational Linguistics, K. Kavukcuoglu, L. A. Hendricks, and G. Irving,
2021. “Improving alignment of dialogue agents via targeted
[93] S. Wu, X. Zhao, T. Yu, R. Zhang, C. Shen, H. Liu, F. Li, human judgements,” CoRR, vol. abs/2209.14375, 2022.
H. Zhu, J. Luo, L. Xu et al., “Yuan 1.0: Large-scale [101] H. Su, X. Zhou, H. Yu, Y. Chen, Z. Zhu, Y. Yu, and
pre-trained language model in zero-shot and few-shot J. Zhou, “Welm: A well-read pre-trained language
learning,” arXiv preprint arXiv:2110.04725, 2021. model for chinese,” CoRR, vol. abs/2209.10372, 2022.
[94] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, [102] Y. Tay, J. Wei, H. W. Chung, V. Q. Tran, D. R. So,
T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das- S. Shakeri, X. Garcia, H. S. Zheng, J. Rao, A. Chowdh-
Sarma, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, ery, D. Zhou, D. Metzler, S. Petrov, N. Houlsby, Q. V.
J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. B. Le, and M. Dehghani, “Transcending scaling laws
Brown, J. Clark, S. McCandlish, C. Olah, and J. Ka- with 0.1% extra compute,” CoRR, vol. abs/2210.11399,
plan, “A general language assistant as a laboratory 2022.
for alignment,” CoRR, vol. abs/2112.00861, 2021. [103] X. Ren, P. Zhou, X. Meng, X. Huang, Y. Wang,
[95] S. Wang, Y. Sun, Y. Xiang, Z. Wu, S. Ding, W. Gong, W. Wang, P. Li, X. Zhang, A. Podolskiy, G. Arshinov,
S. Feng, J. Shang, Y. Zhao, C. Pang, J. Liu, X. Chen, A. Bout, I. Piontkovskaya, J. Wei, X. Jiang, T. Su,
Y. Lu, W. Liu, X. Wang, Y. Bai, Q. Chen, L. Zhao, Q. Liu, and J. Yao, “Pangu-Σ: Towards trillion pa-
S. Li, P. Sun, D. Yu, Y. Ma, H. Tian, H. Wu, T. Wu, rameter language model with sparse heterogeneous
W. Zeng, G. Li, W. Gao, and H. Wang, “ERNIE 3.0 computing,” CoRR, vol. abs/2303.10845, 2023.
titan: Exploring larger-scale knowledge enhanced pre- [104] A. Radford, R. Józefowicz, and I. Sutskever, “Learn-
training for language understanding and generation,” ing to generate reviews and discovering sentiment,”
CoRR, vol. abs/2112.12731, 2021. CoRR, vol. abs/1704.01444, 2017.
[96] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, [105] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever
Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, B. Zoph, et al., “Improving language understanding by genera-
L. Fedus, M. P. Bosma, Z. Zhou, T. Wang, Y. E. tive pre-training,” 2018.
Wang, K. Webster, M. Pellat, K. Robinson, K. S. Meier- [106] B. McCann, N. S. Keskar, C. Xiong, and R. Socher,
Hellstern, T. Duke, L. Dixon, K. Zhang, Q. V. Le, “The natural language decathlon: Multitask learning
Y. Wu, Z. Chen, and C. Cui, “Glam: Efficient scaling as question answering,” CoRR, vol. abs/1806.08730,
of language models with mixture-of-experts,” in In- 2018.
ternational Conference on Machine Learning, ICML 2022, [107] Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett,
17-23 July 2022, Baltimore, Maryland, USA, 2022, pp. X. Gao, J. Gao, J. Liu, and B. Dolan, “DIALOGPT :
5547–5569. Large-scale generative pre-training for conversational
[97] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajb- response generation,” in Proceedings of the 58th Annual
handari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, Meeting of the Association for Computational Linguistics:
V. Korthikanti, E. Zheng, R. Child, R. Y. Aminabadi, System Demonstrations, ACL 2020, Online, July 5-10,
J. Bernauer, X. Song, M. Shoeybi, Y. He, M. Hous- 2020, A. Celikyilmaz and T. Wen, Eds. Association
ton, S. Tiwary, and B. Catanzaro, “Using deepspeed for Computational Linguistics, 2020, pp. 270–278.
and megatron to train megatron-turing NLG 530b, [108] D. Ham, J. Lee, Y. Jang, and K. Kim, “End-to-end neu-
A large-scale generative language model,” CoRR, vol. ral pipeline for goal-oriented dialogue systems using
abs/2201.11990, 2022. GPT-2,” in Proceedings of the 58th Annual Meeting of the
[98] Y. Li, D. H. Choi, J. Chung, N. Kushman, J. Schrit- Association for Computational Linguistics, ACL 2020, On-
twieser, R. Leblond, T. Eccles, J. Keeling, F. Gi- line, July 5-10, 2020. Association for Computational
meno, A. D. Lago, T. Hubert, P. Choy, C. de Mas- Linguistics, 2020, pp. 583–592.
son d’Autume, I. Babuschkin, X. Chen, P. Huang, [109] I. Drori, S. Tran, R. Wang, N. Cheng, K. Liu, L. Tang,
J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J. E. Ke, N. Singh, T. L. Patti, J. Lynch, A. Shporer,
Mankowitz, E. S. Robson, P. Kohli, N. de Freitas, N. Verma, E. Wu, and G. Strang, “A neural network
K. Kavukcuoglu, and O. Vinyals, “Competition-level solves and generates mathematics problems by pro-
code generation with alphacode,” Science, 2022. gram synthesis: Calculus, differential equations, linear
[99] S. Soltan, S. Ananthakrishnan, J. FitzGerald, R. Gupta, algebra, and more,” CoRR, vol. abs/2112.15594, 2021.
W. Hamza, H. Khan, C. Peris, S. Rawls, A. Rosen- [110] A. Neelakantan, T. Xu, R. Puri, A. Radford, J. M. Han,
baum, A. Rumshisky, C. S. Prakash, M. Sridhar, J. Tworek, Q. Yuan, N. Tezak, J. W. Kim, C. Hal-
F. Triefenbach, A. Verma, G. Tür, and P. Natara- lacy, J. Heidecke, P. Shyam, B. Power, T. E. Nekoul,
jan, “Alexatm 20b: Few-shot learning using a G. Sastry, G. Krueger, D. Schnurr, F. P. Such, K. Hsu,
large-scale multilingual seq2seq model,” CoRR, vol. M. Thompson, T. Khan, T. Sherbakov, J. Jang, P. Welin-
abs/2208.01448, 2022. der, and L. Weng, “Text and code embeddings by
44

contrastive pre-training,” CoRR, vol. abs/2201.10005, against neural fake news,” in Advances in Neural Infor-
2022. mation Processing Systems 32: Annual Conference on Neu-
[111] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, ral Information Processing Systems 2019, NeurIPS 2019,
and O. Klimov, “Proximal policy optimization algo- December 8-14, 2019, Vancouver, BC, Canada, H. M.
rithms,” arXiv preprint arXiv:1707.06347, 2017. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-
[112] N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 9051–
C. Voss, A. Radford, D. Amodei, and P. F. Chris- 9062.
tiano, “Learning to summarize from human feed- [126] A. Gokaslan, V. C. E. Pavlick, and S. Tellex,
back,” CoRR, vol. abs/2009.01325, 2020. “Openwebtext corpus,” http://Skylion007.github.io/
[113] OpenAI, “Our approach to alignment research,” Ope- OpenWebTextCorpus, 2019.
nAI Blog, August 2022. [127] J. Baumgartner, S. Zannettou, B. Keegan, M. Squire,
[114] ——, “Introducing chatgpt,” OpenAI Blog, November and J. Blackburn, “The pushshift reddit dataset,” in
2022. Proceedings of the Fourteenth International AAAI Con-
[115] D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, ference on Web and Social Media, ICWSM 2020, Held
S. Kadavath, B. Mann, E. Perez, N. Schiefer, Virtually, Original Venue: Atlanta, Georgia, USA, June
K. Ndousse, A. Jones, S. Bowman, A. Chen, T. Con- 8-11, 2020. AAAI Press, 2020, pp. 830–839.
erly, N. DasSarma, D. Drain, N. Elhage, S. E. Showk, [128] “Wikipedia.” [Online]. Available: https://en.
S. Fort, Z. Hatfield-Dodds, T. Henighan, D. Hernan- wikipedia.org/wiki/Main Page
dez, T. Hume, J. Jacobson, S. Johnston, S. Kravec, [129] “Bigquery dataset.” [Online]. Available: https://
C. Olsson, S. Ringer, E. Tran-Johnson, D. Amodei, cloud.google.com/bigquery?hl=zh-cn
T. Brown, N. Joseph, S. McCandlish, C. Olah, J. Ka- [130] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe,
plan, and J. Clark, “Red teaming language models C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima,
to reduce harms: Methods, scaling behaviors, and S. Presser, and C. Leahy, “The pile: An 800gb dataset
lessons learned,” CoRR, vol. abs/2209.07858, 2022. of diverse text for language modeling,” CoRR, vol.
[116] OpenAI, “Lessons learned on language model safety abs/2101.00027, 2021.
and misuse,” OpenAI Blog, March 2022. [131] H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. V.
[117] L. Huawei Technologies Co., “Huawei mindspore del Moral, T. Le Scao, L. Von Werra, C. Mou, E. G.
ai development framework,” in Artificial Intelligence Ponferrada, H. Nguyen et al., “The bigscience roots
Technology. Springer, 2022, pp. 137–162. corpus: A 1.6 tb composite multilingual dataset,” in
[118] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, Thirty-sixth Conference on Neural Information Processing
C. Guestrin, P. Liang, and T. B. Hashimoto, “Stan- Systems Datasets and Benchmarks Track, 2022.
ford alpaca: An instruction-following llama model,” [132] “Common crawl.” [Online]. Available: https://
https://github.com/tatsu-lab/stanford alpaca, 2023. commoncrawl.org/
[119] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, [133] “A reproduction version of cc-stories on hugging
L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, face.” [Online]. Available: https://huggingface.co/
I. Stoica, and E. P. Xing, “Vicuna: An open-source datasets/spacemanidol/cc-stories
chatbot impressing gpt-4 with 90%* chatgpt quality,” [134] B. Wang and A. Komatsuzaki, “GPT-J-6B: A 6 Billion
2023. [Online]. Available: https://vicuna.lmsys.org Parameter Autoregressive Language Model,” https://
[120] 2023. [Online]. Available: https://github.com/ github.com/kingoflolz/mesh-transformer-jax, 2021.
nebuly-ai/nebullvm/tree/main/apps/accelerate/ [135] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue,
chatllama A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz,
[121] Y. You, “Colossalchat: An open-source J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jer-
solution for cloning chatgpt with a complete nite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame,
rlhf pipeline,” 2023. [Online]. Available: Q. Lhoest, and A. M. Rush, “Transformers: State-of-
https://medium.com/@yangyou berkeley/ the-art natural language processing,” in Proceedings of
colossalchat-an-open-source-solution-for-cloning- the 2020 Conference on Empirical Methods in Natural Lan-
chatgpt-with-a-complete-rlhf-pipeline-5edf08fb538b guage Processing: System Demonstrations, EMNLP 2020
[122] Y. Zhu, R. Kiros, R. S. Zemel, R. Salakhutdinov, R. Ur- - Demos, Online, November 16-20, 2020. Association
tasun, A. Torralba, and S. Fidler, “Aligning books for Computational Linguistics, 2020, pp. 38–45.
and movies: Towards story-like visual explanations [136] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson,
by watching movies and reading books,” in 2015 IEEE C. Leary, D. Maclaurin, G. Necula, A. Paszke,
International Conference on Computer Vision, ICCV 2015, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang,
Santiago, Chile, December 7-13, 2015. IEEE Computer “JAX: composable transformations of Python+NumPy
Society, 2015, pp. 19–27. programs,” 2018. [Online]. Available: http://github.
[123] “Project gutenberg.” [Online]. Available: https:// com/google/jax
www.gutenberg.org/ [137] Z. Bian, H. Liu, B. Wang, H. Huang, Y. Li, C. Wang,
[124] T. H. Trinh and Q. V. Le, “A simple method for F. Cui, and Y. You, “Colossal-ai: A unified deep learn-
commonsense reasoning,” CoRR, vol. abs/1806.02847, ing system for large-scale parallel training,” CoRR,
2018. vol. abs/2110.14883, 2021.
[125] R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, [138] J. Fang, Y. Yu, S. Li, Y. You, and J. Zhou, “Patrick-
A. Farhadi, F. Roesner, and Y. Choi, “Defending star: Parallel training of pre-trained models via
45

a chunk-based memory management,” CoRR, vol. [150] Z. Manna and R. J. Waldinger, “Toward automatic
abs/2108.05818, 2021. program synthesis,” Commun. ACM, vol. 14, no. 3, pp.
[139] “Bmtrain: Effient training for big models.” [Online]. 151–165, 1971.
Available: https://github.com/OpenBMB/BMTrain [151] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong,
[140] J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, and J. Tang, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou,
“Fastmoe: A fast mixture-of-expert training system,” “Codebert: A pre-trained model for programming and
CoRR, vol. abs/2103.13262, 2021. natural languages,” in Findings of EMNLP, 2020.
[141] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad- [152] J. Austin, A. Odena, M. I. Nye, M. Bosma,
bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, H. Michalewski, D. Dohan, E. Jiang, C. J. Cai, M. Terry,
L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, Z. De- Q. V. Le, and C. Sutton, “Program synthesis with large
Vito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, language models,” CoRR, vol. abs/2108.07732, 2021.
L. Fang, J. Bai, and S. Chintala, “Pytorch: An imper- [153] S. Black, L. Gao, P. Wang, C. Leahy, and S. Bi-
ative style, high-performance deep learning library,” derman, “GPT-Neo: Large Scale Autoregressive Lan-
in Advances in Neural Information Processing Systems guage Modeling with Mesh-Tensorflow,” 2021.
32: Annual Conference on Neural Information Process- [154] F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn,
ing Systems 2019, NeurIPS 2019, December 8-14, 2019, “A systematic evaluation of large language models of
Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, code,” in MAPS@PLDI, 2022.
A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Gar- [155] D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace,
nett, Eds., 2019, pp. 8024–8035. F. Shi, R. Zhong, W. Yih, L. Zettlemoyer, and M. Lewis,
[142] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, “Incoder: A generative model for code infilling and
J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Is- synthesis,” in ICLR, 2023.
ard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, [156] A. Madaan, S. Zhou, U. Alon, Y. Yang, and G. Neubig,
D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, “Language models of code are few-shot commonsense
P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensor- learners,” in Proceedings of the 2022 Conference on Em-
flow: A system for large-scale machine learning,” in pirical Methods in Natural Language Processing, EMNLP
12th USENIX Symposium on Operating Systems Design 2022, Abu Dhabi, United Arab Emirates, December 7-11,
and Implementation, OSDI 2016, Savannah, GA, USA, 2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds.
November 2-4, 2016, K. Keeton and T. Roscoe, Eds. Association for Computational Linguistics, 2022, pp.
USENIX Association, 2016, pp. 265–283. 1384–1403.
[143] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, [157] Y. Wu, A. Q. Jiang, W. Li, M. N. Rabe, C. Staats,
T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: M. Jamnik, and C. Szegedy, “Autoformalization with
A flexible and efficient machine learning library large language models,” CoRR, vol. abs/2205.12615,
for heterogeneous distributed systems,” CoRR, vol. 2022.
abs/1512.01274, 2015. [158] D. Hernandez, T. B. Brown, T. Conerly, N. DasSarma,
[144] Y. Ma, D. Yu, T. Wu, and H. Wang, “Paddlepaddle: An D. Drain, S. E. Showk, N. Elhage, Z. Hatfield-Dodds,
open-source deep learning platform from industrial T. Henighan, T. Hume, S. Johnston, B. Mann, C. Olah,
practice,” Frontiers of Data and Domputing, vol. 1, no. 1, C. Olsson, D. Amodei, N. Joseph, J. Kaplan, and S. Mc-
p. 105, 2019. Candlish, “Scaling laws and interpretability of learn-
[145] J. Yuan, X. Li, C. Cheng, J. Liu, R. Guo, S. Cai, C. Yao, ing from repeated data,” CoRR, vol. abs/2205.10487,
F. Yang, X. Yi, C. Wu, H. Zhang, and J. Zhao, “One- 2022.
flow: Redesign the distributed deep learning frame- [159] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi,
work from scratch,” CoRR, vol. abs/2110.15032, 2021. “The curious case of neural text degeneration,” in
[146] S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson, 8th International Conference on Learning Representations,
Y. Liu, J. Xu, M. Ott, E. M. Smith, Y. Boureau, and ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
J. Weston, “Recipes for building an open-domain chat- OpenReview.net, 2020.
bot,” in Proceedings of the 16th Conference of the European [160] K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck,
Chapter of the Association for Computational Linguistics: C. Callison-Burch, and N. Carlini, “Deduplicating
Main Volume, EACL 2021, Online, April 19 - 23, 2021, training data makes language models better,” in Pro-
2021, pp. 300–325. ceedings of the 60th Annual Meeting of the Association for
[147] A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer, Computational Linguistics (Volume 1: Long Papers), ACL
H. Michalewski, V. V. Ramasesh, A. Slone, C. Anil, 2022, Dublin, Ireland, May 22-27, 2022, 2022, pp. 8424–
I. Schlag, T. Gutman-Solo, Y. Wu, B. Neyshabur, 8445.
G. Gur-Ari, and V. Misra, “Solving quantitative rea- [161] N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramèr,
soning problems with language models,” CoRR, vol. and C. Zhang, “Quantifying memorization across
abs/2206.14858, 2022. neural language models,” CoRR, 2022.
[148] T. Saier, J. Krause, and M. Färber, “unarxive 2022: [162] N. Carlini, F. Tramèr, E. Wallace, M. Jagielski,
All arxiv publications pre-processed for nlp, includ- A. Herbert-Voss, K. Lee, A. Roberts, T. B. Brown,
ing structured full-text and citation network,” arXiv D. Song, Ú. Erlingsson, A. Oprea, and C. Raffel, “Ex-
preprint arXiv:2303.14957, 2023. tracting training data from large language models,”
[149] H. A. Simon, “Experiments with a heuristic compiler,” in 30th USENIX Security Symposium, USENIX Security
J. ACM, vol. 10, no. 4, pp. 493–506, 1963. 2021, August 11-13, 2021, 2021, pp. 2633–2650.
46

[163] N. Kandpal, E. Wallace, and C. Raffel, “Deduplicating cessing Systems 34: Annual Conference on Neural Infor-
training data mitigates privacy risks in language mod- mation Processing Systems 2021, NeurIPS 2021, December
els,” in International Conference on Machine Learning, 6-14, 2021, virtual, 2021, pp. 19 822–19 835.
ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA. [173] L. J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normal-
PMLR, 2022, pp. 10 697–10 707. ization,” vol. abs/1607.06450, 2016.
[164] T. Kudo and J. Richardson, “Sentencepiece: A simple [174] B. Zhang and R. Sennrich, “Root mean square layer
and language independent subword tokenizer and normalization,” in Advances in Neural Information Pro-
detokenizer for neural text processing,” in Proceedings cessing Systems 32: Annual Conference on Neural Infor-
of the 2018 Conference on Empirical Methods in Natural mation Processing Systems 2019, NeurIPS 2019, December
Language Processing, EMNLP 2018: System Demonstra- 8-14, 2019, Vancouver, BC, Canada, 2019, pp. 12 360–
tions, Brussels, Belgium, October 31 - November 4, 2018, 12 371.
E. Blanco and W. Lu, Eds. Association for Computa- [175] H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang, and
tional Linguistics, 2018. F. Wei, “Deepnet: Scaling transformers to 1, 000 lay-
[165] R. Sennrich, B. Haddow, and A. Birch, “Neural ma- ers,” vol. abs/2203.00555, 2022.
chine translation of rare words with subword units,” [176] V. Nair and G. E. Hinton, “Rectified linear units im-
in Proceedings of the 54th Annual Meeting of the Asso- prove restricted boltzmann machines,” in Proceedings
ciation for Computational Linguistics, ACL 2016, August of the 27th international conference on machine learning
7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The (ICML-10), 2010, pp. 807–814.
Association for Computer Linguistics, 2016. [177] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy,
[166] M. Davis and M. Dürst, “Unicode normalization and S. R. Bowman, “GLUE: A multi-task bench-
forms,” 2001. mark and analysis platform for natural language un-
[167] D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. derstanding,” in Proceedings of the Workshop: Analyz-
Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, ing and Interpreting Neural Networks for NLP, Black-
and R. Fernández, “The LAMBADA dataset: Word boxNLP@EMNLP 2018, Brussels, Belgium, November 1,
prediction requiring a broad discourse context,” in 2018, T. Linzen, G. Chrupala, and A. Alishahi, Eds.
ACL (1). The Association for Computer Linguistics, Association for Computational Linguistics, 2018, pp.
2016. 353–355.
[168] P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, [178] P. Ramachandran, B. Zoph, and Q. V. Le,
and I. Sutskever, “Deep double descent: Where bigger “Searching for activation functions,” arXiv preprint
models and more data hurt,” in 8th International Con- arXiv:1710.05941, 2017.
ference on Learning Representations, ICLR 2020, Addis [179] N. Shazeer, “GLU variants improve transformer,” vol.
Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, abs/2002.05202, 2020.
2020. [180] J. Su, Y. Lu, S. Pan, B. Wen, and Y. Liu, “Roformer: En-
[169] B. Zhang, B. Ghorbani, A. Bapna, Y. Cheng, X. Garcia, hanced transformer with rotary position embedding,”
J. Shen, and O. Firat, “Examining scaling and transfer vol. abs/2104.09864, 2021.
of language model architectures for machine transla- [181] O. Press, N. A. Smith, and M. Lewis, “Train short, test
tion,” in International Conference on Machine Learning, long: Attention with linear biases enables input length
ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, extrapolation,” in The Tenth International Conference on
2022, pp. 26 176–26 192. Learning Representations, ICLR 2022, Virtual Event, April
[170] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, 25-29, 2022, 2022.
J. Gao, M. Zhou, and H. Hon, “Unified language [182] R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing,
model pre-training for natural language understand- H. Zhang, Y. Lan, L. Wang, and T. Liu, “On layer nor-
ing and generation,” in Advances in Neural Informa- malization in the transformer architecture,” in ICML,
tion Processing Systems 32: Annual Conference on Neu- 2020.
ral Information Processing Systems 2019, NeurIPS 2019, [183] S. Narang, H. W. Chung, Y. Tay, L. Fedus, T. Févry,
December 8-14, 2019, Vancouver, BC, Canada, 2019, pp. M. Matena, K. Malkan, N. Fiedel, N. Shazeer, Z. Lan,
13 042–13 054. Y. Zhou, W. Li, N. Ding, J. Marcus, A. Roberts,
[171] A. Clark, D. de Las Casas, A. Guy, A. Mensch, and C. Raffel, “Do transformer modifications transfer
M. Paganini, J. Hoffmann, B. Damoc, B. A. Hecht- across implementations and applications?” in Proceed-
man, T. Cai, S. Borgeaud, G. van den Driessche, ings of the 2021 Conference on Empirical Methods in Nat-
E. Rutherford, T. Hennigan, M. J. Johnson, A. Cassirer, ural Language Processing, EMNLP 2021, Virtual Event /
C. Jones, E. Buchatskaya, D. Budden, L. Sifre, S. Osin- Punta Cana, Dominican Republic, 7-11 November, 2021,
dero, O. Vinyals, M. Ranzato, J. W. Rae, E. Elsen, 2021, pp. 5758–5773.
K. Kavukcuoglu, and K. Simonyan, “Unified scaling [184] T. L. Scao, T. Wang, D. Hesslow, S. Bekman, M. S. Bari,
laws for routed language models,” in International S. Biderman, H. Elsahar, N. Muennighoff, J. Phang,
Conference on Machine Learning, ICML 2022, 17-23 July O. Press, C. Raffel, V. Sanh, S. Shen, L. Sutawika, J. Tae,
2022, Baltimore, Maryland, USA, 2022, pp. 4057–4086. Z. X. Yong, J. Launay, and I. Beltagy, “What language
[172] M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, model to train if you have one million GPU hours?” in
D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, and J. Tang, Findings of the Association for Computational Linguistics:
“Cogview: Mastering text-to-image generation via EMNLP 2022, Abu Dhabi, United Arab Emirates, Decem-
transformers,” in Advances in Neural Information Pro- ber 7-11, 2022, 2022, pp. 765–782.
47

[185] D. Hendrycks and K. Gimpel, “Gaussian error linear amos, E. Elsen, D. Garcı́a, B. Ginsburg, M. Houston,
units (gelus),” arXiv preprint arXiv:1606.08415, 2016. O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed preci-
[186] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, sion training,” CoRR, vol. abs/1710.03740, 2017.
“Language modeling with gated convolutional net- [198] Q. Xu, S. Li, C. Gong, and Y. You, “An efficient 2d
works,” in Proceedings of the 34th International Confer- method for training super-large deep learning mod-
ence on Machine Learning, ICML 2017, Sydney, NSW, els,” CoRR, vol. abs/2104.05343, 2021.
Australia, 6-11 August 2017, 2017, pp. 933–941. [199] B. Wang, Q. Xu, Z. Bian, and Y. You, “Tesseract:
[187] R. Child, S. Gray, A. Radford, and I. Sutskever, “Gen- Parallelize the tensor parallelism efficiently,” in Pro-
erating long sequences with sparse transformers,” ceedings of the 51st International Conference on Parallel
CoRR, vol. abs/1904.10509, 2019. Processing, ICPP 2022, Bordeaux, France, 29 August 2022
[188] H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A. - 1 September 2022. ACM, 2022.
Smith, and L. Kong, “Random feature attention,” in [200] Z. Bian, Q. Xu, B. Wang, and Y. You, “Maximizing
9th International Conference on Learning Representations, parallelism in distributed training for huge neural
ICLR 2021, Virtual Event, Austria, May 3-7, 2021. networks,” CoRR, vol. abs/2105.14450, 2021.
[189] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, [201] S. Li, F. Xue, C. Baranwal, Y. Li, and Y. You, “Sequence
C. Alberti, S. Ontañón, P. Pham, A. Ravula, Q. Wang, parallelism: Long sequence training from system per-
L. Yang, and A. Ahmed, “Big bird: Transformers for spective,” arXiv e-prints, pp. arXiv–2105, 2021.
longer sequences,” in Advances in Neural Information [202] FairScale authors, “Fairscale: A general purpose
Processing Systems 33: Annual Conference on Neural modular pytorch library for high performance
Information Processing Systems 2020, NeurIPS 2020, De- and large scale training,” https://github.com/
cember 6-12, 2020, virtual, 2020. facebookresearch/fairscale, 2021.
[190] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Re, [203] L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen,
“Flashattention: Fast and memory-efficient exact at- Y. Huang, Y. Wang, Y. Xu, D. Zhuo, E. P. Xing et al.,
tention with IO-awareness,” in NeurIPS, 2022. “Alpa: Automating inter-and {Intra-Operator} paral-
[191] D. P. Kingma and J. Ba, “Adam: A method for lelism for distributed deep learning,” in OSDI, 2022,
stochastic optimization,” in 3rd International Confer- pp. 559–578.
ence on Learning Representations, ICLR 2015, San Diego, [204] T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training
CA, USA, May 7-9, 2015, Conference Track Proceedings, deep nets with sublinear memory cost,” CoRR, vol.
Y. Bengio and Y. LeCun, Eds., 2015. abs/1604.06174, 2016.
[192] I. Loshchilov and F. Hutter, “Fixing weight decay [205] Z. Yao, C. Li, X. Wu, S. Youn, and Y. He, “A compre-
regularization in adam,” CoRR, vol. abs/1711.05101, hensive study on post-training quantization for large
2017. language models,” CoRR, vol. abs/2303.08302, 2023.
[193] N. Shazeer and M. Stern, “Adafactor: Adaptive learn- [206] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer,
ing rates with sublinear memory cost,” in Proceedings “Llm.int8(): 8-bit matrix multiplication for transform-
of the 35th International Conference on Machine Learning, ers at scale,” CoRR, vol. abs/2208.07339, 2022.
ICML 2018, Stockholmsmässan, Stockholm, Sweden, July [207] C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu,
10-15, 2018, ser. Proceedings of Machine Learning P. Luo, and N. Wong, “Compression of generative
Research, J. G. Dy and A. Krause, Eds., vol. 80. PMLR, pre-trained language models via quantization,” in
2018, pp. 4603–4611. Proceedings of the 60th Annual Meeting of the Association
[194] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, for Computational Linguistics (Volume 1: Long Papers),
M. X. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, and ACL 2022, Dublin, Ireland, May 22-27, 2022, S. Muresan,
Z. Chen, “Gpipe: Efficient training of giant neural P. Nakov, and A. Villavicencio, Eds. Association for
networks using pipeline parallelism,” in Advances in Computational Linguistics, 2022, pp. 4821–4836.
Neural Information Processing Systems 32: Annual Con- [208] S. Mishra, D. Khashabi, C. Baral, and H. Ha-
ference on Neural Information Processing Systems 2019, jishirzi, “Cross-task generalization via natural lan-
NeurIPS 2019, December 8-14, 2019, Vancouver, BC, guage crowdsourcing instructions,” in Proceedings of
Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, the 60th Annual Meeting of the Association for Compu-
F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, tational Linguistics (Volume 1: Long Papers), ACL 2022,
pp. 103–112. Dublin, Ireland, May 22-27, 2022, S. Muresan, P. Nakov,
[195] A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, and A. Villavicencio, Eds., 2022, pp. 3470–3487.
N. R. Devanur, G. R. Ganger, and P. B. Gibbons, [209] Q. Ye, B. Y. Lin, and X. Ren, “Crossfit: A few-shot
“Pipedream: Fast and efficient pipeline parallel DNN learning challenge for cross-task generalization in
training,” CoRR, vol. abs/1806.03377, 2018. NLP,” in EMNLP (1). Association for Computational
[196] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, Linguistics, 2021, pp. 7163–7189.
“Zero: memory optimizations toward training trillion [210] S. H. Bach, V. Sanh, Z. X. Yong, A. Webson, C. Raffel,
parameter models,” in Proceedings of the International N. V. Nayak, A. Sharma, T. Kim, M. S. Bari, T. Févry,
Conference for High Performance Computing, Networking, Z. Alyafeai, M. Dey, A. Santilli, Z. Sun, S. Ben-David,
Storage and Analysis, SC 2020, Virtual Event / Atlanta, C. Xu, G. Chhablani, H. Wang, J. A. Fries, M. S.
Georgia, USA, November 9-19, 2020, C. Cuicchi, I. Qual- AlShaibani, S. Sharma, U. Thakker, K. Almubarak,
ters, and W. T. Kramer, Eds. IEEE/ACM, 2020, p. 20. X. Tang, D. R. Radev, M. T. Jiang, and A. M. Rush,
[197] P. Micikevicius, S. Narang, J. Alben, G. F. Di- “Promptsource: An integrated development environ-
48

ment and repository for natural language prompts,” for alignment,” CoRR, vol. abs/2112.00861, 2021.
in ACL (demo). Association for Computational Lin- [224] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen,
guistics, 2022, pp. 93–104. N. DasSarma, D. Drain, S. Fort, D. Ganguli,
[211] V. Aribandi, Y. Tay, T. Schuster, J. Rao, H. S. Zheng, T. Henighan, N. Joseph, S. Kadavath, J. Kernion,
S. V. Mehta, H. Zhuang, V. Q. Tran, D. Bahri, J. Ni, T. Conerly, S. E. Showk, N. Elhage, Z. Hatfield-Dodds,
J. P. Gupta, K. Hui, S. Ruder, and D. Metzler, “Ext5: D. Hernandez, T. Hume, S. Johnston, S. Kravec,
Towards extreme multi-task scaling for transfer learn- L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. B.
ing,” in ICLR. OpenReview.net, 2022. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and
[212] T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak, M. Ya- J. Kaplan, “Training a helpful and harmless assistant
sunaga, C. Wu, M. Zhong, P. Yin, S. I. Wang, V. Zhong, with reinforcement learning from human feedback,”
B. Wang, C. Li, C. Boyle, A. Ni, Z. Yao, D. Radev, CoRR, vol. abs/2204.05862, 2022.
C. Xiong, L. Kong, R. Zhang, N. A. Smith, L. Zettle- [225] E. Perez, S. Huang, H. F. Song, T. Cai, R. Ring,
moyer, and T. Yu, “Unifiedskg: Unifying and multi- J. Aslanides, A. Glaese, N. McAleese, and G. Irving,
tasking structured knowledge grounding with text-to- “Red teaming language models with language mod-
text language models,” in EMNLP. Association for els,” in Proceedings of the 2022 Conference on Empirical
Computational Linguistics, 2022, pp. 602–631. Methods in Natural Language Processing, EMNLP 2022,
[213] T. Tang, J. Li, W. X. Zhao, and J. Wen, “MVP: multi- Abu Dhabi, United Arab Emirates, December 7-11, 2022,
task supervised pre-training for natural language gen- Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. Asso-
eration,” CoRR, vol. abs/2206.12131, 2022. ciation for Computational Linguistics, 2022, pp. 3419–
[214] R. Lou, K. Zhang, and W. Yin, “Is prompt all you 3448.
need? no. A comprehensive and broader view of in- [226] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Rad-
struction learning,” CoRR, vol. abs/2303.10475, 2023. ford, D. Amodei, P. F. Christiano, and G. Irving, “Fine-
[215] X. Liu, P. He, W. Chen, and J. Gao, “Multi-task deep tuning language models from human preferences,”
neural networks for natural language understand- CoRR, vol. abs/1909.08593, 2019.
ing,” in ACL (1). Association for Computational [227] J. Menick, M. Trebacz, V. Mikulik, J. Aslanides,
Linguistics, 2019, pp. 4487–4496. H. F. Song, M. Chadwick, M. Glaese, S. Young,
[216] A. Aghajanyan, A. Gupta, A. Shrivastava, X. Chen, L. Campbell-Gillingham, G. Irving, and N. McAleese,
L. Zettlemoyer, and S. Gupta, “Muppet: Massive “Teaching language models to support answers with
multi-task representations with pre-finetuning,” in verified quotes,” CoRR, vol. abs/2203.11147, 2022.
EMNLP (1). Association for Computational Linguis- [228] J. Wu, L. Ouyang, D. M. Ziegler, N. Stiennon, R. Lowe,
tics, 2021, pp. 5799–5811. J. Leike, and P. F. Christiano, “Recursively sum-
[217] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, marizing books with human feedback,” CoRR, vol.
Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei, and abs/2109.10862, 2021.
A. Roberts, “The flan collection: Designing data and [229] N. Ding, Y. Qin, G. Yang, F. Wei, Y. Zonghan, Y. Su,
methods for effective instruction tuning,” CoRR, vol. S. Hu, Y. Chen, C.-M. Chan, W. Chen, J. Yi, W. Zhao,
abs/2301.13688, 2023. X. Wang, Z. Liu, H.-T. Zheng, J. Chen, Y. Liu, J. Tang,
[218] Y. Gu, P. Ke, X. Zhu, and M. Huang, “Learning J. Li, and M. Sun, “Parameter-efficient fine-tuning
instructions with unlabeled data for zero-shot cross- of large-scale pre-trained language models,” Nature
task generalization,” in EMNLP. Association for Machine Intelligence, vol. 5, pp. 1–16, 03 2023.
Computational Linguistics, 2022, pp. 1617–1634. [230] X. L. Li and P. Liang, “Prefix-tuning: Optimizing
[219] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, continuous prompts for generation,” in Proceedings
D. Khashabi, and H. Hajishirzi, “Self-instruct: Align- of the 59th Annual Meeting of the Association for Com-
ing language model with self generated instructions,” putational Linguistics and the 11th International Joint
CoRR, vol. abs/2212.10560, 2022. Conference on Natural Language Processing, ACL/IJCNLP
[220] O. Honovich, T. Scialom, O. Levy, and T. Schick, “Un- 2021, (Volume 1: Long Papers), Virtual Event, August 1-
natural instructions: Tuning language models with 6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds.
(almost) no human labor,” CoRR, vol. abs/2212.09689, Association for Computational Linguistics, 2021, pp.
2022. 4582–4597.
[221] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, [231] B. Lester, R. Al-Rfou, and N. Constant, “The power
C. Guestrin, P. Liang, and T. B. Hashimoto, “Stan- of scale for parameter-efficient prompt tuning,” in
ford alpaca: An instruction-following llama model,” Proceedings of the 2021 Conference on Empirical Methods
https://github.com/tatsu-lab/stanford alpaca, 2023. in Natural Language Processing, EMNLP 2021, Virtual
[222] Z. Kenton, T. Everitt, L. Weidinger, I. Gabriel, V. Miku- Event / Punta Cana, Dominican Republic, 7-11 November,
lik, and G. Irving, “Alignment of language agents,” 2021, M. Moens, X. Huang, L. Specia, and S. W. Yih,
CoRR, vol. abs/2103.14659, 2021. Eds. Association for Computational Linguistics, 2021,
[223] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, pp. 3045–3059.
T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das- [232] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang,
Sarma, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, L. Wang, and W. Chen, “Lora: Low-rank adaptation of
J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. B. large language models,” in The Tenth International Con-
Brown, J. Clark, S. McCandlish, C. Olah, and J. Ka- ference on Learning Representations, ICLR 2022, Virtual
plan, “A general language assistant as a laboratory Event, April 25-29, 2022. OpenReview.net, 2022.
49

[233] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Computational Linguistics, 2020, pp. 7654–7673.
Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and [246] S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, and
S. Gelly, “Parameter-efficient transfer learning for S. Paul, “Peft: State-of-the-art parameter-efficient fine-
NLP,” in Proceedings of the 36th International Conference tuning methods,” https://github.com/huggingface/
on Machine Learning, ICML 2019, 9-15 June 2019, Long peft, 2022.
Beach, California, USA, 2019, pp. 2790–2799. [247] S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis,
[234] Z. Hu, Y. Lan, L. Wang, W. Xu, E. Lim, R. K. Lee, H. Hajishirzi, and L. Zettlemoyer, “Rethinking the role
L. Bing, and S. Poria, “Llm-adapters: An adapter of demonstrations: What makes in-context learning
family for parameter-efficient fine-tuning of large lan- work?” in Proceedings of the 2022 Conference on Em-
guage models,” CoRR, vol. abs/2304.01933, 2023. pirical Methods in Natural Language Processing, EMNLP
[235] J. He, C. Zhou, X. Ma, T. Berg-Kirkpatrick, and 2022, Abu Dhabi, United Arab Emirates, December 7-
G. Neubig, “Towards a unified view of parameter- 11, 2022. Association for Computational Linguistics,
efficient transfer learning,” in The Tenth International 2022, pp. 11 048–11 064.
Conference on Learning Representations, ICLR 2022, Vir- [248] Y. Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stene-
tual Event, April 25-29, 2022. OpenReview.net, 2022. torp, “Fantastically ordered prompts and where to
[236] X. Liu, K. Ji, Y. Fu, Z. Du, Z. Yang, and J. Tang, “P- find them: Overcoming few-shot prompt order sen-
tuning v2: Prompt tuning can be comparable to fine- sitivity,” in Proceedings of the 60th Annual Meeting of
tuning universally across scales and tasks,” CoRR, vol. the Association for Computational Linguistics (Volume 1:
abs/2110.07602, 2021. Long Papers), ACL 2022, Dublin, Ireland, May 22-27,
[237] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, 2022, S. Muresan, P. Nakov, and A. Villavicencio, Eds.,
and J. Tang, “GPT understands, too,” CoRR, vol. 2022, pp. 8086–8098.
abs/2103.10385, 2021. [249] Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh,
[238] Y. Gu, X. Han, Z. Liu, and M. Huang, “Ppt: Pre-trained “Calibrate before use: Improving few-shot perfor-
prompt tuning for few-shot learning,” in Proceedings mance of language models,” in Proceedings of the
of the 60th Annual Meeting of the Association for Com- 38th International Conference on Machine Learning, ICML
putational Linguistics (Volume 1: Long Papers), 2022, pp. 2021, 18-24 July 2021, Virtual Event, M. Meila and
8410–8423. T. Zhang, Eds., 2021, pp. 12 697–12 706.
[239] Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can [250] J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and
we know what language models know?” Transactions W. Chen, “What makes good in-context examples for
of the Association for Computational Linguistics, vol. 8, gpt-3?” in Proceedings of Deep Learning Inside Out: The
pp. 423–438, 2020. 3rd Workshop on Knowledge Extraction and Integration for
[240] T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, Deep Learning Architectures, DeeLIO@ACL 2022, Dublin,
and S. Singh, “Autoprompt: Eliciting knowledge Ireland and Online, May 27, 2022, 2022, pp. 100–114.
from language models with automatically gener- [251] Y. Lee, C. Lim, and H. Choi, “Does GPT-3 generate
ated prompts,” in Proceedings of the 2020 Conference empathetic dialogues? A novel in-context example
on Empirical Methods in Natural Language Processing selection method and automatic evaluation metric
(EMNLP), 2020, pp. 4222–4235. for empathetic dialogue generation,” in Proceedings
[241] Q. Zhang, M. Chen, A. Bukharin, P. He, Y. Cheng, of the 29th International Conference on Computational
W. Chen, and T. Zhao, “Adaptive budget allocation Linguistics, COLING 2022, Gyeongju, Republic of Korea,
for parameter-efficient fine-tuning,” CoRR, vol. October 12-17, 2022, N. Calzolari, C. Huang, H. Kim,
abs/2303.10512, 2023. [Online]. Available: https: J. Pustejovsky, L. Wanner, K. Choi, P. Ryu, H. Chen,
//doi.org/10.48550/arXiv.2303.10512 L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue,
[242] M. Valipour, M. Rezagholizadeh, I. Kobyzev, and S. Kim, Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond,
A. Ghodsi, “Dylora: Parameter efficient tuning of and S. Na, Eds. International Committee on Compu-
pre-trained models using dynamic search-free low- tational Linguistics, 2022, pp. 669–683.
rank adaptation,” CoRR, vol. abs/2210.07558, 2022. [252] I. Levy, B. Bogin, and J. Berant, “Diverse demon-
[Online]. Available: https://doi.org/10.48550/arXiv. strations improve in-context compositional general-
2210.07558 ization,” CoRR, vol. abs/2212.06800, 2022.
[243] Alpaca-LoRA, “Instruct-tune llama on consumer [253] H. Su, J. Kasai, C. H. Wu, W. Shi, T. Wang, J. Xin,
hardware,” https://github.com/tloen/alpaca-lora, R. Zhang, M. Ostendorf, L. Zettlemoyer, N. A. Smith,
2023. and T. Yu, “Selective annotation makes language mod-
[244] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, els better few-shot learners,” CoRR, 2022.
P. Gao, and Y. Qiao, “Llama-adapter: Efficient fine- [254] X. Ye, S. Iyer, A. Celikyilmaz, V. Stoyanov, G. Durrett,
tuning of language models with zero-init attention,” and R. Pasunuru, “Complementary explanations for
CoRR, vol. abs/2303.16199, 2023. effective in-context learning,” CoRR, 2022.
[245] J. Pfeiffer, I. Vulic, I. Gurevych, and S. Ruder, “MAD- [255] X. Li and X. Qiu, “Finding supporting examples for
X: an adapter-based framework for multi-task cross- in-context learning,” CoRR, 2023.
lingual transfer,” in Proceedings of the 2020 Conference [256] O. Rubin, J. Herzig, and J. Berant, “Learning to re-
on Empirical Methods in Natural Language Processing, trieve prompts for in-context learning,” in Proceedings
EMNLP 2020, Online, November 16-20, 2020, B. Webber, of the 2022 Conference of the North American Chapter
T. Cohn, Y. He, and Y. Liu, Eds. Association for of the Association for Computational Linguistics: Human
50

Language Technologies, NAACL 2022, Seattle, WA, United CoRR, vol. abs/2209.11895, 2022.
States, July 10-15, 2022, 2022, pp. 2655–2671. [270] H. Bansal, K. Gopalakrishnan, S. Dingliwal, S. Bodap-
[257] Y. Zhang, S. Feng, and C. Tan, “Active example se- ati, K. Kirchhoff, and D. Roth, “Rethinking the role
lection for in-context learning,” in Proceedings of the of scale for in-context learning: An interpretability-
2022 Conference on Empirical Methods in Natural Lan- based case study at 66 billion scale,” CoRR, vol.
guage Processing, EMNLP 2022, Abu Dhabi, United Arab abs/2212.09095, 2022.
Emirates, December 7-11, 2022, 2022, pp. 9134–9148. [271] Y. Li, M. E. Ildiz, D. S. Papailiopoulos, and S. Oymak,
[258] F. Gilardi, M. Alizadeh, and M. Kubli, “Chatgpt out- “Transformers as algorithms: Generalization and im-
performs crowd-workers for text-annotation tasks,” plicit model selection in in-context learning,” CoRR,
2023. vol. abs/2301.07067, 2023.
[259] H. J. Kim, H. Cho, J. Kim, T. Kim, K. M. Yoo, and [272] E. Akyürek, D. Schuurmans, J. Andreas, T. Ma, and
S. Lee, “Self-generated in-context learning: Leverag- D. Zhou, “What learning algorithm is in-context learn-
ing auto-regressive language models as a demonstra- ing? investigations with linear models,” CoRR, vol.
tion generator,” CoRR, vol. abs/2206.08082, 2022. abs/2211.15661, 2022.
[260] Y. Lin, A. Papangelis, S. Kim, S. Lee, D. Hazarika, [273] S. Garg, D. Tsipras, P. Liang, and G. Valiant, “What can
M. Namazifar, D. Jin, Y. Liu, and D. Hakkani-Tur, transformers learn in-context? A case study of simple
“Selective in-context data augmentation for intent de- function classes,” CoRR, vol. abs/2208.01066, 2022.
tection using pointwise v-information,” CoRR, 2023. [274] K. Cobbe, V. Kosaraju, M. Bavarian, J. Hilton,
[261] S. M. Xie, A. Raghunathan, P. Liang, and T. Ma, “An R. Nakano, C. Hesse, and J. Schulman, “Training
explanation of in-context learning as implicit bayesian verifiers to solve math word problems,” CoRR, vol.
inference,” in The Tenth International Conference on abs/2110.14168, 2021.
Learning Representations, ICLR 2022, Virtual Event, April [275] A. Patel, S. Bhattamishra, and N. Goyal, “Are NLP
25-29, 2022, 2022. models really able to solve simple math word prob-
[262] Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic lems?” in NAACL-HLT. Association for Computa-
chain of thought prompting in large language mod- tional Linguistics, 2021, pp. 2080–2094.
els,” CoRR, vol. abs/2210.03493, 2022. [276] S. Miao, C. Liang, and K. Su, “A diverse corpus
[263] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, for evaluating and developing english math word
D. Schuurmans, O. Bousquet, Q. Le, and E. H. Chi, problem solvers,” in Proceedings of the 58th Annual
“Least-to-most prompting enables complex reasoning Meeting of the Association for Computational Linguistics,
in large language models,” CoRR, vol. abs/2205.10625, ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai,
2022. N. Schluter, and J. R. Tetreault, Eds. Association for
[264] Z. Wu, Y. Wang, J. Ye, and L. Kong, “Self-adaptive in- Computational Linguistics, 2020, pp. 975–984.
context learning,” CoRR, vol. abs/2212.10375, 2022. [277] A. Talmor, J. Herzig, N. Lourie, and J. Berant, “Com-
[265] S. Min, M. Lewis, L. Zettlemoyer, and H. Hajishirzi, monsenseqa: A question answering challenge tar-
“Metaicl: Learning to learn in context,” in Proceedings geting commonsense knowledge,” in Proceedings of
of the 2022 Conference of the North American Chapter the 2019 Conference of the North American Chapter of
of the Association for Computational Linguistics: Human the Association for Computational Linguistics: Human
Language Technologies, NAACL 2022, Seattle, WA, United Language Technologies, NAACL-HLT 2019, Minneapolis,
States, July 10-15, 2022, M. Carpuat, M. de Marneffe, MN, USA, June 2-7, 2019, Volume 1 (Long and Short
and I. V. M. Ruı́z, Eds., 2022, pp. 2791–2809. Papers), J. Burstein, C. Doran, and T. Solorio, Eds.
[266] S. C. Y. Chan, A. Santoro, A. K. Lampinen, J. X. Association for Computational Linguistics, 2019, pp.
Wang, A. Singh, P. H. Richemond, J. McClelland, and 4149–4158.
F. Hill, “Data distributional properties drive emer- [278] M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth,
gent in-context learning in transformers,” CoRR, vol. and J. Berant, “Did aristotle use a laptop? A question
abs/2205.05055, 2022. answering benchmark with implicit reasoning strate-
[267] S. Shin, S. Lee, H. Ahn, S. Kim, H. Kim, B. Kim, K. Cho, gies,” Trans. Assoc. Comput. Linguistics, vol. 9, pp. 346–
G. Lee, W. Park, J. Ha, and N. Sung, “On the effect of 361, 2021.
pretraining corpora on in-context learning by a large- [279] Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J. Lou, and
scale language model,” in NAACL-HLT. Association W. Chen, “On the advance of making language mod-
for Computational Linguistics, 2022, pp. 5168–5186. els better reasoners,” CoRR, vol. abs/2206.02336, 2022.
[268] J. von Oswald, E. Niklasson, E. Randazzo, J. Sacra- [280] Y. Fu, H. Peng, A. Sabharwal, P. Clark, and T. Khot,
mento, A. Mordvintsev, A. Zhmoginov, and M. Vla- “Complexity-based prompting for multi-step reason-
dymyrov, “Transformers learn in-context by gradient ing,” CoRR, vol. abs/2210.00720, 2022.
descent,” CoRR, vol. abs/2212.07677, 2022. [281] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwa-
[269] C. Olsson, N. Elhage, N. Nanda, N. Joseph, sawa, “Large language models are zero-shot reason-
N. DasSarma, T. Henighan, B. Mann, A. Askell, ers,” CoRR, vol. abs/2205.11916, 2022.
Y. Bai, A. Chen, T. Conerly, D. Drain, D. Gan- [282] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H.
guli, Z. Hatfield-Dodds, D. Hernandez, S. Johnston, Chi, and D. Zhou, “Self-consistency improves chain
A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, of thought reasoning in language models,” CoRR, vol.
T. Brown, J. Clark, J. Kaplan, S. McCandlish, and abs/2203.11171, 2022.
C. Olah, “In-context learning and induction heads,” [283] ——, “Rationale-augmented ensembles in language
51

models,” CoRR, 2022. penn treebank,” Comput. Linguistics, vol. 19, no. 2, pp.
[284] S. Imani, L. Du, and H. Shrivastava, “Mathprompter: 313–330, 1993.
Mathematical reasoning using large language mod- [299] S. Merity, C. Xiong, J. Bradbury, and R. Socher,
els,” arXiv preprint arXiv:2303.05398, 2023. “Pointer sentinel mixture models,” in ICLR (Poster).
[285] E. Zelikman, J. Mu, N. D. Goodman, and Y. T. Wu, OpenReview.net, 2017.
“Star: Self-taught reasoner bootstrapping reasoning [300] O. Bojar, C. Buck, C. Federmann, B. Haddow,
with reasoning,” 2022. P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post,
[286] J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, and H. Saint-Amand, R. Soricut, L. Specia, and A. Tam-
J. Han, “Large language models can self-improve,” chyna, “Findings of the 2014 workshop on statistical
CoRR, vol. abs/2210.11610, 2022. machine translation,” in WMT@ACL. The Association
[287] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, for Computer Linguistics, 2014, pp. 12–58.
M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Ku- [301] O. Bojar, R. Chatterjee, C. Federmann, Y. Graham,
mar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cos- B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn,
grove, C. D. Manning, C. Ré, D. Acosta-Navas, D. A. V. Logacheva, C. Monz, M. Negri, A. Névéol, M. L.
Hudson, E. Zelikman, E. Durmus, F. Ladhak, F. Rong, Neves, M. Popel, M. Post, R. Rubino, C. Scarton,
H. Ren, H. Yao, J. Wang, K. Santhanam, L. J. Orr, L. Specia, M. Turchi, K. Verspoor, and M. Zampieri,
L. Zheng, M. Yüksekgönül, M. Suzgun, N. Kim, “Findings of the 2016 conference on machine trans-
N. Guha, N. S. Chatterji, O. Khattab, P. Henderson, lation,” in WMT. The Association for Computer
Q. Huang, R. Chi, S. M. Xie, S. Santurkar, S. Gan- Linguistics, 2016, pp. 131–198.
guli, T. Hashimoto, T. Icard, T. Zhang, V. Chaudhary, [302] L. Barrault, O. Bojar, M. R. Costa-jussà, C. Federmann,
W. Wang, X. Li, Y. Mai, Y. Zhang, and Y. Koreeda, M. Fishel, Y. Graham, B. Haddow, M. Huck, P. Koehn,
“Holistic evaluation of language models,” CoRR, vol. S. Malmasi, C. Monz, M. Müller, S. Pal, M. Post, and
abs/2211.09110, 2022. M. Zampieri, “Findings of the 2019 conference on
[288] A. Madaan and A. Yazdanbakhsh, “Text and patterns: machine translation (WMT19),” in Proceedings of the
For effective chain of thought, it takes two to tango,” Fourth Conference on Machine Translation, WMT 2019,
CoRR, vol. abs/2209.07686, 2022. Florence, Italy, August 1-2, 2019 - Volume 2: Shared
[289] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and Task Papers, Day 1, O. Bojar, R. Chatterjee, C. Feder-
A. Smola, “Multimodal chain-of-thought reasoning in mann, M. Fishel, Y. Graham, B. Haddow, M. Huck,
language models,” CoRR, vol. abs/2302.00923, 2023. A. Jimeno-Yepes, P. Koehn, A. Martins, C. Monz,
[290] F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Sri- M. Negri, A. Névéol, M. L. Neves, M. Post, M. Turchi,
vats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder, and K. Verspoor, Eds. Association for Computational
D. Zhou, D. Das, and J. Wei, “Language models are Linguistics, 2019, pp. 1–61.
multilingual chain-of-thought reasoners,” CoRR, vol. [303] L. Barrault, M. Biesialska, O. Bojar, M. R. Costa-
abs/2210.03057, 2022. jussà, C. Federmann, Y. Graham, R. Grundkiewicz,
[291] K. Shridhar, A. Stolfo, and M. Sachan, “Distilling B. Haddow, M. Huck, E. Joanis, T. Kocmi, P. Koehn,
multi-step reasoning capabilities of large language C. Lo, N. Ljubesic, C. Monz, M. Morishita, M. Na-
models into smaller models via semantic decompo- gata, T. Nakazawa, S. Pal, M. Post, and M. Zampieri,
sitions,” ArXiv, vol. abs/2212.00193, 2022. “Findings of the 2020 conference on machine trans-
[292] N. Ho, L. Schmid, and S. Yun, “Large language models lation (WMT20),” in Proceedings of the Fifth Con-
are reasoning teachers,” CoRR, vol. abs/2212.10071, ference on Machine Translation, WMT@EMNLP 2020,
2022. Online, November 19-20, 2020, L. Barrault, O. Bojar,
[293] L. C. Magister, J. Mallinson, J. Adámek, E. Malmi, F. Bougares, R. Chatterjee, M. R. Costa-jussà, C. Fe-
and A. Severyn, “Teaching small language models to dermann, M. Fishel, A. Fraser, Y. Graham, P. Guzman,
reason,” CoRR, vol. abs/2212.08410, 2022. B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn,
[294] Y. Fu, H. Peng, L. Ou, A. Sabharwal, and T. Khot, A. Martins, M. Morishita, C. Monz, M. Nagata,
“Specializing smaller language models towards multi- T. Nakazawa, and M. Negri, Eds. Association for
step reasoning,” CoRR, vol. abs/2301.12726, 2023. Computational Linguistics, 2020, pp. 1–55.
[295] A. Chan, Z. Zeng, W. Lake, B. Joshi, H. Chen, and [304] F. Akhbardeh, A. Arkhangorodsky, M. Biesialska,
X. Ren, “Knife: Distilling meta-reasoning knowledge O. Bojar, R. Chatterjee, V. Chaudhary, M. R. Costa-
with free-text rationales,” in ICLR 2023 Workshop on jussà, C. España-Bonet, A. Fan, C. Federmann, M. Fre-
Pitfalls of limited data and computation for Trustworthy itag, Y. Graham, R. Grundkiewicz, B. Haddow, L. Har-
ML. ter, K. Heafield, C. Homan, M. Huck, K. Amponsah-
[296] Z. Li, C. Wang, P. Ma, C. Liu, S. Wang, D. Wu, Kaakyire, J. Kasai, D. Khashabi, K. Knight, T. Kocmi,
and C. Gao, “On the feasibility of specialized ability P. Koehn, N. Lourie, C. Monz, M. Morishita, M. Na-
stealing for large language code models,” CoRR, 2023. gata, A. Nagesh, T. Nakazawa, M. Negri, S. Pal,
[297] Z. Dai, V. Y. Zhao, J. Ma, Y. Luan, J. Ni, J. Lu, A. A. Tapo, M. Turchi, V. Vydrin, and M. Zampieri,
A. Bakalov, K. Guu, K. B. Hall, and M. Chang, “Findings of the 2021 conference on machine transla-
“Promptagator: Few-shot dense retrieval from 8 ex- tion (WMT21),” in Proceedings of the Sixth Conference
amples,” CoRR, 2022. on Machine Translation, WMT@EMNLP 2021, Online
[298] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, Event, November 10-11, 2021, L. Barrault, O. Bojar,
“Building a large annotated corpus of english: The F. Bougares, R. Chatterjee, M. R. Costa-jussà, C. Fe-
52

dermann, M. Fishel, A. Fraser, M. Freitag, Y. Graham, OpenReview.net, 2021.


R. Grundkiewicz, P. Guzman, B. Haddow, M. Huck, [314] M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay,
A. Jimeno-Yepes, P. Koehn, T. Kocmi, A. Martins, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi,
M. Morishita, and C. Monz, Eds. Association for D. Zhou, and J. Wei, “Challenging big-bench tasks and
Computational Linguistics, 2021, pp. 1–88. whether chain-of-thought can solve them,” CoRR, vol.
[305] T. Kocmi, R. Bawden, O. Bojar, A. Dvorkovich, C. Fe- abs/2210.09261, 2022.
dermann, M. Fishel, T. Gowda, Y. Graham, R. Grund- [315] L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu,
kiewicz, B. Haddow, R. Knowles, P. Koehn, C. Monz, K. Sun, D. Yu, C. Yu, Y. Tian, Q. Dong, W. Liu, B. Shi,
M. Morishita, M. Nagata, T. Nakazawa, M. Novák, Y. Cui, J. Li, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Pat-
M. Popel, and M. Popovic, “Findings of the 2022 terson, Z. Tian, Y. Zhang, H. Zhou, S. Liu, Z. Zhao,
conference on machine translation (WMT22),” in Pro- Q. Zhao, C. Yue, X. Zhang, Z. Yang, K. Richardson,
ceedings of the Seventh Conference on Machine Trans- and Z. Lan, “CLUE: A chinese language understand-
lation, WMT 2022, Abu Dhabi, United Arab Emirates ing evaluation benchmark,” in COLING. Interna-
(Hybrid), December 7-8, 2022, P. Koehn, L. Barrault, tional Committee on Computational Linguistics, 2020,
O. Bojar, F. Bougares, R. Chatterjee, M. R. Costa- pp. 4762–4772.
jussà, C. Federmann, M. Fishel, A. Fraser, M. Freitag, [316] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika,
Y. Graham, R. Grundkiewicz, P. Guzman, B. Haddow, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song,
M. Huck, A. Jimeno-Yepes, T. Kocmi, A. Martins, and J. Steinhardt, “Measuring coding challenge com-
M. Morishita, C. Monz, M. Nagata, T. Nakazawa, petence with APPS,” in NeurIPS Datasets and Bench-
M. Negri, A. Névéol, M. Neves, M. Popel, M. Turchi, marks, 2021.
and M. Zampieri, Eds. Association for Computa- [317] Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettle-
tional Linguistics, 2022, pp. 1–45. moyer, S. W. Yih, D. Fried, S. I. Wang, and T. Yu,
[306] N. Goyal, C. Gao, V. Chaudhary, P. Chen, G. Wen- “DS-1000: A natural and reliable benchmark for data
zek, D. Ju, S. Krishnan, M. Ranzato, F. Guzmán, and science code generation,” CoRR, vol. abs/2211.11501,
A. Fan, “The flores-101 evaluation benchmark for low- 2022.
resource and multilingual machine translation,” Trans. [318] Z. Wang, S. Zhou, D. Fried, and G. Neubig,
Assoc. Comput. Linguistics, vol. 10, pp. 522–538, 2022. “Execution-based evaluation for open-domain code
[307] R. Bawden, E. Bilinski, T. Lavergne, and S. Rosset, generation,” CoRR, vol. abs/2212.10481, 2022.
“Diabla: a corpus of bilingual spontaneous written [319] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins,
dialogues for machine translation,” Lang. Resour. Eval- A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin,
uation, vol. 55, no. 3, pp. 635–660, 2021. J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey,
[308] R. Nallapati, B. Zhou, C. N. dos Santos, Ç. Gülçehre, M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov,
and B. Xiang, “Abstractive text summarization using “Natural questions: a benchmark for question answer-
sequence-to-sequence rnns and beyond,” in Proceed- ing research,” Trans. Assoc. Comput. Linguistics, pp.
ings of the 20th SIGNLL Conference on Computational 452–466, 2019.
Natural Language Learning, CoNLL 2016, Berlin, Ger- [320] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal,
many, August 11-12, 2016, Y. Goldberg and S. Riezler, C. Schoenick, and O. Tafjord, “Think you have solved
Eds. ACL, 2016, pp. 280–290. question answering? try arc, the AI2 reasoning chal-
[309] S. Narayan, S. B. Cohen, and M. Lapata, “Don’t give lenge,” CoRR, vol. abs/1803.05457, 2018.
me the details, just the summary! topic-aware convo- [321] S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring
lutional neural networks for extreme summarization,” how models mimic human falsehoods,” in Proceedings
in EMNLP. Association for Computational Linguis- of the 60th Annual Meeting of the Association for Compu-
tics, 2018, pp. 1797–1807. tational Linguistics (Volume 1: Long Papers), ACL 2022,
[310] F. Ladhak, E. Durmus, C. Cardie, and K. Mckeown, Dublin, Ireland, May 22-27, 2022, 2022, pp. 3214–3252.
“Wikilingua: A new benchmark dataset for cross- [322] J. Berant, A. Chou, R. Frostig, and P. Liang, “Semantic
lingual abstractive summarization,” in Findings of the parsing on freebase from question-answer pairs,” in
Association for Computational Linguistics: EMNLP 2020, Proceedings of the 2013 Conference on Empirical Methods
2020, pp. 4034–4048. in Natural Language Processing, EMNLP 2013, 18-21
[311] S. Moon, P. Shah, A. Kumar, and R. Subba, “Open- October 2013, Grand Hyatt Seattle, Seattle, Washington,
dialkg: Explainable conversational reasoning with USA, A meeting of SIGDAT, a Special Interest Group of
attention-based walks over knowledge graphs,” in the ACL, 2013, pp. 1533–1544.
ACL (1). Association for Computational Linguistics, [323] M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer,
2019, pp. 845–854. “Triviaqa: A large scale distantly supervised challenge
[312] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, dataset for reading comprehension,” in Proceedings of
J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Su- the 55th Annual Meeting of the Association for Computa-
perglue: A stickier benchmark for general-purpose tional Linguistics, ACL 2017, Vancouver, Canada, July 30
language understanding systems,” in NeurIPS, 2019, - August 4, Volume 1: Long Papers, 2017, pp. 1601–1611.
pp. 3261–3275. [324] Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi,
[313] D. Hendrycks, C. Burns, S. Basart, A. Zou, “PIQA: reasoning about physical commonsense in
M. Mazeika, D. Song, and J. Steinhardt, “Measuring natural language,” in The Thirty-Fourth AAAI Confer-
massive multitask language understanding,” in ICLR. ence on Artificial Intelligence, AAAI 2020, The Thirty-
53

Second Innovative Applications of Artificial Intelligence pp. 8082–8090.


Conference, IAAI 2020, The Tenth AAAI Symposium [334] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang,
on Educational Advances in Artificial Intelligence, EAAI “Squad: 100, 000+ questions for machine comprehen-
2020, New York, NY, USA, February 7-12, 2020, 2020, sion of text,” in Proceedings of the 2016 Conference
pp. 7432–7439. on Empirical Methods in Natural Language Processing,
[325] M. Dubey, D. Banerjee, A. Abdelkawi, and EMNLP 2016, Austin, Texas, USA, November 1-4, 2016,
J. Lehmann, “Lc-quad 2.0: A large dataset for 2016, pp. 2383–2392.
complex question answering over wikidata and [335] A. H. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes,
dbpedia,” in The Semantic Web - ISWC 2019 - 18th and J. Weston, “Key-value memory networks for di-
International Semantic Web Conference, Auckland, New rectly reading documents,” in Proceedings of the 2016
Zealand, October 26-30, 2019, Proceedings, Part II, 2019, Conference on Empirical Methods in Natural Language
pp. 69–78. Processing, EMNLP 2016, Austin, Texas, USA, November
[326] Y. Gu, S. Kase, M. Vanni, B. M. Sadler, P. Liang, X. Yan, 1-4, 2016, 2016, pp. 1400–1409.
and Y. Su, “Beyond I.I.D.: three levels of generaliza- [336] B. Goodrich, V. Rao, P. J. Liu, and M. Saleh, “Assessing
tion for question answering on knowledge bases,” in the factual accuracy of generated text,” in Proceedings
WWW ’21: The Web Conference 2021, Virtual Event / of the 25th ACM SIGKDD International Conference on
Ljubljana, Slovenia, April 19-23, 2021, 2021, pp. 3477– Knowledge Discovery & Data Mining, KDD 2019, An-
3488. chorage, AK, USA, August 4-8, 2019, 2019, pp. 166–175.
[327] S. Cao, J. Shi, L. Pan, L. Nie, Y. Xiang, L. Hou, J. Li, [337] K. Toutanova and D. Chen, “Observed versus latent
B. He, and H. Zhang, “KQA pro: A dataset with features for knowledge base and text inference,” in
explicit compositional programs for complex question Proceedings of the 3rd Workshop on Continuous Vector
answering over knowledge base,” in Proceedings of the Space Models and their Compositionality, CVSC 2015,
60th Annual Meeting of the Association for Computational Beijing, China, July 26-31, 2015, 2015, pp. 57–66.
Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, [338] K. D. Bollacker, C. Evans, P. K. Paritosh, T. Sturge, and
Ireland, May 22-27, 2022, 2022, pp. 6101–6119. J. Taylor, “Freebase: a collaboratively created graph
[328] X. Hu, X. Wu, Y. Shu, and Y. Qu, “Logical form database for structuring human knowledge,” in Pro-
generation via multi-task learning for complex ques- ceedings of the ACM SIGMOD International Conference
tion answering over knowledge bases,” in Proceedings on Management of Data, SIGMOD 2008, Vancouver, BC,
of the 29th International Conference on Computational Canada, June 10-12, 2008, 2008, pp. 1247–1250.
Linguistics, COLING 2022, Gyeongju, Republic of Korea, [339] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel,
October 12-17, 2022, 2022, pp. 1687–1696. “Convolutional 2d knowledge graph embeddings,”
[329] S. Longpre, Y. Lu, and J. Daiber, “MKQA: A lin- in Proceedings of the Thirty-Second AAAI Conference on
guistically diverse benchmark for multilingual open Artificial Intelligence, (AAAI-18), the 30th innovative Ap-
domain question answering,” Trans. Assoc. Comput. plications of Artificial Intelligence (IAAI-18), and the 8th
Linguistics, vol. 9, pp. 1389–1406, 2021. AAAI Symposium on Educational Advances in Artificial
[330] T. Saikh, T. Ghosal, A. Mittal, A. Ekbal, and P. Bhat- Intelligence (EAAI-18), New Orleans, Louisiana, USA,
tacharyya, “Scienceqa: a novel resource for question February 2-7, 2018, 2018, pp. 1811–1818.
answering on scholarly articles,” Int. J. Digit. Libr., [340] G. A. Miller, “Wordnet: A lexical database for en-
vol. 23, no. 3, pp. 289–301, 2022. glish,” Commun. ACM, pp. 39–41, 1995.
[331] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can [341] F. Petroni, T. Rocktäschel, S. Riedel, P. S. H. Lewis,
a suit of armor conduct electricity? A new dataset A. Bakhtin, Y. Wu, and A. H. Miller, “Language mod-
for open book question answering,” in Proceedings of els as knowledge bases?” in Proceedings of the 2019
the 2018 Conference on Empirical Methods in Natural Conference on Empirical Methods in Natural Language
Language Processing, Brussels, Belgium, October 31 - Processing and the 9th International Joint Conference
November 4, 2018, 2018, pp. 2381–2391. on Natural Language Processing, EMNLP-IJCNLP 2019,
[332] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, Hong Kong, China, November 3-7, 2019, 2019, pp. 2463–
R. Majumder, and L. Deng, “MS MARCO: A human 2473.
generated machine reading comprehension dataset,” [342] F. Mahdisoltani, J. Biega, and F. M. Suchanek,
in Proceedings of the Workshop on Cognitive Computa- “YAGO3: A knowledge base from multilingual
tion: Integrating neural and symbolic approaches 2016 wikipedias,” in Seventh Biennial Conference on Innova-
co-located with the 30th Annual Conference on Neural tive Data Systems Research, CIDR 2015, Asilomar, CA,
Information Processing Systems (NIPS 2016), Barcelona, USA, January 4-7, 2015, Online Proceedings, 2015.
Spain, December 9, 2016, 2016. [343] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago:
[333] T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sab- a core of semantic knowledge,” in Proceedings of the
harwal, “QASC: A dataset for question answering 16th International Conference on World Wide Web, WWW
via sentence composition,” in The Thirty-Fourth AAAI 2007, Banff, Alberta, Canada, May 8-12, 2007, 2007, pp.
Conference on Artificial Intelligence, AAAI 2020, The 697–706.
Thirty-Second Innovative Applications of Artificial Intelli- [344] C. Clark, K. Lee, M. Chang, T. Kwiatkowski,
gence Conference, IAAI 2020, The Tenth AAAI Symposium M. Collins, and K. Toutanova, “Boolq: Exploring the
on Educational Advances in Artificial Intelligence, EAAI surprising difficulty of natural yes/no questions,” in
2020, New York, NY, USA, February 7-12, 2020, 2020, Proceedings of the 2019 Conference of the North American
54

Chapter of the Association for Computational Linguistics: 2021, pp. 3621–3634.


Human Language Technologies, NAACL-HLT 2019, Min- [353] B. Dalvi, P. Jansen, O. Tafjord, Z. Xie, H. Smith, L. Pi-
neapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and patanangkura, and P. Clark, “Explaining answers with
Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. entailment trees,” in Proceedings of the 2021 Conference
Association for Computational Linguistics, 2019, pp. on Empirical Methods in Natural Language Processing,
2924–2936. EMNLP 2021, Virtual Event / Punta Cana, Dominican
[345] M. Sap, H. Rashkin, D. Chen, R. L. Bras, and Y. Choi, Republic, 7-11 November, 2021, M. Moens, X. Huang,
“Socialiqa: Commonsense reasoning about social in- L. Specia, and S. W. Yih, Eds. Association for Com-
teractions,” CoRR, vol. abs/1904.09728, 2019. putational Linguistics, 2021, pp. 7358–7370.
[346] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and [354] A. Saparov and H. He, “Language models are greedy
Y. Choi, “Hellaswag: Can a machine really finish reasoners: A systematic formal analysis of chain-of-
your sentence?” in Proceedings of the 57th Conference of thought,” CoRR, vol. abs/2210.01240, 2022.
the Association for Computational Linguistics, ACL 2019, [355] C. Anil, Y. Wu, A. Andreassen, A. Lewkowycz,
Florence, Italy, July 28- August 2, 2019, Volume 1: Long V. Misra, V. V. Ramasesh, A. Slone, G. Gur-Ari,
Papers, A. Korhonen, D. R. Traum, and L. Màrquez, E. Dyer, and B. Neyshabur, “Exploring length gen-
Eds. Association for Computational Linguistics, 2019, eralization in large language models,” CoRR, vol.
pp. 4791–4800. abs/2207.04901, 2022.
[347] K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi, [356] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb,
“Winogrande: An adversarial winograd schema chal- A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta,
lenge at scale,” in AAAI. AAAI Press, 2020, pp. 8732– A. Garriga-Alonso, A. Kluska, A. Lewkowycz,
8740. A. Agarwal, A. Power, A. Ray, A. Warstadt, A. W.
[348] M. Roemmele, C. A. Bejan, and A. S. Gordon, “Choice Kocurek, A. Safaya, A. Tazarv, A. Xiang, A. Parrish,
of plausible alternatives: An evaluation of common- A. Nie, A. Hussain, A. Askell, A. Dsouza, A. Rahane,
sense causal reasoning,” in Logical Formalizations of A. S. Iyer, A. Andreassen, A. Santilli, A. Stuhlmüller,
Commonsense Reasoning, Papers from the 2011 AAAI A. M. Dai, A. La, A. K. Lampinen, A. Zou, A. Jiang,
Spring Symposium, Technical Report SS-11-06, Stanford, A. Chen, A. Vuong, A. Gupta, A. Gottardi, A. Norelli,
California, USA, March 21-23, 2011. AAAI, 2011. A. Venkatesh, A. Gholamidavoodi, A. Tabassum,
[349] K. Sakaguchi, C. Bhagavatula, R. L. Bras, N. Tandon, A. Menezes, A. Kirubarajan, A. Mullokandov, A. Sab-
P. Clark, and Y. Choi, “proscript: Partially ordered harwal, A. Herrick, A. Efrat, A. Erdem, A. Karakas,
scripts generation,” in Findings of the Association for and et al., “Beyond the imitation game: Quantifying
Computational Linguistics: EMNLP 2021, Virtual Event / and extrapolating the capabilities of language mod-
Punta Cana, Dominican Republic, 16-20 November, 2021, els,” CoRR, vol. abs/2206.04615, 2022.
M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. [357] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang,
Association for Computational Linguistics, 2021, pp. J. Callan, and G. Neubig, “PAL: program-aided lan-
2138–2149. guage models,” CoRR, vol. abs/2211.10435, 2022.
[350] B. Dalvi, L. Huang, N. Tandon, W. Yih, and P. Clark, [358] S. Roy and D. Roth, “Solving general arithmetic
“Tracking state changes in procedural text: a challenge word problems,” in Proceedings of the 2015 Conference
dataset and models for process paragraph comprehen- on Empirical Methods in Natural Language Processing,
sion,” in Proceedings of the 2018 Conference of the North EMNLP 2015, Lisbon, Portugal, September 17-21, 2015,
American Chapter of the Association for Computational L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, and
Linguistics: Human Language Technologies, NAACL-HLT Y. Marton, Eds. The Association for Computational
2018, New Orleans, Louisiana, USA, June 1-6, 2018, Vol- Linguistics, 2015, pp. 1743–1752.
ume 1 (Long Papers), M. A. Walker, H. Ji, and A. Stent, [359] A. Amini, S. Gabriel, S. Lin, R. Koncel-Kedziorski,
Eds. Association for Computational Linguistics, 2018, Y. Choi, and H. Hajishirzi, “Mathqa: Towards inter-
pp. 1595–1604. pretable math word problem solving with operation-
[351] S. Saha, P. Yadav, L. Bauer, and M. Bansal, “Expla- based formalisms,” in Proceedings of the 2019 Conference
graphs: An explanation graph generation task for of the North American Chapter of the Association for
structured commonsense reasoning,” in Proceedings Computational Linguistics: Human Language Technolo-
of the 2021 Conference on Empirical Methods in Natu- gies, NAACL-HLT 2019, Minneapolis, MN, USA, June
ral Language Processing, EMNLP 2021, Virtual Event / 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein,
Punta Cana, Dominican Republic, 7-11 November, 2021, C. Doran, and T. Solorio, Eds. Association for Com-
M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. putational Linguistics, 2019, pp. 2357–2367.
Association for Computational Linguistics, 2021, pp. [360] W. Ling, D. Yogatama, C. Dyer, and P. Blunsom,
7716–7740. “Program induction by rationale generation: Learning
[352] O. Tafjord, B. Dalvi, and P. Clark, “Proofwriter: Gener- to solve and explain algebraic word problems,” in
ating implications, proofs, and abductive statements Proceedings of the 55th Annual Meeting of the Associa-
over natural language,” in Findings of the Association tion for Computational Linguistics, ACL 2017, Vancouver,
for Computational Linguistics: ACL/IJCNLP 2021, Online Canada, July 30 - August 4, Volume 1: Long Papers,
Event, August 1-6, 2021, ser. Findings of ACL, C. Zong, R. Barzilay and M. Kan, Eds. Association for Com-
F. Xia, W. Li, and R. Navigli, Eds., vol. ACL/IJCNLP putational Linguistics, 2017, pp. 158–167.
2021. Association for Computational Linguistics, [361] R. Koncel-Kedziorski, S. Roy, A. Amini, N. Kushman,
55

and H. Hajishirzi, “Mawps: A math word problem 4393–4479.


repository,” in Proceedings of the 2016 conference of the [373] Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su,
north american chapter of the association for computational B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, Q. V. Do,
linguistics: human language technologies, 2016, pp. 1152– Y. Xu, and P. Fung, “A multitask, multilingual, mul-
1157. timodal evaluation of chatgpt on reasoning, halluci-
[362] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, nation, and interactivity,” CoRR, vol. abs/2302.04023,
and M. Gardner, “DROP: A reading comprehension 2023.
benchmark requiring discrete reasoning over para- [374] S. Gulwani, O. Polozov, and R. Singh, “Program syn-
graphs,” in Proceedings of the 2019 Conference of the thesis,” Found. Trends Program. Lang., vol. 4, no. 1-2,
North American Chapter of the Association for Com- pp. 1–119, 2017.
putational Linguistics: Human Language Technologies, [375] S. Zhang, Z. Chen, Y. Shen, M. Ding, J. B. Tenenbaum,
NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, and C. Gan, “Planning with large language models for
2019, Volume 1 (Long and Short Papers), 2019, pp. 2368– code generation,” 2023.
2378. [376] M. Welsh, “The end of programming,” Commun. ACM,
[363] S. Welleck, J. Liu, R. L. Bras, H. Hajishirzi, Y. Choi, vol. 66, no. 1, pp. 34–35, 2023.
and K. Cho, “Naturalproofs: Mathematical theorem [377] B. Wang, X. Deng, and H. Sun, “Iteratively prompt
proving in natural language,” in Proceedings of the Neu- pre-trained language models for chain of thought,”
ral Information Processing Systems Track on Datasets and in Proceedings of the 2022 Conference on Empirical
Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, Methods in Natural Language Processing, EMNLP 2022,
December 2021, virtual, J. Vanschoren and S. Yeung, Abu Dhabi, United Arab Emirates, December 7-11, 2022,
Eds., 2021. Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. Asso-
[364] A. Q. Jiang, W. Li, J. M. Han, and Y. Wu, “Lisa: ciation for Computational Linguistics, 2022, pp. 2714–
Language models of isabelle proofs,” in 6th Conference 2730.
on Artificial Intelligence and Theorem Proving, 2021, pp. [378] O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith,
378–392. and M. Lewis, “Measuring and narrowing the com-
[365] K. Zheng, J. M. Han, and S. Polu, “minif2f: a cross- positionality gap in language models,” CoRR, vol.
system benchmark for formal olympiad-level mathe- abs/2210.03350, 2022.
matics,” in The Tenth International Conference on Learn- [379] J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. Liu, Y. Cui,
ing Representations, ICLR 2022, Virtual Event, April 25- Z. Zhou, C. Gong, Y. Shen, J. Zhou, S. Chen, T. Gui,
29, 2022. OpenReview.net, 2022. Q. Zhang, and X. Huang, “A comprehensive capabil-
[366] Z. Azerbayev, B. Piotrowski, H. Schoelkopf, E. W. ity analysis of gpt-3 and gpt-3.5 series models,” arXiv
Ayers, D. Radev, and J. Avigad, “Proofnet: Autofor- preprint arXiv:2303.10420, 2023.
malizing and formally proving undergraduate-level [380] M. McCloskey and N. J. Cohen, “Catastrophic interfer-
mathematics,” CoRR, vol. abs/2302.12433, 2023. ence in connectionist networks: The sequential learn-
[367] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine ing problem,” in Psychology of learning and motivation,
translation by jointly learning to align and translate,” 1989, pp. 109–165.
in ICLR, 2015. [381] R. Kemker, M. McClure, A. Abitino, T. L. Hayes,
[368] A. M. Rush, S. Chopra, and J. Weston, “A neural and C. Kanan, “Measuring catastrophic forgetting in
attention model for abstractive sentence summariza- neural networks,” in Proceedings of the Thirty-Second
tion,” in EMNLP. The Association for Computational AAAI Conference on Artificial Intelligence, (AAAI-18),
Linguistics, 2015, pp. 379–389. the 30th innovative Applications of Artificial Intelligence
[369] D. Chen, A. Fisch, J. Weston, and A. Bordes, “Reading (IAAI-18), and the 8th AAAI Symposium on Educational
wikipedia to answer open-domain questions,” in ACL Advances in Artificial Intelligence (EAAI-18), New Or-
(1). Association for Computational Linguistics, 2017, leans, Louisiana, USA, February 2-7, 2018, 2018, pp.
pp. 1870–1879. 3390–3398.
[370] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu: [382] A. Roberts, C. Raffel, and N. Shazeer, “How much
a method for automatic evaluation of machine trans- knowledge can you pack into the parameters of a
lation,” in Proceedings of the 40th Annual Meeting of language model?” in Proceedings of the 2020 Conference
the Association for Computational Linguistics, July 6-12, on Empirical Methods in Natural Language Processing,
2002, Philadelphia, PA, USA. ACL, 2002, pp. 311–318. EMNLP 2020, Online, November 16-20, 2020, 2020, pp.
[371] C.-Y. Lin, “ROUGE: A package for automatic evalu- 5418–5426.
ation of summaries,” in Text Summarization Branches [383] G. Izacard, P. S. H. Lewis, M. Lomeli, L. Hos-
Out. Association for Computational Linguistics, Jul. seini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin,
2004, pp. 74–81. S. Riedel, and E. Grave, “Few-shot learning with
[372] K. Yang, Y. Tian, N. Peng, and D. Klein, “Re3: Gen- retrieval augmented language models,” CoRR, vol.
erating longer stories with recursive reprompting and abs/2208.03299, 2022.
revision,” in Proceedings of the 2022 Conference on Em- [384] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang,
pirical Methods in Natural Language Processing, EMNLP “Retrieval augmented language model pre-training,”
2022, Abu Dhabi, United Arab Emirates, December 7-11, in Proceedings of the 37th International Conference on
2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. Machine Learning, ICML 2020, 13-18 July 2020, Virtual
Association for Computational Linguistics, 2022, pp. Event, 2020, pp. 3929–3938.
56

[385] P. S. H. Lewis, E. Perez, A. Piktus, F. Petroni, Advances in Neural Information Processing Systems, 2022.
V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, [394] Z. Shao, Y. Gong, Y. Shen, M. Huang, N. Duan, and
T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval- W. Chen, “Synthetic prompting: Generating chain-of-
augmented generation for knowledge-intensive NLP thought demonstrations for large language models,”
tasks,” in Advances in Neural Information Processing CoRR, vol. abs/2302.00618, 2023.
Systems 33: Annual Conference on Neural Information [395] N. Bian, X. Han, L. Sun, H. Lin, Y. Lu, and B. He,
Processing Systems 2020, NeurIPS 2020, December 6-12, “ChatGPT is a Knowledgeable but Inexperienced
2020, virtual, 2020. Solver: An Investigation of Commonsense Problem in
[386] Y. Lan, G. He, J. Jiang, J. Jiang, W. X. Zhao, and J. Wen, Large Language Models,” CoRR, 2023.
“Complex knowledge base question answering: A [396] Sifatkaur, M. Singh, V. S. B, and N. Malviya, “Mind
survey,” CoRR, vol. abs/2108.06688, 2021. meets machine: Unravelling gpt-4’s cognitive psychol-
[387] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, ogy,” CoRR, vol. abs/2303.11436, 2023.
E. Rutherford, K. Millican, G. van den Driessche, [397] M. I. Nye, A. J. Andreassen, G. Gur-Ari,
J. Lespiau, B. Damoc, A. Clark, D. de Las Casas, H. Michalewski, J. Austin, D. Bieber, D. Dohan,
A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang, A. Lewkowycz, M. Bosma, D. Luan, C. Sutton,
L. Maggiore, C. Jones, A. Cassirer, A. Brock, M. Pa- and A. Odena, “Show your work: Scratchpads for
ganini, G. Irving, O. Vinyals, S. Osindero, K. Si- intermediate computation with language models,”
monyan, J. W. Rae, E. Elsen, and L. Sifre, “Improv- CoRR, vol. abs/2112.00114, 2021.
ing language models by retrieving from trillions of [398] J. Qian, H. Wang, Z. Li, S. Li, and X. Yan, “Limita-
tokens,” in International Conference on Machine Learn- tions of language models in arithmetic and symbolic
ing, ICML 2022, 17-23 July 2022, Baltimore, Maryland, induction,” CoRR, vol. abs/2208.05051, 2022.
USA, ser. Proceedings of Machine Learning Research, [399] W. X. Zhao, K. Zhou, Z. Gong, B. Zhang, Y. Zhou,
K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári, J. Sha, Z. Chen, S. Wang, C. Liu, and J. Wen, “Jiuzhang:
G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 2022, A chinese pre-trained language model for mathemat-
pp. 2206–2240. ical problem understanding,” in KDD ’22: The 28th
[388] B. Peng, M. Galley, P. He, H. Cheng, Y. Xie, Y. Hu, ACM SIGKDD Conference on Knowledge Discovery and
Q. Huang, L. Liden, Z. Yu, W. Chen, and J. Gao, Data Mining, Washington, DC, USA, August 14 - 18,
“Check your facts and try again: Improving large 2022, A. Zhang and H. Rangwala, Eds. ACM, 2022,
language models with external knowledge and auto- pp. 4571–4581.
mated feedback,” CoRR, vol. abs/2302.12813, 2023. [400] Q. Wang, C. Kaliszyk, and J. Urban, “First experi-
[389] S. Agarwal, I. Akkaya, V. Balcom, M. Bavarian, ments with neural translation of informal to formal
G. Bernadett-Shapiro, G. Brockman, M. Brundage, mathematics,” in Intelligent Computer Mathematics -
J. Chan, F. Chantzis, N. Deutsch, B. Eastman, A. Eleti, 11th International Conference, CICM 2018, Hagenberg,
N. Felix, S. P. Fishman, I. Fulford, C. Gibson, J. Gross, Austria, August 13-17, 2018, Proceedings, ser. Lecture
M. Heaton, J. Hilton, X. Hu, S. Jain, H. Jin, L. Kil- Notes in Computer Science, F. Rabe, W. M. Farmer,
patrick, C. Kim, M. Kolhede, A. Mayne, P. McMil- G. O. Passmore, and A. Youssef, Eds., vol. 11006.
lan, D. Medina, J. Menick, A. Mishchenko, A. Nair, Springer, 2018, pp. 255–270.
R. Nayak, A. Neelakantan, R. Nuttall, J. Parish, [401] S. Polu and I. Sutskever, “Generative language mod-
A. T. Passos, A. Perelman, F. de Avila Belbute Peres, eling for automated theorem proving,” CoRR, vol.
V. Pong, J. Schulman, E. Sigler, N. Staudacher, N. Tur- abs/2009.03393, 2020.
ley, J. Tworek, R. Greene, A. Vijayvergiya, C. Voss, [402] A. Q. Jiang, W. Li, S. Tworkowski, K. Czechowski,
J. Weng, M. Wiethoff, S. Yoo, K. Yu, W. Zaremba, T. Odrzygózdz, P. Milos, Y. Wu, and M. Jamnik,
S. Zhao, W. Zhuk, and B. Zoph, “Chatgpt plugins,” “Thor: Wielding hammers to integrate language mod-
OpenAI Blog, March 2023. els and automated theorem provers,” CoRR, vol.
[390] A. Lazaridou, E. Gribovskaya, W. Stokowiec, and abs/2205.10893, 2022.
N. Grigorev, “Internet-augmented language models [403] S. Polu, J. M. Han, K. Zheng, M. Baksys, I. Babuschkin,
through few-shot prompting for open-domain ques- and I. Sutskever, “Formal mathematics statement cur-
tion answering,” CoRR, vol. abs/2203.05115, 2022. riculum learning,” CoRR, vol. abs/2202.01344, 2022.
[391] A. Madaan, N. Tandon, P. Clark, and Y. Yang, [404] A. Q. Jiang, S. Welleck, J. P. Zhou, W. Li, J. Liu,
“Memory-assisted prompt editing to improve GPT- M. Jamnik, T. Lacroix, Y. Wu, and G. Lample, “Draft,
3 after deployment,” in EMNLP. Association for sketch, and prove: Guiding formal theorem provers
Computational Linguistics, 2022, pp. 2833–2861. with informal proofs,” CoRR, vol. abs/2210.12283,
[392] D. Dai, L. Dong, Y. Hao, Z. Sui, B. Chang, and F. Wei, 2022.
“Knowledge neurons in pretrained transformers,” in [405] Q. Lyu, S. Havaldar, A. Stein, L. Zhang, D. Rao,
Proceedings of the 60th Annual Meeting of the Association E. Wong, M. Apidianaki, and C. Callison-Burch,
for Computational Linguistics (Volume 1: Long Papers), “Faithful chain-of-thought reasoning,” CoRR, vol.
ACL 2022, Dublin, Ireland, May 22-27, 2022, S. Muresan, abs/2301.13379, 2023.
P. Nakov, and A. Villavicencio, Eds. Association for [406] Y. Weng, M. Zhu, S. He, K. Liu, and J. Zhao, “Large
Computational Linguistics, 2022, pp. 8493–8502. language models are reasoners with self-verification,”
[393] K. Meng, D. Bau, A. J. Andonian, and Y. Belinkov, CoRR, vol. abs/2212.09561, 2022.
“Locating and editing factual associations in gpt,” in [407] X. Pi, Q. Liu, B. Chen, M. Ziyadi, Z. Lin, Q. Fu, Y. Gao,
57

J. Lou, and W. Chen, “Reasoning like program execu- Language model programs for embodied control,”
tors,” in Proceedings of the 2022 Conference on Empirical CoRR, vol. abs/2209.07753, 2022.
Methods in Natural Language Processing, EMNLP 2022, [418] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu,
Abu Dhabi, United Arab Emirates, December 7-11, 2022, J. Tremblay, D. Fox, J. Thomason, and A. Garg, “Prog-
2022, pp. 761–779. prompt: Generating situated robot task plans using
[408] A. Parisi, Y. Zhao, and N. Fiedel, “TALM: large language models,” CoRR, vol. abs/2209.11302,
tool augmented language models,” CoRR, vol. 2022.
abs/2205.12255, 2022. [419] J. H. Clark, J. Palomaki, V. Nikolaev, E. Choi, D. Gar-
[409] N. Nangia, C. Vania, R. Bhalerao, and S. R. Bowman, rette, M. Collins, and T. Kwiatkowski, “Tydi QA: A
“Crows-pairs: A challenge dataset for measuring so- benchmark for information-seeking question answer-
cial biases in masked language models,” in Proceedings ing in typologically diverse languages,” Trans. Assoc.
of the 2020 Conference on Empirical Methods in Natural Comput. Linguistics, vol. 8, pp. 454–470, 2020.
Language Processing, EMNLP 2020, Online, November [420] L. Gao, J. Tow, S. Biderman, S. Black, A. DiPofi, C. Fos-
16-20, 2020, 2020, pp. 1953–1967. ter, L. Golding, J. Hsu, K. McDonell, N. Muennighoff,
[410] R. Rudinger, J. Naradowsky, B. Leonard, and B. V. J. Phang, L. Reynolds, E. Tang, A. Thite, B. Wang,
Durme, “Gender bias in coreference resolution,” in K. Wang, and A. Zou, “A framework for few-shot
Proceedings of the 2018 Conference of the North American language model evaluation,” Sep. 2021.
Chapter of the Association for Computational Linguistics: [421] Q. Zhong, L. Ding, J. Liu, B. Du, and D. Tao,
Human Language Technologies, NAACL-HLT, New Or- “Can chatgpt understand too? A comparative study
leans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short on chatgpt and fine-tuned BERT,” CoRR, vol.
Papers), 2018, pp. 8–14. abs/2302.10198, 2023.
[411] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, [422] J. Kocon, I. Cichecki, O. Kaszyca, M. Kochanek,
“Language models as zero-shot planners: Extracting D. Szydlo, J. Baran, J. Bielaniewicz, M. Gruza, A. Janz,
actionable knowledge for embodied agents,” in ICML, K. Kanclerz, A. Kocon, B. Koptyra, W. Mieleszczenko-
ser. Proceedings of Machine Learning Research, vol. Kowszewicz, P. Milkowski, M. Oleksy, M. Piasecki,
162. PMLR, 2022, pp. 9118–9147. L. Radlinski, K. Wojtasik, S. Wozniak, and P. Kazienko,
[412] T. Carta, C. Romac, T. Wolf, S. Lamprier, O. Sigaud, “Chatgpt: Jack of all trades, master of none,” CoRR,
and P. Oudeyer, “Grounding large language models vol. abs/2302.10724, 2023.
in interactive environments with online reinforcement [423] C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga,
learning,” CoRR, vol. abs/2302.02662, 2023. and D. Yang, “Is chatgpt a general-purpose nat-
[413] X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler, ural language processing task solver?” CoRR, vol.
and A. Torralba, “Virtualhome: Simulating household abs/2302.06476, 2023.
activities via programs,” in CVPR. Computer Vision [424] Y. Ma, Y. Cao, Y. Hong, and A. Sun, “Large language
Foundation / IEEE Computer Society, 2018, pp. 8494– model is not a good few-shot information extractor,
8502. but a good reranker for hard samples!” CoRR, vol.
[414] M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, abs/2303.08559, 2023.
R. Mottaghi, L. Zettlemoyer, and D. Fox, “ALFRED: [425] X. Chen, J. Ye, C. Zu, N. Xu, R. Zheng, M. Peng,
A benchmark for interpreting grounded instructions J. Zhou, T. Gui, Q. Zhang, and X. Huang, “How robust
for everyday tasks,” in CVPR. Computer Vision is gpt-3.5 to predecessors? a comprehensive study on
Foundation / IEEE, 2020, pp. 10 737–10 746. language understanding tasks,” 2023.
[415] S. Srivastava, C. Li, M. Lingelbach, R. Martı́n-Martı́n, [426] M. Jang and T. Lukasiewicz, “Consistency analysis of
F. Xia, K. E. Vainio, Z. Lian, C. Gokmen, S. Buch, chatgpt,” CoRR, vol. abs/2303.06273, 2023.
C. K. Liu, S. Savarese, H. Gweon, J. Wu, and L. Fei- [427] R. Tang, X. Han, X. Jiang, and X. Hu, “Does synthetic
Fei, “BEHAVIOR: benchmark for everyday household data generation of llms help clinical text mining?”
activities in virtual, interactive, and ecological en- arXiv preprint arXiv:2303.04360, 2023.
vironments,” in CoRL, ser. Proceedings of Machine [428] O. Nov, N. Singh, and D. M. Mann, “Putting chat-
Learning Research, vol. 164. PMLR, 2021, pp. 477– gpt’s medical advice to the (turing) test,” CoRR, vol.
490. abs/2301.10035, 2023.
[416] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, [429] S. Chen, B. H. Kann, M. B. Foote, H. J. Aerts, G. K.
B. David, C. Finn, K. Gopalakrishnan, K. Hausman, Savova, R. H. Mak, and D. S. Bitterman, “The utility
A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, of chatgpt for cancer treatment information,” medRxiv,
E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. J. Joshi, 2023.
R. Julian, D. Kalashnikov, Y. Kuang, K. Lee, S. Levine, [430] L. Yunxiang, L. Zihan, Z. Kai, D. Ruilong, and Z. You,
Y. Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, “Chatdoctor: A medical chat model fine-tuned on
K. Rao, J. Rettinghouse, D. Reyes, P. Sermanet, N. Siev- llama model using medical domain knowledge,” 2023.
ers, C. Tan, A. Toshev, V. Vanhoucke, F. Xia, T. Xiao, [431] K. Jeblick, B. Schachtner, J. Dexl, A. Mittermeier, A. T.
P. Xu, S. Xu, and M. Yan, “Do as I can, not as I say: Stüber, J. Topalis, T. Weber, P. Wesp, B. O. Sabel,
Grounding language in robotic affordances,” CoRR, J. Ricke, and M. Ingrisch, “Chatgpt makes medicine
vol. abs/2204.01691, 2022. easy to swallow: An exploratory case study on sim-
[417] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, plified radiology reports,” CoRR, vol. abs/2212.14882,
B. Ichter, P. Florence, and A. Zeng, “Code as policies: 2022.
58

[432] H. Nori, N. King, S. M. McKinney, D. Carignan, and 2023.


E. Horvitz, “Capabilities of gpt-4 on medical challenge [451] H. Cho, H. J. Kim, J. Kim, S. Lee, S. Lee, K. M. Yoo,
problems,” vol. abs/2303.13375, 2023. and T. Kim, “Prompt-augmented linear probing: Scal-
[433] B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding, ing beyond the limit of few-shot in-context learners,”
J. Yue, and Y. Wu, “How close is chatgpt to human ex- CoRR, vol. abs/2212.10873, 2022.
perts? comparison corpus, evaluation, and detection,” [452] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Ef-
CoRR, vol. abs/2301.07597, 2023. ficient transformers: A survey,” ACM Comput. Surv.,
[434] V. Liévin, C. E. Hother, and O. Winther, “Can large vol. 55, no. 6, pp. 109:1–109:28, 2023.
language models reason about medical questions?”
CoRR, vol. abs/2207.08143, 2022.
[435] G. Kortemeyer, “Could an artificial-intelligence agent
pass an introductory physics course?” arXiv preprint
arXiv:2301.12127, 2023.
[436] S. Bordt and U. von Luxburg, “Chatgpt participates in
a computer science exam,” CoRR, vol. abs/2303.09461,
2023.
[437] K. Malinka, M. Peresı́ni, A. Firc, O. Hujnak, and
F. Janus, “On the educational impact of chatgpt: Is
artificial intelligence ready to obtain a university de-
gree?” CoRR, vol. abs/2303.11146, 2023.
[438] T. Susnjak, “Chatgpt: The end of online exam in-
tegrity?” CoRR, vol. abs/2212.09292, 2022.
[439] A. Blair-Stanek, N. Holzenberger, and B. V. Durme,
“Can GPT-3 perform statutory reasoning?” CoRR, vol.
abs/2302.06100, 2023.
[440] F. Yu, L. Quartey, and F. Schilder, “Legal prompting:
Teaching a language model to think like a lawyer,”
CoRR, vol. abs/2212.01326, 2022.
[441] D. Trautmann, A. Petrova, and F. Schilder, “Legal
prompt engineering for multilingual legal judgement
prediction,” CoRR, vol. abs/2212.02199, 2022.
[442] J. H. Choi, K. E. Hickman, A. Monahan, and
D. Schwarcz, “Chatgpt goes to law school,” Available
at SSRN, 2023.
[443] J. J. Nay, “Law informs code: A legal informatics
approach to aligning artificial intelligence with hu-
mans,” CoRR, vol. abs/2209.13020, 2022.
[444] A. Tamkin, M. Brundage, J. Clark, and D. Ganguli,
“Understanding the capabilities, limitations, and so-
cietal impact of large language models,” CoRR, vol.
abs/2102.02503, 2021.
[445] Z. Sun, “A short survey of viewing large language
models in legal aspect,” CoRR, vol. abs/2303.09136,
2023.
[446] A. Abid, M. Farooqi, and J. Zou, “Persistent anti-
muslim bias in large language models,” in AIES ’21:
AAAI/ACM Conference on AI, Ethics, and Society, Virtual
Event, USA, May 19-21, 2021, M. Fourcade, B. Kuipers,
S. Lazar, and D. K. Mulligan, Eds. ACM, 2021, pp.
298–306.
[447] A. Borji, “A categorical archive of chatgpt failures,”
CoRR, vol. abs/2302.03494, 2023.
[448] M. Kosinski, “Theory of mind may have sponta-
neously emerged in large language models,” CoRR,
vol. abs/2302.02083, 2023.
[449] M. M. Amin, E. Cambria, and B. W. Schuller, “Will
affective computing emerge from foundation models
and general ai? A first evaluation on chatgpt,” CoRR,
vol. abs/2303.03186, 2023.
[450] R. Aiyappa, J. An, H. Kwak, and Y.-Y. Ahn, “Can we
trust the evaluation on chatgpt?” vol. abs/2303.12767,

You might also like