0% found this document useful (0 votes)
129 views52 pages

Survey On Large Language Models

1. The document discusses the evolution of language models from statistical models to large neural models containing billions of parameters. 2. Recently, large pre-trained language models (PLMs) like BERT and GPT-3 have achieved strong performance on natural language tasks due to scaling model size. 3. The survey reviews recent advances in large language models (LLMs) focusing on pre-training, adaptation, utilization, and evaluation and provides resources for developing LLMs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
129 views52 pages

Survey On Large Language Models

1. The document discusses the evolution of language models from statistical models to large neural models containing billions of parameters. 2. Recently, large pre-trained language models (PLMs) like BERT and GPT-3 have achieved strong performance on natural language tasks due to scaling model size. 3. The survey reviews recent advances in large language models (LLMs) focusing on pre-training, adaptation, utilization, and evaluation and provides resources for developing LLMs.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

1

A Survey of Large Language Models


Wayne Xin Zhao, Kun Zhou*, Junyi Li*, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen
Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang,
Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie and Ji-Rong Wen

Abstract—Ever since the Turing Test was proposed in the 1950s, humans have explored the mastering of language intelligence
by machine. Language is essentially a complex, intricate system of human expressions governed by grammatical rules. It poses a
significant challenge to develop capable artificial intelligence (AI) algorithms for comprehending and grasping a language. As a major
approach, language modeling has been widely studied for language understanding and generation in the past two decades, evolving
from statistical language models to neural language models. Recently, pre-trained language models (PLMs) have been proposed by pre-
training Transformer models over large-scale corpora, showing strong capabilities in solving various natural language processing (NLP)
arXiv:2303.18223v4 [cs.CL] 12 Apr 2023

tasks. Since the researchers have found that model scaling can lead to an improved model capacity, they further investigate the scaling
effect by increasing the parameter scale to an even larger size. Interestingly, when the parameter scale exceeds a certain level, these
enlarged language models not only achieve a significant performance improvement, but also exhibit some special abilities (e.g., in-
context learning) that are not present in small-scale language models (e.g., BERT). To discriminate the language models in different
parameter scales, the research community has coined the term large language models (LLM) for the PLMs of significant size (e.g.,
containing tens or hundreds of billions of parameters). Recently, the research on LLMs has been largely advanced by both academia
and industry, and a remarkable progress is the launch of ChatGPT (a powerful AI chatbot developed based on LLMs), which has
attracted widespread attention from society. The technical evolution of LLMs has been making an important impact on the entire AI
community, which would revolutionize the way how we develop and use AI algorithms. Considering this rapid technical progress, in this
survey, we review the recent advances of LLMs by introducing the background, key findings, and mainstream techniques. In particular,
we focus on four major aspects of LLMs, namely pre-training, adaptation tuning, utilization, and capacity evaluation. Besides, we also
summarize the available resources for developing LLMs and discuss the remaining issues for future directions. This survey provides an
up-to-date review of the literature on LLMs, which can be a useful resource for both researchers and engineers.

Index Terms—Large Language Models; Emergent Abilities; Adaptation Tuning; Utilization; Alignment; Capacity Evaluation

1 I NTRODUCTION

L ANGUAGE is a prominent ability in human beings to


express and communicate, which develops in early
childhood and evolves over a lifetime [1, 2]. Machines,
a fixed context length n are also called n-gram language
models, e.g., bigram and trigram language models. SLMs
have been widely applied to enhance task performance
however, cannot naturally grasp the abilities of understand- in information retrieval (IR) [8, 9] and natural language
ing and communicating in the form of human language, processing (NLP) [10–12]. However, they often suffer from
unless equipped with powerful artificial intelligence (AI) the curse of dimensionality: it is difficult to accurately
algorithms. It has been a longstanding research challenge estimate high-order language models since an exponential
to achieve this goal, to enable machines to read, write, and number of transition probabilities need to be estimated.
communicate like humans [3]. Thus, specially designed smoothing strategies such as back-
Technically, language modeling (LM) is one of the major off estimation [13] and Good–Turing estimation [14] have
approaches to advancing language intelligence of machines. been introduced to alleviate the data sparsity problem.
In general, LM aims to model the generative likelihood • Neural language models (NLM). NLMs [15–17] character-
of word sequences, so as to predict the probabilities of ize the probability of word sequences by neural networks,
future (or missing) tokens. The research of LM has received e.g., recurrent neural networks (RNNs). As a remarkable
extensive attention in the literature, which can be divided contribution, the work in [15] introduced the concept of
into four major development stages: distributed representation of words and built the word predic-
• Statistical language models (SLM). SLMs [4–7] are de- tion function conditioned on the aggregated context features
veloped based on statistical learning methods that rose in (i.e., the distributed word vectors). By extending the idea
the 1990s. The basic idea is to build the word prediction of learning effective features for words or sentences, a
model based on the Markov assumption, e.g., predicting the general neural network approach was developed to build
next word based on the most recent context. The SLMs with a unified solution for various NLP tasks [18]. Further,
word2vec [19, 20] was proposed to build a simplified shal-
low neural network for learning distributed word represen-
• GitHub link: https://github.com/RUCAIBox/LLMSurvey tations, which were demonstrated to be very effective across
• * K. Zhou and J. Li contribute equally to this work.
• The authors are mainly with Gaoling School of Artificial Intelligence and a variety of NLP tasks. These studies have initiated the
School of Information, Renmin University of China, Beijing, China; Jian- use of language models for representation learning (beyond
Yun Nie is with DIRO, Université de Montréal, Canada. word sequence modeling), having an important impact on
Contact e-mail: [email protected]
the field of NLP.
2

• Pre-trained language models (PLM). As an early at- parallel training. To develop capable LLMs, researchers
tempt, ELMo [21] was proposed to capture context-aware have to solve complicated engineering issues, working with
word representations by first pre-training a bidirectional engineers or being engineers.
LSTM (biLSTM) network (instead of learning fixed word Nowadays, LLMs are posing a significant impact on
representations) and then fine-tuning the biLSTM network the AI community, and the advent of ChatGPT and GPT-4
according to specific downstream tasks. Further, based on leads to the rethinking of the possibilities of artificial general
the highly parallelizable Transformer architecture [22] with intelligence (AGI). OpenAI has published a technical article
self-attention mechanisms, BERT [23] was proposed by pre- entitled “Planning for AGI and beyond”, which discusses
training bidirectional language models with specially de- the short-term and long-term plans to approach AGI [40],
signed pre-training tasks on large-scale unlabeled corpora. and a more recent paper has argued that GPT-4 might be
These pre-trained context-aware word representations are considered as an early version of an AGI system [41]. The
very effective as general-purpose semantic features, which research areas of AI are being revolutionized by the rapid
have largely raised the performance bar of NLP tasks. This progress of LLMs. In the field of NLP, LLMs can serve as a
study has inspired a large number of follow-up work, which general-purpose language task solver (to some extent), and
sets the “pre-training and fine-tuning” learning paradigm. the research paradigm has been shifting towards the use
Following this paradigm, a great number of studies on of LLMs. In the field of IR, traditional search engines are
PLMs have been developed, introducing either different challenged by the new information seeking way through AI
architectures [24, 25] (e.g., GPT-2 [26] and BART [24]) or chatbots (i.e., ChatGPT), and New Bing3 presents an initial
improved pre-training strategies [27–29]. In this paradigm, it attempt that enhances the search results based on LLMs. In
often requires fine-tuning the PLM for adapting to different the field of CV, the researchers try to develop ChatGPT-like
downstream tasks. vision-language models that can better serve multimodal
• Large language models (LLM). Researchers find that dialogues [42–45], and GPT-4 [46] has supported multi-
scaling PLM (e.g., scaling model size or data size) often modal input by integrating the visual information. This new
leads to an improved model capacity on downstream tasks wave of technology would potentially lead to a prosperous
(i.e., following the scaling law [30]). A number of studies ecosystem of real-world applications based on LLMs. For
have explored the performance limit by training an ever instance, Microsoft 365 is being empowered by LLMs (i.e.,
larger PLM (e.g., the 175B-parameter GPT-3 and the 540B- Copilot) to automate the office work, and OpenAI supports
parameter PaLM). Although scaling is mainly conducted the use of plugins in ChatGPT for implementing special
in model size (with similar architectures and pre-training functions.
tasks), these large-sized PLMs display different behaviors Despite the progress and impact, the underlying prin-
from smaller PLMs (e.g., 330M-parameter BERT and 1.5B- ciples of LLMs are still not well explored. Firstly, it is
parameter GPT-2) and show surprising abilities (called emer- mysterious why emergent abilities occur in LLMs, instead of
gent abilities [31]) in solving a series of complex tasks. For smaller PLMs. As a more general issue, there lacks a deep,
example, GPT-3 can solve few-shot tasks through in-context detailed investigation of the key factors that contribute to
learning, whereas GPT-2 cannot do well. Thus, the research the superior abilities of LLMs. It is important to study when
community coins the term “large language models (LLM)”1 for and how LLMs obtain such abilities [47]. Although there are
these large-sized PLMs [32–35]. A remarkable application some meaningful discussions about this problem [31, 47],
of LLMs is ChatGPT2 that adapts the LLMs from the GPT more principled investigations are needed to uncover the
series for dialogue, which presents an amazing conversation “secrets“ of LLMs. Secondly, it is difficult for the research
ability with humans. community to train capable LLMs. Due to the huge de-
In the existing literature, PLMs have been widely dis- mand of computation resources, it is very costly to carry
cussed and surveyed [36–39], while LLMs are seldom re- out repetitive, ablating studies for investigating the effect
viewed in a systematic way. To motivate our survey, we first of various strategies for training LLMs. Indeed, LLMs are
highlight three major differences between LLMs and PLMs. mainly trained by industry, where many important training
First, LLMs display some surprising emergent abilities that details (e.g., data collection and cleaning) are not revealed
may not be observed in previous smaller PLMs. These abili- to the public. Thirdly, it is challenging to align LLMs with
ties are key to the performance of language models on com- human values or preferences. Despite the capacities, LLMs
plex tasks, making AI algorithms unprecedently powerful are also likely to produce toxic, fictitious, or harmful con-
and effective. Second, LLMs would revolutionize the way tents. It requires effective and efficient control approaches
that humans develop and use AI algorithms. Unlike small to eliminating the potential risk of the use of LLMs [46].
PLMs, the major approach to accessing LLMs is through Faced with both opportunities and challenges, it needs
the prompting interface (e.g., GPT-4 API). Humans have to more attention on the research and development of LLMs.
understand how LLMs work and format their tasks in a way In order to provide a basic understanding of LLMs, this
that LLMs can follow. Third, the development of LLMs no survey conducts a literature review of the recent advances
longer draws a clear distinction between research and en- in LLMs from four major aspects, including pre-training
gineering. The training of LLMs requires extensive practical (how to pre-train a capable LLM), adaptation tuning (how to
experiences in large-scale data processing and distributed effectively tune pre-trained LLMs from the two perspectives
of effectiveness and safety), utilization (how to use LLMs
1. Note that a LLM is not necessarily more capable than a small PLM, for solving various downstream tasks) and capability eval-
and emergent abilities may not occur in some LLMs.
2. https://openai.com/blog/chatgpt/ 3. https://www.bing.com/new
3

uation (how to evaluate the abilities of LLMs and existing pattern has close connections with the phenomenon of phase
empirical findings). We thoroughly comb the literature and transition in physics [31, 58]. In principle, emergent abilities
summarize the key findings, techniques, and methods of can be defined in relation to some complex tasks [31, 59],
LLMs. For this survey, we also create a GitHub project while we are more concerned with general abilities that
website by collecting the supporting resources for LLMs, at can be applied to solve a variety of tasks. Here, we briefly
the link https://github.com/RUCAIBox/LLMSurvey. We introduce three representative emergent abilities for LLMs,
are also aware of several related review articles on PLMs described as follows.
or LLMs [32, 36, 38, 39, 43, 48–54]. These papers either • In-context learning. The in-context learning ability is for-
discuss PLMs or some specific (or general) aspects of LLMs. mally introduced by GPT-3 [55]: assuming that the language
Compared with them, we focus on the techniques and model has been provided with a natural language instruc-
methods to develop and use LLMs and provide a relatively tion and/or several task demonstrations, it can generate the
comprehensive reference to important aspects of LLMs. expected output for the test instances by completing the
The remainder of this survey is organized as follows: word sequence of input text, without requiring additional
Section 2 introduces the background for LLMs, with the training or gradient update5 .
terminology, settings, resources, and organization outline, • Instruction following. By fine-tuning with a mixture of
followed by the summarization of available resources for multi-task datasets formatted via natural language descrip-
developing LLMs in Section 3. Sections 4, 5, 6, and 7 review tions (called instruction tuning), LLMs are shown to perform
and summarize the recent progress from the four aspects well on unseen tasks that are also described in the form
of pre-training, adaptation tuning, utilization, and capacity of instructions [28, 61, 62]. With instruction tuning, LLMs
evaluation, respectively. Finally, we conclude the survey in are enabled to follow the task instructions for new tasks
Section 8 by summarizing the major findings and discuss without using explicit examples, thus having an improved
the remaining issues for future work. generalization ability.
• Step-by-step reasoning. For small language models, it is
usually difficult to solve complex tasks that involve multiple
2 OVERVIEW
reasoning steps, e.g., mathematical word problems. While,
In this section, we introduce the background of LLMs with with the chain-of-thought reasoning strategy [33], LLMs can
key terminologies, abilities and techniques. solve such tasks by utilizing the prompting mechanism that
involves intermediate reasoning steps for deriving the final
Background. Typically, large language models (LLMs) refer
answer. This ability is speculated to be potentially obtained
to language models that contain hundreds of billions (or
by training on code [33, 47].
more) of parameters4 , which are trained on massive text
data [32], such as GPT-3 [55], PaLM [56], Galactica [35], and Key Techniques for LLMs. It has been a long way that
LLaMA [57]. Specifically, LLMs are built upon the Trans- LLMs evolve into the current state: general and capable
former architecture [22], where multi-head attention layers learners. In the development process, a number of impor-
are stacked in a very deep neural network. Existing LLMs tant techniques are proposed, which largely improve the
mainly adopt similar model architectures (i.e., Transformer) capacity of LLMs. Here, we briefly list several important
and pre-training objectives (i.e., language modeling) as small techniques that (potentially) lead to the success of LLMs, as
language models. As the major difference, LLMs largely follows.
scale the model size, pre-training data, and total compute • Scaling. Scaling is the key factor to increase the model
(orders of magnification). They can better understand the capacity of LLMs. As the initial attempt, GPT-3 firstly in-
natural language and generate high-quality text based on creases the model size to an extremely large scale of 175B
the given context (i.e., prompts). Such a capacity improve- parameters. Later on, PaLM further raises the parameter
ment can be partially described by the scaling law, where scale to a new record of 540B. As discussed before, a large
the performance roughly follows a substantial increase with model size is essential to emergent abilities. While, scaling
respect to the model size [30]. However, some abilities (e.g., is not only conducted on model size but also related to
in-context learning [55]) are unpredictable according to the data size and total compute [34, 63]. A recent study [34]
scaling law, which can be observed only when the model has discussed the optimal schedule among the three aspects
size exceeds a certain level (as discussed below). of model size, data size, and total compute, given a fixed
Emergent Abilities of LLMs. In the literature [31], emergent budget. Further, the quality of the pre-training data plays
abilities of LLMs are formally defined as “the abilities that a key role in achieving good performance, so that data
are not present in small models but arise in large models”, collection and cleaning strategies are very important to
which is one of the most prominent features that distin- consider when scaling the pre-training corpora.
guish LLMs from previous PLMs. It further introduces a • Training. Due to the huge model size, it is very chal-
notable characteristic when emergent abilities occur [31]: lenging to successfully train a capable LLM. Distributed
performance rises significantly above random when the training algorithms are needed to learn the network param-
scale reaches a certain level. By analogy, such an emergent eters of LLMs, in which various parallel strategies are often
jointly utilized. To support distributed training, several opti-
4. In existing literature, there is no formal consensus on the minimum mization frameworks have been released to facilitate the im-
parameter scale for LLMs, since the model capacity is also related to
data size and total compute. In this survey, we take a slightly loose 5. In some recent studies [60], it also shows that in-context learning
definition of LLMs, and mainly focus on discussing language models implicitly performs meta-optimization through the attention mecha-
with a model size larger than 10B. nism.
4

plementation and deployment of parallel algorithms, such 3 R ESOURCES OF LLM S


as DeepSpeed [64] and Megatron-LM [65–67]. Besides, opti-
It is by no means an easy job to develop or reproduce LLMs,
mization tricks are also important for training stability and
considering the challenging technical issues and huge de-
model performance, e.g., restart to overcome training loss
mands of computation resources. A feasible way is to learn
spike [56] and mixed precision training [68]. More recently,
experiences from existing LLMs and reuse publicly avail-
GPT-4 [46] proposes to develop special infrastructure and
able resources for incremental development or experimental
optimization methods that reliably predict the performance
study. In this section, we briefly summarize the publicly
of large models with much smaller models.
available resources for developing LLMs, including model
• Ability eliciting. After being pre-trained on large-scale checkpoints (or APIs), corpora and libraries.
corpora, LLMs are endowed with potential abilities as
general-purpose task solvers. While, these abilities might
not be explicitly exhibited when LLMs perform some spe- 3.1 Publicly Available Model Checkpoints or APIs
cific tasks. As the technical approach, it is useful to de- Given the huge cost of model pre-training, well-trained
sign suitable task instructions or specific in-context learn- model checkpoints are critical to the study and development
ing strategies to elicit such abilities. For instance, chain- of LLMs for the research community. Since the parameter
of-thought prompting has been shown to be useful to scale is a key factor to consider for using LLMs, we cate-
solve complex reasoning tasks by including intermediate gorize these public models into two scale levels (i.e., tens
reasoning steps. Besides, we can further perform instruction of billions of parameters and hundreds of billions of parameters),
tuning on LLMs with task descriptions expressed in natural which is useful for users to identify the suitable resources
language, for improving the generalizability of LLMs on according to their resource budget. Besides, for inference,
unseen tasks. While, these techniques mainly correspond to we can directly employ public APIs to perform our tasks,
the emergent abilities of LLMs, which may not show the without running the model locally. Next, we introduce the
same effect on small language models. publicly available model checkpoints and APIs.
• Alignment tuning. Since LLMs are trained to capture Models with Tens of Billions of Parameters. Most of the
the data characteristics of pre-training corpora (including models in this category have a parameter scale ranging from
both high-quality and low-quality data), they are likely to 10B to 20B, except LLaMA [57] (containing 65B parameters
generate toxic, biased, or even harmful content for humans. in the largest version) and NLLB [81] (containing 54.5B
It is necessary to align LLMs with human values, e.g., helpful, parameters in the largest version). Other models within
honest, and harmless. For this purpose, InstructGPT [61] this range include mT5 [73], PanGu-α [74], T0 [28], GPT-
designs an effective tuning approach that enables LLMs to NeoX-20B [77], CodeGen [76], UL2 [79], Flan-T5 [83], and
follow the expected instructions, which utilizes the tech- mT0 [84]. Among them, Flan-T5 (11B version) can serve as
nique of reinforcement learning with human feedback [61, 69]. a premier model for research on instruction tuning, since
It incorporates human in the training loop with elaborately it explores the instruction tuning from three aspects [83]:
designed labeling strategies. ChatGPT is indeed developed increasing the number of tasks, scaling the model size,
on a similar technique to InstructGPT, which shows a strong and fine-tuning with chain-of-thought prompting data. Be-
alignment capacity in producing high-quality, harmless re- sides, CodeGen (11B version), as an autoregressive language
sponses, e.g., rejecting to answer insulting questions. model designed for generating code, can be considered as a
• Tools manipulation. In essence, LLMs are trained as text good candidate for exploring the code generation ability.
generators over massive plain text corpora, thus performing It also introduces a new benchmark MTPB [76] specially
less well on the tasks that are not best expressed in the for multi-turn program synthesis, which is composed by
form of text (e.g., numerical computation). Besides, their 115 expert-generated problems. To solve these problems, it
capacities are also limited to the pre-training data, e.g., the requires LLMs to acquire sufficient programming knowl-
inability to capture up-to-date information. To tackle these edge (e.g., math, array operations, and algorithms). As for
issues, a recently proposed technique is to employ external multilingual tasks, mT0 (13B version) might be a good
tools to compensate for the deficiencies of LLMs [70, 71]. candidate model, which has been fine-tuned on multilin-
For example, LLMs can utilize the calculator for accurate gual tasks with multilingual prompts. Furthermore, PanGu-
computation [70] and employ search engines to retrieve α [74] shows good performance in Chinese downstream
unknown information [71]. More recently, ChatGPT has tasks in zero-shot or few-shot settings, which is developed
enabled the mechanism of using external plugins (existing based on the deep learning framework MindSpore [104].
or newly created apps)6 , which are by analogy with the “eyes Note that PanGu-α [74] holds multiple versions of models
and ears” of LLMs. Such a mechanism can broadly expand (up to 200B parameters), while the largest public version
the scope of capacities for LLMs. has 13B parameters. As a more recent release, LLaMA (65B
version) [57], which contains approximately five times as
Besides, many other factors (e.g., the upgrade of hard-
many parameters as other models, has exhibited superior
ware) also contribute to the success of LLMs. While, we
performance in tasks related to instruction following. Due
limit our discussion to the major technical approaches and
to the openness and effectiveness, LLaMA has attracted
key findings for developing LLMs.
significant attention from the research community, and
many efforts [105–108] have been devoted to fine-tuning
or continually pre-training its different model versions for
6. https://openai.com/blog/chatgpt-plugins implementing new models or tools. Typically, pre-training
5

T5 GShard Publicly Available

2019 2020 mT5 PanGu- Ernie 3.0


2021
1-4 PLUG Jurassic-1
GPT-3
Codex 5-8
CPM-2
FLAN
T0 9-10 LaMDA
Anthropic Yuan 1.0
HyperCLOVA AlphaCode
WebGPT 11-12
Chinchilla
Ernie 3.0 Titan InstructGPT 2022 Pythia
UL2 Sparrow
Gopher CodeGen 1-3 Vicuna
MT-NLG PaLM Flan-T5
GLaM OPT PanGu-Σ
YaLM Flan-PaLM
GPT-NeoX-20B Bard
4-6 Luminous
BLOOM Tk-Instruct ERNIE Bot
GLM
mT0 7-10 NLLB
Cohere LLaMA
AlexaTM
BLOOMZ 11-12
WeLM 2023 1-4
Galatica
OPT-IML ChatGPT GPT-4

Fig. 1. A timeline of existing large language models (having a size larger than 10B) in recent years. The timeline was established mainly according
to the release date (e.g., the submission date to arXiv) of the technical paper for a model. If there was not a corresponding paper, we set the date
of a model as the earliest time of its public release or announcement. We mark the LLMs with publicly available model checkpoints in yellow color.
Due to the space limit of the figure, we only include the LLMs with publicly reported evaluation results.

models at this scale require hundreds or even thousands for common users to use LLMs, without the need of
of GPUs or TPUs. For instance, GPT-NeoX-20B uses 12 running the model locally. As a representative inter-
supermicro servers, each equipped with 8 NVIDIA A100- face for using LLMs, the APIs for the GPT-series mod-
SXM4-40GB GPUs, while LLaMA utilizes 2,048 A100-80G els [46, 55, 61, 88] have been widely used for both
GPUs as reported in their original publications. To accu- academia and industry7 . OpenAI has provided seven
rately estimate the computation resources needed, it is sug- major interfaces to the models in GPT-3 series: ada,
gested to use the metrics measuring the number of involved babbage, curie, davinci (the most powerful version in
computations such as FLOPS (i.e., FLoating point number GPT-3 series), text-ada-001, text-babbage-001, and
Operations Per Second) [30]. text-curie-001. Among them, the first four interfaces
can be further fine-tuned on the host server of OpenAI.
Models with Hundreds of Billions of Parameters. For In particular, babbage, curie, and davinci correspond
models in this category, only a handful of models have been to the GPT-3 (1B), GPT-3 (6.7B), and GPT-3 (175B) models,
publicly released. For example, OPT [80], OPT-IML [85], respectively [55]. Besides, there are also two APIs related
BLOOM [68], and BLOOMZ [84] have nearly the same num- to Codex [88], called code-cushman-001 (a powerful
ber of parameters as GPT-3 (175B version), while GLM [82] and multilingual version of the Codex (12B) [88]) and
and Galactica [35] have 130B and 120B parameters, respec- code-davinci-002. Further, GPT-3.5 series include one
tively. Among them, OPT (175B version) has been spe- base model code-davinci-002 and three enhanced ver-
cially motivated for open sharing, which aims to enable sions, namely text-davinci-002, text-davinci-003,
researchers to carry out reproducible research at scale. For and gpt-3.5-turbo-0301. It is worth noting that
research in cross-lingual generalization, BLOOM (176B ver- gpt-3.5-turbo-0301 is the interface to invoke Chat-
sion) and BLOOMZ (176B version) can be used as base GPT. More recently, OpenAI has also released the corre-
models, due to the competence in multilingual language sponding APIs for GPT-4, including gpt-4, gpt-4-0314,
modeling tasks. Among these models, OPT-IML have been gpt-4-32k, and gpt-4-32k-0314. Overall, the choice of
tuned with instructions, which might be good candidates for API interfaces depends on the specific application scenarios
studying the effect of instruction tuning. Models of this scale and response requirements. The detailed usage can be found
typically require thousands of GPUs or TPUs to train. For on their project websites8 .
instance, OPT (175B version) used 992 A100-80GB GPUs,
while GLM (130B version) used a cluster of 96 NVIDIA 3.2 Commonly Used Corpora
DGX-A100 (8x40G) GPU nodes. In contrast to earlier PLMs, LLMs which consist of a signifi-
Public API of LLMs. Instead of directly using the 7. https://platform.openai.com/docs/api-reference/introduction
model copies, APIs provide a more convenient way 8. https://platform.openai.com/docs/models/overview
6

TABLE 1
Statistics of large language models (having a size larger than 10B in this survey) in recent years, including the capacity evaluation, pre-training
data scale (either in the number of tokens or storage size) and hardware resource costs. In this table, we only include LLMs with a public paper
about the technical details. Here, “Release Time” indicates the date when the corresponding paper was officially released. “Publicly Available”
means that the model checkpoints can be publicly accessible while “Closed Source” means the opposite. “Adaptation” indicates whether the
model has been with subsequent fine-tuning: IT denotes instruction tuning and RLHF denotes reinforcement learning with human feedback.
“Evaluation” indicates whether the model has been evaluated with corresponding abilities in their original paper: ICL denotes in-context learning
and CoT denotes chain-of-thought. “*” denotes the largest publicly available version.

Release Size Base Adaptation Pre-train Latest Data Hardware Training Evaluation
Model
Time (B) Model IT RLHF Data Scale Timestamp (GPUs / TPUs) Time ICL CoT
T5 [72] Oct-2019 11 - - - 1T tokens Apr-2019 1024 TPU v3 - X -
mT5 [73] Oct-2020 13 - - - 1T tokens - - - X -
PanGu-α [74] Apr-2021 13* - - - 1.1TB - 2048 Ascend 910 - X -
CPM-2 [75] Jun-2021 198 - - - 2.6TB - - - - -
T0 [28] Oct-2021 11 T5 X - - - 512 TPU v3 27 h X -
CodeGen [76] Mar-2022 16 - - - 577B tokens - - - X -
GPT-NeoX-20B [77] Apr-2022 20 - - - 825GB - 96 40G A100 - X -
Tk-Instruct [78] Apr-2022 11 T5 X - - - 256 TPU v3 4h X -
UL2 [79] May-2022 20 - - - 1T tokens Apr-2019 512 TPU v4 - X X
OPT [80] May-2022 175 - - - 180B tokens - 992 80G A100 - X -
Publicly NLLB [81] Jul-2022 54.5 - - - - - - - X -
Available GLM [82] Oct-2022 130 - - - 400B tokens - 768 40G A100 60 d X -
Flan-T5 [83] Oct-2022 11 T5 X - - - - - X X
BLOOM [68] Nov-2022 176 - - - 366B tokens - 384 80G A100 105 d X -
mT0 [84] Nov-2022 13 mT5 X - - - - - X -
Galactica [35] Nov-2022 120 - - - 106B tokens - - - X X
BLOOMZ [84] Nov-2022 176 BLOOM X - - - - - X -
OPT-IML [85] Dec-2022 175 OPT X - - - 128 40G A100 - X X
LLaMA [57] Feb-2023 65 - - - 1.4T tokens - 2048 80G A100 21 d X -
Pythia [86] Apr-2023 12 - - - 300B tokens - 256 40G A100 - X -

GPT-3 [55] May-2020 175 - - - 300B tokens - - - X -


GShard [87] Jun-2020 600 - - - 1T tokens - 2048 TPU v3 4d - -
Codex [88] Jul-2021 12 GPT-3 - - 100B tokens May-2020 - - X -
ERNIE 3.0 [89] Jul-2021 10 - - - 375B tokens - 384 V100 - X -
Jurassic-1 [90] Aug-2021 178 - - - 300B tokens - 800 GPU - X -
HyperCLOVA [91] Sep-2021 82 - - - 300B tokens - 1024 A100 13.4 d X -
FLAN [62] Sep-2021 137 LaMDA X - - - 128 TPU v3 60 h X -
Yuan 1.0 [92] Oct-2021 245 - - - 180B tokens - 2128 GPU - X -
Anthropic [93] Dec-2021 52 - - - 400B tokens - - - X -
WebGPT [71] Dec-2021 175 GPT-3 - X - - - - X -
Gopher [59] Dec-2021 280 - - - 300B tokens - 4096 TPU v3 920 h X -
ERNIE 3.0 Titan [94] Dec-2021 260 - - - 300B tokens - 2048 V100 28 d X -
GLaM [95] Dec-2021 1200 - - - 280B tokens - 1024 TPU v4 574 h X -
LaMDA [96] Jan-2022 137 - - - 2.81T tokens - 1024 TPU v3 57.7 d - -
Closed
MT-NLG [97] Jan-2022 530 - - - 270B tokens - 4480 80G A100 - X -
Source
AlphaCode [98] Feb-2022 41 - - - 967B tokens Jul-2021 - - - -
InstructGPT [61] Mar-2022 175 GPT-3 X X - - - - X -
Chinchilla [34] Mar-2022 70 - - - 1.4T tokens - - - X -
PaLM [56] Apr-2022 540 - - - 780B tokens - 6144 TPU v4 - X X
AlexaTM [99] Aug-2022 20 - - - 1.3T tokens - 128 A100 120 d X X
Sparrow [100] Sep-2022 70 - - X - - 64 TPU v3 - X -
WeLM [101] Sep-2022 10 - - - 300B tokens - 128 A100 40G 24 d X -
U-PaLM [102] Oct-2022 540 PaLM - - - - 512 TPU v4 5d X X
Flan-PaLM [83] Oct-2022 540 PaLM X - - - 512 TPU v4 37 h X X
Flan-U-PaLM [83] Oct-2022 540 U-PaLM X - - - - - X X
GPT-4 [46] Mar-2023 - - X X - - - - X X
PanGu-Σ [103] Mar-2023 1085 PanGu-α - - 329B tokens - 512 Ascend 910 100 d X -

cantly larger number of parameters require a higher volume topics and genres (e.g., novels and biographies). Another
of training data that covers a broad range of content. For large-scale book corpus is Project Gutenberg [110], consist-
this need, there are increasingly more accessible training ing of over 70,000 literary books including novels, essays,
datasets that have been released for research. In this section, poetry, drama, history, science, philosophy, and other types
we will briefly summarize several widely used corpora for of works in the public domain. It is currently one of the
training LLMs. Based on their content types, we catego- largest open-source book collections, which is used in train-
rize these corpora into six groups: Books, CommonCrawl, ing of MT-NLG [97] and LLaMA [57]. As for Books1 [55] and
Reddit links, Wikipedia, Code, and others. Books2 [55] used in GPT-3 [55], they are much larger than
BookCorpus but have not been publicly released so far.
Books. BookCorpus [109] is a commonly used dataset in
previous small-scale models (e.g., GPT [119] and GPT-2 [26]), CommonCrawl. CommonCrawl [120] is one of the largest
consisting of over 11,000 books covering a wide range of open-source web crawling databases, containing a petabyte-
7

TABLE 2 only filtered versions of Wikipedia are widely used in most


Statistics of commonly-used data sources. LLMs (e.g., GPT-3 [55], LaMDA [96], and LLaMA [57]).
Wikipedia is available in multiple languages, so it can be
Corpora Size Source Latest Update Time used in multilingual settings.
BookCorpus [109] 5GB Books Dec-2015
Gutenberg [110] - Books Dec-2021 Code. To collect code data, existing work mainly crawls
C4 [72] 800GB CommonCrawl Apr-2019 open-source licensed codes from the Internet. Two major
CC-Stories-R [111] 31GB CommonCrawl Sep-2019 sources are public code repositories under open-source li-
CC-NEWS [27] 78GB CommonCrawl Feb-2019 censes (e.g., GitHub) and code-related question-answering
REALNEWs [112] 120GB CommonCrawl Apr-2019
OpenWebText [113] 38GB Reddit links Mar-2023 platforms (e.g., StackOverflow). Google has publicly re-
Pushift.io [114] - Reddit links Mar-2023 leased the BigQuery dataset [116], which includes a substan-
Wikipedia [115] - Wikipedia Mar-2023 tial number of open-source licensed code snippets in various
BigQuery [116] - Codes Mar-2023
the Pile [117] 800GB Other Dec-2020
programming languages, serving as a representative code
ROOTS [118] 1.6TB Other Jun-2022 dataset. CodeGen has utilized BIGQUERY [76], a subset of
the BigQuery dataset, for training the multilingual version
of CodeGen (CodeGen-Multi).
scale data volume, which has been widely used as training Others. The Pile [117] is a large-scale, diverse, and open-
data for existing LLMs. As the whole dataset is very large, source text dataset consisting of over 800GB of data from
existing studies mainly extract subsets of web pages from it multiple sources, including books, websites, codes, scien-
within a specific period. However, due to the widespread tific papers, and social media platforms. It is constructed
existence of noisy and low-quality information in web from 22 diverse high-quality subsets. The Pile dataset is
data, it is necessary to perform data preprocessing before widely used in models with different parameter scales, such
usage. Based on CommonCrawl, there are four filtered as GPT-J (6B) [122], CodeGen (16B) [76], and Megatron-
datasets that are commonly used in existing work: C4 [72], Turing NLG (530B) [97]. Besides, ROOTS [118] is composed
CC-Stories [111], CC-News [27], and RealNews [112]. The of various smaller datasets (totally 1.61 TB of text) and
Colossal Clean Crawled Corpus (C4) includes five vari- covers 59 different languages (containing natural languages
ants9 , namely en (806G), en.noclean (6T), realnewslike (36G), and programming languages), which have been used for
webtextlike (17G), and multilingual (38T). The en version training BLOOM [68].
has been utilized for pre-training T5 [72], LaMDA [96], In practice, it commonly requires a mixture of different
Gopher [59], and UL2 [79]. The multilingual C4, also called data sources for pre-training LLMs (see Figure 2), instead
mC4, has been used in mT5 [73]. CC-Stories (31G) is com- of a single corpus. Therefore, existing studies commonly
posed of a subset of CommonCrawl data, in which the mix several ready-made datasets (e.g., C4, OpenWebText,
contents are made in a story-like way. While, the original and the Pile), and then perform further processing to obtain
source of CC-Stories is not available now, so a reproduction the pre-training corpus. Besides, to train the LLMs that
version, CC-Stories-R [121], has been included in Table 2. are adaptive to specific applications, it is also important
Moreover, two news corpora extracted from Common- to extract data from relevant sources (e.g., Wikipedia and
Crawl, i.e., REALNEWS (120G) and CC-News (76G), are also BigQuery) for enriching the corresponding information in
commonly used as the pre-training data. pre-training data. To have a quick reference of the data
Reddit Links. Reddit is a social media platform that enables sources used in existing LLMs, we present the pre-training
users to submit links and text posts, which can be voted on corpora of three representative LLMs:
by others through “upvotes” or “downvotes”. Highly up- • GPT-3 (175B) [55] was trained on a mixed dataset of
voted posts are often considered useful, and can be utilized 300B tokens, including CommonCrawl [120], WebText2 [55],
to create high-quality datasets. WebText [26] is a well-known Books1 [55], Books2 [55], and Wikipedia [115].
corpus composed of highly upvoted links from Reddit, but it • PaLM (540B) [56] uses a pre-training dataset of 780B
is not publicly available. As a surrogate, there is a readily ac- tokens, which is sourced from social media conversations,
cessible open-source alternative called OpenWebText [113]. filtered webpages, books, Github, multilingual Wikipedia,
Another corpus extracted from Reddit is PushShift.io [114], and news.
a real-time updated dataset that consists of historical data • LLaMA [57] extracts training data from various sources,
from Reddit since its creation day. Pushshift provides not including CommonCrawl, C4 [72], Github, Wikipedia,
only monthly data dumps but also useful utility tools to books, ArXiv, and StackExchange. The training data size for
support users in searching, summarizing, and conducting LLaMA (6B) and LLaMA (13B) is 1.0T tokens, while 1.4T
preliminary investigations on the entire dataset. This makes tokens are used for LLaMA (32B) and LLaMA (65B).
it easy for users to collect and process Reddit data.
Wikipedia. Wikipedia [115] is an online encyclopedia con- 3.3 Library Resource
taining a large volume of high-quality articles on diverse In this part, we briefly introduce a series of available li-
topics. Most of these articles are composed in an expository braries for developing LLMs.
style of writing (with supporting references), covering a • Transformers [123] is an open-source Python library
wide range of languages and fields. Typically, the English- for building models using the Transformer architecture,
which is developed and maintained by Hugging Face. It
9. https://www.tensorflow.org/datasets/catalog/c4 has a simple and user-friendly API, making it easy to use
8

and customize various pre-trained models. It is a powerful Besides, to effectively pre-train LLMs, model architectures,
library with a large and active community of users and acceleration methods, and optimization techniques need to
developers who regularly update and improve the models be well designed. In what follows, we first discuss the data
and algorithms. collection and processing in Section 4.1, then introduce the
• DeepSpeed [64] is a deep learning optimization library commonly used model architectures in Section 4.2, and fi-
(compatible with PyTorch) developed by Microsoft, which nally present the training techniques to stably and efficiently
has been used to train a number of LLMs, such as MT- optimize LLMs in Section 4.3.
NLG [97] and BLOOM [68]. It provides the support of
various optimization techniques for distributed training,
4.1 Data Collection
such as memory optimization (ZeRO technique, gradient
checkpointing), and pipeline parallelism. Compared with small-scale language models, LLMs have
• Megatron-LM [65–67] is a deep learning library devel- a stronger demand for high-quality data for model pre-
oped by NVIDIA for training large-scale language models. training, and their model capacities largely rely on the pre-
It also provides rich optimization techniques for distributed training corpus and how it has been preprocessed. In this
training, including model and data parallelism, mixed- part, we discuss the collection and processing of pre-training
precision training, and FlashAttention. These optimization data, including data sources, preprocessing methods, and
techniques can largely improve the training efficiency and important analysis of how pre-training data affects the
speed, enabling efficient distributed training across GPUs. performance of LLMs.
• JAX [124] is a Python library for high-performance
machine learning algorithms developed by Google, allow- 4.1.1 Data Source
ing users to easily perform computations on arrays with To develop a capable LLM, it is key to collect a large amount
hardware acceleration (e.g., GPU or TPU). It enables efficient of natural language corpus from various data sources. Ex-
computation on various devices and also supports several isting LLMs mainly leverage a mixture of diverse public
featured functions, such as automatic differentiation and textual datasets as the pre-training corpus. Figure 2 shows
just-in-time compilation. the distribution of the sources of pre-training data for a
• Colossal-AI [125] is a deep learning library developed number of representative LLMs.
by HPC-AI Tech for training large-scale AI models. It is The source of pre-training corpus can be broadly cate-
implemented based on PyTorch and supports a rich collec- gorized into two types: general data and specialized data.
tion of parallel training strategies. Furthermore, it can also General data, such as webpages, books, and conversational
optimize heterogeneous memory management with meth- text, is utilized by most LLMs [55, 56, 80] due to its large,
ods proposed by PatrickStar [126]. Recently, a ChatGPT-like diverse, and accessible nature, which can enhance the lan-
model called ColossalChat [108] has been publicly released guage modeling and generalization abilities of LLMs. In
with two versions (7B and 13B), which are developed using light of the impressive generalization capabilities exhibited
Colossal-AI based on LLaMA [57]. by LLMs, there are also studies that extend their pre-training
• BMTrain [127] is an efficient library developed by corpus to more specialized datasets, such as multilingual
OpenBMB for training models with large-scale parameters data, scientific data, and code, endowing LLMs with specific
in a distributed manner, which emphasizes code simplicity, task-solving capabilities [35, 56, 76]. In what follows, we
low resource, and high availability. BMTrain has already describe these two types of pre-training data sources and
incorporated several common LLMs (e.g., Flan-T5 [83] and their effects on LLMs. For a detailed introduction to the
GLM [82]) into its ModelCenter, where developers can use commonly used corpus, one can refer to Section 3.2.
these models directly.
• FastMoE [128] is a specialized training library for MoE General Text Data. As we can see in Figure 2, the vast
(i.e., mixture-of-experts) models. It is developed based on majority of LLMs adopt general-purpose pre-training data,
PyTorch, prioritizing both efficiency and user-friendliness such as webpages, books, and conversational text, which
in its design. FastMoE simplifies the process of transferring provides rich text sources on a variety of topics. Next, we
Transformer models to MoE models and supports both data briefly summarize three important kinds of general data.
parallelism and model parallelism during training. • Webpages. Owing to the proliferation of the Internet,
Besides the above library resources, existing deep learn- various types of data have been created, which enables
ing frameworks (e.g., PyTorch [129], TensorFlow [130], LLMs to gain diverse linguistic knowledge and enhance
MXNet [131], PaddlePaddle [132], MindSpore [104] and their generalization capabilities [26, 72]. For convenient
OneFlow [133]) have also provided the support for parallel use of these data resources, a large amount of data is
algorithms, which are commonly used for training large- crawled from the web in previous work, such as Com-
scale models. monCrawl [120]. However, the crawled web data tends to
contain both high-quality text, such as Wikipedia and low-
quality text, like spam mail, thus it is important to filter and
4 P RE - TRAINING process webpages for improving the data quality.
Pre-training establishes the basis of the abilities of LLMs. By • Conversation text. Conversation data can enhance the
pre-training on large-scale corpora, LLMs can acquire essen- conversational competence of LLMs [80] and potentially im-
tial language understanding and generation skills [55, 56]. prove their performance on a range of question-answering
In this process, the scale and quality of the pre-training tasks [56]. Researchers can utilize subsets of public conver-
corpus are critical for LLMs to attain powerful capabilities. sation corpus (e.g., PushShift.io Reddit corpus) [114, 134] or
9

T5 (11B) mT5 (13B) LLaMA (65B) GPT-3 (175B) MT-NLG (530B) Gopher (280B) Chinchilla (70B)
3% 2%
2% 5% 16% 3% 4%
5% 26% 4% 37% 40%
62% 60% 56%
6%
100% 100% 87% 84%

GLaM (1200B) PaLM (540B) LaMDA (137B) Galactica (120B) GPT-NeoX (20B) CodeGen (16B) AlphaCode (41B)
5% 8%
13% 8% 20%
22% 14% 7% 30%
31% 38% 39%
48% 6%
38%
10% 10%
30% 50%
50% 86% 15% 25% 100%

Webpages Conversation Data Books & News Scientific Data Code

Fig. 2. Ratios of various data sources in the pre-training data for existing LLMs.

collect conversation data from online social media. Since on- tasks [136]. To construct the scientific corpus, existing efforts
line conversational data often involves discussions among mainly collect arXiv papers, scientific textbooks, math web-
multiple participants, an effective processing way is to pages, and other related scientific resources. Due to the com-
transform a conversation into a tree structure, where the plex nature of data in scientific fields, such as mathematical
utterance is linked to the one it responds to. In this way, the symbols and protein sequences, specific tokenization and
multi-party conversation tree can be divided into multiple preprocessing techniques are usually required to transform
sub-conversations, which can be collected in the pre-training these different formats of data into a unified form that can
corpus. Furthermore, a potential risk is that the excessive be processed by language models.
integration of dialogue data into LLMs may result in a side • Code. Program synthesis has been widely studied in
effect [80]: declarative instructions and direct interrogatives the research community [88, 137–140], especially the use of
are erroneously perceived as the beginning of conversations, PLMs trained on code [122, 141]. However, it remains chal-
thus leading to a decline in the efficacy of the instructions. lenging for these PLMs (e.g., GPT-J [122]) to generate high-
• Books. Compared to other corpus, books provide an quality and accurate programs. Recent studies [88, 140] have
important source of formal long texts, which are potentially found that training LLMs on a vast code corpus can lead to
beneficial for LLMs to learn linguistic knowledge, model a substantial improvement in the quality of the synthesized
long-term dependency, and generate narrative and coherent programs. The generated programs can successfully pass
texts. To obtain open-source book data, existing studies expert-designed unit-test cases [88] or solve competitive
usually adopt the Books3 and Bookcorpus2 datasets, which programming questions [98]. In general, two types of code
are available in the Pile dataset [117]. corpora are commonly used for pre-training LLMs. The first
source is from programming question answering communi-
Specialized Text Data. Specialized datasets are useful to ties like Stack Exchange [142, 143]. The second source is from
improve the specific capabilities of LLMs on downstream public software repositories such as GitHub [76, 88, 140],
tasks. Next, we introduce three kinds of specialized data. where code data (including comments and docstrings) are
• Multilingual text. Besides the text in the target lan- collected for utilization. Compared to natural language text,
guage, integrating a multilingual corpus can enhance the code is in the format of a programming language, corre-
multilingual abilities of language understanding and gen- sponding to long-range dependencies and accurate execu-
eration. For example, BLOOM [68] and PaLM [56] have tion logic [144]. A recent study [47] also speculates that
curated multilingual data covering 46 and 122 languages, training on code might be a source of complex reasoning
respectively, within their pre-training corpora. These models abilities (e.g., chain-of-thought ability [33]). Besides, it has
demonstrate impressive performance in multilingual tasks, been shown that formatting reasoning tasks into code can
such as translation, multilingual summarization, and mul- help LLMs generate more accurate results [144, 145].
tilingual question answering, and achieve comparable or
superior performance to the state-of-the-art models that are
fine-tuned on the corpus in the target language(s). 4.1.2 Data Preprocessing
• Scientific text. The exploration of science by humans has After collecting a large amount of text data, it is essential
been witnessed by the increasing growth of scientific publi- to preprocess the data for constructing the pre-training cor-
cations. In order to enhance the understanding of scientific pus, especially removing noisy, redundant, irrelevant, and
knowledge for LLMs [35, 135], it is useful to incorporate a potentially toxic data [56, 59], which may largely affect the
scientific corpus for model pre-training [35, 135]. By pre- capacity and performance of LLMs. In this part, we review
training on a vast amount of scientific text, LLMs can the detailed data preprocessing strategies to improve the
achieve impressive performance in scientific and reasoning quality of the collected data [59, 68, 95]. A typical pipeline
10

Ready to
Raw Corpus Quality Filtering De-duplication Privacy Reduction Tokenization
pre-train!

Language Filtering Sentence-level Detect Personality Reuse Existing


Document-level Identifiable Tokenizer
Metric Filtering
Information (PII) SentencePiece
Statistic Filtering Set-level
Remove PII Byte-level BPE
Keyword Filtering

Alice is writing a paper about Alice is writing a paper about Replace('Alice') is Encode('[Somebody] is 32, 145, 66, 79, 12, 56, ...
LLMs. #$^& Alice is writing LLMs. Alice is writing a paper writing a paper about LLMs. writing a paper about LLMs.')
a paper about LLMs. about LLMs.

Fig. 3. An illustration of a typical data preprocessing pipeline for pre-training large language models.

of preprocessing the pre-training data for LLMs has been overlap ratio of surface features (e.g., words and n-grams
illustrated in Figure 3. overlap) between documents to detect and remove duplicate
documents containing similar contents [57, 59, 68, 148].
Quality Filtering. To remove low-quality data from the Furthermore, to avoid the dataset contamination problem,
collected corpus, existing work generally adopts two ap- it is also crucial to prevent the overlap between the training
proaches: (1) classifier-based, and (2) heuristic-based. The and evaluation sets [56], by removing the possible duplicate
former approach trains a selection classifier based on high- texts from the training set. It has been shown that the three
quality texts and leverages it to identify and filter out low- levels of de-duplication are useful to improve the training
quality data. Typically, these methods [55, 56, 95] train a bi- of LLMs [56, 149], which should be jointly used in practice.
nary classifier with well-curated data (e.g., Wikipedia pages)
as positive instances and sample candidate data as negative Privacy Redaction. The majority of pre-training text data is
instances, and predict the score that measures the quality obtained from web sources, including user-generated con-
of each data example. However, several studies [59, 95] tent involving sensitive or personal information, which may
also find that a classifier-based approach may result in the increase the risk of privacy breaches [150]. Thus, it is nec-
unintentional removal of high-quality texts in dialectal, col- essary to remove the personally identifiable information (PII)
loquial, and sociolectal languages, which potentially leads from the pre-training corpus. One direct and effective ap-
to bias in the pre-training corpus and diminishes the corpus proach is to employ rule-based methods, such as keyword
diversity. As the second approach, several studies, such spotting, to detect and remove PII such as names, addresses,
as BLOOM [68] and Gopher [59], employ heuristic-based and phone numbers [118]. Furthermore, researchers also
approaches to eliminate low-quality texts through a set of find that the vulnerability of LLMs under privacy attacks
well-designed rules, which can be summarized as follows: can be attributed to the presence of duplicate PII data in the
• Language based filtering. If a LLM would be mainly used pre-training corpus [151]. Therefore, de-duplication can also
in the tasks of certain languages, the text in other lan- reduce privacy risks to some extent.
guages can be filtered. Tokenization. Tokenization is also a crucial step for data
• Metric based filtering. Evaluation metrics about the gener- preprocessing. It aims to segment raw text into sequences
ated texts, e.g., perplexity, can be employed to detect and of individual tokens, which are subsequently used as the
remove unnatural sentences. inputs of LLMs. Although it is expedient to leverage an
existing tokenizer (e.g., OPT [80] and GPT-3 [55] utilize
• Statistic based filtering. Statistical features of a corpus,
the tokenizer of GPT-2 [26]), using a tokenizer specially
e.g., the punctuation distribution, symbol-to-word ratio,
designed for the pre-training corpus can be highly benefi-
and sentence length, can be utilized to measure the text
cial [68], especially for the corpus that consists of diverse
quality and filter the low-quality data.
domains, languages, and formats. Therefore, several recent
• Keyword based filtering. Based on specific keyword set, the LLMs train the customized tokenizers specially for the pre-
noisy or unuseful elements in the text, such as HTML training corpus with SentencePiece [152]. The byte-level Byte
tags, hyperlinks, boilerplates, and offensive words, can Pair Encoding (BPE) algorithm [153] is utilized to ensure that
be identified and removed. the information after tokenization is lossless [56, 59]. While,
normalization techniques in BPE, such as NFKC [154], may
De-duplication. Existing work [146] has found that dupli- degrade the tokenization performance [34, 59, 68].
cate data in a corpus would reduce the diversity of language
models, which may cause the training process to become un-
4.1.3 Effect of Pre-training Data on LLMs
stable and thus affect the model performance. Therefore, it is
necessary to de-duplicate the pre-training corpus. Specially, Unlike small-scale PLMs, it is usually infeasible to iterate
de-duplication can be performed at different granularities, the pre-training of LLMs multiple times, due to the huge
including sentence-level, document-level, and dataset-level demand for computational resources. Thus, it is particularly
de-duplication. First, low-quality sentences that contain re- important to construct a well-prepared pre-training corpus
peated words and phrases should be removed, as they may before training a LLM. In this part, we discuss how the qual-
introduce repetitive patterns in language modeling [147]. ity and distribution of the pre-training corpus potentially
At the document level, existing studies mostly rely on the influence the performance of LLMs.
11

Mixture of Sources. As discussed before, pre-training data the context, which might further affect the generalization
from different domains or scenarios has distinct linguistic capacity of LLMs using in-context learning [146]. Therefore,
characteristics or semantic knowledge. By pre-training on a as suggested in [56, 59, 68], it is essential to incorporate
mixture of text data from diverse sources, LLMs can acquire preprocessing methods on the pre-training corpus carefully
a broad scope of knowledge and may exhibit a strong (as illustrated in Section 4.1.2), to improve stability of the
generalization capacity. When mixing different sources, one training process and avoid affecting the model performance.
needs to carefully set the distribution of pre-training data,
since it is also likely to affect the performance of LLMs on
4.2 Architecture
downstream tasks [59]. Gopher [59] conducts the ablation
experiment on data distribution to examine the impact of In this section, we review the architecture design of LLMs,
mixed sources on downstream tasks. Experimental results i.e., mainstream architecture, pre-training objective, and de-
on the LAMBADA dataset [155] show that increasing the tailed configuration. Table 3 presents the model cards of
proportion of books data can improve the capacity of the several representative LLMs with public details.
model in capturing long-term dependencies from text, and
increasing the proportion of the C4 dataset [72] leads to 4.2.1 Mainstream Architectures
performance improvement on the C4 validation dataset [59]. Due to the excellent parallelizability and capacity, the Trans-
While, as a side effect, training on excessive data about a former architecture [22] has become the de facto backbone to
certain domain would affect the generalization capability of develop various LLMs, making it possible to scale language
LLMs on other domains [35, 59]. Therefore, it is suggested models to hundreds or thousands of billions of parameters.
that researchers should carefully determine the proportion In general, the mainstream architectures of existing LLMs
of data from different domains in the pre-training corpus, in can be roughly categorized into three major types, namely
order to develop LLMs that better meet their specific needs. encoder-decoder, causal decoder, and prefix decoder.
The readers can refer to Figure 2 for a comparison of the
data sources for different LLMs. Encoder-decoder Architecture. The vanilla Transformer
model is built on the encoder-decoder architecture [22],
Amount of Pre-training Data. For pre-training an effective which consists of two stacks of Transformer blocks as
LLM, it is important to collect sufficient high-quality data the encoder and decoder, respectively. The encoder adopts
that satisfies the data quantity demand of the LLM. Exist- stacked multi-head self-attention layers to encode the input
ing studies have found that with the increasing parameter sequence for generating its latent representations, while
scale in the LLM, more data is also required to train the the decoder performs cross-attention on these representa-
model [34, 57]: a similar scaling law as model size is also tions and autoregressively generates the target sequence.
observed in data size, with respect to model performance. Encoder-decoder PLMs (e.g., T5 [72] and BART [24]) have
Chinchilla [34] demonstrates that a number of existing shown effectiveness on a variety of NLP tasks. So far,
LLMs suffer from sub-optimal training due to inadequate there are only a small number of LLMs that are built based
pre-training data. By conducting extensive experiments, it on the encoder-decoder architecture, e.g., Flan-T5 [83]. We
further shows that it is necessary to adopt equal scales leave a detailed discussion about the architecture selection
of the model parameters and training tokens for a given in Section 4.2.4.
compute budget. More recently, LLaMA [57] shows that Causal Decoder Architecture. The causal decoder archi-
with more data and longer training, smaller models can tecture incorporates the unidirectional attention mask, to
also achieve good performance. Therefore, it is suggested guarantee that each input token can only attend to the past
that researchers should pay more attention to the amount tokens and itself. The input and output tokens are processed
of high-quality data for adequately training the model, in the same fashion through the decoder. As representa-
especially when scaling the model parameters. tive language models of this architecture, the GPT-series
models [26, 55, 119] are developed based on the causal-
Quality of Pre-training Data. Existing work has shown decoder architecture. In particular, GPT-3 [55] has success-
that pre-training on the low-quality corpus, such as noisy, fully demonstrated the effectiveness of this architecture, also
toxic, and duplicate data, may hurt the performance of showing an amazing in-context learning capability of LLMs.
models [59, 146, 148, 151]. For developing a well-performing Interestingly, GPT-1 [119] and GPT-2 [26] do not exhibit such
LLM, it is crucial to consider both the quantity ant the superior abilities as those in GPT-3, and it seems that scaling
quality of the collected training data. Recent studies, such plays an important role in increasing the model capacity
as T5 [72], GLaM [95], and Gopher [59], have investigated of this model architecture. So far, the causal decoders have
the influence of data quality on the performance of down- been widely adopted as the architecture of LLMs by var-
stream tasks. By comparing the performance of models ious existing LLMs, such as OPT [80], BLOOM [68], and
trained on the filtered and unfiltered corpus, they reach Gopher [59]. Note that both the causal decoder and prefix
the same conclusion that pre-training LLMs on cleaned decoder discussed next belong to decoder-only architec-
data can improve the performance. More specifically, the tures. While, when mentioning “decoder-only architecture”,
duplication of data may result in “double descent” (referring it mainly refers to the causal decoder architecture in existing
to the phenomenon of performance initially deteriorating literature, unless specified.
and subsequently improving) [146, 156], or even overwhelm
the training process [146]. Besides, it has been shown that Prefix Decoder Architecture. The prefix decoder architec-
duplicate data degrades the ability of LLMs to copy from ture (a.k.a., non-causal decoder [157]) revises the masking
12

TABLE 3
Model cards of several selected LLMs with public configuration details. Here, PE denotes position embedding, #L denotes the number of layers, #H
denotes the number of attention heads, dmodel denotes the size of hidden states, and MCL denotes the maximum context length during training.

Model Category Size Normalization PE Activation Bias #L #H dmodel MCL


GPT3 [55] Causal decoder 175B Pre Layer Norm Learned GeLU X 96 96 12288 2048
PanGU- α [74] Causal decoder 207B Pre Layer Norm Learned GeLU X 64 128 16384 1024
OPT [80] Causal decoder 175B Pre Layer Norm Learned ReLU X 96 96 12288 2048
PaLM [56] Causal decoder 540B Pre Layer Norm RoPE SwiGLU × 118 48 18432 2048
BLOOM [68] Causal decoder 176B Pre Layer Norm ALiBi GeLU X 70 112 14336 2048
MT-NLG [97] Causal decoder 530B - - - - 105 128 20480 2048
Gopher [59] Causal decoder 280B Pre RMS Norm Relative - - 80 128 16384 2048
Chinchilla [34] Causal decoder 70B Pre RMS Norm Relative - - 80 64 8192 -
Galactica [35] Causal decoder 120B Pre Layer Norm Learned GeLU × 96 80 10240 2048
LaMDA [96] Causal decoder 137B - Relative GeGLU - 64 128 8192 -
Jurassic-1 [90] Causal decoder 178B Pre Layer Norm Learned GeLU X 76 96 13824 2048
LLaMA [57] Causal decoder 65B Pre RMS Norm RoPE SwiGLU X 80 64 8192 2048
GLM-130B [82] Prefix decoder 130B Post Deep Norm RoPE GeGLU X 70 96 12288 2048
T5 [72] Encoder-decoder 11B Pre RMS Norm Relative ReLU × 24 128 1024 512

mechanism of causal decoders, to enable performing bidi- ployed due to its superiority in training speed and per-
rectional attention over the prefix tokens [158] and unidi- formance [164]. Compared with LN, DeepNorm [165] has
rectional attention only on generated tokens. In this way, shown a better capability to ensure the stability in training,
like the encoder-decoder architecture, the prefix decoders which has been adopted by GLM-130B with post normaliza-
can bidirectionally encode the prefix sequence and autore- tion. In addition, adding an extra LN after the embedding
gressively predict the output tokens one by one, where the layer can also stabilize the training of LLMs. However, it
same parameters are shared during encoding and decoding. tends to incur a significant performance drop [166], which
Instead of pre-training from scratch, a practical suggestion has been removed in several recent LLMs [68].
is to continually train causal decoders and then convert
them into prefix decoders for accelerating convergence [29], Activation Functions. To obtain good performance, activa-
e.g., U-PaLM [102] is derived from PaLM [56]. Existing rep- tion functions also need to be properly set in feed-forward
resentative LLMs based on prefix decoders include GLM- networks. In existing LLMs, GeLU activations [167] are
130B [82] and U-PaLM [102]. widely used. Besides, in the latest LLMs (e.g., PaLM and
For the three types of architectures, we can also consider LaMDA), variants of GLU activation [168, 169] have also
extending them via the mixture-of-experts (MoE) scaling, in been utilized, especially the SwiGLU and GeGLU variants,
which a subset of neural network weights for each input which often achieve better performance in practice [164].
are sparsely activated, e.g., Switch Transformer [25] and However, compared with GeLU, they require extra parame-
GLaM [95]. It has been shown that substantial performance ters (about 50%) in the feed-forward networks [166].
improvement can be observed by increasing either the num- Position Embeddings. Since the self-attention modules in
ber of experts or the total parameter size [159]. Transformer are permutation equivariant, position embed-
dings are employed to inject absolute or relative position
4.2.2 Detailed Configuration information for modeling sequences. There are two vari-
Since the launch of Transformer [22], various improvements ants of absolute position embeddings in the vanilla Trans-
have been proposed to enhance its training stability, per- former [22], i.e., sinusoids and learned position embeddings,
formance, and computational efficiency. In this part, we where the latter is commonly employed in LLMs. Unlike
will discuss the corresponding configurations for four major absolute position embeddings, relative positional encodings
parts of the Transformer, including normalization, position generate embeddings according to the offsets between keys
embeddings, activation functions, and attention and bias. and queries [72], so it can perform well on sequences
longer than those it has seen during training, i.e., extrap-
Normalization. Training instability is a challenging issue olation [170]. ALiBi [170] biases attention scores using a
for pre-training LLMs. To alleviate this problem, layer nor- penalty based on the distance between keys and queries.
malization (Layer Norm, LN) [160] is widely employed in Empirical results have shown that it has better zero-shot
Transformer architectures. The position of LN is vital to the generalization with a stronger extrapolation capacity than
performance of LLMs. While the initial Transformer [22] other position embeddings [29]. Besides, by setting specific
uses post-LN, most LLMs employ pre-LN for more stable rotatory matrices based on the absolute position, the scores
training in spite of decreasing performance [161]. Based between keys and queries in RoPE [171] can be computed
on pre-LN, Sandwich-LN [162] adds extra LN before the with relative position information, which is useful to model
residual connections to avoid value explosion. However, long sequences. As a result, RoPE has been widely adopted
it has been found that Sandwich-LN sometimes fails to in several latest LLMs [56, 57, 82]
stabilize the training of LLMs and may lead to the collapse
of training [82]. Recently, several advanced normalization Attention and Bias. Beyond the full self-attention in the
techniques have been proposed as alternatives to LN. In original Transformer [22], sparse attention with lower com-
Gopher [59] and Chinchilla [34], RMS Norm [163] is em- putation complexity is employed in GPT-3 (i.e., Factorized
13

Attention [55, 172]). In order to effectively and efficiently However, the DAE task seems to be more complicated
model longer sequences, more attempts have been made by in implementation than LM task. As a result, it has not
either introducing special attention patterns [173, 174] or been widely used to pre-train large language models. Exist-
considering GPU memory access (i.e., FlashAttention [175]). ing LLMs that take DAE as pre-training objectives include
Besides, following the original Transformer, most LLMs T5 [72] and GLM-130B [82]. These models are mainly trained
keep the biases in each dense kernel and Layer Norm. How- to recover the replaced spans in an autoregressive way.
ever, in PaLM [56] and Galactica [35], biases are removed.
It demonstrates that no biases can enhance training stability 4.2.4 Summary and Discussion
for LLMs [56]. The choice of architecture and pre-training tasks may incur
To put all these discussions together, we summarize the different inductive biases for LLMs, which would lead to
suggestions from existing literature for detailed configura- different model capacities. In this part, we summarize some
tion. For stronger generalization and training stability, it is important findings or discussions in the existing literature
suggested to choose the pre RMS Norm for layer normal- on this issue.
ization, and SwiGLU or GeGLU as the activation function. • By pre-training with the LM objective, it seems that
While, LN may not be used immediately after embedding causal decoder architecture can achieve a more superior
layers, which is likely to incur performance degradation. zero-shot and few-shot generalization capacity. Existing
Besides, as for position embeddings, RoPE or ALiBi is a research has shown that without multi-task fine-tuning,
better choice since it performs better on long sequences. the causal decoder has better zero-shot performance than
other architectures [29]. The success of GPT-3 [55] has
4.2.3 Pre-training Tasks demonstrated that the large causal decoder model can be
Pre-training plays a key role that encodes general knowl- a good few-shot learner. In addition, instruction tuning and
edge from large-scale corpus into the massive model param- alignment tuning discussed in Section 5 have been proven
eters. For training LLMs, there are two commonly used pre- to further enhance the capability of large causal decoder
training tasks, namely language modeling and denoising models [61, 62, 83].
autoencoding.
• Scaling law has been widely observed in causal de-
coders. By scaling the model size, the dataset size, and
Language Modeling. The language modeling task (LM) is the total computation, the performance of causal decoders
the most commonly used objective to pre-train decoder-only can be substantially improved [30, 55]. Thus, it has become
LLMs, e.g., GPT3 [55] and PaLM [56]. Given a sequence of an important strategy to increase the model capacity of
tokens x = {x1 , . . . , xn }, the LM task aims to autoregres- the causal decoder via scaling. However, more detailed
sively predict the target tokens xi based on the preceding investigation on encoder-decoder models is still lacking, and
tokens x<i in a sequence. A general training objective is to more efforts are needed to investigate the performance of
maximize the following likelihood: encoder-decoder models at a large scale.
n
More research efforts about the discussions on archi-
tectures and pre-training objectives are in need to analyze
X
LLM (x) = log P (xi |x<i ). (1)
i=1
how the choices of the architecture and pre-training tasks
affect the capacity of LLMs, especially for encoder-decoder
Since most language tasks can be cast as the prediction architectures. Besides the major architecture, the detailed
problem based on the input, these decoder-only LLMs might configuration of LLM is also worth attention, which has
be potentially advantageous to implicitly learn how to ac- been discussed in Section 4.2.2.
complish these tasks in a unified LM way. Some studies
have also revealed that decoder-only LLMs can be naturally
4.3 Model Training
transferred to certain tasks by autoregressively predicting
the next tokens [26, 55], without fine-tuning. An important In this part, we review the important settings, techniques,
variant of LM is the prefix language modeling task, which is or tricks for training LLMs.
designed for pre-training models with the prefix decoder
4.3.1 Optimization Setting
architecture. The tokens within a randomly selected prefix
would not be used in computing the loss of prefix language For parameter optimization of LLMs, we present the com-
modeling. With the same amount of tokens seen during pre- monly used settings for batch training, learning rate, opti-
training, prefix language modeling performs slightly worse mizer, and training stability.
than language modeling, since fewer tokens in the sequence Batch Training. For language model pre-training, existing
are involved for model pre-training [29]. work generally sets the batch size to a large number (e.g.,
Denoising Autoencoding. Besides conventional LM, the 8,196 examples or 1.6M tokens) to improve the training
denoising autoencoding task (DAE) has also been widely stability and throughput. For LLMs such as GPT-3 and
used to pre-train language models [24, 72]. The inputs x\x̃ PaLM, they have introduced a new strategy that dynam-
for DAE task are corrupted text with randomly replaced ically increases the batch size during training, ultimately
spans. Then, the language models are trained to recover the reaching a million scale. Specifically, the batch size of GPT-3
replaced tokens x̃. Formally, the training objective of DAE is gradually increasing from 32K to 3.2M tokens. Empirical
is denoted as follows: results have demonstrated that the dynamic schedule of
batch size can effectively stabilize the training process of
LDAE (x) = log P (x̃|x\x̃ ). (2) LLMs [56].
14

TABLE 4
Detailed optimization settings of several existing LLMs.

Batch Size Learning Precision Weight Grad


Model Warmup Decay Method Optimizer Dropout
(#tokens) Rate Type Decay Clip
GPT3 (175B) 32K→3.2M 6 × 10−5 yes cosine decay to 10% Adam FP16 0.1 1.0 -
PanGu-α (200B) - 2 × 10−5 - - Adam - 0.1 - -
OPT (175B) 2M 1.2 × 10−4 yes manual decay AdamW FP16 0.1 - 0.1
PaLM (540B) 1M→4M 1 × 10−2 no inverse square root Adafactor BF16 lr2 1.0 0.1
BLOOM (176B) 4M 6 × 10−5 yes cosine decay to 10% Adam BF16 0.1 1.0 0.0
MT-NLG (530B) 64 K→3.75M 5 × 10−5 yes cosine decay to 10% Adam BF16 0.1 1.0 -
Gopher (280B) 3M→6M 4 × 10−5 yes cosine decay to 10% Adam BF16 - 1.0 -
Chinchilla (70B) 1.5M→3M 1 × 10−4 yes cosine decay to 10% AdamW BF16 - - -
Galactica (120B) 2M 7 × 10−6 yes linear decay to 10% AdamW - 0.1 1.0 0.1
LaMDA (137B) 256K - - - - BF16 - - -
Jurassic-1 (178B) 32 K→3.2M 6 × 10−5 yes - - - - - -
LLaMA (65B) 4M 1.5 × 10−4 yes cosine decay to 10% AdamW - 0.1 1.0 -
GLM (130B) 0.4M→8.25M 8 × 10−5 yes cosine decay to 10% AdamW FP16 0.1 1.0 0.1
T5 (11B) 64K 1 × 10−2 no inverse square root AdaFactor - - - 0.1
ERNIE 3.0 Titan (260B) - 1 × 10−4 - - Adam FP16 0.1 1.0 -
PanGu-Σ (1.085T) 0.5M 2 × 10−5 yes - Adam FP16 - - -

Learning Rate. Existing LLMs usually adopt a similar learn- 4.3.2 Scalable Training Techniques
ing rate schedule with the warm-up and decay strategies As the model and data sizes increase, it has become chal-
during pre-training. Specifically, in the initial 0.1% to 0.5% lenging to efficiently train LLMs under a limited compu-
of the training steps, a linear warm-up schedule is employed tational resource. Especially, two primary technical issues
for gradually increasing the learning rate to the maximum are required to be resolved, i.e., increasing training through-
value that ranges from approximately 5 × 10−5 to 1 × 10−4 put and loading larger models into GPU memory. In this
(e.g., 6 × 10−5 for GPT-3). Then, a cosine decay strategy part, we review several widely used approaches in existing
is adopted in the subsequent steps, gradually reducing the work to address the above two challenges, namely 3D
learning rate to approximately 10% of its maximum value, parallelism [65, 179, 180], ZeRO [181], and mixed precision
until the convergence of the training loss. training [182], and also give general suggestions about how
to utilize them for training.
3D Parallelism. 3D parallelism is actually a combination of
Optimizer. The Adam optimizer [176] and AdamW opti-
three commonly used parallel training techniques, namely
mizer [177] are widely utilized for training LLMs (e.g., GPT-
data parallelism, pipeline parallelism [179, 180], and tensor
3), which are based on adaptive estimates of lower-order
parallelism [65]10 . We next introduce the three parallel train-
moments for first-order gradient-based optimization. Com-
ing techniques.
monly, its hyper-parameters are set as follows: β1 = 0.9,
β2 = 0.95 and  = 10−8 . Meanwhile, the Adafactor op- • Data parallelism. Data parallelism is one of the most
timizer [178] has also been utilized in training LLMs (e.g., fundamental approaches to improving the training through-
PaLM and T5), which is a variant of the Adam optimizer put. It replicates the model parameters and optimizer states
specially designed for conserving GPU memory during across multiple GPUs and then distributes the whole train-
training. The hyper-parameters of the Adafactor optimizer ing corpus into these GPUs. In this way, each GPU only
are set as: β1 = 0.9 and β2 = 1.0 − k −0.8 , where k denotes needs to process the assigned data for it, and performs
the number of training steps. the forward and backward propagation to obtain the gra-
dients. The computed gradients on different GPUs will be
further aggregated to obtain the gradients of the entire batch
for updating the models in all GPUs. In this way, as the
Stabilizing the Training. During the pre-training of LLMs, calculations of gradients are independently performed on
it often suffers from the training instability issue, which different GPUs, the data parallelism mechanism is highly
may cause the model collapse. To address this issue, weight scalable, enabling the way that increases the number of
decay and gradient clipping have been widely utilized, GPUs to improve training throughput. Furthermore, this
where existing studies [55, 68, 80, 82, 97] commonly set technique is simple in implementation, and most of existing
the threshold of gradient clipping to 1.0 and weight decay popular deep learning libraries have already implemented
rate to 0.1. However, with the scaling of LLMs, the training data parallelism, such as TensorFlow and PyTorch.
loss spike is also more likely to occur, leading to unstable • Pipeline parallelism. Pipeline parallelism aims to dis-
training. To mitigate this problem, PaLM [56] and OPT [80] tribute the different layers of a LLM into multiple GPUs.
use a simple strategy that restarts the training process from Especially, in the case of a Transformer model, pipeline
an earlier checkpoint before the occurrence of the spike and parallelism loads consecutive layers onto the same GPU, to
skips over the data that may have caused the problem. reduce the cost of transmitting the computed hidden states
Further, GLM [82] finds that the abnormal gradients of the
embedding layer usually lead to spikes, and proposes to 10. Model parallelism is a more broader term that includes tensor
shrink the embedding layer gradients to alleviate it. parallelism and pipeline parallelism in some work [65].
15

or gradients between GPUs. However, a naive implemen- alternative called Brain Floating Point (BF16) has been used
tation of pipeline parallelism may result in a lower GPU for training, which allocates more exponent bits and fewer
utilization rate as each GPU has to wait for the previous significant bits than FP16. For pre-training, BF16 generally
one to complete the computation, leading to the unneces- performs better than FP16 on representation accuracy [68].
sary cost of bubbles overhead [179]. To reduce these bubbles
Overall Training Suggestion. In practice, the above train-
in pipeline parallelism, GPipe [179] and PipeDream [180]
ing techniques, especially 3D parallelism, are often jointly
propose the techniques of padding multiple batches of data
used to improve the training throughput and large model
and asynchronous gradient update to improve the pipeline
loading. For instance, researchers have incorporated 8-way
efficiency.
data parallelism, 4-way tensor parallelism, and 12-way
• Tensor parallelism. Tensor parallelism is also a com- pipeline parallelism, enabling the training of BLOOM [68]
monly used technique that aims to decompose the LLM for on 384 A100 GPUs. Currently, open-source libraries like
multi-GPU loading. Unlike pipeline parallelism, tensor par- DeepSpeed [64], Colossal-AI [125], and Alpa [188] can well
allelism focuses on decomposing the tensors (the parameter support the three parallel training methods. To reduce the
matrices) of LLMs. For a matrix multiplication operation memory redundancy, ZeRO, FSDP, and activation recompu-
Y = XA in the LLM, the parameter matrix A can be split tation techniques [67, 189] can be also employed for training
into two submatrices, A1 and A2 , by column, which can be LLMs, which have already been integrated into DeepSpeed,
expressed as Y = [XA1 , XA2 ]. By placing matrices A1 and PyTorch, and Megatron-LM. Besides, the mixed precision
A2 on different GPUs, the matrix multiplication operation training technique such as BF16 can be also leveraged to
would be invoked at two GPUs in parallel, and the final improve the training efficiency and reduce GPU memory
result can be obtained by combining the outputs from the usage, while it requires necessary support on hardware
two GPUs through across-GPU communication. Currently, (e.g., A100 GPU). Because training large models is a time-
tensor parallelism has been supported in several open- intensive process, it would be useful to forecast the model
source libraries, e.g., Megatron-LM [65], and can be extended performance and detect abnormal issues at an early stage.
to higher-dimensional tensors. Besides, Colossal-AI has also For this purpose, GPT-4 [46] has recently introduced a
implemented tensor parallelism for higher-dimensional ten- new mechanism called predictable scaling built on a deep
sors [183–185] and proposed sequence parallelism [186] learning stack, enabling the performance prediction of large
especially for sequence data, which can further decompose models with a much smaller model, which might be quite
the attention operation of the Transformer model. useful for developing LLMs. In practice, one can further
leverage the supporting training techniques of mainstream
ZeRO. ZeRO [181] technique, proposed by the Deep- deep learning frameworks. For instance, PyTorch supports
Speed [64] library, focuses on the issue of memory re- the data parallel training algorithm FSDP [187] (i.e., fully
dundancy in data parallelism. As mentioned before, data sharded data parallel), which allows for partial offloading
parallelism requires each GPU to store the same copy of of training computations to CPUs if desired.
a LLM, including model parameters, model gradients, and Besides the above training strategies, it is also important
optimizer parameters. Whereas, not all of the above data is to improve the inference speed for using LLMs. Typically,
necessary to be retained on each GPU, which would cause quantization techniques are widely used to reduce both
a memory redundancy problem. To resolve it, the ZeRO the time and space costs of LLMs during the inference
technique aims to retain only a fraction of data on each stage [190]. With some loss in model performance, quan-
GPU, while the rest data can be retrieved from other GPUs tized language models have smaller model sizes and can
when required. Specifically, ZeRO provides three solutions, achieve faster inference speed [82, 191, 192]. For model
depending on how the three parts of the data are stored, quantization, a popular choice is INT8-quantization [191].
namely optimizer state partitioning, gradient partitioning, Further, some research work attempts to develop more
and parameter partitioning. Empirical results indicate that aggressive INT4-quantization methods [82]. Among these
the first two solutions do not increase the communication open-source LLMs, BLOOM11 , GPT-J12 , and GLM13 have
overhead, and the third solution increases about 50% com- released the corresponding quantized model copies.
munication overhead but saves memory proportional to
the number of GPUs. PyTorch has implemented a similar 5 A DAPTATION T UNING OF LLM S
technique as ZeRO, called FSDP [187].
After pre-training, LLMs can acquire the general abilities
Mixed Precision Training. In previous PLMs (e.g., for solving various tasks. However, increasing studies have
BERT [23]), 32-bit floating-point numbers, also known as shown that LLM’s abilities can be further adapted according
FP32, have been predominantly used for pre-training. In to specific goals. In this section, we introduce two major ap-
recent years, to pre-train extremely large language models, proaches to adapting pre-trained LLMs, namely instruction
some studies [182] have started to utilize 16-bit floating- tuning and alignment tuning. The former approach mainly
point numbers (FP16), which reduces memory usage and aims to enhance (or unlock) the abilities of LLMs, while
communication overhead. Additionally, as popular NVIDIA the latter approach aims to align the behaviors of LLMs
GPUs (e.g., A100) have twice the amount of FP16 computa- with human values or preferences. In what follows, we will
tion units as FP32, the computational efficiency of FP16 can introduce the two approaches in detail.
be further improved. However, existing work has found that 11. https://huggingface.co/joaoalvarenga/bloom-8bit
FP16 may lead to the loss of computational accuracy [59, 68], 12. https://huggingface.co/hivemind/gpt-j-6B-8bit
which affects the final model performance. To alleviate it, an 13. https://github.com/ggerganov/llama.cpp
16

Instance API collection Human-written


Human-written Task description

Task description Please answer this question: &


Please translate the French to English:

Optional Demonstrations
Demonstrations Task description
NLP Datasets Q: Where is the capital of France?
fr: Reprise de la session Can you recommend some ways
A: Paris.
en: Resumption of the session to lose weight?
fr: Il s'agit du cas d'Alexandre Nikitin. Q: Where is the capital of Brazil?
en: It is the case of Alexander Nikitin. A: Brasilia
Desired output written by human
Input
fr: Nous ne savons pas ce qui se passe.
Input Output Output
Here are some ways to lose weight:
Output Q: Where is the capital of China?
1. Eat a healthy diet: Focus on …
en: We do not know what is happening. A: Beijing.
2. Increase physical activity: Engage …

(a) Instance format (b) Formatting existing datasets (c) Formatting human needs

Fig. 4. An illustration of instance formatting and two different methods for constructing the instruction-formatted instances.

TABLE 5 and a small number of demonstrations (optional). As impor-


A detailed list of available task collections for instruction tuning. Note tant public resources, existing studies have released a large
that OIG is a large collection consisting of existing collections.
number of labeled data formatted in natural language (see
the list of available resources in Table 5). Next, we introduce
Collections Time #Task types #Tasks #Examples two major methods for constructing formatted instances
Nat. Inst. [193] Apr-2021 6 61 193K (see an illustration in Figure 4) and then discuss several key
CrossFit [194] Apr-2021 13 160 7.1M factors for instance construction.
FLAN [62] Sep-2021 12 62 4.4M
P3 [195] Oct-2021 13 267 12.1M
ExMix [196] Nov-2021 11 107 18M
UnifiedSKG [197] Jan-2022 6 21 812K Formatting Existing Datasets. Before instruction tuning was
Super Nat. Inst. [78] Apr-2022 76 1616 5M proposed, several early studies [196, 198, 200, 201] col-
MVPCorpus [198] Jun-2022 11 77 41M
xP3 [84] Nov-2022 17 85 81M
lected the instances from a diverse range of tasks (e.g., text
OIG14 Mar-2023 - - 43M summarization, text classification, and translation) to create
supervised multi-task training datasets. As a major source of
instruction tuning instances, it is convenient to format these
5.1 Instruction Tuning multi-task training datasets with natural language task de-
scriptions. Specifically, recent work [28, 61, 62, 78] augments
In essence, instruction tuning is the approach to fine-tuning the labeled datasets with human-written task descriptions,
pre-trained LLMs on a collection of formatted instances in which instructs LLMs to understand the tasks by explaining
the form of natural language [62], which is highly related the task goal. For example, in Figure 4(b), a task description
to supervised fine-tuning [61] and multi-task prompted “Please answer this question” is added for each example in
training [28]. In order to perform instruction tuning, we first the question-answering task. After instruction tuning, LLMs
need to collect or construct instruction-formatted instances. can generalize well to other unseen tasks by following their
Then, we employ these formatted instances to fine-tune task descriptions [28, 62, 83]. In particular, it has been shown
LLMs in a supervised learning way (e.g., training with the that instructions are the crucial factor in task generalization
sequence-to-sequence loss). After instruction tuning, LLMs ability for LLMs [62]: by fine-tuning the model on labeled
can demonstrate superior abilities to generalize to unseen datasets with the task descriptions removed, it results in a
tasks [28, 62, 83], even in a multilingual setting [84]. dramatic drop in model performance. To better generate
A recent survey [199] presents a systematic overview labeled instances for instruction tuning, a crowd-sourcing
of the research on instruction tuning. In comparison to platform, PromptSource [195] has been proposed to effec-
that, we mainly focus on the effect of instruction tuning tively create, share, and verify the task descriptions for
on LLMs and provide detailed guidelines or strategies for different datasets. To enrich the training instances, several
instance collection and tuning. Besides, we also discuss the studies [28, 198, 202] also try to invert the input-output pairs
use of instruction tuning for satisfying the real needs of of existing instances with specially designed task descrip-
users, which has been widely applied in existing LLMs, e.g., tions for instruction tuning. For instance, given a question-
InstructGPT [61] and GPT-4 [46]. answer pair, we can create a new instance by predicting
the question-conditioned answer and some task description
5.1.1 Formatted Instance Construction (e.g., “Please generate a question based on the answer:”). Besides,
Generally, an instruction-formatted instance consists of a some work [203] also leverages heuristic task templates to
task description (called an instruction), an input-output pair, convert massive unlabeled texts into labeled instances.
17

Formatting Human Needs. Despite that a large number of on the performance of LLMs [78, 193]. Recently, to elicit
training instances have been formatted with instructions, the step-by-step reasoning ability of LLMs, some work [83]
they mainly come from public NLP datasets, either lack- proposes to include chain-of-thought (CoT) examples for
ing instruction diversity or mismatching with real human some reasoning datasets, such as arithmetic reasoning. It
needs [61]. To overcome this issue, InstructGPT [61] pro- has been shown that fine-tuning LLMs with both CoT and
poses to take the queries that real users have submitted non-CoT examples can lead to a good performance across
to the OpenAI API as the task descriptions. User queries various reasoning tasks, including those that require multi-
are expressed in natural languages, which are particularly hop reasoning ability (e.g., commonsense question answer-
suitable for eliciting the ability of instruction following for ing and arithmetic reasoning) as well as those without the
LLMs. Additionally, to enrich the task diversity, human need for such a reasoning way (e.g., sentiment analysis and
labelers are also asked to compose the instructions for real- extractive question answering) [83, 85].
life tasks, including open-ended generation, open question To summarize, it seems that the diversity of instructions
answering, brainstorming, and chatting. Then, they let an- is more important than the number of instances since the
other group of labelers directly answer these instructions as well-performing InstructGPT [61] and Alpaca [206] utilize
the output. Finally, they pair one instruction (i.e., the col- fewer but more diverse instructions (or instances) than the
lected user query) and the expected output (i.e., the human- Flan-series LLMs [62, 83]. Further, it is more useful to invite
written answer) as a training instance. Note that Instruct- labelers to compose human-need tasks than using dataset-
GPT also employs these real-world tasks formatted in natu- specific tasks. While, it still lacks the guidelines to anno-
ral language for alignment tuning (discussed in Section 5.2). tate human-need instances, making the task composition
Further, GPT-4 [46] has designed potentially high-risk in- somehow heuristic. To reduce human efforts, we can either
structions and guided the model to reject these instructions reuse existing formatted datasets (Table 5) or automatically
through supervised fine-tuning for safety concerns. Besides, construct the instructions using existing LLMs [204].
to reduce the burden of human annotation, several semi-
automated approaches [204–206] have also been proposed 5.1.2 Instruction Tuning Strategies
for constructing instances by feeding existing instances into Unlike pre-training, instruction tuning is often more effi-
LLMs to generate diverse task descriptions and instances. cient since only a moderate number of instances are used
for training. Although instruction tuning can be considered
Key Factors for Instance Construction. The quality of as a supervised training process, its optimization is different
instruction instances has an important impact on the perfor- from pre-training in several aspects [83], such as the training
mance of the model. Here, we discuss some essential factors objective (i.e., sequence-to-sequence loss) and optimization
for instance construction. configuration (e.g., smaller batch size and learning rate),
• Scaling the instructions. It has been widely shown that which require special attention in practice. In addition to
scaling the number of tasks can largely enhance the general- these optimization configurations, there are also two impor-
ization ability of LLMs [28, 62, 78]. With the increasing of the tant aspects to consider for instruction tuning:
task number, the model performance initially shows a con-
Balancing the Data Distribution. Since instruction tun-
tinuous growth pattern, while the gain becomes negligible
ing involves a mixture of different tasks, it is important
when it reaches a certain level [78, 83]. A plausible specu-
to balance the proportion of different tasks during fine-
lation is that a certain number of representative tasks can
tuning. A widely used method is the examples-proportional
provide relatively sufficient knowledge and adding more
mixing strategy [72], i.e., combining all the datasets and
tasks may not bring additional gains [83]. Besides, it is also
sampling each instance equally from the mixed datasets.
beneficial to enhance the diversity of the task descriptions in
Furthermore, increasing the sampling ratio of high-quality
several aspects, such as length, structure, and creativity [28].
collections (e.g., FLAN [62] and P3 [195]) can generally
As for the number of instances per task, it has been found
lead to performance improvement according to recent find-
that a small number of instances can usually saturate the
ings [83, 85]. While, it is common to set a maximum cap to
generalization performance of the model [62, 83]. Whereas,
control the maximum number of examples that a dataset
increasing the number of instances for some tasks to a
can contain during instruction tuning [72], which is set to
large number (e.g., hundreds of thousands) could poten-
prevent larger datasets from overwhelming the entire dis-
tially result in the overfitting issue and impair the model
tribution [72, 85]. In practice, the maximum cap is typically
performance [78].
set to several thousands or tens of thousands according to
• Formatting design. As an important factor, the design
different datasets [62, 83].
of natural language format also highly impacts the gener-
alization performance of LLMs [78]. Typically, we can add Combining Instruction Tuning and Pre-Training. To make
task descriptions and optional demonstrations to the input- the tuning process more effective and stable, OPT-IML [85]
output pairs of existing datasets, where the task description incorporates pre-training data during instruction tuning,
is the most key part for LLMs to understand the task [78]. which can be regarded as regularization for model tun-
Further, it can lead to substantial improvements by using an ing. Further, instead of using a separate two-stage process
appropriate number of exemplars as demonstrations [83], (pre-training then instruction tuning), some studies attempt
which also alleviates the model sensitivity to instruction to train a model from scratch with a mixture of pre-
engineering [62, 83]. However, incorporating other compo- training data (i.e., plain texts) and instruction tuning data
nents (e.g., things to avoid, reasons, and suggestions) into (i.e., formatted datasets) using multi-task learning [72, 196].
instructions may have a negligible or even adverse effect Specifically, GLM-130B [82] and Galactica [35] integrate
18

instruction-formatted datasets as a small proportion of the 5.2.1 Background and Criteria for Alignment
pre-training corpora to pre-train LLMs, which potentially
achieves the advantages of pre-training and instruction tun- Background. LLMs have shown remarkable capabilities
ing at the same time. in a wide range of NLP tasks [55, 56, 62, 80]. However,
these models may sometimes exhibit unintended behav-
5.1.3 The Effect of Instruction Tuning iors, e.g., fabricating false information, pursuing inaccurate
objectives, and producing harmful, misleading, and biased
In this part, we discuss the effect of instruction tuning on
expressions [61, 207]. For LLMs, the language modeling
LLMs in two major aspects.
objective pre-trains the model parameters by word predic-
Performance Improvement. Despite being tuned on a mod- tion while lacking the consideration of human values or
erate number of instances, instruction tuning has become preferences. To avert these unexpected behaviors, human
an important way to improve or unlock the abilities of alignment has been proposed to make LLMs act in line with
LLMs [83]. Recent studies have experimented with language human expectations [61, 100]. However, unlike the original
models in multiple scales (ranging from 77M to 540B), pre-training and adaptation tuning (e.g., instruction tuning),
showing that the models of different scales can all benefit such an alignment requires considering very different crite-
from instruction tuning [83, 202], yielding improved perfor- ria (e.g., helpfulness, honesty, and harmlessness). It has been
mance as the parameter scale increases [84]. Further, smaller shown that alignment might harm the general abilities of
models with instruction tuning can even perform better LLMs to some extent, which is called alignment tax in related
than larger models without fine-tuning [28, 83]. Besides literature [61, 208, 209].
the model scale, instruction tuning demonstrates consistent
Alignment Criteria. Recently, there is increasing attention
improvements in various model architectures, pre-training
on developing multifarious criteria to regulate the behav-
objectives, and model adaptation methods [83]. In practice,
iors of LLMs. Here, we take three representative alignment
instruction tuning offers a general approach to enhancing
criteria (i.e., helpful, honest, and harmless) as examples for
the abilities of existing language models [83] (including
discussion, which have been widely adopted in existing
small-sized PLMs). Besides, it is also much less costly than
literature [61, 207, 208]. Besides, there are other align-
pre-training, since the amount of instruction data required
ment criteria for LLMs from different perspectives including
by LLMs is significantly smaller than pre-training data.
behavior, intent, incentive, and inner aspects [207], which
Task Generalization. Instruction tuning encourages the are essentially similar (or at least with similar alignment
model to understand natural language instructions for task techniques) to the above three criteria. It is also feasible to
completion. It endows LLMs with the ability (often con- modify the three criteria according to specific needs, e.g.,
sidered as an emergent ability) to follow human instruc- substituting honesty with correctness [100] or focusing on
tions [31] to perform specific tasks without demonstrations, some specified criteria [209]. Next, we give brief explana-
even on unseen tasks [83]. A large number of studies tions about the three representative alignment criteria:
have confirmed the effectiveness of instruction tuning to • Helpfulness. To be helpful, the LLM should demon-
achieve superior performance on both seen and unseen strate a clear attempt to assist users in solving their tasks
tasks [85, 202]. Besides, instruction tuning has been shown or answering questions in a concise and efficient manner
to be useful in alleviating several weaknesses of LLMs (e.g., as possible. At a higher level, when further clarification
repetitive generation or complementing the input without is needed, the LLM should demonstrate the capability of
accomplishing a certain task) [61, 83], leading to a superior eliciting additional relevant information through pertinent
capacity to solve real-world tasks for LLMs. Furthermore, inquiries and exhibit suitable levels of sensitivity, percep-
LLMs trained with instruction tuning can generalize to re- tiveness, and prudence [208]. Realizing the alignment of
lated tasks across languages. For example, BLOOMZ-P3 [84] helpful behavior is challenging for LLMs since it is difficult
is fine-tuned based on BLOOM [68] using English-only task to precisely define and measure the intention of users [207].
collection P3 [195]. Interestingly, BLOOMZ-P3 can achieve • Honesty. At a basic level, a LLM aligned to be honest
a more than 50% improvement in multilingual sentence should present accurate content to users instead of fabri-
completion tasks compared to BLOOM, which shows that cating information. Additionally, it is crucial for the LLM
instruction tuning can help LLMs acquire general task skills to convey appropriate degrees of uncertainty in its output,
from English-only datasets and transfer such skills into in order to avoid any form of deception or misrepresen-
other languages [84]. In addition, it has been found that tation of information. This requires the model to know
using English-only instructions can produce satisfactory about its capabilities and levels of knowledge (e.g., “know
results on multilingual tasks [84], which helps reduce the unknowns”). According to the discussion in [208], honesty
effort of instruction engineering for a specific language. is a more objective criterion compared to helpfulness and
harmlessness, hence honesty alignment could potentially be
developed with less reliance on human efforts.
5.2 Alignment Tuning
• Harmlessness. To be harmless, it requires that the lan-
This part first presents the background of alignment with guage produced by the model should not be offensive or
its definition and criteria, then focuses on the collection discriminatory. To the best of its abilities, the model should
of human feedback data for aligning LLMs, and finally be capable of detecting covert endeavors aimed at soliciting
discusses the key technique of reinforcement learning from requests for malicious purposes. Ideally, when the model
human feedback for alignment tuning. was induced to conduct a dangerous action (e.g., commit-
19

ting a crime), the LLM should politely refuse. Nonetheless, • Ranking-based approach. In early work [212, 215], hu-
what behaviors are deemed harmful and to what extent vary man labelers often evaluate model-generated outputs in a
amongst individuals or societies [208] highly depend on coarse-grained manner (i.e., only selecting the best) without
who is using the LLM, the type of the posed question, and taking into account more fine-grained alignment criteria.
the context (e.g., time) at which the LLM is being used. Nonetheless, different labelers may hold diverse opinions
As we can see, these criteria are quite subjective, and are on the selection of the best candidate output, and this
developed based on human cognition. Thus, it is difficult method disregards the unselected samples, which may lead
to directly formulate them as optimization objectives for to inaccurate or incomplete human feedback. To address this
LLMs. In existing work, there are many ways to fulfill issue, subsequent studies [100, 209] introduce the Elo rating
these criteria when aligning LLMs. A promising technique system to derive the preference ranking by comparing can-
is red teaming [210, 211], which involves using manual or didate outputs. The ranking of outputs serves as the training
automated means to probe LLMs in an adversarial way signal that guides the model to prefer certain outputs over
to generate harmful outputs and then updates LLMs to others, thus inducing outputs that are more reliable and
prevent such outputs. safer.
• Question-based approach. Further, human labelers can
5.2.2 Collecting Human Feedback provide more detailed feedback by answering certain ques-
During the pre-training stage, LLMs are trained using the tions designed by researchers [71], covering the alignment
language modeling objective on a large-scale corpus. How- criteria as well as additional constraints for LLMs. Specially,
ever, it cannot take into account the subjective and qualita- in WebGPT [71], to assist the model in filtering and utiliz-
tive evaluations of LLM outputs by humans (called human ing relevant information from retrieved documents, human
feedback in this survey). High-quality human feedback is labelers are required to answer questions with multiple
extremely important for aligning LLMs with human pref- options about whether the retrieved documents are useful
erences and values. In this part, we discuss how to select a for answering the given input.
team of human labelers for feedback data collection. • Rule-based approach. Besides, many studies develop
Human Labeler Selection. In existing work, the dominant rule-based methods to provide more detailed human feed-
method for generating human feedback data is human back. As a typical case, Sparrow [100] not only selects the
annotation [61, 100, 212]. This highlights the critical role response that labelers consider the best but also uses a series
of selecting appropriate human labelers. To provide high- of rules to test whether model-generated responses meet the
quality feedback, human labelers are supposed to have a alignment criteria of being helpful, correct, and harmless.
qualified level of education and excellent proficiency in En- In this way, two kinds of human feedback data can be ob-
glish. For example, Sparrow [100] requires human labelers tained: (1) the response preference feedback is obtained by
to be UK-based native English speakers who have obtained comparing the quality of model-generated output in pairs,
at least an undergraduate-level educational qualification. and (2) the rule violation feedback is obtained by collecting
Further, in [209], about half of human labelers for high the assessment from human labelers (i.e., a score indicating
priority tasks were recruited from the US-based Amazon to what extent the generated output has violated the rules).
Mechanical Turk workforce with a master’s qualification. Furthermore, GPT-4 [46] utilizes a set of zero-shot classifiers
Even then, several studies [212, 213] have found that there (based on GPT-4 itself) as rule-based reward models, which
still exists a mismatch between the intentions of researchers can automatically determine whether the model-generated
and human labelers, which may lead to low-quality human outputs violate a set of human-written rules.
feedback and cause LLMs to produce unexpected output. In the following, we focus on a well-known technique,
To address this issue, InstructGPT [61] further conducts a reinforcement learning from human feedback (RLHF),
screening process to filter labelers by assessing the agree- which has been widely used in the recent powerful LLMs
ment between human labelers and researchers. Specifically, such as ChatGPT. As discussed below, the alignment criteria
researchers first label a small amount of data and then introduced in Section 5.2.1 can be fulfilled by learning from
measure the agreement between themselves and human human feedback on the responses of LLMs to users’ queries.
labelers. The labelers with the highest agreement will be
selected to proceed with the subsequent annotation work. 5.2.3 Reinforcement Learning from Human Feedback
In some other work [214], “super raters” are used to ensure To align LLMs with human values, reinforcement learning
the high quality of human feedback. Researchers evaluate from human feedback (RLHF) [69, 212] has been proposed
the performance of human labelers and select a group of to fine-tune LLMs with the collected human feedback data,
well-performing human labelers (e.g., high agreement) as which is useful to improve the alignment criteria (e.g.,
super raters. The super raters will be given priority to col- helpfulness, honesty, and harmlessness). RLHF employs
laborate with the researchers in the subsequent study. When reinforcement learning (RL) algorithms (e.g., Proximal Pol-
human labelers annotate the output of LLMs, it is helpful to icy Optimization (PPO) [216]) to adapt LLMs to human
specify detailed instructions and provide instant guidance feedback by learning a reward model. Such an approach
for human labelers [213], which can further regulate the incorporates humans in the training loop for developing
annotation of labelers. well-aligned LLMs, as exemplified by InstructGPT [61].
Human Feedback Collection. In existing work, there are RLHF System. The RLHF system mainly comprises three
mainly three kinds of approaches to collecting feedback and key components: a pre-trained LM to be aligned, a reward
preference data from human labelers. model learning from human feedback, and a RL algorithm
20

Supervised Fine-tuning the human-generated prompt) as input. We then invite


Prompts human labelers to annotate the preference for these pairs.
Training with demonstration data
The annotation process can be conducted in multiple forms,
Human
Annotator
Demonstrations Pre-trained LM
🔥 and a common approach is to annotate by ranking the
generated candidate texts, which can reduce the inconsis-
Demonstration Data
tency among annotators. Then, the RM is trained to predict
the human-preferred output. In InstructGPT, labelers rank
Reward Model Training model-generated outputs from best to worst, and the RM
🔥 (i.e., 6B GPT-3) is trained to predict the ranking.
Prompts LM Outputs Reward
Model
Pre-trained LM
🧊 • RL fine-tuning. At this step, aligning (i.e., fine-tuning)
the LM is formalized as an RL problem. In this setting,
the pre-trained LM acts as the policy that takes as input
Ranking Human Feedback Training with feedback data a prompt and returns an output text, the action space of
it is the vocabulary, the state is the currently generated
RL Fine-tuning token sequence, and the reward is provided by the RM. To
🧊 avoid eviating significantly from the initial (before tuning)
Prompts
Reward
Model
Aligned LM
🔥 LM, a penalty term is commonly incorporated into the
reward function. For example, InstructGPT optimizes the
LM Outputs 😊/ 😞
Reward
Training with RL algorithm (PPO)
LM against the RM using the PPO algorithm. For each input
prompt, InstructGPT calculates the KL divergence between
the generated results from the current LM and the initial LM
Fig. 5. The workflow of the RLHF algorithm. as the penalty. It is noted that the second and final steps can
be iterated in multiple turns for better aligning LLMs.

training the LM. Specifically, the pre-trained LM is typically


a generative model that is initialized with existing pre- 6 U TILIZATION
trained LM parameters. For example, OpenAI uses 175B
GPT-3 for its first popular RLHF model, InstructGPT [61], After pre-training or adaptation tuning, a major approach
and DeepMind uses the 280 billion parameter model Go- to using LLMs is to design suitable prompting strategies for
pher [59] for its GopherCite model [214]. Further, the reward solving various tasks. A typical prompting method is in-
model (RM) provides (learned) guidance signals that reflect context learning [50, 55], which formulates the task descrip-
human preferences for the text generated by the LM, usually tion and/or demonstrations in the form of natural language
in the form of a scalar value. The reward model can take on text. In addition, chain-of-thought prompting [33] can be em-
two forms: a fine-tuned LM or a LM trained de novo using ployed to enhance in-context learning by involving a series
human preference data. Existing work typically employs of intermediate reasoning steps into prompts. Next, we will
reward models having a parameter scale different from that elaborate on the details of the two techniques.
of the aligned LM [61, 214]. For example, OpenAI uses 6B
GPT-3 and DeepMind uses 7B Gopher as the reward model,
6.1 In-Context Learning
respectively. Finally, to optimize the pre-trained LM using
the signal from the reward model, a specific RL algorithm As a special prompting form, in-context learning (ICL) is
is designed for large-scale model tuning. Specifically, Prox- first proposed along with GPT-3 [55], which has become a
imal Policy Optimization (PPO) [216] is a widely used RL typical approach to utilizing LLMs.
algorithm for alignment in existing work [61, 100, 214].

Key Steps for RLHF. Figure 5 illustrates the overall three- 6.1.1 Prompting Formulation
step process of RLHF [61, 213] as introduced below. As stated in [55], ICL uses a formatted natural language
• Supervised fine-tuning. To make the LM initially perform prompt, consisting of the task description and/or a few task
desired behaviors, it usually needs to collect a supervised examples as demonstrations. Figure 6 presents the illustra-
dataset containing input prompts (instruction) and desired tion of ICL. First, starting with a task description, a few ex-
outputs for fine-tuning the LM. These prompts and outputs amples are selected from the task dataset as demonstrations.
can be written by human labelers for some specific tasks Then, they are combined in a specific order to form nat-
while ensuring the diversity of tasks. For example, Instruct- ural language prompts with specially designed templates.
GPT [61] asks human labelers to compose prompts (e.g., Finally, the test instance is appended to the demonstration
“List five ideas for how to regain enthusiasm for my career”) and as the input for LLMs to generate the output. Based on task
desired outputs for several generative tasks such as open demonstrations, LLMs can recognize and perform a new
QA, brainstorming, chatting, and rewriting. Note that the task without explicit gradient update.
first step is optional in specific settings or scenarios. Formally, let Dk = {f (x1 , y1 ), . . . , f (xk , yk )} represent
• Reward model training. The second step is to train the a set of demonstrations with k examples, where f (xk , yk ) is
RM using human feedback data. Specifically, we employ the prompt function that transforms the k -th task example
the LM to generate a certain number of output texts using into natural language prompts. Given the task description
sampled prompts (from either the supervised dataset or I , demonstration Dk , and a new input query xk+1 , the
21

In-Context Learning Chain-of-Thought Prompting


Answer the following mathematical reasoning questions: Answer the following mathematical reasoning questions:

Q: If you have 12 candies and you give 4 candies to your friend, Q: If a rectangle has a length of 6 cm and a width of 3 cm,
how many candies do you have left? what is the perimeter of the rectangle?
Nx A: The answer is 8. Nx A: For a rectangle, add up the length and width and double it.
If a rectangle has a length of 6 cm and a width of 3 cm,
Q: what is the perimeter of the rectangle? So, the perimeter of this rectangle is (6 + 3) x 2 = 18 cm.

A: The answer is 18 cm. The answer is 18 cm.

Q: Sam has 12 marbles. He gives 1/4 of them to his sister. Q: Sam has 12 marbles. He gives 1/4 of them to his sister.
How many marbles does Sam have left? How many marbles does Sam have left?

A: He gives (1 / 4) x 12 = 3 marbles.
A: The answer is 9. LLM So Sam is left with 12 – 3 = 9 marbles.
The answer is 9.

: Task description : Demonstration : Chain-of-Thought : Query

Fig. 6. A comparative illustration of in-context learning (ICL) and chain-of-thought (CoT) prompting. ICL prompts LLMs with a natural language
description, several demonstrations, and a test query. While CoT prompting involves a series of intermediate reasoning steps in prompts.

prediction of the output ŷk+1 generated from LLMs can be Following the discussion in Section 6.1.1, we will introduce
formulated as follows15 : the demonstration design of ICL from three major aspects,
 i.e., demonstration selection, format, and order.
LLM I, f (x1 , y1 ), . . . , f (xk , yk ), f (xk+1 , ) → ŷk+1 .
| {z } | {z } |{z}
demonstrations input answer Demonstration Selection. The performance of ICL tends
(3) to have a large variance with different demonstration exam-
where the actual answer yk+1 is left as a blank to be ples [220], so it is important to select a subset of examples
predicted by the LLM. Since the performance of ICL heavily that can effectively leverage the ICL capability of LLMs.
relies on demonstrations, it is an important issue to properly There are two main demonstration selection approaches,
design them in the prompts. According to the construction namely heuristic and LLM-based approaches:
process in Equation (3), we focus on three major aspects in • Heuristic approaches. Due to the simplicity and low
formatting demonstrations in the prompts, including how to costs, existing work widely adopts heuristic methods to
select examples that make up demonstrations, format each select demonstrations. Several studies employ a k -NN based
example into the prompt with the function f (·), and arrange retriever to select examples that are semantically relevant to
demonstrations in a reasonable order. the query [220, 221]. However, they perform the selection
A comprehensive review of ICL has been presented in individually for each example, rather than evaluating the
the survey paper [50], and we suggest the readers refer- example set as a whole. To resolve this issue, diversity-
ring to it for a more general, detailed discussion on this based selection strategies are proposed to choose the most
topic. Compared with this survey, we specially focus on the representative set of examples for specific tasks [222, 223].
discussion of applying ICL to LLMs in two major aspects, Furthermore, in [224], both relevance and diversity are taken
i.e., demonstration design and the underlying mechanism into consideration when selecting demonstrations.
of ICL. Besides, ICL also has a close connection with • LLM-based approaches. Another line of work selects
instruction tuning (discussed in Section 5.1) in that both demonstrations by making use of LLMs. For example, LLMs
utilize natural language to format the task or instances. can be utilized to directly measure the informativeness
However, instruction tuning needs to fine-tune LLMs for of each example according to the performance gain after
adaptation, while ICL only prompts LLMs for utilization. adding the example [225]. Besides, EPR [226] proposes a
Furthermore, instruction tuning can enhance the ICL ability two-stage retrieval approach that first recalls similar ex-
of LLMs to perform target tasks, especially in the zero-shot amples with an unsupervised method (e.g., BM25) and
setting (only using task descriptions) [83]. then ranks them using a dense retriever (trained with
positive and negative examples labeled by LLMs). As an
6.1.2 Demonstration Design
alternative approach, the task of demonstration selection
Several studies have shown that the effectiveness of ICL is can be formulated into a RL problem, where LLMs serve
highly affected by the design of demonstrations [217–219] as the reward function to provide feedback for training
15. When ICL was introduced in the GPT-3’s paper [55], it was
the policy model [227]. Since LLMs perform well for text
originally defined to be a combination of the task description and annotation [228], some recent studies employ LLM itself
demonstration examples, wherein either component is dispensable. as the demonstration generator without human interven-
Following this definition, when a LLM is required to solve an unseen tion [229, 230].
task by using only task descriptions, it can be also considered to
perform ICL for task solving, whereas the ICL ability can be enhanced To summarize, as discussed in [231], the selected demon-
by instruction tuning. stration examples in ICL should contain sufficient informa-
22

tion about the task to solve as well as be relevant to the test els [235]. It suggests that the design of training tasks is an
query, for the above two selection approaches. important influence factor of the ICL capability of LLMs.
Besides training tasks, recent studies have also investigated
Demonstration Format. After selecting task examples, the
the relationship between ICL and the pre-training cor-
next step is to integrate and format them into a natural
pora [231, 236, 237]. It has been shown that the performance
language prompt for LLMs. A straightforward method is to
of ICL heavily depends on the source of pre-training corpora
instantiate a pre-defined template with the corresponding
rather than the scale [237]. Another study [236] provides an
input-output pairs [36]. To construct more informative tem-
in-depth analysis of the impact of training data distribution.
plates, recent studies consider adding task descriptions [83]
They find that ICL emerges when the training data can be
or enhancing the reasoning capability of LLMs with chain-
clustered into numerous infrequent classes, instead of being
of-thought prompts [33]. For instance, in [193], the authors
uniformly distributed. Furthermore, the authors in [231]
collect a large-scale dataset with task descriptions written by
theoretically explain ICL as the product of pre-training on
humans. After tuning with this dataset, the performance on
documents that exhibit long-range coherence.
seen tasks can be boosted, and LLMs can also generalize to
unseen tasks to some extent. To reduce the annotation costs, How LLMs Perform ICL? At the inference stage, researchers
a semi-automated approach has been proposed in [204] focus on analyzing how the ICL capability operates based
by employing a seed set consisting of human-written task on given demonstrations since no explicit learning or up-
descriptions to guide LLMs to generate task descriptions dating is involved. They typically analyze from the per-
for new tasks. Since it is costly to manually annotate spective of gradient descent and consider ICL as implicit
demonstration formats for different tasks, some work also fine-tuning [60, 238]. Under this framework, the ICL process
studies how to automatically generate high-quality ones. can be explained as follows: by means of forward computa-
As two representative methods, Auto-CoT [232] leverages tion, LLMs generate meta-gradients with respect to demon-
LLMs with the zero-shot prompt “Let’s think step by step” strations and implicitly perform gradient descent via the
for generating intermediate reasoning steps, while least-to- attention mechanism. Experiments also show that certain
most prompting [233] first queries LLMs to perform prob- attention heads in LLMs are capable of performing task-
lem decomposition and then utilizes LLMs to sequentially agnostic atomic operations (e.g., copying and prefix match-
solve sub-problems based on the intermediate answers to ing), which are closely related to the ICL ability [239, 240].
previously solved ones. To further explore the working mechanism of ICL, some
studies abstract ICL as an algorithm learning process [241–
Demonstration Order. LLMs are shown to sometimes suffer
243]. Specifically, the authors in [242] find that LLMs es-
from the recency bias, i.e., they are prone to repeat answers
sentially encode implicit models through their parameters
that are near the end of demonstrations [219]. Thus, it is
during pre-training. With the examples provided in ICL,
important to arrange demonstrations (i.e., task examples)
LLMs can implement learning algorithms such as gradient
in a reasonable order. Early work proposes several heuris-
descent or directly compute the closed-form solution to
tic methods to quickly find a good order. For example,
update these models during forward computation. Under
demonstrations can be directly organized according to their
this explanation framework, it has been shown that LLMs
similarity to the query in the embedding space [220]: the
can effectively learn simple linear functions and even some
more similar, the closer to the end. Besides, global and local
complex functions like decision trees with ICL [241–243].
entropy metrics can be used to score different demonstra-
tion orders [218]. To integrate more task information, some
recent studies propose to minimize the code length required 6.2 Chain-of-Thought Prompting
to compress and transmit task labels, which is inspired
by information theory [234]. However, these methods need Chain-of-Thought (CoT) [33] is an improved prompting
additional labeled data as the validation set to evaluate the strategy to boost the performance of LLMs on complex
performance of specific demonstration orders. To eliminate reasoning tasks, such as arithmetic reasoning [244–246],
this need, the authors in [218] propose to sample the valida- commonsense reasoning [247, 248], and symbolic reason-
tion data from the LLM itself. ing [33]. Instead of simply constructing the prompts with
input-output pairs as in ICL, CoT incorporates intermediate
6.1.3 Underlying Mechanism reasoning steps that can lead to the final output into the
prompts. In the following, we will elaborate on the usage of
After pre-training, LLMs can exhibit intriguing ICL capabil- CoT with ICL and discuss when and why CoT prompting
ity without being updated. In what follows, we discuss two works.
key questions about the ICL ability of LLMs, i.e., “how does
pre-training affect the ICL ability” and “how do LLMs perform
ICL during inference”. 6.2.1 In-context Learning with CoT
Typically, CoT can be used with ICL in two major settings,
How Pre-Training Affects ICL? ICL is first proposed in namely the few-shot and zero-shot settings, as introduced
GPT-3 [55], and it has shown that the ICL ability becomes below.
more significant with a larger model size. While, some
studies reveal that small-scale PLMs can also demonstrate Few-shot CoT. Few-shot CoT is a special case of ICL, which
a strong ICL ability with specially designed training tasks augments each demonstration hinput, outputi as hinput, CoT,
(e.g., learning to predict the label with task examples and outputi by incorporating the CoT reasoning steps. To apply
the query as the input), and may even surpass larger mod- this strategy, we next discuss two key issues, i.e., how to
23

design appropriate CoT prompts and how to utilize the exceeds a certain size, but is not effective with small-scale
generated CoTs for deriving the final answer. models, showing a significant pattern of emergent abilities.
• CoT prompt design. It is critical to design appropriate In order to unlock the CoT ability on more tasks, Flan-T5
CoT prompts for effectively eliciting the complex reasoning and Flan-PaLM [83] further perform instruction tuning on
ability of LLMs. As a direct approach, it is shown that CoT annotations and the zero-shot performance on unseen
using diverse CoTs (i.e., multiple reasoning paths for each tasks has been improved.
problem) can effectively enhance their performance [249].
Another intuitive idea is that prompts with more complex 6.2.2 Further Discussion on CoT
reasoning paths are more likely to elicit the reasoning ability In this part, we present discussions regarding two funda-
of LLMs [250], which can result in higher accuracy in mental questions related to CoT, i.e., “when does CoT work for
generating correct answers. However, both of these two LLMs” and “why can LLMs perform CoT reasoning”.
approaches rely on annotated CoT datasets, which limits When CoT works for LLMs? Since CoT is an emergent
their use in practice. To overcome this limitation, Auto- ability [31], it only has a positive effect on sufficiently
CoT [232] proposes to utilize Zero-shot-CoT [251] (detailed large models (e.g., typically containing 10B or more pa-
in the following part “Zero-shot CoT”) to generate CoT rea- rameters [33]) but not on small models. Moreover, since
soning paths by specially prompting LLMs, thus eliminating CoT augments the standard prompting with intermediate
manual efforts. In order to boost the performance, Auto-CoT reasoning steps, it is mainly effective to improve the tasks
further divides the questions in the training set into different that require step-by-step reasoning [33], such as arithmetic
clusters and then chooses the questions that are closest to the reasoning, commonsense reasoning, and symbolic reason-
centroid of each cluster, which is supposed to well represent ing. Whereas, for other tasks that do not rely on complex
the questions in the training set. Although few-shot CoT can reasoning, it might show worse performance than standard
be considered as a special prompt case of ICL, the ordering prompting [253], e.g., MNLI-m/mm, SST-2, and QQP from
of demonstrations seems to have a relatively small impact GLUE [257]. Interestingly, it seems that the performance
compared to the standard prompt in ICL: reordering the gain brought by CoT prompting could be significant only
demonstrations only results in a performance variation of when standard prompting yields poor results [33].
less than 2% in most tasks [33].
• Enhanced CoT strategies. Besides enriching the contex- Why LLMs Can Perform CoT Reasoning? As the second
tual information, CoT prompting also provides more op- question, we discuss the underlying mechanism of CoT in
tions to infer the answer given a question. Existing studies the following two aspects.
mainly focus on generating multiple reasoning paths, and • The source of CoT ability. Regarding the source of CoT
try to find a consensus among the derived answers [252– capability, it is widely hypothesized that it can be attributed
254]. For instance, self-consistency [252] is proposed as a to training on code since models trained on it show a strong
new decoding strategy when generating CoT and the final reasoning ability [47, 258]. Intuitively, code data is well orga-
answer. It first generates several reasoning paths and then nized with algorithmic logic and programming flow, which
takes an ensemble over all the answers (e.g., selecting the may be useful to improve the reasoning performance of
most consistent answer by voting among these paths). Self- LLMs. However, this hypothesis still lacks publicly reported
consistency boosts the performance in CoT reasoning by evidence of ablation experiments (with and without training
a large margin, and can even improve some tasks where on code). Besides, instruction tuning seems not to be the key
CoT prompting is usually worse than standard prompting reason to obtain the CoT ability, since it has been empirically
(e.g., closed-book question answering and natural language shown that instruction tuning on non-CoT data does not
inference). Further, the authors in [253] expand the self- improve the performance on held-out CoT benchmarks [83].
consistency strategy to a more general ensemble frame- • The effect of prompting components. The major distinction
work (extending to ensemble on the prompts), and they find between CoT prompting and standard prompting is the
that diverse reasoning paths are the key to the performance incorporation of reasoning paths prior to the final answer.
improvement in CoT reasoning. The above methods can Thus, some researchers investigate the effect of different
be easily integrated into CoT prompting to enhance the components in the reasoning paths. Specifically, a recent
performance without additional training. In contrast, other study identifies three key components in CoT prompting,
studies train a scoring model to measure the reliability of the namely symbols (e.g., numerical quantities in arithmetic rea-
generated reasoning paths [249] or continually train LLMs soning), patterns (e.g., equations in arithmetic reasoning),
on the reasoning paths generated by themselves [255, 256] and text (i.e., the rest of tokens that are not symbols or
to improve the performance. patterns) [259]. It is shown that the latter two parts (i.e., pat-
terns and text) are essential to the model performance, and
Zero-shot CoT. Different from few-shot CoT, zero-shot CoT removing either one would lead to a significant performance
does not include human-annotated task demonstrations in drop. However, the correctness of symbols and patterns
the prompts. Instead, it directly generates reasoning steps does not seem critical. Further, there exists a symbiotic
and then employs the generated CoTs to derive the answers. relationship between text and patterns: the text helps LLMs
Zero-shot CoT is first proposed in [251], where the LLM to generate useful patterns, and patterns aid LLMs to under-
is first prompted by “Let’s think step by step” to generate stand tasks and generate texts that help solve them [259].
reasoning steps and then prompted by “Therefore, the answer In summary, CoT prompting provides a general yet
is” to derive the final answer. They find that such a strategy flexible approach to eliciting the reasoning ability of LLMs.
drastically boosts the performance when the model scale There are also some preliminary attempts that extend this
24

technique to solve multimodal tasks [260] and multilingual Conditional Text Generation. As an important topic in lan-
tasks [261]. In addition to directly utilizing LLMs with ICL guage generation, conditional text generation [48] focuses
and CoT, some recent studies explore how to specialize the on generating texts satisfying specific task demands based
ability of LLMs towards specific tasks [262–264], which is on the given conditions, typically including machine trans-
called model specialization [265]. For example, the researchers lation [338], text summarization [339], and question answer-
in [265] specialize the ability of mathematical reasoning ing [340]. To measure the quality of the generated text, auto-
from LLMs through fine-tuning the small-scale Flan-T5 [83] matic metrics (e.g., Accuracy, BLEU [341] and ROUGE [342])
on CoT reasoning paths generated by LLMs. Model spe- and human ratings have been typically used for evaluating
cialization can also be applied to solve a variety of tasks the performance. Due to the powerful language generation
like question answering [266], code synthesis [267], and capabilities, LLMs have achieved remarkable performance
information retrieval [268]. on existing datasets and benchmarks, even surpassing hu-
man performance (on test datasets). For instance, given
only 32 examples as the input, GPT-3 with in-context learn-
7 C APACITY E VALUATION
ing can outperform a full-data fine-tuned BERT-Large on
To examine the effectiveness and superiority of LLMs, a the average score of SuperGLUE [283]; on MMLU, a 5-
surge of tasks and benchmarks have been leveraged for shot Chinchilla [34] nearly doubles the average accuracy
conducting empirical evaluation and analysis. We first intro- of human raters, and GPT-4 [46] in 5-shot setting further
duce three types of basic evaluation tasks of LLMs for lan- achieves the state-of-the-art performance which yields more
guage generation and understanding, then present several than 10% improvement in average accuracy compared to the
advanced tasks of LLMs with more complicated settings or previous best model. Thus, it raises serious concern about
goals, and finally discuss existing benchmarks and empirical whether existing benchmarks for conditional text generation
analyses. tasks can appropriately evaluate and reflect the capability
of LLMs. Considering this issue, researchers try to make
7.1 Basic Evaluation Tasks new evaluation benchmarks (e.g., BIG-bench Hard [285]) by
In this part, we mainly focus on three types of evaluation collecting currently unsolvable tasks (i.e., the task on which
tasks for LLMs, i.e., language generation, knowledge uti- LLMs fail to perform well) or creating more challenging
lization, and complex reasoning. It is noted that we do not tasks, e.g., super-long text generation [343]. Moreover, recent
intend to have complete coverage of all the related tasks, but studies also find that the automatic metrics may underesti-
instead only focus on the most widely discussed or studied mate the generation quality of LLMs. In OpenDialKG [282],
tasks for LLMs. Next, we introduce these tasks in detail. ChatGPT underperforms a fine-tuned GPT-2 on BLEU and
ROUGE-L metrics, while earning more favor from human
7.1.1 Language Generation judgment [344]. Therefore, more efforts need to be devoted
to developing new metrics that are more aligned with
According to the task definition, existing tasks about lan-
human judgment.
guage generation can be roughly categorized into language
modeling, conditional text generation, and code synthesis Code Synthesis. Besides generating high-quality natural
tasks. Note that code synthesis is not a typical NLP task, we language, existing LLMs also show strong abilities to gen-
include it for discussion because it can be directly solved erate formal language, especially computer programs (i.e.,
by a number of LLMs (trained on code data) in a similar code) that satisfy specific conditions, called code synthe-
generation approach as natural language text. sis [345]. Unlike natural language generation, as the gen-
erated code can be directly checked by execution with cor-
Language Modeling. As the most fundamental ability of
responding compilers or interpreters, existing work mostly
LLMs, language modeling aims to predict the next token
evaluates the quality of the generated code from LLMs by
based on the previous tokens [15], which mainly focuses
calculating the pass rate against the test cases, i.e., pass@k 16 .
on the capacity of basic language understanding and gen-
Recently, several code benchmarks focusing on functional
eration. For evaluating such an ability, typical language
correctness are proposed to assess the code synthesis abil-
modeling datasets that existing work uses include Penn
ities of LLMs, such as APPS [287], HumanEval [88], and
Treebank [269], WikiText-103 [270], and the Pile [117], where
MBPP [140]. Typically, they consist of diverse program-
the metric of perplexity is commonly used for evaluating the
ming problems, with text specification and test cases for
model performance under the zero-shot setting. Empirical
correctness checking. To improve such an ability, it is key
studies [55, 82] show that LLMs bring substantial per-
to fine-tuning (or pre-training) LLMs on code data, which
formance gains over the previous state-of-the-art methods
can effectively adapt LLMs to code synthesis tasks [76]. Be-
on these evaluation datasets. To better test the modeling
sides, existing work has proposed new strategies to generate
capacity of long-range dependencies in text, the LAMBADA
code, e.g., sampling multiple candidate solutions [140] and
dataset [155] has been introduced, where LLMs are required
planning-guided decoding [346], which can be considered
to predict the last word of sentences based on a paragraph of
as the imitation of bug-fixing and code-planning processes
context. Then, the accuracy and perplexity of the predicted
by programmers. Impressively, LLMs have recently shown
last words are employed to evaluate LLMs. As shown in
competitive performance with humans by achieving a rank-
existing work, the performance on the language modeling
ing of the top 28% among users on the programming contest
tasks typically follows the scaling law [30], which means
that scaling language models would improve the accuracy 16. Given k programs generated by the LLM, pass@k is computed as
and reduce the perplexity. 1 when at least one program passes all test cases, or else 0
25

TABLE 6
Basic evaluation tasks and corresponding representative datasets of LLMs.

Task Dataset
Language Modeling Penn Treebank [269], WikiText-103 [270], the Pile [117], LAMBADA [155]
WMT’14,16,19,20,21,22 [271–276], Flores-101 [277], DiaBLa [278],
Conditional Text Generation CNN/DailyMail [279], XSum [280], WikiLingua [281], OpenDialKG [282]
Language Generation
SuperGLUE [283], MMLU [284], BIG-bench Hard [285], CLUE [286]
APPS [287], HumanEval [88], MBPP [140], CodeContest [98], MTPB [76],
Code Synthesis
DS-1000 [288], ODEX [289]
Natural Questions [290], ARC [291], TruthfulQA [292], Web Questions [293],
Closed-Book QA TriviaQA [294], PIQA [295], LC-quad2.0 [296], GrailQA [297], KQApro [298],
CWQ [299], MKQA [300], ScienceQA [301]
Knowledge Utilization Natural Questions [290], OpenBookQA [302], ARC [291], Web Questions [293],
Open-Book QA
TriviaQA [294], MS MARCO [303], QASC [304], SQuAD [305], WikiMovies [306]
WikiFact [307], FB15k-237 [308], Freebase [309], WN18RR [310], WordNet [311],
Knowledge Completion
LAMA [312], YAGO3-10 [313], YAGO [314]
CSQA [247], StrategyQA [248], ARC [291], BoolQ [315], PIQA [295], SIQA [316],
HellaSwag [317], WinoGrande [318], OpenBookQA [302], COPA [319],
Knowledge Reasoning
ScienceQA [301], proScript [320], ProPara [321], ExplaGraphs [322],
ProofWriter [323], EntailmentBank [324], ProOntoQA [325]
CoinFlip [33], ReverseList [33], LastLetter [33], Boolean Assignment [326],
Complex Reasoning
Symbolic Reasoning Parity [326], Colored Object [327], Penguins in a Table [327],
Repeat Copy [328], Object Counting [328]
MATH [284], GSM8k [244], SVAMP [245], MultiArith [329], ASDiv [246],
Mathematical Reasoning MathQA [330], AQUA-RAT [331], MAWPS [332], DROP [333], NaturalProofs [334],
PISA [335], miniF2F [336], ProofNet [337]

platform Codeforces [98]. Further, GitHub Copilot has been into multiple steps such as planning, drafting, rewriting,
released to assist programming in coding IDEs (e.g., Visual and editting [343]. Several studies have proven that iterative
Studio and JetBrains IDEs), which can support a variety prompting can elicit relevant knowledge to achieve better
of languages including Python, JavaScript, and Java. A performance in sub-tasks [348, 349]. In essence, chain-of-
viewpoint article entitled “The End of Programming” [347] in thought prompting has utilized the idea of decomposing
Communications of the ACM has discussed the impact of AI complex tasks into multi-step reasoning chains. Besides,
programming in the field of computer science, emphasizing the safety control of generated texts is also important for
an important shift towards the highly adaptive LLM as a practical deployment. It has been shown that LLMs may
new atomic unit of computation. generate texts that contain sensitive information or offensive
expressions [46]. Although the RLHF algorithm [61] can
Major Issues. Although LLMs have achieved splendid per- alleviate this problem to some extent, it still relies on con-
formance in generating human-like text, they are susceptible siderable human-labeled data for tuning LLMs, without an
to suffering from two major issues in language generation objective optimization goal to follow. Thus, it is imperative
as discussed below. to explore effective methods to overcome these limitations
• Controllable generation. For LLMs, the mainstream way and enable safer control over the outputs of LLMs.
to generate texts under given conditions is through the use • Specialized generation. Although LLMs have learned
of natural language instructions or prompts. Despite the general language patterns to generate coherent text, their
simplicity, such a mechanism poses significant challenges in proficiency in generation might be constrained when deal-
terms of exerting fine-grained or structural constraints over ing with a specialized domain or task. For instance, a
the generated outputs of these models. Existing work [41] language model that has been trained on general web
shows that, when generating texts with complex constraints articles may face challenges when generating a medical
on their structures, LLMs can handle local planning (e.g., in- report which involves many medical jargon and methods.
teractions between proximal sentences) very well but might Intuitively, domain knowledge should be critical for model
struggle with global planning (i.e., long-range relatedness). specialization. Whereas, it is not easy to inject such special-
For example, to generate a complex long passage with sev- ized knowledge into LLMs. As discussed in recent analy-
eral paragraphs, it is still difficult to directly ensure specific ses [47, 350], when LLMs are trained to exhibit some specific
text structure (e.g., the order of concepts and the logical ability that allows them to excel in some areas, they might
flow), considering the whole text. This case will become struggle in others. Such an issue is related to catastrophic
even more challenging for generation tasks that require forgetting [351, 352] in training neural networks, which refers
formal rules or grammar, e.g., code synthesis. To tackle this to the conflict phenomenon of integrating new and old
issue, a potential solution is to extend the one-pass genera- knowledge. Similar cases also occur in human alignment
tion into the iterative prompting of LLMs. This simulates the of LLMs, where “alignment tax” [61] (e.g., a potential loss in
human writing process to break down language generation the in-context learning ability) has to be paid for aligning
26

to human values and needs. Therefore, it is important to LLMs [71, 354, 358]. In evaluation, existing studies mainly
develop effective model specialization methods that can focus on testing how LLMs utilize the extracted knowledge
flexibly adapt LLMs to various task scenarios, meanwhile to answer the question and show that the retrieved evi-
retaining the original abilities as possible. dence can largely improve the accuracy of the generated
answers, even enabling a smaller LLM to outperform 10×
7.1.2 Knowledge Utilization larger ones [354, 358]. Besides, open-book QA tasks can
Knowledge utilization is an important ability of intelligent also evaluate the recency of knowledge information. Pre-
systems to accomplish knowledge-intensive tasks (e.g., com- training or retrieving from outdated knowledge resources
monsense question answering and fact completion) based may cause LLMs to generate incorrect answers for time-
on supporting factual evidence. Concretely, it requires LLMs sensitive questions [354].
to properly utilize the rich factual knowledge from the pre-
training corpus or retrieve external data when necessary. In Knowledge Completion. In knowledge completion tasks,
particular, question answering (QA) and knowledge com- LLMs might be (to some extent) considered as a knowledge
pletion have been two commonly used tasks for evaluating base [312], which can be leveraged to complete or predict the
this ability. According to the test tasks (question answering missing parts of knowledge units (e.g., knowledge triples).
or knowledge completion) and evaluation settings (with or Such tasks can probe and evaluate how much and what kind
without external resources), we categorize existing knowl- of knowledge LLMs have learned from the pre-training
edge utilization tasks into three types, namely closed-book data. Existing knowledge completion tasks can be roughly
QA, open-book QA17 , and knowledge completion. divided into knowledge graph completion tasks (e.g., FB15k-
237 [308] and WN18RR [310]) and fact completion tasks (e.g.,
Closed-Book QA. Closed-book QA tasks [353] test the WikiFact [307]), which aim to complete the triples from a
acquired factual knowledge of LLMs from the pre-training knowledge graph and incomplete sentences about specific
corpus, where LLMs should answer the question only based facts, respectively. Empirical studies have revealed that it
on the given context without using external resources. For is difficult for existing LLMs to accomplish knowledge
evaluating this ability, there are several datasets that can completion tasks related to specific relation types [258].
be leveraged, including Natural Questions [290], Web Ques- As shown in the evaluation results on WikiFact, LLMs
tions [293], and TriviaQA [294], where the accuracy metric is perform well on several frequent relations that occur in
widely adopted. Empirical results have revealed that LLMs the pre-training data (e.g., currency and author), while
can perform well in this setting and even match the per- not well on rare ones (e.g., discoverer_or_inventor
formance of state-of-the-art open-domain QA systems [56]. and place_of_birth). Interestingly, under the same eval-
Besides, the performance of LLMs on closed-book QA tasks uation settings (e.g., in-context learning), InstructGPT (i.e.,
also shows a scaling law pattern in terms of both model size text-davinci-002) outperforms GPT-3 in all subsets of
and data size: scaling the parameters and training tokens WikiFact. It indicates that instruction tuning is helpful for
can increase the capacity of LLMs and help them learn (or LLMs to accomplish knowledge completion tasks.
memorize) more knowledge from the pre-training data [56].
Further, under a similar parameter scale, LLMs with more Major Issues. Although LLMs have achieved key progress
pre-training data relevant to the evaluated tasks would in capturing and utilizing knowledge information, they
achieve better performance [71]. Besides, the closed-book suffer from two major issues as discussed below.
QA setting also provides a testbed for probing the accuracy • Hallucination. In generating factual texts, a challenging
of the factual knowledge encoded by LLMs. However, as issue is hallucination generations [344], where the generated
shown in existing work [55], LLMs might perform less well information is either in conflict with the existing source
on QA tasks relying on fine-grained knowledge, even when (intrinsic hallucination) or cannot be verified by the available
it exists in the pre-training data. source (extrinsic hallucination), which are illustrated with
two examples in Figure 7. Hallucination widely occurs in
Open-Book QA. Unlike closed-book QA, in open-book QA
existing LLMs, even the most superior LLMs such as GPT-
tasks, LLMs can extract useful evidence from the external
4 [46]. In essence, LLMs seem to “unconsciously” utilize
knowledge base or document collections, and then answer
the knowledge in task solving, which still lack an ability to
the question based on the extracted evidence [354–357]. Typ-
accurately control the use of internal or external knowledge.
ical open-book QA datasets (e.g., Natural Questions [290],
Hallucination would mislead LLMs to generate undesired
OpenBookQA [302], and SQuAD [305]) have overlap with
outputs and mostly degrade the performance, leading to
closed-book QA datasets, but they incorporate external data
potential risks when deploying LLMs in real-world ap-
sources, e.g., Wikipedia. The metrics of accuracy and F1
plications. To alleviate this problem, the alignment tuning
score are widely used in open-book QA tasks for evaluation.
strategies (as discussed in Section 5.2) have been widely
To select relevant knowledge from external resources, LLMs
utilized in existing works [61], which rely on tuning LLMs
are often paired with a text retriever (or even a search
on high-quality data or using human feedback. For the eval-
engine), which is trained independently or jointly with
uation of the hallucination problem, a set of hallucination
17. In this part, open-book QA refers to the QA tasks that require detection tasks have been proposed, e.g., TruthfulQA [292],
to extract and utilize useful information from external knowledge for detecting human falsehood mimicked by models.
resources, as the antithesis of closed-book QA (only using the encoded • Knowledge recency. As another major challenge, LLMs
information from pre-training corpus). Note that there is a dataset also
named OpenBookQA [302], which follows the settings of open-book would encounter difficulties when solving tasks that require
QA tasks by extracting and utilizing external science facts. the latest knowledge beyond the training data. To tackle
27

Bob’s wife is Amy. Bob’s daughter is Cindy.


Explain RLHF for LLMs.
Who is Cindy to Amy?

RLHF stands for "Rights, Limitations, Harms, and


Cindy is Amy’s daughter-in-law. Freedoms" and is a framework for …… models like
LLMs (Large Language Models).

(a) Intrinsic hallucination (b) Extrinsic hallucination

Fig. 7. Examples of intrinsic and extrinsic hallucination for a public LLM (access date: March 19, 2023). As an example of intrinsic hallucination,
the LLM gives a conflicting judgment about the relationship between Cindy and Amy, which contradicts the input. For extrinsic hallucination, in this
example, the LLM seems to have an incorrect understanding of the meaning of RLHF (reinforcement learning from human feedback), though it can
correctly understand the meaning of LLMs (in this context).

this issue, a straightforward approach is to regularly update As discussed in Section 6.2, CoT involves the intermediate
LLMs with new data. However, it is very costly to fine-tune reasoning steps, which can be manually created [33] or
LLMs, and also likely to cause the catastrophic forgetting automatically generated [365], into the prompts to guide
issue when incrementally training LLMs. Therefore, it is LLMs to perform multi-step reasoning. Such a way largely
necessary to develop efficient and effective approaches that improves the reasoning performance of LLMs, leading to
can integrate new knowledge into existing LLMs, making new state-of-the-art results on several complex knowledge
them up-to-date. Existing studies have explored how to reasoning tasks [33, 56, 366]. Further, after reformulating
utilize the external knowledge source (e.g., search engine) knowledge reasoning tasks into code generation tasks, re-
to complement LLMs, which can be either jointly optimized searchers have found that the performance of LLMs can
with LLMs [354] or used as a plug-and-play module [359]. be further improved [144], especially with the LLMs pre-
For instance, ChatGPT utilizes a retrieval plugin to access trained on code. However, due to the complexity of knowl-
up-to-date information sources [360]. By incorporating the edge reasoning tasks, the performance of current LLMs still
extracted relevant information into the context [361, 362], lags behind human results on tasks such as commonsense
LLMs can acquire new factual knowledge and perform reasoning [33, 56, 367]. As one of the most common mis-
better on relevant tasks. However, such an approach seems takes, LLMs might generate inaccurate intermediate steps
to be still at a superficial level. It has been revealed that it based on wrong factual knowledge, leading to a wrong final
is difficult to directly amend intrinsic knowledge or inject result. To address this issue, existing work has proposed
specific knowledge into LLMs, which remains an open special decoding or ensemble strategies to improve the accu-
research problem [363, 364]. racy of the whole reasoning chain [249, 252]. More recently,
an empirical study [366] reveals that LLMs may have dif-
7.1.3 Complex Reasoning ficulty in explicitly inferring the commonsense knowledge
Complex reasoning refers to the ability of understanding required by a specific task, though they can successfully
and utilizing supporting evidence or logic to derive con- solve it. Further, it further shows that leveraging self-
clusions or make decisions [51, 52]. According to the type generated knowledge may not be beneficial for improving
of involved logic and evidence in the reasoning process, the reasoning performance.
we consider dividing existing evaluation tasks into three
Symbolic Reasoning18 . The symbolic reasoning tasks
major categories, namely knowledge reasoning, symbolic
mainly focus on manipulating the symbols in a formal rule
reasoning, and mathematical reasoning.
setting to fulfill some specific goal [51], where the operations
Knowledge Reasoning. The knowledge reasoning tasks and rules may have never been seen by LLMs during pre-
rely on logical relations and evidence about factual training. Existing work [33, 233, 251] commonly evaluates
knowledge to answer the given question. Existing work LLMs on the task of last letter concatenation and coin flip,
mainly uses specific datasets to evaluate the reasoning where the evaluation examples require the same reasoning
capacity of the corresponding type of knowledge, e.g., steps as the in-context examples (called in-domain test) or
CSQA [247]/StrategyQA [248] for commonsense knowledge more steps (called out-of-domain test). For an example of
reasoning and ScienceQA [301] for science knowledge rea- the out-of-domain test, LLMs could only see the examples
soning. In addition to the accuracy of the predicted results, with two words in context, but it requires LLMs to concate-
existing work [301] has also evaluated the quality of the nate the last letters of three or more words. Typically, the
generated reasoning process, via automatic metrics (e.g., accuracy of the generated symbols is adopted to evaluate
BLEU) or human evaluation. Typically, these tasks require the performance of LLMs on these tasks. Thus, LLMs need
LLMs to perform step-by-step reasoning based on factual to understand the semantic relations among the symbolic
knowledge, until reaching the answer to the given ques-
tion. To elicit the step-by-step reasoning ability, chain-of- 18. Following [33], we mainly discuss symbolic reasoning tasks spe-
cially designed for evaluating LLMs. We do not consider symbolic
thought (CoT) prompting strategy [33] has been proposed reasoning methods in traditional NLP tasks, such as deducing logical
for enhancing the complex reasoning capacity of LLMs. rules from the knowledge graphs in KBQA.
28

operations and their composition in complex scenarios. answer after a correct reasoning process [33, 376], leading
However, under the out-of-domain setting, as LLMs have to inconsistency between the derived answer and the rea-
not seen the complex compositions of symbolic operations soning process. To alleviate this problem, existing work has
and rules (e.g., twice the number of operations in context proposed to guide the whole generation process of LLMs
examples), it is hard for LLMs to capture their accurate via external tools or models [346], or re-check the reasoning
meanings. To solve this issue, existing studies incorporate process and final answer for correcting them [377]. As
scratchpad [326, 368] and tutor [369] strategies to help a promising solution, recent approaches reformulate the
LLMs better manipulate symbolic operations, for generating complex reasoning tasks into code generation tasks, where
longer and more complex reasoning processes. Another the strict execution of the generated code ensures the con-
line of research work utilizes the formal programming sistency between the reasoning process and the outcome.
language to represent the symbolic operations and rules, Besides, it has been revealed that there might also exist in-
which requires LLMs to generate code and perform the consistency between tasks with similar inputs, where small
reasoning process by executing it with external interpreters. changes in the task description may cause the model to
Such a way can decompose the complex reasoning process produce different results [49, 245]. To mitigate this problem,
into code synthesis and program execution for LLMs and the ensemble of multiple reasoning paths can be applied to
interpreters, respectively, leading to a simplified reasoning enhance the decoding process of LLMs [252].
process with yet more accurate results [328]. • Numerical computation. For complex reasoning tasks,
LLMs still face difficulties in the involved numerical com-
Mathematical Reasoning. The mathematical reasoning putation, especially for the symbols that are seldom en-
tasks need to comprehensively utilize mathematical knowl- countered during pre-training, such as arithmetic with large
edge, logic, and computation for solving problems or gen- numbers [49, 369]. To tackle this issue, a direct way is to tune
erating proof statements. Existing mathematical reasoning LLMs on synthesized arithmetic problems [378]. A surge
tasks can be mainly categorized into math problem solving of studies follow this approach and further improve the
and automated theorem proving. For math problem solving numerical computation performance by special training and
tasks, SVAMP [245], GSM8k [244], and MATH [284] datasets inference strategies [368], e.g., scratchpad tracing. Besides,
are commonly used for evaluation, where LLMs need to existing work [70] has also incorporated external tools (e.g.,
generate accurate concrete numbers or equations to answer calculator), especially for handling arithmetic operations.
the mathematical problem. As these tasks also require multi- More recently, ChatGPT has provided a plugin mechanism
step reasoning, the chain-of-thought prompting strategy has to use external tools [360]. In this way, LLMs need to learn
been widely adopted for LLMs to improve the reasoning how to properly manipulate the tools. For this purpose,
performance [33]. As a practical strategy, continually pre- researchers have augmented the examples using tools (even
training LLMs on large-scale mathematical corpora can the LLM itself) for tuning the LLM [70, 379], or devised
largely boost their performance on mathematical reason- instructions and exemplars for in-context learning [328].
ing tasks [35, 135, 370]. Further, since math problems in While, these LLMs still rely on the text context to capture
different languages share the same mathematical logic, re- the semantic meanings of mathematical symbols (during the
searchers also propose a multilingual math word problem pre-training stage), which is not best suited for numerical
benchmark [261] to evaluate the multilingual mathematical computation in essence.
reasoning capacity of LLMs. As another challenging task,
automated theorem proving (ATP) [334, 336, 371] requires
7.2 Advanced Ability Evaluation
the reasoning model to strictly follow the reasoning logic
and mathematical skills. To evaluate the performance on In addition to the above basic evaluation tasks, LLMs also
this task, PISA [335] and miniF2F [336] are two typical ATP exhibit some superior abilities that require special consider-
datasets with the proof success rate as the evaluation metric. ations for evaluation. In this part, we discuss several rep-
As a typical approach, existing work on ATP utilizes LLMs resentative advanced abilities and the corresponding eval-
to aid the search for proofs using an interactive theorem uation approaches, including human alignment, interaction
prover (ITP), such as Lean, Metamath, and Isabelle [372– with the external environment, and tool manipulation. Next,
374]. A major limitation of ATP research is the lack of related we discuss these advanced abilities in detail.
corpora in formal language. To tackle it, several studies
utilize LLMs to convert informal statements into formal 7.2.1 Human Alignment
proofs for augmenting new data [145] or generate drafts and It is desired that LLMs could well conform to human values
proof sketches to reduce the search space of the proofs [375]. and needs, i.e., human alignment, which is a key ability for
the broad use of LLMs in real-world applications.
Major Issues. In spite of the advancements, LLMs still have To evaluate this ability, existing studies consider multiple
several limitations in solving complex reasoning tasks. criteria for human alignment, such as helpfulness, honesty,
• Inconsistency. With improved reasoning strategies (e.g., and safety [46, 208, 209]. For helpfulness and honesty, adver-
CoT prompting), LLMs can solve some complex reasoning sarial question answering tasks (e.g., TruthfulQA [292]) can
tasks, by performing step-by-step reasoning based on the be utilized to examine LLM’s ability in detecting possible
supporting logic and evidence. Despite the effectiveness, the falsehood in the text [46, 71]. Furthermore, harmlessness
inconsistency issue often occurs in the decomposed reasoning can be also evaluated by several existing benchmarks, e.g.,
process. Concretely, LLMs may generate the correct answer CrowS-Pairs [380] and Winogender [381]. Despite the auto-
following an invalid reasoning path, or produce a wrong matic evaluation with the above datasets, human evaluation
29

is still a more direct way to effectively test the human To examine the ability of tool manipulation, existing
alignment ability of LLMs. OpenAI invites many experts work mostly adopts complex reasoning tasks for evaluation,
in domains related to AI risks to evaluate and improve the such as mathematical problem solving (e.g., GSM8k [244]
behaviors of GPT-4 when encountering risky contents [46]. and SVAMP [245]) or knowledge question answering (e.g.,
Besides, for other aspects of human alignment (e.g., truth- TruthfulQA [292]), where the successful utilization of tools is
fulness), several studies propose to use specific instruc- very important for enhancing the required skills that LLMs
tions and devise annotation rules to guide the annotation are incapable of (e.g., numerical calculation). In this way, the
process [71]. Empirical studies have revealed that these evaluated performance on these tasks can reflect the ability
strategies can greatly improve the human alignment ability of LLMs in tool manipulation. To teach LLMs to utilize tools,
of LLMs [209]. For instance, after alignment tuning on data existing studies add exemplars using tools in context to elicit
collected through interactions with experts, the incorrect LLMs [328], or fine-tune LLMs on simulated data about tool
behavior rate of GPT-4 can be largely reduced when it deals utilization [70, 379]. Existing work has found that with the
with sensitive or disallowed prompts. In addition, high- help of tools, LLMs become more capable of handling the
quality pre-training data can reduce the effort required for issues that they are not good at, e.g., equation calculation
alignment [46]. For instance, Galactica is potentially more and utilizing real-time information, and eventually improve
harmless due to the less biased contents in the scientific the final performance [70].
corpus [35]. Summary. The above three abilities are of great value to
the practical performance of LLMs: conforming to human
7.2.2 Interaction with External Environment values and preferences (human alignment), acting properly
Besides standard evaluation tasks, LLMs have the ability in real-world scenarios (interaction with the external envi-
to receive feedback from the external environment and ronment), and expanding the ability scope (tool manipu-
perform actions according to the behavior instruction, e.g., lation). In addition to the above three advanced abilities,
generating action plans in natural language to manipulate LLMs might also show other abilities that are specially
agents [382, 383]. Such an ability is also emergent in LLMs related to some tasks (e.g., data annotation [228]) or learning
that can generate detailed and highly realistic action plans, mechanisms (e.g., self-improvement [256]). It will be an open
while smaller models (e.g., GPT-2) tend to generate shorter direction to discover, measure and evaluate these newly
or meaningless plans [382]. emerging abilities, so as to better utilize and improve LLMs.
To test this ability, several embodied AI benchmarks
can be used for evaluation, described as follows. Virtual- 7.3 Public Benchmarks and Empirical Analysis
Home [384] builds a 3D simulator for household tasks such In the aforementioned parts, we have discussed the eval-
as cleaning and cooking, in which the agent can execute uation tasks of LLMs and their corresponding settings.
natural language actions generated by LLMs. ALFRED [385] Next, we will introduce existing evaluation benchmarks and
includes more challenging tasks that require LLMs to ac- empirical analyses for LLMs, which focus on exploring more
complish compositional targets. BEHAVIOR [386] focuses comprehensive discussions from a general perspective.
on everyday chores in simulation environments and re-
quires LLMs to generate complex solutions, e.g., changing 7.3.1 Evaluation Benchmarks
the internal status of objects. Based on the generated action Recently, several comprehensive benchmarks [258, 284, 327]
plans from LLMs, existing work either adopts the regular have been released for the evaluation of LLMs. In this
metrics (e.g., executability and correctness of the generated part, we introduce several representative and widely used
action plans) [382] in the benchmark or directly conducts benchmarks, i.e., MMLU, BIG-bench, and HELM.
real-world experiments and measures the success rate [387], • MMLU [284] is a versatile benchmark for large-scale
to evaluate such ability. Existing work has shown the effec- evaluation of multi-task knowledge understanding, cover-
tiveness of LLMs in interacting with the external environ- ing a wide range of knowledge domains from mathematics
ment and generating accurate action plans [388]. Recently, and computer science to humanities and social sciences. The
several improved methods have been proposed to enhance difficulties of these tasks vary from basic to advanced. As
the interaction ability of LLMs, e.g., designing code-like shown in existing work, LLMs mostly outperform small
prompts [389] and providing real-world grounding [387]. models by a substantial margin on this benchmark [35, 56,
57, 83], which shows the scaling law in model size. More
7.2.3 Tool Manipulation recently, GPT-4 achieves a remarkable record (86.4% in 5-
When solving complex problems, LLMs can turn to external shot setting) in MMLU, which is significantly better than
tools if they determine it is necessary. By encapsulating the previous state-of-the-art models [46].
available tools with API calls, existing work has involved • BIG-bench [327] is a collaborative benchmark intended
a variety of external tools, e.g., search engine [71], calcula- to probe existing LLMs from various aspects. It comprises
tor [70], and compiler [328], to enhance the performance of 204 tasks that encompass a broad range of topics, includ-
LLMs on several specific tasks. Recently, OpenAI has sup- ing linguistics, childhood development, mathematics, com-
ported the use of plugins in ChatGPT [360], which can equip monsense reasoning, biology, physics, social bias, software
LLMs with broader capacities beyond language modeling. development, and so on. By scaling the model size, LLMs
For example, the web browser plugin enables ChatGPT can even outperform the average human performance under
to access fresh information. Further, incorporating third- the few-shot setting on 65% of tasks in BIG-bench [56].
party plugins is particularly key for creating a prosperous Considering the high evaluation cost of the entire bench-
ecosystem of applications based on LLMs. mark, a lightweight benchmark BIG-bench-Lite has been
30

proposed, which contains 24 small yet diverse and chal- comprehensive qualitative analysis [41] has revealed that
lenging tasks from BIG-bench. Additionally, the BIG-bench GPT-4 approaches human-level performance in a variety of
hard (BBH) benchmark has been proposed to concentrate challenging tasks across various fields (e.g., mathematics,
on investigating the currently unsolvable tasks of LLMs by computer vision, and programming), and considered it as
selecting the challenging tasks in which LLMs exhibit infe- “an early version of an artificial general intelligence system”.
rior performance compared to humans. Since BBH becomes Despite the promising results, this analysis has also revealed
more difficult, small models mostly achieve performance that GPT-4 still has severe limitations. For example, GPT-4
close to random. As a comparison, CoT prompting can is hard to calibrate its confidence about the generated result,
elicit the abilities of LLMs to perform step-by-step reasoning and can not verify its consistency with the training data
for enhancing the performance, even exceeding the average and itself. Besides, it demonstrates inferior performance
human performance in BBH [285]. on tasks that require planning (e.g., solving the “Tower of
• HELM [258] is a comprehensive benchmark that cur- Hanoi” problem) or conceptual leaps (e.g., proposing a new
rently implements a core set of 16 scenarios and 7 categories scientific hypothesis). Furthermore, several studies have
of metrics. It is built on top of many prior studies, conduct- also shown that LLMs may misunderstand unfamiliar con-
ing a holistic evaluation of language models. As shown in cepts [394, 395] on information extraction tasks from specific
the experimental results of HELM [258], instruction tuning domains, and face challenges in solving pragmatic emotion-
can consistently boost the performance of LLMs in terms related tasks [393] (e.g., personalized emotion recognition),
of accuracy, robustness, and fairness. Further, for reasoning showing inferior performance compared to specific fine-
tasks, the LLMs that have been pre-trained on code corpus tuned models.
show superior performance. • Robustness. Besides the mastery, another aspect to con-
The above benchmarks cover a variety of mainstream sider is the stability of LLMs against noises or perturbations,
evaluation tasks for the evaluation of LLMs. Besides, there which is particularly important for practical applications.
are also several benchmarks that focus on evaluating specific To evaluate the robustness of LLMs against noises or per-
abilities of LLMs, such as TyDiQA [390] for multilingual turbations, existing work [396] conducts adversarial attack
knowledge utilization and MGSM [261] for multilingual (e.g., token replacement) on the input, and then evaluates the
mathematical reasoning. To conduct the evaluation, one robustness of LLMs based on the change of output results.
can select suitable benchmarks according to specific goals. It has been shown that LLMs are more robust than small
In addition, there are also several open-source evaluation language models in a variety of tasks, but may encounter
frameworks for researchers to evaluate LLMs on existing new issues about robustness, e.g.,robustness instability and
benchmarks or extend new tasks for customized evalua- prompt sensitivity. Concretely, LLMs are prone to provide
tions, such as Language Model Evaluation Harness [391] different answers when using varied expressions of the
and OpenAI Evals [46]. same input, even in conflict with the content generated by
itself [397]. Such an issue would also lead to unstable results
7.3.2 Comprehensive Analyses on LLMs’ Capacities when evaluating the robustness using different prompts,
In addition to constructing large-scale evaluation bench- making the evaluation results of robustness analysis them-
marks, a surge of studies have conducted comprehensive selves less reliable.
analyses to investigate the strengths and limitations of
Specialist. As LLMs have been pre-trained on large-scale
LLMs. In this part, we briefly discuss them in major aspects,
mixture-of-source corpora, they can capture rich knowledge
namely generalist (general-purpose capacity) and specialist
from the pre-training data. Thus, LLMs are also employed
(domain-specific capacity).
as domain experts or specialists for specific areas. Therefore,
Generalist. Due to the remarkable performance, existing recent studies have widely explored the use of LLMs for
work [41, 46, 344, 350, 392–394] has systematically evaluated solving domain-specific tasks and evaluated the adaptation
the general capacities of LLMs, to explore their competences capacity of LLMs. Typically, these studies collect or con-
in a variety of different tasks or applications. Typically, these struct domain-specific datasets to evaluate the performance
studies mainly focus on the newly emerged LLMs (e.g., of LLMs using in-context learning. Since our focus is not
ChatGPT and GPT-4) that have not been well investigated to cover all the possible application domains, we briefly
before, which are discussed as follows: discuss three representative domains receiving considerable
• Mastery. To evaluate the mastery level of LLMs in attention from the research community, namely healthcare,
solving general tasks, existing work [394] typically collects education, and law.
a set of datasets covering a range of tasks and domains, • Healthcare is a vital application field closely related
and then tests LLMs under the few/zero-shot setting. Em- to human life. Since the advent of ChatGPT, a series of
pirical results [41, 46, 350, 394] have shown the superior studies have applied ChatGPT or other LLMs to the medical
capacities of LLMs as a general-purpose task solver. As a re- domain. It has been shown that LLMs are capable of han-
markable progress, GPT-4 has surpassed the state-of-the-art dling a variety of healthcare tasks, e.g., biology information
methods with benchmark-specific training in a wide range extraction [398], medical advice consultation [399–401], and
of tasks, such as language understanding, commonsense report simplification [402], and can even pass the medical
reasoning, and mathematical reasoning [46]. Furthermore, license exams [403–405] specially designed for professional
it can achieve human-like performance in real-world ex- doctors. However, LLMs may fabricate medical misinfor-
ams designed for humans (e.g., Advanced Placement exams mation [400, 402], e.g., misinterpreting medical terms and
and Graduate Record Examination [46]). More recently, a suggesting advice inconsistent with medical guidelines. Be-
31

sides, it would also raise privacy concerns to upload the This survey tries to cover the most recent literature about
health information of patients [398]. LLMs and provides a good reference resource on this topic
• Education is also an important application domain for both researchers and engineers.
where LLMs potentially exert significant influence. Existing Next, we summarize the discussions of this survey, and
work has found that LLMs can achieve student-level perfor- introduce the challenges and future directions for LLMs, in
mance on standardized tests [46, 406, 407] in the subjects the following aspects.
of mathematics, physics, computer science and so on, in
both multiple-choice and free-response problems. Besides, Theory and Principle. To understand the underlying work-
empirical studies have shown that LLMs can serve as writ- ing mechanism of LLMs, one of the greatest mysteries
ing or reading assistant for education [408, 409]. A recent is how information is distributed, organized, and utilized
study [409] reveals that ChatGPT is capable of generating through the very large, deep neural network. It is important
logically consistent answers across disciplines, balancing to reveal the basic principles or elements that establish the
both depth and breadth. Another quantitative analysis [408] foundation of the abilities of LLMs. In particular, scaling
shows that students utilizing ChatGPT perform better than seems to play an important role in increasing the capacity
average students with different usage methods (e.g., keeping of LLMs [31, 55, 59]. It has been shown that some emergent
or refining the results from LLMs as their own answers) in abilities would occur in an unexpected way (a sudden per-
some courses from the computer security field. However, formance leap) when the parameter scale of language mod-
the increasing popularity of LLMs has been raising concerns els increases to a critical size (e.g., 10B) [31, 33], typically in-
(e.g., cheating on homework) on the rational use of such cluding in-context learning, instruction following, and step-
intelligent assistants for education. by-step reasoning. These emergent abilities are fascinating
• Law is a specialized domain that is built on professional yet perplexing: when and how they are obtained by LLMs
domain knowledge. Recently, a number of studies have ap- are not yet clear. Recent studies either conduct extensive
plied LLMs to solve various legal tasks, e.g., legal document experiments for investigating the effect of emergent abilities
analysis [410, 411], legal judgment prediction [412], and and the contributing factors to such abilities [220, 237, 422],
legal document writing [413]. A recent study [414] has found or explain some specific abilities with existing theoretical
that LLMs own powerful abilities of legal interpretation frameworks [60, 231]. An insightful technical post also spe-
and reasoning. Moreover, the latest GPT-4 model achieves cially discusses this topic [47], taking the GPT-series models
a top 10% score in a simulated bar exam compared with as the target. While, more formal theories and principles
human test-takers. However, the use of LLMs in law also to understand, characterize, and explain the abilities or
raises concerns about legal challenges, including copyright behaviors of LLMs are still missing. Since emergent abilities
issues [415], personal information leakage [416], or bias and bear a close analogy to phase transitions in nature [31, 58],
discrimination [417]. cross-discipline theories or principles (e.g., whether LLMs
Besides the aforementioned work, the capacities of can be considered as some kind of complex systems) might
LLMs have been also analyzed from other perspectives. be useful to explain and understand the behaviors of LLMs.
For instance, some recent work has studied the human- These fundamental questions are worth exploring for the
like characteristics of LLMs, such as self-awareness, theory research community, which are important for developing
of mind (ToM), and affective computing [41, 418–420]. In the next-generation LLMs.
particular, an empirical evaluation of ToM conducted on
two classic false-belief tasks speculates that LLMs may have Model Architecture. Due to the scalability and effective-
ToM-like abilities since the model in the GPT-3.5 series ness, Transformer, consisting of stacked multi-head self-
achieves comparable performance with nine-year-old chil- attention layers, has become the de facto architecture for
dren in ToM task [419]. Further, another line of work has building LLMs. Various strategies have been proposed to
investigated the fairness and accuracy of existing evaluation improve the performance of this architecture, such as neural
settings about LLMs [421], e.g., the large-scale mixture-of- network configuration and scalable parallel training (see
source pre-training data may contain the data in test sets. discussions in Section 4.2.2). To enhance the model capacity
(e.g., the multi-turn conversation ability), existing LLMs
typically maintain a long context window, e.g., GPT-4-32k
8 C ONCLUSION AND F UTURE D IRECTIONS has an extremely large context length of 32,768 tokens. Thus,
In this survey, we have reviewed the recent progress of large a practical consideration is to reduce the time complexity
language models (LLMs), and introduced the key concepts, (originally to be quadratic costs) incurred by the standard
findings, and techniques for understanding and utilizing self-attention mechanism. It is important to investigate the
LLMs. We focus on the large-sized models (i.e., having a size effect of more efficient Transformer variants in building
larger than 10B) while excluding the contents of early pre- LLMs [423], e.g., sparse attention has been used in GPT-
trained language models (e.g., BERT and GPT-2) that have 3 [55]. Besides, catastrophic forgetting has been a long-
been well covered in the existing literature. In particular, our standing challenge for neural networks, which also has a
survey has discussed four important aspects of LLMs, i.e., negative impact on LLMs. When tuning LLMs with new
pre-training, adaptation tuning, utilization, and evaluation. data, the originally learned knowledge is likely to be dam-
For each aspect, we highlight the techniques or findings that aged, e.g., fine-tuning a LLM according to some specific
are key to the success of LLMs. Besides, we also summarize tasks will affect the general ability of LLMs. A similar case
the available resources for developing LLMs and discuss im- occurs when LLMs are aligned with human values (called
portant implementation guidelines for reproducing LLMs. alignment tax [61, 208]). Thus, it is necessary to consider
32

extending existing architectures with more flexible mech- similar safety challenges as small language models. For
anisms or modules that can effectively support data update example, LLMs exhibit a tendency to generate hallucina-
and task specialization. tions [344], which are texts that seem plausible but may be
factually incorrect. What is worse, LLMs might be elicited by
Model Training. In practice, it is very difficult to pre-
intentional instructions to produce harmful, biased, or toxic
train capable LLMs, due to the huge computation con-
texts for malicious systems, leading to the potential risks of
sumption and the sensitivity to data quality and training
misuse [55, 61]. To have a detailed discussion of the safety
tricks [68, 82]. Thus, it becomes particularly important to
issues of LLMs (e.g., privacy, overreliance, disinformation,
develop more systemic, economical pre-training approaches
and influence operations), the readers can refer to the GPT-
for optimizing LLMs, considering the factors of model ef-
3/4 technical reports [46, 55]. As the major approach to
fectiveness, efficiency optimization, and training stability.
averting these issues, reinforcement learning from human
More model checking or performance diagnosis methods
feedback (RLHF) [61, 100] has been widely used by in-
(e.g., predictable scaling in GPT-4 [46]) should be developed
corporating humans in the training loop for developing
in order to detect early abnormal issues during training.
well-aligned LLMs. To improve the model safety, it is also
Furthermore, it also calls for more flexible mechanisms of
important to include safety-relevant prompts during RLHF,
hardware support or resource schedule, so as to better
as shown by GPT-4 [46]. However, RLHF heavily relies
organize and utilize the resources in a computing cluster.
on high-quality human feedback data from professional
Since it is very costly to pre-train a LLM from scratch, it is
labelers, making it difficult to be properly implemented in
important to design a suitable mechanisms for continually
practice. Therefore, it is necessary to improve the RLHF
pre-training or fine-tuning the LLM based on publicly avail-
framework for reducing the efforts of human labelers and
able model checkpoints (e.g., LLaMA [57] and Flan-T5 [83]).
seek a more efficient annotation approach with guaranteed
For this purpose, a number of technical issues have to be
data quality, e.g., LLMs can be employed to assist the
resolved, e.g., catastrophic forgetting and task specialization.
labeling work. More recently, red teaming [210, 211] has
However, to date, there still lack open-source model check-
been adopted for improving the model safety of LLMs,
points for LLMs with complete pre-processing and training
which utilizes the collected adversarial prompts to refine
logs (e.g., the scripts to prepare the pre-training data) for
the LLMs (i.e., avoiding the attacks from red teaming).
reproduction. We believe that it will be of great value to
Furthermore, it is also meaningful to establish the proper
report more technical details in open-source models for the
learning mechanism for LLMs to obtain human feedback
research of LLMs. Besides, it is also important to develop
via chatting and directly utilize it for self-improvement.
more improvement tuning strategies that effectively elicits
the model abilities. Application and Ecosystem. As LLMs have shown a strong
Model Utilization. Since fine-tuning is very costly in real capacity in solving various tasks, they can be applied in a
applications, prompting has become the prominent approach broad range of real-world applications (i.e., following task-
to using LLMs. By combining task descriptions and demon- specific natural language instructions). As a remarkable
stration examples into prompts, in-context learning (a spe- progress, ChatGPT has potentially changed the way how
cial form of prompting) endows LLMs with the ability to humans access information, which has been implemented
perform well on new tasks, even outperforming full-data in the release of New Bing. In the near future, it can be
fine-tuned models in some cases. Furthermore, to enhance foreseen that LLMs would have a significant impact on
the ability of complex reasoning, advanced prompting tech- information-seeking techniques, including both search en-
niques have been proposed, exemplified by the chain-of- gines and recommender systems. Furthermore, the develop-
thought (CoT) strategy, which includes the intermediate ment and use of intelligent information assistants would be
reasoning steps into prompts. However, existing prompt- highly promoted with the technology upgrade from LLMs.
ing approaches still have several deficiencies described as In a broader scope, this wave of technical innovation would
follows. Firstly, it involves considerable human efforts in lead to an ecosystem of LLM-empowered applications (e.g.,
the design of prompts. It would be quite useful to au- the support of plugins by ChatGPT), which has a close con-
tomatically generate effective prompts for solving various nection with human life. Lastly, the rise of LLMs sheds light
tasks. Secondly, some complex tasks (e.g., formal proof and on the exploration of artificial general intelligence (AGI).
numerical computation) require specific knowledge or logic It is promising to develop more smart intelligent systems
rules, which may not be well expressed in natural language (possibly with multi-modality signals) than ever. However,
or demonstrated by examples. Thus, it is important to in this development process, AI safety should be one of the
develop more informative, flexible task formatting methods primary concerns, i.e., making AI lead to good for humanity
for prompts19 . Thirdly, existing prompting strategies mainly but not bad [40].
focus on single-turn performance. It is useful to develop C ODA: This survey was planned during a discussion
interactive prompting mechanisms (e.g., through natural meeting held by our research team, and we aimed to sum-
language conversations) for solving complex tasks, which marize the recent advances of large language models as
have been demonstrated to be very useful by ChatGPT. a highly readable report for our team members. The first
draft was finished on March 13, 2023, in which our team
Safety and Alignment. Despite their capacities, LLMs pose members tried their best to include the related studies about
LLMs in a relatively objective, comprehensive way. Then,
19. While, it seems that an alternative approach to this issue is to
invoke external tools, e.g., the plugins for ChatGPT, when the task is we have extensively revised the writing and contents in
difficult to solve via text generation. several passes. Despite all our efforts, this survey is still
33

far from perfect: we are likely to miss important references [5] J. Gao and C. Lin, “Introduction to the special issue
or topics, and might also have non-rigorous expressions or on statistical language modeling,” ACM Trans. Asian
discussions. Due to the space limit, we can only include Lang. Inf. Process., vol. 3, no. 2, pp. 87–93, 2004.
a fraction of existing LLMs in Figure 2 and Table 1, by [6] R. Rosenfeld, “Two decades of statistical language
setting the selection criterion. However, we set a more modeling: Where do we go from here?” Proceedings
relaxed criterion for model selection on our GitHub page of the IEEE, vol. 88, no. 8, pp. 1270–1278, 2000.
(https://github.com/RUCAIBox/LLMSurvey), which will [7] A. Stolcke, “Srilm-an extensible language modeling
be regularly maintained. We will continuously update this toolkit,” in Seventh international conference on spoken
survey, and improve the quality as much as we can. For language processing, 2002.
us, survey writing is also a learning process for LLMs [8] X. Liu and W. B. Croft, “Statistical language modeling
by ourselves. For readers with constructive suggestions to for information retrieval,” Annu. Rev. Inf. Sci. Technol.,
improve this survey, you are welcome to leave comments on vol. 39, no. 1, pp. 1–31, 2005.
the GitHub page of our survey or directly email our authors. [9] C. Zhai, Statistical Language Models for Information Re-
We will make revisions following the received comments trieval, ser. Synthesis Lectures on Human Language
or suggestions in a future version, and acknowledge the Technologies. Morgan & Claypool Publishers, 2008.
readers who have contributed constructive suggestions in [10] S. M. Thede and M. P. Harper, “A second-order hid-
our survey. den markov model for part-of-speech tagging,” in
Update log. In this part, we regularly maintain a update 27th Annual Meeting of the Association for Computational
log for the submissions of this survey to arXiv: Linguistics, University of Maryland, College Park, Mary-
• First release on March 31, 2023: the initial version. land, USA, 20-26 June 1999, R. Dale and K. W. Church,
• Update on April 9, 2023: add the affiliation information, Eds. ACL, 1999, pp. 175–182.
revise Figure 2 and Table 1 and clarify the correspond- [11] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer,
ing selection criterion for LLMs, improve the writing, “A tree-based statistical language model for natural
and correct some minor errors. language speech recognition,” IEEE Transactions on
• Update on April 11, 2023: correct the errors for library Acoustics, Speech, and Signal Processing, vol. 37, no. 7,
resources. pp. 1001–1008, 1989.
• Update on April 12, 2023: revise Figure 2 and Table 1, [12] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean,
and clarify the release date of LLMs. “Large language models in machine translation,” in
Planning content. We will regularly include new content EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Con-
into this survey, to make it more self-contained and up-to- ference on Empirical Methods in Natural Language Pro-
date. Here, we list several potential topics that might appear cessing and Computational Natural Language Learning,
in the next major version(s): (1) the technical evolution June 28-30, 2007, Prague, Czech Republic, J. Eisner, Ed.
from GPT-1 to ChatGPT, (2) LLaMA based tuning (e.g., ACL, 2007, pp. 858–867.
Alpaca), (3) lightweight tuning strategies (e.g., LoRA), and [13] S. M. Katz, “Estimation of probabilities from sparse
(4) detailed formulations for model details (Section 4.2). If data for the language model component of a speech
you have a specific topic suggested for this survey, please recognizer,” IEEE Trans. Acoust. Speech Signal Process.,
drop us a message about it. vol. 35, no. 3, pp. 400–401, 1987.
[14] W. A. Gale and G. Sampson, “Good-turing frequency
estimation without tears,” J. Quant. Linguistics, vol. 2,
ACKNOWLEDGMENTS no. 3, pp. 217–237, 1995.
The authors would like to thank Yankai Lin and Yutao Zhu [15] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A
for proofreading this paper. Since the first release of this neural probabilistic language model,” J. Mach. Learn.
paper, we have received a number of valuable comments Res., vol. 3, pp. 1137–1155, 2003.
from the readers. We sincerely thank the readers who have [16] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and
written to us with constructive suggestions and comments: S. Khudanpur, “Recurrent neural network based lan-
Tyler Suard, Damai Dai, Liang Ding, Stella Biderman, Kevin guage model,” in INTERSPEECH 2010, 11th Annual
Gray, and Jay Alammar. Conference of the International Speech Communication
Association, Makuhari, Chiba, Japan, September 26-30,
2010, T. Kobayashi, K. Hirose, and S. Nakamura, Eds.
R EFERENCES ISCA, 2010, pp. 1045–1048.
[1] S. Pinker, The Language Instinct: How the Mind Creates [17] S. Kombrink, T. Mikolov, M. Karafiát, and L. Burget,
Language. Brilliance Audio; Unabridged edition, “Recurrent neural network based language modeling
2014. in meeting recognition,” in INTERSPEECH 2011, 12th
[2] M. D. Hauser, N. Chomsky, and W. T. Fitch, “The Annual Conference of the International Speech Commu-
faculty of language: what is it, who has it, and how nication Association, Florence, Italy, August 27-31, 2011.
did it evolve?” science, vol. 298, no. 5598, pp. 1569– ISCA, 2011, pp. 2877–2880.
1579, 2002. [18] R. Collobert, J. Weston, L. Bottou, M. Karlen,
[3] A. M. Turing, “Computing machinery and intelli- K. Kavukcuoglu, and P. P. Kuksa, “Natural language
gence,” Mind, vol. LIX, no. 236, pp. 433–460, 1950. processing (almost) from scratch,” J. Mach. Learn. Res.,
[4] F. Jelinek, Statistical Methods for Speech Recognition. vol. 12, pp. 2493–2537, 2011.
MIT Press, 1998. [19] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
34

J. Dean, “Distributed representations of words and ica, S. Shen, Z. X. Yong, H. Pandey, R. Bawden,
phrases and their compositionality,” in Advances in T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli,
Neural Information Processing Systems 26: 27th Annual T. Févry, J. A. Fries, R. Teehan, T. L. Scao, S. Bider-
Conference on Neural Information Processing Systems man, L. Gao, T. Wolf, and A. M. Rush, “Multitask
2013. Proceedings of a meeting held December 5-8, 2013, prompted training enables zero-shot task generaliza-
Lake Tahoe, Nevada, United States, C. J. C. Burges, L. Bot- tion,” in The Tenth International Conference on Learning
tou, Z. Ghahramani, and K. Q. Weinberger, Eds., 2013, Representations, ICLR 2022, Virtual Event, April 25-29,
pp. 3111–3119. 2022. OpenReview.net, 2022.
[20] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Ef- [29] T. Wang, A. Roberts, D. Hesslow, T. L. Scao, H. W.
ficient estimation of word representations in vector Chung, I. Beltagy, J. Launay, and C. Raffel, “What
space,” in 1st International Conference on Learning Rep- language model architecture and pretraining objective
resentations, ICLR 2013, Scottsdale, Arizona, USA, May works best for zero-shot generalization?” in Interna-
2-4, 2013, Workshop Track Proceedings, Y. Bengio and tional Conference on Machine Learning, ICML 2022, 17-23
Y. LeCun, Eds., 2013. July 2022, Baltimore, Maryland, USA, ser. Proceedings
[21] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, of Machine Learning Research, vol. 162, 2022, pp.
C. Clark, K. Lee, and L. Zettlemoyer, “Deep contex- 22 964–22 984.
tualized word representations,” in Proceedings of the [30] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown,
2018 Conference of the North American Chapter of the As- B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and
sociation for Computational Linguistics: Human Language D. Amodei, “Scaling laws for neural language mod-
Technologies, NAACL-HLT 2018, New Orleans, Louisiana, els,” CoRR, vol. abs/2001.08361, 2020.
USA, June 1-6, 2018, Volume 1 (Long Papers), M. A. [31] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph,
Walker, H. Ji, and A. Stent, Eds. Association for S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou,
Computational Linguistics, 2018, pp. 2227–2237. D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals,
[22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, P. Liang, J. Dean, and W. Fedus, “Emergent abilities of
L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, large language models,” CoRR, vol. abs/2206.07682,
“Attention is all you need,” in Advances in Neural 2022.
Information Processing Systems 30: Annual Conference on [32] M. Shanahan, “Talking about large language models,”
Neural Information Processing Systems 2017, December 4- CoRR, vol. abs/2212.03551, 2022.
9, 2017, Long Beach, CA, USA, 2017, pp. 5998–6008. [33] J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi,
[23] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Q. Le, and D. Zhou, “Chain of thought prompting
pre-training of deep bidirectional transformers for elicits reasoning in large language models,” CoRR, vol.
language understanding,” in Proceedings of the 2019 abs/2201.11903, 2022.
Conference of the North American Chapter of the Asso- [34] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya,
ciation for Computational Linguistics: Human Language T. Cai, E. Rutherford, D. de Las Casas, L. A. Hen-
Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, dricks, J. Welbl, A. Clark, T. Hennigan, E. Noland,
June 2-7, 2019, Volume 1 (Long and Short Papers), K. Millican, G. van den Driessche, B. Damoc, A. Guy,
J. Burstein, C. Doran, and T. Solorio, Eds. Association S. Osindero, K. Simonyan, E. Elsen, J. W. Rae,
for Computational Linguistics, 2019, pp. 4171–4186. O. Vinyals, and L. Sifre, “Training compute-optimal
[24] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mo- large language models,” vol. abs/2203.15556, 2022.
hamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, [35] R. Taylor, M. Kardas, G. Cucurull, T. Scialom,
“BART: denoising sequence-to-sequence pre-training A. Hartshorn, E. Saravia, A. Poulton, V. Kerkez, and
for natural language generation, translation, and com- R. Stojnic, “Galactica: A large language model for
prehension,” in Proceedings of the 58th Annual Meeting science,” CoRR, vol. abs/2211.09085, 2022.
of the Association for Computational Linguistics, ACL [36] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and
2020, Online, July 5-10, 2020, 2020, pp. 7871–7880. G. Neubig, “Pre-train, prompt, and predict: A system-
[25] W. Fedus, B. Zoph, and N. Shazeer, “Switch trans- atic survey of prompting methods in natural language
formers: Scaling to trillion parameter models with processing,” ACM Comput. Surv., pp. 195:1–195:35,
simple and efficient sparsity,” J. Mach. Learn. Res, pp. 2023.
1–40, 2021. [37] C. Zhou, Q. Li, C. Li, J. Yu, Y. Liu, G. Wang, K. Zhang,
[26] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, C. Ji, Q. Yan, L. He, H. Peng, J. Li, J. Wu, Z. Liu, P. Xie,
I. Sutskever et al., “Language models are unsuper- C. Xiong, J. Pei, P. S. Yu, and L. Sun, “A comprehensive
vised multitask learners,” OpenAI blog, p. 9, 2019. survey on pretrained foundation models: A history
[27] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, from BERT to chatgpt,” CoRR, vol. abs/2302.09419,
O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, 2023.
“Roberta: A robustly optimized BERT pretraining ap- [38] X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo,
proach,” CoRR, vol. abs/1907.11692, 2019. J. Qiu, Y. Yao, A. Zhang, L. Zhang, W. Han, M. Huang,
[28] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Q. Jin, Y. Lan, Y. Liu, Z. Liu, Z. Lu, X. Qiu, R. Song,
Z. Alyafeai, A. Chaffin, A. Stiegler, A. Raja, M. Dey, J. Tang, J. Wen, J. Yuan, W. X. Zhao, and J. Zhu, “Pre-
M. S. Bari, C. Xu, U. Thakker, S. S. Sharma, trained models: Past, present and future,” AI Open,
E. Szczechla, T. Kim, G. Chhablani, N. V. Nayak, vol. 2, pp. 225–250, 2021.
D. Datta, J. Chang, M. T. Jiang, H. Wang, M. Man- [39] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang,
35

“Pre-trained models for natural language processing: J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler,
A survey,” CoRR, vol. abs/2003.08271, 2020. M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. Mc-
[40] S. Altman, “Planning for agi and beyond,” OpenAI Candlish, A. Radford, I. Sutskever, and D. Amodei,
Blog, February 2023. “Language models are few-shot learners,” in Ad-
[41] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, vances in Neural Information Processing Systems 33: An-
E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lund- nual Conference on Neural Information Processing Sys-
berg, H. Nori, H. Palangi, M. T. Ribeiro, and Y. Zhang, tems 2020, NeurIPS 2020, December 6-12, 2020, virtual,
“Sparks of artificial general intelligence: Early experi- H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and
ments with gpt-4,” vol. abs/2303.12712, 2023. H. Lin, Eds., 2020.
[42] S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal, S. Ma, [56] A. Chowdhery, S. Narang, J. Devlin, M. Bosma,
T. Lv, L. Cui, O. K. Mohammed, B. Patra, Q. Liu, G. Mishra, A. Roberts, P. Barham, H. W. Chung,
K. Aggarwal, Z. Chi, J. Bjorck, V. Chaudhary, S. Som, C. Sutton, S. Gehrmann, P. Schuh, K. Shi,
X. Song, and F. Wei, “Language is not all you need: S. Tsvyashchenko, J. Maynez, A. Rao, P. Barnes,
Aligning perception with language models,” CoRR, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du,
vol. abs/2302.14045, 2023. B. Hutchinson, R. Pope, J. Bradbury, J. Austin, M. Is-
[43] Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, and ard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghe-
L. Sun, “A comprehensive survey of ai-generated mawat, S. Dev, H. Michalewski, X. Garcia, V. Misra,
content (aigc): A history of generative ai from gan to K. Robinson, L. Fedus, D. Zhou, D. Ippolito, D. Luan,
chatgpt,” arXiv preprint arXiv:2303.04226, 2023. H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Do-
[44] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdh- han, S. Agrawal, M. Omernick, A. M. Dai, T. S. Pil-
ery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu lai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child,
et al., “Palm-e: An embodied multimodal language O. Polozov, K. Lee, Z. Zhou, X. Wang, B. Saeta,
model,” arXiv preprint arXiv:2303.03378, 2023. M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-
[45] C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and Hellstern, D. Eck, J. Dean, S. Petrov, and N. Fiedel,
N. Duan, “Visual chatgpt: Talking, drawing and edit- “Palm: Scaling language modeling with pathways,”
ing with visual foundation models,” arXiv preprint CoRR, vol. abs/2204.02311, 2022.
arXiv:2303.04671, 2023. [57] H. Touvron, T. Lavril, G. Izacard, X. Martinet,
[46] OpenAI, “Gpt-4 technical report,” OpenAI, 2023. M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham-
[47] Y. Fu, H. Peng, and T. Khot, “How does gpt obtain its bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and
ability? tracing emergent abilities of language models G. Lample, “Llama: Open and efficient foundation
to their sources,” Yao Fu’s Notion, Dec 2022. language models,” CoRR, 2023.
[48] J. Li, T. Tang, W. X. Zhao, and J. Wen, “Pretrained [58] B. A. Huberman and T. Hogg, “Phase transitions in
language model for text generation: A survey,” in artificial intelligence systems,” Artificial Intelligence,
Proceedings of the Thirtieth International Joint Conference vol. 33, no. 2, pp. 155–171, 1987.
on Artificial Intelligence, IJCAI 2021, Virtual Event / [59] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoff-
Montreal, Canada, 19-27 August 2021, Z. Zhou, Ed. mann, H. F. Song, J. Aslanides, S. Henderson, R. Ring,
ijcai.org, 2021, pp. 4492–4499. S. Young, E. Rutherford, T. Hennigan, J. Menick,
[49] P. Lu, L. Qiu, W. Yu, S. Welleck, and K. Chang, “A A. Cassirer, R. Powell, G. van den Driessche, L. A.
survey of deep learning for mathematical reasoning,” Hendricks, M. Rauh, P. Huang, A. Glaese, J. Welbl,
CoRR, vol. abs/2212.10535, 2022. S. Dathathri, S. Huang, J. Uesato, J. Mellor, I. Higgins,
[50] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, A. Creswell, N. McAleese, A. Wu, E. Elsen, S. M.
X. Sun, J. Xu, L. Li, and Z. Sui, “A survey for in-context Jayakumar, E. Buchatskaya, D. Budden, E. Suther-
learning,” CoRR, vol. abs/2301.00234, 2023. land, K. Simonyan, M. Paganini, L. Sifre, L. Martens,
[51] J. Huang and K. C. Chang, “Towards reasoning X. L. Li, A. Kuncoro, A. Nematzadeh, E. Gribovskaya,
in large language models: A survey,” CoRR, vol. D. Donato, A. Lazaridou, A. Mensch, J. Lespiau,
abs/2212.10403, 2022. M. Tsimpoukelli, N. Grigorev, D. Fritz, T. Sotti-
[52] S. Qiao, Y. Ou, N. Zhang, X. Chen, Y. Yao, S. Deng, aux, M. Pajarskas, T. Pohlen, Z. Gong, D. Toyama,
C. Tan, F. Huang, and H. Chen, “Reasoning with C. de Masson d’Autume, Y. Li, T. Terzi, V. Mikulik,
language model prompting: A survey,” CoRR, vol. I. Babuschkin, A. Clark, D. de Las Casas, A. Guy,
abs/2212.09597, 2022. C. Jones, J. Bradbury, M. J. Johnson, B. A. Hechtman,
[53] J. Zhou, P. Ke, X. Qiu, M. Huang, and J. Zhang, “Chat- L. Weidinger, I. Gabriel, W. S. Isaac, E. Lockhart,
gpt: potential, prospects, and limitations,” in Frontiers S. Osindero, L. Rimell, C. Dyer, O. Vinyals, K. Ayoub,
of Information Technology & Electronic Engineering, 2023, J. Stanway, L. Bennett, D. Hassabis, K. Kavukcuoglu,
pp. 1–6. and G. Irving, “Scaling language models: Methods,
[54] W. X. Zhao, J. Liu, R. Ren, and J. Wen, “Dense text analysis & insights from training gopher,” CoRR, vol.
retrieval based on pretrained language models: A abs/2112.11446, 2021.
survey,” CoRR, vol. abs/2211.14876, 2022. [60] D. Dai, Y. Sun, L. Dong, Y. Hao, Z. Sui, and F. Wei,
[55] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, “Why can GPT learn in-context? language models se-
P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, cretly perform gradient descent as meta-optimizers,”
A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, CoRR, vol. abs/2212.10559, 2022.
T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, [61] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wain-
36

wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, 2023.


A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, [71] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang,
M. Simens, A. Askell, P. Welinder, P. F. Christiano, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saun-
J. Leike, and R. Lowe, “Training language models to ders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger,
follow instructions with human feedback,” CoRR, vol. K. Button, M. Knight, B. Chess, and J. Schulman,
abs/2203.02155, 2022. “Webgpt: Browser-assisted question-answering with
[62] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, human feedback,” CoRR, vol. abs/2112.09332, 2021.
B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Fine- [72] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang,
tuned language models are zero-shot learners,” in M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring
The Tenth International Conference on Learning Repre- the limits of transfer learning with a unified text-
sentations, ICLR 2022, Virtual Event, April 25-29, 2022. to-text transformer,” J. Mach. Learn. Res., pp. 140:1–
OpenReview.net, 2022. 140:67, 2020.
[63] A. Ananthaswamy, “In ai, is bigger always better?” [73] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-
Nature, 2023. Rfou, A. Siddhant, A. Barua, and C. Raffel, “mt5: A
[64] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, massively multilingual pre-trained text-to-text trans-
“Deepspeed: System optimizations enable training former,” in Proceedings of the 2021 Conference of the
deep learning models with over 100 billion parame- North American Chapter of the Association for Com-
ters,” in KDD, 2020, pp. 3505–3506. putational Linguistics: Human Language Technologies,
[65] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, NAACL-HLT 2021, Online, June 6-11, 2021, 2021, pp.
J. Casper, and B. Catanzaro, “Megatron-lm: Training 483–498.
multi-billion parameter language models using model [74] W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang,
parallelism,” CoRR, vol. abs/1909.08053, 2019. X. Jiang, Z. Yang, K. Wang, X. Zhang, C. Li,
[66] D. Narayanan, M. Shoeybi, J. Casper, P. LeGres- Z. Gong, Y. Yao, X. Huang, J. Wang, J. Yu, Q. Guo,
ley, M. Patwary, V. Korthikanti, D. Vainbrand, Y. Yu, Y. Zhang, J. Wang, H. Tao, D. Yan, Z. Yi,
P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phan- F. Peng, F. Jiang, H. Zhang, L. Deng, Y. Zhang,
ishayee, and M. Zaharia, “Efficient large-scale lan- Z. Lin, C. Zhang, S. Zhang, M. Guo, S. Gu, G. Fan,
guage model training on GPU clusters using Y. Wang, X. Jin, Q. Liu, and Y. Tian, “Pangu-α:
megatron-lm,” in International Conference for High Per- Large-scale autoregressive pretrained chinese lan-
formance Computing, Networking, Storage and Analysis, guage models with auto-parallel computation,” CoRR,
SC 2021, St. Louis, Missouri, USA, November 14-19, vol. abs/2104.12369, 2021.
2021. ACM, 2021, p. 58. [75] Z. Zhang, Y. Gu, X. Han, S. Chen, C. Xiao, Z. Sun,
[67] V. Korthikanti, J. Casper, S. Lym, L. McAfee, M. An- Y. Yao, F. Qi, J. Guan, P. Ke, Y. Cai, G. Zeng, Z. Tan,
dersch, M. Shoeybi, and B. Catanzaro, “Reducing ac- Z. Liu, M. Huang, W. Han, Y. Liu, X. Zhu, and
tivation recomputation in large transformer models,” M. Sun, “CPM-2: large-scale cost-effective pre-trained
CoRR, vol. abs/2205.05198, 2022. language models,” CoRR, vol. abs/2106.10715, 2021.
[68] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilic, D. Hess- [76] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang,
low, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An
J. Tow, A. M. Rush, S. Biderman, A. Webson, P. S. open large language model for code with mtulti-turn
Ammanamanchi, T. Wang, B. Sagot, N. Muennighoff, program synthesis,” arXiv preprint arXiv:2203.13474,
A. V. del Moral, O. Ruwase, R. Bawden, S. Bekman, 2022.
A. McMillan-Major, I. Beltagy, H. Nguyen, L. Saulnier, [77] S. Black, S. Biderman, E. Hallahan, Q. Anthony,
S. Tan, P. O. Suarez, V. Sanh, H. Laurençon, Y. Jer- L. Gao, L. Golding, H. He, C. Leahy, K. McDonell,
nite, J. Launay, M. Mitchell, C. Raffel, A. Gokaslan, J. Phang, M. Pieler, U. S. Prashanth, S. Purohit,
A. Simhi, A. Soroa, A. F. Aji, A. Alfassy, A. Rogers, L. Reynolds, J. Tow, B. Wang, and S. Weinbach, “Gpt-
A. K. Nitzav, C. Xu, C. Mou, C. Emezue, C. Klamm, neox-20b: An open-source autoregressive language
C. Leong, D. van Strien, D. I. Adelani, and et al., model,” CoRR, vol. abs/2204.06745, 2022.
“BLOOM: A 176b-parameter open-access multilingual [78] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi,
language model,” CoRR, vol. abs/2211.05100, 2022. A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran,
[69] P. F. Christiano, J. Leike, T. B. Brown, M. Martic, A. Arunkumar, D. Stap, E. Pathak, G. Karamanolakis,
S. Legg, and D. Amodei, “Deep reinforcement learn- H. G. Lai, I. Purohit, I. Mondal, J. Anderson, K. Kuz-
ing from human preferences,” in Advances in Neural nia, K. Doshi, K. K. Pal, M. Patel, M. Moradshahi,
Information Processing Systems 30: Annual Conference on M. Parmar, M. Purohit, N. Varshney, P. R. Kaza,
Neural Information Processing Systems 2017, December P. Verma, R. S. Puri, R. Karia, S. Doshi, S. K. Sampat,
4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von S. Mishra, S. R. A, S. Patro, T. Dixit, and X. Shen,
Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. “Super-naturalinstructions: Generalization via declar-
Vishwanathan, and R. Garnett, Eds., 2017, pp. 4299– ative instructions on 1600+ NLP tasks,” in Proceedings
4307. of the 2022 Conference on Empirical Methods in Natural
[70] T. Schick, J. Dwivedi-Yu, R. Dessı̀, R. Raileanu, Language Processing, EMNLP 2022, Abu Dhabi, United
M. Lomeli, L. Zettlemoyer, N. Cancedda, and Arab Emirates, December 7-11, 2022, 2022, pp. 5085–
T. Scialom, “Toolformer: Language models can teach 5109.
themselves to use tools,” CoRR, vol. abs/2302.04761, [79] Y. Tay, M. Dehghani, V. Q. Tran, X. Garcı́a, J. Wei,
37

X. Wang, H. W. Chung, D. Bahri, T. Schuster, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan,


H. Zheng, D. Zhou, N. Houlsby, and D. Metzler, “Ul2: S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser,
Unifying language learning paradigms,” 2022. M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cum-
[80] S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, mings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-
S. Chen, C. Dewan, M. T. Diab, X. Li, X. V. Lin, Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak,
T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saun-
P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer, ders, C. Hesse, A. N. Carr, J. Leike, J. Achiam,
“OPT: open pre-trained transformer language mod- V. Misra, E. Morikawa, A. Radford, M. Knight,
els,” CoRR, vol. abs/2205.01068, 2022. M. Brundage, M. Murati, K. Mayer, P. Welinder,
[81] M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever,
K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, and W. Zaremba, “Evaluating large language models
D. Licht, J. Maillard, A. Sun, S. Wang, G. Wenzek, trained on code,” CoRR, vol. abs/2107.03374, 2021.
A. Youngblood, B. Akula, L. Barrault, G. M. Gonzalez, [89] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang,
P. Hansanti, J. Hoffman, S. Jarrett, K. R. Sadagopan, J. Liu, X. Chen, Y. Zhao, Y. Lu, W. Liu, Z. Wu,
D. Rowe, S. Spruit, C. Tran, P. Andrews, N. F. Ayan, W. Gong, J. Liang, Z. Shang, P. Sun, W. Liu, X. Ouyang,
S. Bhosale, S. Edunov, A. Fan, C. Gao, V. Goswami, D. Yu, H. Tian, H. Wu, and H. Wang, “ERNIE 3.0:
F. Guzmán, P. Koehn, A. Mourachko, C. Ropers, Large-scale knowledge enhanced pre-training for lan-
S. Saleem, H. Schwenk, and J. Wang, “No language guage understanding and generation,” CoRR, vol.
left behind: Scaling human-centered machine transla- abs/2107.02137, 2021.
tion,” CoRR, vol. abs/2207.04672, 2022. [90] O. Lieber, O. Sharir, B. Lenz, and Y. Shoham, “Jurassic-
[82] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, 1: Technical details and evaluation,” White Paper. AI21
Z. Yang, Y. Xu, W. Zheng, X. Xia, W. L. Tam, Z. Ma, Labs, vol. 1, 2021.
Y. Xue, J. Zhai, W. Chen, P. Zhang, Y. Dong, and [91] B. Kim, H. Kim, S. Lee, G. Lee, D. Kwak, D. H. Jeon,
J. Tang, “GLM-130B: an open bilingual pre-trained S. Park, S. Kim, S. Kim, D. Seo, H. Lee, M. Jeong,
model,” vol. abs/2210.02414, 2022. S. Lee, M. Kim, S. Ko, S. Kim, T. Park, J. Kim, S. Kang,
[83] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, N. Ryu, K. M. Yoo, M. Chang, S. Suh, S. In, J. Park,
W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma, K. Kim, H. Kim, J. Jeong, Y. G. Yeo, D. Ham, D. Park,
A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, M. Y. Lee, J. Kang, I. Kang, J. Ha, W. Park, and
A. Chowdhery, S. Narang, G. Mishra, A. Yu, V. Y. N. Sung, “What changes can large-scale language
Zhao, Y. Huang, A. M. Dai, H. Yu, S. Petrov, E. H. models bring? intensive study on hyperclova: Billions-
Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, scale korean generative pretrained transformers,” in
and J. Wei, “Scaling instruction-finetuned language Proceedings of the 2021 Conference on Empirical Methods
models,” CoRR, vol. abs/2210.11416, 2022. in Natural Language Processing, EMNLP 2021, Virtual
[84] N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, Event / Punta Cana, Dominican Republic, 7-11 November,
S. Biderman, T. L. Scao, M. S. Bari, S. Shen, Z. X. Yong, 2021. Association for Computational Linguistics,
H. Schoelkopf, X. Tang, D. Radev, A. F. Aji, K. Al- 2021.
mubarak, S. Albanie, Z. Alyafeai, A. Webson, E. Raff, [92] S. Wu, X. Zhao, T. Yu, R. Zhang, C. Shen, H. Liu, F. Li,
and C. Raffel, “Crosslingual generalization through H. Zhu, J. Luo, L. Xu et al., “Yuan 1.0: Large-scale
multitask finetuning,” CoRR, vol. abs/2211.01786, pre-trained language model in zero-shot and few-shot
2022. learning,” arXiv preprint arXiv:2110.04725, 2021.
[85] S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig, [93] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli,
P. Yu, K. Shuster, T. Wang, Q. Liu, P. S. Koura, X. Li, T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das-
B. O’Horo, G. Pereyra, J. Wang, C. Dewan, A. Celikyil- Sarma, N. Elhage, Z. Hatfield-Dodds, D. Hernandez,
maz, L. Zettlemoyer, and V. Stoyanov, “OPT-IML: scal- J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. B.
ing language model instruction meta learning through Brown, J. Clark, S. McCandlish, C. Olah, and J. Ka-
the lens of generalization,” CoRR, vol. abs/2212.12017, plan, “A general language assistant as a laboratory
2022. for alignment,” CoRR, vol. abs/2112.00861, 2021.
[86] S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, [94] S. Wang, Y. Sun, Y. Xiang, Z. Wu, S. Ding, W. Gong,
K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. S. Feng, J. Shang, Y. Zhao, C. Pang, J. Liu, X. Chen,
Prashanth, E. Raff et al., “Pythia: A suite for analyzing Y. Lu, W. Liu, X. Wang, Y. Bai, Q. Chen, L. Zhao,
large language models across training and scaling,” S. Li, P. Sun, D. Yu, Y. Ma, H. Tian, H. Wu, T. Wu,
arXiv preprint arXiv:2304.01373, 2023. W. Zeng, G. Li, W. Gao, and H. Wang, “ERNIE 3.0
[87] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, titan: Exploring larger-scale knowledge enhanced pre-
Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, training for language understanding and generation,”
“Gshard: Scaling giant models with conditional com- CoRR, vol. abs/2112.12731, 2021.
putation and automatic sharding,” in 9th International [95] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin,
Conference on Learning Representations, ICLR 2021, Vir- Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, B. Zoph,
tual Event, Austria, May 3-7, 2021, 2021. L. Fedus, M. P. Bosma, Z. Zhou, T. Wang, Y. E.
[88] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. Wang, K. Webster, M. Pellat, K. Robinson, K. S. Meier-
de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, Hellstern, T. Duke, L. Dixon, K. Zhang, Q. V. Le,
N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, Y. Wu, Z. Chen, and C. Cui, “Glam: Efficient scaling
38

of language models with mixture-of-experts,” in In- [103] X. Ren, P. Zhou, X. Meng, X. Huang, Y. Wang,
ternational Conference on Machine Learning, ICML 2022, W. Wang, P. Li, X. Zhang, A. Podolskiy, G. Arshinov,
17-23 July 2022, Baltimore, Maryland, USA, 2022, pp. A. Bout, I. Piontkovskaya, J. Wei, X. Jiang, T. Su,
5547–5569. Q. Liu, and J. Yao, “Pangu-Σ: Towards trillion pa-
[96] R. Thoppilan, D. D. Freitas, J. Hall, N. Shazeer, rameter language model with sparse heterogeneous
A. Kulshreshtha, H. Cheng, A. Jin, T. Bos, L. Baker, computing,” CoRR, vol. abs/2303.10845, 2023.
Y. Du, Y. Li, H. Lee, H. S. Zheng, A. Ghafouri, [104] L. Huawei Technologies Co., “Huawei mindspore
M. Menegali, Y. Huang, M. Krikun, D. Lepikhin, ai development framework,” in Artificial Intelligence
J. Qin, D. Chen, Y. Xu, Z. Chen, A. Roberts, M. Bosma, Technology. Springer, 2022, pp. 137–162.
Y. Zhou, C. Chang, I. Krivokon, W. Rusch, M. Pick- [105] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li,
ett, K. S. Meier-Hellstern, M. R. Morris, T. Doshi, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stan-
R. D. Santos, T. Duke, J. Soraker, B. Zevenbergen, ford alpaca: An instruction-following llama model,”
V. Prabhakaran, M. Diaz, B. Hutchinson, K. Olson, https://github.com/tatsu-lab/stanford alpaca, 2023.
A. Molina, E. Hoffman-John, J. Lee, L. Aroyo, R. Ra- [106] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang,
jakumar, A. Butryna, M. Lamm, V. Kuzmina, J. Fenton, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez,
A. Cohen, R. Bernstein, R. Kurzweil, B. Aguera-Arcas, I. Stoica, and E. P. Xing, “Vicuna: An open-source
C. Cui, M. Croak, E. H. Chi, and Q. Le, “Lamda: chatbot impressing gpt-4 with 90%* chatgpt quality,”
Language models for dialog applications,” CoRR, vol. 2023. [Online]. Available: https://vicuna.lmsys.org
abs/2201.08239, 2022. [107] 2023. [Online]. Available: https://github.com/
[97] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajb- nebuly-ai/nebullvm/tree/main/apps/accelerate/
handari, J. Casper, Z. Liu, S. Prabhumoye, G. Zerveas, chatllama
V. Korthikanti, E. Zheng, R. Child, R. Y. Aminabadi, [108] Y. You, “Colossalchat: An open-source
J. Bernauer, X. Song, M. Shoeybi, Y. He, M. Hous- solution for cloning chatgpt with a complete
ton, S. Tiwary, and B. Catanzaro, “Using deepspeed rlhf pipeline,” 2023. [Online]. Available:
and megatron to train megatron-turing NLG 530b, https://medium.com/@yangyou berkeley/
A large-scale generative language model,” CoRR, vol. colossalchat-an-open-source-solution-for-cloning-
abs/2201.11990, 2022. chatgpt-with-a-complete-rlhf-pipeline-5edf08fb538b
[98] Y. Li, D. H. Choi, J. Chung, N. Kushman, J. Schrit- [109] Y. Zhu, R. Kiros, R. S. Zemel, R. Salakhutdinov, R. Ur-
twieser, R. Leblond, T. Eccles, J. Keeling, F. Gi- tasun, A. Torralba, and S. Fidler, “Aligning books
meno, A. D. Lago, T. Hubert, P. Choy, C. de Mas- and movies: Towards story-like visual explanations
son d’Autume, I. Babuschkin, X. Chen, P. Huang, by watching movies and reading books,” in 2015 IEEE
J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. J. International Conference on Computer Vision, ICCV 2015,
Mankowitz, E. S. Robson, P. Kohli, N. de Freitas, Santiago, Chile, December 7-13, 2015. IEEE Computer
K. Kavukcuoglu, and O. Vinyals, “Competition-level Society, 2015, pp. 19–27.
code generation with alphacode,” Science, 2022. [110] “Project gutenberg.” [Online]. Available: https://
[99] S. Soltan, S. Ananthakrishnan, J. FitzGerald, R. Gupta, www.gutenberg.org/
W. Hamza, H. Khan, C. Peris, S. Rawls, A. Rosen- [111] T. H. Trinh and Q. V. Le, “A simple method for
baum, A. Rumshisky, C. S. Prakash, M. Sridhar, commonsense reasoning,” CoRR, vol. abs/1806.02847,
F. Triefenbach, A. Verma, G. Tür, and P. Natara- 2018.
jan, “Alexatm 20b: Few-shot learning using a [112] R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk,
large-scale multilingual seq2seq model,” CoRR, vol. A. Farhadi, F. Roesner, and Y. Choi, “Defending
abs/2208.01448, 2022. against neural fake news,” in Advances in Neural Infor-
[100] A. Glaese, N. McAleese, M. Trebacz, J. Aslanides, mation Processing Systems 32: Annual Conference on Neu-
V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chad- ral Information Processing Systems 2019, NeurIPS 2019,
wick, P. Thacker, L. Campbell-Gillingham, J. Ue- December 8-14, 2019, Vancouver, BC, Canada, H. M.
sato, P. Huang, R. Comanescu, F. Yang, A. See, Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-
S. Dathathri, R. Greig, C. Chen, D. Fritz, J. S. Elias, Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 9051–
R. Green, S. Mokrá, N. Fernando, B. Wu, R. Foley, 9062.
S. Young, I. Gabriel, W. Isaac, J. Mellor, D. Hassabis, [113] A. Gokaslan, V. C. E. Pavlick, and S. Tellex,
K. Kavukcuoglu, L. A. Hendricks, and G. Irving, “Openwebtext corpus,” http://Skylion007.github.io/
“Improving alignment of dialogue agents via targeted OpenWebTextCorpus, 2019.
human judgements,” CoRR, vol. abs/2209.14375, 2022. [114] J. Baumgartner, S. Zannettou, B. Keegan, M. Squire,
[101] H. Su, X. Zhou, H. Yu, Y. Chen, Z. Zhu, Y. Yu, and and J. Blackburn, “The pushshift reddit dataset,” in
J. Zhou, “Welm: A well-read pre-trained language Proceedings of the Fourteenth International AAAI Con-
model for chinese,” CoRR, vol. abs/2209.10372, 2022. ference on Web and Social Media, ICWSM 2020, Held
[102] Y. Tay, J. Wei, H. W. Chung, V. Q. Tran, D. R. So, Virtually, Original Venue: Atlanta, Georgia, USA, June
S. Shakeri, X. Garcia, H. S. Zheng, J. Rao, A. Chowdh- 8-11, 2020. AAAI Press, 2020, pp. 830–839.
ery, D. Zhou, D. Metzler, S. Petrov, N. Houlsby, Q. V. [115] “Wikipedia.” [Online]. Available: https://en.
Le, and M. Dehghani, “Transcending scaling laws wikipedia.org/wiki/Main Page
with 0.1% extra compute,” CoRR, vol. abs/2210.11399, [116] “Bigquery dataset.” [Online]. Available: https://
2022. cloud.google.com/bigquery?hl=zh-cn
39

[117] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Gar-
C. Foster, J. Phang, H. He, A. Thite, N. Nabeshima, nett, Eds., 2019, pp. 8024–8035.
S. Presser, and C. Leahy, “The pile: An 800gb dataset [130] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis,
of diverse text for language modeling,” CoRR, vol. J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Is-
abs/2101.00027, 2021. ard, M. Kudlur, J. Levenberg, R. Monga, S. Moore,
[118] H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. V. D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan,
del Moral, T. Le Scao, L. Von Werra, C. Mou, E. G. P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensor-
Ponferrada, H. Nguyen et al., “The bigscience roots flow: A system for large-scale machine learning,” in
corpus: A 1.6 tb composite multilingual dataset,” in 12th USENIX Symposium on Operating Systems Design
Thirty-sixth Conference on Neural Information Processing and Implementation, OSDI 2016, Savannah, GA, USA,
Systems Datasets and Benchmarks Track, 2022. November 2-4, 2016, K. Keeton and T. Roscoe, Eds.
[119] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever USENIX Association, 2016, pp. 265–283.
et al., “Improving language understanding by genera- [131] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang,
tive pre-training,” 2018. T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet:
[120] “Common crawl.” [Online]. Available: https:// A flexible and efficient machine learning library
commoncrawl.org/ for heterogeneous distributed systems,” CoRR, vol.
[121] “A reproduction version of cc-stories on hugging abs/1512.01274, 2015.
face.” [Online]. Available: https://huggingface.co/ [132] Y. Ma, D. Yu, T. Wu, and H. Wang, “Paddlepaddle: An
datasets/spacemanidol/cc-stories open-source deep learning platform from industrial
[122] B. Wang and A. Komatsuzaki, “GPT-J-6B: A 6 Billion practice,” Frontiers of Data and Domputing, vol. 1, no. 1,
Parameter Autoregressive Language Model,” https:// p. 105, 2019.
github.com/kingoflolz/mesh-transformer-jax, 2021. [133] J. Yuan, X. Li, C. Cheng, J. Liu, R. Guo, S. Cai, C. Yao,
[123] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, F. Yang, X. Yi, C. Wu, H. Zhang, and J. Zhao, “One-
A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, flow: Redesign the distributed deep learning frame-
J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jer- work from scratch,” CoRR, vol. abs/2110.15032, 2021.
nite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, [134] S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson,
Q. Lhoest, and A. M. Rush, “Transformers: State-of- Y. Liu, J. Xu, M. Ott, E. M. Smith, Y. Boureau, and
the-art natural language processing,” in Proceedings of J. Weston, “Recipes for building an open-domain chat-
the 2020 Conference on Empirical Methods in Natural Lan- bot,” in Proceedings of the 16th Conference of the European
guage Processing: System Demonstrations, EMNLP 2020 Chapter of the Association for Computational Linguistics:
- Demos, Online, November 16-20, 2020. Association Main Volume, EACL 2021, Online, April 19 - 23, 2021,
for Computational Linguistics, 2020, pp. 38–45. 2021, pp. 300–325.
[124] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, [135] A. Lewkowycz, A. Andreassen, D. Dohan, E. Dyer,
C. Leary, D. Maclaurin, G. Necula, A. Paszke, H. Michalewski, V. V. Ramasesh, A. Slone, C. Anil,
J. VanderPlas, S. Wanderman-Milne, and Q. Zhang, I. Schlag, T. Gutman-Solo, Y. Wu, B. Neyshabur,
“JAX: composable transformations of Python+NumPy G. Gur-Ari, and V. Misra, “Solving quantitative rea-
programs,” 2018. [Online]. Available: http://github. soning problems with language models,” CoRR, vol.
com/google/jax abs/2206.14858, 2022.
[125] Z. Bian, H. Liu, B. Wang, H. Huang, Y. Li, C. Wang, [136] T. Saier, J. Krause, and M. Färber, “unarxive 2022:
F. Cui, and Y. You, “Colossal-ai: A unified deep learn- All arxiv publications pre-processed for nlp, includ-
ing system for large-scale parallel training,” CoRR, ing structured full-text and citation network,” arXiv
vol. abs/2110.14883, 2021. preprint arXiv:2303.14957, 2023.
[126] J. Fang, Y. Yu, S. Li, Y. You, and J. Zhou, “Patrick- [137] H. A. Simon, “Experiments with a heuristic compiler,”
star: Parallel training of pre-trained models via J. ACM, vol. 10, no. 4, pp. 493–506, 1963.
a chunk-based memory management,” CoRR, vol. [138] Z. Manna and R. J. Waldinger, “Toward automatic
abs/2108.05818, 2021. program synthesis,” Commun. ACM, vol. 14, no. 3, pp.
[127] “Bmtrain: Effient training for big models.” [Online]. 151–165, 1971.
Available: https://github.com/OpenBMB/BMTrain [139] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong,
[128] J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, and J. Tang, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou,
“Fastmoe: A fast mixture-of-expert training system,” “Codebert: A pre-trained model for programming and
CoRR, vol. abs/2103.13262, 2021. natural languages,” in Findings of EMNLP, 2020.
[129] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad- [140] J. Austin, A. Odena, M. I. Nye, M. Bosma,
bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, H. Michalewski, D. Dohan, E. Jiang, C. J. Cai, M. Terry,
L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, Z. De- Q. V. Le, and C. Sutton, “Program synthesis with large
Vito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, language models,” CoRR, vol. abs/2108.07732, 2021.
L. Fang, J. Bai, and S. Chintala, “Pytorch: An imper- [141] S. Black, L. Gao, P. Wang, C. Leahy, and S. Bi-
ative style, high-performance deep learning library,” derman, “GPT-Neo: Large Scale Autoregressive Lan-
in Advances in Neural Information Processing Systems guage Modeling with Mesh-Tensorflow,” 2021.
32: Annual Conference on Neural Information Process- [142] F. F. Xu, U. Alon, G. Neubig, and V. J. Hellendoorn,
ing Systems 2019, NeurIPS 2019, December 8-14, 2019, “A systematic evaluation of large language models of
Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, code,” in MAPS@PLDI, 2022.
40

[143] D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The
F. Shi, R. Zhong, W. Yih, L. Zettlemoyer, and M. Lewis, Association for Computer Linguistics, 2016.
“Incoder: A generative model for code infilling and [154] M. Davis and M. Dürst, “Unicode normalization
synthesis,” in ICLR, 2023. forms,” 2001.
[144] A. Madaan, S. Zhou, U. Alon, Y. Yang, and G. Neubig, [155] D. Paperno, G. Kruszewski, A. Lazaridou, Q. N.
“Language models of code are few-shot commonsense Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda,
learners,” in Proceedings of the 2022 Conference on Em- and R. Fernández, “The LAMBADA dataset: Word
pirical Methods in Natural Language Processing, EMNLP prediction requiring a broad discourse context,” in
2022, Abu Dhabi, United Arab Emirates, December 7-11, ACL (1). The Association for Computer Linguistics,
2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. 2016.
Association for Computational Linguistics, 2022, pp. [156] P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak,
1384–1403. and I. Sutskever, “Deep double descent: Where bigger
[145] Y. Wu, A. Q. Jiang, W. Li, M. N. Rabe, C. Staats, models and more data hurt,” in 8th International Con-
M. Jamnik, and C. Szegedy, “Autoformalization with ference on Learning Representations, ICLR 2020, Addis
large language models,” CoRR, vol. abs/2205.12615, Ababa, Ethiopia, April 26-30, 2020. OpenReview.net,
2022. 2020.
[146] D. Hernandez, T. B. Brown, T. Conerly, N. DasSarma, [157] B. Zhang, B. Ghorbani, A. Bapna, Y. Cheng, X. Garcia,
D. Drain, S. E. Showk, N. Elhage, Z. Hatfield-Dodds, J. Shen, and O. Firat, “Examining scaling and transfer
T. Henighan, T. Hume, S. Johnston, B. Mann, C. Olah, of language model architectures for machine transla-
C. Olsson, D. Amodei, N. Joseph, J. Kaplan, and S. Mc- tion,” in International Conference on Machine Learning,
Candlish, “Scaling laws and interpretability of learn- ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA,
ing from repeated data,” CoRR, vol. abs/2205.10487, 2022, pp. 26 176–26 192.
2022. [158] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang,
[147] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi, J. Gao, M. Zhou, and H. Hon, “Unified language
“The curious case of neural text degeneration,” in model pre-training for natural language understand-
8th International Conference on Learning Representations, ing and generation,” in Advances in Neural Informa-
ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. tion Processing Systems 32: Annual Conference on Neu-
OpenReview.net, 2020. ral Information Processing Systems 2019, NeurIPS 2019,
[148] K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, December 8-14, 2019, Vancouver, BC, Canada, 2019, pp.
C. Callison-Burch, and N. Carlini, “Deduplicating 13 042–13 054.
training data makes language models better,” in Pro- [159] A. Clark, D. de Las Casas, A. Guy, A. Mensch,
ceedings of the 60th Annual Meeting of the Association for M. Paganini, J. Hoffmann, B. Damoc, B. A. Hecht-
Computational Linguistics (Volume 1: Long Papers), ACL man, T. Cai, S. Borgeaud, G. van den Driessche,
2022, Dublin, Ireland, May 22-27, 2022, 2022, pp. 8424– E. Rutherford, T. Hennigan, M. J. Johnson, A. Cassirer,
8445. C. Jones, E. Buchatskaya, D. Budden, L. Sifre, S. Osin-
[149] N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramèr, dero, O. Vinyals, M. Ranzato, J. W. Rae, E. Elsen,
and C. Zhang, “Quantifying memorization across K. Kavukcuoglu, and K. Simonyan, “Unified scaling
neural language models,” CoRR, 2022. laws for routed language models,” in International
[150] N. Carlini, F. Tramèr, E. Wallace, M. Jagielski, Conference on Machine Learning, ICML 2022, 17-23 July
A. Herbert-Voss, K. Lee, A. Roberts, T. B. Brown, 2022, Baltimore, Maryland, USA, 2022, pp. 4057–4086.
D. Song, Ú. Erlingsson, A. Oprea, and C. Raffel, “Ex- [160] L. J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normal-
tracting training data from large language models,” ization,” vol. abs/1607.06450, 2016.
in 30th USENIX Security Symposium, USENIX Security [161] R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing,
2021, August 11-13, 2021, 2021, pp. 2633–2650. H. Zhang, Y. Lan, L. Wang, and T. Liu, “On layer nor-
[151] N. Kandpal, E. Wallace, and C. Raffel, “Deduplicating malization in the transformer architecture,” in ICML,
training data mitigates privacy risks in language mod- 2020.
els,” in International Conference on Machine Learning, [162] M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou,
ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA. D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, and J. Tang,
PMLR, 2022, pp. 10 697–10 707. “Cogview: Mastering text-to-image generation via
[152] T. Kudo and J. Richardson, “Sentencepiece: A simple transformers,” in Advances in Neural Information Pro-
and language independent subword tokenizer and cessing Systems 34: Annual Conference on Neural Infor-
detokenizer for neural text processing,” in Proceedings mation Processing Systems 2021, NeurIPS 2021, December
of the 2018 Conference on Empirical Methods in Natural 6-14, 2021, virtual, 2021, pp. 19 822–19 835.
Language Processing, EMNLP 2018: System Demonstra- [163] B. Zhang and R. Sennrich, “Root mean square layer
tions, Brussels, Belgium, October 31 - November 4, 2018, normalization,” in Advances in Neural Information Pro-
E. Blanco and W. Lu, Eds. Association for Computa- cessing Systems 32: Annual Conference on Neural Infor-
tional Linguistics, 2018. mation Processing Systems 2019, NeurIPS 2019, December
[153] R. Sennrich, B. Haddow, and A. Birch, “Neural ma- 8-14, 2019, Vancouver, BC, Canada, 2019, pp. 12 360–
chine translation of rare words with subword units,” 12 371.
in Proceedings of the 54th Annual Meeting of the Asso- [164] S. Narang, H. W. Chung, Y. Tay, L. Fedus, T. Févry,
ciation for Computational Linguistics, ACL 2016, August M. Matena, K. Malkan, N. Fiedel, N. Shazeer, Z. Lan,
41

Y. Zhou, W. Li, N. Ding, J. Marcus, A. Roberts, ing rates with sublinear memory cost,” in Proceedings
and C. Raffel, “Do transformer modifications transfer of the 35th International Conference on Machine Learning,
across implementations and applications?” in Proceed- ICML 2018, Stockholmsmässan, Stockholm, Sweden, July
ings of the 2021 Conference on Empirical Methods in Nat- 10-15, 2018, ser. Proceedings of Machine Learning
ural Language Processing, EMNLP 2021, Virtual Event / Research, J. G. Dy and A. Krause, Eds., vol. 80. PMLR,
Punta Cana, Dominican Republic, 7-11 November, 2021, 2018, pp. 4603–4611.
2021, pp. 5758–5773. [179] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen,
[165] H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang, and M. X. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, and
F. Wei, “Deepnet: Scaling transformers to 1, 000 lay- Z. Chen, “Gpipe: Efficient training of giant neural
ers,” vol. abs/2203.00555, 2022. networks using pipeline parallelism,” in Advances in
[166] T. L. Scao, T. Wang, D. Hesslow, S. Bekman, M. S. Bari, Neural Information Processing Systems 32: Annual Con-
S. Biderman, H. Elsahar, N. Muennighoff, J. Phang, ference on Neural Information Processing Systems 2019,
O. Press, C. Raffel, V. Sanh, S. Shen, L. Sutawika, J. Tae, NeurIPS 2019, December 8-14, 2019, Vancouver, BC,
Z. X. Yong, J. Launay, and I. Beltagy, “What language Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer,
model to train if you have one million GPU hours?” in F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019,
Findings of the Association for Computational Linguistics: pp. 103–112.
EMNLP 2022, Abu Dhabi, United Arab Emirates, Decem- [180] A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri,
ber 7-11, 2022, 2022, pp. 765–782. N. R. Devanur, G. R. Ganger, and P. B. Gibbons,
[167] D. Hendrycks and K. Gimpel, “Gaussian error linear “Pipedream: Fast and efficient pipeline parallel DNN
units (gelus),” arXiv preprint arXiv:1606.08415, 2016. training,” CoRR, vol. abs/1806.03377, 2018.
[168] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, [181] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He,
“Language modeling with gated convolutional net- “Zero: memory optimizations toward training trillion
works,” in Proceedings of the 34th International Confer- parameter models,” in Proceedings of the International
ence on Machine Learning, ICML 2017, Sydney, NSW, Conference for High Performance Computing, Networking,
Australia, 6-11 August 2017, 2017, pp. 933–941. Storage and Analysis, SC 2020, Virtual Event / Atlanta,
[169] N. Shazeer, “GLU variants improve transformer,” vol. Georgia, USA, November 9-19, 2020, C. Cuicchi, I. Qual-
abs/2002.05202, 2020. ters, and W. T. Kramer, Eds. IEEE/ACM, 2020, p. 20.
[170] O. Press, N. A. Smith, and M. Lewis, “Train short, test [182] P. Micikevicius, S. Narang, J. Alben, G. F. Di-
long: Attention with linear biases enables input length amos, E. Elsen, D. Garcı́a, B. Ginsburg, M. Houston,
extrapolation,” in The Tenth International Conference on O. Kuchaiev, G. Venkatesh, and H. Wu, “Mixed preci-
Learning Representations, ICLR 2022, Virtual Event, April sion training,” CoRR, vol. abs/1710.03740, 2017.
25-29, 2022, 2022. [183] Q. Xu, S. Li, C. Gong, and Y. You, “An efficient 2d
[171] J. Su, Y. Lu, S. Pan, B. Wen, and Y. Liu, “Roformer: En- method for training super-large deep learning mod-
hanced transformer with rotary position embedding,” els,” CoRR, vol. abs/2104.05343, 2021.
vol. abs/2104.09864, 2021. [184] B. Wang, Q. Xu, Z. Bian, and Y. You, “Tesseract:
[172] R. Child, S. Gray, A. Radford, and I. Sutskever, “Gen- Parallelize the tensor parallelism efficiently,” in Pro-
erating long sequences with sparse transformers,” ceedings of the 51st International Conference on Parallel
CoRR, vol. abs/1904.10509, 2019. Processing, ICPP 2022, Bordeaux, France, 29 August 2022
[173] H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A. - 1 September 2022. ACM, 2022.
Smith, and L. Kong, “Random feature attention,” in [185] Z. Bian, Q. Xu, B. Wang, and Y. You, “Maximizing
9th International Conference on Learning Representations, parallelism in distributed training for huge neural
ICLR 2021, Virtual Event, Austria, May 3-7, 2021. networks,” CoRR, vol. abs/2105.14450, 2021.
[174] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, [186] S. Li, F. Xue, C. Baranwal, Y. Li, and Y. You, “Sequence
C. Alberti, S. Ontañón, P. Pham, A. Ravula, Q. Wang, parallelism: Long sequence training from system per-
L. Yang, and A. Ahmed, “Big bird: Transformers for spective,” arXiv e-prints, pp. arXiv–2105, 2021.
longer sequences,” in Advances in Neural Information [187] FairScale authors, “Fairscale: A general purpose
Processing Systems 33: Annual Conference on Neural modular pytorch library for high performance
Information Processing Systems 2020, NeurIPS 2020, De- and large scale training,” https://github.com/
cember 6-12, 2020, virtual, 2020. facebookresearch/fairscale, 2021.
[175] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Re, [188] L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen,
“Flashattention: Fast and memory-efficient exact at- Y. Huang, Y. Wang, Y. Xu, D. Zhuo, E. P. Xing et al.,
tention with IO-awareness,” in NeurIPS, 2022. “Alpa: Automating inter-and {Intra-Operator} paral-
[176] D. P. Kingma and J. Ba, “Adam: A method for lelism for distributed deep learning,” in OSDI, 2022,
stochastic optimization,” in 3rd International Confer- pp. 559–578.
ence on Learning Representations, ICLR 2015, San Diego, [189] T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training
CA, USA, May 7-9, 2015, Conference Track Proceedings, deep nets with sublinear memory cost,” CoRR, vol.
Y. Bengio and Y. LeCun, Eds., 2015. abs/1604.06174, 2016.
[177] I. Loshchilov and F. Hutter, “Fixing weight decay [190] Z. Yao, C. Li, X. Wu, S. Youn, and Y. He, “A compre-
regularization in adam,” CoRR, vol. abs/1711.05101, hensive study on post-training quantization for large
2017. language models,” CoRR, vol. abs/2303.08302, 2023.
[178] N. Shazeer and M. Stern, “Adafactor: Adaptive learn- [191] T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer,
42

“Llm.int8(): 8-bit matrix multiplication for transform- A. Roberts, “The flan collection: Designing data and
ers at scale,” CoRR, vol. abs/2208.07339, 2022. methods for effective instruction tuning,” CoRR, vol.
[192] C. Tao, L. Hou, W. Zhang, L. Shang, X. Jiang, Q. Liu, abs/2301.13688, 2023.
P. Luo, and N. Wong, “Compression of generative [203] Y. Gu, P. Ke, X. Zhu, and M. Huang, “Learning
pre-trained language models via quantization,” in instructions with unlabeled data for zero-shot cross-
Proceedings of the 60th Annual Meeting of the Association task generalization,” in EMNLP. Association for
for Computational Linguistics (Volume 1: Long Papers), Computational Linguistics, 2022, pp. 1617–1634.
ACL 2022, Dublin, Ireland, May 22-27, 2022, S. Muresan, [204] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith,
P. Nakov, and A. Villavicencio, Eds. Association for D. Khashabi, and H. Hajishirzi, “Self-instruct: Align-
Computational Linguistics, 2022, pp. 4821–4836. ing language model with self generated instructions,”
[193] S. Mishra, D. Khashabi, C. Baral, and H. Ha- CoRR, vol. abs/2212.10560, 2022.
jishirzi, “Cross-task generalization via natural lan- [205] O. Honovich, T. Scialom, O. Levy, and T. Schick, “Un-
guage crowdsourcing instructions,” in Proceedings of natural instructions: Tuning language models with
the 60th Annual Meeting of the Association for Compu- (almost) no human labor,” CoRR, vol. abs/2212.09689,
tational Linguistics (Volume 1: Long Papers), ACL 2022, 2022.
Dublin, Ireland, May 22-27, 2022, S. Muresan, P. Nakov, [206] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li,
and A. Villavicencio, Eds., 2022, pp. 3470–3487. C. Guestrin, P. Liang, and T. B. Hashimoto, “Stan-
[194] Q. Ye, B. Y. Lin, and X. Ren, “Crossfit: A few-shot ford alpaca: An instruction-following llama model,”
learning challenge for cross-task generalization in https://github.com/tatsu-lab/stanford alpaca, 2023.
NLP,” in EMNLP (1). Association for Computational [207] Z. Kenton, T. Everitt, L. Weidinger, I. Gabriel, V. Miku-
Linguistics, 2021, pp. 7163–7189. lik, and G. Irving, “Alignment of language agents,”
[195] S. H. Bach, V. Sanh, Z. X. Yong, A. Webson, C. Raffel, CoRR, vol. abs/2103.14659, 2021.
N. V. Nayak, A. Sharma, T. Kim, M. S. Bari, T. Févry, [208] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli,
Z. Alyafeai, M. Dey, A. Santilli, Z. Sun, S. Ben-David, T. Henighan, A. Jones, N. Joseph, B. Mann, N. Das-
C. Xu, G. Chhablani, H. Wang, J. A. Fries, M. S. Sarma, N. Elhage, Z. Hatfield-Dodds, D. Hernandez,
AlShaibani, S. Sharma, U. Thakker, K. Almubarak, J. Kernion, K. Ndousse, C. Olsson, D. Amodei, T. B.
X. Tang, D. R. Radev, M. T. Jiang, and A. M. Rush, Brown, J. Clark, S. McCandlish, C. Olah, and J. Ka-
“Promptsource: An integrated development environ- plan, “A general language assistant as a laboratory
ment and repository for natural language prompts,” for alignment,” CoRR, vol. abs/2112.00861, 2021.
in ACL (demo). Association for Computational Lin- [209] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen,
guistics, 2022, pp. 93–104. N. DasSarma, D. Drain, S. Fort, D. Ganguli,
[196] V. Aribandi, Y. Tay, T. Schuster, J. Rao, H. S. Zheng, T. Henighan, N. Joseph, S. Kadavath, J. Kernion,
S. V. Mehta, H. Zhuang, V. Q. Tran, D. Bahri, J. Ni, T. Conerly, S. E. Showk, N. Elhage, Z. Hatfield-Dodds,
J. P. Gupta, K. Hui, S. Ruder, and D. Metzler, “Ext5: D. Hernandez, T. Hume, S. Johnston, S. Kravec,
Towards extreme multi-task scaling for transfer learn- L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. B.
ing,” in ICLR. OpenReview.net, 2022. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and
[197] T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak, M. Ya- J. Kaplan, “Training a helpful and harmless assistant
sunaga, C. Wu, M. Zhong, P. Yin, S. I. Wang, V. Zhong, with reinforcement learning from human feedback,”
B. Wang, C. Li, C. Boyle, A. Ni, Z. Yao, D. Radev, CoRR, vol. abs/2204.05862, 2022.
C. Xiong, L. Kong, R. Zhang, N. A. Smith, L. Zettle- [210] E. Perez, S. Huang, H. F. Song, T. Cai, R. Ring,
moyer, and T. Yu, “Unifiedskg: Unifying and multi- J. Aslanides, A. Glaese, N. McAleese, and G. Irving,
tasking structured knowledge grounding with text-to- “Red teaming language models with language mod-
text language models,” in EMNLP. Association for els,” in Proceedings of the 2022 Conference on Empirical
Computational Linguistics, 2022, pp. 602–631. Methods in Natural Language Processing, EMNLP 2022,
[198] T. Tang, J. Li, W. X. Zhao, and J. Wen, “MVP: multi- Abu Dhabi, United Arab Emirates, December 7-11, 2022,
task supervised pre-training for natural language gen- Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. Asso-
eration,” CoRR, vol. abs/2206.12131, 2022. ciation for Computational Linguistics, 2022, pp. 3419–
[199] R. Lou, K. Zhang, and W. Yin, “Is prompt all you 3448.
need? no. A comprehensive and broader view of in- [211] D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai,
struction learning,” CoRR, vol. abs/2303.10475, 2023. S. Kadavath, B. Mann, E. Perez, N. Schiefer,
[200] X. Liu, P. He, W. Chen, and J. Gao, “Multi-task deep K. Ndousse, A. Jones, S. Bowman, A. Chen, T. Con-
neural networks for natural language understand- erly, N. DasSarma, D. Drain, N. Elhage, S. E. Showk,
ing,” in ACL (1). Association for Computational S. Fort, Z. Hatfield-Dodds, T. Henighan, D. Hernan-
Linguistics, 2019, pp. 4487–4496. dez, T. Hume, J. Jacobson, S. Johnston, S. Kravec,
[201] A. Aghajanyan, A. Gupta, A. Shrivastava, X. Chen, C. Olsson, S. Ringer, E. Tran-Johnson, D. Amodei,
L. Zettlemoyer, and S. Gupta, “Muppet: Massive T. Brown, N. Joseph, S. McCandlish, C. Olah, J. Ka-
multi-task representations with pre-finetuning,” in plan, and J. Clark, “Red teaming language models
EMNLP (1). Association for Computational Linguis- to reduce harms: Methods, scaling behaviors, and
tics, 2021, pp. 5799–5811. lessons learned,” CoRR, vol. abs/2209.07858, 2022.
[202] S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, [212] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Rad-
Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei, and ford, D. Amodei, P. F. Christiano, and G. Irving, “Fine-
43

tuning language models from human preferences,” [223] H. Su, J. Kasai, C. H. Wu, W. Shi, T. Wang, J. Xin,
CoRR, vol. abs/1909.08593, 2019. R. Zhang, M. Ostendorf, L. Zettlemoyer, N. A. Smith,
[213] N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, and T. Yu, “Selective annotation makes language mod-
C. Voss, A. Radford, D. Amodei, and P. F. Chris- els better few-shot learners,” CoRR, 2022.
tiano, “Learning to summarize from human feed- [224] X. Ye, S. Iyer, A. Celikyilmaz, V. Stoyanov, G. Durrett,
back,” CoRR, vol. abs/2009.01325, 2020. and R. Pasunuru, “Complementary explanations for
[214] J. Menick, M. Trebacz, V. Mikulik, J. Aslanides, effective in-context learning,” CoRR, 2022.
H. F. Song, M. Chadwick, M. Glaese, S. Young, [225] X. Li and X. Qiu, “Finding supporting examples for
L. Campbell-Gillingham, G. Irving, and N. McAleese, in-context learning,” CoRR, 2023.
“Teaching language models to support answers with [226] O. Rubin, J. Herzig, and J. Berant, “Learning to re-
verified quotes,” CoRR, vol. abs/2203.11147, 2022. trieve prompts for in-context learning,” in Proceedings
[215] J. Wu, L. Ouyang, D. M. Ziegler, N. Stiennon, R. Lowe, of the 2022 Conference of the North American Chapter
J. Leike, and P. F. Christiano, “Recursively sum- of the Association for Computational Linguistics: Human
marizing books with human feedback,” CoRR, vol. Language Technologies, NAACL 2022, Seattle, WA, United
abs/2109.10862, 2021. States, July 10-15, 2022, 2022, pp. 2655–2671.
[216] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, [227] Y. Zhang, S. Feng, and C. Tan, “Active example se-
and O. Klimov, “Proximal policy optimization algo- lection for in-context learning,” in Proceedings of the
rithms,” arXiv preprint arXiv:1707.06347, 2017. 2022 Conference on Empirical Methods in Natural Lan-
[217] S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, guage Processing, EMNLP 2022, Abu Dhabi, United Arab
H. Hajishirzi, and L. Zettlemoyer, “Rethinking the role Emirates, December 7-11, 2022, 2022, pp. 9134–9148.
of demonstrations: What makes in-context learning [228] F. Gilardi, M. Alizadeh, and M. Kubli, “Chatgpt out-
work?” in Proceedings of the 2022 Conference on Em- performs crowd-workers for text-annotation tasks,”
pirical Methods in Natural Language Processing, EMNLP 2023.
2022, Abu Dhabi, United Arab Emirates, December 7- [229] H. J. Kim, H. Cho, J. Kim, T. Kim, K. M. Yoo, and
11, 2022. Association for Computational Linguistics, S. Lee, “Self-generated in-context learning: Leverag-
2022, pp. 11 048–11 064. ing auto-regressive language models as a demonstra-
[218] Y. Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stene- tion generator,” CoRR, vol. abs/2206.08082, 2022.
torp, “Fantastically ordered prompts and where to [230] Y. Lin, A. Papangelis, S. Kim, S. Lee, D. Hazarika,
find them: Overcoming few-shot prompt order sen- M. Namazifar, D. Jin, Y. Liu, and D. Hakkani-Tur,
sitivity,” in Proceedings of the 60th Annual Meeting of “Selective in-context data augmentation for intent de-
the Association for Computational Linguistics (Volume 1: tection using pointwise v-information,” CoRR, 2023.
Long Papers), ACL 2022, Dublin, Ireland, May 22-27, [231] S. M. Xie, A. Raghunathan, P. Liang, and T. Ma, “An
2022, S. Muresan, P. Nakov, and A. Villavicencio, Eds., explanation of in-context learning as implicit bayesian
2022, pp. 8086–8098. inference,” in The Tenth International Conference on
[219] Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh, Learning Representations, ICLR 2022, Virtual Event, April
“Calibrate before use: Improving few-shot perfor- 25-29, 2022, 2022.
mance of language models,” in Proceedings of the [232] Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic
38th International Conference on Machine Learning, ICML chain of thought prompting in large language mod-
2021, 18-24 July 2021, Virtual Event, M. Meila and els,” CoRR, vol. abs/2210.03493, 2022.
T. Zhang, Eds., 2021, pp. 12 697–12 706. [233] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang,
[220] J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and D. Schuurmans, O. Bousquet, Q. Le, and E. H. Chi,
W. Chen, “What makes good in-context examples for “Least-to-most prompting enables complex reasoning
gpt-3?” in Proceedings of Deep Learning Inside Out: The in large language models,” CoRR, vol. abs/2205.10625,
3rd Workshop on Knowledge Extraction and Integration for 2022.
Deep Learning Architectures, DeeLIO@ACL 2022, Dublin, [234] Z. Wu, Y. Wang, J. Ye, and L. Kong, “Self-adaptive in-
Ireland and Online, May 27, 2022, 2022, pp. 100–114. context learning,” CoRR, vol. abs/2212.10375, 2022.
[221] Y. Lee, C. Lim, and H. Choi, “Does GPT-3 generate [235] S. Min, M. Lewis, L. Zettlemoyer, and H. Hajishirzi,
empathetic dialogues? A novel in-context example “Metaicl: Learning to learn in context,” in Proceedings
selection method and automatic evaluation metric of the 2022 Conference of the North American Chapter
for empathetic dialogue generation,” in Proceedings of the Association for Computational Linguistics: Human
of the 29th International Conference on Computational Language Technologies, NAACL 2022, Seattle, WA, United
Linguistics, COLING 2022, Gyeongju, Republic of Korea, States, July 10-15, 2022, M. Carpuat, M. de Marneffe,
October 12-17, 2022, N. Calzolari, C. Huang, H. Kim, and I. V. M. Ruı́z, Eds., 2022, pp. 2791–2809.
J. Pustejovsky, L. Wanner, K. Choi, P. Ryu, H. Chen, [236] S. C. Y. Chan, A. Santoro, A. K. Lampinen, J. X.
L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, Wang, A. Singh, P. H. Richemond, J. McClelland, and
S. Kim, Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond, F. Hill, “Data distributional properties drive emer-
and S. Na, Eds. International Committee on Compu- gent in-context learning in transformers,” CoRR, vol.
tational Linguistics, 2022, pp. 669–683. abs/2205.05055, 2022.
[222] I. Levy, B. Bogin, and J. Berant, “Diverse demon- [237] S. Shin, S. Lee, H. Ahn, S. Kim, H. Kim, B. Kim, K. Cho,
strations improve in-context compositional general- G. Lee, W. Park, J. Ha, and N. Sung, “On the effect of
ization,” CoRR, vol. abs/2212.06800, 2022. pretraining corpora on in-context learning by a large-
44

scale language model,” in NAACL-HLT. Association W. Chen, “On the advance of making language mod-
for Computational Linguistics, 2022, pp. 5168–5186. els better reasoners,” CoRR, vol. abs/2206.02336, 2022.
[238] J. von Oswald, E. Niklasson, E. Randazzo, J. Sacra- [250] Y. Fu, H. Peng, A. Sabharwal, P. Clark, and T. Khot,
mento, A. Mordvintsev, A. Zhmoginov, and M. Vla- “Complexity-based prompting for multi-step reason-
dymyrov, “Transformers learn in-context by gradient ing,” CoRR, vol. abs/2210.00720, 2022.
descent,” CoRR, vol. abs/2212.07677, 2022. [251] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwa-
[239] C. Olsson, N. Elhage, N. Nanda, N. Joseph, sawa, “Large language models are zero-shot reason-
N. DasSarma, T. Henighan, B. Mann, A. Askell, ers,” CoRR, vol. abs/2205.11916, 2022.
Y. Bai, A. Chen, T. Conerly, D. Drain, D. Gan- [252] X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H.
guli, Z. Hatfield-Dodds, D. Hernandez, S. Johnston, Chi, and D. Zhou, “Self-consistency improves chain
A. Jones, J. Kernion, L. Lovitt, K. Ndousse, D. Amodei, of thought reasoning in language models,” CoRR, vol.
T. Brown, J. Clark, J. Kaplan, S. McCandlish, and abs/2203.11171, 2022.
C. Olah, “In-context learning and induction heads,” [253] ——, “Rationale-augmented ensembles in language
CoRR, vol. abs/2209.11895, 2022. models,” CoRR, 2022.
[240] H. Bansal, K. Gopalakrishnan, S. Dingliwal, S. Bodap- [254] S. Imani, L. Du, and H. Shrivastava, “Mathprompter:
ati, K. Kirchhoff, and D. Roth, “Rethinking the role Mathematical reasoning using large language mod-
of scale for in-context learning: An interpretability- els,” arXiv preprint arXiv:2303.05398, 2023.
based case study at 66 billion scale,” CoRR, vol. [255] E. Zelikman, J. Mu, N. D. Goodman, and Y. T. Wu,
abs/2212.09095, 2022. “Star: Self-taught reasoner bootstrapping reasoning
[241] Y. Li, M. E. Ildiz, D. S. Papailiopoulos, and S. Oymak, with reasoning,” 2022.
“Transformers as algorithms: Generalization and im- [256] J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, and
plicit model selection in in-context learning,” CoRR, J. Han, “Large language models can self-improve,”
vol. abs/2301.07067, 2023. CoRR, vol. abs/2210.11610, 2022.
[242] E. Akyürek, D. Schuurmans, J. Andreas, T. Ma, and [257] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy,
D. Zhou, “What learning algorithm is in-context learn- and S. R. Bowman, “GLUE: A multi-task bench-
ing? investigations with linear models,” CoRR, vol. mark and analysis platform for natural language un-
abs/2211.15661, 2022. derstanding,” in Proceedings of the Workshop: Analyz-
[243] S. Garg, D. Tsipras, P. Liang, and G. Valiant, “What can ing and Interpreting Neural Networks for NLP, Black-
transformers learn in-context? A case study of simple boxNLP@EMNLP 2018, Brussels, Belgium, November 1,
function classes,” CoRR, vol. abs/2208.01066, 2022. 2018, T. Linzen, G. Chrupala, and A. Alishahi, Eds.
[244] K. Cobbe, V. Kosaraju, M. Bavarian, J. Hilton, Association for Computational Linguistics, 2018, pp.
R. Nakano, C. Hesse, and J. Schulman, “Training 353–355.
verifiers to solve math word problems,” CoRR, vol. [258] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu,
abs/2110.14168, 2021. M. Yasunaga, Y. Zhang, D. Narayanan, Y. Wu, A. Ku-
[245] A. Patel, S. Bhattamishra, and N. Goyal, “Are NLP mar, B. Newman, B. Yuan, B. Yan, C. Zhang, C. Cos-
models really able to solve simple math word prob- grove, C. D. Manning, C. Ré, D. Acosta-Navas, D. A.
lems?” in NAACL-HLT. Association for Computa- Hudson, E. Zelikman, E. Durmus, F. Ladhak, F. Rong,
tional Linguistics, 2021, pp. 2080–2094. H. Ren, H. Yao, J. Wang, K. Santhanam, L. J. Orr,
[246] S. Miao, C. Liang, and K. Su, “A diverse corpus L. Zheng, M. Yüksekgönül, M. Suzgun, N. Kim,
for evaluating and developing english math word N. Guha, N. S. Chatterji, O. Khattab, P. Henderson,
problem solvers,” in Proceedings of the 58th Annual Q. Huang, R. Chi, S. M. Xie, S. Santurkar, S. Gan-
Meeting of the Association for Computational Linguistics, guli, T. Hashimoto, T. Icard, T. Zhang, V. Chaudhary,
ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, W. Wang, X. Li, Y. Mai, Y. Zhang, and Y. Koreeda,
N. Schluter, and J. R. Tetreault, Eds. Association for “Holistic evaluation of language models,” CoRR, vol.
Computational Linguistics, 2020, pp. 975–984. abs/2211.09110, 2022.
[247] A. Talmor, J. Herzig, N. Lourie, and J. Berant, “Com- [259] A. Madaan and A. Yazdanbakhsh, “Text and patterns:
monsenseqa: A question answering challenge tar- For effective chain of thought, it takes two to tango,”
geting commonsense knowledge,” in Proceedings of CoRR, vol. abs/2209.07686, 2022.
the 2019 Conference of the North American Chapter of [260] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and
the Association for Computational Linguistics: Human A. Smola, “Multimodal chain-of-thought reasoning in
Language Technologies, NAACL-HLT 2019, Minneapolis, language models,” CoRR, vol. abs/2302.00923, 2023.
MN, USA, June 2-7, 2019, Volume 1 (Long and Short [261] F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Sri-
Papers), J. Burstein, C. Doran, and T. Solorio, Eds. vats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder,
Association for Computational Linguistics, 2019, pp. D. Zhou, D. Das, and J. Wei, “Language models are
4149–4158. multilingual chain-of-thought reasoners,” CoRR, vol.
[248] M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, abs/2210.03057, 2022.
and J. Berant, “Did aristotle use a laptop? A question [262] K. Shridhar, A. Stolfo, and M. Sachan, “Distilling
answering benchmark with implicit reasoning strate- multi-step reasoning capabilities of large language
gies,” Trans. Assoc. Comput. Linguistics, vol. 9, pp. 346– models into smaller models via semantic decompo-
361, 2021. sitions,” ArXiv, vol. abs/2212.00193, 2022.
[249] Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J. Lou, and [263] N. Ho, L. Schmid, and S. Yun, “Large language models
45

are reasoning teachers,” CoRR, vol. abs/2212.10071, ference on Machine Translation, WMT@EMNLP 2020,
2022. Online, November 19-20, 2020, L. Barrault, O. Bojar,
[264] L. C. Magister, J. Mallinson, J. Adámek, E. Malmi, F. Bougares, R. Chatterjee, M. R. Costa-jussà, C. Fe-
and A. Severyn, “Teaching small language models to dermann, M. Fishel, A. Fraser, Y. Graham, P. Guzman,
reason,” CoRR, vol. abs/2212.08410, 2022. B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn,
[265] Y. Fu, H. Peng, L. Ou, A. Sabharwal, and T. Khot, A. Martins, M. Morishita, C. Monz, M. Nagata,
“Specializing smaller language models towards multi- T. Nakazawa, and M. Negri, Eds. Association for
step reasoning,” CoRR, vol. abs/2301.12726, 2023. Computational Linguistics, 2020, pp. 1–55.
[266] A. Chan, Z. Zeng, W. Lake, B. Joshi, H. Chen, and [275] F. Akhbardeh, A. Arkhangorodsky, M. Biesialska,
X. Ren, “Knife: Distilling meta-reasoning knowledge O. Bojar, R. Chatterjee, V. Chaudhary, M. R. Costa-
with free-text rationales,” in ICLR 2023 Workshop on jussà, C. España-Bonet, A. Fan, C. Federmann, M. Fre-
Pitfalls of limited data and computation for Trustworthy itag, Y. Graham, R. Grundkiewicz, B. Haddow, L. Har-
ML. ter, K. Heafield, C. Homan, M. Huck, K. Amponsah-
[267] Z. Li, C. Wang, P. Ma, C. Liu, S. Wang, D. Wu, Kaakyire, J. Kasai, D. Khashabi, K. Knight, T. Kocmi,
and C. Gao, “On the feasibility of specialized ability P. Koehn, N. Lourie, C. Monz, M. Morishita, M. Na-
stealing for large language code models,” CoRR, 2023. gata, A. Nagesh, T. Nakazawa, M. Negri, S. Pal,
[268] Z. Dai, V. Y. Zhao, J. Ma, Y. Luan, J. Ni, J. Lu, A. A. Tapo, M. Turchi, V. Vydrin, and M. Zampieri,
A. Bakalov, K. Guu, K. B. Hall, and M. Chang, “Findings of the 2021 conference on machine transla-
“Promptagator: Few-shot dense retrieval from 8 ex- tion (WMT21),” in Proceedings of the Sixth Conference
amples,” CoRR, 2022. on Machine Translation, WMT@EMNLP 2021, Online
[269] M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz, Event, November 10-11, 2021, L. Barrault, O. Bojar,
“Building a large annotated corpus of english: The F. Bougares, R. Chatterjee, M. R. Costa-jussà, C. Fe-
penn treebank,” Comput. Linguistics, vol. 19, no. 2, pp. dermann, M. Fishel, A. Fraser, M. Freitag, Y. Graham,
313–330, 1993. R. Grundkiewicz, P. Guzman, B. Haddow, M. Huck,
[270] S. Merity, C. Xiong, J. Bradbury, and R. Socher, A. Jimeno-Yepes, P. Koehn, T. Kocmi, A. Martins,
“Pointer sentinel mixture models,” in ICLR (Poster). M. Morishita, and C. Monz, Eds. Association for
OpenReview.net, 2017. Computational Linguistics, 2021, pp. 1–88.
[271] O. Bojar, C. Buck, C. Federmann, B. Haddow, [276] T. Kocmi, R. Bawden, O. Bojar, A. Dvorkovich, C. Fe-
P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post, dermann, M. Fishel, T. Gowda, Y. Graham, R. Grund-
H. Saint-Amand, R. Soricut, L. Specia, and A. Tam- kiewicz, B. Haddow, R. Knowles, P. Koehn, C. Monz,
chyna, “Findings of the 2014 workshop on statistical M. Morishita, M. Nagata, T. Nakazawa, M. Novák,
machine translation,” in WMT@ACL. The Association M. Popel, and M. Popovic, “Findings of the 2022
for Computer Linguistics, 2014, pp. 12–58. conference on machine translation (WMT22),” in Pro-
[272] O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, ceedings of the Seventh Conference on Machine Trans-
B. Haddow, M. Huck, A. Jimeno-Yepes, P. Koehn, lation, WMT 2022, Abu Dhabi, United Arab Emirates
V. Logacheva, C. Monz, M. Negri, A. Névéol, M. L. (Hybrid), December 7-8, 2022, P. Koehn, L. Barrault,
Neves, M. Popel, M. Post, R. Rubino, C. Scarton, O. Bojar, F. Bougares, R. Chatterjee, M. R. Costa-
L. Specia, M. Turchi, K. Verspoor, and M. Zampieri, jussà, C. Federmann, M. Fishel, A. Fraser, M. Freitag,
“Findings of the 2016 conference on machine trans- Y. Graham, R. Grundkiewicz, P. Guzman, B. Haddow,
lation,” in WMT. The Association for Computer M. Huck, A. Jimeno-Yepes, T. Kocmi, A. Martins,
Linguistics, 2016, pp. 131–198. M. Morishita, C. Monz, M. Nagata, T. Nakazawa,
[273] L. Barrault, O. Bojar, M. R. Costa-jussà, C. Federmann, M. Negri, A. Névéol, M. Neves, M. Popel, M. Turchi,
M. Fishel, Y. Graham, B. Haddow, M. Huck, P. Koehn, and M. Zampieri, Eds. Association for Computa-
S. Malmasi, C. Monz, M. Müller, S. Pal, M. Post, and tional Linguistics, 2022, pp. 1–45.
M. Zampieri, “Findings of the 2019 conference on [277] N. Goyal, C. Gao, V. Chaudhary, P. Chen, G. Wen-
machine translation (WMT19),” in Proceedings of the zek, D. Ju, S. Krishnan, M. Ranzato, F. Guzmán, and
Fourth Conference on Machine Translation, WMT 2019, A. Fan, “The flores-101 evaluation benchmark for low-
Florence, Italy, August 1-2, 2019 - Volume 2: Shared resource and multilingual machine translation,” Trans.
Task Papers, Day 1, O. Bojar, R. Chatterjee, C. Feder- Assoc. Comput. Linguistics, vol. 10, pp. 522–538, 2022.
mann, M. Fishel, Y. Graham, B. Haddow, M. Huck, [278] R. Bawden, E. Bilinski, T. Lavergne, and S. Rosset,
A. Jimeno-Yepes, P. Koehn, A. Martins, C. Monz, “Diabla: a corpus of bilingual spontaneous written
M. Negri, A. Névéol, M. L. Neves, M. Post, M. Turchi, dialogues for machine translation,” Lang. Resour. Eval-
and K. Verspoor, Eds. Association for Computational uation, vol. 55, no. 3, pp. 635–660, 2021.
Linguistics, 2019, pp. 1–61. [279] R. Nallapati, B. Zhou, C. N. dos Santos, Ç. Gülçehre,
[274] L. Barrault, M. Biesialska, O. Bojar, M. R. Costa- and B. Xiang, “Abstractive text summarization using
jussà, C. Federmann, Y. Graham, R. Grundkiewicz, sequence-to-sequence rnns and beyond,” in Proceed-
B. Haddow, M. Huck, E. Joanis, T. Kocmi, P. Koehn, ings of the 20th SIGNLL Conference on Computational
C. Lo, N. Ljubesic, C. Monz, M. Morishita, M. Na- Natural Language Learning, CoNLL 2016, Berlin, Ger-
gata, T. Nakazawa, S. Pal, M. Post, and M. Zampieri, many, August 11-12, 2016, Y. Goldberg and S. Riezler,
“Findings of the 2020 conference on machine trans- Eds. ACL, 2016, pp. 280–290.
lation (WMT20),” in Proceedings of the Fifth Con- [280] S. Narayan, S. B. Cohen, and M. Lapata, “Don’t give
46

me the details, just the summary! topic-aware convo- [292] S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring
lutional neural networks for extreme summarization,” how models mimic human falsehoods,” in Proceedings
in EMNLP. Association for Computational Linguis- of the 60th Annual Meeting of the Association for Compu-
tics, 2018, pp. 1797–1807. tational Linguistics (Volume 1: Long Papers), ACL 2022,
[281] F. Ladhak, E. Durmus, C. Cardie, and K. Mckeown, Dublin, Ireland, May 22-27, 2022, 2022, pp. 3214–3252.
“Wikilingua: A new benchmark dataset for cross- [293] J. Berant, A. Chou, R. Frostig, and P. Liang, “Semantic
lingual abstractive summarization,” in Findings of the parsing on freebase from question-answer pairs,” in
Association for Computational Linguistics: EMNLP 2020, Proceedings of the 2013 Conference on Empirical Methods
2020, pp. 4034–4048. in Natural Language Processing, EMNLP 2013, 18-21
[282] S. Moon, P. Shah, A. Kumar, and R. Subba, “Open- October 2013, Grand Hyatt Seattle, Seattle, Washington,
dialkg: Explainable conversational reasoning with USA, A meeting of SIGDAT, a Special Interest Group of
attention-based walks over knowledge graphs,” in the ACL, 2013, pp. 1533–1544.
ACL (1). Association for Computational Linguistics, [294] M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer,
2019, pp. 845–854. “Triviaqa: A large scale distantly supervised challenge
[283] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, dataset for reading comprehension,” in Proceedings of
J. Michael, F. Hill, O. Levy, and S. R. Bowman, “Su- the 55th Annual Meeting of the Association for Computa-
perglue: A stickier benchmark for general-purpose tional Linguistics, ACL 2017, Vancouver, Canada, July 30
language understanding systems,” in NeurIPS, 2019, - August 4, Volume 1: Long Papers, 2017, pp. 1601–1611.
pp. 3261–3275. [295] Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi,
[284] D. Hendrycks, C. Burns, S. Basart, A. Zou, “PIQA: reasoning about physical commonsense in
M. Mazeika, D. Song, and J. Steinhardt, “Measuring natural language,” in The Thirty-Fourth AAAI Confer-
massive multitask language understanding,” in ICLR. ence on Artificial Intelligence, AAAI 2020, The Thirty-
OpenReview.net, 2021. Second Innovative Applications of Artificial Intelligence
[285] M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, Conference, IAAI 2020, The Tenth AAAI Symposium
H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, on Educational Advances in Artificial Intelligence, EAAI
D. Zhou, and J. Wei, “Challenging big-bench tasks and 2020, New York, NY, USA, February 7-12, 2020, 2020,
whether chain-of-thought can solve them,” CoRR, vol. pp. 7432–7439.
abs/2210.09261, 2022. [296] M. Dubey, D. Banerjee, A. Abdelkawi, and
[286] L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, J. Lehmann, “Lc-quad 2.0: A large dataset for
K. Sun, D. Yu, C. Yu, Y. Tian, Q. Dong, W. Liu, B. Shi, complex question answering over wikidata and
Y. Cui, J. Li, J. Zeng, R. Wang, W. Xie, Y. Li, Y. Pat- dbpedia,” in The Semantic Web - ISWC 2019 - 18th
terson, Z. Tian, Y. Zhang, H. Zhou, S. Liu, Z. Zhao, International Semantic Web Conference, Auckland, New
Q. Zhao, C. Yue, X. Zhang, Z. Yang, K. Richardson, Zealand, October 26-30, 2019, Proceedings, Part II, 2019,
and Z. Lan, “CLUE: A chinese language understand- pp. 69–78.
ing evaluation benchmark,” in COLING. Interna- [297] Y. Gu, S. Kase, M. Vanni, B. M. Sadler, P. Liang, X. Yan,
tional Committee on Computational Linguistics, 2020, and Y. Su, “Beyond I.I.D.: three levels of generaliza-
pp. 4762–4772. tion for question answering on knowledge bases,” in
[287] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, WWW ’21: The Web Conference 2021, Virtual Event /
A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, Ljubljana, Slovenia, April 19-23, 2021, 2021, pp. 3477–
and J. Steinhardt, “Measuring coding challenge com- 3488.
petence with APPS,” in NeurIPS Datasets and Bench- [298] S. Cao, J. Shi, L. Pan, L. Nie, Y. Xiang, L. Hou, J. Li,
marks, 2021. B. He, and H. Zhang, “KQA pro: A dataset with
[288] Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettle- explicit compositional programs for complex question
moyer, S. W. Yih, D. Fried, S. I. Wang, and T. Yu, answering over knowledge base,” in Proceedings of the
“DS-1000: A natural and reliable benchmark for data 60th Annual Meeting of the Association for Computational
science code generation,” CoRR, vol. abs/2211.11501, Linguistics (Volume 1: Long Papers), ACL 2022, Dublin,
2022. Ireland, May 22-27, 2022, 2022, pp. 6101–6119.
[289] Z. Wang, S. Zhou, D. Fried, and G. Neubig, [299] X. Hu, X. Wu, Y. Shu, and Y. Qu, “Logical form
“Execution-based evaluation for open-domain code generation via multi-task learning for complex ques-
generation,” CoRR, vol. abs/2212.10481, 2022. tion answering over knowledge bases,” in Proceedings
[290] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, of the 29th International Conference on Computational
A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin, Linguistics, COLING 2022, Gyeongju, Republic of Korea,
J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, October 12-17, 2022, 2022, pp. 1687–1696.
M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, and S. Petrov, [300] S. Longpre, Y. Lu, and J. Daiber, “MKQA: A lin-
“Natural questions: a benchmark for question answer- guistically diverse benchmark for multilingual open
ing research,” Trans. Assoc. Comput. Linguistics, pp. domain question answering,” Trans. Assoc. Comput.
452–466, 2019. Linguistics, vol. 9, pp. 1389–1406, 2021.
[291] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, [301] T. Saikh, T. Ghosal, A. Mittal, A. Ekbal, and P. Bhat-
C. Schoenick, and O. Tafjord, “Think you have solved tacharyya, “Scienceqa: a novel resource for question
question answering? try arc, the AI2 reasoning chal- answering on scholarly articles,” Int. J. Digit. Libr.,
lenge,” CoRR, vol. abs/1803.05457, 2018. vol. 23, no. 3, pp. 289–301, 2022.
47

[302] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can [312] F. Petroni, T. Rocktäschel, S. Riedel, P. S. H. Lewis,
a suit of armor conduct electricity? A new dataset A. Bakhtin, Y. Wu, and A. H. Miller, “Language mod-
for open book question answering,” in Proceedings of els as knowledge bases?” in Proceedings of the 2019
the 2018 Conference on Empirical Methods in Natural Conference on Empirical Methods in Natural Language
Language Processing, Brussels, Belgium, October 31 - Processing and the 9th International Joint Conference
November 4, 2018, 2018, pp. 2381–2391. on Natural Language Processing, EMNLP-IJCNLP 2019,
[303] T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, Hong Kong, China, November 3-7, 2019, 2019, pp. 2463–
R. Majumder, and L. Deng, “MS MARCO: A human 2473.
generated machine reading comprehension dataset,” [313] F. Mahdisoltani, J. Biega, and F. M. Suchanek,
in Proceedings of the Workshop on Cognitive Computa- “YAGO3: A knowledge base from multilingual
tion: Integrating neural and symbolic approaches 2016 wikipedias,” in Seventh Biennial Conference on Innova-
co-located with the 30th Annual Conference on Neural tive Data Systems Research, CIDR 2015, Asilomar, CA,
Information Processing Systems (NIPS 2016), Barcelona, USA, January 4-7, 2015, Online Proceedings, 2015.
Spain, December 9, 2016, 2016. [314] F. M. Suchanek, G. Kasneci, and G. Weikum, “Yago:
[304] T. Khot, P. Clark, M. Guerquin, P. Jansen, and A. Sab- a core of semantic knowledge,” in Proceedings of the
harwal, “QASC: A dataset for question answering 16th International Conference on World Wide Web, WWW
via sentence composition,” in The Thirty-Fourth AAAI 2007, Banff, Alberta, Canada, May 8-12, 2007, 2007, pp.
Conference on Artificial Intelligence, AAAI 2020, The 697–706.
Thirty-Second Innovative Applications of Artificial Intelli- [315] C. Clark, K. Lee, M. Chang, T. Kwiatkowski,
gence Conference, IAAI 2020, The Tenth AAAI Symposium M. Collins, and K. Toutanova, “Boolq: Exploring the
on Educational Advances in Artificial Intelligence, EAAI surprising difficulty of natural yes/no questions,” in
2020, New York, NY, USA, February 7-12, 2020, 2020, Proceedings of the 2019 Conference of the North American
pp. 8082–8090. Chapter of the Association for Computational Linguistics:
[305] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, Human Language Technologies, NAACL-HLT 2019, Min-
“Squad: 100, 000+ questions for machine comprehen- neapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and
sion of text,” in Proceedings of the 2016 Conference Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds.
on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, 2019, pp.
EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, 2924–2936.
2016, pp. 2383–2392. [316] M. Sap, H. Rashkin, D. Chen, R. L. Bras, and Y. Choi,
[306] A. H. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes, “Socialiqa: Commonsense reasoning about social in-
and J. Weston, “Key-value memory networks for di- teractions,” CoRR, vol. abs/1904.09728, 2019.
rectly reading documents,” in Proceedings of the 2016 [317] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and
Conference on Empirical Methods in Natural Language Y. Choi, “Hellaswag: Can a machine really finish
Processing, EMNLP 2016, Austin, Texas, USA, November your sentence?” in Proceedings of the 57th Conference of
1-4, 2016, 2016, pp. 1400–1409. the Association for Computational Linguistics, ACL 2019,
[307] B. Goodrich, V. Rao, P. J. Liu, and M. Saleh, “Assessing Florence, Italy, July 28- August 2, 2019, Volume 1: Long
the factual accuracy of generated text,” in Proceedings Papers, A. Korhonen, D. R. Traum, and L. Màrquez,
of the 25th ACM SIGKDD International Conference on Eds. Association for Computational Linguistics, 2019,
Knowledge Discovery & Data Mining, KDD 2019, An- pp. 4791–4800.
chorage, AK, USA, August 4-8, 2019, 2019, pp. 166–175. [318] K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi,
[308] K. Toutanova and D. Chen, “Observed versus latent “Winogrande: An adversarial winograd schema chal-
features for knowledge base and text inference,” in lenge at scale,” in AAAI. AAAI Press, 2020, pp. 8732–
Proceedings of the 3rd Workshop on Continuous Vector 8740.
Space Models and their Compositionality, CVSC 2015, [319] M. Roemmele, C. A. Bejan, and A. S. Gordon, “Choice
Beijing, China, July 26-31, 2015, 2015, pp. 57–66. of plausible alternatives: An evaluation of common-
[309] K. D. Bollacker, C. Evans, P. K. Paritosh, T. Sturge, and sense causal reasoning,” in Logical Formalizations of
J. Taylor, “Freebase: a collaboratively created graph Commonsense Reasoning, Papers from the 2011 AAAI
database for structuring human knowledge,” in Pro- Spring Symposium, Technical Report SS-11-06, Stanford,
ceedings of the ACM SIGMOD International Conference California, USA, March 21-23, 2011. AAAI, 2011.
on Management of Data, SIGMOD 2008, Vancouver, BC, [320] K. Sakaguchi, C. Bhagavatula, R. L. Bras, N. Tandon,
Canada, June 10-12, 2008, 2008, pp. 1247–1250. P. Clark, and Y. Choi, “proscript: Partially ordered
[310] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel, scripts generation,” in Findings of the Association for
“Convolutional 2d knowledge graph embeddings,” Computational Linguistics: EMNLP 2021, Virtual Event /
in Proceedings of the Thirty-Second AAAI Conference on Punta Cana, Dominican Republic, 16-20 November, 2021,
Artificial Intelligence, (AAAI-18), the 30th innovative Ap- M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds.
plications of Artificial Intelligence (IAAI-18), and the 8th Association for Computational Linguistics, 2021, pp.
AAAI Symposium on Educational Advances in Artificial 2138–2149.
Intelligence (EAAI-18), New Orleans, Louisiana, USA, [321] B. Dalvi, L. Huang, N. Tandon, W. Yih, and P. Clark,
February 2-7, 2018, 2018, pp. 1811–1818. “Tracking state changes in procedural text: a challenge
[311] G. A. Miller, “Wordnet: A lexical database for en- dataset and models for process paragraph comprehen-
glish,” Commun. ACM, pp. 39–41, 1995. sion,” in Proceedings of the 2018 Conference of the North
48

American Chapter of the Association for Computational L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, and
Linguistics: Human Language Technologies, NAACL-HLT Y. Marton, Eds. The Association for Computational
2018, New Orleans, Louisiana, USA, June 1-6, 2018, Vol- Linguistics, 2015, pp. 1743–1752.
ume 1 (Long Papers), M. A. Walker, H. Ji, and A. Stent, [330] A. Amini, S. Gabriel, S. Lin, R. Koncel-Kedziorski,
Eds. Association for Computational Linguistics, 2018, Y. Choi, and H. Hajishirzi, “Mathqa: Towards inter-
pp. 1595–1604. pretable math word problem solving with operation-
[322] S. Saha, P. Yadav, L. Bauer, and M. Bansal, “Expla- based formalisms,” in Proceedings of the 2019 Conference
graphs: An explanation graph generation task for of the North American Chapter of the Association for
structured commonsense reasoning,” in Proceedings Computational Linguistics: Human Language Technolo-
of the 2021 Conference on Empirical Methods in Natu- gies, NAACL-HLT 2019, Minneapolis, MN, USA, June
ral Language Processing, EMNLP 2021, Virtual Event / 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein,
Punta Cana, Dominican Republic, 7-11 November, 2021, C. Doran, and T. Solorio, Eds. Association for Com-
M. Moens, X. Huang, L. Specia, and S. W. Yih, Eds. putational Linguistics, 2019, pp. 2357–2367.
Association for Computational Linguistics, 2021, pp. [331] W. Ling, D. Yogatama, C. Dyer, and P. Blunsom,
7716–7740. “Program induction by rationale generation: Learning
[323] O. Tafjord, B. Dalvi, and P. Clark, “Proofwriter: Gener- to solve and explain algebraic word problems,” in
ating implications, proofs, and abductive statements Proceedings of the 55th Annual Meeting of the Associa-
over natural language,” in Findings of the Association tion for Computational Linguistics, ACL 2017, Vancouver,
for Computational Linguistics: ACL/IJCNLP 2021, Online Canada, July 30 - August 4, Volume 1: Long Papers,
Event, August 1-6, 2021, ser. Findings of ACL, C. Zong, R. Barzilay and M. Kan, Eds. Association for Com-
F. Xia, W. Li, and R. Navigli, Eds., vol. ACL/IJCNLP putational Linguistics, 2017, pp. 158–167.
2021. Association for Computational Linguistics, [332] R. Koncel-Kedziorski, S. Roy, A. Amini, N. Kushman,
2021, pp. 3621–3634. and H. Hajishirzi, “Mawps: A math word problem
[324] B. Dalvi, P. Jansen, O. Tafjord, Z. Xie, H. Smith, L. Pi- repository,” in Proceedings of the 2016 conference of the
patanangkura, and P. Clark, “Explaining answers with north american chapter of the association for computational
entailment trees,” in Proceedings of the 2021 Conference linguistics: human language technologies, 2016, pp. 1152–
on Empirical Methods in Natural Language Processing, 1157.
EMNLP 2021, Virtual Event / Punta Cana, Dominican [333] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh,
Republic, 7-11 November, 2021, M. Moens, X. Huang, and M. Gardner, “DROP: A reading comprehension
L. Specia, and S. W. Yih, Eds. Association for Com- benchmark requiring discrete reasoning over para-
putational Linguistics, 2021, pp. 7358–7370. graphs,” in Proceedings of the 2019 Conference of the
[325] A. Saparov and H. He, “Language models are greedy North American Chapter of the Association for Com-
reasoners: A systematic formal analysis of chain-of- putational Linguistics: Human Language Technologies,
thought,” CoRR, vol. abs/2210.01240, 2022. NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7,
[326] C. Anil, Y. Wu, A. Andreassen, A. Lewkowycz, 2019, Volume 1 (Long and Short Papers), 2019, pp. 2368–
V. Misra, V. V. Ramasesh, A. Slone, G. Gur-Ari, 2378.
E. Dyer, and B. Neyshabur, “Exploring length gen- [334] S. Welleck, J. Liu, R. L. Bras, H. Hajishirzi, Y. Choi,
eralization in large language models,” CoRR, vol. and K. Cho, “Naturalproofs: Mathematical theorem
abs/2207.04901, 2022. proving in natural language,” in Proceedings of the Neu-
[327] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, ral Information Processing Systems Track on Datasets and
A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, Benchmarks 1, NeurIPS Datasets and Benchmarks 2021,
A. Garriga-Alonso, A. Kluska, A. Lewkowycz, December 2021, virtual, J. Vanschoren and S. Yeung,
A. Agarwal, A. Power, A. Ray, A. Warstadt, A. W. Eds., 2021.
Kocurek, A. Safaya, A. Tazarv, A. Xiang, A. Parrish, [335] A. Q. Jiang, W. Li, J. M. Han, and Y. Wu, “Lisa:
A. Nie, A. Hussain, A. Askell, A. Dsouza, A. Rahane, Language models of isabelle proofs,” in 6th Conference
A. S. Iyer, A. Andreassen, A. Santilli, A. Stuhlmüller, on Artificial Intelligence and Theorem Proving, 2021, pp.
A. M. Dai, A. La, A. K. Lampinen, A. Zou, A. Jiang, 378–392.
A. Chen, A. Vuong, A. Gupta, A. Gottardi, A. Norelli, [336] K. Zheng, J. M. Han, and S. Polu, “minif2f: a cross-
A. Venkatesh, A. Gholamidavoodi, A. Tabassum, system benchmark for formal olympiad-level mathe-
A. Menezes, A. Kirubarajan, A. Mullokandov, A. Sab- matics,” in The Tenth International Conference on Learn-
harwal, A. Herrick, A. Efrat, A. Erdem, A. Karakas, ing Representations, ICLR 2022, Virtual Event, April 25-
and et al., “Beyond the imitation game: Quantifying 29, 2022. OpenReview.net, 2022.
and extrapolating the capabilities of language mod- [337] Z. Azerbayev, B. Piotrowski, H. Schoelkopf, E. W.
els,” CoRR, vol. abs/2206.04615, 2022. Ayers, D. Radev, and J. Avigad, “Proofnet: Autofor-
[328] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, malizing and formally proving undergraduate-level
J. Callan, and G. Neubig, “PAL: program-aided lan- mathematics,” CoRR, vol. abs/2302.12433, 2023.
guage models,” CoRR, vol. abs/2211.10435, 2022. [338] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine
[329] S. Roy and D. Roth, “Solving general arithmetic translation by jointly learning to align and translate,”
word problems,” in Proceedings of the 2015 Conference in ICLR, 2015.
on Empirical Methods in Natural Language Processing, [339] A. M. Rush, S. Chopra, and J. Weston, “A neural
EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, attention model for abstractive sentence summariza-
49

tion,” in EMNLP. The Association for Computational AAAI Conference on Artificial Intelligence, (AAAI-18),
Linguistics, 2015, pp. 379–389. the 30th innovative Applications of Artificial Intelligence
[340] D. Chen, A. Fisch, J. Weston, and A. Bordes, “Reading (IAAI-18), and the 8th AAAI Symposium on Educational
wikipedia to answer open-domain questions,” in ACL Advances in Artificial Intelligence (EAAI-18), New Or-
(1). Association for Computational Linguistics, 2017, leans, Louisiana, USA, February 2-7, 2018, 2018, pp.
pp. 1870–1879. 3390–3398.
[341] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu: [353] A. Roberts, C. Raffel, and N. Shazeer, “How much
a method for automatic evaluation of machine trans- knowledge can you pack into the parameters of a
lation,” in Proceedings of the 40th Annual Meeting of language model?” in Proceedings of the 2020 Conference
the Association for Computational Linguistics, July 6-12, on Empirical Methods in Natural Language Processing,
2002, Philadelphia, PA, USA. ACL, 2002, pp. 311–318. EMNLP 2020, Online, November 16-20, 2020, 2020, pp.
[342] C.-Y. Lin, “ROUGE: A package for automatic evalu- 5418–5426.
ation of summaries,” in Text Summarization Branches [354] G. Izacard, P. S. H. Lewis, M. Lomeli, L. Hos-
Out. Association for Computational Linguistics, Jul. seini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin,
2004, pp. 74–81. S. Riedel, and E. Grave, “Few-shot learning with
[343] K. Yang, Y. Tian, N. Peng, and D. Klein, “Re3: Gen- retrieval augmented language models,” CoRR, vol.
erating longer stories with recursive reprompting and abs/2208.03299, 2022.
revision,” in Proceedings of the 2022 Conference on Em- [355] K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang,
pirical Methods in Natural Language Processing, EMNLP “Retrieval augmented language model pre-training,”
2022, Abu Dhabi, United Arab Emirates, December 7-11, in Proceedings of the 37th International Conference on
2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. Machine Learning, ICML 2020, 13-18 July 2020, Virtual
Association for Computational Linguistics, 2022, pp. Event, 2020, pp. 3929–3938.
4393–4479. [356] P. S. H. Lewis, E. Perez, A. Piktus, F. Petroni,
[344] Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih,
B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung, Q. V. Do, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-
Y. Xu, and P. Fung, “A multitask, multilingual, mul- augmented generation for knowledge-intensive NLP
timodal evaluation of chatgpt on reasoning, halluci- tasks,” in Advances in Neural Information Processing
nation, and interactivity,” CoRR, vol. abs/2302.04023, Systems 33: Annual Conference on Neural Information
2023. Processing Systems 2020, NeurIPS 2020, December 6-12,
[345] S. Gulwani, O. Polozov, and R. Singh, “Program syn- 2020, virtual, 2020.
thesis,” Found. Trends Program. Lang., vol. 4, no. 1-2, [357] Y. Lan, G. He, J. Jiang, J. Jiang, W. X. Zhao, and J. Wen,
pp. 1–119, 2017. “Complex knowledge base question answering: A
[346] S. Zhang, Z. Chen, Y. Shen, M. Ding, J. B. Tenenbaum, survey,” CoRR, vol. abs/2108.06688, 2021.
and C. Gan, “Planning with large language models for [358] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai,
code generation,” 2023. E. Rutherford, K. Millican, G. van den Driessche,
[347] M. Welsh, “The end of programming,” Commun. ACM, J. Lespiau, B. Damoc, A. Clark, D. de Las Casas,
vol. 66, no. 1, pp. 34–35, 2023. A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang,
[348] B. Wang, X. Deng, and H. Sun, “Iteratively prompt L. Maggiore, C. Jones, A. Cassirer, A. Brock, M. Pa-
pre-trained language models for chain of thought,” ganini, G. Irving, O. Vinyals, S. Osindero, K. Si-
in Proceedings of the 2022 Conference on Empirical monyan, J. W. Rae, E. Elsen, and L. Sifre, “Improv-
Methods in Natural Language Processing, EMNLP 2022, ing language models by retrieving from trillions of
Abu Dhabi, United Arab Emirates, December 7-11, 2022, tokens,” in International Conference on Machine Learn-
Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. Asso- ing, ICML 2022, 17-23 July 2022, Baltimore, Maryland,
ciation for Computational Linguistics, 2022, pp. 2714– USA, ser. Proceedings of Machine Learning Research,
2730. K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvári,
[349] O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, G. Niu, and S. Sabato, Eds., vol. 162. PMLR, 2022,
and M. Lewis, “Measuring and narrowing the com- pp. 2206–2240.
positionality gap in language models,” CoRR, vol. [359] B. Peng, M. Galley, P. He, H. Cheng, Y. Xie, Y. Hu,
abs/2210.03350, 2022. Q. Huang, L. Liden, Z. Yu, W. Chen, and J. Gao,
[350] J. Ye, X. Chen, N. Xu, C. Zu, Z. Shao, S. Liu, Y. Cui, “Check your facts and try again: Improving large
Z. Zhou, C. Gong, Y. Shen, J. Zhou, S. Chen, T. Gui, language models with external knowledge and auto-
Q. Zhang, and X. Huang, “A comprehensive capabil- mated feedback,” CoRR, vol. abs/2302.12813, 2023.
ity analysis of gpt-3 and gpt-3.5 series models,” arXiv [360] S. Agarwal, I. Akkaya, V. Balcom, M. Bavarian,
preprint arXiv:2303.10420, 2023. G. Bernadett-Shapiro, G. Brockman, M. Brundage,
[351] M. McCloskey and N. J. Cohen, “Catastrophic interfer- J. Chan, F. Chantzis, N. Deutsch, B. Eastman, A. Eleti,
ence in connectionist networks: The sequential learn- N. Felix, S. P. Fishman, I. Fulford, C. Gibson, J. Gross,
ing problem,” in Psychology of learning and motivation, M. Heaton, J. Hilton, X. Hu, S. Jain, H. Jin, L. Kil-
1989, pp. 109–165. patrick, C. Kim, M. Kolhede, A. Mayne, P. McMil-
[352] R. Kemker, M. McClure, A. Abitino, T. L. Hayes, lan, D. Medina, J. Menick, A. Mishchenko, A. Nair,
and C. Kanan, “Measuring catastrophic forgetting in R. Nayak, A. Neelakantan, R. Nuttall, J. Parish,
neural networks,” in Proceedings of the Thirty-Second A. T. Passos, A. Perelman, F. de Avila Belbute Peres,
50

V. Pong, J. Schulman, E. Sigler, N. Staudacher, N. Tur- abs/2009.03393, 2020.


ley, J. Tworek, R. Greene, A. Vijayvergiya, C. Voss, [373] A. Q. Jiang, W. Li, S. Tworkowski, K. Czechowski,
J. Weng, M. Wiethoff, S. Yoo, K. Yu, W. Zaremba, T. Odrzygózdz, P. Milos, Y. Wu, and M. Jamnik,
S. Zhao, W. Zhuk, and B. Zoph, “Chatgpt plugins,” “Thor: Wielding hammers to integrate language mod-
OpenAI Blog, March 2023. els and automated theorem provers,” CoRR, vol.
[361] A. Lazaridou, E. Gribovskaya, W. Stokowiec, and abs/2205.10893, 2022.
N. Grigorev, “Internet-augmented language models [374] S. Polu, J. M. Han, K. Zheng, M. Baksys, I. Babuschkin,
through few-shot prompting for open-domain ques- and I. Sutskever, “Formal mathematics statement cur-
tion answering,” CoRR, vol. abs/2203.05115, 2022. riculum learning,” CoRR, vol. abs/2202.01344, 2022.
[362] A. Madaan, N. Tandon, P. Clark, and Y. Yang, [375] A. Q. Jiang, S. Welleck, J. P. Zhou, W. Li, J. Liu,
“Memory-assisted prompt editing to improve GPT- M. Jamnik, T. Lacroix, Y. Wu, and G. Lample, “Draft,
3 after deployment,” in EMNLP. Association for sketch, and prove: Guiding formal theorem provers
Computational Linguistics, 2022, pp. 2833–2861. with informal proofs,” CoRR, vol. abs/2210.12283,
[363] D. Dai, L. Dong, Y. Hao, Z. Sui, B. Chang, and F. Wei, 2022.
“Knowledge neurons in pretrained transformers,” in [376] Q. Lyu, S. Havaldar, A. Stein, L. Zhang, D. Rao,
Proceedings of the 60th Annual Meeting of the Association E. Wong, M. Apidianaki, and C. Callison-Burch,
for Computational Linguistics (Volume 1: Long Papers), “Faithful chain-of-thought reasoning,” CoRR, vol.
ACL 2022, Dublin, Ireland, May 22-27, 2022, S. Muresan, abs/2301.13379, 2023.
P. Nakov, and A. Villavicencio, Eds. Association for [377] Y. Weng, M. Zhu, S. He, K. Liu, and J. Zhao, “Large
Computational Linguistics, 2022, pp. 8493–8502. language models are reasoners with self-verification,”
[364] K. Meng, D. Bau, A. J. Andonian, and Y. Belinkov, CoRR, vol. abs/2212.09561, 2022.
“Locating and editing factual associations in gpt,” in [378] X. Pi, Q. Liu, B. Chen, M. Ziyadi, Z. Lin, Q. Fu, Y. Gao,
Advances in Neural Information Processing Systems, 2022. J. Lou, and W. Chen, “Reasoning like program execu-
[365] Z. Shao, Y. Gong, Y. Shen, M. Huang, N. Duan, and tors,” in Proceedings of the 2022 Conference on Empirical
W. Chen, “Synthetic prompting: Generating chain-of- Methods in Natural Language Processing, EMNLP 2022,
thought demonstrations for large language models,” Abu Dhabi, United Arab Emirates, December 7-11, 2022,
CoRR, vol. abs/2302.00618, 2023. 2022, pp. 761–779.
[366] N. Bian, X. Han, L. Sun, H. Lin, Y. Lu, and B. He, [379] A. Parisi, Y. Zhao, and N. Fiedel, “TALM:
“ChatGPT is a Knowledgeable but Inexperienced tool augmented language models,” CoRR, vol.
Solver: An Investigation of Commonsense Problem in abs/2205.12255, 2022.
Large Language Models,” CoRR, 2023. [380] N. Nangia, C. Vania, R. Bhalerao, and S. R. Bowman,
[367] Sifatkaur, M. Singh, V. S. B, and N. Malviya, “Mind “Crows-pairs: A challenge dataset for measuring so-
meets machine: Unravelling gpt-4’s cognitive psychol- cial biases in masked language models,” in Proceedings
ogy,” CoRR, vol. abs/2303.11436, 2023. of the 2020 Conference on Empirical Methods in Natural
[368] M. I. Nye, A. J. Andreassen, G. Gur-Ari, Language Processing, EMNLP 2020, Online, November
H. Michalewski, J. Austin, D. Bieber, D. Dohan, 16-20, 2020, 2020, pp. 1953–1967.
A. Lewkowycz, M. Bosma, D. Luan, C. Sutton, [381] R. Rudinger, J. Naradowsky, B. Leonard, and B. V.
and A. Odena, “Show your work: Scratchpads for Durme, “Gender bias in coreference resolution,” in
intermediate computation with language models,” Proceedings of the 2018 Conference of the North American
CoRR, vol. abs/2112.00114, 2021. Chapter of the Association for Computational Linguistics:
[369] J. Qian, H. Wang, Z. Li, S. Li, and X. Yan, “Limita- Human Language Technologies, NAACL-HLT, New Or-
tions of language models in arithmetic and symbolic leans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short
induction,” CoRR, vol. abs/2208.05051, 2022. Papers), 2018, pp. 8–14.
[370] W. X. Zhao, K. Zhou, Z. Gong, B. Zhang, Y. Zhou, [382] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch,
J. Sha, Z. Chen, S. Wang, C. Liu, and J. Wen, “Jiuzhang: “Language models as zero-shot planners: Extracting
A chinese pre-trained language model for mathemat- actionable knowledge for embodied agents,” in ICML,
ical problem understanding,” in KDD ’22: The 28th ser. Proceedings of Machine Learning Research, vol.
ACM SIGKDD Conference on Knowledge Discovery and 162. PMLR, 2022, pp. 9118–9147.
Data Mining, Washington, DC, USA, August 14 - 18, [383] T. Carta, C. Romac, T. Wolf, S. Lamprier, O. Sigaud,
2022, A. Zhang and H. Rangwala, Eds. ACM, 2022, and P. Oudeyer, “Grounding large language models
pp. 4571–4581. in interactive environments with online reinforcement
[371] Q. Wang, C. Kaliszyk, and J. Urban, “First experi- learning,” CoRR, vol. abs/2302.02662, 2023.
ments with neural translation of informal to formal [384] X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler,
mathematics,” in Intelligent Computer Mathematics - and A. Torralba, “Virtualhome: Simulating household
11th International Conference, CICM 2018, Hagenberg, activities via programs,” in CVPR. Computer Vision
Austria, August 13-17, 2018, Proceedings, ser. Lecture Foundation / IEEE Computer Society, 2018, pp. 8494–
Notes in Computer Science, F. Rabe, W. M. Farmer, 8502.
G. O. Passmore, and A. Youssef, Eds., vol. 11006. [385] M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han,
Springer, 2018, pp. 255–270. R. Mottaghi, L. Zettlemoyer, and D. Fox, “ALFRED:
[372] S. Polu and I. Sutskever, “Generative language mod- A benchmark for interpreting grounded instructions
eling for automated theorem proving,” CoRR, vol. for everyday tasks,” in CVPR. Computer Vision
51

Foundation / IEEE, 2020, pp. 10 737–10 746. language understanding tasks,” 2023.
[386] S. Srivastava, C. Li, M. Lingelbach, R. Martı́n-Martı́n, [397] M. Jang and T. Lukasiewicz, “Consistency analysis of
F. Xia, K. E. Vainio, Z. Lian, C. Gokmen, S. Buch, chatgpt,” CoRR, vol. abs/2303.06273, 2023.
C. K. Liu, S. Savarese, H. Gweon, J. Wu, and L. Fei- [398] R. Tang, X. Han, X. Jiang, and X. Hu, “Does synthetic
Fei, “BEHAVIOR: benchmark for everyday household data generation of llms help clinical text mining?”
activities in virtual, interactive, and ecological en- arXiv preprint arXiv:2303.04360, 2023.
vironments,” in CoRL, ser. Proceedings of Machine [399] O. Nov, N. Singh, and D. M. Mann, “Putting chat-
Learning Research, vol. 164. PMLR, 2021, pp. 477– gpt’s medical advice to the (turing) test,” CoRR, vol.
490. abs/2301.10035, 2023.
[387] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, [400] S. Chen, B. H. Kann, M. B. Foote, H. J. Aerts, G. K.
B. David, C. Finn, K. Gopalakrishnan, K. Hausman, Savova, R. H. Mak, and D. S. Bitterman, “The utility
A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, of chatgpt for cancer treatment information,” medRxiv,
E. Jang, R. J. Ruano, K. Jeffrey, S. Jesmonth, N. J. Joshi, 2023.
R. Julian, D. Kalashnikov, Y. Kuang, K. Lee, S. Levine, [401] L. Yunxiang, L. Zihan, Z. Kai, D. Ruilong, and Z. You,
Y. Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, “Chatdoctor: A medical chat model fine-tuned on
K. Rao, J. Rettinghouse, D. Reyes, P. Sermanet, N. Siev- llama model using medical domain knowledge,” 2023.
ers, C. Tan, A. Toshev, V. Vanhoucke, F. Xia, T. Xiao, [402] K. Jeblick, B. Schachtner, J. Dexl, A. Mittermeier, A. T.
P. Xu, S. Xu, and M. Yan, “Do as I can, not as I say: Stüber, J. Topalis, T. Weber, P. Wesp, B. O. Sabel,
Grounding language in robotic affordances,” CoRR, J. Ricke, and M. Ingrisch, “Chatgpt makes medicine
vol. abs/2204.01691, 2022. easy to swallow: An exploratory case study on sim-
[388] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, plified radiology reports,” CoRR, vol. abs/2212.14882,
B. Ichter, P. Florence, and A. Zeng, “Code as policies: 2022.
Language model programs for embodied control,” [403] H. Nori, N. King, S. M. McKinney, D. Carignan, and
CoRR, vol. abs/2209.07753, 2022. E. Horvitz, “Capabilities of gpt-4 on medical challenge
[389] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, problems,” vol. abs/2303.13375, 2023.
J. Tremblay, D. Fox, J. Thomason, and A. Garg, “Prog- [404] B. Guo, X. Zhang, Z. Wang, M. Jiang, J. Nie, Y. Ding,
prompt: Generating situated robot task plans using J. Yue, and Y. Wu, “How close is chatgpt to human ex-
large language models,” CoRR, vol. abs/2209.11302, perts? comparison corpus, evaluation, and detection,”
2022. CoRR, vol. abs/2301.07597, 2023.
[390] J. H. Clark, J. Palomaki, V. Nikolaev, E. Choi, D. Gar- [405] V. Liévin, C. E. Hother, and O. Winther, “Can large
rette, M. Collins, and T. Kwiatkowski, “Tydi QA: A language models reason about medical questions?”
benchmark for information-seeking question answer- CoRR, vol. abs/2207.08143, 2022.
ing in typologically diverse languages,” Trans. Assoc. [406] G. Kortemeyer, “Could an artificial-intelligence agent
Comput. Linguistics, vol. 8, pp. 454–470, 2020. pass an introductory physics course?” arXiv preprint
[391] L. Gao, J. Tow, S. Biderman, S. Black, A. DiPofi, C. Fos- arXiv:2301.12127, 2023.
ter, L. Golding, J. Hsu, K. McDonell, N. Muennighoff, [407] S. Bordt and U. von Luxburg, “Chatgpt participates in
J. Phang, L. Reynolds, E. Tang, A. Thite, B. Wang, a computer science exam,” CoRR, vol. abs/2303.09461,
K. Wang, and A. Zou, “A framework for few-shot 2023.
language model evaluation,” Sep. 2021. [408] K. Malinka, M. Peresı́ni, A. Firc, O. Hujnak, and
[392] Q. Zhong, L. Ding, J. Liu, B. Du, and D. Tao, F. Janus, “On the educational impact of chatgpt: Is
“Can chatgpt understand too? A comparative study artificial intelligence ready to obtain a university de-
on chatgpt and fine-tuned BERT,” CoRR, vol. gree?” CoRR, vol. abs/2303.11146, 2023.
abs/2302.10198, 2023. [409] T. Susnjak, “Chatgpt: The end of online exam in-
[393] J. Kocon, I. Cichecki, O. Kaszyca, M. Kochanek, tegrity?” CoRR, vol. abs/2212.09292, 2022.
D. Szydlo, J. Baran, J. Bielaniewicz, M. Gruza, A. Janz, [410] A. Blair-Stanek, N. Holzenberger, and B. V. Durme,
K. Kanclerz, A. Kocon, B. Koptyra, W. Mieleszczenko- “Can GPT-3 perform statutory reasoning?” CoRR, vol.
Kowszewicz, P. Milkowski, M. Oleksy, M. Piasecki, abs/2302.06100, 2023.
L. Radlinski, K. Wojtasik, S. Wozniak, and P. Kazienko, [411] F. Yu, L. Quartey, and F. Schilder, “Legal prompting:
“Chatgpt: Jack of all trades, master of none,” CoRR, Teaching a language model to think like a lawyer,”
vol. abs/2302.10724, 2023. CoRR, vol. abs/2212.01326, 2022.
[394] C. Qin, A. Zhang, Z. Zhang, J. Chen, M. Yasunaga, [412] D. Trautmann, A. Petrova, and F. Schilder, “Legal
and D. Yang, “Is chatgpt a general-purpose nat- prompt engineering for multilingual legal judgement
ural language processing task solver?” CoRR, vol. prediction,” CoRR, vol. abs/2212.02199, 2022.
abs/2302.06476, 2023. [413] J. H. Choi, K. E. Hickman, A. Monahan, and
[395] Y. Ma, Y. Cao, Y. Hong, and A. Sun, “Large language D. Schwarcz, “Chatgpt goes to law school,” Available
model is not a good few-shot information extractor, at SSRN, 2023.
but a good reranker for hard samples!” CoRR, vol. [414] J. J. Nay, “Law informs code: A legal informatics
abs/2303.08559, 2023. approach to aligning artificial intelligence with hu-
[396] X. Chen, J. Ye, C. Zu, N. Xu, R. Zheng, M. Peng, mans,” CoRR, vol. abs/2209.13020, 2022.
J. Zhou, T. Gui, Q. Zhang, and X. Huang, “How robust [415] A. Tamkin, M. Brundage, J. Clark, and D. Ganguli,
is gpt-3.5 to predecessors? a comprehensive study on “Understanding the capabilities, limitations, and so-
52

cietal impact of large language models,” CoRR, vol.


abs/2102.02503, 2021.
[416] Z. Sun, “A short survey of viewing large language
models in legal aspect,” CoRR, vol. abs/2303.09136,
2023.
[417] A. Abid, M. Farooqi, and J. Zou, “Persistent anti-
muslim bias in large language models,” in AIES ’21:
AAAI/ACM Conference on AI, Ethics, and Society, Virtual
Event, USA, May 19-21, 2021, M. Fourcade, B. Kuipers,
S. Lazar, and D. K. Mulligan, Eds. ACM, 2021, pp.
298–306.
[418] A. Borji, “A categorical archive of chatgpt failures,”
CoRR, vol. abs/2302.03494, 2023.
[419] M. Kosinski, “Theory of mind may have sponta-
neously emerged in large language models,” CoRR,
vol. abs/2302.02083, 2023.
[420] M. M. Amin, E. Cambria, and B. W. Schuller, “Will
affective computing emerge from foundation models
and general ai? A first evaluation on chatgpt,” CoRR,
vol. abs/2303.03186, 2023.
[421] R. Aiyappa, J. An, H. Kwak, and Y.-Y. Ahn, “Can we
trust the evaluation on chatgpt?” vol. abs/2303.12767,
2023.
[422] H. Cho, H. J. Kim, J. Kim, S. Lee, S. Lee, K. M. Yoo,
and T. Kim, “Prompt-augmented linear probing: Scal-
ing beyond the limit of few-shot in-context learners,”
CoRR, vol. abs/2212.10873, 2022.
[423] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Ef-
ficient transformers: A survey,” ACM Comput. Surv.,
vol. 55, no. 6, pp. 109:1–109:28, 2023.

You might also like