AComprehensive Overviewof Large Language Models
AComprehensive Overviewof Large Language Models
net/publication/372341712
CITATIONS READS
0 2,737
8 authors, including:
All content following this page was uploaded by Humza Naveed on 14 September 2023.
Abstract—
Large Language Models (LLMs) have recently demonstrated
remarkable capabilities in natural language processing tasks and
arXiv:2307.06435v3 [cs.CL] 13 Sep 2023
Fig. 2: Chronological display of LLM releases: light blue rectangles represent ’pre-trained’ models, while dark rectangles
correspond to ’instruction-tuned’ models. Models on the upper half signify open-source availability, whereas those on the
bottom half are closed-source. The chart illustrates the increasing trend towards instruction-tuned models and open-source
models, highlighting the evolving landscape and trends in natural language processing research.
in the number of released LLMs, including open-source and datasets have been curated for instruction fine-tuning. These
closed-source models, over the years. Furthermore, Fig 2 datasets include more instances and tasks that further improve
highlights the names of significant releases of various LLMs the performance over baselines [24], [23], [25], [26]. When
and Fig 3 provides a broader overview of LLMs. performing instruction fine-tuning, all the model parameters
During the early days of Large Language Models (LLMs), need to be updated. However, parameter-efficient fine-tuning
many research efforts focused on developing models for takes a different approach by updating only a small number
transfer learning to downstream tasks [11], [12], [15] until of parameters while still maintaining good performance. This
the emergence of models like GPT-3 [8], which demonstrated method keeps the original model frozen and adds a few extra
impressive performance even without fine-tuning. Due to the parameters at different locations within the model [27], [28],
closed-source nature of GPT-3, there was a demand for open- [29], [30], [31]. This approach helps achieve efficient fine-
source alternatives, leading to the development of various tuning while minimizing the impact on the model’s overall
models [9], [10] operating at the scale of GPT-3 and trained performance.
on extensive web-based datasets [16], [17], [18], [19]. Subse- Due to the success of LLMs on a wide variety of tasks, the
quently, researchers proposed several architectural designs and research literature has recently experienced a large influx of
training strategies that showed superior performance compared LLM related contributions. Naturally, the research community
to GPT-3 across various tasks [15], [14], [20], [21]. has started the effort of organizing this literature as survey
The performance of LLMs improves further with instruc- articles. For instance, Zhou et al. [32] presented an overview
tion fine-tuning, outperforming pre-trained LLMs on various of the foundation models. An impressive effort is recently
benchmarks [22], [23]. Instruction fine-tuning of LLMs refers made by Zhou et al. [33] in their survey that also discusses
to a specific training approach by incorporating additional aspects related to model architectures, fine-tuning, emergent
prompts or instructions during the fine-tuning phase to guide abilities, and more. Another recent survey on augmented lan-
the output and thus enable the users to have more fine- guage models provides a historical account of the foundation
grained control over the outputs of LLMs. These prompts can models [34]. In contrast to these surveys, our contribution
be natural language instructions or example demonstrations focuses on providing a comprehensive yet concise overview
based on the task’s requirement. In the literature, different of the general direction of LLM research. On one hand, this
JOURNAL OF LATEX 3
Fig. 3: A broader overview of LLMs, dividing LLMs into five branches: 1. Training 2. Inference 3. Evaluation 4. Applications
5. Challenges
article summarizes more details of the individual models as detailed discussion on the key design and deployment
compared to the existing efforts. On the other, it also covers aspects of LLMs to help practitioners to effectively
more models in providing their summaries. It also delves leverage this technology.
into the details of model development, architectures, training It is noteworthy that although this article is the first contri-
datasets, and other related concepts to provide a self-contained bution in its own right in terms of providing a concise yet
comprehensive overview of this direction. Hence, this article comprehensive overview of LLMs, our work complements
addresses an important gap of providing a concise yet compre- the recent (and emerging) surveys of this direction, e.g.,
hensive overview of the rapidly developing general direction [33], [32]. Infrequently, we also loosely follow the existing
of LLM research. Our key contributions are summarized as terminologies to ensure providing a more standardized outlook
follows. of this research direction. For instance, following [33], our
• We present the first survey on the developments in LLM survey considers a language model to be large if it has 10B
research with the specific aim of providing concise yet parameters or more. Hence, we discuss such models in detail
comprehensive overview of the direction. We present in this survey. We refer the readers interested in smaller models
extensive summaries that include fine-grained details of to [35], [36], [32].
the reviewed contributions. The organization of this paper is as follows. Section II dis-
• In this self-contained article, we cover a range of concepts cusses the background of LLMs. Section III focuses on LLMs
to comprehend the general direction of LLMs, including overview, architectures, and training pipelines and strategies.
background concepts, popular models, crucial discover- Section IV presents the key findings derived from each LLM.
ies, related datasets and evaluation details etc. Section V highlights the configuration and parameters that
• Besides paying special attention to the chronological play a crucial role in the functioning of these models. The
order of LLMs throughout the article, we also summarize LLM training and evaluation benchmarks are discussed in sec-
major findings of the popular contributions, and provide tion VI, followed by concluding remarks and future direction
JOURNAL OF LATEX 4
positional encoding scheme which decays with the distance where gil is the gain parameter. RMSNorm [57] modifies ali
between the tokens. as
E. Activation Functions
v
l
u n
l ai l l
u1 X
The activation functions serve a crucial role in the curve- ai = gi , where RMS(a ) = t (al )2 . (5)
RMS(al ) n i i
fitting abilities of the neural networks, as proved in [50]. The
modern activation functions used in LLMs are different from
3. Pre-Norm and Post-Norm: LLMs use transformer [44]
the earlier squashing functions but are critical to the success
architecture with some variations. The original implementa-
of LLMs. We discuss these activation functions in this section.
tion [44] used layer normalization after the residual con-
1. ReLU [51]: Rectified linear unit (ReLU) is defined as
nection, commonly called post-LN, concerning the order of
ReLU (x) = max(0, x) (1) Multihead attention – Residual – LN. There is another order
of the normalization, referred to as pre-LN [58] due to the
2. GeLU [52]: Gaussian Error Linear Unit (GeLU) is the
position of the normalization step before the self-attention
combination of ReLU, dropout [53] and zoneout [54]. It is the
layer as in LN – Multihead attention – Residual. Pre-LN is
most widely used activation function in contemporary LLM
known to provide more stability in the training [59].
literature.
4. DeepNorm: While pre-LN has certain benefits over post-
3. GLU variants [55]: Gated Linear Unit [56] is a neural
LN training, pre-LN training has an unwanted effect on the
network layer that is an element-wise product (⊗) of a linear
gradients [59]. The earlier layers have larger gradients than
transformation and a sigmoid transformed (σ) linear projection
those at the bottom. DeepNorm [60] mitigates these adverse
of the input given as
effects on the gradients. It is given as
GLU (x, W, V, b, c) = (xW + b) ⊗ σ(xV + c), (2)
xlf = LN (αxlp + Glp (xlp , θlp ), (6)
where X is the input of layer and l, W, b, V and c are learned
parameters. where α is a constant and θlp represents the parameters of
GLU was modified in [55] to evaluate the effect of different layer lp . These parameters are scaled by another constant β.
variations in the training and testing of transformers, resulting Both of these constants depend only on the architecture.
in better empirical results. Here are the different GLU varia-
tions introduced in [55] and used in LLMs. G. Distributed LLM Training
This section describes distributed LLM training approaches
ReGLU (x, W, V, b, c) = max(0, xW + b)⊗, briefly. More details are available in [9], [61], [62], [63].
GEGLU (x, W, V, b, c) = GELU (xW + b) ⊗ (xV + c), 1. Data Parallelism: Data parallelism replicates the model
on multiple devices where data in a batch gets divided across
SwiGLU (x, W, V, b, c, β) = Swishβ(xW + b) ⊗ (xV + c).
devices. At the end of each training iteration weights are
synchronized across all devices.
F. Layer Normalization 2. Tensor Parallelism: Tensor parallelism shards a tensor
Layer normalization leads to faster convergence and is a computation across devices. It is also known as horizontal
widely used component in transformers. In this section, we parallelism or intra-layer model parallelism.
provide different normalization techniques widely used in 3. Pipeline Parallelism: Pipeline parallelism shards model
LLM literature. layers across different devices. This is also known as vertical
1. LayerNorm: Layer norm computes statistics over all the parallelism.
hidden units in a layer (l) as follows: 4. Model Parallelism: A combination of tensor and pipeline
n
v
u n parallelism is known as model parallelism.
l 1 X
l l
u1 X
5. 3D Parallelism: A combination of data, tensor, and
u = ai σ =t (al − ul )2 , (3)
n i n i i model parallelism is known as 3D parallelism.
6. Optimizer Parallelism: Optimizer parallelism also
where n is the number of neurons in the layer l and ali is the known as zero redundancy optimizer [61] implements opti-
summed input of the i neuron in layer l. LayerNorm provides mizer state partitioning, gradient partitioning, and parameter
invariance to rescaling of the weights and re-centering of the partitioning across devices to reduce memory consumption
distribution. while keeping the communication costs as low as possible.
2. RMSNorm: [57] proposed that the invariance properties
of LayerNorm are spurious, and we can achieve the same
performance benefits as we get from LayerNorm by using a H. Libraries
computationally efficient normalization technique that trades Some commonly used libraries for LLM training are: 1)
off re-centering invariance with speed. LayerNorm gives the Transformer [64], 2) DeepSpeed [65], 3) Megatraon-LM [62],
normalized summed input to layer l as follows 4) JAX [66], 5) Colossal-AI [67], 6) BMTrain [63], 7)
ali − ul l FastMoE [68], and frameworks are 1) MindSpore [69], 2)
ali = gi (4) PyTorch [70], 3) Tensorflow [71], 4) MXNet [72].
σ
JOURNAL OF LATEX 6
I. Data PreProcessing
This section briefly summarizes data preprocessing tech-
niques used in LLMs training.
1. Quality Filtering: For better results, training data quality
is essential. Some approaches to filtering data are: 1) classifier-
based and 2) heuristics-based. Classifier-based approaches
train a classifier on high-quality data and predict the quality of
text for filtering, whereas heuristics-based employ some rules
for filtering like language, metrics, statistics, and keywords. Fig. 4: An example of attention patterns in language models,
2. Data Deduplication: Duplicated data can affect model image is taken from [74].
performance and increase data memorization; therefore, to
train LLMs, data deduplication is one of the preprocessing
steps. This can be performed at multiple levels, like sentences,
documents, and datasets.
3. Privacy Reduction: Most of the training data for LLMs
is collected through web sources. This data contains private
information; therefore, many LLMs employ heuristics-based
methods to filter information such as names, addresses, and
phone numbers to avoid learning personal information. Fig. 5: An example of language model training objectives,
image from [74].
J. Architectures
Here we discuss the variants of the transformer architectures
at a higher level which arise due to the difference in the K. Pre-Training Objectives
application of the attention and the connection of transformer This section describes LLMs pre-training objectives. For
blocks. An illustration of attention patterns of these architec- more details see the paper [74].
tures is shown in Figure 4. 1. Full Language Modeling: An autoregressive language
1. Encoder Decoder: Transformers were originally de- modeling objective where the model is asked to predict future
signed as sequence transduction models and followed other tokens given the previous tokens, an example is shown in
prevalent model architectures for machine translation systems. Figure 5.
They selected encoder-decoder architecture to train human 2. Prefix Language Modeling: A non-causal training objec-
language translation tasks. This architecture is adopted by [11], tive, where a prefix is chosen randomly and only remaining
[15]. In this architectural scheme, an encoder encodes the target tokens are used to calculate the loss. An example is
input sequences to variable length context vectors, which are shown in Figure 5.
then passed to the decoder to maximize a joint objective of 3. Masked Language Modeling: In this training objective,
minimizing the gap between predicted token labels and the tokens or spans (a sequence of tokens) are masked randomly
actual target token labels. and the model is asked to predict masked tokens given the
2. Causal Decoder: The underlying objective of an LLM past and future context. An example is shown in Figure 5.
is to predict the next token based on the input sequence. While 4. Unified Language Modeling: Unified language model-
additional information from the encoder binds the prediction ing [75] is a combination of causal, non-causal, and masked
strongly to the context, it is found in practice that the LLMs language training objectives. Here in masked language mod-
can perform well in the absence of encoder [73], relying eling, the attention is not bidirectional but unidirectional,
only on the decoder. Similar to the original encoder-decoder attending either left-to-right or right-to-left context.
architecture’s decoder block, this decoder restricts the flow
of information backward, i.e., the predicted token tk only
depends on the tokens preceded by and up to tk−1 . This is L. Model Adaptation
the most widely used variant in the state-of-the-art LLMs. This section discusses various model adaptation techniques,
3. Prefix Decoder: The causal masked attention is reason- where a model is pre-trained on large data and then adapted
able in the encoder-decoder architectures where the encoder for downstream tasks. An example of different training stages
can attend to all the tokens in the sentence from every position and inference in LLMs is shown in Figure 6.
using self-attention. This means that the encoder can also 1. Transfer Learning: Fine-tuning a pre-trained model with
attend to tokens tk+1 to tn in addition to the tokens from t1 data for the downstream task is known as transfer learning. In
to tk−1 while calculating the representation for tk . But when this type of model adaptation, the model is initialized with
we drop the encoder and only keep the decoder, we also lose pre-trained weights and updated according to the new data.
this flexibility in attention. A variation in the decoder-only Some of the LLMs employing this technique are [11], [12],
architectures is by changing the mask from strictly causal to [15], [20].
fully visible on a portion of the input sequence, as shown 2. Parameter Efficient Learning: The parameter efficient
in Figure 4. The Prefix decoder is also known as non-causal learning fine-tunes a few parameters either by adding new
decoder architecture. parameters to the model or the existing ones.
JOURNAL OF LATEX 7
Fig. 6: A basic flow diagram depicting various stages of LLMs from pre-training to prompting/utilization. Prompting LLMs
to generate responses is possible at different training stages like pre-training, instruction-tuning, or alignment tuning.
Prompt Tuning: [30], [76] adds trainable prompt token em- generate text and make decisions, making it vital to control
beddings as prefixes or free-style to the input token embed- their behavior and outputs to avoid undesirable outcomes.
dings. During fine-tuning only these embeddings parameters Alignment techniques aim to bridge the gap between what
are trained for the downstream task while keeping the rest of humans expect from LLMs and their actual behavior. A model
the weights frozen. is defined to be an “aligned” model if the model fulfills three
Prefix Tuning: [31] adds task-specific trainable prefix vectors criteria of helpful, honest, and harmless or “HHH” [78].
to the transformer layers, where only prefix parameters are To align a model with human values, researchers widely
fine-tuned, and the rest of the model stays frozen. The input employ reinforcement learning with human feedback
sequence tokens can attend prefixes acting as virtual tokens. (RLHF) [79]. In RLHF, a fine-tuned model on demonstrations
is further trained with reward modeling (RM) and
Adapter Tuning: module is an encoder-decoder architecture reinforcement learning (RL), shown in Figure 6. Below
that is placed either sequential or parallel to the attention and we briefly discuss RM and RL pipelines in RLHF.
feed-forward layers in the transformer block [77], [28], [29]. 4.1 Reward modeling: Reward modeling trains a model
Only these layers are fine-tuned, and the rest of the model is to rank generated responses according to human preferences
kept frozen. using a classification objective. To train the classifier humans
3. Instruction Finetuning: Instruction tuning is an approach annotate the responses based on HHH criteria.
to fine-tuning pre-trained models on instruction formatted 4.2 Reinforcement Learning: In this stage, the reward
data. Instructions generally comprise multiple tasks in plain model trained previously ranks LLM-generated responses into
natural language, guiding the model to respond according to preferred vs. dispreferred. The output of the reward model
the prompt and the input. The training data consists of an is used to train the model with proximal policy optimization
instruction and an input-output pair. More details on formatting (PPO). This process repeats iteratively until convergence.
instruction data and its various styles are available in [33]. 5. Prompting/Utilization: Prompting is a method to query
4. Alignment Tuning: Alignment techniques play a crucial trained LLMs for generating responses, as illustrated in
role in ensuring large language models (LLMs) operate Figure 6. LLMs can be prompted in various prompt setups,
according to human intentions and values. These models can where they can be adapted to the instructions without fine-
JOURNAL OF LATEX 8
prompts at various positions, front, middle, and back. CPM-2 pre-training step, which enables ERNIE 3.0 Titan to beat
also proposes INFMOE, a memory-efficient framework with other LLMs in their manually selected Factual QA task set
a strategy to dynamically offload parameters to the CPU for evaluations.
inference at a 100B scale. It overlaps data movement with 1.12 GPT-NeoX-20B [100]: An auto-regressive model
inference computation for lower inference time. that largely follows GPT-3 with a few deviations in architec-
1.6 ERNIE 3.0 [92]: ERNIE 3.0 takes inspiration from ture design, trained on the Pile dataset without any data dedu-
multi-task learning to build a modular architecture using plication. GPT-NeoX has parallel attention and feed-forward
Transformer-XL [93] as the backbone. The universal repre- layers in a transformer block, given in Eq. 8, that increases
sentation module is shared by all the tasks, which serve as the throughput by 15%. It uses rotary positional embedding [48],
basic block for task-specific representation modules, which are applying it to only 25% of embedding vector dimension as
all trained jointly for natural language understanding, natural in [101]. This reduces the computation without performance
language generation, and knowledge extraction. This LLM is degradation. Opposite to GPT-3, which uses dense and sparse
primarily focused on the Chinese language, claims to train layers, GPT-NeoX-20B uses only dense layers. The hyperpa-
on the largest Chinese text corpora for LLM training, and rameter tuning at this scale is difficult; therefore, the model
achieved state-of-the-art in 54 Chinese NLP tasks. chooses hyperparameters from the method [8] and interpolates
1.7 Jurassic-1 [94]: A pair of auto-regressive language values between 13B and 175B models for the 20B model. The
models, including a 7B-parameter J1-Large model and a model training is distributed among GPUs using both tensor
178B-parameter J1-Jumbo model. The training vocabulary of and pipeline parallelism.
Jurassic-1 comprise word pieces, complete words, and multi-
word expressions without any word boundaries, where possible x + Attn(LN1 (x)) + F F (LN2 (x)) (8)
out-of-vocabulary instances are interpreted as Unicode bytes. 1.13 OPT [10]: It is a clone of GPT-3, developed with
Compared to the GPT-3 counterparts, the Jurassic-1 models the intention to open-source a model that replicates GPT-3
apply a more balanced depth-to-width self-attention architec- performance. Training of OPT employs dynamic loss scaling
ture [95] and an improved tokenizer for a faster prediction [102] and restarts from an earlier checkpoint with a lower
based on broader resources, achieving a comparable perfor- learning rate whenever loss divergence is observed. Overall,
mance in zero-shot learning tasks and a superior performance the performance of OPT-175B models is comparable to the
in few-shot learning tasks given the ability to feed more GPT3-175B model.
examples as a prompt. 1.14 BLOOM [9]: A causal decoder model trained on
1.8 HyperCLOVA [96]: A Korean language model with ROOTS corpus with the aim of open-sourcing an LLM. The
GPT-3 architecture. architecture of BLOOM is shown in Figure 9, with differences
1.9 Yuan 1.0 [97]: Trained on a Chinese corpus with like ALiBi positional embedding, an additional normalization
5TB of high-quality text collected from the Internet. A layer after the embedding layer as suggested by the bitsand-
Massive Data Filtering System (MDFS) built on Spark is bytes1 library. These changes stabilize training with improved
developed to process the raw data via coarse and fine filtering downstream performance.
techniques. To speed up the training of Yuan 1.0 with the 1.15 GLaM [103]: Generalist Language Model (GLaM)
aim of saving energy expenses and carbon emissions, various represents a family of language models using a sparsely acti-
factors that improve the performance of distributed training vated decoder-only mixture-of-experts (MoE) structure [104],
are incorporated in architecture and training like increasing [105]. To gain more model capacity while reducing compu-
the number of hidden size improves pipeline and tensor par- tation, the experts are sparsely activated where only the best
allelism performance, larger micro batches improve pipeline two experts are used to process each input token. The largest
parallelism performance, and higher global batch size improve GLaM model, GLaM (64B/64E), is about 7× larger than GPT-
data parallelism performance. In practice, the Yuan 1.0 model 3 [8], while only a part of the parameters is activated per input
performs well on text classification, Winograd Schema, natural token. The largest GLaM (64B/64E) model achieves better
language inference, and reading comprehension tasks. overall results as compared to GPT-3 while consuming only
1.10 Gopher [98]: The Gopher family of models ranges one-third of GPT-3’s training energy.
from 44M to 280B parameters in size to study the effect of 1.16 MT-NLG [21]: A 530B causal decoder based on
scale on the LLMs performance. The 280B model beats GPT- GPT-2 architecture that is roughly 3× GPT-3 model parame-
3 [8], Jurrasic-1 [94], MT-NLG [21], and others on 81% of ters. MT-NLG is trained on filtered high-quality data collected
the evaluated tasks. from various public datasets and blends various types of
1.11 ERNIE 3.0 TITAN [99]: ERNIE 3.0 Titan extends datasets in a single batch, which beats GPT-3 on a number
ERNIE 3.0 by training a larger model with 26x the number of of evaluations.
parameters of the latter. This bigger model outperformed other 1.17 Chinchilla [106]: A causal decoder trained on the
state-of-the-art models in 68 NLP tasks. LLMs produce text same dataset as the Gopher [98] but with a little different
with incorrect facts. In order to have control of the generated data sampling distribution (sampled from MassiveText). The
text with factual consistency, ERNIE 3.0 Titan adds another model architecture is similar to the one used for Gopher,
task, Credible and Controllable Generations, to its multi- with the exception of AdamW optimizer instead of Adam.
task learning setup. It introduces additional self-supervised
adversarial and controllable language modeling losses to the 1 https://github.com/TimDettmers/bitsandbytes
JOURNAL OF LATEX 10
TABLE I: Noteworthy findings and insights from pre-trained Large Language Model.
Models Findings & Insights
• Encoder and decoder with shared parameters perform equivalently when parameters are not shared
T5 • Fine-tuning model layers (adapter layers) work better than the conventional way of training on only classification layers
GPT-3 • Few-shot performance of LLMs is better than the zero-shot, suggesting that LLMs are meta-learners
• Large multi-lingual models perform equivalently to single language models on downstream tasks. However, smaller multi-
mT5 lingual models perform worse
Gopher • Relative encodings enable models to be evaluated for longer sequences than those on which it was trained.
• This LLM builds on top of ERNIE 3.0 and add a self-supervised adversarial loss to distinguish whether a text is generated
ERNIE 3.0 Titan or the original one.
• This distinction ability between real and generate text improves the LLM’s performance as compared to ERNIE 3.0.
• Parallel attention + FF layers speed-up training 15% with the same performance as with cascaded layers
• Initializing feed-forward output layers before residuals with scheme in [131] avoids activations from growing with increasing
GPT-NeoX-20B depth and width
• Training on Pile outperforms GPT-3 on five-shot
• Restart training from an earlier checkpoint with a lower learning rate if loss diverges
OPT • Model is prone to generate repetitive text and stuck in a loop
BLOOM • None
• Galactica’s performance has continued to improve across validation set, in-domain, and out-of-domain benchmarks, even
with multiple repetitions of the corpus, which is superior to existing research on LLMs.
Galactica • A working memory token approach can achieve strong performance over existing methods on mathematical MMLU and
MATH benchmarks. It sets a new state-of-the-art on several downstream tasks such as PubMedQA (77.6%) and MedMCQA
dev (52.9%).
• The feed-forward component of each Transformer layer can be replaced with a mixture-of-experts (MoE) module consisting
of a set of independent feed-forward networks (i.e., the ‘experts’). By sparsely activating these experts, the model capacity
can be maintained while much computation is saved.
• By leveraging sparsity, we can make significant strides toward developing high-quality NLP models while simultaneously
reducing energy consumption. Consequently, MoE emerges as a robust candidate for future scaling endeavors.
GLaM • The model trained on filtered data shows consistently better performances on both NLG and NLU tasks, where the effect of
filtering is more significant on the former tasks.
• Filtered pretraining corpora plays a crucial role in the generation capability of LLMs, especially for the downstream tasks.
• The scaling of GLaM MoE models can be achieved by increasing the size or number of experts in the MoE layer. Given a
fixed budget of computation, more experts contribute to better predictions.
LaMDA • The model can be fine-tuned to learn to call different external information resources and tools.
MT-NLG • None.
• For higher effectiveness and efficiency, a transformer model can be asymmetrically constructed with a shallower encoder and
a deeper decoder.
• To achieve better performances, it is necessary to employ strategies such as massively scaling up sampling, followed by the
AlphaCode filtering and clustering of samples into a compact set.
• The utilization of novel sampling-efficient transformer architectures designed to facilitate large-scale sampling is crucial.
• Simplifying problem descriptions can effectively improve the model’s performance.
GLM-130B • Pre-training data with a small proportion of multi-task instruction data improves the overall model performance
CodeGen • Multi-step prompting for code synthesis leads to a better user intent understanding and code generation
• LLaMA is open-source and can be fine-tuned or continually pre-trained to develop new models or instruction-based tools.
• A few optimizations are proposed to improve the training efficiency of LLaMA, such as efficient implementation of multi-head
self-attention and a reduced amount of activations during back-propagation.
LLaMA • Training exclusively on public data can also achieve state-of-the-art performance.
• A constant performance improvement is gained when scaling the model.
• Smaller models can also realize good performances using more training data and time.
• Sparse models provide the benefits of large models at a lower computation cost
• Randomly Routed Experts reduces catastrophic forgetting effects which in turn is essential for continual learning
PanGu-Σ • Randomly Routed Experts allow extracting a domain-specific sub-model in deployment which is cost-efficient while
maintaining a performance similar to the original
BloombergGPT • Pre-training with general-purpose and task-specific data improves task performance without hurting other model capabilities
XuanYuan 2.0 • Combining pre-training and fine-tuning stages in single training avoids catastrophic forgetting
• Causal LM is crucial for a model’s generation capability in encoder-decoder architectures
CodeT5+ • Multiple training objectives like span corruption, Causal LM, matching, etc complement each other for better performance
StarCoder • HHH prompt by Anthropic allows the model to follow instructions without fine-tuning
• Model trained on unfiltered data is more toxic but may perform better on downstream tasks after fine-tuning
LLaMA-2 • Model trained on unfiltered data requires fewer samples for safety alignment
• Data quality is important to train better models
PaLM-2 • Model and data size should be scaled with 1:1 proportions
• Smaller models trained for larger iterations outperform larger models
We review various fine-tuned LLMs and strategies for effective shot performance improves significantly by expanding task
fine-tuning in this section. collection and prompt styles. OPT-IML [24] and Flan [25]
curated larger 2k and 1.8k task datasets, respectively. While
1. Instruction-Tuning with Manually Created Datasets: increasing task size alone is not enough, OPT-IML and
Numerous hand-crafted instruction-tuning datasets with Flan add more prompting setups in their datasets, zero-shot,
different design choices are proposed in the literature to few-shot, and CoT. In continuation, CoT Collection [80]
instruction-tune LLMs. The performance of fine-tuned LLMs fine-tunes Flan-T5 further on 1.88M CoT samples. Another
depends on multiple factors, such as dataset, instruction method [81] uses symbolic tasks with tasks in T0, Flan, etc.
diversity, prompting templates, model size, and training
objectives. Keeping this in view, diverse fine-tuned models
have emerged in the literature using manually created datasets. 2. Instruction-Tuning with LLMs Generated Datasets:
The models T0 [22] and mT0 (multi-lingual) [134] employ Generating an instruction-tuning dataset requires carefully
templates to convert existing datasets into prompt datasets. writing instructions and input-output pairs, which are often
They have shown improvements in generalization to zero-shot written by humans, smaller in size, and less diverse. To
and held-out tasks. Tk-Instruct [26] fine-tuned the T5 model overcome this, self-instruct [138] proposed an approach to
with in-context instructions to study generalization on unseen prompt available LLMs to generate instruction-tuning datasets.
tasks when given in-context instructions during test time. The Self-instruct outperformed models trained on manually created
model outperformed Instruct-GPT, despite being smaller in dataset SUPER-NATURALINSTRUCTIONS (a dataset with
size, i.e., 11B parameters as compared to 175B of GPT-3. 1600+ tasks) [26] by 33%. It starts with a seed of 175 tasks,
Increasing Tasks and Prompt Setups: Zero-shot and few- 1 instruction, and 1 sample per task and iteratively generates
JOURNAL OF LATEX 14
TABLE II: Key insights and findings from the study of instruction-tuned Large Language Models.
Models Findings & Insights
• Multi-task prompting enables zero-shot generalization and outperforms baselines
T0 • Even a single prompt per dataset task is enough to improve performance
• The answer quality of LLMs can be further improved with human feedback.
• To aid the model in effectively filtering and utilizing relevant information, human labelers play a crucial role in answering
questions regarding the usefulness of the retrieved documents.
WebGPT • Interacting a fine-tuned language model with a text-based web-browsing environment can improve end-to-end retrieval and
synthesis via imitation learning and reinforcement learning.
• Generating answers with references can make labelers easily judge the factual accuracy of answers.
• Instruction tuning leads to a stronger generalization of unseen tasks
• More tasks improve generalization whereas only increasing task instances does not help
Tk-INSTRUCT • Supervised trained models are better than generalized models
• Models pre-trained with instructions and examples perform well for different types of inputs
• Instruction tuning enables zero-shot generalization to the tasks never seen before
• Multi-lingual training leads to even better zero-shot generalization for both English and non-English
mT0 and BLOOMZ • Training on machine-translated prompts improves performance for held-out tasks with non-English prompts
• English only fine-tuning on multilingual pre-trained language model is enough to generalize to other pre-trained language
tasks
• Task size sampling to create a batch with most of the task examples is important for better performance
• Only example proportional sampling is not enough, training datasets/benchmarks should also be proportional for better
generalization/performance
• Fully held-out and partially supervised tasks performance improves by scaling tasks or categories whereas fully supervised
OPT-IML tasks have no effect
• Including small amounts i.e. 5% of pretraining data during fine-tuning is effective
• Only 1% reasoning data improves the performance, adding more deteriorates performance
• Adding dialogue data makes the performance worse
• Finetuning with CoT improves performance on held-out tasks
• Fine-tuning along with CoT data improves reasoning abilities
• CoT tuning improves zero-shot reasoning
Flan • Performance improves with more tasks
• Instruction fine-tuning improves usability which otherwise is challenging for pre-trained models
• Improving the model’s performance with instruction tuning is compute-efficient
• Multitask prompting enables zero-shot generalization abilities in LLM
• The judgments of labelers and the alignments with defined rules can help the model generate better responses.
• Good dialogue goals can be broken down into detailed natural language rules for the agent and the raters.
Sparrow • The combination of reinforcement learning (RL) with reranking yields optimal performance in terms of preference win rates
and resilience against adversarial probing.
WizardCoder • Fine-tuning with re-written instruction-tuning data into a complex set improves the performance significantly
• Model learns to write safe responses with fine-tuning on safe demonstrations, while additional RLHF step further improves
LLaMA-2-Chat model safety and make it less prone to jailbreak attacks
TABLE III: Summary of pre-trained LLMs. Only the LLMs discussed individually in the previous sections are summarized.
“Data/Tokens” is the model’s pre-training data which is either the number of tokens or data size. “Data Cleaning” indicates
whether the data cleaning is performed or not. This includes heuristics (Heur), deduplication (Dedup), quality filtering (QF),
and privacy filtering (PF), “Cost” is the calculated training cost obtained by multiplying the GPUs/TPUs hourly rate with the
number of GPUs and the training time. The actual cost may vary due to many reasons such as using in-house GPUs or getting
a discounted rate, re-training, number of employees working on the problem, etc. “Training Parallelism” indicates distributed
training using data parallelism (D), tensor parallelism (T), pipeline parallelism (P), model parallelism (M), optimizer parallelism
(OP), and rematerialization (R), where for “Library” column, “DS” is a short form for Deep Speed. In column “Commercial
Use”, we assumed a model is for non-commercial purposes if its license is not available.
Publication License Model No. of Commercial Steps Data/ Data No. of Processing Training Calculated Training
Models
Venue Type Creators Purpose Params Use Trained Tokens Cleaning Processing Units Unit Type Time Train. Cost Parallelism Library
T5 [11] JMLR'20 Apache-2.0 Google General 11B ✓ 1M 1T Heur+Dedup 1024 TPU v3 - - D+M Mesh TensorFlow
GPT-3 [8] NeurIPS'20 - OpenAI General 175B × - 300B Dedup+QF - V100 - - M -
mT5 [12] NAACL'21 Apache-2.0 Google General 13B ✓ 1M 1T - - - - - - -
PanGu-α [90] arXiv'21 Apache-2.0 Huawei General 200B ✓ 260k 1.1TB Heur+Dedup 2048 Ascend 910 - - D+OP+P+O+R MindSpore
CPM-2 [13] AI Open'21 MIT Tsinghua General 198B ✓ 1M 2.6TB Dedup - - - - D+M JAXFormer
Codex [117] arXiv'21 - OpenAI Coding 12B × - 100B Heur - - - - - -
ERNIE 3.0 [92] arXiv'21 - Baidu General 10B × 120k∗ 375B Heur+Dedup 384 V100 - - M∗ PaddlePaddle
Jurassic-1 [94] White-Paper'21 Apache-2.0 AI21 General 178B ✓ - 300B - 800 GPU - - D+M+P Megatron+DS
HyperCLOVA [96] EMNLP'21 - Naver General 82B × - 300B Clf+Dedup+PF 1024 A100 321h 1.32 Mil M Megatron
Yuan 1.0 [97] arXiv'21 Apache-2.0 - General 245B ✓ 26k∗ 180B Heur+Clf+Dedup 2128 GPU - - D+T+P -
Gopher [98] arXiv'21 - Google General 280B × - 300B QF+Dedup 4096 TPU v3 920h 13.19 Mil D+M JAX+Haiku
ERNIE 3.0 Titan [99] arXiv'21 - Baidu General 260B × - 300B Heur+Dedup - Ascend 910 - - D+M+P+D* PaddlePaddle
GPT-NeoX-20B [132] BigScience'22 Apache-2.0 EleutherAI General 20B ✓ 150k 825GB None 96 40G A100 - - M Megatron+DS+PyTorch
OPT [10] arXiv'22 MIT Meta General 175B ✓ 150k 180B Dedup 992 80G A100 - - D+T Megatron
BLOOM [9] arXiv'22 RAIL-1.0 BigScience General 176B ✓ - 366B Dedup+PR 384 80G A100 2520h 3.87 Mil D+T+P Megatron+DS
Galactica [125] arXiv'22 Apache-2.0 Meta Science 120B × 225k 106B Dedup 128 80GB A100 - - - Metaseq
GLaM [103] ICML'22 - Google General 1.2T × 600k∗ 600B Clf 1024 TPU v4 - - M GSPMD
LaMDA [127] arXiv'22 - Google Dialog 137B × 3M 2.81T Filtered 1024 TPU v3 1384h 4.96 Mil D+M Lingvo
MT-NLG [21] arXiv'22 Apache-v2.0 MS.+Nvidia General 530B × - 270B - 4480 80G A100 - - D+T+P Megatron+DS
AlphaCode [118] Science'22 Apache-v2.0 Google Coding 41B ✓ 205k 967B Heur+Dedup - TPU v4 - - M JAX+Haiku
Chinchilla [106] arXiv'22 - Google General 70B × - 1.4T QF+Dedup - TPUv4 - - - JAX+Haiku
PaLM [14] arXiv'22 - Google General 540B × 255k 780B Heur 6144 TPU v4 - - D+M JAX+T5X
AlexaTM [107] arXiv'22 Apache v2.0 Amazon General 20B × 500k 1.1T Filtered 128 A100 2880h 1.47 Mil M DS
U-PaLM [20] arXiv'22 - Google General 540B × 20k - - 512 TPU v4 120h 0.25 Mil - -
UL2 [15] ICLR'23 Apache-2.0 Google General 20B ✓ 2M 1T - 512 TPU v4 - - M JAX+T5X
GLM [109] ICLR'23 Apache-2.0 Multiple General 130B × - 400B - 768 40G A100 1440h 3.37 Mil M -
CodeGen [116] ICLR'23 Apache-2.0 Salesforce Coding 16B ✓ 650k 577B Heur+Dedup - TPU v4 - - D+M JAXFormer
LLaMA [111] arXiv'23 - Meta General 65B × 350k 1.4T Clf+Heur+Dedup 2048 80G A100 504h 4.12 Mil D+M xFormers
PanGuΣ [115] arXiv'23 - Huawei General 1.085T × - 329B - 512 Ascend 910 2400h - D+OP+P+O+R MindSpore
BloombergGPT [128] arXiv23 - Bloomberg Finance 50B × 139k 569B Dedup 512 40G A100 1272h 1.97 Mil M PyTorch
Xuan Yuan 2.0 [130] arXiv23 RAIL-1.0 Du Xiaoman Finance 176B ✓ - 366B Filtered 80GB A100 - - P DS
CodeT5+ [122] arXiv'23 BSD-3 Salesforce Coding 16B ✓ 110k 51.5B Dedup 16 40G A100 - - - DS
StarCoder [124] arXiv'23 OpenRAIL-M BigCode Coding 15.5B ✓ 250k 1T Dedup+QF+PF 512 80G A100 624h 1.28 Mil D+T+P Megatron-LM
LLaMA-2 [112] arXiv'23 LLaMA-2.0 Meta General 70B ✓ 500k 2T Minimal Filtering - 80G A100 1.7Mh - - -
PaLM-2 [108] arXiv'23 - Google General - × - - Ddedup+PF+QF - - - - - -
TABLE IV: Summary of instruction tuned LLMs. All abbreviations are the same as Table III. Entries in “Data/Tokens” starting
with “S-” represents the number of training samples.
Publication License Model No. of Commercial Pre-trained Steps Data/ No. of Processing Train. Calculated Train.
Models
Venue Type Creators Purpose Params Use Models Trained Tokens Processing Units Unit Type Time Train. Cost Parallelism Library
WebGPT [133] arXiv'21 - OpenAI General 175B × GPT-3 - - - - - - - -
T0 [22] ICLR'22 Apache-2.0 BigScience General 11B ✓ T5 - 250B 512 TPU v3 270h 0.48 Mil - -
Tk-Instruct [26] EMNLP'22 MIT AI2+ General 11B ✓ T5 1000 - 256 TPU v3 4h 0.0036 Mil - Google T5
OPT-IML [24] arXiv'22 - Meta General 175B × OPT 8k 2B 128 40G A100 - - D+T Megatron
Flan-U-PaLM [25] ICLR'22 Apache-2.0 Google General 540B ✓ U-PaLM 30k - 512 TPU v4 - - - JAX+T5X
mT0 [134] ACL'23 Apache-2.0 HuggingFace+ General 13B ✓ mT5 - - - - - - - -
Sparrow [135] arXiv'22 - Google Dialog 70B × Chinchilla - - 64 TPU v3 - - M -
WizardCoder [136] arXiv'23 Apache-2.0 HK Bapt. Coding 15B × StarCoder 200 S-78k - - - - - -
Bard, and others. modeling into helpfulness and safety rewards and using
rejection sampling in addition to PPO. The initial four
versions of LLaMA 2-Chat are fine-tuned with rejection
sampling and then with PPO on top of rejection sampling.
3. Aligning with Human Preferences: Incorporating Aligning with Supported Evidence: This style of alignment
human preferences into LLMs presents a significant allows the model to generate responses with proofs and facts,
advantage in mitigating undesirable behaviors and ensuring reduces hallucination, and assists humans more effectively,
accurate outputs. The initial work on alignment, such as which increases trust in the model’s output. Similar to the
InstructGPT [137] aligns GPT-3 using a 3-step approach, RLHF training style, a reward model is trained to rank
instruction-tuning, reward modeling, and fine-tuning with generated responses containing web citations in answers
reinforcement learning (RL). The supervised fine-tuned to questions, which is later used to train the model, as in
GPT-3 on demonstrations is queried to generate responses, GopherCite [145], WebGPT [133], and Sparrow [135]. The
which human labelers rank according to human values, and ranking model in Sparrow [135] is divided into two branches,
a reward model is trained on the ranked data. Lastly, the preference reward and rule reward, where human annotators
GPT-3 is trained with proximal policy optimization (PPO) adversarial probe the model to break a rule. These two
using rewards on the generated data from the reward model. rewards together rank a response to train with RL.
LLaMA 2-Chat [112] improves alignment by dividing reward
JOURNAL OF LATEX 16
Aligning Directly with SFT: The PPO in the RLHF pipeline data is small and the original capacity is to be maintained.
is complex, memory-intensive, and unstable, requiring Prompt-based continued pre-training (PCP) [161] trains model
multiple models, reward, value, policy, and reference with text and instructions related to tasks and then finally
models. Avoiding this sophisticated alignment pipeline is instruction-tunes the model for downstream tasks.
possible by incorporating minimal changes in supervised 5. Sample Efficiency: While fine-tuning data is generally
fine-tuning (SFT) pipeline as in [146], [147], [148], with many-fold smaller than the pre-training data, it still has to
better or comparable performance to PPO. Direct preference be large enough for acceptable performance [25], [24], [26]
optimization (DPO) [146] trains a model directly on the and requires proportional computing resources. To study the
human-preferred responses to maximize the likelihood of effects on performance with less data, existing literature [162],
preferred against unpreferred responses, with per-sample [163] finds that the models trained on lesser data can out-
importance weight. Reward ranked fine-tuning RAFT [147] perform models trained with more data. In [162], 25% of
fine-tunes the model on ranked responses by the reward the total downstream data is found enough for state-of-the-
model. Preference ranking optimization (PRO) [149] and art performance. Selecting coreset-based 0.5% of the total
RRHF [148] penalize the model to rank responses with instruction-tuning data improves the model performance by
human preferences and supervised loss. On the other hand, 2% in [163], as compared to the complete data tuning. Less
chain-of-hindsight (CoH) [150] provides feedback to the is more for alignment (LIMA) [164] uses only 1000 carefully
model in language rather than reward, to learn good versus created demonstrations to fine-tune the model and has achieved
bad responses. comparable performance to GPT-4.
Aligning with Synthetic Feedback: Aligning LLMs with
human feedback is slow and costly. The literature suggests a C. Robotics
semi-automated process to align LLMs by prompting LLMs to LLMs have been rapidly adopted across various domains in
generate helpful, honest, and ethical responses to the queries, the scientific community due to their multipurpose capabili-
and fine-tuning using the newly created dataset. Constitutional ties [33]. In robotics research, the LLMs have very promising
AI [151] replaces human feedback in RLHF with AI, calling applications as well, such as enhancing human-robot inter-
it RL from AI feedback (RLAIF). AlpacaFarm [152] designs action [165], [166], [167], [168], task planning [169], [170],
prompts to imitate human feedback using LLMs APIs. [171], navigation [172], [173], and learning [174], [175].
Opposite to constitutional AI, AlpacaFarm injects noise They can enable robots to understand and generate natural
in feedback to replicate human mistakes. Self-Align [153] language, aiding in instruction following, data annotation, and
prompts the LLM with ICL examples, instructing the LLM collaborative problem-solving. They can facilitate continuous
about what the response should contain to be considered learning by allowing robots to access and integrate information
useful and ethical. The same LLM is later fine-tuned with the from a wide range of sources. This can help robots acquire new
new dataset. skills, adapt to changes, and refine their performance based on
Aligning with Prompts: LLMs can be steered with prompts real-time data.
to generate desirable responses without training [154], LLMs have also started assisting in simulating environments
[155]. The self-correction prompting in [155] concatenates for testing and offer potential for innovative research in
instructions and CoT with questions, guiding the model to robotics, despite challenges like bias mitigation and integration
answer its instruction following strategy to ensure moral complexity. The work in [176] focuses on personalizing robot
safety before the actual answer. This strategy is shown to household cleanup tasks. By combining language-based plan-
reduce the harm in generated responses significantly. ning and perception with LLMs, such that having users provide
Red-Teaming/Jailbreaking/Adversarial Attacks: LLMs object placement examples, which the LLM summarizes to
exhibit harmful behaviors, hallucinations, leaking personal generate generalized preferences, they show that robots can
information, and other shortcomings through adversarial generalize user preferences from a few examples. An embod-
probing. The models are susceptible to generating harmful ied LLM is introduced in [177], which employs a Transformer-
responses even though they are aligned for safety [156], based language model where sensor inputs are embedded
[157]. Red-teaming is a common approach to address alongside language tokens, enabling joint processing to en-
illicit outputs, where the LLMs are prompted to generate hance decision-making in real-world scenarios. The model
harmful outputs [157], [158]. The dataset collected through is trained end-to-end for various embodied tasks, achieving
red-teaming is used to fine-tune models for safety. While positive transfer from diverse training across language and
red-teaming largely relies on human annotators, another vision domains. LLMs have also been explored as zero-shot
work [159] red-team LLMs to find prompts that lead to human models for enhancing human-robot interaction.
harmful outputs of other LLMs. The study in [165] demonstrates that LLMs, trained on vast
text data, can serve as effective human models for certain
4. Continue Pre-Training: Although fine-tuning boosts a HRI tasks, achieving predictive performance comparable to
model’s performance, it leads to catastrophic forgetting of specialized machine-learning models. However, limitations
previously learned information. Concatenating fine-tuning data were identified, such as sensitivity to prompts and difficulties
with a few randomly selected pre-training samples in ev- with spatial/numerical reasoning. In another study [178], the
ery iteration avoids network forgetting [160], [130]. This is authors enable LLMs to reason over sources of natural lan-
also effective in adapting LLMs for cases where fine-tuning guage feedback, forming an “inner monologue” that enhances
JOURNAL OF LATEX 17
their ability to process and plan actions in robotic control language modality and additional modalities, the learnable
scenarios. They combine LLMs with various forms of textual interface is introduced to connect different modalities from
feedback, allowing the LLMs to incorporate conclusions into frozen pre-trained models. Particularly, the learnable interface
their decision-making process for improving the execution of is expected to work in a parameter-efficient tuning manner:
user instructions in different domains, including simulated and e.g., LLaMA-Adapter [200] applies an efficient transformer-
real-world robotic tasks involving tabletop rearrangement and based adapter module for training, and LaVIN [199] dynam-
mobile manipulation. All of these studies employ LLMs as the ically learns the multimodal feature weights using a mixture-
core mechanism for assimilating everyday intuitive knowledge of-modality adapter. Different from the learnable interface, the
into the functionality of robotic systems. expert models can directly convert multimodalities into lan-
guage: e.g., VideoChat-Text [182] incorporates Whisper [201],
D. Multimodal LLMs a speech recognition expert model, to generate the captions of
Inspired by the success of LLMs in natural language pro- given videos for the understanding of following LLMs.
cessing applications, an increasing number of research works Prompting Different from the fine-tuning technique that
are now facilitating LLMs to perceive different modalities directly updates the model parameters given task-specific
of information like image [179], [180], [181], video [182], datasets, the prompting technique provides certain context,
[183], [184], audio [185], [184], [186], etc. Multimodal LLMs examples, or instructions to the model, fulfilling specialized
(MLLMs) present substantial benefits compared to standard tasks without changing the model parameters. Since prompting
LLMs that process only text. By incorporating information can significantly reduce the needs of large-scale multimodal
from various modalities, MLLMs can achieve a deeper un- data, this technique is widely used to construct MLLMs.
derstanding of context, leading to more intelligent responses Particularly, to solve multimodal Chain of Thought (CoT)
infused with a variety of expressions. Importantly, MLLMs problems [85], LLMs are prompted to generate both the rea-
align closely with human perceptual experiences, leveraging soning process and the answer given multimodal inputs [202].
the synergistic nature of our multisensory inputs to form On this front, different learning paradigms are exploited in
a comprehensive understanding of the world [186], [177]. practice: for example, Multimodal-CoT [202] involves two
Coupled with a user-friendly interface, MLLMs can offer stages of rationale generation and answer inference, where the
intuitive, flexible, and adaptable interactions, allowing users input of the second stage is a combination of the original input
to engage with intelligent assistants through a spectrum of and the output of the first stage; and CoT-PT [203] applies
input methods. According to the ways of constructing models, both prompt tuning and specific visual bias to generate a chain
current MLLMs can be generally divided into three streams: of reasoning implicitly. In addition to CoT problems, LLMs
pre-training, fine-tuning, and prompting. In this section, we can also be prompted with multimodal descriptions and tools,
will discuss more details of these main streams, as well as the effectively dividing complex tasks into sub-tasks [204], [205].
important application of MLLMs in visual reasoning. Visual Reasoning Application Recent visual reasoning sys-
Pre-training This stream of MLLMs intends to support differ- tems [206], [207], [208], [209] tend to apply LLMs for better
ent modalities using unified end-to-end models. For instance, visual information analysis and visual-language integration.
Flamingo [179] applies gated cross-attention to fuse vision and Different from previous works [210], [211] that rely on limited
language modalities, which are collected from pre-trained and VQA datasets and small-scale neural networks, current LLM-
frozen visual encoder and LLM, respectively. Moreover, BLIP- aided methods offer benefits of stronger generalization ability,
2 [180] proposes a two-stage strategy to pre-train a Querying emergent ability, and interactivity [194]. To realize visual
Transformer (Q-Former) for the alignment between vision reasoning with the help of LLMs, the prompting and the fine-
and language modalities: in the first stage, vision-language tuning techniques can also be utilized: for example, PointClip
representation learning is bootstrapped from a frozen visual V2 [207] applies LLMs to generate 3D-specific prompts,
encoder; and in the second stage, a frozen LLM bootstraps which are encoded as textual features and then combined with
vision-to-language generative learning for zero-shot image- visual features for 3D recognition; and GPT4Tools [197] em-
to-text generation. Similarly, MiniGPT-4 [187] also deploys ploys LoRA [212] to fine-tune LLMs following tool-related in-
pre-trained and frozen ViT [188], Q-Former and Vicuna structions. Serving as a controller [209], decision maker [213],
LLM [189], while only a linear projection layer needs to be or semantics refiner [206], [214], LLMs significantly facilitates
trained for vision and language modalities alignment. the progress of visual reasoning research.
Fine-tuning Derived from instruction tuning [25] for NLP
tasks [137], [25], [24], researchers are now fine-tuning pre-
IV. F INDINGS & I NSIGHTS
trained LLMs using multimodal instructions. Following this
method, LLMs can be easily and effectively extended as Training a billion-scale model is difficult as compared to
multimodal chatbots [187], [181], [190] and multimodal task a smaller model. LLMs are prone to various instabilities
solvers [191], [192], [193]. The key issue of this stream of during training, such as hardware failure and instability. Other
MLLMs is to collect multimodal instruction-following data for than this, LLMs exhibit different behaviors such as emergent
fine-tuning [194]. To address this issue, the solutions of bench- abilities, improved zero-shot, few-shot, and reasoning abilities.
mark adaptation [191], [195], [196], self-instruction [138], Researchers report these essential details in their papers for
[197], [198], and hybrid composition [199], [193] are em- results reproduction and field progress. We identify critical
ployed, respectively. To mitigate the gap between the original information in Table I and II such as architecture, training
JOURNAL OF LATEX 18
TABLE V: Architecture details of LLMs. Here, “PE” is the positional embedding, “nL” is the number of layers, “nH” is the
number of attention heads, “HS” is the size of hidden states.
Training
Models Type Attention Vocab Tokenizer Norm PE Activation Bias nL nH HS
Objective
T5 (11B) Enc-Dec Span Corruption Standard 32k SentencePiece Pre-RMS Relative ReLU × 24 128 1024
GPT3 (175B) Causal-Dec Next Token Dense+Sparse - - Layer Learned GeLU ✓ 96 96 12288
mT5 (13B) Enc-Dec Span Corruption Standard 250k SentencePiece Pre-RMS Relative ReLU - - - -
PanGu-α (200B) Causal-Dec Next Token Standard 40k BPE Layer - - - 64 128 16384
CPM-2 (198B) Enc-Dec Span Corruption Standard 250k SentencePiece Pre-RMS Relative ReLU - 24 64 -
Codex (12B) Causal-Dec Next Token Standard - BPE+ Pre-Layer Learned GeLU - 96 96 12288
ERNIE 3.0 (10B) Causal-Dec Next Token Standard - WordPiece Post-Layer Relative GeLU - 48 64 4096
Jurassic-1 (178B) Causal-Dec Next Token Standard 256k SentencePiece∗ Pre-Layer Learned GeLU ✓ 76 96 13824
HyperCLOVA (82B) Causal-Dec Next Token Dense+Sparse - BPE* Pre-Layer Learned GeLU - 64 80 10240
Yuan 1.0 (245B) Causal-Dec Next Token Standard - - - - - - 76 - 16384
Gopher (280B) Causal-Dec Next Token Standard 32k SentencePiece Pre-RMS Relative GeLU ✓ 80 128 16384
ERNIE 3.0 Titan (260B) Causal-Dec Next Token Standard - WordPiece Post-Layer Relative GeLU - 48 192 12288
GPT-NeoX-20B Causal-Dec Next Token Parallel 50k BPE Layer Rotary GeLU ✓ 44 64 -
OPT (175B) Causal-Dec Next Token Standard - BPE - - ReLU ✓ 96 96 -
BLOOM (176B) Causal-Dec Next Token Standard 250k BPE Layer ALiBi GeLU ✓ 70 112 14336
Galactica (120B) Causal-Dec Next Token Standard 50k BPE+custom Layer Learned GeLU × 96 80 10240
GLaM (1.2T) MoE-Dec Next Token Standard 256k SentencePiece Layer Relative GeLU ✓ 64 128 32768
LaMDA (137B) Causal-Dec Next Token Standard 32k BPE Layer Relative GeGLU - 64 128 8192
MT-NLG (530B) Causal-Dec Next Token Standard 50k BPE Pre-Layer Learned GeLU ✓ 105 128 20480
AlphaCode (41B) Enc-Dec Next Token Multi-query 8k SentencePiece - - - - 64 128 6144
Chinchilla (70B) Causal-Dec Next Token Standard 32k SentencePiece-NFKC Pre-RMS Relative GeLU ✓ 80 64 8192
PaLM (540B) Causal-Dec Next Token Parallel+Multi-query 256k SentencePiece Layer RoPE SwiGLU × 118 48 18432
AlexaTM (20B) Enc-Dec Denoising Standard 150k SentencePiece Pre-Layer Learned GeLU ✓ 78 32 4096
Sparrow (70B) Causal-Dec Pref.&Rule RM - 32k SentencePiece-NFKC Pre-RMS Relative GeLU ✓ 16∗ 64 8192
U-PaLM (540B) Non-Causal-Dec MoD Parallel+Multi-query 256k SentencePiece Layer RoPE SwiGLU × 118 48 18432
UL2 (20B) Enc-Dec MoD Standard 32k SentencePiece - - - - 64 16 4096
GLM (130B) Non-Causal-Dec AR Blank Infilling Standard 130k SentencePiece Deep RoPE GeGLU ✓ 70 96 12288
CodeGen (16B) Causal-Dec Next Token Parallel - BPE Layer RoPE - - 34 24 -
LLaMA (65B) Causal-Dec Next Token Standard 32k BPE Pre-RMS RoPE SwiGLU - 80 64 8192
PanGu-Σ (1085B) Causal-Dec Next Token Standard - BPE Fused Layer - FastGeLU - 40 40 5120
BloombergGPT (50B) Causal-Dec Next Token Standard 131k Unigram Layer ALiBi GeLU ✓ 70 40 7680
Xuan Yuan 2.0 (176B) Causal-Dec Next Token Self 250k BPE Layer ALiBi GeLU ✓ 70 112 14336
CodeT5+ (16B) Enc-Dec SC+NT+Cont.+Match Standard - Code-Specific - - - - - - -
StarCoder (15.5B) Causal-Dec FIM Multi-query 49k BPE - Learned - - 40 48 6144
LLaMA (70B) Causal-Dec Next Token Grouped-query 32k BPE Pre-RMS RoPE SwiGLUE - - - -
PaLM-2 - MoD Parallel - - - - - - - - -
TABLE VI: Summary of optimization settings used for pre-trained LLMs. The values for weight decay, gradient clipping, and
dropout are 0.1, 1.0, and 0.1, respectively, for most of the LLMs.
Sequence LR Optimizers Precision Weight Grad
Models Batch Size Length LR Warmup Decay AdaFactor Adam AdamW FP16 BF16 Mixed Decay Clip Dropout
T5 (11B) 211 512 0.01 × inverse square root ✓ - - - ✓
GPT3 (175B) 32K - 6e-5 ✓ cosine ✓ ✓ ✓ ✓ -
mT5 (13B) 1024 1024 0.01 - inverse square root ✓ - - - ✓
PanGu-α (200B) - 1024 2e-5 - - - - ✓ - -
CPM-2 (198B) 1024 1024 0.001 - - ✓ - - - ✓
Codex (12B) - - 6e-5 ✓ cosine ✓ ✓ ✓ - -
ERNIE 3.0 (12B) 6144 512 1e-4 ✓ linear ✓ - ✓ - -
Jurassic-1 (178B) 3.2M 2048 6e-5 ✓ cosine ✓ ✓ ✓ ✓ -
HyperCLOVA (82B) 1024 - 6e-5 - cosine ✓ - ✓ - -
Yuan 1.0 (245B) <10M 2048 1.6e-4 ✓ cosine decay to 10% ✓ - ✓ - -
Gopher (280B) 3M 2048 4e-5 ✓ cosine decay to 10% ✓ ✓ - ✓ -
ERNIE 3.0 Titan (260B) - 512 1e-4 ✓ linear ✓ ✓ ✓ ✓ -
GPT-NeoX-20B 1538 2048 0.97e-5 ✓ cosine ✓ ✓ ✓ ✓ ×
OPT (175B) 2M 2048 1.2e-4 - linear ✓ ✓ ✓ ✓ ✓
BLOOM (176B) 2048 2048 6e-5 ✓ cosine ✓ ✓ ✓ ✓ ×
Galactica (120B) 2M 2048 7e-6 ✓ linear decay to 10% ✓ - ✓ ✓ ✓
GLaM (1.2T) 1M 1024 0.01 - inverse square root ✓ FP32 + ✓ - ✓ ×
LaMDA (137B) 256K - - - - - - - - - - - - -
MT-NLG (530B) 1920 2048 5e-5 ✓ cosine decay to 10% ✓ ✓ ✓ ✓ -
AlphaCode (41B) 2048 1536+768 1e-4 ✓ cosine decay to 10% ✓ ✓ ✓ ✓ -
Chinchilla (70B) 1.5M 2048 1e-4 ✓ cosine decay to 10% ✓ ✓ - - -
PaLM (540B) 2048 2048 0.01 - inverse square root ✓ - ✓ ✓ ×
AlexaTM (20B) 2M 1024 1e-4 - linear decay to 5% ✓ ✓ ✓ - ✓
Sparrow (70B) RM: 8+16, RL:16 - 2e-6 ✓ cosine decay to 10% ✓ ✓ ✓ - ✓ ×
U-PaLM (540B) 32 2048 1e-4 - cosine ✓ - - - -
UL2 (20B) 1024 1024 - - inverse square root - - - - - - ×
GLM (130B) 4224 2048 8e-5 ✓ cosine ✓ ✓ ✓ ✓ ✓
CodeGen (16B) 2M 2048 5e-5 ✓ cosine ✓ - ✓ ✓ -
LLaMA (65B) 4M Tokens 2048 1.5e-4 ✓ cosine decay to 10% ✓ - ✓ ✓ -
PanGu-Σ (1.085T) 512 1024 2e-5 ✓ - ✓ ✓ - - -
BloombergGPT (50B) 2048 2048 6e-5 ✓ cosine ✓ ✓ ✓ ✓ ×
Xuan Yuan 2.0 (176B) 2048 2048 6e-5 ✓ cosine ✓ ✓ ✓ ✓ -
CodeT5+ (16B) 2048 1024 2e-4 - linear ✓ ✓ ✓ - -
StarCoder (15.5B) 512 8k 3e-4 ✓ cosine ✓ ✓ ✓ - -
LLaMA-2 (70B) 4M Tokens 4k 1.5e-4 ✓ cosine ✓ ✓ ✓ ✓ -
JOURNAL OF LATEX 20
TABLE VII: Summary of optimization settings used for instruction-tuned LLMs. Values for gradient clipping and dropout are
the same as the pre-trained models, while no model is using weight decay for instruction tuning.
Sequence Optimizer Grad
Models Batch Size Length LR Warmup LR_Decay AdaFactor Adam Clip Dropout
WebGPT (175B) BC:512, RM:32 - 6e-5 - - ✓ - -
T0 (11B) 1024 1280 1e-3 - - ✓ - ✓
Tk-Instruct (11B) 1024 - 1e-5 - constant - - - -
OPT-IML (175B) 128 2048 5e-5 × linear ✓ ✓ ✓
Flan-U-PaLM (540B) 32 - 1e-3 - constant ✓ - ✓
WizardCoder (15B) 512 2048 2e-5 ✓ cosine - - - -
Fig. 12: Distribution of benchmark datasets available for different natural language processing tasks. We include only the tasks
for which at least 20 datasets have already been proposed.
TABLE VIII: Performance comparison of top performing LLMs across various NLU and NLG tasks. Here, ‘N-Shots’ indicate
the number of example prompts provided to the model during the evaluation, representing its capability in few-shot or zero-shot
learning settings, and ‘B’ represents the benchmark.
Task Dataset/Benchmark Model Model Size N-Shots Score
Chinchilla 70B 5-shot 65.1
Multi-Task
BIG-bench (B) Gopher 280B 5-shot 53.97
PaLM 540B 5-shot 53.7
PaLM 540B 5-shot 69.3
MMLU (B) Chinchilla 70B 5-shot 67.6
LLaMA 65B 5-shot 63.4
Language Understanding ERNIE 3.0 12B - 90.6
SuperGLUE (B) T5 11B - 88.9
GPT3 175B 32-shot 71.8
LLaMa 65B zero shot 84.2
Story Comprehension and Generation
HellaSwag PaLM 540B zero shot 83.6
Chinchilla 70B zero shot 80.8
GPT3 175B few shot 87.7
StoryCloze
OPT 175B - 79.82
Chinchilla 70B zero shot 85.0
Physical Knowledge and World Understanding PIQA LLaMa 65B zero shot 82.8
MT-NLG 530B zero shot 81.8
PaLM 540B one shot 81.4
TriviaQA GlaM 62B one shot 75.8
LLaMa 65B 64-shot 73.0
AlexaTM 20B - 94.4
OpenBookQA OPT 175B few shot 65.4
GPTNeoX-20B 20B one shot 44.2
Contextual Language Understanding PaLM 540B few shot 89.7
LAMBADA GPT3 175B few shot 86.4
GLM 130B - 80.2
PaLM 540B zero shot 81.1
Commonsense Reasoning
LLaMa 65B zero shot 77.0
WinoGrande
Chinchilla 70B zero shot 74.9
LLaMA 65B zero shot 52.3
SIQA Chinchilla 70B zero shot 51.3
Gopher 280B zero shot 50.6
Reading Comprehension LLaMA 65B zero shot 85.3
BoolQ
Chinchilla 70B zero shot 83.7
Truthfulness Truthful-QA LLaMA 65B - 57
JOURNAL OF LATEX 21
10. RACE-High [225]: A subset of the RACE [225] entailment, predicting whether a given sentence logically fol-
dataset, RACE-High consists of high school-level English lows from another and evaluating a model’s understanding of
exam questions. It is designed to evaluate the comprehension logical relationships in a text.
ability of models in a more academic and challenging context. 22. BIG-bench [236]: The BIG-bench (Behavior of Intel-
11. RACE-Middle [225]: Another subset of the ligent Generative Models Benchmark) is a large-scale bench-
RACE [225] dataset, RACE-Middle, contains middle mark designed to test the abilities of LLMs across a wide
school-level English exam questions. It offers a slightly less range of tasks, including reasoning, creativity, ethics, and
challenging but academically oriented evaluation of a model’s understanding of specific domains.
comprehension skills. 23. SQUADv2 [237]: The Stanford Question Answering
12. Truthful-QA [226]: A unique benchmark that measures Dataset (SQuAD) [238] is a collection of questions posed by
a language model’s truthfulness when generating answers. The crowdworkers on a set of Wikipedia articles, where the answer
dataset includes questions across various categories like health, to every question is a segment of text from the corresponding
law, and politics, some of which are designed to test the model reading passage. SQuADv2 combines the original SQuAD1.1
against common human misconceptions. dataset with over 50,000 unanswerable questions. The aim is to
13. ANLI [227]: A large-scale dataset designed to test the evaluate a model’s ability to understand and answer questions
robustness of machine learning models in Natural Language based on a given context and to determine when a question is
Inference (NLI) is created through an iterative, adversarial unanswerable.
process where humans try to generate examples that models 24. GSM8K [239]: A dataset of diverse grade school math
cannot correctly classify. word problems, testing a model’s ability to perform multi-step
14. ARC-Challenge [228]: A rigorous question-answering mathematical reasoning.
dataset, ARC-Challenge includes complex, grade-school level 25. WiC [240]: This dataset assesses a model’s ability
questions that demand reasoning beyond simple retrieval, to discern word meanings based on context, aiding in tasks
testing the true comprehension capabilities of models. related to Word Sense Disambiguation.
15. XNLI [229]: A cross-lingual benchmark, XNLI extends 26. Math23k [241]: This one challenges a model’s ability
the MultiNLI [230] corpus to 15 languages, including low- to understand and solve mathematical word problems. It con-
resource ones like Urdu. It tests models on cross-lingual tains 23,000 Chinese arithmetic word problems that require
sentence understanding, with 112,500 annotated pairs across models to perform reasoning and computation based on the
three categories: entailment, contradiction, and neutral. problem description.
16. PAWS-X [231]: PAWS-X, or Cross-lingual Paraphrase 27. LCQMC [242]: The Large-scale Chinese Question
Adversaries from Word Scrambling, is a multilingual version Matching Corpus (LCQMC) is a dataset for evaluating the
of the PAWS [232] dataset for paraphrase identification. It performance of models in semantic matching tasks. It contains
includes examples in seven languages and is designed to eval- pairs of questions in Chinese and their matching status,
uate the performance of cross-lingual paraphrase identification making it a valuable resource for research in Chinese language
models. understanding.
17. ARC [228]: A larger version of the ARC-Challenge, 28. MATH [243]: This dataset is a platform for evaluating
this dataset contains both easy and challenging grade-school the mathematical problem-solving abilities of AI models. It
level, multiple-choice science questions. It’s a comprehensive contains a diverse set of math problems, ranging from arith-
test of a model’s ability to understand and answer complex metic to calculus, and is designed to test the model’s ability
questions. to understand and solve complex mathematical problems.
18. ARC-Easy [228]: A subset of the ARC dataset, ARC- 29. ETHOS [244]: ETHOS is a hate speech detection
Easy, contains questions that are answered correctly by either dataset built from YouTube and Reddit comments. It’s a tool
a retrieval-based algorithm or a word co-occurrence algorithm. in the fight against online hate speech, offering binary and
It’s a great starting point for models beginning to explore multi-label variants for robust content moderation.
advanced question-answering. 30. StereoSet [245]: StereoSet is a comprehensive dataset
19. CoQA [233]: A conversational question-answering designed to measure and evaluate the presence of stereotypical
dataset, CoQA challenges models with questions that rely biases in language models. It focuses on four key domains:
on conversation history and require free-form text answers. gender, profession, race, and religion. By contrasting stereo-
Its diverse content from seven domains makes it a rigorous typical bias against language modeling ability, it provides a
test for models’ ability to handle a wide range of topics and valuable tool for understanding and mitigating biases in large
conversational contexts. language models.
20. DROP [234]: DROP, or Discrete Reasoning Over the 31. HumanEval [246]: A dataset for the problem-solving
content of Paragraphs, is designed to test a model’s ability to ability of AI models, which includes a diverse set of tasks that
understand a wide variety of reading phenomena. It encourages require various cognitive abilities, makes it a comprehensive
comprehensive and reliable evaluation of reading comprehen- tool for assessing general intelligence in AI.
sion capabilities. 32. WebQA [247]: A dataset for open-domain question
21. RTE [235]: The Recognizing Textual Entailment (RTE) answering, WebQA offers a large collection of web-based
datasets come from a series of annual competitions on textual question-answer pairs. It is designed to assess the ability of
JOURNAL OF LATEX 22
AI models to understand and answer questions based on web architectural modules used in various LLMs, leading to better
content. performance, reduced training time and memory, and better
33. CMRC2018 [248]: This dataset is a test of Chinese training stability.
language models’ ability to reason comprehensively and is Layer Normalization is found to have a significant effect on
designed with a challenging span-extraction format that pushes the performance and training stability of LLMs. Pre-norm,
the boundaries of machine performance. that is normalizing inputs rather than outputs, is more
34. Wikitext103 [249]: With over 100 million tokens from common among LLMs stabilizing the training [8], [111],
Wikipedia’s top articles, this dataset is a rich resource for tasks [90]. BLOOM [9] and AlexaTM [107] utilize an additional
that require understanding long-term dependencies, such as layer normalization before embedding layer to stabilize
language modeling and translation. the training of large-scale models, while the model’s zero-
35. PG19 [250]: This is a digital library of diverse books shot generalization ability can be negatively impacted [9].
from Project Gutenberg. It’s specifically designed to facilitate However, another study [109] finds that pre-norm degrades
research in unsupervised learning and language modeling, with fine-tuned model performance as compared to post-norm,
a special focus on long-form content. and there are no stability benefits of pre-norm beyond the
36. C4 [11]: A clean, multilingual dataset, C4 offers bil- 100B scale. Therefore, GLM-130B [109] used deep-norm
lions of tokens from web-crawled data. It’s a comprehensive which is a variant of post-norm for better downstream task
resource for training advanced Transformer models on various performance after fine-tuning.
languages. Positional Encoding effect performance and training stability
37. QuAC [251]: This dataset simulates an information- of LLMs like other building blocks of a model. BLOOM [9]
seeking dialog between students and teachers using hidden finds ALiBi outperforming learned and rotary positional
Wikipedia text. It introduces unique challenges not found encodings. Contrary to this, GLM-130B [109] identifies
in machine comprehension datasets, making it a valuable rotary positional encoding better than ALiBi. So, there is no
resource for advancing dialog systems. conclusion in literature about the positional encodings yet.
38. COPA [252]: This dataset evaluates a model’s progress Parallel Attention where attention and feed-forward layers
in open-domain commonsense causal reasoning. Each question are parallel to each other rather than sequential in transformer
comprises a premise and two alternatives, and the model must block has shown to reduce training time by 15%. There is no
select the more plausible alternative, testing a model’s ability evidence of performance drop due to this change in literature
to understand and reason about cause and effect. and used by the models PaLM [14], GPT-NeoX [100], and
39. WSC [220]: The Winograd Schema Challenge (WSC) CodeGen [116].
is a reading comprehension task in which a system must Multi-Query Attention has shared key and value attention
resolve references in a text, often requiring world knowledge heads in a transformer block while query attention heads are
and reasoning about the text. projected as usual. This reduces memory usage and speeds
40. RACE [225]: The RACE is a reading comprehension up sampling in autoregressive decoding. No performance
dataset collected from English examinations in China, which degradation has been observed with this change and makes
benchmarks AI models for understanding and answering ques- the training efficient allowing larger batch sizes. Multi-query
tions on long and complex passages, simulating the challenge attention is used in [14], [118].
of a real-world examination. Mixture of Experts allows easily scaling model to trillion
41. StrategyQA [253]: A question-answering dataset that of parameters [115], [103]. Only a few experts are activated
requires reasoning over multiple pieces of evidence to evaluate during the computation making them compute-efficient. The
the strategic reasoning ability of AI models, pushing the performance of MoE models is better than the dense models
boundaries of what machines can understand and answer. for the same amount of data and requires less computation
42. CSQA [254]: The CommonsenseQA is a question- during fine-tuning to achieve performance similar to the dense
answering dataset that requires commonsense knowledge to models as discussed in [103]. MoE architectures are less
answer the ability of AI models to understand and answer prone to catastrophic forgetting, therefore are more suited for
questions that require commonsense reasoning. continual learning [115]. Extracting smaller sub-models for
43. GLUE [222]: The General Language Understanding downstream tasks is possible without losing any performance,
Evaluation (GLUE) benchmark is a collection of resources making MoE architecture hardware-friendly [115].
for training, evaluating, and analyzing natural language under- Sparse vs Dense Activated GPT-3 [8] usesP sparse
standing systems. It includes a variety of tasks that test a wide transformers [45] whereas GLaM [103] and PanGu- [115]
range of linguistic phenomena, making it a comprehensive tool use MoE [104] architecture to lower computational costs
for evaluating language understanding in AI. and increase the model size and capacity. According to
the literature, sparse modules do not degrade the model’s
VII. S UMMARY AND D ISCUSSION performance [45]. However, more experiments are required
to verify this statement.
A. Architecture
Due to the gigantic scale of LLMs, minor changes
in architecture and training strategies have a big impact
on performance and stability. Here, we summarize key
JOURNAL OF LATEX 23
TABLE IX: Training and evaluation dataset for pre-trained LLMs. Here,“D” denotes Dialogue, “QA” denotes question
answering, “CR” is for commonsense reasoning, “CoT” is for chain-of-thought, “RC” for reading comprehension, “LU”
for language understanding, “IRC” for in-context reading comprehension, “NLI” for natural language inference, “WT”
for winograd-style tasks, “SC” for sentence completion, “WSD” for word sense disambiguation, “CorefR” for coreference
resolution.
Models Training Dataset Evaluation Dataset
GLUE [222], CNNDM, SQuAD [238], SuperGLUE [3], EnDe, ENFr, EnRo,
QQP [255], MNLI-m [256], MNLI-mm [256], QNLI [238],
T5 C4 [11]
WNLI [220], CB [257],
WiC [240], WMT [258], CNN/DM
QA: NaturalQS, WebQS, TriviaQA, ARC, CoQA, DROP
GPT-3 Common Crawl, WebText, Books Corpora, Wikipedia
SuperGLUE, WMT, LAMBADA, StoryCloze, HellaSwag
SP: XNLI [229], PAWS-X [231] S: WikiAnn NER [259]
mT5 mC4 [12]
QA: MLQA [260], TyDiQA-GoldP [261]
PanGu-α 1.1TB Chinese Text Corpus -
CCPM [262], C3 [263], Sogou-Log,
CPM-2 WuDaoCorpus [91] WMT20 [264], Math23k [241], LCSTS [265],
LCQMC [242], AdGen [266], CUGE [267]
54 million public software repositories hosted on GitHub HumanEval [246],
Codex
containing python files under 1MB 64 original programming problems with unit test
NLU: NLPCC2014-SC, SE-ABSA16_PHNS, SE-ABSA16_CAME,
BDCI2019, COTE-BD [268], COTE-DP [268], COTE-MFW [268],
XNLI [229], OCNLI [269], CMNLI [269], CLUEWSC2020 [269],
FinRE [270], SanWen [271], CCKS2020, AFQMC [269],
LCQMC [242], CSL [269], PAWS-X [231], BQ Corpus [272],
TNEWS, IFLYTEK [273], THUCNEWS, CNSE [274], CNSS [274],
Chinese text corpora, Baidu Search, Web text, NLPCC-DBQA, CHIP2019, cMedQA [275],
QA-long, QA-short, Poetry & Couplet cMedQA2 [276], CKBQA 13 [277], WebQA [247],
ERNIE3.0
Domain-specific data from medical, law and financial area CLUENER [269], Weibo [278], OntoNotes [279], CCKS2019,
Baidu knowledge graph with more than 50 million facts CMRC 2018 [248], CMRC2019 [280], DRCD [281],
DuReader [282], Dureaderrobust [283], Dureaderchecklist , Dureaderyesno ,
3
C [263], CHID [284], CAIL2018-Task1 & Task2 [285],
DogWhistle Insider & Outsider [286], Sogou-log [287];
NLG: LCSTS [265], KBQG, DuReader-QG [282],
Dureaderrobust -QG [283], MATINF-QA [288], Math23KMath23k [241],
AdGen [266], WMT20-enzh [264], KdConv [289]
ARC-Challenge [228], ARC-Easy [228], BoolQ [224],
Wikipedia, OWT, Books, C4 [11], HellaSwag [215], PIQA [216],
Jurassic-1
PileCC [290], arXiv, GitHub RACE-high [225], RACE-middle [225],
RTE [235], StoryCloze [223], WinoGrande [219]
Korean blogs, Community sites, News, KiN NSMC: a movie review dataset from NAVER movies;
Korean Wikipedia, Wikipedia (English and Japanese); KorQuAD 1.0 [291], Korean ML dataset
HyperCLOVA
Modu-Corpus: Messenger, News, AI Hub Korean-English, YNAT [292],
Spoken and written language corpus, Web corpus KLUE-TC [292], KLUE-STS [292]
Common Crawl, SogouT, Sogou News, FewCLUE [293], ZeroCLUE [269],
Yuan 1.0
Baidu Baike, Wikipedia, Books CMRC2018 [248], WebQA [247]
LM: Pile [290], LAMBADA [218],
Wikitext103 [249], PG-19 [250], C4 [11];
LU: MMLU [221], BIG-bench [236];
subsets of MassiveWeb [98] RC: RACE-middle [225], RACE-high [225]
Gopher Books, C4 [11], News, GitHub and QA: TriviaQA [217], TruthfulQA [226], Natural Questions [294];
Wikipedia samples from MassiveText [98] Fact Checking on Fever [295], MultiFC [296];
HellaSwag [215], PIQA [216], WinoGrande [219], SIQA [297];
RealToxicityPrompts [298], Twitter Dataset [299],
CivilComments toxicity classification [300]
NLU: NLPCC2014-SC, SE-ABSA16_PHNS, SE-ABSA16_CAME,
BDCI2019, EPRSTMT [293], COTE-BD [268], COTE-MFW [268],
OCNLI [269], CMNLI [269], OCNLI-FC [293], CLUEWSC [269]
CLUEWSC-FC [293], FinRE [270], SanWen [271], AFQMC [269],
LCQMC [242], PAWS-X [231], BQ Corpus [272], CSL [269]
Chinese text corpora, Baidu Search, Web text,
CSL-FC [293], BUSTM, TNEWS, TNEWS-FC [293], IFLYTEK [273], IFLYTEK-FC
QA-long, QA-short, Poetry & Couplet
THUCNEWS, CNSE [274], CNSS [274], CSLDCP
ERNIE3.0 TITAN Domain-specific data from medical, law and financial area
NLPCC-DBQA, CHIP2019, cMedQA [275],
Baidu knowledge graph with more than 50 million facts
cMedQA2 [276], CKBQA 13 [277], WebQA [247],
ERNIE 3.0 adversarial dataset, ERNIE 3.0 controllable dataset
PD&CFT, CMRC2017 [301], CMRC2019 [280]
CHID [284], CHID-FC [293], WPLC, DRCD [281],
DuReader [282], Dureaderrobust [283], Dureaderchecklist , Dureaderyesno ,
C3 [263], CMRC 2018 [248], CAIL2018-Task1 & Task2 [285]
DogWhistle Insider & Outsider [286]
ANLI [227], ARC [228], HeadQA [302], HellaSwag [215],
LAMBADA [218], LogiQA [303], OpenBookQA [304], PIQA [216],
GPT-NeoX-20B Pile [290] PROST [305], QA4MRE [306], SciQ [307], TriviaQA [217],
WinoGrande [219], SuperGLUE [3], MATH [243],
Advanced Knowledge-Based Tasks
HellaSwag [215], StoryCloze [223], PIQA [216],
ARC-Easy [228], ARC-Challenge [228], OpenBookQA [304],
WinoGrad [220], WinoGrande [219], SuperGLUE [3],
RoBERTa [308], Pile [290],
OPT Wizard of Wikipedia [310], Empathetic Dialogues [311],
PushShift.io Reddit [309]
ConvAI2 [312], Blended Skill Talk [313], Wizard of Internet [314]
ETHOS [244], CrowS-Pairs [315], StereoSet [245],
RealToxicPrompts [298], Dialogue Responsible AI evaluations
TABLE X: Training and evaluation datasets for instruction-tuned LLMs. All the abbreviations are the same as Table IX
Models Training Datasets Evaluation Datasets
NLI: ANLI [227], CB [257], RTE [235];
T0 - SC: COPA [318], HellaSwag [215] StoryCloze [223];
WSD: WiC [240]; CorefR: WSC [220], Wino (XL) [219]
ELI5 [351], ELI5 fact-check [133], TriviaQA [217],
ARC-Challenge [228], ARC-Easy [228],
WebGPT ELI5 [351], TruthfulQA [226], TriviaQA [217]
Hand-written data, Demonstrations of humans,
Comparisons between model-generated answers
Tk-INSTRUCT SUP-NATINST [26] SUP-NATINST [26]
mT0 xP3 [134] -
PromptSource [22], FLAN [25],
PromptSource [22], FLAN [25], Super-NaturalInstructions [352],
Super-NaturalInstructions [352],
OPT-IML UnifiedSKG [353], CrossFit [354], ExMix [355], T5 [11],
UnifiedSKG [353], CrossFit [354],
Reasoning, MMLU [221], BBH [236], RAFT [356]
ExMix [355], T5 [11], Reasoning
Flan Muffin, T0-SF, NIv2, CoT MMLU [221], BBH [236], TyDiQA [261], MGSM [343]
WizardCoder Code Alpaca HumanEval [246], MBPP [345], DS-1000 [349]
B. Training Strategies range 1e−4 to 8e−4 . Moreover, MT-NLG (530B) [21] and
Training models at a huge scale require some tricks to GPT-NeoX (20B) [100] suggest interpolating learning rates
reduce training costs, avoid loss divergence and achieve better based on the model size using the GPT-3 [8] models ranging
performance. We summarize and discuss some of these key between 13B and 175B. This avoids tuning the learning rate
tricks used in different LLMs. hyperparameter.
Mixed Precision is a famous method for LLMs to reduce Training Parallelism 3D parallelism, a combination of data,
memory usage and improve training efficiency. In mixed pipeline and tensor parallelism, is the most utilized training
precision, forward and backward passes are performed in FP16 parallelism approach in LLMs [109], [14], [10], [9], [21], [97],
format whereas optimizer states and master weights are kept [94]. In addition to the 3D parallelism, BLOOM [9] uses zero
in FP32 format [357]. A drawback associated with this format optimizer [61] to shard optimizer states. PanGu-α [90] and
change is training instability due to a smaller value range PanGu-Σ [115] go beyond the 3D parallelism and apply 5D
resulting in loss spikes [109]. An alternative to FP16 is BF16 parallelism which additionally contains optimizer parallelism
which has a comparatively larger range and performs some and rematerialization.
precision-sensitive operations like gradient accumulation and Mode Switching adds task-related tokens at the beginning
softmax in FP32 [9]. BF16 has better performance and training of the text during training. These tokens refer to the natural
stability but uses more memory and is supported on specific language understanding and natural language generation tasks
hardware, for example, A100 GPUs. Therefore, its adoption which are shown to improve the downstream task performance
in LLMs is limited. in [15], [20], [107]. During fine-tuning and inference, tokens
Training Instability is a common issue in LLMs where loss are appended based on the downstream tasks.
divergence or spiking is observed multiple times during train- Controllable Text Generation Generating credible and con-
ing. This happens in the presence of gradient clipping [14]. trolled text from a pre-trained model is challenging. GPT-
To mitigate this problem, many approaches suggest restarting 3 [8] and other LLMs use in-context learning to control
training from an earlier checkpoint [14], [109], [103], skipping generated text. While in-context learning helps in controlling
200-500 earlier data batches at the point of divergence in [14] the generated text, ERNIE 3.0 Titan [99] suggests using
and re-shuffling batches in [103]. The embedding layer gradi- adversarial loss to rank its generated text for credibility and
ent shrink proves to further stabilize the training as its gradient soft prompts such as genre, topic, keywords, sentiment, and
norm is significantly larger than the other layers [109]. Another length for better control on generated text.
suggestion to improve training stability for larger models is not
to use biases in dense and norm layers as in [14]. C. Pre-Training vs Instruction Tuning
Weight Initialization plays a significant role in model con- While pre-training is important for the generalization of
vergence and training stability. GPT-NeoX [100] initializes LLMs, instruction-tuning improves the performance of LLMs
2
feed-forward layers before residuals with L√ d
as in [131] and further and makes them useable. Therefore, it is suggested
other layers with small initialization scheme [358]. This avoids to perform instruction fine-tuning of pre-trained LLMs to use
activations growing exponentially with the increasing depth. them effectively [25], [26], [137], [24], [133].
MT-NLG [21] found higher variance for weight initialization
leads to unstable training, hence validating small initialization
scheme [358]. Various models perform random weight ini- D. Supervised Models vs Generalized Models
tialization which can cause bad initialization, Galactica [125] Although generalized models are capable of performing
suggests a longer warmup to negate the effect. diverse tasks with good performance they have not yet outper-
Learning Rate is important for stable training. It is suggested formed models trained in supervised settings. The supervised
to use a lower value [9], [14], [20] with warmup and decay trained models are still state-of-the-art in various NLP tasks
(cosine or linear). Usually, the learning rate is within the by a large margin as shown in [8], [14], [26].
JOURNAL OF LATEX 26
VIII. C ONCLUSION
This paper has reviewed various LLMs, discussing the pros
and cons of multiple models. Our review concluded significant
findings and provided a detailed analysis of the design aspects
of each LLM, including architecture, datasets, and training
pipelines. We have identified crucial architectural compo-
nents and training strategies employed by different LLMs
and presented a summary and discussion. Moreover, we have
compared the performance of LLMs in zero-shot and few-shot
settings, explored the impact of fine-tuning, and compared
supervised vs generalized models, and encoder vs decoder
vs encoder-decoder architectures. This paper will serve as a
valuable resource for researchers, offering insights into the
recent advancements in LLMs and providing fundamental
concepts and details to develop improved LLMs.
JOURNAL OF LATEX 27
[42] C. W. Eriksen and J. E. Hoffman, “Some characteristics of selective [66] J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary,
attention in visual perception determined by vocal reaction time,” D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-
Perception & Psychophysics, vol. 11, no. 2, pp. 169–171, 1972. 4 Milne et al., “Jax: composable transformations of python+ numpy
[43] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by programs,” 2018. 5
jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, [67] S. Li, J. Fang, Z. Bian, H. Liu, Y. Liu, H. Huang, B. Wang, and
2014. 4 Y. You, “Colossal-ai: A unified deep learning system for large-scale
[44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. parallel training,” arXiv preprint arXiv:2110.14883, 2021. 5
Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” [68] J. He, J. Qiu, A. Zeng, Z. Yang, J. Zhai, and J. Tang, “Fastmoe: A fast
Advances in neural information processing systems, vol. 30, 2017. 4, mixture-of-expert training system,” arXiv preprint arXiv:2103.13262,
5, 8 2021. 5
[45] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long [69] L. Huawei Technologies Co., “Huawei mindspore ai development
sequences with sparse transformers,” arXiv preprint arXiv:1904.10509, framework,” in Artificial Intelligence Technology. Springer, 2022, pp.
2019. 4, 8, 22 137–162. 5
[46] T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “Flashattention: Fast [70] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
and memory-efficient exact attention with io-awareness,” Advances in T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An
Neural Information Processing Systems, vol. 35, pp. 16 344–16 359, imperative style, high-performance deep learning library,” Advances
2022. 4 in neural information processing systems, vol. 32, 2019. 5
[47] O. Press, N. Smith, and M. Lewis, “Train short, test long: Attention [71] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
with linear biases enables input length extrapolation,” in International S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: a system for
Conference on Learning Representations, 2022. [Online]. Available: large-scale machine learning.” in Osdi, vol. 16, no. 2016. Savannah,
https://openreview.net/forum?id=R8sQPpGCv0 4 GA, USA, 2016, pp. 265–283. 5
[48] J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu, “Roformer: [72] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu,
Enhanced transformer with rotary position embedding,” arXiv preprint C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine
arXiv:2104.09864, 2021. 4, 9 learning library for heterogeneous distributed systems,” arXiv preprint
[49] A. Kazemnejad, I. Padhi, K. N. Ramamurthy, P. Das, and S. Reddy, arXiv:1512.01274, 2015. 5
“The impact of positional encoding on length generalization in trans- [73] P. J. Liu*, M. Saleh*, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and
formers,” arXiv preprint arXiv:2305.19466, 2023. 4 N. Shazeer, “Generating wikipedia by summarizing long sequences,”
[50] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward in International Conference on Learning Representations, 2018.
networks are universal approximators,” Neural networks, vol. 2, no. 5, [Online]. Available: https://openreview.net/forum?id=Hyg0vbWC- 6
pp. 359–366, 1989. 5 [74] T. Wang, A. Roberts, D. Hesslow, T. Le Scao, H. W. Chung, I. Beltagy,
[51] V. Nair and G. E. Hinton, “Rectified linear units improve restricted J. Launay, and C. Raffel, “What language model architecture and
boltzmann machines,” in Proceedings of the 27th international confer- pretraining objective works best for zero-shot generalization?” in
ence on machine learning (ICML-10), 2010, pp. 807–814. 5 International Conference on Machine Learning. PMLR, 2022, pp.
[52] D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” 22 964–22 984. 6
arXiv preprint arXiv:1606.08415, 2016. 5 [75] L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao,
[53] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut- M. Zhou, and H.-W. Hon, “Unified language model pre-training for
dinov, “Dropout: a simple way to prevent neural networks from natural language understanding and generation,” Advances in neural
overfitting,” The journal of machine learning research, vol. 15, no. 1, information processing systems, vol. 32, 2019. 6
pp. 1929–1958, 2014. 5
[76] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang, “Gpt
[54] D. Krueger, T. Maharaj, J. Kramár, M. Pezeshki, N. Ballas, N. R. Ke,
understands, too,” arXiv preprint arXiv:2103.10385, 2021. 7
A. Goyal, Y. Bengio, A. Courville, and C. Pal, “Zoneout: Regulariz-
[77] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe,
ing rnns by randomly preserving hidden activations,” arXiv preprint
A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer
arXiv:1606.01305, 2016. 5
learning for nlp,” in International Conference on Machine Learning.
[55] N. Shazeer, “Glu variants improve transformer,” arXiv preprint
PMLR, 2019, pp. 2790–2799. 7, 8
arXiv:2002.05202, 2020. 5
[56] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling [78] A. Askell, Y. Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan,
with gated convolutional networks,” in International conference on A. Jones, N. Joseph, B. Mann, N. DasSarma et al., “A general
machine learning. PMLR, 2017, pp. 933–941. 5 language assistant as a laboratory for alignment,” arXiv preprint
[57] B. Zhang and R. Sennrich, “Root mean square layer normalization,” arXiv:2112.00861, 2021. 7
Advances in Neural Information Processing Systems, vol. 32, 2019. 5 [79] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford,
[58] A. Baevski and M. Auli, “Adaptive input representations for neural D. Amodei, P. Christiano, and G. Irving, “Fine-tuning language models
language modeling,” arXiv preprint arXiv:1809.10853, 2018. 5 from human preferences,” arXiv preprint arXiv:1909.08593, 2019. 7
[59] S. Shleifer, J. Weston, and M. Ott, “Normformer: Improved [80] S. Kim, S. J. Joo, D. Kim, J. Jang, S. Ye, J. Shin, and M. Seo,
transformer pretraining with extra normalization,” arXiv preprint “The cot collection: Improving zero-shot and few-shot learning of
arXiv:2110.09456, 2021. 5 language models via chain-of-thought fine-tuning,” arXiv preprint
[60] H. Wang, S. Ma, L. Dong, S. Huang, D. Zhang, and F. Wei, arXiv:2305.14045, 2023. 8, 13
“Deepnet: Scaling transformers to 1,000 layers,” arXiv preprint [81] Q. Liu, F. Zhou, Z. Jiang, L. Dou, and M. Lin, “From zero to hero:
arXiv:2203.00555, 2022. 5 Examining the power of symbolic tasks in instruction tuning,” arXiv
[61] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “Zero: Memory preprint arXiv:2304.07995, 2023. 8, 13
optimizations toward training trillion parameter models,” in SC20: In- [82] E. Saravia, “Prompt Engineering Guide,” https://github.com/dair-
ternational Conference for High Performance Computing, Networking, ai/Prompt-Engineering-Guide, 12 2022. 8
Storage and Analysis. IEEE, 2020, pp. 1–16. 5, 25 [83] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun,
[62] M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catan- J. Xu, and Z. Sui, “A survey for in-context learning,” arXiv preprint
zaro, “Megatron-lm: Training multi-billion parameter language models arXiv:2301.00234, 2022. 8
using model parallelism,” arXiv preprint arXiv:1909.08053, 2019. 5 [84] J. Huang and K. C.-C. Chang, “Towards reasoning in large language
[63] “"bmtrain: Efficient training for big models.".” [Online]. Available: models: A survey,” arXiv preprint arXiv:2212.10403, 2022. 8
https://github.com/OpenBMB/BMTrain 5 [85] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V.
[64] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in
P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Transformers: large language models,” Advances in Neural Information Processing
State-of-the-art natural language processing,” in Proceedings of the Systems, vol. 35, pp. 24 824–24 837, 2022. 8, 17
2020 conference on empirical methods in natural language processing: [86] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowd-
system demonstrations, 2020, pp. 38–45. 5 hery, and D. Zhou, “Self-consistency improves chain of thought rea-
[65] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He, “Deepspeed: Sys- soning in language models,” arXiv preprint arXiv:2203.11171, 2022.
tem optimizations enable training deep learning models with over 8
100 billion parameters,” in Proceedings of the 26th ACM SIGKDD [87] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and
International Conference on Knowledge Discovery & Data Mining, K. Narasimhan, “Tree of thoughts: Deliberate problem solving with
2020, pp. 3505–3506. 5 large language models,” arXiv preprint arXiv:2305.10601, 2023. 8
JOURNAL OF LATEX 29
[88] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., [109] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu,
“Language models are unsupervised multitask learners,” OpenAI blog, W. Zheng, X. Xia et al., “Glm-130b: An open bilingual pre-trained
vol. 1, no. 8, p. 9, 2019. 8 model,” arXiv preprint arXiv:2210.02414, 2022. 10, 15, 22, 25
[89] S. McCandlish, J. Kaplan, D. Amodei, and O. D. Team, “An empirical [110] Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang,
model of large-batch training,” arXiv preprint arXiv:1812.06162, 2018. “Glm: General language model pretraining with autoregressive blank
8 infilling,” in Proceedings of the 60th Annual Meeting of the Association
[90] W. Zeng, X. Ren, T. Su, H. Wang, Y. Liao, Z. Wang, X. Jiang, Z. Yang, for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 320–
K. Wang, X. Zhang et al., “Pangu-α : Large-scale autoregressive 335. 10
pretrained chinese language models with auto-parallel computation,” [111] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux,
arXiv preprint arXiv:2104.12369, 2021. 8, 15, 22, 25 T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama:
[91] S. Yuan, H. Zhao, Z. Du, M. Ding, X. Liu, Y. Cen, X. Zou, Z. Yang, Open and efficient foundation language models,” arXiv preprint
and J. Tang, “Wudaocorpora: A super large-scale chinese corpora for arXiv:2302.13971, 2023. 10, 15, 22
pre-training language models,” AI Open, vol. 2, pp. 65–68, 2021. 8, [112] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei,
23, 24 N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama
[92] Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang, J. Liu, X. Chen, 2: Open foundation and fine-tuned chat models,” arXiv preprint
Y. Zhao, Y. Lu et al., “Ernie 3.0: Large-scale knowledge enhanced arXiv:2307.09288, 2023. 10, 15
pre-training for language understanding and generation,” arXiv preprint [113] M. N. Rabe and C. Staats, “Self-attention does not need o(n2 ) memory,”
arXiv:2107.02137, 2021. 9, 15 arXiv preprint arXiv:2112.05682, 2021. 10
[93] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, [114] V. A. Korthikanti, J. Casper, S. Lym, L. McAfee, M. Andersch,
“Transformer-xl: Attentive language models beyond a fixed-length M. Shoeybi, and B. Catanzaro, “Reducing activation recomputation
context,” arXiv preprint arXiv:1901.02860, 2019. 9 in large transformer models,” Proceedings of Machine Learning and
[94] O. Lieber, O. Sharir, B. Lenz, and Y. Shoham, “Jurassic-1: Technical Systems, vol. 5, 2023. 10
details and evaluation,” White Paper. AI21 Labs, vol. 1, 2021. 9, 15, [115] X. Ren, P. Zhou, X. Meng, X. Huang, Y. Wang,PW. Wang, P. Li,
25 X. Zhang, A. Podolskiy, G. Arshinov et al., “Pangu- : Towards trillion
[95] Y. Levine, N. Wies, O. Sharir, H. Bata, and A. Shashua, “Limits to parameter language model with sparse heterogeneous computing,”
depth efficiencies of self-attention,” Advances in Neural Information arXiv preprint arXiv:2303.10845, 2023. 10, 11, 15, 22, 25
Processing Systems, vol. 33, pp. 22 640–22 651, 2020. 9 [116] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou,
[96] B. Kim, H. Kim, S.-W. Lee, G. Lee, D. Kwak, D. H. Jeon, S. Park, S. Savarese, and C. Xiong, “Codegen: An open large language
S. Kim, S. Kim, D. Seo et al., “What changes can large-scale language model for code with multi-turn program synthesis,” arXiv preprint
models bring? intensive study on hyperclova: Billions-scale korean arXiv:2203.13474, 2022. 10, 15, 22, 24
generative pretrained transformers,” arXiv preprint arXiv:2109.04650, [117] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan,
2021. 9, 15 H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large
[97] S. Wu, X. Zhao, T. Yu, R. Zhang, C. Shen, H. Liu, F. Li, H. Zhu, language models trained on code,” arXiv preprint arXiv:2107.03374,
J. Luo, L. Xu et al., “Yuan 1.0: Large-scale pre-trained language model 2021. 10, 15
in zero-shot and few-shot learning,” arXiv preprint arXiv:2110.04725, [118] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond,
2021. 9, 15, 25 T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al., “Competition-
[98] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, level code generation with alphacode,” Science, vol. 378, no. 6624, pp.
J. Aslanides, S. Henderson, R. Ring, S. Young et al., “Scaling language 1092–1097, 2022. 11, 15, 22, 24
models: Methods, analysis & insights from training gopher,” arXiv [119] N. Shazeer, “Fast transformer decoding: One write-head is all you
preprint arXiv:2112.11446, 2021. 9, 10, 15, 23, 24 need,” arXiv preprint arXiv:1911.02150, 2019. 11
[99] S. Wang, Y. Sun, Y. Xiang, Z. Wu, S. Ding, W. Gong, S. Feng, [120] R. Y. Pang and H. He, “Text generation by learning from demonstra-
J. Shang, Y. Zhao, C. Pang et al., “Ernie 3.0 titan: Exploring larger- tions,” arXiv preprint arXiv:2009.07839, 2020. 11
scale knowledge enhanced pre-training for language understanding and [121] R. Dabre and A. Fujita, “Softmax tempering for training neural
generation,” arXiv preprint arXiv:2112.12731, 2021. 9, 15, 25 machine translation models,” arXiv preprint arXiv:2009.09372, 2020.
[100] S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Gold- 11
ing, H. He, C. Leahy, K. McDonell, J. Phang et al., “Gpt-neox- [122] Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, and S. C. Hoi,
20b: An open-source autoregressive language model,” arXiv preprint “Codet5+: Open code large language models for code understanding
arXiv:2204.06745, 2022. 9, 22, 25, 26 and generation,” arXiv preprint arXiv:2305.07922, 2023. 11, 15, 26
[101] W. Ben and K. Aran, “Gpt-j-6b: A 6 billion parameter autoregressive [123] Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware
language model,” 2021. 9 unified pre-trained encoder-decoder models for code understanding and
[102] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, generation,” arXiv preprint arXiv:2109.00859, 2021. 11
B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh et al., “Mixed [124] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou,
precision training,” arXiv preprint arXiv:1710.03740, 2017. 9 M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source
[103] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, be with you!” arXiv preprint arXiv:2305.06161, 2023. 11, 15
Y. Zhou, A. W. Yu, O. Firat et al., “Glam: Efficient scaling of [125] R. Taylor, M. Kardas, G. Cucurull, T. Scialom, A. Hartshorn, E. Sar-
language models with mixture-of-experts,” in International Conference avia, A. Poulton, V. Kerkez, and R. Stojnic, “Galactica: A large
on Machine Learning. PMLR, 2022, pp. 5547–5569. 9, 15, 22, 25 language model for science,” arXiv preprint arXiv:2211.09085, 2022.
[104] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, 11, 15, 24, 25
and J. Dean, “Outrageously large neural networks: The sparsely-gated [126] FairScale authors, “Fairscale: A general purpose modular pytorch
mixture-of-experts layer,” arXiv preprint arXiv:1701.06538, 2017. 9, library for high performance and large scale training,” https://github.
22 com/facebookresearch/fairscale, 2021. 11
[105] W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling [127] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T.
to trillion parameter models with simple and efficient sparsity,” The Cheng, A. Jin, T. Bos, L. Baker, Y. Du et al., “Lamda: Language models
Journal of Machine Learning Research, vol. 23, no. 1, pp. 5232–5270, for dialog applications,” arXiv preprint arXiv:2201.08239, 2022. 11,
2022. 9 15, 24
[106] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, [128] S. Wu, O. Irsoy, S. Lu, V. Dabravolski, M. Dredze, S. Gehrmann,
E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark et al., P. Kambadur, D. Rosenberg, and G. Mann, “Bloomberggpt: A large
“Training compute-optimal large language models,” arXiv preprint language model for finance,” arXiv preprint arXiv:2303.17564, 2023.
arXiv:2203.15556, 2022. 9, 15, 24 11, 15, 24
[107] S. Soltan, S. Ananthakrishnan, J. FitzGerald, R. Gupta, W. Hamza, [129] Y. Levine, N. Wies, O. Sharir, H. Bata, and A. Shashua, “Limits to
H. Khan, C. Peris, S. Rawls, A. Rosenbaum, A. Rumshisky et al., depth efficiencies of self-attention,” Advances in Neural Information
“Alexatm 20b: Few-shot learning using a large-scale multilingual Processing Systems, vol. 33, pp. 22 640–22 651, 2020. 11
seq2seq model,” arXiv preprint arXiv:2208.01448, 2022. 10, 15, 22, [130] X. Zhang, Q. Yang, and D. Xu, “Xuanyuan 2.0: A large chinese
25, 26 financial chat model with hundreds of billions parameters,” arXiv
[108] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, preprint arXiv:2305.12002, 2023. 11, 15, 16
S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical [131] W. Ben, “Mesh-transformer-jax: Model-parallel implementation of
report,” arXiv preprint arXiv:2305.10403, 2023. 10, 15 transformer language model with jax,” 2021. 12, 25
JOURNAL OF LATEX 30
[132] S. Black, S. Biderman, E. Hallahan, Q. Anthony, L. Gao, L. Gold- [155] D. Ganguli, A. Askell, N. Schiefer, T. Liao, K. Lukošiūtė, A. Chen,
ing, H. He, C. Leahy, K. McDonell, J. Phang et al., “Gpt-neox- A. Goldie, A. Mirhoseini, C. Olsson, D. Hernandez et al., “The capacity
20b: An open-source autoregressive language model,” arXiv preprint for moral self-correction in large language models,” arXiv preprint
arXiv:2204.06745, 2022. 15 arXiv:2302.07459, 2023. 16
[133] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, [156] A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does llm
C. Hesse, S. Jain, V. Kosaraju, W. Saunders et al., “Webgpt: Browser- safety training fail?” arXiv preprint arXiv:2307.02483, 2023. 16
assisted question-answering with human feedback,” arXiv preprint [157] D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y. Bai, S. Kadavath,
arXiv:2112.09332, 2021. 15, 25 B. Mann, E. Perez, N. Schiefer, K. Ndousse et al., “Red teaming
[134] N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, S. Biderman, language models to reduce harms: Methods, scaling behaviors, and
T. L. Scao, M. S. Bari, S. Shen, Z.-X. Yong, H. Schoelkopf lessons learned,” arXiv preprint arXiv:2209.07858, 2022. 16
et al., “Crosslingual generalization through multitask finetuning,” arXiv [158] S. Casper, J. Lin, J. Kwon, G. Culp, and D. Hadfield-Menell, “Explore,
preprint arXiv:2211.01786, 2022. 13, 15, 25 establish, exploit: Red teaming language models from scratch,” arXiv
[135] A. Glaese, N. McAleese, M. Tr˛ebacz, J. Aslanides, V. Firoiu, T. Ewalds, preprint arXiv:2306.09442, 2023. 16
M. Rauh, L. Weidinger, M. Chadwick, P. Thacker et al., “Improving [159] E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese,
alignment of dialogue agents via targeted human judgements,” arXiv N. McAleese, and G. Irving, “Red teaming language models with
preprint arXiv:2209.14375, 2022. 15 language models,” arXiv preprint arXiv:2202.03286, 2022. 16
[136] Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, [160] T. Scialom, T. Chakrabarty, and S. Muresan, “Fine-tuned language
and D. Jiang, “Wizardcoder: Empowering code large language models models are continual learners,” in Proceedings of the 2022 Conference
with evol-instruct,” arXiv preprint arXiv:2306.08568, 2023. 14, 15 on Empirical Methods in Natural Language Processing, 2022, pp.
[137] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, 6107–6122. 16
C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language [161] Z. Shi and A. Lipani, “Don’t stop pretraining? make prompt-based
models to follow instructions with human feedback,” Advances in fine-tuning powerful learner,” arXiv preprint arXiv:2305.01711, 2023.
Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 16
2022. 11, 15, 17, 25 [162] H. Gupta, S. A. Sawant, S. Mishra, M. Nakamura, A. Mitra,
[138] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, S. Mashetty, and C. Baral, “Instruction tuned models are quick learn-
and H. Hajishirzi, “Self-instruct: Aligning language model with self ers,” arXiv preprint arXiv:2306.05539, 2023. 16
generated instructions,” arXiv preprint arXiv:2212.10560, 2022. 13, [163] H. Chen, Y. Zhang, Q. Zhang, H. Yang, X. Hu, X. Ma, Y. Yanggong,
17 and J. Zhao, “Maybe only 0.5% data is needed: A preliminary
[139] D. Yin, X. Liu, F. Yin, M. Zhong, H. Bansal, J. Han, and K.-W. Chang, exploration of low training data instruction tuning,” arXiv preprint
“Dynosaur: A dynamic growth paradigm for instruction-tuning data arXiv:2305.09246, 2023. 16
curation,” arXiv preprint arXiv:2305.14327, 2023. 14 [164] C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat,
[140] P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu, P. Yu, L. Yu et al., “Lima: Less is more for alignment,” arXiv preprint
C. He, X. Yue et al., “Llama-adapter v2: Parameter-efficient visual arXiv:2305.11206, 2023. 16
instruction model,” arXiv preprint arXiv:2304.15010, 2023. 14, 26 [165] B. Zhang and H. Soh, “Large language models as zero-shot human
[141] “Openai. gpt-4 technical report,” 2023. 14 models for human-robot interaction,” arXiv preprint arXiv:2303.03548,
[142] T. Liu and B. K. H. Low, “Goat: Fine-tuned llama outperforms gpt-4 2023. 16
on arithmetic tasks,” arXiv preprint arXiv:2305.14201, 2023. 14 [166] A. Lykov and D. Tsetserukou, “Llm-brain: Ai-driven fast generation of
[143] H. Wang, C. Liu, N. Xi, Z. Qiang, S. Zhao, B. Qin, and T. Liu, “Huatuo: robot behaviour tree based on large language model,” arXiv preprint
Tuning llama model with chinese medical knowledge,” arXiv preprint arXiv:2305.19352, 2023. 16
arXiv:2304.06975, 2023. 14 [167] E. Billing, J. Rosén, and M. Lamb, “Language models for human-robot
[144] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, and interaction,” in ACM/IEEE International Conference on Human-Robot
D. Jiang, “Wizardlm: Empowering large language models to follow Interaction, March 13–16, 2023, Stockholm, Sweden. ACM Digital
complex instructions,” arXiv preprint arXiv:2304.12244, 2023. 14 Library, 2023, pp. 905–906. 16
[145] J. Menick, M. Trebacz, V. Mikulik, J. Aslanides, F. Song, M. Chadwick, [168] Y. Ye, H. You, and J. Du, “Improved trust in human-robot collaboration
M. Glaese, S. Young, L. Campbell-Gillingham, G. Irving et al., with chatgpt,” IEEE Access, 2023. 16
“Teaching language models to support answers with verified quotes,” [169] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay,
arXiv preprint arXiv:2203.11147, 2022. 15 D. Fox, J. Thomason, and A. Garg, “Progprompt: Generating situated
[146] R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and robot task plans using large language models,” in 2023 IEEE Interna-
C. Finn, “Direct preference optimization: Your language model is tional Conference on Robotics and Automation (ICRA). IEEE, 2023,
secretly a reward model,” arXiv preprint arXiv:2305.18290, 2023. 16 pp. 11 523–11 530. 16
[147] H. Dong, W. Xiong, D. Goyal, R. Pan, S. Diao, J. Zhang, K. Shum, and [170] Y. Zhen, S. Bi, L. Xing-tong, P. Wei-qin, S. Hai-peng, C. Zi-rui,
T. Zhang, “Raft: Reward ranked finetuning for generative foundation and F. Yi-shu, “Robot task planning based on large language model
model alignment,” arXiv preprint arXiv:2304.06767, 2023. 16 representing knowledge with directed graph structures,” arXiv preprint
[148] Z. Yuan, H. Yuan, C. Tan, W. Wang, S. Huang, and F. Huang, “Rrhf: arXiv:2306.05171, 2023. 16
Rank responses to align language models with human feedback without [171] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models
tears,” arXiv preprint arXiv:2304.05302, 2023. 16 as zero-shot planners: Extracting actionable knowledge for embodied
[149] F. Song, B. Yu, M. Li, H. Yu, F. Huang, Y. Li, and H. Wang, agents,” in International Conference on Machine Learning. PMLR,
“Preference ranking optimization for human alignment,” arXiv preprint 2022, pp. 9118–9147. 16
arXiv:2306.17492, 2023. 16 [172] Y. Ding, X. Zhang, C. Paxton, and S. Zhang, “Task and motion planning
[150] H. Liu, C. Sferrazza, and P. Abbeel, “Languages are rewards: Hindsight with large language models for object rearrangement,” arXiv preprint
finetuning using human feedback,” arXiv preprint arXiv:2302.02676, arXiv:2303.06247, 2023. 16
2023. 16 [173] ——, “Leveraging commonsense knowledge from large language mod-
[151] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, els for task and motion planning,” in RSS 2023 Workshop on Learning
A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon et al., “Constitutional for Task and Motion Planning, 2023. 16
ai: Harmlessness from ai feedback,” arXiv preprint arXiv:2212.08073, [174] Y. Ge, W. Hua, J. Ji, J. Tan, S. Xu, and Y. Zhang, “Openagi: When llm
2022. 16 meets domain experts,” arXiv preprint arXiv:2304.04370, 2023. 16
[152] Y. Dubois, X. Li, R. Taori, T. Zhang, I. Gulrajani, J. Ba, C. Guestrin, [175] T. Zhong, Y. Wei, L. Yang, Z. Wu, Z. Liu, X. Wei, W. Li, J. Yao,
P. Liang, and T. B. Hashimoto, “Alpacafarm: A simulation frame- C. Ma, X. Li et al., “Chatabl: Abductive learning via natural language
work for methods that learn from human feedback,” arXiv preprint interaction with chatgpt,” arXiv preprint arXiv:2304.11107, 2023. 16
arXiv:2305.14387, 2023. 16 [176] J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg,
[153] Z. Sun, Y. Shen, Q. Zhou, H. Zhang, Z. Chen, D. Cox, Y. Yang, S. Rusinkiewicz, and T. Funkhouser, “Tidybot: Personalized robot as-
and C. Gan, “Principle-driven self-alignment of language mod- sistance with large language models,” arXiv preprint arXiv:2305.05658,
els from scratch with minimal human supervision,” arXiv preprint 2023. 16
arXiv:2305.03047, 2023. 16 [177] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter,
[154] C. Si, Z. Gan, Z. Yang, S. Wang, J. Wang, J. Boyd-Graber, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied
and L. Wang, “Prompting gpt-3 to be reliable,” arXiv preprint multimodal language model,” arXiv preprint arXiv:2303.03378, 2023.
arXiv:2210.09150, 2022. 16 16, 17
JOURNAL OF LATEX 31
[178] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, [199] G. Luo, Y. Zhou, T. Ren, S. Chen, X. Sun, and R. Ji, “Cheap and
J. Tompson, I. Mordatch, Y. Chebotar, P. Sermanet, T. Jackson, quick: Efficient vision-language instruction tuning for large language
N. Brown, L. Luu, S. Levine, K. Hausman, and brian ichter, “Inner models,” arXiv preprint arXiv:2305.15023, 2023. 17
monologue: Embodied reasoning through planning with language [200] R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, P. Gao, and
models,” in 6th Annual Conference on Robot Learning, 2022. Y. Qiao, “Llama-adapter: Efficient fine-tuning of language models with
[Online]. Available: https://openreview.net/forum?id=3R3Pz5i0tye 16 zero-init attention,” arXiv preprint arXiv:2303.16199, 2023. 17
[179] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, [201] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and
K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: I. Sutskever, “Robust speech recognition via large-scale weak super-
a visual language model for few-shot learning,” Advances in Neural vision,” in International Conference on Machine Learning. PMLR,
Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022. 17 2023, pp. 28 492–28 518. 17
[180] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language- [202] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola,
image pre-training with frozen image encoders and large language “Multimodal chain-of-thought reasoning in language models,” arXiv
models,” arXiv preprint arXiv:2301.12597, 2023. 17 preprint arXiv:2302.00923, 2023. 17
[181] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” arXiv [203] J. Ge, H. Luo, S. Qian, Y. Gan, J. Fu, and S. Zhan, “Chain of
preprint arXiv:2304.08485, 2023. 17 thought prompt tuning in vision language models,” arXiv preprint
[182] K. Li, Y. He, Y. Wang, Y. Li, W. Wang, P. Luo, Y. Wang, L. Wang, and arXiv:2304.07919, 2023. 17
Y. Qiao, “Videochat: Chat-centric video understanding,” arXiv preprint [204] C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and N. Duan, “Visual chatgpt:
arXiv:2305.06355, 2023. 17 Talking, drawing and editing with visual foundation models,” arXiv
[183] M. Maaz, H. Rasheed, S. Khan, and F. S. Khan, “Video-chatgpt: preprint arXiv:2303.04671, 2023. 17
Towards detailed video understanding via large vision and language [205] Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu,
models,” arXiv preprint arXiv:2306.05424, 2023. 17 C. Liu, M. Zeng, and L. Wang, “Mm-react: Prompting chatgpt for
[184] H. Zhang, X. Li, and L. Bing, “Video-llama: An instruction-tuned multimodal reasoning and action,” arXiv preprint arXiv:2303.11381,
audio-visual language model for video understanding,” arXiv preprint 2023. 17
arXiv:2306.02858, 2023. 17 [206] T. Wang, J. Zhang, J. Fei, Y. Ge, H. Zheng, Y. Tang, Z. Li,
[185] X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, M. Gao, S. Zhao, Y. Shan et al., “Caption anything: Interactive
Y. Zou, and W. Wang, “Wavcaps: A chatgpt-assisted weakly-labelled image description with diverse multimodal controls,” arXiv preprint
audio captioning dataset for audio-language multimodal research,” arXiv:2305.02677, 2023. 17
arXiv preprint arXiv:2303.17395, 2023. 17 [207] X. Zhu, R. Zhang, B. He, Z. Zeng, S. Zhang, and P. Gao, “Pointclip
[186] C. Lyu, M. Wu, L. Wang, X. Huang, B. Liu, Z. Du, S. Shi, and v2: Adapting clip for powerful 3d open-world learning,” arXiv preprint
Z. Tu, “Macaw-llm: Multi-modal language modeling with image, arXiv:2211.11682, 2022. 17
audio, video, and text integration,” arXiv preprint arXiv:2306.09093, [208] P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang, Y. N. Wu, S.-C.
2023. 17 Zhu, and J. Gao, “Chameleon: Plug-and-play compositional reasoning
[187] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: En- with large language models,” arXiv preprint arXiv:2304.09842, 2023.
hancing vision-language understanding with advanced large language 17
models,” arXiv preprint arXiv:2304.10592, 2023. 17 [209] T. Gupta and A. Kembhavi, “Visual programming: Compositional
[188] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, visual reasoning without training,” in Proceedings of the IEEE/CVF
T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., Conference on Computer Vision and Pattern Recognition, 2023, pp.
“An image is worth 16x16 words: Transformers for image recognition 14 953–14 962. 17
at scale,” arXiv preprint arXiv:2010.11929, 2020. 17 [210] P. Gao, Z. Jiang, H. You, P. Lu, S. C. Hoi, X. Wang, and H. Li,
[189] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, “Dynamic fusion with intra-and inter-modality attention flow for visual
L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and question answering,” in Proceedings of the IEEE/CVF conference on
E. P. Xing, “Vicuna: An open-source chatbot impressing gpt-4 computer vision and pattern recognition, 2019, pp. 6639–6648. 17
with 90%* chatgpt quality,” March 2023. [Online]. Available: [211] Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, “Deep modular co-
https://lmsys.org/blog/2023-03-30-vicuna/ 17 attention networks for visual question answering,” in Proceedings of
[190] Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang, A. Hu, P. Shi, the IEEE/CVF conference on computer vision and pattern recognition,
Y. Shi et al., “mplug-owl: Modularization empowers large language 2019, pp. 6281–6290. 17
models with multimodality,” arXiv preprint arXiv:2304.14178, 2023. [212] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang,
17 and W. Chen, “Lora: Low-rank adaptation of large language models,”
[191] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, arXiv preprint arXiv:2106.09685, 2021. 17
and S. Hoi, “Instructblip: Towards general-purpose vision-language [213] H. You, R. Sun, Z. Wang, L. Chen, G. Wang, H. A. Ayyubi, K.-
models with instruction tuning,” arXiv preprint arXiv:2305.06500, W. Chang, and S.-F. Chang, “Idealgpt: Iteratively decomposing vision
2023. 17 and language reasoning via large language models,” arXiv preprint
[192] W. Wang, Z. Chen, X. Chen, J. Wu, X. Zhu, G. Zeng, P. Luo, arXiv:2305.14985, 2023. 17
T. Lu, J. Zhou, Y. Qiao et al., “Visionllm: Large language model is [214] R. Zhang, X. Hu, B. Li, S. Huang, H. Deng, Y. Qiao, P. Gao, and H. Li,
also an open-ended decoder for vision-centric tasks,” arXiv preprint “Prompt, generate, then cache: Cascade of foundation models makes
arXiv:2305.11175, 2023. 17 strong few-shot learners,” in Proceedings of the IEEE/CVF Conference
[193] Z. Xu, Y. Shen, and L. Huang, “Multiinstruct: Improving multi- on Computer Vision and Pattern Recognition, 2023, pp. 15 211–15 222.
modal zero-shot learning via instruction tuning,” arXiv preprint 17
arXiv:2212.10773, 2022. 17 [215] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, “Hel-
[194] S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on laswag: Can a machine really finish your sentence?” arXiv preprint
multimodal large language models,” arXiv preprint arXiv:2306.13549, arXiv:1905.07830, 2019. 18, 23, 24, 25
2023. 17 [216] Y. Bisk, R. Zellers, J. Gao, Y. Choi et al., “Piqa: Reasoning about
[195] Z. Zhao, L. Guo, T. Yue, S. Chen, S. Shao, X. Zhu, Z. Yuan, and physical commonsense in natural language,” in Proceedings of the
J. Liu, “Chatbridge: Bridging modalities with large language model as AAAI conference on artificial intelligence, vol. 34, no. 05, 2020, pp.
a language catalyst,” arXiv preprint arXiv:2305.16103, 2023. 17 7432–7439. 18, 23, 24
[196] L. Li, Y. Yin, S. Li, L. Chen, P. Wang, S. Ren, M. Li, Y. Yang, J. Xu, [217] M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer, “Triviaqa: A large
X. Sun et al., “M3 it: A large-scale dataset towards multi-modal mul- scale distantly supervised challenge dataset for reading comprehen-
tilingual instruction tuning,” arXiv preprint arXiv:2306.04387, 2023. sion,” arXiv preprint arXiv:1705.03551, 2017. 18, 23, 24, 25
17 [218] D. Paperno, G. Kruszewski, A. Lazaridou, Q. N. Pham, R. Bernardi,
[197] R. Yang, L. Song, Y. Li, S. Zhao, Y. Ge, X. Li, and Y. Shan, “Gpt4tools: S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández, “The lambada
Teaching large language model to use tools via self-instruction,” arXiv dataset: Word prediction requiring a broad discourse context,” arXiv
preprint arXiv:2305.18752, 2023. 17 preprint arXiv:1606.06031, 2016. 18, 23, 24
[198] R. Pi, J. Gao, S. Diao, R. Pan, H. Dong, J. Zhang, L. Yao, J. Han, H. Xu, [219] K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi, “Winogrande:
and L. K. T. Zhang, “Detgpt: Detect what you need via reasoning,” An adversarial winograd schema challenge at scale,” Communications
arXiv preprint arXiv:2305.14167, 2023. 17 of the ACM, vol. 64, no. 9, pp. 99–106, 2021. 18, 23, 24, 25
JOURNAL OF LATEX 32
[220] H. Levesque, E. Davis, and L. Morgenstern, “The winograd schema [240] M. T. Pilehvar and J. Camacho-Collados, “Wic: 10,000 example
challenge,” in Thirteenth international conference on the principles of pairs for evaluating context-sensitive representations,” arXiv preprint
knowledge representation and reasoning, 2012. 18, 22, 23, 24, 25 arXiv:1808.09121, vol. 6, 2018. 21, 23, 24, 25
[221] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and [241] Y. Wang, X. Liu, and S. Shi, “Deep neural solver for math word
J. Steinhardt, “Measuring massive multitask language understanding,” problems,” in Proceedings of the 2017 conference on empirical methods
arXiv preprint arXiv:2009.03300, 2020. 18, 23, 24, 25 in natural language processing, 2017, pp. 845–854. 21, 23
[222] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, [242] X. Liu, Q. Chen, C. Deng, H. Zeng, J. Chen, D. Li, and B. Tang,
“Glue: A multi-task benchmark and analysis platform for natural “Lcqmc: A large-scale chinese question matching corpus,” in Proceed-
language understanding,” arXiv preprint arXiv:1804.07461, 2018. 18, ings of the 27th international conference on computational linguistics,
22, 23 2018, pp. 1952–1962. 21, 23
[223] N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Van- [243] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo,
derwende, P. Kohli, and J. Allen, “A corpus and evaluation framework C. Burns, S. Puranik, H. He, D. Song et al., “Measuring coding
for deeper understanding of commonsense stories,” arXiv preprint challenge competence with apps,” arXiv preprint arXiv:2105.09938,
arXiv:1604.01696, 2016. 18, 23, 24, 25 2021. 21, 23, 24
[224] C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and [244] I. Mollas, Z. Chrysopoulou, S. Karlos, and G. Tsoumakas, “Ethos: an
K. Toutanova, “Boolq: Exploring the surprising difficulty of natural online hate speech detection dataset,” arXiv preprint arXiv:2006.08328,
yes/no questions,” arXiv preprint arXiv:1905.10044, 2019. 18, 23, 24 2020. 21, 23, 24
[225] G. Lai, Q. Xie, H. Liu, Y. Yang, and E. Hovy, “Race: Large-scale [245] M. Nadeem, A. Bethke, and S. Reddy, “Stereoset: Measuring
reading comprehension dataset from examinations,” arXiv preprint stereotypical bias in pretrained language models,” arXiv preprint
arXiv:1704.04683, 2017. 21, 22, 23, 24 arXiv:2004.09456, 2020. 21, 23, 24
[226] S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring how models [246] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan,
mimic human falsehoods,” arXiv preprint arXiv:2109.07958, 2021. 21, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large
23, 24, 25 language models trained on code,” arXiv preprint arXiv:2107.03374,
2021. 21, 23, 24, 25
[227] Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela,
“Adversarial nli: A new benchmark for natural language understand- [247] Y. Chang, M. Narang, H. Suzuki, G. Cao, J. Gao, and Y. Bisk, “We-
ing,” arXiv preprint arXiv:1910.14599, 2019. 21, 23, 24, 25 bqa: Multihop and multimodal qa,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2022, pp.
[228] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick,
16 495–16 504. 21, 23
and O. Tafjord, “Think you have solved question answering? try arc,
[248] Y. Cui, T. Liu, W. Che, L. Xiao, Z. Chen, W. Ma, S. Wang, and G. Hu,
the ai2 reasoning challenge,” arXiv preprint arXiv:1803.05457, 2018.
“A span-extraction dataset for chinese machine reading comprehen-
21, 23, 24, 25
sion,” arXiv preprint arXiv:1810.07366, 2018. 22, 23
[229] A. Conneau, G. Lample, R. Rinott, A. Williams, S. R. Bowman,
[249] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel
H. Schwenk, and V. Stoyanov, “Xnli: Evaluating cross-lingual sentence
mixture models,” arXiv preprint arXiv:1609.07843, 2016. 22, 23, 24
representations,” arXiv preprint arXiv:1809.05053, 2018. 21, 23, 24
[250] J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap,
[230] A. Williams, N. Nangia, and S. Bowman, “A broad-coverage “Compressive transformers for long-range sequence modelling,” arXiv
challenge corpus for sentence understanding through inference,” in preprint arXiv:1911.05507, 2019. 22, 23, 24
Proceedings of the 2018 Conference of the North American Chapter
[251] E. Choi, H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y. Choi, P. Liang, and
of the Association for Computational Linguistics: Human Language
L. Zettlemoyer, “Quac: Question answering in context,” arXiv preprint
Technologies, Volume 1 (Long Papers). New Orleans, Louisiana:
arXiv:1808.07036, 2018. 22, 24
Association for Computational Linguistics, Jun. 2018, pp. 1112–1122.
[252] E. M. Ponti, G. Glavaš, O. Majewska, Q. Liu, I. Vulić, and A. Korho-
[Online]. Available: https://aclanthology.org/N18-1101 21
nen, “Xcopa: A multilingual dataset for causal commonsense reason-
[231] Y. Yang, Y. Zhang, C. Tar, and J. Baldridge, “Paws-x: A cross- ing,” arXiv preprint arXiv:2005.00333, 2020. 22, 24
lingual adversarial dataset for paraphrase identification,” arXiv preprint
[253] M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, and J. Berant,
arXiv:1908.11828, 2019. 21, 23, 24
“Did aristotle use a laptop? a question answering benchmark with
[232] Y. Zhang, J. Baldridge, and L. He, “PAWS: Paraphrase adversaries implicit reasoning strategies,” Transactions of the Association for
from word scrambling,” in Proceedings of the 2019 Conference of Computational Linguistics, vol. 9, pp. 346–361, 2021. 22, 24
the North American Chapter of the Association for Computational [254] A. Talmor, J. Herzig, N. Lourie, and J. Berant, “Commonsenseqa:
Linguistics: Human Language Technologies, Volume 1 (Long and Short A question answering challenge targeting commonsense knowledge,”
Papers). Minneapolis, Minnesota: Association for Computational arXiv preprint arXiv:1811.00937, 2018. 22, 24
Linguistics, Jun. 2019, pp. 1298–1308. [Online]. Available: https:
[255] S. Iyer, N. Dandekar, and K. Csernai, “First quora
//aclanthology.org/N19-1131 21
dataset release: Question pairs,” https://quoradata.quora.com/
[233] S. Reddy, D. Chen, and C. D. Manning, “Coqa: A conversational First-Quora-Dataset-Release-Question-Pairs. 23
question answering challenge,” Transactions of the Association for [256] A. Williams, N. Nangia, and S. R. Bowman, “A broad-coverage
Computational Linguistics, vol. 7, pp. 249–266, 2019. 21, 24 challenge corpus for sentence understanding through inference,” arXiv
[234] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner, preprint arXiv:1704.05426, 2017. 23
“Drop: A reading comprehension benchmark requiring discrete rea- [257] M.-C. De Marneffe, M. Simons, and J. Tonhauser, “The commit-
soning over paragraphs,” arXiv preprint arXiv:1903.00161, 2019. 21, mentbank: Investigating projection in naturally occurring discourse,”
24 in proceedings of Sinn und Bedeutung, vol. 23, no. 2, 2019, pp. 107–
[235] I. Dagan, O. Glickman, and B. Magnini, “The pascal recognising tex- 124. 23, 24, 25
tual entailment challenge,” in Machine learning challenges workshop. [258] O. Bojar, R. Chatterjee, C. Federmann, Y. Graham, B. Haddow,
Springer, 2005, pp. 177–190. 21, 23, 24, 25 M. Huck, A. J. Yepes, P. Koehn, V. Logacheva, C. Monz et al., “Find-
[236] A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, ings of the 2016 conference on machine translation,” in Proceedings of
A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso et al., “Beyond the First Conference on Machine Translation: Volume 2, Shared Task
the imitation game: Quantifying and extrapolating the capabilities of Papers, 2016, pp. 131–198. 23
language models,” arXiv preprint arXiv:2206.04615, 2022. 21, 23, 24, [259] X. Pan, B. Zhang, J. May, J. Nothman, K. Knight, and H. Ji, “Cross-
25 lingual name tagging and linking for 282 languages,” in Proceedings
[237] P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know: of the 55th Annual Meeting of the Association for Computational
Unanswerable questions for squad,” arXiv preprint arXiv:1806.03822, Linguistics (Volume 1: Long Papers), 2017, pp. 1946–1958. 23
2018. 21, 24 [260] P. Lewis, B. Oğuz, R. Rinott, S. Riedel, and H. Schwenk, “Mlqa:
[238] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+ Evaluating cross-lingual extractive question answering,” arXiv preprint
questions for machine comprehension of text,” arXiv preprint arXiv:1910.07475, 2019. 23
arXiv:1606.05250, 2016. 21, 23 [261] J. H. Clark, E. Choi, M. Collins, D. Garrette, T. Kwiatkowski, V. Niko-
[239] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, laev, and J. Palomaki, “Tydi qa: A benchmark for information-seeking
M. Plappert, J. Tworek, J. Hilton, R. Nakano et al., “Training verifiers question answering in typologically diverse languages,” Transactions
to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021. of the Association for Computational Linguistics, vol. 8, pp. 454–470,
21, 24 2020. 23, 24, 25
JOURNAL OF LATEX 33
[262] W. Li, F. Qi, M. Sun, X. Yi, and J. Zhang, “Ccpm: A chinese classical [285] C. Xiao, H. Zhong, Z. Guo, C. Tu, Z. Liu, M. Sun, Y. Feng, X. Han,
poetry matching dataset,” arXiv preprint arXiv:2106.01979, 2021. 23 Z. Hu, H. Wang et al., “Cail2018: A large-scale legal dataset for
[263] K. Sun, D. Yu, D. Yu, and C. Cardie, “Investigating prior knowledge judgment prediction,” arXiv preprint arXiv:1807.02478, 2018. 23
for challenging chinese machine reading comprehension,” Transactions [286] C. Xu, W. Zhou, T. Ge, K. Xu, J. McAuley, and F. Wei, “Blow the dog
of the Association for Computational Linguistics, vol. 8, pp. 141–155, whistle: A chinese dataset for cant understanding with common sense
2020. 23 and world knowledge,” arXiv preprint arXiv:2104.02704, 2021. 23
[264] B. Loïc, B. Magdalena, B. Ondřej, F. Christian, G. Yvette, G. Roman, [287] C. Xiong, Z. Dai, J. Callan, Z. Liu, and R. Power, “End-to-end
H. Barry, H. Matthias, J. Eric, K. Tom et al., “Findings of the neural ad-hoc ranking with kernel pooling,” in Proceedings of the 40th
2020 conference on machine translation (wmt20),” in Proceedings International ACM SIGIR conference on research and development in
of the Fifth Conference on Machine Translation. Association for information retrieval, 2017, pp. 55–64. 23
Computational Linguistics„ 2020, pp. 1–55. 23 [288] C. Xu, J. Pei, H. Wu, Y. Liu, and C. Li, “Matinf: A jointly labeled large-
[265] B. Hu, Q. Chen, and F. Zhu, “Lcsts: A large scale chinese short text scale dataset for classification, question answering and summarization,”
summarization dataset,” arXiv preprint arXiv:1506.05865, 2015. 23 arXiv preprint arXiv:2004.12302, 2020. 23
[266] Z. Shao, M. Huang, J. Wen, W. Xu, and X. Zhu, “Long and diverse text [289] H. Zhou, C. Zheng, K. Huang, M. Huang, and X. Zhu, “Kdconv: A
generation with planning-based hierarchical variational model,” arXiv chinese multi-domain dialogue dataset towards multi-turn knowledge-
preprint arXiv:1908.06605, 2019. 23 driven conversation,” arXiv preprint arXiv:2004.04100, 2020. 23
[267] Y. Yao, Q. Dong, J. Guan, B. Cao, Z. Zhang, C. Xiao, X. Wang, F. Qi, [290] L. Gao, S. Biderman, S. Black, L. Golding, T. Hoppe, C. Foster,
J. Bao, J. Nie et al., “Cuge: A chinese language understanding and J. Phang, H. He, A. Thite, N. Nabeshima et al., “The pile: An
generation evaluation benchmark,” arXiv preprint arXiv:2112.13610, 800gb dataset of diverse text for language modeling,” arXiv preprint
2021. 23 arXiv:2101.00027, 2020. 23, 24
[268] Y. Li, T. Liu, D. Li, Q. Li, J. Shi, and Y. Wang, “Character-based [291] S. Lim, M. Kim, and J. Lee, “Korquad1. 0: Korean qa dataset for
bilstm-crf incorporating pos and dictionaries for chinese opinion target machine reading comprehension,” arXiv preprint arXiv:1909.07005,
extraction,” in Asian Conference on Machine Learning. PMLR, 2018, 2019. 23
pp. 518–533. 23 [292] S. Park, J. Moon, S. Kim, W. I. Cho, J. Han, J. Park, C. Song,
[269] L. Xu, H. Hu, X. Zhang, L. Li, C. Cao, Y. Li, Y. Xu, K. Sun, D. Yu, J. Kim, Y. Song, T. Oh et al., “Klue: Korean language understanding
C. Yu et al., “Clue: A chinese language understanding evaluation evaluation,” arXiv preprint arXiv:2105.09680, 2021. 23
benchmark,” arXiv preprint arXiv:2004.05986, 2020. 23, 24 [293] L. Xu, X. Lu, C. Yuan, X. Zhang, H. Xu, H. Yuan, G. Wei, X. Pan,
[270] Z. Li, N. Ding, Z. Liu, H. Zheng, and Y. Shen, “Chinese relation extrac- X. Tian, L. Qin et al., “Fewclue: A chinese few-shot learning evaluation
tion with multi-grained information and external linguistic knowledge,” benchmark,” arXiv preprint arXiv:2107.07498, 2021. 23
in Proceedings of the 57th Annual Meeting of the Association for [294] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh,
Computational Linguistics, 2019, pp. 4377–4386. 23 C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee et al., “Natural
[271] J. Xu, J. Wen, X. Sun, and Q. Su, “A discourse-level named entity questions: a benchmark for question answering research,” Transactions
recognition and relation extraction dataset for chinese literature text,” of the Association for Computational Linguistics, vol. 7, pp. 453–466,
arXiv preprint arXiv:1711.07010, 2017. 23 2019. 23, 24
[272] J. Chen, Q. Chen, X. Liu, H. Yang, D. Lu, and B. Tang, “The bq corpus: [295] J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal, “Fever: a
A large-scale domain-specific chinese corpus for sentence semantic large-scale dataset for fact extraction and verification,” arXiv preprint
equivalence identification,” in Proceedings of the 2018 conference on arXiv:1803.05355, 2018. 23
empirical methods in natural language processing, 2018, pp. 4946– [296] I. Augenstein, C. Lioma, D. Wang, L. C. Lima, C. Hansen,
4951. 23 C. Hansen, and J. G. Simonsen, “Multifc: A real-world multi-domain
[273] L. CO, “Iflytek: a multiple categories chinese text classifier. competi- dataset for evidence-based fact checking of claims,” arXiv preprint
tion official website,” 2019. 23 arXiv:1909.03242, 2019. 23
[274] B. Liu, D. Niu, H. Wei, J. Lin, Y. He, K. Lai, and Y. Xu, “Matching [297] M. Sap, H. Rashkin, D. Chen, R. LeBras, and Y. Choi, “Socialiqa:
article pairs with graphical decomposition and convolutions,” arXiv Commonsense reasoning about social interactions,” arXiv preprint
preprint arXiv:1802.07459, 2018. 23 arXiv:1904.09728, 2019. 23, 24
[275] S. Zhang, X. Zhang, H. Wang, J. Cheng, P. Li, and Z. Ding, “Chinese [298] S. Gehman, S. Gururangan, M. Sap, Y. Choi, and N. A. Smith,
medical question answer matching using end-to-end character-level “Realtoxicityprompts: Evaluating neural toxic degeneration in language
multi-scale cnns,” Applied Sciences, vol. 7, no. 8, p. 767, 2017. 23 models,” arXiv preprint arXiv:2009.11462, 2020. 23, 24
[276] S. Zhang, X. Zhang, H. Wang, L. Guo, and S. Liu, “Multi-scale [299] S. L. Blodgett, L. Green, and B. O’Connor, “Demographic dialectal
attentive interaction networks for chinese medical question answer variation in social media: A case study of african-american english,”
selection,” IEEE Access, vol. 6, pp. 74 061–74 071, 2018. 23 arXiv preprint arXiv:1608.08868, 2016. 23
[277] P. Li, W. Li, Z. He, X. Wang, Y. Cao, J. Zhou, and W. Xu, “Dataset [300] D. Borkan, L. Dixon, J. Sorensen, N. Thain, and L. Vasserman,
and neural recurrent sequence labeling model for open-domain factoid “Nuanced metrics for measuring unintended bias with real data for
question answering,” arXiv preprint arXiv:1607.06275, 2016. 23 text classification,” in Companion proceedings of the 2019 world wide
[278] N. Peng and M. Dredze, “Named entity recognition for chinese social web conference, 2019, pp. 491–500. 23
media with jointly trained embeddings,” in Proceedings of the 2015 [301] Y. Cui, T. Liu, Z. Chen, W. Ma, S. Wang, and G. Hu, “Dataset for
conference on empirical methods in natural language processing, 2015, the first evaluation on chinese machine reading comprehension,” arXiv
pp. 548–554. 23 preprint arXiv:1709.08299, 2017. 23
[279] R. Weischedel, S. Pradhan, L. Ramshaw, M. Palmer, N. Xue, M. Mar- [302] D. Vilares and C. Gómez-Rodríguez, “Head-qa: A healthcare dataset
cus, A. Taylor, C. Greenberg, E. Hovy, R. Belvin et al., “Ontonotes for complex reasoning,” arXiv preprint arXiv:1906.04701, 2019. 23
release 4.0,” LDC2011T03, Philadelphia, Penn.: Linguistic Data Con- [303] J. Liu, L. Cui, H. Liu, D. Huang, Y. Wang, and Y. Zhang, “Logiqa:
sortium, 2011. 23 A challenge dataset for machine reading comprehension with logical
[280] Y. Cui, T. Liu, Z. Yang, Z. Chen, W. Ma, W. Che, S. Wang, and G. Hu, reasoning,” arXiv preprint arXiv:2007.08124, 2020. 23
“A sentence cloze dataset for chinese machine reading comprehension,” [304] T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can a suit of armor
arXiv preprint arXiv:2004.03116, 2020. 23 conduct electricity? a new dataset for open book question answering,”
[281] C. C. Shao, T. Liu, Y. Lai, Y. Tseng, and S. Tsai, “Drcd: A arXiv preprint arXiv:1809.02789, 2018. 23, 24
chinese machine reading comprehension dataset,” arXiv preprint [305] S. Aroca-Ouellette, C. Paik, A. Roncone, and K. Kann, “Prost: Phys-
arXiv:1806.00920, 2018. 23 ical reasoning of objects through space and time,” arXiv preprint
[282] W. He, K. Liu, J. Liu, Y. Lyu, S. Zhao, X. Xiao, Y. Liu, Y. Wang, arXiv:2106.03634, 2021. 23
H. Wu, Q. She et al., “Dureader: a chinese machine reading [306] A. Peñas, E. Hovy, P. Forner, Á. Rodrigo, R. Sutcliffe, and R. Morante,
comprehension dataset from real-world applications,” arXiv preprint “Qa4mre 2011-2013: Overview of question answering for machine
arXiv:1711.05073, 2017. 23 reading evaluation,” in Information Access Evaluation. Multilinguality,
[283] H. Tang, J. Liu, H. Li, Y. Hong, H. Wu, and H. Wang, “Dureaderrobust: Multimodality, and Visualization: 4th International Conference of the
A chinese dataset towards evaluating the robustness of machine reading CLEF Initiative, CLEF 2013, Valencia, Spain, September 23-26, 2013.
comprehension models,” arXiv preprint arXiv:2004.11142, 2020. 23 Proceedings 4. Springer, 2013, pp. 303–320. 23
[284] C. Zheng, M. Huang, and A. Sun, “Chid: A large-scale chinese idiom [307] J. Welbl, N. F. Liu, and M. Gardner, “Crowdsourcing multiple choice
dataset for cloze test,” arXiv preprint arXiv:1906.01265, 2019. 23 science questions,” arXiv preprint arXiv:1707.06209, 2017. 23
JOURNAL OF LATEX 34
[308] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, [330] W. Ling, D. Yogatama, C. Dyer, and P. Blunsom, “Program induction
L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert by rationale generation: Learning to solve and explain algebraic word
pretraining approach,” arXiv preprint arXiv:1907.11692, 2019. 23 problems,” arXiv preprint arXiv:1705.04146, 2017. 24
[309] J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, and J. Blackburn, [331] T. Scialom, P.-A. Dray, S. Lamprier, B. Piwowarski, and J. Sta-
“The pushshift reddit dataset,” in Proceedings of the international AAAI iano, “Mlsum: The multilingual summarization corpus,” arXiv preprint
conference on web and social media, vol. 14, 2020, pp. 830–839. 23 arXiv:2004.14900, 2020. 24
[310] E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston, [332] S. Narayan, S. B. Cohen, and M. Lapata, “Don’t give me the details,
“Wizard of wikipedia: Knowledge-powered conversational agents,” just the summary!” Topic-Aware Convolutional Neural Networks for
arXiv preprint arXiv:1811.01241, 2018. 23, 24 Extreme Summarization. ArXiv, abs, 1808. 24
[311] H. Rashkin, E. M. Smith, M. Li, and Y.-L. Boureau, “Towards [333] J. Novikova, O. Dušek, and V. Rieser, “The e2e dataset: New challenges
empathetic open-domain conversation models: A new benchmark and for end-to-end generation,” arXiv preprint arXiv:1706.09254, 2017. 24
dataset,” arXiv preprint arXiv:1811.00207, 2018. 23 [334] T. C. Ferreira, C. Gardent, N. Ilinykh, C. Van Der Lee, S. Mille,
[312] E. Dinan, V. Logacheva, V. Malykh, A. Miller, K. Shuster, J. Ur- D. Moussallem, and A. Shimorina, “The 2020 bilingual, bi-directional
banek, D. Kiela, A. Szlam, I. Serban, R. Lowe et al., “The second webnlg+ shared task overview and evaluation results (webnlg+ 2020),”
conversational intelligence challenge (convai2),” in The NeurIPS’18 in Proceedings of the 3rd International Workshop on Natural Language
Competition: From Machine Learning to Intelligent Conversations. Generation from the Semantic Web (WebNLG+), 2020. 24
Springer, 2020, pp. 187–208. 23 [335] N. Goyal, C. Gao, V. Chaudhary, P.-J. Chen, G. Wenzek, D. Ju, S. Kr-
ishnan, M. Ranzato, F. Guzmán, and A. Fan, “The flores-101 evaluation
[313] E. M. Smith, M. Williamson, K. Shuster, J. Weston, and Y.-L. Boureau,
benchmark for low-resource and multilingual machine translation,”
“Can you put it all together: Evaluating conversational agents’ ability
Transactions of the Association for Computational Linguistics, vol. 10,
to blend skills,” arXiv preprint arXiv:2004.08449, 2020. 23
pp. 522–538, 2022. 24
[314] M. Komeili, K. Shuster, and J. Weston, “Internet-augmented dialogue [336] Y. Xia, X. Tan, F. Tian, F. Gao, W. Chen, Y. Fan, L. Gong, Y. Leng,
generation,” arXiv preprint arXiv:2107.07566, 2021. 23 R. Luo, Y. Wang et al., “Microsoft research asia’s systems for wmt19,”
[315] N. Nangia, C. Vania, R. Bhalerao, and S. R. Bowman, “Crows-pairs: arXiv preprint arXiv:1911.06191, 2019. 24
A challenge dataset for measuring social biases in masked language [337] A. Tikhonov and M. Ryabinin, “It’s all in the heads: Using attention
models,” arXiv preprint arXiv:2010.00133, 2020. 23, 24 heads as a baseline for cross-lingual transfer in commonsense reason-
[316] H. Laurençon, L. Saulnier, T. Wang, C. Akiki, A. Villanova del Moral, ing,” arXiv preprint arXiv:2106.12066, 2021. 24
T. Le Scao, L. Von Werra, C. Mou, E. González Ponferrada, H. Nguyen [338] S. Roy and D. Roth, “Solving general arithmetic word problems,” arXiv
et al., “The bigscience roots corpus: A 1.6 tb composite multilingual preprint arXiv:1608.01413, 2016. 24
dataset,” Advances in Neural Information Processing Systems, vol. 35, [339] R. Rudinger, J. Naradowsky, B. Leonard, and B. Van Durme, “Gender
pp. 31 809–31 826, 2022. 24 bias in coreference resolution,” arXiv preprint arXiv:1804.09301, 2018.
[317] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, 24
D. Song, and J. Steinhardt, “Measuring mathematical problem solving [340] J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K.-W. Chang, “Gender
with the math dataset,” arXiv preprint arXiv:2103.03874, 2021. 24 bias in coreference resolution: Evaluation and debiasing methods,”
[318] M. Roemmele, C. A. Bejan, and A. S. Gordon, “Choice of plausible arXiv preprint arXiv:1804.06876, 2018. 24
alternatives: An evaluation of commonsense causal reasoning.” in AAAI [341] A. Parrish, A. Chen, N. Nangia, V. Padmakumar, J. Phang, J. Thomp-
spring symposium: logical formalizations of commonsense reasoning, son, P. M. Htut, and S. R. Bowman, “Bbq: A hand-built bias benchmark
2011, pp. 90–95. 24, 25 for question answering,” arXiv preprint arXiv:2110.08193, 2021. 24
[319] D. Khashabi, S. Chaturvedi, M. Roth, S. Upadhyay, and D. Roth, [342] J. Boyd-Graber, B. Satinoff, H. He, and H. Daumé III, “Besting
“Looking beyond the surface: A challenge set for reading comprehen- the quiz master: Crowdsourcing incremental classification games,”
sion over multiple sentences,” in Proceedings of the 2018 Conference in Proceedings of the 2012 joint conference on empirical methods
of the North American Chapter of the Association for Computational in natural language processing and computational natural language
Linguistics: Human Language Technologies, Volume 1 (Long Papers), learning, 2012, pp. 1290–1301. 24
2018, pp. 252–262. 24 [343] F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Srivats, S. Vosoughi, H. W.
[320] S. Zhang, X. Liu, J. Liu, J. Gao, K. Duh, and B. Van Durme, “Record: Chung, Y. Tay, S. Ruder, D. Zhou et al., “Language models are multi-
Bridging the gap between human and machine commonsense reading lingual chain-of-thought reasoners,” arXiv preprint arXiv:2210.03057,
comprehension,” arXiv preprint arXiv:1810.12885, 2018. 24 2022. 24, 25
[321] T. H. Trinh and Q. V. Le, “A simple method for commonsense [344] S.-Y. Miao, C.-C. Liang, and K.-Y. Su, “A diverse corpus for evaluating
reasoning,” arXiv preprint arXiv:1806.02847, 2018. 24 and developing english math word problem solvers,” arXiv preprint
[322] R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, arXiv:2106.15772, 2021. 24
and Y. Choi, “Defending against neural fake news,” Advances in neural [345] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan,
information processing systems, vol. 32, 2019. 24 E. Jiang, C. Cai, M. Terry, Q. Le et al., “Program synthesis with large
[323] R. T. McCoy, E. Pavlick, and T. Linzen, “Right for the wrong reasons: language models,” arXiv preprint arXiv:2108.07732, 2021. 24, 25
Diagnosing syntactic heuristics in natural language inference,” arXiv [346] H. Husain, H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt,
preprint arXiv:1902.01007, 2019. 24 “Codesearchnet challenge: Evaluating the state of semantic code
search,” CoRR, vol. abs/1909.09436, 2019. 24
[324] M. Mirzayanov, “Codeforces: Results of 2020,” https://codeforces.com/
[347] J. Austin, A. Odena, M. I. Nye, M. Bosma, H. Michalewski, D. Dohan,
blog/entry/89502. 24
E. Jiang, C. J. Cai, M. Terry, Q. V. Le, and C. Sutton, “Program
[325] E. Caballero, . OpenAI, and I. Sutskever, “Description2Code Dataset,” synthesis with large language models,” CoRR, vol. abs/2108.07732,
8 2016. [Online]. Available: https://github.com/ethancaballero/ 2021. 24
description2code 24 [348] D. Kocetkov, R. Li, L. B. Allal, J. Li, C. Mou, C. M. Ferrandis,
[326] R. Puri, D. S. Kung, G. Janssen, W. Zhang, G. Domeniconi, V. Zolotov, Y. Jernite, M. Mitchell, S. Hughes, T. Wolf et al., “The stack: 3 tb of
J. Dolby, J. Chen, M. Choudhury, L. Decker et al., “Codenet: A large- permissively licensed source code,” arXiv preprint arXiv:2211.15533,
scale ai for code dataset for learning a diversity of coding tasks,” arXiv 2022. 24
preprint arXiv:2105.12655, 2021. 24 [349] Y. Lai, C. Li, Y. Wang, T. Zhang, R. Zhong, L. Zettlemoyer, W.-
[327] J. Berant, A. Chou, R. Frostig, and P. Liang, “Semantic parsing on t. Yih, D. Fried, S. Wang, and T. Yu, “Ds-1000: A natural and
freebase from question-answer pairs,” in Proceedings of the 2013 reliable benchmark for data science code generation,” in International
conference on empirical methods in natural language processing, 2013, Conference on Machine Learning. PMLR, 2023, pp. 18 319–18 345.
pp. 1533–1544. 24 24, 25
[328] A. Patel, S. Bhattamishra, and N. Goyal, “Are nlp models really able to [350] P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga,
solve simple math word problems?” arXiv preprint arXiv:2103.07191, Y. Zhang, D. Narayanan, Y. Wu, A. Kumar et al., “Holistic evaluation
2021. 24 of language models,” arXiv preprint arXiv:2211.09110, 2022. 24
[329] R. Koncel-Kedziorski, S. Roy, A. Amini, N. Kushman, and H. Ha- [351] A. Fan, Y. Jernite, E. Perez, D. Grangier, J. Weston, and M. Auli, “Eli5:
jishirzi, “Mawps: A math word problem repository,” in Proceedings of Long form question answering,” arXiv preprint arXiv:1907.09190,
the 2016 conference of the north american chapter of the association 2019. 25
for computational linguistics: human language technologies, 2016, pp. [352] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei,
1152–1157. 24 A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap et al.,
JOURNAL OF LATEX 35