Pre Trained Models For NLP
Pre Trained Models For NLP
Engineering
journal homepage: www.elsevier.com/locate/eng
Research
Artificial Intelligence—Review
a r t i c l e i n f o a b s t r a c t
Article history: Pre-trained language models have achieved striking success in natural language processing (NLP), leading
Received 10 November 2021 to a paradigm shift from supervised learning to pre-training followed by fine-tuning. The NLP community
Revised 8 March 2022 has witnessed a surge of research interest in improving pre-trained models. This article presents a com-
Accepted 5 April 2022
prehensive review of representative work and recent progress in the NLP field and introduces the taxon-
Available online 7 September 2022
omy of pre-trained models. We first give a brief introduction of pre-trained models, followed by
characteristic methods and frameworks. We then introduce and analyze the impact and challenges of
Keywords:
pre-trained models and their downstream applications. Finally, we briefly conclude and address future
Pre-trained models
Natural language processing
research directions in this field.
Ó 2022 THE AUTHORS. Published by Elsevier LTD on behalf of Chinese Academy of Engineering and
Higher Education Press Limited Company. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
1. A brief history of pre-trained models gating mechanism. With the emergence of the model known as
transformer [9], considerable efforts have been devoted to building
The concept of pre-training is related to transfer learning [1]. stronger and more efficient language models based on the trans-
The idea of transfer learning is to reuse the knowledge learned former architecture [10–14]. In neural language modeling, dis-
from one or more tasks and apply it to new tasks. Traditional trans- tributed word representations named ‘‘word embeddings” that
fer learning employs annotated data for supervised training, which are learned with models such as Word2Vec [15] and GloVe [16]
has been the common practice for at least a decade. Within deep have become common initializations for the word vectors of deep
learning, pre-training with self-supervised learning on massive learning models, significantly improving the performance of down-
unannotated data has become the dominant transfer learning stream tasks such as named-entity recognition [16], part-of-speech
approach. The difference is that pre-training methods use unanno- tagging [17], and question answering [18].
tated data for self-supervised training and can be applied to vari- Although methods that leverage static word embeddings for
ous downstream tasks via fine-tuning or few-shot learning. warm startup can improve the performance of downstream NLP
In natural language processing (NLP), model pre-training is tasks, they lack the ability to represent different meanings of
based on the task of language modeling. The goal of language mod- words in context. To solve this problem, context–aware language
eling is to predict the next token, given a history of unannotated models were proposed to incorporate the complete context infor-
texts [2–4]. The first milestone of neural language modeling mation into the training procedure. Dai and Le [19] introduced
appears in Ref. [5], which models n-gram probabilities through dis- context–aware language modeling, which uses unannotated data
tributed representations of words and feed-forward neural net- to improve sequence learning with recurrent networks. This
works. Since then, deep learning methods have begun to achieves significant performance improvement in sentiment anal-
dominate the training paradigm of language modeling. In early ysis, text classification, and object classification tasks. In 2017, con-
methods for neural language modeling, recurrent neural networks textualized word vectors were proposed, which are derived from
(RNNs) were widely used [6,7]. Among the RNN family, long short- an encoder that is pre-trained on machine translation and then
term memory (LSTM) [8] stands out due to its advantage of being transferred to a variety of downstream NLP tasks [20]. However,
less prone to the gradient vanishing problem via its well-designed these studies use a small amount of data for pre-training and do
not achieve consistent performance improvement across all NLP
⇑ Corresponding author. tasks. Nonetheless, these pioneering studies greatly motivated
E-mail address: [email protected] (H. Wang). follow-up pre-training methods for context modeling.
https://doi.org/10.1016/j.eng.2022.04.024
2095-8099/Ó 2022 THE AUTHORS. Published by Elsevier LTD on behalf of Chinese Academy of Engineering and Higher Education Press Limited Company.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65
In another pioneering study on pre-trained models (PTMs), knowledge, greatly promoting the capacity of knowledge memo-
embeddings from language models were proposed to leverage rization and reasoning in PTMs [39].
bidirectional LSTMs in order to learn contextual word representa- However, the aforementioned models only focus on rich-
tions, and the pre-trained contextual embeddings were then resource languages, such as English and Chinese, and thus may
applied to downstream tasks [21]. This method demonstrated overlook numerous low-resource languages. Recent work on mul-
great improvements in a broad range of NLP tasks, including ques- tilingual models is aiming to transfer knowledge from rich-
tion answering, textual entailment, sentiment analysis, semantic resource languages to low-resource languages by modeling the
role labeling, coreference resolution, and named-entity extraction. semantic representation of disparate languages in a unified vector
Since then, numerous PTMs within the ‘‘pre-training then fine- space. Inspirited by BERT, multilingual BERT (mBERT) was devel-
tuning” paradigm have started to emerge. Generative pre-training oped and released; this model is trained via multilingual masked
(GPT) [22] was the first model to use unidirectional transformers language modeling (MMLM) on multilingual corpora [41]. From
as the backbone for the GPT of language models, thereby illustrat- an intuitive perspective, the use of parallel corpora is conducive
ing the dramatic potential of pre-training methods for diverse to learning cross-lingual representations in different languages.
downstream tasks. Following GPT [23], the first model to leverage Therefore, cross-lingual language model (XLM) [42] leverages
bidirectional transformers was called Bidirectional Encoder Repre- bilingual sentence pairs to perform translation language modeling
sentations from Transformers (BERT); this model learns bidirec- (TLM), which encourages models to align the representations of
tional contexts by means of conditioning on both the left and the two languages together. Researchers have also released more mul-
right contexts in deep stacked layers. BERT introduced a denoising tilingual language models, such as XLM-RoBERTa (XLM-R) [43],
autoencoding pre-training task, termed masked language model- InfoXLM [44], and ERNIE-M [45], by improving MMLM or TLM.
ing (MLM), to recover the corrupted tokens of input sentences These studies have demonstrated that pre-trained multilingual
according to their contexts, in what was akin to a cloze task. This language models can significantly improve performance of multi-
approach greatly boosted the performance gain of downstream lingual NLP tasks or low-resource language tasks.
natural language understanding (NLU) tasks. In this type of pre- Given the success of PTMs in NLP, these models have quickly
training, which is also known as self-supervised learning, the been extended to other fields such as computer vision [46–48]
pre-training labels are derived from unannotated data. By resorting and speech processing [49]. Although self-supervised pre-training
to web-scale unannotated data from the Internet, PTMs can auto- has been the most successful transfer learning method in NLP,
matically learn syntactic and semantic representations. the PTMs used for computer vision tasks are diversified. The dom-
The great success of PTMs has attracted a wide range of interest inant method in computer vision tasks is still supervised learning.
in scaling them up and exploring the boundaries of pre-training Sun et al. [48] show that representation learning holds promise for
techniques; examples include decoding-enhanced BERT with dis- advancing model performance based on large-scale (noisy) anno-
entangled attention (DeBERTa) [24], text-to-text transfer trans- tated datasets, such as ImageNet [50] or JTF300M [48]. These
formers (T5) [25], GPT-3 [26], large-scale generative Chinese pre- methods learn visual representations and significantly improve
trained language model (CPM) [27], PanGu-a [28], and ERNIE 3.0 the performance of various downstream vision tasks [48]. Self-
Titan [29]. Large-scale PTMs, such as GPT-3, have now demon- supervised pre-training have also been explored in computer
strated the powerful capabilities of zero-shot and few-shot learn- vision [51–56]. Doersch et al. [53] propose various prediction tasks
ing. With dozens of examples, GPT-3 achieved a performance as propse tasks to learn visual representations. Dosovitskiy et al.
similar to that of BERT, being fine-tuned with tens of thousands [57] explore the masked patch prediction task using transformer
of pieces of data on SuperGLUE [30]. GPT-3 can also generate architecture for images and demonstrates that pre-trained trans-
high-quality creative texts so that even humans cannot determine formers achieve excellent results compared with convolutional
whether or not the texts are written by a human. The success of neural networks (CNNs).
GPT-3 makes it possible to use this model for general-purpose text Recently, contrastive learning has been successfully utilized for
generation, which was considered to be impossible in the past visual self-supervised pre-training. Contrastive predictive coding
decades. [58] has achieved strong results in various scenarios, including
Another line of pre-training methods has attempted to incor- speech, image, and text. These methods [58–60] attempt to maxi-
porate knowledge in order to enhance the representation capabil- mize the similarity of two augmentations of an image and mini-
ity of PTM [31]. Some studies employ linguistic knowledge to mize the similarity of different images with contrastive loss.
design entity-related tasks with weak supervision. For example, More recently, pre-training methods have been advanced by utiliz-
they corrupt entity spans in texts and use knowledge-masking ing language supervision for visual representation learning [61],
strategies such as entity-level or phrase-level masking [31] and achieving a strong performance in image classification tasks and
entity replacement prediction [32] to better learn lexical, syntac- other vision tasks.
tic, and semantic information from texts. Another direction of Pre-training methods have also been applied to multimodal
research integrates structured knowledge together with plain applications, in which texts are combined with other modalities,
texts into pre-training, such as knowledge-enabled BERT (K- such as images [62–65], videos [66,67], and speech [68], enabling
BERT) [33], contextualized language and knowledge embedding a broad application scope of PTMs. Such methods [63] significantly
(CoLAKE) [34], enhanced language representation with informa- improve the performance of various multimodal tasks by jointly
tive entities (ERNIE-THU) [35], knowledge-enhanced BERT learning task-agnostic representations of images and texts. Based
(KnowBERT) [36], SenseBERT [37], knowledge embedding and on the transformer architecture, PTMs build cross-modal semantic
pre-trained language representation (KEPLER) [38], and ERNIE alignments from large-scale image-text pairs. For image genera-
3.0 [39]. ERNIE 3.0, which powers PTMs with knowledge, has tion, DALL-E [69] and CLIP-guided generation [61] leverage multi-
achieved new state-of-the-art (SOTA) performances across 54 modal language and vision input to render compelling visual
Chinese NLP benchmarks, as well as some English benchmarks, scenes. Although the most commonly used pre-training tasks for
including SuperGLUE [30]. Moreover, K-Adapter [40] uses multi- multimodal context are MLM and masked region prediction, Yu
ple adapters for different tasks independently in order to better et al. [70] propose knowledge-enhanced scene graph prediction
fuse various knowledge sources and mitigate catastrophic forget- to capture the alignments of more detailed semantics. Gan et al.
ting. Knowledge-based incorporation has dramatically improved [71] incorporate adversarial training into pre-training and achieves
knowledge sharing between unstructured text and structured higher performance. Cho et al. [72] formulate multimodal
52
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65
pre-training as a unified language modeling task based on multi- former encoders only; and transformer decoder–encoders. A brief
modal context. This demonstrates that PTMs are playing a critical description of each category is given below, and more detail is pro-
role in the artificial intelligence (AI) community and will poten- vided in the subsections that follow.
tially promote the unification of the pre-training framework across Transformer decoders only frameworks use a unidirectional
research fields such as speech, computer vision, and NLP. (left-to-right) transformer decoder as the pre-training back-
There are some existing reviews on PTMs. Some focus on partic- bone and predict tokens in a unidirectional autoregressive
ular types and applications of PTMs, such as transformer-based fashion. Here, ‘‘auto-regression” refers to predicting the cur-
pre-trained language models [73], BERT-based training techniques rent token based on historical tokens—that is, the partial
[74], prompted-based learning [75], data augmentation [76], text sequence on the left of the current token. More specifically,
generation [77], and conversational agent design [78]. Another line given the text sequence x ¼ ðx1 ; x2 ; x3 ; . . . ; xT Þ (where x is
provides a panoramic perspective of the whole progress of PTMs. the original sentence, xt (t = 1, 2, . . ., T) is the tth token, and
For example, Ramponi and Plank [79] provide an overview from T is the sequence length), an autoregressive model factorizes
early traditional non-neural methods to PTMs in NLP. Qiu et al. the likelihood of the input text sequence as
Q
[80] systematically categorize existing PTMs from four different pðxÞ ¼ Tt¼1 pðxt jx<t Þ , where p is the likelihood of the input
perspectives and outlines some potential directions of PTMs for text sequence.
future research. Bommasani et al. [81] propose the concept of Transformer encoder only frameworks leverage a bidirec-
foundation models to unify PTMs in different subfields such as tional transformer encoder and aim to recover corrupted
NLP, computer vision, and speech, and analyzes their opportunities tokens, given the input sentences with randomly masked
and challenges in various AI domains. Han et al. [82] take a deep tokens.
look into the history of PTMs to reveal the crucial position of PTMs Transformer encoder–decoder frameworks aim at pre-
in the AI development spectrum. In our review, we mainly focus on training a sequence-to-sequence (seq2seq) generation model
the PTMs in NLP: We first provide a detailed analysis of different by masking tokens on the source side and recovering them on
PTMs and trends in PTMs at scale, discussing their impact on the the target side. These frameworks consist of two classes:
field of NLP and the main challenges of PTMs; we then focus on ① seq2seq encoder–decoders, which consist of a bidirec-
our observations of and practices in the industrial applications of tional transformer encoder and a unidirectional decoder with
PTMs. separate parameters; and ② unified encoder–decoders, in
In this paper, we will first summarize the methods and taxon- which a bidirectional transformer encoder and a left-to-
omy of pre-trained language models in Section 2, followed by a dis- right decoder are simultaneously pre-trained with shared
cussion of the impact and challenges of pre-trained language model parameters.
models in Section 3. Next, we will introduce the industrial applica-
tions of pre-training techniques in Section 4. Finally, we will con- 2.1.1. Transformer decoders only
clude and address potential future work in this area. The objective for language modeling is to predict the next token
auto-regressively, given its history. The nature of auto-regression
entails the future invisibility of input tokens at each position; that
2. Methods of PTMs is, each token can only attend to the preceding words. GPT [22] was
the first model to use the transformer decoder architecture as its
2.1. Different frameworks and extensions of PTMs backbone. Given a sequence of words as context, GPT computes
the probability distribution of the next word with the masked
When working with PTMs, it is essential to design efficient multi-head self-attention of the transformer. In the fine-tuning
training methods that can fully use unannotated data and assist phase, the pre-trained parameters are set as the initialization of
downstream fine-tuning. In this section, we briefly introduce some the model for downstream tasks. GPT is pre-trained on the
widely used pre-training frameworks to date. Fig. 1 summarizes BooksCorpus dataset, which is nearly the same size as the 1B Word
the existing prevalent pre-training frameworks, which can be clas- Benchmark. It has hundreds of millions of parameters and
sified into three categories: transformer decoders only; trans- improves SOTA results on nine out of 12 NLP datasets, showing
Fig. 1. An illustration of the existing prevalent pre-training frameworks, where x is the original sentence, xt (t = 1, 2, . . ., T) is the tth token, T is the sequence length, and M(x) is
the set of masked tokens in x. S denotes the start token embedding of a sequence. p1, p2, p3, and p4 denote the position embeddings of the first to fourth tokens. P is the
conditional probability. i and j indicate the start and the end indices of input tokens of the encoder, respectively.
53
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65
the potential of large-scale PTMs. GPT-2 [83] follows the unidirec- BERT with single-sequence training. SpanBERT outperforms BERT
tional framework with a transformer decoder that was trained on span-related tasks such as question answering and coreference
with a larger corpus, namely, WebText, and 1.5 billion model resolution. Like SpanBERT, which uses lexical analysis and chunk-
parameters. GPT-2 achieves SOTA results on seven out of eight ing tools to locate the span boundary, enhanced representation
tested language modeling datasets in a zero-shot setting. GPT-3 through knowledge integration (ERNIE) [31] uses a Chinese tok-
[26] further increases the parameters of the transformer to 175 bil- enizer to obtain phrase information and then replaces the random
lion and introduces in-context learning. Both GPT-2 and GPT-3 can token masking in BERT with the entity or phrase masking. ERNIE
be applied to downstream tasks without fine-tuning. They achieve also utilizes a named-entity recognition toolkit to identify the
a strong performance by scaling up the model size and dataset size. entity boundary and randomly masks tokens at the entity level,
Unidirectional language modeling lacks attention on its full thus enabling the integration of external knowledge into model
contexts on both sides, which may degrade its performance on pre-training.
downstream tasks. To tackle this problem, Yang et al. [84] propose
the use of permuted language modeling (PLM), which performs 2.1.3. Transformer encoder–decoders
autoregressive modeling on permuting input tokens. For example, Transformer encoder–decoder architecture is dedicated to nat-
a permutation of the sentence ‘‘I love the movie” can be ‘‘I the ural language generation (NLG) tasks. Unlike NLU, which focuses
movie love.” Once the permutation is chosen, the last few tokens on comprehending texts, NLG aims to generate a coherent, mean-
of the permuted sentence are the target to predict. In the above ingful, and human-like natural language expression according to
example, the token ‘‘love” is the target, depending on the visible specific inputs. For example, the goal of machine translation is to
context ‘‘I the movie.” An advantage of PLM is that it can fully generate a sentence in the target language with the same meaning
leverage the contextual information for different masked tokens, as the given source language input; for text summarization, the
thus building dependent context relationships with both preceding goal is to generate a short version of the input document that cap-
and successive words. To enable PLM, Yang et al. [84] propose a tures the core meanings and opinions. The critical point is to model
novel two-stream self-attention mechanism, with one query two sequences simultaneously—one for the input and the other for
stream to compute the query vectors and another content stream the output.
to compute the key/context vectors. The two-stream self- Song et al. [88] proposes Masked Sequence-to-Sequence Learn-
attention approach evades the leakage of visible context to the ing (MASS) for language generation, in order to pre-train a seq2seq
masked positions. model. The basic idea of MASS is to take a sentence with a masked
fragment (i.e., several consecutive tokens) as input and predict the
2.1.2. Transformer encoders only masked fragment conditioned on the encoder representations. In
Pre-trained transformer encoders, such as BERT [23], have this way, MASS successfully transforms the transformer encoder
become the standard in NLP systems. BERT uses an MLM frame- framework into an autoregressive framework by masking on the
work with a transformer as the backbone. In the pre-training stage, source side and predicting on the target side. MASS uses monolin-
BERT randomly replaces tokens with a special token [MASK] and gual data from the News Crawl Datasets of Workshop on Machine
tries to recover corrupted words based on their contextual repre- Translation (WMT) to pre-train the model, and shows substantial
sentations. It also adopts an objective of next-sentence prediction improvement on machine translation quality in comparison with
(NSP) to capture the discourse relations between two sentences, models directly trained on annotated data.
which is helpful for sentence-level tasks, such as question answer- Pre-training on both a transformer encoder and a transformer
ing. Devlin et al. [23] refer to this procedure as a cloze task, accord- decoder results in a unified model that can simultaneously deal
ing to Ref. [85]. BERT was pre-trained on a combination of the with both language understanding and language generation. One
BooksCorpus (800 M words) and English Wikipedia (2500 M member of this class is the standard transformer encoder–decoder
words), and achieved great improvements on 17 NLP tasks, attain- model that does not share unified encoder and decoder compo-
ing a level even better than a human performance on some of the nents. Bidirectional and Auto-Regressive Transformers (BART)
downstream tasks. However, BERT’s shortcomings are also obvi- [89] proposes a similar objective as MASS, but differs in that MASS
ous: Because the [MASK] token does not appear in real data during masks a consecutive series of tokens—that is, n-grams of the
fine-tuning, it creates a mismatch between pre-training and fine- input—while BART corrupts text with an arbitrary noising func-
tuning. To amend this discrepancy, BERT uses a novel method to tion—that is, masking/deleting/replacing/exchanging random
mask tokens: Among the 15% of the random positions that would tokens in different positions. BART can be viewed as a combination
have to be masked, only 80% are replaced by the [MASK] token, of the above two architectures: The random masking strategy on
while 10% are kept as the original tokens, and 10% are replaced the source side enables the model to deal with NLU tasks, and
by random tokens in the training process. This masking strategy the overall seq2seq pre-training framework enables the model to
causes the model to take more steps to converge, since only 15% be generalized to NLG tasks. Pre-trained on 160 GB data of news,
of the tokens in the training batch are predicted. Another problem books, stories, and web text, BART achieves comparable results to
with BERT is that it predicts tokens independently without consid- RoBERTa [90] and new SOTA results on dialogue and abstractive
ering other masked tokens. The model proposed in Ref. [86], a uni- text summarization. Another member of this category unifies the
fied encoder–decoder model, tends to solve this problem by encoder and decoder as identical transformer blocks. Dong et al.
blanking out text spans of input sentences and predicting the [91] and Bao et al. [92] also propose a unified language model
masked span auto-regressively, which mitigates the independent pre-training framework for NLU and generation. These studies par-
assumption of masked tokens within the same span in the pre- tition the self-attention matrix into three components: the bidirec-
training of masked language models. tional component, the unidirectional component, and the seq2seq
Following the success of BERT, an enormous amount of research component, which respectively stand for unidirectional, bidirec-
effort has gone into MLM. SpanBERT [87] is designed to predict tional, and seq2seq language models. Their experiments show per-
spans of text. It chooses to mask random contiguous spans instead formance gains over using a single pre-training objective. Du et al.
of random tokens, and a span boundary prediction objective is [86] propose a variant of the model reported in Ref. [91], putting
introduced to force the model to predict masked spans according the masked tokens on the right of the unmasked tokens and con-
to the structural information of the span boundaries. It also ducting autoregressive blank filling. Xiao et al. [93] mask multiple
achieves better performance by replacing the NSP objective in segments at different granularities to encourage the decoder to
54
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65
rely more on the encoder representations, thus enhancing the cor- of training recipes including exponentially increased trainable
relation between the encoder and the decoder. Zhang et al. [94] parameters, pre-training architectures, knowledge enhancement,
adopt a different approach: First, a sentence is removed according language-specific corpora, and different pre-trained tasks to
to the pre-defined importance criteria from an input document, support the billion-level training of PTMs. Although training
and then the removed sentence is generated based on the remain- methods differ among these models, all the PTMs use transformers
ing context sentences. This strategy performs auto-regression at [9] as the standard backbone due to the latter’s efficient parallel
the sentence level and prompts whole-document understanding computing performance. Since training large-scale models requires
and summary-like generation. Experiments on 12 downstream massive unsupervised data, research on scaling up PTMs focuses on
summarization tasks demonstrate SOTA results, showing the effec- high-resource languages such as English and Chinese.
tiveness of the gap-sentence pre-training method. According to the different designs used in pre-training architec-
tures, large-scale PTMs can be generally classified into three
2.2. Scaling up PTMs classes (as in Section 2.1): encoder only, decoder only, and
encoder–decoder. The majority of large PTMs leverage the decoder
Recent advances in NLP have demonstrated a promising trend only or the encoder–decoder architecture, whereas only a few large
toward scaling up PTMs with billions of parameters. OpenAI models adopt an encoder-only design. This is because encoder-
researchers trained a model called GPT-3, which has 175 billion only models cannot perform well on generation tasks, such as text
parameters [26]. GPT-3 achieves strong performance on many summarization and dialogue generation, while decoder-only mod-
NLP datasets, including question answering, machine translation, els that are designed for language generation can shed light on not
and three-digit arithmetic. GPT-3 demonstrates that scaling up only NLG but also language understanding tasks via prevalent
language models significantly improves task-agnostic and few- prompting techniques such as GPT-3 [26].
shot performances, sometimes even achieving better results than Encoder-only models at scale employ a bidirectional trans-
prior SOTA fine-tuning approaches [26]. Although large PTMs are former encoder to learn contextual representations; they
a promising direction, training large-scale PTMs is a challenging demonstrate impressive performance on NLU tasks. For exam-
task, which requires massive training data and graphics processing ple, DeBERTa1.5B [24], which consists of 48 transformer layers
unit (GPU) resources. Thus, efficient model training algorithms with 1.5 billion parameters, applied a disentangled attention
play a crucial role in scaling up PTMs. The following section intro- mechanism and enhanced the mask decoder to surpass human
duces the prevalent large-scale PTMs as well as the training meth- performance on the SuperGLUE [30] benchmark. Since a bidi-
ods used to achieve them. rectional nature makes the model unable to be directly used
in NLG tasks, DeBERTa trained another version of a unified
2.2.1. PTMs at scale encoder–decoder to adapt to NLG tasks.
Table 1 [24–28,39,95–102] summarizes the mainstream large- Decoder-only models use transformer decoders by applying
scale PTMs. The size of PTMs has become increasingly larger in autoregressive masks to prevent the current token from
recent years, ranging from 2.6 billion to even 175 billion parame- attending to future tokens. Examples include GPT-3 [26],
ters. Large-scale pre-trained language models embrace a potpourri CPM [27], and PanGu-a [28]. This line of PTMs aims at
Table 1
Summary of large-scale pre-trained language models.
Model Number of Model Knowledge Language Pre-training data Training strategy Training Reference
parameters architecture learning platform
DeBERTa1.5B 1.5 billion Encoder only — English English data (78 GB) — PyTorch [24]
T5 11 billion Encoder– — English C4 (750 GB) Model/data TensorFlow [25]
decoder parallelism
(seq2seq)
GPT-3 175 billion Decoder only — English Cleaned CommonCrawl, Model parallelism — [26]
WebText
CPM 2.6 billion Decoder only — Chinese Chinese corpus (100 GB) — PyTorch [27]
PanGu-a 200 billion Decoder only — Chinese Chinese data MindSpore auto- MindSpore [28]
(1.1 TB, 250 billion tokens) parallel
p
ERNIE 3.0 10 billion Encoder– Chinese, Chinese data (4 TB), Model/pipeline/ PaddlePaddle [39]
decoder English English data tensor parallelism
(unified)
Turing-NLG 17 billion Decoder only — English English data DeepSpeed/ZeRO — [95]
HyperCLOVA 204 billion Decoder only — Korean Korean data — — [96]
CPM-2 11 billion Encoder– — Chinese, WuDao corpus — PyTorch [97]
decoder English (2.3 TB Chinese +
(seq2seq) 300 GB English)
CPM-2-MoE 198 billion Encoder– — Chinese, WuDao corpus Mixture of Experts PyTorch [98]
decoder English (2.3 TB Chinese + (MoE)
(seq2seq) 300 GB English)
Switch 1751 billion Encoder– — English C4 MoE TensorFlow [99]
transformers decoder (750 GB)
(seq2seq)
Yuan 1.0 245 billion Encoder– — Chinese Chinese data Model/pipeline/ — [100]
decoder (5 TB) tensor parallelism
(unified)
GLaM 1.2 trillion Encoder only — English English data (1.6 trillion tokens) MoE/model TensorFlow [101]
parallelism
Gopher 280 billion Decoder only — English English data (10.5 TB) Model/data Jax [102]
parallelism
55
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65
generating human-like texts. Turing-NLG [95] is a 17-billion- The dramatic progress in language PTMs has attracted research
parameter language model that has achieved strong perfor- interest on multimodal pre-training [72,103–107]. Table 2
mance in language model benchmarks. GPT-3, with 175 bil- [69,103,104,107] lists the details of large-scale multimodal PTMs.
lion parameters, can strikingly write samples that deceive DALL-E [69] is a 12-billion variant of GPT-3 that was trained on
human readers, demonstrating that large-scale language 250 million English text–image pairs to generate images according
models can dramatically advance few-shot learning scenarios to language descriptions, thereby improving the zero-shot learning
with in-context learning. In addition to English large-scale performance. ERNIE-ViLG [107] uses a unified GPT framework for
monolingual PTMs, there are also models for other languages bidirectional image–text generation, formulating both the image
such as Chinese and Korean. CPM [27] (2.6 billion parame- and text generation as autoregressive generative tasks. As a result,
ters) and PanGu-a [28] (200 billion parameters) are two it outperforms previous methods on generative tasks such as text-
Chinese variants of GPT-3, while HyperCLOVA [96] is a 204- to-image generation and image captioning with a ten-billion
billion-parameter Korean variant. parameter model pre-trained on 145 million high-quality Chinese
Encoder–decoder models can be further categorized into two text–image pairs. Moreover, the multi-modality-to-multi-modality
classes: ① conventional seq2seq encoder–decoders and ② multi-task mega-transformer (M6) [104] is a 100-billion-
unified encoder–decoders. Conventional seq2seq encoder– parameter transformer encoder, which is trained on over 1.9 TB
decoders adopt the classic transformer encoder–decoder images and 292 GB Chinese texts. M6 achieved strong performance
architecture for pre-training. Recent work includes the T5 in visual question answering, image captioning, and Chinese
[25], the multilingual T5 (mT5) [97], and the large-scale image–text matching. In addition to their improvements on multi-
cost-effective pre-trained language model (CPM-2) [98]. T5 modal tasks, these models can improve the performance of mono-
[25], which has up to 11 billion parameters, unifies the NLP modal tasks, such as text classification, inference, summarization,
tasks in one framework by casting the language understand- and question generation [105]. These results show that multimodal
ing and generation tasks in a text-to-text manner. As the pre-training can leverage multimodal information to enhance both
multilingual variant of T5, mT5 [97], which has up to 13 bil- image representation and text representation, which in turn
lion parameters, has extended the monolingual data to 101 improves the performance of both multimodal tasks and NLP tasks.
human languages and outperformed the previous SOTA
results on a variety of multilingual benchmarks. CPM-2 2.2.2. Efficient training of large-scale models
[98], with 11 billion parameters, is a bilingual model trained The exponential increment of the PTMs’ size has posed a great
on Chinese and English, whose mixture-of-expert (MoE) ver- challenge for efficient training due to the limited GPU memory
sion, denoted as CPM-2-MoE, has 198 billion parameters. and unaffordable training time. Therefore, it is non-trivial to lever-
This model has demonstrated excellent general language age efficient training techniques to speed up large-scale model
intelligence via fine-tuning and prompting. Another kind of training.
encoder–decoder model is the unified encoder–decoder
framework, in which the encoder–decoder architecture 2.2.2.1. Dense models. Data parallelism is a simple solution that
shares the same module and applies different mask strategies allocates different data partitions to multiple workers and dupli-
for MLM and autoregressive language modeling. ERNIE 3.0 cates identical parameters at all workers. However, it usually suf-
[39] jointly learns language understanding and generation fers from a small per-GPU batch size. Another solution is model
by designing two separate heads for understanding and gen- parallelism, in which model parameters are partitioned over differ-
eration, which share a task-agnostic representation. As the ent workers. However, conventional optimization algorithms
third-generation PTMs (with ten billion parameters) in the require extra memory per parameter to store intermediate states,
ERNIE series, ERNIE 3.0 combines the merits of both autore- which hinders the model size from being updated efficiently. Pipe-
gressive causal language models and autoencoding models line parallelism combines the merits of both model parallelism and
to train large-scale knowledge-enhanced PTMs. It has out- data parallelism to reduce time costs. GPipe [108] uses a novel
ranked the SOTA performance on a variety of NLP bench- batch-splitting pipelining algorithm by first partitioning a mini-
marks, including SuperGLUE [30]. These methods have batch of training samples into smaller micro-batches and then
demonstrated superior performance because they all tend aggregating the gradient update simultaneously at the end.
to unify multiple NLP tasks in one model and use different Megatron-LM [109] is an intra-layer model parallel approach for
kinds of corpora or knowledge to enhance the performance. transformer networks, which adds a few synchronization primi-
Most of the above-mentioned large-scale models are trained on tives on the self-attention and multi-layer perceptron blocks.
plain texts without integrating knowledge. Therefore, some PTD-P [110] combines pipeline, tensor, and data parallelism across
researchers have attempted to incorporate knowledge such as lin- multi-GPU servers with a novel interleaved pipelining scheduling
guistic knowledge and world knowledge into PTMs. ERNIE 3.0 pre- strategy, increasing the throughput by more than 10%. Recently,
trained transformers on massive unstructured texts and knowl- Colossal-AI [111] implemented a combination of various data,
edge graphs to learn lexical, syntactic, and semantic information. pipeline, sequence, and multiple tensor parallelism for large-scale
It enriched the PTMs through knowledge integration, phrase mask- model training, which can be a good option for training dense
ing, and named-entity masking. models.
Table 2
Large-scale multimodal PTMs.
Model Number of Pre-training paradigm Pre-training Data Training parallelism Training Reference
parameters platform
Denosing Causal language
auto-encoder model
p
DALL-E 12 billion 250 million English text–image pairs Mixed-precision training PyTorch [69]
p
CogView 4 billion 30 million English text–image pairs — PyTorch [103]
p
M6 100 billion 1.9 TB images + 292 GB Chinese MoE — [104]
p p
ERNIE-ViLG 10 billion 145 million Chinese text–image pairs Mixed-precision training PaddlePaddle [107]
56
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65
Table 3
SOTA performance with and without pre-training on NLU tasks.
NLU task Sentiment analysis SST-2 binary Natural language inference Nested named entity recognition Machine reading comprehension
classification OCNLI GENIA DRCD
(accuracy) (F1) (F1) (F1)
SOTA w/o pre- 93.2 59.80 74.80 78.03
training
SOTA w/ pre- 97.5 82.75 83.75 95.84
training
Results are from Refs. [39,116,117,119,120]. w/: with; SST-2: Stanford Sentiment Treebank v2; OCNLI: Original Chinese Natural Language Inference; DRCD: Delta Reading
Comprehension Dataset.
role in improving the performance of NLG tasks. Large-scale PTMs Technology Challenge (DSTC-9) [131] revealed that PLATO-2 deliv-
automatically learn word combinations and sentence expressions ers a superior performance in multiple conversational tasks,
from unannotated data, which significantly improves the models’ including open-domain chitchat, knowledge-grounded dialogue,
ability in language generation in terms of fluency, coherence, and and task-oriented conversation. Recently, PLATO-XL [132] was
informativeness. ERNIE-GEN [93] uses an enhanced multi-flow scaled up to 11 billion parameters, with multi-party-aware pre-
seq2seq pre-training and fine-tuning framework and incorporates training being carried out to better distinguish roles in social media
a span-by-span generation task to generate consecutive entities, conversations. Other Chinese dialogue PTMs that have been devel-
which has achieved new SOTA results on five typical NLG tasks. oped on a modest scale include Cdial-GPT [133], ProphetNet-X
Researchers and practitioners also pre-train task-specific trans- [134], and EVA [135].
former models on generation tasks, such as MASS [88] and PEGA- With these large-scale dialogue PTMs, some of the problems
SUS [94]. More specifically, MASS adopts the encoder–decoder that plague traditional end-to-end neural approaches [136,137]
framework to reconstruct a sentence fragment, given the remain- are alleviated significantly, including deficiencies in response flu-
ing part of the sentence, and achieves significant improvements ency and context relevance. Moreover, in comparison with existing
over baselines without pre-training on machine translation. PEGA- chatbots that rely on complex frameworks, such as Mitsuku [138]
SUS was used to pre-train a large-scale encoder-decoder model and XiaoIce [139], these dialogue PTMs demonstrate superior per-
with a well-designed pre-training objective, which achieved a formance in multi-turn conversations, especially in terms of
SOTA performance on all 12 text-summarization tasks. With the engagingness and humanness.
growth of the model size, PTMs gradually show notable ability in
creative writing. Models such as GPT-3, HyperCLOVA, and ENRIE 3.2. Key research challenges
3.0 are capable of generating articles, questions and answers, nov-
els, and program codes via only zero-shot learning. The quality of Although PTMs have significantly improved the performance of
the generated texts is sometimes comparable with that of NLP tasks, there are still some key challenges for PTM applications,
human-written texts. For example, humans only achieve 52% such as interpretability, robustness, reasoning capability, and the
accuracy in distinguishing real news from fake news generated deployment of large-scale PTMs. This section describes these chal-
by GPT-3. lenges in the hope that additional future efforts can be devoted to
these directions.
3.1.3. Dialogue
In the past few years, several representative dialogue- 3.2.1. Deployability
generation models have been pre-trained with human-like conver- One trend in PTMs is the substantial increase in capacity. Since
sations collected from social media, including Twitter, Reddit, the release of GPT [22] and BERT [23], PTMs have scaled exponen-
Weibo, and Baidu Tieba. Based on the general language model tially with respect to both the number of parameters and the size
GPT-2 [83], DialoGPT [126] has been trained for response genera- of the pre-training data. For example, the largest version of GPT-3
tion using Reddit comments. Meena [127] scales up the network [26] requires a total training computation of 3.64 103 petaflop-
to 2.6 billion parameters and employs more social media conversa- days, resulting in a total number of around 3.14 1023 flops and
tions in the training process, resulting in a significant improvement costing millions of dollars. The rapid growth in model size raises
in response quality. To mitigate undesirable toxic or bias traits in concerns regarding the tradeoff between scale and deployability.
large corpora, Blender [128] further fine-tunes the PTM with Two types of strategy have been proposed to tackle this issue:
human-annotated datasets and emphasizes the desirable conver- ① Large-scale PTMs are only used as the foundation model via
sational skills of engagingness, empathy, and personality. In addi- application programming interface (API) calls, similar to the way
tion, to alleviate the safe-response problem in open-domain in which the GPT-3 model is used. This strategy enables the effi-
chitchat, PLATO [129] encodes the discrete latent variable into cient use of PTMs and evades model deployment on each device,
transformers for diverse response generation. Moreover, PLATO-2 but significantly limits the model’s application scope. ② Large
[130] further scales up PLATO via curriculum learning for both models are compressed to smaller ones [140] for potential deploy-
Chinese and English response generation. The Ninth Dialog System ment. Typical compressing techniques include model compression
Table 4
SOTA performance with and without pre-training on NLG tasks.
NLG task Text summarization ESLC Dialogue generation Question generation SQuAD 1.1 Data-to-text generation WebNLG
(ROUGE-L) KdConv-film (BLEU-4) (BLEU)
(BLEU-4)
SOTA w/o pre-training 23.44 5.40 15.87 63.69
SOTA w/ pre-training 36.51 74.44 25.41 66.07
Results are from Refs. [94,122–125]. ESLC: English Skills Learning Center; BLEU: bilingual evaluation understudy; ROUGE-L: recall-oriented understudy for gisting evaluation-
longest common subsequence.
58
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65
and knowledge distillation. Unfortunately, existing compressing models can be easily attacked with adversarial patterns by third
techniques are unable to compress super-large PTMs (e.g., GPT-3) parties, resulting in irreparable loss in real-world applications. In
to a suitable size for deployment on a single GPU or terminal addition to adversarial attacks, another form of attack—namely,
device such as a laptop or cell phone. Advanced research in model backdoor attacks—is a threat to PTMs. Unlike adversarial attacks,
compression is thus imperative in order to make large PTMs avail- which usually act during the inference process of a neural model,
able to more users. Another promising direction is to use backdoor attacks hack the model during training [159]. If a model
parameter-efficient techniques, such as prompt tuning [141– is deliberately trained on backdoor data, it will be extremely dan-
146], to reduce the memory budget of deployment; this remains gerous for users to use this model in applications involving privacy
as a large area for further exploration. and security concerns. Future work could aim to improve the
robustness of PTMs toward adversarial attacks. To deal with back-
3.2.2. Model trustworthiness door attacks, a model should be able to detect in the input the trig-
Another challenge of PTMs is their trustworthiness, which gers that can activate the backdoor attack and remove the triggers,
mainly involves their interpretability [147] and robustness [148]. thus enhancing model security.
Although PTMs have achieved SOTA performances across various
tasks, how they make decisions are sometimes obscure to humans,
which makes PTM models difficult to be applied in fields where 4. Applications of PTMs
model interpretability is essential, such as health-care and law
[149]. Consequently, there is a growing interest in interpreting 4.1. Platforms and toolkits for applications
deep neural models [150]. In particular, many studies aim
to understand what PTMs have learned in their representations Due to their universality, PTMs have become foundation models
[151]. in NLP. Many researchers have developed a series of open-source
Some studies have been published on the trustworthiness of toolkits and platforms to make better use of PTMs. These toolkits
deep neural models. These include: linguistic structural analyses and platforms usually contain various PTMs, fine-turning tools,
on PTMs [152], which aim to analyze the linguistic knowledge that and model-compression tools.
is learned by pre-trained language models and to understand the
reason for their success; model behavioral analyses [153], which 4.1.1. Toolkits
evaluate model robustness and reliability with multiple test sets; When researchers propose a new pre-trained language model,
and post-hoc explanation analyses [154], which aim to provide they often open-source a corresponding toolkit for developers.
understandable explanations for the predictions of deep neural Such toolkits usually provide codes for downstream task develop-
models. ment based on the specific model, and therefore lack generality.
Despite the research that has already been done in this field, the Typical toolkits include google-research/bert [160], PaddlePaddle/
following challenges must be addressed in order to build trustwor- ERNIE [161], and PCL-Platform.Intelligence/PanGu-a [162]. These
thy systems: ① general interpretation methods for NLP tasks toolkits provide a series of open-sourced PTMs, such as BERT,
(existing interpretation methods are designed for classification ERNIE, and PanGu-a, along with source code and training data.
tasks); ② causal analysis between model prediction and learned For example, the ERNIE toolkit provides not only the source code,
knowledge or extracted explanations; and ③ a comprehensive training data, and PTM of ERNIE but also a couple of enhanced
evaluation platform for interpretability, including evaluation data ERNIE series models, such as ERNIE-Doc [163] and ERNIE-ViL
and metrics. [70]. In order to deploy the ERNIE model to online service, the
ERNIE toolkit also provides a model-compression tool.
3.2.3. Commonsense knowledge and reasoning With the intensive publish of PTMs, knowing how to use these
Large-scale PTMs have been found to encode some common- models in a unified toolkit has become an urgent need. Given this
sense knowledge [155]. Nevertheless, appropriate probing tasks background, toolkits for general NLP applications have been devel-
need to be designed in order to mine the commonsense knowledge oped. Typical toolkits include HuggingFace/Transformers [164],
learned in PTMs—such as formulating a relational knowledge- Fairseq [165], and PaddleNLP [166]. PTMs are integrated in a
extraction task as the completion of fill-in-the-blank state- user-friendly way into such general-purpose toolkits. Taking Hug-
ments—so as to examine the knowledge-learning ability of PTMs gingFace as an example, this toolkit integrates the codes for differ-
[156]. Although PTMs learn some knowledge from texts, there is ent kinds of PTMs and codes for downstream application
still a large amount of knowledge that cannot be obtained from developments, including classification, generation, summarization,
texts alone. One possible direction is to have models learn this kind translation, question answering, and so forth.
of knowledge from both visual inputs and text inputs.
In addition to commonsense knowledge, other studies are
questioning whether PTMs are endowed with reasoning abili- 4.1.2. Platforms
ties. For example, Talmor et al. [157] design different tasks to Besides toolkits, platforms provide users with PTM services for
evaluate the reasoning abilities of PTMs. The researchers disen- customization. These platforms can provide facilities for
tangle pre-training from fine-tuning and find that the reasoning developers to build models and deploy them to online services.
capabilities are poor for most PTMs, revealing that existing For example, Baidu Wenxin [167] is a platform to facilitate the
PTMs lack the ability to reason. To alleviate this problem, one use of PTMs. This platform meets the needs of both experienced
possible direction could be to integrate prior knowledge into developers and junior developers. It enables developers to easily
the PTMs in order to guide the models to learn reasoning rules build their models with data and model configuration only. It also
implicitly. provides experienced developers with toolkits to train their
models that are tailored for applications. Other platforms such
3.2.4. Model security as AliceMind [168] provide similar services with no significant
One severe issue with PTMs is their vulnerability to adversarial differences. OpenAI API [169] is another kind of platform that is
examples, which can mislead the model into producing a specific used to develop applications based only on PTMs. OpenAI API is
wrong prediction when perturbations are injected into the input based on GPT-3 [26]; it provides specific high-level functions,
[158]. This susceptibility exposes PTMs to safety concerns: The such as English-to-French translation, grammar correction,
59
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65
question answering, advertisement generation, and product-name laborious tasks. Microsoft has also demonstrated that the pre-
generation. trained generation model Turing-NLG is beneficial for autosuggest
recommendations [95]. Moreover, many researchers have built
4.2. Applications various demo applications based on GPT-3, including applications
for ad generation, AI copywriting, book writing, code generation,
PTMs have been widely deployed in real applications, including customer service, and so forth. As for visual content creation,
document intelligence, content creation, virtual assistant, and pre-trained multimodal generative models such as DALL-E [69],
intelligent search engines. Below, we describe how PTMs are CogView [103], and ERNIE-ViLG [107] have greatly improved the
applied in each field. quality and fidelity of generated images. The results from CogView
have demonstrated this model’s capability to generate high-quality
4.2.1. Document intelligence images in a single domain such as industrial fashion design, so this
One widely studied application for PTMs is document intelli- model has been deployed in online fashion production.
gence, which includes sentiment analysis, news classification, In addition to these industrial applications, researchers have
anti-spam detection, and information extraction. Sentiment analy- shown the potential ability of PTMs for creative writing, including
sis is widely used to identify sentiment polarity, such as public poem generation [179], lyrics generation [27], e-mail auto comple-
opinion, for market research, brand reputation analysis, and social tion [180], to-do generation [181], auto-completion for sentences
media influence. Garg and Chatterjee [170] propose analyzing the and paragraphs, and even a long novel generation [22]. Although
sentiment of Twitter feeds using a PTM and classifying them into PTMs exhibit strong generative capabilities, an increasing number
three categories: positive, negative, and neutral. AlQahtani [171] of concerns have arisen regarding generative models, including pri-
proposes analyzing customer reviews on products by combining vacy and copyright.
data-mining techniques with PTMs. Recently, Singh et al. [172]
analyzed public sentiment on the impact of the coronavirus on 4.2.3. Virtual assistants
social life using a PTM. Chen and Sokolova [173] propose analyzing Virtual assistants are adopted in many applications nowadays.
the sentiments in the coronavirus disease 2019 (COVID-19)-related Typical applications include smart speakers, such as Alexa [182]
messages in a popular social media platform, where users share from Amazon and Xiaodu [129] from Baidu. Such applications have
their stories to seek support from other users, especially during used PTMs and have shown that PTMs can provide excellent lan-
the COVID-19 pandemic. Experimental results show that PTMs guage understanding ability for spoken language and voice recogni-
can achieve significant performance gain in classifying sentiment tion [183] in smart speakers. With the benefit brought by PTMs,
polarities, demonstrating the effectiveness of PTMs. these smart speakers can respond to weather forecast queries, sing
News classification and anti-spam detection can also be mod- songs on demand, and vocally control smart home devices. More-
eled as classification tasks. Ding et al. [163] apply PTMs to classify over, smart speakers can chat with humans on a broad range of
news into extreme left-wing or right-wing standpoints. Liu et al. topics and thus establish a closer and more stable relationship
[174] propose classifying the papers published in Arxiv.org into between users and the system. In addition to the usage of PTMs in
11 categories, including math, computer science, and so forth. smart speakers, PTMs have been deployed in mobile-phone-based
Jwa et al. [175] use BERT to detect fake news by analyzing the rela- virtual assistants, such as Siri and Google Assistant. For example,
tionship between the headline and the body text in news. NDTV [184] proposes that PTMs can improve the interaction quality,
Document information extraction is widely used in industry. while Vincent [185] proposes that PTMs can be used in intelligent
Many AI cloud services contain tools for information extraction customer service robots to recognize customer sentiments.
[176], such as Google AI Cloud, Baidu AI Cloud, and Alibaba AI As PTMs are applied more and more widely in virtual assistants,
Cloud. Among these services, Baidu has built a PTM-based plat- the responses generated by chatting bots are becoming more
form, TextMind, for document information-extraction applications, human-like. For example, Microsoft has proposed a PLM-based
including receipt analysis for expense reimbursements, informa- model called DialoGPT that learns from the comment history of
tion extraction from resumes, financial statement analysis, con- Reddit and can fluently reply to users. Google has also suggested
tract analysis, and legal judgment analysis. One of the world’s the use of PLMs to develop a chatbot application that can ‘‘chat
largest online home retailers, Wayfair, also applies BERT to extract about anything” [127]. To make the robots more human-like, Face-
information from customer messages. book applied PLM to a series of dialogue chatbots named Blender
Document image understanding is another important research and Blender 2.0 [128]. Shortly afterwards, Baidu proposed
topic in document intelligence for automatically reading, under- PLATO-XL [132], a PLM-based model, to further push the perfor-
standing, and analyzing business documents. A series of multi- mance of a chatbot and reach the SOTA in terms of both human
modal document PTMs [177] has been proposed to jointly model evaluation and automatic evaluation metrics. Thanks to the perfor-
interactions between text, image, and layout information in busi- mance improvement brought by PTMs, these applications can be
ness documents for many document image understanding tasks, very robust in interactions with users [186].
such as receipt understanding, document image classification,
and document information extraction. Applica proposes a solution 4.2.4. Intelligent search
to take into consideration layout, graphics, and text in order to Aside from the applications mentioned above, PTMs are widely
enable the extraction of precise answers for complex business pro- used in search engines. Google has already applied PTMs in its
cesses in financial services, insurance services, life sciences, and so Google Search and achieved significant improvements [187]. Baidu
on. has also applied PTMs, ERNIE 2.0 [188] and ERNIE 3.0 [39], as its
backbone to support semantic matching by encoding text into
4.2.2. Content creation dense representations for better retrieval performance in Baidu
Content creation tasks are usually designed to verify the perfor- Search [189]. Facebook [190] has revealed a unified embedding
mance of recently proposed large-scale models [22]. For example, framework for personalized systems and noted that their future
Narrativa applies GPT-2 for content automation from just a few work will contain PTMs.
words provided by customers and generates high-quality adver- To address the surging demand for multimedia content
tisement content [178]. GPT-2 has demonstrated its ability to gen- searches, the performance of image and video search engines can
erate content for e-commerce in order to relieve humans from be enhanced through the utilization of multimodal PTMs. For
60
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65
61
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65
Neural Information Processing Systems (NeurIPS 2020); 2020 Dec 7–12; Conference on Computer Vision (ICCV); 2017 Oct 22–29; Venice, Italy. 2017.
online. 2020. p. 1877–901. p. 843–52.
[27] Zhang Z, Han X, Zhou H, Ke P, Gu Y, Ye D, et al. CPM: a large-scale generative [49] Schneider S, Baevski A, Collobert R, Auli M. Wav2vec: unsupervised pre-
Chinese pre-trained language model. AI Open 2021;2:93–9. training for speech recognition. In: Proceedings of the 20th Annual
[28] Zeng W, Ren X, Su T, Wang H, Liao Y, Wang Z, et al. PanGu-a: large-scale Conference of the International Speech Communication Association
autoregressive pretrained Chinese language models with auto-parallel (InterSpeech 2019); 2019 Sep 15–19; Graz, Austria. 2019. p. 3465–9.
computation. 2021. arXiv:2104.12369. [50] Deng J, Dong W, Socher R, Li LJ, Li K, Li FF. ImageNet: a large-scale hierarchical
[29] Wang S, Sun Y, Xiang Y, Wu Z, Ding S, Gong W, et al. ERNIE 3.0 Titan: image database. In: Proceedings of the IEEE Conference on Computer Vision
exploring larger-scale knowledge enhanced pre-training for language and Pattern Recognition (CVPR); 2009 Jun 20–25; Miami, FL, USA. 2009. p.
understanding and generation. 2021. arXiv:2112.12731. 248–55.
[30] Wang A, Pruksachatkun Y, Nangia N, Singh A, Michael J, Hill F, et al. [51] Mahajan D, Girshick R, Ramanathan V, He K, Paluri M, Li Y, et al. Exploring the
SuperGLUE: a stickier benchmark for general-purpose language limits of weakly supervised pretraining. In: Proceedings of the European
understanding systems. In: Proceedings of the 33rd Conference on Neural Conference on Computer Vision (ECCV); 2018 Sep 8–14; Munich, Germany.
Information Processing Systems (NeurIPS 2019); 2019 Dec 9–14; Vancouver, 2018. p. 181–96.
BC, Canada. 2019. p. 3266–80. [52] Zhai X, Kolesnikov A, Houlsby N, Beyer L. Scaling vision transformers. 2021.
[31] Sun Y, Wang S, Li Y, Feng S, Chen X, Zhang H, et al. ERNIE: enhanced arXiv:2106.04560.
representation through knowledge integration. 2019. arXiv:1904.09223. [53] Doersch C, Gupta A, Efros AA. Unsupervised visual representation learning
[32] Xiong W, Du J, Wang WY, Stoyanov V. Pretrained encyclopedia: weakly by context prediction. In: Proceedings of the IEEE International Conference
supervised knowledge-pretrained language model. In: Proceedings of the 8th on Computer Vision (ICCV); 2015 Dec 7–13; Santiago, Chile. 2015. p. 1422–
International Conference on Learning Representations (ICLR 2020); 2020 Apr 30.
26–30; Addis Ababa, Ethiopia; 2020. [54] Noroozi M, Favaro P. Unsupervised learning of visual representations by
[33] Liu W, Zhou P, Zhao Z, Wang Z, Ju Q, Deng H, et al. K-BERT: enabling language solving jigsaw puzzles. In: Proceedings of the European Conference on
representation with knowledge graph. In: Proceedings of the 34th AAAI Computer Vision (ECCV); 2016 Oct 8–16; Amsterdam, The Netherlands. 2016.
Conference on Artificial Intelligence; 2020 Feb 7–12; New York City, NY, USA. p. 69–84.
Palo Alto: AAAI Press; 2020. p. 2901–8. [55] Misra I, van der Maaten L. Self-supervised learning of pretext-invariant
[34] Sun T, Shao Y, Qiu X, Guo Q, Hu Y, Huang X, et al. CoLAKE: contextualized representations. In: Proceedings of the IEEE/CVF Conference on Computer
language and knowledge embedding. In: Proceedings of the 28th Vision and Pattern Recognition (CVPR); 2020 Jun 14–19; online. 2020. p.
International Conference on Computational Linguistics; 2020 Dec 8–13; 6707–17.
online. 2020. p. 3660–70. [56] Gidaris S, Singh P, Komodakis N. Unsupervised representation learning by
[35] Zhang Z, Han X, Liu Z, Jiang X, Sun M, Liu Q. ERNIE: enhanced language predicting image rotations. In: Proceedings of the 6th International
representation with informative entities. In: Proceedings of the 57th Annual Conference on Learning Representations (ICLR 2018); 2018 Apr 30–May 3;
Meeting of the Association for Computational Linguistics (ACL 2019); 2019 Vancouver, BC, Canada; 2018.
Jul 28–Aug 2; Florence, Italy. 2019. p. 1441–51. [57] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T,
[36] Peters ME, Neumann M, Logan IV RL, Schwartz R, Joshi V, Singh S, et al. et al. An image is worth 16 16 words: transformers for image recognition at
Knowledge enhanced contextual word representations. In: Proceedings of the scale. In: Proceedings of the 9th International Conference on Learning
2019 Conference on Empirical Methods in Natural Language Processing and Representations (ICLR 2021); 2021 May 3–7; Vienna, Austria; 2021.
the 9th International Joint Conference on Natural Language Processing [58] Van den Oord A, Li Y, Vinyals O. Representation learning with contrastive
(EMNLP-IJCNLP); 2019 Nov 3–7; Hong Kong, China. 2019. p. 43–54. predictive coding. 2018. arXiv:1807.03748.
[37] Levine Y, Lenz B, Dagan O, Ram O, Padnos D, Sharir O, et al. SenseBERT: [59] He K, Fan H, Wu Y, Xie S, Girshick R. Momentum contrast for unsupervised
driving some sense into BERT. In: Proceedings of the 58th Annual Meeting of visual representation learning. In: Proceedings of the IEEE/CVF Conference on
the Association for Computational Linguistics (ACL 2020); 2020 Jul 5–10; Computer Vision and Pattern Recognition (CVPR); 2020 Jun 14–19; online.
online. 2020. p. 4656–67. 2020. p. 9729–38.
[38] Wang X, Gao T, Zhu Z, Zhang Z, Liu Z, Li J, et al. KEPLER: a unified model for [60] Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive
knowledge embedding and pre-trained language representation. Trans Assoc learning of visual representations. In: Proceedings of the 37th International
Comput Linguist 2021;9:176–94. Conference on Machine Learning (ICML 2020); 2020 Jul 12–18; online. 2020.
[39] Sun Y, Wang S, Feng S, Ding S, Pang C, Shang J, et al. ERNIE 3.0: large-scale p. 1597–607.
knowledge enhanced pre-training for language understanding and [61] Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning
generation. 2021. arXiv:2107.02137. transferable visual models from natural language supervision. In:
[40] Wang R, Tang D, Duan N, Wei Z, Huang X, Ji J, et al. K-Adapter: infusing Proceedings of the 38th International Conference on Machine Learning
knowledge into pre-trained models with adapters. In: Proceedings of the 59th (ICML 2021); 2021 Jul 18–24; online. 2021. p. 8748–63.
Annual Meeting of the Association for Computational Linguistics and the 11th [62] Jia C, Yang Y, Xia Y, Chen YT, Parekh Z, Pham H, et al. Scaling up visual and
International Joint Conference on Natural Language Processing (ACL-IJCNLP vision–language representation learning with noisy text supervision. In:
2021); 2021 Aug 1–6; online. 2021. p. 1405–18. Proceedings of the 38th International Conference on Machine Learning (ICML
[41] Wu S, Dredze M. Beto, Bentz, Becas: the surprising cross-lingual effectiveness 2021); 2021 Jul 18–24; online. 2021. p. 4904–16.
of BERT. In: Proceedings of the 2019 Conference on Empirical Methods in [63] Lu J, Batra D, Parikh D, Lee S. ViLBERT: pretraining task-agnostic
Natural Language Processing and the 9th International Joint Conference on visiolinguistic representations for vision-and-language tasks. In:
Natural Language Processing (EMNLP-IJCNLP); 2019 Nov 3–7; Hong Kong, Proceedings of the 33rd Conference on Neural Information Processing
China. 2019. p. 833–44. Systems (NeurIPS 2019); 2019 Dec 9–14; Vancouver, BC, Canada. 2019. p.
[42] Conneau A, Lample G. Cross-lingual language model pretraining. In: 13–23.
Proceedings of the 33rd Conference on Neural Information Processing [64] Tan H, Bansal M. LXMERT: learning cross-modality encoder representations
Systems (NeurIPS 2019); 2019 Dec 8–14; Vancouver, BC, Canada. 2019. p. from transformers. In: Proceedings of the 2019 Conference on Empirical
7057–67. Methods in Natural Language Processing and the 9th International Joint
[43] Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, et al. Conference on Natural Language Processing (EMNLP-IJCNLP); 2019 Nov 3–7;
Unsupervised cross-lingual representation learning at scale. In: Proceedings Hong Kong, China; 2019.
of the 58th Annual Meeting of the Association for Computational Linguistics [65] Li LH, Yatskar M, Yin D, Hsieh CJ, Chang KW. VisualBERT: a simple and
(ACL 2020); 2020 Jul 5–10; online. 2020. p. 8440–51. performant baseline for vision and language. 2019. arXiv:1908.03557.
[44] Chi Z, Dong L, Wei F, Yang N, Singhal S, Wang W, et al. InfoXLM: an [66] Sun C, Myers A, Vondrick C, Murphy K, Schmid C. VideoBERT: a joint model
information-theoretic framework for cross-lingual language model pre- for video and language representation learning. In: Proceedings of the IEEE/
training. In: Proceedings of the 2021 Conference of the North American CVF International Conference on Computer Vision (ICCV); 2019 Oct 27–Nov
Chapter of the Association for Computational Linguistics: Human Language 2; Seoul, Republic of Korea. 2019. p. 7464–73.
Technologies; 2021 Jun 6–11; online. 2021. p. 3576–88. [67] Sun C, Baradel F, Murphy K, Schmid C. Learning video representations using
[45] Ouyang X, Wang S, Pang C, Sun Y, Tian H, Wu H, et al. ERNIE-M: enhanced contrastive bidirectional transformer. 2019. arXiv:1906.05743.
multilingual representation by aligning cross-lingual semantics with [68] Chuang YS, Liu CL, Lee H, Lee L. SpeechBERT: an audio-and-text jointly
monolingual corpora. In: Proceedings of the 2021 Conference on Empirical learned language model for end-to-end spoken question answering. In:
Methods in Natural Language Processing (EMNLP); 2021 Nov 7–11; online. Proceedings of the 21st Annual Conference of the International Speech
2021. p. 27–38. Communication Association (Interspeech 2020); 2020 Oct 25–29; Shanghai,
[46] Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, et al. DeCAF: a deep China. 2020. p. 4168–72.
convolutional activation feature for generic visual recognition. In: [69] Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, et al. Zero-shot text-to-
Proceedings of the 31st International Conference on Machine Learning image generation. In: Proceedings of the 38th International Conference on
(ICML 2014); 2014 Jun 21–26; Beijing, China. 2014. p. 647–55. Machine Learning (ICML 2021); 2021 Jul 18–24; online. 2021. p. 8821–31.
[47] Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate [70] Yu F, Tang J, Yin W, Sun Y, Tian H, Wu H, et al. ERNIE-ViL: knowledge
object detection and semantic segmentation. In: Proceedings of the IEEE enhanced vision–language representations through scene graphs. In:
Conference on Computer Vision and Pattern Recognition (CVPR); 2014 Jun Proceedings of the 35th AAAI Conference on Artificial Intelligence; 2021
23–28; Columbus, OH, USA. 2014. p. 580–7. Feb 2–9; online. Palo Alto: AAAI Press; 2021. p. 3208–16.
[48] Sun C, Shrivastava A, Singh S, Gupta A. Revisiting unreasonable effectiveness [71] Gan Z, Chen YC, Li L, Zhu C, Cheng Y, Liu J. Large-scale adversarial training for
of data in deep learning era. In: Proceedings of the IEEE International vision-and-language representation learning. In: Proceedings of the 34th
62
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65
Conference on Neural Information Processing Systems (NeurIPS 2020); 2020 [97] Xue L, Constant N, Roberts A, Kale M, Al-Rfou R, Siddhant A, et al. mT5: a
Dec 7–12; online. 2020. p. 6616-28. massively multilingual pre-trained text-to-text transformer. In: Proceedings
[72] Cho J, Lei J, Tan H, Bansal M. Unifying vision-and-language tasks via text of the 2021 Conference of the North American Chapter of the Association for
generation. In: Proceedings of the 38th International Conference on Machine Computational Linguistics: Human Language Technologies; 2021 Jun 6–11;
Learning (ICML 2021); 2021 Jul 18–24; online. 2021. p. 1931–42. online. 2021. p. 483–98.
[73] Kalyan KS, Rajasekharan A, Sangeetha S. AMMUS: a survey of transformer- [98] Zhang Z, Gu Y, Han X, Chen S, Xiao C, Sun Z, et al. CPM-2: large-scale cost-
based pretrained models in natural language processing. 2021. effective pre-trained language models. 2021. arXiv:2106.10715.
arXiv:2108.05542. [99] Fedus W, Zoph B, Shazeer N. Switch transformers: scaling to trillion
[74] Kaliyar RK. A multi-layer bidirectional transformer encoder for pre-trained parameter models with simple and efficient sparsity. 2021. arXiv:
word embedding: a survey of BERT. In: Proceedings of 2020 10th 2101.03961.
International Conference on Cloud Computing, Data Science & Engineering [100] Wu S, Zhao X, Yu T, Zhang R, Shen C, Liu H, et al. Yuan 1.0: large-scale pre-
(Confluence); 2020 Jan 29–31; Noida, India. 2020. p. 336–40. trained language model in zero-shot and few-shot learning. 2021. arXiv:
[75] Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and 2110.04725.
predict: a systematic survey of prompting methods in natural language [101] Du N, Huang Y, Dai AM, Tong S. Lepikhin D, Xu Y, et al. GLaM: efficient scaling
processing. 2021. arXiv:2107.13586. of language models with mixture-of-experts. 2021. arXiv: 2112.06905.
[76] Min B, Ross H, Sulem E, Veyseh APB, Nguyen TH, Sainz O, et al. Recent [102] Rae JW, Borgeaud S, Cai T, Millican K, Hoffmann J, Song F, et al. Scaling
advances in natural language processing via large pre-trained language language models: methods, analysis & insights from training gopher. 2021.
models: a survey. 2021. arXiv:2111.01243. arXiv: 2112.11446.
[77] Li J, Tang T, Zhao WX, Wen JR. Pretrained language models for text [103] Ding M, Yan Z , Hong W, Zheng W, Zhou C, Yin D, et al. CogView: mastering
generation: a survey. 2021. arXiv:2105.10311. text-to-image generation via transformers. 2021. arXiv: 2105.13290.
[78] Zaib M, Sheng QZ, Zhang W. A short survey of pre-trained language models [104] Lin J, Men R, Yang A, Zhou C, Ding M, Zhang Y, et al. M6: a Chinese
for conversational AI—a new age in NLP. In: Proceedings of the Australasian multimodal pretrainer. 2021. arXiv:2103.00823.
Computer Science Week Multiconference (ACSW’20); 2020 Feb 3–7; [105] Li W, Gao C, Niu G, Xiao X, Liu H, Liu J, et al. UNIMO: towards unified-modal
Melbourne, VIC, Australia. 2020. understanding and generation via cross-modal contrastive learning. In:
[79] Ramponi A, Plank B. Neural unsupervised domain adaptation in NLP—a Proceedings of the 59th Annual Meeting of the Association for Computational
survey. In: Proceedings of the 28th International Conference on Linguistics and the 11th International Joint Conference on Natural Language
Computational Linguistics; 2020 Dec 8–13; onine. 2020. p. 6838–55. Processing (ACL-IJCNLP 2021); 2021 Aug 1–6; online. 2021. p. 2592–607.
[80] Qiu XP, Sun TX, Xu YG, Shao YF, Dai N, Huang XJ. Pre-trained models for [106] Huo Y, Zhang M, Liu G, Lu H, Gao Y, Yang G, et al. WenLan: bridging vision and
natural language processing: a survey. Sci China Technol Sci 2020;63 language by large-scale multi-modal pre-training. 2021. arXiv:2103.06561.
(10):1872–97. [107] Zhang H, Yin W, Fang Y, Li L, Duan B, Wu Z, et al. ERNIE-ViLG: unified
[81] Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, et al. On the generative pre-training for bidirectional vision-language generation. 2021.
opportunities and risks of foundation models. 2021. arXiv:2108.07258. arXiv:2112.15283.
[82] Han X, Zhang Z, Ding N, Gu Y, Liu X, Huo Y, et al. Pre-trained models: past, [108] Huang Y, Cheng Y, Bapna A, Firat O, Chen D, Chen M, et al. GPipe: efficient
present and future. AI Open 2021;2:225–50. training of giant neural networks using pipeline parallelism. In: Proceedings
[83] Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are of the 33rd Conference on Neural Information Processing Systems (NeurIPS
unsupervised multitask learners. San Francisco: OpenAI; 2019. 2019); 2019 Dec 9–14; Vancouver, BC, Canada. 2019. p. 103–12.
[84] Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV. XLNet: [109] Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B. Megatron-
generalized autoregressive pretraining for language understanding. In: LM: training multi-billion parameter language models using model
Proceedings of the 33rd Conference on Neural Information Processing parallelism. 2019. arXiv:1909.08053.
Systems (NeurIPS 2019); 2019 Dec 9–14; Vancouver, BC, Canada. 2019. p. [110] Narayanan D, Shoeybi M, Casper J, LeGresley P, Patwary M, Korthikanti V,
5754–64. et al. Efficient large-scale language model training on GPU clusters using
[85] Taylor WL. ‘‘Cloze procedure”: a new tool for measuring readability. J Mass megatron-LM. In: Proceedings of the International Conference for High
Commun Q 1953;30(4):415–33. Performance Computing, Networking, Storage and Analysis (SC 21); 2021
[86] Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z, et al. GLM: general language model Nov 14–19; St. Louis, MO, USA; 2021.
pretraining with autoregressive blank infilling. 2021. arXiv:2103.10360. [111] Bian Z, Liu H, Wang B, Huang H, Li Y, Wang C, et al. Colossal-AI: a unified deep
[87] Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O. SpanBERT: improving learning system for large-scale parallel training. 2021. arXiv:2110.14883.
pre-training by representing and predicting spans. Trans Assoc Comput [112] Shazeer N, Mirhoseini A, Maziarz K, Davis A, Le Q, Hinton G, et al.
Linguist 2020;8:64–77. Outrageously large neural networks: the sparsely-gated mixture-of-experts
[88] Song K, Tan X, Qin T, Lu J, Liu TY. MASS: masked sequence to sequence pre- layer. In: Proceedings of the 5th International Conference on Learning
training for language generation. In: Proceedings of the 36th International Representations (ICLR 2017); 2017 Apr 24–26; Toulon, France; 2017.
Conference on Machine Learning (ICML 2019); 2019 Jun 9–15; Long Beach, [113] Narang S, Diamos G, Elsen E, Micikevicius P, Alben J, Garcia D, et al. Mixed
CA, USA. 2019. p. 5926–36. precision training. In: Proceedings of the 6th International Conference on
[89] Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, et al. BART: Learning Representations (ICLR 2018); 2018 Apr 30–May 3; Vancouver, BC,
denoising sequence-to-sequence pre-training for natural language Canada; 2018.
generation, translation, and comprehension. In: Proceedings of the 58th [114] Rajbhandari S, Rasley J, Ruwase O, He Y. ZeRO: memory optimizations toward
Annual Meeting of the Association for Computational Linguistics (ACL 2020); training trillion parameter models. In: Proceedings of the International
2020 Jul 5–10; online. 2020. p. 7871–80. Conference for High Performance Computing, Networking, Storage and
[90] Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. RoBERTa: a robustly Analysis (SC 20); 2020 Nov 9–19; Atlanta, GA, USA; 2020.
optimized BERT pretraining approach. 2019. arXiv:1907.11692. [115] Kim Y. Convolutional neural networks for sentence classification. In:
[91] Dong L, Yang N, Wang W, Wei F, Liu X, Wang Y, et al. Unified language model Proceedings of the 2014 Conference on Empirical Methods in Natural
pre-training for natural language understanding and generation. In: Language Processing (EMNLP); 2014 Oct 25–29; Doha, Qatar. 2014. p.
Proceedings of the 33rd Conference on Neural Information Processing 1746–51.
Systems (NeurIPS 2019); 2019 Dec 9–14; Vancouver, BC, Canada. 2019. p. [116] Hu H, Richardson K, Xu L, Li L, Kübler S, Moss L. OCNLI: original Chinese
13042–54. natural language inference. In: Proceedings of the 2020 Conference on
[92] Bao H, Dong L, Wei F, Wang W, Yang N, Liu X, et al. UniLMv2: pseudo-masked Empirical Methods in Natural Language Processing (EMNLP); 2020 Nov 16–
language models for unified language model pre-training. In: Proceedings of 20; online. 2020. p. 3512–26.
the 37th International Conference on Machine Learning (ICML 2020); 2020 [117] Shao CC, Liu T, Lai Y, Tseng Y, Tsai S. DRCD: a Chinese machine reading
Jul 12–18; online. 2020. p. 642–52. comprehension dataset. 2018. arXiv:1806.00920..
[93] Xiao D, Zhang H, Li Y, Sun Y, Tian H, Wu H, et al. ERNIE-GEN: an enhanced [118] Schick T, Schütze H. Exploiting cloze-questions for few-shot text
multi-flow pre-training and fine-tuning framework for natural language classification and natural language inference. In: Proceedings of the 16th
generation. In: Proceedings of the 29th International Joint Conference on Conference of the European Chapter of the Association for Computational
Artificial Intelligence (IJCAI); 2021 Jan 7–15; Yokohama, Japan. 2021. p. Linguistics: Main Volume; 2021 Apr 19–23; online. 2021. p. 255–69.
3997–4003. [119] Gray S, Radford A, Kingma DP. GPU kernels for block-sparse weights. 2017.
[94] Zhang J, Zhao Y, Saleh M, Liu P. PEGASUS: pre-training with extracted gap- arXiv:1711.09224.
sentences for abstractive summarization. In: Proceedings of the 37th [120] Lin H, Lu Y, Han X, Sun L. Sequence-to-nuggets: nested entity mention
International Conference on Machine Learning (ICML 2020); 2020 Jul 12– detection via anchor-region networks. In: Proceedings of the 57th Annual
18; online. 2020. p. 11328–39. Meeting of the Association for Computational Linguistics (ACL 2019); 2019
[95] Rosset C. Turing-NLG: a 17-billion-parameter language model by Microsoft Jul 28–Aug 2; Florence, Italy. 2019. p. 5182–92.
[Internet]. Redmond: Microsoft; 2020 Feb 13 [cited 2021 Nov 4]. Available [121] Lin Y, Meng Y, Sun X, Han Q, Kuang K, Li J, et al. BertGCN: transductive text
from: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17- classification by combining GCN and BERT. 2021. arXiv: 2105.05727.
billion-parameter-language-model-by-microsoft/. [122] Zhang R, Tetreault J. This email could save your life: introducing the task of
[96] Kim B, Kim HS, Lee SW, Lee G, Kwak D, Hyeon JD, et al. What changes can email subject line generation. In: Proceedings of the 57th Annual Meeting of
large-scale language models bring? Intensive study on HyperCLOVA: billions- the Association for Computational Linguistics (ACL 2019); 2019 Jul 28–Aug 2;
scale Korean generative pretrained transformers. In: Proceedings of the 2021 Florence, Italy. 2019. p. 446–56.
Conference on Empirical Methods in Natural Language Processing (EMNLP); [123] Zhou H, Zheng C, Huang K, Huang M, Zhu X. KdConv: a Chinese multi-domain
2021 Nov 7–11; online. 2021. p. 3405–24. dialogue dataset towards multi-turn knowledge-driven conversation. In:
63
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65
Proceedings of the 58th Annual Meeting of the Association for Computational Language Resources and Evaluation (LREC 2016); 2016 May 23–28; Portorož,
Linguistics (ACL 2020); 2020 Jul 5–10; online. 2020. p. 7098–108. Slovenia. 2016. p. 1593–600.
[124] Cho J, Seo M, Hajishirzi H, et al. Mixture content selection for diverse [150] Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks:
sequence generation. 2019. arXiv:1909.01953. visualising image classification models and saliency maps. In: Proceedings of
[125] Ribeiro LFR, Zhang Y, Gardent C, Gurevych I. Modeling global and local node the 2nd International Conference on Learning Representations (ICLR 2014);
contexts for text generation from knowledge graphs. Trans Assoc Comput 2014 Apr 14–16; Banff, AB, Canada; 2014.
Linguist 2020;8:589–604. [151] Hewitt J, Manning CD. A structural probe for finding syntax in word
[126] Zhang Y, Sun S, Galley M, Chen YC, Brockett C, Gao X, et al. DialoGPT: large- representations. In: Proceedings of the 2019 Conference of the North
scale generative pre-training for conversational response generation. In: American Chapter of the Association for Computational Linguistics: Human
Proceedings of the 58th Annual Meeting of the Association for Computational Language Technologies; 2019 Jun 2–7; Minneapolis, MN, USA. 2019. p. 4129–38.
Linguistics: System Demonstrations (ACL 2020); 2020 Jul 5–10; online. 2020. [152] Jawahar G, Sagot B, Seddah D. What does BERT learn about the structure of
p. 270–8. language? In: Proceedings of the 57th Annual Meeting of the Association for
[127] Adiwardana D, Luong MT, So DR, Hall J, Fiedel N, Thoppilan R, et al. Towards a Computational Linguistics (ACL 2019); 2019 Jul 28–Aug 2; Florence, Italy.
human-like open-domain chatbot. 2020. arXiv:2001.09977. 2019. p. 3651–7.
[128] Roller S, Dinan E, Goyal N, Ju D, Williamson M, Liu Y, et al. Recipes for [153] Linzen T, Dupoux E, Goldberg Y. Assessing the ability of LSTMs to learn
building an open-domain chatbot. In: Proceedings of the 16th Conference of syntax-sensitive dependencies. Trans Assoc Comput Linguist 2016;4:521–35.
the European Chapter of the Association for Computational Linguistics: Main [154] Ribeiro MT, Singh S, Guestrin C. ‘‘Why should I trust you?” explaining the
Volume; 2021 Apr 19–23; online. 2021. p. 300–25. predictions of any classifier. In: Proceedings of the 2016 Conference of the
[129] DuerOS [Internet]. Beijing: Baidu; c2017 [cited 2021 Nov 4]. Available from: North American Chapter of the Association for Computational Linguistics:
https://dueros.baidu.com/en/index.html. Human Language Technologies; 2016 Jun 12–17; San Diego, CA, USA. 2016. p.
[130] Bao S, He H, Wang F, Wu H, Wang H, Wu W, et al. PLATO-2: towards building 1135–44.
an open-domain chatbot via curriculum learning. In: Proceedings of the 59th [155] Davison J, Feldman J, Rush AM. Commonsense knowledge mining from
Annual Meeting of the Association for Computational Linguistics and the pretrained models. In: Proceedings of the 2019 Conference on Empirical
11th International Joint Conference on Natural Language Processing (ACL- Methods in Natural Language Processing and the 9th International Joint
IJCNLP 2021); 2021 Aug 1–6; online. 2021. p. 2513–25. Conference on Natural Language Processing (EMNLP-IJCNLP); 2019 Nov 3–7;
[131] Gunasekara C, Kim S, D’Haro LF, Rastogi A, Chen YN, Eric M, et al. Overview of Hong Kong, China. 2019. p. 1173–8.
the ninth dialog system technology challenge: DSTC9. 2020. [156] Petroni F, Rocktäschel T, Riedel S, Lewis P, Bakhtin A, Wu Y, et al. Language
arXiv:2011.06486. models as knowledge bases? In: Proceedings of the 2019 Conference on
[132] Bao S, He H, Wang F, Wu H, Wang H, Wu W, et al. PLATO-XL: exploring the Empirical Methods in Natural Language Processing and the 9th International
large-scale pre-training of dialogue generation. 2021. arXiv:2109.09519. Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019 Nov
[133] Wang Y, Ke P, Zheng Y, Huang K, Jiang Y, Zhu X, et al. A large-scale Chinese 3–7; Hong Kong, China. 2019. p. 2463–73.
short-text conversation dataset. In: Proceedings of the 9th CCF International [157] Talmor A, Elazar Y, Goldberg Y, Berant J. oLMpics-on what language model
Conference on Natural Language Processing and Chinese Computing (NLPCC pre-training captures. Trans Assoc Comput Linguist 2020;8:743–58.
2020); 2020 Oct 14–18; Zhengzhou, China. 2020. p. 91–103. [158] Morris JX, Lifland E, Yoo JY, Grigsby J, Jin D, Qi Y. TextAttack: a framework for
[134] Qi W, Gong Y, Yan Y, Xu C, Yao B, Zhou B, et al. ProphetNet-X: large-scale pre- adversarial attacks, data augmentation, and adversarial training in NLP. 2020.
training models for English, Chinese, multi-lingual, dialog, and code arXiv:2005.05909.
generation. 2021. arXiv:2104.08006. [159] Jia J, Liu Y, Gong NZ. BadEncoder: backdoor attacks to pre-trained encoders in
[135] Zhou H, Ke P, Zhang Z, Gu Y, Zheng Y, Zheng C, et al. EVA: an open-domain self-supervised learning. 2021. arXiv:2108.00352.
Chinese dialogue system with large-scale generative pre-training. 2021. [160] Devlin J. Google-research/bert [Internet]. GitHub; 2018 Oct 11 [cited 2021
arXiv:2108.01547. Nov 4]. Available from: https://github.com/google-research/bert.
[136] Vinyals O, Le Q. A neural conversational model. 2015. arXiv:1506.05869. [161] Baidu Ernie Team. Paddlepaddle/ernie [Internet]. GitHub; 2019 Apr 19 [cited
[137] Serban I, Sordoni A, Bengio Y, Courville A, Pineau J. Building end-to-end 2021 Nov 4]. Available from: https://github.com/PaddlePaddle/ERNIE.
dialogue systems using generative hierarchical neural network models. [162] Huawei. Pcl-platform.intelligence/pangu-alpha [Internet]. San Francisco:
In: Proceedings of the 30th AAAI Conference on Artificial Intelligence; OpenAI; 2021 Apr 26 [cited 2021 Nov 4]. Available from: https://git.openi.
2016 Feb 12–17; Phoenix, AZ, USA. Palo Alto: AAAI Press; 2016. p. org.cn/PCL-Platform.Intelligence/PanGu-Alpha.
3776–83. [163] Ding S, Shang J, Wang S, Sun Y, Tian H, Wu H, et al. ERNIE-Doc: a retrospective
[138] Worswick S. ‘‘Mitsuku wins loebner prize 2018!” [Internet]. Medium; 2018 long-document modeling transformer. In: Proceedings of the 59th Annual
Sep 13 [cited 2021 Nov 4]. Available from: https://medium.com/ Meeting of the Association for Computational Linguistics and the 11th
pandorabots-blog/mitsuku-wins-loebner-prize-2018-3e8d98c5f2a7. International Joint Conference on Natural Language Processing (ACL-IJCNLP
[139] Zhou L, Gao J, Li D, Shum HY. The design and implementation of XiaoIce, an 2021); 2021 Aug 1–6; online. 2021. p. 2914–27.
empathetic social chatbot. Comput Linguist 2020;46(1):53–93. [164] Huggingface [Internet]. Hugging Face; 2020 Apr 26 [cited 2021 Nov 4].
[140] Xin J, Tang R, Lee J, Yu Y, Lin J. DeeBERT: dynamic early exiting for Available from: https://huggingface.co.
accelerating BERT inference. In: Proceedings of the 58th Annual Meeting of [165] Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N, et al. FAIRSEQ: a fast,
the Association for Computational Linguistics (ACL 2020); 2020 Jul 5–10; extensible toolkit for sequence modeling. In: Proceedings of the 2019
online. 2020. p. 2246–51. Conference of the North American Chapter of the Association for
[141] Houlsby N, Giurgiu A, Jastrzebski S, Morrone B, Laroussilhe QD, Gesmundo A, Computational Linguistics: Human Language Technologies
et al. Parameter-efficient transfer learning for NLP. In: Proceedings of the (Demonstrations); 2019 Jun 2–7; Minneapolis, MN, USA. 2019. p. 48–53.
36th International Conference on Machine Learning (ICML 2019); 2019 Jun [166] Baidu PaddlePaddle Team. Paddlepaddle/paddlenlp [Internet]. GitHub; 2020
9–15; Long Beach, CA, USA. 2019. p. 2790–9. Nov 16 [cited 2021 Nov 4]. Available from: https://github.com/PaddlePaddle/
[142] Li XL, Liang P. Prefix-tuning: optimizing continuous prompts for generation. PaddleNLP.
In: Proceedings of the 59th Annual Meeting of the Association for [167] Wenxin ernie [Internet]. Beijing: Baidu; c2021 [cited 2021 Nov 4]. Available
Computational Linguistics and the 11th International Joint Conference on from: https://wenxin.baidu.com.
Natural Language Processing (ACL-IJCNLP 2021); 2021 Aug 1–6; online. 2021. [168] Alibaba Damo Academy. AliceMind [Internet]. Aliyuncs; c2021 [cited 2021
p. 4582–97. Nov 4]. Available from: https://alicemind.aliyuncs.com.
[143] Gao T, Fisch A, Chen D. Making pre-trained language models better few-shot [169] Openai API [Internet]. San Francisco: OpenAI; c2021 [cited 2021 Nov 4].
learners. In: Proceedings of the 59th Annual Meeting of the Association for Available from: https://openai.com/api.
Computational Linguistics and the 11th International Joint Conference on [170] Garg Y, Chatterjee N. Sentiment analysis of twitter feeds. In: Proceedings of
Natural Language Processing (ACL-IJCNLP 2021); 2021 Aug 1–6; online. 2021. the 3rd International Conference on Big Data Analytics (BDA 2014); 2014 Dec
p. 3816–30. 20–23; New Delhi, India. 2014. p. 33–52.
[144] Lester B, Al-Rfou R, Constant N. The power of scale for parameter-efficient [171] AlQahtani ASM. Product sentiment analysis for amazon reviews. Int J Comput
prompt tuning. In: Proceedings of the 2021 Conference on Empirical Methods Sci Inf Technol 2021;13(3):15–30.
in Natural Language Processing (EMNLP); 2021 Nov 7–11; online. 2021. p. [172] Singh M, Jakhar AK, Pandey S. Sentiment analysis on the impact of
3045–59. coronavirus in social life using the BERT model. Soc Netw Anal Min
[145] Liu X, Zheng Y, Du Z, Ding M, Qian Y, Yang Z, et al. GPT understands, too. 2021;11:33.
2021. arXiv:2103.10385. [173] Chen Z, Sokolova M. Sentiment analysis of the COVID-related r/Depression
[146] Han X, Zhao W, Ding N, Liu Z, Sun M. PTR: prompt tuning with rules for text posts. 2021. arXiv:2108.06215.
classification. 2021. arXiv:2105.11259. [174] Liu Y, Liu J, Chen L, Lu Y, Feng S, Feng Z, et al. ERNIE-SPARSE: learning
[147] Doshi-Velez F, Kim B. Towards a rigorous science of interpretable machine hierarchical efficient transformer through regularized self-attention. 2022.
learning. 2017. arXiv:1702.08608. arXiv:2203.12276.
[148] Wallace E, Feng S, Kandpal N, Gardner M, Singh S. Universal adversarial [175] Jwa H, Oh D, Park K, Kang JM, Lim H. exBAKE: automatic fake news detection
triggers for attacking and analyzing NLP. In: Proceedings of the 2019 model based on bidirectional encoder representations from transformers
Conference on Empirical Methods in Natural Language Processing and the (BERT). Appl Sci 2019;9(19):4062.
9th International Joint Conference on Natural Language Processing (EMNLP- [176] Soares LB, FitzGerald N, Ling J, Kwiatkowski T. Matching the blanks:
IJCNLP); 2019 Nov 3–7; Hong Kong, China. 2019. p. 2153–62. distributional similarity for relation learning. 2019. arXiv:1906.03158.
[149] Fort K, Couillault A. Yes, we care! Results of the ethics and natural language [177] Wang Z, Xu Y, Cui L, Shang J, Wei F. LayoutReader: pre-training of text and
processing surveys. In: Proceedings of the 10th International Conference on layout for reading order detection. In: Proceedings of the 2021 Conference on
64
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65
Empirical Methods in Natural Language Processing (EMNLP); 2021 Nov 7–11; www.wired.com/sponsored/story/meet-the-ai-powering-todays-smartest-
online. 2021. p. 4735–44. smartphones.
[178] gpt-2-for-the-advertising-industry [Internet]. San Francisco: OpenAI; 2017 [187] Nayak P. Understanding searches better than ever before [Internet]. Google;
Aug 1 [cited 2021 Nov 4]. Available from: https://www.narrativa.com/gpt-2- [cited 2021 Nov 4]. Available from: https://blog.google/products/search/
for-the-advertising-industry. search-language-understanding-bert/.
[179] Agarwal R, Kann K. Acrostic poem generation. In: Proceedings of the 2020 [188] Sun Y, Wang S, Li Y, Feng S, Tian H, Wu H, et al. ERNIE 2.0: a continual pre-
Conference on Empirical Methods in Natural Language Processing (EMNLP); training framework for language understanding. In: Proceedings of the 34th
2020 Nov 16–20; online. 2020. p. 1230–40. AAAI Conference on Artificial Intelligence; 2020 Feb 7–12; New York City, NY,
[180] Lee DH, Hu Z, Lee RKW. Improving text auto-completion with next phrase USA. Palo Alto: AAAI Press; 2020. p. 8968–75.
prediction. In: Proceedings of the 2021 Conference on Empirical Methods in [189] Liu Y, Lu W, Cheng S, Shi D, Wang S, Cheng Z, et al. Pre-trained language
Natural Language Processing (EMNLP); 2021 Nov 7–11; online. 2021. p. model for web-scale retrieval in Baidu Search. In: Proceedings of the 27th
4434–8. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD
[181] Mukherjee S, Mukherjee S, Hasegawa M, Awadallah AH, White R. Smart to- 21); 2021 Aug 14–18; online. 2021. p. 3365–75.
do: automatic generation of to-do items from emails. In: Proceedings of the [190] Huang JT, Sharma A, Sun S, Xia L, Zhang D, Pronin P, et al. Embedding-based
58th Annual Meeting of the Association for Computational Linguistics (ACL retrieval in Facebook Search. In: Proceedings of the 26th ACM SIGKDD
2020); 2020 Jul 5–10; online. 2020. p. 8680–9. Conference on Knowledge Discovery and Data Mining (KDD 20); 2020 Jul 6–
[182] What are Alexa Built-in Devices? [Internet]. Seattle: Amazon; c2010–2023 10; online. 2020. p. 2553–61.
[cited 2021 Nov 4]. Available from: https://developer.amazon.com/alexa- [191] Yu P, Fei H, Li P. Cross-lingual language model pretraining for retrieval. In:
voice-service. Proceedings of the Web Conference; 2021 Apr 19–23; online. 2021. p.1029–
[183] Mari A. Voice commerce: understanding shopping-related voice assistants 39.
and their effect on brands. In: Proceedings of the International Media [192] Ni M, Huang H, Su L, Cui E, Bharti T, Wang L, et al. M3P: learning universal
Management Academic Association Annual Conference; 2019 Oct 4–6; Doha, representations via multitask multilingual multimodal pre-training. In:
Qatar; 2019. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
[184] Google assistant update speech recognition name pronunciation BERT smart Recognition (CVPR); 2021 Jun 19–25; online. 2021. p. 3977–86.
speakers [Internet]. NDTV; 2021 Apr 29 [cited 2021 Nov 4]. Available [193] Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT:
from: https://gadgets.ndtv.com/apps/news/google-assistant-update-speech- smaller, faster, cheaper and lighter. 2019. arXiv:1910.01108.
recognition-name-pronunciation-bert-smart-speak. [194] Gordon MA, Duh K, Andrews N. Compressing BERT: studying the effects of
[185] Vincent J. The future of AI is a conversation with a computer [Internet]. New weight pruning on transfer learning. In: Proceedings of the 5th Workshop on
York City: The Verge; 2021 Nov 1 [cited 2021 Nov 4]. Available from: https:// Representation Learning for NLP; 2020 Jul 9; Seattle, WA, USA. 2020. p. 143–
www.theverge.com/22734662/ai-language-artificial-intelligence-future- 55.
models-gpt-3-limitations-bias/. [195] Kim S, Gholami A, Yao Z, Mahoney MW, Keutzer K. I-BERT: integer-only BERT
[186] Meet the AI powering today’s smartest smartphones [Internet]. San quantization. In: Proceedings of the 38th International Conference on
Francisco: Wired; 2017 Aug 1 [cited 2021 Nov 4]. Available from: https:// Machine Learning (ICML 2021); 2021 Jul 18–24; online. 2021. p. 5506–18.
65