0% found this document useful (0 votes)

43 views

Pre Trained Models For NLP

Machine learning notes

Uploaded by

Muhammad Junaid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views

Pre Trained Models For NLP

Machine learning notes

Uploaded by

Muhammad Junaid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Engineering 25 (2023) 51–65

Contents lists available at ScienceDirect

Engineering
journal homepage: www.elsevier.com/locate/eng

Research
Artificial Intelligence—Review

Pre-Trained Language Models and Their Applications

Haifeng Wang a,⇑, Jiwei Li b, Hua Wu a, Eduard Hovy c, Yu Sun a
a
Baidu Inc., Beijing 100193, China
b
College of Computer Science and Technology, Zhejiang University, Hangzhou 310058, China
c
Language Technologies Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA

a r t i c l e i n f o a b s t r a c t

Article history: Pre-trained language models have achieved striking success in natural language processing (NLP), leading
Received 10 November 2021 to a paradigm shift from supervised learning to pre-training followed by fine-tuning. The NLP community
Revised 8 March 2022 has witnessed a surge of research interest in improving pre-trained models. This article presents a com-
Accepted 5 April 2022
prehensive review of representative work and recent progress in the NLP field and introduces the taxon-
Available online 7 September 2022
omy of pre-trained models. We first give a brief introduction of pre-trained models, followed by
characteristic methods and frameworks. We then introduce and analyze the impact and challenges of
Keywords:
pre-trained models and their downstream applications. Finally, we briefly conclude and address future
Pre-trained models
Natural language processing
research directions in this field.
Ó 2022 THE AUTHORS. Published by Elsevier LTD on behalf of Chinese Academy of Engineering and
Higher Education Press Limited Company. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. A brief history of pre-trained models gating mechanism. With the emergence of the model known as
transformer [9], considerable efforts have been devoted to building
The concept of pre-training is related to transfer learning [1]. stronger and more efficient language models based on the trans-
The idea of transfer learning is to reuse the knowledge learned former architecture [10–14]. In neural language modeling, dis-
from one or more tasks and apply it to new tasks. Traditional trans- tributed word representations named ‘‘word embeddings” that
fer learning employs annotated data for supervised training, which are learned with models such as Word2Vec [15] and GloVe [16]
has been the common practice for at least a decade. Within deep have become common initializations for the word vectors of deep
learning, pre-training with self-supervised learning on massive learning models, significantly improving the performance of down-
unannotated data has become the dominant transfer learning stream tasks such as named-entity recognition [16], part-of-speech
approach. The difference is that pre-training methods use unanno- tagging [17], and question answering [18].
tated data for self-supervised training and can be applied to vari- Although methods that leverage static word embeddings for
ous downstream tasks via fine-tuning or few-shot learning. warm startup can improve the performance of downstream NLP
In natural language processing (NLP), model pre-training is tasks, they lack the ability to represent different meanings of
based on the task of language modeling. The goal of language mod- words in context. To solve this problem, context–aware language
eling is to predict the next token, given a history of unannotated models were proposed to incorporate the complete context infor-
texts [2–4]. The first milestone of neural language modeling mation into the training procedure. Dai and Le [19] introduced
appears in Ref. [5], which models n-gram probabilities through dis- context–aware language modeling, which uses unannotated data
tributed representations of words and feed-forward neural net- to improve sequence learning with recurrent networks. This
works. Since then, deep learning methods have begun to achieves significant performance improvement in sentiment anal-
dominate the training paradigm of language modeling. In early ysis, text classification, and object classification tasks. In 2017, con-
methods for neural language modeling, recurrent neural networks textualized word vectors were proposed, which are derived from
(RNNs) were widely used [6,7]. Among the RNN family, long short- an encoder that is pre-trained on machine translation and then
term memory (LSTM) [8] stands out due to its advantage of being transferred to a variety of downstream NLP tasks [20]. However,
less prone to the gradient vanishing problem via its well-designed these studies use a small amount of data for pre-training and do
not achieve consistent performance improvement across all NLP
⇑ Corresponding author. tasks. Nonetheless, these pioneering studies greatly motivated
E-mail address: [email protected] (H. Wang). follow-up pre-training methods for context modeling.

https://doi.org/10.1016/j.eng.2022.04.024
2095-8099/Ó 2022 THE AUTHORS. Published by Elsevier LTD on behalf of Chinese Academy of Engineering and Higher Education Press Limited Company.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65

In another pioneering study on pre-trained models (PTMs), knowledge, greatly promoting the capacity of knowledge memo-
embeddings from language models were proposed to leverage rization and reasoning in PTMs [39].
bidirectional LSTMs in order to learn contextual word representa- However, the aforementioned models only focus on rich-
tions, and the pre-trained contextual embeddings were then resource languages, such as English and Chinese, and thus may
applied to downstream tasks [21]. This method demonstrated overlook numerous low-resource languages. Recent work on mul-
great improvements in a broad range of NLP tasks, including ques- tilingual models is aiming to transfer knowledge from rich-
tion answering, textual entailment, sentiment analysis, semantic resource languages to low-resource languages by modeling the
role labeling, coreference resolution, and named-entity extraction. semantic representation of disparate languages in a unified vector
Since then, numerous PTMs within the ‘‘pre-training then fine- space. Inspirited by BERT, multilingual BERT (mBERT) was devel-
tuning” paradigm have started to emerge. Generative pre-training oped and released; this model is trained via multilingual masked
(GPT) [22] was the first model to use unidirectional transformers language modeling (MMLM) on multilingual corpora [41]. From
as the backbone for the GPT of language models, thereby illustrat- an intuitive perspective, the use of parallel corpora is conducive
ing the dramatic potential of pre-training methods for diverse to learning cross-lingual representations in different languages.
downstream tasks. Following GPT [23], the first model to leverage Therefore, cross-lingual language model (XLM) [42] leverages
bidirectional transformers was called Bidirectional Encoder Repre- bilingual sentence pairs to perform translation language modeling
sentations from Transformers (BERT); this model learns bidirec- (TLM), which encourages models to align the representations of
tional contexts by means of conditioning on both the left and the two languages together. Researchers have also released more mul-
right contexts in deep stacked layers. BERT introduced a denoising tilingual language models, such as XLM-RoBERTa (XLM-R) [43],
autoencoding pre-training task, termed masked language model- InfoXLM [44], and ERNIE-M [45], by improving MMLM or TLM.
ing (MLM), to recover the corrupted tokens of input sentences These studies have demonstrated that pre-trained multilingual
according to their contexts, in what was akin to a cloze task. This language models can significantly improve performance of multi-
approach greatly boosted the performance gain of downstream lingual NLP tasks or low-resource language tasks.
natural language understanding (NLU) tasks. In this type of pre- Given the success of PTMs in NLP, these models have quickly
training, which is also known as self-supervised learning, the been extended to other fields such as computer vision [46–48]
pre-training labels are derived from unannotated data. By resorting and speech processing [49]. Although self-supervised pre-training
to web-scale unannotated data from the Internet, PTMs can auto- has been the most successful transfer learning method in NLP,
matically learn syntactic and semantic representations. the PTMs used for computer vision tasks are diversified. The dom-
The great success of PTMs has attracted a wide range of interest inant method in computer vision tasks is still supervised learning.
in scaling them up and exploring the boundaries of pre-training Sun et al. [48] show that representation learning holds promise for
techniques; examples include decoding-enhanced BERT with dis- advancing model performance based on large-scale (noisy) anno-
entangled attention (DeBERTa) [24], text-to-text transfer trans- tated datasets, such as ImageNet [50] or JTF300M [48]. These
formers (T5) [25], GPT-3 [26], large-scale generative Chinese pre- methods learn visual representations and significantly improve
trained language model (CPM) [27], PanGu-a [28], and ERNIE 3.0 the performance of various downstream vision tasks [48]. Self-
Titan [29]. Large-scale PTMs, such as GPT-3, have now demon- supervised pre-training have also been explored in computer
strated the powerful capabilities of zero-shot and few-shot learn- vision [51–56]. Doersch et al. [53] propose various prediction tasks
ing. With dozens of examples, GPT-3 achieved a performance as propse tasks to learn visual representations. Dosovitskiy et al.
similar to that of BERT, being fine-tuned with tens of thousands [57] explore the masked patch prediction task using transformer
of pieces of data on SuperGLUE [30]. GPT-3 can also generate architecture for images and demonstrates that pre-trained trans-
high-quality creative texts so that even humans cannot determine formers achieve excellent results compared with convolutional
whether or not the texts are written by a human. The success of neural networks (CNNs).
GPT-3 makes it possible to use this model for general-purpose text Recently, contrastive learning has been successfully utilized for
generation, which was considered to be impossible in the past visual self-supervised pre-training. Contrastive predictive coding
decades. [58] has achieved strong results in various scenarios, including
Another line of pre-training methods has attempted to incor- speech, image, and text. These methods [58–60] attempt to maxi-
porate knowledge in order to enhance the representation capabil- mize the similarity of two augmentations of an image and mini-
ity of PTM [31]. Some studies employ linguistic knowledge to mize the similarity of different images with contrastive loss.
design entity-related tasks with weak supervision. For example, More recently, pre-training methods have been advanced by utiliz-
they corrupt entity spans in texts and use knowledge-masking ing language supervision for visual representation learning [61],
strategies such as entity-level or phrase-level masking [31] and achieving a strong performance in image classification tasks and
entity replacement prediction [32] to better learn lexical, syntac- other vision tasks.
tic, and semantic information from texts. Another direction of Pre-training methods have also been applied to multimodal
research integrates structured knowledge together with plain applications, in which texts are combined with other modalities,
texts into pre-training, such as knowledge-enabled BERT (K- such as images [62–65], videos [66,67], and speech [68], enabling
BERT) [33], contextualized language and knowledge embedding a broad application scope of PTMs. Such methods [63] significantly
(CoLAKE) [34], enhanced language representation with informa- improve the performance of various multimodal tasks by jointly
tive entities (ERNIE-THU) [35], knowledge-enhanced BERT learning task-agnostic representations of images and texts. Based
(KnowBERT) [36], SenseBERT [37], knowledge embedding and on the transformer architecture, PTMs build cross-modal semantic
pre-trained language representation (KEPLER) [38], and ERNIE alignments from large-scale image-text pairs. For image genera-
3.0 [39]. ERNIE 3.0, which powers PTMs with knowledge, has tion, DALL-E [69] and CLIP-guided generation [61] leverage multi-
achieved new state-of-the-art (SOTA) performances across 54 modal language and vision input to render compelling visual
Chinese NLP benchmarks, as well as some English benchmarks, scenes. Although the most commonly used pre-training tasks for
including SuperGLUE [30]. Moreover, K-Adapter [40] uses multi- multimodal context are MLM and masked region prediction, Yu
ple adapters for different tasks independently in order to better et al. [70] propose knowledge-enhanced scene graph prediction
fuse various knowledge sources and mitigate catastrophic forget- to capture the alignments of more detailed semantics. Gan et al.
ting. Knowledge-based incorporation has dramatically improved [71] incorporate adversarial training into pre-training and achieves
knowledge sharing between unstructured text and structured higher performance. Cho et al. [72] formulate multimodal
52
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65

pre-training as a unified language modeling task based on multi- former encoders only; and transformer decoder–encoders. A brief
modal context. This demonstrates that PTMs are playing a critical description of each category is given below, and more detail is pro-
role in the artificial intelligence (AI) community and will poten- vided in the subsections that follow.
tially promote the unification of the pre-training framework across Transformer decoders only frameworks use a unidirectional
research fields such as speech, computer vision, and NLP. (left-to-right) transformer decoder as the pre-training back-
There are some existing reviews on PTMs. Some focus on partic- bone and predict tokens in a unidirectional autoregressive
ular types and applications of PTMs, such as transformer-based fashion. Here, ‘‘auto-regression” refers to predicting the cur-
pre-trained language models [73], BERT-based training techniques rent token based on historical tokens—that is, the partial
[74], prompted-based learning [75], data augmentation [76], text sequence on the left of the current token. More specifically,
generation [77], and conversational agent design [78]. Another line given the text sequence x ¼ ðx1 ; x2 ; x3 ; . . . ; xT Þ (where x is
provides a panoramic perspective of the whole progress of PTMs. the original sentence, xt (t = 1, 2, . . ., T) is the tth token, and
For example, Ramponi and Plank [79] provide an overview from T is the sequence length), an autoregressive model factorizes
early traditional non-neural methods to PTMs in NLP. Qiu et al. the likelihood of the input text sequence as
Q
[80] systematically categorize existing PTMs from four different pðxÞ ¼ Tt¼1 pðxt jx<t Þ , where p is the likelihood of the input
perspectives and outlines some potential directions of PTMs for text sequence.
future research. Bommasani et al. [81] propose the concept of Transformer encoder only frameworks leverage a bidirec-
foundation models to unify PTMs in different subfields such as tional transformer encoder and aim to recover corrupted
NLP, computer vision, and speech, and analyzes their opportunities tokens, given the input sentences with randomly masked
and challenges in various AI domains. Han et al. [82] take a deep tokens.
look into the history of PTMs to reveal the crucial position of PTMs Transformer encoder–decoder frameworks aim at pre-
in the AI development spectrum. In our review, we mainly focus on training a sequence-to-sequence (seq2seq) generation model
the PTMs in NLP: We first provide a detailed analysis of different by masking tokens on the source side and recovering them on
PTMs and trends in PTMs at scale, discussing their impact on the the target side. These frameworks consist of two classes:
field of NLP and the main challenges of PTMs; we then focus on ① seq2seq encoder–decoders, which consist of a bidirec-
our observations of and practices in the industrial applications of tional transformer encoder and a unidirectional decoder with
PTMs. separate parameters; and ② unified encoder–decoders, in
In this paper, we will first summarize the methods and taxon- which a bidirectional transformer encoder and a left-to-
omy of pre-trained language models in Section 2, followed by a dis- right decoder are simultaneously pre-trained with shared
cussion of the impact and challenges of pre-trained language model parameters.
models in Section 3. Next, we will introduce the industrial applica-
tions of pre-training techniques in Section 4. Finally, we will con- 2.1.1. Transformer decoders only
clude and address potential future work in this area. The objective for language modeling is to predict the next token
auto-regressively, given its history. The nature of auto-regression
entails the future invisibility of input tokens at each position; that
2. Methods of PTMs is, each token can only attend to the preceding words. GPT [22] was
the first model to use the transformer decoder architecture as its
2.1. Different frameworks and extensions of PTMs backbone. Given a sequence of words as context, GPT computes
the probability distribution of the next word with the masked
When working with PTMs, it is essential to design efficient multi-head self-attention of the transformer. In the fine-tuning
training methods that can fully use unannotated data and assist phase, the pre-trained parameters are set as the initialization of
downstream fine-tuning. In this section, we briefly introduce some the model for downstream tasks. GPT is pre-trained on the
widely used pre-training frameworks to date. Fig. 1 summarizes BooksCorpus dataset, which is nearly the same size as the 1B Word
the existing prevalent pre-training frameworks, which can be clas- Benchmark. It has hundreds of millions of parameters and
sified into three categories: transformer decoders only; trans- improves SOTA results on nine out of 12 NLP datasets, showing

Fig. 1. An illustration of the existing prevalent pre-training frameworks, where x is the original sentence, xt (t = 1, 2, . . ., T) is the tth token, T is the sequence length, and M(x) is
the set of masked tokens in x. S denotes the start token embedding of a sequence. p1, p2, p3, and p4 denote the position embeddings of the first to fourth tokens. P is the
conditional probability. i and j indicate the start and the end indices of input tokens of the encoder, respectively.

53
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65

the potential of large-scale PTMs. GPT-2 [83] follows the unidirec- BERT with single-sequence training. SpanBERT outperforms BERT
tional framework with a transformer decoder that was trained on span-related tasks such as question answering and coreference
with a larger corpus, namely, WebText, and 1.5 billion model resolution. Like SpanBERT, which uses lexical analysis and chunk-
parameters. GPT-2 achieves SOTA results on seven out of eight ing tools to locate the span boundary, enhanced representation
tested language modeling datasets in a zero-shot setting. GPT-3 through knowledge integration (ERNIE) [31] uses a Chinese tok-
[26] further increases the parameters of the transformer to 175 bil- enizer to obtain phrase information and then replaces the random
lion and introduces in-context learning. Both GPT-2 and GPT-3 can token masking in BERT with the entity or phrase masking. ERNIE
be applied to downstream tasks without fine-tuning. They achieve also utilizes a named-entity recognition toolkit to identify the
a strong performance by scaling up the model size and dataset size. entity boundary and randomly masks tokens at the entity level,
Unidirectional language modeling lacks attention on its full thus enabling the integration of external knowledge into model
contexts on both sides, which may degrade its performance on pre-training.
downstream tasks. To tackle this problem, Yang et al. [84] propose
the use of permuted language modeling (PLM), which performs 2.1.3. Transformer encoder–decoders
autoregressive modeling on permuting input tokens. For example, Transformer encoder–decoder architecture is dedicated to nat-
a permutation of the sentence ‘‘I love the movie” can be ‘‘I the ural language generation (NLG) tasks. Unlike NLU, which focuses
movie love.” Once the permutation is chosen, the last few tokens on comprehending texts, NLG aims to generate a coherent, mean-
of the permuted sentence are the target to predict. In the above ingful, and human-like natural language expression according to
example, the token ‘‘love” is the target, depending on the visible specific inputs. For example, the goal of machine translation is to
context ‘‘I the movie.” An advantage of PLM is that it can fully generate a sentence in the target language with the same meaning
leverage the contextual information for different masked tokens, as the given source language input; for text summarization, the
thus building dependent context relationships with both preceding goal is to generate a short version of the input document that cap-
and successive words. To enable PLM, Yang et al. [84] propose a tures the core meanings and opinions. The critical point is to model
novel two-stream self-attention mechanism, with one query two sequences simultaneously—one for the input and the other for
stream to compute the query vectors and another content stream the output.
to compute the key/context vectors. The two-stream self- Song et al. [88] proposes Masked Sequence-to-Sequence Learn-
attention approach evades the leakage of visible context to the ing (MASS) for language generation, in order to pre-train a seq2seq
masked positions. model. The basic idea of MASS is to take a sentence with a masked
fragment (i.e., several consecutive tokens) as input and predict the
2.1.2. Transformer encoders only masked fragment conditioned on the encoder representations. In
Pre-trained transformer encoders, such as BERT [23], have this way, MASS successfully transforms the transformer encoder
become the standard in NLP systems. BERT uses an MLM frame- framework into an autoregressive framework by masking on the
work with a transformer as the backbone. In the pre-training stage, source side and predicting on the target side. MASS uses monolin-
BERT randomly replaces tokens with a special token [MASK] and gual data from the News Crawl Datasets of Workshop on Machine
tries to recover corrupted words based on their contextual repre- Translation (WMT) to pre-train the model, and shows substantial
sentations. It also adopts an objective of next-sentence prediction improvement on machine translation quality in comparison with
(NSP) to capture the discourse relations between two sentences, models directly trained on annotated data.
which is helpful for sentence-level tasks, such as question answer- Pre-training on both a transformer encoder and a transformer
ing. Devlin et al. [23] refer to this procedure as a cloze task, accord- decoder results in a unified model that can simultaneously deal
ing to Ref. [85]. BERT was pre-trained on a combination of the with both language understanding and language generation. One
BooksCorpus (800 M words) and English Wikipedia (2500 M member of this class is the standard transformer encoder–decoder
words), and achieved great improvements on 17 NLP tasks, attain- model that does not share unified encoder and decoder compo-
ing a level even better than a human performance on some of the nents. Bidirectional and Auto-Regressive Transformers (BART)
downstream tasks. However, BERT’s shortcomings are also obvi- [89] proposes a similar objective as MASS, but differs in that MASS
ous: Because the [MASK] token does not appear in real data during masks a consecutive series of tokens—that is, n-grams of the
fine-tuning, it creates a mismatch between pre-training and fine- input—while BART corrupts text with an arbitrary noising func-
tuning. To amend this discrepancy, BERT uses a novel method to tion—that is, masking/deleting/replacing/exchanging random
mask tokens: Among the 15% of the random positions that would tokens in different positions. BART can be viewed as a combination
have to be masked, only 80% are replaced by the [MASK] token, of the above two architectures: The random masking strategy on
while 10% are kept as the original tokens, and 10% are replaced the source side enables the model to deal with NLU tasks, and
by random tokens in the training process. This masking strategy the overall seq2seq pre-training framework enables the model to
causes the model to take more steps to converge, since only 15% be generalized to NLG tasks. Pre-trained on 160 GB data of news,
of the tokens in the training batch are predicted. Another problem books, stories, and web text, BART achieves comparable results to
with BERT is that it predicts tokens independently without consid- RoBERTa [90] and new SOTA results on dialogue and abstractive
ering other masked tokens. The model proposed in Ref. [86], a uni- text summarization. Another member of this category unifies the
fied encoder–decoder model, tends to solve this problem by encoder and decoder as identical transformer blocks. Dong et al.
blanking out text spans of input sentences and predicting the [91] and Bao et al. [92] also propose a unified language model
masked span auto-regressively, which mitigates the independent pre-training framework for NLU and generation. These studies par-
assumption of masked tokens within the same span in the pre- tition the self-attention matrix into three components: the bidirec-
training of masked language models. tional component, the unidirectional component, and the seq2seq
Following the success of BERT, an enormous amount of research component, which respectively stand for unidirectional, bidirec-
effort has gone into MLM. SpanBERT [87] is designed to predict tional, and seq2seq language models. Their experiments show per-
spans of text. It chooses to mask random contiguous spans instead formance gains over using a single pre-training objective. Du et al.
of random tokens, and a span boundary prediction objective is [86] propose a variant of the model reported in Ref. [91], putting
introduced to force the model to predict masked spans according the masked tokens on the right of the unmasked tokens and con-
to the structural information of the span boundaries. It also ducting autoregressive blank filling. Xiao et al. [93] mask multiple
achieves better performance by replacing the NSP objective in segments at different granularities to encourage the decoder to
54
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65

rely more on the encoder representations, thus enhancing the cor- of training recipes including exponentially increased trainable
relation between the encoder and the decoder. Zhang et al. [94] parameters, pre-training architectures, knowledge enhancement,
adopt a different approach: First, a sentence is removed according language-specific corpora, and different pre-trained tasks to
to the pre-defined importance criteria from an input document, support the billion-level training of PTMs. Although training
and then the removed sentence is generated based on the remain- methods differ among these models, all the PTMs use transformers
ing context sentences. This strategy performs auto-regression at [9] as the standard backbone due to the latter’s efficient parallel
the sentence level and prompts whole-document understanding computing performance. Since training large-scale models requires
and summary-like generation. Experiments on 12 downstream massive unsupervised data, research on scaling up PTMs focuses on
summarization tasks demonstrate SOTA results, showing the effec- high-resource languages such as English and Chinese.
tiveness of the gap-sentence pre-training method. According to the different designs used in pre-training architec-
tures, large-scale PTMs can be generally classified into three
2.2. Scaling up PTMs classes (as in Section 2.1): encoder only, decoder only, and
encoder–decoder. The majority of large PTMs leverage the decoder
Recent advances in NLP have demonstrated a promising trend only or the encoder–decoder architecture, whereas only a few large
toward scaling up PTMs with billions of parameters. OpenAI models adopt an encoder-only design. This is because encoder-
researchers trained a model called GPT-3, which has 175 billion only models cannot perform well on generation tasks, such as text
parameters [26]. GPT-3 achieves strong performance on many summarization and dialogue generation, while decoder-only mod-
NLP datasets, including question answering, machine translation, els that are designed for language generation can shed light on not
and three-digit arithmetic. GPT-3 demonstrates that scaling up only NLG but also language understanding tasks via prevalent
language models significantly improves task-agnostic and few- prompting techniques such as GPT-3 [26].
shot performances, sometimes even achieving better results than Encoder-only models at scale employ a bidirectional trans-
prior SOTA fine-tuning approaches [26]. Although large PTMs are former encoder to learn contextual representations; they
a promising direction, training large-scale PTMs is a challenging demonstrate impressive performance on NLU tasks. For exam-
task, which requires massive training data and graphics processing ple, DeBERTa1.5B [24], which consists of 48 transformer layers
unit (GPU) resources. Thus, efficient model training algorithms with 1.5 billion parameters, applied a disentangled attention
play a crucial role in scaling up PTMs. The following section intro- mechanism and enhanced the mask decoder to surpass human
duces the prevalent large-scale PTMs as well as the training meth- performance on the SuperGLUE [30] benchmark. Since a bidi-
ods used to achieve them. rectional nature makes the model unable to be directly used
in NLG tasks, DeBERTa trained another version of a unified
2.2.1. PTMs at scale encoder–decoder to adapt to NLG tasks.
Table 1 [24–28,39,95–102] summarizes the mainstream large- Decoder-only models use transformer decoders by applying
scale PTMs. The size of PTMs has become increasingly larger in autoregressive masks to prevent the current token from
recent years, ranging from 2.6 billion to even 175 billion parame- attending to future tokens. Examples include GPT-3 [26],
ters. Large-scale pre-trained language models embrace a potpourri CPM [27], and PanGu-a [28]. This line of PTMs aims at

Table 1
Summary of large-scale pre-trained language models.

Model Number of Model Knowledge Language Pre-training data Training strategy Training Reference
parameters architecture learning platform
DeBERTa1.5B 1.5 billion Encoder only — English English data (78 GB) — PyTorch [24]
T5 11 billion Encoder– — English C4 (750 GB) Model/data TensorFlow [25]
decoder parallelism
(seq2seq)
GPT-3 175 billion Decoder only — English Cleaned CommonCrawl, Model parallelism — [26]
WebText
CPM 2.6 billion Decoder only — Chinese Chinese corpus (100 GB) — PyTorch [27]
PanGu-a 200 billion Decoder only — Chinese Chinese data MindSpore auto- MindSpore [28]
(1.1 TB, 250 billion tokens) parallel
p
ERNIE 3.0 10 billion Encoder– Chinese, Chinese data (4 TB), Model/pipeline/ PaddlePaddle [39]
decoder English English data tensor parallelism
(unified)
Turing-NLG 17 billion Decoder only — English English data DeepSpeed/ZeRO — [95]
HyperCLOVA 204 billion Decoder only — Korean Korean data — — [96]
CPM-2 11 billion Encoder– — Chinese, WuDao corpus — PyTorch [97]
decoder English (2.3 TB Chinese +
(seq2seq) 300 GB English)
CPM-2-MoE 198 billion Encoder– — Chinese, WuDao corpus Mixture of Experts PyTorch [98]
decoder English (2.3 TB Chinese + (MoE)
(seq2seq) 300 GB English)
Switch 1751 billion Encoder– — English C4 MoE TensorFlow [99]
transformers decoder (750 GB)
(seq2seq)
Yuan 1.0 245 billion Encoder– — Chinese Chinese data Model/pipeline/ — [100]
decoder (5 TB) tensor parallelism
(unified)
GLaM 1.2 trillion Encoder only — English English data (1.6 trillion tokens) MoE/model TensorFlow [101]
parallelism
Gopher 280 billion Decoder only — English English data (10.5 TB) Model/data Jax [102]
parallelism

ZeRO: zero redundancy optimizer; MoE: mixture-to-expert.

55
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65

generating human-like texts. Turing-NLG [95] is a 17-billion- The dramatic progress in language PTMs has attracted research
parameter language model that has achieved strong perfor- interest on multimodal pre-training [72,103–107]. Table 2
mance in language model benchmarks. GPT-3, with 175 bil- [69,103,104,107] lists the details of large-scale multimodal PTMs.
lion parameters, can strikingly write samples that deceive DALL-E [69] is a 12-billion variant of GPT-3 that was trained on
human readers, demonstrating that large-scale language 250 million English text–image pairs to generate images according
models can dramatically advance few-shot learning scenarios to language descriptions, thereby improving the zero-shot learning
with in-context learning. In addition to English large-scale performance. ERNIE-ViLG [107] uses a unified GPT framework for
monolingual PTMs, there are also models for other languages bidirectional image–text generation, formulating both the image
such as Chinese and Korean. CPM [27] (2.6 billion parame- and text generation as autoregressive generative tasks. As a result,
ters) and PanGu-a [28] (200 billion parameters) are two it outperforms previous methods on generative tasks such as text-
Chinese variants of GPT-3, while HyperCLOVA [96] is a 204- to-image generation and image captioning with a ten-billion
billion-parameter Korean variant. parameter model pre-trained on 145 million high-quality Chinese
Encoder–decoder models can be further categorized into two text–image pairs. Moreover, the multi-modality-to-multi-modality
classes: ① conventional seq2seq encoder–decoders and ② multi-task mega-transformer (M6) [104] is a 100-billion-
unified encoder–decoders. Conventional seq2seq encoder– parameter transformer encoder, which is trained on over 1.9 TB
decoders adopt the classic transformer encoder–decoder images and 292 GB Chinese texts. M6 achieved strong performance
architecture for pre-training. Recent work includes the T5 in visual question answering, image captioning, and Chinese
[25], the multilingual T5 (mT5) [97], and the large-scale image–text matching. In addition to their improvements on multi-
cost-effective pre-trained language model (CPM-2) [98]. T5 modal tasks, these models can improve the performance of mono-
[25], which has up to 11 billion parameters, unifies the NLP modal tasks, such as text classification, inference, summarization,
tasks in one framework by casting the language understand- and question generation [105]. These results show that multimodal
ing and generation tasks in a text-to-text manner. As the pre-training can leverage multimodal information to enhance both
multilingual variant of T5, mT5 [97], which has up to 13 bil- image representation and text representation, which in turn
lion parameters, has extended the monolingual data to 101 improves the performance of both multimodal tasks and NLP tasks.
human languages and outperformed the previous SOTA
results on a variety of multilingual benchmarks. CPM-2 2.2.2. Efficient training of large-scale models
[98], with 11 billion parameters, is a bilingual model trained The exponential increment of the PTMs’ size has posed a great
on Chinese and English, whose mixture-of-expert (MoE) ver- challenge for efficient training due to the limited GPU memory
sion, denoted as CPM-2-MoE, has 198 billion parameters. and unaffordable training time. Therefore, it is non-trivial to lever-
This model has demonstrated excellent general language age efficient training techniques to speed up large-scale model
intelligence via fine-tuning and prompting. Another kind of training.
encoder–decoder model is the unified encoder–decoder
framework, in which the encoder–decoder architecture 2.2.2.1. Dense models. Data parallelism is a simple solution that
shares the same module and applies different mask strategies allocates different data partitions to multiple workers and dupli-
for MLM and autoregressive language modeling. ERNIE 3.0 cates identical parameters at all workers. However, it usually suf-
[39] jointly learns language understanding and generation fers from a small per-GPU batch size. Another solution is model
by designing two separate heads for understanding and gen- parallelism, in which model parameters are partitioned over differ-
eration, which share a task-agnostic representation. As the ent workers. However, conventional optimization algorithms
third-generation PTMs (with ten billion parameters) in the require extra memory per parameter to store intermediate states,
ERNIE series, ERNIE 3.0 combines the merits of both autore- which hinders the model size from being updated efficiently. Pipe-
gressive causal language models and autoencoding models line parallelism combines the merits of both model parallelism and
to train large-scale knowledge-enhanced PTMs. It has out- data parallelism to reduce time costs. GPipe [108] uses a novel
ranked the SOTA performance on a variety of NLP bench- batch-splitting pipelining algorithm by first partitioning a mini-
marks, including SuperGLUE [30]. These methods have batch of training samples into smaller micro-batches and then
demonstrated superior performance because they all tend aggregating the gradient update simultaneously at the end.
to unify multiple NLP tasks in one model and use different Megatron-LM [109] is an intra-layer model parallel approach for
kinds of corpora or knowledge to enhance the performance. transformer networks, which adds a few synchronization primi-
Most of the above-mentioned large-scale models are trained on tives on the self-attention and multi-layer perceptron blocks.
plain texts without integrating knowledge. Therefore, some PTD-P [110] combines pipeline, tensor, and data parallelism across
researchers have attempted to incorporate knowledge such as lin- multi-GPU servers with a novel interleaved pipelining scheduling
guistic knowledge and world knowledge into PTMs. ERNIE 3.0 pre- strategy, increasing the throughput by more than 10%. Recently,
trained transformers on massive unstructured texts and knowl- Colossal-AI [111] implemented a combination of various data,
edge graphs to learn lexical, syntactic, and semantic information. pipeline, sequence, and multiple tensor parallelism for large-scale
It enriched the PTMs through knowledge integration, phrase mask- model training, which can be a good option for training dense
ing, and named-entity masking. models.

Table 2
Large-scale multimodal PTMs.

Model Number of Pre-training paradigm Pre-training Data Training parallelism Training Reference
parameters platform
Denosing Causal language
auto-encoder model
p
DALL-E 12 billion 250 million English text–image pairs Mixed-precision training PyTorch [69]
p
CogView 4 billion 30 million English text–image pairs — PyTorch [103]
p
M6 100 billion 1.9 TB images + 292 GB Chinese MoE — [104]
p p
ERNIE-ViLG 10 billion 145 million Chinese text–image pairs Mixed-precision training PaddlePaddle [107]

56
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65

2.2.2.2. Sparse models. The sparsely gated MoE model [112]

achieved more than 1000 times the increment in model capacity
using a sparsely gated combination of multiple expert sub-
networks. By leveraging the ensemble mechanism, MoE employs
the gated unit to determine which top-k sub-networks should be
activated for prediction.
Switch transformers [91] have advanced the scale of PTMs with
up to trillions of parameters by simplifying the sparse routing and
replacing the feed-forward fully connected layers with switch
routing, in which each sample is routed to only a single expert.

2.2.2.3. Other efficient training strategies. Recent techniques for

memory-efficient optimization include mixed-precision training
[113] and memory-efficient adaptive optimization. Mixed-
precision training utilizes half-precision floating-point numbers
Fig. 2. The evolution shift of representation techniques on various NLP bench-
without losing model accuracy, which nearly halves the memory marks. Results are from Refs. [23,39,116,117]. SuperGLUE is an NLU leaderboard
requirements. Other studies have aimed at memory-efficient adap- consisting of a set of difficult language understanding tasks; an original Chinese
tive optimization. For example, the zero redundancy optimizer natural language inference dataset (OCNLI), a Chinese machine reading compre-
(ZeRO) [114], which is the catalyst that powers Turing-NLG, con- hension dataset (DRCD), a large scale Chinese short text summarization dataset
(LCSTS), and a Chinese multi-domain dialogue dataset towards multi-turn
sists of ZeRO-data parallelism (DP) and ZeRO-residual (R) algo- knowledge-driven conversation (KdConv) are evaluation corpora for natural
rithms that aim at reducing the memory footprint of the model language inference, machine reading comprehension, text summarization, and
states and the residual memory consumption, respectively. First, dialogue generation, respectively. w/o: without.
ZeRO-DP optimizes the optimizer states, gradients, and parameters
by performing optimizer state partitioning, adding gradient parti- models to adapt to few-shot or zero-shot scenarios; as a result, it
tioning, and adding parameter partitioning. Then, ZeRO-R opti- has attracted wide attention in the NLP community. We generally
mizes the residual memory through the removal of activation describe the impact of PLMs in the following three aspects: NLU,
replication, pre-definition of appropriate temporary buffer size, NLG, and dialogue. For dialogue, PTMs focus on response genera-
and proactive memory management. tion. Here, we take dialogue as a separate category due to its large
amount of related work.
3. Impact and challenges of PTMs
3.1.1. Natural language understanding
3.1. Impact of PTMs in NLP NLU is a broad topic in NLP that contains many tasks, such as
named-entity recognition, sentiment analysis, document classifica-
The emergence of PTMs has enabled a significant breakthrough tion, reading comprehension, semantic matching, natural language
in the field of NLP. Before PTMs, many studies focused on designing inference, and information extraction. Table 3
specialized models for specific NLP tasks, which usually could not [39,116,117,119,120] compares the performance of models with
be used for other tasks. For example, Kim [115] proposes and without pre-training techniques on four different NLU tasks.
the TextCNN model for text classification, and Hochreiter and It can be seen that models with pre-training outperform those
Schmidhuber [8] propose the LSTM model for language generation. without pre-training by a clear margin. Thus, PTMs have become
Since their emergence, PTMs have started to serve as foundation the standard backbone in NLU tasks. Numerous researchers have
models in NLP due to their impressive capabilities in representa- employed PTMs to provide task-agnostic representations and then
tion learning. This has opened up a new ‘‘pre-training then fine- design task-specific architectures or objectives to enhance the NLU
tuning” paradigm for NLP. This paradigm can fully exploit unanno- performance. For example, BertGCN [121] combines the represen-
tated data to train a foundation model and then fine-tune it with tative capacity of BERT and transductive learning from graph con-
limited task-specific annotated data. Even with limited annotated volutional networks to advance its performance of text
data, the performance of the downstream NLP tasks is greatly classification, which increases its accuracy by around 4%.
improved. Fig. 2 [23,39,116,117] demonstrates the evolution of To compare the performances of the PTMs on NLU tasks,
SOTA results on five NLP benchmarks from supervised models researchers uploaded their results on two benchmarks, GLUE and
without pre-training to PTMs such as BERT and ERNIE 3.0. It is evi- SuperGLUE. These PTMs now outperform humans on these two
dent that PTMs significantly outperform the previous non-PTMs, leaderboards. In addition, multilingual models such as mBERT
and the knowledge-enhanced ERNIE 3.0 has steadily exceeded [41], XLM [42], mT5 [97], and ERNIE-M [45] use a unified model
BERT on many NLP tasks. Another important trend is to adopt to represent various languages such that the learned information
PTMs to unify almost all NLP tasks. For example, T5 [25] casts both can be shared among different languages. This technology allevi-
language understanding and generation tasks in a text-to-text ates the data sparseness problem in low-resource languages and
manner and tackles all NLP tasks using a seq2seq PTM. Thus, the reduces the demand to train specialized language models for each
NLP community has witnessed the emerging trend of task specific language. This new paradigm is changing the focus of
unification. research on NLP from designing specialized models for multilin-
GPT-3 [26] has shown a promising performance in zero-shot gual tasks to studying how PTMs can be used in these tasks.
learning or few-shot learning. Along with GPT-3, a new prompt-
exploiting training [118] has been proposed to reformulate the task 3.1.2. Natural language generation
paradigm. Thus, pre-training then prompt tuning has initiated a NLG tasks, such as text summarization, question generation,
new trend to better leverage PTMs. Instead of adapting PTMs to and data-to-text generation, are very challenging in NLP. Due to
downstream tasks with fine-tuning, downstream tasks are pre- the huge search space, it is difficult for methods before the PTM
defined as ‘‘slot-filling” tasks: Given a human-designed template era, which suffer from insufficient annotation data and limited
with slots, let the PTMs learn to fill out these templates. This model parameters, to generate fluent, coherent, and informative
framework has been proven powerful, as it enables language text. As shown in Table 4 [94, 122–125], PTMs have played a key
57
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65

Table 3
SOTA performance with and without pre-training on NLU tasks.

NLU task Sentiment analysis SST-2 binary Natural language inference Nested named entity recognition Machine reading comprehension
classification OCNLI GENIA DRCD
(accuracy) (F1) (F1) (F1)
SOTA w/o pre- 93.2 59.80 74.80 78.03
training
SOTA w/ pre- 97.5 82.75 83.75 95.84
training

Results are from Refs. [39,116,117,119,120]. w/: with; SST-2: Stanford Sentiment Treebank v2; OCNLI: Original Chinese Natural Language Inference; DRCD: Delta Reading
Comprehension Dataset.

role in improving the performance of NLG tasks. Large-scale PTMs Technology Challenge (DSTC-9) [131] revealed that PLATO-2 deliv-
automatically learn word combinations and sentence expressions ers a superior performance in multiple conversational tasks,
from unannotated data, which significantly improves the models’ including open-domain chitchat, knowledge-grounded dialogue,
ability in language generation in terms of fluency, coherence, and and task-oriented conversation. Recently, PLATO-XL [132] was
informativeness. ERNIE-GEN [93] uses an enhanced multi-flow scaled up to 11 billion parameters, with multi-party-aware pre-
seq2seq pre-training and fine-tuning framework and incorporates training being carried out to better distinguish roles in social media
a span-by-span generation task to generate consecutive entities, conversations. Other Chinese dialogue PTMs that have been devel-
which has achieved new SOTA results on five typical NLG tasks. oped on a modest scale include Cdial-GPT [133], ProphetNet-X
Researchers and practitioners also pre-train task-specific trans- [134], and EVA [135].
former models on generation tasks, such as MASS [88] and PEGA- With these large-scale dialogue PTMs, some of the problems
SUS [94]. More specifically, MASS adopts the encoder–decoder that plague traditional end-to-end neural approaches [136,137]
framework to reconstruct a sentence fragment, given the remain- are alleviated significantly, including deficiencies in response flu-
ing part of the sentence, and achieves significant improvements ency and context relevance. Moreover, in comparison with existing
over baselines without pre-training on machine translation. PEGA- chatbots that rely on complex frameworks, such as Mitsuku [138]
SUS was used to pre-train a large-scale encoder-decoder model and XiaoIce [139], these dialogue PTMs demonstrate superior per-
with a well-designed pre-training objective, which achieved a formance in multi-turn conversations, especially in terms of
SOTA performance on all 12 text-summarization tasks. With the engagingness and humanness.
growth of the model size, PTMs gradually show notable ability in
creative writing. Models such as GPT-3, HyperCLOVA, and ENRIE 3.2. Key research challenges
3.0 are capable of generating articles, questions and answers, nov-
els, and program codes via only zero-shot learning. The quality of Although PTMs have significantly improved the performance of
the generated texts is sometimes comparable with that of NLP tasks, there are still some key challenges for PTM applications,
human-written texts. For example, humans only achieve 52% such as interpretability, robustness, reasoning capability, and the
accuracy in distinguishing real news from fake news generated deployment of large-scale PTMs. This section describes these chal-
by GPT-3. lenges in the hope that additional future efforts can be devoted to
these directions.
3.1.3. Dialogue
In the past few years, several representative dialogue- 3.2.1. Deployability
generation models have been pre-trained with human-like conver- One trend in PTMs is the substantial increase in capacity. Since
sations collected from social media, including Twitter, Reddit, the release of GPT [22] and BERT [23], PTMs have scaled exponen-
Weibo, and Baidu Tieba. Based on the general language model tially with respect to both the number of parameters and the size
GPT-2 [83], DialoGPT [126] has been trained for response genera- of the pre-training data. For example, the largest version of GPT-3
tion using Reddit comments. Meena [127] scales up the network [26] requires a total training computation of 3.64 103 petaflop-
to 2.6 billion parameters and employs more social media conversa- days, resulting in a total number of around 3.14 1023 flops and
tions in the training process, resulting in a significant improvement costing millions of dollars. The rapid growth in model size raises
in response quality. To mitigate undesirable toxic or bias traits in concerns regarding the tradeoff between scale and deployability.
large corpora, Blender [128] further fine-tunes the PTM with Two types of strategy have been proposed to tackle this issue:
human-annotated datasets and emphasizes the desirable conver- ① Large-scale PTMs are only used as the foundation model via
sational skills of engagingness, empathy, and personality. In addi- application programming interface (API) calls, similar to the way
tion, to alleviate the safe-response problem in open-domain in which the GPT-3 model is used. This strategy enables the effi-
chitchat, PLATO [129] encodes the discrete latent variable into cient use of PTMs and evades model deployment on each device,
transformers for diverse response generation. Moreover, PLATO-2 but significantly limits the model’s application scope. ② Large
[130] further scales up PLATO via curriculum learning for both models are compressed to smaller ones [140] for potential deploy-
Chinese and English response generation. The Ninth Dialog System ment. Typical compressing techniques include model compression

Table 4
SOTA performance with and without pre-training on NLG tasks.

NLG task Text summarization ESLC Dialogue generation Question generation SQuAD 1.1 Data-to-text generation WebNLG
(ROUGE-L) KdConv-film (BLEU-4) (BLEU)
(BLEU-4)
SOTA w/o pre-training 23.44 5.40 15.87 63.69
SOTA w/ pre-training 36.51 74.44 25.41 66.07

Results are from Refs. [94,122–125]. ESLC: English Skills Learning Center; BLEU: bilingual evaluation understudy; ROUGE-L: recall-oriented understudy for gisting evaluation-
longest common subsequence.

58
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65

and knowledge distillation. Unfortunately, existing compressing models can be easily attacked with adversarial patterns by third
techniques are unable to compress super-large PTMs (e.g., GPT-3) parties, resulting in irreparable loss in real-world applications. In
to a suitable size for deployment on a single GPU or terminal addition to adversarial attacks, another form of attack—namely,
device such as a laptop or cell phone. Advanced research in model backdoor attacks—is a threat to PTMs. Unlike adversarial attacks,
compression is thus imperative in order to make large PTMs avail- which usually act during the inference process of a neural model,
able to more users. Another promising direction is to use backdoor attacks hack the model during training [159]. If a model
parameter-efficient techniques, such as prompt tuning [141– is deliberately trained on backdoor data, it will be extremely dan-
146], to reduce the memory budget of deployment; this remains gerous for users to use this model in applications involving privacy
as a large area for further exploration. and security concerns. Future work could aim to improve the
robustness of PTMs toward adversarial attacks. To deal with back-
3.2.2. Model trustworthiness door attacks, a model should be able to detect in the input the trig-
Another challenge of PTMs is their trustworthiness, which gers that can activate the backdoor attack and remove the triggers,
mainly involves their interpretability [147] and robustness [148]. thus enhancing model security.
Although PTMs have achieved SOTA performances across various
tasks, how they make decisions are sometimes obscure to humans,
which makes PTM models difficult to be applied in fields where 4. Applications of PTMs
model interpretability is essential, such as health-care and law
[149]. Consequently, there is a growing interest in interpreting 4.1. Platforms and toolkits for applications
deep neural models [150]. In particular, many studies aim
to understand what PTMs have learned in their representations Due to their universality, PTMs have become foundation models
[151]. in NLP. Many researchers have developed a series of open-source
Some studies have been published on the trustworthiness of toolkits and platforms to make better use of PTMs. These toolkits
deep neural models. These include: linguistic structural analyses and platforms usually contain various PTMs, fine-turning tools,
on PTMs [152], which aim to analyze the linguistic knowledge that and model-compression tools.
is learned by pre-trained language models and to understand the
reason for their success; model behavioral analyses [153], which 4.1.1. Toolkits
evaluate model robustness and reliability with multiple test sets; When researchers propose a new pre-trained language model,
and post-hoc explanation analyses [154], which aim to provide they often open-source a corresponding toolkit for developers.
understandable explanations for the predictions of deep neural Such toolkits usually provide codes for downstream task develop-
models. ment based on the specific model, and therefore lack generality.
Despite the research that has already been done in this field, the Typical toolkits include google-research/bert [160], PaddlePaddle/
following challenges must be addressed in order to build trustwor- ERNIE [161], and PCL-Platform.Intelligence/PanGu-a [162]. These
thy systems: ① general interpretation methods for NLP tasks toolkits provide a series of open-sourced PTMs, such as BERT,
(existing interpretation methods are designed for classification ERNIE, and PanGu-a, along with source code and training data.
tasks); ② causal analysis between model prediction and learned For example, the ERNIE toolkit provides not only the source code,
knowledge or extracted explanations; and ③ a comprehensive training data, and PTM of ERNIE but also a couple of enhanced
evaluation platform for interpretability, including evaluation data ERNIE series models, such as ERNIE-Doc [163] and ERNIE-ViL
and metrics. [70]. In order to deploy the ERNIE model to online service, the
ERNIE toolkit also provides a model-compression tool.
3.2.3. Commonsense knowledge and reasoning With the intensive publish of PTMs, knowing how to use these
Large-scale PTMs have been found to encode some common- models in a unified toolkit has become an urgent need. Given this
sense knowledge [155]. Nevertheless, appropriate probing tasks background, toolkits for general NLP applications have been devel-
need to be designed in order to mine the commonsense knowledge oped. Typical toolkits include HuggingFace/Transformers [164],
learned in PTMs—such as formulating a relational knowledge- Fairseq [165], and PaddleNLP [166]. PTMs are integrated in a
extraction task as the completion of fill-in-the-blank state- user-friendly way into such general-purpose toolkits. Taking Hug-
ments—so as to examine the knowledge-learning ability of PTMs gingFace as an example, this toolkit integrates the codes for differ-
[156]. Although PTMs learn some knowledge from texts, there is ent kinds of PTMs and codes for downstream application
still a large amount of knowledge that cannot be obtained from developments, including classification, generation, summarization,
texts alone. One possible direction is to have models learn this kind translation, question answering, and so forth.
of knowledge from both visual inputs and text inputs.
In addition to commonsense knowledge, other studies are
questioning whether PTMs are endowed with reasoning abili- 4.1.2. Platforms
ties. For example, Talmor et al. [157] design different tasks to Besides toolkits, platforms provide users with PTM services for
evaluate the reasoning abilities of PTMs. The researchers disen- customization. These platforms can provide facilities for
tangle pre-training from fine-tuning and find that the reasoning developers to build models and deploy them to online services.
capabilities are poor for most PTMs, revealing that existing For example, Baidu Wenxin [167] is a platform to facilitate the
PTMs lack the ability to reason. To alleviate this problem, one use of PTMs. This platform meets the needs of both experienced
possible direction could be to integrate prior knowledge into developers and junior developers. It enables developers to easily
the PTMs in order to guide the models to learn reasoning rules build their models with data and model configuration only. It also
implicitly. provides experienced developers with toolkits to train their
models that are tailored for applications. Other platforms such
3.2.4. Model security as AliceMind [168] provide similar services with no significant
One severe issue with PTMs is their vulnerability to adversarial differences. OpenAI API [169] is another kind of platform that is
examples, which can mislead the model into producing a specific used to develop applications based only on PTMs. OpenAI API is
wrong prediction when perturbations are injected into the input based on GPT-3 [26]; it provides specific high-level functions,
[158]. This susceptibility exposes PTMs to safety concerns: The such as English-to-French translation, grammar correction,
59
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65

question answering, advertisement generation, and product-name laborious tasks. Microsoft has also demonstrated that the pre-
generation. trained generation model Turing-NLG is beneficial for autosuggest
recommendations [95]. Moreover, many researchers have built
4.2. Applications various demo applications based on GPT-3, including applications
for ad generation, AI copywriting, book writing, code generation,
PTMs have been widely deployed in real applications, including customer service, and so forth. As for visual content creation,
document intelligence, content creation, virtual assistant, and pre-trained multimodal generative models such as DALL-E [69],
intelligent search engines. Below, we describe how PTMs are CogView [103], and ERNIE-ViLG [107] have greatly improved the
applied in each field. quality and fidelity of generated images. The results from CogView
have demonstrated this model’s capability to generate high-quality
4.2.1. Document intelligence images in a single domain such as industrial fashion design, so this
One widely studied application for PTMs is document intelli- model has been deployed in online fashion production.
gence, which includes sentiment analysis, news classification, In addition to these industrial applications, researchers have
anti-spam detection, and information extraction. Sentiment analy- shown the potential ability of PTMs for creative writing, including
sis is widely used to identify sentiment polarity, such as public poem generation [179], lyrics generation [27], e-mail auto comple-
opinion, for market research, brand reputation analysis, and social tion [180], to-do generation [181], auto-completion for sentences
media influence. Garg and Chatterjee [170] propose analyzing the and paragraphs, and even a long novel generation [22]. Although
sentiment of Twitter feeds using a PTM and classifying them into PTMs exhibit strong generative capabilities, an increasing number
three categories: positive, negative, and neutral. AlQahtani [171] of concerns have arisen regarding generative models, including pri-
proposes analyzing customer reviews on products by combining vacy and copyright.
data-mining techniques with PTMs. Recently, Singh et al. [172]
analyzed public sentiment on the impact of the coronavirus on 4.2.3. Virtual assistants
social life using a PTM. Chen and Sokolova [173] propose analyzing Virtual assistants are adopted in many applications nowadays.
the sentiments in the coronavirus disease 2019 (COVID-19)-related Typical applications include smart speakers, such as Alexa [182]
messages in a popular social media platform, where users share from Amazon and Xiaodu [129] from Baidu. Such applications have
their stories to seek support from other users, especially during used PTMs and have shown that PTMs can provide excellent lan-
the COVID-19 pandemic. Experimental results show that PTMs guage understanding ability for spoken language and voice recogni-
can achieve significant performance gain in classifying sentiment tion [183] in smart speakers. With the benefit brought by PTMs,
polarities, demonstrating the effectiveness of PTMs. these smart speakers can respond to weather forecast queries, sing
News classification and anti-spam detection can also be mod- songs on demand, and vocally control smart home devices. More-
eled as classification tasks. Ding et al. [163] apply PTMs to classify over, smart speakers can chat with humans on a broad range of
news into extreme left-wing or right-wing standpoints. Liu et al. topics and thus establish a closer and more stable relationship
[174] propose classifying the papers published in Arxiv.org into between users and the system. In addition to the usage of PTMs in
11 categories, including math, computer science, and so forth. smart speakers, PTMs have been deployed in mobile-phone-based
Jwa et al. [175] use BERT to detect fake news by analyzing the rela- virtual assistants, such as Siri and Google Assistant. For example,
tionship between the headline and the body text in news. NDTV [184] proposes that PTMs can improve the interaction quality,
Document information extraction is widely used in industry. while Vincent [185] proposes that PTMs can be used in intelligent
Many AI cloud services contain tools for information extraction customer service robots to recognize customer sentiments.
[176], such as Google AI Cloud, Baidu AI Cloud, and Alibaba AI As PTMs are applied more and more widely in virtual assistants,
Cloud. Among these services, Baidu has built a PTM-based plat- the responses generated by chatting bots are becoming more
form, TextMind, for document information-extraction applications, human-like. For example, Microsoft has proposed a PLM-based
including receipt analysis for expense reimbursements, informa- model called DialoGPT that learns from the comment history of
tion extraction from resumes, financial statement analysis, con- Reddit and can fluently reply to users. Google has also suggested
tract analysis, and legal judgment analysis. One of the world’s the use of PLMs to develop a chatbot application that can ‘‘chat
largest online home retailers, Wayfair, also applies BERT to extract about anything” [127]. To make the robots more human-like, Face-
information from customer messages. book applied PLM to a series of dialogue chatbots named Blender
Document image understanding is another important research and Blender 2.0 [128]. Shortly afterwards, Baidu proposed
topic in document intelligence for automatically reading, under- PLATO-XL [132], a PLM-based model, to further push the perfor-
standing, and analyzing business documents. A series of multi- mance of a chatbot and reach the SOTA in terms of both human
modal document PTMs [177] has been proposed to jointly model evaluation and automatic evaluation metrics. Thanks to the perfor-
interactions between text, image, and layout information in busi- mance improvement brought by PTMs, these applications can be
ness documents for many document image understanding tasks, very robust in interactions with users [186].
such as receipt understanding, document image classification,
and document information extraction. Applica proposes a solution 4.2.4. Intelligent search
to take into consideration layout, graphics, and text in order to Aside from the applications mentioned above, PTMs are widely
enable the extraction of precise answers for complex business pro- used in search engines. Google has already applied PTMs in its
cesses in financial services, insurance services, life sciences, and so Google Search and achieved significant improvements [187]. Baidu
on. has also applied PTMs, ERNIE 2.0 [188] and ERNIE 3.0 [39], as its
backbone to support semantic matching by encoding text into
4.2.2. Content creation dense representations for better retrieval performance in Baidu
Content creation tasks are usually designed to verify the perfor- Search [189]. Facebook [190] has revealed a unified embedding
mance of recently proposed large-scale models [22]. For example, framework for personalized systems and noted that their future
Narrativa applies GPT-2 for content automation from just a few work will contain PTMs.
words provided by customers and generates high-quality adver- To address the surging demand for multimedia content
tisement content [178]. GPT-2 has demonstrated its ability to gen- searches, the performance of image and video search engines can
erate content for e-commerce in order to relieve humans from be enhanced through the utilization of multimodal PTMs. For
60
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65

example, WenLan [106] developed two real-world applications References

based on image–text matching, thereby demonstrating the power
of multimodal pre-training. [1] Bahl LR, Jelinek F, Mercer RL. A maximum likelihood approach to continuous
speech recognition. IEEE Trans Pattern Anal Mach Intell 1983;PAMI-5
To further improve the performance of search engines, (2):179–90.
researchers have recently paid an increasing amount of attention [2] Thrun S, Pratt L. Learning to learn. Cham: Springer; 1998.
to multilingual search engine models. Multilingual models are [3] Nadas A. Estimation of probabilities in the language model of the IBM speech
recognition system. IEEE Trans Acoust Speech Signal Process 1984;32
pre-trained with a multilingual corpus to learn cross-language (4):859–61.
information [191]. The most significant advantage of multilingual [4] Chen SF, Goodman J. An empirical study of smoothing techniques for
models is their transferability across languages, which improves language modeling. Comput Speech Lang 1999;13(4):359–94.
[5] Bengio Y, Ducharme R, Vincent P, Janvin C. A neural probabilistic language
their performance on low-resource languages.
model. J Mach Learn Res 2003;3:1137–55.
[6] Sundermeyer M, Schlüter R, Ney H. LSTM neural networks for language
modeling. In: Proceedings of the 13th Annual Conference of the International
Speech Communication Association (Interspeech 2012); 2012 Sep 9–13;
5. Conclusions and future work Portland, OR, USA. 2012. p. 194–7.
[7] Mikolov T, Zweig G. Context dependent recurrent neural network language
PTMs can fully exploit unannotated data for self-supervised model. In: Proceedings of 2012 IEEE Spoken Language Technology Workshop
(SLT); 2012 Dec 2–5; Miami, FL, USA. 2012. p. 234–9.
learning and have become the foundation models in NLP, signifi- [8] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput
cantly improving the performance of downstream NLP tasks. The 1997;9(8):1735–80.
emergence of PTMs opens up a new ‘‘pre-training then fine- [9] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al.
Attention is all you need. In: Proceedings of the 31st Conference on Neural
tuning” paradigm for NLP. With the increase of model parameters,
Information Processing Systems (NIPS 2017); 2017 Dec 4–9; Long Beach, CA,
PTMs show promising performance in zero-shot learning or few- USA. 2017. p. 5998–6008.
shot learning. Their success in NLP is triggering more research [10] Shazeer N, Cheng Y, Parmar N, Tran D, Vaswani A, Koanantakool P, et al.
devoted to PTMs in other fields such as computer vision, speech Mesh-TensorFlow: deep learning for supercomputers. In: Proceedings of the
32nd Conference on Neural Information Processing Systems (NIPS 2018);
processing, and multimodal understanding and generation, reveal- 2018 Dec 3–8; Montréal, QC, Canada; 2018.
ing their potential to act as foundation models in these fields. [11] Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R. Transformer-XL:
Despite the dramatic success of PTMs in NLP, there is still a long attentive language models beyond a fixed-length context. In: Proceedings of
the 57th Annual Meeting of the Association for Computational Linguistics
way to go to achieve artificial general intelligence. First, PTMs are (ACL 2019); 2019 Jul 28–Aug 2; Florence, Italy. 2019. p. 2978–88.
black boxes that are poorly understood. Their interpretability and [12] Beltagy I, Peters ME, Cohan A. Longformer: the long-document transformer.
robustness have yet to be explored due to the nonlinearity of trans- 2020. arXiv:2004.05150.
[13] Press O, Smith NA, Lewis M. Shortformer: better language modeling using
former models. Thus, it is difficult to use PTMs to make reliable shorter inputs. 2020. arXiv:2012.15832.
decisions and reasoning before we fully understand their princi- [14] Press O, Smith NA, Levy O. Improving transformer models by reordering their
ples. It is worth devoting a great deal of effort to researching the sublayers. In: Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics (ACL 2020); 2020 Jul 5–10; online. 2020. p. 2996–
uncertainty of PTMs. Furthermore, current multimodal and multi- 3005.
lingual pre-training [192] is still in the early stage. Unifying multi- [15] Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed
modal and multilingual pre-training will emerge as an exciting representations of words and phrases and their compositionality. In:
Proceedings of the 27th Conference on Neural Information Processing
trend for further exploration, which may improve the performance
Systems (NIPS 2013); 2013 Dec 5–10; Lake Tahoe, NV, USA. 2013. p. 3111–9.
of these low-resource tasks. Another promising direction is to [16] Pennington J, Socher R, Manning CD. GloVe: global vectors for word
incorporate prior knowledge into PTMs to improve their reasoning representation. In: Proceedings of the 2014 Conference on Empirical
abilities and efficiency. Existing work on knowledge pre-training, Methods in Natural Language Processing (EMNLP); 2014 Oct 25–29; Doha,
Qatar; 2014. p.1532–43.
such as K-BERT [33] and ERNIE 3.0 [39], has injected knowledge [17] Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P. Natural
triplets into pre-training or fine-tuning. However, PTMs have language processing (almost) from scratch. J Mach Learn Res
demonstrated limited capability for commonsense awareness and 2011;12:2493–537.
[18] Xiong C, Zhong V, Socher R. DCN+: mixed objective and deep residual
reasoning, which require further improvement. Although large- coattention for question answering. In: Proceedings of the 6th International
scale PTMs have demonstrated strong generalization capabilities, Conference on Learning Representations (ICLR 2018); 2018 Apr 30–May 3;
efficiently deploying them is still an open question. For applica- Vancouver, BC, Canada; 2018.
[19] Dai AM, Le QV. Semi-supervised sequence learning. In: Proceedings of the
tions that require low latency, model compression of PTMs 29th Conference on Neural Information Processing Systems (NIPS 2015);
remains a promising direction. Existing model-compression meth- 2015 Dec 7–12; Montréal, QC, Canada. 2015. p. 3079–87.
ods consist of distillation [193], pruning [194], quantization [195], [20] McCann B, Bradbury J, Xiong C, Socher R. Learned in translation:
contextualized word vectors. In: Proceedings of the 31st Conference on
and so forth. However, how to efficiently build large-scale PTMs Neural Information Processing Systems (NIPS 2017); 2017 Dec 4–9; Long
with a deployable inference time remains an ongoing challenge. Beach, CA, USA; 2017.
In addition, designing more efficient architecture in place of or to [21] Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, et al. Deep
contextualized word representations. In: Proceedings of the 2018 Conference
improve transformers remains an open problem.
of the North American Chapter of the Association for Computational
In summary, there is still a long way to go for PTMs to be able to Linguistics: Human Language Technologies; 2018 Jun 1–6; New Orleans,
make reliable decisions and carry out reliable planning, which are LA, USA; 2018. p. 2227–37.
essential elements of AI. More efficient and powerful neural net- [22] Radford A, Narasimhan K, Salimans T, Sutskever I. Improving language
understanding by generative pre-training. San Francisco: OpenAI; 2018.
works need to be proposed and developed. Fortunately, the use [23] Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep
of PTMs in real applications continues to provide an increased bidirectional transformers for language understanding. In: Proceedings of
amount of data and address new challenges, potentially promoting the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies; 2019 Jun 2–7;
the rapid development of new pre-trained methods. Minneapolis, MN, USA; 2019. p. 4171–86.
[24] He P, Liu X, Gao J, Chen W. DeBERTa: decoding-enhanced BERT with
disentangled attention. In: Proceedings of the 9th International Conference
on Learning Representations (ICLR 2021); 2021 May 3–7; Vienna, Austria;
Compliance with ethics guidelines 2021.
[25] Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the
Haifeng Wang, Jiwei Li, Hua Wu, Eduard Hovy, and Yu Sun limits of transfer learning with a unified text-to-text transformer. J Mach
Learn Res 2019;21(140):1–67.
declare that they have no conflict of interest or financial conflicts [26] Brown TB, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language
to disclose. models are few-shot learners. In: Proceedings of the 34th Conference on

61
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65

Neural Information Processing Systems (NeurIPS 2020); 2020 Dec 7–12; Conference on Computer Vision (ICCV); 2017 Oct 22–29; Venice, Italy. 2017.
online. 2020. p. 1877–901. p. 843–52.
[27] Zhang Z, Han X, Zhou H, Ke P, Gu Y, Ye D, et al. CPM: a large-scale generative [49] Schneider S, Baevski A, Collobert R, Auli M. Wav2vec: unsupervised pre-
Chinese pre-trained language model. AI Open 2021;2:93–9. training for speech recognition. In: Proceedings of the 20th Annual
[28] Zeng W, Ren X, Su T, Wang H, Liao Y, Wang Z, et al. PanGu-a: large-scale Conference of the International Speech Communication Association
autoregressive pretrained Chinese language models with auto-parallel (InterSpeech 2019); 2019 Sep 15–19; Graz, Austria. 2019. p. 3465–9.
computation. 2021. arXiv:2104.12369. [50] Deng J, Dong W, Socher R, Li LJ, Li K, Li FF. ImageNet: a large-scale hierarchical
[29] Wang S, Sun Y, Xiang Y, Wu Z, Ding S, Gong W, et al. ERNIE 3.0 Titan: image database. In: Proceedings of the IEEE Conference on Computer Vision
exploring larger-scale knowledge enhanced pre-training for language and Pattern Recognition (CVPR); 2009 Jun 20–25; Miami, FL, USA. 2009. p.
understanding and generation. 2021. arXiv:2112.12731. 248–55.
[30] Wang A, Pruksachatkun Y, Nangia N, Singh A, Michael J, Hill F, et al. [51] Mahajan D, Girshick R, Ramanathan V, He K, Paluri M, Li Y, et al. Exploring the
SuperGLUE: a stickier benchmark for general-purpose language limits of weakly supervised pretraining. In: Proceedings of the European
understanding systems. In: Proceedings of the 33rd Conference on Neural Conference on Computer Vision (ECCV); 2018 Sep 8–14; Munich, Germany.
Information Processing Systems (NeurIPS 2019); 2019 Dec 9–14; Vancouver, 2018. p. 181–96.
BC, Canada. 2019. p. 3266–80. [52] Zhai X, Kolesnikov A, Houlsby N, Beyer L. Scaling vision transformers. 2021.
[31] Sun Y, Wang S, Li Y, Feng S, Chen X, Zhang H, et al. ERNIE: enhanced arXiv:2106.04560.
representation through knowledge integration. 2019. arXiv:1904.09223. [53] Doersch C, Gupta A, Efros AA. Unsupervised visual representation learning
[32] Xiong W, Du J, Wang WY, Stoyanov V. Pretrained encyclopedia: weakly by context prediction. In: Proceedings of the IEEE International Conference
supervised knowledge-pretrained language model. In: Proceedings of the 8th on Computer Vision (ICCV); 2015 Dec 7–13; Santiago, Chile. 2015. p. 1422–
International Conference on Learning Representations (ICLR 2020); 2020 Apr 30.
26–30; Addis Ababa, Ethiopia; 2020. [54] Noroozi M, Favaro P. Unsupervised learning of visual representations by
[33] Liu W, Zhou P, Zhao Z, Wang Z, Ju Q, Deng H, et al. K-BERT: enabling language solving jigsaw puzzles. In: Proceedings of the European Conference on
representation with knowledge graph. In: Proceedings of the 34th AAAI Computer Vision (ECCV); 2016 Oct 8–16; Amsterdam, The Netherlands. 2016.
Conference on Artificial Intelligence; 2020 Feb 7–12; New York City, NY, USA. p. 69–84.
Palo Alto: AAAI Press; 2020. p. 2901–8. [55] Misra I, van der Maaten L. Self-supervised learning of pretext-invariant
[34] Sun T, Shao Y, Qiu X, Guo Q, Hu Y, Huang X, et al. CoLAKE: contextualized representations. In: Proceedings of the IEEE/CVF Conference on Computer
language and knowledge embedding. In: Proceedings of the 28th Vision and Pattern Recognition (CVPR); 2020 Jun 14–19; online. 2020. p.
International Conference on Computational Linguistics; 2020 Dec 8–13; 6707–17.
online. 2020. p. 3660–70. [56] Gidaris S, Singh P, Komodakis N. Unsupervised representation learning by
[35] Zhang Z, Han X, Liu Z, Jiang X, Sun M, Liu Q. ERNIE: enhanced language predicting image rotations. In: Proceedings of the 6th International
representation with informative entities. In: Proceedings of the 57th Annual Conference on Learning Representations (ICLR 2018); 2018 Apr 30–May 3;
Meeting of the Association for Computational Linguistics (ACL 2019); 2019 Vancouver, BC, Canada; 2018.
Jul 28–Aug 2; Florence, Italy. 2019. p. 1441–51. [57] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T,
[36] Peters ME, Neumann M, Logan IV RL, Schwartz R, Joshi V, Singh S, et al. et al. An image is worth 16 16 words: transformers for image recognition at
Knowledge enhanced contextual word representations. In: Proceedings of the scale. In: Proceedings of the 9th International Conference on Learning
2019 Conference on Empirical Methods in Natural Language Processing and Representations (ICLR 2021); 2021 May 3–7; Vienna, Austria; 2021.
the 9th International Joint Conference on Natural Language Processing [58] Van den Oord A, Li Y, Vinyals O. Representation learning with contrastive
(EMNLP-IJCNLP); 2019 Nov 3–7; Hong Kong, China. 2019. p. 43–54. predictive coding. 2018. arXiv:1807.03748.
[37] Levine Y, Lenz B, Dagan O, Ram O, Padnos D, Sharir O, et al. SenseBERT: [59] He K, Fan H, Wu Y, Xie S, Girshick R. Momentum contrast for unsupervised
driving some sense into BERT. In: Proceedings of the 58th Annual Meeting of visual representation learning. In: Proceedings of the IEEE/CVF Conference on
the Association for Computational Linguistics (ACL 2020); 2020 Jul 5–10; Computer Vision and Pattern Recognition (CVPR); 2020 Jun 14–19; online.
online. 2020. p. 4656–67. 2020. p. 9729–38.
[38] Wang X, Gao T, Zhu Z, Zhang Z, Liu Z, Li J, et al. KEPLER: a unified model for [60] Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive
knowledge embedding and pre-trained language representation. Trans Assoc learning of visual representations. In: Proceedings of the 37th International
Comput Linguist 2021;9:176–94. Conference on Machine Learning (ICML 2020); 2020 Jul 12–18; online. 2020.
[39] Sun Y, Wang S, Feng S, Ding S, Pang C, Shang J, et al. ERNIE 3.0: large-scale p. 1597–607.
knowledge enhanced pre-training for language understanding and [61] Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning
generation. 2021. arXiv:2107.02137. transferable visual models from natural language supervision. In:
[40] Wang R, Tang D, Duan N, Wei Z, Huang X, Ji J, et al. K-Adapter: infusing Proceedings of the 38th International Conference on Machine Learning
knowledge into pre-trained models with adapters. In: Proceedings of the 59th (ICML 2021); 2021 Jul 18–24; online. 2021. p. 8748–63.
Annual Meeting of the Association for Computational Linguistics and the 11th [62] Jia C, Yang Y, Xia Y, Chen YT, Parekh Z, Pham H, et al. Scaling up visual and
International Joint Conference on Natural Language Processing (ACL-IJCNLP vision–language representation learning with noisy text supervision. In:
2021); 2021 Aug 1–6; online. 2021. p. 1405–18. Proceedings of the 38th International Conference on Machine Learning (ICML
[41] Wu S, Dredze M. Beto, Bentz, Becas: the surprising cross-lingual effectiveness 2021); 2021 Jul 18–24; online. 2021. p. 4904–16.
of BERT. In: Proceedings of the 2019 Conference on Empirical Methods in [63] Lu J, Batra D, Parikh D, Lee S. ViLBERT: pretraining task-agnostic
Natural Language Processing and the 9th International Joint Conference on visiolinguistic representations for vision-and-language tasks. In:
Natural Language Processing (EMNLP-IJCNLP); 2019 Nov 3–7; Hong Kong, Proceedings of the 33rd Conference on Neural Information Processing
China. 2019. p. 833–44. Systems (NeurIPS 2019); 2019 Dec 9–14; Vancouver, BC, Canada. 2019. p.
[42] Conneau A, Lample G. Cross-lingual language model pretraining. In: 13–23.
Proceedings of the 33rd Conference on Neural Information Processing [64] Tan H, Bansal M. LXMERT: learning cross-modality encoder representations
Systems (NeurIPS 2019); 2019 Dec 8–14; Vancouver, BC, Canada. 2019. p. from transformers. In: Proceedings of the 2019 Conference on Empirical
7057–67. Methods in Natural Language Processing and the 9th International Joint
[43] Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, et al. Conference on Natural Language Processing (EMNLP-IJCNLP); 2019 Nov 3–7;
Unsupervised cross-lingual representation learning at scale. In: Proceedings Hong Kong, China; 2019.
of the 58th Annual Meeting of the Association for Computational Linguistics [65] Li LH, Yatskar M, Yin D, Hsieh CJ, Chang KW. VisualBERT: a simple and
(ACL 2020); 2020 Jul 5–10; online. 2020. p. 8440–51. performant baseline for vision and language. 2019. arXiv:1908.03557.
[44] Chi Z, Dong L, Wei F, Yang N, Singhal S, Wang W, et al. InfoXLM: an [66] Sun C, Myers A, Vondrick C, Murphy K, Schmid C. VideoBERT: a joint model
information-theoretic framework for cross-lingual language model pre- for video and language representation learning. In: Proceedings of the IEEE/
training. In: Proceedings of the 2021 Conference of the North American CVF International Conference on Computer Vision (ICCV); 2019 Oct 27–Nov
Chapter of the Association for Computational Linguistics: Human Language 2; Seoul, Republic of Korea. 2019. p. 7464–73.
Technologies; 2021 Jun 6–11; online. 2021. p. 3576–88. [67] Sun C, Baradel F, Murphy K, Schmid C. Learning video representations using
[45] Ouyang X, Wang S, Pang C, Sun Y, Tian H, Wu H, et al. ERNIE-M: enhanced contrastive bidirectional transformer. 2019. arXiv:1906.05743.
multilingual representation by aligning cross-lingual semantics with [68] Chuang YS, Liu CL, Lee H, Lee L. SpeechBERT: an audio-and-text jointly
monolingual corpora. In: Proceedings of the 2021 Conference on Empirical learned language model for end-to-end spoken question answering. In:
Methods in Natural Language Processing (EMNLP); 2021 Nov 7–11; online. Proceedings of the 21st Annual Conference of the International Speech
2021. p. 27–38. Communication Association (Interspeech 2020); 2020 Oct 25–29; Shanghai,
[46] Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, et al. DeCAF: a deep China. 2020. p. 4168–72.
convolutional activation feature for generic visual recognition. In: [69] Ramesh A, Pavlov M, Goh G, Gray S, Voss C, Radford A, et al. Zero-shot text-to-
Proceedings of the 31st International Conference on Machine Learning image generation. In: Proceedings of the 38th International Conference on
(ICML 2014); 2014 Jun 21–26; Beijing, China. 2014. p. 647–55. Machine Learning (ICML 2021); 2021 Jul 18–24; online. 2021. p. 8821–31.
[47] Girshick R, Donahue J, Darrell T, Malik J. Rich feature hierarchies for accurate [70] Yu F, Tang J, Yin W, Sun Y, Tian H, Wu H, et al. ERNIE-ViL: knowledge
object detection and semantic segmentation. In: Proceedings of the IEEE enhanced vision–language representations through scene graphs. In:
Conference on Computer Vision and Pattern Recognition (CVPR); 2014 Jun Proceedings of the 35th AAAI Conference on Artificial Intelligence; 2021
23–28; Columbus, OH, USA. 2014. p. 580–7. Feb 2–9; online. Palo Alto: AAAI Press; 2021. p. 3208–16.
[48] Sun C, Shrivastava A, Singh S, Gupta A. Revisiting unreasonable effectiveness [71] Gan Z, Chen YC, Li L, Zhu C, Cheng Y, Liu J. Large-scale adversarial training for
of data in deep learning era. In: Proceedings of the IEEE International vision-and-language representation learning. In: Proceedings of the 34th

62
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65

Conference on Neural Information Processing Systems (NeurIPS 2020); 2020 [97] Xue L, Constant N, Roberts A, Kale M, Al-Rfou R, Siddhant A, et al. mT5: a
Dec 7–12; online. 2020. p. 6616-28. massively multilingual pre-trained text-to-text transformer. In: Proceedings
[72] Cho J, Lei J, Tan H, Bansal M. Unifying vision-and-language tasks via text of the 2021 Conference of the North American Chapter of the Association for
generation. In: Proceedings of the 38th International Conference on Machine Computational Linguistics: Human Language Technologies; 2021 Jun 6–11;
Learning (ICML 2021); 2021 Jul 18–24; online. 2021. p. 1931–42. online. 2021. p. 483–98.
[73] Kalyan KS, Rajasekharan A, Sangeetha S. AMMUS: a survey of transformer- [98] Zhang Z, Gu Y, Han X, Chen S, Xiao C, Sun Z, et al. CPM-2: large-scale cost-
based pretrained models in natural language processing. 2021. effective pre-trained language models. 2021. arXiv:2106.10715.
arXiv:2108.05542. [99] Fedus W, Zoph B, Shazeer N. Switch transformers: scaling to trillion
[74] Kaliyar RK. A multi-layer bidirectional transformer encoder for pre-trained parameter models with simple and efficient sparsity. 2021. arXiv:
word embedding: a survey of BERT. In: Proceedings of 2020 10th 2101.03961.
International Conference on Cloud Computing, Data Science & Engineering [100] Wu S, Zhao X, Yu T, Zhang R, Shen C, Liu H, et al. Yuan 1.0: large-scale pre-
(Confluence); 2020 Jan 29–31; Noida, India. 2020. p. 336–40. trained language model in zero-shot and few-shot learning. 2021. arXiv:
[75] Liu P, Yuan W, Fu J, Jiang Z, Hayashi H, Neubig G. Pre-train, prompt, and 2110.04725.
predict: a systematic survey of prompting methods in natural language [101] Du N, Huang Y, Dai AM, Tong S. Lepikhin D, Xu Y, et al. GLaM: efficient scaling
processing. 2021. arXiv:2107.13586. of language models with mixture-of-experts. 2021. arXiv: 2112.06905.
[76] Min B, Ross H, Sulem E, Veyseh APB, Nguyen TH, Sainz O, et al. Recent [102] Rae JW, Borgeaud S, Cai T, Millican K, Hoffmann J, Song F, et al. Scaling
advances in natural language processing via large pre-trained language language models: methods, analysis & insights from training gopher. 2021.
models: a survey. 2021. arXiv:2111.01243. arXiv: 2112.11446.
[77] Li J, Tang T, Zhao WX, Wen JR. Pretrained language models for text [103] Ding M, Yan Z , Hong W, Zheng W, Zhou C, Yin D, et al. CogView: mastering
generation: a survey. 2021. arXiv:2105.10311. text-to-image generation via transformers. 2021. arXiv: 2105.13290.
[78] Zaib M, Sheng QZ, Zhang W. A short survey of pre-trained language models [104] Lin J, Men R, Yang A, Zhou C, Ding M, Zhang Y, et al. M6: a Chinese
for conversational AI—a new age in NLP. In: Proceedings of the Australasian multimodal pretrainer. 2021. arXiv:2103.00823.
Computer Science Week Multiconference (ACSW’20); 2020 Feb 3–7; [105] Li W, Gao C, Niu G, Xiao X, Liu H, Liu J, et al. UNIMO: towards unified-modal
Melbourne, VIC, Australia. 2020. understanding and generation via cross-modal contrastive learning. In:
[79] Ramponi A, Plank B. Neural unsupervised domain adaptation in NLP—a Proceedings of the 59th Annual Meeting of the Association for Computational
survey. In: Proceedings of the 28th International Conference on Linguistics and the 11th International Joint Conference on Natural Language
Computational Linguistics; 2020 Dec 8–13; onine. 2020. p. 6838–55. Processing (ACL-IJCNLP 2021); 2021 Aug 1–6; online. 2021. p. 2592–607.
[80] Qiu XP, Sun TX, Xu YG, Shao YF, Dai N, Huang XJ. Pre-trained models for [106] Huo Y, Zhang M, Liu G, Lu H, Gao Y, Yang G, et al. WenLan: bridging vision and
natural language processing: a survey. Sci China Technol Sci 2020;63 language by large-scale multi-modal pre-training. 2021. arXiv:2103.06561.
(10):1872–97. [107] Zhang H, Yin W, Fang Y, Li L, Duan B, Wu Z, et al. ERNIE-ViLG: unified
[81] Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, von Arx S, et al. On the generative pre-training for bidirectional vision-language generation. 2021.
opportunities and risks of foundation models. 2021. arXiv:2108.07258. arXiv:2112.15283.
[82] Han X, Zhang Z, Ding N, Gu Y, Liu X, Huo Y, et al. Pre-trained models: past, [108] Huang Y, Cheng Y, Bapna A, Firat O, Chen D, Chen M, et al. GPipe: efficient
present and future. AI Open 2021;2:225–50. training of giant neural networks using pipeline parallelism. In: Proceedings
[83] Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. Language models are of the 33rd Conference on Neural Information Processing Systems (NeurIPS
unsupervised multitask learners. San Francisco: OpenAI; 2019. 2019); 2019 Dec 9–14; Vancouver, BC, Canada. 2019. p. 103–12.
[84] Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV. XLNet: [109] Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B. Megatron-
generalized autoregressive pretraining for language understanding. In: LM: training multi-billion parameter language models using model
Proceedings of the 33rd Conference on Neural Information Processing parallelism. 2019. arXiv:1909.08053.
Systems (NeurIPS 2019); 2019 Dec 9–14; Vancouver, BC, Canada. 2019. p. [110] Narayanan D, Shoeybi M, Casper J, LeGresley P, Patwary M, Korthikanti V,
5754–64. et al. Efficient large-scale language model training on GPU clusters using
[85] Taylor WL. ‘‘Cloze procedure”: a new tool for measuring readability. J Mass megatron-LM. In: Proceedings of the International Conference for High
Commun Q 1953;30(4):415–33. Performance Computing, Networking, Storage and Analysis (SC 21); 2021
[86] Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z, et al. GLM: general language model Nov 14–19; St. Louis, MO, USA; 2021.
pretraining with autoregressive blank infilling. 2021. arXiv:2103.10360. [111] Bian Z, Liu H, Wang B, Huang H, Li Y, Wang C, et al. Colossal-AI: a unified deep
[87] Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O. SpanBERT: improving learning system for large-scale parallel training. 2021. arXiv:2110.14883.
pre-training by representing and predicting spans. Trans Assoc Comput [112] Shazeer N, Mirhoseini A, Maziarz K, Davis A, Le Q, Hinton G, et al.
Linguist 2020;8:64–77. Outrageously large neural networks: the sparsely-gated mixture-of-experts
[88] Song K, Tan X, Qin T, Lu J, Liu TY. MASS: masked sequence to sequence pre- layer. In: Proceedings of the 5th International Conference on Learning
training for language generation. In: Proceedings of the 36th International Representations (ICLR 2017); 2017 Apr 24–26; Toulon, France; 2017.
Conference on Machine Learning (ICML 2019); 2019 Jun 9–15; Long Beach, [113] Narang S, Diamos G, Elsen E, Micikevicius P, Alben J, Garcia D, et al. Mixed
CA, USA. 2019. p. 5926–36. precision training. In: Proceedings of the 6th International Conference on
[89] Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, et al. BART: Learning Representations (ICLR 2018); 2018 Apr 30–May 3; Vancouver, BC,
denoising sequence-to-sequence pre-training for natural language Canada; 2018.
generation, translation, and comprehension. In: Proceedings of the 58th [114] Rajbhandari S, Rasley J, Ruwase O, He Y. ZeRO: memory optimizations toward
Annual Meeting of the Association for Computational Linguistics (ACL 2020); training trillion parameter models. In: Proceedings of the International
2020 Jul 5–10; online. 2020. p. 7871–80. Conference for High Performance Computing, Networking, Storage and
[90] Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. RoBERTa: a robustly Analysis (SC 20); 2020 Nov 9–19; Atlanta, GA, USA; 2020.
optimized BERT pretraining approach. 2019. arXiv:1907.11692. [115] Kim Y. Convolutional neural networks for sentence classification. In:
[91] Dong L, Yang N, Wang W, Wei F, Liu X, Wang Y, et al. Unified language model Proceedings of the 2014 Conference on Empirical Methods in Natural
pre-training for natural language understanding and generation. In: Language Processing (EMNLP); 2014 Oct 25–29; Doha, Qatar. 2014. p.
Proceedings of the 33rd Conference on Neural Information Processing 1746–51.
Systems (NeurIPS 2019); 2019 Dec 9–14; Vancouver, BC, Canada. 2019. p. [116] Hu H, Richardson K, Xu L, Li L, Kübler S, Moss L. OCNLI: original Chinese
13042–54. natural language inference. In: Proceedings of the 2020 Conference on
[92] Bao H, Dong L, Wei F, Wang W, Yang N, Liu X, et al. UniLMv2: pseudo-masked Empirical Methods in Natural Language Processing (EMNLP); 2020 Nov 16–
language models for unified language model pre-training. In: Proceedings of 20; online. 2020. p. 3512–26.
the 37th International Conference on Machine Learning (ICML 2020); 2020 [117] Shao CC, Liu T, Lai Y, Tseng Y, Tsai S. DRCD: a Chinese machine reading
Jul 12–18; online. 2020. p. 642–52. comprehension dataset. 2018. arXiv:1806.00920..
[93] Xiao D, Zhang H, Li Y, Sun Y, Tian H, Wu H, et al. ERNIE-GEN: an enhanced [118] Schick T, Schütze H. Exploiting cloze-questions for few-shot text
multi-flow pre-training and fine-tuning framework for natural language classification and natural language inference. In: Proceedings of the 16th
generation. In: Proceedings of the 29th International Joint Conference on Conference of the European Chapter of the Association for Computational
Artificial Intelligence (IJCAI); 2021 Jan 7–15; Yokohama, Japan. 2021. p. Linguistics: Main Volume; 2021 Apr 19–23; online. 2021. p. 255–69.
3997–4003. [119] Gray S, Radford A, Kingma DP. GPU kernels for block-sparse weights. 2017.
[94] Zhang J, Zhao Y, Saleh M, Liu P. PEGASUS: pre-training with extracted gap- arXiv:1711.09224.
sentences for abstractive summarization. In: Proceedings of the 37th [120] Lin H, Lu Y, Han X, Sun L. Sequence-to-nuggets: nested entity mention
International Conference on Machine Learning (ICML 2020); 2020 Jul 12– detection via anchor-region networks. In: Proceedings of the 57th Annual
18; online. 2020. p. 11328–39. Meeting of the Association for Computational Linguistics (ACL 2019); 2019
[95] Rosset C. Turing-NLG: a 17-billion-parameter language model by Microsoft Jul 28–Aug 2; Florence, Italy. 2019. p. 5182–92.
[Internet]. Redmond: Microsoft; 2020 Feb 13 [cited 2021 Nov 4]. Available [121] Lin Y, Meng Y, Sun X, Han Q, Kuang K, Li J, et al. BertGCN: transductive text
from: https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17- classification by combining GCN and BERT. 2021. arXiv: 2105.05727.
billion-parameter-language-model-by-microsoft/. [122] Zhang R, Tetreault J. This email could save your life: introducing the task of
[96] Kim B, Kim HS, Lee SW, Lee G, Kwak D, Hyeon JD, et al. What changes can email subject line generation. In: Proceedings of the 57th Annual Meeting of
large-scale language models bring? Intensive study on HyperCLOVA: billions- the Association for Computational Linguistics (ACL 2019); 2019 Jul 28–Aug 2;
scale Korean generative pretrained transformers. In: Proceedings of the 2021 Florence, Italy. 2019. p. 446–56.
Conference on Empirical Methods in Natural Language Processing (EMNLP); [123] Zhou H, Zheng C, Huang K, Huang M, Zhu X. KdConv: a Chinese multi-domain
2021 Nov 7–11; online. 2021. p. 3405–24. dialogue dataset towards multi-turn knowledge-driven conversation. In:

63
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65

Proceedings of the 58th Annual Meeting of the Association for Computational Language Resources and Evaluation (LREC 2016); 2016 May 23–28; Portorož,
Linguistics (ACL 2020); 2020 Jul 5–10; online. 2020. p. 7098–108. Slovenia. 2016. p. 1593–600.
[124] Cho J, Seo M, Hajishirzi H, et al. Mixture content selection for diverse [150] Simonyan K, Vedaldi A, Zisserman A. Deep inside convolutional networks:
sequence generation. 2019. arXiv:1909.01953. visualising image classification models and saliency maps. In: Proceedings of
[125] Ribeiro LFR, Zhang Y, Gardent C, Gurevych I. Modeling global and local node the 2nd International Conference on Learning Representations (ICLR 2014);
contexts for text generation from knowledge graphs. Trans Assoc Comput 2014 Apr 14–16; Banff, AB, Canada; 2014.
Linguist 2020;8:589–604. [151] Hewitt J, Manning CD. A structural probe for finding syntax in word
[126] Zhang Y, Sun S, Galley M, Chen YC, Brockett C, Gao X, et al. DialoGPT: large- representations. In: Proceedings of the 2019 Conference of the North
scale generative pre-training for conversational response generation. In: American Chapter of the Association for Computational Linguistics: Human
Proceedings of the 58th Annual Meeting of the Association for Computational Language Technologies; 2019 Jun 2–7; Minneapolis, MN, USA. 2019. p. 4129–38.
Linguistics: System Demonstrations (ACL 2020); 2020 Jul 5–10; online. 2020. [152] Jawahar G, Sagot B, Seddah D. What does BERT learn about the structure of
p. 270–8. language? In: Proceedings of the 57th Annual Meeting of the Association for
[127] Adiwardana D, Luong MT, So DR, Hall J, Fiedel N, Thoppilan R, et al. Towards a Computational Linguistics (ACL 2019); 2019 Jul 28–Aug 2; Florence, Italy.
human-like open-domain chatbot. 2020. arXiv:2001.09977. 2019. p. 3651–7.
[128] Roller S, Dinan E, Goyal N, Ju D, Williamson M, Liu Y, et al. Recipes for [153] Linzen T, Dupoux E, Goldberg Y. Assessing the ability of LSTMs to learn
building an open-domain chatbot. In: Proceedings of the 16th Conference of syntax-sensitive dependencies. Trans Assoc Comput Linguist 2016;4:521–35.
the European Chapter of the Association for Computational Linguistics: Main [154] Ribeiro MT, Singh S, Guestrin C. ‘‘Why should I trust you?” explaining the
Volume; 2021 Apr 19–23; online. 2021. p. 300–25. predictions of any classifier. In: Proceedings of the 2016 Conference of the
[129] DuerOS [Internet]. Beijing: Baidu; c2017 [cited 2021 Nov 4]. Available from: North American Chapter of the Association for Computational Linguistics:
https://dueros.baidu.com/en/index.html. Human Language Technologies; 2016 Jun 12–17; San Diego, CA, USA. 2016. p.
[130] Bao S, He H, Wang F, Wu H, Wang H, Wu W, et al. PLATO-2: towards building 1135–44.
an open-domain chatbot via curriculum learning. In: Proceedings of the 59th [155] Davison J, Feldman J, Rush AM. Commonsense knowledge mining from
Annual Meeting of the Association for Computational Linguistics and the pretrained models. In: Proceedings of the 2019 Conference on Empirical
11th International Joint Conference on Natural Language Processing (ACL- Methods in Natural Language Processing and the 9th International Joint
IJCNLP 2021); 2021 Aug 1–6; online. 2021. p. 2513–25. Conference on Natural Language Processing (EMNLP-IJCNLP); 2019 Nov 3–7;
[131] Gunasekara C, Kim S, D’Haro LF, Rastogi A, Chen YN, Eric M, et al. Overview of Hong Kong, China. 2019. p. 1173–8.
the ninth dialog system technology challenge: DSTC9. 2020. [156] Petroni F, Rocktäschel T, Riedel S, Lewis P, Bakhtin A, Wu Y, et al. Language
arXiv:2011.06486. models as knowledge bases? In: Proceedings of the 2019 Conference on
[132] Bao S, He H, Wang F, Wu H, Wang H, Wu W, et al. PLATO-XL: exploring the Empirical Methods in Natural Language Processing and the 9th International
large-scale pre-training of dialogue generation. 2021. arXiv:2109.09519. Joint Conference on Natural Language Processing (EMNLP-IJCNLP); 2019 Nov
[133] Wang Y, Ke P, Zheng Y, Huang K, Jiang Y, Zhu X, et al. A large-scale Chinese 3–7; Hong Kong, China. 2019. p. 2463–73.
short-text conversation dataset. In: Proceedings of the 9th CCF International [157] Talmor A, Elazar Y, Goldberg Y, Berant J. oLMpics-on what language model
Conference on Natural Language Processing and Chinese Computing (NLPCC pre-training captures. Trans Assoc Comput Linguist 2020;8:743–58.
2020); 2020 Oct 14–18; Zhengzhou, China. 2020. p. 91–103. [158] Morris JX, Lifland E, Yoo JY, Grigsby J, Jin D, Qi Y. TextAttack: a framework for
[134] Qi W, Gong Y, Yan Y, Xu C, Yao B, Zhou B, et al. ProphetNet-X: large-scale pre- adversarial attacks, data augmentation, and adversarial training in NLP. 2020.
training models for English, Chinese, multi-lingual, dialog, and code arXiv:2005.05909.
generation. 2021. arXiv:2104.08006. [159] Jia J, Liu Y, Gong NZ. BadEncoder: backdoor attacks to pre-trained encoders in
[135] Zhou H, Ke P, Zhang Z, Gu Y, Zheng Y, Zheng C, et al. EVA: an open-domain self-supervised learning. 2021. arXiv:2108.00352.
Chinese dialogue system with large-scale generative pre-training. 2021. [160] Devlin J. Google-research/bert [Internet]. GitHub; 2018 Oct 11 [cited 2021
arXiv:2108.01547. Nov 4]. Available from: https://github.com/google-research/bert.
[136] Vinyals O, Le Q. A neural conversational model. 2015. arXiv:1506.05869. [161] Baidu Ernie Team. Paddlepaddle/ernie [Internet]. GitHub; 2019 Apr 19 [cited
[137] Serban I, Sordoni A, Bengio Y, Courville A, Pineau J. Building end-to-end 2021 Nov 4]. Available from: https://github.com/PaddlePaddle/ERNIE.
dialogue systems using generative hierarchical neural network models. [162] Huawei. Pcl-platform.intelligence/pangu-alpha [Internet]. San Francisco:
In: Proceedings of the 30th AAAI Conference on Artificial Intelligence; OpenAI; 2021 Apr 26 [cited 2021 Nov 4]. Available from: https://git.openi.
2016 Feb 12–17; Phoenix, AZ, USA. Palo Alto: AAAI Press; 2016. p. org.cn/PCL-Platform.Intelligence/PanGu-Alpha.
3776–83. [163] Ding S, Shang J, Wang S, Sun Y, Tian H, Wu H, et al. ERNIE-Doc: a retrospective
[138] Worswick S. ‘‘Mitsuku wins loebner prize 2018!” [Internet]. Medium; 2018 long-document modeling transformer. In: Proceedings of the 59th Annual
Sep 13 [cited 2021 Nov 4]. Available from: https://medium.com/ Meeting of the Association for Computational Linguistics and the 11th
pandorabots-blog/mitsuku-wins-loebner-prize-2018-3e8d98c5f2a7. International Joint Conference on Natural Language Processing (ACL-IJCNLP
[139] Zhou L, Gao J, Li D, Shum HY. The design and implementation of XiaoIce, an 2021); 2021 Aug 1–6; online. 2021. p. 2914–27.
empathetic social chatbot. Comput Linguist 2020;46(1):53–93. [164] Huggingface [Internet]. Hugging Face; 2020 Apr 26 [cited 2021 Nov 4].
[140] Xin J, Tang R, Lee J, Yu Y, Lin J. DeeBERT: dynamic early exiting for Available from: https://huggingface.co.
accelerating BERT inference. In: Proceedings of the 58th Annual Meeting of [165] Ott M, Edunov S, Baevski A, Fan A, Gross S, Ng N, et al. FAIRSEQ: a fast,
the Association for Computational Linguistics (ACL 2020); 2020 Jul 5–10; extensible toolkit for sequence modeling. In: Proceedings of the 2019
online. 2020. p. 2246–51. Conference of the North American Chapter of the Association for
[141] Houlsby N, Giurgiu A, Jastrzebski S, Morrone B, Laroussilhe QD, Gesmundo A, Computational Linguistics: Human Language Technologies
et al. Parameter-efficient transfer learning for NLP. In: Proceedings of the (Demonstrations); 2019 Jun 2–7; Minneapolis, MN, USA. 2019. p. 48–53.
36th International Conference on Machine Learning (ICML 2019); 2019 Jun [166] Baidu PaddlePaddle Team. Paddlepaddle/paddlenlp [Internet]. GitHub; 2020
9–15; Long Beach, CA, USA. 2019. p. 2790–9. Nov 16 [cited 2021 Nov 4]. Available from: https://github.com/PaddlePaddle/
[142] Li XL, Liang P. Prefix-tuning: optimizing continuous prompts for generation. PaddleNLP.
In: Proceedings of the 59th Annual Meeting of the Association for [167] Wenxin ernie [Internet]. Beijing: Baidu; c2021 [cited 2021 Nov 4]. Available
Computational Linguistics and the 11th International Joint Conference on from: https://wenxin.baidu.com.
Natural Language Processing (ACL-IJCNLP 2021); 2021 Aug 1–6; online. 2021. [168] Alibaba Damo Academy. AliceMind [Internet]. Aliyuncs; c2021 [cited 2021
p. 4582–97. Nov 4]. Available from: https://alicemind.aliyuncs.com.
[143] Gao T, Fisch A, Chen D. Making pre-trained language models better few-shot [169] Openai API [Internet]. San Francisco: OpenAI; c2021 [cited 2021 Nov 4].
learners. In: Proceedings of the 59th Annual Meeting of the Association for Available from: https://openai.com/api.
Computational Linguistics and the 11th International Joint Conference on [170] Garg Y, Chatterjee N. Sentiment analysis of twitter feeds. In: Proceedings of
Natural Language Processing (ACL-IJCNLP 2021); 2021 Aug 1–6; online. 2021. the 3rd International Conference on Big Data Analytics (BDA 2014); 2014 Dec
p. 3816–30. 20–23; New Delhi, India. 2014. p. 33–52.
[144] Lester B, Al-Rfou R, Constant N. The power of scale for parameter-efficient [171] AlQahtani ASM. Product sentiment analysis for amazon reviews. Int J Comput
prompt tuning. In: Proceedings of the 2021 Conference on Empirical Methods Sci Inf Technol 2021;13(3):15–30.
in Natural Language Processing (EMNLP); 2021 Nov 7–11; online. 2021. p. [172] Singh M, Jakhar AK, Pandey S. Sentiment analysis on the impact of
3045–59. coronavirus in social life using the BERT model. Soc Netw Anal Min
[145] Liu X, Zheng Y, Du Z, Ding M, Qian Y, Yang Z, et al. GPT understands, too. 2021;11:33.
2021. arXiv:2103.10385. [173] Chen Z, Sokolova M. Sentiment analysis of the COVID-related r/Depression
[146] Han X, Zhao W, Ding N, Liu Z, Sun M. PTR: prompt tuning with rules for text posts. 2021. arXiv:2108.06215.
classification. 2021. arXiv:2105.11259. [174] Liu Y, Liu J, Chen L, Lu Y, Feng S, Feng Z, et al. ERNIE-SPARSE: learning
[147] Doshi-Velez F, Kim B. Towards a rigorous science of interpretable machine hierarchical efficient transformer through regularized self-attention. 2022.
learning. 2017. arXiv:1702.08608. arXiv:2203.12276.
[148] Wallace E, Feng S, Kandpal N, Gardner M, Singh S. Universal adversarial [175] Jwa H, Oh D, Park K, Kang JM, Lim H. exBAKE: automatic fake news detection
triggers for attacking and analyzing NLP. In: Proceedings of the 2019 model based on bidirectional encoder representations from transformers
Conference on Empirical Methods in Natural Language Processing and the (BERT). Appl Sci 2019;9(19):4062.
9th International Joint Conference on Natural Language Processing (EMNLP- [176] Soares LB, FitzGerald N, Ling J, Kwiatkowski T. Matching the blanks:
IJCNLP); 2019 Nov 3–7; Hong Kong, China. 2019. p. 2153–62. distributional similarity for relation learning. 2019. arXiv:1906.03158.
[149] Fort K, Couillault A. Yes, we care! Results of the ethics and natural language [177] Wang Z, Xu Y, Cui L, Shang J, Wei F. LayoutReader: pre-training of text and
processing surveys. In: Proceedings of the 10th International Conference on layout for reading order detection. In: Proceedings of the 2021 Conference on

64
H. Wang, J. Li, H. Wu et al. Engineering 25 (2023) 51–65

Empirical Methods in Natural Language Processing (EMNLP); 2021 Nov 7–11; www.wired.com/sponsored/story/meet-the-ai-powering-todays-smartest-
online. 2021. p. 4735–44. smartphones.
[178] gpt-2-for-the-advertising-industry [Internet]. San Francisco: OpenAI; 2017 [187] Nayak P. Understanding searches better than ever before [Internet]. Google;
Aug 1 [cited 2021 Nov 4]. Available from: https://www.narrativa.com/gpt-2- [cited 2021 Nov 4]. Available from: https://blog.google/products/search/
for-the-advertising-industry. search-language-understanding-bert/.
[179] Agarwal R, Kann K. Acrostic poem generation. In: Proceedings of the 2020 [188] Sun Y, Wang S, Li Y, Feng S, Tian H, Wu H, et al. ERNIE 2.0: a continual pre-
Conference on Empirical Methods in Natural Language Processing (EMNLP); training framework for language understanding. In: Proceedings of the 34th
2020 Nov 16–20; online. 2020. p. 1230–40. AAAI Conference on Artificial Intelligence; 2020 Feb 7–12; New York City, NY,
[180] Lee DH, Hu Z, Lee RKW. Improving text auto-completion with next phrase USA. Palo Alto: AAAI Press; 2020. p. 8968–75.
prediction. In: Proceedings of the 2021 Conference on Empirical Methods in [189] Liu Y, Lu W, Cheng S, Shi D, Wang S, Cheng Z, et al. Pre-trained language
Natural Language Processing (EMNLP); 2021 Nov 7–11; online. 2021. p. model for web-scale retrieval in Baidu Search. In: Proceedings of the 27th
4434–8. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD
[181] Mukherjee S, Mukherjee S, Hasegawa M, Awadallah AH, White R. Smart to- 21); 2021 Aug 14–18; online. 2021. p. 3365–75.
do: automatic generation of to-do items from emails. In: Proceedings of the [190] Huang JT, Sharma A, Sun S, Xia L, Zhang D, Pronin P, et al. Embedding-based
58th Annual Meeting of the Association for Computational Linguistics (ACL retrieval in Facebook Search. In: Proceedings of the 26th ACM SIGKDD
2020); 2020 Jul 5–10; online. 2020. p. 8680–9. Conference on Knowledge Discovery and Data Mining (KDD 20); 2020 Jul 6–
[182] What are Alexa Built-in Devices? [Internet]. Seattle: Amazon; c2010–2023 10; online. 2020. p. 2553–61.
[cited 2021 Nov 4]. Available from: https://developer.amazon.com/alexa- [191] Yu P, Fei H, Li P. Cross-lingual language model pretraining for retrieval. In:
voice-service. Proceedings of the Web Conference; 2021 Apr 19–23; online. 2021. p.1029–
[183] Mari A. Voice commerce: understanding shopping-related voice assistants 39.
and their effect on brands. In: Proceedings of the International Media [192] Ni M, Huang H, Su L, Cui E, Bharti T, Wang L, et al. M3P: learning universal
Management Academic Association Annual Conference; 2019 Oct 4–6; Doha, representations via multitask multilingual multimodal pre-training. In:
Qatar; 2019. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
[184] Google assistant update speech recognition name pronunciation BERT smart Recognition (CVPR); 2021 Jun 19–25; online. 2021. p. 3977–86.
speakers [Internet]. NDTV; 2021 Apr 29 [cited 2021 Nov 4]. Available [193] Sanh V, Debut L, Chaumond J, Wolf T. DistilBERT, a distilled version of BERT:
from: https://gadgets.ndtv.com/apps/news/google-assistant-update-speech- smaller, faster, cheaper and lighter. 2019. arXiv:1910.01108.
recognition-name-pronunciation-bert-smart-speak. [194] Gordon MA, Duh K, Andrews N. Compressing BERT: studying the effects of
[185] Vincent J. The future of AI is a conversation with a computer [Internet]. New weight pruning on transfer learning. In: Proceedings of the 5th Workshop on
York City: The Verge; 2021 Nov 1 [cited 2021 Nov 4]. Available from: https:// Representation Learning for NLP; 2020 Jul 9; Seattle, WA, USA. 2020. p. 143–
www.theverge.com/22734662/ai-language-artificial-intelligence-future- 55.
models-gpt-3-limitations-bias/. [195] Kim S, Gholami A, Yao Z, Mahoney MW, Keutzer K. I-BERT: integer-only BERT
[186] Meet the AI powering today’s smartest smartphones [Internet]. San quantization. In: Proceedings of the 38th International Conference on
Francisco: Wired; 2017 Aug 1 [cited 2021 Nov 4]. Available from: https:// Machine Learning (ICML 2021); 2021 Jul 18–24; online. 2021. p. 5506–18.

1 s2.0 S2095809922006324 Main
No ratings yet
1 s2.0 S2095809922006324 Main
20 pages
Advancement in NLP Paper
No ratings yet
Advancement in NLP Paper
49 pages
Recent Advances in Natural Language Processing Via Large Pre-Trained Language Models-A Survey
No ratings yet
Recent Advances in Natural Language Processing Via Large Pre-Trained Language Models-A Survey
40 pages
Pre-Trained Models For Natural Language Processing: A Survey
No ratings yet
Pre-Trained Models For Natural Language Processing: A Survey
31 pages
Qiu et al. - 2020 - Pre-trained Models for Natural Language Processing
No ratings yet
Qiu et al. - 2020 - Pre-trained Models for Natural Language Processing
28 pages
Downloed Papers
No ratings yet
Downloed Papers
700 pages
Kalyan 1 s2.0 S2949719123000456 Main
No ratings yet
Kalyan 1 s2.0 S2949719123000456 Main
48 pages
(2303.18223) A Survey of Large Language Models
No ratings yet
(2303.18223) A Survey of Large Language Models
115 pages
duan2020
No ratings yet
duan2020
6 pages
Trend
No ratings yet
Trend
47 pages
paper-1
No ratings yet
paper-1
44 pages
Rishabh Sharma (Anantika Johari)
No ratings yet
Rishabh Sharma (Anantika Johari)
8 pages
2108.05542
No ratings yet
2108.05542
42 pages
Transfer Learning in Natural Language Processing PDF
0% (1)
Transfer Learning in Natural Language Processing PDF
238 pages
(IJETA-V11I3P37) :anantika Johari, Rishabh Sharma, Aanchal Meena, Vansh Tiwari
No ratings yet
(IJETA-V11I3P37) :anantika Johari, Rishabh Sharma, Aanchal Meena, Vansh Tiwari
9 pages
LLM_book_8_42
No ratings yet
LLM_book_8_42
35 pages
LLM_test_v1_p8_12
No ratings yet
LLM_test_v1_p8_12
5 pages
14-LookingForward
No ratings yet
14-LookingForward
48 pages
A Survey Large Language Models
No ratings yet
A Survey Large Language Models
58 pages
A Survey of Large Language Models
No ratings yet
A Survey of Large Language Models
58 pages
Survey On Large Language Models
No ratings yet
Survey On Large Language Models
52 pages
2503.01159v1
No ratings yet
2503.01159v1
55 pages
Large Language Models: A Survey
No ratings yet
Large Language Models: A Survey
43 pages
LLM Survey
100% (1)
LLM Survey
43 pages
Perspectives in Business Ethics
No ratings yet
Perspectives in Business Ethics
113 pages
Investigating Masking-Based Data Generation in Language Models
No ratings yet
Investigating Masking-Based Data Generation in Language Models
8 pages
2023 07 28 Evolution of Language Models
No ratings yet
2023 07 28 Evolution of Language Models
73 pages
ChatGPT in The Age of Generative AI and Large Lang
No ratings yet
ChatGPT in The Age of Generative AI and Large Lang
60 pages
N19-1213
No ratings yet
N19-1213
7 pages
Overview of The Transformer-Based Models For NLP Tasks
No ratings yet
Overview of The Transformer-Based Models For NLP Tasks
5 pages
1719720399971
No ratings yet
1719720399971
51 pages
Overview of Training LLMs
No ratings yet
Overview of Training LLMs
31 pages
Bert
No ratings yet
Bert
20 pages
Pars BERT
No ratings yet
Pars BERT
10 pages
Chapter 12
No ratings yet
Chapter 12
16 pages
2005 14165v3 PDF
No ratings yet
2005 14165v3 PDF
74 pages
Augmenting LLMs With Knowledge - A Survey On Hallucination Prevention
No ratings yet
Augmenting LLMs With Knowledge - A Survey On Hallucination Prevention
11 pages
Bert
No ratings yet
Bert
10 pages
2022-foundations-tutorial3-sunwang-deeplearning4nlp
No ratings yet
2022-foundations-tutorial3-sunwang-deeplearning4nlp
103 pages
Understanding LLMS: A Comprehensive Overview From Training To Inference
No ratings yet
Understanding LLMS: A Comprehensive Overview From Training To Inference
30 pages
A Survey of Large Language Models
No ratings yet
A Survey of Large Language Models
140 pages
A Simple Survey of Pre-Trained Language Models
No ratings yet
A Simple Survey of Pre-Trained Language Models
7 pages
A Survey of Large Language Models
No ratings yet
A Survey of Large Language Models
140 pages
25636-1454-21112-2-10-20230927
No ratings yet
25636-1454-21112-2-10-20230927
4 pages
Large Language Model
0% (1)
Large Language Model
38 pages
Recent Advances of Foundation Language Models-Based Continual Learning - A Survey
No ratings yet
Recent Advances of Foundation Language Models-Based Continual Learning - A Survey
40 pages
A Survey of Large Language Models
No ratings yet
A Survey of Large Language Models
144 pages
paper_review
No ratings yet
paper_review
6 pages
Survey On LLM
No ratings yet
Survey On LLM
9 pages
Foundations of Large Language Models 1738142777
No ratings yet
Foundations of Large Language Models 1738142777
101 pages
Foundations of LLM
No ratings yet
Foundations of LLM
231 pages
LLM_book_43-102
No ratings yet
LLM_book_43-102
60 pages
ChatGPT KZ Feb2023 PDF
No ratings yet
ChatGPT KZ Feb2023 PDF
7 pages
All NLP Tasks Are Generation Tasks: A General Pretraining Framework
No ratings yet
All NLP Tasks Are Generation Tasks: A General Pretraining Framework
14 pages
cs224n 2023 Lecture9 Pretraining
No ratings yet
cs224n 2023 Lecture9 Pretraining
54 pages
Large Language Model Using Tensorflow: A Complete TensorFlow Implementation Guide for Modern AI Development
From Everand
Large Language Model Using Tensorflow: A Complete TensorFlow Implementation Guide for Modern AI Development
Aarav Joshi
No ratings yet
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet
Advanced Deep Learning Techniques for Natural Language Understanding: A Comprehensive Guide
From Everand
Advanced Deep Learning Techniques for Natural Language Understanding: A Comprehensive Guide
Adam Jones
No ratings yet
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet

Pre Trained Models For NLP

Uploaded by

Pre Trained Models For NLP

Uploaded by

Engineering 25 (2023) 51–65

Contents lists available at ScienceDirect

Pre-Trained Language Models and Their Applications

ZeRO: zero redundancy optimizer; MoE: mixture-to-expert.

2.2.2.2. Sparse models. The sparsely gated MoE model [112]

2.2.2.3. Other efficient training strategies. Recent techniques for

example, WenLan [106] developed two real-world applications References

You might also like