0% found this document useful (0 votes)

8 views

A Survey On Transformers in NLP With Focus On Efficiency

la receta de mi polla organizada entre mi cojon izquierdo y mi cojon derecho

Uploaded by

miidmrcrkcmix

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

A Survey On Transformers in NLP With Focus On Efficiency

la receta de mi polla organizada entre mi cojon izquierdo y mi cojon derecho

Uploaded by

miidmrcrkcmix

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

A S URVEY ON T RANSFORMERS IN NLP WITH F OCUS ON

E FFICIENCY

Wazib Ansar Saptarsi Goswami Amlan Chakrabarti

A. K. Choudhury School of IT Department of Computer Science A. K. Choudhury School of IT
University of Calcutta, Kolkata, India Bangabasi Morning College University of Calcutta
arXiv:2406.16893v1 [cs.CL] 15 May 2024

[email protected] Kolkata, India Kolkata, India

[email protected] [email protected]

A BSTRACT
The advent of transformers with attention mechanisms and associated pre-trained models have
revolutionized the field of Natural Language Processing (NLP). However, such models are resource-
intensive due to highly complex architecture. This limits their application to resource-constrained
environments. While choosing an appropriate NLP model, a major trade-off exists over choosing
accuracy over efficiency and vice versa. This paper presents a commentary on the evolution of NLP
and its applications with emphasis on their accuracy as-well-as efficiency. Following this, a survey
of research contributions towards enhancing the efficiency of transformer-based models at various
stages of model development along with hardware considerations has been conducted. The goal of
this survey is to determine how current NLP techniques contribute towards a sustainable society and
to establish a foundation for future research.

Keywords Attention Mechanism · Efficiency Considerations · LLM · Natural Language Processing (NLP) ·
Transformers

1 Introduction

In recent years, there has been phenomenal evolution in the field of Natural Language Processing (NLP). This has
been based upon a lot of research works over time upon tasks ranging from sentiment analysis [1], misinformation
detection [2, 20, 27], machine translation [3, 4], text summarization [5] to question-answering [6]. These works have
contributed towards addressing the limitations posed by preceding works as-well-as increasing their performance. The
driving force behind this progress has been deep learning techniques, particularly Transformers [7] and associated pre-
trained models like Bidirectional Encoder Representations from Transformers (BERT) [8], XLNet [9], Bidirectional and
Auto-Regressive Transformers (BART) [10], Generative-Pre-trained Transformer (GPT) [11] along with its successors
i.e. GPT-2 [12] and GPT-3 [13]. These advancements have empowered NLP models to perform complex linguistic
tasks relating to understanding natural language and even generating responses as a human would provide [14].
The current research accomplishments in NLP have been enabled through the availability of voluminous textual
data, sophisticated deep learning models, and high-end computing resources. As the complexity of the models rises,
computing resources requirement for such models surge exponentially [15]. With the apparent deceleration of Moore’s
Law1 , increasing the performance of algorithms comes at the cost of straining the computing resources along with
faltering efficiency. This leads to high energy requirements translating into a hike in carbon emissions [16,17]. Therefore,
the need of the hour is to think out of the box and devise sustainable methodologies that can keep the performance
growth rate steady while at the same time being efficient enough to be practically applicable to resource-constrained
environments like mobile and edge devices [18].
The term "efficiency" of a deep learning model in NLP can be generically defined as the trade-off between the
performance and the cost factors. Thus, the goal of efficient modeling lies in achieving pareto-improvement by reducing
1
https://www.nature.com/news/the-chips-are-down-for-moore-s-law-1.19338
the training as-well-as inference cost for a model to achieve a benchmark level of performance [108]. The cost
factors include the number of Floating-point Operations (FlOps) [19], inference time [141], model size [18], speed-up
ratio [141], number of model parameters [18], energy consumption [16], and carbon emissions [17]. The efforts to
enhance efficiency can be directed at various stages of model development, i.e. data curation, text representation, model
design, and model compression. Efficiency can also be achieved by designing optimal hardware and maximizing its
utilization. Thus, achieving efficiency improvement of a language model in NLP is a nuanced task full of challenges.
Despite the challenges, plenty of developments towards efficiency improvement of models in NLP have taken place.
Besides, numerous surveys have tried to summarize these developments to serve as a stepping stone for prospective
researchers who wish to contribute to this domain. Bannour and Ligozat [151] performed a systematic review of the
carbon footprint by NLP models. They identified the tools and studied their accuracy as-well-as applicability for
assessing the energy consumption and carbon emissions of contemporary NLP models. However, the study was limited
only to assessing the environmental impact of NLP models for a single task of named entity recognition. Khadivi
and Sato [150] performed bibliometric analysis on NLP papers published between 2002 to 2021 based on factors like
growth-rate, doubling-time, and collaboration among authors. Given the developments, they predicted the research
trend and future directions. Koubaa et al. [153] performed a critical review on ChatGPT by discussing the supporting
concepts, competing technologies along with its applications. Treviso et al. [19] performed a literature review on
efficient approaches in NLP focusing on data processing, model design as-well-as hardware utilization. The paper
primarily confers the theoretical narratives with a comparative analysis of results from the viewpoint of efficiency is
missing. Xu and McAuley [141] presented a review of model compression and acceleration techniques with a discussion
on associated metrics for efficiency evaluation. Their study was limited only to pre-trained models and did not account
for data-efficiency, parameter-efficiency, or hardware-design considerations. Tay et al. [155] presented a taxonomy
of efficient transformer models in NLP in the form of a literature review. Even though, the survey is extensive, a
comprehensive coverage of the works in the given domain is lacking. Xipeng et al. [119] survey pre-trained models in
NLP with emphasis on model categories, pre-training objectives, fine-tuning, and downstream tasks. However, they do
not focus on the efficiency considerations of the models.
To address the shortcomings of the previous surveys, we augment the body of knowledge with a first-of-its-kind
systematic literature review on efficient transformer-based models in NLP. Firstly, it presents a primer on NLP, its
applications, and the evolution of NLP techniques. Then, it performs an extensive as-well-as exhaustive study focusing
on all stages of model development to achieve efficiency ranging from data curation to model design involving pre-
training, fine-tuning, prompt engineering to inferencing. Not only does it explore software improvements but also
efficient hardware developments accompanied with software-hardware co-designing approaches. It furnishes qualitative
as-well-as quantitative evolution of NLP models and weighs the efficacy of the models in terms of their efficiency.
Lastly, it analyzes the trend of research and presents the future directions. This survey paper targets researchers,
professionals as well as scholars interested in the NLP particularly transformer-based Large Language Models (LLMs),
and wish to design efficient leaner models to bring down the overall computational budget or for deployment on devices
with low computational resources. The key contributions of this paper are as follows:

1. We conduct a comprehensive study on transformer-based models in NLP emphasizing all stages of model
development.
2. This paper presents a qualitative and quantitative analysis of software as-well-as hardware-based contributions
related to transformers in NLP from the perspective of efficiency.
3. Finally, based on the review of existing works, we perceive the trend of developments in NLP and present a
road-map to achieve pareto-optimality.

The remainder of this paper has been organized in the following manner. Section 2 puts forth the methodology adopted
for this survey. Section 3 enunciates the domain of NLP, its applications, and the evolution of NLP based on the
programming paradigms. Section 4 elucidates the concepts associated with transformers and various modeling stages
in transformer-based LLMs. Section 5 showcases the developments towards efficient modeling. Section 6 presents
the results of this survey in the form of statistical insights, trends of research, and ushers the future scope. Finally, in
section 7, the conclusions are drawn.

2 Survey Methodology
The foundation of research lies in the critical review of existing literature and analysis of the previous results obtained
through related works. It can serve several purposes, including presenting the information that is currently available about
a term or concept, mapping the history of developments, determining connections between related concepts, assessing
the evidence supporting any proposition, or demonstrating why a problem merits more investigation [147]. Irrespective

2
Figure 1: PRISMA for the survey methodology

of the field of study, there have been various typologies of surveys distinguished by certain characteristics [149].
Bibliometric analysis is a kind of survey utilizing article details like journal/ conference name, publication date, and
citations as-well-as author details like name, affiliation, and collaborations to assess the developments and trends in a
field of study from a statistical perspective [150]. A systematic review gathers and summarizes the findings of research
works on a given subject that satisfy the standards of scientific credibility and pertinence to form a set of research
questions and answer them [151]. The Preferred reporting items for systematic reviews and meta-analyses, i.e. the
PRISMA statement comprises a 27-item checklist for systematic reviews [148]. A systematic mapping describes and
catalogs the existing information on a topic or question of interest rather than attempting to provide a response to a
particular question [152]. A literature review seeks to uncover important ideas, hypotheses, and research findings as
well as knowledge gaps. It makes an effort to go over the claims and conclusions from earlier research in a narrative,
chronological order [154]. A critical review examines certain concepts, themes, or theoretical viewpoints found in the

3
body of contemporary works. It provides more of a reflection and critique of the concept under consideration [153].
However, it often inculcates bias due to the contextualization of the previous works by authors concerning their
propositions. After examining the merits and demerits of the survey typologies, we conclude that given the topic of
our survey and the developments in the given field, a combination of literature review and systematic review would be
the best option. Hence, we adopt a systematic literature review as the survey methodology for this paper. This would
enable us to elucidate the vital concepts related to transformers NLP with associated developments from the perspective
of efficiency in a systematic manner. We formulate the following research questions and attempt to address them in this
survey:

RQ1: What are the applications of NLP, and what kinds of techniques are used to perform such applications?
RQ2: What are transformer-based models and how transformers are utilized in NLP LLMs?
RQ3: What is the efficiency vs efficacy trade-off in transformer-based LLMs?
RQ4: What efficiency measures are present for NLP models?
RQ5: What efficiency considerations are there for transformer-based NLP models?
RQ6: Which stages of model development can be targeted for efficiency enhancement?
RQ7: What is the current research trend in NLP and to what extent efficiency considerations are going to be prevalent
in the near future?

The survey comprised of original as-well-as review articles written in the English language published in digital
libraries like Google Scholar2 , ACM Digital Library3 , IEEE Xplore, Semantic Scholar4 and Science Direct5 . The
articles were retrieved using keywords related to NLP and its associated terms like "NLP", "pre-trained models",
"LLM", "transformers", "embedding", "pre-training", "fine-tuning", "prompt engineering", "sustainability in NLP" and
"efficiency modeling in NLP". Articles published between 2000 to 2023 were included with a prime focus on articles
published since 2016. The papers that were recently published were given preference. Besides, some pioneering works
were included irrespective of the year in which they were published. Journal publications were chosen over conference
papers where two or more articles were found to share the same subject or methodology. When choosing the papers, the
journal’s impact factor and citations were taken into account. The duplicate articles were removed if it was observed that
the writers had written similar works. At first, 3,210 articles were retrieved. Out of which 214 duplicates were removed.
A spreadsheet application was utilized to store and process article metadata like "title", "abstract", "reference", "author
list", "year of publication", "name of journal or conference" and "number of citations". Then, 1,990 articles were
filtered out through statistical examination based on the above-mentioned criteria. From the remaining 1,006 articles,
312 articles were short-listed after going through the title and abstract. Finally, the full-text screening of the short-listed
articles was performed leading to 151 articles to be included in the survey for this paper. To ensure a scientific and
systematic structure of content, the study has been separated into several coherent pieces depending on significance.

3 Overview of NLP
NLP is a research domain concerned with providing computing devices the ability to comprehend and process input text
in natural language understood by human beings. Using various NLP approaches one can extract meaningful information
from an unstructured text corpus and even synthesize outputs in natural language [14]. The two complimentary facets
of NLP are Natural Language Understanding (NLU) and Natural Language Generation (NLG) as illustrated in Figure 2.
NLU is the process of enabling computers to understand and derive meaning from natural language. By bridging the
gap between unstructured text data and representations that are understood by machines, NLU enables machines to
comprehend and process natural language input. Instances include sentiment analysis [1], opinion spam classification [2],
fake news detection [20] and rumor verification [21]. On the other hand, NLG enables computers to produce natural
language from structured data or other unstructured text inputs. The primary objective of NLG is to communicate
information in a way that is comprehensible to human beings and appropriate as per the given situation. Instances
include question answering [22], machine translation [3] and text summarization [23, 24].

3.1 Applications of NLP

Some of the renowned applications of NLP have been elucidated as follows:

2
https://scholar.google.co.in
3
https://dl.acm.org
4
https://www.semanticscholar.org
5
https://www.sciencedirect.com

4
Figure 2: Applications of NLP

3.1.1 Sentiment Analysis

Sentiment analysis is an application of NLP concerned with the extraction and evaluation of expressions, feelings,
and orientations of people regarding a certain physical or abstract subject [1]. It has evolved over a period of time
with primarily three tiers of analysis- document-based, sentence-based, and aspect-based. While Document-Based
methods provide the overall sentiment for the entire document, they fail to capture the sentiments expressed in individual
sentiments [25, 58]. Sentence-based methods provide the sentiment polarity associated with individual sentences in
a document. They are an improvement over Document-based methods but falter to capture sentiments associated
with the aspects present in a sentence [26]. Aspect-Based Sentiment Analysis (ABSA) redresses the impediments of
Document-Based methods as-well-as Sentence-Based methods with its ability to associate sentiments to individual
aspects [1, 55].

3.1.2 Misinformation Detection

Misinformation detection deals with identifying fake, biased, or propaganda-based content posted through online
platforms. In contrast to classifying the polarity of opinions as in sentiment analysis, it detects fraudulent opinions.
The detection methods may be based upon the content, the meta-data, or through learning some patterns present in
the content [2]. It can be extended to fake news as-well-as rumor verification tasks. On one hand, fake news consists
of news articles with delusive content to misinform the readers [20]. On the other hand, a rumor can be attributed to
information that is rapidly disseminated without ascertaining its authenticity. Thus, a rumor might be true, false, or
even unverified [21]. The approaches to detect fake news may exploit information present in the content of the post,
user profile as-well-as social context [27].

3.1.3 Machine Translation

With increasing globalization, the entire world is becoming a single community. Therefore, it is becoming increasingly
important to overcome linguistic barriers so that seamless transmission of knowledge and information can take place.
Machine translation enables automatic conversion of a given piece of text from one language to another. This field
is full of challenges due to multiple possible translations of a word depending upon the context and difficulty in
understanding idiomatic phrases [28]. The advent of neural networks and encoder-decoder architectures for sequence-
to-sequence models [3, 4] mitigated the impediments to a large extent. Subsequent transformer-based approaches [115]
and associated LLMs [18, 120] have helped achieve SOTA performance.

5
3.1.4 Question Answering
Question Answering (QA) is an application of NLP that focuses on inventing and developing models and algorithms to
automatically produce human-like responses to user queries or questions [6]. The objective is to make it possible for
computers to comprehend natural language input and produce pertinent, correct responses in a conversational style. The
existing QA systems can be grouped into extractive QA and generative QA. The former selects the span of text from a
document termed context which serves as the answer to a given question [29]. While the latter produces automatically
generated nuanced answers on the basis of the comprehended information [22].

3.1.5 Text Summarization

Text summarization is another acclaimed application of NLP. It follows a similar approach as machine translation, with
a difference in the fact that instead of producing output text for each sentence, it condenses the input to a concise form
representing the vital information present in the text. The two main branches of text summarization are - extractive-
summarization and abstractive-summarization. The earlier methods focused upon extractive-summarization, in which
the entire sentences are extracted based upon their significance [23, 30]. Recent methods focus on the more critical
approach of summarizing text by paraphrasing the information learned without using sentences from the original text.
Such methods generally deploy deep learning-based sequence-to-sequence models [24, 31].

3.2 Evolution of NLP

Over the years, NLP has made considerable strides as a result of ground-breaking research, rising processing capacity,
and the creation of complex language models. The development of NLP is evidence of the persistent effort to close
the language and artificial intelligence gap. The major advancements from rule-based systems, conventional machine
learning techniques, deep learning, and pre-trained language models have been outlined in this section with Table 1
presenting a comparative review of the notable contributions.

3.2.1 Rule-Based Approaches

The rule-based approaches utilize the grammatical rules behind sentence construction to extract the features and process
a text. Even though these approaches are currently losing their charm, they are somewhat suitable for less intensive
applications as they are unsupervised, domain-independent, and efficient with considerable accuracy. However, the
accuracy of the rule-based approaches is dependent upon the grammatical construction of text [40]. The rule-based
approaches can also be combined with other methods such as Convolutional Neural Networks (CNN) [55] or even
transformer-based models like BERT [56]to give the dual benefit of satisfactory performance with efficiency.

3.2.2 Traditional Machine Learning Approaches

In traditional machine learning approaches, after pre-processing operations, the features are defined followed by
extraction of the features. After that, using traditional machine learning techniques the processing is carried out.
Some of the notable machine learning techniques are Naive Bayes (NB) Classifier [57], Support Vector Machines
(SVM) [58], Logistic Regression (LR) [2], Maximum Entropy (ME) Classifier [25], k-Nearest Neighbour (k-NN) [58]
and Conditional Random Fields (CRF) [32]. The efficiency of such models is considerably less than the rule-based
approaches due to the requirement of a huge training-set resulting in exorbitant training time with accuracy governed by
the choice of features [55].

3.2.3 Deep Learning Approaches

The advent of deep learning approaches enabled a model to learn the features by itself which was earlier not possible
using traditional machine learning approaches. This led to the development of models that could understand the context
better than traditional machine learning approaches like Recurrent Neural Networks (RNN) [39] and Convolutional
Neural Networks (CNN) [43, 55]. However, RNNs failed to capture the relationship among words beyond a certain
length [15] due to inherent “vanishing-gradient” and “exploding-gradient” issues [59]. The use of Long Short Term
Memory (LSTM) ameliorated this issue with a gated mechanism having the gradients pruned leading to improved
accuracy [60]. Further developments on LSTM led to pre-trained models like Embeddings from Language-Models
(ELMo) [45] and Universal Language Model Fine-tuning (ULMFit) [44]. Subsequent developments led to transformer-
based models as discussed in Section 4. Concerning sustainability, a point to ponder is that these deep learning models
have better accuracy compared to traditional machine learning approaches, but are domain-dependent, and training
them from scratch requires considerable computing resources [15].

6
Table 1: An Overview of Notable Contributions in NLP

Year Task Description Methods Used Author(s)

2008 Analysis of opinion spamming LR Jindal et al. [2]
2010 Sentiment analysis based on dependency-tree CRF Nakagawa et al. [32]
2013 CBoW and SG for vector representation of text CBoW and SG Mikolov et al. [35]
2014 GloVe for word-vector representation GloVe Pennington et al. [36]
2014 Doc2Vec representation of text Doc2Vec Le and Mikolov [38]
2014 Machine translation using encoder-decoder architecture RNN Cho et al. [3]
2014 Machine translation using neural networks RNN Bahdanau et al. [4]
2014 Aspect-extraction and mining of opinions RNN Irsoy and Cardie [39]
2015 Aspect-extraction and opinion mining Rule Based Liu et al. [40]
2015 Targeting efficiency through Knowledge Distillation Neural Networks Hinton, Geoffrey [41]
2016 FastText for word-vector representation FastText Joulin et al. [42]
2016 Aspect extraction and opinion mining using multi-layer CNN CNN Poria et al. [43]
2016 Abstractive text summarization using sequence-to-sequence ar- RNN Nallapati et al. [24]
chitecture
2017 Extractive text summarization using deep learning RNN Nallapati et al. [30]
2017 Abstractive text summarization using pointer-generator networks LSTM See et al. [31]
2017 A survey of fake news detection techniques - Shu et al. [20]
2018 Universal Language Model Fine-tuning (ULMFit) LSTM Howard and Ruder [44]
2018 Embeddings from Language-Models (ELMo) LSTM Peters et al. [45]
2018 Generative-Pre-trained Transformer (GPT) Transformers Radford et al. [11]
2018 Bidirectional Encoder Representations from Transformers Transformers Devlin et al. [8]
(BERT)
2019 Autoregressive pre-training using XLNet Transformers Yang et al. [9]
2019 GPT-2 for unsupervised multi-task learning Transformers Radford et al. [12]
2019 Recurrence-based chunking approach (Transformer-XL) Transformers Dai et al. [46]
2019 Cross-layer parameter sharing with matrix decomposition Transformers Lan et al. [47]
2020 Few-shot learning using GPT-3 Transformers Brown et al. [13]
2020 Utilizing Hash similarity for token clustering (Reformer) Transformers Kitaev et al. [48]
2020 Low-rank approximations of the self-attention matrix (Linformer) Transformers Wang et al. [49]
2021 Utilizing K-means clustering for capturing patterns (Routing Transformers Roy et al. [50]
Transformer)
2021 Parameter sharing with downsampling (Perceiver) Transformers Jaegle et al. [51]
2021 Representing model parameters with reduced precision through Transformers Radford et al. [52]
Quantization
2022 Effective data curation for efficient pre-trained models Transformers Zhang et al. [53]
2023 Structured pruning for efficient pre-trained models Transformers Sajjad et al. [54]

7
(a) Rule-Based

(b) Traditional Machine Learning

(c) Deep Learning

(d) Transfer Learning
Figure 3: Prevalent Modeling Approaches in NLP

4 Transformers in NLP
The Transformer-based approaches come under the purview of deep learning. However, due to the revolution in NLP
brought about by them and the immense developments carried out, they deserve to be discussed separately in this
section. The evolution of transformers, the concepts associated with it accompanied with the stages of modeling have
been enunciated herein-below.

4.1 The Evolution of Transformers

A series of developments paved the way for the transformers. Earlier works on NLG tasks like machine translation
devised sequence-to-sequence models comprising two RNN blocks namely, the encoder and the decoder [3, 110]. Given
an input sequence X = (x1 , x2 , ..., xn ), the RNN-based encoder derives a hidden representation H = (h1 , h2 , ..., hn ).
Subsequently, a few other non-linear functions can also be applied to obtain the final H. For tth time-step, ht is
calculated from xt and ht−1 as shown in equation (1).

ht = RN N (xt , ht−1 ) (1)

The decoder predicts one output token at each time-step on the basis of the previously predicted tokens y1 , y2 , ..., yt−1
and H as a joint probability distribution shown in equation (2).

yt = p(yt |y1 , y2 , ..., yt−1 , H) (2)

8
Figure 4: Illustration of the transformer architecture

However, the above approach leads to loss of information as the length of the input sequence grows due to compression
of information into a fixed-length vector. To ameliorate this issue, Bahdanau et al. [4] deployed a soft-search mechanism
for identifying the significant tokens from the input sequence for the prediction of the output at a given time-step. For
this, they introduce the term context vector ct derived from H which weighs the significance of the token hidden states
as shown in equation (3).
Pn
ct = i=1 αti · hi (3)

where αti is a distribution (often a softmax) function as follows:

αti = Pnexp(eti ) (4)

j=1 exp(etj )

9
given that,

exp(eti ) = fa (st−1 , hi ) (5)

Here, exp(eti ) evaluates the alignment between the output at position t and the input tokens around position i. The
fa (∗) function takes the previous hidden state st−1 of the RNN decoder and ith time-step hidden representation hi .
Finally, the decoder applies a non-linear function fd (∗) to generate the output yt for time-step t as follows:

yt = fd (yt−1 , st , ct ) (6)

This led to the foundation of the attention mechanism, an indispensable component of modern transformer architecture.
To compute attention, the input is transformed into an embedded sequence Z ∈ RL × RD comprising token and
positional embeddings where L is the sequence length and D is embedding dimension. Then, key Ks , query Qs , and
value Vs are calculated through linear transformations on the sequence Z as follows:

Qs , Ks , Vs = W q Z, W k Z, W v Z (7)

D
where, W q , W k and W v ∈ RD× H denote the weight matrices corresponding to Ks , Qs and Vs . The key K represents
the input features. These features might be at character-level, word-level, document-level, or a combination of multiple
features. Qs is the vector whose relationship with K is computed during attention computation. This is accomplished
through a compatibility function fc (∗) as follows:

ea = fc (Qs , Ks ) (8)

One might notice the similarity between equation (8) and the alignment function in equation (5) wherein the alignment
between the previous decoded token and the hidden states is computed. Furthermore, the fc (∗) can have varied forms as
summarized in Table 2. Following this, the attention weights aw are obtained after being fed into a distribution function
fδ (∗) to normalize the alignment scores and transform it into a probability distribution as follows:

aw = fδ (ea ) (9)

Here too, the distribution functions can have varied forms with softmax activation being the most widely used [7]. To
obtain the attention-weighted representation of the input Z ′ , pairwise inner product between Vs and aw is computed as
follows:

Z ′ = aw · Vs (10)

Vs represents the sequence vector upon which the attention weights are applied to determine the significant tokens. In
most of the studies, Vs is considered identical to ks . Finally, the attention-based context vector Ca is obtained as the
element-wise sum of Z ′ such that elements with higher attention weights have more significance compared to lower
attention weights as shown in equation (11).
P ′
Ca = zj , ∀zj′ ∈ Z ′ (11)

Table 2: Cost Behind Training Transformer-based Models

Type Representation Reference
Similarity fc (Qs , Ks ) = sim(Qs , Ks ) Graves et al. [114]
Multiplicative or Dot fc (Qs , Ks ) = QTs · Ks Luong et al. [115]
QT ·K
s
Scaled Multiplicative fc (Qs , Ks ) = √ s
D
Vaswani et al. [7]
T
Bilinear fc (Qs , Ks ) = Qs · W · Ks Luong et al. [115]
Additive fc (Qs , Ks ) = W T g(W1 Ks + W2 Qs + b) Bahdanau et al. [4]
Source: Based on a study by Galassi et al. [113]
Note: W, W1 , W2 and b are learnable parameters

Often, there is only one input sequence and attention is computed solely based on it. It gave rise to self-attention or
intra-attention, a concept refined in many later works [111, 112]. It is achieved by having the same vector for both

10
Ks and Qs . In this manner, it helps to capture the relevance of a particular token in a sequence concerning other
tokens in it. Furthermore, to accommodate parallel computation of attention at diverse positions, Multi-Head Attention
(MHA) am w was devised which concatenates aw computations from all the Dh attention heads and projects them through
W o ∈ RD × RD as depicted hereinbelow.

am o
w = Concatenate(aw [i])W , ∀i ∈ Dh (12)

A milestone achievement was the transformer architecture with multi-head scaled inner-product attention mechanism
by Vaswani et al. [7] as shown in Figure 4. This was the first time a sequence-to-sequence model entirely based on
self-attention without any CNN or RNN units was proposed. The transformers with attention mechanism provide high
performance with exceptional sequence representation abilities and support parallel training unlike the LSTM-based
sequential methods [7]. Moreover, the genesis of transformer-based pre-trained models or LLMs has transformed the
field of NLP providing relief from training the model from scratch. These models are pre-trained on large data-sets and
just need to be fine-tuned as per the application. This helps to provide high accuracy with computational efficiency and
robustness when applied in various domains thereby making them an apt choice in the current scenario [7]. One of the
foundational LLM was OpenAI’s Generative-Pre-trained Transformer (OpenAI GPT) based upon transformer-decoder
architecture with unidirectional context parsing. To overcome this limitation, Bidirectional Encoder Representations
from Transformers (BERT) [8] adopted bidirectional context-parsing deploying a transformer-encoder architecture.
However, BERT suffers from drawbacks like the exclusion of a "Mask" token during fine-tuning and parallel predictions
without dependency consideration. These drawbacks have been resolved by its successor XLNet through "permutation
language modeling" in which the prediction tokens are permuted randomly [9]. The successors of OpenAI GPT i.e.
GPT-2 [12] and GPT-3 [13] further enhance the performance, efficiency, and reusability with the concept of "in-context
learning". This feature further eliminates the need to fine-tune the model and the model just needs to be conditioned
with the instances or description of the application. Apart from this, LLMs have been devised utilizing the entire
transformer encoder-decoder architecture. The T5 transformer [120] is one such LLM that is pre-trained by predicting
a span of tokens corresponding to a mask. Another variation in the form of PEGASUS [146] enforces masking of
entire sentences as a pre-training objective and termed it Gap-Sentence Generation. Similarly, BART [10] comprises
encoder-decoder blocks and applies noise to corrupt the input text and then attempts reconstruction through denoising.
These are just a few examples and the rest of the paper presents several other transformer-based models supported with
an interpretation of their efficiency.

4.2 Stages of Modeling

4.2.1 Pre-Training
Creating an LLM does not only revolve around devising a complex architecture with millions of parameters. Rather,
models need to be trained on data-sets proportionate to the model size to deliver optimum performance [116]. Thus,
large models need large data-sets. But, high-quality annotated data-sets are scarcely available for training a model in
a supervised fashion. This is due to annotation being expensive, and requiring expertise in understanding the syntax,
semantics as well as domain knowledge. However, there exists plenty of unannotated textual content that can be utilized
to make LLMs learn vital representations through unsupervised or self-supervised learning. Previously training or
Pre-Training LLMs on these tasks groom the model towards discerning linguistic intricacies, significantly enhancing
the performance at downstream tasks with to faster convergence even with limited data. The inception of pre-training
can be attributed to the surge in the development of deep convolutional models following the ImageNet6 challenge in
the early 2010s. In NLP, Collobert et al. [117] first demonstrated the concept of pre-trained word embeddings generated
from large unannotated corpora. Subsequently, the pre-trained versions of word embeddings like GloVe [36] and
Word2Vec [35] were devised. In context to the Language Model, Dai and Le [118] became the torchbearer followed
by other models like ELMo [45], ULMFit [44], GPT [11] and BERT [8]. Since then, a plethora of LLMs have been
developed with an upward trend in associated research. There exist quite a few strategies for pre-training LLMs [119].
Out of them, a few significant ones have been mentioned herein-below.

• Causal Language Modeling (CLM): It relies on self-supervised language modeling to predict the next token
in a sequence maximizing the likelihood of the conditional probability distribution over all the unique tokens
based on the context. CLM works in a unidirectional manner, i.e. left-to-right manner. This implies that the
context only includes the tokens to its left. CLM is more suited for NLG applications. A prominent example
of an LLM using CLM is GPT [11]. For a given sequence X = (x1 , ..., x2 , xn ), the loss function of CLM is
computed as follows:
6
https://image-net.org/challenges/LSVRC/

11
PT
LCLM = − t=1 logp(xt |X<t ) (12)
• Masked Language Modeling (MLM): To ameliorate the limitation of CLM to attend only to tokens leftwards,
MLM was devised where the context was constructed in a bidirectional fashion, i.e. allowing it to infer from
tokens present in both right as well as left direction. This makes MLM the apt choice for NLU applications. An
MLM usually works by masking out some random percentage of tokens in the sequence and then predicting
those tokens based on the context. One of the famous LLM utilizing MLM is BERT [8]. For a given sequence
X = (x1 , ..., x2 , xn ), the loss function of MLM is computed as follows:
LM LM = − x′ ∈m(X) logp(x′ |X\m(X) )
P
(13)

where, m(X), X\m(X) denote the masked tokens, and the remaining tokens in the sequence X respectively.
Vanilla MLM deals with replacing single tokens which can reduce their effectiveness at sequence-to-sequence
NLG tasks. A sequence-to-sequence variation of MLM solves this by predicting a span of tokens corresponding
to a mask as can be seen in T5 transformer [120]. Subsequently, even entire sentences have been masked in
LLMs like PEGASUS [146] to make the pre-training objective related to the downstream task of abstractive
summarization. LLMs like BART [10] apply noise to corrupt the input text and then perform denoising
by reconstructing the span of text. This allows pre-training on shorter sequences with equivalent efficacy
contributing towards enhanced efficiency. A limitation of MLM is that the masked tokens are restricted to
pre-training and are not available at the fine-tuning phase leading to a discrepancy.
• Permutation Language Modeling (PLM): To mitigate the drawback of MLM related to the unavailability of
the mask token during the fine-tuning stage, PLM was proposed [9]. PLM generates a random permutation of
the input sequence wherein a permutation defines the order of token predictions (not to be confused with the
order of tokens in the sequence). During pre-training, the model tries to predict some of the tokens selected
as the target considering its position and the remaining tokens. To achieve faster convergence, the endmost
tokens are often predicted. A popular LLM formulated on this pre-training objective is XLNet [9]. Given an
input sequence X with S being its random permutation sequence, the equation for the loss function of PLM is
as follows:
PT
LP LM = − t=1 logp(st |S<t ) (14)
• Contrastive Learning (CL): Contrastive learning aims to capture linguistic contextual information by
distinguishing (contrasting) between valid and invalid samples by means of similarity evaluation. Next-
Sentence Prediction (NSP) is an example of CL utilized in BERT [8]. Here, the objective is to identify whether
a pair of sentences are next to each other given a set of contiguous and non-contiguous sentences. However, a
few works have stated that although NSP focuses on the topic as well as coherence prediction, it is found to be
ineffective and unreliable in coherence prediction even demonstrating performance drop due to NSP [121]. To
resolve this issue Sentence-Order Prediction (SOP) was proposed to predict the order of sentences instead of
predicting whether a given sentence is the next sentence to another sentence. The LLM ALBERT showcases
superior performance by modeling the inter-sentence coherence through SOP [47]. The loss functions for both
SOP and NSP aim to determine the constructiveness of two sentences X and Y as follows:
LN SP/SOP = −logp(k|X, Y ), ∀p ∈ 0, 1 (15)
Regarding efficiency considerations of LLMs, it can be said that pre-training requires the maximum com-
putational resources among all the stages of modeling. Although the pre-training strategies contribute to a
great extent towards the performance, the model design along with the quality and size of the data upon which
pre-training is performed plays a crucial role in efficiency [19]. The efficient data curation as well as model
design considerations have been discussed in Section 5.2.1 and Section 5.2.3.

4.2.2 Fine-Tuning
As seen above, pre-training an LLM serves as an effective model initialization strategy and aids in generalization
with faster convergence on limited annotated data. However, to make a pre-trained model excel at a domain-specific
task, additional training effort is required to exploit annotated samples specific to the downstream task. It is known as
fine-tuning. It underlies the concept of transfer learning wherein a model pre-trained on a certain task having large data
is trained again (fine-tuned) on a related task with significantly fewer data. There are various fine-tuning approaches.
The first approach is to unfreeze a few layers of the model and retain the weights of the other layers calculated during
pre-training. Usually, the output layer is customized as per the output representation format and fine-tuned with a few
other unfrozen layers upon the task-specific data. The second approach is to fine-tune the frozen model with limited
data during initiation and unfreeze other layers in due course.

12
The efficiency considerations for fine-tuning lie in minimizing the number of layers to unfreeze, i.e. number of
parameters of the pre-trained LLM to fine-tune. Unfreezing more layers increases the computational requirements
of fine-tuning but can enhance the accuracy of the downstream task. This applies only if abundant data is available
to perform FT. In most cases, fine-tuning only the last few layers can obtain desirable results [122]. This is due
to the fact the lower layers capture low-level, local features primarily related to the syntax. Whereas, the higher
layers capture the global information involving high-level semantic abstractions specific to the task at hand. The
efficiency can also be improved through adapter modules, i.e. an isolated network that is fine-tuned and combined with
the pre-trained model having all the parameters intact [123]. Further variations include utilizing Kronecker product
of low-rank matrices for the construction of parameter matrices for the adapter [124]. Another variation involves
reparameterization to low-dimensional subspaces for fine-tuning, enhancing efficiency by reducing the number of
parameter updates [125]. There lies one drawback of the adapter approach- it raises the overall model parameters
leading to more computations during inference. This hindrance was resolved through Adaptable Adapters which applies
differing activations specific to each layer as well as data-sets accompanied with a switch trained to select appropriate
layers of the adapter module [126]. Furthermore, AdaMix combined various parameter-efficient adapters to provide
SOTA results with an efficiency equivalent to fine-tuning with a single adapter module [127].

Figure 5: Illustration of prompt engineering process

4.2.3 Prompt Engineering

GPT-2 [12] first demonstrated the multi-task learning ability of generative LLMs. It was capable of performing
various tasks out of the box minimizing manual effort during inference. Its subsequent version, GPT-3 [13] was
further able to perform few-shot or in-context learning, i.e. it could generate the required predictions just by providing
the task description along with priming with a few use-cases. This led to the emergence of the term Prompting
or Prompt Engineering associated with a series of developments. Schick and Schutze [128] utilized a pre-trained
LLM for in-context learning which excelled at tasks such as predicting the next sentence given the first sentence,
generating the second sentence replacing "__" in the prompt or even generating both the sentences through providing
the description. Reynolds and McDonell [129] demonstrated the mathematical reasoning ability of pre-trained LLMs
where given a mathematical problem, the LLM could generate the solution with detailed steps. Wei et al. [130] further
demonstrated that an LLM’s multi-task learning abilities can be enhanced by prompt-learning on several supervised
data-sets concerning various tasks. From the current state of developments [130–135], it can be inferred that the
approaches for prompt engineering can be primarily clustered into the following categories:

• Instruction-based Learning: Also known as Priming, it involves providing the instructions related to the
task description optionally with a few samples of the inputs and their corresponding outputs [130, 132]. For
instance, providing the instruction to perform translation accompanied with a few examples in the prompt to
prime the LLM to generate a translation for any new sentence.
• Template-based Learning: It deals with exploiting predefined structures, known as templates to construct
prompts. The templates can be designed as cloze styled- inserting placeholders in the prompt text and
attempting to fill in the blanks [133], multiple-choice type- providing multiple hypotheses in the template and
asking the model to choose the correct one [134] or prefix-type- adding special prefixes before the input to
denote the task to be performed on the input [131, 132].
• Proxy-Task-based Learning: It involves probing an LLM with a proxy-task, i.e. a related task sharing
some attributes of the original task, to obtain the output of the original task through transferring the inference
to the desired form. This enhances the efficiency and eases inference due to utilizing simpler tasks closer
to those upon which the model has been previously trained to obtain outputs for tasks leveraging rigorous
linguistic comprehension. Instances include applying textual entailment for topic detection [135] or achieving
coreference resolution through question answering [136].

13
Regarding the efficiency of prompt engineering approaches, it can be commented that in-context learning significantly
reduces the computational complexity due to zero parameter updates in the pre-trained LLM. For a multi-task LLM,
prompting can yield results at par with fine-tuning the model with several data samples [131]. Apart from these, certain
prompt engineering practices also enhance efficiency. Firstly, optimizing the length of the prompt and its textual
complexity improves the response-time by requiring less computations. Secondly, designing prompts considering
the resources available and allowing batch processing can improve efficiency. Thirdly, caching the intermediate
outputs can reduce the amount of processing required leading to faster response. Finally, the selection of the LLM for
prompt-engineering plays a crucial role. The selection must be done considering the desired performance given the
availability of computational resources.

5 Efficient Modeling Considerations

The capability of transformer-based models to deliver high accuracy at tasks might seem undisputed. But, they are
highly complex requiring humongous parameters. The first factor behind this is the O(L2 · D) complexity of attention
computation [7]. The second factor is the number of attention heads involved in MHA. Even if MHA supports parallel
computation, it raises the complexity of the model and requires sophisticated hardware for implementation. The third
factor is the number of transformer blocks involved. This increases the number of sequential layers scaling up the
execution time. The fourth factor is the number of embedding dimensions D. It can be observed that self-attention
is directly proportional to D and thus determining the optimal embedding dimension is very crucial as very high
embedding dimensions might incur redundancy while lower values of D might lead to information loss. The fifth factor
is the size of the data being trained on. The potency of current SOTA models to offer zero-shot or few-shot learning is
due to the humongous data they are being trained upon. This results in gigantic model sizes. A single training instance
might be as expensive as $40,000 [16] or even more. The remainder of this section discusses the efficiency measures
along with the developments to improve the efficiency of transformer-based models both from a software as-well-as
hardware perspective.

5.1 Efficiency Measures

The developments to achieve better performance at tasks are at the cost of increased model complexity translating
to escalated training costs and carbon emissions. Given the complexity of SOTA NLP models, the cost of training
might even exceed the annual energy requirements of certain cities. Strubell et al. [16] performed a study in which they
calculated the power consumption, carbon emissions along with the monetary cost associated with the training of a set
of NLP models. In their study, it was found that training a NLP model cost as much as a trans-Atlantic flight. They also
reflected on the percentage of energy coming from renewable sources from countries all over the world. To measure the
efficiency η, the trade-off between the model performance and cost factors needs to be calculated as shown in equation
(1). The cost factors can be defined concerning various metrics as follows:

1. Floating-point Operations (FlOps) define the number of floating-point operations needed for a single instance
computation [19]. This can serve as a consistent benchmark irrespective of the hardware of the application.
However, existing HPCs with support for parallel processing might lead to non-uniform execution times even
with the same number of FlOps.
2. Inference Time denotes the time required by the model to process a test input and generate a suitable response
[141]. Unlike FlOps, it is hardware-dependent, i.e. it depends upon the configuration of the HPC and support
for parallel execution. From the evaluation perspective, it enables a real-time measure of various algorithms
based on execution upon identical HPC.
3. Speed-up Ratio helps to perform comparison of a model concerning another model [141]. Here, one model is
taken as the baseline and the improvement in efficiency of the other model is measured compared to it. In
context to transformer-based models, speed-up can be calculated based on the number of transformer blocks,
attention heads, or overall number of layers in the model.
4. Model Size and Number of Parameters are internal indicators of the computational requirements [18]. Some
models might be more efficient despite the same or even more number of layers and FlOps due to the sharing
of parameters [47]. In such cases, the number of model parameters provides an indicator to the overall model
size and serves as an efficiency evaluation metric.
5. Carbon Footprint is the most significant indicator of the environmental impact due to an LLM. However,
it is an uphill task to precisely report the carbon emissions due to the involvement of multiple factors for
its computation [16, 18]. The preliminary approaches involve tools to calculate the energy consumption
and carbon footprint relying on the execution time, number of cores, memory requirements, and platform

14
information supplied by the user [105, 106]. Further developments led to packages being deployed on systems
to directly access the CPU, GPU and DRAM statistics and calculate power consumption7 [107]. However,
most of these studies only account for the computing resources and do not consider the cooling, networking
and other operational costs.

P erf ormance
η= CostF actors (1)

For performance, it is necessary to discover the pareto-improvement by comparing it with a benchmark, i.e. attaining
higher accuracy at lower cost [108]. Schwarts et al. [18] formulated cost factors proportional to the time and resources
for execution on a single sample Es , data size Ds and the number of epochs n required for training as depicted in
equation (2).

CostF actors ∝ Es · Ds · n (2)

Despite the research developments, the current approaches for measuring efficiency are not fool-proof. There lies a
disparity in the carbon emissions reported by various monitoring applications. The majority of the studies focus on
only model training or do not differentiate between fine-tuning or prompt-engineering stages. Furthermore, the cost of
production of hardware and infrastructure for deployment of these models is often unaccounted for. A study by Gupta
et al. [109] reveals that the environmental impact due to setting up infrastructure and hardware equipment is maximum
compared to other life-cycle stages for data-centers.

5.2 Software Designing

To achieve efficiency in NLP models, numerous software design considerations have been devised to target various
stages of model development as highlighted in Figure 6. In this section, commentary on such techniques based on the
modeling stages such as data curation, text representation, model design, and model compression have been presented.

Figure 6: Efficiency considerations through software designing

7
https://github.com/epfl-iglobalhealth/cumulator

15
5.2.1 Data Curation
Data curation plays a vital role in determining the efficiency of the Language Model (LM). A data set with reduced
sequence lengths or less number of training samples minimizes the model complexity and reduces the training effort
significantly [61]. Duplicate removal from the data set can enhance the efficiency of LM and might also enhance or
improve its performance compared to the entire corpus [62]. In the case of pre-trained LMs, such filtering can be
applied both during the pre-training [53] as-well-as the fine-tuning stages [63]. Although filtering eliminates biases
inherent in the data set, their application is restricted to cases with abundant data as the performance reduces when
insufficient data is available [64].
While duplicate removal applies to already available data sets, Active Learning comes into play while collecting data.
It aims to reduce the training data while retaining model performance by labeling the most informative samples and
selecting them for training [65]. For the identification of informative samples, various approaches have been adopted
such as selecting samples with high uncertainty [66], maximum diversity [67] or both [68]. However, determining
the usefulness of the samples and annotating them is a challenging task [69]. Its efficacy in diverse downstream tasks
cannot be ascertained and can include outliers [70, 71].
Another perspective to data curation can be to order the samples in the data-set to improve utilization also known as
Curriculum Learning. The ordering approach deploys heuristics capturing the complexity of sequences and determines
a pace to progressively move from simpler sequences to complex sequences [72]. However, the pace has to be monitored
to guarantee efficiency, and automation of the pace serves to be beneficial [73].
Establishing a balance between the size of training data and the model parameters is also important to achieve pareto-
improvement as mentioned in section 5.1. Hoffmann et al. [116] state that the number of model parameters and the size
of the training set must be in the same ratio. They showcased that their model named Chinchilla based on this theory
outperformed several SOTA models with a significantly higher number of parameters.
Thus it can be inferred that determining the quality of samples in the corpus and selecting high-quality samples devoid
of repetitive information, outliers and incorrectly ordered sequences can boost the modeling efficiency. Moreover, this
can be extended to decomposing the individual text sequences into smaller sub-sequences with essential information
and discarding the irrelevant portions leading to efficient representation of context [56]. This significantly enhances the
efficiency of transformer-based models with attention mechanisms having complexity quadratically proportional to
sequence length [7].

5.2.2 Representation of Text

After data curation, the next step is to target an efficient representation of text. Earlier methods like Bag-of-Words [33]
produced sparse vectors with enormous dimensions equal to the vocabulary size and did not account for statistical
relationships and ordinal information [33]. The quest for the advancement of text representation and amelioration of
the shortcomings of previous works led to the bifurcation of approaches into categories like matrix-factorization and
context-window-based models. Among the matrix-factorization approaches, Latent Semantic Analysis (LSA) [34] could
comprehend the statistical information but failed to capture term relationships. On the other hand, Global Vectors for
Word Representation (GloVe) [36] used log-bilinear regression and addressed the inability to capture term relationships
by LSA. The context-window family of approaches can be attributed to Neural-Network Language Model (NNLM) [37].
It was succeeded by Word2Vec approaches like Continuous Bag-of-Words (CBoW) and Skip-Gram (SG) [35] which
effectively captured local-context information and its features like hierarchical-softmax, negative-sampling and frequent
word sub-sampling boosted its efficiency. This was further improved by the fastText approach [42] with a manifold rise
in efficiency using a constraint upon rank and fast-loss estimation.
While these embeddings were successful in capturing semantic information, they failed to do so for context-specific
meaning. When meeting polysemous words, or words whose meanings change depending on the situation, this issue is
exacerbated. As a result, contextualized word representations such as Context2Vec [74], ELMo [45], BERT [8], and
others were developed. Although these representations are state-of-the-art (SOTA), each token in the sequence has
multiple dimensions, which contributes to the "curse of dimensionality" problem with a high memory cost and more
complex model parameters.
Sentence-level representations have been found to achieve a better performance vs efficiency tradeoff compared to
word-level embeddings [75]. The Doc2Vec is one of the pioneering approaches in this domain that attempted to
represent a document as a vector through supervised learning [38]. The Universal Sentence Encoder consisting of a deep
averaging-network version and a transformer-based model exhibited appreciable ability for embedding sentences at the
sentence level [76]. Sentence-BERT, which has expanded on the BERT architecture, is another noteworthy strategy [77].
Cross-lingual variations leveraging BERT [78] were produced as a result of further developments. Compared to word
embeddings, these methods produce optimized fixed-length representations with minimal memory usage.

16
To ameliorate the curse of dimensionality, a few studies have been conducted to determine the optimal embedding
dimensions to reduce the excessive memory consumption retaining the semantic and syntactic characteristics in the
data [79]. Instances include determining the embedding dimensions based on corpus statistics like count of pairwise
equidistant words [80], reducing the dimensionality of embedding vector applying Principal Component Analysis
(PCA) [81] and compressed image representations equivalent to a given text [82].

5.2.3 Model Design

To achieve desirable performance on the text representations obtained, the model design should be such that it delivers
optimal results with minimum complexity. The contributions towards model design can be grouped under the following
categories:

• Chunking: It deals with chunking into several blocks, processing each block individually, and connecting the
representations of these blocks through recurrence or by some other mechanism. ABSA BERT [56] breaks
down each sequence based on significant phrases contained in it while filtering out irrelevant chunks of tokens
before being fed into the BERT model. An extension to the chunking approach has been proposed in the case
of Transformer-XL [46] wherein multiple blocks are connected through a recurrence mechanism. This helps to
efficiently compute attention for long sequences by breaking them down into multiple blocks.

(a) Global (b) Band (c) Dilated (d) Random (e) Block

Figure 7: Common types of sparse attention patterns

• Sparse Attention: A few contributions attempt sparsification of the attention matrix to reduce the complexity
of computing attention in transformer-based models. This implies limiting the count of keys to be attended
by queries based either on certain pre-defined patterns or input-conditioned connections. Some common
patterns might be global attention, band attention, dilated attention, random attention, and block attention
as illustrated in Figure 7. This technique exploits the inherent sparsity in the attention matrix in real-life
applications even after computing attention on all possible query-key pairs. Sparse Transformer [87] factorizes
the attention matrix to attain sparse patterns where connectivity
√ is established between a pre-defined set of
tokens. This reduces the complexity of attention to O(n n). Longformer [84] employs attention at fixed
intervals in a strided fashion. It adopts a blend of band attention, dilated attention, and global attention to
achieve a near linear scaling factor with respect to the sequence length. Extended Transformer Construction
(ETC) [156] follows a similar approach agglomerating global attention and local band attention with relative
positional encoding. Additionally, it employs masking through Contrastive Predictive Coding as a pre-training
objective. BigBird [157] builds upon the ETC model by applying random patterns of sparse-attention. It can
handle sequences 8 times the length and achieve linear complexity compared to the conventional attention
mechanism. Selective Learn Forget Network (SLFN) [86] adopts a gated mechanism upon multi-head attention
in a single-block transformer architecture for selective retention of attention weights. This aided in filtering
out insignificant information while retaining long range dependencies. Memory Compressed Transformer [85]
reduces the number of query-key pairs applying strided convolution. BlockBERT [83] proposes an efficient
version of BERT by incorporating block-wise patterns in the attention matrix for sparsity.
• Mixture-of-Experts (MOE): The concept of sparsification for efficient computation has been taken forward
with the notion of Mixture-of-Experts (MOE). In this, the input is routed through multiple sub-networks
replacing the single feed-forward layer. Models such as GLaM [88] demonstrate that it helps to attain high
accuracy along with efficient use of resources. FasterMoE [102] further tackled the load-imbalance in MOE
models through fine-grained concurrent scheduling for distributed computing.
• Low-Rank Approximation: To reduce the computational complexity of the attention mechanism, low-rank
approximation aims to approximate the attention matrix with a lower-rank matrix. Recently, techniques like
Linformer [49] have been devised to perform low-rank approximations of the self-attention matrix to enhance
efficiency. Similarly, the application of kernels for approximation of the computation of self-attention has

17
gained popularity as it reduces the effort required to compute self-attention for the entire sequence matrix. A
prominent example of this is the Performers [89].
• Clustering: It refers to grouping related elements, features in a sequence, or even attention heads to achieve
efficient computation of attention. Some other works learn patterns in the data by capturing relevant tokens and
clustering them together into buckets. Based on the similarity metric applied for clustering, various models
have been devised. For instance, the Reformer [48] utilizes a hashing-based similarity measure while the
Routing Transformer [50] deploys a K-means clustering algorithm.
• Parameter Sharing: The complexity of a model is proportional to the number of parameters present. Hence
reducing the number of parameters can be beneficial towards model efficiency. This can be achieved through
the sharing of parameters across the layers in the transformer network. Perceiver [51] is one such model which
performs downsampling apart from sharing weights among layers for efficient computation. ALBERT [47]
on the other hand applies matrix decomposition upon the embedding layer along with cross-layer parameter
sharing.

5.2.4 Model Compression

Once the model designing process is over, the model can be further compressed to reduce the computational requirements
for attaining efficiency. A variety of techniques have been discussed herein for model compression. The first technique
is Pruning [90] which deals with removing the undesired weight parameters from the transformer-based neural network.
This brings down the computational complexity, and memory footprint as-well-as the bandwidth leading to a rise in
model efficiency. It also serves as a regularizer and prevents overfitting. Pruning can be applied during pre-training [91],
fine-tuning [92], or at the time of inference [93]. Pruning is usually structured or unstructured. While the former
prunes a section of the neural network such as entire hidden layers or attention heads [54], the latter prunes individual
weights [91]. Although structured pruning significantly contributes towards enhancing efficiency, unstructured pruning
retains the model performance in a better way. Pruning although beneficial can even have adverse effects on model
performance if done excessively.
Knowledge Distillation is another concept based on condensing the knowledge from a highly complex model termed
as "teacher model" to a less complex model termed as "student model" based on a custom loss function [41]. The
predictions from the teacher model are converted into a probabilistic distribution having soft labels and fed into the
student model. The soft labels accelerate the learning process of the student model due to more variance imbibed in the
soft labels. A temperature T is applied to obtain the soft labels from the logits of the teacher model. The higher the value
of T, the softer the probability distribution. Whereas, the loss between the soft labels and the hard targets is computed
and minimized. Although knowledge distillation enhances efficiency, it hinders the overall model performance and
generalizability and hyperparameter tuning of the student model increases computational cost [94].
Quantization is the process of representing model parameters in memory with reduced precision to bring down the
memory requirements of the model and subsequently enable efficient computation. Instances include converting values
in 32-bit floating-point notation ’float-32’ to 8-bit integer notation ’int-8’ [52, 82]. Despite enhancing the model’s
efficiency, it might lead to a loss of information degrading the model’s performance. It has been observed that while
quantization performs well in the case of linear functions, for non-linear functions reducing the precision provides
unsatisfactory results [52]. Hence, a mixed quantization approach specific to the model components can prove effective.

5.3 Hardware Designing

Apart from software design, hardware considerations for efficiency in deploying LLMs are a vital yet comparatively
less explored domain. Figure 8 summarizes the developments in efficient hardware design whereas the remainder of
this section explains them in detail.

5.3.1 Customized Hardware

Hardware accelerators with parallel processing can speed up the processing of LLMs. Instances include Graphical
Processing Units (GPU) having several Arithmetic Logic Units (ALU) and high-bandwidth memory. Besides, Tensor
Processing Units (TPU) with matrix processors supporting mixed-precision computing for neural network computations
contribute to fast and efficient processing with reduced bottlenecks. Specialized HPCs utilizing Field Programmable
Gate Arrays (FPGA) tailor-made as per the application enhances the performance along with efficiency. Xilinx AI [96]
is one such FPGA hardware with separate a Artificial Intelligence (AI) compute unit. The Intel Stratix [95] has tensor
cores for AI-related computing integrated inside the FPGA to accelerate the throughput. Lu et al. [99] put forth a
hardware accelerator for a transformer with MHA. Liu et al. [100] further implemented a hardware accelerator of the
quantized version of BERT on FPGA. However, the majority of these developments are limited to certain applications

18
Figure 8: Efficiency considerations through hardware designing

only is that they are not suited to handle sparse data, high-precision arithmetic operations, or certain linear algebra
problems. Moreover, re-configurable hardware like FPGA is noted to have higher FlOps compared to fixed hardware.
Nevertheless, this can be seen as a viable option considering the production costs of fixed hardware for short-term
applications [101].

5.3.2 Software-Hardware Co-Design

Co-design is a school of thought realizing the contribution of hardware for the efficacy of a given software and deals
with joint optimization of hardware with the software to reap improved efficiency along with performance [101]. Ham
et al. [98] presented a lightweight co-design mechanism for approximation of attention computation by filtering out
insignificant relations. Wang et al. [97] presented a co-design architecture with token pruning as well and head pruning
to remove the unnecessary tokens as well as attention heads along with a quantization approach to extract the most
significant bits (MSB) based on confidence scores. Qu et al. [104] devised a detector module for identifying and
eliminating weak attention connections to reduce the computational overhead and jointly optimized it with the main
transformer module. The detector module applied low-rank transformations with low-precision computation to make
the overall architecture lightweight. Rajbhandari et al. [103] attempted model compression along with an optimized
inference strategy to reduce latency and the model size through an MoE-based co-design approach.
Apart from developing High-Performance Computing (HPC) architectures, distributed frameworks like DeepSpeed8
can accommodate training of LLMs in an efficient manner. The distributed computing versions of popular libraries like
PyTorch Distributed9 and TensorFlow10 have also been developed to facilitate training of LLMs. Besides, parallelism in
data can also lead to higher throughput by splitting the data into smaller chunks and distributing it among several HPCs
to train the model using tools like Horovod11 .

6 Results and Discussion

6.1 Statistical Insights

Figure 9 presents the year-wise distribution of surveyed papers. As we have focused more on recently published works,
it can be perceived that the major share of papers is from the last five years. The maximum number of papers is between
2019 and 2021 following a rising trend. However, for the years 2022 and 2023 a declining trend is observed. This
can be attributed to the fact that to assess the quality of papers the number of citations is considered to be one of the
important indicators. But, it is tough for a paper to get cited by a significant number of papers in such a short span of
time.
8
https://www.microsoft.com/en-us/research/project/deepspeed/
9
https://pytorch.org/tutorials/beginner/dist_overview.html
10
https://www.tensorflow.org/guide/distributed_training
11
https://horovod.ai/

19
Figure 9: Year-wise distribution of papers

Figure 10: Percentage share of article-types

Apart from observing the year-wise distribution of papers, the share of various article types, i.e. journals, conferences,
books and pre-print papers. Moreover, the journals have been segregated into regular papers and review papers. From
the pie-chart shown in Figure 10, it can be observed that conference papers account for 56% of the total share of
articles. While journals have a 31% share further subdivided into 23% being regular papers and 8% being review
papers. The reason behind this is the presence of a variety of prestigious conferences on NLP which are considered to
be more reputed than several journals. Thus, such conferences are preferred over journals by prominent researchers in
NLP. Interestingly, 11% share of articles are from pre-print platforms like arXiv12 . Further review of the high-quality
pre-print articles shows that a major share of such articles are milestone achievements authored by eminent researchers
and scientists belonging to reputed institutions. As pre-prints offer recognition for contributions in a couple of days,
it has become the apt avenue to claim authorship for a novel contribution. Lastly, 2% of articles reviewed are books.
This is because NLP is a rapidly evolving field of research while books are typically considered permanent sources of
knowledge presenting persistent concepts that remain relevant for many years. Hence, a book on NLP might lose its
significance in just a few years due to rapid technological developments.
For a more detailed analysis, the distribution of articles comprising of significant terms related to NLP has been
illustrated in Figure 11. The top-noted terms belong to the following categories13 in descending order "Transformers
NLP", "Efficiency Considerations", "Pre-Trained Models", "Deep Learning", "Hardware Design" and "Machine
Learning". This ascertains that the topics discussed in the reviewed papers align with the objective of this survey. It is
to be noted that, although transformers are a subset of deep learning techniques in NLP, it has been treated separately
for more transparency given the magnanimous volume of articles based on transformers. From the low share of articles
on NLP focusing on the efficient use of hardware, it can be inferred that the current research trend majorly emphasizes
on software to formulate pareto-optimal solutions with almost no consideration for hardware. Whereas, the state of

12
https://arxiv.org/
13
The related terms have been grouped and categorized into topics as shown in Figure 11.

20
Figure 11: Category-wise distribution of articles

developments necessitates the inclusion of hardware design considerations while formulating new models to achieve
optimal efficacy accompanied by efficiency.

Figure 12: Comparison of the trend of NLP vs Transformers. Source: Google Trends

To further validate our research, we compare the trend of transformer with NLP based on the number of web searches
by people throughout the world over the last five years utilizing Google Trends14 . Figure 12, portrays an overall rising
trend for both the terms, i.e. "Transformers" and "NLP". The popularity of transformers started rising after 2019
and since then there has been a steady growth. Furthermore, the growth-rate for both "Transformers" and "NLP" are
almost similar with "Transformers" having a slightly steeper growth rate in recent years. This shows the significance of
transformer-based models in the evolution of NLP. Moreover, the trend confirms the statistical analysis of the surveyed
papers mentioned above.

6.2 Trend of Research

From the recent developments, it is evident that transformer-based pre-trained models have excelled in terms of accuracy
compared to other conventional machine learning and deep learning algorithms. Table 3 presents a comparison of
various renowned LLMs based on their year of release, number of parameters, accuracy and pre-training data. This
shows that the current trend is towards designing powerful pre-trained models that only need to be fine-tuned as per
the requirements of a particular task. However, these models have tremendous computational complexity and for
each new task, they need to be fine-tuned on a sufficiently large data-set. Recently, some efforts have been directed
towards "task agnostic models" as in subsequent versions of GPT promoting few-shot or even zero-shot learning through
"prompting". However, such models could be termed as "multi-task learners" rather than task agnostic models as
their generalizability is significantly inferior to human cognition. Moreover, to achieve multi-task generalizability,
fine-tuning upon several tasks (i.e. various large-scale data-sets) is required. Overall, this limits research on such
models in resource-constrained environments and also aggravates carbon emissions. This can be visualized from Table
14
https://trends.google.com/trends/

21
Table 3: Evaluation of High Performance Models
Model Year Pre-training Dataset #Parameters GLUE LAMBADA PTWL
BERT_large [8] 2018 WikiEn+Book Corpus 340M 81.9 31.3
GPT [11] 2018 BookCorpus 117M 72.8 - -
ROBERTA [121] 2019 BookCorpus + CC-News + OpenWeb- 340M 88.5 - -
Text + STORIES
XLNET [9] 2019 WikiEn + BookCorpus + Giga5 + 340M 90.5 - -
ClueWeb + Common Crawl
GPT-2 [12] 2019 Web Crawl Text 1.5B - - 35.76
BART [10] 2019 BookCorpus + CC-News + OpenWeb- 370M 88.4 - -
Text + STORIES
Transformer- 2019 Wikipedia 24M - - 54.55
XL [46]
GPT-3 [13] 2020 Web Crawl Text + Book Corpus 175B 86.4 20.5
T5 [120] 2020 Colossal Clean Crawled Corpus (C4) 11B 89.7 - -
XLM-R [145] 2020 CommonCrawl 10.7B 91.8 - -
XLM-R [145] 2020 CommonCrawl 10.7B 91.8 - -
Megatron Turing 2022 CommonCrawl + Realnews + Github + 530B - 87.2 -
NLG [142] Wikipedia+ Gutenberg + Books3 + ArXiv
+ PubMed Abstracts + Stack Exchange +
Pile-CC + OpenWebText2
PaLM [143] 2022 Public Forums + Source Codes + WikiEn 540B - 89.7 -
+ Web Documents + News + Books
Turing ULRv6 2022 CommonCrawl 4.6B 91.3 - -
[144]
Chinchilla [116] 2022 MassiveText 70B - 77.7 -
LLaMA [137] 2023 CommonCrawl + C4 + Github + 65B - 84 -
Wikipedia + Gutenberg + Books3 +
ArXiv + Stack Exchange
Note: #Parameters- No. of model parameters, LAMBADA- LAMBADA (Accuracy), PTWL- Penn Treebank
(Word Level Perplexity), ’-’ indicates non-avalability of data

4 which shows that training SOTA NLP models can lead to several tonnes of carbon emissions. Figure 13, captures this
relationship among the model parameters and the size of the data used for pre-training the model. It can be observed
that there has been an overall rising trend in the size of the pre-training data as-well-as the model parameter count.
The earlier LLMs incrementally raised the model parameters and pre-training corpus obeying a linear relationship
among both. Subsequent LLMs focused majorly on increasing the model parameters without much rise in the size of
the pre-training data. Besides, the recent LLMs are focusing on striking a balance between the size of data and the
model size if not reducing the model size compared to the volume of pre-training data. This signals a ray of hope that
the awareness of efficient LLMs in the NLP research community is proliferating.

6.3 Future Avenues

To make transformer-based pre-trained models efficient, a significant effort has been directed toward reducing the
complexity of the attention mechanism as-well-as decreasing the number of model parameters. In this paper, several such
techniques as-well-as associated models have been discussed. From these studies, it can be inferred that efficiency can
be achieved at multiple stages of model development. However, the goal of formulating an efficient pre-trained model
is still far from being achieved. For this, efforts towards devising efficient model design need to be consolidated with
efficient pre-training and fine-tuning strategies along with effective prompt-based learning approaches (if applicable).
Also, the data-sets play a major role in the trade-off between performance and efficiency. Determining the optimal
size of the data-sets, distribution of sequence length distribution as-well-as the quality of training samples is of utmost
importance to restrict training costs and prevent over-training.

22
Table 4: Cost Behind Training Models
Model GPU/ TPU GPU/ TPU Hours Energy Emissions Source
Transformerbase GPU-P100 x 8 96 1.416 0.0117 Strubell et al. [16]
Transformerbig GPU-P100 x 8 672 1.515 0.0864 Strubell et al. [16]
ELMo GPU-P100 x 3 1008 0.51766 0.118 Strubell et al. [16]
BERTbase GPU-V100 x 64 5056 12.04151 0.647 Strubell et al. [16]
GPT-2 TPU-v3 x 32 5376 - - Strubell et al. [16]
Gopher GPU-A100 x 16 5725 1,066 352 Luccioni et al. [139]
GPT-3 GPU-A100 x 16 6912 1,287 502 Luccioni et al. [139]
NAS TPU-v2 x 1 32623 - - Strubell et al. [16]
GShard TPU-v3 x 1024 76,185.600 24.100 4.8 Patterson et al. [17]
LLaMA-7B GPU-A100 82,432 36 14 Touvron et al. [137]
LLaMA-13B GPU-A100 135,168 59 23 Touvron et al. [137]
T5 TPU-v3 x 512 245,760 85.7 46.7 Patterson et al. [17]
XLM GPU-V100 x 512 250,675.2 167.443 39 Faiz et al. [140]
LLaMA-33B GPU-A100 530,432 233 90 Touvron et al. [137]
Switch TPU-v3 x 1024 663,552 179 72.200 Patterson et al. [17]
OPT-175B GPU-A100 809,472 356 137 Touvron et al. [137]
LLaMA-65B GPU-A100 1,022,362 449 173 Touvron et al. [137]
BLOOM-175B GPU-A100 1,082,880 475 183 Touvron et al. [137]
LaMDA TPU-V3 1418035 451 25.200 Thoppilan et al. [138]
GPT3 GPU-V100 x 10000 3,552,000 1,287 552.100 Patterson et al. [17]
Note: Emissions- Co2 emitted (metric tons), Energy- Power Consumption (MWh)

Figure 13: Relationship between the size of pre-training data and number of model parameters in LLMs

7 Conclusion

NLP empowers computing devices to decipher and process natural language text. Various applications in NLP include
sentiment analysis, misinformation detection, machine translation, and text summarization. It can be observed that NLP
has evolved considerably from rule-based approaches, followed by machine learning and deep learning models to the
advent of transformer-based pre-trained models. Although the efficacy of NLP models at tasks has increased manifold
over time, the sustainability of the models based on factors like efficiency, task-agnosticism, and domain-independence
is a matter of concern. It motivates directing research towards formulating sustainable NLP models. In this paper, a
survey of research works aimed at enhancing the efficiency of NLP models has been conducted. These works have
been systematically presented targeting the various stages of model development. It highlights the efforts towards
devising practical models with appreciable performance for implementation in resource-constrained environments with
a significantly low carbon footprint. It ushers a paradigm shift in devising NLP models keeping sustainability in mind.

23
References
[1] Liu, Bing. Sentiment analysis: Mining opinions, sentiments, and emotions. Cambridge University Press, 2015.
[2] Jindal, Nitin, and Bing Liu. “Opinion spam and analysis." In Proceedings of the 2008 international conference on
web search and data mining, pp. 219-230. 2008.
[3] Cho, Kyunghyun, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. "On the Properties of Neural
Machine Translation: Encoder–Decoder Approaches." In Proceedings of SSST-8, Eighth Workshop on Syntax,
Semantics and Structure in Statistical Translation, pp. 103-111. 2014.
[4] Bahdanau, Dzmitry, Kyung Hyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to
align and translate." In 3rd International Conference on Learning Representations, ICLR 2015. 2015.
[5] El-Kassas, Wafaa S., Cherif R. Salama, Ahmed A. Rafea, and Hoda K. Mohamed. "Automatic text summarization:
A comprehensive survey." Expert systems with applications 165 (2021): 113679.
[6] Soares, Marco Antonio Calijorne, and Fernando Silva Parreiras. "A literature review on question answering
techniques, paradigms and systems." Journal of King Saud University-Computer and Information Sciences 32, no.
6 (2020): 635-646.
[7] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser,
and Illia Polosukhin. “Attention is all you need." In Advances in neural information processing systems, pp.
5998-6008. 2017.
[8] Kenton, Jacob Devlin Ming-Wei Chang, and Lee Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding." In Proceedings of NAACL-HLT, pp. 4171-4186. 2019.
[9] Yang, Zhilin, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, and Quoc V. Le. “Xlnet:
Generalized autoregressive pretraining for language understanding." In Advances in neural information processing
systems, pp. 5753-5763. 2019.
[10] Lewis, Mike, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin
Stoyanov, and Luke Zettlemoyer. "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language
Generation, Translation, and Comprehension." In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pp. 7871-7880. 2020.
[11] Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. “Improving language
understanding by generative pre-training." URL https://s3-us-west-2.amazonaws.com/openai-
assets/researchcovers/languageunsupervised/language understanding paper. pdf (2018).
[12] Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. “Language models are
unsupervised multitask learners." OpenAI Blog 1, no. 8 (2019): 9.
[13] Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan et al. "Language models are few-shot learners." In Proceedings of the 34th International Conference
on Neural Information Processing Systems, pp. 1877-1901. 2020.
[14] Chowdhary, K. R. "Natural language processing." In Fundamentals of Artificial Intelligence, pp. 603-649. Springer,
New Delhi, 2020.
[15] Gers, Felix A., and Jürgen Schmidhuber. "Recurrent nets that time and count." In Proceedings of the IEEE-INNS-
ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and
Perspectives for the New Millennium, vol. 3, pp. 189-194. IEEE, 2000.
[16] Strubell, Emma, Ananya Ganesh, and Andrew McCallum. "Energy and Policy Considerations for Deep Learning
in NLP." In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.
3645-3650. 2019.
[17] Patterson, David, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David
So, Maud Texier, and Jeff Dean. "Carbon emissions and large neural network training." arXiv preprint
arXiv:2104.10350 (2021).
[18] Schwartz, Roy, Jesse Dodge, Noah A. Smith, and Oren Etzioni. "Green ai." Communications of the ACM 63, no.
12 (2020): 54-63.
[19] Treviso, Marcos, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid et al.
"Efficient methods for natural language processing: A survey." Transactions of the Association for Computational
Linguistics 11 (2023): 826-860.
[20] Shu, Kai, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. “Fake news detection on social media: A data
mining perspective." ACM SIGKDD Explorations Newsletter 19, no. 1 (2017): 22-36.

24
[21] Qazvinian, Vahed, Emily Rosengren, Dragomir Radev, and Qiaozhu Mei. "Rumor has it: Identifying misinforma-
tion in microblogs." In Proceedings of the 2011 conference on empirical methods in natural language processing,
pp. 1589-1599. 2011.
[22] Lewis, Mike, and Angela Fan. "Generative question answering: Learning to answer the whole question." In
International Conference on Learning Representations. 2018.
[23] Kupiec, Julian, Jan Pedersen, and Francine Chen. "A trainable document summarizer." In Proceedings of the 18th
annual international ACM SIGIR conference on Research and development in information retrieval, pp. 68-73.
1995.
[24] Nallapati, Ramesh, Bing Xiang, and Bowen Zhou. "Sequence-to-sequence rnns for text summarization." (2016).
[25] Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. “Thumbs up?: sentiment classification using machine
learning techniques." In Proceedings of the ACL-02 conference on Empirical methods in natural language
processing-Volume 10, pp. 79-86. Association for Computational Linguistics, 2002.
[26] Wilson, Theresa, Janyce Wiebe, and Paul Hoffmann. “Recognizing contextual polarity in phrase-level sentiment
analysis." In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in
Natural Language Processing. 2005.
[27] Ansar, Wazib, and Saptarsi Goswami. "Combating the menace: A survey on characterization and detection of fake
news from a data science perspective." International Journal of Information Management Data Insights 1, no. 2
(2021): 100052.
[28] Bird, Steven, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the
natural language toolkit. “ O’Reilly Media, Inc.", 2009.
[29] Xu, Peng, Davis Liang, Zhiheng Huang, and Bing Xiang. "Attention-guided generative models for extractive
question answering." arXiv preprint arXiv:2110.06393 (2021).
[30] Nallapati, Ramesh, Feifei Zhai, and Bowen Zhou. "Summarunner: A recurrent neural network based sequence
model for extractive summarization of documents." In Thirty-first AAAI conference on artificial intelligence.
2017.
[31] See, Abigail, Peter J. Liu, and Christopher D. Manning. "Get To The Point: Summarization with Pointer-Generator
Networks." In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pp. 1073-1083. 2017.
[32] Nakagawa, Tetsuji, Kentaro Inui, and Sadao Kurohashi. "Dependency tree-based sentiment classification using
CRFs with hidden variables." In Human Language Technologies: The 2010 Annual Conference of the North
American Chapter of the Association for Computational Linguistics, pp. 786-794. Association for Computational
Linguistics, 2010.
[33] Harris, Zellig S. "Distributional structure." Word 10, no. 2-3 (1954): 146-162.
[34] Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. “Indexing
by latent semantic analysis." Journal of the American society for information science 41, no. 6 (1990): 391-407.
[35] Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of words
and phrases and their compositionality." In Advances in neural information processing systems, pp. 3111-3119.
2013.
[36] Pennington, Jeffrey, Richard Socher, and Christopher Manning. “Glove: Global vectors for word representation."
In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp.
1532-1543. 2014.
[37] Y. Bengio, R. Ducharme, P. Vincent. A neural probabilistic language model. Journal of Machine Learning
Research, 3:1137-1155, 2003.
[38] Le, Quoc, and Tomas Mikolov. "Distributed representations of sentences and documents." In International
conference on machine learning, pp. 1188-1196. PMLR, 2014.
[39] Irsoy, Ozan, and Claire Cardie. "Opinion mining with deep recurrent neural networks." In Proceedings of the 2014
conference on empirical methods in natural language processing (EMNLP), pp. 720-728. 2014.
[40] Liu, Qian, Zhiqiang Gao, Bing Liu, and Yuanlin Zhang. “Automated rule selection for aspect extraction in opinion
mining." In Twenty-Fourth International Joint Conference on Artificial Intelligence. 2015.
[41] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the Knowledge in a Neural Network." stat 1050 (2015):
9.

25
[42] Joulin, Armand, Édouard Grave, Piotr Bojanowski, and Tomáš Mikolov. "Bag of Tricks for Efficient Text Classifi-
cation." In Proceedings of the 15th Conference of the European Chapter of the Association for Computational
Linguistics: Volume 2, Short Papers, pp. 427-431. 2017.
[43] Poria, Soujanya, Erik Cambria, and Alexander Gelbukh. “Aspect extraction for opinion mining with a deep
convolutional neural network." Knowledge-Based Systems 108 (2016): 42-49.
[44] Howard, Jeremy, and Sebastian Ruder. "Universal Language Model Fine-tuning for Text Classification." In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), pp. 328-339. 2018.
[45] Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. "Deep contextualized word representations." In Proceedings of NAACL-HLT, pp. 2227-2237. 2018.
[46] Dai, Zihang, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Le, and Ruslan Salakhutdinov. "Transformer-
XL: Attentive Language Models beyond a Fixed-Length Context." In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics, pp. 2978-2988. 2019.
[47] Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. "ALBERT:
A Lite BERT for Self-supervised Learning of Language Representations." In International Conference on Learning
Representations. 2019.
[48] Kitaev, Nikita, Lukasz Kaiser, and Anselm Levskaya. "Reformer: The Efficient Transformer." In International
Conference on Learning Representations. 2019.
[49] Wang, Sinong, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. "Linformer: Self-attention with linear
complexity." arXiv preprint arXiv:2006.04768 (2020).
[50] Roy, Aurko, Mohammad Saffar, Ashish Vaswani, and David Grangier. "Efficient content-based sparse attention
with routing transformers." Transactions of the Association for Computational Linguistics 9 (2021): 53-68.
[51] Jaegle, Andrew, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. "Perceiver:
General perception with iterative attention." In International conference on machine learning, pp. 4651-4664.
PMLR, 2021.
[52] Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry
et al. "Learning transferable visual models from natural language supervision." In International conference on
machine learning, pp. 8748-8763. PMLR, 2021.
[53] Zhang, Susan, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan et al.
"Opt: Open pre-trained transformer language models." arXiv preprint arXiv:2205.01068 (2022).
[54] Sajjad, Hassan, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. "On the effect of dropping layers of pre-trained
transformer models." Computer Speech & Language 77 (2023): 101429.
[55] Ray, Paramita, and Amlan Chakrabarti. "A Mixed approach of Deep Learning method and Rule-Based method to
improve Aspect Level Sentiment Analysis." Applied Computing and Informatics (2019).
[56] Ansar, Wazib, Saptarsi Goswami, Amlan Chakrabarti, and Basabi Chakraborty. "An efficient methodology for
aspect-based sentiment analysis using BERT through refined aspect extraction." Journal of Intelligent & Fuzzy
Systems 40, no. 5 (2021): 9627-9644.
[57] Malik, Vikas, and Amit Kumar. "Sentiment Analysis of Twitter Data Using Naive Bayes Algorithm." International
Journal on Recent and Innovation Trends in Computing and Communication 6, no. 4 (2018): 120-125.
[58] Huq, Mohammad Rezwanul, Ahmad Ali, and Anika Rahman. “Sentiment analysis on Twitter data using KNN and
SVM." IJACSA) International Journal of Advanced Computer Science and Applications 8, no. 6 (2017): 19-25.
[59] Bengio, Yoshua, Patrice Simard, and Paolo Frasconi. "Learning long-term dependencies with gradient descent is
difficult." IEEE transactions on neural networks 5, no. 2 (1994): 157-166.
[60] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9, no. 8 (1997):
1735-1780.
[61] Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego
de Las Casas et al. "An empirical analysis of compute-optimal large language model training." Advances in Neural
Information Processing Systems 35 (2022): 30016-30030.
[62] Lee, Katherine, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and
Nicholas Carlini. "Deduplicating Training Data Makes Language Models Better." In Proceedings of the 60th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8424-8445. 2022.

26
[63] Mishra, Swaroop, and Bhavdeep Singh Sachdeva. "Do we need to create big datasets to learn a task?." In
Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pp. 169-173. 2020.
[64] Le Bras, Ronan, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew Peters, Ashish Sabharwal,
and Yejin Choi. "Adversarial filters of dataset biases." In International conference on machine learning, pp.
1078-1088. PMLR, 2020.
[65] Ren, Pengzhen, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B. Gupta, Xiaojiang Chen, and Xin
Wang. "A survey of deep active learning." ACM computing surveys (CSUR) 54, no. 9 (2021): 1-40.
[66] Yuan, Michelle, Hsuan-Tien Lin, and Jordan Boyd-Graber. "Cold-start Active Learning through Self-supervised
Language Modeling." In Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pp. 7935-7948. 2020.
[67] Sener, Ozan, and Silvio Savarese. "Active Learning for Convolutional Neural Networks: A Core-Set Approach."
In International Conference on Learning Representations. 2018.
[68] Margatina, Katerina, Giorgos Vernikos, Loïc Barrault, and Nikolaos Aletras. "Active Learning by Acquiring
Contrastive Examples." In Proceedings of the 2021 Conference on Empirical Methods in Natural Language
Processing, pp. 650-663. 2021.
[69] Settles, Burr, Mark Craven, and Lewis Friedland. "Active learning with real annotation costs." In Proceedings of
the NIPS workshop on cost-sensitive learning, vol. 1. 2008.
[70] Lowell, David, Zachary C. Lipton, and Byron C. Wallace. "Practical Obstacles to Deploying Active Learning."
In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th
International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 21-30. 2019.
[71] Karamcheti, Siddharth, Ranjay Krishna, Li Fei-Fei, and Christopher D. Manning. "Mind Your Outliers! Inves-
tigating the Negative Impact of Outliers on Active Learning for Visual Question Answering." In Proceedings
of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Volume 1: Long Papers), pp. 7265-7281. 2021.
[72] Press, Ofir, Noah A. Smith, and Mike Lewis. "Shortformer: Better Language Modeling using Shorter Inputs."
In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5493-5505. 2021.
[73] Kumar, M., Benjamin Packer, and Daphne Koller. "Self-paced learning for latent variable models." Advances in
neural information processing systems 23 (2010).
[74] Melamud, Oren, Jacob Goldberger, and Ido Dagan. "context2vec: Learning generic context embedding with
bidirectional lstm." In Proceedings of the 20th SIGNLL conference on computational natural language learning,
pp. 51-61. 2016.
[75] Conneau, A., D. Kiela, H. Schwenk, L. Barrault, and A. Bordes. "Supervised learning of universal sentence
representations from natural language inference data." In Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing, pp. 670-680. Association for Computational Linguistics, 2017.
[76] Cer, Daniel, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant et al.
"Universal sentence encoder." arXiv preprint arXiv:1803.11175 (2018).
[77] Reimers, Nils, and Iryna Gurevych. "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks."
In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th
International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational
Linguistics, 2019.
[78] Feng, Fangxiaoyu, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. "Language-agnostic BERT Sen-
tence Embedding." In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), pp. 878-891. 2022.
[79] Del Giudice, Marco. "Effective dimensionality: A tutorial." Multivariate behavioral research 56, no. 3 (2021):
527-542.
[80] Patel, Kevin, and Pushpak Bhattacharyya. "Towards lower bounds on number of dimensions for word embeddings."
In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short
Papers), pp. 31-36. 2017.
[81] Raunak, Vikas, Vivek Gupta, and Florian Metze. "Effective dimensionality reduction for word embeddings." In
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp. 235-243. 2019.

27
[82] Ansar, Wazib, Saptarsi Goswami, Amlan Chakrabarti, and Basabi Chakraborty. "TexIm: A Novel Text-to-Image
Encoding Technique Using BERT." In Computer Vision and Machine Intelligence: Proceedings of CVMI 2022,
pp. 123-139. Singapore: Springer Nature Singapore, 2023.
[83] Qiu, Jiezhong, Hao Ma, Omer Levy, Wen-tau Yih, Sinong Wang, and Jie Tang. "Blockwise Self-Attention for
Long Document Understanding." In Findings of the Association for Computational Linguistics: EMNLP 2020,
pp. 2555-2565. 2020.
[84] Beltagy, Iz, Matthew E. Peters, and Arman Cohan. "Longformer: The long-document transformer." arXiv preprint
arXiv:2004.05150 (2020).
[85] Liu, Peter J., Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer.
"Generating Wikipedia by Summarizing Long Sequences." In International Conference on Learning Representa-
tions. 2018.
[86] Ansar, Wazib, Saptarsi Goswami, Amlan Chakrabarti, and Basabi Chakraborty. "A novel selective learning based
transformer encoder architecture with enhanced word representation." Applied Intelligence 53, no. 8 (2023):
9424-9443.
[87] Child, Rewon, Scott Gray, Alec Radford, and Ilya Sutskever. "Generating long sequences with sparse transformers."
arXiv preprint arXiv:1904.10509 (2019).
[88] Du, Nan, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun et al.
"Glam: Efficient scaling of language models with mixture-of-experts." In International Conference on Machine
Learning, pp. 5547-5569. PMLR, 2022.
[89] Choromanski, Krzysztof, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter
Hawkins et al. "Masked language modeling for proteins via linearly scalable long-context transformers." arXiv
preprint arXiv:2006.03555 (2020).
[90] LeCun, Yann, John Denker, and Sara Solla. "Optimal brain damage." Advances in neural information processing
systems 2 (1989).
[91] Louizos, Christos, Max Welling, and Diederik P. Kingma. "Learning Sparse Neural Networks through L0
Regularization." In International Conference on Learning Representations. 2018.
[92] Sanh, Victor, Thomas Wolf, and Alexander Rush. "Movement pruning: Adaptive sparsity by fine-tuning."
Advances in Neural Information Processing Systems 33 (2020): 20378-20389.
[93] Fan, Angela, Edouard Grave, and Armand Joulin. "Reducing Transformer Depth on Demand with Structured
Dropout." In International Conference on Learning Representations. 2019.
[94] Stanton, Samuel, Pavel Izmailov, Polina Kirichenko, Alexander A. Alemi, and Andrew G. Wilson. "Does
knowledge distillation really work?." Advances in Neural Information Processing Systems 34 (2021): 6906-6919.
[95] Boutros, Andrew, Eriko Nurvitadhi, Rui Ma, Sergey Gribok, Zhipeng Zhao, James C. Hoe, Vaughn Betz, and
Martin Langhammer. "Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and
GPUs." In 2020 International Conference on Field-Programmable Technology (ICFPT), pp. 10-19. IEEE, 2020.
[96] Gaide, Brian, Dinesh Gaitonde, Chirag Ravishankar, and Trevor Bauer. "Xilinx adaptive compute acceleration
platform: VersalTM architecture." In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, pp. 84-93. 2019.
[97] Wang, Hanrui, Zhekai Zhang, and Song Han. "Spatten: Efficient sparse attention architecture with cascade token
and head pruning." In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA),
pp. 97-110. IEEE, 2021.
[98] Ham, Tae Jun, Yejin Lee, Seong Hoon Seo, Soosung Kim, Hyunji Choi, Sung Jun Jung, and Jae W. Lee. "ELSA:
Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks." In 2021
ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pp. 692-705. IEEE, 2021.
[99] Lu, Siyuan, Meiqi Wang, Shuang Liang, Jun Lin, and Zhongfeng Wang. "Hardware accelerator for multi-head
attention and position-wise feed-forward in the transformer." In 2020 IEEE 33rd International System-on-Chip
Conference (SOCC), pp. 84-89. IEEE, 2020.
[100] Liu, Zejian, Gang Li, and Jian Cheng. "Hardware acceleration of fully quantized bert for efficient natural
language processing." In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp.
513-516. IEEE, 2021.
[101] Hooker, Sara. "The hardware lottery." Communications of the ACM 64, no. 12 (2021): 58-65.

28
[102] He, Jiaao, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. "FasterMoE:
modeling and optimizing training of large-scale dynamic pre-trained models." In Proceedings of the 27th ACM
SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 120-134. 2022.
[103] Rajbhandari, Samyam, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad
Awan, Jeff Rasley, and Yuxiong He. "Deepspeed-moe: Advancing mixture-of-experts inference and training to
power next-generation ai scale." In International Conference on Machine Learning, pp. 18332-18346. PMLR,
2022.
[104] Qu, Zheng, Liu Liu, Fengbin Tu, Zhaodong Chen, Yufei Ding, and Yuan Xie. "Dota: detect and omit weak
attentions for scalable transformer acceleration." In Proceedings of the 27th ACM International Conference on
Architectural Support for Programming Languages and Operating Systems, pp. 14-26. 2022.
[105] Lannelongue, Loïc, Jason Grealey, and Michael Inouye. "Green algorithms: quantifying the carbon footprint of
computation." Advanced science 8, no. 12 (2021): 2100707.
[106] Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. "Quantifying the carbon emis-
sions of machine learning." In Climate Change workshop, NeurIPS 2019. 2019.
[107] Lasse F. Wolff Anthony, Benjamin Kanding, and Raghavendra Selvan. "Carbontracker: Tracking and predicting
the carbon footprint of training deep learning models". In ICML Workshop on "Challenges in Deploying and
monitoring Machine Learning Systems". 2020.
[108] Dürlich, Luise, Evangelia Gogoulou, and Joakim Nivre. "On the Concept of Resource-Efficiency in NLP." In
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pp. 135-145. 2023.
[109] Gupta, Udit, Young Geun Kim, Sylvia Lee, Jordan Tse, Hsien-Hsin S. Lee, Gu-Yeon Wei, David Brooks, and
Carole-Jean Wu. "Chasing carbon: The elusive environmental footprint of computing." In 2021 IEEE International
Symposium on High-Performance Computer Architecture (HPCA), pp. 854-867. IEEE, 2021.
[110] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances
in neural information processing systems 27 (2014).
[111] Lin, Zhouhan, Minwei Feng, Cicero dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio.
"A structured self-attentive sentence embedding." In International Conference on Learning Representations.
International Conference on Learning Representations, ICLR, 2017.
[112] Kim, Yoon, Carl Denton, Luong Hoang, and Alexander M. Rush. "Structured Attention Networks." In Interna-
tional Conference on Learning Representations. 2016.
[113] Galassi, Andrea, Marco Lippi, and Paolo Torroni. "Attention in natural language processing." IEEE transactions
on neural networks and learning systems 32, no. 10 (2020): 4291-4308.
[114] Graves, Alex, Greg Wayne, and Ivo Danihelka. "Neural turing machines." arXiv preprint arXiv:1410.5401
(2014).
[115] Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effective Approaches to Attention-based
Neural Machine Translation." In Proceedings of the 2015 Conference on Empirical Methods in Natural Language
Processing, pp. 1412-1421. 2015.(2015).
[116] Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego
de Las Casas et al. "Training compute-optimal large language models." arXiv preprint arXiv:2203.15556 (2022).
[117] Collobert, Ronan, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. "Natural
language processing (almost) from scratch." Journal of machine learning research 12, no. ARTICLE (2011):
2493-2537.
[118] Dai, Andrew M., and Quoc V. Le. "Semi-supervised sequence learning." Advances in neural information
processing systems 28 (2015).
[119] Qiu, Xipeng, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. "Pre-trained models for
natural language processing: A survey." Science China Technological Sciences 63, no. 10 (2020): 1872-1897.
[120] Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei
Li, and Peter J. Liu. "Exploring the limits of transfer learning with a unified text-to-text transformer." The Journal
of Machine Learning Research 21, no. 1 (2020): 5485-5551.
[121] Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke
Zettlemoyer, and Veselin Stoyanov. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint
arXiv:1907.11692 (2019).
[122] Rogers, Anna, Olga Kovaleva, and Anna Rumshisky. "A primer in BERTology: What we know about how BERT
works." Transactions of the Association for Computational Linguistics 8 (2021): 842-866.

29
[123] Houlsby, Neil, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Ges-
mundo, Mona Attariyan, and Sylvain Gelly. "Parameter-efficient transfer learning for NLP." In International
Conference on Machine Learning, pp. 2790-2799. PMLR, 2019.
[124] Karimi Mahabadi, Rabeeh, James Henderson, and Sebastian Ruder. "Compacter: Efficient low-rank hypercom-
plex adapter layers." Advances in Neural Information Processing Systems 34 (2021): 1022-1035.
[125] Aghajanyan, Armen, Sonal Gupta, and Luke Zettlemoyer. "Intrinsic Dimensionality Explains the Effectiveness of
Language Model Fine-Tuning." In Proceedings of the 59th Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),
pp. 7319-7328. 2021.
[126] Moosavi, Nafise Sadat, Quentin Delfosse, Kristian Kersting, and Iryna Gurevych. "Adaptable Adapters." In
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, pp. 3742-3753. 2022.
[127] Wang, Yaqing, and Sahaj Agarwal. "AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning." In
The 2022 Conference on Empirical Methods in Natural Language Processing. 2022.
[128] Schick, Timo, and Hinrich Schütze. "Generating Datasets with Pretrained Language Models." In Proceedings of
the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6943-6951. 2021.
[129] Reynolds, Laria, and Kyle McDonell. "Prompt programming for large language models: Beyond the few-shot
paradigm." In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1-7.
2021.
[130] Wei, Jason, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai,
and Quoc V. Le. "Finetuned Language Models are Zero-Shot Learners." In International Conference on Learning
Representations. 2021.
[131] Liu, Pengfei, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. "Pre-train, prompt,
and predict: A systematic survey of prompting methods in natural language processing." ACM Computing Surveys
55, no. 9 (2023): 1-35.
[132] Schick, Timo, and Hinrich Schütze. "Few-shot text generation with natural language instructions." In Proceedings
of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 390-402. 2021.
[133] Schick, Timo, and Hinrich Schütze. "Exploiting Cloze-Questions for Few-Shot Text Classification and Natural
Language Inference." In Proceedings of the 16th Conference of the European Chapter of the Association for
Computational Linguistics: Main Volume, pp. 255-269. 2021.
[134] Trinh, Trieu H., and Quoc V. Le. "A simple method for commonsense reasoning." arXiv preprint
arXiv:1806.02847 (2018).
[135] Yin, Wenpeng, Jamaal Hay, and Dan Roth. "Benchmarking Zero-shot Text Classification: Datasets, Evaluation
and Entailment Approach." In Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.
3914-3923. 2019.
[136] Wu, Wei, Fei Wang, Arianna Yuan, Fei Wu, and Jiwei Li. "CorefQA: Coreference resolution as query-based
span prediction." In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.
6953-6963. 2020.
[137] Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971
(2023).
[138] Thoppilan, Romal, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia
Jin et al. "Lamda: Language models for dialog applications." arXiv preprint arXiv:2201.08239 (2022).
[139] Luccioni, Alexandra Sasha, Sylvain Viguier, and Anne-Laure Ligozat. "Estimating the carbon footprint of bloom,
a 176b parameter language model." Journal of Machine Learning Research 24, no. 253 (2023): 1-15.
[140] Faiz, Ahmad, Sotaro Kaneda, Ruhan Wang, Rita Osi, Parteek Sharma, Fan Chen, and Lei Jiang. "LLMCarbon:
Modeling the end-to-end Carbon Footprint of Large Language Models." arXiv preprint arXiv:2309.14393 (2023).
[141] Xu, Canwen, and Julian McAuley. "A survey on model compression and acceleration for pretrained language
models." In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 9, pp. 10566-10575. 2023.
[142] Smith, Shaden, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun
Liu et al. "Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language
model." arXiv preprint arXiv:2201.11990 (2022).

30
[143] Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul
Barham et al. "Palm: Scaling language modeling with pathways." Journal of Machine Learning Research 24, no.
240 (2023): 1-113.
[144] Patra, Barun, Saksham Singhal, Shaohan Huang, Zewen Chi, Li Dong, Furu Wei, Vishrav Chaudhary, and Xia
Song. "Beyond english-centric bitexts for better multilingual language representation learning." arXiv preprint
arXiv:2210.14867 (2022).
[145] Conneau, Alexis, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzmán, Édouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. "Unsupervised Cross-lingual Repre-
sentation Learning at Scale." In Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, pp. 8440-8451. 2020.
[146] Zhang, Jingqing, Yao Zhao, Mohammad Saleh, and Peter Liu. "Pegasus: Pre-training with extracted gap-
sentences for abstractive summarization." In International Conference on Machine Learning, pp. 11328-11339.
PMLR, 2020.
[147] Aromataris, Edoardo, and Alan Pearson. "The systematic review: an overview." AJN The American Journal of
Nursing 114, no. 3 (2014): 53-58.
[148] Moher, David, Alessandro Liberati, Jennifer Tetzlaff, Douglas G. Altman, and Prisma Group. "Preferred reporting
items for systematic reviews and meta-analyses: the PRISMA statement." International journal of surgery 8, no. 5
(2010): 336-341.
[149] Grant, Maria J., and Andrew Booth. "A typology of reviews: an analysis of 14 review types and associated
methodologies." Health information & libraries journal 26, no. 2 (2009): 91-108.
[150] Khadivi, Nasim, and Sho Sato. "A Bibliometric Study of Natural Language Processing Using Dimensions
Database: Development, Research Trend, and Future Research Directions." Journal of Data Science, Informetrics,
and Citation Studies 2, no. 2 (2023): 77-89.
[151] Bannour, Nesrine, Sahar Ghannay, Aurélie Névéol, and Anne-Laure Ligozat. "Evaluating the carbon footprint of
NLP methods: a survey and analysis of existing tools." In Proceedings of the Second Workshop on Simple and
Efficient Natural Language Processing, pp. 11-21. 2021.
[152] Petersen, Kai, Sairam Vakkalanka, and Ludwik Kuzniarz. "Guidelines for conducting systematic mapping studies
in software engineering: An update." Information and software technology 64 (2015): 1-18.
[153] Koubaa, Anis, Wadii Boulila, Lahouari Ghouti, Ayyub Alzahem, and Shahid Latif. "Exploring ChatGPT
Capabilities and Limitations: A Survey." IEEE Access (2023).
[154] Denney, Andrew S., and Richard Tewksbury. "How to write a literature review." Journal of criminal justice
education 24, no. 2 (2013): 218-234.
[155] Tay, Yi, Mostafa Dehghani, Dara Bahri, and Donald Metzler. "Efficient Transformers: A Survey." ACM
Computing Surveys 55, no. 6 (2023): 1-28. https://doi.org/10.1145/3530811
[156] Ainslie, Joshua, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh Ravula,
Sumit Sanghai, Qifan Wang, and Li Yang. "ETC: Encoding Long and Structured Inputs in Transformers." In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.
268-284. 2020.
[157] Zaheer, Manzil, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon,
Philip Pham et al. "Big bird: Transformers for longer sequences." Advances in neural information processing
systems 33 (2020): 17283-17297.

Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen.li
No ratings yet
Nicole Koenigstein - Transformers in Action (MEAP v7) 2024 (2024, Manning Publications Co.) - Libgen.li
272 pages
Quick Start Guide To LLMs by Sinan Ozdemir 1703540700
100% (1)
Quick Start Guide To LLMs by Sinan Ozdemir 1703540700
275 pages
Sinan Ozdemir - Quick Start Guide to Large Language Models, Second Edition-Addison-Wesley (2024)
No ratings yet
Sinan Ozdemir - Quick Start Guide to Large Language Models, Second Edition-Addison-Wesley (2024)
279 pages
(EARLY RELEASE) Quick Start Guide To Large Language Models Strategies and Best Practices For Using ChatGPT and Other LLMs (Sinan Ozdemir) (Z-Library)
100% (14)
(EARLY RELEASE) Quick Start Guide To Large Language Models Strategies and Best Practices For Using ChatGPT and Other LLMs (Sinan Ozdemir) (Z-Library)
132 pages
Computer Theory Book 3rd Edition (English Medium)
No ratings yet
Computer Theory Book 3rd Edition (English Medium)
253 pages
Sinan Ozdemir - Quick Start Guide To Large Language Models - Strategies and Best Practices For Using ChatGPT and Other LLMs-Addison-Wesley Professional (2023)
100% (4)
Sinan Ozdemir - Quick Start Guide To Large Language Models - Strategies and Best Practices For Using ChatGPT and Other LLMs-Addison-Wesley Professional (2023)
326 pages
Config Train
100% (1)
Config Train
12 pages
Husqvarna Huskylock 460D Sewing Machine
100% (1)
Husqvarna Huskylock 460D Sewing Machine
33 pages
Cosmic Ascension - Joshua David Stone
100% (4)
Cosmic Ascension - Joshua David Stone
348 pages
NLP Paper 5
No ratings yet
NLP Paper 5
33 pages
AI-Driven Natural Language Processing Using Transformer Models
No ratings yet
AI-Driven Natural Language Processing Using Transformer Models
3 pages
Transformers in Action MEAP V06 Nicole Koenigstein - Instantly access the full ebook content in just a few seconds
100% (1)
Transformers in Action MEAP V06 Nicole Koenigstein - Instantly access the full ebook content in just a few seconds
47 pages
Transformers: State-of-the-Art Natural Language Processing
No ratings yet
Transformers: State-of-the-Art Natural Language Processing
8 pages
Unit 4 LLM
No ratings yet
Unit 4 LLM
11 pages
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Performance Analysis and Comparison of LLMS Based On Transformer Technology
No ratings yet
Performance Analysis and Comparison of LLMS Based On Transformer Technology
12 pages
[FREE PDF SAMPLE] Transformers in Action MEAP V06 Nicole Koenigstein ebook full chapters
100% (2)
[FREE PDF SAMPLE] Transformers in Action MEAP V06 Nicole Koenigstein ebook full chapters
58 pages
ChatGPT KZ Feb2023 PDF
No ratings yet
ChatGPT KZ Feb2023 PDF
7 pages
Generative AI in the Era of Transformers
No ratings yet
Generative AI in the Era of Transformers
8 pages
2021.sustainlp-1.0
No ratings yet
2021.sustainlp-1.0
10 pages
The NLP Cookbook Modern Recipes For Transformer Ba
No ratings yet
The NLP Cookbook Modern Recipes For Transformer Ba
29 pages
Code, Et Tu - LLM, Transformer, RAG AI - Mastering Large Language Models, Transformer Models, and Retrieval-Augmented Generation (RAG) Technology (2024)
100% (1)
Code, Et Tu - LLM, Transformer, RAG AI - Mastering Large Language Models, Transformer Models, and Retrieval-Augmented Generation (RAG) Technology (2024)
317 pages
Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey
No ratings yet
Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey
40 pages
All you should kno about LLM'S
No ratings yet
All you should kno about LLM'S
10 pages
Dokumen - Pub Quick Start Guide To Large Language Models Strategies and Best Practices For Using Chatgpt and Other Llms 9780138199425
No ratings yet
Dokumen - Pub Quick Start Guide To Large Language Models Strategies and Best Practices For Using Chatgpt and Other Llms 9780138199425
325 pages
Transformer-Based Regression Models For Assessing Reading Passage Complexity: A Deep Learning Approach in Natural Language Processing
No ratings yet
Transformer-Based Regression Models For Assessing Reading Passage Complexity: A Deep Learning Approach in Natural Language Processing
14 pages
Transformers
No ratings yet
Transformers
27 pages
Information 14 00242
No ratings yet
Information 14 00242
17 pages
Understanding LLMS: A Comprehensive Overview From Training To Inference
No ratings yet
Understanding LLMS: A Comprehensive Overview From Training To Inference
30 pages
Chowdhery Et Al. - 2022 - PaLM Scaling Language Modeling With Pathways
No ratings yet
Chowdhery Et Al. - 2022 - PaLM Scaling Language Modeling With Pathways
83 pages
2024 - Efficiency Optimization of Large-Scale Language
No ratings yet
2024 - Efficiency Optimization of Large-Scale Language
8 pages
Unit - 3
No ratings yet
Unit - 3
55 pages
pNLP-Mixer: An Efficient all-MLP Architecture For Language
No ratings yet
pNLP-Mixer: An Efficient all-MLP Architecture For Language
11 pages
Efficient Large Language Models- A Survey
No ratings yet
Efficient Large Language Models- A Survey
67 pages
W03 Benchmarking
No ratings yet
W03 Benchmarking
25 pages
Survery On Fpga and LLM
No ratings yet
Survery On Fpga and LLM
16 pages
Sinan Ozdemir Quick Start Guide To Large Language Models Strategies
No ratings yet
Sinan Ozdemir Quick Start Guide To Large Language Models Strategies
285 pages
Understanding LLMs: A Comprehensive Overview from Training to Inference
No ratings yet
Understanding LLMs: A Comprehensive Overview from Training to Inference
30 pages
Instant download [EARLY RELEASE] Quick Start Guide to Large Language Models: Strategies and Best Practices for using ChatGPT and Other LLMs Sinan Ozdemir pdf all chapter
100% (4)
Instant download [EARLY RELEASE] Quick Start Guide to Large Language Models: Strategies and Best Practices for using ChatGPT and Other LLMs Sinan Ozdemir pdf all chapter
66 pages
Chen et al. - An Agile Framework for Efficient LLM Accelerator Development and Model Inference
No ratings yet
Chen et al. - An Agile Framework for Efficient LLM Accelerator Development and Model Inference
9 pages
Chapter Four_ NLP
No ratings yet
Chapter Four_ NLP
15 pages
aa
No ratings yet
aa
11 pages
1722153544703
No ratings yet
1722153544703
16 pages
Rishabh Sharma (Anantika Johari)
No ratings yet
Rishabh Sharma (Anantika Johari)
8 pages
Overview of The Transformer-Based Models For NLP Tasks
No ratings yet
Overview of The Transformer-Based Models For NLP Tasks
5 pages
Efficient Prompting Methods For Large Language Models - A Survey
100% (1)
Efficient Prompting Methods For Large Language Models - A Survey
18 pages
Tranformrerz
No ratings yet
Tranformrerz
62 pages
good note - Transformer
No ratings yet
good note - Transformer
16 pages
Analysis of The Evolution of Advanced Transformer-Based Language Models: Experiments On Opinion Mining
No ratings yet
Analysis of The Evolution of Advanced Transformer-Based Language Models: Experiments On Opinion Mining
16 pages
Toward_a_Holistic_Performance_Evaluation_of_Large_Language_Models_Across_Diverse_AI_Accelerators
No ratings yet
Toward_a_Holistic_Performance_Evaluation_of_Large_Language_Models_Across_Diverse_AI_Accelerators
10 pages
10.48550 Arxiv.2204.02311
No ratings yet
10.48550 Arxiv.2204.02311
87 pages
A Survey of Small Language Models
No ratings yet
A Survey of Small Language Models
20 pages
14-LookingForward
No ratings yet
14-LookingForward
48 pages
A Survey On Efficient Inference For Large Language Models
No ratings yet
A Survey On Efficient Inference For Large Language Models
35 pages
Survey On Efficient Inference For LLMs 1721657409
No ratings yet
Survey On Efficient Inference For LLMs 1721657409
36 pages
Generative AI Interview Questions and Answers
No ratings yet
Generative AI Interview Questions and Answers
7 pages
Overview of Training LLMs
No ratings yet
Overview of Training LLMs
31 pages
LLM_Review
No ratings yet
LLM_Review
16 pages
Investigating Masking-Based Data Generation in Language Models
No ratings yet
Investigating Masking-Based Data Generation in Language Models
8 pages
2404.14294v1
No ratings yet
2404.14294v1
34 pages
Foundations of LLM
No ratings yet
Foundations of LLM
231 pages
Foundations of Large Language Models 1738142777
No ratings yet
Foundations of Large Language Models 1738142777
101 pages
Whitepaper_Foundational Large Language Models & Text Generation_v2
100% (1)
Whitepaper_Foundational Large Language Models & Text Generation_v2
86 pages
Mastering Partial Least Squares Structural Equation Modeling (Pls-Sem) with Smartpls in 38 Hours
From Everand
Mastering Partial Least Squares Structural Equation Modeling (Pls-Sem) with Smartpls in 38 Hours
Ken Kwong-Kay Wong
3/5 (1)
la copulacion de los pasos de girafa
No ratings yet
la copulacion de los pasos de girafa
20 pages
una poya os doy a dar articulos gratis
No ratings yet
una poya os doy a dar articulos gratis
102 pages
el aparamiento de los bromedarios
No ratings yet
el aparamiento de los bromedarios
8 pages
The Printing Press as an Agent of Change
No ratings yet
The Printing Press as an Agent of Change
82 pages
How To Augment Laachine Translationddsscribdesunaestafaning
No ratings yet
How To Augment Laachine Translationddsscribdesunaestafaning
56 pages
PG - M.Sc. - Computer Science - 341 14 - Lab Advanced Java Programming - Binder
No ratings yet
PG - M.Sc. - Computer Science - 341 14 - Lab Advanced Java Programming - Binder
72 pages
.... DI Based On Percentage and Ratio.... Questions Compiled by ANKUR MUKHERJEE
No ratings yet
.... DI Based On Percentage and Ratio.... Questions Compiled by ANKUR MUKHERJEE
2 pages
SKF SNL 30 and 31 Series Housings New - 6101 - EN
No ratings yet
SKF SNL 30 and 31 Series Housings New - 6101 - EN
112 pages
Selective Problems For Practice
No ratings yet
Selective Problems For Practice
5 pages
Circle and Triangle Olympiad Lamoen
No ratings yet
Circle and Triangle Olympiad Lamoen
4 pages
Sscsa-Dwg-0007 TR 50 Carier Planetary
No ratings yet
Sscsa-Dwg-0007 TR 50 Carier Planetary
54 pages
Logan Delphi
No ratings yet
Logan Delphi
5 pages
Lecture 39
No ratings yet
Lecture 39
9 pages
4 20ma Conversion
No ratings yet
4 20ma Conversion
9 pages
Solution
No ratings yet
Solution
6 pages
Effect of Cleaning Point of Uniclean Machine in Blow Room On Cleaning Efficiency and Yarn Quality
No ratings yet
Effect of Cleaning Point of Uniclean Machine in Blow Room On Cleaning Efficiency and Yarn Quality
21 pages
Humatrol P Lote 0004
75% (4)
Humatrol P Lote 0004
2 pages
Electrochemical Impedance Spectros
100% (1)
Electrochemical Impedance Spectros
14 pages
Ballistic Transport and Velocity Overshoot in Semiconductors: Part I-Uniform Field1 Effects
No ratings yet
Ballistic Transport and Velocity Overshoot in Semiconductors: Part I-Uniform Field1 Effects
4 pages
4 Ge 144 Plane Table Survey
No ratings yet
4 Ge 144 Plane Table Survey
36 pages
Stock Management System
No ratings yet
Stock Management System
20 pages
Xi-C Worksheet Alcohol and Carboxylic Acid
No ratings yet
Xi-C Worksheet Alcohol and Carboxylic Acid
4 pages
Mental Time Travel
100% (1)
Mental Time Travel
13 pages
Function Manual Basic Positioner en-US
No ratings yet
Function Manual Basic Positioner en-US
92 pages
Tos Math 3
No ratings yet
Tos Math 3
3 pages
52 Python Interview Questions Answers PDF
No ratings yet
52 Python Interview Questions Answers PDF
74 pages
Yaralandim - Nalan Klarnet
No ratings yet
Yaralandim - Nalan Klarnet
2 pages
Mathematics: Teacher's Resource Book
No ratings yet
Mathematics: Teacher's Resource Book
10 pages
DRS (Data Recovery System) : 1 Key Features
No ratings yet
DRS (Data Recovery System) : 1 Key Features
3 pages
Conic Section-Circles
No ratings yet
Conic Section-Circles
15 pages
Autoclaves
No ratings yet
Autoclaves
13 pages

A Survey On Transformers in NLP With Focus On Efficiency

Uploaded by

A Survey On Transformers in NLP With Focus On Efficiency

Uploaded by

A S URVEY ON T RANSFORMERS IN NLP WITH F OCUS ON

Wazib Ansar Saptarsi Goswami Amlan Chakrabarti

[email protected] Kolkata, India Kolkata, India

3.1 Applications of NLP

Some of the renowned applications of NLP have been elucidated as follows:

3.1.1 Sentiment Analysis

3.1.2 Misinformation Detection

3.1.3 Machine Translation

3.1.5 Text Summarization

3.2 Evolution of NLP

3.2.1 Rule-Based Approaches

3.2.2 Traditional Machine Learning Approaches

3.2.3 Deep Learning Approaches

Year Task Description Methods Used Author(s)

(b) Traditional Machine Learning

(c) Deep Learning

4.1 The Evolution of Transformers

ht = RN N (xt , ht−1 ) (1)

yt = p(yt |y1 , y2 , ..., yt−1 , H) (2)

where αti is a distribution (often a softmax) function as follows:

αti = Pnexp(eti ) (4)

exp(eti ) = fa (st−1 , hi ) (5)

Table 2: Cost Behind Training Transformer-based Models

4.2 Stages of Modeling

Figure 5: Illustration of prompt engineering process

4.2.3 Prompt Engineering

5 Efficient Modeling Considerations

5.1 Efficiency Measures

CostF actors ∝ Es · Ds · n (2)

5.2 Software Designing

Figure 6: Efficiency considerations through software designing

5.2.2 Representation of Text

5.2.3 Model Design

Figure 7: Common types of sparse attention patterns

5.2.4 Model Compression

5.3 Hardware Designing

5.3.1 Customized Hardware

5.3.2 Software-Hardware Co-Design

6 Results and Discussion

6.1 Statistical Insights

Figure 10: Percentage share of article-types

6.2 Trend of Research

6.3 Future Avenues

You might also like