A Survey On Transformers in NLP With Focus On Efficiency
A Survey On Transformers in NLP With Focus On Efficiency
E FFICIENCY
A BSTRACT
The advent of transformers with attention mechanisms and associated pre-trained models have
revolutionized the field of Natural Language Processing (NLP). However, such models are resource-
intensive due to highly complex architecture. This limits their application to resource-constrained
environments. While choosing an appropriate NLP model, a major trade-off exists over choosing
accuracy over efficiency and vice versa. This paper presents a commentary on the evolution of NLP
and its applications with emphasis on their accuracy as-well-as efficiency. Following this, a survey
of research contributions towards enhancing the efficiency of transformer-based models at various
stages of model development along with hardware considerations has been conducted. The goal of
this survey is to determine how current NLP techniques contribute towards a sustainable society and
to establish a foundation for future research.
Keywords Attention Mechanism · Efficiency Considerations · LLM · Natural Language Processing (NLP) ·
Transformers
1 Introduction
In recent years, there has been phenomenal evolution in the field of Natural Language Processing (NLP). This has
been based upon a lot of research works over time upon tasks ranging from sentiment analysis [1], misinformation
detection [2, 20, 27], machine translation [3, 4], text summarization [5] to question-answering [6]. These works have
contributed towards addressing the limitations posed by preceding works as-well-as increasing their performance. The
driving force behind this progress has been deep learning techniques, particularly Transformers [7] and associated pre-
trained models like Bidirectional Encoder Representations from Transformers (BERT) [8], XLNet [9], Bidirectional and
Auto-Regressive Transformers (BART) [10], Generative-Pre-trained Transformer (GPT) [11] along with its successors
i.e. GPT-2 [12] and GPT-3 [13]. These advancements have empowered NLP models to perform complex linguistic
tasks relating to understanding natural language and even generating responses as a human would provide [14].
The current research accomplishments in NLP have been enabled through the availability of voluminous textual
data, sophisticated deep learning models, and high-end computing resources. As the complexity of the models rises,
computing resources requirement for such models surge exponentially [15]. With the apparent deceleration of Moore’s
Law1 , increasing the performance of algorithms comes at the cost of straining the computing resources along with
faltering efficiency. This leads to high energy requirements translating into a hike in carbon emissions [16,17]. Therefore,
the need of the hour is to think out of the box and devise sustainable methodologies that can keep the performance
growth rate steady while at the same time being efficient enough to be practically applicable to resource-constrained
environments like mobile and edge devices [18].
The term "efficiency" of a deep learning model in NLP can be generically defined as the trade-off between the
performance and the cost factors. Thus, the goal of efficient modeling lies in achieving pareto-improvement by reducing
1
https://www.nature.com/news/the-chips-are-down-for-moore-s-law-1.19338
the training as-well-as inference cost for a model to achieve a benchmark level of performance [108]. The cost
factors include the number of Floating-point Operations (FlOps) [19], inference time [141], model size [18], speed-up
ratio [141], number of model parameters [18], energy consumption [16], and carbon emissions [17]. The efforts to
enhance efficiency can be directed at various stages of model development, i.e. data curation, text representation, model
design, and model compression. Efficiency can also be achieved by designing optimal hardware and maximizing its
utilization. Thus, achieving efficiency improvement of a language model in NLP is a nuanced task full of challenges.
Despite the challenges, plenty of developments towards efficiency improvement of models in NLP have taken place.
Besides, numerous surveys have tried to summarize these developments to serve as a stepping stone for prospective
researchers who wish to contribute to this domain. Bannour and Ligozat [151] performed a systematic review of the
carbon footprint by NLP models. They identified the tools and studied their accuracy as-well-as applicability for
assessing the energy consumption and carbon emissions of contemporary NLP models. However, the study was limited
only to assessing the environmental impact of NLP models for a single task of named entity recognition. Khadivi
and Sato [150] performed bibliometric analysis on NLP papers published between 2002 to 2021 based on factors like
growth-rate, doubling-time, and collaboration among authors. Given the developments, they predicted the research
trend and future directions. Koubaa et al. [153] performed a critical review on ChatGPT by discussing the supporting
concepts, competing technologies along with its applications. Treviso et al. [19] performed a literature review on
efficient approaches in NLP focusing on data processing, model design as-well-as hardware utilization. The paper
primarily confers the theoretical narratives with a comparative analysis of results from the viewpoint of efficiency is
missing. Xu and McAuley [141] presented a review of model compression and acceleration techniques with a discussion
on associated metrics for efficiency evaluation. Their study was limited only to pre-trained models and did not account
for data-efficiency, parameter-efficiency, or hardware-design considerations. Tay et al. [155] presented a taxonomy
of efficient transformer models in NLP in the form of a literature review. Even though, the survey is extensive, a
comprehensive coverage of the works in the given domain is lacking. Xipeng et al. [119] survey pre-trained models in
NLP with emphasis on model categories, pre-training objectives, fine-tuning, and downstream tasks. However, they do
not focus on the efficiency considerations of the models.
To address the shortcomings of the previous surveys, we augment the body of knowledge with a first-of-its-kind
systematic literature review on efficient transformer-based models in NLP. Firstly, it presents a primer on NLP, its
applications, and the evolution of NLP techniques. Then, it performs an extensive as-well-as exhaustive study focusing
on all stages of model development to achieve efficiency ranging from data curation to model design involving pre-
training, fine-tuning, prompt engineering to inferencing. Not only does it explore software improvements but also
efficient hardware developments accompanied with software-hardware co-designing approaches. It furnishes qualitative
as-well-as quantitative evolution of NLP models and weighs the efficacy of the models in terms of their efficiency.
Lastly, it analyzes the trend of research and presents the future directions. This survey paper targets researchers,
professionals as well as scholars interested in the NLP particularly transformer-based Large Language Models (LLMs),
and wish to design efficient leaner models to bring down the overall computational budget or for deployment on devices
with low computational resources. The key contributions of this paper are as follows:
1. We conduct a comprehensive study on transformer-based models in NLP emphasizing all stages of model
development.
2. This paper presents a qualitative and quantitative analysis of software as-well-as hardware-based contributions
related to transformers in NLP from the perspective of efficiency.
3. Finally, based on the review of existing works, we perceive the trend of developments in NLP and present a
road-map to achieve pareto-optimality.
The remainder of this paper has been organized in the following manner. Section 2 puts forth the methodology adopted
for this survey. Section 3 enunciates the domain of NLP, its applications, and the evolution of NLP based on the
programming paradigms. Section 4 elucidates the concepts associated with transformers and various modeling stages
in transformer-based LLMs. Section 5 showcases the developments towards efficient modeling. Section 6 presents
the results of this survey in the form of statistical insights, trends of research, and ushers the future scope. Finally, in
section 7, the conclusions are drawn.
2 Survey Methodology
The foundation of research lies in the critical review of existing literature and analysis of the previous results obtained
through related works. It can serve several purposes, including presenting the information that is currently available about
a term or concept, mapping the history of developments, determining connections between related concepts, assessing
the evidence supporting any proposition, or demonstrating why a problem merits more investigation [147]. Irrespective
2
Figure 1: PRISMA for the survey methodology
of the field of study, there have been various typologies of surveys distinguished by certain characteristics [149].
Bibliometric analysis is a kind of survey utilizing article details like journal/ conference name, publication date, and
citations as-well-as author details like name, affiliation, and collaborations to assess the developments and trends in a
field of study from a statistical perspective [150]. A systematic review gathers and summarizes the findings of research
works on a given subject that satisfy the standards of scientific credibility and pertinence to form a set of research
questions and answer them [151]. The Preferred reporting items for systematic reviews and meta-analyses, i.e. the
PRISMA statement comprises a 27-item checklist for systematic reviews [148]. A systematic mapping describes and
catalogs the existing information on a topic or question of interest rather than attempting to provide a response to a
particular question [152]. A literature review seeks to uncover important ideas, hypotheses, and research findings as
well as knowledge gaps. It makes an effort to go over the claims and conclusions from earlier research in a narrative,
chronological order [154]. A critical review examines certain concepts, themes, or theoretical viewpoints found in the
3
body of contemporary works. It provides more of a reflection and critique of the concept under consideration [153].
However, it often inculcates bias due to the contextualization of the previous works by authors concerning their
propositions. After examining the merits and demerits of the survey typologies, we conclude that given the topic of
our survey and the developments in the given field, a combination of literature review and systematic review would be
the best option. Hence, we adopt a systematic literature review as the survey methodology for this paper. This would
enable us to elucidate the vital concepts related to transformers NLP with associated developments from the perspective
of efficiency in a systematic manner. We formulate the following research questions and attempt to address them in this
survey:
RQ1: What are the applications of NLP, and what kinds of techniques are used to perform such applications?
RQ2: What are transformer-based models and how transformers are utilized in NLP LLMs?
RQ3: What is the efficiency vs efficacy trade-off in transformer-based LLMs?
RQ4: What efficiency measures are present for NLP models?
RQ5: What efficiency considerations are there for transformer-based NLP models?
RQ6: Which stages of model development can be targeted for efficiency enhancement?
RQ7: What is the current research trend in NLP and to what extent efficiency considerations are going to be prevalent
in the near future?
The survey comprised of original as-well-as review articles written in the English language published in digital
libraries like Google Scholar2 , ACM Digital Library3 , IEEE Xplore, Semantic Scholar4 and Science Direct5 . The
articles were retrieved using keywords related to NLP and its associated terms like "NLP", "pre-trained models",
"LLM", "transformers", "embedding", "pre-training", "fine-tuning", "prompt engineering", "sustainability in NLP" and
"efficiency modeling in NLP". Articles published between 2000 to 2023 were included with a prime focus on articles
published since 2016. The papers that were recently published were given preference. Besides, some pioneering works
were included irrespective of the year in which they were published. Journal publications were chosen over conference
papers where two or more articles were found to share the same subject or methodology. When choosing the papers, the
journal’s impact factor and citations were taken into account. The duplicate articles were removed if it was observed that
the writers had written similar works. At first, 3,210 articles were retrieved. Out of which 214 duplicates were removed.
A spreadsheet application was utilized to store and process article metadata like "title", "abstract", "reference", "author
list", "year of publication", "name of journal or conference" and "number of citations". Then, 1,990 articles were
filtered out through statistical examination based on the above-mentioned criteria. From the remaining 1,006 articles,
312 articles were short-listed after going through the title and abstract. Finally, the full-text screening of the short-listed
articles was performed leading to 151 articles to be included in the survey for this paper. To ensure a scientific and
systematic structure of content, the study has been separated into several coherent pieces depending on significance.
3 Overview of NLP
NLP is a research domain concerned with providing computing devices the ability to comprehend and process input text
in natural language understood by human beings. Using various NLP approaches one can extract meaningful information
from an unstructured text corpus and even synthesize outputs in natural language [14]. The two complimentary facets
of NLP are Natural Language Understanding (NLU) and Natural Language Generation (NLG) as illustrated in Figure 2.
NLU is the process of enabling computers to understand and derive meaning from natural language. By bridging the
gap between unstructured text data and representations that are understood by machines, NLU enables machines to
comprehend and process natural language input. Instances include sentiment analysis [1], opinion spam classification [2],
fake news detection [20] and rumor verification [21]. On the other hand, NLG enables computers to produce natural
language from structured data or other unstructured text inputs. The primary objective of NLG is to communicate
information in a way that is comprehensible to human beings and appropriate as per the given situation. Instances
include question answering [22], machine translation [3] and text summarization [23, 24].
4
Figure 2: Applications of NLP
Sentiment analysis is an application of NLP concerned with the extraction and evaluation of expressions, feelings,
and orientations of people regarding a certain physical or abstract subject [1]. It has evolved over a period of time
with primarily three tiers of analysis- document-based, sentence-based, and aspect-based. While Document-Based
methods provide the overall sentiment for the entire document, they fail to capture the sentiments expressed in individual
sentiments [25, 58]. Sentence-based methods provide the sentiment polarity associated with individual sentences in
a document. They are an improvement over Document-based methods but falter to capture sentiments associated
with the aspects present in a sentence [26]. Aspect-Based Sentiment Analysis (ABSA) redresses the impediments of
Document-Based methods as-well-as Sentence-Based methods with its ability to associate sentiments to individual
aspects [1, 55].
Misinformation detection deals with identifying fake, biased, or propaganda-based content posted through online
platforms. In contrast to classifying the polarity of opinions as in sentiment analysis, it detects fraudulent opinions.
The detection methods may be based upon the content, the meta-data, or through learning some patterns present in
the content [2]. It can be extended to fake news as-well-as rumor verification tasks. On one hand, fake news consists
of news articles with delusive content to misinform the readers [20]. On the other hand, a rumor can be attributed to
information that is rapidly disseminated without ascertaining its authenticity. Thus, a rumor might be true, false, or
even unverified [21]. The approaches to detect fake news may exploit information present in the content of the post,
user profile as-well-as social context [27].
With increasing globalization, the entire world is becoming a single community. Therefore, it is becoming increasingly
important to overcome linguistic barriers so that seamless transmission of knowledge and information can take place.
Machine translation enables automatic conversion of a given piece of text from one language to another. This field
is full of challenges due to multiple possible translations of a word depending upon the context and difficulty in
understanding idiomatic phrases [28]. The advent of neural networks and encoder-decoder architectures for sequence-
to-sequence models [3, 4] mitigated the impediments to a large extent. Subsequent transformer-based approaches [115]
and associated LLMs [18, 120] have helped achieve SOTA performance.
5
3.1.4 Question Answering
Question Answering (QA) is an application of NLP that focuses on inventing and developing models and algorithms to
automatically produce human-like responses to user queries or questions [6]. The objective is to make it possible for
computers to comprehend natural language input and produce pertinent, correct responses in a conversational style. The
existing QA systems can be grouped into extractive QA and generative QA. The former selects the span of text from a
document termed context which serves as the answer to a given question [29]. While the latter produces automatically
generated nuanced answers on the basis of the comprehended information [22].
Over the years, NLP has made considerable strides as a result of ground-breaking research, rising processing capacity,
and the creation of complex language models. The development of NLP is evidence of the persistent effort to close
the language and artificial intelligence gap. The major advancements from rule-based systems, conventional machine
learning techniques, deep learning, and pre-trained language models have been outlined in this section with Table 1
presenting a comparative review of the notable contributions.
6
Table 1: An Overview of Notable Contributions in NLP
7
(a) Rule-Based
4 Transformers in NLP
The Transformer-based approaches come under the purview of deep learning. However, due to the revolution in NLP
brought about by them and the immense developments carried out, they deserve to be discussed separately in this
section. The evolution of transformers, the concepts associated with it accompanied with the stages of modeling have
been enunciated herein-below.
A series of developments paved the way for the transformers. Earlier works on NLG tasks like machine translation
devised sequence-to-sequence models comprising two RNN blocks namely, the encoder and the decoder [3, 110]. Given
an input sequence X = (x1 , x2 , ..., xn ), the RNN-based encoder derives a hidden representation H = (h1 , h2 , ..., hn ).
Subsequently, a few other non-linear functions can also be applied to obtain the final H. For tth time-step, ht is
calculated from xt and ht−1 as shown in equation (1).
The decoder predicts one output token at each time-step on the basis of the previously predicted tokens y1 , y2 , ..., yt−1
and H as a joint probability distribution shown in equation (2).
8
Figure 4: Illustration of the transformer architecture
However, the above approach leads to loss of information as the length of the input sequence grows due to compression
of information into a fixed-length vector. To ameliorate this issue, Bahdanau et al. [4] deployed a soft-search mechanism
for identifying the significant tokens from the input sequence for the prediction of the output at a given time-step. For
this, they introduce the term context vector ct derived from H which weighs the significance of the token hidden states
as shown in equation (3).
Pn
ct = i=1 αti · hi (3)
9
given that,
Here, exp(eti ) evaluates the alignment between the output at position t and the input tokens around position i. The
fa (∗) function takes the previous hidden state st−1 of the RNN decoder and ith time-step hidden representation hi .
Finally, the decoder applies a non-linear function fd (∗) to generate the output yt for time-step t as follows:
yt = fd (yt−1 , st , ct ) (6)
This led to the foundation of the attention mechanism, an indispensable component of modern transformer architecture.
To compute attention, the input is transformed into an embedded sequence Z ∈ RL × RD comprising token and
positional embeddings where L is the sequence length and D is embedding dimension. Then, key Ks , query Qs , and
value Vs are calculated through linear transformations on the sequence Z as follows:
Qs , Ks , Vs = W q Z, W k Z, W v Z (7)
D
where, W q , W k and W v ∈ RD× H denote the weight matrices corresponding to Ks , Qs and Vs . The key K represents
the input features. These features might be at character-level, word-level, document-level, or a combination of multiple
features. Qs is the vector whose relationship with K is computed during attention computation. This is accomplished
through a compatibility function fc (∗) as follows:
ea = fc (Qs , Ks ) (8)
One might notice the similarity between equation (8) and the alignment function in equation (5) wherein the alignment
between the previous decoded token and the hidden states is computed. Furthermore, the fc (∗) can have varied forms as
summarized in Table 2. Following this, the attention weights aw are obtained after being fed into a distribution function
fδ (∗) to normalize the alignment scores and transform it into a probability distribution as follows:
aw = fδ (ea ) (9)
Here too, the distribution functions can have varied forms with softmax activation being the most widely used [7]. To
obtain the attention-weighted representation of the input Z ′ , pairwise inner product between Vs and aw is computed as
follows:
Z ′ = aw · Vs (10)
Vs represents the sequence vector upon which the attention weights are applied to determine the significant tokens. In
most of the studies, Vs is considered identical to ks . Finally, the attention-based context vector Ca is obtained as the
element-wise sum of Z ′ such that elements with higher attention weights have more significance compared to lower
attention weights as shown in equation (11).
P ′
Ca = zj , ∀zj′ ∈ Z ′ (11)
Often, there is only one input sequence and attention is computed solely based on it. It gave rise to self-attention or
intra-attention, a concept refined in many later works [111, 112]. It is achieved by having the same vector for both
10
Ks and Qs . In this manner, it helps to capture the relevance of a particular token in a sequence concerning other
tokens in it. Furthermore, to accommodate parallel computation of attention at diverse positions, Multi-Head Attention
(MHA) am w was devised which concatenates aw computations from all the Dh attention heads and projects them through
W o ∈ RD × RD as depicted hereinbelow.
am o
w = Concatenate(aw [i])W , ∀i ∈ Dh (12)
A milestone achievement was the transformer architecture with multi-head scaled inner-product attention mechanism
by Vaswani et al. [7] as shown in Figure 4. This was the first time a sequence-to-sequence model entirely based on
self-attention without any CNN or RNN units was proposed. The transformers with attention mechanism provide high
performance with exceptional sequence representation abilities and support parallel training unlike the LSTM-based
sequential methods [7]. Moreover, the genesis of transformer-based pre-trained models or LLMs has transformed the
field of NLP providing relief from training the model from scratch. These models are pre-trained on large data-sets and
just need to be fine-tuned as per the application. This helps to provide high accuracy with computational efficiency and
robustness when applied in various domains thereby making them an apt choice in the current scenario [7]. One of the
foundational LLM was OpenAI’s Generative-Pre-trained Transformer (OpenAI GPT) based upon transformer-decoder
architecture with unidirectional context parsing. To overcome this limitation, Bidirectional Encoder Representations
from Transformers (BERT) [8] adopted bidirectional context-parsing deploying a transformer-encoder architecture.
However, BERT suffers from drawbacks like the exclusion of a "Mask" token during fine-tuning and parallel predictions
without dependency consideration. These drawbacks have been resolved by its successor XLNet through "permutation
language modeling" in which the prediction tokens are permuted randomly [9]. The successors of OpenAI GPT i.e.
GPT-2 [12] and GPT-3 [13] further enhance the performance, efficiency, and reusability with the concept of "in-context
learning". This feature further eliminates the need to fine-tune the model and the model just needs to be conditioned
with the instances or description of the application. Apart from this, LLMs have been devised utilizing the entire
transformer encoder-decoder architecture. The T5 transformer [120] is one such LLM that is pre-trained by predicting
a span of tokens corresponding to a mask. Another variation in the form of PEGASUS [146] enforces masking of
entire sentences as a pre-training objective and termed it Gap-Sentence Generation. Similarly, BART [10] comprises
encoder-decoder blocks and applies noise to corrupt the input text and then attempts reconstruction through denoising.
These are just a few examples and the rest of the paper presents several other transformer-based models supported with
an interpretation of their efficiency.
4.2.1 Pre-Training
Creating an LLM does not only revolve around devising a complex architecture with millions of parameters. Rather,
models need to be trained on data-sets proportionate to the model size to deliver optimum performance [116]. Thus,
large models need large data-sets. But, high-quality annotated data-sets are scarcely available for training a model in
a supervised fashion. This is due to annotation being expensive, and requiring expertise in understanding the syntax,
semantics as well as domain knowledge. However, there exists plenty of unannotated textual content that can be utilized
to make LLMs learn vital representations through unsupervised or self-supervised learning. Previously training or
Pre-Training LLMs on these tasks groom the model towards discerning linguistic intricacies, significantly enhancing
the performance at downstream tasks with to faster convergence even with limited data. The inception of pre-training
can be attributed to the surge in the development of deep convolutional models following the ImageNet6 challenge in
the early 2010s. In NLP, Collobert et al. [117] first demonstrated the concept of pre-trained word embeddings generated
from large unannotated corpora. Subsequently, the pre-trained versions of word embeddings like GloVe [36] and
Word2Vec [35] were devised. In context to the Language Model, Dai and Le [118] became the torchbearer followed
by other models like ELMo [45], ULMFit [44], GPT [11] and BERT [8]. Since then, a plethora of LLMs have been
developed with an upward trend in associated research. There exist quite a few strategies for pre-training LLMs [119].
Out of them, a few significant ones have been mentioned herein-below.
• Causal Language Modeling (CLM): It relies on self-supervised language modeling to predict the next token
in a sequence maximizing the likelihood of the conditional probability distribution over all the unique tokens
based on the context. CLM works in a unidirectional manner, i.e. left-to-right manner. This implies that the
context only includes the tokens to its left. CLM is more suited for NLG applications. A prominent example
of an LLM using CLM is GPT [11]. For a given sequence X = (x1 , ..., x2 , xn ), the loss function of CLM is
computed as follows:
6
https://image-net.org/challenges/LSVRC/
11
PT
LCLM = − t=1 logp(xt |X<t ) (12)
• Masked Language Modeling (MLM): To ameliorate the limitation of CLM to attend only to tokens leftwards,
MLM was devised where the context was constructed in a bidirectional fashion, i.e. allowing it to infer from
tokens present in both right as well as left direction. This makes MLM the apt choice for NLU applications. An
MLM usually works by masking out some random percentage of tokens in the sequence and then predicting
those tokens based on the context. One of the famous LLM utilizing MLM is BERT [8]. For a given sequence
X = (x1 , ..., x2 , xn ), the loss function of MLM is computed as follows:
LM LM = − x′ ∈m(X) logp(x′ |X\m(X) )
P
(13)
where, m(X), X\m(X) denote the masked tokens, and the remaining tokens in the sequence X respectively.
Vanilla MLM deals with replacing single tokens which can reduce their effectiveness at sequence-to-sequence
NLG tasks. A sequence-to-sequence variation of MLM solves this by predicting a span of tokens corresponding
to a mask as can be seen in T5 transformer [120]. Subsequently, even entire sentences have been masked in
LLMs like PEGASUS [146] to make the pre-training objective related to the downstream task of abstractive
summarization. LLMs like BART [10] apply noise to corrupt the input text and then perform denoising
by reconstructing the span of text. This allows pre-training on shorter sequences with equivalent efficacy
contributing towards enhanced efficiency. A limitation of MLM is that the masked tokens are restricted to
pre-training and are not available at the fine-tuning phase leading to a discrepancy.
• Permutation Language Modeling (PLM): To mitigate the drawback of MLM related to the unavailability of
the mask token during the fine-tuning stage, PLM was proposed [9]. PLM generates a random permutation of
the input sequence wherein a permutation defines the order of token predictions (not to be confused with the
order of tokens in the sequence). During pre-training, the model tries to predict some of the tokens selected
as the target considering its position and the remaining tokens. To achieve faster convergence, the endmost
tokens are often predicted. A popular LLM formulated on this pre-training objective is XLNet [9]. Given an
input sequence X with S being its random permutation sequence, the equation for the loss function of PLM is
as follows:
PT
LP LM = − t=1 logp(st |S<t ) (14)
• Contrastive Learning (CL): Contrastive learning aims to capture linguistic contextual information by
distinguishing (contrasting) between valid and invalid samples by means of similarity evaluation. Next-
Sentence Prediction (NSP) is an example of CL utilized in BERT [8]. Here, the objective is to identify whether
a pair of sentences are next to each other given a set of contiguous and non-contiguous sentences. However, a
few works have stated that although NSP focuses on the topic as well as coherence prediction, it is found to be
ineffective and unreliable in coherence prediction even demonstrating performance drop due to NSP [121]. To
resolve this issue Sentence-Order Prediction (SOP) was proposed to predict the order of sentences instead of
predicting whether a given sentence is the next sentence to another sentence. The LLM ALBERT showcases
superior performance by modeling the inter-sentence coherence through SOP [47]. The loss functions for both
SOP and NSP aim to determine the constructiveness of two sentences X and Y as follows:
LN SP/SOP = −logp(k|X, Y ), ∀p ∈ 0, 1 (15)
Regarding efficiency considerations of LLMs, it can be said that pre-training requires the maximum com-
putational resources among all the stages of modeling. Although the pre-training strategies contribute to a
great extent towards the performance, the model design along with the quality and size of the data upon which
pre-training is performed plays a crucial role in efficiency [19]. The efficient data curation as well as model
design considerations have been discussed in Section 5.2.1 and Section 5.2.3.
4.2.2 Fine-Tuning
As seen above, pre-training an LLM serves as an effective model initialization strategy and aids in generalization
with faster convergence on limited annotated data. However, to make a pre-trained model excel at a domain-specific
task, additional training effort is required to exploit annotated samples specific to the downstream task. It is known as
fine-tuning. It underlies the concept of transfer learning wherein a model pre-trained on a certain task having large data
is trained again (fine-tuned) on a related task with significantly fewer data. There are various fine-tuning approaches.
The first approach is to unfreeze a few layers of the model and retain the weights of the other layers calculated during
pre-training. Usually, the output layer is customized as per the output representation format and fine-tuned with a few
other unfrozen layers upon the task-specific data. The second approach is to fine-tune the frozen model with limited
data during initiation and unfreeze other layers in due course.
12
The efficiency considerations for fine-tuning lie in minimizing the number of layers to unfreeze, i.e. number of
parameters of the pre-trained LLM to fine-tune. Unfreezing more layers increases the computational requirements
of fine-tuning but can enhance the accuracy of the downstream task. This applies only if abundant data is available
to perform FT. In most cases, fine-tuning only the last few layers can obtain desirable results [122]. This is due
to the fact the lower layers capture low-level, local features primarily related to the syntax. Whereas, the higher
layers capture the global information involving high-level semantic abstractions specific to the task at hand. The
efficiency can also be improved through adapter modules, i.e. an isolated network that is fine-tuned and combined with
the pre-trained model having all the parameters intact [123]. Further variations include utilizing Kronecker product
of low-rank matrices for the construction of parameter matrices for the adapter [124]. Another variation involves
reparameterization to low-dimensional subspaces for fine-tuning, enhancing efficiency by reducing the number of
parameter updates [125]. There lies one drawback of the adapter approach- it raises the overall model parameters
leading to more computations during inference. This hindrance was resolved through Adaptable Adapters which applies
differing activations specific to each layer as well as data-sets accompanied with a switch trained to select appropriate
layers of the adapter module [126]. Furthermore, AdaMix combined various parameter-efficient adapters to provide
SOTA results with an efficiency equivalent to fine-tuning with a single adapter module [127].
• Instruction-based Learning: Also known as Priming, it involves providing the instructions related to the
task description optionally with a few samples of the inputs and their corresponding outputs [130, 132]. For
instance, providing the instruction to perform translation accompanied with a few examples in the prompt to
prime the LLM to generate a translation for any new sentence.
• Template-based Learning: It deals with exploiting predefined structures, known as templates to construct
prompts. The templates can be designed as cloze styled- inserting placeholders in the prompt text and
attempting to fill in the blanks [133], multiple-choice type- providing multiple hypotheses in the template and
asking the model to choose the correct one [134] or prefix-type- adding special prefixes before the input to
denote the task to be performed on the input [131, 132].
• Proxy-Task-based Learning: It involves probing an LLM with a proxy-task, i.e. a related task sharing
some attributes of the original task, to obtain the output of the original task through transferring the inference
to the desired form. This enhances the efficiency and eases inference due to utilizing simpler tasks closer
to those upon which the model has been previously trained to obtain outputs for tasks leveraging rigorous
linguistic comprehension. Instances include applying textual entailment for topic detection [135] or achieving
coreference resolution through question answering [136].
13
Regarding the efficiency of prompt engineering approaches, it can be commented that in-context learning significantly
reduces the computational complexity due to zero parameter updates in the pre-trained LLM. For a multi-task LLM,
prompting can yield results at par with fine-tuning the model with several data samples [131]. Apart from these, certain
prompt engineering practices also enhance efficiency. Firstly, optimizing the length of the prompt and its textual
complexity improves the response-time by requiring less computations. Secondly, designing prompts considering
the resources available and allowing batch processing can improve efficiency. Thirdly, caching the intermediate
outputs can reduce the amount of processing required leading to faster response. Finally, the selection of the LLM for
prompt-engineering plays a crucial role. The selection must be done considering the desired performance given the
availability of computational resources.
The developments to achieve better performance at tasks are at the cost of increased model complexity translating
to escalated training costs and carbon emissions. Given the complexity of SOTA NLP models, the cost of training
might even exceed the annual energy requirements of certain cities. Strubell et al. [16] performed a study in which they
calculated the power consumption, carbon emissions along with the monetary cost associated with the training of a set
of NLP models. In their study, it was found that training a NLP model cost as much as a trans-Atlantic flight. They also
reflected on the percentage of energy coming from renewable sources from countries all over the world. To measure the
efficiency η, the trade-off between the model performance and cost factors needs to be calculated as shown in equation
(1). The cost factors can be defined concerning various metrics as follows:
1. Floating-point Operations (FlOps) define the number of floating-point operations needed for a single instance
computation [19]. This can serve as a consistent benchmark irrespective of the hardware of the application.
However, existing HPCs with support for parallel processing might lead to non-uniform execution times even
with the same number of FlOps.
2. Inference Time denotes the time required by the model to process a test input and generate a suitable response
[141]. Unlike FlOps, it is hardware-dependent, i.e. it depends upon the configuration of the HPC and support
for parallel execution. From the evaluation perspective, it enables a real-time measure of various algorithms
based on execution upon identical HPC.
3. Speed-up Ratio helps to perform comparison of a model concerning another model [141]. Here, one model is
taken as the baseline and the improvement in efficiency of the other model is measured compared to it. In
context to transformer-based models, speed-up can be calculated based on the number of transformer blocks,
attention heads, or overall number of layers in the model.
4. Model Size and Number of Parameters are internal indicators of the computational requirements [18]. Some
models might be more efficient despite the same or even more number of layers and FlOps due to the sharing
of parameters [47]. In such cases, the number of model parameters provides an indicator to the overall model
size and serves as an efficiency evaluation metric.
5. Carbon Footprint is the most significant indicator of the environmental impact due to an LLM. However,
it is an uphill task to precisely report the carbon emissions due to the involvement of multiple factors for
its computation [16, 18]. The preliminary approaches involve tools to calculate the energy consumption
and carbon footprint relying on the execution time, number of cores, memory requirements, and platform
14
information supplied by the user [105, 106]. Further developments led to packages being deployed on systems
to directly access the CPU, GPU and DRAM statistics and calculate power consumption7 [107]. However,
most of these studies only account for the computing resources and do not consider the cooling, networking
and other operational costs.
P erf ormance
η= CostF actors (1)
For performance, it is necessary to discover the pareto-improvement by comparing it with a benchmark, i.e. attaining
higher accuracy at lower cost [108]. Schwarts et al. [18] formulated cost factors proportional to the time and resources
for execution on a single sample Es , data size Ds and the number of epochs n required for training as depicted in
equation (2).
Despite the research developments, the current approaches for measuring efficiency are not fool-proof. There lies a
disparity in the carbon emissions reported by various monitoring applications. The majority of the studies focus on
only model training or do not differentiate between fine-tuning or prompt-engineering stages. Furthermore, the cost of
production of hardware and infrastructure for deployment of these models is often unaccounted for. A study by Gupta
et al. [109] reveals that the environmental impact due to setting up infrastructure and hardware equipment is maximum
compared to other life-cycle stages for data-centers.
To achieve efficiency in NLP models, numerous software design considerations have been devised to target various
stages of model development as highlighted in Figure 6. In this section, commentary on such techniques based on the
modeling stages such as data curation, text representation, model design, and model compression have been presented.
7
https://github.com/epfl-iglobalhealth/cumulator
15
5.2.1 Data Curation
Data curation plays a vital role in determining the efficiency of the Language Model (LM). A data set with reduced
sequence lengths or less number of training samples minimizes the model complexity and reduces the training effort
significantly [61]. Duplicate removal from the data set can enhance the efficiency of LM and might also enhance or
improve its performance compared to the entire corpus [62]. In the case of pre-trained LMs, such filtering can be
applied both during the pre-training [53] as-well-as the fine-tuning stages [63]. Although filtering eliminates biases
inherent in the data set, their application is restricted to cases with abundant data as the performance reduces when
insufficient data is available [64].
While duplicate removal applies to already available data sets, Active Learning comes into play while collecting data.
It aims to reduce the training data while retaining model performance by labeling the most informative samples and
selecting them for training [65]. For the identification of informative samples, various approaches have been adopted
such as selecting samples with high uncertainty [66], maximum diversity [67] or both [68]. However, determining
the usefulness of the samples and annotating them is a challenging task [69]. Its efficacy in diverse downstream tasks
cannot be ascertained and can include outliers [70, 71].
Another perspective to data curation can be to order the samples in the data-set to improve utilization also known as
Curriculum Learning. The ordering approach deploys heuristics capturing the complexity of sequences and determines
a pace to progressively move from simpler sequences to complex sequences [72]. However, the pace has to be monitored
to guarantee efficiency, and automation of the pace serves to be beneficial [73].
Establishing a balance between the size of training data and the model parameters is also important to achieve pareto-
improvement as mentioned in section 5.1. Hoffmann et al. [116] state that the number of model parameters and the size
of the training set must be in the same ratio. They showcased that their model named Chinchilla based on this theory
outperformed several SOTA models with a significantly higher number of parameters.
Thus it can be inferred that determining the quality of samples in the corpus and selecting high-quality samples devoid
of repetitive information, outliers and incorrectly ordered sequences can boost the modeling efficiency. Moreover, this
can be extended to decomposing the individual text sequences into smaller sub-sequences with essential information
and discarding the irrelevant portions leading to efficient representation of context [56]. This significantly enhances the
efficiency of transformer-based models with attention mechanisms having complexity quadratically proportional to
sequence length [7].
16
To ameliorate the curse of dimensionality, a few studies have been conducted to determine the optimal embedding
dimensions to reduce the excessive memory consumption retaining the semantic and syntactic characteristics in the
data [79]. Instances include determining the embedding dimensions based on corpus statistics like count of pairwise
equidistant words [80], reducing the dimensionality of embedding vector applying Principal Component Analysis
(PCA) [81] and compressed image representations equivalent to a given text [82].
• Chunking: It deals with chunking into several blocks, processing each block individually, and connecting the
representations of these blocks through recurrence or by some other mechanism. ABSA BERT [56] breaks
down each sequence based on significant phrases contained in it while filtering out irrelevant chunks of tokens
before being fed into the BERT model. An extension to the chunking approach has been proposed in the case
of Transformer-XL [46] wherein multiple blocks are connected through a recurrence mechanism. This helps to
efficiently compute attention for long sequences by breaking them down into multiple blocks.
(a) Global (b) Band (c) Dilated (d) Random (e) Block
• Sparse Attention: A few contributions attempt sparsification of the attention matrix to reduce the complexity
of computing attention in transformer-based models. This implies limiting the count of keys to be attended
by queries based either on certain pre-defined patterns or input-conditioned connections. Some common
patterns might be global attention, band attention, dilated attention, random attention, and block attention
as illustrated in Figure 7. This technique exploits the inherent sparsity in the attention matrix in real-life
applications even after computing attention on all possible query-key pairs. Sparse Transformer [87] factorizes
the attention matrix to attain sparse patterns where connectivity
√ is established between a pre-defined set of
tokens. This reduces the complexity of attention to O(n n). Longformer [84] employs attention at fixed
intervals in a strided fashion. It adopts a blend of band attention, dilated attention, and global attention to
achieve a near linear scaling factor with respect to the sequence length. Extended Transformer Construction
(ETC) [156] follows a similar approach agglomerating global attention and local band attention with relative
positional encoding. Additionally, it employs masking through Contrastive Predictive Coding as a pre-training
objective. BigBird [157] builds upon the ETC model by applying random patterns of sparse-attention. It can
handle sequences 8 times the length and achieve linear complexity compared to the conventional attention
mechanism. Selective Learn Forget Network (SLFN) [86] adopts a gated mechanism upon multi-head attention
in a single-block transformer architecture for selective retention of attention weights. This aided in filtering
out insignificant information while retaining long range dependencies. Memory Compressed Transformer [85]
reduces the number of query-key pairs applying strided convolution. BlockBERT [83] proposes an efficient
version of BERT by incorporating block-wise patterns in the attention matrix for sparsity.
• Mixture-of-Experts (MOE): The concept of sparsification for efficient computation has been taken forward
with the notion of Mixture-of-Experts (MOE). In this, the input is routed through multiple sub-networks
replacing the single feed-forward layer. Models such as GLaM [88] demonstrate that it helps to attain high
accuracy along with efficient use of resources. FasterMoE [102] further tackled the load-imbalance in MOE
models through fine-grained concurrent scheduling for distributed computing.
• Low-Rank Approximation: To reduce the computational complexity of the attention mechanism, low-rank
approximation aims to approximate the attention matrix with a lower-rank matrix. Recently, techniques like
Linformer [49] have been devised to perform low-rank approximations of the self-attention matrix to enhance
efficiency. Similarly, the application of kernels for approximation of the computation of self-attention has
17
gained popularity as it reduces the effort required to compute self-attention for the entire sequence matrix. A
prominent example of this is the Performers [89].
• Clustering: It refers to grouping related elements, features in a sequence, or even attention heads to achieve
efficient computation of attention. Some other works learn patterns in the data by capturing relevant tokens and
clustering them together into buckets. Based on the similarity metric applied for clustering, various models
have been devised. For instance, the Reformer [48] utilizes a hashing-based similarity measure while the
Routing Transformer [50] deploys a K-means clustering algorithm.
• Parameter Sharing: The complexity of a model is proportional to the number of parameters present. Hence
reducing the number of parameters can be beneficial towards model efficiency. This can be achieved through
the sharing of parameters across the layers in the transformer network. Perceiver [51] is one such model which
performs downsampling apart from sharing weights among layers for efficient computation. ALBERT [47]
on the other hand applies matrix decomposition upon the embedding layer along with cross-layer parameter
sharing.
Apart from software design, hardware considerations for efficiency in deploying LLMs are a vital yet comparatively
less explored domain. Figure 8 summarizes the developments in efficient hardware design whereas the remainder of
this section explains them in detail.
18
Figure 8: Efficiency considerations through hardware designing
only is that they are not suited to handle sparse data, high-precision arithmetic operations, or certain linear algebra
problems. Moreover, re-configurable hardware like FPGA is noted to have higher FlOps compared to fixed hardware.
Nevertheless, this can be seen as a viable option considering the production costs of fixed hardware for short-term
applications [101].
Figure 9 presents the year-wise distribution of surveyed papers. As we have focused more on recently published works,
it can be perceived that the major share of papers is from the last five years. The maximum number of papers is between
2019 and 2021 following a rising trend. However, for the years 2022 and 2023 a declining trend is observed. This
can be attributed to the fact that to assess the quality of papers the number of citations is considered to be one of the
important indicators. But, it is tough for a paper to get cited by a significant number of papers in such a short span of
time.
8
https://www.microsoft.com/en-us/research/project/deepspeed/
9
https://pytorch.org/tutorials/beginner/dist_overview.html
10
https://www.tensorflow.org/guide/distributed_training
11
https://horovod.ai/
19
Figure 9: Year-wise distribution of papers
Apart from observing the year-wise distribution of papers, the share of various article types, i.e. journals, conferences,
books and pre-print papers. Moreover, the journals have been segregated into regular papers and review papers. From
the pie-chart shown in Figure 10, it can be observed that conference papers account for 56% of the total share of
articles. While journals have a 31% share further subdivided into 23% being regular papers and 8% being review
papers. The reason behind this is the presence of a variety of prestigious conferences on NLP which are considered to
be more reputed than several journals. Thus, such conferences are preferred over journals by prominent researchers in
NLP. Interestingly, 11% share of articles are from pre-print platforms like arXiv12 . Further review of the high-quality
pre-print articles shows that a major share of such articles are milestone achievements authored by eminent researchers
and scientists belonging to reputed institutions. As pre-prints offer recognition for contributions in a couple of days,
it has become the apt avenue to claim authorship for a novel contribution. Lastly, 2% of articles reviewed are books.
This is because NLP is a rapidly evolving field of research while books are typically considered permanent sources of
knowledge presenting persistent concepts that remain relevant for many years. Hence, a book on NLP might lose its
significance in just a few years due to rapid technological developments.
For a more detailed analysis, the distribution of articles comprising of significant terms related to NLP has been
illustrated in Figure 11. The top-noted terms belong to the following categories13 in descending order "Transformers
NLP", "Efficiency Considerations", "Pre-Trained Models", "Deep Learning", "Hardware Design" and "Machine
Learning". This ascertains that the topics discussed in the reviewed papers align with the objective of this survey. It is
to be noted that, although transformers are a subset of deep learning techniques in NLP, it has been treated separately
for more transparency given the magnanimous volume of articles based on transformers. From the low share of articles
on NLP focusing on the efficient use of hardware, it can be inferred that the current research trend majorly emphasizes
on software to formulate pareto-optimal solutions with almost no consideration for hardware. Whereas, the state of
12
https://arxiv.org/
13
The related terms have been grouped and categorized into topics as shown in Figure 11.
20
Figure 11: Category-wise distribution of articles
developments necessitates the inclusion of hardware design considerations while formulating new models to achieve
optimal efficacy accompanied by efficiency.
Figure 12: Comparison of the trend of NLP vs Transformers. Source: Google Trends
To further validate our research, we compare the trend of transformer with NLP based on the number of web searches
by people throughout the world over the last five years utilizing Google Trends14 . Figure 12, portrays an overall rising
trend for both the terms, i.e. "Transformers" and "NLP". The popularity of transformers started rising after 2019
and since then there has been a steady growth. Furthermore, the growth-rate for both "Transformers" and "NLP" are
almost similar with "Transformers" having a slightly steeper growth rate in recent years. This shows the significance of
transformer-based models in the evolution of NLP. Moreover, the trend confirms the statistical analysis of the surveyed
papers mentioned above.
From the recent developments, it is evident that transformer-based pre-trained models have excelled in terms of accuracy
compared to other conventional machine learning and deep learning algorithms. Table 3 presents a comparison of
various renowned LLMs based on their year of release, number of parameters, accuracy and pre-training data. This
shows that the current trend is towards designing powerful pre-trained models that only need to be fine-tuned as per
the requirements of a particular task. However, these models have tremendous computational complexity and for
each new task, they need to be fine-tuned on a sufficiently large data-set. Recently, some efforts have been directed
towards "task agnostic models" as in subsequent versions of GPT promoting few-shot or even zero-shot learning through
"prompting". However, such models could be termed as "multi-task learners" rather than task agnostic models as
their generalizability is significantly inferior to human cognition. Moreover, to achieve multi-task generalizability,
fine-tuning upon several tasks (i.e. various large-scale data-sets) is required. Overall, this limits research on such
models in resource-constrained environments and also aggravates carbon emissions. This can be visualized from Table
14
https://trends.google.com/trends/
21
Table 3: Evaluation of High Performance Models
Model Year Pre-training Dataset #Parameters GLUE LAMBADA PTWL
BERT_large [8] 2018 WikiEn+Book Corpus 340M 81.9 31.3
GPT [11] 2018 BookCorpus 117M 72.8 - -
ROBERTA [121] 2019 BookCorpus + CC-News + OpenWeb- 340M 88.5 - -
Text + STORIES
XLNET [9] 2019 WikiEn + BookCorpus + Giga5 + 340M 90.5 - -
ClueWeb + Common Crawl
GPT-2 [12] 2019 Web Crawl Text 1.5B - - 35.76
BART [10] 2019 BookCorpus + CC-News + OpenWeb- 370M 88.4 - -
Text + STORIES
Transformer- 2019 Wikipedia 24M - - 54.55
XL [46]
GPT-3 [13] 2020 Web Crawl Text + Book Corpus 175B 86.4 20.5
T5 [120] 2020 Colossal Clean Crawled Corpus (C4) 11B 89.7 - -
XLM-R [145] 2020 CommonCrawl 10.7B 91.8 - -
XLM-R [145] 2020 CommonCrawl 10.7B 91.8 - -
Megatron Turing 2022 CommonCrawl + Realnews + Github + 530B - 87.2 -
NLG [142] Wikipedia+ Gutenberg + Books3 + ArXiv
+ PubMed Abstracts + Stack Exchange +
Pile-CC + OpenWebText2
PaLM [143] 2022 Public Forums + Source Codes + WikiEn 540B - 89.7 -
+ Web Documents + News + Books
Turing ULRv6 2022 CommonCrawl 4.6B 91.3 - -
[144]
Chinchilla [116] 2022 MassiveText 70B - 77.7 -
LLaMA [137] 2023 CommonCrawl + C4 + Github + 65B - 84 -
Wikipedia + Gutenberg + Books3 +
ArXiv + Stack Exchange
Note: #Parameters- No. of model parameters, LAMBADA- LAMBADA (Accuracy), PTWL- Penn Treebank
(Word Level Perplexity), ’-’ indicates non-avalability of data
4 which shows that training SOTA NLP models can lead to several tonnes of carbon emissions. Figure 13, captures this
relationship among the model parameters and the size of the data used for pre-training the model. It can be observed
that there has been an overall rising trend in the size of the pre-training data as-well-as the model parameter count.
The earlier LLMs incrementally raised the model parameters and pre-training corpus obeying a linear relationship
among both. Subsequent LLMs focused majorly on increasing the model parameters without much rise in the size of
the pre-training data. Besides, the recent LLMs are focusing on striking a balance between the size of data and the
model size if not reducing the model size compared to the volume of pre-training data. This signals a ray of hope that
the awareness of efficient LLMs in the NLP research community is proliferating.
To make transformer-based pre-trained models efficient, a significant effort has been directed toward reducing the
complexity of the attention mechanism as-well-as decreasing the number of model parameters. In this paper, several such
techniques as-well-as associated models have been discussed. From these studies, it can be inferred that efficiency can
be achieved at multiple stages of model development. However, the goal of formulating an efficient pre-trained model
is still far from being achieved. For this, efforts towards devising efficient model design need to be consolidated with
efficient pre-training and fine-tuning strategies along with effective prompt-based learning approaches (if applicable).
Also, the data-sets play a major role in the trade-off between performance and efficiency. Determining the optimal
size of the data-sets, distribution of sequence length distribution as-well-as the quality of training samples is of utmost
importance to restrict training costs and prevent over-training.
22
Table 4: Cost Behind Training Models
Model GPU/ TPU GPU/ TPU Hours Energy Emissions Source
Transformerbase GPU-P100 x 8 96 1.416 0.0117 Strubell et al. [16]
Transformerbig GPU-P100 x 8 672 1.515 0.0864 Strubell et al. [16]
ELMo GPU-P100 x 3 1008 0.51766 0.118 Strubell et al. [16]
BERTbase GPU-V100 x 64 5056 12.04151 0.647 Strubell et al. [16]
GPT-2 TPU-v3 x 32 5376 - - Strubell et al. [16]
Gopher GPU-A100 x 16 5725 1,066 352 Luccioni et al. [139]
GPT-3 GPU-A100 x 16 6912 1,287 502 Luccioni et al. [139]
NAS TPU-v2 x 1 32623 - - Strubell et al. [16]
GShard TPU-v3 x 1024 76,185.600 24.100 4.8 Patterson et al. [17]
LLaMA-7B GPU-A100 82,432 36 14 Touvron et al. [137]
LLaMA-13B GPU-A100 135,168 59 23 Touvron et al. [137]
T5 TPU-v3 x 512 245,760 85.7 46.7 Patterson et al. [17]
XLM GPU-V100 x 512 250,675.2 167.443 39 Faiz et al. [140]
LLaMA-33B GPU-A100 530,432 233 90 Touvron et al. [137]
Switch TPU-v3 x 1024 663,552 179 72.200 Patterson et al. [17]
OPT-175B GPU-A100 809,472 356 137 Touvron et al. [137]
LLaMA-65B GPU-A100 1,022,362 449 173 Touvron et al. [137]
BLOOM-175B GPU-A100 1,082,880 475 183 Touvron et al. [137]
LaMDA TPU-V3 1418035 451 25.200 Thoppilan et al. [138]
GPT3 GPU-V100 x 10000 3,552,000 1,287 552.100 Patterson et al. [17]
Note: Emissions- Co2 emitted (metric tons), Energy- Power Consumption (MWh)
Figure 13: Relationship between the size of pre-training data and number of model parameters in LLMs
7 Conclusion
NLP empowers computing devices to decipher and process natural language text. Various applications in NLP include
sentiment analysis, misinformation detection, machine translation, and text summarization. It can be observed that NLP
has evolved considerably from rule-based approaches, followed by machine learning and deep learning models to the
advent of transformer-based pre-trained models. Although the efficacy of NLP models at tasks has increased manifold
over time, the sustainability of the models based on factors like efficiency, task-agnosticism, and domain-independence
is a matter of concern. It motivates directing research towards formulating sustainable NLP models. In this paper, a
survey of research works aimed at enhancing the efficiency of NLP models has been conducted. These works have
been systematically presented targeting the various stages of model development. It highlights the efforts towards
devising practical models with appreciable performance for implementation in resource-constrained environments with
a significantly low carbon footprint. It ushers a paradigm shift in devising NLP models keeping sustainability in mind.
23
References
[1] Liu, Bing. Sentiment analysis: Mining opinions, sentiments, and emotions. Cambridge University Press, 2015.
[2] Jindal, Nitin, and Bing Liu. “Opinion spam and analysis." In Proceedings of the 2008 international conference on
web search and data mining, pp. 219-230. 2008.
[3] Cho, Kyunghyun, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. "On the Properties of Neural
Machine Translation: Encoder–Decoder Approaches." In Proceedings of SSST-8, Eighth Workshop on Syntax,
Semantics and Structure in Statistical Translation, pp. 103-111. 2014.
[4] Bahdanau, Dzmitry, Kyung Hyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to
align and translate." In 3rd International Conference on Learning Representations, ICLR 2015. 2015.
[5] El-Kassas, Wafaa S., Cherif R. Salama, Ahmed A. Rafea, and Hoda K. Mohamed. "Automatic text summarization:
A comprehensive survey." Expert systems with applications 165 (2021): 113679.
[6] Soares, Marco Antonio Calijorne, and Fernando Silva Parreiras. "A literature review on question answering
techniques, paradigms and systems." Journal of King Saud University-Computer and Information Sciences 32, no.
6 (2020): 635-646.
[7] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser,
and Illia Polosukhin. “Attention is all you need." In Advances in neural information processing systems, pp.
5998-6008. 2017.
[8] Kenton, Jacob Devlin Ming-Wei Chang, and Lee Kristina Toutanova. "BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding." In Proceedings of NAACL-HLT, pp. 4171-4186. 2019.
[9] Yang, Zhilin, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, and Quoc V. Le. “Xlnet:
Generalized autoregressive pretraining for language understanding." In Advances in neural information processing
systems, pp. 5753-5763. 2019.
[10] Lewis, Mike, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin
Stoyanov, and Luke Zettlemoyer. "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language
Generation, Translation, and Comprehension." In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pp. 7871-7880. 2020.
[11] Radford, Alec, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. “Improving language
understanding by generative pre-training." URL https://s3-us-west-2.amazonaws.com/openai-
assets/researchcovers/languageunsupervised/language understanding paper. pdf (2018).
[12] Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. “Language models are
unsupervised multitask learners." OpenAI Blog 1, no. 8 (2019): 9.
[13] Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
Neelakantan et al. "Language models are few-shot learners." In Proceedings of the 34th International Conference
on Neural Information Processing Systems, pp. 1877-1901. 2020.
[14] Chowdhary, K. R. "Natural language processing." In Fundamentals of Artificial Intelligence, pp. 603-649. Springer,
New Delhi, 2020.
[15] Gers, Felix A., and Jürgen Schmidhuber. "Recurrent nets that time and count." In Proceedings of the IEEE-INNS-
ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and
Perspectives for the New Millennium, vol. 3, pp. 189-194. IEEE, 2000.
[16] Strubell, Emma, Ananya Ganesh, and Andrew McCallum. "Energy and Policy Considerations for Deep Learning
in NLP." In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.
3645-3650. 2019.
[17] Patterson, David, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David
So, Maud Texier, and Jeff Dean. "Carbon emissions and large neural network training." arXiv preprint
arXiv:2104.10350 (2021).
[18] Schwartz, Roy, Jesse Dodge, Noah A. Smith, and Oren Etzioni. "Green ai." Communications of the ACM 63, no.
12 (2020): 54-63.
[19] Treviso, Marcos, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid et al.
"Efficient methods for natural language processing: A survey." Transactions of the Association for Computational
Linguistics 11 (2023): 826-860.
[20] Shu, Kai, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. “Fake news detection on social media: A data
mining perspective." ACM SIGKDD Explorations Newsletter 19, no. 1 (2017): 22-36.
24
[21] Qazvinian, Vahed, Emily Rosengren, Dragomir Radev, and Qiaozhu Mei. "Rumor has it: Identifying misinforma-
tion in microblogs." In Proceedings of the 2011 conference on empirical methods in natural language processing,
pp. 1589-1599. 2011.
[22] Lewis, Mike, and Angela Fan. "Generative question answering: Learning to answer the whole question." In
International Conference on Learning Representations. 2018.
[23] Kupiec, Julian, Jan Pedersen, and Francine Chen. "A trainable document summarizer." In Proceedings of the 18th
annual international ACM SIGIR conference on Research and development in information retrieval, pp. 68-73.
1995.
[24] Nallapati, Ramesh, Bing Xiang, and Bowen Zhou. "Sequence-to-sequence rnns for text summarization." (2016).
[25] Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. “Thumbs up?: sentiment classification using machine
learning techniques." In Proceedings of the ACL-02 conference on Empirical methods in natural language
processing-Volume 10, pp. 79-86. Association for Computational Linguistics, 2002.
[26] Wilson, Theresa, Janyce Wiebe, and Paul Hoffmann. “Recognizing contextual polarity in phrase-level sentiment
analysis." In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in
Natural Language Processing. 2005.
[27] Ansar, Wazib, and Saptarsi Goswami. "Combating the menace: A survey on characterization and detection of fake
news from a data science perspective." International Journal of Information Management Data Insights 1, no. 2
(2021): 100052.
[28] Bird, Steven, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the
natural language toolkit. “ O’Reilly Media, Inc.", 2009.
[29] Xu, Peng, Davis Liang, Zhiheng Huang, and Bing Xiang. "Attention-guided generative models for extractive
question answering." arXiv preprint arXiv:2110.06393 (2021).
[30] Nallapati, Ramesh, Feifei Zhai, and Bowen Zhou. "Summarunner: A recurrent neural network based sequence
model for extractive summarization of documents." In Thirty-first AAAI conference on artificial intelligence.
2017.
[31] See, Abigail, Peter J. Liu, and Christopher D. Manning. "Get To The Point: Summarization with Pointer-Generator
Networks." In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pp. 1073-1083. 2017.
[32] Nakagawa, Tetsuji, Kentaro Inui, and Sadao Kurohashi. "Dependency tree-based sentiment classification using
CRFs with hidden variables." In Human Language Technologies: The 2010 Annual Conference of the North
American Chapter of the Association for Computational Linguistics, pp. 786-794. Association for Computational
Linguistics, 2010.
[33] Harris, Zellig S. "Distributional structure." Word 10, no. 2-3 (1954): 146-162.
[34] Deerwester, Scott, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. “Indexing
by latent semantic analysis." Journal of the American society for information science 41, no. 6 (1990): 391-407.
[35] Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. "Distributed representations of words
and phrases and their compositionality." In Advances in neural information processing systems, pp. 3111-3119.
2013.
[36] Pennington, Jeffrey, Richard Socher, and Christopher Manning. “Glove: Global vectors for word representation."
In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp.
1532-1543. 2014.
[37] Y. Bengio, R. Ducharme, P. Vincent. A neural probabilistic language model. Journal of Machine Learning
Research, 3:1137-1155, 2003.
[38] Le, Quoc, and Tomas Mikolov. "Distributed representations of sentences and documents." In International
conference on machine learning, pp. 1188-1196. PMLR, 2014.
[39] Irsoy, Ozan, and Claire Cardie. "Opinion mining with deep recurrent neural networks." In Proceedings of the 2014
conference on empirical methods in natural language processing (EMNLP), pp. 720-728. 2014.
[40] Liu, Qian, Zhiqiang Gao, Bing Liu, and Yuanlin Zhang. “Automated rule selection for aspect extraction in opinion
mining." In Twenty-Fourth International Joint Conference on Artificial Intelligence. 2015.
[41] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the Knowledge in a Neural Network." stat 1050 (2015):
9.
25
[42] Joulin, Armand, Édouard Grave, Piotr Bojanowski, and Tomáš Mikolov. "Bag of Tricks for Efficient Text Classifi-
cation." In Proceedings of the 15th Conference of the European Chapter of the Association for Computational
Linguistics: Volume 2, Short Papers, pp. 427-431. 2017.
[43] Poria, Soujanya, Erik Cambria, and Alexander Gelbukh. “Aspect extraction for opinion mining with a deep
convolutional neural network." Knowledge-Based Systems 108 (2016): 42-49.
[44] Howard, Jeremy, and Sebastian Ruder. "Universal Language Model Fine-tuning for Text Classification." In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), pp. 328-339. 2018.
[45] Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. "Deep contextualized word representations." In Proceedings of NAACL-HLT, pp. 2227-2237. 2018.
[46] Dai, Zihang, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Le, and Ruslan Salakhutdinov. "Transformer-
XL: Attentive Language Models beyond a Fixed-Length Context." In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics, pp. 2978-2988. 2019.
[47] Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. "ALBERT:
A Lite BERT for Self-supervised Learning of Language Representations." In International Conference on Learning
Representations. 2019.
[48] Kitaev, Nikita, Lukasz Kaiser, and Anselm Levskaya. "Reformer: The Efficient Transformer." In International
Conference on Learning Representations. 2019.
[49] Wang, Sinong, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. "Linformer: Self-attention with linear
complexity." arXiv preprint arXiv:2006.04768 (2020).
[50] Roy, Aurko, Mohammad Saffar, Ashish Vaswani, and David Grangier. "Efficient content-based sparse attention
with routing transformers." Transactions of the Association for Computational Linguistics 9 (2021): 53-68.
[51] Jaegle, Andrew, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. "Perceiver:
General perception with iterative attention." In International conference on machine learning, pp. 4651-4664.
PMLR, 2021.
[52] Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry
et al. "Learning transferable visual models from natural language supervision." In International conference on
machine learning, pp. 8748-8763. PMLR, 2021.
[53] Zhang, Susan, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan et al.
"Opt: Open pre-trained transformer language models." arXiv preprint arXiv:2205.01068 (2022).
[54] Sajjad, Hassan, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. "On the effect of dropping layers of pre-trained
transformer models." Computer Speech & Language 77 (2023): 101429.
[55] Ray, Paramita, and Amlan Chakrabarti. "A Mixed approach of Deep Learning method and Rule-Based method to
improve Aspect Level Sentiment Analysis." Applied Computing and Informatics (2019).
[56] Ansar, Wazib, Saptarsi Goswami, Amlan Chakrabarti, and Basabi Chakraborty. "An efficient methodology for
aspect-based sentiment analysis using BERT through refined aspect extraction." Journal of Intelligent & Fuzzy
Systems 40, no. 5 (2021): 9627-9644.
[57] Malik, Vikas, and Amit Kumar. "Sentiment Analysis of Twitter Data Using Naive Bayes Algorithm." International
Journal on Recent and Innovation Trends in Computing and Communication 6, no. 4 (2018): 120-125.
[58] Huq, Mohammad Rezwanul, Ahmad Ali, and Anika Rahman. “Sentiment analysis on Twitter data using KNN and
SVM." IJACSA) International Journal of Advanced Computer Science and Applications 8, no. 6 (2017): 19-25.
[59] Bengio, Yoshua, Patrice Simard, and Paolo Frasconi. "Learning long-term dependencies with gradient descent is
difficult." IEEE transactions on neural networks 5, no. 2 (1994): 157-166.
[60] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9, no. 8 (1997):
1735-1780.
[61] Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego
de Las Casas et al. "An empirical analysis of compute-optimal large language model training." Advances in Neural
Information Processing Systems 35 (2022): 30016-30030.
[62] Lee, Katherine, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and
Nicholas Carlini. "Deduplicating Training Data Makes Language Models Better." In Proceedings of the 60th
Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8424-8445. 2022.
26
[63] Mishra, Swaroop, and Bhavdeep Singh Sachdeva. "Do we need to create big datasets to learn a task?." In
Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pp. 169-173. 2020.
[64] Le Bras, Ronan, Swabha Swayamdipta, Chandra Bhagavatula, Rowan Zellers, Matthew Peters, Ashish Sabharwal,
and Yejin Choi. "Adversarial filters of dataset biases." In International conference on machine learning, pp.
1078-1088. PMLR, 2020.
[65] Ren, Pengzhen, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B. Gupta, Xiaojiang Chen, and Xin
Wang. "A survey of deep active learning." ACM computing surveys (CSUR) 54, no. 9 (2021): 1-40.
[66] Yuan, Michelle, Hsuan-Tien Lin, and Jordan Boyd-Graber. "Cold-start Active Learning through Self-supervised
Language Modeling." In Proceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing (EMNLP), pp. 7935-7948. 2020.
[67] Sener, Ozan, and Silvio Savarese. "Active Learning for Convolutional Neural Networks: A Core-Set Approach."
In International Conference on Learning Representations. 2018.
[68] Margatina, Katerina, Giorgos Vernikos, Loïc Barrault, and Nikolaos Aletras. "Active Learning by Acquiring
Contrastive Examples." In Proceedings of the 2021 Conference on Empirical Methods in Natural Language
Processing, pp. 650-663. 2021.
[69] Settles, Burr, Mark Craven, and Lewis Friedland. "Active learning with real annotation costs." In Proceedings of
the NIPS workshop on cost-sensitive learning, vol. 1. 2008.
[70] Lowell, David, Zachary C. Lipton, and Byron C. Wallace. "Practical Obstacles to Deploying Active Learning."
In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th
International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 21-30. 2019.
[71] Karamcheti, Siddharth, Ranjay Krishna, Li Fei-Fei, and Christopher D. Manning. "Mind Your Outliers! Inves-
tigating the Negative Impact of Outliers on Active Learning for Visual Question Answering." In Proceedings
of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Volume 1: Long Papers), pp. 7265-7281. 2021.
[72] Press, Ofir, Noah A. Smith, and Mike Lewis. "Shortformer: Better Language Modeling using Shorter Inputs."
In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th
International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 5493-5505. 2021.
[73] Kumar, M., Benjamin Packer, and Daphne Koller. "Self-paced learning for latent variable models." Advances in
neural information processing systems 23 (2010).
[74] Melamud, Oren, Jacob Goldberger, and Ido Dagan. "context2vec: Learning generic context embedding with
bidirectional lstm." In Proceedings of the 20th SIGNLL conference on computational natural language learning,
pp. 51-61. 2016.
[75] Conneau, A., D. Kiela, H. Schwenk, L. Barrault, and A. Bordes. "Supervised learning of universal sentence
representations from natural language inference data." In Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing, pp. 670-680. Association for Computational Linguistics, 2017.
[76] Cer, Daniel, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant et al.
"Universal sentence encoder." arXiv preprint arXiv:1803.11175 (2018).
[77] Reimers, Nils, and Iryna Gurevych. "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks."
In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th
International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational
Linguistics, 2019.
[78] Feng, Fangxiaoyu, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. "Language-agnostic BERT Sen-
tence Embedding." In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), pp. 878-891. 2022.
[79] Del Giudice, Marco. "Effective dimensionality: A tutorial." Multivariate behavioral research 56, no. 3 (2021):
527-542.
[80] Patel, Kevin, and Pushpak Bhattacharyya. "Towards lower bounds on number of dimensions for word embeddings."
In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short
Papers), pp. 31-36. 2017.
[81] Raunak, Vikas, Vivek Gupta, and Florian Metze. "Effective dimensionality reduction for word embeddings." In
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp. 235-243. 2019.
27
[82] Ansar, Wazib, Saptarsi Goswami, Amlan Chakrabarti, and Basabi Chakraborty. "TexIm: A Novel Text-to-Image
Encoding Technique Using BERT." In Computer Vision and Machine Intelligence: Proceedings of CVMI 2022,
pp. 123-139. Singapore: Springer Nature Singapore, 2023.
[83] Qiu, Jiezhong, Hao Ma, Omer Levy, Wen-tau Yih, Sinong Wang, and Jie Tang. "Blockwise Self-Attention for
Long Document Understanding." In Findings of the Association for Computational Linguistics: EMNLP 2020,
pp. 2555-2565. 2020.
[84] Beltagy, Iz, Matthew E. Peters, and Arman Cohan. "Longformer: The long-document transformer." arXiv preprint
arXiv:2004.05150 (2020).
[85] Liu, Peter J., Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer.
"Generating Wikipedia by Summarizing Long Sequences." In International Conference on Learning Representa-
tions. 2018.
[86] Ansar, Wazib, Saptarsi Goswami, Amlan Chakrabarti, and Basabi Chakraborty. "A novel selective learning based
transformer encoder architecture with enhanced word representation." Applied Intelligence 53, no. 8 (2023):
9424-9443.
[87] Child, Rewon, Scott Gray, Alec Radford, and Ilya Sutskever. "Generating long sequences with sparse transformers."
arXiv preprint arXiv:1904.10509 (2019).
[88] Du, Nan, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun et al.
"Glam: Efficient scaling of language models with mixture-of-experts." In International Conference on Machine
Learning, pp. 5547-5569. PMLR, 2022.
[89] Choromanski, Krzysztof, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter
Hawkins et al. "Masked language modeling for proteins via linearly scalable long-context transformers." arXiv
preprint arXiv:2006.03555 (2020).
[90] LeCun, Yann, John Denker, and Sara Solla. "Optimal brain damage." Advances in neural information processing
systems 2 (1989).
[91] Louizos, Christos, Max Welling, and Diederik P. Kingma. "Learning Sparse Neural Networks through L0
Regularization." In International Conference on Learning Representations. 2018.
[92] Sanh, Victor, Thomas Wolf, and Alexander Rush. "Movement pruning: Adaptive sparsity by fine-tuning."
Advances in Neural Information Processing Systems 33 (2020): 20378-20389.
[93] Fan, Angela, Edouard Grave, and Armand Joulin. "Reducing Transformer Depth on Demand with Structured
Dropout." In International Conference on Learning Representations. 2019.
[94] Stanton, Samuel, Pavel Izmailov, Polina Kirichenko, Alexander A. Alemi, and Andrew G. Wilson. "Does
knowledge distillation really work?." Advances in Neural Information Processing Systems 34 (2021): 6906-6919.
[95] Boutros, Andrew, Eriko Nurvitadhi, Rui Ma, Sergey Gribok, Zhipeng Zhao, James C. Hoe, Vaughn Betz, and
Martin Langhammer. "Beyond peak performance: Comparing the real performance of AI-optimized FPGAs and
GPUs." In 2020 International Conference on Field-Programmable Technology (ICFPT), pp. 10-19. IEEE, 2020.
[96] Gaide, Brian, Dinesh Gaitonde, Chirag Ravishankar, and Trevor Bauer. "Xilinx adaptive compute acceleration
platform: VersalTM architecture." In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, pp. 84-93. 2019.
[97] Wang, Hanrui, Zhekai Zhang, and Song Han. "Spatten: Efficient sparse attention architecture with cascade token
and head pruning." In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA),
pp. 97-110. IEEE, 2021.
[98] Ham, Tae Jun, Yejin Lee, Seong Hoon Seo, Soosung Kim, Hyunji Choi, Sung Jun Jung, and Jae W. Lee. "ELSA:
Hardware-software co-design for efficient, lightweight self-attention mechanism in neural networks." In 2021
ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pp. 692-705. IEEE, 2021.
[99] Lu, Siyuan, Meiqi Wang, Shuang Liang, Jun Lin, and Zhongfeng Wang. "Hardware accelerator for multi-head
attention and position-wise feed-forward in the transformer." In 2020 IEEE 33rd International System-on-Chip
Conference (SOCC), pp. 84-89. IEEE, 2020.
[100] Liu, Zejian, Gang Li, and Jian Cheng. "Hardware acceleration of fully quantized bert for efficient natural
language processing." In 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp.
513-516. IEEE, 2021.
[101] Hooker, Sara. "The hardware lottery." Communications of the ACM 64, no. 12 (2021): 58-65.
28
[102] He, Jiaao, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. "FasterMoE:
modeling and optimizing training of large-scale dynamic pre-trained models." In Proceedings of the 27th ACM
SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 120-134. 2022.
[103] Rajbhandari, Samyam, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad
Awan, Jeff Rasley, and Yuxiong He. "Deepspeed-moe: Advancing mixture-of-experts inference and training to
power next-generation ai scale." In International Conference on Machine Learning, pp. 18332-18346. PMLR,
2022.
[104] Qu, Zheng, Liu Liu, Fengbin Tu, Zhaodong Chen, Yufei Ding, and Yuan Xie. "Dota: detect and omit weak
attentions for scalable transformer acceleration." In Proceedings of the 27th ACM International Conference on
Architectural Support for Programming Languages and Operating Systems, pp. 14-26. 2022.
[105] Lannelongue, Loïc, Jason Grealey, and Michael Inouye. "Green algorithms: quantifying the carbon footprint of
computation." Advanced science 8, no. 12 (2021): 2100707.
[106] Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. "Quantifying the carbon emis-
sions of machine learning." In Climate Change workshop, NeurIPS 2019. 2019.
[107] Lasse F. Wolff Anthony, Benjamin Kanding, and Raghavendra Selvan. "Carbontracker: Tracking and predicting
the carbon footprint of training deep learning models". In ICML Workshop on "Challenges in Deploying and
monitoring Machine Learning Systems". 2020.
[108] Dürlich, Luise, Evangelia Gogoulou, and Joakim Nivre. "On the Concept of Resource-Efficiency in NLP." In
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pp. 135-145. 2023.
[109] Gupta, Udit, Young Geun Kim, Sylvia Lee, Jordan Tse, Hsien-Hsin S. Lee, Gu-Yeon Wei, David Brooks, and
Carole-Jean Wu. "Chasing carbon: The elusive environmental footprint of computing." In 2021 IEEE International
Symposium on High-Performance Computer Architecture (HPCA), pp. 854-867. IEEE, 2021.
[110] Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances
in neural information processing systems 27 (2014).
[111] Lin, Zhouhan, Minwei Feng, Cicero dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio.
"A structured self-attentive sentence embedding." In International Conference on Learning Representations.
International Conference on Learning Representations, ICLR, 2017.
[112] Kim, Yoon, Carl Denton, Luong Hoang, and Alexander M. Rush. "Structured Attention Networks." In Interna-
tional Conference on Learning Representations. 2016.
[113] Galassi, Andrea, Marco Lippi, and Paolo Torroni. "Attention in natural language processing." IEEE transactions
on neural networks and learning systems 32, no. 10 (2020): 4291-4308.
[114] Graves, Alex, Greg Wayne, and Ivo Danihelka. "Neural turing machines." arXiv preprint arXiv:1410.5401
(2014).
[115] Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effective Approaches to Attention-based
Neural Machine Translation." In Proceedings of the 2015 Conference on Empirical Methods in Natural Language
Processing, pp. 1412-1421. 2015.(2015).
[116] Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego
de Las Casas et al. "Training compute-optimal large language models." arXiv preprint arXiv:2203.15556 (2022).
[117] Collobert, Ronan, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. "Natural
language processing (almost) from scratch." Journal of machine learning research 12, no. ARTICLE (2011):
2493-2537.
[118] Dai, Andrew M., and Quoc V. Le. "Semi-supervised sequence learning." Advances in neural information
processing systems 28 (2015).
[119] Qiu, Xipeng, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. "Pre-trained models for
natural language processing: A survey." Science China Technological Sciences 63, no. 10 (2020): 1872-1897.
[120] Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei
Li, and Peter J. Liu. "Exploring the limits of transfer learning with a unified text-to-text transformer." The Journal
of Machine Learning Research 21, no. 1 (2020): 5485-5551.
[121] Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke
Zettlemoyer, and Veselin Stoyanov. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint
arXiv:1907.11692 (2019).
[122] Rogers, Anna, Olga Kovaleva, and Anna Rumshisky. "A primer in BERTology: What we know about how BERT
works." Transactions of the Association for Computational Linguistics 8 (2021): 842-866.
29
[123] Houlsby, Neil, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Ges-
mundo, Mona Attariyan, and Sylvain Gelly. "Parameter-efficient transfer learning for NLP." In International
Conference on Machine Learning, pp. 2790-2799. PMLR, 2019.
[124] Karimi Mahabadi, Rabeeh, James Henderson, and Sebastian Ruder. "Compacter: Efficient low-rank hypercom-
plex adapter layers." Advances in Neural Information Processing Systems 34 (2021): 1022-1035.
[125] Aghajanyan, Armen, Sonal Gupta, and Luke Zettlemoyer. "Intrinsic Dimensionality Explains the Effectiveness of
Language Model Fine-Tuning." In Proceedings of the 59th Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),
pp. 7319-7328. 2021.
[126] Moosavi, Nafise Sadat, Quentin Delfosse, Kristian Kersting, and Iryna Gurevych. "Adaptable Adapters." In
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, pp. 3742-3753. 2022.
[127] Wang, Yaqing, and Sahaj Agarwal. "AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning." In
The 2022 Conference on Empirical Methods in Natural Language Processing. 2022.
[128] Schick, Timo, and Hinrich Schütze. "Generating Datasets with Pretrained Language Models." In Proceedings of
the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6943-6951. 2021.
[129] Reynolds, Laria, and Kyle McDonell. "Prompt programming for large language models: Beyond the few-shot
paradigm." In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1-7.
2021.
[130] Wei, Jason, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai,
and Quoc V. Le. "Finetuned Language Models are Zero-Shot Learners." In International Conference on Learning
Representations. 2021.
[131] Liu, Pengfei, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. "Pre-train, prompt,
and predict: A systematic survey of prompting methods in natural language processing." ACM Computing Surveys
55, no. 9 (2023): 1-35.
[132] Schick, Timo, and Hinrich Schütze. "Few-shot text generation with natural language instructions." In Proceedings
of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 390-402. 2021.
[133] Schick, Timo, and Hinrich Schütze. "Exploiting Cloze-Questions for Few-Shot Text Classification and Natural
Language Inference." In Proceedings of the 16th Conference of the European Chapter of the Association for
Computational Linguistics: Main Volume, pp. 255-269. 2021.
[134] Trinh, Trieu H., and Quoc V. Le. "A simple method for commonsense reasoning." arXiv preprint
arXiv:1806.02847 (2018).
[135] Yin, Wenpeng, Jamaal Hay, and Dan Roth. "Benchmarking Zero-shot Text Classification: Datasets, Evaluation
and Entailment Approach." In Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.
3914-3923. 2019.
[136] Wu, Wei, Fei Wang, Arianna Yuan, Fei Wu, and Jiwei Li. "CorefQA: Coreference resolution as query-based
span prediction." In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.
6953-6963. 2020.
[137] Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971
(2023).
[138] Thoppilan, Romal, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia
Jin et al. "Lamda: Language models for dialog applications." arXiv preprint arXiv:2201.08239 (2022).
[139] Luccioni, Alexandra Sasha, Sylvain Viguier, and Anne-Laure Ligozat. "Estimating the carbon footprint of bloom,
a 176b parameter language model." Journal of Machine Learning Research 24, no. 253 (2023): 1-15.
[140] Faiz, Ahmad, Sotaro Kaneda, Ruhan Wang, Rita Osi, Parteek Sharma, Fan Chen, and Lei Jiang. "LLMCarbon:
Modeling the end-to-end Carbon Footprint of Large Language Models." arXiv preprint arXiv:2309.14393 (2023).
[141] Xu, Canwen, and Julian McAuley. "A survey on model compression and acceleration for pretrained language
models." In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 9, pp. 10566-10575. 2023.
[142] Smith, Shaden, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun
Liu et al. "Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language
model." arXiv preprint arXiv:2201.11990 (2022).
30
[143] Chowdhery, Aakanksha, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul
Barham et al. "Palm: Scaling language modeling with pathways." Journal of Machine Learning Research 24, no.
240 (2023): 1-113.
[144] Patra, Barun, Saksham Singhal, Shaohan Huang, Zewen Chi, Li Dong, Furu Wei, Vishrav Chaudhary, and Xia
Song. "Beyond english-centric bitexts for better multilingual language representation learning." arXiv preprint
arXiv:2210.14867 (2022).
[145] Conneau, Alexis, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco
Guzmán, Édouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. "Unsupervised Cross-lingual Repre-
sentation Learning at Scale." In Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, pp. 8440-8451. 2020.
[146] Zhang, Jingqing, Yao Zhao, Mohammad Saleh, and Peter Liu. "Pegasus: Pre-training with extracted gap-
sentences for abstractive summarization." In International Conference on Machine Learning, pp. 11328-11339.
PMLR, 2020.
[147] Aromataris, Edoardo, and Alan Pearson. "The systematic review: an overview." AJN The American Journal of
Nursing 114, no. 3 (2014): 53-58.
[148] Moher, David, Alessandro Liberati, Jennifer Tetzlaff, Douglas G. Altman, and Prisma Group. "Preferred reporting
items for systematic reviews and meta-analyses: the PRISMA statement." International journal of surgery 8, no. 5
(2010): 336-341.
[149] Grant, Maria J., and Andrew Booth. "A typology of reviews: an analysis of 14 review types and associated
methodologies." Health information & libraries journal 26, no. 2 (2009): 91-108.
[150] Khadivi, Nasim, and Sho Sato. "A Bibliometric Study of Natural Language Processing Using Dimensions
Database: Development, Research Trend, and Future Research Directions." Journal of Data Science, Informetrics,
and Citation Studies 2, no. 2 (2023): 77-89.
[151] Bannour, Nesrine, Sahar Ghannay, Aurélie Névéol, and Anne-Laure Ligozat. "Evaluating the carbon footprint of
NLP methods: a survey and analysis of existing tools." In Proceedings of the Second Workshop on Simple and
Efficient Natural Language Processing, pp. 11-21. 2021.
[152] Petersen, Kai, Sairam Vakkalanka, and Ludwik Kuzniarz. "Guidelines for conducting systematic mapping studies
in software engineering: An update." Information and software technology 64 (2015): 1-18.
[153] Koubaa, Anis, Wadii Boulila, Lahouari Ghouti, Ayyub Alzahem, and Shahid Latif. "Exploring ChatGPT
Capabilities and Limitations: A Survey." IEEE Access (2023).
[154] Denney, Andrew S., and Richard Tewksbury. "How to write a literature review." Journal of criminal justice
education 24, no. 2 (2013): 218-234.
[155] Tay, Yi, Mostafa Dehghani, Dara Bahri, and Donald Metzler. "Efficient Transformers: A Survey." ACM
Computing Surveys 55, no. 6 (2023): 1-28. https://doi.org/10.1145/3530811
[156] Ainslie, Joshua, Santiago Ontanon, Chris Alberti, Vaclav Cvicek, Zachary Fisher, Philip Pham, Anirudh Ravula,
Sumit Sanghai, Qifan Wang, and Li Yang. "ETC: Encoding Long and Structured Inputs in Transformers." In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.
268-284. 2020.
[157] Zaheer, Manzil, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon,
Philip Pham et al. "Big bird: Transformers for longer sequences." Advances in neural information processing
systems 33 (2020): 17283-17297.
31