0% found this document useful (0 votes)
3 views

Fine-Tuning_of_Distil-BERT_for_Continual_Learning_

This article presents a novel approach for continual learning (CL) in text classification using the Distil-BERT model, addressing challenges such as catastrophic forgetting (CF). The proposed methodology incorporates a task-independent architecture and two auxiliary tasks to enhance model performance across various text classification tasks, achieving impressive results in terms of F1 score and accuracy. The research demonstrates significant advancements in the scalability and efficiency of CL models in natural language processing applications.

Uploaded by

ddqe11199
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Fine-Tuning_of_Distil-BERT_for_Continual_Learning_

This article presents a novel approach for continual learning (CL) in text classification using the Distil-BERT model, addressing challenges such as catastrophic forgetting (CF). The proposed methodology incorporates a task-independent architecture and two auxiliary tasks to enhance model performance across various text classification tasks, achieving impressive results in terms of F1 score and accuracy. The research demonstrates significant advancements in the scalability and efficiency of CL models in natural language processing applications.

Uploaded by

ddqe11199
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

Fine-Tuning of Distil-BERT for Continual


Learning in Text Classification: An
Experimental Analysis
SAHAR SHAH1 , SARA LUCIA MANZONI1 , FAROOQ ZAMAN2 , FATIMA ES SABERY3 ,
FRANCESCO EPIFANIA4 and ITALO FRANCESCO ZOPPIS1 ,
1
Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milan, Italy (e-mail: [email protected],
[email protected], [email protected])
2
Department of Computer Science, University of Information Technology, Lahore, Pakistan (e-mail: [email protected];
3
Department of Economics and Management Sciences, Faculty of Law, Economics and Social Sciences, Hassan II University of Casablanca, Mohammedia,
Morocco (e-mail: [email protected])
4
Social Things, Milan, Italy (e-mail: [email protected])
Corresponding author: Sahar Shah (e-mail: [email protected] )
This research is supported by Department of Informatics, Systems and Communication, University of Milano- Bicocca, Milan, Italy

ABSTRACT Continual learning (CL) with bidirectional encoder representation from transformer (BERT)
and its variant Distil-BERT, have shown remarkable performance in various natural language processing
(NLP) tasks, such as text classification (TC). However, the model degrading factors like catastrophic
forgetting (CF), accuracy, task dependent architecture ruined its popularity for complex and intelligent
tasks. This research article proposes an innovative approach to address the challenges of CL in TC
tasks. The objectives are to enable the model to learn continuously without forgetting previously acquired
knowledge and perfectly avoid CF. To achieve this, a task-independent model architecture is introduced,
allowing training of multiple tasks on the same model, thereby improving overall performance in CL
scenarios. The framework incorporates two auxiliary tasks, namely next sentence prediction and task
identifier prediction, to capture both the task-generic and task-specific contextual information. The
Distil-BERT model, enhanced with two linear layers, categorizes the output representation into a task-
generic space and a task-specific space. The proposed methodology is evaluated on diverse sets of TC
tasks, including Yahoo, Yelp, Amazon, DB-Pedia, and AG-News. The experimental results demonstrate
impressive performance across multiple tasks in terms of F1 score, model accuracy, model evaluation
loss, learning rate, and training loss of the model. For the Yahoo task, the proposed model achieved an
F1 score of 96.84 %, accuracy of 95.85 %, evaluation loss of 0.06, learning rate of 0.00003144. In the
Yelp task, our model achieved an F1 score of 96.66 %, accuracy of 97.66 %, evaluation loss of 0.06,
and similarly minimized training losses by achieving the learning rate of 0.00003189. For the Amazon
task, the F1 score was 95.82 %, the observed accuracy is 97.83 %, evaluation loss was 0.06, and training
losses were effectively minimized by securing the learning rate of 0.00003144. In the DB-Pedia task,
we achieved an F1 score of 96.20 %, accuracy of 95.21 %, evaluation loss of 0.08, with learning rate
0.0001972 and rapidly minimized training losses due to the limited number of epochs and instances. In
the AG-News task, our model obtained an F1 score of 94.78 %, accuracy of 92.76 %, evaluation loss
of 0.06, and fixed the learning rate to 0.0001511. These results highlight the exceptional performance of
our model in various TC tasks, with gradual reduction in training losses over time, indicating effective
learning and retention of knowledge.

INDEX TERMS Continual Learning, Natural Language Processing, Text Classification, Fine-Tuning,
Distil-BERT

I. INTRODUCTION customer feedback analysis, and market research [1].


Sentiment Analysis (SA), the automated process of One of the fundamental tasks in SA is to categorize
detecting sentiment or emotion in text data, has text documents into predefined sentiment categories
gained significant attention and finds many applica- such as positive, negative, or neutral [2]. However,
tions in various domains like social media analysis, ensuring accuracy and adaptability of SA models

VOLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Author et al.: Preparation of Papers for IEEE %TRANSACTIONS and JOURNALS

over the time is challenging due to the dynamic a vital aspect of CL to achieve robust and adaptive TC
nature of textual data [3]. While TC is a critical area models. The explicitly addressing how a TC model
within NLP that holds immense significance across handles CF is a crucial role to ensure its ability to
various domains, including SA, document catego- retain and apply knowledge from past tasks in a CL
rization, and information retrieval. In TC techniques, scenario [13].
SA are employed to determine the sentiment or
emotion expressed in textual data [4] [5]. Document The Distil-BERT is a compressed version of the
categorization involves organizing text documents BERT model, and is developed by researchers at
into predefined categories based on their content, Hugging Face to offer similar performance with a
facilitating efficient information organization and smaller size and faster speed. The left side component
retrieval. These applications demonstrate the broad serves as an educational tool to explain the complex
impact and relevance of TC methods in enabling internal workings of the transformer layer in the
effective analysis and management of textual data BERT model, which is foundational for understanding
across different fields [6]. With the increasing size how the entire model processes and transforms the
and complexity of textual data, the need for scalable input data to generate meaningful representations.
and adaptable TC models becomes crucial. However, This component includes several key elements. First,
several challenges hinder the development of effective the multi-head attention mechanism starts with the
CL approaches in this domain [7]. One significant input being projected into three different vectors:
challenge is the scalability of the proposed methods. Query (Q), Key (K), and Value (V). This allows the
TC tasks typically involve large datasets, requiring model to focus on different parts of the input sequence
efficient handling of incremental updates to accom- simultaneously, capturing various relationships and
modate the continuous influx of new data. Managing dependencies within the data. The attention scores
the scalability of models is essential to ensure their for each head are computed and then concatenated to
effectiveness and efficiency in handling the ever- form the final result. After the attention mechanism,
growing volume of textual data [8] [9]. the result is added back to the original input through a
residual connection, which helps preserve the original
CL, which enables models to learn from new data information and stabilize the training process. Layer
while retaining previous knowledge, is of utmost normalization is then applied to ensure the outputs
importance in TC and SA tasks. In these domains, have a stable mean and variance, which speeds up
sentiment and text patterns use constantly evolve, training and improves overall model performance.
necessitating models to consistently adapt and refine Next, the feed-forward neural network (FFNN) con-
their understanding of sentiments and text. This sists of two fully connected layers with a ReLU
dynamic nature of SA underscores the need for CL activation in between them. This part of the model
approaches that can effectively capture evolving sen- applies non-linear transformations to the input data,
timent trends and maintain model performance over enabling the capture of more complex patterns. The
the time [10]. Indeed, the scalability of TC models output of the FFNN is also added back to the input
is a crucial consideration for real-world applications of this sub-layer through another residual connection
dealing with large datasets, specially when one want and then normalized. The purpose of this component
to capture CL domain. In practical applications, is to illustrate the structure of a single transformer
where massive amounts of data need to be processed layer within the BERT model, showing the detailed
accurately and efficiently, the scalability of the model architecture and highlighting the key processes that
plays a significant role in ensuring its effectiveness occur within each layer. It provides a clear under-
and practicality. Future research should aim to address standing of how the multi-head attention mechanism
these scalability challenges to develop TC models works together with the FFNN and residual connec-
that can handle large datasets and numerous classes tions. Additionally, it emphasizes the importance of
while maintaining high performance and efficiency layer normalization in stabilizing and improving the
[11]. CF is a significant challenge in CL, as it can training process. It employs a transformer encoder
lead to the loss of previously acquired knowledge architecture, featuring self-attention mechanisms and
when learning new information. The objective of CL FFNN. Distil-BERT uses Word-Piece embedding to
is to strike a balance between acquiring new knowl- handle out-of-vocabulary words by breaking them
edge and retaining the knowledge of earlier tasks. into sub words. The model undergoes pretraining on
Mitigating CF is crucial to ensure that the model unlabeled text and fine-tuning on labeled data for
maintains performance on earlier tasks while adapting specific NLP tasks like TC [14]. It captures contextual
to new ones [12]. Developing effective strategies, information from both left and right contexts of
such as regularization techniques, rehearsal methods, words and can be applied to tasks such as TC and
or knowledge distillation, can help alleviate this many more. Moreover, computational efficiency of
issue and enable CL models to preserve and transfer Distil-BERT model is a vital consideration when
knowledge across tasks effectively. Addressing CF is deploying with TC in the real-world applications. The
2 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Author et al.: Preparation of Papers for IEEE %TRANSACTIONS and JOURNALS

programmer should consider the model parameters what methodologies and strategies we have adopted to
that impact the practicality and scalability of the achieve our goals. In section IV, we discussed datasets
model’s like training time, memory requirements, statistics, i.e., used epochs, classes, batch size and
and inference speed [15]. These considerations are other specific details. The section V, is experimental
crucial for practical implementations of the TC, where discussion in which achieved results of the proposed
efficient training and inference processes are essential model are discussed in detail. The second last section
for real-time or resource-constrained environments. of the paper is VI- results comparison with SOTA,
In the context of Distil-BERT, TC refers to training this shows the experimental comparison based on
the model to classify text documents into different accuracy parameter between the presented approach
classes/categories using the fine-tuning process. and the approaches already existed in SOTA. The last
section, VII, namely conclusion and future directions,
The first main goal of this paper, is to integrate deals with the conclusion of the overall proposed
the CL memory-based technique with the Distil- work and then suggest some future directions for
BERT model for the TC. This integration performs more better advancements.
continuous learning in such a way that the model
doesn’t forget the previously learned task and avoid Contributions:
CF. The second main goal of the paper includes In this paper, we have made several significant and
the designing of a task independent model through notable contributions in the field of TC compatible
which we are able to train different several tasks with CL approach using Distil-BERT model by tack-
on the same model. To achieve the aforementioned ling out the important and key challenges. Through
goals several modifications and steps are involved presenting this novel approaches, these contributions
that have impressive performance in diverse NLP enhance the existing body of the knowledge by ad-
tasks specifically in TC. Our research centers on the dressing critical challenges in the field of TC and CL
application of pretrained Distil-BERT to the realm using Distil-BERT model. The key contributions of
of CL in TC. In our proposed methodology, we deal our proposed work can be summarized as below.
with Distil-BERT model by adding the two linear ∙ We integrated of the memory-based CL tech-
layers on the head of Distil-BERT. We categorize the nique with the Distil-BERT model for TC. This
output representation of Distil-BERT into two stages, integration ensures that the model retains previ-
a task generic space and a task specific space. This is ously learned tasks and prevents CF by storing
accomplished through the utilization of two auxiliary the best model for each task in memory, our
tasks: next sentence prediction, which facilitates the approach preserves learned information.
acquisition of task generic information, and task iden- ∙ Utilized a pre-trained Distil-BERT model to ex-
tifier prediction, which aids in acquiring task specific tract meaningful representations (features) from
representations. By following this approach, Distil- text data. Distil-BERT’s transformer layers cap-
BERT captures contextual information from both ture contextual information from input text, en-
preceding and subsequent words. We train the model abling our model to classify text documents into
on five different tasks i.e, Yahoo, Yelp, Amazon, DB- different categories based on their contents.
Pedia and AG-News. The Distil-BERT model trains
∙ Proposed a novel task-independent architecture
on each task individually and systematically save the
that effectively handles any sequences of the in-
best model in his memory, ensuring the preservation
put tasks, ensuring better overall performance in
of learned information by avoiding CF. For each
CL scenarios. Our model can adapt to new tasks
task during this mechanism the two added layers
while retaining previously acquired knowledge.
individually initialized from scratch with random
weights this is what we called task specific space ∙ Optimized the framework by significantly re-
while the task generic space remains the same for ducing the computation running time, training
each task. These two linear layers took vectors in 768 losses, and increasing the learning rate of the
dimensions from each input task and then classify the proposed model. Through careful analysis and
provided input into number of classes respectively. efficient algorithm, we achieved substantial im-
provements in the speed of our approach, and
learning rate, these factors make the model more
The rest of the paper is organized as, next to
practical and scalable for real-world applica-
Introduction is section II, namely Literature review.
tions.
We classify the literature review section into different
topics of TC, involving different transformer based ∙ We draw a comparison between the proposed
models with involving CL techniques. In this section, designed and previously existing models in the
we captured the overall theme of the state of the SOTA based on model performance parameter,
art (SOTA) in a summary table. In section III, we i.e., accuracy.
have explained the proposed methodology that shows ∙ Our proposed model presents significant ad-
how we have designed this proposed novel approach, vancements than the previously designed models
VOLUME 4, 2016 3

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Author et al.: Preparation of Papers for IEEE %TRANSACTIONS and JOURNALS

in terms of accuracy. However, the other impor- textual data [17]. TC is a challenge in this con-
tant and key model performance parameters i.e., text, and language models like BERT and Distil-
F1 score, evaluation loss, and learning rate of the BERT have shown promise in handling it. This study
model with optimal results are only considered compares BERT and Distil-BERT for TC in English
by the proposed model. and Brazilian Portuguese using different datasets.
This highlights the importance of dataset balance and
II. LITERATURE REVIEW its impact on model performance, with unbalanced
Numerous studies in the field of CL for TC and SA datasets showing lower accuracy. Additionally, the
using transformer based models have been accom- use of lightweight models like Distil-BERT allows
plished. These studies have provided valuable insights for efficient execution on low computational resources
into lifelong learning, incremental learning, and trans- while maintaining performance comparable to larger
fer learning approaches to TC based on transformer models. The experimental results indicate that Distil-
models. Techniques such as regularization, rehearsal, BERT is 40 % smaller, 45 % faster, and retains 96
and distillation have been proposed to mitigate CF % language comprehension skills compared to BERT.
and retain knowledge from past aspects. The research The study highlights the effectiveness of Distil-BERT
findings contribute to the advancements of TC tech- in different languages and emphasizes the importance
niques that can handle evolving sentiment patterns of dataset quality.
and dynamic textual data in real-world applications.
This section is divided into different topics to enhance
the readability, clarity and define the approaches used C. CL IN ASC TASKS VIA BERT BASED MODEL
for each specific task. In addition, this article incor- The paper [18] focuses on the topic of CL in the
porates a brief table 1, that summarize the proposed context of ASC tasks. While previous CL techniques
techniques, achievements, applications, limitations have been proposed for document sentiment classi-
and key parameters for each topic. By incorporating fication (SC), research specifically targeting CL in
this well-organized table, the readability of the section ASC is limited. The authors introduce BERT based
is enhanced, and researchers can conveniently access CL (B-CL), a new CL system that addresses two
the valuable information for their own research with key challenges in ASC with CL. These challenges
respect to the models and techniques used, and gain include transferring knowledge from previous tasks
a better understanding for more advancements. to facilitate better model learning, and maintaining
performance on previous tasks to prevent forget-
ting or degradation. B-CL is a capsule network-
A. CLASSIC MODEL FOR TC IN DIL SETTINGS based model that incorporates forward knowledge
The authors, in [16], proposed a novel model called transfer (FKT) and backward knowledge transfer
continual and contrastive learning of aspect sentiment (BKT) mechanisms. It effectively utilizes knowledge
classification tasks (CLASSIC), for TC in domain gained from previous tasks to improve performance
incremental learning (DIL). CLASSIC operates in on both new and old tasks. The model utilizes a
a DIL setting, eliminating the need for task infor- novel building block called continual learning adopter
mation during testing. It leverages Adapter-BERT to (CLA), inspired by Adapter-BERT. CLA employs
incorporate pre-trained BERT without fine-tuning and capsules and dynamic routing to identify similar-
addresses CF. The model introduces contrastive CL ities between previous and new tasks, facilitating
for knowledge transfer (KT) between tasks and distil- the transfer of shared knowledge. Task masks are
lation from old to new tasks, enhancing classification employed to protect task-specific knowledge and
accuracy. CLASSIC consists of three sub-systems: prevent CF. Extensive experiments are conducted
contrastive ensemble distillation (CED), contrastive to validate the effectiveness of B-CL, comparing
knowledge sharing (CKS), and contrastive super- it with various baselines. The results demonstrate
vised learning (CSL). The architecture is designed the superior performance of B-CL in ASC with CL
for aspect sentiment classification (ASC) in DIL, scenarios. This paper contributes by highlighting the
utilizing Adapter-BERT. During training, CLASSIC need for CL approaches in ASC and proposing the
takes hidden states and a task ID, but during testing, B-CL model, which incorporates the CLA into a pre-
no task ID is required. The model’s outputs are task- trained BERT model, enabling effective ASC in CL.
specific features used for constructing a classifier. The
framework follows contrastive learning principles and
is termed contrastive CL. D. SUPERVISED CONTRASTIVE LEARNING
FRAMEWORK FOR ABSA BASED ON BERT
In [19], the authors introduce a novel approach to
B. TC IN IOT aspect-based sentiment analysis (ABSA) by focus-
The Internet of Things (IoT) connects smart devices ing on improving sentiment prediction for unknown
through the internet, generating vast amounts of testing aspects. They address this challenge by lever-
4 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Author et al.: Preparation of Papers for IEEE %TRANSACTIONS and JOURNALS

aging sentiment features and propose a BERT-based vast amounts of data for experimentation. Transfer
supervised contrastive learning framework. The main learning, a technique that leverages knowledge from
contributions of this research are twofold. Firstly, they unlabeled data for tasks with limited labeled data, has
approach ABSA from a new perspective, emphasizing gained attention in TC. In this study [21], transfer
the enhancement of SA for unknown testing aspects learning classification models are applied to coron-
by leveraging sentiment features. Secondly, they avirus disease (COVID-19) fake news and extremist-
introduce the BERT-SCon framework, which utilizes non-extremist datasets. The researchers emphasize the
supervised contrastive learning to distinguish senti- potential of transfer learning to improve accuracy
ment features based on sentiment polarity and pattern. with less human supervision compared to active and
The proposed BERT-SCon framework achieves SOTA supervised learning. NLP transformers, particularly
performance on five benchmark datasets, demonstrat- the attention-based transformer models, have shown
ing the effectiveness of the approach. The architecture promising accuracy in various applications. The study
of the framework consists of four components: data demonstrates the effectiveness of transformers in
augmentation, feature extractor using a pre-trained predicting real and fake news related to COVID-
BERT model, SC with a soft max function, and con- 19. Fake news dissemination through social media
trastive learning to bring together representations of platforms is a significant concern, and distinguishing
the same sentiment polarity/pattern and differentiate between real and fake news is challenging. Existing
representations from different classes. approaches have limitations, motivating the devel-
opment of hybrid methods. The study applies nine
transfer learning models to COVID-19 datasets and
E. NLP AND ML FOR CLASSIFICATION OF evaluates their performance using metrics. Reliable
FURNITURE TIP-OVER INCIDENTS repositories are used for data collection, and the
This paper proposes an improved method for classi- results highlight the effectiveness of transfer learning
fying furniture tip-over incidents using a combination models in binary TC.
of NLP techniques and machine learning (ML) algo-
rithms. The proposed model architecture is based on a
pretrained RoBERTa model, enhanced with layer nor-
malization, dropout layers, and a linear classifier [20]. G. MTL MODELS FOR PEER REVIEW COMMENTS
The study compares the proposed model with other
transformer-based models like BERT, RoBERTa, This paper introduces two model transformation
DeBERTa, ALBERT, DistilBERT, and MPNet. The learning (MTL) models, leveraging BERT and Distil-
models were trained on injury narratives from the BERT, for evaluating peer-review comments [22].
united states-consumer product safety commission The models detect multiple features simultaneously,
(U.S-CPSC) dataset, addressing challenges such as improving performance and reducing model size
imbalanced classes and domain-specific jargon. The compared to previous methods. MTL enhances data
text data was preprocessed, tokenized, and encoded efficiency and can be seen as a form of inductive
for input into the models. The computational com- transfer learning. BERT and Distil-BERT, pre-trained
plexity of the proposed model is estimated based on language models, are effective tools for NLP tasks.
the number of attention layers and linear operations. The study demonstrates the superiority of BERT-
The study found that the proposed model achieved based models over GloVe-based ones and suggests
improved classification results compared to the de- deploying the MTL-BERT model for high accuracy
fault transformer-based models, showcasing its po- or the MTL-Distil-BERT model for resource effi-
tential for streamlining classification tasks in various ciency on peer-review platforms. The MTL models
datasets. The experimental analysis demonstrated that proposed in this study consider three features of
the use of machine learning techniques can reduce high-quality peer reviews, containing suggestions,
human effort and enhance efficiency in reviewing and mentioning problems, and using a positive tone. The
classifying incident reports. comparison between BERT-based single task learning
(STL) models and the previous GloVe-based method
demonstrates significant improvements in detecting
a single feature. The study acknowledges limita-
F. TRANSFER LEARNING IN TC FOR COVID-19
tions, such as the need to explore additional tasks,
TC is a widely studied problem in information consider alternative parameter sharing approaches,
retrieval and data mining. It has applications in and evaluate the model in real-world settings. These
various domains such as healthcare, marketing, en- findings lay the groundwork for ongoing work to
tertainment, and content filtering. Researchers have comprehensively evaluate peer review comments and
recently focused on developing automated systems enhance peer assessment.
for TC using NLP and data mining techniques. NLP
enables the categorization of documents with different
types of texts, and social media platforms generate
VOLUME 4, 2016 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Author et al.: Preparation of Papers for IEEE %TRANSACTIONS and JOURNALS

H. MULTIMODEL-DEEP LEARNING FOR effectiveness and efficiency of the CSIC approach


SHORT-TEXT MULTI-CLASS CLASSIFICATION are evaluated through extensive experiments on 16
This paper presents a multimodel-based deep learning popular review datasets. The results demonstrate
framework for short-text multi-class classification that CSIC outperforms strong baseline methods for
with an imbalanced and minimal dataset [23]. The continual SC.
framework consists of an encoder layer using Distil-
BERT for dynamic word embeddings, followed by
word-level and sentence-level long short term mem- J. CL FOR TC IN BERT MODEL
ory (LSTM) networks to extract deep semantic infor- The article titled "Addressing CL Challenges in TC
mation. A max-pooling layer reduces dimensionality, through Information Disentanglement based Regular-
and a Soft Max layer performs multi-class classifica- ization" addresses the notable issue of CL in the con-
tion. The proposed approach achieves SOTA perfor- text of TC tasks [25]. CL entails a model’s ability to
mance while being faster and lighter for deployment learn from an ongoing stream of data while retaining
on mobile devices. Distil-BERT reduces complexity, previously acquired knowledge, without experiencing
and the bidirectional LSTM (BI-LSTM) enhances CF. To tackle this challenge, the researchers propose
the model’s ability to handle polysemous words. The an innovative approach that utilizes regularization
framework addresses data imbalance, small text, and based on information disentanglement. The primary
multi-classification tasks and is applicable to real- goal is to enhance the model’s CL capabilities,
world scenarios, particularly mobile devices. The enabling effective learning from new data classes
smaller model size makes it suitable for Artificial while preserving the knowledge of previously learned
Intelligence (AI) technology in smart devices, while classes. To accomplish this, the proposed method
BERT-based models are more suited for large-scale introduces a regularization term into the loss function
cloud computing and research institutes. of the TC model. This regularization term encourages
the model to disentangle shared information, which
is relevant across different tasks, from task-specific
I. CSIC MODEL FOR SC TASKS information, which pertains to each specific classi-
The research article introduces a novel approach fication task. By disentangling this information, the
called contrastive supervised learning with iterative model can better isolate and retain crucial knowledge
combination (CSIC) for SC tasks [24]. The objective for specific tasks, preventing interference or forgetting
of CSIC is to overcome the limitations of static when learning new tasks. To evaluate the effectiveness
models that cannot adapt to new domains due to of the proposed approach, the researchers conducted
storage constraints or privacy concerns. The proposed experiments on various well-established benchmark
CSIC method combines the original network, trained datasets. The results demonstrate that the information
on old tasks, with a fine-tuned network trained on disentanglement based regularization significantly en-
new tasks using knowledge distillation. This iterative hances the model’s performance in CL scenarios.
combination allows for CL without increasing the By mitigating CF, the approach enables the model
network’s size. BERT, a highly performant model to maintain its performance on previously learned
in SC, is chosen as the backbone model for CSIC. tasks while adapting well to new tasks. Furthermore,
To address CF, where the network’s performance on the proposed method achieves competitive accuracy
old tasks deteriorates when learning new tasks, CSIC compared to SOTA for CL methods, highlighting its
linearly combines the original network and fine-tuned effectiveness in the domain of TC.
network using knowledge distillation. Importantly,
after training, the combined network can be converted
back to the standard BERT structure, eliminating K. NEURAL LANGUAGE MODELS (NLMS)
the need for additional parameters or structures for In [26], authors explores how NLMs, like Transform-
old and new tasks. CSIC consists of four network ers, convolution neural networks (CNNs) and LSTMs,
components: the original network for old tasks, the understand and process verb argument structures in
fine-tuned network for new tasks, the middle network German, a language where word order is flexible in
that integrates knowledge from both networks, and the subordinate clauses. The researchers developed a new
final combined network converted from the middle testing method using minimal variation sets. These
network. All networks utilize BERT as the backbone. sets include sentences with correct and incorrect
During the learning of a new task, CSIC involves verb structures to see if the models can tell the
three phases: linearly combining the original and difference between grammatically right and wrong
fine-tuned networks to create the middle network, sentences. Transformers generally scored higher than
performing an additional retraining phase for the LSTMs and even humans but tended to overgener-
middle network to prevent CF, and converting the alize, sometimes accepting sentences that were not
middle network into the final combined network very plausible. LSTMs had trouble, especially with
with the same size as the original network. The frequent argument structures like double nominatives.
6 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Author et al.: Preparation of Papers for IEEE %TRANSACTIONS and JOURNALS

Human evaluations were also part of the study. annotations for various grammatical features, such
Annotators rated the naturalness of sentences on a as sentence structure, verb use, and specific Spanish
scale, providing a benchmark for the models. These language phenomena like the use of ’ser’ and ’estar.’
ratings confirmed that grammatical rules, such as Human experts also evaluated the sentences to pro-
nominative and dative alignment, play a role in how vide a benchmark for the models’ performance. Their
sentences are judged. The study showed that while judgments matched the dataset labels most of the
NLMs can understand some syntactic rules, their time, but there were some disagreements, which the
performance is affected by biases and overgeneraliza- researchers noted. This human evaluation helps show
tion. In their experiments, the team created sentences the upper limit of how well we can expect the models
by changing the positions and case assignments of to perform. The researchers ran several experiments
arguments in template sentences, resulting in both to see how well different language models (IX-
acceptable and unacceptable variations. They found ABERTesv2, RoBERTa-large-bne, XLM-RoBERTa-
that both LSTMs and Transformers performed better large, and mDeBERTa-v3) performed when fine-
than random guessing in identifying correct verb tuned on the EsCoLA dataset. They tested these
structures. This suggests a need for better models models on both sentences from the same sources used
or training methods that more closely mimic human to create the dataset (in-domain) and sentences from
language understanding. different sources (out-of-domain). The mDeBERTa-
v3 model performed the best, followed by RoBERTa-
large-bne. The other models, XLM-RoBERTa-large
L. CORRECTNESS OF SENTENCE VIA and IXABERTesv2, did not perform as well.
LANGUAGE MODELS
The authors introduced the ItaCoLA corpus in [27],
a collection of almost 10,000 Italian sentences, each
labeled as acceptable or unacceptable. The aim is
to help researchers study how well language models
can judge the grammatical correctness of sentences
in languages other than English. The authors, explain
how they created the ItaCoLA corpus by manually
transcribing sentences from various linguistic sources
and labeling them based on acceptability. About 30%
of these sentences are also tagged with additional
linguistic features to capture specific grammatical
phenomena. The study also explores using a mul-
tilingual model, XLM-RoBERTa, which is trained
on multiple languages. This model benefits from the
multilingual training, but is still not as effective as
models trained specifically on one language. They
described several experiments to test the performance
of neural language models like BERT on this new
corpus. They compare how well these models perform
on Italian sentences versus English sentences from
a similar corpus (CoLA). The experiments include
tasks within the same domain (in-domain) and tasks
with different data (out-of-domain). The findings
show that BERT-based models work well for Italian
sentences, almost as well as for English, though there
are some differences in handling certain grammatical
structures.

M. DATASET FOR LANGUAGE MODELS


A new dataset called EsCoLA, which contains 11,174
Spanish sentences labeled as either acceptable or
unacceptable, is presented in [28]. This dataset is
designed to help researchers test how well language
models can understand and generate grammatically
correct Spanish sentences. The sentences were taken
from well-known Spanish grammar books and include
VOLUME 4, 2016 7

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Author et al.: Preparation of Papers for IEEE %TRANSACTIONS and JOURNALS

Table 1: Comparative Analysis.


Limitations
Keywords Proposed Techniques Achievements Applications
Can be used
Can learn and adopt Dependence on contrastive
whenever the
CL, ASC, DIL, CLASSIC model is used from new data learning to reduce its
objective is to
KT, CED, 2021, to address SC in the without forgetting effectiveness in handling
enhance the
[16] context of DIL for ASC. the previously new tasks.
classification’s
learned data.
accuracy.
Offers comparable It faces challenges with
Proposed two language
Distil-BERT, performance to Distil-BERT for TC memory limitations on
models (English,
Transformer, Big BERT while being tasks in the context devices with very low
Brazilian) for TC using
Data, 2022, [17] significantly smaller of the IoT. resources.
various datasets.
and faster.
Transfer the
Adapter BERT, Continually learn in ASC Can be fit where
knowledge in an The model performance
CL, B-CL, ASC for transferring enhancing model
efficient and drops for very different
Transfer knowledge from previous learning in ASC
effective way tasks.
learning, 2021, tasks to avoid degrading with CL systems is
without forgetting
[18] and forgetting. required.
the learned data.
Needed where
ABSA leverages Improved the
enhancing SA
supervised contrastive sentiment prediction Using supervised
Sentiment performance by
learning based on BERT for unknown testing contrastive learning needs a
prediction, SC, leveraging sentiment
is proposed to extract aspect, integration lot of labeled data, which
BERT model, features and the
sentiment characteristic of BERT model, can be hard to get.
2021, [19] BERT-SCon
from sentence related to enhanced ABSA
framework is
specific aspect. performance.
required.
Broader applicability The model does not work
Text Analysis, An enhanced method for Outperforms for in automating well for tasks outside of
Transformers, furniture tip-over incident transformer-based classification tasks furniture tip-over incidents
Pattern classification, utilizing models in for various domains, because it is trained
Classification, NLP techniques and ML classifying furniture improving specifically for that
NLP, 2022, [20] algorithms is proposed. tip-over incidents. productivity and domain.
accuracy.
Explores the application Improves TC Applications in
This method does not work
TC, Transfer of transfer learning accuracy with less domains such as
well for very specific tasks
Learning, classification models to human supervision healthcare,
or when there is a lot of
Accuracy, 2022, COVID-19 fake news on compared to active marketing,
noisy data.
[21] extremist and and supervised entertainment, and
non-extremist datasets. learning approaches. content filtering.
The MTL-BERT
model can be
Two multitask learning The multitask learning
deployed on
Peer Assessment, (MTL) models, utilizing Demonstrates the method can be less
peer-review
Text Analysis, BERT and Distil-BERT, superiority of effective when tasks are
platforms for high
Data Mining, for evaluating BERT-based models not closely related, leading
accuracy in
2021, [22] peer-review comments over GloVe-based. to negative transfer.
evaluating
are proposed.
peer-review
comments.
Multimodel-based deep Effective in Best for real-world The model struggles with
Distil-BERT,
learning framework is addressing data scenarios, very imbalanced datasets
Multi-
presented for short-text imbalance, small particularly on and still needs adjustments
Classification
multi class classification text, and mobile devices, to fit specific hardware
Tasks, LSTM,
with an imbalanced and multi-classification where efficient limitations.
2022, [23]
minimal dataset. tasks. short-TC is required.
Continued on next page

8 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Author et al.: Preparation of Papers for IEEE %TRANSACTIONS and JOURNALS

Table 1 – Comparative Analysis-Continued


Limitations
Keywords Proposed Techniques Achievements Applications
CSIC addresses Combining networks
limitations of static Adaptability to new repeatedly can make
CSIC can be applied
models in SC, enabling domains, CL training complicated and
Continual SC, in scenarios where
adaptability to new without increasing possibly unstable. The
Fine-tuned storage constraints
domains. It combines an the size of the success of this method
Network, BERT, or privacy concerns
original network with a network, Mitigation depends on the quality of
CF, 2021, [24] prevent the use of
fine-tuned network using of CF, Utilization of the knowledge distillation
large static models.
knowledge distillation for BERT model. process.
CL.
The approach This model makes the
Has various
enhances the training process more
CL, TC, CF, potential
Addressing the challenge model’s ability to complex and slow. This
Knowledge applications in
of CL in TC tasks by learn from new method also needs careful
Retention, domains where
leveraging information classes while adjustment to keep old
Accuracy, 2021, models need to
disentanglement. retaining knowledge knowledge while learning
[25] continuously learn
of previously new information.
from new data.
learned ones.
Transformers tended to
Valuable for
Developed a testing overgeneralize, sometimes
analyzing how
method using minimal Transformers accepting sentences that
different neural
NLMs, variation sets, creating generally performed were not grammatically
models understand
Transformers, sentences with both better than LSTMs correct. LSTMs, on the
and process
LSTMs, 2019, correct and incorrect verb and humans in other hand, had difficulties
syntactic rules in
[26] structures by altering recognizing correct with frequent argument
languages with
positions and case verb structures. structures like double
flexible word order,
assignments. nominatives.
such as German.
Demonstrates that This resource is Multilingual models like
ItaCoLA corpus, Developed the ItaCoLA
BERT-based models beneficial for XLM-RoBERTa, although
Italian sentences, corpus, a collection of
can perform well on linguistic research, beneficial, do not yet
acceptability nearly 10,000 Italian
Italian sentences, particularly in the match the effectiveness of
judgment, sentences labeled for
almost matching area of syntactic and models trained on a single
NLMs, 2021, grammatical
their performance on grammatical language.
[27] acceptability.
English sentences. analysis.
The EsCoLA dataset is
The mDeBERTa-v3
EsCoLA, created by sourcing It is useful for
achieved the best Improvements needed in
Spanish 11,174 Spanish sentences linguistic research,
performance, training and fine-tuning
sentences, from well-known especially in
followed by language models to better
grammatical grammar books, syntactic and
RoBERTa-large-bne, capture the nuances of
acceptability, annotated for grammatical
demonstrating the Spanish grammar.
language models, grammatical features and analysis of the
effectiveness of
2024, [28] labeled as acceptable or Spanish language.
these models.
unacceptable.

VOLUME 4, 2016 9

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Author et al.: Preparation of Papers for IEEE %TRANSACTIONS and JOURNALS

III. PROPOSED METHODOLOGY the text for the training.


We utilized a pre-trained Distil-BERT model to fine-
tune for TC tasks, with our primary focus on CL. The second category, of data splitting, is testing set.
This section is divided into subsections to describe In this category, the datasets assess the performance
the proposed methodology with a clear vision. The and generalization capability of the Distil BERT
architecture of Distil-BERT and its components is de- model trained on tweets present in datasets. The
picted in figure 1. While the overall proposed model testing set contains labeled samples that the Distil
architecture and flow-chart are shown in figures 2 and BERT has not seen during the training phase. The
3 respectively. testing set is used to evaluate the model’s performance
of the Distil BERT on unseen data and estimate how
well it is performs in TC. The goal is to assess the
1) Pre-processing on Datasets model’s ability to generalize its learned patterns and
The data sets or tasks are not ready to use, directly. make accurate predictions on new tweets.
Rather, it contains extra characters, punctuations,
padding, truncating, and require tokenizer, handling In output, for training the model we obtained
special characters etc., that we do not need for training matrices block represents the output data gen-
the next steps. So, the pre-processing is the very erated during the training phase, including matrices
first step that perform on tasks data to convert the for F1 score, accuracy, and learning rate to monitor
task’s data to our desired format. In our case, we the model’s progress. Similarly, the test matrices
have five different tasks i.e., yahoo (Tweets), Yelp block contains evaluation matrices obtained when the
(Tweets), Amazon (Tweets), DB-Pedia (Tweets) and trained model is applied to the test set, providing
AG-News (Tweets). The first step typically involves, insights into the model performance.
cleaning, tokenization, lemmatization or stemming,
handling abbreviations and acronyms, handling rare
words and at last encoding and vectorization. In 3) Features Extraction and Text Classification
text cleaning which we have removed unnecessary Features extraction involves utilizing a pre-trained
characters or symbols, such as special characters, Distil-BERT model to extract meaningful representa-
punctuation marks and trailing white spaces. In the to- tions (features) from text data without further training
kenization step, we have split the text into individual the model on a specific task. The TC is a specific
words or subwords called tokens. In lemmatization task in NLP that involves categorizing or assigning
or stemming, we reduce words to their base or root predefined labels or categories to a given piece of
form. For abbreviations and acronyms, we expanded text. In our case, we have five different tasks i.e.,
the words to their full forms to ensure consistent yahoo, Yelp, Amazon, DB-Pedia and AG-News we
representation and improve understanding. Next, we did feature extraction, and TC on these tasks. The
performed, removing or replacing rare words with general block diagram of TC with feature engineering
a special token that help us in reducing noise and is shown in figure, 4.
improving model performance. At last, in encoding The goal is to automatically classify text documents
and vectorization stage, the textual data converted into different classes or categories based on their
to numerical representation through word embedding content. The Distil-BERT model, specifically, consists
technique (Word2Vec). Shortly, in pre-processing step of six transformer layers that perform computations
we have removed all the unnecessary characters, to extract features and capture contextual information
punctuations, replacing etc., and now the task’s data from the input data. To begin, the integer IDs are
are ready to use for next steps. transformed into embeddings, dense vector represen-
tations that encode the meaning of words in the input
text. These embeddings carry semantic information,
2) Data Splitting reflecting the relationships between words within
The pre-processed data is then fed into the data the sentence context. These embeddings, along with
splitting block. This block divide the pre-processed the position embeddings that indicate word positions
data into two types of data sets, i.e., training set, within the input sequence, are passed through the
and test set. The training set is the portion of the transformer layers. Each transformer layer consists of
considered datasets that contains labeled samples that two sub-layers: the multi-head self-attention mech-
are used to train the Distil BERT model on tweets anism and a FFNN. The self-attention mechanism
with their associated labels to estimate the model’s enables the model to assign varying levels of impor-
parameters. The Distil BERT model learns from the tance to words in the input sequence by considering
training sets by adjusting the weights/coefficients of their relevance to one another. It produces weighted
the neurons. The goal of training the Distil BERT representations of the words, emphasizing the most
model is to minimize the difference between the contextually significant words for each position in
predicted outputs of the text and the true labels of the sequence. Subsequently, the FFNN processes the
10 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Author et al.: Preparation of Papers for IEEE %TRANSACTIONS and JOURNALS

Figure 1: Distil BERT Model Architecture and Components

self-attention outputs. It applies non-linear transfor- task-specific space representation y, and the model
mations to the representations, capturing higher-level parameter 𝜃 as inputs, and produces an output.
features and interactions among the different words
in the sequence. This process of feeding the input
through the transformer layers is repeated iteratively, 4) Memory-Based Continual Learning Technique
typically six times for Distil-BERT. With each it- This proposed idea presents a novel memory-based
eration, the model refines and extracts increasingly CL approach to address the issue of CF in TC tasks.
complex features from the input data. The two added The proposed methodology utilizes the power of pre-
linear layers on the head of Distil-BERT are the trained Distil-BERT and enables the incorporation of
classification layers, responsible for mapping the new tasks while retaining knowledge from previously
representations generated by the transformer layers to learned and saved tasks. To implement memory-
the specific number of classes relevant to the TC task. based CL, the Distil-BERT model is trained on five
By utilizing these extracted features, the two layers tasks. This approach involves training the Distil-
make predictions about the class labels associated BERT model sequentially on multiple text classifi-
with the input text. For the input datasets i.e., Yahoo, cation tasks. Throughout the training process, the
Yelp, Amazon, DB-Pedia and AG-News our Distil- best-performing model for each task is systemat-
BERT model classifies 10, 5, 5, 14, and 4 classes ically saved in memory to ensure that previously
respectively. learned knowledge is retained and can be reused.
The architecture of Distil-BERT is enhanced with
two additional linear layers. These layers are crucial
y = 𝑊1 z + 𝑊2 x + b (1) in dividing the output representation into a task-
generic space and a task-specific space. The task-
In equation 1, the variable term y represents the generic space is designed to capture information
task-specific space representation, the term z rep- that is common across all tasks, providing a stable
resents the task-generic space representation and x foundation that supports general understanding. On
represents the input features. While, 𝑊1 and 𝑊2 are the other hand, the task-specific space captures unique
the weight matrices that transform the task-generic characteristics relevant to each individual task. For
space representation and input features, respectively. each new task, the task-specific layers are initialized
Moreover, b represents the bias vector and 𝑓 (𝑧, 𝑦; 𝜃) with random weights, allowing the model to adapt and
represents the output of the model which is a function learn new information without disrupting the task-
that takes the task-generic space representation z, the generic knowledge. Auxiliary tasks play a vital role
VOLUME 4, 2016 11

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Author et al.: Preparation of Papers for IEEE %TRANSACTIONS and JOURNALS

Figure 2: Proposed Model Architecture

in this approach. Two auxiliary tasks are incorporated new task, and 𝒟old represent the datasets for pre-
during training: next sentence prediction and task viously learned tasks. “𝐿(𝜃old , 𝒟old )“ represents the
identifier prediction. The next sentence prediction loss function computed on the datasets for previously
task helps the model capture task-generic information learned tasks using the parameters 𝜃old . 𝐿(𝜃, 𝒟new )
by training it to predict the logical sequence of represents the loss function computed on the dataset
sentences. The task identifier prediction task aids for the new task using the updated parameters 𝜃.
in capturing task-specific information by requiring “𝜂“ represents the learning rate, which controls the
the model to identify which task the input data size of the parameter updates.
belongs to. By training the model on these auxiliary
tasks alongside the main text classification tasks, the 5) Fine-tuning
model can simultaneously learn task-generic and task- Fine-tuning refers to the process of taking a pre-
specific contextual information. During the training trained language model, in our case Distil BERT,
process, the task-generic layers remain consistent, and further training it on a specific task or domain.
ensuring that the model retains the knowledge ac- For each new task, we employ the same Distil-BERT
quired from previous tasks. The task-specific layers, backbone, complemented by two additional linear
however, are updated for each new task, allowing the layers serving as the classification head. The second-
model to adapt to new challenges while preserving to-last layer incorporates 768 input features and 768
the integrity of previously learned information. This output features, while the final layer comprises 768
strategy ensures that the model can accurately per- input features, with the output features configured
form on previously learned tasks even when new tasks according to the number of classes specific to the task.
are introduced, effectively preventing catastrophic The addition of two extra linear layers at the top of the
forgetting. The systematic saving of the best models Distil-BERT model divides the output representation
and the incorporation of auxiliary tasks enable the into a task-generic space and a task-specific space. To
model to maintain high performance and continually achieve this division, auxiliary tasks are incorporated
adapt to new tasks without losing previously acquired during training. One auxiliary task is next sentence
knowledge. prediction, which captures task-generic information,
while the other is task identifier prediction, which
captures task-specific representations. By training the
model on these auxiliary tasks alongside the main
𝜃new = 𝜃old − 𝜂∇𝜃old ℒ(𝜃old , 𝒟old ) − 𝜂∇𝜃 ℒ(𝜃, 𝒟new ) TC tasks, the model can capture both task-generic
(2) and task-specific contextual information.
Equation 2, explain the concept of CL memory-based
approach, in a formulation for updating the model’s
parameters with new tasks while preserving knowl- IV. DATASETS STATISTICS
edge from previously learned tasks. 𝜃, represents the In this paper, we have considered tweets of five
all parameters of the model, and 𝜃old represent the different tasks i.e., Yahoo, Yelp, Amazon, Dp-Pedia,
parameters of the model before updating with new AG-News for experimental analysis. The considered
tasks. Additionally, 𝒟new represent the dataset for the different statistics i.e., classes, epochs, batch size,
12 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Author et al.: Preparation of Papers for IEEE %TRANSACTIONS and JOURNALS

Figure 3: Flowchart of the Proposed Work

types of tweets, trains and tests set of each dataset for is used for model training, and the test set is used for
experiments are explained below and shown in Table evaluation. This dataset contains 14 epochs with 5
2. The flexibility of our approach allows tasks to be classes, representing different sentiment tweets, such
added in any order, demonstrating robustness to the as positive, negative, or neutral tweets. It consists
sequence of introduction. of a training set with 10,000 instances for model
training and a test set with 7,600 instances for model
evaluation.
A. YAHOO
The dataset Yahoo is a collection of user-generated
reviews and ratings for various products and services. C. AMAZON
It covers a wide range of domains, including electron- The Amazon dataset contains customer reviews and
ics, movies, books, and more. The dataset consists associated ratings for products sold on the Amazon
of textual reviews along with associated ratings or platform. It includes separate training and test sets,
sentiment labels, indicating the sentiment expressed where the training set is used for model training, and
in the review (positive, negative, or neutral). It is the test set is used for evaluation. This dataset consists
a popular dataset used for SA and opinion mining of 14 classes, representing various topics or concepts
tasks. In our case, the Yahoo dataset consists of 10 of tweets derived from Wikipedia, with 14 epochs. It
classes, representing different categories or topics for includes a training set with 28,000 instances and a
the tweets format. The considered number of epochs test set with 7,600 instances used for model training
for Yahoo are 14 with batch size 16. It includes a and evaluation.
training set with 20,000 instances, which are used to
train the model, and a test set with 7,600 instances,
which are used to evaluate the model’s performance. D. DB-PEDIA
The dataset DB-Pedia is derived from Wikipedia and
consists of textual descriptions and associated labels
B. YELP or categories. The dataset typically includes a set
The dataset Yelp comprises user-generated reviews of predefined classes representing different topics or
for local businesses, such as restaurants, bars, and concepts in tweets, such as people, places, organi-
hotels. zations, etc. In our experiments, we have considered
The dataset include classes or labels that represent the DB-Pedia dataset that consists of 14 classes and
different aspects or sentiment categories, such as 2 epochs, representing various topics or concepts
food quality, service, ambiance, etc. It consists of derived from Wikipedia. It includes a training set with
separate training and test sets, where the training set 28,000 instances used for model training and a test
VOLUME 4, 2016 13

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Author et al.: Preparation of Papers for IEEE %TRANSACTIONS and JOURNALS

Figure 4: Text Classification with Feature Engineering

Table 2: Dataset Statistics


Datasets Classes Epochs Batch Size Types Trains Tests
Yahoo 10 14 16 Text 20000 7600
Yelp 5 14 16 Text 10000 7600
Amazon 5 14 16 Text 10000 7600
Dp-Pedia 14 2 16 Text 28000 7600
AG-News 4 3 16 Text 8000 76000

set with 7,600 instances for evaluation. across multiple domains and robust enough to handle
different types of textual data, thus enhancing its
overall applicability and performance in various NLP
E. AG-NEWS tasks.
The dataset AG-News comprises news articles catego-
rized into different topics or classes, such as sports,
business, technology, and world news. This dataset V. RESULTS DISCUSSION
includes predefined classes representing the different In our proposed methodology, we have enhanced
news categories with text. This dataset contains 3 the Distil-BERT model by improving performance
epochs and 4 classes, representing different news parameters of the model such as F1 score, accuracy,
categories. It comprises a training set with 8,000 and learning rate, while reducing evaluation loss and
instances for model training and a larger test set with training loss. These improvements are achieved by
76,000 instances for evaluation of tweets. integrating a memory-based CL technique and de-
signing a task-independent architecture with two ad-
The benefits achieved using diverse datasets are ditional linear layers placed at the head of the Distil-
significant. The selected datasets cover a wide range BERT model. We divide the output representation of
of domains, including reviews, news articles, and Distil-BERT into two spaces: a task-generic space and
product ratings, which helps create a robust model a task-specific space. The comprehensive results of
capable of handling various types of textual data. the proposed model are detailed in Table 3.
Datasets like Yelp and Amazon, which focus on user We conducted a series of experiments with various
reviews and sentiments, enable the model to excel in hyperparameters and settings, including the number
SA tasks, improving its ability to detect and classify of layers, learning rate, batch size and epochs. Our
sentiments accurately. Additionally, using datasets analysis results, indicate that despite the shared lan-
with different classes and topics, such as DB-Pedia guage features among tasks, there is no significant
and AG-News, ensures that the model generalizes performance degradation in individual tasks. This
well across various types of content, making it more suggests that our tasks remain largely independent
versatile and effective in real-world applications. regarding task-specific features. Notably, our novel
methodology demonstrates significant advancements
The division of each dataset into training and test over the current SOTA models in the field of CL,
sets allows for rigorous evaluation of the model’s especially in TC. These improvements highlight the
performance, aiding in fine-tuning and enhancing effectiveness of our approach in enhancing perfor-
the accuracy and generalization capabilities of the mance across the diverse text-based tasks.
model. By leveraging these diverse datasets, the It’s important to note that the x-axis values rep-
research ensures that the proposed model is effective resent the global steps toward the completion of in-
14 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Author et al.: Preparation of Papers for IEEE %TRANSACTIONS and JOURNALS

Table 3: Results of Proposed Pipeline


Tasks F1-Score Accuracy Evaluation Initial Learning
Loss Rate
Yahoo 96.84 97.84 .06 0.00003144
Yelp 96.66 97.98 .06 0.00003189
Amazon 95.82 98.02 .06 0.00003144
DP-Pedia 96.20 98.32 .08 0.0001972
AG-News 94.78 96.22 .06 0.0001511

stances for each task in each performance parameter,


and this value is consistent across all tasks, being
4381. Meanwhile, the y-axis values vary for each task
and performance parameter.

A. MODEL F1 SCORE
The F1 score is a metric commonly used in classifica-
tion tasks to evaluate model performance. It combines
precision and recall into a single measure to provide Figure 5: Model F1 Score
a balanced assessment of the model’s accuracy. The
x-axis coordinates represent the steps required to
fully train the model, while the y-axis coordinates In equation 3, F1 score shows a measure of the test’s
show the model’s performance in terms of the F1 accuracy.
score, with 1 indicating optimal performance and 0
indicating poor performance. Figure 5 illustrates the B. MODEL ACCURACY
F1 scores obtained by the proposed model at different In this subsection, we analyze the performance of the
evaluation points or epochs during training across proposed model by examining its prediction accuracy.
various tasks. The accuracy graph in 6 shows the model’s accuracy
The starting evaluation point for the model is step percentage on the evaluation set at different instances
313, with initial F1 scores for Yahoo, Yelp, Amazon, during various training epochs. The figure tracks the
DB-Pedia, and AG-News tasks being 0.7358, 0.5626, model improvements in accuracy over time, indicat-
0.5168, 0.9874, and 0.9199, respectively. The graph ing the convergence or stability of its performance.
begins after the completion of one epoch. DB-Pedia, The x-axis represents the training progress, showing
being the quickest task, completes its F1 evaluation the number of steps necessary to fully train the model,
by step 626 with a score of 0.9620, as it has only while the y-axis represents the model’s accuracy,
2 epochs. However, during this interval (steps 313 to ranging from 0 to 1, with 1 indicating optimal per-
626), the F1 score for the Yelp task decreases from formance in terms of accuracy and 0 indicating poor
0.5626 to 0.5346. This degradation occurs because accuracy.
the model encounters complex instances that initially The tasks DB-Pedia and AG-News concluded their
a challenge for its performance. Over time, as the evaluations earlier due to having fewer instances and
model learns from these instances, its performance epochs. These tasks begin providing results after
improves. the first epoch. The model’s accuracy for DB-Pedia
The AG-News task completes its F1 evaluation ranges from 0.9374 to 0.9832 between steps 313 and
by step 939 with a score of 0.9478, as it has 3 626. Similarly, for AG-News, the accuracy ranges
epochs. By step 1252, the F1 scores for Yahoo, from 0.9198 to 0.9622 between steps 313 and 939.
Amazon, and Yelp tasks have improved to 0.8928, For the remaining tasks, the model’s accuracy
0.8776, and 0.749, respectively. From steps 2504 to gradually increases over time due to the implemented
4382, the F1 scores for Yahoo, Amazon, and Yelp novel strategies. We observed that the accuracy for
tasks stabilize at approximately 0.9684, 0.9582, and Yelp and Amazon tasks shows a steady response. For
0.9666, respectively. This stabilization indicates that the Yelp task, from step 939 to 1252, the accuracy
the model has fully learned from the instances and ranges from 0.7502 to 0.7592, indicating a consistent
epochs, resulting in consistent performance. performance. Similarly, for the Amazon task, from
The F1 score helps visualize how the model’s step 1252 to 1565, the accuracy ranges from 0.88 to
performance evolves over time, providing insights 0.8882, showing a stable performance.
into its ability to balance precision and recall across Finally, the model’s accuracy reaches its final mag-
different classes, instances, or categories. nitudes from step 2500 to 4382, achieving 0.9784
for Yahoo, 0.9798 for Yelp, and 0.9802 for Amazon.
precision × recall These values indicate that the model has fully learned
𝐹1 = 2 × (3) from the instances, resulting in consistent and high
precision + recall
VOLUME 4, 2016 15

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Author et al.: Preparation of Papers for IEEE %TRANSACTIONS and JOURNALS

Figure 6: Model Accuracy Figure 7: Model Evaluation Loss

performance. The accuracy graph helps in visual- evaluation loss for DB-Pedia decreased from 0.09761
izing the model’s performance changes over time, to 0.08177, and for AG-News, it decreased from
providing insights into its ability to correctly classify 0.3077 to 0.0663.
different instances or categories. For the Yelp task, between steps 939 and 1252,
∑︀ the evaluation loss remained steady, indicating the
CorrectPredictions
Accuracy = ∑︀ (4) model had stabilized, and no significant changes were
AllPredictions observed. For the remaining tasks, i.e., Yahoo, Yelp,
Equation 4, represents a used metric in classifica- and Amazon, the evaluation loss gradually decreased
tion tasks, measuring the ratio of correctly predicted over time. However, between steps 1878 and 2191,
instances to the total number of instances. It serves as the evaluation loss for the Amazon task temporarily
a crucial measure of model performance, indicating increased from 0.1559 to 0.1736, likely due to the
the effectiveness of classification algorithms in accu- introduction of new instances.
rately assigning labels to input data. Finally, from step 3443 to 4382, the evaluation
loss for all remaining tasks i.e., Yahoo, Yelp, and
Amazon—decreased, approaching to zero i.e., 0.0600.
C. MODEL EVALUATION LOSS This trend indicates that our proposed methodology
Model evaluation loss, also known as validation effectively reduces evaluation loss over time, enhanc-
loss, represents the error or discrepancy between the ing the model’s performance on unseen data.
model’s predictions and the ground truth labels on
the evaluation set. It measures how well the model 𝑁
1 ∑︁
performs on unseen data. In the evaluation loss graph Evaluation Loss = (GroundTruth𝑖 −Predicted𝑖 )2
as shown in 7, the horizontal axis indicates the pro- 𝑁 𝑖=1
gression of model training, measured in steps, while (5)
the vertical axis represents the model’s evaluation Equation 5, represents the evaluation loss, it is
loss, with values ranging from 0 to 1. A value of calculated as the mean squared difference between
1 indicates maximum loss, while 0 signifies no loss. the predicted and actual values across all samples in
We monitored the evaluation loss of our model for the dataset. In Equation 5, 𝑁1 represents the number
each task individually and then conducted a combined of instances or samples evaluated. GroundTruth𝑖
analysis for all tasks. Our findings indicate that the represents the ground truth label for the 𝑖-th instance,
evaluation loss for the model gradually decreases over and Predicted2𝑖 represents the predicted label for the 𝑖-
time, ultimately approaching zero. th instance. This equation calculates the mean squared
The analysis starts at step 313, following the first error (MSE) between the ground truth and predicted
epoch. At this initial point, the evaluation loss values labels for over all instances.
for Yahoo, Yelp, Amazon, DB-Pedia, and AG-News
are 0.9121, 0.9845, 1.004, 0.09761, and 0.3077, re-
spectively. DB-Pedia and AG-News concluded their D. LEARNING RATE OF THE MODEL
evaluations earlier, at steps 626 and 939, respectively, The learning rate of any model refers to the step
due to fewer epochs. During their analysis periods, the size or the rate at which the model’s parameters are
16 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Author et al.: Preparation of Papers for IEEE %TRANSACTIONS and JOURNALS

Figure 8: Training Losses of the Model Figure 9: Model Learning Rate

𝑎
updated during training. It determines how quickly or Learning Rate(𝑡) = (6)
(1 + 𝑏𝑡)
slowly the model learns from the training data, signif-
icantly impacting the model’s convergence, stability, The above equation 6, represents a learning rate
and overall performance. Adjusting the learning rate schedule where the learning rate decreases over the
is crucial for finding the optimal value that leads to time according to a specified schedule. In Equation
faster convergence and better generalization. 6, 𝑡 represents the global step or iteration during
In Figure 9, the horizontal axis represents the training, while 𝑎 and 𝑏 are parameters that control
progression of model training, measured in steps, the decay rate and initial learning rate, respectively.
while the vertical axis represents the learning rate on a
logarithmic scale. The values on the y-axis start from
0, followed by 5e-5, 0.0001, and so on, up to 0.0002, E. TRAINING LOSSES OF THE MODEL
indicating the decreasing magnitude of the learning Training loss is the reciprocal of learning rate that rep-
rate over time. resents the error or discrepancy between the model’s
predictions and the actual labels on the training set.
Initially, the learning rate is high to allow the
It measures how well the model is fitting the training
model to quickly learn from the provided instances.
data. The goal during training is to minimize the
As training progresses, the learning rate decreases to
training loss, indicating that the model is learning
enable more precise adjustments. The tasks DB-Pedia,
to make accurate predictions on the training data.
AG-News, and Yahoo start from step 1, while Yelp
Our model successfully minimized the training losses
starts from step 3, and Amazon from step 4.
throughout the training process.
At the beginning, the learning rate gradually de- Figure 8 illustrates the training losses of our pro-
cays, reaching a peak value of 0.0002 for all tasks posed model. This figure shows that the magnitude
within the first 1000 steps. After reaching this peak, of the losses gradually decreases over time. The tasks
the learning rate for each task exponentially de- DB-Pedia and AG-News calculated the model losses
creases, eventually hitting zero at different points. quickly and ended their contributions early due to
DB-Pedia and Amazon reach a learning rate of zero having only 2 and 3 epochs, respectively.
(0.0001972, and 0.00003144) at steps 626 and 644, During the initial training steps (0 to 949), the train-
respectively, indicating they have fully learned from ing losses were higher because the model was still in
the provided instances. AG-News concludes its learn- learning phase and capturing the provided instances.
ing at step 939 by dropping its learning loss to However, once the model analyzed and learned from
0.0001511, while Yelp and Yahoo end their learning these instances, it retained this knowledge for future
at step 4382, with the learning rate dropping to zero predictions. From step 949 to 4381, the training losses
(0.00003189, and 0.00003144). decreased for the remaining tasks: Yahoo, Yelp, and
This analysis, shows that the model’s learning rate Amazon.
decreases over time, ensuring a balance between We observed some spikes in training loss, notably
rapid learning and precise adjustments for optimal at step 2192 for the Yahoo task and at step 3565 for
performance. the Yelp task. These spikes are due to the introduction
VOLUME 4, 2016 17

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Author et al.: Preparation of Papers for IEEE %TRANSACTIONS and JOURNALS

of complex and new instances that the model had not accuracy scores of 92.5, 94.4, 94.3, 91.67, and 91.43,
encountered before. respectively.
Overall, the model’s training losses show a clear
downward trend, demonstrating effective learning and
Model Task 2-Yelp
adaptation over time. This analysis confirms that the
model can successfully learn from the training data
F1 Score Accuracy Evaluation Losses Learning Rates
and reduce errors as training progresses.
Proposed Model 96.66 97.98 .06 0.00003189
VI. RESULTS COMPARISON WITH SOTA
In this section, we present a comprehensive compari- LSTM Model — 94.4 — —-
son of our proposed model’s performance with exist-
ing SOTA models [29], [30], focusing specifically on Character Level CNNN — 88 — —
the accuracy metric to evaluate the effectiveness of
our approach in TC. For more clarity, we compared Table 5: Performance Metrics Comparison for Yelp
the performance metrics for each task across three Task.
different models, as shown in Tables 4 to 8. In con-
trast to our proposed approach, the existing models Similarly, in [30], the authors addressed the task
only report accuracy in their experimental discus- of binarizing word embeddings with minimal infor-
sions, neglecting other critical performance metrics. mation loss using a technique called near-lossless
Notably, these models do not consider or discuss the binarization (NLB), which combines quantization and
F1 score, which is a vital parameter. Additionally, clustering methods. NLB aims to preserve semantic
they fail to address key factors such as model eval- information by maintaining similarities between word
uation loss, learning rate, and training losses, all of embeddings in the binary space. The paper evaluated
which are essential for a comprehensive assessment NLB on various downstream tasks such as SA and
of any model’s performance. To facilitate a visual named entity recognition (NER), demonstrating its
representation of this comparison, we have included effectiveness in compressing word representations
a bar graph, as depicted in Figure 10. The x-axis lists while preserving their quality. However, the specific
the compared models along with the datasets used, model used for generating the word embeddings was
while the y-axis shows the corresponding accuracy a Character-level CNN. Their experiments on Yahoo,
of each model for these datasets. This graph provides Yelp, Amazon, DB-Pedia, and AG-News datasets
a clear visualization of the distinctions between our reported accuracy scores of 67.1, 88, 84.5, 97.4, and
proposed model and the compared SOTA models. 88.1, respectively.
In our proposed approach, we conducted experi-
Model Task 1-Yahoo ments on the same datasets i.e., Yahoo, Yelp, Ama-
zon, DB-Pedia, and AG-News using Distil-BERT
F1 Score Accuracy Evaluation Losses Learning Rates
model. We observed several performance parameters,
Proposed Model 96.84 97.84 .06 0.00003144 but here we focus on comparing the accuracy scores.
Our model achieved accuracy scores of 97.84, 97.98,
LSTM Model — 92.5 — —-
98.02, 98.32, and 96.22, respectively.
Character Level CNNN — 67.1 — — By comparing our proposed model with these
SOTA results, we aim to provide a comprehensive
Table 4: Performance Metrics Comparison for understanding of the model’s performance in terms
Yahoo Task. of accuracy. This analysis offers valuable insights into
the advancements and improvements achieved by our
To ensure a fair and unbiased comparison, we care- proposed model, reinforcing its potential contribution
fully selected two relevant articles, from the SOTA. to the field of TC.
These articles were chosen because they conducted
experiments on the same datasets used in our evalu- Model Task 3- Amazon

ation, but with different models aiming for the same


F1 Score Accuracy Evaluation Losses Learning Rates
objectives.
In [29], the authors introduced a method to enhance Proposed Model 95.82 98.02 .06 0.00003144

TC performance by utilizing LSTM networks com-


LSTM Model — 94.3 — —-
bined with word embedding techniques. Leveraging
LSTM’s ability to capture long-range dependencies Character Level CNNN — 84.5 — —
and representing words as dense vectors through word
embeddings, the authors aimed to improve the accu- Table 6: Performance Metrics Comparison for
racy of TC tasks. Their experiments on Yahoo, Yelp, Amazon Task.
Amazon, DB-Pedia, and AG-News datasets reported
18 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Author et al.: Preparation of Papers for IEEE %TRANSACTIONS and JOURNALS

Figure 10: Results Comparison

Model Task 4- DB-Pedia iments, which involved various hyperparameters and


settings, confirmed the reliability of our methodology.
F1 Score Accuracy Evaluation Losses Learning Rates
Notably, our approach maintains task independence,
Proposed Model 96.20 98.32 .08 0.0001972 with no significant decline in performance across
individual tasks despite shared language features. This
LSTM Model — 91.67 — —-
indicates that the tasks remain largely unaffected by
Character Level CNNN — 97.4 — — each other’s specific features. These results under-
score the potential of our approach to effectively
Table 7: Performance Metrics Comparison for support CL across a wide range of text-based tasks.
DB-Pedia Task. As a future direction, we aim to explore the transfer
ability of our approach to other NLP tasks beyond
TC, such as SA and machine translation. Additionally,
Model Task 5-AG-News
we plan to investigate the impact of alternative pre-
F1 Score Accuracy Evaluation Losses Learning Rates training techniques and model architectures on the
performance of our approach. We also intend to
Proposed Model 94.78 96.22 .06 0.0001511
optimize the computational efficiency of our method-
LSTM Model — 91.43 — —- ology, particularly for large-scale datasets. Through
these avenues, we anticipate further improving the
Character Level CNNN — 88.1 — —
performance and applicability of our approach within
Table 8: Performance Metrics Comparison for the field of NLP.
AG-News Task.
References
[1] Nandwani, P., Verma, R. (2021). A review on sentiment analysis
VII. CONCLUSION AND FUTURE DIRECTIONS and emotion detection from text. Social network analysis and
Our research highlights the effectiveness of integrat- mining, 11(1), 81.
[2] Bordoloi, M., Biswas, S. K. (2023). Sentiment analysis: A survey
ing pre-trained Distil-BERT with a memory-based on design framework, applications and future scopes. Artificial
CL technique for TC tasks. By incorporating task- Intelligence Review, 56(11), 12505-12560.
specific linear layers and employing a dynamic setup, [3] Wankhade, M., Rao, A. C. S., Kulkarni, C. (2022). A survey on
sentiment analysis methods, applications, and challenges. Artifi-
our approach allows for the seamless addition of cial Intelligence Review, 55(7), 5731-5780.
new tasks, ensuring scalability and adaptability. We [4] Khurana, D., Koli, A., Khatter, K., Singh, S. (2023). Natural lan-
systematically save the best model after each training guage processing: State of the art, current trends and challenges.
Multimedia tools and applications, 82(3), 3713-3744.
epoch to address the challenge of CF, preserving the [5] Es-Sabery, F., Es-Sabery, K., Qadir, J., Sainz-De-Abajo, B., Hair,
performance of previously learned tasks. Our exper- A., García-Zapirain, B., De La Torre-Díez, I. (2021). A MapRe-

VOLUME 4, 2016 19

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Author et al.: Preparation of Papers for IEEE %TRANSACTIONS and JOURNALS

duce opinion mining for COVID-19-related tweets classification [27] Trotta, D., Guarasci, R., Leonardelli, E., Tonelli, S. (2021). Mono-
using enhanced ID3 decision tree classifier. IEEE Access, 9, lingual and cross-lingual acceptability judgments with the Italian
58706-58739. CoLA corpus. arXiv preprint arXiv:2109.12053.
[6] Gunasekaran, K. P. (2023). Exploring sentiment analysis tech- [28] Bel, N., Punsola, M., Ruíz-Fernández, V. (2024, May). EsCoLA:
niques in natural language processing: A Comprehensive Review. Spanish Corpus of Linguistic Acceptability. In Proceedings of the
arXiv preprint arXiv:2305.14842. 2024 Joint International Conference on Computational Linguis-
[7] Hartmann, J., Huppertz, J., Schamp, C., Heitmann, M. (2019). tics, Language Resources and Evaluation (LREC-COLING 2024)
Comparing automated text classification methods. International (pp. 6268-6277).
Journal of Research in Marketing, 36(1), 20-38. [29] Adamuthe, A. C. (2020). Improved text classification using long
[8] Es-sabery, F., Es-sabery, K., Garmani, H., Qadir, J., Hair, A. short-term memory and word embedding technique. Int J Hybrid
(2022). Evaluation of different extractors of features at the level Inf Technol, 13(1), 19-32.
of sentiment analysis. INFOCOMMUNICATIONS JOURNAL: [30] Tissier, J., Gravier, C., Habrard, A. (2019, July). Near-lossless
A PUBLICATION OF THE SCIENTIFIC ASSOCIATION FOR binarization of word embeddings. In Proceedings of the AAAI
INFOCOMMUNICATIONS (HTE), 14(2), 85-96. Conference on Artificial Intelligence (Vol. 33, No. 01, pp. 7104-
7111).
[9] Duarte, J. M., Berton, L. (2023). A review of semi-supervised
learning for text classification. Artificial Intelligence Review,
56(9), 9401-9469.
[10] Xu, Q. A., Chang, V., Jayne, C. (2022). A systematic review
of social media-based sentiment analysis: Emerging trends and
challenges. Decision Analytics Journal, 3, 100073.
[11] Hurtado, J., Salvati, D., Semola, R., Bosio, M., Lomonaco, V. SAHAR SHAH received his M.Sc. de-
(2023). Continual learning for predictive maintenance: Overview gree in Electronics, from the Department
and challenges. Intelligent Systems with Applications, 200251. of Electronics, University of Peshawar,
[12] Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., Wermter, S. Pakistan, and the M.Phil degree in Elec-
(2019). Continual lifelong learning with neural networks: A re- tronics, from the Department of Elec-
view. Neural networks, 113, 54-71. tronics, Quaid-Azam University, Islam-
[13] Doan, H. G., Luong, H. Q., Ha, T. O., Pham, T. T. T. (2023). abad, Pakistan back in 2016 and 2019
An Efficient Strategy for Catastrophic Forgetting Reduction in respectively. He is currently pursuing his
Incremental Learning. Electronics, 12(10), 2265. doctoral degree in Computer Science at
[14] Althnian, A., AlSaeed, D., Al-Baity, H., Samha, A., Dris, A. the Department of Informatics, Systems
B., Alzakari, N., Kurdi, H. (2021). Impact of dataset size on and Communication (DISCo) University of Milano-Bicocca, Italy.
classification performance: an empirical evaluation in the medical He is serving as a visiting doctoral student at the Social Things
domain. Applied Sciences, 11(2), 796. Srl company, located in Milan, and working as a developer in
[15] Yao, X., Huang, T., Wu, C., Zhang, R. X., Sun, L. (2019). innovative ICT solutions and algorithms in the fields of Artificial
Adversarial feature alignment: Avoid catastrophic forgetting in Intelligence (AI), Internet of Things (IoT) and Human-Computer
incremental task lifelong learning. Neural computation, 31(11), Interaction. He is also a research collaborator with the School of
2266-2291. Computing, at the National University of Computer and Emerging
[16] Ke, Z., Liu, B., Xu, H., Shu, L. (2021). CLASSIC: Continual and Sciences (NUCES), Lahore, Pakistan and at the Department of
contrastive learning of aspect sentiment classification tasks. arXiv Computer Science, University of Hassan II, Casablanca, Morocco.
preprint arXiv:2112.02714. Mr. Shah served as a lecturer in Electronics at the Higher Education
[17] Silva Barbon, R., Akabane, A. T. (2022). Towards transfer learn- Department of Pakistan, and taught several subjects at Master de-
ing techniques—BERT, DistilBERT, BERTimbau, and Distil- gree course like Digital Logic Design, Computer Networking, Data
BERTimbau for automatic text classification from different lan-
Communication etc. He supervised and designed many projects
guages: a case study. Sensors, 22(21), 8184.
i.e., Digital Voting Machine, Smart Home Appliances, Robotic
[18] Ke, Z., Xu, H., Liu, B. (2021). Adapting BERT for continual
Arm to Improve the Solar Panels efficiency and many more. He
learning of a sequence of aspect sentiment classification tasks.
has published many several international journal research articles
arXiv preprint arXiv:2112.03271.
in prestigious journals such as MDPI-Symmetry, Micromachines,
[19] Liang, B., Luo, W., Li, X., Gui, L., Yang, M., Yu, X., Xu, R.
Hindawi-Big Data, Scientific Programming, and Industrial Internet
(2021, October). Enhancing aspect-based sentiment analysis with
supervised contrastive learning. In Proceedings of the 30th ACM of Things, Wireless Communication and Mobile Computing and
international conference on information knowledge management many conferences. His current research interests include the Arti-
(pp. 3242-3247). ficial Intelligence, Text Classification, Sentiment Analysis, Trans-
[20] Rodrawangpai, B., Daungjaiboon, W. (2022). Improving text former Models, Fuzzy Systems, Continual Learning Techniques,
classification with transformers and layer normalization. Machine Underwater Wireless Sensor Networks and Cloud Computing. He
Learning with Applications, 10, 100403. is a reviewer with a number of prestigious international publishers,
[21] Qasim, R., Bangyal, W. H., Alqarni, M. A., Almazroi, A. A. such as Springer, MDPI-Sensor, International Journal of Distributed
(2022). A fine-tuned BERT-based transfer learning approach for Sensor Networks (IJDSN).
text classification. Journal of healthcare engineering, 2022.
[22] Jia, Q., Cui, J., Xiao, Y., Liu, C., Rashid, P., Gehringer, E. F.
(2021). All-in-one: Multi-task learning bert models for evaluating
peer assessments. arXiv preprint arXiv:2110.03895.
[23] Tong, J., Wang, Z., Rui, X. (2022). A multimodel-based deep
learning framework for short text multiclass classification with the SARA LUCIA MANZONI Associate pro-
imbalanced and minimal data set. Computational Intelligence and fessor at Dep. of Informatics, Commu-
Neuroscience, 2022. nication and Systms (DISCo), University
[24] Wang, S., Liu, J. (2021, December). Continual Learning for of Milano-Bicocca. Her main research re-
Sentiment Classification by Iterative Networks Combination. In sults are in technology transfer projects of
Proceedings of the 2021 5th International Conference on Com- several areas of Artificial Intelligence pri-
puter Science and Artificial Intelligence (pp. 150-155). marly, (knowledge-based reasoning in de-
[25] Huang, Y., Zhang, Y., Chen, J., Wang, X., Yang, D. (2021). Con- cision support systems, multi-agent sys-
tinual learning for text classification with information disentan- tems approach to study complex systems
glement based regularization. arXiv preprint arXiv:2104.05489. dynamics). Head of AI lab at DISCo from
[26] Rochereau, C., Sagot, B., Dupoux, E. (2019). Neural language 2007 to 2017, co-founder of CROWDYXITY SRL, University of
modeling of free word order argument structure. arXiv preprint Milan-Bicocca spin-off.
arXiv:1912.00239.

20 VOLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2024.3435537

Author et al.: Preparation of Papers for IEEE %TRANSACTIONS and JOURNALS

FAROOQ ZAMAN is a PhD Scholar in ITALO FRANCESCO ZOPPIS holds a


AI Lab, Information Technology Univer- Ph.D. in Computer Science from Univer-
sity, Lahore. He received M.Phil (2017) sità degli Studi di Milano. He is cur-
degree in Computer Science from the rently an Associate Professor of Com-
Department of Computer Science, Quaid- puter Science at the University of Milano-
i-Azam, University, Islamabad, Pakistan. Bicocca. His research focuses on transla-
Prior to joining Scientometrics Lab, he tional knowledge discovery from biologi-
was serving as a visiting faculty at Quaid- cal and clinical datasets, mainly applying
I-Azam, University, Islamabad, Pakistan. machine learning techniques to enhance
His research interests are in the area of medical decisions for patient well-being
text summarization, text simplification and machine translation.

FATIMA ES SABERY received the Tech-


nical University degree from the De-
partment of Computer Science, Higher
School of Technology, Casablanca, Mo-
rocco, in 2013, the professional license
with option IT development from the De-
partment of Computer Sciences, Faculty
of Science, Casablanca, in 2014, and the
master’s degree in business intelligence
from the Department of Computer Sci-
ences, Sultan Moulay Sliman University, Beni Mellal, Morocco,
in 2016. She has published several research papers in many
international conferences and journals, i.e., Fuzzy Information
and Engineering, the International Journal of Informatics and
Communication Technology, The 3rd International Conference on
Networking, Information Systems and Security, and the 2019
International Conference on Intelligent Systems and Advanced
Computing Sciences. Her general research interests include data
mining area, big data field, wireless sensor networks, fuzzy sys-
tems, machine learning, deep learning, and the Internet of Things.

FRANCESCO EPIFANIA was a re-


search fellow and professor at the De-
partment of Computer Science of the
University of Milan and obtained a PhD
in Computer Science, together with three
degrees in Computer Science: a three-
year degree in Digital Communications, a
master’s degree in Computer Science and
a master’s degree in Computer Science;
he has produced numerous publications
for national and international conferences and magazines. His
research interests concern the field of human-machine interaction,
and in particular the creation of intelligent and multichannel
interactive systems for the enrichment of knowledge. He is also
dedicated to the evaluation of the Recommender System. He is
CEO and founder of Social Things srl and Whoteach, he was a
researcher in Computer Science at the University of Milan. He
obtained 3 degrees and PhDs in Computer Science: the Bachelor’s
Degree in Digital Communication, the Master’s Degree in Com-
puter Science and the Master in Computer Science at the University
of Milan. He is currently also involved in the study of the evaluation
of Recommender Systems based on user data. He has carried
out consultancy activities in the ICT field, both in academic and
academic fields. He has carried out teaching activities for the degree
courses in Computer Science at the University of Milan, such
as Foundations of digital communication, Systems for computer-
aided design, Multimedia publishing and Informatics. Laboratory
courses. He has published more than 30 articles in national and
international conferences and has supervised more than 200 student
theses in the computer science field. His research interests concern
the area of Human-Machine Interaction and Artificial Intelligence;
in particular the evaluation, design and development of multimedia
interactive intelligent systems for knowledge enrichment.

VOLUME 4, 2016 21

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

You might also like