0% found this document useful (0 votes)
0 views

Comparison_of_BERT_and_XLNet_accuracy_with_classical_methods_and_algorithms_in_text_classification

This document compares the accuracy of BERT and XLNet models in text classification against classical machine learning methods like Naive Bayes and Support Vector Machines. The results indicate that XLNet achieves the highest accuracy at 96%, outperforming the best classical classifier by nearly 8%. The study highlights the advantages of using advanced neural network architectures for improved text classification performance.

Uploaded by

sahar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Comparison_of_BERT_and_XLNet_accuracy_with_classical_methods_and_algorithms_in_text_classification

This document compares the accuracy of BERT and XLNet models in text classification against classical machine learning methods like Naive Bayes and Support Vector Machines. The results indicate that XLNet achieves the highest accuracy at 96%, outperforming the best classical classifier by nearly 8%. The study highlights the advantages of using advanced neural network architectures for improved text classification performance.

Uploaded by

sahar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Comparison of BERT and XLNet accuracy with

classical methods and algorithms in text


classification
1st Neli Arabadzhieva - Kalcheva 2nd Ivelin Kovachev
2021 International Conference on Biomedical Innovations and Applications (BIA) | 978-1-6654-4581-8/22/$31.00 ©2022 IEEE | DOI: 10.1109/BIA52594.2022.9831281

Department of Software and Internet Technical University of Varna


Technologies Varna, Bulgaria
Technical University of Varna [email protected]
Varna, Bulgaria
[email protected]

Abstract—The aim of this publication is to compare the Bayes classifier, Multinomial Naive Bayes classifier and
accuracy of the Bidirectional Encoder Representations from Support Vector Machines.
Transformers (BERT) and Generalized Autoregressive
Pretraining for Language Understanding (XLNet) models in A. BERT
text classification with the accuracy of classical machine BERT is a creation of a Google NLP team, and was
learning methods and algorithms. Analyzed: Bidirectional published in 2018 by Jacob Devlin and his co-researchers. It
Encoder Representations from Transformers (BERT), was designed to pre-train deep bidirectional representations
Generalized Autoregressive Pretraining for Language from unlabeled text by jointly conditioning on both left and
Understanding (XLNet), Bernoulli Naive Bayes classifier,
right context in all layers." This makes the pre-trained BERT
Gaussian Naive Bayes classifier, Multinomial Naive Bayes
classifier, Support Vector Machines. The results show that when
model infinitely customizable, and with the addition of
classifying 50,000 reviews in English, XLNet ranks with the another output layer, it can be instantly plugged into a variety
highest accuracy - 96%, which is nearly 8% more than the best- of computational applications As a result, the pre-trained
performing classic classifier Support Vector Machines. BERT model can be fine-tuned with just one additional output
layer to create state-of-the-art models for a wide range of tasks
Keywords—Bidirectional Encoder Representations from from language translation to text classification [2, 3].
Transformers, BERT, Generalized Autoregressive Pretraining
Another model to consider is ALBERT. It has a specific
for Language Understanding, XLNet, Bernoulli Naive Bayes
classifier, Gaussian Naive Bayes classifier, Multinomial Naive way of solving memory bottlenecks (GPU/TPU) and longer
Bayes classifier, Support Vector Machines, text classification, training time problems. It incorporates techniques to
machine learning minimize memory usage and shorten BERT training. The
performance of downstream tasks can be affected by the size
I. INTRODUCTION of the model. The most important change is to make sure that
The widespread use of social media has provided a means the same transformer block is used at each model layer.
for consumers to post opinions, reviews and ratings of goods B. XLNet
and services. The increased content in the various comments
made it impossible to process them manually. Another current It is used for autoregressive pre-training facilitating
practical problem is the variety of online news sites that often learning two-directional contexts by optimizing for the
copy their news from each other, as well as from unreliable expected likelihood over all permutations of the factorization
external sources. Not infrequently, this news is of obscure order and because of its autoregressive nature manages to
origin and is manipulated and false. An important task is to be ameliorate the deficiencies of BERT - a feature of XLNET is
able to filter out fake news from the Internet. The use of its integration of ideas into pretraining of an autoregressive
machine learning algorithms for text classification allows to model called Transformer-XL [4]. Another advantage over
reduce the time for research and analysis of user opinion and BERT is its incorporation of permutation language modelling
the reliability of information from news sites. that takes into account the interdependency between tokens
and not just the context of the previous token [5].
The task of text classification is to determine the affiliation
of the text to a class based on the analysis of a set of features C. Naive Bayes classifier [6]
that characterize the text.
A foundational algorithm in machine learning is the
II. CLASSIFIERS IN MACHINE LEARNING Naive Bayesian Classifier. It was developed out of the Bayes'
theorem which helps find the posterior probability of an event.
The question of the importance of machine learning has
been thoroughly covered over the past decade. It is a vital topic
We are trying to find which probability class the object
of research, with its use in practical applications steadily
x belongs to, so we choose out of all the classes, the one with
increasing. In most of the studies with the best characteristics
the largest probability p (y = c | x).
of selected metrics among the classical methods is that of
support vectors, often followed by the Naive Bayesian
𝑐𝑜𝑝𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃(𝑥|𝑦 = 𝑐)𝑃(𝑦 = 𝑐) (8)
classifier [1], which is the motive of the authors in this study 𝑐∈𝐶
to use those as representatives of classical methods and
algorithms: Bernoulli Naive Bayes classifier, Gaussian Naive

978-1-6654-4581-8/22/$31.00 ©2022 IEEE


74
Authorized licensed use limited to: University of the West of England. Downloaded on December 23,2024 at 12:05:42 UTC from IEEE Xplore. Restrictions apply.
Naive Bayes classifier has a number of variants where hyperplane. We would like it to be at the greatest possible
each have different assumptions for how the features are distance from the two distinct classes. For this to happen, we
distributed. need to have our documents already classified. The linear
a) Bernoulli Naive Bayes classifier [7] classification function f (x) looks like this: [11]
Bernoulli Naive Bayes classifier presents the document as 𝑓(𝑥) = 𝑤 𝑇 𝑥 + 𝑏 (15)
a binary vector of space. When the word appears at least once
in the document, it is signified by the value of one. The
where wТ is a weight vector and b is a deviation.
conceit of Bayes is that we consider whether each word will
appear to be independent of the incidence of other words. The goal is to find the values of wT and b in order to
determine the classifier. It is necessary to find the points with
Let x [0, 1] be a binary variable, which has a probability
the smallest deviation that must be maximized.
of occurring .
Then: III. ASSESSMENT OF THE ACCURACY OF THE CLASSIFICATION
When data is classified, it is divided into two sets – one for
𝜃 , 𝑖𝑓 х = 1 training and one for testing. The former is used to give shape
𝑃(𝑥|𝜃) = { (9)
1−𝜃 , 𝑖𝑓 х = 0 to the model, and the latter to make sure it is correct. The ratio
of correctly classified examples to the total number of objects
b) Multinomial Bayes Classifier in the test set, or Accuracy, is used to evaluate the training.
The conceit in this classifier is a multinomial distribution
of features. 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃+𝑇𝑁
(16)
𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
Let xi{1,…K}, have emission probabilities 1, …К
where:
Then, the probability for an event х to occur when  is
given is: True Positive (TP) – is a Positive class and is correctly
classified as a class Positive;
𝑛!
𝑃(х|𝜃) = ∏К𝑖=1 𝜃𝑖х𝑖 (10) True Negative (TN) – is a Negative class and is correctly
х1 !…х𝐾 !
classified as a class Negative;
where: 𝑛 = ∑К𝑖=1 х𝑖 (11) False Positive (FP) – is a Negative class and is incorrectly
classified as a class Positive;
We made a naive assumption in our Bayes classifier that False Negative (FN) – is a Positive class and is incorrectly
the context and position of the word have no effect on its classified as a class Negative.
overall likelihood to occur in the text. We thus calculate the
frequency of occurrence using a multinomial Naive Bayes IV. RESEARCH, RESULTS AND ANALYSIS
classifier [8]. In the study, the classifiers of machine learning are
c) Gaussian Naive Bayes classifier implemented in the Python language. External libraries of
scikit-learn.org and tensorflow.org were used.
The Gaussian Naive Bayes classifier is based on the
assumption that the traits have a normal distribution [9]. The report introduces the following abbreviations of the
The function of the probability density of a normally algorithms and models used in the text:
distributed random variable is:
• Bernoulli Naive Bayes classifier – BNB
1
−(𝑥−𝜇)2 • Gaussian tf-idf n-grams Naive Bayes classifier – GNB
𝑃(𝑥|𝜇, 𝜎) = 𝑒 2𝜎2 (12)
𝜎√2𝜋
• Multinomial tf-idf n-grams Naive Bayes classifier – MNB
where: • Support Vector Machines tf-idf n-grams – SVM
 - average value;
• Bidirectional Encoder Representations from
1 Transformers (Base) - BERT-Base
𝜇 = ∑𝑁
𝑖=1(𝑥𝑖 )
2
(13)
𝑁
• Bidirectional Encoder Representations from
Transformers (Large) - BERT-Large
 - standard deviation
• A Lite Bidirectional Encoding Representation
2
𝜎 =
1
∑𝑁 (𝑥 − 𝜇) 2
(14) Transformer - AlBERT-XLarge
𝑁−1 𝑖=1 𝑖
• Generalized Autoregressive Pretraining for Language
D. Support Vector Machine [10] Understanding - XLNet
SVM classification algorithm is another foundational A set of data was selected for the study, consisting of
machine learning algorithm, where the training examples 50,000 posts of user comments, representing English-
(linearly separable) are points in n-dimensional space. Our language film reviews, of which 25,000 were positive and
goal is to neatly divide the data into two classes using a 25,000 negative film reviews. Half of the set (12,500 positive

75
Authorized licensed use limited to: University of the West of England. Downloaded on December 23,2024 at 12:05:42 UTC from IEEE Xplore. Restrictions apply.
and 12,500 negative) was used for training and the other half classifier and Support Vector Machines. The advantage of
for testing. Data taken from Stanford University neural network methods using the "Transformer" architecture
(https://ai.stanford.edu/~amaas/data/sentiment/). is due to the consideration of the context of a word - the words
that surround it and the positions in which they are relative to
The results show (Fig. 1) that the XLNet model is
it. The advantage of XLNet over BERT is due to the much
classified with the highest accuracy, with a 2% advantage
deeper context (up to 512 X 24 = 12,288 words for XLNet vs.
over the next algorithm. The second most accurate is
512 words for BERT), which it uses to digitally assess the
AlBERT-XLarge followed by BERT-Large with a small
meaning of words, as well as to use a more effective training
difference between them of 1%. BERT-Base is in fourth place
method - "Permutation Language Modeling". This study
with a difference of 1.5% compared to third in the rankings
provides a guide to selecting machine learning techniques for
and approximately 4.3% less than the best performer. Support
text classification.
Vector Machines classifies movie reviews with the highest
accuracy compared to other classic algorithms, but with a ACKNOWLEDGEMENT
difference of 8.5% compared to the best. Varieties of the
This paper is supported by the Scientific Project
Naive Bayesian Classifier have small differences in accuracy.
“Classification of a text in Bulgarian using machine learning
The multinomial Naive Bays Classifier has an advantage of
methods”, Technical University - Varna, 2022, financed by
less than 1% of the Gaussian Naive Bays Classifier. The
the Ministry of Education and Science.
algorithm was classified with the lowest accuracy by
Bernoulli's Naive Bays Classifier, which has a difference of REFERENCES
13.4% compared to the first one. [1] N. Kalcheva, “Overview of publications with comparative analysis of
methods and algorithms for classification of text in machine training”,
Accuracy [%] Computer Science and Technologies Journal, Technical University of
100% Varna, Bulgaria, vol. 1, 2020, pp. 118-125.
96.00% [2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining
94.14% of deep bidirectional transformers for language understanding, "In:
95% 92.98% Proceedings of the 2019 Conference of the North American Chapter
91.62% of the Association for Computational Linguistics: Human Language
Technologies, Volume 1, MN, USA, Association for Computational
90% Linguistics, 2019, pp. 4171–4186.
87.53%
[3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, Gomez AN,
84.70% 85.29%
85% Kaiser Ł, Polosukhin I. “Attention is all you need”, Advances in neural
82.60% information processing systems, 2017.
[4] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, Salakhutdinov RR, QV. Le,
80% “Xlnet: Generalized autoregressive pretraining for language
understanding”, Advances in neural information processing systems,
2019.
75% [5] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, QV. Le, R. Salakhutdinov,
„Transformer-xl: Attentive language models beyond a fixed-length
context“, arXiv preprint arXiv:1901.02860, 2019.
[6] N. Kalcheva, M. Todorova, G. Marinova, “Naive Bayes Classifier,
Decision Tree and AdaBoost Ensemble Algorithm. Advantages and
disadvantages”, 6th International Conference – ERAZ 2020 –
Knowledge based sustainable development, Online/virtual, Published
by: Association of Economists and Managers of the Balkans –
Fig.1 Accuracy of machine learning classifiers in classifying reviews Belgrade, Serbia, May 21, 2020.
[7] K. Murphy ,”Machine Learning”, A Probabilistic Perspective, 2012.
V. CONCLUSION
[8] A. McCallum, K. Nigam, „A comparison of event models for Naive
The classification of short text is a challenging task in Bayes text classification“. Papers from the 1998 AAAI Workshop,
natural language processing. Existing classical methods and 1998.
algorithms have achieved very good results, but accuracy still [9] S. Raschka, “Naive Bayes and text classification I - Introduction and
theory”, Cornell university library, Ithaca, 2014.
needs to be improved. Research shows that when classifying
[10] N. Kalcheva, M. Karova, I. Penev, “Comparison of the accuracy and
25,000 user reviews in English, the accuracy of the the execution time of classification algorithms for Bulgarian literary
Generalized Autoregressive Searching for Language works”, International Conference “Automatics and Informatics’2020”,
Understanding - XLNet model is 96% and is the highest Varna, Bulgaria, 2020.
compared to other surveyed models, algorithms and methods, [11] С. Aggarwal, Z. ChengXiang, "A survey of text classification
algorithms.", Mining text data, Springer, Boston, MA, 2012, pp. 163-
namely: Bidirectional Encoder Representations from 222.
Transformers (BERT), Bernoulli Naive Bayes classifier,
Gaussian Naive Bayes classifier, Multinomial Naive Bayes

76
Authorized licensed use limited to: University of the West of England. Downloaded on December 23,2024 at 12:05:42 UTC from IEEE Xplore. Restrictions apply.

You might also like