Comparison_of_BERT_and_XLNet_accuracy_with_classical_methods_and_algorithms_in_text_classification
Comparison_of_BERT_and_XLNet_accuracy_with_classical_methods_and_algorithms_in_text_classification
Abstract—The aim of this publication is to compare the Bayes classifier, Multinomial Naive Bayes classifier and
accuracy of the Bidirectional Encoder Representations from Support Vector Machines.
Transformers (BERT) and Generalized Autoregressive
Pretraining for Language Understanding (XLNet) models in A. BERT
text classification with the accuracy of classical machine BERT is a creation of a Google NLP team, and was
learning methods and algorithms. Analyzed: Bidirectional published in 2018 by Jacob Devlin and his co-researchers. It
Encoder Representations from Transformers (BERT), was designed to pre-train deep bidirectional representations
Generalized Autoregressive Pretraining for Language from unlabeled text by jointly conditioning on both left and
Understanding (XLNet), Bernoulli Naive Bayes classifier,
right context in all layers." This makes the pre-trained BERT
Gaussian Naive Bayes classifier, Multinomial Naive Bayes
classifier, Support Vector Machines. The results show that when
model infinitely customizable, and with the addition of
classifying 50,000 reviews in English, XLNet ranks with the another output layer, it can be instantly plugged into a variety
highest accuracy - 96%, which is nearly 8% more than the best- of computational applications As a result, the pre-trained
performing classic classifier Support Vector Machines. BERT model can be fine-tuned with just one additional output
layer to create state-of-the-art models for a wide range of tasks
Keywords—Bidirectional Encoder Representations from from language translation to text classification [2, 3].
Transformers, BERT, Generalized Autoregressive Pretraining
Another model to consider is ALBERT. It has a specific
for Language Understanding, XLNet, Bernoulli Naive Bayes
classifier, Gaussian Naive Bayes classifier, Multinomial Naive way of solving memory bottlenecks (GPU/TPU) and longer
Bayes classifier, Support Vector Machines, text classification, training time problems. It incorporates techniques to
machine learning minimize memory usage and shorten BERT training. The
performance of downstream tasks can be affected by the size
I. INTRODUCTION of the model. The most important change is to make sure that
The widespread use of social media has provided a means the same transformer block is used at each model layer.
for consumers to post opinions, reviews and ratings of goods B. XLNet
and services. The increased content in the various comments
made it impossible to process them manually. Another current It is used for autoregressive pre-training facilitating
practical problem is the variety of online news sites that often learning two-directional contexts by optimizing for the
copy their news from each other, as well as from unreliable expected likelihood over all permutations of the factorization
external sources. Not infrequently, this news is of obscure order and because of its autoregressive nature manages to
origin and is manipulated and false. An important task is to be ameliorate the deficiencies of BERT - a feature of XLNET is
able to filter out fake news from the Internet. The use of its integration of ideas into pretraining of an autoregressive
machine learning algorithms for text classification allows to model called Transformer-XL [4]. Another advantage over
reduce the time for research and analysis of user opinion and BERT is its incorporation of permutation language modelling
the reliability of information from news sites. that takes into account the interdependency between tokens
and not just the context of the previous token [5].
The task of text classification is to determine the affiliation
of the text to a class based on the analysis of a set of features C. Naive Bayes classifier [6]
that characterize the text.
A foundational algorithm in machine learning is the
II. CLASSIFIERS IN MACHINE LEARNING Naive Bayesian Classifier. It was developed out of the Bayes'
theorem which helps find the posterior probability of an event.
The question of the importance of machine learning has
been thoroughly covered over the past decade. It is a vital topic
We are trying to find which probability class the object
of research, with its use in practical applications steadily
x belongs to, so we choose out of all the classes, the one with
increasing. In most of the studies with the best characteristics
the largest probability p (y = c | x).
of selected metrics among the classical methods is that of
support vectors, often followed by the Naive Bayesian
𝑐𝑜𝑝𝑡 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑃(𝑥|𝑦 = 𝑐)𝑃(𝑦 = 𝑐) (8)
classifier [1], which is the motive of the authors in this study 𝑐∈𝐶
to use those as representatives of classical methods and
algorithms: Bernoulli Naive Bayes classifier, Gaussian Naive
75
Authorized licensed use limited to: University of the West of England. Downloaded on December 23,2024 at 12:05:42 UTC from IEEE Xplore. Restrictions apply.
and 12,500 negative) was used for training and the other half classifier and Support Vector Machines. The advantage of
for testing. Data taken from Stanford University neural network methods using the "Transformer" architecture
(https://ai.stanford.edu/~amaas/data/sentiment/). is due to the consideration of the context of a word - the words
that surround it and the positions in which they are relative to
The results show (Fig. 1) that the XLNet model is
it. The advantage of XLNet over BERT is due to the much
classified with the highest accuracy, with a 2% advantage
deeper context (up to 512 X 24 = 12,288 words for XLNet vs.
over the next algorithm. The second most accurate is
512 words for BERT), which it uses to digitally assess the
AlBERT-XLarge followed by BERT-Large with a small
meaning of words, as well as to use a more effective training
difference between them of 1%. BERT-Base is in fourth place
method - "Permutation Language Modeling". This study
with a difference of 1.5% compared to third in the rankings
provides a guide to selecting machine learning techniques for
and approximately 4.3% less than the best performer. Support
text classification.
Vector Machines classifies movie reviews with the highest
accuracy compared to other classic algorithms, but with a ACKNOWLEDGEMENT
difference of 8.5% compared to the best. Varieties of the
This paper is supported by the Scientific Project
Naive Bayesian Classifier have small differences in accuracy.
“Classification of a text in Bulgarian using machine learning
The multinomial Naive Bays Classifier has an advantage of
methods”, Technical University - Varna, 2022, financed by
less than 1% of the Gaussian Naive Bays Classifier. The
the Ministry of Education and Science.
algorithm was classified with the lowest accuracy by
Bernoulli's Naive Bays Classifier, which has a difference of REFERENCES
13.4% compared to the first one. [1] N. Kalcheva, “Overview of publications with comparative analysis of
methods and algorithms for classification of text in machine training”,
Accuracy [%] Computer Science and Technologies Journal, Technical University of
100% Varna, Bulgaria, vol. 1, 2020, pp. 118-125.
96.00% [2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining
94.14% of deep bidirectional transformers for language understanding, "In:
95% 92.98% Proceedings of the 2019 Conference of the North American Chapter
91.62% of the Association for Computational Linguistics: Human Language
Technologies, Volume 1, MN, USA, Association for Computational
90% Linguistics, 2019, pp. 4171–4186.
87.53%
[3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, Gomez AN,
84.70% 85.29%
85% Kaiser Ł, Polosukhin I. “Attention is all you need”, Advances in neural
82.60% information processing systems, 2017.
[4] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, Salakhutdinov RR, QV. Le,
80% “Xlnet: Generalized autoregressive pretraining for language
understanding”, Advances in neural information processing systems,
2019.
75% [5] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, QV. Le, R. Salakhutdinov,
„Transformer-xl: Attentive language models beyond a fixed-length
context“, arXiv preprint arXiv:1901.02860, 2019.
[6] N. Kalcheva, M. Todorova, G. Marinova, “Naive Bayes Classifier,
Decision Tree and AdaBoost Ensemble Algorithm. Advantages and
disadvantages”, 6th International Conference – ERAZ 2020 –
Knowledge based sustainable development, Online/virtual, Published
by: Association of Economists and Managers of the Balkans –
Fig.1 Accuracy of machine learning classifiers in classifying reviews Belgrade, Serbia, May 21, 2020.
[7] K. Murphy ,”Machine Learning”, A Probabilistic Perspective, 2012.
V. CONCLUSION
[8] A. McCallum, K. Nigam, „A comparison of event models for Naive
The classification of short text is a challenging task in Bayes text classification“. Papers from the 1998 AAAI Workshop,
natural language processing. Existing classical methods and 1998.
algorithms have achieved very good results, but accuracy still [9] S. Raschka, “Naive Bayes and text classification I - Introduction and
theory”, Cornell university library, Ithaca, 2014.
needs to be improved. Research shows that when classifying
[10] N. Kalcheva, M. Karova, I. Penev, “Comparison of the accuracy and
25,000 user reviews in English, the accuracy of the the execution time of classification algorithms for Bulgarian literary
Generalized Autoregressive Searching for Language works”, International Conference “Automatics and Informatics’2020”,
Understanding - XLNet model is 96% and is the highest Varna, Bulgaria, 2020.
compared to other surveyed models, algorithms and methods, [11] С. Aggarwal, Z. ChengXiang, "A survey of text classification
algorithms.", Mining text data, Springer, Boston, MA, 2012, pp. 163-
namely: Bidirectional Encoder Representations from 222.
Transformers (BERT), Bernoulli Naive Bayes classifier,
Gaussian Naive Bayes classifier, Multinomial Naive Bayes
76
Authorized licensed use limited to: University of the West of England. Downloaded on December 23,2024 at 12:05:42 UTC from IEEE Xplore. Restrictions apply.