A Hybrid Transformer Model for Fake News Detection Leveraging Bayesian Optimization and Bidirectional Recurrent Unit
A Hybrid Transformer Model for Fake News Detection Leveraging Bayesian Optimization and Bidirectional Recurrent Unit
Abstract—In this paper, we propose an optimized Transformer crises. For instance, during elections, fake news influences
model that integrates Bayesian algorithms with a Bidirectional voter decision-making and undermines democratic integrity.
Gated Recurrent Unit (BiGRU), and apply it to fake news Consequently, automated fake news detection has become a
classification for the first time. First, we employ the TF-IDF
method to extract features from news texts and transform them critical research focus in information science, data science,
into numeric representations to facilitate subsequent machine and computational social science [2].
learning tasks. Two sets of experiments are then conducted for Among the various approaches explored, machine learning
fake news detection and classification: one using a Transformer (ML) has emerged as a key solution, enabling automated
model optimized only with BiGRU, and the other incorporating analysis and classification of misinformation. ML models ex-
Bayesian algorithms into the BiGRU-based Transformer. Ex-
perimental results show that the BiGRU-optimized Transformer tract distinguishing features from historical and spatiotemporal
achieves 100% accuracy on the training set and 99.67% on the data [3], including linguistic patterns (e.g., sentiment analysis,
test set, while the addition of the Bayesian algorithm maintains word frequency), user interaction metrics (e.g., engagement
100% accuracy on the training set and slightly improves test-set levels, virality), and source credibility [4]. Common ML-
accuracy to 99.73%. This indicates that the Bayesian algorithm based classifiers include support vector machines (SVM) ,
boosts model accuracy by 0.06%, further enhancing the detection
capability for fake news. Moreover, the proposed algorithm decision trees , random forests, and neural networks, while
converges rapidly at around the 10th training epoch with ac- deep learning architectures such as convolutional neural net-
curacy nearing 100%, demonstrating both its effectiveness and works (CNN), recurrent neural networks (RNN), RoBERTa,
its fast classification ability. Overall, the optimized Transformer DeBERTa, and T5 have demonstrated superior performance
model, enhanced by the Bayesian algorithm and BiGRU, exhibits in handling complex textual data [5]. Compared to traditional
excellent continuous learning and detection performance, offering
a robust technical means to combat the spread of fake news in approaches, these models offer enhanced accuracy and robust-
the current era of information overload. ness in identifying misinformation.
Index Terms—Bayesian algorithm; fake news detection; trans- To address the data scarcity challenge in fake news detec-
former; BiGRU. tion, semi-supervised learning and transfer learning techniques
have been employed to leverage both labeled and unlabeled
I. I NTRODUCTION data. Additionally, recent advancements in large language
models (LLMs) have significantly improved detection capabil-
T HE rapid expansion of the Internet and social media has
significantly accelerated the spread of fake news, pos-
ing serious challenges across social, political, and economic
ities by integrating multimodal learning, adversarial training,
and chain of reasoning [6]–[9]. Retrieval-Augmented Gener-
domains [1]. Defined as misleading or fabricated content ation (RAG) was proposed for further improve performance
designed to attract attention, manipulate opinions, or serve of LLMs. [10]. Pre-trained LLMs, for example, GreenPLM,
specific agendas, fake news propagates rapidly through digital not only showed great performance but also had low cost
communication networks, often leading to misinformation for training. [11] However, challenges remain in adapting to
evolving misinformation trends and avoiding uncertainty of
∗ Corresponding author: [email protected] Large Language Models [12].
In this paper, we propose an optimized Transformer-based preprocessing (such as word segmentation, removal of stop
model, incorporating Bayesian inference and bidirectional words and desiccation), then calculating word frequency and
gated recurrent units (Bi-GRUs) to enhance fake news classifi- inverse document frequency for each document, and finally
cation accuracy. To the best of our knowledge, this is the first obtaining a sparse matrix representing the TF-IDF value of
application of this approach in misinformation detection [13]. each word in all documents. [16] These values are used as
input features of subsequent machine learning models. The
II. DATA S OURCES
features converted to numerical types are used for subsequent
The data set selected in this paper comes from the Kaggle machine learning classification.
open source data set, which contains 5000 rows of data,
including two categories of true news and fake news. The IV. M ETHOD
data set has been tested by numerous experimenters in kaggle,
and can significantly distinguish and compare the advantages A. Bayesian algorithm
and disadvantages of the algorithm. Select some data sets for
Bayesian algorithm is a statistical inference method based
display, and the results are shown in Table I.
on Bayes’ theorem for classification and probabilistic model
construction. The core idea is to evaluate the probability of
TABLE I
S OME OF THE DATA an event by updating the prior information and combining
the new observation data. The principle diagram of Bayes
Text Type algorithm is shown in Fig. 1. The Bayesian approach, which
Trump says healthcare reform push may need additional money
WASHINGTON (Reuters) - President Donald Trump on emphasizes adjusting our beliefs by observing new evidence
Tuesday said that the Republican push to repeal Obamacare may during reasoning, is flexible and efficient.
require additional money for healthcare, but he did not specify
how much more funding would be needed or how it might be
Real
used. Trump told Republican Senators joining him for lunch at
the White House that their planned healthcare reform bill would
need to be “generous” and “kind.” “That may be adding
additional money into it,” Trump said, without offering further
details. [14]
China’s Xi, Trump discuss ‘global hot-spot issues’: Xinhua
BEIJING (Reuters) - Chinese President Xi Jinping and U.S.
President Donald Trump on Saturday discussed “global hot-spot
Real
issues” on the sidelines of the G20 summit in the German city
of Hamburg, state news agency Xinhua said. It did not
immediately give any other details.
Trump has talked to top lawmakers about immigration reform:
White House WASHINGTON (Reuters) - U.S. President Donald
Trump has spoken to congressional leaders about immigration
reform and is confident that Congress will take action to deal
with the status of illegal immigrants who have grown up in the Fig. 1. The principle diagram of Bayes algorithm.
United States, the White House said on Tuesday. “We have
confidence that Congress is going to step up and do their job,”
White House spokeswoman Sarah Sanders told a briefing shortly Real In a Bayesian framework, we usually start with prior
after the administration scrapped a program that protected from knowledge, which may be derived from historical data or the
deportation some 800,000 young people who grew up in the
United States. “This is something that needs to be fixed
experience of domain experts. A prior probability quantifies
legislatively and we have confidence that they’re going to do our initial belief that an event will occur in the absence
that,” Sanders said, adding that Trump was willing to work with of observational data. When new observational data appear,
lawmakers on immigration reform, which she said should
include several “big fixes,” not just one tweak to the system.
Bayesian algorithms are used to update this belief, producing
a posterior probability. This updating process attaches impor-
tance to the information provided by the data, which means
III. T EXT F EATURE E XTRACTION that even if the prior knowledge is poor, the prediction ability
Term Frequency-inverse Document Frequency (TF-IDF) is of the model will gradually improve with the addition of more
probably the most common feature extraction method applied data [17].
to text features in both Natural Language Processing and The Bayesian methods of reasoning specifically have to
Information Retrieval. TF-IDF attempts to provide a measure deal with problems having uncertainty and complexities. Most
of importance of a word in a particular document in context practical problems involve usually incomplete knowledge of
of its universality across the document set. More precisely, TF existing and observational data. The Bayesian algorithms can
(word frequency) calculates the frequency of a word within a then become very strong, using prior distributions in a manner
document, while IDF (inverse document frequency) calculates such that small data may provide a very accurate mainte-
the rarity of a word across the total set of documents [15]. nance of performance. The flexibility makes Bayesian methods
The score in TF-IDF comes to some value for every word bound for application in domains of wide variance, including
just by multiplying both and may show how relevant each but not limited to medicine, finance, machine learning, and
word is in that given document. This process includes data natural language processing.
B. Bidirectional gated cycle unit V. T RANSFORMER
Bi-gated loop Unit (Bi-GRU) is an improved recurrent Transformer is a deep learning model for processing se-
neural network (RNN) structure for the processing of sequence quence data, first proposed in the year 2017 by Vaswani et
data, such as natural language processing and time series al. Since its proposition, it has entirely changed model design
prediction. Fig. 2 shows the schematic diagram of bidirectional in the field of natural language processing (NLP), especially
gated cycle unit. Unlike traditional one-way recurrent neural on machine translation tasks. Unlike traditional recurrent
networks, the bidirectional gated recurrent structure improves neural networks (RNN), Transformer is completely based on
the understanding of the model in context by considering both mechanisms of self-attention and abandons series-dependent
the forward and reverse information of the sequence [18]. limitations [13], [20]. This makes parallel processing possible
and significantly improves training efficiency. A schematic
diagram of Transformer is presented in Fig. 3.
VI. R ESULT
In terms of parameter Settings, Adam Optimizer was used
in the experiment, the maximum training rounds is set to 200,
the number of batches is set to 256, the initial learning rate
is 0.001, the learning rate decline factor is set to 0.1, and the
gradient clipping threshold is set to 10. The Nvidia 4090 GPU
is used for running experiments with Matlab R2024a.
In the division of data sets, this experiment divided the
training set and the testing set according to the ratio of 7:3.
In the category of binary classification, the proportion of data
sets is balanced.
This paper presents experiments using two variations of
the Transformer algorithm: one optimized with a bidirectional
gated cycle unit and another optimized with a bidirectional
gated cycle unit based on Bayesian optimization. To compare
their performance, we analyze the confusion matrices gener-
ated for both the training and testing datasets. Fig. 5 shows
the confusion matrix of the Transformer model optimized
with a bidirectional gated cycle unit, while Fig. 6 shows the Fig. 6. The experiment of fake news detection and classification based on
confusion matrix of the model with Bayesian optimization and Bayesian algorithm and bidirectional gated cycle unit optimization Trans-
former algorithm.
the bidirectional gated cycle unit.
TABLE II
T HE ACCURACY OF TWO ALGORITHMS ON TRAINING SET AND TESTING
SET
Training Training
Method accuracy accuracy
(%) (%)
Bidirectional Gated Cycle Unit Optimization
100 99.67
Transformer
Bidirectional gated Cycle unit with Bayesian
100 99.73
Algorithm Optimization Transformer