Constructing A User-Centered Fake News Detection Model by Using Classification Algorithms in Machine Learning Techniques

Uploaded by

Harsha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views

Constructing A User-Centered Fake News Detection Model by Using Classification Algorithms in Machine Learning Techniques

Uploaded by

Harsha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Received 5 June 2023, accepted 24 June 2023, date of publication 12 July 2023, date of current version 19 July 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3294613

Constructing a User-Centered Fake News

Detection Model by Using Classification
Algorithms in Machine Learning Techniques
MINJUNG PARK AND SANGMI CHAI
Ewha School of Business, Ewha Womans University, Seodaemun-gu, Seoul 03760, South Korea
Corresponding author: Sangmi Chai ([email protected])
This work was supported in part by the Ministry of Education of the Republic of Korea and in part by the National Research Foundation of
Korea under Grant NRF-2020S1A5A2A01046634.

ABSTRACT As fake news spreads rapidly in social media, attempts to develop detection technology to
automatically identify fake news are actively being developed, recently. However, most of them focus only
on the linguistic and compositional characteristics of fake news (e.g., source or authors indication, length of a
message, frequency of negative words). Compared to them, this study proposes a fake news detection model
based on machine learning that reflects the characteristics of users, news content, and social networks based
on social capital. To comprehensively reflect the characteristics related to the spread of fake news, this study
applied the XGBoost model to estimate the feature importance of each variable to derive the priority factors
that preferentially affect fake news detection. Based on the derived variables, we established SVM, RF, LR,
CART, and NNET, which are representative classification models of machine learning, and compared the
performance rate of fake news detection. To generalize the established models (i.e., to avoid overfitting or
underfitting), this study performed a cross-validation step, and to compare the predictive accuracy of the
established models. As a result, the RF model indicated the highest prediction rate at about 94%, while the
NNET had the lowest performance rate at about 92.1%. The results of this study are expected to contribute
to improve the fake news detection system in preparation for the more sophisticated generation and spread
of fake news.

INDEX TERMS Classification algorithms, fake news, fake news detection, feature selection, prediction
algorithms, predictive models, XGBoost.

I. INTRODUCTION Almost prior studies have focused on detecting or identi-

The content-based recommender system developed to pro- fying fake news depending on its linguistic or compositional
vide customized content to users improved their satisfaction; characteristics. They have identified fake news based on
however, it recently became a decisive opportunity for spread- whether the article has a clear author and source or whether
ing fake news [1]. These are designed to continuously show the article has enough length [4], [5], [6], [7]. This approach
content in the feed similar to what the user has previously assumed differences in linguistic or compositional features
seen or has shown engagement with in the past by ‘‘Likes’’ between fake and factual news. It is hard to reflect the
or comments, regardless of whether the news is true [2]. In characteristics of users who accept or spread fake news and
other words, once a user encounters fake news, the system the features of the social media networks where fake news
has no choice but to recommend similar content to the user spreads.
continuously. They even intentionally adjust the appearance With the advent of ChatGPT(Generative Pre-trained Trans-
of unwanted content in their social media feeds [3]. former), which can describe stylistic features that do not look
awkward, such as those written by low-level AI, it is no
longer possible to guarantee the accuracy of detecting fake
The associate editor coordinating the review of this manuscript and news in the previous way. Recently, it has become so easy to
approving it for publication was Geng-Ming Jiang . create fake news that looks like real news that users mistake
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 11, 2023 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 71517
M. Park, S. Chai: Constructing a User-Centered Fake News Detection Model

news written by ChatGPT in a few seconds for news written groups. If a system mistakenly identifies a legitimate news
by a professional journalist [48]. Therefore, it is necessary story as fake news, it could lead to accusations of bias or
to identify fake news differently than before. In this study, censorship against the news outlet that published it [5].
we finally propose a detection model that comprehensively To summarize the current state-of-the-art in fake news
considers not only the visual features of the content, but also detection systems, most of the previous research assumes
the characteristics of the users who generate and share fake that linguistic and compositional features of content are the
news and the networks that spread fake news. main criteria for distinguishing fake news and real news [9].
Moreover, fake news can be generated more sophisticat- Fake news detection systems typically rely on linguistic and
edly due to the technology that can automatically cause it structural features of news articles, but they often fail to
to be identical to real news easily in a short time by using capture the context of the news, such as the history of the
AI (Artificial Intelligence) is spreading today. AI-powered news source or the socio-political environment in which the
bots in Twitter (i.e., AI Twitterbot) can populate thousands of news is circulated. These methods which could not detect
user accounts that can support and oppose any content, which words semantic meaning and context of the word picked up
looks the same as the real news, even if it is fake, the bot from a fake news have been identified with low accuracy
controllers target [8]. Therefore, it has become increasingly value [10].
difficult to detect precisely manipulated fake news based The content-oriented fake news detection, which is the
only on its superficial features. It is necessary to approach most common approach, focuses on natural language pro-
the user characteristics and networks of social media from a cessing (NLP) to identify fake news by concentrating on
more diverse viewpoint, to overcome the existing fake news the characteristics of the text. NLP techniques process news
detection method that focuses on linguistic characteristics. content based on language pattern detection, word occur-
This study aims to improve the prediction performance of rences common to satire, irony, sentiment, and topicality [11].
fake news detection by overcoming the limitation that previ- To find deemphasizing the source or design highlights the
ous studies did not consider the characteristics of information article headline of the news is also a way of identifying fake
recipients. Therefore, we establish a fake news detection news by paying attention to its textual characteristics [12].
model by considering various content features and users in This content-oriented approach assumes that fake and real
social media and the network where fake news is generated news have different linguistic and composition structures. It
and propagated. Among the different explanatory variables to proposes a hybrid fake news detection algorithm that com-
detect fake news, ‘feature selection,’ which is the priority of bines a linguistic approach and network cues and provides
the explanatory variable, is first derived through the XGBoost operational guidelines for a feasible fake news detecting sys-
(Extreme Gradient Boosting). By constructing an optimal tem [13]. In addition, based on grammatical characteristics
fake news detection model through the selected explanatory through syntax parsing through Probabilistic Context Free
variables by XGBoost, we aim to increase the predictive Grammar (PCFG) and the difference between keywords used
performance rate. It is constructed by applying five machine in fake news and real news, semantic characteristics, rhetor-
learning techniques which are Logistic Regression (LR), ical structure, and discourse analysis results were selected
Neural Network (NNET), Random Forest (RF), Support Vec- as explanatory variables to determine whether fake news or
tor Machine (SVM), Classification and Regression Trees not [14]. On fake news detection targeting Facebook posts
(CART). A model with the highest prediction performance and various articles, Term Frequency - Inverse Document
rate for detecting fake news is finally derived by comparing Frequency (TF-IDF) is frequently used to represent text char-
their performance rate. acteristics in text analysis and was used as a criterion for
classifying fake news [5].
II. RELATED WORKS It is constructed as an automated fake news detection
A. FAKE NEWS DETECTION model by extracting linguistic features from the text of
Recent studies demonstrate the diverse range of approaches online newspaper articles [15]. Using the LIWC (Linguistic
that researchers are taking to develop fake news detection Inquiry and Word Count), the ratio using punctuation marks
models using machine learning, and the potential of these (e.g., periods, commas, question marks, exclamation points)
models to improve the accuracy of news verification. How- was calculated. In addition, it also figured how many words
ever, like any technology, fake news detection systems are related to positive and negative were mentioned in each doc-
hard to be perfect and can sometimes make errors in identi- ument based on the LIWC lexicon to extract the proportions
fying fake news. If a fake news is not detected appropriately of observations that fall into psycholinguistic categories and,
by the system,] it can be shared widely on social media in a finally, constructed as readability metrics to apply in the
short time, leading to a significant impact on public opinion fake news detection model. A naive Bayesian classifier to
and behavior. For example, hundreds of people died in Iran, news attributes [16], the combination of the text of news and
after drinking methanol for curing COVID-19, due to the clickbait [17], a hybrid model combining spread patterns of
fake news which had been accepted fact in the first [47]. fake news, and semantic analysis to develop an automated
Furthermore, errors in fake news detection systems can lead fake news detection model is being actively conducted [18],
to false accusations or misidentifications of individuals or [19], [20], [21].
71518 VOLUME 11, 2023
M. Park, S. Chai: Constructing a User-Centered Fake News Detection Model

There are several datasets available in literature, such as Finally, we include the context of fake news and the latest
the LIAR dataset or Fakenewsnet, that can be used to train fake news generation styles by reflecting word sentiment,
and evaluate machine learning models for the task of detect- similarity, and users’ network relationships to overcome the
ing fake news. The LIAR dataset, for example, is a widely limitations of the existing fake news system presented above.
used dataset for fake news detection, consisting of statements
labeled as either true, mostly true, half true, barely true, B. SOCIAL CAPITAL THEORY
false [51]. The Fakenewsnet dataset, on the other hand, con-
Social influence can be explained that is a structure of social
tains news articles labeled as either fake, bias, or conspiracy
relationships by exchanging interactions between users based
[52]. Both datasets have been labeled by human annotators.
on the social network. The main background enabling this
The LIAR dataset has been labeled by human fact-checkers,
mutual exchange of social influence is derived from each
while the Fakenewsnet dataset has been labeled by crowd-
individual’s social capital in the social network [23]. There-
sourced workers. They determined fake news depending on
fore, it was found that the type of social network formed
its surface-level linguistic patterns. It is difficult to apply their
by the users differs according to the social capital possessed
patterns when fake news has the appearance as same as real
by them. At the same time, the will to create a social rela-
news and uses a professional expression look like written
tionship and the degree of persistence of the relationship are
by professional journalist, these days. Therefore, it has been
different [24]. Social capital arises from different attributes
addressed, with the number of sophisticated fake news has
within three dimensions –structural, relational, and cogni-
been increased, the needs for reflecting not only its linguis-
tive [24]. The structural dimension is the key to whether
tic characteristics but also various approaches including the
or not to establish a network connection between actors in
users’ network relationships and the context of fake news.
the network and includes the overall strength of the connec-
As fake news detection systems have become more
tion [25]. The relational dimension of social capital refers
popular, recently, models for spreading fake news have
to personal relationships between actors formed through
emerged that bypass the algorithmic identification methods
interactions between individuals [26]. Finally, the cognitive
of well-known fake news systems to avoid being identified
dimension refers to sharing shared representations, interpre-
as fake news. Fake news detection systems can be targeted by
tations, semantic content, and systems among agents [27].
individuals who deliberately create fake news that is designed
Based on the three dimensions of social capital presented
to bypass these systems [22]. These adversarial attacks can
above, determinants influencing was selected to detect fake
make it difficult to develop accurate and reliable fake news
news spread on Twitter. It is to comprehensively consider the
detection systems. While the technology used to spread fake
characteristics of the network of Twitter and users affecting
news continues to evolve, current fake news detection sys-
the individual acceptance and spread of fake news.
tems are still focused on the linguistic and compositional
features used in fake news, similar to the past. Therefore,
we aim to improve the existing system that are difficult to 1) NETWORK FEATURES OF STRUCTURAL DIMENSION
distinguish fake news with the increased the more sophisti- The network features of Twitter can be defined as factors
cated fake news in the recent seems like fact news. Recent that can affect users mutually through Twitter. In this study,
fake news detection systems have some limitations, which three factors were adopted to estimate the network features
have made it easier for fake news to spread. Some limitations of social network structures: the number of followers and
that have recently been identified include as follows. First, the followings and the degree of centrality in the network. First,
existing fake news detection system mainly rely on keywords. the number of followers and followings is representative as
Some systems depend on keywords to identify fake news, an indirect proxy variable that shows the user’s willingness
which can be easily manipulated by those spreading false to interact with others on Twitter [28], [29]. It can be inferred
information [49]. In other words, fake news can be designed that users have many followers means that the user has
to reach a wider audience by exploiting weaknesses in these high expectations for establishing relationships with others
systems. Second, it has been identified that fake news system, based on the structural network features of Twitter. It can be
in the recent, has been hard to detect new forms of fake news. explained that the user has a higher willingness to engage in
As the methods used to spread fake news evolve, the existing networking activities with others when the user has a lot of
detection systems may not be able to keep up. For example, followings than those with few followings. As the degree of
deepfake videos or manipulated images may be difficult to centrality in the network can be measured differently depend-
detect using current systems [50]. Third, many fake news ing on the way users communicate and interact with others
systems rely on analyzing individual pieces of content in on Twitter, considering all these factors (i.e., in-degree, out-
isolation, without considering the broader context in which degree, betweenness centrality) were regarded as features of
it was shared [22]. This can make it difficult to determine the network. In-degree centrality indicates how many users
whether a piece of information is intentionally false or simply are followed by other users in Twitter’s limited network [30].
a mistake or misunderstanding. In other words, it is hard In other words, it is an index of influence in which the direc-
to determine a content is fact or not only to depend on the tion of exchange within the Twitter network has a direction
existing fake news systems which has a lack of contents. from others to oneself. In contrast, out-degree centrality refers