0% found this document useful (0 votes)
4 views6 pages

Sima Final Paper

This paper investigates the detection of fake news using various machine learning and natural language processing techniques, highlighting the challenges posed by the rapid spread of misinformation on social media. The study evaluates models such as Logistic Regression, Naïve Bayes, SVM, LSTM, and BERT, with BERT achieving the highest accuracy of 98%. The research emphasizes the need for automated systems to enhance the efficiency of fake news detection while maintaining interpretability and reliability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views6 pages

Sima Final Paper

This paper investigates the detection of fake news using various machine learning and natural language processing techniques, highlighting the challenges posed by the rapid spread of misinformation on social media. The study evaluates models such as Logistic Regression, Naïve Bayes, SVM, LSTM, and BERT, with BERT achieving the highest accuracy of 98%. The research emphasizes the need for automated systems to enhance the efficiency of fake news detection while maintaining interpretability and reliability.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Detection of Fake News Using Machine Learning

and Natural Language Processing Algorithms


Manasa,Raunit,Pranavi,Mani,Vivek
Email: [email protected]

Abstract—The rapid proliferation of information


and empowering individuals to participate in global
conversations.
through the internet, particularly via web-based social
media platforms, has made it increasingly difficult to
differentiate between genuine and fabricated content. With However, this very ease of information dissemination
most smartphone users now relying on social media for has also given rise to a significant problem: the rampant
news consumption rather than traditional online news spread of fake news. Fake news refers to deliberately
sources, the spread of unverified and misleading misleading or entirely fabricated information disguised
information has grown exponentially. This ease of instant
sharing has significantly contributed to the widespread
as legitimate news content. Unlike traditional
dissemination of misinformation. Consequently, fake news misinformation, fake news often carries an agenda—to
has emerged as a major challenge in the digital age. sway elections, provoke social unrest, or simply generate
This paper presents a comprehensive study employing viral traffic for profit. The consequences of such
various machine learning, deep learning, and natural disinformation are profound, affecting societal trust,
language processing (NLP) techniques for fake news public safety, and even the stability of democratic
detection. The models explored in this work include Logistic
institutions. Instances where fake news has altered
Regression, Decision Tree, Naïve Bayes, Support Vector
Machine (SVM), Long Short-Term Memory (LSTM), and political outcomes or incited violence serve as stark
Bidirectional Encoder Representations from Transformers reminders of the gravity of the situation.
(BERT). An open-source fake news detection dataset was
used to train these approaches, aiming to classify Despite growing awareness, combating fake news
information as authentic or counterfeit. Feature vectors remains a formidable challenge. Manual fact-checking
were generated using several feature engineering techniques
by journalists and dedicated organizations, while
such as regular expressions (regex), tokenization, stop
words removal, lemmatization, and Term Frequency- essential, struggles to keep pace with the sheer volume
Inverse Document Frequency (TF-IDF). and velocity of new content being created every second.
The performance of the proposed models was evaluated Moreover, the sophisticated tactics used to craft
based on accuracy, precision, recall, F1-score, and ROC convincing fake narratives make human detection
curve analysis. Among the machine learning models, increasingly difficult.
Decision Tree achieved the highest accuracy of 89.66%,
followed by SVM, Naïve Bayes, and Logistic Regression
with accuracies of 76.65%, 74.19%, and 73.75%, In light of these challenges, there is a clear and urgent
respectively. The LSTM model attained an accuracy of need for automated, intelligent systems capable of
95%, while the BERT-based model delivered the highest detecting and flagging fake news with minimal human
performance, achieving an accuracy of 98%. intervention. This research seeks to address that need by
Index Terms— Bidirectional Encoder Representations from exploring the use of machine learning and natural
Transformers (BERT), Fake News Detection,
language processing (NLP) techniques to develop an
Lemmatization, Long Short-Term Memory (LSTM), Naïve
Bayes, Support Vector Machine (SVM), Tokenization. effective fake news detection model. By automating the
detection process, we aim not only to enhance the speed
and efficiency of identifying misinformation but also to
contribute to preserving the integrity of information
I. INTRODUCTION ecosystems. Ultimately, the goal is to provide a scalable,
reliable solution that can support fact-checkers, media
Over the past few decades, the way information is organizations, and digital platforms in their ongoing
produced, shared, and consumed has undergone a efforts to safeguard public discourse against the harmful
dramatic transformation, largely fueled by rapid effects of fake news.
technological advancements. Today, social media
platforms, online news outlets, and instant messaging
services dominate the information landscape, shaping
public opinion at an unprecedented speed and scale. With
II. LITERRATURE REVIEW
just a few clicks, a piece of news can travel across
continents, influencing millions within minutes. This
The detection of fake news has rapidly emerged as a
democratization of information sharing has undeniably
critical area of research, driven by the growing
brought several benefits, making news more accessible
recognition of misinformation’s impact on society. Over
1
time, researchers have explored a wide range of computational resources, large volumes of labeled data,
methodologies, evolving alongside advances in machine and reduced interpretability. These limitations make
learning, natural language processing (NLP), and data them less feasible for real-time applications or
availability. Early studies took a largely manual deployment in resource-constrained environments.
approach, focusing on linguistic analysis and content
evaluation to assess the credibility of news articles. For Recognizing these challenges, the present study aims
instance, Castillo et al. highlighted the importance of to strike a careful balance between effectiveness and
examining user behavior, message credibility, and efficiency. Rather than relying on heavy, black-box
propagation patterns on social media platforms, marking models, it leverages the proven power of TF-IDF-based
an important step towards understanding how feature extraction combined with lightweight, scalable
misinformation spreads in digital ecosystems. classifiers like the Passive Aggressive Classifier. This
approach not only supports faster training and
Building on these foundations, researchers like Rubin deployment but also maintains a level of interpretability
et al. investigated deception detection by analyzing that is often sacrificed in deep learning models. In doing
rhetorical structures and linguistic cues within text. Their so, it aligns with current real-world demands for models
work underscored the fact that fake news often employs that are not just accurate, but also explainable, resource-
emotional, sensational, or manipulative language, friendly, and readily deployable for live fake news
distinguishing it from more objective reporting. Around detection.
the same time, Wang introduced the LIAR dataset—a
valuable contribution providing a large collection of Overall, the literature reflects a dynamic and rapidly
short political statements labeled by varying degrees of evolving field where innovations are constantly
truthfulness. This dataset became a benchmark for many reshaping best practices. As misinformation tactics
machine learning models focused on fake news continue to grow more sophisticated, there is a clear and
detection, allowing for more standardized evaluation ongoing need for adaptable, efficient, and trustworthy
across studies. solutions—a goal this study directly seeks to advance.

Traditional machine learning models such as Logistic


Regression, Naïve Bayes, and Support Vector Machines
(SVM) initially dominated the field. When paired with III. METHODOLOGY
straightforward feature extraction techniques like Bag-
of-Words or TF-IDF, these models demonstrated Data Collection
promising results in identifying fake news. However,
despite their early success, these approaches often
This study builds its foundation on a high-quality, trusted
struggled when faced with more nuanced, context-
dataset compiled by BuzzFeed News. The dataset
dependent, or cleverly crafted fake news articles. The
includes both real and fake news articles, each carefully
limited ability to capture semantic meaning and context
verified and categorized by professional fact-checkers.
in traditional models pointed to the need for more
Real news articles were sourced from reputable and well-
sophisticated techniques.
established media outlets, ensuring journalistic
credibility. On the other hand, the fake news corpus
As the field matured, researchers began turning comprises fabricated or misleading content that was
toward ensemble methods and deep learning intentionally designed to deceive readers.
architectures. Ensemble models, which combine By using such a carefully curated dataset, the study
predictions from multiple classifiers, improved ensures that the model is trained and evaluated on a
robustness and predictive performance. More notably, balanced, representative, and realistic sample of the types
deep learning approaches such as Ruchansky et al.'s CSI of information encountered by online audiences today.
model combined content analysis with user behavior and
temporal patterns, leveraging deep neural networks to Data Preprocessing
enhance accuracy. The emergence of Transformer-based
architectures like BERT (Bidirectional Encoder
Raw text data, while rich in information, often contains a
Representations from Transformers) brought a paradigm
great deal of noise—irrelevant or redundant details that
shift by enabling models to deeply understand the
can confuse machine learning models. Therefore, a
context and relationships within texts. These models crucial part of the methodology involves cleaning and
achieved state-of-the-art results on many benchmark preparing the text through several preprocessing steps:
datasets, further raising the bar for fake news detection
capabilities.
 Punctuation, Number, and Special Character
Removal: All extraneous symbols are stripped
However, the sophistication of deep learning models
came at a cost—namely, the need for extensive
2
away to maintain focus on meaningful textual for training the models and 20% reserved for evaluating
patterns. how well the models generalize to unseen data.
 Lowercasing: Text is standardized to lowercase
to avoid treating words like "News" and "news" Model Evaluation
as separate entities.
 Stopword Removal: Common, low-meaning Model performance is assessed through a comprehensive
words such as "the," "is," and "and" are removed set of evaluation metrics:
to sharpen the focus on more significant terms.
 Lemmatization: Words are reduced to their
base or dictionary forms (e.g., "running"  Accuracy: Measures the overall proportion of
becomes "run"), enhancing consistency and correct predictions across both real and fake
reducing dimensionality. news articles.
 Precision: Assesses how many of the news
articles predicted as "fake" were truly fake,
These steps not only simplify the text but also strengthen emphasizing quality over quantity.
the quality of the input data, making it easier for models  Recall: Evaluates how well the model identifies
to learn meaningful patterns without being distracted by all actual fake news articles, ensuring minimal
noise. oversight.
 F1 Score: Offers a balanced metric that
Feature Extraction considers both precision and recall, particularly
useful when the dataset has slight class
Once the text is cleaned, it needs to be transformed into a imbalances.
form that machine learning algorithms can work with—  Confusion Matrix: Provides a detailed
namely, numerical features. breakdown of true positives, true negatives,
To achieve this, Term Frequency-Inverse Document false positives, and false negatives, offering
Frequency (TF-IDF) vectorization is employed. This insights into specific strengths and weaknesses
technique assigns weights to words based on their of the model.
importance in a document relative to the entire corpus.
Words that are highly specific and informative are given Together, these metrics paint a complete and nuanced
greater significance, while extremely common words are picture of how effectively the models detect fake news,
de-emphasized. ensuring the study’s findings are both reliable and
By emphasizing the "distinctive" terms in each article, actionable.
TF-IDF helps the models better differentiate between real
and fake news narratives.

Model Building V. Result and analysis

The study explores three well-established machine


learning models, each chosen for its strengths:
After training and fine-tuning the models, a
 Logistic Regression: A straightforward and comprehensive evaluation was carried out using a held-
highly interpretable binary classifier, making it out test dataset. The models were assessed across
easy to understand how predictions are made. multiple performance metrics to determine their
 Support Vector Machine (SVM): Known for effectiveness in detecting fake news.
its ability to handle high-dimensional data
effectively, SVM constructs an optimal
hyperplane that maximizes separation between A. Performance Comparison of Models
classes.
 Passive Aggressive Classifier: Designed for The table below summarizes the results obtained from
large-scale and real-time learning, this model each machine learning model tested:
updates itself aggressively when it makes a
wrong prediction and remains passive otherwise, F1-
striking a balance between efficiency and Model Accuracy Precision Recall
Score
adaptability. Logistic
0.90 0.89 0.91 0.90
Regression
To ensure the robustness of the models, hyperparameter Naive
tuning is conducted to fine-tune performance, and cross- 0.92 0.91 0.93 0.92
Bayes
validation is applied to minimize the risks of overfitting.
Support 0.89 0.88 0.90 0.89
The dataset is split into an 80:20 ratio, with 80% used

3
F1- dynamic environments where new data arrives
Model Accuracy Precision Recall continuously.
Score
Vector  TF-IDF feature vectors proved highly effective
Machine for distinguishing fake news, allowing the model
(SVM) to prioritize specific, informative terms while
ignoring common, non-informative ones.
Random
0.88 0.87 0.89 0.88
Forest
Furthermore, a feature importance analysis provided
deeper insights into the nature of fake news versus real
Key Observations:
news:

 Naive Bayes emerged as the best-performing


 Fake news articles often contained
traditional model, achieving the highest
sensationalist, emotionally charged, and
accuracy and F1-score. Its assumption of feature
provocative language, designed to trigger quick
independence aligned well with the nature of
emotional responses from readers.
TF-IDF features, allowing it to handle text
 In contrast, real news articles leaned more
classification efficiently.
toward neutral, fact-based, and verifiable
 Logistic Regression also delivered strong
information, emphasizing credibility and
performance, particularly with precision, but
balanced reporting.
showed a slightly lower recall compared to
Naive Bayes.
 SVM demonstrated robustness but was Uniqueness of the Research
relatively slower during prediction and model
updating, making it less ideal for real-time This study offers several distinct contributions that
scenarios. differentiate it from existing approaches:
 Random Forest lagged slightly behind, likely
due to the challenges posed by the high-  Lightweight and Scalable Approach: By
dimensional, sparse feature space generated by focusing on efficient algorithms like the Passive
TF-IDF. This environment tends to favor linear Aggressive Classifier combined with TF-IDF
models over ensemble tree-based methods. vectorization, the proposed system achieves
real-time fake news detection even with
Given the critical nature of fake news detection, F1- limited computational resources—a
Score was emphasized as the primary evaluation metric significant advantage for practical deployment.
because it balances both precision (avoiding false  Focus on Interpretability: Unlike deep
alarms) and recall (catching actual fake news), both of learning-based black-box models, the chosen
which are essential in maintaining information integrity. techniques provide transparent decision-
making, which is critical for applications where
explainability and trust are essential.
 Comprehensive Pipeline: From data collection
and preprocessing to feature extraction,
B. Analysis of Passive Aggressive Classifier
model training, evaluation, and
Performance
interpretation, the research outlines a holistic,
end-to-end framework that can be easily
In addition to the models discussed above, the Passive adapted or extended for future studies or
Aggressive Classifier was also evaluated, and it different domains.
demonstrated superior overall performance:  Balanced Trade-Off: The methodology
carefully balances model complexity,
 The Passive Aggressive Classifier achieved a accuracy, and computational efficiency,
high accuracy rate while maintaining making it practical for a wide range of real-
exceptional computational efficiency, making world applications where speed and reliability
it highly suitable for real-time applications. are paramount.
 It struck the best balance between speed and
accuracy compared to other classifiers,
allowing it to update quickly without extensive
retraining.
 Logistic Regression and SVM also performed
well but showed relatively slower update
capabilities, making them less efficient in References

4
[1] Nirvana Prachi, M. Habibullah, M. E. H. Rafi, E. [9] C. Castillo, M. Mendoza, and B. Poblete,
Alam, and R. Khan, “Detection of Fake News Using “Information Credibility on Twitter,” in Proceedings of
Machine Learning and Natural Language Processing the 20th International Conference on World Wide Web
Algorithms,” Journal of Advances in Information (WWW), 2011, pp. 675–684. DOI:
Technology, vol. 13, no. 6, pp. 652–661, Dec. 2022. 10.1145/1963405.1963500
Available: https://doi.org/10.12720/jait.13.6.652-661

[10] V. Ruchansky, S. Seo, and Y. Liu, “CSI: A Hybrid


[2] A. Jain, H. Khatter, A. Shakya, and A. K. Gupta, “A Deep Model for Fake News Detection,” in Proceedings of
Smart System for Fake News Detection Using Machine the 2017 ACM on Conference on Information and
Learning,” in Proceedings of the International Knowledge Management (CIKM), 2017, pp. 797–806.
Conference on Innovative Computing and
Communication (ICICT), 2019. DOI:
10.1109/ICICT46931.2019.8977659
[11] W. Y. Wang, “‘Liar, Liar Pants on Fire’: A New
Benchmark Dataset for Fake News Detection,” in
Proceedings of the 55th Annual Meeting of the
[3] YouTube – Two Minute Papers, “This AI Can Spot Association for Computational Linguistics (ACL), 2017,
Deepfakes With 99% Accuracy,” 2021. [Online video]. pp. 422–426.
Available: https://www.youtube.com/watch?
v=U6ieiJAhXQ4

[12] A. Shu, S. Wang, and H. Liu, “Beyond News


Contents: The Role of Social Context for Fake News
[4] T. Jiang, J. P. Li, A. U. Haq, et al., “A Novel Stacking Detection,” in Proceedings of the Twelfth ACM
Approach for Accurate Detection of Fake News,” IEEE International Conference on Web Search and Data
Access, vol. 9, pp. 22626–22639, 2021. Mining (WSDM), 2019, pp. 312–320.

[13] J. Thorne and A. Vlachos, “Automated Fact


Checking: Task Formulations, Methods and Future
[5] S. Ni, J. Li, and H. Y. Kao, “MVAN: Multi-View Directions,” in Proceedings of the 27th International Joint
Attention Networks for Fake News Detection on Social Conference on Artificial Intelligence (IJCAI), 2018, pp.
Media,” IEEE Access, vol. 9, pp. 106907–106917, 2021. 4530–4537.

[6] M. Umer, “Fake News Stance Detection Using Deep [14] Y. Liu and Y. Wu, “Early Detection of Fake News
Learning Architecture (CNN-LSTM),” IEEE Access, vol. on Social Media through Propagation Path Classification
8, pp. 156695–156706, 2020. with Recurrent and Convolutional Networks,” in
Proceedings of the AAAI Conference on Artificial
Intelligence, vol. 34, no. 01, pp. 354–361, 2020.

[7] N. Yadav and A. K. Singh, “Bi-directional Encoder


Representation of Transformer Model for Sequential
Media Forensics,” in Proc. Forum for Information [15] Z. Yang, P. Shu, S. Wang, R. Guo, F. Zhang, and H.
Retrieval Evaluation (FIRE), 2020, pp. 49–53. Liu, “Unsupervised Fake News Detection on Social
Media: A Generative Approach,” in Proceedings of the
AAAI Conference on Artificial Intelligence, vol. 34, no.
01, pp. 5644–5651, 2020.
[8] F. A. Ozbay and B. Alatas, “Fake News Detection
within Online Social Media Using Supervised Artificial
Intelligence Algorithms,” Physica A: Statistical
Mechanics and its Applications, vol. 540, 2020.

5
6

You might also like