Fake News Detection System pdf
Fake News Detection System pdf
On
Ambition College
Submitted by:
March, 2024
Date: 2024/03/06
AMBITION COLLEGE
Mid-Baneshwor, Kathmandu
SUPERVISOR’S RECOMMENDATION
I hereby recommend that this project prepared under my supervision by the team of Rupesh
Thakuri, Rajesh Chaudhary, Ishwor Nepal entitled “Fake News Detection System” is
accepted as fulfilling in partial requirements for the degree of Bachelor of Science in
Computer Science and Information Technology. In my best knowledge, this is an original
work in Computer Science by them.
…………………………….
Supervisor
Ambition College
Mid-Baneshwor, Kathmandu
Tribhuvan University
Institute of Science and Technology
AMBITION COLLEGE
Letter of Approval
This is to certify that this project prepared by the team of Rupesh Thakuri, Rajesh
Chaudhary, Ishwor Nepal entitled “Fake News Detection System” in partial fulfillment
of the requirement for the degree of Bachelors of Science in Computer Science and
Information Technology has been well studied and prepared. In our opinion, it is
satisfactory in the scope and quality as a project for the required degree.
Evaluation Committee
External
ACKNOWLEDGEMENT
We extend our sincere appreciation to Ambition College for providing us with the platform
to undertake this project. The college has been instrumental in shaping our learning
experience, offering the necessary resources and support for the successful completion of
this endeavor. We express our proudful gratitude to Mr. Ramesh Kumar Chaudhary, our
Head of Department as well as supervisor for his unwavering guidance and support. His
expertise and experience were invaluable in developing this project, and his encouragement
and constructive feedback helped us overcome challenges along the way. We are very
thankful for his dedication and support, which were crucial to the project's completion. We
recognize the challenges we faced and thank all those who assisted us. This project would
not have been possible without the support of our college, Head of Department, supervisor,
and everyone who helped us along the journey.
Sincerely,
Rupesh Thakuri
Rajesh Chaudhary
Ishwor Nepal
iv
ABSTRACT
Fake News Detection system using Random Forest Algorithm offers an innovative solution
to combat the proliferation of misinformation. Leveraging the Random Forest algorithm,
the system automates the analysis of textual content to identify patterns indicative of fake
news. By extracting relevant features and employing group of learning techniques, it
enhances accuracy and strengthen in classification. The model learns to tell real news from
fake news better by practicing over and over. This helps people know if something is
trustworthy or not. Key features include linguistic cues and contextual information
extraction for informed decision-making.
v
Table of Contents
Acknowledgement .............................................................................................................. iv
Abstract……… .................................................................................................................... v
vi
3.2.1 Technical Feasibility ......................................................................................... 11
3.4 Analysis.................................................................................................................... 15
vii
5.1.2 Implemented programming languages .............................................................. 21
5.2 Testing...................................................................................................................... 25
References
Appendix
viii
LIST OF FIGURES
Figure 1.1: Development Methodology Model ................................................................... 3
Figure 3.1: Use-Case Diagram of Fake News Detection system ....................................... 10
Figure 3.2: Working mechanism of the system ................................................................. 15
Figure 3.3: ER Diagram of fake news detection system .................................................... 16
Figure 3.4: Level 0 DFD .................................................................................................... 16
Figure 3.5: Level 1 DFD .................................................................................................... 17
Figure 4.1: Block Diagram ................................................................................................ 18
Figure 5.1: Confusion Matrix on Training Dataset ............................................................ 27
Figure 5.2: Confusion Matrix on Testing Dataset ............................................................. 27
ix
LIST OF TABLES
Table 5.1: Testing for Pre-processing ................................................................................ 25
Table 5.2: Testing for Vectorization .................................................................................. 25
Table 5.3: ML Model Accuracy......................................................................................... 26
Table 5.4: Classification Report ........................................................................................ 28
x
LIST OF ABBREVIATIONS
BERT : Bidirectional Encoder Representations from Transformers
CDSSM : Convolution Dynamic Semantic Structural Model
CNN : Convolutional Neural Network
CSS : Cascading Style Sheet
D-CNN : Deep Convolutional Neural Networks
DFD : Data Flow Diagram
ERDs : Entity Relationship Diagrams
FNDS : Fake News Detection System
HTML : Hyper Text Markup Language
JS : JavaScript
LDA : Linear Discriminant Analysis
LSTM : Logistic Regression and Long Short-Term Memory
RFA : Random Forest Algorithm
SLD-CNN : Semi-Supervised Linear Discriminant Convolutional Neural Network
TF-IDF : Term Frequency - Inverse Document Frequency
UX : User Experience
xi
CHAPTER 1 :
INTRODUCTION
1.1 Introduction
Fake News Detection System (FNDS) is a system that helps to determine either the given
news is true or false (fake). Nowadays, there are many news and different sources to get
this news that is hard to believe. Hence, this system plays a vital role to determine the
validity in that news and helps the trustworthiness of the news. It filters out the real from
the fake saving peoples to get and believe in the wrong information circulated through
different websites having their own prime motive like misdirecting the citizens, some
tremendous change in economic like shares and many more.
This system takes the valid news that has been validated through different sources through
some websites (like Kaggle). Through training the machine with the data through a website
the user can test the valid news by just giving the news title. The website is free source
hence anyone can get access to the website and use it to test the news. These also helps to
bring the right news to the people with free of cost making it cost efficient. These systems
are in constant evolution, yet encounter challenges due to the ever-changing landscape of
misinformation. A comprehensive approach, uniting technology, media. Its main purpose
is to spread good and correct news to the user and establish a environment free of corrupt
or fake news.
To make the website and system merge together the model that is being prepared using the
machine learning algorithm (Random Forest Algorithm) is used inside a website prepared
by different medium like (Bootstrap, tailwind etc.) with the help of flask. The machine is
trained using the RFA algorithm in python (Jupyter) and tested for its accuracy. It helps to
determine or conclude a final result or give a final decision with the majority decision made
by all the binary decision tree. The website will provide a good user interface where the
user will find a place (search area) where he/she can use the news title to search for the
validity of the news. This process entails a thorough analysis, leveraging pre-trained
models, extensive datasets, and sophisticated algorithms to ensure precise assessments.
This implementation seamlessly combines powerful machine learning techniques with
user-centric web interfaces, guaranteeing both accessibility and dependability. The fake
news detection system works by using a set of data to decide if news is true or fake. It's
1
really good at identifying fake news from real news based on this data. Most of the time,
our system tends to label news as true, which is one of its default behaviors.
The rampant spread of fake news has emerged as a critical challenge, threatening the
integrity of information and societal cohesion. Conventional approaches to identifying
misinformation lack scalability and efficiency, necessitating innovative solutions. This
project aims to develop a robust system employing machine learning algorithms for
accurate fake news detection. Key objectives include achieving high accuracy, optimizing
computational efficiency, ensuring scalability, generalizing across fake news types, and
maintaining cost-effectiveness. The primary inquiry revolves around detecting fake news
through machine learning models while minimizing memory and storage costs. Given the
recent surge in fake news, this endeavor has gained urgency. Through rigorous analysis,
the project aims to shed light on machine learning's potential in combating misinformation.
By addressing pressing challenges, it aims to contribute valuable insights to the discourse
on information integrity.
The fake news detection system works by using a set of data to decide if news is true or
fake. It's really good at identifying fake news from real news based on this data. Most of
the time, our system tends to label news as true, which is one of its default behaviors.
1.3 Objectives
• Fake news detection systems may employ fact-checking techniques to verify the
accuracy of claims made in news articles and other content.
• These systems may assess the credibility and reliability of the sources from which
the information originates.
2
• Limited Access To Data
Access to comprehensive and diverse datasets for training and testing fake news
detection systems may be limited, hindering their ability to generalize across different
sources and types of misinformation.
• Privacy Concern
Fake news detection systems using user-generated data for training and assessment
could raise privacy issues related to gathering and retaining sensitive information.
The agile software development process is a system development process where the system
is developed quickly collaborating with customers frequently, and being able to adapt
changes quickly. Here the tools are developed with complete interaction over processes and
tools. It takes the software itself as reference rather than the documents and quickly bring
any changes if needed that helps the easy development of the system.
• Concept
The scope of the project is determined determining the most critical tasks that might
occur in future development of the project. Also, with the past research and studies, the
plan for the progress and features for the system were planned that can be the most need
for the users. Feasibility test is performed which will be a key part to determine the
cost, system requirements for effective and efficient running of the project in the
market.
3
• Inception
The team members are being gathered after all the decision making and future prospect
and challenges that may appear. The team member needs to build a user interface mock-
up and lay down the project architecture with regular guidance and feedback from the
supervisor and the at the time user (consider the classmates or other team members).
• Iteration
It is the building phase where the system development starts. It takes most of the time
for the project where the designer meets with the UI/UX developer for the layout of the
project. At the conclusion of the initial iteration or sprint, the objective is to establish
the fundamental functionality of the product. Subsequent iterations can then incorporate
additional features and adjustments. This phase is pivotal in Agile software
development as it enables developers to rapidly create functional software and make
adaptations to fulfill the client's requirements.
• Release
At this point, the product is nearing its release. Here, the quality assurance team needs
to conduct various tests to verify the complete functionality of the software. Initially,
the team will perform system testing to ensure the readiness of the code for release.
Crucially, any potential bugs identified by testers will be promptly addressed by the
developers. Once all these tasks are completed, the product's final iteration will enter
the production phase.
• Maintenance
As part of this stage, the software development team will offer continuous support to
ensure the proper functioning of the system and address any new issues. Additionally,
the team will deliver additional training to users and confirm their comprehension of
the product's usage. Developers may gradually introduce new iterations to enhance the
product with advanced features over time.
• Retirement
• A product reaches the retirement phase due to two primary reasons:
1. It is replaced by new software.
2. The system becomes outdated or incompatible with the organization over time.
• During this phase, the software development team will initially inform the users about
the decommissioning of the software. Subsequently, if the company identifies a
4
replacement, users will transition to the new system. Finally, the developers will
finalize any remaining end-of-life tasks and cease support for the current product.
This document is categorized into several chapters and further divided into sub chapters
including all the details of the project.
Chapter 1: It is about the introduction of the whole report. It includes short introduction
of the system, scope and limitations and objectives of the system.
Chapter 2: It includes the research methodologies in the project. Background study and
literature review has been covered.
Chapter 3: It is all about system analysis. It also includes feasibility study and requirement
analysis.
Chapter 5: It is about the implementation and testing procedures. It contains the detail
about the tools that are required to design the system. In the testing section, different testing
processes are included.
Chapter 6: It includes conclusion of the whole project. It also provides information about
what further can be achieved from this project.
5
CHAPTER 2 :
BACKGROUND STUDY AND LITERATURE REVIEW
The rise of fake news in the digital era has fundamentally altered the landscape of
information dissemination, posing unprecedented challenges to public discourse,
democratic processes, and societal cohesion. Enabled by the rapid expansion of digital
technologies and social media platforms, the spread of misinformation has become
pervasive, undermining the credibility of traditional news sources and eroding public trust
in the information ecosystem.
In response to this pervasive threat, the development of fake news detection systems has
emerged as a critical line of defense against the proliferation of false and misleading
information. These systems leverage a diverse array of technological tools and
methodologies, ranging from natural language processing and machine learning algorithms
to network analysis and data mining techniques. By analyzing textual content, user
behavior patterns, and information propagation dynamics, these systems aim to identify
and mitigate the dissemination of fake news across online platforms.
One of the primary challenges faced by fake news detection systems is the dynamic nature
of misinformation itself. As purveyors of fake news continuously adapt their tactics and
strategies to evade detection, detection systems must evolve and adapt in tandem to
effectively combat this evolving threat landscape. This necessitates ongoing research and
development efforts aimed at enhancing the robustness, accuracy, and scalability of
detection algorithms and methodologies.
The ethical and societal implications of fake news detection are complex and multifaceted.
Content moderation and censorship practices raise questions about freedom of speech,
editorial independence, and the role of technology platforms as arbiters of truth and
information dissemination. Striking a balance between combating fake news and
safeguarding fundamental democratic principles requires careful consideration of these
ethical dilemmas and the development of transparent and accountable governance
frameworks. Furthermore, the proliferation of algorithmic biases and discriminatory
practices in fake news detection algorithms underscores the importance of diversity, equity,
and inclusion in algorithmic decision-making processes. Ensuring that detection systems
6
are free from biases and prejudices requires proactive measures to address data biases,
algorithmic fairness, and the representation of diverse voices and perspectives within the
development and deployment of detection technologies.
A research article titled “Exploiting Network Structure to Detect Fake News,” authored by
three students from Stanford University, introduces a Neural Network approach for
identifying fake news. Their approach involves considering not only the article-specific
aspects like title and content but also incorporates the social context to enhance prediction
accuracy. This innovative strategy offers an avenue for refining prediction accuracy
without relying solely on advancing natural language processing techniques. [1]
A separate research paper titled “Fake News Detection: Deep Learning Approach” explored
the utilization of three distinct neural network models. The focus was on comparing these
models, primarily differing in how they processed the article's content and title. This
comparison highlights that the methodology used to handle text within an article
significantly impacts a model's performance. This observation is logical, given that the
content of an article typically stands as the primary basis for authenticating its credibility,
emphasizing the importance of text processing methodologies. [2]
In this study, the proposed method integrates a Hybrid Deep Neural Network Model,
incorporating both Convolution Dynamic Semantic Structural Model (C-DSSM) and Deep
Convolutional Neural Networks (D-CNN). This combined architecture employs the
preliminary layers of the C-DSSM for feature extraction, while the subsequent D-CNN
layers facilitate the categorization process. The initial layers of the C-DSSM focus on
extracting salient features, while the subsequent D-CNN layers aid in the categorization or
classification phase. This hybrid model leverages the strengths of both architectures to
enhance performance and accuracy in the task at hand. Experimental results demonstrate
that the proposed model achieved an impressive accuracy level of 92.60%. This success
emphasizes the efficacy of the hybrid model in delivering accurate categorization or
classification outcomes, highlighting its potential for various applications. [3]
In this study, the proposed method involves employing various algorithms, namely Naïve
Bayes, logistic regression, and Long Short-Term Memory (LSTM), to discern fake news.
The aim was to compare and contrast the outcomes generated by these distinct algorithms
7
in identifying deceptive information. Assessing the scope and accuracy of these algorithms,
it was observed that the LSTM algorithm notably stood out, showcasing the highest
accuracy level at 92.36%. This exceptional accuracy rate underscores the effectiveness of
LSTM in distinguishing fake news among the algorithms examined in this study. [4]
In this study, the proposed method employs the Bidirectional Encoder Representations
from Transformers (BERT), a sophisticated deep neural network model. BERT, being
inherently intricate, operates on deep neural networks. Its performance potential
significantly improves when handling large datasets, showcasing enhanced efficiency.
Specifically in this study, the BERT model utilized achieved an accuracy level of 52%.
This highlights the model's capability within the context of this research, signifying its
efficacy in analyzing and processing the given dataset. [5]
2.3.1 Snopes
Snopes is a highly esteemed fact-checking platform renowned for its role in authenticating
and debunking misinformation and fake news proliferating across the internet. Staffed with
a team of diligent fact-checkers, Snopes rigorously investigates claims and narratives
spanning diverse subjects and fields. Through meticulously crafted articles, Snopes
8
transparently presents the evidence and sources utilized in assessing the credibility of
assertions, thereby aiding readers in discerning truth from falsehood.
2.3.2 FactCheck.org
FactCheck.org stands as a distinguished entity within the realm of fact-checking, known
for its meticulous scrutiny of political claims, statements, and news articles to ensure
accuracy. While its primary focus lies in political discourse, FactCheck.org also extends its
coverage to encompass a broad spectrum of topics and issues. Through its comprehensive
articles and analyses, FactCheck.org systematically debunks false information, offering
readers valuable insights into the veracity of various assertions. Its dedication to thorough
investigation and evidence-based reporting reinforces its reputation as a trusted source for
reliable information, essential for fostering informed decision-making in today's media
landscape.
2.3.3 PolitiFact
PolitiFact emerges as a prominent player in the fact-checking arena, specializing in
scrutinizing the accuracy of assertions put forth by politicians and public figures.
Employing a unique "Truth-O-Meter" scale, PolitiFact categorizes statements along a
spectrum, ranging from "True" to "Pants on Fire," contingent upon the evidence and
credibility of the claims. Beyond mere rating, PolitiFact supplements its assessments with
in-depth explanations and evidence, furnishing readers with comprehensive insights into
the validity of the statements under scrutiny. This commitment to transparency and
evidence-based analysis enhances PolitiFact's credibility as a reliable source for evaluating
the accuracy of political discourse and aiding the public in navigating the complexities of
contemporary media.
9
CHAPTER 3 :
SYSTEM ANALYSIS
• The system must analyze textual content from diverse sources to identify potential
fake news instances.
• It should extract relevant linguistic and contextual features to aid in classification.
• Incorporate models like decision trees or deep learning algorithms to classify text
as authentic or fake.
• Generate detailed reports and visualizations summarizing detected fake news
instances for user interpretation and analysis.
10
3.1.2 Non-Functional Requirement
• Ensuring high accuracy in detecting fake news to prevent the spread of misinformation.
• Ability to handle increasing data volumes and user loads for effective detection.
• Implementing robust measures to protect user data and prevent unauthorized access.
• Providing clear explanations of classification decisions and detection processes.
• Adhering to legal and regulatory requirements related to data handling and privacy in
fake news detection.
The system is checked to ensure its successful and efficient operation in the current
environment, taking into account various factors such as changes in economic status,
technical and operational feasibility, and the scheduling feasibility of all parties involved
in system development, testing, and utilization. Assessing the system's adaptability to
dynamic environments and economic conditions is crucial for its sustained functionality
and relevance. Additionally, evaluating technical and operational feasibility helps identify
potential challenges and ensures smooth system integration and performance. Moreover,
adhering to a feasible schedule ensures timely delivery and adoption of the system,
enhancing its overall effectiveness and user satisfaction.
The project stands as technically feasible, aligning with current technology standards
encompassing both hardware and software components. The outlined technical
requirements for this project include a laptop with a minimum of 4GB RAM equipped with
GPU and a high-speed internet connection. This application is compatible with most
contemporary personal computers, meeting the specified hardware and software
prerequisites. The system architecture is designed to be scalable, allowing for future
enhancements and integration with additional features.
This project can be executed with minimal human resources, as two developers are engaged
in the project, which surpasses the required manpower. The project's objective is to develop
a Fake News Detection system that identifies fake news within the provided dataset. The
11
system aims to employ cutting-edge machine learning algorithms to analyze textual
content, metadata, and user interactions to determine the credibility of news articles.
Through rigorous data preprocessing, feature engineering, and model training, the system
endeavors to accurately differentiate between genuine and fake news articles, thereby
fostering media literacy and combating the spread of misinformation.
The making of the system or the whole project starts from month 1 and will take
approximately 5 months to complete. The first task that is defining the requirements will
take about 1 month, the second task that is prototype will take the last weeks of month 1 to
first weeks of 5th month. Feedbacks will be received throughout the time period of the
system making and the software will be finalized by the later week of the 5th month.
3.3 Methodology
The Fake news detection system (FNDS) is a system that helps to build a model that can
determine the news source through title and determining either the news is true or fake. The
system uses Random Forest Algorithm (RFA) that uses the binary decision tree to give
the result from different sets of input data. Some of the processes involves in working of
the system are as follows:
In this phase the data is gathered from the trusted source. The data exist in comma-separated
values file (.CSV) format. This data is later used to train the model. Different trusted source
like Kaggle can be used to get the data. It consists of different attributes that defines a data.
For data of the FNDS attributes like news title, text, subject, date and label seems to be
12
more appropriate. Where the label attributes are considered as the critical attributes which
contains the result true or false and the model is trained accordingly.
Data preprocessing is the way of reducing the extra amount of data. It helps to reduce the
training time for the model. Different process like tokenization, lowercase, stopwords
removal and lemmatization or stemming can be used for data pre-processing.
Tokenization: It is the way of breaking down an entire collection of sentences into a word
of array. It can also be said as splitting a string or input text into list of tokens. Tokens helps
in understanding the context and for interpreting the meaning of text by analyzing the
sequence of words.
Make Lowercase: Some data may create a different impression but have the same actual
words combination because of the lower- and upper-case situation. For e.g. ‘The’ and ‘the’
can be considered as different with the defense in ‘T’ with each other. Hence, making all
the words lowercase will help to remove the ambiguity in the sentence.
Remove Stopwords: Stopwords are those words that don’t have general specific meaning
like constants (a, an, the). Hence removing the stopwords can help in reducing the noise
and dimension of feature set affect pre-processing.
Stemming and Lemmatization: Stemming is used to normalize the words into its best
form or root form. Sometimes it can change the word into root form which doesn’t have
any meaning which can cause problem which is solved by lemmatization but it is used to
group different inflected form of the words called as lemma and produces group words
which have meaning.
For e.g.: for the word ‘lazy’ the stemming process convert the word as ‘lazi’ which don’t
have any meaning in English dictionary where the lemmatization produces the exact word
‘lazy’ for ‘lazy'.
3.3.3 Vectorization
The vectorization is the process where the text data is converted to vectors which can be
later easily processed using the algorithm. It can be done using bag or words (Count
Vectorizer) or TF-IDF (Term frequency - inverse document frequency). It converts the
words into matrix form helps in reducing dimensionality and feature extraction. Here, the
matrix thus formed corresponds to a document and columns correspond to a word or term.
13
3.3.4 Model Building
Before training the model, the data is first split into train and test data using some
vectorization method. Later the main algorithm i.e., Random Forest Algorithm is
implemented to build the model. This algorithm works on the basis of decision tree where
the decision of the majority of the decision tree is considered as the final result or output.
The combination of decision tree makes the Random Forest.
The preciseness of the model is checked checking any possible errors. Accuracy soccer,
confusion matrix and classification report can be used to check the accuracy and occurrence
of any possible error in the building process of the model. Here confusion matrix is a 2*2
matrix where the C12 and C21 shows the number errors that occurs while evaluating the
model.
The model is then attached with the website that is the actual user interface where the user
tests the truthiness of the news using the news title. The synchronization of the model and
the website is possible using flask.
The system uses prediction pipeline which performs all the data preprocessing and prepare
a method. This method can be called with input (as news title) giving the result true or fake
for the news.
14
Figure 3.2: Working Mechanism of the System
3.4 Analysis
The development of this system follows a Structured Approach, beginning with the analysis
phase where a Conceptual Model is created through structured design principles. Structured
Design encompasses crafting the process model of the system utilizing DFD diagrams and
outlining the fundamental flowchart of the system's operations. This methodical approach
ensures a systematic and organized framework for system development, aiding in the clear
delineation of processes and functionalities within the system.
In the context of a fake news detection system, data modeling plays a crucial role in
conceptualizing data elements and constructing a structured data model for storage within
the system's database. It helps in visually organizing data and enforcing compliance with
regulatory standards, business rules, and other relevant guidelines. Techniques like Entity-
Relationship Diagrams (ERDs) are commonly used to represent relationships between
different data elements and illustrate the flow of information within the system. This
graphical representation enhances comprehension of the system's architecture and
facilitates the implementation of data validation, consistency checks, and security protocols
to uphold the integrity and credibility of the data utilized for fake news detection.
15
3.4.1.1 ER Diagram
In order to represent the process model DFD is used. The processes used in the system
and its corresponding flow are shown in DFD.
16
3.4.2.2 Level 1 DFD
Dataset for Fake News are available easily on platforms like kaggle, UCI. For our project
we used a dataset from Kaggle “FAKE NEWS” Dataset that predicts whether the news is
real or fake. It has 4460 rows and 5 attributes with some null or missing value. Its following
attributes are Id, Title, author, text, are features of the dataset and label attribute is our target
attribute also our output on the basis of these features.
17
CHAPTER 4 :
SYSTEM DESIGN
4.1 Design
For implementing the system, the Random Forest Algorithm (RFA) that helps to determine
or conclude a final result or give a final decision with the majority decision made by all the
binary decision tree. Hence, to implement this algorithm first the understanding and
implementation of decision tree is necessary. Also, the combination of tree (decision tree
makes it name as random forest).
18
Decision Tree
From the sets of data provided to us according to the target data and the remaining data
decision tree can be in the following ways:
Step 1: Choose a target attribute within the attributes of the data given to train the model.
Step 2: Then the information gain is found out as:
𝑷 𝑷 𝑵 𝑵
IG=−[𝑷+𝑵 𝐥𝐨𝐠 𝟐 (𝑷+𝑵) − 𝑷+𝑵 𝐥𝐨𝐠 𝟐 (𝑷+𝑵)]
Step 3: Then the entropy is determined using the remaining attribute. One of the remaining
attributes will be the root of our decision tree.
𝑷𝒊+𝑵𝒊
Entropy = ∑𝒏𝒊=𝟏 ∗ 𝑰(𝑷𝒊𝑵𝒊)
𝑷+𝑵
n nodes of tree
decision_tree_result [n]
true ++
else
false ++
result = true;
else
result = false;
Step 1: The data that is given by the admin is considered to be the observed data.
19
Step 2: From the observed data set a bootstrap data set is taken.
A bootstrap data set is a collection of data set that is randomly picked from the observed
dataset. The same data or event from the observed that may be repeated more than once or
may not even be there while taking the data for bootstrap data set. But the less repetition
the better.
Step 3: Then a decision tree is built from the data in bootstrap data set.
While making the decision tree the subsets of the variables is used to make the node at each
step. The one with the highest entropy is choose as the root nodes from any two randomly
selected variables (attributes).
Step 4: Then a random data is taken again leaving the target value as unknown.
Step 5: Now the sample value is passed through the decision tree and let them decide the
value for target attribute.
Step 6: The majority decision made by all the decision tree is considered to be the final
value for target attribute.
20
CHAPTER 5 :
IMPLEMENTATION AND TESTING
5.1 Implementation
After the completion of analysis and design of the whole system, the implementation is
carried out by involving how the system is developed and how it works. Various tools
which are already available have been used to develop the system including front-end,
back-end and other tools which have been used is discussed in this chapter.
• Visual Studio Code: Visual Studio Code stands as a nimble yet robust source code editor
that operates on desktop platforms, including Windows, macOS, and Linux. It boasts native
backing for JavaScript, TypeScript, and Node.js, accompanied by a diverse range of
extensions supporting additional languages and runtimes.
• Jupyter: Jupyter represents an open-source web platform enabling users to generate and
distribute documents featuring live code, equations, visualizations, and explanatory text.
Its flexibility extends to accommodating various programming languages, including
Python, rendering it an adaptable instrument for tasks spanning data analysis, scientific
computations, and machine learning endeavors.
Front-end Tools
The "Fake News Detection System" utilizes a combination of front-end tools including
HTML, CSS, and JavaScript to create an interactive and user-friendly interface for users to
interact with the system.
HTML is used to define the structure of the system's user interface. It helps organize
content, create forms for user input, and establish the overall layout of the application.
CSS is employed to enhance the visual appeal of the user interface. It ensures a consistent
and attractive design, making the system user-friendly. It's crucial for creating a responsive
and aesthetically pleasing layout.
21
JavaScript is utilized for implementing interactive features in the system. This includes
real-time updates, dynamic content loading, and handling user interactions. Also used to
provide instant feedback on news credibility or facilitate user engagement.
Back-end Tools
In a fake news detection system built using Python, packages like NumPy and Pandas are
employed to perform mathematical operations on the dataset and facilitate model training.
These packages are pivotal for handling numerical computations and organizing data
efficiently.
Flask is utilized to integrate the front-end and back-end components of the system. Flask
acts as a communication bridge between the user interface and the underlying functionality,
ensuring smooth interaction between the user and the system.
Database
In the Fake News Detection System, databases play a vital role in storing, managing, and
retrieving relevant data for analysis and model training. Databases store a corpus of news
articles along with associated metadata such as publication date, source, and article content.
Databases also store user as well as the admin information such as login details, user
interaction within the system which helps to improve the accuracy of the fake news
detection algorithms and personalize the user experience.
Login and register are the main way or method of identifying the right user and giving the
defined access to the defined users. The login section is the default section for the system
hence it is the first page to show up when the system is initially started. In the system, upon
accessing the login section, it first checks whether the user or admin is logging in. This is
done with the help of role assigned to the user and admin. The role is checked first from
the database table of user and admin which helps to verify either the user is admin or general
user. If a user is logged in, they are directed to the user landing page where they would be
able to detect whether the news is fake or real, while if an admin is logged in, the admin
page is opened where the admin is able to insert, delete or modify the news. If the provided
credentials are not found in the database, users are prompted to sign up for an account.
After initiating the signup process, users are redirected back to the login section to proceed
22
with logging in using their newly created credentials. This flow ensures smooth navigation
and clear guidance for users and administrators interacting with the system also providing
authentication and verification of the newly anonymous users.
Data Pre-processing
In this model the system loads the dataset and clean the data, by cleaning means fill the null
values, remove noise and make it suitable so that there won’t be the chance of overfitting
or underfitting of the datasets. In the preprocessing phase, we incorporate a series of
essential techniques to refine our text data. Firstly, tokenization which divides the text from
the datasets into individual tokens or words, facilitating further analysis. It also helps in
realizing the text patterns. Next, lowercasing standardizes the text by converting all
characters to lowercase, ensuring uniformity and consistency in our dataset eradicating
further confusion is the similar words like the words ‘The’ and ‘the’ is same. Stopwords
removal eliminates common but insignificant words that may obscure the underlying
meaning of the text. It uses the English dictionary to check the stopwords that doesn’t have
specific meaning in the sentences like a, an, the etc. Stemming and lemmatization then
further refine the words by reducing them to their root or base forms, respectively. This
process enhances the efficiency and effectiveness of our model by minimizing variations
and focusing on the essential semantic content of the text. It helps to give the text a precise
meaning maintaining sematic and grammatical errors. By integrating tokenization,
lowercasing, stopwords removal, stemming, and lemmatization into our preprocessing
pipeline, we optimize the quality and relevance of our text data for subsequent analyses and
modeling tasks.
Data Vectorization
After preprocessing the data, the next step involves extracting features by converting
textual data into numerical form using the TF-IDF vectorization process. TF-IDF stands
for "Term Frequency-Inverse Document Frequency," a method that assigns numerical
values to words based on their importance in a document. This process helps in creating
feature vectors that represent the unique contribution of words to the text data. In TF-IDF,
the Term Frequency (TF) measures how often a word appears in a document, while the
Inverse Document Frequency (IDF) evaluates the importance of a word by considering how
common it is across all documents. By combining TF and IDF, TF-IDF assigns weights to
words that reflect their significance in a document beyond mere frequency.
23
After completing the TF-IDF vectorization process, the data should be split into test and
train sets with proportions of 20% and 80%, respectively. To ensure a balanced split, the
target variable should be stratified. This approach helps maintain similar proportions of
different classes within the training and testing datasets, which is crucial for model training
and evaluation. When extending this paragraph with additional essential details, it's
important to provide more insights into how TF-IDF works, emphasizing its role in
capturing the uniqueness of words in textual data. Additionally, ensuring originality by
checking and removing any instances of plagiarism is essential to maintain academic
integrity and credibility in research or analysis.
After training Random Forest and Decision Tree models, along with a TF-IDF vectorizer,
it's crucial to save them for future deployment. Utilizing Python's pickle library,
practitioners serialize these components, often storing them in .pkl format. This
preservation maintains the intricate parameters and structures of both models, ensuring
consistency in predictions. Additionally, saving the TF-IDF vectorizer guarantees uniform
text preprocessing across new data. Subsequently, loading these saved objects with
pickle.load() allows seamless integration for predictions without retraining. This
streamlined approach expedites deployment in diverse environments and fosters efficient
model reuse. By adhering to standardized model persistence practices, scalability,
reproducibility, and real-world application integration are enhanced, empowering
organizations to harness machine learning's full potential.
Prediction pipeline
The prediction pipeline outlined in the provided Python class `Prediction` offers a
systematic approach to classifying news headlines as "True" or "Fake" using a trained
model and TF-IDF vectorization. Upon initialization, the class processes input data through
text preprocessing, transforms it into numerical form with a TF-IDF vectorizer, and then
utilizes the model for predictions. The pipeline interprets the model's output, determining
whether a news headline is classified as "True" or "Fake based on the prediction result. This
structured workflow ensures efficient handling of data, feature transformation, and accurate
classification, providing a clear and concise method for analyzing news headline
authenticity.
24
5.2 Testing
Testing involves the systematic examination of various aspects of the system to identify
potential defects, errors, or areas of improvement. This process ensures that the system
functions according to specifications and meets the desired requirements. Result analysis
entails the interpretation and evaluation of the outcomes obtained from the testing phase.
This helps the system to ensure it is well functioning and meets the user’s needs.
25
3 TEXAS CHURCH SHOOTER: Years [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, Fake
Before ‘Soft Target’ Attack, Gunman 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
Tried to Carry Out Death Threats on
CIA Linked Air Force Base
Accuracy
The accuracy of our model is calculated as the ratio of correct predictions to the total
number of predictions. After evaluating our model, we achieved an accuracy of 84% for
test data and 81% for train dataset
Confusion Matrix
The confusion matrix provides a detailed breakdown of the model's predictions compared
to the actual values. Here is the confusion matrix for our model:
26
Figure 5.1: Confusion Matrix on Training Dataset
27
True Negatives (TN): 436
Precision assesses the accuracy of the model's positive predictions by computing the ratio
of true positive predictions to the total positive predictions. In our scenario, precision is
determined as 99% for Fake and 82% for True.
Recall gauges the model's capacity to detect all pertinent instances by calculating the ratio
of true positive predictions to the total actual positives. Our model achieved recall scores
of 0.5 for Fake and 1 for True.
The F1-score, a measure derived from the harmonic mean of precision and recall, offers a
balanced evaluation, particularly useful in scenarios with class imbalances. Following our
model assessment, we obtained F1-scores of 0.68 and 0.90 for Fake and True respectively.
28
CHAPTER 6 :
CONCLUSION AND FUTURE RECOMMENDATION
6.1 Conclusion
Hence we implemented the system successfully using Decision tree and random forest
where a text (title of news) as input is taken from the website using GET method followed
by performing different preprocessing to the input text using the preprocessing pipeline
followed by predicting the news using our trained model after transforming the preprocess
news title. Hence, the system was successful to show the result in the website where the
user can interact. Also for further clarification more accuracy representing charts like
confusion matrix and support line chart were also displayed in the dashboard.
The system also possesses a login system with proper verification of the user. It contains
roles that helps to check either the user is admin or user and redirect them to their specific
pages. The user is able to detect the news whereas the admin is able to add, delete or update
the news content. Therefore a fake news detection was successfully developed using
random forest model having role of user and admin and a proper login verification system.
The accuracy was evaluated using the test data where the result was determined to be
around 84%. With the evaluation of the confusion matrix the major problem was seen in
predicting the fake news which was predicted as true. This problem was estimated to be
solved with an increase in input data and increase in the tree height but it affects the training
time for the model.
Looking ahead, there are several ways we can improve fake news detection systems. We
can explore more advanced ways to understand language. Instead of just looking for
specific words, we can use smarter techniques to understand the true meaning behind the
words. We can teach these systems to recognize different types of fake news, not just the
obvious ones. They could learn to identify more subtle lies or tricky stories. We should
expand the scope of these systems to include images and videos, not just written text.
Sometimes, false information is spread through pictures or videos too. It would also be
29
helpful to customize these systems for different situations or users. This means they could
work better for different languages, cultures, or the ways people use information.
30
REFERENCES
[1] M. Rao. “Exploiting Network Structure To Detect Fake News”. Stanford University,
School of Computer Science, 2018..
[2] A. Thota.”Fake News Detection: A deep learning approach,”. SMU, Data Science
Review, 2018.
[3] R. Roshan Karwa, R. Sunil Gupta, “Automated hybrid Deep Neural Network model
for fake news identification and classification in social networks”, Journal of
Integrated Science and Technology, Volume 10, No 2, 2022.
[4] S. Kumar, T. DorenSingh, “Fake News detection on Hindi news dataset”, Global
Transition Proceedingd, 2022, doi: https://doi.org/10.1016/j.gltp.2022.03.014
[5] E. Maity, A. Tomar, R. Peter, “Fake News Detection System: In Hindi Data Set
Using BERT”, International Research Journal of Modernization in Engineering
Technology and Science, Volume 4, May-2022.
[6] R. Mansouri, M. Naderan-Tahan, M. Javad Rashti, “A Semi-supervised Learning
Method for Fake News Detection in Social Media”, Iranian Conference on
Electrical Engineering (ICEE), 2020, doi: 10.1109/ICEE50131.2020.9261053.
31
Appendix
Add News
Home Page
About Session