Fake News - Machine Learning
Fake News - Machine Learning
A BSTRACT
Due to recent events in American Politics, fake news, or maliciously-fabricated media has taken
a central role in American political discourse. A litany of verticals, spanning national security,
education and social media are currently scrambling to find better ways to tag and identify fake
news with the goal of protecting the public from misinformation. Our goal is to develop a reliable
model that classifies a given news article as either fake or true. Our model is designed to emulate the
functionality of the BS Detector, a popular extension for Chrome that automatically flags articles,
websites and content as BS. Our simple models give us accuracy of up to 82%.
1 I NTRODUCTION
With the advent of technology, information is free for everyone. This is an advancement in human history, but at the
same time it blurs the line between true media and maliciously fabricated generated media. A freely available tool
to verify the trustworthiness of a news is needed to filter the information we receive everyday. With this motivation,
through building our knowledge of python and machine learning, we worked on our final project. Given the text of
a news article and it’s headline as input, we have developed an algorithm that can litigate the difference between the
fake and true news with an 83 percent accuracy. This project can consists of numerous binary classification (true/fake)
algorithm. We use a combination of python and MATLAB both to create our model. Taking 8071 true news samples
(New York Times, CNN, BBC), and 4094 fake news samples, we convert our words into numbers using NLP in
python. Having this big dataset, we pre-processed and cleaned the data and tested it using our Naive Bayes, SVM,
logistic regression. Based on the identification of indicative tokens through Naive Bayes, we used a 1 layer deep neural
network, and 2 layer deep neural network for fake news identification.
2 R ELATED WORK
2.1 T EXT P ROCESSING
Text processing has evolved from simple tokenization to contextual clustering with time. The most common approach,
simply tokenizing the data and and using classification algorithms is computationally cheap in preprocessing, but
increases the feature space significantly (Mihalcea & Strapparava, 2009). Centering resonance analysis (CRA), is one
such method that identifies the key nodes (dominant words), and identifies the most important words correlated with
the node (Papacharissi & Oliveira, 2012). For our project we use simple tokenization with removal of stop words,
and also take into account lemmatization. This served a middle ground of the two approaches described above. The
context of the words, however, is not taken into account. There are third party tools like Stanford GloVe (Jeffrey
Pennington, Richard Socher, & Christopher D. Manning, 2014), that help with creating contextual word tokens.
The simplest and most popular classification algorithms are Naive Bayes, and Support Vector Machines (SVM). Naive
Bayes classifier used with Laplace smoothing helps understand the basic nature of given data (Oraby et al., 2015), and
we use this method to identify the causal features in our project. With SVM, experimenting with different kernels such
as polynomial kernel, RBF kernel function, sigmoid kernel, and gaussian kernel is an option (Zhang et al., 2012). The
1
above mentioned methods are both helpful for classification, and so we used these methods to analyze which features
are most causal. Further up, we play with deep learning activation functions to obtain our classification model.
We gathered our fake news data of 4094 samples (headlines+body) from Kaggle[7] . This fake data was obtained from
a chrome extension called BS Detector that allows chrome users to tag a news article as BS as they are reading the
news. We obtained our true news data of 8071 samples from Kaggle[8] as well. These are news articles from trusted
news sources. The data obtained was in excel form, and needed a lot of manual, as well as syntactic preprocessing.
We have a total of 12,165 samples which we distributed to train: validation: test set in ratio 70:25:5. Each sample
corresponds to a news article headline and body. We used NLTK in python to tokenize the body and title. Removing
the stop-words (referencing the nltk stop-word list), we lemmatized the rest of the data. To obtain the labeled sentence
list for a particular course, we apply the following processing steps:
1. Tokenize the body and headline with the Punkt statement tokenizer from the NLTK NLP library. This to-
kenizer runs an unsupervised machine learning algorithm pre-trained on a general English corpus, and can
distinguish between scentence punctuation marks, and position of words in a statement.
3. Tag each sample with the tokens obtained from entire headline set, and body set.
For titles, we kept only the tokens that had a frequency more than 10 over the entire title dataset, and for body, we
kept only the tokens that had a frequency of more than 200 over the entire dataset. This leaves us with a total of 5261
tokens. In order to remove the symbols that came along with data (for example, xd,*,dxl), we only kept tokens with
string size greater than 3.
We used Naive Bayes to obtain the tokens with high posterior probability, which we then used for deep learning and
logistic regression.
4 M ETHODS
Two models were developed independently in order to achieve our desired objective: an Average-Hypothesis model,
and a Neural Network. In this section, the methodology and theory behind each model will be discussed in detail.
2
4.1 AVERAGE -H YPOTHESIS M ODEL
Our average hypothesis model combines the hypotheses obtained from Nave Bayes, Logistic Regression and SVM
by averaging the output probabilities obtained from each model. The aim of averaging is to obtain a model that
is less susceptible to over-fitting compared to a model that only uses one of the constituent methods. Given our
large feature set consisting of 5,078 features, certain judgment calls were used and validated to integrate this models.
Within the Average-Hypothesis model, the Nave Bayes algorithm (which includes Laplace smoothing) and SVM
algorithm was run using all 5,078 tokens, while Logistic Regression was performed using only the 20 tokens that were
determined to be most indicative to a sample’s classification. The following sections delineates the theory used in our
implementations of these three learning algorithms.
Given the size of our feature space, we determined that Naive Bayes was an appropriate method to begin our analysis.
Drawing from the lecture notes, the maximum-likelihood estimates for the model parameters are:
Pm (i) (i) Pm (i) (i)
i=11{xj = 1 ∧ yj = 1} + 1 i=11{xj = 1 ∧ yj = 0} + 1
φj|y=1 = Pm (i)
; φj|y=0 = Pm (i)
i=1 1{yj = 1} + 2 i=1 1{yj = 0} + 2
Using our Naive Bayes algorithm, we identified the top-k tokens that were found to be the most indicative on the
classification of the example. This was computed by finding the k/2 tokens which have the highest posterior probability
of being in fake news, and the k/2 tokens with the lowest posterior probability of being in fake news. The following
expression was used to rank the tokens by their indication of fake news:
exp(φj|y=1 )
T oken Rank =
exp(φj|y=0 )
The k/2 most indicative tokens for each class was used to form a new feature space for our Logistic Regression
model. These tokens were also examined heuristically to ensure they pass the eye-test given our team’s knowledge of
contemporaneous fake news.
4.1.2 SVM
Due to it’s robustness, a support vector machine (SVM) was used as the second algorithm in our Average-Hypothesis
model. The SVM algorithm used uses a hinge loss that seeks to maximize the margin between the two classes of
data. The SVM algorithm uses a second-order Gauss kernel that operates on the full 5078 token feature space. The
expression for this kernel is given by the following expression:
1 x2
G(x; σ) = √
2
exp( 2 )
2πσ 2σ
Note that this expression is provided for the 1-D case. In retrospect, the selection of this high-order kernel seems rather
naive, since it may have caused the SVM model to over fit the training set.
Due to its simplicity and elegance, Logistic Regression (LR) was used as the third algorithm within the Average-
Hypothesis model. The LR model uses gradient descent to converge onto the optimal set of weights (θ) for the
training set. Where J is the loss function and alpha is the learning rate. For our model, the hypothes used is the
sigmoid function:
1
hθ (x) = g(θT x) =
1 + exp(−θT x)
3
4.2 N EURAL N ETWORK
A one-layered neural network model was used on the 80 tokens identified to be most causal to a sources classification.
The hidden layer neurons uses sigmoid activation functon and, the output layer uses the softmax activation.
Also, ReLU and tanh function were tested for the activation function of the hidden layer. Although the results from
sigmoid are not good enough to be used as compared to other models discussed above, it was better than ReLU and
tanh activation function.
Using the average hypothesis model, we observed an accuracy of 83 percent on our training set. As mentioned in
the methods section, the average hypothesis model is constituted by three separate learning algorithms: Naive Bayes,
SVM and Logistic Regression. Due to the poor performance of Logistic Regression on the dev. set, it’s hypothesis
was omitted from the average accuracy model.
5.2 D ISCUSSION
4
To determine the sensitivity of our model to sample size, we generated a loss curve for the model.
5
As we see, the false positives are higher in SVM, and are highest in NN. We need to try to reduce them further by
adding constraints in future.
6 F UTURE W ORK
A lot of our results circle back to the need for acquiring more data. Generally speaking, simple algorithms perform
better on less (less variant) data. Since we had less data, SVM and Naive Bayes outperformed Neural Networks, and
Logistic Regression did not perform well. Given enough time to acquire more fake news data, and gain experience in
python, we will try to better process the data using n-grams, and revisit our deep-learning algorithm. We tried using
our own codes for the project, and the algorithms were relatively slow. To tweak all knobs of various algorithms, we
shall use available robust packages in the future.
7 C ONTRIBUTIONS
Overall, work was well-distributed between the team members throughout the project. Since he had experience with
Python, Ayush lead the effort to process of the data using NLTK. Devyani modified these algorithms to work for a
larger set of examples. Sohan helped debug the data processing algorithms. For modeling, Sohan and Ayush worked on
the Average-Hypothesis model which Devayani debugged. The neural Network was developed by Ayush and Devyani
and the poster was formatted by Sohan. All team members were involved in the preparation of this report.
R EFERENCES
[1] Papacharissi, Z. & Oliveira, M. (2012). The Rhythms of News Storytelling on Egypt. Journal of Communication. 62. pp.
266–282.
[2] Mihalcea, R. & Strapparava, C. (2009). The Lie Detector: Explorations in the Automatic Recognition of Deceptive Language.
Proceedings of the ACL–IJCNLP Conference Short Papers, pp. 309–312.
[3] Chen, Y., Conroy, N. J., & Rubin, V. L. (2015). News in an Online World: The Need for an Automatic Crap Detector. In The
Proceedings of the Association for Information Science and Technology Annual Meeting (ASIST2015), Nov. 6–0, St. Louis.
[4] Jeffrey Pennington, Richard Socher, & Christopher D. Manning. (2014). GloVe: Global Vectors for Word Representation.
[5]Zhang, H., Fan, Z., Zeng, J. & Liu, Q. (2012). An Improving Deception Detection Method in Computer-Mediated Communi-
cation. Journal of Networks, 7 (11)
[6]Conroy, N. J., Rubin, V. L., Chen, Y. (2015). Automatic deception detection: Methods for finding fake news. Proceedings of
the Association for Information Science and Technology, 52(1), 1–4.
[7] kaggle Fake News NLP Stuff. https://www.kaggle.com/rksriram312/fake-news-nlp-stuff/notebook.
[8] kaggle All the news .https://www.kaggle.com/snapcrack/all-the-news.