s134450 Fake News Detection Using Machine Learning
s134450 Fake News Detection Using Machine Learning
Learning
University Of Liège
Faculty Of Applied Science
Belgium
Accademic Year 2018-2019
Contents
1 Introduction 7
1.1 What are fake news? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.2 Fake News Characterization . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 News Content Features . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.2 Social Context Features . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 News Content Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.1 Knowledge-based models . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 Style-Based Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Social Context Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5.1 Fake news detection . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5.2 State of the Art Text classification . . . . . . . . . . . . . . . . . . 12
1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Related Work 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Supervised Learning for Fake News Detection[12] . . . . . . . . . . . . . . 13
2.3 CSI: A Hybrid Deep Model for Fake News Detection . . . . . . . . . . . . 14
2.4 Some Like it Hoax: Automated Fake News Detection in Social Networks [16] 15
2.5 Fake News Detection using Stacked Ensemble of Classifiers . . . . . . . . . 16
2.6 Convolutional Neural Networks for Fake News Detection[19] . . . . . . . . 17
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Data Exploration 20
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Fake News Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2 Liar, Liar Pants on Fire . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Dataset statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Fake News Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.2 Liar-Liar Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Visualization With t-SNE . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
1
CONTENTS 2
5 Attention Mechanism 59
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Text to Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.1 Word2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5.2 Liar-Liar dataset results . . . . . . . . . . . . . . . . . . . . . . . . 64
5.5.3 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.5.4 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.5.5 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.6 Attention Mechanism on fake news corpus . . . . . . . . . . . . . . . . . . 73
5.6.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6 Conclusion 79
6.1 Result analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
A 84
A.1 TF-IDF max features row results on liar-liar corpus . . . . . . . . . . . . . 84
A.1.1 Weighted Average Metrics . . . . . . . . . . . . . . . . . . . . . . . 84
A.1.2 Per Class Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
A.2 TF-IDF max features row results for fake news corpus without SMOTE . . 88
CONTENTS 3
B 89
B.1 Training plot for attention mechanism . . . . . . . . . . . . . . . . . . . . 89
Master thesis
Fake news detection using machine learning
Simon Lorent
Acknowledgement
I would start by saying thanks to my family, who have always been supportive and who
have always believed in me.
I would also thanks Professor Itoo for his help and the opportunity he gave me to works
on this very interesting subject.
In addition I would also thank all the professors of the faculty of applied science for what
they taught me during these five years at the University of Liège.
4
Master thesis
Fake news detection using machine learning
Simon Lorent
Abstract
For some years, mostly since the rise of social media, fake news have become a society
problem, in some occasion spreading more and faster than the true information. In this
paper I evaluate the performance of Attention Mechanism for fake news detection on
two datasets, one containing traditional online news articles and the second one news
from various sources. I compare results on both dataset and the results of Attention
Mechanism against LSTMs and traditional machine learning methods. It shows that
Attention Mechanism does not work as well as expected. In addition, I made changes
to original Attention Mechanism paper[1], by using word2vec embedding, that proves to
works better on this particular case.
5
CONTENTS 6
Chapter 1
Introduction
In this paper I experiment the possibility to detect fake news based only on textual infor-
mation by applying traditional machine learning techniques[5, 6, 7] as well as bidirectional-
LSTM[8] and attention mechanism[1] on two different datasets that contain different kinds
of news.
In order to work on fake news detection, it is important to understand what is fake news
and how they are characterized. The following is based on Fake News Detection on Social
Media: A Data Mining Perspective[9].
The first is characterization or what is fake news and the second is detection. In order
to build detection models, it is need to start by characterization, indeed, it is need to
understand what is fake news before trying to detect them.
7
CHAPTER 1. INTRODUCTION 8
Definition 1 Fake news is a news article that is intentionally and verifiable false
• Source: Where does the news come from, who wrote it, is this source reliable or
not.
• Headline: Short summary of the news content that try to attract the reader.
Features will be extracted from these four basic components, with the mains features
being linguistic-based and visual-based. As explained before, fake news is used to influ-
ence the consumer, and in order to do that, they often use a specific language in order
to attract the readers. On the other hand, non-fake news will mostly stick to a different
language register, being more formal. This is linguistic-based features, to which can be
added lexical features such as the total number of words, frequency of large words or
unique words.
The second features that need to be taken into account are visual features. Indeed,
modified images are often used to add more weight to the textual information. For
example, the Figure 1.2 is supposed to show the progress of deforestation, but the two
images are actually from the same original one, and in addition the WWF logo makes it
look like to be from a trusted source.
Figure 1.2: The two images provided to show deforestation between two dates are from
the same image taken at the same time. [10]
CHAPTER 1. INTRODUCTION 10
of trusting or sharing false information. For instance, this metadata can be its centre of
interest, its number of followers, or anything that relates to it.
Post-based aspect is in a sense similar to users based: it can use post metadata in order to
provide useful information, but in addition to metadata, the actual content can be used.
It is also possible to extract features from the content using latent Dirichlet allocation
(LDA)[11].
These methods all have pros and cons, hiring experts might be costly, and expert are
limited in number and might not be able to treat all the news that is produced. In the
case of crowdsourcing, it can easily be fooled if enough bad annotators break the system
and automatic fact checking might not have the necessary accuracy.
The second method is called objectivity-oriented approaches and tries to capture the
objectivity of the texts or headlines. These kind of style is mostly used by partisan
articles or yellow journalism, that is, websites that rely on eye-catching headlines without
reporting any useful information. An example of these kind of headline could be
This kind of headline plays on the curiosity of the reader that would click to read the
news.
CHAPTER 1. INTRODUCTION 11
Current research focus mostly on using social features and speaker information in order
to improve the quality of classifications.
Ruchansky et al.[15] proposed a hybrid deep model for fake news detection making use
of multiple kinds of feature such as temporal engagement between n users and m news
articles over time and produce a label for fake news categorization but as well a score for
suspicious users.
Tacchini et al.[16] proposed a method based on social network information such as likes
and users in order to find hoax information.
Granik and Mesyura[18] used Naı̈ve-Bayes classifier in order to classify news from buz-
zfeed datasets.
In addition to texts and social features, Yang et al.[19] used visual features such as images
with a convolutional neural network.
Wang et al.[20] also used visual features for classifying fake news but uses adversarial
neural networks to do so.
CHAPTER 1. INTRODUCTION 12
1.6 Conclusion
As it has been shown in Section 1.2 and Section 1.3 multiple approaches can be used
in order to extract features and use them in models. This works focus on textual news
content features. Indeed, other features related to social media are difficult to acquire.
For example, users information is difficult to obtain on Facebook, as well as post infor-
mation. In addition, the different datasets that have been presented at Section 3.2 does
not provide any other information than textual ones.
Looking at Figure 1.3 it can be seen that the main focus will be made on unsupervised
and supervised learning models using textual news content. It should be noted that
machine learning models usually comes with a trade-off between precision and recall and
thus that a model which is very good at detected fake news might have a high false positive
rate as opposite to a model with a low false positive rate which might not be good at
detecting them. This cause ethical questions such as automatic censorship that will not
be discussed here.
Chapter 2
Related Work
2.1 Introduction
In this chapter I will detail a bit more, some related works that are worth investigating.
In order to feed this network, they used a lot of hand-crafted features such as
• Language Features: bag-of-words, POS tagging and others for a total of 31 different
features,
• Lexical Features: number of unique words and their frequencies, pronouns, etc,
• Pyschological Features[14]: build using Linguistic Inquiry and Word Count which
is a specific dictionary build by a text mining software,
Many other features were also used, based on the source and social metadata.
13
CHAPTER 2. RELATED WORK 14
They did test their model on two datasets, one from Twitter and the other one from Weibo,
which a Chinese equivalent of Twitter. Compared to simpler models, CSI performs better,
with 6% improvement over simple GRU networks (Figure 2.3).
CHAPTER 2. RELATED WORK 15
For the training they used cross-validation, dividing the dataset into 80% for training and
20% for testing and performing 5-fold cross-validation, reaching 99% of accuracy in both
cases.
In addition they used one-page out, using posts from a single page as test data or using
half of the page as training and the other half as testing. This still leads to good results,
harmonic algorithm outperforming logistic regression. Results are shown at Figures 2.4
and 2.5.
Their network is made of two branches: one text branch and one image branch (Figure
2.7). The textual branch is then divided of two subbranch: textual explicit: derived
information from text such as length of the news and the text latent subbranch, which is
the embedding of the text, limited to 1000 words.
The image branch is also made of two subbranch, one containing information such as
image resolution or the number of people present on the image, the second subbranch use
a CNN on the image itself. The full details of the network are at Figure 2.8. And the
results are at Figure 2.9 and show that indeed using images works better.
2.7 Conclusion
We have seen in the previous sections that most of the related works focus on improving
the prediction quality by adding additional features. The fact is that these features are
not always available, for instance some article may not contain images. There is also the
fact that using social media information is problematic because it is easy to create a new
account on these media and fool the detection system. That’s why I chose to focus on
the article body only and see if it is possible to accurately detect fake news.
1
https://www.kaggle.com/mrisdal/fake-news
CHAPTER 2. RELATED WORK 18
Data Exploration
3.1 Introduction
A good starting point for the analysis is to make some data exploration of the data set.
The first thing to be done is statistical analysis such as counting the number of texts per
class or counting the number of words per sentence. Then it is possible to try to get an
insight of the data distribution by making dimensionality reduction and plotting data in
2D.
3.2 Datasets
3.2.1 Fake News Corpus
This works uses multiple corpus in order to train and test different models. The main
corpus used for training is called Fake News Corpus[29]. This corpus has been automati-
cally crawled using opensources.co labels. In other words, domains have been labelled
with one or more labels in
• Fake News
• Satire
• Extreme Bias
• Conspiracy Theory
• Junk Science
• Hate News
• Clickbait
• Political
• Credible
20
CHAPTER 3. DATA EXPLORATION 21
These annotations have been provided by crowdsourcing, which means that they might
not be exactly accurate, but are expected to be close to the reality. Because this works
focus on fake news detection against reliable news, only the news labels as fake and
credible have been used.
• pants-fire
• false
• barely-true
• half-true
• mostly-true
• true
This set will be used a second test set. Because in this case there are six classes against
two in the other cases, a threshold should be used in order to fix which one will be con-
sidered as true or false in order to be compared with the other dataset.
It should be noted that this one differs from the two other datasets is it is composed only
on short sentences, and thus it should not be expected to have very good results on this
dataset for models trained on Fake News Corpus which is made of full texts. In addition,
the texts from the latest dataset are more politically oriented than the ones from the first
one.
one.
Because the dataset has been cleaned, numbers provided by the dataset creators and
number computed after cleaning will be provided. We found the values given at Table
3.1. It shows that the number of fake news is smaller by a small factor with respect to
the number of reliable news, but given the total number of items it should not cause any
problems. But it will still be taken into account later on.
In addition to the numbers provided at Table 3.1, there are also two more categories
that are in the dataset but for which no description is provided:
• Unknown: 231301
• Rumour: 376815
In addition, the number of words per text and the average number of words per sentences
have been computed for each text categories. Figure 3.2 shows the boxplots for these
values. It can be seen that there is no significative difference that might be used in order
to make class prediction.
Before counting the number of words and sentences, the texts are preprocessed using
gensim[31] and NLTK[32]. The first step consists of splitting text into an array of sen-
tences on stop punctuation such as dots or questions mark, but not on commas. The
second step consists of filtering words that are contained in these sentences, to do so,
stop words (words such as ’a’, ’an’, ’the’), punctuation, words or size less or equal to
tree, non-alphanumeric words, numeric values and tags (such as html tags) are removed.
Finally, the number of words still present is used.
An interesting feature to look at is the distribution of news sources with respect to their
categories. It shows that in some case some source a predominant. For instance, looking
at Figure 3.3 shows that most of the reliable news are from nytimes.com and in the
same way, most of the fake news is coming from beforeitsnews.com. That has to be taken
CHAPTER 3. DATA EXPLORATION 23
2,000,000
1,800,000
1,600,000
1,400,000
1,200,000
Count
1,000,000
800,000
600,000
400,000
200,000
junksci
political
unknown
fake
hate
reliable
satire
bias
unreliable
conspiracy
clickbait
rumor
type
Figure 3.1: Histogram of text distribution along their categories on the computed num-
bers.
(a) Boxplot of average sentence length (b) Boxplot of number of sentences for
for each category. each category.
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
45,000
50,000
70news.wordpress.com
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
americannews.com addictinginfo.org
abcnews.go.com
americanoverlook.com
af.reuters.com bipartisanreport.com
anonjekloy.tk
au.be.yahoo.com blacklistednews.com
aurora-news.us
au. nance.yahoo.com beforeitsnews.com
bluenationreview.com
au.news.yahoo.com bighairynews.com
bostonleader.com
breaking911.com
bleacherreport.com
channel18news.com elitereaders.com
ca.news.yahoo.com
civictribune.com
ca.reuters.com liberalamerica.org
clashdaily.com
de. nance.yahoo.com lovethispic.com
conservative ghters.com
disneyworld.disney.go.com
dailybuzzlive.com notallowedto.com
domains
edition.cnn.com dailyheadlines.com
occupydemocrats.com
feeds.reuters.com dailyheadlines.net
dailysurge.com other98.com
nance.yahoo.com
dcgazette.com
(a) Clickbaits
in.reuters.com politicususa.com
donaldtrumpnews.co
indianexpress.com downtrend.com remedydaily.com
latino.foxnews.com empireherald.com
twitchy.com
empirenews.net
m.mlb.com
enduringvision.com
viraltube.nl
money.cnn.com
enhlive.com yournationnews.com
motherboard.vice.com
ashnewscorner.com
yournewswire.com
music.yahoo.com freedomdaily.com
news.abs-cbn.com globalpoliticsnow.com
goneleft.com
news.yahoo.com
gopthedailydose.com
newsinfo.inquirer.net
healthycareandbeauty.com
nutritionfacts.org intrendtoday.com
libertyalliance.com
nz.sports.yahoo.com metropolitanworlds.com
online.wsj.com news4ktla.com
americanborderpatrol.com
people.com newsbreakshere.com
amren.com
newsfrompolitics.com
pro t.ndtv.com
newslo.com barenakedislam.com
sports.yahoo.com
newsmagazine.com
CHAPTER 3. DATA EXPLORATION
uk. nance.yahoo.com barnesreview.org
uk.news.yahoo.com now8news.com
onepoliticalplaza.com
creativitymovement.net/category/news/
uk.reuters.com
onlineconservativepress.com
video.foxnews.com darkmoon.me
openmagazines.com
washpost.bloomberg.com politicalsitenews.com davidduke.com
wiki.mozilla.org president45donaldtrump.com
prntly.com drrichswier.com
domains
www.bloomberg.com
domains
proamericanews.com
www.businessinsider.com glaringhypocrisy.com
domains
(d) Fake
proudcons.com
(b) Hate
www.buzzfeed.com
realnewsrightnow.com
ihr.org
(e) Reliable
www.cbsnews.com redcountry.us
www.cnn.com smag31.com
therightstu .biz
www.engadget.com stormcloudsgathering.com
subjectpolitics.com truthfeed.com
www.eonline.com
success-street.com
www.forbes.com
teaparty.org
these two sources but between fake news and reliable news.
www.foxnews.com thebigriddle.com
www.hu ngtonpost.com thecommonsenseshow.com
thefreepatriot.org
www.investing.com
theinternetpost.net
www.latimes.com Count
thelastgreatstand.com
www.legacy.com thenet24h.com
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
www.marketwatch.com thenewyorkevening.com
ancient-code.com
www.msn.com therightscoop.com
thetruthdivision.com collective-evolution.com
www.nba.com
thetruthseeker.co.uk collectivelyconscious.net
www.nbcnews.com
thewashingtonpress.com dineal.com
www.ndtv.com threepercenternation.com
ewao.com
www.n .com universepolitics.com
foodbabe.com
usa-television.com
www.nhl.com
usadailytime.com galacticconnection.com
www.npr.org
usadosenews.com geoengineeringwatch.org
www.nydailynews.com
right one. It can be seen at Figure 3.4 that reliable news has slightly more words than
padding. It is thus needed to investigate the length of the texts in order to choose the
small length would mean a lot of cutting and using too long size would mean too much
Indeed, at some point it will be needed to fix a constant length for the texts and using to
Another import feature to look at is the distribution of the number of words in the text.
into account when training and testing models as the goal is not to distinguish between
24
Count Count Count
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
40,000
0
5,000
10,000
15,000
20,000
25,000
30,000
35,000
abriluno.com
50,000
0
100,000
150,000
200,000
250,000
300,000
350,000
400,000
Zengardner.com
ace ashman.wordpress.com
advocate.com
fantasticword.com actualidadpanamericana.com
adobochronicles.com alternet.org
yheight.com
awazetribune.com
kauilapele.wordpress.com americannewsx.com
beehivebugle.com
kingworldnews.com americanprogress.org
betootaadvocate.com
legora .fr bigamericannews.com attn.com
lightlybraisedturnip.com burrardstreetjournal.com
baptistnews.com
callthecops.net
livefreelivenatural.com
celebtricity.com breitbart.com
observatorial.com
chaser.com.au
chroniclesmagazine.org
opednews.com christwire.org
patriotarchives.blogspot.com chronicle.su
city-journal.org
dandygoat.com
counterinformation.wordpress.com
randpaulreview.com
der-postillon.com counterpunch.org
rbth.com
derfmagazine.com
dailycaller.com
readynutrition.com disclose.tv
fridaymash.com ecowatch.com
southfront.org
glossynews.com
lmsforaction.org
strategic-culture.org
gomerblog.com
CHAPTER 3. DATA EXPLORATION
surenews.com foreignpolicyjournal.com
humortimes.com
domains
theamericancause.org liberaldarkness.com
domains
heritage.org
domains
lushforlife.com
(g) Satire
theamericanmirror.com
nationalreport.net
(f) Political
ijr.com
thecivilian.co.nz
(h) Unknown
ncscooper.com
theconservativetreehouse.com
jackpineradicals.com
newsbiscuit.com
thedailybell.com newsbreakers.org jacobinmag.com
politicot.com
nationalreview.com
thepostemail.com
reductress.com newcoldwar.org
therebel.media
rockcitytimes.com
therundownlive.com oann.com
scrappleface.com
us.blastingnews.com washingtonexaminer.com
wealthwire.com
waterfordwhispersnews.com
world24monitor.com weeklystandard.com
worldnewsdailyreport.com
worlddaily.info wundergroundmusic.com yellowhammernews.com
25
CHAPTER 3. DATA EXPLORATION 26
800
600
Word Count
400
200
0
rumor hate unreliable conspiracy clickbait satire fake reliable bias political junksci unknown
type
Distribution of word count per texts Distribution of word count per texts, limited to less than 1250 words
500000
0.00200
0.00175 400000
0.00150
300000
0.00125
0.00100
200000
0.00075
0.00050
100000
0.00025
0.00000 0
0 2500 5000 7500 10000 12500 15000 17500 0 200 400 600 800 1000 1200
Word Count Word Count
Comparing Figure 3.5 and Figure 3.6 shows that even by removing nytimes.com and
beforeitsnews.com news from the dataset still leave enough texts to train the different
models without the risk of learning something unwanted such as only separating these
two sources. But one drawback is that the ratio between fake news and reliable news is
going from around one half to around one fourth.
CHAPTER 3. DATA EXPLORATION 27
2,000,000
1,800,000
1,600,000
1,400,000
1,200,000
Count 1,000,000
800,000
600,000
400,000
200,000
0
fake
reliable
type
Figure 3.5: Summary statistics for not downsampled fake and reliable news.
350,000
300,000
250,000
200,000
Count
150,000
100,000
50,000
0
fake
reliable
type
Figure 3.6: Summary statistics for downsampled dataset on fake and reliable news
And in order to have a low dimension probability distribution as close as possible to the
high dimension one, it minimizes the KullbackLeibler divergence of the two distributions.
X pij
KL(P ||Q) = pij ∗ log( )
i6=j
qij
Proportion
Proportion
0.06
10.0 0.06
0.00 0.00
reliable fake 0 50 100 150 200 250 300 0 10 20 30 40
Type Word Count Word Count
Figure 3.7: Number of words distributions for liar-liar datasets. On the first and the third
plots, a few outliers with length greater than 50 have been removed in order to make the
plots more readable.
Increasing the number of PCA components to 1750 gives the results at Figure 3.9 and
does not show more clustering, even if it explains 75% of the variance. This shows that
classifying the dots might not be easy, but it should be reminded that it is a dimensional-
ity reduction and that there is a loss of information. Some of the original data dimensions
can have a better separation of the classes.
It is not possible to use t-SNE on the Fake News Corpus because the algorithm is quadratic
with respect to the number of samples. Which makes it impossible to compute for that
corpus which is larger than the liar-liar corpus. But it is still possible to try to make
some visualization using truncated singular value decomposition.
In this case, only with a 2D projection, we can already see some kind of separation between
the two classes, thus we can think that the Fake News Corpus will be easier to deal
with.
3.5 Conclusion
Data exploration has shown that there is no real statistical differences between text meta-
data for fake and reliable news, and thus make it not interesting for using it for classifying
new texts. In addition, dimensionality reduction does not show any sign of helpfulness
for the classification.
CHAPTER 3. DATA EXPLORATION 30
40
20
20
40
40 20 0 20 40
40
20
20
40
40 20 0 20 40
Figure 3.10: First two LSA components for fake news corpus
CHAPTER 3. DATA EXPLORATION 33
Chapter 4
4.1 Introduction
In this chapter, we will focus on the more traditional methods used in natural language
processing such as Naı̈ve-Bayes, decision trees, linear SVM and others. These will serve as
a baseline for comparing the performances of the more two advanced models that will be
analysed later on: LSTM and Attention Mechanism. The first thing to do when working
with text is the do words and texts embedding, indeed, in order to use machine learning
algorithms on texts, a mathematical representation of these texts is required.
In order to overcome this problem, term-frequency might be used, that is, rather than
setting Mij to 0 or 1 we set it to the amount of time it appears in the text.
It is possible to use even better text embedding. It is called term-frequency, inverse doc-
ument frequency. The main idea is that a word that appears often in all the documents is
not helpful in order to classify the documents. For example, if the task is to classify books
of biology and physics, words atoms, cells or light are more useful than today or tomorrow.
In order to compute tf-idf, it is separated in two parts, the first one being term frequency
and the second one inverse document frequency. We have that
That it tfij is the number of times the word j appears in the document i. Secondly, we
have that
#D
idfj = log( )
#(Di |Wj ∈ Di )
34
CHAPTER 4. MACHINE LEARNING TECHNIQUES 35
this is the log of the total number of documents, over the number of documents that
contains the word j. Finally, the value tfidf value is computed by
This the text embedding methods that will be used in this section.
4.3 Methodology
All the methods presented will be tested in two different ways:
• On the fake corpus dataset, excluding the news from beforeitsnews.com and ny-
times.com
To be more precise, in the first case, the models will be trained on a training set, tuned
using validation set and finally tested using test set. In the second case, the same method-
ology will be used, the dataset has been split be choosing 60% of the text from each domain
for training, and 20% for validation and testing. This way of splitting has been chosen
because of the uneven representation of each domain in the dataset in order to ensure
representation of all the domains in the tree subsets.
The first parameter to tune is the max number of features used by tf-idf. This is the
maximum number of words that will be kept to create the text encoding. The words that
are kept are the most frequent words.
4.4 Models
Four models have been used in order to classify texts represented as a TF-IDF matrix.
These are Multinomial Naı̈ve-Bayes, Linear SVM, Ridge Classifier and Decision Tree. I
will start by a very brief recap of each model and how they work.
4.4.1 Naı̈ve-Bayes[7]
The basic idea of Naı̈ve-Bayes model is that all features are independent of each other.
This is a particularly strong hypothesis in the case of text classification because it supposes
that words are not related to each other. But it knows to work well given this hypothesis.
Given an element of class y and vector of features X = (x1 , ..., xn ). The probability of
the class given that vector is defined as
P (y) ∗ P (X|y)
P (y|X) = (4.6)
P (X)
Where ξ is a loss function, in this case L2 loss function has been used, and C > 0 a
penalty parameter. Class of new examples are assigned by looking at the value of wT w.
The class 1 is assigned if wT w ≥ 0 and the class −1 if wT w < 0.
CHAPTER 4. MACHINE LEARNING TECHNIQUES 37
Where p(i) is the probability of class i in the current branch. The best split is chosen as
the one that decreases the most the impurity. For instance, beginning from the root, the
gini impurity is computed on the complete dataset, then the impurity of each branch is
computed over all features, weighting it by the number of elements in each branch. The
chosen feature is the one that has the highest impurity.
0.50
0.45
0.40
0.35
0.30
4 3 2 1 0
10 10 10 10 10
0.56
0.54
0.52
0.50
1 2 3 4 5
10 10 10 10 10
Average of precision, recall and f1score
0.64
Average Recall
Average precision
Average f1score
0.62
0.60
0.58
0.56
10 4 10 3 10 2 10 1 100 101
Figure 4.3: Average metrics for ridge classifiers with respect to the penalty parameter.
0.56 0.59
0.58
0.58
0.54 0.57
0.57
0.56
0.52 0.56
DecisionTreeClassifier 0.55 DecisionTreeClassifier 0.55 DecisionTreeClassifier
LinearSVC LinearSVC LinearSVC
0.50 MultinomialNB MultinomialNB MultinomialNB
RidgeClassifier 0.54 RidgeClassifier 0.54 RidgeClassifier
101 102 103 104 0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
Figure 4.4: Weighted average of f1-score, precision and recall of each class.
The results are slightly different if the goal is to optimize the precision because if the best
value stays the same for Linear SVM and Ridge Classifier, the Naı̈ve-Bayes work better
when using the maximum number of features and it goes the same way for recall. Based
on Figure 4.4 we can say that when it comes to precision and recall, Naı̈ve-Bayes is the
one that performs the best.
Row results for max features selection are available at Appendix A. It goes differently
when we focus on a single class. For example, the precision for fake detection is at its
maximum for Linear SVM and Ridge Classifier when only 10 features are used. But at the
same time, it is at its minium for reliable class. It shows that when trying to optimize the
CHAPTER 4. MACHINE LEARNING TECHNIQUES 40
Precision for positive class : fake Precision for positive class : reliable
0.66 DecisionTreeClassifier
0.60 LinearSVC
0.64 MultinomialNB
0.59 RidgeClassifier
0.62
0.58
0.60
0.57
0.58
0.56
0.56
DecisionTreeClassifier 0.55
0.54 LinearSVC
MultinomialNB
RidgeClassifier 0.54
0.52
101 102 103 104 101 102 103 104
Figure 4.5: Precision of the model for each class, the x axes is log scale of the number of
features
Recall for positive class : fake Recall for positive class : reliable
DecisionTreeClassifier DecisionTreeClassifier
0.55 0.90
LinearSVC LinearSVC
MultinomialNB MultinomialNB
0.50 RidgeClassifier 0.85 RidgeClassifier
0.45
0.80
0.40
0.75
0.35
0.70
0.30
0.65
0.25
0.60
0.20
Figure 4.6: Precision of the model for each class, the x axes is log scale of the number of
features
overall model and not only for a single class, it is better to look at the weighted average
than at the value for a single class. But it is still important to look at the metrics for a
single class because it indicates how it behaves for this class. For instance, in the case of
automatic fake news detection, it is important to minimize the number of reliable news
misclassified in order to avoid what could be called censorship.
Average of precision, recall and f1score
0.94 Average Recall
Average precision
0.93 Average f1score
0.92
0.91
0.90
0.89
0.88
0.87
10 4 10 3 10 2 10 1 100 101
Figure 4.7: Metrics value with respect to the penalty parameter for ridge classifier
joining the sample and its neighbour. Algorithm 1 and 2 shows how it works. The first
one computes the k-nearest neighbours and the second one computes a new element by
randomly choosing one of these neighbours.
The optimal parameter for the ridge classifier is clearly 1. As well as for the decision tree
trained on liar-liar dataset, the optimal maximum depth is of 1000. And finally, the
optimal value for the penalty parameter of the svm is also 1.
By looking at Figure 4.5, 4.6 and 4.4 we can find optimal parameters for the number of
features used in TF-IDF. It shows that linear svm and ridge classifiers are the ones that
CHAPTER 4. MACHINE LEARNING TECHNIQUES 42
0.88
0.87
0.86
0.85
0.84
1 2 3 4 5
10 10 10 10 10
0.85
0.80
0.75
0.70
0.65
0.60
0.55 4 3 2 1 0
10 10 10 10 10
0.92 0.92
0.90
0.90 0.90
Figure 4.10: Average recall, precision and f1-score wti respect to the maximum number
of features.
Precision for positive class : fake Precision for positive class : reliable
1.00
DecisionTreeClassifier
LinearSVC 0.950
0.95 MultinomialNB
RidgeClassifier 0.925
0.90
0.900
0.85
0.875
0.80 0.850
0.75 0.825
DecisionTreeClassifier
LinearSVC
0.70 0.800
MultinomialNB
RidgeClassifier
0.775
0 50000 100000 150000 200000 250000 300000 350000 0 50000 100000 150000 200000 250000 300000 350000
Figure 4.11: Precision for fake and reliable class for each model with respect to the
maximum number of features
perform the best, having an average precision of sligtly more than 94% for the linear svm
and 94% for the ridge classifier. They achieve these performances from 50, 000 features
and does not decrease. On the other hand, Naı̈ve-Bayes reaches a pike at 100, 000 features
and greatly decrease afterward.
Figure 4.4 shows why it is important to look at all the metrics, because Naı̈ve-Bayes
reaches a recall of 1 for the reliable class and close to 0 for the fake class, which means
that almost all texts are classified as reliable. This can be verified by looking at Figure
4.21, only a small proportion of true fake is actually classified as it.
Recall for positive class : fake Recall for positive class : reliable
0.9 1.00 DecisionTreeClassifier
LinearSVC
0.8 MultinomialNB
RidgeClassifier
0.98
0.7
0.6
0.96
0.5
0.4 0.94
0.3 DecisionTreeClassifier
LinearSVC 0.92
0.2 MultinomialNB
RidgeClassifier
4 5 4 5
10 10 10 10
Figure 4.12: Recall for fake and reliable class for each model with respect to the maximum
number of features
CHAPTER 4. MACHINE LEARNING TECHNIQUES 44
Using max features : 364070
Model : LinearSVC Model : MultinomialNB
50000 50000
40000 40000
15067 2169 2822 14414
Fake
Fake
30000 30000
True Class
True Class
20000 20000
Reliable
Reliable
10000 10000
40000
Fake 13130 4106 14250 2986 40000
Fake
32000
30000
True Class
True Class
24000
20000
Reliable
10000
8000
Figure 4.13: Confusion matrix for each model using 364, 070 features
It also has a huge impact on how Naı̈ve-Bayes behaves as it removes overfitting when using
a larger number of features in TF-IDF, leading to a few percent of accuracy increase.
It conclusion for SMOTE method we can say that it does help models that do not have a
regularization parameter or when the regularization parameter is low. Thus it does help
prevent overfitting.
• Linear SVM with regularization parameters of 0.1 of a max TF-IDF features of 500,
• Decision Tree with maximum depth of 1000 and the maximum number of features
for TF-IDF,
CHAPTER 4. MACHINE LEARNING TECHNIQUES 45
Average of precision, recall and f1score
0.94 Average Recall
Average precision
Average f1score
0.92
0.90
0.88
0.86
10 4 10 3 10 2 10 1 100 101
Figure 4.14: Metrics value with respect to the penalty parameter for ridge classifiers when
using SMOTE.
0.88
0.86
0.84
0.82
Average Recall
0.80 Average precision
Average f1-score
1 2 3 4 5
10 10 10 10 10
0.90
0.88
0.86
0.84
0.82
0.80
4 3 2 1 0
10 10 10 10 10
Figure 4.16: Optimal penalty parameters for linear svm when using SMOTE
CHAPTER 4. MACHINE LEARNING TECHNIQUES 46
0.86 0.850
0.88
0.84
0.825
0.82 0.86
DecisionTreeClassifier DecisionTreeClassifier 0.800 DecisionTreeClassifier
LinearSVC LinearSVC LinearSVC
0.80
MultinomialNB 0.84 MultinomialNB MultinomialNB
RidgeClassifier RidgeClassifier 0.775 RidgeClassifier
0 50000 100000 150000 200000 250000 300000 350000 0 50000 100000 150000 200000 250000 300000 350000 0 50000 100000 150000 200000 250000 300000 350000
Figure 4.17: Average recall, precision and f1-score wti respect to the maximum number
of features.
Precision for positive class : fake Precision for positive class : reliable
0.85 0.97
0.80 0.96
0.75
0.95
0.70
0.94
0.65
0.60 0.93
DecisionTreeClassifier DecisionTreeClassifier
LinearSVC LinearSVC
0.55 MultinomialNB MultinomialNB
RidgeClassifier 0.92 RidgeClassifier
0 50000 100000 150000 200000 250000 300000 350000 0 50000 100000 150000 200000 250000 300000 350000
Figure 4.18: Precision for fake and reliable class for each model with respect to the
maximum number of features
Recall for positive class : fake Recall for positive class : reliable
0.92 0.95
0.90
0.88 0.90
0.86
0.84 0.85
0.82
0.80 0.80
DecisionTreeClassifier DecisionTreeClassifier
0.78 LinearSVC LinearSVC
MultinomialNB MultinomialNB
0.76 RidgeClassifier 0.75 RidgeClassifier
0 50000 100000 150000 200000 250000 300000 350000 0 50000 100000 150000 200000 250000 300000 350000
Figure 4.19: Recall for fake and reliable class for each model with respect to the maximum
number of features
CHAPTER 4. MACHINE LEARNING TECHNIQUES 47
• Naı̈ve-Bayes will also use the maximum number of features for TF-IDF.
For the Fake News Corpus, the following setting will be used:
They will all be trained using 100, 000 features for TF-IDF. For the Fake News Corpus
with SMOTE, the same parameters will be used, but the maximum number of features
for TF-IDF will be used.
All the models will be trained on train and validation set and tested on test set.
4.7.2 Results
Liar-Liar Corpus
By looking at the row results, based on average accuracy Naı̈ve-Bayes, Linear SVM and
ridge classifiers perform very close, but when looking at the recall per class it shows that
Naı̈ve-Bayes is bad at detecting fake news and classifies most of the text as reliable, when
Linear SVM and Ridge classifiers are more balanced. Finally, it is possible to look at the
ROC curve at Figure 4.20. One more time, it shows that Naı̈ve-Bayes, linear svm and
ridge classifier have similar performance, but in this case it shows that NB has a little
advantage, with a slightly larger AUC. There is only one point for the decision tree as it
does not output probabilities for each class.
When it comes to the Fake news corpus, linear models are still the ones that perform
the best, with linear svm reaching an accuracy of 94.7% and ridge classifiers 93.98%. Sur-
prisingly, decision tree outperform Naı̈ve-Bayes in this case, with an accuracy of 89.4%
when Naı̈ve-Bayes only gets 85.3%.
In this case, the ROC curve (Figure 4.22) shows almost the same ranking of models,
except for Decision Tree that is the last one, and Naı̈ve-Bayes being juste above it. Con-
fusion matrix (Figure 4.23) shows that Naı̈ve-Bayes has a tendency of classifying fake
news as being reliable. And the other hand, ridge classifier is the one that makes the least
misclassification for reliable news, which is a good point.
Finally, there is the results of the models trained with SMOTE data augmentation. Using
it shows little to no improvements. The only benefit is to balance a little bit the recall
for Naı̈ve-Bayes on fake and reliable news.
4.8 Conclusion
In this chapter we have analysed how traditional machine learning algorithms words on
two different datasets, the second being imbalanced a data augmentation technique, called
SMOTE, has been used in order to see if it improves the results.
CHAPTER 4. MACHINE LEARNING TECHNIQUES 48
Linear SVC, AUC : 0.622406
1.0 Decision Tree, AUC : 0.545509
Ridge Classifier, AUC : 0.625501
NaïveBayes, AUC : 0.637815
0.8
0.6
True Positive Rate
0.4
0.2
0.0
Model : Linear SVM Model : NaïveBayes
900 1000
518 588 354 752
Fake
Fake
800
True Class
True Class
750
600
600
390 1038 258 1170
Reliable
Reliable
400
450
Fake
True Class
True Class
750 720
600 660
Reliable
600
450
540
Fake Reliable Fake Reliable
Predicted Class Predicted Class
1.0
0.8
0.6
True Positive Rate
0.4
0.2
Linear SVC, AUC : 0.983538
Decision Tree, AUC : 0.839026
0.0 Ridge Classifier, AUC : 0.981112
NaïveBayes, AUC : 0.919392
0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate
Model : Linear SVM Model : NaïveBayes
50000 48000
40000 40000
15430 2066 10561 6935
Fake
Fake
32000
30000
True Class
True Class
24000
20000
Reliable
10000
8000
40000
14575 2921 40000 13904 3592
Fake
Fake
32000
30000
True Class
True Class
24000
20000
Reliable
10000
8000
Table 4.3: Results on Fake News Corpus when training with SMOTE.
CHAPTER 4. MACHINE LEARNING TECHNIQUES 55
1.0
0.8
0.6
True Positive Rate
0.4
0.2
Linear SVC, AUC : 0.982912
Decision Tree, AUC : 0.840983
0.0 Ridge Classifier, AUC : 0.980182
NaïveBayes, AUC : 0.919660
0.0 0.2 0.4 0.6 0.8 1.0
False Positive Rate
Model : Linear SVM Model : NaïveBayes
40000
40000
15989 1507 15301 2195 32000
Fake
Fake
30000
True Class
True Class
24000
20000
16000
2390 49791 10014 42167
Reliable
Reliable
10000
8000
40000
40000
16152 1344 14827 2669
Fake
Fake
32000 32000
True Class
True Class
24000 24000
16000 16000
3155 49026 5346 46835
Reliable
Reliable
8000 8000
We can conclude that in all the cases linear models are the ones that work the best, with a
top accuracy of 61.7% on the liar-liar corpus using Ridge Classifier, and a top accuracy
of 94.7% on the Fake News Corpus using linear svm. At the end, the result obtains on
the second data set are really good, when those obtain on the first when are mitigated.
As explained earlier, it might be important to choose to model that makes the smaller
misclassification rate on reliable news in order to avoid possible censorship and confusion
matrix shows that in both case Ridge Classifiers is the ones that make the fewer errors in
that case.
In addition, we have shown that Synthectic Minority Over Sampling Techniques acts as
a regularizers, as it does improve performance when the penalization term in small on
linear models.
In the next section, the focus will be put on trying to improve results on the Liar-Liar
corpus as there is room for improvement and that the second dataset already as very
good results. But models will still be trying on it for comparison.
CHAPTER 4. MACHINE LEARNING TECHNIQUES 58
Chapter 5
Attention Mechanism
5.1 Introduction
In this section, we will focus on the deep learning models, the first one being a bidirectional
LSTM and the second one an attention layer is added to this LSTM. But it is need to use
another text embedding in order to work with LSTM. Indeed, tf-idf create a sparse matrix
with each row corresponding to a value for a given word. This means that the order of
the words are lost. In order to solve this, word2vec[38] is used. It allows matching words
to continuous vectors of a given size with interesting properties. Another method, which
consists in making word embedding as tuning parameters will be used.
It is possible to visualize these relationships by using t-SNE for projecting high dimensions
word vectors in 2D space. The results of various relationships can be seen at Figure 5.1.
59
CHAPTER 5. ATTENTION MECHANISM 60
Figure 5.1: Relationships between different words with t-SNE dimensionality reduction.
Figure 5.2: A simple CBOW model with only one word in the context
CHAPTER 5. ATTENTION MECHANISM 61
Where WV×N is the weight matrix to optimize over. The output layer values are com-
puted as
Y = W0T h (5.2)
As before W0N×V is also a weight matrix to optimize. The loss can be computed as
softmax cross entropy.
It is also possible to make the opposite: predicting the context given a single input word.
This is the skip-gram model. In this case the loss becomes Equation 5.3.
C
X V
X
E=− u +C ·
jc∗ exp(uj 0 ) (5.3)
c=1 j 0 =1
jc∗ is the index of the cth output context word and ujc∗ is the score of the jth word in the
vocabulary for the cth context word. Finally, the embedding that is used are the value of
the hidden layers produced for a given word.
5.3 LSTM
LSTM or Long Short Term Memory[8] is a kind of recurrent neural network that fits well
to temporal or sequential input such as texts. A RNN is a type of neural network where
the hidden state is fed in a loop with the sequential inputs. There are usually shown as
unrolled version of it (Figure 5.4). Each of the Xi being one value in the sequence.
In this case, Xi values are word vectors. There are two possibilities, either use pre-trained
vector with word2vec or make Xi inputs a parameter to learn in the same way as it
works for the Word2Vec algorithm, having a one-hot encoding of the word and a matrix
of weights to tune. Each method will be used.
Recurrent Neural Networks do not works very well with long-term dependencies, that is
why LSTM have been introduced. It is made of an input gate, an output gate and a
forget gate that are combined in Equation 5.4.
Figure 5.5 shows how it works. A bidirectional LSTM works the same way, but the
input is fed in the two directions, from the start to the end and from the end to the start.
Outputs sequence of the LSTM is summed element-wise in order to merge them. We have
→
− ← − → − ←
−
that hi = [ hi + hi ], hi and hi begin the outputs i of sequence in each direction as show
at Figure 5.6.
Lets H be a matrix of the concatenation of all the hi ,
M = tanh(H) (5.10)
α = sof tmax(wT M ) (5.11)
T
r = Hα (5.12)
5.5 Results
5.5.1 Methodology
In order to train the models and perform hyper parameters optimization grid search have
been used when it was possible (on the liar-liar dataset) and knwoleadge acquired there
have been used in order to tune parameters for the networks on the Fake News Corpus.
In addition, in order to find the best parameters among all tested with gird search, for
each metric, the training epochs having the highest validation value for those metrics have
been chosen.
All the models have been trained using adam optimizer and initialized using a normal
distribution for the weights.
As SMOTE cannot be used on the Fake News Corpus dues to the size of the corpus,
in order to rebalance the dataset the minority class have been over sampled by feeding
multiple times the same input by looping through them.
LSTM
When it comes to LSTM trained on liar-liar dataset, it simply does not works. It classi-
fies almost all the texts as being from the same class. Although, it reaches a good score
on the training data, it does not manage to generalize correctly. Figure 5.7 shows the
recall, precision and f1-score and loss for training and testing set of the best models for
the LSTM using word2vec. We can see that even if the training score increase, the testing
CHAPTER 5. ATTENTION MECHANISM 65
Loss with respect to the epoch Recall metrics for train and validation set with respect to the epoch
loss train
100000 0.75 valid
0.70
80000
0.65
60000
Loss
0.60
40000
0.55
20000
0.50
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
epoch epoch
F1score metrics for train and validation set with respect to the epoch Precision metrics for train and validation set with respect to the epoch
train train
valid valid
0.7 0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
epoch epoch
Loss with respect to the epoch Recall metrics for train and validation set with respect to the epoch
loss train
0.85
120000 valid
0.80
100000
0.75
80000
0.70
Loss
60000 0.65
0.60
40000
0.55
20000
0.50
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
epoch epoch
F1score metrics for train and validation set with respect to the epoch Precision metrics for train and validation set with respect to the epoch
train train
0.85
valid valid
0.8
0.80
0.75
0.7
0.70
0.65
0.6
0.60
0.5 0.55
0.50
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
epoch epoch
values oscillate.
When training the models with word embedding as tunable parameters, the results slightly
improve, with an average precision between 55% and 60%. This can be seen at Figure
5.8.
The training was stopped after 200 iterations because the validation score was not im-
proving anymore.
0.5925
0.61
0.5900
0.5875 0.60
0.5850 0.59
precision
precision
0.5825 0.58
0.5800
0.57
0.5775
0.56
0.5750
0.5725 0.55
10.0 15.0 20.0 5.0 10.0 25.0 50.0 100.0 200.0 300.0
SEQ_LENGTH HIDDEN
0.590 0.590
0.585
0.585
precision
precision
0.580
0.580
0.575
0.575
0.570
1.0 3.0 0.0 0.25 0.5 0.75
LAYERS DROPOUT
different parameters, it is interesting to look at the distribution of the results for the best
epochs by fixing parameters one by one. Figure 5.9 shows the 95% confidence interval
for precision for a fixed parameter.
It shows that it is better to use fewer hidden units in the model, and only a single layer.
The sequence length has a very small impact on the precision. Actually, the best model
uses a sequence length of 20.
The precision of the different models range from 53% to 63% (Figure 5.10.) The training
plot of the model that reaches the maximum precision can be seen at figure 5.11. It
shows that after the 25th iteration, the validation values start to decrease, which is a sign
of overfitting.
Finally, there is the models where the embedding is a tunable parameter. The Figure
5.12 shows that in this case, the longer the sequence the better, and that as before using
few hidden units perform better. In this case, variation has a wider range than when
CHAPTER 5. ATTENTION MECHANISM 68
16
14
12
10
8
6
4
2
0
0.54 0.56 0.58 0.60 0.62
precision
Figure 5.10: Distribution of the precision of best epochs for all the models trained with
word2vec embedding.
using word2vec. There are a few models that have top precision higher than 75%, but
looking at the training plot (Appendix B, Figure B.1) shows that the model does not
perform well. Because in particular case precision in not a good indicator of how well
a model perform, f1-score will be used instead, as it is a balance between precision and
recall.
And the best f1-score obtained is 0.55384, which is quite smaller than the 0.63 for the
model using word2vec. The training plot is at Figure 5.13. We can see that there is still
room for improvement, the next step is to see what happens when training on more epochs.
Training on 1000 epochs rather than 200 does not improve validation score, but it does
for training (Figure 5.14).
• Using word2vec rather than training the embedding gives better results.
It also shows that despite reaching a very good precision, recall and f1-score on the train-
ing set it does not perform well on the validation set. This is a sign of overfitting. In order
CHAPTER 5. ATTENTION MECHANISM 69
Loss with respect to the epoch Recall metrics for train and validation set with respect to the epoch
0.90
loss train
7000 valid
0.85
0.80
6000
0.75
5000
Loss
0.70
0.65
4000
0.60
0.55
3000
0.50
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
epoch epoch
F1score metrics for train and validation set with respect to the epoch Precision metrics for train and validation set with respect to the epoch
0.9 train 0.9 train
valid valid
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
epoch epoch
Figure 5.11: Training and validation of the model with top precision trained with word2vec
embedding.
CHAPTER 5. ATTENTION MECHANISM 70
0.59 0.61
0.60
0.58
0.59
0.58
precision
precision
0.57
0.57
0.56
0.56
0.55
0.54
0.55
0.53
10.0 15.0 20.0 5.0 10.0 25.0 50.0 100.0 200.0
SEQ_LENGTH HIDDEN
0.66 0.61
0.64 0.60
0.59
0.62
0.58
precision
precision
0.60
0.57
0.58 0.56
0.56 0.55
0.54
0.54
0.53
5.0 10.0 25.0 50.0 100.0 200.0 300.0 0.0 0.25 0.5 0.75
EMBEDDING_DIM DROPOUT
Figure 5.12: Training and Validation of the Model With top Precision
CHAPTER 5. ATTENTION MECHANISM 71
Loss with respect to the epoch Recall metrics for train and validation set with respect to the epoch
7600 0.80
loss train
valid
7400 0.75
7200
0.70
7000
0.65
Loss
6800
0.60
6600
0.55
6400
6200 0.50
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
epoch epoch
F1score metrics for train and validation set with respect to the epoch Precision metrics for train and validation set with respect to the epoch
0.80 0.80
train train
valid valid
0.75
0.75
0.70
0.70
0.65
0.65
0.60
0.60
0.55
0.55
0.50
0.50
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
epoch epoch
Figure 5.13: Training and validation of the model with top f1-score
CHAPTER 5. ATTENTION MECHANISM 72
Loss with respect to the epoch Recall metrics for train and validation set with respect to the epoch
7500 loss 1.0 train
valid
7000 0.9
0.8
6500
Loss
0.7
6000
0.6
5500
0.5
0 200 400 600 800 1000 0 200 400 600 800 1000
epoch epoch
F1score metrics for train and validation set with respect to the epoch Precision metrics for train and validation set with respect to the epoch
1.0 train 1.0 train
valid valid
0.9 0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0 200 400 600 800 1000 0 200 400 600 800 1000
epoch epoch
to avoid this, multiple methods have been applied without showing any improvement.
• Dropout[40],
• batch-normlaziation[41],
• reducing network capacity (fewer hidden layers, lower embedding dimensions, less
training parameters with word2vec),
The highest gain was from using word2vec embedding. This significantly reduces the
amount of training parameters, secondly dropout also helped a little.
5.5.5 Testing
The same way as in Chapter 4, the models will be trained on the parameters that
produced the best results on the training set, and trained on training and validation set,
and tested on testing set.
The parameters used for training are given at Table 5.1. The results for all four models
model embedding size Sequence Length num hiddens dropout Early Stop
LSTM 300 10 50 0.75 126
LSTM + word2vec 300 10 50 0.0 160
Attention 10 20 10 0.75 400
Attention + word2vec 300 20 5 0.75 25
are given at Table 5.2. It shows that the model that works the best is attention network
using word2vec embedding, with an accuracy of 61%, which is equivalent to ridge classifiers
and linear svm. The three other models do not perform well, all having a average precision
around 55%, which is close to being a random classifier.
It shows out that in this case LSTMs works better than Attention Mechanism, but as in
previous section does not reach machine learning results.
The best LSTM obtained use sequence of 200 words and 200 hidden layers, with an
average precision of 0.929376 on the validation set. The training plots of this particular
CHAPTER 5. ATTENTION MECHANISM 74
Table 5.2: Results for the differents models trained with parameters given at Table 5.1.
CHAPTER 5. ATTENTION MECHANISM 75
Loss with respect to the epoch Recall metrics for train and validation set with respect to the epoch
loss train
valid
0.950
90000
0.925
80000
0.900
70000
0.875
Loss
60000
0.850
50000 0.825
0.800
40000
0.775
30000
0 5 10 15 20 25 0 5 10 15 20 25
epoch epoch
F1score metrics for train and validation set with respect to the epoch Precision metrics for train and validation set with respect to the epoch
train train
0.96
valid valid
0.950
0.925 0.94
0.900
0.92
0.875
0.850 0.90
0.825
0.88
0.800
0.86
0 5 10 15 20 25 0 5 10 15 20 25
epoch epoch
Figure 5.15: Training plots of the best LSTM using word2vec embedding.
CHAPTER 5. ATTENTION MECHANISM 76
Loss with respect to the epoch Recall metrics for train and validation set with respect to the epoch
0.85
loss train
valid
200000
0.80
180000
0.75
Loss
160000
0.70
140000
0.65
0 5 10 15 20 25 0 5 10 15 20 25
epoch epoch
F1score metrics for train and validation set with respect to the epoch Precision metrics for train and validation set with respect to the epoch
0.85 train train
valid valid
0.84
0.80
0.82
0.75 0.80
0.78
0.70
0.76
0.65
0.74
0.60 0.72
0 5 10 15 20 25 0 5 10 15 20 25
epoch epoch
Figure 5.16: Training plots of the best attention network using word2vec embedding.
In the case of attention mechanism, training on the Fake News Corpus has shown to
be harder than on the Liar-Liar Corpus as using too large learning rate would lead to
oscillating loss and too small learning rate lead to halting the loss decrease. This can be
seen at Appendix B.2.
The same parameters as for the LSTM will be used for training the Attention Network.
Its training plot is available at Figure 5.16. The final results are given at Table 5.3. It
shows that for the same parameters LSTM works better than Attention Network on this
particular dataset. It shows that LSTM place below Linear SVM and Ridge Classifier
and above Decision Tree and Naı̈ve-Bayes in terms of accuracy. It is likely to be possible
to reach results as well as LSTM or even better for the Attention Network, but due to
technical and time constraints I was not able to experiment further. For instance, using
CHAPTER 5. ATTENTION MECHANISM 77
Table 5.3: Final result on testing set for LSTM and attention network using word2vec.
longer sequence length and more hidden units with a smaller learning rate might have
overcome this problem.
5.7 Conclusion
In this chapter I have investigated how state-of-the-art deep learning models work on fake
news detection, and it shows that for the particular case of fake news detection it does
not outperform traditional machine learning methods. I have also made some addition
to the original model that improves the performances by a few percent by replacing the
tunable word embedding by constant one using word2vec. It shows out that it helps re-
duce overfitting and increase result on the testing set.
A hypothesis to explain why these two deep learning methods do not works as well as
machine learning methods is the fact that in this case text are required to be the same
size. Which means that some of them require some padding and the other are srunk. In
the second case, information is lost.
In addition, it shows that Liar-Liar Corpus is hard to work on, with 60% precision,
when Fake News Corpus still have good results.
CHAPTER 5. ATTENTION MECHANISM 78
Chapter 6
Conclusion
79
Bibliography
[1] Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li, Hongwei Hao, and Bo Xu.
Attention-based bidirectional long short-term memory networks for relation classi-
fication. In Proceedings of the 54th Annual Meeting of the Association for Com-
putational Linguistics (Volume 2: Short Papers), pages 207–212, Berlin, Germany,
August 2016. Association for Computational Linguistics.
[2] Hunt Allcott and Matthew Gentzkow. Social media and fake news in the 2016 elec-
tion. In Journal of Economic Perspective, volume 31, 2017.
[3] Jeffrey Gottfried and Elisa Shearer. News Use Across Social Medial Platforms 2016.
Pew Research Center, 2016.
[4] Craig Silverman and Lawrence Alexander. How teens in the balkans are duping
trump supporters with fake news. Buzzfeed News, 3, 2016.
[5] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin.
Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:1871–1874,
2008.
[8] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Com-
putation, 9:1735–1780, 1997.
[9] Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. Fake news detection
on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter,
19(1):22–36, 2017.
[11] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation.
Journal of machine Learning research, 3(Jan):993–1022, 2003.
[12] Julio CS Reis, André Correia, Fabrı́cio Murai, Adriano Veloso, Fabrı́cio Benevenuto,
and Erik Cambria. Supervised learning for fake news detection. IEEE Intelligent
Systems, 34(2):76–81, 2019.
[13] Vernica Prez-Rosas, Bennett Kleinberg, Alexandra Lefevre, and Rada Mihalcea. Au-
tomatic detection of fake news.
80
BIBLIOGRAPHY 81
[14] James W Pennebaker, Martha E Francis, and Roger J Booth. Linguistic inquiry
and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates, 71(2001):2001,
2001.
[15] Natali Ruchansky, Sungyong Seo, and Yan Liu. Csi: A hybrid deep model for fake
news detection. In Proceedings of the 2017 ACM on Conference on Information and
Knowledge Management, pages 797–806. ACM, 2017.
[16] Eugenio Tacchini, Gabriele Ballarin, Marco L. Della Vedova, Stefano Moret, and Luca
de Alfaro. Some like it hoax: Automated fake news detection in social networks.
[17] James Thorne, Mingjie Chen, Giorgos Myrianthous, Jiashu Pu, Xiaoxuan Wang, and
Andreas Vlachos. Fake news stance detection using stacked ensemble of classifiers.
In Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets
Journalism, pages 80–83, 2017.
[18] Mykhailo Granik and Volodymyr Mesyura. Fake news detection using naive bayes
classifier. In 2017 IEEE First Ukraine Conference on Electrical and Computer En-
gineering (UKRCON), pages 900–903. IEEE, 2017.
[19] Yang Yang, Lei Zheng, Jiawei Zhang, Qingcai Cui, Zhoujun Li, and Philip S. Yu.
Ti-cnn: Convolutional neural networks for fake news detection.
[20] Yaqing Wang, Fenglong Ma, Zhiwei Jin, Ye Yuan, Guangxu Xun, Kishlay Jha, Lu Su,
and Jing Gao. Eann: Event adversarial neural networks for multi-modal fake news
detection. In Proceedings of the 24th acm sigkdd international conference on knowl-
edge discovery & data mining, pages 849–857. ACM, 2018.
[21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.
Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
[22] Shuai Li, Wanqing Li, Chris Cook, Ce Zhu, and Yanbo Gao. Independently recurrent
neural network (indrnn): Building a longer and deeper rnn.
[23] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy.
Hierarchical Attention Networks for Document Classification. In Proceedings of the
2016 Conference of the North American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, pages 1480–1489, San Diego, Cal-
ifornia, 2016. Association for Computational Linguistics.
[24] Takeru Miyato, Andrew M. Dai, and Ian Goodfellow. Adversarial Training Meth-
ods for Semi-Supervised Text Classification. arXiv:1605.07725 [cs, stat], May 2016.
arXiv: 1605.07725.
[26] Kamran Kowsari, Mojtaba Heidarysafa, Donald E. Brown, Kiana Jafari Meimandi,
and Laura E. Barnes. RMDL: Random Multimodel Deep Learning for Classification.
Proceedings of the 2nd International Conference on Information System and Data
Mining - ICISDM ’18, pages 19–28, 2018. arXiv: 1805.01890.
BIBLIOGRAPHY 82
[27] Quoc V. Le and Tomas Mikolov. Distributed representations of sentences and docu-
ments.
[28] David R. Karger, Sewoong Oh, and Devavrat Shah. Iterative learning for reliable
crowdsourcing systems. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira,
and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems
24, pages 1953–1961. Curran Associates, Inc., 2011.
[30] William Yang Wang. ”liar, liar pants on fire”: A new benchmark dataset for fake
news detection.
[31] Radim Řehůřek and Petr Sojka. Software Framework for Topic Modelling with Large
Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP
Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http://is.muni.cz
/publication/884893/en.
[32] Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with
Python. O’Reilly Media, 2009.
[33] Karen Spärck Jones. A statistical interpretation of term specificity and its application
in retrieval. Journal of documentation, 2004.
[34] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal
of Machine Learning Research, (9):2579–2605, 2008.
[36] Lior Rokach and Oded Maimon. Data Mining With Decision Trees: Theory and
Applications. World Scientific Publishing Co., Inc., River Edge, NJ, USA, 2nd edition,
2014.
[38] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of
word representations in vector space.
[40] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.
The journal of machine learning research, 15(1):1929–1958, 2014.
[41] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network
training by reducing internal covariate shift.
BIBLIOGRAPHY 83
Appendix A
84
APPENDIX A. 85
Loss with respect to the epoch Recall metrics for train and validation set with respect to the epoch
loss train
valid
7800 0.8
7700 0.6
Loss
7600 0.4
7500
0.2
7400
0.0
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
epoch epoch
F1score metrics for train and validation set with respect to the epoch Precision metrics for train and validation set with respect to the epoch
train 0.75 train
valid valid
0.50
0.70
0.48
0.65
0.46
0.60
0.44
0.55
0.42
0.50
0.40
0.45
0.38
0.40
0.36 0.35
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
epoch epoch
Figure B.1: There is a spike for the precision, but that does not means that the model
performs well.
89
APPENDIX B. 90
Loss with respect to the epoch Recall metrics for train and validation set with respect to the epoch
loss 1.0 train
valid
200000 0.9
0.8
150000
0.7
Loss
0.6
100000
0.5
50000 0.4
0.3
0
0 10 20 30 40 50 0 10 20 30 40 50
epoch epoch
F1score metrics for train and validation set with respect to the epoch Precision metrics for train and validation set with respect to the epoch
1.0 train 1.0 train
valid valid
0.8 0.8
0.6 0.6
0.4
0.4
0.2
0.2
0 10 20 30 40 50 0 10 20 30 40 50
epoch epoch