CSI: A Hybrid Deep Model For Fake News Detection
CSI: A Hybrid Deep Model For Fake News Detection
ABSTRACT
The topic of fake news has drawn a1ention both from the public and the academic communities. Such
misinformation has the potential of affecting public opinion, providing an opportunity for malicious parties to
manipulate the outcomes of public events such as elections. Because such high stakes are at play,
automatically detecting fake news is an important, yet challenging problem that is not yet well understood.
Nevertheless, there are three generally agreed upon characteristics of fake news: the text of an article, the
user response it receives, and the source users promoting it. Existing work has largely focused on tailoring
solutions to one particular characteristic which has limited their success and generality.
In this work, we propose a model that combines all three characteristics for a more accurate and automated
prediction. Specifically, we incorporate the behavior of both parties, users and articles, and the group behavior of
users who propagate fake news. Motivated by the three characteristics, we propose a model called CSI which is
composed of three modules: Capture, Score, and Integrate. The first module is based on the response and
text; it uses a Recurrent Neural Network to capture the temporal pa1ern of user activity on a given article. the
second module learns the source characteristic based on the behavior of users, and the two are integrated
with the third module to classify an article as fake or not. Experimental analysis on real-world data
demonstrates that CSI achieves higher accuracy than existing models, and extracts meaningful latent
representations of both users and articles.
KEYWORDS
Fake news detection, Neural networks, Deep learning, Social net- works, Group anomaly detection,
Temporal analysis.
1 INTRODUCTION
Fake news on social media has experienced a resurgence of interest due to the recent political climate and the
growing concern around its negative effect. For example, in January 2017, a spokesman for the German
government stated that they “are dealing with a phenomenon of a dimension that [they] have not seen
before”, referring to the proliferation of fake news [3]. Not only does it provide a source of spam in our lives,
but fake news also has the potential to manipulate public perception and awareness in a major way.
Detecting misinformation on social media is an extremely important but also a technically challenging problem.
the difficulty comes in part from the fact that even the human eye cannot accurately distinguish true from
false news; for example, one study found that when shown a fake news article, respondents found it “‘somewhat’
or ‘very’ accurate 75% of the time”, and another found that 80% of high school students had a hard time
determining whether an article was fake [2, 9]. In an a1empt to combat the growing misinformation and
confusion, several fact-checking websites have been deployed to expose or confirm stories (e.g.
snopes.com). these websites play a crucial role in combating fake news, but they require expert analysis
which inhibits a timely response. As a response, numerous articles and blogs have been wri1en to raise public
awareness and provide tips on differentiating truth from falsehood [29]. While each author provides a
different set of signals to look out for, there are several characteristics that are generally agreed upon, relating
to the text of an article, the response it receives, and its source.
the most natural characteristic is the text of an article. Advice in the media varies from evaluating whether
the headline matches the body of the article, to judging the consistency and quality of the language. A1empts to
automate the evaluation of text have manifested in sophisticated natural language processing and machine
learning techniques that rely on hand cracked and data-specific textual features to classify a piece of text as
true or false [11, 13, 24, 27, 28, 34]. these approaches are limited by the fact that the linguistic
characteristics of fake news are still not yet fully understood. Further, the characteristics vary across different
types of fake news, topics, and media platforms.
A second characteristic is the response that a news article is meant to illicit. Advice columns encourage
readers to consider how a story makes them feel – does it provoke either anger or an emotional response?
the advice stems from the observation that fake news ttoken contains opinionated and inflammatory language,
cracked as click bait or to incite confusion. For example, the New York Times cited examples of people profiting
from publishing fake stories online; the more provoking, the greater the response, and the larger the profit.
Efforts to automate response detection typically model the spread of fake news as an epidemic on a social
graph [12, 16, 17, 35], or use hand-cracked features that are social-network dependent, such as the number
of Facebook likes, combined with a traditional classifier [6, 18, 25, 27, 41, 45]. Unfortunately, access to a
social graph is not always feasible in practice, and manual selection of features is labor intensive.
A final characteristic is the source of the article. Advice here
Ranges from checking the structure of the url, to the credibility of the media source, to the profile of the journalist who authored it; in
fact, Google has recently banned nearly 200 publishers to aid this task [37]. In the interest of exposure to a large audience, a set of loyal
promoters may be deployed to publicize and disseminate the content. In fact, several small-scale analyses have observed that there
are ttoken groups of users that heavily publicize fake news, particularly just taker its publication [1, 22].
For example, Figure 1 shows an example of three Twi1er users who consistently promote the same fake news stories. Approaches
here typically focus on data-dependent user behaviors, or identifying the source of an epidemic, and disregard the fake news articles
themselves [31, 40]. Each of the three characteristics mentioned above has ambiguities that make it challenging to successfully
automate fake news detection based on just one of them. Linguistic characteristics are not fully understood, hand-craked features are
data-specific and arduous, and source identification does not trivially lead to fake news detection. In this work, we build a more
accurate automated fake news detection by utilizing all three characteristics at once: text, response, and source. Instead of relying on
manual feature selection, the CSI model that we propose is built upon deep neural networks, which can automatically select
important features. Neural networks also enable CSI to exploit information from different domains and capture temporal
dependencies in users engagement with articles. A key property of CSI is that it explicitly outputs information both on articles and
users, and does not require the existence of a social graph, domain knowledge, nor assumptions on the types and distribution of
behaviors that occur in the data.
Specifically, CSI is composed of one module for each side of the activity, user and article – Figure 3b illustrates the intuition. the first
module, called Capture, exploits the temporal pa1ern of user activity, including text, to capture the response a given article received.
Capture is constructed as a Recurrent Neural Network (more precisely an LSTM) which receives article-specific information such as the
temporal spacing of user activity on the article and a doc2vec [19] representation of the text generated in this activity (such as a tweet).
the second module, which we call Score, uses a neural network and an implicit user graph to extract a representation and assign a
score to each user that is indicative of their propensity to participate in a source promotion group. Finally, the third module, Integrate,
combines the response, text, and source information from the first two modules to classify each article as fake or not. the three
module composition of CSI allows it to independently learn characteristics from both sides of the activity, combine them for a more
accurate prediction and output feedback both on the articles (as a falsehood classification) and on the users (as a suspiciousness score).
Experiments on two real-world datasets demonstrate that by incorporating text, response, and source, the CSI model achieves
significantly higher classification accuracy than existing models. In addition, we demonstrate that both the Capture and Score modules
provide meaningful information on each side of the activity. Capture generates low-dimensional representations of news articles and
users that can be used for tasks other than classification, and Score rates users by their participation in group behavior. the main
contributions can be summarized as:
(1) To the best of our knowledge, we propose the first model that explicitly captures the three common characteristics of fake
news, text, response, and source, and identifies misinformation both on the article and on the user side.
(2) the proposed model, which we call CSI, evades the cost of manual feature selection by incorporating neural networks. the
features we use capture the temporal behavior and textual content in a general way that does not depend on the data context nor
require distributional assumptions.
(3) Experiments on real world datasets demonstrate that CSI is more accurate in fake news classification than previous work,
while requiring fewer parameters and training.
2 RELATED WORK
the task of detecting fake news has undergone a variety of labels, from misinformation, to rumor, to spam. Just as each individual may
have their own intuitive definition of such related concepts, each paper adopts its own definition of these words which conflicts or
overlaps both with other terms and other papers. For this reason, we specify that the target of our study is detecting news content
that is fabricated, that is fake. Given the disparity in terminology, we overview existing work grouped loosely according to which of the
three characteristics (text, response, and source) it considers.
there has been a large body of work surrounding text analysis
of fake news and similar topics such as rumors or spam. This work has focused on mining particular linguistic cues, for example, by
finding anomalous pa1erns of pronouns, conjunctions, and words associated with negative emotional word usage [10, 28]. For
example, Gupta et al. [13] found that fake news token contain an inflated number of swear words and personal pronouns. Branching
offof the core linguistic analysis, many have combined the approach with traditional classifiers to label an article as true or false [6, 11,
18, 25, 27, 41, 45]. Unfortunately, the linguistic indicators of fake news across topic and media platform are not yet well understood;
Rubin et al. [34] explained that there are many types of fake news, each with different potential textual indicators. thus existing works
design hand-cracked features which is not only laborious but highly dependent on the specific dataset and the availability of domain
knowledge to design appropriate features. To expand beyond the specificity of hand-craked features, Ma et al. [24] proposed a model
based on recurrent neural networks that uses mainly linguistic features. In contrast to [24], the CSI model we propose captures all
three characteristics, is able to isolate suspicious users, and requires fewer parameters for a more accurate classification.
the response characteristic has also received a1ention in existing work. Outside of the fake news domain, Castillo et al. [5] showed that
the temporal pa1ern of user response to news articles plays an important role in understanding the properties of the content itself.
From a slightly different point of view, one popular approach has been to measure the response an article received by studying its
propagation on a social graph [12, 16, 17, 35]. the epidemic approach requires access to a graph which is infeasible in many scenarios.
Another approach has been to utilize hand-craked social- network dependent behaviors, such as the number of Facebook likes, as
features in a classifier [6, 18, 25, 27, 41, 45]. As with the linguistic features, these works require feature-engineering which is laborious
and lacks generality.
the final characteristic, source, has been studied as the task of identifying the source of an epidemic on a graph [23, 40, 46], or isolating
bots based on certain documented behaviors [7, 38]. Another approach identifies group anomalies. Early work in group anomaly
detection assumed that the groups were known a priori, and the goal was to detect which of them were anomalous [31]. Such
information is not feasible in practice, hence later works propose variants of mixtures models for the data, where the learned
parameters are used to identify the anomalous groups [42, 43]. Muandet et al. [30] took a similar approach by combining kernel
embedding with an SVM classifier. Most recently, Yu et al. [44] proposed a unified hierarchical Bayes model to infer the groups and
detect group anomalies simultaneously. there has also been a strong line of work surrounding detecting suspicious user behavior of
various types; a nice overview is given in [15]. Of this line, the most related is the CopyCatch model proposed in [4], which identifies
temporal bipartite cores of user activity on pages. In contrast to existing works, the CSI model we propose can identify group anomalies
as well as the core behaviors they are responsible for (fake news). the model does not require group information as input, does not
make assumptions about a particular distribution, and learns a representation and score for each user.
In contrast to the vast array of work highlighted here, the CSI model we propose does not rely on hand-craked features, domain
knowledge, or distributional assumptions, offering a more general modeling of the data. Further, CSI captures all three characteristics
and outputs both a classification of articles, a scoring of users, and representations of both users and articles that can be used for in
separate analysis.
3 PROBLEM
In this section we first lay out preliminaries, and then discuss the context of fake news which we address.
Preliminaries: We consider a series of temporal engagements that occurred between n users with m news-
articles over time 1,T . Each engagement between a user ui and an article aj at time t is represented as e ijt =
ui , aj , t . In particular, in our se1ing, an engagement is composed of textual information relayed by the user ui
about article aj , at time t ; for example, a tweet or a Facebook post. Figure 2 illustrates the se1ing. In
addition, we assume that each news article is associated with a label L aj = 0 if the news is true, and L aj = 1
if it is false. throughout we will use italic characters x for scalars, bold characters h for vectors, and capital
bold characters W for matrices.
Goal: While the overarching theme of this work is fake news detection, the goal is two fold (1) accurately
classify fake news, and (2) identify groups of suspicious users. In particular, given a temporal sequence of
engagements E = eijt = ui , aj , t , our
goal is to produce a label Lˆ aj 0, 1 for each article, and a suspiciousness score si for each user. To do this we
encapsulate the text, response, and source characteristics in a model and capture the temporal behavior of
both parties, users and articles, as well as textual information exchanged in the activity. We make no
assumptions on the distribution of user behavior, nor on the context of the engagement activity.
4 MODEL
In this section, we give the details of the proposed model, which we call CSI. the model consists of two main
parts, a module for extracting temporal representation of news articles, and a module for representing and
scoring the behavior of users. the former captures the response characteristic described in Section 1 while
incorporating text, and the la1er captures the source characteristic. Specifically, CSI is composed of the
following three parts, the specification and intuition of which is shown in Figure 3:
(1) Capture: To extract temporal representations of articles we use a Recurrent Neural Network (RNN).
Temporal en gagements are stored as vectors and are fed into the RNN which produces an output a
representation vector vj .
(2) Score: To compute a score si and representation y˜i , user- features are fed into a fully connected
layer and a weight
is applied to produce the scores vectors s.
(3) Integrate: the outputs of the two modules are concatenated and the resultant vector is used for
classification.
With the first two modules, Capture and Score, the CSI model ex- tracts representations of both users and
articles as low-dimensional vectors; these representations are important for the fake news task, but can also
be used for independent analysis of users and articles.
In addition, Score produces a score for each user as a compact version of the vector. the Integrate module
then combines the article representations with the user scores for an ultimate prediction of the veracity of an
article. In the sections that follow, we discuss the details of each module.
the first two variables, η and ∆t , capture the temporal pa1ern of engagement an article receives with two
simple, yet powerful quantities: the number of engagements η, and the time between engagements ∆t .
Together, η and ∆t provide a general of measure the frequency and distribution of the response an article
received. Next, we incorporate source by adding a user feature vector xu that is global and not specific to a
given article. In line with existing literature on information retrieval and recommender systems [21], we
construct the binary incidence matrix of which articles a user engaged with, and apply the Singular Value
Decomposition (SVD) to extract a lower-dimensional representation for each ui . Finally, a vector xτ is
included which carries the text characteristic of an engagement with a given article aj . To avoid hand-craked
textual feature selection for xτ , we use doc2vec [19] on the text of each engagement. Further technical details
will be explained in Section 5.
Since the temporal and textual features come from different domains, it is not desirable to incorporate
them into the RNN as raw input. To standardize the input features, we insert an embedding layer between the
raw features xt and the inputs x˜ t of the RNN. this embedding layer is a fully connected layer as following:
x˜ t = tanh(Wa xt + ba )
where Wa is a weight matrix applied to the raw features xt at time t and ba is a bias vector. Both Wa and ba
are the fixed for all xt . To capture the temporal response of users to an article, we construct the Capture
module using a Long Short-Term Memory (LSTM) model because of its propensity for capturing long-term
dependencies and its flexibility in processing inputs of variable lengths. For the sake of brevity we do not
discuss the well-established LSTM model here, but refer the interested reader to [14] for more detail.
What is important for our discussion is that in the final step of the LSTM, x˜ T is fed as input and the last
hidden state hT is passed to the fully connected layer. the result is a vector:
vj = tanh(Wr hT + br )
this vector serves as a low dimension representation of the temporal pa1ern of engagements a given article aj
received– capturing both the response and textual characteristics. the vectors vj will be fed to the Integrate
module for article classification, but can also be used for stand-alone analysis of articles.
Partitioning: In principle, the feature vector xt associated with each engagement can be considered as an
input into a cell; however, this would be highly inefficient for large data. A more efficient approach is to
partition a given sequence by changing the granularity, and using an aggregate of each partition (such as an
average) as input to a cell. Specifically, the feature vector for article aj at partition t has the following form:
η is the number of engagements that occurred in partition t , ∆t holds the time between the current and
previous non-empty partitions, xu is the average of user-features over users ui that engaged with aj during t
, and τ is the textual content exchanged during t
4.2 Score users
In the second module, we wish to capture the source characteristic present in the behavior of users. To do
this, we seek a compact representation that will have the same (small) dimension for every article (since it will
ultimately be used in the Integrate module). Given a set of user features, we first apply a fully connected
layer to extract vector representations of each user as follows:
y˜ i = tanh(Wu yi + bu )
where Wu is the weight matrix and bu is the bias; L2-regularization is used on Wu with parameter λ. this results
in a vector representation y˜ i for each user ui that is learned jointly with the Capture module. To aggregate
this information, we apply a weight vector with bs as the bias of a fully connected layer, and σ as the sigmoid
function. the set of si forms the vector s of user scores.
In principle, user features can be constructed( using) information from the users social network profile. Since we
wish to capture the source characteristic, we construct a weighted user graph where an edge denotes the
number of articles with which two users have both engaged. Users who engage in group behavior will
correspond to dense blocks in the adjacency matrix. Following the literature, we apply the SVD to the
adjacency matrix and extract a lower dimensional feature yi for each user, ultimately obtaining si , y˜ I for
each user ui .
By constructing the Score module in this way, CSI is able to jointly learn from the two sides of the
engagements while extracting information that is meaningful to the source characteristic. As with the Capture
module, the vector y˜ i can be used for stand-alone analysis of the users.
Each of the Capture and Score modules outputs information on articles and users with respect to the
three characteristics of interest. In order to incorporate the two sources of information, we propose a third
module as the final step of CSI in which article representations vj are combined with the user scores si to
produce a label prediction Lˆ j for each article.
To integrate the two modules, we apply a mask mj to the vector s
that selects only the entries si whose corresponding user ui engaged with a given article aj . these values are
average to produce pj which captures the suspiciousness score of the users that engage with the specific
article aj . the overall score pj is concatenated with vj from Capture, and the resultant vector cj is fed into
the last fully connected layer to predict the label Lˆ j of article a j .
Lˆ j = σ (wcT cj + bc )
this integration step enables the modules to work together to form a more accurate prediction. By jointly
training the CSI with the Capture and Score modules, the model learns both user and article information
simultaneously. At the same time, the CSI model generates information on articles and users that captures
different important characteristics of the fake news problem, and combines the information for an ultimate
prediction.
where Lj is a the ground-truth label. To reduce overfi1ing in CSI, random units in Wa and Wr are dropped
out for training. Under these constraints, the parameters in Capture, Score, and Integrate are jointly trained
by back-propagation.
4.4 Generality
We have presented the CSI model in the context of fake news; how- ever, our model can be easily
generalized to any dataset. Consider a set of engagements between an actor qi and a target rj over time t
0,T , in other words, the article in Figure 3b is a target and each user is an actor. the Capture module can be
used to capture the temporal pa1erns of engagements exhibited on targets by actors, and Score can be used
to extract a score and representation of each actor qi that captures the participation in group behavior. Finally,
Integrate combines the first two modules to enhance the prediction quality on targets. For example, consider
users accessing a set of databases. the Capture module can identify databases which received an unusual
pa1ern of access, and Score can highlight users that were likely responsible. In addition, the flexibility of CSI
allows for integration of additional domain knowledge.
5 EXPERIMENTS
In this section, we demonstrate the quality of CSI on two real world datasets. In the main set of
experiments, we evaluate the accuracy of the classification produced by CSI. In addition, we investigate the
quality of the scores and representations produced by the Score module and show that they are highly related
to the score characteristic. Finally, we show the robustness of our model when labeled data is limited and
investigate temporal behaviors of suspicious users.
Datasets In order to have a fair comparison, we use two real- world social media datasets that have been
used in previous work, TWITTER and WEIBO [24]. To date, these are the only publicly available datasets that
include all three characteristics: response, text, and user information. Each dataset has a number of articles
with labels L a j ; in TWITTER the articles are news stories, and in WEIBO they are discussion topics. Each article
also has a set of engagements (tweets) made by a user ui at time t . A summary of the statistics is listed in Table
1.
Table 2 shows the classification results using 80% of entire data as training samples, 5% to tune parameters,
and the remaining 15% for testing; we use 5-fold cross validation. this division is chosen following previous
work for fair comparison, and will be studied in later sections. We see that CSI outperforms other models in
both accuracy and F-score. Specifically, CI shows similar performance with GRU-2 which is a more complex
2-layer stacked network. this performance validates our choice of capturing fundamental temporal behavior,
and demonstrates how a simpler structure can benefit from be1er features and partitioning. Further, it shows the
benefit of utilizing doc2vec over simple tf-idf.
Next, we see that CI-t exhibits an improvement of more than 1% in both accuracy and F-score over CI. this
demonstrated that while linguistic features may carry some temporal properties, the frequency and
distribution of engagements caries useful information in capturing the difference between true and fake news.
Finally, CSI gives the best performance over all comparison models and versions. We see that integrating
user features boosts the overall numbers up to 4.3% from GRU-2. Put together, these results demonstrate
that CSI successfully captures and leverages all three characteristics of text, response, and source, for
accurately classifying fake news.
5.3 Model complexity
In practice, the availability of labeled examples of true and fake news may be limited, hence, in this section,
we study the usability of CSI in terms of the number of parameters and amount of labeled training samples it
requires.
Although CSI is based on deep neural networks, the compact set of features that Capture utilizes results in
fewer required parameters than other models. Furthermore, the user relations in Score can deliver condensed
representations which cannot be captured by an RNN, allowing CSI to have less parameters than other RNN-
based models. In particular, the model has on the order of 52K parameters, whereas GRU-2 has 621K
parameter.
To study the number of labeled samples CSI relies on, we study the accuracy as a function of the training set
size. Figure 4 shows that even if only 10% training samples are available, CSI can show comparable
performance with GRU-2; thus, the CSI model is lighter and can be trained more easily with fewer training
samples
5.4 Interpreting user representations
In this section, we analyze the output of Score which is a score si and a representation y˜ i for every user. Since
the available data does not have ground-truth labels on users, we perform a qualitative evaluation of the
information contained in si , y˜ i with respect to the source characteristic of fake news.
Although we lack user-labels, the dataset still contains information that can be used as a proxy. In particular,
we want to evaluate whether si , y˜ i captures the suspicious behavior of users in terms promotion of fake
news and group behavior. For the former, a reasonable proxy is the fraction of fake news a user engages
with, denoted 4i 0, 1 with 0.0 meaning the user has never reacted to fake news, and 1.0 meaning the
engagements are exclusively with fake news. In addition, we consider the corresponding scores for articles
as the average over users, namely pj is the average of si and λj is the average of 4i over ui that engaged with
aj .
To test the extent to which si , y˜ i capture 4i , we compute the correlation between the two measures across
users; Table 3 shows the Pearson correlation coefficient and significance. For both datasets and on both sides of
the user-article engagement, we find a statistically significant positive relationship between the two scores.
Results are consistent for the Spearman coefficient and for ordinary least squares regression(OLS). In addition,
Figures 5a and 5b show the distribution of 4i among a subset of users with highest and lowest si . Most of the
users who were assigned a high si by CSI (marked as most suspicious) have 4 i close to 1, while those with
low si have low 4i . Altogether, the results demonstrate that si and pj hold meaningful information with
respect to user levels of engagement with fake news.
To investigate the relation of y˜ i to 4i , we regress the cosine distance between y˜ i and y˜ i J against
the difference between 4i and 4i J for each pair of users i, i J . Consistent with results for si , we find a positive
correlation of 0.631 for TWITTER and 0.867 for WEIBO, both of which are statistically significant at the 1% level.
Further, we visualize the space of user representations by projecting a sample of the vectors y˜ i onto the first
and second singular vectors µ 1 and µ2 of the matrix of y˜i ’s. Figure 6 shows the projection for both datasets,
where each point corresponds to a user ui and is colored according to 4i . We see that the space exhibits a
strong separation between users with extreme 4i , suggesting that the vectors y˜ i offer a good latent
representation of user behavior with respect to fake news and can be used for deeper user analysis.
Next we analyze the propensity of si , y˜ i to capture group behavior. We construct an implicit user
graph by adding an edge between users who have engaged with the same article, and by analyze the
clustering of users in the graph. We apply the BiMax algorithm proposed by Prelicet al. [32] to search for
biclusters in the adjacency matrix.3 We find that for both datasets, users with large 4i participate in more
and larger biclusters than those with low 4i . Further, biclusters for users with large 4i are formed largely with
fake news articles, while those for low 4i are largely with true news.
this suggests that suspicious users exhibit the source characteristic with respect to fake news. In addition, for
each pair of users ui , ui J we compute the Jaccard distance between the set of articles they interacted with. We
compute the correlation between this quantity and |si − si(J | as well) as the cosine distance between y˜ i
and y˜ i J . For the former we find a correlation of 0.36 for TWITTER
and 0.21 for WEIBO, and for the la1er we find 0.30 for TWITTER
and 0.16 for WEIBO. All results are significant at the 1% level, with Spearman correlation and OLS giving consistent
results.
Overall, despite lack of ground-truth labels on users, our analysis demonstrates that the Score module captures
meaningful information with respect to the the source characteristic. the user score si provides the model
with an indication of the suspiciousness of user ui with respect to group behavior and fake news engagement.
Further, the y˜ i vector provides a representation of each user that can be used for deeper analysis of user
behavior in the data.
5.5 Characterizing user behavior
In this section, we ask whether the users marked as suspicious by CSI have any characteristic behavior. Using
the si scores of each user we select approximately 25 users from the most suspicious groups, and the same
amount from the least suspicious group.
We consider two properties of user behavior: (1) the lag and (2) the activity. To measure lag for each user,
we compute the lag in time between time between an article’s publication, and when the user first engaged
with it. We then plot the distribution of user lags separated by most and least suspicious, and true and fake
news. Figure 7 shows the CDF of the results. Immediately we see that the most suspicious users in each
dataset are some of the first to promote the fake content – supporting the source characteristic. In contrast,
both types of users act similarly on real news.
Next, we measure the user activity as the time between engagements user ui had with a particular article a j .
Figure 8 shows the CDF of user activity. We see that on both datasets, suspicious users token have bursts of quick
engagements with a given article; this behavior differs more significantly from the least suspicious users on fake
news than it does on true news. Interestingly, the behavior of suspicious users on TWITTER is similar on fake and
true news, which may demonstrate a sophistication in fake content promotion techniques. Overall, these
distributions show that the combination of temporal, textual, and user features in xt provides meaningful
information to capture the three key characteristics, and for CSI to distinguishing suspicious users.
5.6 Utilizing temporal article representations
In this section, we investigate the vector vj that is the output of Capture for each article aj . Intuitively,
these vectors are a low- dimensional representation of the temporal and textual response an article has
received, as well as the types of users the response has come from. In a general sense, the output of an
LSTM has been used for a variety of tasks such as machine translation [36], question answering [39], and
text classification [20]. Hence, in the context of this work it is natural to wondering whether these vectors
can be used for deeper insight into the space of articles.
As an example, we consider applying Spectral Clustering for a more fine-grained partition than two classes. We
consider the set of vj associated with the test set of TWITTER and WEIBO articles, and set k = 5 clusters according
to the elbow curve. Figure 9 shows the results in the space of the first two singular vectors (µ1 and µ 2) of the
matrix formed by the vectors vj for each respective dataset, with one color for each cluster.
Table 4 shows the breakdown of true and false articles in each cluster. We can see that the results gives a
natural division both among true and fake articles. For example, on the TWITTER datasets, while both C2 and C4 are
composed of mostly fake news, we can see that the projections of their temporal representation are quite
separated. this separation suggests that there may be different types of fake news which exhibit slightly
different signals in the text, response, and source characteristics, for example, satire and spam. the WEIBO data
shows two poles: C1 in the top lek corresponds largely to true news, while C2 and C4 captures different types
of fake news. Meanwhile, C3 and C5 which are spread across the middle, have more mixed membership.
In the context of the general framework described in Section 4, the results show that the vj vectors
produced by the Capture mod- ule offer insight into the population of users with respect to their behavior
towards fake news. Aside from the classification output of the model, the representations can be used stand-
alone for gaining insight about targets (articles) in the data.
6 CONCLUSION
In this work, we study the timely problem of fake news detection. While existing work has typically
addressed the problem by focusing on either the text, the response an article receives, or the users who
source it, we argue that it is important to incorporate all three. We propose the CSI model which is
composed of three modules. the first module, Capture, captures the abstract temporal behavior of user
encounters with articles, as well as temporal textual and user features, to measure response as well as the
text. the second component, Score, estimates a source suspiciousness score for every user, which is then
combined with the first module by Integrate to produce a predicted label for each article.
the separation into modules allows CSI to output a prediction separately on users and articles, incorporating
each of the three characteristics, meanwhile combining the information for classification. Experiments on two
real-world datasets demonstrate the accuracy of CSI in classifying fake news articles. Aside from accurate
prediction, the CSI model also produces latent representations of both users and articles that can be used for
separate analysis; we demonstrate the utility of both the extracted representations and the computed user
scores.
the CSI model is general in that it does not make assumptions on the distribution of user behavior, on the
particular textual context of the data, nor on the underlying structure of the data. Further, by utilizing the
power of neural networks, we incorporate different sources of information, and capture the temporal
evolution of engagements from both parties, users and articles. At the same time, the model allows for easy
incorporation of richer data, such as user profile information, or advanced text libraries. Overall our work
demonstrates the value in modeling the three intuitive and powerful characteristics of fake news powerful
characteristics of fake news.
Despite encouraging results, fake news detection remains a challenging problem with many open questions.
One particularly interesting direction would be to build models that incorporate concepts from
reinforcement learning and crowd sourcing. Including humans in the learning process could lead to more
accurate and, in particular, more timely predictions.