0% found this document useful (0 votes)
15 views

Natural Language-centered Inference Network for Multi-modal

虚假新闻检测论文

Uploaded by

xiao15279899932
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Natural Language-centered Inference Network for Multi-modal

虚假新闻检测论文

Uploaded by

xiao15279899932
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24)

Natural Language-centered Inference Network for Multi-modal


Fake News Detection

Qiang Zhang , Jiawei Liu∗ , Fanrui Zhang , Jingyi Xie and Zheng-Jun Zha
University of Science and Technology of China, China
{zq 126, zfr888, hsfzxjy}@mail.ustc.edu.cn, {jwliu6, zhazj}@ustc.edu.cn

Abstract Image
Multi-modal
The proliferation of fake news with image and text Encoder
Image

Classifier
Classifier
Multi-modal Text
in the internet has triggered widespread concern. Encoder
Existing research has made important contributions Text News Knowledge Knowledge
Entities Graph Encoder
in cross-modal information interaction and fusion,
but fails to fundamentally address the modality gap
(a) The paradigm of conventional MFND (b) The paradigm of MFND with news knowledge
among news image, text, and news-related external
knowledge representations. In this paper, we pro- Multi-modal Feature Reasoning
Image
pose a novel Natural Language-centered Inference Multi-modal
Network (NLIN) for multi-modal fake news detec- Encoder
Natural Language Space

Classifier
tion by aligning multi-modal news content with the Text Inference
Decoder
natural language space and introducing an encoder- Visual
decoder architecture to fully comprehend the news Context Unified-modal
News Background Context Encoder
in-context. Specifically, we first unify multi-modal Entities Knowledge
news content into textual modality by converting Prompt

news images and news-related external knowledge (c) The proposed paradigm
into plain textual content. Then, we design a multi-
modal feature reasoning module, which consists Figure 1: Comparison with previous paradigms. (a) Traditional
of a multi-modal encoder, a unified-modal con- MFND follows the paradigm of fusing news image and text fea-
text encoder and an inference decoder with prompt tures by designing multi-modal fusion encoder, then performing
fake news detection. (b) The MFND with news knowledge paradigm
phrase. This framework not only fully extracts
employs news entities represented in the knowledge graph as news
the latent representation of cross-modal news con- background knowledge and fuses them into the cross-modal space.
tent, but also utilizes the prompt phrase to stimu- (c) The proposed paradigm aims to unify news content with back-
late the powerful in-context learning ability of the ground knowledge into the natural language space and reason about
pre-trained large language model to reason about the authenticity of news through an encoder-decoder architecture
the truthfulness of the news content. In addition, to equipped with prompt phrases.
support the research in the field of multi-modal fake
news detection, we produce a challenging large
scale, multi-platform, multi-domain multi-modal care practices, undermine the credibility of governments and
Chinese Fake News Detection (CFND) dataset. Ex- interfere with presidential elections [Narayan et al., 2022;
tensive experiments show that our CFND dataset Zhang et al., 2023a]. In addition, with the swift development
is challenging and the proposed NLIN outperforms of multimedia technology, fake news makers are increasingly
state-of-the-art methods. turning more and more to multi-modal content, such as at-
tractive images, to capture and mislead the general public,
thus making fabricated stories more credible and disseminat-
1 Introduction ing them more speedily. Therefore, in order to reduce the
Fake news is spreading online at an ever-increasing rate, pos- harmful effects of fake news dissemination, automatic Multi-
ing a serious challenge to the credibility of news media plat- modal Fake News Detection (MFND) has received more and
forms [Abdelnabi et al., 2022; Liu et al., 2023]. At the more research attention.
same time, the widespread dissemination of fake news may To address this problem, a series of multi-modal fake
cause mass panic and social instability, e.g., some unscrupu- news detectors have been proposed to explore cross-modal
lous individuals have exploited fake news to mislead health information interaction and fusion for identifying anoma-
lies in fake news. Most of the previous work follows the

Corresponding author paradigm of fusing news image and text features by multi-

2542
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24)

modal encoder and subsequently performing fake news de- ternal knowledge, we first extract visual entity set from the
tection (as shown in Figure 1 (a)). These approaches sim- visual context and textual entity set from the news text. Then
ply utilize concatenation operations [Singhal et al., 2019a; the news entities are linked to the Wikidata [Vrandečić and
Wang et al., 2021], attention mechanism [Jin et al., 2017; Krötzsch, 2014] database, so as to obtain the contextual de-
Zhou et al., 2022] or auxiliary tasks [Chen et al., 2022a; scription of each entity, and construct the news background
Khattar et al., 2019] to capture the underlying semantic cor- knowledge. (2) In the encoding phase, we employ the LLM
relations between image and text features. However, these encoder finetuned with LoRA [Hu et al., 2021] to embed the
methods fail to obtain persuasive and interpretable results as news content which has unified to the natural language space,
they lack the link to external facts. Therefore, in order to for obtaining the news visual context feature and textual con-
verify the authenticity of news at the knowledge level, a few text feature. Meanwhile, in order to prevent the loss of the
of approaches excavate background knowledge information original news information, we extract the news multi-modal
about the news and serve as a source of objective evidence by features with the CLIP model [Radford et al., 2021] and map
extracting news entities and linking them to the knowledge them to the textual embedding space with a multi-modal map-
graph, which is structured as shown in Figure 1 (b). For ex- ping network. (3) In the decoding phase, we introduce an
ample, some of them consider news entities and their contexts inference decoder with prompt phrase. This module utilizes
as external knowledge and utilize attention mechanisms to as- the prompt phrase to enable the decoder from the LLM to
sess the weights of news entity representations [Zhang et al., reason about the authenticity of the news content based on
2023b; Tseng et al., 2022] or discover inconsistent semantic the previously obtained news visual context feature, textual
information at the knowledge level [Sun et al., 2021]. How- context feature and multi-modal feature, while transforming
ever, these aforementioned methods neglect or do not fun- the generative problem into a classification problem, i.e., the
damentally address the intractable problem of modality gap location-specific output from the decoder serving as the input
among image, text, and knowledge representations of news, features for the final classification. Extensive experiments
resulting in the model’s inability to adequately extract cross- demonstrate that our CFND dataset is challenging and the
modal correlation features of news content. Therefore, in or- proposed NLIN outperforms state-of-the-art methods.
der to overcome this dilemma, a potentially effective solution The main contributions of this paper are as following:
is to unify multi-modal news content into textual modality • We unify all the news-related information (image, text
and utilize the powerful in-context learning ability and rich and news-related external knowledge) into the natural
implicit knowledge of large language model (LLM) to extract language space, thus addressing the effects of modality
the news cross-modal correlation features and reason about gap fundamentally.
the authenticity of news content.
• We introduce an encoder-decoder architecture equipped
Although fake news crosses geographical and linguistic with prompt phrases for fully comprehending the news
boundaries, most of the work and datasets available in the context and inferring its authenticity.
field are concentrated in the English domain [Boididou et
al., 2018; Zubiaga et al., 2017], with limited content in other • We produce a challenging large scale, multi-platform,
language domains. Taking the Chinese domain as an exam- multi-domain multi-modal Chinese Fake News Detec-
ple, the existing multi-modal fake news detection datasets are tion (CFND) dataset.
only Weibo [Jin et al., 2017] and Weibo-21 [Nan et al., 2021],
and their data sources are often limited to one platform on the 2 Related Works
web, which makes the representation of fake news often more 2.1 Text-based Fake News Detection
homogeneous. At the same time, their data size is restricted, Traditional fake news detection is mainly based on analysing
and the news content is more outdated, which is not con- the semantic features of news textual content to determine its
ducive to detecting the current news content, and bring certain authenticity [Ma et al., 2015]. Early work focused on design-
limitations for the development of Chinese fake news detec- ing complex handcrafted features, such as lexical and syn-
tion. Therefore, a large-scale, multi-platform, multi-domain tactic features [Pérez-Rosas et al., 2017] based on news text,
Chinese fake news detection dataset is urgently needed. user comments or user interactions on social network [Wu
In order to support the research in the field of multi-modal et al., 2015]. Recently, deep learning models have achieved
fake news detection, we produce a challenging multi-modal promising results in detecting fake news. For example, Tseng
Chinese Fake News Detection (CFND) dataset, which is col- et al. [Tseng et al., 2022] introduced sub-event segmentation
lected from multiple platforms, contains news from multi- algorithms to aggregate user comments, as well as modelling
ple domains. And CFND consists of 26,665 news samples, the news content and user comments at various degrees of se-
significantly larger than previous Chinese datasets. Addi- mantic granularity. Han et al. [Han et al., 2021] extracted
tionally, we propose a novel Natural Language-centered In- news entities and relationships to form a single knowledge
ference Network (NLIN) for multi-modal fake news detec- graph, and transform the problem of detecting fake news into
tion, whose general structure is shown in Figure 1 (c). This subgraph classification task.
detection paradigm is mainly divided into three phases: (1)
News pre-processing phase, we extract the visual context us- 2.2 Multi-modal Fake News Detection
ing three approaches, namely, image caption, dense labeling Recently, multi-modal fake news detection has received con-
and OCR, so that we could achieve image-to-text conversion siderable attention because news content forms tend to co-
with minimal information loss. In terms of news-related ex- exist with multi-modal information such as images and text.

2543
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24)

Khattar et al. [Khattar et al., 2019] performed fake news de-


Society Epidemic Society Epidemic
tection by using a multi-modal variable autoencoder to learn 9.8% 19.0% 15.4% 23.1%
cross-modal correlations and shared representations of multi- Disasters
modal content. Chen et al. [Chen et al., 2022a] applied 14.1%

the contrast learning approach to transform unimodal fea- Disasters


tures into the same feature space and aggregates unimodal 21.3%
Health
and multi-modal features to varying degrees by quantifying 27.7%
Politics Health
the ambiguity between text and image. Wu et al. [Wu et al., 29.4% Politics 28.9%
2021] applied the frequency domain information of the image 11.3%

as the complement and fuse it with image and text features (a) Real news (b) Fake news
for fake news detection. Sun et al. [Sun et al., 2021] detected
fake news by capturing inconsistent semantic information of Figure 2: The statistics of CFND dataset.
news content at cross-modal and knowledge levels. However,
these above models neglect or do not fundamentally address
the intractable problem of modality gap among image, text noise data identification and removal. Additionally, we
and knowledge representations of news, which in turn limits check the quality of the images and remove images with
the performance of fake news detection systems. too low resolution or blurred images to ensure that the
model inputs are of high quality.
2.3 Multi-modal Fake News Detection Dataset • In some cases, the same news data may exist on multiple
The field of multi-modal fake news detection has constructed platforms, so we apply the operation of de-duplication to
many datasets for research purposes. In the English domain, remove these similar data and reduce the redundancy in
the Pheme [Zubiaga et al., 2017] dataset consists of tweets the dataset.
from the Twitter platform based on five breaking news sto- • In Chinese fact-checking websites, news headlines often
ries. The Politifact and GossipCop datasets were collected tend to use rhetorical questions to express non-factual
from the political and entertainment domains of the Fake- claims. Therefore, in order to alleviate the doubtful-
NewsNet [Shu et al., 2020] database respectively. The Twitter ness of headlines, we transform rhetorical questions into
[Boididou et al., 2018] dataset was created for the MediaE- declarative phrases.
val Validating Multimedia Usage task release. In the Chinese
domain, Weibo [Jin et al., 2017] and Weibo-21 [Nan et al., • To determine news domains, we reference and synthe-
2021] are two widely used Chinese multi-modal fake news size categorization methods from various news websites,
detection datasets. They are both collected on the Sina Weibo and select five domains, while hiring 5 experts to manu-
platform, while the Weibo-21 dataset contains multiple cate- ally label the news domains. Initially, each expert anno-
gories of news. However, these datasets tend to be confined to tated independently, followed by a cross-checking pro-
a single platform with small data volume, which brings about cess. If at least 4 experts agreed, the final domain label-
some limitations. ing was determined.
After pre-processing the dataset, we obtain a total of
3 CFND Dataset Construction 10,271 fake news items and 16,394 real news items, and make
sure that each news item has a corresponding image content.
3.1 Data Collection
Furthermore, we divide the whole dataset into training set,
For the fake news data, we utilize six active Chinese fact- validation set and test set with the ratio of 60%, 20% and
checking websites as data sources, whose news content is typ- 20% respectively, and maintaining the proportion of real and
ically derived from public announcements, news articles and fake news in each set.
social media platforms such as Sina Weibo and TikTok. And
the authenticity of news content on the websites is evaluated 3.3 Data Analysis
by expert fact-checking personnel. For the real news data, we We analyze the news content in the CFND dataset, as shown
apply four official news websites as data sources. During the in Figure 2. We can observe that its news content involves five
data collection process, we selected news content from differ- fields, including epidemic, health, politics, disasters and so-
ent categories in proportion to ensure coverage across various ciety. In the fake news data, the content about health and epi-
domains. Additionally, for each news content, we select both demic information is the most, because the content of these
the news title and its corresponding image content as the text two types of information is closely related to people’s daily
and image of a sample in the CFND dataset. life, and rumor mongers usually spread rumors in this field to
3.2 Data Pre-processing gain public attention. In the real news data, the content about
politics is the most, followed by the health and epidemic field.
Due to the fact that when crawling the news content of the Meanwhile, we compare and analyze the CFND dataset with
website, the results often differ from the standard dataset for- the existing multi-modal fake news detection datasets, and
mat. Therefore, we manually check all the crawled data. The the results are shown in Table 1. It can be observed that com-
checks are described below: pared to previous datasets, CFND has a larger data volume
• Since some crawled data may be irrelevant information, and simultaneously encompasses news content from multiple
such as advertisements or irrelevant images, we perform platforms and domains.

2544
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24)

Dataset Real Fake Image Multi-domain Language


Pheme [Zubiaga et al., 2017] 3,830 1,972 5,802 ✗ English
Twitter [Boididou et al., 2018] 6,225 9,404 411 ✗ English
Politifact [Shu et al., 2020] 624 432 783 ✗ English
GossipCop [Shu et al., 2020] 16,817 5,323 18,417 ✗ English
Weibo [Jin et al., 2017] 4,779 4,779 9,558 ✗ Chinese
Weibo-21 [Nan et al., 2021] 4,640 4,488 9,128 ! Chinese
CFND 16,394 10,271 26,665 ! Chinese

Table 1: Comparison with existing multi-modal fake news detection datasets.

4 Method Background Knowledge Description. News background


4.1 Overview knowledge can supply rich evidential information, which is
beneficial for both understanding news content and improv-
Multi-modal fake news detection is commonly described as a ing the interpretability of fake news detection model. There-
binary classification problem, aiming to determine the verac- fore, in order to construct news background knowledge, we
ity of a given post with both news text T and image I. To first extract news entities from the news content, after which
address this problem, we propose a novel Natural Language- we perform entity search in the Wikidata [Vrandečić and
centered Inference Network (NLIN) for multi-modal fake Krötzsch, 2014] database to obtain the specific description
news detection, whose architecture is shown in Figure 3. of each entity. The details of the process are as follows:
Specifically, we first convert the news image and background Regarding visual entities, we chose to extract visual en-
knowledge into plain textual content. Then, we employ a tities from the visual context with the entity linking tool
unified-modal context encoder to extract features from the TagMe. Since the image object labels L extracted by the
news content, which are unified into the natural language VinVL model tend to be more redundant, and their main la-
space, thus obtaining both visual and textual context features bels are often the same as the entities in the image caption,
of the news. Meanwhile, to prevent the loss of original news we extracted the visual entities only using the content of the
information, we extend the news multi-modal feature through image caption C and the image embedded text O. Thus, we
the CLIP model and map it to the text embedding space. Fi- can obtain the set of visual entities {EVv }. Regarding textual
nally, we concatenate the news visual context feature, textual entities, we also employ the TagMe tool to obtain the textual
context feature as well as the multi-modal feature and input entity set {ETt } from the news text.
them together into the inference decoder, while using prompt After obtaining the news visual entity set {EVv } and the
phrase to transform the generation problem into a classifica- textual entity set {ETt }, we search the Wikidata database for
tion problem. each entity individually to acquire its corresponding entity
4.2 News Pre-processing description. Meanwhile, we form a complete sentence with
In this section, we present an outline of the ways in which the entity and its contextual description, e.g., “Tony Abbott
news images and background knowledge are translated into : prime minister of Australia from 2013 to 2015.”, as well
the natural language space respectively. as treating this as a news background knowledge item de-
scription. After searching for all entities, we have completed
Visual Description. In order to maximize the extraction of the construction of the news visual background knowledge Vb
semantic information embedded in the image, we perform a and textual background knowledge Tb .
total of the following three transformations on the image:
• Obtain the global semantic information of the image 4.3 Multi-modal Feature Reasoning
with the usage of BLIP [Li et al., 2022] model for gen- We utilize the Flan-T5 [Chung et al., 2022] model as the
erating the image caption C. backbone for the multi-modal feature reasoning module, i.e.,
• Obtain the local semantic information of the image with employ its text encoder and decoder as our unified-modal
the usage of VinVL [Zhang et al., 2021] model to ac- context encoder and inference decoder. Additionally, the
quire the dense labelling L. CLIP [Radford et al., 2021] model is applied to extract the
news complementary multi-modal features. To avoid exces-
• Utilize EasyOCR [EasyOCR, ] model to detect the em-
sive training parameters and ensure that the pre-trained lan-
bedded text O in the image.
guage model effectively adapts to the multi-modal fake news
At this point, we have obtained the visual context informa- detection task, we employ low-rank adaptation (LoRA) [Hu
tion of the image, VC = (C, L, O), which can be expressed et al., 2021] to efficiently fine-tune the Flan-T5 model, which
as follows: freezes the pre-trained Flan-T5 model weights and injects
C = MBLIP (I) trainable rank decomposition matrices into each layer of the
L = MV inV L (I) = {l1 , ..., li } , transformer architecture, significantly reducing the number of
(1) trainable parameters. The forward propagation process of the
where li = w0attr , ..., wnattr , wobj

modified model can be expressed as follows:
O = MEasyOCR (I) = w0 , ..., wjocr
 ocr
h = W0 x + ∆W x = W0 x + BAx (2)

2545
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24)

News pre-processing Multi-modal Feature Reasoning


News content Visual Context Prompt Phrase: Identify the following news
Image Caption: content as true or false: Embedding 1
Concat
A man in a suit and tie sitting Unified-modal
in front of a christmas tree.
Context Embedding 2
Dense Labeling: Visual Context Visual background knowledge
Encoder

Classifier
Talking man in a suit, Christ- Inference
mas tree, yellow background… LoRA Embedding 3
News Text Textual background knowledge Decoder
OCR:
Sydney siege, tony abbott med- LoRA
ia conference in canberra…
Text: Australian PM Abbott
Multi-modal Encoder … Learnable Q
Motivation of perpetrator in
Visual Entities:

Cross Attention
Sydney hostage situation… News Image Prompt Phrase:

Add & Norm


Self Attention
Add & Norm
Tony abbott, Sydney, Canberra…

K
The news is <M>

CLIP
Embedding 4
Textual Entities: Entity knowledge:
Tony Abbott: prime

V
Australians, Tony Abbott, Syd-
ney, Hostage, … minister of Australia News Text
from 2013 to 2015…
L

Figure 3: The overall architecture of the proposed NLIN. It consists of two modules: news pre-processing module and multi-modal feature
reasoning module. The news pre-processing module is responsible for unifying news content into natural language space. And the multi-
modal feature reasoning module is used to reason about the authenticity of news content.

Where W0 represents the frozen parameters of the original as input along with fclip to obtain the news multi-modal em-
pre-trained Flan-T5 model, A is a parameter matrix initial- bedding fM .
ized using random gaussian initialization and B is a parame-
fM = M apm (Pm , fclip ) (3)
ter matrix initialized with zeros.
Specifically, the multi-modal mapping network consists of
Unified-modal Context Encoder. We concatenate the vi- L blocks, each containing a cross-attention layer and a self-
sual context and visual background knowledge as visual text attention layer. Each block is computed as follows:
input Tv , which is denoted as <Caption C> + <Dense la-
belling L> + <OCR O> + <Visual background knowl- fMl
= SAl CAl Pm , fM l−1

(4)
edge Vb >. Additionally, the news text and textual back-
l−1
ground knowledge are concatenated as news text input Tt , where l = 1, ..., L, and if l = 1, then the input fM = fclip .
denoted by <News text T > + <Textual background knowl- Each block in the multi-modal mapping network M apm first
edge Tb >. Apart from this, to make the pre-trained Flan-T5 murders the multi-modal feature fclip to an intermediate fea-
l
model comprehend the purpose of multi-modal fake news de- ture via the cross-attention CAl , and then obtains fM via the
tection task and maximise its potential, we design a prompt self-attention SAl . In the cross-attention, Pm serves as the
l−1
phrase Tp , i.e., “Identify the following news content as true query, and fM serves as the key and value. Thus, we itera-
or false:”. Then, we encode each of the above three con- tively extract information from the news multi-modal feature
texts with unified-modal context encoder to obtain the prompt fclip into the potential features and finally output the news
phrase feature fP ∈ Rl×d , the visual context feature fV ∈ multi-modal embedding fM .
Rm×d and the textual context feature fT ∈ Rn×d , where l, Inference Decoder. To further construct the global repre-
m and n correspond to their text lengths and d refers to the sentation of the news, we concatenate the chains of features
embedding dimension. previously extracted by the unified-modal context encoder
Multi-modal Encoder. In order to prevent the loss of orig- and the multi-modal encoder to obtain the news global fea-
inal news information, we utilize the CLIP model to extract ture fN of the news, and input it into the inference decoder,
multi-modal feature from news image and text, and employ a as shown below:
multi-modal mapping network that maps it to the same space fN = Concat [fP : fV : fM : fT ] (5)
as the text embedding, thus obtaining multi-modal embed-
ding of the news. The details are described as follows: Additionally, existing research [Chen et al., 2022b] has
We firstly apply the image encoder and text encoder of the shown that prompt engineering has a significant impact on
CLIP model to extract the news image and text features re- model performance. Therefore, we adopt the prompt phrases
spectively, which are denoted as fclip−V and fclip−T . After “The news is <M>” as the input to the inference decoder,
that, we construct a news multi-modal feature fclip , which is which echoes the prompt phrase Tp of the unified-modal con-
the concatenation of the image and text features. text encoder. Based on these prompt phrases, we can stimu-
late the powerful in-context learning ability of Flan-T5 model
Multi-modal Mapping Network. In order to map the news to inference the truthfulness of news content.
multi-modal feature fclip to the text embedding space, we
employ a multi-modal mapping network, M apm , to recon- 4.4 Model Optimization
struct the original multi-modal feature representation. Within We can transform the generative problem into a binary classi-
the multi-modal mapping network, we apply a set of M learn- fication problem by taking the output H<M > of the infer-
able prompts Pm to facilitate the mapping, which will be used ence decoder for <M> token as the final feature used for

2546
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24)

Pheme CFND Weibo


Method
Acc Prec Rec F1 Acc Prec Rec F1 Acc Prec Rec F1
MVAE [Khattar et al., 2019] 0.776 0.735 0.723 0.728 0.812 0.807 0.811 0.806 0.824 0.828 0.822 0.823
SAFE [Zhou et al., 2020] 0.807 0.787 0.789 0.791 0.795 0.789 0.804 0.796 0.851 0.849 0.849 0.849
SpotFake [Singhal et al., 2019b] 0.845 0.809 0.836 0.822 0.830 0.825 0.841 0.833 0.873 0.873 0.874 0.873
CAFE [Chen et al., 2022a] 0.832 0.796 0.794 0.795 0.826 0.827 0.846 0.837 0.840 0.840 0.841 0.840
MCAN [Wu et al., 2021] 0.861 0.830 0.840 0.835 0.845 0.831 0.784 0.807 0.899 0.899 0.899 0.899
KDIN [Sun et al., 2021] 0.846 0.815 0.804 0.809 0.847 0.813 0.846 0.830 0.893 0.894 0.892 0.893
LIIMR [Singhal et al., 2022] 0.870 0.848 0.831 0.839 0.852 0.817 0.834 0.826 0.900 0.882 0.823 0.847
BMR [Ying et al., 2023] 0.884 0.872 0.840 0.855 0.859 0.834 0.815 0.824 0.918 0.912 0.909 0.910
NLIN 0.903 0.875 0.883 0.879 0.874 0.848 0.841 0.844 0.922 0.917 0.922 0.919

Table 2: Performance comparison to the state-of-the-art methods on Pheme, CFND and Weibo datasets.

classification. Specifically, we convert this embedding fea- Methods Acc Prec Rec F1
ture H<M > through a fully-connected network layer with a w/o I-T 0.882 0.851 0.870 0.860
sigmoid activation function to predict the probability of fake w/o K-T 0.886 0.856 0.873 0.864
news occurrence. The formulation is described below: w/o L-E 0.879 0.849 0.863 0.856
NLIN 0.903 0.875 0.883 0.879

y = σ (Wc (H<M > ) + bc ) (6)
Table 3: Ablation study of different components of NLIN’s news
where Wc and bc are the parameters of the fully-connected pre-processing module on the Pheme dataset.
network layer, and σ refers to the sigmoid function.
After that we utilize the cross entropy loss function as the
loss of the whole model with the formula: 5.2 Results and Discussion
∧  The results of the comparison are shown in Table 2. In all

 
Lp = −y log y − (1 − y) log 1 − y (7) three datasets, our method NLIN outperforms other compar-
ative methods on all evaluation metrics, demonstrating its su-
perior performance.
5 Experiments Among these comparative methods, MVAE and SAFE
both apply multi-modal information such as image and text,
5.1 Experimental Settings but they perform weakly as the other methods, which may
Dataset. To evaluate the effectiveness of the proposed be due to the fact that they apply Text-CNN and Bi-LSTM
NLIN, we compare it with the state-of-the-art methods on the with weakly learned textual representations, suggesting that
CFND, Pheme [Zubiaga et al., 2017] and Weibo [Jin et al., textual representation plays an important role in multi-modal
2017] datasets. In particular, the CFND dataset is detailed fake news detection. In addition, KDIN achieves good per-
as described in section 3. The Pheme dataset is constructed formance, which suggests that the introduction of external
based on five tweets related to breaking news on the Twitter knowledge information is effective in multimodal fake news
platform, each containing text with labels and corresponding detection. The BMR model achieves the second best results
images. In addition, the Weibo dataset is from Xinhua News on all datasets, which shows that both unimodal and multi-
Agency and Weibo platform, which contains a large num- modal perspectives on news content contribute to the detec-
ber of labeled texts and images. For the Pheme and Weibo tion of fake news. NLIN outperforms other methods, demon-
datasets, we divide them into training, validation and testing strating that unifying news content into textual modalities
sets according to the ratio of 6:2:2. can avoid the gap between different modal representations of
news. And utilizing the powerful in-context learning capa-
Implementation Details. For the unified-modal context en- bility of the LLM to infer news authenticity can enable the
coder, the input lengths for both the visual text input Tv and NLIN model to achieve stronger detection performance with
the news text input Tt are set to a maximum of 200 words. fewer training parameters.
For the multi-modal mapping network, the number of blocks
is 2 and the length of the learnable prompt Pm is 16. For 5.3 Ablation Studies
other hyperparameters, we set the batch size to 16, the num-
ber of epochs to 60, the learning rate to 5e-4, employ the The ablation experiments in Table 3 investigate the effect of
Adam optimizer, and apply the learning rate warm-up strat- different components of NLIN’s news pre-processing module
egy. We implement all models in PyTorch and experimented on the fake news detection performance. Specifically, w/o I-
on a Tesla V100 GPU. T means that the input of the unified-modal context encoder
removes the visual text input Tv . w/o K-T refers to the non-
Evaluation Metrics. We utilize the accuracy metric Acc as transformation of news background knowledge into textual
our evaluation metric. Moreover, considering the category modality, i.e., the input removes the news visual and textual
imbalance problem, we apply precision, recall and F1 score background knowledge. w/o L-E means that, in addition to
as supplementary evaluation metrics. the contents of the image caption and OCR text, the image

2547
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24)

Text: Suspected jiha- Text: A photo of Nat- Text: The successful Text: The Three Gor-
dist in Sydney stand- han Cirillo killed in installation of China's ges Dam is about to
off ID as Iranian sex Ottawa today first 12MW offshore collapse due to defor-
offender as ordeal wind turbine mation
enters 15th hour

GT: Real GT: Fake GT: Real GT: Fake

Visual context Visual context Visual context Visual context


C: A woman is being dragged by a police C: A man sitting in a chair holding a dog. C: Crane on a platform in the middle of the C: A screenshot of a map of a bridge and
officer ocean a river
DL: Black backpack | crying woman | black DL: Smiling face | black beard | a dog being DL: Huge boat with equipment | white cloud DL: Green Water | list of words | stretch of
helmet | poster in white lettering held | green clothes | red brick wall | crane at sea | building full of equipment mountain
O: Terror down under Sydney hostage O: Null O: Null O: The Three Gorges Dam is deformed. If it
suspect ID’d as Iranian refugee demands… breaks, half of China will be devastated.
Visual background knowledge Visual background knowledge Visual background knowledge Visual background knowledge
Police officer: warranted employee of police Chair : piece of furniture for sitting on | Crane : type of machine specialised in lifting Bridge : structure that spans and provides a pa-
force | sydney: capital city of New South Wa- dog : domestic animal | man : any member objects | ocean : very large body of saline ssage over a road, railway, river, or some other
les, Australia | hostage : person /entity held of Homo sapiens, unique extant species of water | platform : environment in which a obstacle | river : larger natural watercourse | Th-
by a belligerent party to another or seized… the genus Homo, from embryo to adult piece of software is executed ree Gorges Dam : hydroelectric dam that…
Textual background knowledge Textual background knowledge Textual background knowledge Textual background knowledge
Jihadist : combattant in the name of Jihad- Ottawa : capital city of Canada | kill : milit- China : country in East Asia | offshore wind Three Gorges Dam : hydroelectric dam that
ism | sydney : capital city of New South W- ary casualty classification used for deaths, turbine : wind turbine in a marine body of spans the Yangtze River by the town of San-
ales, Australia | iranian : Western Iranian includes accidents and illness | photo : im- water | installation : act of making a comp- douping, China | collapse : engineering event
language | sex offender : criminal offender age created by light falling … uter program ready for execution in which the structural integrity of…

(a) (b) (c) (d)

Figure 4: Four examples of NLIN models making correct predictions, where (a) and (b) are from the Pheme dataset and (c) and (d) are from
the CFND dataset. In the content of visual cotext: C represents image caption, DL represents dense labelling, O represents OCR text.

Methods Acc Prec Rec F1 pre-trained Flan-T5 model, while utilizing the CLIP model to
w/o Clip 0.875 0.845 0.861 0.853 extract multi-modal features can effectively complement the
w/o Map 0.895 0.868 0.880 0.874 raw information present in news content.
w/o LoRA 0.893 0.865 0.882 0.873
w/o Prompt 0.845 0.810 0.840 0.825 5.4 Visualization Results
NLIN 0.903 0.875 0.883 0.879
Figure 4 gives four examples of correct recognition from the
Pheme and CFND datasets and provides the extracted vi-
Table 4: Ablation study of different components of NLIN’s multi- sual context, visual background knowledge and textual back-
modal feature reasoning module on the Pheme dataset. ground knowledge. We can observe that the content of the
visual context can comprehensively present the global and
local semantic information embedded in the image and ef-
object labels L extracted by the VinVL model are also used fectively provide background knowledge information for the
to extract visual entities. Experimental results of these vari- news. For instance, in the example (a), the news content is rel-
ants of NLIN illustrate that: (1) unifying the news images atively obscure, but through the descriptions of entities such
and background knowledge into textual modality avoids the as ‘jihadist’ and ‘sydney’, the model can fully understand the
gap among news different modality representations, and news news content, which in turn facilitates to recognize fake news.
background knowledge contributes to a more comprehensive In addition, referring to the example (b), by transforming the
understanding of news content by the model. (2) When there image and news background knowledge to the textual modal-
are more descriptions of irrelevant news entities, it will bring ity, the discrepancy between image and textual content is
certain noise information to the model, which in turn destroys strengthened, i.e., the inconsistency information between the
the semantics of the original information. This has a greater image and text is more reflected, which undoubtedly brings
impact on the model performance than losing some of the im- benefits for performing fake news detection.
age information.
The ablation experiments in Table 4 examine the impact of
NLIN’s multi-modal feature reasoning module on the perfor-
6 Conclusion
mance of fake news detection. Specifically, w/o Clip refers to In this paper, we propose a novel Natural Language-centered
not using the CLIP to supplement multi-modal feature. w/o Inference Network (NLIN) for multi-modal fake news de-
Map means not mapping news features fclip to the text space tection. The proposed NLIN not only radically handles the
with multi-modal mapping networks. w/o LoRA refers to not huge gap among news different modality representations,
using LoRA to fine-tune the pre-trained language model. w/o but also introduces an encoder-decoder architecture equipped
Prompt represents the non-use of prompt phrases, while the with prompt phrases for fully comprehending the news in-
model applies the text generation pattern directly to produce context and inferring its authenticity. In addition, we pro-
words such as “true” or “false”. These variants perform sig- duce a challenging large scale, multi-platform, multi-domain
nificantly worse than the original NLIN, and w/o Prompt per- multi-modal Chinese Fake News Detection (CFND) dataset.
forms the worst of these variants. This indicates that employ- Extensive experiments show that our dataset is challenging
ing prompt phrases can significantly tap into the potential of and our NLIN outperforms the state-of-the-art methods.

2548
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24)

Acknowledgements [Li et al., 2022] Junnan Li, Dongxu Li, Caiming Xiong, and
Steven Hoi. Blip: Bootstrapping language-image pre-
This work was supported by National Natural Science Foun- training for unified vision-language understanding and
dation of China (NSFC) under Grants 62106245, 62225207 generation. In International Conference on Machine
and U19B2038. Learning, pages 12888–12900. PMLR, 2022.
[Liu et al., 2023] Jiawei Liu, Jingyi Xie, Yang Wang, and
References Zheng-Jun Zha. Adaptive texture and spectrum clue min-
[Abdelnabi et al., 2022] Sahar Abdelnabi, Rakibul Hasan, ing for generalizable face forgery detection. IEEE Trans-
and Mario Fritz. Open-domain, content-based, multi- actions on Information Forensics and Security, 2023.
modal fact-checking of out-of-context images via online [Ma et al., 2015] Jing Ma, Wei Gao, Zhongyu Wei, Yueming
resources. In Proceedings of the IEEE/CVF Conference on Lu, and Kam-Fai Wong. Detect rumors using time series of
Computer Vision and Pattern Recognition, pages 14940– social context information on microblogging websites. In
14949, 2022. Proceedings of the 24th ACM International on Conference
[Boididou et al., 2018] Christina Boididou, Symeon Pa- on Information and Knowledge Management, pages 1751–
padopoulos, Markos Zampoglou, Lazaros Apostolidis, 1754, 2015.
Olga Papadopoulou, and Yiannis Kompatsiaris. Detec- [Nan et al., 2021] Qiong Nan, Juan Cao, Yongchun Zhu,
tion and visualization of misleading content on twitter. In- Yanyan Wang, and Jintao Li. Mdfend: Multi-domain fake
ternational Journal of Multimedia Information Retrieval, news detection. In Proceedings of the 30th ACM Interna-
7(1):71–86, 2018. tional Conference on Information & Knowledge Manage-
[Chen et al., 2022a] Yixuan Chen, Dongsheng Li, Peng ment, pages 3343–3347, 2021.
Zhang, Jie Sui, Qin Lv, Lu Tun, and Li Shang. Cross- [Narayan et al., 2022] Kartik Narayan, Harsh Agarwal,
modal ambiguity learning for multimodal fake news de- Surbhi Mittal, Kartik Thakral, Suman Kundu, Mayank
tection. In Proceedings of the ACM Web Conference 2022, Vatsa, and Richa Singh. Desi: Deepfake source identifier
pages 2897–2905, 2022. for social media. In Proceedings of the IEEE/CVF Confer-
[Chen et al., 2022b] Zhuo Chen, Yufeng Huang, Jiaoyan ence on Computer Vision and Pattern Recognition, pages
2858–2867, 2022.
Chen, Yuxia Geng, Yin Fang, Jeff Z Pan, Ningyu Zhang,
and Wen Zhang. Lako: Knowledge-driven visual ques- [Pérez-Rosas et al., 2017] Verónica Pérez-Rosas, Bennett
tion answering via late knowledge-to-text injection. In Kleinberg, Alexandra Lefevre, and Rada Mihalcea.
Proceedings of the 11th International Joint Conference on Automatic detection of fake news. arXiv preprint
Knowledge Graphs, pages 20–29, 2022. arXiv:1708.07104, 2017.
[Chung et al., 2022] Hyung Won Chung, Le Hou, Shayne [Radford et al., 2021] Alec Radford, Jong Wook Kim, Chris
Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar-
Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack
et al. Scaling instruction-finetuned language models. arXiv Clark, et al. Learning transferable visual models from nat-
preprint arXiv:2210.11416, 2022. ural language supervision. In International conference on
machine learning, pages 8748–8763. PMLR, 2021.
[EasyOCR, ] EasyOCR. ocr. https://github.com/JaidedAI/
EasyOCR. [Shu et al., 2020] Kai Shu, Deepak Mahudeswaran, Suhang
Wang, Dongwon Lee, and Huan Liu. Fakenewsnet: A
[Han et al., 2021] Yi Han, Amila Silva, Ling Luo, Shanika data repository with news content, social context, and spa-
Karunasekera, and Christopher Leckie. Knowledge en- tiotemporal information for studying fake news on social
hanced multi-modal fake news detection. arXiv preprint media. Big data, 8(3):171–188, 2020.
arXiv:2108.04418, 2021.
[Singhal et al., 2019a] Shivangi Singhal, Rajiv Ratn Shah,
[Hu et al., 2021] Edward J Hu, Yelong Shen, Phillip Wallis, Tanmoy Chakraborty, Ponnurangam Kumaraguru, and
Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Shin’ichi Satoh. Spotfake: A multi-modal framework for
and Weizhu Chen. Lora: Low-rank adaptation of large fake news detection. In 2019 IEEE fifth International Con-
language models. arXiv preprint arXiv:2106.09685, 2021. ference on Multimedia Big Data (BigMM), pages 39–47.
[Jin et al., 2017] Zhiwei Jin, Juan Cao, Han Guo, Yongdong IEEE, 2019.
Zhang, and Jiebo Luo. Multimodal fusion with recurrent [Singhal et al., 2019b] Shivangi Singhal, Rajiv Ratn Shah,
neural networks for rumor detection on microblogs. In Tanmoy Chakraborty, Ponnurangam Kumaraguru, and
Proceedings of the 25th ACM International Conference on Shin’ichi Satoh. Spotfake: A multi-modal framework for
Multimedia, pages 795–816, 2017. fake news detection. In 2019 IEEE Fifth International
[Khattar et al., 2019] Dhruv Khattar, Jaipal Singh Goud, Conference on Multimedia Big Data (BigMM), pages 39–
Manish Gupta, and Vasudeva Varma. Mvae: Multimodal 47, 2019.
variational autoencoder for fake news detection. In The [Singhal et al., 2022] Shivangi Singhal, Tanisha Pandey,
World Wide Web Conference, pages 2915–2921, 2019. Saksham Mrig, Rajiv Ratn Shah, and Ponnurangam Ku-

2549
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24)

maraguru. Leveraging intra and inter modality relation- [Zhou et al., 2020] Xinyi Zhou, Jindi Wu, and Reza Za-
ship for multimodal fake news detection. In Companion farani. : Similarity-aware multi-modal fake news detec-
Proceedings of the Web Conference 2022, pages 726–734, tion. In Advances in Knowledge Discovery and Data Min-
2022. ing: 24th Pacific-Asia Conference, PAKDD 2020, Singa-
[Sun et al., 2021] Mengzhu Sun, Xi Zhang, Jianqiang Ma, pore, May 11–14, 2020, Proceedings, Part II, pages 354–
367. Springer, 2020.
and Yazheng Liu. Inconsistency matters: A knowledge-
guided dual-inconsistency network for multi-modal rumor [Zhou et al., 2022] Yangming Zhou, Qichao Ying, Zhenxing
detection. In Findings of the Association for Computa- Qian, Sheng Li, and Xinpeng Zhang. Multimodal fake
tional Linguistics: EMNLP 2021, pages 1412–1423, 2021. news detection via clip-guided learning. arXiv preprint
arXiv:2205.14304, 2022.
[Tseng et al., 2022] Yu-Wun Tseng, Hui-Kuo Yang, Wei-
Yao Wang, and Wen-Chih Peng. Kahan: Knowledge- [Zubiaga et al., 2017] Arkaitz Zubiaga, Maria Liakata, and
aware hierarchical attention network for fake news detec- Rob Procter. Exploiting context for rumour detection in
tion on social media. In Companion Proceedings of the social media. In Social Informatics: 9th International
Web Conference 2022, pages 868–875, 2022. Conference, SocInfo 2017, Oxford, UK, September 13-
15, 2017, Proceedings, Part I 9, pages 109–123. Springer,
[Vrandečić and Krötzsch, 2014] Denny Vrandečić and 2017.
Markus Krötzsch. Wikidata: a free collaborative knowl-
edgebase. Communications of the ACM, 57(10):78–85,
2014.
[Wang et al., 2021] Yaqing Wang, Fenglong Ma, Haoyu
Wang, Kishlay Jha, and Jing Gao. Multimodal emergent
fake news detection via meta neural process networks. In
Proceedings of the 27th ACM SIGKDD International Con-
ference on Knowledge Discovery & Data Mining, pages
3708–3716, 2021.
[Wu et al., 2015] Ke Wu, Song Yang, and Kenny Q Zhu.
False rumors detection on sina weibo by propagation struc-
tures. In 2015 IEEE 31st International Conference on
Data Engineering, pages 651–662. IEEE, 2015.
[Wu et al., 2021] Yang Wu, Pengwei Zhan, Yunjian Zhang,
Liming Wang, and Zhen Xu. Multimodal fusion with co-
attention networks for fake news detection. In Findings
of the Association for Computational Linguistics: ACL-
IJCNLP 2021, pages 2560–2569, 2021.
[Ying et al., 2023] Qichao Ying, Xiaoxiao Hu, Yangming
Zhou, Zhenxing Qian, Dan Zeng, and Shiming Ge. Boot-
strapping multi-view representations for fake news detec-
tion, 2023.
[Zhang et al., 2021] Pengchuan Zhang, Xiujun Li, Xiaowei
Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi,
and Jianfeng Gao. Vinvl: Revisiting visual representa-
tions in vision-language models. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pages 5579–5588, 2021.
[Zhang et al., 2023a] Fanrui Zhang, Jiawei Liu, Qiang
Zhang, Esther Sun, Jingyi Xie, and Zheng-Jun Zha.
Ecenet: Explainable and context-enhanced network for
muti-modal fact verification. In Proceedings of the
31st ACM International Conference on Multimedia, pages
1231–1240, 2023.
[Zhang et al., 2023b] Qiang Zhang, Jiawei Liu, Fanrui
Zhang, Jingyi Xie, and Zheng-Jun Zha. Hierarchical se-
mantic enhancement network for multimodal fake news
detection. In Proceedings of the 31st ACM International
Conference on Multimedia, pages 3424–3433, 2023.

2550

You might also like