0% found this document useful (0 votes)
19 views

An inter-modal attention-based deep learning framework using unified modality for multimodal fake news, hate speech and offensive language detection

This paper presents a deep learning framework for detecting fake news, hate speech, and offensive language using a unified modality approach that integrates text, images, and image-texts. The proposed model employs an inter-modal attention mechanism to enhance multimodal content understanding and has shown superior performance on benchmark datasets compared to prior methods. The study addresses the challenges of multimodal content understanding and aims to improve detection effectiveness by leveraging recent advancements in computer vision and deep learning.

Uploaded by

ippotinane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

An inter-modal attention-based deep learning framework using unified modality for multimodal fake news, hate speech and offensive language detection

This paper presents a deep learning framework for detecting fake news, hate speech, and offensive language using a unified modality approach that integrates text, images, and image-texts. The proposed model employs an inter-modal attention mechanism to enhance multimodal content understanding and has shown superior performance on benchmark datasets compared to prior methods. The study addresses the challenges of multimodal content understanding and aims to improve detection effectiveness by leveraging recent advancements in computer vision and deep learning.

Uploaded by

ippotinane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Information Systems 123 (2024) 102378

Contents lists available at ScienceDirect

Information Systems
journal homepage: www.elsevier.com/locate/is

An inter-modal attention-based deep learning framework using unified


modality for multimodal fake news, hate speech and offensive language
detection
Eniafe Festus Ayetiran a,b ,∗, Özlem Özgöbek a
a
Department of Computer Science, Norwegian University of Science and Technology, Trondheim, Norway
b
Department of Computer Science, Achievers University, Owo

ARTICLE INFO ABSTRACT

Keywords: Fake news, hate speech and offensive language are related evil triplets currently affecting modern societies.
Inter-modal attention Text modality for the computational detection of these phenomena has been widely used. In recent times,
Unified modality multimodal studies in this direction are attracting a lot of interests because of the potentials offered by other
Multimodal fusion
modalities in contributing to the detection of these menaces. However, a major problem in multimodal content
BiLSTM-CNN
understanding is how to effectively model the complementarity of the different modalities due to their diverse
Multimodal content understanding
Fake news
characteristics and features. From a multimodal point of view, the three tasks have been studied mainly using
Hate speech image and text modalities. Improving the effectiveness of the diverse multimodal approaches is still an open
Offensive language research topic. In addition to the traditional text and image modalities, we consider image–texts which are
rarely used in previous studies but which contain useful information for enhancing the effectiveness of a
prediction model. In order to ease multimodal content understanding and enhance prediction, we leverage
recent advances in computer vision and deep learning for these tasks. First, we unify the modalities by creating
a text representation of the images and image–texts, in addition to the main text. Secondly, we propose a
multi-layer deep neural network with inter-modal attention mechanism to model the complementarity among
these modalities. We conduct extensive experiments involving three standard datasets covering the three tasks.
Experimental results show that detection of fake news, hate speech and offensive language can benefit from
this approach. Furthermore, we conduct robust ablation experiments to show the effectiveness of our approach.
Our model predominantly outperforms prior works across the datasets.

1. Introduction However, in order to provide a general perspective, the United Nations


Strategy and Plan of Action on Hate Speech1 defines hate speech as
The exponential growth of world wide web (WWW) and social ‘‘any kind of communication in speech, writing or behavior, that attacks or
media have fueled the menaces of fake news, hate speech and offensive uses pejorative or discriminatory language with reference to a person or a
language in recent times. Fake news are a form of disinformation, group on the basis of who they are, in other words, based on their religion,
fabricated to deceive readers to believe they are real by imitating ethnicity, nationality, race, color, descent, gender or other identity factor’’.
mainstream news. The main goal of any form of disinformation is to Hate speech and offensive language share a common characteristic as a
intentionally mislead people through the creation and spread of false form of abuse or attack but are different in a sense, though some works
information. In some cases, fake news take genuine part of mainstream in literature often use the two terms interchangeably to mean the same.
news and modify them by injecting some form of falsehood into them.
Offensive language is simply a statement which upsets another person.
Such modification and injection affect not only the text modality but
Hence, hate speech is considered more severe as it may lead to extreme
also images. The main difference between disinformation and misinfor-
action(s) and constitute severe harm to the target. Furthermore, the
mation is that in disinformation, the piece of information is deliberately
common effect of the three menaces of fake news, hate speech and
created to mislead people while misinformation is an unintentional
offensive language is emotional harm to their targets. Giachanou and
propagation of false information. Hate speech is more complex to define
Rosso [1] identifies this nexus among the phenomena and refer to
because what constitutes hate is relative and differs across jurisdictions.

∗ Corresponding author.
E-mail addresses: [email protected] (E.F. Ayetiran), [email protected] (Ö. Özgöbek).
1
https://www.un.org/en/hate-speech/understanding-hate-speech/what-is-hate-speech (accessed 14th March, 2023).

https://doi.org/10.1016/j.is.2024.102378
Received 9 June 2023; Received in revised form 29 November 2023; Accepted 14 March 2024
Available online 16 March 2024
0306-4379/© 2024 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
E.F. Ayetiran and Ö. Özgöbek Information Systems 123 (2024) 102378

them as harmful information. A greater percentage of fake news, hate 2. Related works
speech and offensive language happen online particularly on social
media. These three tasks have been studied individually. However, Unimodal content understanding has been widely studied for a
there are connections among them as one may lead to the other. For wide range of tasks. On the other hand, studies on multimodal con-
instance, fake news may lead to hate speech or offensive language. On tent understanding is currently limited [9,12], with some inherent
the other hand, hatred or animosity towards someone or a community challenges [13,14]. Some of the challenges as discussed in Section 1
of people may make them a target of fake news. In most cases, this include heterogeneity and semantic gaps. In the following subsections,
is especially applicable to public figures or celebrities. In fact, some we discuss prior multimodal works on fake news, hate speech and
fake news doubles as hate speech and/or offensive language. The Euro- offensive language detection.
pean Foundation for South Asian Studies identifies and discusses some
roles played by fake news in promoting hate speech and extremism 2.1. Multimodal fake news detection
online [2]. Both the actions and reactions pertaining to fake news,
hate speech and offensive language are strongly tied to intention and One of the earliest works on multimodal fake news detection is the
harm. Till now, the detection of the three phenomena in digital con- work of Wang et al. [15]. They proposed an end-to-end framework
tents have mostly been studied individually with different approaches. based on neural networks named EANN comprising three primary
Early studies focused mainly on text modality but as the digital media components; a multimodal feature extractor, a fake news detector and
and repositories continue to grow in multimedia contents (i.e. text, an event discriminator. The multimodal feature extractor comprises
videos, images and audio), the interest to research the prospects of two sub-components; a text features extractor and a visual features
these modalities has risen over the years. Besides text, image is the extractor. Each of the components cooperate for the task of multimodal
most commonly used because it is the next most readily available in fake news detection. Experiments on two benchmark datasets show
online digital contents. With the aim to explore the prospects of other performance improvements over baselines. Khattar et al. [3] proposed
modalities, studies based on multimodality are gaining ground [3–8] MVAE, a similar work to EANN and which as the name suggests used
but still under-explored [9]. Multimodal content understanding aims variational autoencoder for classifying multimodal news contents as
at recognizing and localizing objects, determining the attributes of real or fake. The components of MVAE are an encoder, a decoder and
entities, characterizing the relationships among entities and describing a fake news detector. Both the encoder and the decoder each com-
the common semantic features among different modalities [10]. Chen prises a text and visual extractor. While the encoder basically encodes
et al. [10] specifically identified two major gaps in deep multimodal the multimodal inputs and outputs a shared representation of learnt
content understanding. First is the ‘‘heterogeneity gap’’, which arises features as latent vectors, the decoder reconstructs the latent vectors.
as a result of differences and uniqueness of features of images and The encoded representations serve as inputs to the decoder and the
texts. This characteristics are directly related the second challenge of fake news detector component. The fake news detector classifies news
‘‘semantic gap’’, caused by the peculiarities of individual modalities content based on the encoded representations, sum of reconstructed
thereby leading to different abstract representations. Extracted features and Kullback–Leibler divergence losses. The evaluation of MVAE was
from individual modalities are not directly comparable and are in- carried out on the same datasets as EANN with significant perfor-
consistently distributed. Techniques to mitigate these problems rely mance improvements over EANN and other baselines. With the goal of
mainly on embeddings of individual unimodal features into a common achieving a pure classifier without any subtask, SpotFake [16] employs
latent space with the help of mapping functions in order to make them pretrained transformers to incorporate contextualized information and
comparable and consistent. At present, these techniques still do not image recognition into multimodal fake news classification. Precisely,
fully resolve these problems and advances are still being explored. In they employ Bidirectional Encoder Representations from Transformers
view of these current challenges and inspired by advances in computer (BERT) [17] and VGG-19 [18] to extract textual and visual features
vision precisely image captioning and Optical Character Recognition which were fused for the classification. The evaluation of SpotFake on
(OCR), we develop a unified modality-based deep learning framework the same datasets as EANN and MVAE shows that it outperforms both
which presents the advantage of direct comparison and consistency on only one of the datasets. In order to help identify fake news based on
across modalities. Image captioning and OCR enable the unification irrelevant images in news content, Zhou et al. [5] introduced SAFE. For
of the modalities for comparison and consistency of the modalities. textual and visual features extraction, SAFE extended a method based
Our deep learning framework comprises a Bidirectional Long Short- on convolutional neural network (CNN). Images are first processed as
Term Memory (BiLSTM) layer [11], a Convolutional Neural Network text using image captioning. The main crux of SAFE is the computa-
(CNN) layer with an inter-modal attention mechanism among other tion of similarity between text and image features which is used to
layer/modules. The model can be trained on datasets involving any optimize model learning parameters. Giachanou et al. [19] combines
language with appropriate preprocessing. The important contributions textual, visual and semantic information for fake news detection using
of this paper are as follow: neural network classifier. Textual features include embeddings of posts
• We develop a unified modality for multimedia contents to resolve and sentiments while visual features comprise image tags and local
the barriers in multimodal content understanding binary patterns (LBP). The model was evaluated on three datasets. In
a follow-up work, Giachanou et al. [20] extended the visual features
• We propose an inter-modal attention mechanism for complemen-
to include multi-image information and in contrast to their earlier
tarity among modalities in order to improve multimodal content
work uses BERT [17] and VGG-16 [18] for the extraction of text and
understanding
image features respectively. The main underlying idea in both works
• We develop a deep learning framework based on the inter-modal
is the computation of semantic similarity between textual and visual
attention mechanism for fake news, hate speech and offensive
features. Multimodal Consistency Neural Network (MCNN) [21] is a
language detection
network-based approach which consists of five subnetworks namely:
• Our deep learning framework achieves state-of-the-art perfor-
a text feature extraction module, a visual semantic feature extraction
mance on three benchmark datasets covering the three tasks.
module, a visual tampering feature extraction module, a similarity
The rest of the paper is structured as follows: Section 2 reviews the measurement module and a multimodal fusion module. MCNN experi-
relevant related works covering the three tasks. Section 3 describes the ments on four datasets show improvements over baselines. In another
deep learning framework while Section 4 discusses the experiments and work, Multimodal Fusion with Co-attention Networks (MCAN) [22] was
model implementation. In Section 5, we discuss the model evaluation proposed. MCAN includes a co-attention block, a co-attention layer and
and results. Section 6 concludes the paper. multiple co-attention stacking on spatial-domain, frequency-domain

2
E.F. Ayetiran and Ö. Özgöbek Information Systems 123 (2024) 102378

and textual features. Experimental evaluation of MCAN on two domain 2.3. Multimodal offensive language detection
datasets improves baselines. A cross-modal ambiguity learning model
(named CAFE) was proposed by Chen et al. [23]. CAFE comprises To the best of our knowledge, the work of [4] is the first work
three modules namely: a cross-modal alignment module, a cross-modal to experiment on a truly multimedia contents for offensive language
ambiguity learning module and a cross-modal fusion module. The main detection. They developed a dataset (MultiOFF) for this purpose using
goal of CAFE is adaptive aggregation of unimodal features and cross- existing meme data collection and experimented with different known
modal correlations. The evaluation of CAFE was carried out on two classifiers. In their study, multimodal experiments show very little
benchmark datasets with improvements over baselines. Zhang et al. improvements over unimodal experiments when the same algorithms
[24] proposed a model (named SceneFND) with a different approach are used. Curiously, some unimodal experiments with different algo-
from prior works by incorporating contextual scene information in rithms outperform multimodal experiments. Lee et al. [33] proposed a
addition to textual and visual information. The scene features were method called DisMultiHate to disentangle target entities in multimodal
memes for hate detection. Their proposed method consists of three
obtained from the images by calculating the probabilities of each scene
modules; data pre-processing, text representation learning and visual
category with different scene recognition methods. They presented
representation learning modules. DisMultiHate uses a regression layer
results for several variants of the model and SpotFake [16]. Similar to
to generate the probability of a multimedia content being hate or not
some of the previous works, TRIMOON [25] uses BERT and VGG-19
and experimental evaluation of the method on MultiOFF improved
to extract text and image features respectively followed by a fusion
performance over compared baselines. Pramanick et al. [34] developed
module. The fusion module consists of two co-attention blocks and
a framework (MOMENTA) for detection of harmful memes and the
gate-based fusion component. Experiments on two real-world datasets target entities. It uses Google’s Vision API to extract image–texts.
show improvements over baselines. The extracted text and images are then encoded with a pre-trained
visual-linguistic model and VGG-19 respectively. A key component of
2.2. Multimodal hate speech detection MOMENTA is intra-modal and cross-modal attention fusion. It outper-
forms majority of the baselines. MeBERT is another work [35] that uses
external knowledge-base to enhance semantic representation for meme
The work of Hosseinmardi et al. [26] is one of the earliest works classification. It fuses texts and images based on attention mechanism
on multimodal hate speech detection in which they focussed on cyber- for the classification task. Experiments on two public datasets show the
bullying using both textual and image features. Their model is based effectiveness of the method. A recent work on multimodal offensive
logistic regression classifier trained with a forward feature selection language is MemeFier [8], a deep learning framework for classifying
method. They experimented on a dataset collected from Instagram for offensive memes. It incorporates external knowledge into features en-
the purpose of validating the model. Automated hate speech detection coding. A key component of MemeFier is alignment-aware fusion of
was explored by Yang et al. [27] with multimodal modalities involving modalities. Experiments on three datasets reveal MemeFier outperforms
text and image. They experimented with quite a number of multimodal baselines on two of the three datasets.
fusion approaches including concatenation and addition with attention
mechanism. Evaluation reports on the experiments did not show any 3. Methodology
tangible gain in fusing the two modalities. As part of the hateful
memes challenge competition, Kiela et al. [28] developed a dataset We define the problem and describe the unified framework for
of multimodal memes for the task of identifying whether the memes multimodal content classification for fake news, hate speech and of-
are hateful or not. They presented a number of models evaluated fensive language. The general architecture of the unified deep learning
based on defined benchmarks. What can be referred to as a truly framework is presented in Fig. 1. It consists of a modality unification
standard benchmark dataset for multimodal hate speech classification module, an embedding layer, a BiLSTM layer, a CNN layer, an inter-
was developed by Gomez et al. [6] which they named MMHS150K. It modal attention module, a fusion module and a prediction module
was collected from Twitter and annotated on a large scale. In contrast (dense layer with sigmoid activation).
to other datasets, it also leverages image–texts in addition to the main
text and image modalities. They experimented widely with diverse 3.1. Problem formulation
models and similar to Kiela et al. [28], also reported that multimodality
did not result in tangible gain when compared with using a single Let 𝑀 denote a multimedia data comprising a text 𝑇 , an image
or two modalities. Maity et al. [29] introduces a model for detecting 𝑋 and an image–text 𝑌 , belonging to a binary class 𝐶. Given a set
of multimedia data 𝑀𝑖𝑘 , for each 𝑚𝑖 ∈ 𝑀𝑖𝑘 comprising text 𝑡𝑖 , an
cyberbullying in multimodal memes taking into account sentiment,
image 𝑥𝑖 and an image–text 𝑦𝑖 , the problem is to determine the class
emotion and sarcasm. This led to the development of a dataset on
𝑐𝑖 ∈ 𝐶𝑖𝑛=2 to which 𝑚𝑖 belongs, where 𝐶𝑖𝑛=2 is a set of predefined
which the model was evaluated. A recent work Yang et al. [30] explores
binary classes. We adapt this formulation to fake news, hate speech and
transfer learning for hate speech detection. The authors opine that there
offensive language detection where 𝑚𝑖 is either a news article, hateful
is a high correlation between hate speech and sarcasm and therefore
or non-hateful and offensive or non-offensive contents respectively.
designate them as primary and auxiliary tasks for the purpose of cross-
Furthermore, 𝑐𝑖 represents fake or real, hateful or non-hateful and
task transfer learning. The model consists mainly of adaptation modules offensive or non-offensive classes for the three tasks respectively.
namely: semantic, definition and domain adaptation modules. A joint
objective for the modules is optimized for learning the parameters. 3.2. Modality unification module
Experiments show efficacy of the approach across benchmark datasets.
Besides the traditional hate speech, research on misogyny detection For each sample 𝑚𝑖 in a multimedia content, we obtain the cap-
using multimodal contents is now generating interests [31,32]. Misog- tion 𝑥, of the image and the image–text 𝑦. We use LAVIS [36], a
yny is a type of hate targeted at women. Fersini et al. [31] specifically deep learning library for LAnguage-and-VISion intelligence research
organized a task on this problem using multimedia contents while Rizzi and applications to retrieve image captions. LAVIS consists over thirty
et al. [32] proposed to answer some open questions on the topic which state-of-the-art language vision models including but not limited to
include but not limited to determining which modality contributes most Contrastive Language-Image Pre-training (CLIP) [37] and Bootstrap-
to misogyny detection. ping Language-Image Pre-training (BLIP) for Unified Vision-Language

3
E.F. Ayetiran and Ö. Özgöbek Information Systems 123 (2024) 102378

Fig. 1. Architecture of the unified inter-modal attention framework.

Understanding and Generation [38]. Sample images with captions on The same situation applies to the generated image captions. When the
top are presented in Fig. 2. captions do not fully describe the scene, they still capture them to
We use EasyOCR2 to retrieve texts inserted within the images. Sam- reasonable extents useful for understanding the contents.
ple images with inserted texts are shown in Fig. 3. We therefore have
the text representations for the original texts, images and image–texts 3.3. Embedding module
denoted 𝑡, 𝑥 and 𝑦 respectively. EasyOCR and LAVIS have been chosen
for OCR and image captioning respectively because of their state-of-the- For each word in the unified modality, that is 𝑤 ∈ {𝑡, 𝑥, 𝑦}, we obtain
art efficacy and easy-to-use Application Programming Interface (API). their embeddings 𝑒𝑤 from an embedding matrix 𝐸 ∈ R𝑉 ×𝑑 , where V
In few cases where the outputs of OCR are not 100% perfect, the is the vocabulary size of the embedding matrix and 𝑑, the dimension.
results are still useful as some of the recognized texts are still accurate. Specifically, Eqs. (1) to (3) present the word embeddings for a text in
each modality as follow:
2
https://github.com/JaidedAI/EasyOCR 𝑒𝑡 = {𝑒𝑤1 , 𝑒𝑤2 , … ., 𝑒𝑤𝑛 } (1)

4
E.F. Ayetiran and Ö. Özgöbek Information Systems 123 (2024) 102378

where the input sequences 𝑒𝑤 are the embeddings from the embedding
layer. 𝜃𝑏𝑖𝑙𝑠𝑡𝑚 is a trainable parameter. Therefore, the final hidden state
is obtained using Eq. (10):
( )
ℎ𝑡 = 𝜇 ℎ⃖⃖⃗𝑡 , ⃖⃖⃖
ℎ𝑡 (10)

where 𝜇 is the average of the hidden states of the forward and the
backward LSTMs.
Fig. 2. Sample images with captions on top.
3.5. Convolutional Neural Network (CNN) layer

Convolutional Neural Network (CNN) [40] has become one of the


most popular choice in the field of deep learning for image classi-
fication and feature extraction. Convolutional Neural Network is a
kind of feedforward neural network that is able to extract features
from data with convolution structures. In contrast to the traditional
feature extraction methods, CNN does not need to extract features
manually. Computer vision based on Convolutional Neural Networks
Fig. 3. Sample images with inserted texts.
has enabled accomplishments that had been considered impossible in
the past decades. These areas include face recognition, autonomous
vehicles and intelligent medical treatment. They have also recently
found applications in sequence modeling problems such as text classifi-
cation, sentiment analysis, prediction tasks among others. Therefore,
𝑒𝑥 = {𝑒𝑤1 , 𝑒𝑤2 , … ., 𝑒𝑤𝑛 } (2) this layer is meant to extract important features from the BiLSTM
hidden vectors. The CNN component of our architecture uses a one-
dimensional convolutional layer with 𝑓 filters and 𝑘 kernels. The output
𝑒𝑦 = {𝑒𝑤1 , 𝑒𝑤2 , … ., 𝑒𝑤𝑛 } (3)
hidden state representations from the BiLSTM layer is fed into this
where 𝑛 is the number the words in each of 𝑡, 𝑥 and 𝑦. The embedded layer. Similar to TextCNN [41], for each BiLSTM hidden vector ℎ𝑖 ∈
texts are fed to the Bidirectional Long Short-Term Memory (BiLSTM) R𝑑 (d is the dimension of the vector), a convolution operation which
layer. involves a filter 𝑓 ∈ R𝑗𝑑 is applied to a window of 𝑗 hidden vectors of
words to produce a new feature map. A feature map 𝑔 is extracted from
a window of 𝑗 word hidden vectors as given by Eq. (11):
3.4. Bidirectional Long Short-Term Memory (BiLSTM) layer ( )
𝑔 = 𝑡𝑎𝑛ℎ 𝑓 ⋅ ℎ𝑖∶𝑖+𝑗−1 + 𝑏 (11)
The original Long Short-Term Memory (LSTM) [39] was developed where tanh() is a hyperbolic tangent and 𝑏 ∈ R is a bias term. Therefore,
to address the exploding and vanishing gradient problems in feed- for each possible window of words 𝑗 in a text (𝑡𝑖 , 𝑥𝑖 and 𝑦𝑖 for the three
forward neural networks. The LSTM architecture comprises three gates; modalities), the filter is applied to produce a set of feature vector for a
an input gate 𝑖𝑡 , a forget gate 𝑓𝑡 and an output gate 𝑜𝑡 . It also has text in each modality as presented in Eqs. (12) to (14):
a memory cell 𝑐𝑡 with capability to learn long-term dependencies in
sequences and a hidden state ℎ𝑡 . The transition equations of the LSTM 𝑔𝑡 = {𝑔1 , 𝑔2 , … ., 𝑔𝑛−𝑗+1 } (12)
are presented in Eqs. (4) to (8):
( [ ] ) 𝑔𝑥 = {𝑔1 , 𝑔2 , … ., 𝑔𝑛−𝑗+1 } (13)
𝑖𝑡 = 𝜎 𝑤𝑖 ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑖 (4)

( [ ] ) 𝑔𝑦 = {𝑔1 , 𝑔2 , … ., 𝑔𝑛−𝑗+1 } (14)


𝑓𝑡 = 𝜎 𝑤𝑓 ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑓 (5)

( [ ) 3.6. Inter-modal attention module


]
𝑜𝑡 = 𝜎 𝑤𝑜 ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑜 (6)
Given the set of feature maps of words for each modal texts 𝑡𝑖 , 𝑥𝑖
( [ ] ) and 𝑦𝑖 , we propose an inter-modal attention layer to assign weights to
𝑐𝑡 = 𝑓𝑡 ⊙ 𝑐𝑡−1 + 𝑖𝑡 ⊙ tanh 𝑤𝑐 ℎ𝑡−1 , 𝑥𝑡 + 𝑏𝑐 (7) each word in the text. This attention mechanism is based on the variant
proposed by Luong et al. [42] and applied in [43]. At a time, we take
( )
ℎ𝑡 = 𝑜𝑡 ⊙ tanh 𝑐𝑡 (8) either feature map 𝑔𝑡 , 𝑔𝑥 or 𝑔𝑦 as the source and another text of different
modality as the target to produce an alignment vector 𝑎. For instance,
where 𝑤𝑖 , 𝑤𝑓 and 𝑤𝑜 are the weights of the neurons. 𝑏𝑖 , 𝑏𝑓 and taking 𝑔𝑥 as the source and 𝑔𝑡 as the target, the alignment vector 𝑎𝑡 (𝑥)
𝑏𝑜 are the biases to be learned during training. 𝜎 denotes a logistic is given by Eq. (15):
sigmoid function and ⊙ denotes element-wise multiplication. tanh() is a ( ( ))
hyperbolic tangent function. Bidirectional LSTM [11] is a variant of the 𝑒𝑥𝑝 𝑠𝑐𝑜𝑟𝑒 𝑔𝑡𝐓 , 𝑔𝑥
conventional LSTM which consists of two LSTMs that are run forward 𝑎𝑡 (𝑥) = ∑ ( ( 𝐓 )) (15)
and backward simultaneously on the input sequence. The backward 𝑥′ 𝑒𝑥𝑝 𝑠𝑐𝑜𝑟𝑒 𝑔𝑡 , 𝑔𝑥′

LSTM is used to capture the past contextual information while the where 𝑠𝑐𝑜𝑟𝑒 is a function which computes the semantic relationship
forward LSTM is used to capture future contextual information. The among words in the source and target; dot product in this case. The re-
BiLSTM is used to capture the sequential and contextual information sulting alignment vector is passed through a softmax activation to pre-
in the input sequences. The outputs are hidden representations of the dict the probabilities (attention weights) 𝛼 of each word. The weights
inputs as presented in Eq. (9): are given by Eq. (16):
( )
[ℎ1 , … ., ℎ𝑛 ] = 𝐵𝑖𝐿𝑆𝑇 𝑀([𝑒𝑤1 , … ., 𝑒𝑤𝑛 ], 𝜃𝑏𝑖𝑙𝑠𝑡𝑚 ) (9) 𝛼𝑡 (𝑥) = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥 𝑎𝑡 (𝑥) (16)

5
E.F. Ayetiran and Ö. Özgöbek Information Systems 123 (2024) 102378

When the three modalities are considered, the final attention wei- Table 1
Statistics of PolitiFact dataset.
ghts for a modality is obtained by computing the average [44] of
Real Fake Total
the resulting attention weights obtained from using the other two
modalities as targets. For instance, the weight 𝛼𝑡 (𝑥, 𝑦) is the average News contents 624 432 1056
- with text 528 420 948
of 𝛼𝑡 (𝑥) and 𝛼𝑡 (𝑦) given by Eq. (17): - with image 447 336 783
1∑
𝛼𝑡 (𝑥, 𝑦) = 𝛼𝑡 (𝑥), 𝛼𝑡 (𝑦) (17)
2 Table 2
If only two modalities are involved or considered, the weighted rep- Statistics of MHS150K dataset.
resentation of a modality is computed by simply using one of the two Train Validation Test
modalities as source and the other as target and vice versa. Hate 29,447 2,500 5,001
NotHate 105,346 2,500 4,999
Total 134,823 5,000 10,000
3.7. Fusion module

The final attention-weighted vector representation of a text 𝑡 is 4.1.1. Fake news dataset
presented in Eq. (18): • PolitiFact: PolitiFact dataset is part of FakeNewsNet [45], a
𝑣𝑡 = 𝛼𝑡 𝑔𝑡 (18) repository of news contents which fact-checks political reports
and issues. It has been collected from the website3 of the or-
The same weighted representation is applicable to image and image– ganization. It consists of three contexts; news content, social
text, denoted 𝑣𝑥 and 𝑣𝑦 respectively. To obtain a multimodal repre- context and spatio-temporal information. Like most prior works,
sentation of a multimedia content, the weighted multimodal repre- we used the news content. Annotations were done by human
sentation 𝑣𝑚 is derived through concatenation of individual weighted annotators as part of the dataset development. The news content
representations 𝑣𝑡 , 𝑣𝑥 and 𝑣𝑦 as shown in Eq. (19): comprises mainly the news headline and body. PolitiFact consists
of news articles that were published from May, 2002 to July,
𝑣𝑚 = 𝑣𝑡 ⊕ 𝑣 𝑥 ⊕ 𝑣 𝑦 (19) 2018. It comprises 1,056 news articles with 624 real news and
432 fake news. Other statistics on textual and visual contents
where 𝑣𝑡 , 𝑣𝑥 and 𝑣𝑦 are the weighted vectors for original text, caption are shown in Table 1. In our experiment, we have treated the
and image–text respectively. headline and body as original text modality but applied separately
when dealing with other modalities in the model. For instance, in
3.8. Prediction module computing attention weights, both are weighted separately using
captions and image–texts.

The resulting multimodal fused representation from the fusion mod-


ule is fed into an output layer to predict the probability 𝑃 of a data 4.1.2. Hate speech dataset
sample of a multimedia content belonging to a particular class as shown • MMHS150K: MMHS150K [6] is a large-scale collection of tweets
in Eq. (20): for hate speech classification task. The raw MMHS150K consists
of 150,000 samples, on which annotations were done by hu-
𝑝 = 𝜎(𝑣𝑚 ) (20)
man annotators based on majority voting. The annotated data
where 𝜎 is a sigmoid activation function. The objective loss function consists of 112,845 NotHate samples and 36,978 Hate samples.
which the model seek to minimize is a cross-entropy function given by The annotations fall into six classes including ‘‘No attacks to
Eq. (21), with specific application to our binary classification problem any community’’, ‘‘racist’’, ‘‘sexist’’, ‘‘homophobic’’, ‘‘religion based
as given in Eq. (22): attacks’’ and ‘‘attacks to other communities’’. The other five labels
apart from ‘‘No attacks to any community’’ are hate categories.

𝑐
The dataset was further split into a test set consisting of 10,000
=− 𝑦𝑖 log𝑝𝑖 (21)
𝑖=1
samples, a validation set consisting of 5,000 samples while the
remaining were set aside as training set. Each data sample has a
𝑏 = −(𝑦𝑖 log(𝑝𝑖 ) + (1 − 𝑦𝑖 ) log(1 − 𝑝𝑖 )) (22) text and associated image and majority of the images have texts
inserted within them. Following the authors of MHS150K, in our
where 𝑦 is the true class label and 𝑝 is the predicted probability of experiment, the dataset was treated as binary by taking all hateful
a data sample belonging to a particular category between the binary categories as ‘‘Hate’’ label and ‘‘No attacks to any community’’
categories. as ‘‘NotHate’’ label. Full statistical details about the dataset are
presented in Table 2.
4. Experiments and model implementation
4.1.3. Offensive language dataset
In this section, we discuss the experiments and model implementa- • MultiOFF: MultiOFF [4] was developed from a collection of
tion details. memes from social media such as Facebook, Twitter etc. and
annotated for offensiveness or otherwise. It is an extension of
an existing dataset about 2016 U.S. Presidential Election. In all,
4.1. Datasets description
MultiOFF consists of 743 samples split into training, validation
and test sets. It consists only of the memes (images) and texts
We describe the datasets used for the experiments as per the tasks.
We however present a summary of the statistics of the datasets for fake
3
news, hate speech and offensive language in Tables 1 to 3 respectively. https://www.politifact.com/

6
E.F. Ayetiran and Ö. Özgöbek Information Systems 123 (2024) 102378

Table 3 Table 4
Statistics of MultiOFF dataset. Hyperparameters for the task-specific models.
Train Validation Test Task Fake news Hate speech Offens lang
Offensive 187 59 59 LSTM neurons 300 100 300
Non-offensive 258 90 90 CNN filters 300 100 300
Batch size 32 128 32
Total 445 149 149
Embedding size 300 300 300
Optimizer ADAM NADAM ADAM
Learning rate 1𝑒−3 1𝑒−3 1𝑒−3
Dropout 0.2 0.2 0.2
(full details in Table 3). The text modality constitutes the image– Batch norm NA Yes NA
texts already extracted by the authors. Therefore, the dataset does
Keys: NA — Not applied, Offens lang — Offensive language.
not have separate text and image–text modalities. Experiments on
MultiOFF are carried out using the two modalities.
Table 5
Performance of inter-modal attention models with different combination of modalities
on PolitiFact dataset. Best results in bold.
4.2. Data pre-processing
Modality Accuracy Precision Recall F1

Traditional pre-processing on text data such data cleaning, remov- TT & IM 0.893 0.898 0.894 0.896
TT & IT 0.893 0.893 0.903 0.898
ing noise, lower casing among others were performed on the data
IM & IT 0.570 0.568 0.546 0.557
according to the peculiarity of the data. For instance, the hate speech TT & IM & IT 0.940 0.939 0.940 0.939
dataset was collected from Twitter which contains slangs and some
Keys: TT — text, IM — Images and IT — Image–text.
informal writing styles. All numeric tokens are represented as ‘‘<num-
ber>’’, all uniform resource locators (URLs) as ‘‘<url>’’, emojis’ inter-
pretation among other pre-processing activities. Table 6
Performance of BiLSTM-CNN models with different combination of modalities on
PolitiFact dataset. Best results in bold.
4.3. Model implementation
Modality Accuracy Precision Recall F1
TT 0.893 0.898 0.895 0.897
In the following subsections, we present a brief description of
IM 0.631 0.744 0.658 0.698
task-specific details of the experiments and present a summary of IT 0.557 0.776 0.515 0.619
the hyperparameters for all models in Table 4. In all experiments, a TT & IM 0.893 0.892 0.895 0.894
300-dimension GloVe embeddings [46] (trained on 840 billions word TT & IT 0.866 0.867 0.876 0.871
tokens) was used to initialize the Embedding layer. Furthermore, for IM & IT 0.651 0.694 0.626 0.658
TT & IM & IT 0.879 0.882 0.884 0.883
datasets which suffer from class imbalance, we adopt sample weights
from each class in computing the loss. Keys: TT — Text, IM — Images and IT — Image–text.

4.3.1. Fake news detection experiments


We split the fake news datasets in the ratio 7:1:2 for training, 5.1. Evaluation results on fake news detection task
validation and test respectively. In training the model, we use a batch
size of 32 using ADAM [47] as the optimization algorithm with a
Table 5 shows the results of the inter-modal attention model on
learning rate of 1𝑒−3. The model was regularized with a Dropout [48]
the PolitiFact dataset with different modalities in terms of Accuracy,
probability of 0.2, applied after the concatenation layer. The number
Precision, Recall and F1. Table 6 shows the results of the ablation
of LSTM neurons is 300 while CNN filters and kernel windows are 300
experiments which use BiLSTM-CNN directly without the inter-modal
and 3 respectively.
attention module for all combination of modalities.
4.3.2. Hate speech detection experiments The result in Table 5 shows that the inter-modal attention model
The MMHS150K dataset used for experiment is already split into produces the best result when the three modalities are used. Text used
training, validation and test sets as shown in Table 2. Batch normal- with image and text used with image–text have comparable perfor-
ization [49] was applied to standardize inputs to the BiLSTM layer. In mance. Image used with image–text does not seem to be helpful in
training the model, we use a batch size of 128 using NADAM [50] as detecting fake news as the combination performs way below other
the optimization algorithm with a learning rate of 1𝑒−3. A Dropout [48] combinations. The result of the ablation studies which uses direct con-
probability of 0.2 was applied after the concatenation layer to regular- catenation of BiLSTM-CNN features without attention layer is presented
ize the model. The number of LSTM neurons and CNN filters are 100 in Table 6. Analysis of the ablation results also reveals that combination
while the kernel window size is 4. of the three modalities best detects fake news. The performances of
the different combination of modalities follows the same pattern as the
4.3.3. Offensive language detection experiments main inter-modal attention model. However, the impact of the inter-
Like MMHS150K, MultiOFF dataset is also already split into train- modal attention mechanism is noteworthy. It enhances performance
ing, validation and test sets as shown in Table 3. A batch size of by 6.1%, 5.7%, 5.6% and 5.6% for Accuracy, Precision, Recall and F1
32 is adopted in training the model while Adam [47] is used as the respectively.
optimization algorithm with a learning rate of 1𝑒−3. A Dropout [48] Fig. 4 shows the Receiver Operating Characteristic (ROC) curve
probability of 0.2, applied after the concatenation layer is used to of our inter-attention model. The ROC curve tilts towards the True
regularize the model. The number of LSTM neurons is 300 while CNN Positive Rate and farther away from the random curve which confirms
filters and kernel windows are 300 and 3 respectively.
the quality of the model.
5. Model evaluation and results discussion
5.1.1. Performance comparison with baselines on fake news detection
In the following subsections, we evaluate each of the models, report Performance comparison of the inter-modal attention model with
their performances and that of ablation studies and then compare with state-of-the-art models is presented in Table 7. Brief descriptions of the
state-of-the-art models. baselines are as follow:

7
E.F. Ayetiran and Ö. Özgöbek Information Systems 123 (2024) 102378

Table 9
Performance of BiLSTM-CNN model with different combination of modalities on
MMHS150K dataset. Best results in bold.
Modality Acc Prec Rec F1 AUC
TT 0.678 0.653 0.762 0.703 0.678
IM 0.524 0.516 0.892 0.654 0.524
IT 0.506 0.508 0.965 0.666 0.506
TT & IM 0.677 0.653 0.762 0.703 0.677
TT & IT 0.677 0.652 0.763 0.703 0.677
IM & IT 0.524 0.520 0.883 0.654 0.524
TT & IM & IT 0.675 0.652. 0.761 0.702 0.675

Keys: Acc — Accuracy, Prec — Precision, Rec — Recall, TT — Text, IM — Images and
IT — Image–text.

Fig. 4. The Receiver Operating Characteristic (ROC) curve for fake news detection
task.

Table 7
Performance comparison of the inter-modal attention model with state-of-the-art models
on PolitiFact dataset. Best results in bold.
Model Accuracy Precision Recall F1
LBP-sim [19] – – – 0.925
SAFE [5] 0.874 0.889 0.903 0.896
MCNN [21] 0.884 0.973 0.867 0.917
Our model 0.940 0.939 0.940 0.939

Table 8
Performance of the inter-modal attention models with different combination of
modalities on MMHS150K dataset. Best results in bold. Fig. 5. The Receiver Operating Characteristic (ROC) curve for hate speech detection
task.
Modality Acc Prec Rec F1 AUC
TT & IM 0.684 0.656 0.763 0.705 0.684
TT & IT 0.680 0.653 0.764 0.704 0.680
IM & IT 0.514 0.516 0.934 0.665 0.514 Combination of the three modalities results in better performance
TT & IM & IT 0.687 0.659 0.761 0.706 0.687 across all the metrics except Recall, where utilization of image and
Keys: Acc — Accuracy, Prec — Precision, Rec — Recall, TT — text, IM — Images and image–text results in better performance with a score of 93.4%. Combi-
IT — Image–text. nation of text and image compared with text and image–text produces
comparable performance. Performance of image and image–text results
in considerably general lower performance when compared to other
• Text-tags-LBP-similarity: Text-tags-LBP-similarity [19] is a mod- combinations except in Recall where it produces a significantly higher
el based on neural network classifier which computes the similar- performance.
ity of text tags and LBP features. In contrast to the results produced by the inter-modal attention
• SAFE: SAFE [5] uses an extension of a method based on Con- model, combination of text with image or image–text performs best
volutional Neural Network (CNN) for textual and visual features using the BiLSTM-CNN model. Still on the BiLSTM-CNN models, the
extraction. The main component of SAFE is the computation of combination of text and either of image or image–text produce better
similarity between text and image features which was used to performance than using the three modalities. More surprising is the fact
optimize model learning parameters. that text only outperforms any other combination of modalities except
• MCNN: MCNN [21] is a network-based approach which consists in Recall where image–text only has the best performance.
of five sub-networks namely: a text feature extraction module, We also qualitatively evaluate the inter-attention model on hate
a visual semantic feature extraction module, a visual tampering speech detection using ROC curve as presented in Fig. 5. The position
feature extraction module, a similarity measurement module and of the ROC curve as shown in the figure confirms the quantitative
a multimodal fusion module performance.

Comparison of the performance of our model with baselines shows that


our inter-modal attention model outperforms the other baselines across 5.2.1. Performance comparison with baselines on hate speech detection
all the metrics except precision where MCNN has a better performance. Table 10 depicts the performance of the inter-modal attention model
with state-of-the-art models. The descriptions of the baseline models are
as follow:
5.2. Evaluation results on hate speech detection task
• FCM: Features Concatenation Model (FCM) [6] is a concatenation
Table 8 shows the results of the inter-modal attention model on of features from each of the modalities in which a CNN-based
the MMHS150K dataset with different modalities in terms of Accuracy, pretrained model (Inception v3) was used for image features
Precision, Recall, F1 and Area Under Curve (AUC). Table 9 shows representation with Average Pooling while LSTM was used for the
the results of ablation experiments which use the BiLSTM-CNN di- representation of text and image–text features.
rectly without the inter-modal attention module for all combination of • SCM: Spatial Concatenation Model (SCM) [6] is the same as FCM
modalities. but with a change in the feature vectors of Inception v3.

8
E.F. Ayetiran and Ö. Özgöbek Information Systems 123 (2024) 102378

Table 10
Performance comparison of the inter-modal attention model with state-of-the-art models
on MMHS150K dataset. Best results in bold.
Model Acc Precision Recall F1 AUC
FCM [6] 0.684 – – 0.704 0.734
SCM [6] 0.685 – – 0.702 0.732
TKM [6] 0.682 – – 0.701 0.731
Our model 0.687 0.659 0.761 0.706 0.687

Keys: Acc — Accuracy.

Table 11
Performance of inter-modal attention and BiLSTM-CNN models with individual and
combination of modalities on MultiOFF dataset. Best results in bold.
Modality Acc Prec Rec F1
BiLSTM-CNN(TT) 0.658 0.641 0.598 0.619
BiLSTM-CNN(IM) 0.537 0.614 0.635 0.624
BiLSTM-CNN(TT & IM) 0.664 0.655 0.643 0.649
Inter-att(TT & IM) 0.718 0.703 0.700 0.702
Fig. 6. The Receiver Operating Characteristic (ROC) curve for offensive language
Keys: Acc — Accuracy, Prec — Precision, Rec — Recall, TT — text, IM — Images and detection task.
IT — Image–text. Key: Inter-att — Inter-modal attention.

Table 12
Performance comparison of the inter-modal attention model with state-of-the-art models
on MultiOFF dataset. Best results in bold.
• TKM: Textual Kernels Model (TKM) [6] aims to boost interactions
Model Acc Prec Rec F1
among modalities by learning dependent text kernels for texts and
image–texts. Stacked LSTM+VGG16 [4] – 0.400 0.660 0.500
BiLSTM+VGG16 [4] – 0.400 0.440 0.410
Comparison of the performance of our inter-modal attention model CNNText+VGG16 [4] – 0.380 0.670 0.480
DisMultiHate [33] – 0.645 0.651 0.646
with baseline models as presented in Table 10 shows that it outperforms
MeBERT [35] – 0.670 0.671 0.671
the baselines on Accuracy and F1 with a score of 68.7% and 70.6% MemeFier [8] 0.685 – – 0.625
respectively. FCM is the best on AUC metric. Our model achieves 65.9% Our model 0.718 0.703 0.700 0.702
and 76.1% in Precision and Recall respectively. The baselines do not Key: Acc — Accuracy, Prec — Precision, Rec — Recall.
present evaluation results for Precision and Recall.

5.3. Evaluation results on offensive language detection task


• MeBERT: MeBERT [35] fuses texts and images enhanced with
external knowledge for semantic representation
Since the MultiOFF dataset consists of two modalities, we present • MemeFier: MemeFier [8] is a deep learning framework for clas-
the results of the inter-modal attention model with those of BiLSTM- sifying memes. It also incorporates external knowledge into fea-
CNN models in Table 11. tures encoding. The fusion of modalities is based on alignment
The results in Table 11 confirms that combined usage of text and among the multimodal features.
image is beneficial for offensive language detection. On all the eval-
uation metrics, multimodality improves performance. For unimodal Our inter-modal attention model with 70.3%, 70.0% and 70.2%
approaches, usage of text only is better for Accuracy and Precision on Precision, Recall and F1 respectively is the best performing model
while image modality is better in Recall and F1. The inter-modal among the compared baselines. It also achieves better Accuracy when
attention model outperforms the BiLSTM-CNN model with a significant compared with MemeFier; the only baseline which evaluated on Accu-
margin which therefore confirms its effectiveness. racy.
The ROC curve for our inter-attention model on offensive language
is presented in Fig. 6. The positioning of the curve in relation to True 5.4. Further note
Positive Rate and random curve confirm the quality of the quantitative
performance of the model. Result analysis across the tasks reveals that combination of the
three modalities mostly lead to the best performance on both inter-
5.3.1. Performance comparison with baselines on offensive language detec- modal attention and BiLSTM-CNN models with the exception of the
tion hate speech dataset (MMHS150K) where only text modality leads to the
Performance comparison of our inter-modal attention model with best performance on the BiLSTM-CNN model. In general, the efficacy
state-of-the-art models is presented in Table 12. Brief descriptions of of the inter-modal attention model is evident across all the tasks.
the baselines are as follow:
6. Conclusion
• Stacked LSTM + VGG16: Stacked LSTM + VGG16 [4] uses
stacked LSTM and VGG16 to for text and image representation Multimodal content understanding is a challenging and still an
respectively in a neural classifier open research area due to the heterogeneity and semantic gaps in the
• BiLSTM + VGG16: BiLSTM + VGG16 [4] uses BiLSTM to rep- modalities involved. Majority of the prior works in multimodal content
resent the texts combined with image features extracted with understanding for fake news, hate speech and offensive language de-
VGG16 tection do not take into account how modalities involved complements
• CNNText + VGG16: CNNText + VGG16 [4] employs VGG16 for one another due to issues caused by the aforementioned gaps. In this
image features extraction while traditional CNN was used for work, we introduce an additional modality and filled the gaps by
textual features representation leveraging on advances in computer vision to unify the diverse modal-
• DisMultiHate: DisMultiHate [33] extracts target entities for hate ities. We further develop a unified deep learning framework based
detection in multimodal memes on inter-modal attention mechanism on the unified modalities. Our

9
E.F. Ayetiran and Ö. Özgöbek Information Systems 123 (2024) 102378

framework consists of several modules/layers based mainly on neural [8] C. Koutlis, M. Schinas, S. Papadopoulos, MemeFier: Dual-stage modality fusion
networks. We conduct extensive experiments on three public bench- for image meme classification, in: I. Kompatsiaris, J. Luo, N. Sebe, A. Yao,
V. Mazaris, S. Papadopoulos, A. Popescu, Z.H. Huang (Eds.), Proceedings of
mark datasets covering fake news, hate speech and offensive language.
the 2023 ACM International Conference on Multimedia Retrieval, ICMR 2023,
Our model significantly enhance prediction and achieves state-of-the- Thessaloniki, Greece, June 12-15, 2023, ACM, 2023, pp. 586–591, http://dx.doi.
art performance on most of the datasets. We further conduct ablation org/10.1145/3591106.3592254.
experiments covering the three tasks to show the effectiveness of our [9] L. Bozarth, C. Budak, Toward a better performance evaluation framework for
fake news classification, in: M.D. Choudhury, R. Chunara, A. Culotta, B.F. Welles
unified inter-modal attention approach.
(Eds.), Proceedings of the Fourteenth International AAAI Conference on Web and
Social Media, ICWSM 2020, Held Virtually, Original Venue: Atlanta, Georgia,
CRediT authorship contribution statement USA, June 8-11, 2020, AAAI Press, 2020, pp. 60–71, URL https://ojs.aaai.org/
index.php/ICWSM/article/view/7279.
[10] W. Chen, W. Wang, L. Liu, M.S. Lew, New ideas and trends in deep multimodal
Eniafe Festus Ayetiran: Conceptualization, Data curation, Formal content understanding: A review, Neurocomputing 426 (2021) 195–215, http:
analysis, Investigation, Methodology, Project administration, Software, //dx.doi.org/10.1016/j.neucom.2020.10.042.
Validation, Visualization, Writing – original draft, Writing – review [11] A. Graves, N. Jaitly, A. Mohamed, Hybrid speech recognition with deep bidi-
rectional LSTM, in: 2013 IEEE Workshop on Automatic Speech Recognition and
& editing. Özlem Özgöbek: Data curation, Formal analysis, Funding
Understanding, Olomouc, Czech Republic, December 8-12, 2013, IEEE, 2013, pp.
acquisition, Investigation, Methodology, Project administration, Re- 273–278, http://dx.doi.org/10.1109/ASRU.2013.6707742.
sources, Supervision, Validation, Visualization, Writing – review & [12] I. Segura-Bedmar, S. Alonso-Bartolome, Multimodal fake news detection, Inf. 13
editing. (6) (2022) 284, http://dx.doi.org/10.3390/info13060284.
[13] D. Lahat, T. Adali, C. Jutten, Challenges in multimodal data fusion, in:
22nd European Signal Processing Conference, EUSIPCO 2014, Lisbon, Portugal,
Declaration of competing interest September 1-5, 2014, IEEE, 2014, pp. 101–105, URL https://ieeexplore.ieee.org/
document/6951999/.
The authors declare that they have no known competing finan- [14] F. Alam, S. Cresci, T. Chakraborty, F. Silvestri, D. Dimitrov, G.D.S. Martino, S.
Shaar, H. Firooz, P. Nakov, A survey on multimodal disinformation detection, in:
cial interests or personal relationships that could have appeared to Proceedings of the 29th International Conference on Computational Linguistics,
influence the work reported in this paper. International Committee on Computational Linguistics, Gyeongju, Republic of
Korea, 2022, pp. 6625–6643, URL https://aclanthology.org/2022.coling-1.576.
[15] Y. Wang, F. Ma, Z. Jin, Y. Yuan, G. Xun, K. Jha, L. Su, J. Gao, EANN: event
Data availability adversarial neural networks for multi-modal fake news detection, in: Y. Guo, F.
Farooq (Eds.), Proceedings of the 24th ACM SIGKDD International Conference
The data are publicly available online. on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23,
2018, ACM, 2018, pp. 849–857, http://dx.doi.org/10.1145/3219819.3219903.
[16] S. Singhal, R.R. Shah, T. Chakraborty, P. Kumaraguru, S. Satoh, SpotFake: A
Acknowledgment multi-modal framework for fake news detection, in: Fifth IEEE International
Conference on Multimedia Big Data, BigMM 2019, Singapore, September 11-13,
2019, IEEE, 2019, pp. 39–47, http://dx.doi.org/10.1109/BigMM.2019.00-44.
This work was carried out during the tenure of the first author as
[17] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidi-
an ERCIM ‘‘Alain Bensoussan’’ fellow. rectional transformers for language understanding, in: J. Burstein, C. Doran,
T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language
References
Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume
1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp.
[1] A. Giachanou, P. Rosso, The battle against online harmful information: The cases 4171–4186, http://dx.doi.org/10.18653/v1/n19-1423.
of fake news and hate speech, in: M. d’Aquin, S. Dietze, C. Hauff, E. Curry, P. [18] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale
Cudré-Mauroux (Eds.), CIKM ’20: The 29th ACM International Conference on image recognition, in: Y. Bengio, Y. LeCun (Eds.), 3rd International Conference
Information and Knowledge Management, Virtual Event, Ireland, October 19-23, on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
2020, ACM, 2020, pp. 3503–3504, http://dx.doi.org/10.1145/3340531.3412169. Conference Track Proceedings, 2015.
[2] European Foundation for South Asian Studies, The role of fake news in fueling [19] A. Giachanou, G. Zhang, P. Rosso, Multimodal fake news detection with textual,
hate speech and extremism online; promoting adequate measures for tackling the visual and semantic information, in: P. Sojka, I. Kopecek, K. Pala, A. Horák
phenomenon, 2021, https://www.efsas.org/publications/study-papers/the-role- (Eds.), Text, Speech, and Dialogue - 23rd International Conference, TSD 2020,
of-fake-news-in-fueling-hate-speech-and-extremism-online/. Brno, Czech Republic, September 8-11, 2020, Proceedings, in: Lecture Notes in
[3] D. Khattar, J.S. Goud, M. Gupta, V. Varma, MVAE: multimodal variational Computer Science, vol. 12284, Springer, 2020, pp. 30–38, http://dx.doi.org/10.
autoencoder for fake news detection, in: L. Liu, R.W. White, A. Mantrach, F. 1007/978-3-030-58323-1_3.
Silvestri, J.J. McAuley, R. Baeza-Yates, L. Zia (Eds.), The World Wide Web [20] A. Giachanou, G. Zhang, P. Rosso, Multimodal multi-image fake news detection,
Conference, WWW 2019, San Francisco, CA, USA, May 13-17, 2019, ACM, 2019, in: G.I. Webb, Z. Zhang, V.S. Tseng, G. Williams, M. Vlachos, L. Cao (Eds.),
pp. 2915–2921, http://dx.doi.org/10.1145/3308558.3313552. 7th IEEE International Conference on Data Science and Advanced Analytics,
[4] S. Suryawanshi, B.R. Chakravarthi, M. Arcan, P. Buitelaar, Multimodal meme DSAA 2020, Sydney, Australia, October 6-9, 2020, IEEE, 2020, pp. 647–654,
dataset (multiOFF) for identifying offensive content in image and text, in: http://dx.doi.org/10.1109/DSAA49011.2020.00091.
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying, [21] J. Xue, Y. Wang, Y. Tian, Y. Li, L. Shi, L. Wei, Detecting fake news by exploring
European Language Resources Association (ELRA), Marseille, France, 2020, pp. the consistency of multimodal data, Inf. Process. Manag. 58 (5) (2021) 102610,
32–41, URL https://aclanthology.org/2020.trac-1.6. http://dx.doi.org/10.1016/j.ipm.2021.102610.
[5] X. Zhou, J. Wu, R. Zafarani, SAFE: similarity-aware multi-modal fake news [22] Y. Wu, P. Zhan, Y. Zhang, L. Wang, Z. Xu, Multimodal fusion with co-attention
detection, in: H.W. Lauw, R.C. Wong, A. Ntoulas, E. Lim, S. Ng, S.J. Pan networks for fake news detection, in: C. Zong, F. Xia, W. Li, R. Navigli
(Eds.), Advances in Knowledge Discovery and Data Mining - 24th Pacific-Asia (Eds.), Findings of the Association for Computational Linguistics: ACL/IJCNLP
Conference, PAKDD 2020, Singapore, May 11-14, 2020, Proceedings, Part II, 2021, Online Event, August 1-6, 2021, in: Findings of ACL, ACL/IJCNLP 2021,
in: Lecture Notes in Computer Science, 12085, Springer, 2020, pp. 354–367, Association for Computational Linguistics, 2021, pp. 2560–2569, http://dx.doi.
http://dx.doi.org/10.1007/978-3-030-47436-2_27. org/10.18653/v1/2021.findings-acl.226.
[6] R. Gomez, J. Gibert, L. Gómez, D. Karatzas, Exploring hate speech detection in [23] Y. Chen, D. Li, P. Zhang, J. Sui, Q. Lv, T. Lu, L. Shang, Cross-modal ambiguity
multimodal publications, in: IEEE Winter Conference on Applications of Com- learning for multimodal fake news detection, in: F. Laforest, R. Troncy, E.
puter Vision, WACV 2020, Snowmass Village, CO, USA, March 1-5, 2020, IEEE, Simperl, D. Agarwal, A. Gionis, I. Herman, L. Médini (Eds.), WWW ’22: The
2020, pp. 1459–1467, http://dx.doi.org/10.1109/WACV45572.2020.9093414. ACM Web Conference 2022, Virtual Event, Lyon, France, April 25 - 29, 2022,
[7] C. Yang, F. Zhu, G. Liu, J. Han, S. Hu, Multimodal hate speech detection via ACM, 2022, pp. 2897–2905, http://dx.doi.org/10.1145/3485447.3511968.
cross-domain knowledge transfer, in: J. ao Magalhães, A.D. Bimbo, S. Satoh, N. [24] G. Zhang, A. Giachanou, P. Rosso, Scenefnd: Multimodal fake news
Sebe, X. Alameda-Pineda, Q. Jin, V. Oria, L. Toni (Eds.), MM ’22: The 30th ACM detection by modelling scene context information, J. Inf. Sci. (2024)
International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, 01655515221087683, http://dx.doi.org/10.1177/01655515221087683, arXiv:
ACM, 2022, pp. 4505–4514, http://dx.doi.org/10.1145/3503161.3548255. 10.1177/01655515221087683.

10
E.F. Ayetiran and Ö. Özgöbek Information Systems 123 (2024) 102378

[25] S. Xiong, G. Zhang, V. Batra, L. Xi, L. Shi, L. Liu, TRIMOON: two-round [37] A. Radford, J.W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry,
inconsistency-based multi-modal fusion network for fake news detection, Inf. A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning transferable
Fusion 93 (2023) 150–158, http://dx.doi.org/10.1016/j.inffus.2022.12.016. visual models from natural language supervision, in: M. Meila, T. Zhang (Eds.),
[26] H. Hosseinmardi, R.I. Rafiq, R. Han, Q. Lv, S. Mishra, Prediction of cyberbullying Proceedings of the 38th International Conference on Machine Learning, ICML
incidents in a media-based social network, in: R. Kumar, J. Caverlee, H. Tong 2021, 18-24 July 2021, Virtual Event, in: Proceedings of Machine Learning
(Eds.), 2016 IEEE/ACM International Conference on Advances in Social Networks Research, vol. 139, PMLR, 2021, pp. 8748–8763, URL http://proceedings.mlr.
Analysis and Mining, ASONAM 2016, San Francisco, CA, USA, August 18-21, press/v139/radford21a.html.
2016, IEEE Computer Society, 2016, pp. 186–192, http://dx.doi.org/10.1109/ [38] J. Li, D. Li, C. Xiong, S.C.H. Hoi, BLIP: bootstrapping language-image pre-training
ASONAM.2016.7752233. for unified vision-language understanding and generation, in: K. Chaudhuri, S.
[27] F. Yang, X. Peng, G. Ghosh, R. Shilon, H. Ma, E. Moore, G. Predovic, Exploring Jegelka, L. Song, C. Szepesvári, G. Niu, S. Sabato (Eds.), International Conference
deep multimodal fusion of text and photo for hate speech classification, in: on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA,
Proceedings of the Third Workshop on Abusive Language Online, Association in: Proceedings of Machine Learning Research, vol. 162, PMLR, 2022, pp.
for Computational Linguistics, Florence, Italy, 2019, pp. 11–18, http://dx.doi. 12888–12900, URL https://proceedings.mlr.press/v162/li22n.html.
org/10.18653/v1/W19-3502, URL https://aclanthology.org/W19-3502. [39] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8)
[28] D. Kiela, H. Firooz, A. Mohan, V. Goswami, A. Singh, P. Ringshia, D. Testuggine, (1997) 1735–1780, http://dx.doi.org/10.1162/neco.1997.9.8.1735.
The hateful memes challenge: Detecting hate speech in multimodal memes, in: H. [40] K. Fukushima, S. Miyake, Neocognitron: A new algorithm for pattern recognition
Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural tolerant of deformations and shifts in position, Pattern Recognit. 15 (6) (1982)
Information Processing Systems 33: Annual Conference on Neural Information 455–469, http://dx.doi.org/10.1016/0031-3203(82)90024-3.
Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, Virtual, 2020. [41] Y. Kim, Convolutional neural networks for sentence classification, in: A. Mos-
[29] K. Maity, P. Jha, S. Saha, P. Bhattacharyya, A multitask framework for sentiment, chitti, B. Pang, W. Daelemans (Eds.), Proceedings of the 2014 Conference on
emotion and sarcasm aware cyberbullying detection from multi-modal code- Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-
mixed memes, in: E. Amigó, P. Castells, J. Gonzalo, B. Carterette, J.S. Culpepper, 29, 2014, Doha, Qatar, a Meeting of SIGDAT, a Special Interest Group of the
G. Kazai (Eds.), SIGIR ’22: The 45th International ACM SIGIR Conference on ACL, ACL, 2014, pp. 1746–1751, http://dx.doi.org/10.3115/v1/d14-1181.
Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, [42] T. Luong, H. Pham, C.D. Manning, Effective approaches to attention-based neural
2022, ACM, 2022, pp. 1739–1749, http://dx.doi.org/10.1145/3477495.3531925. machine translation, in: L. Màrquez, C. Callison-Burch, J. Su, D. Pighin, Y.
[30] C. Yang, F. Zhu, G. Liu, J. Han, S. Hu, Multimodal hate speech detection via Marton (Eds.), Proceedings of the 2015 Conference on Empirical Methods in
cross-domain knowledge transfer, in: J. ao Magalhães, A.D. Bimbo, S. Satoh, N. Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-
Sebe, X. Alameda-Pineda, Q. Jin, V. Oria, L. Toni (Eds.), MM ’22: The 30th ACM 21, 2015, The Association for Computational Linguistics, 2015, pp. 1412–1421,
International Conference on Multimedia, Lisboa, Portugal, October 10 - 14, 2022, http://dx.doi.org/10.18653/v1/d15-1166.
ACM, 2022, pp. 4505–4514, http://dx.doi.org/10.1145/3503161.3548255. [43] E.F. Ayetiran, Attention-based aspect sentiment classification using enhanced
[31] E. Fersini, F. Gasparini, G. Rizzi, A. Saibene, B. Chulvi, P. Rosso, A. Lees, J. learning through CNN-BiLSTM networks, Knowl.-Based Syst. 252 (2022) 109409,
Sorensen, SemEval-2022 task 5: Multimedia automatic misogyny identification, http://dx.doi.org/10.1016/j.knosys.2022.109409.
in: G. Emerson, N. Schluter, G. Stanovsky, R. Kumar, A. Palmer, N. Schneider, [44] E.F. Ayetiran, P. Sojka, V. Novotný, EDS-MEMBED: multi-sense embeddings
S. Singh, S. Ratan (Eds.), Proceedings of the 16th International Workshop on based on enhanced distributional semantic structures via a graph walk over
Semantic Evaluation, SemEval@NAACL 2022, Seattle, Washington, United States, word senses, Knowl.-Based Syst. 219 (2021) 106902, http://dx.doi.org/10.1016/
July 14-15, 2022, Association for Computational Linguistics, 2022, pp. 533–549, j.knosys.2021.106902.
http://dx.doi.org/10.18653/V1/2022.SEMEVAL-1.74. [45] K. Shu, D. Mahudeswaran, S. Wang, D. Lee, H. Liu, FakeNewsNet: A data
[32] G. Rizzi, F. Gasparini, A. Saibene, P. Rosso, E. Fersini, Recognizing misogynous repository with news content, social context, and spatiotemporal information
memes: Biased models and tricky archetypes, Inf. Process. Manag. 60 (5) (2023) for studying fake news on social media, Big Data 8 (3) (2020) 171–188, http:
103474, http://dx.doi.org/10.1016/J.IPM.2023.103474. //dx.doi.org/10.1089/big.2020.0062.
[33] R.K. Lee, R. Cao, Z. Fan, J. Jiang, W. Chong, Disentangling hate in online memes, [46] J. Pennington, R. Socher, C.D. Manning, Glove: Global vectors for word represen-
in: MM ’21: ACM Multimedia Conference, Virtual Event, China, October 20 - 24, tation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural
2021, ACM, 2021, pp. 5138–5147, http://dx.doi.org/10.1145/3474085.3475625. Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, a Meeting
[34] S. Pramanick, S. Sharma, D. Dimitrov, M.S. Akhtar, P. Nakov, T. Chakraborty, of SIGDAT, a Special Interest Group of the ACL, ACL, 2014, pp. 1532–1543,
MOMENTA: A multimodal framework for detecting harmful memes and their http://dx.doi.org/10.3115/v1/d14-1162.
targets, in: M. Moens, X. Huang, L. Specia, S.W. Yih (Eds.), Findings of the Asso- [47] D.P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Y. Bengio,
ciation for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Y. LeCun (Eds.), 3rd International Conference on Learning Representations, ICLR
Dominican Republic, 16-20 November, 2021, Association for Computational 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
Linguistics, 2021, pp. 4439–4455, http://dx.doi.org/10.18653/v1/2021.findings- [48] N. Srivastava, G.E. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov,
emnlp.379. Dropout: A simple way to prevent neural networks from overfitting, J. Mach.
[35] Q. Zhong, Q. Wang, J. Liu, Combining knowledge and multi-modal fusion for Learn. Res. 15 (1) (2014) 1929–1958, URL http://dl.acm.org/citation.cfm?id=
meme classification, in: MultiMedia Modeling - 28th International Conference, 2670313.
MMM 2022, Phu Quoc, Vietnam, June 6-10, 2022, Proceedings, Part I, in: [49] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training
Lecture Notes in Computer Science, vol. 13141, Springer, 2022, pp. 599–611, by reducing internal covariate shift, in: F.R. Bach, D.M. Blei (Eds.), Proceedings
http://dx.doi.org/10.1007/978-3-030-98358-1_47. of the 32nd International Conference on Machine Learning, ICML 2015, Lille,
[36] D. Li, J. Li, H. Le, G. Wang, S. Savarese, S.C.H. Hoi, LAVIS: A library for France, 6-11 July 2015, in: JMLR Workshop and Conference Proceedings, vol. 37,
language-vision intelligence, 2022, arXiv:2209.09019. JMLR.org, 2015, pp. 448–456, URL http://proceedings.mlr.press/v37/ioffe15.
html.
[50] T. Dozat, Incorporating nesterov momentum into adam, in: 4th International
Conference on Learning Representations, ICLR 2016 Workshop Track, San Juan,
Puerto Rico, USA, May 2-4, 2016, Conference Track Proceedings, 2016.

11

You might also like