Multimodal_Hate_Speech_Detection_in_Memes_Using_Contrastive_Language-Image_Pre-Training
Multimodal_Hate_Speech_Detection_in_Memes_Using_Contrastive_Language-Image_Pre-Training
Corresponding authors: Mohammad Kamrul Hasan ([email protected]), Nurhizam Safie ([email protected]), Ashish Bagwari
([email protected]), and Shayla Islam ([email protected])
This work was supported in part by the Universiti Kebangsaan Malaysia under Grant GUP-2023-010.
ABSTRACT In contemporary society, the proliferation of online hateful messages has emerged as a
pressing concern, inflicting deleterious consequences on both societal fabric and individual well-being.
The automatic detection of such malevolent content online using models designed to recognize it, holds
promise in mitigating its harmful impact. However, the advent of ‘‘Hateful Memes’’ poses fresh challenges
to the detection paradigm, particularly within the realm of deep learning models. These memes, constituting
of a textual element associated with an image are individually innocuous but their combination causes
a detrimental effect. Consequently, entities responsible for disseminating information via web browsers
are compelled to institute mechanisms that regulate and automatically filter out such injurious content.
Effectively identifying hateful memes demands algorithms and models endowed with robust vision and
language fusion capabilities, capable of reasoning across diverse modalities. This research introduces a novel
approach by leveraging the multimodal Contrastive Language-Image Pre-Training (CLIP) model, fine-tuned
through the incorporation of prompt engineering. This innovative methodology achieves a commendable
accuracy of 87.42%. Comprehensive metrics such as loss, AUROC, and f1 score are also meticulously
computed, corroborating the efficacy of the proposed strategy. Our findings suggest that this approach
presents an efficient means to regulate the dissemination of hate speech in the form of viral meme content
across social networking platforms, thereby contributing to a safer online environment.
INDEX TERMS CLIP, facebook hateful meme dataset, multimodal, contrastive learning, zero-shot
prediction, InfoNCE contrastive loss, prompt engineering, cosine similarity matrix.
I. INTRODUCTION The term ‘‘hate speech’’ has solidified its presence as a
The pervasive use of hateful memes as a vehicle for spreading ubiquitous phenomenon in the realm of the internet. Memes,
animosity on online platforms has become an alarming trend. which can be any shareable content encompassing pho-
tographs or videos that are spread, altered, and repeated over
The associate editor coordinating the review of this manuscript and time [1], serve as conduits for the transmission of such hate
approving it for publication was Tai Fei . among individuals. A speech exhibiting hostility or violence
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 22359
G. Arya et al.: Multimodal Hate Speech Detection in Memes Using CLIP
empirical results demonstrate a noteworthy improvement in A triplet is created by stacking the visual features,
classification accuracy as compared to other state-of-the- object tags, and text features of memes produced by the
art models, which is directly attributed to the integration of object detection model known as Visual features in Vision-
accurate prompts into the CLIP model. Furthermore, this Language (VinVl) and the optical character recognition
research extends its contribution to the broader context of (OCR) technology in the study by Yuyang Chen1, Feng
online content regulation. The insights gained from this work PanID2, ‘‘Multimodal detection of hateful memes by apply-
can inform the development of content filtering systems ing a vision-language pre-training model’’ to perform cross-
by social networking companies and platforms, ultimately modal meme learning. After being tweaked and coupled with
creating safer online environments. By offering a robust a random forest (RF) classifier, our model (OSCAR+RF)
solution that effectively addresses the fusion of text and outperformed the other eleven (11) published baselines on the
images in hateful content, this research takes a meaningful task of identifying nasty memes, reaching average accuracy
step towards mitigating the societal impacts of online hate and AUROC of 0.684 and 0.768, respectively, in a public
speech and promoting a healthier digital ecosystem for all test set. In conclusion, our study has demonstrated that VL-
users. PTMs with anchor point additions can improve the efficiency
The remaining sections are organized as follows: Section II of deep learning-based hate meme identification by including
discusses a few recent works related to hate speech detection a more robust deep learning model [7].
in multimodal memes using various techniques and their The study by Yi Zhou1, Zhenhao Chen2, and Huiyuan
learning. Then the problem definition, proposed optimal Yang on MULTIMODAL LEARNING FOR HATEFUL
solution and motivation behind the selection are discussed MEMES identification focuses on the identification of
in section III. It also contains the detailed description of hate speech using a model based on Visual Question
the proposed framework including its working, architecture, Answering [8].
and technical procedure. Then, the results and predicted The research on ‘‘The Hateful Memes Challenge:
outcomes obtained using the proposed model is deliberated Detecting Hate Speech in Multimodal Memes’’ by Douwe
in section IV. Finally, section V contains the future scope of Kiela, HamedFirooz, Aravind Mohan, VedanujGoswami,
the method and section VI concludes the project. Amanpreet Singh, Pratik Ringshia, and DavideTestuggine
suggests a new challenge set for multimodal classification
that focuses on spotting offensive language in multimodal
II. LITERATURE SURVEY memes [9]. Problematic instances are included in the dataset
The field of hate speech identification has seen a lot of to make it harder to rely on unimodal models and to
effort, but relatively little of it has focused on multimodal demonstrate the superiority of multimodal models signals.
hate speech detection. Both NLP and network science have Despite requiring complex reasoning, the task can be quickly
studied hate speech in depth. Hate speech has long been reduced to a binary classification problem.
detected using language analysis. One of the key places where
hate speech is targeted with a range of targets is social media.
Obtaining a sentence embedding and putting it into a binary III. METHODOLOGY
classifier for hate speech prediction are the typical processes The detection of Hateful Memes represents a binary classi-
for hate speech detection. A variety of language-based hate fication challenge, seeking to ascertain whether a meme is
speech identification datasets have been made available for offensive or hateful through the analysis of multimodal data
research on hate speech detection. According to Yang et constituting both text and image signals. Given the inherent
al. adding image embedding information to text improves complexity of memes, characterized by the coexistence of
hate speech recognition ability right away. With the aid of two modalities, and recognizing the formidable obstacle
Crowdflower employees, Hosseinmardi et al. categorize a of detection accuracy in this task, a multi-task learning
dataset of Instagram photographs and the comments that technique is deemed essential for drawing meaningful
go with them. Two inquiries were made to the staff: First, statistical conclusions [10], [11].
does the case represent cyberaggression, and second, does it In this study, the categorization of multimodal hostile
represent cyberbullying. They demonstrate that incorporating memes is characterized as a classification model that predicts
picture characteristics enhances classification efficiency. the label of a multimodal meme (hateful or non-hateful)
The dataset included 998 cases, 90% of which had high based on the associated image and text. To achieve this,
confidence ratings, and 52% of which were labeled as models must predict a probability vector y∈R over the
bullying [5]. Zhong et al. compiled a dataset of 3000 samples two classes. In greater detail, y0 denotes the projected
of Instagram posts and comments in a manner like this. Two probability that the meme is non-hateful, while y1 represents
employees of Mechanical Turk were questioned: Is there the probability that the meme is hateful. If y1 > y0, the
any bullying in the comments? If so, can the bullying be meme is classified as hateful; otherwise, it is categorized
linked to the image’s subject matter? 560 incidents of bullying as non-hateful. Leveraging the Contrastive Language-Image
were discovered. They evaluate several features and simple Pre-training (CLIP) model, we successfully classify memes
classifiers for automatically identifying bullying [6]. within this framework, assigning them to the hateful or
non-hateful classes based on the highest cosine similarity corpus of natural language data, encompassing a distinctive
score in contrastive learning. dataset composed of 400 million training images paired with
The technique employs a pre-trained model for feature their text descriptions, sourced abundantly from the internet.
extraction from a meme in a transfer learning context, In essence, CLIP is an enhanced image classification
combined with a downstream classification model that model characterized by heightened accuracy and efficiency,
utilizes these features [12]. For operating without the addition heralding a transformative era in the field of multimodal
of new data or manual labeling, a model that can balance the learning.
unambiguous information from the multimodal and the fuzzy
information from the individual modality while minimizing
1) ZERO-SHOT LEARNING USING CLIP
generalization blunders is required. Consequently, a strategy
that integrates statistical theory with state-of-the-art neural In the proposed solution for meme detection and classifica-
networks and optimization methods was developed to discern tion as hateful or non-hateful, the concept of ‘‘zero-shot’’
the offensive memes. image classification has been leveraged. This approach is
CLIP is one of the most significant advancements in particularly advantageous as it allows us to generalize and
computer vision, as it bridges the domains of computer make predictions on unseen labels without the necessity of
vision and natural language processing. The challenges of specific training for each class. Traditional machine learning
huge datasets and subpar real-world results in conventional models are typically confined to learning and excelling at
vision models are also addressed by the CLIP network. a single pre-defined task. For example, an image classifier
It distinguishes itself by allowing a singular method, CLIP, trained exclusively on categorizing dogs and cats may
to handle a diverse range of applications without necessitating perform well within that specific scope. However, models
the construction of extensive custom datasets [13]. This like CLIP distinguish themselves by possessing the capability
departure from the conventional approaches, such as training to excel at tasks for which they haven’t undergone explicit
models like ResNet requiring vast labeled image datasets, training. This phenomenon is encapsulated by the term ‘‘zero-
positions CLIP as a widely preferred model in the field of shot learning.’’ Here, the model employs generalization to
computer vision. predict a class that has not been encountered in the training
data.
This makes it an ideal candidate for the proposed solution,
A. CONTRASTIVE LANGUAGE–IMAGE PRE-TRAINING since the specific nature of hateful and non-hateful memes
CLIP is a robust and scalable state of the art multi modal may vary widely and evolve over time. This adaptability con-
vision and language model introduced by OpenAI. This tributes to the robustness of our meme detection framework.
neural network boasts versatility, as it can be applied to
various visual classification benchmarks and adeptly learns 2) APPROACH
visual concepts through natural language supervision. Unlike Scaling an elementary pre-training task is necessary to attain
traditional models, CLIP exhibits remarkable ‘‘zero-shot’’ competitive zero-shot performance on a wide range of image
capabilities akin to GPT-2 and GPT-3, enabling it to perform classification datasets. To achieve this, the CLIP model must
tasks such as predicting the most pertinent text snippet given be trained to recognize a wide range of visual concepts in
the names of the visual categories to be recognized (image), images and connect them to their names. This is done by
without direct optimization. applying contrastive learning to a large dataset of image-
Central to CLIP’s prowess is its training methodology text pairs. This information is utilized to determine which of
employing contrastive learning, which aims to map images 32,768 randomly chosen text descriptions a given image was
and text descriptions into a shared latent space. This unique accurately paired with in our dataset.
approach enables CLIP to discern whether an image and
textual description match, therefore, facilitating tasks like
image classification through text-image similarity [14], [15]. a: CONTRASTIVE REPRESENTATION LEARNING
This means that CLIP can successfully predict which cap- The novelty of the CLIP model lies in its utilization of
tions correspond to which images without domain-specific a contrastive training strategy, a paradigm that leverages
training, making it particularly potent for out-of-the-box text positive (image-text pairs) and negative (other images and
and image search applications [16]. Beyond its fundamental text) samples to train a scoring function, thereby generating
capabilities, CLIP finds application in a myriad of domains, meaningful representations of the data. Through this inno-
including image generation, image similarity search, image vative technique, CLIP is trained to understand that similar
ranking, object tracking, robotics control, image captioning, representations should converge in the latent space, while
geo-localization, and more. dissimilar ones should exhibit considerable separation [17].
Its versatility arises from its comprehensive understanding Within the model’s architecture, encompassing both image
of the intricate relationships between visual data and and text encoders, the process of contrastive training involves
the corresponding linguistic representations. This profound the labeling of image-text pairs, followed by their embedding
understanding is cultivated through training on an extensive with various ‘‘objects’’ to learn abstracts in the data. This
facilitates the training of a zero-shot classifier on the resulting between given input vectors a and b shown in Eqn. (1).
image and text embeddings. Pn
(a.b) ai bi
Deep learning has extensively used contrastive pre- cos (θ) = = qP 1 qP (1)
||a|| ||b|| n 2 n 2
training. One explanation for this is that contrastive pre- 1 ai 1 bi
training followed by supervised fine-tuning is a paradigm
that is more label-efficient and enhances the effectiveness c: SOFTMAX FUNCTION
of labeled data. During pre-training, unlabeled images are When performing supervised categorization in the contrastive
effectively clustered together in the latent space, resulting training of the CLIP model utilized, the InfoNCE Loss
in precise decision boundaries between distinct classes. optimization function is often applied after a softmax
Subsequent supervised fine-tuning, based on this clustering, function has been applied to the network outputs. Using a
consistently outperforms random initialization. It is also vector of real values, the softmax function restricts their range
a better approach because it not only captures shared to lie within 0 and 1, with the total of all the numbers equaling
information from multiple sources, such as images and text, 1. Another characteristic of softmax is that it ensures that
but also maximizes mutual information [18]. Therefore, the any one of the values is often much larger than the others,
adoption of a model rooted in the contrastive approach aligns consequently we get a positive example (closest vector) that
seamlessly with our research objectives. is significantly larger than the random ones and can be easily
identified [20]. Therefore, we first take the softmax of the
b: COSINE SIMILARITY values and then the negative log of the labeled category to
In an ideal world, the vector representations of text and calculate the loss for categorical cross-entropy. Refer to ‘‘(2)’’
its corresponding image should be equal. The similarity for the Softmax Function equation.
between the embedded representations for each text and each ezi
image represents the ‘‘goodness’’ of our model. Similarly, the σ (z)i = PK fori = 1, . . . , K , and
zj
‘‘Badness’’ is measured by the dissimilarity between them. j=1 e
An optimal model has maximized goodness and minimized z = (z1 , . . . , zk ) eRk (2)
badness.
where,
To assess this ‘‘goodness’’ we require a method of
computing the distance between the image and text vectors. σi = Softmax Function
The Euclidean distance, often known as the straight-line z = Input Vector
distance, is a wonderful option for determining the separation zi
e = Standard exponential function for input vector
between two points in 2 or 3 dimensions. All points, however,
tend to be far apart by the Euclidean measure in a large K = Number of classes in the multi − class classifier
dimensional space. Hence, the angle between vectors is a ezj = Standard exponential function for output vector
more useful metric in higher dimensions. Thus we use the
cosine similarity which calculates the cosine of the angle
d: InfoNCE CONTRASTIVE LOSS FUNCTION
between two vectors [19], to determine the similarity between
them in our model as can be seen in Figure 1. The contrastive loss function compares the distance between
a sample and the network’s output for a positive example of
the same class to its distance from a negative example. If pos-
itive samples are encoded to similar (closer) representations
and negative samples to dissimilar (farther) representations,
the loss is low. This is achieved by taking the cosine distances
between the vectors and treating the resulting distances
as the prediction probabilities of a standard categorization
network. The popular loss function we use for contrastive
learning in this paper is InfoNCE (NCE is an acronym for
Noise-Contrastive Estimation) which is an altered variant
of the cross-entropy loss function [21]. The similarity of
positive pairs is maximized while that of negative pairs
is minimized using this function. In ‘‘(3)’’, as shown at
FIGURE 1. Cosine similarity. the bottom of the next page, za , zp , and zn represent the
anchor, positive, and negative embeddings. According to
A greater value of the cosine distance is produced by more self-supervised learning, we have one positive sample and
comparable vectors. The dot product of the vectors is used many negatives (N). Through the cos_sim function, the
for computation. When not using unit vectors, we must either vector cosine similarity is assessed. We aim to maximize the
normalize the vectors or divide the product to the normed cosine similarity between za and zp , bringing them closer
vectors. Equation (1) gives the cosine similarity value (cosθ) together by using this function. The reverse is true for za
a: CONTRASTIVE PRE-TRAINING
During this phase, a batch of N (32,768) images paired
with their respective descriptions e.g., <image1, text1>,
<image2, text2>, <imageN, textN> are processed through FIGURE 5. Contrastive pre-training phase.
the Image and text Encoders simultaneously to obtain their
vector representations (embeddings). • The light blue squares in Figure 5 stand in for
A series of purple text cards are being delivered into the text these pairs where the text and image coincide. As an
encoder in the example image. Each card’s output would be a illustration, T1 and I1 are the embedded forms of the
list of numbers. ‘‘Pepper the Aussie dog’’, for instance, would first text and first picture, respectively. The highest
enter the text encoder and emerge as a series of digits like (0, cosine similarity between I1 and T1 is what we’re
0.2, 0.8). The similar thing takes place with the images: each aiming for. The same thing is desired for I2, T2,
image will enter the image encoder and come out as a series of and all other light blue squares. The greater these
integers. The image of Pepper the Australian dog will appear cosine similarities, the more ‘‘goodness’’ our model
as (0.05, 0.25, 0.7). possesses.
The CLIP model is then trained to predict which • The cosine similarities of off-diagonal (where i̸ =j)
image embedding belongs to which text embedding in a elements that are dissimilar pairs <I1, T2>, <I1, T3>
batch. To achieve this contrastive pre-training method seeks . . . <Ii, Tj> are minimized in a contrastive manner,
to compute the cosine similarity between every pair of separating the actual image from all the other incorrect
image embeddings (I1, I2. . . IN) and text embeddings (T1, text descriptions (for e.g I1 image is described by T1
T2. . . TN). Over the calculated similarity scores, an optimiza- and not by T2, T2,T3 etc).
tion is executed by applying a symmetric cross-entropy loss • The grey squares in Figure 5 show where the text and
to maximize the cosine similarity between the embeddings of image are out of alignment. For instance, T1 might be
real pairs in the batch while minimizing the cosine similarity the text ‘‘pepper the aussie pup’’ while I2 might be a
between the embeddings of incorrect pairings. picture of a raccoon. Since ‘‘Pepper the Aussie pup’’
The step-by-step procedure is - measures ‘‘badness,’’ the cosine similarity between this
• N pairs of ‘‘image-text’’ in batch are sent into the image (I2) and the words ‘‘Pepper the Aussie pup’’
model. should be quite low.
• The Image Encoder computes an image vector for each • The model then uses the symmetric cross-entropy loss
image in the batch. The I1 vector is represented by the as its optimization objective which corresponds to the
first image, I2 by the second, and so on. The size of InfoNCE loss. This type of loss minimizes both the
each vector is N and N is the latent dimension’s size. image-to-text direction as well as the text-to-image
As a result, N∗ N matrix is the outcome of this stage. direction as the contrastive loss matrix keeps both the
• Similarly, the text descriptions are transformed into text <I1,T2> and <I2,T1> cosine similarities.
embeddings (T1, T2 . . . TN), producing a N∗ N matrix.
• Finally, we multiply those matrices and calculate the b: CREATE DATASET CLASSIFIER FROM LABEL TEXT
cosine similarities for every single pair of image and This step encodes all the labels/objects in the following con-
text description. This produces an N∗ N matrix as text format: ‘‘a photo of a {object}. The vector representation
shown. of each context is generated from the text encoder.
B. TECHNICAL PROCEDURE
The proposed solution utilizes the CLIP multimodal model in
Python language using PyTorch machine learning framework
and Torchvision library to better understand and classify
the multimodal hateful memes involving images and text.
It makes use of a pre-trained CLIP model to create a custom
classifier without any training required. The generated hateful
C. COMPARATIVE ANALYSIS
The classification of hateful memes can be quite a challeng-
ing task due to the dual nature of the data that needs to
be extracted from the input images. For the successful and
accurate prediction of hateful memes, both the image and text
features need to be extracted from the input meme, which will
allow us to combine both the contributions of the text and
image embeddings towards detecting the hatefulness of the
input. However, far fewer studies focus on the multimodal
representation of data, namely the information that consists
FIGURE 24. Training cross entropy loss. of multiple channels, since most classification tasks are
‘‘unimodal’’ or can only extract and learn in one mode, the random and majority-class baselines lie at 50 AUROC for
either texts or images. Hence, the conventional methods of the unimodal text-only or visual-only classifiers. Multimodal
classification of hateful memes rely on unimodal models, models such as VilBERT are also able to achieve a maximum
which prove to be very less effective as training a model of 72 AUROC.
using only one dimension out of images, text, or video is Lastly, Sethi et al. investigates the classification of hateful
an extremely difficult endeavor. Thus, this task requires a memes using pre-trained models like VGG19 and Xception,
multimodal model that includes text and visuals that are combined with machine learning models like support vector
trained successfully and simultaneously for accurate results. machines and Naïve Bayes [31]. They achieve the highest f1-
There have been a few previous works that have taken score of 0.584 using an integrated stacked model technique.
this direction and implemented multimodal study of hateful Table 3 shows the results of a comparison between the
memes using methods such as visual question answering, evaluation metrics (AUROC and accuracy score) achieved by
vision language pre-training models, and encoders based on various models utilized in several state-of-the-art approaches
convolutional neural networks, apart from making use of for the classification of hateful memes [32], [33], [34], [35].
unimodal techniques.
Zhong et al. proposes a new model that combines mul- TABLE 3. Performance comparison of proposed and existing models.
timodal features with rules, achieving the highest accuracy
of 86.8% [28]. It leverages a specific dataset developed
by Facebook with over 10,000 memes. The specimens
encompass memes in the percentages of 10% unimodal hate,
20% benign image confounder, 20% benign text confounder,
and 10% random non-hateful. Here, a clustering technique
based on perceptual hash is used to group the meme images
together. By using a straightforward comparison on their
strings, the memes are grouped into groups, including ‘‘3-
tuple,’’‘‘2-tuple,’’‘‘unimodal hate,’’ etc. The ‘‘3-tuple,’’ for
example, is made up of 3 memes, with the first meme having
an image like the second meme and text equivalent to the
third meme. The second meme and the third meme, however,
are not connected. The labels for a ‘‘3-tuple’’ consist of 1, 0,
and 0, where 1 denotes hate and 0 denotes non-hatred, while
the labels for a ‘‘2-tuple’’ consist of 1 and 0. From the analysis
above, there were rules formulated, such as Rule 1, where the
hatred probability for samples in a ‘‘3-tuple’’ was set to (1, 0,
0). and Rule 2, where in the case of samples in ‘‘2-tuple,’’ the
hateful probabilities were set to (1,0), with the larger hateful
probability being adjusted to 1. The results of this proposed study demonstrate that
Alternatively, Ahmed et al. explores the use of unimodal the CLIP model, fine-tuned with prompt engineering, can
text and image models, such as Bert, LSTM, VGG16, achieve an accuracy rate of 87.42% and an AUROC
Resnet50, SE-Resnet50, and XSE-Resnet architectures, and of 88.35 in the classification of hateful memes when
combines them into multimodal models for predicting hateful implemented on the Facebook Hateful Memes Dataset. This
memes with evaluation metrics such as the AUC-ROC score, represents a substantial improvement compared to previous
F1 score, and accuracy score [29]. The dataset selected for the studies that employed machine learning methods for the
endeavor was also the ‘‘Hateful Memes Challenge,’’ released detection of hate speech in memes.
by Facebook AI, but yielded results of a maximum prediction As shown in Table 3, the proposed CLIP model outper-
accuracy of 66.3% in classifying memes as hateful or forms all other models in terms of accuracy.
non-hateful. In contrast with previous works, the implementation of the
Fan et al. utilises data from Meta’s Hateful Meme CLIP model also presents us with an easy, feasible option
Detection Challenge and builds three models, with their for the classification process since this model is already
best model, VisualBERT with external feature extraction, pre-trained and does not need to go through a highly time-
achieving a 62.4% accuracy [30]. Kiela et al. highlights the consuming process of training over large sets of data. This
difficulty of the hateful meme detection task, with state- not only saves time but also saves the effort of a large data
of-the-art methods performing poorly compared to humans accumulation process. The proposed model also does not
(64.73% vs. 84.7% accuracy) [9]. It then evaluates a variety require a complex combination of rules for attaining high
of models—unimodal models and multimodal models—that accuracy and is thus a simpler method of achieving efficiency.
were unimodally pretrained (a BERT model combined with The approach taken in this study is also able to provide
a ResNet) on the hateful memes dataset and concludes that better results than previous works since the dataset chosen—
the Facebook Hateful Memes Dataset—is a more difficult the user is protected. The perpetrators spreading hate online
dataset containing several false positive examples and through such memes can then be identified and punished.
benign confounders and is paired with prompt engineering. Additionally, further research into this suggested method may
Incorporation of prompt engineering is a deliberate and also aid in improving the accuracy of hate speech detection
systematic curation of accurate prompts to serve as classes for in memes, by exploring the possibilities of advanced machine
increasing the accuracy in meme classification. The strategic learning algorithms in managing multimodal data.
pairing of this approach with the intricacies of the Facebook
Hateful Memes Dataset aims to mitigate the impact of false ACKNOWLEDGMENT
positives and confounding variables, ultimately elevating The authors acknowledges to the Universiti Kebangsaan
the discriminatory capabilities of the proposed model and Malaysia for supporting this work under GUP 2023-010, and
making it challenging to rely on unimodal signals, resulting in Advanced and Innovative Research Laboratory, India, for the
the success of multimodal models. This makes the proposed technical support.
strategy more efficient and precise as compared to others. REFERENCES
V. FUTURE SCOPE [1] R. Richard and G. Giorgi, ‘‘What is a meme, technically speaking?’’ Inf.,
Commun. Soc. pp. 1–19, Feb. 2023.
For detecting hateful memes our model that included texts [2] Z. Mansur, N. Omar, and S. Tiun, ‘‘Twitter hate speech detection:
and visuals being trained simultaneously was able to provide A systematic review of methods, taxonomy analysis, challenges, and
successful results with precision. Its accuracy can be further opportunities,’’ IEEE Access, vol. 11, pp. 16226–16249, 2023, doi:
10.1109/ACCESS.2023.3239375.
improved to give extremely accurate results, by training the [3] E. K. Boahen, B. E. Bouya-Moko, F. Qamar, and C. Wang, ‘‘A deep
model especially for a particular dataset since the model that learning approach to online social network account compromisation,’’
is utilized presently is pre-trained contrastively for a general IEEE Trans. Computat. Social Syst., vol. 10, no. 6, pp. 3204–3216,
Dec. 2023, doi: 10.1109/TCSS.2022.3199080.
dataset. But training of multimodal data is an extremely [4] K. Abbas, M. K. Hasan, A. Abbasi, U. A. Mokhtar, A. Khan,
cumbersome process and there is a lack of easy access to the S. N. H. S. Abdullah, S. Dong, S. Islam, D. Alboaneen, and
relevant hardware and software devices required for it. Social F. R. A. Ahmed, ‘‘Predicting the future popularity of academic
publications using deep learning by considering it as temporal citation
media’s development over the past few decades has made a networks,’’ IEEE Access, vol. 11, pp. 83052–83068, 2023.
wealth of information readily accessible online which makes [5] H. Hosseinmardi, S. Arredondo Mattson, R. Ibn Rafiq, R. Han, Q. Lv, and
a multimodal dataset such as the one used extremely complex S. Mishra, ‘‘Detection of cyberbullying incidents on the Instagram social
network,’’ 2015, arXiv:1503.03909.
and very large. The training process will thus require a lot [6] H. Zhong, H. Li, A. C. Squicciarini, S. M. Rajtmajer, C. Griffin,
of GPU hours and memory before the model can be used to D. J. Miller, and C. Caragea, ‘‘Content-driven detection of cyberbullying
on the Instagram social network,’’ in Proc. IJCAI, vol. 16, 2016,
successfully classify hate speech. But in the future, this can pp. 3952–3958.
be made possible with access to better technology and time [7] Y. Chen and F. Pan, ‘‘Multimodal detection of hateful memes by applying
to develop, test, and improve models used for hate speech a vision-language pre-training model,’’ Plos One, vol. 17, no. 9, 2022,
Art. no. e0274300.
detection. [8] Y. Zhou, Z. Chen, and H. Yang, ‘‘Multimodal learning for hateful
memes detection,’’ in Proc. IEEE Int. Conf. Multimedia Expo. Workshops
VI. CONCLUSION (ICMEW), Jul. 2021, pp. 1–6, doi: 10.1109/ICMEW53276.2021.9455994.
This paper aims to develop a novel and efficient architecture [9] D. Kiela, H. Firooz, A. Mohan, V. Goswami, A. Singh, P. Ringshia, and
for detecting and classifying multimodal hate speech in D. Testuggine, ‘‘The hateful memes challenge: Detecting hate speech in
multimodal memes,’’ 2020, arXiv:2005.04790.
memes circulating through social media. The suggested [10] T. Deshpande and N. Mani, ‘‘An interpretable approach to hateful meme
strategy for this is making use of OpenAI’s latest multimodal detection,’’ in Proc. Int. Conf. Multimodal Interact., New York, NY, USA,
model - CLIP, to better understand multimodal hate speech in Oct. 2021, pp. 723–727, doi: 10.1145/3462244.3479949.
[11] A. A. Ahmed, M. K. Hasan, M. M. Jaber, S. M. Al-Ghuribi, D. H. Abd,
memes that contain both visual images and text captions. The W. Khan, A. T. Sadiq, and A. Hussain, ‘‘Arabic text detection using
CLIP model analyses the image and its accompanying text to rough set theory: Designing a novel approach,’’ IEEE Access, vol. 11,
determine whether the two modalities taken together are hate- pp. 68428–68438, 2023.
[12] A. K. Thakur, F. Ilievski, H. Sandlin, Z. Sourati, L. Luceri, R. Tommasini,
ful or not. The ‘‘Facebook Hateful Meme Dataset,’’ which and A. Mermoud, ‘‘Multimodal and explainable internet meme classifica-
consists of 10,000 examples of new multimodal memes (text tion,’’ Dec. 2022, arXiv:2212.05612.
+ image) created by Facebook AI, is utilized as the dataset [13] A. Radford, J. Wook Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever,
for the proposed method. The implemented model has been ‘‘Learning transferable visual models from natural language supervision,’’
able to achieve an accuracy of 87.42% in recognizing hateful 2021, arXiv:2103.00020.
[14] Y. Qu, X. He, S. Pierson, M. Backes, Y. Zhang, and S. Zannettou,
memes. This scope of this study can be further extended to ‘‘On the evolution of (Hateful) memes by means of multimodal contrastive
filter out these memes from the social media platforms to learning,’’ in Proc. IEEE Symp. Secur. Privacy (SP), San Francisco, CA,
keep a check on the hatred spreading through such content USA, May 2023, pp. 293–310, doi: 10.1109/sp46215.2023.10179315.
[15] K. Abbas, M. K. Hasan, A. Abbasi, S. Dong, T. M. Ghazal,
online. This will help control the hate spread against minority S. N. H. S. Abdullah, A. Khan, D. Alboaneen, F. R. A. Ahmed,
communities and diminish any form of discrimination such T. E. Ahmed, and S. Islam, ‘‘Co-evolving popularity prediction in temporal
as racism or sexism through cyber platforms. It will also curb bipartite networks: A heuristics based model,’’ IEEE Access, vol. 11,
pp. 37546–37559, 2023.
cyber bullying and hate speech on social media generated by [16] A. Bhandari, ‘‘Bias in AI: A comprehensive examination of factors and
trolls using offensive memes. Amidst the incoming network improvement strategies,’’ Int. J. Comput. Sci. Eng., vol. 10, no. 6, pp. 9–14,
traffic, such hate speech will be recognized and routed so that Jun. 2023, doi: 10.14445/23488387/ijcse-v10i6p102.
[17] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, MOHAMMAD KAMRUL HASAN (Senior Mem-
G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, ber, IEEE) received the Ph.D. degree in electrical
‘‘Learning transferable visual models from natural language supervision,’’ and communication engineering from the Faculty
in Proc. Int. Conf. Mach. Learn., 2021, pp. 8748–8763. of Engineering, International Islamic University,
[18] J. Badour and J. A. Brown, ‘‘Hateful memes classification using machine Malaysia, in 2016. He is currently an Associate
learning,’’ in Proc. IEEE Symp. Ser. Comput. Intell. (SSCI), Orlando, FL, Professor and the Head of the Network and
USA, Jun. 2021, pp. 1–8, doi: 10.1109/SSCI50451.2021.9659896.
Communication Technology Laboratory, Faculty
[19] S. Prabhakaran. Cosine Similarity—Understanding the Math and How it
Works (With Python Codes). Accessed: Oct. 20, 2023. [Online]. Available: of Information Science and Technology, Center for
https://www.machinelearningplus.com/nlp/cosine-similarity/ Cyber Security, Universiti Kebangsaan Malaysia
[20] K. E. Koech. Softmax Activation Function—How it Actually Works. (UKM). He is specialized in elements pertaining
Towardsdatascience.com. Accessed: Oct. 5, 2023. [Online]. Avail- to cutting-edge information centric networks, computer networks, data
able: https://towardsdatascience.com/softmax-activation-function-how-it- communication and security, mobile network and privacy protection, cyber-
actually-works-d292d335bd78 physical systems, industrial IoT, transparent AI, and electric vehicles
[21] A. van den Oord, Y. Li, and O. Vinyals, ‘‘Representation learning with networks. He has published more than 230 indexed papers in ranked
contrastive predictive coding,’’ 2018, arXiv:1807.03748. journals and conference proceedings. He is a member of the Institution
[22] M. S. Hee, R. K.-W. Lee, and W.-H. Chong, ‘‘On explaining multimodal of Engineering and Technology and the Internet Society. He is a Certi-
hateful meme detection models,’’ in Proc. ACM Web Conf., New York, NY, fied Professional Technologist in Malaysia. He has actively participated
USA, Apr. 2022, pp. 3651–3655, doi: 10.1145/3485447.3512260.
[23] P. Lippe, N. Holla, S. Chandra, S. Rajamanickam, G. Antoniou, E. Shutova,
in many events/workshops/trainings for the IEEE and IEEE Humanity
and H. Yannakoudakis, ‘‘A multimodal framework for the detection of Programs in Malaysia. He served as the Chair for IEEE Student Branch,
hateful memes,’’ 2020, arXiv:2012.12871. from 2014 to 2016. He is the general chair, the co-chair, and a speaker of
[24] M. A. Latiffi and M. R. Yaakub, ‘‘Sentiment analysis: An enhancement of conferences and workshops for the shake of society and academy knowledge
ontological-based using hybrid machine learning techniques,’’ Asian J. Inf. building and sharing and learning. He has been contributing and working as
Technol., vol. 7, pp. 61–69, Dec. 2018. a volunteer for underprivileged people for the welfare of society. He is an
[25] A. Mukred, D. Singh, and N. S. Mohd Satar, ‘‘Examining the influence of editorial member in many prestigious high-impact journals, such as IEEE,
perceived need on the adoption of information system in public hospitals in IET, Elsevier, Frontier, and MDPI.
Yemen,’’ Asia–Pacific J. Inf. Technol. Multimedia, vol. 9, no. 2, pp. 35–49,
Dec. 2020.
[26] R. Cao, R. Ka-Wei Lee, W.-H. Chong, and J. Jiang, ‘‘Prompting for
multimodal hateful meme classification,’’ 2023, arXiv:2302.04156.
[27] I. Memon, R. A. Shaikh, M. K. Hasan, R. Hassan, A. U. Haq, and
K. A. Zainol, ‘‘Protect mobile travelers information in sensitive region
based on fuzzy logic in IoT technology,’’ Secur. Commun. Netw., vol. 2020,
pp. 1–12, Nov. 2020.
[28] X. Zhong, ‘‘Classification of multimodal hate speech—The winning
solution of hateful memes challenge,’’ 2020, arXiv:2012.01002.
[29] Md. R. Ahmed, N. Bhadani, and I. Chakraborty, ‘‘Hateful meme prediction
model using multimodal deep learning,’’ in Proc. Int. Conf. Comput.,
Commun. Green Eng. (CCGE), Sep. 2021, pp. 1–5.
[30] A. Fan and Y Wu. Identifying Hateful Memes With Multimodal Classifica-
tion. Cs231n.stanford.edu. Accessed: Sep. 10, 2023. [Online]. Available:
http://cs231n.stanford.edu/reports/2022/pdfs/66.pdf ASHISH BAGWARI (Senior Member, IEEE)
[31] A. Sethi, U. Kuchhal, and R. Katarya, ‘‘Study of various techniques for received the B.Tech. (Hons.), M.Tech. (Hons.),
the classification of hateful memes,’’ in Proc. Int. Conf. Recent Trends and Ph.D. degrees in electronics and communi-
Electron., Inf., Commun. Technol. (RTEICT), Bangalore, India, Aug. 2021,
cation engineering. He is currently the Head of
pp. 675–680, doi: 10.1109/RTEICT52294.2021.9573926.
[32] A. Gao, B. Wang, J. Yin, and Y. Tian, ‘‘Hateful memes challenge:
the Department of Electronics and Communica-
An enhanced multimodal framework,’’ 2021, arXiv:2112.11244. tion Engineering, Women Institute of Technology
[33] Y. Chen and F. Pan, ‘‘Multimodal detection of hateful memes by applying (WIT) (Institute of State Government), Affiliating
a vision-language pre-training model,’’ PLoS One, vol. 17, no. 9, 2022, Institution of Uttarakhand Technical University,
Art. no. e0274300, doi: 10.1371/journal.pone.0274300. Dehradun, India. He has more than 14.5 years of
[34] Z. Ma, S. Yao, L. Wu, S. Gao, and Y. Zhang, ‘‘Hateful memes detection experience in industry, academics, and research.
based on multi-task learning,’’ Mathematics, vol. 10, no. 23, p. 4525, He has published more than 170 research articles in various international
Nov. 2022, doi: 10.3390/math10234525. journals that also include IEEE international conferences. His areas
[35] M. G. Constantin, D.-S. Parvu, C. Stanciu, D. Ionascu, and B. Ionescu, of interest are cognitive radio networks, mobile communication, sensor
‘‘Hateful meme detection with multimodal deep neural networks,’’ in
networks, wireless, and 5G Communication, digital communication, and
Proc. Int. Symp. Signals, Circuits Syst. (ISSCS), Iasi, Romania, Jul. 2021,
mobile ad-hoc networks. He is an active member of various professional
pp. 1–4, doi: 10.1109/ISSCS52333.2021.9497374.
societies, such as IEEE, USA. He is also a Senior Member of the Institute
GREESHMA ARYA received the B.Tech. and of Electronics and Telecommunication Engineers (IETE), India; a Lifetime
M.Tech. degrees (Hons.) from Dr. A. P. J. Abdul Member and a Professional Member of the Association for Computing
Kalam University, Lucknow, India, and the Ph.D. Machinery (ACM); and a member of the Machine Intelligence Research
degree in electronics and communication engi- Laboratory Society. He received the Gold Medalist during the master’s study.
neering from Uttarakhand Technical University, He also received the Best WIT Faculty Award, in 2013 and 2015; the Best
Dehradun, India. She is currently an Associate Project Guide Award, in 2015; and the Corps of Electrical and Mechanical
Professor with the Department of Electronics and Engineers Prize from the Institution of Engineers, India (IEI), in December
Communication Engineering, Indira Gandhi Delhi 2015, for his research work. Also, he received the Outstanding Scientist
Technical University for Women (IGDTUW) Award 2021 from VDGOOD Technology, Chennai, India, in November
(State Government University), Delhi, India. She 2021; the Dr. A. P. J. Abdul Kalam Life Time Achievement National
has more than 17.5 years of experience in academics and research. She has Award 2022 from National Institute for Socio Economic Development
published more than 35 research articles in various international journals (NISED), Bangalore, India, in June 2022; and the Best Teacher Award-
(including SCI, ESCI, Scopus, and ISI indexed). Her areas of interests 2023 from Veer Madho Singh Bhandari Uttarakhand Technical University
include wireless sensor networks, wireless communication, renewable (State Government Technical University), Dehradun, in September 2023.
energy sources, network security, the Internet of Things, 5G network He was named in Who’s Who in the World 2016 (33rd Edition) and 2017
technology, artificial intelligence, and deep learning. (34th Edition).
22374 VOLUME 12, 2024
G. Arya et al.: Multimodal Hate Speech Detection in Memes Using CLIP
NURHIZAM SAFIE (Member, IEEE) received AAISHANI DE is pursued the B.Tech. degree with
the master’s degree in information technology the Department of Electronics and Communica-
from UKM, in 1999, the M.B.A. degree from tion Engineering, Indira Gandhi Delhi Technical
Anglia Ruskin University, U.K., in 2019, and University for Women. He is a Researcher. Her
the Ph.D. degree in management information technology stack includes Java, Springboot and
systems (MIS). He is an Associate Professor AI/ML development in python. She has pub-
and the Dean of the Faculty of Information lished more than 10+ research articles in various
Science and Technology. Before this position, international journals (including SCI, ESCI and
he was a Research Fellow with United Nations Scopus indexed). Her research work primarily
University, a United Nations academic arm. He has focuses on the topics of computer vision, natural
conferred the Professional Technologist [.Ts/P.Tech.(IT)] credential from the language processing and deep learning. Her area of interests includes web
Malaysian Board of Technology (MBoT), in 2018. During the Ph.D. study, development, machine learning, data science and artificial intelligence. She
he received the National Science Fellowship (NSF) Scholarship from the has received the Incentive Award for Excellence in Research from Indira
Malaysian Ministry of Science, Technology, and Innovation (MoSTI). Gandhi Delhi Technical University for her work.