0% found this document useful (0 votes)
1 views

Multimodal_Hate_Speech_Detection_in_Memes_Using_Contrastive_Language-Image_Pre-Training

This research addresses the challenge of detecting hateful memes, which combine text and images to spread harmful messages online. By leveraging the Contrastive Language-Image Pre-Training (CLIP) model and prompt engineering, the study achieves an accuracy of 87.42% in classifying hateful content. The findings suggest that this approach can effectively filter out hate speech on social media, contributing to a safer online environment.

Uploaded by

abhinavn22comp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Multimodal_Hate_Speech_Detection_in_Memes_Using_Contrastive_Language-Image_Pre-Training

This research addresses the challenge of detecting hateful memes, which combine text and images to spread harmful messages online. By leveraging the Contrastive Language-Image Pre-Training (CLIP) model and prompt engineering, the study achieves an accuracy of 87.42% in classifying hateful content. The findings suggest that this approach can effectively filter out hate speech on social media, contributing to a safer online environment.

Uploaded by

abhinavn22comp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Received 29 December 2023, accepted 17 January 2024, date of publication 1 February 2024, date of current version 15 February 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3361322

Multimodal Hate Speech Detection in Memes Using


Contrastive Language-Image Pre-Training
GREESHMA ARYA 1 , MOHAMMAD KAMRUL HASAN 2 , (Senior Member, IEEE),
ASHISH BAGWARI 3 , (Senior Member, IEEE), NURHIZAM SAFIE2 , (Member, IEEE),
SHAYLA ISLAM 4 , (Senior Member, IEEE), FATIMA RAYAN AWAD AHMED5 , AAISHANI DE 1,

MUHAMMAD ATTIQUE KHAN 6,7 , (Senior Member, IEEE), AND


TAHER M. GHAZAL 8,9,10 , (Senior Member, IEEE)
1 Department of Electronics and Communication Engineering, Indira Gandhi Delhi Technical University for Women, New Delhi 110006, India
2 Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia (UKM), Bangi, Selangor 43600, Malaysia
3 Department of Electronics and Communication Engineering, Uttarakhand Technical University, Dehradun 248007, India
4 Institute of Computer Science and Digital Innovation, UCSI University Malaysia, Kuala Lumpur 56000, Malaysia
5 Computer Science Department, College of Computer Engineering and Science, Prince Sattam Bin Abdulaziz University, Al-Kharj 16273, Saudi Arabia
6 Department of Computer Science, HITEC University, Taxila 47080, Pakistan
7 Department of CS and Mathematics, Lebanese American University, Beirut 1102 2801, Lebanon
8 Centre for Cyber Physical Systems, Computer Science Department, Khalifa University, Abu Dhabi, United Arab Emirates
9 Center for Cyber Security, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia (UKM), Bangi, Selangor 43600, Malaysia
10 Applied Science Research Center, Applied Science Private University, Amman 11937, Jordan

Corresponding authors: Mohammad Kamrul Hasan ([email protected]), Nurhizam Safie ([email protected]), Ashish Bagwari
([email protected]), and Shayla Islam ([email protected])
This work was supported in part by the Universiti Kebangsaan Malaysia under Grant GUP-2023-010.

ABSTRACT In contemporary society, the proliferation of online hateful messages has emerged as a
pressing concern, inflicting deleterious consequences on both societal fabric and individual well-being.
The automatic detection of such malevolent content online using models designed to recognize it, holds
promise in mitigating its harmful impact. However, the advent of ‘‘Hateful Memes’’ poses fresh challenges
to the detection paradigm, particularly within the realm of deep learning models. These memes, constituting
of a textual element associated with an image are individually innocuous but their combination causes
a detrimental effect. Consequently, entities responsible for disseminating information via web browsers
are compelled to institute mechanisms that regulate and automatically filter out such injurious content.
Effectively identifying hateful memes demands algorithms and models endowed with robust vision and
language fusion capabilities, capable of reasoning across diverse modalities. This research introduces a novel
approach by leveraging the multimodal Contrastive Language-Image Pre-Training (CLIP) model, fine-tuned
through the incorporation of prompt engineering. This innovative methodology achieves a commendable
accuracy of 87.42%. Comprehensive metrics such as loss, AUROC, and f1 score are also meticulously
computed, corroborating the efficacy of the proposed strategy. Our findings suggest that this approach
presents an efficient means to regulate the dissemination of hate speech in the form of viral meme content
across social networking platforms, thereby contributing to a safer online environment.

INDEX TERMS CLIP, facebook hateful meme dataset, multimodal, contrastive learning, zero-shot
prediction, InfoNCE contrastive loss, prompt engineering, cosine similarity matrix.
I. INTRODUCTION The term ‘‘hate speech’’ has solidified its presence as a
The pervasive use of hateful memes as a vehicle for spreading ubiquitous phenomenon in the realm of the internet. Memes,
animosity on online platforms has become an alarming trend. which can be any shareable content encompassing pho-
tographs or videos that are spread, altered, and repeated over
The associate editor coordinating the review of this manuscript and time [1], serve as conduits for the transmission of such hate
approving it for publication was Tai Fei . among individuals. A speech exhibiting hostility or violence

2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 12, 2024 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 22359
G. Arya et al.: Multimodal Hate Speech Detection in Memes Using CLIP

towards individuals who have protected traits safeguarded A. MOTIVATION


by societal norms or constitutional provisions is categorized The rise of online hateful messages in recent times has
as hateful speech. The direct and automatic identification of emerged as a pressing societal concern, posing significant
hateful memes assumes paramount significance in fostering a harm to both individuals and the community at large [2].
healthy social networking environment. Proactively filtering These ‘‘Hateful Memes’’ represent a particularly vexing
out content that disseminates hatred before it reaches online challenge. The motivation behind addressing this issue lies
platforms plays a pivotal role in curbing the proliferation of in the imperative to curb the adverse consequences of such
social media hate. content. Following the Covid-19 pandemic, social media
Hateful memes are characterized by a blend of back- has exploded, and hate speech in the form of memes is
ground images and text captions, which encapsulate the being widely used adversely for cyberbullying as well as
intentions and sentiments of users. While these images or discrimination against minority communities in a variety
captions may appear harmless when viewed in isolation, of ways, including racism, sexism, sexual harassment, and
their combination elicits a disturbing effect. This hidden discrimination based on one’s sexual orientation or religious
meaning in the combination is very much visible to the background. Hence, it becomes vital to regulate such hostile
human mind but is often concealed from conventional content to safeguard users as well as establish clean and
scanners as it is challenging for machines to recognize secure social networking platforms [3], [4]. The unique nature
this difference. As a result, hateful users intentionally of hateful memes, where seemingly innocuous text combines
post hate speech in the form of such offensive memes with images to create harmful effects, underscores the need
to get past the traditional scanners. Traditional automatic for automated detection mechanisms. It becomes evident that
text detection techniques and visual feature analysis work internet platforms and companies responsible for delivering
in isolation totally disregarding visual and text features content to users must play an active role in filtering out
respectively, rendering it challenging to identify multimodal these harmful messages. To accomplish this, the key lies
hate speech. To address this, a model must transcend the in developing algorithms and models that exhibit robust
limitations of processing individual modalities to combine fusion of vision and language capabilities, allowing them
the processing of two or more modalities to comprehend to reason effectively across diverse modalities that would
the complexities inherent in the amalgamation of text be contributing to the mitigation of online hate and its far-
and images. reaching consequences.
Consider phrases such as ‘‘look how many people love
you’’ or ‘‘you look beautiful today.’’ Paired with seemingly
innocuous images of a skunk or tumbleweed, these otherwise B. CONTRIBUTION
benign statements transform into destructive and mean- This research makes a substantial contribution by addressing
spirited expressions. This perceptual nuance, easily discerned the pressing issue of online hate through the development
by humans, poses a formidable challenge for AI systems. and application of advanced deep learning models. The crux
Despite recent initiatives to detect hate speech in text of the novelty in this research resides in the introduction
and images, there is a notable scarcity of models adept and execution of a pioneering methodology that revolves
at recognizing hate speech in multimodal contexts that around the integration of the multimodal CLIP model
encompass both text and images. combined with a strategic application of prompt engineering.
This study draws upon the ‘‘Facebook Hate Meme This fusion forms the basis of a unique approach to
Dataset,’’ a rich repository of over 10,000 freshly curated the classification of hateful and non-hateful memes. The
multimodal examples (text + image) generated by Facebook technique capitalizes on the cutting-edge capabilities of
AI, as its primary data source. The proposed methodol- the CLIP model, by incorporating which the study aims
ogy employs the CLIP (Contrastive Language-Image Pre- to harness trhe synergistic relationship between language
training) model to analyze the accompanying text and and image features, enhancing the overall discriminative
images, to determine whether their combined expression power of the proposed meme classification system. Central
qualifies as hateful. Beyond mere detection, the approach to this innovation is the recognition of the pivotal role
holds the potential to proactively filter out harmful memes, played by accurate textual prompts, meticulously designed
contributing to the overarching goal of mitigating the spread to serve as explicit classes for accurate and nuanced meme
of hate on social media platforms. categorization. The precision and relevance of these prompts
The implementation process involves training the model are systematically crafted to counter challenges such as false
on the dataset developed by Facebook and subsequently positives, thereby significantly elevating the discriminatory
subjecting it to rigorous testing using a diverse collection of accuracy of the proposed method. Thus, this paper delves into
images, with the model’s accuracy serving as a quantitative the intricacies of this novel approach of synergy between the
benchmark. Notably, the CLIP model exhibits high accuracy multimodal model and prompt engineering, which facilitates
in identifying hostile memes, marking a significant stride in enhanced accuracy and efficacy in meme classification.
the ongoing efforts to fortify the digital landscape against the Through the proposed methodology, an accuracy rate of
deleterious impact of hate speech online. 87.42% is attained in the detection of hateful memes. These

22360 VOLUME 12, 2024


G. Arya et al.: Multimodal Hate Speech Detection in Memes Using CLIP

empirical results demonstrate a noteworthy improvement in A triplet is created by stacking the visual features,
classification accuracy as compared to other state-of-the- object tags, and text features of memes produced by the
art models, which is directly attributed to the integration of object detection model known as Visual features in Vision-
accurate prompts into the CLIP model. Furthermore, this Language (VinVl) and the optical character recognition
research extends its contribution to the broader context of (OCR) technology in the study by Yuyang Chen1, Feng
online content regulation. The insights gained from this work PanID2, ‘‘Multimodal detection of hateful memes by apply-
can inform the development of content filtering systems ing a vision-language pre-training model’’ to perform cross-
by social networking companies and platforms, ultimately modal meme learning. After being tweaked and coupled with
creating safer online environments. By offering a robust a random forest (RF) classifier, our model (OSCAR+RF)
solution that effectively addresses the fusion of text and outperformed the other eleven (11) published baselines on the
images in hateful content, this research takes a meaningful task of identifying nasty memes, reaching average accuracy
step towards mitigating the societal impacts of online hate and AUROC of 0.684 and 0.768, respectively, in a public
speech and promoting a healthier digital ecosystem for all test set. In conclusion, our study has demonstrated that VL-
users. PTMs with anchor point additions can improve the efficiency
The remaining sections are organized as follows: Section II of deep learning-based hate meme identification by including
discusses a few recent works related to hate speech detection a more robust deep learning model [7].
in multimodal memes using various techniques and their The study by Yi Zhou1, Zhenhao Chen2, and Huiyuan
learning. Then the problem definition, proposed optimal Yang on MULTIMODAL LEARNING FOR HATEFUL
solution and motivation behind the selection are discussed MEMES identification focuses on the identification of
in section III. It also contains the detailed description of hate speech using a model based on Visual Question
the proposed framework including its working, architecture, Answering [8].
and technical procedure. Then, the results and predicted The research on ‘‘The Hateful Memes Challenge:
outcomes obtained using the proposed model is deliberated Detecting Hate Speech in Multimodal Memes’’ by Douwe
in section IV. Finally, section V contains the future scope of Kiela, HamedFirooz, Aravind Mohan, VedanujGoswami,
the method and section VI concludes the project. Amanpreet Singh, Pratik Ringshia, and DavideTestuggine
suggests a new challenge set for multimodal classification
that focuses on spotting offensive language in multimodal
II. LITERATURE SURVEY memes [9]. Problematic instances are included in the dataset
The field of hate speech identification has seen a lot of to make it harder to rely on unimodal models and to
effort, but relatively little of it has focused on multimodal demonstrate the superiority of multimodal models signals.
hate speech detection. Both NLP and network science have Despite requiring complex reasoning, the task can be quickly
studied hate speech in depth. Hate speech has long been reduced to a binary classification problem.
detected using language analysis. One of the key places where
hate speech is targeted with a range of targets is social media.
Obtaining a sentence embedding and putting it into a binary III. METHODOLOGY
classifier for hate speech prediction are the typical processes The detection of Hateful Memes represents a binary classi-
for hate speech detection. A variety of language-based hate fication challenge, seeking to ascertain whether a meme is
speech identification datasets have been made available for offensive or hateful through the analysis of multimodal data
research on hate speech detection. According to Yang et constituting both text and image signals. Given the inherent
al. adding image embedding information to text improves complexity of memes, characterized by the coexistence of
hate speech recognition ability right away. With the aid of two modalities, and recognizing the formidable obstacle
Crowdflower employees, Hosseinmardi et al. categorize a of detection accuracy in this task, a multi-task learning
dataset of Instagram photographs and the comments that technique is deemed essential for drawing meaningful
go with them. Two inquiries were made to the staff: First, statistical conclusions [10], [11].
does the case represent cyberaggression, and second, does it In this study, the categorization of multimodal hostile
represent cyberbullying. They demonstrate that incorporating memes is characterized as a classification model that predicts
picture characteristics enhances classification efficiency. the label of a multimodal meme (hateful or non-hateful)
The dataset included 998 cases, 90% of which had high based on the associated image and text. To achieve this,
confidence ratings, and 52% of which were labeled as models must predict a probability vector y∈R over the
bullying [5]. Zhong et al. compiled a dataset of 3000 samples two classes. In greater detail, y0 denotes the projected
of Instagram posts and comments in a manner like this. Two probability that the meme is non-hateful, while y1 represents
employees of Mechanical Turk were questioned: Is there the probability that the meme is hateful. If y1 > y0, the
any bullying in the comments? If so, can the bullying be meme is classified as hateful; otherwise, it is categorized
linked to the image’s subject matter? 560 incidents of bullying as non-hateful. Leveraging the Contrastive Language-Image
were discovered. They evaluate several features and simple Pre-training (CLIP) model, we successfully classify memes
classifiers for automatically identifying bullying [6]. within this framework, assigning them to the hateful or

VOLUME 12, 2024 22361


G. Arya et al.: Multimodal Hate Speech Detection in Memes Using CLIP

non-hateful classes based on the highest cosine similarity corpus of natural language data, encompassing a distinctive
score in contrastive learning. dataset composed of 400 million training images paired with
The technique employs a pre-trained model for feature their text descriptions, sourced abundantly from the internet.
extraction from a meme in a transfer learning context, In essence, CLIP is an enhanced image classification
combined with a downstream classification model that model characterized by heightened accuracy and efficiency,
utilizes these features [12]. For operating without the addition heralding a transformative era in the field of multimodal
of new data or manual labeling, a model that can balance the learning.
unambiguous information from the multimodal and the fuzzy
information from the individual modality while minimizing
1) ZERO-SHOT LEARNING USING CLIP
generalization blunders is required. Consequently, a strategy
that integrates statistical theory with state-of-the-art neural In the proposed solution for meme detection and classifica-
networks and optimization methods was developed to discern tion as hateful or non-hateful, the concept of ‘‘zero-shot’’
the offensive memes. image classification has been leveraged. This approach is
CLIP is one of the most significant advancements in particularly advantageous as it allows us to generalize and
computer vision, as it bridges the domains of computer make predictions on unseen labels without the necessity of
vision and natural language processing. The challenges of specific training for each class. Traditional machine learning
huge datasets and subpar real-world results in conventional models are typically confined to learning and excelling at
vision models are also addressed by the CLIP network. a single pre-defined task. For example, an image classifier
It distinguishes itself by allowing a singular method, CLIP, trained exclusively on categorizing dogs and cats may
to handle a diverse range of applications without necessitating perform well within that specific scope. However, models
the construction of extensive custom datasets [13]. This like CLIP distinguish themselves by possessing the capability
departure from the conventional approaches, such as training to excel at tasks for which they haven’t undergone explicit
models like ResNet requiring vast labeled image datasets, training. This phenomenon is encapsulated by the term ‘‘zero-
positions CLIP as a widely preferred model in the field of shot learning.’’ Here, the model employs generalization to
computer vision. predict a class that has not been encountered in the training
data.
This makes it an ideal candidate for the proposed solution,
A. CONTRASTIVE LANGUAGE–IMAGE PRE-TRAINING since the specific nature of hateful and non-hateful memes
CLIP is a robust and scalable state of the art multi modal may vary widely and evolve over time. This adaptability con-
vision and language model introduced by OpenAI. This tributes to the robustness of our meme detection framework.
neural network boasts versatility, as it can be applied to
various visual classification benchmarks and adeptly learns 2) APPROACH
visual concepts through natural language supervision. Unlike Scaling an elementary pre-training task is necessary to attain
traditional models, CLIP exhibits remarkable ‘‘zero-shot’’ competitive zero-shot performance on a wide range of image
capabilities akin to GPT-2 and GPT-3, enabling it to perform classification datasets. To achieve this, the CLIP model must
tasks such as predicting the most pertinent text snippet given be trained to recognize a wide range of visual concepts in
the names of the visual categories to be recognized (image), images and connect them to their names. This is done by
without direct optimization. applying contrastive learning to a large dataset of image-
Central to CLIP’s prowess is its training methodology text pairs. This information is utilized to determine which of
employing contrastive learning, which aims to map images 32,768 randomly chosen text descriptions a given image was
and text descriptions into a shared latent space. This unique accurately paired with in our dataset.
approach enables CLIP to discern whether an image and
textual description match, therefore, facilitating tasks like
image classification through text-image similarity [14], [15]. a: CONTRASTIVE REPRESENTATION LEARNING
This means that CLIP can successfully predict which cap- The novelty of the CLIP model lies in its utilization of
tions correspond to which images without domain-specific a contrastive training strategy, a paradigm that leverages
training, making it particularly potent for out-of-the-box text positive (image-text pairs) and negative (other images and
and image search applications [16]. Beyond its fundamental text) samples to train a scoring function, thereby generating
capabilities, CLIP finds application in a myriad of domains, meaningful representations of the data. Through this inno-
including image generation, image similarity search, image vative technique, CLIP is trained to understand that similar
ranking, object tracking, robotics control, image captioning, representations should converge in the latent space, while
geo-localization, and more. dissimilar ones should exhibit considerable separation [17].
Its versatility arises from its comprehensive understanding Within the model’s architecture, encompassing both image
of the intricate relationships between visual data and and text encoders, the process of contrastive training involves
the corresponding linguistic representations. This profound the labeling of image-text pairs, followed by their embedding
understanding is cultivated through training on an extensive with various ‘‘objects’’ to learn abstracts in the data. This

22362 VOLUME 12, 2024


G. Arya et al.: Multimodal Hate Speech Detection in Memes Using CLIP

facilitates the training of a zero-shot classifier on the resulting between given input vectors a and b shown in Eqn. (1).
image and text embeddings. Pn
(a.b) ai bi
Deep learning has extensively used contrastive pre- cos (θ) = = qP 1 qP (1)
||a|| ||b|| n 2 n 2
training. One explanation for this is that contrastive pre- 1 ai 1 bi
training followed by supervised fine-tuning is a paradigm
that is more label-efficient and enhances the effectiveness c: SOFTMAX FUNCTION
of labeled data. During pre-training, unlabeled images are When performing supervised categorization in the contrastive
effectively clustered together in the latent space, resulting training of the CLIP model utilized, the InfoNCE Loss
in precise decision boundaries between distinct classes. optimization function is often applied after a softmax
Subsequent supervised fine-tuning, based on this clustering, function has been applied to the network outputs. Using a
consistently outperforms random initialization. It is also vector of real values, the softmax function restricts their range
a better approach because it not only captures shared to lie within 0 and 1, with the total of all the numbers equaling
information from multiple sources, such as images and text, 1. Another characteristic of softmax is that it ensures that
but also maximizes mutual information [18]. Therefore, the any one of the values is often much larger than the others,
adoption of a model rooted in the contrastive approach aligns consequently we get a positive example (closest vector) that
seamlessly with our research objectives. is significantly larger than the random ones and can be easily
identified [20]. Therefore, we first take the softmax of the
b: COSINE SIMILARITY values and then the negative log of the labeled category to
In an ideal world, the vector representations of text and calculate the loss for categorical cross-entropy. Refer to ‘‘(2)’’
its corresponding image should be equal. The similarity for the Softmax Function equation.
between the embedded representations for each text and each ezi
image represents the ‘‘goodness’’ of our model. Similarly, the σ (z)i = PK fori = 1, . . . , K , and
zj
‘‘Badness’’ is measured by the dissimilarity between them. j=1 e

An optimal model has maximized goodness and minimized z = (z1 , . . . , zk ) eRk (2)
badness.
where,
To assess this ‘‘goodness’’ we require a method of
computing the distance between the image and text vectors. σi = Softmax Function
The Euclidean distance, often known as the straight-line z = Input Vector
distance, is a wonderful option for determining the separation zi
e = Standard exponential function for input vector
between two points in 2 or 3 dimensions. All points, however,
tend to be far apart by the Euclidean measure in a large K = Number of classes in the multi − class classifier
dimensional space. Hence, the angle between vectors is a ezj = Standard exponential function for output vector
more useful metric in higher dimensions. Thus we use the
cosine similarity which calculates the cosine of the angle
d: InfoNCE CONTRASTIVE LOSS FUNCTION
between two vectors [19], to determine the similarity between
them in our model as can be seen in Figure 1. The contrastive loss function compares the distance between
a sample and the network’s output for a positive example of
the same class to its distance from a negative example. If pos-
itive samples are encoded to similar (closer) representations
and negative samples to dissimilar (farther) representations,
the loss is low. This is achieved by taking the cosine distances
between the vectors and treating the resulting distances
as the prediction probabilities of a standard categorization
network. The popular loss function we use for contrastive
learning in this paper is InfoNCE (NCE is an acronym for
Noise-Contrastive Estimation) which is an altered variant
of the cross-entropy loss function [21]. The similarity of
positive pairs is maximized while that of negative pairs
is minimized using this function. In ‘‘(3)’’, as shown at
FIGURE 1. Cosine similarity. the bottom of the next page, za , zp , and zn represent the
anchor, positive, and negative embeddings. According to
A greater value of the cosine distance is produced by more self-supervised learning, we have one positive sample and
comparable vectors. The dot product of the vectors is used many negatives (N). Through the cos_sim function, the
for computation. When not using unit vectors, we must either vector cosine similarity is assessed. We aim to maximize the
normalize the vectors or divide the product to the normed cosine similarity between za and zp , bringing them closer
vectors. Equation (1) gives the cosine similarity value (cosθ) together by using this function. The reverse is true for za

VOLUME 12, 2024 22363


G. Arya et al.: Multimodal Hate Speech Detection in Memes Using CLIP

and zn . Additionally, there is a temperature hyperparameter,


designated by τ , which regulates the degree of penalty for
harder negative samples. Harder negatives incur an increased
penalty at lower temperatures.
The numerator here is essentially the output of a positive
pair, and the denominator is the sum of all values of positive
and negative pairs. Ultimately, this simple loss forces the
positive pairs to have a greater value, closer to 1 (as pushing
the log term to 1 will make −log(1) = 0, which is the
optimal loss) and the negative pairs further apart (closer to
0). This loss function can also be interpreted geometrically.
Since za , zp , and zn are high-dimensional latent vectors and
normalized, they can be simply seen as points on a hyper-
FIGURE 3. InfoNCE network architecture.
sphere. The cosine similarity between any two of these is then
just the Euclidean distance between them shows in Eqn. (4).
(A.B)
similarity = cos (θ) = (4) 3) CLIP MODEL ARCHITECTURE
||A|| ||B||
Multi-Modal architectures leverage more than one domain
As seen in ‘‘(4)’’, for two normalized vectors A and B, since to learn a specific task. CLIP is a novel architecture that
cosine function is inversely proportional to the angle, the integrates computer vision and natural language processing.
bigger their similarity, the smaller the angle is or nearer they Its architecture is designed in a way that both the visual image
are. Thus, as shown in Figure 2, if the two blue vectors form and the text caption in a multimodal meme can be analyzed
a positive pair, they would be getting nearer like the red one simultaneously to extract the text and image embeddings
through learning. Otherwise, if they form a negative pair, they [22], [23], [24]. A text encoder and an image encoder are
would be split apart like black ones through learning. the two primary parts of its architecture. To predict the most
suitable pairings in a batch of training (image, text) examples,
these two encoders are trained jointly-
• The core of the text encoder is a transformer model,
its base size requires 63 million parameters, 12 layers,
and a 512-wide model with 8 attention heads in order
to obtain the text features.
• The image encoder, on the other hand, uses both
a Vision Transformer (ViT) and a ResNet50 as
its backbone, responsible for generating the feature
representation of the image.

FIGURE 2. Similarity between vector.


4) WORKING OF THE CLIP MODEL
The image and text pairs need to be embedded for them to
be linked to one another. If we had one cat and two dogs, for
e: InfoNCE NETWORK ARCHITECTURE instance, we might represent that information as a dot on a
The InfoNCE contrastive learning method has a network graph, embedding the data on the X-Y grid (Euclidean space),
architecture as illustrated in Figure 3. The input x is as depicted in Figure 4. Both the text and the images work in
augmented to xi and xj with different augmentation operators. a similar manner.
Then they are passed forward through the neural network Of the two sub-models composing CLIP, the image
f to get representations hi and hj , on which nonlinear encoder embeds images into a mathematical space while the
transformation g is performed to get zi (text vector) and zj text encoder embeds words into one. Then, using contrastive
(image vector). Finally, contrastive loss is evaluated on zi and pre-training, CLIP is trained to predict how probable it is that
zj to optimize f and g. During training, xi and xj , if from the the image corresponds to the text. When compared to other
same input x, are used as a positive pair, and if from a different approaches, CLIP is four times more effective at this zero-
input x, they are used as a negative pair. shot image classification.

exp (cos_sim(za , zp )/τ )


LInfoNCE = −log (3)
exp (cos_sim(za , zp )/τ ) + n∈N exp (cos_sim(za , zp )/τ
P

22364 VOLUME 12, 2024


G. Arya et al.: Multimodal Hate Speech Detection in Memes Using CLIP

• The objective is to achieve the highest possible cosine


similarity along the diagonal, which corresponds to the
correct image-text embedding pairs (the actual image-
text pairs that are maximally near and where i=j) -
I1,T1> and I2,T2>.

FIGURE 4. Embedding of ‘‘1 cat, 2 dogs.’’

The working of the model consists of three steps: 1)


contrastive pre-training, 2) dataset classifier creation from
labeled text, 3) application of zero-shot classification.

a: CONTRASTIVE PRE-TRAINING
During this phase, a batch of N (32,768) images paired
with their respective descriptions e.g., <image1, text1>,
<image2, text2>, <imageN, textN> are processed through FIGURE 5. Contrastive pre-training phase.
the Image and text Encoders simultaneously to obtain their
vector representations (embeddings). • The light blue squares in Figure 5 stand in for
A series of purple text cards are being delivered into the text these pairs where the text and image coincide. As an
encoder in the example image. Each card’s output would be a illustration, T1 and I1 are the embedded forms of the
list of numbers. ‘‘Pepper the Aussie dog’’, for instance, would first text and first picture, respectively. The highest
enter the text encoder and emerge as a series of digits like (0, cosine similarity between I1 and T1 is what we’re
0.2, 0.8). The similar thing takes place with the images: each aiming for. The same thing is desired for I2, T2,
image will enter the image encoder and come out as a series of and all other light blue squares. The greater these
integers. The image of Pepper the Australian dog will appear cosine similarities, the more ‘‘goodness’’ our model
as (0.05, 0.25, 0.7). possesses.
The CLIP model is then trained to predict which • The cosine similarities of off-diagonal (where i̸ =j)
image embedding belongs to which text embedding in a elements that are dissimilar pairs <I1, T2>, <I1, T3>
batch. To achieve this contrastive pre-training method seeks . . . <Ii, Tj> are minimized in a contrastive manner,
to compute the cosine similarity between every pair of separating the actual image from all the other incorrect
image embeddings (I1, I2. . . IN) and text embeddings (T1, text descriptions (for e.g I1 image is described by T1
T2. . . TN). Over the calculated similarity scores, an optimiza- and not by T2, T2,T3 etc).
tion is executed by applying a symmetric cross-entropy loss • The grey squares in Figure 5 show where the text and
to maximize the cosine similarity between the embeddings of image are out of alignment. For instance, T1 might be
real pairs in the batch while minimizing the cosine similarity the text ‘‘pepper the aussie pup’’ while I2 might be a
between the embeddings of incorrect pairings. picture of a raccoon. Since ‘‘Pepper the Aussie pup’’
The step-by-step procedure is - measures ‘‘badness,’’ the cosine similarity between this
• N pairs of ‘‘image-text’’ in batch are sent into the image (I2) and the words ‘‘Pepper the Aussie pup’’
model. should be quite low.
• The Image Encoder computes an image vector for each • The model then uses the symmetric cross-entropy loss
image in the batch. The I1 vector is represented by the as its optimization objective which corresponds to the
first image, I2 by the second, and so on. The size of InfoNCE loss. This type of loss minimizes both the
each vector is N and N is the latent dimension’s size. image-to-text direction as well as the text-to-image
As a result, N∗ N matrix is the outcome of this stage. direction as the contrastive loss matrix keeps both the
• Similarly, the text descriptions are transformed into text <I1,T2> and <I2,T1> cosine similarities.
embeddings (T1, T2 . . . TN), producing a N∗ N matrix.
• Finally, we multiply those matrices and calculate the b: CREATE DATASET CLASSIFIER FROM LABEL TEXT
cosine similarities for every single pair of image and This step encodes all the labels/objects in the following con-
text description. This produces an N∗ N matrix as text format: ‘‘a photo of a {object}. The vector representation
shown. of each context is generated from the text encoder.

VOLUME 12, 2024 22365


G. Arya et al.: Multimodal Hate Speech Detection in Memes Using CLIP

It is difficult to find robust datasets with paired image-


textual descriptions. Most public datasets, such as CIFAR,
are images with just one-single-word labels — these labels
are the target class. But CLIP was created to use full textual
descriptions. To overcome this discrepancy, using some
feature engineering: Single word labels, such as a bird, or a
car are converted to sentences. If we have dog, car, and plane
as the classes of the dataset, we will output the following
context representations:
• a photo of a dog
• a photo of a car
• a photo of a plane FIGURE 7. Image classification using CLIP.

c: USE OF ZERO-SHOT PREDICTION


memes detector achieves competitive results with supervised
To perform zero-shot class classification, the image is sent
models baseline using zero-shot classification.
to the encoder, which then conducts a similarity search to
determine which text matches the image from the entire 1) DATASET
batch. For example, the text encoder will contain a batch of ‘‘a The Dataset chosen to implement the Multimodal Hate-
photo of a dog,’’‘‘a photo of a car,’’ etc., and CLIP uses these ful Memes Detection is the Facebook Hateful Meme
names of all the classes in the dataset (output of section II) Dataset (https://ai.meta.com/blog/hateful-memes-challenge-
as text pairings to predict which image vector corresponds and-data-set/). This dataset was produced by Facebook AI
to which text vector or the most probable (text, image) pair. with the express purpose of assisting in the creation of new
The most similar text prompt is selected as the prediction methods to detect multimodal hate speech. It is challenging
after computing the pair wise cosine similarities between the for machines to comprehend this content since it integrates
image and the text embeddings as can be seen in Figure 6. multiple modalities - text and images. The dataset includes
10,000+ novel, multimodal examples of memes, each of
which includes an image and an OCR sentence in it (refer
Figure 9). Memes can be classified into two categories
for the purposes of this challenge: non-hateful and hateful.
Fully balanced, the validation and test sets each contain 5%
and 10% of the data (Table 1). The remaining data, which
consists of 36% hateful memes and 64% non-hateful memes,
is used as a train set. The images in the dataset are all
licensed from Getty Images and span a wide range of both
attacks (such as encouraging violence or depicting groups as
criminals) and protected categories (such as religion, gender,
and sexual orientation). The memes in this dataset were
chosen in a way that only multimodal models can successfully
classify them, making it difficult for strictly unimodal
classifiers to do so. When taken separately, the text phrase
and the image in each meme are harmless but when taken
into consideration, the meme’s semantic content becomes
offensive.
FIGURE 6. Zero-shot classification.

TABLE 1. Facebook hateful Memes dataset splits.


The CLIP model may be demonstrated, as in Figure 7- it
is accurately predicting the dog by maximizing the similarity
between the word dog and the visual data.

B. TECHNICAL PROCEDURE
The proposed solution utilizes the CLIP multimodal model in
Python language using PyTorch machine learning framework
and Torchvision library to better understand and classify
the multimodal hateful memes involving images and text.
It makes use of a pre-trained CLIP model to create a custom
classifier without any training required. The generated hateful

22366 VOLUME 12, 2024


G. Arya et al.: Multimodal Hate Speech Detection in Memes Using CLIP

FIGURE 8. Flowchart of the technical procedure.

FIGURE 9. Facebook hateful Meme dataset samples.


FIGURE 10. Benign confounders in the dataset.

The dataset is specifically made to address issues that are


frequently encountered in AI research, namely the lack of the CLIP model, its weights, tokenizer image processor, and
examples that would enable machines to learn to avoid false related libraries from OpenAI. The image encoder is either a
positives. This is accomplished by introducing memes in the Vision Transformer (ViT) or a ResNet version like ResNet50,
dataset that resemble offensive examples but are innocuous. whereas the text encoder is a Transformer. For the goal
These challenging cases, referred to as benign confounders of identifying the hateful memes, we use the ViT-B/32 as
(Figure 10), are included to address potential biases in the image encoder. The command clip.available_models as
classification systems and the development of systems that shown in Figure 11 can be used to view the image encoders
avoid false positives. that are offered.

2) LOADING THE CLIP MODEL 3) EXTRACTING IMAGE EMBEDDINGS


We load the model and the torchvision transformation Each image goes through preprocessing before being fed into
pipeline that are needed by it after installing and importing the Image Encoder. First, the dataset’s mean and standard

VOLUME 12, 2024 22367


G. Arya et al.: Multimodal Hate Speech Detection in Memes Using CLIP

FIGURE 13. Extracting the text features & padded tensor.

FIGURE 11. Loading the clip model.

as hateful. With a potent multimodal model like CLIP,


we employ the cosine similarity distance metric to compute
the degree of similarity between the various modalities i.e.,
each text and image encoding.
The model is fed with 8 example images and their
associated texts, compute the dot products for each pair, and
compare the similarity between the corresponding features.
The model () calculates the cosine similarity (refer Figure 14)
between the corresponding image and text features and
multiplies them by 100 to generate image_logits, by passing
FIGURE 12. Extracting the image features. the preprocessed image and text inputs through the image
and text encoders. The logits are then normalized into a list
deviation are used to normalize the input images’ pixel of probability distributions for each class. The class with the
intensity. They are then center cropped, normalized, and highest probability (thus highest similarity score) is then set
resized to comply with the image resolution that the image as the predicted class.
encoder requires. The image is preprocessed (refer Figure 12
for pseudo code) and then sent to the Image Encoder, which
produces a 1 × 512 image embedding tensor as its output
[8], [25]. This preprocessing is executed by a torchvision
transform function.
FIGURE 14. Cosine similarity matrix pseudocode.
4) EXTRACTING TEXT EMBEDDINGS
First, a case-insensitive text tokenizer that is invoked with
clip.tokenize() processes the text labels, converting the label
6) CLASSIFYING THE MEME AS HATEFUL OR NON-HATEFUL
words into numeric values. To meet the requirements of the
We set the threshold for cosine similarity between the image
Text Encoder, the outputs are by default padded to a length
and its text to be 0.2. This means that, if the cosine similarity
of 77 tokens. As a result, a padded tensor of size N × 77
between the image and its text falls below 0.2, we directly
(Figure 13) is created (N is the number of classes, which
classify it as non-hateful. This is because, even if the model
equals 2 × 77 in binary classification), and this is used as
pairs the image to the hateful description, according to the
input for the Text Encoder. Following that, the Text Encoder
values obtained in the cosine similarity matrix, we can see
converts the tensor into a N × 512 tensor of text embeddings,
that for similarity scores below 0.2, the text features does not
where each class is represented by a single vector. The
relate in context to the hateful features of the image and thus
zmodel.encode_text() method can be used to encode text and
combined cannot be hateful.
retrieve embedding.
If the cosine similarity between the image and its text is
5) CALCULATING AND PLOTTING THE COSINE greater than 0.2, then we classify it based on whether the
SIMILARITY MATRIX model associates the image to the good meme or hateful
We must first identify the relationship between the text meme description as shown in Figure 15. This approach
feature and the image features before we can label a meme allows us to combine both the contributions of text and

22368 VOLUME 12, 2024


G. Arya et al.: Multimodal Hate Speech Detection in Memes Using CLIP

engineering by changing the text descriptions for labels being


fed into the model.

9) TESTING THE MODEL AND MAKING PREDICTIONS


After successfully implementing the CLIP model, it is ready
to predict outcomes for unseen data sets. Hence, we feed
the test data which consists of only images and texts and no
labels into it. The model returns the expected output of all the
input images being successfully classified into their predicted
labels according to their probability scores.

IV. RESULTS AND OUTPUT


In this paper we used the Facebook Hateful Meme
FIGURE 15. Multimodal Meme classification. dataset (https://ai.meta.com/blog/hateful-memes-challenge-
and-data-set/) to detect hate speech in the multimodal
image text combinations of memes by implementing the
image feaures towards detecting hatefulness of the input CLIP model. On feeding random unseen memes as input
meme. to the model, we can see that the CLIP model gave us
highly accurate results, by predicting and classifying each
7) PROMPT ENGINEERING combination of text and image that falls into the category
Without any training or fine-tuning, a powerful model like of hate speech as a ‘‘Hateful’’meme correctly. Also, no non-
CLIP can produce zero-shot predictions. To achieve that, hateful meme out of all the inputs is falsely classified as
we offer the model some text prompts [26], [27]. These text hateful i.e., there are no false positive outputs in the result.
labels or prompts are encoded by the CLIP classifier into a Therefore, it is seen the model is accurate and is working
learned latent space, and their similarity to the image latent successfully by predicting correct outcomes.
space is assessed. The classifier’s performance may be altered
by changing the language of the prompts because different A. PREDICTED OUTCOMES AND ANALYSIS
text embeddings can have a different effect. We produce The cosine similarity matrix obtained while applying the
written descriptions for a ‘‘good meme’’ and a ‘‘hateful CLIP model to the chosen dataset can be observed in
meme’’ as can be seen in Figure 16, that serve as classes Figure 16. This similarity matrix illustrates a visual repre-
for our dataset to classify an input image meme as hateful or sentation of the relationship between the texts and images in
non-hateful. the hateful memes dataset by calculating the cosine similarity
between the image and text features. As seen in the figure,
the highest possible cosine similarity between the image and
text pairs is achieved along the diagonal, or yellow squares,
since they have the highest dot product values. In other
words, these image text pairs are the closest to each other in
what they are describing or have the maximum correlation.
On the other hand, the blue and purple squares signify that the
corresponding image text pair is completely out of alignment
or that the given statement is completely different from what
is being shown in the picture.
Then the accurately predicted outcomes of the validation
FIGURE 16. Text prompts.
set can be seen below, as the memes containing hate speech
are correctly identified and classified as ‘‘Hateful Memes’’
in Figures 18–23. There are a total of 358 memes classified
8) CALCULATING THE ACCURACY OF THE MODEL accurately as hateful out of the 1000 unknown input images
To evaluate the model’s performance, we compare its in the test dataset. The first 20 of these ‘‘happy memes’’are
performance to a validation dataset—a set of data on shown in a grid in Figure 17. These precise classification
which it has not been trained. The most popular evaluation outcomes are a result of setting the threshold for cosine
metric, ‘‘accuracy,’’is calculated as the proportion of correctly similarity value accurately and fine-tuning the CLIP model
classified images to all the images in your data set. We find using prompt engineering. In the proposed solution, the
the accuracy of this model to be 57.8% which is quite high threshold of cosine similarity between the image and its
for a model that is pre- trained and predicts results using text is set at 0.2 for precise predictions. As a result, all
generalization of unseen labels in zero shot classification. image text pairs with a cosine similarity value below 0.2 are
We can improve the accuracy of the model through prompt automatically classified as non-hateful. For values above

VOLUME 12, 2024 22369


G. Arya et al.: Multimodal Hate Speech Detection in Memes Using CLIP

A non-hateful meme that is good.’’ and Hateful Meme: ‘‘A


hateful meme containing racism, sexism, nationality, religion,
and disability. As seen from the results, the CLIP model
classified the unknown dataset images with precision based
on the key words of the given prompts.
• Some of the memes classified as hateful are –

FIGURE 17. Cosine similarity matrix.

FIGURE 20. ‘‘Hateful Meme.’’

FIGURE 21. ‘‘Hateful Meme.’’

FIGURE 18. Accurately predicted outcomes.

FIGURE 22. ‘‘Hateful Meme.’’

FIGURE 19. ‘‘Hateful Meme.’’


B. MODEL EVALUATION PARAMETERS
The evaluation metrics, including accuracy, AUROC, Loss%,
0.2, the model performs the classification process according and F1-score were computed for the proposed CLIP
to the given text prompts for a good and a hateful meme. model, and the results are tabulated in Table 2. The model’s
In this case, the prompts provided are - Good Meme: Accuracy and Loss graphs have also been plotted (refer

22370 VOLUME 12, 2024


G. Arya et al.: Multimodal Hate Speech Detection in Memes Using CLIP

FIGURE 25. Validation cross entropy loss.

FIGURE 23. ‘‘Hateful Meme.’’

TABLE 2. Evaluation parameters of the CLIP model.

FIGURE 26. Training accuracy.

Figures 24 to 27). The Model Accuracy Plot with the


validation accuracy line being an increasing curve shows that
the model performs accurately with a score of 87.42%. The
Model’s Loss Plot with the graph dipping shows that the loss
calculated during its performance is low with a base line of
0.357. Since the model loss is low, the model accuracy is high
as they are inversely proportional to each other, leading to
the proposed model being an optimal and precise solution for
the classification of memes as hate speech. The high AUROC
value (or the model’s ability to differentiate between positive
and false-positive samples) and f1 score also indicate the high
efficiency of the model implemented.

FIGURE 27. Validation accuracy.

C. COMPARATIVE ANALYSIS
The classification of hateful memes can be quite a challeng-
ing task due to the dual nature of the data that needs to
be extracted from the input images. For the successful and
accurate prediction of hateful memes, both the image and text
features need to be extracted from the input meme, which will
allow us to combine both the contributions of the text and
image embeddings towards detecting the hatefulness of the
input. However, far fewer studies focus on the multimodal
representation of data, namely the information that consists
FIGURE 24. Training cross entropy loss. of multiple channels, since most classification tasks are

VOLUME 12, 2024 22371


G. Arya et al.: Multimodal Hate Speech Detection in Memes Using CLIP

‘‘unimodal’’ or can only extract and learn in one mode, the random and majority-class baselines lie at 50 AUROC for
either texts or images. Hence, the conventional methods of the unimodal text-only or visual-only classifiers. Multimodal
classification of hateful memes rely on unimodal models, models such as VilBERT are also able to achieve a maximum
which prove to be very less effective as training a model of 72 AUROC.
using only one dimension out of images, text, or video is Lastly, Sethi et al. investigates the classification of hateful
an extremely difficult endeavor. Thus, this task requires a memes using pre-trained models like VGG19 and Xception,
multimodal model that includes text and visuals that are combined with machine learning models like support vector
trained successfully and simultaneously for accurate results. machines and Naïve Bayes [31]. They achieve the highest f1-
There have been a few previous works that have taken score of 0.584 using an integrated stacked model technique.
this direction and implemented multimodal study of hateful Table 3 shows the results of a comparison between the
memes using methods such as visual question answering, evaluation metrics (AUROC and accuracy score) achieved by
vision language pre-training models, and encoders based on various models utilized in several state-of-the-art approaches
convolutional neural networks, apart from making use of for the classification of hateful memes [32], [33], [34], [35].
unimodal techniques.
Zhong et al. proposes a new model that combines mul- TABLE 3. Performance comparison of proposed and existing models.
timodal features with rules, achieving the highest accuracy
of 86.8% [28]. It leverages a specific dataset developed
by Facebook with over 10,000 memes. The specimens
encompass memes in the percentages of 10% unimodal hate,
20% benign image confounder, 20% benign text confounder,
and 10% random non-hateful. Here, a clustering technique
based on perceptual hash is used to group the meme images
together. By using a straightforward comparison on their
strings, the memes are grouped into groups, including ‘‘3-
tuple,’’‘‘2-tuple,’’‘‘unimodal hate,’’ etc. The ‘‘3-tuple,’’ for
example, is made up of 3 memes, with the first meme having
an image like the second meme and text equivalent to the
third meme. The second meme and the third meme, however,
are not connected. The labels for a ‘‘3-tuple’’ consist of 1, 0,
and 0, where 1 denotes hate and 0 denotes non-hatred, while
the labels for a ‘‘2-tuple’’ consist of 1 and 0. From the analysis
above, there were rules formulated, such as Rule 1, where the
hatred probability for samples in a ‘‘3-tuple’’ was set to (1, 0,
0). and Rule 2, where in the case of samples in ‘‘2-tuple,’’ the
hateful probabilities were set to (1,0), with the larger hateful
probability being adjusted to 1. The results of this proposed study demonstrate that
Alternatively, Ahmed et al. explores the use of unimodal the CLIP model, fine-tuned with prompt engineering, can
text and image models, such as Bert, LSTM, VGG16, achieve an accuracy rate of 87.42% and an AUROC
Resnet50, SE-Resnet50, and XSE-Resnet architectures, and of 88.35 in the classification of hateful memes when
combines them into multimodal models for predicting hateful implemented on the Facebook Hateful Memes Dataset. This
memes with evaluation metrics such as the AUC-ROC score, represents a substantial improvement compared to previous
F1 score, and accuracy score [29]. The dataset selected for the studies that employed machine learning methods for the
endeavor was also the ‘‘Hateful Memes Challenge,’’ released detection of hate speech in memes.
by Facebook AI, but yielded results of a maximum prediction As shown in Table 3, the proposed CLIP model outper-
accuracy of 66.3% in classifying memes as hateful or forms all other models in terms of accuracy.
non-hateful. In contrast with previous works, the implementation of the
Fan et al. utilises data from Meta’s Hateful Meme CLIP model also presents us with an easy, feasible option
Detection Challenge and builds three models, with their for the classification process since this model is already
best model, VisualBERT with external feature extraction, pre-trained and does not need to go through a highly time-
achieving a 62.4% accuracy [30]. Kiela et al. highlights the consuming process of training over large sets of data. This
difficulty of the hateful meme detection task, with state- not only saves time but also saves the effort of a large data
of-the-art methods performing poorly compared to humans accumulation process. The proposed model also does not
(64.73% vs. 84.7% accuracy) [9]. It then evaluates a variety require a complex combination of rules for attaining high
of models—unimodal models and multimodal models—that accuracy and is thus a simpler method of achieving efficiency.
were unimodally pretrained (a BERT model combined with The approach taken in this study is also able to provide
a ResNet) on the hateful memes dataset and concludes that better results than previous works since the dataset chosen—

22372 VOLUME 12, 2024


G. Arya et al.: Multimodal Hate Speech Detection in Memes Using CLIP

the Facebook Hateful Memes Dataset—is a more difficult the user is protected. The perpetrators spreading hate online
dataset containing several false positive examples and through such memes can then be identified and punished.
benign confounders and is paired with prompt engineering. Additionally, further research into this suggested method may
Incorporation of prompt engineering is a deliberate and also aid in improving the accuracy of hate speech detection
systematic curation of accurate prompts to serve as classes for in memes, by exploring the possibilities of advanced machine
increasing the accuracy in meme classification. The strategic learning algorithms in managing multimodal data.
pairing of this approach with the intricacies of the Facebook
Hateful Memes Dataset aims to mitigate the impact of false ACKNOWLEDGMENT
positives and confounding variables, ultimately elevating The authors acknowledges to the Universiti Kebangsaan
the discriminatory capabilities of the proposed model and Malaysia for supporting this work under GUP 2023-010, and
making it challenging to rely on unimodal signals, resulting in Advanced and Innovative Research Laboratory, India, for the
the success of multimodal models. This makes the proposed technical support.
strategy more efficient and precise as compared to others. REFERENCES
V. FUTURE SCOPE [1] R. Richard and G. Giorgi, ‘‘What is a meme, technically speaking?’’ Inf.,
Commun. Soc. pp. 1–19, Feb. 2023.
For detecting hateful memes our model that included texts [2] Z. Mansur, N. Omar, and S. Tiun, ‘‘Twitter hate speech detection:
and visuals being trained simultaneously was able to provide A systematic review of methods, taxonomy analysis, challenges, and
successful results with precision. Its accuracy can be further opportunities,’’ IEEE Access, vol. 11, pp. 16226–16249, 2023, doi:
10.1109/ACCESS.2023.3239375.
improved to give extremely accurate results, by training the [3] E. K. Boahen, B. E. Bouya-Moko, F. Qamar, and C. Wang, ‘‘A deep
model especially for a particular dataset since the model that learning approach to online social network account compromisation,’’
is utilized presently is pre-trained contrastively for a general IEEE Trans. Computat. Social Syst., vol. 10, no. 6, pp. 3204–3216,
Dec. 2023, doi: 10.1109/TCSS.2022.3199080.
dataset. But training of multimodal data is an extremely [4] K. Abbas, M. K. Hasan, A. Abbasi, U. A. Mokhtar, A. Khan,
cumbersome process and there is a lack of easy access to the S. N. H. S. Abdullah, S. Dong, S. Islam, D. Alboaneen, and
relevant hardware and software devices required for it. Social F. R. A. Ahmed, ‘‘Predicting the future popularity of academic
publications using deep learning by considering it as temporal citation
media’s development over the past few decades has made a networks,’’ IEEE Access, vol. 11, pp. 83052–83068, 2023.
wealth of information readily accessible online which makes [5] H. Hosseinmardi, S. Arredondo Mattson, R. Ibn Rafiq, R. Han, Q. Lv, and
a multimodal dataset such as the one used extremely complex S. Mishra, ‘‘Detection of cyberbullying incidents on the Instagram social
network,’’ 2015, arXiv:1503.03909.
and very large. The training process will thus require a lot [6] H. Zhong, H. Li, A. C. Squicciarini, S. M. Rajtmajer, C. Griffin,
of GPU hours and memory before the model can be used to D. J. Miller, and C. Caragea, ‘‘Content-driven detection of cyberbullying
on the Instagram social network,’’ in Proc. IJCAI, vol. 16, 2016,
successfully classify hate speech. But in the future, this can pp. 3952–3958.
be made possible with access to better technology and time [7] Y. Chen and F. Pan, ‘‘Multimodal detection of hateful memes by applying
to develop, test, and improve models used for hate speech a vision-language pre-training model,’’ Plos One, vol. 17, no. 9, 2022,
Art. no. e0274300.
detection. [8] Y. Zhou, Z. Chen, and H. Yang, ‘‘Multimodal learning for hateful
memes detection,’’ in Proc. IEEE Int. Conf. Multimedia Expo. Workshops
VI. CONCLUSION (ICMEW), Jul. 2021, pp. 1–6, doi: 10.1109/ICMEW53276.2021.9455994.
This paper aims to develop a novel and efficient architecture [9] D. Kiela, H. Firooz, A. Mohan, V. Goswami, A. Singh, P. Ringshia, and
for detecting and classifying multimodal hate speech in D. Testuggine, ‘‘The hateful memes challenge: Detecting hate speech in
multimodal memes,’’ 2020, arXiv:2005.04790.
memes circulating through social media. The suggested [10] T. Deshpande and N. Mani, ‘‘An interpretable approach to hateful meme
strategy for this is making use of OpenAI’s latest multimodal detection,’’ in Proc. Int. Conf. Multimodal Interact., New York, NY, USA,
model - CLIP, to better understand multimodal hate speech in Oct. 2021, pp. 723–727, doi: 10.1145/3462244.3479949.
[11] A. A. Ahmed, M. K. Hasan, M. M. Jaber, S. M. Al-Ghuribi, D. H. Abd,
memes that contain both visual images and text captions. The W. Khan, A. T. Sadiq, and A. Hussain, ‘‘Arabic text detection using
CLIP model analyses the image and its accompanying text to rough set theory: Designing a novel approach,’’ IEEE Access, vol. 11,
determine whether the two modalities taken together are hate- pp. 68428–68438, 2023.
[12] A. K. Thakur, F. Ilievski, H. Sandlin, Z. Sourati, L. Luceri, R. Tommasini,
ful or not. The ‘‘Facebook Hateful Meme Dataset,’’ which and A. Mermoud, ‘‘Multimodal and explainable internet meme classifica-
consists of 10,000 examples of new multimodal memes (text tion,’’ Dec. 2022, arXiv:2212.05612.
+ image) created by Facebook AI, is utilized as the dataset [13] A. Radford, J. Wook Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever,
for the proposed method. The implemented model has been ‘‘Learning transferable visual models from natural language supervision,’’
able to achieve an accuracy of 87.42% in recognizing hateful 2021, arXiv:2103.00020.
[14] Y. Qu, X. He, S. Pierson, M. Backes, Y. Zhang, and S. Zannettou,
memes. This scope of this study can be further extended to ‘‘On the evolution of (Hateful) memes by means of multimodal contrastive
filter out these memes from the social media platforms to learning,’’ in Proc. IEEE Symp. Secur. Privacy (SP), San Francisco, CA,
keep a check on the hatred spreading through such content USA, May 2023, pp. 293–310, doi: 10.1109/sp46215.2023.10179315.
[15] K. Abbas, M. K. Hasan, A. Abbasi, S. Dong, T. M. Ghazal,
online. This will help control the hate spread against minority S. N. H. S. Abdullah, A. Khan, D. Alboaneen, F. R. A. Ahmed,
communities and diminish any form of discrimination such T. E. Ahmed, and S. Islam, ‘‘Co-evolving popularity prediction in temporal
as racism or sexism through cyber platforms. It will also curb bipartite networks: A heuristics based model,’’ IEEE Access, vol. 11,
pp. 37546–37559, 2023.
cyber bullying and hate speech on social media generated by [16] A. Bhandari, ‘‘Bias in AI: A comprehensive examination of factors and
trolls using offensive memes. Amidst the incoming network improvement strategies,’’ Int. J. Comput. Sci. Eng., vol. 10, no. 6, pp. 9–14,
traffic, such hate speech will be recognized and routed so that Jun. 2023, doi: 10.14445/23488387/ijcse-v10i6p102.

VOLUME 12, 2024 22373


G. Arya et al.: Multimodal Hate Speech Detection in Memes Using CLIP

[17] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, MOHAMMAD KAMRUL HASAN (Senior Mem-
G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, ber, IEEE) received the Ph.D. degree in electrical
‘‘Learning transferable visual models from natural language supervision,’’ and communication engineering from the Faculty
in Proc. Int. Conf. Mach. Learn., 2021, pp. 8748–8763. of Engineering, International Islamic University,
[18] J. Badour and J. A. Brown, ‘‘Hateful memes classification using machine Malaysia, in 2016. He is currently an Associate
learning,’’ in Proc. IEEE Symp. Ser. Comput. Intell. (SSCI), Orlando, FL, Professor and the Head of the Network and
USA, Jun. 2021, pp. 1–8, doi: 10.1109/SSCI50451.2021.9659896.
Communication Technology Laboratory, Faculty
[19] S. Prabhakaran. Cosine Similarity—Understanding the Math and How it
Works (With Python Codes). Accessed: Oct. 20, 2023. [Online]. Available: of Information Science and Technology, Center for
https://www.machinelearningplus.com/nlp/cosine-similarity/ Cyber Security, Universiti Kebangsaan Malaysia
[20] K. E. Koech. Softmax Activation Function—How it Actually Works. (UKM). He is specialized in elements pertaining
Towardsdatascience.com. Accessed: Oct. 5, 2023. [Online]. Avail- to cutting-edge information centric networks, computer networks, data
able: https://towardsdatascience.com/softmax-activation-function-how-it- communication and security, mobile network and privacy protection, cyber-
actually-works-d292d335bd78 physical systems, industrial IoT, transparent AI, and electric vehicles
[21] A. van den Oord, Y. Li, and O. Vinyals, ‘‘Representation learning with networks. He has published more than 230 indexed papers in ranked
contrastive predictive coding,’’ 2018, arXiv:1807.03748. journals and conference proceedings. He is a member of the Institution
[22] M. S. Hee, R. K.-W. Lee, and W.-H. Chong, ‘‘On explaining multimodal of Engineering and Technology and the Internet Society. He is a Certi-
hateful meme detection models,’’ in Proc. ACM Web Conf., New York, NY, fied Professional Technologist in Malaysia. He has actively participated
USA, Apr. 2022, pp. 3651–3655, doi: 10.1145/3485447.3512260.
[23] P. Lippe, N. Holla, S. Chandra, S. Rajamanickam, G. Antoniou, E. Shutova,
in many events/workshops/trainings for the IEEE and IEEE Humanity
and H. Yannakoudakis, ‘‘A multimodal framework for the detection of Programs in Malaysia. He served as the Chair for IEEE Student Branch,
hateful memes,’’ 2020, arXiv:2012.12871. from 2014 to 2016. He is the general chair, the co-chair, and a speaker of
[24] M. A. Latiffi and M. R. Yaakub, ‘‘Sentiment analysis: An enhancement of conferences and workshops for the shake of society and academy knowledge
ontological-based using hybrid machine learning techniques,’’ Asian J. Inf. building and sharing and learning. He has been contributing and working as
Technol., vol. 7, pp. 61–69, Dec. 2018. a volunteer for underprivileged people for the welfare of society. He is an
[25] A. Mukred, D. Singh, and N. S. Mohd Satar, ‘‘Examining the influence of editorial member in many prestigious high-impact journals, such as IEEE,
perceived need on the adoption of information system in public hospitals in IET, Elsevier, Frontier, and MDPI.
Yemen,’’ Asia–Pacific J. Inf. Technol. Multimedia, vol. 9, no. 2, pp. 35–49,
Dec. 2020.
[26] R. Cao, R. Ka-Wei Lee, W.-H. Chong, and J. Jiang, ‘‘Prompting for
multimodal hateful meme classification,’’ 2023, arXiv:2302.04156.
[27] I. Memon, R. A. Shaikh, M. K. Hasan, R. Hassan, A. U. Haq, and
K. A. Zainol, ‘‘Protect mobile travelers information in sensitive region
based on fuzzy logic in IoT technology,’’ Secur. Commun. Netw., vol. 2020,
pp. 1–12, Nov. 2020.
[28] X. Zhong, ‘‘Classification of multimodal hate speech—The winning
solution of hateful memes challenge,’’ 2020, arXiv:2012.01002.
[29] Md. R. Ahmed, N. Bhadani, and I. Chakraborty, ‘‘Hateful meme prediction
model using multimodal deep learning,’’ in Proc. Int. Conf. Comput.,
Commun. Green Eng. (CCGE), Sep. 2021, pp. 1–5.
[30] A. Fan and Y Wu. Identifying Hateful Memes With Multimodal Classifica-
tion. Cs231n.stanford.edu. Accessed: Sep. 10, 2023. [Online]. Available:
http://cs231n.stanford.edu/reports/2022/pdfs/66.pdf ASHISH BAGWARI (Senior Member, IEEE)
[31] A. Sethi, U. Kuchhal, and R. Katarya, ‘‘Study of various techniques for received the B.Tech. (Hons.), M.Tech. (Hons.),
the classification of hateful memes,’’ in Proc. Int. Conf. Recent Trends and Ph.D. degrees in electronics and communi-
Electron., Inf., Commun. Technol. (RTEICT), Bangalore, India, Aug. 2021,
cation engineering. He is currently the Head of
pp. 675–680, doi: 10.1109/RTEICT52294.2021.9573926.
[32] A. Gao, B. Wang, J. Yin, and Y. Tian, ‘‘Hateful memes challenge:
the Department of Electronics and Communica-
An enhanced multimodal framework,’’ 2021, arXiv:2112.11244. tion Engineering, Women Institute of Technology
[33] Y. Chen and F. Pan, ‘‘Multimodal detection of hateful memes by applying (WIT) (Institute of State Government), Affiliating
a vision-language pre-training model,’’ PLoS One, vol. 17, no. 9, 2022, Institution of Uttarakhand Technical University,
Art. no. e0274300, doi: 10.1371/journal.pone.0274300. Dehradun, India. He has more than 14.5 years of
[34] Z. Ma, S. Yao, L. Wu, S. Gao, and Y. Zhang, ‘‘Hateful memes detection experience in industry, academics, and research.
based on multi-task learning,’’ Mathematics, vol. 10, no. 23, p. 4525, He has published more than 170 research articles in various international
Nov. 2022, doi: 10.3390/math10234525. journals that also include IEEE international conferences. His areas
[35] M. G. Constantin, D.-S. Parvu, C. Stanciu, D. Ionascu, and B. Ionescu, of interest are cognitive radio networks, mobile communication, sensor
‘‘Hateful meme detection with multimodal deep neural networks,’’ in
networks, wireless, and 5G Communication, digital communication, and
Proc. Int. Symp. Signals, Circuits Syst. (ISSCS), Iasi, Romania, Jul. 2021,
mobile ad-hoc networks. He is an active member of various professional
pp. 1–4, doi: 10.1109/ISSCS52333.2021.9497374.
societies, such as IEEE, USA. He is also a Senior Member of the Institute
GREESHMA ARYA received the B.Tech. and of Electronics and Telecommunication Engineers (IETE), India; a Lifetime
M.Tech. degrees (Hons.) from Dr. A. P. J. Abdul Member and a Professional Member of the Association for Computing
Kalam University, Lucknow, India, and the Ph.D. Machinery (ACM); and a member of the Machine Intelligence Research
degree in electronics and communication engi- Laboratory Society. He received the Gold Medalist during the master’s study.
neering from Uttarakhand Technical University, He also received the Best WIT Faculty Award, in 2013 and 2015; the Best
Dehradun, India. She is currently an Associate Project Guide Award, in 2015; and the Corps of Electrical and Mechanical
Professor with the Department of Electronics and Engineers Prize from the Institution of Engineers, India (IEI), in December
Communication Engineering, Indira Gandhi Delhi 2015, for his research work. Also, he received the Outstanding Scientist
Technical University for Women (IGDTUW) Award 2021 from VDGOOD Technology, Chennai, India, in November
(State Government University), Delhi, India. She 2021; the Dr. A. P. J. Abdul Kalam Life Time Achievement National
has more than 17.5 years of experience in academics and research. She has Award 2022 from National Institute for Socio Economic Development
published more than 35 research articles in various international journals (NISED), Bangalore, India, in June 2022; and the Best Teacher Award-
(including SCI, ESCI, Scopus, and ISI indexed). Her areas of interests 2023 from Veer Madho Singh Bhandari Uttarakhand Technical University
include wireless sensor networks, wireless communication, renewable (State Government Technical University), Dehradun, in September 2023.
energy sources, network security, the Internet of Things, 5G network He was named in Who’s Who in the World 2016 (33rd Edition) and 2017
technology, artificial intelligence, and deep learning. (34th Edition).
22374 VOLUME 12, 2024
G. Arya et al.: Multimodal Hate Speech Detection in Memes Using CLIP

NURHIZAM SAFIE (Member, IEEE) received AAISHANI DE is pursued the B.Tech. degree with
the master’s degree in information technology the Department of Electronics and Communica-
from UKM, in 1999, the M.B.A. degree from tion Engineering, Indira Gandhi Delhi Technical
Anglia Ruskin University, U.K., in 2019, and University for Women. He is a Researcher. Her
the Ph.D. degree in management information technology stack includes Java, Springboot and
systems (MIS). He is an Associate Professor AI/ML development in python. She has pub-
and the Dean of the Faculty of Information lished more than 10+ research articles in various
Science and Technology. Before this position, international journals (including SCI, ESCI and
he was a Research Fellow with United Nations Scopus indexed). Her research work primarily
University, a United Nations academic arm. He has focuses on the topics of computer vision, natural
conferred the Professional Technologist [.Ts/P.Tech.(IT)] credential from the language processing and deep learning. Her area of interests includes web
Malaysian Board of Technology (MBoT), in 2018. During the Ph.D. study, development, machine learning, data science and artificial intelligence. She
he received the National Science Fellowship (NSF) Scholarship from the has received the Incentive Award for Excellence in Research from Indira
Malaysian Ministry of Science, Technology, and Innovation (MoSTI). Gandhi Delhi Technical University for her work.

MUHAMMAD ATTIQUE KHAN (Senior Mem-


ber, IEEE) received the master’s and Ph.D. degrees
in human activity recognition for application of
video surveillance and skin lesion classification
using deep learning from COMSATS University
Islamabad, Pakistan. He is currently a Lec-
SHAYLA ISLAM (Senior Member, IEEE) turer with the Computer Science Department,
received the B.Sc. degree in computer science and HITEC University, Taxila, Pakistan. He has above
engineering from International Islamic University 190 publications that have more than 6500 cita-
Chittagong, Bangladesh, and the M.Sc. and tions and impact factor more than 600 with H-
Ph.D. degrees from the Department of Electrical index of 50. His research interests include medical imaging, COVID19, MRI
and Computer Engineering, International Islamic analysis, video surveillance, human gait recognition, and agriculture plants.
University Malaysia (IIUM), in 2012 and 2016, He is a Reviewer of several reputed journals, such as IEEE TRANSACTIONS ON
respectively. She is currently an Assistant Profes- INDUSTRIAL INFORMATICS, IEEE TRANSACTIONS ON NEURAL NETWORKS, Pattern
sor with UCSI University, Malaysia. She received Recognition Letters, Multimedia Tools and Application, Computers and
the Malaysian International Scholarship for the Electronics in Agriculture, IET Image Processing, Biomedical Signal
Ph.D. study. She received the Silver Medal for her research work with Processing and Control, IET Computer Vision, EURASIP Journal of Image
International Islamic University Malaysia. In consequence, she also received and Video Processing, IEEE ACCESS, Sensors (MDPI), Electronics (MDPI),
the Young Scientist Award for the contribution of a research paper at the Applied Sciences (MDPI), Diagnostics (MDPI), and Cancers (MDPI).
Second International Conference on Green Computing and Engineering
Technologies 2016 (ICGCET’16), organized by the Department of Energy TAHER M. GHAZAL (Senior Member, IEEE)
Technology, Aalborg University, Esbjerg, Denmark. received the B.Sc. degree in software engineering
from Al Ain University, in 2011, the M.Sc.
degree in information technology management
from The British University in Dubai (associated
with The University of Manchester and The
University of Edinburgh), in 2013, the first Ph.D.
degree in IT/software engineering from Damascus
University, in 2019, and the second Ph.D. degree
FATIMA RAYAN AWAD AHMED received the in information science and technology from Uni-
B.S. degree in computer science from the Sudan versiti Kebangsaan Malaysia, in 2023. With over a decade of extensive and
University of Science and Technology, in 2004, diverse experience, he was an instructor, a tutor, a researcher, a teacher,
and the M.S. and Ph.D. degrees in computer a IT support/specialist engineer, and a business/systems analyst. He was
science from Al Neelain University, Sudan, in with various departments, including engineering, computer science, and
2007 and 2012, respectively. ICT. He was the Head of the STEM and Innovation, and has also been
She joined Prince Sattam Bin Abdulaziz Uni- involved in quality assurance, accreditation, and data analysis, in several
versity, Saudi Arabia, in 2013, as an Assistant governmental and private educational institutions under KHDA, Ministry of
Professor with the Information Systems Depart- Education, and the Ministry of Higher Education and Scientific Research,
ment, from 2013 to 2016, and has been with the United Arab Emirates. He is actively involved in community services in the
Computer Science Department, since 2017. In 2004, she joined Sudanese projects and research field. His research interests include the IoT, IT, artificial
Company for Telecommunications, Sudan, as a Computer Programmer with intelligence, information systems, software engineering, web development,
the IT Department, where she analyzed, designed, and programmed a set of building info, modeling, quality of education, management, big data, quality
systems. Her current research interests include artificial intelligence, systems of software, and project management.
and algorithms analysis and design, web applications, and e-learning.

VOLUME 12, 2024 22375

You might also like