A Deep Learning-Based Framework For Offensive Text Detection in Unstructured Data For Heterogeneous Social Media
A Deep Learning-Based Framework For Offensive Text Detection in Unstructured Data For Heterogeneous Social Media
ABSTRACT Social media such as Facebook, Instagram, and Twitter are powerful and essential platforms
where people express and share their ideas, knowledge, talents, and abilities with others. Users on social
media also share harmful content, such as targeting gender, religion, race, and trolling. These posts may be
in the form of tweets, videos, images, and memes. A meme is one of the mediums on social media which
has an image and embedded text in it. These memes convey various views, including fun or offensiveness,
that may be a personal attack, hate speech, or racial abuse. Such posts need to be filtered out immediately
from social media. This paper presents a framework that detects offensive text in memes and prevents such
nuisance from being posted on social media, using the collected KAU-Memes dataset 2582. The latter
combines the ‘‘2016 U.S. Election’’ dataset with the newly generated memes from a series of offensive and
non-offensive tweets datasets. In fact, this model uses the KAU-Memes dataset containing symbolic images
and the corresponding text to validate the proposed model. We compare the performance of three proposed
deep-learning algorithms to train and detect offensive text in memes. To the best of the authors knowledge
and literature review, this is the first approach based on You Only Look Once (YOLO) for offensive text
detection in memes. This framework uses YOLOv4, YOLOv5, and SSD MobileNetV2 to compare the
model’s performance on the newly labeled KAU-Memes dataset. The results show that the proposed model
achieved 81.74%, 84.1%, mAP, and F1-score, respectively, for SSD-MobileNet V2, and 85.20%, 84.0%,
mAP, and F1-score, respectively for YOLOv4. YOLOv5 had the best performance and achieved the highest
possible mAP, F1-score, precision, and recall which are 88.50%, 88.8%, 90.2%, and 87.5%, respectively, for
YOLOv5.
INDEX TERMS Cyberbully, unstructured data, deep learning, YOLO, social media, offensive,
MobileNet-SSD, image processing.
image [3]. On social media, hate speech is one of the common cyberbullying [13]. Figure 1 shows examples of offensive
content [4]. This is one of the important reasons to understand and non-offensive memes. Where Figure 1 (a) and (b) are the
the meaning and intention of memes and identify whether memes, there is no offensive text that makes the meme to be
they can be offensive or non-offensive. Memes can spread offensive. While on the other hand, Figure 1 (c), (d), and (e)
hatred in society via social media: a legitimate concern are images where some offensive text is used and makes
justifying the need to filter such content automatically and the memes offensive. There are a lot of images that include
immediately. offensive text in images. The text associated with images
A meme can be a racial, religious, personal attack, can make clear that the meme is offensive or non-offensive.
or maybe an attack on the community. personal attack, That’s why this framework focuses on the text and detecting
or maybe an attack on the community. The literature revealed offensive content in unstructured data.
several interesting works on memes: emotion analysis in [5], Therefore to address such problems and overcome the error
sarcastic meme detection in [6], and hateful meme detection rate, this proposed approach is based on YOLO to detect the
in [7]. Where they discussed the multi-model nature of offensive text inside the memes on social media. Accordingly,
memes which makes them very difficult to understand and this paper proposed a new dataset with the addition of an
classify. This is also difficult for a machine learning model to existing dataset on the 2016 U.S. Election and the offensive
classify whether a meme can be offensive or non-offensive. and non-offensive tweets dataset from [14].
The reason is that memes depend on the context and focus on The contributions to this paper are as follows:
the image and text. Without relevant knowledge of the context 1) A new framework based on the computer vision model
in which the meme was created, it is rather risky to speculate is presented in this study for the detection of offensive
on whether the meme is offensive or not. Similarly, it is hard content in unstructured data.
for a standard OCR to extract and detach texts from the meme, 2) This paper studies text detection in unstructured data
because memes can be noisy. Another critical factor is that and formulated two kinds of text detection, i.e.,
since the text in the meme is overlaid on top of the image, offensive and non-offensive.
the text needs to be extracted using OCR, which can result in 3) Generated a new KAU-Memes dataset consists
errors that require additional manual post-editing [8]. of 2582 memes and is labeled for YOLO and
The deeper meaning of memes can be funny for one; but SSD-MobileNet algorithms individually.
can be offensive for another. These memes are usually spread 4) This paper presented a performance comparison of
on social media such as Facebook, Instagram, Twitter, and YOLOv4, YOLOv5, and SSD MobileNet-V2 algo-
Pinterest. However, some people use it to target a person, rithms based on training, detection time, mAP,
a specific religion, or an entire community. These memes F1-score, precision vs recall curve, and confusion
can elicit depressive behaviors and should be filtered out matrix.
from social media. Even some political campaign managers 5) Extensive experiments on 2016.US.Election and the
have already turned to memes on social media in their KAU-Memes dataset prove that algorithms perfor-
quest to directly or indirectly influence election results: mance improves with a high number of memes.
because people can see those memes and accept the idea they The paper is organized as follows: Related work to
promote. Many researchers are trying to solve this problem offensive and hateful memes is discussed in Section II.
by identifying offensive memes, but millions of memes The proposed model is described in Section III. Results
on social media are hard to remove manually. According and discussion of the model are explained in Section IV.
to,1 an average of 95 million images are uploaded daily. Section V discussed the conclusion of the proposed model
On Twitter, for instance, there is nearly 40% post that has and future work plan.
visual contents.2 Also, the tweets with images can get 150%
higher retweets than the tweet which don’t have images.3 II. LITERATURE REVIEW
There are multiple approaches for memes classification, There are different approaches used for offensive, cyber-
like OCR technique [9] extracts the text from the images. bullying, toxic comments, and hateful speech classification
However, using the OCR text extraction can extract all the and detection. Bad behavior became a big issue on social
text from the images, like watermarks, implicit and explicit media platforms [15]. On social media, there are different
entities which can lead us to the incorrect classification of rumors [16], hateful content [17], and cyberbullying [18]
the memes. The meme’s typo graphic text extraction using contents people share. However, memes perform a big role
optical character recognizer OCR is explained in [6] for in such kind of situations on social media. Some approaches
sarcasm detection in memes. have been proposed to overcome these problems of hate
Offensive memes can be dangerous and insult people speech and offensive content. Such as the troll memes
[10]. A meme can be aggressive [11], troll [12], and classification has been developed based on pre-trained
models i.e. EffNet, VGG16, and Resnet [19]. Two models
1 https://www.wired.co.uk/article/instagram-doubles-to-half-billion-users are proposed by [20] among them one works as a text
2 https://unionmetrics.com/blog/2017/11/include-image-video-tweets/ features extraction and the second is on image-based features
3 https://blog.hubspot.com/marketing/visual-content-marketing-strategy extraction while sending the memes to the transformer
TABLE 1. Brief summary of offensive memes classification models and performance results. Where A = Accuracy, F = F1-Score, WF = Weighted F1-score,
and R = Recall.
model, however, VGG16 has been used for feature extraction hateful memes, [24] developed an ensemble learning
from memes. A framework by [21] is based on deep approach by including classification results from multiple
learning to automatically detect the harmful speech in memes classifiers.
based on the fusion of visual and linguistic contents of DisMultiHate model is proposed in [25] for the classifi-
the memes. To simultaneously classify memes into five cation of multimodal hateful content. For the improvement
different categories like offensiveness, sarcasm, sentiment, of hateful content classification and explainability, they
motivational, and humor, a multi-task framework via BERT target the entities in memes. A combination of a Feature
and ResNet is proposed by [22]. Concatenation Model (FCM), a Textual Kernels Model
A model based on the visual-linguistic transformer (TKM), and a Spatial Concatenation Model (SCM) can be
is integrated with the pre-trained visual and linguis- used to boost the multimodel memes classification [26].
tic features to detect the abusiveness in memes is A framework named deep learning-based Analogy-aware
explained in [23]. To enhance the performance of Offensive Meme Detection (AOMD) by [27] is proposed
FIGURE 2. Proposed framework for offensive and non-offensive text detection in Memes.
which learns the implicit analogy from the memes to detect existing datasets named Kaggle, McIntire, Reuters, and
offensive analogy memes. KnowMeme model, which is BuzzFeed;
based on a knowledge-enriched graph neural network that A dataset of images with their comments is collected from
uses the information facts from human commonsense can Instagram and labeled with the help of Crawdflower workers.
accurately detect offensive memes [28]. Reference [29] Where the criteria for labeling were to i) does the example
proposed that convolutional neural networks (CNN), VGG16, create cyber aggression which means the image intentionally
and bidirectional long short-term memory (BiLSTM) can harms someone, or ii) does it create cyberbullying which
be used for the offensive and non-offensive classification in mean is there any aggressiveness that contains against a
multimodal memes. Reference [30] proposed a joint model to person who can not defend herself or himself [35]. This
classify undesired memes based on counteractive unimodal dataset is also used by [36] for the detection of cyberbullying
features and multimodal features. For making the constituent detection. Another dataset from [37] is collected from
module of the framework they employed multilingual-BERT, Instagram posts and their comments which consist of
multilingual-DistilBERT, XLM-R for textual and VGG19, 3000 examples. They asked two questions i) do the comments
VGG16, and ResNet50 for visual. A textual, visual, and contain any bullying ii) If yes, is the bullying due to the
info-graphic cyberbully is detected based on a deep neural contents of the image to label the dataset?
architecture which includes Capsule network deep neural Some state-of-the-art papers’ summary has explained in
network with dynamic routing for textual bullying content Table 1. This shows us how each of the models performs
detection and CNN for visual bullying content prediction and toward offensive memes classification.
discretizing the info-graphic content by separating image and
text from each other by Google Lens [31]. A deep learning- III. PROPOSED FRAMEWORK FOR OFFENSIVE MEMES
based framework for a bully or non-bully identification based FILTERING ON SOCIAL MEDIA
on residual BiLSTM and RCNN architecture is discussed The proposed model for offensive and non-offensive text
in [32]. Reference [33] explained that hate speech detection detection in memes is depicted in Figure 2. The goal is to train
can be improved by augmenting text with image-embedding the model with the training dataset and then test the model
information. with the test dataset to check the performance comparison
A new approach by [34] named WELFake is suggested. of the YOLOv4, YOLOv5, and SSD MobileNet-V2 models.
They used 20 linguistic features and then combined these This platform can be used as a plug-in for heterogeneous
features with word embeddings and implemented voting social media to filter out offensive memes. As millions of
classification. This model is based on a count vectorizer memes on social media can not be filtered out manually. This
and Tf-idf word embedding and uses a machine learning approach can help us to overcome the spread of offensive
classifier. For unbiased dataset creation, they merged four memes that are already posted and will be posted on social
media. After the data preprocessing, YOLOv4, YOLOv5, and Algorithm 1 Algorithm for Detecting Offensive Text in
SSD MobileNet-V2 models are used to detect the offensive Image
and non-offensive text in memes. The model trains over the 1: Images ← ImagesInDatabase
dataset and generates weights and checkpoints. When the 2: for image in Images do
YOLO model is trained over the labeled image dataset it Ofensive ← Checkpoints(image)
generates weights files. These files are usually named yolov- 3: if Ofensive == "offensive" then
final.weights with the extension of weights. The weights file 4: Delete image
can be used as a plug-in for any social media in the future. 5: else
Plug-ins also known as extensions, add-ons, or computer 6: Keep image
software can be added to a host program to add new functions 7: end if
without making any changes in the host program. In our 8: end for=0
case, it can be added to Facebook, Twitter, Instagram, etc.
It enables programmers to update a main program while
keeping the user within the program’s environment. So, the automatic detection from X-ray images is explained in [45].
model will discard memes to upload on social media when YOLO is used for electrical component recognition in real-
there is offensive content. time [46]. Pedestrian detection in real time but at night is
Let’s consider an image that contains blood, a gun, explained in [47]. By using these models the images having
private parts of the body, or something else in the image. offensive text can be detected immediately and accurately
If someone uploads such kinds of images, the Facebook before it goes viral. Even this trained AI model can be
algorithm discards that image or just shows us that ‘‘this installed in a camera and the camera can fit in a two-tire
photo may show violent or graphic content’’ as everyone has vehicle. There are many offensive texts in the streets as can
experienced this while using social media. Let’s consider an be seen on this website [48]. This can be used as a smart city
image that contains offensive text. What if someone uploads and when the camera detects offensive text on the street walls
any of the images from Figure 1 1 as we can see in these there should be some action to clean that wall from those
images there is offensive text targeting politicians. No one offensive words.
has experienced that Facebook or any other social media can
do the same for those images or videos that contain offensive A. DATA GENERATION
text. These models have trained with the training labeled This section explains the KAU-Memes dataset which con-
(bounding boxes) images KAU-Memes dataset. When the tains the images having text embedded in these images.
training process is finished YOLO or SSD models generate The text in the images is offensive and non-offensive
a final weight or a checkpoint. Consider these weights or memes. Before the data generation, the algorithms were
checkpoints as a trained AI model. Now, let’s assume an tested in 2016.U.S.Election 738 memes dataset got higher
image with some text and when we pass it from the trained performance, but the dataset consisted of fewer memes, so the
AI model (weight or checkpoint). The trained AI model gives model performance was poor. To improve the performance,
us a bounding box on the text inside the image and decides this approach generates memes by a third-party website.4
whether the text is offensive or not. As there are thousands However, for meme generation, there is a need for a text
of images in any social media database so this can be used dataset that can be embedded in images. So, this approach
by just executing a for loop over that database and passing used the offensive tweets dataset from [14] embedded it on
images one by one from this trained AI model (weight or famous images, and generated a new KAU-Memes dataset.
checkpoints) and the trained AI model makes a decision in the This dataset consists of 24802 labeled tweets; however, only
image as a bounding box and if the bounding box labeled is a few of them were used to generate 2850 memes. While
offensive then delete that image and if the bounding box is not generating the memes, the text on images was embedded in
offensive then keep the image in the database. Social media different colors, fonts, and angles. The model can filter every
and their databases are filled with such kinds of images. So, kind of offensive meme on social media.
this AI-trained model helps us to delete those images that
have offensive text from the database of any social media, B. TEXT VARIATION IN MEMES
rather than checking images one by one manually because There are hundreds of text fonts, colors, and orientations in
there are millions of images uploaded This is explained in memes on social media. Memes can have any form of text
Algorithm 1 1. This plugin can be installed on any social and background image. This section explains different types
media and every future image should be passed over it if there of variations of text in memes. Figure 3 (a) shows the most
is offensive text discard it and do not allow it to upload on challenging and common variation of text which is found
social media. in the dataset. While generating the data, text in different
YOLOv4, YOLOv5, and SSD MobileNet are famous for orientations was embedded over images to make the model
their robustness, accuracy, and real-time object detection. more robust and accurate. This model also tried to take care
Here these models are used for the first time to detect
offensive text inside the images. YOLO for COVID-19 4 https://imgflip.com/memegenerator
of the image background clutter because every time, there can To remove such kind of duplicate images from the dataset,
be a different image in the memes which is shown in Figure 3 the duplication tool5 [49] is used. This is upto date repository
(b). The text position in the image can be seen in Figure 3 based on CNN, Perceptual hashing (PHash), Difference
(c). Sometimes the text can be in the center of the meme, hashing (DHash), Wavelet hashing (WHash), and Average
below the image, or maybe to the left or right side of the hashing (AHash). Also, the memes were removed manually
image. The size of the text also varies in memes. So KAU- from the dataset, which only consisted of text, and there was
Memes dataset also exists in such kind of variation in the text no image in the background.
as shown in Figure 3 (d). Last but not least, the dataset also
consists of different formats and colors of text and also blur D. DATA ANNOTATION AND LABELING
text, which can be seen in Figure 3 (e), (f) respectively. There For the annotation procedure, the dataset in [4], and for new
are yellow, black, white, etc., color formats for offensive and memes which were generated, they are labeled according to
non-offensive text. the tweets dataset of [14]. For the manual data annotation, the
labeling tool Roboflow6 is used. The bounding boxes around
the text in the memes are made in a manner allowing users
to decide whether that text is offensive or non-offensive. This
bounding box helps the model because it localizes the area
for the YOLO and SSD MobileNet. Roboflow tool generates
a text file for each meme with the same file name as the image.
Roboflow generates the coordinate in the form of (x1,y1) and
(x2, y2) with the label 0 if offensive and 1 if non-offensive,
in a text file.
FIGURE 4. A generalize illustration of YOLO pipeline for offensive and non-offensive text detection in memes.
Because of this new network, the model can keep the accuracy 10% and 12%, respectively. They made many changes in
and reduce the computation. Also, the path aggregation the architecture of YOLO models, but the major changes
network (PANet) is used in YOLOv4, which can help the are the adjustment of network structure and an increasing
model to boost the information flow networks [51]. number of applied tricks. YOLOv4 changed the backbone
to CSPDarknet53 from the old Darknet53. Some data
augmentation techniques were also adopted, i.e., Cutout, Grid
2) YOU ONLY LOOK ONCE VERSION 5 (YOLOV5) Mask, Random Erase, Hide and Seek, Class label smoothing,
On the other side, the YOLOv5 is compiled by PyTorch. MixUp, Self-Adversarial Training, Cutmix, and Mosaic data
Due to the application features of PyTorch, the model has augmentation.
high productivity and flexibility. YOLOv5 uses the same After a few months, another company named Ultralytics
CSPDarknet and PANet, as can be seen in Table 2. For the released a new version of YOLO named YOLOv5. Instead
activation function, YOLOv5 uses a sigmoid function rather of publishing research and comparison with other models of
than the Mish function for YOLOv4 [52]. YOLO, the company just released YOLOv5’s source code
YOLO algorithms are robust in real-time object detection on GitHub [54]. However, the main changes in architecture
and represented by Redmon in 2016 [50]. The 4th version between YOLOv4 and v5 and the advancement in YOLOv5
of YOLO was released in 2020 [53], compare to the old are presented in Figure 5 and 6, respectively. In YOLOv5
version of YOLO, the mAP and FPS were improved to leaky Relu is adopted as an activation function (CBL module)
in the hidden layers, while in YOLOv4, there are two modules friendly than Darknet.
with leaky Relu and mish activation functions (CBL and |C−B ∪ Bgt |
CBM). Secondly, in the backbone, YOLOv5 adopted a new LGIoU = 1−IoU + (1)
|C|
module at the beginning that is named Focus. Focus makes
four slices of an input image and concatenates all of them where Bgt represents the ground truth box, B is the predicted
for convolution operation. For example, an image of 608 × box, C is equal to the smallest box which covers B and Bgt ,
608 × 3 is divided into four small images with 304 × and IoU = (B∩Bgt )/B∪Bgt is the intersection over the union.
304 × 3, concatenated into a 304 × 304 × 12 image. ρ 2 (p, pgt )
Third, for the backbone and neck, YOLOv5 designed two LCIoU = 1−IoU + (2)
c2
CSPNet modules. To maintain processing accuracy and
reduce computation power, CSPNet combines feature maps where pgt and p are the central points of boxes B and Bgt ,
from the start and at the end of a network stage [55]. is represented Euclidean distance, c is the diagonal length of
Compared to the standard convolution module in YOLOv4, the smallest box C, V and are the consistency of the aspect
YOLOv5 adopted the CSPNet module, i.e., CSP2_x in the ratio.
neck, to strengthen the network feature fusion. Besides TABLE 2. Architecture comparison of YOLOv4 and YOLOv5.
the structure adjustment, YOLOv5 adopted an algorithm to
automatically learn bounding box anchors in the input stage,
which could help calculate the anchor box size for other
image sizes and improve the detection quality. Except this,
YOLOv5 uses Generalize Intersection Over Union (GIoU)
as a loss as shown in Equation 1 [56] for the regression
loss function in the bounding box instead of Complete
Intersection Over Union (CIoU) loss in YOLOv4 as shown
in Equation 2. GIoU can solve the imperfect calculation
of non-overlapping bounding boxes that remain in the
previous Intersection Over Union (IoU) loss function. CIoU
incorporates all three geometric factors: including distance,
aspect ratio, and overlapping area. To better determine F. SINGLE SHOT DETECTOR (SSD)
difficult regression cases, CIoU enhances the accuracy and In the field of computer vision, the models become more
speed. YOLOv5 is constructed under a new environment complex and deeper for more accurate results and perfor-
at PyTorch [56], which makes the training procedure more mance. However, the advancement makes the model latency
and size bigger, which cannot be used in a system that has the mean average precision is the mean of average precision
computational challenges. SSD-MobileNet can help in such (AP) as shown in Equation 4. Where n is the number of
kinds of challenges. This model is basically designed for classes while the AP is the average precision for that given
those situations that require high speed. The MobileNetV2 class n. mAP returns a score after comparing the ground truth
provides an inverted residual structure for better modularity. bounding box with the detected box. After taking the mean
MobileNet eliminates the non-linearities in tight layers and of AP, we can get the mAP which can be used to calculate
results in higher performance for previous applications. The the accuracy of machine learning algorithms. F1-score [59]
MobileNet-SSD detector inherits the design of VGG16- measures a model’s accuracy over the dataset and can use
SSD, and the front-end MobileNet-v2 network provides six to evaluate binary classification problems. Equation 5 can be
feature maps with different dimensions for the back-end used for the F1-score calculation using precision and recall.
detection network to perform multi-scale object detection. The highest possible value for the F1-score is 1 and the lowest
Since the backbone network model is changed from VGG-16 is 0. The precision is the ratio of true prediction with the total
to MobileNet-v2, the MobileNet-SSD detector can achieve number of predictions, while the recall is the ratio of true
real-time performance and is faster than other existing object prediction to the total number of objects in the image [60],
detection networks. which are shown in Equation 6 and Equation 7 respectively.
1 Xn
G. EVALUATION MATRIX mAP = APk (4)
n k=1
Usually, a basic matrix intersection over union (IoU) is used precision ∗ recall
to evaluate the performance of object detection models, which F1Score = 2 ∗ (5)
precision + recall
can be seen in Figure 7. IoU is the overlap of the detection box TruePositive
(D) and the ground truth box (G), which can be calculated Precision = (6)
TruePositive + FalsePositive
by using Equation 3 [57]. When we obtain the IoU, then TruePositive
we use the confusion matrix, i.e., False Positive (FP), True Recall = (7)
TruePositive + FalseNegative
Positive (TP), False Negative (FN), and True Negative (TN)
for accuracy measurement. For TP, a specific class ground IV. RESULTS AND DISCUSSION
truth must be the class of detection, also the IoU must be A. EXPERIMENTAL SETUPS
greater than 50%. As TP is the correct detection of the class. All the models were trained on Colab Pro+, and the resources
In case the detection owns the same class as the ground truth, used for the training are shown in Table 3. To train the
and the IoU is less than 50%, then it is considere FP. Which models properly, YOLOv4, YOLOv5, and SSD MobileNet
means the detection is not corrected. If the model does not with different parameters were trained to achieve the highest
make detection and there is a ground truth, then it is considere possible mAP. The parameters for each of the models are
FN, which means that the instance is not detected. In many shown in Table 4.
cases, the background does not have any ground truth and also
no detection, so that is classified as TN. TABLE 3. Colab specification used for the training of YOLOv4, YOLOv5
and SSD MobileNet.
Intersection G∩P
IoU = = (3)
Union G∪P
For the performance comparison of YOLOv4, YOLOv5, B. PERFORMANCE BASED ON EVALUATION MATRIX
and SSD MobileNet-V2 algorithms mAP, F1-Score, Preci- The ML models achieved the highest possible results for the
sion, and Recall can be used as criteria. Where mAP [58] is public online dataset in [4] consisting of 743 memes, and
D. DETECTION RESULTS
Some offensive and non-offensive text detection were per-
formed to find the model’s performance. In Figure 9 (a), the
YOLOv5 predicts the offensive text with a high confidence
of 0.96, and also, for non-offensive, the confidence value is
almost 0.93. Similarly, Figure 9 (b) shows the prediction of
offensive text and non-offensive using the YOLOv4 model
Using the KAU-Memes dataset, this approach performs with the confidence of 0.86 and 0.80, respectively.
three different experiments. In the first experiment, the data is Other than YOLOv4 and v5, the performance of SSD
split into 90% training and 10% validation sets, in the second MobileNet-V2 is also good for the detection of offensive
experiment 80% and 20%, and in the third experiment 70% memes. The SSD-MobileNet V2 detects the offensive text
and 30% training and validation. with a confidence of 0.83 for the offensive meme and
The results of both models can be seen in Table 7 using the 0.78 confidence for non-offensive, which can be seen in
KAU-Memes dataset for offensive text detection in memes. Figure 9 (c).
It is clear from the table that YOLOv5 shows the highest
mAP of 91.40%, the precision of 86.2%, recall of 91.9%, and E. MODELS PERFORMANCE LOSS
F1-score of 88.4% than YOLOv4 while splitting the dataset To explore the performance of the algorithm in more detail,
in 90% training and 10% validation. Also, the YOLOv5 has it is necessary to find their incorrect detection. Which
TABLE 7. Results of YOLOv4, YOLOv5, and SSD MobileNet V2 algorithms for offensive and non-offensive memes text detection with a train-validation
split of 90%-10%, 80%-20%, and 70%-30%.
FIGURE 10. Offensive small text detection by YOLOv5, YOLOv4, and SSD-MobileNet.
can help in future research for improvement. YOLOv5s is matrix for YOLOv5s is shown in Figure 11 (a) where the
the best detection among other models. However, when it offensive text is detected 251 times correctly, but the model
comes to detecting the offensive text in a meme that has confused 35 times with non-offensive text. Similarly, the
small size text, the performance goes down for each of the non-offensive text is detected 202 times correctly, but it is
models. In Figure 10, all the models detect the offensive text confused with offensive text around 25 times. In the YOLOv5
with a different confidence score. Among all the models, confusion matrix, the False Positive (FP) is divided into two
YOLOv5s still performs better for small text detection than parts based on the value of IOU. If IOU = 0, the false
YOLOv4 and SSD-MobileNet. YOLOv5 detects small text positive prediction is far from the ground truth. Also, if IOU is
with a confidence score of 0.88, YOLOv4 achieves 0.81, between 0 and 0.5 then the overlap between the ground truth
and SSD MobileNet achieves a 0.75 confidence score. The and prediction is not enough to decide it as a true positive. For
performance can be improved by adding more images having YOLOv4, the offensive text is detected 221 times and non-
offensive text in small sizes. offensive text 210 times correctly, but 48 times the offensive
text is confused with non-offensive text, and 42 times the
F. ANALYSIS BASED ON CONFUSION MATRIX non-offensive text is confused with offensive text detection
A confusion matrix can be used for the performance of differ- as shown in Figure 11 (b). Similarly, for SSD-MobileNet
ent models. The confusion matrix also provides information V2, the model detected offensive text 213 times but confused
on the type and source of errors. Where the elements on 58 times with non-offensive and non-offensive text confused
the diagonal represent all the correct classes. The confusion with offensive text 56 times, as shown in Figure 11 (c).
[22] D. S. Chauhan, S. R. Dhanush, A. Ekbal, and P. Bhattacharyya, ‘‘All-in- [43] R. K. Giri, S. C. Gupta, and U. K. Gupta, ‘‘An approach to detect offence
one: A deep attentive multi-task learning framework for humour, sarcasm, in memes using natural language processing(NLP) and deep learning,’’ in
offensive, motivation, and sentiment on memes,’’ in Proc. 1st Conf. Asia– Proc. Int. Conf. Comput. Commun. Informat. (ICCCI), Jan. 2021, pp. 1–5.
Pacific chapter Assoc. Comput. Linguistics 10th Int. Joint Conf. Natural [44] R. Nayak, B. S. U. Kannantha, K. S, and C. Gururaj, ‘‘Multimodal
Lang. Process., 2020, pp. 281–290. offensive meme classification u sing transformers and BiLSTM,’’ Int. J.
[23] R. Zhu, ‘‘Enhance multimodal transformer with external label and in- Eng. Adv. Technol., vol. 11, no. 3, pp. 96–102, Feb. 2022.
domain pretrain: Hateful meme challenge winning solution,’’ 2020, [45] A. Karacı, ‘‘VGGCOV19-NET: Automatic detection of COVID-19 cases
arXiv:2012.08290. from X-ray images using modified VGG19 CNN architecture and YOLO
[24] D. Kiela, ‘‘The hateful memes challenge: Competition report,’’ in Proc. algorithm,’’ Neural Comput. Appl., vol. 34, no. 10, pp. 8253–8274,
NeurIPS, 2021, pp. 5138–5147. May 2022.
[25] R. K.-W. Lee, R. Cao, Z. Fan, J. Jiang, and W.-H. Chong, ‘‘Disentangling [46] H. Chen, Z. He, B. Shi, and T. Zhong, ‘‘Research on recognition method
hate in online memes,’’ in Proc. 29th ACM Int. Conf. Multimedia, of electrical components based on YOLO V3,’’ IEEE Access, vol. 7,
Oct. 2021, pp. 5138–5147. pp. 157818–157829, 2019.
[47] Y. Xue, Z. Ju, Y. Li, and W. Zhang, ‘‘MAF-YOLO: Multi-modal attention
[26] R. Gomez, J. Gibert, L. Gomez, and D. Karatzas, ‘‘Exploring hate speech
fusion based YOLO for pedestrian detection,’’ Infr. Phys. Technol.,
detection in multimodal publications,’’ in Proc. IEEE Winter Conf. Appl.
vol. 118, Nov. 2021, Art. no. 103906.
Comput. Vis. (WACV), Mar. 2020, pp. 1459–1467.
[48] Age Foto Stock. Accessed: Mar. 2, 2023. [Online]. Available:
[27] L. Shang, Y. Zhang, Y. Zha, Y. Chen, C. Youn, and D. Wang, ‘‘AOMD: An https://www.agefotostock.com/age/en/details-photo/offensive-graffiti-
analogy-aware approach to offensive meme detection on social media,’’ on-shop-shutter-in-rome-italy/Y5G-1951508
Inf. Process. Manage., vol. 58, no. 5, Sep. 2021, Art. no. 102664. [49] T. Jain, C. Lennan, Z. John, and D. Tran, ‘‘Imagededup,’’ 2019. [Online].
[28] L. Shang, C. Youn, Y. Zha, Y. Zhang, and D. Wang, ‘‘KnowMeme: A Available: https://github.com/idealo/imagededup
knowledge-enriched graph neural network solution to offensive meme [50] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
detection,’’ in Proc. IEEE 17th Int. Conf. eScience (eScience), Sep. 2021, Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis.
pp. 186–195. Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
[29] S. Khedkar, P. Karsi, D. Ahuja, and A. Bahrani, ‘‘Hateful memes, offensive [51] S. Li, X. Gu, X. Xu, D. Xu, T. Zhang, Z. Liu, and Q. Dong, ‘‘Detection
or non-offensive,’’ in Proc. Int. Conf. Innov. Comput. Commun. Singapore: of concealed cracks from ground penetrating radar images based on deep
Springer, 2022, pp. 609–621. learning algorithm,’’ Construct. Building Mater., vol. 273, Mar. 2021,
[30] E. Hossain, O. Sharif, M. M. Hoque, M. A. A. Dewan, N. Siddique, Art. no. 121949.
and M. A. Hossain, ‘‘Identification of multilingual offense and troll from [52] D. Thuan, ‘‘Evolution of YOLO algorithm and YOLOv5: The state-of-
social media memes using weighted ensemble of multimodal features,’’ the-art object detention algorithm,’’ Tech. Rep., 2021. [Online]. Available:
J. King Saud Univ. Comput. Inf. Sci., vol. 34, no. 9, pp. 6605–6623, https://doi.org/10.48550/arXiv.2004.10934
Oct. 2022. [53] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, ‘‘YOLOv4: Optimal
[31] A. Kumar and N. Sachdeva, ‘‘Multimodal cyberbullying detection using speed and accuracy of object detection,’’ 2020, arXiv:2004.10934.
capsule network with dynamic routing and deep convolutional neural [54] Ultralytics. GitHub. Accessed: Nov. 22, 2022. [Online]. Available:
network,’’ Multimedia Syst., vol. 28, no. 6, pp. 2043–2052, Dec. 2022. https://github.com/ultralytics/yolov5
[32] S. Paul, S. Saha, and M. Hasanuzzaman, ‘‘Identification of cyberbullying: [55] C.-Y. Wang, H.-Y. Mark Liao, Y.-H. Wu, P.-Y. Chen, J.-W. Hsieh, and
A deep learning based multimodal approach,’’ Multimedia Tools Appl., I.-H. Yeh, ‘‘CSPNet: A new backbone that can enhance learning capability
vol. 81, no. 19, pp. 26989–27008, Aug. 2022. of CNN,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
[33] F. Yang, X. Peng, G. Ghosh, R. Shilon, H. Ma, E. Moore, and G. Workshops (CVPRW), Jun. 2020, pp. 1571–1580.
Predovic, ‘‘Exploring deep multimodal fusion of text and photo for hate [56] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, and T.
speech classification,’’ in Proc. 3rd Workshop Abusive Lang. Online, 2019, Killeen, ‘‘Pytorch: An imperative style, high-performance deep learning
pp. 11–18. library,’’ in Proc. Adv. Neural Inf. Process. Syst., 2019.
[34] P. K. Verma, P. Agrawal, I. Amorim, and R. Prodan, ‘‘WELFake: Word [57] R. Padilla, S. L. Netto, and E. A. B. da Silva, ‘‘A survey on performance
embedding over linguistic features for fake news detection,’’ IEEE Trans. metrics for object-detection algorithms,’’ in Proc. Int. Conf. Syst., Signals
Computat. Social Syst., vol. 8, no. 4, pp. 881–893, Aug. 2021. Image Process. (IWSSIP), Jul. 2020, pp. 237–242.
[35] H. Hosseinmardi, S. A. Mattson, R. I. Rafiq, R. Han, Q. Lv, and S. Mishra, [58] L. Liu and M. T. Zsu, Eds., Encyclopedia of Database Systems, vol. 6.
‘‘Detection of cyberbullying incidents on the Instagram social network,’’ New York, NY, USA: Springer, 2009.
2015, arXiv:1503.03909. [59] J. Davis and M. Goadrich, ‘‘The relationship between precision-recall
[36] V. K. Singh, S. Ghosh, and C. Jose, ‘‘Toward multimodal cyberbullying and ROC curves,’’ in Proc. 23rd Int. Conf. Mach. Learn. (ICML), 2006,
detection,’’ in Proc. CHI Conf. Extended Abstr. Hum. Factors Comput. pp. 233–240.
Syst., May 2017, pp. 2090–2099. [60] I. D. Melamed, R. Green, and J. P. Turian, ‘‘Precision and recall of
[37] H. Zhong, H. Li, A. C. Squicciarini, S. M. Rajtmajer, C. Griffin, machine translation,’’ in Proc. Conf. North Amer. Chapter Assoc. Comput.
D. J. Miller, and C. Caragea, ‘‘Content-driven detection of cyberbullying Linguistics Hum. Lang. Technol. Companion (HLT-NAACL) Short Papers
on the Instagram social network,’’ in Proc. IJCAI, vol. 16, 2016, (NAACL), 2003, pp. 61–63.
pp. 3952–3958.
[38] M. Balaji J and C. Hs, ‘‘TrollMeta@DravidianLangTech-EACL2021:
Meme classification using deep learning,’’ in Proc. 1st Workshop Speech
Lang. Technol. Dravidian Lang., 2021, pp. 277–280.
[39] K. Perifanos and D. Goutsos, ‘‘Multimodal hate speech detection in Greek
social media,’’ Multimodal Technol. Interact., vol. 5, no. 7, p. 34, Jun. 2021.
[40] K. Kumari, J. P. Singh, Y. K. Dwivedi, and N. P. Rana, ‘‘Multi-modal
aggression identification using convolutional neural network and binary
particle swarm optimization,’’ Future Gener. Comput. Syst., vol. 118, JAMSHID BACHA received the B.Sc. degree
pp. 187–197, May 2021. in computer systems engineering from UET
[41] E. Hossain, O. Sharif, and M. M. Hoque, ‘‘NLP- Peshawar, Pakistan, in 2020, and the mas-
CUET@DravidianLangTech-EACL 2021: Investigating Visual and ter’s degree in computer information systems
Textual Features to Identify Trolls from Multimodal Social Media and networks from Korea Aerospace University,
Memes,’’ in Proc. 1st Workshop Speech Lang. Technol. Dravidian Lang., South Korea. He is currently pursuing the Ph.D.
2021, pp. 300–306. degree with Technische Universität Berlin. His
[42] A. K. Mishra and S. Saumya, ‘‘Identifying troll meme in Tamil using current research interests include machine learn-
a hybrid deep learning approach,’’ in Proc. 1st Workshop Speech Lang. ing, deep learning, computer vision, and wireless
Technol. Dravidian Lang., 2021, pp. 243–248. communication.
FARMAN ULLAH received the M.S. degree in ABDUL WASAY SARDAR received the B.Sc.
computer engineering from CASE, Islamabad, degree in computer engineering from COMSATS
Pakistan, in 2010, and the Ph.D. degree from Korea University Islamabad, Pakistan, in 2020, and the
Aerospace University, South Korea, in 2016. master’s degree in computer information systems
He worked and collaborated on various projects and networks from Korea Aerospace Univer-
funded by the Ministry of Economy, the Korea sity, Goyang, South Korea. His current research
Research Foundation, and ETRI, South Korea. interests include artificial intelligence, machine
In 2007, he joined AERO, Pakistan, as an Assistant learning, deep learning, and computer vision.
Manager of telemetry. He is currently an Assistant
Professor with the College of IT, United Arab
Emirates University (UAEU), Abu Dhabi, Al Ain, United Arab Emirates.
Before Joining UAEU, he was an Assistant Professor with the Department
of Electrical and Computer Engineering, COMSATS University Islamabad,
Attock Campus, Pakistan, and a Postdoctoral Researcher with the High Pro-
cessing Computing Laboratory, Jeonbuk National University, South Korea.
He has authored/coauthored more than 40 peer-reviewed publications. His
SUNGCHANG LEE (Member, IEEE) received the
current research interests include embedded, wearable, the IoT applications,
B.S. degree from Kyungpook National University,
intelligent resource management for high-performance computing, and
in 1983, the M.S. degree in electrical engineering
artificial intelligence and machine learning.
from the Korea Advanced Institute of Science
and Technology (KAIST), in 1985, and the Ph.D.
degree in electrical engineering from Texas A&M
University, in 1991. From 1985 to 1987, he was
with KAIST, as a Researcher, where he worked on
image processing and pattern recognition projects.
JEBRAN KHAN received the B.Sc. and M.Sc. From 1992 to 1993, he was a Senior Researcher
degrees in computer systems engineering from with the Electronics and Telecommunications Research Institute (ETRI),
the University of Engineering and Technology South Korea, and the Director of the Government Project on Intelligent Smart
at Peshawar, Peshawar, Pakistan, and the Ph.D. Home Security and Automation Service Technology, from 2004 to 2009.
degree in electronics and information engineer- In 2009, he was the Vice President of the Institute of Electronics and
ing from Korea Aerospace University, Goyang, Information Engineers (IEIE), South Korea, and also the Director of the
South Korea. He is currently a Postdoctoral Telecommunications Society, South Korea. Since 1993, he has been a Faculty
Researcher with Ajou University, South Korea. of Korea Aerospace University, Goyang, South Korea, where he is currently a
His current research interests include social net- Professor with the School of Electronics, Telecommunication and Computer
work analysis, modeling, frameworks, and their Engineering.
applications.