A Deep Learning-Based Approach For Inappropriate Content Detection and Classification of YouTube Video
A Deep Learning-Based Approach For Inappropriate Content Detection and Classification of YouTube Video
ABSTRACT The exponential growth of videos on YouTube has attracted billions of viewers among which
the majority belongs to a young demographic. Malicious uploaders also find this platform as an opportunity
to spread upsetting visual content, such as using animated cartoon videos to share inappropriate content with
children. Therefore, an automatic real-time video content filtering mechanism is highly suggested to be inte-
grated into social media platforms. In this study, a novel deep learning-based architecture is proposed for the
detection and classification of inappropriate content in videos. For this, the proposed framework employs an
ImageNet pre-trained convolutional neural network (CNN) model known as EfficientNet-B7 to extract video
descriptors, which are then fed to bidirectional long short-term memory (BiLSTM) network to learn effective
video representations and perform multiclass video classification. An attention mechanism is also integrated
after BiLSTM to apply attention probability distribution in the network. These models are evaluated on a
manually annotated dataset of 111,156 cartoon clips collected from YouTube videos. Experimental results
demonstrated that EfficientNet-BiLSTM (accuracy = 95.66%) performs better than attention mechanism-
based EfficientNet-BiLSTM (accuracy = 95.30%) framework. Secondly, the traditional machine learning
classifiers perform relatively poor than deep learning classifiers. Overall, the architecture of EfficientNet and
BiLSTM with 128 hidden units yielded state-of-the-art performance (f1 score = 0.9267). Furthermore, the
performance comparison against existing state-of-the-art approaches verified that BiLSTM on top of CNN
captures better contextual information of video descriptors in network architecture, and hence achieved better
results in child inappropriate video content detection and classification.
INDEX TERMS Deep learning, social media analysis, video classification, bidirectional LSTM, CNN,
EfficientNet.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 16283
K. Yousaf, T. Nawaz: Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos
Unlike television, children can be presented with any feedback from statistics. Since the YouTube metadata can be
type of content on the Internet due to lack of regulations. easily manipulated, it is suggested to better use video features
Exposing children to disturbing content is considered as one for detection of inappropriate content than metadata features
among other internet safety threats (like cyberbullying, cyber associated with videos [22].
predators, hate etc.) [7]. Bushman and Huesmann [8] con- Prior techniques [23]–[28] addressed the challenge of iden-
firmed that frequent exposure to disturbing video content tifying disturbing content (i.e., violence, pornography, etc.)
may have a short-term or long-term impact on children’s from videos by using traditional hand-crafted features on
behavior, emotions and cognition. Many reports [9]–[12] frame-level data. In recent years, the state-of-the-art per-
identified the trend of distributing inappropriate content formance of deep learning has motivated researchers to
in children’s videos. This trend got people’s attention employ it in image and video processing. The most fre-
when mainstream media reported about the Elsagate con- quent applications of image/video classification employed
troversy [13], [14], where such video material was found the convolutional neural networks [29]–[31]. Apart from
on YouTube featuring famous childhood cartoon charac- that, the long-short term memory (LSTM), a special type
ters (i.e., Disney characters, superheroes, etc.) portrayed in of recurrent neural network (RNN) architecture, has proven
disturbing scenes; for instance, performing mild violence, to be an effective deep learning model in time-series data
stealing, drinking alcohol and involving in nudity or sexual analysis [32]. Hence, this study targets the YouTube mul-
activities. ticlass video classification problem by leveraging CNN
In an attempt to provide a safe online platform, laws (EfficientNet-B7) and LSTM to learn video effective rep-
like the children’s online privacy protection act (COPPA) resentations for detection and classification of inappropri-
imposes certain requirements on websites to adopt safety ate content. We targeted two types of objectionable content
mechanisms for children under the age of 13. YouTube has geared towards young viewers, one, which contains violence
also included a ‘‘safety mode’’ option to filter out unsafe and the second, which includes sexual nudity connotations.
content. Apart from that, YouTube developed the YouTube The main contributions of this study are threefold:
Kids application to allow parental control over videos that 1. We propose a novel CNN (EfficientNet-B7) and
are approved as safe for a certain age group of children [15]. BiLSTM-based deep learning framework for inappropri-
Regardless of YouTube’s efforts in controlling the unsafe ate video content detection and classification.
content phenomena, disturbing videos still appear [16]–[19] 2. We present a manually annotated ground truth video
even in YouTube Kids [20] due to difficulty in identifying dataset of 1860 minutes (111,561 seconds) of cartoon
such content. An explanation for this may be that the rate at videos for young children (under the age of 13). All videos
which videos are uploaded every minute makes YouTube vul- are collected from YouTube using famous cartoon names
nerable to unwanted content. Besides, the decision-making as search keywords. Each video clip is annotated for either
algorithms of YouTube rely heavily on the metadata of video safe or unsafe class. For the unsafe category, fantasy
(i.e., video title, video description, view count, rating, tags, violence and sexual-nudity explicit content are monitored
comments, and community flags). Hence, filtering videos in videos. We also intend to make this dataset publicly
based on the metadata and community flagging is not suf- available for the research community.
ficient to assure the safety of children [21]. Many cases exist 3. We evaluate the performance of our proposed
on YouTube where safe video titles and thumbnails are used CNN-BiLSTM framework. Our multiclass video classi-
for disturbing content to trick children and their parents. The fier achieved the validation accuracy of 95.66%. Several
sparse inclusion of child inappropriate content in videos is other state-of-the-art machine learning and deep learning
another common technique followed by malicious upload- architectures are also evaluated and compared for the task
ers. Fig. 1 displays an example among such cases where of inappropriate video content detection.
video title and video clips are safe for children (as shown To summarize, this work can assist any video sharing plat-
in Fig. 1(a)) but included inappropriate scenes in this video form to either remove unsafe video or blur/hide any portion of
(as shown in Fig. 1(b) and Fig. 1(c)). The concerning thing video involving unsafe content. Secondly, it may also help in
about this example, including many similar cases, is that the development of parental control solutions on the web via
these videos have millions of views with more likes than plugins or browser extensions where children inappropriate
dislikes, and have been available for years. Many other cases content filters automatically. The upcoming sections of the
(as shown in Fig. 1(d)) also identified where videos or the article are outlined as follows: Section II covers the related
YouTube channel is not popular, yet contains child unsafe work in this research area. The methodology of our proposed
content especially in the form of animated cartoons. It is system is explained in Section III. The experimental setup of
evident from examples that this problem persists irrespec- the proposed system is presented in Section IV. The results
tive of channel or video popularity. Furthermore, YouTube obtained from the experimental setup are analyzed and dis-
has disabled the dislike feature of videos which resulted in cussed in Section V, and finally, Section VI concludes the
viewers being incapable of getting the indirect video content work and directs some future scope for improvements.
FIGURE 1. Examples of YouTube cartoon videos depicting different scenarios (a) video showing safe content with 8.9 million views, uploaded
since 2018 by AOK channel of 1.05 million subscribers, (b) video showing sexual-nudity content with 8.9 million views, uploaded since 2018 by AOK
channel of 1.05 million subscribers, (c) video showing fantasy violence content with 32 million views, uploaded since 2016 by WB Kids channel of
19.9 million subscribers, and (d) video showing sexual-nudity content with 14.6k views, uploaded since 2021 by Lovely Simpsons channel of 421
subscribers.
II. RELATED WORK The machine learning algorithms are usually employed
The explosion of multimedia data on YouTube has presented as classifiers Liu et al. [38] classified the periodicity-based
a lot of opportunities for researchers [33]. However, the chal- audio and visual segmentation features through support vec-
lenging task is to find an optimal technique to understand tor machine (SVM) algorithm with Gaussian radial basis
the context of videos. In both computer vision and machine function (RBF) kernel. Later on, they extended the frame-
learning, video classification is studied extensively as one work [39] by applying the energy envelope (EE) and
of the fundamental approaches for video understanding [34]. bag-of-words (BoW)-based audio representations and visual
The problem of detecting inappropriate video falls in the features. Ulges et al. [23] used MPEG motion vectors and
category of video classification or event detection problem. Mel-frequency cepstral coefficient (MFCC) audio features
with skin color and visual words. Each feature representa-
A. MACHINE LEARNING METHOD tion is processed through an individual SVM classifier and
Most of the earlier studies used hand-crafted features on combined in a weighted sum of late fusion. Ochoa et al. [40]
images or video-level data in identifying the discrimina- performed binary video genre classification for adult content
tive patterns of inappropriate content. The skin (i.e., skin detection by processing the spatiotemporal features with two
color) [35] and motion information based features were used types of SVM algorithms: sequential minimal optimization
for nudity or pornography detection [26], [36]. The multi- (SMO) and LibSVM. Jung et al. [41] worked with the one-
modal approach was also followed by fusing different modal- dimensional signal of spatiotemporal motion trajectory and
ities of data (i.e., audio, video) with skin and motion-based skin color. Tang et al. [42] proposed a pornography detection
features. Rea et al. [37] proposed a periodicity-based audio system—PornProbe, based on a hierarchical latent Dirichlet
feature extraction method which was later combined with allocation (LDA) and SVM algorithm. This system com-
visual features for illicit content detection in videos. bined an unsupervised clustering in LDA and supervised
learning in SVM, and achieved high efficiency than a sin- in YouTube comments Dadvar and Eckert [53] identified
gle SVM classifier. Lee et al. [43] presented a multilevel cyberbullying in YouTube comments by experimenting with
hierarchical framework by taking the multiple features of four deep learning models (i.e., CNN, LSTM, BiLSTM and
different temporal domains. Lopes et al. [44] worked with BiLSTM with attention) using GloVe, SSWE and random
the bag-of-visual features (BoVF) for obscenity detection. word embeddings. In this study, the BiLSTM with an atten-
Kaushal et al. [21] performed supervised learning to identify tion mechanism performed better than conventional machine
the child unsafe content and content uploaders by feeding the learning algorithms. Mohaouchane et al. [54] performed an
machine learning classifiers (i.e., random forest, K-nearest automatic detection of YouTube offensive comments using
neighbor, and decision tree) with video-level, user-level and the CNN-BiLSTM model. Alshamrani [55] used an ensemble
comment-level metadata of YouTube Reddy et al. [45] han- classification model to detect the age-inappropriate com-
dled the explicit content problem of videos through text ments posted on YouTube videos. Later on, they extended
classification of YouTube comments. They applied bigram the work [17] by leveraging natural language processing in
collocation and fed the features to the naïve Bayes classifier an ensemble classifier. They detect five age-inappropriate
for final classification. remarks (toxic, absence, insult, threat and identity hate)
which appear in YouTube comments Alshamrani et al. [56]
B. DEEP LEARNING METHOD applied a neural network model on YouTube comments for
In contrast to machine learning algorithms, there is a grow- toxicity detection and an LDA model on video captions for
ing trend of using deep learning architectures to learn topic modeling.
the video-based feature representations in video classi- The combination of different YouTube modalities is also
fication. The study of Ngiam et al. [46] is the first explored in deep learning architectures. Mariconti et al. [57]
deep learning-based research that processed different data showed an ensemble method for the detection of proactive
modalities such as video, image, audio and text, and per- remarks in YouTube videos. Features from three different
formed greedy layer-wise training of restricted Boltzmann sources (video metadata, audio transcripts and thumbnail)
machine (RBM) model. Karpathy et al. [29] showed multi- are extracted and fed to individual classifiers such as meta-
class video classification results on a large-scale video dataset data to SVM with linear kernel, thumbnails to the random
(ImageNet) by using the convolutional neural network. This forest and audio transcripts to RNN-based gated recurrent
study is followed by the research of Yue-Hei Ng et al. [32] unit (GRU) classifier Hou et al. [10] recognized the bloody
in which information across the full-length duration of videos by combining the audio-visual features and passed
videos is examined. The LSTM model is employed on them to the CNN-LSTM model. Ali and Senan [11] also
top of frame-level CNN activations for better video clas- explored the audio-visual features of YouTube videos with
sification. Some other studies have also taken the benefits the DNN model for violence classification. Sumon et al. [58]
of CNN-LSTM model to capture the context of a given proposed the CNN-LSTM model for identifying the violent
sequence of video frames [47]. Wu et al. [31] presented crowd flow in YouTube videos Alghowinem [59] showed a
the CNN-LSTM hybrid model for obtaining the short-term multimodal approach for inappropriate content filtering of
spatial-motion patterns through CNNs and long-term tempo- the YouTube Kids application. Papadamou et al. [20] stud-
ral clues by employing LSTMs Simonyan and Zisserman [30] ied an Elsagate phenomenon in the YouTube Kids appli-
built a two-stream CNN architecture for action detection cation by using YouTube video metadata such as video
from videos. Wehrmann et al. [48] reported the best accuracy title, tags, statistics (likes/dislikes, view count, comments
results with CNN-LSTM than other baselines on the NPDI count), style features (same as proposed in [21]) and
pornography dataset [24]. Perez et al. [49] classified pornog- thumbnails. The authors deployed a fully connected neu-
raphy videos by deploying CNNs with static and motion ral network model for statistics and style features, CNN
information (i.e., MPEG motion vectors). This study yielded for thumbnail features, and LSTMs for video title and tags
the same accuracy scores when static and motion information features. Vitorino et al. [60] explored transfer learning and
are combined by late fusion or mid-level fusion (96.4%), fine-tuning with CNN for sexually exploitative imagery of
but the performance degraded in early fusion (90.5%) children (SEIC) detection. Ishikawa et al. [61] proposed
Aldahoul et al. [50] explored and compared different state- an end-to-end pipeline for combating the Elsagate content
of-the-art deep learning approaches and found that the pre- in YouTube videos. They evaluated the classification per-
trained CNN model (EfficientNet-B7) based features with formances of mobile-platform based different deep learn-
SVM classifier performed better in unsafe video detection. ing architectures such as GoogLeNet, SqueezNet, NASNet,
In the literature, many studies explored different YouTube and MobileNetV2 Tahir et al. [22] processed multimodal
data modalities (i.e., text, audio, image, and video) individ- data with the CNN-LSTM framework for disturbing and
ually for inappropriate content detection. Yenala et al. [51] fake embedded content in videos. Lastly, Singh et al. [62]
proposed a novel CNN-BiLSTM model for automatic detec- performed a fine-grained approach for child unsafe video
tion and filtering of the inappropriate text in query sugges- content detection. They employed the LSTM autoencoder
tions of YouTube search. Trana et al. [52] investigated naïve to learn video representations from CNN (VGG-16) feature
Bayes, SVM and CNN classifiers for harassment detection descriptors.
FIGURE 2. The proposed framework architecture works into three stages for children inappropriate video content detection and classification. In the first
stage, video clips are preprocessed to discard irrelevant video frames and transform the selected frames into fixed-size images (224,224,3). Next, the
frames are fed to EfficientNet-B7 to get feature vectors. All feature vectors are reshaped and passed into the two-layer stack of BiLSTMs for video
representation. A fully connected layer followed by an output layer with softmax activation is integrated to return the probabilities of each video clip
against the three classes including safe (class 0), fantasy violence (class 1) and sexual-nudity (class 2).
state ht−1 , and the last memory cell state ct−1 , the following It is noticed that adding an excessive number of layers of
equations are used to implement an LSTM model: BiLSTM increases the network complexity and slows down
the training process. Hence, this work employed two layers
it = σ (Wxi xt + Whi ht−1 + Wci ct−1 + bi ) (1) of BiLSTM to understand the video representations.
ft = σ (Wxf xt + Whf ht−1 + Wcf ct−1 + bf ) (2)
ct = ft ct−1 + it tanh(Wxc xt + Whc ht−1 + bc ) (3) 2) BiLSTM NETWORK WITH AN ATTENTION MECHANISM
ot = σ (Wxo xt + Who ht−1 + Wco ct + bo ) (4) A neural network architecture with an attention mechanism
decides when to look into data (or in this case, segments of
ht = ot .tanh(ct ) (5)
videos) by automatically giving a high level of focus to fea-
where, σ represents the sigmoid activation function; ture vectors with most valuable information than the feature
i, f , c, and o denote input gate, forget gate, memory cell state vectors with less valuable information. The architecture of
and output gate at time t, respectively. W and b denote the BiLSTM with an attention mechanism is depicted in Fig. 4.
weights and bias vector. Considering the video classification Considering the final hidden state of i-th BiLSTM as hit ,
problem, one potential drawback of LSTM is that it captures which is computed as:
the past context only. For getting the full context of any f
hit = [ht , hbt ] (9)
video, it is important to consider both directions i.e., past
and future context of the video. Therefore, the bidirectional Then, the attention mechanism is computed by using the
LSTM appears to be a suitable option in video classification following equations:
as it preserves the information in both directions, as shown
in Fig. 3. eit = tanh(Wa hit + ba ) (10)
In BiLSTM, there are two distinct hidden layers referred exp(eit )
f ait = (11)
to as forward hidden layer (ht ) and backward hidden layer T
P
f exp(ejt )
(hbt ). The forward hidden layer ht considers input vector xt in j=1
ascending order i.e., t = 1, 2, 3, . . . , T , and backward hidden XT
layer hbt in descending order i.e., t = T , T − 1, T − 2, . . . , 1. vt = ait .hit (12)
t=1
Lastly, the output yt is generated by combining the results
f The attention mechanism assigns attention weight ait to
of ht and hbt . Following equations are used to implement the
the i-th BiLSTM output vector at time t, as calculated in
BiLSTM model:
equation 11. Wa and ba represents the weight and bias from
f f f f f the attention layer. Finally, the output from attention layers
ht = tanh(Wxh xt + Whh ht−1 + bh ) (6)
generates an attention vector vt , which is calculated as a
hbt = b
tanh(Wxh b b
xt + Whh ht+1 + bbh ) (7)
weighted sum of the multiplication between attention weight
f f b b
yt = Why ht + Whh ht + by (8) ait and i-th BiLSTM output vector at time t.
1, 2, 3, . . . , m, i = 1, 2, . . . , n2 }, where v1 ∩ v2 . . . ∩ vm = ∅.
5. Validate the model Model epoch .
6. Calculate accuracy, precision, recall and F1 score using the confusion matrix conf _matrix epoch .
The manual annotation process results in total of 111,561 27003 clips in sexual-nudity class, and 26650 clips in fantasy
video clips including 57908 clips belonging to safe class, violence class. Overall, there is a balanced distribution of safe
and unsafe video clips (1.08:1). We also intend to make this TABLE 1. Cartoon dataset distribution.
dataset publicly available for research community. Table 1
summarizes the overall distribution of manually annotated
cartoon videos according to three classes.
with an 80:20 split such that 80% of the data is allocated for approaches for video classification are presented and dis-
training and 20% for evaluation and testing of models. cussed. Afterwards, the best-proposed approach is compared
with existing state-of-the-art methods from literature.
D. EVALUATION METRICS
The performances of multiclass video classification models A. ANALYSIS OF PRE-TRAINED CNN MODEL VARIANTS
are evaluated by calculating the accuracy, precision, recall At first, three pre-trained convolutional neural network mod-
and f1 score using confusion matrices. Accuracy is the ratio els including Inception-V3, VGG-19 and EfficientNet-B7 are
of number of correct predictions for each class to the total employed as video classifiers to determine the performances
number of predictions of all classes, and is calculated as: of these ImageNet pre-trained CNN models in our multiclass
video classification problem. For each model, the last three
N −1
1 X (Tpc + TNc ) layers of the pipeline are discarded and added with a fully
Accuracy = × 100% (15) connected layer of softmax activation function using three
N (Tpc + TNc + Fpc + FNc )
c=0 output nodes.
In equation (15), c represents a particular class index from N The transfer learning approach is implemented in a man-
number of classes, Tp denotes the true positives, TN denotes ner where weights of all layers in the model are fixed
true negatives, Fp denotes false positives and FN denotes false except the last fully connected layer. After training each
negatives. Precision is the ratio of total number of correct pre-trained convolutional neural network model using the
predictions of positive instances to the total number of pre- transfer learning approach, as shown in Table 3, it is ana-
dictions with positive instances. It is calculated as: lyzed that the EfficientNet-B7 model performs comparatively
better than VGG-19 and Inception-V3 on the YouTube car-
N −1
1 X Tpc toon video dataset. It has achieved the highest recall score
Precision = × 100% (16) which means that the EfficientNet-B7 model retrieves more
N (Tpc + Fpc )
c=0 relevant instances than the remaining two pre-trained CNN
models. Hence, further experiments are carried out with the
The recall (also known as sensitivity) is the ratio of total
EfficientNet-B7 as a base classifier.
number of correct predictions of positive instances to the total
number of instances in an actual class. The recall and f1 score
B. ANALYSIS OF EFFICIENT-NET FEATURES WITH
are calculated by using equations (17) and (18):
DIFFERENT CLASSIFIER VARIANTS
N −1 In this section, the performances of different classifiers
1 X Tpc
Recall = × 100% (17) trained on EfficientNet visual features are evaluated. For this
N (Tpc + FNc ) purpose, some machine learning algorithms are considered
c=0
Precision ∗ Recall for the video classification task. The experimental evaluation
F1Score = 2 ∗ ( ) (18) of Xu et al. [76] also presented that even a simple machine
Precision + Recall
learning algorithm can play an effective role in video classi-
V. RESULTS AND DISCUSSION fication considering the features are distinctive enough.
In this section, the results obtained through experimental This study applied three machine learning algorithms
evaluations of different machine learning and deep learning namely SVM, KNN and random forest as video classifiers
by training on EfficientNet features. The evaluation results it still has a high ratio of false negatives (recall = 79.54%).
of Table 4 shows that among three machine learning classi- Hence, model training by using a single fully connected layer
fiers, SVM with RBF kernel achieved the highest accuracy with pre-trained CNN architecture is not sufficient. It requires
of 72.48% on EfficientNet visual features. It is followed by some deep neural network classifier to effectively understand
the random forest and KNN (neighbors = 3) classifiers with the hidden sequences of video representations by returning
accuracy values of 68.69% and 60.12%, respectively. The high precision-recall values for video classification. Thus,
other evaluation metrics i.e., precision, recall and f1 score further experiments are conducted using EfficientNet-B7 as
of SVM with RBF kernel outperformed the random forest a feature extractor with other neural network models to not
and KNN classifiers. Apart from machine learning classifiers, let any child inappropriate content go undetected in video
an experiment is performed where EfficientNet-B7 itself is classification.
treated as a sole classifier by replacing the last three lay-
ers of architecture with a fully connected layer (units = C. ANALYSIS OF EFFICIENT-NET WITH BiLSTM AND
512) followed by an output layer (activation = softmax) of ATTENTION-BASED BiLSTM CLASSIFIER VARIANTS
three units. Two main methods such as transfer learning and The experiments of previous sections revealed that ImageNet
fine-tuning of EfficientNet-B7 are implemented for the video pre-trained EfficientNet-B7 works better as a feature extrac-
classification task. In transfer learning, the weights of all tor and this architecture in conjunction with any deep learning
layers of EfficientNet are fixed except the last fully connected algorithm can successfully detect and classify unsafe video
layer and fine-tuning updates the weights of an entire model. content. The bidirectional LSTM, a supervised deep learning
From Table 4, it can be observed that the performances algorithm, is opted for developing a deep learning-based
of all machine learning algorithms are relatively poor than framework because it preserves the contextual information in
transfer learning or fine-tuning of the EfficientNet-B7 model. both directions of time-series data, which appears to be a suit-
In comparison, the EfficientNet model using transfer learning able choice in our video classification problem. The experi-
approach performed slightly better (accuracy = 89.07%) than ments are conducted using the two-layer stack of BiLSTMs
fine-tuned model (accuracy = 87.89%). The main reason for followed by fully connected (units = 4096, activation =
such model behavior is because the ImageNet video classifi- ReLU), drop out (value = 0.3) and softmax (output units = 3)
cation dataset is much larger in scope (14 million images) layers. Details of the complete architecture are mentioned in
than our self-curated cartoon video dataset (2.5 million section III. This study implemented and evaluated different
video frames) used for fine-tuning of the EfficientNet model. hidden units (i.e., 64, 128, 256, and 512) in each BiLSTM
By further examining the evaluation results of other classifier layer. For simplicity and consistency, the same number of
variants, it is noticed that although EfficientNet with transfer hidden units are used in both layers of BiLSTM networks.
learning method yields best results than other classifiers, You and Korhonen [77] reported that adding an attention
TABLE 4. Evaluation results in terms of accuracy, precision, recall and f1 score of using EfficientNet features with different classifier variants.
TABLE 5. Evaluation results in terms of accuracy, precision, recall and f1 score of using EfficientNet-B7 with BiLSTM and attention-based BiLSTM classifier
variants.
mechanism after the BiLSTM layer boosts the performance The first observation from all evaluation results, as men-
of deep neural networks for video classification. Hence, tioned in Table 5, is that all EfficientNet-BiLSTM networks
an attention mechanism-based BiLSTM model is also exam- perform comparatively better than the attention-based
ined by integrating an attention block after each bidirectional EfficientNet-BiLSTM networks. For attention-based models,
layer followed by fully connected (units = 4096, activation = the f1 scores are improved by updating the hidden units from
ReLU), drop out (value = 0.3) and softmax (output units = 64 to 128 and 256 in each BiLSTM network as it affects
3) layers. In all experiments, models are trained and evalu- the network trainable parameters during backpropagation.
ated for 20 epochs with an 80:20 split in which 80% of the However, it is also found that adding an excessive number
YouTube cartoon dataset is used for training purposes and of hidden units (i.e., units = 512) gradually decreases the
20% for the evaluation and testing of the model. The trained overall network performance. Secondly, the overall behav-
model from the last epoch (epoch = 20) is tested for obtaining ior of attention mechanism-based neural network models
the final video classification scores. Table 5 demonstrates is different from models with no attention blocks. In the
the experimental results of attention and without attention EfficientNet-BiLSTM network, upgrading the hidden units
mechanism-based EfficientNet-BiLSTM models by working in BiLSTMs from 64 to 128 immediately resulted in the
with the different number of hidden units (i.e., 64, 128, 256, best performing model of all experiments by showing the
and 512) in each bidirectional LSTM layer of the proposed highest f1 score (0.9267). However, it drastically decreases
framework. the performance by adding more hidden units in BiLSTMs
TABLE 6. Performance comparison between the proposed EfficientNet-BiLSTM model with existing state-of-the-art video classification techniques.
(i.e., 256 and 512). A detailed performance comparison higher accuracy than GoogLeNet-SVM [60], fine-tuned
between the EfficientNet with attention and without attention NASNet-SVM [61], and EfficientNet-SVM [50] approaches
mechanism-based BiLSTM models are presented through by significant margins of 3.06%, 7.79%, and 10.1%,
confusion matrices for all three classes. The diagonal values respectively. In comparison with base models using pre-
represent the correctly classified number of instances in trained CNNs and LSTMs, the Inception-V3 with LSTM
each class, but anything off the diagonal indicates incor- approach [20] reported f1 score of 0.828 which is much
rect classification instances. The evaluation results of the lower than our with attention (f1 score = 0.9195) and with-
EfficientNet-BiLSTM model with and without attention out attention-based (f1 score = 0.9267) BiLSTM classifier
blocks for the YouTube cartoon video dataset are illustrated variants. It is also worth mentioning that the ResNet-LSTM
in Fig. 5. Overall, the EfficientNet and BiLSTM network model in existing studies [10], [48] attained comparable accu-
with 128 hidden units in each bidirectional layer achieved racy results to our proposed technique. It can be explained by
the highest validation (f1 score = 0.9274) and testing scores the fact that the studies reporting these approaches performed
(f1 score = 0.9267). binary video classification which is much simpler than multi-
class video classification. Note that the proposed model still
D. PERFORMANCE COMPARISON WITH EXISTING outperformed some existing approaches of multiclass video
STATE-OF-THE-ART CLASSIFICATION METHODS classification using VGG-LSTM-based models [22], [62],
We compare the performance of the proposed EfficientNet- which shows that BiLSTM has high robustness on time-series
BiLSTM model with existing state-of-the-art models and data modeling. In addition, some studies [11], [17], [21] used
methods employed for inappropriate content classification simple convolutional neural networks and reported the lowest
using different YouTube data modalities. classification accuracy and f1 scores. Hence, it is deduced that
Table 6 summarizes the results and quality scores of simple CNNs are not sufficient to understand the complex-
existing and proposed classification methods. It is worth ities of YouTube data modalities. Overall, the performance
noting that existing studies explored different YouTube comparison showed that the proposed EfficientNet using
modalities (i.e., text, audio, video, and metadata) for dif- BiLSTM (hidden units = 128) surpassed the existing studies
ferent classifications. The most common strategy in exist- in inappropriate video content detection and classification.
ing studies, for unsafe content classification, is using
pre-trained CNN models with either LSTM-based classi- VI. CONCLUSION AND FUTURE WORK
fiers [10], [20], [22], [48], [62] or machine learning-based In this paper, a novel deep learning-based framework is
classifiers [50], [60], [61]. Compared with the approaches proposed for child inappropriate video content detection
that use pre-trained CNN features with machine learning clas- and classification. Transfer learning using EfficientNet-B7
sifiers, our EfficientNet-BiLSTM classifier method yielded architecture is employed to extract the features of videos.
The extracted video features are processed through the [4] J. Marsh, L. Law, J. Lahmar, D. Yamada-Rice, B. Parry, and F. Scott,
BiLSTM network, where the model learns the effective Social Media, Television and Children. Sheffield, U.K.: Univ. Sheffield,
2019. [Online]. Available: https://www.stac-study.org/downloads/
video representations and performs multiclass video classi- STAC_Full_Report.pdf
fication. All evaluation experiments are performed by using [5] L. Ceci. YouTube—Statistics & Facts. Accessed: Sep. 01, 2021. [Online].
a manually annotated cartoon video dataset of 111,156 Available: https://www.statista.com/topics/2019/youtube/
[6] M. M. Neumann and C. Herodotou, ‘‘Young children and YouTube:
video clips collected from YouTube. The evaluation results A global phenomenon,’’ Childhood Educ., vol. 96, no. 4, pp. 72–77,
indicated that proposed framework of EfficientNet-BiLSTM Jul. 2020, doi: 10.1080/00094056.2020.1796459.
(with hidden units = 128) exhibits higher performance [7] S. Livingstone, L. Haddon, A. Görzig, and K. Ólafsson, Risks and Safety
on the Internet: The Perspective of European Children: Full Findings and
(accuracy = 95.66%) than other experimented models includ- Policy Implications From the EU Kids Online Survey of 9-16 Year Olds
ing EfficientNet-FC, EfficientNet-SVM, EfficientNet-KNN, and Their Parents in 25 Countries. London, U.K.: EU Kids Online, 2011.
EfficientNet-Random Forest, and EfficientNet-BiLSTM with [Online]. Available: http://eprints.lse.ac.U.K./id/eprint/33731
[8] B. J. Bushman and L. R. Huesmann, ‘‘Short-term and long-term effects
attention mechanism-based models (with hidden units = 64, of violent media on aggression in children and adults,’’ Arch. Pediatrics
128, 256, and 512). Moreover, the performance compari- Adolescent Med., vol. 160, no. 4, pp. 348–352, 2006, doi: 10.1001/arch-
son with existing state-of-the-art models also demonstrated pedi.160.4.348.
[9] S. Maheshwari. (2017). On YouTube Kids, Startling Videos Slip Past Fil-
that our BiLSTM-based framework surpassed other existing ters. The New York Times. [Online]. Available: https://www.nytimes.com/
models and methods by achieving the highest recall score of 2017/11/04/business/media/youtube-kids-paw-patrol.html
92.22%. The advantages of the proposed deep learning-based [10] C. Hou, X. Wu, and G. Wang, ‘‘End-to-end bloody video recognition
by audio-visual feature fusion,’’ in Proc. Chin. Conf. Pattern Recognit.
children inappropriate video content detection system are as Comput. Vis. (PRCV), 2018, pp. 501–510, doi: 10.1007/978-3-030-03398-
follows: 9_43.
1) It works by considering the real-time conditions by [11] A. Ali and N. Senan, ‘‘Violence video classification performance using
deep neural networks,’’ in Proc. Int. Conf. Soft Comput. Data Mining, 2018,
processing the video with a speed of 22 fps using pp. 225–233, doi: 10.1007/978-3-319-72550-5_22.
EfficientNet-B7 and BiLSTM-based deep learning [12] H.-E. Lee, T. Ermakova, V. Ververis, and B. Fabian, ‘‘Detecting child
framework, which helps in filtering the live-captured sexual abuse material: A comprehensive survey,’’ Forensic Sci. Int.,
Digit. Invest., vol. 34, Sep. 2020, Art. no. 301022, doi: 10.1016/j.fsidi.
videos. 2020.301022.
2) It can assist any video sharing platform to either remove [13] R. Brandom. (2017). Inside Elsagate, The Conspiracy Fueled War on
the video containing unsafe clips or blur/hide any por- Creepy YouTube Kids Videos. [Online]. Available: https://www.theverge.
com/2017/12/8/16751206/elsagate-youtube-kids-creepy-conspiracy-
tion with unsettling frames. theory
3) It may also help in the development of parental control [14] Reddit. What is ElsaGate? Accessed: Dec. 14, 2020. [Online]. Available:
solutions on the Internet through plugins or browser https://www.reddit.com/r/ElsaGate/comments/6o6baf/
[15] B. Burroughs, ‘‘YouTube kids: The app economy and mobile parenting,’’
extensions where child unsafe content can be filtered Soc. media+ Soc., vol. 3, May 2017, Art. no. 2056305117707189, doi:
automatically. 10.1177/2056305117707189.
Furthermore, our methodology to detect inappropriate chil- [16] H. Wilson, ‘‘YouTube is unsafe for children: YouTube’s safeguards
and the current legal framework are inadequate to protect children
dren content from YouTube is independent of YouTube video from disturbing content,’’ Seattle J. Technol., Environ. Innov. Law,
metadata which can easily be altered by malicious uploaders vol. 10, no. 1, p. 8, 2020. [Online]. Available: https://digitalcommons.
to deceive the audiences. In the future, we intend to combine law.seattleu.edu/sjteil/vol10/iss1/8
[17] S. Alshamrani, A. Abusnaina, M. Abuhamad, D. Nyang, and D. Mohaisen,
the temporal stream using optical flow frames with the spatial ‘‘Hate, obscenity, and insults: Measuring the exposure of children
stream of the RGB frames to further improve the model per- to inappropriate comments in YouTube,’’ in Proc. Companion
formance by better understanding the global representations Proc. Web Conf., Apr. 2021, pp. 508–515, doi: 10.1145/3442442.
3452314.
of videos. We also aim to increase the classification labels to [18] N. Elias and I. Sulkin, ‘‘YouTube viewers in diapers: An exploration of
target the different types of inappropriate children content of factors associated with amount of toddlers’ online viewing,’’ Cyberpsy-
YouTube videos. chol., J. Psychosoc. Res. Cyberspace, vol. 11, no. 3, p. 2, Nov. 2017, doi:
10.5817/cp2017-3-2.
[19] D. Craig and S. Cunningham, ‘‘Toy unboxing: Living in a (n unregulated)
ACKNOWLEDGMENT material world,’’ Media Int. Aust., vol. 163, no. 1, pp. 77–86, May 2017,
The authors are appreciative of Prof. Dr. Hafiz Adnan Habib doi: 10.1177/1329878X17693700.
[20] K. Papadamou, A. Papasavva, S. Zannettou, J. Blackburn,
(Head of Department of Computer Engineering, University of N. Kourtellis, I. Leontiadis, G. Stringhini, and M. Sirivianos, ‘‘Disturbed
Engineering and Technology, Taxila) for providing valuable YouTube for kids: Characterizing and detecting inappropriate
advice and suggestions in this study. videos targeting young children,’’ in Proc. Int. AAAI Conf. Web
Soc. Media, 2020, pp. 522–533. [Online]. Available: https://ojs.
aaai.org/index.php/ICWSM/article/view/7320/7174
REFERENCES [21] R. Kaushal, S. Saha, P. Bajaj, and P. Kumaraguru, ‘‘KidsTube:
[1] L. Ceci. YouTube Usage Penetration in the United States 2020, Detection, characterization and analysis of child unsafe content &
by Age Group. Accessed: Nov. 1, 2021. [Online]. Available: promoters on YouTube,’’ in Proc. 14th Annu. Conf. Privacy,
https://www.statista.com/statistics/296227/us-youtube-reach-age-gender/ Secur. Trust (PST), Dec. 2016, pp. 157–164, doi: 10.1109/pst.2016.
[2] P. Covington, J. Adams, and E. Sargin, ‘‘Deep neural networks for 7906950.
YouTube recommendations,’’ in Proc. 10th ACM Conf. Recommender [22] R. Tahir, F. Ahmed, H. Saeed, S. Ali, F. Zaffar, and C. Wilson, ‘‘Bring-
Syst., Sep. 2016, pp. 191–198, doi: 10.1145/2959100.2959190. ing the kid back into YouTube kids: Detecting inappropriate content on
[3] M. M. Neumann and C. Herodotou, ‘‘Evaluating YouTube videos for video streaming platforms,’’ in Proc. IEEE/ACM Int. Conf. Adv. Soc.
young children,’’ Educ. Inf. Technol., vol. 25, no. 5, pp. 4459–4475, Netw. Anal. Mining, Aug. 2019, pp. 464–469, doi: 10.1145/3341161.
Sep. 2020, doi: 10.1007/s10639-020-10183-7. 3342913.
[23] A. Ulges, C. Schulze, D. Borth, and A. Stahl, ‘‘Pornography detection [42] S. Tang, T.-S. Chua, J. Li, Y. Zhang, C. Xie, M. Li, Y. Liu, X. Hua,
in video benefits (a lot) from a multi-modal approach,’’ in Proc. ACM Y.-T. Zheng, and J. Tang, ‘‘Pornprobe: An LDA-SVM based pornogra-
Int. Workshop Audio Multimedia Methods Large-Scale Video Anal., 2012, phy detection system,’’ in Proc. 17th ACM Int. Conf. Multimedia, 2009,
pp. 21–26, doi: 10.1145/2390214.2390222. pp. 1003–1004, doi: 10.1145/1631272.1631490.
[24] C. Caetano, S. Avila, S. Guimaraes, and A. D. A. Araújo, ‘‘Pornog- [43] S. Lee, W. Shim, and S. Kim, ‘‘Hierarchical system for objectionable video
raphy detection using BossaNova video descriptor,’’ in Proc. 22nd detection,’’ IEEE Trans. Consum. Electron., vol. 55, no. 2, pp. 677–684,
Eur. Signal Process. Conf., 2014, pp. 1681–1685. [Online]. Available: May 2009, doi: 10.1109/TCE.2009.5174439.
https://ieeexplore.ieee.org/document/6952616 [44] A. P. B. Lopes, S. E. F. D. Avila, A. N. A. Peixoto, R. S. Oliveira,
[25] L. Duan, G. Cui, W. Gao, and H. Zhang, ‘‘Adult image detection M. D. M. Coelho, and A. D. A. Araújo, ‘‘Nude detection in video using
method base-on skin color model and support vector machine,’’ in Proc. bag-of-visual-features,’’ in Proc. XXII Brazilian Symp. Comput. Graph.
Asian Conf. Comput. Vis., 2002, pp. 797–800. [Online]. Available: Image Process., Oct. 2009, pp. 224–231, doi: 10.1109/sibgrapi.2009.32.
http://aprs.dictaconference.org/accv2002/accv2002_proceedings/ [45] S. Reddy, N. Srikanth, and G. Sharvani, ‘‘Development of kid-friendly
Duan797.pdf YouTube access model using deep learning,’’ in Data Science and Security.
[26] C. Jansohn, A. Ulges, and T. M. Breuel, ‘‘Detecting pornographic Singapore: Springer, 2021, pp. 243–250, doi: 10.1007/978-981-15-5309-
video content by combining image features with motion information,’’ 7_26.
in Proc. 17th ACM Int. Conf. Multimedia, 2009, pp. 601–604, doi: [46] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng,
10.1145/1631272.1631366. ‘‘Multimodal deep learning,’’ in Proc. 28th Int. Conf. Int. Conf.
[27] P. Zhou, Q. Ding, H. Luo, and X. Hou, ‘‘Violence detection in surveillance Mach. Learn. (ICML), 2011, pp. 689–696. [Online]. Available:
video using low-level features,’’ PLoS ONE, vol. 13, no. 10, Oct. 2018, https://dl.acm.org/doi/10.5555/3104482.3104569
Art. no. e0203668, doi: 10.1371/journal.pone.0203668. [47] I. Sutskever, O. Vinyals, and Q. V. Le, ‘‘Sequence to sequence
[28] M. B. Garcia, T. F. Revano, B. G. M. Habal, J. O. Contreras, and learning with neural networks,’’ in Proc. 27th Int. Conf. Neural
J. B. R. Enriquez, ‘‘A pornographic image and video filtering appli- Inf. Process. Syst., 2014, pp. 3104–3112. [Online]. Available:
cation using optimized nudity recognition and detection algorithm,’’ in https://dl.acm.org/doi/10.5555/2969033.2969173
Proc. IEEE 10th Int. Conf. Humanoid, Nanotechnol., Inf. Technol., Com- [48] J. Wehrmann, G. S. Simōes, R. C. Barros, and V. F. Cavalcante,
mun. Control, Environ. Manage. (HNICEM), Nov. 2018, pp. 1–5, doi: ‘‘Adult content detection in videos with convolutional and recurrent neu-
10.1109/HNICEM.2018.8666227. ral networks,’’ Neurocomputing, vol. 272, pp. 432–438, Jan. 2018, doi:
[29] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and 10.1016/j.neucom.2017.07.012.
L. Fei-Fei, ‘‘Large-scale video classification with convolutional neural [49] M. Perez, S. Avila, D. Moreira, D. Moraes, V. Testoni, E. Valle,
networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, S. Goldenstein, and A. Rocha, ‘‘Video pornography detection through deep
pp. 1725–1732, doi: 10.1109/cvpr.2014.223. learning techniques and motion information,’’ Neurocomputing, vol. 230,
[30] K. Simonyan and A. Zisserman, ‘‘Two-stream convolutional networks pp. 279–293, Mar. 2017, doi: 10.1016/j.neucom.2016.12.017.
for action recognition in videos,’’ in Proc. 27th Int. Conf. Neural
[50] N. Aldahoul, H. A. Karim, M. H. L. Abdullah, and A. S. Ba Wazir,
Inf. Process. Syst. (NIPS), 2014, pp. 568–576. [Online]. Available:
‘‘An evaluation of traditional and CNN-based feature descriptors for
https://dl.acm.org/doi/10.5555/2968826.2968890
cartoon pornography detection,’’ IEEE Access, vol. 9, pp. 39910–39925,
[31] Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue, ‘‘Modeling spatial-
2021, doi: 10.1109/ACCESS.2021.3064392.
temporal clues in a hybrid deep learning framework for video classifica-
[51] H. Yenala, A. Jhanwar, M. K. Chinnakotla, and J. Goyal, ‘‘Deep learning
tion,’’ in Proc. 23rd ACM Int. Conf. Multimedia, Oct. 2015, pp. 461–470,
for detecting inappropriate content in text,’’ Int. J. Data Sci. Analytics,
doi: 10.1145/2733373.2806222.
vol. 6, no. 4, pp. 273–286, Dec. 2018, doi: 10.1007/s41060-017-0088-4.
[32] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals,
[52] R. E. Trana, C. E. Gomez, and R. F. Adler, ‘‘Fighting cyberbullying:
R. Monga, and G. Toderici, ‘‘Beyond short snippets: Deep networks for
An analysis of algorithms used to detect harassing text found on YouTube,’’
video classification,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
in Proc. Int. Conf. Appl. Hum. Factors Ergonom., 2020, pp. 9–15, doi:
(CVPR), Jun. 2015, pp. 4694–4702, doi: 10.1109/CVPR.2015.7299101.
10.1007/978-3-030-51328-3_2.
[33] J. P. Verma and S. Agrawal, ‘‘Big data analytics: Challenges and applica-
tions for text, audio, video, and social media data,’’ Int. J. Soft Comput., [53] M. Dadvar and K. Eckert, ‘‘Cyberbullying detection in social networks
Artif. Intell. Appl., vol. 5, no. 1, pp. 41–51, Feb. 2016, doi: 10.5121/ijs- using deep learning based models,’’ in Proc. Int. Conf. Big Data Analytics
cai.2016.5105. Knowl. Discovery, 2020, pp. 245–255, doi: 10.1201/9781003134527-11.
[34] Y.-G. Jiang, Z. Wu, J. Wang, X. Xue, and S.-F. Chang, ‘‘Exploiting fea- [54] H. Mohaouchane, A. Mourhir, and N. S. Nikolov, ‘‘Detecting offensive
ture and class relationships in video categorization with regularized deep language on Arabic social media using deep learning,’’ in Proc. 6th Int.
neural networks,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 2, Conf. Soc. Netw. Anal., Manage. Secur. (SNAMS), Oct. 2019, pp. 466–471,
pp. 352–364, Feb. 2017, doi: 10.1109/TPAMI.2017.2670560. doi: 10.1109/snams.2019.8931839.
[35] M. J. Jones and J. M. Rehg, ‘‘Statistical color models with application to [55] S. Alshamrani, ‘‘Detecting and measuring the exposure of children
skin detection,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern and adolescents to inappropriate comments in YouTube,’’ in Proc. 29th
Recognit., Jun. 1999, pp. 274–280, doi: 10.1109/CVPR.1999.786951. ACM Int. Conf. Inf. Knowl. Manage., Oct. 2020, pp. 3213–3216, doi:
[36] T. Endeshaw, J. Garcia, and A. Jakobsson, ‘‘Classification of indecent 10.1145/3340531.3418511.
videos by low complexity repetitive motion detection,’’ in Proc. 37th [56] S. Alshamrani, M. Abuhamad, A. Abusnaina, and D. A. Mohaisen,
IEEE Appl. Imag. Pattern Recognit. Workshop, Oct. 2008, pp. 1–7, doi: ‘‘Investigating online toxicity in users interactions with the mainstream
10.1109/AIPR.2008.4906438. media channels on YouTube,’’ in Proc. CIKM Workshops, 2020, pp. 1–6.
[37] N. Rea, G. Lacey, R. Dahyot, and C. Lambe, ‘‘Multimodal periodicity [Online]. Available: http://ceur-ws.org/Vol-2699/paper39.pdf
analysis for illicit content detection in videos,’’ in Proc. 3rd Eur. Conf. Vis. [57] E. Mariconti, G. Suarez-Tangil, J. Blackburn, E. De Cristofaro,
Media Prod., 2006, pp. 106–114, doi: 10.1049/cp:20061978. N. Kourtellis, and I. Leontiadis, ‘‘‘You know what to do’ proactive
[38] Y. Liu, X. Wang, Y. Zhang, and S. Tang, ‘‘Fusing audio-words with visual detection YouTube videos targeted by coordinated hate attacks,’’ in
features for pornographic video detection,’’ in Proc. IEEE 10th Int. Conf. Proc. ACM Hum.-Comput. Interact., vol. 3, pp. 1–21, Nov. 2019, doi:
Trust, Secur. Privacy Comput. Commun., Nov. 2011, pp. 1488–1493, doi: 10.1145/3359309.
10.1109/TRUSTCOM.2011.205. [58] M. Gao, J. Jiang, L. Ma, S. Zhou, G. Zou, J. Pan, and Z. Liu, ‘‘Violent
[39] Y. Liu, Y. Yang, H. Xie, and S. Tang, ‘‘Fusing audio vocabulary with visual crowd behavior detection using deep learning and compressive sensing,’’
features for pornographic video detection,’’ Future Gener. Comput. Syst., in Proc. Chin. Control Decis. Conf. (CCDC), Jun. 2019, pp. 613–625, doi:
vol. 31, pp. 69–76, Feb. 2014, doi: 10.1016/j.future.2012.08.012. 10.1109/ccdc.2019.8832598.
[40] V. M. T. Ochoa, S. Y. Yayilgan, and F. A. Cheikh, ‘‘Adult video content [59] S. Alghowinem, ‘‘A safer YouTube kids: An extra layer of content filtering
detection using machine learning techniques,’’ in Proc. 8th Int. Conf. using automated multimodal analysis,’’ in Proc. SAI Intell. Syst. Conf.,
Signal Image Technol. Internet Based Syst., Nov. 2012, pp. 967–974, doi: 2018, pp. 294–308, doi: 10.1007/978-3-030-01054-6_21.
10.1109/sitis.2012.143. [60] P. Vitorino, S. Avila, M. Perez, and A. Rocha, ‘‘Leveraging deep neu-
[41] S. Jung, J. Youn, and S. Sull, ‘‘A real-time system for detecting indecent ral networks to fight child pornography in the age of social media,’’
videos based on spatiotemporal patterns,’’ IEEE Trans. Consum. Electron., J. Vis. Commun. Image Represent., vol. 50, pp. 303–313, Jan. 2018, doi:
vol. 60, no. 4, pp. 696–701, Nov. 2014, doi: 10.1109/TCE.2014.7027345. 10.1016/j.jvcir.2017.12.005.
[61] A. Ishikawa, E. Bollis, and S. Avila, ‘‘Combating the elsagate phe- [74] N. Ketkar, ‘‘Introduction to keras,’’ in Deep Learning With Python.
nomenon: Deep learning architectures for disturbing cartoons,’’ in Proc. Berkeley, CA, USA: Springer, 2017, pp. 97–111, doi: 10.1007/978-1-
7th Int. Workshop Biometrics Forensics (IWBF), May 2019, pp. 1–6, doi: 4842-2766-4_7.
10.1109/iwbf.2019.8739202. [75] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, and J. Dean, ‘‘Tensor-
[62] S. Singh, R. Kaushal, A. B. Buduru, and P. Kumaraguru, ‘‘KidsGUARD: Flow: A system for large-scale machine learning,’’ in Proc. 12th USENIX
Fine grained approach for child unsafe video representation and detec- Symp. Oper. Syst. Des. Implement. (OSDI), 2016, pp. 265–283. [Online].
tion,’’ in Proc. 34th ACM/SIGAPP Symp. Appl. Comput., Apr. 2019, Available: https://www.usenix.org/system/files/conference/osdi16/osdi16-
pp. 2104–2111, doi: 10.1145/3297280.3297487. abadi.pdf
[63] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, [76] Z. Xu, Y. Yang, and A. G. Hauptmann, ‘‘A discriminative CNN
‘‘ImageNet: A large-scale hierarchical image database,’’ in Proc. IEEE video representation for event detection,’’ in Proc. IEEE Conf. Com-
Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 248–255, doi: put. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1798–1807, doi:
10.1109/CVPR.2009.5206848. 10.1109/cvpr.2015.7298789.
[64] M. Tan and Q. V. Le, ‘‘EfficientNet: Rethinking model scaling for convo- [77] J. You and J. Korhonen, ‘‘Attention boosted deep networks for video
lutional neural networks,’’ 2019, arXiv:1905.11946. classification,’’ in Proc. IEEE Int. Conf. Image Process. (ICIP), Oct. 2020,
[65] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, pp. 1761–1765, doi: 10.1109/ICIP40778.2020.9190996.
and S. Vijayanarasimhan, ‘‘YouTube-8M: A large-scale video classifica-
tion benchmark,’’ 2016, arXiv:1609.08675.
[66] K. Soomro, A. Roshan Zamir, and M. Shah, ‘‘UCF101: A dataset of 101
human actions classes from videos in the wild,’’ 2012, arXiv:1212.0402.
[67] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, ‘‘HMDB:
A large video database for human motion recognition,’’ in Proc. Int. Conf.
Comput. Vis., Nov. 2011, pp. 2556–2563, doi: 10.1109/iccv.2011.6126543. KANWAL YOUSAF received the B.Sc. (Hons.)
[68] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, and M.Sc. degrees in Software Engineering from
S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, the University of Engineering and Technology
and A. Zisserman, ‘‘The kinetics human action video dataset,’’ 2017, (UET), Taxila, in 2010 and 2013, respectively,
arXiv:1705.06950. where she is currently pursuing the Ph.D. degree.
[69] L. Wolf, T. Hassner, and I. Maoz, ‘‘Face recognition in unconstrained
She is also working as a Lecturer at UET, Taxila.
videos with matched background similarity,’’ in Proc. CVPR, Jun. 2011,
pp. 529–534, doi: 10.1109/cvpr.2011.5995566.
Her research interests include deep learning, arti-
[70] M. Kim, S. Kumar, V. Pavlovic, and H. Rowley, ‘‘Face tracking and ficial neural networks and machine learning.
recognition with visual constraints in real-world videos,’’ in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1–8, doi:
10.1109/cvpr.2008.4587572.
[71] A. Bermingham, M. Conway, L. McInerney, N. O’Hare, and
A. F. Smeaton, ‘‘Combining social network analysis and sentiment
analysis to explore the potential for online radicalisation,’’ in Proc.
Int. Conf. Adv. Soc. Netw. Anal. Mining, Jul. 2009, pp. 231–236, doi:
TABASSAM NAWAZ received the Ph.D. degree
10.1109/asonam.2009.31.
[72] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, from the University of Engineering and Tech-
R. Mooney, T. Darrell, and K. Saenko, ‘‘YouTube2Text: Recognizing nology (UET), Taxila. He is currently serv-
and describing arbitrary activities using semantic hierarchies and zero- ing as a Professor and the Head of the
shot recognition,’’ in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, Software Engineering Department, UET, Taxila.
pp. 2712–2719, doi: 10.1109/iccv.2013.337. His research interests include advanced databases,
[73] J. Xu, T. Mei, T. Yao, and Y. Rui, ‘‘MSR-VTT: A large video descrip- and object-oriented design and analysis.
tion dataset for bridging video and language,’’ in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 5288–5296, doi:
10.1109/cvpr.2016.571.