0% found this document useful (0 votes)
148 views16 pages

A Deep Learning-Based Approach For Inappropriate Content Detection and Classification of YouTube Video

This document presents a deep learning approach for detecting and classifying inappropriate content in YouTube videos. The proposed framework uses an EfficientNet-B7 CNN model to extract video descriptors, which are then fed into a BiLSTM network. An attention mechanism is also integrated. The models are evaluated on a dataset of over 111,000 cartoon video clips manually annotated as either safe or unsafe. Experimental results found that an EfficientNet-BiLSTM architecture achieved state-of-the-art performance with an F1 score of 0.9267 for inappropriate content detection and classification in children's videos.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views16 pages

A Deep Learning-Based Approach For Inappropriate Content Detection and Classification of YouTube Video

This document presents a deep learning approach for detecting and classifying inappropriate content in YouTube videos. The proposed framework uses an EfficientNet-B7 CNN model to extract video descriptors, which are then fed into a BiLSTM network. An attention mechanism is also integrated. The models are evaluated on a dataset of over 111,000 cartoon video clips manually annotated as either safe or unsafe. Experimental results found that an EfficientNet-BiLSTM architecture achieved state-of-the-art performance with an F1 score of 0.9267 for inappropriate content detection and classification in children's videos.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Received December 27, 2021, accepted January 25, 2022, date of publication January 28, 2022, date of current

version February 14, 2022.


Digital Object Identifier 10.1109/ACCESS.2022.3147519

A Deep Learning-Based Approach for


Inappropriate Content Detection and
Classification of YouTube Videos
KANWAL YOUSAF AND TABASSAM NAWAZ
Department of Software Engineering, University of Engineering and Technology, Taxila 47050, Pakistan
Corresponding author: Kanwal Yousaf ([email protected])

ABSTRACT The exponential growth of videos on YouTube has attracted billions of viewers among which
the majority belongs to a young demographic. Malicious uploaders also find this platform as an opportunity
to spread upsetting visual content, such as using animated cartoon videos to share inappropriate content with
children. Therefore, an automatic real-time video content filtering mechanism is highly suggested to be inte-
grated into social media platforms. In this study, a novel deep learning-based architecture is proposed for the
detection and classification of inappropriate content in videos. For this, the proposed framework employs an
ImageNet pre-trained convolutional neural network (CNN) model known as EfficientNet-B7 to extract video
descriptors, which are then fed to bidirectional long short-term memory (BiLSTM) network to learn effective
video representations and perform multiclass video classification. An attention mechanism is also integrated
after BiLSTM to apply attention probability distribution in the network. These models are evaluated on a
manually annotated dataset of 111,156 cartoon clips collected from YouTube videos. Experimental results
demonstrated that EfficientNet-BiLSTM (accuracy = 95.66%) performs better than attention mechanism-
based EfficientNet-BiLSTM (accuracy = 95.30%) framework. Secondly, the traditional machine learning
classifiers perform relatively poor than deep learning classifiers. Overall, the architecture of EfficientNet and
BiLSTM with 128 hidden units yielded state-of-the-art performance (f1 score = 0.9267). Furthermore, the
performance comparison against existing state-of-the-art approaches verified that BiLSTM on top of CNN
captures better contextual information of video descriptors in network architecture, and hence achieved better
results in child inappropriate video content detection and classification.

INDEX TERMS Deep learning, social media analysis, video classification, bidirectional LSTM, CNN,
EfficientNet.

I. INTRODUCTION crowdsourced database, it is extremely challenging to moni-


The creation and consumption of videos on social media tor and regulate the uploaded content as per platform guide-
platforms have grown drastically over the past few years. lines. This creates opportunities for malicious users to indulge
Among the social media sites, YouTube predominates as a in spamming activities by misleading the audiences with
video sharing platform with plethora of videos from diverse falsely advertised content (i.e., video, audio or text). The
categories. According to YouTube statistics [1], the global most disruptive behavior by malicious users is to expose the
user base of YouTube is over 2 billion registered users and young audiences to disturbing content, particularly when it
more than 500 hours of video content is uploaded every is fabricated as safe for them. Children today spend most of
minute. Consequently, billions of hours of videos are avail- their time on the Internet and the YouTube platform for them
able where users of all age groups can explore generic as well has distinctly established itself as an alternative to traditional
as personalized content [2]. Considering such a large-scale screen media (e.g., television) [3], [4]. The YouTube press
release [5] also confirmed the high popularity of this social
media site among younger audiences compared to other age
The associate editor coordinating the review of this manuscript and groups, and the reason for this high level of approval is due
approving it for publication was Aasia Khanum . to fewer restrictions [6].

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 16283
K. Yousaf, T. Nawaz: Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos

Unlike television, children can be presented with any feedback from statistics. Since the YouTube metadata can be
type of content on the Internet due to lack of regulations. easily manipulated, it is suggested to better use video features
Exposing children to disturbing content is considered as one for detection of inappropriate content than metadata features
among other internet safety threats (like cyberbullying, cyber associated with videos [22].
predators, hate etc.) [7]. Bushman and Huesmann [8] con- Prior techniques [23]–[28] addressed the challenge of iden-
firmed that frequent exposure to disturbing video content tifying disturbing content (i.e., violence, pornography, etc.)
may have a short-term or long-term impact on children’s from videos by using traditional hand-crafted features on
behavior, emotions and cognition. Many reports [9]–[12] frame-level data. In recent years, the state-of-the-art per-
identified the trend of distributing inappropriate content formance of deep learning has motivated researchers to
in children’s videos. This trend got people’s attention employ it in image and video processing. The most fre-
when mainstream media reported about the Elsagate con- quent applications of image/video classification employed
troversy [13], [14], where such video material was found the convolutional neural networks [29]–[31]. Apart from
on YouTube featuring famous childhood cartoon charac- that, the long-short term memory (LSTM), a special type
ters (i.e., Disney characters, superheroes, etc.) portrayed in of recurrent neural network (RNN) architecture, has proven
disturbing scenes; for instance, performing mild violence, to be an effective deep learning model in time-series data
stealing, drinking alcohol and involving in nudity or sexual analysis [32]. Hence, this study targets the YouTube mul-
activities. ticlass video classification problem by leveraging CNN
In an attempt to provide a safe online platform, laws (EfficientNet-B7) and LSTM to learn video effective rep-
like the children’s online privacy protection act (COPPA) resentations for detection and classification of inappropri-
imposes certain requirements on websites to adopt safety ate content. We targeted two types of objectionable content
mechanisms for children under the age of 13. YouTube has geared towards young viewers, one, which contains violence
also included a ‘‘safety mode’’ option to filter out unsafe and the second, which includes sexual nudity connotations.
content. Apart from that, YouTube developed the YouTube The main contributions of this study are threefold:
Kids application to allow parental control over videos that 1. We propose a novel CNN (EfficientNet-B7) and
are approved as safe for a certain age group of children [15]. BiLSTM-based deep learning framework for inappropri-
Regardless of YouTube’s efforts in controlling the unsafe ate video content detection and classification.
content phenomena, disturbing videos still appear [16]–[19] 2. We present a manually annotated ground truth video
even in YouTube Kids [20] due to difficulty in identifying dataset of 1860 minutes (111,561 seconds) of cartoon
such content. An explanation for this may be that the rate at videos for young children (under the age of 13). All videos
which videos are uploaded every minute makes YouTube vul- are collected from YouTube using famous cartoon names
nerable to unwanted content. Besides, the decision-making as search keywords. Each video clip is annotated for either
algorithms of YouTube rely heavily on the metadata of video safe or unsafe class. For the unsafe category, fantasy
(i.e., video title, video description, view count, rating, tags, violence and sexual-nudity explicit content are monitored
comments, and community flags). Hence, filtering videos in videos. We also intend to make this dataset publicly
based on the metadata and community flagging is not suf- available for the research community.
ficient to assure the safety of children [21]. Many cases exist 3. We evaluate the performance of our proposed
on YouTube where safe video titles and thumbnails are used CNN-BiLSTM framework. Our multiclass video classi-
for disturbing content to trick children and their parents. The fier achieved the validation accuracy of 95.66%. Several
sparse inclusion of child inappropriate content in videos is other state-of-the-art machine learning and deep learning
another common technique followed by malicious upload- architectures are also evaluated and compared for the task
ers. Fig. 1 displays an example among such cases where of inappropriate video content detection.
video title and video clips are safe for children (as shown To summarize, this work can assist any video sharing plat-
in Fig. 1(a)) but included inappropriate scenes in this video form to either remove unsafe video or blur/hide any portion of
(as shown in Fig. 1(b) and Fig. 1(c)). The concerning thing video involving unsafe content. Secondly, it may also help in
about this example, including many similar cases, is that the development of parental control solutions on the web via
these videos have millions of views with more likes than plugins or browser extensions where children inappropriate
dislikes, and have been available for years. Many other cases content filters automatically. The upcoming sections of the
(as shown in Fig. 1(d)) also identified where videos or the article are outlined as follows: Section II covers the related
YouTube channel is not popular, yet contains child unsafe work in this research area. The methodology of our proposed
content especially in the form of animated cartoons. It is system is explained in Section III. The experimental setup of
evident from examples that this problem persists irrespec- the proposed system is presented in Section IV. The results
tive of channel or video popularity. Furthermore, YouTube obtained from the experimental setup are analyzed and dis-
has disabled the dislike feature of videos which resulted in cussed in Section V, and finally, Section VI concludes the
viewers being incapable of getting the indirect video content work and directs some future scope for improvements.

16284 VOLUME 10, 2022


K. Yousaf, T. Nawaz: Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos

FIGURE 1. Examples of YouTube cartoon videos depicting different scenarios (a) video showing safe content with 8.9 million views, uploaded
since 2018 by AOK channel of 1.05 million subscribers, (b) video showing sexual-nudity content with 8.9 million views, uploaded since 2018 by AOK
channel of 1.05 million subscribers, (c) video showing fantasy violence content with 32 million views, uploaded since 2016 by WB Kids channel of
19.9 million subscribers, and (d) video showing sexual-nudity content with 14.6k views, uploaded since 2021 by Lovely Simpsons channel of 421
subscribers.

II. RELATED WORK The machine learning algorithms are usually employed
The explosion of multimedia data on YouTube has presented as classifiers Liu et al. [38] classified the periodicity-based
a lot of opportunities for researchers [33]. However, the chal- audio and visual segmentation features through support vec-
lenging task is to find an optimal technique to understand tor machine (SVM) algorithm with Gaussian radial basis
the context of videos. In both computer vision and machine function (RBF) kernel. Later on, they extended the frame-
learning, video classification is studied extensively as one work [39] by applying the energy envelope (EE) and
of the fundamental approaches for video understanding [34]. bag-of-words (BoW)-based audio representations and visual
The problem of detecting inappropriate video falls in the features. Ulges et al. [23] used MPEG motion vectors and
category of video classification or event detection problem. Mel-frequency cepstral coefficient (MFCC) audio features
with skin color and visual words. Each feature representa-
A. MACHINE LEARNING METHOD tion is processed through an individual SVM classifier and
Most of the earlier studies used hand-crafted features on combined in a weighted sum of late fusion. Ochoa et al. [40]
images or video-level data in identifying the discrimina- performed binary video genre classification for adult content
tive patterns of inappropriate content. The skin (i.e., skin detection by processing the spatiotemporal features with two
color) [35] and motion information based features were used types of SVM algorithms: sequential minimal optimization
for nudity or pornography detection [26], [36]. The multi- (SMO) and LibSVM. Jung et al. [41] worked with the one-
modal approach was also followed by fusing different modal- dimensional signal of spatiotemporal motion trajectory and
ities of data (i.e., audio, video) with skin and motion-based skin color. Tang et al. [42] proposed a pornography detection
features. Rea et al. [37] proposed a periodicity-based audio system—PornProbe, based on a hierarchical latent Dirichlet
feature extraction method which was later combined with allocation (LDA) and SVM algorithm. This system com-
visual features for illicit content detection in videos. bined an unsupervised clustering in LDA and supervised

VOLUME 10, 2022 16285


K. Yousaf, T. Nawaz: Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos

learning in SVM, and achieved high efficiency than a sin- in YouTube comments Dadvar and Eckert [53] identified
gle SVM classifier. Lee et al. [43] presented a multilevel cyberbullying in YouTube comments by experimenting with
hierarchical framework by taking the multiple features of four deep learning models (i.e., CNN, LSTM, BiLSTM and
different temporal domains. Lopes et al. [44] worked with BiLSTM with attention) using GloVe, SSWE and random
the bag-of-visual features (BoVF) for obscenity detection. word embeddings. In this study, the BiLSTM with an atten-
Kaushal et al. [21] performed supervised learning to identify tion mechanism performed better than conventional machine
the child unsafe content and content uploaders by feeding the learning algorithms. Mohaouchane et al. [54] performed an
machine learning classifiers (i.e., random forest, K-nearest automatic detection of YouTube offensive comments using
neighbor, and decision tree) with video-level, user-level and the CNN-BiLSTM model. Alshamrani [55] used an ensemble
comment-level metadata of YouTube Reddy et al. [45] han- classification model to detect the age-inappropriate com-
dled the explicit content problem of videos through text ments posted on YouTube videos. Later on, they extended
classification of YouTube comments. They applied bigram the work [17] by leveraging natural language processing in
collocation and fed the features to the naïve Bayes classifier an ensemble classifier. They detect five age-inappropriate
for final classification. remarks (toxic, absence, insult, threat and identity hate)
which appear in YouTube comments Alshamrani et al. [56]
B. DEEP LEARNING METHOD applied a neural network model on YouTube comments for
In contrast to machine learning algorithms, there is a grow- toxicity detection and an LDA model on video captions for
ing trend of using deep learning architectures to learn topic modeling.
the video-based feature representations in video classi- The combination of different YouTube modalities is also
fication. The study of Ngiam et al. [46] is the first explored in deep learning architectures. Mariconti et al. [57]
deep learning-based research that processed different data showed an ensemble method for the detection of proactive
modalities such as video, image, audio and text, and per- remarks in YouTube videos. Features from three different
formed greedy layer-wise training of restricted Boltzmann sources (video metadata, audio transcripts and thumbnail)
machine (RBM) model. Karpathy et al. [29] showed multi- are extracted and fed to individual classifiers such as meta-
class video classification results on a large-scale video dataset data to SVM with linear kernel, thumbnails to the random
(ImageNet) by using the convolutional neural network. This forest and audio transcripts to RNN-based gated recurrent
study is followed by the research of Yue-Hei Ng et al. [32] unit (GRU) classifier Hou et al. [10] recognized the bloody
in which information across the full-length duration of videos by combining the audio-visual features and passed
videos is examined. The LSTM model is employed on them to the CNN-LSTM model. Ali and Senan [11] also
top of frame-level CNN activations for better video clas- explored the audio-visual features of YouTube videos with
sification. Some other studies have also taken the benefits the DNN model for violence classification. Sumon et al. [58]
of CNN-LSTM model to capture the context of a given proposed the CNN-LSTM model for identifying the violent
sequence of video frames [47]. Wu et al. [31] presented crowd flow in YouTube videos Alghowinem [59] showed a
the CNN-LSTM hybrid model for obtaining the short-term multimodal approach for inappropriate content filtering of
spatial-motion patterns through CNNs and long-term tempo- the YouTube Kids application. Papadamou et al. [20] stud-
ral clues by employing LSTMs Simonyan and Zisserman [30] ied an Elsagate phenomenon in the YouTube Kids appli-
built a two-stream CNN architecture for action detection cation by using YouTube video metadata such as video
from videos. Wehrmann et al. [48] reported the best accuracy title, tags, statistics (likes/dislikes, view count, comments
results with CNN-LSTM than other baselines on the NPDI count), style features (same as proposed in [21]) and
pornography dataset [24]. Perez et al. [49] classified pornog- thumbnails. The authors deployed a fully connected neu-
raphy videos by deploying CNNs with static and motion ral network model for statistics and style features, CNN
information (i.e., MPEG motion vectors). This study yielded for thumbnail features, and LSTMs for video title and tags
the same accuracy scores when static and motion information features. Vitorino et al. [60] explored transfer learning and
are combined by late fusion or mid-level fusion (96.4%), fine-tuning with CNN for sexually exploitative imagery of
but the performance degraded in early fusion (90.5%) children (SEIC) detection. Ishikawa et al. [61] proposed
Aldahoul et al. [50] explored and compared different state- an end-to-end pipeline for combating the Elsagate content
of-the-art deep learning approaches and found that the pre- in YouTube videos. They evaluated the classification per-
trained CNN model (EfficientNet-B7) based features with formances of mobile-platform based different deep learn-
SVM classifier performed better in unsafe video detection. ing architectures such as GoogLeNet, SqueezNet, NASNet,
In the literature, many studies explored different YouTube and MobileNetV2 Tahir et al. [22] processed multimodal
data modalities (i.e., text, audio, image, and video) individ- data with the CNN-LSTM framework for disturbing and
ually for inappropriate content detection. Yenala et al. [51] fake embedded content in videos. Lastly, Singh et al. [62]
proposed a novel CNN-BiLSTM model for automatic detec- performed a fine-grained approach for child unsafe video
tion and filtering of the inappropriate text in query sugges- content detection. They employed the LSTM autoencoder
tions of YouTube search. Trana et al. [52] investigated naïve to learn video representations from CNN (VGG-16) feature
Bayes, SVM and CNN classifiers for harassment detection descriptors.

16286 VOLUME 10, 2022


K. Yousaf, T. Nawaz: Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos

The abundance of literature is available for inappropri- B. DEEP FEATURE EXTRACTION


ate video content detection by using hand-crafted or deep In this module, the deep features of preprocessed video
learning feature extraction techniques. These studies have frames are extracted by using a deep learning model with
performed binary classification to classify the videos as advanced architecture. Instead of training an entire CNN
safe or unsafe. However, it lacks the research of detecting model from scratch, this study employed a pre-trained CNN
or classifying different categories of disturbing content in architecture known as EfficientNet for the extraction of visual
real-time YouTube videos, particularly in children-oriented representations from video frames.
videos. Secondly, the existing studies have not explored the
BiLSTM network for inappropriate video content detection. 1) EFFICIENT-NET
This paper addresses the aforementioned problems by work- The EfficientNet model is a convolutional neural network
ing with EfficientNet-B7 and BiLSTM-based deep learning model and scaling method that uniformly scales network
framework for children inappropriate video content detection depth, width and resolution through compound coeffi-
and classification. cient. It is trained on a large-scale ImageNet dataset with
1.3 million images from 1000 object classes [63]. Tan and
III. PROPOSED METHODOLOGY Le [64] reported state-of-the-art accuracy of EfficientNet on
The proposed methodology provides a system to resolve the
ImageNet dataset with much smaller and faster inference
problem of disturbing content in videos. This work employs
than best existing CNN models. The baseline network of
deep learning architecture which has already been applied
EfficientNet is referred to as B0, whereas other scaling net-
successfully in several applications for video classification
works include B1, B2, B3, B4, B5, B6 and B7 respectively.
problems. As shown in Fig. 2, the proposed system is divided
In general, all scaling networks show better accuracy but with
into three main modules, namely (1) video preprocessing,
the cost of FLOPS.
(2) deep feature extraction, and (3) video representation and
The proposed framework included EfficientNet-B7 by
classification. In the video preprocessing stage, the collected i,j i,j
working with preprocessed extracted frames (f1 , f2 , . . . ,
YouTube videos are preprocessed to remove all irrelevant i,j
or missing video information. It also rescales the extracted f22 ) of each video clip cij as an input. The EfficientNet mod-
frames of each video clip into fixed dimensions (224 × 224). ule performed feature extraction with the transfer learning
The preprocessed video frames of each video clip are for- technique in which each input frame with dimensions of
warded as an input to an ImageNet pre-trained EfficientNet- 224×224×3 (image_width x image_height x RGB_channel)
B7 model for feature extraction. The extracted features are is processed through a stack of 813 layers. The last three
experimented with the BiLSTM network to learn effective layers including the fully connected layer generating 1000
video representations, which subsequently passed to the fully ImageNet output labels are discarded, which makes the
connected and softmax layers for final video classifica- EfficientNet-B7 to generate an output of feature descriptors
i,j i,j
tion. Furthermore, the detailed descriptions of each step are Xk with 7 × 7 × 2560 dimensions for each frame fk . These
explained in the following subsections. feature descriptors are used as an input to the BiLSTM model
for video representation and classification.
A. VIDEO PREPROCESSING
Video preprocessing plays an important role in deep learning C. VIDEO REPRESENTATION AND CLASSIFICATION
techniques, as it helps in acquiring the relevant features for The third stage of pipeline trained bidirectional LSTM
better video classification. In this work, a video (Vi ) is first network with supervised learning so that effective video rep-
represented as N small segments of videos referred to as video resentations can be learnt from feature descriptors of video
clips (ci1 , ci2 , . . . ., ciN ) with one-second length each. These clips. Subsequently, the proposed system is added with two
video clips are labeled through a manual annotation process fully connected layers for acquiring the final video classifi-
where all clips with incomplete information or video context cation results.
are ignored. After splitting and labeling of video clips, it is
noticed that the initial 3-4 frames of each video clip have 1) BiLSTM NETWORK
a piece of information from the previous clip of the same The recurrent neural networks produce good network per-
video. Therefore, considering an average video frame rate formance in modeling the hidden sequential patterns of
of 23-24 frames per second (fps), each of the jth video clip time-series data. However, the vanishing gradient problem
(cij ) is sampled at the frame rate of 22 fps by ignoring some hampers an update of network parameters during the back-
initial frames. The last frame is duplicated for all video clips propagation process. It is usually resolved by using two vari-
containing fewer frames than an average frame rate of a video. ations of RNNs, which are: LSTM and gated recurrent unit
i,j i,j i,j
Overall, the frames are represented as f1 , f2 , . . . , f22 , where (GRU). Conceptually, the network structure of LSTM is as
i,j
fk means k th frame of video clip cij . Finally, the selected same as RNN, but a special unit ‘‘memory cell’’ is introduced
frames of each video clip are rescaled to a fixed resolution in LSTM to replace the update process of RNN. The memory
of 224 × 224 pixels that correspond to the input size of the cell of LSTM maintains information for a longer duration of
pre-trained convolutional neural network model. time. Considering the current input vector xt , the last hidden

VOLUME 10, 2022 16287


K. Yousaf, T. Nawaz: Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos

FIGURE 2. The proposed framework architecture works into three stages for children inappropriate video content detection and classification. In the first
stage, video clips are preprocessed to discard irrelevant video frames and transform the selected frames into fixed-size images (224,224,3). Next, the
frames are fed to EfficientNet-B7 to get feature vectors. All feature vectors are reshaped and passed into the two-layer stack of BiLSTMs for video
representation. A fully connected layer followed by an output layer with softmax activation is integrated to return the probabilities of each video clip
against the three classes including safe (class 0), fantasy violence (class 1) and sexual-nudity (class 2).

state ht−1 , and the last memory cell state ct−1 , the following It is noticed that adding an excessive number of layers of
equations are used to implement an LSTM model: BiLSTM increases the network complexity and slows down
the training process. Hence, this work employed two layers
it = σ (Wxi xt + Whi ht−1 + Wci ct−1 + bi ) (1) of BiLSTM to understand the video representations.
ft = σ (Wxf xt + Whf ht−1 + Wcf ct−1 + bf ) (2)
ct = ft ct−1 + it tanh(Wxc xt + Whc ht−1 + bc ) (3) 2) BiLSTM NETWORK WITH AN ATTENTION MECHANISM
ot = σ (Wxo xt + Who ht−1 + Wco ct + bo ) (4) A neural network architecture with an attention mechanism
decides when to look into data (or in this case, segments of
ht = ot .tanh(ct ) (5)
videos) by automatically giving a high level of focus to fea-
where, σ represents the sigmoid activation function; ture vectors with most valuable information than the feature
i, f , c, and o denote input gate, forget gate, memory cell state vectors with less valuable information. The architecture of
and output gate at time t, respectively. W and b denote the BiLSTM with an attention mechanism is depicted in Fig. 4.
weights and bias vector. Considering the video classification Considering the final hidden state of i-th BiLSTM as hit ,
problem, one potential drawback of LSTM is that it captures which is computed as:
the past context only. For getting the full context of any f
hit = [ht , hbt ] (9)
video, it is important to consider both directions i.e., past
and future context of the video. Therefore, the bidirectional Then, the attention mechanism is computed by using the
LSTM appears to be a suitable option in video classification following equations:
as it preserves the information in both directions, as shown
in Fig. 3. eit = tanh(Wa hit + ba ) (10)
In BiLSTM, there are two distinct hidden layers referred exp(eit )
f ait = (11)
to as forward hidden layer (ht ) and backward hidden layer T
P
f exp(ejt )
(hbt ). The forward hidden layer ht considers input vector xt in j=1
ascending order i.e., t = 1, 2, 3, . . . , T , and backward hidden XT
layer hbt in descending order i.e., t = T , T − 1, T − 2, . . . , 1. vt = ait .hit (12)
t=1
Lastly, the output yt is generated by combining the results
f The attention mechanism assigns attention weight ait to
of ht and hbt . Following equations are used to implement the
the i-th BiLSTM output vector at time t, as calculated in
BiLSTM model:
equation 11. Wa and ba represents the weight and bias from
f f f f f the attention layer. Finally, the output from attention layers
ht = tanh(Wxh xt + Whh ht−1 + bh ) (6)
generates an attention vector vt , which is calculated as a
hbt = b
tanh(Wxh b b
xt + Whh ht+1 + bbh ) (7)
weighted sum of the multiplication between attention weight
f f b b
yt = Why ht + Whh ht + by (8) ait and i-th BiLSTM output vector at time t.

16288 VOLUME 10, 2022


K. Yousaf, T. Nawaz: Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos

FIGURE 4. The architecture of the BiLSTM model with an attention


mechanism.

FIGURE 3. The architecture of the BiLSTM model.

has been explored in numerous researches. Google released


3) SOFTMAX CLASSIFIER
the YouTube-8M benchmark dataset of more than 8 million
The softmax activation function in an output layer of any
video IDs with corresponding labels from 4716 classes [65]
deep learning model is considered as a softmax classifier.
Apart from it, there exists other video benchmarks of specific
To classify each video clip into one of three classes (i.e.,
categories like sports (Sports-1M [29], UCF-101 [66]), action
fantasy violence, sexual-nudity and safe), the proposed model
recognition (HMDB51 [67], Kinetics [68]), face recognition
integrated a softmax activation function in the last fully con-
(YTF [69], YouTube Celebrities [70]), sentiment analysis
nected layer to determine the relative probability of three
([71]), and video captioning (MSVD [72], MSR-VTT [73]).
output units. The softmax activation function (σ ) is calculated
However, none of these existing benchmarks aims for the
as:
proposed video classification problem. The datasets closely
exp(zi ) related to our problem are the NPDI cartoon dataset [50], the
σ (zi ) = (13)
−1
NP Elsagate dataset [61], and the dataset of Singh et al. [62].
exp(zc ) Comparatively, the NPDI dataset is the smallest with 900
c=0
images only and is not suitable to perform our deep learning-
based video classification task. The Elsagate dataset is a
D. OVERALL PROPOSED FRAMEWORK publicly available dataset of cartoon videos from sensitive
The general architecture of the proposed framework for mul- and non-sensitive classes. In this dataset, whole video is con-
ticlass video classification is illustrated in Fig. 2. The input is sidered either safe or unsafe where the clean frames of video
a sequence of 22 frames of fixed resolution with 224×224×3 are also labeled as unsafe. Secondly, it lacks the complex
pixels. These frames are processed through EfficientNet-B7 behaviors of sensitivity content. The videos in this dataset are
for extracting features and generating the feature vector of targeted for toddlers. Lastly, the dataset of Singh et al. [62]
22×7×7×2560 shape. The high-level features are reshaped included Japanese anime videos dedicated for mature view-
and directed towards the two-layer stack of BiLSTM net- ers. For this reason, a manually annotated video dataset is
work. The flattening layer is added to transform the feature presented for identifying the disturbing content. Because
representations into a 1-dimensional vector. Subsequently, the intention is to focus on content for children, this study
a fully connected layer (or dense layer) of 4096 neurons with included cartoon videos only. The videos are searched and
rectified linear unit (ReLU) activation function is added. As a collected using the four popular cartoon names (including
fully connected layer generates a wide range of probabilities Tom and Jerry, gravity falls, Simpsons and Sponge Bob)
by connecting all inputs of one layer to every activation unit and ‘‘cartoon’’ as keywords through YouTube Data API.
of the next layer; therefore, a dropout of 0.3 is carried out to Once the list of videos is obtained and downloaded, next
prevent this model from an overfitting problem. Finally, the step involved video filtering in which all irrelevant videos
softmax output layer with 3 neurons gives the final classifica- (like non-cartoon) are discarded. The process of filtering
tion scores. The algorithmic steps for training and validation resulted into 1126 videos with duration range between 2 to
of the proposed framework are presented in Algorithm 1. 600 seconds long. All collected videos are split into one-
second duration clips using FFmpeg. Each video clip is man-
IV. EXPERIMENTAL SETUP ually annotated as belonging to either safe, fantasy violence
A. DATASET or sexual-nudity class. The clips with any other act (i.e.,
YouTube has a huge collection of videos and metadata of extreme bloodshed or violence, smoking, use of drugs, fright-
videos (i.e., likes, dislikes, view count, comments, etc.) that ening or horror scenes, etc.) are not included in this dataset.

VOLUME 10, 2022 16289


K. Yousaf, T. Nawaz: Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos

Algorithm 1 Training and Validation of the EfficientNet-BiLSTM Video Classification Algorithm


Input: Training set T = {(ci , θi ) , i = 1, 2, 3, . . . , n}, Validation set V = { c0 i , θ 0 i , i = 1, 2, 3, . . . , n},


epoch_num = number of epochs


Output: Trained model Model epoch , Accuracy accuracyepoch , Precision precisionepoch , Recall recall epoch ,
F1 Score f 1 − scoreepoch
Basic Idea:
1. Divide the training instances T into k mini-batches tk each containing a fixed number of unique video clips ci such that
T = {(tk , θk ) , ci ∈ tk , k = 1, 2, 3, . . . , m, i = 1, 2, . . . , n1 }, where t1 ∩ t2 . . . ∩ tm = ∅.
2. Extract features from each mini-batch tk , where k = 1, 2, 3, . . . , m.
3. Train model with extracted features of mini-batch tk and assigned labels θk .
4. For validation, apply the same procedure as STEP 1 such that validation instances V = { vk , θk0 , c0 i ∈ vk , k =


1, 2, 3, . . . , m, i = 1, 2, . . . , n2 }, where v1 ∩ v2 . . . ∩ vm = ∅.
5. Validate the model Model epoch .
6. Calculate accuracy, precision, recall and F1 score using the confusion matrix conf _matrix epoch .

1 for epoch ← 0 to epoch_num do


2 for alltk T (k = 1, 2, 3, . . . , m) do
3 ftk = [], θtk = []
4 for allci tk (i = 1, 2, . . . , n1 ) do
5 fc0i ,tk ← feature_extraction(ci )
6 ftk .append(fc0i ,tk )
7 θtk .append(θci ,tk )
8 end for
9 Xtk ← np.array(ftk )
10 Ytk ← Transform(θtk )// transform labels into one-hot encoder
11 Model epoch ← Train_Classifier(Xtk , Ytk )
12 end for
13 for allvk V (k = 1, 2, 3, . . . , m) do
14 fvk = [], θvk = [], result epoch = []
15 for allc0 i v0k (i = 1, 2, . . . , n2 ) do
16 fc00 ,v ← feature_extraction(c0 i )
i k
17 fvk .append(fc00 ,v )
i k
18 θvk .append(θc0 0 ,v )
i k
19 end for
20 Xvk ← np.array(fvk )
21 Yvk _pred ← Predict(Xvk )
22 result epoch .append(epoch, [Yvkpred = Inverse_Transform(Yvk _pred )], [Y vkactual = θvk ])
23 end for
24 conf _matrix epoch = Confusion_Matrix(result epoch .predicted, result epoch .actual)
NP−1 conf _matrix epoch (TPc +TN c )
25 accuracyepoch = N1 conf _matrix (TPc +TN c +FPc +FN c )
epoch
c=0
−1
NP conf _matrix epoch (TPc )
1
26 precisionepoch = N conf _matrix epoch (TPc +FPc )
c=0
−1
NP
1 conf _matrix epoch (TPc )
27 recall epoch = N conf _matrix epoch (TPc +FN c )
c=0
−1
NP
1 precisionclass ∗recall class
28 f 1 − scoreepoch = N (2 ∗ precisionclass +recall class )
class=0
29 end for
30 return Model epoch , accuracyepoch , precisionepoch , recall epoch , f 1 − scoreepoch

The manual annotation process results in total of 111,561 27003 clips in sexual-nudity class, and 26650 clips in fantasy
video clips including 57908 clips belonging to safe class, violence class. Overall, there is a balanced distribution of safe

16290 VOLUME 10, 2022


K. Yousaf, T. Nawaz: Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos

and unsafe video clips (1.08:1). We also intend to make this TABLE 1. Cartoon dataset distribution.
dataset publicly available for research community. Table 1
summarizes the overall distribution of manually annotated
cartoon videos according to three classes.

B. NETWORK HYPERPARAMETERS TUNING


1) FRAME SAMPLING
Due to frame overlap issue, each video clip is sampled at a
frame rate of 22 fps by ignoring some of the starting frames.
Video clips containing frames less than an average frame
rate (23-24 fps) are padded with the last frame of same clip.
The frame sampling rate of 22 fps is adopted in all training, cross-entropy loss function is used for multiclass video clas-
validation and testing experiments of neural network models. sification, which is calculated as:
N
X −1
2) BILSTM PARAMETER Lcross_entropy (ŷ, y) = − yi,c ∗ log (ŷi,c ) (14)
Considering an ImageNet pre-trained CNN model is c=0
employed for video frames feature extraction by using trans-
fer learning approach, the hyperparameters tuning in pro- In equation (14), c represents a particular class index from
posed framework is required for BiLSTM and subsequent N number of classes, yi,c is the binary indicator (0 or 1)
fully connected (dense) layer. Various design choices of which indicates whether c is the actual class for instance i,
bidirectional LSTM layers are evaluated, but it is confirmed ŷi,c represents predicted probability of instance i for class c.
through experiments that adding two simultaneous layers of Adam optimizer, the extension of stochastic gradient descent
bidirectional LSTMs perform better in video classification (SGD) algorithm, is used with a learning rate of 1e-5 to
than working with single or multiple layers of bidirectional minimize the error of cost function in proposed model.
LSTM networks. Apart from that, the different number of
hidden units (i.e., 64, 128, 256, 512) in each bidirectional 4) TRAINING PARAMETER
LSTM layer are embedded. For consistency, the same num- Because of memory and computational constraints, it is found
ber of hidden units are used in both bidirectional layers. convenient to load the training dataset in memory by taking
Experiments showed that the highest validation accuracy is the small subsets of data for model training. In this research,
achieved by using 128 hidden units in two BiLSTM network a subset of 1000 samples (or video clips) of training datasets
layers. It is also noticed that the fully connected (FC) layer of are processed simultaneously in each iteration of one epoch.
4096 units with ReLU activation function helps in selecting Training of the proposed EfficientNet-BiLSTM model is per-
the most relevant and appropriate labels in classification formed through the mini-batch gradient descent optimization
layer than a fully connected layer of 2048 units with same method, which further divides each subset of the training
activation function. To avoid the problem of overfitting, the dataset into n number of mini-batches. After performing
dropout layer (value = 0.3) is added before the last fully experiments using different mini-batch sizes (i.e., 8, 16 and
connected output layer (dense layer). Finally, the softmax 32), it is observed that the mini-batch size of 16 converges
classifier is applied in an output layer of 3 units to get the faster and performs better in model accuracy than other batch
final probability scores. sizes. Overall, the weights of the proposed EfficientNet-
Table 2 enlists the complete layer configuration of our best- BiLSTM model are updated in 90 iterations per epoch. Each
proposed model used in inappropriate video content detection iteration processed the chunk of training dataset in 63 batches
and classification. This model has nine layers with each (mini-batch size = 16).
containing a different number of output size and learnable
parameters. Overall, the proposed model is trained with 152 C. EXPERIMENTAL ENVIRONMENT
million parameters (number of neurons) that are updated The proposed model is implemented in Python programming
during the backpropagation process. All the parameters of the using Keras, which is a high-level deep learning application
pre-trained EfficientNet-B7 model are non-trainable param- programming interface (API) that runs on top of Google’s
eters which means that these parameters are not optimized TensorFlow open-source library [74], [75]. The Google Colab
during model training. integrated development environment (IDE) is used which
offers a cloud-based Jupyter notebook to write and run the
3) COST FUNCTION AND OPTIMIZER PARAMETERS Python codes for deep learning architectures. For all exper-
The cost function measures an error between predicted and iments, the Google Colab Pro+ is used to train the model,
actual values. The optimizer function is responsible for reduc- which offers 50 GB of T4 or P100 GPU and 255 GB of disk.
ing an error or overall loss of neural network model to Moreover, the Google drive with 2 TB of space is used to
improve the model accuracy. In this study, the categorical store the YouTube cartoon dataset. This dataset is partitioned

VOLUME 10, 2022 16291


K. Yousaf, T. Nawaz: Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos

TABLE 2. Layer configuration of proposed EfficientNet-BiLSTM model for video classification.

with an 80:20 split such that 80% of the data is allocated for approaches for video classification are presented and dis-
training and 20% for evaluation and testing of models. cussed. Afterwards, the best-proposed approach is compared
with existing state-of-the-art methods from literature.
D. EVALUATION METRICS
The performances of multiclass video classification models A. ANALYSIS OF PRE-TRAINED CNN MODEL VARIANTS
are evaluated by calculating the accuracy, precision, recall At first, three pre-trained convolutional neural network mod-
and f1 score using confusion matrices. Accuracy is the ratio els including Inception-V3, VGG-19 and EfficientNet-B7 are
of number of correct predictions for each class to the total employed as video classifiers to determine the performances
number of predictions of all classes, and is calculated as: of these ImageNet pre-trained CNN models in our multiclass
video classification problem. For each model, the last three
N −1
1 X (Tpc + TNc ) layers of the pipeline are discarded and added with a fully
Accuracy = × 100% (15) connected layer of softmax activation function using three
N (Tpc + TNc + Fpc + FNc )
c=0 output nodes.
In equation (15), c represents a particular class index from N The transfer learning approach is implemented in a man-
number of classes, Tp denotes the true positives, TN denotes ner where weights of all layers in the model are fixed
true negatives, Fp denotes false positives and FN denotes false except the last fully connected layer. After training each
negatives. Precision is the ratio of total number of correct pre-trained convolutional neural network model using the
predictions of positive instances to the total number of pre- transfer learning approach, as shown in Table 3, it is ana-
dictions with positive instances. It is calculated as: lyzed that the EfficientNet-B7 model performs comparatively
better than VGG-19 and Inception-V3 on the YouTube car-
N −1
1 X Tpc toon video dataset. It has achieved the highest recall score
Precision = × 100% (16) which means that the EfficientNet-B7 model retrieves more
N (Tpc + Fpc )
c=0 relevant instances than the remaining two pre-trained CNN
models. Hence, further experiments are carried out with the
The recall (also known as sensitivity) is the ratio of total
EfficientNet-B7 as a base classifier.
number of correct predictions of positive instances to the total
number of instances in an actual class. The recall and f1 score
B. ANALYSIS OF EFFICIENT-NET FEATURES WITH
are calculated by using equations (17) and (18):
DIFFERENT CLASSIFIER VARIANTS
N −1 In this section, the performances of different classifiers
1 X Tpc
Recall = × 100% (17) trained on EfficientNet visual features are evaluated. For this
N (Tpc + FNc ) purpose, some machine learning algorithms are considered
c=0
Precision ∗ Recall for the video classification task. The experimental evaluation
F1Score = 2 ∗ ( ) (18) of Xu et al. [76] also presented that even a simple machine
Precision + Recall
learning algorithm can play an effective role in video classi-
V. RESULTS AND DISCUSSION fication considering the features are distinctive enough.
In this section, the results obtained through experimental This study applied three machine learning algorithms
evaluations of different machine learning and deep learning namely SVM, KNN and random forest as video classifiers

16292 VOLUME 10, 2022


K. Yousaf, T. Nawaz: Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos

by training on EfficientNet features. The evaluation results it still has a high ratio of false negatives (recall = 79.54%).
of Table 4 shows that among three machine learning classi- Hence, model training by using a single fully connected layer
fiers, SVM with RBF kernel achieved the highest accuracy with pre-trained CNN architecture is not sufficient. It requires
of 72.48% on EfficientNet visual features. It is followed by some deep neural network classifier to effectively understand
the random forest and KNN (neighbors = 3) classifiers with the hidden sequences of video representations by returning
accuracy values of 68.69% and 60.12%, respectively. The high precision-recall values for video classification. Thus,
other evaluation metrics i.e., precision, recall and f1 score further experiments are conducted using EfficientNet-B7 as
of SVM with RBF kernel outperformed the random forest a feature extractor with other neural network models to not
and KNN classifiers. Apart from machine learning classifiers, let any child inappropriate content go undetected in video
an experiment is performed where EfficientNet-B7 itself is classification.
treated as a sole classifier by replacing the last three lay-
ers of architecture with a fully connected layer (units = C. ANALYSIS OF EFFICIENT-NET WITH BiLSTM AND
512) followed by an output layer (activation = softmax) of ATTENTION-BASED BiLSTM CLASSIFIER VARIANTS
three units. Two main methods such as transfer learning and The experiments of previous sections revealed that ImageNet
fine-tuning of EfficientNet-B7 are implemented for the video pre-trained EfficientNet-B7 works better as a feature extrac-
classification task. In transfer learning, the weights of all tor and this architecture in conjunction with any deep learning
layers of EfficientNet are fixed except the last fully connected algorithm can successfully detect and classify unsafe video
layer and fine-tuning updates the weights of an entire model. content. The bidirectional LSTM, a supervised deep learning
From Table 4, it can be observed that the performances algorithm, is opted for developing a deep learning-based
of all machine learning algorithms are relatively poor than framework because it preserves the contextual information in
transfer learning or fine-tuning of the EfficientNet-B7 model. both directions of time-series data, which appears to be a suit-
In comparison, the EfficientNet model using transfer learning able choice in our video classification problem. The experi-
approach performed slightly better (accuracy = 89.07%) than ments are conducted using the two-layer stack of BiLSTMs
fine-tuned model (accuracy = 87.89%). The main reason for followed by fully connected (units = 4096, activation =
such model behavior is because the ImageNet video classifi- ReLU), drop out (value = 0.3) and softmax (output units = 3)
cation dataset is much larger in scope (14 million images) layers. Details of the complete architecture are mentioned in
than our self-curated cartoon video dataset (2.5 million section III. This study implemented and evaluated different
video frames) used for fine-tuning of the EfficientNet model. hidden units (i.e., 64, 128, 256, and 512) in each BiLSTM
By further examining the evaluation results of other classifier layer. For simplicity and consistency, the same number of
variants, it is noticed that although EfficientNet with transfer hidden units are used in both layers of BiLSTM networks.
learning method yields best results than other classifiers, You and Korhonen [77] reported that adding an attention

TABLE 3. Transfer learning using ImageNet pre-trained CNN models.

TABLE 4. Evaluation results in terms of accuracy, precision, recall and f1 score of using EfficientNet features with different classifier variants.

VOLUME 10, 2022 16293


K. Yousaf, T. Nawaz: Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos

TABLE 5. Evaluation results in terms of accuracy, precision, recall and f1 score of using EfficientNet-B7 with BiLSTM and attention-based BiLSTM classifier
variants.

FIGURE 5. Confusion matrix results of EfficientNet-BiLSTM and attention-based EfficientNet-BiLSTM networks.

mechanism after the BiLSTM layer boosts the performance The first observation from all evaluation results, as men-
of deep neural networks for video classification. Hence, tioned in Table 5, is that all EfficientNet-BiLSTM networks
an attention mechanism-based BiLSTM model is also exam- perform comparatively better than the attention-based
ined by integrating an attention block after each bidirectional EfficientNet-BiLSTM networks. For attention-based models,
layer followed by fully connected (units = 4096, activation = the f1 scores are improved by updating the hidden units from
ReLU), drop out (value = 0.3) and softmax (output units = 64 to 128 and 256 in each BiLSTM network as it affects
3) layers. In all experiments, models are trained and evalu- the network trainable parameters during backpropagation.
ated for 20 epochs with an 80:20 split in which 80% of the However, it is also found that adding an excessive number
YouTube cartoon dataset is used for training purposes and of hidden units (i.e., units = 512) gradually decreases the
20% for the evaluation and testing of the model. The trained overall network performance. Secondly, the overall behav-
model from the last epoch (epoch = 20) is tested for obtaining ior of attention mechanism-based neural network models
the final video classification scores. Table 5 demonstrates is different from models with no attention blocks. In the
the experimental results of attention and without attention EfficientNet-BiLSTM network, upgrading the hidden units
mechanism-based EfficientNet-BiLSTM models by working in BiLSTMs from 64 to 128 immediately resulted in the
with the different number of hidden units (i.e., 64, 128, 256, best performing model of all experiments by showing the
and 512) in each bidirectional LSTM layer of the proposed highest f1 score (0.9267). However, it drastically decreases
framework. the performance by adding more hidden units in BiLSTMs

16294 VOLUME 10, 2022


K. Yousaf, T. Nawaz: Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos

TABLE 6. Performance comparison between the proposed EfficientNet-BiLSTM model with existing state-of-the-art video classification techniques.

(i.e., 256 and 512). A detailed performance comparison higher accuracy than GoogLeNet-SVM [60], fine-tuned
between the EfficientNet with attention and without attention NASNet-SVM [61], and EfficientNet-SVM [50] approaches
mechanism-based BiLSTM models are presented through by significant margins of 3.06%, 7.79%, and 10.1%,
confusion matrices for all three classes. The diagonal values respectively. In comparison with base models using pre-
represent the correctly classified number of instances in trained CNNs and LSTMs, the Inception-V3 with LSTM
each class, but anything off the diagonal indicates incor- approach [20] reported f1 score of 0.828 which is much
rect classification instances. The evaluation results of the lower than our with attention (f1 score = 0.9195) and with-
EfficientNet-BiLSTM model with and without attention out attention-based (f1 score = 0.9267) BiLSTM classifier
blocks for the YouTube cartoon video dataset are illustrated variants. It is also worth mentioning that the ResNet-LSTM
in Fig. 5. Overall, the EfficientNet and BiLSTM network model in existing studies [10], [48] attained comparable accu-
with 128 hidden units in each bidirectional layer achieved racy results to our proposed technique. It can be explained by
the highest validation (f1 score = 0.9274) and testing scores the fact that the studies reporting these approaches performed
(f1 score = 0.9267). binary video classification which is much simpler than multi-
class video classification. Note that the proposed model still
D. PERFORMANCE COMPARISON WITH EXISTING outperformed some existing approaches of multiclass video
STATE-OF-THE-ART CLASSIFICATION METHODS classification using VGG-LSTM-based models [22], [62],
We compare the performance of the proposed EfficientNet- which shows that BiLSTM has high robustness on time-series
BiLSTM model with existing state-of-the-art models and data modeling. In addition, some studies [11], [17], [21] used
methods employed for inappropriate content classification simple convolutional neural networks and reported the lowest
using different YouTube data modalities. classification accuracy and f1 scores. Hence, it is deduced that
Table 6 summarizes the results and quality scores of simple CNNs are not sufficient to understand the complex-
existing and proposed classification methods. It is worth ities of YouTube data modalities. Overall, the performance
noting that existing studies explored different YouTube comparison showed that the proposed EfficientNet using
modalities (i.e., text, audio, video, and metadata) for dif- BiLSTM (hidden units = 128) surpassed the existing studies
ferent classifications. The most common strategy in exist- in inappropriate video content detection and classification.
ing studies, for unsafe content classification, is using
pre-trained CNN models with either LSTM-based classi- VI. CONCLUSION AND FUTURE WORK
fiers [10], [20], [22], [48], [62] or machine learning-based In this paper, a novel deep learning-based framework is
classifiers [50], [60], [61]. Compared with the approaches proposed for child inappropriate video content detection
that use pre-trained CNN features with machine learning clas- and classification. Transfer learning using EfficientNet-B7
sifiers, our EfficientNet-BiLSTM classifier method yielded architecture is employed to extract the features of videos.

VOLUME 10, 2022 16295


K. Yousaf, T. Nawaz: Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos

The extracted video features are processed through the [4] J. Marsh, L. Law, J. Lahmar, D. Yamada-Rice, B. Parry, and F. Scott,
BiLSTM network, where the model learns the effective Social Media, Television and Children. Sheffield, U.K.: Univ. Sheffield,
2019. [Online]. Available: https://www.stac-study.org/downloads/
video representations and performs multiclass video classi- STAC_Full_Report.pdf
fication. All evaluation experiments are performed by using [5] L. Ceci. YouTube—Statistics & Facts. Accessed: Sep. 01, 2021. [Online].
a manually annotated cartoon video dataset of 111,156 Available: https://www.statista.com/topics/2019/youtube/
[6] M. M. Neumann and C. Herodotou, ‘‘Young children and YouTube:
video clips collected from YouTube. The evaluation results A global phenomenon,’’ Childhood Educ., vol. 96, no. 4, pp. 72–77,
indicated that proposed framework of EfficientNet-BiLSTM Jul. 2020, doi: 10.1080/00094056.2020.1796459.
(with hidden units = 128) exhibits higher performance [7] S. Livingstone, L. Haddon, A. Görzig, and K. Ólafsson, Risks and Safety
on the Internet: The Perspective of European Children: Full Findings and
(accuracy = 95.66%) than other experimented models includ- Policy Implications From the EU Kids Online Survey of 9-16 Year Olds
ing EfficientNet-FC, EfficientNet-SVM, EfficientNet-KNN, and Their Parents in 25 Countries. London, U.K.: EU Kids Online, 2011.
EfficientNet-Random Forest, and EfficientNet-BiLSTM with [Online]. Available: http://eprints.lse.ac.U.K./id/eprint/33731
[8] B. J. Bushman and L. R. Huesmann, ‘‘Short-term and long-term effects
attention mechanism-based models (with hidden units = 64, of violent media on aggression in children and adults,’’ Arch. Pediatrics
128, 256, and 512). Moreover, the performance compari- Adolescent Med., vol. 160, no. 4, pp. 348–352, 2006, doi: 10.1001/arch-
son with existing state-of-the-art models also demonstrated pedi.160.4.348.
[9] S. Maheshwari. (2017). On YouTube Kids, Startling Videos Slip Past Fil-
that our BiLSTM-based framework surpassed other existing ters. The New York Times. [Online]. Available: https://www.nytimes.com/
models and methods by achieving the highest recall score of 2017/11/04/business/media/youtube-kids-paw-patrol.html
92.22%. The advantages of the proposed deep learning-based [10] C. Hou, X. Wu, and G. Wang, ‘‘End-to-end bloody video recognition
by audio-visual feature fusion,’’ in Proc. Chin. Conf. Pattern Recognit.
children inappropriate video content detection system are as Comput. Vis. (PRCV), 2018, pp. 501–510, doi: 10.1007/978-3-030-03398-
follows: 9_43.
1) It works by considering the real-time conditions by [11] A. Ali and N. Senan, ‘‘Violence video classification performance using
deep neural networks,’’ in Proc. Int. Conf. Soft Comput. Data Mining, 2018,
processing the video with a speed of 22 fps using pp. 225–233, doi: 10.1007/978-3-319-72550-5_22.
EfficientNet-B7 and BiLSTM-based deep learning [12] H.-E. Lee, T. Ermakova, V. Ververis, and B. Fabian, ‘‘Detecting child
framework, which helps in filtering the live-captured sexual abuse material: A comprehensive survey,’’ Forensic Sci. Int.,
Digit. Invest., vol. 34, Sep. 2020, Art. no. 301022, doi: 10.1016/j.fsidi.
videos. 2020.301022.
2) It can assist any video sharing platform to either remove [13] R. Brandom. (2017). Inside Elsagate, The Conspiracy Fueled War on
the video containing unsafe clips or blur/hide any por- Creepy YouTube Kids Videos. [Online]. Available: https://www.theverge.
com/2017/12/8/16751206/elsagate-youtube-kids-creepy-conspiracy-
tion with unsettling frames. theory
3) It may also help in the development of parental control [14] Reddit. What is ElsaGate? Accessed: Dec. 14, 2020. [Online]. Available:
solutions on the Internet through plugins or browser https://www.reddit.com/r/ElsaGate/comments/6o6baf/
[15] B. Burroughs, ‘‘YouTube kids: The app economy and mobile parenting,’’
extensions where child unsafe content can be filtered Soc. media+ Soc., vol. 3, May 2017, Art. no. 2056305117707189, doi:
automatically. 10.1177/2056305117707189.
Furthermore, our methodology to detect inappropriate chil- [16] H. Wilson, ‘‘YouTube is unsafe for children: YouTube’s safeguards
and the current legal framework are inadequate to protect children
dren content from YouTube is independent of YouTube video from disturbing content,’’ Seattle J. Technol., Environ. Innov. Law,
metadata which can easily be altered by malicious uploaders vol. 10, no. 1, p. 8, 2020. [Online]. Available: https://digitalcommons.
to deceive the audiences. In the future, we intend to combine law.seattleu.edu/sjteil/vol10/iss1/8
[17] S. Alshamrani, A. Abusnaina, M. Abuhamad, D. Nyang, and D. Mohaisen,
the temporal stream using optical flow frames with the spatial ‘‘Hate, obscenity, and insults: Measuring the exposure of children
stream of the RGB frames to further improve the model per- to inappropriate comments in YouTube,’’ in Proc. Companion
formance by better understanding the global representations Proc. Web Conf., Apr. 2021, pp. 508–515, doi: 10.1145/3442442.
3452314.
of videos. We also aim to increase the classification labels to [18] N. Elias and I. Sulkin, ‘‘YouTube viewers in diapers: An exploration of
target the different types of inappropriate children content of factors associated with amount of toddlers’ online viewing,’’ Cyberpsy-
YouTube videos. chol., J. Psychosoc. Res. Cyberspace, vol. 11, no. 3, p. 2, Nov. 2017, doi:
10.5817/cp2017-3-2.
[19] D. Craig and S. Cunningham, ‘‘Toy unboxing: Living in a (n unregulated)
ACKNOWLEDGMENT material world,’’ Media Int. Aust., vol. 163, no. 1, pp. 77–86, May 2017,
The authors are appreciative of Prof. Dr. Hafiz Adnan Habib doi: 10.1177/1329878X17693700.
[20] K. Papadamou, A. Papasavva, S. Zannettou, J. Blackburn,
(Head of Department of Computer Engineering, University of N. Kourtellis, I. Leontiadis, G. Stringhini, and M. Sirivianos, ‘‘Disturbed
Engineering and Technology, Taxila) for providing valuable YouTube for kids: Characterizing and detecting inappropriate
advice and suggestions in this study. videos targeting young children,’’ in Proc. Int. AAAI Conf. Web
Soc. Media, 2020, pp. 522–533. [Online]. Available: https://ojs.
aaai.org/index.php/ICWSM/article/view/7320/7174
REFERENCES [21] R. Kaushal, S. Saha, P. Bajaj, and P. Kumaraguru, ‘‘KidsTube:
[1] L. Ceci. YouTube Usage Penetration in the United States 2020, Detection, characterization and analysis of child unsafe content &
by Age Group. Accessed: Nov. 1, 2021. [Online]. Available: promoters on YouTube,’’ in Proc. 14th Annu. Conf. Privacy,
https://www.statista.com/statistics/296227/us-youtube-reach-age-gender/ Secur. Trust (PST), Dec. 2016, pp. 157–164, doi: 10.1109/pst.2016.
[2] P. Covington, J. Adams, and E. Sargin, ‘‘Deep neural networks for 7906950.
YouTube recommendations,’’ in Proc. 10th ACM Conf. Recommender [22] R. Tahir, F. Ahmed, H. Saeed, S. Ali, F. Zaffar, and C. Wilson, ‘‘Bring-
Syst., Sep. 2016, pp. 191–198, doi: 10.1145/2959100.2959190. ing the kid back into YouTube kids: Detecting inappropriate content on
[3] M. M. Neumann and C. Herodotou, ‘‘Evaluating YouTube videos for video streaming platforms,’’ in Proc. IEEE/ACM Int. Conf. Adv. Soc.
young children,’’ Educ. Inf. Technol., vol. 25, no. 5, pp. 4459–4475, Netw. Anal. Mining, Aug. 2019, pp. 464–469, doi: 10.1145/3341161.
Sep. 2020, doi: 10.1007/s10639-020-10183-7. 3342913.

16296 VOLUME 10, 2022


K. Yousaf, T. Nawaz: Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos

[23] A. Ulges, C. Schulze, D. Borth, and A. Stahl, ‘‘Pornography detection [42] S. Tang, T.-S. Chua, J. Li, Y. Zhang, C. Xie, M. Li, Y. Liu, X. Hua,
in video benefits (a lot) from a multi-modal approach,’’ in Proc. ACM Y.-T. Zheng, and J. Tang, ‘‘Pornprobe: An LDA-SVM based pornogra-
Int. Workshop Audio Multimedia Methods Large-Scale Video Anal., 2012, phy detection system,’’ in Proc. 17th ACM Int. Conf. Multimedia, 2009,
pp. 21–26, doi: 10.1145/2390214.2390222. pp. 1003–1004, doi: 10.1145/1631272.1631490.
[24] C. Caetano, S. Avila, S. Guimaraes, and A. D. A. Araújo, ‘‘Pornog- [43] S. Lee, W. Shim, and S. Kim, ‘‘Hierarchical system for objectionable video
raphy detection using BossaNova video descriptor,’’ in Proc. 22nd detection,’’ IEEE Trans. Consum. Electron., vol. 55, no. 2, pp. 677–684,
Eur. Signal Process. Conf., 2014, pp. 1681–1685. [Online]. Available: May 2009, doi: 10.1109/TCE.2009.5174439.
https://ieeexplore.ieee.org/document/6952616 [44] A. P. B. Lopes, S. E. F. D. Avila, A. N. A. Peixoto, R. S. Oliveira,
[25] L. Duan, G. Cui, W. Gao, and H. Zhang, ‘‘Adult image detection M. D. M. Coelho, and A. D. A. Araújo, ‘‘Nude detection in video using
method base-on skin color model and support vector machine,’’ in Proc. bag-of-visual-features,’’ in Proc. XXII Brazilian Symp. Comput. Graph.
Asian Conf. Comput. Vis., 2002, pp. 797–800. [Online]. Available: Image Process., Oct. 2009, pp. 224–231, doi: 10.1109/sibgrapi.2009.32.
http://aprs.dictaconference.org/accv2002/accv2002_proceedings/ [45] S. Reddy, N. Srikanth, and G. Sharvani, ‘‘Development of kid-friendly
Duan797.pdf YouTube access model using deep learning,’’ in Data Science and Security.
[26] C. Jansohn, A. Ulges, and T. M. Breuel, ‘‘Detecting pornographic Singapore: Springer, 2021, pp. 243–250, doi: 10.1007/978-981-15-5309-
video content by combining image features with motion information,’’ 7_26.
in Proc. 17th ACM Int. Conf. Multimedia, 2009, pp. 601–604, doi: [46] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng,
10.1145/1631272.1631366. ‘‘Multimodal deep learning,’’ in Proc. 28th Int. Conf. Int. Conf.
[27] P. Zhou, Q. Ding, H. Luo, and X. Hou, ‘‘Violence detection in surveillance Mach. Learn. (ICML), 2011, pp. 689–696. [Online]. Available:
video using low-level features,’’ PLoS ONE, vol. 13, no. 10, Oct. 2018, https://dl.acm.org/doi/10.5555/3104482.3104569
Art. no. e0203668, doi: 10.1371/journal.pone.0203668. [47] I. Sutskever, O. Vinyals, and Q. V. Le, ‘‘Sequence to sequence
[28] M. B. Garcia, T. F. Revano, B. G. M. Habal, J. O. Contreras, and learning with neural networks,’’ in Proc. 27th Int. Conf. Neural
J. B. R. Enriquez, ‘‘A pornographic image and video filtering appli- Inf. Process. Syst., 2014, pp. 3104–3112. [Online]. Available:
cation using optimized nudity recognition and detection algorithm,’’ in https://dl.acm.org/doi/10.5555/2969033.2969173
Proc. IEEE 10th Int. Conf. Humanoid, Nanotechnol., Inf. Technol., Com- [48] J. Wehrmann, G. S. Simōes, R. C. Barros, and V. F. Cavalcante,
mun. Control, Environ. Manage. (HNICEM), Nov. 2018, pp. 1–5, doi: ‘‘Adult content detection in videos with convolutional and recurrent neu-
10.1109/HNICEM.2018.8666227. ral networks,’’ Neurocomputing, vol. 272, pp. 432–438, Jan. 2018, doi:
[29] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and 10.1016/j.neucom.2017.07.012.
L. Fei-Fei, ‘‘Large-scale video classification with convolutional neural [49] M. Perez, S. Avila, D. Moreira, D. Moraes, V. Testoni, E. Valle,
networks,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2014, S. Goldenstein, and A. Rocha, ‘‘Video pornography detection through deep
pp. 1725–1732, doi: 10.1109/cvpr.2014.223. learning techniques and motion information,’’ Neurocomputing, vol. 230,
[30] K. Simonyan and A. Zisserman, ‘‘Two-stream convolutional networks pp. 279–293, Mar. 2017, doi: 10.1016/j.neucom.2016.12.017.
for action recognition in videos,’’ in Proc. 27th Int. Conf. Neural
[50] N. Aldahoul, H. A. Karim, M. H. L. Abdullah, and A. S. Ba Wazir,
Inf. Process. Syst. (NIPS), 2014, pp. 568–576. [Online]. Available:
‘‘An evaluation of traditional and CNN-based feature descriptors for
https://dl.acm.org/doi/10.5555/2968826.2968890
cartoon pornography detection,’’ IEEE Access, vol. 9, pp. 39910–39925,
[31] Z. Wu, X. Wang, Y.-G. Jiang, H. Ye, and X. Xue, ‘‘Modeling spatial-
2021, doi: 10.1109/ACCESS.2021.3064392.
temporal clues in a hybrid deep learning framework for video classifica-
[51] H. Yenala, A. Jhanwar, M. K. Chinnakotla, and J. Goyal, ‘‘Deep learning
tion,’’ in Proc. 23rd ACM Int. Conf. Multimedia, Oct. 2015, pp. 461–470,
for detecting inappropriate content in text,’’ Int. J. Data Sci. Analytics,
doi: 10.1145/2733373.2806222.
vol. 6, no. 4, pp. 273–286, Dec. 2018, doi: 10.1007/s41060-017-0088-4.
[32] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals,
[52] R. E. Trana, C. E. Gomez, and R. F. Adler, ‘‘Fighting cyberbullying:
R. Monga, and G. Toderici, ‘‘Beyond short snippets: Deep networks for
An analysis of algorithms used to detect harassing text found on YouTube,’’
video classification,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
in Proc. Int. Conf. Appl. Hum. Factors Ergonom., 2020, pp. 9–15, doi:
(CVPR), Jun. 2015, pp. 4694–4702, doi: 10.1109/CVPR.2015.7299101.
10.1007/978-3-030-51328-3_2.
[33] J. P. Verma and S. Agrawal, ‘‘Big data analytics: Challenges and applica-
tions for text, audio, video, and social media data,’’ Int. J. Soft Comput., [53] M. Dadvar and K. Eckert, ‘‘Cyberbullying detection in social networks
Artif. Intell. Appl., vol. 5, no. 1, pp. 41–51, Feb. 2016, doi: 10.5121/ijs- using deep learning based models,’’ in Proc. Int. Conf. Big Data Analytics
cai.2016.5105. Knowl. Discovery, 2020, pp. 245–255, doi: 10.1201/9781003134527-11.
[34] Y.-G. Jiang, Z. Wu, J. Wang, X. Xue, and S.-F. Chang, ‘‘Exploiting fea- [54] H. Mohaouchane, A. Mourhir, and N. S. Nikolov, ‘‘Detecting offensive
ture and class relationships in video categorization with regularized deep language on Arabic social media using deep learning,’’ in Proc. 6th Int.
neural networks,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 2, Conf. Soc. Netw. Anal., Manage. Secur. (SNAMS), Oct. 2019, pp. 466–471,
pp. 352–364, Feb. 2017, doi: 10.1109/TPAMI.2017.2670560. doi: 10.1109/snams.2019.8931839.
[35] M. J. Jones and J. M. Rehg, ‘‘Statistical color models with application to [55] S. Alshamrani, ‘‘Detecting and measuring the exposure of children
skin detection,’’ in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern and adolescents to inappropriate comments in YouTube,’’ in Proc. 29th
Recognit., Jun. 1999, pp. 274–280, doi: 10.1109/CVPR.1999.786951. ACM Int. Conf. Inf. Knowl. Manage., Oct. 2020, pp. 3213–3216, doi:
[36] T. Endeshaw, J. Garcia, and A. Jakobsson, ‘‘Classification of indecent 10.1145/3340531.3418511.
videos by low complexity repetitive motion detection,’’ in Proc. 37th [56] S. Alshamrani, M. Abuhamad, A. Abusnaina, and D. A. Mohaisen,
IEEE Appl. Imag. Pattern Recognit. Workshop, Oct. 2008, pp. 1–7, doi: ‘‘Investigating online toxicity in users interactions with the mainstream
10.1109/AIPR.2008.4906438. media channels on YouTube,’’ in Proc. CIKM Workshops, 2020, pp. 1–6.
[37] N. Rea, G. Lacey, R. Dahyot, and C. Lambe, ‘‘Multimodal periodicity [Online]. Available: http://ceur-ws.org/Vol-2699/paper39.pdf
analysis for illicit content detection in videos,’’ in Proc. 3rd Eur. Conf. Vis. [57] E. Mariconti, G. Suarez-Tangil, J. Blackburn, E. De Cristofaro,
Media Prod., 2006, pp. 106–114, doi: 10.1049/cp:20061978. N. Kourtellis, and I. Leontiadis, ‘‘‘You know what to do’ proactive
[38] Y. Liu, X. Wang, Y. Zhang, and S. Tang, ‘‘Fusing audio-words with visual detection YouTube videos targeted by coordinated hate attacks,’’ in
features for pornographic video detection,’’ in Proc. IEEE 10th Int. Conf. Proc. ACM Hum.-Comput. Interact., vol. 3, pp. 1–21, Nov. 2019, doi:
Trust, Secur. Privacy Comput. Commun., Nov. 2011, pp. 1488–1493, doi: 10.1145/3359309.
10.1109/TRUSTCOM.2011.205. [58] M. Gao, J. Jiang, L. Ma, S. Zhou, G. Zou, J. Pan, and Z. Liu, ‘‘Violent
[39] Y. Liu, Y. Yang, H. Xie, and S. Tang, ‘‘Fusing audio vocabulary with visual crowd behavior detection using deep learning and compressive sensing,’’
features for pornographic video detection,’’ Future Gener. Comput. Syst., in Proc. Chin. Control Decis. Conf. (CCDC), Jun. 2019, pp. 613–625, doi:
vol. 31, pp. 69–76, Feb. 2014, doi: 10.1016/j.future.2012.08.012. 10.1109/ccdc.2019.8832598.
[40] V. M. T. Ochoa, S. Y. Yayilgan, and F. A. Cheikh, ‘‘Adult video content [59] S. Alghowinem, ‘‘A safer YouTube kids: An extra layer of content filtering
detection using machine learning techniques,’’ in Proc. 8th Int. Conf. using automated multimodal analysis,’’ in Proc. SAI Intell. Syst. Conf.,
Signal Image Technol. Internet Based Syst., Nov. 2012, pp. 967–974, doi: 2018, pp. 294–308, doi: 10.1007/978-3-030-01054-6_21.
10.1109/sitis.2012.143. [60] P. Vitorino, S. Avila, M. Perez, and A. Rocha, ‘‘Leveraging deep neu-
[41] S. Jung, J. Youn, and S. Sull, ‘‘A real-time system for detecting indecent ral networks to fight child pornography in the age of social media,’’
videos based on spatiotemporal patterns,’’ IEEE Trans. Consum. Electron., J. Vis. Commun. Image Represent., vol. 50, pp. 303–313, Jan. 2018, doi:
vol. 60, no. 4, pp. 696–701, Nov. 2014, doi: 10.1109/TCE.2014.7027345. 10.1016/j.jvcir.2017.12.005.

VOLUME 10, 2022 16297


K. Yousaf, T. Nawaz: Deep Learning-Based Approach for Inappropriate Content Detection and Classification of YouTube Videos

[61] A. Ishikawa, E. Bollis, and S. Avila, ‘‘Combating the elsagate phe- [74] N. Ketkar, ‘‘Introduction to keras,’’ in Deep Learning With Python.
nomenon: Deep learning architectures for disturbing cartoons,’’ in Proc. Berkeley, CA, USA: Springer, 2017, pp. 97–111, doi: 10.1007/978-1-
7th Int. Workshop Biometrics Forensics (IWBF), May 2019, pp. 1–6, doi: 4842-2766-4_7.
10.1109/iwbf.2019.8739202. [75] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, and J. Dean, ‘‘Tensor-
[62] S. Singh, R. Kaushal, A. B. Buduru, and P. Kumaraguru, ‘‘KidsGUARD: Flow: A system for large-scale machine learning,’’ in Proc. 12th USENIX
Fine grained approach for child unsafe video representation and detec- Symp. Oper. Syst. Des. Implement. (OSDI), 2016, pp. 265–283. [Online].
tion,’’ in Proc. 34th ACM/SIGAPP Symp. Appl. Comput., Apr. 2019, Available: https://www.usenix.org/system/files/conference/osdi16/osdi16-
pp. 2104–2111, doi: 10.1145/3297280.3297487. abadi.pdf
[63] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, [76] Z. Xu, Y. Yang, and A. G. Hauptmann, ‘‘A discriminative CNN
‘‘ImageNet: A large-scale hierarchical image database,’’ in Proc. IEEE video representation for event detection,’’ in Proc. IEEE Conf. Com-
Conf. Comput. Vis. Pattern Recognit., Jun. 2009, pp. 248–255, doi: put. Vis. Pattern Recognit. (CVPR), Jun. 2015, pp. 1798–1807, doi:
10.1109/CVPR.2009.5206848. 10.1109/cvpr.2015.7298789.
[64] M. Tan and Q. V. Le, ‘‘EfficientNet: Rethinking model scaling for convo- [77] J. You and J. Korhonen, ‘‘Attention boosted deep networks for video
lutional neural networks,’’ 2019, arXiv:1905.11946. classification,’’ in Proc. IEEE Int. Conf. Image Process. (ICIP), Oct. 2020,
[65] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, pp. 1761–1765, doi: 10.1109/ICIP40778.2020.9190996.
and S. Vijayanarasimhan, ‘‘YouTube-8M: A large-scale video classifica-
tion benchmark,’’ 2016, arXiv:1609.08675.
[66] K. Soomro, A. Roshan Zamir, and M. Shah, ‘‘UCF101: A dataset of 101
human actions classes from videos in the wild,’’ 2012, arXiv:1212.0402.
[67] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, ‘‘HMDB:
A large video database for human motion recognition,’’ in Proc. Int. Conf.
Comput. Vis., Nov. 2011, pp. 2556–2563, doi: 10.1109/iccv.2011.6126543. KANWAL YOUSAF received the B.Sc. (Hons.)
[68] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, and M.Sc. degrees in Software Engineering from
S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, the University of Engineering and Technology
and A. Zisserman, ‘‘The kinetics human action video dataset,’’ 2017, (UET), Taxila, in 2010 and 2013, respectively,
arXiv:1705.06950. where she is currently pursuing the Ph.D. degree.
[69] L. Wolf, T. Hassner, and I. Maoz, ‘‘Face recognition in unconstrained
She is also working as a Lecturer at UET, Taxila.
videos with matched background similarity,’’ in Proc. CVPR, Jun. 2011,
pp. 529–534, doi: 10.1109/cvpr.2011.5995566.
Her research interests include deep learning, arti-
[70] M. Kim, S. Kumar, V. Pavlovic, and H. Rowley, ‘‘Face tracking and ficial neural networks and machine learning.
recognition with visual constraints in real-world videos,’’ in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1–8, doi:
10.1109/cvpr.2008.4587572.
[71] A. Bermingham, M. Conway, L. McInerney, N. O’Hare, and
A. F. Smeaton, ‘‘Combining social network analysis and sentiment
analysis to explore the potential for online radicalisation,’’ in Proc.
Int. Conf. Adv. Soc. Netw. Anal. Mining, Jul. 2009, pp. 231–236, doi:
TABASSAM NAWAZ received the Ph.D. degree
10.1109/asonam.2009.31.
[72] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, from the University of Engineering and Tech-
R. Mooney, T. Darrell, and K. Saenko, ‘‘YouTube2Text: Recognizing nology (UET), Taxila. He is currently serv-
and describing arbitrary activities using semantic hierarchies and zero- ing as a Professor and the Head of the
shot recognition,’’ in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, Software Engineering Department, UET, Taxila.
pp. 2712–2719, doi: 10.1109/iccv.2013.337. His research interests include advanced databases,
[73] J. Xu, T. Mei, T. Yao, and Y. Rui, ‘‘MSR-VTT: A large video descrip- and object-oriented design and analysis.
tion dataset for bridging video and language,’’ in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016, pp. 5288–5296, doi:
10.1109/cvpr.2016.571.

16298 VOLUME 10, 2022

You might also like