Violence Detection Using Deep Learning Techniques1
Violence Detection Using Deep Learning Techniques1
Department of Computer Science and Engineering Department of Computer Science and Engineering
Visvesvaraya National Institute of Technology Visvesvaraya National Institute of Technology
Nagpur, India Nagpur, India
[email protected] 0000-0001-7927-8643
Abstract—In the past years, human action recognition has been A fighting event is defined as two or more people are
improved. Violence recognition is one of the best challenging fighting at a rate that should be disrupted. Violence in other
research topics in the field of computer vision. Its one of the parts of the world includes the use of heavy weapons such
specific application is to find violence from surveillance cameras
in public places, private places etc. We want an immediate as machine guns, Knife, etc and the violence in India is very
control on these violent incidents. Human operator needed for different as there is violence between the various social and
monitoring the screen of surveillance video, which often leads religious groups This violence does not charge any lethal
to mistake and neglect to pinpoint the occurrence of unusual weapons, but is usually punched or punched with heavy metal
events, its require to them in a powerful search for automated rod and sticks. As per the universal speculation, to detect
violence detection systems. This paper discusses this research
problem and explores LSTM and BiLSTM based solution to this movement of violent mobs, the database needs to be
solve it. In addition, a layer of attention is present and used distributed evenly.
a new content database that collected from surveillance camera The goal of the project is to provide the development to the
and normal recorded videos available on YouTube, Facebook etc. society through working in the social problem. The proposed
This database is publicly available. From the comprehensive tests method contains solution to prevent violence from the society.
conducted on Hockey Fight and dataset containing crowd and
normal public scenes. It seems that BiLSTM model has more Deep learning can be helpful in detecting such violence.
accuracy in the combat situation than LSTM model. Deep learning is an advanced technology that is a segment of
Index Terms—Violence detection, Surveillance, LSTM, BiL- machine learning methods based on artificial neural networks.
STM, CNN, VGG16. We are proposing a comprehensive solution based on deep
learning program that will be able to identify violence be-
I. I NTRODUCTION haviour in real-time video surveillance. The proposed method
will be capable to find this horrible incident with dint of in-
In various parts of the world, including India, violence is depth learning strategies, and classify the surveillance video
taking its toll on humanity. Violence is widespread in the life into two part after getting the results of violence state i.e
of many people all over the world. No nation is untouched Violent and Non-violent, and providing immediate relief to
from the violence and we all are affected to various degrees. victims.
Many of them are protecting themselves by locking their
houses and avoiding going unsafe places. On the other hand, II. R ELATED W ORK
for some people, there is no escape from danger, violence Preliminary work in this field identify the violent and non-
exists inside the closed door which is far away from our eyes. violent fight by blood transfusions, flames, photography rate,
The economical and political structure of these area are failed and the sounds of the element of violent events. A important
in reducing the violence. In the name of race religion, and work carried out by combining both video and audio features
sometimes nationality, the violence is happening all around for local violence. For example, a slightly monitored method
the world and it is impossible to stop it. However, few of the is used to integrate the visual and audio classifications in a
advanced technology has putted some effort for the detection very different way. While adding the audio feature to the
of violence activity from live video. Violence detection system analysis may be very effective, but the sound is very rare
has been getting more attention as a research topic, because available for the video from the public [1]. They are solving
we have many useful use cases. Since, unfortunately, violent this problem progressively using a violence detection structure
scenes in movies or in the media are very common, and that does not require any audio features. In addition, violence
the new generation can attain themselves in this easy media is widespread, not limited to personal violence, but also to
content. The major use is to find violence in the community other types of violence such as fire, physical violence, sports
places, such as roads, buses, underground parking, hospitals, violence, mob violence and gunfire etc. For the mob violence
institutions, etc and then automatically alert the public officials detection the Support Vector Machines (SVMs) and Latent
about the violence and immediate action against them will be Dirichlet Allocation (LDA) can be used [2].
performed. Violent activity contains a wide range of activities, Some researcher have focused on the problems of finding
for example: fighting, explosion, and vandalism. real-time violence, working on that, they got some good
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on November 06,2023 at 03:54:18 UTC from IEEE Xplore. Restrictions apply.
978-1-6654-7941-7/22/$31.00 ©2022 IEEE 121
2022 International Conference on Emerging Techniques in Computational Intelligence (ICETCI)
results. They uses surveillance cameras for unique real-time they encode using VGG13 network as a forward pass the
data. Depending upon the vector size change, they do some video frame as a collection of feature maps. They achieved
calculations using a violent descriptor flow. After that, they a comparable results, that is 97.3% at epoch 63, for Hockey
collect statistics from their database with short frames. By Fights dataset.
using this method, they obtained 82.9 % accuracy [3]. Some researcher tried to build and improved version for
For detecting violence some researcher uses 3D CNN in- the detection of violence activities by using two steps. In the
stead of 2D CNN, having 9 layer. The model uses only one first step they design a feature extractor model which mainly
node as the output, the result contains one of two values; focus on the change in motion magnitude. In second they adopt
true or false. The functionality and cost-function used here the features with multi-classifier combination. They applied
are sigmoid and stochastic gradient descent, that provide an these model on the two different dataset. They achieved a
accuracy of 91% with hockey dataset [4]. good accuracy of 94.84% [12].
The Peliculas dataset which includes different fighting and A lightweight model is introduced by Halder et al. [13] for
non-fighting videos which are downloaded or taken from the smart surveillance. They use CNN based BiLSTM model
YouTube or movies [4]. And a dataset is taken from a to detect the violent activity and achieve 99.27% accuracy
ice hockey game that contains fight and non fight video. for Hockey Fights dataset. They model compares the current
Another dataset which contains different crime scene which frame with the previous frame and with the upcoming frame
includes various violence scene. The UCF dataset contains to identify the sequential flow of events then the classifier
argon, robbery, burglary etc violence scene. A newly published identifies whether this action is violent or not. But they have
dataset in 2019 contains video from security camera videos not done for Crowd dataset, and it is difficult to achieve good
with examples of fights [5]. accuracy in this.
Many existing strategies uses changes within the frame-
work to detect violence, to capture patterns to change rapid III. P ROPOSED A PPROACH
movements such as violence function. The use of accelerated
For the classification of violent and non-violent actions, our
calculations in power the width of the adjacent structures as
model is able to predict it with the sequences of consecutive
an indicator of rapid movement between sequences indepen-
frames. The frames is resized to 224 x 224 pixels. These
dently. Many methods follow the strategies such as status
frames is given as input to the pretrained model then we
of the organs and tracking etc. identify local interest points
extracted the spatial features from the frames, then feed into
and exclude features in these points. This includes the Harris
LSTM model as the input sequences. The LSTM model uses
Detector, the Motion Scale-Invariant Transformation Factor
these features and then model will classify the output as
(MoSIFT) [6]. MoSIFT definitions are available in two parts
Violent or non-violent. We extends the concept further to of
in most of the cases : the first one is the Integrated Histogram
using VGG16 with BiLSTM model which encode details in
of Gradients (HoG) that define the appearance of the place.
both directions. We assume that accessing the past and future
The second is the Integrated Histogram Optical Flow (HoF)
inputs from the current state allow BiLSTM to understand the
which shows the flow point of the feature.
clear context of the current input, which allows it to classify
LSTM is used for video segmentation. Initially, several
better over complex and heterogeneous data sets. We assure
types of features are collected using the methods designed
the functionality of proposed network is tested on the dataset
for SIFT and Bag of Word (BoW). The features are then
i.e.Hockey Fights database, crowd database, and violent flow
transferred to the LSTM-RNN model for taking the best
dataset video which are collected from different sources.
benefit of temporal evaluation, these features are then used
for classification Model, achieved the accuracy of 90% in the IV. DATASET
soccer database [7].
Fenil et al. [8] used a football match video for the violence Testing the effectiveness of appropriate methods, multiple
detection and extract the Histogram of oriented Gradient tests datasets are designed for research purposes. The
(HoG) features from the video and fed the vector to the commonly used data sets includes Violent Flows, Hockey
BiLSTM network. Fighting, and Crowd dataset.
Ullah et al. used different CNN methods for the feature
extraction from the frames [9]. Then features are taken and Hockey Fight dataset : It is created by collecting the small
fed to last layer Bi-directional which perform classification. video clip of hockey match which contains scene for both fight
A new model is proposed by the some researcher which uses and non fight. It is a collection of 1000 video and the video
CNN with LSTM model. CNN extract the features and feed of fight and non fight are equally distributed. Average size of
to LSTM model. In this model a total of 256 filter is used and video is 1 sec.
ReLU as activation function. Using the hockey dataset they Crowd dataset : It is created by collecting the videos of
are able to achieved an accuracy of 94.6% [10]. crowd scene from many public places which contain scene for
Hanson et al. [11] has explored the problem of detecting both fight and non fight. it is a collection of 700 videos. Unlike
violent activity and introduced a Spatiotemporal Encoder the Hockey Fights database, crowd clips they have different
which is built on Bidirectional Convolutional LSTM. Here backgrounds and contents. Average size of video is 3 sec.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on November 06,2023 at 03:54:18 UTC from IEEE Xplore. Restrictions apply.
122
2022 International Conference on Emerging Techniques in Computational Intelligence (ICETCI)
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on November 06,2023 at 03:54:18 UTC from IEEE Xplore. Restrictions apply.
123
2022 International Conference on Emerging Techniques in Computational Intelligence (ICETCI)
This study explores and dives deeper into the use of features
extracted from the frames that used for the detection of
violence in the videos. We have experimented two model i.e
LSTM and BiLSTM with the pretrained model VGG16. The
features extracted from the 20 frames are feed into LSTM and
BiLSTM model as a input sequence. However, the features
extracted by the pre-trained VGG16 and the BiLSTM model
have been proved to be more salient than the LSTM model.
The framework was evaluated in a number of ways using the
challenge Violence flow database and crowd dataset. From the
proposed framework the test results shows that the violence in
the crowd and real time violence can be recognised with high
accuracy.
Despite of satisfactory performance of our proposed model,
it requires continuous evaluation of more standard dataset
where identification of more violent activities that include
weapons which is hard to detect.
R EFERENCES
[1] J. Nam, M. Alghoniemy, and A. H. Tewfik, “Audio-visual content-
based violent scene characterization,” in Proceedings 1998 International
Conference on Image Processing. ICIP98 (Cat. No. 98CB36269), vol. 1.
IEEE, 1998, pp. 353–357.
Authorized licensed use limited to: Chaitanya Bharathi Institute of Tech - HYDERABAD. Downloaded on November 06,2023 at 03:54:18 UTC from IEEE Xplore. Restrictions apply.
124