0% found this document useful (0 votes)
10 views

Weapon_Detection_in_Real-Time_CCTV_Videos_using_De

Uploaded by

safaa hussein
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Weapon_Detection_in_Real-Time_CCTV_Videos_using_De

Uploaded by

safaa hussein
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3059170, IEEE Access

M. Tahir Bhatti et al.: Weapon Detection in Real-Time CCTV Videos using Deep Learning

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2021.Doi Number

Weapon Detection in Real-Time CCTV Videos


using Deep Learning
Muhammad Tahir Bhatti1, Muhammad Gufran Khan1 (Senior Member, IEEE), Masood
Aslam2, Muhammad Junaid Fiaz1
1
Department of Electrical Engineering, National University of Computer and Emerging Sciences, CFD campus, Punjab, Pakistan
2
Department of Electrical Engineering, Comsats University Islamabad, Pakistan

Corresponding author: Muhammad Tahir Bhatti ([email protected]).


This work was sponsored by the Higher Education Commission (HEC) of Pakistan under the Technology Development Fund (TDF) with project code TDF-
02-161.

ABSTRACT Security and safety is a big concern for today’s modern world. For a country to be
economically strong, it must ensure a safe and secure environment for investors and tourists. Having said
that, Closed Circuit Television (CCTV) cameras are being used for surveillance and to monitor activities
i.e. robberies but these cameras still require human supervision and intervention. We need a system that can
automatically detect these illegal activities. Despite state-of-the-art deep learning algorithms, fast
processing hardware, and advanced CCTV cameras, weapon detection in real-time is still a serious
challenge. Observing angle differences, occlusions by the carrier of the firearm and persons around it
further enhances the difficulty of the challenge. This work focuses on providing a secure place using CCTV
footage as a source to detect harmful weapons by applying the state of the art open-source deep learning
algorithms. We have implemented binary classification assuming pistol class as the reference class and
relevant confusion objects inclusion concept is introduced to reduce false positives and false negatives. No
standard dataset was available for real-time scenario so we made our own dataset by making weapon photos
from our own camera, manually collected images from internet, extracted data from YouTube CCTV
videos, through GitHub repositories, data by university of Granada and Internet Movies Firearms Database
(IMFDB) imfdb.org. Two approaches are used i.e. sliding window/classification and region proposal/object
detection. Some of the algorithms used are VGG16, Inception-V3, Inception-ResnetV2, SSDMobileNetV1,
Faster-RCNN Inception-ResnetV2 (FRIRv2), YOLOv3, and YOLOv4. Precision and recall count the most
rather than accuracy when object detection is performed so these entire algorithms were tested in terms of
them. Yolov4 stands out best amongst all other algorithms and gave a F1-score of 91% along with a mean
average precision of 91.73% higher than previously achieved.

INDEX TERMS Gun Detection, Deep Learning, Object Detection, Artificial Intelligence, Computer Vision

I. INTRODUCTION because of the media and especially social media, the


The crime rate across the globe has increased mainly damage will be done. People now have more depression
because of the frequent use of handheld weapons during and have less control over their anger, and hate speeches
violent activity. For a country to progress, the law-and- can get those people to lose their minds. People can be
order situation must be in control. Whether we want to brainwashed and psychological studies show that if a
attract investors for investment or to generate revenue with person has a weapon in this situation, he may lose his
the tourism industry, all these needs is a peaceful and safe senses and commit a violent activity.
environment. The crime ratio because of guns is very High incidents were recorded in past few years with the
critical in numerous parts of the world. It includes mainly use of harmful weapons in public areas. Starting with the
those countries in which it is legal to keep a firearm. The past year’s attacks on a couple of Mosques in New Zealand,
world is a global village now and what we speak or write on March 15, 2019 at 1:40 pm, the attacker attacks the
has an impact on the people. Even if the news they heard is Christchurch AL-Noor Mosque during a Friday prayer
crafted having no truth but as it gets viral in a few hours killing almost 44 innocent and unarmed worshippers. On

VOLUME XX, 2021 1

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3059170, IEEE Access

M. Tahir Bhatti et al.: Weapon Detection in Real-Time CCTV Videos using Deep Learning

the same day just after 15 minutes at 1:55 PM, another groundbreaking results in object categorizing and detection.
attack happened killing seven more civilians [1]. Active It has achieved finest results thus far in classical problems
shooter incidents had also occurred in USA and then in of image processing such as grouping, detection and
Europe. The most significant cases were those at localization. Instead of selecting features manually, CNN
Columbine High School (USA, 37 victims), Andreas automatically learns features from given data.
Broeivik's assault on Uotya Island (Norway, 179 victims) or This article presents an automatic detection and
the Charlie Hebdo newspaper attack killing 23. According classification method of weapons for real-time scenario
to stats provided by the UNODC, among 0.1 Million people using state of the art deep learning models. For real-time
of a country, the crimes involving guns are very high i-e. implementation relating the problem question of this work
1.6 in Belgium, United States having 4.7 and Mexico with a “detecting weapons in real-time for potential
number of 21.5 [2]. robbers/terrorist using deep learning”, detection and
CCTV cameras play an important role to overcome this classification was done for pistol, revolver and other shot
problem and are considered to be one of the most important handheld weapons as in single class called pistol and
requirements for the security aspect. [3]. CCTVs are related confusion objects such as cell phone, metal detector,
installed in every public place today and are mainly used wallet, selfie stick in not pistol class. A major reason
for providing safety, crime investigation, and other security behind this was our research done on weapons used in
measures for detection. CCTV footage is the most robbery cases and it further motivated us to choose pistol
important evidence in courts. After a crime is committed, and revolver as our target object. We go through several
law enforcement agencies arrive at the scene and take the CCTV captured robbery videos on YouTube and found that
recording of footage with them [4]. If we look at the almost 95% of cases have pistol or revolver as the weapon
surveillance system of different countries around the world, used. With the implementation of this system, many
UK has about 4.5 million cameras, which are used for robbery crimes, and other incidents like what happened last
surveillance. Sweden has about 50000 cameras installed year in New Zealand’s Christchurch mosque could be
around 2010. The government of Poland was able to reduce controlled using early alarm system by alerting the operator
drug cases by 60% and street fights by 40% by installing and concerned authorities so action can be taken
just 450 cameras in the city of Poznan [5]. China has the immediately.
world's biggest surveillance system and 170 million Gun detection in real-time is a very challenging task. As
cameras around the nation, and these are expected to our desired object has a small size so, detecting it in an
expand three times, through an additional 400 million to be image is also very challenging in presence of other objects,
connected by 2020. It took only seven minutes for Chinese especially those objects that can be confused with it. Deep
officials to find and apprehend BBC reporter John learning models faced several below mentioned challenges
Sudworth using their strong CCTV cameras network and for detection and classification task:
facial recognition technology and put the criminal behind  The first and main problem is the data through which
the bar [6]. CNN learn its features to be used later for
In previous years, though having surveillance cameras classification and detection.
installed, to use them for security purposes was not an easy  No standard dataset was available for weapons.
and dependable method. A human has to be there all the  For real-time scenarios, making a novel dataset
time to monitor screens. CCTV operator has to monitor 20- manually was a very long and time-consuming
25 screens for 10 hours. He has to look, observe, identify, process.
and control the situation that can be harmful to the  Labeling the desired database is not an easy task, as
individuals and the property. As the number of screens all data needs to be labeled manually.
increases, the concentration of the person decreases  Different detection algorithms were used, so a labeled
considerably to monitor each screen with time. It is dataset for one algorithm cannot be utilized for the
impossible for the person monitoring the screens to keep other one.
the same level of attention all the time [7].  Every algorithm requires different labeling and pre-
The solution to aforementioned problem is to install processing operations for the same-labeled database.
surveillance cameras with the ability to automatically detect  As for real-time implementation, detection systems
weapons and raise alarm to alert the operators or security require the exact location of the weapon so gun
personals. However, there is not much work done on blocking or occlusion is also a problem that arises
algorithms for weapon detection in surveillance cameras, frequently and it could occur because of self, inter-
and related studies are often considering concealed weapon object, or background blocking.
detection (CWD), mostly using X-rays or millimeter waves Different approaches are used in this work for weapon
images employing traditional machine learning techniques classification and detection purpose but all have deep
[8-12]. In the past few years, deep learning in particular learning and CNN architecture behind them because of
convolutional neural network (CNN) has given their state of the art performance. Training from scratch

VOLUME XX, 2021 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3059170, IEEE Access

M. Tahir Bhatti et al.: Weapon Detection in Real-Time CCTV Videos using Deep Learning

took very much time so the Transfer learning approach was for detecting hidden weapons at airports and other safe
used and ImageNet and COCO (common objects in locations in the body [14]. Z. Xue et al. suggested a CWD
context) pre-trained models are used. Different datasets technique based on a fusion-based technique of multi-scale
were made for classification and detection. For real-time decomposition, which combines color visual picture with
purposes, we made our dataset by taking weapon photos infrared (IR) picture integration [15]. R. Blum et al.
from the camera, data was extracted manually from robbery suggested a CWD method based on the inclusion of visual
CCTV videos, downloaded from imfdb (internet movie picture and IR or mm wave picture using a multi-resolution
firearm database), data by university of Granada and other mosaic technique to highlight the hidden weapon of the
online repositories. All the work has been done to achieve target picture [16].
results in real-time. E. M. Upadhyay et. al. suggested a CWD technique using
The main contributions of this work are: presentation of a image fusion. They used IR image and visual fusion to detect
first detailed and comprehensive work on weapon detection hidden weapons in a situation where the image of the scene
that can achieve detection in videos from real-time CCTV was present over and under exposed area. Their methodology
and works well even in low resolution and brightness was to apply a homomorphic filter captured at distinct
because most of the work done earlier is on high definition exposure conditions to visual and IR pictures [17]. Current
training images but realtime scenario needs realtime techniques attain high precision by using various
training data as well for better results, finding of the most combinations of extractors and detectors, either by using easy
suitable and appropriate CNN based object detector for the intensity descriptors, boundary detection, and pattern
application of weapon detection in real-time CCTV video matching [18] or by using more complicated techniques such
streams, making of a new dataset because real-time as cascade classifiers with boosting.
detection also needs real-time training data so we made a CWD though had worked for some sort of cases but it had
new database of 8327 images and preprocessed it using many limitations. These systems were based on metal
different OpenCV filters i.e. Equalized, Grayscale and detection; non-metallic guns cannot be detected. They were
clahe that helped in detecting images in low brightness and costly to use in many locations because they need to be
resolution, introducing the concept of related confusion coupled with X-ray scanners and conveyor belts and
classes to reduce false positives and negatives, training and responds to all metallic objects, so were not accurate.
testing of our novel database on the latest state of the deep Economic cost and health risks limited the practical
learning based classification and detection models among implementation of such methods. Furthermore, video-based
them Yolov4 performed best in terms of both speed and firearm detection was a preventive measure for acoustic
accuracy and our selected trained model predict images at detection of gunshot and can be combined with it for
almost every orientation, angle, and view, achieving the implementation [19-20].
highest mean average precision of 91.73% along with a F1- The idea of automated image processing for public security
score of 91% on Yolov4. purposes in many fields has been well recognized and
The rest of the paper is organized as follows: related studied. CCTV was the ultimate need for this kind of work to
work is discussed in Section II. The implementation progress. CCTV was first used back in 1946 in Germany and
methodology based on deep learning algorithms is at that time, these cameras were installed to observe the
explained in Section III. The dataset construction, launch of a rocket named V2 [21]. Although it had been used
annotation, and preprocessing using different filters have earlier, major improvements happened in the last two
been discussed in section IV, which follows the decades. With the advancement in CCTV technology, visual
experiments and results in Section V. Finally, the object recognition and detection for surveillance, control, and
conclusion and future work is discussed in Section VI. security were performed. In 1973, Charge-Coupled Device
(CCD) was developed, which made the deployment of
II. RELATED WORK surveillance cameras possible by 1980 [22]. If we go a bit
The problem of detection and classification of objects in real- forward in time, a company named Axis Communication
time started after major developments in the CCTV field, developed the first-ever network camera, which enabled the
processing hardware, and deep learning models. Very little transformation of surveillance cameras from analog to digital
work has been done in this field before and most of the [22]. This transformation of analog to digital video made it
previous effort was related to concealed weapon detection possible for everyone to apply image processing, machine
(CWD). learning, and computer vision techniques on videos recorded
Starting with concealed weapon detection (CWD), before from surveillance cameras. In 2003, Royal Palm Middle
its use in weapon detection, it was used for luggage control School in Phoenix used facial recognition for the first time
and other security purposes at airports and was based on for tracking missing children.
imaging techniques like millimeter-wave and infrared Several object detection algorithms were proposed in the
imaging [13]. Sheen et al. suggested CWD method based on field of computer vision to make surveillance system better.
a three-dimensional millimeter (mm) wave imaging method, Object detection algorithms were used in several sectors like

VOLUME XX, 2021 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3059170, IEEE Access

M. Tahir Bhatti et al.: Weapon Detection in Real-Time CCTV Videos using Deep Learning

anomaly detection, deterrence, human detection, and traffic predict the objects in the frame. HOG significant work used
monitoring [23]. R. Chellapa et.al. discussed briefly object low-level features, discriminative learning, and pictorial
tracking and detection in surveillance cameras [24]. The structure along with SVM [35-37]. These algorithms were
authors had explained the tracking of an object using slow for real-time scenarios with 14s per image. Although
multiple surveillance cameras. Another author addressed these classifiers gave good accuracies, the slowness of the
techniques for detecting objects that come into contact with sliding window method was a big problem, especially for the
another object and are occluded. They also wrote regarding real-time implementation purpose.
the segmentation of mean fluctuations. They outlined how This work focuses on the state of the art deep learning
mean segmentation of shifts can help detect objects. They network rather SIFT and HOG features which use
used a Bayesian Kalman filter with a simplified Gaussian handcrafted rules for feature extraction, selection, and
blend (BKF-SGM) algorithm to track the detected object detection in real-time visual scenario using CCTV cameras.
[25]. J.S Marques proposed distinct techniques for evaluating X. Zhang et al. concluded an important finding that helped
the efficiency of distinct algorithms for object recognition my work. They concluded that the automatic feature
[26]. B. Triggs et.al. described histogram oriented gradient representation gave improved results rather than manual
(HOG). HOG became a novel architecture for feature features [38]. Not only the learned features were better in
extraction. It was used mostly in applications involved in performance, they also had learned the deep representation of
human detection [27]. In 2005, the sliding window technique the data and reduced a lot of manual work, and saved time
was proposed for the recognition of number plates [28]. They and energy.
had used a sliding window for the purpose of segmentation Rohith Vajhala et al. proposed the technique of knife and
and a neural network for character recognition on the number gun detection in surveillance systems. They had used HOG
plate. as a feature extractor along with backpropagation of artificial
As described above, objection detection for the computer neural networks for classification purposes. The detection
vision tasks was used for some applications with big objects was performed using different scenarios, first weapon only
to identity like a person, transport or traffic monitoring, etc. and then using HOG and background subtraction methods for
Literature review on weapon detection left me with the human before the desired object and claimed to have an
opinion that regardless of many object detection algorithms, accuracy of 83%.[39]. The aforementioned work uses the
the algorithms proposed for weapon detection are very few. CNN along with non-linearity of ReLu, convolutional neural
At last, the idea of firearm detection using the images and layer, fully connected layer, and dropout layer of CNN to
videos was proposed and false alarms were reduced by reach a result for detection with multiple classes and
classifying neural networks with region-based descriptors implemented their work using the Tensor flow open-source
and determining region of interest (ROI) using the sliding platform. Their system achieved a test accuracy of 90.2 % for
window technique and then trained the neural network their dataset [40]. Michał Grega et al. proposed knives and
classifier with image pixels [29]. firearm detection in CCTV images. They had applied
With the development in CCTV’s, object detection for MPEG-7 and principle component analysis along with the
different computer vision problems for real-time were sliding window approach, which made their work slower for
performed and the idea to detect firearms were introduced real-time scenarios, although they claimed to achieve good
first by L. Ward et al. in 2007 [30] and a surveillance system accuracy on their test dataset. [41].
was also implemented by them a year later in 2008 [31]. In Verma et al. had also used the deep learning technique to
the aforementioned work, writers created an accurate pistol detect weapons and used the Faster RCNN model. The work
detection model for RGB pictures. However, in the same was performed on imfdb, which in my opinion is not suitable
scene, their method did not detect various pistols [31-33]. to train a model for real-time case. They claimed to have an
The approach used comprises of first removing non-related accuracy of 93.1% on that dataset but in the case of weapon
items from the segmented picture using the K-mean detection, only achieving higher accuracy is not enough, and
clustering algorithm and then applying the SURF (Speed up precision and recall must the considered [42]. Siham Tabik et
Robust Features) method to detect points of interest. Darker al. work was very much related to the real-time scenario.
gave the concept of SIFT based weapon detection algorithm They used Faster RCNN to detect weapons in real-time using
and for ROI estimation, used the motion segmentation sliding window and region proposal methods. Best results
method [34]. SIFT algorithm is prone to false alarms, so for were obtained by using the region proposal technique. The
estimating ROI, authors used motion segmentation rather sliding window was also very time-consuming and took 14
than using SIFT on complete image. When ROI was s/image, on the other hand, the region proposal method
determined, then SIFT was applied to detect firearms in their processed the image in 140ms with 7 fps [43]. They trained
case. the network on Faster RCNN using only one class focusing
Different approaches then used for weapon detection using on reducing the false positive. Recent past objection
sliding window and region proposal algorithms. HOG detection work with the application to firearms was proposed
(Histogram of oriented Gradient) models were used to in 2019, where a group of researchers, Javed Iqbal et al.

VOLUME XX, 2021 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3059170, IEEE Access

M. Tahir Bhatti et al.: Weapon Detection in Real-Time CCTV Videos using Deep Learning

proposed orientation aware detection of the object. This A. OBJECT RECOGNITION


system is more suitable for long and thin objects like rifles As the name suggests, it is the process of predicting the real
etc. The predicted bounding box in their case was aligned class or category of an image to which it belongs by making
with the object and had the less unnecessary area to deal probability high only for that particular class. CNN’s are
with. Images of very high quality were used for training and used to efficiently perform this process. Many state of the art
testing purposes, which may make it less suitable for real- Classification and Detection algorithms uses CNN as a
time scenarios [44]. Jose Luis Salazar Gonz´alez et al. work backend to perform their tasks.
was very much related to achieve real-time results. They did
immense experimentation using different datasets and trained
Faster –RCNN using Feature Pyramid Network with
Resnet50 and improves the previous state of the art by 3.91
% [45].

III. METHODOLOGY
Deep learning is a branch of machine learning inspired by the
functionality and structure of the human brain also called an
artificial neural network. The methodology adopted in this
FIGURE 1. Object Recognition to detection Hierarchy
work features the state of art deep learning, especially the
convolutional neural networks due to their exceptional Fig. 1 depicts that classification and localization come
performance in this field. [46]. The aforementioned under the category of recognition and combined
techniques are used for both the classification as well as classification and localization is performed to do object
localizing the specific object in a frame so both the object detection. Let us have a brief overview of the object
classification and detection algorithms were used and classification, localization, and detection.
because our object is small with other object in background
so after experimentation we found the best algorithm for our
case. Sliding window/classification and region 1) IMAGE CLASSIFICATION
proposal/object detection algorithms were used, and these The classification model takes an image and slide the
techniques will be discussed later in this section. kernel/filter over the whole image to get the feature maps.
We had started by doing the classification using different From the feature extracted, it then predicts the label based on
deep learning models and achieved good precision but for the the probability.
real-time scenarios, the low frame per seconds of 2) OBJECT LOCALIZATION
classification models were the real issue in implementation. This method outputs the actual location of an object in an
Oxford VGG [47-48], Google Inceptionv3 [49] and image by giving the associated height and width along with
InceptionResnetv2 [50-51] were trained using the its coordinates.
aforementioned approach. 3) OBJECT DETECTION
To achieve high precision, increase number of frame per This task uses the properties of the aforementioned
seconds and improve localization, we moved to the object algorithms. The detection algorithm tells us the bounding box
detection and region proposal methods. The different state of having x and y coordinates with associated width and height
the art deep learning models for object detection were used along with the class label. Non-max suppression is used to
and the results were compared in terms of precision, speed, output the box with our desired threshold [60]. This process
and standard metric of F1 score. State of the art deep gives the following results altogether:
learning based SSDMobileNetv1 [52-54], YOLOv3 [55],  Bounding Box
FasterRCNN-InceptionResnetv2 [56-58], and YOLOv4 [59]  Probability
were trained and tested. In past object detection was very limited because of less data
Different datasets were made keeping in mind the and low processing power of computers but with the passage
classification and detection problem as both have a separate of time the computing power of computers increased and
requirement for performing the tasks to achieve high world moved from CPU’s to Graphic Processing Units
accuracy, mean average precision as well as frame per (GPU). GPU’s were firstly made for increasing the graphic
second for the real-time implementation. To understand quality of the systems and for gaming but later GPUs were
object classification and detection let us first briefly used extensively for deep learning. In ImageNet,
understand object recognition as both the aforementioned competitions started and contained about 1000 classes [61].
types come under the umbrella of this and combined This was the evolution of machine learning and deep
classification and localization make detection possible for learning. In the beginning, the models were not very deep,
any kind of detection problem giving class name as well as means there were not many layers as they are now in an
the region where our desired object is in the frame. algorithm. Because of the aforementioned developments, in

VOLUME XX, 2021 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3059170, IEEE Access

M. Tahir Bhatti et al.: Weapon Detection in Real-Time CCTV Videos using Deep Learning

2012 A.Krizhevsky presented a model called Alex Net simultaneously predicts the probability and bounding boxes
trained on ImageNet and got the first position in that for an object with a center falling into a grid cell [55][59].
competition. This was the beginning of object detection in
deep learning. It gave a way to researchers and then every C. TRAINING MECHANISM
year the algorithms and models keep on coming. All these Fig. 2 describes the general methodology used in training and
algorithms contain layers that work on the principle of the optimization. It starts with defining a problem, finding the
convolutional neural network (CNN). required dataset, applying pre-processing methods, and then
finally training and evaluating the dataset. If the evaluation is
B. CLASSIFICATION AND DETECTION APPROACH correct then we save those weights as a classifier but if it’s
There are many ways to generate region proposals, but the incorrect then comes the process of backpropagation
simplest way of generating them is by using the sliding algorithm along with the gradient descent algorithm [66]. In
window approach. The sliding window method is slow backpropagation, weights are optimized by subtracting the
because filter slides over the entire frame and has limitations, partial derivative of cost function J(Ɵ) with a multiplier of
which were tackled by the region proposal approach, so we the learning rate alpha  from the old or previous weight
have the following two approaches used in our work for both value. Gradient descent is the main weight optimization
classification and detection models are: algorithm. It is used as a base in all optimizers used for the
 Sliding window/Classification Models modeling and it helps in converging the model and reaching
 Region proposal/Object Detection Models the minima where we get the best and desired weights values.
1) SLIDING WINDOW/CLASSIFICATION MODELS
In the method to the sliding window, a box or window is
moved over a picture to select an area and use the object
recognition model to identify each frame patch covered by
the window. It is an exhaustive search over the whole picture
for objects. Not only do we need to search in the picture for
all feasible places, we also need to search on distinct scales.
This is because models are usually trained on a particular
range. The outcomes are in tens of thousands (104) of picture
spots being classified [62]. The sliding window method is
computationally very costly because of the search with
various aspect ratios and especially for each pixel of an
image if the stride or step value is less.
2) REGION PROPOSAL/OBJECT DETECTION MODELS
This technique takes an image as the bounding boxes of input
and output proposals related to all areas in a picture most
probable to be the object. These regional proposals may be
noisy; coinciding not containing the object flawlessly, but
there is a proposal among these region proposals related to
the original target object. As this method takes a picture as
the bounding boxes of input and output related to all patches
in a picture most probable to be a category, so it proposes a
region with the maximum score as the location of an object.
Instead of considering all possible regions of the input frame
as possibilities, this method uses detection proposal
techniques to select regions [63]. Region-based CNNs (R- FIGURE 2. Training and Optimization Flow Diagram
CNN) was the first detection model to introduce CNNs under
this approach [64]. The selective search method of this D. CONFUSION OBJECT INCLUSION
approach produces 2000 boxes having maximum likelihood. We have formulated the problem to reduce the number of
Selective search is a widely used proposal generation false positives and negatives by adding relevant confusion
method because it is very fast having a good recall value. It is object. The weapon category includes all the handheld
dependent on the hierarchical calculation of desired areas weapons such as, pistol, revolver, shotgun and other than
established on the compatibility of color, texture, size, and weapon includes the objects that can most be confused with
shape [65]. pistol classes e.g. mobile, metal detector, selfie stick, purse,
Yolo series is among the state of the art object detection etc.
models. Unlike the other region proposal-based methods it By understanding the differences between classification
divides the input image into an SxS grid and then and detection algorithms, sliding window, and region

VOLUME XX, 2021 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3059170, IEEE Access

M. Tahir Bhatti et al.: Weapon Detection in Real-Time CCTV Videos using Deep Learning

proposal methods, let's now look at the algorithms used for of Granada research group, and internet movie firearm
both approaches. database imfdb.org.

E. CLASSIFIERS AND OBJECT DETECTORS 1) WEAPON DATASET CLASSES


The classifiers used under the sliding window approach: The weapon dataset for real-time weapon detection is divided
 VGG16 into the following two classes:
 InceptionV3  Pistol
 Inception ResnetV2  Not-Pistol
The object detectors used for real-time detection are:
2) WEAPON DATASET CATEGORIES FOR PISTOL
 SSD MobilNetV1 CLASS
 YoloV3 Dataset for this class includes weapon samples of the
 Faster RCNN-Inception ResNetV2 following categories:
 YoloV4  Pistol
 Revolver
Three databases named database1, database2, and  Other shot handheld weapons
database3 were created one by one after experimentation on
different algorithms with diverse images, first for 3) REASON OF CHOOSING DATA CATEGORIES OF
classification and then for object detection. Although the PISTOL CLASS
results obtained from the classification algorithms were not The reason we choose pistol and revolver in the pistol class is
bad but the frames per second were very slow for real-time because of our study and analysis after watching many
implementation. Detail for each database will be discussed in robberies and shooting incident CCTV videos. We concluded
the next section. that almost 95% of the weapon used in those cases were
either pistol or revolver. Fig. 3 shows some sample images
IV. DATASET CONSTRUCTION, ANNOTATION AND for real-time from the collected dataset of the pistol class.
PRE-PROCESSING (D-CAP)
Data plays a key role in the development of any deep
learning model as the model learns and extract feature from
it. For a real-time model to detect weapons with minimized
processing time and high precision, the importance of
accurate and relevant data increases further as all other
processes are dependent on it.
When we study the stats and goes through almost 50-60
videos of robbery on available online resources, we come to
know that 95 percent of the videos have revolver or pistol as
a weapon, so we focused on binary classification with pistol
and revolver to be in a single class called pistol. Besides, to
make the system more precise and to reduce the false
positive and false negative values we added objects that can
be confused with a weapon such as a wallet, cell phone,
metal detector etc and put them in a separate class named it
as not pistol.
Let’s now discuss the datasets used in our case because, in
a supervised learning case, the network learns the FIGURE 3. Dataset samples for pistol Class- Top left to bottom right [a-
representation of the input data with given true answers, so d]: (a) CCTV image (b) Medium Resolution Image (c) Image with Dark
background and Low Resolution, (d) Filtered Image
the data must be clean, preprocessed, and properly annotated
to make the network learn and predict better.
4) WEAPON DATASET CATEGORIES FOR NOT-PISTOL
CLASS
A. DATASET CONSTRUCTION AND SELECTION Datasets for this class include objects that can most likely be
The task of dataset construction and collection was very confused with pistol class objects. Following are some
important and tough as well because there was no benchmark samples categories for the not pistol class:
dataset available for this. Dataset for real-time detection was  Wallet
collected and constructed in different phases and data was  Metal Detector
collected from the internet, extracted from YouTube CCTV  Cell phone
videos, through GitHub repositories, data by the University  Selfie stick

VOLUME XX, 2021 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3059170, IEEE Access

M. Tahir Bhatti et al.: Weapon Detection in Real-Time CCTV Videos using Deep Learning

5) REASON OF CHOOSING DATA CATEGORIES OF 2) DATASET 2


NOT-PISTOL CLASS This was the second dataset made for the real-time scenario.
We introduced this relevant confusion object concept This dataset contains 5254 images and classification, as well
because these are the objects that can mostly be confused as object detection algorithms, were trained on this dataset to
with our desired weapon object, so predicting them correctly meet the task. Images were extracted for real-time scenario
results in reducing the number of false positives and false with the desired object in hand from online, sources, imfdb
negatives, hence increasing overall accuracy and precision. database, and ImageNet website. Dataset was divided by the
Some previously done work though had objects other than separation criteria of test and train explained in Table 1.
weapons used for the background or class other than a 3) DATASET 3
weapon but they had samples like cars, airplanes, cats, etc This was the third dataset constructed for the real-time
and there are very fewer chances for them to be confused scenario and object detection algorithms were performed on
with our desired weapon, which is very small as compared to it. This database was made by enhancing dataset 2 by
them. As our desired objects of pistol class are small so there overcoming the shortcomings and problems of the previous
are lot of chances for them to be confused with some other dataset. The need for this dataset arises because though we
objects having some features like that. Fig. 4 shows some got a reasonable accuracy from classification models but the
sample images from the collected dataset of the not pistol frames per second were very few. To detect images from
class which helps in reducing false positives and negatives. CCTV videos, similar kinds of training data must be included
so we made our own dataset to tackle this issue.
This dataset contains 8327 images divided into the pistol
and not pistol class. In this case, a related confusion data
concept was introduced to reduce false positives and false
negatives in real-time detection. Dataset images were
extracted from several online sources, from CCTV videos for
the particular robbery scenario, made our own dataset with a
weapon in hand for the diverse scenario, did data
augmentation, and finally, it was separated for test and train
case.

C. DATA DISTRIBUTION
Each of the aforementioned datasets was divided into the
following categories with step size defining the separation
percentage of the total data in test and train.
TABLE 1. Data Distribution

Sr.No. Category Total Training Test Split


FIGURE 4. Dataset samples for not-pistol Class-Top left to bottom right
[a-d]: (a) Cell Phone (b) Metal Detector (c) Selfie Stick (d) Wallet Data Data Data Size

1. Dataset 1 1732 1251 260 15%


B. DATASETS FOR REAL-TIME DETECTION
This work deals with the binary classification for a real-time 2. Dataset 2 5254 3797 784 15%
scenario so two classes were made and pistol and revolver
images were included in pistol class and not pistol class 3. Dataset 3 8327 7328 999 12%
include confusion classes like mobile phone, metal detector,
selfie stick, wallet, purse, etc. For the pistol and not pistol D. DATA PRE-PROCESSING AND ANNOTATION
classes, we have made three datasets, which are explained Many things affect the performance of a Machine Learning
below. (ML) model for a specified job. First, the representation and
quality of the data are essential. If there are many irrelevant
1) DATASET 1 and redundant data existing or noisy and unreliable data, then
This was the initial dataset used while starting this work. In it is harder to discover representation during the training
this dataset, we had 1732 images in total, with 750 images in stage. Data preparation and filtering steps take significant
pistol class and 950 in not pistol class. Dataset was divided processing time in ML issues [67]. The pre-processing
by the separation criteria described in Table 1 of train and process involves data cleaning, standardization, processing,
test. Images were collected from online sources and imfdb extraction and choice of features, etc. The final training
database and sliding window classification algorithms were dataset is the result of pre-processing processes applied to the
trained and tested on it. collected dataset.

VOLUME XX, 2021 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3059170, IEEE Access

M. Tahir Bhatti et al.: Weapon Detection in Real-Time CCTV Videos using Deep Learning

Pre-processing is necessary for better training of a model,


so the first step is to make the same size or resolution of the
dataset. The next step is to apply the mean normalization.
The third step is making bounding boxes on these images,
which is also called annotation, localization, or labeling. In
data, labeling a bounding box is made on each image. The
value x, y coordinates, and width, height of the labeled object
was stored in xml, csv or txt format. Following are the four
main steps of data preprocessing:
(b) Equalized Filter Result
 Image scaling
 Data-augmentation
 Image labeling
 Image Filtering using OpenCV
• RGB to Grayscale
• Equalized
• Clahe
Following Fig. 5, 6 and 7 shows the results after applying the
afore-mentioned pre-processing techniques.

(c) Gray Scale Filter Result

(d) Clahe Filter Result


FIGURE 5. Image Augmentation and Scaling
FIGURE 7. Image Filtration using OpenCV Filters- (a) Original Image (b)
Equalized Filter Result (c) Gray Scale Filter Result (d) Clahe Filter Result

V. EXPERIMENTS, RESULTS AND ANALYSIS


We have detected weapons in real-time CCTV streams in
low resolution, dark light with real-time frame per second.
Most of the work done before was on detecting images and
videos of high quality and because those models were trained
on high-quality datasets, it is not possible to then detect an
object of low resolution in real-time. The results are analyzed
after training and testing models on datasets mentioned in
Table 1.
FIGURE 6. Image Annotation and Labelling
As described in the methodology section the results for
different approaches are evaluated. Our main problem
statement is of real-time detection because 97% of weapon
used in robbery cases were pistol or revolver, so different
dataset results have been evaluated here for sliding window
and region proposal approach.
The performance of these models was analyzed by
comparing them in terms of the standard metrics of F1-score
and frame per seconds along with mean average precision
(mAP) for the best performed model and these terms are
calculated by using the below equation 1,2 and 3. F1 score is
(a) Actual Image ratio of the precision and recall functions.

VOLUME XX, 2021 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3059170, IEEE Access

M. Tahir Bhatti et al.: Weapon Detection in Real-Time CCTV Videos using Deep Learning

True Positives TABLE 3. Sliding Window Results Comparison Dataset-2


Precision = (1)
True Positives + False Positives
Sr.No Algorithms Precision Recall F1-score
True Positives
Recall = (2)
True Positives + False Negatives
1 VGG16 80.00% 83.47% 81.69%
2 * Precision * Recall
F1-score = (3)
Precision + Recall 2 Inceptionv3 84.36% 84.36% 84.36%

A. DATASET-1 EXPERIMENTATION AND RESULTS 3 Inception- 85.52% 85.98% 85.74%


Dataset 1 contains 1732 images distributed between two ResNetV2
classes of pistol and not-pistol with 750 and 982 images in
each class respectively. Experimentation on dataset-1 has
been performed using the sliding window/classification
models of VGG16, Inceptionv3 and InceptionResNetv2.
After experimentation, we have analyzed that the results
obtained are not good because most of the images of this
dataset have white or the same kind of background which
lead to a point where the model also starts learning the
background as its region of interest (ROI) and in real-time
background varies so a new dataset was required to train and
test the model on images with diverse cases and background.
Table 2 shows the results for the aforementioned models
using this dataset giving precision, recall, and F1-score. FIGURE 8. Best sliding window model accuracy graph:
InceptionResNetv2
TABLE 2. Sliding Window Results Comparison Dataset-1

Sr.No Algorithms Precision Recall F1-score

1 VGG16 71% 66.66% 69.09%

2 Inceptionv3 74.11% 96.18% 83.71%

3 Inception- 79.24% 89.54% 84.07%


ResNetV2

FIGURE 9. Best sliding window model Loss graph: InceptionResNetv2

B. DATASET-2 EXPERIMENTATION AND RESULTS


This dataset contains two classes of pistol and not-pistol
with 3000 and 2254 images in each class respectively. Table
3 shows results based on it. Experimentation on dataset-2 has
been performed using the sliding window/classification
models of VGG16, Inceptionv3, and InceptionResNetv2.
Experimentation results show that though we get a
reasonable accuracy from classification models using this
dataset but the frames per second were very few and which
was a big problem in making a real-time weapon detector.
Among these classification models, InceptionResnetV2
performed best and achieves the best results. Table 3 shows
the results under the sliding window methods using dataset 2
and Fig. 8, 9, and 10 shows the accuracy, loss, and confusion
matrix respectively for the best classification model under the
FIGURE 10. Best sliding window model Confusion Matrix:
sliding window approach. InceptionResNetv2

VOLUME XX, 2021 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3059170, IEEE Access

M. Tahir Bhatti et al.: Weapon Detection in Real-Time CCTV Videos using Deep Learning

C. DATASET-3 EXPERIMENTATION AND RESULTS loss score of 1.062 and a mean average precision of 91.73%.
After experimentation on the previous two datasets and not The mean average precision is the mean of the average
finding satisfactory results for the real-time case a new precision values for all the relevant classes. The associated
dataset was made. Images were collected from robbery values of average precision (AP) for pistol and not-pistol
videos, our own dataset images holding a weapon in different class for the calculation of mean average precision value is
scenarios, images with a dark background and low given in Table 5.
resolution, and images extracted from applying different
OpenCV filters are added to make real-time detection TABLE 5. Best Performed Model mAP Calculation

possible. A total of 8327 images are used in this case.


Following object detection models were trained and Average Average Mean
evaluated using this dataset: IoU_ Precision Precision Average
 SSD MobilNetV1
Threshold Pistol Pistol Precision
 YoloV3
(APP) (APNP) (mAP)
 Faster RCNN-Inception ResNetV2
 YoloV4 0.35 96.91% 92.41% 94.66%
Each model had its pros and cons. SSD-MobileNet is good
in terms of processing frames per second. FasterRCNN- 0.50 94.00% 89.47% 91.73%
InceptionResNetv2 has good precision and recall but not
0.75 60.17% 67.41% 63.79%
processing speed. Yolo family has a series of models. It has a
different approach for the detection purpose. Unlike the other
region proposal based methods, it divides the input image The mean average precision value is calculated for the
into an SxS grid and then simultaneously predicts the yolov4 model as it performs best in all scenario and
probability and bounding boxes for an object with the center accurately detected the desired object even when the object
falling into a grid cell. We have trained the latest state of the has a very small presense in the frame and there were lots of
art Yolov3 and Yolov4 on our own weapon dataset 3 for real- other objects in the background as well.
time detection and best results were obtained through
YOLOv4 in terms of both processing speed and precision.
Table 4 below shows the results for the aforementioned
detection models for this dataset at a standard threshold score
of 50%.

TABLE 4. Region Proposal/Object Detection Models-Dataset-3

Sr. Models IoU Threshold=50%

No
Precision Recall F1-
score

1 SSD-MobileNet-v1 62.79% 60.23% 59%

2 Yolov3 85.86% 87.34% 86%

3 FasterRcnn- 86.38% 89.25% 87%


InceptionResNetV2

4 Yolov4 93% 88% 91%

Yolov4 performs best among all the models of both the


sliding window and region proposal approach. Performance
graph for yolov4 in terms of loss and mean average precision
(mAP) on a validation dataset is shown in Fig. 11. We can
see that how smooth is the model loss curve and how
precisely it converges to the best level giving a very good FIGURE 11. Best Object Detection Model-Yolov4: loss vs mAP

VOLUME XX, 2021 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3059170, IEEE Access

M. Tahir Bhatti et al.: Weapon Detection in Real-Time CCTV Videos using Deep Learning

▶ Yolov4 performs best amongst all models with a mean


average precision and F1-score of 91.73% and 91%
respectively with detection confidence of 99% in the
majority of cases.
▶ Comparison in terms of test accuracy vs F1-score for
the best-performed models of both classification and
detection approaches is shown in the Fig.13. Accuracy
and F1-score for VGG, Inceptionv3,
InceptionResNetv2, SSDMobileNet, FasterRCNN-
InceptionResNetv2, Yolov3 and yolov4 are 78.20%,
85.20%, 92.20%, 79%, 96%, 94%, 99% and 81.69%,
84.36%, 85.74%, 59%, 87%, 86% and 91%
respectively.
▶ Fig. 14-19 shows the inference or detection results of
FIGURE 12. Object Detection models Performance/Comparison Graph our model for pistol and not pistol class on images,
videos, and real-time CCTV streams.
▶ Hyperparameters used in training the best-performed
detector Yolov4 can be observed from Table 6.

TABLE 6. Yolov4 Hyper Parameters

Sr. No. Hyper Parameter Value

1. Learning rate 0.001

2. Optimizer SGD

3. Decay 0.0005

4. Momentum 0.949
FIGURE 13. Best performed models comparison: Accuracy vs F1-score 5. Activation Function Mish

D. ANALYSIS AND DISCUSSION 6. Batch Size 64


▶ Table 2, 3, and 4 above shows the comparison 7. Max Batches / Iteration 6000
between the classification and object detection models
using standard metrics of precision, recall, and F1- TABLE 7. Comparison with some existing studies
score for evaluation.
▶ Some classification models showed good results but Study Algorithm Precision mAP
they were not suitable for a real-time scenario, were
slow, not much accurate, and fast as compared to the Javed Iqbal et al. OOAD N/A 85.40%
object detection models as they performs very well 2019
and achieved high precision and recall. Roberto Olmos et Faster RCNN 84.20% N/A
▶ The reason why some classification models have a al. 2017
good F1-score is the training and evaluation on initial
Jose Luis Salazar Faster RCNN 88.23% N/A
datasets we made when starting this work, but after Gonz´alez et al. using FPN
experimentation, we come to know that these models 2020
are not suitable for real-time scenarios having the
background objects. Ours Yolov4 93 % 91.73%
▶ Object detection models performed well for the real-
time scenario and performance comparison in terms of It is very hard to do a comparison with studies conducted
speed and F1-score between the detection models can previously on this subject because each study has its own
be seen from Fig. 12. Inference results are obtained dataset, models and metrics used to evaluate performance.
using the NVIDIA RTX 2080ti for each model. It should also be noticed achieve realtime detection, we also
▶ The standard metrics of mean average precision need to have a realtime dataset for traning because with
(mAP), recall and F1-score are calculate and all the high quality training images we cannot achieve results in
models have been compared at a benchmark IoU realtime. Each study also has different testing conditions,
threshold of 0.50 or 50%. either just on images, videos or on images with high quality

VOLUME XX, 2021 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3059170, IEEE Access

M. Tahir Bhatti et al.: Weapon Detection in Real-Time CCTV Videos using Deep Learning

but our approach from start was to achieve realtime results. mostly mAP is used as standard so we have given
In some studies, the performanece metric used is accuray, comparison results in terms of mAP and precisoin at a
others have precisoin or mean average precision (mAP) but standard iou threshold of 50%, which ever was available.
E. DETECTION RESULTS - PISTOL CLASS WITHOUT BACKGROUND

FIGURE 14. Detection Results- Only weapon in the whole frame without any background at different angles, brightness, sharpness, and quality

F. DETECTION RESULTS - PISTOL CLASS WITH BACKGROUND

FIGURE 15. Detection Results-Top left to bottom right (a-i): (a) Image with front and side view, (b) Image vertical view (c) Image with Dark background
and Low Resolution fully tilted side view, (d) Low brightness image side view slightly tilted (e) Image with the back view (f) Full front view (g) Small
CCTV object (h) Very small object with side view (i) Image with full side view

VOLUME XX, 2021 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3059170, IEEE Access

M. Tahir Bhatti et al.: Weapon Detection in Real-Time CCTV Videos using Deep Learning

G. DETECTION RESULTS - PISTOL CLASS IN VIDEOS

FIGURE 16. Detection Results-Top left to bottom right (a-f) - video 1 inference (a-c), video 2 inference (d-f): (a) Small object-side view tilted, (b) Small
object with side view (c) Small object front view (d) side view (e) Top view double object (f) Small object with front and side view

H. DETECTION RESULTS - PISTOL CLASS IN REALTIME CCTV STREAMS:

FIGURE 17. Detection Results-Top left to bottom right (a-i)-cctv stream1(a-c),cctv stream2(d-f), cctv stream3(g-i) :(a) Small object in Low resolution (b)
Tilted Object (c) Low Resolution vertical object , (d) Day light side view with slightly tilted (e) Day light side view (f) Day light side view flipped (g)
Small object medium resolution (h) vertical view (i) side view

VOLUME XX, 2021 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3059170, IEEE Access

M. Tahir Bhatti et al.: Weapon Detection in Real-Time CCTV Videos using Deep Learning

I. DETECTION RESULTS – NOT PISTOL CLASS

FIGURE 18. Detection Results- Top left to bottom right (a-d): (a) Cell phone (b) Metal detector(c) Wallet (d) Selfie stick

J. MISDETECTIONS

FIGURE 19. Misdetections: False positives and Negatives

VI. CONCLUSION AND FUTURE WORK evaluated it on the latest state-of-the-art deep learning models
using two approaches, i.e. sliding window/classification and
For both monitoring and control purposes, this work has region proposal/object detection. Different algorithms were
presented a novel automatic weapon detection system in real- investigated to get good precision and recall.
time. This work will indeed help in improving the security, Through a series of experiments, we concluded that object
law and order situation for the betterment and safety of detection algorithms with ROI (Region of Interest) perform
humanity, especially for the countries who had suffered a lot better than algorithms without ROI. We have tested many
with these kind of violent activities. This will bring a positive models but among all of them, the state-of-the-art Yolov4,
impact on the economy by attracting investors and tourists, as trained on our new database, gave very few false positive and
security and safety are their primary needs. We have focused negative values, hence achieved the most successful results.
on detecting the weapon in live CCTV streams and at the It gave 91.73% mean average precision (mAP) and a F1-
same time reduced the false negatives and positives. To score of 91% with almost 99% confidence score on all types
achieve high precision and recall we constructed a new of images and videos. We can say that it satisfactorily
training database for the real-time scenario, then trained, and qualifies as an automatic real-time weapon detector. Looking

VOLUME XX, 2021 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3059170, IEEE Access

M. Tahir Bhatti et al.: Weapon Detection in Real-Time CCTV Videos using Deep Learning

at the results, we got the highest mean average precision of the 5th Annual Workshop on Cyber Security and
(mAP) F1-score as compared to the research done before for Information Intelligence Research: Cyber Security
real-time scenarios. and Information Intelligence Challenges and
The future work includes reducing the false positives and Strategies, p. 20. ACM, 2009
negatives even more as there is still a need for improvement. [10] Tiwari, Rohit Kumar, and Gyanendra K. Verma. "A
We might also try to increase the number of classes or computer vision based framework for visual gun
objects in the future but the priority is to further improve
detection using harris interest point
precision and recall.
detector." Procedia Computer Science 54 (2015):
REFERENCES 703-712.
[1] “Christchurch mosque [11] Tiwari, Rohit Kumar, and Gyanendra K. Verma. "A
shootzings”, En.wikipedia.org, 2019. [Online]. computer vision based framework for visual gun
detection using SURF." In 2015 International
Available:
Conference on Electrical, Electronics, Signals,
https://en.wikipedia.org/wiki/Christchurch_mosqu Communication and Optimization (EESCO), pp. 1-5.
e_shootings. [Accessed: - 10 Jul- 2019]. 2015.
[2] Unodc.org. (2019). Global study on homicide. [12] Xiao, Zelong, Xuan Lu, Jiangjiang Yan, Li Wu, and
[online] Available at: Luyao Ren. "Automatic detection of concealed
https://www.unodc.org/unodc/en/data-and- pistols using passive millimeter wave imaging."
In 2015 IEEE International Conference on Imaging
analysis/global-study-on-homicide.html [Accessed
Systems and Techniques (IST), pp. 1-4. IEEE, 2015.
10 Jul. 2019].
[13] Flitton, Greg, Toby P. Breckon, and Najla Megherbi.
[3] Deisman, Wade. "cctv: Literature review and
"A comparison of 3D interest point descriptors
bibliography." In Research and Evaluation Branch,
with application to airport baggage object
Community, Contract and Aboriginal Policing
detection in complex CT imagery." Pattern
Services Directorate. Ottawa: Royal Canadian Recognition 46, no. 9 (2013): 2420-2436.
Mounted. 2003. [14] Sheen, David M., Douglas L. McMakin, and Thomas
[4] Ratcliffe, Jerry. Video surveillance of public places.
E. Hall. "Three-dimensional millimeter-wave
Washington, DC: US Department of Justice, Office
imaging for concealed weapon detection." IEEE
of Community Oriented Policing Services, 2006.
[5] Grega, Michał, Andrzej Matiolański, Piotr Guzik, Transactions on microwave theory and
and Mikołaj Leszczuk. "Automated detection of techniques 49, no. 9 (2001): 1581-1592.
firearms and knives in a CCTV image." Sensors 16, [15] Xue, Zhiyun, Rick S. Blum, and Y. Li. "Fusion of
no. 1 (2016): 47. visual and IR images for concealed weapon
[6] "China’s CCTV surveillance network took just 7 detection." In Proceedings of the Fifth
minutes to capture BBC reporter –
International Conference on Information Fusion.
TechCrunch", TechCrunch, 2019. [Online].
Available: FUSION 2002. (IEEE Cat. No. 02EX5997), vol. 2, pp.
https://techcrunch.com/2017/12/13/china-cctv- 1198-1205. IEEE, 2002.
bbc-reporter/. [Accessed: 15- Jul- 2019]. [16] Blum, Rick, Zhiyun Xue, Zheng Liu, and David S.
[7] Cohen, Neil, Jay Gattuso, and Ken MacLennan- Forsyth. "Multisensor concealed weapon detection
Brown. CCTV operational requirements manual by using a multiresolution mosaic approach."
2009. Home Office Scientific Development Branch, In IEEE 60th Vehicular Technology Conference,
2009.
2004. VTC2004-Fall. 2004, vol. 7, pp. 4597-4601.
[8] Flitton, Greg, Toby P. Breckon, and Najla Megherbi.
"A comparison of 3D interest point descriptors IEEE, 2004.
with application to airport baggage object [17] Upadhyay, Ekta M., and N. K. Rana. "Exposure
detection in complex CT imagery." Pattern fusion for concealed weapon detection." In 2014
Recognition 46, no. 9 (2013): 2420-2436. 2nd International Conference on Devices, Circuits
[9] Gesick, Richard, Caner Saritac, and Chih-Cheng and Systems (ICDCS), pp. 1-6. IEEE, 2014.
Hung. "Automatic image analysis process for the
detection of concealed weapons." In Proceedings

VOLUME XX, 2021 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3059170, IEEE Access

M. Tahir Bhatti et al.: Weapon Detection in Real-Time CCTV Videos using Deep Learning

[18] Gesick, Richard, Caner Saritac, and Chih-Cheng Transactions on Multimedia 8, no. 4 (2006): 761-
Hung. "Automatic image analysis process for the 774.
detection of concealed weapons." In Proceedings [27] Dalal, Navneet, and Bill Triggs. "Histograms of
of the 5th Annual Workshop on Cyber Security and oriented gradients for human detection." 2005.
Information Intelligence Research: Cyber Security [28] Anagnostopoulos, C., I. Anagnostopoulos, G.
and Information Intelligence Challenges and Tsekouras, G. Kouzas, V. Loumos, and E. Kayafas.
Strategies, p. 20. ACM, 2009. "Using sliding concentric windows for license plate
[19] Maher, Robert C. "Modeling and signal processing segmentation and processing." In IEEE Workshop
of acoustic gunshot recordings." In 2006 IEEE 12th on Signal Processing Systems Design and
Digital Signal Processing Workshop & 4th IEEE Implementation, 2005., pp. 337-342. IEEE, 2005.
Signal Processing Education Workshop, pp. 257- [29] Grega, Michał, Seweryn Łach, and Radosław
261. IEEE, 2006. Sieradzki. "Automated recognition of firearms in
[20] Chacon-Rodriguez, A.; Julian, P.; Castro, L.;
surveillance video." In 2013 IEEE International
Alvarado, P.; Hernandez, N. Evaluation of Gunshot
Detection Algorithms. IEEE Trans. Circuits Syst. I Multi-Disciplinary Conference on Cognitive
Regul. Pap. 2011, 58, 363–373. Methods in Situation Awareness and Decision
[21] 2019. [Online]. Available: Support (CogSIMA), pp. 45-50. IEEE, 2013.
https://www.business2community.com/tech- [30] Darker, Iain, Alastair Gale, Leila Ward, and
gadgets/from-edison-to-internet-a-history-of- Anastassia Blechko. "Can CCTV reliably detect gun
video-surveillance-0578308. [Accessed: 13- Jun- crime?" In 2007 41st Annual IEEE International
2019]. Carnahan Conference on Security Technology, pp.
[22] "Infographic: History of Video Surveillance - IFSEC 264-271. IEEE, 2007.
Global | Security and Fire News and [31] Darker, Iain T., Alastair G. Gale, and Anastassia
Resources", IFSEC Global | Security and Fire News Blechko. "CCTV as an automated sensor for
and Resources, 2019. [Online]. Available: firearms detection: Human-derived performance
https://www.ifsecglobal.com/video- as a precursor to automatic recognition."
surveillance/infographic-history-of-video- In Unmanned/Unattended Sensors and Sensor
surveillance/. [Accessed: 15- Sep- 2019]. Networks V, vol. 7112, p. 71120V. International
[23] Hu, Weiming, Tieniu Tan, Liang Wang, and Steve Society for Optics and Photonics, 2008.
Maybank. "A survey on visual surveillance of [32] Tiwari, Rohit Kumar, and Gyanendra K. Verma. "A
object motion and behaviors." IEEE Transactions computer vision based framework for visual gun
on Systems, Man, and Cybernetics, Part C detection using harris interest point
detector." Procedia Computer Science 54 (2015):
(Applications and Reviews) 34, no. 3 (2004): 334-
703-712.
352. [33] Tiwari, Rohit Kumar, and Gyanendra K. Verma. "A
[24] Sankaranarayanan, Aswin C., Ashok computer vision based framework for visual gun
Veeraraghavan, and Rama Chellappa. "Object detection using SURF." In 2015 International
detection, tracking and recognition for multiple Conference on Electrical, Electronics, Signals,
smart cameras." Proceedings of the IEEE 96, no. 10 Communication and Optimization (EESCO), pp. 1-5.
2015.
(2008): 1606-1624.
[34] Darker, Iain T., Paul Kuo, Ming Yuan Yang,
[25] Zhang, Shuai, Chong Wang, Shing-Chow Chan,
Anastassia Blechko, Christos Grecos, Dimitrios
Xiguang Wei, and Check-Hei Ho. "New object
Makris, Jean-Christophe Nebel, and Alastair G.
detection, tracking, and recognition approaches
Gale. "Automation of the CCTV-mediated
for video surveillance over camera network." IEEE
detection of individuals illegally carrying firearms:
sensors journal 15, no. 5 (2014): 2679-2691.
combining psychological and technological
[26] Nascimento, Jacinto C., and Jorge S. Marques.
approaches." In Visual Information Processing
"Performance evaluation of object detection
algorithms for video surveillance." IEEE

VOLUME XX, 2021 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3059170, IEEE Access

M. Tahir Bhatti et al.: Weapon Detection in Real-Time CCTV Videos using Deep Learning

XVIII, vol. 7341, p. 73410P. International Society Fernando Sancho Caparrini. "Real-time gun
for Optics and Photonics, 2009. detection in CCTV: An open problem." Neural
[35] Al-Rfou, Rami, Guillaume Alain, Amjad Almahairi, networks 132 (2020): 297-308.
Christof Angermueller, Dzmitry Bahdanau, Nicolas [46] Convolutional Neural Networks, 2017. [online]
Ballas, Frédéric Bastien et al. "Theano: A Python Available: http://cs231n.github.io/convolutional-
framework for fast computation of mathematical networks/ Accessed: 15-Aug-2018.
expressions." arXiv preprint [47] Simonyan, Karen, and Andrew Zisserman. "Very
arXiv:1605.02688 (2016). deep convolutional networks for large-scale image
[36] Dalal, Navneet, and Bill Triggs. "Histograms of
recognition." arXiv preprint
oriented gradients for human detection." 2005.
[37] Chollet, F. (2019). fchollet - Overview. [online] arXiv:1409.1556 (2014).
GitHub. Available at: https://github.com/fchollet [48] "VGG16 - Convolutional Network for Classification
[Accessed 10 Apr. 2019]. and Detection", Neurohive.io, 2019. [Online].
[38] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Available: https://neurohive.io/en/popular-
Jian Sun. "Deep residual learning for image networks/vgg16/. [Accessed: 19- Dec- 2018].
recognition." In Proceedings of the IEEE conference [49] S Szegedy, Christian, Vincent Vanhoucke, Sergey
on computer vision and pattern recognition, pp.
Ioffe, Jon Shlens, and Zbigniew Wojna. "Rethinking
770-778. 2016.
[39] Diva-portal.org, 2019. [Online]. Available: the inception architecture for computer vision."
http://www.diva- In Proceedings of the IEEE conference on computer
portal.org/smash/get/diva2:1054902/FULLTEXT02. vision and pattern recognition, pp. 2818-2826.
[Accessed: 5- May- 2019]. 2016.
[40] Nakib, Mohammad, Rozin Tanvir Khan, Md Sakibul [50] Szegedy, Christian, Sergey Ioffe, Vincent
Hasan, and Jia Uddin. "Crime Scene Prediction by Vanhoucke, and Alexander A. Alemi. "Inception-v4,
Detecting Threatening Objects Using Convolutional
inception-resnet and the impact of residual
Neural Network." In 2018 International Conference
on Computer, Communication, Chemical, Material connections on learning." In Thirty-First AAAI
and Electronic Engineering (IC4ME2), pp. 1-4. IEEE, Conference on Artificial Intelligence. 2017.
2018. [51] "A Simple Guide to the Versions of the Inception
[41] Grega, Michał, Andrzej Matiolański, Piotr Guzik, Network", Medium, 2019. [Online]. Available:
and Mikołaj Leszczuk. "Automated detection of https://towardsdatascience.com/a-simple-guide-
firearms and knives in a CCTV image." Sensors 16,
to-the-versions-of-the-inception-network-
no. 1 (2016): 47.
[42] Verma, Gyanendra K., and Anamika Dhillon. "A 7fc52b863202. [Accessed: 27- Jul- 2019].
Handheld Gun Detection using Faster R-CNN Deep [52] Liu, Wei, Dragomir Anguelov, Dumitru Erhan,
Learning." In Proceedings of the 7th International Christian Szegedy, Scott Reed, Cheng-Yang Fu, and
Conference on Computer and Communication Alexander C. Berg. "Ssd: Single shot multibox
Technology, pp. 84-88. ACM, 2017. detector." In European conference on computer
[43] Olmos, Roberto, Siham Tabik, and Francisco vision, pp. 21-37. Springer, Cham, 2016.
Herrera. "Automatic handgun detection alarm in [53] "Understanding SSD MultiBox — Real-Time Object
videos using deep learning." Neurocomputing 275 Detection In Deep Learning", Medium, 2019.
(2018): 66-72. [Online]. Available:
[44] Iqbal, Javed, Muhammad Akhtar Munir, Arif https://towardsdatascience.com/understanding-
Mahmood, Afsheen Rafaqat Ali, and Mohsen Ali. ssd-multibox-real-time-object-detection-in-deep-
"Orientation Aware Object Detection with learning-495ef744fab. [Accessed: 19- Aug- 2019].
Application to Firearms." arXiv preprint [54] Howard, Andrew G., Menglong Zhu, Bo Chen,
arXiv:1904.10032 (2019). Dmitry Kalenichenko, Weijun Wang, Tobias
[45] González, Jose L. Salazar, Carlos Zaccaro, Juan A. Weyand, Marco Andreetto, and Hartwig Adam.
Álvarez-García, Luis M. Soria Morillo, and "Mobilenets: Efficient convolutional neural

VOLUME XX, 2021 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3059170, IEEE Access

M. Tahir Bhatti et al.: Weapon Detection in Real-Time CCTV Videos using Deep Learning

networks for mobile vision applications." arXiv segmentation." In Proceedings of the IEEE
preprint arXiv:1704.04861 (2017). conference on computer vision and pattern
[55] Redmon, Joseph, and Ali Farhadi. "Yolov3: An recognition, pp. 580-587. 2014.
incremental improvement." arXiv preprint [65] A. Consulting, "Selective Search for Object
arXiv:1804.02767 (2018). Detection (C++ / Python) | Learn
[56] Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian OpenCV", Learnopencv.com, 2019. [Online].
Sun. "Faster r-cnn: Towards real-time object Available:
detection with region proposal networks." https://www.learnopencv.com/selective-search-
In Advances in neural information processing for-object-detection-cpp-python/. [Accessed: 25-
systems, pp. 91-99. 2015. May- 2019]
[57] 2019. [Online]. Available: [66] LeCun, Yann, Léon Bottou, Yoshua Bengio, and
https://medium.com/@smallfishbigsea/faster-r- Patrick Haffner. "Gradient-based learning applied
cnn-explained-864d4fb7e3f. [Accessed: 25- Aug- to document recognition." Proceedings of the
2019]. IEEE 86, no. 11 (1998): 2278-2324.
[58] "Faster RCNN Object detection", Medium, 2019.
[Online]. Available: [67] Kotsiantis, S. B., Dimitris Kanellopoulos, and P. E.
https://towardsdatascience.com/faster-rcnn- Pintelas. "Data preprocessing for supervised
object-detection-f865e5ed7fc4. [Accessed: 27- learning." International Journal of Computer
Aug- 2019]. Science 1, no. 2 (2006): 111-117.
[59] Bochkovskiy, Alexey, Chien-Yao Wang, and Hong-
Yuan Mark Liao. "YOLOv4: Optimal Speed and
Accuracy of Object Detection." arXiv preprint
arXiv:2004.10934 (2020).
[60] GeeksforGeeks. 2020. Object Detection Vs Object MUHAMMAD TAHIR BHATTI received the
B.Sc. degree in electrical and electronics
Recognition Vs Image Segmentation - engineering from Air University, Islamabad,
Geeksforgeeks. [online] Available at: Pakistan in 2013, 1.5 years certified artificial
intelligence engineer certification from Saylani
<https://www.geeksforgeeks.org/object- Mass Training Program, Faisalabad, Punjab,
detection-vs-object-recognition-vs-image- Pakistan in 2018 and M.S degree in electrical
engineering, securing bronze medal, with
segmentation/> [Accessed 28 December 2020]. specialization and research in the field of
artificial Intelligence and machine learning from
[61] "ImageNet", Image-net.org, 2019. [Online]. National University of Computer and Emerging Sciences Faisalabad,
Available: http://www.image-net.org/. [Accessed: Punjab Pakistan in 2019.
He is currently working as a Research Assistant in the field of artificial
05- Jun- 2019]. intelligence in National University of computer and Emerging Sciences,
[62] Laguna, Javier Ortiz, Angel García Olaya, and Faisalabad, Pakistan. He has worked previously as an Electrical and
Electronics Engineer in National Silk and Rayan Pvt Ltd.
Daniel Borrajo. "A dynamic sliding window
approach for activity recognition." In International
Dr. MUHAMMAD GUFRAN KHAN is an
Conference on User Modeling, Adaptation, and Associate Professor & Head of Department of
Electrical Engineering at FAST NUCES Chiniot-
Personalization, pp. 219-230. Springer, Berlin, Faisalabad Campus. He received the B.Sc. degree
Heidelberg, 2011. in electrical engineering from University of
Engineering and Technology, Lahore, Punjab,
[63] Hosang, Jan, Rodrigo Benenson, Piotr Dollár, and Pakistam in 2003, M.Sc degree in electrical
Bernt Schiele. "What makes for effective detection engineering specialization in signal processing
from Blekinge Institute Of Technology, Sweden
proposals?" IEEE transactions on pattern analysis in 2005 and the Ph.D. degree in electrical
engineering specialization in wireless communication from Blekinge
and machine intelligence 38, no. 4 (2015): 814-830. Institute Of Technology, Sweden in 2011.
[64] Girshick, Ross, Jeff Donahue, Trevor Darrell, and He is also the Chairperson of IEEE Faisalabad Subsection and Senior
Member of IEEE. Before joining FAST, he has worked at Volvo Car
Jitendra Malik. "Rich feature hierarchies for Corporation, Sweden as Analysis Engineer. He has conducted research in
accurate object detection and semantic the areas of Signal Processing, Computer Vision and Machine Learning.
He has also worked on different funded research projects in the area of

VOLUME XX, 2021 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2021.3059170, IEEE Access

M. Tahir Bhatti et al.: Weapon Detection in Real-Time CCTV Videos using Deep Learning

Embedded systems and Robotics. Currently he is actively involved in the


applications of AI and IoT technology to solve real-world problems.

MASOOD ASLAM received the B.S. degree in


electrical engineering from The University of
Faisalabad, Faisalabad, Pakistan, in 2013, and the
M.S. degree in electrical engineering from
FASTNUCES, Islamabad, Pakistan, in 2018.
He is currently working as Research Associate
with the Visual Computing Technology (VC-
Tech) Lab, Islamabad. Before joining VC-Tech,
he worked as a Research Assistant with FAST-
NUCES. His research interests include computer
vision and image processing.

MUHAMMAD JUNAID FIAZ received the B.S.


degree in computer science from Government
College University Faisalabad, Punjab, Pakistan
in 2019.
He is currently working as a Research
Assistant in the field of artificial intelligence in
National University of computer and Emerging
Sciences, Faisalabad, Pakistan. He has worked
previously as an Android developer.

VOLUME XX, 2021 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/

You might also like