Automatic Handgun Detection Alarm in Videos Using Deep Learning
Automatic Handgun Detection Alarm in Videos Using Deep Learning
Abstract
Current surveillance and control systems still require human super-
vision and intervention. This work presents a novel automatic handgun
detection system in videos appropriate for both, surveillance and con-
trol purposes. We reformulate this detection problem into the problem
of minimizing false positives and solve it by building the key training
data-set guided by the results of a deep Convolutional Neural Networks
(CNN) classifier, then assessing the best classification model under two
approaches, the sliding window approach and region proposal approach.
The most promising results are obtained by Faster R-CNN based model
trained on our new database. The best detector show a high potential even
in low quality youtube videos and provides satisfactory results as auto-
matic alarm system. Among 30 scenes, it successfully activates the alarm
after five successive true positives in less than 0.2 seconds, in 27 scenes.
We also define a new metric, Alarm Activation per Interval (AApI), to
assess the performance of a detection model as an automatic detection
system in videos.
Index terms— Classification, Detection, Deep learning, Convolutional Neu-
ral Networks (CNNs), Faster R-CNN, VGG-16, Alarm Activation per Interval
1 Introduction
The crime rates caused by guns are very concerning in many places in the world,
especially in countries where the possession of guns is legal or was legal for a
period of time. The last statistics reported by the United Nations Office on
Drugs and Crime (UNODC) reveals that the number of crimes involving guns
1
per 100,000 habitants are very high in many countries, e.g., 21.5 in Mexico,
4.7 in United States and 1.6 in Belgium [19]. In addition, several psychological
studies demonstrated that the simple fact of having access to a gun increases
drastically the probability of committing a violent behavior.
One way to reducing this kind of violence is prevention via early detection
so that the security agents or policemen can act. In particular, one innovative
solution to this problem is to equip surveillance or control cameras with an
accurate automatic handgun detection alert system. Related studies address
the detection of guns but only on X-ray or millimetric wave images and only
using traditional machine learning methods [6, 7, 28, 25, 26].
In the last five years, deep learning in general and Convolutional Neural
Networks (CNNs) in particular have achieved superior results to all the classical
machine learning methods in image classification, detection and segmentation
in several applications [18, 13, 22, 8, 29, 23]. Instead of manually selecting
features, deep learning CNNs automatically discover increasingly higher level
features from data [17, 11]. We aim at developing a good gun detector in videos
using CNNs.
A proper training of deep CNNs, which contain millions of parameters, re-
quires very large datasets, in the order of millions of samples, as well as High
Performance Computing (HPC) resources, e.g., multi-processor systems accel-
erated with GPUs. Transfer learning through fine-tuning is becoming a widely
accepted alternative to overcome these constraints. It consists of re-utilizing the
knowledge learnt from one problem to another related one [20]. Applying trans-
fer learning with deep CNNs depends on the similarities between the original
and new problem and also on the size of the new training set.
In general, fine-tuning the entire network, i.e., updating all the weights, is
only used when the new dataset is large enough, else the model could suffer
overfitting especially among the first layers of the network. Since these layers
extract low-level features, e.g., edges and color, they do not change significantly
and can be utilized for several visual recognition tasks. The last layers of the
CNN are gradually adjusted to the particularities of the problem and extract
high level features, which are not readable by the human eye. In this work we
used a VGG-16 based classification model pre-trained on the ImageNet dataset
(around 1.28 million images over 1,000 generic object classes) [24] and fine-tuned
on our own dataset of 3000 images of guns taken in a variety of contexts.
Using CNNs to automatically detect pistols in videos faces several challenges:
• Pistols can be handled with one or two hands in different ways and thus
a large part of the pistol can be occluded.
• The process of designing a new dataset is manual and time consuming.
2
real time and only when the system is confident about the existence of a
pistol in the scene.
• Automatic detection alarm systems require an accurate location of the
pistol in the monitored scene.
As far as we know, this work presents the first automatic gun detection alarm
system that uses deep CNNs based detection models. We focus on the most used
type of handguns in crimes [30], pistol, which includes, revolver, automatic and
semi-automatic pistols, six-gun shooters, horse pistol and derringers. To guide
the design of the new dataset and to find the best detector we consider the
following steps:
• Designing a new labeled database that makes the learning model achieve
high detection qualities. Our experience in building the new dataset and
detector can be useful to guide developing the solution of other different
problems.
• Finding the most appropriate CNN-based detector that achieves real-time
pistol detection in videos.
• Introducing a new metric, AATpI, to assess the suitability of the proposed
detector as automatic detection alarm system.
From the experiments we found that the most promising results are ob-
tained by Faster R-CNN based model trained on our new database. The best
performing model shows a high potential even in low quality youtube videos
and provides satisfactory results as automatic alarm system. Among 30 scenes,
it successfully activates the alarm, after five successive true positives, within an
interval of time smaller than 0.2 seconds, in 27 scenes.
3
This paper is organized as follows. Section 2 gives a brief analysis of the most
related papers. Section 3 provides an overview of the CNN model used in this
work. Section 4 describes the procedure we have used to find the best detector
that reaches good precisions and low false positives rate. Section 5 analyzes
the performance of the built detector using seven videos and introduces a new
metric to assess the performance of the detector as automatic detection system.
Finally the conclusions are summarized in Section 6.
2 Related works
The problem of handgun detection in videos using deep learning is related in
part to two broad research areas. The first addresses gun detection using clas-
sical methods and the second focuses on improving the performance of object
detection using deep CNNs.
4
All the above cited systems are slow, can not be used for constant monitoring,
require the supervision of an operator and can not be used in open areas.
This work addresses a new solution to the problem of real-time pistol de-
tection alarm system using deep learning CNN-based detector. We develop,
5
evaluate and compare a CNN based classifier on different new datasets within
the sliding window and region proposals detection based methods.
6
performance of the classifier in combination with two detection methods, the
sliding window (Section 4.1) and the region proposals (Section 4.2).
Due to the differences between these two approaches, different optimization
model based on databases with different characteristics, size and classes, are
applied in each case. In the sliding window approach, we address reducing the
number of false positives by increasing the number of classes and thus building
four databases, Database-1, -2, -3 and -4. The characteristics of all the databases
built in this work are summarized in Table 1. In the region proposals approach,
the detector is directly trained on region proposals of a new database, Database-
5, with richer contexts.
Table 1: Characteristics of the new training- and test-sets. The training sets
are labeled as Database-1, 2, 3 , 4 and 5.
7
Figure 1: Examples from Database-2, the top three images represent the pistol
class and the down three images represent the background class
8
different images of hands holding different objects other than pistols, e.g., cell
phone and pen as illustrated in Figure 1. On the test set, the binomial classifi-
cation model trained on Database-2 obtained 11 false positives, a high number
of false negatives 206, a precision of 89, 91%, recall 32, 24% and F1 measure
47, 46%, which are still below our expectations. By analyzing the false positives
we found that most of them consider the white background as part of the pistol
which is due to the presence of the white background in most training examples.
9
Table 3: The results obtained by the classification model under the region pro-
posals approach on the testset.
The design of a new training dataset for this approach is also manual and can-
not re-use the databases from the previous approach. We have built Database-5
using 3000 images that contains pistols in different contexts and scenarios, down-
loaded from diverse web-sites. Figure 2 provides three examples of Database-5.
We considered a two class model and labeled the pistols by providing its lo-
calization, i.e., bounding box, in each individual training image. The rest of
objects in the image are considered background.
In general, as it can be seen from Table 3, Faster R-CNN trained on Database-
5 obtains the highest performance over all the previously analyzed models. It
provides the highest true positives number and the highest true negative num-
bers and consequently the highest recall 100% and F1 score 91.43%. However
it produces more false positives, 57, and consequently lower precision, 84, 21%.
In Section 5, we address this issue in the context of automatic alarm systems
by activating the alarm only when at least five successive false positives happen
in five successive frames. Next we analyze the speed of this detection model.
10
Table 4: The total number of True Positives #TP, total number of Ground Truth
true Positives #GT P, total number of False Positives #FP in the considered
seven videos, labeled as video 1 to 7
85.21%, a reasonable false positives number and can be used for real time de-
tection. This makes it a good candidate for detecting pistols in a sequence of
frames as shown in Section 5.
11
Figure 3: An example of an accurate detection of four pistols.
12
Figure 4: An illustrative example of false negatives, i.e., the two pistols in
background.
section we used k = 5.
For the experiments, we selected 30 scenes from the previously used videos
with the next requirements. Each scene is made up of at least 5 frames, filmed
in a fixed scenario, i.e., in the same place, and the pistols are clearly visible to
a human viewer. These scenes can be found in a public repository in github 2 .
The model successfully detects the pistol in 27 scenes with an average time
interval AATpI=0.2 seconds, which is good enough for an alarm system. The
detector fails to detect pistols only in three scenes. This is due to the same
reasons highlighted previously, which are the low contrast and luminosity of the
frames, the pistol is moved very fast or when the pistol is not in the foreground.
In summary, although we have used low quality videos for the evaluation,
the proposed model has shown good performance and demonstrated to be ap-
propriate for automatic pistol detection alarm systems.
2 https://github.com/SihamTabik/Pistol-Detection-in-Videos.git
13
6 Conclusions and future work
This work presented a novel automatic pistol detection system in videos appro-
priate for both, surveillance and control purposes. We reformulate this detection
problem into the problem of minimizing false positives and solve it by building
the key training data-set guided by the results of a VGG-16 based classifier, then
assessing the best classification model under two approaches, the sliding win-
dow approach and region proposal approach. The most promising results have
been obtained with Faster R-CNN based model, trained on our new database,
providing zero false positives, 100% recall, a high number of true negatives and
good precision 84,21%. The best detector has shown a high potential even in
low quality youtube videos and provides very satisfactory results as automatic
alarm system. Among 30 scenes, it successfully activates the alarm after five
successive true positives within an interval of time smaller than 0.2 seconds, in
27 scenes.
As present and future work, we are evaluating reducing the number of false
positives, of Faster R-CNN based detector, by preprocessing the videos, i.e.,
increasing their contrast and luminosity, and also by enriching the training set
with pistols in motion. We will also evaluate different CNNs-based classifier
such as, GoogLenet and consider a higher number of classes.
Acknowledgments
This work was partially supported by the Spanish Ministry of Science and Tech-
nology under the project TIN2014-57251-P. Siham Tabik was supported by the
Ramon y Cajal Programme (RYC-2015-18136).
References
[1] Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller,
Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Ana-
toly Belikov, Alexander Belopolsky, et al. Theano: A python frame-
work for fast computation of mathematical expressions. arXiv preprint
arXiv:1605.02688, 2016.
[2] François Chollet. Keras: Theano-based deep learning library. Code:
https://github. com/fchollet. Documentation: http://keras. io, 2015.
[3] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human
detection. In 2005 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR’05), volume 1, pages 886–893. IEEE, 2005.
[4] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ra-
manan. Object detection with discriminatively trained part-based mod-
els. IEEE transactions on pattern analysis and machine intelligence,
32(9):1627–1645, 2010.
14
[5] Pedro F Felzenszwalb and Daniel P Huttenlocher. Pictorial structures for
object recognition. International Journal of Computer Vision, 61(1):55–79,
2005.
[6] Greg Flitton, Toby P Breckon, and Najla Megherbi. A comparison of
3d interest point descriptors with application to airport baggage object
detection in complex ct imagery. Pattern Recognition, 46(9):2420–2436,
2013.
[7] Richard Gesick, Caner Saritac, and Chih-Cheng Hung. Automatic image
analysis process for the detection of concealed weapons. In Proceedings
of the 5th Annual Workshop on Cyber Security and Information Intelli-
gence Research: Cyber Security and Information Intelligence Challenges
and Strategies, page 20. ACM, 2009.
[8] Mostafa Mehdipour Ghazi, Berrin Yanikoglu, and Erchan Aptoula. Plant
identification using deep neural networks via optimization of transfer learn-
ing parameters. Neurocomputing, 2017, in press.
[9] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE International Con-
ference on Computer Vision, pages 1440–1448, 2015.
[10] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich
feature hierarchies for accurate object detection and semantic segmenta-
tion. In Proceedings of the IEEE conference on computer vision and pattern
recognition, pages 580–587, 2014.
[11] Yanming Guo, Yu Liu, Ard Oerlemans, Songyang Lao, Song Wu, and
Michael S Lew. Deep learning for visual understanding: A review. Neuro-
computing, 187:27–48, 2016.
[12] Nadhir Ben Halima and Osama Hosam. Bag of words based surveillance
system using support vector machines. International Journal of Security
and Its Applications, 10(4):331–346, 2016.
[13] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mo-
hamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick
Nguyen, Tara Sainath, and Brian Kingsbury. Deep Neural Networks for
Acoustic Modeling in Speech Recognition: The Shared Views of Four Re-
search Groups. IEEE Signal Process. Mag., 29(6):82–97, 2012.
[14] Osama Hosam and Abdulaziz Alraddadi. K-means clustering and support
vector machines approach for detecting fire weapons in cluttered scenes.
Life Science Journal, 11(9), 2014.
[15] Jan Hosang, Rodrigo Benenson, Piotr Dollár, and Bernt Schiele. What
makes for effective detection proposals? IEEE transactions on pattern
analysis and machine intelligence, 38(4):814–830, 2016.
15
[16] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan
Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe:
Convolutional architecture for fast feature embedding. arXiv preprint
arXiv:1408.5093, 2014.
[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet clas-
sification with deep convolutional neural networks. In Advances in neural
information processing systems, pages 1097–1105, 2012.
[18] Quoc V. Le. Building high-level features using large scale unsupervised
learning. In 2013 IEEE Int. Conf. Acoust. Speech Signal Process., pages
8595–8598, 2013.
[19] United Nations Office on Drugs and Crime (UNODC). Global study on
homicide 2013. Data: UNODC Homicide Statistics 2013, 2013.
[20] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE
Transactions on knowledge and data engineering, 22(10):1345–1359, 2010.
[21] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn:
Towards real-time object detection with region proposal networks. In Ad-
vances in neural information processing systems, pages 91–99, 2015.
[22] Tara N. Sainath, Abdel-rahman Mohamed, Brian Kingsbury, and Bhuvana
Ramabhadran. Deep convolutional neural networks for LVCSR. In 2013
IEEE Int. Conf. Acoust. Speech Signal Process., pages 8614–8618, 2013.
[23] Xiangbo Shu, Yunfei Cai, Liu Yang, Liyan Zhang, and Jinhui Tang. Com-
putational face reader based on facial attribute estimation. Neurocomput-
ing, 2016, in press.
[24] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks
for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[25] Rohit Kumar Tiwari and Gyanendra K Verma. A computer vision based
framework for visual gun detection using harris interest point detector.
Procedia Computer Science, 54:703–712, 2015.
[26] Rohit Kumar Tiwari and Gyanendra K Verma. A computer vision based
framework for visual gun detection using surf. In Electrical, Electronics,
Signals, Communication and Optimization (EESCO), 2015 International
Conference on, pages 1–5. IEEE, 2015.
[27] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM
Smeulders. Selective search for object recognition. International journal of
computer vision, 104(2):154–171, 2013.
[28] Zelong Xiao, Xuan Lu, Jiangjiang Yan, Li Wu, and Luyao Ren. Automatic
detection of concealed pistols using passive millimeter wave imaging. In
2015 IEEE International Conference on Imaging Systems and Techniques
(IST), pages 1–4. IEEE, 2015.
16
[29] Wei Yu, Kuiyuan Yang, Hongxun Yao, Xiaoshuai Sun, and Pengfei Xu.
Exploiting the complementary strengths of multi-layer CNN features for
image retrieval. Neurocomputing, 2016, in press.
[30] Marianne W Zawitz. Guns used in crime. Washington, DC: US Department
of Justice: Bureau of Justice Statistics Selected Findings, publication NCJ-
148201, 1995.
17