3DVSD_ An end-to-end 3D convolutional object detection network for video smoke detection
3DVSD_ An end-to-end 3D convolutional object detection network for video smoke detection
A R T I C L E I N F O A B S T R A C T
Keywords: In addition to static features, dynamic features are also important for smoke recognition. 3D convolution can
3D convolutional extract temporal and spatial information from video sequences. Currently, for video smoke detection, 3D
Video frame sequence convolution is usually used as a tool for secondary judgment of the detection results of single frame approaches.
Object detection
In this work, an end-to-end object detection neural network based on 3D convolution for video smoke detection,
End-to-end
Video smoke detection
named 3DVSD, is proposed for the first time. The network captures moving objects from the input video se
quences by the dynamic feature extraction part first and then inputs the feature tensor to the static feature
extraction part for recognition and localization, which makes full use of the spatiotemporal features of smoke and
improves the reliability of the algorithm. In addition, a time-series smoke video dataset for network training is
proposed. The proposed algorithm is compared with other related studies. The experimental results demon
strated that the 3DVSD is promising with an accuracy rate of 99.54%, a false alarm rate of 1.11%, and a missed
detection rate of 0.14%, and meets the requirements of real-time detection.
1. Introduction algorithm [8,9]. Toreyin, B. Uğur et al. [10,11] proposed a video smoke
detection algorithm based on wavelet transform, which uses features
Fire has always been one of the main threats to human life and such as motion flicker, blurred edge and color to detect smoke. Xu [12]
property. If the fire a can be detected in time, it can provide precious simultaneously used static and dynamic features of smoke for smoke
time for escape and extinguishing to avoid casualties and property los detection, extracts growth, flicker, self-similarity and wavelet energy,
ses. Compared with traditional point detectors, fire monitoring using and then input them into a BP neural network in the form of a joint
surveillance video can detect wildfires and fires in buildings with large vector for training to obtain a smoke pixel classifier. Truong and Jong
spaces in time. Because a large amount of smoke is often produced [13] proposed a smoke detection algorithm based on spatiotemporal
before the appearance of open fire and smoke is more likely to appear in features. They trained SVM classifiers with spatiotemporal features such
a surveillance picture, video smoke detection is a very valuable research as surface roughness and motion vectors as input parameters. In the
direction. detection part, the approximate median method was used to segment the
Recently, some scholars have performed much research on related motion area first. Then, fuzzy-C means was used to cluster the moving
problems. Traditional video smoke detection technology is developed regions to find the candidate smoke regions. Finally, the SVM classifier
based on manually extracted smoke features, such as color [1,2], texture was used for judgment. Kwak et al. [14] used spatiotemporal features
[3–5], shape [6], and self-similarity [7]. However, the results of smoke and pattern classification technology to detect wildfire smoke, set
detection using only the features provided by a single frame image are thresholds to segment the moving areas in two continuous video frames,
not reliable, because smoke is a moving object, and its features, such as and then used morphology to cluster the suspected smoke areas. How
color, texture and shape, are easily affected by the combustion state and ever, due to the large feature vectors generated by this algorithm, the
the external environment. Although smoke movement causes practicality is poor. Gunay et al. [15] proposed an online adaptive de
complexity in the smoke image, it also gives the smoke rich dynamic cision fusion framework based on an entropy function for video wildfire
features. The design of a video smoke detection algorithm based on the smoke detection, which includes detecting slow-moving objects,
static and dynamic features of smoke can improve the reliability of the smoke-colored regions, region smoothness detection and shadow
* Corresponding author. State Key Laboratory of Fire Science, University of Science and Technology of China, Hefei, 230026, Anhui, PR China.
E-mail address: [email protected] (Q. Zhang).
https://doi.org/10.1016/j.firesaf.2022.103690
Received 19 January 2022; Received in revised form 2 October 2022; Accepted 6 October 2022
Available online 8 October 2022
0379-7112/© 2022 Elsevier Ltd. All rights reserved.
Y. Huo et al. Fire Safety Journal 134 (2022) 103690
detection based on wavelet transform. Toreyin, B. Uğur [16] proposed a location. RCNN uses static spatial information to locate suspected smoke
video smoke detection technology based on the Markov model and targets, while 3D CNN uses dynamic features to determine whether the
wavelet transform. The moving area in the MJPEG2000 compressed target is smoke. Mira et al. [23] combined YOLO and an LSTM classifier
video is first detected, and then the high-subband energy of the current for video smoke detection and location. First, YOLO was used to detect
image frame is analyzed to determine the smoke. individual video frames. After the suspected smoke target was found, the
The result of the above traditional video image smoke detection previous 30 consecutive frames were input into ResNet50 to convert
technology mainly depends on the smoke features extracted by re them into 1024-dimensional feature vectors. Then, these 30
searchers. However, it is very difficult to design a feature extractor, and 1024-dimensional feature vectors were input into the LSTM classifier for
often only shallow features can be extracted. The shallow features are final judgment. Kim et al. [24] proposed a video sequence fire detection
only effective in a particular case, so the robustness of traditional video method based on Faster RCNN and LSTM. First, the suspected fire area
smoke detection technology is generally poor. Deep learning solves the was detected by Faster RCNN. Then, the summarized features within the
above problems, and we can directly take the original smoke image data bounding boxes in successive frames were accumulated by LSTM to
as the input of the BP neural network. The data, which contain all the classify whether there was a fire in a short-term period. In their subse
information, is passed through a multilayer neural network. After quent work [25], they introduced Bayesian networks, which combined
thousands of training iterations, the network model can automatically environmental information with visual information and successfully
extract deep features of smoke images to improve smoke detection ac improved the accuracy of fire detection. Wang et al. [26] proposed a
curacy. Zeng et al. [17] used RCNN, SSD and R–FCN for video smoke time-domain neural network for video smoke detection. They used 3D
detection. They used TensorFlow to construct an object detection convolution to simultaneously extract dynamic and static features of the
network, and Inception V2, Inception ResNet V2, ResNet V2 and smoke in the video, and used these features to classify the input video
MobileNet were tested as backbone networks. The results showed that clips.
the SSD with MobileNet is the fastest, but the accuracy is lowest. Faster At present, there are four main types of research on video smoke
RCNN with Inception ResNetV2 has the highest accuracy but is the detection using deep learning methods. (1) Each video frame is detected
slowest. Xu et al. [18] proposed a video smoke detection system based independently, and real-time detection is realized by using high detec
on SSD. To solve the problem of lack of training data, they used synthetic tion speed, as shown in Fig. 1 (a), e.g., Refs. [17,18]. This method does
smoke images for training. In addition, they also used an antagonistic not make use of the timing information contained in the continuous
strategy and domain adaptation method to reduce the difference be video frames, so there are inevitably serious missed positives and false
tween real images and synthetic images. Wu et al. [19] combined mo positives. (2) The traditional motion detection algorithm is used to
tion detection with deep learning. First, the Vibe algorithm was used to extract the moving area, and DCNN is used to detect the moving area, as
extract the background from the video, and the frame difference method shown in Fig. 1 (b), e.g., Refs. [19–21]. This method only makes use of
was used to update the moving area. Then, deep learning was used to shallow time-series information. Although it can eliminate the false
extract the static features of flame and smoke. Finally, all the clues were positives caused by some stationary objects, it can do nothing for the
combined to realize fire detection. Maksymiv et al. [20] combined interference of moving objects and omission. (3) First, independent
traditional and DCNN-based methods for flame and smoke detection. detection is carried out for each video frame. When a suspected target is
They used the frame difference method for motion detection and the detected, a sequential network is used to make a judgment, as shown in
HSV color model for segmentation of flame and smoke, and finally used Fig. 1 (c), e.g., Refs. [22–25]. Although this method extracts effective
morphological operations such as expansion and corrosion to eliminate deep dynamic features, its role is only to check the detection results of
noise caused by frame differences. Aslan et al. [21] first used the Far the object detection network and can eliminate false alarms, but there is
nebacks algorithm for motion estimation of video frames and then input no way to deal with missed detections. In addition, the algorithms are
the transformed image into a deep convolution generation adversarial not end-to-end, so they tend to run slowly. (4) A classifier for video clips
network for smoke detection. Lin et al. [22] developed a joint RCNN and is constructed by using a time-series network, as shown in Fig. 1 (d), e.g.,
3D CNN framework to solve the problem of smoke detection and Ref. [26]. This method can fully extract the dynamic and static features
Fig. 1. Main methods for video smoke detection using deep learning methods.
2
Y. Huo et al. Fire Safety Journal 134 (2022) 103690
contained in the video, but these features are only used for classification classification without making full use of spatial-temporal features to
without positioning the smoke target. realize the positioning function, and the network has only six layers, so it
To solve the above problems, for the first time, we propose an end-to- is difficult to extract deep features. Therefore, the C3D network is not
end video smoke detection algorithm based on 3D convolution, which suitable for smoke detection and positioning. Currently, for video smoke
takes continuous video frames as input to detect and locate the smoke on detection, 3D convolution is usually used as a tool for secondary judg
the last frame of the image. The main contributions of this paper are as ment of the detection results of single frame approaches, for example
follows: [22]. The temporal-spatial feature extraction ability of 3D convolution is
not fully utilized for smoke recognition and localization.
(1) A training method of an object detection algorithm based on a
video frame sequence is proposed; 2.2. YOLO layer
(2) An end-to-end deep object detection network based on 3D
convolution is constructed; The YOLO layer [28] is the final layer of the YOLO series object
(3) Video datasets for smoke detection are collected and fabricated; detection network, and its function is to decode the input tensor. The
(4) The effect of the time span and time step of the video frame idea is to divide the input picture into a grid of S × S, and each cell is
sequence on the detection results is analyzed, which provides a responsible for detecting the target whose center point falls within the
reference for subsequent related research. cell. Each cell predicts the geometric parameters of the three bounding
boxes and the confidence score of the bounding boxes. Four geometrical
The rest of this article is arranged as follows. In the second part, we parameters adjust the preset anchor box to obtain the size and position
introduce the relevant research adopted in this study. In the third part, of the bounding box. The confidence represents the probability that the
we describe the approach proposed in this paper. The fourth part con bounding box contains an object. In addition, ‘k’ class probability values
tains the experimental results and analysis. The last part is the conclu are given for each bounding box, which represent the probability that
sion of this study. the object in the bounding box belongs to each class. Therefore, the
input of the YOLO layer is the tensor of S × S × 3(4 + 1 + k), and Fig. 3
2. Related work shows the decoding process of the YOLO layer.
We set the YOLO layer as the final layer of the video smoke detection
2.1. 3D convolutional network in this study, and the spatiotemporal features extracted by the
deep convolutional neural network are input into three YOLO layers of
2D convolution, with its powerful ability to extract image features, is different scales for decoding. In addition, the K-means clustering algo
suitable for processing image data and has been widely used in image rithm is used to obtain a more appropriate anchor box size for our smoke
processing algorithms. Unfortunately, 2D convolution cannot capture dataset, which makes the network more suitable for detecting and
the motion information contained in video frames. To solve this prob locating smoke.
lem, researchers proposed a 3D convolution structure, which extends the
2D convolution kernel by one more dimension and enables it to convolve 3. Proposed method
in three directions. 3D convolution is used to process multidimensional
data with spatial-temporal features, and effective features can be Smoke is translucent, when the early fire smoke concentration is low,
extracted from time and space simultaneously. The operation process of there is often no clear boundary between the smoke and the background
the 3D convolution layer is shown in Formula 1: in the image. At this time, even human eyes often have difficulty
(
Pi − 1 Q
) detecting smoke in static images in time. It is more difficult for object
∑∑ ∑ i− 1 ∑
Ri − 1
xyt
vij = f pqr (x+p)(y+q)(t+r)
wijm v(i− 1)m + bij (1) detection networks based on a single image input to extract effective
m p=0 q=0 r=0 features from the background to accurately locate smoke. If the human
eye receives a continuous sequence of images, moving smoke with a
In the formula, vxyt
represents the value at (x, y, t) in the j-th feature
ij constant background will attract our attention, allowing us to quickly
map of the i-th layer, f() is the activation function, m is the number of identify and locate the smoke. Inspired by this, we proposed an object
feature maps in layer (i - 1), Pi and Qi are the size of the spatial detection network based on the input of the image sequence. The
dimension of the i-th layer of the 3D convolution kernel, Ri is the size of network first extracts the dynamic features of the input sequence and
the time dimension of the i-th layer of the 3D convolution kernel, wpqr ijm is preliminarily locates the moving area, then continues to extract the
the convolution kernel weight of the m-th feature map connection in the static features in the feature map, and combines the static and dynamic
(i - 1) layer, and bij is the bias of the j-th feature map of the i-th layer. features for accurate identification and location of smoke.
Fig. 2 shows how the 3D convolution works.
Tran et al. [27] proposed a C3D network with 3D convolution for 3.1. Network architecture
action recognition, with an accuracy of 85.2% on the UCF-101 dataset.
However, the C3D network can only realize action recognition and The proposed video smoke detection network takes a video frame
sequence of length n as input, fully captures their space-time informa
tion, marks the suspected smoke object in the last frame of the sequence,
and realizes the smoke detection and location of the whole video by
deleting the first video frame and adding a new video frame. In use, the
value of n is fixed, and the effect of n is discussed in detail. The value of n
depends on two aspects: one is the time span of the input sequence, and
the other is the time step between every two frames of images. This
algorithm can make judgments by using dynamic features accumulated
within a short period of time, similar to the human brain, so we choose 1
s, 2 s, 3 s, and 4 s for comparison. When the time span is 5 s or above, too
much data are input, and the large number of computations makes it
difficult for this algorithm to meet the real-time detection requirements.
We did not extract all the video frames when generating the video frame
Fig. 2. 3D convolution working process. sequence. Ince et al. [29] showed that the extraction of smoke video
3
Y. Huo et al. Fire Safety Journal 134 (2022) 103690
frames at fixed intervals can enhance turbulence characteristics, so we The overall structure of the network is shown in Fig. 4. Since dy
set a fixed time step. To study the effect of different time steps on the test namic features are generated by the changes in static features in the time
results, we select 1 s to extract one frame and 1 s to extract three frames dimension, we choose 3D convolution, which can simultaneously
for comparison. Combining these two factors, we obtain the following extract features in three dimensions, to extract dynamic features of
eight experimental groups for 1/1 s, 2/2 s, 3/3 s, 4/4 s, 3/1 s, 6/2 s, 9/3 smoke in video frame sequences. The dynamic feature extraction part of
s, and 12/4 s, where n/t represents n pieces of video frames within t the network is composed of four 3D convolution layers in series. The size
seconds of uniform extraction. of the input video sequence is n × 416 × 416 × 3, which becomes 1 × 52
4
Y. Huo et al. Fire Safety Journal 134 (2022) 103690
× 52 × 128 after four 3D convolution layers, and 52 × 52 × 128 after cake, cotton rope and n-heptane. There were seven smoke-free videos
one dimension reduction. Each time the data tensor passes through a 3D with pedestrians or other distractions. We divided the data into three
convolution layer, its length and width are reduced by half. When n is categories: indoor, outdoor near and outdoor far. Because the detection
large, the value of the first dimension also shrinks by half. When n is network has the function of smoke location, the corresponding label file
small, to extract more dynamic features, the value of the first dimension was needed during training, so the dataset was composed of pictures. We
shrinks at a slower rate. The detailed sizes of tensors T1, T2, T3 and T4 in extracted images from the above video at fixed intervals of 3 frames per
each experimental group are shown in Table 1. The static feature second, designed the dataset structure in the form of a table, and every
extraction part of the network is composed of the CSP-RES-SConv11 100 images was a sequence. Finally, we sorted 147 video frame se
module SPP module and PANet module. Tensor T4 is subsampled to quences, for a total of 14,700 images. There were 115 video frame se
13 × 13 × 512 through three CSP-Res-SConv11 modules, as shown in quences in the positive sample and 32 video frame sequences in the
the green box in Fig. 3. This module is composed of 11 2D deep sepa negative sample. Each positive sample in the dataset had a corre
rable convolutions [30] in series; compared with standard convolution, sponding label file, which was produced by LabelImg software. Fig. 5
depth-separable convolution can significantly reduce the number of shows the structure of the dataset and some of the samples. The dataset
parameters. We introduce the ResNet structure [31] and the CSPNet has been uploaded to the website: http://smoke.ustc.edu.cn/.
structure [32]. The ResNet structure is used to solve the gradient In network training, we selected 132 video sequences, a total of
disappearance problem when the network is too deep, and the CSPNet 13,200 images as the training set, and the remaining 15 sequences as the
structure can strengthen the learning ability of the network. To enable verification set. Because the training requires continuous reading of n (n
the network to effectively detect smoke of different sizes, we selected < 100) pictures to input into the network, to ensure that each read n
three scales of 52 × 52, 26 × 26 and 13 × 13 for output decoding. T5, T6 images from the same sequence, we designed a read rule for this dataset.
and T7 tensors by the SPP module [33] strengthen regional features and When reading the training data, two random integers ‘a’ and ‘b’ were
enhance the sparse features of the light smoke region in the original first generated, where the value of ‘a’ ranged from 0 to 132, and the
image. Then, the PANet structure [34] was used for feature fusion value of ‘b’ ranged from 0 to 101-n. Let the variable ‘i’ be equal to ‘a ×
among various scales. The small-scale feature tensor was upsampled and 100 + b’, and then read from the i-th picture, and then read n pictures in
fused with the mesoscale and large-scale feature tensors in turn, and turn. After the picture reading was completed, the label file corre
then large-scale features were subsampled and fused with the mesoscale sponding to the last picture was read as the label for this training to
and small-scale feature tensors. This structure can shorten the infor calculate the loss value. The same rule applies when reading validation
mation path and enhance the feature pyramid by using the precise data, except that the value of ‘a’ changed from 0 to 15.
positioning signals existing in the large-scale feature tensor. Finally, We integrate and adjust the data enhancement methods commonly
three tensors with sizes of 52 × 52 × 18, 26 × 26 × 18 and 13 × 13 × 18 used in object detection so that they can be used for video frame se
were obtained, which are input into the corresponding YOLO layer for quences, including image flipping, cutting, size transformation, trans
decoding, and the detection results are obtained. The total number of lation transformation and gamut adjustment. Because our network takes
network parameters was 8,600,694, among which the number of a sequence of multiple video frames as input, the computation cost of
trainable parameters is 8,563,990, and the final model size was 33.3 MB. equipment is under high pressure. Therefore, it is difficult to carry out
batch training, which leads to a long training cycle and poor model
robustness. To solve this problem, we use the idea of mosaic enhance
3.2. Dataset ment [43] for reference, and four video frame sequences are processed at
a time. When we train, each input video frame sequence has an equal
Video-based deep learning algorithms require considerable data. chance to come from the single video sequence transformation and the
Since the background of the input image sequence is required to be mosaic transformation of multiple video frame sequences.
consistent in this study, videos shot by fixed cameras need to be
collected. We found 44 videos meeting the requirements from the public 4. Experiment and evaluation
fire smoke video image database [35–38], scenes including classroom,
city road, park and forest, of which 32 videos with smoke for making 4.1. Training
positive samples, and 12 videos without smoke for making negative
samples. In addition, we also shot 28 videos as supplements, scenes The operating environment configuration for this study is as follows.
including standard combustion rooms, offices, campuses and rural areas. The operating system is Windows 10 Pro 64-bit. The CPU is an Intel i5-
Among them, there were 21 videos with smoke, and the fuel was tobacco 8600k. The GPU is an NVIDIA GeForce RTX 2080 Ti. CUDA 10.0 and
CUDNN 7.4.1 are installed in the system. The RAM is 8G and the
Table 1 OpenCV version is 4.1.1.
The size of the feature tensor at the main nodes of each experimental group In the training, we set 100 epochs and 1500 iterations in each epoch.
network. Therefore, each experimental model was calculated using 150,000 it
Group Input T1 T2 T3 T4 erations. The initial learning rate was 0.0001. When the validation set
1/1s 1 × 416 × 1 × 416 × 1 × 208 × 1 × 104 × 52 × 52
loss value obtained by two consecutive epochs did not decrease, the
416 × 3 416 × 16 208 × 32 104 × 64 × 128 learning rate decreased by half. Fig. 6 shows the curve of the loss value
2/2s 2 × 416 × 2 × 416 × 1 × 208 × 1 × 104 × 52 × 52 in each group during experimental training. As shown in the figure, the
416 × 3 416 × 16 208 × 32 104 × 64 × 128 loss value in the 1/1 s group is slightly higher than that in the remaining
3/3s 3 × 416 × 3 × 416 × 2 × 208 × 1 × 104 × 52 × 52
7 groups, and the curve is relatively unstable. The length of the input
416 × 3 416 × 16 208 × 32 104 × 64 × 128
4/4s 4 × 416 × 3 × 416 × 2 × 208 × 1 × 104 × 52 × 52 sequence is 1, which means that the group is based on the single image of
416 × 3 416 × 16 208 × 32 104 × 64 × 128 the target detection, and the dynamic feature between successive video
3/1s 3 × 416 × 3 × 416 × 2 × 208 × 1 × 104 × 52 × 52 frames is not utilized. Therefore, it can be seen that the combination of
416 × 3 416 × 16 208 × 32 104 × 64 × 128 dynamic features and spatial features is beneficial to the training of the
6/2s 6 × 416 × 3 × 416 × 2 × 208 × 1 × 104 × 52 × 52
416 × 3 416 × 16 208 × 32 104 × 64 × 128
model.
9/3s 9 × 416 × 5 × 416 × 3 × 208 × 2 × 104 × 52 × 52
416 × 3 416 × 16 208 × 32 104 × 64 × 128 4.2. Evaluation methods
12/4s 12 × 416 × 6 × 416 × 3 × 208 × 2 × 104 × 52 × 52
416 × 3 416 × 16 208 × 32 104 × 64 × 128
To compare the influence of time parameters on the recognition
5
Y. Huo et al. Fire Safety Journal 134 (2022) 103690
Fig. 5. The structure of the dataset and the display of some samples. This dataset contains 147 image sequences with a length of 100, with a total of 14,700 images.
first frame, and enter the detection network again for calculation, to
realize the detection of the whole video.
Eight experimental models were used to detect the test videos, and
all the test results were saved in the form of pictures. The number of
positive samples with correct smoke detection is called TP, the number
of negative samples without false positives is called TN, the number of
negative samples of false positives is called FP, and the number of missed
positive samples is called FN. The accuracy rate (A), false alarm rate
(FA), missing detect rate (MD) and processing time were used as in
dicators to evaluate the eight models. The above indicators are calcu
lated as follows:
TP + TN
A= × 100% (2)
TP + TN + FP + FN
FP
FA = × 100% (3)
TN + FP
6
Y. Huo et al. Fire Safety Journal 134 (2022) 103690
Fig. 7. Test the video preview. There were a total of 18 videos with a length of approximately 60 s, including three scenes. Each scene had 6 videos, among which the
first four were positive samples with smoke and the last two were negative samples without smoke.
Table 2
Processing time.
Group 1/1 s 2/2 s 3/3 s 4/4 s 3/1 s 6/2 s 9/3 s 12/4 s
Processing Time 0.024 s 0.029 s 0.033 s 0.038 s 0.033 s 0.046 s 0.057 s 0.063 s
Table 3
Comparison of the detection results of 8 groups of models in three scenarios.
Group Scene Positive Samples Negative Samples TP TN FP FN A(%) FA(%) MD(%)
7
Y. Huo et al. Fire Safety Journal 134 (2022) 103690
s, 6/2 s, 9/3 s and 12/4 s were 3 frames per second. Therefore, under the
same test video, the number of results produced by the first four
experimental groups was only one third of that of the last four experi
mental groups. The detection result of the 12/4 s model was optimal in
the three scenarios, which means that increasing the length of the input
sequence can improve the final detection result. Fig. 8 shows some
detection results of the 12/4 s model. However, due to the limitations of
computing device performance, we cannot indefinitely increase the
length of the input sequence. The length of a sequence is determined by
two factors: time span and time step. It is necessary to analyze the in
fluence of these two factors on the detection results, to provide a
reference for the selection of time parameters of video smoke detection
in different scenes.
It should be noted that the proposed algorithm is only applicable to
surveillance videos with fixed lenses. When the camera is violently
shaken or shifted, a large number of complex and irregular temporal
features will be generated in the input video frame sequence, and the
above detection model will not be able to effectively identify the smoke
features, resulting in false positives or missed positives.
4.3.1. Influence of different time spans on the detection results Fig. 9. Accuracy rate changes with the time span.
We took the time span as the abscissa and grouped them according to
scenarios and time steps, so that each index would obtain 6 curves
changing with time. We further analyzed the accuracy rate (A), false
alarm rate (FA) and missing detect rate (MD).
Fig. 9 shows the change in accuracy rate with time span in each
scene, where “indoor (3/s)” represents the average extraction of 3
frames per second in the indoor scene, and the definitions of the other
curves are similar. As shown in Fig. 9, the accuracy rate in each scene
increases with an increasing time span. However, except for the “out
door far (1/s)” curve, which continued to rise significantly, the other
curves showed the trend of convex function; that is, in these scenes, the
increasing effect of time span on the increasing accuracy rate gradually
decreased. In addition, the increase in accuracy rate between 3 s and 4 s
in these scenarios was small, almost zero in the indoor scenario. Fig. 10
shows the change in the false alarm rate with the time span in each
scene. The false alarm rate in each scene decreases with an increasing
time span, but the influence of the time span on the false alarm rate in
each scene is different. When the scene is far outside, the false alarm rate
decreases significantly with an increasing time span. When the scene is
near the outside, the false alarm rate is still decreasing, but the decrease
is significantly less than that when the scene is far from the outside.
When the scene is indoors, the decrease in the false alarm rate is smaller.
Fig. 11 shows the change in the missing detection rate with the time span Fig. 10. The false alarm rate changes with the time span.
in each scene, and the missing detection rate in each scene decreases
with an increasing time span. Corresponding to Fig. 9, in Fig. 11, except These phenomena are related to the accumulation of dynamic fea
for the “outdoor far (1/s)” curve, which continued to decline signifi tures. When the scene is indoors or near the outside, the combustion
cantly, the other curves showed a concave function trend. source is close to the camera, and the smoke moves at a higher speed in
8
Y. Huo et al. Fire Safety Journal 134 (2022) 103690
9
Y. Huo et al. Fire Safety Journal 134 (2022) 103690
Fig. 13. Comparison of the false alarm rates at different time steps.
Fig. 14. Comparison of the missing detection rates at different time steps.
10
Y. Huo et al. Fire Safety Journal 134 (2022) 103690
Therefore, this algorithm has a fast running speed, but a high missing It is worth noting that the algorithm proposed in this paper can only
detection rate in smoke video detection. FGFA [45] is the development be used for stationary surveillance cameras and cannot work properly
of the DFF [44] algorithm, which sacrifices the running speed to detect when the camera moves. Therefore, in our future work, we will focus on
all video frames and uses the information of the before and after frames the application of the attention model in video smoke detection, and try
to enhance the features of the current frame. The missing detection rate to apply the improved algorithm to UAV based fire detection.
decreased, but the false alarm rate of negative samples also increased.
Because the algorithm uses the image information after the current Author statement
frame in the detection, it cannot achieve real-time smoke detection for
the monitoring video stream. RDN [46] completed the detection by Yinuo Huo (First Author): :Conceptualization, Data Curation,
multistage inference of suspected smoke areas and integrating the fea Formal Analysis, Investigation, Methodology, Software, Validation,
tures associated with candidate areas in the three frames. However, the Visualization, Writing - Original Draft; Qixing Zhang (Corresponding
algorithm had difficulty distinguishing smoke and interferers through Author)::Conceptualization, Funding Acquisition, Resources, Super
the proposal features extracted from the three frames, so the false alarm vision, Writing - Review & Editing; Yongming Zhang: :Supervision,
rate was very high. MEGA [47] integrated a large amount of global time Validation; Jiping Zhu: :Funding Acquisition, Project Administration;
information and local time information to assist keyframe detection. Jinjun Wang: :Investigation, Resources;
Experimental results showed that the algorithm was very sensitive to
smoke in image sequences and achieved a very low missing detection
rate. However, smoke has no fixed shape, and when the video time span Declaration of competing interest
was long, the smoke image changed greatly. At this time, the global time
information contained in the video frame was very complex, and using The authors declare that they have no known competing financial
this information for training made the target of the network model un interests or personal relationships that could have appeared to influence
clear. Experimental results showed that the algorithm has a high false the work reported in this paper.
alarm rate in smoke detection. Our method has no advantage in pro
cessing time, but it only needs to process three times within 1 s in actual Data availability
use, which meets the requirements of real-time detection. The size of our
model is only 33.3 MB, which is far smaller than the other models in Data will be made available on request.
comparison except for EfficientDet-D0.
Acknowledgment
5. Conclusion
This work was supported by the National Key Research and Devel
In this paper, we proposed an end-to-end object detection network opment Plan under Grant No. 2020YFC1522800, and the Anhui Pro
based on 3D convolution for video smoke detection. In addition, based vincial Science and Technology Major Project under Grant No.
on the public smoke video dataset and the experimental videos, we 202203a07020017, and the Research Plan of Fire and Rescue Depart
produced a sequential smoke dataset for network training, and designed ment, Ministry of Emergency Management under Grant No.
a data enhancement algorithm for sequential network training. Through 2020XFZD13. The authors gratefully acknowledge all of this support.
the test results of each experimental group and the comparison with
other object detection networks, we draw the following conclusions: References
[1] S. Verstockt, P. Lambert, R. Van de Walle, et al., State of the Art in Vision-Based
(1) The combination of dynamic features and static features can
Fire and Smoke dectection[C]//14th International Conference on Automatic Fire
effectively improve the reliability of the video smoke detection Detection, vol. 2, University of Duisburg-Essen. Department of Communication
algorithm. The more dynamic features are extracted, the more Systems, 2009, pp. 285–292.
[2] D. Krstinić, D. Stipaničev, T. Jakovčević, Histogram-based smoke segmentation in
accurate the detection results will be. However, the accumulation
forest fire detection system[J], Inf. Technol. Control 38 (3) (2009).
of motion features shows the law of diminishing marginal utility. [3] J. Yang, F. Chen, W. Zhang, Visual-based smoke detection using support vector
(2) More dynamic features can be obtained by increasing the time machine[C]//2008 Fourth International Conference on Natural Computation, IEEE
span of the detection sequence. When the distance is close, the 4 (2008) 301–305.
[4] Y. Chunyu, Z. Yongming, F. Jun, et al., Texture analysis of smoke for real-time fire
smoke moves faster in the video frame, and enough motion fea detection[C]//2009 second international workshop on computer science and
tures can be accumulated in a short time, so a smaller time span engineering, IEEE 2 (2009) 511–515.
can be selected. When the distance is far, the smoke moves slowly [5] F. Yuan, Video-based smoke detection with histogram sequence of LBP and LBPV
pyramids[J], Fire Saf. J. 46 (3) (2011) 132–139.
in the video frame, and it takes a long time to accumulate enough [6] F. Yuan, A double mapping framework for extraction of shape-invariant features
dynamic features. Therefore, a large time span should be selected based on multi-scale partitions with AdaBoost for video smoke detection[J],
under the conditions allowed by the computing equipment. Pattern Recogn. 45 (12) (2012) 4326–4336.
[7] N. Fujiwara, K. Terada, in: Extraction of a Smoke Region Using Fractal coding[C]//
(3) More dynamic features can be obtained by reducing the time step IEEE International Symposium on Communications and Information Technology,
of the detection sequence. When the time span of the detection 2004. ISCIT 2004, vol. 2, IEEE, 2004, pp. 659–662.
sequence is small, reducing the time step of the detection [8] A. Enis Çetin, et al., Video fire detection-review, Digit. Signal Process. 23 (6)
(2013) 1827–1843.
sequence can significantly improve the reliability of the algo
[9] A. Enis Çetin, B. Merci, O. Günay, et al., Methods and Techniques for Fire
rithm. When the time span of the detection sequence is large, Detection: Signal, Image and Video Processing Perspectives[J], 2016.
because of the law of diminishing marginal utility of the motion [10] B. Uğur Töreyin, Yiğithan Dedeoğlu, A. Enis Cetin, Wavelet based real-time smoke
detection in video, in: 2005 13th European Signal Processing Conference, IEEE,
characteristics, reducing the time step of the detection sequence
2005.
does not significantly improve the detection effect, and it will [11] B. Ugur Toreyin, Yigithan Dedeoglu, A. Enis Cetin, Contour based smoke detection
significantly increase the calculation amount. in video using wavelets, in: 2006 14th European Signal Processing Conference,
(4) Compared with the existing image-based object detection algo IEEE, 2006.
[12] Z. Xu, J. Xu, in: Automatic Fire Smoke Detection Based on Image Visual features
rithm, the proposed algorithm achieves better results in false [C]//2007 International Conference on Computational Intelligence and Security
alarm rate and missed detection rate, which proves that the ob Workshops (CISW 2007), IEEE, 2007, pp. 316–319.
ject detection algorithm based on image sequence has more ad [13] T.X. Tung, J.M. Kim, An effective four-stage smoke-detection algorithm using video
images for early fire-alarm systems[J], Fire Saf. J. 46 (5) (2011) 276–282.
vantages and is worthy of further research. [14] J.Y. Kwak, B.C. Ko, J.Y. Nam, in: Forest Smoke Detection Using CCD Camera and
Spatial-Temporal Variation of Smoke Visual patterns[C]//2011 Eighth
11
Y. Huo et al. Fire Safety Journal 134 (2022) 103690
International Conference Computer Graphics, Imaging and Visualization, IEEE, [31] K. He, X. Zhang, S. Ren, et al., in: Deep Residual Learning for Image recognition
2011, pp. 141–144. [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern
[15] Osman Gunay, et al., Entropy-functional-based online adaptive decision fusion Recognition, 2016, pp. 770–778.
framework with application to wildfire detection in video, IEEE Trans. Image [32] C.Y. Wang, H.Y.M. Liao, Y.H. Wu, et al., in: CSPNet: A New Backbone that Can
Process. 21 (5) (2012) 2853–2865. Enhance Learning Capability of CNN[C]//Proceedings of the IEEE/CVF Conference
[16] Behçet Uğur Töreyin, Smoke Detection in Compressed video." Applications of on Computer Vision and Pattern Recognition Workshops, 2020, pp. 390–391.
Digital Image Processing XLI, vol. 10752, International Society for Optics and [33] K. He, X. Zhang, S. Ren, et al., Spatial pyramid pooling in deep convolutional
Photonics, 2018. networks for visual recognition[J], IEEE Trans. Pattern Anal. Mach. Intell. 37 (9)
[17] J. Zeng, Z. Lin, C. Qi, et al., in: An Improved Object Detection Method Based on (2015) 1904–1916.
Deep Convolution Neural Network for Smoke detection[C]//2018 International [34] S. Liu, L. Qi, H. Qin, et al., in: Path Aggregation Network for Instance segmentation
Conference on Machine Learning and Cybernetics (ICMLC), vol. 1, IEEE, 2018, [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern
pp. 184–189. Recognition, 2018, pp. 8759–8768.
[18] G. Xu, Q. Zhang, D. Liu, et al., Adversarial adaptation from synthesis to reality in [35] http://signal.ee.bilkent.edu.tr/VisiFire/Demo. (Accessed 5 May 2018).
fast detector for smoke detection[J], IEEE Access 7 (2019) 29471–29483. [36] http://cvpr.kmu.ac.kr/. (Accessed 5 May 2018).
[19] X. Wu, X. Lu, H. Leung, in: An Adaptive Threshold Deep Learning Method for Fire [37] L. Shuai, W. Bo, D. Ranran, et al., in: A Novel Smoke Detection Algorithm Based on
and Smoke detection[C]//2017 IEEE International Conference on Systems, Man, Fast Self-Tuning Background subtraction[C]//2016 Chinese Control and Decision
and Cybernetics (SMC), IEEE, 2017, pp. 1954–1959. Conference (CCDC), IEEE, 2016, pp. 3539–3543.
[20] O. Maksymiv, T. Rak, O. Menshikova, in: Deep Convolutional Network for [38] http://smoke.ustc.edu.cn/.
Detecting Probable Emergency situations[C]//2016 IEEE First International [39] X. Zhou, D. Wang, P. Krähenbühl, Objects as points[J]. arXiv Preprint arXiv:
Conference on Data Stream Mining & Processing (DSMP), IEEE, 2016, 1904.07850, 2019.
pp. 199–202. [40] M. Tan, R. Pang, Q.V. Le, in: Efficientdet: Scalable and Efficient Object detection
[21] Süleyman Aslan, et al., Early wildfire smoke detection based on motion-based [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
geometric image transformation and deep convolutional generative adversarial Recognition, 2020, pp. 10781–10790.
networks, in: ICASSP 2019-2019 IEEE International Conference on Acoustics, [41] Q. Zhao, T. Sheng, Y. Wang, et al., in: M2det: A Single-Shot Object Detector Based
Speech and Signal Processing (ICASSP), IEEE, 2019. on Multi-Level Feature Pyramid network[C]//Proceedings of the AAAI Conference
[22] G. Lin, Y. Zhang, G. Xu, et al., Smoke detection on video sequences using 3D on Artificial Intelligence, vol. 33, 2019, pp. 9259–9266, 01.
convolutional neural networks[J], Fire Technol. 55 (5) (2019) 1827–1847. [42] S. Liu, D. Huang, in: Receptive Field Block Net for Accurate and Fast Object
[23] M. Jeong, M.J. Park, J. Nam, et al., Light-weight student LSTM for real-time detection[C]//Proceedings of the European Conference on Computer Vision
wildfire smoke detection[J], Sensors 20 (19) (2020) 5508. (ECCV), 2018, pp. 385–400.
[24] B. Kim, J. Lee, A video-based fire detection using deep learning models[J], Appl. [43] A. Bochkovskiy, C.Y. Wang, H.Y.M. Liao, Yolov4: Optimal Speed and Accuracy of
Sci. 9 (14) (2019) 2862. Object detection[J], 2020 arXiv preprint arXiv:2004.10934.
[25] B. Kim, J. Lee, A bayesian network-based information fusion combined with DNNs [44] X. Zhu, Y. Xiong, J. Dai, et al., in: Deep Feature Flow for Video Recognition[C]//
for robust video fire detection[J], Appl. Sci. 11 (16) (2021) 7624. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE,
[26] D. Wang, S. Luo, L. Zhao, et al., Smoke recognition network based on dynamic 2017.
characteristics[J], Int. J. Adv. Rob. Syst. 17 (3) (2020), 1729881420925662. [45] X. Zhu, Y. Wang, J. Dai, et al., in: Flow-guided Feature Aggregation for Video
[27] D. Tran, L. Bourdev, R. Fergus, et al., in: Learning Spatiotemporal Features with 3d Object detection[C]//Proceedings of the IEEE International Conference on
Convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 408–417.
Computer Vision, 2015, pp. 4489–4497. [46] J. Deng, Y. Pan, T. Yao, et al., in: Relation Distillation Networks for Video Object
[28] J. Redmon, A. Farhadi, Yolov3: an Incremental improvement[J]. arXiv Preprint detection[C]//Proceedings of the IEEE/CVF International Conference on Computer
arXiv:1804.02767, 2018. Vision, 2019, pp. 7023–7032.
[29] I.F. Ince, M.E. Yildirim, Y.B. Salman, et al., Fast video fire detection using luminous [47] Y. Chen, Y. Cao, H. Hu, et al., in: Memory Enhanced Global-Local Aggregation for
smoke and textured flame features[J], KSII Transac. Internet Inform. Syst. 10 (12) Video Object detection[C]//Proceedings of the IEEE/CVF Conference on Computer
(2016) 5485–5506. Vision and Pattern Recognition, 2020, pp. 10337–10346.
[30] A.G. Howard, M. Zhu, B. Chen, et al., Mobilenets: Efficient Convolutional Neural
Networks for Mobile Vision applications[J], 2017 arXiv preprint arXiv:
1704.04861.
12