Multi-Object Detection in Security Screening Scene Based on
Multi-Object Detection in Security Screening Scene Based on
Article
Multi-Object Detection in Security Screening Scene Based on
Convolutional Neural Network
Fan Sun, Xiangfeng Zhang, Yunzhong Liu and Hong Jiang *
College of Intelligent Manufacturing and Industrial Modernization, Xinjiang University, Urumchi 830017, China
* Correspondence: [email protected]
Abstract: The technique for target detection based on a convolutional neural network has been
widely implemented in the industry. However, the detection accuracy of X-ray images in security
screening scenarios still requires improvement. This paper proposes a coupled multi-scale feature
extraction and multi-scale attention architecture. We integrate this architecture into the Single Shot
MultiBox Detector (SSD) algorithm and find that it can significantly improve the effectiveness of
target detection. Firstly, ResNet is used as the backbone network to replace the original VGG network
to improve the feature extraction capability of the convolutional neural network for images. Secondly,
a multi-scale feature extraction (MSE) structure is designed to enrich the information contained in the
multi-stage prediction feature layer. Finally, the multi-scale attention architecture (MSA) is fused onto
the prediction feature layer to eliminate the redundant features’ interference and extract effective
contextual information. In addition, a combination of Adaptive-NMS and Soft-NMS is used to output
the final prediction anchor boxes when performing non-maximum suppression. The results of the
experiments show that the improved method improves the mean average precision (mAP) value by
7.4% compared to the original approach. New modules make detection much more accurate while
keeping the detection speed the same.
Keywords: security screening scenes; convolutional neural networks; multi-scale feature extraction;
Citation: Sun, F.; Zhang, X.; Liu, Y.;
Jiang, H. Multi-Object Detection in
attentional mechanisms
Security Screening Scene Based on
Convolutional Neural Network.
Sensors 2022, 22, 7836. https://
doi.org/10.3390/s22207836
1. Introduction
X-ray security screening is now widely used in the subway, airports, train stations,
Academic Editors: Rongxing Lu and
campuses, superstores, and other scenarios. At the same time, with the continuous de-
Wei Yi
velopment of artificial intelligence technology, intelligent security solutions are gradually
Received: 30 August 2022 gaining widespread attention in the industrial sector. However, at the moment, contraband
Accepted: 13 October 2022 detection under the X-ray security machine relies primarily on manual discrimination.
Published: 15 October 2022 While the X-ray image background is complicated, many targets, low color contrast, and
Publisher’s Note: MDPI stays neutral occlusion phenomenon are more serious [1], and the security personnel needs to spend a
with regard to jurisdictional claims in long time on the detection process.
published maps and institutional affil- Alex et al., introduced the AlexNet [2] model, which employs deep convolutional
iations. neural networks to solve image classification problems and is significantly more effective
than conventional image processing algorithms, demonstrating the superior performance
of deep neural networks in the field of computer vision. Since then, the application of
deep convolutional neural networks to image recognition, image segmentation, and target
Copyright: © 2022 by the authors. detection problems has become an essential topic of study. Currently, target detectors are
Licensee MDPI, Basel, Switzerland. categorized into one-stage target detectors and multi-stage target detectors based on the
This article is an open access article configuration of their networks.
distributed under the terms and
Girshick [3] et al., initially presented the R-CNN network model to extract features
conditions of the Creative Commons
using convolutional neural networks as an alternative to classic target detection approaches
Attribution (CC BY) license (https://
using artificially created feature extraction and description approaches. After that, convo-
creativecommons.org/licenses/by/
lutional neural network technology experienced rapid growth. With the development of
4.0/).
adaption and localization of small-size targets utilizing a small anchor point strategy and
ROI pooling method. Literature [25] suggested a cascaded structure tensor technique for
acquiring contour-based suggestions, removing clutter and occlusion from X-ray images,
and then passing the candidate proposals to a single feed-forward convolutional neural
network for recognition. In the literature [26], in order to address the occlusion problem in
X-ray image detection, a de-obscuring attention module (DOAM) and diverse materials
exhibiting visually distinct colors and textures are utilized to build an attention tensor and
enhance the detector’s feature map information. Literature [27] suggests a lateral inhibition
module (LIM) that maximally inhibits the flow of noisy information via a bidirectional
propagation (BP) module and activates the most attractive borders from four directions via
a boundary activation (BA) module. Literature [28] employs adversarial domain adaption
approaches to match the background distribution of a large number of unlabeled SoC
samples. This approach helps to train the network to detect items in the SoC dataset using
a small labeled dataset. The previous work focused on solving the problems of dense
stacking and mutual occlusion of X-ray images or making a better dataset of X-ray images.
However, they do not consider the limitations of existing fixed-size convolution kernels to
handle widely available objects with different scales.
In recent years, NVIDIA has continued to launch higher-performance GPU products,
and the problem of arithmetic power limiting the development of deep learning has also
been alleviated to some extent. In industrial scenarios, SSD has been able to meet real-
time requirements, while compared to the YOLO series of continuous updates, up to
YOLOv7, SSD algorithm updates and optimization have been relatively limited. However,
considering background feature redundancy has a substantial effect on model detection
performance, and the feature pyramid structure has been demonstrated to play a crucial
role in multi-scale and small object detection in target detection tasks. The SSD architecture
prediction feature layer essentially acts as a divide-and-conquer for different scales [29].
Hence, the focus of this article is on SSD-based architectures, which comprise:
(1) Replacing the VGG [30] architecture in the original literature with the ResNet [31]
architecture to enhance the backbone network model extraction and characterization
capabilities.
(2) Based on the divide-and-conquer concept, we create a multi-scale extraction module
to enrich the information inside the multi-stage prediction feature layer.
(3) On the feature prediction layer, a multi-scale attention mechanism is combined to get
rid of redundant feature interference and pull out useful context information.
The other parts of this paper are arranged as follows: Section 2 introduces the basic idea
and network structure of the classical algorithm SSD; Section 3 illustrates the improvement
strategy of improved SSD in detail; Section 4 presents the experimental environment,
dataset distribution, experimental result analysis, ablation experiment, and algorithm
comparison. Finally, Section 5 provides the conclusions, and Section 6 introduces follow-up
work and improvement direction.
The SSD approach employs the VGG-16 network architecture as the backbone network.
Still, it retains only the first 15 convolution layers of its feature extraction part while
adjusting the final MaxPooling5 downsampling from the original network with a step size
of 2 and a window size of 2 × 2 to a step size of 1 and a window size of 3 × 3. Following
the adjustment, the size of Conv4 and Conv5 feature maps will remain unchanged. Finally,
the VGG-16 model’s fully connected layers FC6 and FC7 are replaced by 3 × 3 convolution
layers (Conv6) and 1 × 1 convolution layers (Conv7). The Conv4_3 convolution layer serves
as the first feature prediction layer, and Conv6_2 serves as the second feature prediction
layer. Figure 2 depicts the architecture of the SSD backbone network.
Four extra feature prediction layers are introduced as additional layers in the SSD
approach to adapt to detecting targets of varied scales. The subsequent layer employs one
1 × 1 convolution layer to reduce the channel dimension, then adds one 3 × 3 convolution
Sensors 2022, 22, 7836 5 of 22
layer to refine semantic contexts. All convolutional operations are followed by batch
normalization and ReLU activation functions. The additional feature prediction layers
have 256-dimensional output channels, except for Conv8_2, which has 512-dimensional
output channels. The feature map sizes of the various feature prediction layers are 10 × 10,
5 × 5, 3 × 3, 1 × 1, respectively. Four 3 × 3 convolution layers make up the classification
and regression networks of the feature prediction layers Conv7 to Conv9, and four 1 × 1
convolution layers make up Conv10. The classification and regression networks then
receive data from the feature extraction layer. For all prediction categories, the number of
output channels of the classification network equals the number of predefined anchors at
feature map cells on the feature map. The number of output channels for all regression
networks is simply the number of four offsets of predetermined anchor frames at each
feature map cell.
Literature [32] suggests that positive and negative sample matching, one of the key
factors causing the performance gap between various target detection network models, and
the severe imbalance between positive and negative samples, can significantly hinder the
model’s final performance. The SSD model matches positive samples based on the existing
anchor boxes and mostly follows the two rules below.
Sensors 2022, 22, 7836 6 of 22
(1) Calculate the intersection over union between the ground truth boxes and the default
anchor boxes. For each ground truth box, choose the intersection over union with the
highest value as the positive sample.
(2) Mark the remaining default anchors not marked as positive samples if their intersec-
tion over union is greater than 0.5 with any ground truth box.
Based on the aforementioned matching principle, numerous predetermined anchor
boxes can be allocated to each ground truth box, thereby facilitating the selection of the
anchor box with the greatest score for regression in the prediction stage. Algorithm 1
shows the algorithm flow detail. The remaining anchor boxes are designated as negative
samples. Typically, the number of negative examples is significantly more than the number
of positive samples. The difference between the number of positive and negative samples
is a major factor that hurts network performance during learning. It can cause a significant
drop in how well the network can detect targets. Therefore, the SSD algorithm only selects
the default anchor box that incorrectly predicts negative samples as positive samples with
the highest scores in a ratio of 1:3 of positive to negative samples to calculate the loss in the
classification network. All other default anchor frames are discarded. Other default anchors
are not taken into account, and negative samples are not used to figure out regression loss.
1
L( x, c, l, g) = (L ( x, c) + αLloc ( x, l, g)) (2)
N con f
N
Ll oc ( x, l, g) = ∑ ∑ xi j k smoooth L1 (li m − ĝm
j )
i ∈ pos m∈{cx,cy,w,h}
g cx − d cx
ĝ j cx = j dw i
i
g cy − d cy (3)
ĝ j cy = j d h i
i w
g
ĝ j w = log( d j w )
i
gh
ĝ j h = log( d j h )
i
N
exp(ci p )
Lcon f ( x, c) = − ∑ xij p log(ĉi p ) − ∑ log(ĉi 0 )where ĉi p =
∑ p exp(ci p )
(4)
i ∈ Pos i ∈ Neg
3. Improved Approach
3.1. Improved Backbone
The literature [33] indicates that the overall depth of the model is very important for
performance. Shallow neural networks focus more on the image’s edge contour and color
information. As the network layers deepen, the neural network will focus more on texture
information. As the network continues to deepen, the layers will focus on information with
class specificity, object pose, and general object class information. The shallower neural
network layers often contain more precise location information, whereas the deeper neural
network levels contain semantic information regarding the target. Therefore, adding deeper
backbone networks can greatly improve target detection accuracy. Compared to VGG-
16 as the backbone network, ResNet-50 achieves a deeper convolutional neural network
by reducing the number of parameters by 82.97 percent while avoiding the problem of
performance degradation in the deep model via the residual connection module. Figure 3
depicts the basic module of ResNet. The 3 × 3 convolution step of Conv4_1 is changed to 1,
and the 1 × 1 convolution step of residual connection is changed to 1. Consequently, the
size of the Conv4 feature map is the same as that of Conv3, and five more feature prediction
modules comprised of 1 × 1 convolution layer and 3 × 3 convolution layer are added.
The output of all convolution layers is batch normalized and nonlinearly activated by the
ReLU function while the step size of all 1 × 1 convolution layers is set to 1, the step size of
all 3 × 3 convolution layers is {2,2,2,1,1}, the number of filled pixels is {1,1,1,0,0}, and the
output channel dimension is {512,512,256,256,256}. Table 2 displays the complete backbone
network parameters.
Sensors 2022, 22, 7836 8 of 22
design a multi-scale attention mechanism that interacts with adjacent channels and spaces’
feature information. In addition, when constructing the attention mechanism, we should
consider the size of the number of parameters since an excessive number of attention
modules can slow down the model’s inference speed when integrated into the framework
for target detection. Since the multi-scale attention mechanism is applied to the feature
prediction layer, the feature map size in the last two layers of the feature prediction layer,
i.e., additional layers 4 and 5, is downsampled to 3 × 3 and 1 × 1, respectively. The
feature map no longer has accurate location information, so only the squeeze-excitation
(SE) module is used to interact with the channel information in these two layers.
The size of the previous feature prediction layer Xi −1 is adjusted to the following scale
of feature prediction layer Xi using a set of 1 × 1 convolution and 3 × 3 convolution. That
would make the attention mechanism adapt to the sizes of different feature prediction
layers, after which a multi-scale convolution kernel partitions the input feature Xi into
parts of S, i.e., [Xi0 , Xi1 , . . . , Xis −1 ]. Grouped convolution would handle input features
with different convolution kernel scales without increasing the computational effort. The
relationship between the number of groups and the size of the convolution kernel scale
follows the design Equation (7) of the literature [39], where G represents the number of
groups and k is the convolution kernel size. The final adopted convolution kernel sizes
are [3, 5, 7, 9], respectively. The number of groups corresponding to them is [1, 4, 8, 16],
and the convolution process is shown in Equation (8), where ki represents the i-th group
convolution kernel size. Xi represents the current feature prediction that covers the same
receptive field. The process of channel interaction is shown in Equation (9), where C0 and C1
represent two 1 × 1 convolution layers, and σ represents the sigmoid nonlinear activation
function. H and W represent the feature map’s width and height, respectively. Lastly, the
Sensors 2022, 22, 7836 11 of 22
Fi = Conv(k i , Xi ), i = 0, 1, 2, . . . S − 1 (8)
H W
1
H×W∑ ∑ Fi (u, v)))))
Gi = σ (C1 (σ (C0 ( (9)
u v
G = Cat([ G0 , G1 , G2 , . . . , Gs−1 ])
exp( Gi )
atti = Softmax( Gi ) = s−1 (10)
∑ exp( Gi )
i =0
Figure 6. (a) Multi-scale attention mechanism (channel). (b) Multi-scale attention mechanism (space).
Sensors 2022, 22, 7836 12 of 22
After recalibration of the channel-level attention vector, the output tensor generates
a spatial attention map utilizing the spatial relationship between features. The spatial
attention information is obtained by stitching all channels’ mean and maximum values of
the corresponding feature layers. The information regions can be effectively highlighted by
applying pooling operations along the channel axes [22] states. Following this, the spatial
attention of each scale is fused with the MaxPooling and average pooling information using
7 × 7 window size, with the fill value set to 3 and the stride set to 1 to maintain the size of
the feature map. A spatial attention output tensor is generated for each scale of the feature
map, as depicted in Equation (12). The spatial attention information is then recalibrated
using a sigmoid operation, as illustrated by Equation (13). Experiments show that using a
multi-scale attention method can increase the model’s detection mAP value by about 2.1%
on the dataset.
(a)
(b)
Figure 7. (a) Improved
Figure target detection framework.
7. (a) Improved (b) Objective
target detection framework.classification network
(b) Objective with network with regres-
classification
regression network (prediction head).
sion network (prediction head).
4. Experiment
4.1. Experimental Environment and Hyperparameter Setting
All experiments with this technique were conducted using an Nvidia RTX3070 (12 G
graphics memory) graphics card with a Windows 10 operating system, 32 G.B. of RAM,
Sensors 2022, 22, 7836 13 of 22
4. Experiment
4.1. Experimental Environment and Hyperparameter Setting
All experiments with this technique were conducted using an Nvidia RTX3070 (12 G
graphics memory) graphics card with a Windows 10 operating system, 32 G.B. of RAM,
the CUDA11.6 accelerated computing library, and the Pytorch framework. The batch size
was set to 22. To prevent the instability of the model gradient during the initial training,
a stochastic gradient optimizer (SGD) with momentum was used for 350 rounds of pre-
warm-up training, with the initialized learning rate set to 8 × 10−6 , the momentum set to
0.9, and the weight decay rate set to 0.0005. After that, using the Adam optimizer with an
initialized learning rate of 0.01 to retrain the model. After 2500 epochs, the learning rate
automatically lowers to 0.1 times the initial rate. The smoothing constants are, respectively,
0.90 and 0.999, and the weight decay rate is 0.0005. Fifteen thousand epochs make up the
entire number of iteration epochs.
Applying data augmentation techniques would improve the robustness and gener-
alization performance of the model for various input image target sizes and shapes. The
improved SSD algorithm makes use of data augmentation techniques, such as randomly
flipping each training image with a probability of 50% and randomly altering the image
color with a brightness of 0.125, contrast of 0.5, saturation of 0.5, and hue of 0.05. Randomly
cropping the picture helps improve the detection precision of small-scale objects. The figure
depicts the progress of the random cropping algorithm (Figure 8).
randomly picked and separated into a training set of 11,375 images designated EDS_10K
and a test set of 2843 images in a ratio of 4:1. Five thousand images were taken from the
training set and named EDS_5K to facilitate the ablation experiment. Figure 9 shows the
sample distribution of the dataset. In general, the target samples are relatively evenly
distributed. The uniformly distributed number of targets enables the target detection
network to obtain meaningful feature representation capability without overly favoring
certain data, improving model generalization capability. In addition, the image detection
targets are classified into three classes based on the COCO dataset classification criteria.
Detection targets smaller than 32 pixels are classified as “small targets”, detection targets
larger than 32 pixels but less than 96 pixels are classified as “medium targets”, and detection
targets larger than 96 pixels are classified as “large targets”. The specific statistical figures
are shown in Figure 10. Since targets of different scales exist, it is possible to verify the
improved algorithm’s detection accuracy for multi-scale targets.
area( B p ∩ Bgt )
IOU = (14)
area( B p ∩ Bgt )
TP
TPR = (15)
TP + FP
TP
FPR = (16)
TP + FN
Figure 9. (Left) Number of target instances in EDS_5k. (Right) Number of target instances in
EDS_10k.
Figure 10. (Left) Distribution of targets by scale for EDS_5k. (Right) Distribution of targets by scale
for EDS_10k.
Sensors 2022, 22, 7836 15 of 22
The improved model can obtain the highest mAP value of 80.6% at an input image
size of 300 × 300 while maintaining a high inference speed. The original model can achieve
a detection accuracy comparable to the input image size of 512 × 512 (mAP value of 80.3%).
Still, the inference speed of the original model is only 196 ms due to the sharp increase in
the default anchor frames (98112), which is caused by the increase in the input image size.
Multiple sets of ablation experiments were conducted on the EDS_ 5K dataset to determine
the effectiveness of each module. The network detection accuracy could be improved by
0.8% after replacing the backbone network; the multi-scale feature extraction module could
improve the model detection accuracy by 0.9%; the multi-scale attention module could
improve the model detection accuracy by 0.7%. Table 4 shows the results of the ablation
experiments. The class activation heat map also confirms the multi-scale attention module’s
effectiveness by visualizing prediction results. The images are depicted in Figure 12, where
Sensors 2022, 22, 7836 16 of 22
the first row of images is the actual image and its label (in the orange box), the second row
is the class activation heat map without the multi-scale attention module, and the third
row is the improved class activation heat map. The results of the forecast are depicted in
Figure 13. In plotting the prediction results, the results after non-maximal suppression for
targets with thresholds below 0.4 are filtered out, indicating that the model can recognize
all types of objects with high confidence.
Inference
Approach mAP0.5 mAP0.75 mAPs mAPm mAPl
Speed (ms)
Fast-RCNN 67.2 45.7 57.8 71.5 72.3 396
Faster-RCNN 69.9 46.3 58.3 74.9 76.5 142
G-RCNN 77.3 55.1 61.4 83.9 88.4 89
YOLOv4 75.1 53.5 49.7 88.4 87.2 33
YOLOv5 75.5 55.5 49.7 87.8 89.3 28
YOLOv7 77.2 55.5 52.3 88.7 90.6 37
SSD300 74.4 51.9 47.5 85.3 87.4 52
DSSD321 75.9 53.3 55.1 85.1 87.5 67
Improved SSD300 78.7 56.0 52.6 90.3 93.2 47
Improved SSD300 (10k) 80.6 56.3 52.6 94.1 95.1 47
X-ray images are frequently densely packed with objects and heavily obscured. Tradi-
tional non-maximum suppression methods typically set the confidence of predicted anchor
frames with an intersection ratio exceeding a threshold of 0. This frequently results in the
loss of overlapping targets. As a result, redundant prediction anchor frames are filtered
out using a combination of Adaptive-NMS [40] and Soft-NMS [41] in the post-processing
phase. Equation (17) is subsequently used to calculate the target density, where the density
of object i is defined as the maximum bounding box IoU with other objects in the set of
ground truth G. Equations (18) and (19) are utilized to execute Soft-NMS, reducing the
confidence score of the predicted target frame. NM represents the adaptive NMS threshold
for M, and dM is the object's density that M covers.
NM := max( Nt , d M ) (18)
si , iou( M, bi ) < NM
si = (19)
si f (iou( M, bi )), iou( M, bi ) < NM
The feature layer extracted by the backbone network is first reduced using 1 × 1
convolution, followed by splicing the output tensor of the feature prediction layer with
the output tensor of the classification network and the anchor frame regression network.
A 5 × 5 convolution kernel is used in the last layer of the density network to consider
the surrounding information. Finally, the results are mapped to the [0, 1] interval by the
Softmax operation. The final output dimension is kept the same as the output dimension of
the anchor frame regression network. In particular, since the density prediction network
needs to use 5 × 5 convolutional kernels to generate the final results, only the first four
feature prediction layers are added to the density prediction network. The density sub-
network is shown in Figure 14. The total model detection accuracy was improved by
1.2 mAP values. The specific enhancement results are shown in Table 5.
decay were set to 0.9 and 0.0005, respectively. We evaluated the average mean precision
(mAP) of the target detection to measure the performance of the models, and the IOU
threshold was set to 0.5.
In the OPIXRay dataset, the literature [22] proposes a de-obscuring attention module
(DOAM) to capture edge information and area information separately. SSD, YOLOv3, and
FCOS network architectures were selected as baselines into which the DOAM modules
were incorporated to verify the improvements. The same baseline was chosen to fuse the
MSA and MSE coupling modules separately when performing comparison tests. In the
DBF dataset, the original paper focused on transfer learning of X-ray images and used
Faster-RCNN and R-FCN networks as baselines. In the same way, we used the two network
architectures above as a baseline to compare the work of our proposed modules. The
experimental comparison results demonstrate that the proposed modules can significantly
improve the detection precision of all categories on the DBF6 dataset. The detection
accuracy of all categories exceeded 90% except for Knife, a tiny target. The improved
SSD model on the OPIXray dataset outperformed the DOMA architecture reported in the
literature [22] and attained the best detection accuracy of 82.9% mAP. The performance of
multi-scale feature extraction and multi-scale attention integration network on YOLOv3 and
FCOS improved the detection accuracy by 0.9% and 0.2%, respectively. The experimental
results of DBF6 dataset are shown in Table 6, and OPIXray dataset are shown in Table 7.
The detection results are shown in Figure 15.
5. Conclusions
In this research, we propose an enhanced multi-scale target detection method based
on the SSD algorithm, which seeks to increase the detection accuracy of the model for
each scale target while ensuring the model’s detection speed. We use the EDS dataset
as our primary training dataset. With approximately 22 pictures per second, the current
mainstream model achieves its highest detection accuracy. The modified Resnet 50 network
architecture replaces the original VGG network to improve the feature extraction capability
Sensors 2022, 22, 7836 20 of 22
of the backbone network while reducing the number of network parameters. A multi-scale
feature extraction module is designed with stacked convolutional kernels of different sizes.
The algorithm’s performance is further enhanced by maximizing the feature expression
potential of the output feature map of the backbone network. In addition, a multi-scale
attention mechanism is intended to improve the detection impact by concentrating feature
information at various scales in both channel and spatial dimensions. Ablation experiments
demonstrate that adding multi-scale feature extraction and multi-scale attention coupling
mechanisms into a model improves its detection accuracy compared to employing only
one of the modules. Ultimately, we combine our newly developed multi-scale multi-feature
extraction and attention coupling modules with other target detection models on different
X-ray datasets. The results of the experiments show that the new architecture works well
with other datasets.
6. Discussion
This paper provides a reliable solution for intelligent security screening. This paper
proposes a method with better results than the fast detection speed but low detection
accuracy of the YOLO series and the slow detection speed of the original SSD algorithm.
The idea is mainly to consider that authentic images will generally contain targets of
various scales in real life. It is often difficult to adapt to targets of different scales by
using a fixed-size convolution kernel. It can be seen that this study uses ResNet_50 as the
backbone network to save the computational overhead in network inference as much as
possible. Reasonable speculation is that if a deeper neural network such as ResNet-101
or Res2Net-101 is used as the backbone network, it can further improve the detection
accuracy of the model. In addition, the model can still be further improved in terms of the
number of parameters and detection speed, and a more lightweight model still needs to be
researched in the future. In this paper, the contrast of X-ray images is made higher during
data enhancement so that the existing network can directly predict targets in different color
gamuts. However, the network architecture for detecting targets in different color gamuts
separately still needs more research.
Author Contributions: Supervision, H.J.; writing—original draft, F.S.; writing—review and editing,
Y.L.; validation, X.Z. All authors have read and agreed to the published version of the manuscript.
Funding: This work is supported by the National Natural Science Foundation of China (No. 51865054)
and Xinjiang Uygur Autonomous Region Foundation Project (No. 2022D01A117).
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Cao, Y.; Zhang, L.; Meng, J.; Song, Q.; Zhang, L. A multi-objective prohibited item identification algorithm in the X-ray security
scene. Laser Optoelectron. 2021, 5, 1–14. (In Chinese)
2. Alex, K.; Ilya, S.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90.
3. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.
In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014;
pp. 580–587. [CrossRef]
4. Girshick, R. Fast R-CNN. In Proceedings of the International Conference on Computer Vision, Santiago, Chile, 7–13 December
2015; IEEE Computer Society: Washington, DC, USA.
5. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans.
Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [CrossRef] [PubMed]
6. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788.
[CrossRef]
Sensors 2022, 22, 7836 21 of 22
7. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision & Pattern
Recognition, Honolulu, HI, USA, 21–27 July 2017; pp. 6517–6525.
8. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767.
9. Bochkovskiy, A.; Wang, C.Y.; Liao, H. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934.
10. Wang, C.Y.; Bochkovskiy, A.; Liao, H. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors.
arXiv 2022, arXiv:2207.02696.
11. Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In
Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 448–456.
12. Lin, T.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings
of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017.
13. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans.
Pattern Anal. Mach. Intell. 2014, 37, 1904–1916. [CrossRef] [PubMed]
14. Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF
International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019.
15. Lin, T.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017,
99, 2999–3007.
16. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of
the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37.
17. Fu, M.; Deng, M.; Zhang, D. Survey on Deep Neural Network Image Target Detection Algorithms. Comput. Appl. Syst. 2022,
7, 35–45. [CrossRef]
18. Qiao, J.; Zhang, L. X-Ray Object Detection Based on Pyramid Convolution and Strip Pooling. Prog. Laser Optoelectron. 2022,
4, 217–228.
19. Zhang, Z.; Li, H.; Li, M. Research on YOLO Algorithm in Abnormal Security Images. Comput. Eng. Appl. 2020, 56, 187–193.
20. Huang, G.; Liu, Z.; Van, D.; van der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708.
21. Wu, H.; Wei, X.; Liu, M.; Wang, A.; Liu, H. Improved YOLOv4 for dangerous goods detection in X-ray inspection combined with
atrous convolution and transfer learning. China Opt. 2021, 14, 1417–1425.
22. Akcay, S.; Kundegorski, M.E.; Willcocks, C.G.; Breckon, T.P. Using deep convolutional neural network architectures for object
classification and detection within x-ray baggage security imagery. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2203–2215. [CrossRef]
23. Galvez, R.; Dadios, E.; Bandala, A.; Vicerra, R.R.P. Threat object classification in X-ray images using transfer learning. In
Proceedings of the IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology, Communication
and Control, Environment and Management (HNICEM), Baguio City, Philippines, 29 November–2 December 2018; pp. 1–5.
24. Gong, Y.; Luo, J.; Shao, H.; Li, Z. A transfer learning object detection model for defects detection in X-ray images of spacecraft
composite structures. Compos. Struct. 2022, 284, 115136. [CrossRef]
25. Hassan, T.; Bettayeb, M.; Akçay, S.; Khan, S.; Bennamoun, M.; Werghi, N. Detecting prohibited items in X-ray images: A contour
proposal learning approach. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Online, 25–28
October 2020; pp. 2016–2020.
26. Wei, Y.; Tao, R.; Wu, Z.; Ma, Y.; Zhang, L.; Liu, X. Occluded prohibited items detection: An x-ray security inspection benchmark
and de-occlusion attention module. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA,
12–16 October 2020; pp. 138–146.
27. Tao, R.; Wei, Y.; Jiang, X.; Li, H.; Qin, H.; Wang, J.; Liu, X. Towards real-world X-ray security inspection: A high-quality benchmark
and lateral inhibition module for prohibited items detection. In Proceedings of the IEEE/CVF International Conference on
Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10923–10932.
28. Sigman, J.B.; Spell, G.P.; Liang, K.J.; Carin, L. Background adaptive faster R-CNN for semi-supervised convolutional object
detection of threats in x-ray images. In Proceedings of the Anomaly Detection and Imaging with X-Rays (ADIX), Online, 26 May
2020; Volume 11404, pp. 12–21.
29. Chen, Q.; Wang, Y.; Yang, T.; Zhang, X.; Cheng, J.; Sun, J. You Only Look One-level Feature. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13039–13048.
30. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556.
31. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016.
32. Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive
Training Sample Selection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), Seattle, WA, USA, 13–19 June 2020.
33. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Chapter 53—Visualizing and understanding convolutional networks. In Computer
Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; Volume
8689, pp. 818–833.
34. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141.
Sensors 2022, 22, 7836 22 of 22
35. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional Block Attention Module; Springer: Cham, Switzerland, 2018.
36. Guo, J.; Ma, X.; Sansom, A.; Mcguire, M.; Fu, S. Spanet: Spatial Pyramid Attention Network for Enhanced Image Recognition. In
Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME), Online, 6–10 July 2020.
37. Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019.
38. Zhang, H.; Zu, K.; Lu, J.; Zou, Y.; Meng, D. Epsanet: An efficient pyramid squeeze attention block on convolutional neural
network. arXiv 2021, arXiv:2105.14447.
39. Ioannou, Y.; Robertson, D.; Cipolla, R.; Criminisi, A. Deep roots: Improving cnn efficiency with hierarchical filter groups.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017;
pp. 1231–1240.
40. Liu, S.; Huang, D.; Wang, Y. Adaptive NMS: Refining Pedestrian Detection in a Crowd. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 6452–6461.
41. Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS—Improving Object Detection with One Line of Code. In Proceedings of
the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5562–5570.