0% found this document useful (0 votes)
10 views5 pages

2023 Object Detection in Thermal Infrared Image Based On Improved YOLOX

Uploaded by

mritunjayrai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views5 pages

2023 Object Detection in Thermal Infrared Image Based On Improved YOLOX

Uploaded by

mritunjayrai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

This article has been accepted for publication in IEEE Geoscience and Remote Sensing Letters.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LGRS.2023.3312148

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS 1

Object Detection in Thermal Infrared Image Based


on Improved YOLOX
Ruijie Gao and Zhanchuan Cai, Senior Member, IEEE

Abstract—Infrared image has received much attention, but the With the improvement of hardware computing power, deep
weak features and multi noise in it bring difficulties to object learning methods have made great progress in object detection.
detection. In this letter, an improved YOLOX called YOLOX-IRI Object detection based on deep learning is divided into two
is proposed to improve the detection accuracy on infrared images.
First, an improved CBAM is proposed to make the network focus categories: two-stage detector and one-stage detector. Two-
on the object area. This module enriches the feature information stage detector is represented by a group of R-CNN networks.
by mixing three kinds of pooling methods properly, which helps One-stage detector mainly includes YOLO series and SSD
the network distinguish between background and object. Second, series.
class-balanced loss is introduced to suppress adverse effects In infrared images, there is more background noise and
caused by unbalanced sample distribution problems. This loss
function can balance the contribution of each class to the total fewer object features, which brings new challenges to object
loss by assigning weights, thereby improving the classification detection. Some researchers have made efforts in this field.
result. Experimental results indicate that our method is superior Geng et al. [9] proposed an improved YOLOV3 network,
to other object detection algorithms. which contains three DenseNet blocks, to detect human ges-
Index Terms—Infrared image, object detection, YOLOX, at- tures in infrared images. Li et al. [10] improved the detection
tention mechanism. head structure and CSP module of YOLOV5 to obtain a
higher accuracy. Wan et al. [11] used a lightweight backbone
to reduce the computational complexity for object detection
I. I NTRODUCTION on SAR images. Zhou et al. [12] inserted ECA module into
BJECT detection [1]–[3] is an important part of com- CenterNet, which improved the capability for efficient feature
O puter vision and pattern recognition. It has a wide range
of applications in transportation, public security, military, med-
extraction.
In this paper, an improved YOLOX called YOLOX-IRI is
ical and other fields. The research on object detection in visible proposed. On the BIRDSAI dataset, the network has achieved
light images has been perfected day by day. Many efficient good results. Our contributions are summarized as follows:
and stable methods have been used in practice. However, the 1) To make the network focus on the object area, an im-
limitations of visible images are obvious. In extreme environ- proved CBAM is proposed. This module enriches the feature
ments such as night and fog, the object detection method of information by mixing three kinds of pooling methods prop-
visible images will fail. The photos taken with thermal infrared erly, which helps the network distinguish between background
cameras are hardly affected by the light intensity, so object and object. In YOLOX-IRI, the improved CBAM is placed
detection in infrared images has gradually been paid more between the backbone and the neck.
attention. 2) To suppress the adverse influence of unbalanced sample
Traditional object detection methods are divided into three distribution, a new loss function scheme is proposed. Class-
parts: region selection, feature extraction, and classifier. Re- balanced loss is introduced to be the classification loss of
gion selection is to locate the position of the object, and the YOLOX. This loss function can balance the contribution of
sliding window method is often used. The disadvantage of each class to the total loss by assigning weights, thereby
this method is obvious, namely the high time complexity. The improving the classification result.
quality of the features directly affects the accuracy of the
classifier. Commonly used features in the feature extraction II. METHODOLOGY
stage are Haar [4], HOG [5] and ACF [6]. Classifiers are often A. YOLOX
implemented by Adaboost [7] and SVM [8]. Although the YOLOX [13] is a YOLO network using the anchor-free
traditional method achieves the function of object detection, it mechanism, which avoids the burden caused by the calculation
only has good results in specific scenarios. of anchor boxes. As shown in Fig 1, YOLOX is composed
of three parts: backbone, neck, and head. The backbone
This work was supported in part by the Science and Technology Develop-
ment Fund of Macau under Grant 0059/2020/A2, and Grant 0052/2020/AFJ, of YOLOX is CSPDarknet53, which is improved based on
in part by Zhuhai Industry-University-Research Collaboration Program YOLOV3-SPP. The CSP module keeps more network gradient
(ZH22017002210011PWC). (Corresponding author: Zhanchuan Cai.) information in the output graph and maintains the performance
The authors are with the School of Computer Science and Engineering,
Macau University of Science and Technology, Macau 999078, China (e-mail: of the network while reducing the amount of calculation.
rj [email protected]; [email protected]) The neck part of YOLOX uses FPN and PAN for feature
0000–0000/00$00.00 © 2021 IEEE

Authorized licensed use limited to: Indian Institute of Technology (ISM) Dhanbad. Downloaded on September 10,2023 at 05:09:30 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Geoscience and Remote Sensing Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LGRS.2023.3312148

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS 2

Fig. 3. Channel Attention Module(CAM).

Fig. 4. Spatial Attention Module(SAM).

Fig. 1. The structure of YOLOX-IRI. “cls” refers to classification and “reg”


refers to regression.

Fig. 2. Convolutional Block Attention Module.

fusion. The feature map generated by the backbone is input to


FPN. FPN transmits and fuses the feature information through
up-sampling while PAN fuses the feature again by down- Fig. 5. Improved CBAM. From top to bottom, they are the structure of
sampling. The head of YOLOX is a decoupled head that uses the entire module, the detailed structure of CAM module, and the detailed
three branches to predict classification, regression, and IoU structure of SAM module.
parameters, respectively.
soft pooling [15] is introduced. Soft pooling is to preserve the
B. Improved CBAM basic properties of the input through the soft-max weighting
In infrared images, there are clutter and noise, which method, while amplifying the feature activation of greater
interfere with the detection of the model. To alleviate the intensity. It can reduce the information loss caused by the pool-
above problems, an attention mechanism is introduced to ing layer, which helps to improve the classification accuracy.
YOLOX. As shown in Fig 2, CBAM [14] is a lightweight In CBAM, we have designed three structures for CAM, SAM,
attention module that combines channel and spatial attention. and the entire CBAM module: (1) using three pooling methods
The channel attention module (CAM) keeps the channel di- simultaneously (2) using max pooling and soft pooling (3)
mension unchanged and compresses the spatial dimension. It using average pooling and soft pooling. There are 9 schemes
focuses on what is the effective information of the feature in total. According to the results presented later, using max
map, which is helpful for classification. The spatial attention pooling and soft pooling for CAM module is chosen. The
module (SAM) keeps the spatial dimension unchanged and structure of our improved CBAM is shown in Fig 5.
compresses the channel dimension. It focuses on where the The input feature map is first subjected to parallel max
effective information of the feature map is, which is helpful pooling and soft pooling to obtain the max pooling feature
c c
for localization. The detailed structure of CAM and SAM are Fmax and soft pooling feature Fsof t . Then these two features
shown in Fig 3 and Fig 4, respectively. According to Fig 3, are sent to the shared multi-layer perceptron to generate the
Fig 4 and Fig 5, the whole process of CBAM can be defined channel attention feature map Mc . The process of generating
as: channel attention can be summarized as the following formula:
F ′ = Mc (F ) ⊗ F
(1) Mc (F ) = σ(M LP (M axP ool(F )) + M LP (Sof tP ool(F ))),
F ′′ = Ms (F ′ ) ⊗ F ′ ,
(2)
where ⊗ represents element-wise multiplication. F is the input where σ denotes the sigmoid function.
feature map. Mc is the channel attention, and Ms is the Spatial attention is complementary to channel attention and
spatial attention. F ′ is an intermediate feature processed by the the process can be concluded as follows:
channel attention module, known as channel-refined feature.
The final output is F ′′ . Ms (f ) = σ(f 7×7 ([AvgP ool(F ); M axP ool(F )])), (3)
Both max and average pooling are used in CBAM, so we
suppose that if another pooling method is introduced, the where σ denotes the sigmoid function. f 7×7 represents a
detection accuracy of the model may be improved. Therefore, convolution operation with the filter size of 7 × 7.

Authorized licensed use limited to: Indian Institute of Technology (ISM) Dhanbad. Downloaded on September 10,2023 at 05:09:30 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Geoscience and Remote Sensing Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LGRS.2023.3312148

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS 3

In YOLOX, we place improved CBAM between the back- reserves. We selected 3300 images, of which 2700 were used
bone and the neck. As shown in Fig 1, the feature maps for training, 600 for testing. The label contains three species,
extracted from the backbone are sent to the neck after passing namely Human, Elephant and Giraffe.
through improved CBAM. In this letter, we implement models on Pytorch 1.9.0 and
Python 3.7. The models are trained and tested on Tesla V100
C. Class-balanced loss GPU. YOLOX-s is the basic model and the learning rate is set
to be 0.01. Stochastic gradient descent (SGD) is chosen as an
In our dataset, the number of samples is not balanced among
optimizer with a weight decay of 0.0005 and a momentum
the three classes of people, elephants, and giraffes. Among
of 0.9. The total epochs and batch sizes are 100 and 16,
them, the sample of human is the least, while the sample
respectively.
of giraffes is the most. In order to suppress the influence
of sample imbalance on model accuracy, we introduce class-
balanced loss [16]. At present, there are two ways to deal B. Qualitative Result
with the problem of unbalanced samples. One is to expand As mentioned earlier, we designed 9 schemes with soft
the number of samples through data augmentation, and the pooling. As shown in Table I, CAM with max and soft
other is to add a weight coefficient to the loss value of each pooling, as well as the entire CBAM with average and soft
class. In the second method, the weight coefficient α is defined pooling achieve the best mAP at 0.444. Then we combine
as the inverse of the number of samples of a certain class: the two schemes with class-balanced loss. The ablation results
1 are demonstrated in Table II. YOLOX combined with CAM-
, α= (4) SOFT-MAX and class-balanced loss performs better, whose
n
mAP reaches 0.459. We name this model as YOLOX-IRI.
where n is the number of samples. In fact, it is unreasonable However, mAP reduces to 0.429 when YOLOX combined with
to choose the weight coefficient in this way, because not all CBAM-SOFT-AVG and class-balanced loss. The reason may
samples are helpful for the accuracy. Among many samples of be that CBAM-SOFT-AVG conflicts with class-balanced loss.
a certain class, there may be feature overlap. In other words, Compared with the original YOLOX, YOLOX-IRI improves
some samples can be obtained by certain transformations from 2.5%. Besides, we also compare our method with other com-
other samples, which is somewhat similar to the concept of mon detectors. Based on the results from Table III, YOLOX
maximal linearly independent groups. Only new features can has a better performance on this dataset.
improve the performance of the network, and repeated features
do not help the training of the network. Therefore, we need TABLE I
to determine the number of these non-overlapping, unique D IFFRENT SCHEMES OF IMPROVED CBAM. “✓” INDICATES THAT THE
samples. The effective number of samples E is defined as: MODULE HAS BEEN MODIFIED , AND “—” INDICATES THAT IT REMAINS
ORIGINAL .
1 − βn
E= , (5) CAM SAM mAP
1−β
✓ — 0.430
where n is the number of samples, β = NN−1 , and N is sample MAX-AVG-SOFT — ✓ 0.439
✓ ✓ 0.433
volume. In our improved YOLOX, we use class-balanced ✓ — 0.444
sigmoid cross-entropy loss for classification loss. The original MAX -SOFT — ✓ 0.430
sigmoid cross-entropy loss is ✓ ✓ 0.430
✓ — 0.428
C
X AVG-SOFT — ✓ 0.441
CEsigmoid (z, y) = − log(sigmoid(zit )) ✓ ✓ 0.444
Baseline 0.434
i=1
C   (6)
X 1 Some detection results are shown in Fig 6. Each row is
=− log .
i=1
1 + exp(−zit ) an image, and each column is a method. The first, second,
third row are human, elephants and giraffes, respectively. The
The class-balanced sigmoid cross-entropy loss is
first column is the ground truth. These results prove that our
1 YOLOX-IRI has better performance.
CBsigmoid (z, y) = CEsigmoid (z, y)
Eny
C   IV. C ONCLUSION
1−β X 1
=− log . In this letter, an improved YOLOX called YOLOX-IRI
1 − β ny i=1 1 + exp(−zit )
(7) is proposed for infrared image object detection. To increase
the detection accuracy, YOLOX is combined with improved
CBAM and class-balanced loss. Improved CBAM combines
III. EXPERIMENTS AND RESULTS
soft pooling with CBAM, which makes the network focus
A. Dataset and Experimental Setup on the region of objects. Class-balanced loss can suppress
The image data used in the experiment comes from the the impact from the unbalanced sample distribution. The
BIRDSAI [17] dataset. The images, taken by UAV with TIR experimental results indicate that our YOLOX-IRI has higher
cameras, document the people and animals in African nature mAP than other common detectors on our dataset.

Authorized licensed use limited to: Indian Institute of Technology (ISM) Dhanbad. Downloaded on September 10,2023 at 05:09:30 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Geoscience and Remote Sensing Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LGRS.2023.3312148

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS 4

TABLE II
A BLATION STUDY.

Model CAM-SOFT-MAX CBAM-SOFT-AVG CBLoss Recall Precision mAP


YOLOX 0.432 0.581 0.434
✓ 0.439 0.585 0.444
✓ 0.437 0.591 0.444
YOLOX-IRI ✓ ✓ 0.451 0.604 0.459
✓ ✓ 0.431 0.556 0.429

Fig. 6. Detection results.

TABLE III [7] Y. Freund, R. E. Schapire et al., “Experiments with a new boosting
C OMPARISON WITH OTHER METHODS . algorithm,” in icml, vol. 96. Citeseer, 1996, pp. 148–156.
[8] P.-H. Chen, C.-J. Lin, and B. Schölkopf, “A tutorial on ν-support vector
Model Backbone mAP machines,” Applied Stochastic Models in Business and Industry, vol. 21,
Cascade R-CNN [18] ResNet50 0.191 no. 2, pp. 111–136, 2005.
Libra R-CNN [19] ResNet50 0.415 [9] K. Geng and G. Yin, “Using deep learning in infrared images to enable
TOOD [20] ResNet50 0.252 human gesture recognition for autonomous vehicles,” IEEE Access,
ATSS [21] ResNet50 0.184 vol. 8, pp. 88 227–88 240, 2020.
PVT [22] Pyramid-Vision-Transformer-S 0.258 [10] S. Li, Y. Li, Y. Li, M. Li, and X. Xu, “Yolo-firi: Improved yolov5
DETR [23] ResNet50 0.164 for infrared image object detection,” IEEE Access, vol. 9, pp. 141 861–
YOLOV5 [24] CSPDarkNet53 0.266 141 875, 2021.
YOLOX [13] CSPDarkNet53 0.434
[11] H. Wan, J. Chen, Z. Huang, R. Xia, B. Wu, L. Sun, B. Yao, X. Liu, and
YOLOX-IRI(Ours) CSPDarkNet53 0.459
M. Xing, “Afsar: An anchor-free sar target detection algorithm based on
multiscale enhancement representation learning,” IEEE Transactions on
Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2022.
[12] G. Zhou, F. Wu, J. He, H. Li, C. Zhang, and G. Yang, “Fast thermal
R EFERENCES infrared image ground object detection method based on deep learning
algorithm,” in 2021 6th International Conference on Communication,
[1] Y. Zhao, L. Zhao, C. Li, and G. Kuang, “Pyramid attention dilated Image and Signal Processing (CCISP), 2021, pp. 59–63.
network for aircraft detection in sar images,” IEEE Geoscience and [13] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun, “Yolox: Exceeding yolo series
Remote Sensing Letters, vol. 18, no. 4, pp. 662–666, 2021. in 2021,” arXiv preprint arXiv:2107.08430, 2021.
[2] H. Fang, M. Xia, G. Zhou, Y. Chang, and L. Yan, “Infrared small [14] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional
uav target detection based on residual image prediction via global and block attention module,” in Proceedings of the European Conference on
local dilated residual networks,” IEEE Geoscience and Remote Sensing Computer Vision (ECCV), September 2018.
Letters, vol. 19, pp. 1–5, 2022. [15] A. Stergiou, R. Poppe, and G. Kalliatakis, “Refining activation down-
[3] X. Dong, R. Fu, Y. Gao, Y. Qin, Y. Ye, and B. Li, “Remote sensing object sampling with softpool,” in Proceedings of the IEEE/CVF International
detection based on receptive field expansion block,” IEEE Geoscience Conference on Computer Vision, 2021, pp. 10 357–10 366.
and Remote Sensing Letters, vol. 19, pp. 1–5, 2022. [16] Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie, “Class-balanced loss
[4] P. Viola and M. J. Jones, “Robust real-time face detection,” International based on effective number of samples,” in 2019 IEEE/CVF Conference
Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2004. on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9260–
[5] N. Dalal and B. Triggs, “Histograms of oriented gradients for human 9269.
detection,” in 2005 IEEE Computer Society Conference on Computer [17] E. Bondi, R. Jain, P. Aggrawal, S. Anand, R. Hannaford, A. Kapoor,
Vision and Pattern Recognition (CVPR’05), vol. 1. IEEE, 2005, pp. J. Piavis, S. Shah, L. Joppa, B. Dilkina, and M. Tambe, “Birdsai: A
886–893. dataset for detection and tracking in aerial thermal infrared videos,”
[6] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: in 2020 IEEE Winter Conference on Applications of Computer Vision
An evaluation of the state of the art,” IEEE Transactions on Pattern (WACV), 2020, pp. 1736–1745.
Analysis and Machine Intelligence, vol. 34, no. 4, pp. 743–761, 2011. [18] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality

Authorized licensed use limited to: Indian Institute of Technology (ISM) Dhanbad. Downloaded on September 10,2023 at 05:09:30 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Geoscience and Remote Sensing Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LGRS.2023.3312148

IEEE GEOSCIENCE AND REMOTE SENSING LETTERS 5

object detection,” in 2018 IEEE/CVF Conference on Computer Vision


and Pattern Recognition, 2018, pp. 6154–6162.
[19] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin, “Libra r-
cnn: Towards balanced learning for object detection,” in 2019 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2019,
pp. 821–830.
[20] C. Feng, Y. Zhong, Y. Gao, M. R. Scott, and W. Huang, “Tood: Task-
aligned one-stage object detection,” in 2021 IEEE/CVF International
Conference on Computer Vision (ICCV), 2021, pp. 3490–3499.
[21] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li, “Bridging the gap
between anchor-based and anchor-free detection via adaptive training
sample selection,” in 2020 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), 2020, pp. 9756–9765.
[22] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo,
and L. Shao, “Pyramid vision transformer: A versatile backbone for
dense prediction without convolutions,” in 2021 IEEE/CVF International
Conference on Computer Vision (ICCV), 2021, pp. 548–558.
[23] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
S. Zagoruyko, “End-to-end object detection with transformers,” in
European conference on computer vision. Springer, 2020, pp. 213–
229.
[24] G. Jocher, “YOLOv5 by Ultralytics,” 5 2020. [Online]. Available:
https://github.com/ultralytics/yolov5

Authorized licensed use limited to: Indian Institute of Technology (ISM) Dhanbad. Downloaded on September 10,2023 at 05:09:30 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

You might also like