Moving Traffic Object Detection Based on Bayesian Theory Fusion
Moving Traffic Object Detection Based on Bayesian Theory Fusion
https://doi.org/10.1007/s42154-023-00245-0
Received: 14 November 2022 / Accepted: 20 July 2023 / Published online: 19 July 2024
© China Society of Automotive Engineers (China SAE) 2024
Abstract
In order to improve the performance of object detection algorithm in dynamic traffic scenarios, a moving traffic object detec-
tion method based on Bayesian theory fusion is proposed. To obtain initial object detection results, adaptive coordinate atten-
tion YOLO (ACA-YOLO) network with high accuracy and multi-scale optical flow (MSOF) method with high sensitivity to
dynamic object are applied, respectively. To enhance the detection performance of YOLOv5 network, adaptive coordinate
attention (ACA) mechanism is applied to obtain more accurate location and identification of interest objects. To address
the issue of constant loss values when the inclusion relationship between truth boxes and prediction boxes occurs, the loss
function now utilizes efficient-IoU instead of generalized-IoU. Fusion weights are obtained by calculating the posterior
probabilities of ACA-YOLO network and MSOF method separately using Bayesian formula after object detection regions
matching based on intersection over union (IoU). The results of fusion detection are based on the posteriori probabilities.
The proposed detection method was tested on the KITTI dataset and self-built continuous moving traffic objects dataset
which consists of real continuous dynamic traffic scenarios. The experimental results indicate that the proposed method for
detecting moving traffic object based on Bayesian theory fusion has outstanding performance. The mean average precision,
Recall and Precision of the proposed method on KITTI dataset reach 0.954, 0.912, 0.944, which are 5.5%, 9.9% and 1.9%
higher than that of traditional YOLOv5 network.
Vol:.(1234567890)
Moving Traffic Object Detection Based on Bayesian Theory Fusion 419
gradient (HOG), scale invariant feature transform (SIFT), weak. To establish temporal connection, these algorithms
and oriented fast and rotated brief (ORB) [5]. The classifi- are combined with other methods, such as LSTM and Deep
ers include support vector machines (SVM) [6] and random Sort [38]. Teng et al. [39] proposed the clip-LSTM algo-
forest [7]. Recently, the algorithms based on optical flow rithm and applied it to detect objects from remote sensing
[8–10] have been proposed. Unlike other algorithms, these images. Fan et al. [9] proposed an object detection frame-
algorithms choose the optical flow information of object as work based on optical flow feature which focused on fore-
features instead of appearance information, which are highly ground objects. Jie et al. [40] and Gai et al. [41] adopted
sensitive to moving objects. Traditional detection algorithms YOLO and Deep Sort to achieve multi-object tracking.
suffer from low detection efficiency, accuracy, and robust- Bayesian theory, proposed by British mathematician
ness. Compared with traditional detection algorithms, deep Bayesian, has been widely applied in artificial intelligence
learning algorithms [11, 12] exhibit superior features extrac- and machine learning. Compared with the traditional estima-
tion ability and detection accuracy. Thus, deep learning tion, Bayesian theory starts with subjective prior probability,
algorithms have become the prevailing research trend in the and can constantly correct the probability. Currently, algo-
field of object detection. rithms shown in Ref. [42–44] have been proposed, which
Deep learning algorithms can be divided into two-stage enhance detection performance by fusing features based on
algorithms and one-stage algorithms based on whether can- Bayesian theory.
didate regions are generated in advance. The two-stage algo- YOLO series algorithms have become prevailing object
rithms first generate a large number of candidate regions, detection algorithms at present, due to their high detection
then classify the candidate regions using classifiers, and per- accuracy and real-time performance. However, the detection
form regression correction on the candidate regions. Rep- results may be affected by the change of color brightness and
resentative two-stage algorithms are the region-based CNN other parameters. For small and medium objects in images,
(R-CNN) series. Currently, the state-of-the-art algorithms the recognition accuracy of YOLO algorithms is not good.
[13–16] are mainly based on the network [17]. Inspired by Optical flow based algorithms can effectively solve the issue,
Faster R-CNN, Cai et al. [18] proposed the 3D object detec- benefiting of high sensitive to moving traffic objects, and no
tion algorithms by fusing lidar and camera, which gener- dependencies on large-scale datasets. As traditional algo-
ated 3D proposal regions for objects. The one-stage algo- rithms, optical flow based algorithms still have the problems
rithms don’t need to generate candidate regions, and regard of poor robustness and low detection accuracy. Gao et al.
both positioning and classification as regression problems. [45] proposed a hybrid strategy for traffic light detection,
Compared with the two-stage algorithms based on candidate which combines traditional algorithms and deep learning
regions, the one-stage algorithms have higher detection effi- algorithms. A moving traffic object detection method based
ciency and slightly lower detection accuracy. Representative on Bayesian theory fusion is proposed in this paper. The ini-
one-stage algorithms are the YOLO series, which integrate tial object detection results are obtained by ACA-YOLO net-
classification, positioning, detection, and other functions into work and MSOF method separately. Then, the fusion results
a network. Currently, the state-of-the-art algorithms shown are obtained by calculating the posterior probabilities.
in Ref. [19–26] are mainly based on YOLOv1-3 [27–29]. The main contributions of this paper are: (1) To enhance
To better extract object features, algorithms shown in Ref. feature extraction ability for objects with different scales,
[30–33] have been proposed. To enhance the robustness an ACA-YOLO network is proposed by introducing ACA
of the loss function and to obtain more accurate detection mechanism into the original YOLOv5. (2) To improve the
results, algorithms proposed in Ref. [34–37] have been pro- detection accuracy of moving objects, a MSOF algorithm
posed. The loss function based on Distance-IoU [34] consid- is proposed, which has high sensitivity to moving traf-
ers the center point distance between prediction boxes and fic objects. (3) An object detection method is proposed to
truth boxes. The loss function based on Complete-IoU [34] enhance the detection performance in dynamic traffic sce-
considers the aspect ratio of boxes and improves the regres- narios, which fuses detection results of ACA-YOLO network
sion accuracy. The loss function based on EIoU [35] consid- and MSOF method by calculating the Bayesian posterior
ers the true difference of the length and width of boxes. The probabilities.
loss function based on Scylla-IoU [36] considers the vector The structure of the rest parts of this paper is described
angle of truth boxes and prediction boxes. The loss function as follows: The second section introduces the moving traffic
based on Wise-IoU [37] focuses on reducing the impact of object detection method based on Bayesian theory fusion,
low-quality examples. as well as the ACA-YOLO network and MSOF algorithm in
Deep learning algorithms mentioned above are mostly detail. The third section introduces the experimental results
based on convolutional neural network to propose the spatial and analysis, including self-built moving traffic objects data-
features of objects in images to achieve detection results. set, the detection performance of the proposed algorithm in
However, the temporal connection of these algorithms is dynamic traffic scenarios, ablation study, and comparison
420 Y. Sun et al.
with state-of-the-art algorithms on KITTI dataset. The con- ACA mechanism enhances feature extraction ability in x and
clusions are presented in the fourth section. y directions. In addition, loss function of boundary boxes
regression utilizes EIoU instead of GIoU.
The output of ACA mechanism is fed into the Spatial
2 Moving Traffic Object Detection Algorithm Pyramid Pooling–Fast (SPPF), which is the last layer of
Based on Bayesian Theory Fusion backbone. Thus, the backbone network captures not only
cross-channel information, but also direction-aware and
The flow chart of fusion algorithm based on Bayesian theory location-aware information, which helps the model to locate
is shown in Fig. 1. For the input continuous images, moving and identify the objects of interest more accurately. The
traffic objects are detected based on ACA-YOLO network overview of ACA mechanism is presented in Table 1.
and MSOF algorithm separately. Then, detection results of First of all, each channel of feature vector X with input
the two algorithms are fused based on Bayesian theory. dimension C × H × W is coded separately in horizontal and
vertical directions through average pooling. The average
2.1 ACA‑YOLO Network
Although YOLOv5 network has good detection perfor- Table 1 Overview architecture of ACA mechanism
mance, there are still some deficiencies: (1) Detection results Name Patch size/stride Output size
based on YOLOv5 network for small objects intensively
AdaptiveAvgPool_h None × 1 512 × 15 × 1
depend on shallow features. However, a large number of con-
AdaptiveAvgPool_w 1 × None 512 × 20 × 1
volution operations in the backbone are easy to cause issue
Concat 512 × 35 × 1
of information loss during the process of feature extraction.
Conv1 1 × 1/1 16 × 35 × 1
(2) If there is an inclusion relationship between prediction
BatchNorm 16 × 35 × 1
boxes and truth boxes, the loss function of regression boxes
Conv_h 1 × 1/1 512 × 15 × 1
is constant, which means that further optimization and high-
Conv_w 1 × 1/1 512 × 1 × 20
precision positioning cannot be achieved.
Parameter_h 512 × 15 × 1
To enhance feature extraction ability for objects with
Parameter_w 512 × 1 × 20
different scales, a novel ACA mechanism is introduced to
Output 512 × 15 × 20
the backbone. Compared with the traditional mechanism,
Fig. 1 Flow chart of moving traffic object detection based on Bayesian theory fusion
Moving Traffic Object Detection Based on Bayesian Theory Fusion 421
pooling sizes are H × 1 and 1 × W respectively, and two fea- 2.2 MSOF Algorithm
ture graphs zh and zw are obtained. The zh and zw are spliced
through Concat, and the intermediate feature graph f is gen- The ViBe algorithm is applied for foreground detection.
erated by 1 × 1 convolution transform function F1, which Specifically, the background model is initialized firstly, and
can be described as each pixel is assigned a dataset that stores pixel values before
( ([ ])) and pixel values of its neighbors. In the model, the number
f = 𝛿 F1 zh , zw (1) of pixel values stored in the dataset allocated to each pixel
value is set to 20, and its neighborhood is 9 adjacent pixels.
where δ is a nonlinear activation function.
For a new frame of image, the new value of pixel point
The feature graph f is divided into two tensors fh and fw
is compared with the dataset. If the difference between the
along the spatial dimensions, and 1 × 1 convolution transfor-
current pixel value and a certain pixel value in the dataset is
mation is performed separately. To obtain attention weights
less than 20, the current pixel value is considered to be close
gh and gw for input feature vector X, the dimensions of fh and
to the pixel value. When there are more than two values in
fw are transformed into the dimensions of feature vector X,
dataset close to the current pixel value, the pixel point is
which can be described as
judged as a background point. Otherwise, it is judged as
{ h ( ( ))
g = 𝜎 (𝛼h Fh f h ) a foreground point. An algorithm of updating background
gw = 𝜎 𝛼w Fw (f w ) (2) model is proposed, as shown in Table 2. If the pixel point is
a background point before, the difference between it and its
where Fh and Fw are 1 × 1 convolution transformations. αh dataset is used to judge whether pixel point is a foreground
and αw are adaptive learning weights in w and h directions point. If the pixel point is a foreground point, it is judged as
with initial values of 1. σ is sigmoid activation function. a foreground point directly. When the foreground point has
Then, the output y can be obtained as follows: been detected for 50 consecutive times already, the point is
reset as a background point. In this model, only the back-
yc (i, j) = xc (i, j) × ghc (i) × gwc (j) (3) ground points are updated randomly, and the update prob-
where c represents the number of channels. i represents the ability is set as 6.25%.
height. j represents the width. All methods of optical flow assume that pixels of dynamic
To address the issue of zero loss value when prediction objects maintain the same brightness between image frames,
boxes and truth boxes do not intersect, the minimum bound- and the moving of pixels between adjacent image frames has
ing rectangles have been introduced by GIoU. However, the
issue of constant loss values has not been addressed when
the inclusion relationship between truth boxes and prediction Table 2 Background model update algorithm
boxes occurs. To improve the regression results of predic-
tion boxes, the loss function based on EIoU considers the Algorithm1: Background model update
difference of overlap area, distance of center point, height, Input: pixel point p, dataset D, background point set B, foreground point set
and width between prediction boxes and truth boxes. The
F, detection time N
calculation formula is as follows:
Output: D, B, F and N
2 2
d2 (hd −ht ) (wd −wt )
EIoU = IoU − c2
− h2c
− w2c
(4) Process:
if ∈ :
where IoU is the intersection ratio between prediction boxes =| − |
and truth boxes. d is the distance of center point between 1: if > ℎ ℎ :
{ ∈
prediction boxes and truth boxes. hd and wd are the height end
and width of prediction boxes. ht and wt are the height and { end
4: return D, B, F and N
422 Y. Sun et al.
a small motion intensity. Thus, optical flow values of pixel The method of LK optical flow assumes that the motion
points can be determined as follows: intensity of pixels in motion state between adjacent image
frames is small. However, the calculated values of optical
Ix u + Iy v = −It (6) flow may have a large error for pixels with large motion
where Ix, Iy, and It are the partial derivatives of pixel points speed. To address this issue, MSOF is proposed to calcu-
to directions x, y, and time t separately. u and v are the veloc- late value of optical flow as shown in Table 3. To meet the
ities of pixel points in x and y directions. assumption of small pixel motion, an image pyramid is con-
Assumption of spatial consistency is added by the structed by scaling down the original image, which is shown
Lucas–Kanade (LK) optical flow method, which indicates in Fig. 2. In the highest layer, values of optical flow are cal-
each pixel in the image frame moves in a similar way to its culated. Then, the values of upper layers are taken as initial
neighbor pixels. For a sufficiently small window n × n, it can be values, and values in next layer are iteratively calculated by
considered that all pixels in the window have the same value of the method of LK optical flow. In this way, values of optical
optical flow. Then, optical flow can be obtained by least square flow in the bottom image are obtained.
method. The detailed equations are: To reduce the computational complexity of the algorithm
of object detection based on optical flow in the image, each
⎡ Ix1 Iy1 ⎤� � ⎡ It1 ⎤ dynamic object region extracted by the ViBe algorithm is
⎢ Ix2 Iy2 ⎥ u ⎢ It2 ⎥ represented by a rectangular object box that contains all of
⎢ ⎥ v = −⎢ ⋮ ⎥ (7)
⎢ ⋮ ⋮ ⎥ ⎢ ⎥ its pixels. Specifically, the dynamic object region is searched
⎣ Ixn2 Iyn2 ⎦ ⎣ Itn2 ⎦ for the smallest x and y coordinates, named x0 and y0, and the
largest coordinate, named x1 and y1. Set (x0, y0) as the left
� ∑ 2 ∑ �� � �∑ � upper point of rectangular region and set (x1, y1) as the right
I I I
∑ xi ∑ xi2 yi u = − ∑ Ixi Iti (8) bottom point of rectangular region. To avoid the interference
Iyi Ixi Iyi v Iyi Iti
of noise, the regions which contain less than 10 pixels are
determined as the noise and discarded.
Considering the influence of object motion, the object
Table 3 Multi-scale optical flow algorithm
boxes are appropriately enlarged according to the center of
Input: previous frame image , current image boxes, and the magnification factor is set to 1.5. Then, the
Output: optical flow value image information of corresponding region from the original
image is extracted, which is applied to calculate the optical
Process:
flow values of dynamic object regions.
1: construct image pyramids for , : { } =0,1,2,3,4 , { } =0,1,2,3,4 The calculated values of optical flow in image are shown
2: initialize the highest optical flow value: =4
= [0 0] T as green dots and green lines in Fig. 3. When only green
3:
dots are displayed, it indicates that the optical flow values of
pixel points are zero. When the green lines are displayed, it
for = 4; ≥ 0; − −: indicates that pixels have values of optical flow. The length
calculate the position of pixel in : =[ ]=2
( +1, )− ( −1, )
of green lines is related to values of optical flow, and the
calculate the gradient of in the : ( , )=
2 directions of green lines are related to directions of opti-
cal flow. The average value of optical flow in each dynamic
( , +1)− ( , −1)
calculate the gradient of in the : ( , )=
2
= + = +
2
object region is taken as the feature of optical flow, and the
calculate space matrix: =∑ ∑ [ ]
= − = − 2
object box is taken as the detection result for the region. The
feature of optical flow for each region is applied as input of
0 T
initial iteration value: = [0 0]
for = 1; ≤ || < ℎ ℎ ; + +:
calculate images difference:
SVM to classify the category of this region. In particular,
( , )= ( , )− ( + + −1
, + + −1
) single SVM can only deal with the issue of binary classifica-
calculate matching difference vector: tion, which is difficult to meet the requirement of multiple
= +
=∑ = − ∑ = − [
= + ( , ) ( , )
] classifications in this paper. Therefore, multiple SVMs are
( , ) ( , )
calculating optical low: = −1
applied for object classification.
−1
provide initial value for the next iteration: = +
{ end
optical low optimization value in : =
optical low initial value in −1 : −1
= 2( + )
2.3 Object Detection Based on Bayesian Theory
{ end Fusion
0 0
4: calculate final optical flow value: = +
Different from other estimation, Bayesian theory starts
5: return
with subjective prior probability, constantly modifies the
Moving Traffic Object Detection Based on Bayesian Theory Fusion 423
∞ � � � �
∑
P(B) = P Bi P B�Bi (11)
i=1
3.1 Evaluation Index
TP
Precision = TP+FP (19)
∑
AP
mAP = (21)
num
Fig. 4 Object region fusion based on Bayesian theory
where TP represents the number of correctly detected
objects of this category. FN represents the number of unde-
( ) tected objects of this category. FP represents the number of
Similarly, the posterior probability P AA2 |D based on incorrectly detected objects of this category. PR represents
MSOF algorithm prior can be obtained as: the precision-recall function, and num is the number of cat-
( ) egories. In addition, the detection threshold is set to 0.5.
( ) PO × P D|AA2
P AA2 |D = ( ) ( ) ( ) When IoU values between prediction boxes and truth boxes
PO × P D|AA2 + 1 − PO × P D|NA2 are smaller than 0.5, the detection boxes will be eliminated.
(16)
( )
where PO is the confidence of region AO. P D|AA2 is the 3.2 Experimental Results in Dynamic Traffic
probability that the pixels
( in AD) are detected as object by
Scenarios
MSOF algorithm, and P O|NA1 is the probability that the
pixels To verify the detection performance of the proposed method
( in AD)are not detected as object by MSOF algorithm.
based on Bayesian theory fusion in dynamic traffic sce-
( AA1)|O is taken as the fusion weight of AD, and
P
P AA2 |D is taken as the fusion weight of AO. Fusion region narios, CMTO dataset was constructed, which consists of
Af can be obtained as follows: continuous real traffic scenes in different environments. The
{ ( ) ( ) acquisition platform of CMTO dataset is shown in Fig. 5,
AD P(AA1 |O) ≥ P(AA2 |D) and the camera model is TC480HD. To have a good acqui-
Af =
AO P AA1 |O < P AA2 |D (17)
sition field for the camera, a cantilever beam was designed
to carry the camera. The height of the cantilever beam
If P(AA1 |O) is bigger, it indicates that the detection perfor-
from ground can be adjusted, and four gears of 2.5m, 3.5m,
mance of ACA-YOLO network is better and region AD is
4.5m, and 7m were designed. In addition, lidar (LSC32) and
chosen as fusion region. If P(AA1 |O) is smaller, it indicates
that the detection performance MSOF algorithm is better
and region AO is chosen as fusion region.
millimeter wave radar (DTAM D29) were also deployed on intensity, and other environmental variables. The 69,203
this platform for future research. instance labels are manually annotated. Statistics of labeled
A total of 10,000 images were collected, labeled with data are shown in Fig. 6, wherein 18,838 pedestrians, 11,338
four categories: pedestrian, cyclist, car, and large vehicle. cyclists, 34,299 cars and 4728 large vehicles. Some image
This dataset covers moving traffic objects under a variety of samples are shown in Fig. 7. Image resolution of CMTO
scenarios, seasons, weather conditions, temperature, light dataset is 640 × 480, and the format is JPG. The download
link of CMTO dataset is: https://drive.google.com/file/d/
1m1BFzco3zueDMYL2mpVFjPYh0hl1OlwB/view?usp=
share_link.
The 81% of images are used for training, 9% for valida-
tion, and 10% for testing. In addition, rotation, translation,
mirror, splicing, and other image argumentation methods are
applied to expand training data.
Curves of precision and recall with confidence are shown
in Fig. 8(a), (b), respectively. The recall reaches 0.895 and
the precision reaches 0.933. Curve of PR function is shown
in Fig. 8(c). It indicates that AP values of pedestrian, cyclist,
car, and large vehicle are 0.838, 0.972, 0.969, and 0.989,
respectively with the mAP of 0.942. Detection performance
Fig. 6 Statistical chart of annotation in CMTO dataset
in different dynamic traffic scenarios is shown in Fig. 9,
3.3 Ablation Study
detection performance, it is not ideal for some objects, such 100 epochs, and the input resolution was 640 × 640. From
as small objects. From Table 5, it indicates that the detection Table 6, it indicates that the proposed method has the
method based on Bayesian theory fusion effectively reduces best recall and mAP, and the precision is also at a high
the impact of this issue by fusing MSOF method, which is level. Compared with YOLOv7, which exhibits excellent
highly sensitive to moving objects. detection performance, the proposed method has higher
precision and mAP, which are increased by 0.1% and
3.4 Comparison with State‑of‑the‑Art Object 1.1%, separately. Especially, recall has a more significant
Detection Algorithms increase of 1.7% due to high sensitivity to moving objects.
From Table 6, it indicates that the detection method based
The proposed method has been compared with state-of- on Bayesian theory fusion exhibits excellent detection
the-art object detection algorithms, such as YOLOR, performance.
YOLOX, YOLOv6, and YOLOv7 on KITTI dataset. The
results are shown in Table 6. All models were trained for
428 Y. Sun et al.
Algorithm Precision Recall mAP 1. Mahadevkar, S.V., Khemani, B., Patil, S., et al.: A review on
machine learning styles in computer vision-techniques and
YOLOR [22] 0.750 0.907 0.918 future directions. IEEE Access 10, 107293–107329 (2022)
YOLOX [23] 0.955 0.904 0.908 2. Ghedia, N.S., Vithalani, C.H.: Outdoor object detection for sur-
veillance based on modified GMM and adaptive thresholding.
YOLOv6 [25] 0.899 0.861 0.903
Int. J. Inf. Technol. 13(1), 185–193 (2021)
YOLOv7 [24] 0.943 0.895 0.943 3. Houhou, I., Zitouni, A., Ruichek, Y., et al.: Improving ViBe-
The proposed method 0.944 0.912 0.954 based background subtraction techniques using RGBD Infor-
mation. Paper presented at 2022 7th International Conference
on Image and Signal Processing and their Applications, IEEE,
Mostaganem, 8–9 May 2022
4 Conclusions 4. Wang, Y., Lu, H., Gao, R., et al.: V-Vibe: a robust ROI extrac-
tion method based on background subtraction for vein images
collected by infrared device. Infrar. Phys. Technol. 123, 104175
In this paper, to enhance feature extraction ability for (2022)
objects with different scales, ACA-YOLO network is 5. Bansal, M., Kumar, M., Kumar, M.: 2D object recognition: a com-
proposed by applying ACA mechanism and EIoU into parative analysis of SIFT, SURF and ORB feature descriptors.
Multimed. Tools Appl. 80, 18839–18857 (2021)
YOLOv5. To have high sensitivity to moving traffic 6. Abdullah, D.M., Abdulazeez, A.M.: Machine learning applica-
objects, MSOF method is proposed. Compared with tra- tions based on SVM classification a review. Qubahan Acad. Jur-
ditional methods, the accuracy is increased and calculation nal. 1(2), 81–90 (2021)
cost of optical flow is reduced. To enhance detection per- 7. Chen, Y., Zheng, W., Li, W., et al.: Large group activity security
risk assessment and risk early warning based on random forest
formance in dynamic traffic scenarios, a detection method algorithm. Pattern Recogn. Lett. 144, 1–5 (2021)
based on Bayesian theory fusion is proposed, which 8. Caldelli, R., Galteri, L., Amerini, I., et al.: Optical flow based
combines detection results of ACA-YOLO network and CNN for detection of unlearnt deepfake manipulations. Pattern
MSOF method to obtain fusion results. The experimental Recogn. Lett. 146, 31–37 (2021)
9. Fan, L., Zhang, T., Du, W.: Optical-flow-based framework to boost
results indicate that the proposed detection method based video object detection performance with object enhancement.
on Bayesian theory fusion exhibits excellent performance. Expert Syst. Appl. 170, 114544 (2021)
The mAP, recall, and precision are 5.5%, 9.9%, and 1.9% 10. Ahn, H., Cho, H.J.: Research of multi-object detection and track-
higher than that of traditional YOLOv5 network on KITTI ing using machine learning based on knowledge for video surveil-
lance system. Pers. Ubiq. Comput. 66, 1–10 (2022)
dataset. 11. Zaidi, S.S.A., Ansari, M.S., Aslam, A., et al.: A survey of modern
In the future, to include more scenarios, it is necessary deep learning based object detection models. Digit. Signal Pro-
to extend CMTO dataset. Compared with public datasets cess. 126, 103514 (2022)
such as COCO and KITTI, current CMTO dataset is rela- 12. Kang, J., Tariq, S., Oh, H., et al.: A survey of deep learning-based
object detection methods and datasets for overhead imagery. IEEE
tively small and lack of universality. In addition, the fusion Access 10, 20118–20134 (2022)
process can be further optimized. Instead of selecting better 13. Sun, P., Zhang, R., Jiang, Y., et al.: Sparse R-CNN: end-to-end
detection boxes for fusion, pixel-level fusion is supposed to object detection with learnable proposals. Paper presented at the
obtain better detection performance. Specifically, posterior IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion, IEEE, Online, 19–25 June 2021
probabilities are used as fusion weights of pixels in detec- 14. Qiao, L., Zhao, Y., Li, Z., et al.: Defrcn: decoupled faster R-CNN
tion boxes of ACA-YOLO network and MSOF algorithm, for few-shot object detection. Paper presented at the IEEE/CVF
separately, and pixel values of two detection boxes are super- International Conference on Computer Vision, IEEE, Montreal,
imposed. A set of pixels that exceed the preset threshold are 10–17 October 2021
15. Xie, X., Cheng, G., Wang, J., et al.: Oriented R-CNN for object
fitted to a detection box as fusion results. Since lidar and detection. Paper presented at the IEEE/CVF International Confer-
radar are installed on the data acquisition platform, multi- ence on Computer Vision, IEEE, Montreal, 10–17 October 2021
sensor fusion is considered in future research. 16. Avola, D., Cinque, L., Diko, A., et al.: MS-Faster R-CNN: multi-
stream backbone for improved faster R-CNN object detection
Acknowledgements This work was supported in part by National Nat- and aerial tracking from UAV images. Remote Sens. 13(9), 1670
ural Science Foundation of China (Grant no. 52272414 and 51905095). (2021)
17. Girshick, R., Donahue, J., Darrell, T., et al.: Rich feature hierar-
Declarations chies for accurate object detection and semantic segmentation.
Paper presented at the IEEE Conference on Computer Vision and
Conflict of interest On behalf of all the authors, the corresponding au- Pattern Recognition, IEEE, Columbus, 23–28 June 2014
thor states that there is no conflict of interest. 18. Cai, Y., Zhang, T., Wang, H., et al.: 3D vehicle detection based on
LiDAR and camera fusion. Automot. Innov. 2(4), 276–283 (2019).
https://doi.org/10.1007/s42154-019-00083-z
430 Y. Sun et al.
19. Peng, L., Wang, H., Li, J.: Uncertainty evaluation of object detec- presented at the IEEE/CVF Conference on Computer Vision and
tion algorithms for autonomous vehicles. Automot. Innov. 4(3), Pattern Recognition, IEEE, Online, 19–25 June 2021
241–252 (2021) 34. Zheng, Z., Wang, P., Liu, W., et al.: Distance-IoU loss: faster and
20. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: YOLOv4: optimal better learning for bounding box regression. Paper presented at the
speed and accuracy of object detection. arXiv preprint arXiv: AAAI Conference on Artificial Intelligence, AAAI, New York,
2004.10934 (2020) 7–12 February 2020
21. Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Scaled-YOLOv4: 35. Zhang, Y.F., Ren, W., Zhang, Z., et al.: Focal and efficient IOU
scaling cross stage partial network. Paper presented at the IEEE/ loss for accurate bounding box regression. Neurocomputing 506,
CVF Conference on Computer Vision and Pattern Recognition, 146–157 (2022)
IEEE, Online, 19–25 June 2021 36. Gevorgyan, Z.: SIoU loss: more powerful learning for bounding
22. Wang, C.Y., Yeh, I.H., Liao, H.Y.M.: You only learn one repre- box regression. arXiv preprint arXiv:2205.12740 (2022)
sentation: unified network for multiple tasks. arXiv preprint arXiv: 37. Tong, Z., Chen, Y., Xu, Z., et al.: Wise-IoU: bounding box regres-
2105.04206 (2021) sion loss with dynamic focusing mechanism. arXiv preprint arXiv:
23. Ge, Z., Liu, S., Wang, F., et al.: YOLOX: exceeding yolo series in 2301.10051 (2023)
2021. arXiv preprint arXiv:2107.08430 (2021) 38. Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime
24. Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: YOLOv7: trainable tracking with a deep association metric. Paper presented at 2017
bag-of-freebies sets new state-of-the-art for real-time object detec- IEEE International Conference on Image Processing, IEEE, Bei-
tors. Paper presented at the IEEE/CVF Conference on Computer jing, 17–20 September 2017
Vision and Pattern Recognition, IEEE, Vancouver, 17–24 June 39. Teng, Z., Duan, Y., Liu, Y., et al.: Global to local: clip-LSTM-
2023 based object detection from remote sensing images. IEEE Trans.
25. Li, C., Li, L., Jiang, H., et al.: YOLOv6: a single-stage object Geosci. Remote Sens. 60, 1–13 (2021)
detection framework for industrial applications. arXiv preprint 40. Jie, Y., Leonidas, L., Mumtaz, F., et al.: Ship detection and track-
arXiv:2209.02976 (2022) ing in inland waterways using improved YOLOv3 and Deep
26. Xu, S., Wang, X., Lv, W., et al.: PP-YOLOE: an evolved version SORT. Symmetry 13(2), 308 (2021)
of YOLO. arXiv preprint arXiv:2203.16250 (2022) 41. Gai, Y., He, W., Zhou, Z.: Pedestrian target tracking based on
27. Redmon, J., Divvala, S., Girshick, R., et al.: You only look once: Deep SORT with YOLOv5. Paper presented at 2021 2nd Inter-
unified, real-time object detection. Paper presented at the IEEE national Conference on Computer Engineering and Intelligent
Conference on Computer Vision and Pattern Recognition, IEEE, Control, IEEE, Chongqing, 12–14 November 2021
Las Vegas, 26 June–1 July 2016 42. Zhao, Z., Xu, S., Zhang, C., et al.: Bayesian fusion for infrared
28. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. Paper and visible images. Signal Process. 177, 107734 (2020)
presented at the IEEE Conference on Computer Vision and Pattern 43. Zhou, H., Dong, C., Wu, R., et al.: Feature fusion based on Bayes-
Recognition, IEEE, Hawaii, 21–26 July 2017 ian decision theory for radar deception jamming recognition.
29. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. IEEE Access 9, 16296–16304 (2021)
arXiv preprint arXiv:1804.02767 (2018) 44. Wu, T., Hu, J., Ye, L., et al.: A pedestrian detection algorithm
30. Woo, S., Park, J., Lee, J.Y., et al.: CBAM: convolutional block based on score fusion for multi-LiDAR systems. Sensors 21(4),
attention module. Paper presented at the European Conference 1159 (2021)
on Computer Vision, Springer, Munich, 8–14 September 2018 45. Gao, F., Wang, C.: Hybrid strategy for traffic light detection
31. Hou, Q., Zhou, D., Feng, J.: Coordinate attention for efficient by combining classical and self-learning detectors. IET Intell.
mobile network design. Paper presented at the IEEE/CVF Confer- Transp. Syst. 14(7), 735–741 (2020)
ence on Computer Vision and Pattern Recognition, IEEE, Online,
19–25 June 2021 Springer Nature or its licensor (e.g. a society or other partner) holds
32. Qiao, S., Chen, L.C., Yuille, A.: Detectors: detecting objects exclusive rights to this article under a publishing agreement with the
with recursive feature pyramid and switchable atrous convolu- author(s) or other rightsholder(s); author self-archiving of the accepted
tion. Paper presented at the IEEE/CVF Conference on Computer manuscript version of this article is solely governed by the terms of
Vision and Pattern Recognition, IEEE, Online 19–25 June 2021 such publishing agreement and applicable law.
33. Hu, M., Li, Y., Fang, L., et al.: A2-FPN: attention aggregation
based feature pyramid network for instance segmentation. Paper