0% found this document useful (0 votes)
5 views

Moving Traffic Object Detection Based on Bayesian Theory Fusion

Uploaded by

Melih AKAY
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Moving Traffic Object Detection Based on Bayesian Theory Fusion

Uploaded by

Melih AKAY
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Automotive Innovation (2024) 7:418–430

https://doi.org/10.1007/s42154-023-00245-0

Moving Traffic Object Detection Based on Bayesian Theory Fusion


Yuxiao Sun1 · Keke Geng1 · Weichao Zhuang1 · Guodong Yin1 · Xiaolong Chen1 · Jinhu Wang1 · Pengbo Ding1

Received: 14 November 2022 / Accepted: 20 July 2023 / Published online: 19 July 2024
© China Society of Automotive Engineers (China SAE) 2024

Abstract
In order to improve the performance of object detection algorithm in dynamic traffic scenarios, a moving traffic object detec-
tion method based on Bayesian theory fusion is proposed. To obtain initial object detection results, adaptive coordinate atten-
tion YOLO (ACA-YOLO) network with high accuracy and multi-scale optical flow (MSOF) method with high sensitivity to
dynamic object are applied, respectively. To enhance the detection performance of YOLOv5 network, adaptive coordinate
attention (ACA) mechanism is applied to obtain more accurate location and identification of interest objects. To address
the issue of constant loss values when the inclusion relationship between truth boxes and prediction boxes occurs, the loss
function now utilizes efficient-IoU instead of generalized-IoU. Fusion weights are obtained by calculating the posterior
probabilities of ACA-YOLO network and MSOF method separately using Bayesian formula after object detection regions
matching based on intersection over union (IoU). The results of fusion detection are based on the posteriori probabilities.
The proposed detection method was tested on the KITTI dataset and self-built continuous moving traffic objects dataset
which consists of real continuous dynamic traffic scenarios. The experimental results indicate that the proposed method for
detecting moving traffic object based on Bayesian theory fusion has outstanding performance. The mean average precision,
Recall and Precision of the proposed method on KITTI dataset reach 0.954, 0.912, 0.944, which are 5.5%, 9.9% and 1.9%
higher than that of traditional YOLOv5 network.

Keywords Object detection · Bayesian theory · YOLOv5 · Optical flow method

Abbreviations SVM Support vector machines


ACA​ Adaptive coordinate attention ViBe Visual background extractor
ACA-YOLO Adaptive coordinate attention YOLO
AP Average precision
CMTO Continuous moving traffic objects 1 Introduction
CNN Convolutional neural network
EIoU Efficient intersection over union With the development of modern society, automobile, as
IoU Intersection over union an integral part of human life, is advancing towards the
LK Lucas–Kanade direction of intelligence. Detecting moving traffic object
LSTM Long short-term memory is a crucial aspect of automobile intelligence. The primary
mAP Mean average precision objective of moving traffic object detection is to locate traf-
MSOF Multi-scale optical flow fic participants, such as pedestrians, vehicles and cyclists,
R-CNN Region-based CNN by using on-board sensing system, and accurately classify
SPPF Spatial pyramid pooling–Fast their categories.
Object detection methods can be divided into traditional
and deep learning algorithms. Traditional algorithms typi-
cally use sliding window or image segmentation technology
* Keke Geng to generate candidate regions, extract image features of each
[email protected]
region, and classify them using classifiers [1]. Commonly
* Guodong Yin used candidate region generation methods include Gaussian
[email protected]
mixture model [2] and the ViBe algorithm [3, 4]. Methods
1
School of Mechanical Engineering, Southeast University, for extracting image feature include histogram of oriented
Nanjing 211189, China

Vol:.(1234567890)
Moving Traffic Object Detection Based on Bayesian Theory Fusion 419

gradient (HOG), scale invariant feature transform (SIFT), weak. To establish temporal connection, these algorithms
and oriented fast and rotated brief (ORB) [5]. The classifi- are combined with other methods, such as LSTM and Deep
ers include support vector machines (SVM) [6] and random Sort [38]. Teng et al. [39] proposed the clip-LSTM algo-
forest [7]. Recently, the algorithms based on optical flow rithm and applied it to detect objects from remote sensing
[8–10] have been proposed. Unlike other algorithms, these images. Fan et al. [9] proposed an object detection frame-
algorithms choose the optical flow information of object as work based on optical flow feature which focused on fore-
features instead of appearance information, which are highly ground objects. Jie et al. [40] and Gai et al. [41] adopted
sensitive to moving objects. Traditional detection algorithms YOLO and Deep Sort to achieve multi-object tracking.
suffer from low detection efficiency, accuracy, and robust- Bayesian theory, proposed by British mathematician
ness. Compared with traditional detection algorithms, deep Bayesian, has been widely applied in artificial intelligence
learning algorithms [11, 12] exhibit superior features extrac- and machine learning. Compared with the traditional estima-
tion ability and detection accuracy. Thus, deep learning tion, Bayesian theory starts with subjective prior probability,
algorithms have become the prevailing research trend in the and can constantly correct the probability. Currently, algo-
field of object detection. rithms shown in Ref. [42–44] have been proposed, which
Deep learning algorithms can be divided into two-stage enhance detection performance by fusing features based on
algorithms and one-stage algorithms based on whether can- Bayesian theory.
didate regions are generated in advance. The two-stage algo- YOLO series algorithms have become prevailing object
rithms first generate a large number of candidate regions, detection algorithms at present, due to their high detection
then classify the candidate regions using classifiers, and per- accuracy and real-time performance. However, the detection
form regression correction on the candidate regions. Rep- results may be affected by the change of color brightness and
resentative two-stage algorithms are the region-based CNN other parameters. For small and medium objects in images,
(R-CNN) series. Currently, the state-of-the-art algorithms the recognition accuracy of YOLO algorithms is not good.
[13–16] are mainly based on the network [17]. Inspired by Optical flow based algorithms can effectively solve the issue,
Faster R-CNN, Cai et al. [18] proposed the 3D object detec- benefiting of high sensitive to moving traffic objects, and no
tion algorithms by fusing lidar and camera, which gener- dependencies on large-scale datasets. As traditional algo-
ated 3D proposal regions for objects. The one-stage algo- rithms, optical flow based algorithms still have the problems
rithms don’t need to generate candidate regions, and regard of poor robustness and low detection accuracy. Gao et al.
both positioning and classification as regression problems. [45] proposed a hybrid strategy for traffic light detection,
Compared with the two-stage algorithms based on candidate which combines traditional algorithms and deep learning
regions, the one-stage algorithms have higher detection effi- algorithms. A moving traffic object detection method based
ciency and slightly lower detection accuracy. Representative on Bayesian theory fusion is proposed in this paper. The ini-
one-stage algorithms are the YOLO series, which integrate tial object detection results are obtained by ACA-YOLO net-
classification, positioning, detection, and other functions into work and MSOF method separately. Then, the fusion results
a network. Currently, the state-of-the-art algorithms shown are obtained by calculating the posterior probabilities.
in Ref. [19–26] are mainly based on YOLOv1-3 [27–29]. The main contributions of this paper are: (1) To enhance
To better extract object features, algorithms shown in Ref. feature extraction ability for objects with different scales,
[30–33] have been proposed. To enhance the robustness an ACA-YOLO network is proposed by introducing ACA
of the loss function and to obtain more accurate detection mechanism into the original YOLOv5. (2) To improve the
results, algorithms proposed in Ref. [34–37] have been pro- detection accuracy of moving objects, a MSOF algorithm
posed. The loss function based on Distance-IoU [34] consid- is proposed, which has high sensitivity to moving traf-
ers the center point distance between prediction boxes and fic objects. (3) An object detection method is proposed to
truth boxes. The loss function based on Complete-IoU [34] enhance the detection performance in dynamic traffic sce-
considers the aspect ratio of boxes and improves the regres- narios, which fuses detection results of ACA-YOLO network
sion accuracy. The loss function based on EIoU [35] consid- and MSOF method by calculating the Bayesian posterior
ers the true difference of the length and width of boxes. The probabilities.
loss function based on Scylla-IoU [36] considers the vector The structure of the rest parts of this paper is described
angle of truth boxes and prediction boxes. The loss function as follows: The second section introduces the moving traffic
based on Wise-IoU [37] focuses on reducing the impact of object detection method based on Bayesian theory fusion,
low-quality examples. as well as the ACA-YOLO network and MSOF algorithm in
Deep learning algorithms mentioned above are mostly detail. The third section introduces the experimental results
based on convolutional neural network to propose the spatial and analysis, including self-built moving traffic objects data-
features of objects in images to achieve detection results. set, the detection performance of the proposed algorithm in
However, the temporal connection of these algorithms is dynamic traffic scenarios, ablation study, and comparison
420 Y. Sun et al.

with state-of-the-art algorithms on KITTI dataset. The con- ACA mechanism enhances feature extraction ability in x and
clusions are presented in the fourth section. y directions. In addition, loss function of boundary boxes
regression utilizes EIoU instead of GIoU.
The output of ACA mechanism is fed into the Spatial
2 Moving Traffic Object Detection Algorithm Pyramid Pooling–Fast (SPPF), which is the last layer of
Based on Bayesian Theory Fusion backbone. Thus, the backbone network captures not only
cross-channel information, but also direction-aware and
The flow chart of fusion algorithm based on Bayesian theory location-aware information, which helps the model to locate
is shown in Fig. 1. For the input continuous images, moving and identify the objects of interest more accurately. The
traffic objects are detected based on ACA-YOLO network overview of ACA mechanism is presented in Table 1.
and MSOF algorithm separately. Then, detection results of First of all, each channel of feature vector X with input
the two algorithms are fused based on Bayesian theory. dimension C × H × W is coded separately in horizontal and
vertical directions through average pooling. The average
2.1 ACA‑YOLO Network

Although YOLOv5 network has good detection perfor- Table 1  Overview architecture of ACA mechanism
mance, there are still some deficiencies: (1) Detection results Name Patch size/stride Output size
based on YOLOv5 network for small objects intensively
AdaptiveAvgPool_h None × 1 512 × 15 × 1
depend on shallow features. However, a large number of con-
AdaptiveAvgPool_w 1 × None 512 × 20 × 1
volution operations in the backbone are easy to cause issue
Concat 512 × 35 × 1
of information loss during the process of feature extraction.
Conv1 1 × 1/1 16 × 35 × 1
(2) If there is an inclusion relationship between prediction
BatchNorm 16 × 35 × 1
boxes and truth boxes, the loss function of regression boxes
Conv_h 1 × 1/1 512 × 15 × 1
is constant, which means that further optimization and high-
Conv_w 1 × 1/1 512 × 1 × 20
precision positioning cannot be achieved.
Parameter_h 512 × 15 × 1
To enhance feature extraction ability for objects with
Parameter_w 512 × 1 × 20
different scales, a novel ACA mechanism is introduced to
Output 512 × 15 × 20
the backbone. Compared with the traditional mechanism,

Fig. 1  Flow chart of moving traffic object detection based on Bayesian theory fusion
Moving Traffic Object Detection Based on Bayesian Theory Fusion 421

pooling sizes are H × 1 and 1 × W respectively, and two fea- 2.2 MSOF Algorithm
ture graphs zh and zw are obtained. The zh and zw are spliced
through Concat, and the intermediate feature graph f is gen- The ViBe algorithm is applied for foreground detection.
erated by 1 × 1 convolution transform function F1, which Specifically, the background model is initialized firstly, and
can be described as each pixel is assigned a dataset that stores pixel values before
( ([ ])) and pixel values of its neighbors. In the model, the number
f = 𝛿 F1 zh , zw (1) of pixel values stored in the dataset allocated to each pixel
value is set to 20, and its neighborhood is 9 adjacent pixels.
where δ is a nonlinear activation function.
For a new frame of image, the new value of pixel point
The feature graph f is divided into two tensors fh and fw
is compared with the dataset. If the difference between the
along the spatial dimensions, and 1 × 1 convolution transfor-
current pixel value and a certain pixel value in the dataset is
mation is performed separately. To obtain attention weights
less than 20, the current pixel value is considered to be close
gh and gw for input feature vector X, the dimensions of fh and
to the pixel value. When there are more than two values in
fw are transformed into the dimensions of feature vector X,
dataset close to the current pixel value, the pixel point is
which can be described as
judged as a background point. Otherwise, it is judged as
{ h ( ( ))
g = 𝜎 (𝛼h Fh f h ) a foreground point. An algorithm of updating background
gw = 𝜎 𝛼w Fw (f w ) (2) model is proposed, as shown in Table 2. If the pixel point is
a background point before, the difference between it and its
where Fh and Fw are 1 × 1 convolution transformations. αh dataset is used to judge whether pixel point is a foreground
and αw are adaptive learning weights in w and h directions point. If the pixel point is a foreground point, it is judged as
with initial values of 1. σ is sigmoid activation function. a foreground point directly. When the foreground point has
Then, the output y can be obtained as follows: been detected for 50 consecutive times already, the point is
reset as a background point. In this model, only the back-
yc (i, j) = xc (i, j) × ghc (i) × gwc (j) (3) ground points are updated randomly, and the update prob-
where c represents the number of channels. i represents the ability is set as 6.25%.
height. j represents the width. All methods of optical flow assume that pixels of dynamic
To address the issue of zero loss value when prediction objects maintain the same brightness between image frames,
boxes and truth boxes do not intersect, the minimum bound- and the moving of pixels between adjacent image frames has
ing rectangles have been introduced by GIoU. However, the
issue of constant loss values has not been addressed when
the inclusion relationship between truth boxes and prediction Table 2  Background model update algorithm
boxes occurs. To improve the regression results of predic-
tion boxes, the loss function based on EIoU considers the Algorithm1: Background model update

difference of overlap area, distance of center point, height, Input: pixel point p, dataset D, background point set B, foreground point set
and width between prediction boxes and truth boxes. The
F, detection time N
calculation formula is as follows:
Output: D, B, F and N
2 2
d2 (hd −ht ) (wd −wt )
EIoU = IoU − c2
− h2c
− w2c
(4) Process:

if ∈ :
where IoU is the intersection ratio between prediction boxes =| − |
and truth boxes. d is the distance of center point between 1: if > ℎ ℎ :
{ ∈
prediction boxes and truth boxes. hd and wd are the height end
and width of prediction boxes. ht and wt are the height and { end

width of truth boxes. c, hc, and wc are the diagonal distance,


if ∈ :
height, and width of the minimum bounding rectangles = +1
between prediction boxes and truth boxes. if > 50:
∈ , = 0 and update randomly
Then, the loss function based on EIoU can be described 2:
else:
as ∈
{ end
{ end
EIoU loss = 1 − EIoU (5)
3: update B and F

4: return D, B, F and N
422 Y. Sun et al.

a small motion intensity. Thus, optical flow values of pixel The method of LK optical flow assumes that the motion
points can be determined as follows: intensity of pixels in motion state between adjacent image
frames is small. However, the calculated values of optical
Ix u + Iy v = −It (6) flow may have a large error for pixels with large motion
where Ix, Iy, and It are the partial derivatives of pixel points speed. To address this issue, MSOF is proposed to calcu-
to directions x, y, and time t separately. u and v are the veloc- late value of optical flow as shown in Table 3. To meet the
ities of pixel points in x and y directions. assumption of small pixel motion, an image pyramid is con-
Assumption of spatial consistency is added by the structed by scaling down the original image, which is shown
Lucas–Kanade (LK) optical flow method, which indicates in Fig. 2. In the highest layer, values of optical flow are cal-
each pixel in the image frame moves in a similar way to its culated. Then, the values of upper layers are taken as initial
neighbor pixels. For a sufficiently small window n × n, it can be values, and values in next layer are iteratively calculated by
considered that all pixels in the window have the same value of the method of LK optical flow. In this way, values of optical
optical flow. Then, optical flow can be obtained by least square flow in the bottom image are obtained.
method. The detailed equations are: To reduce the computational complexity of the algorithm
of object detection based on optical flow in the image, each
⎡ Ix1 Iy1 ⎤� � ⎡ It1 ⎤ dynamic object region extracted by the ViBe algorithm is
⎢ Ix2 Iy2 ⎥ u ⎢ It2 ⎥ represented by a rectangular object box that contains all of
⎢ ⎥ v = −⎢ ⋮ ⎥ (7)
⎢ ⋮ ⋮ ⎥ ⎢ ⎥ its pixels. Specifically, the dynamic object region is searched
⎣ Ixn2 Iyn2 ⎦ ⎣ Itn2 ⎦ for the smallest x and y coordinates, named x0 and y0, and the
largest coordinate, named x1 and y1. Set (x0, y0) as the left
� ∑ 2 ∑ �� � �∑ � upper point of rectangular region and set (x1, y1) as the right
I I I
∑ xi ∑ xi2 yi u = − ∑ Ixi Iti (8) bottom point of rectangular region. To avoid the interference
Iyi Ixi Iyi v Iyi Iti
of noise, the regions which contain less than 10 pixels are
determined as the noise and discarded.
Considering the influence of object motion, the object
Table 3  Multi-scale optical flow algorithm
boxes are appropriately enlarged according to the center of
Input: previous frame image , current image boxes, and the magnification factor is set to 1.5. Then, the
Output: optical flow value image information of corresponding region from the original
image is extracted, which is applied to calculate the optical
Process:
flow values of dynamic object regions.
1: construct image pyramids for , : { } =0,1,2,3,4 , { } =0,1,2,3,4 The calculated values of optical flow in image are shown
2: initialize the highest optical flow value: =4
= [0 0] T as green dots and green lines in Fig. 3. When only green
3:
dots are displayed, it indicates that the optical flow values of
pixel points are zero. When the green lines are displayed, it
for = 4; ≥ 0; − −: indicates that pixels have values of optical flow. The length
calculate the position of pixel in : =[ ]=2
( +1, )− ( −1, )
of green lines is related to values of optical flow, and the
calculate the gradient of in the : ( , )=
2 directions of green lines are related to directions of opti-
cal flow. The average value of optical flow in each dynamic
( , +1)− ( , −1)
calculate the gradient of in the : ( , )=
2

= + = +
2
object region is taken as the feature of optical flow, and the
calculate space matrix: =∑ ∑ [ ]
= − = − 2
object box is taken as the detection result for the region. The
feature of optical flow for each region is applied as input of
0 T
initial iteration value: = [0 0]
for = 1; ≤ || < ℎ ℎ ; + +:
calculate images difference:
SVM to classify the category of this region. In particular,
( , )= ( , )− ( + + −1
, + + −1
) single SVM can only deal with the issue of binary classifica-
calculate matching difference vector: tion, which is difficult to meet the requirement of multiple
= +
=∑ = − ∑ = − [
= + ( , ) ( , )
] classifications in this paper. Therefore, multiple SVMs are
( , ) ( , )
calculating optical low: = −1
applied for object classification.
−1
provide initial value for the next iteration: = +
{ end
optical low optimization value in : =
optical low initial value in −1 : −1
= 2( + )
2.3 Object Detection Based on Bayesian Theory
{ end Fusion
0 0
4: calculate final optical flow value: = +
Different from other estimation, Bayesian theory starts
5: return
with subjective prior probability, constantly modifies the
Moving Traffic Object Detection Based on Bayesian Theory Fusion 423

∞ � � � �

P(B) = P Bi P B�Bi (11)
i=1

where P(B) is the probability that event B occurs. P(Bi) is


the probability that event Bi occurs. P(B|Bi) is the probability
that event B occurs under the condition that event Bi occurs.
Combined Eqs. (9), (10), and (11), the Bayesian equation
can be expressed to obtain the posterior probability of event
A prior as follows:
P(A)P(B�A)
P(A�B) = ∑∞
i=1 P(Bi )P(B�Bi ) (12)

Bayesian theory is applied for object fusion, and the pos-


teriori probability is applied as the fusion weight of detec-
tion result, which can improve the accuracy of detection
region. Fusion process of the object region based on Bayes-
ian theory is shown in Fig. 4. Objects matching is performed
by detection results based on ACA-YOLO network and
MSOF algorithm. Firstly, detection results based on ACA-
YOLO network are taken as the benchmark to calculate IoU
values with each detection result based on MSOF algorithm.
The IoU values are applied as matching confidence, and the
Fig. 2  Multi-scale optical flow algorithm schematic diagram group with highest matching confidence is judegd as the
matching result. To avoid false matching, results with a con-
fidence level lower than 0.5 will be eliminated.
Aiming at the posterior probability under ACA-YOLO
network prior, training accuracy PD of ACA-YOLO network
is taken as prior probability. Region AD is the region where
the object is detected by ACA-YOLO network, and region
AO is the region where the( object)is detected by MSOF
( algo-)
rithm. The probability P O|AA1 and probability P O|NA1
can be obtained as:
( ) nA
P O|AA1 = NAd
d
(13)
Fig. 3  Schematic diagram of optical flow values of picture pixels
where the length and directions of the green lines represent the values ( ) nNd
and directions of the optical flow separately
P O|NA1 = NAd (14)
( )
where P O|AA1 is the probability that pixel points in AO
probability, and greatly improves the accuracy. If there region are
are events A and B, it is expected to obtain the probabil- ( ) detected as object by ACA-YOLO network.
P O|NA1 is the probability that the pixels in AO region are
ity P(A|B), that event A occurs under the condition where not detected as object by ACA-YOLO network. nAd is the
event B occurs. To obtain the probability that both event A number of pixels in region AO located in region AD. NAd
and event B occur simultaneously, P(AB) can be obtained is the total number of pixels in region AD, and nNd is the
as follows: number of pixels in AO which do not (belong )to AD.
P(AB) = P(A)P(B|A) (9) Then, the posterior probability P AA1 |O under ACA-
YOLO network prior can be obtained as:
where P(A) is the probability that event A occurs and P(B|A) ( )
is the probability that event B occurs under the condition ( ) PD × P O|AA1
P AA1 |O = ( ) ( ) ( )
where event A occurs. PD × P O|AA1 + 1 − PD × P O|NA1
Then P(A|B) can be expressed as (15)
P(AB) where PD is the confidence of region AD.
P(A|B) = P(B) (10)
424 Y. Sun et al.

3.1 Evaluation Index

Recall, precision, average precision (AP), and mean average


precision (mAP) are selected as evaluation indexes which
can be expressed by following equations:
TP
Recall = TP+FN (18)

TP
Precision = TP+FP (19)

AP = ∫01 PRdR (20)


AP
mAP = (21)
num
Fig. 4  Object region fusion based on Bayesian theory
where TP represents the number of correctly detected
objects of this category. FN represents the number of unde-
( ) tected objects of this category. FP represents the number of
Similarly, the posterior probability P AA2 |D based on incorrectly detected objects of this category. PR represents
MSOF algorithm prior can be obtained as: the precision-recall function, and num is the number of cat-
( ) egories. In addition, the detection threshold is set to 0.5.
( ) PO × P D|AA2
P AA2 |D = ( ) ( ) ( ) When IoU values between prediction boxes and truth boxes
PO × P D|AA2 + 1 − PO × P D|NA2 are smaller than 0.5, the detection boxes will be eliminated.
(16)
( )
where PO is the confidence of region AO. P D|AA2 is the 3.2 Experimental Results in Dynamic Traffic
probability that the pixels
( in AD) are detected as object by
Scenarios
MSOF algorithm, and P O|NA1 is the probability that the
pixels To verify the detection performance of the proposed method
( in AD)are not detected as object by MSOF algorithm.
based on Bayesian theory fusion in dynamic traffic sce-
( AA1)|O is taken as the fusion weight of AD, and
P
P AA2 |D is taken as the fusion weight of AO. Fusion region narios, CMTO dataset was constructed, which consists of
Af can be obtained as follows: continuous real traffic scenes in different environments. The
{ ( ) ( ) acquisition platform of CMTO dataset is shown in Fig. 5,
AD P(AA1 |O) ≥ P(AA2 |D) and the camera model is TC480HD. To have a good acqui-
Af =
AO P AA1 |O < P AA2 |D (17)
sition field for the camera, a cantilever beam was designed
to carry the camera. The height of the cantilever beam
If P(AA1 |O) is bigger, it indicates that the detection perfor-
from ground can be adjusted, and four gears of 2.5m, 3.5m,
mance of ACA-YOLO network is better and region AD is
4.5m, and 7m were designed. In addition, lidar (LSC32) and
chosen as fusion region. If P(AA1 |O) is smaller, it indicates
that the detection performance MSOF algorithm is better
and region AO is chosen as fusion region.

3 Experiment and Result Analysis

Experimental environment of this paper is Ubuntu18.04LTS


operating system, NVIDIA GTX 2080Ti GPU, and Intel
(R) Core (TM) I9-9900K [email protected]. Programming
language is Python 3.7. The Pytorch 1.7.0 is chosen as the
framework of deep learning for neural network programming
with GPU acceleration.
Fig. 5  Movable acquisition platform and equipment
Moving Traffic Object Detection Based on Bayesian Theory Fusion 425

millimeter wave radar (DTAM D29) were also deployed on intensity, and other environmental variables. The 69,203
this platform for future research. instance labels are manually annotated. Statistics of labeled
A total of 10,000 images were collected, labeled with data are shown in Fig. 6, wherein 18,838 pedestrians, 11,338
four categories: pedestrian, cyclist, car, and large vehicle. cyclists, 34,299 cars and 4728 large vehicles. Some image
This dataset covers moving traffic objects under a variety of samples are shown in Fig. 7. Image resolution of CMTO
scenarios, seasons, weather conditions, temperature, light dataset is 640 × 480, and the format is JPG. The download
link of CMTO dataset is: https://​drive.​google.​com/​file/d/​
1m1BF​zco3z​ueDMY​L2mpV​FjPYh​0hl1O​lwB/​view?​usp=​
share_​link.
The 81% of images are used for training, 9% for valida-
tion, and 10% for testing. In addition, rotation, translation,
mirror, splicing, and other image argumentation methods are
applied to expand training data.
Curves of precision and recall with confidence are shown
in Fig. 8(a), (b), respectively. The recall reaches 0.895 and
the precision reaches 0.933. Curve of PR function is shown
in Fig. 8(c). It indicates that AP values of pedestrian, cyclist,
car, and large vehicle are 0.838, 0.972, 0.969, and 0.989,
respectively with the mAP of 0.942. Detection performance
Fig. 6  Statistical chart of annotation in CMTO dataset
in different dynamic traffic scenarios is shown in Fig. 9,

Fig. 7  Samples captured in different traffic scenarios


426 Y. Sun et al.

which indicates that moving traffic objects can be detected


well by the proposed method.
Figures 10 and 11 display detection results and IoU
curves of ACA-YOLO network, MSOF algorithm, and detec-
tion method based on Bayesian theory fusion in a continuous
traffic scene. In Fig. 10, it indicates that small object may be
missed by ACA-YOLO network, due to few and inconspicu-
ous features. Thus, objects in frames 1,37, and 160 in Fig. 11
were not detected, which leads to IoU values of zero.
Specific differences of detection results between three
methods are shown in Fig. 11. In the first 30 frames, com-
pared with ACA-YOLO network, MSOF algorithm has
higher IoU values and better detection performance, because
objects are far away from the camera which belong to small
objects. In addition, due to high sensitivity to moving objects
of MSOF algorithm, the issue of missing detection (as
shown in Fig. 10(a)) is avoided. After about 30 frames, the
detection performance of ACA-YOLO network is better than
MSOF algorithm due to its strong feature extraction abil-
ity and high detection accuracy. In the majority of frames,
the detection method based on Bayesian theory fusion has
the highest IoU values in three methods, which indicates its
outstanding detection performance. From frames 1,37, and
160 in Fig. 11, it also demonstrates that the detection method
based on Bayesian theory fusion can still obtain high-preci-
sion boxes by relying on other method for detecting even if
a single method fails to detect objects.

3.3 Ablation Study

To further verify the improvement of detection performance


of ACA-YOLO network, the ablation study is performed on
COCO, KITTI and CMTO datasets as shown in Table 4.
It indicates that both EIOU and ACA modules effectively
improve the mAP of YOLOv5 and enhance detection perfor-
mance. The ACA module effectively enhances the ability of
extracting salient features of objects. The loss function based
on EIoU effectively improves detection boxes regression.
The mAP values of ACA-YOLO network are increased by
2.8%, 2.3% and 2.3% on COCO, KITTI and CMTO datasets.
From Table 4, it indicates that ACA-YOLO can enhance
detection performance effectively.
To further verify the improvement of detection perfor-
mance of moving traffic objects by the detection method
based on Bayesian theory fusion, the ablation study is per-
formed as shown in Table 5. Since continuous frames are
required by MSOF to calculate optical flow and COCO data-
set image is discontinuous, the ablation study is only per-
formed on KITTI and CMTO datasets. Compared with ACA-
YOLO and MSOF, the mAP values of detection method
based on Bayesian theory fusion on KITTI and CMTO
Fig. 8  Curves of detection method based on Bayesian theory fusion datasets are increased, especially on KITTI dataset by
3.2% and 21.7%. Although ACA-YOLO network has good
Moving Traffic Object Detection Based on Bayesian Theory Fusion 427

Fig. 9  Detection samples

detection performance, it is not ideal for some objects, such 100 epochs, and the input resolution was 640 × 640. From
as small objects. From Table 5, it indicates that the detection Table 6, it indicates that the proposed method has the
method based on Bayesian theory fusion effectively reduces best recall and mAP, and the precision is also at a high
the impact of this issue by fusing MSOF method, which is level. Compared with YOLOv7, which exhibits excellent
highly sensitive to moving objects. detection performance, the proposed method has higher
precision and mAP, which are increased by 0.1% and
3.4 Comparison with State‑of‑the‑Art Object 1.1%, separately. Especially, recall has a more significant
Detection Algorithms increase of 1.7% due to high sensitivity to moving objects.
From Table 6, it indicates that the detection method based
The proposed method has been compared with state-of- on Bayesian theory fusion exhibits excellent detection
the-art object detection algorithms, such as YOLOR, performance.
YOLOX, YOLOv6, and YOLOv7 on KITTI dataset. The
results are shown in Table 6. All models were trained for
428 Y. Sun et al.

Fig. 10  Continuous frame detection result

Table 4  Comparison of mAP among ablation study on ACA-YOLO


network

Algorithm COCO KITTI CMTO

YOLOv5 0.568 0.899 0.912


Only EIoU 0.578 0.908 0.915
Only ACA​ 0.589 0.916 0.921
ACA-YOLO 0.596 0.922 0.935

Table 5  Comparison of mAP among ablation study on Bayesian the-


ory fusion detection method
Algorithm KITTI CMTO

ACA-YOLO 0.922 0.935


MSOF 0.737 0.671
Fig. 11  IoU curves where green curve represents ACA-YOLO net-
The proposed method 0.954 0.942
work, red curve represents MSOF algorithm and blue curve repre-
sents detection method based on Bayesian theory fusion
Moving Traffic Object Detection Based on Bayesian Theory Fusion 429

Table 6  Comparison with state-of-the-art object detection algorithms References


on KITTI dataset

Algorithm Precision Recall mAP 1. Mahadevkar, S.V., Khemani, B., Patil, S., et al.: A review on
machine learning styles in computer vision-techniques and
YOLOR [22] 0.750 0.907 0.918 future directions. IEEE Access 10, 107293–107329 (2022)
YOLOX [23] 0.955 0.904 0.908 2. Ghedia, N.S., Vithalani, C.H.: Outdoor object detection for sur-
veillance based on modified GMM and adaptive thresholding.
YOLOv6 [25] 0.899 0.861 0.903
Int. J. Inf. Technol. 13(1), 185–193 (2021)
YOLOv7 [24] 0.943 0.895 0.943 3. Houhou, I., Zitouni, A., Ruichek, Y., et al.: Improving ViBe-
The proposed method 0.944 0.912 0.954 based background subtraction techniques using RGBD Infor-
mation. Paper presented at 2022 7th International Conference
on Image and Signal Processing and their Applications, IEEE,
Mostaganem, 8–9 May 2022
4 Conclusions 4. Wang, Y., Lu, H., Gao, R., et al.: V-Vibe: a robust ROI extrac-
tion method based on background subtraction for vein images
collected by infrared device. Infrar. Phys. Technol. 123, 104175
In this paper, to enhance feature extraction ability for (2022)
objects with different scales, ACA-YOLO network is 5. Bansal, M., Kumar, M., Kumar, M.: 2D object recognition: a com-
proposed by applying ACA mechanism and EIoU into parative analysis of SIFT, SURF and ORB feature descriptors.
Multimed. Tools Appl. 80, 18839–18857 (2021)
YOLOv5. To have high sensitivity to moving traffic 6. Abdullah, D.M., Abdulazeez, A.M.: Machine learning applica-
objects, MSOF method is proposed. Compared with tra- tions based on SVM classification a review. Qubahan Acad. Jur-
ditional methods, the accuracy is increased and calculation nal. 1(2), 81–90 (2021)
cost of optical flow is reduced. To enhance detection per- 7. Chen, Y., Zheng, W., Li, W., et al.: Large group activity security
risk assessment and risk early warning based on random forest
formance in dynamic traffic scenarios, a detection method algorithm. Pattern Recogn. Lett. 144, 1–5 (2021)
based on Bayesian theory fusion is proposed, which 8. Caldelli, R., Galteri, L., Amerini, I., et al.: Optical flow based
combines detection results of ACA-YOLO network and CNN for detection of unlearnt deepfake manipulations. Pattern
MSOF method to obtain fusion results. The experimental Recogn. Lett. 146, 31–37 (2021)
9. Fan, L., Zhang, T., Du, W.: Optical-flow-based framework to boost
results indicate that the proposed detection method based video object detection performance with object enhancement.
on Bayesian theory fusion exhibits excellent performance. Expert Syst. Appl. 170, 114544 (2021)
The mAP, recall, and precision are 5.5%, 9.9%, and 1.9% 10. Ahn, H., Cho, H.J.: Research of multi-object detection and track-
higher than that of traditional YOLOv5 network on KITTI ing using machine learning based on knowledge for video surveil-
lance system. Pers. Ubiq. Comput. 66, 1–10 (2022)
dataset. 11. Zaidi, S.S.A., Ansari, M.S., Aslam, A., et al.: A survey of modern
In the future, to include more scenarios, it is necessary deep learning based object detection models. Digit. Signal Pro-
to extend CMTO dataset. Compared with public datasets cess. 126, 103514 (2022)
such as COCO and KITTI, current CMTO dataset is rela- 12. Kang, J., Tariq, S., Oh, H., et al.: A survey of deep learning-based
object detection methods and datasets for overhead imagery. IEEE
tively small and lack of universality. In addition, the fusion Access 10, 20118–20134 (2022)
process can be further optimized. Instead of selecting better 13. Sun, P., Zhang, R., Jiang, Y., et al.: Sparse R-CNN: end-to-end
detection boxes for fusion, pixel-level fusion is supposed to object detection with learnable proposals. Paper presented at the
obtain better detection performance. Specifically, posterior IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion, IEEE, Online, 19–25 June 2021
probabilities are used as fusion weights of pixels in detec- 14. Qiao, L., Zhao, Y., Li, Z., et al.: Defrcn: decoupled faster R-CNN
tion boxes of ACA-YOLO network and MSOF algorithm, for few-shot object detection. Paper presented at the IEEE/CVF
separately, and pixel values of two detection boxes are super- International Conference on Computer Vision, IEEE, Montreal,
imposed. A set of pixels that exceed the preset threshold are 10–17 October 2021
15. Xie, X., Cheng, G., Wang, J., et al.: Oriented R-CNN for object
fitted to a detection box as fusion results. Since lidar and detection. Paper presented at the IEEE/CVF International Confer-
radar are installed on the data acquisition platform, multi- ence on Computer Vision, IEEE, Montreal, 10–17 October 2021
sensor fusion is considered in future research. 16. Avola, D., Cinque, L., Diko, A., et al.: MS-Faster R-CNN: multi-
stream backbone for improved faster R-CNN object detection
Acknowledgements This work was supported in part by National Nat- and aerial tracking from UAV images. Remote Sens. 13(9), 1670
ural Science Foundation of China (Grant no. 52272414 and 51905095). (2021)
17. Girshick, R., Donahue, J., Darrell, T., et al.: Rich feature hierar-
Declarations chies for accurate object detection and semantic segmentation.
Paper presented at the IEEE Conference on Computer Vision and
Conflict of interest On behalf of all the authors, the corresponding au- Pattern Recognition, IEEE, Columbus, 23–28 June 2014
thor states that there is no conflict of interest. 18. Cai, Y., Zhang, T., Wang, H., et al.: 3D vehicle detection based on
LiDAR and camera fusion. Automot. Innov. 2(4), 276–283 (2019).
https://​doi.​org/​10.​1007/​s42154-​019-​00083-z
430 Y. Sun et al.

19. Peng, L., Wang, H., Li, J.: Uncertainty evaluation of object detec- presented at the IEEE/CVF Conference on Computer Vision and
tion algorithms for autonomous vehicles. Automot. Innov. 4(3), Pattern Recognition, IEEE, Online, 19–25 June 2021
241–252 (2021) 34. Zheng, Z., Wang, P., Liu, W., et al.: Distance-IoU loss: faster and
20. Bochkovskiy, A., Wang, C.Y., Liao, H.Y.M.: YOLOv4: optimal better learning for bounding box regression. Paper presented at the
speed and accuracy of object detection. arXiv preprint arXiv:​ AAAI Conference on Artificial Intelligence, AAAI, New York,
2004.​10934 (2020) 7–12 February 2020
21. Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Scaled-YOLOv4: 35. Zhang, Y.F., Ren, W., Zhang, Z., et al.: Focal and efficient IOU
scaling cross stage partial network. Paper presented at the IEEE/ loss for accurate bounding box regression. Neurocomputing 506,
CVF Conference on Computer Vision and Pattern Recognition, 146–157 (2022)
IEEE, Online, 19–25 June 2021 36. Gevorgyan, Z.: SIoU loss: more powerful learning for bounding
22. Wang, C.Y., Yeh, I.H., Liao, H.Y.M.: You only learn one repre- box regression. arXiv preprint arXiv:​2205.​12740 (2022)
sentation: unified network for multiple tasks. arXiv preprint arXiv:​ 37. Tong, Z., Chen, Y., Xu, Z., et al.: Wise-IoU: bounding box regres-
2105.​04206 (2021) sion loss with dynamic focusing mechanism. arXiv preprint arXiv:​
23. Ge, Z., Liu, S., Wang, F., et al.: YOLOX: exceeding yolo series in 2301.​10051 (2023)
2021. arXiv preprint arXiv:​2107.​08430 (2021) 38. Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime
24. Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: YOLOv7: trainable tracking with a deep association metric. Paper presented at 2017
bag-of-freebies sets new state-of-the-art for real-time object detec- IEEE International Conference on Image Processing, IEEE, Bei-
tors. Paper presented at the IEEE/CVF Conference on Computer jing, 17–20 September 2017
Vision and Pattern Recognition, IEEE, Vancouver, 17–24 June 39. Teng, Z., Duan, Y., Liu, Y., et al.: Global to local: clip-LSTM-
2023 based object detection from remote sensing images. IEEE Trans.
25. Li, C., Li, L., Jiang, H., et al.: YOLOv6: a single-stage object Geosci. Remote Sens. 60, 1–13 (2021)
detection framework for industrial applications. arXiv preprint 40. Jie, Y., Leonidas, L., Mumtaz, F., et al.: Ship detection and track-
arXiv:​2209.​02976 (2022) ing in inland waterways using improved YOLOv3 and Deep
26. Xu, S., Wang, X., Lv, W., et al.: PP-YOLOE: an evolved version SORT. Symmetry 13(2), 308 (2021)
of YOLO. arXiv preprint arXiv:​2203.​16250 (2022) 41. Gai, Y., He, W., Zhou, Z.: Pedestrian target tracking based on
27. Redmon, J., Divvala, S., Girshick, R., et al.: You only look once: Deep SORT with YOLOv5. Paper presented at 2021 2nd Inter-
unified, real-time object detection. Paper presented at the IEEE national Conference on Computer Engineering and Intelligent
Conference on Computer Vision and Pattern Recognition, IEEE, Control, IEEE, Chongqing, 12–14 November 2021
Las Vegas, 26 June–1 July 2016 42. Zhao, Z., Xu, S., Zhang, C., et al.: Bayesian fusion for infrared
28. Redmon, J., Farhadi, A.: YOLO9000: better, faster, stronger. Paper and visible images. Signal Process. 177, 107734 (2020)
presented at the IEEE Conference on Computer Vision and Pattern 43. Zhou, H., Dong, C., Wu, R., et al.: Feature fusion based on Bayes-
Recognition, IEEE, Hawaii, 21–26 July 2017 ian decision theory for radar deception jamming recognition.
29. Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. IEEE Access 9, 16296–16304 (2021)
arXiv preprint arXiv:​1804.​02767 (2018) 44. Wu, T., Hu, J., Ye, L., et al.: A pedestrian detection algorithm
30. Woo, S., Park, J., Lee, J.Y., et al.: CBAM: convolutional block based on score fusion for multi-LiDAR systems. Sensors 21(4),
attention module. Paper presented at the European Conference 1159 (2021)
on Computer Vision, Springer, Munich, 8–14 September 2018 45. Gao, F., Wang, C.: Hybrid strategy for traffic light detection
31. Hou, Q., Zhou, D., Feng, J.: Coordinate attention for efficient by combining classical and self-learning detectors. IET Intell.
mobile network design. Paper presented at the IEEE/CVF Confer- Transp. Syst. 14(7), 735–741 (2020)
ence on Computer Vision and Pattern Recognition, IEEE, Online,
19–25 June 2021 Springer Nature or its licensor (e.g. a society or other partner) holds
32. Qiao, S., Chen, L.C., Yuille, A.: Detectors: detecting objects exclusive rights to this article under a publishing agreement with the
with recursive feature pyramid and switchable atrous convolu- author(s) or other rightsholder(s); author self-archiving of the accepted
tion. Paper presented at the IEEE/CVF Conference on Computer manuscript version of this article is solely governed by the terms of
Vision and Pattern Recognition, IEEE, Online 19–25 June 2021 such publishing agreement and applicable law.
33. Hu, M., Li, Y., Fang, L., et al.: A2-FPN: attention aggregation
based feature pyramid network for instance segmentation. Paper

You might also like