0% found this document useful (0 votes)
15 views

Realtime Vehicle Tracking Method Based on YOLOv5+DeepSORT

Uploaded by

alicodeacm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Realtime Vehicle Tracking Method Based on YOLOv5+DeepSORT

Uploaded by

alicodeacm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Hindawi

Computational Intelligence and Neuroscience


Volume 2023, Article ID 7974201, 11 pages
https://doi.org/10.1155/2023/7974201

Research Article
Realtime Vehicle Tracking Method Based on YOLOv5 + DeepSORT

Lixiong Lin ,1 Hongqin He ,2 Zhiping Xu ,1 and Dongjie Wu 3

1
School of Ocean Information Engineering, Jimei University, Xiamen 361021, China
2
Zhejiang Dahua Technology Co., Ltd., Hangzhou 310051, China
3
Department of Automation, Xiamen University, Xiamen, Fujian 361005, China

Correspondence should be addressed to Lixiong Lin; [email protected]

Received 3 January 2023; Revised 20 March 2023; Accepted 22 March 2023; Published 15 June 2023

Academic Editor: S. P. Raja

Copyright © 2023 Lixiong Lin et al. Tis is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

In actual trafc scenarios, the environment is complex and constantly changing, with many vehicles that have substantial
similarities, posing signifcant challenges to vehicle tracking research based on deep learning. To address these challenges, this
article investigates the application of the DeepSORT (simple online and realtime tracking with a deep association metric)
multitarget tracking algorithm in vehicle tracking. Due to the strong dependence of the DeepSORT algorithm on target detection,
a YOLOv5s_DSC vehicle detection algorithm based on the YOLOv5s algorithm is proposed, which provides accurate and fast
vehicle detection data to the DeepSORT algorithm. Compared to YOLOv5s, YOLOv5s_DSC has no more than a 1% diference in
optimal mAP0.5 (mean average precision), precision rate, and recall rate, while reducing the number of parameters by 23.5%, the
amount of computation by 32.3%, the size of the weight fle by 20%, and increasing the average processing speed of each image by
18.8%. After integrating the DeepSORTalgorithm, the processing speed of YOLOv5s_DSC + DeepSORTreaches up to 25 FPS, and
the system exhibits better robustness to occlusion.

1. Introduction target object to realize the target tracking. However, the


target loss can easily happen, while the target size changes.
Te increasing number of vehicles has caused great dif- Zhu et al. [12] adopted a distractor recognition model to
culties in trafc management. Vehicle tracking is an ap- update the tracking template online, which could well deal
plication of a target tracking in the feld of transportation, with the problems of serious occlusion and appearance
which can serve to alleviate the pressure of trafc man- change of the target. Li et al. [13] introduced the deep
agement [1–3]. At present, the mainstream target tracking network into the Siamese Net framework and played the role
method is the discriminative tracking method, which adds of the deep network through multilayer aggregation.
the step of target detection and makes the tracking more Multitarget tracking is harder than the single-target
accurate. Discriminant tracking methods mainly include tracking. Problems such as appearance similarity among
tracking methods based on sparse representation [4–6], targets, occlusion, and the start and end of single-target
tracking methods based on correlation fltering [7–9], and tracking tasks pose signifcant challenges in the feld of
tracking methods based on deep learning. Li and Huang [10] multitarget tracking. Bewley et al. [14] proposed the SORT
proposed the TOD (tracking object based on detector) al- (simple online and realtime tracking) algorithm, which used
gorithm, which used YOLOv3 for target detection, and the Kalman flter to predict the tracking frame information
tracked the target according to LBP (local binary pattern) of the tracked object in the next frame and performed data
features and color histogram. Bertinetto et al. [11] proposed associated with the detection frame information in the next
the SiamFC (Siamese fully convolutional) algorithm, which frame to achieve multitarget tracking. Te algorithm had
took the target object in the frst frame as one input of the small memory footprint and high speed, but the accuracy
SiameseNet and the search area in the subsequent frames as was very low when the target was occluded. Wojke et al. [15]
another input and then found out the area closest to the proposed the DeepSORT algorithm based on the idea of
2 Computational Intelligence and Neuroscience

SORT. Te algorithm considered the motion information et al. [34] proposed an improved YOLOv5 algorithm based
and appearance information in the tracking process and on bidirectional feature pyramid network for multiscale
resolved the problem of target occlusion. At present, feature fusion. Wang et al. [35] proposed a novel vehicle
detection-based tracking algorithms still have many prob- detection and tracking method for small target vehicles to
lems, such as a lack of datasets, inaccurate target detection, achieve high detection and tracking accuracy based on the
and insufcient realtime performance,. attention mechanism. In summary, research based on the
Traditional target detection algorithms rely on image improved YOLOv5 algorithm mainly focuses on the accu-
features and classifers such as SVM (support vector ma- racy of small object detection, while research on detection
chine) [16], Adaboost [17], Random Forest [18], artifcially speed and occlusion robustness in vehicle tracking still has
designed color features [19], gradient features [20], and great research value. Te main contributions of this article
pattern features [21]. Target detection algorithms based on are as follows:
deep learning have stronger adaptability to complex scenes,
(1) To solve the problems of large number of vehicles,
including target detection methods based on candidate re-
fast-moving speed of vehicles, substantial similarity
gions and target detection methods based on regression. Te
of vehicle appearance, and vehicle occlusion in the
representative algorithm based on candidate region is
actual urban trafc scene, the DeepSORT algorithm
R-CNN (Region-CNN) series [22–24]. Owing to the need to
is used for vehicle tracking, which has better realtime
process large number of candidate frames, such methods
performance and tracking robustness than tradi-
face the problem of low efciency and do not have the ability
tional vision-based vehicle tracking methods.
for realtime detection. Te regression-based target detection
method reduces the steps of generating candidate regions (2) To reduce the calculation amount of YOLOv5s, re-
and improves the speed signifcantly. It has been widely used duce its inference time, and improve the operation
for developing realtime target detection systems. Te YOLO speed, a YOLOv5s_DSC algorithm with faster in-
(you only live once) algorithm [25] proposed in 2016 used ference speed is proposed.
a grid to divide an image and generated a series of initial (3) Combining YOLOv5s_DSC with the DeepSORT
anchor boxes in each grid of the image. By learning to fne algorithm, the robustness of occlusion of the pro-
tune the initial box, the predicted box was generated to be posed algorithm is verifed and the realtime per-
closer to the actual box. Te YOLOv2 algorithm introduced formance of the algorithm is tested in the cases of
batch normalization and used DarkNet-19 as the backbone vehicles being occluded by foreign objects or vehicles
network, which could dynamically adjust the input and being occluded by each other.
achieve better precision for small targets [26]. On this basis,
the YOLOv3 algorithm used DarkNet-53 as the backbone 2. Algorithmic Framework
network, introduced FPN (feature pyramid network)
structure to obtain feature maps at diferent scales, and used 2.1. Overall Framework. Te DeepSORT algorithm adopts
a logistic classifer to predict the category of targets [27]. Te a two-stage idea of detection and tracking, using the Kalman
YOLOv4 algorithm added data enhancement and self- flter and Hungary algorithm to track the target and in-
antagonistic training methods at the input end [28]. Te troducing a deep convolutional neural network to extract the
backbone network used CSPDarkNet53 and improved the appearance information of the tracked target for data as-
loss function of the output layer, which greatly improved the sociation, which solves the problem that the target occlusion
speed and accuracy. Te YOLOv5 has the same performance is difcult to track accurately. Stable and accurate vehicle
as YOLOv4. However, YOLOv5 is faster and has a detection detection result is an important guarantee for the Deep-
speed of 140 FPS on Tesla P100. Sasagawa and Nagahara [29] SORTalgorithm in the vehicle tracking task. Considering the
used YOLO to locate and identify objects and proposed realtime requirements of the realistic application scenarios,
a method for detecting objects under low illumination by the YOLOv5 target detection algorithm is studied in this
utilizing the power of transfer learning. Krišto et al. [30] used article. To further reduce the memory and computing re-
thermal images on YOLO to improve target detection sources occupied by the algorithm, a DSC structure with
performance in challenging conditions such as adverse residual is introduced into YOLOv5s, and the
weather, night time, and dense areas. Xiao et al. [31] fused YOLOv5s_DSC algorithm with a smaller model and faster
the context information in the YOLO backbone network to speed is proposed. YOLOv5s_DSC is used as the detector of
avoid the loss of low-level context features, retain lower the DeepSORT algorithm, and its excellent detection ac-
spatial features, and solve the problem of difcult detection curacy can make tracking more accurate and provide better
of targets under dim light. Guo et al. [32] designed an realtime performance.
improved SSD (single shot multibox detector) detector,
which used the method of single data deformation data
amplifcation to transform the color gamut and afne of the 2.2. Te DeepSORT Algorithm. Figure 1 is a framework
original data and could detect targets that were close to each diagram of the DeepSORT algorithm. First, the Kalman flter
other. To improve feature fusion for small tassel detection, is used to predict the tracking frame information of the
Liu et al. [33] proposed a novel algorithm referred to as tracked target in the next frame, and all the detection frame
YOLOv5-tassel to detect tassels. To enrich feature in- information is obtained by the target detection algorithm in
formation and improve the feature extraction ability, Bie the subsequent frame. Ten, the Hungarian algorithm is
Computational Intelligence and Neuroscience 3

Kalman filter prediction

True
Begin
Tracking
Not true Tracking box

Matching
failed
Target Cascade Detection box
detection matching
Matching
succeeded
IOU data association

Matching
Kalman filter update
succeeded

Detection box
Matching
failed

Age ≤max age

True
Age>max age
Cascade
matching

Tracking subtask stop Not true

Figure 1: Block diagram of the DeepSORT algorithm.

used to fnd an optimal allocation for the minimum cost of investigation will be conducted. If all three rounds of
between all the detection frames and the tracking prediction matching are successful, the fag will be changed to “true.”
frames. Te cost matrix used in this step contains not only For tracking frame that fail to match, if they are marked as
the Mahalanobis distance but also the cosine distance of the “false,” the tracking task will be stopped, and if they are
appearance features constructed from the appearance fea- marked as “true,” the lifespan will be set. Within the lifespan,
tures extracted by the deep convolutional neural network. the following three rounds of investigation will also be
After solving by the Hungarian algorithm, the optimal conducted. If all three rounds of matching are successful, the
combination of the prediction frame and the detection frame mark will be changed to “true.”
can be obtained. Te DeepSORT algorithm uses cascade State estimation methods mainly include state observers
matching, and the shorter the number of frames from the and various linear and nonlinear discrete estimators based
last successful matching is, the higher the priority is in this on the Kalman flter. Liu et al. [36] proposed a novel vehicle
matching. Te tracking frame information is updated sideslip angle estimation algorithm with the fusion of dy-
according to the detected frame information after the namic model and vision for vehicle dynamic control. A
matching is successful, and the tracking frame information vehicle attitude angle observer based on the square-root
of the tracked target in the next frame is continued to predict cubature Kalman flter (SCKF) is designed in [37] to estimate
according to the tracking information. For the samples that the roll and pitch to reject the gravity components induced
fail to match, the cost matrix will be constructed again with by the vehicle roll and pitch. For simplicity, this article uses
the IOU calculation results of the remaining tracking frame the Kalman flter as state estimation. Te prediction equa-
and the prediction frame and then transferred to the tion of the Kalman flter is as follows:
Hungarian algorithm for the solution. After the matching is
successful, the tracking frame is updated according to the 􏽢 −k &9; � Fk x
x 􏽢 +k−1 ,
(1)
detection frame information, and the tracking frame in- P−k &9; � Fk P+k−1 FTk + Qk ,
formation of the tracked target in the next frame is con-
tinuously predicted according to the tracking frame where x􏽢 −k is the state estimation at the time k, x
􏽢 +k−1 is the state
information. Whether the match is successful or not is −
estimation at the time k − 1, Pk is the covariance matrix of
determined by marking “true” and “false.” For the detection the state estimation, and Fk is the state transition matrix. Te
frame that fails to match, a fag will be added–“false,” and measurement update equation of the Kalman flter is as
three subsequent rounds (age is the round and max age � 3) follows:
4 Computational Intelligence and Neuroscience

−1 Table 1: DeepSORT deep convolutional neural network.


Kk &9; � P−k HTk 􏼐Hk P−k HTk + Rk 􏼑 ,
􏽢 +k &9; � x
x 􏽢 −k + Kk yk − Hk x
􏽢 −k 􏼁, (2) C in C out Kernel_size Stride
Pk &9; � I − Kk Hk 􏼁P−k ,
+ Conv 3 32 3 1
Conv 32 32 3 1
where yk is the measurement vector at a time k, Hk is the MaxPool 32 32 3 2
measurement matrix, Rk is the covariance matrix of mea- Residual 32 32 3 1
surement noise, Kk is the Kalman gain used to correct the Residual 32 32 3 1
Residual 32 64 3 2
state estimation, and I is the identity matrix. Te state vector
Residual 64 64 3 1
of the DeepSORT algorithm can be described as follows: Residual 64 128 3 2
_ T,
_ v,_ r,_ h]
x � [u, v, r, h, u, (3) Residual 128 128 3 1
Dense 128
where u, v, r, and h represent the target box center co- Batch and L2
128
_ v,_ r,_
ordinates of x, y aspect ratio, and height, respectively. u, normalization
and h_ represent the corresponding value in the next frame
predicted with Kalman fltering. Te DeepSORT algorithm achieve the purpose of a lightweight deep learning network,
uses the cost matrix constructed by Mahalanobis distance while ensuring feature extraction. It divides into the fol-
and cosine distance of appearance features in the frst data lowing two steps:
association. Te Mahalanobis distance correlation metric is (1) Channel-by-channel convolution: the input image is
calculated as follows: Hin ∗ Win ∗ Cin . Each channel consists of
T
d(1) (i, j) � 􏼐dj − yi 􏼑 S−1
i 􏼐dj − yi 􏼑,
(4) a K ∗ K ∗ 1. Te convolution kernel performs an
independent convolution operation to obtain Cin
where dj � [uj , vj , rj , hj ]T represents the jth detection state, characteristic map, whose size is Hout ∗ Wout . Te
parameter number of the convolution kernel is
di � [u_ i , v_i , r_i , h_i ]T represents the ith tracking target which
K ∗ K ∗ Cin . As shown in Figure 2, if 3 channels of
predicts the state of the current frame according to the state
images are as inputs in the point-by-point convo-
of the previous frame. Si is the covariance of the detected
lution, 3 single-channel will be obtained.
state with the predicted state. Te cosine distance mea-
surement formula of appearance features is as follows: (2) Pointwise convolution: using 1 ∗ 1 ∗ Cin ∗ Cout ,
convolution kernel performs convolution operation
d(2) (i, j) � min 􏽮1 − rTj r(i) (i)
k , rk ∈ Ri 􏽯, (5) output of (1) to obtain the characteristic map with
Hout ∗ Wout ∗ Cout . As shown in Figure 3, the
where rj corresponds to the feature vector of the jth de- number of parameters of the characteristic map is
tection frame, r(i)
k correspond to the feature vector of the 1 ∗ 1 ∗ Cin ∗ Cout .
tracking frame, and Ri is for the last set of features k times
successfully tracked. Te DeepSORT algorithm constructs Te number of parameters of the whole DSC is as
a deep convolutional neural network to extract the ap- follows:
pearance features of the tracking target and uses L2 stan-
P � K ∗ K ∗ Cin + 1 ∗ 1 ∗ Cin ∗ Cout . (6)
dardization to project the features. Te network structure is
shown in Table 1. Tis is similar to the packet convolution with a number
Due to the nonconstant update frequency of image of packets Cin . Te diference is that the results of group
frames, we use the time diference between the two frames as convolution are the splicing of each group result, while the
the time step of the Kalman flter during the discretization results of DSC are the weighted combination of each group
process. Tis approach allows us to dynamically adjust the of result by point-by-point convolution, which can make full
state update rate of the Kalman flter based on the actual use of the characteristic information of each channel at the
situation, which helps to better track the target. same position.
3. Improved Yolov5 Vehicle Detection Method
3.2. Yolov5s Improvement Strategy. Te YOLOv5s model has
To achieve high precision vehicle tracking tasks, the vehicle 283 layers in total, the number of parameters is 7,071,633,
detection algorithm is studied in this subsection. To further and the amount of calculation is 16.4GFLOPS. To further
improve the realtime performance of vehicle detection, the simplify the network structure, reduce the amount of cal-
DSC structure with residual is introduced, and the culation, and reduce the reasoning time of the model, the
YOLOv5s_DSC vehicle detection algorithm is proposed, which DSC structure is introduced to replace the C3 structure of
has a lower number of parameters and calculation and faster the backbone part in YOLOv5s. As shown in Figure 4, the
detection speed. frst C3 structure in the YOLOv5s network contains fve
convolutions, and the parameters are shown in Table 2.
3.1. Depth Separable Convolution. With the help of grouping From Table 2, it can be calculated that the number of
convolution, DSC uses point-by-point convolution to fuse parameters for Conv1 and Conv2 is 2048, the number of
the feature information of diferent channels, which can parameters for Conv3 is 4,096, the number of parameters for
Computational Intelligence and Neuroscience 5

3 channel input Filters * 3 Maps * 3 is 1 × 1. Ten, the number of parameters of the pointwise
convolution is 4,096. Tus, the number of parameters of the
DSC structure amounts to 4,672, which is 13,760 lower than
that of the frst C3 structure in the YOLOv5s network. Te
backbone of the YOLOv5s network contains four C3
structures, which are replaced by DSC structures in turn. To
avoid network degradation caused by replacing with DSC,
a residual structure is introduced, as shown in Figure 5.
Te introduction of DSC can efectively reduce the
number of parameters and make the network model smaller.
Figure 2: Depthwise convolution. Te comparison of the number of parameters after re-
placement is shown in Table 3. Te number of parameters of
each structure includes the parameters of convolution, de-
Maps * 3 Filters * 4 Maps * 4 viation, and batch normalization in the structure. Te im-
proved network framework is presented in Table 4.

4. Experimental Results and Analysis


4.1. Dataset Preparation. Te VeRi dataset [38] is a large
vehicle rerecognition dataset, which contains vehicle images
from multiple angles and under diferent light intensities. It
is suitable for related research on vehicle rerecognition. As
shown in Figure 6, each folder contains pictures taken by the
Figure 3: Pointwise convolution. same vehicle from diferent angles, with a total of 776 folders.
Te training set and the test set are distributed according to
the proportion of 8 : 1.
Conv1 Bottleneck UA-DETRAC [39] is a vehicle dataset, collected from the
Concat

C3 = Conv3 real trafc environment of Beijing and Tianjin, labeled with


Conv2 four vehicle categories of “Bus,” “Car,” “Van,” and “Others,”
including vehicle images of diferent angles and periods,
covering most of the trafc conditions. Te UA-DETRAC
Conv4 Conv5 Add dataset contains a total of 60 image folders collected from
Bottleneck = diferent road sections and periods, and each folder corre-
sponds to an XML tag fle. We use the code to strip the tag
If shortcut
corresponding to each image in the XML fle and convert all
the XML tag fles obtained into TXT format. Te training set
Figure 4: Te frst C3 structure of YOLOv5s. and the test set are distributed according to the proportion of
9 : 1. Tere are 73,876 pieces of training sets and 8,209 pieces
of test sets in total. Te data structure of images and labels is
Table 2: Te convolution parameter of the frst C3. shown in Figure 7.
C in C out Kernel_size
Conv1 64 32 1 4.2. Training of DeepSORT Deep Convolutional Neural
Conv2 64 32 1 Network. Te vehicle rerecognition dataset is used to train the
Conv3 64 64 1 DeepSORT deep convolutional neural network, so that it can
Conv4 32 32 1
correctly extract the appearance features of the vehicle for the
Conv5 32 32 3
calculation of the cosine distance of the appearance features.
Since the task requirement is vehicle tracking, the input of the
network is set 128(h) × 64(w), according to the aspect ratio of
Conv4 is 1,024, and the number of parameters for Conv5 is the vehicle image. Te network model is built under the
9,216. Terefore, the number of parameters of the frst C3 PyTorch framework. Te initial learning rate is set at 0.1, which
structure in the YOLOv5s network amounts to 18,432. is reduced to 0.1 times every 40 epochs. Te training loss curve
DSC performs two convolutions. Te frst convolution is shown in Figure 8. After reaching 100 epochs, the loss tends
obtains the features of each channel. Te second convolution to be stable and the accuracy on the test set reaches 88%.
fuses the position information of each channel. In contrast to
the frst C3 structure in the YOLOv5s network, the input and
output channels of the DSC are also set to 64, and the size of 4.3. Yolov5s_DSC Network Training and Result Analysis.
the channel-by-channel convolution is 3 × 3. Ten, the Te YOLOv5s network model and YOLOv5s_DSC network
number of parameters of the channel-by-channel convo- model are constructed under the PyTorch framework,
lution is 576, and the size of the point-by-point convolution respectively. Te UA-DETRAC dataset is used for training.
6 Computational Intelligence and Neuroscience

Input Conv, g=input channel Conv 1×1 BatchNormalization Add

Figure 5: DSC structure with residual.

Table 3: Number of parameters before and after replacement.


Network structure Parameter quantity
C3_1 18816
C3_2 156928
C3_3 625152
C3_4 1182720
DSC_1 4928
DSC_2 54144
DSC_3 206592
DSC_4 268800

Table 4: Te network structure of YOLOv5s_DSC.


C in C out Kernel_size Stride Padding Figure 7: UA-DETRAC dataset.
Focus 3 32
Conv 32 64 3 2 1
Te batch size is 128, and 50 epochs are trained. Te
DSC_1 64 8
Conv 8 128 3 2 1 training loss is shown in Figure 9. Te YOLOv5s_DSC
DSC_2 128 128 network decreases as fast as YOLOv5s in the regression
Conv 128 256 3 2 1 loss, the classifcation loss, and the target loss, where the
DSC_3 256 256 lowest values of the three losses of YOLOv5S are 0.01722,
Conv 256 512 3 2 1 0.0011741, and 0.02758, respectively. However, the lowest
SPP 512 512 values of the three losses of YOLOv5s_DSC are 0.01835,
DSC_4 512 512 0.0013954, and 0.02933, respectively, which indicates that
Conv 512 256 1 1 0 the introduction of DSC structure with residual error does
Upsample concat not bring too much impact on the training difculty of the
C3 512 256 network.
Conv 256 128 1 1 0 Compare the performance of YOLOv5s and
Upsample concat YOLOv5s_DSC in mAP, precision, and recall. In Figure 10,
C3 256 128 the curves of the two networks are almost coincident, which
Conv 128 128 3 2 1 indicates that the introduction of the DSC structure with
Upsample concat residuals brings about a decrease in the number of network
C3 256 256 parameters but does not cause a decrease in the accuracy of
Conv 256 256 3 2 1
the network. Te YOLOv5s_DSC with KF (Kalman flter) is
Concat smoother than the YOLOv5s_DSC. Te YOLOv5s with KF is
C3 512 512 smoother than the YOLOv5s. Tis indicates that KF can
dynamically adjust its update rate, which helps to better
track the targets. In Table 5, mAP (mean average precision)
indicates better performance of the detector, except for the
optimal mAP0.5: 0.95; the diference between the optimal
mAP0.5, precision, and recall of YOLOv5s_DSC and
YOLOv5s is not more than 1%, while the number of pa-
rameters is reduced by 23.5%. Te amount of calculation is
reduced by 32.3%, and the size of the weight fle is decreased
by 20%. In the hardware environment, where the graphics
card is NVIDIA GeForce RTX 3080 and the CPU is Inter (R)
Xeon (R) CPU E5-2670 v3, the average processing speed of
each image is improved by 18.8%, which proves that the
Figure 6: VeRi data set. proposed algorithm is faster while ensuring the accuracy.
Computational Intelligence and Neuroscience 7

5.5

4.5

3.5

loss
2.5

1.5

0.5

0 20 40 60 80 100 120 140 160 180


epoch
Figure 8: Training loss graph of DeepSORT deep convolutional neural network.

0.08
0.018 0.07
0.06 0.014 0.06
loss

loss
loss

0.01 0.05
0.04
6e-3 0.04
0.02 2e-3 0.03
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
epoch epoch epoch

YOLOv5s
YOLOv5s_DSC
Figure 9: Comparison of training loss between YOLOv5s and YOLOv5s_DSC.

4.4. Verifcation Experiment. Select a video of trafc fow YOLOv5s_DSC + DeepSORT algorithm achieves a pro-
captured from the front of the intersection as the input. As cessing speed of 25 FPS.
shown in Figure 11, the YOLOv5s_DSC vehicle detection Next, the robustness of the proposed algorithm is ver-
algorithm can efectively detect vehicles and correctly ifed in the occlusion scene. Consider the tracking perfor-
classify vehicles under the window. Each detection frame mance of two occlusion situations: (1) the target is occluded
contains two information: vehicle category name and cat- by foreign objects and (2) the targets are occluded by each
egory confdence. In the hardware environment shown in other. First, the robustness of the proposed algorithm is
Table 6, the detection speed of the algorithm reaches 77 FPS. verifed when the target is occluded by external objects. Te
Select a video of trafc fow captured from the oblique side of efect of rerecognition and retracking after the target dis-
the intersection as the input, and YOLOv5s_DSC vehicle appears is tested. A trafc video blocked by a pillar is
detection algorithm can also efectively detect the vehicles supported to verify the algorithm. Figure 14 shows four
and correctly classify the vehicles under this window, as consecutive images. Te dark car with tracking ID 3 reap-
shown in Figure 12. YOLOv5s_ DSC can accurately detect pears after being blocked by a pillar and can still be tracked
vehicles from diferent angles. As shown in Figure 12(b), by the algorithm. Tis result shows that the algorithm ex-
local mutual occlusion between vehicles does not afect the hibits strong robustness and accuracy in occluded scenes,
detection efect of the algorithm. Terefore, the advantages providing strong support for target tracking in practical
of the YOLOv5s_ DSC algorithm for vehicle detection can applications.
provide realtime and accurate vehicle detection information We also evaluate the algorithm’s performance in sce-
for vehicle tracking. narios where targets are occluded by other targets. Specif-
To test the efect and the robustness of occlusion of the ically, we test the algorithm’s ability to track targets that are
algorithm on vehicle tracking, the YOLOv5s_DSC is as partially occluded by other targets. A video sequence in
a detector connected to YOLOv5s_DSC and DeepSORT. which a bus partially occludes a car that is being tracked is
As shown in Figure 13, the tracking boxes of diferent selected. As shown in Figure 15, the vehicle with tracking ID
types of vehicles have diferent colors, and each tracking 4 is partially blocked by the bus with tracking ID 1. Despite
box includes a tracking ID in addition to the category and the occlusion, the tracking ID for the car remains un-
category confdential information of the vehicle. In the changed. It is shown that the proposed algorithm is capable
hardware environment shown in Table 6, the of handling partial occlusions between targets. Tese results
8 Computational Intelligence and Neuroscience

1
1
0.8
0.8

mAP_0.9
0.6
mAP_0.5

0.6

0.4 0.4

0.2 0.2

0 0

0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
epoch epoch

YOLOv5s_DSC YOLOv5s with KF YOLOv5s_DSC YOLOv5s with KF


with KF YOLOv5s_DSC with KF YOLOv5s_DSC
YOLOv5s YOLOv5s

1
0.9
0.8
precision

0.6
0.7 recall
0.4

0.2
0.5
0

0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
epoch epoch

YOLOv5s_DSC YOLOv5s with KF YOLOv5s_DSC YOLOv5s with KF


with KF YOLOv5s_DSC with KF YOLOv5s_DSC
YOLOv5s YOLOv5s
Figure 10: mAP, accuracy, and recall of YOLOv5s and YOLOv5s_DSC.

Table 5: Data comparison of YOLOv5s_DSC and YOLOv5s.


Image
Model
mAP0.5 : Calculation processing
mAP0.5 Accuracy Recall Parameters size
0.95 (GFLOPS) speed
(M)
(s)
YOLOv5s 0.993 0.865 0.981 0.983 7071633 16.4 14.4 0.016
YOLOv5s_DSC 0.993 0.851 0.980 0.976 5622481 12.3 11.5 0.013

Table 6: Hardware environment confguration.


Confguration Model
CPU Inter (R) xeon (R) CPU E5-2670 v3
GPU NVIDIA GeForce RTX 3080
CUDA 11.0
PyTorch 1.7.1
Python 3.8
Computational Intelligence and Neuroscience 9

(a) (b)

Figure 11: YOLOv5s_DSC vehicle detection at the frontal windows.

(a) (b)

Figure 12: YOLOv5s_DSC vehicle detection at the oblique side of the intersection.

Figure 13: YOLOv5s_DSC vehicle detection efect in front of an intersection.

(a) (b) (c) (d)

Figure 14: YOLOv5s_DSC + DeepSORT recognition and tracking efect.


10 Computational Intelligence and Neuroscience

(a) (b) (c) (d)

Figure 15: Tracking efect of YOLOv5s_DSC + DeepSORT under local mutual occlusion of the target.

further demonstrate the robustness and efectiveness of the China, under grant JAT220169, by the Xiamen Key Labo-
algorithm in occluded scenes, which is crucial for the ratory of Marine Intelligent Terminal R&D and Application,
practical application of the target tracking. China, under grant B18208, and by the Xiamen Ocean and
Fishery Development Special Fund Project, China, under
5. Conclusions grant 21CZB013HJ15.

Tis article investigates the application of the DeepSORT


algorithm in vehicle tracking, using vehicle fow videos from References
diferent scenarios to verify the efectiveness and robustness [1] H. Chen and J. Guan, “Teacher–student behavior recognition
of the YOLOv5s_DSC vehicle detection algorithm. Te in classroom teaching based on improved YOLO-v4 and
YOLOv5s + DeepSORT algorithm is validated by repro- internet of things technology,” Electronics, vol. 11, no. 23,
ducing trafc fow videos that block each other after the p. 3998, 2022.
vehicle disappears. It is showed that the algorithm has good [2] M. Li, L. Zhang, L. Li, and W. Song, “Yolo-based trafc sign
rerecognition and retracking ability and robustness against recognition algorithm,” Computational Intelligence and
partial occlusion of targets. However, the algorithm in this Neuroscience, vol. 2022, Article ID 2682921, 10 pages, 2022.
article does not take into account the detection efect in [3] K. Du, Y. Ju, Y. Jin, G. Li, Y. Li, and S. Qian, “Object tracking
diferent weather environments such as rainy days, foggy based on improved Mean Shift and SIFT,” in Proceedings of
weather, and vehicle video blurring. In future work, the the 2012 2nd International Conference on Consumer Elec-
tronics, Communications, and Networks, pp. 2716–2719,
application of model compression methods will be studied to
Yichang, China, April 2012.
further compress network models, while maintaining
[4] L. Liu, C. Ke, H. Lin, and H. Xu, “Research on pedestrian
a certain accuracy and improving the speed of network detection algorithm based on MobileNet-YoLo,” Computa-
reasoning and combining algorithms such as environment tional Intelligence and Neuroscience, vol. 2022, Article ID
optimization to achieve more scene applications. 8924027, 12 pages, 2022.
[5] X. Mei, H. Ling, Y. Wu, E. P. Blasch, and L. Bai, “Efcient
Data Availability minimum error bounded particle resampling L1 tracker with
occlusion detection,” IEEE Transactions on Image Processing,
Te data used to support the fndings of this study are vol. 22, no. 7, pp. 2661–2675, 2013.
available upon request from the corresponding author. [6] G. Yan, G. Qu, and M. Yu, “Real-time L1-tracker based on
hear-like features,” Computer Science, vol. 44, pp. 300–306,
Conflicts of Interest 2017.
[7] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-
Te authors declare that they have no conficts of interest. speed tracking with kernelized correlation flters,” IEEE
Transactions on Pattern Analysis and Machine Intelligence,
Authors’ Contributions vol. 37, no. 3, pp. 583–596, 2015.
[8] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg,
Lixiong Lin conceptualized the study, proposed the meth- “Beyond correlation flters: learning continuous convolution
odology, validated the study, and wrote and prepared the operators for visual tracking,” in Proceedings of the European
original draft; Hongqin He and Dongjie Wu performed Conference on Computer Vision, pp. 472–488, Amsterdam,
formal analysis, performed investigation, provided re- Te Netherlands, October 2016.
[9] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “Eco:
sources, and performed data curation; Zhiping Xu wrote,
efcient convolution operators for tracking,” in Proceedings of
reviewed, and edited the manuscript, visualized the study, the IEEE Conference on Computer Vision and Pattern Rec-
and performed project administration. All authors have read ognition, pp. 6638–6646, Honolulu, HI, USA, July 2017.
and agreed to the published version of the manuscript. [10] J. Li and S. Huang, “YOLOv3 based object tracking method,”
Electronics Optics and Control, vol. 26, p. 87, 2019.
Acknowledgments [11] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and
P. Torr, “Fully-convolutional siamese networks for object
Te research was supported by the Jimei University Startup tracking,” in Proceedings of the European Conference on
Research Project, China, under grant ZQ2022002, by the Computer Vision, pp. 850–865, Amsterdam, Te Netherlands,
Education Department Foundation of Fujian Province, October 2016.
Computational Intelligence and Neuroscience 11

[12] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu, [28] A. Bochkovskiy, C. Y. Wang, and H. Y. M. Liao, “Yolov4:
“Distractor-aware siamese networks for visual object track- optimal speed and accuracy of object detection,” 2020, https://
ing,” in Proceedings of the European Conference on Computer arxiv.org/abs/2004.10934.
Vision, pp. 101–117, Munich, Germany, September 2018. [29] Y. Sasagawa and H. Nagahara, “Yolo in the dark-domain
[13] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, adaptation method for merging multiple models,” in Pro-
“Siamrpn++: Evolution of siamese visual tracking with very ceedings of the European Conference on Computer Vision,
deep networks,” in Proceedings of the IEEE/CVF Conference pp. 345–359, Glasgow, UK, August 2020.
on Computer Vision and Pattern Recognition, pp. 4282–4291, [30] M. Krišto, M. Ivasic-Kos, and M. Pobar, “Termal object
Seoul, South Korea, October 2019. detection in difcult weather conditions using YOLO,” IEEE
[14] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple Access, vol. 8, pp. 125459–125476, 2020.
online and real-time tracking,” in Proceedings of the IEEE [31] Y. Xiao, A. Jiang, J. Ye, and M. Wang, “Making of night vision:
International Conference on Image Processing, pp. 3464–3468, object detection under low-illumination,” IEEE Access, vol. 8,
Amsterdam, Te Netherlands, October 2016. pp. 123075–123086, 2020.
[15] N. Wojke, A. Bewley, and D. Paulus, “Simple online and real [32] K. Guo, X. Li, M. Zhang, Q. Bao, and M. Yang, “Real-time
time tracking with a deep association metric,” in Proceedings vehicle object detection method based on multi-scale feature
of the IEEE International Conference on Image Processing, fusion,” IEEE Access, vol. 9, pp. 115126–115134, 2021.
pp. 3645–3649, Honolulu, HI, USA, July 2017. [33] W. Liu, K. Quijano, and M. M. Crawford, “YOLOv5-Tassel:
[16] S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas, “Machine detecting tassels in RGB UAV imagery with improved
learning: a review of classifcation and combining tech- YOLOv5 based on transfer learning,” Ieee Journal of Selected
niques,” Artifcial Intelligence Review, vol. 26, no. 3, Topics in Applied Earth Observations and Remote Sensing,
pp. 159–190, 2006. vol. 15, pp. 8085–8094, 2022.
[17] P. Wang, C. Shen, N. Barnes, and H. Zheng, “Fast and robust [34] M. Bie, Y. Liu, G. Li, J. Hong, and J. Li, “Real-time vehicle
object detection using asymmetric totally corrective boost- detection algorithm based on a lightweight You-Only-Loo-
ing,” IEEE Transactions on Neural Networks and Learning k-Once (YOLOv5n-L) approach,” Expert Systems with Ap-
Systems, vol. 23, no. 1, pp. 33–46, 2012. plications, vol. 213, Article ID 119108, 2023.
[18] S. Bmeshram and S. M Shinde, “A survey on ensemble [35] J. Wang, Y. Dong, S. Zhao, and Z. Zhang, “A high-precision
methods for high dimensional data classifcation in bio- vehicle detection and tracking method based on the attention
medicine feld,” International Journal of Computer Applica- mechanism,” Sensors, vol. 23, no. 2, p. 724, 2023.
tion, vol. 111, no. 11, pp. 5–7, 2015. [36] W. Liu, L. Xiong, X. Xia, Y. Lu, L. Gao, and S. Song, “Vision-
[19] G. Dı́az-San Martı́n, L. Reyes-González, S. Sainz-Ruiz, aided intelligent vehicle sideslip angle estimation based on
L. Rodriguez-Cobo, and J. M. Lopez-Higuera, “Automatic a dynamic model,” IET Intelligent Transport Systems, vol. 14,
ankle angle detection by integrated RGB and depth camera no. 10, pp. 1183–1189, 2020.
system,” Sensors, vol. 21, no. 5, p. 1909, 2021. [37] W. Liu, X. Xia, L. Xiong, Y. Lu, L. Gao, and Z. Yu, “Automated
[20] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection vehicle sideslip angle estimation considering signal mea-
and alignment using multitask cascaded convolutional net- surement characteristic,” IEEE Sensors Journal, vol. 21, no. 19,
works,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 21675–21687, 2021.
pp. 1499–1503, 2016. [38] X. Liu and Z. Zheng, “VeRi dataset,” https://github.com/
[21] T. Ahonen, A. Hadid, and M. Pietikäinen, “Face recognition JDAI-CV/VeRidataset.
with local binary patterns,” in Proceedings of the European [39] L. Wen, D. Du, Z. Cai et al., “UA-DETRAC: a new benchmark
Conference on Computer Vision, pp. 469–481, Prague, Czech and protocol for multi-object detection and tracking,”
Republic, May 2004. Computer Vision and Image Understanding, vol. 193, pp. 1–20,
[22] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature 2020, https://detrac-db.rit.albany.edu/.
hierarchies for accurate object detection and semantic seg-
mentation,” in Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 580–587, Zurich,
Switzerland, September 2014.
[23] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE In-
ternational Conference on Computer Vision, pp. 1440–1448,
Santiago, Chile, December 2015.
[24] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: towards
real-time object detection with region proposal networks,”
Advances in Neural Information Processing Systems, vol. 28,
2015.
[25] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only
look once: unifed, real-time object detection,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 779–788, Amsterdam, Te Netherlands,
October 2016.
[26] J. Redmon and A. Farhadi, “YOLO9000: better, faster,
stronger,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 7263–7271, Honolulu, HI,
USA, July 2017.
[27] J. Redmon and A. Farhadi, “Yolov3: an incremental im-
provement,” 2018, https://arxiv.org/abs/1804.02767.

You might also like