Realtime Vehicle Tracking Method Based on YOLOv5+DeepSORT
Realtime Vehicle Tracking Method Based on YOLOv5+DeepSORT
Research Article
Realtime Vehicle Tracking Method Based on YOLOv5 + DeepSORT
1
School of Ocean Information Engineering, Jimei University, Xiamen 361021, China
2
Zhejiang Dahua Technology Co., Ltd., Hangzhou 310051, China
3
Department of Automation, Xiamen University, Xiamen, Fujian 361005, China
Received 3 January 2023; Revised 20 March 2023; Accepted 22 March 2023; Published 15 June 2023
Copyright © 2023 Lixiong Lin et al. Tis is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
In actual trafc scenarios, the environment is complex and constantly changing, with many vehicles that have substantial
similarities, posing signifcant challenges to vehicle tracking research based on deep learning. To address these challenges, this
article investigates the application of the DeepSORT (simple online and realtime tracking with a deep association metric)
multitarget tracking algorithm in vehicle tracking. Due to the strong dependence of the DeepSORT algorithm on target detection,
a YOLOv5s_DSC vehicle detection algorithm based on the YOLOv5s algorithm is proposed, which provides accurate and fast
vehicle detection data to the DeepSORT algorithm. Compared to YOLOv5s, YOLOv5s_DSC has no more than a 1% diference in
optimal mAP0.5 (mean average precision), precision rate, and recall rate, while reducing the number of parameters by 23.5%, the
amount of computation by 32.3%, the size of the weight fle by 20%, and increasing the average processing speed of each image by
18.8%. After integrating the DeepSORTalgorithm, the processing speed of YOLOv5s_DSC + DeepSORTreaches up to 25 FPS, and
the system exhibits better robustness to occlusion.
SORT. Te algorithm considered the motion information et al. [34] proposed an improved YOLOv5 algorithm based
and appearance information in the tracking process and on bidirectional feature pyramid network for multiscale
resolved the problem of target occlusion. At present, feature fusion. Wang et al. [35] proposed a novel vehicle
detection-based tracking algorithms still have many prob- detection and tracking method for small target vehicles to
lems, such as a lack of datasets, inaccurate target detection, achieve high detection and tracking accuracy based on the
and insufcient realtime performance,. attention mechanism. In summary, research based on the
Traditional target detection algorithms rely on image improved YOLOv5 algorithm mainly focuses on the accu-
features and classifers such as SVM (support vector ma- racy of small object detection, while research on detection
chine) [16], Adaboost [17], Random Forest [18], artifcially speed and occlusion robustness in vehicle tracking still has
designed color features [19], gradient features [20], and great research value. Te main contributions of this article
pattern features [21]. Target detection algorithms based on are as follows:
deep learning have stronger adaptability to complex scenes,
(1) To solve the problems of large number of vehicles,
including target detection methods based on candidate re-
fast-moving speed of vehicles, substantial similarity
gions and target detection methods based on regression. Te
of vehicle appearance, and vehicle occlusion in the
representative algorithm based on candidate region is
actual urban trafc scene, the DeepSORT algorithm
R-CNN (Region-CNN) series [22–24]. Owing to the need to
is used for vehicle tracking, which has better realtime
process large number of candidate frames, such methods
performance and tracking robustness than tradi-
face the problem of low efciency and do not have the ability
tional vision-based vehicle tracking methods.
for realtime detection. Te regression-based target detection
method reduces the steps of generating candidate regions (2) To reduce the calculation amount of YOLOv5s, re-
and improves the speed signifcantly. It has been widely used duce its inference time, and improve the operation
for developing realtime target detection systems. Te YOLO speed, a YOLOv5s_DSC algorithm with faster in-
(you only live once) algorithm [25] proposed in 2016 used ference speed is proposed.
a grid to divide an image and generated a series of initial (3) Combining YOLOv5s_DSC with the DeepSORT
anchor boxes in each grid of the image. By learning to fne algorithm, the robustness of occlusion of the pro-
tune the initial box, the predicted box was generated to be posed algorithm is verifed and the realtime per-
closer to the actual box. Te YOLOv2 algorithm introduced formance of the algorithm is tested in the cases of
batch normalization and used DarkNet-19 as the backbone vehicles being occluded by foreign objects or vehicles
network, which could dynamically adjust the input and being occluded by each other.
achieve better precision for small targets [26]. On this basis,
the YOLOv3 algorithm used DarkNet-53 as the backbone 2. Algorithmic Framework
network, introduced FPN (feature pyramid network)
structure to obtain feature maps at diferent scales, and used 2.1. Overall Framework. Te DeepSORT algorithm adopts
a logistic classifer to predict the category of targets [27]. Te a two-stage idea of detection and tracking, using the Kalman
YOLOv4 algorithm added data enhancement and self- flter and Hungary algorithm to track the target and in-
antagonistic training methods at the input end [28]. Te troducing a deep convolutional neural network to extract the
backbone network used CSPDarkNet53 and improved the appearance information of the tracked target for data as-
loss function of the output layer, which greatly improved the sociation, which solves the problem that the target occlusion
speed and accuracy. Te YOLOv5 has the same performance is difcult to track accurately. Stable and accurate vehicle
as YOLOv4. However, YOLOv5 is faster and has a detection detection result is an important guarantee for the Deep-
speed of 140 FPS on Tesla P100. Sasagawa and Nagahara [29] SORTalgorithm in the vehicle tracking task. Considering the
used YOLO to locate and identify objects and proposed realtime requirements of the realistic application scenarios,
a method for detecting objects under low illumination by the YOLOv5 target detection algorithm is studied in this
utilizing the power of transfer learning. Krišto et al. [30] used article. To further reduce the memory and computing re-
thermal images on YOLO to improve target detection sources occupied by the algorithm, a DSC structure with
performance in challenging conditions such as adverse residual is introduced into YOLOv5s, and the
weather, night time, and dense areas. Xiao et al. [31] fused YOLOv5s_DSC algorithm with a smaller model and faster
the context information in the YOLO backbone network to speed is proposed. YOLOv5s_DSC is used as the detector of
avoid the loss of low-level context features, retain lower the DeepSORT algorithm, and its excellent detection ac-
spatial features, and solve the problem of difcult detection curacy can make tracking more accurate and provide better
of targets under dim light. Guo et al. [32] designed an realtime performance.
improved SSD (single shot multibox detector) detector,
which used the method of single data deformation data
amplifcation to transform the color gamut and afne of the 2.2. Te DeepSORT Algorithm. Figure 1 is a framework
original data and could detect targets that were close to each diagram of the DeepSORT algorithm. First, the Kalman flter
other. To improve feature fusion for small tassel detection, is used to predict the tracking frame information of the
Liu et al. [33] proposed a novel algorithm referred to as tracked target in the next frame, and all the detection frame
YOLOv5-tassel to detect tassels. To enrich feature in- information is obtained by the target detection algorithm in
formation and improve the feature extraction ability, Bie the subsequent frame. Ten, the Hungarian algorithm is
Computational Intelligence and Neuroscience 3
True
Begin
Tracking
Not true Tracking box
Matching
failed
Target Cascade Detection box
detection matching
Matching
succeeded
IOU data association
Matching
Kalman filter update
succeeded
Detection box
Matching
failed
True
Age>max age
Cascade
matching
used to fnd an optimal allocation for the minimum cost of investigation will be conducted. If all three rounds of
between all the detection frames and the tracking prediction matching are successful, the fag will be changed to “true.”
frames. Te cost matrix used in this step contains not only For tracking frame that fail to match, if they are marked as
the Mahalanobis distance but also the cosine distance of the “false,” the tracking task will be stopped, and if they are
appearance features constructed from the appearance fea- marked as “true,” the lifespan will be set. Within the lifespan,
tures extracted by the deep convolutional neural network. the following three rounds of investigation will also be
After solving by the Hungarian algorithm, the optimal conducted. If all three rounds of matching are successful, the
combination of the prediction frame and the detection frame mark will be changed to “true.”
can be obtained. Te DeepSORT algorithm uses cascade State estimation methods mainly include state observers
matching, and the shorter the number of frames from the and various linear and nonlinear discrete estimators based
last successful matching is, the higher the priority is in this on the Kalman flter. Liu et al. [36] proposed a novel vehicle
matching. Te tracking frame information is updated sideslip angle estimation algorithm with the fusion of dy-
according to the detected frame information after the namic model and vision for vehicle dynamic control. A
matching is successful, and the tracking frame information vehicle attitude angle observer based on the square-root
of the tracked target in the next frame is continued to predict cubature Kalman flter (SCKF) is designed in [37] to estimate
according to the tracking information. For the samples that the roll and pitch to reject the gravity components induced
fail to match, the cost matrix will be constructed again with by the vehicle roll and pitch. For simplicity, this article uses
the IOU calculation results of the remaining tracking frame the Kalman flter as state estimation. Te prediction equa-
and the prediction frame and then transferred to the tion of the Kalman flter is as follows:
Hungarian algorithm for the solution. After the matching is
successful, the tracking frame is updated according to the −k &9; � Fk x
x +k−1 ,
(1)
detection frame information, and the tracking frame in- P−k &9; � Fk P+k−1 FTk + Qk ,
formation of the tracked target in the next frame is con-
tinuously predicted according to the tracking frame where x −k is the state estimation at the time k, x
+k−1 is the state
information. Whether the match is successful or not is −
estimation at the time k − 1, Pk is the covariance matrix of
determined by marking “true” and “false.” For the detection the state estimation, and Fk is the state transition matrix. Te
frame that fails to match, a fag will be added–“false,” and measurement update equation of the Kalman flter is as
three subsequent rounds (age is the round and max age � 3) follows:
4 Computational Intelligence and Neuroscience
3 channel input Filters * 3 Maps * 3 is 1 × 1. Ten, the number of parameters of the pointwise
convolution is 4,096. Tus, the number of parameters of the
DSC structure amounts to 4,672, which is 13,760 lower than
that of the frst C3 structure in the YOLOv5s network. Te
backbone of the YOLOv5s network contains four C3
structures, which are replaced by DSC structures in turn. To
avoid network degradation caused by replacing with DSC,
a residual structure is introduced, as shown in Figure 5.
Te introduction of DSC can efectively reduce the
number of parameters and make the network model smaller.
Figure 2: Depthwise convolution. Te comparison of the number of parameters after re-
placement is shown in Table 3. Te number of parameters of
each structure includes the parameters of convolution, de-
Maps * 3 Filters * 4 Maps * 4 viation, and batch normalization in the structure. Te im-
proved network framework is presented in Table 4.
5.5
4.5
3.5
loss
2.5
1.5
0.5
0.08
0.018 0.07
0.06 0.014 0.06
loss
loss
loss
0.01 0.05
0.04
6e-3 0.04
0.02 2e-3 0.03
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
epoch epoch epoch
YOLOv5s
YOLOv5s_DSC
Figure 9: Comparison of training loss between YOLOv5s and YOLOv5s_DSC.
4.4. Verifcation Experiment. Select a video of trafc fow YOLOv5s_DSC + DeepSORT algorithm achieves a pro-
captured from the front of the intersection as the input. As cessing speed of 25 FPS.
shown in Figure 11, the YOLOv5s_DSC vehicle detection Next, the robustness of the proposed algorithm is ver-
algorithm can efectively detect vehicles and correctly ifed in the occlusion scene. Consider the tracking perfor-
classify vehicles under the window. Each detection frame mance of two occlusion situations: (1) the target is occluded
contains two information: vehicle category name and cat- by foreign objects and (2) the targets are occluded by each
egory confdence. In the hardware environment shown in other. First, the robustness of the proposed algorithm is
Table 6, the detection speed of the algorithm reaches 77 FPS. verifed when the target is occluded by external objects. Te
Select a video of trafc fow captured from the oblique side of efect of rerecognition and retracking after the target dis-
the intersection as the input, and YOLOv5s_DSC vehicle appears is tested. A trafc video blocked by a pillar is
detection algorithm can also efectively detect the vehicles supported to verify the algorithm. Figure 14 shows four
and correctly classify the vehicles under this window, as consecutive images. Te dark car with tracking ID 3 reap-
shown in Figure 12. YOLOv5s_ DSC can accurately detect pears after being blocked by a pillar and can still be tracked
vehicles from diferent angles. As shown in Figure 12(b), by the algorithm. Tis result shows that the algorithm ex-
local mutual occlusion between vehicles does not afect the hibits strong robustness and accuracy in occluded scenes,
detection efect of the algorithm. Terefore, the advantages providing strong support for target tracking in practical
of the YOLOv5s_ DSC algorithm for vehicle detection can applications.
provide realtime and accurate vehicle detection information We also evaluate the algorithm’s performance in sce-
for vehicle tracking. narios where targets are occluded by other targets. Specif-
To test the efect and the robustness of occlusion of the ically, we test the algorithm’s ability to track targets that are
algorithm on vehicle tracking, the YOLOv5s_DSC is as partially occluded by other targets. A video sequence in
a detector connected to YOLOv5s_DSC and DeepSORT. which a bus partially occludes a car that is being tracked is
As shown in Figure 13, the tracking boxes of diferent selected. As shown in Figure 15, the vehicle with tracking ID
types of vehicles have diferent colors, and each tracking 4 is partially blocked by the bus with tracking ID 1. Despite
box includes a tracking ID in addition to the category and the occlusion, the tracking ID for the car remains un-
category confdential information of the vehicle. In the changed. It is shown that the proposed algorithm is capable
hardware environment shown in Table 6, the of handling partial occlusions between targets. Tese results
8 Computational Intelligence and Neuroscience
1
1
0.8
0.8
mAP_0.9
0.6
mAP_0.5
0.6
0.4 0.4
0.2 0.2
0 0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
epoch epoch
1
0.9
0.8
precision
0.6
0.7 recall
0.4
0.2
0.5
0
0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 25 30 35 40 45 50
epoch epoch
(a) (b)
(a) (b)
Figure 12: YOLOv5s_DSC vehicle detection at the oblique side of the intersection.
Figure 15: Tracking efect of YOLOv5s_DSC + DeepSORT under local mutual occlusion of the target.
further demonstrate the robustness and efectiveness of the China, under grant JAT220169, by the Xiamen Key Labo-
algorithm in occluded scenes, which is crucial for the ratory of Marine Intelligent Terminal R&D and Application,
practical application of the target tracking. China, under grant B18208, and by the Xiamen Ocean and
Fishery Development Special Fund Project, China, under
5. Conclusions grant 21CZB013HJ15.
[12] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu, [28] A. Bochkovskiy, C. Y. Wang, and H. Y. M. Liao, “Yolov4:
“Distractor-aware siamese networks for visual object track- optimal speed and accuracy of object detection,” 2020, https://
ing,” in Proceedings of the European Conference on Computer arxiv.org/abs/2004.10934.
Vision, pp. 101–117, Munich, Germany, September 2018. [29] Y. Sasagawa and H. Nagahara, “Yolo in the dark-domain
[13] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, adaptation method for merging multiple models,” in Pro-
“Siamrpn++: Evolution of siamese visual tracking with very ceedings of the European Conference on Computer Vision,
deep networks,” in Proceedings of the IEEE/CVF Conference pp. 345–359, Glasgow, UK, August 2020.
on Computer Vision and Pattern Recognition, pp. 4282–4291, [30] M. Krišto, M. Ivasic-Kos, and M. Pobar, “Termal object
Seoul, South Korea, October 2019. detection in difcult weather conditions using YOLO,” IEEE
[14] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple Access, vol. 8, pp. 125459–125476, 2020.
online and real-time tracking,” in Proceedings of the IEEE [31] Y. Xiao, A. Jiang, J. Ye, and M. Wang, “Making of night vision:
International Conference on Image Processing, pp. 3464–3468, object detection under low-illumination,” IEEE Access, vol. 8,
Amsterdam, Te Netherlands, October 2016. pp. 123075–123086, 2020.
[15] N. Wojke, A. Bewley, and D. Paulus, “Simple online and real [32] K. Guo, X. Li, M. Zhang, Q. Bao, and M. Yang, “Real-time
time tracking with a deep association metric,” in Proceedings vehicle object detection method based on multi-scale feature
of the IEEE International Conference on Image Processing, fusion,” IEEE Access, vol. 9, pp. 115126–115134, 2021.
pp. 3645–3649, Honolulu, HI, USA, July 2017. [33] W. Liu, K. Quijano, and M. M. Crawford, “YOLOv5-Tassel:
[16] S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas, “Machine detecting tassels in RGB UAV imagery with improved
learning: a review of classifcation and combining tech- YOLOv5 based on transfer learning,” Ieee Journal of Selected
niques,” Artifcial Intelligence Review, vol. 26, no. 3, Topics in Applied Earth Observations and Remote Sensing,
pp. 159–190, 2006. vol. 15, pp. 8085–8094, 2022.
[17] P. Wang, C. Shen, N. Barnes, and H. Zheng, “Fast and robust [34] M. Bie, Y. Liu, G. Li, J. Hong, and J. Li, “Real-time vehicle
object detection using asymmetric totally corrective boost- detection algorithm based on a lightweight You-Only-Loo-
ing,” IEEE Transactions on Neural Networks and Learning k-Once (YOLOv5n-L) approach,” Expert Systems with Ap-
Systems, vol. 23, no. 1, pp. 33–46, 2012. plications, vol. 213, Article ID 119108, 2023.
[18] S. Bmeshram and S. M Shinde, “A survey on ensemble [35] J. Wang, Y. Dong, S. Zhao, and Z. Zhang, “A high-precision
methods for high dimensional data classifcation in bio- vehicle detection and tracking method based on the attention
medicine feld,” International Journal of Computer Applica- mechanism,” Sensors, vol. 23, no. 2, p. 724, 2023.
tion, vol. 111, no. 11, pp. 5–7, 2015. [36] W. Liu, L. Xiong, X. Xia, Y. Lu, L. Gao, and S. Song, “Vision-
[19] G. Dı́az-San Martı́n, L. Reyes-González, S. Sainz-Ruiz, aided intelligent vehicle sideslip angle estimation based on
L. Rodriguez-Cobo, and J. M. Lopez-Higuera, “Automatic a dynamic model,” IET Intelligent Transport Systems, vol. 14,
ankle angle detection by integrated RGB and depth camera no. 10, pp. 1183–1189, 2020.
system,” Sensors, vol. 21, no. 5, p. 1909, 2021. [37] W. Liu, X. Xia, L. Xiong, Y. Lu, L. Gao, and Z. Yu, “Automated
[20] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection vehicle sideslip angle estimation considering signal mea-
and alignment using multitask cascaded convolutional net- surement characteristic,” IEEE Sensors Journal, vol. 21, no. 19,
works,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 21675–21687, 2021.
pp. 1499–1503, 2016. [38] X. Liu and Z. Zheng, “VeRi dataset,” https://github.com/
[21] T. Ahonen, A. Hadid, and M. Pietikäinen, “Face recognition JDAI-CV/VeRidataset.
with local binary patterns,” in Proceedings of the European [39] L. Wen, D. Du, Z. Cai et al., “UA-DETRAC: a new benchmark
Conference on Computer Vision, pp. 469–481, Prague, Czech and protocol for multi-object detection and tracking,”
Republic, May 2004. Computer Vision and Image Understanding, vol. 193, pp. 1–20,
[22] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature 2020, https://detrac-db.rit.albany.edu/.
hierarchies for accurate object detection and semantic seg-
mentation,” in Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 580–587, Zurich,
Switzerland, September 2014.
[23] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE In-
ternational Conference on Computer Vision, pp. 1440–1448,
Santiago, Chile, December 2015.
[24] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: towards
real-time object detection with region proposal networks,”
Advances in Neural Information Processing Systems, vol. 28,
2015.
[25] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only
look once: unifed, real-time object detection,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 779–788, Amsterdam, Te Netherlands,
October 2016.
[26] J. Redmon and A. Farhadi, “YOLO9000: better, faster,
stronger,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 7263–7271, Honolulu, HI,
USA, July 2017.
[27] J. Redmon and A. Farhadi, “Yolov3: an incremental im-
provement,” 2018, https://arxiv.org/abs/1804.02767.