Batch - 5 Base Paper
Batch - 5 Base Paper
ScienceDirect
highlights
YOLOv5 model was used for detection, and the DeepSORT model was used for tracking to study the vehicle occlusion problem.
The power of quantum computing with the alternating direction method of multipliers (ADMM) optimizer was leveraged.
The multiple object tracking accuracy (MOTA) indicated a significant increase by 16% more than the regular YOLOv5-DeepSORT.
A 6% multiple object tracking precision (MOTP) increase and a 6% identification metrics (F1) score increase were observed.
Article history: Inaccuracies of traffic sensors during traffic counting and vehicle classification have per-
Received 15 August 2022 sisted as transportation agencies have been prompted to calibrate sensors periodically.
Received in revised form Detection of multiple objects, heavy occlusions, and similar appearances in congested
13 May 2023 places are some causes of computer vision model inaccuracies. This paper used the
Accepted 15 May 2023 YOLOv5 model for detection and the DeepSORT model for tracking objects. Due to the
Available online 24 January 2024 nature of the reported problem caused by many misses and mismatches, the power of
quantum computing with the alternating direction method of multipliers (ADMM) opti-
Keywords: mizer was leveraged. A basic Kalman filter and the Hungarian algorithm features were
Traffic classification used in combination with a quantum optimizer to present robust multiple object tracking
Traffic counting (MOT) algorithms. This hybrid combination of the classical and quantum model has
DeepSORT fastened learning the occludes during frame matching of tracks and detections by gener-
YOLOv5 ating minimum quantum cost function value. Comparisons with the existing models
Quantum computing indicated a significant increase in the primary MOT metric multiple object tracking accu-
racy (MOTA) by 16% more than the regular YOLOv5-DeepSORT model when using a
quantum optimizer. Also, a 6% multiple object tracking precision (MOTP) increases and a
6% identification metrics (F1) score increase were observed using the quantum optimizer
with identity switching reduced from 6 to 4. This model is expected to assist transportation
officials in improving the accuracy of traffic counts and vehicle classification and reduce
the need for regular computer vision software calibration.
* Corresponding author.
E-mail addresses: [email protected] (F. Ngeni), [email protected] (J. Mwakalonge), [email protected] (S. Siuhi).
Peer review under responsibility of Periodical Offices of Chang'an University.
https://doi.org/10.1016/j.jtte.2023.05.006
2095-7564/© 2024 Periodical Offices of Chang'an University. Publishing services by Elsevier B.V. on behalf of KeAi Communications Co.
Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
2 J. Traffic Transp. Eng. (Engl. Ed.) 2024; 11 (1): 1e15
Fig. 1 e Vehicle tracking. (a) Complex scenario of tracking vehicles in congested roadway condition. (b) Detection using
bounding boxes. (c) Movements of vehicles across different frames and what the machine detects during losses. (d) The
expectation for the machine to detect across the frames.
selective region search in the fixed image size but region- from respective frames. At the same time, at the object-
based fully convolutional networks (R-FCN), feature pyramid level, every object is located based on adaptive appearance
network (FPN), and mask RCNN have improved the feature models, spatial distributions, and inter-occlusion
extraction methods, selection, and classification capabilities relationships (Huang and Essa, 2005). One of the
of CNN in various ways (Girshick et al., 2014; He et al., 2017). disadvantages of the model is there are no motion, shape, or
These are all termed two-stage detection methods. The one- size assumptions, but it can perceive the persistence of
stage methods include the single shot multibox detector object occlusions even when they re-emerge from the
(SSD) and the famous YOLO model frameworks (Liu et al., occludes. The ignored motion, shape, and size assumptions
2016; Redmon et al., 2016). are insufficient (Huang and Essa, 2005). Later the idea of
Most of the multi-object tracking models have employed object permanence was extended by Papadourakis and
detection-based tracking (DBT) and detection-free tracking Argyros (2010) model that did not require prior training to
(DFT) for object initialization (Song et al., 2019). The more account for the shape, size, or motion differences. The
prominent DBT method utilizes background modeling to model autonomically and dynamically builds appropriate
detect moving objects in video frames before tracking starts object representations. It answered how to model objects
because it considers the problems of similarity of inter- and generated powerful data association mechanisms to be
frame objects and intra-frame objects. Based on this, employed as it also answered how to handle long-term
multiple object tracking (MOT) has been simplified and occlusions (Papadourakis and Argyros, 2010).
classified into two, which are online and offline tracking. During the last decade, detection has improved signifi-
Online tracking processes two frames at a time and has cantly using the SORT algorithm (Kline et al., 2019). Still, the
good real-time application, but it is difficult to recover from DeepSORT has made it better where the cost used during the
occlusions. Offline tracking processes a batch of frames, and first matching step on frames is set as a combination of the
it is robust to recover from short-term occlusions and Mahalanobis and the cosine distances functions (Hou et al.,
suitable for video analysis. The latter is not suitable for 2019; Wojke et al., 2017). For the persisting occlusion problem
realtime applications, but most MOT methods depend on it in multi-object tracking, Wojke et al. (2017) proposed the
for initialization (Milan et al., 2017). Simple online and real DeepSORT algorithm by introducing cascade matching based
time tracking (SORT) is a representative of the online on SORT and matching the association process between the
method, which is based on rudimentary data association prediction and detection frame of the targets by the
and state estimation techniques such as the Kalman filter Hungarian algorithm. Cascading reduced the occlusion
(KF) and the Hungarian algorithm for the tracking (Bewley problem, but low light intensity still led to misses during
et al., 2016). Moreover, Milan et al. (2016) proposed the first detection (Wojke et al., 2017). Further improvements on
online MOT algorithm based on deep learning with high DeepSORT have led to the introduction of acceleration
performance on the benchmarks (Milan et al., 2016). parameter components and global trajectory generation
In solving the occlusion problem, Han et al. (2009) used mechanisms, but only a slight improvement has been
Kanade-Lucas-Tomasi (KLT) feature to represent an object reported (Chen et al., 2020). The current performance
and a trajectory estimation algorithm by considering the comparison between YOLOv3 and YOLOv5 indicates an
weighting function of tracked features. This function improved performance in the tracking of objects using the
achieved desirable results by removing tracking errors for DeepSORT algorithm and MOT-17 datasets (Gai et al., 2021).
partially occluded objects only (Han et al., 2009). Later, Shu Table 1 shows the performance metrics comparison between
et al. (2012) developed a more discriminant and robust YOLOv3 with DeepSORT and YOLOv5 with DeepSORT.
model against appearance changes and occlusions by Using quantum computing, scholars have studied solving
excluding parts of the background within the detection the occlusion problem using state-of-the-art non-maximum
window; however, it failed with total occlusions also (Shu suppression (NMS) to enable the image-retrieving process
et al., 2012). Another study by Gabriel et al. (2003) while removing redundant objects. The process is based on
approached the problem by splitting it into two groups eliminating false positives by keeping frames with the highest
namely merge-split (MS) and straight-through (ST). In the detection scores. However, it suffers during detection under
MS approach, as the blobs are declared occluded, the system occlusion, where true positives with lower detections are
merges them into a new blob characterized by new suppressed (Li and Ghosh, 2020). Quantum computing can be
attributes until they split again (Gabriel et al., 2003). This used to remove redundant detections in the quadratic
creates the problem of identifying the objects again after unconstrained binary optimization (QUBO) framework with
splitting, as MS does not address the problem. In another scores from every bounding box and overlap ratio between
case, the ST approach builds a model for each object, and it pairs of bounding boxes optimized (Hu and Ni, 2019; Li and
does not suffer from MS problems since they classify any Ghosh, 2020). The generated QUBO optimization problem was
pixel in the occlusion region as it belongs to one of the
occluded objects. However, studies suggest it is still
insufficient (Gabriel et al., 2003).
Table 1 e Performance metrics comparison between
In further study to solve the occlusion problem, Huang and YOLOv3 and YOLOv5.
Essa (2005) presented an idea on object permanence to reason
Clear MOT metrics MOTA (%) IDF1 (%) FP FN IDS
about occlusions by using region-level and object-level
tracking. At a region-level, a genetic algorithm is used to find Yolov3-DeepSORT 50.2 69.5 1971 1413 29
Yolov5-DeepSORT 60.1 76.0 156 227 6
optimal region tracks by associating foreground (FG) regions
J. Traffic Transp. Eng. (Engl. Ed.) 2024; 11 (1): 1e15 5
generating a powerful image augmentation interface. Fig. 4 the model is extensive. It uses either a YOLO layer or a faster
shows how the three-axle truck image was augmented. region-convolutional neural network (R-CNN) as a detector for
target detection in each frame (Gai et al., 2021). However,
3.2. Model training faster R-CNN is a two-stage detection process that needs to
extract the region of interest using the region extraction
The dataset was split into two sets, with 80% for training and technique and then detect the target for the specific region.
20% for testing or validation. An additional 300 images were These steps make it tedious, leading to a slower detection
used as background images without their labels to reduce the rate and a slower tracking process; however, it has two
effect of false negative (FN) and false positives (FP) to increase improvements over the SORT. It combines motion
the model's accuracy. This set was also split into 80 % and 20%; information with apparent target features as the matching
the first was included in the training set and the latter in the criteria for the same target. It uses a combination of cascade
validation set. The model was trained on a graphics processing matching and intersection over union (IoU) matching to
unit (GPU) machine with NVIDIA GeForce GTX 1650 and CUDA reduce tracking errors due to occlusion. The DeepSORT is
version 11.6. To achieve desirable results, the model was equipped with the following vital algorithms.
trained at different parameter settings with a default image
size of 416 pixels, 40 batch sizes, 100 epochs, and a 0.01 learning 3.4.1. Kalman filtering
rate. These parameters were adjusted to obtain optimal values Kalman filter (KF) is also called linear quadratic estimation
with stochastic gradient descent (SGD) optimizer. (LQE). The algorithm uses a series of prior measurements
observed over time (statistical noise included) to produce
3.3. The YOLOv5 detection model more accurate estimates of unknown variables (Busu and
Busu, 2021; Kalman, 1960). The KF is the algorithm that has
The YOLOv5 system architecture used for the detection con- been incorporated in the DeepSORT algorithm, aimed to
sists of three parts, namely, the backbone (cross stage partial accurately predict the tracked target position based on the
darknet (CSPDarknet)), the neck (path aggregation network target's initial motion state through optimal estimation of
(PANet)), and the head (YOLO layer) (Gai et al., 2021). Data were the overall system state.
fed to the CSPDarknet for feature extraction through a cross-
stage hierarchy, then the PANet for feature fusion by 3.4.2. Hungarian algorithm (Kuhn-Munkres algorithm)
enhancing the instance segmentation process by preserving For matching predictions from the KF and detection targets,
spatial information. Finally, the YOLO layer produces the Kuhn-Munkres (KM) algorithm was used. The KF gener-
detection results: class, score, location, and size. ates optimal states and predicts bounding boxes of the target
The model input adopted mosaic data enhancement with state that later is matched by the detection target derived
adaptive anchor frame calculation and adaptive image scaling from the detection algorithm. This matching process is in two
for different datasets for easy feed in the future parts of the stages (Gai et al., 2021). Initially, the cost matrix is calculated
model. The backbone has a focused structure with two CSPs which is defined as the weighted value of IoU distance and
designed whereby one is applied in the backbone and the appearance similarity distance between the predicted and
other in the neck. The model has a neck to simplify collection detected targets. Then minimization of the total matching
feature maps by connecting the head and backbone. cost and returning a matching matrix containing the flags of
the predicted and detected targets that have been matched
and that have failed to match is done. This is done by
3.4. The DeepSORT target-tracking algorithm
searching for the optimal solution to an assignment bipartite
graph. From the matching stage in the Hungarian algorithm,
DeepSORT tracking algorithm is a detection-based and multi-
the cost function optimization was passed to a quantum
target tracking algorithm with better robustness than the
optimization model to minimize the total matching cost.
SORT algorithm though it needs powerful computers when
J. Traffic Transp. Eng. (Engl. Ed.) 2024; 11 (1): 1e15 7
3.5. Quantum optimization systems the variables were incorporated into the model as equations.
Before these constraints were passed through the quantum
3.5.1. Cost function optimization algorithm, normal DeepSORT was allowed to
The cost function was used to assign a numerical score or proceed before capturing the fully cost matrix for optimization.
value that each prospective solution compares and chooses Below are the constraints and their descriptions.
the most suitable solution, the optimal solution, typically the
lowest cost value. In quantum computing optimization, the (1) Maximum matching threshold distance (u)
laws of quantum physics apply where the Hamiltonian func-
tion takes the role of the cost function where its cost value is A new bounding box (Bb) from the detection algorithm will
termed as the system energy. Each state chosen is termed the be assigned to a track if the cost is minimal, as in Eq. (1). The
state and the lowest energy state is called the ground state. cost function is computed as the sum of the linear
Usually, the mathematical expression approach defines the combination of the Euclidean distance between centroids
cost function of the problem's parameters and variables. (Cb1) of the previous bounding box assigned to the track
During optimization solution generation, constraints are used (Tb1) and the detected centroids (Cb) and the absolute area
such that relationship between multiple variables must be difference of the detected bounding boxes area (Pb) and the
satisfied for a solution to be valid. It is normal for solutions previous bounding box area (Pb1) assigned to the track (Tb1).
that violate the pre-defined constraints to be assigned higher
argmin
P
Bb ðCb ; Pb Þ : b ¼ ðud2 ðCb1 ; Ci Þ þ ð1 uÞjPi Pb1 jÞ ue½0; 1; d2 ðCb1 ; Ci Þ < T (1)
ieF
i
cost values or penalties by the cost function or, in other cases, where T is the distance between centroids in consecutive
be excluded explicitly by the optimization solver. frames, u is the adjustable parameter that determines the
displacement's relative influence and the bounding box area
3.5.2. Optimization models change in consecutive frames, F is a set of all bounding
The discussed power of quantum computing provides the po- boxes within the current frame. This distance controls the
tential to solve problems that are practically unfeasible on maximum allowable distance between a detected bounding
classical computers or in other cases speed up the solutions for box and the previous bounding box assigned to a track. It
the regular classical solutions. Quantum combinatorial opti- negatively influences the association matrix because as it
mization (QCO) algorithm was used to find an optimal object increases, the association probability decreases; hence larger
from a finite set of available alternative solutions. The problem cost values are predicted.
was phrased as a minimization function of the objective using
a sum functions that usually employs the approximate opti- (2) Maximum IoU threshold (v)
mization in finding the approximate solution which is usually
termed non-deterministic polynomial-time hardness (NP- This value refers to the threshold value used to determine
hard). the extent to which the bounding boxes should overlap when
Due to the prevailing computational errors and mostly determining the identities of the unassigned tracks. It is
important noisy intermediate-scale quantum (NISQ) from contrary to the IoU value that ranges from 0 to 1 used to
using many gates in quantum computers, a hybrid classical- specify the extent of overlap between the predicted and
quantum model was developed. Quantum processing unit ground truth bounding box during object tracking. It is ex-
(QPU) computed the system energy for a given set of param- pected that the greater the number, the lesser the probability
eters from the classical computer, and the later steps were that association can be achieved. Lower probabilities are
done on a classical computer. usually assigned larger cost values; hence keeping the value
For the given constrained optimization problem, the alter- optimal is necessary. After assigning the extent the bounding
nating direction method of multipliers (ADMM) convex opti- boxes should overlap, the linear assignment by matching
mizer with an operator splitting algorithm was used. It is cascade is formulated to compute the cost matrix between
known to have a residual, objective, and dual variable conver- each detected bounding box Di ; ief1; 2; /; Ng and all predicted
gence properties if convexity assumptions are held. They can bounding boxes Pi ; ief1; 2; /; Mg within a frame with the IoU as
also be solved using a QUBO quantum device via variational a metric shown in Eq. (2).
algorithms. One of the positive parts of ADMM is its continuous 2 3
convex-constrained subproblem, which can effectively be IoUðD1 ; P1 Þ / IoUðD1 ; PM Þ
6 7
solved with both quantum and classical optimizers. 6 7
6 IoUðD2 ; P1 Þ / IoUðD2 ; PM Þ 7
6 7
IoUðD; PÞ ¼ 6
6
7
7 (2)
3.5.2.1. Model constraints. The variable constraints of the 6 « « 7
6 7
regular DeepSORT model are usually a trial and error to achieve 4 5
an optimum combination in state-of-art models. In this model, IoUðDN ; P1 Þ / IoUðDN ; PM Þ
8 J. Traffic Transp. Eng. (Engl. Ed.) 2024; 11 (1): 1e15
The IoU between the detected and predicted bounding box were fed to the quantum optimization algorithms to minimize
is given by Eq. (3). the cost function value that is defined by f ðQoptÞi in Eq. (5),
T where i shows the ith frame for the matching process,
Di Pi
IoUðDi ; Pi Þ ¼ S (3) assumptions have been taken that these variables have
Di Pi
equal weights.
f ðQoptÞi ¼ ui þ vi þ xi þ yi þ zi (5)
(3) Maximum number of misses before a track is deleted (x) where ui is the maximum matching threshold distance, vi is
the maximum IoU threshold or gating threshold, xi is the
This refers to the maximum aging of the detected tracks or maximum number of misses before a track is deleted, yi is the
the maximum number of consecutive misses before the track number of frames that a track remains in the initialization
state is set to be deleted. Newly created tracks that go missing phase, zi is the maximum size of the appearance descriptors
are usually classified as tentative until enough evidence has gallery.
been collected to delete them. Then, the track state is changed Fig. 5 shows the consolidated diagram of the model
to confirmed, and tracks that are no longer alive are classified combining the regular DeepSORT model and quantum
as deleted to mark them for removal from the set of active optimization algorithm to fasten learning of the occludes
tracks. The loss of tracking targets and missed matches are while utilizing the powers of the hybrid quantum model.
prone to happen when trackers perform poorly, or the target
number is large. This is more predominantly in the occlusion
event in a congested area. Hence it has got a negative effect on
association hence penalty is given to the cost function. 4. Vehicle counting and classification
analysis
(4) Number of frames that a track remains in the initiali-
zation phase (y) 4.1. Analysis of a short-duration video
The detection-based tracking requires manual initializa- The initial output of the model after incorporating training
tion of a fixed number of objects in the first frame before weights was the counting and classification of vehicles ac-
localizing them in subsequent frames. This initialization stage cording to the FHWA (2014) classes essential in the design of
is fundamental to the tracking algorithm, and the larger the highway pavements. The counting of the vehicles on the
number of frames a track remains uninitialized, the less the 3 min video was analyzed by establishing the ground truth
accuracy. However, it has a minimal effect because detection counts using manual counts from the video.
in the next frame can compensate for the losses. Hence the The model counts without quantum optimization and with
association will be reduced to some extent but not the prev- quantum optimizer were recorded to compare the counts in
alent case. terms of misses (FN), mismatches (FP), and identity switches
(ID switches). Table 2 shows that the number of misses was
(5) Maximum size of the appearance descriptors gallery (z) reduced by comparing the southbound (SB) and northbound
(NB) counts in the model with and without a quantum
A vector that can describe all the features of a given image optimizer. Furthermore, the mismatches on the buses were
in DeepSORT is achieved by building a classifier over a pre- removed as the model was able to differentiate them from
defined dataset, training, and then utilizing the final classifi- trucks. The identity switching was also reduced especially in
cation layer. The dense layer capable of producing a feature the number of cars from five cars to only three cars along the
vector for classification called the object appearance NB.
descriptor is generated. Once training was done, we passed all
the crops of the detected bounding box from the image to this 4.2. Analysis of a long-duration video
network and obtained a definite dimensional feature vector. It
directly influences the Mahalanobis distance that the updated This study used a 34-min video recording to check the model's
distance metric becomes as Eq. (4). effectiveness over a long duration for vehicle counting and
classification. The video was analyzed similarly to the previ-
D ¼ lDk þ ð1 lÞDa (4)
ous short video by establishing the ground truth counts using
where Dk is the Mahalanobis distance, Da is the cosine dis- manual counts. Fig. 6 shows the model counts compared to
tance between the appearance feature vectors, l is the established ground truth counts. It shows a slight difference
weighting factor. in the number of counts compared to the ground truth as
This incorporation in DeepSORT has enabled tracking ob- the number of identity switches increases with traffic. The
jects through more extended periods of occlusions, effectively northbound traffic was higher than the southbound traffic,
reducing the number of identity switches (Wojke et al., 2017). explaining the higher difference. The number of cars has
It indicates that the larger the value, the higher probability of been scaled down by 100 to simplify visualization. Further
association, and it directly affects the cost function's value. data analysis was carried out to study the extent of the
difference in traffic counts.
3.5.2.2. Objective function. After the formulation of a fully cost Fig. 7 shows the difference between the model and ground-
matrix in the regular DeepSORT, the tracks and detections truth counts observed. It was observed that cars, buses, and 6-
J. Traffic Transp. Eng. (Engl. Ed.) 2024; 11 (1): 1e15 9
axle trucks are more different than other classes, but their vehicles exceeds 4% for a single lane. Table 3 shows the
composition can explain this in total traffic. There were a differences between the ground truth and the model counts.
total of 2422 vehicles southbound and 2551 vehicles The roadway had three lanes on each side, and on average,
northbound. The model generated a total of 2286 vehicles the error in counts was 2% per class, with the error in the
southbound and 3018 vehicles northbound. The positive number of cars along the 3 lanes reaching 17% equivalent to
value on the graph indicates the number of counts not 5% increase in NB direction per lane. This error is due to an
observed by the model, and the negative value shows the identity-switching problem reported.
increase in the number of counts due to switching other
identities. Further observation in traffic data indicated there
was switching of identities of 5-axle trucks and 6-axle trucks 5. Performance comparison with other
to buses as the number of buses increased. Similarly, it was models
observed that cars were switching identities between
themselves as the car volume increased. The model performance was tested on the MOT datasets and
According to FHWA (2014), the traffic data measurements compared with other tracking models. MOT has different
become unacceptable when the percentage of unclassified metrics that are useful during comparison. MOTP, MOTA, F1
10 J. Traffic Transp. Eng. (Engl. Ed.) 2024; 11 (1): 1e15
Fig. 6 e Vehicle counts on a model with the quantum optimizer and ground truth.
score, number of frames, number of matches, number of track MOT models, association, detection, and localization errors
switches, number of false positives (false alarms), and number are expected. Cascading, sometimes called fragmentation, is
of misses are some of them. The following is the description of defined as the loss of tracking by the model between consec-
the main MOT performance metrics according to the classifi- utive frames and once the detection re-emerges, it is assigned
cation of events, activities, and relationships (CLEAR). a new identity. It can be seen in Fig. 8(a) where a pedestrian
was assigned ID-24 in the first frame, is lost in the second
5.1. MOT metrics frame, then emerges in the third frame with ID-47. Fig. 8(b)
shows how occlusion is experienced during vehicle tracking
The classification of events, activities, and relationships on a congested road where only a few vehicles can be
(CLEAR) MOT metrics are used to summarize other metrics detected.
listed earlier and they include MOTP, MOTA, and F1 score. These localization and association errors are estimated for
With occlusion and cascading as the main challenges for the every single frame at a time (t) in the series of frames. Then
J. Traffic Transp. Eng. (Engl. Ed.) 2024; 11 (1): 1e15 11
the final MOT metric value. These values are defined as where ¼ 1 IoU is the distance between the localization of
dit
follows. an object in the ground truth and the detection at time t, TPt
are the total matches made between ground truth and the
(1) Multiple object tracking accuracy (MOTA) detection (number of true positives) at time t.
It is a primary measure of the overall accuracy that con- (3) Identification metrics (IDF1) or F1 score
siders both detection and association errors. MOTA deals with
both tracker output and detection output and is computed at This also has good attributes in measuring the association
time t as in Eq. (6). accuracy rather than detection hence considered as the sec-
ondary metric. It is described as the ratio of correctly identi-
P
N
ðFNt þ FPt þ IDSt Þ fied detections to the average of ground truth and generated
MOTA ¼ 1 t¼1 (6) detections with the Hungarian algorithm involved in selecting
P
N
GTt trajectories and can be computed as in Eq. (10). It usually
t¼1
combines the ID_precision (Eq. (9)) and ID_recall (Eq. (8)).
Fig. 8 e Challenges for the MOT models. (a) Cascading problem during pedestrian tracking. (b) Vehicle occlusion problem
during tracking.
12 J. Traffic Transp. Eng. (Engl. Ed.) 2024; 11 (1): 1e15
The MOTP value of 91.97% compared to the GT in Table 4 Table 5 e Results of the quantum model in comparison
with other models.
indicates the system is too superior in the average
localization considering the area occupied on average by a Model MOTA (%) MOTP (%) IDF1 (%) IDS
person. However, it is essential to note the influence of the YOLOv3-DeepSORT 50.32 65.46 69.50 29
threshold value when discussing the results. It is well noted YOLOv5-DeepSORT 60.18 86.26 76.00 6
that once the threshold is set to a higher value, the impacts YOLOv5-DeepSORT 76.21 91.97 82.09 4
and quantum
will be felt on the MOTA value in measuring the correct
J. Traffic Transp. Eng. (Engl. Ed.) 2024; 11 (1): 1e15 13
object tracking, such as pedestrians and vehicles. The quan- (1) The accuracy of computer vision applications in traffic
tum optimizer was used as a novelty to speed up the learning counting and classifications can be enhanced by incor-
of occludes during frames interpretation. It also compared the porating quantum computing due to its faster learning
YOLOv3 DeepSORT, regular YOLOv5 DeepSORT, and YOLOv5 of occludes and the DeepSORT tendency to store iden-
DeepSORT with quantum optimizer outputs to check how the tities in maximum age variable, reducing mismatches,
model overcame identity switching, misses, and false de- misses, and identity switches.
tections. These attributes usually accompany the occlusion (2) The comparison of the model with the state-of-art
problem, hence the need to be reduced. In summary, the models showed a lack of datasets with established
following results were observed. ground truth ready to derive the MOT metrics, particu-
larly vehicle datasets. The only available datasets are
(1) The model showed a decrease in misses and mis- pedestrian datasets that have their advantages, such as
matches. This decrease indicates the effectiveness of suitability for analysis in different environmental
the quantum optimizers in improving the tracking ac- complexities but lack the characteristics like vehicles,
curacy and faster release of the quantum cost function especially vehicle profiles that have led to identity-
value before new identities are assigned to vehicles switching properties. The model exhibited extensive
compared to regular YOLO models. This decrease is identity switching between five-axle and six-axle trucks
considerably significant in a short video, but as the switching to buses. Thus, there is a need for further
traffic increased, the number of identity switches analysis in the future to solve this problem.
increased while the number of mismatches and misses (3) For future studies, other quantum optimizers, such as
remained proportional. adiabatic quantum computation (AQC), can minimize
(2) The optimization stage considered one persistent an objective function by interpolating two Hamilto-
problem of tentatively choosing the intersection over nians, which will need to be defined based on the
union (IoU) value by autonomically choosing its problem and later their accuracy compared to the state-
threshold values, hence controlling the values of the of-art models.
multiple object tracking accuracy (MOTA) and multiple (4) Another metric for tracking evaluation, such as higher
object tracking precision (MOTP) that may be regarded order tracking accuracy (HOTA), can be utilized to check
as useless if the threshold is not optimal. accuracy. It can combine three IoU scores in terms of
(3) Comparisons of the model with selected state-of-art detection, association, and localization metrics. The
models indicated a significant increase in the primary study shows it has better performance and explanatory
classification of events, activities, and relationships parameters for MOT.
(CLEAR) multiple object tracking metric (MOTA-76%)
when using a quantum optimizer. The regular Deep-
SORT model with YOLOv3 has 50%, while the regular
Author contributions
DeepSORT model with YOLOv5 has 60%. This result
indicated a 16% increase in the MOTA value.
The authors confirm their contribution to the paper as follows.
(4) During comparisons, the MOTP value reached 92%,
Study conception and design: F. Ngeni, J. Mwakalonge, S.
significantly higher than other state-of-art models after
Siuhi; model architecture: F. Ngeni, J. Mwakalonge, S. Siuhi;
adding a quantum optimizer. This metric is affected
model results analysis and interpretation: F. Ngeni, J. Mwa-
considerably by the set IoU threshold value in the reg-
kalonge, S. Siuhi; draft manuscript preparation: F. Ngeni, J.
ular DeepSORT model. Optimization application
Mwakalonge, S. Siuhi. All authors reviewed the results and
removed the necessity of setting value since it is chosen
approved the final version of the manuscript.
based on other parameters.
(5) The study observed a higher value in the secondary
metric called F1 score using the quantum optimizer on a
DeepSORT model using pedestrians MOT17 datasets
Conflict of interest
with identity switching reduced from six to four.
references Hu, L., Ni, Q., 2019. Quantum automated object detection
algorithm. In: 25th International Conference on Automation
and Computing, Lancaster, 2019.
Huang, Y., Essa, I., 2005. Tracking multiple objects through
Abbas, A., Sutter, D., Zoufal, C., et al., 2021. The power of quantum
occlusions. In: 2005 IEEE Conference on Computer Vision
neural networks. Nature Computational Science 1, 403e409.
and Pattern Recognition (CVPR), San Diego, 2005.
Ali, H., Mohamed, M., El-Sayed, M.S., et al., 2014. Multiple objects
Kalman, R.E., 1960. A new approach to linear filtering and
tracking under occlusions: a survey. In: International
prediction problems. Journal of Fluids Engineering 82 (1), 35e45.
Conference on Advances in Computing, Electronics and
Kamkar, S., Safabakhsh, R., 2016. Vehicle detection, counting and
Electrical Technology, Kuala Lumpur, 2014.
classification in various conditions. The Institution of
Bewley, A., Ge, Z., Ott, L., et al., 2016. Simple online and realtime
Engineering and Technology 10 (6), 406e413.
tracking. In: 2016 IEEE International Conference on Image
Kline, K., Salvo, M., Johnson, D., 2019. How Artificial Intelligence and
Processing (ICIP), Phoenix, 2016.
Quantum Computing are Evolving Cyber Warfare. Available at:
Busu, C., Busu, M., 2021. An application of the Kalman filter
https://www.iwp.edu/cyber-intelligence-initiative/2019/03/27/
recursive algorithm to estimate the Gaussian errors by
how-artificial-intelligence-and-quantum-computing-are-
minimizing the symmetric loss function. Symmetry 13 (2), 240.
evolving-cyber-warfare/ (Accessed 28 October 2022).
Chen, Y., Wang, H., Zhu, Y., et al., 2020. A multi-target tracking
Koller, D., Weber, J., Malik, J., 2005. Robust Multiple Car Tracking
algorithm based on improved DeepSORT algorithm.
with Occlusion Reasoning. University of California, Berkeley.
Computer Application Research 37 (S2), 311e315.
Li, J., Ghosh, S., 2020. Quantum-soft QUBO Suppression for
Dilmegani, C., 2022. In-depth Guide to Quantum Artificial
Accurate Object Detection. The Pennsylvania State
Intelligence in 2022. Available at: https://research.aimultiple.
University, University Park.
com/quantum-ai/ (Accessed 28 October 2022).
Li, Q., Xiao, D.X., Wang, K.C.P., et al., 2011. Mechanistic-empirical
Dong, M., Fang, Z., Li, Y., et al., 2021. AR3D: attention residual 3D
pavement design guide (MEPDG): a bird’s-eye view. Journal of
network for human action recognition. Sensors 21 (5), 1656.
Modern Transportation 19, 114e133.
Eslami, E., Yun, H.-B., 2023. Comparison of deep convolutional r, P., Girshick, R., et al., 2017. Feature pyramid networks
Lin, T., Dolla
neural network classifiers and the effect of scale encoding
for object detection. In: 2017 IEEE Conference on Computer
for automated pavement assessment. Journal of Traffic and
Vision Andd Pattern Recognition (CVPR), Honolulu, 2017.
Transportation Engineering (English Edition) 10 (2), 258e275.
Liu, J., An, F., 2020. Image classification algorithm based on deep
Felzenszwalb, P.F., Girshick, R.B., McAllester, D., et al., 2009.
learning-kernel function. Scientific Programming 2020, 7607612.
Object detection with discriminatively trained part-based
Liu, W., Anguelov, D., Erhan, D., et al., 2016. SSD: single shot
models. IEEE Transactions on Pattern Analysis and Machine
MultiBox detector. In: Leibe, B., Matas, J., Sebe, N. (Eds.),
Intelligence 32 (9), 1627e1645.
Computer Vision-ECCV 2016. Springer, Cham, pp. 21e37.
FHWA, 2014. Traffic Monitoring Guide-Appendix C. Vehicle Types.
Luca, G.D., 2021. A survey of NISQ era hybrid quantum-classical
Available at: https://www.fhwa.dot.gov/policyinformation/
machine learning research. Journal of Artificial Intelligence
tmguide/tmg_2013/vehicle-types.cfm (Accessed 28 October
and Technology 2 (1), 9e15.
2022).
Make Sense, 2022. Make Sense. Available at: https://www.
Gabriel, P.F., Verly, J.G., Piater, J.H., et al., 2003. The State of the Art
makesense.ai/ (Accessed 28 October 2022).
in Multiple Object Tracking under Occlusion in Video
Memon, S., et al., 2018. A video-based vehicle detection, counting
Sequences. University of Lie ge, Lie ge.
and classification system. International Journal of Image,
Gai, Y., He, W., Zhou, Z., 2021. Pedestrian target tracking based on
Graphics and Signal Processing 10 (9), 34e41.
DeepSORT with YOLOv5. In: 2nd International Conference on , L., Reid, I., et al., 2016. MOT16: a benchmark
Milan, A., Leal-Taixe
Computer Engineering and Intelligent Control (ICCEIC),
for multi-object tracking. arXiv 1603, 00831.
Chongqing, 2021.
Milan, A., Rezatofighi, S.H., Dick, A., et al., 2017. Online multi-
Gambella, C., Simonetto, A., 2020. Multi-block ADMM heuristics for
target tracking using recurrent neural networks. In: 31st
mixed-binary optimization on classical and quantum
AAAI Conference on Artificial Intelligence, San Francisco, 2017.
computers. IEEE Transactions on Quantum Engineering 1,
Ngeni, F., Mwakalonge, J.L., Comert, G., et al., 2022. Monitoring of
3102022.
Illegal Removal of Road Barricades Using Intelligent
Girshick, R., Donahue, J., Darrell, T., et al., 2014. Rich feature
Transportation Systems in Connected and Non-connected
hierarchies for accurate object detection and semantic
Environments. Center for Connected Multimodal Mobility
segmentation. In: IEEE Conference on Computer Vision and
(C2M2), Clemson.
Pattern Recognition, Columbus, 2014.
Pancharatnam, M., Sonnadara, U., 2008. Vehicle counting and
Han, B., Paulson, C., Lu, T., et al., 2009. Tracking of Multiple Objects
classification from a traffic scene. In: 26th National
under Partial Occlusion. University of Florida, Gainesville.
Information Technology Conference, Colombo, 2008.
Hashmi, M.F., Ashish, B.K.K., Sharma, V., et al., 2021. LARNet:
Papadourakis, V., Argyros, A., 2010. Multiple objects tracking in
real-time detection of facial micro expression using lossless
the presence of long-term occlusions. Computer Vision and
attention residual network. Sensors 21 (4), 1098.
Image Understanding 114 (7), 835e846.
He, K., Gkioxari, G., Dolla r, P., et al., 2017. Mask R-CNN. In: IEEE
Radonjic , M., Prvanovic , S., Buric , N., 2012. System of classical
International Conference on Computer Vision (ICCV), Venice,
nonlinear oscillators as a coarse-grained quantum system.
2017.
Journal of Physics Conference Series 442 (85), 022117.
Hong, F., Prozzi, J.A., 2006. Comparison of equivalent single-axle
Redmon, J., Divvala, S., Girshick, R., et al., 2016. You only look once:
loads from empirical and mechanistic-empirical approaches.
unified, real-time object detection. In: 2016 IEEE Conference on
In: Transportation Research Board 85th Annual Meeting,
Computer Vision and Pattern Recognition, Las Vegas, 2016.
Washington DC, 2006.
Ruseruka, C., Mwakalonge, J., Comert, G., et al., 2023. Pavement
Hou, X., Wang, Y., Chau, L.-P., 2019. Vehicle tracking using
distress identification based on computer vision and
DeepSORT with low confidence track filtering. In: 16th IEEE
controller area network (CAN) sensor models. Sustainability
International Conference on Advanced Video and Signal
15 (8), 6438.
Based Surveillance (AVSS), Taipei, 2019.
J. Traffic Transp. Eng. (Engl. Ed.) 2024; 11 (1): 1e15 15
Shorten, C., Khoshgoftaar, T.M., 2019. A survey on image data transportation planning, travel demand modeling, transportation
augmentation for deep learning. Journal of Big Data 6 (1), 1e48. systems analysis, transportation economics, and traffic safety.
Shu, G., Dehghan, A., Oreifej, O., et al., 2012. Part-based multiple- Furthermore, he has worked with clients, contractors, and
person tracking with partial occlusion handling. In: 2012 IEEE consulting firms in the construction industry for more than 4
Conference on Computer Vision and Pattern Recognition, years and supervised construction projects as a project manager,
Providence, 2012. project planner, and materials engineer.
Song, H., Liang, H., Li, H., et al., 2019. Vision-based vehicle
detection and counting system using deep learning in
highway scenes. European Transport Research Review 11 (1),
Dr. Judith Mwakalonge is a professor of
1e16.
Department of Engineering at South Carolina
Wojke, N., Bewley, A., Paulus, D., 2017. Simple online and realtime
State University with a specialty in trans-
tracking with a deep association metric. In: IEEE International
portation engineering. She has more than 13
Conference on Image Processing (ICIP), Beijing, 2017.
years of experience planning and modeling
Yang, F., Choi, W., Lin, Y., 2016. Exploit all the layers: fast and
transportation networks, analysis of traffic
accurate CNN object detector with scale dependent pooling
operational efficiency and safety, and evalu-
and cascaded rejection classifiers. In: 2016 IEEE Conference
ating self-driving in the connected vehicle
on Computer Vision and Pattern Recognition, Las Vegas, 2016.
environment. She has published and pre-
Zhao, Z., Zheng, P., Xu, S., et al., 2019. Object detection with deep
sented numerous papers in various international journals and
learning: a review. IEEE Transactions on Neural Networks and
proceedings.
Learning Systems 30 (11), 3212e3232.
Frank Ngeni is a civil and transportation en- Dr. Saidi Siuhi is an assistant professor of
gineering professional specializing in trans- Department of Engineering at South Carolina
portation engineering. He has over seven State University specializing in civil and
years of experience in planning, design, su- transportation engineering. He has more
pervision, engineering, and budgeting for than 14 years of experience in transportation
transportation networks. He is a dedicated, planning, travel demand modeling, trans-
resourceful, and innovative transportation portation systems analysis, transportation
engineering researcher with more than three economics, traffic safety, and evaluation of
years of experience with interest in intelligent transportation self-driving in the connected vehicle environment. He has pub-
systems (ITS), quantum computing, artificial intelligence (AI), lished and presented numerous papers in various international
multimodal mobility, connected and automated vehicles (CAVs), journals and conferences.