0% found this document useful (0 votes)
22 views

Specker Improving Multi-Target Multi-Camera Tracking by Track Refinement and Completion CVPRW 2022 Paper

Uploaded by

Bouchera Rezagui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Specker Improving Multi-Target Multi-Camera Tracking by Track Refinement and Completion CVPRW 2022 Paper

Uploaded by

Bouchera Rezagui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Improving Multi-Target Multi-Camera Tracking by

Track Refinement and Completion

Andreas Specker1,2,3 Lucas Florin2,3 Mickael Cormier1,2,3 Jürgen Beyerer2,1,3


1
Karlsruhe Institute of Technology 2 Fraunhofer IOSB 3 Fraunhofer Center for Machine Learning
{andreas.specker,lucas.florin,mickael.cormier,juergen.beyerer}@iosb.fraunhofer.de

Abstract

Multi-camera tracking of vehicles on a city-wide level


is a core component of modern traffic monitoring systems.
For this task, single-camera tracking failures are the most
common causes of errors concerning automatic multi-target
multi-camera tracking systems. To address these problems,
we propose several modules that aim at improving single-
camera tracklets, e.g., appearance-based tracklet splitting,
single-camera clustering, and track completion. After these Figure 1. Challenges of tracking vehicles within a camera view
track refinement steps, hierarchical clustering is used to as- – Tracking vehicles in real-world scenarios is challenging due to
sociate the enhanced single-camera tracklets. During this heavy occlusions, e.g., when vehicles wait at a traffic light. Some
vehicles are not detected while they are occluded which often leads
stage, we leverage vehicle re-identification features as well
to track fragmentation or identity switches.
as prior knowledge about the scene’s topology. Last, the
proposed track completion strategy is adopted for the cross-
camera association task to obtain the final multi-camera
tracks. Our method proves itself competitive: With it, we lenging in this context. To solve these problems, most
achieved 4th place in track 1 of the 2022 AI City Challenge. of the Single-Camera Tracking (SCT) methods follow the
tracking-by-detection paradigm [2–4, 49, 51, 58, 60]: First,
a set of detections is generated for each video frame inde-
pendently. Afterward, these detections are linked together
to form tracks based on a similarity metric. Usually, this
1. Introduction
similarity metric considers visual features extracted by a re-
Multi-Target Multi-Camera Tracking (MTMCT) sys- identification (re-ID) model together with position informa-
tems aim at tracking multiple targets, in our case vehicles, tion.
as they move through a scene captured by numerous cam- This has been proven to be a powerful approach in ideal
eras. This means localizing and tracking targets in each environments where targets are visible in their entirety.
camera feed and identifying the other instances of the same However, in real-world environments, this is often not the
target in the other camera feeds. Applications include traffic case: Vehicles occlude each other, particularly in crowded
flow analysis and traffic signal time planning. scenes (see Fig. 1) at traffic lights or when vehicles over-
An MTMCT system consists of two core components: take each other. As a result, single-camera tracklets are di-
First, a single-camera tracking pipeline that localizes all rel- vided into multiple fragments or the tracklet switches from
evant objects in each video frame and connects these detec- one vehicle to another. To solve this, we develop a track
tions across time into tracklets. Second, an inter-camera as- refinement module consisting of several mechanisms to fur-
sociation module matches tracks belonging to the exact ve- ther improve single-camera tracklets obtained by the JDE
hicle across different cameras. In a complex real-world traf- tracker [57]. Besides filtering methods such as background
fic scene, the distance and orientation of the objects w.r.t. to filtering and track filtering, splitting approaches are lever-
the camera vary enormously between cameras. Also, differ- aged to reduce the number of identity switches. For in-
ent cameras have different technical characteristics. These stance, we propose to employ K-Means clustering to split
properties make both sub-tasks of MTMCT especially chal- tracklets in which multiple vehicles occur based on the vi-

3199
sual appearance. Visual information is also used to re- 2.2. Vehicle Re-identification
connect multiple track fragments belonging to the same ve-
Vehicle re-ID has attracted increasing attention in the
hicle and therefore tackle the problem of fragmentation.
computer vision community in the context of intelligent
Moreover, a so-called track completion component finalizes
transportation systems. Although re-ID has been researched
tracklets based on the knowledge that cars cannot suddenly
for a long time, the vehicle re-ID task is still extremely
appear or disappear in the middle of the camera view.
challenging. Similar to person re-ID, high intra-class and
Following many works from related literature [25,28,44,
small inter-class variances are prominent problems in ve-
47], visual features are the core component of our Multi-
hicle re-ID due to different camera perspectives and sim-
Camera Tracking (MCT) approach since they efficiently re-
ilar vehicle appearances [22]. Several earlier works at-
identify vehicles across cameras. We use a background sub-
tempted to design invariant features for vehicles in different
traction model similar to [47] to handle occlusions from
scenes. In [16], a descriptor with a global feature invariant
static objects. This model removes detections occluded by
to affine transformations and global illumination changes is
static objects such as traffic lights. It also discards tracks
designed. In [12], the Local Binary Patterns (LBP) and Lo-
located entirely in the background, such as parked vehicles.
cal Variance (VAR) are applied to local grid cells of the im-
Besides visual features, we also consider the structure of
age for extremely low-resolution vehicle re-ID. More recent
the scene. Our scene model includes information about the
works apply deep learning and achieve competitive results
topology of the traffic camera network as well as tempo-
by learning global features using a bag of tricks [32,33,64].
ral information. This way we prevent implausible matches
An Identity Unrelated Information Decoupling paradigm is
of tracks across cameras. Finally, we also adopt the pro-
proposed in [31] to learn invariant features of the vehicle
posed track completion approach to the multi-camera track-
with the same ID in different scenes using camera perspec-
ing task.
tive and background information as two kinds of identity-
Our main contributions can be summarized as follows:
unrelated information. Global feature learning does not
• We develop a robust MTMCT system that leverages rely on prior knowledge such as a specific structure, i.e.,
topological and temporal information and is easily ex- a body structure. Therefore, methods from the widely ac-
tendable. tive field of person re-ID may be adapted to vehicle re-ID.
• We address the primary error source, i.e., single- While complex methods aim at making use of the particular
camera tracking errors caused by occlusions, through structure of the domain with attention mechanisms [8, 65]
our track refinement module. or using auxiliary high-level semantic attributes [27, 46],
similar concepts are available for the task of vehicle re-ID
• We propose an explicit track completion mechanism as well [9, 14, 24]. In this work, we rely on a global fea-
that improves single-camera tracking results and is ture learning approach for our vehicle re-ID component, as
also applicable and beneficial for the cross-camera in [19, 28, 36, 47].
tracking task.
2.3. Single-camera Tracking
2. Related Work Most SCT approaches use the tracking-by-detection
paradigm, i.e., the task is divided into a detection step us-
2.1. Vehicle Detection
ing an efficient detector followed by an association step, in
Vehicle detection is a domain-specific sub-task of ob- which detections of the same targets are matched based on
ject detection, which is often associated with autonomous a similarity measure [2–4, 41, 45, 49, 51, 58–61, 63]. Most
driving and smart cities. In recent years the contribution methods combine position and motion information [2, 3],
of several large datasets for vehicle detection and track- and additionally re-ID features [4, 49, 51, 58] or other cues
ing [6, 10, 11, 17, 67] has facilitated the adoption of object such as pose information [51, 60]. While these approaches
detection architectures such as SSD [29], YOLO [5, 40], are highly effective, the temporal context available in videos
and Faster R-CNN [42] to the vehicle detection task. In is often considered less. In [38], the TPM algorithm is
fact, several challenges [35, 36, 67] have been proposed, proposed, which efficiently combines multiple short sub-
which often resulted in the adoption of variants of the trajectories into a long trajectory and, using trajectory con-
YOLO-based models [18,30,54,55] which offer a favorable text, mitigates missing detections. TNT [56] uses a graph-
trade-off between accuracy and computational efficiency. based model to incorporate temporal and appearance infor-
While adopting a large offline ensemble of detectors in such mation for tracking simultaneously.
challenges is a widely used approach [47], the winner of Nonetheless, a recent line of works tightens the link be-
track 3 of the 2021 AI City Challenge [28] used a single tween detection and tracking by extending object detec-
YOLOv5 [23] detector. Therefore, we follow this practice tors to trackers [1, 66], incorporating tracking results as
and adopt YOLOv5 as our sole detector in this work. prior knowledge for detection [15, 66] or 3D CNNs to de-

3200
Single camera tracking
JDE tracker

Vehicle Filtered Forward


Vehicle detection detections detections tracking Tracklets
Video frames NMS Forward-backward
(YOLOv5)
track combination
Videos Re-ID model Backward
ensemble tracking
Visual
features

Track refinement module

Tracklets Background Time gap Turn Visual appearance Single camera Track Track
filtering splitting splitting splitting clustering completion filtering
Foreground
masks Camera
Scene
zones
model

Camera Multi-camera tracking


Single camera zones
tracks First clustering Second Track Multi-camera
step clustering step completion tracks

Visual best match


features distance
Compute visual
distance matrix
mean feature
distance

Figure 2. System diagram of the proposed MTMCT system – First, we generate tracklets by looking at vehicle detections in each video
feed. Then, these tracklets are filtered, split, clustered and completed, resulting in high quality single-camera tracks. Finally, these tracks
are joined across cameras into multi-camera tracks.

tect tracklets [37]. TrackFormer [34] leverages recent ad- an occlusion-aware approach to discard obstacle-occluded
vances in vision transformers to address occluded, missing, bounding boxes and overlapping tracks. Last year’s win-
or noisy detection. In addition, TransTrack [50] further im- ners of AI City Challenge [35] especially focused on spatio-
proves performance by combining the object query from temporal information and traffic rules [28, 30, 63]. Inspired
DETR [7] and a track query. by these recent trends, we adapt a clustering approach to
cluster tracks from adjacent cameras and leverage informa-
2.4. Multi-camera Tracking tion about the scene’s topology.
Typical MTMCT pipelines include detection, MTSCT,
and cross-camera clustering [21, 25, 44, 53]. Recent 3. Methods
works [20, 21, 25, 39] use additional external information 3.1. Overview
about the camera setup to improve their pipeline. In [21,28,
47] the scene topology is used to prevent infeasible cross- Our multi-target multi-camera vehicle tracking system
camera transitions, camera adjacency in [28, 39, 47], the with its components is depicted in Figure 2. In general, it
movement directions are used in [20] to determine the plau- consists of three main parts.
sibility of camera transitions while camera-specific areas The first one is the single-camera tracking stage which
are defined in [21] to determine the possibility of tracks takes videos as input and generates a set of single-camera
appearing in multiple cameras. Specker et al. [47] propose tracklets for each input video. Specifically, vehicles are de-

3201
tected, and appearance features are extracted using a vehicle ciation. Many approaches and best practices can be trans-
re-identification approach. Subsequently, a single-camera ferred from the widely studied topic of person re-ID.
tracker is applied in order to group detections into single- However, vehicle re-ID raises additional challenges
camera tracklets. since multiple, almost identical cars from the same make,
The second part is the Track Refinement Module (TRM) model, and color may appear in the scene. As a result, mod-
which contains several post-processing steps that aim at re- els have to be able to distinguish vehicles based on small-
ducing the number of identity switches and fragmentations. scale visual features, e.g., scratches, dirt, or special equip-
Finally, multi-camera tracking is performed by associ- ment. Data augmentation and synthetic data are essential
ating the single-camera tracklets provided by the previous since the challenge dataset is limited concerning the num-
stages. Two rounds of clustering are applied, followed by ber of cars and trucks and therefore lacks diversity. Simi-
our track completion module to finalize tracks that must lar to the strong baseline for re-ID [32], we train different
have predecessors or successors since they start or end in models using real-world data, synthetic data, as well as im-
the middle of camera views. All the modules and compo- age data that was transferred from the synthetic to the real-
nents are described in detail in the following sections. world domain using a generative adversarial network [13].
This procedure leads to diverse appearance representations
3.2. Detection which generalize well when used as an ensemble. In gen-
To be able to track vehicles’ routes within a camera net- eral, we rely on a global approach which means that we
work accurately, it is essential to detect them reliably. The aim at learning one global feature vector instead of extract-
detection stage should not miss vehicles since subsequent ing multiple embeddings for different parts of the vehicles.
tracking, and filtering stages can suppress false positives or Such approaches are lightweight, do not tend to overfit, and
multiple detections for the same instance. In contrast, it is deliver robust results.
more challenging to interpolate missing bounding boxes af- In detail, we train ResNet-101 IBN-A and ResNeXt-
terward. We employ the YOLOv5 [23] single-stage detector 101 IBN-A models with an input image size of 384 × 384.
due to its good trade-off between speed and accuracy. It is A fully-connected classification layer is appended to these
applied to each video frame to obtain the bounding boxes backbone networks with as many output neurons as in-
of the occurring vehicles and a confidence score that ex- stances in the training dataset. During inference, i.e., fea-
presses the detector’s certainty. To avoid false negatives, ture extraction, the classification layer is omitted. As done
i.e., missing vehicles due to the reasons mentioned above, in many works, the stride parameter of the last pooling layer
we rely on a low confidence threshold for detections of 0.1. is set to 1 to keep fine-grained details. Analogous to current
Moreover, we found that using the off-the-shelf YOLOv5x6 state-of-the-art approaches to re-identification, a combina-
model pre-trained on the COCO [26] dataset is sufficient tion of the cross-entropy classification loss function and the
to achieve promising performance and detect all vehicles. metric learning triplet loss function is used. While the for-
Training on external data or fine-tuning the model on the mer aims to identify the vehicles, the latter helps to learn
challenge dataset is not required. Since the COCO [26] features being close in embedding space for samples of the
dataset considers a variety of different classes which are not same class and far for similar vehicles originating from dif-
relevant to the multi-camera tracking task, classes such as ferent classes. We extract 2048-dimensional feature vectors
person, etc., are discarded, and only detections for different for each bounding box extracted in the previous detection
types of vehicles are kept. Using a low detection threshold stage. Bounding boxes and corresponding appearance rep-
leads to many double detections, so non-maximum suppres- resentations serve as input for the subsequent single-camera
sion (NMS) is applied directly after detection as a first filter- tracking.
ing step. Strongly overlapping bounding boxes measured by
the intersection-over-union (IoU) in the video frames are fil- 3.4. Single-camera Tracking
tered to reduce the number of bounding boxes to one per ve- In this processing step, detections within the video
hicle instance. Subsequently, the bounding boxes for each frames are combined to form so-called tracklets. Track-
frame in the videos are forwarded to a vehicle re-ID network lets represent the spatial-temporal trajectories of vehicles
to extract meaningful appearance descriptors. crossing a single camera view. Our method uses the JDE
tracker [57] that associates detections to tracklets building
3.3. Vehicle Re-identification
on motion tracking using a Kalman filter and visual similar-
Vehicle re-ID is the task of extracting meaningful fea- ity based on the extracted re-identification embeddings. The
ture representations of vehicles’ visual appearances to as- tracker outputs a set of tracklets containing the correspond-
sess the similarity between different vehicles based on a ing bounding boxes and feature vectors for each time step
distance measure. This pipeline stage constitutes an inte- the tracklet is visible. Situations in which vehicles lower
gral part of single-camera tracking and multi-camera asso- their velocity and come to a stop at traffic lights are a signif-

3202
icant challenge in the dataset. The motion estimates of the clusters. If two different vehicles appear within the same
Kalman filter get worse, and due to heavy occlusions, many tracklet, the respective detections should be assigned to dif-
tracks get fragmented. To reduce the negative impact of ferent clusters. Subsequently, the distance between the clus-
such situations, we perform single-camera tracking in both ter centers is leveraged to assess whether the tracklet shows
temporal directions [48]. First, the video is processed from multiple vehicles. If the Cosine distance between the cluster
start to end and then from end to start. By searching for the centers exceeds 0.68, the tracklet will be split at the point
best overlap between tracklets from the forward and back- after which all detections were assigned to the same clus-
ward tracking and keeping the larger one, robustness against ter. The high threshold ensures that correct tracklets are not
fragmentation is greatly increased. divided.

3.5. Single-camera Track Refinement Single-camera clustering After splitting tracklets to cor-
Although the JDE single-camera tracker delivers primar- rect identity switches, we use two rounds of Agglomera-
ily promising results, some systematic error sources exist. tive Clustering to merge fragments showing the same ve-
On the one hand, many tracks are fragmented due to occlu- hicle. To do this, a distance matrix of size Nc × Nc is
sions caused by other vehicles or obstacles. On the other constructed where Nc equals the number of single-camera
hand, identity switches occur, especially in scenes when tracklets found in camera c. Each element at position (x, y)
cars stop at traffic lights or overtake each other. We have of the matrix corresponds to the Cosine distance between
developed and combined several methods to post-process the mean feature vectors of two tracklets Tx and Ty . In the
and greatly enhance single-camera tracklets in our TRM. first round, no constraints are applied to combine tracklets
Each of them is explained in the following. that overlap in time and position. This leads to great im-
provement, especially in waiting situations with heavy oc-
Background filtering Analogous to the work of Specker clusions near traffic lights. A low distance threshold of 0.1
et al. [47], we filter false positive detections based on a is used for the first clustering to avoid false-positive combi-
background-foreground segmentation model. First, fore- nations. Afterward, tracklet clusters are merged. If some
ground regions are determined by computing areas that do of the tracklets overlap in time, detections of the longer
not change during the video. Detections or even entire tracklet are kept, and detections of the shorter one are dis-
tracklets are discarded if they overlap with the static back- carded. Before the second clustering, the distance matrix
ground by more than 50%. is re-computed using the newly created tracklets. In con-
trast to the first round, constraints w.r.t. time overlap, time
gaps between fragments, and direction of travel are applied.
Time gap splitting Time gaps between associated detec- This allows leveraging a higher distance threshold of 0.3 to
tions in single-camera tracklets may indicate that the vehi- reduce the number of fragments further.
cle got lost and re-identified after, i.e., it was occluded.
Since this may lead to identity switches, we examine such
Single-camera track completion We propose a track
time gaps within single-camera tracklets. If the gap is suf-
completion module since vehicles follow strict traffic rules
ficiently large and the re-ID features representing the visual
and are unlikely to disappear in the middle of the cam-
appearance change too much, tracklets will be split and di-
era view suddenly. As long as vehicles do not reach the
vided into two new ones.
frame boundaries or the video ends, there must be a succes-
sor tracklet. We aim at connecting single-camera tracklets
Turn splitting Another common source of error is iden- based on these prior assumptions. For each tracklet, the fol-
tity switches that occur when a vehicle leaves a camera view lowing actions are performed:
and a second one enters it nearby. In some cases, the track-
let of the leaving vehicle is not finished, but instead, it is re- 1. Check if the tracklet is already finished, i.e., enters and
sumed by the entering one. This problem is straightforward leaves the camera or starts/ends with the start or end of
to remedy by examining the direction of travel and splitting the video.
the tracklet if it is suddenly reversed. In contrast to, e.g., 2. If not, search for possible predecessors and successors
people, it is improbable that a vehicle performs such a ma- based on direction, time gap, distance, and visual sim-
neuver due to traffic rules. ilarity.
3. Merge tracklets with the best matching predecessor
Visual appearance splitting To further reduce the num-
and/or successor.
ber of identity switches, we propose a visual appearance
splitting mechanism. For this, K-Means clustering is ap- After the track completion refinement step, most single-
plied to the embeddings of a tracklet to group them into two camera tracking errors are corrected, and the number of

3203
identity switches and fragmentations is greatly lowered. We Approach IDF1 IDP IDR
conclude the refinement pipeline by applying several filter-
Baseline 76.42 78.80 74.19
ing methods. This includes omitting short tracklets with less + TRM 82.07 86.08 78.41
than five detections, tracklets that cannot appear in another + multiplying best match distance 83.12 86.45 80.04
camera, as well as static tracks that do not change position. + multi-camera track completion 83.48 86.74 80.46

3.6. Cross-camera Association Table 1. Ablation Study – Comparison of the influence of dif-
ferent modules on the overall multi-camera tracking performance.
After the single-camera tracklets have been finalized, the The proposed track refinement module leads to the most signifi-
next step is the association of tracklets from different cam- cant improvement. The use of the combined distance and of the
eras that show the identical vehicle to obtain multi-camera multi-camera track completion component brings lower but still
tracks. significant improvements.

4. Evaluation
Multi-camera clustering Many works [25, 28, 47] re-
garding multi-camera tracking solve the task by hierarchi- Experimental results are presented in this section. After
cal clustering, so we also rely on this approach. Like many briefly introducing the CityFlowV2 [52] dataset and evalu-
other works [21,28,47], we make use of so-called zones. In ation metrics, we provide an ablation study, qualitative re-
detail, we leverage four different zones: one for each possi- sults, and the final challenge ranking.
ble direction vehicles can come from or go to, respectively. 4.1. Datasets
This scene model is helpful to constrain impossible transi-
tions of vehicles between cameras. Inspired by [28], two We used the two datasets allowed for track 1 of the
rounds of clustering are applied. In the first one, tracklets AI City Challenge 2022: the real-world CityFlowV2 [52]
are clustered separately for each possible transition between dataset and the synthetic VehicleX [62] dataset.
the cameras, and in the second one, all tracks from adjacent
cameras are clustered. In contrast to [28], we modify the CityFlowV2 This dataset is a benchmark for city-scale
distance metric and do not solely rely on the Cosine distance MTMCT. The training and validation sets consist of a high
between the tracks’ mean features for clustering. [47] indi- number of video feeds covering intersections of a road net-
cates that it is beneficial to consider the distance between work of a U.S. city. Different city areas are represented,
the most similar detection pair between two tracks in addi- such as residential areas and highways. Some intersections
tion to the distance of mean features. So, we build the dis- are covered by multiple overlapping video feeds. However,
tance matrix by multiplying both aforementioned distances. the test set only covers intersections along a single stretch of
The idea is that tracklets may be merged when either the a highway, with only one camera for each intersection. This
visual appearances of vehicles across whole single-camera makes the tracking task somewhat simpler, but the training
tracklets are very similar or when there is one strong agree- and validation sets can hardly be used to predict tracking
ment between detections of the tracklets. We set distance performance on the test set.
values between tracks that cannot belong to the same ve-
hicle due to invalid zone transitions or impossible transi- VehicleX The VehicleX dataset is a synthetic dataset built
tion durations to infinity. This limits the search space, and from rendered 3D models of vehicles. The synthetic images
thus false-positive associations are avoided. After the sec- are placed on background images taken from a real-world
ond clustering step, tracklets within the same clusters are dataset. It can be used to augment a smaller, real-world
merged to form multi-camera tracks. dataset such as CityFlowV2.
4.2. Evaluation Metrics
Multi-camera track completion Similar to the single- The challenge ranking is based on the IDF1 score. This
camera track refinement stage, resulting multi-camera is a comprehensive metric that measures both SCT and
tracks are post-processed by our track completion module. MCT performance. In our ablation study, we also provide
For instance, some tracks may not be matched by the pre- the IDP (identity precision) and IDR (identity recall) met-
vious clustering step since they do not start or end in tran- rics [43].
sition zones. The track completion algorithm is identical to
4.3. Results & Discussion
the single-camera version, but instead of searching for can-
didates within the camera, the search space is composed of This section provides qualitative and quantitative results
tracks from adjacent cameras. for different stages of our pipeline.

3204
Rank Team ID IDF1 Rank Team ID IDF1
1 28 84.86 11 114 81.27
2 59 84.37 12 57 80.95
3 37 83.71 13 5 79.55
4 Ours 83.48 14 18 78.79
5 70 82.51 15 38 75.53
6 36 82.18 16 49 74.57
7 15 81.71 17 109 72.62
8 118 81.66 18 4 72.55
9 110 81.40 19 141 62.12
10 94 81.29 20 16 60.94

(a) Turn splitting Table 2. Challenge results – Challenge results on the official test
set.

(b) Visual appearance splitting that the turn splitting component (see Figure 3a) is able
to detect sudden changes of direction and split the track-
lets accordingly. Furthermore, the proposed clustering ap-
proach to correct identity switches is working as expected,
as shown in Figure 3b. Detections belonging to different ve-
hicles are assigned to different clusters and are subsequently
(c) Single-camera clustering divided into separate tracklets. To reduce the number of
fragmented tracklets, we use single-camera clustering (see
Figure 3c) and the introduced track completion module (see
Figure 3d). The single-camera clustering example shows
(d) Single-camera track completion a partly occluded vehicle that appears mainly in the back-
ground with a small size. As a result, the single-camera
tracklet is divided into multiple fragments. Our clustering
Figure 3. Track refinement module results – Qualitative evalu-
approach is capable of reuniting the fragments based on vi-
ation of our TRM. Figure 3a visualizes a case where the tracklet
sual similarity. The single-camera tracklet completion mod-
transitions from a vehicle leaving the scene to one driving into
the scene at the frame boundary. By analyzing the direction of ule handles cases where the visual similarity is lower, e.g.,
travel, our TRM separates the tracklets. In Figure 3b, each row due to different distances from the camera. The sample pic-
of images stands for one tracklet after the split. One can observe tured in Figure 3d visualizes such a case. It shows that using
that our clustering approach is able to determine and correct the the information that a tracklet is not completed yet and thus
identity switch. In Figures 3c and 3d, the last row visualizes the a successor must exist, tracklets can be merged accurately
merged tracklet and the rows above visualize separate track frag- based on movement information.
ments. The resulting tracklets are greatly enhanced by combining
several sub-tracklets showing the same vehicle.
Multi-camera track completion Regarding multi-
camera track completion, we present an example of a
vehicle performing a U-turn and driving back the way it
was coming from in Figure 4. Different colors stand for
other multi-camera tracks before the module is applied.
Due to different lighting conditions and viewing angles, the
tracks were not merged during the cross-camera clustering.
Leveraging movement information and less strict con-
Figure 4. Cross-camera track completion – Qualitative results
of our multi-camera track completion component. It is capable of straints allows the connection of the track fragments and,
connection track fragments even if the visual appearance differs therefore, the improvement of the resulting multi-camera
due to varying lighting conditions and viewing angles. tracks.

Ablation Table 1 presents the impact of the TRM module,


TRM Figure 3 visualizes qualitative examples for se- the multiplied distances, and the multi-camera track com-
lected components of the proposed TRM. One can observe pletion on the multi-camera tracking accuracy. The results

3205
(a)

(b)

Figure 5. Qualitative Multi-camera Tracking Results – Each row shows a single-camera tracklet from a different camera. The upper
track shows the effectiveness of our TRM. The lower track shows that with our approach, even with variation in viewing angles and lighting
conditions, vehicles can be tracked through multiple cameras.

indicate that single-camera track refinement is the most cru- single-camera tracklets to enhance the overall performance.
cial component for increasing performance. Single-camera In addition, the featured track completion strategy is also
tracklets constitute the base of the whole multi-camera applied to the cross-camera association task. Experimen-
tracking pipeline, and errors in this stage impede robust tal validation proved the effectiveness of the proposed ap-
cross-camera association. Moreover, the results show that proaches. In summary, the tracking system achieves an
post-processing resulting multi-camera tracks using prior IDF1 score of 83.48%, which corresponds to the fourth po-
knowledge about traffic rules further enhances the results. sition in track 1 of the AI City Challenge 2022.

References
Challenge results Table 2 compares our approach with
the other challenge participants. We achieved fourth place [1] P. Bergmann, T. Meinhardt, and L. Leal-Taixé. Tracking
with an IDF1 score of 83.48%. without bells and whistles. In Int. Conf. Comput. Vis., pages
941–951, 2019. 2
[2] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft. Sim-
Final results Last but not least, we give some final qual- ple online and realtime tracking. In IEEE Int. Conf. Image
itative results of our tracking approach in Figure 5. Each Process., pages 3464–3468, 2016. 1, 2
row represents a single-camera sub-tracklet from the multi- [3] E. Bochinski, V. Eiselein, and T. Sikora. High-speed
camera track. The first example shows a track that bene- tracking-by-detection without using image information. In
fits from our single-camera clustering and track completion. IEEE Int. Conf. Adv. Video Sign. Surv., 2017. 1, 2
The car is almost entirely overlapped by another vehicle [4] E. Bochinski, T. Senst, and T. Sikora. Extending iou based
in the top row, leading to fragmentations after the single- multi-object tracking by visual information. In IEEE Int.
Conf. Adv. Video Sign. Surv., 2018. 1, 2
camera tracking stage. Our track refinement strategy cor-
[5] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-
rects the error before cross-camera clustering. The example
Yuan Mark Liao. Yolov4: Optimal speed and accuracy of
shown in Figure 5b proves the capability of our approach object detection. arXiv preprint arXiv:2004.10934, 2020. 2
to robustly track the routes of vehicles across multiple cam- [6] Yohann Cabon, Naila Murray, and Martin Humenberger. Vir-
eras and thus allow applications such as automatic traffic tual kitti 2. arXiv preprint arXiv:2001.10773, 2020. 2
monitoring. [7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas
Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-
5. Conclusion end object detection with transformers. In European confer-
ence on computer vision, pages 213–229. Springer, 2020. 3
In this work, we have proposed a multi-target multi- [8] T. Chen, S. Ding, J. Xie, Y. Yuan, W. Chen, Y. Yang, Z.
camera vehicle tracking system that focuses on improving Ren, and Z. Wang. Abd-net: Attentive but diverse person re-

3206
identification. In Int. Conf. Comput. Vis., pages 8351–8361, [21] H.-M. Hsu, T.-W. Huang, G. Wang, J. Cai, Z. Lei, and
2019. 2 J.-N. Hwang. Multi-camera tracking of vehicles based on
[9] T.-S. Chen, C.-T. Liu, C.-W. Wu, and S.-Y. Chien. deep features re-id and trajectory-based camera link models.
Orientation-aware vehicle re-identification with semantics- In IEEE Conf. Comput. Vis. Pattern Recog. Worksh., pages
guided part attention network. In Eur. Conf. Comput. Vis., 416–424, 2019. 3, 6
pages 330–346, 2020. 2 [22] Yi Jin, Chenning Li, Yidong Li, Peixi Peng, and George A
[10] Yukyung Choi, Namil Kim, Soonmin Hwang, Kibaek Park, Giannopoulos. Model latent views with multi-center metric
Jae Shin Yoon, Kyounghwan An, and In So Kweon. Kaist learning for vehicle re-identification. IEEE Transactions on
multi-spectral day/night data set for autonomous and assisted Intelligent Transportation Systems, 22(3):1919–1931, 2021.
driving. IEEE Transactions on Intelligent Transportation 2
Systems, 19(3):934–948, 2018. 2 [23] Glenn Jocher, Ayush Chaurasia, Alex Stoken, Jirka Borovec,
[11] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo NanoCode012, Yonghye Kwon, TaoXie, Jiacong Fang, imy-
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe hxy, Kalen Michael, Lorna, Abhiram V, Diego Montes, Je-
Franke, Stefan Roth, and Bernt Schiele. The cityscapes bastin Nadar, Laughing, tkianai, yxNONG, Piotr Skalski,
dataset for semantic urban scene understanding. In Proceed- Zhiqiang Wang, Adam Hogan, Cristi Fati, Lorenzo Mam-
ings of the IEEE conference on computer vision and pattern mana, AlexWang1900, Deep Patel, Ding Yiwei, Felix You,
recognition, pages 3213–3223, 2016. 2 Jan Hajek, Laurentiu Diaconu, and Mai Thanh Minh. ultra-
[12] Mickael Cormier, Lars Wilko Sommer, and Michael Teutsch. lytics/yolov5: v6.1 - TensorRT, TensorFlow Edge TPU and
Low resolution vehicle re-identification based on appearance OpenVINO Export and Inference, Feb. 2022. 2, 4
features for wide area motion imagery. In 2016 IEEE Win- [24] P. Khorramshahi, N. Peri, J.-c. Chen, and R. Chellappa. The
ter Applications of Computer Vision Workshops (WACVW), devil is in the details: Self-supervised attention for vehicle
pages 1–7. IEEE, 2016. 2 re-identification. In Eur. Conf. Comput. Vis., pages 369–386,
[13] Weijian Deng, Liang Zheng, Qixiang Ye, Guoliang Kang, 2020. 2
Yi Yang, and Jianbin Jiao. Image-image domain adaptation [25] P. Köhl, A. Specker, A. Schumann, and J. Beyerer. The mta
with preserved self-similarity and domain-dissimilarity for dataset for multi-target multi-camera pedestrian tracking by
person re-identification. In Proceedings of the IEEE Confer- weighted distance aggregation. In IEEE Conf. Comput. Vis.
ence on Computer Vision and Pattern Recognition (CVPR), Pattern Recog. Worksh., pages 1042–1043, 2020. 2, 3, 6
June 2018. 4 [26] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
[14] V. Eckstein, A. Schumann, and A. Specker. Large scale ve- Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
hicle re-identification by knowledge transfer from simulated Zitnick. Microsoft coco: Common objects in context. In
data and temporal attention. In IEEE Conf. Comput. Vis. Pat- European conference on computer vision, pages 740–755.
tern Recog. Worksh., pages 616–617, 2020. 2 Springer, 2014. 4
[15] C. Feichtenhofer, A. Pinz, and A. Zisserman. Detect to track [27] Y. Lin, L. Zheng, Z. Zheng, Y. Wu, Z. Hu, C. Yan, and Y.
and track to detect. In Int. Conf. Comput. Vis., pages 3057– Yang. Improving person re-identification by attribute and
3065, 2017. 2 identity learning. Pattern Recognition, 95:151–161, 2019. 2
[16] Andres Frias-Velazquez, Peter Van Hese, Aleksandra [28] Chong Liu, Yuqi Zhang, Hao Luo, Jiasheng Tang, Weihua
Pižurica, and Wilfried Philips. Split-and-match: A bayesian Chen, Xianzhe Xu, Fan Wang, Hao Li, and Yi-Dong Shen.
framework for vehicle re-identification in road tunnels. Engi- City-scale multi-camera vehicle tracking guided by cross-
neering Applications of Artificial Intelligence, 45:220–233, road zones. In Proceedings of the IEEE/CVF Conference
2015. 2 on Computer Vision and Pattern Recognition, pages 4129–
[17] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel 4137, 2021. 2, 3, 6
Urtasun. Vision meets robotics: The kitti dataset. The Inter- [29] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian
national Journal of Robotics Research, 32(11):1231–1237, Szegedy, Scott Reed, et al. Ssd: Single shot multibox de-
2013. 2 tector. In ECCV, 2016. 2
[18] Synh Viet-Uyen Ha, Nhat Minh Chung, Tien-Cuong [30] Jincheng Lu, Meng Xia, Xu Gao, Xipeng Yang, Tianran Tao,
Nguyen, and Hung Ngoc Phan. Tiny-pirate: A tiny model Hao Meng, Wei Zhang, Xiao Tan, Yifeng Shi, Guanbin Li,
with parallelized intelligence for real-time analysis as a traf- et al. Robust and online vehicle counting at crowded in-
fic counter. In Proceedings of the IEEE/CVF Conference tersections. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 4119– on Computer Vision and Pattern Recognition, pages 4002–
4128, 2021. 2 4008, 2021. 2, 3
[19] S. He, H. Luo, W. Chen, M. Zhang, Y. Zhang, F. Wang, H. Li, [31] Zefeng Lu, Ronghao Lin, Xulei Lou, Lifeng Zheng, and
and W. Jiang. Multi-domain learning and identity mining for Haifeng Hu. Identity-unrelated information decoupling
vehicle re-identification. In IEEE Conf. Comput. Vis. Pattern model for vehicle re-identification. IEEE Transactions on
Recog. Worksh., pages 582–583, 2020. 2 Intelligent Transportation Systems, 2022. 2
[20] Y. He, J. Han, W. Yu, X. Hong, X. Wei, and Y. Gong. City- [32] Hao Luo, Weihua Chen, Xu Xianzhe, Gu Jianyang, Yuqi
scale multi-camera vehicle tracking by semantic attribute Zhang, Chong Liu, Jiang Qiyi, Shuting He, Fan Wang, and
parsing and cross-camera tracklet matching. In IEEE Conf. Hao Li. An empirical study of vehicle re-identification on
Comput. Vis. Pattern Recog., pages 576–577, 2020. 3 the ai city challenge. In Proc. CVPR Workshops, 2021. 2, 4

3207
[33] H. Luo, Y. Gu, X. Liao, S. Lai, and W. Jiang. Bag of tricks [47] Andreas Specker, Daniel Stadler, Lucas Florin, and Jur-
and a strong baseline for deep person re-identification. In gen Beyerer. An occlusion-aware multi-target multi-camera
IEEE Conf. Comput. Vis. Pattern Recog. Worksh., 2019. 2 tracking system. In IEEE Conf. Comput. Vis. Pattern Recog.
[34] Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, and Worksh., pages 4173–4182, 2021. 2, 3, 5, 6
Christoph Feichtenhofer. Trackformer: Multi-object track- [48] D. Stadler and J. Beyerer. Improving multiple pedestrian
ing with transformers. arXiv preprint arXiv:2101.02702, tracking by track management and occlusion handling. In
2021. 3 IEEE Conf. Comput. Vis. Pattern Recog., 2021. 5
[35] Milind Naphade, Shuo Wang, David C. Anastasiu, Zheng [49] D. Stadler, L. W. Sommer, and J. Beyerer. Pas tracker:
Tang, Ming-Ching Chang, Xiaodong Yang, Yue Yao, Liang Position-, appearance- and size-aware multi-object tracking
Zheng, Pranamesh Chakraborty, Christian E. Lopez, Anuj in drone videos. In Eur. Conf. Comput. Vis. Worksh., pages
Sharma, Qi Feng, Vitaly Ablavsky, and Stan Sclaroff. The 604–620, 2020. 1, 2
5th ai city challenge. In The IEEE Conference on Com- [50] Peize Sun, Jinkun Cao, Yi Jiang, Rufeng Zhang, Enze Xie,
puter Vision and Pattern Recognition (CVPR) Workshops, Zehuan Yuan, Changhu Wang, and Ping Luo. Transtrack:
June 2021. 2, 3 Multiple object tracking with transformer. arXiv preprint
[36] M. Naphade, S. Wang, D. C. Anastasiu, Z. Tang, M.-C. arXiv:2012.15460, 2020. 3
Chang, X. Yang, L. Zheng, A. Sharma, R. Chellappa, and [51] S. Tang, M. Andriluka, B. Andres, and B. Schiele. Mul-
P. Chakraborty. The 4th ai city challenge. In IEEE Conf. tiple people tracking by lifted multicut and person re-
Comput. Vis. Pattern Recog. Worksh., pages 626–627, 2020. identification. In IEEE Conf. Comput. Vis. Pattern Recog.,
2 pages 3701–3710, 2017. 1, 2
[37] B. Pang, Y. Li, Y. Zhang, M. Li, and C. Lu. Tubetk: Adopting
[52] Z. Tang, M. Naphade, M.-Y. Liu, X. Yang, S. Birchfield, S.
tubes to track multi-object in a one-step training model. In
Wang, R. Kumar, D. Anastasiu, and J.-N. Hwang. Cityflow:
IEEE Conf. Comput. Vis. Pattern Recog., pages 6308–6318,
A city-scale benchmark for multi-target multi-camera vehi-
2020. 3
cle tracking and re-identification. In IEEE Conf. Comput.
[38] Jinlong Peng, Tao Wang, Weiyao Lin, Jian Wang, John
Vis. Pattern Recog., pages 8797–8806, 2019. 6
See, Shilei Wen, and Erui Ding. Tpm: Multiple object
[53] Y. T. Tesfaye, E. Zemene, A. Prati, M. Pelillo, and M. Shah.
tracking with tracklet-plane matching. Pattern Recognition,
Multi-target tracking in multiple non-overlapping cameras
107:107480, 2020. 2
using constrained dominant sets. arXiv:1706.06196, 2017.
[39] Y. Qian, L. Yu, W. Liu, and A. G. Hauptmann. Electric-
3
ity: An efficient multi-camera vehicle tracking system for
intelligent city. In IEEE Conf. Comput. Vis. Pattern Recog. [54] Duong Nguyen-Ngoc Tran, Long Hoang Pham, Huy-Hung
Worksh., pages 588–589, 2020. 3 Nguyen, Tai Huu-Phuong Tran, Hyung-Joon Jeon, and
[40] Joseph Redmon and Ali Farhadi. Yolov3: An incremental Jae Wook Jeon. A region-and-trajectory movement matching
improvement. arXiv preprint arXiv:1804.02767, 2018. 2 for multiple turn-counts at road intersection on edge device.
In Proceedings of the IEEE/CVF Conference on Computer
[41] Pengfei Ren, Kang Lu, Yu Yang, Yun Yang, Guangze Sun,
Vision and Pattern Recognition, pages 4087–4094, 2021. 2
Wei Wang, Gang Wang, Junliang Cao, Zhifeng Zhao, and
Wei Liu. Multi-camera vehicle tracking system based on [55] Vu-Hoang Tran, Le-Hoai-Hieu Dang, Chinh-Nghiep
spatial-temporal filtering. In Proceedings of the IEEE/CVF Nguyen, Ngoc-Hoang-Lam Le, Khanh-Phong Bui, Lam-
Conference on Computer Vision and Pattern Recognition, Truong Dam, Quang-Thang Le, and Dinh-Hiep Huynh.
pages 4213–4219, 2021. 2 Real-time and robust system for counting movement-specific
[42] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. vehicle at crowded intersections. In Proceedings of the
Faster r-cnn: Towards real-time object detection with region IEEE/CVF Conference on Computer Vision and Pattern
proposal networks. In NIPS, 2015. 2 Recognition, pages 4228–4235, 2021. 2
[43] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. [56] Gaoang Wang, Yizhou Wang, Haotian Zhang, Renshu Gu,
Performance measures and a data set for multi-target, multi- and Jenq-Neng Hwang. Exploit the connectivity: Multi-
camera tracking. In Eur. Conf. Comput. Vis., pages 17–35, object tracking with trackletnet. In Proceedings of the 27th
2016. 6 ACM International Conference on Multimedia, pages 482–
[44] E. Ristani and C. Tomasi. Features for multi-target multi- 490, 2019. 2
camera tracking and re-identification. In IEEE Conf. Com- [57] Z. Wang, L. Zheng, Y. Liu, Y. Li, and S. Wang. Towards
put. Vis. Pattern Recog., pages 6036–6046, 2018. 2, 3 real-time multi-object tracking. In Eur. Conf. Comput. Vis.,
[45] Kyujin Shim, Sungjoon Yoon, Kangwook Ko, and Changick pages 107–122, 2020. 1, 4
Kim. Multi-target multi-camera vehicle tracking for city- [58] N. Wojke, A. Bewley, and D. Paulus. Simple online and
scale traffic management. In Proceedings of the IEEE/CVF realtime tracking with a deep association metric. In IEEE
Conference on Computer Vision and Pattern Recognition, Int. Conf. Image Process., pages 3645–3649, 2017. 1, 2
pages 4193–4200, 2021. 2 [59] Minghu Wu, Yeqiang Qian, Chunxiang Wang, and Ming
[46] A. Specker, A. Schumann, and J. Beyerer. A multitask model Yang. A multi-camera vehicle tracking system based on city-
for person re-identification and attribute recognition using scale vehicle re-id and spatial-temporal information. In Pro-
semantic regions. In Art. Intell. and Mach. Learn. in Def. ceedings of the IEEE/CVF Conference on Computer Vision
Appl., 2020. 2 and Pattern Recognition, pages 4077–4086, 2021. 2

3208
[60] B. Xiao, H. Wu, and Y. Wei. Simple baselines for human
pose estimation and tracking. In Eur. Conf. Comput. Vis.,
pages 472–487, 2018. 1, 2
[61] Kai-Siang Yang, Yu-Kai Chen, Tsai-Shien Chen, Chih-Ting
Liu, and Shao-Yi Chien. Tracklet-refined multi-camera
tracking based on balanced cross-domain re-identification
for vehicles. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 3983–
3992, 2021. 2
[62] Y. Yao, L. Zheng, X. Yang, M. Naphade, and T. Gedeon.
Simulating content consistent vehicle datasets with attribute
descent. In Eur. Conf. Comput. Vis., pages 775–791, 2020. 6
[63] Jin Ye, Xipeng Yang, Shuai Kang, Yue He, Weiming Zhang,
Leping Huang, Minyue Jiang, Wei Zhang, Yifeng Shi, Meng
Xia, et al. A robust mtmc tracking system for ai-city chal-
lenge 2021. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 4044–
4053, 2021. 2, 3
[64] M. Ye, J. Shen, G. Lin, T. Xiang, L. Shao, and S. Hoi. Deep
learning for person re-identification: A survey and outlook.
IEEE Trans. Pattern Anal. Mach. Intell., 2021. 2
[65] Z. Zhang, C. Lan, W. Zeng, X. Jin, and Z. Chen. Relation-
aware global attention for person re-identification. In IEEE
Conf. Comput. Vis. Pattern Recog., pages 3186–3195, 2020.
2
[66] X. Zhou, V. Koltun, and P. Krähenbühl. Tracking objects as
points. In Eur. Conf. Comput. Vis., pages 474–490, 2020. 2
[67] Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Heng
Fan, Qinghua Hu, and Haibin Ling. Detection and track-
ing meet drones challenge. IEEE Transactions on Pattern
Analysis and Machine Intelligence, pages 1–1, 2021. 2

3209

You might also like