0% found this document useful (0 votes)
179 views

Strongsort

Uploaded by

David Sbox
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
179 views

Strongsort

Uploaded by

David Sbox
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 14
2 Feb 2023 2. arXiv:2202.13514v2 [ces.CV] StrongSORT: Make DeepSORT Great Again ‘Yunhao Du, Zhicheng Zhao, Yang Song Yanyun Zhao, Abstract—Recently, multi-object tracking (MOT) has attracted increasing attention, and accordingly, remarkable progress has been achieved. However, the existing methods tend to use various basic models (eg, detector and embedding models) and different taining or inference tricks. As a result, the construction of a good ‘baseline for a fair comparison is essential. In this paper, a classic tracker, i.e, DeepSORT, is first revisited, and then is significantly iproved from multiple perspectives such as object detection, feature embedding, and trajectory association. The proposed tracker, named StrongSORT, contributes a strong and fair baseline to the MOT community. Moreover, (wo lightweight and plug-and-play algorithms are proposed to address two inherent “missing” problems of MOT: missing association and mis detection. Specitically, unlike most methods, which associate short tracklets into complete trajectories at high computational complexity, we propose an appearance-free link model (AFLink) to perform global association without appearance information, and achieve a good balance between speed and accuracy. Fur. thermore, we propose Gaussian ‘based on Gaus AFLink and plugged into various trackers with 2 negligible extra computational cost (L7 ms and 7.1 ‘ms per image, respectively, on MOTI). Finally, by. fusi StrongSORT with AFLink and GSI, the final tracker (Strong- SORT#+) achieves stateof-the-art results on multiple public benchmarks, ie, MOT17, MOT20, DanceTrack and KITTI. Codes are available at https:/github,com/dyhBUPT/StrongSORT and bttps://github.com/open-mmlab/mmtracking. Index Terms—Muli-Object Tracking, Baseline, AFLink, GSI I. ItRopuctION ‘ULTL-OBJECT TRACKING (MOT) aims to detect ‘and tack all specific classes of objects frame by frame, which plays an essential role in video understanding. In the past few years, the MOT task has been dominated by the tracking-by-detection (TBD) paradigm [60, 3, 55, 4, 32], which performs per frame detection and formulates the MOT problem as a data association task. TBD methods tend (o extract appearance and/or motion embeddings first and then perform bipartite graph matching. Benefiting from high-performing object detection models, TBD methods have gained favour due to their excellent performance ‘As MOT is a downstream task corresponding to object de- tection and object re-identification (ReID), recent works tend ‘Yunhao Du, Zhicheng Zo, Yang Sone, Yenyun Zhao and Fei Su ate with Key Laboratory of Inactive Teshnology and Experience Sytem, Mbniey of Culture and Tourism. Beijing Key Laboratory of Network Syste and Network Culture, and School of Arica Inteligence Being Univesity of Pots ad Tlecompaunications, Being, China. (e-mail (dys bop, zhaaze, 12138, xy sale) Bhupteds em) “Tuo Gong is with Shanghai Al Laborstory (e-mail gonglaotiab org ca) Hlngying Meng is with te Collegeof Bogineening, Design, and Phys jal Seienets, Brunel University London. Usbridge, United Kingdom. (e- tmalRengying meng @ brane tk} Su, Tao Gong, Hongying Meng to use various detectors and ReID models to increase MOT performance [15, 39], which makes it difficult to construct a fair comparison between them. Another problem preventing fair comparison is the usage of various external datasets for taining [64, 63]. Moreover, some taining and inference tricks are also used to improve the tracking performance, To solve the above problems, this paper presents a simple ‘but effective MOT baseline called StrongSORT. We revisit the classic TRD tracker DeepSORT [55], which is among the eati- est methods that apply a deep learning model to the MOT task. We choose DeepSORT because of its simplicity, expansibility and effectiveness, Itis claimed that DeepSORT underperforms compared With state-of-the-art methods because of its out- dated techniques, rather than its tracking paradigm. To be specific, we frst equip DeepSORT with a strong detector [18] following [63] and embedding model [30]. Then, we collect some inference tricks from recent works to further improve its performance, Simply equipping DeepSORT with these advanced components results in the proposed Strong SORT, and it is shown that it can achieve SOTA results on the popular benchmarks MOT17 [31] and MOT20 [2 ‘The motivations of StongSORT can be summatized as follows: + Itcan serve as a baseline for fair comparison between different tracking methods, especially for tracking-by- detection trackers + Compared to weak baselines, a stronger baseline can better demonstrate the effectiveness of methods. + The elaborately collected inference tricks ean be applied fon other trackers without the need to retrain the model. ‘This can benefit some tasks in academia and industry. ‘There are two “missing” problems in the MOT task, ie, missing association and missing detection. Missing association ‘means the same object is spread in more than one tracklet. This problem is particularly common in online trackers because they lack global information in association. Missing detection, also known as false negatives, refers to recognizing the object as background, which is usually caused by occlusion and low resolutions. First, for the missing association problem, several methods propose to associate short tracklets into trajectories using a global link model [1 1, 47, 50, 35, 58]. They usually first generate accurate but incomplete tracklets and then associate them with global information in an offline manner, Although these methods improve tracking performance significantly, they rely on computation-intensive models, especially apes ance embeddings. In contrast, we propose an appearanc free link model (AFLink), which only utilizes spatiotemporal information to predict whether the two input tracklets belong mori7 *) moT20 1» Gerson ote = 7 ; 6 © o <. ° on - cstors+ See rr MOTA ‘Mota’ Fig, 1: IDFI-MOTA-HOTA comparisons of state-of-the-art trackers with our proposed StrongSORT and StrongSORT++ on ‘MOTI7 and MOT20 test sets, The horizontal axis is MOTA, the vertical axis is IDFI, and the radius of the circle is HOTA. "*” represents our reproduced version. Our StrongSORT++ achieves the best IDF] and HOTA and comparable MOTA performance. to the same ID. Without the appearance model, AFLink achieves a better trade-off between speed and accuracy. ‘Second, linear interpolation is widely used to compensate for missing detections [36, 22, 35, 37, 63, LI], However, it ignores motion information during interpolation, which limits the accuracy of the interpolated positions. To solve this problem, we propose the Gaussian-smoothed interpolation algorithm (GSI), which fixes the interpolated bounding boxes using the Gaussian process regression algorithm [54]. GST is also a kind of detection noise filter that can produce more accurate and stable localizations. AFLink and GSI are both lightweight, plug-and-play, model-independent and appearance-free models, which are beneficial and suitable for this study, Extensive experiments demonstrate that they can create notable improvements in StrongSORT and other state-of-the-art trackers, e.g., Center- ‘Track [66], TransTrack [45] and FairMOT [64], with running. speeds of 1.7 ms and 7.1 ms per image, respectively, on MOTI?. In particular, by applying AFLink and GSI to Strong- SORT, we obtain a stronger tracker called StrongSORT++. It achieves SOTA results on various benchmarks, Le., MOTI7, ‘MOT20, DanceTrack (4!] and KITTI [19]. Figure 1 presents the IDF1-MOTA-HOTA comparisons of state-of-the-art wack- ers with our proposed SwongSORT and SongSORT++ on the MOTI? and MOT2O test sets ‘The contributions of our work are summarized as follows: ‘+ We propose StrongSORT, which equips DeepSORT with advanced modules (ie., detector and embedding model) and some inference tricks, It can serve as a strong and fair baseline for other MOT methods, which is valuable to both academia and industry + We propose two novel and lightweight algorithms, AFLink and GSI, which can be plugged into various lwackers to improve their performance with a negligible computational cost, «+ Extensive experiments are designed to demonstrate the effectiveness of the proposed methods. Furthermore, the proposed StongSORT and SuongSORT++ achieve SOTA performance on multiple benchmarks, including MOTI7, MOT20, DanceTrack and KITT II. RELATED WoRK A. Separate and Joint Trackers MOT methods can be classified into separate and joint twackers. Separate trackers [60, 3, 55, 4, 32, 21] follow the twacking-by-detection paradigm, which localizes targets first and then associates them with information on appearance, motion, etc. Benefiting from the rapid development of object detection [39, 38, 18], separate trackers have been widely applied in MOT tasks. Recently, several joint wacking methods [57, 59, 28, 51] have been proposed to jointly train detection and other components, such as motion, embedding and asso- ciation models, The main advantages of these wackers are low computational cost and comparable performance. Meanwhile, several recent studies [12, 43, 63, 7] have abandoned appearance information, and relied only on high- performance detectors and motion information, which achieve high running speed and state-of-the-art performance on MOTChallenge benchmarks (31, 9]. However, abandoning appearance features would lead to poor robustness in more complex scenes, In this paper, we adopt the DeepSORTHike [55] paradigm and equip it with advanced techniques from various aspects to confirm the effectiveness of this classic framework, B. Global Link in MOT Missing association is an essential problem in MOT tasks. ‘To exploit rich global information, several methods refine the tracking results with a global link model [11 47, 50, 35, 58] They first generate accurate but incomplete tracklets using spatiotemporal and/or appearance information, Then, these tracklets are linked by exploring global information in an offline manner. TNT [50] is designed with a multiscale Track- letNet to measure the connectivity between two tracklets, 1k encodes motion and appearance information in a unified network using multiscale convolution kemels. TPM [35] is presented with a tracklet-plane matching process to push easily confusable tracklets into different tracKlet-planes, which helps reduce the confusion in the tracklet matching step. ReMOT [58] splits imperfect trajectories into tracklets and then merges them with appearance features. GIAOTracker [11] proposes a complex global link algorithm that encodes tracklet appeat- ance features using an improved ResNet50-TP model [16] and associates tracklets together with spatial and temporal costs. Although these methods yield notable improvements, they rely on appearance features, which bring high computational cost. In contrast, the proposed AFLink model exploits only ‘motion information to predict the link confidence between two tacklets. By designing an appropriate model framework and training process, AFLink benefits various state-of-the-art trackers with a negligible extra cost. ‘AFLink shares similar motivations with LGMTracker [8], Which also associates tacklets with motion information. LGMTracker is designed with an interesting but complex reconstruct-to-embed strategy to perform wacklet association based on GCN and TGC modules, which aims to solve the problem of latent space dissimilarity. However, AFLink shows that by carefully designing the framework and taining strategy, a much simpler and more lightweight module can still work well, Particularly. AFlink takes only 10+ seconds for training and 10 seconds for testing on MOTI7. ©. Interpolation in MOT Linear interpolation is widely used to ill the gaps in recov- ered trajectories for missing detections (36, 22, 35, 37, 63. 11] Despite its simplicity and effectiveness, linear interpolation ignores motion information, which limits the accuracy of the restored bounding boxes, To solve this problem, several strate- gies have been proposed to utilize spatiotemporal information effectively, VIOUTracker [5] extends TOUTracker [] by falling back to single-object tracking while missing detection ‘occurs, MAT [20] smooths linearly interpolated trajectories nonlinearly by adopting a cyclic pseudo-observation trajectory Gilling strategy. An extra camera motion compensation (CMC) model [13] and a Kalman filter [24] are needed to predict missing positions, MAATrack [43] simplifies it by applying only the CMC model. All these methods apply extra models, i.e. a single-object tracker, CMC, and a Kalman filter, in exchange for performance gains. Instead, we propose mod- cling nonlinear motion on the basis of the Gaussian process regression (GPR) algorithm [54]. Without additional time- consuming components, our proposed GSI algorithm achieves 1 good trade-off between accuracy and efficiency. ‘The most similar work (© our GSI is [67], which uses the GPR algorithm to smooth the uninterpolated tracklets for accurate velocity predictions. However, it works for the event detection task in surveillance videos. In contrast, we study the MOT task and adopt GPR to refine the interpolated localizations. Moreover, we present an adaptive smoothness factor instead of presetting a hyperparameter as done in [67] III, STRONGSORT In this section, we present various approaches to upgrade DeepSORT [55] to StrongSORT. Specifically, we review Deep- SORT in Section A and introduce StrongSORT in Section B. Notably, we do not claim any algorithmic novelty in this section. Instead, our contributions here lie in giving a clear understanding of DeepSORT and equipping it with various advanced techniques to present a strong MOT baseline. A. Review of DeepSORT We briefly summarize DeepSORT as a two-branch frame- work, that i, with an appearance branch and a motion branch, as shown in the top half of Figure 2. In the appearance branch, given detections in cach frame, the deep appearance descriptor (a simple CNN), which is pretrained on the person re-identification dataset MARS [65], is applied to extract their appearance features. It utilizes a feature bank mechanism to store the features of the last 100 frames for each tracklet. As new detections come, the smallest cosine distance between the feature bank 2, of the i-th tracklet and the feature f, of the j-th detection is computed as ITIP | fe © Bah co) “The distance is used as the matching cost during the associa- tion procedute. In the motion branch, the Kalman filter algorithm [24] accounts for predicting the positions of teacklets in the current frame. It works by a two-phase process, ie, state prediction and state update, In the state prediction step, it predicts the current state as i,j) = ming a, Feta @ PL = FP ART +Qs, @ where #1 and Py are the mean and covariance of the slate at time step k— 1, # and PZ are the estimated states at time step k, Fi, is the state transition model, and Qy is the covariance of the process noise. In the state update step, the Kalman guin is calculated based on the covariance of the estimated state P{ and the observation noise Re as: K = PLU (HePLUT + Re), @ where Hf is the observation model, which maps the state from the estimation space to the observation space, Then, the Kalman gain AC is used to update the final state: te = 2, + K (ze — Math), s) Peal KHA)PL © where 2p is the measurement at time step k. Given the motion state of tracklets and new-coming detections, Mahalanobis distance is used to measute the spatiotemporal dissimilarity ‘simple Feature a Kalman (Utter Jem DeepSORT | Dei73 | MOTA: 76.7 HOTA: 66.3 f Bo’ OS) \ a SiongSORT NSA Kalman gate MOTA: 77.1 (+04) \ IDF: 82.3 (45.0) | HOTA: 69.6 (43.3) | Fig. 2: Framework and performance comparison between DeepSORT and StongSORT. Performance is evaluated on the MOT17 validation set based on detections predicted by YOLOX [1] between them. DeepSORT takes this motion distance as a gate to filter out unlikely associations Afierwards, the matching cascade algorithm is proposed to solve the association task as a series of subproblems instead of a global assignment problem. The core idea is to give greater matching priority to more frequently seen objects. Each association subproblem is solved using the Hungarian algorithm [27] B. SirongSORT ‘Our improvements over DeepSORT incinde advanced mod- ules and some inference tricks, as shown in the bottom half of Figure 2, Advanced modules. DeepSORT uses the optimized Faster R- CNN [59] presented in [60] as the detector and tains simple CNN as the embedding model, Instead, we replace the detector with YOLOX-X [18] following [63], which is not presented in Figure 2 for clarity. In addition, a stronger appearance feature extactor, BOT (30). is applied to replace the original simple ENN, which can extract much more discriminative features. EMA. Although the feature bank mechanism in DeepSORT can preserve long-term information, itis sensitive to detection noise [|] To solve this problem, we replace the feature bank smechanism with the feature updating strategy proposed in [52], Which updates the appearance state e! for the i-th tacklet at fame 1 in an exponential moving average (EMA) manner as follows: eae 4(- als, o where isthe appearance embeing ofthe euent matched detection and a. = 0.9 is 4 momentum term, The EMA updating strategy leverages the information of inter-frame feature changes and can depress detection noise, Experiments show that it not only enhances the matching quality but also reduces the time consumption, ECC. Camera movements exist in multiple benchmarks [31, 44, 19], Similar to [20, 43, 25, 21], we adopt the enhanced correlation coefficient maximization (ECC) [13] model for camera motion compensation, It is a technique for parametric image alignment that can estimate the global rotation and translation between adjacent frames. Specitically, it is based fon the following criterion to quantify the performance of the warping tansformation: : @ where | parameter, and i and ean versions of the reference (template) image iy and warped image iy (P) Then, the image alignment problem is solved by minimiz- ing Exoo(p). with the proposed forward additive iterative algorithm or inverse compositional iterative algorithm. Due to its efficiency and effectiveness, ECC is widely used to compensate forthe motion noise caused by camera movement in MOT tasks NSA Kalman, The vanilla Kalman fier is vulnerable wat low-quality detections [43] and ignores the information on scales of detection noise [11]. To solve this problem, we borrow the NSA Kalman algorithm from GLAOTracker {1}, which proposes a formula to adaptively calelate the noise covariance Ry f= (1- aR, ° where Ry is the preset constant measurement noise covariance and cr is the detection confidence score at state k. Intuitively, the detection has a higher score cy when it has less noise, which results ina low R,. According to formulas 4-6, a lower ‘Fx means that the detection will have a higher weight in the state update step, and vice versa. This can help improve the accuracy of updated states, a) om) ( | pooling T; > squeeze ‘\ Temporal Fusion Module Module -@j- sag Classifier pooling J Tsqueeze Fig. 3: Framework of the two-branch AFLink model. It adopts two tracklets 7, and 7, as input, where T. Uneiwie consists of the frame id ff and positions (zp, yf) of the recent \V = 30 frames, Then, the temporal module extracts features along the temporal dimension with 7 > 1 convolutions and the fusion module integrates information along the feature dimension With 1 x 3 convolutions. These two Wacklet features are pooled, squeezed and concatenated, and then input into a classifier to predict the association score, Motion Cost. DeepSORT only employs the appearance feature distance as a matching cost during the fist association stage in which the motion distance is only used as the gate. Instead we solve the assignment problem with both appearance and ‘motion information, similar to [52, 6!]. The cost matrix C is a weighted sum of appearance cost A and motion cost Arm as follows: c 4a +(1-A)Ams (10) where the weight factor ) is set to 0.98, as in [52, 64]. ‘Vanilla Matching. An interesting finding is that although the ‘matching cascade algorithm is not tivial in DeepSORT, it limits the performance as the tracker becomes more powerful ‘The reason is that as the tracker becomes stronger, it becomes ‘more robust to confusing associations, Therefore, additional prior constraints limit the matching accuracy. We solve this problem by simply replacing the matching cascade with vanilla global linear assignment. IV. STRONGSORT++ We present a strong baseline in Section Il. In this sec- tion, we introduce two lightweight, plug-and-play, model- independent, appearance-free algorithms, namely, AFLink and GSI, to further solve the problems of missing association and iissing detection, We call the final method StrongSORT++, which integrates StongSORT with these two algorithms, A. AFLink The global link for tracklets is used in several works to pursue highly accurate associations, However, they gen- ‘rally rely on computationally expensive components and ‘have numerous hyperparameters to fine-tune, For example, the link algorithm in GIAOTracker [|] utilizes an improved ResNetS0-TP [16] to extract tracklet 3D features and performs association with additional spatial and temporal distances. Tt thas six hyperparameters to be set, ie, three thresholds and tee weight factors, which incurs heavy tuning experiments and poor robustness. Moreover, overteliance on appearance features can be vulnerable to occlusion. Motivated by this, we design an appearance-fiee model, AFLink, to predict the connectivity between two wacklets by relying only on spatiotemporal information, Figure 3 shows the two-branch framework of the AFLink model. It adopts two tacklets T; and 7, as the input, where Te = (fi,2f.yg}EL}Y consists of the frame id ff and positions (rf, yz) of the most recent N= 30 frames. Zero padding is used for tracklets that is shorter than 30 frames. A temporal module is applied to extract features by convolving along the temporal dimension with 7 > 1 kernels, which consists of four "Conv-BN-ReLU” layers. Then, the fusion module, which is a single 1 x 3 convolution layer with BN and ReLU, is used to integrate the information from different feature dimensions, namely f, x and y. The two resulting feature maps are pooled and squeezed to feature vectors and then concatenated, which includes rich spatiotemporal information. Finally, an MLP is used to predict a confidence score for association. Note that the weights ofthe two branches in the temporal and fusion modules are not shared. Dring taining, the association procedure is formulated as a binary classification task. Then, itis optimized with the binary exossentropy loss as follows: pee 4 ~ Wale a (1 valog(h Ds where Z, € [0,1] is the predicted probability of association for sample pair n, andy, € {0,1} is the ground truth During association, we filter out unreasonable tracklet pairs with spatiotemporal constraints. Then, the global link is solved asa linear assignment task (27] with the predicted connectivity =a B. GST Interpolation is widely used to fil the gaps in trajectories caused by missing detections. Linear interpolation is popular ddue to its simplicity; however, its accuracy is limited because it does not use motion information. Although several strategies Tracked+GSI Fig. 4: Tlustration of the difference between linear interpo- lation (LI and the proposed Gaussian-smoothed interpolation Gs hhave been proposed to solve this problem, they generally introduce additional time-consuming modules, e.g., a single- object tracker, a Kalman filter, and ECC. In contrast, we present a lightweight interpolation algorithm that employs Gaussian process regression (5] to model nonlinear motion. We formulate the GSI model for the i-th trajectory as follows: B= f+ (12) where £ € Fis the frame id, € P is the position coordinate variable at frame & (i.e., z,y, w,h) and ¢~ N(0,0%) is Gaus- sian noise. Given wacked and linearly interpolated trajectories SO) = {pl}, with length Z, the task of nonlinear ‘motion modeling is solved by fitting the function f\"). We assume that it obey’ a Gaussian process 1 © GPO,K,») (3) where k(x,2') = exp(—L25%l") is a radial basis function kernel, On the basis ofthe properties of the Gaussian process, given a new frame set F*, its smoothed position P* is predicted by PSK PKR A) FNP, (1d) where K(.,-) isa covatiance function based on K(-,-) Moreover, hyperparameter \ controls the smoothness of the trajectory, which should be related to its Iength, We simply design it as a function adaptive to length 1 as follows: » log(r*/0), aus) Where + is set to 10 based on the ablation experiment. Figure 4 illustrates an example of the difference between GSI and linear interpolation (LI), The raw tracked results (in orange) generally include noisy jitter, and LI (in blue) ignores motion information. Our GSI (in red) solves both problems simultaneously by smoothing the entire trajectory with an adaptive smoothness factor, V. EXPERIMENTS A. Setting Datasets. We conduct experiments on the MOTI [31] and MOT20 [9] datasets under the “private detection” protocol. MOTI7 is a popular dataset for MOT, which consists of 7 sequences and 5,316 frames for training and 7 sequences and 5919 frames for testing. MOT20 is a dataset of highly crowded challenging scenes, with 4 sequences and 8,931 frames for taining and 4 sequences and 4,479 frames for testing. For ablation studies, We take the frst half of each sequence in the MOTI? taining set for training and the last half for validation following [66, 63]. We use DukeMTMC [40] to pretrain our appearance featue extractor. We tain the detector on the CrowdHuman dataset [11] and MOTI? half waining set for ablation following [66, 63, 45, 56, 61]. We add Cityperson [62] and BTHZ [12] for testing as in (63, 52, 64, 28). We also test StrongSORT++ on KITT [19] and DacneTrack [4], KITTH is a popular dataset related to autonomous diving tasks. It ean be used for pedestrian and car tracking, which consists of 21 training sequences and 29 test sequences with a relatively low frame rate of 10 FPS, DanceTrack is a recently proposed dataset for multi-human tracking, which encourages more MOT algorithms that rely less on visual discrimination and depend more on motion analysis. It consists of 100 group dancing videos, where humans have similar appearances but diverse motion features. Metries. We use the metrics MOTA, IDs, IDF1, HOTA, AssA, DetA and FPS to evaluate tracking performance [?, 40, 29] MOTA is computed based on FP, FN and IDs and focuses more on detection performance. By comparison, IDF] better measures the consistency of ID matching. ITOTA is an explicit combination of detection score DetA and association score AssA, which balances the effects of performing accurate detec- tion and association into a single unified metric. Moreover, it evaluates at a number of different distinct detection similarity values (0.05 to 0.95 in 0.05 intervals) between predicted and GT bounding boxes, instead of setting a single value (ie. 0.5), such as in MOTA and IDF1, and better takes localization accuracy into account Implementation Details. We present the default implementa- tion details in this section. For detection, we adopt YOLOX-X [18] as our detector for an improved time-accuracy trade-off. “The taining schedule is similar to that in [63]. In inference, a threshold of 0.8 is set for non-maximum suppression (NMS) and a threshold of 0.6 for detection confidence. For Strong- SORT, the matching distance threshold is 0.45, the warp mode for ECC is MOTION EUCLIDEAN, the momentum term er itn EMA is 0.9 and the weight factor for appearance cost 2 is 0.98. For GSI, the maximum gap allowed for interpolation is 20 frames, and hyperparameter + is 10. For AFLink, the temporal module consists of four convo- lution layers with 7 x 1 kernels and (32, 64, 128, 256} output channels, Each convolution is followed by a BN layer and a ReLU activation layer. The fusion module includes a 1 x 3 convolution, a BN and a ReLU. It does not change the number of channels. The classifier is an MLP with two fully connected layers and a ReLU layer inserted in between, The training data ‘TABLE I: Ablation study on the MOTI7 validation set for basic strategies, i., stronger feature extractor (BOT), camera motion compensation (ECO), NSA Kalman filter and abandoning matching cascade (woC (best in bold) NSA), EMA feature updating mechanism (EMA), matching with motion cost (MC) Method | BoT | ECC | NSA | EMA | MC | woC | IDFICH | MOTAC) | HOTA(?) | FRSC) Baseline a 767 63 | 138 StrongSORTyI | ¥ 798 768 618 83 StongsoRtv2 | v |v 93 77 619 63 Swongsortys | ¥ | v | v 93 m7 683 62 Swongsortvs | v | v | v | v 30.1 79 682 74 Stongsortys | ¥ | v |v | v | v 809 79 oo | 74 Strongsortvs | v | v |v |v |v |v | 3 | om 6 a8 TABLE Il: Results of applying AFLink and GSI to various MOT methods. All experiments are performed on the MOTIT validation set with a single GPU, (best in bold) Method AFLink | GSI | DEK | MOTAG) | HOTA | FPS) SwongSORTVT 5 = [19s 768 OTe 33 “ 800 168 ost “ | 80.4009 | 18.2614 | 68.9610) . PBT TI oS “ 805 m 686 “ v | 80.961.) | 78:7141.6 | 950412) SwoRSORTE 5 a) TTT we ‘ 825 m1 696 “ ¥_| 833410 | 78.7416 | 708012) | 70.05) Centra AT] = = oe 065 355 Tat ‘ 683 669 372 wa “ ¥_| 684438) | 6690. | 57642 | 128 61.6) Trans Tiack 7 = [885 OTT 3B 38 “ oot nT 383 58 “ ¥_| 6991.) | 696.9) | sects) | 56 029 FaEMOTIOT 7 =| TET wr STs T20 v 32 eo2 576 118 “ ¥_| 742415) | 711420 | se.0619 | 109 cL fae generated by cutting annotated tajectories into tracklets With random spatiotemporal noise at a 1:3 ratio of positive to negative samples. We use Adam as the optimizer (26] and cross-entropy loss as the objective function and tain it for 20 epochs with a cosine annealing learning rate schedule. ‘The overall training process takes just over 10 seconds. In inference, a temporal distance threshold of 30 frames and f spatial distance threshold of 75 pixels are used to filter cout unreasonable association pairs. Finally, the association is considered if its prediction score is larger than 0.95 Al experiments are conducted on a server machine with single V100. B. Ablation Studies Ablation study for StrongSORT. Table I summarizes the path from DeepSORT to StrongSORT: 1) BoT: Replacing the original feature extractor with BoT leads to a significant improvement for IDF (+2.2), indicating that association quality benefits from more discriminative appearance features, 2) ECC: The CMC model results in a slight inerease in IDFI (40.2) and MOTA (10.3), implying that it helps extract more precise motion information, 3) NSA: The NSA Kalman filter improves HOTA (40.4) but not MOTA and IDF1. This means that it enhances positioning accuracy, 4) EMA: The EMA feature updating mechanism brings not aly superior association (+0.4 IDFl) but also a faster speed (+12 FPS), 5) MC: Matching with both appearance and motion cost aids association (+0.8 IDF). ©) woC: For the stronger tracker, the matching cascade algorithm with redundant prior information limits the tracking accuracy. By simply employing a vanilla matching method, IDF! is improved by a large margin (+1.4). Ablation study for AFLink and GST. We apply AFLink and GSI on six different trackers, ie., three versions of Strong- SORT and three state-of-the-art trackers (CenterTrack [66], TransTrack [5] and Fair MOT [64]). Their results are shown in Table II, The first line of the results for each tracker is the original performance. The application of AFLink (the second line) brings different levels of improvement for the different trackers, Specifically, poorer trackers tend to benefit more from AFLink due to more missing associations. In particularly, the IDFI of CenterTrack is improved by 3.7. The third line of the results for each tracker proves the effectiveness of GSI for both detection and association. Different from AFLink, GSI bly Fig. 5: Comparison of normalized velocity between the trajectories after applying linear interpolation (LI, in red) and Gaussian- smoothed interpolation (GSI, in blue). The x-coordinate represents the frame id, and the y-coordinate is the normalized velocity. works better on stronger trackers, but it can be confused by the large amount of false association in poor trackers Ablation study for vanilla matching. We present the com- parison between the matching cascade algorithm and vanilla matching on different baselines in Table II. It is. shown that the matching cascade algorithm greatly benefits Deep- SORT. However, with the gradual enhancement of the baseline ‘wacker, it has increasingly smaller advantages and is even ‘harmful to tacking accuracy. Specifically, for StrongSORTVS, it can bring a gain of 1.4 on IDFI by replacing the matching cascade with vanilla matching. This leads us to the following interesting conclusion: Although the priori assumption in the ‘matching cascade can reduce confusing associations in poor trackers, this additional constraint will limit the performance of stronger trackers instead. Additional analysis of GSI. Speed estimation is essential for some downstream tasks, e.g.. action analysis [10] and benefits the construction of intelligent transportation systems (ITSs) (14, To measure the performance of different interpolation algorithms on the speed estimation task, we compare the normalized velocity between trajectories after applying linear interpolation (LI) and Gaussian-smoothed interpolation (GST) in Figure 5, Specifically, six trajectories from DeepSORT on the MOTIT validation set are sampled. The x-coordinate and y-coondinate represent the frame id and normalized velocity, respectively. Iis shown thatthe velocity of trajectories with LI jitters wildly (in red), mainly due to detection noise. Instead, Urajectories with GST have more stable velocity (in blue). This gives us another perspective to understand GSE GST is a kind of detection noise filter that can produce more accurate ‘and stable localizations. This feature is beneficial to speed estimation and other related tasks. TABLE III: Ablation study on the MOTI7 validation set for the matching cascade algorithm and vanilla matching, ‘Method Matching [IDFUCD) | MOTACT) DeqpSORT—] Cascade | TT Yunila | 752 (44) | 767 ¢0.0) ranSORTNT | Ciseade z Vanilla | 796 (40.4) | 767 40.4) ranSORTVT | Caseade 7.7 THT Vanilla | 79.7 (400) | 77.1 40.0) angSORTET | Cease 7.7 TT Vanilla | 799 (40.2) | 71.0.0) ‘SaORSORTVE | Cascade] BOT 770 Vanilla | $19 (418) | 759 40.4) ‘SaORESORTS | Cavcade | 809 7719 Vanilla | $2.3 44) | 77.41 0a) ©. Main Results ‘We compare SuongSORT, StrongSORT+ (SuongSORT AFLink) and StongSORT++ (SuongSORT + AFLink + GST) with state-of-the-art tackers on the test sets of MOTT, MOT20, DanceTrack and KITTI, as shown in Tables IV, V, VI and VIL, respectively. Notably, comparing FPS fairly is ifficult, because the speed claimed by cach method depends on the devices where they are implemented, and the time spent fn detections is generally excluded for tracking-by-detection trackers, MOTIT. StrongSORT++ ranks first on MOTI7 for metrics HOTA, IDFI, Ass, and DetA and ranks second for MOTA and IDs. In particular, it yields an accurate association and outperforms the second-performance tracker by a large margin Gie., #2.1 IDI and 42.1 Ass). We use the same hyperparam- ters asin the ablation study and do not carefully tune them for each sequence as in [63]. The steady improvements on the test ‘TABLE IV: Comparison with state-of-the-art MOT methods on the MOTI7 test set. "*" represents our reproduced version ‘Qw/o LD)” means abandoning the offline linear interpolation procedure. The wo best resulls for each metric are bolded and highlighted in red and blue. ‘mode ‘Method Ret__| HOTA) | FLT) | MOTAG | AssAUT) | DetACD | sid) | FPSCD SORT TT Tors | a0 Paes fa Sie] a70 4832 | as MIDF [15] mmo | 377 | as2 | ays | 34s | azo | ser | 12 DeepMOT [571 ever2m0 | are | sss | 337 | a7 | aes | isa | ao ISEHDADH [5] TMM2019 - : 345 - so | 36 Trackloet (1) scevaois | aas | ssi | 563 | asa tos | as TubeTK [33] everaim0 | aso | 586 | 6x0 | asa | si | ais7 | 30 CRE-MOT (17) "rMM2022 - cos | sao : sf asa] CenterTrack (68) wccv220 | s22 | 67 | ors | sto | sss | 30x | 3s “TraneTrack [45 anno | 54 os | 152 | 479 | ate | 3603 | so2 cnline | PermalTrack [36] iccvzm | sss | 639 | mss | sar | ses | 3009 | 119 Strack [25] mmo | 393 | me | mao | 379 | ori | 3567 | ise FurMOT [65] uevaon | sos | 23 | m7 | 380 | oo9 | 3303 | 259 CeowaTeack [22] asso | 63 | 736 | 756 | 503 | sis | 2sa | ios CorrTvacker [51] cyrrzo1 | 507 | m6 | ys | sso | oro | 3369 | 156 RelatonTrack (59) | twa22 | so | m7 | ma | ats | sos | aszs | ss ocsort? (wont | arvzo2 | 617 | 762 | 760 | wo | sis | 2109 | 290 ByteTiack? (wo Lt) [63] | ECcv2022 | os | 72 | 79 | s22 | oss | 2310 | 296 ‘DeepSORT* (55 rewai7 | 612 | ms | m0 | sor J oxi | ian | ise StrongSORT os | ms | 7s | 637 | 636 | itas | 75 TPMT] a MPNTrack (6) cvprz0 | avo | siz | ses | sia | ars | nass | 6s Booster [39] mn | sos | 63 | ois 20 | 42 | 2us | 69 MAT [20 Nc202 soo | x2 | ort | 572 | ssa | ine | us “thine ReMOT [58] e202} sr | mo | mo | sri | gos | 2ass | is Mastek (31 | wacva2022 | oo | 759 | 724 | oa | 62 | ase | assa (OC-SORT [71 ano | 632 | 775 | 70 | 634 | 632 | 1950 | 290 yteTrack* [6 sccv2n | 532 | ma | m7 | m3 | ota | 2253 | 296 StrongSORT+ ours 637 | mo | m3 | ssa | 636 | van | 4 SteongSORT + ure sia _| ws | 796 | ssa | 646 | iis | 71 TABLE V: Comparison with state-of-the-art MOT methods on the MOT20 test set. "” represents our reproduced version, ‘Qw/o LI)” means abandoning the offline linear interpolation procedure. The two best resulls for each metric are bolded and highlighted in red and blue. ‘mode Method Ret | HOTA) | IDFLT) | MOTAG) | ASAT) | DetACD | Isit) | FPS) SORT DT TPIS] Pa] 5 7) 4a10- 373 ‘Tracklor++ (1) acevaiy | ai | 327 | sas | s20 | 423 | ress | 12 Strack [25] ‘rap2022 sso | sso | 656 | seo | sez | sive | 4s ParMOT (6"] nevaor | sas | sis | ois | sav | 5e7 | S20 | 32 omine | CeowdTeack [2 sso | 682 70, s2s | si7 | save | 95 | RelationTrack [59] ses | ms | 12 | sea | sos | 4203 | 43 OC-SORT* (w/o LD [7] wos | ma | ma | eos | os | 1307 | - ByteTrack? (wo L2) [5] os | m9 | 7 | 599 | oxo | i3e7 | ars ‘DeepSORT* [55 33 ws | ms | 35s | soo | sas | 32 as | 9 | m2 | 62 | so9 | 1066 | 15 aS] 33d] 8] a as] 6d aT MPNTrack (6) cver220 | aos | sor | sis | 473 | ass | iz | 6s Maatreck (13) | wacvw2022 | 373 | m2 | m9 | ssa | so7 | 3a | ser ReMOT [5s] wou | a2 | 73 ma | ser | oso | i789 | oa oc-soRt [7] anno «| 21 | 759 | 95s : > fos] - yteTrack* [63] az | 7. ws | soo | o26 | tia | as StrongSORT+ a6 | m3 | m2 | ese | soo | ross | 15 StrongSORT++ ours as | mo | ne | oso | ots | a | ia ‘TABLE VI: Comparison with state-of-the-art MOT methods on the DanceTrack test set, The two best results for each metric ae bolded and highlighted in red and blue. Method | HOTA) CenterTrack [oo] | ECCVIOD FaMOT [6s] | -CV2021 ‘TransTrack [15] | arxiv2020 TraDes [56] cvPR2021 ByteTrack [63] | ECCV2022 MOTE [61] ECCV2022 oc-sort (7) | aniv2022 StrongSORT:++ | ours TABLE VI: Comparison with state-of-the-art MOT methods bolded and highlighted in red and blue. ‘MOTA(T) AssA(T) DetA(D) fon the KITTI test set, The two best results for each metric are Car Pedestian vera] [MOTA [ASAT IDA | HOTTY | MOTA | ASAT [DAD ABSD ST a 33 |e ‘MPNTiek [6] : = | 4526 | 4623 | az28 | 397 CenterTrack (65) sss | 7120 | ass | 43s | sas | 3693 | ars qpspt >] | ream? | 7277 | ssoa | zza9 | 206 | aos | sim | 38s2 | 717 Qptrack (35) | cvpr2o2t | oss | s493 | 6549 | 3s | 4ni2 | ssss | 3si0 | 487 LGMTracker [1s] | tecv2021 | 7314 | s760 | mar | as : : - Permatrack [35] | 1ceva021 | 7742 | oss | 7765 | 275 | 474s | 650s | 4306 | 48s 0C-SORT [7 acw2022 | 7654 | 9028 | 7639 | 250 | seo | saa | soos | 205 StrongSORT++ ous m7s_|_ 903s | 7820 | a4 | sas | 673s | 573i | ams set prove the robustness of our methods. It is worth noting that ‘our reproduced version of DecpSORT (with a stronger detector YOLOX and several tuned hyperparameters) also. performs well on the benchmark, which demonstrates the effectiveness of the DeepSORTlike tracking paradigm. MOT20, The data in MOT20 is taken from more crowded scenarios. High occlusion means a high risk of missing de- tections and associations. StrongSORT++ still ranks first for the metrics HOTA, IDFI and AssA. It achieves significantly fewer IDs than other trackers, Note that we use exactly the same byperparameters as in MOTI7, which implies the generalization capability of our method. Its detection perfor- mance (MOTA and DetA) is slightly poor compared to that ‘of several trackers, We think this is because we use the same detection score threshold as in MOTI7, which results in many missing detections. Specifically, the metric FN (number of false negatives) of our StrongSORT++ is 117,920, whereas that of ByteTrack [63] is only 87,594, DanceTrack. Our StrongSORT++ also achieves the best re- sults on the DanceTrack benchmark for most metrics. Because this dataset focuses less attention on appearance features, we abandon the appearance-related optimizations here, ie,, BoT and EMA. The NMS threshold is set as 0.7, the matching distance is 0.3, the AFLink prediction threshold is 0.9, and the GSI interpolation threshold is 5 frames. For fair comparison, ‘we use the same detections with ByteTrack [63] and achieve ‘much better results, which demonstrates the superiority of our method. KITTI. On the KITTI dataset, we use the same detection results as PermaTrack [46] and OC-SORT [7] for fair compar- ison. The results show that StrongSORT++ achieves compara ble results for cars and superior performance for pedestrians compared to PermatTrack. For simplicity, we only apply two ticks (ie., ECC and NSA Kalman) and two proposed algo- rithms (ie., AFLink and GSI) bere. D. Qualitative Results. Figure 6 visualizes several tacking results of Strong- SORT++ on the test sets of MOTI7, MOT20, DanceTrack and KITTL The results of MOT11-01 show the effectiveness of our method in normal scenarios. From the results of MOT17-08, we can see correct associations after occlusion, The results of MOTI7-L4 show that our method can work well while the camera is moving, Moreover, the results of MOT20-04 show the excellent performance of StrongSORT++ in scenarios with severe occlusion. The results of DanceTrack and KITT! demonstrate the effectiveness of StrongSORT++ while facing the problems of complex motion patterns and low frame rates. E. Limitations StrongSORT and StrongSORT++ still have several limita- tions. One concern is their relatively low running speed com- pared to joint trackers and several appearance-free separate trackers, This problem is mainly caused by the DeepSORT- like paradigm, which requires an extra detector and appearance model, and the proposed AFLink and GSI are both lightweight algorithms. Moreover, although our method performs well on the IDFI and HOTA metrics, it has a slightly lower MOTA on MOTIT and MOT20, which is mainly caused by many missing detections due to the high threshold of the detection score, We believe an elaborate threshold strategy of association algorithm would help. For AFLink, although it performs Lc ample tracking results visualization of StrongSORT=+ on the test sets of MOT17, MOT20, DanceTrack and KITTI. ‘The box color corresponds to the ID. well in restoring missing associations, it is helpless against false association problems. Specifically, AFLink cannot split mixed-up ID trajectories into accurate tracklets, Future work is needed to develop stronger and more flexible global link strategies. VI. Concivston In this paper, we revisit the classic tracker DeepSORT and upgrade it with new modules and several inference tricks. The resulting new tracker, StrongSORT, can serve as a new strong. baseline for the MOT task. We also propose two lightweight and appearance-free algo- rithms, AFLink and GSI, to solve the missing association and ‘missing detection problems. Experiments show that they can be applied to and benefit various state-of-the-art trackers with a negligible extra computational cost By integrating SwongSORT with AFLink and GSI, the re- sulting tracker StrongSORT++ achieves state-of-the-art results fon multiple benchmarks, i.., MOTI7, MOT20, DanceTrack and KITT, ACKNOWLEDGMENTS ‘This work is supported by Chinese National Natural Science Foundation under Grants (62076033, U1931202) and BUPT Excellent Ph.D, Students Foundation (CX2022145) REFERENCES 1] Bergmann, P,, Meinhardt, TS, Leal-Taixe, Li Track- ing without bells and whistles. In: Proceedings of the IEEE/CVE Intemational Conference on Computer Vi- sion, pp, 941-951 (2019) 9 Bernardin, K., Stiefelhagen, R Evaluating multiple object tacking performance: the clear mot metrics EURASIP Journal on Image and Video Processing 2008, 1-10 (2008) 6 Bewley, A., Ge, Z, Ot, L, Ramos, F, Uperoft, B. Simple online and vealtime tacking. In: 2016 IEEE international conference on image processing (ICIP). pp. 3464-3468. IEEE (2016) 1. 2, 9 (4) Bochinski, E., Eiselein, V, Sikora, T: High-speed twacking-by-detection without using image information, In: 2017 14th IEEE international conference on advanced ‘video and signal based surveillance (AVSS). pp. 1-6. IEEE (2017) 1, 2, 3 [5] Bochinski, E., Senst, T., Sikora, T.: Extending iow based rulti-object tracking by visual information, In: 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). pp. 1-6. IEEE (2018) 3 [6] Bras6, G., Leal-Taixé, L: Learning a neural solver for multiple object tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6247-6257 (2020) 9, 10 Cao, 1, Weng, X., Khirodkar, R, Pang, J., Kitani, K: Observation-centric sort: Rethinking sort for robust rmulticobject tracking. aXiv preprint arXiv:2203.14360 (2022) 2, 9, 10 2 3 7 (8) Dai, P, Wang, X., Zhang, W., Chen, J: Instance segmen- tation enabled hybrid data association and discriminative ‘hashing for online multi-object tracking, TEE Transac- tions on Multimedia 21(7), 1709-1723 (2018) 9 [9] Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J, Cre- mers, D., Reid, I, Roth, S., Schindler, K., LealTaixé, Ls Mot20: A benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003 (2020) 12,6 Du, ¥., Tong, Z., Wan, I, Zhang, B., Zhao, Y.: Pami-ad: An activity detector exploiting part-attention and motion information in surveillance videos. In: 2022 IEEE Inter- national Conference on Multimedia and Expo Workshops (ICMEW). pp. 1-6. IEEE (2022) 8 Du, ¥, Wan, J, Zhao, ¥, Zhang, B., Tong, Z., Dong, J: Giaotracker: A comprehensive framework for mo- ‘mot with global information and optimizing strategies in visdrone 2021. In: Proceedings of the IEEE/CVE International Conference on Computer Vision. pp. 2809— 2819 (2021) 1, 2,3, 4,5 Ess, A. Leibe, B., Schindler, K., Van Gool, LA mobile vision system for robust multi-person tracking. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition. pp. 1-8. IEEE (2008) 6 Evangelidis, GD, Psarakis, EZ.> Parametric image alignment using enhanced correlation coefficient max- mization. IEEE transactions on pattern analysis and ‘machine intelligence 30(10), 1858-1865 (2008) 3, 4 Lorca, D) Martinez, A, Garcia Daza, I: Vision-hased vehicle speed estimation: A. survey, IET Intelligent Transport Systems 15(8), 987-1005 (2021) 8 , Z., Angelini, B, Chambers, J., Naqvi, SM: Multi- level cooperative fusion of gm-phd filters for online mule tiple human tacking. IEEE Transactions on Multimedia 24(9), 2277-2291 (2019) 9 Gao, J, Nevatia, Rs Revisiting temporal model- ing for video-based person reid. arXiv preprint arXiv:1805.02104 (2018) 3, 5 (17] Gao, T, Pan, H., Wang, Z., Gao, H A erf-based framework for tacklet inactivation in online multi-object, tacking. IEEE Transactions on Multimedia 24, 99S— 1007 (2021) 9 Ge, Z., Liu, S., Wang, F, Li, Z., Sun, J: Yolox: Exceed ing yolo series in 2021. aeXiv preprint arXiv:2107.08430, (2021) 1, 2, 4, 6 Geiger, A., Lenz, P, Stiller, C., Untasun, R.: Vision meets robotics: The kitti dataset. The International Joumal of Robotics Research 32(11), 1231-1237 (2013) 2, 4, 6 Han, S., Huang, P, Wang, H., Yu, B., Liu, D., Pan, X. Mat: Motion-aware multi-object tracking, Neurocomput- ing (2022) 3, 4, 9 [21] He, 1, Huang, Z., Wang, N., Zhang, Z. graph matching: Incorporating graph partitioning with deep feature learning for multiple object tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5299-5309 (2021) 2, 4 fio] an (12) (3) [14] Femnéndez Hernéndez us} 116) (18) 19] [20] Learnable 22] 23) [24] Bs} (26) 7) (3) 9] [30] bu (21 33] G4] Bs] 136] Hofmann, M., Haag, M, Rigoll, G: Unified hierarchi- cal mul-object tracking using global data association Th: 2013 IEEE Intemational Workshop on Performance Evaluation of Tracking and Surveillance (PETS). pp. 22- 28. IEEE (2013) 2. Hu, TIN, Yang, YI, Fischer, T, Darrell, T, Ya, Sus, M. Monocular quasi-dense 3d object tracking TEE Transactions on Pattern Analysis and Machine Tnteligence (2022) 10 Kalman, RE: A new approach to linear filtering and prediction problems. Journal of Basic Engineering 82D, 35-45 (1960) 3 Khurana, T, Dave, A., Ramanan, D.: Detecting invisible people. In; Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3174-3184 (2021) 4 Kingma, D.P, Ba, J: Adam: A method for stochastic optimization. In; International Conference on Learning Representations (2014) 7 Kahn, H.W: The hungarian method for the assignment problem, Naval research logistics quarterly 2(1-2), 83-97 (1955) 4, 5 Liang. C., Zhang, Z., Zhou, X. Li, B., Zhu, $.. Ho, W: Rethinking the competition between detection and reid in multiobject tracking, IEEE Transactions on Image Processing 31, 3182-3196 (2022) 2. 6, 9 Laiten, 1, Osep, A., Dendorfer,P, Torr, P, Geiger, A. Leal-Taixg, L, Lebe, B: Hola: A higher order metric for evaluating tulti-object tracking, Intemational journal of ‘computer vision 1292), 548-578 (2021) 6 Lio, Ht, Jiang, W, Gu, ¥, Liv, F, Lito, X, Lai, 5. Gu, J: A strong baseline and batch normalization neck for deep person re-identiication. IEEE Transactions on ‘Moltimedia 22(10), 2597-2609 (2019) 1, 4 Milan, A., Leal-Taixé, L, Reid, L, Roth, S, Schindler K: MotI6: A benchmark for mult-object tracking. arXiv preprint aXiv:1603.00831 (2016) 1,2, 4, 6 Naiel, MA, Ahmad, M.O,, Swamy, M, Lim, J. Yang ‘MAH. Online multi-object tracking via robust collabora- tive model and sample selection, Computer Vision and Image Understanding 154, 94-107 (2017) 1,2 Pang, B. Li, Y, Zhang, Y. Li, M, La, C1 Tubetk Adopting tubes to track muli-object in a one-step tain- ing model. n: Proceedings of the IEEE/CVF conference fon computer vision and pattern recognition. pp. 6308- 6318 (2020) 2, 3,9 Pang, J., Qiu, L., Li, X., Chen, H., Li, Q., Darrell, T., ‘Ya, F: Quasi-dense similarity learning for multiple object tracking. In: Proceedings of the TEEF/CVF conference ‘on computer vision and pattem recognition, pp. 161-173 (2021) 10 Peng, I, Wang, T, Lin, W,, Wang, 3, See, 3, Wen, S., Ding, E Tpm: Multiple object tracking with tracklet- plane matching. Pattern Recognition 107, 107480 (2020) 1,2,3,9 Perera, A.A., Srinivas, C., Hoogs, A., Brooksby, G., Hu, W: Muli-object tacking through simultaneous long ‘occlusions and splitanerge conditions. In: 2006 IEEE G71 (38) 39] 40) (au (43) (3) (44) (45) (46) (a7) [43] [49] Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06). vol. 1, pp. 665-675 IEEE (2006) 2, 3 Possegger, H., Mauthner, T, Roth, PM. Bischof, H. Occlusion geodesics for online malti-object tracking. Tn proceedings ofthe IEEE Conference on Computer Vision and Patter Recognition. pp. 1306-1313 (2014) 2, 3 Redmon, J, Farhad, A Yolov3: An incremental fm- provement. arXiv preprint arXiv:1804.02767 018) 2 Ren, §., He, K,, Gintick, R, Sun, J: Faster rem: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015) 1, 2, 4 Ristani, E, Solera, F, Zou, R., Cucchiara, R., Tomasi, Cs Performance measures and a data set for multi target, multzcamera tacking. In: European conference on computer vision, pp. 17-35. Springer (2016) 6 Shao, $., Zhao, Z, Li B.. Xiao, T, Ya, G., Zhang. X Sun, J: Crowdhuman: A benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123 (2018) 6 Stadler, D.. Beyerer, J: On the performance of crowd- specific detectors in muti-pedestian tracking. In: 2021 17th IEEE Intemational Conference on Advanced Video and Signal Based Surveillance (AVSS). pp. 1-12. IEEE. (2021) 2,9 Stadler, D., Beyerer, J Modeling ambiguous assign- ments for multperson tracking in crowds, In: Proceed- ings of the TEEEICVE Winter Conference on Applica- tions of Computer Vision. pp. 133-142 (2022) 2,3, 4 9 Sun, B, Cao, J, Jiang, ¥, Yuan, Z, Bai, S, Kitani, K, Tuo, P: Danccirack: Mult-object tracking. in uniform appearance and diverse mekion. In; Proceedings of the IEEEICVF Conference on Computer Vision and Pattern Recognition. pp. 20993-21002 (2022) 2, 4, 6 Sun, P, Cao, I, Siang, ¥,, Zhang, R, Xie, E, Yuan, Z,, Wong, C. Lo, P Transtack: Multiple object tacks ing with transformer. a:Xiv preprint arXiv:2012.15460 (2020) 2, 6,7, 9, 10 Tokmakoy, B, Li, J, Burgard. W., Gaidon, A Learning to tack with object permanence. In: Proceedings of the IEEE/CVF Intemational Conference on Computer Vision. pp. 10860-10869 (2021) 9, 10 Wang, B., Wang. G, Chan, KL, Wang, L: Tracklet association by online target-apeciic metric leaning and coherent dynamics estimation. IEEE transactions on pat- tem analysis and machine intelligence 39(3), 589-602 (2016) 1, 2 Wing, G, Gu, R, Liv, Z., He, W., Song, M, Hwang, IN: Track without appearance: Learn box and teacket embedding with local and global motion pattems for ‘hice tracking. In: Proceedings of the IEEE/CVF Inter national Conference on Computer Vision pp. 9876-9886 (202k) 3, 10 Wang, G, Wang, ¥, Gu, R,, Ho, W., Hwang, LN, Split and Connect: A universal tracklet booster for mult- object tracking, IEEE Transactions on Multimedia pp. 1— 1 (2022), bps:fdo.org/10.1109/TMM.2022.5140919 9 [50] Wang, G., Wang, ¥., Zhang, H., Gu, R., Hwang, J.N.: Ex- ploit the connectivity: Multi-abject tracking with track- Telnet, In: Proceedings of the 27th ACM Intemational Conference on Multimedia, pp. 482490 (2019) 1, 2, 3 ‘Wang, Q, Zheng, Y,, Pan, P, Xu, Y: Multiple object racking with correlation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3876-3886 (2021) 2, 9 ‘Wang, Z., Zheng, I., Liu, ¥, Li, ¥, Wang, S.: Towards real-time multi-object tacking. In: European Conference ‘on Computer Vision. pp. 107-122. Springer (2020) 4, 5, 6 ‘Weng, X., Wang, J, Held, D., Kitani, K 3d multi-object twacking: A baseline and new evaluation mettics. In: 2020 IEEE/RS! International Conference on Intelligent Robots and Systems (IROS). pp. 10359-10366. IEEE (2020) 10 Williams, C., Rasmussen, C.: Gaussian processes for regression. Advances in neural information processing systems 8 (1995) 2, 3, 6 Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE international conference on image processing ICIP), pp. 3645-3649. IEEE (2017) 1, 2, 3, 9 ‘Wo. J., Cao, J., Song, L., Wang. Y., Yang, M.. Yuan, Jy Track to detect and segment: An online multi-object twacker. In: Proceedings of the IEEE/CVF conference fon computer vision and pattern recognition. pp. 12352— 12361 (2021) 6, 10 Xu, Y, Osep, A., Ban, ¥., Horaud, R., Leal-Taixé, L., Alameda-Pineda, X.: How to train your deep multi-object tracker, In: Proceedings of the IEEE/CVF Conference ‘on Computer Vision and Pattern Recognition. pp. 6787— 6796 (2020) 2, 9 [58] Yang, F, Chang, X., Sakti, S., Wu, ¥., Nakamura, 5. Remot: A model-agnostic refinement for multiple object tacking. Image and Vision Computing 106, 104091 (2021) 1, 2, 3,9 Yu, E,,Li, Z., Han, S., Wang, H.: Relationtrack: Relation- aware multiple object tracking with decoupled represen- tation. IEEE Transactions on Multimedia pp. 1-1 (2022). [hups:fidoi.org/10.1 109/TMM.2022.3150169 2, 9 Yu. BL W, Li Q, Liv, ¥., Shi, X. Yan, J: Pot ‘Multiple object tracking with high performance detection and appearance feature. In; European Conference on ‘Computer Vision, pp. 3642. Springer (2016) 1, 2, 4 Zeng, F., Dong, B., Zhang, Y, Wang, T., Zhang, X., Wei, Ys Mott: End-to-end multiple-object tracking with transformer. In; Computer Vision-ECCV 2022; 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVIL. pp. 659-675, Springer (2022) 6, 10 Zhang, S., Benenson, R, Schiele, B.: Citypersons: A. diverse dataset for pedestrian detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213-3221 (2017) 6 [63] Zhang, ¥,, Sun, P, Jiang, Y., Ya, D., Weng, F, Yuan, Z,, Luo, P, Liu, W,, Wang, X.: Bytetrack: Multi-object tacking by associating every detection box. In: Com- sn [52] 153] [54] 55) [56] (571 [59] {60 (61) (62) (64) [65] [66] (67) puter Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXIL, pp. I-21. Springer (2022) 1, 2, 3, 4, 6, 8, 9, 10 Zhang, Y, Wang, C., Wang, X., Zeng, W, Liu, W. Fairmot: On the faimess of detection and re-identification in multiple object racking. International Journal of Com= puter Vision 129(11), 3069-3087 (2021) 1, 2, 5, 6,7, 9, 10 Zheng, L., Bie, Z., Sun, ¥,, Wang, I. Su, C., Wang, S., Tian, Q.: Mars: A video benchmark for large-seale person re-identification, In: European conference on computer vision, pp. 868-884, Springer (2016) 3 ‘rou, X., Koltun, V., Krihenbuhl, P: Tracking objects as points. In: European Conference on Computer Vision pp. 474-490. Springer (2020) 2, 6, 7, 9, 10 ‘Zou, Y., Zhou, K. Wang, M., Zhao, ¥, Zhao, ZA comprehensive solution for detecting events in complex surveillance videos. Multimedia Tools and Applications 78(1), 817-838 (2019) 3

You might also like