Multi-Object Tracking and Segmentation With Embedding Mask-Based Affinity Fusion in Hierarchical Data Association
Multi-Object Tracking and Segmentation With Embedding Mask-Based Affinity Fusion in Hierarchical Data Association
ABSTRACT In this paper, we propose a highly feasible fully online multi-object tracking and
segmentation (MOTS) method that uses instance segmentation results as an input. The proposed method is
based on the Gaussian mixture probability hypothesis density (GMPHD) filter, a hierarchical data association
(HDA), and a mask-based affinity fusion (MAF) model to achieve high-performance online tracking. The
HDA consists of two associations: segment-to-track and track-to-track associations. One affinity, for position
and motion, is computed by using the GMPHD filter, and the other affinity, for appearance is computed
by using the responses from single object trackers such as kernalized correlation filter, SiamRPN, and
DaSiamRPN. These two affinities are simply fused by using a score-level fusion method such as min-max
normalization referred to as MAF. In addition, to reduce the number of false positive segments, we adopt
mask IoU-based merging (mask merging). The proposed MOTS framework with the key modules: HDA,
MAF, and mask merging, is easily extensible to simultaneously track multiple types of objects with CPU-only
execution in parallel processing. In addition, the developed framework only requires simple parameter tuning
unlike many existing MOTS methods that need intensive hyperparameter optimization. In the experiments
on the two popular MOTS datasets, the key modules show some improvements. For instance, ID-switch
decreases by more than half compared to a baseline method in the training sets. In conclusion, our tracker
achieves state-of-the-art MOTS performance in the test sets.
INDEX TERMS Multi-object tracking, instance segmentation, tracking by segmentation, online approach,
Gaussian mixture probability hypothesis filter, affinity fusion.
FIGURE 3. Detailed processing pipeline of MAF_HDA with input (images and instance segmentation results) and output (MOTS results). The key
components are hierarchical data association (HDA), mask merging, and mask-based affinity fusion (MAF). HDA has two association steps: S2TA and
T2TA. MAF executes each affinity fusion in each association step while mask merging runs once between S2TA and T2TA.
refining the segmentation quality fusing two difference Mask segmentation results as inputs and gives MOTS results as
RCNN implementations in offline. outputs, which are shown in Figures 1 and 3. Each instance
Many of state-of-the-art MOTS studies [13], [14], [16], has an object type, pixelwise segment, and confidence score
[19] have performed with their own pixelwise segmentation but does not include time series information. Through the
quality increase techniques. Different to them, we propose MOTS method, we can assign tracking IDs to the object
an easily feasible MOTS method without intensive or segments and turn them into time series information, i.e.,
additional learning techniques that can give high applicability MOTS results.
to research community. Our proposed method, named The proposed MOTS framework is not only built based on
MAF_HDA, exploits the tracking-by-instance-segmentation a HDA strategy consisting of segment-to-track association
paradigm, which performs the MOTS task by using two (S2TA) and track-to-track association (T2TA) but is also
popular filtering methods: the GMPHD filter [22] and the implemented as a fully online process using only information
KCF [23]. We build a two-step hierarchical data associa- at the present time t and the past times 0 to t − 1. In each
tion (HDA) strategy to handle tracklet loss and ID switches. observation-to-state association step, affinities between states
In each association step, position and motion affinity are and observations are calculated considering position, motion,
calculated by the GMPHD filter, and appearance affinity is and appearance. The ‘‘position and motion’’ and ‘‘appear-
calculated by the KCF. To appropriately consider these two ance’’ affinities are computed by using a GMPHD filter [22]
affinities, we devise a mask-based affinity fusion (MAF) and a single object tracking (SOT) method such as KCF [23],
model. Those key modules of parameters are simply tuned SiamRPN [24], and DaSiamRPN [25], respectively. Since
through adjusting the values in 0.0 to 1.0 ranges. Moreover, these two types of affinities have different filtering domains,
to show our final method’s efficiency, we compare four one affinity can be of a much higher magnitude than the
MAF_HDA settings with a conventional KCF, a modified other affinity. To appropriately consider position, motion, and
KCF for MOT (KCF2), and two state-of-the-art Siamese appearance information in HDA, we devise a MAF method.
network based SOT methods: SiamRPN [24] and DaSi- Additionally, to reduce false positive segments, we adopt the
amRPN [25]. As a result, the four MAF_HDA methods show mask intersection-over-union (IoU)-based merging technique
competitive performance in two popular KITTI-MOTS [2] between S2TA and T2TA.
and MOTS20 [3] datasets against state-of-the-art MOTS In summary, the proposed MOTS framework follows the
methods [12]–[20] and KCF2 shows the best performance order of (1) S2TA, (2) mask merging, and then (3) T2TA,
among the four SOTs. in which the affinities of each association are computed by
exploiting the GMPHD filter and KCF, are fused by using
III. PROPOSED METHOD MAF. In what follows, we use MAF_HDA as the abbreviation
In this section, we introduce the proposed online multi- for the proposed framework (see Figure 3).
object tracking and segmentation (MOTS) framework in
terms of input/output interfaces (I/O) and key modules in A. GMPHD FILTERING THEORY
detail. Following the tracking-by-segmentation paradigm, The main steps of the GMPHD filtering-based tracking
the MOTS method receives image sequences and instance includes initialization, prediction, and update. The set of
states (segment tracks) and the set of observations (instance model. The optimal z in the observation set Z makes qt the
segmentations) at time t are Xt and Zt represented as follows: maximum value in (8) as:
Nt|t
Xt = {x1t , . . . , xNt
t }, (1) X
gt|t (z) = wit (z)N (z; xit|t , Pit|t ), (7)
Zt = {z1t , . . . , zM t
t }, (2)
i=1
where a state vector xt is composed of {cx , cy , vx , vy } qit (z) = N (z; H xit|t−1 , R + HPit|t−1 (H )T ). (8)
with a tracking ID, and segment mask. cx , cy , and vx ,
From the perspective of application, the update step involves
vy indicate the center coordinates of the mask’s 2D box,
data association. Finding the optimal observations and
and the velocities of the x and y directions of the object,
updating the state models is equivalent to finding the
respectively. An observation vector zt is composed of center
association pairs. R is the observation noise covariance.
point {cx , cy } of a segment mask skt with a confidence score
H is the observation matrix uesd to transform a state vector
δtk . The Gaussian model N representing xt is initialized by
into an observation vector. Both matrices are constant in our
zt , predicted to xt+1|t , and updated to xt+1 by zt+1 .
application. After finding the optimal z, the Gaussian mixture
is updated in the order of (9), (10), (11), and (12) as:
1) INITIALIZATION
The Gaussian mixture model gt are initialized by using the wit|t−1 qit (z)
initial observations from the detection responses. In addition, wit (z) = PN , (9)
t|t−1
w l q l (z)
l=1 t|t−1 t
when an observation fails to find the association pair, i.e.,
to update the target state, the observation initializes a new xit|t (z) = xit|t−1 + Kti (z − H xit|t−1 ), (10)
Gaussian model. We call this birth (a kind of initialization). Pit|t = [I − Kt H ]Pit|t−1 ,
i
(11)
Each Gaussian N represents a state model with weight w,
mean vector x, input observation vector z, and covariance Kti = Pit|t−1 H T (HPit|t−1 H T + R)−1
, (12)
matrix P, which are as follows: where the set of wt|t−1 includes wt−1 (weights from the
Nt
X targets at the previous frame) and wt−1 (weights of newly
gt (z) = wit N (z; xit , Pit ), (3) born targets). Likewise, Nt|t−1 is the sum of Nt−1 and the
i=1 number of the newly born targets.
where Nt is the number of Gaussian models. At this step, B. HIERARCHICAL DATA ASSOCIATION (HDA)
we set the initial velocities of the mean vector to zero.
To compensate for the imperfection of the framewise one-
Each weight is set to the normalized confidence value of the
step online propagation of the GMPHD filtering process,
corresponding detection response: confidence score δ given
we extend the GMPHD filter-based online MOT with a
by instance segmentation module. Additionally, the method
hierarchical data association (HDA) strategy that has two-
of setting covariance matrix P is shown in Section IV-B2.
step association steps: S2TA and T2TA (see Figure 4). Each
association has different states and observations as inputs,
2) PREDICTION
which are used to compute position and motion affinitypm
We assume that there already has been the Gaussian mixture and appearance affinityappr (see Figure 5). Song et al. [28]
gt−1 of the target states at the previous frame t − 1, as shown proposed a GMPHD filter based hierarchical data association
in (4). Then, we can predict the state at time t using Kalman strategy. They adjust the minimum consecutive frames
filtering. In (5), xit|t−1 is derived by using the velocity at time for initialization in detection-to-track association and the
t − 1 and the covariance P is also predicted by the Kalman minimum track length for track-to-track association since
filtering method in (6) as: they use only GMPHD filter based position and motion
Nt−1 affinity for realtime speed. So false associations can be
X
gt−1 (z) = wit−1 N (z; xit−1 , Pit−1 ), (4) prevented between just close tracks each other by using
i=1 reliable tracks. However, in our work, a track state is
xit|t−1 = Fxit−1 , (5) initialized in a single frame as soon as S2TA succeeds and
the minimum track length for T2TA is 1 for fully online
Pit|t−1 = Q + FPit−1 (F)T , (6) process, and the false associations can be prevented by using
the appearance affinity.
where F is the state transition matrix, and Q is the process
To build the proposed HDA strategy, we define some online
noise covariance matrix. Those two matrices are constant in
MOT’s processing units at the present time t. St indicates the
our tracker.
instance segmentation results and the k th segment is denoted
by skt . T indicates a set of tracks. These units are defined in
3) UPDATE
detail as:
The goal of the update step is to derive (7). First, we should
find an optimal observation z at time t to update the Gaussian St = {s1t , . . . , skt }, (13)
3) MIN-MAX NORMALIZATION
In our experiments, Apm and Aappr have quite different
magnitudes, e.g., Apm = {10−9 , . . . , 10−3 } and Aappr =
{0.4, . . . , 1.0} (see Figure 5). To fuse two affinities, we apply
min-max normalization to them as follows:
A(i,j) − min 1≤i≤N A(i,j)
1≤j≤M
Ā(i,j) = , (34)
max 1≤i≤N A(i,j) − min 1≤i≤N A(i,j)
1≤j≤M 1≤j≤M
where Aappr and Apm are normalized into Āappr and Āpm ,
respectively. Aappr and Apm have quite different magnitudes
but are normalized 0.0 to 1.0 ranges in (34). First after
subtracting the minimum value from the original affinity
A value, and it is divided by the difference between the
maximum value and the minimum value. Then, we finally
propose a MAF model represented by: FIGURE 6. Normalized distributions of the affinities between cars in
KITTI-MOTS training sequence 0019. KCF and GMPHD represent
(i,j) ‘‘appearance affinity’’ and ‘‘position and motion affinity’’, respectively.
pm Āappr .
Amaf = Ā(i,j) (i,j)
(35) (a) and (b) show the distributions with each average m and standard
deviation σ , and (c) shows that (a) and (b) are very different from each
Figures 6 and 7 show the probabilistic distributions before other. (d) The proposed mask-based affinity fusion (MAF) method can
determine the scale difference between the KCF and GMPHD affinities
MAF and after MAF. From this fused affinity, we can and then normalize the two affinities and fuse (multiply) them. m̄ and σ̄
compute the final cost between states and observations as denote the normalized values in (34).
follows:
j (i,j) Figures 6(d) and 7(d), the gaps are much closer than before,
Cost(xit|t−1 , zt ) = −α · ln Amaf , (36)
and the two affinities are fused into Amaf by using MAF.
where α is a scale factor empirically set to 100. If one of
the affinities is close to zero, such as 10−39 , the cost is set D. MASK MERGING
to 10000 to prevent the final cost from becoming an infinite As shown in Figure 3, for mask merging, i.e., track merging,
value. Then, the final cost ranges from 0 to 10000. we can utilize bounding box-based IoU or segment mask-
From the different states and observations (inputs) in S2TA based IoU (mask IoU) measures that calculate boxwise or
and T2TA, two cost matrices are computed in every frame pixel-wise overlapping ratios between two objects, respec-
and we utilize the Hungarian algorithm [44] to solve the tively. The two measures are represented by:
cost matrices, which has O(n3 ) time complexity, as shown in bbox(A)∩bbox(B)
Figure 5. Then, observations succeeding in S2TA or T2TA IoU(A, B) = , (37)
bbox(A)∪bbox(B)
are assigned to the associated states for Update, and other
mask(A)∩mask(B)
observations failing in S2TA and T2TA initialize new states. Mask IoU(A, B) = . (38)
mask(A)∪mask(B)
4) ANALYSIS OF AFFINITY DATA If the value of a selected measure is greater than or equal to
Figures 6(a)-(c) and 7(a)-(c) show that the position and the threshold tm , the two objects are merged into one object.
motion affinity Agmphd and appearance affinity Akcf have Mask merging is applied only between tracking objects, i.e.,
quite different data magnitudes and distributions. In our states, that are not observations, after S2TA.
experiments, Apm = {10−9 , . . . , 10−3 } and Aappr =
{0.4, . . . , 1.0} are observed. Figure 6(a) shows that the cars E. PARALLEL PROCESSING
have more concentrated distributions, with mean mkcf ≈ We assume that data association runs only between the same
0.944 for appearance affinity than the pedestrians, with class of objects. For example, if the instance segmentation
mkcf ≈ 0.905 in Figure 7(a). On the other hand, for module provides two or more object classes, e.g., car
the GMPHD affinity, pedestrians have more concentrated and pedestrian classes, our proposed framework is easily
distributions as seen in Figures 6(b) and Figure 7(b). expansible (see Figure 1). In this paper, we implement
These facts are interpreted as follows: cars can be well the MOTS module with two parallel processes because
discriminated by position and motion while pedestrians the datasets used for our experiments produce car and
can be well discriminated by appearance. To considering pedestrian segments. Then, the time complexity O((|car| +
these two characteristics, we propose MAF; from the |pedestrian|)3 ) decreases to the slower of O(|car|3 ) and
distributions of normalized affinities Āgmphd and Ākcf in O(|pedestrian|3 ).
X
TP
f= Mask IoU(h, gt(h)), (40)
h∈TP
TP
f − |FP| − |IDS|
sMOTSA = , (41)
|M|
where M is a set of ground truth (GT) pixel masks, h is a track FIGURE 8. Experimental studies for the parameters in the MOTS20
training set. The best sMOTSA and IDS scores are shown when β, tm , and
hypothesis mask, and gt(h) is the most overlapping mask fappr are (a) 0.4, (b) 0.3, and (c) 0.5 for pedestrians. The same values are
among all GTs. In multi-object tracking and segmentation set for the test which are presented in Tables 4 and 6.
accuracy (MOTSA), a mask-based variant of the original
multi-object tracking accuracy (MOTA), a case is only
TABLE 4. Evaluation results on the MOTS20 training set. p1 is the
counted as a true positive (TP) when the mask IoU value, baseline method and p6 is selected as a final model.
between h and gt(h), is greater than or equal to 0.5, but in soft
multi-object tracking and segmentation accuracy (MOTSA),
TP
f is used, which is a soft version of TP. Other details of the
measures are displayed in Table 2.
B. IMPLEMENTATION DETAILS
1) DEVELOPMENT ENVIRONMENTS
All experiments are conducted on an Intel i7-7700K CPU @
4.20GHz, DDR4 32.0GB RAM, and Nvidia GTX 1080 Ti.
We implement MAF_HDA by using OpenCV image pro-
cessing libraries written in Visual C++. The official code
implementation is available at the Github repository .1
FIGURE 9. Experimental studies for parameters β, tm , and fappr in the KITTI-MOTS training set. The best sMOTSA and IDS scores are shown when
β is (a) 0.4 and (b) 0.5 for cars and pedestrians, respectively. The best scores are observed setting tm to (c) 0.3 and (d) 0.4 with the Mask IoU
measure and setting fappr to (g) 0.85 and (h) 0.85. The same values are set for the test which are presented in Tables 5 and 7.
TABLE 5. Evaluation results on the KITTI-MOTS training set. p1 is the baseline method without any proposed modules. KCF2 indicates a simplified version
for MOT which uses fixed-size window instead of multi-scale windows used in the referenced version of KCF. We select p6 as a final model.
1) KEY MODULES we can see that at least Mask IoU works better than IoU in
As discussed in Section III, our method includes three key the merging of the results of p3 and p4. In p5 and p6, ‘‘MAF
modules: HDA, mask merging, and MAF. HDA consists of in T2TA’’ is applied to our method, where KCF extracts
S2TA and T2TA in order. Then, we can rearrange these the appearance affinities by using multi-scale windows, but
modules with ‘‘MAF in S2TA’’, ‘‘Mask Merging’’, and KCF2 uses fix-size window to rely on the object sizes
‘‘MAF in T2TA’’ considering serial processes as described from instance segmentation responses. In addition, we apply
in Table 5. Additionally, either IoU (37) or Mask IoU (38) for two more state-of-the-art SOT methods: SiamRPN [24] and
‘‘Mask Merging’’ can be selected. DaSiamRPN [25], in the proposed appearance affinity model.
In MOTS20, Table 4, p6, p7, and p8 show comparative
performances, but in KITTI-MOTS, Table 5, p7 and p8 show
2) EFFECTIVENESS OF THE KEY MODULES worse than p6 and even worse than p5 and p6 for cars.
As seen in Tables 5 and 4, when the key modules ‘‘MAF in Figure 10 shows that reason. SiamRPN shows a biased
S2TA’’, ‘‘Mask Merging’’, and ‘‘MAF in T2TA’’ are added to correlation so DaSiamRPN shows too wide correlation in
the baseline method p1 one by one, our method shows incre- appearance affinity space for cars. We think that is because
mental and remarkable improvements. Comparing p1 and p2, cars are hard to be discriminated especially in relatively low-
p1 exploits one-step GMPHD filtering in computing only resolution of KITTI-MOTS images and tiny-size of objects.
position and motion affinity, but p2 considers the position- On the other hand, KCF’s moderate correlation can be
motion affinity with the GMPHD filter and appearance appropriate to be fuses with GMPHD filter seeing the better
affinity by the KCF in ‘‘MAF in S2TA’’. The remarkable performance. Moreover, since [24], [25] exploit the siamese
improvements in IDS and FM indicate that the proposed network which requires GPU processing, those methods
affinity fusion method works effectively. Comparing p2 and cannot extract appearance affinities for dozens of objects in
p3 in both Tables, because the results are advanced only in data association steps in parallel with one single GPU. Thus,
KITTI-MOTS, ‘‘Mask Merging’’ may merge more than two even if they presented 100 FPS in SOT, p6 and p7 run at
segments of one object into one segment or not. However, 1.0-2.0 FPS. Comparing the settings without T2TA, from
TABLE 6. Evaluation results on the MOTS20 test set. Proposed methods are denoted by MAF_HDA. The red and blue results indicate the first and second
best scores among online processing (proc.) approaches. The bold results indicate the best scores among offline proc. approaches. ‘‘not available (n/a)’’
FPS indicates the case that only total FPS is provided. ‘-’ denotes unpublished results in their paper and the MOTS20 leaderboard
at https://motchallenge.net/results/MOTS/.
TABLE 7. Evaluation results on the KITTI-MOTS test set. Proposed methods are denoted by MAF_HDA. The red and blue results indicate the first and
second best scores among online processing (proc.) approaches. The bold results indicate the best scores among offline proc. approaches. ‘‘not available
(n/a)’’ FPS indicates the cases that only total FPS is provided. All entries are available at http://www.cvlibs.net/datasets/kitti/old_eval_mots.php.
p2 to p4, and with T2TA, p5, p6, p7, and p8, the results show
that HDA with MAF reduces IDS very effectively in both
datasets and KCF2, the simplified version of KCF, shows
faster FPS and better performance in terms of sMOTSA,
MOTSA, IDS than the conventional KCF and the state-of-
the-art SOT methods [24], [25].
Numerically, when adding the key modules ‘‘MAF in
S2TA’’, ‘‘Mask Merging’’, and ‘‘MAF in T2TA’’ one by
one, as shown in Tables 5 and 4, and Figure 12, our MOTS
method shows incremental improvements from p1 to p6.
The baseline method p1 is numerically improved as follows:
for the KITTI-MOTS Cars training set, sMOTSA changes
from 73.7 to 78.5 and IDS changes from 1,322 to 212;
for the KITTI-MOTS Pedestrians training set, sMOTSA
changes from 56.4 to 62.3 and IDS changes from 800 to 140;
and for the MOTS20 training set, sMOTSA changes from FIGURE 10. Correlation maps between appearance affinities: KCF,
64.5 to 65.8 and IDS changes from 686 to 234. Thus, SiamRPN, and DaSiamRPN, and position-motion affinity: GMPHD,
in KITTI-MOTS.
p6:MAF_HDAKCF2 are selected as our final model.
FIGURE 11. Comparisons of speed (FPS) vs. MOTS accuracy (sMOTSA and IDS) against state-of-the-art methods in KITTI-MOTS and MOTS test sets.
FIGURE 12. Visualization of the segmentation and MOTS results on KITTI-MOTS test sequence 0018. (b), (c), and (d) are the results of the three
different settings of the proposed method, which are based on the same segmentation results (a) from MaskRCNN [11]. Comparing (b) the
baseline model p1 and (c) the model p4 with S2TA and mask merging (without T2TA), in (b), the IDs, ‘‘0, 2, 4, 6, 8’’, of the five pedestrians at the
right side of the scene are switched except the person w/ ID 2, but, in (c), only the pedestrian w/ ID 0 gets switched to ID 24. In (d) the final
model p6, the five IDs are preserved since T2TA can find the IDs after occlusion with trees at the right side. In addition, the car w/ ID 7 at frame
0 are recovered at frame 15, while (b) and (c) do not recover the car w/ ID 7 that are switched to ID 17.
only tracking speed but also detection and segmentation 2) ACCURACY COMPARISON W/SOTA METHODS
speed in Tables 6 and 7. Speed of the public segmentation, Some state-of-the-art methods [13], [16], [17], [19] have
Mask RCNN X152, and speed of private segmentation, tackled raising the detection and segmentation quality, Eager-
PointTrack, are measured in our environment presented MOT and MOTSFusion utilized fusion of multi-detector from
in Subsection IV-B. Other speeds are referenced from multi-domain, and PointTrack and CPPNet focus on learning
their paper. PointTrack [13], SORTS+RReID [15], and segmentation model from scratch of MOTS20 and KITTI-
EagerMOT [17] introduce fast tracking speeds 22.2, 36.4, MOTS training sets. The former has promising performance
and 90.9 FPS, respectively, but including detection and since that multi-source detectors can complement each other.
segmentation, the speeds drop to 4.3, 2.3, and 1.96 FPS. However it inevitably needs heavy computing resources.
Likewise multi-detector fusion based methods EagerMOT The latter can achieve fine performance as seen in Table 7
and MOTSFusion [16] show similar speed 1.96 and 2.3 FPS. if fine data such as over 8,000 images with uniformed
Among those state-of-the-art methods, one-stage methods resolutions of the KITTI-MOTS training set is given as
CPPNet [19] and TraDeS [20] show faster speeds 7.0 FPS and seen in Table 1. However, in 2,862 images of MOTS20
11.5 FPS respectively, but still not enough to achieve realtime training set with various resolutions like 1920 × 1080 and
speeds which are 30 FPS for MOTS20 and 10 FPS for KITTI- 640 × 480, their MOTS accuracy drops sharply contrast to
MOTS. Among the proposed methods, MAF_HDAKCF2 Mask RCNN based methods such as [14] and MAF_HDA
achieves 4.6 FPS in MOTS20 and 10.9 FPS in KITTI-MOTS (see Figure 11(a) and (c)).
with tracking only, and 1.6 FPS in MOTS20 and 2.8 FPS To summarize numerically, we refer to Tables 6
in KITTI-MOTS with segmentation and tracking. Compared and 7. First, in particular, comparing the variants of
to others, MAF_HDAKCF2 shows moderate speeds (see MAF_HDA with KCF, KCF2, SiamRPN, and DaSiamRPN,
Figure 11(a)-(b)). Therefore, efficiency of MOTS method in MAF_HDAKCF2 shows the best performance in terms of
terms of speed versus accuracy is still a challenging issue. speed and accuracy. MAF_HDASiam and MAF_HDADaSiam
show drastic speed drop compared to MAF_HDAKCF and [6] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,
MAF_HDAKCF2 Those results follow the evaluation results ‘‘PointPillars: Fast encoders for object detection from point clouds,’’ in
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
in the training sets (see Tables 5 and 4). Against state-of-the- pp. 12697–12705.
art MOTS methods [12]–[20], our proposed method named [7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
MAF_HDAKCF2 ranks 2nd sMOTSA score (1st among Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
the online approaches), 69.9, in the MOTS20 test set. [8] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-time
In addition, MAF_HDAKCF2 ranks 3rd sMOTSA score, 65.0, object detection with region proposal networks,’’ in Proc. 28th Int. Conf.
for pedestrians and 3rd sMOTSA score, 77.2, for cars in the Neural Inf. Process. Syst. (NIPS), Dec. 2015, pp. 91–99.
KITTI-MOTS test set. [9] S. Shi, X. Wang, and H. Li, ‘‘PointRCNN: 3D Object proposal generation
and detection from point cloud,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 2019, pp. 770–779.
V. CONCLUSION [10] W. Shi and R. Rajkumar, ‘‘Point-GNN: Graph neural network for 3D object
detection in a point cloud,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
In this paper, we propose a highly feasible MOTS method Recognit. (CVPR), Jun. 2020, pp. 1711–1719.
named MAF_HDA, which is an easily reproducible reassem- [11] K. He, G. Gkioxari, P. Dollar, and R. Girshick, ‘‘Mask R-CNN,’’ in Proc.
bly of four key modules: a GMPHD filter, HDA, mask IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2961–2969.
merging, and MAF. These key modules can operate in the [12] P. Voigtlaender, M. Krause, A. Os̆ep, J. Luiten, B. B. G. Sekar, A. Geiger,
and B. Leibe, ‘‘MOTS: Multi-object tracking and segmentation,’’ in
proposed fully online MOTS framework which tracks cars Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
and pedestrians in parallel CPU-only processes. In addition, pp. 7942–7951.
the key parameters can be simply tuned through experimental [13] Z. Xu, W. Zhang, Z. Tan, W. Yang, H. Huang, S. Wen, E. Ding, and
L. Huang, ‘‘Segment as points for efficient online multi-object tracking
studies adjusting the values in 0.0 to 1.0 ranges, and and segmentation,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), Aug. 2020,
these modules show remarkable improvements in evaluation pp. 264–281.
on the training sets of MOTS20 and KITTI-MOTS in [14] F. Yang, X. Chang, C. Dang, Z. Zheng, S. Sakti, S. Nakamura, and
T. Wu, ‘‘ReMOTS: Self-supervised refining multi-object tracking and
terms of MOTS measures such as sMOTSA and IDS. segmentation,’’ 2020, arXiv:2007.03200.
In the test sets of the two popular datasets, MAF_HDA [15] M. Ahrnbom, M. Nilsson, and H. Ardö, ‘‘Real-time and online segmenta-
achieves very competitive performance against the state- tion multi-target tracking with track revival re-identification,’’ in Proc. Int.
Conf. Comput. Vis. Theory Appl. (VISAPP), Feb. 2021, pp. 777–784.
of-the-art MOTS methods. In future work, we expect
[16] J. Luiten, T. Fischer, and B. Leibe, ‘‘Track to reconstruct and reconstruct to
that the proposed MOTS method will be reproduced and track,’’ IEEE Robot. Autom. Lett., vol. 5, no. 2, pp. 1803–1810, Apr. 2020.
extended in research community with a more precise [17] A. Kim, A. Osep, and L. Leal-Taixe, ‘‘EagerMOT: 3D multi-object
and simpler position and motion filtering model and tracking via sensor fusion,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA),
May 2021, pp. 11315–11321.
more rapid and sophisticated appearance feature extrac- [18] Z. Wang, H. Zhao, Y.-L. Li, S. Wang, P. H. Torr, and L. Bertinetto, ‘‘Do
tors such as deep neural network-based re-identification different tracking tasks require different appearance models?’’ in Proc.
techniques. Conf. Neural Inf. Process. Syst. (NeurIPS), Dec. 2021, pp. 15323–15332.
[19] Z. Xu, A. Meng, Z. Shi, W. Yang, Z. Chen, and L. Huang, ‘‘Continuous
copy-paste for one-stage multi-object tracking and segmentation,’’ in Proc.
ACKNOWLEDGMENT IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 15323–15332.
This work was performed based on the cooperation with [20] J. Wu, J. Cao, L. Song, Y. Wang, M. Yang, and J. Yuan, ‘‘Track to detect
and segment: An online multi-object tracker,’’ in Proc. IEEE/CVF Conf.
GIST-LIG Nex1 Cooperation and was supported by Institute Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 12352–12361.
of Information and Communications Technology Planning [21] A. Hu, A. Kendall, and R. Cipolla, ‘‘Learning a spatio-temporal embedding
and Evaluation (IITP) grant funded by the Korea govern- for video instance segmentation,’’ 2019, arXiv:1912.08969.
ment (MSIT) (No.2014-3-00077-008, Development of global [22] B. N. Vo and W. K. Ma, ‘‘The Gaussian mixture probability hypothesis den-
sity filter,’’ IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4091–4104,
multi-target tracking and event prediction techniques based Nov. 2006.
on real-time large-scale video analysis). [23] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, ‘‘High-speed tracking
with kernelized correlation filters,’’ IEEE Trans. Pattern Anal. Mach.
Intell., vol. 37, no. 3, pp. 583–596, Mar. 2015.
REFERENCES [24] B. Li, W. Wu, Z. Zhu, and J. Yan, ‘‘High performance visual tracking
[1] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler, ‘‘MOTChal- with Siamese region proposal network,’’ in Proc. IEEE Conf. Comput. Vis.
lenge 2015: Towards a benchmark for multi-target tracking,’’ 2015, Pattern Recognit. (CVPR), Jun. 2018, pp. 8971–8980.
arXiv:1504.01942. [25] Z. Zhu, Q. Wang, and B. Li, ‘‘Distractor-aware Siamese networks for visual
[2] A. Geiger, P. Lenz, and R. Urtasun, ‘‘Are we ready for autonomous driving? object tracking,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), Dec. 2018,
The KITTI vision benchmark suite,’’ in Proc. IEEE Conf. Comput. Vis. pp. 101–117.
Pattern Recognit. (CVPR), Sep. 2012, pp. 3354–3361. [26] R. P. S. Mahler, ‘‘Multitarget Bayes filtering via first-order multitar-
[3] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler, ‘‘MOT16: A get moments,’’ IEEE Trans. Aerosp. Electron. Syst., vol. 39, no. 4,
benchmark for multi-object tracking,’’ vol. 2016, arXiv:1603.00831. pp. 1152–1178, Oct. 2003.
[4] L. Chen, S. Lin, X. Lu, D. Cao, H. Wu, C. Guo, C. Liu, and [27] B.-N. Vo, S. Singh, and A. Doucet, ‘‘Sequential Monte Carlo implementa-
F.-Y. Wang, ‘‘Deep neural network based vehicle and pedestrian detection tion of the PHD filter for multi-target tracking,’’ in Proc. 6th Int. Conf. Inf.
for autonomous driving: A survey,’’ IEEE Trans. Intell. Transp. Syst., Fusion (ICIF), Jul. 2003, pp. 792–799.
vol. 22, no. 6, pp. 3234–3246, Jun. 2021. [28] Y.-M. Song, K. Yoon, Y.-C. Yoon, K. C. Yow, and M. Jeon, ‘‘Online multi-
[5] D. Feng, C. Haase-Schutz, L. Rosenbaum, H. Hertlein, C. Glaser, F. Timm, object tracking with GMPHD filter and occlusion group management,’’
W. Wiesbeck, and K. Dietmayer, ‘‘Deep multi-modal object detection IEEE Access, vol. 7, pp. 165103–165121, 2019.
and semantic segmentation for autonomous driving: Datasets, methods, [29] R. Sanchez-Matilla, F. Poiesi, and A. Cavallaro, ‘‘Multi-target tracking
and challenges,’’ IEEE Trans. Intell. Transp. Syst., vol. 22, no. 3, with strong and weak detections,’’ in Proc. Eur. Conf. Comput. Vis.
pp. 1341–1360, Mar. 2021. Workshops (ECCVW), Oct. 2016, pp. 84–99.
[30] R. Sanchez-Matilla and A. Cavallaro, ‘‘A predictor of moving objects YOUNG-CHUL YOON received the B.S. degree
for first-person vision,’’ in Proc. IEEE Int. Conf. Image Process. (ICIP), in electronics and communications engineering
Sep. 2019, pp. 2189–2193. from Kwangwoon University, Seoul, South Korea,
[31] T. Kutschbach, E. Bochinski, V. Eiselein, and T. Sikora, ‘‘Sequential sensor and the M.S. degree in electrical engineering and
fusion combining probability hypothesis density and kernelized correlation computer science from the Gwangju Institute of
filters for multi-object tracking in video data,’’ in Proc. 14th IEEE Int. Science and Technology, Gwangju, South Korea,
Conf. Adv. Video Signal Based Surveill. (AVSS), Aug. 2017, pp. 1–6. in 2019. He is currently a Software Engineer at
[32] Z. Fu, P. Feng, F. Angelini, J. Chambers, and S. M. Naqvi, ‘‘Particle
Robotics Laboratory, Hyundai Motor Company.
PHD filter based multiple human tracking using online group-structured
His research interests include multi-object track-
dictionary learning,’’ IEEE Access, vol. 6, pp. 14764–14778, 2018.
[33] M. P. Muresan and S. Nedevschi, ‘‘Multi-object tracking of 3D cuboids ing and video analysis.
using aggregated features,’’ in Proc. IEEE 15th Int. Conf. Intell. Comput.
Commun. Process. (ICCP), Sep. 2019, pp. 11–18.
[34] W. Li, J. Mu, and G. Liu, ‘‘Multiple object tracking with motion and
appearance cues,’’ IEEE Access, vol. 7, pp. 104423–104434, 2019. KWANGJIN YOON received the Ph.D. degree in
[35] Y.-C. Yoon, D. Y. Kim, Y.-M. Song, K. Yoon, and M. Jeon, ‘‘Online electrical engineering and computer science from
multiple pedestrians tracking using deep temporal appearance matching the Gwangju Institute of Science and Technology,
association,’’ Inf. Sci., vol. 561, pp. 326–351, Jun. 2021. in 2019. He is currently a Research Scientist with
[36] K. Yoon, D. Y. Kim, Y.-C. Yoon, and M. Jeon, ‘‘Data association for multi- SI Analytics Company Ltd. His research interests
object tracking via deep neural networks,’’ Sensors, vol. 19, pp. 1–15, include multi-object tracking, computer vision,
Jan. 2019. and deep learning.
[37] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan,
‘‘Object detection with discriminatively trained part-based models,’’
IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627–1645,
Sep. 2010.
[38] F. Yang, W. Choi, and Y. Lin, ‘‘Exploit all the layers: Fast and accurate
CNN object detector with scale dependent pooling and cascaded rejection
classifiers,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), HYUNSUNG JANG is currently pursuing the
Jun. 2016, pp. 2129–2137. Ph.D. degree in electrical and electronic engi-
[39] X. Wang, M. Yang, S. Zhu, and Y. Lin, ‘‘Regionlets for generic object neering with Yonsei University, Seoul, South
detection,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 10, Korea. He is also a Chief Research Engineer
pp. 2071–2084, Oct. 2015. with the Department of EO/IR Systems Research
[40] J. Ren, X. Chen, J. Liu, W. Sun, J. Pang, Q. Yan, Y. Tai, and L. Xu, and Development, LIG Nex1. He was officially
‘‘Accurate single stage detector using recurrent rolling convolution,’’ in
certified as a Software Architect at LG Electronics,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
in 2020, and completed the Software Architect
pp. 5420–5428.
[41] J. Luiten, P. Voigtlaender, and B. Leibe, ‘‘PReMVOS: Proposal-generation, Course at Carnegie Mellon University. His current
refinement and merging for video object segmentation,’’ in Proc. Asian research interests include computer vision, deep
Conf. Comput. Vis. (ACCV), Dec. 2018, pp. 565–580. learning, and quantum computing.
[42] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, ‘‘PointNet:
Deep learning on point sets for 3D classification and segmentation,’’ in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
pp. 652–660. NAMKOO HA received the Ph.D. degree in
[43] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernandez, computer engineering from Kyungpook National
T. Vojir, G. Hager, G. Nebehay, and R. Pflugfelder, ‘‘The visual object University, South Korea, in 2008. He is currently
tracking VOT2015 challenge results,’’ in Proc. IEEE Int. Conf. Comput. a Chief Research Engineer with the Department
Vis. Workshops (ICCVW), Dec. 2015, pp. 1–23. of EO/IR Systems Research and Development,
[44] R. Jonker and A. Volgenant, ‘‘A shortest augmenting path algorithm for
LIG Nex1. His current research interest includes
dense and sparse linear assignment problems,’’ Computing, vol. 38, no. 4,
developing intelligent EO/IR systems through the
pp. 325–340, Nov. 1987.
[45] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick. (2019).
use of deep learning.
Detectron2. [Online]. Available: https://github.com/facebookresearch/
detectron2
[46] Y. Li, C. Huang, and R. Nevatia, ‘‘Learning to associate: HybridBoosted
multi-target tracker for crowded scene,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2009, pp. 2953–2960. MOONGU JEON (Senior Member, IEEE)
[47] K. Bernardin and R. Stiefelhagen, ‘‘Evaluating multiple object tracking received the B.S. degree in architectural engineer-
performance: The CLEAR MOT metrics,’’ EURASIP J. Image Video ing from Korea University, Seoul, South Korea,
Process., vol. 2008, pp. 1–10, May 2008. in 1988, and the M.S. and Ph.D. degrees in
[48] (2020). Costa St Tracker. [Online]. Available: https://motchallenge.net/
computer science and scientific computation from
method/MOTS=87&chl=17
the University of Minnesota, Minneapolis, MN,
USA, in 1999 and 2001, respectively.
YOUNG-MIN SONG (Graduate Student Member, As a Postgraduate Researcher, he worked on
IEEE) received the B.S. degree in computer optimal control problems at the University of
science and engineering from Chungnam National California at Santa Barbara, Santa Barbara, CA,
University, Daejeon, South Korea, in 2013, and the USA, from 2001 to 2003, and then moved to the National Research
M.S. degree in information and communications Council of Canada, where he worked on the sparse representation of high-
from the Gwangju Institute of Science and Tech- dimensional data and the image processing, until 2005. In 2005, he joined the
nology (GIST), Gwangju, South Korea, in 2015, Gwangju Institute of Science and Technology, Gwangju, South Korea, where
where he is currently pursuing the Ph.D. degree he is currently a Full Professor with the School of Electrical Engineering and
in electrical engineering and computer science. Computer Science. His current research interests include machine learning,
His research interests are multi-object tracking and computer vision, and artificial intelligence.
data fusion.