0% found this document useful (0 votes)
9 views

Multi-Object Tracking and Segmentation With Embedding Mask-Based Affinity Fusion in Hierarchical Data Association

This paper presents a novel online multi-object tracking and segmentation (MOTS) method that utilizes instance segmentation results, employing a Gaussian mixture probability hypothesis density filter and a hierarchical data association model. The proposed method incorporates mask-based affinity fusion and mask merging to improve tracking performance while being computationally efficient and requiring minimal parameter tuning. Experimental results demonstrate that the method achieves state-of-the-art performance on popular MOTS datasets, significantly reducing ID-switch errors compared to baseline methods.

Uploaded by

JUNJA SHI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Multi-Object Tracking and Segmentation With Embedding Mask-Based Affinity Fusion in Hierarchical Data Association

This paper presents a novel online multi-object tracking and segmentation (MOTS) method that utilizes instance segmentation results, employing a Gaussian mixture probability hypothesis density filter and a hierarchical data association model. The proposed method incorporates mask-based affinity fusion and mask merging to improve tracking performance while being computationally efficient and requiring minimal parameter tuning. Experimental results demonstrate that the method achieves state-of-the-art performance on popular MOTS datasets, significantly reducing ID-switch errors compared to baseline methods.

Uploaded by

JUNJA SHI
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Received 8 March 2022, accepted 22 April 2022, date of publication 2 May 2022, date of current version 13 June 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3171565

Multi-Object Tracking and Segmentation With


Embedding Mask-Based Affinity Fusion in
Hierarchical Data Association
YOUNG-MIN SONG 1 , (Graduate Student Member, IEEE), YOUNG-CHUL YOON2 ,
KWANGJIN YOON3 , HYUNSUNG JANG 4 , NAMKOO HA4 ,
AND MOONGU JEON 1 , (Senior Member, IEEE)
1 Schoolof Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju 61005, South Korea
2 HyundaiMotor Company, Uiwang-si 16082, South Korea
3 SIAnalytics Company Ltd., Daejeon 34051, South Korea
4 LIG Nex1 Company Ltd., Yongin-si 16911, South Korea

Corresponding author: Moongu Jeon ([email protected])

ABSTRACT In this paper, we propose a highly feasible fully online multi-object tracking and
segmentation (MOTS) method that uses instance segmentation results as an input. The proposed method is
based on the Gaussian mixture probability hypothesis density (GMPHD) filter, a hierarchical data association
(HDA), and a mask-based affinity fusion (MAF) model to achieve high-performance online tracking. The
HDA consists of two associations: segment-to-track and track-to-track associations. One affinity, for position
and motion, is computed by using the GMPHD filter, and the other affinity, for appearance is computed
by using the responses from single object trackers such as kernalized correlation filter, SiamRPN, and
DaSiamRPN. These two affinities are simply fused by using a score-level fusion method such as min-max
normalization referred to as MAF. In addition, to reduce the number of false positive segments, we adopt
mask IoU-based merging (mask merging). The proposed MOTS framework with the key modules: HDA,
MAF, and mask merging, is easily extensible to simultaneously track multiple types of objects with CPU-only
execution in parallel processing. In addition, the developed framework only requires simple parameter tuning
unlike many existing MOTS methods that need intensive hyperparameter optimization. In the experiments
on the two popular MOTS datasets, the key modules show some improvements. For instance, ID-switch
decreases by more than half compared to a baseline method in the training sets. In conclusion, our tracker
achieves state-of-the-art MOTS performance in the test sets.

INDEX TERMS Multi-object tracking, instance segmentation, tracking by segmentation, online approach,
Gaussian mixture probability hypothesis filter, affinity fusion.

I. INTRODUCTION have been achieved by many deep neural network (DNN)-


Multi-object tracking and segmentation (MOTS) has recently based detectors [6]–[10] from various sensor domains,
become one of challenging research fields which has been such as color cameras (2D images) and LiDAR (3D point
proposed for pixelwise intelligent systems beyond 2D bound- clouds). For instance, the detection responses of [7], [8]
ing boxes. This new vision task MOTS has been extended are 2D bounding boxes and those of [6], [9], [10] are 3D
from a conventional task multi-object tracking (MOT) with boxes. In addition, He et al. [11] introduced a pixelwise
segmentation. Since the MOT benchmark datasets [1]–[3] classification and detection method represented by instance
were released, the tracking-by-detection paradigm has been segmentation, which has motivated many segmentation-
the mainstream, and breakthroughs [4], [5] in object detection based studies including MOTS.
The MOTS task was first introduced in
The associate editor coordinating the review of this manuscript and Voigtlaender et al. [12] with a new baseline method, new
approving it for publication was Davide Patti . evaluation measures, and a new MOTS dataset extended
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
VOLUME 10, 2022 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 60643
Y.-M. Song et al.: MOTS With Embedding MAF in HDA

3) Additionally, the tracking part of proposed


method can run with CPU-only process when
KCF is used so that it can run in parallel to
simultaneously track multiple types of objects: cars
and pedestrians, in this paper (see Figure 1).
4) Finally, we evaluate the proposed method on
state-of-the-art datasets [2], [3], [12]. The results
on the training sets show that the developed key
modules efficiently improve MOTS performances
compared to the baseline method. In the results on
FIGURE 1. Flow chart of parallel multi-object tracking and the test sets, our method not only shows competi-
segmentation (MOTS) processing, which receives two classes of objects,
i.e., cars and pedestrians. (a) Modules (input, detection, tracking, and tive performance against state-of-the-art published
output) are implemented as (b) processes (image sequencing, instance methods but also achieves state-of-the-art level
segmentation, MOTS, and MOTS results). Our proposed MOTS framework
is denoted by MAF_HDA.
performance against state-of-the-art unpublished
methods that are available at the leaderboards of the
MOTS20 and KITTI-MOTS websites.
from MOTChallenge [3] and KITTI [2] image sequences. The proposed method has high applicability due to a
To solve this task, most of state-of-the-art methods [13]–[18] feasible combination of existing models [22]–[25] and simple
have exploited multi-stage approaches that separate detection parameter tuning unlike many state-of-the-art DNN based
(instance segmentation) and tracking modules while some tracking methods [12]–[20]. In addition, in the experi-
one-stage methods [12], [19], [20] have been rarely ments, our method shows state-of-the-art level performance.
proposed. Luiten et al. [16] and Kim et al. [17] have We present the works related to the proposed method in
proposed MOTS methods that use a fusion of 2D box Section II and the details of our method are covered in
detection, 3D box detection, and instance segmentation Section III. Additionally, we discuss the experimental results
results. Xu et al. [13] exploited spatial-embedding [21] to in Section IV and conclude the paper in Section V. In what
raise instance segmentation quality which performs instance follows, we use MAF_HDA as the abbreviation for our
segmentation without bounding box (bbox) proposals so proposed method.
runs faster than bbox proposal based instance segmentation
like Mask RCNN [11]. Also, they devised a point-wise II. RELATED WORKS
representative feature extraction from input segments which A. TRACKING MODELS WITH A PHD FILTER
can consider foreground and background information. The PHD filter [22], [26], [27] was originally designed to deal
Yang et al. [14] focused refining the segmentation quality with radar and sonar data-based MOT systems. Mahler [26]
fusing two difference Mask RCNN implementations in proposed recursive Bayes filter equations for the PHD filter
offline. These state-of-the art methods [13], [14], [16], [17], that optimize multi-target tracking processes where states and
[19], [20] have improved MOTS performance through raising observations are defined with a random-finite set. Following
the detection quality: refinement or fusion of multi-type this theory, Vo et al. [27] implemented the PHD filter by
detections, and they necessarily involve exhausted additional using a sequential Monte Carlo method with particle filtering
learning step. Different to those state-of-the-art MOTS works, and clustering, named the SMCPHD filter, and proposed the
in this paper, we propose an easily feasible online MOTS governing equations by using a Gaussian mixture model with
method without the exhausted learning but also our method a closed-form recursion method named the Gaussian mixture
achieves competitive MOTS performances. probability hypothesis density (GMPHD) filter. Since the
Our contributions are summarized as follows: GMPHD filter is tractable in implementing online and real-
1) We propose an easily feasible online MOTS time trackers, it has been recently extended and exploited
method consisting of (a) two-step hierarchical data as a famous tracking model in video-based systems. While
association (HDA), (b) mask-based affinity fusion the radar and sonar sensors receive massive number of false
(MAF), and (c) mask merging. These key modules positives but rarely miss any observations, visual object
can run with CPU-only execution. detectors yield many fewer false positives and more missed
2) Particularly among the key modules, (b) MAF detections. Thus, in video-based tracking, noise control
effectively fuses ‘‘position and motion’’ affinity processes for the original domains are simplified and many
with a Gaussian mixture probability density additional techniques for visual objects have been developed.
(GMPHD) filter [22] and ‘‘appearance’’ with single
object tracker such as KCF [23], SiamRPN [24], 1) POSITION AND MOTION MODELS
and DaSiamRPN [25] to improve the MOTS Song et al. [28] combined the GMPHD filter and data
performance compared to a baseline method using association processes with two-step hierarchy to recover lost
only one-step GMPHD filter association. tracks IDs. They designed an affinity model considering

60644 VOLUME 10, 2022


Y.-M. Song et al.: MOTS With Embedding MAF in HDA

position and linear motion between the tracks in the second


step association. This approach is an intuitive implementation
of the GMPHD filter to reconnect lost tracks. In addition,
they presented an energy minimization model based on
occluded objects group to correct the false associations
that already occurred in the first step association between
detections and tracks. Sanchez-Matilla et al. [29] proposed
detection confidence-based data association schemes with
a PHD filter. Strong (high-confidence) detections initiate
and propagate tracks, but weak (low-confidence) detec-
FIGURE 2. Demonstration of class-independent discrimination ability of a
tions propagate only existing tracks. This scheme works single object tracker: KCF [23], in KITTI-MOTS test 0028 sequence.
well when the detection results are reliable. However, the
tracking performance depends on the detection performance
and is especially weak for long-term missed detections.
Sanchez-Matilla et al. [30] utilized long short-term mem- proposed a dissimilarity cost computation in form of weight
ory (LSTM) models to design a global motion model multiplied summation of appearance, structure, motion and
for MOT. size based on 2D bbox. Muresan and Nedevschi [33] used
a sophisticated MOT method that aggregates features of 2D
2) APPEARANCE MODELS image and 3D point cloud spaces by projecting 2D pixels
More intensive tracking solutions [31], [32] have been into 3D point cloud map. Unlike MOT, MOTS uses pixelwise
proposed with appearance models. Kutschbach et al. [31] instance segmentation results as a tracking input beyong 2D
combined a naive GMPHD filtering process and kernelized bbox results. Voigtlaender et al. [12] first introduced the
correlation filters (KCF) [23] that can update appearance MOTS task. They extended the popular MOT datasets such as
online and discriminate occluded objects. They proposed MOTChallenge [3] and KITTI [2] with instance segmentation
robust online appearance learning to refind the IDs of lost results by using a fine-tuned MaskRCNN [11] for the same
tracks. However, updating the appearance of every object image sequences, and proposed a new soft multi-object
in every frame requires heavy computing resources and tracking and segmentation accuracy (sMOTSA) measure that
inevitably increases the runtime. In Fu et al. [32], the can be used to evaluate MOTS methods. In addition, they
GMPHD filter is equipped with an online group-structured presented a new MOTS baseline method named TrackRCNN,
dictionary for appearance learning and an adaptive gating which was extended from MaskRCNN with 3D convolutions
technique, which is an advanced tracking process suitable for to deal with temporal information. Inspired by the new task,
video-based MOT. state-of-the-art MOTS methods [13], [14], [16], [19] have
These online MOT methods based on the PHD filter been proposed very recently. MOTSFusion [16] proposes
have successfully improved the tracking performance by a fusion-based MOTS method exploiting bounding box
using motion and appearance learning models. We exploit detection [40] and instance segmentation [41]. It estimates
the GMPHD filter in hierarchical data association of [28] a segmentation mask for each bounding box and builds
to consider temporal information. In addition, to efficiently up short tracklets using 2D optical flow, then fuses these
apply single object tracking (SOT) as appearance affinity in 2D short tracklets into dynamic 3D object reconstructions
MOT, different to [31] that they simply used KCF [23] to hierarchically. The precise reconstructed 3D object motion
propagate detection loss tracks in, we devise a simple and is used to recover missed objects with occlusions in 2D
efficient affinity fusion model which fuse the GMPHD filter coordinates. PointTrack [13] devises a new feature extractor
based position and motion affinity and SOT based appearance based on PointNet [42] to appropriately consider both
affinity. As shown in Figure 2, three representative SOT foreground and background features. This is motivated by
methods: KCF, SiamesRPN [24], and DaSiamRPN [25], can the fact that the inherent receptive field of convolution-based
discriminate falsely classified pedestrian (true class: car) feature extraction inevitably confuses up the foreground and
and true pedestrian since their class-independent feature background features. PointNet is used to randomly sample
extraction abilities. feature points considering the offsets between foreground
and background regions, the colors of those regions, and the
B. STATE-OF-THE-ART MOTS METHODS categories of segments. Then the context-aware embedding
Conventional MOT methods [28]–[36] have exploited the vectors for association are built after concatenation of the
tracking-by-detection paradigm, where the detectors [7], [8], separately computed position difference vectors. In addition
[37]–[39] generate 2D bounding box (bbox) results and the PointTrack exploited spatial-embedding [21] to raise instance
trackers assign tracking identities (IDs) to the bounding segmentation quality which performs instance segmentation
boxes via data association. Refs. [28]–[32] successfully without bounding box. Similarly, CPPNet [19] trained their
extended the GMPHD filter into visual object tracking as segmentation model by using [21] and proposed copy-paste
we discussed in the previous subsection II-A. Li et al. [34] data argumentation technique. ReMOTS [14] focused on

VOLUME 10, 2022 60645


Y.-M. Song et al.: MOTS With Embedding MAF in HDA

FIGURE 3. Detailed processing pipeline of MAF_HDA with input (images and instance segmentation results) and output (MOTS results). The key
components are hierarchical data association (HDA), mask merging, and mask-based affinity fusion (MAF). HDA has two association steps: S2TA and
T2TA. MAF executes each affinity fusion in each association step while mask merging runs once between S2TA and T2TA.

refining the segmentation quality fusing two difference Mask segmentation results as inputs and gives MOTS results as
RCNN implementations in offline. outputs, which are shown in Figures 1 and 3. Each instance
Many of state-of-the-art MOTS studies [13], [14], [16], has an object type, pixelwise segment, and confidence score
[19] have performed with their own pixelwise segmentation but does not include time series information. Through the
quality increase techniques. Different to them, we propose MOTS method, we can assign tracking IDs to the object
an easily feasible MOTS method without intensive or segments and turn them into time series information, i.e.,
additional learning techniques that can give high applicability MOTS results.
to research community. Our proposed method, named The proposed MOTS framework is not only built based on
MAF_HDA, exploits the tracking-by-instance-segmentation a HDA strategy consisting of segment-to-track association
paradigm, which performs the MOTS task by using two (S2TA) and track-to-track association (T2TA) but is also
popular filtering methods: the GMPHD filter [22] and the implemented as a fully online process using only information
KCF [23]. We build a two-step hierarchical data associa- at the present time t and the past times 0 to t − 1. In each
tion (HDA) strategy to handle tracklet loss and ID switches. observation-to-state association step, affinities between states
In each association step, position and motion affinity are and observations are calculated considering position, motion,
calculated by the GMPHD filter, and appearance affinity is and appearance. The ‘‘position and motion’’ and ‘‘appear-
calculated by the KCF. To appropriately consider these two ance’’ affinities are computed by using a GMPHD filter [22]
affinities, we devise a mask-based affinity fusion (MAF) and a single object tracking (SOT) method such as KCF [23],
model. Those key modules of parameters are simply tuned SiamRPN [24], and DaSiamRPN [25], respectively. Since
through adjusting the values in 0.0 to 1.0 ranges. Moreover, these two types of affinities have different filtering domains,
to show our final method’s efficiency, we compare four one affinity can be of a much higher magnitude than the
MAF_HDA settings with a conventional KCF, a modified other affinity. To appropriately consider position, motion, and
KCF for MOT (KCF2), and two state-of-the-art Siamese appearance information in HDA, we devise a MAF method.
network based SOT methods: SiamRPN [24] and DaSi- Additionally, to reduce false positive segments, we adopt the
amRPN [25]. As a result, the four MAF_HDA methods show mask intersection-over-union (IoU)-based merging technique
competitive performance in two popular KITTI-MOTS [2] between S2TA and T2TA.
and MOTS20 [3] datasets against state-of-the-art MOTS In summary, the proposed MOTS framework follows the
methods [12]–[20] and KCF2 shows the best performance order of (1) S2TA, (2) mask merging, and then (3) T2TA,
among the four SOTs. in which the affinities of each association are computed by
exploiting the GMPHD filter and KCF, are fused by using
III. PROPOSED METHOD MAF. In what follows, we use MAF_HDA as the abbreviation
In this section, we introduce the proposed online multi- for the proposed framework (see Figure 3).
object tracking and segmentation (MOTS) framework in
terms of input/output interfaces (I/O) and key modules in A. GMPHD FILTERING THEORY
detail. Following the tracking-by-segmentation paradigm, The main steps of the GMPHD filtering-based tracking
the MOTS method receives image sequences and instance includes initialization, prediction, and update. The set of

60646 VOLUME 10, 2022


Y.-M. Song et al.: MOTS With Embedding MAF in HDA

states (segment tracks) and the set of observations (instance model. The optimal z in the observation set Z makes qt the
segmentations) at time t are Xt and Zt represented as follows: maximum value in (8) as:
Nt|t
Xt = {x1t , . . . , xNt
t }, (1) X
gt|t (z) = wit (z)N (z; xit|t , Pit|t ), (7)
Zt = {z1t , . . . , zM t
t }, (2)
i=1

where a state vector xt is composed of {cx , cy , vx , vy } qit (z) = N (z; H xit|t−1 , R + HPit|t−1 (H )T ). (8)
with a tracking ID, and segment mask. cx , cy , and vx ,
From the perspective of application, the update step involves
vy indicate the center coordinates of the mask’s 2D box,
data association. Finding the optimal observations and
and the velocities of the x and y directions of the object,
updating the state models is equivalent to finding the
respectively. An observation vector zt is composed of center
association pairs. R is the observation noise covariance.
point {cx , cy } of a segment mask skt with a confidence score
H is the observation matrix uesd to transform a state vector
δtk . The Gaussian model N representing xt is initialized by
into an observation vector. Both matrices are constant in our
zt , predicted to xt+1|t , and updated to xt+1 by zt+1 .
application. After finding the optimal z, the Gaussian mixture
is updated in the order of (9), (10), (11), and (12) as:
1) INITIALIZATION
The Gaussian mixture model gt are initialized by using the wit|t−1 qit (z)
initial observations from the detection responses. In addition, wit (z) = PN , (9)
t|t−1
w l q l (z)
l=1 t|t−1 t
when an observation fails to find the association pair, i.e.,
to update the target state, the observation initializes a new xit|t (z) = xit|t−1 + Kti (z − H xit|t−1 ), (10)
Gaussian model. We call this birth (a kind of initialization). Pit|t = [I − Kt H ]Pit|t−1 ,
i
(11)
Each Gaussian N represents a state model with weight w,
mean vector x, input observation vector z, and covariance Kti = Pit|t−1 H T (HPit|t−1 H T + R)−1
, (12)
matrix P, which are as follows: where the set of wt|t−1 includes wt−1 (weights from the
Nt
X targets at the previous frame) and wt−1 (weights of newly
gt (z) = wit N (z; xit , Pit ), (3) born targets). Likewise, Nt|t−1 is the sum of Nt−1 and the
i=1 number of the newly born targets.
where Nt is the number of Gaussian models. At this step, B. HIERARCHICAL DATA ASSOCIATION (HDA)
we set the initial velocities of the mean vector to zero.
To compensate for the imperfection of the framewise one-
Each weight is set to the normalized confidence value of the
step online propagation of the GMPHD filtering process,
corresponding detection response: confidence score δ given
we extend the GMPHD filter-based online MOT with a
by instance segmentation module. Additionally, the method
hierarchical data association (HDA) strategy that has two-
of setting covariance matrix P is shown in Section IV-B2.
step association steps: S2TA and T2TA (see Figure 4). Each
association has different states and observations as inputs,
2) PREDICTION
which are used to compute position and motion affinitypm
We assume that there already has been the Gaussian mixture and appearance affinityappr (see Figure 5). Song et al. [28]
gt−1 of the target states at the previous frame t − 1, as shown proposed a GMPHD filter based hierarchical data association
in (4). Then, we can predict the state at time t using Kalman strategy. They adjust the minimum consecutive frames
filtering. In (5), xit|t−1 is derived by using the velocity at time for initialization in detection-to-track association and the
t − 1 and the covariance P is also predicted by the Kalman minimum track length for track-to-track association since
filtering method in (6) as: they use only GMPHD filter based position and motion
Nt−1 affinity for realtime speed. So false associations can be
X
gt−1 (z) = wit−1 N (z; xit−1 , Pit−1 ), (4) prevented between just close tracks each other by using
i=1 reliable tracks. However, in our work, a track state is
xit|t−1 = Fxit−1 , (5) initialized in a single frame as soon as S2TA succeeds and
the minimum track length for T2TA is 1 for fully online
Pit|t−1 = Q + FPit−1 (F)T , (6) process, and the false associations can be prevented by using
the appearance affinity.
where F is the state transition matrix, and Q is the process
To build the proposed HDA strategy, we define some online
noise covariance matrix. Those two matrices are constant in
MOT’s processing units at the present time t. St indicates the
our tracker.
instance segmentation results and the k th segment is denoted
by skt . T indicates a set of tracks. These units are defined in
3) UPDATE
detail as:
The goal of the update step is to derive (7). First, we should
find an optimal observation z at time t to update the Gaussian St = {s1t , . . . , skt }, (13)

VOLUME 10, 2022 60647


Y.-M. Song et al.: MOTS With Embedding MAF in HDA

i,t−1 = {cx,t−1 , cy,t−1 , vx,t−1 , vy,t−1 } from τi,t−1 ,


xS2TA i i i i T live
(21)
 
1 0 1 0
0 1 0 1
0 0 1 0 ,
S2TA
F =  (22)
0 0 0 1
x̄S2TA
i,t i,t−1 ,
= F S2TA xS2TA (23)
x̄S2TA
i,t = {c̄ix,t , c̄iy,t , vix,t−1 , viy,t−1 }T , (24)

where matrix F S2TA (22) makes (21) be (24) after multiplica-


tion (23) that is identical to (5) from the Prediction step and
c̄it is equal to cit−1 + vit−1 . In (22), 1 at 1st row and 3rd column
and 1 at 2nd row and 4th column indicate the frame difference
between t − 1 and t. An example is ID2 in Figure 4(a).
General Kalman filter was designed for predicting a single
object in a space so it can be easily drifted by initial velocity
to find the one object. However, in MOT, the drifting can
cause false associations with other observations. Depending
on whether the state x finds that an observation z is associated,
is born, or neither, the framewise motion v is updated as
follows:


 β ∗ v(it−1 + (1.0 ) − β)

 k i
 ∗ cx,t − c̄x,t ,

if zkt is assigned to x̄S2TA

FIGURE 4. Demonstration of the hierarchical data association (HDA)
live can be associated with an
process with (a) S2TA and (b) T2TA. Track τ1,t −1
vit = cky,t − c̄iy,t i,t

observation zS2TA live becomes τ lost , and


from segment k. If S2TA fails, τ1,t
{0, 0}T ,

else if xS2TA is born

k,t 1 1,t 
 i,t
it can succeed in T2TA with an observation zT3,t
2TA from the live track τ live .

vt−1 ,
3,t  i

otherwise,
(25)
Ttlive = {τ1,t
live
, . . . , τi,t
live
}, (14)
τi,t
live
= {xitb , ., xitl }, 0 ≤ tb < tl , tl = t, (15) where β can be differently set according to the scene context
Ttlost = lost
{τ1,t , . . . , τj,t
lost
}, (16) and frame rate. The impact of β is presented in Figure 9 and 8
in detail.
j j
τj,t
lost
= {xtb , ., xtl }, 0 ≤ tb < tl < t, (17)
xtb = {cx,tb , cy,tb , vx,tb , vy,tb } , T
(18) 2) TRACK-TO-TRACK ASSOCIATION (T2TA)
xtl = {cx,tl , cy,tl , vx,tl , vy,tl } , T
(19) In T2TA, observations ZtT 2TA and states XtT 2TA (inputs) are
built from the live track set Ttlive and lost track set Ttlost ,
where two attributes: ‘‘live’’ and ‘‘lost’’, indicate success respectively. Each of Ttlive and Ttlost consists of the track vec-
and failure in tracking at time t, respectively, which are not tors of τi,t
live and τ lost with their identities (see (14) and (16)).
j,t
compatible, and thus Ttlost ∪Ttlive = Ttall and Ttlost ∩Ttlive = φ The track vectors have temporal information with the birth
are satisfied. Tt is composed of a track τi,t with identity i time tb and loss time tl . The live track’s tl is identical to the
which is also a set of state vectors from the birth time tb to current time t, which means that the track is not yet lost, while
the last tracking time tl . In the case of τi,t live , t is identical
l the lost track’s tl is less than t, which means the track was lost
to the present time t, in the case of τj,t
lost , t is less than time
l before the time t (see (15) and (17)).
t. Regardless of when time t is, state vector x has the center Since general Kalman filter only considers frame-by-frame
point {cx , cy } in the segment bounding box, velocities {vx , vy } prediction, unlike that the Prediction step of S2TA uses the
in the x and y axis directions, an identity (ID), and a segment framewise motion from time t − 1 to t, we devise a simple
mask (see (18) and (19)). trackwise motion model considering temporal information.
The trackwise motion analysis is used in T2TA as follows:
1) SEGMENT-TO-TRACK ASSOCIATION (S2TA)
In S2TA, the observations denoted by ZtS2TA are frame- zTi,t2TA = {cix,t , ciy,t }T from τi,t
live
, (26)
j j j j
by-frame instance segmentation results St . If there are no 2TA
xTj,t−1 = {cx,tl , cy,tl , ṽx,t , ṽy,t }T from τj,t
lost
, (27)
track states, the states XtS2TA are initialized from ZtS2TA , and  
live and updated by using 1 0 df 0
otherwise, X̄tS2TA is predicted from Tt−1 0 1 0 df 
,
the GMPHD filter with the processing units as follows: F T 2TA =
0 (28)
0 1 0
zS2TA
k,t = {ckx,t , cky,t }T from skt , (20) 0 0 0 1

60648 VOLUME 10, 2022


Y.-M. Song et al.: MOTS With Embedding MAF in HDA

x̄Tj,t2TA = F T 2TA xTj,t−1


2TA
, (29)
j j j j
x̄Tj,t2TA = {c̄x,t , c̄y,t , vx,t , vy,t }T , (30)

where df (i, j) (28) is the frame difference between τi,t live ’s


j
first element xitb (15) and τj,t
lost ’s last element x (17). The
tl
j
trackwise motion vector ṽt of (27) has two linearly averaged
j j
velocities ṽx,t and ṽy,t of a track, in the directions of the x-axis
and y-axis, respectively, as follows:
j j j j
j j j cx,tl − cx,tb cy,tl − cy,tb T
ṽt = {ṽx,t , ṽy,t }T = { , } , (31)
tl − tb tl − tb
j j j j
cx,tl −cx,tb cy,tl −cy,tb
where the velocities tl −tb and tl −tb are calculated by
j
subtracting the center position of the first object state xtb from
j
that of the last state xtl and dividing it by tl − tb , which is the
frame difference and is equivalent to the length of the track
τj,t
lost . A related example is shown in Figure 4(b) ID1.

In terms of temporal motion analysis, S2TA has the


same time interval ‘‘1’’ between states and observations
in transition matrix F, whereas T2TA has a different time
interval (frame difference) between states and observations.
FIGURE 5. Detailed examples of the proposed mask-based affinity
The variable df depends on which state of the lost track and fusion (MAF) method with the hierarchical data association (HDA):
observation of the live track are paired. (20)-(25) of S2TA (a) S2TA and (b) T2TA.
are the prediction step with framewise motion analysis and
update, but (26)-(31) of T2TA contain the prediction step the GMPHD filter as follows:
with trackwise linear motion analysis. Detailed examples are
shown in Figure 4. A(i,j) i i j
pm = w · q (z ), (32)
Recent researches [16], [33] exploit both Kalman filter and
optical flow in their motion models, we exclude the optical which is acquired from (8) and (9) of the Update step.
flow since its intensive computation. Instead, we devise
framewise motion (25) and trackwise motion (31) in HDA. 2) APPEARANCE AFFINITY
Following the proposed HDA strategy, for S2TA and T2TA, We exploit single object tracking (SOT) methods [23]–[25]
two cost matrices can be filled by using the affinities to compute the appearance affinity between the ith state
between the differently defined states and observations. and jth observation since instance segmentation results does
In the next subsection, we present an efficient mask-based not provide appearance features to discriminate the objects
affinity calculation method considering position, motion, and belonging to a single class, pedestrian or cars. The SOT does
appearance for multi-object tracking and segmentation. not have class dependency because it was originally designed
for single-object tracking challenges such as the VOT
C. MASK-BASED AFFINITY FUSION (MAF) benchmark [43]. So it can be applied multi-class tracking
We adopt a simple score-level fusion method to adequately and we utilize it for calculating the appearance similarity
consider position, motion, and appearance between states by matching object templates in this paper. Before applying
and observations. Fusing affinities obtained from different the SOT method, the state and observation image templates
domains requires a normalization step that can balance the are preprocessed by setting the backgrounds pixels to zero
different affinities and avoid bias toward one affinity, which in the RGB channel’s 0 to 255 ranges. This preprocessing
may have a much higher magnitude than the others. step ensures that he appearance affinity pays attention to
the foreground pixels based on the segment mask. The
SOT-based affinity can be derived as follows:
1) POSITION AND MOTION AFFINITY
The GMPHD filter includes Kalman filtering in its Prediction Pwidthj Pheightj (i,j)
c=xj r=yj s̄SOT (r, c)
step, (4)-(6), designed with a linear motion model with
(i,j)
Aappr = 1 − , (33)
widthj · heightj
noise Q. Additionally, we present two different linear motion
models for the hierarchical data association with two steps, where s̄(·) indicates the normalized SOT similarity value,
S2TA and T2TA, as described in (25) and (31). Therefore, which varies from 0.0 to 1.0 at each pixel. To verify that
the position and motion affinity between the ith state and jth single-object tracking does efficiently work in computing
observation gives the probabilistic value w · q(z) obtained by this appearance affinity, one conventional method: KCF [23],

VOLUME 10, 2022 60649


Y.-M. Song et al.: MOTS With Embedding MAF in HDA

and two state-of-the-art methods: SiamRPN [24] and DaSi-


amRPN [25], are adopted in our work.

3) MIN-MAX NORMALIZATION
In our experiments, Apm and Aappr have quite different
magnitudes, e.g., Apm = {10−9 , . . . , 10−3 } and Aappr =
{0.4, . . . , 1.0} (see Figure 5). To fuse two affinities, we apply
min-max normalization to them as follows:
A(i,j) − min 1≤i≤N A(i,j)
1≤j≤M
Ā(i,j) = , (34)
max 1≤i≤N A(i,j) − min 1≤i≤N A(i,j)
1≤j≤M 1≤j≤M

where Aappr and Apm are normalized into Āappr and Āpm ,
respectively. Aappr and Apm have quite different magnitudes
but are normalized 0.0 to 1.0 ranges in (34). First after
subtracting the minimum value from the original affinity
A value, and it is divided by the difference between the
maximum value and the minimum value. Then, we finally
propose a MAF model represented by: FIGURE 6. Normalized distributions of the affinities between cars in
KITTI-MOTS training sequence 0019. KCF and GMPHD represent
(i,j) ‘‘appearance affinity’’ and ‘‘position and motion affinity’’, respectively.
pm Āappr .
Amaf = Ā(i,j) (i,j)
(35) (a) and (b) show the distributions with each average m and standard
deviation σ , and (c) shows that (a) and (b) are very different from each
Figures 6 and 7 show the probabilistic distributions before other. (d) The proposed mask-based affinity fusion (MAF) method can
determine the scale difference between the KCF and GMPHD affinities
MAF and after MAF. From this fused affinity, we can and then normalize the two affinities and fuse (multiply) them. m̄ and σ̄
compute the final cost between states and observations as denote the normalized values in (34).
follows:
j (i,j) Figures 6(d) and 7(d), the gaps are much closer than before,
Cost(xit|t−1 , zt ) = −α · ln Amaf , (36)
and the two affinities are fused into Amaf by using MAF.
where α is a scale factor empirically set to 100. If one of
the affinities is close to zero, such as 10−39 , the cost is set D. MASK MERGING
to 10000 to prevent the final cost from becoming an infinite As shown in Figure 3, for mask merging, i.e., track merging,
value. Then, the final cost ranges from 0 to 10000. we can utilize bounding box-based IoU or segment mask-
From the different states and observations (inputs) in S2TA based IoU (mask IoU) measures that calculate boxwise or
and T2TA, two cost matrices are computed in every frame pixel-wise overlapping ratios between two objects, respec-
and we utilize the Hungarian algorithm [44] to solve the tively. The two measures are represented by:
cost matrices, which has O(n3 ) time complexity, as shown in bbox(A)∩bbox(B)
Figure 5. Then, observations succeeding in S2TA or T2TA IoU(A, B) = , (37)
bbox(A)∪bbox(B)
are assigned to the associated states for Update, and other
mask(A)∩mask(B)
observations failing in S2TA and T2TA initialize new states. Mask IoU(A, B) = . (38)
mask(A)∪mask(B)
4) ANALYSIS OF AFFINITY DATA If the value of a selected measure is greater than or equal to
Figures 6(a)-(c) and 7(a)-(c) show that the position and the threshold tm , the two objects are merged into one object.
motion affinity Agmphd and appearance affinity Akcf have Mask merging is applied only between tracking objects, i.e.,
quite different data magnitudes and distributions. In our states, that are not observations, after S2TA.
experiments, Apm = {10−9 , . . . , 10−3 } and Aappr =
{0.4, . . . , 1.0} are observed. Figure 6(a) shows that the cars E. PARALLEL PROCESSING
have more concentrated distributions, with mean mkcf ≈ We assume that data association runs only between the same
0.944 for appearance affinity than the pedestrians, with class of objects. For example, if the instance segmentation
mkcf ≈ 0.905 in Figure 7(a). On the other hand, for module provides two or more object classes, e.g., car
the GMPHD affinity, pedestrians have more concentrated and pedestrian classes, our proposed framework is easily
distributions as seen in Figures 6(b) and Figure 7(b). expansible (see Figure 1). In this paper, we implement
These facts are interpreted as follows: cars can be well the MOTS module with two parallel processes because
discriminated by position and motion while pedestrians the datasets used for our experiments produce car and
can be well discriminated by appearance. To considering pedestrian segments. Then, the time complexity O((|car| +
these two characteristics, we propose MAF; from the |pedestrian|)3 ) decreases to the slower of O(|car|3 ) and
distributions of normalized affinities Āgmphd and Ākcf in O(|pedestrian|3 ).

60650 VOLUME 10, 2022


Y.-M. Song et al.: MOTS With Embedding MAF in HDA

TABLE 2. Evaluation measures. sMOTSA is mainly used for measuring the


tracking performance as a key measure.

with ‘‘position and motion’’ affinity (GMPHD) as shown in


Figure 10. Finally, in IV-E, we show that the final proposed
FIGURE 7. Normalized distributions of the affinities between pedestrians
in KITTI-MOTS training sequence 0019. KCF and GMPHD represent
model p6 achieves competitive performances on the test
‘‘appearance affinity’’ and ‘‘position and motion affinity’’, respectively. sequences of the datasets in terms of the sMOTSA, MOTSP,
(a) and (b) show the distributions with each average m and standard and IDS measures.
deviation σ , and (c) shows that (a) and (b) are very different from each
other. (d) The proposed mask-based affinity fusion (MAF) method can
determine the scale difference between the KCF and GMPHD affinities A. DATASETS AND MEASURES
and then normalize the two affinities and fuse (multiply) them. m̄ and σ̄
denote the normalized values in (34). MAF_HDA is evaluated on MOTS20 [12] and KITTI-
MOTS [2], which are the most popular datasets for MOTS.
TABLE 1. Dataset specifications for MOTS20 and KITTI-MOTS. Voigtlaender et al. [12] proposed new MOTS measures
and two MOTS datasets that were extended from image
sequences of MOT16 [3] and KITTI [2]. They have been
widely used for multi-object tracking with 2D bounding
box-based detection results but instance segmentation results
with the same image sequences were provided for MOTS,
created by Mask RCNN [11] X152 of Detectron2 [45].
Table 1 describes the MOTS20 and KITTI-MOTS bench-
mark datasets in terms of training and test sequences,
frames per second (FPS), resolution, and the number of
frames (Frame). MOTS20 provides six high resolution
1920 × 1080 images with 30 FPS and two low resolution
640 × 480 image sequences with 14 FPS containing only
pedestrians. MOTS02-01, MOTS20-02, and MOTS20-09 are
taken by static CCTV, and the rest of sequences are taken in
IV. EXPERIMENTS human holding moving cam. MOTS20 set are divided into
In this section, we present experimental studies for the 4 training sequences with 2,862 images and 4 test sequences
proposed MOTS method, named MAF_HDA, in detail. with 3,044 images. On the other hand, KITTI-MOTS
In IV-A, we note that MAF_HDA is studied with state-of- provides image sequences taken in the camera on a vehicle
the-art MOTS20 [12] and KITTI-MOTS [2] datasets and new with 1224 × 370, 1238 × 374, and 1242 × 375 resolutions
evaluation measures. In IV-B, the implementation details of and 10 FPS, which are divided into 21 training sequences
our method are addressed in terms of development environ- with 8,008 images and 29 test sequences with 11,095 images.
ments and parameter settings. In IV-C, experimental studies Pedestrians and cars appear in the KITTI-MOTS scenes. The
on the key parameters are addressed. In IV-D, we determine ablation studies and experimental results using these datasets
the effectiveness of key modules through ablation studies are presented in Tables 5, 4, 7, and 6 of Section IV. For
in the dataset training subset. The ablation studies show evaluation, sMOTSA and IDS are mainly used in this paper.
that the proposed key modules comprehensively improve the These measures are mask-based variants of the original
baseline model p1 remarkably in terms of IDS. In particular, CLEAR MOT measures [47] as follows:
we compare the effectiveness of KCF, SiamRPN, and |TP| − |FP| − |IDS|
MOTSA = , (39)
DaSiamRPN as ‘‘appearance’’ affinity by using correlation |M|
VOLUME 10, 2022 60651
Y.-M. Song et al.: MOTS With Embedding MAF in HDA

TABLE 3. Threshold settings for ‘‘Mask Merging’’ and ‘‘Mask-Based


Affinity Fusion (MAF).’’

X
TP
f= Mask IoU(h, gt(h)), (40)
h∈TP
TP
f − |FP| − |IDS|
sMOTSA = , (41)
|M|
where M is a set of ground truth (GT) pixel masks, h is a track FIGURE 8. Experimental studies for the parameters in the MOTS20
training set. The best sMOTSA and IDS scores are shown when β, tm , and
hypothesis mask, and gt(h) is the most overlapping mask fappr are (a) 0.4, (b) 0.3, and (c) 0.5 for pedestrians. The same values are
among all GTs. In multi-object tracking and segmentation set for the test which are presented in Tables 4 and 6.
accuracy (MOTSA), a mask-based variant of the original
multi-object tracking accuracy (MOTA), a case is only
TABLE 4. Evaluation results on the MOTS20 training set. p1 is the
counted as a true positive (TP) when the mask IoU value, baseline method and p6 is selected as a final model.
between h and gt(h), is greater than or equal to 0.5, but in soft
multi-object tracking and segmentation accuracy (MOTSA),
TP
f is used, which is a soft version of TP. Other details of the
measures are displayed in Table 2.

B. IMPLEMENTATION DETAILS
1) DEVELOPMENT ENVIRONMENTS
All experiments are conducted on an Intel i7-7700K CPU @
4.20GHz, DDR4 32.0GB RAM, and Nvidia GTX 1080 Ti.
We implement MAF_HDA by using OpenCV image pro-
cessing libraries written in Visual C++. The official code
implementation is available at the Github repository .1

2) PARAMETER SETTINGS We uniformly truncate the segmentation results under


The matrices F, Q, P, R, and H are used in Initialization, threshold values, which are 0.6 for cars and 0.7 for
Prediction, and Update for the GMPHD filter’s tracking pedestrians.
process. Experimentally, the parameter matrices are set as:

1 0 1 0
 C. EXPERIMENTAL STUDIES ON KEY PARAMETERS
0 1 0 1 Figures 9 and 8 address experimental studies on key
F = 0 0 1 0 , parameters of our method and show that the parameters

0 0 0 1 such as β (framewise motion update ratio in S2TA), tm


 2  (upper threshold for mask merging with mask IoU), and fappr
5 0 0 0
(upper threshold for Aappr in MAF) can be tuned by simple
1 0 102 0 0 , numerical studies. In Figure 9, comparing (c) to (e) and
Q=  2
2 0 0 5 0  (d) to (f), mask IoU is less sensitive to parameter settings and
0 0 0 102 shows better sMOTSA and IDS than IoU. The final parameter
 2 
5 0 0 0 settings are summarized in Table 3 whos values are learned in
102
 2 
0 0 0  5 0 MOTS20 and KITTI-MOTS training sets and are identically
P=   , R= ,
0 0 52 0  0 102 set for evaluation in the training and test sets as shown in
0 0 0 10 2 Tables 5, 4, 7, and 6.
 
1 0 0 0
H = .
0 1 0 0 D. ABLATION STUDIES
For the ablation studies, MAF_HDA is evaluated on the
1 https://github.com/SonginCV/MAF_HDA
training sequences of MOTS20 and KITTI-MOTS.

60652 VOLUME 10, 2022


Y.-M. Song et al.: MOTS With Embedding MAF in HDA

FIGURE 9. Experimental studies for parameters β, tm , and fappr in the KITTI-MOTS training set. The best sMOTSA and IDS scores are shown when
β is (a) 0.4 and (b) 0.5 for cars and pedestrians, respectively. The best scores are observed setting tm to (c) 0.3 and (d) 0.4 with the Mask IoU
measure and setting fappr to (g) 0.85 and (h) 0.85. The same values are set for the test which are presented in Tables 5 and 7.

TABLE 5. Evaluation results on the KITTI-MOTS training set. p1 is the baseline method without any proposed modules. KCF2 indicates a simplified version
for MOT which uses fixed-size window instead of multi-scale windows used in the referenced version of KCF. We select p6 as a final model.

1) KEY MODULES we can see that at least Mask IoU works better than IoU in
As discussed in Section III, our method includes three key the merging of the results of p3 and p4. In p5 and p6, ‘‘MAF
modules: HDA, mask merging, and MAF. HDA consists of in T2TA’’ is applied to our method, where KCF extracts
S2TA and T2TA in order. Then, we can rearrange these the appearance affinities by using multi-scale windows, but
modules with ‘‘MAF in S2TA’’, ‘‘Mask Merging’’, and KCF2 uses fix-size window to rely on the object sizes
‘‘MAF in T2TA’’ considering serial processes as described from instance segmentation responses. In addition, we apply
in Table 5. Additionally, either IoU (37) or Mask IoU (38) for two more state-of-the-art SOT methods: SiamRPN [24] and
‘‘Mask Merging’’ can be selected. DaSiamRPN [25], in the proposed appearance affinity model.
In MOTS20, Table 4, p6, p7, and p8 show comparative
performances, but in KITTI-MOTS, Table 5, p7 and p8 show
2) EFFECTIVENESS OF THE KEY MODULES worse than p6 and even worse than p5 and p6 for cars.
As seen in Tables 5 and 4, when the key modules ‘‘MAF in Figure 10 shows that reason. SiamRPN shows a biased
S2TA’’, ‘‘Mask Merging’’, and ‘‘MAF in T2TA’’ are added to correlation so DaSiamRPN shows too wide correlation in
the baseline method p1 one by one, our method shows incre- appearance affinity space for cars. We think that is because
mental and remarkable improvements. Comparing p1 and p2, cars are hard to be discriminated especially in relatively low-
p1 exploits one-step GMPHD filtering in computing only resolution of KITTI-MOTS images and tiny-size of objects.
position and motion affinity, but p2 considers the position- On the other hand, KCF’s moderate correlation can be
motion affinity with the GMPHD filter and appearance appropriate to be fuses with GMPHD filter seeing the better
affinity by the KCF in ‘‘MAF in S2TA’’. The remarkable performance. Moreover, since [24], [25] exploit the siamese
improvements in IDS and FM indicate that the proposed network which requires GPU processing, those methods
affinity fusion method works effectively. Comparing p2 and cannot extract appearance affinities for dozens of objects in
p3 in both Tables, because the results are advanced only in data association steps in parallel with one single GPU. Thus,
KITTI-MOTS, ‘‘Mask Merging’’ may merge more than two even if they presented 100 FPS in SOT, p6 and p7 run at
segments of one object into one segment or not. However, 1.0-2.0 FPS. Comparing the settings without T2TA, from

VOLUME 10, 2022 60653


Y.-M. Song et al.: MOTS With Embedding MAF in HDA

TABLE 6. Evaluation results on the MOTS20 test set. Proposed methods are denoted by MAF_HDA. The red and blue results indicate the first and second
best scores among online processing (proc.) approaches. The bold results indicate the best scores among offline proc. approaches. ‘‘not available (n/a)’’
FPS indicates the case that only total FPS is provided. ‘-’ denotes unpublished results in their paper and the MOTS20 leaderboard
at https://motchallenge.net/results/MOTS/.

TABLE 7. Evaluation results on the KITTI-MOTS test set. Proposed methods are denoted by MAF_HDA. The red and blue results indicate the first and
second best scores among online processing (proc.) approaches. The bold results indicate the best scores among offline proc. approaches. ‘‘not available
(n/a)’’ FPS indicates the cases that only total FPS is provided. All entries are available at http://www.cvlibs.net/datasets/kitti/old_eval_mots.php.

p2 to p4, and with T2TA, p5, p6, p7, and p8, the results show
that HDA with MAF reduces IDS very effectively in both
datasets and KCF2, the simplified version of KCF, shows
faster FPS and better performance in terms of sMOTSA,
MOTSA, IDS than the conventional KCF and the state-of-
the-art SOT methods [24], [25].
Numerically, when adding the key modules ‘‘MAF in
S2TA’’, ‘‘Mask Merging’’, and ‘‘MAF in T2TA’’ one by
one, as shown in Tables 5 and 4, and Figure 12, our MOTS
method shows incremental improvements from p1 to p6.
The baseline method p1 is numerically improved as follows:
for the KITTI-MOTS Cars training set, sMOTSA changes
from 73.7 to 78.5 and IDS changes from 1,322 to 212;
for the KITTI-MOTS Pedestrians training set, sMOTSA
changes from 56.4 to 62.3 and IDS changes from 800 to 140;
and for the MOTS20 training set, sMOTSA changes from FIGURE 10. Correlation maps between appearance affinities: KCF,
64.5 to 65.8 and IDS changes from 686 to 234. Thus, SiamRPN, and DaSiamRPN, and position-motion affinity: GMPHD,
in KITTI-MOTS.
p6:MAF_HDAKCF2 are selected as our final model.

speed (FPS) vs. MOTS accuracy (sMOTSA and IDS) where


E. TEST RESULTS our method is denoted by MAF_HDA.
We evaluate the proposed MOTS method against state-of-the-
art MOTS methods [12]–[20] in the test set of the MOTS20 1) SPEED COMPARISON W/SEGMENTATION
and KITTI-MOTS benchmarks. Tables 6 and 7 show the For fair comparison of one-stage MOTS methods [12], [19],
evaluation results and Figure 11 describes comparisons of [20] and multi-stage methods [13]–[18], we present not

60654 VOLUME 10, 2022


Y.-M. Song et al.: MOTS With Embedding MAF in HDA

FIGURE 11. Comparisons of speed (FPS) vs. MOTS accuracy (sMOTSA and IDS) against state-of-the-art methods in KITTI-MOTS and MOTS test sets.

FIGURE 12. Visualization of the segmentation and MOTS results on KITTI-MOTS test sequence 0018. (b), (c), and (d) are the results of the three
different settings of the proposed method, which are based on the same segmentation results (a) from MaskRCNN [11]. Comparing (b) the
baseline model p1 and (c) the model p4 with S2TA and mask merging (without T2TA), in (b), the IDs, ‘‘0, 2, 4, 6, 8’’, of the five pedestrians at the
right side of the scene are switched except the person w/ ID 2, but, in (c), only the pedestrian w/ ID 0 gets switched to ID 24. In (d) the final
model p6, the five IDs are preserved since T2TA can find the IDs after occlusion with trees at the right side. In addition, the car w/ ID 7 at frame
0 are recovered at frame 15, while (b) and (c) do not recover the car w/ ID 7 that are switched to ID 17.

only tracking speed but also detection and segmentation 2) ACCURACY COMPARISON W/SOTA METHODS
speed in Tables 6 and 7. Speed of the public segmentation, Some state-of-the-art methods [13], [16], [17], [19] have
Mask RCNN X152, and speed of private segmentation, tackled raising the detection and segmentation quality, Eager-
PointTrack, are measured in our environment presented MOT and MOTSFusion utilized fusion of multi-detector from
in Subsection IV-B. Other speeds are referenced from multi-domain, and PointTrack and CPPNet focus on learning
their paper. PointTrack [13], SORTS+RReID [15], and segmentation model from scratch of MOTS20 and KITTI-
EagerMOT [17] introduce fast tracking speeds 22.2, 36.4, MOTS training sets. The former has promising performance
and 90.9 FPS, respectively, but including detection and since that multi-source detectors can complement each other.
segmentation, the speeds drop to 4.3, 2.3, and 1.96 FPS. However it inevitably needs heavy computing resources.
Likewise multi-detector fusion based methods EagerMOT The latter can achieve fine performance as seen in Table 7
and MOTSFusion [16] show similar speed 1.96 and 2.3 FPS. if fine data such as over 8,000 images with uniformed
Among those state-of-the-art methods, one-stage methods resolutions of the KITTI-MOTS training set is given as
CPPNet [19] and TraDeS [20] show faster speeds 7.0 FPS and seen in Table 1. However, in 2,862 images of MOTS20
11.5 FPS respectively, but still not enough to achieve realtime training set with various resolutions like 1920 × 1080 and
speeds which are 30 FPS for MOTS20 and 10 FPS for KITTI- 640 × 480, their MOTS accuracy drops sharply contrast to
MOTS. Among the proposed methods, MAF_HDAKCF2 Mask RCNN based methods such as [14] and MAF_HDA
achieves 4.6 FPS in MOTS20 and 10.9 FPS in KITTI-MOTS (see Figure 11(a) and (c)).
with tracking only, and 1.6 FPS in MOTS20 and 2.8 FPS To summarize numerically, we refer to Tables 6
in KITTI-MOTS with segmentation and tracking. Compared and 7. First, in particular, comparing the variants of
to others, MAF_HDAKCF2 shows moderate speeds (see MAF_HDA with KCF, KCF2, SiamRPN, and DaSiamRPN,
Figure 11(a)-(b)). Therefore, efficiency of MOTS method in MAF_HDAKCF2 shows the best performance in terms of
terms of speed versus accuracy is still a challenging issue. speed and accuracy. MAF_HDASiam and MAF_HDADaSiam

VOLUME 10, 2022 60655


Y.-M. Song et al.: MOTS With Embedding MAF in HDA

show drastic speed drop compared to MAF_HDAKCF and [6] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,
MAF_HDAKCF2 Those results follow the evaluation results ‘‘PointPillars: Fast encoders for object detection from point clouds,’’ in
Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
in the training sets (see Tables 5 and 4). Against state-of-the- pp. 12697–12705.
art MOTS methods [12]–[20], our proposed method named [7] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
MAF_HDAKCF2 ranks 2nd sMOTSA score (1st among Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2016, pp. 779–788.
the online approaches), 69.9, in the MOTS20 test set. [8] S. Ren, K. He, R. Girshick, and J. Sun, ‘‘Faster R-CNN: Towards real-time
In addition, MAF_HDAKCF2 ranks 3rd sMOTSA score, 65.0, object detection with region proposal networks,’’ in Proc. 28th Int. Conf.
for pedestrians and 3rd sMOTSA score, 77.2, for cars in the Neural Inf. Process. Syst. (NIPS), Dec. 2015, pp. 91–99.
KITTI-MOTS test set. [9] S. Shi, X. Wang, and H. Li, ‘‘PointRCNN: 3D Object proposal generation
and detection from point cloud,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 2019, pp. 770–779.
V. CONCLUSION [10] W. Shi and R. Rajkumar, ‘‘Point-GNN: Graph neural network for 3D object
detection in a point cloud,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
In this paper, we propose a highly feasible MOTS method Recognit. (CVPR), Jun. 2020, pp. 1711–1719.
named MAF_HDA, which is an easily reproducible reassem- [11] K. He, G. Gkioxari, P. Dollar, and R. Girshick, ‘‘Mask R-CNN,’’ in Proc.
bly of four key modules: a GMPHD filter, HDA, mask IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2961–2969.
merging, and MAF. These key modules can operate in the [12] P. Voigtlaender, M. Krause, A. Os̆ep, J. Luiten, B. B. G. Sekar, A. Geiger,
and B. Leibe, ‘‘MOTS: Multi-object tracking and segmentation,’’ in
proposed fully online MOTS framework which tracks cars Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
and pedestrians in parallel CPU-only processes. In addition, pp. 7942–7951.
the key parameters can be simply tuned through experimental [13] Z. Xu, W. Zhang, Z. Tan, W. Yang, H. Huang, S. Wen, E. Ding, and
L. Huang, ‘‘Segment as points for efficient online multi-object tracking
studies adjusting the values in 0.0 to 1.0 ranges, and and segmentation,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), Aug. 2020,
these modules show remarkable improvements in evaluation pp. 264–281.
on the training sets of MOTS20 and KITTI-MOTS in [14] F. Yang, X. Chang, C. Dang, Z. Zheng, S. Sakti, S. Nakamura, and
T. Wu, ‘‘ReMOTS: Self-supervised refining multi-object tracking and
terms of MOTS measures such as sMOTSA and IDS. segmentation,’’ 2020, arXiv:2007.03200.
In the test sets of the two popular datasets, MAF_HDA [15] M. Ahrnbom, M. Nilsson, and H. Ardö, ‘‘Real-time and online segmenta-
achieves very competitive performance against the state- tion multi-target tracking with track revival re-identification,’’ in Proc. Int.
Conf. Comput. Vis. Theory Appl. (VISAPP), Feb. 2021, pp. 777–784.
of-the-art MOTS methods. In future work, we expect
[16] J. Luiten, T. Fischer, and B. Leibe, ‘‘Track to reconstruct and reconstruct to
that the proposed MOTS method will be reproduced and track,’’ IEEE Robot. Autom. Lett., vol. 5, no. 2, pp. 1803–1810, Apr. 2020.
extended in research community with a more precise [17] A. Kim, A. Osep, and L. Leal-Taixe, ‘‘EagerMOT: 3D multi-object
and simpler position and motion filtering model and tracking via sensor fusion,’’ in Proc. IEEE Int. Conf. Robot. Autom. (ICRA),
May 2021, pp. 11315–11321.
more rapid and sophisticated appearance feature extrac- [18] Z. Wang, H. Zhao, Y.-L. Li, S. Wang, P. H. Torr, and L. Bertinetto, ‘‘Do
tors such as deep neural network-based re-identification different tracking tasks require different appearance models?’’ in Proc.
techniques. Conf. Neural Inf. Process. Syst. (NeurIPS), Dec. 2021, pp. 15323–15332.
[19] Z. Xu, A. Meng, Z. Shi, W. Yang, Z. Chen, and L. Huang, ‘‘Continuous
copy-paste for one-stage multi-object tracking and segmentation,’’ in Proc.
ACKNOWLEDGMENT IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2021, pp. 15323–15332.
This work was performed based on the cooperation with [20] J. Wu, J. Cao, L. Song, Y. Wang, M. Yang, and J. Yuan, ‘‘Track to detect
and segment: An online multi-object tracker,’’ in Proc. IEEE/CVF Conf.
GIST-LIG Nex1 Cooperation and was supported by Institute Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 12352–12361.
of Information and Communications Technology Planning [21] A. Hu, A. Kendall, and R. Cipolla, ‘‘Learning a spatio-temporal embedding
and Evaluation (IITP) grant funded by the Korea govern- for video instance segmentation,’’ 2019, arXiv:1912.08969.
ment (MSIT) (No.2014-3-00077-008, Development of global [22] B. N. Vo and W. K. Ma, ‘‘The Gaussian mixture probability hypothesis den-
sity filter,’’ IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4091–4104,
multi-target tracking and event prediction techniques based Nov. 2006.
on real-time large-scale video analysis). [23] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, ‘‘High-speed tracking
with kernelized correlation filters,’’ IEEE Trans. Pattern Anal. Mach.
Intell., vol. 37, no. 3, pp. 583–596, Mar. 2015.
REFERENCES [24] B. Li, W. Wu, Z. Zhu, and J. Yan, ‘‘High performance visual tracking
[1] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler, ‘‘MOTChal- with Siamese region proposal network,’’ in Proc. IEEE Conf. Comput. Vis.
lenge 2015: Towards a benchmark for multi-target tracking,’’ 2015, Pattern Recognit. (CVPR), Jun. 2018, pp. 8971–8980.
arXiv:1504.01942. [25] Z. Zhu, Q. Wang, and B. Li, ‘‘Distractor-aware Siamese networks for visual
[2] A. Geiger, P. Lenz, and R. Urtasun, ‘‘Are we ready for autonomous driving? object tracking,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), Dec. 2018,
The KITTI vision benchmark suite,’’ in Proc. IEEE Conf. Comput. Vis. pp. 101–117.
Pattern Recognit. (CVPR), Sep. 2012, pp. 3354–3361. [26] R. P. S. Mahler, ‘‘Multitarget Bayes filtering via first-order multitar-
[3] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler, ‘‘MOT16: A get moments,’’ IEEE Trans. Aerosp. Electron. Syst., vol. 39, no. 4,
benchmark for multi-object tracking,’’ vol. 2016, arXiv:1603.00831. pp. 1152–1178, Oct. 2003.
[4] L. Chen, S. Lin, X. Lu, D. Cao, H. Wu, C. Guo, C. Liu, and [27] B.-N. Vo, S. Singh, and A. Doucet, ‘‘Sequential Monte Carlo implementa-
F.-Y. Wang, ‘‘Deep neural network based vehicle and pedestrian detection tion of the PHD filter for multi-target tracking,’’ in Proc. 6th Int. Conf. Inf.
for autonomous driving: A survey,’’ IEEE Trans. Intell. Transp. Syst., Fusion (ICIF), Jul. 2003, pp. 792–799.
vol. 22, no. 6, pp. 3234–3246, Jun. 2021. [28] Y.-M. Song, K. Yoon, Y.-C. Yoon, K. C. Yow, and M. Jeon, ‘‘Online multi-
[5] D. Feng, C. Haase-Schutz, L. Rosenbaum, H. Hertlein, C. Glaser, F. Timm, object tracking with GMPHD filter and occlusion group management,’’
W. Wiesbeck, and K. Dietmayer, ‘‘Deep multi-modal object detection IEEE Access, vol. 7, pp. 165103–165121, 2019.
and semantic segmentation for autonomous driving: Datasets, methods, [29] R. Sanchez-Matilla, F. Poiesi, and A. Cavallaro, ‘‘Multi-target tracking
and challenges,’’ IEEE Trans. Intell. Transp. Syst., vol. 22, no. 3, with strong and weak detections,’’ in Proc. Eur. Conf. Comput. Vis.
pp. 1341–1360, Mar. 2021. Workshops (ECCVW), Oct. 2016, pp. 84–99.

60656 VOLUME 10, 2022


Y.-M. Song et al.: MOTS With Embedding MAF in HDA

[30] R. Sanchez-Matilla and A. Cavallaro, ‘‘A predictor of moving objects YOUNG-CHUL YOON received the B.S. degree
for first-person vision,’’ in Proc. IEEE Int. Conf. Image Process. (ICIP), in electronics and communications engineering
Sep. 2019, pp. 2189–2193. from Kwangwoon University, Seoul, South Korea,
[31] T. Kutschbach, E. Bochinski, V. Eiselein, and T. Sikora, ‘‘Sequential sensor and the M.S. degree in electrical engineering and
fusion combining probability hypothesis density and kernelized correlation computer science from the Gwangju Institute of
filters for multi-object tracking in video data,’’ in Proc. 14th IEEE Int. Science and Technology, Gwangju, South Korea,
Conf. Adv. Video Signal Based Surveill. (AVSS), Aug. 2017, pp. 1–6. in 2019. He is currently a Software Engineer at
[32] Z. Fu, P. Feng, F. Angelini, J. Chambers, and S. M. Naqvi, ‘‘Particle
Robotics Laboratory, Hyundai Motor Company.
PHD filter based multiple human tracking using online group-structured
His research interests include multi-object track-
dictionary learning,’’ IEEE Access, vol. 6, pp. 14764–14778, 2018.
[33] M. P. Muresan and S. Nedevschi, ‘‘Multi-object tracking of 3D cuboids ing and video analysis.
using aggregated features,’’ in Proc. IEEE 15th Int. Conf. Intell. Comput.
Commun. Process. (ICCP), Sep. 2019, pp. 11–18.
[34] W. Li, J. Mu, and G. Liu, ‘‘Multiple object tracking with motion and
appearance cues,’’ IEEE Access, vol. 7, pp. 104423–104434, 2019. KWANGJIN YOON received the Ph.D. degree in
[35] Y.-C. Yoon, D. Y. Kim, Y.-M. Song, K. Yoon, and M. Jeon, ‘‘Online electrical engineering and computer science from
multiple pedestrians tracking using deep temporal appearance matching the Gwangju Institute of Science and Technology,
association,’’ Inf. Sci., vol. 561, pp. 326–351, Jun. 2021. in 2019. He is currently a Research Scientist with
[36] K. Yoon, D. Y. Kim, Y.-C. Yoon, and M. Jeon, ‘‘Data association for multi- SI Analytics Company Ltd. His research interests
object tracking via deep neural networks,’’ Sensors, vol. 19, pp. 1–15, include multi-object tracking, computer vision,
Jan. 2019. and deep learning.
[37] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan,
‘‘Object detection with discriminatively trained part-based models,’’
IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627–1645,
Sep. 2010.
[38] F. Yang, W. Choi, and Y. Lin, ‘‘Exploit all the layers: Fast and accurate
CNN object detector with scale dependent pooling and cascaded rejection
classifiers,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), HYUNSUNG JANG is currently pursuing the
Jun. 2016, pp. 2129–2137. Ph.D. degree in electrical and electronic engi-
[39] X. Wang, M. Yang, S. Zhu, and Y. Lin, ‘‘Regionlets for generic object neering with Yonsei University, Seoul, South
detection,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 10, Korea. He is also a Chief Research Engineer
pp. 2071–2084, Oct. 2015. with the Department of EO/IR Systems Research
[40] J. Ren, X. Chen, J. Liu, W. Sun, J. Pang, Q. Yan, Y. Tai, and L. Xu, and Development, LIG Nex1. He was officially
‘‘Accurate single stage detector using recurrent rolling convolution,’’ in
certified as a Software Architect at LG Electronics,
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
in 2020, and completed the Software Architect
pp. 5420–5428.
[41] J. Luiten, P. Voigtlaender, and B. Leibe, ‘‘PReMVOS: Proposal-generation, Course at Carnegie Mellon University. His current
refinement and merging for video object segmentation,’’ in Proc. Asian research interests include computer vision, deep
Conf. Comput. Vis. (ACCV), Dec. 2018, pp. 565–580. learning, and quantum computing.
[42] R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, ‘‘PointNet:
Deep learning on point sets for 3D classification and segmentation,’’ in
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
pp. 652–660. NAMKOO HA received the Ph.D. degree in
[43] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernandez, computer engineering from Kyungpook National
T. Vojir, G. Hager, G. Nebehay, and R. Pflugfelder, ‘‘The visual object University, South Korea, in 2008. He is currently
tracking VOT2015 challenge results,’’ in Proc. IEEE Int. Conf. Comput. a Chief Research Engineer with the Department
Vis. Workshops (ICCVW), Dec. 2015, pp. 1–23. of EO/IR Systems Research and Development,
[44] R. Jonker and A. Volgenant, ‘‘A shortest augmenting path algorithm for
LIG Nex1. His current research interest includes
dense and sparse linear assignment problems,’’ Computing, vol. 38, no. 4,
developing intelligent EO/IR systems through the
pp. 325–340, Nov. 1987.
[45] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick. (2019).
use of deep learning.
Detectron2. [Online]. Available: https://github.com/facebookresearch/
detectron2
[46] Y. Li, C. Huang, and R. Nevatia, ‘‘Learning to associate: HybridBoosted
multi-target tracker for crowded scene,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2009, pp. 2953–2960. MOONGU JEON (Senior Member, IEEE)
[47] K. Bernardin and R. Stiefelhagen, ‘‘Evaluating multiple object tracking received the B.S. degree in architectural engineer-
performance: The CLEAR MOT metrics,’’ EURASIP J. Image Video ing from Korea University, Seoul, South Korea,
Process., vol. 2008, pp. 1–10, May 2008. in 1988, and the M.S. and Ph.D. degrees in
[48] (2020). Costa St Tracker. [Online]. Available: https://motchallenge.net/
computer science and scientific computation from
method/MOTS=87&chl=17
the University of Minnesota, Minneapolis, MN,
USA, in 1999 and 2001, respectively.
YOUNG-MIN SONG (Graduate Student Member, As a Postgraduate Researcher, he worked on
IEEE) received the B.S. degree in computer optimal control problems at the University of
science and engineering from Chungnam National California at Santa Barbara, Santa Barbara, CA,
University, Daejeon, South Korea, in 2013, and the USA, from 2001 to 2003, and then moved to the National Research
M.S. degree in information and communications Council of Canada, where he worked on the sparse representation of high-
from the Gwangju Institute of Science and Tech- dimensional data and the image processing, until 2005. In 2005, he joined the
nology (GIST), Gwangju, South Korea, in 2015, Gwangju Institute of Science and Technology, Gwangju, South Korea, where
where he is currently pursuing the Ph.D. degree he is currently a Full Professor with the School of Electrical Engineering and
in electrical engineering and computer science. Computer Science. His current research interests include machine learning,
His research interests are multi-object tracking and computer vision, and artificial intelligence.
data fusion.

VOLUME 10, 2022 60657

You might also like