多模态目标跟踪综述
多模态目标跟踪综述
Experimental Comparison
Abstract
1. Introduction
2
duction of existing datasets, challenges, and corresponding evaluation metrics
described in section 4. In section 5, we report the experimental results on sev-
eral datasets and different challenges. Finally, we discuss the future direction
of multi-modal tracking in section 6. All the collected materials and analysis
will be released at https://github.com/zhang-pengyu/Multimodal_
tracking_survey.
2. Background
Visual object tracking aims to estimate the coordinates and scales of a spe-
cific target throughout the given video. In general, tracking methods can be
divided into two types according to used information: (1) single-modal track-
ing and (2) multi-modal tracking. Single-modal tracking locates the target cap-
tured by a single sensor, such as laser, visible and infrared cameras, to name
a few. In previous years, tracking with RGB image, being computationally
efficient, easily accessible and high-quality, became increasingly popular. Nu-
merous methods have been proposed to improve tracking accuracy and speed.
In RGB tracking, several frameworks including Kalman filter (KF) [12, 13],
particle filter (PF) [14, 15], sparse learning (SL) [16, 17], correlation filter (CF) [18,
19], and CNN [20, 21] have been involved to improve the tracking accuracy and
speed. In 2010, Bolme et al. [18] proposed a CF-based method called MOSSE,
which achieves high-speed tracking with reasonable performance. Thereafter,
many researchers have aimed to develop the CF framework to achieve state-
of-the-art performance. Li et al. [19] achieve scale estimation and multiple fea-
ture integration on the CF framework. Martin et al. [22] eliminate the bound-
ary effect by adding a spatial regularization to the learned filter at the cost of
speed decrease. Galoogahi et al. [23] provide another efficient solution to solve
the boundary effect, thereby maintaining a real-time speed. Another popular
framework is Siamese-based network, which is first introduced by Bertinetto
et al. [20]. Then, deeper and wider networks are utilized to improve target rep-
resentation. Zhang et al. [21] find that the padding operation in the deeper net-
3
work induces a position bias, suppressing the capability of powerful network.
They address the position bias problem, and improve the tracking performance
significantly. Some methods perform better scale estimation by predicting seg-
mentation masks rather than bounding boxes [24, 25]. Above all, many efforts
have been conducted in this field. However, target appearance, as the main
cue from visible images, is not reliable for tracking when target suffers extreme
scenarios including low illumination, out-of-view and heavy occlusion. To this
end, more complementary cues are added to handle these challenges. A visible
camera is assisted by other sensors, such as laser [26], depth [7], thermal [10],
radar [27], and audio [28], to satisfy different requirements.
Since 2005, series of methods have been proposed using various multi-
modal information. Song et al. [26] conduct multiple object tracking by us-
ing visible and laser data. Kim et al. [27] exploit the traditional Kalman filter
method for multiple object tracking with radar and visible images. Megherbi
et al. [28] propose a tracking method by combining vision and audio informa-
tion using belief theory. In particular, tracking with RGB-D and RGB-T data
has been the focus of attention using a portable and affordable binocular cam-
era. Thermal data can provide a powerful supplement to RGB images in some
challenging scenes, including night, fog, and rainy. Besides, a depth map can
provide an additional constrain to avoid tracking failure caused by heavy oc-
clusion and model drift. Lan et al. [29] apply the sparse learning method to
RGB-T tracking, thereby removing the cross-modality discrepancy. Li et al. [11]
extend an RGB tracker to the RGB-T domain, which achieves promising results.
Zhang et al. [10] jointly model motion and appearance information to achieve
accurate and robust tracking. Kart et al. [7] introduce an effective constraint
using a depth map to guide model learning. Liu et al. [30] transform the tar-
get position to 3D coordinate using RGB and depth images, and then perform
tracking using the mean shift method.
2.2. Previous Surveys and Reviews
As shown in Table 1, existing surveys are introduced, which are related
to multi-modal processing, such as, image fusion, object tracking, and multi-
4
Table 1: Summary of existing surveys in related fields.
5
Figure 1: Structure of three classification methods and algorithms in each category.
Feature Learning LF[73, 74, 75, 4, 76, 77, 8, 78, 10]
EF [42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53]
Auxiliary Modality Purpose
[54, 55, 29, 56, 57, 58, 59, 2, 60, 3, 61, 62]
[63, 64, 1, 65, 66, 11, 67, 68, 69, 70, 71, 72]
Pre-processing [38, 39, 30, 40, 41, 7]
OR[43, 82, 44, 46, 47, 48, 30, 51, 74, 79, 76, 41, 9, 80, 10]
Multimodal Tracking
DL[40, 66, 11, 67, 69, 70, 71, 72, 81, 62]
RGB-T[88, 89, 3, 1, 2]
Challenges[5, 6]
posed in 2016. Finally, compared with the literature [37] that only focuses on
RGB-T tracking, our study provides a more substantial and comprehensive
survey in a large scope, including RGB-D and RGB-T tracking.
6
F Fusion Mechanism
Decision 1
Feature Extractor Tracker 1
Decision
Modal 1 feature
Decision
Modal data 1 Tracker Modal data 1
F F
Decision 2
Feature Extractor Tracker 2
Modal 2 feature
Modal data 2 Early Fusion Modal data 2 Late Fusion
Figure 2: Workflows of early fusion (EF) and late fusion (LF). EF-based methods conduct feature
fusionAuxiliary
3.1. and modelModality
them jointly; while LF-based methods aim to model each modality individually
Purpose
and then combine their decisions.
We first discuss the auxiliary modality purpose in multi-modal tracking.
There are three main categories: (a) feature learning, where the feature repre-
sentations of auxiliary modality image are extracted to help locate the target;
(b) pre-processing, where the information from auxiliary modality is used be-
fore the target modeling; and (c) post-processing, where the information from
auxiliary modality aims to improve the model or refine the bounding box.
Early Fusion (EF). In EF-based methods, the features extracted from both modal-
ities are first aggregated as a larger feature vector and then sent to the model
to locate the target. The workflow of EF-based trackers is shown in the left
part of Figure 2. For most of the trackers, EF is the primary choice in the
7
multi-modal tracking task, while visible and auxiliary modalities are treated
alike with the same feature extraction methods. Camplani et al. [43] utilize
HOG feature for both visible and depth maps. Kart et al. [47] extract multiple
features to build a robust tracker for RGB-D tracking. Similar methods exist
in [44, 48, 49, 42, 54, 56, 58, 2, 60, 3]. However, auxiliary modality often indi-
cates different information against the visible map. For example, thermal and
depth images contain temperature and depth data, respectively. The afore-
mentioned trackers apply feature fusion, ignoring the modality discrepancy,
which decreases the tracking accuracy and causes the tracker to drift easily. To
this end, some trackers differentiate the heterogeneous modalities by applying
different feature methods. In [45], the gradient feature is extracted in a depth
map, while the average color feature is used to represent the target in the vis-
ible modality. Meshgi et al. [52] use the raw depth information and many fea-
ture methods (HOG, LBP, and LoG) for RGB images. In [29, 57, 64], the HOG
and intensity features are used for visible and thermal modalities, respectively.
Due to the increasing cost involved in feature concatenation and the misalign-
ment of multi-modal data, some methods tune the feature representation af-
ter feature extraction by the pruning [67] or re-weighting operation [50, 72],
which can compress the feature space and exploit the cross-modal correlation.
In DAFNet [67], a feature pruning module is proposed to eliminate noisy and
redundant information. Liu et al. [50] introduce a spatial weight to highlight
the foreground area. Zhu et al. [72] exploit modality importance using the pro-
posed multi-modal aggregation network.
8
fuse two single-modal trackers via an adaptive weight map. In MCBT [75],
data from multiple sources are used stepwise to locate the target. A rough
target position is first estimated by optical flow in the visible domain, and the
final result is determined by part-based matching method with RGB-D data.
3.1.2. Pre-Processing
Due to the available depth map, the second purpose of auxiliary modality
is to transform the target into 3D space before target modeling via RGB-D data.
Instead of tracking in the image plane, these types of methods model the target
in the world coordinate, and 3D trackers are designed [38, 39, 7, 30, 40, 41]. Liu
et al. [30] extend the classical mean shift tracker to 3D extension. In OTR [7],
the dynamic spatial constraint generated by the 3D target model enhances the
discrimination of DCF trackers in dealing with out-of-view rotation and heavy
occlusion. Although a significant performance is achieved, the computation
cost of 3D reconstruction cannot be neglected. Furthermore, the performance
is highly subject to the quality of depth data and the accessibility of mapping
functions between the 2D and 3D spaces.
3.1.3. Post-processing
Compared with the RGB image that brings more detailed content, the depth
image highlights the contour of objects, which can segment the target among
surroundings via depth variance. Inspired by the nature of depth map, many
RGB-D trackers utilize the depth information to determine whether the occlu-
sion occurs and estimate the target scale [43, 46, 49, 79].
9
togram is recorded to examine whether the occlusion occurs. If the occlusion
is detected, the tracker locates the occluder and searches the candidate around.
In [10], Zhang et al. propose a tracker switcher to detect occlusion based on the
template matching method and tracking reliability. The tracker can dynam-
ically select which information is used for tracking between appearance and
motion cues, thereby improving the robustness of the tracker significantly.
10
Condition
#0124
Normal
Template Model
Reasoning
Occlusion
No
...
...
Template Update
Ye
Occlusion
#0174
s
Full
Occlusion Model
Tracking Results
Occluded Particles Non-occluded Particles
Figure 3: Framework of OAPF [52]. The particle filter method is applied, with occlusion handling,
in which the occlusion model is constructed against the template model. When the target is oc-
cluded, the occlusion model is used to predict the position without the updating template model.
sample manners are exploited, e.g. sliding window [50], particle filter [38, 45],
and Gaussian sampling [11]. Furthermore, a crucial task is utilizing powerful
features to represent the target. Thanks to the emerging convolution networks,
more trackers have been built via efficient CNNs. We will introduce the vari-
ous frameworks in the following paragraphs.
Mean Shift (MS). MS-based methods maximize the similarity between the his-
tograms of candidates and the target template, and conduct fast local search
using the mean shift technique. These methods usually assume that the ob-
ject overlaps itself in consecutive frames [77]. In [39, 30], the authors extend
11
#0010
Multimodal
ROI Crop
CF Tracker
Camera Motion
Fusion
R RG B
#0010
Model
RF
CF Tracker
RT
Figure 4: Workflow of the JMMAC [10]. The CF-based tracker is used to model appearance cue,
while both camera and target motion are considered, thereby achieving substantial performance.
12
#0002 MA
ReLU
ReLU
ReLU
Conv
Conv
Conv
LRN
Block1
Block1
Pool
LRN
Block2
Block1 Block2 Block3
GA IA
+ + +
Block1
Block1
Block3
Concat
FC
FC
FC
#0002
+ + +
Block1
Block1
Block2
MA Tracking Result
Figure 5: Framework of MANet [11]. Generic adapter (GA) is used to extract common information
of RGB-T images. Modality adapter (MA) aims to exploit the different properties of heteroge-
neous modalities. Finally, instance adapter (IA) models the appearance properties and temporal
variations of a certain object.
wti ∝ p zt | xt = xit .
(2)
13
equipping discriminative features [100], to increase the tracking performance.
Due to the advantage of CF-based trackers, many researchers aim to build
multi-modal trackers with the CF framework. Zhai et al. [65] introduce low-
rank constraint to learn the filters of both modalities collaboratively, thereby ex-
ploiting the relationship between RGB and thermal data. Hannuna et al. [46] ef-
fectively handle the scale change with the guidance of the depth map. Kart et al.
propose a long-term RGB-D tracker [7], which is designed based on CSRDCF [101]
and applies online 3D target reconstruction to facilitate learning robust filters.
The spatial constraint is learned from the 3D model of the target. When the
target is occluded, view-specific DCFs are used to robustly localize the target.
Camplani et al. [43] improve the CF method in scale estimation and occlusion
handling, while maintaining a real-time speed.
14
Table 2: Summary of multi-modal tracking datasets.
Name Seq. Num. Total Frames Min. Frame Max. Frame Attr. Resolution Metrics Year
PTB [86] 100 21.5K 0.04K 0.90K 11 640 × 480 CPE, SR 2013
RGB-D
STC [4] 36 18.4K 0.13K 0.72K 10 640 × 480 SR, Acc., Fail. 2018
CTDB [87] 80 101.9K 0.4K 2.5K 13 640 × 360 F-score, Pr, Re 2019
OTCBVS [88] 6 7.2K 0.6K 2.3K – 320 × 240 – 2007
LITIV [89] 9 6.3K 0.3K 1.2K – 320 × 240 – 2012
RGB-T
Ding et al. [44] learn a Bayesian classifier and consider the candidate with max-
imal score as the target location, which can reduce the model drift. In [83], a
structured SVM [105] is learned by maximizing a classification score, which
can prevent the labeling ambiguity in the training process.
4. Datasets
15
PTB
where the IoU (·, ·) denotes the IoU between the bounding box bbi and ground
truth gti in the i-th frame. If the IoU is larger than the threshold tsr , we con-
sider the target to be successfully tracked. The final rank of the tracker is de-
termined by the Avg. Rank, which is defined as the average ranking of SR in
each attribute. The STC dataset [4] consists of 36 RGB-D sequences and covers
some extreme tracking circumstances, such as outdoor and night scenes. This
dataset is captured by still and moving ASUS Xtion RGB-D cameras to evaluate
the tracking performance under conditions of arbitrary camera motion. A total
of 10 attributes are labeled to thoroughly analyze the dataset bias. The detailed
introduction of each attributes are shown in the supplementary file.
The trackers are measured by using both SR and VOT protocols. The VOT
protocol evaluates the tracking performance in terms of two aspects: accuracy
and failure. Accuracy (Acc.) considers the IoU between the ground truth and
bounding box, and failure (Fail.) measures the times when the overlap is zero
and the tracker is set to re-initialize using the ground truth and continues to
track. CTDB [87] is the latest RGB-D tracking dataset, which contains 80 short-
term and long-term videos. The target is out-of-view and occluded frequently,
which needs the tracker to handle both tracking and re-detection cases. The
metrics are Precision (Pr.), Recall (Re.) and the overall F-score [106]. The preci-
16
sion and recall are defined as follows,
PN 1 bb exists
u i i
P r = Pi=1
N
u t = , (4)
i=1 si
0 otherwise
PN 1 gt exists
ui i
Re = Pi=1
N
gt = , (5)
g
i=1 i
0 otherwise
where ui is defined in Eq. 3. The F-score combines both precision and recall
through
2P r × Re
F − score = . (6)
P r + Re
17
GTOT
BlackCar: OCC, LSV, FM, LI BlackSwan1: LSV, TC, SO, DEF Gathering: LI, DEF GarageHover: LI, DEF
RGBT234
baginhand: NO, LI, LR, TC, CM kite4: PO, DEF, FM, CM carLight: NO, LI, DEF, SC, CM, BC car10: NO, LI, FM,SC, CM
Figure 7: Examples and corresponding attributes in GTOT and RGBT234 tracking datasets.
Since 2019, both RGB-D and RGB-T challenges have been held by VOT
Committee [6, 5]. For RGB-D challenge, trackers are evaluated on the CDTB
dataset [87] with the same evaluation metrics. All the sequences are anno-
tated on the basis of 5 attributes, namely, occlusion, dynamics change, mo-
tion change, size change, and camera motion. RGB-T challenge constructs the
dataset as a subset of RGBT234 with slight change in ground truth, which con-
sists of 60 RGB-T public videos and 60 sequestered videos. Compared with
RGBT234, VOT-RGBT utilizes different evaluation metrics, i.e., EAO, to mea-
sure trackers. In VOT2019-RGBT, trackers need to be re-initialized, when track-
ing failure is detected (the overlap between bounding box and ground truth is
zero). Besides, VOT2020-RGBT introduces a new anchor mechanism to avoid
a causal correlation between the first reset and the later ones [5] instead of the
re-initialization mechanism.
5. Experiments
18
Table 3: Experimental results on the PTB dataset. Numbers in parentheses indicate their ranks.
The top three results are in red, blue, and green fonts.
Avg. Target type Target size Movement Occlusion Motion type
Algorithm
Rank Human Animal Rigid Large Small Slow Fast Yes No Passive Active
OTR [7] 2.91 77.3(3) 68.3(6) 81.3(3) 76.5(4) 77.3(1) 81.2(2) 75.3(1) 71.3(2) 84.7(6) 85.1(1) 73.9(3)
WCO [50] 3.91 78.0(2) 67.0(7) 80.0(4) 76.0(5) 75.0(2) 78.0(7) 73.0(5) 66.0(6) 86.0(2) 85.0(2) 82.0(1)
TACF [48] 4.64 76.9(5) 64.7(9) 79.5(5) 77.2(3) 74.0(4) 78.5(5) 74.1(3) 68.3(5) 85.1(3) 83.6(4) 72.3(5)
CA3DMS [30] 6 66.3(12) 74.3(2) 82.0(1) 73.0(10) 74.2(3) 79.6(3) 71.4(8) 63.2(13) 88.1(1) 82.8(5) 70.3(8)
CSR-rgbd [9] 6.36 76.6(6) 65.2(8) 75.9(9) 75.4(6) 73.0(6) 79.6(3) 71.8(6) 70.1(3) 79.4(10) 79.1(7) 72.1(6)
3DT [38] 6.55 81.4(1) 64.2(11) 73.3(11) 79.9(2) 71.2(9) 75.1(12) 74.9(2) 72.5(1) 78.3(11) 79.0(8) 73.5(4)
DLST [42] 7 77.0(4) 69.0(5) 73.0(12) 80.0(1) 70.0(12) 73.0(13) 74.0(4) 66.0(6) 85.0(5) 72.0(13) 75.0(2)
OAPF [52] 7.27 64.2(13) 84.8(1) 77.2(6) 72.7(11) 73.4(5) 85.1(1) 68.4(12) 64.4(11) 85.1(3) 77.7(10) 71.4(7)
CCF [80] 7.55 69.7(10) 64.5(10) 81.4(2) 73.1(9) 72.9(7) 78.4(6) 70.8(9) 65.2(8) 83.7(7) 84.4(3) 68.7(12)
OTOD [40] 8.91 72.0(8) 71.0(3) 73.0(12) 74.0(7) 71.0(10) 76.0(9) 70.0(11) 65.0(9) 82.0(8) 77.0(12) 70.0(9)
DMKCF [47] 9 76.0(7) 58.0(13) 76.7(7) 72.4(12) 72.8(8) 75.2(11) 71.6(7) 69.1(4) 77.5(13) 82.5(6) 68.9(11)
DSKCF [46] 9.09 70.9(9) 70.8(4) 73.6(10) 73.9(8) 70.3(11) 76.2(8) 70.1(10) 64.9(10) 81.4(9) 77.4(11) 69.8(10)
DSOH [43] 11.45 67.0(11) 61.2(12) 76.4(8) 68.8(13) 69.7(13) 75.4(10) 66.9(13) 63.3(12) 77.6(12) 78.8(9) 65.7(13)
DOHR [44] 14 45.0(14) 49.0(14) 42.0(14) 48.0(14) 42.0(14) 50.0(14) 43.0(14) 38.0(14) 54.0(14) 54.0(14) 41.0(14)
1 http://tracking.cs.princeton.edu/
19
Table 4: The speed analysis of RGB-D trackers.
effectiveness of CNN and obtains the 10th rank on the PTB dataset.
Speed Analysis. The speed report of RGB-D trackers are listed in Table 4.
Most of the trackers cannot meet the requirements of real-time tracking. Track-
ers based on the improved CF framework (OTR [7], DMKCF [47], CCF [80],
WCO [50], and TACF [48]), are subject to their speed. Only two real-time track-
ers (DSKCF [46] and DSOH [43]) are based on the original CF architecture.
20
Table 5: Experimental results on the GTOT and RGBT234 datasets.
GTOT RGBT234
Tracker Speed Device Platform Setting
SR PR SR PR
CMPP [81] 73.8 92.6 57.5 82.3 1.3 GPU PyTorch RTX 2080Ti
JMMAC [10] 73.2 90.1 57.3 79.0 4.0 GPU MatConvNet RTX 2080Ti
CAT [121] 71.7 88.9 56.1 80.4 20.0 GPU PyTorch GTX 2080Ti
MaCNet [62] 71.4 88.0 55.4 79.0 0.8 GPU PyTorch GTX 1080Ti
TODA [69] 67.7 84.3 54.5 78.7 0.3 GPU PyTorch GTX1080Ti
DAFNet [66] 71.2 89.1 54.4 79.6 23.0 GPU PyTorch RTX 2080Ti
MANet [11] 72.4 89.4 53.9 77.7 1.1 GPU PyTorch GTX1080Ti
DAPNet [67] 70.7 88.2 53.7 76.6 – GPU PyTorch GTX1080Ti
FANet [72] 69.8 88.5 53.2 76.4 1.3 GPU PyTorch GTX1080Ti
CMR [59] 64.3 82.7 48.6 71.1 8.0 CPU C++ –
SGT [2] 62.8 85.1 47.2 72.0 5.0 CPU C++ –
mfDiMP [71] 49.0 59.4 42.8 64.6 18.6 GPU PyTorch RTX 2080Ti
CSR [1] – – 32.8 46.3 1.6 CPU Matlab & C++ –
L1-PF [54] 42.7 55.1 28.7 43.1 – – – –
JSR [64] 43.0 55.3 23.4 34.3 0.8 CPU Matlab –
Overall Comparison. All the high-performance trackers are equipped with the
learned deep features and most of them are based on MDNet variants (CMPP,
MaCNet, TODA, DAFNet, MANet, DAPNet, and FANet), which achieve sat-
isfactory results. The CF-based tracker (JMMAC) obtains the second rank on
GTOT and RGBT234 datasets, which combines appearance and motion cues.
Compared with CF trackers, MDNet-based trackers can provide precise target
position with higher PR, but are inferior to CF framework in scale estimation,
reflecting on SR. Trackers with sparse learning technique (CSR, SGT) are better
than L1-PF based on the PF method. Although mfDiMP utilizes the state-of-
the-art backbone, the performance is not positive. The main reason may be that
mfDiMP utilizes different training data, which are generated by image transla-
tion methods [120] and brings a gap between existing real RGB-T data.
21
Attribute based Precision Plot Attribute based Success Plot
Deformation Deformation
Background Clutter 1 Background Clutter
Low Resolution Low Resolution
0.6
0.8
0.6 0.4
Camera Moving Motion Blur Camera Moving Motion Blur
0.4
0.2
0.2
CMPP TODA MaCNet JMMAC MANet SGT DAPNet CMR mfDiMP L1+PF CSR JSR
ily in fast-moving targets. This condition may result from CF-based trackers
having a fixed search region. When the target moves outside the region, the
target cannot be detected, thereby causing tracking failure. CMPP, which ex-
ploits the inter-modal and cross-modal correlation, have great promotion on
low illumination, low resolution, and thermal crossover. Targets in these at-
tributes have unavailable modality and CMPP can eliminate the gap between
heterogeneous modalities. The detailed figure on attribute-based comparison can be
found in supplementary file.
Speed Analysis. For tracking speed, we list the platform and setting for fair
comparison in Table 5. DAFNet based on a real-time MDNet variant achieves
fast tracking with 23.0 fps. Although mfDiMP is equipped with ResNet-101,
it achieves the second fastest tracker because most parts of the network are
trained offline without tuning online. Other trackers are constrained by their
low speed, which cannot be utilized in some real-time applications.
We list the challenge results in Table 6. Both original RGB trackers without
utilizing depth information and RGB-D trackers are merged for evaluation.
Trackers who obtain top three ranks on F-score, precision, and Recall, are de-
signed with the same component and framework. Unlike the PTB dataset, DL-
based methods have potential performance on VOT-RGBD19, which results
from these trackers utilizing large-scale visual datasets for offline training and
22
Table 6: Challenge results on VOT2019-RGBD dataset.
equipping deeper networks. For instance, the original RGB tracker with DL
framework also achieves excellent performance. Furthermore, occlusion han-
dling is another necessary component of the high-performance tracker because
VOT2019-RGBD focuses on long-term tracking with target frequent reappear-
ance and out-of-view and most of the trackers are equipped with a redetection
mechanism. The CF framework (FuCoLoT, OTR, CSR-RGBD, and ECO) does
not perform well, which may stem from the online updating using occlusion
patches that degrade the discriminability of the model.
23
Table 7: Challenge results on the VOT2019-RGBT dataset.
6. Further Prospects
24
network architecture, e.g., VGGNet and ResNet, and extract the feature in
the same level (layer). A crucial task is to design a network for processing
multi-modal data. Since 2017, AutoML method, especially neural architecture
search (NAS), has been popular which design the architecture automatically
and obtain highly competitive results in many areas, such as image classifica-
tion [125], and recognition [126]. However, researchers pay less attention to
NAS method for multi-modal tracking, which is a good direction to explore.
Large-scale Dataset for Training. With the emergence of deep neural net-
works, more powerful methods are equipped with CNN to achieve accurate
and robust performance. However, the existing datasets focus on testing with
no training subset. For instance, most DL-based trackers use the GTOT dataset
as the training set when testing RGBT234, which has a small amount of data
with limited scenes. The effectiveness of DL-based methods has not been fully
exploited. Zhang et al. [71] generate synthetic thermal data from the numerous
existing visible datasets by using the image translation method [120]. How-
ever, this data augmentation does not bring significant performance improve-
ment. Above all, constructing a large-scale dataset for training is the main
direction for multi-modal tracking.
25
#0040 #0009 #0327
Figure 9: Unregistration examples in RGBT234 dataset. We show the ground truth of visible
modality in both images. The coarse bounding box degrades the discriminability of the model.
Metrics for Robustness Evaluation. In some extreme scenes and weather con-
ditions, such as rainy, low illumination and hot sunny days, visible or thermal
sensors cannot provide meaningful data. The depth camera cannot obtain pre-
cise distance estimation when the object is far from the sensor. Therefore, a
robust tracker needs to avoid tracking failure when any of the modality data is
unavailable during a certain period. To handle this case, both complementary
and discriminative features have to be applied in localization. However, none
of the datasets measures the tracking robustness with missing data. Thus, a
new evaluation metric for tracking robustness needs to be considered.
7. Conclusion
26
field, several possible directions are identified to facilitate the improvement of
multi-modal tracking. The comparison results and analysis will be available at
https://github.com/zhang-pengyu/Multimodal tracking survey.
References
[3] C. Li, X. Liang, Y. Lu, N. Zhao, J. Tang, RGB-T object tracking: Benchmark
and baseline, Pattern Recognition 96 (12) (2019) 106977.
[4] J. Xiao, R. Stolkin, Y. Gao, A. Leonardis, Robust fusion of color and depth
data for RGB-D target tracking using adaptive range-invariant depth
models and spatio-temporal consistency constraints, IEEE Transactions
on Cybernetics 48 (8) (2018) 2485–2499.
27
[8] N. Cvejic, S. G. Nikolov, H. D. Knowles, A. Loza, A. Achim, A. Achim,
C. N. Canagarajah, The effect of pixel-level fusion on object tracking in
multi-sensor surveillance video, in: IEEE Conference on Computer Vi-
sion and Pattern Recognition, 2007.
[9] U. Kart, J.-K. Kamarainen, J. Matas, How to make an RGBD tracker?, in:
European Conference on Computer Vision Workshop, 2018.
[10] P. Zhang, J. Zhao, D. Wang, H. Lu, X. Yang, Jointly modeling motion and
appearance cues for robust rgb-t tracking, CoRR abs/2007.02041.
[11] C. Li, A. Lu, A. Zheng, Z. Tu, J. Tang, Multi-adapter RGBT tracking, in:
IEEE International Conference on Computer Vision Workshop, 2019.
[12] S.-K. Weng, C.-M. Kuo, S.-K. Tu, Video object tracking using adaptive
kalman filter, Journal of Visual Communication and Image Representa-
tion 17 (6) (2006) 1190–1208.
[14] C. Yang, R. Duraiswami, L. Davis, Fast multiple object tracking via a hi-
erarchical particle filter, in: IEEE International Conference on Computer
Vision, 2005.
[17] T. Zhang, B. Ghanem, S. Liu, N. Ahuja, Robust visual tracking via multi-
task sparse learning, in: IEEE Conference on Computer Vision and Pat-
tern Recognition, 2012.
28
[18] D. S. Bolme, J. R. Beveridge, B. A. Draper, Y. M. Lui, Visual object track-
ing using adaptive correlation filters, in: IEEE Conference on Computer
Vision and Pattern Recognition, 2010.
[19] Y. Li, J. Zhu, A scale adaptive kernel correlation filter tracker with feature
integration, in: European Conference on Computer Vision Workshop,
2014.
[21] Z. Zhang, H. Peng, Deeper and wider siamese networks for real-time
visual tracking, in: IEEE Conference on Computer Vision and Pattern
Recognition, 2019.
29
[27] D. Y. Kim, M. Jeon, Data fusion of radar and image measurements for
multi-object tracking via kalman filtering, Information Fusion 278 (10)
(2014) 641–652.
[33] Z. Cai, J. Han, L. Liu, L. Shao, Rgb-d datasets using microsoft kinect or
similar sensors: A survey, Multimedia Tools and Applications 76 (2016)
4313–4355.
30
[36] J. Ma, Y. Ma, C. Li, Infrared and visible image fusion methods and appli-
cations: A survey, Information Fusion 45 (2019) 153–178.
[40] Y. Xie, Y. Lu, S. Gu, RGB-D object tracking with occlusion detection,
in: International Conference on Computational Intelligence and Security,
2019.
[42] N. An, X.-G. Zhao, Z.-G. Hou, Online RGB-D tracking via detection-
learning-segmentation, in: International Conference on Pattern Recog-
nition, 2016.
[44] P. Ding, Y. Song, Robust object tracking using color and depth images
with a depth based occlusion handling and recovery, in: International
Conference on Fuzzy Systems and Knowledge Discovery, 2015.
31
[45] G. M. Garcia, D. A. Klein, J. Stuckler, Adaptive multi-cue 3D tracking of
arbitrary objects, in: Joint DAGM and OAGM Symposium, 2012.
[49] J. Leng, Y. Liu, Real-time RGB-D visual tracking with scale estimation
and occlusion handling, Access 6 (2018) 24256–24263.
[50] W. Liu, X. Tang, C. Zhao, Robust RGBD tracking via weighted convolu-
tion operators, Sensors 20 (8) (2020) 4496–4503.
[51] Z. Ma, Z. Xiang, Robust object tracking with RGBD-based sparse learn-
ing, Frontiers of Information Technology and Electronic Engineering
18 (7) (2017) 989–1001.
[54] Y. Wu, E. Blasch, G. Chen, L. Bai, H. Ling, Multiple source data fusion via
sparse representation for robust visual tracking, in: International Confer-
ence on Information Fusion, 2011.
32
[55] Y. Wang, C. Li, J. Tang, D. Sun, Learning collaborative sparse correlation
filter for real-time multispectral object tracking, in: International Confer-
ence on Brain Inspired Cognitive Systems, 2018.
[59] C. Li, C. Zhu, Y. Huang, J. Tang, L. Wang, Cross-modal ranking with soft
consistency and noisy labels for robust RGB-T tracking, in: European
Conference on Computer Vision, 2018.
[64] H. Liu, F. Sun, Fusion tracking in color and infrared images using joint
sparse representation, Information Sciences 55 (3) (2012) 590–599.
33
[65] S. Zhai, P. Shao, X. Liang, X. Wang, Fast RGB-T tracking via cross-modal
correlation filters, Neurocomputing 334 172–181.
[66] Y. Gao, C. Li, Y. Zhu, J. Tang, T. He, F. Wang, Deep adaptive fusion net-
work for high performance RGBT tracking, in: IEEE International Con-
ference on Computer Vision Workshop, 2019.
[67] Y. Zhu, C. Li, B. Luo, J. Tang, X. Wang, Dense feature aggregation and
pruning for RGBT tracking, in: ACM International Conference on Multi-
media, 2019.
[72] Y. Zhu, C. Li, Y. Lu, L. Lin, B. Luo, J. Tang, FANet: Quality-aware feature
aggregation network for RGB-T tracking, CoRR abs/1811.09855.
34
[75] Q. Wang, J. Fang, Y. Yuan, Multi-cue based tracking, Neurocomputing
131 (2014) 227–236.
[76] H. Zhang, M. Cai, J. Li, A real-time RGB-D tracker based on KCF, in:
Chinese Control And Decision Conference, 2018.
[78] C. Luo, B. Sun, K. Yang, T. Lu, W.-C. Yeh, Thermal infrared and visible
sequences fusion tracking based on a hybrid tracking framework with
adaptive weighting scheme, Infrared Physics and Technology 99 265–
276.
[80] G. Li, L. Huang, P. Zhang, Q. Li, Y. Huo, Depth information aided con-
strained correlation filter for visual tracking, in: International Conference
on Geo-Spatial Knowledge and Intelligence, 2019.
[82] Y. Chen, Y. Shen, X. Liu, B. Zhong, 3d object tracking via image sets and
depth-based occlusion detection, Signal Processing 112 (2015) 146–153.
[84] A. Geiger, P. Lenz, C. Stiller, R. Urtasun, Vision meets robotics: The KITTI
dataset, International Journal of Robotics Research 32 (11) (2013) 1231–
1237.
35
[85] J. Liu, Y. Liu, Y. Cui, Y. Chen, Real-time human detection and tracking
in complex environments using single RGBD camera, in: IEEE Interna-
tional Conference on Image Processing, 2013.
[86] S. Song, J. Xiao, Tracking revisited using RGBD camera: Unified bench-
mark and baselines, in: IEEE International Conference on Computer Vi-
sion, 2013.
[91] Y. Li, J. Zhu, A scale adaptive kernel correlation filter tracker with feature
integration, in: European Conference on Computer Vision, 2014.
36
[94] S. Shojaeilangari, W.-Y. Yau, K. Nandakumar, J. Li, E. K. Teoh, Robust
representation and recognition of facial emotions using extreme sparse
learning, IEEE Transactions on Image Processing 24 (7).
37
[104] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recogni-
tion, in: IEEE Conference on Computer Vision and Pattern Recognition,
2016.
[107] Y. Wu, J. Lim, M.-H. Yang, Online object tracking: A benchmark, in: IEEE
Conference on Computer Vision and Pattern Recognition, 2013.
[108] Y. Wu, J. Lim, M.-H. Yang, Object tracking benchmark, IEEE Transactions
on Pattern Analysis and Machine Intelligence 37 (9).
38
[114] B. K.P.Horn, B. G.Schunck, Determining optical flow, in: Artificial Intel-
ligence, 1981.
[121] C. Li, L. Liu, A. Lu, Q. Ji, J. Tang, Challenge-aware rgbt tracking, in:
European Conference on Computer Vision, 2020.
39
[124] H. Xu, T.-S. Chua, Fusion of av features and external information sources
for event detection in team sports video, ACM Transactions on Multime-
dia Computing Communications and Applications 2 (1).
40