0% found this document useful (0 votes)
14 views

多模态目标跟踪综述

Uploaded by

a3218928075
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

多模态目标跟踪综述

Uploaded by

a3218928075
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Multi-modal Visual Tracking: Review and

Experimental Comparison

Pengyu Zhang, Dong Wang, Huchuan Lu


arXiv:2012.04176v1 [cs.CV] 8 Dec 2020

Abstract

Visual object tracking, as a fundamental task in computer vision, has drawn


much attention in recent years. To extend trackers to a wider range of appli-
cations, researchers have introduced information from multiple modalities to
handle specific scenes, which is a promising research prospect with emerging
methods and benchmarks. To provide a thorough review of multi-modal track-
ing, we summarize the multi-modal tracking algorithms, especially visible-
depth (RGB-D) tracking and visible-thermal (RGB-T) tracking in a unified tax-
onomy from different aspects. Second, we provide a detailed description of the
related benchmarks and challenges. Furthermore, we conduct extensive exper-
iments to analyze the effectiveness of trackers on five datasets: PTB, VOT19-
RGBD, GTOT, RGBT234, and VOT19-RGBT. Finally, we discuss various fu-
ture directions from different perspectives, including model design and dataset
construction for further research.
Keywords: Visual tracking, Object tracking, Multi-modal fusion, RGB-T
tracking, RGB-D tracking

1. Introduction

Visual object tracking is a fundamental task in computer vision, which is


widely applied in many areas, such as smart surveillance, autonomous driving,
and human-computer interaction. Traditional tracking methods are mainly
based on visible (RGB) images captured by a monocular camera. When the
target suffers long-term occlusion or is in low-illumination scenes, the RGB

Preprint submitted to Pattern Recognition December 9, 2020


tracker can hardly work well and may cause tracking failure. With the easy-
access binocular camera, tracking with multi-modal information (e.g. visible-
depth, visible-thermal, visible-radar and visible-laser) is a prospective research
direction that has become popular in recent years. Many datasets and chal-
lenges have been presented [1, 2, 3, 4, 5, 6]. Motivated by these developments,
trackers with multi-modal cues have been proposed with the potential accu-
racy and robustness against extreme tracking scenarios [7, 8, 9, 10, 11].
With the emergence of multi-modal trackers, a comprehensive and in-depth
survey has not been conducted. To this end, we revisit existing methods from a
unified view and evaluate them on popular datasets. The contributions of this
work can be summarized as follows.

• Substantial review of multi-modal tracking methods from various as-


pects in a unified view. We exploit the similarity of RGB-D and RGB-T
tracking and classify them in a unified framework. We category existing
56 multi-modal tracking methods based on auxiliary modality, tracking
framework, and related datasets with corresponding metrics. Taxonomy
with detailed analysis can cover the main knowledge in this field, and
provide an in-depth introduction to multi-modal tracking models.

• A comprehensive and fair evaluation of popular trackers on several


datasets. We collect 29 methods consisting of 14 RGB-D and 15 RGB-
T trackers, and evaluate them on 5 datasets in accuracy and speed for
various applications. We further analyze the advantages and drawbacks
of different frameworks in qualitative and quantitative experiments.

• A prospective discussion for multi-modal tracking. We present the po-


tential direction of multi-modal tracking in model design and dataset
construction, which can provide prospective guidance to researchers.

The rest of the paper is organized as follows. In section 2, we introduce


existing related basic concepts and previous related surveys. Section 3 pro-
vides a taxonomical review of multi-modal tracking. We represent the intro-

2
duction of existing datasets, challenges, and corresponding evaluation metrics
described in section 4. In section 5, we report the experimental results on sev-
eral datasets and different challenges. Finally, we discuss the future direction
of multi-modal tracking in section 6. All the collected materials and analysis
will be released at https://github.com/zhang-pengyu/Multimodal_
tracking_survey.

2. Background

2.1. Visual Object Tracking

Visual object tracking aims to estimate the coordinates and scales of a spe-
cific target throughout the given video. In general, tracking methods can be
divided into two types according to used information: (1) single-modal track-
ing and (2) multi-modal tracking. Single-modal tracking locates the target cap-
tured by a single sensor, such as laser, visible and infrared cameras, to name
a few. In previous years, tracking with RGB image, being computationally
efficient, easily accessible and high-quality, became increasingly popular. Nu-
merous methods have been proposed to improve tracking accuracy and speed.
In RGB tracking, several frameworks including Kalman filter (KF) [12, 13],
particle filter (PF) [14, 15], sparse learning (SL) [16, 17], correlation filter (CF) [18,
19], and CNN [20, 21] have been involved to improve the tracking accuracy and
speed. In 2010, Bolme et al. [18] proposed a CF-based method called MOSSE,
which achieves high-speed tracking with reasonable performance. Thereafter,
many researchers have aimed to develop the CF framework to achieve state-
of-the-art performance. Li et al. [19] achieve scale estimation and multiple fea-
ture integration on the CF framework. Martin et al. [22] eliminate the bound-
ary effect by adding a spatial regularization to the learned filter at the cost of
speed decrease. Galoogahi et al. [23] provide another efficient solution to solve
the boundary effect, thereby maintaining a real-time speed. Another popular
framework is Siamese-based network, which is first introduced by Bertinetto
et al. [20]. Then, deeper and wider networks are utilized to improve target rep-
resentation. Zhang et al. [21] find that the padding operation in the deeper net-

3
work induces a position bias, suppressing the capability of powerful network.
They address the position bias problem, and improve the tracking performance
significantly. Some methods perform better scale estimation by predicting seg-
mentation masks rather than bounding boxes [24, 25]. Above all, many efforts
have been conducted in this field. However, target appearance, as the main
cue from visible images, is not reliable for tracking when target suffers extreme
scenarios including low illumination, out-of-view and heavy occlusion. To this
end, more complementary cues are added to handle these challenges. A visible
camera is assisted by other sensors, such as laser [26], depth [7], thermal [10],
radar [27], and audio [28], to satisfy different requirements.
Since 2005, series of methods have been proposed using various multi-
modal information. Song et al. [26] conduct multiple object tracking by us-
ing visible and laser data. Kim et al. [27] exploit the traditional Kalman filter
method for multiple object tracking with radar and visible images. Megherbi
et al. [28] propose a tracking method by combining vision and audio informa-
tion using belief theory. In particular, tracking with RGB-D and RGB-T data
has been the focus of attention using a portable and affordable binocular cam-
era. Thermal data can provide a powerful supplement to RGB images in some
challenging scenes, including night, fog, and rainy. Besides, a depth map can
provide an additional constrain to avoid tracking failure caused by heavy oc-
clusion and model drift. Lan et al. [29] apply the sparse learning method to
RGB-T tracking, thereby removing the cross-modality discrepancy. Li et al. [11]
extend an RGB tracker to the RGB-T domain, which achieves promising results.
Zhang et al. [10] jointly model motion and appearance information to achieve
accurate and robust tracking. Kart et al. [7] introduce an effective constraint
using a depth map to guide model learning. Liu et al. [30] transform the tar-
get position to 3D coordinate using RGB and depth images, and then perform
tracking using the mean shift method.
2.2. Previous Surveys and Reviews
As shown in Table 1, existing surveys are introduced, which are related
to multi-modal processing, such as, image fusion, object tracking, and multi-

4
Table 1: Summary of existing surveys in related fields.

Index Year Reference Area Description Publication

This paper provides a overview on how to fuse


1 2010 [31] Multi-modal fusion MS
multimodal data.
This paper provides a general review of both
2 2016 [32] multimodal object tracking AIR
single-modal and multi-modal tracking methods.
This paper collects popular RGB-D datasets for
3 2016 [33] RGB-D dataset different applications and provides an anaysis of the MTA
popularity and difficulty.
This paper surveys the existing multiple human
4 2017 [34] RGB-D multiple human tracking IET CV
tracking methods with RGB-D data into two aspects.
A general survey covers how to represent, translate
5 2019 [35] Multimodal machine learning TPAMI
and fuse multimodal data according to various tasks.
This paper gives a detailed survey on existing
6 2019 [36] RGB-T Image Fusion IF
methods and applications for RGB-T image fusion.
7 2020 [37] RGB-T object tracking A survey of the existing RGB-T tracking methods. IF

modal machine learning. Some of them focus on specific multi-modal infor-


mation or single tasks. Cai et al. [33] collect the datasets captured by RGB-D
sensors, which are used in many different applications, such as object recogni-
tion, scene classification, hand gesture recognition, 3D-simultaneous localiza-
tion and mapping, and pose estimation. Camplani et al.[34] focus on multiple
human tracking with RGB-D data and conduct an in-depth review of different
aspects. A comprehensive and detailed survey by Ma et al. [36] is presented to
summarize the methods regarding RGB-T image fusion. Recently, a survey on
RGB-T object tracking [37] is presented, which analyzes various RGB-T track-
ers and conducts quantitative analysis on several datasets.
Other surveys aim to give a general introduction on how to utilize and rep-
resent multi-modal information among a series of tasks. Atrey et al. [31] present
a brief introduction on multi-modal fusion methods and analyze different fu-
sion types in 2010. Walia et al. [32] introduce a general survey on tracking with
multiple modality data in 2016. Baltrusaitis et al. [35] provide a detailed review
of the machine learning method using multi-modal information.
Various differences and developments are observed among the most related
works [32, 37]. First, we aim to conduct a general survey on how to utilize
multi-modal information, especially RGB-D and RGB-T tracking, on visual ob-
ject tracking in a unified view. Furthermore, different from [32], we pay much
attention to the recent deep-learning-based methods, which have not been pro-

5
Figure 1: Structure of three classification methods and algorithms in each category.
Feature Learning LF[73, 74, 75, 4, 76, 77, 8, 78, 10]

EF [42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53]
Auxiliary Modality Purpose
[54, 55, 29, 56, 57, 58, 59, 2, 60, 3, 61, 62]
[63, 64, 1, 65, 66, 11, 67, 68, 69, 70, 71, 72]
Pre-processing [38, 39, 30, 40, 41, 7]

Post-processing SE [43, 46, 49, 79, 80]

OR[43, 82, 44, 46, 47, 48, 30, 51, 74, 79, 76, 41, 9, 80, 10]
Multimodal Tracking

SL [51, 57, 29, 56, 58, 60, 61, 63, 64, 1, 2]

Generative MS[39, 30, 77]

Tracking Framework Others[53, 73]

Discriminative PF[38, 45, 51, 52, 54, 8, 64]

CF [42, 43, 46, 7, 47, 48, 49, 50]


[4, 76, 78, 79, 55, 68, 65, 9, 80, 10]

DL[40, 66, 11, 67, 69, 70, 71, 72, 81, 62]

Others[82, 74, 75, 41, 59, 44, 83]

Dataset Public Datasets RGB-D[45, 84, 85, 86, 87, 4]

RGB-T[88, 89, 3, 1, 2]
Challenges[5, 6]

posed in 2016. Finally, compared with the literature [37] that only focuses on
RGB-T tracking, our study provides a more substantial and comprehensive
survey in a large scope, including RGB-D and RGB-T tracking.

3. Multi-modal Visual Tracking

This section provides an overview of multi-modal tracking from three as-


pects: (1) auxiliary modality purpose: how to utilize the information of aux-
iliary modality to improve tracking performance; (2) tracking framework: the
types of framework that trackers belong to. Note that, in this study, we mainly
focus on visible-thermal (RGB-T), visible-depth (RGB-D) tracking, and we con-
sider visible modality as the main modality and other sources (i.e. thermal and
depth) as auxiliary modalities. The taxonomic structure is shown in Figure 1.

6
F Fusion Mechanism

Decision 1
Feature Extractor Tracker 1

Decision
Modal 1 feature

Decision
Modal data 1 Tracker Modal data 1
F F

Decision 2
Feature Extractor Tracker 2

Modal 2 feature
Modal data 2 Early Fusion Modal data 2 Late Fusion

Figure 2: Workflows of early fusion (EF) and late fusion (LF). EF-based methods conduct feature
fusionAuxiliary
3.1. and modelModality
them jointly; while LF-based methods aim to model each modality individually
Purpose
and then combine their decisions.
We first discuss the auxiliary modality purpose in multi-modal tracking.
There are three main categories: (a) feature learning, where the feature repre-
sentations of auxiliary modality image are extracted to help locate the target;
(b) pre-processing, where the information from auxiliary modality is used be-
fore the target modeling; and (c) post-processing, where the information from
auxiliary modality aims to improve the model or refine the bounding box.

3.1.1. Feature Learning


Methods based on feature learning extract information from auxiliary modal-
ity through various feature methods, and then adopt modality fusion to com-
bine the data from different sources. Feature learning is an explicit way to
utilize multi-modal information, and most of corresponding methods consider
the image of auxiliary modality as an extra channel of the model. Accord-
ing to different fusion methods, as shown in Figure 2, it can be further cate-
gorized as methods based on early fusion (EF) and late fusion (LF) [31, 90].
EF-based methods combine multi-modal information in the feature level us-
ing concatenation and summation approaches; while LF-based methods model
each modality individually and obtain the final result by considering both de-
cisions of modalities.

Early Fusion (EF). In EF-based methods, the features extracted from both modal-
ities are first aggregated as a larger feature vector and then sent to the model
to locate the target. The workflow of EF-based trackers is shown in the left
part of Figure 2. For most of the trackers, EF is the primary choice in the

7
multi-modal tracking task, while visible and auxiliary modalities are treated
alike with the same feature extraction methods. Camplani et al. [43] utilize
HOG feature for both visible and depth maps. Kart et al. [47] extract multiple
features to build a robust tracker for RGB-D tracking. Similar methods exist
in [44, 48, 49, 42, 54, 56, 58, 2, 60, 3]. However, auxiliary modality often indi-
cates different information against the visible map. For example, thermal and
depth images contain temperature and depth data, respectively. The afore-
mentioned trackers apply feature fusion, ignoring the modality discrepancy,
which decreases the tracking accuracy and causes the tracker to drift easily. To
this end, some trackers differentiate the heterogeneous modalities by applying
different feature methods. In [45], the gradient feature is extracted in a depth
map, while the average color feature is used to represent the target in the vis-
ible modality. Meshgi et al. [52] use the raw depth information and many fea-
ture methods (HOG, LBP, and LoG) for RGB images. In [29, 57, 64], the HOG
and intensity features are used for visible and thermal modalities, respectively.
Due to the increasing cost involved in feature concatenation and the misalign-
ment of multi-modal data, some methods tune the feature representation af-
ter feature extraction by the pruning [67] or re-weighting operation [50, 72],
which can compress the feature space and exploit the cross-modal correlation.
In DAFNet [67], a feature pruning module is proposed to eliminate noisy and
redundant information. Liu et al. [50] introduce a spatial weight to highlight
the foreground area. Zhu et al. [72] exploit modality importance using the pro-
posed multi-modal aggregation network.

Late fusion (LF). LF-based methods process both modalities simultaneously


and the independent models for each modality are built to make decisions.
Then, the decisions are combined by using weighted summation [78, 74, 4, 76],
calculating joint distribution function [73, 8, 77], and conducting multi-step
localization [75]. Conaire et al. [73] assume the independence between multi-
modal data, and then obtain the result by multiplying the target’s likelihoods
in both modalities. A similar method is adopted in literature [77]. Xiao et al. [4]

8
fuse two single-modal trackers via an adaptive weight map. In MCBT [75],
data from multiple sources are used stepwise to locate the target. A rough
target position is first estimated by optical flow in the visible domain, and the
final result is determined by part-based matching method with RGB-D data.

3.1.2. Pre-Processing
Due to the available depth map, the second purpose of auxiliary modality
is to transform the target into 3D space before target modeling via RGB-D data.
Instead of tracking in the image plane, these types of methods model the target
in the world coordinate, and 3D trackers are designed [38, 39, 7, 30, 40, 41]. Liu
et al. [30] extend the classical mean shift tracker to 3D extension. In OTR [7],
the dynamic spatial constraint generated by the 3D target model enhances the
discrimination of DCF trackers in dealing with out-of-view rotation and heavy
occlusion. Although a significant performance is achieved, the computation
cost of 3D reconstruction cannot be neglected. Furthermore, the performance
is highly subject to the quality of depth data and the accessibility of mapping
functions between the 2D and 3D spaces.

3.1.3. Post-processing
Compared with the RGB image that brings more detailed content, the depth
image highlights the contour of objects, which can segment the target among
surroundings via depth variance. Inspired by the nature of depth map, many
RGB-D trackers utilize the depth information to determine whether the occlu-
sion occurs and estimate the target scale [43, 46, 49, 79].

Occlusion Reasoning (OR). Occlusion is a traditional challenge in the track-


ing task because the dramatic appearance variation leads the model drifting.
Depth cue is a powerful feature to detect target occlusion; thus, the tracker can
apply a global search strategy or model updating mechanism to avoid learning
from the occluded target. In [43], occlusion is detected when the depth variance
is large. Then, tracker enlarges the search region to detect the re-appeared tar-
get. Ding et al. [44] propose an occlusion recovery method, where a depth his-

9
togram is recorded to examine whether the occlusion occurs. If the occlusion
is detected, the tracker locates the occluder and searches the candidate around.
In [10], Zhang et al. propose a tracker switcher to detect occlusion based on the
template matching method and tracking reliability. The tracker can dynam-
ically select which information is used for tracking between appearance and
motion cues, thereby improving the robustness of the tracker significantly.

Scale Estimation (SE). SE is an important module in tracking task, which can


obtain a tight bounding box and avoid drift. CF-based trackers estimate the
target scale by sampling the search region in multiple resolutions [91], learn-
ing a filter for scale estimation [92], which cannot effectively adapt to the tar-
get’s scale change [49]. Both thermal and depth maps provide clear contour
information and a coarse pixel-wise target segmentation map. With such in-
formation, the target shape can be effectively estimated. In [46], the number
of scales is adaptively changed to fit the scale variation. SEOH [49] uses space
continuity-of-depth information to achieve accurate scale estimation with mi-
nor time cost. The pixels belonging to the target are clustered by the K-means
method in the depth map, and the sizes of the target and search regions are
determined by clustering result.

3.2. Tracking Framework

In this section, multi-modal trackers are categorized based on the methods


used in target modeling, including generative and discriminative. The gener-
ative framework focuses on directly modeling the representation of the target.
During tracking, the target is captured by matching the data distribution in the
incoming frame. However, generative methods only learn the representations
for the foreground information while ignoring the influence of surroundings,
suffering from background cluttering or distractions [93]. In comparison, the
discriminative models construct an effective classifier to distinguish the object
against the surroundings. The tracker outputs the confidence score of sampled
candidates and chooses the best matching patch as the target. Various patch

10
Condition
#0124

Normal
Template Model

Reasoning
Occlusion

No
...

...
Template Update

Ye
Occlusion
#0174

s
Full

Occlusion Model

Tracking Results
Occluded Particles Non-occluded Particles
Figure 3: Framework of OAPF [52]. The particle filter method is applied, with occlusion handling,
in which the occlusion model is constructed against the template model. When the target is oc-
cluded, the occlusion model is used to predict the position without the updating template model.

sample manners are exploited, e.g. sliding window [50], particle filter [38, 45],
and Gaussian sampling [11]. Furthermore, a crucial task is utilizing powerful
features to represent the target. Thanks to the emerging convolution networks,
more trackers have been built via efficient CNNs. We will introduce the vari-
ous frameworks in the following paragraphs.

3.2.1. Generative Methods


Sparse Learning (SL). SL has been popular in many tasks including image
recognition [94] and classification [95], object tracking [96], and others. In SL-
based RGB-T trackers, the tracking task can be formulated as a minimization
problem for the reconstruction error with the learned sparse dictionary [57, 29,
56, 58, 60, 63, 64, 1]. Lan et al. [29] propose a unified learning paradigm to
learn the target representation, modality-wise reliability and classifier, collab-
oratively. Similar methods are also applied in the RGB-D tracking task. Ma
et al. [51] construct an augmented dictionary consisting of target and occlusion
templates, which achieves accurate tracking even in heavy occlusion. SL-based
trackers achieve promising results at the expense of computation cost. These
trackers cannot meet the requirements of real-time tracking.

Mean Shift (MS). MS-based methods maximize the similarity between the his-
tograms of candidates and the target template, and conduct fast local search
using the mean shift technique. These methods usually assume that the ob-
ject overlaps itself in consecutive frames [77]. In [39, 30], the authors extend

11
#0010

Multimodal
ROI Crop
CF Tracker

Camera Motion

Fusion
R RG B
#0010

Model
RF
CF Tracker
RT

Target Motion Model Tracker Track with


Switcher motion
RGB-T Images

Figure 4: Workflow of the JMMAC [10]. The CF-based tracker is used to model appearance cue,
while both camera and target motion are considered, thereby achieving substantial performance.

the 2D MS method to 3D with RGB-D data. Conaire et al. [77] propose an MS


tracker using spatiogram instead of histogram. Compared with discrimina-
tive methods, MS-based trackers directly regress the offset of the target, which
omits dense sampling. These methods with lightweight features can achieve
real-time performance, whereas the performance advantage is not obvious.

Other Frameworks. Other generative methods have been applied to tracking


tasks. Coraire et al. [73] model the tracked object via Gaussian distribution and
select the best-matched patch via similarity measure. Chen et al. [53] model
the statistics of each individual modality and the relationship between RGB
and thermal data using the expectation maximization algorithm. These meth-
ods can model individual or complementary modalities, thereby achieving a
flexible framework for different scenes.

3.2.2. Discriminative Methods


Particle Filter (PF). The PF framework is a Bayesian sequential importance
sampling technique [97]. It consists of two steps, i.e., prediction and updat-
ing. In the prediction step, given the state observations z1:t = {z1 , z2 , ..., zt }
during the previous t frames, the posterior distribution of the state xt is pre-
dicted using Bayesian rule as follows:

p (zt | xt ) p (xt | z1:t−1 )


p (xt | z1:t ) = , (1)
p (zt | z1:t−1 )

12
#0002 MA

ReLU

ReLU

ReLU
Conv

Conv

Conv
LRN
Block1

Block1

Pool

LRN
Block2
Block1 Block2 Block3

GA IA
+ + +

Block1

Block1

Block3

Concat

FC

FC

FC
#0002
+ + +

Block1

Block1

Block2
MA Tracking Result

Figure 5: Framework of MANet [11]. Generic adapter (GA) is used to extract common information
of RGB-T images. Modality adapter (MA) aims to exploit the different properties of heteroge-
neous modalities. Finally, instance adapter (IA) models the appearance properties and temporal
variations of a certain object.

where p (xt | z1:t−1 ) is estimated by a set of N particles. Each particle has a


weight, wti . In the updating process, wti is updated as

wti ∝ p zt | xt = xit .

(2)

In the PF framework, the restrictions of linearity and Gaussianity imposed by


Kalman filter are relaxed, thereby leading accurate and robust tracking [8]. Sev-
eral works improve the PF method for multi-modal tracking task. Bibi et al. [38]
formulate the PF framework in 3D, which considers both representation and
motion models and propose a particle pruning method to boost the tracking
speed. Meshgi et al. [52] consider occlusion in approximation step to improve
PF in occlusion handling. Liu et al. [64] propose a new likelihood function for
PF to determine the goodness of particles, thereby promoting the performance.

Correlation Filter (CF). CF-based tracker learns the discriminative template


denoted as CF to represent the target. Then, the online learned filter is used
to detect the object in the next frame. As the circular convolution can be ac-
celerated in Fourier domain, these trackers can maintain approving accuracy
with high speed. In recent years, many CF-based variants are proposed, such
as adding spatial regularization [98], introducing temporal constraint [99], and

13
equipping discriminative features [100], to increase the tracking performance.
Due to the advantage of CF-based trackers, many researchers aim to build
multi-modal trackers with the CF framework. Zhai et al. [65] introduce low-
rank constraint to learn the filters of both modalities collaboratively, thereby ex-
ploiting the relationship between RGB and thermal data. Hannuna et al. [46] ef-
fectively handle the scale change with the guidance of the depth map. Kart et al.
propose a long-term RGB-D tracker [7], which is designed based on CSRDCF [101]
and applies online 3D target reconstruction to facilitate learning robust filters.
The spatial constraint is learned from the 3D model of the target. When the
target is occluded, view-specific DCFs are used to robustly localize the target.
Camplani et al. [43] improve the CF method in scale estimation and occlusion
handling, while maintaining a real-time speed.

Deep Leraning (DL). Due to the discriminative ability in feature representa-


tion, CNN is widely used in the tracking task. Various networks provide a
powerful alternative to the traditional hand-crafted feature, which is the sim-
plest way to utilize CNN. Liu et al. [50] extract the deep features from VG-
GNet [102] and hand-crafted features to learn a robust representation. Li et
al.[68] concatenate deep features from visible and thermal images, and then
adaptively fuse them using the proposed FusionNet to achieve robust fea-
ture representation. Furthermore, some methods aim to learn an end-to-end
network for multi-modal tracking. In [11, 67, 69], a similar framework bor-
rowed from MDNet [103] is applied for tracking with different structures to
fuse the cross-modal data. These trackers achieve obvious performance pro-
motion while the speed is poor. Zhang et al. [71] propose an end-to-end RGB-T
tracking framework with real-time speed and balanced accuracy. They apply
ResNet [104] as the feature extractor and fuse RGB and thermal information in
the feature level, which are used for target localization and box estimation.

Other Frameworks. Some methods use an explicit template matching method


to localize the object. These methods find the best-matched candidate with
the target captured in frames through a pre-defined matching function [75, 41].

14
Table 2: Summary of multi-modal tracking datasets.

Name Seq. Num. Total Frames Min. Frame Max. Frame Attr. Resolution Metrics Year
PTB [86] 100 21.5K 0.04K 0.90K 11 640 × 480 CPE, SR 2013
RGB-D

STC [4] 36 18.4K 0.13K 0.72K 10 640 × 480 SR, Acc., Fail. 2018
CTDB [87] 80 101.9K 0.4K 2.5K 13 640 × 360 F-score, Pr, Re 2019
OTCBVS [88] 6 7.2K 0.6K 2.3K – 320 × 240 – 2007
LITIV [89] 9 6.3K 0.3K 1.2K – 320 × 240 – 2012
RGB-T

GTOT [1] 50 7.8K 0.04K 0.37K 7 384 × 288 SR, PR 2016


RGBT210 [2] 210 104.7K 0.04K 4.1K 12 630 × 460 SR, PR 2017
RGBT234 [3] 234 116.6K 0.04K 8.1K 12 630 × 460 SR, PR, EAO 2019
VOT-RGBT [5, 6] 60 20.0K 0.04K 1.3K 5 630 × 460 EAO 2019

Ding et al. [44] learn a Bayesian classifier and consider the candidate with max-
imal score as the target location, which can reduce the model drift. In [83], a
structured SVM [105] is learned by maximizing a classification score, which
can prevent the labeling ambiguity in the training process.

4. Datasets

With the emergence of multi-modal tracking methods, several datasets and


challenges for RGB-D and RGB-T tracking are released. We summarize the
available datasets in Table 2.

4.1. Public dataset

4.1.1. RGB-D dataset


In 2012, a small-scale dataset called BoBoT-D [45] is constructed, consisting
of five RGB-D video sequences captured by Kinect V1 sensor. Both overlap
and hit rate are used for evaluation, which indicate the mean overlap between
result and ground truth and percentage of frame where the overlap is larger
than 0.33. Song et al. [86] propose the well-known Princeton tracking bench-
mark (PTB) of 100 high-diversity RGB-D videos, five of which are used for
validation and others without available ground truth are used for testing. The
PTB dataset contains 11 annotations, which are separated by 5 categories in-
cluding target type, target size, movement, occlusion, and motion type. Two
metrics are conducted to evaluate the tracking performance: center position
error (CPE) and success rate (SR). CPE measures the Euclidean distance be-
tween centers of result and ground truth and SR is the average intersection

15
PTB

bag1 box_no_occ computerbar1 cf_occ2


STC

athlete_move athlete_static bin_move bin_static

Figure 6: Examples in RGB-D tracking datasets (PTB and STC datasets).

over union (IoU) during all frames, which is defined as



N 1 IoU (bb , gt ) > t
1 X i i sr
SR = ui ui = , (3)
N i=1
0 otherwise

where the IoU (·, ·) denotes the IoU between the bounding box bbi and ground
truth gti in the i-th frame. If the IoU is larger than the threshold tsr , we con-
sider the target to be successfully tracked. The final rank of the tracker is de-
termined by the Avg. Rank, which is defined as the average ranking of SR in
each attribute. The STC dataset [4] consists of 36 RGB-D sequences and covers
some extreme tracking circumstances, such as outdoor and night scenes. This
dataset is captured by still and moving ASUS Xtion RGB-D cameras to evaluate
the tracking performance under conditions of arbitrary camera motion. A total
of 10 attributes are labeled to thoroughly analyze the dataset bias. The detailed
introduction of each attributes are shown in the supplementary file.
The trackers are measured by using both SR and VOT protocols. The VOT
protocol evaluates the tracking performance in terms of two aspects: accuracy
and failure. Accuracy (Acc.) considers the IoU between the ground truth and
bounding box, and failure (Fail.) measures the times when the overlap is zero
and the tracker is set to re-initialize using the ground truth and continues to
track. CTDB [87] is the latest RGB-D tracking dataset, which contains 80 short-
term and long-term videos. The target is out-of-view and occluded frequently,
which needs the tracker to handle both tracking and re-detection cases. The
metrics are Precision (Pr.), Recall (Re.) and the overall F-score [106]. The preci-

16
sion and recall are defined as follows,

PN 1 bb exists
u i i
P r = Pi=1
N
u t = , (4)
i=1 si
0 otherwise

PN 1 gt exists
ui i
Re = Pi=1
N
gt = , (5)
g
i=1 i
 0 otherwise
where ui is defined in Eq. 3. The F-score combines both precision and recall
through
2P r × Re
F − score = . (6)
P r + Re

4.1.2. RGB-T Dataset


In previous years, two RGB-T people detection datasets are used for track-
ing. The OTCBVS dataset [88] has six grayscale-thermal video clips captured
from two outdoor scenes. The LITIV dataset [89] contains nine sequences, con-
sidering the illumination influence and being captured indoors. These datasets
with limited sequences and low diversity have been depreciated. In 2016, Li
et al. construct the GTOT dataset for RGB-T tracking, which consists of 50
grayscale-thermal sequences under different scenarios and conditions. A new
attribute for RGB-T tracking is labeled as thermal crossover (TC), which indi-
cates that the target has similar temperature with the background. Inspired
by [107, 108], GTOT adopts success rate (SR) and precision rate (PR) for evalu-
ation. PR denotes the percentage of frames whose CPE is smaller than a thresh-
old tpr , which is set to 5 in GTOT to evaluate small targets. Li et al. [2] propose
a large-scale RGB-T tracking dataset, namely RGBT210, which contains 210
videos and 104.7k image pairs. This dataset also extends the number of at-
tributes to 12. The detailed description of attributes can be found in supplementary
file. The metric is the same as GTOT, except tpr is set to 20 normally. In 2019,
the researchers enlarge the RGBT210 dataset and propose RGBT234 [3], which
provides individual ground truth for each modality. Furthermore, besides SR
and PR, expected average overlap (EAO) is used for evaluation, combining the
accuracy and failures in a principled manner.

17
GTOT

BlackCar: OCC, LSV, FM, LI BlackSwan1: LSV, TC, SO, DEF Gathering: LI, DEF GarageHover: LI, DEF
RGBT234

baginhand: NO, LI, LR, TC, CM kite4: PO, DEF, FM, CM carLight: NO, LI, DEF, SC, CM, BC car10: NO, LI, FM,SC, CM

Figure 7: Examples and corresponding attributes in GTOT and RGBT234 tracking datasets.

4.2. Challenges for Multi-modal Tracking

Since 2019, both RGB-D and RGB-T challenges have been held by VOT
Committee [6, 5]. For RGB-D challenge, trackers are evaluated on the CDTB
dataset [87] with the same evaluation metrics. All the sequences are anno-
tated on the basis of 5 attributes, namely, occlusion, dynamics change, mo-
tion change, size change, and camera motion. RGB-T challenge constructs the
dataset as a subset of RGBT234 with slight change in ground truth, which con-
sists of 60 RGB-T public videos and 60 sequestered videos. Compared with
RGBT234, VOT-RGBT utilizes different evaluation metrics, i.e., EAO, to mea-
sure trackers. In VOT2019-RGBT, trackers need to be re-initialized, when track-
ing failure is detected (the overlap between bounding box and ground truth is
zero). Besides, VOT2020-RGBT introduces a new anchor mechanism to avoid
a causal correlation between the first reset and the later ones [5] instead of the
re-initialization mechanism.

5. Experiments

In this section, we conduct analysis on both public datasets and challenges


from the overall comparison, attribute-based comparison, and speed. For fair
comparison on speed, we refer to the device used (CPU or GPU), platform used
(M: Matlab, MCN: Matconvnet, P: Python, and PT: PyTorch), and setting (de-
tailed information on CPU and GPU). The available codes and detailed description
of trackers are collected and listed in the supplementary files.

18
Table 3: Experimental results on the PTB dataset. Numbers in parentheses indicate their ranks.
The top three results are in red, blue, and green fonts.
Avg. Target type Target size Movement Occlusion Motion type
Algorithm
Rank Human Animal Rigid Large Small Slow Fast Yes No Passive Active
OTR [7] 2.91 77.3(3) 68.3(6) 81.3(3) 76.5(4) 77.3(1) 81.2(2) 75.3(1) 71.3(2) 84.7(6) 85.1(1) 73.9(3)
WCO [50] 3.91 78.0(2) 67.0(7) 80.0(4) 76.0(5) 75.0(2) 78.0(7) 73.0(5) 66.0(6) 86.0(2) 85.0(2) 82.0(1)
TACF [48] 4.64 76.9(5) 64.7(9) 79.5(5) 77.2(3) 74.0(4) 78.5(5) 74.1(3) 68.3(5) 85.1(3) 83.6(4) 72.3(5)
CA3DMS [30] 6 66.3(12) 74.3(2) 82.0(1) 73.0(10) 74.2(3) 79.6(3) 71.4(8) 63.2(13) 88.1(1) 82.8(5) 70.3(8)
CSR-rgbd [9] 6.36 76.6(6) 65.2(8) 75.9(9) 75.4(6) 73.0(6) 79.6(3) 71.8(6) 70.1(3) 79.4(10) 79.1(7) 72.1(6)
3DT [38] 6.55 81.4(1) 64.2(11) 73.3(11) 79.9(2) 71.2(9) 75.1(12) 74.9(2) 72.5(1) 78.3(11) 79.0(8) 73.5(4)
DLST [42] 7 77.0(4) 69.0(5) 73.0(12) 80.0(1) 70.0(12) 73.0(13) 74.0(4) 66.0(6) 85.0(5) 72.0(13) 75.0(2)
OAPF [52] 7.27 64.2(13) 84.8(1) 77.2(6) 72.7(11) 73.4(5) 85.1(1) 68.4(12) 64.4(11) 85.1(3) 77.7(10) 71.4(7)
CCF [80] 7.55 69.7(10) 64.5(10) 81.4(2) 73.1(9) 72.9(7) 78.4(6) 70.8(9) 65.2(8) 83.7(7) 84.4(3) 68.7(12)
OTOD [40] 8.91 72.0(8) 71.0(3) 73.0(12) 74.0(7) 71.0(10) 76.0(9) 70.0(11) 65.0(9) 82.0(8) 77.0(12) 70.0(9)
DMKCF [47] 9 76.0(7) 58.0(13) 76.7(7) 72.4(12) 72.8(8) 75.2(11) 71.6(7) 69.1(4) 77.5(13) 82.5(6) 68.9(11)
DSKCF [46] 9.09 70.9(9) 70.8(4) 73.6(10) 73.9(8) 70.3(11) 76.2(8) 70.1(10) 64.9(10) 81.4(9) 77.4(11) 69.8(10)
DSOH [43] 11.45 67.0(11) 61.2(12) 76.4(8) 68.8(13) 69.7(13) 75.4(10) 66.9(13) 63.3(12) 77.6(12) 78.8(9) 65.7(13)
DOHR [44] 14 45.0(14) 49.0(14) 42.0(14) 48.0(14) 42.0(14) 50.0(14) 43.0(14) 38.0(14) 54.0(14) 54.0(14) 41.0(14)

5.1. Experimental Comparison on RGB-D Datasets

Overall Comparison. PTB provides a website1 for comprehensive evaluating


RGB and RGB-D methods in an online manner. We collect the results of 14
RGB-D trackers on the website and sort them based on rank. The results are
shown in Table 3. We list the Avg. Rank, SR and corresponding rank of each
attribute. The Avg. Rank is calculated by averaging the rankings of all at-
tributes. According to Table 3, OTR achieves the best performance among all
the competitors, which is based on the CF framework without deep features.
Reason for the promising result is that the 3D construction provides a useful
constraint for filter learning. The same conclusion can be obtained by CA3DMS
and 3DT, which construct a 3D model to locate the target via mean-shift and
sparse learning methods. These trackers with traditional features are compet-
itive to the deep trackers. DL-based trackers (WCO, TACF, and CSR-RGBD)
achieve substantial performance, which indicates the discrimination of deep
features. CF-based trackers achieve various results and are the most widely-
applied framework. Trackers based on original CF methods (DMKCF, DSKCF
and DSOH) perform significantly worse than those developed on improved
CF (OTR, WCO and TACF). OTOD based on point cloud does not exploit the

1 http://tracking.cs.princeton.edu/

19
Table 4: The speed analysis of RGB-D trackers.

Trackers Speed Device Platform Setting


OTR [7] 2.0 CPU Matlab [email protected]
WCO [50] 9.5 GPU M & MCN [email protected], GTX TITAN
TACF [48] 13.1 GPU M & MCN [email protected]
CA3DMS [30] 63 CPU C++ [email protected]
DLST [42] 4.8 CPU M I5- [email protected]
OAPF [52] 0.9 CPU M –
CCF [80] 6.3 CPU M [email protected]
DMKCF [47] 8.3 CPU M [email protected]
DSKCF [46] 40 CPU M & C++ [email protected]
DSOH [43] 40 CPU M [email protected]

effectiveness of CNN and obtains the 10th rank on the PTB dataset.

Attribute-based Comparison. PTB provided 11 attributes from five aspects for


comparison. CF-based trackers, including OTR, WCO, TACF, CSR-RGBD, and
CCF, are not well-performed on tracking animals. As animals move fast and ir-
regularly, these trackers with online learning are easy to drift. While the target
is in small size, CF can provide precise tracking results. The occlusion handling
mechanism contributes greatly to videos with target occlusion. The 3D mean
shift method shows obvious advantage in tracking targets with rigid shape and
no occlusion. OAPF obtains an above-average performance on tracking small
objects, thereby indicating the effectiveness of the scale estimation strategy.

Speed Analysis. The speed report of RGB-D trackers are listed in Table 4.
Most of the trackers cannot meet the requirements of real-time tracking. Track-
ers based on the improved CF framework (OTR [7], DMKCF [47], CCF [80],
WCO [50], and TACF [48]), are subject to their speed. Only two real-time track-
ers (DSKCF [46] and DSOH [43]) are based on the original CF architecture.

5.2. Experimental Comparison of RGB-T Datasets

We select 14 trackers as our baseline to perform overall comparison of the


GTOT and RGBT234 datasets. As only part of the trackers (JMMAC, MANet,
mfDiMP) release their codes, We run these trackers on these two datasets and
record the performance of other trackers in their original papers. The overall
results are shown in Table 5.

20
Table 5: Experimental results on the GTOT and RGBT234 datasets.

GTOT RGBT234
Tracker Speed Device Platform Setting
SR PR SR PR
CMPP [81] 73.8 92.6 57.5 82.3 1.3 GPU PyTorch RTX 2080Ti
JMMAC [10] 73.2 90.1 57.3 79.0 4.0 GPU MatConvNet RTX 2080Ti
CAT [121] 71.7 88.9 56.1 80.4 20.0 GPU PyTorch GTX 2080Ti
MaCNet [62] 71.4 88.0 55.4 79.0 0.8 GPU PyTorch GTX 1080Ti
TODA [69] 67.7 84.3 54.5 78.7 0.3 GPU PyTorch GTX1080Ti
DAFNet [66] 71.2 89.1 54.4 79.6 23.0 GPU PyTorch RTX 2080Ti
MANet [11] 72.4 89.4 53.9 77.7 1.1 GPU PyTorch GTX1080Ti
DAPNet [67] 70.7 88.2 53.7 76.6 – GPU PyTorch GTX1080Ti
FANet [72] 69.8 88.5 53.2 76.4 1.3 GPU PyTorch GTX1080Ti
CMR [59] 64.3 82.7 48.6 71.1 8.0 CPU C++ –
SGT [2] 62.8 85.1 47.2 72.0 5.0 CPU C++ –
mfDiMP [71] 49.0 59.4 42.8 64.6 18.6 GPU PyTorch RTX 2080Ti
CSR [1] – – 32.8 46.3 1.6 CPU Matlab & C++ –
L1-PF [54] 42.7 55.1 28.7 43.1 – – – –
JSR [64] 43.0 55.3 23.4 34.3 0.8 CPU Matlab –

Overall Comparison. All the high-performance trackers are equipped with the
learned deep features and most of them are based on MDNet variants (CMPP,
MaCNet, TODA, DAFNet, MANet, DAPNet, and FANet), which achieve sat-
isfactory results. The CF-based tracker (JMMAC) obtains the second rank on
GTOT and RGBT234 datasets, which combines appearance and motion cues.
Compared with CF trackers, MDNet-based trackers can provide precise target
position with higher PR, but are inferior to CF framework in scale estimation,
reflecting on SR. Trackers with sparse learning technique (CSR, SGT) are better
than L1-PF based on the PF method. Although mfDiMP utilizes the state-of-
the-art backbone, the performance is not positive. The main reason may be that
mfDiMP utilizes different training data, which are generated by image transla-
tion methods [120] and brings a gap between existing real RGB-T data.

Attribute-based Comparison. We conduct attribute-based comparison on RGBT234,


as shown in Figure 8. Improved MDNet-based trackers achieve satisfactory
performance on low-resolution, deformation, background clutter, fast motion,
and thermal crossover. Since modeling both camera motion and target motion,
JMMAC has strength in camera movement and partial occlusion, but fails eas-

21
Attribute based Precision Plot Attribute based Success Plot
Deformation Deformation
Background Clutter 1 Background Clutter
Low Resolution Low Resolution
0.6
0.8

0.6 0.4
Camera Moving Motion Blur Camera Moving Motion Blur
0.4
0.2
0.2

Partial Occlusion 0 Low Illumination Partial Occlusion 0 Low Illumination

Thermal Crossover Heavy Occlusion Thermal Crossover Heavy Occlusion

No Occlusion Fast Motion No Occlusion Fast Motion

Scale Variation Scale Variation

CMPP TODA MaCNet JMMAC MANet SGT DAPNet CMR mfDiMP L1+PF CSR JSR

Figure 8: Attribute-based Comparison on RGBT234.

ily in fast-moving targets. This condition may result from CF-based trackers
having a fixed search region. When the target moves outside the region, the
target cannot be detected, thereby causing tracking failure. CMPP, which ex-
ploits the inter-modal and cross-modal correlation, have great promotion on
low illumination, low resolution, and thermal crossover. Targets in these at-
tributes have unavailable modality and CMPP can eliminate the gap between
heterogeneous modalities. The detailed figure on attribute-based comparison can be
found in supplementary file.

Speed Analysis. For tracking speed, we list the platform and setting for fair
comparison in Table 5. DAFNet based on a real-time MDNet variant achieves
fast tracking with 23.0 fps. Although mfDiMP is equipped with ResNet-101,
it achieves the second fastest tracker because most parts of the network are
trained offline without tuning online. Other trackers are constrained by their
low speed, which cannot be utilized in some real-time applications.

5.3. Challenge Results on VOT2019-RGBD

We list the challenge results in Table 6. Both original RGB trackers without
utilizing depth information and RGB-D trackers are merged for evaluation.
Trackers who obtain top three ranks on F-score, precision, and Recall, are de-
signed with the same component and framework. Unlike the PTB dataset, DL-
based methods have potential performance on VOT-RGBD19, which results
from these trackers utilizing large-scale visual datasets for offline training and

22
Table 6: Challenge results on VOT2019-RGBD dataset.

Tracker Rank Modality Type F-score Precison Recall


SiamDW-D [21] 1 RGB-D OR, DL 0.681 0.677 0.685
ATCAIS [6] 2 RGB-D OR, DL 0.676 0.643 0.712
LTDSE-D [6] 3 RGB-D OR, DL 0.658 0.674 0.643
SiamM-Ds [24] 4 RGB-D SE, DL 0.455 0.463 0.447
MDNet [103] 5 RGB DL 0.455 0.463 0.447
MBMD [6] 6 RGB OR,DL 0.441 0.454 0.429
FuCoLoT [122] 7 RGB OR, CF 0.391 0.459 0.340
OTR [7] 8 RGB-D CF, EF, Pre 0.336 0.364 0.312
SiamFC [20] 9 RGB DL 0.333 0.356 0.312
CSR-rgbd [9] 10 RGB-D CF, EF, OR 0.332 0.375 0.397
ECO [100] 11 RGB CF 0.329 0.317 0.342
CA3DMS [30] 12 RGB MS, OR, Pre 0.271 0.284 0.259

equipping deeper networks. For instance, the original RGB tracker with DL
framework also achieves excellent performance. Furthermore, occlusion han-
dling is another necessary component of the high-performance tracker because
VOT2019-RGBD focuses on long-term tracking with target frequent reappear-
ance and out-of-view and most of the trackers are equipped with a redetection
mechanism. The CF framework (FuCoLoT, OTR, CSR-RGBD, and ECO) does
not perform well, which may stem from the online updating using occlusion
patches that degrade the discriminability of the model.

5.4. Challenge Results on VOT2019-RGBT

For VOT2019-RGBT dataset shown in Table 7, JMMAC with exploiting both


appearance and motion cues shows high accuracy and robust performance and
obtains the highest EAO in a large margin. Early fusion is the primary manner
of RGB-T fusion, while the late fusion method (JMMAC) has great potential
in improving tracking accuracy and robustness, which has not been fully uti-
lized. All top six trackers are equipped with CNN as feature extractor, thereby
indicating the powerful ability of CNN. SiamDW using a Siamese network is
a general method that performs well in both RGB-D and RGB-T tasks. ATOM

23
Table 7: Challenge results on the VOT2019-RGBT dataset.

Tracker Rank Modality Type EAO Accuracy Robustness


JMMAC [10] 1 RGBT CF, LF 0.4826 0.6649 0.8211
SiamDW-T [21] 2 RGBT DL, EF 0.3925 0.6158 0.7839
mfDiMP [71] 3 RGBT DL, EF 0.3879 0.6019 0.8036
FSRPN [6] 4 RGBT DL, EF 0.3553 0.6362 0.7069
MANet [11] 5 RGBT DL, EF 0.3463 0.5823 0.7010
MPAT [6] 6 RGB DL 0.3180 0.5723 0.7242
CISRDCF [6] 7 RGBT CF, EF 0.2923 0.5215 0.6904
GESBTT [6] 8 RGBT Lucas–Kanade 0.2896 0.6163 0.6350

variants (mfDiMP and MPAT) are used to handle RGB-T tracking.

6. Further Prospects

6.1. Model Design

Multi-modal fusion. Compared with tracking with unimodality data, multi-


modal tracking can easily exploit a powerful data fusion mechanism. Existing
methods mainly focus on feature fusion, whereas the effectiveness of other fu-
sion types has not been exploited. Compared with early fusion, late fusion
eliminates the bias that heterogeneous features may be learned from differ-
ent modalities. Furthermore, another advantage of late fusion is that we can
utilize various methods to model each modality independently. The hybrid
fusion method combining the early and late fusion strategies has been used in
image segmentation [123] and sports video analysis [124], which is also a better
choice for multi-modal tracking.

Specific Network for Auxiliary Modality. As the gap of different modalities


exists and the semantic information is also heterogeneous, traditional meth-
ods use different features to extract more useful data [57, 64, 45]. Although
sufficient works on network structures for visible image analysis have been
conducted, the specific architecture for depth and thermal maps has not been
deeply explored. Thus, DL-based methods [11, 66, 67, 71] trade the data in aux-
iliary modality as an additional dimension of the RGB image with the same

24
network architecture, e.g., VGGNet and ResNet, and extract the feature in
the same level (layer). A crucial task is to design a network for processing
multi-modal data. Since 2017, AutoML method, especially neural architecture
search (NAS), has been popular which design the architecture automatically
and obtain highly competitive results in many areas, such as image classifica-
tion [125], and recognition [126]. However, researchers pay less attention to
NAS method for multi-modal tracking, which is a good direction to explore.

Multi-modal Tracking with Real-time Speed. The additional modality mul-


tiplies the computation, which causes difficulty for the existing tracking frame-
works to achieve the requirements of real-time performance. A speed-up mech-
anism needs to be designed, such as feature selection [67], knowledge distilla-
tion technology, and others. Furthermore, Huang et al. [127] propose a trade-
off method, where the agent decides which layer is more suitable for accurate
localization, thereby providing 100 times speed boost.

6.2. Dataset Construction

Large-scale Dataset for Training. With the emergence of deep neural net-
works, more powerful methods are equipped with CNN to achieve accurate
and robust performance. However, the existing datasets focus on testing with
no training subset. For instance, most DL-based trackers use the GTOT dataset
as the training set when testing RGBT234, which has a small amount of data
with limited scenes. The effectiveness of DL-based methods has not been fully
exploited. Zhang et al. [71] generate synthetic thermal data from the numerous
existing visible datasets by using the image translation method [120]. How-
ever, this data augmentation does not bring significant performance improve-
ment. Above all, constructing a large-scale dataset for training is the main
direction for multi-modal tracking.

Modality Registration. As multi-modal data is captured by different sensors


and the binocular camera has a parallax error that cannot be ignored when
the target is small and has low resolution, registering the data in spacial and

25
#0040 #0009 #0327

Biketwo Child1 Bike

Figure 9: Unregistration examples in RGBT234 dataset. We show the ground truth of visible
modality in both images. The coarse bounding box degrades the discriminability of the model.

temporal aspects is essential. As shown in Figure 9, the target is out of the


box and the model is degraded by learning meaningless background informa-
tion. In the VOT-RGBT challenge, the dataset ensures the precise annotation
in infrared modality and the misalignment of the RGB image is required to
be handled by the tracker. We state that the image pre-registration process is
necessary during dataset construction by cropping the shared visual field and
applying image registration method.

Metrics for Robustness Evaluation. In some extreme scenes and weather con-
ditions, such as rainy, low illumination and hot sunny days, visible or thermal
sensors cannot provide meaningful data. The depth camera cannot obtain pre-
cise distance estimation when the object is far from the sensor. Therefore, a
robust tracker needs to avoid tracking failure when any of the modality data is
unavailable during a certain period. To handle this case, both complementary
and discriminative features have to be applied in localization. However, none
of the datasets measures the tracking robustness with missing data. Thus, a
new evaluation metric for tracking robustness needs to be considered.

7. Conclusion

In this study, we provide an in-depth review of multi-modal tracking. First,


we conclude multi-modal trackers in a unified framework, and analyze them
from different perspectives, including auxiliary modality purpose and track-
ing framework. Then, we present a detailed introduction on the datasets for
multi-modal tracking and corresponding metrics. Furthermore, a comprehen-
sive comparison of five popular datasets is conducted and the effectiveness of
trackers belonging to various types are analyzed in the views of overall per-
formance, attribute-based performance, and speed. Finally, as an emerging

26
field, several possible directions are identified to facilitate the improvement of
multi-modal tracking. The comparison results and analysis will be available at
https://github.com/zhang-pengyu/Multimodal tracking survey.

References

[1] C. Li, H. Cheng, S. Hu, X. Liu, J. Tang, L. Lin, Learning collaborative


sparse representation for grayscale-thermal tracking, IEEE Transactions
on Image Processing 25 (12) (2016) 5743–5756.

[2] C. Li, N. Zhao, Y. Lu, C. Zhu, J. Tang, Weighted sparse representation


regularized graph learning for RGB-T object tracking, in: ACM Interna-
tional Conference on Multimedia, 2017.

[3] C. Li, X. Liang, Y. Lu, N. Zhao, J. Tang, RGB-T object tracking: Benchmark
and baseline, Pattern Recognition 96 (12) (2019) 106977.

[4] J. Xiao, R. Stolkin, Y. Gao, A. Leonardis, Robust fusion of color and depth
data for RGB-D target tracking using adaptive range-invariant depth
models and spatio-temporal consistency constraints, IEEE Transactions
on Cybernetics 48 (8) (2018) 2485–2499.

[5] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, et al., The eighth visual


object tracking VOT2020 challenge results, in: European Conference on
Computer Vision Workshop, 2020.

[6] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, et al., The seventh visual


object tracking VOT2019 challenge results, in: IEEE International Con-
ference on Computer Vision Workshop, 2019.

[7] U. Kart, A. Lukezic, M. Kristan, J.-K. Kamarainen, J. Matas, Object track-


ing by reconstruction with view-specific discriminative correlation fil-
ters, in: IEEE Conference on Computer Vision and Pattern Recognition,
2019.

27
[8] N. Cvejic, S. G. Nikolov, H. D. Knowles, A. Loza, A. Achim, A. Achim,
C. N. Canagarajah, The effect of pixel-level fusion on object tracking in
multi-sensor surveillance video, in: IEEE Conference on Computer Vi-
sion and Pattern Recognition, 2007.

[9] U. Kart, J.-K. Kamarainen, J. Matas, How to make an RGBD tracker?, in:
European Conference on Computer Vision Workshop, 2018.

[10] P. Zhang, J. Zhao, D. Wang, H. Lu, X. Yang, Jointly modeling motion and
appearance cues for robust rgb-t tracking, CoRR abs/2007.02041.

[11] C. Li, A. Lu, A. Zheng, Z. Tu, J. Tang, Multi-adapter RGBT tracking, in:
IEEE International Conference on Computer Vision Workshop, 2019.

[12] S.-K. Weng, C.-M. Kuo, S.-K. Tu, Video object tracking using adaptive
kalman filter, Journal of Visual Communication and Image Representa-
tion 17 (6) (2006) 1190–1208.

[13] G. Y. Kulikov, M. V. Kulikova, The accurate continuous-discrete extended


kalman filter for radar tracking, IEEE Transactions on Signal Processing
64 (4) (2016) 948–958.

[14] C. Yang, R. Duraiswami, L. Davis, Fast multiple object tracking via a hi-
erarchical particle filter, in: IEEE International Conference on Computer
Vision, 2005.

[15] K. Okuma, A. Taleghani, N. de Freitas, J. J. Little, D. G. Lowe, A boosted


particle filter: Multitarget detection and tracking, in: European Confer-
ence on Computer Vision, 2004.

[16] T. Zhang, B. Ghanem, S. Liu, N. Ahuja, Low-rank sparse learning for


robust visual tracking, in: European Conference on Computer Vision,
2012.

[17] T. Zhang, B. Ghanem, S. Liu, N. Ahuja, Robust visual tracking via multi-
task sparse learning, in: IEEE Conference on Computer Vision and Pat-
tern Recognition, 2012.

28
[18] D. S. Bolme, J. R. Beveridge, B. A. Draper, Y. M. Lui, Visual object track-
ing using adaptive correlation filters, in: IEEE Conference on Computer
Vision and Pattern Recognition, 2010.

[19] Y. Li, J. Zhu, A scale adaptive kernel correlation filter tracker with feature
integration, in: European Conference on Computer Vision Workshop,
2014.

[20] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, P. H. S. Torr, Fully-


convolutional siamese networks for object tracking, in: European Con-
ference on Computer Vision Workshop, 2016.

[21] Z. Zhang, H. Peng, Deeper and wider siamese networks for real-time
visual tracking, in: IEEE Conference on Computer Vision and Pattern
Recognition, 2019.

[22] M. Danelljan, G. Hager, F. S. Khan, M. Felsberg, Learning spatially regu-


larized correlation filters for visual tracking, in: IEEE International Con-
ference on Computer Vision, 2015.

[23] H. K. Galoogahi, A. Fagg, S. Lucey, Learning background-aware corre-


lation filters for visual tracking, in: IEEE International Conference on
Computer Vision, 2017.

[24] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, P. H. Torr, Fast online object


tracking and segmentation: A unifying approach, in: IEEE Conference
on Computer Vision and Pattern Recognition, 2019.

[25] A. Lukezic, J. Matas, M. Kristan, D3S – a discriminative single shot seg-


mentation tracker, in: IEEE Conference on Computer Vision and Pattern
Recognition, 2020.

[26] X. Song, H. Zhao, J. Cui, X. Shao, R. Shibasaki, H. Zha, An online sys-


tem for multiple interacting targets tracking: Fusion of laser and vi-
sion, tracking and learning, ACM Transactions on Intelligent Systems
and Technology 4 (1) (2013) 1–21.

29
[27] D. Y. Kim, M. Jeon, Data fusion of radar and image measurements for
multi-object tracking via kalman filtering, Information Fusion 278 (10)
(2014) 641–652.

[28] N. Megherbi, S. Ambellouis, O. Colot, F. Cabestaing, Joint audio-video


people tracking using belief theory, in: IEEE Conference on Advanced
Video and Signal based Surveillance, 2005.

[29] X. Lan, M. Ye, S. Zhang, P. C. Yuen, Robust collaborative discriminative


learning for RGB-infrared tracking, in: AAAI Conference on Artificial
Intelligence, 2018.

[30] Y. Liu, X.-Y. Jing, J. Nie, H. Gao, J. Liu, Context-aware three-dimensional


mean-shift with occlusion handling for robust object tracking in RGB-D
videos, IEEE Transactions on Multimedia 21 (3) (2019) 664–677.

[31] P. K. Atrey, M. A. Hossain, A. E. Saddik, M. S. Kankanhalli, Multimodal


fusion for multimedia analysis: a survey, Multimedia Systems 16 (2010)
345–379.

[32] G. S. Walia, R. Kapoor, Recent advances on multicue object tracking: A


survey, Artificial Intelligence Review 46 (1) (2016) 1–39.

[33] Z. Cai, J. Han, L. Liu, L. Shao, Rgb-d datasets using microsoft kinect or
similar sensors: A survey, Multimedia Tools and Applications 76 (2016)
4313–4355.

[34] M. Camplani, A. Paiement, M. Mirmehdi, D. Damen, S. Hannuna,


T. Burghardt, L. Tao, Multiple human tracking in rgb-depth data: A sur-
vey, IET Computer Vision 11 (4) (2016) 265–285.

[35] T. Baltrusaitis, C. Ahuja, L.-P. Morency, Multimodal machine learning:a


survey and taxonomy, IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence 41 (2) (2019) 423–443.

30
[36] J. Ma, Y. Ma, C. Li, Infrared and visible image fusion methods and appli-
cations: A survey, Information Fusion 45 (2019) 153–178.

[37] X. Zhang, P. Ye, H. Leung, K. Gong, GangXiao, Object fusion tracking


based on visible and infrared images: A comprehensive review, Infor-
mation Fusion 63 (2020) 166–187.

[38] A. Bibi, T. Zhang, B. Ghanem, 3D part-based sparse tracker with auto-


matic synchronization and registration, in: IEEE Conference on Com-
puter Vision and Pattern Recognition, 2016.

[39] A. Gutev, C. J. Debono, Exploiting depth information to increase object


tracking robustness, in: International Conference on Smart Technologies,
2019.

[40] Y. Xie, Y. Lu, S. Gu, RGB-D object tracking with occlusion detection,
in: International Conference on Computational Intelligence and Security,
2019.

[41] B. Zhong, Y. Shen, Y. Chen, W. Xie, Z. Cui, H. Zhang, D. Chen, T. Wang,


X. Liu, S. Peng, J. Gou, J. Du, J. Wang, W. Zheng, Online learning 3D
context for robust visual tracking, Neurocomputing 151 (2015) 710–718.

[42] N. An, X.-G. Zhao, Z.-G. Hou, Online RGB-D tracking via detection-
learning-segmentation, in: International Conference on Pattern Recog-
nition, 2016.

[43] M. Camplani, S. Hannuna, M. Mirmehdi, D. Damen, A. Paiement, L. Tao,


T. Burghardt, Real-time RGB-D tracking with depth scaling kernelised
correlation filters and occlusion handling, in: British Machine Vision
Conference, 2015.

[44] P. Ding, Y. Song, Robust object tracking using color and depth images
with a depth based occlusion handling and recovery, in: International
Conference on Fuzzy Systems and Knowledge Discovery, 2015.

31
[45] G. M. Garcia, D. A. Klein, J. Stuckler, Adaptive multi-cue 3D tracking of
arbitrary objects, in: Joint DAGM and OAGM Symposium, 2012.

[46] S. Hannuna, M. Camplani, J. Hall, M. Mirmehdi, D. Damen,


T. Burghardt, A. Paiement, L. Tao, DS-KCF: a real-time tracker for RGB-D
data, Journal of Real-Time Image Processing 16 (2019) 1439–1458.

[47] U. Kart, J.-K. Kamarainen, J. Matas, L. Fan, F. Cricri, Depth masked


discriminative correlation filter, in: International Conference on Pattern
Recognition, 2018.

[48] Y. Kuai, G. Wen, D. Li, J. Xiao, Target-aware correlation filter tracking in


RGBD videos, Sensors 19 (20) (2019) 9522–9531.

[49] J. Leng, Y. Liu, Real-time RGB-D visual tracking with scale estimation
and occlusion handling, Access 6 (2018) 24256–24263.

[50] W. Liu, X. Tang, C. Zhao, Robust RGBD tracking via weighted convolu-
tion operators, Sensors 20 (8) (2020) 4496–4503.

[51] Z. Ma, Z. Xiang, Robust object tracking with RGBD-based sparse learn-
ing, Frontiers of Information Technology and Electronic Engineering
18 (7) (2017) 989–1001.

[52] K. Meshgi, S. ichi Maeda, S. Oba, H. Skibbe, Y. zhe Li, S. Ishii, An


occlusion-aware particle filter tracker to handle complex and persistent
occlusions, Computer Vision and Image Understanding 150 (2016) 81–94.

[53] S. Chen, W. Zhu, H. Leung, Thermo-visual video fusion using probabilis-


tic graphical model for human tracking, in: International Symposium on
Circuits and Systems, 2008.

[54] Y. Wu, E. Blasch, G. Chen, L. Bai, H. Ling, Multiple source data fusion via
sparse representation for robust visual tracking, in: International Confer-
ence on Information Fusion, 2011.

32
[55] Y. Wang, C. Li, J. Tang, D. Sun, Learning collaborative sparse correlation
filter for real-time multispectral object tracking, in: International Confer-
ence on Brain Inspired Cognitive Systems, 2018.

[56] X. Lan, M. Ye, R. Shao, B. Zhong, Online non-negative multi-modality


feature template learning for RGB-assisted infrared tracking, Access 7
(2019) 67761–67771.

[57] X. Lan, M. Ye, S. Zhang, H. Zhou, P. C. Yuen, Modality-correlation-aware


sparse representation for RGB-infrared object tracking, Pattern Recogni-
tion Letters.

[58] X. Lan, M. Ye, R. Shao, B. Zhong, P. C. Yuen, H. Zhou, Learning modality-


consistency feature templates: A robust RGB-Infrared tracking system,
IEEE Transactions on Industrial Electronics 66 (12) 9887–9897.

[59] C. Li, C. Zhu, Y. Huang, J. Tang, L. Wang, Cross-modal ranking with soft
consistency and noisy labels for robust RGB-T tracking, in: European
Conference on Computer Vision, 2018.

[60] C. Li, S. Hu, S. Gao, , J. Tang, Real-time grayscale-thermal tracking via


laplacian sparse representation, 2016.

[61] C. Li, C. Zhu, J. Zhang, B. Luo, X. Wu, J. Tang, Learning local-global


multi-graph descriptors for RGB-T object tracking, IEEE Transactions on
Circuits and Systems for Video Technology 29 (10) (2018) 2913–2926.

[62] H. Zhang, L. Zhang, L. Zhuo, J. Zhang, Object tracking in RGB-T videos


using modal-aware attention network and competitive learning, Sensors
20 (2).

[63] C. Li, X. Sun, X. Wang, L. Zhang, J. Tang, Grayscale-thermal object track-


ing via multitask laplacian sparse representation 47 (4) 673–681.

[64] H. Liu, F. Sun, Fusion tracking in color and infrared images using joint
sparse representation, Information Sciences 55 (3) (2012) 590–599.

33
[65] S. Zhai, P. Shao, X. Liang, X. Wang, Fast RGB-T tracking via cross-modal
correlation filters, Neurocomputing 334 172–181.

[66] Y. Gao, C. Li, Y. Zhu, J. Tang, T. He, F. Wang, Deep adaptive fusion net-
work for high performance RGBT tracking, in: IEEE International Con-
ference on Computer Vision Workshop, 2019.

[67] Y. Zhu, C. Li, B. Luo, J. Tang, X. Wang, Dense feature aggregation and
pruning for RGBT tracking, in: ACM International Conference on Multi-
media, 2019.

[68] C. Li, X. Wu, N. Zhao, X. Cao, J. Tang, Fusing two-stream convolutional


neural networks for RGB-T object tracking, Neurocomputing 281 (2018)
78–85.

[69] R. Yang, Y. Zhu, X. Wang, C. Li, J. Tang, Learning target-oriented dual


attention for robust RGB-T tracking, in: IEEE International Conference
on Image Processing, 2019.

[70] X. Zhang, X. Zhang, X. Du, X. Zhou, J. Yin, Learning multi-domain con-


volutional network for RGB-T visual tracking, 2018.

[71] L. Zhang, M. Danelljan, A. Gonzalez-Garcia1, J. van de Weijer, F. S. Khan,


Multi-modal fusion for end-to-end RGB-T tracking, in: IEEE Interna-
tional Conference on Computer Vision Workshop, 2019.

[72] Y. Zhu, C. Li, Y. Lu, L. Lin, B. Luo, J. Tang, FANet: Quality-aware feature
aggregation network for RGB-T tracking, CoRR abs/1811.09855.

[73] C. O. Conaire, N. E. O’Connor, E. Cooke, A. F. Smeaton, Comparison of


fusion methods for thermo-visual surveillance tracking, in: International
Conference on Information Fusion, 2006.

[74] H. Shi, C. Gao, N. Sang, Using consistency of depth gradient to improve


visual tracking in RGB-D sequences, in: Chinese Automation Congress,
2015.

34
[75] Q. Wang, J. Fang, Y. Yuan, Multi-cue based tracking, Neurocomputing
131 (2014) 227–236.

[76] H. Zhang, M. Cai, J. Li, A real-time RGB-D tracker based on KCF, in:
Chinese Control And Decision Conference, 2018.

[77] C. O. Conaire, N. E. O’Connor, A. F. Smeaton, Thermo-visual feature fu-


sion for object tracking using multiple spatiogram trackers, Machine Vi-
sion and Applications 19 (5) (2008) 483–494.

[78] C. Luo, B. Sun, K. Yang, T. Lu, W.-C. Yeh, Thermal infrared and visible
sequences fusion tracking based on a hybrid tracking framework with
adaptive weighting scheme, Infrared Physics and Technology 99 265–
276.

[79] Y. Zhai, P. Song, Z. Mou, X. Chen, X. Liu, Occlusion-aware correlation


particle filter target tracking based on rgbd data, Access 6 (2018) 50752–
50764.

[80] G. Li, L. Huang, P. Zhang, Q. Li, Y. Huo, Depth information aided con-
strained correlation filter for visual tracking, in: International Conference
on Geo-Spatial Knowledge and Intelligence, 2019.

[81] C. Wang, C. Xu, Z. Cui, L. Zhou, T. Zhang, X. Zhang, J. Yang, Cross-


modal pattern-propagation for RGB-T tracking, in: IEEE Conference on
Computer Vision and Pattern Recognition, 2020.

[82] Y. Chen, Y. Shen, X. Liu, B. Zhong, 3d object tracking via image sets and
depth-based occlusion detection, Signal Processing 112 (2015) 146–153.

[83] C. Li, C. Zhu, S. Zheng, B. Luo, J. Tang, Two-stage modality-graphs reg-


ularized manifold ranking for RGB-T tracking, Signal Processing: Image
Communication 68 207–217.

[84] A. Geiger, P. Lenz, C. Stiller, R. Urtasun, Vision meets robotics: The KITTI
dataset, International Journal of Robotics Research 32 (11) (2013) 1231–
1237.

35
[85] J. Liu, Y. Liu, Y. Cui, Y. Chen, Real-time human detection and tracking
in complex environments using single RGBD camera, in: IEEE Interna-
tional Conference on Image Processing, 2013.

[86] S. Song, J. Xiao, Tracking revisited using RGBD camera: Unified bench-
mark and baselines, in: IEEE International Conference on Computer Vi-
sion, 2013.

[87] A. Lukezic, U. Kart, J. Kapyla, A. Durmush, J.-K. Kamarainen, J. Matas,


M. Kristan, Cdtb: A color and depth visual object tracking dataset
and benchmark, in: IEEE International Conference on Computer Vision,
2019.

[88] J. W.Davis, V. Sharma, Background-subtraction using contour-based fu-


sion of thermal and visible imagery, Computer Vision and Image Under-
standing 106 (2-3) 162–182.

[89] A. Torabi, G. Massé, G.-A. Bilodeau, An iterative integrated framework


for thermal-visible image registration, sensor fusion, and people tracking
for video surveillance applications, Computer Vision and Image Under-
standing 116 (2) 210–221.

[90] D. Ramachandram, G. W. Taylor, Deep multimodal learning: A survey


on recent advances and trends 34 (6) (2017) 96–108.

[91] Y. Li, J. Zhu, A scale adaptive kernel correlation filter tracker with feature
integration, in: European Conference on Computer Vision, 2014.

[92] M. Danelljan, G. Hager, F. S. Khan, M. Felsberg, Discriminative scale


space tracking, IEEE Transactions on Pattern Analysis and Machine In-
telligence 39 (8) (2017) 1561–1575.

[93] X. Li, W. Hu, C. Shen, Z. Zhang, A. Dick, A survey of appearance models


in visual object tracking, ACM Transactions on Intelligent Systems and
Technology 58 (4).

36
[94] S. Shojaeilangari, W.-Y. Yau, K. Nandakumar, J. Li, E. K. Teoh, Robust
representation and recognition of facial emotions using extreme sparse
learning, IEEE Transactions on Image Processing 24 (7).

[95] M. Yang, L. Zhang, X. Feng, D. Zhang, Sparse representation based fisher


discrimination dictionary learning for image classification, International
Journal of Computer Vision 109.

[96] Y. Xie, W. Zhang, C. Li, S. Lin, Y. Qu, Y. Zhang, Discriminative object


tracking via sparse representation and online dictionary learning, IEEE
Transactions on Cybernetics 44 (4).

[97] M. Isard, A. Blake, Condensation—conditional density propagation for


visual tracking, International Journal of Computer Vision 29.

[98] M. Danelljan, G. Häger, F. S. Khan, M. Felsberg, Learning spatially regu-


larized correlation filters for visual tracking, in: IEEE International Con-
ference on Computer Vision, 2015.

[99] F. Li, C. Tian, W. Zuo, L. Zhang, M.-H. Yang, Learning spatial-temporal


regularized correlation filters for visual tracking, in: IEEE Conference on
Computer Vision and Pattern Recognition, 2018.

[100] M. Danelljan, G. Bhat, F. S. Khan, M. Felsberg, ECO: Efficient convolution


operators for tracking, in: IEEE Conference on Computer Vision and Pat-
tern Recognition, 2017.

[101] A. Lukežič, T. Vojı́ř, L. Čehovin Zajc, J. Matas, M. Kristan, Discriminative


correlation filter tracker with channel and spatial reliability, International
Journal of Computer Vision 126.

[102] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-


scale image recognition (2015).

[103] H. Nam, B. Han, Learning multi-domain convolutional neural networks


for visual tracking, in: IEEE Conference on Computer Vision and Pattern
Recognition, 2016.

37
[104] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recogni-
tion, in: IEEE Conference on Computer Vision and Pattern Recognition,
2016.

[105] S. Hare, A. Saffari, P. H. S. Torr, Struck:structured output tracking with


kernels, in: IEEE International Conference on Computer Vision, 2011.

[106] A. Lukežič, L. Čehovin Zajc, T. Vojı́ř, J. Matas, M. Kristan, Now you


see me: evaluating performance in long-term visual tracking, CoRR
abs/1804.07056.

[107] Y. Wu, J. Lim, M.-H. Yang, Online object tracking: A benchmark, in: IEEE
Conference on Computer Vision and Pattern Recognition, 2013.

[108] Y. Wu, J. Lim, M.-H. Yang, Object tracking benchmark, IEEE Transactions
on Pattern Analysis and Machine Intelligence 37 (9).

[109] D. A. Klein, A. B. Cremers, Boosting scalable gradient features for adap-


tive real-time tracking, in: International Conference on Robotics and Au-
tomation, 2011.

[110] T. Brox, A. Bruhn, N. Papenberg, J. Weickert, High accuracy optical flow


estimation based on a theory for warping, in: European Conference on
Computer Vision, 2004.

[111] F. M. Porikli, Integral histogram: a fast way to extract histograms in


cartesian spaces, in: IEEE Conference on Computer Vision and Pattern
Recognition, 2005.

[112] N. Dalal, B. Triggs, Histograms of oriented gradients for human detec-


tion, in: IEEE Conference on Computer Vision and Pattern Recognition,
2005.

[113] P. Viola, M. Jones, Rapid object detection using a boosted cascade of


simple features, in: IEEE Conference on Computer Vision and Pattern
Recognition, 2001.

38
[114] B. K.P.Horn, B. G.Schunck, Determining optical flow, in: Artificial Intel-
ligence, 1981.

[115] H. Bay, A. Ess, T. Tuytelaars, L. V. Gool, Speeded-up robust features,


Computer Vision and Image Understanding 110 346–359.

[116] J. van de Weijer, C. Schmid, C. Schmid, D. Larlus, Learning color names


for real-world applications, IEEE Transactions on Image Processing 18 (7)
1512–1523.

[117] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with


deep convolutional neural networks, in: Advances in Neural Informa-
tion Processing Systems, 2012.

[118] S. Birchfield, S. Rangarajan, Spatiograms versus histograms for region-


based tracking, in: IEEE Conference on Computer Vision and Pattern
Recognition, 2005.

[119] J. Canny, A computational approach to edge detection, IEEE Transactions


on Pattern Analysis and Machine Intelligence 8 (6) 679–698.

[120] L. Zhang, A. Gonzalez-Garcia, J. van de Weijer, M. Danelljan, F. S. Khan,


Synthetic data generation for end-to-end thermal infrared tracking, IEEE
Transactions on Image Processing 28 (4) (2018) 1837–1850.

[121] C. Li, L. Liu, A. Lu, Q. Ji, J. Tang, Challenge-aware rgbt tracking, in:
European Conference on Computer Vision, 2020.

[122] A. Lukezic, L. C. Zajc, T. Vojir, J. Matas, M. Kristan, FuCoLoT – a fully-


correlational long-term tracker, in: Asian Conference on Computer Vi-
sion, 2018.

[123] A. Bendjebbour, ves Delignon, L. Fouque, V. Samson, W. Pieczyn-


ski, Multisensor image segmentation usingdempster–shafer fusion in
markov fields context, IEEE Transactions on Geoscience and Remote
Sensing 39 (8).

39
[124] H. Xu, T.-S. Chua, Fusion of av features and external information sources
for event detection in team sports video, ACM Transactions on Multime-
dia Computing Communications and Applications 2 (1).

[125] H. Liu, K. Simonyan, Y. Yang, DARTS: Differentiable architecture search,


2019.

[126] B. Zoph, V. Vasudevan, J. Shlens, Q. V. Le, Learning transferable architec-


tures for scalable image recognition, in: IEEE Conference on Computer
Vision and Pattern Recognition, 2018.

[127] C. Huang, S. Lucey, D. Ramanan, Learning policies for adaptive tracking


with deep feature cascades, in: IEEE International Conference on Com-
puter Vision, 2017.

40

You might also like