0% found this document useful (0 votes)
11 views

Dimp Iccv2019

Uploaded by

zhanglimin582
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Dimp Iccv2019

Uploaded by

zhanglimin582
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Learning Discriminative Model Prediction for Tracking

Goutam Bhat∗ Martin Danelljan∗ Luc Van Gool Radu Timofte


CVL, ETH Zürich, Switzerland

Abstract
arXiv:1904.07220v2 [cs.CV] 8 Jun 2020

The current strive towards end-to-end trainable com-


puter vision systems imposes major challenges for the task
of visual tracking. In contrast to most other vision problems,
tracking requires the learning of a robust target-specific ap-
pearance model online, during the inference stage. To be
end-to-end trainable, the online learning of the target model
thus needs to be embedded in the tracking architecture it-
self. Due to the imposed challenges, the popular Siamese
paradigm simply predicts a target feature template, while
ignoring the background appearance information during
inference. Consequently, the predicted model possesses lim- Image Siamese based Ours
ited target-background discriminability. Figure 1. Confidence maps of the target object (red box) provided
We develop an end-to-end tracking architecture, capa- by the target model obtained using i) a Siamese approach (middle),
ble of fully exploiting both target and background appear- and ii) Our approach (right). The model predicted in a Siamese
fashion, using only target appearance, struggles to distinguish the
ance information for target model prediction. Our archi-
target from distractor objects in the background. In contrast, our
tecture is derived from a discriminative learning loss by model prediction architecture also integrates background appear-
designing a dedicated optimization process that is capa- ance, providing superior discriminative power.
ble of predicting a powerful model in only a few iterations.
Furthermore, our approach is able to learn key aspects of
at test-time. This unconventional nature of the visual track-
the discriminative loss itself. The proposed tracker sets a
ing problem imposes significant challenges when pursuing
new state-of-the-art on 6 tracking benchmarks, achieving
an end-to-end learning solution.
an EAO score of 0.440 on VOT2018, while running at over
The aforementioned problems have been most success-
40 FPS. The code and models are available at https:
fully addressed by the Siamese learning paradigm [2, 23].
//github.com/visionml/pytracking.
These approaches first learn a feature embedding, where
the similarity between two image regions is computed by
a simple cross-correlation. Tracking is then performed by
1. Introduction finding the image region most similar to the target template.
Generic object tracking is the task of estimating the state In this setting, the target model simply corresponds to the
of an arbitrary target in each frame of a video sequence. In template features extracted from the target region. Conse-
the most general setting, the target is only defined by its ini- quently, the tracker can easily be trained end-to-end using
tial state in the sequence. Most current approaches address pairs of annotated images.
the tracking problem by constructing a target model, capa- Despite its recent success, the Siamese learning frame-
ble of differentiating between the target and background ap- work suffers from severe limitations. Firstly, Siamese track-
pearance. Since target-specific information is only available ers only utilize the target appearance when inferring the
at test-time, the target model cannot be learned in an of- model. This completely ignores background appearance
fline training phase, as in for instance object detection. In- information, which is crucial for discriminating the target
stead, the target model must be constructed during the infer- from similar objects in the scene (see figure 1). Secondly,
ence stage itself by exploiting the target information given the learned similarity measure is not necessarily reliable for
objects that are not included in the offline training set, lead-
∗ Both authors contributed equally. ing to poor generalization. Thirdly, the Siamese formulation

1
does not provide a powerful model update strategy. Instead, handle background distractors by subtracting corresponding
state-of-the-art approaches resort to simple template averag- image features from the target template during online track-
ing [46]. These limitations result in inferior robustness [20] ing. Despite these attempts, the Siamese trackers are yet
compared to other state-of-the-art tracking approaches. to reach high level of robustness attained by state-of-the-art
In this work, we introduce an alternative tracking archi- trackers employing online learning [20].
tecture, trained in an end-to-end manner, that directly ad- In contrast to Siamese methods, another family of track-
dresses all aforementioned limitations. In our design, we ers [6, 7, 30] learn a discriminative classifier online to dis-
take inspiration from the discriminative online learning pro- tinguish the target object from the background. These ap-
cedures that have been successfully applied in recent track- proaches can effectively utilize background information,
ers [6, 9, 30]. Our approach is based on a target model thereby achieving impressive robustness on multiple track-
prediction network, which is derived from a discriminative ing benchmarks [20, 43]. However, such methods rely on
learning loss by applying an iterative optimization proce- more complicated online learning procedures that cannot
dure. The architecture is carefully designed to enable ef- be easily formulated in an end-to-end learning framework.
fective end-to-end training, while maximizing the discrim- Thus, these approaches are often restricted to features ex-
inative ability of the predicted model. This is achieved by tracted from deep networks pre-trained for image classifi-
ensuring a minimal number of optimization steps through cation [9, 25] or hand-crafted alternatives [8].
two key design choices. First, we employ a steepest descent A few recent works aim to formulate existing discrim-
based methodology that computes an optimal step length inative online learning based trackers as a neural network
in each iteration. Second, we integrate a module that ef- component in order to benefit from end-to-end training.
fectively initializes the target model. Furthermore, we in- Valmadre et al. [41] integrate the single-sample closed-
troduce significant flexibility into our final architecture by form solution of the correlation filter (CF) [15] into a deep
learning the discriminative learning loss itself. network. Yao et al. [45] unroll the ADMM iterations in
Our entire tracking architecture, along with the back- BACF [18] tracker to learn the feature extractor and a few
bone feature extractor, is trained using annotated track- tracking hyper-parameters in a complex multi-stage training
ing sequences by minimizing the prediction error on fu- procedure. The BACF model learning is however restricted
ture frames. We perform comprehensive experiments on 7 to the single-sample variant of the Fourier-domain CF for-
tracking benchmarks: VOT2018 [20], LaSOT [10], Track- mulation which cannot exploit multiple samples, requiring
ingNet [27], GOT10k [16], NFS [12], OTB-100 [43], and ad-hoc linear combination of filters for model adaption.
UAV123 [26]. Our approach achieves state-of-the-art re- The problem of learning to predict a target model using
sults on all 7 datasets, while running at over 40 FPS. We only a few images is closely related to meta-learning [11,
also provide an extensive experimental analysis of the pro- 28, 29, 33, 35, 36, 40]. A few works have already pursued
posed architecture, showing the impact of each component. this direction for tracking. Bertinetto et al. [1] meta-train
a network to predict the parameters of the tracking model.
2. Related Work Choi et al. [5] utilize a meta-learner to predict a target-
specific feature space to complement the general target-
Generic object tracking has undergone astonishing independent feature space used for estimating the similarity
progress in recent years, with the development of a vari- in Siamese trackers. Park et al. [32] develop a meta-learning
ety of approaches. Recently, methods based on Siamese framework employing an initial target independent model,
networks [2, 23, 39] have received much attention due to which is then refined using gradient descent with learned
their end-to-end training capabilities and high efficiency. step-lengths. However, constant step-lengths are only suit-
The name derives from the deployment of a Siamese net- able for fast initial adaption of the model and does not pro-
work architecture in order to learn a similarity metric of- vide optimal convergence when applied iteratively.
fline. Bertinetto et al. [2] utilize a fully-convolutional ar-
chitecture for similarity prediction, thereby attaining high
3. Method
tracking speeds of over 100 FPS. Wang et al. [42] learn a
residual attention mechanism to adapt the tracking model to In this work, we develop a discriminative model predic-
the current target. Li et al. [23] employ a region proposal tion architecture for tracking. As in Siamese trackers, our
network [34] to obtain accurate bounding boxes. approach benefits from end-to-end training. However, un-
A key limitation in Siamese approaches is their inability like Siamese, our architecture can fully exploit background
to incorporate information from the background region or information and provides natural and powerful means of
previous tracked frames into the model prediction. A few updating the target model with new data. Our model pre-
recent attempts aim to address these issues. Guo et al. [13] diction network is derived from two main principles: (i)
learn a feature transformation to handle the target appear- A discriminative learning loss promoting robustness in the
ance changes and to suppress background. Zhu et al. [46] learned target model; and (ii) a powerful optimization strat-
Feature Extractor F Initial
Model
Model Model f (0)

Training Set
Initializer Predictor D
Cls
Backbone Update
Feat Model (i)
Model f
Optimizer
Test Frame

Final Model f

Cls
Backbone Conv
Feat
Score
Prediction

Figure 2. An overview of the target classification branch in our tracking architecture. Given an annotated training set (top left), we extract
deep feature maps using a backbone network followed by an additional convolutional block (Cls Feat). The feature maps are then input
to the model predictor D, consisting of the initializer and the recurrent optimizer module. The model predictor outputs the weights of the
convolutional layer which performs target classification on the feature map extracted from the test frame.

egy ensuring rapid convergence. By such careful design, Strain = {(xj , cj )}nj=1 of deep feature maps xj ∈ X gen-
our architecture can predict the target model in only a few erated by the feature extractor network F . Each sample
iterations, without compromising its discriminative power. is paired with the corresponding target center coordinate
In our framework, the target model constitutes the cj ∈ R2 . Given this data, our aim is to predict a target
weights of a convolutional layer, providing target classifi- model f = D(Strain ). The model f is defined as the filter
cation scores as output. Our model prediction architecture weights of a convolutional layer tasked with discriminat-
computes these weights by taking a set of bounding-box ing between target and background appearance in the fea-
annotated image samples as input. The model predictor ture space X . We gather inspiration from the least-squares-
includes an initializer network that efficiently provides an based regression take on the tracking problem, that has seen
initial estimate of the model weights, using only the tar- tremendous success in the recent years [6, 7, 15]. However,
get appearance. These weights are then processed by the in this work we generalize the conventional least-squares
optimizer module, taking both target and background ap- loss applied for tracking in several directions, allowing the
pearance into account. By design, our optimizer module final tracking network to learn the optimal loss from data.
possesses few learnable parameters in order to avoid over- In general, we consider a loss of the form,
fitting to certain classes and scenes during offline training. 1 X
Our model predictor can thus generalize to unseen objects, L(f ) = kr(x ∗ f, c)k2 + kλf k2 . (1)
|Strain |
which is crucial in generic object tracking. (x,c)∈Strain
Our final tracking architecture consists of two branches: Here, ∗ denotes convolution and λ is a regularization factor.
a target classification branch (see figure 2) for distinguish- The function r(s, c) computes the residual at every spatial
ing the target from background, and a bounding box esti- location based on the target confidence scores s = x ∗ f and
mation branch for predicting an accurate target box. Both the ground-truth target center coordinate c. The most com-
branches input deep features from a common backbone net- mon choice is r(s, c) = s − yc , where yc are the desired
work. The target classification branch contains a convolu- target scores at each location, popularly set to a Gaussian
tional block, extracting features on which the classifier op- function centered at c [4]. However, simply taking the dif-
erates. Given a training set of samples and corresponding ference forces the model to regress calibrated confidence
target boxes, the model predictor generates the weights of scores, usually zero, for all negative samples. This requires
the target classifier. These weights are then applied to fea- substantial model capacity, forcing the learning to focus on
tures extracted from the test frame, in order to compute the the negative data samples instead of achieving the best dis-
target confidence scores. For the bounding box estimation criminative abilities. Furthermore, taking the naı̈ve differ-
branch, we utilize the overlap maximization based architec- ence does not address the problem of data imbalance be-
ture introduced in [6]. The entire tracking network, includ- tween target and background.
ing the target classification, bounding box estimation and To alleviate the latter issue of data imbalance, we use a
backbone modules, is trained offline on tracking datasets. spatial weight function vc . The subscript c indicates the de-
pendence on the center location of the target, as detailed in
3.1. Discriminative Learning Loss
section 3.4. To accommodate the first issue, we modify the
In this section, we describe the discriminative learning loss following the philosophy of Support Vector Machines.
loss used to derive our model prediction architecture. The We employ a hinge-like loss in r, clipping the scores at zero
input to our model predictor D consists of a training set as max(0, s) in the background region. The model is thus
free to predict large negative values for easy samples in the or the current model estimate. We solve this issue by deriv-
background without increasing the loss. For the target re- ing a more elaborate optimization approach, requiring only
gion on the other hand, we found it disadvantageous to add a handful of iterations to predict a strong discriminative fil-
an analogous hinge loss max(0, 1−s). Although contradic- ter f . The core idea is to compute the step length α based
tory at a first glance, this behavior can be attributed to the on the steepest descent methodology, which is a common
fundamental asymmetry between the target and background optimization technique [31, 37]. We first approximate the
class, partially due to the numerical imbalance. Moreover, loss with a quadratic function at the current estimate f (i) ,
accurately calibrated target confidences are indeed advanta-
1
geous in the tracking scenario, e.g. for detecting target loss. L(f ) ≈ L̃(f ) = (f − f (i) )T Q(i) (f − f (i) )+ (4)
We therefore desire the properties of standard least-squares 2
regression in the target neighborhood. (f − f (i) )T ∇L(f (i) ) + L(f (i) ) .
To accommodate the advantages of both least-squares re-
gression and the hinge loss, we define the residual function, Here, the filter variables f and f (i) are seen as vectors and
Q(i) is positive definite square matrix. The steepest descent
r(s, c) = vc · (mc s + (1 − mc ) max(0, s) − yc ) . (2) then proceeds by finding the step length α that minimizes
the approximate loss (4) in the gradient direction (3). This
The target region is defined by the mask mc , having val- d

is found by solving dα L̃ f (i) − α∇L(f (i) ) = 0, as
ues in the interval mc (t) ∈ [0, 1] at each spatial location
t ∈ R2 . Again, the subscript c indicate the dependence on ∇L(f (i) )T ∇L(f (i) )
the target center coordinate. The formulation in (2) is capa- α= . (5)
∇L(f (i) )T Q(i) ∇L(f (i) )
ble of continuously changing the behavior of the loss from
standard least squares regression to a hinge loss depending In steepest descent, the formula (5) is used to compute the
on the image location relative to the target center c. Setting scalar step length α in each iteration of the filter update (3).
mc ≈ 1 at the target and mc ≈ 0 in the background region The quadratic model (4), and consequently the resulting
yields the desired behavior described above. However, how step length (5), depends on the choice of Q(i) . For exam-
to optimally set mc is not clear, in particular at the transition ple, by using a scaled identity matrix Q(i) = β1 I we re-
region between target and background. While the classical trieve the standard gradient descent algorithm with a fixed
strategy is to manually set the mask parameters using trial step length α = β. On the other hand, we can now integrate
and error, our end-to-end formulation allows us to learn the second order information into the optimization procedure.
2
mask in a data-driven manner. In fact, as detailed in sec- The most obvious choice is setting Q(i) = ∂∂fL2 (f (i) ) to the
tion 3.4, our approach learns all free parameters in the loss: Hessian of the loss (1), which corresponds to a second order
the target mask mc , the spatial weight vc , the regularization Taylor approximation (4). For our least-squares formula-
factor λ, and even the regression target yc itself. tion (1) however, the Gauss-Newton method [31] provides
a powerful alternative, with significant computational bene-
3.2. Optimization-Based Architecture fits since it only involves first-order derivatives. We thus set
Here, we derive the network architecture D that predicts Q(i) = (J (i) )T J (i) , where J (i) is the Jacobian of the resid-
the filter f = D(Strain ) by implicitly minimizing the error uals at f (i) . In fact, neither the matrix Q(i) or Jacobian J (i)
(1). The network is designed by formulating an optimiza- need to be constructed explicitly, but rather implemented
tion procedure. From eqs. (1) and (2) we can easily derive as a sequence of neural network operations. See the sup-
a closed-form expression for the gradient of the loss ∇L plementary material (section S2) for details. Algorithm 1
with respect to the filter f 2 . The straight-forward option is describes our target model predictor D. Note that our op-
to then employ gradient descent using a step length α, timizer module can easily be employed for online model
adaption as well. This is achieved by continuously extend-
f (i+1) = f (i) − α∇L(f (i) ) . (3) ing the training set Strain with new samples from the previ-
ously tracked frames. The optimizer module is then applied
However, we found this simple approach to be insufficient, on this extended training set, using the current target model
even if the learning rate α (either a scalar or coefficient- as the initialization f (0) .
specific) is learned by the network itself (see section 4.1). It
experiences slow adaption of the filter parameters f , requir- 3.3. Initial Filter Prediction
ing a vast increase in the number of iterations. This harms
To further reduce the number of optimization recursions
efficiency and complicates offline learning.
required in D, we introduce a small network module that
The slow convergence of gradient descent is largely due
predicts an initial model estimate f (0) . Our initializer net-
to the constant step length α, which does not depend on data
work consists of a convolutional layer followed by a pre-
2 See supplementary material (section S1) for details. cise ROI pooling [17]. The latter extracts features from the
1.5
Algorithm 1 Target model predictor D.
Input: Samples Strain = {(xj , cj )}nj=1 , iterations Niter 1

1: f (0) ← ModelInit(Strain )

Value
# Initialize filter (sec 3.3)
2: for i = 0, . . . , Niter − 1 do # Optimizer module loop 0.5

3: ∇L(f (i) ) ← FiltGrad(f (i) , Strain ) # Using (1)-(2)


4: h ← J (i) ∇L(f (i) ) # Apply Jacobian of (2) 0
0 1 2 3 4 5 6 7 8 9 10
5: α ← k∇L(f (i) )k2 /khk2 # Compute step length (5) Distance from target center

6: f (i+1) ← f (i) − α∇L(f (i) ) # Update filter Figure 3. Plot of the learned regression label (yc ), target mask
7: end for (mc ), and spatial weight (vc ). The markers show the knot loca-
tions. The initialization of each quantity is shown in dotted lines.

target region and pools them to the same size as the tar- get mask mc , we constrain the values to the interval [0, 1]
get model f . The pooled feature maps are then averaged by passing the output from (6) through a Sigmoid function.
over all the samples in Strain to obtain the initial model f (0) . We use N = 100 basis functions and set the knot dis-
As in Siamese trackers, this approach only utilizes the tar- placement to ∆ = 0.1 in the resolution of the deep feature
get appearance. However, rather than predicting the final space X . For offline training, the regression label yc is ini-
model, our initializer network is tasked with only providing tialized to the same Gaussian zc used in the offline classifi-
a reasonable initial estimate, which is then processed by the cation loss, described in section 3.6. The weight function vc
optimizer module to provide the final model. is initialized to constant vc (t) = 1. Lastly, we initialize the
target mask mc using a scaled tanh function. The coeffi-
3.4. Learning the Discriminative Learning Loss cients φk , along with λ, are learned as part of the model pre-
Here, we describe how the free parameters in the residual diction network D (see section 3.6). The initial and learned
function (2), defining the loss (1), are learned. Our residual values for yc , mc and vc are visualized in figure 3. Notably,
function includes the label confidence scores yc , the spa- our network learns to increase the weight vc at the target
tial weight function vc and the target mask mc . While such center and reduce it in the ambiguous transition region.
variables are constructed by hand in current discriminative
3.5. Bounding Box Estimation
online learning based trackers, our approach in fact learns
these functions from data. We parametrize them based on We utilize the overlap maximization strategy introduced
the distance from the target center. This is motivated by the in [6] for the task of accurate bounding box estimation.
radial symmetry of the problem, where the direction to the Given a reference target appearance, the bounding box esti-
sample location relative to the target is of little significance. mation branch is trained to predict the IoU overlap between
In contrast, the distance to the sample location plays a cru- the target and a set of candidate boxes on a test image. The
cial role, especially in the transition from target to back- target information is integrated into the IoU prediction by
ground. Thus, we parameterize yc , mc and vc using radial computing a modulation vector from the reference appear-
basis functions ρk and learn their coefficients φk . For in- ance of the target. The computed vector is used to modu-
stance, the label yc at position t ∈ R2 is given by late the features from the test image, which are then used
for IoU prediction. The IoU prediction network is differen-
N −1
X tiable w.r.t. the input box co-ordinates, allowing the candi-
yc (t) = φyk ρk (kt − ck) . (6) dates to be refined during tracking by maximizing the pre-
k=0
dicted IoU. We use the same network architecture as in [6].
We use triangular basis functions ρk , defined as
3.6. Offline Training
(
max(0, 1 − |d−k∆|
∆ ), k <N −1 Here, we describe our offline training procedure. In
ρk (d) = d−k∆
(7)
max(0, min(1, 1 + ∆ )), k = N − 1 Siamese approaches, the network is trained with image
pairs, using one image to predict the target template and
The above formulation corresponds to a continuous piece- the other for evaluating the tracker. In contrast, our model
wise linear function with a knot displacement of ∆. Note prediction network D inputs a set Strain of multiple data
that the final case k = N −1 represents all locations that are samples from the sequence. To better exploit this advan-
far away from the target center and thus can be treated iden- tage, we train our full tracking architecture on pairs of sets
tically. We use a small ∆ to enable accurate representation (Mtrain , Mtest ). Each set M = {(Ij , bj )}N frames
j=1 consists of
of the regression label at the target-background transition. images Ij paired with their corresponding target bounding
The functions vc and mc are parameterized analogously us- boxes bj . The target model is predicted using Mtrain and
ing coefficients φvk and φm
k respectively in (6). For the tar- then evaluated on the test frames Mtest . Uniquely, our train-
ing allows the model predictor D to learn how to better uti- the predicted IoU overlaps in Mtest and the ground truth. We
lize multiple samples. The sets are constructed by sampling train the full tracking architecture by combining this with
a random segment of length Tss in the sequence. We then the target classification loss (9) as Ltot = βLcls + Lbb .
construct Mtrain and Mtest by sampling Nframes frames each Training details: We use the training splits of the Track-
from the first and second halves of the segment respectively. ingNet [27], LaSOT [10], GOT10k [16] and COCO [24]
Given the pair (Mtrain , Mtest ), we first pass the images datasets. The backbone network is initialized with the
through the backbone feature extractor to construct the train ImageNet weights. We train for 50 epochs by sampling
Strain and test Stest samples for our target model. Formally, 20,000 videos per epoch, giving a total training time of
the train set is obtained as Strain = {(F (Ij ), cj ) : (Ij , bj ) ∈ less than 24 hours on a single Nvidia TITAN X GPU.
Mtrain }, where cj is the center coordinate of the box bj . This We use ADAM [19] with learning rate decay of 0.2 every
is input to the target predictor f = D(Strain ). The aim is to 15th epoch. The target classification loss weight is set to
predict a model f that is discriminative and that generalizes β = 102 and we use Niter = 5 optimizer module recursions
well to future unseen frames. We therefore only evaluate in (9) during training. The image patches in (Mtrain , Mtest )
the predicted model f on the test samples Stest , obtained are extracted by sampling a random translation and scale
analogously using Mtest . Following the discussion in sec- relative to the target annotation. We set the base scale to 5
tion 3.1, we compute the regression errors using a hinge for times the target size to incorporate significant background
the background samples, information. For each sequence, we sample Nframes = 3 test
( and train frames, using a segment length of Tss = 60. The
s−z, z>T label scores zc are constructed using a standard deviation of
`(s, z) = . (8) 1/4 relative to the base target size, and we use T = 0.05 for
max(0, s) , z ≤ T
the regression error (8). We employ the ResNet architecture
Here, the threshold T defines the target and background re- for the backbone. For the model predictor D, we use fea-
gion based on the label confidence value z. For the target tures extracted from the third block, having a spatial stride
region z > T we take the difference between the predicted of 16. We set the kernel size of the target model f to 4 × 4.
confidence score s and the label z, while we only penalize 3.7. Online Tracking
positive confidence values for the background z ≤ T .
The total target classification loss is computed as the Given the first frame with annotation, we employ data
mean squared error (8) over all test samples. However, in- augmentation strategies [3] to construct an initial set Strain
stead of only evaluating the final target model f , we average containing 15 samples. The target model is then obtained
the loss over the estimates f (i) obtained in each iteration i using our discriminative model prediction architecture f =
by the optimizer (see alg. 1). This introduces intermedi- D(Strain ). For the first frame, we employ 10 steepest de-
ate supervision to the target prediction module, benefiting scent recursions, after the initializer module. Our approach
training convergence. Furthermore, we do not aim to train allows the target model to be easily updated by adding a
for a specific number of recursions, but rather be free to set new training sample to Strain whenever the target is pre-
the desired number of optimization recursions online. It is dicted with sufficient confidence. We ensure a maximum
thus natural to evaluate each iterate f (i) equally. The target memory size of 50 by discarding the oldest sample. During
classification loss used for offline training is given by, tracking, we refine the target model f by performing two
optimizer recursions every 20 frames, or a single recursion
Niter
1 X X 2 whenever a distractor peak is detected. Bounding box esti-
` x ∗ f (i) , zc

Lcls = . (9) mation is performed using the same settings as in [6].
Niter i=0
(x,c)∈Stest
4. Experiments
Here, regression label zc is set to a Gaussian function cen-
tered as the target c. Note that the output f (0) from the filter Our approach is implemented in Python using PyTorch,
initializer (section 3.3) is also included in the above loss. and operates at 57 FPS with a ResNet-18 backbone and 43
Although not denoted explicitly to avoid clutter, both x and FPS with ResNet-50 on a single Nvidia GTX 1080 GPU.
f (i) in (9) depend on the parameters of the feature extrac- Detailed results are provided in the supplementary material
tion network F . The model iterates f (i) additionally depend (section S3–S6).
on the parameters in the model predictor network D.
4.1. Analysis of our Approach
For bounding box estimation, we extend the training pro-
cedure in [6] to image sets by computing the modulation Here, we perform an extensive analysis of the proposed
vector on the first frame in Mtrain and sampling candidate model prediction architecture. Experiments are performed
boxes from all images in Mtest . The bounding box estima- on a combined dataset containing the entire OTB-100 [43],
tion loss Lbb is computed as the mean squared error between NFS (30 FPS version) [12] and UAV123 [26] datasets. This
Init GD SD No update Model averaging Ours

AUC 58.2 61.6 63.8 AUC 61.7 61.7 63.8

Table 1. Analysis of different model prediction architectures on Table 3. Comparison of different model update strategies on the
the combined OTB-100, NFS and UAV123 datasets. The architec- combined OTB-100, NFS and UAV123 datasets.
ture using only the target information for model prediction (Init) DRT RCO UPDT DaSiam- MFT LADCF ATOM SiamRPN++ DiMP-18 DiMP-50
[38] [20] [3] RPN [46] [20] [44] [6] [22]
achieves an AUC score of 58.2%. The proposed steepest descent EAO 0.356 0.376 0.378 0.383 0.385 0.389 0.401 0.414 0.402 0.440
based architecture (SD) provides the best results, outperforming Robustness 0.201 0.155 0.184 0.276 0.140 0.159 0.204 0.234 0.182 0.153
Accuracy 0.519 0.507 0.536 0.586 0.505 0.503 0.590 0.600 0.594 0.597
the gradient descent method (GD) by over 2.2% AUC score.
Table 4. State-of-the-art comparison on the VOT2018 dataset in
SD +Init +FT +Cls +Loss terms of expected average overlap (EAO), accuracy & robustness.
AUC 58.7 60.0 62.6 63.3 63.8
baseline approach achieves an AUC score of 58.7%. By
Table 2. Analysis of the impact of initializer module (+Init), train- adding the model initializer module (+Init), we achieve a
ing the backbone (+FT), using extra conv. block (+Cls) and offline
significant gain of 1.3% in AUC score. Further training
learning of the loss (+Loss), by incrementally adding them one at
a time. The baseline SD constitutes our steepest descent based
the entire network, including backbone feature extractor,
optimizer module along with a ResNet-18 trained on ImageNet. (+FT) leads to a major improvement of 2.6% in AUC score.
This demonstrates the advantages of learning specialized
pooled dataset contains 323 diverse videos to enable thor- features suitable for tracking through end-to-end learning.
ough analysis. The trackers are evaluated using the AUC Using an additional convolutional block to extract classifi-
[43] metric. Due to the stochastic nature of the tracker, we cation specific features (+Cls) yields a further improvement
always report the average AUC score over 5 runs. We em- of 0.7% AUC score. Finally, learning the discriminative
ploy ResNet-18 as the backbone network for this analysis. loss (2) itself (+Loss), as described in section 3.4, improves
Impact of optimizer module: We compare our proposed the AUC score by another 0.5%. This shows the benefit of
method, utilizing the steepest descent (SD) based architec- learning the implicit online loss by maximizing the gener-
ture, with two alternative approaches. Init: Here, we only alization capabilities of the model on future frames.
use the initializer module to predict the final target model, Impact of online model update: Here, we analyze the im-
which corresponds to removing the optimizer module in our pact of updating the target model online, using information
approach. Thus, similar to the Siamese approaches, only from previous tracked frames. We compare three different
target appearance information is used for model prediction, model update strategies. i) No update: The model is not
while background information is discarded. GD: In this ap- updated during tracking. Instead, the model predicted in
proach, we replace steepest descent with the gradient de- the first frame by our model predictor D, is employed for
scent (GD) algorithm using learned coefficient-wise step- the entire sequence. ii) Model averaging: In each frame,
lengths α in (3). All networks are trained using the same the target model is updated using the linear combination of
settings. The results for this analysis are shown in table 1. the current and newly predicted model, as commonly em-
The model predicted by the initializer network, which ployed in tracking [15, 18, 41]. iii) Ours: The target model
uses only target information, achieves an AUC score of is obtained using the training set constructed online, as de-
58.2%. The gradient descent approach, which can exploit scribed in section 3.7. The naı̈ve model averaging fails to
background information, provides a substantial improve- improve over the baseline method with no updates (see ta-
ment, achieving an AUC score of 61.6%. This highlights ble 3). In contrast, our approach obtains a significant gain of
the importance of employing discriminative learning for about 2% in AUC score over both methods, indicating that
model prediction. Our steepest descent approach obtains our approach can effectively adapt the target model online.
the best results, outperforming GD by 2.2%. This is due
4.2. State-of-the-art Comparison
to the superior convergence properties of steepest descent,
important for offline learning and fast online tracking. We compare our proposed approach DiMP with the
Analysis of model prediction architecture: Here, we an- state-of-the-art methods on seven challenging tracking
alyze the impact of key aspects of the proposed discrimi- benchmarks. Results for two versions of our approach are
native online learning architecture, by incrementally adding shown: DiMP-18 and DiMP-50 employing ResNet-18 and
them one at a time. The results are shown in table 2. The ResNet-50 respectively as the backbone network.
baseline SD constitutes our steepest descent based opti- VOT2018 [20]: We evaluate our approach on the 2018
mizer module along with a fixed ResNet-18 network trained version of the Visual Object Tracking (VOT) challenge con-
on ImageNet. That is, similar to the current state-of-the-art sisting of 60 challenging videos. Trackers are evaluated us-
discriminative approaches, we do not fine-tune the back- ing the measures accuracy (average overlap over success-
bone. Instead of learning the discriminative loss, we em- fully tracked frames) and robustness (failure rate). Both
ploy the regression error (8) in the optimizer module. This these measures are combined to get the EAO (Expected
Success plot ECO SiamFC CFNet MDNet UPDT DaSiam- ATOM SiamRPN++ DiMP-18 DiMP-50
[7] [2] [41] [30] [3] RPN [46] [6] [22]
80
Precision (%) 49.2 53.3 53.3 56.5 55.7 59.1 64.8 69.4 66.6 68.7
70 Norm. Prec. (%) 61.8 66.6 65.4 70.5 70.2 73.3 77.1 80.0 78.5 80.1
Success (AUC) (%) 55.4 57.1 57.8 60.6 61.1 63.8 70.3 73.3 72.3 74.0
Overlap Precision [%] 60
DiMP-50 [56.9]
50 DiMP-18 [53.2] Table 5. State-of-the-art comparison on the TrackingNet test set in
ATOM [51.5]
40 SiamRPN++ [49.6] terms of precision, normalized precision, and success.
MDNet [39.7]
30 VITAL [39.0] MDNet CF2 ECO CCOT GOTURN SiamFC SiamFCv2 ATOM DiMP-18 DiMP-50
SiamFC [33.6] [30] [25] [7] [9] [14] [2] [41] [6]
20
StructSiam [33.5]
10 DSiam [33.3] SR0.50 (%) 30.3 29.7 30.9 32.8 37.5 35.3 40.4 63.4 67.2 71.7
ECO [32.4] SR0.75 (%) 9.9 8.8 11.1 10.7 12.4 9.8 14.4 40.2 44.6 49.2
0 AO (%) 29.9 31.5 31.6 32.5 34.7 34.8 37.4 55.6 57.9 61.1
0 0.2 0.4 0.6 0.8 1
Overlap threshold
Figure 4. Success plot on the LaSOT dataset. Table 6. State-of-the-art comparison on the GOT10k test set in
terms of average overlap (AO), and success rates (SR) at overlap
Average Overlap) score used to rank trackers. The re- thresholds 0.5 and 0.75.
sults are shown in table 4. Among previous approaches, ECOhc DaSiam- ATOM CCOT MDNet ECO SiamRPN++ UPDT DiMP-18 DiMP-50
[7] RPN [46] [6] [9] [30] [7] [22] [3]
SiamRPN++ achieves the best accuracy and EAO. How- NFS - - 58.4 48.8 41.9 46.6 - 53.6 61.0 61.9
ever, it attains much inferior robustness compared to the OTB-100 64.3 65.8 66.3 68.2 67.8 69.1 69.6 70.4 66.0 68.4
UAV123 51.2 57.7 64.2 51.3 - 53.2 - 54.5 64.3 65.3
discriminative learning based approaches, such as MFT
Table 7. State-of-the-art comparison on the NFS, OTB-100 and
and LADCF. Similar to the aforementioned approaches,
UAV123 datasets in terms of AUC score.
SiamRPN++ employs ResNet-50 for feature extraction.
Our approach DiMP-50, employing the same backbone net-
work, significantly outperforms SiamRPN++ with a rela- achieves the best AO score of 61.1%, verifying the strong
tive gain of 6.3% in terms of EAO. Further, compared to generalization abilities of our tracker.
SiamRPN++, our approach has a 34% lower failure rate, Need for Speed [12]: We evaluate our approach on the 30
while achieving similar accuracy. This shows that discrimi- FPS version of the dataset, containing challenging videos
native model prediction is crucial for robust tracking. with fast-moving objects. The AUC scores over all the 100
LaSOT [10]: We evaluate our approach on the test set videos are shown in table 7. The previous best method
consisting of 280 videos. The success plots are shown in ATOM achieves an AUC score of 58.4% . Our approach
figure 4. Compared to other datasets, LaSOT has longer outperforms ATOM with relative gains of 4.4% and 6.0%
sequences, with an average of 2500 frames per sequence. using ResNet-18 and ResNet-50 respectively.
Thus, online model adaption is crucial for this dataset. The OTB-100 [43]: Table 7 shows the AUC scores over all
previous best approach ATOM [6] employs online discrim- the 100 videos in the dataset. Among the compared meth-
inative learning with with pre-trained ResNet-18 features. ods, UPDT achieves the best results with an AUC score of
Our end-to-end trained approach, using the same backbone 70.4%. Our DiMP-50 achieves an AUC score of 68.4%,
architecture, outperforms ATOM with a relative gain of competitive with the other state-of-the-art approaches.
3.3%, showing the impact of end-to-end training. DiMP-50 UAV123 [26]: This dataset consists of 123 low altitude
further improves the results with an AUC score of 56.9%. aerial videos captured from a UAV. Results in terms of
These results demonstrate the powerful model adaption ca- AUC are shown in table 7. Among previous methods,
pabilities of our method on long sequences. ATOM achieves an AUC score of 64.2%. Both DiMP-18
TrackingNet [27]: We evaluate our approach on the test and DiMP-50 outperform ATOM, achieving AUC scores of
set of the large-scale TrackingNet dataset. The results are 64.3% and 65.4%, respectively.
shown in table 5. SiamRPN++ achieves an impressive AUC
score of 73.3%. Our approach, with the same ResNet- 5. Conclusions
50 backbone as in SiamRPN++, outperforms all previous
methods by achieving AUC score of 74.0%. We propose a tracking architecture that is trained offline
GOT10k [16]: This is large-scale dataset containing over in an end-to-end manner. Our approach is derived from a
10, 000 videos, 180 of which form the test set used for eval- discriminative learning loss by applying an iterative opti-
uation. Interestingly, there is no overlap in object classes mization procedure. By employing a steepest descent based
between the train and test splits, promoting the importance optimizer and an effective model initializer, our approach
of generalization to unseen object classes. To ensure fair can predict a powerful model in only a few optimization
evaluation, the trackers are forbidden from using external steps. Further, our approach learns the discriminative loss
datasets for training. We follow this protocol by retrain- during offline training by minimizing the prediction error on
ing our trackers using only the GOT10k train split. Results unseen test frames. Our approach sets a new state-of-the-art
are shown in table 6. ATOM achieves an average overlap on 6 tracking benchmarks, while operating at over 40 FPS.
(AO) score of 55.6%. Our ResNet-18 version outperforms Acknowledgments: This work was supported by ETH
ATOM with a relative gain of 4.1%. Our ResNet-50 version General Fund (OK), and Nvidia through a hardware grant.
References [19] D. P. Kingma and J. Ba. Adam: A method for stochastic
optimization. In ICLR, 2014. 6
[1] L. Bertinetto, J. F. Henriques, J. Valmadre, P. H. S. Torr,
[20] M. Kristan, A. Leonardis, J. Matas, M. Felsberg,
and A. Vedaldi. Learning feed-forward one-shot learners.
R. Pfugfelder, L. C. Zajc, T. Vojir, G. Bhat, A. Lukezic,
In NIPS, 2016. 2
A. Eldesokey, G. Fernandez, and et al. The sixth visual ob-
[2] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and ject tracking vot2018 challenge results. In ECCV workshop,
P. H. Torr. Fully-convolutional siamese networks for object 2018. 2, 7, 12
tracking. In ECCV workshop, 2016. 1, 2, 8 [21] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Čehovin,
[3] G. Bhat, J. Johnander, M. Danelljan, F. S. Khan, and M. Fels- G. Fernández, T. Vojı́r, G. Nebehay, R. Pflugfelder, and
berg. Unveiling the power of deep tracking. In ECCV, 2018. G. Hger. The visual object tracking vot2015 challenge re-
6, 7, 8 sults. In ICCV workshop, 2015. 12
[4] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui. [22] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan.
Visual object tracking using adaptive correlation filters. In Siamrpn++: Evolution of siamese visual tracking with very
CVPR, 2010. 3 deep networks. In CVPR, 2019. 7, 8, 12
[5] J. Choi, J. Kwon, and K. M. Lee. Deep meta learning [23] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu. High performance
for real-time visual tracking based on target-specific feature visual tracking with siamese region proposal network. In
space. CoRR, abs/1712.09153, 2017. 2 CVPR, 2018. 1, 2
[6] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. ATOM: [24] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Gir-
Accurate tracking by overlap maximization. In CVPR, 2019. shick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L.
2, 3, 5, 6, 7, 8, 12 Zitnick. Microsoft COCO: common objects in context. In
[7] M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg. ECCV, 2014. 6, 12
ECO: efficient convolution operators for tracking. In CVPR, [25] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang. Hierarchical
2017. 2, 3, 8 convolutional features for visual tracking. In ICCV, 2015. 2,
[8] M. Danelljan, G. Häger, F. Shahbaz Khan, and M. Felsberg. 8
Learning spatially regularized correlation filters for visual [26] M. Mueller, N. Smith, and B. Ghanem. A benchmark and
tracking. In ICCV, 2015. 2 simulator for uav tracking. In ECCV, 2016. 2, 6, 8, 12
[9] M. Danelljan, A. Robinson, F. Shahbaz Khan, and M. Fels- [27] M. Müller, A. Bibi, S. Giancola, S. Al-Subaihi, and
berg. Beyond correlation filters: Learning continuous con- B. Ghanem. Trackingnet: A large-scale dataset and bench-
volution operators for visual tracking. In ECCV, 2016. 2, mark for object tracking in the wild. In ECCV, 2018. 2, 6, 8,
8 12
[10] H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, [28] T. Munkhdalai and H. Yu. Meta networks. Proceedings of
Y. Xu, C. Liao, and H. Ling. Lasot: A high-quality machine learning research, 70:2554–2563, 2017. 2
benchmark for large-scale single object tracking. CoRR, [29] D. B. Naik and R. J. Mammone. Meta-neural networks
abs/1809.07845, 2018. 2, 6, 8, 12 that learn by learning. [Proceedings 1992] IJCNN Inter-
[11] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta- national Joint Conference on Neural Networks, 1:437–442
learning for fast adaptation of deep networks. In ICML, vol.1, 1992. 2
2017. 2 [30] H. Nam and B. Han. Learning multi-domain convolutional
[12] H. K. Galoogahi, A. Fagg, C. Huang, D. Ramanan, and neural networks for visual tracking. In CVPR, 2016. 2, 8
S. Lucey. Need for speed: A benchmark for higher frame [31] J. Nocedal and S. J. Wright. Numerical Optimization.
rate object tracking. In ICCV, 2017. 2, 6, 8, 12 Springer, 2nd edition, 2006. 4
[13] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang. [32] E. Park and A. C. Berg. Meta-tracker: Fast and robust online
Learning dynamic siamese network for visual object track- adaptation for visual object trackers. In ECCV, 2018. 2
ing. In ICCV, 2017. 2 [33] S. Ravi and H. Larochelle. Optimization as a model for few-
[14] D. Held, S. Thrun, and S. Savarese. Learning to track at 100 shot learning. In ICLR, 2017. 2
fps with deep regression networks. In ECCV, 2016. 8 [34] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN:
[15] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High- towards real-time object detection with region proposal net-
speed tracking with kernelized correlation filters. TPAMI, works. In NIPS, 2015. 2
37(3):583–596, 2015. 2, 3, 7 [35] J. Schmidhuber. Evolutionary principles in self-referential
[16] L. Huang, X. Zhao, and K. Huang. Got-10k: A large high- learning. on learning now to learn: The meta-meta-meta...-
diversity benchmark for generic object tracking in the wild. hook. Diploma thesis, Technische Universitat Munchen,
arXiv preprint arXiv:1810.11981, 2018. 2, 6, 8, 12 Germany, 14 May 1987. 2
[17] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang. Acquisition [36] J. Schmidhuber. Learning to control fast-weight memories:
of localization confidence for accurate object detection. In An alternative to dynamic recurrent networks. Neural Com-
ECCV, 2018. 4 put., 4(1):131–139, Jan. 1992. 2
[18] H. Kiani Galoogahi, A. Fagg, and S. Lucey. Learning [37] J. R. Shewchuk. An introduction to the conjugate gradient
background-aware correlation filters for visual tracking. In method without the agonizing pain. Technical report, Pitts-
ICCV, 2017. 2, 7 burgh, PA, USA, 1994. 4
[38] C. Sun, D. Wang, H. Lu, and M. Yang. Correlation tracking
via joint discrimination and reliability learning. In CVPR,
2018. 7
[39] R. Tao, E. Gavves, and A. W. M. Smeulders. Siamese in-
stance search for tracking. In CVPR, 2016. 2
[40] S. Thrun and L. Pratt, editors. Learning to Learn. Kluwer
Academic Publishers, Norwell, MA, USA, 1998. 2
[41] J. Valmadre, L. Bertinetto, J. F. Henriques, A. Vedaldi, and
P. H. S. Torr. End-to-end representation learning for correla-
tion filter based tracking. In CVPR, 2017. 2, 7, 8
[42] Q. Wang, Z. Teng, J. Xing, J. Gao, W. Hu, and S. J. Maybank.
Learning attentions: Residual attentional siamese network
for high performance online visual tracking. In CVPR, 2018.
2
[43] Y. Wu, J. Lim, and M.-H. Yang. Object tracking benchmark.
TPAMI, 37(9):1834–1848, 2015. 2, 6, 7, 8, 12
[44] T. Xu, Z. Feng, X. Wu, and J. Kittler. Learning adaptive dis-
criminative correlation filters via temporal consistency pre-
serving spatial feature selection for robust visual tracking.
CoRR, abs/1807.11348, 2018. 7
[45] Y. Yao, X. Wu, S. Shan, and W. Zuo. Joint representation
and truncated inference learning for correlation filter based
tracking. In ECCV, 2018. 2
[46] Z. Zhu, Q. Wang, L. Bo, W. Wu, J. Yan, and W. Hu.
Distractor-aware siamese networks for visual object track-
ing. In ECCV, 2018. 2, 7, 8
Supplementary Material Here, · denotes the element-wise product. The multipli-
∂s T

cation with the transposed Jacobian ∂f corresponds to
This supplementary material provides additional details
backpropagation of the input qc · rs,c through the convo-
and results. Section S1 derives the closed form expression
lution layer f 7→ x ∗ f . This is implemented as a trans-
of the filter gradient, employed in the optimizer module. In
posed convolution with x. The closed-form expression (S5)
section S2 we derive the application of the Jacobian in or-
is thus easily implemented using standard operations in a
der to compute the quantity h, employed in algorithm 1 in
deep learning library like PyTorch.
the paper. In section S3 we provide detailed results on the
VOT2018 dataset, while in section S4, we provide detailed S2. Calculation of h in Algorithm 1
results on the LaSOT dataset. We also provide additional
details on the NFS, OTB100 and UAV123 datasets in sec- In this section, we show the calculation of h =
tion S5. We analyze the impact when training with less data J (i) ∇L(f (i) ), used when determining the optimal step
in section S6. Finally, we provide a 2d visualization of the length α in Algorithm 1 in the main paper. Since we
learned functions parametrizing the discriminative loss in only need the squared L2 norm of h in step length calcu-
section S7. lation, we will directly derive an expression for khk2 =
∂ξ
kJ (i) ∇L(f (i) )k2 . Here, J (i) = ∂f is the Jacobian
S1. Closed-Form Expression for ∇L f (i)
of the residual vector ξ of loss (S1), evaluated at the filter
Here, we derive a closed-form expression for the gradi- estimate f (i) . Not to be confused with the residual func-
ent of the loss (1) in the main paper, also restated here, tion (S2), the residual vector ξ is obtained as the concate-

nation of individual residuals ξj = r(xj ∗ f, cj )/ n for
1 X
L(f ) = kr(s, c)k2 + kλf k2 . (S1) j ∈ {1, . . . , n} and ξj = λf for j = n + 1. Here,
|Strain | n = |Strain | is the number of samples in Strain . Consequently,
(x,c)∈Strain
we get,
Here, s = x ∗ f is the score map obtained after convolving
2
the deep feature map x with the target model f . The training khk2 = J (i) ∇L(f (i) ) (S6)
set is given by Strain = {(xj , cj )}nj=1 . The residual function
n+1 2
r(s, c) is defined as (also eq. (2) in the paper), X ∂ξj (i)
= ∇L(f )
∂f f (i)
r(s, c) = vc · (mc s + (1 − mc ) max(0, s) − yc ) . (S2) j=1
n 2
The gradient ∇L(f ) of the loss (S1) w.r.t. the filter coeffi- X 1 ∂r(xj ∗ f, cj ) 2
= √ ∇L(f (i) ) + λ∇L(f (i) )
cients f is then computed as, j=1
n ∂f f (i)
2
2 X  ∂rs,c T 1 ∂rs,c 2
rs,c + 2λ2 f . (S3)
X
∇L(f ) = = ∇L(f (i)
) + λ∇L(f (i) ) .
|Strain | ∂f n ∂f f (i)
(x,c)∈Strain (x,c)∈Strain

∂r Using eqs. (S6) and (S4) we finally obtain,


Here, we have defined rs,c = r(s, c) and ∂fs,c corresponds
to the Jacobian of the residual function (S2) w.r.t. the filter ! 2
coefficients f . Using eq. (S2) we obtain, 2 1 X ∂s (i)
khk = qc · ∇L(f ) +
|Strain | ∂f f (i)
(x,c)∈Strain
∂rs,c ∂s ∂s
= diag(vc mc ) + diag ((1 − mc ) · 1s>0 )
∂f ∂f ∂f kλ∇L(f (i) )k2
∂s 1 X   2
= diag(qc ) . (S4) = qc · x ∗ ∇L(f (i) ) +
∂f |Strain |
(x,c)∈Strain
Here, diag(qc ) denotes a diagonal matrix containing the el- kλ∇L(f (i) )k2
ements in qc . Further, qc = vc mc + (1 − mc ) · 1s>0 is
computed using only point-wise operations, where 1s>0 is As described in section S1, ∇L(f (i) ) is computed using the
1 for positive s and 0 otherwise. Using eqs. (S3) and (S4) closed-form expression (S5). The term ∂s
∇L(f (i) )
∂f
we finally obtain, f (i)
corresponds to convolution of x with ∇L(f (i) ), i.e.
2 X  ∂s T ∂s
∇L(f (i) ) = x ∗ ∇L(f (i) ). Thus, khk2 is com-
∇L(f ) = (qc · rs,c ) + 2λ2 f . ∂f
f (i)
|Strain | ∂f puted easily using standard operations from deep learning
(x,c)∈Strain
(S5) libraries.
0.6 Normalized precision plots
80

0.5
70

0.4
Expected overlap

60

DiMP-50 [0.440]
0.3 SiamRPN++ [0.415] 50

Precision
DiMP-18 [0.402] [65.0] DiMP-50
ATOM [0.401] 40 [61.0] DiMP-18
0.2 LADCF [0.389] [57.6] ATOM
MFT [0.385] [56.9] SiamRPN++
30
DaSiamRPN [0.383] [46.0] MDNet
0.1 UPDT [0.378] [45.3] VITAL
RCO [0.376] 20
[42.0] SiamFC
DRT [0.356]
[41.8] StructSiam
0 10
50 100 200 500 1000 [40.5] DSiam
Sequence length [33.8] ECO
0
0 0.2 0.4
Figure S1. Expected average overlap curve on the VOT2018 Location error threshold
dataset, showing the expected overlap between tracker prediction
Figure S2. Normalized precision plot on the LaSOT dataset. Both
and ground truth for different sequence lengths. The EAO mea-
our ResNet-18 and ResNet-50 versions outperform all previous
sure, computed as the average of the expected average overlap over
methods by significant margins.
typical sequence lengths (grey region in the plot), is shown in the
legend. Our approach achieves the best EAO score, outperforming
the previous best approach SiamRPN++ [22] with a relative gain
over the previous best method, ATOM [6].
of 6.3% in terms of EAO.

S5. Detailed Results on NFS, OTB-100, and


S3. Detailed Results on VOT2018 UAV123
In this section, we provide detailed results on the Here, we provide detailed results on NFS [12], OTB-100
VOT2018 [20] dataset. The VOT protocol evaluates the [43], and UAV123 [26] datasets. We use the overlap preci-
expected average overlap (EAO) between the tracker pre- sion (OP) metric for evaluating the trackers. The OP score
dictions and the ground truth bounding boxes for different denotes the percentage of frames in a video for which the
sequence lengths. The trackers are then ranked using the intersection-over-union (IoU) overlap between the tracker
EAO measure, which computes the average of the expected prediction and the ground truth bounding box exceeds a cer-
average overlaps over typical sequence lengths. We refer tain threshold. The mean OP score over all the videos in a
to [21] for further details about the EAO computation. Fig- dataset are plotted over a range of thresholds [0, 1] to obtain
ure S1 plots the expected average overlap for different se- the success plot. The area under this plot provides the AUC
quence lengths on VOT2018 dataset. Our approach DiMP- score, which is used to rank the trackers. We refer to [43]
50 achieves the best EAO score of 0.44. for further details. The success plots over the entire NFS,
OTB-100, and UAV123 datasets are shown in figure S3.
S4. Detailed Results on LaSOT Our tracker using ResNet-50 backbone, denoted DiMP-50,
achieves the best results on both NFS and UAV123 datasets,
Here, we provide the normalized precision plots on the while obtaining results competitive with the state-of-the-art
LaSOT [10] dataset. These are obtained in the following on the, now saturated, OTB-100 dataset. On the challeng-
manner. First, the normalized precision score Pnorm is com- ing NFS dataset, our approach achieves an absolute gain of
puted as the percentage of frames in which the distance 3.5% AUC score over the previous best method ATOM [6].
between the target location predicted by the tracker and
the ground truth, relative to the target size, is less than a S6. Impact of Training Data
certain threshold. The normalized precision score over all
the the videos are then plotted over a range of thresholds Here, we investigate the impact of the number of videos
[0, 0.5] to obtain the normalized precision plots. The track- used for training on the tracking performance. We train
ers are ranked using the area under the resulting curve. Fig- different versions of our tracker using the same datasets
ure S2 shows the normalized precision plots over all 280 as in the main paper, i.e. TrackingNet [27], LaSOT [10],
videos in the LaSOT dataset. Both our ResNet-18 (DiMP- GOT10k [16], and COCO [24], but using only a sub-set
18) and ResNet-50 (DiMP-50) versions outperform all pre- of videos from each dataset. The results on the combined
vious methods, achieving relative gains of 5.9% and 12.8% OTB-100, NFS, and UAV123 datasets are shown in figure
90
Success plot 100
Success plot 90
Success plot
80 80
70 80 70
Overlap Precision [%]

Overlap Precision [%]


Overlap Precision [%]
UPDT [70.4]
60 SiamRPN++ [69.6] 60
50
60 ECO [69.1] 50
DiMP50 [65.3]
DiMP50 [61.9] DiMP50 [68.4] DiMP18 [64.3]
40 DiMP18 [61.0] 40
CCOT [68.2] 40 ATOM [64.2]
30 ATOM [58.4] MDNet [67.8] 30 DaSiamRPN [57.7]
UPDT [53.6] ATOM [66.3] UPDT [54.5]
20 CCOT [48.8] 20 DiMP18 [66.0] 20 ECO [53.2]
10
ECO [46.6] DaSiamRPN [65.8] 10
CCOT [51.3]
MDNet [41.9] ECOhc [64.3] ECOhc [51.2]
0 0 0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Overlap threshold Overlap threshold Overlap threshold
(a) NFS (b) OTB-100 (c) UAV123
Figure S3. Success plots on NFS (a), OTB-100 (b), and UAV123 (c) datasets. The area-under-the-curve (AUC) scores are shown in the
legend. Our approach achieves the best scores on both the NFS and UAV123 datasets.

Figure S4. Impact of the percentage of total videos used for offline
training (log x-axis). Results are shown on the combined OTB-
100, NFS, and UAV123 datasets.

S4. Observe that the performance degrades by only 1.5%


when the model is trained with only 10% of the total videos.
Even when using only 1% of videos, our approach still ob-
tains a respectable AUC score of around 58%.

S7. Visualizations of learned yc , mc , and vc


A 2D visualization of the learned regression label (yc ),
target mask (mc ), and spatial weight (vc ) is provided in fig-
ure S5. Note that each of these quantities are in fact contin-
uous and are here sampled at the discrete feature grid points.
In this example, that target (red box) is centered in the image
patch. From the figure, we can see that the network learns to
give the samples in the target-background transition region
less weight due to their ambiguous nature.
0
0.9
1.0 1.2 0.8
0.7
50

0.8 1.1
100

0.6 1.0 0.6


150 0.5
0.4 0.9 0.4
0.3
200

0.2 0.8
250
0.0 0.2
0 50 100 150 200 250
Image Label yc Spatial Weight vc Target Mask mc
Figure S5. Visualization of the learned label yc , spatial weight vc , and target mask mc . The red box denotes the target object.

You might also like