Wang SPM-Tracker Series-Parallel Matching For Real-Time Visual Object Tracking CVPR 2019 Paper
Wang SPM-Tracker Series-Parallel Matching For Real-Time Visual Object Tracking CVPR 2019 Paper
Abstract
CNN
Image Coarse
The greatest challenge facing visual object tracking is Pair Matching +
the simultaneous requirements on robustness and discrim-
ination power. In this paper, we propose a SiamFC-based Fine
tracker, named SPM-Tracker, to tackle this challenge. The Matching
basic idea is to address the two requirements in two sep-
arate matching stages. Robustness is strengthened in the Figure 1. The series-parallel structure which connects coarse
coarse matching (CM) stage through generalized training matching and fine matching stages in the proposed SPM-Tracker.
while discrimination power is enhanced in the fine match- through online training. A tracker keeps collecting posi-
ing (FM) stage through a distance learning network. The tive and negative samples along the tracking process. For
two stages are connected in series as the input proposals of generative trackers, positive samples help to model the ap-
the FM stage are generated by the CM stage. They are also pearance variation of the target. For discriminative track-
connected in parallel as the matching scores and box loca- ers, more positive and negative samples help to find a more
tion refinements are fused to generate the final results. This precise decision boundary that separates the target from the
innovative series-parallel structure takes advantage of both background. For quite a long time, online training has been
stages and results in superior performance. The proposed an indispensable part in tracker design. Recently, with the
SPM-Tracker, running at 120fps on GPU, achieves an AUC advancement of deep learning and convolutional neural net-
of 0.687 on OTB-100 and an EAO of 0.434 on VOT-16, ex- works, deep features have been widely adopted in object
ceeding other real-time trackers by a notable margin. trackers [34, 9, 39, 15, 7, 30]. However, online training
with deep features is extremely time consuming. Without
much surprise, the deep version of many high-performance
trackers [9, 7, 3, 39, 34, 48, 53] cannot run in real-time any
1. Introduction more, even on modern GPUs.
Visual object tracking is one of the fundamental research While the excessive volume of deep features brings
problems in computer vision and video analytics. Given the speed issues to online training, their strong representational
bounding box of a target object in the first frame of a video, power also opens up a possibility to completely give up on-
a tracker is expected to locate the target object in all subse- line training. The idea is, under a given distance measure,
quent frames. The greatest challenge of visual tracking can to learn an embedding space, through offline training, that
be attributed to the simultaneous requirements on robust- can maximize the interclass inertia between different ob-
ness and discrimination power. The robustness requirement jects and minimize the intraclass inertia for the same ob-
demands a tracker not to lose tracking when the appearance ject [58]. Note that maximizing the interclass inertia corre-
of the target changes due to illumination, motion, view an- sponds to the discrimination power and minimizing the intr-
gle, or object deformation. Meanwhile, a tracker is required aclass inertia corresponds to the robustness. The pioneering
to have the power to discriminate the target object from clut- work along this research line is SiamFC [2]. In addition
tered background or similar surrounding objects. These two to the offline training, SiamFC uses cross-correlation oper-
requirements are sometimes contradictory and hard to be ation to efficiently measure the distance between the target
fulfilled at the same time. patch and all surrounding patches. As a result, SiamFC can
Intuitively, both requirements need to be handled operate at 86fps on GPU.
By design, the SiamFC framework faces challenges in
∗ This work is done when Guangting Wang is an intern in MSRA. balancing the robustness and the discrimination power of
3643
the embedding and in handling the scale and aspect ra- proach of discriminative trackers is to build a binary classi-
tio changes of the target object. Recently, SiamRPN [26] fier that represents the decision boundary between the ob-
was proposed to address the second challenge. It consists ject and its background [24]. It is generally believed that
of a Siamese subnetwork for feature extraction and a re- adaptive discriminative trackers, which continuously update
gion proposal subnetwork for similarity matching and box the classifier during tracking, are more powerful than their
regression. In a follow-up work called DaSiamRPN [58], static counterparts.
distractor-aware training is adopted to promote the general- Correlation Filter (CF) based trackers are among the
ization and discriminative ability of the embedding. In these most successful and representative adaptive discriminative
two pieces of work, visual object tracking is formulated as trackers. Bolme et al. [4] first proposed the MOSSE fil-
a local one-shot detection task. ter which is capable of producing stable CFs from a single
In this paper, we design a two-stage SiamFC-based net- frame and then continuously being improved during track-
work for visual object tracking, aiming to address both chal- ing. The MOSSE filter has aroused a great deal of inter-
lenges mentioned above. The two stages are the coarse est and there are a bunch of follow-up work. For exam-
matching (CM) stage which focuses on enhancing the ro- ple, kernel tricks [19, 20, 10] were introduced to extend CF.
bustness and the fine matching (FM) stage which focuses on DSST [10] and SAMF [27] enabled scale estimation in CF.
improving the discrimination power. By decomposing these SRDCF [8] was proposed to alleviate the periodic effect of
two equally important but somewhat contradictory require- convolution boundaries.
ments, our proposed network is expected to achieve better More recently, with the advancement of deep learning,
performance. Moreover, both CM and FM stages perform the rich representative power of deep features is widely ac-
similarity matching and bounding box regression. Thanks knowledged. There is a trend to utilize deep features in
to the two-stage box refinement, our tracker achieves high CF-based trackers [31, 9, 7, 3]. However, this creates a
localization precision without multi-scale test. dilemma: online training is an indispensable part of CF-
The key innovation in this work is the series-parallel based trackers, but online training with deep features is ex-
structure that is used to connect the two stages. The tremely slow.
schematic diagram is shown in Fig.1. Similar to the se-
In many real world applications, being real-time is
ries structure which is widely adopted in two-stage object
mandatory for a tracker. Facing the above mentioned
detection, the input of the second FM stage relies on the
dilemma, many researchers resorted to another choice:
output of the first CM stage. In this sense, the CM stage is
static discriminative trackers. With the highly expres-
a proposal stage. Similar to the parallel structure, the final
sive deep features, it becomes possible to build high-
matching score as well as the box location are the fused re-
performance static trackers. This idea was successfully re-
sults from both stages. This series-parallel structure brings
alized by SiamFC [2]. SiamFC employs Siamese convo-
a number of advantages which will be detailed in Section 3.
lutional neural networks (CNNs) to extract features, and
In addition, we propose generalized training (where objects
then uses a simple cross-correlation layer to perform dense
from the same category are all treated as the same object)
and efficient sliding-window evaluation in the search re-
to boost the robustness of the CM stage. The discrimination
gion. Every patch of the same size as the target gets a sim-
power of the FM stage is promoted by replacing the cross-
ilarity score, and the one with the highest score is identi-
correlation layer with a distance learning subnetwork. With
fied as the new target location. There are also a great num-
these three innovations, the resulting tracker achieves supe-
ber of follow-up work [15, 52, 49], among which SA-Siam
rior performance on major benchmark datasets. It achieves
[17, 16] and SiamRPN [26, 58] are most related to ours.
an AUC of 0.687 on OTB-100 and EAOs of 0.434 and 0.338
on VOT-16 and VOT-17, respectively. More importantly, The main challenge in SiamFC-based methods is to
the inference speed is 120fps on a NVIDIA P100 GPU. find an embedding space, through offline training, that is
The rest of the paper is organized as follows. We dis- both robust and discriminative. Zhu et al. [58] propose
cuss related work in Section 2. The proposed series-parallel distractor-aware training to emphasize these two aspects.
framework is presented in Section 3. After describing the They use diverse categories of positive still image pairs to
implementation details of SPM-Tracker in Section 4, we promote the robustness, and use semantic negative pairs to
provide extensive experimental results in Section 5. Finally, improve the discriminative ability. However, it is difficult to
we conclude the paper with some discussions in Section 6. attend to both requirements in a single network. SA-Siam
[17] and Siam-BM [16] adopt a two-branch network to en-
2. Related Work code images into two embedding spaces, one for semantic
similarity (more robust) and the other for appearance simi-
Object trackers have conventionally been classified into larity (more discriminative). This typical parallel structure
generative trackers and discriminative trackers [24], and does not take advantage of the innate proposal capability of
most modern trackers belong to the latter. A common ap- the semantic branch.
3644
Conv 2/4
concat RoI
Align
FM scores
CM scores Template
box
Classification
branch
Box regression
branch
Template
image Concat
Classification
branch
Box regression
branch
Candidate
CM box deltas boxes
Search region
image RoI FM box deltas
Conv 2/4 Align
concat
Input images Feature extraction Coarse Matching Stage Fine Matching Stage
Figure 2. Details of the proposed series-parallel matching framework. We employ Siamese AlexNet [25] for feature extraction. The CM
stage adopts the network structure of SiamRPN [26]. RoI Align [18] is used to generate fixed-length regional features for each proposal.
The FM stage implements a relation network [50] for distance learning. Finally, results from both stages are fused for decision making.
Another challenge in SiamFC-based methods is how to tract features from the target patch and the local search re-
handle scale and shape changes. Almost all SiamFC-based gion. This is followed by two matching stages, namely
trackers adopt an awkward multi-scale test for scale adjust- coarse matching stage and fine matching stage, organized
ment, but the aspect ratio of bounding boxes remains un- in a series-parallel structure.
changed throughout the tracking process. SiamRPN [26] Both the CM and FM stages produce similarity scores
addresses this issue with an elegant region proposal net- of proposals and box location deltas. We let the CM stage
work (RPN). The capability to do box refinement also al- to focus on the robustness, i.e. to minimize the intraclass
lows it to discard multi-scale test. In this work, we fol- inertia for the same object. It is expected to propose the
low SiamRPN to use RPN for bounding box size adjust- target object even when it is experiencing huge appearance
ment. The two-stage refinement allows our SPM-Tracker to changes. A number of proposals which get the top match-
achieve an even more precise box location. ing scores in the CM stage are then passed to the FM stage
SiamRPN and DaSiamRPN [58] pose the tracking prob- and fixed-size regional features are extracted through RoI
lem as a local single-stage object detection problem. Some Align [18]. The FM stage is designed to focus on discrim-
recent empirical studies [22] on object detection show that ination, i.e. to maximize the interclass inertia between dif-
two-stage design is often more powerful than one-stage de- ferent objects. It is expected to discriminate the true tar-
sign. This may be related to hard example mining [28] and get from background or surrounding similar objects. Even-
regional feature alignment [18]. In the tracking commu- tually, the matching scores and box locations from both
nity, Zhang et al. [55] adopt a two-stage design for long- matching stages are fused to make the final decision.
term tracking. However, the series structure they adopted The proposed SPM framework brings a number of ad-
demands for a very powerful second stage. They use MD- vantages as outlined below.
Net [34] for the second stage, which greatly slows down the • The robustness and the discrimination requirements
inference speed to 2fps. are decomposed and emphasized in separate stages. It
is easier to train two networks to achieve their respec-
3. Our Approach tive goals than to train a single network that simultane-
ously achieves the goals for both requirements..
3.1. Series-Parallel Matching Framework
• The input proposals of the FM stage are all high-score
We propose a framework for robust and discriminative candidates from the CM stage. FM stage training ben-
visual object tracking. The proposed SPM framework is efits from a balanced positive-negative ratio and hard
depicted in Fig.2. We employ a Siamese network to ex- negative mining to enhance the discrimination power.
3645
Figure 3. Illustration of the generalized training (GT) strategy for
the CM stage. Given a template as shown on the left, the green
blocks in search image 1 indicate the positive samples used in Figure 5. Visualization of the top-K matched boxes and their simi-
conventional training. The red blocks are the locations of other larity scores output by SiamFC [2], SiamRPN [26], and our SPM-
objects of the same category. GT takes both green and red blocks Tracker. Our tracker generates two scores, corresponding to the
as positive samples. (The blue blocks indicate the ignored region.) CM stage (C) and the FM stage (F). Objects of the same category
Best viewed in color. get high C-scores but only the true target gets high F-scores. It
shows that SPM-Tracker achieves the design goal.
box deltas. Similar to SiamRPN, we can discard multi-scale
test since the proposal network handles scale and shape
changes in a graceful manner.
For the CM stage, we propose generalized training (GT)
to improve the robustness. Conventionally, image pairs of
the same object drawn from two frames of a video are used
as positive samples. In DaSiamRPN [58], still images from
detection datasets are used to generate positive image pairs
through data augmentation. In this work, we additionally
Figure 4. Visualization of the cross-correlation response maps gen- treat some image pairs containing different objects as posi-
erated by SiamFC [2], SiamRPN [26], and the CM stage of our tive samples when the two objects belong to the same cate-
tracker. Our tracker can robustly highlight the target object even gory. Fig. 3 illustrates the classification labels used in our
when it has severe deformation. Best viewed in color. CM stage and in other SiamFC-based trackers. This train-
• Box regression in the CM stage allows the FM stage ing strategy leads to exceedingly generalized embeddings
to evaluate aligned patches with different scale or even which capture high-level semantic information and there-
different aspect ratio from the target object. Fusion of fore are insensitive to object appearance changes.
two-stage box regressions leads to a higher precision. Fig.4 shows the response map of the CM stage and com-
pares it with that of SiamFC and SiamRPN (with distractor-
• Since only a few proposals are passed to the FM stage,
aware training). It is observed that our tracker is able to
it is not necessary to use cross-correlation operation to
generate strong responses even when the target object has
compute distance. We could adopt a trainable distance
significant deformation. By contrast, SiamRPN [26, 58]
measure for the FM stage.
barely produce any response and SiamFC does not have a
In the following two subsections, we will discuss the CM precise localization.
and FM stages in more details.
3.3. Fine Matching Stage
3.2. Coarse Matching Stage
The fine matching stage is expected to capture fine-
The coarse matching stage looks in the search region for grained appearance information so that the true target can
candidate patches which are similar to the target patch. It is be distinguished from background or similar surrounding
expected to be very robust such that the target object will not objects. The FM stage only evaluates the top K highest-
be missed even when it is experiencing drastic appearance score patches from the CM stage.
changes due to intrinsic or extrinsic factors. We adopt the As illustrated in Fig. 2, the FM stage shares features with
region proposal subnetwork as introduced in SiamRPN [26] the CM stage. For each proposal, the regional features are
for this stage. Given the features extracted by a Siamese directly cropped from the shared feature maps. Consider-
network, pair-wise correlation feature maps are computed ing the fact that shallow features contain detailed appear-
for the classification branch and the regression branch. The ance information and also result in high localization preci-
classification branch produces the similarity scores for the sion, we take both deep and shallow features and fuse them
candidate boxes while the regression branch generates the by concatenation. Then, RoI Align operation [18] creates
3646
fixed-size feature maps for each proposal. tive) labels are assigned to candidate boxes whose IoU over-
Since there are only a limited number of patches to be laps are greater (or less) than 0.5. Same as in the Faster R-
evaluated in this stage, we can afford to use a more powerful CNN object detection framework [37], box regression loss
distance learning network, instead of the cross-correlation is added to positive samples in both stages. We adopt cross-
layer, to measure the similarity. Additionally, such a net- entropy loss for classification and smooth L1 loss [14] for
work could be trained to generate a complementary score to box regression. The overall loss function can be written as:
the CM similarity scores. We adopt a light-weight relation
network as proposed in [50] for the FM stage. The input L = λ1 Lcm cls + λ2 Lcm b + λ3 Lf m cls + λ4 Lf m b , (1)
of the relation network is the concatenated feature from the
where Lcls denotes the classification loss and Lb denotes
image pairs. A 1 × 1 convolution layer is followed by two
the box regression loss. We set λ2 = 2 and λ1 = λ3 =
fully connected layers which generate feature embedding
λ4 = 1 since the box regression loss of CM module is much
for classification and box regression.
smaller than the others.
Finally, the similarity scores and the box deltas from two
The training image pairs are extracted from both videos
stages are fused by weighted sum. The candidate box with
and still images. The video datasets include VID [38]
the highest similarity score is identified as the target ob-
and the training set of Youtube-BB [35]. Following DaSi-
ject. Fig. 5 shows the top-K candidates and their similarity
amRPN [58], we also make use of still image datasets, in-
scores output by different trackers. Our tracker is associated
cluding COCO [29], ImageNet DET [38], Cityperson [56]
with two scores corresponding to the CM and FM stages.
and WiderFace [51]. The sampling ratio between videos
The high C-scores for all the foreground objects suggest the
and still images is 4 : 1. There are three types of im-
robustness of SPM-Tracker and the low F-scores for non-
age pairs, denoted by same-instance, same-category, and
target objects demonstrate the discrimination power.
different-category. They are sampled at a ratio of 2 : 1 : 1.
The standard SGD optimizer is adopted for training. In
4. Implementation each step, the CM stage produces hundreds of candidate
4.1. Network Structure and Parameters boxes, among which 48 boxes are selected to train the FM
stage. The positive-negative ratio is set to 1 : 1. The learn-
The CNN backbone used for feature extraction is the ing rate is decreased from 10−2 to 10−4 . We train the net-
standard AlexNet [25]. It is pre-trained on the ImageNet work for 50 epochs and 160,000 image pairs are sampled in
dataset. Unlike other SiameFC-based trackers, we keep the each epoch.
padding operations in the backbone network. This is be-
cause the RoI Align operation needs pixel alignment be- 4.3. Inference
tween feature maps and source images. The CM stage still During inference, we crop the template image patch from
uses the central features without padding. In our imple- the first frame and feed it to the feature extraction network.
mentation, the target image has a size of 127 × 127 × 3. The template features are cached so that we do not need to
The size of its last-conv-layer feature map with padding is compute it in the subsequent frames.
16 × 16 × 256. Only the central 6 × 6 × 256 features are Given the tracking box in the last frame, a search image
used for the CM stage, which is consistent with the origi- patch surrounding the box location is cropped and resized
nal SiamFC. The FM stage extracts regional features from to 271 × 271. The CM stage takes the search image as
conv2 (384 channels) and conv4 (256 channels) layers and input and then outputs a number of boxes. The candidate
concatenates them. We use RoI Align operation to pool re- box that has the largest overlap with the tracking box in
gional features of size 6 × 6 × 640 for each proposal, where the previous frame will be reserved to increase the stability.
6 × 6 is the spatial size and 640 is the number of chan- Other boxes go through the standard proposal processing in
nels. The two fully-connected layers in the FM stage are RPN [37]. First, boxes with low scores are filtered. Then
lightweight, with only 256 neurons per layer. non-maximum suppression (NMS) is applied. The NMS
threshold is 0.5. Finally, K candidate boxes with top scores
4.2. Training
are selected and passed to the FM stage. In this step, we
The entire network can be trained end-to-end. The over- do not add shape penalties or cosine window penalties in
all loss function is composed of four parts: classification order to aggressively propose boxes. The number of can-
loss and box regression loss in both the CM stage and FM didate boxes K is set to 9, which is further analyzed in
stage. For the CM stage, an anchor box is assigned a pos- Section 5.2. We use five anchors whose aspect ratios are
itive (or negative) label when its intersection-over-union [0.33, 0.5, 1.0, 2.0, 3.0].
(IoU) overlap with the ground-truth box is greater than 0.6 In the FM stage, similarity scores and refined box po-
(or less than 0.3). Other patches whose IoU overlap falls in sitions are predicted by the classification head and the box
between are ignored. For the FM stage, positive (or nega- regression head. Let uc , uf be the scores predicted by CM
3647
CM CM+FM CM+FM OTB-100 VOT-17 VOT-16
Only Series Series-Parallel AUC EAO EAO
OTB-100 (AUC) 0.643 0.632 0.670 S-P model 0.670 0.323 0.391
VOT-17 (EAO) 0.279 0.296 0.323 S-P model + GT 0.687 0.338 0.434
VOT-16 (EAO) 0.359 0.343 0.391
Table 2. Generalized training (GT) for the CM stage significantly
Table 1. Ablation analysis of different architectures. Results on improves the performance.
three benchmark datasets are consistent, and they demonstrate the 1.0
Recall
1 candidates - [0.825]
box locations after the adjustment of the CM and FM stages, 0.66 0.4 3 candidates - [0.877]
5 candidates - [0.888]
respectively. The final score and box coordinates are the 7 candidates - [0.893]
SPM-Tracker w/o GT 0.2 9 candidates - [0.895]
0.65 SPM-Tracker 9 candidates + GT - [0.901]
weighted sum of the results from the two modules: 0.00.5
2 4 6 8 10 12 0.6 0.7 0.8 0.9 1.0
Number of candidate boxes Overlap threshold
u = (1 − Wcls )uc + Wcls uf
(a) (b)
uc Wbox uf (2)
B= Bc + Bf , Figure 6. CM module analysis: (a) AUC score vs. number of can-
Wbox uf + uc Wbox uf + uc didate boxes; (b) recall rate vs. overlap thresholds (the values in
where Wcls , Wbox are weights of the FM module for simi- brackets indicate the mean recalls over thresholds 0.5:0.05:0.7).
All experiments are carried out on OTB-100 dataset.
larity score and box coordinates. We find that good tracking
results are usually achieved when Wcls takes a value around at the CM stage and discrimination power at the FM stage.
0.5 and Wbox takes a value of 2 or 3. The matching score produced by one stage does not reflect
After applying cosine windows [2], the candidate box the other capability. The idea of fusion has been practiced in
with the highest score is selected and its size is updated by many trackers [47, 13, 6, 17] and has shown effectiveness.
linear interpolation with the result in the previous frame.
Our tracker can run inference at 120fps with a single 5.2. Analysis of the CM Stage
NVIDIA P100 GPU and an Intel Xeon E5-2690 CPU.
Generalized Training Strategy: To make the CM mod-
ule more robust to object appearance change, we propose to
5. Experiments
take image pairs in the same category as positive samples
The three main contributions in this work are: 1) us- during training. This is referred to as the generalized train-
ing series-parallel structure to connect two matching stages; ing (GT) strategy. We compare the performance of SPM-
2) adopting generalized training for the CM stage; and 3) Tracker when it is trained with or without the GT strat-
adopting a relation network for distance measurement in egy for the CM stage. Improvements achieved on all three
the FM stage. In this section, we will first perform ablation benchmark datasets, as shown in Table 2, confirm the effec-
analyses which support our contributions, and then carry tiveness of this strategy. Some of the visualization results
out comparison studies with the state-of-the-art trackers on have already been presented in Fig. 4 to show that the GT
major benchmark datasets. strategy helps to locate objects with large deformation.
Number of Candidate Boxes: During inference, the CM
5.1. Analysis of the Series-Parallel Structure
stage passes K top-scored candidate boxes to the FM stage.
We corroborate the effectiveness of the series-parallel On the one hand, the larger the K is, the higher the proba-
structure by comparing it with two alternatives. The base- bility that the true target is included in the final evaluation.
line scheme, denoted by “CM only” in Table 1 is actually On the other hand, a larger K means more false positives
SiamRPN [26]. Our implementation achieves a slightly bet- will be evaluated in the FM stage, which reduces the speed
ter performance than reported in their original paper (0.279 and might decrease the accuracy as well. In order to deter-
vs 0.244 on VOT-17 benchmark) because we have included mine K, we investigate the relationship between the track-
additional still images in the training. When the FM stage is ing performance and the number of candidate boxes. Fig.
added in series, the performance (denoted by “CM+FM Se- 6(a) shows how the AUC on OTB-100 changes with K. We
ries” in Table 1) is similar to the baseline (better on VOT-17 find that when K is larger than 7, the performance tends to
and worse on OTB-100 and VOT-16). flatten. Therefore, we choose K = 9 in experiments.
The proposed “CM+FM Series-Parallel” method, which Recall: Recall of candidate boxes can be used to mea-
performs two-stage fusion, significantly outperforms the sure the robustness. We use recall to further validate the
other two schemes, as Table 1 shows. The reason why fu- GT strategy and K selection in the CM stage. To ensure
sion plays an important role is that the two stages pay atten- fairness, we crop the template from the first frame and the
tion to different aspects of tracker capabilities: robustness search region in the current frame is generated according
3648
AUC score (OPE) Speed Success plots of OPE − OTB100 Precision plots of OPE − OTB100
Tracker 1 1
OTB-2013 OTB-50 OTB-100 (FPS) 0.9 0.9
0.8 0.8
LCT [32] 0.628 0.492 0.568 27 0.7 0.7
Success rate
CF-based Trackers
Ours [0.687]
Precision
Staple [1] 0.593 0.516 0.582 80 0.6 Siam−BM [0.662] 0.6
Ours [0.899]
0.5 DaSiamRPN [0.658] 0.5
LMCF [46] 0.628 0.533 0.580 85 0.4
SA−SIAM [0.656]
ECO−hc [0.643]
0.4
DaSiamRPN [0.880]
SA−SIAM [0.864]
ECO−hc [0.856]
CFNet [44] 0.611 0.530 0.568 75 0.3
0.2
SiamRPN [0.637]
PTAV [0.635]
0.3
0.2
Siam−BM [0.855]
SiamRPN [0.851]
PTAV [0.849]
BACF [12] 0.656 0.570 0.621 35 0.1
SiamFc−3s [0.582]
Staple [0.578]
0.1 Staple [0.784]
SiamFc−3s [0.771]
ECO-hc [7] 0.652 0.592 0.643 60 0
0 0.2 0.4 0.6
Overlap threshold
0.8 1
0
0 10 20 30 40
Location error threshold
50
PTAV [11] 0.663 0.581 0.635 25 gression. But in the FM stage, there are much fewer candi-
ACT [5] 0.657 - 0.625 30 date boxes scattered in the search region. There is not much
RT-MDNet [23] - - 0.650 46 advantage in using cross-correlation operation. Therefore,
Ours 0.693 0.653 0.687 120 we replace the cross-correlation layer with a more power-
ful relation network as described in [50]. Experimental re-
Table 3. Comparison with state-of-the-art real-time trackers on
sults verify our design choice. When models are trained
OTB dataset. Trackers are grouped into CF-based methods,
without the GT strategy, using cross-correlation layer in the
SiamFC-based methods and miscellaneous. Numbers in red and
blue are the best and the second best results, respectively. FM stage results in an AUC of 0.647 on OTB-100, which
is slightly better than the single stage baseline SiamRPN
to the ground-truth box in the previous frame. Fig. 6 (b) (0.643), but is notably inferior to the relation-network-based
shows the recall vs. overlap threshold. The mean recall for model (0.670). In addition, when GT is adopted, the AUC
the overlap thresholds [0.5, 0.7] is also computed and listed score of the cross-correlation-based model is 0.655 while
in brackets. It is obvious that using a single candidate box that of the relation-network-based model is 0.687.
results in significantly lower recall than using multiple can-
didates. As the number of candidate boxes increases, the 5.4. Comparison with State-of-the-Arts
recall also increases before it saturates at around K = 9. At Evaluation on OTB: Our SPM-Tracker is first com-
the saturation point, applying the GT strategy still can boost pared with the state-of-the-art real-time trackers on OTB
recall. This double confirms the power of the GT strategy. 2013/50/100 benchmarks. The detailed AUC scores are
summarized in Table 3. Due to space limitation, we only
5.3. Analysis of the FM Stage
show the success plot and the precision plot of one pass
Multi-Layer Feature Fusion: The FM stage takes regional evaluation (OPE) on OTB-100 in Fig. 7. The SPM-
features cropped from the shared backbone network as in- Tracker outperforms other real-time trackers on all three
puts. Generally speaking, deep features are rich in high- OTB benchmarks by a large margin.
level semantic information and shallow features are rich in We also compare SPM-Tracker with some non-real-time
low-level appearance information. As suggested in many top-performing trackers, including C-COT [9], ECO [7],
previous works [43, 45, 39, 3], multi-layer features can be MDNet [34], ADNet [54], TCCN [33], LSART [41], VI-
fused to achieve better performance. We follow this com- TAL [40], RTINet [53], and DRL-IS [36]. The AUC score
mon practice and use conv2 + conv4 features for the FM vs. speed curve on OTB-100 is shown in Fig. 8. SPM-
stage. To demonstrate the advantage of multi-layer feature Tracker strikes a very good balance between tracking per-
fusion, we compare the performance of SPM-Tracker with formance and inference speed.
alternative implementations which only use single layer fea- Evaluation on VOT: SPM-Tracker is evaluated on two
tures. We train and test models which use conv2, conv3, VOT benchmark datasets, VOT-16 and VOT-17. Table 4
or conv4 only. On OTB-100 benchmark, these three mod- shows the comparison with almost all the top-performing
els achieve AUC scores of 0.666, 0.675, and 0.676, respec- trackers despite their speed. Among the real-time track-
tively, while our final model using conv2 + conv4 achieves ers, SPM-Tracker is by far the best performing one with su-
3649
0.70 MDNet
ECO
0.69 Ours C-COT
ADNet
0.68 TCNN
LSART
DRL-IS
AUC on OTB-100
0.67 RTINet
VITAL
0.66 Real-time DaSiamRPN
boundary SiamRPN
SA-Siam
0.65 SiamBM
RASNet
0.64 PTAV
RT-MDNet
MCCT-H
0.63 ECO-hc
0 1 3 10 31 100 316 Ours
Speed (FPS)
Figure 8. Performance-speed trade-off of top-performing trackers
on OTB-100 benchmark. The speed axis is logarithmic.
Figure 9. Visualization of three successful tracking sequences
VOT-16 VOT-17
Tracker FPS from OTB-100.
A R EAO A R EAO
CREST 0.51 0.25 0.283 - - - 1
MDNet 0.54 0.34 0.257 - - - 1
C-COT 0.54 0.24 0.331 - - - 0.3
LSART - - - 0.49 0.22 0.323 1
ECO 0.55 0.20 0.375 0.48 0.27 0.280 8
UPDT - - - 0.53 0.18 0.378 -
SiamFC 0.53 0.46 0.235 0.50 0.59 0.188 86 Figure 10. Visualization of failure cases. The green box is ground-
Staple 0.54 0.38 0.295 0.52 0.69 0.169 80 truth and the red box is our tracking result.
ECO-hc 0.54 0.30 0.322 0.49 0.44 0.238 60
SA-Siam 0.54 - 0.291 0.50 0.46 0.236 50 DaSiamRPN [58] barely follows the target, but the box lo-
Siam-BM - - - 0.56 0.26 0.335 32 cations are less precise. This demonstrates the advantage of
SiamRPN 0.56 0.26 0.344 0.49 0.46 0.244 200 our two-stage box refinement.
DaSiamRPN 0.61 0.22 0.411 0.56 0.34 0.326 160
Failure Cases: We observe two types of failures in SPM-
Ours 0.62 0.21 0.434 0.58 0.30 0.338 120
Tracker, as shown in Fig. 10. In walking2 and liquor se-
Table 4. Comparison with state-of-the-art trackers on VOT bench- quences, when the target is occluded by a similar object,
mark. Both non-real-time methods (top rows) and real-time meth- the tracking box may drift. The other type of failure occurs
ods (bottom rows) are included. “A” and “R” denote accuracy and when the ground-truth target is only a part of an object, as
robustness. EAO stands for expected average overlap. The num- in sequences bird1 and dog. SPM-Tracker seems to have
bers in red and blue indicate the best and the second best results.
a strong sense of objectness and tends to track the entire
object even when the template only contains a part of it.
perior accuracy and EAO. Even when compared with non-
real-time trackers, SPM-Tracker achieves the best accuracy
and the EAO performance is among the best. 6. Conclusion
Excluding extra data: Compared with DaSiamRPN [58],
our tracker has used two more datasets (Cityperson [56] and We have presented the design and implementation of a
WiderFace [51]) in training. For fair comparison, we have static discriminative tracker named SPM-Tracker. SPM-
also trained a model excluding these two datasets. The AUC Tracker adopts a novel series-parallel structure for two-
on OTB-100 slightly drops to 0.671, but still outperforms stage matching. Evaluations on OTB and VOT benchmarks
DaSiamRPN and Siam-BM. The EAO on VOT-16 becomes show its superior tracking performance. In the future, we
0.432 and that on VOT-17 slightly increases to 0.347. plan to explore solutions to the drifting problem when the
target is occluded by similar objects. Possible choices in-
5.5. Qualitative Results clude template update and forward-backward verification.
We believe that the series-parallel matching framework has
Successful Cases: In Fig. 9, we visualize three success-
great potential and is worthy of further investigation.
ful tracking cases, including the very challenging jump and
diving sequences. Owing to the robustness of the CM stage,
our tracker is able to detect targets with huge deformation. Acknowledgement
The region proposal branch allows SPM-Tracker to fit to
the varying object shapes. In these two sequences, some of We acknowledge funding from National Key R&D Pro-
the best trackers such as ECO [7] and MDNet [34] also fail. gram of China under Grant 2017YFA0700800.
3650
References [19] João F Henriques, Rui Caseiro, Pedro Martins, and Jorge
Batista. Exploiting the circulant structure of tracking-by-
[1] Luca Bertinetto, Jack Valmadre, Stuart Golodetz, Ondrej detection with kernels. In ECCV, pages 702–715, 2012.
Miksik, and Philip HS Torr. Staple: Complementary learners
[20] João F Henriques, Rui Caseiro, Pedro Martins, and Jorge
for real-time tracking. In CVPR, pages 1401–1409, 2016.
Batista. High-speed tracking with kernelized correlation fil-
[2] Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea ters. T-PAMI, 37(3):583–596, 2015.
Vedaldi, and Philip HS Torr. Fully-convolutional siamese
[21] Chen Huang, Simon Lucey, and Deva Ramanan. Learning
networks for object tracking. In ECCV, pages 850–865,
policies for adaptive tracking with deep feature cascades. In
2016.
ICCV, pages 105–114, 2017.
[3] Goutam Bhat, Joakim Johnander, Martin Danelljan, Fa- [22] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu,
had Shahbaz Khan, and Michael Felsberg. Unveiling the Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wo-
power of deep tracking. In ECCV, 2018. jna, Yang Song, Sergio Guadarrama, et al. Speed/accuracy
[4] David S Bolme, J Ross Beveridge, Bruce A Draper, and trade-offs for modern convolutional object detectors. In
Yui Man Lui. Visual object tracking using adaptive corre- CVPR, pages 7310–7311, 2017.
lation filters. In CVPR, pages 2544–2550, 2010. [23] Ilchae Jung, Jeany Son, Mooyeol Baek, and Bohyung Han.
[5] Boyu Chen, Dong Wang, Peixia Li, Shuang Wang, and Real-time mdnet. In ECCV, pages 83–98, 2018.
Huchuan Lu. Real-time actor-critictracking. In ECCV, pages [24] Zdenek Kalal, Krystian Mikolajczyk, Jiri Matas, et al.
328–345, 2018. Tracking-learning-detection. T-PAMI, 34(7):1409, 2012.
[6] Dapeng Chen, Zejian Yuan, Gang Hua, Yang Wu, and Nan- [25] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
ning Zheng. Description-discrimination collaborative track- Imagenet classification with deep convolutional neural net-
ing. In ECCV, pages 345–360, 2014. works. In NIPS, pages 1097–1105, 2012.
[7] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, [26] Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu.
Michael Felsberg, et al. Eco: Efficient convolution opera- High performance visual tracking with siamese region pro-
tors for tracking. In CVPR, pages 6931–6939, 2017. posal network. In CVPR, pages 8971–8980, 2018.
[8] Martin Danelljan, Gustav Hager, Fahad Shahbaz Khan, and [27] Yang Li and Jianke Zhu. A scale adaptive kernel correlation
Michael Felsberg. Learning spatially regularized correlation filter tracker with feature integration. In ECCV, pages 254–
filters for visual tracking. In ICCV, pages 4310–4318, 2015. 265, 2014.
[9] Martin Danelljan, Andreas Robinson, Fahad Shahbaz Khan, [28] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
and Michael Felsberg. Beyond correlation filters: Learn- Piotr Dollár. Focal Loss for Dense Object Detection. In
ing continuous convolution operators for visual tracking. In ICCV, pages 2980–2988, 2017.
ECCV, pages 472–488, 2016.
[29] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
[10] Martin Danelljan, Fahad Shahbaz Khan, Michael Felsberg, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
and Joost Van de Weijer. Adaptive color attributes for real- Zitnick. Microsoft coco: Common objects in context. In
time visual tracking. In CVPR, pages 1090–1097, 2014. ECCV, pages 740–755, 2014.
[11] Heng Fan and Haibin Ling. Parallel tracking and verifying: [30] Xiankai Lu, Chao Ma, Bingbing Ni, Xiaokang Yang, Ian
A framework for real-time and high accuracy visual tracking. Reid, and Ming-Hsuan Yang. Deep regression tracking with
In ICCV, pages 5487–5495, 2017. shrinkage loss. In ECCV, pages 353–369, 2018.
[12] H Kiani Galoogahi, Ashton Fagg, and Simon Lucey. Learn- [31] Chao Ma, Jia-Bin Huang, Xiaokang Yang, and Ming-Hsuan
ing background-aware correlation filters for visual tracking. Yang. Hierarchical convolutional features for visual tracking.
In CVPR, pages 1144–1152, 2017. In ICCV, pages 3074–3082, 2015.
[13] Jin Gao, Haibin Ling, Weiming Hu, and Junliang Xing. [32] Chao Ma, Xiaokang Yang, Chongyang Zhang, and Ming-
Transfer learning based visual tracking with gaussian pro- Hsuan Yang. Long-term correlation tracking. In CVPR,
cesses regression. In ECCV, pages 188–203, 2014. pages 5388–5396, 2015.
[14] Ross Girshick. Fast R-CNN. In ICCV, pages 1440–1448, [33] Hyeonseob Nam, Mooyeol Baek, and Bohyung Han. Model-
2015. ing and propagating cnns in a tree structure for visual track-
[15] Qing Guo, Wei Feng, Ce Zhou, Rui Huang, Liang Wan, and ing. arXiv preprint, 2016.
Song Wang. Learning dynamic siamese network for visual [34] Hyeonseob Nam and Bohyung Han. Learning multi-domain
object tracking. In ICCV, pages 1763–1771, 2017. convolutional neural networks for visual tracking. In CVPR,
[16] Anfeng He, Chong Luo, Xinmei Tian, and Wenjun Zeng. To- pages 4293–4302, 2016.
wards a better match in siamese network based visual object [35] Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan,
tracker. In ECCV Workshop, 2018. and Vincent Vanhoucke. Youtube-boundingboxes: A large
[17] Anfeng He, Chong Luo, Xinmei Tian, and Wenjun Zeng. A high-precision human-annotated data set for object detection
twofold siamese network for real-time object tracking. In in video. In CVPR, pages 7464–7473, 2017.
CVPR, pages 4834–4843, 2018. [36] Liangliang Ren, Xin Yuan, Jiwen Lu, Ming Yang, and Jie
[18] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- Zhou. Deep reinforcement learning with iterative shift for
shick. Mask r-cnn. In ICCV, pages 2980–2988, 2017. visual tracking. In ECCV, pages 684–700, 2018.
3651
[37] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. [54] SYJCY Yoo, Kimin Yun, Jin Young Choi, K Yun, and JY
Faster r-cnn: Towards real-time object detection with region Choi. Action-decision networks for visual tracking with deep
proposal networks. In NIPS, pages 91–99, 2015. reinforcement learning. In CVPR, pages 1349–1358, 2017.
[38] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- [55] Lijun Wang Jinqing Qi Huchuan Lu Yunhua Zhang,
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Dong Wang. Learning regression and verification net-
Aditya Khosla, Michael Bernstein, et al. Imagenet large works for long-term visual tracking. In arXiv preprint
scale visual recognition challenge. IJCV, 115(3):211–252, arXiv:1809.04320, 2018.
2015. [56] Shanshan Zhang, Rodrigo Benenson, and Bernt Schiele.
[39] Yibing Song, Chao Ma, Lijun Gong, Jiawei Zhang, Ryn- Citypersons: A diverse dataset for pedestrian detection. In
son WH Lau, and Ming-Hsuan Yang. Crest: Convolutional CVPR, pages 3213–3221, 2017.
residual learning for visual tracking. In ICCV, pages 2574– [57] Yunhua Zhang, Lijun Wang, Jinqing Qi, Dong Wang,
2583, 2017. Mengyang Feng, and Huchuan Lu. Structured siamese net-
[40] Yibing Song, Chao Ma, Xiaohe Wu, Lijun Gong, Linchao work for real-time visual tracking. In ECCV, pages 351–366,
Bao, Wangmeng Zuo, Chunhua Shen, Rynson Lau, and 2018.
Ming-Hsuan Yang. Vital: Visual tracking via adversarial [58] Zheng Zhu, Qiang Wang, Bo Li, Wei Wu, Junjie Yan, and
learning. In CVPR, 2018. Weiming Hu. Distractor-aware siamese networks for visual
[41] Chong Sun, Huchuan Lu, and Ming-Hsuan Yang. Learning object tracking. In ECCV, pages 103–119, 2018.
spatial-aware regressions for visual tracking. In CVPR, pages
8962–8970, 2018.
[42] Ming Tang, Bin Yu, Fan Zhang, and Jinqiao Wang. High-
speed tracking with multi-kernel correlation filters. In CVPR,
pages 4874–4883, 2018.
[43] Ran Tao, Efstratios Gavves, and Arnold W M Smeulders.
Siamese instance search for tracking. In CVPR, pages 1420–
1429, 2016.
[44] Jack Valmadre, Luca Bertinetto, João Henriques, Andrea
Vedaldi, and Philip HS Torr. End-to-end representation
learning for correlation filter based tracking. In CVPR, pages
5000–5008, 2017.
[45] Lijun Wang, Wanli Ouyang, Xiaogang Wang, and Huchuan
Lu. Visual tracking with fully convolutional networks. In
ICCV, pages 3119–3127, 2015.
[46] Mengmeng Wang, Yong Liu, and Zeyi Huang. Large margin
object tracking with circulant feature maps. In CVPR, pages
21–26, 2017.
[47] Naiyan Wang, Siyi Li, Abhinav Gupta, and Dit-Yan Yeung.
Transferring rich feature hierarchies for robust visual track-
ing. arXiv preprint, 2015.
[48] Ning Wang, Wengang Zhou, Qi Tian, Richang Hong, Meng
Wang, and Houqiang Li. Multi-cue correlation filters for ro-
bust visual tracking. In CVPR, pages 4844–4853, 2018.
[49] Qiang Wang, Zhu Teng, Junliang Xing, Jin Gao, Weiming
Hu, and Stephen Maybank. Learning attentions: residual
attentional siamese network for high performance online vi-
sual tracking. In CVPR, pages 4854–4863, 2018.
[50] Flood Sung Yongxin Yang, Li Zhang, Tao Xiang, Philip HS
Torr, and Timothy M Hospedales. Learning to compare: Re-
lation network for few-shot learning. In CVPR, 2018.
[51] Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang.
Wider face: A face detection benchmark. In CVPR, pages
5525–5533, 2016.
[52] Tianyu Yang and Antoni B Chan. Learning dynamic memory
networks for object tracking. In ECCV, 2018.
[53] Yingjie Yao, Xiaohe Wu, Lei Zhang, Shiguang Shan, and
Wangmeng Zuo. Joint representation and truncated inference
learning for correlation filter based tracking. In ECCV, pages
552–567, 2018.
3652