Efficient Online Structured Output Learning for Keypoint-Based ObjectTracking
Efficient Online Structured Output Learning for Keypoint-Based ObjectTracking
Tracking
1895
3.1. RANSAC for structured prediction do not perform learning which takes into account the entire
transformation prediction process.
Given an object model M and an input image I, the goal
To allow learning for the entire prediction process, we
of object detection is to compute a transformation T ∈ T
propose introducing a weight vector wj for each model
which maps M to I. A 3D pose or 2D homography are
keypoint uj . This weight vector is used to score corre-
examples of such a transformation.
spondences according to sjk = hwj , dk i, where dk is a
We can think of this process as one of structured out-
descriptor extracted around image keypoint vk , normalized
put prediction, with the output space consisting of all valid
such that kdk k2 = 1. We then propose modifying the com-
transformations, along with a null transformation indicating
patibility function Eq (2) to include correspondence scores,
the absence of the object. We therefore assume that there
such that it can be written as a linear operator
exists a function T = f (M, I), and that this function can
X
be expressed as Fw (C, T ) = sjk I(kvk − T (uj )k2 < τ )
(uj ,vk )∈C (3)
T = argmax F (M, I, T 0 ), (1)
T 0 ∈T =hw, Φ(C, T )i,
where F is a compatibility function, scoring all possible where w = [w1 , . . . , wJ ]T is the concatenation of model
transformations of the object given an image. weight vectors and Φ(C, T ) = [φ1 (C, T ), . . . , φJ (C, T )]T
In practice, finding a solution for the prediction func- is a joint feature mapping. Each φj is defined as
tion Eq (1) under a specific model definition is generally (
infeasible because the output space is very large, and eval- dk ∃(uj , vk ) ∈ C : kvk − T (uj )k2 < τ
φj (C, T ) =
uating image observations under different transformations 0 otherwise.
of the model will be expensive. The way that this issue is (4)
usually handled is by applying an iterative robust parameter Our goal is to learn a model parameterized by w such
estimation algorithm such as RANSAC [5] or PROSAC [4] that the behaviour of this function in the output space is
to approximately solve Eq (1). These algorithms rely on close to the actual behaviour of RANSAC, but, because it
a sparse representation for the model and image and use a includes information about appearance, in the process of
set of correspondences between model and image points as learning we will discover which model points are the most
their input. discriminative and how best we can utilize them to predict
Consider an object model M which is based on a sparse transformations.
set of keypoints M = {u1 , . . . , uJ }, with each keypoint de-
fined by a location (2D or 3D). Similarly, let the image I be 3.2. Structured output learning
represented as a sparse set of keypoints I = {v1 , . . . , vK }. Now, given a set of training examples {(Ii , Ti )}N
i=1 , w
A set of correspondences C = {(uj , vk , sjk )|uj ∈ M, vk ∈ can be learned in a structured output maximum margin
I, sjk ∈ R} is found between model keypoints and im- framework [17]. For each training example i, this formu-
age keypoints, where sjk is a correspondence score derived lation tries to maximize the margin between the score of
from appearance information. Traditional RANSAC maxi- the true output Ti and all alternative outputs. This can be
mizes the number of inliers defined by expressed by the following optimization problem
X
N
F (C, T ) = I(kvk − T (uj )k2 < τ ), (2) λ X
(uj ,vk )∈C
min kwk2 + ξi
w,ξ 2 i=1
(5)
where T (uj ) is the location of model keypoint uj under s.t. ∀i : ξi ≥ 0
the transformation T , τ is a spatial mis-alignment thresh- i
∀i, ∀T 6= Ti : δFw (T ) ≥ ∆(Ti , T ) − ξi
old and I(.) is an indicator function. This maximization
i
is performed by randomly sampling transformations which where δFw (T ) = Fw (Ci , Ti ) − Fw (Ci , T ), and λ is a pa-
are compatible with minimal subsets of correspondences in rameter determining the trade-off between training set accu-
C, with variants such as PROSAC biasing this sampling by racy and regularization. ∆(Ti , T ) is a loss function which
using the correspondence scores sjk . measures the penalty for choosing T instead of the true
Existing approaches have applied learning in an offline transformation Ti . The loss function ∆(Ti , T ) should mea-
setting [9, 12, 15] as well as in an online setting [6, 13] to sure the dissimilarity of two competing output hypotheses,
encourage reliable appearance-based correspondences to be and will be discussed in Section 3.3.
found in C. However, in these approaches the generation of Because we are using RANSAC to perform the output
correspondences and the scoring and maximization (Eq (2)) prediction and this relies on an accurate set of correspon-
are decoupled from each other. These approaches therefore dences, we modify this formulation to also encourage each
1896
inlier correspondence to score higher than any other image descent. We rewrite the optimization problem of Eq (6) in
correspondence. This can be realized as an additional set of unconstrained form as
ranking constraints and the formulation then becomes
nλ XN
min kwk2 + i
N N max {∆(Ti , T ) − δFw (T )} + +
λ X X X w 2 T 6=Ti
min kwk2 + ξi + ν γij i=1
w,ξ,γ 2 i=1 i=1 (uj ,vk )∈Ci∗ N
X X o
ν max{1 − hw j , dk − dk 0 i}
s.t. ∀i : ξi ≥ 0 0
k 6=k +
i=1 (uj ,vk )∈Ci∗
i
∀i, ∀T 6= Ti : δFw (T ) ≥ ∆(Ti , T ) − ξi (9)
∀i, ∀j : γij ≥ 0 where (.)+ = max{0, .} is the hinge function.
∀i, ∀(uj , vk ), ∀k 0 6= k : hwj , dk − dk0 i ≥ 1 − γij Given a training example (It , Tt ) at time t, a subgradient
(6) of Eq (9) is found with respect to w, and a gradient descent
where Ci∗ ⊂ Ci is the set of inlier correspondences under Ti , step is then performed according to
and ν is a weighting parameter.
The learning problem presented in Eq (6) allows us to wjt+1 ← (1 − ηt λ)wjt +
t
train a discriminative model in a unified framework where I(max {∆(Tt , T ) − δFw (T )} > 0)ηt αtj +
T 6=Tt
learning the representation of model points and performing
pose estimation is combined in a structured output maxi- I(uj ∈ Ct∗ ) I(max
0
{1 − hwjt , dk − dk0 i} > 0)ηt νβ tj ,
k 6=k
mum margin setting. (10)
where ηt = 1/λt is the step size. Let T̂ =
3.3. Loss function t
argmaxT 6=Tt {∆(Tt , T ) − δFw (T )} and k̂ =
Eq (6) requires a loss function ∆ to be defined be- argmaxk0 6=k {1 − hwj , dk − dk i}. Then αj and β tj
t 0
t
tween two transformations. Since the compatibility func- are defined as
tion Fw (C, T ) sums over those correspondences in C which
are inliers under T , we desire a loss function which takes αtj = φj (Ct , Tt ) − φj (Ct , T̂ ), (11)
into account the fact that transformations will have differ-
ent numbers of inliers. We consider two such loss functions, and
which we compare experimentally in Section 4: β tj = dk − dk̂ . (12)
1. Hamming distance on inliers: To estimate Tt for the current image, we use the predic-
X tion of Eq (1) using the old model representation wt−1 and
∆H (T, T 0 ) = I z(uj , vk , T ) 6= z(uj , vk , T 0 )
then update the model according to Eq (10). Furthermore,
(uj ,vk )∈C when performing RANSAC in our prediction function we
(7) will also be exploring and scoring other transformations,
where z(uj , vk , T ) = I(kvk − T (uj )k2 < τ ). This which gives us a mechanism for identifying any margin vi-
loss function aims to penalize transformations having olations which have occurred, the largest of which will con-
different inlier sets. tribute to the gradient descent step Eq (10). In this way, our
2. Difference in number of inliers: online learning can re-use the intermediate results of esti-
mating Tt , and thus adds only a small amount of overhead
∆I (T, T 0 ) = |F (C, T ) − F (C, T 0 )|. (8) compared to detection alone.
This loss function aims to penalize transformations 3.5. Binary approximation of model
with different numbers of inliers, similar in spirit to An important goal of our method is to be real-time and
the traditional RANSAC scoring function (Eq (2)). suitable for low-powered devices, and we would therefore
like to take advantage of binary descriptors. Although these
3.4. Online learning
descriptors are very compact when represented as bitsets,
While Eq (6) can be solved offline as a batch problem, to use a linear SVM requires converting them into high-
for our application we are interested in updating w online, dimensional real vectors. While this is acceptable when
such that we can adapt the model to a given environment. updating the classifier, it would be very computationally
The model can be initialized by setting each wj to be the de- expensive at the matching stage, which requires exhaustive
scriptor one would use in a static model, and in subsequent evaluation of every model classifier with every image key-
frames can be updated by performing stochastic gradient point. To avoid this, we propose approximating each wj in
1897
terms of a set of basis vectors
Nb
X
wj ≈ βi bi (13)
i=1
(c) map
1898
BRIEF BRISK SURF
Sequence Boost. [6]
Static Indep. ∆H ∆I Static Indep. ∆H ∆I Static Indep. ∆H ∆I
barbapapa 0.19 0.94 0.94 0.94 0.92 0.93 0.94 0.94 0.89 0.45 0.92 0.92 0.88
comic 0.41 0.90 0.94 0.98 0.42 0.60 0.61 0.76 0.83 0.67 0.90 0.93 0.56
map 0.82 0.98 0.99 0.99 0.79 0.91 0.91 0.93 0.91 0.09 0.98 0.99 0.83
paper 0.06 0.68 0.77 0.85 0.04 0.40 0.51 0.54 0.03 0.01 0.03 0.03 0.04
phone 0.88 0.93 0.97 0.97 0.64 0.82 0.91 0.92 0.92 0.46 0.96 0.97 0.85
Table 1. Average detection rates in test sequences (the higher better). Each row represents a video sequence. Each set of columns shows a
different combination of feature point detector and descriptor, while the last single column is the result of the boosting approach. Within
a feature detector/descriptor combination, we compare the results of no learning (static), independently trained SVM classifiers, and our
structured output learning framework with the two loss functions ∆H and ∆I defined in Section 3.3. The bold-face font represents the best
working method for a video sequence in a chosen detector/descriptor setting.
j-th model keypoint using the learned SVM as fj (dk ) = When learning with binary descriptors, √ we apply the fea-
hwj , dk i and use this score to find the highest scoring match ture transformation d̃ = (d − 0.5)/0.5 D, where D is the
to construct the correspondence set for pose estimation. To dimensionality of the descriptor, which centers and normal-
update each classifier, we take each inlier returned from izes the descriptors, as this is known to improve the perfor-
the geometric verification set as a positive training exam- mance of stochastic gradient descent algorithms [8]. During
ple, and select the next highest scoring image keypoint as matching this transformation can easily be handled implic-
a negative example. We then perform stochastic gradient itly in the binary approximation without any overhead. We
descent learning to update the classifiers. fix the SVM learning rate λ = 0.1 for all experiments. We
Additionally, we implemented the boosting-based clas- also set ν = 1 for the structured model.
sification approach used in [6], by making use of the pub- For each sequence, we initialize a model using the
licly available online boosting code of the authors1 . We fronto-parallel planar patch in the first frame, by detecting
train these classifiers in the same manner as our indepen- the 100 strongest features to define the locations of model
dent SVM baseline. keypoints M. Given these locations we initialize four mod-
The unoptimized C++ implementation of our approach els for comparison: a non-adaptive matching-based model
as well as the annotated videos used during our experiments using the descriptors for the model keypoints extracted from
are publicly available to download2 . the first frame (to represent a traditional keypoint-based ob-
ject detection approach); the baseline learned model using
4.1. Tracking by detection independent SVM classifiers; and our structured output ap-
To illustrate the benefit of online learning we consider proach with the two loss functions described in Section 3.3.
the task of tracking-by-detection, in which the target ob- We initialize the weight vector for a model keypoint by set-
ject should be detected in consecutive frames of a video se- ting it to the descriptor from the first frame.
quence. For this task we do not use any information about To assess detection performance, we define a scoring
the location of the object in the previous frame when de- function between the ground-truth homography T ∗ and the
tecting the object, but we use each successful detection in predicted homography T as:
order to perform an online learning step to update our object 4
model for subsequent frames. ∗ 1X
S(T , T ) = kci − (T ∗ T −1 )(ci )k2 , (15)
We apply our approach using three different combina- 4 i=1
tions of interest point detector and descriptor: FAST de-
tector with 256-bit BRIEF descriptor, BRISK detector with where {ci }4i=1 = {(−1, −1)T , (1, −1)T , (−1, 1)T , (1, 1)T }
512-bit BRISK descriptor and SURF detector with SURF64 define the corners of a square. This score will be 0 if the
descriptor. These have been chosen to illustrate that our two homographies are identical. Frames for which
method works with a variety of feature point detectors and S(T ∗ , T ) < 10 are considered correct detections, and we
descriptors, but as they each have different invariances and report the average over the entire sequence.
dimensionality, our results should not be interpreted as a The results of our experiments without the binary ap-
comparison between different descriptor types. Therefore, proximation described in Section 3.5 can be seen in Ta-
we are interested in relative performance figures for a par- ble 13 . As can be seen from this table, the structured out-
ticular feature point detector and descriptor combination. put learning framework outperforms the static model (with
1 http://www.vision.ee.ethz.ch/boostingTrackers/onlineBoosting.htm 3 The actual videos and the results of tracking-by-detection can be found
2 http://www.samhare.net/research in our supplementary material or at http://www.samhare.net/research.
1899
no learning), as well as the model trained with indepen-
dent SVM classifiers. Comparing the results of independent
SVM classifiers and the static model highlights the fact that
adapting an object model to a particular environment online
helps a lot in practice. However, the highest detection rate
is attained when we used the structured output framework
where the learning of the object model and geometric esti-
mation is linked inside a unified optimization formulation.
In particular, using the loss function ∆I based on the differ-
ence in number of inliers results in the best performance. It (a) Static BRIEF model
should be noted that for SURF descriptors the independent
SVMs had difficulty learning the correct object model. We
suspect that this is caused because of the continious nature
of the SURF descriptor and the fact that the number of gen-
erated keypoints is lower with the SURF keypoint detector.
However, given the same settings, the structured learning
approch is able to benefit fully from the adaptation process
and improve upon the static model.
For the boosting-based online descriptor learning ap-
proach, it is only fair to compare against the models where
we use the BRIEF descriptor (as both of these methods are (b) Learned BRIEF model using our structured output formulation
using the same FAST corner detector). Again one can see
that comparing the boosting method with the static method, Figure 2. Example frames from the paper sequence showing the
learning still provides an improvement. However, the boost- top correspondence for each model point. The model is displayed
ing based approach is not able to outperform the indepen- in a green box on the left of this image. The brightness of each
dent SVM learning framework, and is therefore also per- line indicates the correspondence score, before any geometric ver-
forming worse than the structured output framework. ification has taken place (the brighter the higher the score). The
The most difficult video in our set of experiments is the learned model has adapted to discriminate against the many con-
paper sequence. This video sequence features highly repet- fusing keypoints in the image, resulting in a successful detection,
itive local appearance structures and a simple static model while no detection is found with the static model.
fails in all cases. The learning based approaches (except the
boosting method) are able to deliver a reasonable detection
framework and approximate the model keypoint weight
rate using binary descriptors. An example frame of this se-
vectors wj with varying numbers of binary bases Nb . As
quence is shown in Figure 2 where we display the generated
can be seen in Figure 3, in general the binary approximation
correspondences before geometric verification. As can be
produces detection performance comparable to the original
seen in the top image, because of the confusing appearance
classifier with Nb ≥ 2 bases, and for the less challenging
of the local image features, the static BRIEF model fails to
sequences even a single basis suffices. In terms of detection
match model keypoints consistently to the image. However,
time, which includes the stages of generating correspon-
the structured learning framework which uses the same set
dences between model and image, performing geometric
of descriptors extracted from the input image for match-
verification, and updating the learner, we see that the binary
ing has learned a more discriminative object model and is
approximation provides significant performance gains (ap-
able to provide more meaningful correspondences which
proximately 4 times faster detection with our unoptimized
are mainly focused on where the object is in the input frame.
implementation). This means that our approach is suitable
Another observation is that although the structured learning
for use even on low-powered devices.
model creates some mis-matches (lines matching an object
point to some background locations), they all contain very
small matching scores (indicated by their dark color). 5. Conclusions
In this paper, we presented a novel approach to learn-
4.2. Binary approximation
ing for real-time keypoint-based object detection and track-
To verify that the binary approximation proposed in Sec- ing. Our formulation generalizes previous methods by com-
tion 3.5 is reasonable when using binary descriptors such bining the feature matching, learning, and object pose es-
as BRIEF and BRISK, we repeat our experiments for the timation into a coherent structured output learning frame-
BRIEF descriptor model learned in our structured output work. We showed how such a model can be trained on-
1900
(a) Detection rate (b) Detection time
Figure 3. Behaviour of the learned BRIEF model using our structured output formulation when employing a binary approximation of
each wj as described in Section 3.5. For Nb ≥ 2 the detection performance is almost equivalent to the original model, whilst being
approximately four times faster with our unoptimized implementation.
line, and presented an approximation to create an efficient analysis and automated cartography. Communications of the
way of using binary descriptors at runtime. During our ex- ACM, 24(6):381–395, June 1981.
periments we showed that learning in the unified structured [6] M. Grabner, H. Grabner, and H. Bischof. Learning Features
output learning formulation plays an important role in im- for Tracking. In CVPR, 2007.
proving the detection rate compared to state-of-the-art static [7] S. Hare, A. Saffari, and P. H. S. Torr. Struck: Structured
and learning-based feature matching techniques. Output Tracking with Kernels. In ICCV, 2011.
While we did not perform feature selection explicitly, [8] Y. LeCun, L. Bottou, G. Orr, and K. Müller. Efficient Back-
our formulation implicitly is able to down-weight the less Prop. In G. Orr and K.-R. Müller, editors, Neural Networks:
Tricks of the Trade, volume 1524 of Lecture Notes in Com-
discriminative features, and therefore, provides a good start-
puter Science, page 546. Springer Berlin / Heidelberg, 1998.
ing platform for further research into automatic online fea-
[9] V. Lepetit and P. Fua. Keypoint Recognition Using Random-
ture selection.
ized Trees. PAMI, 28(9):1465–79, Sept. 2006.
Acknowledgements This work is supported by EPSRC
[10] S. Leutenegger, M. Chli, and R. Siegwart. BRISK: Binary
CASE and KTP research grants and the IST Programme of Robust Invariant Scalable Keypoints. In ICCV, 2011.
the European Community, under the PASCAL2 Network of [11] D. G. Lowe. Distinctive Image Features from Scale-Invariant
Excellence, IST-2007-216886. P. H. S. Torr is in receipt of Keypoints. IJCV, 60(2):91–110, Nov. 2004.
Royal Society Wolfson Research Merit Award. [12] M. Özuysal, M. Calonder, V. Lepetit, and P. Fua. Fast Key-
point Recognition Using Random Ferns. PAMI, 32(3):448–
References 61, Mar. 2010.
[1] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded- [13] M. Özuysal, V. Lepetit, F. Fleuret, and P. Fua. Feature Har-
Up Robust Features (SURF). CVIU, 110(3):346–359, June vesting for Tracking-by-Detection. In ECCV, 2006.
2008. [14] E. Rosten and T. Drummond. Machine learning for high-
[2] M. B. Blaschko and C. H. Lampert. Learning to Localize speed corner detection. In European Conference on Com-
Objects with Structured Output Regression. In D. Forsyth, puter Vision, volume 1, pages 430–443, May 2006.
P. Torr, and A. Zisserman, editors, ECCV, volume 5302 of [15] S. Taylor and T. Drummond. Multiple Target Localisation at
Lecture Notes in Computer Science, pages 2–15, Berlin, Hei- over 100 FPS. In BMVC, 2009.
delberg, Oct. 2008. Springer Berlin Heidelberg. [16] P. H. S. Torr and A. Zisserman. MLESAC: A New Robust
[3] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. BRIEF: Estimator with Application to Estimating Image Geometry.
Binary Robust Independent Elementary Features. In ECCV, Computer Vision and Image Understanding, 78(1):138–156,
2010. Apr. 2000.
[4] O. Chum and J. Matas. Matching with PROSAC Progressive [17] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun.
Sample Consensus. In CVPR, 2005. Large Margin Methods for Structured and Interdependent
[5] M. A. Fischler and R. C. Bolles. Random sample consen- Output Variables. JMLR, 6:1453–1484, Dec. 2005.
sus: a paradigm for model fitting with applications to image
1901