0% found this document useful (0 votes)
3 views

Efficient Online Structured Output Learning for Keypoint-Based ObjectTracking

Uploaded by

su zhu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Efficient Online Structured Output Learning for Keypoint-Based ObjectTracking

Uploaded by

su zhu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Efficient Online Structured Output Learning for Keypoint-Based Object

Tracking

Sam Hare1 Amir Saffari1,2 Philip H. S. Torr1


1 2
Oxford Brookes University, Oxford, UK Sony Computer Entertainment Europe, London, UK
{sam.hare,philiptorr}@brookes.ac.uk [email protected]

Abstract can be used to infer the presence and transformation of the


object.
Efficient keypoint-based object detection methods are There has been a great deal of progress in making these
used in many real-time computer vision applications. These approaches suitable for real-time applications, and there are
approaches often model an object as a collection of key- now a range of methods available for use on a desktop
points and associated descriptors, and detection then in- PC [1, 9, 12]. Recently, there has been significant interest in
volves first constructing a set of correspondences between developing approaches suitable for low-powered mobile de-
object and image keypoints via descriptor matching, and vices such as smartphones and tablets, which are becoming
subsequently using these correspondences as input to a ro- increasingly popular platforms for computer vision applica-
bust geometric estimation algorithm such as RANSAC to tions [3, 10, 14, 15]. These approaches focus on making the
find the transformation of the object in the image. In such matching stage as efficient as possible, since this is gener-
approaches, the object model is generally constructed of- ally the most time-consuming part of the detection pipeline.
fline, and does not adapt to a given environment at runtime. To achieve this, they design image descriptors which can
Furthermore, the feature matching and transformation esti- be represented as binary vectors, allowing matching to be
mation stages are treated entirely separately. In this paper, performed very efficiently by measuring Hamming distance
we introduce a new approach to address these problems by between descriptors, which can be implemented using bi-
combining the overall pipeline of correspondence genera- nary CPU instructions.
tion and transformation estimation into a single structured The object models built by traditional approaches are
output learning framework. static, usually constructed offline for a particular object. For
Following the recent trend of using efficient binary de- certain applications like AR and SLAM, however, we want
scriptors for feature matching, we also introduce an ap- to detect the object repeatedly in a dynamic environment.
proach to approximate the learned object model as a collec- Additionally, some applications require on-the-fly learning
tion of binary basis functions which can be evaluated very and detection to build an instantaneous model from only
efficiently at runtime. Experiments on challenging video a single snapshot of the object. Therefore it is desirable
sequences show that our algorithm significantly improves to be able to learn an object model efficiently online and
over state-of-the-art descriptor matching techniques using adapt it to a particular environment, which is not typically
a range of descriptors, as well as recent online learning addressed by traditional approaches. This process of adapt-
based approaches. ing or learning the model should not add significant over-
head to the detection pipeline, and should still be suitable
for real-time detection on low-powered devices. These re-
1. Introduction quirements create a very challenging problem for a learning
algorithm.
Keypoint-based object detection has become a corner- The approach we propose in this paper frames the entire
stone of modern computer vision, enabling great advances object detection procedure as structured output prediction,
in areas such as augmented reality (AR) and simultaneous such that overall detection performance can be optimized
localization and mapping (SLAM). These object detection given a set of training images. Our formulation combines
approaches model an object as a set of keypoints, which are feature learning, matching, and pose estimation into a single
matched independently in an input image. Robust estima- unified framework. Furthermore, because we use a linear
tion procedures based on RANSAC [4, 5, 16] are then used structured SVM to perform learning, we are able to perform
to determine geometrically consistent sets of matches which training online, which allows us to quickly adapt our model

978-1-4673-1228-8/12/$31.00 ©2012 IEEE 1894


to a given environment. Additionally, we show that we can is still too computationally expensive to be useful on low-
accurately approximate our model during evaluation in such powered devices, and also does not continue to adapt the
a way that we can take advantage of binary descriptors and model after the initial training phase. The method presented
the efficiency they provide. As a result, our algorithm adds in [6] is most related to our work, in which the authors learn
a relatively small amount of computational overhead com- keypoint classifiers online by using Haar features and an
pared to static models, while improving the detection rate online boosting algorithm. This approach relies on the fact
significantly. that the geometric verification step can be used in order to
provide labels for updating the classifiers in an online man-
2. Motivation and related work ner, allowing for adaptive tracking by detection.
To the best of our knowledge, all previous methods in-
Keypoint-based methods for geometric object detection volving learning treat the generation of correspondences
generally follow a two stage approach: and estimation of object transformation separately. In
this paper, we propose a novel approach which combines
1. Finding a set of 2D correspondences between an object
these two steps into a coherent structured output learning
model and an input image.
framework. In this formulation, correspondence genera-
2. Estimating the transformation of the object in the im- tion, learning, and transformation estimation are all work-
age using a robust geometric verification method based ing together in a unified optimization formulation with the
on hypotheses generated from the correspondences goal of performing object detection robustly. Our approach
(e.g. RANSAC and its variants). proposes an alternative view on keypoint-based object de-
tection where the transformation estimation algorithm op-
Generally these two steps are considered as separate erates as the maximization step of a structured output learn-
problems, and many algorithms focus on improving the ob- ing framework. Unlike the online boosting approach of [6],
ject detection quality by employing robust methods for each our formulation is also capable of incorporating any kind of
of these steps individually. keypoint descriptor into its learning process and is specifi-
To find the appearance-based 2D correspondences, there cally targeted towards low-powered devices.
are two approaches: matching and classification. Matching- Structured output prediction was introduced to the com-
based approaches [1,3,10,11] use descriptors to store a sig- puter vision community in [2] for the task of 2D sliding-
nature for each model keypoint in a database. These de- window object detection. In [7], the authors use a similar
scriptors are designed to be invariant to various geometric approach with online learning to perform adaptive 2D track-
and photometric transformations, and can then be matched ing by detection. Our work differs from these approaches in
given a suitable distance metric to keypoints in an image in that we are interested in object detection and tracking under
a nearest-neighbour fashion. a much larger class of transformations such as 3D pose or
Classification-based approaches [9, 12, 15] treat match- homography, and as a result we propose using RANSAC in
ing as multi-class classification, in which the task is to clas- order to perform structured output prediction.
sify each image keypoint as either background or a partic- There has recently been significant research interest fo-
ular keypoint from the model. These classifiers are learned cusing on object detection for low-powered portable plat-
offline from training examples of the object observed under forms such as smartphones. In particular, highly efficient
various geometric and photometric transformations (usually methods such as BRIEF [3] and BRISK [10] have been de-
generated synthetically), and are therefore tuned to the spe- veloped for descriptor matching. Both of these methods
cific object and how individual keypoints might appear un- perform simple binary pixel-based tests on keypoints in or-
der various illumination levels and new view-points. The der to build binary descriptors. By representing these de-
training algorithm and the number of training examples de- scriptors as bitsets and measuring similarity using the ham-
termine the computational complexity of the learning stage. ming distance, matching can be performed extremely effi-
Since classification-based approaches rely on the avail- ciently using bitwise operations which are well-supported
ability of a 2D/3D object model at training time, these ap- by modern CPUs. We show how the internal representation
proaches cannot easily be used for on-the-fly object detec- of our algorithm can be approximated to take advantage of
tion and tracking. In other words, these algorithms are not these binary descriptors, making our approach also suitable
suitable for detection and tracking of arbitrary unknown ob- for low-powered devices.
jects. This particular problem of the classification-based ap-
proaches limit their applicability in practice. 3. Structured output formulation
In [13], the authors propose an approach for learning a
classification-based model at runtime, by using online ran- In this section, we describe our formulation of keypoint-
dom forests to reduce training time. However, this approach based object detection as a structured output problem.

1895
3.1. RANSAC for structured prediction do not perform learning which takes into account the entire
transformation prediction process.
Given an object model M and an input image I, the goal
To allow learning for the entire prediction process, we
of object detection is to compute a transformation T ∈ T
propose introducing a weight vector wj for each model
which maps M to I. A 3D pose or 2D homography are
keypoint uj . This weight vector is used to score corre-
examples of such a transformation.
spondences according to sjk = hwj , dk i, where dk is a
We can think of this process as one of structured out-
descriptor extracted around image keypoint vk , normalized
put prediction, with the output space consisting of all valid
such that kdk k2 = 1. We then propose modifying the com-
transformations, along with a null transformation indicating
patibility function Eq (2) to include correspondence scores,
the absence of the object. We therefore assume that there
such that it can be written as a linear operator
exists a function T = f (M, I), and that this function can
X
be expressed as Fw (C, T ) = sjk I(kvk − T (uj )k2 < τ )
(uj ,vk )∈C (3)
T = argmax F (M, I, T 0 ), (1)
T 0 ∈T =hw, Φ(C, T )i,
where F is a compatibility function, scoring all possible where w = [w1 , . . . , wJ ]T is the concatenation of model
transformations of the object given an image. weight vectors and Φ(C, T ) = [φ1 (C, T ), . . . , φJ (C, T )]T
In practice, finding a solution for the prediction func- is a joint feature mapping. Each φj is defined as
tion Eq (1) under a specific model definition is generally (
infeasible because the output space is very large, and eval- dk ∃(uj , vk ) ∈ C : kvk − T (uj )k2 < τ
φj (C, T ) =
uating image observations under different transformations 0 otherwise.
of the model will be expensive. The way that this issue is (4)
usually handled is by applying an iterative robust parameter Our goal is to learn a model parameterized by w such
estimation algorithm such as RANSAC [5] or PROSAC [4] that the behaviour of this function in the output space is
to approximately solve Eq (1). These algorithms rely on close to the actual behaviour of RANSAC, but, because it
a sparse representation for the model and image and use a includes information about appearance, in the process of
set of correspondences between model and image points as learning we will discover which model points are the most
their input. discriminative and how best we can utilize them to predict
Consider an object model M which is based on a sparse transformations.
set of keypoints M = {u1 , . . . , uJ }, with each keypoint de-
fined by a location (2D or 3D). Similarly, let the image I be 3.2. Structured output learning
represented as a sparse set of keypoints I = {v1 , . . . , vK }. Now, given a set of training examples {(Ii , Ti )}N
i=1 , w
A set of correspondences C = {(uj , vk , sjk )|uj ∈ M, vk ∈ can be learned in a structured output maximum margin
I, sjk ∈ R} is found between model keypoints and im- framework [17]. For each training example i, this formu-
age keypoints, where sjk is a correspondence score derived lation tries to maximize the margin between the score of
from appearance information. Traditional RANSAC maxi- the true output Ti and all alternative outputs. This can be
mizes the number of inliers defined by expressed by the following optimization problem
X
N
F (C, T ) = I(kvk − T (uj )k2 < τ ), (2) λ X
(uj ,vk )∈C
min kwk2 + ξi
w,ξ 2 i=1
(5)
where T (uj ) is the location of model keypoint uj under s.t. ∀i : ξi ≥ 0
the transformation T , τ is a spatial mis-alignment thresh- i
∀i, ∀T 6= Ti : δFw (T ) ≥ ∆(Ti , T ) − ξi
old and I(.) is an indicator function. This maximization
i
is performed by randomly sampling transformations which where δFw (T ) = Fw (Ci , Ti ) − Fw (Ci , T ), and λ is a pa-
are compatible with minimal subsets of correspondences in rameter determining the trade-off between training set accu-
C, with variants such as PROSAC biasing this sampling by racy and regularization. ∆(Ti , T ) is a loss function which
using the correspondence scores sjk . measures the penalty for choosing T instead of the true
Existing approaches have applied learning in an offline transformation Ti . The loss function ∆(Ti , T ) should mea-
setting [9, 12, 15] as well as in an online setting [6, 13] to sure the dissimilarity of two competing output hypotheses,
encourage reliable appearance-based correspondences to be and will be discussed in Section 3.3.
found in C. However, in these approaches the generation of Because we are using RANSAC to perform the output
correspondences and the scoring and maximization (Eq (2)) prediction and this relies on an accurate set of correspon-
are decoupled from each other. These approaches therefore dences, we modify this formulation to also encourage each

1896
inlier correspondence to score higher than any other image descent. We rewrite the optimization problem of Eq (6) in
correspondence. This can be realized as an additional set of unconstrained form as
ranking constraints and the formulation then becomes
nλ XN
min kwk2 + i

N N max {∆(Ti , T ) − δFw (T )} + +
λ X X X w 2 T 6=Ti
min kwk2 + ξi + ν γij i=1
w,ξ,γ 2 i=1 i=1 (uj ,vk )∈Ci∗ N
X X  o
ν max{1 − hw j , dk − dk 0 i}
s.t. ∀i : ξi ≥ 0 0
k 6=k +
i=1 (uj ,vk )∈Ci∗
i
∀i, ∀T 6= Ti : δFw (T ) ≥ ∆(Ti , T ) − ξi (9)
∀i, ∀j : γij ≥ 0 where (.)+ = max{0, .} is the hinge function.
∀i, ∀(uj , vk ), ∀k 0 6= k : hwj , dk − dk0 i ≥ 1 − γij Given a training example (It , Tt ) at time t, a subgradient
(6) of Eq (9) is found with respect to w, and a gradient descent
where Ci∗ ⊂ Ci is the set of inlier correspondences under Ti , step is then performed according to
and ν is a weighting parameter.
The learning problem presented in Eq (6) allows us to wjt+1 ← (1 − ηt λ)wjt +
t
train a discriminative model in a unified framework where I(max {∆(Tt , T ) − δFw (T )} > 0)ηt αtj +
T 6=Tt
learning the representation of model points and performing
pose estimation is combined in a structured output maxi- I(uj ∈ Ct∗ ) I(max
0
{1 − hwjt , dk − dk0 i} > 0)ηt νβ tj ,
k 6=k
mum margin setting. (10)
where ηt = 1/λt is the step size. Let T̂ =
3.3. Loss function t
argmaxT 6=Tt {∆(Tt , T ) − δFw (T )} and k̂ =
Eq (6) requires a loss function ∆ to be defined be- argmaxk0 6=k {1 − hwj , dk − dk i}. Then αj and β tj
t 0
t
tween two transformations. Since the compatibility func- are defined as
tion Fw (C, T ) sums over those correspondences in C which
are inliers under T , we desire a loss function which takes αtj = φj (Ct , Tt ) − φj (Ct , T̂ ), (11)
into account the fact that transformations will have differ-
ent numbers of inliers. We consider two such loss functions, and
which we compare experimentally in Section 4: β tj = dk − dk̂ . (12)
1. Hamming distance on inliers: To estimate Tt for the current image, we use the predic-
X tion of Eq (1) using the old model representation wt−1 and
∆H (T, T 0 ) = I z(uj , vk , T ) 6= z(uj , vk , T 0 )

then update the model according to Eq (10). Furthermore,
(uj ,vk )∈C when performing RANSAC in our prediction function we
(7) will also be exploring and scoring other transformations,
where z(uj , vk , T ) = I(kvk − T (uj )k2 < τ ). This which gives us a mechanism for identifying any margin vi-
loss function aims to penalize transformations having olations which have occurred, the largest of which will con-
different inlier sets. tribute to the gradient descent step Eq (10). In this way, our
2. Difference in number of inliers: online learning can re-use the intermediate results of esti-
mating Tt , and thus adds only a small amount of overhead
∆I (T, T 0 ) = |F (C, T ) − F (C, T 0 )|. (8) compared to detection alone.

This loss function aims to penalize transformations 3.5. Binary approximation of model
with different numbers of inliers, similar in spirit to An important goal of our method is to be real-time and
the traditional RANSAC scoring function (Eq (2)). suitable for low-powered devices, and we would therefore
like to take advantage of binary descriptors. Although these
3.4. Online learning
descriptors are very compact when represented as bitsets,
While Eq (6) can be solved offline as a batch problem, to use a linear SVM requires converting them into high-
for our application we are interested in updating w online, dimensional real vectors. While this is acceptable when
such that we can adapt the model to a given environment. updating the classifier, it would be very computationally
The model can be initialized by setting each wj to be the de- expensive at the matching stage, which requires exhaustive
scriptor one would use in a static model, and in subsequent evaluation of every model classifier with every image key-
frames can be updated by performing stochastic gradient point. To avoid this, we propose approximating each wj in

1897
terms of a set of basis vectors
Nb
X
wj ≈ βi bi (13)
i=1

where bi ∈ {−1, 1}D , and D is the dimensionality of the (a) barbapapa


descriptor. This approximation must be updated each time
wj changes, so we choose to use a simple greedy method
as described in Algorithm 1.

Algorithm 1 Binary approximation of wj


Require: wj , Nb
r = wj (initialize residual) (b) comic
for i = 1 to Nb do
bi = sign(r)
βi = hbi , ri/kbi k2 (project r onto bi )
r ← r − βi bi (update residual)
end for
return {βi }N Nb
i=1 , {bi }i=1
b

(c) map

Using this approximation, we can efficiently compute the


dot-product hwj , di using only bitwise operations. To do
so, we represent each bi using a binary vector and its com-
plement: bi = b+ + + D
i − bi , where bi ∈ {0, 1} . We then
rewrite
Nb
(d) paper
X
hwj , di ≈ βi (hb+
i , di − hb+
i , di), (14)
i=1

and note that each dot-product inside the summation can be


computed very efficiently using a bitwise AND followed
by a bit-count. This can be computed even more effi-
ciently if we have precomputed the bit-count of d, since (e) phone
hb+ + +
i , di−hbi , di = 2hbi , di−|d|. This means that by ap-
proximating wj with Nb components, our correspondence Figure 1. Example frames from our test sequences, which also
score is roughly Nb times more expensive to evaluate than a show the detection results for the BRIEF model learned using our
binary Hamming distance. In practice, we find it sufficient structured output framework (Section 4.1). These sequences are
challenging for keypoint-based matching approaches due to the
to set Nb = 2, see Section 4 for experimental results.
presence of many similar features in the scene.
4. Experiments
We performed a number of experiments in order to vali- camera pose, we computed a ground-truth homography for
date the approach described in this paper. Our method is ap- the object in each video frame, which is then used for eval-
plicable to general object models and transformations, but uating the quality of the homography estimates produced
for the purposes of our experiments we consider the case of during object detection in our experiments.
a planar object model detected in an image under a homog- In order to evaluate the effectiveness of our approach
raphy transformation. and to observe the contribution of learning and the com-
We recorded a number of video sequences of a static bined structured output framework, we also implemented
scene observed from a moving camera, using a SLAM sys- a baseline of independently online trained SVM classifiers
tem to track the 3D camera pose in each frame (example for each model keypoint. In this framework, we take away
frames can be seen in Figure 1). Each sequence begins with the coupling between model points that comes from our
a fronto-parallel view of a planar patch, which is used in our model and train each SVM classifier independently of one
experiments to define the object model. Using the known other. At run-time, we compute a matching score for the

1898
BRIEF BRISK SURF
Sequence Boost. [6]
Static Indep. ∆H ∆I Static Indep. ∆H ∆I Static Indep. ∆H ∆I
barbapapa 0.19 0.94 0.94 0.94 0.92 0.93 0.94 0.94 0.89 0.45 0.92 0.92 0.88
comic 0.41 0.90 0.94 0.98 0.42 0.60 0.61 0.76 0.83 0.67 0.90 0.93 0.56
map 0.82 0.98 0.99 0.99 0.79 0.91 0.91 0.93 0.91 0.09 0.98 0.99 0.83
paper 0.06 0.68 0.77 0.85 0.04 0.40 0.51 0.54 0.03 0.01 0.03 0.03 0.04
phone 0.88 0.93 0.97 0.97 0.64 0.82 0.91 0.92 0.92 0.46 0.96 0.97 0.85

Table 1. Average detection rates in test sequences (the higher better). Each row represents a video sequence. Each set of columns shows a
different combination of feature point detector and descriptor, while the last single column is the result of the boosting approach. Within
a feature detector/descriptor combination, we compare the results of no learning (static), independently trained SVM classifiers, and our
structured output learning framework with the two loss functions ∆H and ∆I defined in Section 3.3. The bold-face font represents the best
working method for a video sequence in a chosen detector/descriptor setting.

j-th model keypoint using the learned SVM as fj (dk ) = When learning with binary descriptors, √ we apply the fea-
hwj , dk i and use this score to find the highest scoring match ture transformation d̃ = (d − 0.5)/0.5 D, where D is the
to construct the correspondence set for pose estimation. To dimensionality of the descriptor, which centers and normal-
update each classifier, we take each inlier returned from izes the descriptors, as this is known to improve the perfor-
the geometric verification set as a positive training exam- mance of stochastic gradient descent algorithms [8]. During
ple, and select the next highest scoring image keypoint as matching this transformation can easily be handled implic-
a negative example. We then perform stochastic gradient itly in the binary approximation without any overhead. We
descent learning to update the classifiers. fix the SVM learning rate λ = 0.1 for all experiments. We
Additionally, we implemented the boosting-based clas- also set ν = 1 for the structured model.
sification approach used in [6], by making use of the pub- For each sequence, we initialize a model using the
licly available online boosting code of the authors1 . We fronto-parallel planar patch in the first frame, by detecting
train these classifiers in the same manner as our indepen- the 100 strongest features to define the locations of model
dent SVM baseline. keypoints M. Given these locations we initialize four mod-
The unoptimized C++ implementation of our approach els for comparison: a non-adaptive matching-based model
as well as the annotated videos used during our experiments using the descriptors for the model keypoints extracted from
are publicly available to download2 . the first frame (to represent a traditional keypoint-based ob-
ject detection approach); the baseline learned model using
4.1. Tracking by detection independent SVM classifiers; and our structured output ap-
To illustrate the benefit of online learning we consider proach with the two loss functions described in Section 3.3.
the task of tracking-by-detection, in which the target ob- We initialize the weight vector for a model keypoint by set-
ject should be detected in consecutive frames of a video se- ting it to the descriptor from the first frame.
quence. For this task we do not use any information about To assess detection performance, we define a scoring
the location of the object in the previous frame when de- function between the ground-truth homography T ∗ and the
tecting the object, but we use each successful detection in predicted homography T as:
order to perform an online learning step to update our object 4
model for subsequent frames. ∗ 1X
S(T , T ) = kci − (T ∗ T −1 )(ci )k2 , (15)
We apply our approach using three different combina- 4 i=1
tions of interest point detector and descriptor: FAST de-
tector with 256-bit BRIEF descriptor, BRISK detector with where {ci }4i=1 = {(−1, −1)T , (1, −1)T , (−1, 1)T , (1, 1)T }
512-bit BRISK descriptor and SURF detector with SURF64 define the corners of a square. This score will be 0 if the
descriptor. These have been chosen to illustrate that our two homographies are identical. Frames for which
method works with a variety of feature point detectors and S(T ∗ , T ) < 10 are considered correct detections, and we
descriptors, but as they each have different invariances and report the average over the entire sequence.
dimensionality, our results should not be interpreted as a The results of our experiments without the binary ap-
comparison between different descriptor types. Therefore, proximation described in Section 3.5 can be seen in Ta-
we are interested in relative performance figures for a par- ble 13 . As can be seen from this table, the structured out-
ticular feature point detector and descriptor combination. put learning framework outperforms the static model (with
1 http://www.vision.ee.ethz.ch/boostingTrackers/onlineBoosting.htm 3 The actual videos and the results of tracking-by-detection can be found
2 http://www.samhare.net/research in our supplementary material or at http://www.samhare.net/research.

1899
no learning), as well as the model trained with indepen-
dent SVM classifiers. Comparing the results of independent
SVM classifiers and the static model highlights the fact that
adapting an object model to a particular environment online
helps a lot in practice. However, the highest detection rate
is attained when we used the structured output framework
where the learning of the object model and geometric esti-
mation is linked inside a unified optimization formulation.
In particular, using the loss function ∆I based on the differ-
ence in number of inliers results in the best performance. It (a) Static BRIEF model
should be noted that for SURF descriptors the independent
SVMs had difficulty learning the correct object model. We
suspect that this is caused because of the continious nature
of the SURF descriptor and the fact that the number of gen-
erated keypoints is lower with the SURF keypoint detector.
However, given the same settings, the structured learning
approch is able to benefit fully from the adaptation process
and improve upon the static model.
For the boosting-based online descriptor learning ap-
proach, it is only fair to compare against the models where
we use the BRIEF descriptor (as both of these methods are (b) Learned BRIEF model using our structured output formulation
using the same FAST corner detector). Again one can see
that comparing the boosting method with the static method, Figure 2. Example frames from the paper sequence showing the
learning still provides an improvement. However, the boost- top correspondence for each model point. The model is displayed
ing based approach is not able to outperform the indepen- in a green box on the left of this image. The brightness of each
dent SVM learning framework, and is therefore also per- line indicates the correspondence score, before any geometric ver-
forming worse than the structured output framework. ification has taken place (the brighter the higher the score). The
The most difficult video in our set of experiments is the learned model has adapted to discriminate against the many con-
paper sequence. This video sequence features highly repet- fusing keypoints in the image, resulting in a successful detection,
itive local appearance structures and a simple static model while no detection is found with the static model.
fails in all cases. The learning based approaches (except the
boosting method) are able to deliver a reasonable detection
framework and approximate the model keypoint weight
rate using binary descriptors. An example frame of this se-
vectors wj with varying numbers of binary bases Nb . As
quence is shown in Figure 2 where we display the generated
can be seen in Figure 3, in general the binary approximation
correspondences before geometric verification. As can be
produces detection performance comparable to the original
seen in the top image, because of the confusing appearance
classifier with Nb ≥ 2 bases, and for the less challenging
of the local image features, the static BRIEF model fails to
sequences even a single basis suffices. In terms of detection
match model keypoints consistently to the image. However,
time, which includes the stages of generating correspon-
the structured learning framework which uses the same set
dences between model and image, performing geometric
of descriptors extracted from the input image for match-
verification, and updating the learner, we see that the binary
ing has learned a more discriminative object model and is
approximation provides significant performance gains (ap-
able to provide more meaningful correspondences which
proximately 4 times faster detection with our unoptimized
are mainly focused on where the object is in the input frame.
implementation). This means that our approach is suitable
Another observation is that although the structured learning
for use even on low-powered devices.
model creates some mis-matches (lines matching an object
point to some background locations), they all contain very
small matching scores (indicated by their dark color). 5. Conclusions
In this paper, we presented a novel approach to learn-
4.2. Binary approximation
ing for real-time keypoint-based object detection and track-
To verify that the binary approximation proposed in Sec- ing. Our formulation generalizes previous methods by com-
tion 3.5 is reasonable when using binary descriptors such bining the feature matching, learning, and object pose es-
as BRIEF and BRISK, we repeat our experiments for the timation into a coherent structured output learning frame-
BRIEF descriptor model learned in our structured output work. We showed how such a model can be trained on-

1900
(a) Detection rate (b) Detection time

Figure 3. Behaviour of the learned BRIEF model using our structured output formulation when employing a binary approximation of
each wj as described in Section 3.5. For Nb ≥ 2 the detection performance is almost equivalent to the original model, whilst being
approximately four times faster with our unoptimized implementation.

line, and presented an approximation to create an efficient analysis and automated cartography. Communications of the
way of using binary descriptors at runtime. During our ex- ACM, 24(6):381–395, June 1981.
periments we showed that learning in the unified structured [6] M. Grabner, H. Grabner, and H. Bischof. Learning Features
output learning formulation plays an important role in im- for Tracking. In CVPR, 2007.
proving the detection rate compared to state-of-the-art static [7] S. Hare, A. Saffari, and P. H. S. Torr. Struck: Structured
and learning-based feature matching techniques. Output Tracking with Kernels. In ICCV, 2011.
While we did not perform feature selection explicitly, [8] Y. LeCun, L. Bottou, G. Orr, and K. Müller. Efficient Back-
our formulation implicitly is able to down-weight the less Prop. In G. Orr and K.-R. Müller, editors, Neural Networks:
Tricks of the Trade, volume 1524 of Lecture Notes in Com-
discriminative features, and therefore, provides a good start-
puter Science, page 546. Springer Berlin / Heidelberg, 1998.
ing platform for further research into automatic online fea-
[9] V. Lepetit and P. Fua. Keypoint Recognition Using Random-
ture selection.
ized Trees. PAMI, 28(9):1465–79, Sept. 2006.
Acknowledgements This work is supported by EPSRC
[10] S. Leutenegger, M. Chli, and R. Siegwart. BRISK: Binary
CASE and KTP research grants and the IST Programme of Robust Invariant Scalable Keypoints. In ICCV, 2011.
the European Community, under the PASCAL2 Network of [11] D. G. Lowe. Distinctive Image Features from Scale-Invariant
Excellence, IST-2007-216886. P. H. S. Torr is in receipt of Keypoints. IJCV, 60(2):91–110, Nov. 2004.
Royal Society Wolfson Research Merit Award. [12] M. Özuysal, M. Calonder, V. Lepetit, and P. Fua. Fast Key-
point Recognition Using Random Ferns. PAMI, 32(3):448–
References 61, Mar. 2010.
[1] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded- [13] M. Özuysal, V. Lepetit, F. Fleuret, and P. Fua. Feature Har-
Up Robust Features (SURF). CVIU, 110(3):346–359, June vesting for Tracking-by-Detection. In ECCV, 2006.
2008. [14] E. Rosten and T. Drummond. Machine learning for high-
[2] M. B. Blaschko and C. H. Lampert. Learning to Localize speed corner detection. In European Conference on Com-
Objects with Structured Output Regression. In D. Forsyth, puter Vision, volume 1, pages 430–443, May 2006.
P. Torr, and A. Zisserman, editors, ECCV, volume 5302 of [15] S. Taylor and T. Drummond. Multiple Target Localisation at
Lecture Notes in Computer Science, pages 2–15, Berlin, Hei- over 100 FPS. In BMVC, 2009.
delberg, Oct. 2008. Springer Berlin Heidelberg. [16] P. H. S. Torr and A. Zisserman. MLESAC: A New Robust
[3] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. BRIEF: Estimator with Application to Estimating Image Geometry.
Binary Robust Independent Elementary Features. In ECCV, Computer Vision and Image Understanding, 78(1):138–156,
2010. Apr. 2000.
[4] O. Chum and J. Matas. Matching with PROSAC Progressive [17] I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun.
Sample Consensus. In CVPR, 2005. Large Margin Methods for Structured and Interdependent
[5] M. A. Fischler and R. C. Bolles. Random sample consen- Output Variables. JMLR, 6:1453–1484, Dec. 2005.
sus: a paradigm for model fitting with applications to image

1901

You might also like