0% found this document useful (0 votes)
32 views

Articulated Body Motion Capture by Stochastic Search

The document describes a method for markerless motion capture of articulated bodies using multiple cameras and image processing techniques. The key aspects of the method are: 1. It uses an articulated body model and edge/background subtraction from camera images to estimate body pose over time. 2. It develops a novel particle filtering approach called "annealed particle filtering" which uses a continuation principle based on simulated annealing to gradually introduce the influence of narrow peaks in the fitness function. This allows efficient searching of the high-dimensional configuration space. 3. It shows the annealed particle filter can effectively track rapid, arbitrary human motions in real-time using only standard video cameras without any attached markers or sensors.

Uploaded by

Cyrille Lamassé
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Articulated Body Motion Capture by Stochastic Search

The document describes a method for markerless motion capture of articulated bodies using multiple cameras and image processing techniques. The key aspects of the method are: 1. It uses an articulated body model and edge/background subtraction from camera images to estimate body pose over time. 2. It develops a novel particle filtering approach called "annealed particle filtering" which uses a continuation principle based on simulated annealing to gradually introduce the influence of narrow peaks in the fitness function. This allows efficient searching of the high-dimensional configuration space. 3. It shows the annealed particle filter can effectively track rapid, arbitrary human motions in real-time using only standard video cameras without any attached markers or sensors.

Uploaded by

Cyrille Lamassé
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

International Journal of Computer Vision 61(2), 185–205, 2005


c 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.

Articulated Body Motion Capture by Stochastic Search

JONATHAN DEUTSCHER AND IAN REID


Department of Engineering Science, University of Oxford, Oxford, OX13PJ, United Kingdom
[email protected]
[email protected]

Received August 19, 2003; Revised April 9, 2004; Accepted April 9, 2004

First online version published in October, 2004

Abstract. We develop a modified particle filter which is shown to be effective at searching the high-dimensional
configuration spaces (c. 30 + dimensions) encountered in visual tracking of articulated body motion. The algo-
rithm uses a continuation principle, based on annealing, to introduce the influence of narrow peaks in the fitness
function, gradually. The new algorithm, termed annealed particle filtering, is shown to be capable of recovering full
articulated body motion efficiently. A mechanism for achieving a soft partitioning of the search space is described
and implemented, and shown to improve the algorithm’s performance. Likewise, the introduction of a crossover
operator is shown to improve the effectiveness of the search for kinematic trees (such as a human body). Results
are given for a variety of agile motions such as walking, running and jumping.

Keywords: human motion capture, visual tracking, particle filtering, genetic algorithms

1. Introduction images, without markers. However, full-body track-


ing from standard images is a challenging problem,
A popular form of motion capture, for tasks such as and markerless system presented to date rarely achieve
gait analysis and computer animation, involves attach- the following combination of capabilities of current
ing a number of retro-reflective markers to a subject’s marker-based systems: full 3D motion recovery; robust
body and viewing the motion of the markers over time tracking of rapid, arbitrary movement; high accuracy;
using a set of calibrated cameras. The easily-recovered easy application to new scenarios; on-line model ac-
image positions of the markers are transformed into 3D quisition; real-time, or near real-time processing.
trajectories via triangulation of the measurements, and A major problem which confronts all attempts to
a parameterised representation of the subject’s move- satisfy these criteria is the high dimensionality of the
ments can be calculated. configuration space, and the exponentially increasing
The use of markers is intrusive and restrictive, computational cost that results. A realistic articulated
and necessitates the use of potentially expensive, spe- model (see Fig. 4) of the human body usually has at
cialised capture hardware. The goal of markerless mo- least 25 DOF. The model used in this paper for example
tion capture is to reproduce the performance of marker- has between 29 and 34 DOF, and models employed for
based methods in a system using conventional cameras commercial character animation usually have over 40.
and without the use of special apparel or other equip- In this paper we describe a multi-camera system for
ment. For this reason recent years have seen a huge markerless human motion capture which goes some
growth in research in the computer vision community way to achieving the goals above. The work described
with the aim of recovering motion data directly from combines and extends our previous efforts published
186 Deutscher and Reid

in short form in Deutscher et al. (2000, 2001). Our ap- geometry) are compared with the visual data from an
proach is characterised by the following: 1. articulated image stream, to estimate a best fit for each frame in the
body model, 2. weak dynamical modelling, 3. edge scene. Thus the principal components usually comprise
and background subtraction image measurements, and a target model, an image search method, and a dynam-
4. a particle-based stochastic search algorithm. The key ical model.
contributions comprise: The system of Plänkers and Fua (2003) represents
one of the best examples of this paradigm in its simplest
• The development of a novel, particle-based stochas- form. Their system does not have a (strong) model of
tic search algorithm, called annealed particle filter- the person’s dynamics (in contrast to Sidenbladh et al.
ing. The method uses a continuation principle, based (2000), e.g., see below) or have a sophisticated multi-
on annealing, to introduce the influence of narrow modal search algorithm such as we describe. Rather,
peaks in the fitness function gradually. This is intro- the key to the success of their system is in much more
duced in Section 3. careful modelling of the shape and appearance than in
• The annealed particle filter is applied to the problem most other work, and in the use of binocular disparity
of markerless human motion capture, and shown to maps as well as silhouette data. Unlike most other work
be more effective and efficient than, for example, (including our own) their system estimates the size of
Condensation (Blake and Isard, 1998), at localising the body as well as the pose parameters.
the pose. Section 4 discusses implementation issues In considering the role of the other two system com-
and Section 5 shows results. ponents, search and dynamics it is useful to discuss the
• We demonstrate how adaptive selection of the vari- influential work of Harris (1992). He showed how rapid
ances/covariances which control the diffusion dur- motions can be tracked by constraining the search area
ing annealing can lead to what can be thought of as a via a predicted motion of the object. Harris used rigid
“soft” hierarchical partitioning of the configuration polyhedral models (and simple surfaces of revolution)
space, and hence to further gains in efficiency. and sought the 6 DOF pose of object. Given a predicted
• We introduce a crossover operator, analogous that location, the system searches from predicted edge lo-
that found in Genetic Algorithms, into the particle cations along 1D profiles to find “actual” edges. These
filtering framework. We demonstrate that this opera- 1D measurements are then combined to obtain a pose
tor improves the ability of the algorithm to search the update.
configuration spaces of objects whose articulations Drummond and Cipolla (2001) showed how many
can be modelled as a kinematic tree. In particular of the ideas in Harris’ system can be extended to artic-
we show results for reliable and efficient tracking ulated objects by effectively tracking body parts using
for walking, running and jumping with no special Harris’ method but enforcing global consistency via
training of the dynamics of any activity. kinematic constraints.
These latter two developments are discussed in A second and arguably more important innovation in
Section 6. Harris (1992) was to place the tracking system within
the framework of a Kalman Filter, a provably optimal
We begin with a review of relevant literature in recursive estimator for linear systems which can be
Section 2, including a detailed discussion of the two thought of as an algorithm for sequential propagation
most closely associated technologies: particle filtering of Gaussian probability densities.
and simulated annealing. A natural step would be to consider the use of a
Kalman Filter, (or its extension to non-linear measure-
2. Background ments and/or dynamics, the Extended Kalman Filter
or EKF) for articulated body tracking. Wachter and
2.1. Visual Tracking and Particle Filters Nagel (1999) demonstrated this for single view track-
ing using image motion and edges (though the results
Full-body motion capture is an example of model- show only motion parallel to the image plane). More
based tracking; i.e. the process of sequentially esti- recently Mikic et al. (2001) demonstrated the extrac-
mating the parameters of a model of a target over tion and filtering of pose parameters from a volumetric
time from visual data. Typically a priori knowledge model obtained by “carving” space using silhouettes
about the target’s observable properties (such as its from multiple cameras.
Articulated Body Motion Capture by Stochastic Search 187

While Gaussian uncertainty is sufficient for mod- weight for each particle proportional to how well it
elling many motion and measurement noise sources, fits image data. The weighted particle set produced
the Kalman Filter has been shown to fail catastrophi- represents the new (posterior) probability density
cally in cases where the true probability function has after movement and measurement.
a very different shape. In particular attempts to track
objects moving against a very cluttered background,
where measurement densities include the chance of Particle filtering works well for tracking in clutter
detecting erroneous image features and are therefore because it can represent arbitrary functional shapes
multi-modal, lead to tracking failure for Harris’ algo- and propagate multiple hypotheses. Less likely model
rithm and many of its ilk. configurations will not be thrown away immediately
An alternative, more general approach is particle fil- but given a chance to prove themselves later on, re-
tering, in which arbitrary densities are approximated. sulting in more robust tracking. In its original imple-
This was introduced in the context of visual tracking mentation, Condensation demonstrated robust tracking
in the form of the Condensation algorithm (Isard and in low-dimensional configuration spaces (up to about
Blake, 1996). A posterior density p(X | Zt ) represent- 10 DOF) in the presence of significant clutter. Even
ing current knowledge about the model state X after in the absence of a cluttered background, the compli-
incorporation of all measurements up to the current cated nature of the observation process during human
time-step t, Zt , is represented by a finite set of nor- motion capture causes the posterior density to be non-
malised weighted particles, or samples, Gaussian and multi-modal as shown experimentally
by Deutscher et al (1999). Condensation has indeed
 (0)   ) (N )  been implemented successfully for short human mo-
s(0)
t , πt · · · s(N
t , πt .
tion capture sequences (see Deutscher et al. (2000)
and Sidenbladh et al. (2000)), however, in the high-
An estimate of the state Xt at each time-step t can
dimensional configuration spaces occurring in human
easily be estimated by the sample mean of the posterior
motion capture and other domains, there are serious
density, p(X | Zt ),
problems with Condensation arising from the inabil-

N ity of a manageable size particle set (of, say, a few
Xt = E[X] = πt(n) s(n)
t (1) thousand particles), adequately to populate the space
n=1 and represent an arbitrary density. In fact it has been
shown by MacCormick and Isard (2000) that N ≥ Dαmin d
or the mode where N is the number of particles required, d is the
( j) ( j)   number of dimensions. The survival diagnostic Dmin
Xt = M[X] = st , πt = max πt(n) . (2) and the particle survival rate α are both constants with
α  1. Clearly when d is large normal particle filtering
Essentially, a smooth probability density function is becomes infeasible.
approximated by a finite collection of weighted sam- Cham and Rehg (1999) proposed the use of a mul-
ple points, and it can be shown that as the number of tiple hypothesis tracker which represented the poste-
points tends to infinity the behaviour of the particle set rior distribution as a piecewise Gaussian. As only local
is indistinguishable from that of the smooth function. modes are propagated between frames, the solution is
Tracking with a particle filter works by: computationally much cheaper than Condensation, but
they avoid the pitfalls of a single hypothesis tracker.
1. Resampling, in which a weighted particle set is Unlike our work, in which we explicitly model the joint
transformed into a set of evenly weighted particles angles and overall pose degrees of freedom, they use a
distributed with concentration dependent on proba- so-called scaled prismatic model which explicitly mod-
bility density; els 2D in-plane translation and rotation, but models out
2. Stochastic movement and dispersion of the particle of plane rotation via a per-link independent scaling.
set in accordance with a motion model to represent Partitioned sampling was developed by
the growth of uncertainty during movement of the MacCormick and Blake (1999) as a variation on
tracked object; Condensation to reduce the number of particles
3. Measurement, in which the likelihood function is needed to track more than one object, and applied
evaluated at each particle site, producing a new by MacCormick and Isard (2000) to the problem
188 Deutscher and Reid

of tracking articulated objects. Using partitioned In addition to the problems of representing PDFs
sampling reduces the number of particles required via particle sets in high dimensional spaces, a second
to N >= Dαmin making the problem tractable. How- difficulty is associated with constructing a valid obser-
ever, this assumes that the configuration space can vation model p(Zt | X) as a normalised probability den-
be sliced so that one can construct an observation sity distribution. Even if such a likelihood model can be
density p(Zt | xi ) for each dimension xi of the model constructed the cost of evaluating it can be prohibitive.1
configuration vector X = {x0 . . . xd }. This assumption, Often an intuitive weighting function w(Zt , X) can be
that it is possible independently to localise separate constructed that approximates the probabilistic likeli-
parts of an articulated model, is similar to that made hood p(Zt | X) but which requires much less computa-
by Gavrila and Davis (1996) to enable a hierarchical tional effort to evaluate.
search. In this paper we reduce the problem from propagat-
Another variation on the standard particle filter used ing the conditional density p(X | Zt ) using p(Z | X),
to reduce the number of particles needed effectively simply to finding the configuration Xt which returns the
to represent a posterior density has been developed by maximum value from a simple and efficient weighting
Sullivan et al. (1999). Called layered sampling it is function w(Zt , X) at each time t, given Xt−1 . By doing
centred around the concept of importance resampling. this gains are made on two fronts: (i) it is possible to
The technique we present in this paper bears some sim- make do with fewer likelihood (or weighting function)
ilarity to layered sampling, but experimental evidence evaluations because the function p(X | Zt ) no longer
suggests our technique is more effective at reducing the has to be fully represented; and (ii) an evaluation of a
number of particles required when the dimensionality simple weighting function w(Zt , X) requires less com-
of the search space approaches 30. putational effort when compared to an evaluation of
Two successful recent approaches which use parti- the observation model p(Zt | X). The main disadvan-
cle filtering are due to Sminchisescu and Triggs (2001) tage is that we no longer work within a truly Bayesian
and Sidenbladh et al. (2000). Both are concerned with framework.
monocular tracking (in some important ways more dif- We retain the use of a particle based stochastic frame-
ficult than the multi-camera case) but in other respects work because of its ability to handle multi-modal like-
problem is essentially the same: how can a high dimen- lihoods, or in the case of a weighting function, one
sional space be adequately populated with a particle set with many local maxima. In order most effectively to
of manageable size? Their approaches to this problem optimise the non-convex weighting function we use an
are quite different and in some ways complementary. approach similar to that of simulated annealing.
The former introduces the idea of covariance sam-
pling, spreading particles in areas where there is least
2.2. Simulated Annealing
confidence in the localisation. This idea is very closely
related to our soft partitioning approach developed in
The Markov chain based method of simulated anneal-
Section 6.1. More recently they have extended this
ing was proposed by Kirkpatrick et al. (1983) as a
work explicitly to take into account the particular ambi-
means to optimise a multi-modal objective function
guities that arise from human kinematics, “scattering”
U (x). It proceeds by defining a distribution over the
particles into areas of potential ambiguity and therefore
function values
making better use of the particle set (Sminchisescu and
Triggs, 2003). P(x) = const e−λU (x)
The latter work (Sidenbladh et al., 2000, 2002) on
the other hand, takes the approach that dynamical mod- The aim is then to generate samples xi from this dis-
elling can be used to obtain strong, predictive priors, tribution, in the knowledge that as λ → ∞, the prob-
reducing the search space to manageable proportions. ability mass concentrates on the minumum of U , and
Indeed in Sidenbladh et al. (2000) tracking was re- hence the samples xi will cluster around the minimum
stricted, via the learnt dynamics, to the case of walk- value state.
ing people. More recently however (Sidenbladh et al., Samples from the distribution are generated in a
2002) showed how a database of motions could be con- straightforward fashion using the Metropolis-Hastings
structed and efficiently indexed in order to obtain pre- algorithm (Metropolis et al., 1953) which generates
dictions over a wide class of motions. a Markov sequence of points whose distribution will
Articulated Body Motion Capture by Stochastic Search 189

converge to P: a new candidate point x in a sequence is


generated “at random”, and accepted with probability:
 
P(x )
min 1,
P(x)

i.e. the candidate point is accepted if it improves U or



with probability e−λ[U (x)−U (x )] .
Simply using a large value of λ and generating a
sequence starting at a random x0 yields poor results if Figure 1. Illustration of the annealed particle filter with M =
U has isolated minima since the sequence can easily 1. Even though a large number of particles are used (so that an
equivalent number of weighting function evaluations are made as
become trapped in a local mode of P (e.g. the closest
in Fig. 2), the search is misdirected by local maxima. From the re-
to x0 ). sulting weighted set it is very hard to tell where the global maximum
The annealing process is a heuristic for avoiding this. of w0 lies.
The initial value of λ is set to be small (or in more phys-
ical language, the temperature, which is inversely pro- w M is designed to be very broad, representing the over-
portional to λ, is initially high). This results in a broad all trend of the search space while w0 should be very
distribution P and hence allows free exploration of the peaked, emphasising local features. This is achieved by
search space. Samples are generated from this distribu- setting
tion, and then the value of λ increased. Samples are then
generated from the new distribution starting from the wm (Z, X) = w(Z, X)βm , (3)
final state of the previous sequence, and so on. Each
increase of λ successively excludes (in a probabistic for β0 > β1 > · · · > β M , where w(Z, X) is the original
sense) regions that contain little of the probability mass weighting function, as suggested by the discussion in
of the distribution. Section 2.2. Because it is not the aim to sample from
The set of values for λ = λ M . . . λ0 is known as w(Z, X), but only to find its maximum it is not required
the annealing schedule. This schedule needs to be de- that β0 = 1.
signed as a compromise between speed and efficacy: A large βm produces a peaked weighting function wm
slow annealing is more likely to find a globally optimal resulting in a high rate of annealing. Small values of βm
solution, but is also prohibitively expensive. will have the opposite effect. If the rate of annealing is
The similarity with particle-based methods arises too high the influence of local maxima will distort the
when we view this process one of generating samples estimate of Xt as seen in Fig. 1. If the rate is too low Xt
from a sequence of distributions, Pλ M . . . Pλ0 , where will not be determined with enough resolution (unless
Pλm (x) ∝ Pλ0 (x)βm , for 1 = β0 > β1 > · · · > β M , more layers are used wasting computational resources).
and where βm = λm /λ0 (as described by Neal (2001) The manner in which the rate of annealing is influenced
whose algorithm ours resembles). Moreover the algo- by the sequence β M , . . . β0 is discussed in Section 3.1.
rithm exhibits exactly the kind of behaviour needed for One annealing run is performed at each time t using
our purposes: one wants to move towards the global image observations Zt . The state of the tracker after
maximum of the weighting function w(Zt , X), using each layer m of an annealing run is represented by a
the overall trend of the matching function as a guide, set of N weighted particles
without becoming misguided by local maxima as seen
in Fig. 1. The idea of annealing for optimisation is now π
 (0)   (N ) (N ) 
St,m = s(0)
t,m , πt,m · · · st,m , πt,m . (4)
adapted to perform a particle based stochastic search
within the framework of an annealed particle filter. An unweighted set of particles will be denoted
   (N ) 
3. Annealed Particle Filter St,m = s(0)
t,m · · · st,m . (5)

π
A series of weighting functions w0 (Z, X) to w M (Z, X) Each particle in the set St,m is considered as an
(i) (i) (i)
is employed in which each wm differs only slightly (st,m , πt,m ) pair in which st,m is an instance of the
(i)
from wm−1 (see Fig. 2, where M = 3). The function multi-variate model configuration X, and πt,m is the
190 Deutscher and Reid

Figure 3. Annealed particle filter in progress. The sets St,m are


plotted here, taken while tracking the walking person as seen in
Fig. 9. Only the horizontal translation components x0 and x1 of the
Figure 2. Illustration of the annealed particle filter with M = 3.
model configuration vector X are shown. Starting with St−1,0 from
With a multi-layered search the sparse particle set is able to migrate
the previous time step the particles are diffused to form St,9 which
gradually towards the global maximum without being distracted by
π easily covers the expected range of translational movement of the
local maxima. The final set St,0 provides a good indication of the
subject. The particles and are then slowly annealed over 10 layers
weighting function’s global maximum.
(the sets St,6 to St,4 are omitted for brevity) to produce St,0 which
is clustered around the maximum of the weighting function.
corresponding particle weighting. Each annealing run
can be broken down as follows (the process is illus-  (i)
trated with a 1D schematic in Fig. 2 and with 2D data which are normalised so that N πt,m = 1. The set
π
extracted from an actual human motion tracking exper- of weighted particles St,m has now been formed.
π
iment in Fig. 3). 4. N particles are drawn randomly from St,m with
replacement and with a probability equal to their
(i)
weighting πt,m . As the nth particle s(n)
t,m is chosen it
1. For every time step t an annealing run is started at
is used to produce the particle s(n)
t,m−1 using
layer M, with m = M.
2. Each layer of an annealing run is initialised by a set
of un-weighted particles St,m . s(n) (n)
t,m−1 = st,m + Bm (7)
3. Each of these particles is then assigned a weight
  where Bm is a multi-variate Gaussian random vari-
(i)
πt,m ∝ wm Zt , s(i)
t,m (6) able with covariance Pm and mean 0.
Articulated Body Motion Capture by Stochastic Search 191

5. The set St,m−1 has now been produced which can be measure of the effective number of particles that will be
used to initialise layer m −1. The process is repeated chosen for propagation to the next layer is the survival
π
until we arrive at the set St,0 . diagnostic D (taken from MacCormick (2000)) where
π
6. St,0 is used to estimate the optimal model configu-

−1
ration Xt using 
N
 
(n) 2
D= π (11)

N
n=1
Xt = s(i) (i)
t,0 πt,0 . (8)
i=1
and from this the particle survival rate α can be esti-
π
7. The set St+1,M is then produced from St,0 using mated MacCormick (2000)

s(n) (n) D
t+1,M = st,0 + B0 . (9) α= . (12)
N
This set is then used to initialise layer M of the next
annealing run at tt+1 . In the case of traditional annealing, the temperature
acts like a barrier, restricting the movement of sam-
Note that Step 7, where the particle set for the next ples: the cooler the temperature, the fewer the number
time-step is generated, incorporates no dynamic model. of samples with a low function value U (x) (energy) that
There is nothing in the algorithm that precludes the will be generated. In the context of a particle set, a high
use of dynamics: simply replace Eq. (9) with the more survival rate corresponds to an even spread probability
general mass, while a low one suggests the mass is concen-
trated in a few particles. Hence decreasing the survival
 (n) 
s(n)
t+1,M = f st,0 + B0 (10) rate has the same effect as cooling the temperature in
traditional annealing.
where the function f represents the dynamical model. Now D is clearly a monotonic decreasing function
We have not done so since our focus is on tracking of βmk . At a given layer, we therefore adjust the value
previously unseen/unmodelled agile motions. While a of βmk to change the value of D(βmk ) so that α = D/N
dynamical model is certainly beneficial during “steady approaches a desired value. This is trivially done by
state” tracking, it can be a hindrance if the model is searching over βmk (using the value from the previous
poor (as it often is for agile motions). The price we pay time step βmk−1 as the starting point) to find the value
for this is a less economical use of particles than would that solves the equation
be ideal, and the potential for jittery trajectories. The
 
latter could be addressed by smoothing the recovered αdesired = D βmk N
pose/joint trajectories.
i.e. produces the desired rate of annealing.
3.1. Setting the Tracking Parameters Note that this does not mean the weights have to
be completely re-evaluated each time βmk is adjusted
It remains to consider how best to set the free parame- during the search. Since wm (Z, X) = w(Z, X)βm the
ters of the algorithm, and in particular, to consider how values w(Z, X) = s(i) t,m , i : 1 . . . N can be stored for
to influence the annealing schedule. In Eq. (3) it is the each set S k,m and βmk applied to each individual weight
π
value of βmk that determines the rate of annealing at as appropriate to produce St,m .
each layer. How then are the appropriate values for α0 . . . α M
To see how and why this is so, first note that the determined? There are also a number of other track-
equivalent of temperature in our particle-based frame- ing parameters that need to be set before tracking can
work is the particle survival rate: the ratio of effective begin, including the number of particles N , the num-
particles to total number of particles. If the probabil- ber of annealing layers M and the diffusion covariance
ity mass is all concentrated in a few particles then the matrices P M . . . P0 . A tentative framework has been de-
number of effective particles is low, and conversely, an veloped to allocate values to these parameters although
even distribution of probability mass amongst particles it is acknowledged that more work needs to be done in
signals a large number of effective particles. A sensible this area.
192 Deutscher and Reid

1. The first step is to decide on how many anneal-


ing layers are needed. It was found that doubling
the number of annealing layers reduces the number
of particles needed for successful tracking by more
than half. This will only work up to a point how-
ever as there seems to be a minimum number (N )
of particles needed for tracking no matter how many
layers are used. Using a 30 DOF model it was found
that setting M = 10 with N ≥ 200 worked well.
2. Each diagonal element in P0 is allocated a value
equal to half the maximum expected movement of
Figure 4. The model is based on a kinematic tree consisting of 17
the corresponding model configuration parameter
segments (a). Six degrees of freedom are given to base translation
over one time step. In this way the set St+1,M should and rotation. The shoulder and hip joints are treated as sockets with 3
cover all possible movements of the subject between degrees of freedom, the clavicle joints are given 2 degrees of freedom
time t and t + 1. The amount of diffusion added (they are not allowed to rotate about their own axis and are assumed to
to each successive annealing layer should decrease be coupled) and the remaining joints are modelled as hinges requiring
only one. This results in a model with 29 degrees of freedom and a
at the same rate as the resolution of the set St,m
configuration vector X = {x1 . . . x29 }. The model is fleshed out by
increases. Our early experiments used cones with elliptical cross-sections (b).

Pm = α M × · · · × αm−1 × P0 (13)
we have constructed the weighting function on the
and produced decent results, but a better, adaptive basis of two image features—edges and foreground
method for setting the P is described in Section 6.1. silhouette—chosen for their joint virtues of simplic-
3. The appropriate rates of annealing α M . . . α1 are in- ity (i.e. easy and efficient to extract), and a degree
fluenced by the number of annealing layers used. of invariance to imaging conditions. While these fea-
With a higher number of annealing layers a lower tures are not fully general (in particular the silhou-
rate of annealing can be used to obtain the desired ette relies on a knowledge of the background which
resolution. It was found that while using 10 anneal- may not be available in more general environments)
ing layers setting α M = . . . = α1 = 0.5 provided they suffice for our purposes. Even without a large
sufficient resolution of Xt . degree of background clutter distracting edge mea-
surements, there remains a challenging, multi-modal
search problem because of self occlusions and fore-
4. Implementation ground clutter (i.e. unmodelled markings on the tar-
get). Other features such as optic flow could equally be
4.1. The Model used.
The articulated model of the human body used in this
paper is built around the framework of a kinematic tree, 4.2.1. Edges. The strongest continuous edges pro-
as seen in Fig. 4. Each limb is fleshed out using cones duced by a human subject in an image usually provide
with elliptical cross-sections. Such a model has a num- a good outline of visible arms and legs and are mostly
ber of advantages including computational simplicity, invariant to colour, clothing texture, lighting and pose.
high-level interpretation of output and compact repre- In severely cluttered environments or when the subject
sentation. is wearing very baggy clothes edges may lose some
of their usefulness, however in most situations they
provide a good basis for a weighting function. A gradi-
4.2. The Weighting Function ent based edge detection mask is used to detect edges.
The result is thresholded to eliminate spurious edges,
The basic annealed particle filter is a general optimi- smoothed with a Gaussian mask and remapped between
sation tool and can be used for a variety of purposes 0 and 1. This produces a pixel map (Fig. 5(b)) in which
(for another application see Deutscher et al. (2002)) each pixel is assigned a value related to its proximity
with different weighting functions. In the present work to an edge.
Articulated Body Motion Capture by Stochastic Search 193

4.2.2. Silhouette. The second feature extraction per-


formed on the image is foreground-background seg-
mentation. Thresholded background subtraction was
used here to separate the subject from the background
and typical results can be seen in Fig. 5(c). This may be
inappropriate in some environments with a lot of back-
ground movement where more sophisticated methods
may have to be employed. Most foreground segmenta-
tion techniques are largely invariant to clothing, light-
ing, pose motion and environment and as such provide
an excellent image feature for a general human mo-
Figure 5. Feature extraction. A gradient based edge detection mask tion capture system. Once again a pixel map is con-
is used to find edges. The result is thresholded to eliminate spurious structed, this time with foreground pixels set to 1 and
edges and smoothed using a Gaussian mask to produce a pixel map background to 0 (Fig. 5(b)), and an SSD is computed
(b) in which the value of each pixel is related to it proximity to an
1 
edge. The foreground is segmented using thresholded background N
 2
subtraction to produce the pixel map (c) used in the weighting func-  r (X, Z) = 1 − pir (X, Z) (15)
tion. N i=1

where pi (X, Z) are the values of the foreground pixel


A sum-squared difference (SSD) function  e (X, Z) map at the N sampling points taken from the interior
is then computed using of the truncated cones as seen in Fig. 6(b).
To combine the edge and region measurements the
two SSD’s are added together and the result exponen-
1 N
 2 tiated to give
 e (X, Z) = 1 − pie (X, Z) (14)
N i=1
w(X, Z) = exp −( e (X, Z) +  r (X, Z)). (16)

where X is the model’s configuration vector and Z is the An equal weighting to each component was determined
image from which the pixel map is derived. pi (X, Z) empirically.
are the values of the edge pixel map at the N sampling When there is more than one camera the measure-
points taken along the model’s silhouette as seen in ments are combined in a similar way, giving
Fig. 6(a).

C
 e 
w(X, Z) = exp − i (X, Z) + i (X, Z)
r

i=1
(17)

where C is the number of cameras and i∗ (X) is from
camera i. An example of the output of this weighting
function can be seen in Fig. 7.

5. Results

Two examples illustrate the system: in the first a subject


Figure 6. Configurations of the pixel map sampling points pi (X, Z) walks in a circle as seen in Fig. 9; in the second the
for the edge based measurements (a) and the foreground segmenta- subject steps over a box, turns 180◦ on the spot before
tion measurements (b). The sampling points for the edge measure-
ments are located along the occluding contours of the model’s trun-
stepping over it again as seen in Fig. 10.
cated cones that have been projected into the image. The sampling Three cameras were used to capture the motion and
points for the foreground segmentation measurements are taken from all three views can be seen in the corresponding figures.
a grid within these occluding contours. The same tracking parameters were used in all three
194 Deutscher and Reid

Figure 7. Example output of the weighting function obtained


by varying only component x15 of X (the right knee angle) us-
ing the image and model configuration seen in (a). The func-
tion is highly peaked around the correct angle of −0.7 radians
(b).

Figure 9. Walking in a circle. Using three cameras (arranged here


from top to bottom) a person is tracked over 4 seconds while walking
in a circle. The tracker maintains an accurate lock throughout. 10
annealing layers were used with 200 particles for this sequence.

sequences, which demonstrate the tracker’s ability to


follow a wide range of human movement.
A comparison of the annealed particle filter with
standard Condensation can be seen in Fig. 8. Direct
comparison is complicated by the fact that in Annealed
Particle Filtering we use a simplified weighting func-
tion rather than a “correct” likelihood taking expected
clutter into account (such as is derived in Blake and
Isard (1998)). For this experiment the likelihood for
Condensation comprised the edge based likelihood of
Blake and Isard (1998), fused with a silhouette obser-
Figure 8. A comparison of condensation with the annealed particle
filter. At top the results of tracking with 4000 particles using stan- vation. The pose shown in each frame is the sample
dard condensation can be seen. Tracking gradually deteriorates until mean of the particle set. The one layer annealed search
terminal failure after 1.2 seconds. Experiments with 40000 particles represents a similar experiment. It differs from Con-
were carried out taking over 30 hours to process just 4 seconds of densation in using the simplified weighting function
video, still with negative results. An annealed search using 4000 par-
(exactly the same as for the full Annealed Particle Fil-
ticles with one layer fairs little better (middle), also suffering terminal
failure after 1.2 seconds. An annealed search using 400 particles and ter experiment), and in propagating only the mode of
10 layers (i.e., 4000 weighting function evaluations per frame) tracks the distribution between frames. The former difference
very well. accounts, remarkably, for a four-fold increase in speed
Articulated Body Motion Capture by Stochastic Search 195

tracking framework for Human Motion Capture. How-


ever it remains a computationally intensive technique.
The promise of further improvements is held out by the
fact that the model is structured as a kinematic tree.
One way to reduce the effective volume of the config-
uration space is to perform a hierarchical search. If one
part of an articulated model can be localised indepen-
dently then it can be used as a constraint for restricting
the rest of the search. This straightforward idea has
been applied in a heuristic fashion by (among others)
Gavrila and Davis (1996), who localised the torso us-
ing colour cues and used this information to constrain
the search for the limbs, and more recently by Mikic
et al., who first locate the head in order to limit their
subsequent search. Although this approach is usually
sound, without the assistance of colour cues (or other
labelling cues) it is often very hard independently and
reliably to localise specific body parts in realistic sce-
narios. Furthermore, failure of the first heuristic search
can easily lead to catastrophic, unrecoverable failure.
A more formal approach to hierarchical search was
proposed by MacCormick and Isard (2000). That work
applied partitioned sampling to tracking articulated ob-
jects, but crucially assumed that the configuration space
can be sliced so that one can construct an observation
density for each dimension of the model configuration
Figure 10. Stepping over a box. Using three cameras (arranged vector—effectively that it is possible independently to
here from top to bottom) a person is tracked over 5 seconds while
stepping over a box, turning around and stepping over the box again.
localise separate parts of an articulated model.
The tracker maintains an accurate lock throughout. 10 annealing When using all but the simplest kinematic models,
layers were used with 200 particles for this sequence. the optimal partitioning may not be obvious and it may
indeed change over time as the degree of interaction
between different segments of a model changes—such
of execution. The final part of the experiment shows
as when the legs cross during walking. Rather than
tracking using the full Annealed Particle Filter.
impose a hierarchy on the search, we seek instead
Each algorithm used a total of 4000 likelihood evalu-
to develop a method for soft or fuzzy partitioning
ations. In the final case this was divided as 400 particles
in which there is no need to commit to a particular
over 10 layers. It was found in practice that good re-
hierarchy. Cham and Rehg (1999) capture this spirit in
sults on this sequence could be achieved with as few as
describing a search which is sequential in the degrees
100 particles. While not being a strictly fair comparison
of freedom of the body. Their crucial innovation is
between Condensation and the Annealed Particle Fil-
to allow the order to be flexible: the search for body
ter, the experiment gives an indication of the improved
parts is conducted on a “best”-first basis, where best
tracking performance of the APF given equivalent
is defined as the component which can be found with
computational resources.
minimum effort, usually meaning minimum variance.
While motivated by similar desires, our solution is
6. Algorithm Extensions: Hierarchical Search rather different from theirs. Our approach improves
upon and extend the APF in two ways. First we in-
The Annealed particle filter (APF), introduced in the troduce a means to make the diffusion step in the APF
previous sections, directly addresses the problem of adaptive, so that effort is not wasted in those places
searching high-dimensional configuration spaces, and where the algorithm is already confident of doing well,
has been demonstrated to be an effective and robust and is concentrated on localising parts whose location
196 Deutscher and Reid

is uncertain. The effect of this can be interpreted as a on the spot. A configuration of the arm is described by
hierarchical search strategy which automatically par- an instance of the state variable x = (x1 , x2 , x3 , x4 ).
titions the search space in a soft way, without any ex- The weighting function w(z, x) required for the APF
plicit representation of the partitions (Section 6.1). Sec- is computed by a Sum of Squared Differences (SSD)
ond, we introduce a crossover operator (similar to that measure between a model template and a silhouette im-
found in Genetic Algorithms) which improves the abil- age (the regional correlation portion of the observation
ity of the tracker to search different partitions in parallel model in Eq. (15)).
(Section 6.2). The set St,m is initialised with particles uniformly
We present results for simple examples to demon- distributed over a range of x that we know to con-
strate the new algorithm’s implementation and ef- tain the actual position of the arm. This results in a
fectiveness, and show that these measures together large and similar variance for each parameter of x over
increase the tracker efficiency by a factor of 4 and in- all the particles in St,m as can be seen in Fig. 12(a).
crease agility of the motion that can be tracked.
We apply the tracker to the complex problem of Hu-
man Motion Capture with 34 degrees of freedom. Extra
degrees of freedom have been added to the model in
Fig. 4 in the back (2) to allow arching that would not
normally be encountered in everday walking (and was
not neceeary in our ealier experiments), in the neck (1)
to account for head nodding, and the clavicles are given
independent motion (2 each).

6.1. Adaptive Diffusion


and Hierarchical Partitioning

Consider the simple task of tracking an articulated arm


as seen in Fig. 11. The arm consists of four segments,
each joined by a swivelling joint with one end rooted

Figure 12. Parameter variance over one annealing layer: new APF
vs. old APF. On the left graphs a, b and c plot the variance of each
parameter of x = (x1 , x2 , x3 , x4 ) through the first annealing run of
the APF when tracking the articulated arm seen in Fig. 11. Graphs d, e
and f show the same information for the improved APF as described in
Section 6.1. Graphs a and d show the variances of the initial set St,m ,
displaying equal variances for each parameter. Graphs b and e show
the variances of the set St,m−1 before the addition of diffusion noise.
Note that in both b and e, x1 has a very small variance indicating
advanced localisation, however the variance of x2 , x3 and x4 has
been reduced only a little. Up to this point the algorithms are the
same and any differences between b and e are random. After the
Figure 11. A planar articulated arm with 4 DOF is shown (a). It addition of noise in the original APF the localisation of x1 has been
consists of four links connected by swivelling joints and rooted at O. greatly degraded as seen in graph c, however when noise is added
The configuration of the arm is described by x = (x1 , x2 , x3 , x4 ) as in proportion to each parameter’s variance the localisation of x1 is
seen in (b). preserved as seen in graph f.
Articulated Body Motion Capture by Stochastic Search 197

(i)
After calculating a weight πt,m for each particle us-
(i)
ing wm (zt , st,m ) we then proceed to Step 4 of the APF
π
and draw N particles from St,m with replacement and
probability proportional to each particle’s weight.
Consider the set St,m so produced before the addition
of any noise. In a typical annealing run the individual
parameters of each particle were found to have variance
as detailed in Fig. 12(b). Note here that the variance of
x1 has been greatly reduced while the other parame-
ters x2 , x3 and x4 have been hardly reduced at all. The
variance of any parameter can be considered (with a
number of acceptable caveats) to be directly related to
the degree to which the optimal value for that parame-
ter has been localised. Figure 12(b) shows that x1 has
been localised down to a very small area of its range
simply because it dominates the topology of the search
space whereas each particle’s values for x2 , x3 and x4
had very little influence on whether it was selected or
not. In effect we see here an automatic partitioning of
the state space into soft partitions according each pa-
rameter’s topological dominance.
The weakness of the original APF (indeed any par- Figure 13. Variance reduction with the improved APF. Here we
see the orderly reduction of each of the four parameters variances
ticle filter) arises with the addition of diffusion noise from most dominant (x1 ) to least dominant (x4 ) over 6 layers of the
to each particle upon selection. According to Eqs. (7) annealing process while tracking the simple articulated arm. Using
and (13) an equal amount of noise should be added to the improved APF results in a 2-fold increase in efficiency over the
each parameter. This results in a parameter variance classical APF. Tracker efficiency was measured by the minimum
profile like that seen in Fig. 12(c) with the localisa- number of particles needed to successfully track the articulated arm
over 40 frames.
tion of x1 seen in Fig. 12(b) all but wiped out by the
excessive addition of noise.
If instead the amount of randomness added to the
parameters of each selected particle is proportional to Sminchisescu and Triggs (2001) independently ar-
the variance of that parameter over the entire set of rived at a very similar idea, although in that work they
particles, these gains will be protected from disruption. were concerned with most effective use of particles be-
Instead we will arrive at the situation seen in Fig. 12(f) tween frames in order to recover from “ambiguous”
where enough noise has been added to each parameter poses.
to allow the thorough diffusion of the particles into the The changes to the APF are almost trivial, and can
spaces between repeatedly selected particles, but not be formalised as follows. Step 4 of the APF algorithm
enough to increase the variance of any given parameter described in Section 3 is amended so that at layer m,
which would erase any localisation gains made up to Pm is set to be proportional to the covariance of the
that point. particles in St,m as it exists before the addition of noise,
If this new method for determining the elements of i.e..
Pi (the covariance matrix of B from Eq. (7)), is con-
tinued through all the annealing layers we can see that 1 N
 (i)   (i) av T

each parameter is localised in turn, with some degree Pm ∝ st,m − st,m
av
· st,m − st,m . (18)
N i=1
of overlap as seen in Fig. 13. This can be compared
to the pattern of variance reduction for the original
APF algorithm seen in Fig. 14. This is exactly the where sav t,m is the sample mean of the particle set.
kind of hierarchical soft partitioning that was desired Using this modification enabled successful tracking
and no explicit partition boundaries or functions were with the APF with fewer than half the number of par-
required. ticles; i.e. a 2-fold increase in efficiency.
198 Deutscher and Reid

The soft hierarchical partitioning described in


Section 6.1 provides some increase in efficiency over
conventional APF when applied to tracking this assem-
bly, localising x1 and x4 together, then x2 and x5 and
finally x3 and x6 . However if we were to decouple the
search space and localise each arm independently the
computational effort required for tracking would be re-
duced considerably.
One possibility, of course, would be to introduce a
hard partition between the two arms and conduct two
separate searches. However, in keeping with our phi-
losophy of adaptive partitioning, we seek to avoid com-
mitment to specific partitions.
Many people comment on the similarity between
particle filters and Genetic Algorithms. Both employ
a set (population) of particles (individuals) coded by
a state vector (genetic sequence) from which the best
particles (individuals) are chosen to be propagated to
the next time-step (generation) in the hope of finding
the maximum of some function (fittest possible indi-
vidual).
Figure 14. Variance reduction with the conventional APF. The even One glaring difference between GA’s and a typical
reduction in variance over 6 layers of the annealing process is shown particle filter is the lack of a crossover operator in the
in contrast to Fig. 13. There is little evidence of hierarchical parti-
particle filter which in a conventional GA is meant to
tioning and more annealing layers will be required to find the optimal
configuration. simulate the breeding of individuals and the sharing of
genetic information. The use of the crossover operator
encourages the survival of short, highly fit sections of
6.2. A Crossover Operator and Parallel Partitions the parameter space known in some GA literature as
building blocks. This is done in the hope that when
Now consider the articulated object found in Fig. 15 highly fit building blocks are brought together they
which consists of two articulated arms joined at a sta- will have a good chance of forming a very fit com-
tionary hinge. This configuration is a much simplified plete individual. These building blocks are effectively
version of that found in Human Motion Capture when optimised in parallel without any specification of their
using a model with arms and legs. boundaries or appropriate building block (partition)

Figure 15. A pair of planar articulated arms consisting of 3 segments each and each rooted to point O (as seen in b) are used to demonstrate
the effectiveness of the crossover operator. The configuration of the arms is described by x = (x1 , . . . , x6 ) as seen in (b).
Articulated Body Motion Capture by Stochastic Search 199

weighting functions, exactly the kind of behaviour we


are looking for.
We now describe how to incorporate the crossover
operator into the framework of the APF and examine
the effect via a simple example.

6.2.1. Inclusion of the Crossover Operator in the APF.


The inclusion of the crossover operator can be for- Figure 16. The crossover operator in action. The Sum of Squared
malised as follows. In Step 4 of the APF (as described Differences (SSD) match between model and image obtained after
in Section 3) at annealing layer m, the ith particle a set number of annealing layers is plotted against the percentage
of particles generated using the crossover operator at each anneal-
of St,m−1 is created by drawing two particles from ing layer. Graph (a) shows the result for the articulated arm seen
π
St,m with probability proportional to their respective in Fig. 11 where no benefit to using the crossover operator is seen
weights. Two parameter indices γ and  are chosen ran- although importantly no degradation in performance is seen either
domly and the two selected particles s(a)
t,m = (x 1 . . . x L )
a a (i.e. the SSD does not increase). Graph (b) shows the result for the
(b) articulated arms seen in Fig. 15 where a steady improvement in track-
and st,m = (x1 . . . x L ) are combined to form the new
b b
ing performance is seen when increasing the percentage of particles
particle s(i)
t,m−1 where produced using the crossover operator. This shows that the crossover
 a  operator is able to decouple sections of the search space effectively
s(i)
t,m−1 = x 1 , . . . , x γ , x γ +1 , . . . x  , x +1 , . . . , x L .
a b b a a
and enables the APF to search them in parallel, improving tracker
performance.
(19)

Noise is then added to each particle as detailed in


Section 6.1.

6.2.2. Testing the Crossover Operator. To assess the


benefit to the crossover operator two articulated objects
were tracked: the first (Fig. 11), was used in the experi-
ment from Section 6.1, an un-branched articulated arm;
the second as seen in Fig. 15 is two articulated arms
rooted to the same position.
As seen in Fig. 16, the object consisting of branched
arms was more effectively localised by the APF that
employed the crossover operator whereas there was no
difference when it was applied to the non-branched ob-
ject. A good graphical illustration of what the crossover
operator is actually doing—i.e. partitioning sections of
the search space which can be tracked in parallel—is
evident in Figs. 17 and 18 where the parameters lo-
calised best first are those closest to the root of the
tree.
A good indication of the increased speed provided
by the crossover operator when tracking branched
objects is again the number of particles needed for
successful tracking. This number was reduced by a
factor of 2 with the introduction of the crossover
operator. Figure 17. Variance reduction for the parallel arms. When the APF
with crossover operator is applied to the articulated arms seen in
Fig. 15 we get the pattern of variance reduction seen above. The
6.3. Results for Full-Body Tracking graphs show the parameters describing each arm (x1 , x2 , x3 and
x4 , x5 , x6 ) being localised in order of decreasing topological domi-
Although less clear-cut than the results for the “toy” nance, from the fixed point of the articulated arms, progressing out-
example in the previous section, Figs. 19 and 20 ward.
200 Deutscher and Reid

Figure 18. Particle distribution for the branched articulated arm


over 8 annealing layers. The entire set of particles is drawn at each
annealing layer. The hierarchical localisation of each model segment
from the hinge joint outwards is clearly seen.

Figure 20. Particle distribution for a full-body model. The


entire set of particles is drawn at each annealing layer for
one frame. The hierarchical reduction of each parameter from
torso rotation and translation out to the limb joint angles is
evident.

show a similar process of variance reduction when


the PAPF with crossover is applied to full-body
tracking.
The algorithm was applied to a variety of challeng-
ing sequences of human movement including walk-
ing with turns (Fig. 21), running around in a random
fashion (Fig. 22) and handstands (Fig. 23). The se-
quences for these experiments were generated using
three evenly spaced cameras, calibrated and hardware
synchronised.
We define successful tracking qualitatively as oc-
curring when the algorithm locks onto the body
and limbs for the duration of the sequence, return-
ing sensible values (i.e. ones that can be used for
re-animation, for example) for the pose and artic-
ulation parameters. Our tests measured the num-
Figure 19. Annealed particle filter particle variance for a full-body
ber of particles needed to achieve such successful
model. The difference in rates of variance reduction for each param- tracking. This number represents a sensible mea-
eter can clearly seen. As expected a more complicated pattern of sure of algorithm speed since the number of like-
reduction than that seen for the simple articulated arm is evident. lihood evaluations dominates the processing time.
Articulated Body Motion Capture by Stochastic Search 201

Figure 21. Tracking a walking person.

We observed an improved by a factor of 4 when required on average 15 seconds to process one


comparing the new PAPF to the original APF frame whereas the APF required around 60 seconds
(i.e. successful tracking achieved with one quar- when run on a single processor 1 GHz pIII Linux
ter the number of particles). As a result the PAPF box.
202 Deutscher and Reid

Figure 22. Tracking a running person.

We have also built a parallel implementation, that are to be expected, with processing time
in which particles are farmed out to indepen- decreasing linearly in the number of proces-
dent processors which compute the weight/likelihood sors (with a constant of proportionality around
function. This achieves the sort of speed-ups 0.8).
Articulated Body Motion Capture by Stochastic Search 203

Figure 23. Tracking a person performing a hand-stand.

7. Discussion and Conclusion traditional particle filters, but which retains a number
of their significant advantages. The algorithm has been
We have developed a general algorithm for searching applied to the problem of visually tracking a person in
large configuration spaces which is more efficient than multiple cameras. In this context we have demonstrated
204 Deutscher and Reid

reliable tracking of complex human motion using sim- Note


ple image features, and without the need for a strong
dynamic model of the motion. 1. Note, for example, that although (Blake and Isard, 1998) derives
We have also introduced two novel improvements the full multi-modal likelihood model for edge-normal observa-
tions in the presence of clutter, the implementation makes a much
to the algorithm, soft hierarchical partitioning, and a simplified assumption of a unimodal likelihood for each individ-
crossover operator, which have the combined effect of ual observation.
improving performance and increasing efficiency.
The results, especially Figs. 21–23, show a robust-
ness of tracking human motion achieved by very few References
other algorithms. Of particular note in the sequences
shown are the points where the subject turns rapidly Blake, A. and Isard, M. 1998. Active Contours. Springer.
Cham, T.-J. and Rehg, J.M. 1999. Dynamic feature ordering for ef-
on the spot (shown in both Figs. 21 and 22), and the ficient registration. In Proc. 7th Int’l Conf. on Computer Vision,
unusual and rapid motion of a handstand. Corfu, vol. 2, pp. 1084–1091.
Our primary effort has been concentrated on the Cham, T.-J. and Rehg, J. M. 1999. A multiple hypothesis approach
search technique. It seems clear that improvements in to figure tracking. In Proc. of the IEEE Conf. on Computer Vision
the modelling process, such as published in Plänkers and Pattern Recognition, pp. 239–245.
Deutscher, J., Blake, A., North, B., and Bascle, B. 1999. Tracking
and Fua (2003), and in dynamic modelling Sidenbladh through singularities and discontinuities by random sampling. In
et al. (2000), would improve tracking reliability and Proc. 7th Int. Conf. on Computer Vision, vol. 2, pp. 1144–1149.
applicability further. Deutscher, J., Blake, A., and Reid, I.D. 2000. Articulated body
Though in the experiments shown the background motion capture by annealed particle filtering. In Proc. of the
lacks a large degree of clutter (but is not entirely IEEE Conf. on Computer Vision and Pattern Recognition, vol. 2,
pp. 126–133.
clean either), tracking agile motions, even with mul- Deutscher, J., Davison, A.J., and Reid, I.D. 2001. Automatic parti-
tiple cameras, remains a difficult problem. We have tioning of high dimensional search spaces associated with articu-
performed experiments with other sequences with a lated body motion capture. In Proc. of the IEEE Conf. on Computer
greater amount of clutter with similar results, but the Vision and Pattern Recognition, vol. 2, pp. 669–676.
exact degree of clutter that can be tolerated is an open Deutscher, J., Isard, M., and MacCormick, J. 2002. Automatic cam-
era calibration from a single manhattan image. In Proc. 7th Euro-
question. No doubt the use of background subtraction pean Conf. on Computer Vision, Copenhagen, vol. 4, pp. 175–188.
to obtain silhouette information assists in this signifi- Drummond, T. and Cipolla, R. 2001. Real-time tracking of highly
cantly. The algorithm exhibits some robustness to er- articulated structures in the presence of noisy measurements. In
rors in this data, but in cases where poor contrast results Proc. 8th Int’l Conf. on Computer Vision, Vancouver, pp. 315–320.
in poor silhouettes and a lack of edges we have observed Gavrila, D. and Davis, L.S. 1996. 3d model-based tracking of humans
in action: A multi-view approach. In Proc. IEEE Conf. Computer
tracker failure. Vision and Pattern Recognition, pp. 73–80.
Our results to date have made use of 3 cameras, and Harris, C.G. 1992. Tracking with rigid models. In Active Vision.
tracking using a single camera raises issues with re- A. Blake and A. Yuille (Eds.), MIT Press: Cambridge, MA.
gard to ambiguity. Experiments with using the APF Isard, M.A. and Blake, A. 1996. Visual tracking by stochastic prop-
monocularly (Lyons, 2002) suggest that in the monoc- agation of conditional density. In Proc. 4th European Conf. on
Computer Vision, Cambridge, England, p. 343–356.
ular case further sophistication in the placement of par- Kirkpatrick, S., Gellatt, C.D., and Vecchi, M.P. 1983. Optimisation
ticles is required to overcome the inherent ambiguities by simulated annealing. Science, 220(4598):671–680.
and avoid all associated local minima. Some progress Lyons, D. 2002. A qualitative approach to computer sign language
in this respect has been made recently by Sminchisescu recognition. Master’s thesis, University of Oxford.
and Triggs (2002, 2003). MacCormick, J. 2000. Probabilistic models and stochastic algorithms
for visual tracking. PhD thesis, University of Oxford.
MacCormick, J. and Blake, A. 1999. A probabilistic exclusion prin-
ciple for tracking multiple objects. In Proc. 7th Int. Conf. on Com-
puter Vision, vol. 1, pp. 572–578.
Acknowledgments MacCormick, J. and Isard, M. 2000. Partitioned sampling, articulated
objects and interface-quality hand tracking. In Proc. 6th European
This work was supported by Vicon Systems Ltd. and Conf. on Computer Vision, Dublin, vol. 2, pp. 3–19.
EPSRC grant GR/M15262. We would also like to thank Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H.,
and Teller, E. 1953. Equations of state calculations by fast com-
Andrew Blake, Ben North, Andrew Davison and David puting machine. J. Chem. Phys., 21:1087–1091.
Murray, for many useful discussions and the anony- Mikic, I., Trivedi, M., Hunter, E., and Cosman, P. 2001. Atriculated
mous referees for insightful comments. body posture estimation from multi-camera voxel data. In Proc.
Articulated Body Motion Capture by Stochastic Search 205

of the IEEE Conf. on Computer Vision and Pattern Recognition, Sminchisescu, C. and Triggs, B. 2001. Covariance scaled sampling
vol. 1, pp. 455–462. for monocular 3d body tracking. In Proc. of the IEEE Conf. on
Neal, R.M. 2001. Annealed importance sampling. Statistics and Computer Vision and Pattern Recognition, vol. 1, pp. 447–454.
Computing, (11):125–139. Sminchisescu, C. and Triggs, B. 2002. Hyperdynamics importance
Plänkers, R. and Fua, P. 2003. Articulated soft objects for multi-view sampling. In Proc. 7th European Conf. on Computer Vision,
shape and motion capture. IEEE Transactions on Pattern Analysis Copenhagen, vol. 1, pp. 769–783.
and Machine Intelligence, 25(10):1182–1187. Sminchisescu, C. and Triggs, B. 2003. Kinematic jump processes
Sidenbladh, H., Black, M.J., and Fleet, D.J. 2000. Stochastic track- for monocular 3d human tracking. In Proc. of the IEEE Conf. on
ing of 3D human figures using 2D image motion. In Proc. 6th Computer Vision and Pattern Recognition, vol. 1, pp. 69–76.
European Conf. on Computer Vision, Dublin, vol. 2, pp. 702– Sullivan, J., Blake, A., Isard, M., and MacCormick, J. 1999. Object
718. localization by bayesian correlation. In Proc. 7th Int. Conf. on
Sidenbladh, H., Black, M.J., and Sigal, L. 2002. Implicit probabilistic Computer Vision, vol. 2, pp. 1068–1075.
models of human motion for synthesis and tracking. In Proc. 7th Wachter, S. and Nagel, H. 1999. Tracking persons in monocular
European Conf. on Computer Vision, Copenhagen, vol. 1, pp. 784– image sequences. Computer Vision and Image Understanding,
800. 74(3):174–192.

You might also like