Space Time Actions ICCV2005
Space Time Actions ICCV2005
net/publication/4193986
Conference Paper in IEEE Transactions on Pattern Analysis and Machine Intelligence · November 2005
DOI: 10.1109/ICCV.2005.28 · Source: IEEE Xplore
CITATIONS READS
1,374 923
5 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Lena Gorelick on 31 May 2014.
Moshe Blank Lena Gorelick Eli Shechtman Michal Irani Ronen Basri
Abstract
1
In this paper we generalize a method developed for anal-
ysis of 2D shapes [9], to deal with volumetric space-time
shapes induced by human actions. This method exploits
the solution to the Poisson equation to extract various shape
properties that are utilized for shape representation and clas-
sification. We adopted some of the relevant properties and
extend them to deal with space-time shapes (Sec. 2.1). The
spatial and temporal domains are different in nature and
therefore are treated differently at several stages of our Figure 2. The solution to the Poisson equation on space-time
method. The additional time domain gives rise to new shapes of “jumping-jack”, “walking” and “running” actions. The
space-time shape entities that do not exist in the spatial do- values are encoded by the color spectrum from blue (low values)
to red (high values). Note that points at the boundary attain zero
main, such as a space-time “stick”, “plate” and “ball”. Each
values (Dirichlet boundary conditions).
such type has different informative properties that charac-
terize every space-time point. In addition, we extract space- frames of the video, we impose the Neumann boundary con-
time saliency at every point, which detects fast moving pro- ditions requiring Ut = 0 at those frames [19]. The induced
truding parts of an action (Sec. 2.2). effect is of a “mirror” in time that prevents attenuation of
Unlike images, where extraction of a silhouette might be the solution towards the first and last frames.
a difficult segmentation problem, the extraction of a space- Note that space and time units may have different ex-
time shape from a video sequence can be simple in many tents, thus when discretizing the Poisson equation we uti-
scenarios. In video surveillance with a fixed camera as well lize space-time grid with different meshsizes in space and in
as in various other settings, the appearance of the back- time. This affects the distribution of local orientations and
ground is known. In these cases, using a simple change saliency features across the space-time shape, and thus al-
detection algorithm usually leads to satisfactory space-time lows us to emphasize different aspects of actions. We found
shapes. that a discretization scheme that uses space-time units with
Our method is fast and does not require prior video align- spatial extent twice as long as the temporal one (distance
ment. We demonstrate the robustness of our approach to between frames) works best for most of human actions that
partial occlusions, non-rigid deformations, imperfections in we collected.
the extracted silhouettes and high irregularities in the per- Fig. 2 shows a spatial cross-cut of the solution to the
formance of an action. Finally, we report the performance Poisson equation obtained for several space-time shapes in
of our approach in the tasks of action recognition, clustering Fig. 1. The level sets of U represent smoother versions
and action detection in a low quality video (Sec. 3). of the bounding surface with the external protrusions (fast
moving limbs) disappearing already at relatively low values
of U . Below we generalize the analysis in [9] to character-
2. Representing Actions as Space-Time Shapes ize actions as space-time shapes using measures that esti-
Below we generalize the approach in [9] to deal with mate locally the second order moments of a shape near any
volumetric space-time shapes. given point.
Consider first a space-time shape given by a conic,
2.1. The Poisson Equation and its Properties i.e., composed of the points (x, y, t) satisfying
Consider an action and its space-time shape S sur- P (x, y, t) = ax2 +by 2 +ct2 +dxy+eyt+f xt+g ≤ 0. (2)
rounded by a simple, closed surface. We assign every inter-
nal space-time point a value reflecting its relative position In this case the solution to the Poisson equation takes the
within the space-time shape. This is done by assigning each form
space-time point with the mean time required for a particle P (x, y, t)
U (x, y, t) = − . (3)
undergoing a random-walk process starting from the point 2(a + b + c)
to hit the boundaries. This measure can be computed [9] by The isosurfaces of U then contain a nested collection of
solving a Poisson equation of the form: scaled versions of the conic boundary, where the value of
U increases quadratically as we approach the center. If
∆U (x, y, t) = −1, (1) we now consider the Hessian matrix of U we obtain at any
given point exactly the same matrix, namely
with (x, y, t) ∈ S, where the Laplacian of U is defined as
∆U = Uxx + Uyy + Utt , subject to the Dirichlet boundary a d/2 f /2
1 d/2
conditions U (x, y, t) = 0 at the bounding surface ∂S. In H(x, y, t) = − b e/2 . (4)
a+b+c
order to cope with the artificial boundary at the first and last f /2 e/2 c
2
This matrix is in fact the second moment matrix of the entire
3D conic shape, scaled by a constant. The eigenvectors and
eigenvalues of H then reveal the orientation of the shape
and its aspect ratios.
For general space-time shapes described by more com-
plicated equations the isosurfaces of U represent smoother
versions of the boundaries and the Hessian varies contin-
uously from one point to the next. The Hessian provides
a measure that estimates locally the space-time shape near Figure 3. Examples of the local space-time saliency features -
any space-time point inside the shape. Φ̂. The values are encoded by the color spectrum from blue (low
Numerical solutions to the Poisson Equation can be ob- values) to red (high values).
tained by various methods. We used an efficient multigrid
technique to solve the equation. The time complexity of where ∇U = (Ux , Uy , Ut ).
such solver is linear in the number of space-time points. In Consider a space-time sphere which is a space-time
all our experiments one multigrid “w-cycle” was sufficient shape of a disk growing and shrinking in time. This
to obtain an adequate solution. For more details see [19]. isotropic space-time shape has no protruding moving parts
Finally, the solution obtained may be noisy near the bound- and therefore all its space-time points are equally salient.
2
aries due to discretization. To reduce this noise we apply Indeed, Φ = r6 at all points inside the sphere, with r de-
as a post-processing stage a few relaxation sweeps enforc- noting the radius of the sphere. In space-time shapes of
ing 4U = −1 inside the space-time shape and 4U = 0 natural human actions Φ achieves its highest values inside
outside. This will smooth U near the boundaries and hardly the torso, and its lowest values inside the fast moving limbs.
affect more inner points. Static elongated parts or large moving parts (e.g. head of a
running person) will only attain intermediate values of Φ.
2.2. Extracting Space-Time Shape Features We use a normalized variant of Φ
The solution to the Poisson equation can be used to ex- log (1 + Φ(x, y, t))
Φ̂(x, y, t) = 1 − ¡ ¢, (6)
tract a wide variety of useful local shape properties [9]. We max log (1 + Φ(x, y, t))
adopted some of the relevant properties and extended them (x,y,t)∈S
3
and 3D shape reconstruction, we distinguish between the
following 3 types of local space-time structures:
• λ1 ≈ λ2 À λ3 - corresponds to a space-time “stick”
structure. For example a small moving object gen-
erates a slanted space-time “stick”, whereas a static
object has a “stick” shape in the temporal direction.
The informative direction of such a structure is the di-
rection of the “stick” which corresponds to the third
eigenvector of H.
• λ1 À λ2 ≈ λ3 - corresponds to a space-time “plate”
structure. For example a fast moving limb generates a
slanted space-time surface (“plate”), and a static ver-
tical torso/limb generates a “plate” parallel to the y-t
plane. The informative direction of a “plate” is its nor-
mal which corresponds to the first eigenvector of H. Degree of “Plateness” Degree of “Stickness”
• λ1 ≈ λ2 ≈ λ3 - corresponds to a space-time “ball” Figure 4. Space-time orientations of plates and sticks for
structure which does not have any principal direction. “jumping-jack” (first two rows) and “walk” (last row) actions. The
first two rows illustrate three sample frames of two different per-
We exploit the decomposition above to characterize each sons performing the “jumping-jack” action. In the third row we
point with two types of local features. The first is related to show a person walking. The left three columns show a schematic
the local shape structure, and the second relies on its most representation of normals where local plates were detected. The
informative orientation. Using the ratio of the eigenvalues right three columns show principal directions of local sticks. In
at every space-time point we define three continuous mea- all examples we represent with the blue, red and green colors re-
sures of “plateness” Spl (x, y, t), “stickness” Sst (x, y, t) and gions with temporal, horizontal and vertical informative direction
“ballness” Sba (x, y, t) where accordingly. The intensity denotes the extent to which the local
λ2
shape is a plate or a stick. For example, fast moving hands of a
Spl = e−α λ1 “jumping-jack” are identified as plates with normals oriented in
λ3 temporal direction (appear in blue on the left). Whereas slower
Sst = (1 − Spl )e−α λ2 (7)
moving legs are identified as vertical sticks (appear in green on
λ
−α λ3
Sba = (1 − Spl )(1 − e 2 ). the right). Note the color consistency between the same action
of two different persons, despite the dissimilarity of their spatial
Note that Spl + Sst + Sba = 1 and the transition between appearance.
the different types of regions is gradual.
The second type of local features identifies regions with
vertical, horizontal and temporal plates and sticks. Let where g(x, y, t) denotes the characteristic function of the
v(x, y, t) be the informative direction (of a plate or a stick) space-time shape, w(x, y, t) is a weighting function. For
computed with Hessian at each point. Then the orientation each pair of a local shape type i and a unit vector ej , we
measures are defined as: substitute the weights w with the combined local feature
D1 = e−β|v·e1 |
w(x, y, t) = Si (x, y, t) · Dj (x, y, t) (10)
D2 = e−β|v·e2 | (8)
D3 = e−β|v·e3 | , where i ∈ {pl, st} and j ∈ {1, 2, 3}. We have found
with e1 , e2 , e3 denoting the unit vectors in the direction of the isotropic ball features to be redundant and there-
the principle axes - x,y and t (we used β = 3). fore did not use them as global features. Note that
Fig. 4 demonstrates examples of space-time shapes and 0 ≤ w(x, y, t) ≤ 1 ∀(x, y, t).
their orientation measured locally at every space-time point. In addition to the above six types of weighting func-
tions we also generate space-time saliency moments using
2.2.2 Global Features
w(x, y, t) = Φ̂ of Eq. (6).
In order to represent an action with global features we use In the following section we demonstrate the utility of
weighted moments of the form: these features in action recognition and classification exper-
Z∞ Z∞ Z∞ iments.
mpqr = w(x, y, t)g(x, y, t)xp y q tr dxdydt
−∞ −∞ −∞
(9)
4
3. Results and Experiments frames with an overlap of 5 frames between the consecutive
space-time cubes. We centered each space-time cube about
The local and global space-time features presented in 2.2 its space-time centroid and brought it to a uniform scale in
are used for action recognition and classification. space preserving the spatial aspect ratio. We then computed
For the first two experiments (action classification global space-time shape features with spatial moments up to
and clustering) we collected a database of 81 low- order 5 and time moments up to order 2 (i.e., with p + q ≤ 5
resolution (180 × 144, 25 fps) video sequences show- and r ≤ 2 in Eq. (9)), giving rise to a 280 feature vec-
ing nine different people, each performing nine nat- tor representation per space-time cube. Note, that coordi-
ural actions such as “running”, “walking”, “jumping- nate normalization above does not involve any global video
jack”, “jumping-forward-on-two-legs”, “jumping-in-place- alignment/ registration.
on-two-legs”, “galloping-sideways”, “waving-two-hands”,
“waving-one-hand”, “bending”. To obtain space-time 3.1. Action Classification
shapes of the actions we subtracted the median background
from each of the sequences and used a simple thresholding For every video sequence we perform a leave-one-out
in color-space. The resulting silhouettes contained “leaks” procedure, i.e., we remove the entire sequence (all its space-
and “intrusions” due to imperfect subtraction, shadows and time cubes) from the database while other actions of the
color similarities with the background (see Fig. 5 for ex- same person remain. Each cube of the removed sequence
amples). For actions in which a human body undergoes a is then compared to all the cubes in the database and clas-
global motion, we compensate for the translation of the cen- sified using the nearest neighbor procedure (with euclidian
ter of mass in order to emphasize motion of parts relative to distance operating on normalized global features). Thus, for
the torso by fitting a second order polynomial to the frame a space-time cube to be classified correctly, it must exhibit
centers of mass. high similarity to a cube of a different person performing
the same action. This way the possibility of high similarity
For each sequence we solved the Poisson equation and
due to spatial appearance purely, is minimized.
computed seven types of local features w(x, y, t) in Eq. (10)
The algorithm misclassified 1 out of 549 space-time
and Eq. (6). In order to treat both the periodic and non-
cubes (0.36% error rate). The correct classifications orig-
periodic actions in the same framework as well as to com-
inated uniformly from all other persons in the database.
pensate for different length of periods, we used a sliding
We also ran the same experiment with ordinary space-time
window in time to extract space-time cubes, each having 10
shape moments (i.e., substituting w(x, y, t) = 1 in Eq. (9))
of up to order 7 in space and in time. The algorithm mis-
classified 17 out of 549 cubes (3.10% error rate). Further
experiments with all combinations of orders between 3 and
14 yielded worse results. Note that space-time shapes of
an action are very informative and rich as is demonstrated
by the relatively high classification rates achieved even with
ordinary shape moments.
To demonstrate the superiority of the space-time shape
information over spatial information collected separately
from each frame of a sequence we conducted an addi-
tional experiment. For each of the space-time cubes in our
database we centered the silhouette in each frame about its
spatial centroid and brought it to a uniform scale preserv-
ing the spatial aspect ratio. We then computed spatial shape
moments of the silhouette in each of the frames separately
and concatenated these moments into one feature vector for
the entire space-time cube. Next, we used these moments
to perform the same leave-one-out classification procedure.
We tested all combinations of orders between 3 and 8 result-
ing in up to 440 features. The algorithm with the best com-
bination misclassified 35 out 549 cubes (6.38% error rate).
To explain why the space-time approach outperforms the
spatial-per-frame approach consider for example the “run”
Figure 5. Examples of video sequences and extracted silhou- and “walk” actions. Many successive frames from the first
ettes from our database. action may exhibit high spatial similarity to the successive
5
For each of the test sequences s we measured its Me-
dian Hausdorff Distance to each of the action types ak , k ∈
{1 . . . 9} in our database:
frames from the second one. Ignoring the dynamics within 3.4. Action Detection in a Ballet Movie
the frames might lead to confusion between the two actions.
In this experiment we show how given an example of an
action we can use space-time shape properties to identify all
3.2. Action Clustering locations with similar actions in a given video sequence.
In this experiment we applied a common spectral clus- We chose to demonstrate our method on the ballet
tering algorithm [13] to 81 unlabelled action sequences. We movie example used in [17]. This is a highly compressed
defined the distance between any two sequences to be a vari- (111Kbps, wmv format) 192 × 144 × 750 ballet movie with
ant of the Median Hausdorff Distance: effective frame rate of 15 fps, moving camera and chang-
ing zoom, showing performance of two (female and male)
DH (s1 , s2 ) = median(min kc1i − c2j k) + (11) dancers. We manually separated the sequence into two par-
j i
allel movies each showing only one of the dancers. For both
median(min kc1i − c2j k), of the sequences we then solved the Poisson equation and
i j
6
Figure 7. Examples of sequences used in robustness experiments. We show three sample frames and their silhouettes for the following
sequenced (from left to right): “Diagonal walk”, “Occluded legs”, “Knees up”, “Swinging bag”, “Sleepwalking”, “Walking with a dog”.
7
View publication stats
Figure 9. Results of action detection in a ballet movie. The green and the red lines denote the distances between the query cube and the
cubes of the female and the male dancers accordingly. The ground truth is marked with the green squares for the female dancer and the
red squares for the male dancer. A middle frame is shown for every detected space-time cube. Correct detections are marked with blue “v”
whereas false alarms and misses are marked with blue “x”. Full video results can be found at http://www.wisdom.weizmann.ac.
il/˜vision/SpaceTimeActions.html.