0% found this document useful (0 votes)
76 views

Space Time Actions ICCV2005

This document discusses representing human actions in video as three-dimensional "space-time shapes" induced by silhouettes in the video volume. The authors generalize a method for analyzing 2D shapes to analyze these 3D space-time shapes. They extract features like local saliency, dynamics, structure, and orientation from the solution to the Poisson equation on the shapes. These features are useful for tasks like action recognition and clustering. The method is fast, does not require video alignment, and is robust to issues like occlusions, deformations, scale/viewpoint changes, and video quality.

Uploaded by

Hemi Angelic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

Space Time Actions ICCV2005

This document discusses representing human actions in video as three-dimensional "space-time shapes" induced by silhouettes in the video volume. The authors generalize a method for analyzing 2D shapes to analyze these 3D space-time shapes. They extract features like local saliency, dynamics, structure, and orientation from the solution to the Poisson equation on the shapes. These features are useful for tasks like action recognition and clustering. The method is fast, does not require video alignment, and is robust to issues like occlusions, deformations, scale/viewpoint changes, and video quality.

Uploaded by

Hemi Angelic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/4193986

Action as space-time shapes

Conference Paper  in  IEEE Transactions on Pattern Analysis and Machine Intelligence · November 2005
DOI: 10.1109/ICCV.2005.28 · Source: IEEE Xplore

CITATIONS READS
1,374 923

5 authors, including:

Lena Gorelick Eli Shechtman


The University of Western Ontario Adobe Systems Inc
29 PUBLICATIONS   4,083 CITATIONS    145 PUBLICATIONS   16,456 CITATIONS   

SEE PROFILE SEE PROFILE

Michal Irani Ronen Basri


Weizmann Institute of Science Weizmann Institute of Science
131 PUBLICATIONS   19,713 CITATIONS    160 PUBLICATIONS   13,509 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Internal Distribution Matching for Natural Image Retargeting View project

All content following this page was uploaded by Lena Gorelick on 31 May 2014.

The user has requested enhancement of the downloaded file.


Actions as Space-Time Shapes

Moshe Blank Lena Gorelick Eli Shechtman Michal Irani Ronen Basri

Dept. of Computer Science and Applied Math.


Weizmann Institute of Science
Rehovot 76100, Israel

Abstract

Human action in video sequences can be seen as sil-


houettes of a moving torso and protruding limbs undergo-
ing articulated motion. We regard human actions as three-
dimensional shapes induced by the silhouettes in the space-
time volume. We adopt a recent approach [9] for analyzing
2D shapes and generalize it to deal with volumetric space-
time action shapes. Our method utilizes properties of the so-
lution to the Poisson equation to extract space-time features Figure 1. Space-time shapes of “jumping-jack”, “walking” and
such as local space-time saliency, action dynamics, shape “running” actions.
structure and orientation. We show that these features are
useful for action recognition, detection and clustering. The able in cases of low quality video, motion discontinuities
method is fast, does not require video alignment and is ap- and motion aliasing.
plicable in (but not limited to) many scenarios where the On the other hand, studies in the field of object recog-
background is known. Moreover, we demonstrate the ro- nition in 2D images have demonstrated that silhouettes
bustness of our method to partial occlusions, non-rigid de- contain detailed information about the shape of objects
formations, significant changes in scale and viewpoint, high e.g., [16, 1, 9, 5]. When a silhouette is sufficiently detailed
irregularities in the performance of an action and low qual- people can readily identify the object, or judge its similarity
ity video. to other shapes.
Our approach is based on the observation that the human
1. Introduction action in video generates a space-time shape in the space-
time volume (see Fig. 1). These space-time shapes contain
Recognizing human action is a key component in many both spatial information about the pose of the human figure
computer vision applications, such as video surveillance, at any time (location and orientation of the torso and the
human-computer interface, video indexing and browsing, limbs, aspect ratio of the different body parts), as well as
recognition of gestures, analysis of sports events and dance the dynamic information (global body motion and motion
choreography. Some of the recent work done in the area of of the limbs relative to the body). Several other approaches
action recognition [7, 21, 11, 17] have shown that it is use- use information that could be derived from the space-time
ful to analyze actions by looking at the video sequence as shape of an action. [3] uses motion history images repre-
a space-time intensity volume. Analyzing actions directly sentation and [14] analyzes planar slices (such as x-t planes)
in the space-time volume avoids some limitations of tra- of the space-time intensity volume. Note, that these meth-
ditional approaches that involve the computation of opti- ods implicitly use only partial information about the space-
cal flow [2, 8] (aperture problems, smooth surfaces, singu- time shape. Methods for 3D shape analysis and matching
larities, etc.), feature tracking [20, 4] (self-occlusions, re- have been recently used in computer graphics (see survey
initialization, change of appearance, etc.), key frames [6] in [18]). However, in their current form, they do not apply
(lack of information about the motion). Most of the above to space-time shapes due to the non-rigidity of actions, the
studies are based on computing local space-time gradients inherent differences between the spatial and temporal do-
or other intensity based features and thus might be unreli- mains and the imperfections of the extracted silhouettes.

1
In this paper we generalize a method developed for anal-
ysis of 2D shapes [9], to deal with volumetric space-time
shapes induced by human actions. This method exploits
the solution to the Poisson equation to extract various shape
properties that are utilized for shape representation and clas-
sification. We adopted some of the relevant properties and
extend them to deal with space-time shapes (Sec. 2.1). The
spatial and temporal domains are different in nature and
therefore are treated differently at several stages of our Figure 2. The solution to the Poisson equation on space-time
method. The additional time domain gives rise to new shapes of “jumping-jack”, “walking” and “running” actions. The
space-time shape entities that do not exist in the spatial do- values are encoded by the color spectrum from blue (low values)
to red (high values). Note that points at the boundary attain zero
main, such as a space-time “stick”, “plate” and “ball”. Each
values (Dirichlet boundary conditions).
such type has different informative properties that charac-
terize every space-time point. In addition, we extract space- frames of the video, we impose the Neumann boundary con-
time saliency at every point, which detects fast moving pro- ditions requiring Ut = 0 at those frames [19]. The induced
truding parts of an action (Sec. 2.2). effect is of a “mirror” in time that prevents attenuation of
Unlike images, where extraction of a silhouette might be the solution towards the first and last frames.
a difficult segmentation problem, the extraction of a space- Note that space and time units may have different ex-
time shape from a video sequence can be simple in many tents, thus when discretizing the Poisson equation we uti-
scenarios. In video surveillance with a fixed camera as well lize space-time grid with different meshsizes in space and in
as in various other settings, the appearance of the back- time. This affects the distribution of local orientations and
ground is known. In these cases, using a simple change saliency features across the space-time shape, and thus al-
detection algorithm usually leads to satisfactory space-time lows us to emphasize different aspects of actions. We found
shapes. that a discretization scheme that uses space-time units with
Our method is fast and does not require prior video align- spatial extent twice as long as the temporal one (distance
ment. We demonstrate the robustness of our approach to between frames) works best for most of human actions that
partial occlusions, non-rigid deformations, imperfections in we collected.
the extracted silhouettes and high irregularities in the per- Fig. 2 shows a spatial cross-cut of the solution to the
formance of an action. Finally, we report the performance Poisson equation obtained for several space-time shapes in
of our approach in the tasks of action recognition, clustering Fig. 1. The level sets of U represent smoother versions
and action detection in a low quality video (Sec. 3). of the bounding surface with the external protrusions (fast
moving limbs) disappearing already at relatively low values
of U . Below we generalize the analysis in [9] to character-
2. Representing Actions as Space-Time Shapes ize actions as space-time shapes using measures that esti-
Below we generalize the approach in [9] to deal with mate locally the second order moments of a shape near any
volumetric space-time shapes. given point.
Consider first a space-time shape given by a conic,
2.1. The Poisson Equation and its Properties i.e., composed of the points (x, y, t) satisfying

Consider an action and its space-time shape S sur- P (x, y, t) = ax2 +by 2 +ct2 +dxy+eyt+f xt+g ≤ 0. (2)
rounded by a simple, closed surface. We assign every inter-
nal space-time point a value reflecting its relative position In this case the solution to the Poisson equation takes the
within the space-time shape. This is done by assigning each form
space-time point with the mean time required for a particle P (x, y, t)
U (x, y, t) = − . (3)
undergoing a random-walk process starting from the point 2(a + b + c)
to hit the boundaries. This measure can be computed [9] by The isosurfaces of U then contain a nested collection of
solving a Poisson equation of the form: scaled versions of the conic boundary, where the value of
U increases quadratically as we approach the center. If
∆U (x, y, t) = −1, (1) we now consider the Hessian matrix of U we obtain at any
given point exactly the same matrix, namely
with (x, y, t) ∈ S, where the Laplacian of U is defined as
 
∆U = Uxx + Uyy + Utt , subject to the Dirichlet boundary a d/2 f /2
1  d/2
conditions U (x, y, t) = 0 at the bounding surface ∂S. In H(x, y, t) = − b e/2  . (4)
a+b+c
order to cope with the artificial boundary at the first and last f /2 e/2 c

2
This matrix is in fact the second moment matrix of the entire
3D conic shape, scaled by a constant. The eigenvectors and
eigenvalues of H then reveal the orientation of the shape
and its aspect ratios.
For general space-time shapes described by more com-
plicated equations the isosurfaces of U represent smoother
versions of the boundaries and the Hessian varies contin-
uously from one point to the next. The Hessian provides
a measure that estimates locally the space-time shape near Figure 3. Examples of the local space-time saliency features -
any space-time point inside the shape. Φ̂. The values are encoded by the color spectrum from blue (low
Numerical solutions to the Poisson Equation can be ob- values) to red (high values).
tained by various methods. We used an efficient multigrid
technique to solve the equation. The time complexity of where ∇U = (Ux , Uy , Ut ).
such solver is linear in the number of space-time points. In Consider a space-time sphere which is a space-time
all our experiments one multigrid “w-cycle” was sufficient shape of a disk growing and shrinking in time. This
to obtain an adequate solution. For more details see [19]. isotropic space-time shape has no protruding moving parts
Finally, the solution obtained may be noisy near the bound- and therefore all its space-time points are equally salient.
2
aries due to discretization. To reduce this noise we apply Indeed, Φ = r6 at all points inside the sphere, with r de-
as a post-processing stage a few relaxation sweeps enforc- noting the radius of the sphere. In space-time shapes of
ing 4U = −1 inside the space-time shape and 4U = 0 natural human actions Φ achieves its highest values inside
outside. This will smooth U near the boundaries and hardly the torso, and its lowest values inside the fast moving limbs.
affect more inner points. Static elongated parts or large moving parts (e.g. head of a
running person) will only attain intermediate values of Φ.
2.2. Extracting Space-Time Shape Features We use a normalized variant of Φ
The solution to the Poisson equation can be used to ex- log (1 + Φ(x, y, t))
Φ̂(x, y, t) = 1 − ¡ ¢, (6)
tract a wide variety of useful local shape properties [9]. We max log (1 + Φ(x, y, t))
adopted some of the relevant properties and extended them (x,y,t)∈S

to deal with space-time shapes. The additional time domain


which emphasizes fast moving parts. Fig. 3 illustrates the
gives rise to new space-time shape entities that do not ex-
space-time saliency function Φ̂ computed on the space-time
ist in the spatial domain. We first show how the Poisson
shapes of Fig. 1.
equation can be used to characterize space-time points by
For actions in which a human body undergoes a global
identifying space-time saliency of moving parts and locally
motion (e.g., a walking person), we compensate for the
judging the orientation and rough aspect ratios of the space-
global translation of the body in order to emphasize mo-
time shape. Next we describe how these local properties
tion of parts relative to the torso. This is done by fitting a
can be integrated into a compact vector of global features to
smooth function to the centers of mass collected from the
represent an action.
entire sequence and considering only the deviations from
2.2.1 Local Features this function (similarly to figure-centric stabilization in [8]).
Space-Time Saliency
Human action can often be described as a moving torso and Space-Time Orientations
a collection of parts undergoing articulated motion [4, 10]. The Poisson equation can be used to estimate the local ori-
Below we describe how we can identify portions of a space- entation and aspect ratio of different space-time parts. This
time shape that are salient both in space and in time. is done by constructing the Hessian H Eq. (4) that approxi-
In a space-time shape induced by a human action the mates locally the second order shape moments at any given
highest values of U are obtained within the human torso. point, and its eigenvectors correspond to the local principal
Using an appropriate threshold we can identify the central directions. The eigenvalues of H are related to the local
part of a human body. However, the remaining space-time curvature in the direction of the corresponding eigenvectors
region includes both the moving parts and portions of the and therefore inversely proportional to the length.
torso that are near the boundaries, where U has low values. Let λ1 ≥ λ2 ≥ λ3 be the eigenvalues of H. Then the
Those portions of boundary can be excluded by noticing first principle eigenvector corresponds to the shortest direc-
that they have high gradient values. Following [9] we de- tion of the local space-time shape and the third eigenvector
fine corresponds to the most elongated direction. Inspired by
3 2
Φ = U + k∇U k (5) earlier works [15, 12] in the area of perceptual grouping,
2

3
and 3D shape reconstruction, we distinguish between the
following 3 types of local space-time structures:
• λ1 ≈ λ2 À λ3 - corresponds to a space-time “stick”
structure. For example a small moving object gen-
erates a slanted space-time “stick”, whereas a static
object has a “stick” shape in the temporal direction.
The informative direction of such a structure is the di-
rection of the “stick” which corresponds to the third
eigenvector of H.
• λ1 À λ2 ≈ λ3 - corresponds to a space-time “plate”
structure. For example a fast moving limb generates a
slanted space-time surface (“plate”), and a static ver-
tical torso/limb generates a “plate” parallel to the y-t
plane. The informative direction of a “plate” is its nor-
mal which corresponds to the first eigenvector of H. Degree of “Plateness” Degree of “Stickness”
• λ1 ≈ λ2 ≈ λ3 - corresponds to a space-time “ball” Figure 4. Space-time orientations of plates and sticks for
structure which does not have any principal direction. “jumping-jack” (first two rows) and “walk” (last row) actions. The
first two rows illustrate three sample frames of two different per-
We exploit the decomposition above to characterize each sons performing the “jumping-jack” action. In the third row we
point with two types of local features. The first is related to show a person walking. The left three columns show a schematic
the local shape structure, and the second relies on its most representation of normals where local plates were detected. The
informative orientation. Using the ratio of the eigenvalues right three columns show principal directions of local sticks. In
at every space-time point we define three continuous mea- all examples we represent with the blue, red and green colors re-
sures of “plateness” Spl (x, y, t), “stickness” Sst (x, y, t) and gions with temporal, horizontal and vertical informative direction
“ballness” Sba (x, y, t) where accordingly. The intensity denotes the extent to which the local
λ2
shape is a plate or a stick. For example, fast moving hands of a
Spl = e−α λ1 “jumping-jack” are identified as plates with normals oriented in
λ3 temporal direction (appear in blue on the left). Whereas slower
Sst = (1 − Spl )e−α λ2 (7)
moving legs are identified as vertical sticks (appear in green on
λ
−α λ3
Sba = (1 − Spl )(1 − e 2 ). the right). Note the color consistency between the same action
of two different persons, despite the dissimilarity of their spatial
Note that Spl + Sst + Sba = 1 and the transition between appearance.
the different types of regions is gradual.
The second type of local features identifies regions with
vertical, horizontal and temporal plates and sticks. Let where g(x, y, t) denotes the characteristic function of the
v(x, y, t) be the informative direction (of a plate or a stick) space-time shape, w(x, y, t) is a weighting function. For
computed with Hessian at each point. Then the orientation each pair of a local shape type i and a unit vector ej , we
measures are defined as: substitute the weights w with the combined local feature
D1 = e−β|v·e1 |
w(x, y, t) = Si (x, y, t) · Dj (x, y, t) (10)
D2 = e−β|v·e2 | (8)
D3 = e−β|v·e3 | , where i ∈ {pl, st} and j ∈ {1, 2, 3}. We have found
with e1 , e2 , e3 denoting the unit vectors in the direction of the isotropic ball features to be redundant and there-
the principle axes - x,y and t (we used β = 3). fore did not use them as global features. Note that
Fig. 4 demonstrates examples of space-time shapes and 0 ≤ w(x, y, t) ≤ 1 ∀(x, y, t).
their orientation measured locally at every space-time point. In addition to the above six types of weighting func-
tions we also generate space-time saliency moments using
2.2.2 Global Features
w(x, y, t) = Φ̂ of Eq. (6).
In order to represent an action with global features we use In the following section we demonstrate the utility of
weighted moments of the form: these features in action recognition and classification exper-
Z∞ Z∞ Z∞ iments.
mpqr = w(x, y, t)g(x, y, t)xp y q tr dxdydt
−∞ −∞ −∞
(9)

4
3. Results and Experiments frames with an overlap of 5 frames between the consecutive
space-time cubes. We centered each space-time cube about
The local and global space-time features presented in 2.2 its space-time centroid and brought it to a uniform scale in
are used for action recognition and classification. space preserving the spatial aspect ratio. We then computed
For the first two experiments (action classification global space-time shape features with spatial moments up to
and clustering) we collected a database of 81 low- order 5 and time moments up to order 2 (i.e., with p + q ≤ 5
resolution (180 × 144, 25 fps) video sequences show- and r ≤ 2 in Eq. (9)), giving rise to a 280 feature vec-
ing nine different people, each performing nine nat- tor representation per space-time cube. Note, that coordi-
ural actions such as “running”, “walking”, “jumping- nate normalization above does not involve any global video
jack”, “jumping-forward-on-two-legs”, “jumping-in-place- alignment/ registration.
on-two-legs”, “galloping-sideways”, “waving-two-hands”,
“waving-one-hand”, “bending”. To obtain space-time 3.1. Action Classification
shapes of the actions we subtracted the median background
from each of the sequences and used a simple thresholding For every video sequence we perform a leave-one-out
in color-space. The resulting silhouettes contained “leaks” procedure, i.e., we remove the entire sequence (all its space-
and “intrusions” due to imperfect subtraction, shadows and time cubes) from the database while other actions of the
color similarities with the background (see Fig. 5 for ex- same person remain. Each cube of the removed sequence
amples). For actions in which a human body undergoes a is then compared to all the cubes in the database and clas-
global motion, we compensate for the translation of the cen- sified using the nearest neighbor procedure (with euclidian
ter of mass in order to emphasize motion of parts relative to distance operating on normalized global features). Thus, for
the torso by fitting a second order polynomial to the frame a space-time cube to be classified correctly, it must exhibit
centers of mass. high similarity to a cube of a different person performing
the same action. This way the possibility of high similarity
For each sequence we solved the Poisson equation and
due to spatial appearance purely, is minimized.
computed seven types of local features w(x, y, t) in Eq. (10)
The algorithm misclassified 1 out of 549 space-time
and Eq. (6). In order to treat both the periodic and non-
cubes (0.36% error rate). The correct classifications orig-
periodic actions in the same framework as well as to com-
inated uniformly from all other persons in the database.
pensate for different length of periods, we used a sliding
We also ran the same experiment with ordinary space-time
window in time to extract space-time cubes, each having 10
shape moments (i.e., substituting w(x, y, t) = 1 in Eq. (9))
of up to order 7 in space and in time. The algorithm mis-
classified 17 out of 549 cubes (3.10% error rate). Further
experiments with all combinations of orders between 3 and
14 yielded worse results. Note that space-time shapes of
an action are very informative and rich as is demonstrated
by the relatively high classification rates achieved even with
ordinary shape moments.
To demonstrate the superiority of the space-time shape
information over spatial information collected separately
from each frame of a sequence we conducted an addi-
tional experiment. For each of the space-time cubes in our
database we centered the silhouette in each frame about its
spatial centroid and brought it to a uniform scale preserv-
ing the spatial aspect ratio. We then computed spatial shape
moments of the silhouette in each of the frames separately
and concatenated these moments into one feature vector for
the entire space-time cube. Next, we used these moments
to perform the same leave-one-out classification procedure.
We tested all combinations of orders between 3 and 8 result-
ing in up to 440 features. The algorithm with the best com-
bination misclassified 35 out 549 cubes (6.38% error rate).
To explain why the space-time approach outperforms the
spatial-per-frame approach consider for example the “run”
Figure 5. Examples of video sequences and extracted silhou- and “walk” actions. Many successive frames from the first
ettes from our database. action may exhibit high spatial similarity to the successive

5
For each of the test sequences s we measured its Me-
dian Hausdorff Distance to each of the action types ak , k ∈
{1 . . . 9} in our database:

DH (s, ak ) = median(min kci − cj k) (12)


i j

where ci ∈ s is a space-time cube belonging to the test se-


quence and cj ∈ ak denotes a space-time cube belonging
to one of the training sequences of the action ak . We then
classified each test sequence as the action with the smallest
distance. All the test sequences except for one were clas-
sified correctly as the “walk” action. Fig. 8 shows for each
of the test sequences the first and second best choices and
their distances as well as the median distance to all the ac-
tions. The test sequences are sorted by the distance to their
Figure 6. Results of spectral clustering. Distance matrix, re-
first best chosen action. Note that in the misclassified se-
ordered using the results of spectral clustering. We obtained nine quence the difference between the first and second (the cor-
separate clusters of the nine different actions. The row of the erro- rect) choices is small (w.r.t the median distance), compared
neously clustered “walk” sequence is marked with an arrow. to the differences in the other sequences.

frames from the second one. Ignoring the dynamics within 3.4. Action Detection in a Ballet Movie
the frames might lead to confusion between the two actions.
In this experiment we show how given an example of an
action we can use space-time shape properties to identify all
3.2. Action Clustering locations with similar actions in a given video sequence.
In this experiment we applied a common spectral clus- We chose to demonstrate our method on the ballet
tering algorithm [13] to 81 unlabelled action sequences. We movie example used in [17]. This is a highly compressed
defined the distance between any two sequences to be a vari- (111Kbps, wmv format) 192 × 144 × 750 ballet movie with
ant of the Median Hausdorff Distance: effective frame rate of 15 fps, moving camera and chang-
ing zoom, showing performance of two (female and male)
DH (s1 , s2 ) = median(min kc1i − c2j k) + (11) dancers. We manually separated the sequence into two par-
j i
allel movies each showing only one of the dancers. For both
median(min kc1i − c2j k), of the sequences we then solved the Poisson equation and
i j

where {c1i } and {c2j } denote the space-time cubes belong-


Test Sequence 1st best 2nd best Med.
ing to the sequences s1 and s2 accordingly. As a result we
Normal walk walk 7.8 run 11.5 15.9
obtained nine separate clusters of the nine different actions
Walking in a skirt walk 8.8 run 11.6 16.0
with only one “walk” sequence erroneously clustered with
Carrying briefcase walk 10.0 gallop 13.5 16.7
the “run” sequences. Fig. 6 shows the resulting distance
Knees up walk 10.5 jump 14.0 14.9
matrix.
Diagonal walk walk 11.4 gallop 13.6 15.1
Limping man walk 12.8 gallop 15.9 16.8
3.3. Robustness
Occluded legs walk 13.4 pjump 15.0 15.8
In this experiment we demonstrate the robustness of our Swinging bag walk 14.9 jack 17.3 19.7
method to high irregularities in the performance of an ac- Sleepwalking walk 15.2 run 16.8 19.9
tion. We collected ten test video sequences of people walk- Walking with a dog run 17.7 walk 18.4 22.2
ing in various difficult scenarios in front of different non-
Figure 8. Robustness experiment results. The leftmost column
uniform backgrounds (see Fig. 7 for a few examples). We describes the test action performed. For each of the test sequences
show that our approach has relatively low sensitivity to par- the closest two actions with the corresponding distances are re-
tial occlusions, non-rigid deformations and other defects in ported in the second and third columns. The median distance to
the extracted space-time shape. In addition, it is partially all the actions in the database appears in the rightmost column.
robust to changes in viewpoint, as is demonstrated by the Abbreviations: pjump = “jumping-in-place-on-two-legs”, jump =
“diagonal walk” example (30-40 degrees, see Fig. 7, upper “jumping-forward-on-two-legs”, jack = “jumping-jack”, gallop =
left). “galloping-sideways”.

6
Figure 7. Examples of sequences used in robustness experiments. We show three sample frames and their silhouettes for the following
sequenced (from left to right): “Diagonal walk”, “Occluded legs”, “Knees up”, “Swinging bag”, “Sleepwalking”, “Walking with a dog”.

computed the same global features as in the previous exper- 4. Conclusion


iment for each space-time cube.
In this paper we represent actions as space-time shapes
and show that such a representation contains rich and de-
We selected a cube with the male dancer performing a scriptive information about the action performed. The qual-
“cabriole” pa (beating feet together at an angle in the air) ity of the extracted features is demonstrated by the success
and used it as a query to find all the locations in the two of the relatively simple classification scheme used (nearest
movies where a similar movement was performed by ei- neighbors classification and euclidian distance). In many
ther a male or a female dancer. Fig. 9 demonstrates the situations the information contained in a single space-time
results of the action detection by simply thresholding eu- cube is rich enough for a reliable classification to be per-
clidian distances computed with normalized global features. formed, as was demonstrated in the first classification ex-
The green and the red lines denote the distances between periment. In real-life applications, reliable performance can
the query cube and the cubes of the female and the male be achieved by integrating information coming from the en-
dancers accordingly. The ground truth is marked with the tire input sequence (all its space-time cubes), as was demon-
green squares for the female dancer and the red squares strated by the robustness experiments.
for the male dancer. A middle frame is shown for ev- Our approach has several advantages. First, it does not
ery detected space-time cube. The algorithm detected all require video alignment. Second, it is linear in the number
locations with action similar to the query except for one of space-time points in the shape. The overall processing
false alarm of the female dancer and two misses (male time (solving the Poisson equation and extracting features)
and female), all marked with blue “x”. The two misses in Matlab of a 110×70×50 pre-segmented video takes less
can be explained by the difference in the hand movement, than 30 seconds on a Pentium 4, 3.0 GHz. Third, it has a
and the false alarm - by the high similarity between the potential to cope with low quality video data, where other
hand movement of the female dancer and the query. Ad- methods that are based on intensity features only (e.g., gra-
ditional “cabriole” pa of the male dancer was completely dients), might encounter difficulties. On the other hand by
occluded by the female dancer, and therefore ignored in looking at the space-time shape only we ignore the inten-
our experiment. These results are comparable to the re- sity information inside the shape. In the future this method
sults reported in [17]. Accompanying video material can be can be combined with intensity based features to further
found at http://www.wisdom.weizmann.ac.il/ improve the performance. It is also possible to broaden
˜vision/SpaceTimeActions.html the range of space-time features extracted with the Pois-
son equation in order to deal with more challenging tasks
such as human gait recognition. Finally, this approach can
also be applied with very little change to general 3D shapes
representation and matching.

7
View publication stats

Figure 9. Results of action detection in a ballet movie. The green and the red lines denote the distances between the query cube and the
cubes of the female and the male dancers accordingly. The ground truth is marked with the green squares for the female dancer and the
red squares for the male dancer. A middle frame is shown for every detected space-time cube. Correct detections are marked with blue “v”
whereas false alarms and misses are marked with blue “x”. Full video results can be found at http://www.wisdom.weizmann.ac.
il/˜vision/SpaceTimeActions.html.

Acknowledgements [8] A. A. Efros, A. C. Berg, G. Mori, and J. Malik. Recognizing


action at a distance. ICCV, October 2003.
[9] L. Gorelick, M. Galun, E. Sharon, A. Brandt, and R. Basri.
This work was supported in part by the Israel Science
Shape representation and recognition using the poisson
Foundation Grant No. 267/02, by the European Commis-
equation. CVPR, 2:61–67, 2004.
sion Project IST-2002-506766 Aim Shape, and by the Bi- [10] S. X. Ju, M. J. Black, and Y. Yacoob. Cardboard people:
national Science foundation Grant No. 2002/254. The re- A parametrized model of aticulated image motion. 2nd Int.
search was conducted at the Moross Laboratory for Vision Conf. on Automatic Face and Gesture Recognition, pages
and Motor Control at the Weizmann Institute of science. 38–44, October 1996.
[11] I. Laptev and T. Lindeberg. Space-time interest points.
ICCV, 2003.
References [12] G. Medioni and C. Tang. Tensor voting: Theory and appli-
cations. Proceedings of RFIA, Paris, France, 2000.
[1] S. Belongie, J. Malik, and J. Puzicha. Shape matching and [13] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering:
object recognition using shape contexts. PAMI, 24(4):509– Analysis and an algorithm. NIPS, pages 849–856, 2001.
522, 2002. [14] S. A. Niyogi and E. H. Adelson. Analyzing and recognizing
[2] M. J. Black. Explaining optical flow events with parameter- walking figures in xyt. CVPR, June 1994.
ized spatio-temporal models. CVPR, 1:1326–1332, 1999. [15] E. Rivlin, S. Dickinson, and A. Rosenfeld. Recognition by
[3] A. Bobick and J. Davis. The recognition of human move- functional parts. CVPR, pages 267–274, 1994.
ment using temporal templates. PAMI, 23(3):257–267, [16] T. Sebastian, P. Klein, and B. Kimia. Shock-based indexing
2001. into large shape databases. ECCV (3), pages 731–746, 2002.
[17] E. Shechtman and M. Irani. Space-time behavior based cor-
[4] C. Bregler. Learning and recognizing human dynamics in
relation. In proceedings of CVPR, June 2005.
video sequences. CVPR, June 1997.
[18] J. Tangelder and R. Veltkamp. A survey of content based
[5] S. Carlsson. Order structure, correspondence and shape
3d shape retrieval methods. Proceedings Shape Modeling
based categories. International Workshop on Shape, Con-
International, pages 145–156, 2004.
tour and Grouping , Springer Lecture Notes in Computer [19] U. Trottenberg, C. Oosterlee, and A. Schuller. Multigrid.
Science, page 1681, 1999. Academic Press, 2001.
[6] S. Carlsson and J. Sullivan. Action recognition by shape [20] Y. Yacoob and M. J. Black. Parametrized modeling and
matching to key frames. Workshop on Models versus Exem- recognition of activities. CVIU, 73(2):232–247, 1999.
plars in Computer Vision, December 2001. [21] L. Zelnik-Manor and M. Irani. Event-based analysis of
[7] O. Chomat and J. L. Crowley. Probabilistic sensor for the video. CVPR, pages 123–130, September 2001.
perception of activities. ECCV, 2000.

You might also like