MST-CSS (Multi-Spectro-Temporal Curvature Scale Space), A Novel Spatio-Temporal Representation For Content-Based Video Retrieval
MST-CSS (Multi-Spectro-Temporal Curvature Scale Space), A Novel Spatio-Temporal Representation For Content-Based Video Retrieval
8, AUGUST 2010
Abstract—We present a novel spatio-temporal descriptor to Query by image content is a commercially available system
efficiently represent a video object for the purpose of content- developed by Flickner et al. [1] in IBM research center. Other
based video retrieval. Features from spatial along with temporal
examples of CBIR include CHABOT [2], VisualSeek [3],
information are integrated in a unified framework for the purpose
of retrieval of similar video shots. A sequence of orthogonal Photobook [4], etc. The idea of representing an image based
processing, using a pair of 1-D multiscale and multispectral on its content has been extended to video which resulted in
filters, on the space-time volume (STV) of a video object an important application area of research for content-based
(VOB) produces a gradually evolving (smoother) surface. Zero- video retrieval (CBVR). Extensive research efforts have been
crossing contours (2-D) computed using the mean curvature
made to represent and retrieve videos based on their visual
on this evolving surface are stacked in layers to yield a hilly
(3-D) surface, for a joint multispectro-temporal curvature scale content. The video retrieval systems which retrieve videos
space (MST-CSS) representation of the video object. Peaks and using low-level features such as color, texture, shape, motion,
valleys (saddle points) are detected on the MST-CSS surface etc. include JACOB [5], VideoQ [6], Violone [7], Netra-V
for feature representation and matching. Computation of the [8], VisualGREP [9], and InsightVideo [10]. Most of these
cost function for matching a query video shot with a model
systems extract features from key-frames detected from the
involves matching a pair of 3-D point sets, with their attributes
(local curvature), and 3-D orientations of the finally smoothed frame sequence. Features are also extracted from compressed
STV surfaces. Experiments have been performed with simulated domain for video retrieval [10]–[12].
and real-world video shots using precision-recall metric for our Shape and motion are two important features for CBVR.
performance study. The system is compared with a few state- Shape descriptors (spatial domain) are classified into two
of-the-art methods, which use shape and motion trajectory for
broad categories: region based and contour based. A contour-
VOB representation. Our unified approach has shown better
performance than other approaches that use combined match- based descriptor encapsulates the shape properties of the
costs obtained with separate shape and motion trajectory rep- object’s outline (silhouette) [13]–[16]. MPEG-7 standardized
resentations and our previous work on a simple joint spatio- curvature scale space (CSS) for multiscale representation of
temporal descriptor (3-D-CSS). shapes, which was developed by Mokhtarian et al. [16]. CSS
Index Terms—Content-based video retrieval, curvature scale emulates well the human perception of visualizing objects. The
space, Gabor filter, multispectral features, multispectro- method is invariant to rotation, translation, scale and it is also
temporal curvature scale space (MST-CSS), spatio-temporal robust with respect to noise.
descriptor, video object representation.
Motion descriptors can be categorized as frame-based and
object-based descriptors. Camera motion, motion activity, and
I. Introduction warping parameters are frame-based descriptors, while motion
In videoQ [6], shape is represented using PCA based filters is motivated by the concept of curvature scale space
approach and motion trail of video objects is matched using [49], which uses Gaussian function for efficient representation
position differences. In this system, other low-level features for shapes at multiple levels of smoothing; and multispectral
are also used for video retrieval. Dao et al. [30] proposed approach using Gabor filters [50] for efficient representation
a object based video retrieval system based on shape and of trajectory [23]. The video object is represented using salient
motion trajectory. Edge potential function is used for repre- features of a surface built from zero-crossing contours (ZCCs)
senting both shape and motion trajectory. Recently, Basharat obtained by computing the mean curvature on the evolving
et al. [31] used interest point trajectories to generate video surface (of video object). We have compared our unified
volumes. A set of features including color, texture, motion, approach with state-of-the-art methods which use shape and
and scale invariant feature transform CSIFT descriptors are motion trajectory for CBVR [30], [51] and our earlier pro-
extracted to represent the volume. Earth mover’s distance posed methods [23], [52]. The proposed unified representation
based approach is employed for the comparison of volume [multispectro-temporal curvature scale space (MST-CSS)] has
features. Experiments have shown promising results using shown better performance than the other methods which uses
the databases: TRECVID 2005 explore BBC rushes [32], combined match-costs for retrieval and also our previous
online video archives [33], and BBC motion gallery [34]. unified representation (3-D-CSS) for CBVR.
The spatial along with temporal information is a vital feature Our proposed method works under the following assump-
for content-based analysis of a video. Space-time analysis is tions. Videos are segmented into shots and made available
widely used for video object segmentation [35]–[39], human in the database. A predominantly large object moves in the
action recognition [40]–[45], and other video processing ap- foreground of the video shot being analyzed. The background
plications. In most of the video retrieval techniques, video is static as the scene is shot (real-world video) using a static
sequences are treated as collections of still images, extracting camera with fixed viewing direction, pan, tilt and zoom (this
relevant key-frames, and comparing their low-level features. constraint can be relaxed using modern tools for compensating
Laptev [46] extended the notion of spatial interest points into camera motion [53]–[55]). Our focus is on representation of
the spatio-temporal domain and showed how the resulting video objects and the design of an efficient similarity measure,
features often reflect interesting events that can be used rather than processing of the video shots to extract the moving
for a compact representation of video data as well as for foreground object under complex situations. The rest of this
interpretation of spatio-temporal events. LucentVision [47] is paper is organized as follows. The methods related to our
an instantly indexed multimedia database system developed for proposed approach are discussed in the next section. Section
the sport of tennis. This system analyzes video from multiple III elaborates on our proposed approach. Section IV presents
cameras in real time and captures the activity of the players our experimental results and analyzes the performance of our
and the ball in the form of motion trajectories. The system system compared with state-of-the-art methods. Section V
stores these trajectories in a database along with the video, summarizes and concludes the paper.
3-D models of the environment, scores, and other domain-
specific information. LucentVision has been used to enhance
live television and Internet broadcasts with game analysis and II. Brief Review of Related Work
virtual replays in more than 250 international tennis matches. Low level features, widely used for CBVR applications
Chatzis et al. [48] proposed a method for video representation for measuring similarity between video sequences, are color,
and retrieval exploiting the spatio-temporal adjacencies and texture, shape, motion intensity, and so on. These are image
relations between regions segmented from the 3-D space-time specific features extracted from key-frame, whose match-costs
volume of pixels. Dementhon and Doermann [39] proposed a are combined and used for retrieval (as in CBIR). With the
method for retrieval of near-duplicate videos and recognition development of MPEG-4 standard for video-object coding,
of actions in surveillance video. Very little research work has and its complementary standard (MPEG-7) for content-based
been done by integrating the spatial and temporal information description of video (object), video object based representa-
for CBVR. A detailed review of related literature is discussed tion is an emerging research area for CBVR. MPEG-7 has
in the next section. standardized CSS for representation of shape feature of a
In our work, we propose a novel unified spatio-temporal video object. A CSS image is a multiscale representation of
representation for the purpose of content-based video retrieval. the inflection points (or curvature zero-crossing points) of the
Our method is a video object based representation, in which contour as it evolves. Contour evolution is achieved by first
the shape contour and motion trajectory of a video object parameterizing the contour function using arc length (u). The
are integrated with the help of multiscale and multispectral two coordinate functions [x(u) and y(u)] are then convolved
processing. This is the novelty of our proposed approach— with a Gaussian filter of increasing width or standard devia-
integrated representation of shape and motion features used tion. Evolution process results in a fine to coarse description
for CBVR. Video object is analyzed in space-time to reveal of the planar contour shape. Curvature zero-crossing points are
the curvature properties of an evolving surface of the video detected from the curvature of the evolved contour and mapped
object volume, when processed with two efficient filters: to the CSS image in which the horizontal axis represents
1-D Gaussian filter (applied along the direction of spatial the arc length parameter (u) on the original contour, and the
parameter) and 1-D derivative of Gabor filter (applied along vertical axis represents the standard deviation of the Gaussian
the direction of temporal parameter). The choice of these filter. The rest of this section of literature review elaborately
1082 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 8, AUGUST 2010
discusses three main categories of relevant work done in the function was used for representing both shape and motion
area of CBVR: 1) motion trajectory based representation; trajectory. A weighted sum of the partial costs obtained from
2) matching shape and motion features of query and model matching shape and trajectory features of query and model
separately, for a combined cost function for video retrieval; and provide the final “match-cost.” Choice of optimal weights is a
3) elementary methods of unified shape and motion features critical issue in this approach. The method is tested using real-
for a joint representation, used for CBVR. world video clips, which contains short racing clips. Precision
Motion trajectory is an important feature for representing and recall measures are used to evaluate the system using
video shots for the purpose of motion-based video retrieval. queries of different types: 1) only trajectory; 2) key-video
Bashir et al. [27] and Chen et al. [28] have proved experi- object plane (VOP) and trajectory; and 3) video clip. It is
mentally that segmented trajectory representations allow more reported that query by trajectory gives the best result for a real-
accurate trajectory retrieval than global trajectory representa- world video database. Anjulan and Canagarajah [57] proposed
tions. Among all different methods of global trajectory repre- a object retrieval system in which video segmentation, feature
sentation, polynomial [19] and Fourier based descriptors [56] extraction, feature clustering, and object retrieval are combined
are considered to be efficient in terms of applicability and use. using a single feature, based on local invariant region descrip-
Polynomial representation has the drawback of its inability tors. The method is claimed to be highly robust to camera and
to represent sharp changes in trajectory. It only provides an object motions and can withstand severe illumination changes.
approximation by segmenting the curve at junctions of sharp The performance of the system is evaluated using precision
changes and then piecewise modeling it. Fourier descriptors and recall values, which is based on the number of false and
cannot represent the temporal occurrence of frequency change correctly retrieved clusters.
in a signal—it represents well only the frequency content. Simultaneous spatial and temporal processing has been
Global trajectory representation has the disadvantage that it applied for video object segmentation methods, which con-
is unable to handle partial trajectory matching, which is sider video as a spatio-temporal block of pixels [58]. Spatio-
overcome by segmented trajectory representation methods. temporal descriptors are used to represent position and motion
During recent years, segmented trajectory representation of regions in [48], [59]. The position, motion and color
methods [24], [25], [27] have gained importance. In [24], a describes the region in a high dimensional space. K-nearest
hybrid motion-based video retrieval system which includes neighbor retrieval was used in [59]. In [48], a graph based
both sketch-based scheme and a string based one is proposed. description was used to describe relation between regions.
Bashir et al. [27] segment the object trajectories using the Average position, motion velocity, and color are used as spatio-
points with high curvature and then representing by their PCA temporal descriptors to describe the region. Spatio-temporal
coefficients. Spectral clustering and string based approaches relation between regions is represented using a graph structure.
are used for matching and retrieval. In [25], geometric and In [37], unsupervised clustering via Gaussian mixture model
syntactic features are used for representation and edit distance (GMM) was used to extract space-time blobs. The modeling
is used for matching. In our previous work on motion-based and the segmentation are combined to enable the extraction of
video retrieval [23], a trajectory is represented using multi- video-regions that represent coherent regions across the video
channel spectral features along with their semantic meaning. sequence. In general, temporal and spatial characteristics have
This representation (global) is able to capture significant not been adequately addressed in most video retrieval systems
changes in the path of the trajectory using a Gabor filter bank. despite their obvious importance [60]. Very little research
Salient points are detected on the path using spectral response work has been done in representing video objects in a unified
from the Gabor filter bank, which identifies positions of sharp descriptor, based on its low-level but significant features (shape
changes. Position and spectral feature of the salient points and motion trajectory). In our previous work [52], as a first
are key features used to represent the trajectory. Edit distance attempt for a unified spatio-temporal representation (3-D-
(string) matching was configured for matching and retrieval. CSS) for CBVR, we extended the idea of CSS [16]. A 2-D
The method has shown better performance than the segmented Gaussian filter (with parameter σ) has been used to iteratively
representations [25], [27] as well as global representation [19], convolve the STV surface (of VOB) yielding a sequence of
[56] methods, using large synthetic and real-world datasets. evolved surfaces. ZCCs on the evolving surface are stacked as
VideoQ is the first online object-based video search engine 2-D layers to form the feature space (termed 3-D-CSS). Peak
developed by Chang et al. [6]. VideoQ retrieves videos based points on the 3-D CSS surface (u, v, σ) are used as features to
on color, texture, shape, and motion of video objects. Violone represent the VOB. This method has shown promising results
developed by Yoshitaka et al. [7] is a video retrieval system for CBVR, but it is mostly sensitive to the variations in shape
which uses chain code to represent object’s motion trajectory. but not for trajectories. Motion trajectories are not closed
Netra-V is a video object based retrieval system proposed by curves as that of shape contours. Hence, smoothing along the
Deng et al. [8], which segments video objects and represents temporal axis for 3-D-CSS representation is not suitable for
the color, texture, shape and motion features of video object. capturing the sharp changes (vital information) of the motion
Color histogram, texture descriptor using Gabor filter bank, trajectory.
affine motion parameters, and fast Fourier transform (FFT) Literature review suggests that a joint (unified) spatio-
features for shape are used to represent and retrieve the video temporal representation of a video object volume (or STV)
objects. Dao et al. [30] proposed a object based video retrieval has not been exploited thoroughly for CBVR. Methods based
system based on shape and motion trajectory. Edge potential on combining the cost functions of shape and trajectory
DYANA AND DAS: MST-CSS (MULTI-SPECTRO-TEMPORAL CURVATURE SCALE SPACE), A NOVEL SPATIO-TEMPORAL REPRESENTATION 1083
A. Overall Framework
segment moving video object from a video shot. GMM is
The main focus of our system is to exploit the spatio- used to model the background, as described in [62]. The code
temporal features present in a video shot and use it for available on “open source computer vision library” [63] is
retrieving video shots from the database (gallery of video used for the implementation of video object segmentation.
shots), rank-ordered by similarity in content. The overall The median frame of the video shot is chosen to extract the
framework of our system is shown in Fig. 1. There are two shape of the moving foreground object. Fig. 2(a) shows an
phases namely, database population and database querying. example of a frame from a video shot and its corresponding
During database population, features are extracted from model extracted shape contour [Fig. 2(b)]. Centroid of the foreground
video shots and the corresponding spatio-temporal descriptors object is tracked to retrieve the motion trajectory of the VOB
(MST-CSS representation) are stored in the database. While [Fig. 2(c)]. Space-time volume of the video object is con-
querying the video database, the query can be of two types: structed by sweeping (translating in 3-D) the shape contour
by example and by sketch. Features are extracted from the along the motion trajectory, which forms a sequence of 2-D
query and represented using spatio-temporal descriptors. The contours and visualized as a 3-D surface in (x, y, t) space.
represented features of the query are matched with the stored The multiscale and multispectral features extracted from this
feature descriptors from the database. The model video shots STV is used to represent the moving VOB segmented from the
are ordered based on their ranks assigned in accordance to video shot or virtual VOB (query by sketch). The algorithm
their similarity with the query and then retrieved. used for this representation is discussed in the following.
We use shape and motion trajectory of a moving video
object to build a unified spatio-temporal feature representing B. Spatio-Temporal Processing
a video shot (either model or query video-shot). These two Space-time volume (STV) is defined using a parametric
distinct features are generally used for CBVR [6], [30], surface, formed by the movement of the contour of the video
using separate descriptors and video shots are retrieved based object over time. The parametric surface (STV) is represented
on the weighted combination of match-costs obtained from using a pair of parameters [spatial (u) and temporal (v)] as
matching corresponding shape and motion features. Choice
of optimal weights for different queries, is a critical issue in
r(u, v) = (x(u, v), y(u, v), v). (1)
this approach. In our work, we represent space-time volume
(STV), using a unified (joint) spatio-temporal descriptor. The The STV, which consists of both spatial and temporal
STV is obtained from the segmented foreground moving object information, is processed using multiscale and multispectral
extracted from a video shot. This will be used to efficiently filters to represent the video object. Multiscale processing is
represent the features over space and time to retrieve similar performed along the spatial parameter (shape contour) of the
videos based on its content. We use existing algorithms to STV. This is similar to the multiscale approach (curvature
1084 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 8, AUGUST 2010
scale space) for shapes, proposed by Mokhtarian et al. [16], Algorithm 1 Extracting the MST-CSS surface from a VOB
which very well simulates human perception of shapes. On
the other hand, multiscale approaches have not shown better Input: Space-time volume (STV); Output: MST-CSS surface.
performance for trajectories [22], [52]. In [52], smoothing at 1) Initialize σu for the 1-D Gaussian filter to a minimum
multiple scales along the trajectory dimension of STV could value (σmin ); ωv for the 1-D Gabor filter to a maximum
not capture the sharp changes (vital information) of the motion value (ωmax ).
trajectory. This is due to the fact that motion trajectories 2) Convolve the space-time volume with 1-D orthogonal
are not always closed curves as that of shape contours. The filters [Gaussian filter (g) along the direction of spatial
method of 2-D-CSS [22], generally used for representing shape parameter and derivative of Gabor filter (gab ) along
contours, has been applied to represent motion trajectory but the direction of temporal parameter] which yields an
the performance reported was less than PCA based approaches. deformed surface as
The multispectral approach for trajectories has shown better
σu = {(X(u, v ; σu , ωv ), Y (u, v ; σu , ωv ))
performance than PCA based approaches in our earlier work
[23]. Hence, for our unified representation (MST-CSS), we | u ∈ [0, 1] , v ∈ [0, 1]} (2)
simultaneously perform multispectral processing along the
where X(u, v; σu , ωv ) =
temporal parameter and multiscale processing along the spatial
parameter. (x(u, v) ⊗ g(u; σu )) ⊗ gab (v; ωv ) (3a)
For MST-CSS representation, we apply orthogonal 1-D Y (u, v; σu , ωv ) =
processing of the STV surface; smoothing along the spatial (y(u, v) ⊗ g(u; σu )) ⊗ gab (v; ωv )
parameter (u) and multi-channel processing along the temporal
parameter (v). This process is applied iteratively to obtain a 1
e−u /2σu
2 2
sequence of evolved surfaces (of the STV) which get smoother. where g(u; σu ) = √ (3b)
2πσu
ZCCs are detected on the sequence of evolving surfaces, as is a 1-D Gaussian function and
features to represent the VOB. This yields an unified spatio-
temporal descriptor for the VOB. The overall processing of gab (v; σv , ωv ) =
the space-time volume for extracting features to represent the
1 −v
VOB is described in Algorithm 1. √ ( 2 + jω) exp (−v2 /2σv2 + jωv v) (3c)
The STV is iteratively convolved using a set of 1-D 2πσv σv
Gaussian functions along u dimension and 1-D derivative of is a derivative of the 1-D Gabor filter, with the
Gabor filters along v dimension. The significance of using the constraint: σv ∗ ωv = 1.6.
derivative of 1-D Gabor filter as a spectral operator is based 3) Compute mean curvature of the evolved surface σu , as
on the fact that: 1) Gabor function is a tuned bandpass filter
which satisfies the lower bound of the uncertainty principle κ(u, v) = (κu (u, v) + κv (u, v))/2 (4)
[64], and 2) the derivative function will produce attenuated (∂u X) (∂uu Y ) − (∂u Y ) (∂uu X)
response at locations of sharp changes in the coordinates where κu = 3/2
(x or y) of the trajectory (as used in [23]). After each iteration, ∂u X 2 + ∂u Y 2
we increment the standard deviation of the Gaussian function (∂v X) (∂vv Y ) − (∂v Y ) (∂vv X)
κv = 3/2 (5)
and decrement the frequency of the derivative of Gabor filter ∂ v X 2 + ∂v Y 2
(3). This sequence of orthogonal processing of the STV surface
4) Obtain ZCCs by identifying locations when: κ(u, v) = 0,
yields a family of evolved surfaces. The evolved surface (of
to locate the ZCCs in 3-D space (u, v, σu ).
STV) gradually becomes smoother to form an ellipsoidal
5) Increment σu = σu + σ and decrement ωv = ωv − ω.
cylindrical structure. The Gaussian function in both the filters,
6) Repeat steps 2–5, until no ZCCs are detected.
applied along orthogonal directions [see (3) in step 2 of
7) The sets of ZCCs (2-D) obtained with the increasing
Algorithm 1], helps in this smoothing process.
values of σu and decreasing values of ωv are stacked as
Mean curvature is computed (4) on each of the set of
layers to form the MST-CSS feature surface.
evolved surfaces (of STV), formed as a result of the sequence
of orthogonal pair of filtering. Mean curvature is an extrinsic
measure which describes the local curvature of a surface.
Zero-crossing curvature contours, also called parabolic curves obtained by sweeping (3-D translation) the contour of an
[65] are found from the mean curvature at each iteration. object shape “teddy” (from dataset [66]), shown in Fig. 3(a),
ZCCs are formed at the boundaries between elliptic (convex) along the trajectory shown in Fig. 3(b). The contour of the
and hyperbolic (concave) regions. The process (steps 2–5 in teddy is represented by x and y coordinates and the teddy
Algorithm 1) is repeated until no ZCCs are detected. The set of is moved along the trajectory over time. The change in the
ZCCs of the evolving VOB surface are then stacked as layers contour position is depicted along the v-axis. The surface thus
to form a hilly surface called the MST-CSS surface, which is formed as a stack of 2-D contours is termed the space-time
used to represent the VOB. volume (STV), as shown in Fig. 3(c). The surface can be
The feature extraction process is now illustrated using an interpreted as a set of normalized 2-D contours, where the
example in Fig. 3. Fig. 3(c) shows the wireframe diagram z-coordinate is v. This surface is processed along orthogonal
DYANA AND DAS: MST-CSS (MULTI-SPECTRO-TEMPORAL CURVATURE SCALE SPACE), A NOVEL SPATIO-TEMPORAL REPRESENTATION 1085
Fig. 3. (a) Shape contour. (b) Trajectory. (c) Space-time volume (3-D VOB) formed for the shape in (a) and its trajectory in (b). (d) Evolved surface at σ=7
and ωv = 0.23. (e) Plot of mean curvature computed on the evolved surface shown in (d). (f) Location of zero-crossing contours in 3-D space (u, v, σ) for
the evolved surface in (d) detected from the mean curvature plot in (e). (g) Detected ZCCs marked on the evolved surface in (d).
Fig. 4. Stages of multispectro-temporal and multiscale processing for the VOB in Fig. 3(c). (a), (c), (e) Evolved surfaces with ZCCs/marked (with red dots)
on the evolved surfaces at ωv = 0.3, 0.15, 0.05 and corresponding σu = 2, 10, 14. (b), (d), (f) MST-CSS surfaces corresponding to the evolved surfaces in (a),
(c), and (e), respectively (σu appears as sigma). (g) Eventually transformed VOB, as an ellipsoidal cylindrical surface. (h) MST-CSS surface for the VOB in
Fig. 3(c).
1086 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 8, AUGUST 2010
directions using a pair of 1-D filters (3). As the surface evolves, evolving STV surface. These in turn depend on the shape
it is smoothed along the spatial dimension, while the short- contour and trajectory of the VOB.
time frequency response is reflected (as high curvature zones) The motivation of extracting multispectral features from
along the temporal dimension. Fig. 3(d) shows an evolved STV is based on the success achieved using the motion-
surface obtained by a single step of orthogonal 1-D filtering, trajectory representation proposed in our previous work [23].
where the parameters of the filters are σu = 7, ωv = 0.23. In MST-CSS algorithm, the processing of the temporal in-
Mean curvature computed on this evolved surface is shown formation is similar to the multispectral approach which
in Fig. 3(e). Zero-crossing contours are detected from the intrinsically extracts the features from the MST-CSS surface.
mean curvature (at κ = 0) plot and are marked in 3-D (MST- Multi-channel processing along the temporal direction pro-
CSS) space (u, v, σ), shown in Fig. 3(f). Fig. 3(g) shows the duces attenuated response at the zones of the STV surface,
ZCCs marked (in red) on the evolved surface in Fig. 3(d). wherever there are deviations corresponding to changes in
Fig. 4(a), (c), and (e) shows three typical examples of the the trajectory/path of the video object. These bends occur
evolved (smoothed) surface obtained at different stages (it- in the generally smoothed STV surface depending on the
erative steps) of Algorithm 1, from the STV, with different locations (temporal) and amount of deviations in the trajectory.
parameters of standard deviation of the Gaussian filter (σ) The response of our multi-channel processing (convolution)
and frequency of the derivative of the Gabor filter (ω). The is proportional to the amount of deviation and occurs at the
parameters of the filters for the examples shown in Fig. 4(a), location of the same (see [23]), for a similar concept, which
(c), and (e) are: σ = 2, 10, 14 and ωv = 0.3, 0.15, 0.05, is exploited here in 3-D.
respectively. The corresponding stages of the MST-CSS sur- Fig. 5 shows four MST-CSS surfaces obtained for VOBs
faces (being built) are shown in Fig. 4(b), (d), and (f). The formed by pairing two similar shapes and two similar tra-
ZCCs (parabolic curves) detected from the mean curvature jectories. The video objects are formed by sweeping (2-D
plot are marked (for illustration) on the smoothed surfaces. translation) the corresponding (see top-row) shape contours
After applying this sequence of processing steps for a few along the trajectory (see left-column). We obtain similar
iterations, the STV surface eventually transforms into a smooth representations for all four similar video objects. This is seen
ellipsoidal, cylindrical structure [see Fig. 4(g)]. When viewed from the similar (qualitatively) MST-CSS surfaces generated
along the temporal axis (v), a cross-sectional slice of the using shape-trajectory pair to form the VOBs. Similar shapes
smoothed structure results in an ellipse-like contour. The ZCCs and trajectories produce similar MST-CSS surfaces (Fig. 5).
are obtained at each step of iteration, from the evolving Fig. 5 illustrates the discriminating ability of the MST-CSS
surface, with incrementing values of σu of the 1-D Gaussian representation to distinguish between different combinations of
filter and decrementing values of ωv of the 1-D derivative trajectories and shape contours. Two different shape contours
of the Gabor filter. These are then stacked as 2-D layers to along with two different trajectory paths are used in both the
form a hilly-terrain like surface, which we term as the MST- cases. For the same trajectory with different shapes, the MST-
CSS surface [shown in Fig. 4(h)]. In the following section, CSS surfaces (see the horizontal pair) have similar formation
the discriminating ability of the MST-CSS representation is (structure) when viewed along u-axis (or projected on the v–σ
illustrated with few examples of MST-CSS surfaces obtained plane). For the same object with different motions (trajecto-
from different VOBs. This property is essential for computing ries), the MST-CSS surfaces (see vertical pair of surfaces) have
the similarity measure, when comparing two similar/dissimilar similar structures when viewed along v-axis (or projected on
VOBs. the u–σ plane). Using the examples in Fig. 5, one observes
that different combinations (pairing) of trajectories and object
C. MST-CSS Surface shapes produce discriminating structures of the hilly MST-CSS
MST-CSS surface is formed as a result of a sequence of surface. This exhibits the scope of uniqueness in the algorithm
orthogonal processing of the space-time volume, using a set to represent the spatio-temporal information of a video object
of 1-D filters (Gaussian and derivative of Gabor). Multiscale (in a video shot), which is vital for superior performance
processing along the spatial parameter (using 1-D Gaussian for CBVR. We fall short of proving the uniqueness in MST-
function) results in smoothing of the surface along u direction. CSS surface for a particular/given VOB. This may form an
Multi-channel processing (using the derivative of the Gabor interesting field of study in the field of digital geometry and
filter) along the temporal parameter (v) deforms the STV pattern recognition.
surface based on the characteristics [23] in the trajectory of
the video object. As a result of this processing, the STV D. MST-CSS Surface Descriptors
surface gradually evolves to become smooth, with a certain MST-CSS surface, obtained by the method described in
geometrical flow pattern. This pattern of evolution is unique Algorithm 1, can be visualized as a 3-D plot of a stack of the
for a VOB, and can be captured using the dynamics (time- zero-crossing curvature contours obtained at different levels
varying) of the extrinsic curvature of the surface, as the STV of processing. Axis variables are the spatial parameter (u),
surface evolves with its geometrical properties. The ZCCs on the temporal parameter (v), and scale (σu ). We use standard
the evolving surface detected using the mean curvature are deviation of the Gaussian filter (σu ) as an index to refer
stacked as 2-D layers, to form the MST-CSS surface. The to the different stages of simultaneous multispectro-temporal
dynamics of the ZCCs on the STV surface are reflected by and multiscale processing on the STV. For simplicity, we use
change in both the local curvature and global shape of the symbol σ (referring to σu ) in the 3-D MST-CSS plot.
DYANA AND DAS: MST-CSS (MULTI-SPECTRO-TEMPORAL CURVATURE SCALE SPACE), A NOVEL SPATIO-TEMPORAL REPRESENTATION 1087
Fig. 5. MST-CSS surfaces for two similar shapes and two similar motion Fig. 6. MST-CSS surfaces for two dissimilar shapes and two dissimilar
trajectories (similar VOBs). motion trajectories (dissimilar VOBs).
F = P < κPu , κPv >, S < κSu , κSv >, O (6)
where
P = {ui , vi , σi } i = 1, 2, ...np
S = uj , v j , σ j j = 1, 2, ...ns
np is the number of peaks and ns is the number of saddle
points on the MST-CSS surface. Our previous work [52] for
a simplified joint representation of VOB used only the peaks
of the 3-D-CSS surface for feature representation. The imple-
mentation of the same (for comparative study of performance Fig. 7. Two typical examples of MST-CSS surfaces, with peak and saddle
analysis in Section IV) has been done using the peaks and points, shown in (a) and (b).
saddle points of the 3-D-CSS surface, as in case of MST-CSS
space representation.
P and S denote the set of peak and saddle points (detected using generalized Hough transform (GHT) (discussed in the
locally) on the MST-CSS surface. The attributes κu and κv next section), are shown in Fig. 8(c) and (d), respectively.
indicate the local extrinsic curvature along u and v directions, The GHT is used for alignment of the features, in order to
respectively, as given in (4). The symbol O in (6) represents make it invariant to translation and rotation of the VOB. For
the angle of orientation of the eventually (final) evolved STV two similar VOBs, the peak/saddle points of the two VOBs
surface (forming an ellipsoidal cylindrical structure), with overlap/lie close to each other (have similar pattern), thus
respect to the temporal direction v. This angle (quantized in yielding a lower match-cost (discussed in the next section).
octants) mainly depends on the initial and final positions on It must be noted in Fig. 8(c) and (d) that in a few cases
the trajectory of the moving object. a perfect overlap of features may result in only one of the
Fig. 8 shows the feature descriptors obtained using MST- features visible in the plot (red or blue marker only). Fig. 9(c)
CSS processing for two similar VOBs in Fig. 8(a) and (b). The and (d) shows the feature descriptors obtained using MST-
locations of the set of peak and saddle points in 3-D MST-CSS CSS processing (after alignment) for two dissimilar VOBs
domain after alignment of feature sets of the pair of VOBs, given in Fig. 9(a) and (b). The set of peak and saddle points
1088 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 8, AUGUST 2010
E. MST-CSS Matching
Every video object model in the database is represented by
the locations of their peak and saddle points (u, v, σ), along
with their attributes (κu , κv ). The values of the feature and
the attributes are first normalized in the range [0–1]. Any
change in the starting position of the shape contour (2-D
translation) causes a horizontal shift (with a wrap-around)
along u-axis in the MST-CSS surface. Hence, the set of peak
and saddle points of the query must be aligned with the
Fig. 8. Two similar VOBs: (a) VOB1 and (b) VOB2; feature descriptors for set of peak and saddle points of the model. This translation
the VOBs in (a) and (b). (c) Peak points and (d) saddle points of the pair of parameter is obtained using the GHT [68], which is then used
VOBs, overlapping in most cases in the MST-CSS domain, after using the
GHT for alignment. for alignment of the surfaces to be matched. To align the
two surfaces, transformations between all possible pairs of
points (peaks) in the two scatters are considered. For every
possible pair (one taken from each set to form the pair) of
point correspondences, the translation parameter that maps one
point to the other is obtained. This value is quantized to the
nearest bin corresponding to the parameter value in the GHT
space, which is then incremented. Finally, the bin with the
highest number of votes is chosen as the translation parameter
for aligning (registering) the two point sets. This concept is
similar to that used for matching two (minutea) feature sets
for fingerprints [69]. We have only adopted the same.
After alignment, the set of peaks of the query [represented
by triplets (u, v, σ)] are matched with the set of peaks of the
model. The number of saddle points is generally less than
that of the peaks, and hence the set of peaks are only used
Fig. 9. Two dissimilar VOBs: (a) VOB3 and (b) VOB4; feature descriptors for alignment of the features of query and model. The peaks
for the VOBs in (a) and (b). (c) Peak points and (d) saddle points of the pair of query are paired with their nearest ones from the model.
of VOBs, mostly do no overlap in this case in MST-CSS domain, after using
the GHT for alignment. The Euclidean distance (in 3-D space) between the matched
peaks with difference in their attributes contributes to the
match-cost. For all the unmatched peak points, the heights of
of the dissimilar VOBs have dissimilar (distinct) patterns the peaks (σ) are considered to contribute to the match-cost.
and have many isolated or unmatched peak/saddle points and This factor in the match-cost acts as a penalty (component in
hence produce much higher match-cost. Hence, similar VOBs the overall cost-function) for each of the unmatched peaks
matched at the retrieval stage will be ranked higher than the occurring in the pair of MST-CSS representations of the
dissimilar ones. STVs being matched. Higher the value of sigma (peak), more
In our system, velocity is not considered as a feature to prominent is the peak in the MST-CSS surface representation.
represent the motion trajectory, as has been done in [24], [28]. If such a peak is unmatched, it reveals the dissimilarity (more
Change in velocity will cause the video object volume to scale the number or prominent is the set of unmatched peaks,
(due to change in the number of sample points available) along more is the dissimilarity) between the pair of STVs being
the temporal (v) axis. This in turn will cause an alteration in considered for matching. Hence, we provide this element as
the feature attributes, as the peaks and valleys in the MST- an additive component in the match-cost, to reflect the degree
CSS surface will then appear shallow or profound (due to of dissimilarity between the pair of STVs being matched.
low or high speed, respectively) along the temporal (v) axis. These distances (from matched and unmatched features) in
Hence, the proposed representation scheme provides a scope 3-D (u, v, σ) space are summed in the final match-cost func-
for the parameters of the model to be sensitive to the changes tion used for matching. Similarly, match-cost also incorporates
in velocity, although we have not used the velocity variations the distance values obtained by matching the saddle points of
in our experiments. query with that of the model.
We hypothesize that the feature representation for the MST- In addition, the difference of the encoded angles (O)
CSS surface, given in (6), provides a complete and discrimina- between the query and model is added to the final match-
tory representation for the STV surface of the VOB. This will cost. Hence, the three components of the final match-cost
be verified experimentally using simulation datasets of VOBs computation are: 1) peak; 2) saddle point matching; and
DYANA AND DAS: MST-CSS (MULTI-SPECTRO-TEMPORAL CURVATURE SCALE SPACE), A NOVEL SPATIO-TEMPORAL REPRESENTATION 1089
Fig. 10. Comparing match-costs for different VOBs. (a) A query VOB.
(b) Table of match-costs for model VOBs constructed with different com-
Fig. 11. Performance with precision-recall metric averaged over 100 queries,
binations of shape and trajectory pair. The four entries in a cell of the table
using simulated dataset formed with MPEG7-B [66] dataset for shapes and
correspond to: (top-most) MST-CSS representation, (left) CSS [16] combined
(a) ASL [71] dataset and (b) Hsieh’s [70] database for motion trajectories.
with multispectral approach [23], (right) edge potential method [30], and
(bottom-most) 3-D-CSS [52] representation.
table) produce lower match-costs for all the methods. The
MST-CSS representation is most sensitive to small variations
3) difference in orientation. The relative weights assigned of shape and trajectory, followed by the unified representation
to the three components of the match-cost are 0.45, (3-D-CSS), compared to the methods of combining the match-
0.45, and 0.1; obtained empirically for optimal perfor- cost. For dissimilar trajectory (see the last column), the match-
mance. costs obtained using the MST-CSS approach is higher than
Similar models are retrieved according to the final match- other methods. For dissimilar shapes (bottom row), the 3-D-
cost. The lower the match-cost, higher is the similarity of CSS approach produces the highest cost, as it is sensitive and
the model with the query. The system has the property that responsive to shape variation. Overall the power of discrim-
it is invariant to translation, rotation, and scaling of shape. ination of the unified MST-CSS descriptor is best compared
This is based on the principle that 2-D-CSS is invariant to to the representation using separate feature sets. VOBs are
rotation, translation and scale. The system is also invariant rank-ordered based on the match-costs and retrieved.
to translation and scale of motion trajectories. It works based
on the principle that similar video objects should have similar
joint representations for shape and motion trajectory. In case of IV. Performance Analysis
a closed trajectory, we obtain a contour for which the starting We have analyzed the performance of our proposed ap-
and ending points are identical. Hence, in such a case, the proach using precision-recall metric. Precision and recall met-
angle term “O” does not exist (is invalid). If such a case occurs, rics are expressed as
for any one of the STVs being matched, we then switch off
the third component of the cost function, which involves the
CN CN
angle term. Precision(N) = Recall(N) = (7)
We compare the match-costs obtained using our unified N M
approach with a few state-of-the-art methods, for two separate where N is the number of retrieved samples, CN is the number
VOBs. The query VOB shown in Fig. 10(a) is matched of relevant samples among N retrieved samples, and M is the
with four similar [intersection of top two rows with left two total number of relevant samples in the database.
columns in table of Fig. 10(b)] and five dissimilar [the last row We have experimented with three different datasets. We
and last column of Fig. 10(b)] VOBs. Fig. 10(b) shows the have created two virtual datasets with 2000 video objects
comparison of match-costs (in each cell of the table) obtained and a real-world database, consisting of 480 real-world video
using: 1) MST-CSS approach as the top entries; 2) CSS [16] shots. The synthetic database used for our experiments con-
combined with multispectral approach [23] as left entries; sists of 1000 shapes from MPEG7-B [66] dataset and 1000
3) the work by Dao et al. [30] as right entries; and 4) 3- trajectories available from Hsieh’s [70] database. The set of X
D-CSS [52] as bottom entries. The match-cost values are, and Y coordinates for representing the shape are normalized
respectively, normalized in the range [0–1]. Similar VOBs by sampling the shape contour with 200 points. Similarly,
(entries at the intersection of first two rows and columns in the motion trajectory is sampled and represented also by
1090 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 8, AUGUST 2010
Fig. 12. Query video objects from two simulated datasets formed with MPEG7-B [66] dataset for shapes, and (a) ASL [71] and (b) Hsieh’s [70] databases
for motion trajectories. (c) and (d) Top 12 retrieved video objects indexed with decreasing order of similarity for the query video objects in (a) and (b),
respectively. The shape and motion trajectory are shown in all cases.
Fig. 14. (a) and (b) Two query video shots from the real-world database [73]. (c) and (d) Top 12 retrieved video shots indexed with decreasing order of
similarity for the query video shots in (a) and (b), respectively. The median frame and motion trajectory are shown for all cases.
TABLE II the 2-D object contour) of the STV, occurring due to rotation
Comparison of the Average Computational Time (in S) for of the object or camera zoom.
Feature Representation Other low-level features such as color, texture can also be
added along with shape and motion features to retrieve videos
Methods MST-CSS 3-D-CSS CSS + Multispectral Dao et al. [30] with similar background and texture features of the foreground
Multispectral
moving video objects. The MST-CSS representation could
Time 3.2 3.1 3.0 3.0
be further analyzed for other applications such as action
recognition, video classification, event detection etc.
machine) of our proposed methods: MST-CSS, 3-D-CSS [52],
CSS [16] and multispectral approach [23] and state-of-the-art References
method [30]. The computational time has been obtained (as
[1] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom,
shown in Table I) by obtaining the average of that required to M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker,
represent 100 VOBs; while in case of matching (see Table II), “Query by image and video content: The QBIC system,” Computer,
it is averaged over 100 retrievals for each of the 50 queries, vol. 28, no. 9, pp. 23–32, 1995.
[2] V. Ogle and M. Stonebraker, “Chabot: Retrieval from a relational
using synthetic datasets. Table I shows that 3-D-CSS is the database of images,” IEEE Comput., vol. 28, no. 9, pp. 40–48, Sep.
costliest method (although the representation of models can be 1995.
done offline) due to the presence of 2-D convolution function [3] J. R. Smith and S. F. Chang, “Visualseek: A fully automated content-
based image query system,” in Proc. ACM Int. Multimedia Conf. Exhib.,
as a filtering operation. In Table II, all the four methods show 1996, pp. 87–98.
similar computational time (within 5% variation) for matching [4] A. Pentland, R. Picard, and S. Sclaroff, “Photobook: Content-based
a query with a model. manipulation of image databases,” Int. J. Comput. Vis., vol. 18, no. 3,
pp. 233–254, 1994.
[5] E. Ardizzone and M. L. Cascia, “Automatic video database indexing
and retrieval,” Multimedia Tools Appl., vol. 4, no. 1, pp. 29–56, 1997.
V. Conclusion [6] S. F. Chang, W. Chen, H. J. Meng, H. Sundaram, and D. Zhong,
“VideoQ: An automated content based video search system using visual
We have fused the two important features, shape and motion cues,” in Proc. ACM Multimedia, 1997, pp. 313–324.
trajectory, to obtain a novel joint spatio-temporal representa- [7] A. Yoshitaka, Y. I. Hosoda, M. Yoshimitsu, M. Hirakawa, and
tion of video objects for the purpose of content-based video T. Ichikawa, “VIOLONE: Video retrieval by motion example,” J. Vis.
Languages Comput., vol. 7, no. 4, pp. 423–443, 1996.
retrieval. In MST-CSS, we convolve the space-time volume [8] Y. Deng, D. Mukherjee, and B. S. Manjunath, “NeTra-V: Toward an
with a pair of 1-D orthogonal filters (Gaussian and derivative object-based video representation,” IEEE Trans. Circuits Syst. Video
of Gabor filter), which yields a family of evolved (deformed) Technol., vol. 8, no. 5, pp. 616–627, Sep. 1998.
[9] R. Lienhart, W. Effelsberg, and R. Jain, “VisualGREP: A systematic
surfaces. ZCCs are detected using the mean curvature com- method to compare and retrieve video sequences,” Multimedia Tools
puted on the evolving surface, and then stacked in the form of Appl., vol. 10, no. 1, pp. 47–92, Jan. 2000.
2-D layers to form a MST-CSS surface. This surface efficiently [10] X. Zhu, A. K. Elmagarmid, X. Xue, L. Wu, and A. C. Catlin, “In-
sightVideo: Toward hierarchical video content organization for efficient
represents the multiscale and multispectral features of shape browsing, summarization and retrieval,” IEEE Trans. Multimedia, vol. 7,
and motion trajectory, respectively. The system is sensitive to no. 4, pp. 648–666, Aug. 2005.
variations in both the shape and motion trajectories. An effi- [11] R. Babu and K. Ramakrishnan, “Compressed domain video retrieval
using object and global motion descriptors,” Multimedia Tools Appl.,
cient integration of spatial and temporal information (without vol. 32, no. 1, pp. 93–113, 2007.
any weighted combination) has been achieved in our MST-CSS [12] C.-W. Su, H.-Y. Liao, H.-R. Tyan, C.-W. Lin, D.-Y. Chen, and K.-C. Fan,
approach for CBVR. Experimental results have shown better “Motion-flow based video retrieval,” IEEE Trans. Multimedia, vol. 9,
no. 6, pp. 1193–1201, Oct. 2007.
performance than state-of-the-art methods, including MPEG-7 [13] J. M. S. Belongie and J. Puzicha, “Matching shapes,” in Proc. 8th IEEE
standardization and our previous works [23], [52]. Int. Conf. Comput. Vision, 2001, pp. 456–461.
Representation of multiple moving objects and non- [14] I. Kunttu, L. Lepistö, J. Rauhamaa, and A. Visa, “Multiscale Fourier
descriptor for shape-based image retrieval,” in Proc. 17th ICPR, vol. 2.
stationary background (in presence of camera motion: pan, 2004, pp. 765–768.
zoom, tilt etc.) will form the most vital and challenging scope [15] R. Ogniewicz, “Skeleton-space: A multiscale shape description combin-
of work in the future. In these cases, if pre-processing methods ing region and boundary information,” in Proc. Comput. Vision Pattern
Recognit., 1994, pp. 746–751.
for video analysis (segmentation of multiple moving objects, [16] F. Mokhtarian, S. Abbasi, and J. Kittler, “Robust and efficient shape
with or without camera motion) are able to efficiently segment indexing through curvature scale space,” in Proc. Brit. Mach. Vis. Conf.,
the moving video objects, our proposed method of MST-CSS 1996, pp. 53–62.
[17] N. Dimitrova and F. Golshani, “Motion recovery for video content
representation may then be used to distinctly model each of classification,” ACM Trans. Inform. Syst., vol. 14, no. 13, pp. 408–439,
the multiple moving objects separately. Matching of two such 1995.
video shots, with multiple objects in motion, may then be [18] Y. K. Jung, K. W. Lee, and Y.-S. Ho, “Content-based event retrieval
using semantic scene interpretation for automated traffic surveillance,”
formulated using a multi-variate optimization framework to IEEE Trans. Intell. Transportation Syst., vol. 2, no. 3, pp. 151–163, Sep.
compute a similarity measure, in order to obtain a reasonably 2001.
good solution. Experimentation with MST-CSS representation [19] S. Jeanin and A. Divakaran, “MPEG-7 visual motion descriptors,” IEEE
Trans. Circuits Syst. Video Technol., vol. 11, no. 6, pp. 720–724, Jun.
for video objects that undergo self-rotations (in case of regular 2001.
objects) or move toward/apart from the camera, should also be [20] S. Dagtas, W. Al-Khatib, A. Ghafoor, and R. L. Kashyap, “Models for
possible with the help of our proposed representation. In this motion-based video indexing and retrieval,” IEEE Trans. Image Process.,
vol. 9, no. 1, pp. 88–101, Jan. 2000.
case, the feature attributes of MST-CSS representation will [21] E. Sahouria and A. Zakhor, “Motion indexing of video,” in Proc. IEEE
capture the deformation (due to time-varying shear or scale of Int. Conf. Image Process., 1997, pp. 526–529.
DYANA AND DAS: MST-CSS (MULTI-SPECTRO-TEMPORAL CURVATURE SCALE SPACE), A NOVEL SPATIO-TEMPORAL REPRESENTATION 1093
[22] F. Bashir, A. Khokhar, and D. Schonfeld, “A hybrid system for affine [48] S. Chatzis, A. Doulamis, D. Kosmopoulos, and T. Varvarigou, “Video
invariant trajectory retrieval,” in Proc. 6th ACM SIGMM Int. Workshop representation and retrieval using spatio-temporal descriptors and region
Multimedia Inform. Retrieval, 2004, pp. 235–242. relations,” in Proc. ICANN, 2006, pp. 94–103.
[23] A. Dyana and S. Das, “Trajectory representation using Gabor features [49] F. Mokhtarian and M. Bober, Curvature Scale Space Representation:
for motion-based video retrieval,” Pattern Recognit. Lett., vol. 30, no. 10, Theory, Applications and MPEG-7 Standardization. Dordrecht, The
pp. 877–892, Jul. 2009. Netherlands: Kluwer, 2003.
[24] J. W. Hsieh, S. L. Yu, and Y. S. Chen, “Motion-based video retrieval by [50] D. Gabor, “Theory of communication,” J. IEE, vol. 93, no. 3, pp. 429–
trajectory matching,” IEEE Trans. Circuits Syst. Video Technol., vol. 16, 459, 1946.
no. 3, pp. 396–409, Mar. 2006. [51] T. Sikora, “The MPEG-7 visual standard for content description: An
[25] T. L. Le, A. Boucher, and M. Thonnat, “Subtrajectory-based video overview,” IEEE Trans. Circuits Syst. Video Technol., vol. 11, no. 6, pp.
indexing and retrieval,” in Proc. Int. MultiMedia Modeling Conf., LNCS 696–702, Jun. 2001.
4351. 2007, pp. 418–427. [52] A. Dyana and S. Das, “Spatio-temporal descriptor using 3D curvature
[26] C.-B. Shim and J.-W. Chang, “Efficient similar trajectory-based retrieval scale space,” in Proc. 2nd Int. Conf. PReMi, LNCS 4815. 2007, pp.
for moving objects in video databases,” in Content Based Image and 632–640.
Video Retrieval. Berlin/Heidelberg, Germany: Springer, 2003, pp. 163– [53] Y.-K. Jung and Y.-S. Ho, “Active camera tracking system using affine
173. motion compensation,” in Proc. SPIE Visual Commun. Image Process.
[27] F. Bashir, A. Khokhar, and D. Schonfeld, “Real-time motion trajectory- 2003, vol. 5150. 2003, pp. 1966–1973.
based indexing and retrieval of video sequences,” IEEE Trans. Multime- [54] L. Tiehan, B. Ozer, and W. Wolf, “A real-time background subtraction
dia, vol. 9, no. 1, pp. 58–65, Jan. 2007. method with camera motion compensation,” in Proc. IEEE Int. Conf.
[28] P. Y. Chen and A. L. P. Chen, “Video retrieval based on video motion Multimedia Expo., 2004, pp. 331–334.
tracks of moving objects,” in Proc. SPIE Storage Retrieval Methods [55] A. Yilmaz, X. Li, and M. Shah, “Contour based object tracking with
Applicat. Multimedia 2004, vol. 5307. 2003, pp. 550–558. occlusion handling in video acquired using mobile cameras,” IEEE
[29] N. Anjum and A. Cavallaro, “Multifeature object trajectory clustering Trans. Pattern Anal. Mach. Intell., vol. 26, no. 11, pp. 1531–1536, Nov.
for video analysis,” IEEE Trans. Circuits Syst. Video Technol., vol. 18, 2004.
no. 11, pp. 1555–1564, Nov. 2008. [56] D. Zhong and G. Lu, “Evaluation of MPEG-7 shape descriptors against
[30] M. Dao, F. De Natale, and A. Massa, “Video retrieval using video object- other shape descriptors,” Multimedia Syst., vol. 9, no. 1, pp. 15–30,
trajectory and edge potential function,” in Proc. IEEE Int. Symp. Intell. 2003.
Multimedia Video Speech Process., 2004, pp. 454–457. [57] A. Anjulan and N. Canagarajah, “Object based video retrieval with local
[31] A. Basharat, Y. Zhai, and M. Shah, “Content based video matching using region tracking,” Signal Process. Image Commun., vol. 22, nos. 7–8, pp.
spatiotemporal volumes,” Comput. Vis. Image Understanding, vol. 110, 607–621, 2007.
no. 3, pp. 360–377, Jun. 2008. [58] R. Megret and D. Dementhon, “A survey of spatio-temporal grouping
[32] NIST. Trec Video Retrieval Track. (2005) [Online]. Available: techniques,” Univ. Maryland, College Park, Tech. Rep. LAMP-TR-
http://www-nlpir.nist.gov/projects/trecvid 094/CS-TR-4403/UMIACS-TR-2002-83/CAR-TR-979, 2002.
[33] Google. Google Videos [Online]. Available: http://video.google.com [59] D. DeMenthon and D. Doermann, “Video retrieval using spatio-
[34] BBC. BBC Motion Gallery [Online]. Available: http://www. temporal descriptors,” in Proc. ACM Multimedia, Nov. 2003, pp. 508–
bbcmotiongallery.com/Customer/RoyaltyFree.aspx 517.
[35] J. Shi and J. Malik, “Motion segmentation and tracking using normalized [60] W. Ren, S. Singh, M. Singh, and Y. Zhu, “State-of-the-art on spatio-
cuts,” in Proc. Int. Conf. Comput. Vision, 1998, pp. 1154–1160. temporal information-based video retrieval,” Pattern Recognit., vol. 42,
[36] G. Hamarneh and T. Gustavsson, “Deformable spatio-temporal shape no. 2, pp. 267–282, 2009.
models: Extending ASM to 2D+time,” in Proc. Brit. Mach. Vision Conf., [61] M. J. Pickering and S. Rüger, “Evaluation of key frame-based retrieval
2001, pp. 13–22. techniques for video,” Comput. Vision Image Understanding, vol. 92,
[37] H. Greenspan, J. Goldberger, and A. Mayer, “Probabilistic space-time nos. 2–3, pp. 217–235, 2003.
video modeling via piecewise GMM,” IEEE Trans. Pattern Anal. Mach. [62] P. Kaewtrakulpong and R. Bowden, “An improved adaptive background
Intell., vol. 26, no. 3, pp. 384–396, Mar. 2004. mixture model for real-time tracking with shadow detection,” in Proc.
[38] X. Song and G. Fan, “Joint key-frame extraction and object segmentation 2nd Eur. Workshop AVB, Sep. 2001, pp. 1–5.
for content-based video analysis,” IEEE Trans. Circuits Syst. Video [63] Bob Davies. Open Computer Vision Library. (2010) [Online]. Available:
Technol., vol. 16, no. 7, pp. 904–914, Jul. 2006. http://sourceforge.net/projects/opencvlibrary
[39] D. Dementhon and D. Doermann, “Video retrieval of near-duplicates [64] J. A. Bloom and T. R. Reed, “An uncertainty analysis of some real
using Î-nearest neighbor retrieval of spatio-temporal descriptors,” Mul- functions for image processing applications,” in Proc. IEEE Int. Conf.
timedia Tools Applicat., vol. 30, no. 3, pp. 229–253, 2006. Image Process., 1997, pp. 670–673.
[40] Y. Ricquebourg and P. Bouthemy, “Real-time tracking of moving persons [65] C. Lu, Y. Cao, and D. Mumford, “Surface evolution under curvature
by exploiting spatio-temporal image slices,” IEEE Trans. Pattern Anal. flows,” J. Vis. Commun. Image Representation, vol. 13, pp. 65–81, Jun.
Mach. Intell., vol. 22, no. 8, pp. 797–808, Aug. 2000. 2002.
[41] A. Yilmaz and M. Shah, “Actions sketch: A novel action representation,” [66] MPEG7-B. Image Databases. (2005) [Online]. Available: http://www.
in Proc. Comput. Vision Pattern Recognit., 2005, pp. 984–989. imageprocessingplace.com/root− files− V3/image− databases.htm
[42] M. Ristivojevic and J. Konrad, “Space-time image sequence analysis: [67] C. Ken, W. Yicong, and Y. Rener, “Hessian matrix based saddle point
Object tunnels and occlusion volumes,” IEEE Trans. Image Process., detection for granules segmentation in 2D image,” J. Electron., vol. 25,
vol. 15, no. 2, pp. 364–376, Feb. 2006. no. 6, pp. 728–736, Nov. 2008.
[43] E. Shechtman and M. Irani, “Space-time behavior-based correlation, or, [68] D. H. Ballard, “Generalizing the Hough transform to detect ar-
how to tell if two underlying motion fields are similar without computing bitrary shapes,” Pattern Recognit., vol. 13, no. 2, pp. 111–122,
them,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 11, pp. 1981.
2045–2056, Nov. 2007. [69] N. K. Ratha, K. Karu, S. Chen, and A. K. Jain, “A real-time matching
[44] A. Mokhber, C. Achard, and M. Milgram, “Recognition of human system for large fingerprint databases,” IEEE Trans. Pattern Anal. Mach.
behavior by space-time silhouette characterization,” Pattern Recognit. Intell., vol. 18, no. 8, pp. 799–813, Aug. 1996.
Lett., vol. 29, no. 1, pp. 81–89, 2008. [70] J. W. Hsieh. Trajectory Database. (2006) [Online]. Available:
[45] C. Achard, X. Qu, A. Mokhber, and M. Milgram, “A novel approach http://mmplab.ee.yzu.edu.tw/trajectory/trajectory.rar
for recognition of human actions with semi-global features,” Mach. Vis. [71] W. Kadous. Australian Sign Language. (2001) [Online]. Available:
Appl., vol. 19, no. 1, pp. 27–34, Jan. 2008. http://www.cse.unsw.edu.au/∼waleed/tml/data
[46] I. Laptev, “On space-time interest points,” Int. J. Comput. Vision, vol. 64, [72] I. Laptev and B. Caputo. Action Database. (2005) [Online]. Available:
nos. 2–3, pp. 107–123, Sep. 2005. http://www.nada.kth.se/cvap/actions
[47] G. S. Pingali, A. Opalach, Y. Jean, and I. Carlbom, “Instantly indexed [73] S. Das. Real World Videos for Content-Based Video Retrieval. (2009)
multimedia databases of real world events,” IEEE Trans. Multimedia, [Online]. Available: http://www.cse.iitm.ac.in/∼sdas/vplab/
vol. 4, no. 2, pp. 269–282, Jun. 2002.
1094 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 8, AUGUST 2010
Sukhendu Das was born in Kharagpur, India, in technical papers in international and national journals and conferences. His
1962. He received the B.Tech. degree from the current research interests include visual perception, computer vision, digital
Department of Electrical Engineering, Indian Insti- image processing and pattern recognition, computer graphics, artificial neural
tute of Technology (IIT) Kharagpur, Kharagpur, in networks, computational science and engineering, and soft computing.
1985, the M.Tech. degree in the area of computer Dr. Das has reviewed several papers in international journals (IEEE, IET,
technology from IIT Delhi, New Delhi, India, in Elsevier, Springer, and others) and chaired several sessions in conferences.
1987, and the Ph.D degree from IIT Kharagpur in He has received two Best Papers and a Best Design Contest Award.
1993.
Since 1989, he has been a member of the faculty
with the Department of Computer Science and En-
gineering (CS&E), IIT Madras, Chennai, India. He
A. Dyana received the M.Tech. degree in computer
was a Visiting Scientist with the University of Applied Sciences, Pforzheim,
and information technology from Manonmaniam
Germany, for post-doctoral research work from 2001 to 2003, and with the
Sundaranar University, Tirunelvali, India, in 2004.
University of Western Australia, Perth, Australia, from June–August 2006,
Since 2004, she has been pursuing the Ph.D. degree
and July–September 2008. He is currently an Associate Professor with the
from the Visualization and Perception Laboratory,
Department of CS&E, IIT Madras. He has guided two (currently guiding
Department of Computer and Science Engineering,
three) Ph.D. students, 19 (currently guiding eight) M.S. students, 31 (currently
Indian Institute of Technology Madras, Chennai,
guiding three) M.Tech. students, and several B.Tech students. He has com-
India.
pleted several international and national sponsored projects and consultancies,
Her current research interests include video pro-
both as a Principle and Co-Investigator. Currently, he is involved in four
cessing, computer vision, and image/video coding.
sponsored projects/consultancies in IIT Madras. He has published nearly 100