csvt11_preprint
csvt11_preprint
Abstract—In this work a novel approach to video temporal are usually defined as Logical Story Units (LSU): higher-level
decomposition into semantic units, termed scenes, is presented. temporal segments, each covering either a single event (e.g.
In contrast to previous temporal segmentation approaches that a dialog) or several related events taking place in parallel
employ mostly low-level visual or audiovisual features, we in-
troduce a technique that jointly exploits low-level and high-level [2]. The close relation between video scenes and the real-
features automatically extracted from the visual and the auditory life events depicted in the video make scene detection a key
channel. This technique is built upon the well-known method enabling technology for advanced applications such as event-
of the Scene Transition Graph (STG), first by introducing a based video indexing; the latter has been gaining significant
new STG approximation that features reduced computational attention, as part of recent efforts towards experience- and
cost, and then by extending the unimodal STG-based temporal
segmentation technique to a method for multimodal scene seg- event-based multimedia manipulation [3]. Fig. 1(a) illustrates
mentation. The latter exploits, among others, the results of a large the relations between different temporal segments of a video.
number of TRECVID-type trained visual concept detectors and Video segmentation to shots and scenes are two differ-
audio event detectors, and is based on a probabilistic merging ent problems that are characterized by considerably differ-
process that combines multiple individual STGs while at the ent degrees of difficulty. State-of-the-art shot segmentation
same time diminishing the need for selecting and fine-tuning
several STG construction parameters. The proposed approach techniques, detecting the presence of video editing effects
is evaluated on three test datasets, comprising TRECVID docu- such as cuts and fades with the use of low-level visual
mentary films, movies, and news-related videos, respectively. The features, have been shown in large-scale experiments (e.g.
experimental results demonstrate the improved performance of TRECVID) to reach an accuracy that is close to perfect;
the proposed approach in comparison to other unimodal and this accuracy is deemed by the relevant community to be
multimodal techniques of the relevant literature and highlight the
contribution of high-level audiovisual features towards improved sufficient for any practical application [4]. On the other hand,
video segmentation to scenes. scene segmentation is still an open research problem, with
most approaches of the literature failing to take into account
the semantics of the content in performing a task that by
I. I NTRODUCTION definition is based on semantic criteria: different consecutive
Video decomposition into temporal units is an essential parts of the video are assigned to the same scene, according
pre-processing task for a wide range of video manipulation to the literature, simply because they present similar low-level
applications, such as video indexing, non-linear browsing, audiovisual properties, whereas it is much more than such low-
classification etc. Video decomposition techniques aim to level properties that make humans recognize (and request to
partition a video sequence into segments, such as shots and consume, in applications such as retrieval) different scenes in
scenes, according to semantic or structural criteria. Shots are a video.
elementary structural segments that are defined as sequences In this work a novel approach to video temporal decomposi-
of images taken without interruption by a single camera [1]. tion into scenes is presented. This builds upon the well-known
On the other hand, scenes are longer temporal segments that technique of the Scene Transition Graph (STG) [5], which it
extends, and additionally exploits recent advances in sematic
P. Sidiropoulos is with the Informatics and Telematics Institute / Centre for video analysis tasks in order to overcome the limitations
Research and Technology Hellas, 6th Km Charilaou-Thermi Road, P.O.BOX
60361, Thermi 57001, Greece, and with the Center for Vision, Speech and of existing scene segmentation approaches. Initially, a new
Signal Processing, Faculty of Engineering and Physical Sciences, University STG approximation that features reduced computational cost
of Surrey, Guildford, Surrey GU2 5XH, UK, [email protected] is introduced. This is important for ensuring the efficiency of a
V. Mezaris and I. Kompatsiaris are with the Informatics and Telematics In-
stitute / Centre for Research and Technology Hellas, 6th Km Charilaou-Thermi subsequent processing stage, which mandates the construction
Road, P.O.BOX 60361, Thermi 57001, Greece, {bmezaris, ikom}@iti.gr. of multiple STGs. Then, a Generalized STG-based (GSTG)
H. Meinedo is with the Instituto de Engenharia de Sistemas technique is proposed for multimodal scene segmentation. This
e Computadores, Investigacao e Desenvolvimento, Lisbon, Portugal,
[email protected] is based on firstly constructing multiple STGs that separately
M. Bugalho and I. Trancoso are with the Technical University of exploit different audiovisual features for segmentation, using
Lisbon, Lisbon, Portugal and with the Instituto de Engenharia de Sis- a new algorithm for the extension of STG to non-visual input,
temas e Computadores, Investigacao e Desenvolvimento, Lisbon, Portugal,
{miguel.bugalho,isabel.trancoso}@inesc-id.pt and secondly on using a probabilistic merging process to
This work was supported by the European Commission under contracts combine their results while at the same time diminishing the
FP6-045547 VIDI-Video and FP7-248984 GLOCAL. need for selecting and fine-tuning several STG construction
Copyright (c) 2011 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be parameters. In contrast to previous temporal segmentation
obtained from the IEEE by sending an email to [email protected]. approaches that employ mostly low-level visual or audiovisual
IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 8, pp. 1163-1177, August 2011 2
visual segment boundaries are aligned to generate a set of similarity between keyframes of video shots to construct a
candidate scene boundaries, which are accepted or discarded connected graph; the cut-edges of this graph constitute the set
by further examining the audio changes. Similar in principle of scene boundaries.
approaches, based on aligning the boundaries of visual and The STG construction starts with the generation of a seg-
audio segments, are presented in [15], [16]. In [17], scene mentation S of the video B to non-overlapping visual shots,
changes are detected by evaluating the audio dissimilarity
S = {xi }N
i=1 where xi = {fk }k=bi , bi < bi+1 ∀i,
ei
of adjacent shots only; a similar process is adopted in [18],
where however the notions of visual and audio attention are x1 ∪ x2 ∪ ... ∪ xN = B, (1)
used for guiding the shot similarity evaluation. In [19], low-
level color and audio features, together with face detection where fk is the k-th frame of the video, and bi , ei are the
results, are used for computing a table of distances between indices of the first and last frame of shot xi , respectively. Two
the shots of a video that is exploited for clustering, while a video shots are considered similar if they contain at least one
weighted combination of audio- and visual-similarity measures pair of similar frames according to similarity measure D(., .),
is used in [20]. In [21] a fuzzy k-means algorithm is introduced D(xi , xj ) = min D (fm , fn )
to segment the auditory channel into audio segments; scene m,n
breaks are identified when a visual shot boundary exists where bi ≤ m ≤ ei and bj ≤ n ≤ ej (2)
within an empirical time interval before or after an audio
segment boundary. Learning-based methods are presented in In this equation D (fm , fn ) is a measure of the similarity
[22], [23], [24]; [22] proposes a statistical framework, which of frames fm , fn ; typically, low-level features such as color
learns from a training set the probability of different shot histograms and distance measures such as L1 distance or
features taking specific values on a scene boundary, and detects histogram intersection are used. Although the similarity of all
scene boundaries at local maxima of the likelihood ratio curve. frames of both shots needs to be evaluated according to this
In [23] and [24], audiovisual features are used as input to criterion, a set of selected key-frames is often used instead,
Support Vector Machine (SVM) classifiers, which are trained for reducing computational complexity.
to differentiate between two classes: scene-boundary and non- The visual similarity values D(xi , xj ) between each pair
scene-boundary. of shots xi , xj in the video, providing that xi , xj are less
Common deficiency of the reviewed techniques is that they than an empirical time threshold τ apart, are calculated and
rely mostly on low-level audiovisual features. Although these used for grouping shots that are similar (i.e. shots for which
are to some extent useful in evaluating the similarity of shots D(., .) < Dt ) into the same cluster. This clustering criterion
for the purpose of grouping them, there is a gap between requires each shot to be similar to every other shot in the
the similarities that can be revealed by examining just low- same cluster. The order according to which the clustering
level properties of the audiovisual signal and the semantic proceeds is specified by D(xi , xj ): at any time, the most
coherence that is desired of a scene. Another deficiency is similar pair of shots is examined before all less similar ones.
that the combination of audio and visual information, which From the clusters and the temporal ordering of the shots, a
is evidently advantageous for scene segmentation, is typically scene transition graph is constructed, where nodes represent
performed either in a simplistic manner (e.g. simple temporal the shot clusters and a directed edge is drawn from a node
alignment of audiovisual segment boundaries) or with the use to another if there is a shot represented by the first node
of learning-based techniques. The latter usually require large that immediately precedes any shot represented by the second
genre-specific manually-segmented training sets. node. Finally, the “cut-edges” of the graph are identified. A
cut-edge is defined as an edge which, if removed, results in two
disconnected graphs. The collection of all cut edges constitutes
the set of scene boundaries.
Among the advantages of the STG approach is that the
evaluation of shot similarity is not limited to pairs of adjacent
shots (thus, scenes characterized by repetitive patterns, such as
dialogs, can be detected correctly), in contrast to several other
(a) (b) unimodal or multimodal techniques. Among its disadvantages,
Fig. 2. Six keyframes of shots that belong to the same scene and (a) are though, is that it exploits only low-level visual features;
characterized by pronounced visual similarities, (b) do not present significant it provides no support for combining heterogeneous feature
visual similarities, but a relation between them is indicated by non-visual
means (audio). sets; and similarly to most literature approaches it requires
the heuristic setting of certain parameters (STG construction
parameters Dt and τ ).
shots (e.g., all shots lying within a temporal window) and produces a primary set of links, i.e. avoids examining the
for evaluating the presence or not of such links (e.g. the existence of links that, given those already identified, would
shots’ HSV histogram similarity lying below a threshold). In be trivial, we can reduce the computational cost associated
this section, we use properties related to shot linking, such with the detection of scene boundaries.
as shot linking transitivity and the fact that scenes are by
definition convex sets of shots, to present an approximation
to STG-based scene segmentation. This approximation limits B. Shot linking by primary set estimation
the number of shot pairs whose possible linking needs to be Given the input video B that contains shots x1 , x2 , ...xN ,
evaluated and simplifies or renders obsolete other processing as defined in section II-B, a primary set of shot links can be
steps associated with the STG, thus allowing the faster detec- directly estimated according to Algorithm 1.
tion of scene boundaries. The proposed approximation is not
guaranteed to produce the exact same results as the original Algorithm 1 Primary set estimation
STG; nevertheless, the experiments in section VI show that
1: Initially, all pairs of shots (xi , xj ), xi <o xj and i, j ∈
the performance differences are very small.
[1, N ], are marked as valid pairs; any pair that is examined
in subsequent steps, and is not identified as linked, is
A. Definitions automatically marked as an invalid pair. d is set to N − 1
Following the definition of the scene as a Logical Story and i is set to 1.
2: d , d are set to zero.
Unit [2], any scene segmentation process can be viewed as
3: If (xi , xi+d ) is a valid pair, the presence of a link between
a clustering of shots into non-overlapping convex sets. Let
us remind that in a totally ordered space, a set of points is these two shots is examined. If it’s an invalid pair or no
convex if for every pair of points that belong to it, all points link is found: if i+d < N , this step is repeated after setting
in between (according to the total order <o of the space) also i = i + 1, otherwise is repeated after setting d = d − 1
belong to it. The shots of a video can be seen as defining a and i = 1. This continues until a shot link is found or d
totally ordered one-dimensional space according to time, and becomes zero.
4: If pair (xi , xi+d ) has been identified as linked, then
scenes are indeed non-overlapping convex sets in this space:
if two shots xi , xj belong to a single scene, then every shot starting from d = d and descending by step of 1 all valid
xm , xi <o xm <o xj also belongs to the same scene. The pairs (xi+d , xi+d+d ) are examined sequentially for shot
implication of this is that, having established a definitive link links, until a shot link is found or d becomes zero.
5: If pair (xi+d , xi+d+d ) has been identified as linked, then
between shots xi , xj , it is redundant to look for links between
any shots xm , xn if xi ≤o xm <o xn ≤o xj , because of the starting from d = d and descending by step of 1 all valid
convexity of the set that the link between shots xi , xj defines. pairs (xi+d+d , xi+d+d +d ) are examined sequentially for
shot links, until a shot link is found or d becomes zero.
Considering the transitivity of shot linking, strictly speak-
6: If pair (xi+d+d , xi+d+d +d ) has been identified as
ing, shot linking is not a transitive relation. This can be
linked, d is set equal to d + d and step 5 is repeated
seen with an example: assuming shots xi <o xm <o xj ,
(without checking again if the condition of step 5 is
D(., .) being a shot similarity measure (e.g. HSV histogram
satisfied); the algorithm oscillates between steps 5 and 6
difference) and D(., .) ≤ a being the shot linking criterion,
until no further link can be found by these two steps.
D(xi , xm ) ≤ a and D(xm , xj ) ≤ a do not necessarily
7: If d = 0, (xi , xi+d+d +d ) is added to the shot pairs
mean that D(xi , xj ) ≤ a also holds. However, viewing scene
that belong to the primary set of links; all pairs of shots
segmentation as the clustering of shots into non-overlapping
(x, y), xi ≤o x, y ≤o xi+d+d +d are marked as invalid
convex sets, D(xi , xm ) ≤ a and D(xm , xj ) ≤ a means that
pairs; i is set equal to i + d + d + 1 (see Fig. 3(c)) and
xi , xm , xj all belong to the same scene, and this is equivalent
the algorithm returns to step 2. If d = 0, the algorithm
to establishing a shot link for the pair (xi , xj ). For this, we
terminates.
will treat shot linking as a transitive relation in the sequel.
Based on the above considerations and assuming that a set L
comprising K linked pairs of shots, (xs1 , xe1 ), ..., (xsK , xeK ), It is evident that following this algorithm, no pair of shots is
has been identified for a video B according to some linking examined for the presence of a shot link more than once; also,
criteria, we proceed with the following definitions: as soon as a shot link is found (step 3), shot pairs potentially
Definition 1: A link between shots xi and xj is called a defining related trivial links are immediately excluded from
trivial link if there exists a (xsk , xek ) ∈ L such that xsk ≤o xi further consideration. Related double trivial links are then
and xek ≥o xj . looked for (steps 4-6) and, if found, are eliminated, further
Definition 2: Three shots xi , xm , xj are said to define a increasing the number of shot pairs that are excluded from
trivial double link if both (xi , xm ) and (xm , xj ) belong to subsequent processing. The resulting primary set of links L
L. essentially defines a scene transition graph, with the convex
Definition 3: The set L is named primary if both no trivial sets of shots defined by the links in L serving as the nodes
links and no trivial double links exist in it. of the graph. With respect to the maximum allowed temporal
Examples of a trivial link and a trivial double link are distance τ of linked shots, which is a parameter of the original
shown in Fig. 3. By introducing an algorithm that directly STG, this can be integrated in Algorithm 1 simply by limiting
IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 8, pp. 1163-1177, August 2011 5
x i+d+d'+d''+1
x sk xi xj x ek xi xm xj xi
time time x i+d x i+d'+d''+1 x i+d+d'+d'' time
accordingly the number of shot pairs that are marked as valid shot similarity measure D(., .) does not need to be computed
pairs in the first step of it. for (μk +2)(μ
2
k +1)
− 1 pairs of shots. For all K primary links in
Set L is parsed for detecting the scene boundaries as L, the number of shot pairs for which D(., .) is not computed
K
follows: All shot pairs that belong to it are ordered in tabular rises to k=1 ( (μk +2)(μ
2
k +1)
− 1), out of the N (N2−1) possible
form, as shown in Fig. 4(a). Then, starting from the top left pairs of shots in B (assuming that τ → ∞). Consequently, the
cell, proportional computational complexity gain K G from the
use of
(μ )2 +3 K μ
1) If the current cell (Fig. 4(a)) belongs to the left column, the algorithm of section III-B is: G = k=1 Nk (N −1) k=1 k .
we just move to the one of the two neighboring cells This quantity is minimized when μk = μ, ∀k ∈ [1, K], thus
(Fig. 4(b)) that corresponds to the shot that appears a lower bound for gain G is given by: Gmin = μ(μ+3)K N (N −1) .
before the other one in B (according to the total order
Assuming, for example, that out of N (N2−1) possible pairs of
<o ; e.g. in Fig. 4(b) we will move to the xe1 cell if
shots in B, non-trivial links are established for 5% of them
xe1 <o xs2 , otherwise we will move to the xs2 cell).
(i.e. K = 0.05 N (N2−1) ) and μ = 4, the lower bound for
2) If the current cell belongs to the right column (Fig. 4(c)),
gain G is 70%. This gain persists when additional limitations
we move to the cell on the left column that is one row
to the number of examined shot pairs are introduced (e.g.
below the current one (Fig. 4(d)) and shots xi of B that
by τ ∞), providing that the non-trivial links continue
lie in between the two cells considered in this step (i.e.,
to represent a reasonable portion of all the shot pairs that
in the example of Fig. 4(c), shots for which e1 ≤ i < s2)
would otherwise be examined. Experiments indicate that 70%
are added to the scene boundary list.
is indeed a typical value for G; this alone represents a speed-
When the bottom-right cell is reached, the scene boundary
up by a factor of 3.
list contains the last shot of each scene, i.e. the scene bound- Considering the clustering of the shots, this step becomes
aries. obsolete in the proposed algorithm, whereas in the STG
X s1 X e1
method this step involves, among others, the sorting of values
X s1 X e1 X s1 X e1 X s1 X e1
Scene boundary
X e1
Scene boundary
Scene boundary
X s2 X e2
list (empty)
list (empty)
X s2 X e2 X s2 X e2 X s2 X e2 X e1+1
X s3 X e3 X s3 X e3 X s3 X e3 X s3 X e3 X e1+2 shots. The latter process alone has average computational cost
... ... ... ... ... ... ... ... ... proportional to Λ log Λ, where Λ denotes the number of shot
pairs (when τ → ∞, Λ = N (N2−1) ). Finally, the parsing
X
X sn X en X sn X en X sn X en X sn X en s2-1
(a) (b) (c) (d) of the table of primary links, which is the last main step
of the proposed algorithm, has very low computational cost
Fig. 4. Primary set of link L in tabular form, and example of browsing it
so as to fill-in the list of scene boundaries. (proportional to K, K being the number of primary links in L).
Although a direct theoretic comparison with the computational
cost of algorithms for graph parsing is difficult, due to the
different parameters affecting the latter (i.e. the number of
C. Computational Complexity Analysis nodes and edges of the graph, rather than K), the proposed
The main processing steps of the STG method for detecting parsing algorithm is intuitively expected to contribute to the
scene boundaries and the corresponding steps of the proposed overall speed-up of scene boundary detection.
approximation of it are summarized in Table I. This table
indicates that the proposed approximation is expected to IV. G ENERALIZED S CENE T RANSITION G RAPH M ETHOD
deliver significant gains in computational complexity, since in (GSTG)
it each main step of the STG is either simplified or becomes The STG method for scene segmentation, regardless of
obsolete. whether the original algorithm of [5] or the fast approximation
Specifically, with respect to the calculation of visual similar- of section III are used, is a method exploiting only low-level
ity values D(., .), the algorithm of the previous section refrains visual information for both the initial decomposition of the
from checking a number of shot pairs for possible links after video stream to elementary video segments (shots) and for the
establishing a non-trivial link for shot pair (xsk , xek ). Assum- similarity-based linking of them. In this section we introduce
ing μk shots lie between shots xsk and xek , this means that i) a unimodal extension of STG to non-visual input, and ii)
IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 8, pp. 1163-1177, August 2011 6
TABLE I
M AIN PROCESSING STEPS OF STG AND OF THE PROPOSED FAST APPROXIMATION OF IT
STG Fast STG approximation
Calculates visual similarity D(., .) for every pair of shots that do not Uses shot linking properties to further limit the number of shot pairs for
exceed a specified temporal distance (τ ). which D(., .) needs to be calculated.
Clusters the shots (section II-B); this requires sorting the shot pairs This processing step becomes obsolete; the primary links detected at the
according to D(., .), and comparing the distances between all involved previous step directly define the shot clusters.
shot pairs for merging two clusters.
Parses the resulting graph (STG) to identify cut-edges. Parses a much simpler structure (a table, as in Fig. 4).
a method for combining unimodal STGs towards multimodal Algorithm 2 Unimodal extension of STG to non-visual input
scene segmentation. Preliminary versions of these techniques 1: Adjacent segments of S are merged according to similar-
have been introduced by the authors in [25], [26]. ity criteria set O, leading to segmentation S1 .
2: The assumption that each segment of S1 can belong to
just one scene is adopted. Based on this, adjacent shots
A. Unimodal extension of STG to non-visual input of S are merged by eliminating shot boundaries that do
not correspond to segment boundaries in S1 , resulting in
Non-visual features, e.g. low-level audio features, speaker
segmentation S1 . Evidently, if S1 and S coincide (e.g.
diarization results, audio events etc., can be used for providing
when considering visual features), this processing step has
two kinds of information in relation to the goal of video
no effect and S1 also coincides with S.
segmentation to scenes: i) information about the similarity
3: Each segment of S1 is described using appropriate features
of two elementary video segments (e.g. shots), so as to
(e.g. in the case of speaker diarization results, by mapping
allow for deciding whether the two segments are linked on
speaker IDs to the segments of S1 , so that each segment
not, and ii) binary information about the potential of a shot
is described by the histogram of speakers heard in it).
boundary to also be a scene boundary (i.e. allowed / non-
4: STG-based scene segmentation is performed (by means of
allowed). The first possibility comes from using the non-
either the algorithm of section III or that of [5]), using
visual features together with an appropriate similarity measure,
segmentation S1 instead of S as a starting point and
analogously to the use of measure D(., .) for low-level visual
replacing D(., .) with a similarity measure appropriate for
features in previous sections. The second possibility arises
the considered features.
from the fact that the extraction of non-visual features from the
audiovisual stream is typically accompanied by the definition
of an appropriate decomposition of the stream to elementary In this extended algorithm, similarity criteria set O is used
segments. This decomposition in general does not coincide for correcting any over-segmentation errors in S , e.g. by
with the decomposition of the video to shots, and cannot merging two adjacent speaker segments of S in case they
be used by the STG in place of the latter, since this would are both assigned to the same speaker. Thus, the criteria in
lead to possible violation of the basic assumption that scene O are qualitative rather than quantitative and do not involve
boundaries are a subset of the video’s shot boundaries. It any distance measures or thresholds. For the second step,
can however be used in combination with the decomposition a temporal tolerance parameter is used when evaluating the
to shots for limiting the number of shot boundaries that are correspondence of shot boundaries in S and segment bound-
treated as potential scene boundaries, with the help of simple aries in S1 , to prevent minor misalignments from triggering
semantic criteria. the elimination of shot boundaries. Following this algorithm,
For example, when performing speaker diarization for the various different STGs can be constructed for a single video,
purpose of describing each elementary video segment by a each based on different visual or non-visual features.
speaker ID, a speaker segmentation of the audio stream is
defined. The resulting speaker IDs can be mapped to the B. Combination of unimodal STGs for scene segmentation
video shots, so that each shot is described by the histogram of Despite the definition of the STG extension of section
speakers heard in it, and a suitable similarity distance can be IV-A, which in place of the typically employed low-level
defined for these shot feature vectors. The speaker segmenta- visual features can use different ones, the problem of com-
tion of the audio stream can however provide additional binary bining multiple heterogeneous features remains. At the same
information about the potential of a shot boundary to also be time, it has been experimentally found that regardless of the
a scene boundary: the absence of a speaker change across a considered features, the estimated scene boundaries depend
shot boundary, for example, could be used as evidence that the significantly on the value of parameters that are inherent to
two corresponding adjacent shots belong to the same scene. the STG construction process, namely the temporal distance τ
In order to exploit such decomposition-based information and the similarity threshold Dt . In order to combine multiple
when dealing with non-visual input, a few additional steps heterogeneous features for scene segmentation and simulta-
are introduced to the STG construction algorithm. Denoting neously reduce the dependence of the proposed approach on
S the decomposition of video into shots and S the non- parameters, we propose a probabilistic technique that involves
visual decomposition of the audiovisual stream to elementary the independent creation of multiple STGs of each type, where
segments, we proceed according to Algorithm 2. a “type” means here an STG that uses a specific set of features
IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 8, pp. 1163-1177, August 2011 7
The denominator in the second part of Eq. 6 denotes the max- speaker θ is active in video segment x over the total duration
imum value of the γ-th concept detector across all keyframes of the same segment. Similarly to the HSV histograms, the L1
of the examined video. distance is used as a segment similarity measure D(., .).
The definition of a shot similarity measure using the model
vectors is based on the requirement that not only the difference D. Audio event-based model vectors
of values φγ (x) between two shots, but also the absolute
values φγ (xi ) and φγ (xj ) themselves, should affect shot Audio events are the audio equivalent to visual concepts.
similarity. The rationale behind this is that, for the γ-th An audio event is defined as a semantically elementary piece
detector, two shots receiving similarly high confidence values of information that can be found in the audio stream, such
is a strong indication of their semantic similarity (i.e. they as telephone ringing, dog barking, music, child voice, traffic
are both likely to depict the γ-th concept). On the contrary, noise etc. Audio events are detected with the use of trained
the same shots receiving similarly low confidence values is an audio event detectors that rely on machine learning, as outlined
indication neither in favor nor against their semantic similarity; below:
it merely suggests that the γ-th concept (out of a large number • Classification using SVMs as described in [34] for 61
Jv of concepts) is not depicted in either of the two shots. audio events, e.g. Dog-Barking, Siren, Applause, Explo-
The commonly used L1 or other Minkowski distances do not sion.
satisfy the above requirement, since they depend only on the • Classification using Multi-layer Perceptrons or Gaussian
difference of the values. Instead of it, a variation of the Chi- Mixture Models as described in [35] for 14 audio events,
test distance is employed in this work, defined as: e.g. male speaking, speech with noise background, music.
Jv The complete list of considered audio events is given in Table
(φγ (xi ) − φγ (xj ))2
D(φ(xi ), φ(xj )) =
II.
(7)
γ=1
φγ (xi ) + φγ (xj ) Similarly to how the results of visual concept detectors are
used in this work, the responses of the audio event detectors
It should be noted that the TRECVID 2005 dataset, on (confidence values for the presence of a specific audio event
which the visual concept detectors were trained, is a concept- in a given audio segment) are used to build audio event-based
annotated dataset extensively used for concept detector train- model vectors,
ing and evaluation, and it is completely unrelated to the two
test datasets used for experimentation in section VI. ψ(x) = [ψ1 (x), ψ2 (x), ..., ψJa (x)] (9)
where x denotes again a temporal segment of segmentation
C. Typical audio features S1 , produced using the same criteria O as in the previous
Audio features typically employed for video segmentation to subsection (section V-C). For the reasons discussed in section
scenes include low-level features (e.g. short-time energy, zero- V-B, the variation of the Chi-test distance introduced in Eq. 7
crossing rate) and somewhat higher-level ones (e.g. the results is also used here for comparing audio segments according
of audio segmentation, background conditions classification, to their audio event-based model vectors. Finally, it should
speaker clustering etc.). be noted that, similarly to the visual concept detectors, the
In this work, we extract audio features by performing audio event detectors were trained on an annotated audio event
audio segmentation, classification according to background corpus ([34], [35]) completely unrelated to the two test datasets
conditions, and speaker diarization [32], [33]. Background used for experimentation.
classification considers three classes: noise, silence and music.
Speaker diarization identifies speaker homogeneous segments VI. E XPERIMENTAL R ESULTS
in the audio stream and further assigns a speaker identity to A. Datasets and evaluation measures
each, after clustering them. The result of this process is the
For experimentation, two datasets were used in all experi-
partitioning of the audiovisual stream into audio segments,
ments, while a third one was additionally used in a few exper-
each of which carries a background class label and, in case it
iments for showing the applicability of the proposed approach
also includes speech, a speaker ID as well.
to a certain type of news videos. The first dataset is made of 15
For exploiting these features, criteria set O (section IV-A)
documentary films (513 minutes in total) from the collection
is defined as two adjacent audio segments sharing the same
of the Netherlands Institute for Sound & Vision1 , also used as
background conditions and speaker ID labels; the feature
part of the TRECVID dataset in the last few years. The second
used for describing each segment of segmentation S1 (an
one is made of six movies (643 minutes in total). Application
intermediate result of the algorithm of section IV-A) is a
of the shot segmentation algorithms of [36], [37] (for abrupt
speaker identity distribution, defined as:
and gradual transition detection, respectively) to these datasets
H(x) = [H1 (x), H2 (x), ... HΘ (x)] (8) resulted in 3459 and 6665 shots; manual grouping of them to
scenes resulted in 525 and 357 ground truth scenes. For each
where x denotes in this equation a temporal segment of of these two datasets, one additional video of the same genre
segmentation S1 rather than an original shot in S, and Θ is (one documentary, one movie) was processed in the same way
the total number of speakers in the video as per the speaker
diarization results. Hθ (x) is defined as the fraction of time that 1 http://instituut.beeldengeluid.nl/
IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 8, pp. 1163-1177, August 2011 9
TABLE II
L IST OF AUDIO EVENTS
Airplane Engine Jet Wolf/Coyote/Dog Howling Car Animal Hiss Morse Code Typing Male Voice
Baby Whining or Crying Telephone Ringing Digital Bear Bell Electric Frog Saw Manual Rattlesnake
Bell Mechanic Non Vocal Music Big Cat Crowd Applause Music Thunder Insect Buzz
Bite Chew Eat Noise Background Bus Buzzer Speech Pig Horse Walking
Airplane Engine Propeller Voice With Background Noise Cat Meowing Donkey Vocal Music Helicopter Train
Child Voice Telephone Ringing Bell Cow Child Laughing Paper Saw Electric Hammering
Clean Background Telephone Band Birds Wind Sheep Gun Shot Heavy Water
Digital Beep Voice With Background Music Dog Barking Dolphin Siren People Talking Glass
Chicken Clucking Walk/Run/Climb Stairs (Soft) Female Voice Drink Whistle Fireworks Traffic
Elephant or Trumpet Walk/Run/Climb Stairs (Hard) Electricity Explosion Motorcycle Insect Chirp Fire
Door Open or Close Gun Shot Light Horn Vehicle Music Background Moose or Elk or Deer
(shot segmentation, manual grouping of the shots to scenes) [0, 0.2] or [0, 0.4], depending on the type of STGs. Random
and was used for automatically adjusting the parameters of selection was implemented with the use of simple random
the algorithm (weights wy and threshold T in Eqs. (3)-(4), number generators.
as well as optimal number of employed visual concept and
audio event detectors) in the relevant reported experiments.
The third, smaller, dataset was generated with the purpose
of simulating unedited news video content; this was done by
The results of GSTG are shown in Table III. The first
concatenating several news-related videos from YouTube into
column (“Index y”) indicates the types of STGs that contribute
three hour-long videos. The number of automatically detected
to GSTG in each experiment. The Coverage, Overflow and F-
shots and manually-identified ground truth scenes in the latter
Score columns report the results of GSTG when the algorithm
dataset was 1763 and 57, respectively.
of [5] is used for individual STG construction, while the F-
For evaluating the results of the scene segmentation ex-
Score values in parentheses correspond to the case where
periments, the Coverage (C), Overflow (O) and F-Score
the fast approximation of section III is used instead, as part
(F ) measures were employed. Coverage and Overflow were
of GSTG. In the first experiment, for example, y ∈ {V }
proposed in [38] for scene segmentation evaluation; Coverage
indicates that only the typical visual features of section V-A
measures to what extent frames belonging to the same scene
are employed; thus, the resulting method essentially resembles
are correctly grouped together, while Overflow evaluates the
the original STG method of [5], integrating however the proba-
quantity of frames that, although not belonging to the same
bilistic technique introduced in section IV-B that alleviates the
scene, are erroneously grouped together. The optimal values
need for experimentally setting STG construction parameters
for Coverage and Overflow are 100% and 0% respectively.
τ , Dt . In subsequent experiments of this series, STGs con-
The F-Score is defined in this work as the harmonic mean of
structed with the use of visual concept-based model vectors
C and 1 − O, to combine Coverage and Overflow in a single
2C(1−O) (V C), typical audio features (A) and audio event-based model
measure, F = C+(1−O) , were 1 − O is used in this formula
vectors (AE), as well as combinations of them, contribute to
instead of O to account for 0 being the optimal value of the
GSTG. It can be seen from this table that, among individual
latter, instead of 1.
features (first four rows of the table), the use of the typical
visual features results in the highest F-Score. Considering
B. Experimental upper bounds of performance the cases where two or more types of STGs contribute to
A first series of experiments was carried out with the GSTG, however, its is clear that the {V, V C} combination
GSTG method, using those GSTG parameter values that performs better than {V } and the {A, AE} combination
were determined by exhaustive search as being the ones that performs better than {A}. Further combining visual and audio
maximize the F-Score attained for each test dataset. This was features (y ∈ {V, V C, A} and y ∈ {V, V C, A, AE}) leads
done for experimentally estimating an upper bound for the to additional gains; the F-Score attained by the GSTG when
performance of GSTG when different audiovisual features or all audiovisual features of section V are employed is about
combinations of them are used. It is reminded that parameters 10 points higher that that of y ∈ {V }. The conclusion
of the GSTG method are the weights wy and threshold T in here is that, providing that good GSTG parameter values
Eqs. (3)-(4); the number of employed visual concept and audio can be determined, the GSTG can effectively use any single
event detectors, assuming that we consider the possibility of one of the considered audiovisual features towards improved
using just a subset of those defined in section V, is also performance, and the observed performance improvements are
treated as a parameter in this series of experiments. In this significant in both examined datasets. Furthermore, the use
and all subsequent series of experiments, in any case where of the fast approximation of section III instead of [5], as
the use of keyframes was required, three keyframes per shot part of GSTG, results in only small F-Score degradation (in
were used. The number P of STGs of each type that were most cases, F-Score differences of < 1%) in return for major
constructed using randomly selected parameters τ and Dt was computational efficiency gains (section VI-F). These F-Score
set to 1000, with the randomly selected values of τ being in the differences translate to an increase of the number of true scene
range [0, 5000] (measured in frames) and of Dt in the range boundaries that are not detected by less than 1%.
IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 8, pp. 1163-1177, August 2011 10
TABLE III
GSTG PERFORMANCE , USING GSTG PARAMETER VALUES THAT WERE DETERMINED BY EXHAUSTIVE SEARCH AS BEING THE ONES THAT MAXIMIZE
THE F-S CORE ATTAINED FOR EACH TEST DATASET.
Documentary Dataset Movie Dataset
Index y (types of STGs in GSTG) Coverage(%) Overflow(%) F-Score(%) Coverage(%) Overflow(%) F-Score(%)
{V } 78.33 19.06 79.61 (78.17) 74.49 24.11 75.18 (74.21)
{V C} 75.66 31.19 72.07 (71.21) 65.78 17.73 73.11 (71.63)
{A} 68.58 27.59 70.44 (70.63) 62.33 45.51 58.15 (57.40)
{AE} 72.24 34.78 68.55 (68.75) 60.28 37.21 61.51 (61.42)
{V, V C} 80.60 14.71 82.91 (81.57) 71.96 8.51 80.56 (80.32)
{A, AE} 70.10 15.46 76.65 (75.97) 66.16 32.78 66.69 (66.12)
{V, V C, A} 85.48 12.28 86.59 (86.42) 81.89 15.60 83.13 (83.47)
{V, V C, A, AE} 87.35 9.37 88.96 (88.34) 89.27 17.02 86.01 (85.55)
C. Impact of parameters on performance A similar study of the number Ja of employed audio event
detectors was also carried out, with Ja ranging from 20 to 60
Having examined the performance of GSTG when using
with a step of 10; using all 75 audio events of Table II was also
“good” GSTG parameter values, we then examined the im-
examined. The F-Score of each individual audio event detector,
pact of each of these parameters separately. Starting with
calculated on the test portion of the audio event corpus (section
the number Jv of visual concept detectors that are taken
V-D), was used as a detector goodness criterion. The results,
into account (Eq. 5), experiments were carried out with it
shown in Fig. 7, are similar to those for the visual concept
varying from 10 to 90 with a step of 10; the use of all
detectors that were discussed above.
101 visual concept detectors was also examined. Assuming
Finally, regarding the impact of weights wy and threshold
that, when selecting a subset of the available detectors, it
T when y ∈ {V, V C, A, AE}), results from varying each of
makes sense to select the best Jv detectors out of all the
wV C , wA , wAE and T separately are shown in Fig. 8. In
available ones, two different “goodness” criteria were used
varying the weights, wV was set equal to 1−wV C −wA −wAE ;
for the detectors: Average Precision (AP ) and Delta Average
thus, in Fig. 8(a) wV C varies from 0 to 100% of its maximum
Precision (ΔAP ) [39]. Both AP and ΔAP for the trained
allowed value, the latter being the one that would make wV
concept detectors were those calculated on the test portion
equal to 0 for the given (constant) values of wA and wAE ; sim-
of the TRECVID 2005 dataset (section V-B). The results
ilarly for wA and wAE in Fig. 8(b) and (c), respectively. The
presented in Fig. 6 indicate that when y ∈ {V C}, higher
results indicate that GSTG is not very sensitive to the values of
Jv values generally lead to higher F-Score. When considering
weights wy , since in all cases there is a relatively large range of
combinations of features, though, Jv values between 40 and
weight values that result in close-to-maximum F-Score, and no
80 lead to the best results; using additional concept detectors
abrupt changes in F-Score for small changes in a weight value
leads to slight performance decrease. A possible explanation
are observed. Threshold T (Fig. 8(d)) is shown to have a more
of this is that even poorly-performing concept detectors tend
significant impact on F-Score, which was however expected,
to produce similar responses for “similar” keyframes (if not
considering that its minimum and maximum values practically
semantically similar, at least visually similar). Thus, in the
mean that all and no potential scene boundaries, respectively,
absence of other features, such concept detectors provide some
are accepted as scene boundaries. Even for T though there is a
useful information to the scene boundary detection algorithm,
relatively large range of values that result in close-to-maximum
besides introducing noise due to their poor performance in
F-Score.
detecting specific concepts. When used in combination with
other features, though (specifically, low-level visual features),
visual similarity can be reliably estimated from the latter D. Results using automatically determined parameters and
features, and the poorly-performing concept detectors seem comparison with literature works
to only introduce additional noise to the representation of the An advantage of the GSTG approach, discussed in section
shots. This noise is responsible for the slight decline of the IV-B, is that weights wy of GSTG are not hard-to-optimize
F-score when increasing the value of Jv beyond an optimal STG construction parameters; on the contrary, they can be
one. In the above cases, selecting the detectors according to easily optimized using Least Squares Estimation (LSE). In this
ΔAP is advantageous, compared to using AP , although the section, we repeat the series of experiments of section VI-B,
differences between the two are generally small (< 1% in F- using however the single out-of-testset video for each dataset
Score). What is most interesting though is that regardless of that was mentioned in section VI-A in order to automatically
the value of Jv , y ∈ {V, V C} consistently performs better than select the values of weights wy as well as all other GSTG
the baseline y ∈ {V }. Furthermore, when additional features parameters (T , Jv , Ja ). For weights wy , LSE estimation is
are introduced (y ∈ {V, V C, A}, y ∈ {V, V C, A, AE}), the employed. Specifically, a value of 1 is assigned to each shot
F-Score curves as a function of Jv tend to become more boundary of the out-of-testset ground-truth-segmented video
flat, i.e. although {V C} introduces significant performance that is also a scene boundary, according to the ground-truth
improvement (particularly for the Movie dataset), GSTG is segmentation, and a value of 0 to each other shot boundary.
rather insensitive to the number of employed visual concept LSE estimates the weights wy that minimize the sum of
detectors. differences between the aforementioned values and pi (Eq. 3)
IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 8, pp. 1163-1177, August 2011 11
90 90
85 85
80 80
F−Score (%)
F−Score (%)
75 75
70 70
65 65
{V, VC, A, AE} {V, VC, A, AE}
60 {V} 60 {V}
{VC} {VC}
55 {V, VC} 55 {V, VC}
{V, VC, A} {V, VC, A}
50 50
0 20 40 60 80 100 0 20 40 60 80 100
Number of Visual Concepts Number of Visual Concepts
(a) (b)
90 90
85 85
80 80
F−Score (%)
F−Score (%)
75 75
70 70
65 65
{V, VC, A, AE} {V, VC, A, AE}
60 {V} 60 {V}
{VC} {VC}
55 {V, VC} 55 {V, VC}
{V, VC, A} {V, VC, A}
50 50
0 20 40 60 80 100 0 20 40 60 80 100
Number of Visual Concepts Number of Visual Concepts
(c) (d)
Fig. 6. F-Score as a function of the number Jv of visual concept detectors, (a) Documentary dataset, concepts selected according to AP , (b) Documentary
dataset, concepts selected according to ΔAP , (c) Movie dataset, concepts selected according to AP , (d) Movie dataset, concepts selected according to ΔAP .
91
70 Documentaries Documentaries
Movies Movies
90
68
89
F−Score (%)
F−Score (%)
66
88
64
62 87
60 86
58 85
20 30 40 50 60 70 80 20 30 40 50 60 70 80
Number of Audio Events Number of Audio Events
(a) (b)
Fig. 7. F-Score as a function of the number Ja of audio event detectors, (a) y ∈ {AE}, (b) y ∈ {V, V C, A, AE}.
for this video. Threshold T is then set to the value that of GSTG when the algorithm of [5] is used for individual
maximizes the F-Score attained for the same out-of-testset STG construction, while the F-Score values in parentheses
video, given the estimated weights; this value is determined correspond to the case where the fast approximation of section
by simple exhaustive search. Finally, the above optimization III is used instead, as part of GSTG. It can be seen that,
process is repeated for different selected values of Jv and Ja in comparison to the results of Table III, the F-Score in
(the same few values used for plotting Figs. 6 and 7), and almost all experiments has only been slightly reduced (F-
the set of parameters that leads to the maximum F-Score for Score differences of approx. 1%). The F-Score attained by the
the out-of-testset video is chosen. Although this may not be GSTG when all audiovisual features of section V are employed
the most elegant optimization process possible, it is a simple continues to be about 10 points higher that that of y ∈ {V },
one that requires use of just one out-of-testset ground-truth- and every one of the 4 examined types of features is shown to
segmented video for automatically estimating all parameters of have a non-negligible contribution. The conclusion here is that
GSTG. The results of using the estimated parameters on the automatic selection of GSTG parameter values using a simple
test datasets are reported in the first part of Table IV. Again, procedure and a single out-of-testset video of the same genre
the Coverage, Overflow and F-Score columns report the results is sufficient for getting very close to the upper performance
IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 8, pp. 1163-1177, August 2011 12
100 80
Documentaries Documentaries
95 Movies Movies
90 75
F−Score (%)
F−Score (%) 85
80 70
75
70 65
65
60 60
0 20 40 60 80 100 0 20 40 60 80 100
Weight W (%) Weight W
VC A
(a) (b)
80 100
Documentaries Documentaries
Movies 90 Movies
75 80
F−Score (%)
F−Score (%)
70
70 60
50
65 40
30
60 20
0 20 40 60 80 100 0 20 40 60 80 100
Weight WAE (%) Threshold T
(c) (d)
Fig. 8. F-Score as a function of weights wy and threshold T : (a) F-Score versus wV C , (b) F-Score versus wA , (c) F-Score versus wAE , (d) F-Score versus
T.
bounds identified in section VI-B. parameters are automatically selected with the use of an out-
For the purpose of comparing the proposed GSTG method of-testset ground-truth-segmented video that belongs to a dif-
with additional methods of the literature, besides the STG ferent genre (i.e., one documentary video is used for estimating
[5] (whose results, when it also exploits the probabilistic the GSTG parameters for the movie dataset, and similarly
technique introduced in section IV-B, are essentially those one movie video is used for the documentary dataset). For
reported above for y ∈ {V }), three additional methods are both datasets, this cross-genre automatic parameter selection
tested and their results are also reported in Table IV. These are results in F-Score differences of < 0.5%, compared to using
the very recent unimodal method of [12], which is based on an a same-genre video for this task. These results complement
elaborate sequence alignment technique, and the multimodal our previous findings about the insensitivity of the proposed
methods of [21] and [24], which similarly to GSTG combine technique to parameters (section VI-C), and indicate that the
visual and audio features. The latter method ([24]) is based GSTG can in practice be applied to different video genres
on a discriminative classifier (SVM) that realizes early fusion without using even one manually segmented video of the same
of the audio-visual features. For ensuring a fair comparison, genre, with minimal performance loss.
the same keyframes, audio segmentation results and high-
level audio features (where applicable) that are used by the
proposed approach were also used when experimenting with E. Applicability of GSTG to News videos
these three methods. It can be seen from the reported results In order to discuss the applicability of the GSTG approach
that the GSTG method significantly outperforms [12], [21] and to different video genres, most notably news-related videos,
[24]. These performance differences are caused by the use of we first need to make the distinction between two broad
a wealth of low- and high-level audiovisual features in the types of video content: loosely-structured content and tightly-
proposed approach, as opposed to just low-level features being structured one. We use the term “tightly-structured content”
used in [12], [21]. The proposed probabilistic merging process here to denote content that is known to follow a very specific
that effectively combines these features also contributes to structure. Examples of such video are the news bulletins of
improved performance, in comparison to simpler heuristics a single broadcaster: they tend to follow a structure that is
used in [21] for audiovisual feature combination, and also in characteristic of the broadcaster, e.g. each scene starts with
comparison to early fusion of low- and high-level audiovisual one anchor-person shot and is followed by external reporting
features used in [24]. shots. On the contrary, video genres such as documentaries,
Finally, in the last row of Table IV, results of the GSTG are movies, unedited news-related video etc., do not observe
reported for the case that the weights wy and all other GSTG such strict structuring rules, and consequently fall under the
IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 8, pp. 1163-1177, August 2011 13
TABLE IV
GSTG PERFORMANCE , USING GSTG PARAMETER VALUES THAT WERE AUTOMATICALLY ESTIMATED WITH THE USE OF AN OUT- OF - TESTSET
GROUND - TRUTH - SEGMENTED VIDEO , AND COMPARISON WITH LITERATURE WORKS [12], [21], [24].
category of loosely-structured content. In the case of tightly- is reduced from about 40 minutes to less than 2. The pre-
structured content, it is evidently advantageous to develop processing and feature extraction processes excluded from
dedicated methods that exploit the knowledge about the con- the aforementioned time measurements clearly introduce some
tent’s structure (thus focusing, for example, on detecting the additional computational overhead; nevertheless, i) some of
anchor-person shots that may signify a scene change). The these processes (e.g. shot segmentation) are common to all
GSTG approach, on the other hand, similarly to most literature scene segmentation methods, ii) other processes (e.g. concept
works, is a generic approach that does not make any restrictive detection) are typically performed on the video as part of
assumptions about the structure of the video, thus is mostly its semantic analysis, and re-using their results also for the
suited for processing loosely-structured content. purpose of scene segmentation does not introduce additional
For examining how the GSTG performs on news-related computational cost, iii) real-time or near-real time imple-
content falling under the latter category, we used the third mentations for all of them generally exist (even for concept
dataset defined in section VI-A, which simulates unedited detection, e.g. [40]).
news video content. Application of GSTG to it (with the fast
STG approximation of section III being used as part of GSTG; VII. C ONCLUSIONS
y ∈ {V, V C, A, AE}) and looking for the experimental upper In this work a novel multimodal scene segmentation method,
bounds of performance (as in section VI-B) resulted in F- making use of high-level audiovisual features, was presented.
Score equal to 78.76%; automatically determining the GSTG’s As part of this method, algorithms were developed i) for a
parameters resulted in F-Scores equal to 77.91% and 77.83%, fast STG approximation, ii) for extending the STG so as to
when a documentary and a movie were used for cross-genre exploit non-visual input, and iii) for effectively combining
parameter selection, respectively (as in the last paragraph of STGs that were constructed with the use of different features,
section VI-D). In comparison, the F-Scores for the literature possibly coming from processing different modalities of the
works [12], [21] and [24] were 75.97%, 75.09% and 75.19%, audiovisual stream. New high-level features, such as model
respectively. vectors constructed with the help of large numbers of trained
visual concept detectors or audio event detectors, were pre-
sented and were exploited by the proposed multimodal scene
F. Computational efficiency
segmentation method. For training these detectors, existing
Concerning the computational efficiency of the GSTG ap- annotated corpora were employed; these were unrelated to
proach, this is experimentally shown to be high. Specifically, the datasets used for experimentation in this work, thus not
excluding the pre-processing of the audio-visual stream (e.g. leaving room for any doubts on the usefulness of the model
shot segmentation) and feature extraction, the GSTG approach vector-based features on different datasets. The experimental
is faster than real-time (approximately 60 frames per second) results revealed the merit of the developed algorithms and doc-
on an 3.0GHz PC, considering y ∈ {V, V C, A, AE} and umented the significance of introducing high-level audiovisual
employing the method of [5] for individual STG construction. features in the scene segmentation task.
When, instead of the latter, the fast STG approximation
introduced in this work is used as part of GSTG, the frame R EFERENCES
processing rate rises to over 1200 frames per second, repre-
[1] G. Boccignone, A. Chianese, V. Moscato, and A. Picariello, “Foveated
senting a speed-up by over 20 times. As a result, the processing shot detection for video segmentation,” IEEE Transactions on Circuits
time for a 90-minute film (featuring 25 frames per second) and Systems and Video Technology, vol. 15, pp. 365–377, 2005.
IEEE Transactions on Circuits and Systems for Video Technology, vol. 21, no. 8, pp. 1163-1177, August 2011 14
[2] A. Hanjalic, R. L. Lagendijk, and J. Biemond, “Automated high- [27] J. Smith, M. Naphade, and A. Natsev, “Multimedia semantic indexing
level movie segmentation for advanced video-retrieval systems,” IEEE using model vectors,” in Proc. IEEE ICME, Baltimore, MD, USA, July
Transactions on Circuits and Systems and Video Technology, vol. 9, 2003, pp. 445–448.
no. 4, pp. 580–588, 1999. [28] N. Rasiwasia, P. Moreno, and N. Vasconcelos, “Bridging the gap: Query
[3] R. Jain, “Eventweb: Developing a human-centered computing system,” by semantic example,” IEEE Transactions on Multimedia, vol. 9, no. 5,
IEEE Computer, vol. 41, no. 2, pp. 42–50, 2008. pp. 923–938, 2007.
[4] A. F. Smeaton, P. Over, and A. R. Doherty, “Video shot boundary [29] C. Snoek, M. Worring, J. van Gemert, J.-M. Geusebroek, and A. Smeul-
detection: Seven years of trecvid activity,” Computer Vision and Image ders, “The challenge problem for automated detection of 101 semantic
Understanding, vol. 114, no. 4, pp. 411–418, April 2010. concepts in multimedia,” in Proc. ACM Multimedia, Santa Barbara,
[5] M. Yeung, B.-L. Yeo, and B. Liu, “Segmentation of video by clustering USA, October 2006, pp. 421–430.
and graph analysis,” Computer Vision and Image Understanding, vol. 71, [30] B. Manjunath, J.-R. Ohm, V. Vasudevan, and A. Yamada, “Color and
no. 1, pp. 94–109, 1998. texture descriptors,” IEEE Transactions on Circuits and Systems for
[6] W. Tavanapong and J. Zhou, “Shot clustering techniques for story Video Technology, vol. 11, no. 6, pp. 703–715, June 2001.
browsing,” IEEE Transactions on Multimedia, vol. 6, no. 4, pp. 517–527, [31] V. Mezaris, P. Sidiropoulos, A. Dimou, and I. Kompatsiaris, “On the use
2004. of visual soft semantics for video temporal decomposition to scenes,” in
[7] Z. Rasheed and M. Shah, “Detection and representation of scenes in Proc. IEEE International Conference on Semantic Computing (ICSC),
videos,” IEEE Transactions on Multimedia, vol. 7, no. 6, pp. 1097– 2010.
1105, 2005. [32] R. Amaral, H. Meinedo, D. Caseiro, I. Trancoso, and J. Neto, “A proto-
type system for selective dissemination of broadcast news in european
[8] Y. Zhai and M. Shah, “Video scene segmentation using markov chain
portuguese,” EURASIP Journal on Advances in Signal Processing, vol.
monte carlo,” IEEE Transactions on Multimedia, vol. 8, no. 4, pp. 686–
2007, no. 37507, May 2007.
697, 2006.
[33] H. Meinedo, PhD Thesis. Portugal: IST, Technical University of Lisbon,
[9] C.-W. Ngo, Y.-F. Ma, and H.-J. Zhang, “Video summarization and March 2008.
scene detection by graph modeling,” IEEE Transactions on Circuits and [34] M. Bugalho, J. Portelo, I. Trancoso, T. Pellegrini, and A. Abad,
Systems and Video Technology, vol. 15, no. 2, pp. 296–305, 2005. “Detecting audio events for semantic video search,” in Proc. Interspeech,
[10] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE 2009.
Transactions on Pattern Analysis and Machine Intelligence, vol. 22, [35] I. Trancoso, T. Pellegrini, J. Portelo, H. Meinedo, M. Bugalho, A. Abad,
no. 8, pp. 888–905, 2000. and J. Neto, “Audio contributions to semantic video search,” in Proc.
[11] Y. Zhao, T. Wang, P. Wang, W. Hu, Y. Du, Y. Zhang, and G. Xu, “Scene IEEE ICME, 2009, pp. 630–633.
segmentation and categorization using ncuts,” in Proc. CVPR, 2007, pp. [36] G. Chavez, M. Cord, S. Philip-Foliguet, F. Precioso, and A. Araujo,
1–7. “Robust scene cut detection by supervised learning,” in Proc. EUSIPCO,
[12] V. T. Chasanis, A. C. Likas, and N. P. Galatsanos, “Scene detection in Florence, Italy, September 2006.
videos using shot clustering and sequence alignment,” IEEE Transac- [37] E. Tsamoura, V. Mezaris, and I. Kompatsiaris, “Gradual transition
tions on Multimedia, vol. 11, no. 1, pp. 89–100, 2009. detection using color coherence and other criteria in a video shot meta-
[13] Y. Cao, W. Tavanapong, K. Kim, and J. Oh, “Audio-assisted scene segmentation framework,” in Proc. IEEE ICIP-MIR, 2008, pp. 45–48.
segmentation for story browsing,” in Proc. ACM CIVR, 2003, pp. 446– [38] J. Vendrig and M. Worring, “Systematic evaluation of logical story unit
455. segmentation,” IEEE Transactions on Multimedia, vol. 4, no. 4, pp. 492–
[14] Y. Zhu and D. Zhou, “Scene change detection based on audio and 499, 2002.
video content analysis,” in Proc. of Fifth International Conference [39] J. Yang and A. Hauptmann, “(Un)Reliability of video concept detection,”
on Computational Intelligence and Multimedia Applications, 2003, pp. in Proc. ACM CIVR, Niagara Falls, Canada, July 2008.
229–234. [40] J. Uijlings, A. Smeulders, and R. Scha, “Real-time bag of words,
[15] H. Sundaram and S.-F. Chang, “Video scene segmentation using video approximately,” in Proc. ACM CIVR, 2009.
and audio features,” in Proc. IEEE ICME, 2000, pp. 1145–1148.
[16] S. Rho and E. Hwang, “Video scene determination using audiovisual
data analysis,” in Proc. 24th International Conference on Distributed
Computing Systems Workshops, 2004, pp. 124–129.
[17] S.-C. Chen, M.-L. Shyu, W. Liao, and C. Zhang, “Scene change
detection by audio and video clues,” in Proc. IEEE ICME, 2002, pp.
365–368.
[18] A. Chianese, V. Moscato, A. Penta, and A. Picariello, “Scene detection
using visual and audio attention,” in Proc. 2008 Ambi-Sys workshop on
Ambient media delivery and interactive television, 2008.
[19] S. Pfeiffer, R. Lienhart, and W. Efflsberg, “Scene determination based on
video and audio features,” Multimedia Tools and Applications, vol. 15,
no. 1, pp. 59–81, 2001.
[20] A. Velivelli, C.-W. Ngo, and T. S. Huang, “Detection of documentary
scene changes by audio-visual fusion,” in Proc. ACM CIVR, 2004, pp.
227–237.
[21] N. Nitanda, M. Haseyama, and H. Kitajima, “Audio signal segmentation
and classification for scene-cut detection,” in Proc. IEEE International
Symposium on Circuits and Systems, vol. 4, 2005, pp. 4030–4033.
[22] V. Parshin, A. Paradzinets, and L. Chen, “Multimodal data fusion for
video scene segmentation,” in Proc. International Conference on Visual
Information and Information Systems, 2005, pp. 279–289.
[23] N. Goela, K. Wilson, F. Niu, and A. Divakaran, “An svm framework for
genre-independent scene change detection,” in Proc. IEEE ICME, 2007,
pp. 532–535.
[24] K. Wilson and A. Divakaran, “Discriminative genre-independent audio-
visual scene change detection,” in Proc. of SPIE Conference on Multi-
media Content Access: Algorithms and Systems III, vol. 7255, 2009.
[25] P. Sidiropoulos, V. Mezaris, I. Kompatsiaris, H. Meinedo, and I. Tran-
coso, “Multi-modal scene segmentation using scene transition graphs,”
in Proc. ACM Multimedia, 2009, pp. 665–668.
[26] P. Sidiropoulos, V. Mezaris, I. Kompatsiaris, H. Meinedo, M. Bugalho,
and I. Trancoso, “On the use of audio events for improving video scene
segmentation,” in Proc. WIAMIS, 2010.