0% found this document useful (0 votes)
2 views

integrated_image_and_speech_analysis_for_content_based_video_indexing

Uploaded by

Maria Marchiano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

integrated_image_and_speech_analysis_for_content_based_video_indexing

Uploaded by

Maria Marchiano
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Integrated Image and Speech Analysis for Content-Based

Video Indexing
Yuh-Lin Chang Wenjun Zeng Ibrahim Kamel Rafael Alonsoy
Matsushita Information Technology Laboratory
Panasonic Technologies, Inc.
2 Research Way
Princeton, NJ 08540-6628, USA
e-mail: fyuhlin,kevin,ibrahim,[email protected].

Abstract a novel approach to the video information ex-


In this paper we study an important problem in traction problem that is based on the integration
multimedia database, namely, the automatic ex- of speech understanding and image analysis al-
traction of indexing information from raw data gorithms. The goal of our research project is to
based on video contents. The goal of our research develop a prototype system for automatic index-
project is to develop a prototype system for auto- ing of sports (in particular, football) videos. The
matic indexing of sports videos. The novelty of sports video analysis problem has been studied by
our work is that we propose to integrate speech other researchers before [4, 6]. However, to our
understanding and image analysis algorithms for knowledge, our work is the rst that uses knowl-
extracting information. The main thrust of this edge from both audio and video domains.
work comes from the observation that in news or The main thrust of this work comes from the
sports video indexing, usually speech analysis is observations that in news or sports video indexing,
more ecient in detecting events than image anal- a very important aspect is the detection of the oc-
ysis. Therefore, in our system, the audio process- currence of important events, and that for such
ing modules are rst applied to locate candidates videos, speech analysis is usually more ecient in
in the whole data. This information is passed to detecting events than image analysis. Therefore,
the video processing modules, which further ana- we propose to use speech analysis to detect impor-
lyze the video. The nal products of video analy- tant events rst, and then apply image analysis
sis are in the form of pointers to the locations of algorithms for further processing.
interesting events in a video. Our algorithms have Figure 1 gives a global view of our work. There
been tested extensively with real TV programs, and are three major components in our system: au-
results are presented and discussed in the paper. dio processing, video processing, and demo video
database. We rst digitize the video and au-
dio data from regular video tapes. We then ap-
1. Introduction ply audio processing modules to locate candidates
in the data. This information is passed to the
The content-based video indexing problem has video processing modules, which further analyze
attracted much attention recently [3, 9, 12]. The the video. The results of video analysis are in
applications of such research work may include the form of pointers to the locations of interest-
digital library, non-linear video editing, video-on- ing events. We put the indexed video on a LAN-
demand services, etc.. In this paper we propose based video-on-demand (VOD) server, the Star-
 A Ph.D. candidate at the Department of Electrical En- Works VOD server, and we also developed a demo
gineering, Princeton University. video database client that can retrieve the indexed
y Currently with David Sarno Research Center, SRI. video from a PC running MS-Windows.
2.1. Word Spotting
One important observation we get from watch-
ing TV sports programs for so many years is that
in such programs, the information content in au-
dio is highly correlated with the information con-
tent in video. After all, a sports reporter's job is
to inform viewers what is happening on the eld.
Therefore, if we can detect important keywords
such as \touchdown" or \fumble" in the audio
stream, then we can use it as a coarse lter to
locate candidates for important events.
Keyword spotting is an important application
of speech recognition, and it has attracted a grow-
ing research interest lately [7, 8, 11]. Currently we
use a simple, template-matching based approach
to spotting keywords. We are aware of the more
sophisticated and robust algorithms to the prob-
lem [7, 8, 11], but for the preliminary implemen-
Figure 1. Overview of the video and audio tation, we chose a simpler approach, mainly be-
processing modules. cause our current application is di erent from tra-
ditional keyword spotting in the following aspects.
 In our system, audio processing is used as a
While our current work focuses on the analy- \pre-processing" for video analysis and, con-
sis of football video, extension to other domains sequently, false alarm is not a major concern.
should not be a problem since we adopt a \tool-  Speaker independence is also not a major con-
box" approach. Our audio and video analysis al- cern, since we can assume we know who the
gorithms are implemented using the tools we de- reporters are a priori.
veloped for the Khoros system [1]. Khoros pro-
vides a nice environment for both application inte- The algorithm and part of the implementation
gration and fast prototyping. Therefore, it should is based on a public domain package, Lotec [10].
be easy to incorporate both new domain knowl- Figure 2 illustrates the algorithm. For a prelimi-
edge and data analysis algorithms into our frame- nary implementation, the template-based spotting
work. algorithm works surprisingly well.
The rest of the paper is organized as follows. In
the next section, we explain the audio processing
modules, and in Section 3, we present the video
processing algorithms. Section 4 then describes
the implementation of a demo VDB, and Section 5
discusses the results of applying our algorithms on
real data. Finally, Section 6 concludes this paper.

2. Audio Signal Analysis


We rst extract information from data using
audio processing, since its computation is less ex- Figure 2. The wordspotting algorithm.
pensive than image processing. Two types of au-
dio signal processing are experimented in the pa- To detect keywords in an audio stream, we rst
per, word spotting and cheering detection. They extract features and then match the feature vec-
will be explained in the following subsections. tors against a set of pre-computed templates.
 Feature extraction. Filter banks are used to there is little or no silence spots in the cheering
extract features from audio data. The follow- segments while there are quite few silence spots in
ing procedures are involved. the reporter chat.
The outline of the cheering detection schema is
1. Noise reduction. To reduce the e ect as follow. The audio signal is divided into small
of background noise, we rst collect the units, 300 msec each, and processed sequentially
noise statistics from the training data. from the beginning to the end. Each unit is pro-
Such information is then used in lter- cessed by a classi er, shown in Figure 3, which
ing out noise in the test data. marks each audio unit as either a cheering unit
2. Segmentation. An audio stream is split or chatting unit. A cheering segment is identi-
into segments of xed size 10 ms each. ed only when m cheering units are detected in
3. Filter banks. We rst transform the sequence, where m is an integer constant that is
data to the frequency domain by FFT. A de ned experimentally.
set of eight overlapping lters are then In Figure 3, the Envelop Extraction module cal-
applied to the Fourier magnitude, and culates the covering envelop for the audio unit by
the log of total energy in each bank is calculating the local maximum of the absolute val-
computed and used as \features" to rep- ues of the signal. The results is then smoothed
resent this audio sample. The lters we using a low-pass lter. The last step in the cheer-
use cover frequency from 150 to 4000 Hz. ing classi er calculates the peak-to-peak values of
the smoothed envelop. If the peak-to-peak value
 Feature matching. Feature vectors are is greater than a threshold the audio unit is
matched against templates, which are ob- marked as regular chat otherwise it is marked as
tained from the training data. Currently the cheering unit. Of course the threshold is a func-
normalized distance is used to measure simi- tion of the microphone volume. In our experi-
larity. The distance between a template and ments we determine the value of experimentally.
the test data is de ned as the Euclidean dis- Currently, we are considering a formula that cal-
tance between the two 8-dimensional feature culates the as a function of the average signal
vectors. The distance is then normalized by value.
the sum of energy in each template. After
matching, the best matches from all tem-
plates are then sorted according to the dis-
tance. We use the inverse of distance to rep- Envelop
Extraction
Lowpass
Filter
Peak−to−peak
estiamtion

resent con dence of a match. If the con dence


Audio Unit Signal Envelop Smoothed Envelop Peak−to−peak value

is greater than a pre-set threshold, we declare Figure 3. Block diagram for the cheering
the detection of a candidate. classi er.
2.2. Cheering Detection
Crowd cheering can be a powerful tool in in- 3. Video Information Analysis
dexing sport videos in general because it indicates
\interesting" events during the game, for exam- The candidates detected by the audio analysis
ple, touchdown (or scoring), fumble, clever pass- modules are further examined by the video anal-
ing, exciting run, etc. Unlike the word-spotting, ysis modules. Assuming that a touchdown can-
cheering detection is general and game/speaker in- didate is located at time t, we will apply video
dependent. In this subsection we describe our al- analysis only to the region [t ? 1 min; t + 2 min].
gorithm to detect the crowd cheering in a football The assumption we employ here is that a touch-
game video using audio stream only. Our main down event should begin and end within that time
goal is to build a simple and fast cheering detec- range. In video processing, the original video se-
tion module that recognizes the crowd cheering quence is broken down into discrete shots. Key
from the reporter chat. We process the audio sig- frames from each shot are extracted and shot iden-
nals in the time domain to quantify the frequency ti cation is then applied on them to verify the ex-
of the silence spots. The basic assumption is that istence of a touchdown.
3.1. Shot Segmentation
We use a well-known video shot segmentation
algorithm based on histogram di erence [5]. Fig-
ure 4 illustrates the owchart of operations, and
Figure 5 shows the implementation in Khoros. Ba-
sically, a cut is detected if a frame's histogram is
considered \substantially di erent" from that of
its previous frame, as de ned by the 2 compari-
son:
X G (H (i) ? H (i))2
t t?1 ; (1)
i=1 H t (i) Figure 5. Khoros workspace for video shot
segmentation.
where Ht is the histogram for time t, and G is the
total number of colors in an image.
To implement this algorithm, we have utilized 3.2. Shot Identi cation
both native Khoros and modules developed by us.
We propose a model-based approach to identify
 Import AVI converts an AVI encoded data the contents of key frames. In particular, for a
stream into VIFF, and Video Histogram com- touchdown sequence, we de ne an ideal model for
putes the histogram of a VIFF video. shot transition, as shown in Figure 6. Basically,
a touchdown sequence should start with the two
 Translate is a Khoros function for shifting a teams lining up on the eld. The word touchdown
VIFF object in time, and Subtract subtracts is usually announced in the middle or at the end
two VIFF objects. of the action shot, which is followed by some kind
 Square is a Khoros function for applying the
of commentary and replay. To conclude a touch-
down sequence, the scoring team usually kicks an
squaring operation on a VIFF object, and Di- extra point. We notice that our model may cover
vide is divides two VIFF objects. most but not all the possible touchdown sequences.
 Statistics is a Khoros function for computing However, for a preliminary implementation, our
the statistics of a VIFF object. simple model provides very satisfactory results.
 Shot Segment detects the shot transition
boundary by locating the peaks in the his-
togram di erence sequence.
 Store Kframe extracts the representative
frames from each shot and store them as a
new VIFF video. Currently, we use the rst
and/or the last frames to represent the whole Figure 6. The ideal shot transition model for
shot. a touchdown sequence.

Starting with the candidate location supplied


by audio analysis, our system looks backward and
forward a few shots to t the model with the video
data. If there is a high con dence in the matching,
then a touchdown event is declared detected.
Figure 4. Flowchart of the video shot seg- To identify shots, some features of interests on
mentation algorithm. which the model is based need to be extracted. In
football videos, possible features of interests are
line marks, players numbers, end zone, goal posts,
etc. In particular, for the detection of a touch-  Client. We developed a video player for
down sequence as modeled in Figure 6, our prelim- MS/VfW that utilizes the indexing informa-
inary work focuses on the detection of line marks tion when retrieving AVI video data . Using
and goal posts in the lining-up and the kicking our video player, a user can move directly to
shots, respectively. For the lining-up shot, the line the next or previous shot/play/event. Such
marks usually appear to be parallel lines oriented search capabilities can be complimentary to
at around diagonal directions. On the other hand, the traditional linear fast-forward/backward
goal posts almost always show up as strong verti- movements.
cal lines. Both line marks and goal post should be
relatively long with respect to the image size.
Our line extraction work is based on the Object 5. Examples
Recognition Toolkit [2]. We modi ed and incorpo-
rated it into the Khoros system. For each shot,
we have one or two representative frames. The Our algorithms have been tested extensively
gradient operation is rst applied to these repre- with real TV programs. Table 1 summarizes the
sentative frames to detect edges. The edge pixels data used in the experiments. We captured in to-
are then converted into lists of connected pixels by tal about 45 minutes of video and audio data from
Pixel Chaining. The chain lists are segmented into two football games. Most of the data are from
straight-line segments which are further grouped Super Bowl IXXX played in Jan. 1995, while one
into parallel lines. The parallel line pairs are then sequence is from the Chicago vs. Minnesota game
ltered by length and orientation. For example, to played in Oct. 1995. Both games were broadcasted
qualify for a goal post, the detected parallel lines by the ABC. These data are separated into two
should be long and vertically oriented. Similarly, groups, the 1st group (from 1st half) is used for
the detected parallel lines should be long and diag- training, and the 2nd group (from 2nd half) is
onally oriented to be potential candidates for line used for testing. We should only use data from
marks. the 1st group to train system parameters, and we
Currently, we use only the intensity values of will report the results of applying our algorithms
an image for line extraction. In the future, other on the test group. The resolution for video is 256
attributes, such as color, texture, etc., should be by 192 at 15 frames per second. The data rate for
incorporated to improve the accuracy and robust- audio is 22 KHz with 8 bits per sample.
ness.

4. Demo Video Database Group Name # of


frames
Game TD
To demonstrate the results of automatic video td1 1,297 SB 1-H Yes
indexing, we also built a simple demo video 1st td2 2,262 SB 1-H Yes
database system running under the MS/VfW (Mi- td3 1,694 SB 1-H Yes
crosoft Video for Windows). The demo video 2ndhalf1 7,307 SB 2-H No
database system has two sub-parts, the server and 2ndhalf2 6,919 SB 2-H No
the client. 2ndhalf3 6,800 SB 2-H Yes
2nd 2ndhalf4 5,592 SB 2-H No
 Server. We use the StarWorks VOD system 2ndhalf5 2,661 SB 2-H Yes
(from Starlight Networks Inc.) as the server. 2ndhalf6 2,774 SB 2-H Yes
The server is running on an EISA-bus PC- 2ndhalf7 2,984 SB 2-H Yes
486/66 with Lynx realtime OS and 4 GB stor- newgame1 2,396 Chicago vs. Yes
age space. A PC/Windows client can connect Minnesota
to the server through regular 10-BaseT Eth- Table 1. Summary information for the au-
ernet. The server can guarantee the realtime dio/video data.
delivery of data (video and audio) streams of
up to 12 Mbps (mega bits per second) via two
Ethernet segments.
5.1. Audio Processing Results and the segmentation results. The key frames ex-
We rst discuss the results of audio processing tracted by the segmentation process are shown in
on the eight test sets. Figures 7 shows the re- Figure 10. Theses 12 frames are arranged from
sults of wordspotting. The graphs are arranged left to right, and from top to bottom, according
from left to right, and from top to bottom. In to their temporal order.
each graph, the X-axis is time, and the Y-axis in-
dicates con dence. The higher the con dence, the
more likely the existence of a touchdown. From
the training data, the wordspotting threshold is
set to 25. Figure 8 shows the results of cheering
detection. The regions of 1's indicate the presence
of cheering, while those of 0's indicate the absence.
From the training data, the double thresholds used
in cheering detection are set to 20 and 60. Table 2
summaries the audio processing results. In gen-
eral, our simplistic wordspotting algorithm gives
quite reliable results. Of the ve touchdown exist-
ing in the test data, only the one in 2ndhalf7 is not
detected. The miss-detection is mainly due to the
fact that in 2ndhalf7, touchdown is announced in
a way di erent from the three templates we use.
One possible remedy is to reduce the threshold to
10, but this will generate a lot of false alarms (45,
to be exact). A better way is to collect more sam-
ples for templates. An even better approach is
to use more robust matching algorithms, such as
dynamic time warping or HMM (hidden Markov
model). To combine the results from wordspot-
ting and cheering detection, we apply simple logic
AND. However, we notice that it is also possible
to use other methods such as weighted sum. We
shall further investigate the information fusion as-
pect in the future.

Algorithm Correct Miss False


detect detect alarms
Wordspot 4 (out of 5) 1 (out of 5) 3
Wordspot + 4 (out of 5) 1 (out of 5) 1 Figure 7. Wordspotting results for the rst
Cheer detect and the second four sets.
Table 2. Audio detection results.
Finally, the results of shot identi cation are
5.2. Video Processing Results demonstrated. Basically, if a touchdown event ts
our model and if the kicking shot is correctly de-
We now present the results of shot segmenta- tected by the segmentation algorithm, then the
tion. The test data 2ndhalf2 is used as an ex- line extraction algorithm should have no problem
ample. Only 1; 471 frames are processed because detecting the goal post. Line mark detection is
we are only interested in the region around the more dicult, but our line extractor works quite
candidate detected by the audio processing mod- well nonetheless. We expect to have better re-
ules. Figure 9 shows the beginning of this video sults for extracting line mark when we incorpo-
rate color information into the edge detector. On
the other hand, currently the purpose of identify- Algorithm Correct Miss False
ing lining-up is mainly for determining the start of detect detect alarms
the touchdown act. As a result, we may use only Shot 4 (out of 5) 1 (out of 5) 0
the kicking shot detection to extract a touchdown Identify
sequence.
Table 3. Video analysis results.

6. Conclusion
In this paper we present a novel approach to
automatically extract important information from
football videos. Our system integrates speech
understanding and image analysis algorithms, so
that we can maximize detection accuracy and min-
imize computation cost at the same time. Our
algorithms have been tested extensively with real
data captured from TV programs. The prelimi-
nary results demonstrate the feasibility of our ap-
proach. In the future, we may work on the follow-
ing topics to improve the system.
 More test data and more robust wordspotting
algorithms.
 More complicated shot segmentation algo-
rithms with good shot transition models.
 Other shot representation scheme such as the
mosaic used in the QBIC system [3].
 Detecting other events such as fumbles.
References
[1] Y.-L. Chang and R. Alonso. Developing a mul-
timedia toolbox for the Khoros system. In SPIE
Proceedings, Multimedia: full-service impact on
business, education, and home, October 1995.
[2] A. Etemadi. Robust segmentation of edge data.
Technical report, University of Surrey, U.K.,
1992.
[3] M. Flickner et al. Query by image and video
Figure 8. Cheering detection results for the content: the QBIC system. IEEE Computer,
28(9):23{32, 1995.
rst and the second four sets. [4] Y. Gong et al. Automatic parsing of TV soccer
programs. In The 2nd IEEE International Con-
ference on Multimedia Computing, pages 167{
Table 3 presents the video analysis results. Of 174, May 1995.
the ve test sets with touchdowns, 2ndhalf6 does [5] A. Hampapur and T. Weymouth. Digital video
segmentation. In The 2nd ACM Int'l Conf. on
not t our model because its touchdown starts Multimedia, pages 357{364, Oct. 1994.
with kick-o (instead of lining-up) and ends with [6] S. S. Intille and A. F. Bobick. Tracking using a lo-
2-point conversion (instead of kicking an extra cal closed-world assumption: tracking in the foot-
point). Finally, Figures 11 illustrates the lining-up ball domain. Technical Report TR-296, M.I.T.,
and kicking shots identi ed by our algorithms. Aug. 1994.
[7] K. M. Knill and S. J. Young. Speaker dependent
keyword spotting for accessing stored speech.
Technical Report TR-193, Cambridge University
Engineering Department, Oct. 1994.
[8] R. C. Rose and E. M. Hofstetter. Techniques
for robust word spotting in continuous speech
messages. In Proc. Eurospeech, pages 1183{1186,
Sep. 1991.
[9] S. W. Smoliar and H. Zhang. Content-based
video indexing and retrieval. IEEE Multimedia,
1(2):62{75, 1994.
[10] N. Ward. The Lotec Speech Recognition Pack-
age. ftp.sanpo.t.u-tokyo.ac.jp: /pub/nigel/lotec,
1994.
[11] L. D. Wilcox and M. A. Bush. Training and
search algorithms for an interactive wordspotting
system. In Proc. ICASSP, 1992.
[12] A. Yoshitaka et al. Knowledge-assisted content-
based retrieval for multimedia databases. IEEE
Multimedia, 1(4):12{20, 1994.

Figure 9. The rst frame of 2ndhalf2 and its


cut detection results.

Figure 10. Collection of the rst frame in Figure 11. lining-up and kicking shots lo-
each shot for 2ndhalf2. cated for 2ndhalf3, 2ndhalf5, and 2ndhalf7.

You might also like