integrated_image_and_speech_analysis_for_content_based_video_indexing
integrated_image_and_speech_analysis_for_content_based_video_indexing
Video Indexing
Yuh-Lin Chang Wenjun Zeng Ibrahim Kamel Rafael Alonsoy
Matsushita Information Technology Laboratory
Panasonic Technologies, Inc.
2 Research Way
Princeton, NJ 08540-6628, USA
e-mail: fyuhlin,kevin,ibrahim,[email protected].
is greater than a pre-set threshold, we declare Figure 3. Block diagram for the cheering
the detection of a candidate. classi er.
2.2. Cheering Detection
Crowd cheering can be a powerful tool in in- 3. Video Information Analysis
dexing sport videos in general because it indicates
\interesting" events during the game, for exam- The candidates detected by the audio analysis
ple, touchdown (or scoring), fumble, clever pass- modules are further examined by the video anal-
ing, exciting run, etc. Unlike the word-spotting, ysis modules. Assuming that a touchdown can-
cheering detection is general and game/speaker in- didate is located at time t, we will apply video
dependent. In this subsection we describe our al- analysis only to the region [t ? 1 min; t + 2 min].
gorithm to detect the crowd cheering in a football The assumption we employ here is that a touch-
game video using audio stream only. Our main down event should begin and end within that time
goal is to build a simple and fast cheering detec- range. In video processing, the original video se-
tion module that recognizes the crowd cheering quence is broken down into discrete shots. Key
from the reporter chat. We process the audio sig- frames from each shot are extracted and shot iden-
nals in the time domain to quantify the frequency ti cation is then applied on them to verify the ex-
of the silence spots. The basic assumption is that istence of a touchdown.
3.1. Shot Segmentation
We use a well-known video shot segmentation
algorithm based on histogram di erence [5]. Fig-
ure 4 illustrates the owchart of operations, and
Figure 5 shows the implementation in Khoros. Ba-
sically, a cut is detected if a frame's histogram is
considered \substantially di erent" from that of
its previous frame, as de ned by the 2 compari-
son:
X G (H (i) ? H (i))2
t t?1 ; (1)
i=1 H t (i) Figure 5. Khoros workspace for video shot
segmentation.
where Ht is the histogram for time t, and G is the
total number of colors in an image.
To implement this algorithm, we have utilized 3.2. Shot Identi cation
both native Khoros and modules developed by us.
We propose a model-based approach to identify
Import AVI converts an AVI encoded data the contents of key frames. In particular, for a
stream into VIFF, and Video Histogram com- touchdown sequence, we de ne an ideal model for
putes the histogram of a VIFF video. shot transition, as shown in Figure 6. Basically,
a touchdown sequence should start with the two
Translate is a Khoros function for shifting a teams lining up on the eld. The word touchdown
VIFF object in time, and Subtract subtracts is usually announced in the middle or at the end
two VIFF objects. of the action shot, which is followed by some kind
Square is a Khoros function for applying the
of commentary and replay. To conclude a touch-
down sequence, the scoring team usually kicks an
squaring operation on a VIFF object, and Di- extra point. We notice that our model may cover
vide is divides two VIFF objects. most but not all the possible touchdown sequences.
Statistics is a Khoros function for computing However, for a preliminary implementation, our
the statistics of a VIFF object. simple model provides very satisfactory results.
Shot Segment detects the shot transition
boundary by locating the peaks in the his-
togram di erence sequence.
Store Kframe extracts the representative
frames from each shot and store them as a
new VIFF video. Currently, we use the rst
and/or the last frames to represent the whole Figure 6. The ideal shot transition model for
shot. a touchdown sequence.
6. Conclusion
In this paper we present a novel approach to
automatically extract important information from
football videos. Our system integrates speech
understanding and image analysis algorithms, so
that we can maximize detection accuracy and min-
imize computation cost at the same time. Our
algorithms have been tested extensively with real
data captured from TV programs. The prelimi-
nary results demonstrate the feasibility of our ap-
proach. In the future, we may work on the follow-
ing topics to improve the system.
More test data and more robust wordspotting
algorithms.
More complicated shot segmentation algo-
rithms with good shot transition models.
Other shot representation scheme such as the
mosaic used in the QBIC system [3].
Detecting other events such as fumbles.
References
[1] Y.-L. Chang and R. Alonso. Developing a mul-
timedia toolbox for the Khoros system. In SPIE
Proceedings, Multimedia: full-service impact on
business, education, and home, October 1995.
[2] A. Etemadi. Robust segmentation of edge data.
Technical report, University of Surrey, U.K.,
1992.
[3] M. Flickner et al. Query by image and video
Figure 8. Cheering detection results for the content: the QBIC system. IEEE Computer,
28(9):23{32, 1995.
rst and the second four sets. [4] Y. Gong et al. Automatic parsing of TV soccer
programs. In The 2nd IEEE International Con-
ference on Multimedia Computing, pages 167{
Table 3 presents the video analysis results. Of 174, May 1995.
the ve test sets with touchdowns, 2ndhalf6 does [5] A. Hampapur and T. Weymouth. Digital video
segmentation. In The 2nd ACM Int'l Conf. on
not t our model because its touchdown starts Multimedia, pages 357{364, Oct. 1994.
with kick-o (instead of lining-up) and ends with [6] S. S. Intille and A. F. Bobick. Tracking using a lo-
2-point conversion (instead of kicking an extra cal closed-world assumption: tracking in the foot-
point). Finally, Figures 11 illustrates the lining-up ball domain. Technical Report TR-296, M.I.T.,
and kicking shots identi ed by our algorithms. Aug. 1994.
[7] K. M. Knill and S. J. Young. Speaker dependent
keyword spotting for accessing stored speech.
Technical Report TR-193, Cambridge University
Engineering Department, Oct. 1994.
[8] R. C. Rose and E. M. Hofstetter. Techniques
for robust word spotting in continuous speech
messages. In Proc. Eurospeech, pages 1183{1186,
Sep. 1991.
[9] S. W. Smoliar and H. Zhang. Content-based
video indexing and retrieval. IEEE Multimedia,
1(2):62{75, 1994.
[10] N. Ward. The Lotec Speech Recognition Pack-
age. ftp.sanpo.t.u-tokyo.ac.jp: /pub/nigel/lotec,
1994.
[11] L. D. Wilcox and M. A. Bush. Training and
search algorithms for an interactive wordspotting
system. In Proc. ICASSP, 1992.
[12] A. Yoshitaka et al. Knowledge-assisted content-
based retrieval for multimedia databases. IEEE
Multimedia, 1(4):12{20, 1994.
Figure 10. Collection of the rst frame in Figure 11. lining-up and kicking shots lo-
each shot for 2ndhalf2. cated for 2ndhalf3, 2ndhalf5, and 2ndhalf7.