0% found this document useful (0 votes)
56 views16 pages

Chap 5 Audio Dbms

Uploaded by

windasempurna82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views16 pages

Chap 5 Audio Dbms

Uploaded by

windasempurna82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Chapter 5

Indexing and Retrieval of Audio

5.1 INTRODUCTION

In Chapter 2 we learned that digital audio is represented as a sequence of samples (except for structured
representations such as MIDI) and is normally stored in a compressed form. The present chapter is devoted to the
automatic indexing and retrieval of audio.
Human beings have amazing ability to distinguish different types of audio. Given any audio piece, we can
instantly tell the type of audio (e.g., human voice, music, or noise), speed (fast or slow), the mood (happy, sad,
relaxing, etc.), and determine its similarity to another piece of audio. However, a computer sees a piece of audio as a
sequence of sample values. At the moment, the most common method of accessing audio pieces is based on their
titles or file names. Due to the incompleteness and subjectiveness of the file name and text description, it may be
hard to find audio pieces satisfying the particular requirements of applications. In addition, this retrieval technique
cannot support queries such as “find audio pieces similar to the one being played” (query by example).
To solve the above problems, content-based audio retrieval techniques are required. The simplest content-based
audio retrieval uses sample to sample comparison between the query and the stored audio pieces. This approach
does not work well because audio signals are variable and different audio pieces may be represented by different
sampling rates and may use a different number of bits for each sample. Because of this, content-based audio
retrieval is commonly based on a set of extracted audio features, such as average amplitude and frequency
distribution.
The following general approach to content-based audio indexing and retrieval is normally taken:
• Audio is classified into some common types of audio such as speech, music, and noise.
• Different audio types are processed and indexed in different ways. For example, if the audio type is speech,
speech recognition is applied and the speech is indexed
based on recognized words.
• Query audio pieces are similarly classified, processed, and indexed.
• Audio pieces are retrieved based on similarity between the query index and the audio index in the database.

The audio classification step is important for several reasons. First, different audio types require different processing
and indexing retrieval techniques. Second, different audio types have different significance to different applications.
Third, one of the most important audio types is speech and there are now quite successful speech recognition
techniques/systems available. Fourth, the audio type or class information is itself very useful to some applications.
Fifth, the search space after classification is reduced to a particular audio class during the retrieval process.
Audio classification is based on some objective or subjective audio features. Thus before we discuss audio
classification in Section 5.3, we describe a number of major audio features in Section 5.2. In our discussion, we
assume audio files are in uncompressed form.
One of the major audio types is speech. The general approach to speech indexing and retrieval is to first apply
speech recognition to convert speech to spoken words and then apply traditional IR on the recognized words. Thus
speech recognition techniques are critical to speech indexing and retrieval. Section 5.4 discusses the main speech
recognition techniques.
There are two forms of musical representation: structured and sample-based. We describe general approaches to
the indexing and retrieval in Section 5.5.
In some applications, a combination of multiple media types are used to represent information (multimedia
objects). We can use the temporal and content relationships between different media types to help with the indexing
and retrieval of multimedia objects. We briefly describe this in Section 5.6.
Section 5.7 summarizes the chapter.

5.2 MAIN AUDIO PROPERTIES AND FEATURES


In this section, we describe a number of common features of audio signals. These features are used for audio
classification and indexing in later sections. Audio perception is itself a complicated discipline. A complete
coverage of audio features and their effects on perception is beyond the scope of this book. Interested readers are
referred to [1, 2].

Audio signals are represented in the time domain (time-amplitude representation) or the frequency domain
(frequency-magnitude representation). Different features are derived or extracted from these two representations. In
the following, we describe features obtained in these two domains separately. In addition to features that can be
directly calculated in these two domains, there are other subjective features such as timbre. We briefly describe these
features, too.

5.2.1 Features Derived in the Time Domain

Time domain or time-amplitude representation is the most basic signal representation technique, where a signal is
represented as amplitude varying with time. Figure 5.1 shows a typical digital audio signal in the time domain. In
the figure, silence is represented as 0. The signal value can be positive or negative depending on whether the sound
pressure is above or below the equilibrium atmospheric pressure when there is silence. It is assumed that 16 bits are
used for representing each audio sample. Thus the signal value ranges from 32767 (215-l) to -32767.

Figure 5.1 Amplitude-time representation of an audio signal.

From the above representation, we can easily obtain the average energy, zero crossing rate, and silence ratio.

Average energy
The average energy indicates the loudness of the audio signal. There are many ways to calculate it. One simple
calculation is as follows:

N −1
∑ x ( n) 2
E= n=0
N
where E is the average energy of the audio piece, N is the total number of samples in the audio piece, and x(n) is the
sample value of sample n.

Zero crossing rate


The zero crossing rate indicates the frequency of signal amplitude sign change. To some extent, it indicates the
average signal frequency. The average zero crossing rate is calculated as follows:

N
∑ | sgn x(n) − sgn x(n − 1) |
ZC = n =1
2N
where sgnx(n) is the sign of x(n) and will be 1 if x(n) is positive and -l if x(n) is negative.

Silence ratio
The silence ratio indicates the proportion of the sound piece that is silent. Silence is defined as a period within which
the absolute amplitude values of a certain number of samples are below a certain threshold. Note that there are two
thresholds in the definition. The first is the amplitude threshold. A sample is considered quiet or silent when its
amplitude is below the amplitude threshold. But an individual quiet sample is not considered as a silent period. Only
when the number of consecutive quiet samples is above a certain time threshold are these samples considered to
make up a silent period.
The silence ratio is calculated as the ratio between the sum of silent periods and the total length of the audio
piece.

5.2.2 Features Derived From the Frequency Domain

Sound Spectrum

The time domain representation does not show the frequency components and frequency distribution of a sound
signal. These are represented in frequency domain. The frequency domain representation is derived from the time
domain representation according to the Fourier transform. The Fourier transform can be loosely described as any
signal can be decomposed into its frequency components. In the frequency domain, the signal is represented as
amplitude varying with frequency, indicating the amount of energy at different frequencies. The frequency domain
representation of a signal is called the spectrum of the signal. We look at an example spectrum first and then briefly
describe how the spectrum is obtained using the Fourier transform.
Figure 5.2 shows the spectrum of the sound signal of Figure 5.1. In the spectrum, frequency is shown on the
abscissa and amplitude is shown on the ordinate. From the spectrum, it is easy to see the energy distribution across
the frequency range. For example, the spectrum in Figure 5.2 shows that most energy is in the frequency range 0 to
10 kHz.

Figure 5.2 The spectrum of the sound signal in Figure 5.1.

Now let us see how to derive the signal spectrum, based on the Fourier transform. As we are interested in
digital signals, we use the DFT, given by the following formula:

N −1
X (k ) = ∑ x ( n ) e − jn ω k
n=0
where ωk=2π k/N, x(n) is a discrete signal with N samples, k is the DFT bin number.
If the sampling rate of the signal is fs Hz, then the frequency fk of bin k in hertz is given by:

ωk k
fk = fs = fs
2π N

If x(n) is time-limited to length N, then it can be recovered completely by taking the inverse discrete Fourier
transform (IDFT) of the N frequency samples as follows:

1 N −1
x ( n) = ∑ X (k )e jnωk
N k =0
The DFT and IDFT are calculated efficiently using an algorithm called the Fast Fourier transforms (FFT).
As stated above, the DFT operates on finite length (length N) discrete signals. In practice, many signals extend
over a long time period. It would be difficult to do a DFT on a signal with very large N. To solve this problem, the
short time Fourier transform (STFT) was introduced. In the STFT, a signal of arbitrary length is broken into blocks
called frames and the DFT is applied to each of the frames. Frames are obtained by multiplying the original signal
with a window function. We will not go into details of the STVF here. Interested readers are referred to [3—5].
Typically, a frame length of 10 to 20 ms is used in sound analysis.
In the following, we describe a number of features that can be derived from the signal spectrum.

Bandwidth

The bandwidth indicates the frequency range of a sound. Music normally has a higher bandwidth than speech
signals. The simplest way of calculating bandwidth is by taking the frequency difference between the highest
frequency and lowest frequency of the nonzero spectrum components. In some cases “nonzero” is defined as at least
3 dB above the silence level.

Energy Distribution

From the signal spectrum, it is very easy to see the signal distribution across the frequency components. For
example, we can see if the signal has significant high frequency components. This information is useful for audio
classification because music normally has more high frequency components than speech. So it is important to
calculate low and high frequency band energy. The actual definitions of “low” and “high” are application dependent.
For example, we know that the frequencies of a speech signal seldom go over 7 kHz. Thus we can divide the entire
spectrum along the 7 kHz line: frequency components below 7 kHz belong to the low band and others belong to the
high band. The total energy for each band is calculated as the sum of power of each samples within the band.
One important feature that can be derived from the energy distribution is the spectral centroid, which is the
midpoint of the spectral energy distribution of a sound. Speech has low centroid compared to music. The centroid is
also called brightness.

Harmonicity

The second frequency domain feature of the sound is harmonicity. In harmonic sound the spectral components are
mostly whole number multiples of the lowest, and most often loudest frequency. The lowest frequency is called
fundamental frequency. Music is normally more harmonic than other sounds. Whether a sound is harmonic is
determined by checking if the frequencies of dominant components are of multiples of the fundamental frequency.
For example, the sound spectrum of the flute playing the note G4 has a series of peaks at frequencies of:
400 Hz, 800 Hz, 1200 Hz, 1600 Hz, and so on.
We can write the above series as:
f, 2f, 3f, 4f, and so on.
Where f= 400 Hz is the fundamental frequency of the sound. The individual components with frequencies of nf are
called harmonics of the note.

Pitch

The third frequency domain feature is pitch. Only period sounds, such as those produced by musical instruments and
the voice, give rise to a sensation of pitch. Sounds can be ordered according to the levels of pitch. Most percussion
instruments, as well as irregular noise, don’t give rise to a sensation by which they could be ordered. Pitch is a
subjective feature, which is related to but not equivalent to the fundamental frequency. However, in practice, we use
the fundamental frequency as the approximation of the pitch.
5.2.3 Spectrogram

The amplitude-time representation and spectrum are the two simplest signal representations. Their expressive power
is limited in that the amplitude-time representation does not show the frequency components of the signal and the
spectrum does not show when the different frequency components occur. To solve this problem, a combined
representation called a spectrogram is used. The spectrogram of a signal shows the relation between three variables:
frequency content, time, and intensity. In the spectrogram, frequency content is shown along the vertical axis, and
time along the horizontal one. The intensity, or power, of different frequency components of the signal is indicated
by a gray scale, the darkest part marking the greatest amplitude/power.
Figure 5.3 shows the spectrogram of the sound signal of Figure 5.1. The spectrogram clearly illustrates the
relationships among time, frequency, and amplitude. For example, we see from Figure 5.3 that there are two strong
high frequency components of up to 8 kHz appearing at 0.07 and 1.23 ins.
We determine the regularity of occurrence of some frequency components from the spectrogram of a signal.
Music spectrogram is more regular.

Figure 5.3 Spectrogram of the sound signal of Figure 5.1.

5.2.4 Subjective Features


Except for pitch, all the features described above can be directly measured in either the time domain or the
frequency domain. There are other features that are normally subjective. One such feature is timbre.
Timbre relates to the quality of a sound. Timbre is not well understood and defined. It encompasses all the
distinctive qualities of a sound other than its pitch, loudness, and duration. Salient components of timbre include the
amplitude envelope, harmonicity, and spectral envelope.

5.3 AUDIO CLASSIFICATION

We have mentioned five reasons why audio classification is important in Section 5.1. In this section, we first
summarize the main characteristics of different types of sound, based on the features described in the previous
section. We broadly consider two types of sound — speech and music, although each of these sound types can be
further divided into different subtypes such as male and female speech, and different types of music. We then
present two types of classification frameworks and their classification results.

5.3.1 Main Characteristics of Different Types of Sound

In the following we summarize the main characteristics of speech and music. They are tile basis for audio
classification.

Speech
The bandwidth of a speech signal is generally low compared to music. It is normally within the range 100 to 7,000
Hz. Because speech has mainly low frequency components, the spectral centroids (also called brightness) of speech
signals are usually lower than those of music.
There are frequent pauses in a speech, occurring between words and sentences. Therefore, speech signals
normally have a higher silence ratio than music.
The characteristic structure of speech is a succession of syllables composed of short periods of friction (caused
by consonants) followed by longer periods for vowels [61. It was found that during the fricativity, the average zero-
crossing rate (ZCR) rises significantly. Therefore, compared to music, speech has higher variability in ZCR.

Music
Music normally has a high frequency range, from 16 to 20,000 Hz. Thus, its spectral centroid is higher than that of
speech.
Compared to speech, music has a lower silence ratio. One exception may be music produced by a solo
instrument or singing without accompanying music.
Compared to speech, music has lower variability in ZCR.
Music has regular beats that can be extracted to differentiate it from speech [7].

Table 5.1 summarize the major characteristics of speech and music. Note that the list is not exhaustive. There
are other characteristics derived from specific characteristics of speech and music [8].

Table 5.1
Main Characteristics of Speech and Music
Features Speech Music
Bandwidth 0-7 kHz 0-20 kHz
Spectral Centroid Low High
Silence Ratio High Low
Zero-crossing rate More Variable Less Variable
Regular beat None Yes

5.3.2 Audio Classification Frameworks

All classification methods are based on calculated feature values. But they differ in how these features are used. In
the first group of methods, each feature is used individually in different classification steps [9, 10], while in the
second group a set of features is used together as a vector to calculate the closeness of the input to the training sets
[8, 111. We discuss these two types of classification frameworks.

Step-by-Step Classification
In step-by-step audio classification, each audio feature is used separately to determine if an audio piece is music or
speech. Each feature is seen as a filtering or selection criterion. At each filtering step, an audio piece is determined
as one type or another. A possible filtering process is shown in Figure 5.4. First, the centroid of all input audio
pieces is calculated. If an input has a centroid higher than a preset threshold, it is deemed to be music. Otherwise, the
input is speech or music because not all music has high centroid. Second. the silence ratio is calculated. If the input
has a low silence ratio, it is deemed to be music. Otherwise, the input is speech or solo music because solo music
may have a very high silence ratio. Finally, we calculate ZCR. If the input has very a high ZCR variability, it is
speech. Otherwise, it is solo music.
In this classification approach, it is important to determine the order in which different features are used for
classification. The order is normally decided based on computational complexity and the differentiating power of the
different features. The less I complicated feature with high differentiating power is used first. This reduces the
number of steps that a particular input will go through and reduces the total required amount of computation.
Multiple features and steps are used to improve classification performance. In some applications, audio
classification is based on only one feature. For example, Saunders [6] used ZCR variability to discriminate broadcast
speech and music and achieved an average successful classification rate of 90% [6]. Lu and Hankinson [12] used the
silence ratio to classify audio into music and speech with an average success rate of 82%.

Feature-Vector-Based Audio Classification


In feature-vector-based audio classification, values of a set of features are calculated and used as a feature vector.
During the training stage, the average feature vector (reference vector) is found for each class of audio. During
classification, the feature vector of an input is calculated and the vector distances between the input feature vector
and each of the reference vectors are calculated. The input is classified into the class from which the input has least
vector distance. Euclidean distance is commonly used as the feature vector distance. This approach assumes that
audio pieces of the same class are located close to each other in the feature space and audio pieces of different
classes are located far apart in the feature space. This approach can also be used for audio retrieval, discussed in
Section 5.5.
Audio Input

High Yes
Music
centroid?

No Speech plus music

High No
silence Music
ratio?

Yes Speech plus solo music

High ZCR No
Solo music
variability
?

Yes

Speech

Figure 5.4 A possible audio classification process.

Scheirer and Slaney [8] used 13 features including spectral centroid and ZCR for audio classification. A
successful classification rate of over 95% was achieved. Note that because different test sound files were used in
[6], [8], and [12], it is not meaningful to compare their results directly.

5.4 SPEECH RECOGNITION AND RETRIEVAL

Now that we have classified audio into speech and music, we can deal with them separately with different
techniques. This section looks at speech retrieval techniques, and the next section deals with music.
The basic approach to speech indexing and retrieval is to apply speech recognition
techniques to convert speech signals into text and then to apply IR techniques for indexing and retrieval. In addition
to actual spoken words, other information contained in speech, such as the speaker’s identity and the mood of the
speaker, can be used to enhance speech indexing and retrieval. In the following, we describe the basic speech
recognition and speaker identification techniques.

5.4.1 Speech Recognition


In general, the automatic speech recognition (ASR) problem is a pattern matching problem. An ASR system is
trained to collect models or feature vectors for all possible speech units. The smallest unit is a phoneme. Other
possible units are word and phrases. During the recognition process, the feature vector of an input speech unit is
extracted and corn-pared with each of the feature vectors collected during the training process. The speech
unit whose feature vector is closest to that of the input speech unit is deemed to be the I unit spoken.
In this section, we first present the basic concepts of ASR and discuss a number of factors that complicate the
ASR process. We then describe three classes of practical ASR
techniques. These classes are dynamic time warping, hidden Markov models (HMMs), and artificial neural network
(ANN) models. Among these techniques, those based on HMMs are most popular and produce the highest speech
recognition performance.

5.4.1.1 Basic Concepts of ASR

An ASR system operates in two stages: training and pattern matching. During the training stage, features of each
speech unit is extracted and stored in the system. In the recognition process, features of an input speech unit are
extracted and compared with each of the stored features, and the speech unit with the best matching features is taken
as the recognized unit. Without losing generality, we use a phoneme as a speech unit. If each phoneme can be
uniquely identified by a feature vector independent of speakers, environment and context, speech recognition would
be simple. In practice, however, speech recognition is complicated by the following factors:
• A phoneme spoken by different speakers or by the same speaker at different times produces different features in
terms of duration, amplitude, and frequency components. That is, a phoneme cannot be uniquely identified with
100% certainty.
• The above differences are exacerbated by the background or environmental noise.
• Normal speech is continuous and difficult to separate into individual phonemes because different phonemes
have different durations.
• Phonemes vary with their location in a word. The frequency components of a vowel’s pronunciation are heavily
influenced by the surrounding consonants [13].

Because of the above factors, the earlier ASR systems were speaker dependent, required a pause between
words, and could only recognize a small number of words.
The above factors also illustrate that speech recognition is a statistical process in which ordered sound
sequences are matched against the likelihood that they represent a particular string of phonemes and words. Speech
recognition must also make use of knowledge of the language, including a dictionary of the vocabulary and a
grammar of allowable word sequences.
Figure 5.5 shows a general model of ASR systems. The first stage is training (top part of Figure 5.5). In this
stage, speech sequences from a large number of speakers are collected. Although it is possible to carry out speech
recognition from the analog speech signals, digital signals are more suitable. So these speech sequences are
converted into digital format. The digitized speech sequences are divided into frames of fixed duration. The typical
frame size is 10 ins. Feature vectors are then computed for each frame. Many types of features are possible, but the
most popular ones are the mel-frequency cepstral coefficients (MFCCs). MFCCs are obtained by the following
process:

1. The spectrum of the speech signal is warped to a scale, called the mel-scale, that represents how a human ear
hears sound.
2. The logarithm of the warped spectrum is taken.
3. An inverse Fourier transform of the result of step 2 is taken to produce what is called the cepstrum.
Training speech Preprcessing &
feature extraction Feature vectors

Corresponding word of training speech Phonetic


modeling

Training process Phoneme Dictionary


Retreiving process models and grammar

Speech input Preprocessing Feature Search and Output word


& feature vectors matching sequence
extraction

Figure 5.5 A general ASR system (after [13]).

The phonetic modeling process uses the above obtained feature vectors, a dictionary containing all the words
and their possible pronunciations, and the statistics of grammar usage to produce a set of phoneme models or
templates. At the end of the training stage we have a recognition database consisting of the set of phoneme models,
the dictionary, and grammar.
When speech is to be recognized (bottom part of Figure 5.5), the input speech is processed in a similar way as
in the training stage to produce feature vectors. The search and matching engine finds the word sequence (from the
recognition database) that has the feature vector that best matches the feature vectors of the input speech. The word
sequence is output as recognized text.
Different techniques vary in features used, phonetic modeling, and matching methods used. In the following we
describe three techniques based on dynamic time warping, HMMs, and ANNs.

5.4.1.2 Techniques Based on Dynamic Time Warping

As we have mentioned, each speech frame is represented by a feature vector. During the recognition process, the
simplest way to find the distances between the input feature vector and those in the recognition database is to
compute the sum of frame to frame differences between feature vectors. The best match is the one with the smallest
distance. This simple method will not work in practice, as there are nonlinear variations in the timing of speeches
made by different speakers and made at different times by the same speaker. For example, the same word spoken by
different people will take a different amount of time. Therefore, we cannot directly calculate frame to frame
differences.
Dynamic time warping normalizes or scales speech duration to minimize the sum of distances between feature
vectors that most likely match best. Figure 5.6 shows an exam-pie of dynamic time warping. Although the spoken
words of the reference speech and the test speech are the same, these two speeches have different time durations
before time warping (Figure 5.6(a)), and it is difficult to calculate the feature differences between them. After time
warping (Figure 5.6(b)), however, they are very similar and their distance can be calculated by summing the frame
to frame or sample to sample differences.
Feature
Amplitude Reference speech
Test speech

Time

Feature
Amplitude

Time

Figure 5.6 A Dynamic time warping example: (a) before time warping
(b) after time warping

5.4.1.3 Techniques Based on Hidden Markov Models

Techniques based on HMMs are currently the most widely used and produce the best recognition performance. A
detailed coverage of HMMs is beyond the scope of this book. The interested reader is referred to [14, 15] for details.
In the following, we describe the basic idea of using HMMs for speech recognition.
Phonemes are fundamental units of meaningful sound in speech. They are each different from all the rest, but
they are not unchanging in themselves. When one phoneme is voiced, it can be identified as similar to its previous
occurrences, although not exactly the same. In addition, a phoneme’s sound is modified by its neighbors’ sounds.
The challenge of speech recognition is how to model these variations mathematically.
We briefly describe what HMMs are and how they can be used to model and recognize phonemes.
An 11MM consists of a number of states, linked by a number of possible transitions (Figure 5.7). Associated
with each state are a number of symbols, each with a certain occurrence probability associated with each transition.
When a state is entered, a symbol is generated. Which symbol to be generated at each state is determined by the
occurrence probabilities. In Figure 5.7, the 11MM has three states. At each state, one of four possible symbols, x1,
x2, x3 and x4, is generated with different probabilities, as shown by b1(x), h2(x), b3(x) and b4(x). The transition
probabilities are shown as a11, a12, and so forth.
2

3
1

Figure 5.7 An example of an HMM.

In an HMM, it is not possible to identify a unique sequence of states given a sequence of output symbols.
Every sequence of states that has the same length as the output symbol sequence is possible, each with a different
probability. The sequence of states is “hidden” from the observer who sees only the output symbol sequence. This is
why the model is called the hidden Markov model.
Although it is not possible to identify the unique sequence of state for a given sequence of output symbols, it
is possible to determine which sequence of state is most likely to generate the sequence of symbols, based on state
transition and symbol generating probabilities.
Now let us look at applications of HMMs in speech recognition. Each phoneme is divided into three
audible states: an introductory state, a middle state, and an exiting state. Each state can last for more than one frame
(normally each frame is 10 ins). During the training stage, training speech data is used to construct HMMs for each
of the possible phonemes. Each 11MM has the above three states and is defined by state transition probabilities and
symbol generating probabilities. In this context, symbols are feature vectors calculated for each frame. Some
transitions are not allowed as time flows forward only. For example, transitions from 2 to 1, 3 to 2 and 3 to 1 are not
allowed if the 11MM in Figure 5.7 is used as a phoneme model. Transitions from a state to itself are allowed and
serve to model time variability of speech.
Thus, at the end of the training stage, each phoneme is represented by one HMM capturing the
variations of feature vectors in different frames. These variations are caused by different speakers, time variations,
and surrounding sounds.
During speech recognition, feature vectors for each input phoneme are calculated frame by frame. The
recognition problem is to find which phoneme HMM is most likely to generate the sequence of feature vectors of the
input phoneme. The corresponding phoneme of the HMM is deemed as the input phoneme. As a word has a number
of phonemes, a sequence of phonemes are normally recognized together. There are a number of algorithms, such as
forward and Viterbi algorithms, to compute the probability that an 1-1MM generates a given sequence of feature
vectors. The forward algorithm is used for recognizing isolated words and the Viterbi algorithm for recognizing
continuous speech[15].

5.4.1.4 Techniques Based on Artificial Neural Networks


ANNs have been widely used for pattern recognition. An ANN is an information processing system that simulates
the cognitive process of the human brain. An ANN consists of many neurons interconnected by links with weights.
Speech recognition with ANNs consists of also two stages: training and recognition. During the training stage,
feature vectors of training speech data are used to train the ANN (adjust weights on different links). During the
recognition stage, the ANN will identify the most likely phoneme based on the input feature vectors. For more
details of ANNs and their applications to ASR, the reader is referred to [16].

5.4.1.5 Speech Recognition Performance

Speech recognition performance is normally measured by recognition error rate. The lower the error rate, the higher
the performance. The performance is affected by the following factors:
1. Subject matter: this may vary from a set of digits, a newspaper article, to general news.
2. Types of speech: read or spontaneous conversation.
3. Size of the vocabulary: it ranges from dozens to a few thousand words.

As techniques based on HIMMs perform best, we briefly list their performance when the above factors vary
(Table 5.2)

Table 5.2
Error Rates for High Performance Speech Recognition (Based on [13])
Subject Matter Type Vocabulary, No of Words Word Error Rate(%)
Connected digits Read 10 <0,3
Airline travel system Spontaneous 2.500 2
Wall Street Journal Read 64.000 7
Broadcast news Read/spontaneous (mixed) 64.000 30
General phone call Conversation 10.000 50

The above table shows that speech recognition performance varies greatly. For many specific applications, it is
quite acceptable. However, the recognition performance for general applications is still very low and unacceptable.

5.4.2 Speaker Identification

While speech recognition focuses on the content of speech, speaker identification or voice recognition attempts to
find the identity of the speaker or to extract information about an individual from his/her speech [17]. Speaker
identification is potentially very useful to multimedia information retrieval. It can determine the number of speakers
in a particular setting, whether the speaker is male or female, adult or child, a speaker’s mood, emotional state and
attitude, and other information. This information, together with the speech content (derived from speech recognition)
significantly improves information retrieval performance.
Voice recognition is complementary to speech recognition. Both use similar signal processing techniques to
some extent. However they differ in the following aspect. Speech recognition, if it is to be speaker-independent,
must purposefully ignore any idiosyncratic speech characteristics of the speaker and focus on those parts of the
speech signal richest in linguistic information. In contrast, voice recognition must amplify those idiosyncratic speech
characteristics that individualize a person and suppress linguistic characteristics that have no bearing on the
recognition of the individual speaker. Readers are referred to [17] for details of voice recognition.

5.4.3 Summary

After an audio piece is determined to be speech, we can apply speech recognition to convert the speech into text. We
can then use ER techniques discussed in Chapter 4 to cany out speech indexing and retrieval. The information
obtained from voice recognition can be used to improve ER performance.

5.5 MUSIC INDEXING AND RETRIEVAL

We discussed speech indexing and retrieval based on speech recognition in the previous section. This section deals
with music indexing and retrieval. In general, research and development of effective techniques for music indexing
and retrieval is still at an early stage. As mentioned in Chapter 2, there are two types of music: structured or
synthetic and sample based music. We briefly describe the handling of these two types of music.

5.5.1 Indexing and Retrieval of Structured Music and Sound Effects

Structured music and sound effects are represented by a set of commands or algorithms. The most common
structured music is MIDI, which represent music as a number of notes a~ control commands [18]. A new standard
for structured audio (music and sound effects) is MPEG-4 Structured Audio, which represents sound in algorithms
and control languages [19].
These structured sound standards and formats are developed for sound transmission, synthesis, and production.
They are not specially designed for indexing and retrieval purposes. The explicit structure and notes description
existing in these formats make the retrieval process easy, as there is no need to do feature extraction from audio
signals.
Structured music and sound effects are very suitable for queries requiring an exact match between the queries
and database sound files. The user can specify a sequence of notes as a query and it is relatively easy to find those
structured sound files that contain this sequence of notes. Although an exact match of the sequence of notes is found,
the sound produced by the sound file may not be what the user wants because the same structured sound file can be
rendered differently by different devices.
Finding similar music or sound effects to a query based on similarity instead of exact match is complicated
even with structured music and sound effects. The main problem is that it is hard to define similarity between two
sequences of notes. One possibility is to retrieve music based on the pitch changes of a sequence of notes [20]. In
this scheme, each note (except for the first one) in the query and in the database sound files is converted into pitch
change relative to its previous note. The three possible values for the pitch change are U(up), D(down), and S(same
or similar). In this way, a sequence of notes is characterized as a sequence of symbols. Then the retrieval task
becomes a string-matching process. This scheme was proposed for sample-based sound retrieval where notes must
be identified and pitch changes must be tracked with some algorithms that we will discuss in the next subsection.
But this scheme is equally applicable to structured sound retrieval, where the notes are already available and pitch
change can be easily obtained based on the notes scale.

5.5.2 Indexing and Retrieval of Sample-Based Music


There are two general approaches to indexing and retrieval of sample-based music. The first approach is based on a
set of extracted sound features [21], and the second is specifically based on pitches of music notes [20, 22]. We
briefly describe these two approaches separately.

Music retrieval based on a set of features

In this approach to music retrieval, a set of acoustic features is extracted for each sound (including queries). This set
of N features is represented as an N-vector. The similarity between the query and each of the stored music pieces is
calculated based on the closeness between their corresponding feature vectors. This approach can be applied to gen-
eral sound including music, speech, and sound effects.
A good example using this approach is the work carried out at Muscle Fish LLC [21]. In this work, five
features are used, namely loudness, pitch, brightness, bandwidth, and harmonicity. These features of sound vary
over time and thus are calculated for each frame. Each feature is then represented statistically by three parameters:
mean, variance, and autocorrelation. The Euclidean distance or Manhattan distance between the query vector and the
feature vector of each stored piece of music is used as the distance between them.
This approach can be used for audio classification, as discussed earlier. It is based on the assumption that
perceptually similar sounds are closely located in the chosen feature space and perceptually different sounds are
located far apart in the chosen feature space. This assumption may not be true, depending on the features chosen to
represent the sound.

Music retrieval based on pitch

This approach is similar to pitch-based retrieval of structured music. The main difference is that the pitch for each
note has to be extracted or estimated in this case [20, 22]. Pitch extraction or estimation is often called pitch
tracking. Pitch tracking is a simple form of automatic music transcription that converts musical sound into a
symbolic representation
[23, 24].
The basic idea of this approach is quite simple. Each note of music (including the query) is represented by its
pitch. So a musical piece or segment is represented as a sequence or string of pitches. The retrieval decision is based
on the similarity between the query and candidate strings. The two major issues are pitch tracking and string simi-
larity measurement.
Pitch is normally defined as the fundamental frequency of a sound. To find the pitch for each note, the input
music must first be segmented into individual notes. Segmentation of continuous music, especially humming and
singing, is very difficult. Therefore, it is normally assumed that music is stored as scores in the database. The pitch
of each note is known. The common query input form is humming. To improve pitch tracking performance on the
query input, a pause is normally required between consecutive notes.
There are two pitch representations. In the first method, each pitch except the first one is represented as pitch
direction (or change) relative to the previous note. The pitch direction is either U(up), D(down), or S (similar). Thus
each musical piece is represented as a string of three symbols or characters.
The second pitch representation method represents each note as a value based on a chosen reference note. The
value is assigned from a set of standard pitch values that is closest to the estimated pitch. If we represent each
allowed value as a character, each musical piece or segment is represented as a string of characters. But in this case,
the number of allowed symbols is much greater than the three that are used in the first pitch representation.
After each musical piece is represented as a string of characters, the final stage is to find a match or similarity
between the strings. Considering that humming is not exact and the user may be interested in find similar musical
pieces instead of just the same one, approximate matching is used instead of exact matching. The approximate
matching problem is that of string matching with k mismatches. The variable k is determined by the user of the
system. The problem consists of finding all instances of a query string Q=q1q2q3..qm in a reference string
R=r1,r2,r3,…rm such that there are at most k mismatches (characters that are not the same). There are several
algorithms that were developed to address the problem of approximate string matching [21, 22].
Both the systems of Muscle Fish LLC [21] and the University Waikato [22] produced good retrieval performance.
But the performance depends on the accuracy of pitch tracking of hummed input signals. High performance is
only achieved when a pause is inserted between consecutive notes.

5.6 MULTIMEDIA INFORMATION INDEXING AND RETRIEVAL USING RELATIONSHIPS


BETWEEN AUDIO AND OThER MEDIA

So far, we have treated sound independently of other media. In some applications, sound appears as part of
a multimedia document or object. For example, a movie consists of a sound track and a video track with
fixed temporal relationships between them. Different media in a multimedia object are interrelated in their
contents as well as by time. We use this interrelation to improve multimedia information indexing and
retrieval in the following two ways.
First, we can use knowledge or understanding about one medium to understand the contents of other
media. We have used text to index and retrieve speech through speech recognition. We can in turn use
audio classification and speech understanding to help with the indexing and retrieval of video. Figure 5.8
shows an multimedia object consisting of a video track and a sound track. The video track has 26 frames.
Now we assume the sound track has been segmented into different sound types. The first segment is speech
and corresponds to video frames 1 through 7. The second segment is loud music and corresponds to video
frames 7 through 18. The final segment is speech again and corresponds to video frames 19 through 26.
We then use the knowledge of the sound track to do the following on the video track. First, we segment
the video track according to the sound track segment boundaries. In this case, the video track is likely to
have three segments with the boundaries aimed with the sound track segment boundaries. Second, we apply
speech recognition to sound segments 1 and 3 to understand what was talked about. The corresponding
video track may very likely have similar content. Video frames may be indexed and retrieved based on the
speech content without any other processing. This is very important because in general it is difficult to
extract video content even with complicated image processing techniques.

Frame 1 Frame 7 Frame 19 Frame 26

Video track

Sound track

Segment 1 Segment 2 Segment 3

Figure 5.8 An example multimedia object with a video track and a soundtrack.

The second way to make use of relationships between media for multimedia retrieval is during the retrieval
process. The user can use the most expressive and simple media to formulate a query, and the system will retrieve
and present relevant information to the user regardless of media types. For example, a user can issue a query
using speech to describe what information is required and the system may retrieve and present relevant
information in text, audio, video, or their combinations. Alternatively, the user can use an example image as
query and retrieve information in images, text, audio, and their combinations. This is useful because there are
different levels of difficulty in formulating queries in different media.
We discuss the indexing and retrieval of composite multimedia objects further in Chapter 9.

5.7 SUMMARY

This chapter described some common techniques and related issues for content-based audio indexing and retrieval.
The general approach is to classify audio into some common types such as speech and music, and then use different
techniques to process and retrieve the different types of audio. Speech indexing and retrieval is relatively easy, by
applying JR techniques on words identified using speech recognition. But speech recognition performance on
general topics without any vocabulary restriction is still to be improved. For music retrieval, some useful work has
been done based on audio feature vector matching and approximate pitch matching. However, more work is
needed on how music and audio in general is perceived and on similarity comparison between musical pieces. It will
also be very useful if we can further automatically classify music into different types such as pop and classical.
The classification and retrieval capability described in this chapter is potentially important and useful in many
areas, such as the press and music industry, where audio information is used. For example, a user can hum or play a
song and ask the system to find songs similar to what was hummed or played. A radio presenter can specify the
requirements of a particular occasion and ask the system to provide a selection of audio pieces meeting these
requirements. When a reporter wants to find a recorded speech, he or she can type in part of the speech to locate the
actual recorded speech. Audio and video are often used together in situations such as movie and television programs,
so audio retrieval techniques may help locate some specific video clips, and video retrieval techniques may help
locate some audio segments. These relationships should be exploited to develop integrated multimedia database
management systems.

You might also like