Audio Indexing: Feature Extraction
Audio Indexing: Feature Extraction
Figure 1. A typical architecture for a statistical audio indexing system based on a traditional bag-of frames
approach. In a problem of automatic musical instrument recognition, each class represents an instrument or a A
family of instruments.
the musical signal. Note that the knowledge of note crossing rate and envelope amplitude modula-
onset positions allows for other important applications tion.
such as Audio-to-Audio alignment or Audio-to-Score • Cepstral features: Such features are widely used
alignment. in speech recognition or speaker recognition due
However a number of different audio indexing to a clear consensus on their appropriateness for
tasks will share a similar architecture. In fact, a typical these applications. This is duly justified by the fact
architecture of an audio indexing system includes two that such features allow to estimate the contribu-
or three major components: A feature extraction module tion of the filter (or vocal tract) in a source-filter
sometimes associated with a feature selection module model of speech production. They are also often
and a classification or decision module. This typical used in audio indexing applications since many
“bag-of-frames” approach is depicted in Figure 1. audio sources also obey a source filter model. The
These modules are further detailed below. usual features include the Mel-Frequency Cepstral
Coefficients (MFCC), and the Linear-Predictive
Feature Extraction Cepstral Coefficients (LPCC).
• Spectral features: These features are usually com-
The feature extraction module aims at representing puted on the spectrum (magnitude of the Fourier
the audio signal using a reduced set of features that Transform) of the time domain signal. They in-
well characterize the signal properties. The features clude the first four spectral statistical moments,
proposed in the literature can be roughly classified in namely the spectral centroid, the spectral width,
four categories: the spectral asymmetry defined from the spectral
skewness, and the spectral kurtosis describing
• Temporal features: These features are directly the peakedness/flatness of the spectrum. A num-
computed on the time domain signal. The ad- ber of spectral features were also defined in the
vantage of such features is that they are usually framework of MPEG-7 such as for example the
straightforward to compute. They include amongst MPEG-7 Audio Spectrum Flatness and Spectral
others the crest factor, temporal centroid, zero- Crest Factors which are processed over a number
of frequency bands (ISO, 2001). Other features
0