Lecture 7 - Automatic Speech Recognition
Lecture 7 - Automatic Speech Recognition
Processing
https://thesoundofai.slack.com/
Automatic Speech Recognition: Systems (Hardware/Software) that can analyze,
classify and recognize speech signals.
Technology that allows human beings to use their voices to speak with a computer
interface in a way that, in its most sophisticated variations, resembles normal
human conversation.
Important in mobile and security applications e.g. speech to text, Alexa, Siri etc
Classification/
Acoustic Processing Feature Extraction
Recognition
The ASR Problem
-Continuous signal
-Some harmonic components
Quefrency (ms)
Quefrequency: The inverse of the distance between successive lines in a Fourier transform, measured in seconds.
Info on phonemes
Info on pitch
The Cepstrum
• One way to think about this
– Separating the source and filter
– Speech waveform is created by
• A glottal source waveform
• Passes through a vocal tract which because of its shape has
a particular filtering characteristic
• Articulatory facts:
– The vocal cord vibrations create harmonics
– The mouth is an amplifier
– Depending on shape of oral cavity, some harmonics
are amplified more than others
Understanding the cepstrum
Example: Pass the signal through a first-order finite impulse response (FIR) filter:
increases the amplitude of high frequency bands and decrease the amplitudes of lower bands
MFCC
Windowing
• Rectangular window:
• Hamming window
Window in time domain
Window in the frequency
domain
MFCC
Discrete Fourier Transform
• Input:
– Windowed signal x[n]…x[m]
• Output:
– For each of N discrete frequency bands
– A complex number X[k] representing magnitude and
phase of that frequency component in the original
signal
• Discrete Fourier Transform (DFT)
• Alternative
– Linear Prediction Coefficients (LPC)
– Linear Prediction Cepstral Coefficients (LPCC)
– Line Spectral Frequencies (LSF)
– Discrete Wavelet Transform (DWT)
– Perceptual Linear Prediction (PLP)
Step 2: Acoustic Model
• For each frame of data, we need some way of
describing the likelihood of it belonging to any of our
classes
• Two methods are commonly used
− Multilayer perceptron (MLP) gives the likelihood of a class
given the data
− Gaussian Mixture Model (GMM) gives the likelihood of the
data given a class
Step 3: Pronunciation Model
• While the pronunciation model can be very
complex, it is typically just a dictionary
• The dictionary contains the valid pronunciations
for each word
• Examples:
− Cat: k ae t
− Dog: d ao g
− Fox: f aa x s
Step 4: Language Model
• Now we need some way of representing the
likelihood of any given word sequence
• Many methods exist, but ngrams are the most
common
• Ngrams models are trained by simply
counting the occurrences of words in a
training set
Ngrams
• A unigram is the probability of any word in
isolation
• A bigram is the probability of a given word
given the previous word
• Higher order ngrams continue in a similar
fashion
• A backoff probability is used for any unseen
data
How do we put it together?