0% found this document useful (0 votes)

22 views

Lecture 7 - Automatic Speech Recognition

Automatic speech recognition (ASR) systems analyze, classify, and recognize speech signals. ASR allows humans to speak with computer interfaces using voice. ASR is important in applications like speech-to-text, Alexa, and Siri. ASR systems process speech signals through acoustic processing, feature extraction, and classification/recognition steps. Mel-frequency cepstral coefficients (MFCCs) are commonly used features that incorporate human perceptual properties. Statistical modeling techniques like Gaussian mixture models are often used for classification.

Uploaded by

Rhona Hazel

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

Lecture 7 - Automatic Speech Recognition

Uploaded by

Rhona Hazel

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

CMP4205: Audio and Speech Signal

Processing

AUTOMATIC SPEECH RECOGNITION (ASR)

Cosmas Mwikirize, Ph.D

Department of Electrical and Computer Engineering, Makerere
University

© Cosmas Mwikirize, 2023

Essential Community

https://thesoundofai.slack.com/
Automatic Speech Recognition: Systems (Hardware/Software) that can analyze,
classify and recognize speech signals.

Technology that allows human beings to use their voices to speak with a computer
interface in a way that, in its most sophisticated variations, resembles normal
human conversation.

Important in mobile and security applications e.g. speech to text, Alexa, Siri etc

Classification/
Acoustic Processing Feature Extraction
Recognition
The ASR Problem

• There is no single ASR problem

• The problem depends on many factors
− Microphone: Close-mic, throat-mic, microphone array, audio-
visual
− Sources: band-limited, background noise, reverberation
− Speaker: speaker dependent, speaker independent
− Language: open/closed vocabulary, vocabulary size,
read/spontaneous speech
− Output: Transcription, speaker id, keywords
A legacy example: Template-Based ASR
• Originally only worked for isolated words
• Performs best when training and testing conditions
are best
• For each word we want to recognize, we store a
template or example based on actual data
• Each test utterance is checked against the templates
to find the best match
• Uses the Dynamic Time Warping (DTW) algorithm
Dynamic Time Warping
• Create a similarity matrix for the two utterances
• Use dynamic programming to find the lowest cost
path
Modern Statistical ASR
Step 1: Feature Calculation

• As in any data-driven task (increasingly machine

learning), the data must be represented in some
format
• Cepstral features have been found to perform well
• Mel-frequency cepstral coefficients (MFCC) are the
most common variety
Essential reading: https://www.intechopen.com/chapters/63970
Alim SA, Rashid NKA. Some Commonly Used Speech Feature Extraction
Algorithms.
So what is a Cepstrum?

Many associated wordplays!!!

Mathematical Formulation: Computing the Cepstrum
Computing the Cepstrum
Visualizing the Cepstrum
Visualizing the Cepstrum

-Continuous signal
-Some harmonic components

Can we treat it as if it were a

Time-domain signal?
Visualizing the Cepstrum

Quefrency (ms)
Quefrequency: The inverse of the distance between successive lines in a Fourier transform, measured in seconds.

Gives idea of pitch

Why is this important?

Info on phonemes
Info on pitch
The Cepstrum
• One way to think about this
– Separating the source and filter
– Speech waveform is created by
• A glottal source waveform
• Passes through a vocal tract which because of its shape has
a particular filtering characteristic
• Articulatory facts:
– The vocal cord vibrations create harmonics
– The mouth is an amplifier
– Depending on shape of oral cavity, some harmonics
are amplified more than others
Understanding the cepstrum

Vocal tract response

Understanding the cepstrum
Understanding the cepstrum
Separating the Components
Separating the Components
We need to remove high frequency components
Associated with Glottal response== Low Pass Filter

This leads us to the concept of the MFCC

MFCC

• All the steps are perceptually relevant

• Result gives formant information on speech

• Spectrum

• Discrete Cosine Transform gives real-valued coefficients

MFCC—Alternative (Detailed) Implementation
Pre-Emphasis

• Pre-emphasis: boosting the energy in the high

frequencies
• Q: Why do this?
• A: The spectrum for voiced segments has more
energy at lower frequencies than higher
frequencies.
– This is called spectral tilt
– Spectral tilt is caused by the nature of the glottal pulse
• Boosting high-frequency energy gives more info to
Acoustic Model
– Improves recognition performance
Example of pre-emphasis

• Before and after pre-emphasis

– Spectral slice from the vowel [aa]

Example: Pass the signal through a first-order finite impulse response (FIR) filter:
increases the amplitude of high frequency bands and decrease the amplitudes of lower bands
MFCC
Windowing

Slide from Bryan Pellom

Windowing
• Why divide speech signal into successive
overlapping frames?
– Speech is not a stationary signal; we want information
about a small enough region that the spectral
information is a useful cue.
• Frames
– Frame size: typically, 10-25ms
– Frame shift: the length of time between successive
frames, typically, 5-10ms
Common window shapes

• Rectangular window:

• Hamming window
Window in time domain
Window in the frequency
domain
MFCC
Discrete Fourier Transform
• Input:
– Windowed signal x[n]…x[m]
• Output:
– For each of N discrete frequency bands
– A complex number X[k] representing magnitude and
phase of that frequency component in the original
signal
• Discrete Fourier Transform (DFT)

• Standard algorithm for computing DFT:

– Fast Fourier Transform (FFT) with complexity N*log(N)
– In general, choose N=512 or 1024
Discrete Fourier Transform computing a
spectrum
• A 24 ms Hamming-windowed signal
– And its spectrum as computed by DFT (plus
other smoothing)
MFCC
Mel-scale
• Human hearing is not equally sensitive to all
frequency bands
• Less sensitive at higher frequencies, roughly >
1000 Hz
• i.e. human perception of frequency is non-linear:
Mel-scale
• A mel is a unit of pitch
– Definition:
• Pairs of sounds perceptually equidistant in pitch
– Are separated by an equal number of mels:

• Mel-scale is approximately linear below 1

kHz and logarithmic above 1 kHz
• Definition:
Mel Filter Bank Processing

• Mel Filter bank

– Uniformly spaced before 1 kHz
– logarithmic scale after 1 kHz
Mel-filter Bank Processing
• Apply the bank of filters according Mel scale to
the spectrum
• Each filter output is the sum of its filtered spectral
components
MFCC
Log energy computation
• Compute the logarithm of the square
magnitude of the output of Mel-filter bank
Log energy computation
• Why log energy?
Logarithm compresses dynamic range of
values
– Human response to signal level is logarithmic
– humans less sensitive to slight differences in amplitude
at high amplitudes than low amplitudes
Makes frequency estimates less sensitive to
slight variations in input (power variation
due to speaker’s mouth moving closer to
mike)
Phase information not helpful in speech
MFCC
Mel Frequency cepstrum
• The cepstrum requires Fourier analysis
• But we’re going from frequency space back to
time
• So we actually apply inverse DFT

Details for signal processing gurus: Since the log

power spectrum is real and symmetric, inverse
DFT reduces to a Discrete Cosine Transform
(DCT)
Another advantage of the Cepstrum

• DCT produces highly uncorrelated features

• Simpler to model (acoustic modelling)
– Simply modelled by linear combinations of Gaussian
density functions with diagonal covariance matrices
• In general we’ll just use the first 12 cepstral
coefficients (we don’t want the later ones which
have e.g. the F0 spike)
MFCC
Dynamic Cepstral Coefficient

• The cepstral coefficients do not capture energy

• So we add an energy feature

• Also, we know that speech signal is not constant (slope of

formants, change from stop burst to release).

• So we want to add the changes in features (the slopes).

• We call these delta features

• We also add double-delta acceleration features

Delta and double-delta
• Derivative: in order to obtain temporal information
Typical MFCC features

• Window size: 25ms

• Window shift: 10ms
• Pre-emphasis coefficient: 0.97
• MFCC:
– 12 MFCC (mel frequency cepstral coefficients)
– 1 energy feature
– 12 delta MFCC features
– 12 double-delta MFCC features
– 1 delta energy feature
– 1 double-delta energy feature
• Total 39-dimensional features
Why is MFCC so popular?
• Efficient to compute
• Incorporates a perceptual Mel frequency scale
• Separates the source and filter

• IDFT(DCT) decorrelates the features

– Improves diagonal assumption in HMM modeling

• Alternative
– Linear Prediction Coefficients (LPC)
– Linear Prediction Cepstral Coefficients (LPCC)
– Line Spectral Frequencies (LSF)
– Discrete Wavelet Transform (DWT)
– Perceptual Linear Prediction (PLP)
Step 2: Acoustic Model
• For each frame of data, we need some way of
describing the likelihood of it belonging to any of our
classes
• Two methods are commonly used
− Multilayer perceptron (MLP) gives the likelihood of a class
given the data
− Gaussian Mixture Model (GMM) gives the likelihood of the
data given a class
Step 3: Pronunciation Model
• While the pronunciation model can be very
complex, it is typically just a dictionary
• The dictionary contains the valid pronunciations
for each word
• Examples:
− Cat: k ae t
− Dog: d ao g
− Fox: f aa x s
Step 4: Language Model
• Now we need some way of representing the
likelihood of any given word sequence
• Many methods exist, but ngrams are the most
common
• Ngrams models are trained by simply
counting the occurrences of words in a
training set
Ngrams
• A unigram is the probability of any word in
isolation
• A bigram is the probability of a given word
given the previous word
• Higher order ngrams continue in a similar
fashion
• A backoff probability is used for any unseen
data
How do we put it together?

• We now have models to represent the three

parts of our equation
• We need a framework to join these models
together
• The standard framework used is the Hidden
Markov Model (HMM)
Markov Model

• A state model using the markov property

– The markov property states that the future
depends only on the present state
• Models the likelihood of transitions between
states in a model
• Given the model, we can determine the
likelihood of any sequence of states
Hidden Markov Model

• Similar to a markov model except the states

are hidden
• We now have observations tied to the
individual states
• We no longer know the exact state sequence
given the data
• Allows for the modeling of an underlying
unobservable process

Digital Modulations using Matlab
From Everand
Digital Modulations using Matlab
Mathuranathan Viswanathan
4/5 (6)
Voice Based Email System
No ratings yet
Voice Based Email System
42 pages
Simulation of Digital Communication Systems Using Matlab
From Everand
Simulation of Digital Communication Systems Using Matlab
Mathuranathan Viswanathan
3.5/5 (22)
Speech Recognition, Synthesis, and Dialogue 2
No ratings yet
Speech Recognition, Synthesis, and Dialogue 2
59 pages
7.0 Speech Signals and Front-End Processing: References: 1. 3.3, 3.4 of Becchetti
No ratings yet
7.0 Speech Signals and Front-End Processing: References: 1. 3.3, 3.4 of Becchetti
50 pages
MFCC PDF
No ratings yet
MFCC PDF
14 pages
Implementing Speaker Recognition: Chase Zhou Physics 406 - 11 May 2015
No ratings yet
Implementing Speaker Recognition: Chase Zhou Physics 406 - 11 May 2015
10 pages
Intechopen 80419
No ratings yet
Intechopen 80419
18 pages
Acoustic Feature Analysis For ASR: Instructor: Preethi Jyothi
No ratings yet
Acoustic Feature Analysis For ASR: Instructor: Preethi Jyothi
34 pages
MFCCs
No ratings yet
MFCCs
12 pages
Voice Recognition
No ratings yet
Voice Recognition
6 pages
LSA 352 Speech Recognition and Synthesis: Dan Jurafsky
No ratings yet
LSA 352 Speech Recognition and Synthesis: Dan Jurafsky
104 pages
Implementation of Speech Recognition Using Artificial Neural Networks
No ratings yet
Implementation of Speech Recognition Using Artificial Neural Networks
12 pages
Discrete Representation of Signal
No ratings yet
Discrete Representation of Signal
34 pages
2017 Bookmatter SpeechRecognitionUsingArticula
No ratings yet
2017 Bookmatter SpeechRecognitionUsingArticula
8 pages
Feature Extraction MFCCs PDF
No ratings yet
Feature Extraction MFCCs PDF
15 pages
MFCC and Vector Quantization For Arabic Fricatives2012
No ratings yet
MFCC and Vector Quantization For Arabic Fricatives2012
6 pages
MFCC Features: Appendix A
No ratings yet
MFCC Features: Appendix A
19 pages
Maretext Independent Speaker Identification Based On K-Mean Algorithm
No ratings yet
Maretext Independent Speaker Identification Based On K-Mean Algorithm
9 pages
Speaker Recognition Using Vocal Tract Features
No ratings yet
Speaker Recognition Using Vocal Tract Features
5 pages
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
No ratings yet
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
5 pages
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
No ratings yet
Feature Extraction Methods LPC, PLP and MFCC in Speech Recognition
5 pages
Feature Extraction Methods LPC, PLP and MFCC
100% (1)
Feature Extraction Methods LPC, PLP and MFCC
5 pages
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
No ratings yet
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
19 pages
Speech Recognition
No ratings yet
Speech Recognition
4 pages
Mel Frequency Cepstral Coefficient (MFCC) - Guidebook - Informatica e Ingegneria Online
No ratings yet
Mel Frequency Cepstral Coefficient (MFCC) - Guidebook - Informatica e Ingegneria Online
12 pages
MFCC Feature Extraction
No ratings yet
MFCC Feature Extraction
9 pages
M FCC Review
No ratings yet
M FCC Review
10 pages
Voice Recognition
100% (1)
Voice Recognition
18 pages
Control of Robot Arm Based On Speech Recognition Using Mel-Frequency Cepstrum Coefficients (MFCC) and K-Nearest Neighbors (KNN) Method
No ratings yet
Control of Robot Arm Based On Speech Recognition Using Mel-Frequency Cepstrum Coefficients (MFCC) and K-Nearest Neighbors (KNN) Method
6 pages
Speaker Recognition
100% (1)
Speaker Recognition
15 pages
$Xwrpdwlf6Shhfk5Hfrjqlwlrqxvlqj&Ruuhodwlrq $Qdo/Vlv: $evwudfw - 7Kh Jurzwk LQ Zluhohvv FRPPXQLFDWLRQ
No ratings yet
$Xwrpdwlf6Shhfk5Hfrjqlwlrqxvlqj&Ruuhodwlrq $Qdo/Vlv: $evwudfw - 7Kh Jurzwk LQ Zluhohvv FRPPXQLFDWLRQ
5 pages
13MFCC Tutorial
No ratings yet
13MFCC Tutorial
6 pages
Lecture 1
No ratings yet
Lecture 1
48 pages
Mel-Scaled Filter Bank: Mel (F) 2595 Log10 (1+f/700)
No ratings yet
Mel-Scaled Filter Bank: Mel (F) 2595 Log10 (1+f/700)
3 pages
Final Report On Speech Recognition Project
No ratings yet
Final Report On Speech Recognition Project
32 pages
An Automatic Speaker Recognition System
100% (1)
An Automatic Speaker Recognition System
11 pages
Speech Reconstruction From Mel-Frequency Cepstral Coefficients Using A Source-Filter Model
No ratings yet
Speech Reconstruction From Mel-Frequency Cepstral Coefficients Using A Source-Filter Model
4 pages
Recall What Are Sound Features? Feature Detection and Extraction Features in Sphinx III
No ratings yet
Recall What Are Sound Features? Feature Detection and Extraction Features in Sphinx III
11 pages
Speaker Verification For Remote Authentication
100% (2)
Speaker Verification For Remote Authentication
31 pages
Chapter 2 - Speech Signal Processing
No ratings yet
Chapter 2 - Speech Signal Processing
60 pages
Speaker Recognition Using Matlab
No ratings yet
Speaker Recognition Using Matlab
14 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
69 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
69 pages
Practical Cryptography PDF
No ratings yet
Practical Cryptography PDF
10 pages
Digital Signal Processing "Speech Recognition": Paper Presentation On
No ratings yet
Digital Signal Processing "Speech Recognition": Paper Presentation On
12 pages
CCS369 - TSS-Unit 5
No ratings yet
CCS369 - TSS-Unit 5
23 pages
Note Voice Reco
No ratings yet
Note Voice Reco
11 pages
03 MFCC
No ratings yet
03 MFCC
50 pages
Speech Analysis
No ratings yet
Speech Analysis
6 pages
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
45 pages
Dynamic Spectrum Derived MFCC and HFCC Parameters and Human Robot Speech Interaction
No ratings yet
Dynamic Spectrum Derived MFCC and HFCC Parameters and Human Robot Speech Interaction
5 pages
Recognizing Voice For Numerics Using MFCC and DTW
No ratings yet
Recognizing Voice For Numerics Using MFCC and DTW
4 pages
A_novel_approach_for_MFCC_feature_extraction
No ratings yet
A_novel_approach_for_MFCC_feature_extraction
5 pages
Reconocimiento de Voz - MATLAB
No ratings yet
Reconocimiento de Voz - MATLAB
5 pages
Unit 2 - Speech and Video Processing (SVP) - 1
No ratings yet
Unit 2 - Speech and Video Processing (SVP) - 1
23 pages
Filter Bank: Insights into Computer Vision's Filter Bank Techniques
From Everand
Filter Bank: Insights into Computer Vision's Filter Bank Techniques
Fouad Sabry
No ratings yet
Learn Amateur Radio Electronics on Your Smartphone
From Everand
Learn Amateur Radio Electronics on Your Smartphone
Clive W. Humphris
No ratings yet
Amateur Radio Electronics on Your Mobile
From Everand
Amateur Radio Electronics on Your Mobile
Clive W. Humphris
5/5 (1)
Amateur Radio Electronics V11 Home Study
From Everand
Amateur Radio Electronics V11 Home Study
Clive W. Humphris
No ratings yet
A Beginner's Guide to Ham Radio
From Everand
A Beginner's Guide to Ham Radio
George Freeman
No ratings yet
Problem-Statement-3
No ratings yet
Problem-Statement-3
5 pages
AI Based Personal Assistant: Bachelor of Technology Computer Science & Engineering
No ratings yet
AI Based Personal Assistant: Bachelor of Technology Computer Science & Engineering
19 pages
Sinhala Speech Recognition For Interactive Voice Response Systems Accessed Through Mobile Phones
No ratings yet
Sinhala Speech Recognition For Interactive Voice Response Systems Accessed Through Mobile Phones
7 pages
Project 1
No ratings yet
Project 1
4 pages
AI Teachers Using A VR/MR Environment For Greater Student Interaction and Immersion
No ratings yet
AI Teachers Using A VR/MR Environment For Greater Student Interaction and Immersion
9 pages
Module in Remedial Instruction
No ratings yet
Module in Remedial Instruction
8 pages
Voice Based Web Browser
No ratings yet
Voice Based Web Browser
3 pages
Abstact Ramya Krishna
No ratings yet
Abstact Ramya Krishna
1 page
1.3 Hardware and Software
No ratings yet
1.3 Hardware and Software
229 pages
xxxxx
No ratings yet
xxxxx
13 pages
Intro To AI
No ratings yet
Intro To AI
31 pages
Pankaj Singh MTech Elec Mercedes Benz
No ratings yet
Pankaj Singh MTech Elec Mercedes Benz
2 pages
It Se Synopsis Sample
No ratings yet
It Se Synopsis Sample
21 pages
How to Improve English-Speaking Fluency
No ratings yet
How to Improve English-Speaking Fluency
8 pages
Confidence Measures For Speech Recognition - A Survey
No ratings yet
Confidence Measures For Speech Recognition - A Survey
16 pages
NLP Nanodegree Syllabus
No ratings yet
NLP Nanodegree Syllabus
11 pages
Slides For 'Large Language Model: From Theory To Implementations', Chapter 1
No ratings yet
Slides For 'Large Language Model: From Theory To Implementations', Chapter 1
40 pages
Unit Vi Natural Language Processing
No ratings yet
Unit Vi Natural Language Processing
2 pages
Funaudiollm: Voice Understanding and Generation Foundation Models For Natural Interaction Between Humans and Llms
No ratings yet
Funaudiollm: Voice Understanding and Generation Foundation Models For Natural Interaction Between Humans and Llms
21 pages
Project
No ratings yet
Project
12 pages
417_AI_MS
No ratings yet
417_AI_MS
7 pages
IBM Watson Products
No ratings yet
IBM Watson Products
7 pages
Speech Emotion Recognition: Two Decades in A Nutshell, Benchmarks, and Ongoing Trends
No ratings yet
Speech Emotion Recognition: Two Decades in A Nutshell, Benchmarks, and Ongoing Trends
9 pages
Copy of Sample Project File hotel-management 2023
No ratings yet
Copy of Sample Project File hotel-management 2023
10 pages
R Link2 Nx1062 Eng
No ratings yet
R Link2 Nx1062 Eng
148 pages
ML Resume First Job
No ratings yet
ML Resume First Job
2 pages
JETIR2003165
No ratings yet
JETIR2003165
5 pages
Ali Haider and Ahmed Muneeb
No ratings yet
Ali Haider and Ahmed Muneeb
4 pages
CASE STUDY - Speech Recognition
No ratings yet
CASE STUDY - Speech Recognition
25 pages

Lecture 7 - Automatic Speech Recognition

Uploaded by

Lecture 7 - Automatic Speech Recognition

Uploaded by

CMP4205: Audio and Speech Signal

AUTOMATIC SPEECH RECOGNITION (ASR)

Cosmas Mwikirize, Ph.D

© Cosmas Mwikirize, 2023

• There is no single ASR problem

• As in any data-driven task (increasingly machine

Many associated wordplays!!!

Can we treat it as if it were a

Gives idea of pitch

Vocal tract response

This leads us to the concept of the MFCC

• All the steps are perceptually relevant

• Result gives formant information on speech

• Discrete Cosine Transform gives real-valued coefficients

• Pre-emphasis: boosting the energy in the high

• Before and after pre-emphasis

Slide from Bryan Pellom

• Standard algorithm for computing DFT:

• Mel-scale is approximately linear below 1

• Mel Filter bank

Details for signal processing gurus: Since the log

• DCT produces highly uncorrelated features

• The cepstral coefficients do not capture energy

• So we add an energy feature

• Also, we know that speech signal is not constant (slope of

• So we want to add the changes in features (the slopes).

• We call these delta features

• We also add double-delta acceleration features

• Window size: 25ms

• IDFT(DCT) decorrelates the features

• We now have models to represent the three

• A state model using the markov property

• Similar to a markov model except the states

You might also like