Design and Implementation
Design and Implementation
Administrivia
Instructors:
Bhiksha Raj
GHC 6705
Office hours: Tuesday 2-3
[email protected]
Rita Singh
GHC 6703
Office hours: TBD
[email protected]
TA
TBD
Attendance
Is compulsory
Carries 10% of your points
Grading: Relative, with bounds
Anyone not completing 50% of assignments automatically
gets a C
This means everyone can get a C
Why Speech?
Most natural form of human communication
Highest bandwidth human communication as well
With modern technology (telephones etc.) people
can communicate over long distances
Voice-based IVR systems are virtually everywhere
Such automated systems can remain online 24/7
speech
Technological Challenges
Inherent variations in speech make pattern
matching difficult
Solution must understand and represent what is
invariant
This represents the message
Search algorithms
Machine learning
Computational complexity
Computer hardware
Offline:
Transcription
Keyword spotting, Mining
16
Format of Course
Lectures
A series of projects/assignments of linearly increasing
complexity
Each project has a score
Projects will be performed by teams
Size 2-3 members
Projects
Project 2: A spellchecker
String matching
18
Projects
Project 4: HMM-based recognition of isolated words
Viterbi decoding with simple Gaussian densities
Viterbi decoding with mixture Gaussian densities
19
Projects
Project 7: HMM-based recognition of
continuous word strings
Continuous ASR of words
Continuous ASR of words with optional silences
Training a set of word models (carried over from
previous exercise)
Evaluation
20
Projects
Project 8: Grammar-based recognition of
continuous words
Building graphs from grammars
Building HMM-networks from grammars
Recognition of continuous word strings from a
grammar
21
Projects
Project 9: Grammar-based recognition from
Ngram models
Conversion of Ngrams to FSGs
Grammar-based recognition of continuous speech
from Ngrams
22
Endpointing
Feature
Computation
Pattern
Matching
Linguistic
Patterns
Figure taken
from the web
A simplified view:
Sampling
Speech
waves
Analog
speech
signal
samples
Digitization
Discrete
time
signal
Digitized
discrete
time
signal
Hit to talk
Hit and speak
Continuous listening
Multi-pass endpointing
Feature Extraction
Should pattern matching in speech be done directly on
audio samples?
Raw sample streams are not well suited for matching
A visual analogy: recognizing a letter inside a box
template
input
AA
IY
UW
A little diversion
Generalization of templates
16 Jul 2007
39
Pronunciation
Model
Speech Recognizer
(Decoder)
Language
Model
SPEECH
Our Focus
For now, our focus will be more on the recognizer or decoder,
than the trainers
Other commonly used names for the recognizer: decoder,
recognition engine, decoding engine, etc.
The algorithm used by the recognizer is usually referred to as
the search algorithm
Given some spoken input, it searches for the correct word sequence
among all possibilities
It does this, in effect, by searching through all available templates
Acoustic model
Feature
Extraction
TOMORROW
Training data
start
end
3-state HMM for the word SAM
Pronunciation Modelling
Large (or flexible) vocabulary systems
compose models for words from smaller sound
units
Pronunciation modelling specifies how words
are composed from sub-word units
Includes hand-crafted lexicons and
automatically generated pronunciations
Will only be superficially covered
A simple pronunciation generator (if time permits)
Language Modelling
Language models represent valid and plausible
word sequences
Help the system choose between Wreck a nice
beach or recognize speech
Speech Recognizer
(Decoder)
speech
beach
switch
Or,
return and
Lattice (edges of
returned
graph not shown)
Confidence Estimation
Problem: How confident are we that the recognizers
hypothesis is correct?
The question can be posed at the word or the utterance level
Assign a confidence value to each word in the hypothesis, or,
To the entire utterance as a single unit
Training
The models used by a recognizer must be trained
Acoustic and language models are typically trained based on statistics
gathered from known training data
The templates used in template matching are one such, although trivial,
example
Resources
Spoken Language Processing, by Huang, Acero and Hon
Extensive references to virtually all topics in speech
Cambridge HTK
The HTK Book