Tied-State HMMs + Introduction To NN-based AMs
Tied-State HMMs + Introduction To NN-based AMs
Lecture 7
CS 753
Instructor: Preethi Jyothi
Recall: Acoustic Model
Acoustic
Context
Pronunciation
Language
Acoustic
Models Transducer Model Model Word
Indices Triphones Monophones Words Sequence
}
b/a_b
FST Union +
Closure
. Resulting
FST
.
. H
x/y_z
Triphone HMM Models
• Each phone is modelled in the context of its left and right neighbour phones
• If each triphone HMM has 3 states and each state generates m-component GMMs
(m ≈ 64), for d-dimensional acoustic feature vectors (d ≈ 40) with Σ having d2 parameters
• Insufficient data to learn all triphone models reliably. What do we do? Share parameters
across triphone models!
Parameter Sharing
• Sharing of parameters (also referred to as “parameter tying”) can be
done at any level:
• Parameters in HMMs corresponding to two triphones are said to be
tied if they are identical
Transition probs
are tied i.e. t’i = ti
t1 t3 t5 t’1 t’3 t’5
t2 t4 t’2 t’4
Shape?
Leafy Cylindrical
Oval
Spinach
Color? Green
Taste?
2. What is the training data for each phone state, pj? (root node of DT)
How do we build these phone DTs?
1. What questions are used?
Linguistically-inspired binary questions: “Does the left or right phone come
from a broad class of phones such as vowels, stops, etc.?” “Is the left or
right phone [k] or [m]?”
2. What is the training data for each phone state, pj? (root node of DT)
Training data for DT nodes
• Align training data, xi = (xi1, …, xiTi) i=1…N where xit ∈ ℝd ,
against a set of triphone HMMs
• Use Viterbi algorithm to find the best HMM state sequence
corresponding to each xi
• Tag each xit with ID of current phone along with left-context
and right-context
xit
{
{
{
sil/b/aa b/aa/g aa/g/sil
xit is tagged with ID aa2[b/g] i.e. xit is aligned with the second state
of the 3-state HMM corresponding to the triphone b/aa/g
• For a state j in phone p, collect all xit’s that are tagged with ID pj[?/?]
How do we build these phone DTs?
1. What questions are used?
Linguistically-inspired binary questions: “Does the left or right phone come
from a broad class of phones such as vowels, stops, etc.?” “Is the left or
right phone [k] or [m]?”
2. What is the training data for each phone state, pj? (root node of DT)
All speech frames that align with the jth state of every triphone HMM that
has p as the middle phone
3. What criterion is used at each node to find the best question to split the
data on?
Find the question which partitions the states in the parent node so as to
give the maximum increase in log likelihood
Likelihood of a cluster of states
• For a question q that splits S into Syes and Sno, compute the
following quantity:
q q
q = L(Syes ) + L(Sno ) L(S)
Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994
Example: Phonetic Decision Tree (DT)
One tree is constructed for each state of each phone to cluster all the
corresponding triphone states
DT for center
state of [ow]
Head node Uses all training data
aa/ow2/f, aa/ow2/s,
aa/ow2/d, h/ow2/p, tagged as ow2[?/?]
aa/ow2/n, aa/ow2/g, Is left ctxt a vowel?
…
Yes No
Is right ctxt a
Is right ctxt nasal?
fricative?
No Yes
Yes No
Is right ctxt a Leaf E
Leaf A Leaf B glide? aa/ow2/n,
aa/ow2/f,
aa/ow2/d,
aa/ow2/m,
aa/ow2/s, aa/ow2/g, Yes No …
… …
Leaf C Leaf D
h/ow2/l,
h/ow2/p,
b/ow2/r, b/ow2/k,
… …
For an unseen triphone at test time
• Transition Matrix:
• All triphones of a given phoneme use the same
transition matrix common to all triphones of a phoneme
• State observation densities:
• Use the triphone identity to traverse all the way to a leaf
of the decision tree
• Use the state observation probabilities associated with
that leaf
That’s a wrap on HMM-based acoustic models
Acoustic
Context
Pronunciation
Language
Acoustic
Models Transducer Model Model Word
Indices Triphones Monophones Words Sequence
}
b/a_b One 3-state
FST Union +
HMM for
Closure
each
Resulting
. tied-state
FST
triphone;
. parameters estimated
H
. using Baum-Welch
algorithm
x/y_z
DNN-based acoustic models?
Acoustic
Context
Pronunciation
Language
Acoustic
Models Transducer Model Model Word
Indices Triphones Monophones Words Sequence
DAHL et al.: CONTEXT-DEPENDENT PRE-TRAINED DEEP NEURAL NETWORKS FOR LVSR 35
H
where is the state (senone) posterior probability esti-
mated from the DNN, is the prior probability of each state
Phone posteriors (senone) estimated from the training set, and is indepen-
dent of the word sequence and thus can be ignored. Although
dividing by the prior probability (called scaled likelihood
}
estimation by [38], [40], [41]) may not give improved recog-
nition accuracy under some conditions, we have found it to be
Can we use
very important in alleviating the label bias problem, especially
deep neural networks
when the training utterances contain long silence segments.
Resulting
instead of HMMs to
FST
B. learn mappings
Training
of CD-DNN-HMMs
Procedure
between acoustics
CD-DNN-HMMs
H
and phones? can be trained using the embedded Viterbi
algorithm. The main steps involved are summarized in Algo-
rithm 1, which takes advantage of the triphone tying structures
and the HMMs of the CD-GMM-HMM system. Note that the
Fig. 1. Diagram of our hybrid architecture employing a deep neural network.
The HMM models the sequential property of the speech signal, and the DNN
logical triphone HMMs that are effectively equivalent are clus-
models the scaled observation likelihood of all the senones (tied tri-phone tered and represented by a physical triphone (i.e., several log-
states). The same DNN is replicated over different points in time. ical triphones are mapped to the same physical triphone). Each
Brief Introduction to Neural Networks
Feed-forward Neural Network
Output
Layer
Input
Hidden
Layer Layer
Feed-forward Neural Network
Brain Metaphor
Single neuron
g
wi yi
xi (activation
function)
yi=g(Σi wi xi)
1
w13 3 w35
x1
w14
w23 5 a5
2 4 w45
x2 w24
Parameters of
a5 = g(w35 ⋅ a3 + w45 ⋅ a4) the network: all wij
= g(w35 ⋅ (g(w13 ⋅ a1 + w23 ⋅ a2)) +
(and biases not
w45 ⋅ (g(w14 ⋅ a1 + w24 ⋅ a2))) shown here)
1
w13 3 w35
x1
w14
w23 5 a5
2 4 w45
x2 w24
a5 = g(w35 ⋅ a3 + w45 ⋅ a4)
= g(w35 ⋅ (g(w13 ⋅ a1 + w23 ⋅ a2)) +
w45 ⋅ (g(w14 ⋅ a1 + w24 ⋅ a2)))
The simplest neural network is the perceptron:
Perceptron(x) = xW + b
A 1-layer feedforward neural network has the form:
MLP(x) = g(xW1 + b1) W2 + b2
Common Activation Functions (g)
Sigmoid: σ(x) = 1/(1 + e-x)
1.0
0.8
sigmoid
0.6
output
0.4
0.2
0.0
−10 −5 0 5 10
x
Common Activation Functions (g)
Sigmoid: σ(x) = 1/(1 + e-x)
Hyperbolic tangent (tanh): tanh(x) = (e2x - 1)/(e2x + 1)
1.0
tanh
sigmoid
0.5
output
0.0
−0.5
−1.0
−10 −5 0 5 10
x
Common Activation Functions (g)
Sigmoid: σ(x) = 1/(1 + e-x)
Hyperbolic tangent (tanh): tanh(x) = (e2x - 1)/(e2x + 1)
Rectified Linear Unit (ReLU): RELU(x) = max(0, x)
nonlinear activation functions
10
ReLU
tanh
8
sigmoid
6
output
4
2
0
−10 −5 0 5 10
x
Optimization Problem
SGD Algorithm
Inputs:
Function NN(x; θ), Training examples, x1 … xn and
outputs, y1 … yn and Loss function L.
Return: θ
Training a Neural Network
Backpropagation
L Forward Pass
Base case: ∂L/∂L = 1
First, in a forward
For each u (top to pass, compute
bottom): v values of all nodes
For each v ∈ Γ(u): given an input
Inductively, have
u (The values of each node
will be needed during
computed ∂L/∂v backprop)
Directly compute ∂v/∂u
Compute ∂L/∂u
Compute ∂L/∂w
Where values computed in the
where ∂L/∂w = ∂L/∂u ⋅ ∂u/∂w forward pass may be needed
History of Neural Networks in ASR
[M09] A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” NIPS Workshop on Deep Learning for Speech Recognition, 2009.
[D11] G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition,” TASL 20(1
[H12] G. Hinton, et al., “Deep Neural Networks for Acoustic Modeling in Speech Recognition”, IEEE Signal Processing Magazine, 2012.
What’s new?
• Important developments
• Vast quantities of data available for ASR training
• Fast GPU-based training
• Improvements in optimization/initialization techniques
• Deeper networks enabled by fast training
• Larger output spaces enabled by fast training and
availability of data