0% found this document useful (0 votes)
107 views37 pages

Tied-State HMMs + Introduction To NN-based AMs

1. The document discusses acoustic modeling techniques using tied-state hidden Markov models (HMMs) and deep neural network (DNN)-based models. 2. It describes how triphone HMMs model each phone in the context of its left and right neighboring phones, but this results in a huge number of parameters. Parameter tying techniques like tied mixture models and state tying are used to share parameters across similar triphone models to address the data sparsity issue. 3. Decision trees are used to cluster states of triphone models derived from the same monophone that should have their parameters tied together, in one of the main steps of building a tied-state HMM system. The trees are built using linguistically

Uploaded by

Sammy K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views37 pages

Tied-State HMMs + Introduction To NN-based AMs

1. The document discusses acoustic modeling techniques using tied-state hidden Markov models (HMMs) and deep neural network (DNN)-based models. 2. It describes how triphone HMMs model each phone in the context of its left and right neighboring phones, but this results in a huge number of parameters. Parameter tying techniques like tied mixture models and state tying are used to share parameters across similar triphone models to address the data sparsity issue. 3. Decision trees are used to cluster states of triphone models derived from the same monophone that should have their parameters tied together, in one of the main steps of building a tied-state HMM system. The trees are built using linguistically

Uploaded by

Sammy K
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Acoustic Modeling:

Tied-state HMMs & DNN-based models

Lecture 7

CS 753
Instructor: Preethi Jyothi
Recall: Acoustic Model
Acoustic
 Context
 Pronunciation
 Language

Acoustic
 Models Transducer Model Model Word

Indices Triphones Monophones Words Sequence

a/a_b f1:ε f3:ε f4:ε f5:ε


f0:a+a+b
f2:ε f4:ε f6:ε

}
b/a_b
FST Union +
Closure
. Resulting
FST
.
. H
x/y_z
Triphone HMM Models
• Each phone is modelled in the context of its left and right neighbour phones

• Pronunciation of a phone is influenced by the preceding and succeeding phones. 



E.g. The phone [p] in the word “peek” : p iy k” vs. [p] in the word “pool” : p uw l

• Number of triphones that appear in data ≈ 1000s or 10,000s

• If each triphone HMM has 3 states and each state generates m-component GMMs 

(m ≈ 64), for d-dimensional acoustic feature vectors (d ≈ 40) with Σ having d2 parameters

• Hundreds of millions of parameters! 


• Insufficient data to learn all triphone models reliably. What do we do? Share parameters
across triphone models!
Parameter Sharing
• Sharing of parameters (also referred to as “parameter tying”) can be
done at any level:
• Parameters in HMMs corresponding to two triphones are said to be
tied if they are identical
Transition probs 

are tied i.e. t’i = ti
t1 t3 t5 t’1 t’3 t’5
t2 t4 t’2 t’4

State observation densities 



are tied

• More parameter tying: Tying variances of all Gaussians within a state,



tying variances of all Gaussians in all states, tying individual Gaussians, etc.
1. Tied Mixture Models
• All states share the same Gaussians (i.e. same means and
covariances)

• Mixture weights are specific to each state

Triphone HMMs (No sharing)

Triphone HMMs (Tied Mixture Models)


2. State Tying
• Observation probabilities are shared across states which
generate acoustically similar data
b/a/k p/a/k b/a/g

Triphone HMMs (No sharing)

b/a/k p/a/k b/a/g

Triphone HMMs (State Tying)


Tied state HMMs
Four main steps in building a tied state HMM
system:
1. Create and train 3-state monophone
HMMs with single Gaussian
observation probability densities
2. Clone these monophone
distributions to initialise a set of
untied triphone models. Train them
using Baum-Welch estimation.
Transition matrix remains common
across all triphones of each phone.
3. For all triphones derived from the
same monophone, cluster states
whose parameters should be tied
together.
4. Number of mixture components in
each tied state is increased and
models re-estimated using BW
Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994
Tied state HMMs
Four main steps in building a tied state HMM
system:
1. Create and train 3-state monophone
HMMs with single Gaussian
observation probability densities
2. Clone these monophone
distributions to initialise a set of
untied triphone models. Train them
using Baum-Welch estimation.
Transition matrix remains common
across all triphones of each phone.
3. For all triphones derived from the
same monophone, cluster states
whose parameters should be tied
together.
4. NumberWhich
of mixture components
states should be tied in
together?
each tied stateUse decision trees.
is increased and
models re-estimated using BW
Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994
Decision Trees

Classification using a decision tree:


Begins at the root node: What property is satisfied?
Depending on answer, traverse to different branches

Shape?
Leafy Cylindrical
Oval
Spinach
Color? Green
Taste?

Sour Neutral White Snakegourd


Turnip
Color?
Tomato White Purple
Radish Brinjal
Decision Trees

• Given the data at a node, either declare the node to be a


leaf or find another property to split the node into branches.

• Important questions to be addressed for DTs:


1. How many splits at a node? Chosen by the user.
2. Which property should be used at a node for splitting?
One which decreases “impurity” of nodes as much as
possible.
3. When is a node a leaf? Set threshold in reduction in
impurity
Tied state HMMs
Four main steps in building a tied state HMM
system:
1. Create and train 3-state monophone
HMMs with single Gaussian
observation probability densities
2. Clone these monophone
distributions to initialise a set of
untied triphone models. Train them
using Baum-Welch estimation.
Transition matrix remains common
across all triphones of each phone.
3. For all triphones derived from the
same monophone, cluster states
whose parameters should be tied
together.
4. NumberWhich
of mixture components
states should be tied in
together?
each tied stateUse decision trees.
is increased and
models re-estimated using BW
Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994
How do we build these phone DTs?
1. What questions are used?


Linguistically-inspired binary questions: “Does the left or right phone come
from a broad class of phones such as vowels, stops, etc.?” “Is the left or
right phone [k] or [m]?”

2. What is the training data for each phone state, pj? (root node of DT)
How do we build these phone DTs?
1. What questions are used?


Linguistically-inspired binary questions: “Does the left or right phone come
from a broad class of phones such as vowels, stops, etc.?” “Is the left or
right phone [k] or [m]?”

2. What is the training data for each phone state, pj? (root node of DT)
Training data for DT nodes
• Align training data, xi = (xi1, …, xiTi) i=1…N where xit ∈ ℝd ,
against a set of triphone HMMs
• Use Viterbi algorithm to find the best HMM state sequence
corresponding to each xi
• Tag each xit with ID of current phone along with left-context
and right-context

xit

{
{
{
sil/b/aa b/aa/g aa/g/sil
xit is tagged with ID aa2[b/g] i.e. xit is aligned with the second state
of the 3-state HMM corresponding to the triphone b/aa/g

• For a state j in phone p, collect all xit’s that are tagged with ID pj[?/?]
How do we build these phone DTs?
1. What questions are used?


Linguistically-inspired binary questions: “Does the left or right phone come
from a broad class of phones such as vowels, stops, etc.?” “Is the left or
right phone [k] or [m]?”

2. What is the training data for each phone state, pj? (root node of DT)


All speech frames that align with the jth state of every triphone HMM that
has p as the middle phone

3. What criterion is used at each node to find the best question to split the
data on? 


Find the question which partitions the states in the parent node so as to
give the maximum increase in log likelihood
Likelihood of a cluster of states

• If a cluster of HMM states, S = {s1, s2, …, sM} consists of M states


and a total of K acoustic observation vectors are associated with
S, {x1, x2 …, xK} , then the log likelihood associated with S is:
K X
X
L(S) = log Pr(xi ; µS , ⌃S ) s (xi )
i=1 s2S

• For a question q that splits S into Syes and Sno, compute the
following quantity:
q q
q = L(Syes ) + L(Sno ) L(S)

• Go through all questions, find Δq for each question q and choose


the question for which Δq is the biggest

• Terminate when: Final Δq is below a threshold or data associated


with a split falls below a threshold
Likelihood criterion
• Given a phonetic question, let the
initial set of untied states S be split
into two partitions Syes and Sno

• Each partition is clustered to form


a single Gaussian output
distribution with mean μSyes and
covariance ΣSyes

• Use the likelihood of the parent


state and the subsequent split
states to determine which question
a node should be split on

Image from: Young et al., “Tree-based state tying for high accuracy acoustic modeling”, ACL-HLT, 1994
Example: Phonetic Decision Tree (DT)
One tree is constructed for each state of each phone to cluster all the 

corresponding triphone states
DT for center 

state of [ow]
Head node Uses all training data 

aa/ow2/f, aa/ow2/s,

aa/ow2/d, h/ow2/p, tagged as ow2[?/?]
aa/ow2/n, aa/ow2/g, Is left ctxt a vowel?

Yes No

Is right ctxt a
Is right ctxt nasal?
fricative?
No Yes
Yes No
Is right ctxt a Leaf E
Leaf A Leaf B glide? aa/ow2/n,

aa/ow2/f,
 aa/ow2/d,
 aa/ow2/m,
aa/ow2/s, aa/ow2/g, Yes No …
… …
Leaf C Leaf D
h/ow2/l,
 h/ow2/p,

b/ow2/r, b/ow2/k,
… …
For an unseen triphone at test time

• Transition Matrix:
• All triphones of a given phoneme use the same
transition matrix common to all triphones of a phoneme
• State observation densities:
• Use the triphone identity to traverse all the way to a leaf
of the decision tree
• Use the state observation probabilities associated with
that leaf
That’s a wrap on HMM-based acoustic models
Acoustic
 Context
 Pronunciation
 Language

Acoustic
 Models Transducer Model Model Word

Indices Triphones Monophones Words Sequence

a/a_b f1:ε f3:ε f4:ε f5:ε


f0:a:a_b
f2:ε f4:ε f6:ε

}
b/a_b One 3-state 
 FST Union +
HMM for 
 Closure
each 
 Resulting
. tied-state
 FST
triphone;
. parameters estimated
 H
. using Baum-Welch

algorithm
x/y_z
DNN-based acoustic models?
Acoustic
 Context
 Pronunciation
 Language

Acoustic
 Models Transducer Model Model Word

Indices Triphones Monophones Words Sequence
DAHL et al.: CONTEXT-DEPENDENT PRE-TRAINED DEEP NEURAL NETWORKS FOR LVSR 35
H
where is the state (senone) posterior probability esti-
mated from the DNN, is the prior probability of each state
Phone posteriors (senone) estimated from the training set, and is indepen-
dent of the word sequence and thus can be ignored. Although
dividing by the prior probability (called scaled likelihood

}
estimation by [38], [40], [41]) may not give improved recog-
nition accuracy under some conditions, we have found it to be
Can we use
very important in alleviating the label bias problem, especially
deep neural networks

when the training utterances contain long silence segments.
Resulting
instead of HMMs to

FST
B. learn mappings
Training 
 of CD-DNN-HMMs
Procedure
between acoustics 

CD-DNN-HMMs
H
and phones? can be trained using the embedded Viterbi
algorithm. The main steps involved are summarized in Algo-
rithm 1, which takes advantage of the triphone tying structures
and the HMMs of the CD-GMM-HMM system. Note that the
Fig. 1. Diagram of our hybrid architecture employing a deep neural network.
The HMM models the sequential property of the speech signal, and the DNN
logical triphone HMMs that are effectively equivalent are clus-
models the scaled observation likelihood of all the senones (tied tri-phone tered and represented by a physical triphone (i.e., several log-
states). The same DNN is replicated over different points in time. ical triphones are mapped to the same physical triphone). Each
Brief Introduction to Neural Networks
Feed-forward Neural Network

Output 

Layer

Input 
 Hidden 

Layer Layer
Feed-forward Neural Network

Brain Metaphor

Single neuron

g
wi yi
xi (activation

function)

yi=g(Σi wi xi)

Image from: https://upload.wikimedia.org/wikipedia/commons/1/10/Blausen_0657_MultipolarNeuron.png


Feed-forward Neural Network

Parameterized Model

1
w13 3 w35
x1
w14
w23 5 a5

2 4 w45
x2 w24
Parameters of 

a5 = g(w35 ⋅ a3 + w45 ⋅ a4) the network: all wij

= g(w35 ⋅ (g(w13 ⋅ a1 + w23 ⋅ a2)) + 
 (and biases not

w45 ⋅ (g(w14 ⋅ a1 + w24 ⋅ a2))) shown here)

If x is a 2-dimensional vector and the layer above it is a 2-dimensional


vector h, a fully-connected layer is associated with:
h = xW + b
where wij in W is the weight of the connection between ith neuron in the
input row and jth neuron in the first hidden layer and b is the bias vector
Feed-forward Neural Network

Parameterized Model

1
w13 3 w35
x1
w14
w23 5 a5

2 4 w45
x2 w24
a5 = g(w35 ⋅ a3 + w45 ⋅ a4)
= g(w35 ⋅ (g(w13 ⋅ a1 + w23 ⋅ a2)) + 

w45 ⋅ (g(w14 ⋅ a1 + w24 ⋅ a2)))
The simplest neural network is the perceptron:
Perceptron(x) = xW + b
A 1-layer feedforward neural network has the form:
MLP(x) = g(xW1 + b1) W2 + b2
Common Activation Functions (g)
Sigmoid: σ(x) = 1/(1 + e-x)

nonlinear activation functions

1.0
0.8
sigmoid
0.6
output
0.4
0.2
0.0

−10 −5 0 5 10
x
Common Activation Functions (g)
Sigmoid: σ(x) = 1/(1 + e-x)
Hyperbolic tangent (tanh): tanh(x) = (e2x - 1)/(e2x + 1)

nonlinear activation functions

1.0
tanh
sigmoid
0.5
output
0.0
−0.5
−1.0

−10 −5 0 5 10
x
Common Activation Functions (g)
Sigmoid: σ(x) = 1/(1 + e-x)
Hyperbolic tangent (tanh): tanh(x) = (e2x - 1)/(e2x + 1)
Rectified Linear Unit (ReLU): RELU(x) = max(0, x)
nonlinear activation functions

10
ReLU
tanh

8
sigmoid
6
output
4
2
0

−10 −5 0 5 10
x
Optimization Problem

• To train a neural network, define a loss function L(y,ỹ): 



a function of the true output y and the predicted output ỹ

• L(y,ỹ) assigns a non-negative numerical score to the neural


network’s output, ỹ

• The parameters of the network are set to minimise L over


the training examples (i.e. a sum of losses over different
training samples)

• L is typically minimised using a gradient-based method


Stochastic Gradient Descent (SGD)

SGD Algorithm

Inputs: 

Function NN(x; θ), Training examples, x1 … xn and 

outputs, y1 … yn and Loss function L.

do until stopping criterion



Pick a training example xi, yi

Compute the loss L(NN(xi; θ), yi)

Compute gradient of L, ∇L with respect to θ

θ ← θ - η ∇L 

done

Return: θ
Training a Neural Network

Define the Loss function to be minimised as a node L

Goal: Learn weights for the neural network which minimise L

Gradient Descent: Find ∂L/∂w for every weight w, and update it as 



w ← w - η ∂L/ ∂w

How do we efficiently compute ∂L/∂w for all w?

Will compute ∂L/∂u for every node u in the network!

∂L/∂w = ∂L/∂u ⋅ ∂u/∂w where u is the node which uses w


Training a Neural Network

New goal: compute ∂L/∂u for every node u in the network

Simple algorithm: Backpropagation

Key fact: Chain rule of differentiation

If L can be written as a function of variables v1,…, vn, which in turn


depend (partially) on another variable u, then

∂L/∂u = Σi ∂L/∂vi ⋅ ∂vi/∂u


Backpropagation
If L can be written as a function of variables v1,…, vn, which in turn
depend (partially) on another variable u, then
∂L/∂u = Σi ∂L/∂vi ⋅ ∂vi/∂u
L
Consider v1,…, vn as the
layer 

above u, Γ(u) v

Then, the chain rule gives


∂L/∂u = Σv ∈ Γ(u) ∂L/∂v ⋅ ∂v/∂u
Backpropagation
∂L/∂u = Σv ∈ Γ(u) ∂L/∂v ⋅ ∂v/∂u

Backpropagation
L Forward Pass
Base case: ∂L/∂L = 1
First, in a forward
For each u (top to pass, compute
bottom): v values of all nodes
For each v ∈ Γ(u): given an input

Inductively, have
 u (The values of each node
will be needed during
computed ∂L/∂v backprop)
Directly compute ∂v/∂u
Compute ∂L/∂u
Compute ∂L/∂w 

Where values computed in the
where ∂L/∂w = ∂L/∂u ⋅ ∂u/∂w forward pass may be needed
History of Neural Networks in ASR

• Neural networks for speech recognition were explored as


early as 1987

• Deep neural networks for speech


• Beat state-of-the-art on the TIMIT corpus [M09]
• Significant improvements shown on large-vocabulary
systems [D11]
• Dominant ASR paradigm [H12]

[M09] A. Mohamed, G. Dahl, and G. Hinton, “Deep belief networks for phone recognition,” NIPS Workshop on Deep Learning for Speech Recognition, 2009.

[D11] G. Dahl, D. Yu, L. Deng, and A. Acero, “Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition,” TASL 20(1

[H12] G. Hinton, et al., “Deep Neural Networks for Acoustic Modeling in Speech Recognition”, IEEE Signal Processing Magazine, 2012.
What’s new?

• Why have NN-based systems come back to prominence?

• Important developments
• Vast quantities of data available for ASR training
• Fast GPU-based training
• Improvements in optimization/initialization techniques
• Deeper networks enabled by fast training
• Larger output spaces enabled by fast training and
availability of data

You might also like