Acoustic Modeling Using Deep Belief Networks: Abdel-Rahman Mohamed, George E. Dahl, and Geoffrey Hinton
Acoustic Modeling Using Deep Belief Networks: Abdel-Rahman Mohamed, George E. Dahl, and Geoffrey Hinton
AbstractGaussian mixture models are currently the dominant technique for modeling the emission distribution of hidden
Markov models for speech recognition. We show that better
phone recognition on the TIMIT dataset can be achieved by
replacing Gaussian mixture models by deep neural networks
that contain many layers of features and a very large number
of parameters. These networks are first pre-trained as a multilayer generative model of a window of spectral feature vectors
without making use of any discriminative information. Once the
generative pre-training has designed the features, we perform
discriminative fine-tuning using backpropagation to adjust the
features slightly to make them better at predicting a probability
distribution over the states of monophone hidden Markov models.
Index TermsAcoustic Modeling, deep belief networks, neural
networks, phone recognition
I. I NTRODUCTION
Utomatic speech Recognition (ASR) has evolved significantly over the past few decades. Early systems typically
discriminated isolated digits or yes/no, whereas current systems can do quite well at recognizing telephone-quality, spontaneous speech. A huge amount of progress has been made
in improving word recognition rates, but the core acoustic
modeling has remained fairly stable, despite many attempts
to develop better alternatives.
A typical ASR system uses Hidden Markov Models
(HMMs) to model the sequential structure of speech signals,
with each HMM state using a mixture of Gaussians to model a
spectral representation of the sound wave. The most common
spectral representation is a set of Mel Frequency Cepstral
coefficients (MFCCs) derived from a window of about 25 ms
of speech. The window is typically advanced by about 10 ms
per frame, and each frame of coefficients is augmented with
differences and differences of differences with nearby frames.
One research direction involves using deeper acoustic models that contain many layers of features. The work in [1] proposes a hierarchical framework where each layer is designed to
capture a set of distinctive feature landmarks. For each feature,
a specialized acoustic representation is constructed in which
that feature is easy to detect. In [2], a probabilistic generative
model is introduced where the dynamic structure in the hidden
vocal tract resonance space is used to characterize long-span
contextual influence across phonetic units.
Feedforward neural networks have been used in many
ASR systems [3], [4], [5]. Inspired by insights from [6],
the TRAP architecture [7] models a whole second of speech
MOHAMED, DAHL AND HINTON : ACOUSTIC MODELING USING DEEP BELIEF NETWORKS
H
V X
X
wij vi hj
V
X
bi vi
aj hj
(1)
V
X
wij vi )
(3)
i=1
x 1
= (bi +
H
X
computed from the data when computing the first term on the
RHS of Eq. 5.
The contrastive divergence learning rule does not follow the
maximum likelihood gradient. Understanding why it works at
all is much easier using the directed view, so we defer the
explanation to the next section. After learning the weights in
an RBM module, we use the states of the hidden units, when
driven by real data, as the data for training another module of
the same kind. This process can be repeated to learn as many
layers of features as we desire. Again, understanding why this
greedy approach works is much easier using the directed view.
For Gaussian-Bernoulli RBMs2 the energy of a joint configuration is:
E(v, h|) =
j=1
i=1
i=1 j=1
H
X
wij hj ),
V
X
(vi bi )2
i=1
H
V X
X
i=1 j=1
wij vi hj
H
X
aj hj (6)
j=1
H
X
(7)
wij hj , 1
p(vi |h, ) = N bi +
j=1
(4)
j=1
(5)
p(hi
(k1)
(k) (k)
wij hj )
(8)
(k)
directions.
1) Learning with tied weights: Consider a sigmoid belief
net with an infinite number of layers and with tied symmetric
weights between layers as shown in figure 2. In this net, the
posterior in the first hidden layer is factorial: The hidden units
are independent given the states of the visible units. This
occurs because the correlations created by the prior coming
from all of the layers above exactly cancel the anti-correlations
in the likelihood term coming from the layer below [11].
Moreover, the factorial posterior can be computed by simply
multiplying the visible vector by the transposed weight matrix
and then applying the logistic function to each element:
X
(1)
(1)
p(hj =1|v, W) = (bj +
wij vi )
(10)
i
Fig. 2. An infinite sigmoid belief net with tied weights. Alternate layers
must have the same number of units. The tied weights make inference much
simpler in this net than in a general sigmoid belief net.
Once the posterior has been sampled for the first hidden
layer, exactly the same process can be used for the next hidden
layer. So inference is extremely easy in this special kind of
network. Learning is a little more difficult because every copy
of the tied weight matrix gets different derivatives. However,
we know in advance that the expected derivatives will be
zero for very high level layers. This is because the bottomup inference process is really a Markov chain that eventually
converges to its stationary distribution in the higher layers.
When it is sampling from its stationary distribution, the current
weights are perfect for explaining the samples, so, on average,
there is no derivative. When the weights and biases are small,
this Markov chain converges rapidly and we can approximate
MOHAMED, DAHL AND HINTON : ACOUSTIC MODELING USING DEEP BELIEF NETWORKS
(1)
(2) (3)
hvi hj i hhi hj i
(11)
This type of generative pre-training followed by discriminative fine-tuning has been used successfully for hand-written
character recognition [11], [10], [19], dimensionality reduction
[20], 3-D object recognition [21], [22], extracting road maps
from cluttered aerial images [23], information retrieval [24],
[25] and machine transliteration [26]. As we shall see, it is
also very good for phone recognition.
III. U SING DEEP BELIEF NETS FOR PHONE RECOGNITION
In order to apply DBNs with fixed input and output dimensionality to phone recognition, we use a context window of
n successive frames of speech coefficients to set the states
of the visible units of the lowest layer of the DBN. Once
it has been pre-trained as a generative model, the resulting
feedfoward neural network is discriminatively trained to output
a probability distribution over the possible labels of the central
frame. To generate phone sequences, the sequence of predicted
5 In a convenient abuse of the correct terminology, we sometimes use
DBN to refer to a feedforward neural network that was initialized using
a generatively trained deep belief net, even though the feedforward neural
network is clearly very different from a belief net.
The theory used to justify the pre-training algorithm assumes that when the states of the visible units are reconstructed
from the inferred binary activities in the first hidden layer,
they are reconstructed stochastically. To reduce noise in the
learning, we actually reconstructed them deterministically and
used the real values (see [18] for more details).
For fine-tuning, we used stochastic gradient descent with
the same mini-batch size as in pre-training. The learning rate
started at 0.1. At the end of each epoch, if the substitution error
on the development set increased, the weights were returned to
their values at the beginning of the epoch and the learning rate
was halved. This continued until the learning rate fell below
0.001.
During both pre-training and fine-tuning, a small weightcost of 0.0002 was used and the learning was accelerated by
using a momentum of 0.9 (except for the first epoch of finetuning which did not use momentum). [18] gives a detailed
explanation of weight-cost and momentum and sensible ways
to set them.
Figure 3 and figure 4 show the effect of varying the size
of each hidden layer and the number of hidden layers. For
simplicity we used the same size for every hidden layer in a
network. For these comparisons, the number of input frames
was fixed at 11.
24.5
hid1024dev
hid2048dev
24
hid3072dev
hid512dev
23.5
23
22.5
22
21.5
21
20.5
20
1
Number of layers
Fig. 3. Phone error rate on the development set as a function of the number
of layers and size of each layer, using 11 input frames.
MOHAMED, DAHL AND HINTON : ACOUSTIC MODELING USING DEEP BELIEF NETWORKS
26
26
hid1024core
pretrain11fr2048hidcore
hid2048core
25.5
pretrain17fr3072hidcore
25.5
hid3072core
rand11fr2048hidcore
hid512core
rand17fr3072hidcore
25
25
24.5
24
23.5
24.5
24
23.5
23
23
22.5
22.5
22
1
22
1
Number of layers
Number of layers
Fig. 4. Phone error rate on the core test set as a function of the number of
layers and size of each layer, using 11 input frames.
Fig. 6. Phone error rate on the core test set as a function of the number of
hidden layers using randomly initialized and pretrained networks.
23.5
pretrain11fr2048hiddev
24.5
pretrain17fr3072hiddev
23
fr173kdev
rand11fr2048hiddev
rand17fr3072hiddev
fr373kdev
22.5
fr73kdev
23.5
fr273kdev
24
22
21.5
21
23
22.5
22
21.5
20.5
21
20
1
Number of layers
20.5
1
Number of layers
Fig. 5. Phone error rate on the development set as a function of the number
of hidden layers using randomly initialized and pretrained networks.
Fig. 7. Phone error rate on the development set as a function of the number
of layers, using 3072 hidden units per layer.
26
23.5
hid102411frcore
fr173kcore
23
fr273kcore
25.5
hid102411frdev
hid204811frcore
fr373kcore
25
22.5
fr73kcore
24.5
24
23.5
hid204811frdev
hid204815frcore
22
hid204815frdev
21.5
21
20.5
20
23
19.5
22.5
19
22
1
Number of layers
18.5
1
Number of layers
Fig. 8. Phone error rate on the core test set as a function of the number of
layers, using 3072 hidden units per layer.
Fig. 10. Phone error rates as a function of the number of layers, using a
context window of filter bank coefficients as input features to the DBN.
23.5
dev11fr2k5lay
core11fr2k5lay
23
22.5
22
21.5
21
20.5
TABLE I
Reported results on TIMIT core test set
Method
Stochastic Segmental Models [31]
Conditional Random Field [32]
Large-Margin GMM [33]
CD-HMM [34]
Augmented conditional Random Fields [34]
Recurrent Neural Nets [35]
Bayesian Triphone HMM [36]
Monophone HTMs [37]
Heterogeneous Classifiers [38]
Triphone HMMs discriminatively trained w/ BMMI [39]
Monophone Deep Belief Networks(DBNs) (this work)
PER
36%
34.8%
33%
27.3%
26.6%
26.1%
25.6%
24.8%
24.4%
22.7%
20.7%
128
256
512
1024
2048
Bottleneck size
Fig. 9. The effect of the size of the bottleneck on the phone error rate for a
typical network with 11 input frames and 5 hidden layers of 2048 units per
layer (except for the last hidden layer).
MOHAMED, DAHL AND HINTON : ACOUSTIC MODELING USING DEEP BELIEF NETWORKS
10
George E. Dahl received a B.A. in computer science, with highest honors, from Swarthmore College
and an M.Sc. from the University of Toronto, where
he is currently completing a Ph.D. with a research
focus in statistical machine learning. His current
main research interest is in training models that learn
many levels of rich, distributed representations from
large quantities of perceptual and linguistic data.