0% found this document useful (0 votes)

70 views9 pages

Learning Structured Models For Phone Recognition

1. The document presents a new approach to learning acoustic models for automatic speech recognition that uses a simplified training process compared to traditional approaches. 2. The new approach starts with a minimal monophone HMM model and iteratively refines it using split-merge EM without assumptions about subphone structure or context-dependent structure, using a single Gaussian per state. 3. Empirically, the new simplified approach achieves state-of-the-art results on phone classification and competitive performance on phone recognition, outperforming standard CD triphone/subphone/GMM approaches.

Uploaded by

api-25914596

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views9 pages

Learning Structured Models For Phone Recognition

Uploaded by

api-25914596

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Learning Structured Models for Phone Recognition

Slav Petrov Adam Pauls Dan Klein

Computer Science Department, EECS Divison
University of California at Berkeley
Berkeley, CA, 94720, USA
{petrov,adpauls,klein}@cs.berkeley.edu

Abstract densities, states emit mixtures of multivariate Gaus-

sians. This standard structure is shown schemati-
We present a maximally streamlined approach to cally in Figure 1. While this rich structure is pho-
learning HMM-based acoustic models for automatic
speech recognition. In our approach, an initial mono- netically well-motivated and empirically success-
phone HMM is iteratively refined using a split-merge ful, so much structural bias may be unnecessary, or
EM procedure which makes no assumptions about even harmful. For example in the domain of syn-
subphone structure or context-dependent structure,
and which uses only a single Gaussian per HMM tactic parsing with probabilistic context-free gram-
state. Despite the much simplified training process, mars (PCFGs), a surprising recent result is that au-
our acoustic model achieves state-of-the-art results
on phone classification (where it outperforms almost
tomatically induced grammar refinements can out-
all other methods) and competitive performance on perform sophisticated methods which exploit sub-
phone recognition (where it outperforms standard CD stantial manually articulated structure (Petrov et al.,
triphone / subphone / GMM approaches). We also
present an analysis of what is and is not learned by 2006).
our system. In this paper, we consider a much more automatic,
data-driven approach to learning HMM structure for
acoustic modeling, analagous to the approach taken
1 Introduction by Petrov et al. (2006) for learning PCFGs. We start
with a minimal monophone HMM in which there is
Continuous density hidden Markov models (HMMs) a single state for each (context-independent) phone.
underlie most automatic speech recognition (ASR) Moreover, the emission model for each state is a sin-
systems in some form. While the basic algorithms gle multivariate Gaussian (over the standard MFCC
for HMM learning and inference are quite general, acoustic features). We then iteratively refine this
acoustic models of speech standardly employ rich minimal HMM through state splitting, adding com-
speech-specific structures to improve performance. plexity as needed. States in the refined HMMs are
For example, it is well known that a monophone always substates of the original HMM and are there-
HMM with one state per phone is too coarse an fore each identified with a unique base phone. States
approximation to the true articulatory and acoustic are split, estimated, and (perhaps) merged, based on
process. The HMM state space is therefore refined a likelihood criterion. Our model never allows ex-
in several ways. To model phone-internal dynam- plicit Gaussian mixtures, though substates may de-
ics, phones are split into beginning, middle, and end velop similar distributions and thereby emulate such
subphones (Jelinek, 1976). To model cross-phone mixtures.
coarticulation, the states of the HMM are refined In principle, discarding the traditional structure
by splitting the phones into context-dependent tri- can either help or hurt the model. Incorrect prior
phones. These states are then re-clustered (Odell, splits can needlessly fragment training data and in-
1995) and the parameters of their observation dis- correct prior tying can limit the model’s expressiv-
tributions are tied back together (Young and Wood- ity. On the other hand, correct assumptions can
land, 1994). Finally, to model complex emission increase the efficiency of the learner. Empirically,

897
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational
Natural Language Learning, pp. 897–905, Prague, June 2007. 2007
c Association for Computational Linguistics
d7 = c(#-d-ae) ae3 = c(d-ae-d) d13 = c(ae-d-#)
demonstration that such a system is not only com-
Start begin mid end begin mid end begin mid end End
petitive with the traditional approach, but can indeed
outperform even very recent work on some prelimi-
d ae d
nary measures.
b c b c b c
Start End
a d a d a d 2 Learning

In the following, we propose a greatly simplified

Figure 1: Comparison of the standard model to our model (here model that does not impose any manually specified
shown with k = 4 subphones per phone) for the word dad. structural constraints. Instead of specifying struc-
The dependence of subphones across phones in our model is
not shown, while the context clustering in the standard model is
ture a priori, we use the Expectation-Maximization
shown only schematically. (EM) algorithm for HMMs (Baum-Welch) to auto-
matically induce the structure in a way that maxi-
mizes data likelihood.
we show that our automatic approach outperforms
In general, our training data consists of sets
classic systems on the task of phone recognition on
of acoustic observation sequences and phone level
the TIMIT data set. In particular, it outperforms
transcriptions r which specify a sequence of phones
standard state-tied triphone models like Young and
from a set of phones Y , but does not label each
Woodland (1994), achieving a phone error rate of
time frame with a phone. We refer to an observa-
26.4% versus 27.7%. In addition, our approach
tion sequence as x = x1 , . . . , xT where xi ∈ R39
gives state-of-the-art performance on the task of
are standard MFCC features (Davis and Mermel-
phone classification on the TIMIT data set, suggest-
stein, 1980). We wish to induce an HMM over a
ing that our learned structure is particularly effec-
set of states S for which we also have a function
tive at modeling phone-internal structure. Indeed,
π : S → Y that maps every state in S to a phone
our error rate of 21.4% is outperformed only by the
in Y . Note that in the usual formulation of the EM
recent structured margin approach of Sha and Saul
algorithm for HMMs, one is interested in learning
(2006). It remains to be seen whether these posi-
HMM parameters θ that maximize the likelihood of
tive results on acoustic modeling will facilitate better
the observations P(x|θ); in contrast, we aim to max-
word recognition rates in a large vocabulary speech
imize the joint probability of our observations and
recognition system.
phone transcriptions P(x, r|θ) or observations and
We also consider the structures learned by the phone sequences P(x, y|θ) (see below). We now de-
model. Subphone structure is learned, similar to, scribe this relatively straightforward modification of
but richer than, standard begin-middle-end struc- the EM algorithm.
tures. Cross-phone coarticulation is also learned,
with classic phonological classes often emerging
2.1 The Hand-Aligned Case
naturally.
Many aspects of this work are intended to sim- For clarity of exposition we first consider a simpli-
plify rather than further articulate the acoustic pro- fied scenario in which we are given hand-aligned
cess. It should therefore be clear that the basic tech- phone labels y = y1 , . . . , yT for each time t, as is
niques of splitting, merging, and learning using EM the case for the TIMIT dataset. Our procedure does
are not in themselves new for ASR. Nor is the basic not require such extensive annotation of the training
latent induction method new (Matsuzaki et al., 2005; data and in fact gives better performance when the
Petrov et al., 2006). What is novel in this paper is (1) exact transition point between phones are not pre-
the construction of an automatic system for acous- specified but learned.
tic modeling, with substantially streamlined struc- We define forward and backward probabilities
ture, (2) the investigation of variational inference for (Rabiner, 1989) in the following way: the forward
such a task, (3) the analysis of the kinds of struc- probability is the probability of observing the se-
tures learned by such a system, and (4) the empirical quence x1 , . . . , xt with transcription y1 , . . . , yt and

898
next next next next

previous s previous s previous s previous s

th z th z th z th z

dh sh dh sh dh sh dh sh

p f p f p f p f
1 5 2 3
t cl t cl t cl t cl
0
b vcl b vcl b 1 vcl b vcl
0 0 1 0
2 7 6
g m g m g m g 4 m
3
dx n dx n dx n dx n

w ng w ng w ng w ng

r l r l r l r l

l r l r l r l r

er er er er

(a) (b) (c) (d)

Figure 2: Iterative refinement of the /ih/ phone with 1, 2, 4, 8 substates.

ending in state s at time t: To avoid these problems, we apply a hierarchical

parameter estimation strategy similar in spirit to the
αt (s) = P(x1 , . . . , xt , y1 , . . . yt , st = s|λ),
work of Sankar (1998) and Ueda et al. (2000), but
and the backward probability is the probability of here applied to HMMs rather than to GMMs. Be-
observing the sequence xt+1 , . . . , xT with transcrip- ginning with the baseline model, where each state
tion yt+1 , . . . , yT , given that we start in state s at corresponds to one phone, we repeatedly split and
time t: re-train the HMM. This strategy ensures that each
βt (s) = P(xt+1 , . . . , xT , yt+1 , . . . , yT |st = s, λ), split HMM is initialized “close” to some reasonable
maximum.
where λ are the model parameters. As usual, we Concretely, each state s in the HMM is split in
parameterize our HMMs with ass0 , the probability two new states s1 , s2 with π(s1 ) = π(s2 ) = π(s).
of transitioning from state s to s0 , and bs (x) ∼ We initialize EM with the parameters of the previ-
N (µs , Σs ), the probability emitting the observation ous HMM, splitting every previous state s in two
x when in state s. and adding a small amount of randomness ≤ 1%
These probabilities can be computed using the to its transition and emission probabilities to break
standard forward and backward recursions (Rabiner, symmetry:
1989), except that at each time t, we only con- as1 s0 ∝ ass0 + ,
sider states st for which π(st ) = yt , because we
have hand-aligned labels for the observations. These bs1 (o) ∼ N (µs + , Σs ),
quantities also allow us to compute the posterior
counts necessary for the E-step of the EM algorithm. and similarly for s2 . The incoming transitions are
split evenly.
2.2 Splitting We then apply the EM algorithm described above
One way of inducing arbitrary structural annota- to re-estimate these parameters before performing
tions would be to split each HMM state in into subsequent split operations.
m substates, and re-estimate the parameters for the
split HMM using EM. This approach has two ma- 2.3 Merging
jor drawbacks: for larger m it is likely to converge Since adding substates divides HMM statistics into
to poor local optima, and it allocates substates uni- many bins, the HMM parameters are effectively es-
formly across all states, regardless of how much an- timated from less data, which can lead to overfitting.
notation is required for good performance. Therefore, it would be to our advantage to split sub-

899
states only where needed, rather than splitting them since phones frequently occur with multiple consec-
all. utive repetitions, this criterion may vastly overesti-
We realize this goal by merging back those splits mate the actual likelihood loss. As such, we also im-
s → s1 s2 for which, if the split were reversed, the plemented the exact criterion, that is, for each split,
loss in data likelihood would be smallest. We ap- we formed a new HMM with s1 and s2 merged and
proximate the loss in data likelihood for a merge calculated the total data likelihood. This method
s1 s2 → s with the following likelihood ratio (Petrov is much more computationally expensive, requiring
et al., 2006): a full forward-backward pass through the data for
each potential merge, and was not found to produce
Y Pt (x, y)
noticeably better performance. Therefore, all exper-
Y
∆(s1 s2 → s) = .
sequences t
P(x, y) iments use the approximate criterion.

Here P(x, y) is the joint likelihood of an emission 2.4 The Automatically-Aligned Case
sequence x and associated state sequence y. This It is straightforward to generalize the hand-aligned
quantity can be recovered from the forward and case to the case where the phone transcription is
backward probabilities using known, but no frame level labeling is available. The
X main difference is that the phone boundaries are not
P(x, y) = αt (s) · βt (s). known in advance, which means that there is now
s:π(s)=yt additional uncertainty over the phone states. The
forward and backward recursions must thus be ex-
Pt (x, y) is an approximation to the same joint like- panded to consider all state sequences that yield the
lihood where states s1 and s2 are merged. We ap- given phone transcription. We can accomplish this
proximate the true loss by only considering merging with standard Baum-Welch training.
states s1 and s2 at time t, a value which can be ef-
ficiently computed from the forward and backward 3 Inference
probabilities. The forward score for the merged state
s at time t is just the sum of the two split scores: An HMM over refined subphone states s ∈ S nat-
urally gives posterior distributions P(s|x) over se-
α̂t (s) = αt (s1 ) + αt (s2 ), quences of states s. We would ideally like to ex-
tract the transcription r of underlying phones which
while the backward score is a weighted sum of the is most probable according to this posterior1 . The
split scores: transcription is two stages removed from s. First,
it collapses the distinctions between states s which
β̂t (s) = p1 βt (s1 ) + p2 βt (s2 ), correspond to the same phone y = π(s). Second,
it collapses the distinctions between where phone
where p1 and p2 are the relative (posterior) frequen- transitions exactly occur. Viterbi state sequences can
cies of the states s1 and s2 . easily be extracted using the basic Viterbi algorithm.
Thus, the likelihood after merging s1 and s2 at On the other hand, finding the best phone sequence
time t can be computed from these merged forward or transcription is intractable.
and backward scores as: As a compromise, we extract the phone sequence
X (not transcription) which has highest probability in
P t (x, y) = α̂t (s) · β̂t (s) + αt (s0 ) · βt (s0 ) a variational approximation to the true distribution
s0 (Jordan et al., 1999). Let the true posterior distri-
where the second sum is over the other substates of bution over phone sequences be P(y|x). We form
xt , i.e. {s0 : π(s0 ) = xt , s0 ∈
/ {s1 , s2 }}. This an approximation Q(y) ≈ P(y|x), where Q is an
expression is an approximation because it neglects approximation specific to the sequence x and factor-
interactions between instances of the same states at 1
Remember that by “transcription” we mean a sequence of
multiple places in the same sequence. In particular, phones with duplicates removed.

900
izes as: Y
0.42
split only
split and merge
Q(y) = q(t, xt , yt+1 ). 0.4
split and merge, automatic alignment
t 0.38

Phone Recognition Error

0.36
We would like to fit the values q, one for each time
0.34
step and state-state pair, so as to make Q as close to
0.32
P as possible:
0.3

min KL(P(y|x)||Q(y)). 0.28

q
0.26

0.24
The solution can be found analytically using La- 0 200 400 600 800 1000 1200 1400 1600 1800 2000
grange multipliers: Number of States

P(Yt = y, Yt+1 = y 0 |x) Figure 3: Phone recognition error for models of increasing size
q(t, y, y 0 ) = .
P(Yt = y|x)

where we have made the position-specific random models.2 All Gaussians were endowed with weak
variables Yt explicit for clarity. This approximation inverse Wishart priors with zero mean and identity
depends only on our ability to calculate posteriors covariance.3
over phones or phone-phone pairs at individual po-
4.1 Phone Recognition
sitions t, which is easy to obtain from the state pos-
teriors, for example: In the task of phone recognition, we fit an HMM
whose output, with subsequent states collapsed, cor-
P(Yt = y,Yt+1 = y 0 |x) = responds to the training transcriptions. In the TIMIT
data set, each frame is manually phone-annotated, so
X X
αt (s)ass0 bs0 (xt )βt+1 (s0 )
s:π(s)=y s0 :π(s0 )=y 0
the only uncertainty in the basic setup is the identity
of the (sub)states at each frame.
P(x)
We therefore began with a single state for each
Finding the Viterbi phone sequence in the approxi- phone, in a fully connected HMM (except for spe-
mate distribution Q, can be done with the Forward- cial treatment of dedicated start and end states). We
Backward algorithm over the lattice of q values. incrementally trained our model as described in Sec-
tion 2, with up to 6 split-merge rounds. We found
4 Experiments that reversing 25% of the splits yielded good overall
performance while maintaining compactness of the
We tested our model on the TIMIT database, using
model.
the standard setups for phone recognition and phone
classification. We partitioned the TIMIT data into We decoded using the variational decoder de-
training, development, and (core) test sets according scribed in Section 3. The output was then scored
to standard practice (Lee and Hon, 1989; Gunawar- against the reference phone transcription using the
dana et al., 2005; Sha and Saul, 2006). In particu- standard string edit distance.
lar, we excluded all sa sentences and mapped the 61 During both training and decoding, we used “flat-
phonetic labels in TIMIT down to 48 classes before tened” emission probabilities by exponentiating to
training our HMMs. At evaluation, these 48 classes some 0 < γ < 1. We found the best setting for γ
were further mapped down to 39 classes, again in to be 0.2, as determined by tuning on the develop-
the standard way. ment set. This flattening compensates for the non-
MFCC coefficients were extracted from the 2
Most of our findings also hold for diagonal covariance
TIMIT source as in Sha and Saul (2006), includ- Gaussians, albeit the final error rates are 2-3% higher.
3
ing delta and delta-delta components. For all experi- Following previous work with PCFGs (Petrov et al., 2006),
we experimented with smoothing the substates towards each
ments, our system and all baselines we implemented other to prevent overfitting, but we were unable to achieve any
used full covariance when parameterizing emission performance gains.

901
Method Error Rate Method Error Rate
GMM Baseline (Sha and Saul, 2006) 26.0%
State-Tied Triphone HMM
27.7%1 HMM Baseline (Gunawardana et al., 2005) 25.1%
(Young and Woodland, 1994) SVM (Clarkson and Moreno, 1999) 22.4%
Gender Dependent Triphone HMM Hidden CRF (Gunawardana et al., 2005) 21.7%
27.1%1 This Paper 21.4%
(Lamel and Gauvain, 1993) Large Margin GMM (Sha and Saul, 2006) 21.1%
This Paper 26.4%
Bayesian Triphone HMM Table 2: Phone classification error rates on the TIMIT core test.
25.6%
(Ming and Smith, 1998)
Heterogeneous classifiers but on a model with roughly 1000 states (5 split-
24.4%
(Halberstadt and Glass, 1998) merge rounds), the variational decoder decreases er-
ror from 26.5% to 25.6%. The gain in accuracy
Table 1: Phone recognition error rates on the TIMIT core test
from Glass (2003). comes at a cost in time: we must run a (possibly
1
These results are on a slightly easier test set. pruned) Forward-Backward pass over the full state
space S, then another over the smaller phone space
Y . In our experiments, the cost of variational decod-
independence of the frames, partially due to over- ing was a factor of about 3, which may or may not
lapping source samples and partially due to other justify a relative error reduction of around 4%.
unmodeled correlations. The performance of our best model (split and
Figure 3 shows the recognition error as the model merge, automatic alignment, and variational decod-
grows in size. In addition to the basic setup de- ing) on the test set is 26.4%. A comparison of our
scribed so far (split and merge), we also show a performance with other methods in the literature is
model in which merging was not performed (split shown in Table 1. Despite our structural simplic-
only). As can be seen, the merging phase not only ity, we outperform state-tied triphone systems like
decreases the number of HMM states at each round, Young and Woodland (1994), a standard baseline for
but also improves phone recognition error at each this task, by nearly 2% absolute. However, we fall
round. short of the best current systems.
We also compared our hierarchical split only
model with a model where we directly split all states 4.2 Phone Classification
into 2k substates, so that these models had the same Phone classification is the fairly constrained task of
number of states as a a hierarchical model after k classifying in isolation a sequence of frames which
split and merge cycles. While for small k, the dif- is known to span exactly one phone. In order to
ference was negligible, we found that the error in- quantify how much of our gains over the triphone
creased by 1% absolute for k = 5. This trend is to baseline stem from modeling context-dependencies
be expected, as the possible interactions between the and how much from modeling the inner structure of
substates grows with the number of substates. the phones, we fit separate HMM models for each
Also shown in Figure 3, and perhaps unsurprising, phone, using the same split and merge procedure as
is that the error rate can be further reduced by allow- above (though in this case only manual alignments
ing the phone boundaries to drift from the manual are reasonable because we test on manual segmen-
alignments provided in the TIMIT training data. The tations). For each test frame sequence, we com-
split and merge, automatic alignment line shows the pute the likelihood of the sequence from the forward
result of allowing the EM fitting phase to reposition probabilities of each individual phone HMM. The
each phone boundary, giving absolute improvements phone giving highest likelihood to the input was se-
of up to 0.6%. lected. The error rate is a simple fraction of test
We investigated how much improvement in accu- phones classified correctly.
racy one can gain by computing the variational ap- Table 2 shows a comparison of our performance
proximation introduced in Section 3 versus extract- with that of some other methods in the literature.
ing the Viterbi state sequence and projecting that se- A minimal comparison is to a GMM with the same
quence to its phone transcription. The gap varies, number of mixtures per phone as our model’s maxi-

902
Hypothesis next

aw
ow
eh
ae

uh
aa

dh
ax

ey
ay
oy

dx
ch

sil
er

th
el

jh
iy
ix

b
p
d
g
y

z
s

k
r

t
previous m
iy iy
ix ix
eh eh eh ow
ae ae
ax ax
uw uw ow ao
uh uh
aa aa
ey vowels/semivowels ey
ay ay ao aa
oy oy 1
aw aw 0
ow ow aa ey
er er 5
el el
r r 3
w w ey iy
Reference

y y
m m
n n iy 2 6 4 ih
ng nasals/flaps ng
dx dx
jh jh ix ae
ch ch
z strong fricatives z
s s v ix
zh zh
hh hh
v v
f weak fricatives f f z
dh dh
th th
b b k f
p p
d stops d
t t p s
g g
k k
sil sil
ay
oy
ey

aw
ow
eh
ae

m
uw
uh
aa

ng
jh

hh
f
dh

t
iy
ix

n
dx
ch

b
p
d
g
sil
y

z
s

k
Figure 5: Phone contexts and subphone structure. The /l/ phone
after 3 split-merge iterations is shown.
Figure 4: Phone confusion matrix. 76% of the substitutions fall
within the shown classes.

confusions are within natural classes. Some partic-

mum substates per phone. While these models have ularly frequent and reasonable confusions arise be-
the same number of total Gaussians, in our model tween the consonantal /r/ and the vocalic /er/ (the
the Gaussians are correlated temporally, while in same confusion arises between /l/ and /el/, but the
the GMM they are independent. Enforcing begin- standard evaluation already collapses this distinc-
middle-end HMM structure (see HMM Baseline) in- tion), the reduced vowels /ax/ and /ix/, the voiced
creases accuracy somewhat, but our more general and voiceless alveolar sibilants /z/ and /s/, and the
model clearly makes better use of the available pa- voiced and voiceless stop pairs. Other vocalic con-
rameters than those baselines. fusions are generally between vowels and their cor-
Indeed, our best model achieves a surpris- responding reduced forms. Overall, 76% of the sub-
ing performance of 21.4%, greatly outperform- stitutions are within the broad classes shown in the
ing other generative methods and achieving perfor- figure.
mance competitive with state-of-the-art discrimina-
tive methods. Only the recent structured margin ap- We can also examine the substructure learned for
proach of Sha and Saul (2006) gives a better perfor- the various phones. Figure 2 shows the evolution
mance than our model. The strength of our system of the phone /ih/ from a single state to 8 substates
on the classification task suggests that perhaps it is during split/merge (no merges were chosen for this
modeling phone-internal structure more effectively phone), using hand-alignment of phones to frames.
than cross-phone context. These figures were simplified from the complete
state transition matrices as follows: (1) adjacent
5 Analysis phones’ substates are collapsed, (2) adjacent phones
are selected based on frequency and inbound prob-
While the overall phone recognition and classifi- ability (and forced to be the same across figures),
cation numbers suggest that our system is broadly (3) infrequent arcs are suppressed. In the first split,
comparable to and perhaps in certain ways superior (b), a sonorant / non-sonorant distinction is learned
to classical approaches, it is illuminating to investi- over adjacent phones, along with a state chain which
gate what is and is not learned by the model. captures basic duration (a self-looping state gives
Figure 4 gives a confusion matrix over the substi- an exponential model of duration; the sum of two
tution errors made by our model. The majority of the such states is more expressive). Note that the nat-

903
ural classes interact with the chain in a way which Vowels oy 4 4 ng 3 4
aa 31 32 uh 5 2 p 5 24
allows duration to depend on context. In further re- ae 32 17 uw 21 8 r 32 32
finements, more structure is added, including a two- ah 31 8 Consonants s 32 32
track path in (d) where one track captures the distinct ao 32 23 b 2 32 sh 30 32
aw 18 6 ch 13 30 t 24 32
effects on higher formants of r-coloring and nasal- ax 18 3 d 2 14 th 8 11
ization. Figure 5 shows the corresponding diagram ay 32 28 dh 6 31 v 23 11
eh 32 16 dx 2 3 w 10 21
for /l/, where some merging has also occurred. Dif- el 6 4 y 3 7
f 32 32
ferent natural classes emerge in this case, with, for en 4 3 g 2 15 z 31 32
example, preceding states partitioned into front/high er 32 31 hh 3 5 zh 2 2
ey 32 30 jh 3 16 Other
vowels vs. rounded vowels vs. other vowels vs. con- ih 32 11 k 30 32 epi 2 4
sonants. Following states show a front/back dis- ix 31 16 l 25 32 sil 32 32
tinction and a consonant distinction, and the phone iy 31 32 m 25 25 vcl 29 30
ow 26 10 n 29 32 cl 31 32
/m/ is treated specially, largely because the /lm/ se-
quence tends to shorten the /l/ substantially. Note Table 3: Number of substates allocated per phone. The left
again how context, internal structure, and duration column gives the number of substates allocated when training
on manually aligned training sequences, while the right column
are simultaneously modeled. Of course, it should gives the number allocated when we automatically determine
be emphasized that post hoc analysis of such struc- phone boundaries.
ture is a simplification and prone to seeing what one achieves performance competitive with the state-of-
expects; we present these examples to illustrate the the-art discriminative methods (Sha and Saul, 2006),
broad kinds of patterns which are detected. despite being generative in nature. This result to-
As a final illustration of the nature of the learned gether with our analysis of the context-dependencies
models, Table 3 shows the number of substates allo- and substructures that are being learned, suggests
cated to each phone by the split/merge process (the that our model is particularly well suited for mod-
maximum is 32 for this stage) for the case of hand- eling phone-internal structure. It does, of course
aligned (left) as well as automatically-aligned (right) remain to be seen if and how these benefits can be
phone boundaries. Interestingly, in the hand-aligned scaled to larger systems.
case, the vowels absorb most of the complexity since
many consonantal cues are heavily evidenced on
adjacent vowels. However, in the automatically- References
aligned case, many vowel frames with substantial P. Clarkson and P. Moreno. 1999. On the use of Sup-
consontant coloring are re-allocated to those adja- port Vector Machines for phonetic classification. In
cent consonants, giving more complex consonants, ICASSP ’99.
but comparatively less complex vowels. S. B. Davis and P. Mermelstein. 1980. Comparison
of parametric representation for monosyllabic word
6 Conclusions recognition in continuously spoken sentences. IEEE
Transactions on Acoustics, Speech, and Signal Pro-
cessing, 28(4).
We have presented a minimalist, automatic approach
for building an accurate acoustic model for phonetic J. Glass. 2003. A probabilistic framework for segment-
classification and recognition. Our model does not based speech recognition. Computer Speech and Lan-
require any a priori phonetic bias or manual spec- guage, 17(2).
ification of structure, but rather induces the struc- A. Gunawardana, M. Mahajan, A. Acero, and J. Platt.
ture in an automatic and streamlined fashion. Start- 2005. Hidden Conditional Random Fields for phone
ing from a minimal monophone HMM, we auto- recognition. In Eurospeech ’05.
matically learn models that achieve highly compet- A. K. Halberstadt and J. R. Glass. 1998. Hetero-
itive performance. On the TIMIT phone recogni- geneous measurements and multiple classifiers for
tion task our model clearly outperforms standard speech recognition. In ICSLP ’98.
state-tied triphone models like Young and Wood- F. Jelinek. 1976. Continuous speech recognition by sta-
land (1994). For phone classification, our model tistical methods. Proceedings of the IEEE.

904
M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. L. Rabiner. 1989. A Tutorial on hidden Markov mod-
Saul. 1999. An introduction to variational methods els and selected applications in speech recognition. In
for graphical models. Learning in Graphical Models. IEEE.
L. Lamel and J. Gauvain. 1993. Cross-lingual experi- A. Sankar. 1998. Experiments with a Gaussian merging-
ments with phone recognition. In ICASSP ’93. splitting algorithm for HMM training for speech
recognition. In DARPA Speech Recognition Workshop
K. F. Lee and H. W. Hon. 1989. Speaker-independent
’98.
phone recognition using Hidden Markov Models.
IEEE Transactions on Acoustics, Speech, and Signal F. Sha and L. K. Saul. 2006. Large margin Gaussian mix-
Processing, 37(11). ture modeling for phonetic classification and recogni-
T. Matsuzaki, Y. Miyao, and J. Tsujii. 2005. Probabilis- tion. In ICASSP ’06.
tic CFG with latent annotations. In ACL ’05.
N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton.
J. Ming and F.J. Smith. 1998. Improved phone recogni- 2000. Split and Merge EM algorithm for mixture mod-
tion using Bayesian triphone models. In ICASSP ’98. els. Neural Computation, 12(9).
J. J. Odell. 1995. The Use of Context in Large Vocab- S. J. Young and P. C. Woodland. 1994. State clustering
ulary Speech Recognition. Ph.D. thesis, University of in HMM-based continuous speech recognition. Com-
Cambridge. puter Speech and Language, 8(4).
S. Petrov, L. Barrett, R. Thibaux, and D. Klein. 2006.
Learning accurate, compact, and interpretable tree an-
notation. In COLING-ACL ’06.

905

Lecture 9 - Speech Recognition
No ratings yet
Lecture 9 - Speech Recognition
65 pages
CQE Body of Knowledge
No ratings yet
CQE Body of Knowledge
11 pages
Questions
No ratings yet
Questions
2 pages
Research Activity
100% (1)
Research Activity
3 pages
Tied-State HMMs + Introduction To NN-based AMs
No ratings yet
Tied-State HMMs + Introduction To NN-based AMs
37 pages
Pre-Midsem Revision: Instructor: Preethi Jyothi
No ratings yet
Pre-Midsem Revision: Instructor: Preethi Jyothi
35 pages
Speaker-Independent Phone Recognition Using Hidden Markov Models PDF
No ratings yet
Speaker-Independent Phone Recognition Using Hidden Markov Models PDF
8 pages
Improving The HMM Phone Models
No ratings yet
Improving The HMM Phone Models
14 pages
Speech Recognition Using HMM ANN Hybrid Model
No ratings yet
Speech Recognition Using HMM ANN Hybrid Model
4 pages
Analysis of Complex Non-Linear Environment Exploration in Speech Recognition by Hybrid Learning Technique
No ratings yet
Analysis of Complex Non-Linear Environment Exploration in Speech Recognition by Hybrid Learning Technique
8 pages
TASLP2339736 Proof 2
No ratings yet
TASLP2339736 Proof 2
26 pages
A Probabilistic Framework For Segment-Based Speech Recognition
No ratings yet
A Probabilistic Framework For Segment-Based Speech Recognition
16 pages
Phonetic Speaker Recognition Using Maximum-Likelihood Binary-Decision Tree Models
No ratings yet
Phonetic Speaker Recognition Using Maximum-Likelihood Binary-Decision Tree Models
4 pages
Sementation HTK
No ratings yet
Sementation HTK
3 pages
Hidden Conditional Random Fields For Phone Recognition: Yun-Hsuan Sung and Dan Jurafsky
No ratings yet
Hidden Conditional Random Fields For Phone Recognition: Yun-Hsuan Sung and Dan Jurafsky
6 pages
Table of Contents - Jeffin9
No ratings yet
Table of Contents - Jeffin9
13 pages
Isolated-Word Speech Recognition Using Hidden Markov Models: H Akon Sandsmark December 18, 2010
No ratings yet
Isolated-Word Speech Recognition Using Hidden Markov Models: H Akon Sandsmark December 18, 2010
9 pages
Automatic Isolated Digit Recognition System: An Approach Using HMM
No ratings yet
Automatic Isolated Digit Recognition System: An Approach Using HMM
3 pages
A Speaker Independent Continuous Speech Recognizer For Amharic
No ratings yet
A Speaker Independent Continuous Speech Recognizer For Amharic
5 pages
Tandem Connectionist Feature Extraction For Conventional HMM Systems
No ratings yet
Tandem Connectionist Feature Extraction For Conventional HMM Systems
1 page
Speech Recognition Application
No ratings yet
Speech Recognition Application
13 pages
CS 224S/LING 281 Speech Recognition, Synthesis, and Dialogue
No ratings yet
CS 224S/LING 281 Speech Recognition, Synthesis, and Dialogue
78 pages
11578079_80
No ratings yet
11578079_80
8 pages
Automatic Speech Recognition: 2.1 Relevant Keywords From Probability Theory and Statistics
No ratings yet
Automatic Speech Recognition: 2.1 Relevant Keywords From Probability Theory and Statistics
14 pages
A Novel Voice Recognition Model Based On HMM and Fuzzy PPM
No ratings yet
A Novel Voice Recognition Model Based On HMM and Fuzzy PPM
4 pages
4 base
No ratings yet
4 base
4 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
16 pages
Speechrecognitionfinalpresentation 141124072610 Conversion Gate01
No ratings yet
Speechrecognitionfinalpresentation 141124072610 Conversion Gate01
30 pages
(Text, Speech and Language Technology 2) H. Ney (Auth.), Steve Young, Gerrit Bloothooft (Eds.) - Corpus-Based Methods in Language and Speech Processing-Springer Netherlands (1997)
No ratings yet
(Text, Speech and Language Technology 2) H. Ney (Auth.), Steve Young, Gerrit Bloothooft (Eds.) - Corpus-Based Methods in Language and Speech Processing-Springer Netherlands (1997)
246 pages
Static Dictionary For Pronunciation Modeling
No ratings yet
Static Dictionary For Pronunciation Modeling
5 pages
Lecture 2
No ratings yet
Lecture 2
21 pages
2015_Multi-task Learning of Deep Neural Networks for Low-resource Speech Recognition_Chen_Mak_IEEEACM Transactions on Audio, Speech, and Language Processing
No ratings yet
2015_Multi-task Learning of Deep Neural Networks for Low-resource Speech Recognition_Chen_Mak_IEEEACM Transactions on Audio, Speech, and Language Processing
12 pages
Comparative Analysis of Automatic Speech Recognition Techniques
No ratings yet
Comparative Analysis of Automatic Speech Recognition Techniques
8 pages
Advances in Speech Transcription at IBM Under The DARPA EARS Program
No ratings yet
Advances in Speech Transcription at IBM Under The DARPA EARS Program
13 pages
Parallel_hidden_Markov_models_for_American_sign_language_recognition
No ratings yet
Parallel_hidden_Markov_models_for_American_sign_language_recognition
7 pages
Automatic Arabic Pronunciation Scoring For Language Instruction
No ratings yet
Automatic Arabic Pronunciation Scoring For Language Instruction
6 pages
Lecture 9 PDF
No ratings yet
Lecture 9 PDF
42 pages
Hidden Markov Model and Persian Speech Recognition
No ratings yet
Hidden Markov Model and Persian Speech Recognition
9 pages
Hybrid/Tandem Models + Tdnns + Intro To RNNS: Instructor: Preethi Jyothi
No ratings yet
Hybrid/Tandem Models + Tdnns + Intro To RNNS: Instructor: Preethi Jyothi
23 pages
Static Dictionary For Pronunciation Modeling
No ratings yet
Static Dictionary For Pronunciation Modeling
5 pages
Automatic Speech Recognition (ASR) : Omar Khalil Gómez - Università Di Pisa
100% (1)
Automatic Speech Recognition (ASR) : Omar Khalil Gómez - Università Di Pisa
65 pages
Speech Recognition With Hidden Markov Model: A Review
100% (1)
Speech Recognition With Hidden Markov Model: A Review
4 pages
A Bi-Directional Model of English Pronunciation
No ratings yet
A Bi-Directional Model of English Pronunciation
15 pages
Speech Recognition Using Hidden Markov Models
No ratings yet
Speech Recognition Using Hidden Markov Models
22 pages
ASR
No ratings yet
ASR
10 pages
A Literature Survey of Speech Recognition and Hidden Markov Models
No ratings yet
A Literature Survey of Speech Recognition and Hidden Markov Models
6 pages
Build Automatic Speech Recognition System: Bachelor of Technology
No ratings yet
Build Automatic Speech Recognition System: Bachelor of Technology
25 pages
Voice Recognition System Using Machine L
No ratings yet
Voice Recognition System Using Machine L
7 pages
Assignment Submission Speech Recognition System Architectural Design
No ratings yet
Assignment Submission Speech Recognition System Architectural Design
5 pages
What HMMs Can Do
No ratings yet
What HMMs Can Do
24 pages
01 Ok Continuous Phoneme Recognition Based on Audio-Visual Modality Fusion
No ratings yet
01 Ok Continuous Phoneme Recognition Based on Audio-Visual Modality Fusion
8 pages
Tetbfm:978 1 4615 3650 5 - 1
No ratings yet
Tetbfm:978 1 4615 3650 5 - 1
14 pages
Speech Recognition Using Matlab: Objective
No ratings yet
Speech Recognition Using Matlab: Objective
2 pages
論文 HuBERT
No ratings yet
論文 HuBERT
4 pages
Speaker Dependent Continuous Kannada Speech Recognition Using HMM
No ratings yet
Speaker Dependent Continuous Kannada Speech Recognition Using HMM
4 pages
Continuous Density Hidden Markov Model For Hindi Speech Recognition
No ratings yet
Continuous Density Hidden Markov Model For Hindi Speech Recognition
7 pages
Dynamic Bayesian Multinets
No ratings yet
Dynamic Bayesian Multinets
8 pages
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
Gene Expression Programming: Fundamentals and Applications
From Everand
Gene Expression Programming: Fundamentals and Applications
Fouad Sabry
No ratings yet
Long Short Term Memory: Fundamentals and Applications for Sequence Prediction
From Everand
Long Short Term Memory: Fundamentals and Applications for Sequence Prediction
Fouad Sabry
No ratings yet
Reducing The Dimensionality of Data With Neural Networks: Reports
No ratings yet
Reducing The Dimensionality of Data With Neural Networks: Reports
4 pages
Exponential Family Harmoniums With An Application To Information Retrieval
No ratings yet
Exponential Family Harmoniums With An Application To Information Retrieval
8 pages
Reducing The Dimensionality of Data With Neural Networks: Reports
No ratings yet
Reducing The Dimensionality of Data With Neural Networks: Reports
4 pages
Automatic Musical Pattern Feature Extraction Using Convolutional Neural Network
No ratings yet
Automatic Musical Pattern Feature Extraction Using Convolutional Neural Network
5 pages
L D G M: Earning EEP Enerative Odels
No ratings yet
L D G M: Earning EEP Enerative Odels
84 pages
Modeling Human Motion Using Binary Latent Variables: Graham W. Taylor, Geoffrey E. Hinton and Sam Roweis
No ratings yet
Modeling Human Motion Using Binary Latent Variables: Graham W. Taylor, Geoffrey E. Hinton and Sam Roweis
8 pages
Learning Deep Architectures For AI - Yoshua Bengio
No ratings yet
Learning Deep Architectures For AI - Yoshua Bengio
130 pages
Robust Speech Recognition Using Articulatory Information: Der Technischen Fakult at Der Universit at Bielefeld
100% (1)
Robust Speech Recognition Using Articulatory Information: Der Technischen Fakult at Der Universit at Bielefeld
148 pages
Linear Dynamic Models For Automatic Speech Recognition
No ratings yet
Linear Dynamic Models For Automatic Speech Recognition
335 pages
Investigations Into Tandem Acoustic Modeling For The Aurora Task
No ratings yet
Investigations Into Tandem Acoustic Modeling For The Aurora Task
4 pages
Deep Learning in Speech Processing: Potentials and Challenges
No ratings yet
Deep Learning in Speech Processing: Potentials and Challenges
4 pages
Learn From Reviewers: Outline
No ratings yet
Learn From Reviewers: Outline
8 pages
Combining Anns To Improve Phone Recognition: Brian
No ratings yet
Combining Anns To Improve Phone Recognition: Brian
4 pages
Phone Recognition Using Restricted Boltzmann Machines: Department of Computer Science, University of Toronto
No ratings yet
Phone Recognition Using Restricted Boltzmann Machines: Department of Computer Science, University of Toronto
4 pages
Investigations Into Tandem Acoustic Modeling For The Aurora Task
No ratings yet
Investigations Into Tandem Acoustic Modeling For The Aurora Task
1 page
Automatic Speech Recognition
No ratings yet
Automatic Speech Recognition
2 pages
Computing Arc Posteriors in A Word Lattice
No ratings yet
Computing Arc Posteriors in A Word Lattice
4 pages
DSR Front End
No ratings yet
DSR Front End
3 pages
Derivatives of Risk Measures
No ratings yet
Derivatives of Risk Measures
18 pages
Moving and Exponential Smoothing
No ratings yet
Moving and Exponential Smoothing
26 pages
Toh
No ratings yet
Toh
53 pages
Lecture 1 - BIOL933 Design, Analysis, and Interpretation of Experiments PDF
No ratings yet
Lecture 1 - BIOL933 Design, Analysis, and Interpretation of Experiments PDF
43 pages
Chapter 04 - Measures of Dispersion (Part 1)
No ratings yet
Chapter 04 - Measures of Dispersion (Part 1)
15 pages
Hypothesis Testing Questions
No ratings yet
Hypothesis Testing Questions
20 pages
Latent Dirichlet Allocation LDA and Topic Modeling PDF
No ratings yet
Latent Dirichlet Allocation LDA and Topic Modeling PDF
41 pages
Audit Delay Sinta 4 - 24774782
No ratings yet
Audit Delay Sinta 4 - 24774782
16 pages
rr220105 Probability and Statistics
No ratings yet
rr220105 Probability and Statistics
8 pages
Discriminant Analysis
No ratings yet
Discriminant Analysis
5 pages
CNHP 6000 Final Exam Study Guide
No ratings yet
CNHP 6000 Final Exam Study Guide
1 page
Ch.4 Correlation
No ratings yet
Ch.4 Correlation
1 page
Ch4 Supervised
No ratings yet
Ch4 Supervised
78 pages
RM 5 Wpu
No ratings yet
RM 5 Wpu
92 pages
Assignment1 Solution
No ratings yet
Assignment1 Solution
15 pages
8august2010 - Confidence Interval and Sample Size
No ratings yet
8august2010 - Confidence Interval and Sample Size
5 pages
Basic Concepts of Hypothesis Testing Discussion
No ratings yet
Basic Concepts of Hypothesis Testing Discussion
46 pages
Statistics For Business-II
100% (1)
Statistics For Business-II
3 pages
Question Paper Code: X10660: Roll No
No ratings yet
Question Paper Code: X10660: Roll No
4 pages
DataAnalysis1 Lecture11b
No ratings yet
DataAnalysis1 Lecture11b
16 pages
Dsbda Ass2
No ratings yet
Dsbda Ass2
49 pages
ECS4863 - Solutions To Activity 1.2
100% (1)
ECS4863 - Solutions To Activity 1.2
8 pages
Flipkart Problem Employee Value Forecasting - Predictive
No ratings yet
Flipkart Problem Employee Value Forecasting - Predictive
4 pages
Diagnostic Test in Practical Research 2
No ratings yet
Diagnostic Test in Practical Research 2
7 pages
Random Variables
No ratings yet
Random Variables
15 pages
Sta404 Chapter 06
No ratings yet
Sta404 Chapter 06
94 pages
Factors Affecting On Students Test Scores
No ratings yet
Factors Affecting On Students Test Scores
43 pages

Learning Structured Models For Phone Recognition

Uploaded by

Learning Structured Models For Phone Recognition

Uploaded by

Learning Structured Models for Phone Recognition

Slav Petrov Adam Pauls Dan Klein

Abstract densities, states emit mixtures of multivariate Gaus-

In the following, we propose a greatly simplified

previous s previous s previous s previous s

(a) (b) (c) (d)

Figure 2: Iterative refinement of the /ih/ phone with 1, 2, 4, 8 substates.

ending in state s at time t: To avoid these problems, we apply a hierarchical

Phone Recognition Error

min KL(P(y|x)||Q(y)). 0.28

confusions are within natural classes. Some partic-

You might also like