0% found this document useful (0 votes)
70 views9 pages

Learning Structured Models For Phone Recognition

1. The document presents a new approach to learning acoustic models for automatic speech recognition that uses a simplified training process compared to traditional approaches. 2. The new approach starts with a minimal monophone HMM model and iteratively refines it using split-merge EM without assumptions about subphone structure or context-dependent structure, using a single Gaussian per state. 3. Empirically, the new simplified approach achieves state-of-the-art results on phone classification and competitive performance on phone recognition, outperforming standard CD triphone/subphone/GMM approaches.

Uploaded by

api-25914596
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views9 pages

Learning Structured Models For Phone Recognition

1. The document presents a new approach to learning acoustic models for automatic speech recognition that uses a simplified training process compared to traditional approaches. 2. The new approach starts with a minimal monophone HMM model and iteratively refines it using split-merge EM without assumptions about subphone structure or context-dependent structure, using a single Gaussian per state. 3. Empirically, the new simplified approach achieves state-of-the-art results on phone classification and competitive performance on phone recognition, outperforming standard CD triphone/subphone/GMM approaches.

Uploaded by

api-25914596
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Learning Structured Models for Phone Recognition

Slav Petrov Adam Pauls Dan Klein


Computer Science Department, EECS Divison
University of California at Berkeley
Berkeley, CA, 94720, USA
{petrov,adpauls,klein}@cs.berkeley.edu

Abstract densities, states emit mixtures of multivariate Gaus-


sians. This standard structure is shown schemati-
We present a maximally streamlined approach to cally in Figure 1. While this rich structure is pho-
learning HMM-based acoustic models for automatic
speech recognition. In our approach, an initial mono- netically well-motivated and empirically success-
phone HMM is iteratively refined using a split-merge ful, so much structural bias may be unnecessary, or
EM procedure which makes no assumptions about even harmful. For example in the domain of syn-
subphone structure or context-dependent structure,
and which uses only a single Gaussian per HMM tactic parsing with probabilistic context-free gram-
state. Despite the much simplified training process, mars (PCFGs), a surprising recent result is that au-
our acoustic model achieves state-of-the-art results
on phone classification (where it outperforms almost
tomatically induced grammar refinements can out-
all other methods) and competitive performance on perform sophisticated methods which exploit sub-
phone recognition (where it outperforms standard CD stantial manually articulated structure (Petrov et al.,
triphone / subphone / GMM approaches). We also
present an analysis of what is and is not learned by 2006).
our system. In this paper, we consider a much more automatic,
data-driven approach to learning HMM structure for
acoustic modeling, analagous to the approach taken
1 Introduction by Petrov et al. (2006) for learning PCFGs. We start
with a minimal monophone HMM in which there is
Continuous density hidden Markov models (HMMs) a single state for each (context-independent) phone.
underlie most automatic speech recognition (ASR) Moreover, the emission model for each state is a sin-
systems in some form. While the basic algorithms gle multivariate Gaussian (over the standard MFCC
for HMM learning and inference are quite general, acoustic features). We then iteratively refine this
acoustic models of speech standardly employ rich minimal HMM through state splitting, adding com-
speech-specific structures to improve performance. plexity as needed. States in the refined HMMs are
For example, it is well known that a monophone always substates of the original HMM and are there-
HMM with one state per phone is too coarse an fore each identified with a unique base phone. States
approximation to the true articulatory and acoustic are split, estimated, and (perhaps) merged, based on
process. The HMM state space is therefore refined a likelihood criterion. Our model never allows ex-
in several ways. To model phone-internal dynam- plicit Gaussian mixtures, though substates may de-
ics, phones are split into beginning, middle, and end velop similar distributions and thereby emulate such
subphones (Jelinek, 1976). To model cross-phone mixtures.
coarticulation, the states of the HMM are refined In principle, discarding the traditional structure
by splitting the phones into context-dependent tri- can either help or hurt the model. Incorrect prior
phones. These states are then re-clustered (Odell, splits can needlessly fragment training data and in-
1995) and the parameters of their observation dis- correct prior tying can limit the model’s expressiv-
tributions are tied back together (Young and Wood- ity. On the other hand, correct assumptions can
land, 1994). Finally, to model complex emission increase the efficiency of the learner. Empirically,

897
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational
Natural Language Learning, pp. 897–905, Prague, June 2007. 2007
c Association for Computational Linguistics
d7 = c(#-d-ae) ae3 = c(d-ae-d) d13 = c(ae-d-#)
demonstration that such a system is not only com-
Start begin mid end begin mid end begin mid end End
petitive with the traditional approach, but can indeed
outperform even very recent work on some prelimi-
d ae d
nary measures.
b c b c b c
Start End
a d a d a d 2 Learning

In the following, we propose a greatly simplified


Figure 1: Comparison of the standard model to our model (here model that does not impose any manually specified
shown with k = 4 subphones per phone) for the word dad. structural constraints. Instead of specifying struc-
The dependence of subphones across phones in our model is
not shown, while the context clustering in the standard model is
ture a priori, we use the Expectation-Maximization
shown only schematically. (EM) algorithm for HMMs (Baum-Welch) to auto-
matically induce the structure in a way that maxi-
mizes data likelihood.
we show that our automatic approach outperforms
In general, our training data consists of sets
classic systems on the task of phone recognition on
of acoustic observation sequences and phone level
the TIMIT data set. In particular, it outperforms
transcriptions r which specify a sequence of phones
standard state-tied triphone models like Young and
from a set of phones Y , but does not label each
Woodland (1994), achieving a phone error rate of
time frame with a phone. We refer to an observa-
26.4% versus 27.7%. In addition, our approach
tion sequence as x = x1 , . . . , xT where xi ∈ R39
gives state-of-the-art performance on the task of
are standard MFCC features (Davis and Mermel-
phone classification on the TIMIT data set, suggest-
stein, 1980). We wish to induce an HMM over a
ing that our learned structure is particularly effec-
set of states S for which we also have a function
tive at modeling phone-internal structure. Indeed,
π : S → Y that maps every state in S to a phone
our error rate of 21.4% is outperformed only by the
in Y . Note that in the usual formulation of the EM
recent structured margin approach of Sha and Saul
algorithm for HMMs, one is interested in learning
(2006). It remains to be seen whether these posi-
HMM parameters θ that maximize the likelihood of
tive results on acoustic modeling will facilitate better
the observations P(x|θ); in contrast, we aim to max-
word recognition rates in a large vocabulary speech
imize the joint probability of our observations and
recognition system.
phone transcriptions P(x, r|θ) or observations and
We also consider the structures learned by the phone sequences P(x, y|θ) (see below). We now de-
model. Subphone structure is learned, similar to, scribe this relatively straightforward modification of
but richer than, standard begin-middle-end struc- the EM algorithm.
tures. Cross-phone coarticulation is also learned,
with classic phonological classes often emerging
2.1 The Hand-Aligned Case
naturally.
Many aspects of this work are intended to sim- For clarity of exposition we first consider a simpli-
plify rather than further articulate the acoustic pro- fied scenario in which we are given hand-aligned
cess. It should therefore be clear that the basic tech- phone labels y = y1 , . . . , yT for each time t, as is
niques of splitting, merging, and learning using EM the case for the TIMIT dataset. Our procedure does
are not in themselves new for ASR. Nor is the basic not require such extensive annotation of the training
latent induction method new (Matsuzaki et al., 2005; data and in fact gives better performance when the
Petrov et al., 2006). What is novel in this paper is (1) exact transition point between phones are not pre-
the construction of an automatic system for acous- specified but learned.
tic modeling, with substantially streamlined struc- We define forward and backward probabilities
ture, (2) the investigation of variational inference for (Rabiner, 1989) in the following way: the forward
such a task, (3) the analysis of the kinds of struc- probability is the probability of observing the se-
tures learned by such a system, and (4) the empirical quence x1 , . . . , xt with transcription y1 , . . . , yt and

898
next next next next

previous s previous s previous s previous s

th z th z th z th z

dh sh dh sh dh sh dh sh

p f p f p f p f
1 5 2 3
t cl t cl t cl t cl
0
b vcl b vcl b 1 vcl b vcl
0 0 1 0
2 7 6
g m g m g m g 4 m
3
dx n dx n dx n dx n

w ng w ng w ng w ng

r l r l r l r l

l r l r l r l r

er er er er

(a) (b) (c) (d)

Figure 2: Iterative refinement of the /ih/ phone with 1, 2, 4, 8 substates.

ending in state s at time t: To avoid these problems, we apply a hierarchical


parameter estimation strategy similar in spirit to the
αt (s) = P(x1 , . . . , xt , y1 , . . . yt , st = s|λ),
work of Sankar (1998) and Ueda et al. (2000), but
and the backward probability is the probability of here applied to HMMs rather than to GMMs. Be-
observing the sequence xt+1 , . . . , xT with transcrip- ginning with the baseline model, where each state
tion yt+1 , . . . , yT , given that we start in state s at corresponds to one phone, we repeatedly split and
time t: re-train the HMM. This strategy ensures that each
βt (s) = P(xt+1 , . . . , xT , yt+1 , . . . , yT |st = s, λ), split HMM is initialized “close” to some reasonable
maximum.
where λ are the model parameters. As usual, we Concretely, each state s in the HMM is split in
parameterize our HMMs with ass0 , the probability two new states s1 , s2 with π(s1 ) = π(s2 ) = π(s).
of transitioning from state s to s0 , and bs (x) ∼ We initialize EM with the parameters of the previ-
N (µs , Σs ), the probability emitting the observation ous HMM, splitting every previous state s in two
x when in state s. and adding a small amount of randomness  ≤ 1%
These probabilities can be computed using the to its transition and emission probabilities to break
standard forward and backward recursions (Rabiner, symmetry:
1989), except that at each time t, we only con- as1 s0 ∝ ass0 + ,
sider states st for which π(st ) = yt , because we
have hand-aligned labels for the observations. These bs1 (o) ∼ N (µs + , Σs ),
quantities also allow us to compute the posterior
counts necessary for the E-step of the EM algorithm. and similarly for s2 . The incoming transitions are
split evenly.
2.2 Splitting We then apply the EM algorithm described above
One way of inducing arbitrary structural annota- to re-estimate these parameters before performing
tions would be to split each HMM state in into subsequent split operations.
m substates, and re-estimate the parameters for the
split HMM using EM. This approach has two ma- 2.3 Merging
jor drawbacks: for larger m it is likely to converge Since adding substates divides HMM statistics into
to poor local optima, and it allocates substates uni- many bins, the HMM parameters are effectively es-
formly across all states, regardless of how much an- timated from less data, which can lead to overfitting.
notation is required for good performance. Therefore, it would be to our advantage to split sub-

899
states only where needed, rather than splitting them since phones frequently occur with multiple consec-
all. utive repetitions, this criterion may vastly overesti-
We realize this goal by merging back those splits mate the actual likelihood loss. As such, we also im-
s → s1 s2 for which, if the split were reversed, the plemented the exact criterion, that is, for each split,
loss in data likelihood would be smallest. We ap- we formed a new HMM with s1 and s2 merged and
proximate the loss in data likelihood for a merge calculated the total data likelihood. This method
s1 s2 → s with the following likelihood ratio (Petrov is much more computationally expensive, requiring
et al., 2006): a full forward-backward pass through the data for
each potential merge, and was not found to produce
Y Pt (x, y)
noticeably better performance. Therefore, all exper-
Y
∆(s1 s2 → s) = .
sequences t
P(x, y) iments use the approximate criterion.

Here P(x, y) is the joint likelihood of an emission 2.4 The Automatically-Aligned Case
sequence x and associated state sequence y. This It is straightforward to generalize the hand-aligned
quantity can be recovered from the forward and case to the case where the phone transcription is
backward probabilities using known, but no frame level labeling is available. The
X main difference is that the phone boundaries are not
P(x, y) = αt (s) · βt (s). known in advance, which means that there is now
s:π(s)=yt additional uncertainty over the phone states. The
forward and backward recursions must thus be ex-
Pt (x, y) is an approximation to the same joint like- panded to consider all state sequences that yield the
lihood where states s1 and s2 are merged. We ap- given phone transcription. We can accomplish this
proximate the true loss by only considering merging with standard Baum-Welch training.
states s1 and s2 at time t, a value which can be ef-
ficiently computed from the forward and backward 3 Inference
probabilities. The forward score for the merged state
s at time t is just the sum of the two split scores: An HMM over refined subphone states s ∈ S nat-
urally gives posterior distributions P(s|x) over se-
α̂t (s) = αt (s1 ) + αt (s2 ), quences of states s. We would ideally like to ex-
tract the transcription r of underlying phones which
while the backward score is a weighted sum of the is most probable according to this posterior1 . The
split scores: transcription is two stages removed from s. First,
it collapses the distinctions between states s which
β̂t (s) = p1 βt (s1 ) + p2 βt (s2 ), correspond to the same phone y = π(s). Second,
it collapses the distinctions between where phone
where p1 and p2 are the relative (posterior) frequen- transitions exactly occur. Viterbi state sequences can
cies of the states s1 and s2 . easily be extracted using the basic Viterbi algorithm.
Thus, the likelihood after merging s1 and s2 at On the other hand, finding the best phone sequence
time t can be computed from these merged forward or transcription is intractable.
and backward scores as: As a compromise, we extract the phone sequence
X (not transcription) which has highest probability in
P t (x, y) = α̂t (s) · β̂t (s) + αt (s0 ) · βt (s0 ) a variational approximation to the true distribution
s0 (Jordan et al., 1999). Let the true posterior distri-
where the second sum is over the other substates of bution over phone sequences be P(y|x). We form
xt , i.e. {s0 : π(s0 ) = xt , s0 ∈
/ {s1 , s2 }}. This an approximation Q(y) ≈ P(y|x), where Q is an
expression is an approximation because it neglects approximation specific to the sequence x and factor-
interactions between instances of the same states at 1
Remember that by “transcription” we mean a sequence of
multiple places in the same sequence. In particular, phones with duplicates removed.

900
izes as: Y
0.42
split only
split and merge
Q(y) = q(t, xt , yt+1 ). 0.4
split and merge, automatic alignment
t 0.38

Phone Recognition Error


0.36
We would like to fit the values q, one for each time
0.34
step and state-state pair, so as to make Q as close to
0.32
P as possible:
0.3

min KL(P(y|x)||Q(y)). 0.28


q
0.26

0.24
The solution can be found analytically using La- 0 200 400 600 800 1000 1200 1400 1600 1800 2000
grange multipliers: Number of States

P(Yt = y, Yt+1 = y 0 |x) Figure 3: Phone recognition error for models of increasing size
q(t, y, y 0 ) = .
P(Yt = y|x)

where we have made the position-specific random models.2 All Gaussians were endowed with weak
variables Yt explicit for clarity. This approximation inverse Wishart priors with zero mean and identity
depends only on our ability to calculate posteriors covariance.3
over phones or phone-phone pairs at individual po-
4.1 Phone Recognition
sitions t, which is easy to obtain from the state pos-
teriors, for example: In the task of phone recognition, we fit an HMM
whose output, with subsequent states collapsed, cor-
P(Yt = y,Yt+1 = y 0 |x) = responds to the training transcriptions. In the TIMIT
data set, each frame is manually phone-annotated, so
X X
αt (s)ass0 bs0 (xt )βt+1 (s0 )
s:π(s)=y s0 :π(s0 )=y 0
the only uncertainty in the basic setup is the identity
of the (sub)states at each frame.
P(x)
We therefore began with a single state for each
Finding the Viterbi phone sequence in the approxi- phone, in a fully connected HMM (except for spe-
mate distribution Q, can be done with the Forward- cial treatment of dedicated start and end states). We
Backward algorithm over the lattice of q values. incrementally trained our model as described in Sec-
tion 2, with up to 6 split-merge rounds. We found
4 Experiments that reversing 25% of the splits yielded good overall
performance while maintaining compactness of the
We tested our model on the TIMIT database, using
model.
the standard setups for phone recognition and phone
classification. We partitioned the TIMIT data into We decoded using the variational decoder de-
training, development, and (core) test sets according scribed in Section 3. The output was then scored
to standard practice (Lee and Hon, 1989; Gunawar- against the reference phone transcription using the
dana et al., 2005; Sha and Saul, 2006). In particu- standard string edit distance.
lar, we excluded all sa sentences and mapped the 61 During both training and decoding, we used “flat-
phonetic labels in TIMIT down to 48 classes before tened” emission probabilities by exponentiating to
training our HMMs. At evaluation, these 48 classes some 0 < γ < 1. We found the best setting for γ
were further mapped down to 39 classes, again in to be 0.2, as determined by tuning on the develop-
the standard way. ment set. This flattening compensates for the non-
MFCC coefficients were extracted from the 2
Most of our findings also hold for diagonal covariance
TIMIT source as in Sha and Saul (2006), includ- Gaussians, albeit the final error rates are 2-3% higher.
3
ing delta and delta-delta components. For all experi- Following previous work with PCFGs (Petrov et al., 2006),
we experimented with smoothing the substates towards each
ments, our system and all baselines we implemented other to prevent overfitting, but we were unable to achieve any
used full covariance when parameterizing emission performance gains.

901
Method Error Rate Method Error Rate
GMM Baseline (Sha and Saul, 2006) 26.0%
State-Tied Triphone HMM
27.7%1 HMM Baseline (Gunawardana et al., 2005) 25.1%
(Young and Woodland, 1994) SVM (Clarkson and Moreno, 1999) 22.4%
Gender Dependent Triphone HMM Hidden CRF (Gunawardana et al., 2005) 21.7%
27.1%1 This Paper 21.4%
(Lamel and Gauvain, 1993) Large Margin GMM (Sha and Saul, 2006) 21.1%
This Paper 26.4%
Bayesian Triphone HMM Table 2: Phone classification error rates on the TIMIT core test.
25.6%
(Ming and Smith, 1998)
Heterogeneous classifiers but on a model with roughly 1000 states (5 split-
24.4%
(Halberstadt and Glass, 1998) merge rounds), the variational decoder decreases er-
ror from 26.5% to 25.6%. The gain in accuracy
Table 1: Phone recognition error rates on the TIMIT core test
from Glass (2003). comes at a cost in time: we must run a (possibly
1
These results are on a slightly easier test set. pruned) Forward-Backward pass over the full state
space S, then another over the smaller phone space
Y . In our experiments, the cost of variational decod-
independence of the frames, partially due to over- ing was a factor of about 3, which may or may not
lapping source samples and partially due to other justify a relative error reduction of around 4%.
unmodeled correlations. The performance of our best model (split and
Figure 3 shows the recognition error as the model merge, automatic alignment, and variational decod-
grows in size. In addition to the basic setup de- ing) on the test set is 26.4%. A comparison of our
scribed so far (split and merge), we also show a performance with other methods in the literature is
model in which merging was not performed (split shown in Table 1. Despite our structural simplic-
only). As can be seen, the merging phase not only ity, we outperform state-tied triphone systems like
decreases the number of HMM states at each round, Young and Woodland (1994), a standard baseline for
but also improves phone recognition error at each this task, by nearly 2% absolute. However, we fall
round. short of the best current systems.
We also compared our hierarchical split only
model with a model where we directly split all states 4.2 Phone Classification
into 2k substates, so that these models had the same Phone classification is the fairly constrained task of
number of states as a a hierarchical model after k classifying in isolation a sequence of frames which
split and merge cycles. While for small k, the dif- is known to span exactly one phone. In order to
ference was negligible, we found that the error in- quantify how much of our gains over the triphone
creased by 1% absolute for k = 5. This trend is to baseline stem from modeling context-dependencies
be expected, as the possible interactions between the and how much from modeling the inner structure of
substates grows with the number of substates. the phones, we fit separate HMM models for each
Also shown in Figure 3, and perhaps unsurprising, phone, using the same split and merge procedure as
is that the error rate can be further reduced by allow- above (though in this case only manual alignments
ing the phone boundaries to drift from the manual are reasonable because we test on manual segmen-
alignments provided in the TIMIT training data. The tations). For each test frame sequence, we com-
split and merge, automatic alignment line shows the pute the likelihood of the sequence from the forward
result of allowing the EM fitting phase to reposition probabilities of each individual phone HMM. The
each phone boundary, giving absolute improvements phone giving highest likelihood to the input was se-
of up to 0.6%. lected. The error rate is a simple fraction of test
We investigated how much improvement in accu- phones classified correctly.
racy one can gain by computing the variational ap- Table 2 shows a comparison of our performance
proximation introduced in Section 3 versus extract- with that of some other methods in the literature.
ing the Viterbi state sequence and projecting that se- A minimal comparison is to a GMM with the same
quence to its phone transcription. The gap varies, number of mixtures per phone as our model’s maxi-

902
Hypothesis next

uw

aw
ow
eh
ae

uh
aa

ng

hh

dh
ax

ey
ay
oy

dx
ch

zh

sil
er

th
el

jh
iy
ix

b
p
d
g
y

z
s

k
r

t
previous m
iy iy
ix ix
eh eh eh ow
ae ae
ax ax
uw uw ow ao
uh uh
aa aa
ey vowels/semivowels ey
ay ay ao aa
oy oy 1
aw aw 0
ow ow aa ey
er er 5
el el
r r 3
w w ey iy
Reference

y y
m m
n n iy 2 6 4 ih
ng nasals/flaps ng
dx dx
jh jh ix ae
ch ch
z strong fricatives z
s s v ix
zh zh
hh hh
v v
f weak fricatives f f z
dh dh
th th
b b k f
p p
d stops d
t t p s
g g
k k
sil sil
ay
oy
ey

aw
ow
eh
ae

m
uw
uh
aa

el

ng
jh

hh
f
dh

t
iy
ix

ax

er

n
dx
ch

zh

b
p
d
g
sil
y

z
s

th

k
Figure 5: Phone contexts and subphone structure. The /l/ phone
after 3 split-merge iterations is shown.
Figure 4: Phone confusion matrix. 76% of the substitutions fall
within the shown classes.

confusions are within natural classes. Some partic-


mum substates per phone. While these models have ularly frequent and reasonable confusions arise be-
the same number of total Gaussians, in our model tween the consonantal /r/ and the vocalic /er/ (the
the Gaussians are correlated temporally, while in same confusion arises between /l/ and /el/, but the
the GMM they are independent. Enforcing begin- standard evaluation already collapses this distinc-
middle-end HMM structure (see HMM Baseline) in- tion), the reduced vowels /ax/ and /ix/, the voiced
creases accuracy somewhat, but our more general and voiceless alveolar sibilants /z/ and /s/, and the
model clearly makes better use of the available pa- voiced and voiceless stop pairs. Other vocalic con-
rameters than those baselines. fusions are generally between vowels and their cor-
Indeed, our best model achieves a surpris- responding reduced forms. Overall, 76% of the sub-
ing performance of 21.4%, greatly outperform- stitutions are within the broad classes shown in the
ing other generative methods and achieving perfor- figure.
mance competitive with state-of-the-art discrimina-
tive methods. Only the recent structured margin ap- We can also examine the substructure learned for
proach of Sha and Saul (2006) gives a better perfor- the various phones. Figure 2 shows the evolution
mance than our model. The strength of our system of the phone /ih/ from a single state to 8 substates
on the classification task suggests that perhaps it is during split/merge (no merges were chosen for this
modeling phone-internal structure more effectively phone), using hand-alignment of phones to frames.
than cross-phone context. These figures were simplified from the complete
state transition matrices as follows: (1) adjacent
5 Analysis phones’ substates are collapsed, (2) adjacent phones
are selected based on frequency and inbound prob-
While the overall phone recognition and classifi- ability (and forced to be the same across figures),
cation numbers suggest that our system is broadly (3) infrequent arcs are suppressed. In the first split,
comparable to and perhaps in certain ways superior (b), a sonorant / non-sonorant distinction is learned
to classical approaches, it is illuminating to investi- over adjacent phones, along with a state chain which
gate what is and is not learned by the model. captures basic duration (a self-looping state gives
Figure 4 gives a confusion matrix over the substi- an exponential model of duration; the sum of two
tution errors made by our model. The majority of the such states is more expressive). Note that the nat-

903
ural classes interact with the chain in a way which Vowels oy 4 4 ng 3 4
aa 31 32 uh 5 2 p 5 24
allows duration to depend on context. In further re- ae 32 17 uw 21 8 r 32 32
finements, more structure is added, including a two- ah 31 8 Consonants s 32 32
track path in (d) where one track captures the distinct ao 32 23 b 2 32 sh 30 32
aw 18 6 ch 13 30 t 24 32
effects on higher formants of r-coloring and nasal- ax 18 3 d 2 14 th 8 11
ization. Figure 5 shows the corresponding diagram ay 32 28 dh 6 31 v 23 11
eh 32 16 dx 2 3 w 10 21
for /l/, where some merging has also occurred. Dif- el 6 4 y 3 7
f 32 32
ferent natural classes emerge in this case, with, for en 4 3 g 2 15 z 31 32
example, preceding states partitioned into front/high er 32 31 hh 3 5 zh 2 2
ey 32 30 jh 3 16 Other
vowels vs. rounded vowels vs. other vowels vs. con- ih 32 11 k 30 32 epi 2 4
sonants. Following states show a front/back dis- ix 31 16 l 25 32 sil 32 32
tinction and a consonant distinction, and the phone iy 31 32 m 25 25 vcl 29 30
ow 26 10 n 29 32 cl 31 32
/m/ is treated specially, largely because the /lm/ se-
quence tends to shorten the /l/ substantially. Note Table 3: Number of substates allocated per phone. The left
again how context, internal structure, and duration column gives the number of substates allocated when training
on manually aligned training sequences, while the right column
are simultaneously modeled. Of course, it should gives the number allocated when we automatically determine
be emphasized that post hoc analysis of such struc- phone boundaries.
ture is a simplification and prone to seeing what one achieves performance competitive with the state-of-
expects; we present these examples to illustrate the the-art discriminative methods (Sha and Saul, 2006),
broad kinds of patterns which are detected. despite being generative in nature. This result to-
As a final illustration of the nature of the learned gether with our analysis of the context-dependencies
models, Table 3 shows the number of substates allo- and substructures that are being learned, suggests
cated to each phone by the split/merge process (the that our model is particularly well suited for mod-
maximum is 32 for this stage) for the case of hand- eling phone-internal structure. It does, of course
aligned (left) as well as automatically-aligned (right) remain to be seen if and how these benefits can be
phone boundaries. Interestingly, in the hand-aligned scaled to larger systems.
case, the vowels absorb most of the complexity since
many consonantal cues are heavily evidenced on
adjacent vowels. However, in the automatically- References
aligned case, many vowel frames with substantial P. Clarkson and P. Moreno. 1999. On the use of Sup-
consontant coloring are re-allocated to those adja- port Vector Machines for phonetic classification. In
cent consonants, giving more complex consonants, ICASSP ’99.
but comparatively less complex vowels. S. B. Davis and P. Mermelstein. 1980. Comparison
of parametric representation for monosyllabic word
6 Conclusions recognition in continuously spoken sentences. IEEE
Transactions on Acoustics, Speech, and Signal Pro-
cessing, 28(4).
We have presented a minimalist, automatic approach
for building an accurate acoustic model for phonetic J. Glass. 2003. A probabilistic framework for segment-
classification and recognition. Our model does not based speech recognition. Computer Speech and Lan-
require any a priori phonetic bias or manual spec- guage, 17(2).
ification of structure, but rather induces the struc- A. Gunawardana, M. Mahajan, A. Acero, and J. Platt.
ture in an automatic and streamlined fashion. Start- 2005. Hidden Conditional Random Fields for phone
ing from a minimal monophone HMM, we auto- recognition. In Eurospeech ’05.
matically learn models that achieve highly compet- A. K. Halberstadt and J. R. Glass. 1998. Hetero-
itive performance. On the TIMIT phone recogni- geneous measurements and multiple classifiers for
tion task our model clearly outperforms standard speech recognition. In ICSLP ’98.
state-tied triphone models like Young and Wood- F. Jelinek. 1976. Continuous speech recognition by sta-
land (1994). For phone classification, our model tistical methods. Proceedings of the IEEE.

904
M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. L. Rabiner. 1989. A Tutorial on hidden Markov mod-
Saul. 1999. An introduction to variational methods els and selected applications in speech recognition. In
for graphical models. Learning in Graphical Models. IEEE.
L. Lamel and J. Gauvain. 1993. Cross-lingual experi- A. Sankar. 1998. Experiments with a Gaussian merging-
ments with phone recognition. In ICASSP ’93. splitting algorithm for HMM training for speech
recognition. In DARPA Speech Recognition Workshop
K. F. Lee and H. W. Hon. 1989. Speaker-independent
’98.
phone recognition using Hidden Markov Models.
IEEE Transactions on Acoustics, Speech, and Signal F. Sha and L. K. Saul. 2006. Large margin Gaussian mix-
Processing, 37(11). ture modeling for phonetic classification and recogni-
T. Matsuzaki, Y. Miyao, and J. Tsujii. 2005. Probabilis- tion. In ICASSP ’06.
tic CFG with latent annotations. In ACL ’05.
N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton.
J. Ming and F.J. Smith. 1998. Improved phone recogni- 2000. Split and Merge EM algorithm for mixture mod-
tion using Bayesian triphone models. In ICASSP ’98. els. Neural Computation, 12(9).
J. J. Odell. 1995. The Use of Context in Large Vocab- S. J. Young and P. C. Woodland. 1994. State clustering
ulary Speech Recognition. Ph.D. thesis, University of in HMM-based continuous speech recognition. Com-
Cambridge. puter Speech and Language, 8(4).
S. Petrov, L. Barrett, R. Thibaux, and D. Klein. 2006.
Learning accurate, compact, and interpretable tree an-
notation. In COLING-ACL ’06.

905

You might also like