Learning Structured Models For Phone Recognition
Learning Structured Models For Phone Recognition
897
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational
Natural Language Learning, pp. 897–905, Prague, June 2007.
2007
c Association for Computational Linguistics
d7 = c(#-d-ae) ae3 = c(d-ae-d) d13 = c(ae-d-#)
demonstration that such a system is not only com-
Start begin mid end begin mid end begin mid end End
petitive with the traditional approach, but can indeed
outperform even very recent work on some prelimi-
d ae d
nary measures.
b c b c b c
Start End
a d a d a d 2 Learning
898
next next next next
th z th z th z th z
dh sh dh sh dh sh dh sh
p f p f p f p f
1 5 2 3
t cl t cl t cl t cl
0
b vcl b vcl b 1 vcl b vcl
0 0 1 0
2 7 6
g m g m g m g 4 m
3
dx n dx n dx n dx n
w ng w ng w ng w ng
r l r l r l r l
l r l r l r l r
er er er er
899
states only where needed, rather than splitting them since phones frequently occur with multiple consec-
all. utive repetitions, this criterion may vastly overesti-
We realize this goal by merging back those splits mate the actual likelihood loss. As such, we also im-
s → s1 s2 for which, if the split were reversed, the plemented the exact criterion, that is, for each split,
loss in data likelihood would be smallest. We ap- we formed a new HMM with s1 and s2 merged and
proximate the loss in data likelihood for a merge calculated the total data likelihood. This method
s1 s2 → s with the following likelihood ratio (Petrov is much more computationally expensive, requiring
et al., 2006): a full forward-backward pass through the data for
each potential merge, and was not found to produce
Y Pt (x, y)
noticeably better performance. Therefore, all exper-
Y
∆(s1 s2 → s) = .
sequences t
P(x, y) iments use the approximate criterion.
Here P(x, y) is the joint likelihood of an emission 2.4 The Automatically-Aligned Case
sequence x and associated state sequence y. This It is straightforward to generalize the hand-aligned
quantity can be recovered from the forward and case to the case where the phone transcription is
backward probabilities using known, but no frame level labeling is available. The
X main difference is that the phone boundaries are not
P(x, y) = αt (s) · βt (s). known in advance, which means that there is now
s:π(s)=yt additional uncertainty over the phone states. The
forward and backward recursions must thus be ex-
Pt (x, y) is an approximation to the same joint like- panded to consider all state sequences that yield the
lihood where states s1 and s2 are merged. We ap- given phone transcription. We can accomplish this
proximate the true loss by only considering merging with standard Baum-Welch training.
states s1 and s2 at time t, a value which can be ef-
ficiently computed from the forward and backward 3 Inference
probabilities. The forward score for the merged state
s at time t is just the sum of the two split scores: An HMM over refined subphone states s ∈ S nat-
urally gives posterior distributions P(s|x) over se-
α̂t (s) = αt (s1 ) + αt (s2 ), quences of states s. We would ideally like to ex-
tract the transcription r of underlying phones which
while the backward score is a weighted sum of the is most probable according to this posterior1 . The
split scores: transcription is two stages removed from s. First,
it collapses the distinctions between states s which
β̂t (s) = p1 βt (s1 ) + p2 βt (s2 ), correspond to the same phone y = π(s). Second,
it collapses the distinctions between where phone
where p1 and p2 are the relative (posterior) frequen- transitions exactly occur. Viterbi state sequences can
cies of the states s1 and s2 . easily be extracted using the basic Viterbi algorithm.
Thus, the likelihood after merging s1 and s2 at On the other hand, finding the best phone sequence
time t can be computed from these merged forward or transcription is intractable.
and backward scores as: As a compromise, we extract the phone sequence
X (not transcription) which has highest probability in
P t (x, y) = α̂t (s) · β̂t (s) + αt (s0 ) · βt (s0 ) a variational approximation to the true distribution
s0 (Jordan et al., 1999). Let the true posterior distri-
where the second sum is over the other substates of bution over phone sequences be P(y|x). We form
xt , i.e. {s0 : π(s0 ) = xt , s0 ∈
/ {s1 , s2 }}. This an approximation Q(y) ≈ P(y|x), where Q is an
expression is an approximation because it neglects approximation specific to the sequence x and factor-
interactions between instances of the same states at 1
Remember that by “transcription” we mean a sequence of
multiple places in the same sequence. In particular, phones with duplicates removed.
900
izes as: Y
0.42
split only
split and merge
Q(y) = q(t, xt , yt+1 ). 0.4
split and merge, automatic alignment
t 0.38
0.24
The solution can be found analytically using La- 0 200 400 600 800 1000 1200 1400 1600 1800 2000
grange multipliers: Number of States
P(Yt = y, Yt+1 = y 0 |x) Figure 3: Phone recognition error for models of increasing size
q(t, y, y 0 ) = .
P(Yt = y|x)
where we have made the position-specific random models.2 All Gaussians were endowed with weak
variables Yt explicit for clarity. This approximation inverse Wishart priors with zero mean and identity
depends only on our ability to calculate posteriors covariance.3
over phones or phone-phone pairs at individual po-
4.1 Phone Recognition
sitions t, which is easy to obtain from the state pos-
teriors, for example: In the task of phone recognition, we fit an HMM
whose output, with subsequent states collapsed, cor-
P(Yt = y,Yt+1 = y 0 |x) = responds to the training transcriptions. In the TIMIT
data set, each frame is manually phone-annotated, so
X X
αt (s)ass0 bs0 (xt )βt+1 (s0 )
s:π(s)=y s0 :π(s0 )=y 0
the only uncertainty in the basic setup is the identity
of the (sub)states at each frame.
P(x)
We therefore began with a single state for each
Finding the Viterbi phone sequence in the approxi- phone, in a fully connected HMM (except for spe-
mate distribution Q, can be done with the Forward- cial treatment of dedicated start and end states). We
Backward algorithm over the lattice of q values. incrementally trained our model as described in Sec-
tion 2, with up to 6 split-merge rounds. We found
4 Experiments that reversing 25% of the splits yielded good overall
performance while maintaining compactness of the
We tested our model on the TIMIT database, using
model.
the standard setups for phone recognition and phone
classification. We partitioned the TIMIT data into We decoded using the variational decoder de-
training, development, and (core) test sets according scribed in Section 3. The output was then scored
to standard practice (Lee and Hon, 1989; Gunawar- against the reference phone transcription using the
dana et al., 2005; Sha and Saul, 2006). In particu- standard string edit distance.
lar, we excluded all sa sentences and mapped the 61 During both training and decoding, we used “flat-
phonetic labels in TIMIT down to 48 classes before tened” emission probabilities by exponentiating to
training our HMMs. At evaluation, these 48 classes some 0 < γ < 1. We found the best setting for γ
were further mapped down to 39 classes, again in to be 0.2, as determined by tuning on the develop-
the standard way. ment set. This flattening compensates for the non-
MFCC coefficients were extracted from the 2
Most of our findings also hold for diagonal covariance
TIMIT source as in Sha and Saul (2006), includ- Gaussians, albeit the final error rates are 2-3% higher.
3
ing delta and delta-delta components. For all experi- Following previous work with PCFGs (Petrov et al., 2006),
we experimented with smoothing the substates towards each
ments, our system and all baselines we implemented other to prevent overfitting, but we were unable to achieve any
used full covariance when parameterizing emission performance gains.
901
Method Error Rate Method Error Rate
GMM Baseline (Sha and Saul, 2006) 26.0%
State-Tied Triphone HMM
27.7%1 HMM Baseline (Gunawardana et al., 2005) 25.1%
(Young and Woodland, 1994) SVM (Clarkson and Moreno, 1999) 22.4%
Gender Dependent Triphone HMM Hidden CRF (Gunawardana et al., 2005) 21.7%
27.1%1 This Paper 21.4%
(Lamel and Gauvain, 1993) Large Margin GMM (Sha and Saul, 2006) 21.1%
This Paper 26.4%
Bayesian Triphone HMM Table 2: Phone classification error rates on the TIMIT core test.
25.6%
(Ming and Smith, 1998)
Heterogeneous classifiers but on a model with roughly 1000 states (5 split-
24.4%
(Halberstadt and Glass, 1998) merge rounds), the variational decoder decreases er-
ror from 26.5% to 25.6%. The gain in accuracy
Table 1: Phone recognition error rates on the TIMIT core test
from Glass (2003). comes at a cost in time: we must run a (possibly
1
These results are on a slightly easier test set. pruned) Forward-Backward pass over the full state
space S, then another over the smaller phone space
Y . In our experiments, the cost of variational decod-
independence of the frames, partially due to over- ing was a factor of about 3, which may or may not
lapping source samples and partially due to other justify a relative error reduction of around 4%.
unmodeled correlations. The performance of our best model (split and
Figure 3 shows the recognition error as the model merge, automatic alignment, and variational decod-
grows in size. In addition to the basic setup de- ing) on the test set is 26.4%. A comparison of our
scribed so far (split and merge), we also show a performance with other methods in the literature is
model in which merging was not performed (split shown in Table 1. Despite our structural simplic-
only). As can be seen, the merging phase not only ity, we outperform state-tied triphone systems like
decreases the number of HMM states at each round, Young and Woodland (1994), a standard baseline for
but also improves phone recognition error at each this task, by nearly 2% absolute. However, we fall
round. short of the best current systems.
We also compared our hierarchical split only
model with a model where we directly split all states 4.2 Phone Classification
into 2k substates, so that these models had the same Phone classification is the fairly constrained task of
number of states as a a hierarchical model after k classifying in isolation a sequence of frames which
split and merge cycles. While for small k, the dif- is known to span exactly one phone. In order to
ference was negligible, we found that the error in- quantify how much of our gains over the triphone
creased by 1% absolute for k = 5. This trend is to baseline stem from modeling context-dependencies
be expected, as the possible interactions between the and how much from modeling the inner structure of
substates grows with the number of substates. the phones, we fit separate HMM models for each
Also shown in Figure 3, and perhaps unsurprising, phone, using the same split and merge procedure as
is that the error rate can be further reduced by allow- above (though in this case only manual alignments
ing the phone boundaries to drift from the manual are reasonable because we test on manual segmen-
alignments provided in the TIMIT training data. The tations). For each test frame sequence, we com-
split and merge, automatic alignment line shows the pute the likelihood of the sequence from the forward
result of allowing the EM fitting phase to reposition probabilities of each individual phone HMM. The
each phone boundary, giving absolute improvements phone giving highest likelihood to the input was se-
of up to 0.6%. lected. The error rate is a simple fraction of test
We investigated how much improvement in accu- phones classified correctly.
racy one can gain by computing the variational ap- Table 2 shows a comparison of our performance
proximation introduced in Section 3 versus extract- with that of some other methods in the literature.
ing the Viterbi state sequence and projecting that se- A minimal comparison is to a GMM with the same
quence to its phone transcription. The gap varies, number of mixtures per phone as our model’s maxi-
902
Hypothesis next
uw
aw
ow
eh
ae
uh
aa
ng
hh
dh
ax
ey
ay
oy
dx
ch
zh
sil
er
th
el
jh
iy
ix
b
p
d
g
y
z
s
k
r
t
previous m
iy iy
ix ix
eh eh eh ow
ae ae
ax ax
uw uw ow ao
uh uh
aa aa
ey vowels/semivowels ey
ay ay ao aa
oy oy 1
aw aw 0
ow ow aa ey
er er 5
el el
r r 3
w w ey iy
Reference
y y
m m
n n iy 2 6 4 ih
ng nasals/flaps ng
dx dx
jh jh ix ae
ch ch
z strong fricatives z
s s v ix
zh zh
hh hh
v v
f weak fricatives f f z
dh dh
th th
b b k f
p p
d stops d
t t p s
g g
k k
sil sil
ay
oy
ey
aw
ow
eh
ae
m
uw
uh
aa
el
ng
jh
hh
f
dh
t
iy
ix
ax
er
n
dx
ch
zh
b
p
d
g
sil
y
z
s
th
k
Figure 5: Phone contexts and subphone structure. The /l/ phone
after 3 split-merge iterations is shown.
Figure 4: Phone confusion matrix. 76% of the substitutions fall
within the shown classes.
903
ural classes interact with the chain in a way which Vowels oy 4 4 ng 3 4
aa 31 32 uh 5 2 p 5 24
allows duration to depend on context. In further re- ae 32 17 uw 21 8 r 32 32
finements, more structure is added, including a two- ah 31 8 Consonants s 32 32
track path in (d) where one track captures the distinct ao 32 23 b 2 32 sh 30 32
aw 18 6 ch 13 30 t 24 32
effects on higher formants of r-coloring and nasal- ax 18 3 d 2 14 th 8 11
ization. Figure 5 shows the corresponding diagram ay 32 28 dh 6 31 v 23 11
eh 32 16 dx 2 3 w 10 21
for /l/, where some merging has also occurred. Dif- el 6 4 y 3 7
f 32 32
ferent natural classes emerge in this case, with, for en 4 3 g 2 15 z 31 32
example, preceding states partitioned into front/high er 32 31 hh 3 5 zh 2 2
ey 32 30 jh 3 16 Other
vowels vs. rounded vowels vs. other vowels vs. con- ih 32 11 k 30 32 epi 2 4
sonants. Following states show a front/back dis- ix 31 16 l 25 32 sil 32 32
tinction and a consonant distinction, and the phone iy 31 32 m 25 25 vcl 29 30
ow 26 10 n 29 32 cl 31 32
/m/ is treated specially, largely because the /lm/ se-
quence tends to shorten the /l/ substantially. Note Table 3: Number of substates allocated per phone. The left
again how context, internal structure, and duration column gives the number of substates allocated when training
on manually aligned training sequences, while the right column
are simultaneously modeled. Of course, it should gives the number allocated when we automatically determine
be emphasized that post hoc analysis of such struc- phone boundaries.
ture is a simplification and prone to seeing what one achieves performance competitive with the state-of-
expects; we present these examples to illustrate the the-art discriminative methods (Sha and Saul, 2006),
broad kinds of patterns which are detected. despite being generative in nature. This result to-
As a final illustration of the nature of the learned gether with our analysis of the context-dependencies
models, Table 3 shows the number of substates allo- and substructures that are being learned, suggests
cated to each phone by the split/merge process (the that our model is particularly well suited for mod-
maximum is 32 for this stage) for the case of hand- eling phone-internal structure. It does, of course
aligned (left) as well as automatically-aligned (right) remain to be seen if and how these benefits can be
phone boundaries. Interestingly, in the hand-aligned scaled to larger systems.
case, the vowels absorb most of the complexity since
many consonantal cues are heavily evidenced on
adjacent vowels. However, in the automatically- References
aligned case, many vowel frames with substantial P. Clarkson and P. Moreno. 1999. On the use of Sup-
consontant coloring are re-allocated to those adja- port Vector Machines for phonetic classification. In
cent consonants, giving more complex consonants, ICASSP ’99.
but comparatively less complex vowels. S. B. Davis and P. Mermelstein. 1980. Comparison
of parametric representation for monosyllabic word
6 Conclusions recognition in continuously spoken sentences. IEEE
Transactions on Acoustics, Speech, and Signal Pro-
cessing, 28(4).
We have presented a minimalist, automatic approach
for building an accurate acoustic model for phonetic J. Glass. 2003. A probabilistic framework for segment-
classification and recognition. Our model does not based speech recognition. Computer Speech and Lan-
require any a priori phonetic bias or manual spec- guage, 17(2).
ification of structure, but rather induces the struc- A. Gunawardana, M. Mahajan, A. Acero, and J. Platt.
ture in an automatic and streamlined fashion. Start- 2005. Hidden Conditional Random Fields for phone
ing from a minimal monophone HMM, we auto- recognition. In Eurospeech ’05.
matically learn models that achieve highly compet- A. K. Halberstadt and J. R. Glass. 1998. Hetero-
itive performance. On the TIMIT phone recogni- geneous measurements and multiple classifiers for
tion task our model clearly outperforms standard speech recognition. In ICSLP ’98.
state-tied triphone models like Young and Wood- F. Jelinek. 1976. Continuous speech recognition by sta-
land (1994). For phone classification, our model tistical methods. Proceedings of the IEEE.
904
M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. L. Rabiner. 1989. A Tutorial on hidden Markov mod-
Saul. 1999. An introduction to variational methods els and selected applications in speech recognition. In
for graphical models. Learning in Graphical Models. IEEE.
L. Lamel and J. Gauvain. 1993. Cross-lingual experi- A. Sankar. 1998. Experiments with a Gaussian merging-
ments with phone recognition. In ICASSP ’93. splitting algorithm for HMM training for speech
recognition. In DARPA Speech Recognition Workshop
K. F. Lee and H. W. Hon. 1989. Speaker-independent
’98.
phone recognition using Hidden Markov Models.
IEEE Transactions on Acoustics, Speech, and Signal F. Sha and L. K. Saul. 2006. Large margin Gaussian mix-
Processing, 37(11). ture modeling for phonetic classification and recogni-
T. Matsuzaki, Y. Miyao, and J. Tsujii. 2005. Probabilis- tion. In ICASSP ’06.
tic CFG with latent annotations. In ACL ’05.
N. Ueda, R. Nakano, Z. Ghahramani, and G. E. Hinton.
J. Ming and F.J. Smith. 1998. Improved phone recogni- 2000. Split and Merge EM algorithm for mixture mod-
tion using Bayesian triphone models. In ICASSP ’98. els. Neural Computation, 12(9).
J. J. Odell. 1995. The Use of Context in Large Vocab- S. J. Young and P. C. Woodland. 1994. State clustering
ulary Speech Recognition. Ph.D. thesis, University of in HMM-based continuous speech recognition. Com-
Cambridge. puter Speech and Language, 8(4).
S. Petrov, L. Barrett, R. Thibaux, and D. Klein. 2006.
Learning accurate, compact, and interpretable tree an-
notation. In COLING-ACL ’06.
905