(Synthesis Lectures On Speech and Audio Processing Volume 0) Li Deng - Dynamic Speech Models-Morgan and Claypool Publishers (2006)
(Synthesis Lectures On Speech and Audio Processing Volume 0) Li Deng - Dynamic Speech Models-Morgan and Claypool Publishers (2006)
i
P1: IML/FFX P2: IML
MOBK024-FM MOBK024-LiDeng.cls May 24, 2006 8:16
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
DOI: 10.2200/S00028ED1V01Y200605SAP002
First Edition
10 9 8 7 6 5 4 3 2 1
ii
P1: IML/FFX P2: IML
MOBK024-FM MOBK024-LiDeng.cls May 24, 2006 8:16
M
&C Mor gan & Cl aypool Publishers
iii
P1: IML/FFX P2: IML
MOBK024-FM MOBK024-LiDeng.cls May 24, 2006 8:16
iv
ABSTRACT
Speech dynamics refer to the temporal characteristics in all stages of the human speech com-
munication process. This speech “chain” starts with the formation of a linguistic message in a
speaker’s brain and ends with the arrival of the message in a listener’s brain. Given the intri-
cacy of the dynamic speech process and its fundamental importance in human communication,
this monograph is intended to provide a comprehensive material on mathematical models of
speech dynamics and to address the following issues: How do we make sense of the complex
speech process in terms of its functional role of speech communication? How do we quantify
the special role of speech timing? How do the dynamics relate to the variability of speech that
has often been said to seriously hamper automatic speech recognition? How do we put the
dynamic process of speech into a quantitative form to enable detailed analyses? And finally,
how can we incorporate the knowledge of speech dynamics into computerized speech analysis
and recognition algorithms? The answers to all these questions require building and applying
computational models for the dynamic speech process.
What are the compelling reasons for carrying out dynamic speech modeling? We pro-
vide the answer in two related aspects. First, scientific inquiry into the human speech code has
been relentlessly pursued for several decades. As an essential carrier of human intelligence and
knowledge, speech is the most natural form of human communication. Embedded in the speech
code are linguistic (as well as para-linguistic) messages, which are conveyed through four levels
of the speech chain. Underlying the robust encoding and transmission of the linguistic mes-
sages are the speech dynamics at all the four levels. Mathematical modeling of speech dynamics
provides an effective tool in the scientific methods of studying the speech chain. Such scientific
studies help understand why humans speak as they do and how humans exploit redundancy and
variability by way of multitiered dynamic processes to enhance the efficiency and effectiveness
of human speech communication. Second, advancement of human language technology, espe-
cially that in automatic recognition of natural-style human speech is also expected to benefit
from comprehensive computational modeling of speech dynamics. The limitations of current
speech recognition technology are serious and are well known. A commonly acknowledged and
frequently discussed weakness of the statistical model underlying current speech recognition
technology is the lack of adequate dynamic modeling schemes to provide correlation structure
across the temporal speech observation sequence. Unfortunately, due to a variety of reasons,
the majority of current research activities in this area favor only incremental modifications and
improvements to the existing HMM-based state-of-the-art. For example, while the dynamic
and correlation modeling is known to be an important topic, most of the systems neverthe-
less employ only an ultra-weak form of speech dynamics; e.g., differential or delta parameters.
Strong-form dynamic speech modeling, which is the focus of this monograph, may serve as an
ultimate solution to this problem.
P1: IML/FFX P2: IML
MOBK024-FM MOBK024-LiDeng.cls May 24, 2006 8:16
After the introduction chapter, the main body of this monograph consists of four chapters.
They cover various aspects of theory, algorithms, and applications of dynamic speech models,
and provide a comprehensive survey of the research work in this area spanning over past 20 years.
This monograph is intended as advanced materials of speech and signal processing for graudate-
level teaching, for professionals and engineering practioners, as well as for seasoned researchers
and engineers specialized in speech processing.
KEYWORDS
Articulatory trajectories, Automatic speech recognition, Coarticulation, Discretizing hidden
dynamics, Dynamic Bayesian network, Formant tracking, Generative modeling, Speech
acoustics, Speech dynamics, Vocal tract resonance
v
P1: IML/FFX P2: IML
MOBK024-FM MOBK024-LiDeng.cls May 24, 2006 8:16
vi
P1: IML/FFX P2: IML
MOBK024-FM MOBK024-LiDeng.cls May 24, 2006 8:16
vii
Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 What Are Speech Dynamics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 What Are Models of Speech Dynamics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Why Modeling Speech Dynamics? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Outline of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
viii CONTENTS
4.1.2 Parameter Estimation for the Basic Model: Overview . . . . . . . . . . . . . . . 41
4.1.3 EM Algorithm: The E-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.4 A Generalized Forward-Backward Algorithm . . . . . . . . . . . . . . . . . . . . . . 43
4.1.5 EM Algorithm: The M-Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.6 Decoding of Discrete States by Dynamic Programming. . . . . . . . . . . . . .48
4.2 Extension of the Basic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.1 Extension from First-Order to Second-Order Dynamics . . . . . . . . . . . . . 49
4.2.2 Extension from Linear to Nonlinear Mapping . . . . . . . . . . . . . . . . . . . . . . 50
4.2.3 An Analytical Form of the Nonlinear Mapping Function . . . . . . . . . . . . 51
4.2.4 E-Step for Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.5 M-Step for Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.6 Decoding of Discrete States by Dynamic Programming. . . . . . . . . . . . . .61
4.3 Application to Automatic Tracking of Hidden Dynamics . . . . . . . . . . . . . . . . . . . 61
4.3.1 Computation Efficiency: Exploiting Decomposability in the
Observation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
CONTENTS ix
5.4 Application to Phonetic Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.4.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
P1: IML/FFX P2: IML
MOBK024-FM MOBK024-LiDeng.cls May 24, 2006 8:16
x
P1: IML/FFX P2: IML
MOBK024-FM MOBK024-LiDeng.cls May 24, 2006 8:16
xi
Acknowledgments
This book would not have been possible without the help and support from friends, family,
colleagues, and students. Some of the material in this book is the result of collaborations with
my former students and current colleagues. Special thanks go to Jeff Ma, Leo Lee, Dong Yu,
Alex Acero, Jian-Lai Zhou, and Frank Seide.
The most important acknowledgments go to my family. I also thank Microsoft Research
for providing the environment in which the research described in this book is made possible.
Finally, I thank Prof. Fred Juang and Joel Claypool for not only the initiation but also the
encouragement and help throughout the course of writting this book.
P1: IML/FFX P2: IML
MOBK024-FM MOBK024-LiDeng.cls May 24, 2006 8:16
xii
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK024-01 MOBK024-LiDeng.cls May 24, 2006 8:6
CHAPTER 1
Introduction
• Linguistic level: At this highest level of speech communication, the speaker forms the
linguistic concept or message to be conveyed to the listener. That is, the speaker decides
to say something linguistically meaningful. This process takes place in the language
center(s) of speaker’s brain. The basic form of the linguistic message is words, which are
organized into sentences according to syntactic constraints. Words are in turn composed
of syllables constructed from phonemes or segments, which are further composed of
phonological features. At this linguistic level, language is represented in a discrete or
symbolic form.
• Physiological level: Motor program and articulatory muscle movement are involved at
this level of speech generation. The speech motor program takes the instructions, spec-
ified by the segments and features formed at the linguistic level, on how the speech
sounds are to be produced by the articulatory muscle (i.e., articulators) movement
over time. Physiologically, the motor program executes itself by issuing time-varying
commands imparting continuous motion to the articulators including the lips, tongue,
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK024-01 MOBK024-LiDeng.cls May 24, 2006 8:6
INTRODUCTION 3
primary auditory cortex. The decoding process may be aided by some type of analysis-
by-synthesis strategies that make use of general knowledge of the dynamic processes at
the physiological and acoustic levels of the speech chain as the “encoder” device for the
intended linguistic message.
At all the four levels of the speech communication process above, dynamics play a central
role in shaping the linguistic information transfer. At the linguistic level, the dynamics are
discrete and symbolic, as is the phonological representation. That is, the discrete phonological
symbols (segments or features) change their identities at various points of time in a speech
utterance, and no quantitative (numeric) degree of change and precise timing are observed.
This can be considered as a weak form of dynamics. In contrast, the articulatory dynamics at
the physiological level, and the consequent dynamics at the acoustic level, are of a strong form
in that the numerically quantifiable temporal characteristics of the articulator movements and
of the acoustic parameters are essential for the trade-off between overcoming the physiological
limitations for setting the articulators’ movement speed and efficient encoding of the phono-
logical symbols. At the auditory level, the importance of timing in the auditory nerve’s firing
patterns and in the cortical responses in coding speech has been well known. The dynamic
patterns in the aggregate auditory neural responses to speech sounds in many ways reflect the
dynamic patterns in the speech signal, e.g., time-varying spectral prominences in the speech
signal. Further, numerous types of auditory neurons are equipped with special mechanisms (e.g.,
adaptation and onset-response properties) to enhance the dynamics and information contrast
in the acoustic signal. These properties are especially useful for detecting certain special speech
events and for identifying temporal “landmarks” as a prerequisite for estimating the phonological
features relevant to consonants [2, 3].
Often, we use our intuition to appreciate speech dynamics—as we speak, we sense the
motions of speech articulators and the sounds generated from these motions as continuous flow.
When we call this continuous flow of speech organs and sounds as speech dynamics, then we
use them in a narrow sense, ignoring their linguistic and perceptual aspects.
As is often said, timing is of essence in speech. The dynamic patterns associated with ar-
ticulation, vocal tract shaping, sound acoustics, and auditory response have the key property that
the timing axis in these patterns is adaptively plastic. That is, the timing plasticity is flexible but
not arbitrary. Compression of time in certain portions of speech has a significant effect in speech
perception, but not so for other portions of the speech. Some compression of time, together
with the manipulation of the local or global dynamic pattern, can change perception of the style
of speaking but not the phonetic content. Other types of manipulation, on the other hand, may
cause very different effects. In speech perception, certain speech events, such as labial stop bursts,
flash extremely quickly over as short as 1–3 ms while providing significant cues for the listener
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK024-01 MOBK024-LiDeng.cls May 24, 2006 8:6
INTRODUCTION 5
aspects in the speech chain and are sufficiently simplified to facilitate algorithm development
and engineering system implementation for speech processing applications. It is highly desirable
that the models be developed in statistical terms, so that advanced algorithms can be developed
to automatically and optimally determine any parameters in the models from a representative
set of training data. Further, it is important that the probability for each speech utterance be
efficiently computed under any hypothesized word-sequence transcript to make the speech
decoding algorithm development feasible.
Motivated by the multiple-stage view of the dynamic speech process outlined in the
preceding section, detailed computational models, especially those for the multiple generative
stages, can be constructed from the distinctive feature-based linguistic units to acoustic and
auditory parameters of speech. These stages include the following:
The main advantage of modeling such detailed multiple-stage structure in the dynamic
human speech process is that a highly compact set of parameters can then be used to cap-
ture phonetic context and speaking rate/style variations in a unified framework. Using this
framework, many important subjects in speech science (such as acoustic/auditory correlates of
distinctive features, articulatory targets/dynamics, acoustic invariance, and phonetic reduction)
and those in speech technology (such as modeling pronunciation variation, long-span context-
dependence representation, and speaking rate/style modeling for recognizer design) that were
previously studied separately by different communities of researchers can now be investigated
in a unified fashion.
Many aspects of the above multitiered dynamic speech model class, together with its scien-
tific background, have been discussed in [9]. In particular, the feature organization/overlapping
process, as is central to a version of computational phonology, has been presented in some
detail under the heading of “computational phonology.” Also, some aspects of auditory speech
representation, limited mainly to the peripheral auditory system’s functionalities, have been
elaborated in [9] under the heading of “auditory speech processing.” This book will treat these
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK024-01 MOBK024-LiDeng.cls May 24, 2006 8:6
In this book, these three major components of dynamic speech modeling will be treated
in a much greater depth than in [9], especially in model implementation and in algorithm
development. In addition, this book will include comprehensive reviews of new research work
since the publication of [9] in 2003.
INTRODUCTION 7
enhancement of human–human interaction in numerous ways. However, the limitations of
current speech recognition technology are serious and well known (e.g., [10–13]). A commonly
acknowledged and frequently discussed weakness of the statistical model (hidden Markov model
or HMM) underlying current speech recognition technology is the lack of adequate dynamic
modeling schemes to provide correlation structure across the temporal speech observation se-
quence [9, 13, 14]. Unfortunately, due to a variety of reasons, the majority of current research
activities in this area favor only incremental modifications and improvements to the existing
HMM-based state-of-the-art. For example, while the dynamic and correlation modeling is
known to be an important topic, most of the systems nevertheless employ only the ultra-weak
form of speech dynamics, i.e., differential or delta parameters. A strong form of dynamic speech
modeling presented in this book appears to be an ultimate solution to the problem.
It has been broadly hypothesized that new computational paradigms beyond the conven-
tional HMM as a generative framework are needed to reach the goal of all-purpose recognition
technology for unconstrained natural-style speech, and that statistical methods capitalizing
on essential properties of speech structure are beneficial in establishing such paradigms. Over
the past decade or so, there has been a popular discriminant-function-based and conditional
modeling approach to speech recognition, making use of HMMs (as a discriminant function
instead of as a generative model) or otherwise [13, 15–19]. This approach has been grounded
on the assumption that we do not have adequate knowledge about the realistic speech process,
as exemplified by the following quote from [17]: “The reason of taking a discriminant function
based approach to classifier design is due mainly to the fact that we lack complete knowledge
of the form of the data distribution and training data are inadequate.” The special difficulty of
acquiring such distributional speech knowledge lies in the sequential nature of the data with a
variable and high dimensionality. This is essentially the problem of dynamics in the speech data.
As we gradually fill in such knowledge while pursing research in dynamic speech modeling, we
will be able to bridge the gap between the discriminative paradigm and the generative modeling
one, but with a much higher performance level than the systems at present. This dynamic speech
modeling approach can enable us to “put speech science back into speech recognition” instead of
treating speech recognition as a generic, loosely constrained pattern recognition problem. In this
way, we are able to develop models “that really model speech,” and such models can be expected to
provide an opportunity to lay a foundation of the next-generation speech recognition technology.
CHAPTER 2
The main aim of this chapter is to set up a general modeling and computational framework,
based on the modern mathematical tool called dynamic Bayesian networks (DBN) [20, 21],
and to establish general forms of the multistage dynamic speech model outlined in the preceding
chapter. The overall model presented within this framework is comprehensive in nature, covering
all major components in the speech generation chain—from the multitiered, articulation-based
phonological construct (top) to the environmentally distorted acoustic observation (bottom).
The model is formulated in specially structured DBN, in which the speech dynamics at separate
levels of the generative chain are represented by distinct (but dependent) discrete and continuous
state variables and by their characteristic temporal evolution.
Before we present the model and the associated computational framework, we first provide
a general background and literature review.
In this section, we will describe each of these components and their design in some detail.
In particular, as a general computational framework, we provide the DBN representation for
each of the above model components and for their combination.
Each of the component states can take K (l) values. In implementing this model for American
English, we have L = 5, and the five tiers are Lips, Tongue Blade, Tongue Body, Velum, and
Larynx, respectively. For “Lips” tier, we have K (1) = 6 for six possible linguistically distinct Lips
configurations, i.e., those for /b/, /r/, /sh/, /u/, /w/, and /o/. Note that at this phonological level,
the difference among these Lips configurations is purely symbolic. The numerical difference is
manifested in different articulatory target values at lower phonetic level, resulting ultimately in
different acoustic realizations. For the remaining tiers, we have K (2) = 6, K (3) = 17, K (4) = 2,
and K (5) = 2.
The state–space of this factorial Markov chain consists of all K L = K (1) × K (2) × K (3) ×
K (4) × K (5) possible combinations of the s t(l) state variables. If no constraints are imposed on
the state transition structure, this would be equivalent to the conventional one-tiered Markov
chain with a total of K L states and a K L × K L state transition matrix. This would be an unin-
teresting case since the model complexity is exponentially (or factorially) growing in L. It would
also be unlikely to find any useful phonological structure in this huge Markov chain. Further,
since all the phonetic parameters in the lower level components of the overall model (to be
discussed shortly) are conditioned on the phonological state, the total number of model param-
eters would be unreasonably large, presenting a well-known sparseness difficulty for parameter
learning.
Fortunately, rich sources of phonological and phonetic knowledge are available to
constrain the state transitions of the above factorial Markov chain. One particularly useful set
of constraints come directly from the phonological theories that motivated the construction
of this model. Both autosegmental phonology [62] and articulatory phonology [63] treat the
different tiers in the phonological features as being largely independent of each other in their
P1: IML/FFX P2: IML
MOBK024-02 MOBK024-LiDeng.cls May 30, 2006 12:56
....... .......
evolving dynamics. This thus allows the a priori decoupling among the L tiers:
L
P (st |st−1 ) = P (s t(l) |s t−1
(l)
).
l=1
The transition structure of this constrained (uncoupling) factorial Markov chain can be
parameterized by L distinct K (l) × K (l) matrices. This is significantly simpler than the original
K L × K L matrix as in the unconstrained case.
Fig. 2.1 shows a dynamic Bayesian network (DBN) for a factorial Markov chain with
the constrained transition structure. A Bayesian network is a graphical model that describes
dependencies and conditional independencies in the probabilistic distributions defined over a
set of random variables. The most interesting class of Bayesian networks, as relevant to speech
modeling, is the DBN specifically aimed at modeling time series data or symbols such as speech
acoustics, phonological units, or a combination of them. For the speech data or symbols, there
are causal dependencies between random variables in time and they are naturally suited for the
DBN representation.
In the DBN representation of Fig. 2.1 for the L-tiered phonological model, each node
represents a component phonological feature in each tier as a discrete random variable at a
particular discrete time. The fact that there is no dependency (lacking arrows) between the
P1: IML/FFX P2: IML
MOBK024-02 MOBK024-LiDeng.cls May 30, 2006 12:56
where m(s) is the mean vector associated with the composite phonological state s, and
the covariance matrix (s) is nondiagonal. This allows for the correlation among the ar-
ticulatory vector components. Because such a correlation is represented for the articula-
tory target (as a random vector), compensatory articulation is naturally incorporated in the
model.
Since the target distribution, as specified in Eq. (2.2), is conditioned on a specific phono-
logical unit (e.g., a bundle of overlapped features represented by the composite state s consisting
of component feature values in the factorial Markov chain of Fig. 2.1), and since the target does
not switch until the phonological unit changes, the statistics for the temporal sequence of the
target process follows that of a segmental HMM [40].
For the single-tiered (L = 1) phonological model (e.g., phone-based model), the
segmental HMM for the target process will be the same as that described in [40], except
the output is no longer the acoustic parameters. The dependency structure in this segmental
HMM as the combined one-tiered phonological model and articulatory target model can be
illustrated in the DBN of Fig. 2.2. We now elaborate on the dependencies in Fig. 2.2. The
P1: IML/FFX P2: IML
MOBK024-02 MOBK024-LiDeng.cls May 30, 2006 12:56
S1 S2 S3 S4 SK
t1 t2 t3 t4 tK
FIGURE 2.2: DBN for a segmental HMM as a probabilistic model for the combined one-tiered phono-
logical model and articulatory target model. The output of the segmental HMM is the target vector, t,
constrained to be constant until the discrete phonological state, s , changes its value
output of this segmental HMM is the random articulatory target vector t(k) that is constrained
to be constant until the phonological state switches its value. This segmental constraint for
the dynamics of the random target vector t(k) represents the adopted articulatory control
strategy that the goal of the motor system is to try to maintain the articulatory target’s position
(for a fixed corresponding phonological state) by exerting appropriate muscle forces. That is,
although random, t(k) remains fixed until the phonological state s k switches. The switching of
target t(k) is synchronous with that of the phonological state, and only at the time of switching,
is t(k) allowed to take a new value according to its probability density function. This segmental
constraint can be described mathematically by the following conditional probability density
function:
δ[t(k) − t(k − 1)] if s k = s k−1 ,
p[t(k)|s k , s k−1 , t(k − 1)] =
N (t(k); m(s k ), (s k )) otherwise.
This adds the new dependencies of random vector of t(k) on s k−1 and t(k − 1), in addition to
the obvious direct dependency on s k , as shown in Fig. 2.2.
Generalizing from the one-tiered phonological model to the multitiered one as discussed
earlier, the dependency structure in the “segmental factorial HMM” as the combined multitiered
phonological model and articulatory target model has the DBN representation in Fig. 2.3. The
key conditional probability density function (PDF) is similar to the above segmental HMM
except that the conditioning phonological states are the composite states (sk and sk−1 ) consisting
of a collection of discrete component state variables:
δ[t(k) − t(k − 1)] if sk = sk−1 ,
p[t(k)|sk , sk−1 , t(k − 1)] =
N (t(k); m(sk ), (sk )) otherwise.
Note that in Figs. 2.2 and 2.3 the target vector t(k) is defined in the same space as that of
the physical articulator vector (including jaw positions, which do not have direct phonological
P1: IML/FFX P2: IML
MOBK024-02 MOBK024-LiDeng.cls May 30, 2006 12:56
....... .......
t1 t2 t3 t4 tK
FIGURE 2.3: DBN for a segmental factorial HMM as a combined multitiered phonological model and
articulatory target model
into the following mathematically tractable, linear, first-order autoregressive (AR) model:
where z is the n-dimensional real-valued articulatory-parameter vector, w is the IID and Gaus-
sian noise, ts is the HMM-state-dependent target vector expressed in the same articulatory
domain as z(k), As is the HMM-state-dependent system matrix, and Bs is a matrix that mod-
ifies the target vector. The dependence of ts and s parameters of the above dynamic system
on the phonological state is justified by the fact that the functional behavior of an articulator
depends both on the particular goal it is trying to implement, and on the other articulators with
which it is cooperating in order to produce compensatory articulation.
In order for the modeled articulatory dynamics above to exhibit realistic behaviors, e.g.,
movement along the target-directed path within each segment and not oscillating within the
segment, matrices As and Bs can be constrained appropriately. One form of the constraint gives
rise to the following articulatory dynamic model:
where I is the identity matrix. Other forms of the constraint will be discussed in Chapters 4
and 5 of the book for two specific implementions of the general model.
It is easy to see that the constrained linear AR model of Eq. (2.4) has the desirable target-
directed property. That is, the articulatory vector z(k) asymptotically approaches the mean of
the target random vector t for artificially lengthened speech utterances. For natural speech, and
P1: IML/FFX P2: IML
MOBK024-02 MOBK024-LiDeng.cls May 30, 2006 12:56
....... .......
t1 t2 t3 t4 tK
z1 z2 z3 z4 zK
FIGURE 2.4: DBN for a switching, target-directed AR model driven by a segmental factorial HMM.
This is a combined model for multitiered phonology, target process, and articulatory dynamics
P1: IML/FFX P2: IML
MOBK024-02 MOBK024-LiDeng.cls May 30, 2006 12:56
where o is the m-dimensional real-valued observation vector, v is the IID observation noise
vector uncorrelated with the state noise w, and h[·] is the static memoryless transformation
from the articulatory vector to its corresponding acoustic observation vector.
Including this static mapping model, the combined phonological, target, articulatory
dynamic, and the acoustic model now has the DBN representation shown in Fig. 2.5. The new
dependency for the acoustic random variables is specified, on the basis of “observation” equation
in Eq. (2.6), by the following conditional PDF:
p o [o(k) | z(k)] = p v [o(k) − h(z(k))]. (2.7)
There are many ways of choosing the static nonlinear function for h[z] in Eq. (2.6), such as
using a multilayer perceptron (MLP) neural network. Typically, the analytical forms of nonlinear
functions make the associated nonlinear dynamic systems difficult to analyze and make the
estimation problems difficult to solve. Simplification is frequently used to gain computational
advantages while sacrificing accuracy for approximating the nonlinear functions. One most
commonly used technique for the approximation is truncated (vector) Taylor series expansion.
If all the Taylor series terms of order two and higher are truncated, then we have the linear
Taylor series approximation that is characterized by the Jacobian matrix J and by the point of
Taylor series expansion z0 :
h(z) ≈ h(z0 ) + J(z0 )(z − z0 ). (2.8)
P1: IML/FFX P2: IML
MOBK024-02 MOBK024-LiDeng.cls May 30, 2006 12:56
....... .......
t1 t2 t3 t4 tK
z1 z2 z3 z4 zK
o1 o2 o3 o4 oK
FIGURE 2.5: DBN for a target-directed, switching dynamic system (state–space) model driven by a seg-
mental factorial HMM. This is a combined model for multitiered phonology, target process, articulatory
dynamics, and articulatory-to-acoustic mapping
Each element of the Jacobian matrix J is partial derivative of each vector component of the
nonlinear output with respect to each of the input vector components. That is,
⎡ ⎤
∂h 1 (z0 ) ∂h 1 (z0 )
· · · ∂h∂z
1 (z0 )
⎢ ∂z1 ∂z2 n
⎥
∂h ⎢ ∂h 2 (z0 ) ∂h 2 (z0 )
· · · ∂h∂z
2 (z0 ) ⎥
⎢ ∂z1 ∂z2 ⎥
J(z0 ) = =⎢ n
⎥. (2.9)
∂z0 ⎢ .. .. ⎥
⎣ . . ⎦
∂h m (z0 ) ∂h m (z0 )
∂z1 ∂z2
· · · ∂h∂z
m (z0 )
n
P1: IML/FFX P2: IML
MOBK024-02 MOBK024-LiDeng.cls May 30, 2006 12:56
where y(t) is the distorted speech signal sample, modeled as convolution between the clean
speech signal sample o (t) and distortion channel’s impulse response (t) plus additive noise
sample n(t).
However, in the log-spectrum domain or in the cepstrum domain that is commonly used
as the input for speech recognizers, Eq. (2.10) has its equivalent form of (see a derivation in [83])
In Eq. (2.11), y(k) is the cepstral vector at frame k, when C is taken as cosine transform
matrix. (When C is taken as the identity matrix, y(k) becomes a log-spectral vector.) y(k) now
becomes weakly nonlinearly related to the clean speech cepstral vector o(k), cepstral vector of
additive noise n(k), and cepstral vector of the impulse response of the distortion channel h̄. Note
that according to Eq. (2.11), the relationship between clean and noisy speech cepstral vectors
becomes linear (or affine) when the signal-to-noise ratio is either very large or very small.
After incorporating the above acoustic distortion model, and assuming that the statistics of
the additive noise changes slowly over time as governed by a discrete-state Markov chain, Fig. 2.6
shows the DBN for the comprehensive generative model of speech from the phonological
model to distorted speech acoustics. Intermediate models include the target model, articulatory
dynamic model, and clean-speech acoustic model. For clarity, only a one-tiered, rather than
multitiered, phonological model is illustrated. [The dependency of the parameters () of the
articulatory dynamic model on the phonological state is also explicitly added.] Note that in Fig.
2.6, the temporal dependency in the discrete noise states Nk gives rise to nonstationarity in the
additive noise random vectors nk . The cepstral vector h̄ for the distortion channel is assumed
not to change over the time span of the observed distorted speech utterance y1 , y2 , . . ., y K .
P1: IML/FFX P2: IML
MOBK024-02 MOBK024-LiDeng.cls May 30, 2006 12:56
S1 S2 S3 S4 ....... SK
t1 t2 t3 t4 tK
z1 z2 z3 z4 zK
o1 o2 o3 o4 oK
y1 y2 y3 y4 yK
n1 n2 n3 n4 nK
N1 N2 N3 N4 NK
FIGURE 2.6: DBN for a comprehensive generative model of speech from the phonological model to
distorted speech acoustics. Intermediate models include the target model, articulatory dynamic model, and
acoustic model. For clarity, only a one-tiered, rather than multitiered, phonological model is illustrated.
Explicitly added is the dependency of the parameters ((s )) of the articulatory dynamic model on the
phonological state
In Fig. 2.6, the dependency relationship for the new variables of distorted acoustic ob-
servation y(k) is specified, on the basis of observation equation 2.11 where o(k) is specified in
Eq. (5), by the following conditional PDF:
p(y(k) | o(k), n(k), h̄) = p (y(k) − o(k) − h̄ − C log[I + exp(C−1 (n(k) − o(k) − h̄))]),
(2.12)
P1: IML/FFX P2: IML
MOBK024-02 MOBK024-LiDeng.cls May 30, 2006 12:56
S1 S2 S3 S4 ....... SK
t1 t2 t3 t4 tK
z1 z2 z3 z4 zK
o1 o2 o3 o4 oK
y1 y2 y3 y4 yK
n1 n2 n3 n4 nK
N1 N2 N3 N4 NK
FIGURE 2.7: DBN that incorporates the Lombard effect in the comprehensive generative model of
speech. The behavior of articulation is subject to modification (e.g., articulatory target overshoot or hyper-
articulation or increased articulatory efforts by shortening time constant) under severe environmental
distortions. This is represented by the “feedback” dependency from the noise nodes to the articulator
nodes in the DBN
effectiveness and efficiency in model learning. The most straightforward method is to use a
set of linear regression functions to replace the general nonlinear mapping in Eq. (2.6), while
keeping intact the target-directed, linear state dynamics of Eq. (2.4). That is, rather than using
one single set of linear-model parameters to characterize each phonological state, multiple sets
P1: IML/FFX P2: IML
MOBK024-02 MOBK024-LiDeng.cls May 30, 2006 12:56
where Ḣm = [a | Hm ] is the expanded matrix by left appending vector a to matrix Hm , and
ż(k) = [1 | z(k) ] is the expanded vector in a similar manner. In the above equations, M is the
total number of mixture components in the model for each phonological state (e.g., phone). The
state noise and measurement noise, wm (k) and vm (k), are respectively modeled by uncorrelated,
IID, zero-mean, Gaussian processes with covariance matrices Qm and Rm . o represents the
sequence of acoustic vectors, o(1), o(2), . . . , o(k). . . , and the z represents the sequence of
hidden articulatory vectors, z(1), z(2), . . . , z(k), . . . .
The full set of model parameters for each phonological state (not indexed for clarity) are
= (m , tm , Qm , Rm , Hm , for m = 1, 2, . . . , M).
It is important to impose the following mixture-path constraint on the above dynamic
system model: for each sequence of acoustic observations associated with a phonological state,
the sequence is forced to be produced from a fixed mixture component, m, in the model. This
means that the articulatory target for each phonological state is not permitted to switch from
one mixture component to another within the duration of the same segment. The constraint
is motivated by the physical nature of the dynamic speech model—the target that is correlated
with its phonetic identity is defined at the segment level, not at the frame level. Use of the
type of segment-level mixture is intended to represent the various sources of speech variability
including speakers’ vocal tract shape differences and speaking-habit differences, etc.
In Fig. 2.8 is shown the DBN representation for the piecewise linearized dynamic speech
model as a simplified generative model of speech where the nonlinear mapping from hidden
dynamic variables to acoustic observational variables is approximated by a piecewise linear rela-
tionship. The new, discrete random variable m is introduced to provide the “region” or mixture-
component index m to the piecewise linear mapping. Both the input and output variables that
are in a nonlinear relationship have now simultaneous dependency on m. The conditional PDFs
involving this new node are
and
where k denotes the time frame and s denotes the phonological state.
P1: IML/FFX P2: IML
MOBK024-02 MOBK024-LiDeng.cls May 30, 2006 12:56
S1 S2 S3 S4 ....... SK
t1 t2 t3 t4 tK
z1 z2 z3 z4 zK
m1 m2 m3 m4 mK
o1 o2 o3 o4 oK
FIGURE 2.8: DBN representation for a mixture linear model as a simplified generative model of
speech where the nonlinear mapping from hidden dynamic variables to acoustic observational variables
is approximated by a piecewise linear relationship. The new, discrete random variable m is introduced to
provide “region” index to the piecewise linear mapping. Both the input and output variables that are in a
nonlinear relationship have now simultaneous dependency on m
2.4 SUMMARY
After providing general motivations and model design philosophy, technical detail of a multi-
stage statistical generative model of speech dynamics and its associated computational frame-
work based on DBN is presented in this chapter. We now summarize this model description.
Equations (2.4) and (2.6) form a special version of the switching state–space model appropriate
for describing the multilevel speech dynamics. The top-level dynamics occur at the discrete-state
phonology, represented by the state transitions of s with a relatively long time scale (roughly
about the duration of phones). The next level is the target (t) dynamics; it has the same time
scale and provides systematic randomness at the segmental level. At the level of articulatory
dynamics, the time scale is significantly shortened. This level represents the continuous-state
dynamics driven by the stochastic target process as the input. The state equation (2.4) explicitly
describes the dynamics in z, with index of s (which takes discrete values) implicitly representing
the phonological process of transitions among a set of discrete states, which we call “switching.”
At the lowest level is acoustic dynamics, where there is no phonological switching process.
Since the observation equation (2.6) is static, this simplified acoustic generation model assumes
that acoustic dynamics are a direct consequence of articulatory dynamics only. Improvement
of this model component that overcomes this simplification is unlikely until better modeling
P1: IML/FFX P2: IML
MOBK024-02 MOBK024-LiDeng.cls May 30, 2006 12:56
31
CHAPTER 3
In Chapter 2, we described a rather general modeling scheme and the DBN-based computa-
tional framework for speech dynamics. Detailed implementation of the speech dynamic models
would vary depending on the trade-offs in modeling precision and mathematical/algorithm
tractability. In fact, various types of statistical models of speech beyond the HMM have already
been in the literature for sometime, although most of them have not been viewed from a unified
perspective as having varying degrees of approximation to the multistage speech chain. The
purpose of this chapter is to take this unified view in classifying and reviewing a wide variety of
current statistical speech models.
• Observable polynomial trend functions: This is the simplest trended HMM where there
is no uncertainty in the polynomial coefficients Λs (e.g., [41, 55, 56, 86]).
• Random polynomial trend functions: The trend functions gk (Λs ) in Eq. (3.1) are stochas-
tic due to the uncertainty in polynomial coefficients Λs . Λs are random vectors in one
of the two ways: (1) Λs has a discrete distribution [87, 88] and (2) Λs has a continuous
distribution. In the latter case, the model is called the segmental HMM, where the
earlier versions have a polynomial order of zero [40, 89] and the later versions have an
order of one [90] or two [91].
P1: IML/FFX P2: IML
MOBK024-03 MOBK024-LiDeng.cls May 16, 2006 14:4
and the starting point of the recursion for each state s comes usually from the previous state’s
ending history.
The model expressed in Eq. (3.2) provides clear contrast to the trajectory or trended
models where the time-varying acoustic observation vectors are approximated as an explicit
temporal function of time. The sample paths of the model Eq. (3.2), on the other hand, are
piecewise, recursively defined stochastic time-varying functions. Further classification of this
model class is discussed below.
Nonlinear-predictive HMM
Several versions of nonlinear-predictive HMM have appeared in the literature, which generalize
the linear prediction in Eq. (3.2) to nonlinear prediction using neural networks (e.g., [99–101]).
In the model of [101], detailed statistical analysis was provided, proving that nonlinear prediction
with a short temporal order effectively produces a correlation structure over a significantly longer
temporal span.
• articulatory dynamic model (e.g., [46, 54, 58, 59, 78, 79, 103, 104]);
• task-dynamic model (e.g., [105, 106]);
• vocal tract resonance (VTR) dynamic model (e.g., [24, 42, 48, 49, 84, 85, 107–112]);
• model with abstract dynamics (e.g., [42, 44, 107, 113]).
The VTR dynamics are a special type of task dynamics, with the acoustic goal or “task”
of speech production in the VTR domain. Key advantages of using VTRs as the “task” are their
direct correlation with the acoustic information, and the lower dimensionality in the VTR vector
compared with the counterpart hidden vectors either in the articulatory dynamic model or in
the task-dynamic model with articulatorily defined goal or “task” such as vocal tract constriction
properties.
As an alternative classification scheme, the hidden dynamic models can also be classified,
from the computational perspective, according to whether the hidden dynamics are represented
mathematically with temporal recursion or not. Like the acoustic dynamic models, the two
types of the hidden dynamic models in this classification scheme are reviewed here.
3.4 SUMMARY
This chapter serves as a bridge between the general modeling and computational framework
for speech dynamics (Chapter 2) and Chapters 4 and 5 on detailed descriptions of two specific
implementation strategies and algorithms for hidden dynamic models. The theme of this chapter
is to move from the relatively simplistic view of dynamic speech modeling confined within the
acoustic stage to the more realistic view of multistage speech dynamics with an intermediate
hidden dynamic layer between the phonological states and the acoustic dynamics. The latter,
with appropriate constraints in the form of the dynamic function, permits a representation
of the underlying speech structure responsible for coarticulation and speaking-effort-related
P1: IML/FFX P2: IML
MOBK024-03 MOBK024-LiDeng.cls May 16, 2006 14:4
39
CHAPTER 4
In this chapter, we focus on a special type of hidden dynamic models where the hidden dynamics
are recursively defined and where these hidden dynamic values are discretized. The discretization
or quantization of the hidden dynamics causes an approximation to the original continuous-
valued dynamics as described in the earlier chapters but it enables an implementation strategy
that can take direct advantage of the forward–backward algorithm and dynamic programming
in model parameter learning and decoding. Without discretization, the parameter learning and
decoding problems would be typically intractable (i.e., the computation cost would increase
exponentially with time). Under different kinds of model implmentation schemes, other types
of approximation will be needed and one type of the approximation in this case will be detailed
in Chapter 5.
This chapter is based on the materials published in [110, 117], with reorganization,
rewriting, and expansion of these materials so that they naturally fit as an integral part of this
book.
where state noise wt ∼ N(wk ; 0, Bs ) is assumed to be IID, zero-mean Gaussian with phonolog-
ical state (s )-dependent precision (inverse of variance) Bs . The linearized observation equation
is
o t = Hs xt + h s + vt , (4.2)
P1: IML/FFX P2: IML
MOBK024-04 MOBK024-LiDeng.cls May 30, 2006 15:30
And we also have the transition probability for the phonological states:
p(s t = s | s t−1 = s ) = πs s .
where N is the total number of observation data points in the training set.
After discretization of hidden dynamic variables, Eqs. (4.3) and (4.4) are approximated
as
and
o 1N = o 1 , o 2 , . . . , o t , . . . , o N ,
And the expectation is taken over the posterior probability for all hidden variable sequences:
x1N = x1 , x2 , . . . , xt , . . . , x N ,
and
s 1N = s 1 , s 2 , . . . , s t , . . . , s N ,
This gives (before discretization of the hidden dynamic variables):
Q= ··· ··· ··· ··· p(s 1N , x1N | o 1N ) log p(s 1N , x1N , o 1N )d x1 · · · d xt · · · d x N ,
s1 st sN x1 xt xN
(4.7)
where the summation for each phonological state s is from 1 to S (the total number of distinct
phonological units).
After discretizing xt into xt [i], the objective function of Eq. (4.7) is approximated by
Q≈ ··· ··· ··· ··· p(s 1N , i1N | o 1N ) log p(s 1N , i1N , o 1N ), (4.8)
s1 st sN i1 it iN
and
p(o 1N | s 1N , i1N ) = N(o t ; Hs t xt [i] + h s t , Ds t ).
t
In these equations, discretization indices i and j denote the hidden dynamic values taken at
time frames t and t − 1, respectively. That is, s t = i, s t−1 = j .
We first compute Q o (omitting constant −0.5d log(2π) that is irrelevant to optimization):
N
2
Q o = 0.5 p(s 1N , i1N | o 1N ) log |Ds t | − Ds t o t − Hs t xt [i] − h s t
s 1N i1N t=1
C
S
N
2
= 0.5 p(s 1N , i1N | o 1N ) log |Ds t | − Ds t o t − Hs t xt [i] − h s t δs t s δit i
s =1 i=1 s 1N i1N t=1
S N
C
= 0.5 p(s 1N , i1N | o 1N )δs t s δit i log |Ds | − Ds (o t − Hs xt [i] − h s )2 .
s =1 i=1 t=1 s 1N i1N
Noting that
p(s 1N , i1N | o 1N )δs t s δit i = p(s t = s , it = i | o 1N ) = γt (s , i),
s 1N i1N
S C
C
= 0.5 p(s 1N , i1N | o 1N )
s =1 i=1 j =1 s 1N i1N
N
2
× log |Bs t | − Bs t xt [i] − r s t xt−1 [ j ] − (1 − r s t )Ts t δs t s δit i δit−1 j
t=1
P1: IML/FFX P2: IML
MOBK024-04 MOBK024-LiDeng.cls May 30, 2006 15:30
Now noting
p(s 1N , i1N | o 1N )δs t s δit i δit−1 j = p(s t = s , it = i, it−1 = j | o 1N ) = ξt (s , i, j ),
s 1N i1N
Note that large computational saving can be achieved by limiting the summations in
Eq. (4.11) for i, j based on the relative smoothness of hidden dynamics. That is, the range of
i, j can be limited such that |xt [i] − xt−1 [ j ]| < Th, where Th is empirically set threshold value
that controls the computation cost and accuracy.
In Eqs. (4.11) and (4.10), we used ξt (s , i, j ) and γt (s , i) to denote the single-frame
posteriors of
and
γt (s , i) ≡ p(s t = s , xt [i] | o 1N ).
These can be computed efficiently using the generalized forward–backward algorithm (part of
the E-step), which we describe below.
αt (s , i) ≡ p(o 1t , s t = s , it = i).
P1: IML/FFX P2: IML
MOBK024-04 MOBK024-LiDeng.cls May 30, 2006 15:30
In Eq. (4.12), p(o t+1 | s t+1 , it+1 ) is determined by the observation equation:
and p(s t+1 , it+1 | s t , it ) is determined by the state equation (with order one) and the switching
Markov chain’s transition probabilities:
γ (s t , it ) ≡ p(s t , it | o 1N )
= p(s t , it , s t+1 , it+1 | o 1N )
s t+1 it+1
= p(s t , it , s t+1 , it+1 | o 1N ) p(s t+1 , it+1 | o 1N )
s t+1 it+1
= p(s t , it , s t+1 , it+1 | o 1t )γ (s t+1 , it+1 )
s t+1 it+1
where the last step uses conditional independence, and where α(s t , it ) and p(s t+1 , it+1 | s t , it )
on the right-hand side of Eq. (4.15) have been computed already in the forward recursion.
Initialization for the above γ recursion is γ (s N , i N ) = α(s N , i N ), which will be equal to 1 for
the left-to-right model of phonetic strings.
Given this result, ξt (s , i, j ) can be computed directly using α(s t , it ) and γ (s t , it ). Both of
them are already computed from the forward–backward recursions described above.
Alternatively, we can compute β generalized recursion (not discussed here) and then
combine αs and βs to obtain γt (s , i) and ξt (s , i, j ).
∂ Q o (Hs , h s , Ds ) N C
= −Ds γt (s , i){o t − Hs xt [i] − h s } = 0, (4.16)
∂h s t=1 i=1
and
∂ Q o (Hs , h s , Ds ) N C
= −Ds γt (s , i){o t − Hs xt [i] − h s }xt [i] = 0. (4.17)
∂ Hs t=1 i=1
where
N
C
U= γt (s , i)xt [i], (4.20)
t=1 i=1
V1 = N, (4.21)
N
C
C1 = γt (s , i)o t , (4.22)
t=1 i=1
N C
V2 = γt (s , i)xt2 [i], (4.23)
t=1 i=1
N C
C2 = γt (s , i)o t xt [i]. (4.24)
t=1 i=1
The solution is
−1
Ĥs U V1 C1
= . (4.25)
ĥ s V2 U C2
t=1 i=1 j =1
N
C
C
× ξt (s , i, j )(Ts − xt−1 [ j ])(Ts − xt [i]) , (4.27)
t=1 i=1 j =1
1 N C C
T̂s = ξt (s , i, j ) xt [i] − r s xt−1 [ j ] . (4.29)
1 − r s t=1 i=1 j =1
∂ Q x (r s , Ts , Bs ) N C C
= 0.5 ξt (s , i, j ) Bs−1 − (xt [i] − r s xt−1 [ j ] − (1 − r s )Ts )2 = 0,
∂ Bs t=1 i=1 j =1
Similarly, setting
∂ Q o (Hs , h s , Ds ) N C
= 0.5 γt (s , i) Ds−1 − (o t − Hs xt [i] − h s )2 = 0,
∂ Ds t=1 i=1
P1: IML/FFX P2: IML
MOBK024-04 MOBK024-LiDeng.cls May 30, 2006 15:30
Note that each δt (s , i) defined here is associated with a node in a three-dimensional trellis
diagram. Each increment in time corresponds to reaching a new stage in dynamic programming
(DP). At the final stage t = N, we have the objective function of δ N (s , i) that is accomplished
via all the previous stages of computation for t ≤ N − 1. On the basis of the DP optimality
principle, the optimal (joint) partial likelihood at the processing stage of t + 1 can be computed
using the following DP recursion:
δt+1 (s , i) = max
δt (s , i ) p(s t+1 = s , it+1 = i | s t = s , it = i ) p(o t+1 | s t+1 = s , it+1 = i)
s ,i
≈ max
δt (s , i ) p(s t+1 = s | s t = s ) p(it+1 = i | it = i ) p(o t+1 | s t+1 = s , it+1 = i)
s ,i
= max
δt (s , i )πs s N(xt+1 [i]; r s xt [ j ] + (1 − r s )Ts , Bs )
s ,i , j
×N(o t+1 ; Hs xt+1 [i] + h s , Ds ), (4.33)
for all states s and for all quantization indices i. Each pair of (s , i) at this processing stage is
a hypothesized “precursor” node in the global optimal path. All such nodes except one will be
eventually eliminated after the backtracking operation. The essence of DP used here is that we
only need to compute the quantities of δt+1 (s , i) as individual nodes in the trellis, removing
the need to keep track of a very large number of partial paths from the initial stage to the
current (t + 1)th stage, which would be required for the exhaustive search. The optimality is
P1: IML/FFX P2: IML
MOBK024-04 MOBK024-LiDeng.cls May 30, 2006 15:30
xt = r s xt−1 + (1 − r s )Ts + wt (s ),
Here, like the first-order state equation, state noise wk ∼ N(wk ; 0, Bs ) is assumed to be IID
zero-mean Gaussian with state (s )-dependent precision Bs . And again, Ts is the target parameter
that serves as the “attractor” drawing the time-varying hidden dynamic variable toward it within
each phonological unit denoted by s .
It is easy to verify that this second-order state equation, as for the first-order one, has the
desirable properties of target directedness and monotonicity. However, the trajectory implied
by the second-order recursion is more realistic than that by the earlier first-order one. The
new trajectory has critically damped trajectory shaping, while the first-order trajectory has
exponential shaping. Detailed behaviors of the respective trajectories are controlled by the
parameter r s in both the cases. For analysis of such behaviors, see [33, 54].
The explicit probabilistic form of the state equation (4.34) is
Note the conditioning event is both xt−1 and xt−2 , instead of just xt−1 as in the first-order case.
P1: IML/FFX P2: IML
MOBK024-04 MOBK024-LiDeng.cls May 30, 2006 15:30
After discretizing the hidden dyanmic variable xt , the observation equation (4.38) is approxi-
mated by
p(o t | xt [i], s t = s ) ≈ N(o t ; F(xt [i]) + h s , Ds ). (4.39)
Combining this with Eq. (4.35), we have the joint probability model:
N
p(s 1N , x1N , o 1N ) = πs t−1 s t p(xt | xt−1 , xt−2 , s t ) p(o t | xt , s t = s )
t=1
N
≈ πs t−1 s t N(x[it ]; 2r s x[it−1 ] − r s2 x[it−2 ] + (1 − r s )2 Ts , Bs )
t=1
×N(o t ; F(x[it ]) + h s , Ds ), (4.40)
P1: IML/FFX P2: IML
MOBK024-04 MOBK024-LiDeng.cls May 30, 2006 15:30
where
⎛ ⎞ ⎛ ⎞
f1 b1
⎜ ⎟ ⎜ ⎟
⎜ f2 ⎟ ⎜ b2 ⎟
f=⎜
⎜ .. ⎟
⎟ and b=⎜ ⎟
⎜ .. ⎟ .
⎝ . ⎠ ⎝ . ⎠
fP bP
P1: IML/FFX P2: IML
MOBK024-04 MOBK024-LiDeng.cls May 30, 2006 15:30
o ≈ F(x).
Depending on the type of the acoustic measurements as the output in the mapping function,
closed-form computation for F(x) may be impossible, or its in-line computation may be too
expensive. To overcome these difficulties, we may quantize each dimension of x over a range
of frequencies or bandwidths, and then compute C(x) for every quantized vector value of x.
This will be made especially effective when a closed form of the nonlinear function can be
established. We will next show that when the output of the nonlinear function becomes linear
cepstra, a closed form can be easily derived.
where f samp is the sampling frequency. The transfer function with P poles and a gain of G is
P
1
H(z) = G . (4.42)
p=1
(1 − z p z−1 )(1 − z∗p z−1 )
P zn + z∗n
p p
cn = , n > 0, (4.45)
p=1
n
and c 0 = log G.
Using Eq. (4.41) to expand and simplify Eq. (4.45), we obtain the final form of the
nonlinear function (for n > 0):
1 P bp fp bp fp
cn = e −πn fs + j 2πn fs + e −πn fs − j 2π n fs
n p=1
1 P bp fp fp
= e −π n fs e j 2π n fs + e − j 2π n fs
n p=1
1 P b
−π n fps fp fp fp fp
= e cos 2πn + j sin 2πn + cos 2π n − j sin 2πn
n p=1 fs fs fs fs
2 P b
−π n fps fp
= e cos 2πn . (4.46)
n p=1 fs
Here, c n constitutes each of the elements in the vector-valued output of the nonlinear
function F(x).
Re
son
anc
eb z)
and cy (H
wid
th ( frequen
Hz n ance
) Reso
FIGURE 4.1: First-order cepstral value of a one-pole (single-resonance) filter as a function of the
resonance frequency and bandwidth. This plots the value of one term in Eq. (4.46) vs. f p and b p with
fixed n = 1 and f s = 8000 Hz
the decomposition property of the linear cepstrum, for multiple-resonance systems, the corre-
sponding cepstrum is simply a sum of those for the single-resonance systems.
Examining Figs. 4.1–4.3, we easily observe some key properties of the (single-resonance)
cepstrum. First, the mapping function from the VTR frequency and bandwidth variables to the
cepstrum, while nonlinear, is well behaved. That is, the relationship is smooth, and there is no
sharp discontinuity. Second, for a fixed resonance bandwidth, the frequency of the sinusoidal
relation between the cepstrum and the resonance frequency increases as the cepstral order
increases. The implication is that when piecewise linear functions are to be used to approximate
the nonlinear function of Eq. (4.46), more “pieces” will be needed for the higher-order than
for the lower-order cepstra. Third, for a fixed resonance frequency, the dependence of the low-
order cepstral values on the resonance bandwidth is relatively weak. The cause of this weak
dependence is the low ratio of the bandwidth (up to 800 Hz) to the sampling frequency (e.g.,
16 000 Hz) in the exponent of the cepstral expression in Eq. (4.46). For example, as shown
in Fig. 4.1 for the first-order cepstrum, the extreme values of bandwidths from 20 to 800 Hz
P1: IML/FFX P2: IML
MOBK024-04 MOBK024-LiDeng.cls May 30, 2006 15:30
Re
son
anc
eb z)
and cy (H
wid
th ( frequen
Hz n ance
) Reso
FIGURE 4.2: Second-order cepstral value of a one-pole (single-resonance) filter as a function of the
resonance frequency and bandwidth (n = 1 and f s = 8000 Hz)
reduce the peak cepstral values only from 1.9844 to 1.4608 (computed by 2 exp(−20π/8000) and
2 exp(−800π/8000), respectively). The corresponding reduction for the second-order cepstrum
is from 0.9844 to 0.5335 (computed by exp(−2 × 20π/8000) and exp(−2 × 800π/8000),
respectively). In general, the exponential decay of the cepstral value, as the resonance bandwidth
increases, becomes only slightly more rapid for the higher-order than for the lower-order cepstra
(see Fig. 4.3). This weak dependence is desirable since the VTR bandwidths are known to
be highly variable with respect to the acoustic environment [120], and to be less correlated
with the phonetic content of speech and with human speech perception than are the VTR
frequencies.
Re
son
anc (Hz)
eb ency
and equ
wid
th ( nance fr
Hz
) Reso
FIGURE 4.3: Fifth-order cepstral value of a one-pole (single-resonance) filter as a function of the
resonance frequency and bandwidth n = 5 and f s = 8000 Hz
most important phonetic information of the speech signal. That is, an eight-dimensional vector
x = ( f 1 , f 2 , f 3 , f 4 , b 1 , b 2 , b 3 , b 4 ) is used as the input to the nonlinear function F(x). For the
output of the nonlinear function, up to 15 orders of linear cepstra are used. The zeroth order
cepstrum, c 0 , is excluded from the output vector, making the nonlinear mapping from VTRs
to cepstra independent of the energy level in the speech signal. This corresponds to setting the
gain G = 1 in the all-pole model of Eq. (4.42).
For each of the eight dimensions in the VTR vector, scalar quantization is used. Since
F(x) is relevant to all possible phones in speech, the appropriate range is chosen for each VTR
frequency and its corresponding bandwidth to cover all phones according to the considerations
discussed in [9]. Table 4.1 lists the range, from minimal to maximal frequencies in Hz, for
each of the four VTR frequencies and bandwidths. It also lists the corresponding number of
quantization levels used. Bandwidths are quantized uniformly with five levels while frequencies
are mapped to the Mel-frequency scale and then uniformly quantized with 20 levels. The
total number of quantization levels shown in Table 4.1 yields a total of 100 million (204 × 54 )
P1: IML/FFX P2: IML
MOBK024-04 MOBK024-LiDeng.cls May 30, 2006 15:30
TABLE 4.1: Quantization Scheme for the VTR Variables, Including the Ranges of the Four VTR
Frequencies and Bandwidths and the Corresponding Numbers of Quantization Levels
entries for F(x), but because of the constraint f 1 < f 2 < f 3 < f 4 , the resulting number has
been reduced by about 25%.
and
S
N
C
Q o (h s , Ds ) = 0.5 γt (s , i) log |Ds | − Ds (o t − F(xt [i]) − h s )2 . (4.49)
s =1 t=1 i=1
P1: IML/FFX P2: IML
MOBK024-04 MOBK024-LiDeng.cls May 30, 2006 15:30
and
γt (s , i) ≡ p(s t = s , xt [i] | o 1N ).
Note that ξt (s , i, j, k) has one more index k than the counterpart in the basic model. This is
due to the additional conditioning in the second-order state equation.
Similar to the basic model, in order to compute ξt (s , i, j, k) and γt (s , i), we need to
compute the forward and backward probabilities by recursion. The forward recursion αt (s , i) ≡
p(o 1t , s t = s , it = i) is
S
C
α(s t+1 , it+1 ) = α(s t , it ) p(s t+1 , it+1 | s t , it , it−1 ) p(o t+1 | s t+1 , it+1 ), (4.50)
s t =1 it =1
where
and
S
C
β(s t , it ) = β(s t+1 , it+1 ) p(s t+1 , it+1 | s t , it , it−1 ) p(o t+1 | s t+1 , it+1 ). (4.51)
s t+1 =1 it+1 =1
∂ Q x (r s , Ts , Bs ) N C C C
= −Bs ξt (s , i, j, k) (4.52)
∂r s t=1 i=1 j =1 k=1
This can be written in the following form in order to solve for r s (assuming Ts is fixed
from the previous EM iteration):
where
N
C
C
C
A3 = ξt (s , i, j, k){xt−2
2
[k] + Ts xt−2 [k] + Ts 2 },
t=1 i=1 j =1 k=1
N
C
C
C
A2 = ξt (s , i, j, k){−3xt−1 [ j ]xt−2 [k] + 3Ts xt−1 [ j ] + 3Ts xt−2 [k] − 3Ts 2 },
t=1 i=1 j =1 k=1
N
C
C
C
A1 = ξt (s , i, j, k){2xt−1
2
[ j ] + xt [i]xt−2 [k] − xt [i]Ts
t=1 i=1 j =1 k=1
Analytic solutions exist for third-order algebraic equations such as the above. For the three roots
found, constraints 1 > r s > 0 can be used for selecting the appropriate one. If there is more
than one solution satisfying the constraint, then we can select the one that gives the largest
value for Q x .
P1: IML/FFX P2: IML
MOBK024-04 MOBK024-LiDeng.cls May 30, 2006 15:30
∂ Q x (r s , Ts , Bs ) N C C C
= −Bs ξt (s , i, j, k)[xt [i]
∂ Ts t=1 i=1 j =1 k=1
−2r s xt−1 [ j ] + r s2 xt−2 [k] − (1 − r s )2 Ts ](1 − r s )2 = 0. (4.55)
Now fixing r s from the previous EM iteration, we obtain an explicit solution to the reestimate
of Ts :
1 N C C C
T̂s = ξt (s , i, j, k){xt [i] − 2r s xt−1 [ j ] + r s2 xt−2 [k]}.
(1 − r s )2 t=1 i=1 j =1 k=1
∂ Q o (h s , Ds ) N C
= −Ds γt (s , i){o t − F(xt [i]) − h s } = 0. (4.56)
∂h s t=1 i=1
∂ Q x (r s , Ts , Bs ) N C C C
= 0.5 ξt (s , i, j, k)[Bs−1
∂ Bs t=1 i=1 j =1 k=1
2
− xt [i] − 2r s xt−1 [ j ] + r s2 xt−2 [k] − (1 − r s )2 Ts ] = 0, (4.58)
∂ Q o (Hs , h s , Ds ) N C
= 0.5 γt (s , i) Ds−1 − (o t − Hs xt [i] − h s )2 = 0,
∂ Ds t=1 i=1
P1: IML/FFX P2: IML
MOBK024-04 MOBK024-LiDeng.cls May 30, 2006 15:30
δt+1 (s , i) = max
δt (s , i ) p(s t+1 = s , it+1 = i | s t = s , it = i ) p(o t+1 | s t+1 = s , it+1 = i)
s ,i
≈ max
δt (s , i ) p(s t+1 = s | s t = s ) p(it+1 = i | it = i ) p(o t+1 | s t+1 = s , it+1 = i)
s ,i
= max
δt (s , i )πs s N(xt+1 [i]; 2r s xt [ j ] − r s2 xt−1 [k] + (1 − r s )2 Ts , Bs )
s ,i , j,k
×N(o t+1 ; F(xt+1 [i]) + h s , Ds ). (4.61)
Note that the VTR-to-cepstrum mapping function, which was derived to be Eq. (4.46)
as the observation equation of the dynamic speech model (extended model), has this de-
composable form. The greedy optimization technique proceeds as follows. First, initialize
αm , m = 1, 2, . . . , M to reasonable values. Then, fix all αm s except one, say αn , and optimize
αn with respect to the new objective function of
n−1
M
F− Fm (αm ) − Fm (αm ).
m=1 m=n+1
Next, after the low-dimensional, inexpensive search problem for αˆn is solved, fix it and
optimize a new αm , m = n. Repeat this for all αm s . Finally, iterate the above process until all
optimized αm s become stabilized.
In the implementation of this technique for VTR tracking and parameter estimation as
reported in [110], each of the P = 4 resonances is treated as a separate, noninteractive variables
to optimize. It was found that only two to three overall iterations above are already sufficient to
stabilize the parameter estimates. (During the training of the residual parameters, these inner
iterations are embedded in each of the outer EM iterations.) Also, it was found that initialization
of all VTR variables to zero gives virtually the same estimates as those by more carefully thought
initialization schemes.
With the use of the above greedy, suboptimal technique instead of full optimal search,
the computation cost of VTR tracking was reduced by over 4000-fold compared with the
brute-force implementation of the algorithms.
P1: IML/FFX P2: IML
MOBK024-04 MOBK024-LiDeng.cls May 30, 2006 15:30
FIGURE 4.4: VTR tracking by setting the residual mean vector to zero
P1: IML/FFX P2: IML
MOBK024-04 MOBK024-LiDeng.cls May 30, 2006 15:30
To examine the quantitative behavior of the residual parameter training, we list the log-
likelihood score as a function of the EM iteration number in Table 4.2. Three iterations of the
training appear to have reached the EM convergence. When we examine the VTR tracking
results after 5 and 20 iterations, they are found to be identical to Fig. 4.7, consistent with the
near-constant converging log-likelihood score reached after three iterations of training. Note
that the regions in the utterance where the speech energy is relatively low are where consonantal
constriction or closure is formed; e.g., near time mark of 0.1 s for /w/ constriction and near
time mark of 0.4 s for /d/ closure). The VTR tracker gives almost as accurate estimates for the
resonance frequencies in these regions as for the vowel regions.
4.4 SUMMARY
This chapter discusses one of the two specific types of hidden dynamic models in this book,
as example implementations of the general modeling and computational scheme introduced in
Chapter 2. The essence of the implementation described in this chapter is the discretization
P1: IML/FFX P2: IML
MOBK024-04 MOBK024-LiDeng.cls May 30, 2006 15:30
EM ITERATION LOG-LIKELIHOOD
NO. SCORE
0 1.7680
1 2.0813
2 2.0918
3 2.1209
5 2.1220
20 2.1222
P1: IML/FFX P2: IML
MOBK024-04 MOBK024-LiDeng.cls May 30, 2006 15:30
68
P1: IML/FFX P2: IML
MOBK024-05 MOBK024-LiDeng.cls April 26, 2006 14:3
69
CHAPTER 5
The preceding chapter discussed the implementation strategy for hidden dynamic models based
on discretizing the hidden dynamic values. This permits tractable but approximate learning of
the model parameters and decoding of the discrete hidden states (both phonological states
and discretized hidden dynamic “states”). This chapter elaborates on another implementation
strategy where the continuous-valued hidden dynamics remain unchanged but a different type of
approximation is used. This implementation strategy assumes fixed discrete-state (phonological
unit) boundaries, which may be obtained initially from a simpler speech model set such as the
HMMs and then be further refined after the dynamic model is learned iteratively. We will
describe this new implementation and approximation strategy for a hidden trajectory model
(HTM) where the hidden dynamics are defined as an explicit function of time instead of by
recursion. Other types of approximation developed for the recursively defined dynamics can be
found in [84, 85, 121–123] and will not be described in this book.
This chapter extracts, reorganizes, and expands the materials published in [109,115,116,
124], fitting these materials into the general theme of dynamic speech modeling in this book.
Thus,
1−γ
c (γ) ≈ . (5.4)
1 + γ − 2γ D+1
The input to the above FIR filter as a linear system is the target sequence, which is a
function of discrete time and is subject to abrupt jumps at the phone segments’ boundaries.
Mathematically, the input is represented as a sequence of stepwise constant functions with
variable durations and heights:
I
t(k) = [u(k − ksl i ) − u(k − ksri )]t s i , (5.5)
i=1
where u(k) is the unit step function, k sr , s = s 1 , s 2 , . . . , s I are the right boundary sequence
of the segments (I in total) in the utterance, and ksl , s = s 1 , s 2 , . . . , s I are the left boundary
sequence. Note the constraint on these starting and end times: k sl +1 = k sr . The difference of
the two boundary sequences gives the duration sequence. t s , s = s 1 , s 2 , . . . , s I are the random
target vectors for segment s .
Given the filter’s impulse response and the input to the filter as the segmental VTR
target sequence t(k), the filter’s output as the model’s prediction for the VTR trajectories is the
convolution between these two signals. The result of the convolution within the boundaries of
home segment s is
k+D
|k−τ |
z (k) = h s (k) ∗ t(k) = c γ γ s (τ ) t s (τ ) , (5.6)
τ =k−D
where the input target vector’s value and the filter’s stiffness vector’s value typically take not only
those associated with the current home segment, but also those associated with the adjacent
segments. The latter case happens when the time τ in Eq. (5.6) goes beyond the home segment’s
boundaries, i.e., when the segment s (τ ) occupied at time τ switches from the home segment to
an adjacent one.
P1: IML/FFX P2: IML
MOBK024-05 MOBK024-LiDeng.cls April 26, 2006 14:3
where L is the total number of phone-like HTM units as indexed by l, and f = 1, . . . , 8 denotes
four VTR frequencies and four corresponding bandwidths.
The covariance matrix in Eq. (5.7) can be similarly derived to be
k+D
2|k−τ |
Σz(k) = c γ2 γs (τ ) ΣTs (τ ) .
τ =k−D
Approximating the covariance matrix by a diagonal one for each phone unit l, we represent its
diagonal elements as a vector:
σ 2z(k) = vk · σ 2T . (5.10)
In Eqs. (5.8) and (5.10), a k and vk are frame (k)-dependent vectors. They are constructed
for any given phone sequence and phone boundaries within the coarticulation range (2D + 1
frames) centered at frame k. Any phone unit beyond the 2D + 1 window contributes a zero
P1: IML/FFX P2: IML
MOBK024-05 MOBK024-LiDeng.cls April 26, 2006 14:3
2 P b p (k)
−πq fsamp f p (k)
Fq (k) = e cos(2πq ), (5.12)
q p=1 f samp
where f samp is the sampling frequency, P is the highest VTR order (P = 4), and q is the cepstral
order.
We now introduce the cepstral prediction’s residual vector:
We model this residual vector as a Gaussian parameterized by residual mean vector μr s (k) and
covariance matrix Σr s (k) :
p(rs (k) | z (k), s ) = N rs (k); μr s (k) , Σr s (k) . (5.13)
An alternative form of the distribution in Eq. (5.14) is the following “observation equa-
tion”:
where the components of Jacobian matrix F [·] can be computed in a closed form of
4π −πq bfsamp
p (k) f p (k)
Fq [ f p (k)] =− e sin 2πq , (5.16)
f samp f samp
for the VTR bandwidth components of z . In the current implementation, the Taylor series
expansion point z0 (k) in Eq. (5.15) is taken as the tracked VTR values based on the HTM.
Substituting Eq. (5.15) into Eq. (5.14), we obtain the approximate conditional acoustic
observation probability where the mean vector μo s is expressed as a linear function of the VTR
vector z :
where
μo s (k) = F [z 0 (k)]z (k) + F[z 0 (k)] − F [z 0 (k)]z 0 (k) + μr s (k) . (5.19)
This then permits a closed-form solution for acoustic likelihood computation, which we
derive now.
The final result of Eqs. (5.20)–(5.22) are quite intuitive. For instance, when the Taylor
series expansion point is set at z0 (k) = μz(k) = a k · μT , Eq. (5.21) is simplified to μ̄o s (k) =
F[μz(k)] + μr s , which is the noise-free part of cepstral prediction. Also, the covariance ma-
trix in Eq. (5.20) is increased by the quantity F [z0 (k)]Σz(k)(F [z0 (k)])Tr over the covariance
matrix for the cepstral residual term Σr s (k) only. This magnitude of increase reflects the newly
introduced uncertainty in the hidden variable, measured by Σz(k). The variance amplification
factor F [z0 (k)] results from the local “slope” in the nonlinear function F[z] that maps from
the VTR vector z (k) to cepstral vector o(k).
It is also interesting to interpret the likelihood score Eq. (5.20) as probabilistic charac-
terization of a temporally varying Gaussian process, where the time-varying mean vectors are
expressed in Eq. (5.21) and the time-varying covariance matrices are expressed in Eq. (5.22).
This may make the HTM look ostensibly like a nonstationary-state HMM (within the acoustic
dynamic model category). However, the key difference is that in HTM the dynamic structure
represented by the hidden VTR trajectory enters into the time-varying mean vector Eq. (5.21)
in two ways: (1) as the argument z0 (k) in the nonlinear function F[z0 (k)]; and (2) as the
term a k · μT = μz(k) in Eq. (5.21). Being closely related to the VTR tracks, they both capture
long-span contextual dependency, yet with mere context-independent VTR target parameters.
Similar properties apply to the time-varying covariance matrices in Eq. (5.22). In contrast, the
time-varying acoustic dynamic models do not have these desirable properties. For example, the
polynomial trajectory model [55, 56, 86] does regression fitting directly on the cepstral data,
exploiting no underlying speech structure and hence requiring context dependent polynomial
coefficients for representing coarticulation. Likewise, the more recent trajectory model [26] also
relies on a very large number of free model parameters to capture acoustic feature variations.
P1: IML/FFX P2: IML
MOBK024-05 MOBK024-LiDeng.cls April 26, 2006 14:3
FIGURE 5.1: Spectrogram of three renditions of /iy aa iy/ by one author, with an increasingly higher
speaking rate and increasingly lower speaking efforts. The horizontal label is time, and the vertical one
is frequency
P1: IML/FFX P2: IML
MOBK024-05 MOBK024-LiDeng.cls April 26, 2006 14:3
(b) γ = [0.75]
2500
2000
1500
1000
500
0
(c) γ = [0.65]
2500
2000
1500
1000
500
0
0 20 40 60 80 100 120
Time frame (0.01 s)
FIGURE 5.2: f 1 and f 2 formant or VTR frequency trajectories produced from the model for a slow /iy
aa iy/ followed by a fast /iy aa iy/. (a), (b) and (c) correspond to the use of the stiffness parameter values
of γ = 0.85, 0.75 and 0.65, respectively. The amount of formant undershooting or reduction during the
fast /aa/ is decreasing as the γ value decreases. The dashed lines indicate the formant target values and
their switch at the segment boundaries
230 ms or 23 frames) are followed by the faster /iy aa iy/ sounds (with the duration of /aa/ set
at 130 ms or 13 frames). f 1 and f 2 targets for /iy/ and /aa/ are set appropriately in the model
also. Comparing the three plots, we have the model’s quantitative prediction for the magnitude
of reduction in the faster /aa/ that is decreasing as the γ value decreases.
In Figs. 5.3(a)–(c), we show the same model prediction as in Fig. 5.2 but for different
sounds /iy eh iy/, where the targets for /eh/ are much closer to those of the adjacent sound /iy/
than in the previous case for /aa/. As such, the absolute amount of reduction becomes smaller.
However, the same effect of the filter parameter’s value on the size of reduction is shown as for
the previous sounds /iy aa iy/.
P1: IML/FFX P2: IML
MOBK024-05 MOBK024-LiDeng.cls April 26, 2006 14:3
(b) γ = [0.75]
2500
2000
1500
1000
500
0
(c) γ = [0.65]
2500
2000
1500
1000
500
0
0 20 40 60 80 100 120
Time frame
FIGURE 5.3: Same as Fig. 5.2 except for the /iy eh iy/ sounds. Note that the f 1 and f 2 target values
for /eh/ are closer to /iy/ than those for /aa/
1000
x
500
0
(b) γ = [0.85]
2500
2000
x
1500
1000
x
500
0
(c) γ = [0.85]
2500
2000
x
1500
1000
x
500
0
0 10 20 30 40 50 60
FIGURE 5.4: f 1 and f 2 formant trajectories produced from the model for three different durations of
/aa/ in the /iy aa iy/ sounds: (a) 25 frames (250 ms), (b) 20 frames and (c) 15 frames. The same γ value of
0.85 is used. The amount of target undershooting increases as the duration is shortened or the speaking
rate is increased. Symbol “x” indicates the f 1 and f 2 formant values at the central portions of vowels of /aa/
refer to this phenomenon as “static” sound confusion induced by increased speaking rate (or/and
by a greater degree of sloppiness in speaking).
1800 /ε/
1600
f2
/a/
1400
Predicted formant frequencies (Hz)
1200
1000
800
/a/
600 f1
400 /ε/
200
2 3 4 5 6 7 8 9 10
FIGURE 5.5: Relationship, based on model prediction, between the f 1 and f 2 formant values at the
central portions of vowels and the speaking rate. Vowel /aa/ is in the carry-phrase /iy aa iy/, and vowel
/eh/ in /iy eh iy/. Note that as the speaking rate increases, the distinction between vowels /aa/ and /eh/
measured by the difference between their static formant values gradually diminishes. The same γ value
of 0.9 is used in generating all points in the figure
degree of “static” sound confusion as speaking rate increases is clearly evident from both the
measurement data (Fig. 5.6) and prediction (Fig. 5.5).
/ε/
1800
f2
1600
Ave. measured formant frequencies (Hz)
/a/
1400
1200
1000
800
/a/
600
f1
400
/ε/
200
40 50 60 70 80 90 100 110 120
FIGURE 5.6: The formant measurement data from literature are reorganized and plotted, showing
similar trends to the model prediction under similar conditions
target values based on limited context dependency by table lookup (see details in [9], Ch. 13).
Then automatic and iterative target adaptation is performed for each phone-like unit based
on the difference between the results of a VTR tracker (described in [126]) and the VTR
prediction from the FIR filter model. These target values are provided not only to vowels, but
also to consonants for which the resonance frequency targets are used with weak or no acoustic
manifestation. The converged target values, together with the phone boundaries provided from
the TIMIT database, form the input to the FIR filter of the HTM and the output of the filter
gives the predicted VTR frequency trajectories.
Three example utterances from TIMIT (SI1039, SI1669 and SI2299) are shown in
Figs. 5.7–5.9. The stepwise dashed lines ( f 1 / f 2 / f 3 / f 4 ) are the target sequences as inputs to the
FIR filter, and the continuous lines ( f 1 / f 2 / f 3 / f 4 ) are the outputs of the filter as the predicted
VTR frequency trajectories. Parameters γ and D are fixed and not automatically learned. To
facilitate assessment of the accuracy in the prediction, the inputs and outputs are superimposed
P1: IML/FFX P2: IML
MOBK024-05 MOBK024-LiDeng.cls April 26, 2006 14:3
4
Frequency (kHz)
0
0 50 100 150 200 250 300 350
Frame (10 ms)
FIGURE 5.7: The f 1 / f 2 / f 3 / f 4 VTR frequency trajectories (smooth lines) generated from the FIR model
for VTR target filtering using the phone sequence and duration of a speech utterance (SI1039) taken from
the TIMIT database. The target sequence is shown as stepwise lines, switching at the phone boundaries
labeled in the database. They are superimposed on the utterance’s spectrogram. The utterance is “He has
never, himself, done anything for which to be hated—which of us has ”
on the spectrograms of these utterances, where the true resonances are shown as the dark
bands. For the majority of frames, the filter’s output either coincides or is close to the true VTR
frequencies, even though no acoustic information is used. Also, comparing the input and output
of the filter, we observe only a rather mild degree of target undershooting or reduction in these
and many other TIMIT utterances we have examined but not shown here.
4
Frequency (kHz)
0
0 20 40 60 80 100 120 140 160 180 200
Frame (10 ms)
FIGURE 5.8: Same as Fig. 5.7 except with another utterance “Be excited and don’t identify yourself ”
(SI1669)
the three example TIMIT utterances. Note that the model prediction includes residual means,
which are trained from the full TIMIT data set using an HTK tool. The zero-mean random
component of the residual is ignored in these figures. The residual means for the substates
(three for each phone) are added sequentially to the output of the nonlinear function Eq. (5.12),
assuming each substate occupies three equal-length subsegments of the entire phone segment
length provided by TIMIT database. To avoid display cluttering, only linear cepstra with orders
one (C1), two (C2) and three (C3) are shown here, as the solid lines. Dashed lines are the linear
cepstral data C1, C2 and C3 computed directly from the waveforms of the same utterances
for comparison purposes. The data and the model prediction generally agree with each other,
somewhat better for lowerorder cepstra than for higherorder ones. It was found that these
discrepancies are generally within the variances of the prediction residuals automatically trained
from the entire TIMIT training set (using an HTK tool for monophone HMM training).
P1: IML/FFX P2: IML
MOBK024-05 MOBK024-LiDeng.cls April 26, 2006 14:3
4
Frequency (kHz)
0
0 50 100 150 200 250
Frame (10 ms)
FIGURE 5.9: Same as Fig. 5.7 except with the third utterance “Sometimes, he coincided with my father’s
being at home ” (SI2299)
C1 0
−1
−2
0 50 100 150 200 250 300 350
C2 0
−1
0.5
C3
0
−0.5
FIGURE 5.10: Linear cepstra with order one (C1), two (C2) and three (C3) predicted from the final
stage of the model generating the linear cepstra (solid lines) with the input from the FIR filtered results
(for utterance SI1039). Dashed lines are the linear cepstral data C1, C2 and C3 computed directly from
the waveform
Mean Vectors
To find the ML (maximum likelihood) estimate of parameters μr s , we set
K
∂ log p(o(k) | s )
k=1
= 0,
∂μr s
where p(o(k) | s ) is given by Eq. (5.20), and K denotes the total duration of sub-phone s in the
training data. This gives
K
o(k) − μ̄o s = 0, or (5.23)
k=1
P1: IML/FFX P2: IML
MOBK024-05 MOBK024-LiDeng.cls April 26, 2006 14:3
1
C1 0
−1
−2
0 20 40 60 80 100 120 140 160 180 200
0.5
C2 0
−0.5
−1
0 20 40 60 80 100 120 140 160 180 200
0.5
C3 0
−0.5
FIGURE 5.11: Same as Fig. 5.10 except with the second utterance (SI2299)
K
o(k) − F [z 0 (k)]μz(k)
k=1
−{F[z 0 (k)] + μr s − F [z 0 (k)]z 0 (k)} = 0. (5.24)
1
C1
0
−1
−2
0 50 100 150 200 250
0.5
C2 0
−0.5
−1
0 50 100 150 200 250
0.5
C3 0
−0.5
FIGURE 5.12: Same as Fig. 5.10 except with the third utterance (SI1669)
which gives
K
σr2s + q(k) − (o(k) − μ̄o s )2
= 0, (5.26)
k=1
[σr2s + q(k)]2
Due to the frame (k) dependency in the denominator in Eq. (5.26), no simple closed-form
solution is available for solving σr2s from Eq. (5.26). We have implemented three different
techniques for seeking approximate ML estimates that are outlined as follows:
where t is a heuristically chosen positive constant controlling the learning rate at the
t-th iteration.
3. Constrained gradient ascent: Add to the previous standard gradient ascent technique the
constraint that the variance estimate be always positive. The constraint is established
by the parameter transformation: σ̃r2s = log σr2s , and by performing gradient ascent for
σ̃r2s instead for σr2s :
Using chain rule, we show below that the new gradient ∇ L̃ is related to the gradient
∇ L before parameter transformation in a simple manner:
∂L̃ ∂ L̃ ∂σr2s
∇ L̃ = = = (∇L) exp(σ̃r2s ).
∂ σ̃r2s ∂σr2s ∂ σ̃r2s
At the end of the algorithm iteration, the parameters are transformed via σr2s = exp(σ̃r2s ),
which is guaranteed to be positive.
For efficiency purposes, parameter updating in the above gradient ascent techniques is
carried out after each utterance in the training, rather than after the entire batch of all utterances.
We note that the quality of the estimates for the residual parameters discussed above plays
a crucial role in phonetic recognition performance. These parameters provide an important
mechanism for distinguishing speech sounds that belong to different manners of articulation.
This is attributed to the fact that nonlinear cepstral prediction from VTRs has different accuracy
for these different classes of sounds. Within the same manner class, the phonetic separation
is largely accomplished by distinct VTR targets, which typically induce significantly different
cepstral prediction values via the “amplification” mechanism provided by the Jacobian matrix
F [z].
P1: IML/FFX P2: IML
MOBK024-05 MOBK024-LiDeng.cls April 26, 2006 14:3
Mean Vectors
To obtain a closed-form estimation solution, we assume diagonality of the prediction cepstral
residual’s covariance matrix Σr s . Denoting its q th component by σr2 (q ) (q = 1, 2, . . . , Q), we
decompose the multivariate Gaussian of Eq. (5.20) element-by-element into
J
1 (o k ( j ) − μ̄o ( j ))2
p(o(k) | s (k)) = exp − 2
s (k)
, (5.29)
j =1 2πσo2s (k) ( j ) 2σ o s (k) j )
(
where o k ( j ) denotes the j th component (i.e., j th order) of the cepstral observation vector at
frame k.
The log-likelihood function for a training data sequence (k = 1, 2, . . . , K ) relevant to
the VTR mean vector μTs becomes
J
K
(o k ( j ) − μ̄o s (k) ( j ))2
P= − (5.30)
k=1 j =1
σo2s (k) ( j )
J [
2
K
f F [z0 (k), j, f ] l a k (l)μT (l, f ) − d k ( j )]
= ,
k=1 j =1
σo2s (k) ( j )
where l and f are indices to phone and to VTR component, respectively, and
dk ( j ) = o k ( j ) − F[z0 (k), j ] + F [z0 (k), j, f ]z0 (k, f ) − μr s (k) ( j ).
f
While the acoustic feature’s distribution is Gaussian for both HTM and HMM given
the state s , the key difference is that the mean and variance in HTM as in Eq. (5.20) are both
time-varying functions (hence trajectory model). These functions provide context dependency
(and possible target undershooting) via the smoothing of targets across phonetic units in the
utterance. This smoothing is explicitly represented in the weighted sum over all phones in the
utterance (i.e., l ) in Eq. (5.30).
Setting
∂P
= 0,
∂μT (l 0 , f 0 )
P1: IML/FFX P2: IML
MOBK024-05 MOBK024-LiDeng.cls April 26, 2006 14:3
with f 0 = 1, 2, . . . , 8 for each VTR dimension, and with l 0 = 1, 2, . . . , 58 for each phone unit.
In Eq. (5.31),
Eq. (5.31) is a 464 × 464 full-rank linear system of equations. (The dimension 464 =
58 × 8 where we have a total of 58 phones in the TIMIT database after decomposing each
diphthong into two “phones”, and 8 is the VTR vector dimension.) Matrix inversion gives an
ML estimate of the complete set of target mean parameters: a 464-dimensional vector formed
by concatenating all eight VTR components (four frequencies and four bandwidths) of the
58 phone units in TIMIT.
In implementing Eq. (5.31) for the ML solution to target mean vectors, we kept other
model parameters constant. Estimation of the target and residual parameters was carried out
in an iterative manner. Initialization of the parameters μT (l, f ) was provided by the values
described in [9].
An alternative training of the target mean parameters in a simplified version of the HTM
and its experimental evaluation are described in [112]. In that training, the VTR tracking
results obtained by the tracking algorithm described in Chapter 4 are exploited as the basis for
learning, contrasting the learning described in this section, which uses the raw cepstral acoustic
data only. Use of the VTR tracking results enables speaker-adaptive learning for the VTR target
parameters as shown in [112].
where F j f is the ( j, f ) element of Jacobian matrix F [·] in Eq. (5.27), and the second equality
is due to Eq. (5.11).
Using chain rule to compute the gradient, we obtain
∂ LT
∇ LT (l, f ) = (5.35)
∂σT2 (l, f )
K J (o k ( j ) − μ̄o s (k) ( j ))2 (F j f )2 vk (l) (F j f )2 vk (l)
= − .
k=1 j =1
[σr2s ( j ) + q (k, j )]2 σr2s ( j ) + q (k, j )
for each phone l and for each element f in the diagonal VTR target covariance matrix.
5.5 SUMMARY
In this chapter, we present in detail a second specific type of hidden dynamic models, which
we call the hidden trajectory model (HTM). The unique character of the HTM is that the
hidden dynamics are represented not by temporal recursion on themselves but by explicit “tra-
jectories” or hidden trended functions constructed by FIR filtering of targets. In contrast to
the implementation strategy for the model discussed in Chapter 4 where the hidden dynamics
are discretized, the implementation strategy in the HTM maintains continuous-valued hidden
dynamics, and introduces approximations by constraining the temporal boundaries associated
with discrete phonological states. Given such constraints, rigorous algorithms for model param-
eter estimation are developed and presented without the need to approximate the continuous
hidden dynamic variables by their discretized values as done in Chapter 4.
The main portions of this chapter are devoted to formal construction of the HTM, its
computer simulation and the parameter estimation algorithm’s development. The computa-
tionally efficient decoding algorithms have not been presented, as they are still under research
and development and are hence not appropriate to describe in this book at present. In contrast,
decoding algorithms for discretized hidden dynamic models are much more straightforward to
develop, as we have presented in Chapter 4.
Although we present only two types of implementation strategies in this book (Chapters
4, 5, respectively) for dynamic speech modeling within the general computational framework
established in Chapter 2, other types of implementation strategies and approximations (such as
variational learning and decoding) are possible. We have given some related references at the
beginning of this chapter.
As a summary and conclusion of this book, we have provided scientific background, math-
ematical theory, computational framework, algorithmic development and technological needs
and two selected applications for dynamic speech modeling, which is the theme of this book.
A comprehensive survey in this area of research is presented, drawing on the work of a number
of (non-exhaustive) research groups and individual researchers worldwide. This direction of
research is guided by scientific principles applied to study human speech communication, and is
based on the desire to acquire knowledge about the realistic dynamic process in the closed-loop
speech chain. It is hoped that with integration of this unique style of research with other pow-
erful pattern recognition and machine learning approaches, the dynamic speech models, as they
become better developed, will form a foundation for the next-generation speech technology
serving the humankind and society.
P1: IML/FFX P2: IML
MOBK024-05 MOBK024-LiDeng.cls April 26, 2006 14:3
94
P1: IML/FFX P2: IML
MOBK024-BIB MOBK024-LiDeng.cls May 16, 2006 17:39
95
Bibliography
[1] P. Denes and E. Pinson. The Speech Chain, 2nd edn, Worth Publishers, New York, 1993.
[2] K. Stevens. Acoustic Phonetics, MIT Press, Cambridge, MA, 1998.
[3] K. Stevens. “Toward a model for lexical access based on acoustic landmarks and distinc-
tive features,” J. Acoust. Soc. Am., Vol. 111, April 2002, pp. 1872–1891.
[4] L. Rabiner and B.-H. Juang. Fundamentals of Speech Recognition, Prentice-Hall, Upper
Saddle River, NJ, 1993.
[5] X. Huang, A. Acero, and H. Hon. Spoken Language Processing, Prentice Hall, New York,
2001.
[6] V. Zue. “Notes on speech spectrogram reading,” MIT Lecture Notes, Cambridge, MA,
1991.
[7] J. Olive, A. Greenwood, and J. Coleman. Acoustics of American English Speech—A Dy-
namic Approach, Springer-Verlag, New York, 1993.
[8] C. Williams. “How to pretend that correlated variables are independent by using dif-
ference observations,” Neural Comput., Vol. 17, 2005, pp. 1–6.
[9] L. Deng and D. O’Shaughnessy. SPEECH PROCESSING—A Dynamic and
Optimization-Oriented Approach (ISBN: 0-8247-4040-8), Marcel Dekker, New York,
2003, pp. 626.
[10] L. Deng and X.D. Huang. “Challenges in adopting speech recognition,” Commun.
ACM, Vol. 47, No. 1, January 2004, pp. 69–75.
[11] M. Ostendorf. “Moving beyond the beads-on-a-string model of speech,” in Proceedings
of IEEE Workshop on Automatic Speech Recognition and Understanding, December 1999,
Keystone, co, pp. 79–83.
[12] N. Morgan, Q. Zhu, A. Stolcke, et al. “Pushing the envelope—Aside,” IEEE Signal
Process. Mag., Vol. 22, No. 5, September. 2005, pp. 81–88.
[13] F. Pereira. “Linear models for structure prediction,” in Proceedings of Interspeech, Lisbon,
September 2005, pp. 717–720.
[14] M. Ostendorf, V. Digalakis, and J. Rohlicek. “From HMMs to segment models: A
unified view of stochastic modeling for speech recognition” IEEE Trans. Speech Audio
Process., Vol. 4, 1996, pp. 360–378.
[15] B.-H. Juang and S. Katagiri. “Discriminative learning for minimum error classification,”
IEEE Trans. Signal Process., Vol. 40, No. 12, 1992, pp. 3043–3054.
P1: IML/FFX P2: IML
MOBK024-BIB MOBK024-LiDeng.cls May 16, 2006 17:39
BIBLIOGRAPHY 97
[32] J. Allen. “How do humans process and recognize speech,” IEEE Trans. Speech Audio
Process., Vol. 2, 1994, pp. 567–577.
[33] L. Deng. “A dynamic, feature-based approach to the interface between phonology and
phonetics for speech modeling and recognition,” Speech Commun., Vol. 24, No. 4, 1998,
pp. 299–323.
[34] H. Bourlard, H. Hermansky, and N. Morgan. “Towards increasing speech recognition
error rates,” Speech Commun., Vol. 18, 1996, pp. 205–231.
[35] L. Deng. “Switching dynamic system models for speech articulation and acoustics,”
in M. Johnson, M. Ostendorf, S. Khudanpur, and R. Rosenfeld (eds.), Mathemati-
cal Foundations of Speech and Language Processing, Springer-Verlag, New York, 2004,
pp. 115–134.
[36] R. Lippmann. “Speech recognition by human and machines,” Speech Commun., Vol. 22,
1997, pp. 1–14.
[37] L. Pols. “Flexible human speech recognition,” in Proceedings of the IEEE Workshop on
Automatic Speech Recognition and Understanding, 1997, Santa Barbara, CA, pp. 273–283.
[38] C.-H. Lee. “From knowledge-ignorant to knowledge-rich modeling: A new speech
research paradigm for next-generation automatic speech recognition,” in Proc. ICSLP,
Jeju Island, Korea, October 2004, pp. 109–111.
[39] M. Russell. “Progress towards speech models that model speech,” in Proc. IEEE Workshop
on Automatic Speech Recognition and Understanding, 1997, Santa Barbara, CA, pp. 115–
123.
[40] M. Russell. “A segmental HMM for speech pattern matching,” IEEE Proceedings of the
ICASSP, Vol. 1, 1993, pp. 499–502.
[41] L. Deng. “A generalized hidden Markov model with state-conditioned trend functions
of time for the speech signal,” Signal Process., Vol. 27, 1992, pp. 65–78.
[42] J. Bridle, L. Deng, J. Picone, et al. “An investigation of segmental hidden dynamic
models of speech coarticulation for automatic speech recognition,” Final Report for the
1998 Workshop on Language Engineering, Center for Language and Speech Processing
at Johns Hopkins University, 1998, pp. 1–61.
[43] K. Kirchhoff. “Robust speech recognition using articulatory information,” Ph.D. thesis,
University of Bielfeld, Germany, July 1999.
[44] R. Bakis. “Coarticulation modeling with continuous-state HMMs,” in Proceedings
of the IEEE Workshop on Automatic Speech Recognition, Harriman, New York, 1991,
pp. 20–21.
[45] Y. Gao, R. Bakis, J. Huang, and B. Zhang. “Multistage coarticulation model combining
articulatory, formant and cepstral features,” Proc. ICSLP, Vol. 1, 2000, pp. 25–28.
P1: IML/FFX P2: IML
MOBK024-BIB MOBK024-LiDeng.cls May 16, 2006 17:39
BIBLIOGRAPHY 99
W. Hardcastle and A. Marchal (eds.), Speech Production and Speech Modeling, Kluwer,
Norwell, MA, 1990, pp. 403–439.
[61] N. Chomsky and M. Halle. The Sound Pattern of English, Harper and Row, New York,
1968.
[62] N. Clements. “The geometry of phonological features,” Phonology Yearbook, Vol. 2, 1985,
pp. 225–252.
[63] C. Browman and L. Goldstein. “Articulatory phonology: An overview,” Phonetica,
Vol. 49, 1992, pp. 155–180.
[64] M. Randolph. “Speech analysis based on articulatory behavior,” J. Acoust. Soc. Am.,
Vol. 95, 1994, p. 195.
[65] L. Deng and H. Sameti. “Transitional speech units and their representation by the
regressive Markov states: Applications to speech recognition,” IEEE Trans. Speech Audio
Process., Vol. 4, No. 4, July 1996, pp. 301–306.
[66] J. Sun, L. Deng, and X. Jing. “Data-driven model construction for continuous speech
recognition using overlapping articulatory features,” Proc. ICSLP, Vol. 1, 2000, pp. 437–
440.
[67] Z. Ghahramani and M. Jordan. “Factorial hidden Markov models,” Machine Learn.,
Vol. 29, 1997, pp.245–273.
[68] K. Stevens. “On the quantal nature of speech,” J. Phonetics, Vol. 17, 1989, pp. 3–45.
[69] A. Liberman and I. Mattingly. “The motor theory of speech perception revised,” Cog-
nition, Vol. 21, 1985, pp. 1–36.
[70] B. Lindblom. “Role of articulation in speech perception: Clues from production,”
J. Acoust. Soc. Am., Vol. 99, No. 3, 1996, pp. 1683–1692.
[71] P. MacNeilage. “Motor control of serial ordering in speech,” Psychol. Rev., Vol. 77, 1970,
pp. 182–196.
[72] R. Kent, G. Adams, and G. Turner. “Models of speech production,” in N. Lass (ed.),
Principles of Experimental Phonetics, Mosby, London, 1995, pp. 3–45.
[73] J. Perkell, M. Matthies, M. Svirsky, and M. Jordan. “Goal-based speech motor con-
trol: A theoretical framework and some preliminary data,” J. Phonetics, Vol. 23, 1995,
pp. 23–35.
[74] J. Perkell. “Properties of the tongue help to define vowel categories: Hypotheses based
on physiologically-oriented modeling,” J. Phonetics, Vol. 24, 1996, pp. 3–22.
[75] P. Perrier, D. Ostry, and R. Laboissière. “The equilibrium point hypothesis and its
application to speech motor control,” J. Speech Hearing Res., Vol. 39, 1996, pp. 365–378.
[76] B. Lindblom, J. Lubker, and T. Gay. “Formant frequencies of some fixed-mandible
vowels and a model of speech motor programming by predictive simulation,” J. Phonetics,
Vol. 7, 1979, pp. 146–161.
P1: IML/FFX P2: IML
MOBK024-BIB MOBK024-LiDeng.cls May 16, 2006 17:39
BIBLIOGRAPHY 101
observations with applications to speech recognition,” IEEE Trans. Acoust., Speech, Signal
Process., Vol. 38, 1990, pp. 220–225.
[94] L. Deng and C. Rathinavalu. “A Markov model containing state-conditioned second-
order nonstationarity: Application to speech recognition,” Comput. Speech Language,
Vol. 9, 1995, pp. 63–86.
[95] A. Poritz. “Hidden Markov models: A guided tour,” IEEE Proc. ICASSP, Vol. 1, 1988,
pp. 7–13.
[96] H. Sheikhazed and L. Deng. “Waveform-based speech recognition using hidden filter
models: Parameter selection and sensitivity to power normalization,” IEEE Trans. Speech
Audio Process., Vol. 2, 1994, pp. 80–91.
[97] H. Zen, K. Tokuda, and T. Kitamura. “A Viterbi algorithm for a trajectory model derived
from HMM with explicit relationship between static and dynamic features,” IEEE Proc.
ICASSP, 2004, pp. 837–840.
[98] K. Tokuda, H. Zen, and T. Kitamura. “Trajectory modeling based on HMMs with the
explicit relationship between static and dynamic features,” Proc. Eurospeech, Vol. 2, 2003,
pp. 865–868.
[99] J. Tebelskis and A. Waibel. “Large vocabulary recognition using linked predictive neural
networks,” IEEE Proc. ICASSP, Vol. 1, 1990, pp. 437–440.
[100] E. Levin. “Word recognition using hidden control neural architecture,” IEEE Proc.
ICASSP, Vol. 1, 1990, pp. 433–436.
[101] L. Deng, K. Hassanein, and M. Elmasry. “Analysis of correlation structure for a neural
predictive model with application to speech recognition,” Neural Networks, Vol. 7, No. 2,
1994, pp. 331–339.
[102] V. Digalakis, J. Rohlicek, and M. Ostendorf. “ML estimation of a stochastic linear
system with the EM algorithm and its application to speech recognition,” IEEE Trans.
Speech Audio Process., Vol. 1, 1993, pp. 431–442.
[103] L. Deng. “Articulatory features and associated production models in statistical speech
recognition,” in K. Ponting (ed.), Computational Models of Speech Pattern Processing
(NATO ASI Series), Springer, New York, 1999, pp. 214–224.
[104] L. Lee, P. Fieguth, and L. Deng. “A functional articulatory dynamic model for speech
production,” IEEE Proc. ICASSP, Vol. 2, 2001, pp. 797–800.
[105] R. McGowan. “Recovering articulatory movement from formant frequency trajectories
using task dynamics and a genetic algorithm: Preliminary model tests,” Speech Commun.,
Vol. 14, 1994, pp. 19–48.
[106] R. McGowan and A. Faber. “Speech production parameters for automatic speech recog-
nition,” J. Acoust. Soc. Am., Vol. 101, 1997, p. 28.
P1: IML/FFX P2: IML
MOBK024-BIB MOBK024-LiDeng.cls May 16, 2006 17:39
BIBLIOGRAPHY 103
[120] C. Huang and H. Wang. “Bandwidth-adjusted LPC analysis for robust speech recog-
nition,” Pattern Recognit. Lett., Vol. 24, 2003, pp. 1583–1587.
[121] L. Lee, H. Attias, and L. Deng. “Variational inference and learning for segmental
switching state space models of hidden speech dynamics,” in IEEE Proceedings of the
ICASSP, Vol. I, Hong Kong, April 2003, pp. 920–923.
[122] L. Lee, L. Deng, and H. Attias. “A multimodal variational approach to learning and
inference in switching state space models,” in IEEE Proceedings of the ICASSP, Montreal,
Canada, May 2004, Vol. I, pp. 505–508.
[123] J. Ma and L. Deng. “Effcient decoding strategies for conversational speech recognition
using a constrained nonlinear state–space model for vocal–tract–resonance dynamics,”
IEEE Trans. Speech Audio Process., Vol. 11, No. 6, 2003, pp. 590–602.
[124] L. Deng, D. Yu, and A. Acero. “A long-contextual-span model of resonance dynamics
for speech recognition: Parameter learning and recognizer evaluation,” Proceedings of
the IEEE Workshop on Automatic Speech Recognition and Understanding, Puerto Rico,
Nov. 27 – Dec 1, 2005, pp. 1–6 (CDROM).
[125] M. Pitermann. “Effect of speaking rate and contrastive stress on formant dynamics and
vowel perception,” J. Acoust. Soc. Am., Vol. 107, 2000, pp. 3425–3437.
[126] L. Deng, L. Lee, H. Attias, and A. Acero. “A structured speech model with continuous
hidden dynamics and prediction-residual training for tracking vocal tract resonances,”
IEEE Proceedings of the ICASSP, Montreal, Canada, 2004, pp. 557–560.
[127] J. Glass. “A probabilistic framework for segment-based speech recognition,” Comput.
Speech Language, Vol. 17, No. 2/3, pp. 137–152.
[128] A. Oppenheim and D. Johnson. “Discrete representation of signals,” Proc. IEEE,
Vol. 60, No. 6, 1972, pp. 681–691.
P1: IML/FFX P2: IML
MOBK024-BIB MOBK024-LiDeng.cls May 16, 2006 17:39
104
P1: IML/FFX P2: IML
MOBK024-AUTH MOBK024-LiDeng.cls May 30, 2006 12:33
105
106