Speech Recognition Algo
Speech Recognition Algo
Abstract
Recognition accuracy has been the primary objective of most speech recog-
nition research, and impressive results have been obtained, e.g. less than
0.3% word error rate on a speaker-independent digit recognition task.
When it comes to real-world applications, robustness and real-time
response might be more important issues. For the first requirement we
review some of the work on robustness and discuss one specific technique.
spectral normalization, in more detail. The requirement of real-time
response has to be considered in the light of the limited hardware resources
in voice control applications, which are due to the tight cost constraints. In
this paper we discuss in detail one specific means to reduce the processing
and memory demands: a clustering technique applied at various levels
within the acoustic modelling.
Keywords: automatic speech recognition; small-vocabulary systems;
robustness; acoustic-phonetic modefling; state and density
clustering.
1. Introduction
Automatic speech recognition has been a topic of research for many years. It
is, however, primarily in the past few years that the technology has matured
enough to be employed in a large range of applications. There are two main
reasons why this has occurred. Firstly, improvements in speech recognition
algorithms have led to more robust and reliable systems which can cope
with real-world and not only laboratory-controlled environments. Secondly,
there is a sharp decrease in cost of computation and memory, which is reflected
by the rapid growth in computing capability provided by modern digital signal
processing (DSP) chips.
Today, the cost of the speech recognition feature approaches a range which
Philips
Journalof Research Vol. 49 No. 4 19% 381
R. Haeb-Umbach et al.
recognition task.
In this paper we mainly take the first point of view and discuss in detail one
specific means to reduce the processing and memory demands: a clustering
technique applied at various levels within the acoustic modelling. This will
be discussed in Section 3. The results are summarized in Section 4.
where
4k 4, 4% t), h(k,4 and x(k, t) are the logarithms of the power spectral
densities of the pure speech signal, the noise signal, the transfer function
and the compound signal at the input of the recognizer, respectively; k denotes
the frequency subband index and t is the discrete time index.
For most environmental variabilities it is justified to assume that the channel
transfer function varies only slowly with time compared to the rate at which
speech changes; i.e., h(k: t) can be considered a low-pass signal with respect
to the time index t.
The influence of the transfer function can then be eliminated by high-pass
filtering of the spectral subband envelopes. Furthermore, it is well known
that high-pass filtering of the subband envelopes suppresses speaker-specific
characteristics of the speech signal. Different techniques for high-pass filtering
have been applied.
x(k) = f 5 x(k. /)
r=I
= +15
h(k)N,(k) ,=,
T(k. t)s(k, t)
where
N,(k) = -&(k:
t=l
4 (5)
Now it is easily seen that y(k, t): = x(k, t) - Z(k) is independent of the channel
- &-$(k tb(k 4 -
7
t-1
T _ fy
T
(k) $(l- 4% t)bk
t-1
4 (6)
One possible initialization for x(k, 0) is the overall mean value of the kth sub-
band component of the training data. The filter coefficients o(j) have to be
chosen such that the above represents a low-pass filter, e.g. for a first-order
lowpass filter we have a(j) = ai, 0 < a < 1. The bandwidth of the filter has
TABLE I
Word error rates for handset (HS) and hands-free (HF) database, crosstests
(HS: training, HF: recognition) and MTEL telephone database; string error
rate to TI Digits recognition task. In all cases: 32-component feature vector
388 Philips
Jouroal
ofResearch Vol. 49 No. 4 1995
Speech recognition algorithms for voice control interfaces
recognition task. Further, it can be observed that the recursive filters perform
somewhat worse than the mean subtraction technique. In the cases of speaker-
dependent recognition and match between training and test environment
(‘HS’, ‘HF’), spectrum normalization does not yield any benefit and can
even deteriorate performance slightly. Spectrum normalization may thus be
viewed as a safeguard measure for cases where one might encounter a test
environment that differs from the training conditions.
d(o’,,) =f(G-;,)C-yG-QT
pi = (@$I c 1 l/2 ’
Note that - lOg(pi) 20. Thus we define a distance d which incorporates the
term - log(pi) into d:
(11)
(12)
(13)
Rewriting the double sum over all mixtures and all component densities of
each mixture as a single summation over the pool of all N densities, we get:
nj = 1. (14)
c
.?rZ,
’ To indicate that m is the model which corresponds to the observation vector 0’we write m;.
Furthermore we write for convenience ?, tm if 6 is a component density mean vector of the model
(mixture density) M, and we write SE< if < is the closest mean vector (with respect to the Euclidian
distance) to 0’.
Philips
Journal of Research Vol. 49 No. 4 1995 391
R. Haeb-Umbach et al.
(15)
(18)
From this equation it is obvious that clustering can never lead to a decrease of
the negative log-likelihood. Since
(19)
holds, we obtain for an additional fusion of the fused group fand another
group i:
(21)
Aj,f can be interpreted as a distance between the clusters i and f; i.e., the
distance (measure of dissimilarity) of two clusters is simply the increase in
negative log-likelihood V if the two clusters are merged. In an implementation
the terms Ai,j are computed at the beginning of the clustering. During a
successive clustering step the distances of the clusters i to the new clusterf
(f = fused@, 4)) can be directly computed from the distances Ai,g,Ai,p and
AP,y according to eq. (21). The iterative clustering procedure is summarized
392 Philips
Journal of Research Vol. 49 No. 4 1995
Speech recognition algorithm.y,for voice control inteyfbces
in the following:
The k-means clustering procedure can be included after each clustering itera-
tion of Section 3.1.2 or. to save computation time. after a certain number of
iterations.
Hidden Markov models may share some or all component densities of their
mixture densities if they model acoustically similar events. This similarity can
be modelled on state level and density level. As a result of state-tying (see
Fig. l), complete states will be tied together, i.e. the tied states will share the
same inventory of component densities.
Density-tying on the other hand allows different models to share common
regions of the acoustic space (see Fig. 1). It is done across HMM states and
is independent of the previously mentioned state-tying. Note that the resulting
w Density Tying
Two similar states are modelled by one and the same mixture density
TABLE II
String error rate (SER) on TI Digits for various configurations with a small
number of densities
TABLE III
String error rate (SER) on TI Digits for various configurations with a large
number of densities
4. Summary
The reported results show that in the case of a mismatch of training and test
conditions spectrum normalization is essential, since it is able to remove the
negative effects on the error rate of changing transfer channel characteristics
and even of different noise levels of training and test. For the reported
speaker-independent recognition experiments spectrum normalization also
improves performance since it discards speaker-specific spectral characteristics
in the speech spectrum. Further, it can be observed that recursive filters, which
have to be employed to achieve real-time response, perform somewhat worse
than the mean subtraction technique. In the cases of speaker-dependent
recognition and match between training and test environment, spectrum
normalization does not yield any benefit and can even worsen performance
slightly. Spectrum normalization may thus be viewed as a safeguard measure
for cases where the test environment differs from the training conditions.
Clustering techniques have been applied to obtain a compact representation
of acoustic models. This translates directly into reduced computation and
memory demands in an implementation. This is an important factor in voice
control applications which have to live with tight cost constraints. Moreover,
clustering has another beneficial side-effect-a better utilization of the training
data. In this paper we have discussed in detail a maximum-likelihood-based
clustering procedure applied at various levels within the acoustic modelling.
At the state level, clustering allows us to avoid the duplication of acoustically
similar models. A consequence is that rarely seen acoustic events can be
modelled together with more robust ones. At the density level, clustering
allows us to model better the part of the acoustic space that is shared by
different models. A combination of the two clustering techniques leads to a
reduction of the number of parameters by a factor of up to three and to a sig-
nificant error rate reduction on several small-vocabulary speech recognition
tasks.
REFERENCES
[71 H.G. Hirsch, P. Meyer and H.-W. Ruehl. Improved speech recognition using high-pass
filtering of subband envelopes, in Proc. European Conf. on Speech Communication and
Technology, Genova, Sep. 1991. pp. 4133416.
PI H.W. Ruehl. S. Dobler, J. Weith, P. Meyer, A. Nell, H.H. Hamer and H. Piotrowski, Speech
recognition in the noisy car environment, Speech Commun., 10(l), II-22 (Feb. 1991).
[91 R.G. Leonhard, A database for speaker-independent digit recognition, in Proc. IEEE Int.
Conf. on Acoustics, Speech. and Signal Processing. San Diego, CA. Mar. 1984. pp.
42.11.1-42.11.4.
[lOI A. Nell, H.H. Hamer, H. Piotrowski, H.W. Ruehl, S. Dobler and S. Weith. Real-time
connected-word recognition in a noisy environment. in Proc. IEEE Int. Conf. on Acoustics.
Speech, and Signal Processing, Glasgow, UK. May 1989. pp. 679-682.
[Ill H. Hermansky and N. Morgan, Towards handling the acoustic environment in spoken
language processing. in Proc. Int. Conf. Spoken Language Processing, Banff. Canada. Oct.
1992, pp. 85-88.
WI L.R. Rabiner, Mathematical foundations of hidden Markov models, in H. Niemann. M.
Lang and G. Sagerer (eds.), Recent Advances in Speech Understanding and Dialog Systems,
Vol. F46 of NATO ASI Series, Springer, Berlin. 1988, pp. 1833205.
[I31 D. Steinhauser and K. Langer, Clusteranalyse. Walter de Gruyter, 1977.