0% found this document useful (0 votes)
10 views

deng11_interspeech

Uploaded by

phonokoye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

deng11_interspeech

Uploaded by

phonokoye
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

INTERSPEECH 2011

Deep Convex Net: A Scalable Architecture for Speech Pattern Classification


Li Deng and Dong Yu

Microsoft Research, Redmond, WA, USA


{deng, dongyu}@microsoft.com

to train a DNN-HMM for speech recognizers with dozens to a


Abstract few hundreds of hours of speech training data with remarkable
We recently developed context-dependent DNN-HMM (Deep- results [5]. To scale up this success with thousands or more
Neural-Net/Hidden-Markov-Model) for large-vocabulary hours of training data, we have been encountering seemingly
speech recognition. While achieving impressive recognition insurmountable difficulty with the current DNN architecture
error rate reduction, we face the insurmountable problem of used in our work in the recent past [5][6][7][8].
scalability in dealing with virtually unlimited amount of The main thrust of the research reported in this paper is a
training data available nowadays. To overcome the scalability new deep learning architecture, referred to as Deep Convex
challenge, we have designed the deep convex network (DCN) Network (DCN), which squarely attacks the learning
architecture. The learning problem in DCN is convex within scalability problem. The organization of this paper is as
each module. Additional structure-exploited fine tuning further follows. In Section 2, we provide an overview of the DCN
improves the quality of DCN. The full learning in DCN is architecture, and focus on how it integrates some key ideas
batch-mode based instead of stochastic, naturally lending it from DBN, boosting, and extreme learning machine. In
amenable to parallel training that can be distributed over many Section 3, an accelerated optimization algorithm we developed
machines. Experimental results on both MNIST and TIMIT recently is outlined, which “fine-tunes” the DCN weights
tasks evaluated thus far demonstrate superior performance of capitalizing on the structure in each module of the DCN. We
DCN over the DBN (Deep Belief Network) counterpart that show experimental results on static classification tasks defined
forms the basis of the DNN. The superiority is reflected not on MNIST (image) and TIMIT (speech), with the accuracy of
DCN exceeding that of DBN on both tasks.
10.21437/Interspeech.2011-607

only in training scalability and CPU-only computation, but


more importantly in classification accuracy in both tasks.
2. The DCN Architecture
Index Terms: deep learning, scalability, convex optimization,
A DCN includes a variable number of layered modules,
neural network, deep belief network, phone state
wherein each module is a specialized neural network
classification, batch-mode training, parallel computing
consisting of a single hidden layer and two trainable sets of
weights. More particularly, the lowest module in the DCN
1. Introduction comprises a first linear layer with a set of linear input units, a
Automatic speech recognition (ASR) has been the subject of a non-linear layer with a set of non-linear hidden units, and a
significant amount of research and commercial development in second linear layer with a set of linear output units. For
recent years. Recent research in ASR has explored deep, instance, if the DCN is utilized in connection with recognizing
layered architectures, motivated partly by the desire to an image, the input units can correspond to a number of pixels
capitalize on some analogous properties in the human speech (or extracted features) in the image, and can be assigned
generation and perception systems; e.g., [1][2]. In these values based at least in part upon intensity values, RGB
studies, learning of model parameters has been one of the most values, or the like corresponding to the respective pixels. If
prominent and difficult problems. In parallel with the the DCN is utilized in connection with speech recognition, the
development in ASR research, recent progresses made in set of input units may correspond to samples of speech
learning methods from neural network research has also waveform, or the extracted features from speech waveforms,
ignited interest in exploration of deep-structured models; e.g. such as power spectra or cepstral coefficients. Note the use of
[3]. One particular advance is the development of effective speech waveform as the raw features to a speech recognizer is
learning techniques for Deep Belief Networks (DBNs), which not a crazy idea. An early study for an HMM-like system (i.e.,
are densely connected, directed belief networks with many the hidden filter) that models speech waveform directly as the
hidden layers. In general, DBNs can be considered as a observation can be found in [9]. And many years later the use
complex nonlinear feature extractor with many layers of of more powerful Restricted Boltzmann Machine (RBM)
hidden units and at least one layer of visible units, where each overcomes some difficulty encountered earlier [10].
layer of hidden units learns to represent features that capture The hidden layer of the lowest module of a DCN
higher order correlations in the original input data [3]-[8] comprises a set of non-linear units that are mapped to the input
While DBNs have been shown to be extremely powerful units by way of a first, lower-layer weight matrix, which we
in connection with performing recognition and classification denote by W. For instance, the weight matrix may comprise a
tasks including speech recognition [4]-[7], training DBNs has plurality of randomly generated values between zero and one,
proven to be more difficult computationally. In particular, or the weights of an RBM trained separately. The non-linear
conventional techniques for training DBNs involve the units may be sigmoidal units that are configured to perform
utilization of a stochastic gradient descent learning algorithm. non-linear operations on weighted outputs from the input units
Although stochastic gradient descent has been shown to be (weighted in accordance with the first weight matrix W).
powerful for fine-tuning weights assigned to a DBN, such The second, linear layer in any module of a DCN includes
learning algorithm is extremely difficult to parallelize across a set of output units that are representative of the targets of
machines, causing learning at large scale to be difficult. It has classification. For instance, if the DCN is configured to
been possible to use one single, very powerful GPU machine perform digit recognition (e.g., the digits 1-10), then the

Copyright © 2011 ISCA 2285 28- 31 August 2011, Florence, Italy


plurality of output units may be representative of the values 1, connecting any adjacent modules in the DCN has been
2, 3, and so forth up to 10 with a 0-1 coding scheme. If the motivated partly by our earlier work on the deep-structured
DCN is configured to perform ASR, then the output units may conditional random field [11].
be representative of phones, HMM states of phones, or In Fig. 2, we show information flow within each module
context-dependent HMM states of phones in a way that is of the DCN that is not at the lowest layer. The part of the input
similar to [5][6]. The non-linear units in each module of the units in any non-bottom module corresponding to the raw
DCN may be mapped to a set of the linear output units by way training data can be mapped to the hidden units by the first
of a second, upper-layer weight matrix, which we denote by U. weight matrix described earlier, denoted here in Fig. 2 as Wrbm
This second weight matrix can be learned by way of a batch (In our experiments reported in Section 4, the use of separately
learning process, such that learning can be undertaken in trained RBM to initialize these weights gave much better
parallel. Convex optimization can be employed in connection results than all other ways of initialization). The portion of the
with learning U. For instance, U can be learned based at least input units in the module corresponding to the output units of
in part upon the first weight matrix W, values of the coded the lower module can be mapped to the common set of hidden
classification targets, and values of the input units. units by a new weight matrix, which may be initialized by, for
example, random numbers. We denote this set of weights by
Wran in Fig. 2. Thereafter, the aforementioned second weight
matrix U, which connects between the hidden units and the
linear output units of this module, can be learned by way of
convex optimization. This operation is represented by the box
labeled with “Learning Component” in Fig. 2.

Fig.1: Block diagram showing two of many modules in a DCN


and their connection; note the overlapping of two linear layers
in the two adjacent modules Fig. 2. Information flow in one typical module, which is not at
the lowest layer, of the DCN.
As indicated above, the DCN includes a set of serially
connected, overlapping, and layered modules, wherein each The pattern discussed above of including output units in a
module includes the aforementioned three layers -- a first lower module as a portion of the input units in an adjacent
linear layer that includes a set of linear input units whose higher module in the DBN and thereafter learning a weight
number equals the dimensionality of the input features, a matrix that describes connection weights between hidden units
hidden layer that comprises a set of non-linear units whose and linear output units via convex optimization can continue
number is a tunable hyper-parameter, and a second linear layer for many modules -- e.g., tens to hundreds of modules in our
that comprises a plurality of linear output units whose number experiments. A resultant learned DCN may then be deployed
equals that of the target classification classes (e.g., the total in connection with an automatic classification task such as
number of context-dependent phones clustered by a decision frame-level speech phone or state classification. Connecting
tree used in [5][6]). The modules are referred to herein as DCN’s output to an HMM or any dynamic programming
being layered because the output units of a lower module are a device enables continuous speech recognition. Details of this
subset of the input units of an adjacent higher module in the final step can be found in [5] and will not be dealt with in the
DCN. More specifically, in a second module that is directly remaining part of this paper.
above the lowest module in the DCN, the input units can
include the output units of the lower module(s). The input
units can additionally include the raw training data – in other
3. DCN Fine Tuning in Batch Mode
words, the output units of the lowest module can be appended Unlike DBN, the “fine tuning” algorithm of DCN weights we
to the input units in the second module, such that the input developed recently is confined within each module, rather than
units of the second module also include the output units of the across all layers globally. It is batch-mode based, rather than
lowest module. A block diagram showing two of many stochastic; hence it is naturally parallelizable. Further, it
modules in a DCN and their connection is shown in Fig. 1. makes direct use of the DCN structure where the strong
The sharing or overlapping of the two adjacent modules in a constraint is imposed between the upper layer’s weights, U,
DCN is represented explicitly in the overlapping portion of the and the lower layer’s weights, , within the same module as
two large boxes labeled as MODULE 1 and MODULE 2, the weighted pseudo-inverse:
respectively, in Fig. 1. This particular way of serially  = (Λ ) Λ  . (1)

2286
Here, H is the output vectors of the hidden units: batch training instead of full batch training. After the final
 = ( ), (2) mini-batch data are consumed in each training epoch, we then
Λ is the weight matrix constructed to direct the optimization’s use a routine for block matrix multiplication and inversion,
search direction, and T is the classification target vectors. which incurs some undesirable but unavoidable waste of
We use the batch-mode gradient descent to fine tune , computation with a single CPU, to combine the full training
where the gradient of the mean square error, E, after constraint data in implementing the estimation formula of Eq. (1) in
(1) is imposed is given by order to achieve approximate effect of batch-mode training to
our best ability.
= 2  ∘ ( − )

∘ [ (Λ  )(  ) − Λ (  )], 4.3. TIMIT results
The DCN, as well as DBN, has the strength mainly as a static
and  = Λ (Λ ). pattern classifier. HMM or dynamic programing is a
More detail of this DCN fine tuning algorithm is provided convenient tool to help port the strength of a static classifier to
in [13]. handle dynamic patterns, as we recently demonstrated with
DNN [5][6]. (We are nevertheless clearly aware that the
4. Experimental Evaluation unique elasticity of temporal dynamic of speech as explained
in [1] would require temporally-correlated models better than
4.1. NMIST experiments and results HMM for the ultimate success of ASR, and integrating such a
model with the DCN to form the coherent dynamic DCN is by
Comprehensive experiments have been conducted to evaluate itself a more challenging new research beyond the scope of
the performance of the DCN architecture and the related this paper.) Therefore, as our first step of experimentation, we
learning algorithms on the benchmark MNIST database; see focus on evaluation of the static classification ability of DCN
[12] for detail of this task. In brief, the MNIST consists of here. To this end, we choose frame-level phone-state
binary images of handwritten digits, and is one of the most classification error rate as the main evaluation criterion. In this
common classification tasks for evaluating machine learning case, we have a total of 183 state classes, three for each of the
algorithms. We only briefly summarize our strong results on 61 phone labels defined in the TIMIT training set. The actual
MNIST here in Table 1. state labels are obtained by HMM forced alignment. We also
show frame-level phone classification error rates, when the
Table 1: Classification error rate comparison: DBN vs. DCN errors in the state within the same phone are not counted, for
Shallow 61 phone classes.
DCN
DBN [3] DBN DCN (D)CN The results in Table 2 are obtained by a typical run of the
(Fine-
(Hinton’s) (MSR’s) (no Fine- (Fine-tuned DCN program when 6,000 hidden units are used in each
tuning)
tuning) single layer) module of DCN, where “X (Y)” in the first column denotes
1.20% 1.06% 0.83% 0.95% 1.10% the Xth layer of DCN (counted from bottom up) and Yth epoch
in the fine-tuning optimization. The hyper-parameters are
4.2. TIMIT experiments tuned using the development set defined in TIMIT. We used a
We now focus on our more recent experiments where we single-hidden-layer RBM that was trained in the same way as
apply the same DCNs and the related learning algorithms in [4][5] to initialize weights W at the lowest module of the
developed on the MNIST task to the speech database of DCN before applying fine tuning as described in Section 3.
TIMIT. Standard MFCC feature was used, but with a longer We have found empirically that if random noise is used for the
than usual context window of 11 frames. This gives rise to a initialization, then the error rate becomes at least 30% relative
total of 39*11=429 elements in each feature vector, which we higher than presented in Table 2.
call a “super-frame”, as the input to each module of the DCN. Fine-tuned weights from lower modules are used to
For the DCN output, we used 183 target class labels as “phone initialize the weights at higher modules. Then, they are
states”. The 183 target labels correspond to all states in the 61 appended with random weights associated with the output
phone-like units defined in TIMIT. units from the immediately lower module before fine tuning at
The standard training set of TIMIT consisting of 462 the current module.
speakers was used for training the DCN. The total number of Table 2. Frame-level classification error rates of phones (61
super-frames in the training data is about 1.12 million. The classes) and states (183 classes) as a function of the number of
standard development set of 50 speakers, with a total of stacked DCN modules; RBM is used to initialize lowest-level
122,488 super-frames, was used for cross validation. Results network weights.
are reported using the standard 24-speaker core test set
consisting of 192 sentences with 7,333 phone tokens and Train Dev. Test Test
Layer
57,920 super-frames. State State Phone State
(Epoch)
The algorithms presented in this paper all are batch-mode Err % Err % Err % Err %
based. This is because, as an example of convex optimization 1 (1) 27.19 49.50 39.18 49.83
with the global optimum, the pseudo-inverse is carried out … … … … …
necessarily involving the full training set. However, in our 1 (8) 21.20 46.00 36.12 46.30
experiments where the full training set of TIMIT is 2 (1) 13.01 44.44 34.87 44.88
represented by a very large 429 by 1.12M matrix, the various 3 (1) 7.96 44.30 34.64 44.70
batch-mode matrix multiplications required by the algorithms 4 (1) 5.14 44.22 34.67 44.65
easily cause a single computer to run out of memory. (We had 5 (1) 3.51 44.11 34.56 44.53
not implemented our learning algorithms over parallel 6 (1) 2.57 44.25 34.83 44.70
machines at the time of carrying out the reported experiments 7 (1) 1.95 44.25 34.74 44.69
here). To overcome the CPU memory limitation problem, we
block the training data into many mini-batches, and use mini-

2287
The most notable observation from Table 2 is that as the 6. Acknowledgements
layers gradually add up, the error rates for training,
development, and core test sets continue to drop until over- We are grateful to many helpful discussions with, and valuable
fitting occurs at Layer 6 in this example. There has been very suggestions and encouragements by John Platt, Geoff Hinton,
little published work on frame-level phone or phone state Dave Wecker, and Alex Acero. We also thank G.B. Huang for
classification. The closest work we have been able to find discussions on many possible basic modules of the DCN.
reported over 70% phone state error rate with an easier 132
phone state classes (than our 183 state classes) but with a more 7. References
difficult speech database. We ran the DBN system of [7] on
the same TIMIT data and found the corresponding frame-level [1] L. Deng, D. Yu, and A. Acero. “Structured speech
phone state error rate be to 45.04% (which gave 22% phonetic modeling,” IEEE Trans. Audio, Speech & Language
recognition error rate after running a decoder with a standard Proc., vol. 14, no. 5, pp. 1492-1504, September 2006.
bi-gram phonetic “language” model as reported in [7]). This [2] N. Morgan. “Deep and Wide: Multilayers in Automatic
frame-level error rate achieved by DBN is slightly higher than Speech Recognition,” IEEE Trans. on Audio, Speech,
the DCN’s error rate of 44.53% shown in Table 2. and Language Processing, 2011 (in press).
In Table 3 is a summary of the results, with different [3] G. Hinton and R. Salakhutdinov. “Reducing the
hyper-parameters than in Table 2. It shows the dependency of Dimensionality of Data with Neural Networks”, Science,
frame-level classification error rates on the number of hidden Vol. 313. no. 5786, pp. 504 – 507, 2006.
units, which is fixed for all modules of the DCN in our current [4] A. Mohamed, G. Dahl, G. Hinton, “Deep belief networks
implementation. We also fold the 61 classes in the original for phone recognition”, NIPS Workshop on Deep
TIMIT label set into the standard 39 classes. The Learning for Speech Recognition and Related
corresponding results are presented in Table 3 also. These Applications, Dec. 2009.
results are obtained without the use of phone-bound state [5] G. Dahl, D. Yu, L. Deng, and A. Acero. “Context-
alignment. That is, there is no left-to-right constraint, and Dependent Pre-trained Deep Neural Networks for Large
frame-level decision is made. These results are obtained also Vocabulary Speech Recognition”, IEEE Trans. on Audio,
without any phone-level “language” model. Speech, and Language Processing, 2011 (in press).
[6] D. Yu, L. Deng, and G. Dahl, “Roles of Pre-Training
Table 3. Frame-level classification percent error rates of and Fine-Tuning in Context-Dependent DBN-HMMs for
phones (61 or folded 39 classes), and of phone states (183 Real-World Speech Recognition,” NIPS Workshop on
classes) as a function of the size of hidden layer units in DCN. Deep Learning and Unsupervised Feature Learning,
December 2010.
Frame-level Frame-level Frame-level
Size of
Test Phone Test Phone Test State
[7] A. Mohamed, D. Yu, and L. Deng, “Investigation of
Hidden Full-Sequence Training of Deep Belief Networks for
Err % Err % Err %
Units Speech Recognition,” in Interspeech, September 2010.
(39 classes) (61 classes) (181 classes)
3000 27.11 35.97 46.08 [8] L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and
4000 26.37 35.27 45.39 G. Hinton. “Binary Coding of Speech Spectrograms
6000 25.44 34.12 44.24 Using a Deep Auto-encoder,” in Interspeech, Sept. 2010.
7000 25.22 34.04 44.04 [9] H. Sheikhzadeh and L. Deng. “Waveform-Based
Speech Recognition Using Hidden Filter Models:
5. Summary and Conclusions Parameter Selection and Sensitivity to Power
Normalization, IEEE Trans. on Speech and Audio
We recently developed a DNN-based architecture for large- Processing, Vol. 2, pp. 80-91, 1994.
vocabulary speech recognition. While achieving remarkable [10] N. Jaitly and G. Hinton. “Learning a Better
success with this approach, we face the scalability problem in Representation of Speech Sound Waves Using Restricted
practical applications, e.g. voice search. In this paper we Boltzmann Machines,” in Proc. ICASSP, 2011, Prague.
present a novel DCN architecture aimed to enable scalability. [11] D. Yu, S. Wang, and L. Deng. “Sequential Labeling
Experimental results on both MNIST and TIMIT tasks Using Deep-Structured Conditional Random Fields,”
demonstrate higher classification accuracy than DBN. The IEEE J. Selected Topics in Sig. Proc., Vol. 4(6),
superiority of DCN over DBN is particularly strong in the pp.965-973, Dec. 2010.
MNIST task so long as we use a much deeper DCN than could [12] Y. LeCun, L. Bottou, Y., Bengio, and P. Haffner
be computationally afforded by the conventional DBN “Gradient-Based Learning Applied to Document
architecture and learning. While the basic module of the DCN Recognition”, Proc. IEEE, Vol. 86, pp. 2278-2324, 1998.
reported in this paper is similar to the extreme learning [13] D. Yu and L. Deng, “Accelerated Parallelizable Neural
machine in the literature (e.g., [14]), any simple or weak Networks Learning Algorithms for Speech Recognition,”
classifier can be embedded in the DCN architecture to make it Proc. Interspeech 2011, accepted.
stronger. [14] G. B. Huang, Q-Y. Zhu, and C.K. Siew. “Extreme
The future directions of our work include: 1) full Learning Machine: Theory and Applications”,
exploration of the rich flexibility in the architecture and Neurocomputing, vol. 70, pp. 489-501, 2006.
module type provided by the basic DCN framework presented [15] J. Baker, et. al. “Research Developments and Directions
in this paper; 2) addition of a dynamic programing based in Speech Recognition and Understanding,” IEEE Sig.
decoder on top of the final layer of the DCN to enable Proc. Mag., vol. 26, pp. 75-80, May 2009.
continuous phonetic or speech recognition; 3) learning (rather [16] L. Deng, “Computational Models for Speech
than tuning) of hyper-parameters in DCN; 4) development of Production,” Chapter in Computational Models of Speech
speaker and environment adaptation techniques for DCN; and Pattern Processing, pp. 199-213, Springer, 1999.
5) development of a temporal DCN which integrates
generative dynamic models of speech (e.g., [15][16]) with the
DCN architecture presented in this paper.

2288

You might also like