deng11_interspeech
deng11_interspeech
2286
Here, H is the output vectors of the hidden units: batch training instead of full batch training. After the final
= ( ), (2) mini-batch data are consumed in each training epoch, we then
Λ is the weight matrix constructed to direct the optimization’s use a routine for block matrix multiplication and inversion,
search direction, and T is the classification target vectors. which incurs some undesirable but unavoidable waste of
We use the batch-mode gradient descent to fine tune , computation with a single CPU, to combine the full training
where the gradient of the mean square error, E, after constraint data in implementing the estimation formula of Eq. (1) in
(1) is imposed is given by order to achieve approximate effect of batch-mode training to
our best ability.
= 2 ∘ ( − )
∘ [ (Λ )( ) − Λ ( )], 4.3. TIMIT results
The DCN, as well as DBN, has the strength mainly as a static
and = Λ (Λ ). pattern classifier. HMM or dynamic programing is a
More detail of this DCN fine tuning algorithm is provided convenient tool to help port the strength of a static classifier to
in [13]. handle dynamic patterns, as we recently demonstrated with
DNN [5][6]. (We are nevertheless clearly aware that the
4. Experimental Evaluation unique elasticity of temporal dynamic of speech as explained
in [1] would require temporally-correlated models better than
4.1. NMIST experiments and results HMM for the ultimate success of ASR, and integrating such a
model with the DCN to form the coherent dynamic DCN is by
Comprehensive experiments have been conducted to evaluate itself a more challenging new research beyond the scope of
the performance of the DCN architecture and the related this paper.) Therefore, as our first step of experimentation, we
learning algorithms on the benchmark MNIST database; see focus on evaluation of the static classification ability of DCN
[12] for detail of this task. In brief, the MNIST consists of here. To this end, we choose frame-level phone-state
binary images of handwritten digits, and is one of the most classification error rate as the main evaluation criterion. In this
common classification tasks for evaluating machine learning case, we have a total of 183 state classes, three for each of the
algorithms. We only briefly summarize our strong results on 61 phone labels defined in the TIMIT training set. The actual
MNIST here in Table 1. state labels are obtained by HMM forced alignment. We also
show frame-level phone classification error rates, when the
Table 1: Classification error rate comparison: DBN vs. DCN errors in the state within the same phone are not counted, for
Shallow 61 phone classes.
DCN
DBN [3] DBN DCN (D)CN The results in Table 2 are obtained by a typical run of the
(Fine-
(Hinton’s) (MSR’s) (no Fine- (Fine-tuned DCN program when 6,000 hidden units are used in each
tuning)
tuning) single layer) module of DCN, where “X (Y)” in the first column denotes
1.20% 1.06% 0.83% 0.95% 1.10% the Xth layer of DCN (counted from bottom up) and Yth epoch
in the fine-tuning optimization. The hyper-parameters are
4.2. TIMIT experiments tuned using the development set defined in TIMIT. We used a
We now focus on our more recent experiments where we single-hidden-layer RBM that was trained in the same way as
apply the same DCNs and the related learning algorithms in [4][5] to initialize weights W at the lowest module of the
developed on the MNIST task to the speech database of DCN before applying fine tuning as described in Section 3.
TIMIT. Standard MFCC feature was used, but with a longer We have found empirically that if random noise is used for the
than usual context window of 11 frames. This gives rise to a initialization, then the error rate becomes at least 30% relative
total of 39*11=429 elements in each feature vector, which we higher than presented in Table 2.
call a “super-frame”, as the input to each module of the DCN. Fine-tuned weights from lower modules are used to
For the DCN output, we used 183 target class labels as “phone initialize the weights at higher modules. Then, they are
states”. The 183 target labels correspond to all states in the 61 appended with random weights associated with the output
phone-like units defined in TIMIT. units from the immediately lower module before fine tuning at
The standard training set of TIMIT consisting of 462 the current module.
speakers was used for training the DCN. The total number of Table 2. Frame-level classification error rates of phones (61
super-frames in the training data is about 1.12 million. The classes) and states (183 classes) as a function of the number of
standard development set of 50 speakers, with a total of stacked DCN modules; RBM is used to initialize lowest-level
122,488 super-frames, was used for cross validation. Results network weights.
are reported using the standard 24-speaker core test set
consisting of 192 sentences with 7,333 phone tokens and Train Dev. Test Test
Layer
57,920 super-frames. State State Phone State
(Epoch)
The algorithms presented in this paper all are batch-mode Err % Err % Err % Err %
based. This is because, as an example of convex optimization 1 (1) 27.19 49.50 39.18 49.83
with the global optimum, the pseudo-inverse is carried out … … … … …
necessarily involving the full training set. However, in our 1 (8) 21.20 46.00 36.12 46.30
experiments where the full training set of TIMIT is 2 (1) 13.01 44.44 34.87 44.88
represented by a very large 429 by 1.12M matrix, the various 3 (1) 7.96 44.30 34.64 44.70
batch-mode matrix multiplications required by the algorithms 4 (1) 5.14 44.22 34.67 44.65
easily cause a single computer to run out of memory. (We had 5 (1) 3.51 44.11 34.56 44.53
not implemented our learning algorithms over parallel 6 (1) 2.57 44.25 34.83 44.70
machines at the time of carrying out the reported experiments 7 (1) 1.95 44.25 34.74 44.69
here). To overcome the CPU memory limitation problem, we
block the training data into many mini-batches, and use mini-
2287
The most notable observation from Table 2 is that as the 6. Acknowledgements
layers gradually add up, the error rates for training,
development, and core test sets continue to drop until over- We are grateful to many helpful discussions with, and valuable
fitting occurs at Layer 6 in this example. There has been very suggestions and encouragements by John Platt, Geoff Hinton,
little published work on frame-level phone or phone state Dave Wecker, and Alex Acero. We also thank G.B. Huang for
classification. The closest work we have been able to find discussions on many possible basic modules of the DCN.
reported over 70% phone state error rate with an easier 132
phone state classes (than our 183 state classes) but with a more 7. References
difficult speech database. We ran the DBN system of [7] on
the same TIMIT data and found the corresponding frame-level [1] L. Deng, D. Yu, and A. Acero. “Structured speech
phone state error rate be to 45.04% (which gave 22% phonetic modeling,” IEEE Trans. Audio, Speech & Language
recognition error rate after running a decoder with a standard Proc., vol. 14, no. 5, pp. 1492-1504, September 2006.
bi-gram phonetic “language” model as reported in [7]). This [2] N. Morgan. “Deep and Wide: Multilayers in Automatic
frame-level error rate achieved by DBN is slightly higher than Speech Recognition,” IEEE Trans. on Audio, Speech,
the DCN’s error rate of 44.53% shown in Table 2. and Language Processing, 2011 (in press).
In Table 3 is a summary of the results, with different [3] G. Hinton and R. Salakhutdinov. “Reducing the
hyper-parameters than in Table 2. It shows the dependency of Dimensionality of Data with Neural Networks”, Science,
frame-level classification error rates on the number of hidden Vol. 313. no. 5786, pp. 504 – 507, 2006.
units, which is fixed for all modules of the DCN in our current [4] A. Mohamed, G. Dahl, G. Hinton, “Deep belief networks
implementation. We also fold the 61 classes in the original for phone recognition”, NIPS Workshop on Deep
TIMIT label set into the standard 39 classes. The Learning for Speech Recognition and Related
corresponding results are presented in Table 3 also. These Applications, Dec. 2009.
results are obtained without the use of phone-bound state [5] G. Dahl, D. Yu, L. Deng, and A. Acero. “Context-
alignment. That is, there is no left-to-right constraint, and Dependent Pre-trained Deep Neural Networks for Large
frame-level decision is made. These results are obtained also Vocabulary Speech Recognition”, IEEE Trans. on Audio,
without any phone-level “language” model. Speech, and Language Processing, 2011 (in press).
[6] D. Yu, L. Deng, and G. Dahl, “Roles of Pre-Training
Table 3. Frame-level classification percent error rates of and Fine-Tuning in Context-Dependent DBN-HMMs for
phones (61 or folded 39 classes), and of phone states (183 Real-World Speech Recognition,” NIPS Workshop on
classes) as a function of the size of hidden layer units in DCN. Deep Learning and Unsupervised Feature Learning,
December 2010.
Frame-level Frame-level Frame-level
Size of
Test Phone Test Phone Test State
[7] A. Mohamed, D. Yu, and L. Deng, “Investigation of
Hidden Full-Sequence Training of Deep Belief Networks for
Err % Err % Err %
Units Speech Recognition,” in Interspeech, September 2010.
(39 classes) (61 classes) (181 classes)
3000 27.11 35.97 46.08 [8] L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and
4000 26.37 35.27 45.39 G. Hinton. “Binary Coding of Speech Spectrograms
6000 25.44 34.12 44.24 Using a Deep Auto-encoder,” in Interspeech, Sept. 2010.
7000 25.22 34.04 44.04 [9] H. Sheikhzadeh and L. Deng. “Waveform-Based
Speech Recognition Using Hidden Filter Models:
5. Summary and Conclusions Parameter Selection and Sensitivity to Power
Normalization, IEEE Trans. on Speech and Audio
We recently developed a DNN-based architecture for large- Processing, Vol. 2, pp. 80-91, 1994.
vocabulary speech recognition. While achieving remarkable [10] N. Jaitly and G. Hinton. “Learning a Better
success with this approach, we face the scalability problem in Representation of Speech Sound Waves Using Restricted
practical applications, e.g. voice search. In this paper we Boltzmann Machines,” in Proc. ICASSP, 2011, Prague.
present a novel DCN architecture aimed to enable scalability. [11] D. Yu, S. Wang, and L. Deng. “Sequential Labeling
Experimental results on both MNIST and TIMIT tasks Using Deep-Structured Conditional Random Fields,”
demonstrate higher classification accuracy than DBN. The IEEE J. Selected Topics in Sig. Proc., Vol. 4(6),
superiority of DCN over DBN is particularly strong in the pp.965-973, Dec. 2010.
MNIST task so long as we use a much deeper DCN than could [12] Y. LeCun, L. Bottou, Y., Bengio, and P. Haffner
be computationally afforded by the conventional DBN “Gradient-Based Learning Applied to Document
architecture and learning. While the basic module of the DCN Recognition”, Proc. IEEE, Vol. 86, pp. 2278-2324, 1998.
reported in this paper is similar to the extreme learning [13] D. Yu and L. Deng, “Accelerated Parallelizable Neural
machine in the literature (e.g., [14]), any simple or weak Networks Learning Algorithms for Speech Recognition,”
classifier can be embedded in the DCN architecture to make it Proc. Interspeech 2011, accepted.
stronger. [14] G. B. Huang, Q-Y. Zhu, and C.K. Siew. “Extreme
The future directions of our work include: 1) full Learning Machine: Theory and Applications”,
exploration of the rich flexibility in the architecture and Neurocomputing, vol. 70, pp. 489-501, 2006.
module type provided by the basic DCN framework presented [15] J. Baker, et. al. “Research Developments and Directions
in this paper; 2) addition of a dynamic programing based in Speech Recognition and Understanding,” IEEE Sig.
decoder on top of the final layer of the DCN to enable Proc. Mag., vol. 26, pp. 75-80, May 2009.
continuous phonetic or speech recognition; 3) learning (rather [16] L. Deng, “Computational Models for Speech
than tuning) of hyper-parameters in DCN; 4) development of Production,” Chapter in Computational Models of Speech
speaker and environment adaptation techniques for DCN; and Pattern Processing, pp. 199-213, Springer, 1999.
5) development of a temporal DCN which integrates
generative dynamic models of speech (e.g., [15][16]) with the
DCN architecture presented in this paper.
2288