2015_Multi-task Learning of Deep Neural Networks for Low-resource Speech Recognition_Chen_Mak_IEEEACM Transactions on Audio, Speech, and Language Processing
2015_Multi-task Learning of Deep Neural Networks for Low-resource Speech Recognition_Chen_Mak_IEEEACM Transactions on Audio, Speech, and Language Processing
7, JULY 2015
Abstract—We propose a multitask learning (MTL) approach semi-automatic approach to preparing a pronunciation dictio-
to improve low-resource automatic speech recognition using deep nary is to first create a small primary dictionary manually, and
neural networks (DNNs) without requiring additional language then extend it to a large dictionary by applying grapheme-to-
resources. We first demonstrate that the performance of the
phone models of a single low-resource language can be improved phoneme conversion [1]. However, the performance of the final
by training its grapheme models in parallel under the MTL dictionary highly depends on the quality of the primary one. A
framework. If multiple low-resource languages are trained to- simpler solution is to abandon the phone-based models and em-
gether, we investigate learning a set of universal phones (UPS) ploy graphemes as the basic acoustic units because grapheme
as an additional task again in the MTL framework to improve modeling [2], [3], [4], [5], [6] does not require a phonetic dic-
the performance of the phone models of all the involved lan-
guages. In both cases, the heuristic guideline is to select a task tionary1. Many languages that use an alphabet writing system
that may exploit extra information from the training data of the are suitable for grapheme-based acoustic modeling, and their
primary task(s). In the first method, the extra information is the grapheme set is usually selected to be the same as their set of
phone-to-grapheme mappings, whereas in the second method, the alphabets.
UPS helps to implicitly map the phones of the multiple languages While ways are sought to create resources for a new language
among each other. In a series of experiments using three low-re-
source South African languages in the Lwazi corpus, the proposed more efficiently, other ways are proposed to reduce the amount
MTL methods obtain significant word recognition gains when of training data required for robust acoustic modeling. For sys-
compared with single-task learning (STL) of the corresponding tems based on hidden Markov modeling (HMM), one common
DNNs or ROVER that combines results from several STL-trained solution is parameter tying or sharing [7], [8], [9], [10], [11]. An-
DNNs. other method is the basis approach in which a relatively small
Index Terms—Deep neural network (DNN), low-resource speech set of basis vectors or functions is computed so that other model
recognition, multitask learning, universal grapheme set, universal parameters may be derived from them. Successful examples in-
phone set.
clude tied-mixture HMM [12], [13], subspace Gaussian mixture
model (SGMM) [14], Bayesian sensing HMM [15], and canon-
I. INTRODUCTION ical state model [16]. On the other hand, transfer learning [17] is
also proved effective when out of domain data are available. No-
table efforts include cross-lingual [18], [19] and multi-lingual
2329-9290 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2024 at 03:18:52 UTC from IEEE Xplore. Restrictions apply.
CHEN AND MAK: MTL OF DNNs FOR LOW-RESOURCE SPEECH RECOGNITION 1173
In this paper, we would like to improve the estimation of the edge to one another. As a result, the common internal represen-
phonetic models of a low-resource language by learning other tation thus learned helps the models generalize better for future
related task(s) together under the multi-task learning (MTL) unseen data. In [35], a statistical learning theory based approach
[31] framework using deep neural networks2 [32]. According to MTL is developed and a generalization bound on the average
to the theory of MTL, related tasks can be jointly learned to im- error of MTL is derived. In [35], [36], the notion of relatedness
prove the generalization performance of each task if they share among multiple tasks is defined in a particular way so that a
the same inputs and some internal representation; the effect is tighter generalization bound for each learning task can be de-
more prominent when the amount of training data is small. We rived. In his thesis [31], Caruana postulated some requirements
believe that humans do not learn the sounds of a spoken lan- for related tasks if their joint learning in the MTL approach was
guage in isolation but together with other cues such as their to work well:
graphemes, the lexical contexts, its similarity with or its differ- (a) related tasks must share input features, and
ences from other languages, and so forth. Our major contribu- (b) related tasks must share hidden units to benefit each other
tion in this paper is the successful identification of appropriate when trained with MTL-backprop.
related tasks without requiring additional resources so that pho- He also listed out some task relationships which would en-
netic models of low-resource languages can be better learned able MTL-backprop to learn better internal representation (or
under the MTL framework. The heuristic guideline is to find more generalized model) of the related tasks: data amplifica-
a secondary task related to the primary task that can exploit tion, eavesdropping, attribute selection, representation bias, and
extra information from its training data. In our case, the extra overfitting prevention.
information being exploited is the implicit mapping between MTL has been applied successfully in many speech, lan-
the primary targets and the secondary targets. More specifically, guage, image and vision tasks with the use of neural network
we first propose learning the graphemes of the same language (NN) because the hidden layers of an NN naturally capture
as the secondary task3 in Section III. Grapheme-based acoustic learned knowledge that can be readily transferred or shared
modeling does not require additional language resources be- across multiple tasks. For example, [37] applies MTL on a
sides those already required by phone-based acoustic modeling. single convolutional neural network to produce state-of-the-art
We believe that the grapheme learning task exploits extra infor- performance for several language processing predictions; [38]
mation in the acoustic training data to learn implicit phone-to- improves intent classification in goal-oriented human-machine
grapheme mappings of the language. Then, if several low-re- spoken dialog systems which is particularly successful when
source languages are to be learned together, we propose in our the amount of labeled training data is limited; in [39], the MTL
second method in Section IV to derive a UPS among the lan- approach is used to perform multi-label learning in an image
guages and use the UPS learning as an additional secondary annotation application.
task in the learning of the multi-lingual phonetic models. The
UPS learning not only implicitly encodes an indirect mapping A. Multi-Task Learning in ASR Using DNNs
among the phones of all the involved languages, but also serves In ASR, MTL has been applied to improving performance
as a regularizer for the learning of the phonetic models of each robustness using recurrent neural networks [40]. With the
language. Finally we combine the above two methods and fur- emergence of the recently very successful deep neural network
ther performance gain is obtained. (DNN), one expects that DNNs may be used to further improve
The rest of this paper is organized as follows. We first re- MTL performance; we call the resulting deep neural networks
view multi-task learning in Section II. The first MTL approach MTL-DNNs. For instance, Meltzer and Droppo investigated the
using joint phone-based and grapheme-based acoustic modeling training of monophone models for TIMIT phone recognition
is described in Section III. Section IV describes how multi-lin- together with the learning of the phone labels, state contexts, or
gual acoustic modeling is conducted in an MTL framework by phone contexts [41]; significant gains were reported. However,
learning universal phone/grapheme models in parallel. Exper- the work did not model triphone states directly and it is not
imental evaluation on three low-resource South African lan- clear if it is really better to use the triphone contexts as the sec-
guages is reported in Section V, which is followed by con- ondary task in learning monophone state posteriors in the MTL
cluding remarks in Section VI. framework. MTL has also been employed successfully to train
multi-lingual DNNs [21], [22], [42]. In these works, during
II. MULTI-TASK LEARNING pre-training and subsequent fine-tuning, data from all training
Multi-task learning (MTL) [31] or learning to learn [34] is a languages are fed through the common hidden layers but each
machine learning approach that aims at improving the general- language maintains its own language-specific output layer.
ization performance of a learning task by jointly learning mul- However, unlike the MTL work in [31] and [41], for each input
tiple related tasks together. It is found that if the multiple tasks only one task is being trained, and the relatedness among the
are related and if they share some internal representation, then tasks are exploited only by enforcing common weights in the
through learning them together, they are able to transfer knowl- hidden layers. We will follow the notation in [22] and call these
multi-lingual DNNs with shared hidden layers SHL-MDNNs.
2Although we mostly take the estimation of phone-based acoustic models as
the primary task in the proposed MTL framework, the approach is general and B. Our MTL-DNN Formulation
flexible, and one may take other task such as the estimation of grapheme-based
acoustic models as the primary task as well. We would like to apply the MTL framework to improve
3The first method has been presented in our conference paper [33]. phone-based acoustic models for low-resource ASR using
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2024 at 03:18:52 UTC from IEEE Xplore. Restrictions apply.
1174 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 7, JULY 2015
target values of exactly one triphone senone output unit and one TABLE I
trigrapheme senone output unit will be set to 1.0 per training THE UNIVERSAL PHONE SET (UPS) AND THE PHONEMES’
USAGE IN THREE SOUTH AFRICAN LANGUAGES
frame. During decoding, each senone posterior probability is
converted back to a scaled likelihood by dividing it by its prior
as follows:
(4)
C. Extensions
As said in Section III, grapheme-based acoustic modeling is
a viable solution for low-resource language ASR. Method 2 can
be easily modified to use graphemes instead of phones as the
modeling units. A universal grapheme set (UGS) may again be
created by simply taking the union of the grapheme sets of all
the languages under investigation. The UGS for the three lan-
guages in our experiments consists of 30 graphemes including
one that denotes silence. We will call the grapheme-based
multi-lingual MTL-DNN with the extra UGS learning task
ML-MTL-DNN-UGS.
Obviously, one may further combine Method 1 and
Method 2 to jointly model multi-lingual phones and graphemes
using the UPS and UGS as the extra learning tasks, and we
will label such network as ML-MTL-DNN-UPS-UGS. If there
Fig. 2. A multi-lingual MTL-DNN system (ML-MTL-DNN-UPS) with shared are languages to learn simultaneously, then there will be
hidden layers and an extra output layer of UPS states. Outputs, labelled as green, totally (softmax) output layers in the model. Each input
from 2 separate tasks are turned “on” by an input acoustic vector.
acoustic vector from the th language will activate four targets:
one triphone senone and one trigrapheme senone of the th
states is denoted as . For each input vector , only language, one universal (mono)phone state, and one universal
two tasks are involved: the triphone senones of the th language (mono)grapheme state.
( ) and the UPS monophone states ( ) are activated using
V. EXPERIMENTAL EVALUATION
the softmax function of Eq. (2). Their corresponding per-frame
cross-entropies, and , where The two proposed multi-lingual MTL-DNN training methods
consists of the weights in the output layer of the UPS states, were evaluated on three low-resource South African languages
are given by Eq. (3). Finally, the training objective function over in the Lwazi project [45].
all data of the multiple languages is modified from Eq. (1) as A. The Lwazi Speech Corpus
follows:
The Lwazi project was set up to develop a telephone-based
speech-driven information system in South Africa. In the
project, the Lwazi ASR corpus [45] was collected over a
telephone channel from approximately 200 native speakers
for each of the 11 official languages in South Africa. Each
speaker produced approximately 30 utterances, in which 16 of
them are phonetically balanced read speech and the remainders
are elicited short words such as answers to open questions,
answers to yes/no questions, spelt words, dates, and numbers.
A 5,000-word pronunciation dictionary was also created for
(5) each language, which covers only the most common words
in the language. Thus, for the phone-based experiments, the
DictionaryMaker software [48] was used to generate dictionary
entries for the uncovered words in the corpus.
Eq. (5) shows that our multi-lingual MTL-DNN training is dif- Three languages were selected from the corpus in our evalu-
ferent from SHL-MDNN training and may be considered as a ations. They are Afrikaans, Sesotho, and siSwati. Afrikaans is
regularized version of the latter—a form of regularized MTL a Low Franconian, West Germanic language, originated from
[47]. If the language task weights are large, it will be the same Dutch; Sesotho is a Southern Bantu language, closely related to
as SHL-MDNN training; if the UPS task weight is large, it will other languages in the Sotho-Tswana language group; SiSwati
be reduced to UPS training. Since UPS models are usually not as is also a Southern Bantu language, but is more closely related
good as language-specific models [30], the learned UPS output to the Nguni language group. Thus, the three chosen languages
layer will not be used in recognition, and it is only used to help come from different language families. The numbers of phones
enforce the cross-lingual phone mappings during MTL-DNN and graphemes in the three languages and the size of the cor-
training. The training procedure of ML-MTL-DNN-UPS is sim- responding universal phone and grapheme sets are shown in
ilar to that of MTL-DNN-PG in Section III. Table II. Since the corpus does not define an official training, de-
From the perspective of regularization, we prefer simpler velopment, and test set for each language, we followed the par-
regularizer and thus we use UPS monophone states instead of titions used in [6]. In addition, in order to evaluate the efficacy
UPS triphone senones as the common task. In some preliminary of MTL in the scenarios where acoustic data is scarce, smaller
experiments, we also empirically found that they gave similar data sets consisting of approximately one hour of speech were
results. further created by randomly sampling from the full training set
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2024 at 03:18:52 UTC from IEEE Xplore. Restrictions apply.
CHEN AND MAK: MTL OF DNNs FOR LOW-RESOURCE SPEECH RECOGNITION 1177
C. Decoding
Standard Viterbi algorithm was used for decoding in all ex-
of each language. Care had been taken to ensure that there are periments using a bigram language model (LM) for each lan-
roughly the same number of utterances for each speaker. Details guage. Each LM was trained using only the transcriptions in
of the various data sets are listed in Table III. the training set of its language. The test-set perplexities of these
LMs are given in Table II. All system parameters such as the
B. Baseline Systems grammar factor and insertion penalty were tuned using the de-
For each language, HMM-based recognition systems were velopment data.
built using the proposed two MTL-DNN training methods, and
they are compared with two kinds of baseline systems: GMM- D. Evaluation 1: Joint Acoustic Modeling of Phones and
HMMs and STL-DNN-HMMs. Graphemes of Lwazi Languages by MTL-DNN-PG
Training of the GMM-HMM Baselines: Acoustic models For each language, a single MTL-DNN, labelled as MTL-
of all phone-based and grapheme-based baseline systems DNN-PG, was trained to estimate the posterior probabilities of
were strictly left-to-right 3-state continuous-density hidden both triphone and trigrapheme senones of the language.
Markov models (HMMs). HMM state emission probabilities MTL-DNN Training: The construction of the MTL-DNN-PG
were modeled by Gaussian mixture models (GMMs) with at is very similar to that of an STL-DNN. Firstly, the weights in
most 16 components. The GMM-HMMs were trained using its hidden layers were initialized by the weights of the same
maximum-likelihood estimation with 39-dimensional acoustic DBN of the corresponding STL-DNNs. But now the output
feature vectors extracted at every 10 ms over a window of layer in the MTL-DNN-PG consists of two separate softmax
25 ms from the training utterances. The acoustic features layers: one for the primary task and one for the secondary
consist of the first 13 PLP coefficients, including c0, and their task. For each training sample, two error signals—one from
first- and second-order derivatives. Speaker-based cepstral each task’s softmax layer—were propagated back to the hidden
mean subtraction and variance normalization were applied to layers. Thus, the learning rate of the hidden layers was set to
the extracted features before they were used. Moreover, states half of the original one, while that of the output layer remains
were tied using phonetic decision trees and the optimal number the same. Otherwise, the training procedure was the same as
of tied states (senones) were determined using the development that of STL-DNN. In addition, the task weights were set to 0.5
data set. for both tasks as other values did not make much difference in
Training of the STL-DNN Baselines: Single-task learning our preliminary experiments.
(STL) DNNs were trained to classify the central frame of Results and Discussions: The evaluation was first performed
each 15-frame acoustic context window. Feature vectors in using the full training data set of each language, and then re-
the window were concatenated and then normalized to have peated with the reduced training sets to investigate the effect
zero mean and unit variance over the whole training set. All of limited amount of training data on MTL. The recognition
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2024 at 03:18:52 UTC from IEEE Xplore. Restrictions apply.
1178 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 7, JULY 2015
TABLE IV TABLE V
LWAZI: WERS (%) OF MONO-LINGUAL SYSTEMS TRAINED ON THE FULL LWAZI: WERS (%) OF MONO-LINGUAL SYSTEMS TRAINED ON 1-HOUR
TRAINING SETS. FIGURES IN () ARE #SENONES AND FIGURES IN [] ARE WER SMALL TRAINING SETS. FIGURES IN () ARE #SENONES AND FIGURES IN [] ARE
REDUCTIONS (%) OVER THE PHONETIC GMM-HMM BASELINE WER REDUCTIONS (%) OVER THE PHONETIC GMM-HMM BASELINE
TABLE VI
performance of the MTL-DNN-PGs are compared with the cor- NUMBER OF MODEL PARAMETERS WHEN THE MODELS WERE ESTIMATED
responding GMM-HMM baselines, STL-DNN baselines, and USING THE REDUCED DATA SETS (IN MILLIONS)
the ROVER integration (using maximum confidence) of the tri-
phone and trigrapheme STL-DNNs, as well as the ROVER inte-
gration of the triphone models and trigrapheme models derived
from the MTL-DNN-PGs, and they are listed in Table IV and
Table V. We have the following observations:
• For all the three languages, when the full training data sets
were used for acoustic modeling, both triphone and tri-
grapheme GMM-HMMs give similar recognition perfor-
mance. Similar findings were reported in [3] and [4] though
the latter used larger amounts of training data (8–80 hours)
than what are available in the Lwazi corpus (3–8 hours).
Among the three languages, the GMM-HMMs perform the
best in Africaans and the worst in Sesotho even though the
amount of training data is the least in Africaans and the
the reduced training data sets in the three languages. It can
highest in siSwati. The results may be partly explained by
be seen that the STL-DNNs are bigger than the GMMs by
the highest LM perplexity in Sesotho. Moreover, it prob-
more than an order of magnitude. We attribute the robust
ably means that the acoustic manifestations of the phones
estimation of the large number of DNN parameters to the
and graphemes in Africaans are less confusable.
effective initialization of the DNN weights by the corre-
• When the training data sets were reduced to about an hour,
sponding pre-trained DBN and/or the effective discrimi-
the recognition performance in all three languages drops
native fine-tuning of the parameters by back-propagation
as expected. However, trigrapheme models start to out-
without overfitting them.
perform the triphone models in siSwati and Sesotho. One
• After MTL was applied to jointly training the triphone
reason may be that there are much fewer graphemes than
and trigrapheme posteriors in a single MTL-DNN, com-
phones in the two languages: the ratio is 1:1.6 in these two
pared with the corresponding STL-DNN, word error rates
languages but is 1:1.2 in Africaans. Thus, the trigrapheme
(WERs) were further reduced by 3–9% absolute in the full
models were better trained than the triphone models with
set and 3–5% absolute in the reduced set. Consistent per-
the smaller amount of data. In fact, the better performance
formance gain is observed for both the larger and smaller
disappears when the full training set was used. The finding
training sets, and in both the primary and secondary tasks.
again supports the use of graphemic acoustic models in
The results show that MTL benefits learning of not only the
low-resource ASR.
primary task but also the secondary task, and it is still effec-
• All phone-based and grapheme-based STL-DNN-HMMs
tive with even an hour of training speech. Furthermore, the
outperform their GMM-HMM counterparts by 9–25% rel-
gains are obtained with no additional language resources.
ative in the full training sets, and 15–24% relative in the re-
• The triphone models derived from the MTL-DNN-PGs
duced training sets. The amount of performance gains are
even outperform the ROVER integration of the corre-
typical in large-vocabulary ASR (e.g., [50]) and here we
sponding triphone and trigrapheme STL-DNNs (except
show that such gains can also be obtained in low-resource
for the case of using the reduced set in siSwati where
ASR. This is surprising given that the number of model pa-
the trigrapheme models derived from the MTL-DNN-PG
rameters in STL-DNNs is generally much greater than that
in GMMs. Table VI shows the number of model parame- 5The figures do not include HMM transition probabilities but only parameters
ters5 in the various kinds of state models estimated using describing HMM state probability distributions.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2024 at 03:18:52 UTC from IEEE Xplore. Restrictions apply.
CHEN AND MAK: MTL OF DNNs FOR LOW-RESOURCE SPEECH RECOGNITION 1179
used to build the various models were very similar to the ones
in Evaluation 1; only the differences will be described below.
The TIMIT Corpus: The standard NIST training set con-
sisting of 3,696 utterances from 462 speakers was used for
training, whereas the standard core test set consisting of 192
utterances spoken by 24 speakers was used for evaluation. The
development set is part of the complete test set, consisting of
192 utterances spoken by 24 speakers. Speakers in the training,
development, and test sets do not overlap. We followed the
standard experimentation on TIMIT and collapsed the original
61 phonetic labels in the corpus into a set of 48 phones for
acoustic modeling; the latter were further collapsed into the
standard set of 39 phones for error reporting. Moreover, the
glottal stop [q] was ignored.
GMM-HMM Baselines: In the phone-based system, there
Fig. 3. Frame classification error rates of STL-DNN and MTL-DNN on the
were altogether 15,546 cross-word triphone HMMs based
Lwazi training and development sets of Sesotho during back-propagation. on 48 base phones and 587 senones. Phone recognition was
performed with a phone bigram LM that was trained only from
the TIMIT training transcriptions, and it has a perplexity of
is better). This shows that knowledge transfer between 16.44 on the core test set. The grapheme-based system made
multiple learning tasks can be done more effectively by use of the 26 English alphabets plus the silence symbol as
MTL than ROVER integration. Nevertheless, ROVER the graphemic labels. Its GMM-HMMs had altogether 760
may still take advantage of any complementary residual senones. A grapheme bigram LM was estimated from the
errors made by the triphone and trigrapheme models de- training transcriptions, and it has a perplexity of 22.79 on the
rived separately from the MTL-DNN-PGs and gives the core test set—which is very high given that there are only 26
best recognition performance by integrating them. At the letters to recognize!
end, the best results reduce the WERs of the GMM-HMM DNN Systems: The training procedure of STL-DNN and
baselines by 16–33% relative in the full training set and MTL-DNN systems on TIMIT was identical to that in Eval-
27–32% relative in the reduced training set. uation 1 except that the acoustic features were filter-bank
To see the generalization effect of MTL-DNN-PG training, outputs instead of PLP coefficients. Moreover, the softmax
we look at the frame classification errors over both the reduced output layers consisted of monophone and/or monographeme
training and development data sets after each back-propagation states as it is usually found that the use of context-dependent
epoch during both STL-DNN training and MTL-DNN training. monophones does not give performance gain in DNN-HMM
The results for Sesotho are plotted in Fig. 3; similar behaviors systems for TIMIT.
are also found for Afrikaans and siSwati. The plots clearly Results and Discussions: Results on the core test set are sum-
show that although MTL-DNN-PG training converges to a marized in Table VII.
worse local optimum than STL-DNN training for the training • We may see that English grapheme recognition is far more
data, it performs better on the unseen development set. Thus, difficult than English phone recognition with more than
we may conclude that the extra grapheme modeling task really 10% higher error rate. This is expected in English when
provides a representation bias to a better local optimum which the estimated grapheme bigram has a perplexity of 22.79;
generalizes better for unseen data. that means the LM does not help much in TIMIT grapheme
recognition.
E. Evaluation 2: TIMIT Phone Recognition By MTL-DNN-PG • The STL-DNN-HMM system again outperforms the
To further check the efficacy of using grapheme modeling as GMM-HMM system by a large margin—22.4% relative
the secondary task in Method 1, the experiments in Evaluation in phone recognition6 and 10.6% relative in grapheme
1 were repeated to recognize English phones in TIMIT [51]—in recognition.
a language that is notorious for the complicated relationship • Using grapheme modeling as a secondary task in MTL-
between its writing and pronunciation. In fact, grapheme-based DNN-PG training again helps improve the English phone
acoustic models perform much worse than phone-based models and lowers the PER by 2.70% relative. The PER re-
acoustic models in English [5]. Because of our better under- duction obtained in TIMIT is similar to the WER reduction
standing of the English language (and we do not understand the obtained in Sesotho and siSwati in Evaluation 1. Thus, we
South African languages at all), this evaluation is also designed conclude that grapheme modeling can be a good secondary
to verify our claim that the proposed MTL-DNN-PG method MTL task for training phone models even for languages in
exploits extra information in the acoustic data—which is the 6Our DNN baseline result is comparable with others. For example, one Mi-
implicit phone-to-grapheme mappings—to learn a more gener- crosoft group recently reported a PER of 21.63% [41] though a stronger baseline
alized acoustic model. The experimental setup and procedure of 20.7% was reported by Hinton’s group in [32].
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2024 at 03:18:52 UTC from IEEE Xplore. Restrictions apply.
1180 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 7, JULY 2015
TABLE IX TABLE X
LWAZI: WERS (%) OF MULTI-LINGUAL SYSTEMS TRAINED ON THE LWAZI: WERS (%) OF MULTI-LINGUAL SYSTEMS TRAINED ON 1-HOUR
FULL TRAINING SETS. FIGURES IN [] ARE WER REDUCTIONS (%) OVER SMALL TRAINING SETS. FIGURES IN [] ARE WER REDUCTIONS (%) OVER
THE MONO-LINGUAL PHONETIC STL-DNN-HMM BASELINE THE MONO-LINGUAL PHONETIC STL-DNN-HMM BASELINE
1, 2, and 4 output nodes in ML-STL-DNN, SHL-MDNN, learning common weights in the hidden layers, whereas
ML-MTL-DNN-UPS/-UGS, and ML-MTL-DNN-UPS-UGS in our ML-MTL-DNN-UPS/-UGS, the learning of the
respectively. Because of the use of multiple languages and weights in the output layer of each language is further
MTL, some parts of the training procedure were modified. regularized by the learning of the weights in the output
Firstly, the use of data from multiple languages requires layer of UPS (UGS).
the training utterances to be shuffled randomly so that the • Finally, based on the results of Evaluation 1 and above, we
fine-tuning process would not be biased to a particular language put the learning of phone models and grapheme models of
at any time during training. Secondly, since more than one the three languages together with the UPG and UGS, and
output node may be activated, the learning rate of the weights in obtained the best results that reduce the WER by 6–22%
the hidden layers were reduced by a factor equal to the number relative in the full set and 8–13% in the reduced set
of activated output nodes. Otherwise, the training procedure is over the STL-DNN baselines. The improvements obtained
the same as that of MTL-DNN in Evaluation 1. by the ML-MTL-DNN-UGS-UPS are about twice of that
Results and Discussions: Table IX and Table X summarize from the respective SHL-MDNNs. All these are obtained
the recognition performance of the various systems trained on without additional language resources.
the full training sets and the reduced training sets of all the three
languages respectively. Performance of the previous mono-lin- VI. CONCLUSIONS
gual STL-DNNs are repeated in the tables for comparison. Lack of data and language resources is the largest obstacle
• The performance of multi-lingual STL-DNN (ML-STL- in low-resource ASR. In this paper, we propose two methods
DNN) of the universal phones (graphemes) is far inferior in the multi-task learning (MTL) framework using deep neural
to the triphone (trigrapheme) STL-DNN baseline. Sim- networks (DNNs) to train phonetic models in low-resource
ilar finding was reported in [30]. Although the UPS/UGS languages without requiring additional resources. The resulting
models may share data among the various languages, the phonetic models are believed to generalize better to unseen data
data become impure and they may fail to model the lan- because the extra learning task(s) can exploit extra information
guage specificities. Moreover, co-articulatory effects were from the training data to provide a representation bias to the
not modeled as the targets in our ML-STL-DNNs are only original phonetic modeling task. This is made possible because
monophones/monographemes states. both the inputs and hidden layers are shared by the multiple
• On the other hand, multi-lingual models based on learning tasks. More specifically, for single-language low-re-
SHL-MDNN outperform their STL-DNN counterparts source ASR, we propose using grapheme modeling as the
and reduce the WER by 2–11% relative in the full additional learning task to learn the language’s phone models
training set and 4–10% relative in the reduced training using an MTL-DNN. The proposed method was shown to work
set. The improvements agree fairly well with the findings well not only for three low-resource South African languages,
in [22] where the WER reductions are 3–5% relative. It but also equally well for TIMIT phone recognition even though
is believed that the shared internal representation captures it is well-known that the grapheme-to-phone mappings in Eng-
cross-lingual knowledge among the training languages. lish are not simple. Moreover, although the method is originally
• The multi-lingual MTL-DNN (ML-MTL-DNN) with an designed for low-resource ASR, we further show that it works
extra UPS (UGS) output layer further outperforms the even for the WSJ large-vocabulary ASR task where there are
corresponding phonetic (graphemic) SHL-MDNN. For adequate amount of training data. Thus, we believe the method
example, in the case of reduced training set, the WER re- can be applied in other general ASR tasks. Secondly, when the
duction improves from 4–10% relative in SHL-MDNN phone models of multiple low-resource languages are trained
to 7–12% relative in ML-MTL-DNN-UPS/-UGS. In together, we propose using the acoustic modeling of a set
SHL-MDNN, the benefit of MTL is achieved only by of universal phones/graphemes (UPS/UGS) as the additional
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2024 at 03:18:52 UTC from IEEE Xplore. Restrictions apply.
1182 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 7, JULY 2015
learning task. From the optimization perspective, the UPS task [6] T. Ko and B. Mak, “Eigentrigraphemes for under-resourced lan-
serves as a regularizer for the phonetic modeling of all the guages,” Speech Commun., vol. 56, pp. 132–141, Jan. 2014.
[7] S. Takahashi and S. Sagayama, “Four-level tied-structure for efficient
involved languages. From the language perspective, it forces representation of acoustic modeling,” in Proc. ICASSP, 1995, vol. 1,
the multi-lingual MTL-DNN to implicitly encode a mapping pp. 520–523.
among the phones of all the languages. Finally, by combining [8] K. F. Lee, “Context-dependent phonetic hidden Markov models for
speaker-independent continuous speech recognition,” IEEE Trans.
the two methods, we are able to reduce the WERs of mono-lin- Acoust., Speech, Signal Process., vol. 38, no. 4, pp. 599–609, Apr.
gual STL-DNN baselines by 8–13% relative when only an 1990.
hour of training data is available from each of the three South [9] S. J. Young and P. C. Woodland, “The use of state tying in continuous
speech recognition,” in Proc. Eurospeech, 1993, vol. 3, pp. 2203–2206.
African languages, and 7–22\% relative when 3–8 hours [10] M. Y. Hwang and X. D. Huang, “Shared-distribution hidden Markov
of data are available. Additional memory and computational model for speech recognition,” IEEE Trans. Speech Audio Process.,
requirements are only required during MTL training; during vol. 1, pp. 414–420, Jan. 1993.
[11] E. Bocchieri and B. Mak, “Subspace distribution clustering hidden
recognition, the softmax layer(s) due to any extra tasks may Markov model,” IEEE Trans. Speech Audio Process., vol. 9, no. 3, pp.
be discarded. Furthermore, since our multi-lingual MTL-DNN 264–275, Mar. 2001.
has the same architecture as the multi-lingual SHL-MDNN but [12] J. R. Bellegarda and D. Nahamoo, “Tied mixture continuous parameter
modeling for speech recognition,” IEEE Trans. Acoust., Speech, Signal
performs better than the latter, and the latter had been shown to Process., vol. 38, no. 12, pp. 2033–2045, Dec. 1990.
be effective in cross-lingual model adaptation [21], [22], [42], [13] X. Huang and M. A. Jack, “Semi-continuous hidden Markov models
we believe that our multi-lingual MTL-DNN will also perform for speech signals,” Comput. Speech Lang., vol. 3, no. 3, pp. 239–251,
Jul. 1989.
better in cross-lingual model adaptation as well. [14] D. Povey, L. Burget, M. Agarwal, P. Akyazi, K. Feng, A. Ghoshal, O.
Our proposed MTL methods aim at improving the generaliza- Glembek, N. K. Goel, M. Karafiát, and A. Rastrow et al., “Subspace
tion of phonetic DNNs. There are many other ways to do this, Gaussian mixture models for speech recognition,” in Proc. ICASSP,
2010, pp. 4330–4333.
and perhaps the most well-known one is the dropout method [15] G. Saon and J.-T. Chien, “Bayesian sensing hidden Markov models,”
[52] which had been applied successfully to low-resource ASR IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 43–54,
[53]. Both our MTL methods and dropout are regularization Jan. 2012.
[16] M. J. F. Gales and K. Yu, “Canonical state models for automatic speech
methods but they use different mechanisms: dropout prevents recognition,” in Proc. Interspeech, 2010, pp. 58–61.
overfitting by efficiently and approximately combining an expo- [17] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans.
nentially many different neural network architectures, whereas Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, 2010.
[18] W. Byrne, P. Beyerlein, J. M. Huerta, S. Khudanpur, B. Marthi, J.
our MTL methods exploit extra information from the data using Morgan, N. Peterek, J. Picone, D. Vergyri, and T. Wang, “Towards lan-
additional learning task(s) which share(s) some commonality guage independent acoustic modeling,” in Proc. ICASSP, 2000, vol. 2,
with the primary learning task and provide(s) a representation pp. 1029–1032.
[19] V. Le and L. Besacier, “Automatic speech recognition for under-re-
bias towards a better local optimum. Other ways such as weight sourced languages: Application to Vietnamese language,” IEEE Trans.
pruning [54] and large-margin optimization [55], [56] have also Audio, Speech, Lang. Process., vol. 17, no. 8, pp. 1471–1482, Nov.
been proposed, and it will be interesting to see if these methods 2009.
[20] J. Kohler, “Multi-lingual phoneme recognition exploiting acoustic-
are complementary to our proposed MTL methods. phonetic similarities of sounds,” in Proc. ICSLP, 1996.
Multi-task learning can be a powerful learning method if [21] A. Ghoshal, P. Swietojanski, and S. Renals, “Multilingual training of
the tasks involved are truly related. In this paper, the multiple deep-neural networks,” in Proc. ICASSP, 2013, pp. 7319–7323.
[22] J.-T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, “Cross-language
tasks are carefully sought and their positive relationships are knowledge transfer using multilingual deep neural network with shared
assumed, based on common knowledge. In the future, we hidden layers,” in Proc. ICASSP, 2013, pp. 7304–7308.
would like to formulate the task relationships mathematically [23] J. Kohler, “Language adaptation of multilingual phone models for
vocabulary independent speech recognition tasks,” in Proc. ICASSP,
and make use of them in the MTL algorithm to further improve 1998, vol. 1, pp. 417–420.
the ensuing model. In the machine learning community, this [24] F. Grezl, M. Karafiat, and K. Vesely, “Adaptation of multilingual
is known as multi-task relationship learning (MTRL), and stacked bottle-neck neural network structure for new language,” in
Proc. ICASSP, 2014, pp. 7654–7658.
MTRL for simple linear regression tasks had been investigated [25] D. Imseng, H. Bourlard, J. Dines, P. Garner, and M. Magimai-Doss,
[57], [58]. How to do MTRL for complex tasks like speech “Applying multi- and cross-lingual stochastic phone space transforma-
recognition using DNN needs further investigation. tions to non-native speech recognition,” IEEE Trans. Audio, Speech,
Lang. Process., vol. 21, no. 8, pp. 1713–1726, Aug. 2013.
[26] N. T. Vu, D. Imseng, D. Povey, P. Motlicek, T. Schultz, and H.
REFERENCES Bourlard, “Multilingual deep neural network based acoustic mod-
eling for rapid language adaptation,” in Proc. ICASSP, 2014, pp.
[1] S. Hunnicutt, H. M. Meng, S. Seneff, and V. W. Zue, “Reversible 7639–7643.
letter-to-sound sound-to-letter generation based on parsing word mor- [27] X. Cui, V. Goel, and B. Kingsbury, “Data augmentation for deep neural
phology,” in Proc. Eurospeech, 1993, pp. 763–766. network acoustic modeling,” in Proc. ICASSP, 2014, pp. 5582–5586.
[2] E. G. Schukat-Talamazzini, H. Niemann, W. Eckert, T. Kuhn, and S. [28] L. Burget, P. Schwarz, M. Agarwal, P. Akyazi, K. Feng, A. Ghoshal, O.
Rieck, “Automatic speech recognition without phonemes,” in Proc. Glembek, N. Goel, M. Karafiat, D. Povey, A. Rastrow, R. Rose, and S.
Eurospeech, 1993. Thomas, “Multilingual acoustic modeling for speech recognition based
[3] S. Kanthak and H. Ney, “Context-dependent acoustic modeling using on subspace Gaussian mixture models,” in Proc. ICASSP, Mar. 2010,
graphemes for large vocabulary speech recognition,” in Proc. ICASSP, pp. 4334–4337.
2002, vol. 1, pp. 845–848. [29] P. Cohen, S. Dharanipragada, J. Gros, M. Monkowski, C. Neti, S.
[4] P. Charoenpornsawat, S. Hewavitharana, and T. Schultz, “Thai Roukos, and T. Ward, “Towards a universal speech recognizer for
grapheme-based speech recognition,” in Proc. HLT-NAACL, Com- multiple languages,” in Proc. IEEE ASRU, 1997, pp. 591–598.
panion Vol.: Short Papers, 2006, pp. 17–20, ACL. [30] H. Lin, L. Deng, D. Yu, Y.-F. Gong, A. Acero, and C.-H. Lee, “A study
[5] S. Stüker, “Acoustic modeling for under-resourced languages,” Ph.D. on multilingual acoustic modeling for large vocabulary ASR,” in Proc.
dissertation, Univ. of Karlsruhe, Karlsruhe, Germany, 2009. ICASSP, Apr. 2009, pp. 4333–4336.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2024 at 03:18:52 UTC from IEEE Xplore. Restrictions apply.
CHEN AND MAK: MTL OF DNNs FOR LOW-RESOURCE SPEECH RECOGNITION 1183
[31] R. Caruana, “Multitask learning,” Ph.D. dissertation, Carnegie Mellon [53] Y. Miao and F. Metze, “Improving low-resource CD-DNN-HMM
Univ., Pittsburgh, PA, USA, 1997. using dropout and multilingual DNN training,” in Proc. ICASSP,
[32] A. Mohamed, G. Dahl, and G. E. Hinton, “Acoustic modeling using 2013, pp. 7304–7308.
deep belief networks,” IEEE Trans. Audio, Speech, Lang. Process., vol. [54] D. Yu, F. Seide, G. Li, and L. Deng, “Exploiting sparseness in deep
20, no. 1, pp. 14–22, Jan. 2012. neural networks for large vocabulary speech recognition,” in Proc.
[33] D. Chen, B. Mak, C.-C. Leung, and S. Sivadas, “Joint acoustic mod- ICASSP, 2012, pp. 4409–4412.
eling of triphones and trigraphemes by multi-task learning deep neural [55] Y. Tang, “Deep learning using linear support vector machines,”
networks for low-resource speech recognition,” in Proc. ICASSP, arXiv:1306.0239v3 [cs.LG], 2013.
2014, pp. 5992–5296. [56] R. Min, Z. Yuan, D. A. Stanley, A. Bonner, and Z. Zhang, “A deep non-
[34] S. Thrun and L. Pratt, Learning to Learn. Norwell, MA, USA: linear feature mapping for large-margin kNN classification,” in Proc.
Kluwer, Nov. 1997. ICDM, 2009, pp. 357–366.
[35] J. Baxter, “A model of inductive bias learning,” J. Artif. Intell. Res., [57] Y. Zhang and D.-Y. Yeung, “A convex formulation for learning task re-
vol. 12, pp. 149–198, 2000. lationships in multi-task learning,” in Proc. 26th Conf. UAI, Jul. 2010.
[36] S. Ben-David and R. Schuller, “Exploiting task relatedness for multiple [58] W. Zhong and J. Kwok, “Convex multitask learning with flexible task
task learning,” in Proc. COLT, 2003, pp. 567–580. clusters,” in Proc. ICML, 2012.
[37] R. Collobert and J. Weston, “A unified architecture for natural language
processing: Deep neural networks with multitask learning,” in Proc.
ICML, 2008, pp. 160–167, ACM.
[38] G. Tur, “Multitask learning for spoken language understanding,” in
Proc. ICASSP, 2006, pp. 585–588.
[39] Y. Huang, W. Wang, L. Wang, and T. Tan, “Multi-task deep neural
network for multi-label learning,” in Proc. ICIP, 2013, pp. 2897–2900. Dongpeng Chen received the Bachelor degree in
[40] S. Parveen and P. D. Green, “Multitask learning in connectionist ASR computer science from the University of Science and
using recurrent neural networks,” in Proc. Eurospeech, 2003, pp. Technology of China in 2010. Since August 2010,
1813–1816. he has been a Ph.D. candidate under the supervision
[41] M. Seltzer and J. Droppo, “Multi-task learning in deep neural net- of Prof. Brian Mak of the Hong Kong University of
works for improved phoneme recognition,” in Proc. ICASSP, 2013, pp. Science and Technology. His research focuses on
6965–6968. speech recognition and machine learning.
[42] G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen, M. Ranzato, M.
Devin, and J. Dean, “Multilingual acoustic models using distributed
deep neural networks,” in Proc. ICASSP, 2013, pp. 8619–8623.
[43] J. G. Fiscus, “A post-processing system to yield reduced word error
rates: Recognizer output voting error reduction (ROVER),” in Proc.
IEEE ASRU, 1997, pp. 347–354.
[44] I. P. Association, Handbook of the International Phonetic Associa-
tion: A guide to the use of the International Phonetic Alphabet. Cam- Brian Kan-Wing Mak received the B.Sc. degree
bridge, U.K.: Cambridge Univ. Press, 1999. in electrical engineering from the University of
[45] Meraka-Institute, Lwazi ASR corpus [Online]. Available: http://www. Hong Kong, the M. S. degree in computer science
meraka.org.za/lwazi 2009 from the University of California, Santa Barbara,
[46] “Lwazi phone set,” [Online]. Available: ftp://hlt.mirror.ac.za/ USA, and the Ph.D. degree in computer science
Phoneset/Lwazi.Phoneset.1.2.pdf 2009 from Oregon Graduate Institute of Science and
[47] T. Evgeniou and M. Pontil, “Regularized multi-task learning,” in Proc. Technology, Portland, Oregon, USA. He had was
10th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining. ACM, a Research Programmer at the Speech Technology
2004, pp. 109–117. Laboratory of Panasonic Technologies Inc. in Santa
[48] M. M. Tempest, “Dictionarymaker 2.16 user manual,” [Online]. Avail- Barbara, and was a Research Consultant at the
able: http://dictionarymaker.sourceforge.net/ 2009 AT&T Labs–Research, Florham Park, New Jersey,
[49] G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for USA. He had been a Visiting Researcher of Bell Laboratories and Advanced
deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, 2006. Telecommunication Research Institute—International as well. Since April
[50] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre- 1998, he has been with the Department of Computer Science in the Hong Kong
trained deep neural networks for large vocabulary speech recognition,” University of Science and Technology, and is now an Associate Professor.
IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 30–42, He had served or is serving on the editorial board of the IEEE TRANSACTIONS
Jan. 2012. ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, the SIGNAL PROCESSING
[51] V. Zue, S. Seneff, and J. Glass, “Speech database development at MIT: LETTERS, and Speech Communication. He also had served on the Speech and
TIMIT and beyond,” Speech Commun., vol. 9, no. 4, pp. 351–356, Aug. Language Technical Committee of the IEEE Signal Processing Society. His in-
1990. terests include acoustic modeling, speech recognition, spoken language under-
[52] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. R. standing, computer-assisted language learning, and machine learning. He re-
Salakhutdinov, “Dropout: A simple way to prevent neural networks ceived the Best Paper Award in the area of Speech Processing from the IEEE
from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014. Signal Processing Society in 2004.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2024 at 03:18:52 UTC from IEEE Xplore. Restrictions apply.