2015_Multi-task Learning of Deep Neural Networks for Low-resource Speech Recognition_Chen_Mak_IEEEACM Transactions on Audio, Speech, and Language Processing

This paper presents a multitask learning (MTL) approach using deep neural networks (DNNs) to enhance low-resource automatic speech recognition (ASR) without additional language resources. The authors demonstrate that training grapheme models alongside phone models improves performance, and propose learning a universal phone set (UPS) as a secondary task to further enhance the phonetic models across multiple low-resource languages. Experimental results show significant gains in word recognition compared to single-task learning methods.

Uploaded by

yangkunkuo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

2015_Multi-task Learning of Deep Neural Networks for Low-resource Speech Recognition_Chen_Mak_IEEEACM Transactions on Audio, Speech, and Language Processing

Uploaded by

yangkunkuo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

1172 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO.

7, JULY 2015

Multitask Learning of Deep Neural Networks for

Low-Resource Speech Recognition
Dongpeng Chen and Brian Kan-Wing Mak

Abstract—We propose a multitask learning (MTL) approach semi-automatic approach to preparing a pronunciation dictio-
to improve low-resource automatic speech recognition using deep nary is to first create a small primary dictionary manually, and
neural networks (DNNs) without requiring additional language then extend it to a large dictionary by applying grapheme-to-
resources. We first demonstrate that the performance of the
phone models of a single low-resource language can be improved phoneme conversion [1]. However, the performance of the final
by training its grapheme models in parallel under the MTL dictionary highly depends on the quality of the primary one. A
framework. If multiple low-resource languages are trained to- simpler solution is to abandon the phone-based models and em-
gether, we investigate learning a set of universal phones (UPS) ploy graphemes as the basic acoustic units because grapheme
as an additional task again in the MTL framework to improve modeling [2], [3], [4], [5], [6] does not require a phonetic dic-
the performance of the phone models of all the involved lan-
guages. In both cases, the heuristic guideline is to select a task tionary1. Many languages that use an alphabet writing system
that may exploit extra information from the training data of the are suitable for grapheme-based acoustic modeling, and their
primary task(s). In the first method, the extra information is the grapheme set is usually selected to be the same as their set of
phone-to-grapheme mappings, whereas in the second method, the alphabets.
UPS helps to implicitly map the phones of the multiple languages While ways are sought to create resources for a new language
among each other. In a series of experiments using three low-re-
source South African languages in the Lwazi corpus, the proposed more efficiently, other ways are proposed to reduce the amount
MTL methods obtain significant word recognition gains when of training data required for robust acoustic modeling. For sys-
compared with single-task learning (STL) of the corresponding tems based on hidden Markov modeling (HMM), one common
DNNs or ROVER that combines results from several STL-trained solution is parameter tying or sharing [7], [8], [9], [10], [11]. An-
DNNs. other method is the basis approach in which a relatively small
Index Terms—Deep neural network (DNN), low-resource speech set of basis vectors or functions is computed so that other model
recognition, multitask learning, universal grapheme set, universal parameters may be derived from them. Successful examples in-
phone set.
clude tied-mixture HMM [12], [13], subspace Gaussian mixture
model (SGMM) [14], Bayesian sensing HMM [15], and canon-
I. INTRODUCTION ical state model [16]. On the other hand, transfer learning [17] is
also proved effective when out of domain data are available. No-
table efforts include cross-lingual [18], [19] and multi-lingual

I N THE research on automatic speech recognition (ASR),

huge efforts are spent on the most popular languages such
as English, French, German, Mandarin, … etc. and great success
[20], [21], [22] acoustic modeling techniques. A basic assump-
tion behind these techniques is that there exist good mappings
between phones in some rich-resource languages and the phones
has been achieved. However, due to the lack of audio and lan- in the low-resource target language so that transfer learning may
guage resources, there are still many languages in the world that be applied to transfer the knowledge learned in the former to
do not benefit from the advanced human language technologies. the latter by adaptation [23], [24], transformation [25], [26], or
The creation of language resources generally needs the help data augmentation [27]. Usually it is also assumed that these
of native linguistics experts, and is usually costly and time-con- phone mappings can be implicitly derived when speeches from
suming; it is even more so if only non-native developers are multiple languages are trained together and part of their models
available. Thus, an important research direction in low-resource are shared as in SHL-MDNN [22] or SGMM [28]. [18] explic-
ASR is to make the process easier and faster. For instance, a itly derives the mappings by a data-driven method and a knowl-
edge-based method. In addition, universal acoustic models have
also been investigated in which phones of multiple languages
Manuscript received October 20, 2014; revised February 03, 2015; accepted are all explicitly mapped to a universal phone set (UPS) so
March 24, 2015. Date of publication April 13, 2015; date of current version May
that there are sufficient data to train the universal phones [29],
08, 2015. This work was supported by the Research Grants Council of Hong
Kong SAR, China, under Grants 616513 and 16206714. The associate editor [30]. However, it is usually found that the performance of UPS
coordinating the review of this manuscript and approving it for publication was models is worse than the language-specific acoustic models for
Dr. Vincent Vanhoucke.
the target language unless the amount of training data for the
The authors are with the Department of Computer Science and Engineering,
the Hong Kong University of Science and Technology, Clear Water Bay, Hong target language is really small.
Kong (e-mail: [email protected]; [email protected]).
Color versions of one or more of the figures in this paper are available online 1In practice, grapheme models can be trained using existing phone-based
at http://ieeexplore.ieee.org. ASR softwares. During training, a “pronunciation dictionary” is simulated by
Digital Object Identifier 10.1109/TASLP.2015.2422573 simply representing each word by its graphemic transcription.

2329-9290 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2024 at 03:18:52 UTC from IEEE Xplore. Restrictions apply.
CHEN AND MAK: MTL OF DNNs FOR LOW-RESOURCE SPEECH RECOGNITION 1173

In this paper, we would like to improve the estimation of the edge to one another. As a result, the common internal represen-
phonetic models of a low-resource language by learning other tation thus learned helps the models generalize better for future
related task(s) together under the multi-task learning (MTL) unseen data. In [35], a statistical learning theory based approach
[31] framework using deep neural networks2 [32]. According to MTL is developed and a generalization bound on the average
to the theory of MTL, related tasks can be jointly learned to im- error of MTL is derived. In [35], [36], the notion of relatedness
prove the generalization performance of each task if they share among multiple tasks is defined in a particular way so that a
the same inputs and some internal representation; the effect is tighter generalization bound for each learning task can be de-
more prominent when the amount of training data is small. We rived. In his thesis [31], Caruana postulated some requirements
believe that humans do not learn the sounds of a spoken lan- for related tasks if their joint learning in the MTL approach was
guage in isolation but together with other cues such as their to work well:
graphemes, the lexical contexts, its similarity with or its differ- (a) related tasks must share input features, and
ences from other languages, and so forth. Our major contribu- (b) related tasks must share hidden units to benefit each other
tion in this paper is the successful identification of appropriate when trained with MTL-backprop.
related tasks without requiring additional resources so that pho- He also listed out some task relationships which would en-
netic models of low-resource languages can be better learned able MTL-backprop to learn better internal representation (or
under the MTL framework. The heuristic guideline is to find more generalized model) of the related tasks: data amplifica-
a secondary task related to the primary task that can exploit tion, eavesdropping, attribute selection, representation bias, and
extra information from its training data. In our case, the extra overfitting prevention.
information being exploited is the implicit mapping between MTL has been applied successfully in many speech, lan-
the primary targets and the secondary targets. More specifically, guage, image and vision tasks with the use of neural network
we first propose learning the graphemes of the same language (NN) because the hidden layers of an NN naturally capture
as the secondary task3 in Section III. Grapheme-based acoustic learned knowledge that can be readily transferred or shared
modeling does not require additional language resources be- across multiple tasks. For example, [37] applies MTL on a
sides those already required by phone-based acoustic modeling. single convolutional neural network to produce state-of-the-art
We believe that the grapheme learning task exploits extra infor- performance for several language processing predictions; [38]
mation in the acoustic training data to learn implicit phone-to- improves intent classification in goal-oriented human-machine
grapheme mappings of the language. Then, if several low-re- spoken dialog systems which is particularly successful when
source languages are to be learned together, we propose in our the amount of labeled training data is limited; in [39], the MTL
second method in Section IV to derive a UPS among the lan- approach is used to perform multi-label learning in an image
guages and use the UPS learning as an additional secondary annotation application.
task in the learning of the multi-lingual phonetic models. The
UPS learning not only implicitly encodes an indirect mapping A. Multi-Task Learning in ASR Using DNNs
among the phones of all the involved languages, but also serves In ASR, MTL has been applied to improving performance
as a regularizer for the learning of the phonetic models of each robustness using recurrent neural networks [40]. With the
language. Finally we combine the above two methods and fur- emergence of the recently very successful deep neural network
ther performance gain is obtained. (DNN), one expects that DNNs may be used to further improve
The rest of this paper is organized as follows. We first re- MTL performance; we call the resulting deep neural networks
view multi-task learning in Section II. The first MTL approach MTL-DNNs. For instance, Meltzer and Droppo investigated the
using joint phone-based and grapheme-based acoustic modeling training of monophone models for TIMIT phone recognition
is described in Section III. Section IV describes how multi-lin- together with the learning of the phone labels, state contexts, or
gual acoustic modeling is conducted in an MTL framework by phone contexts [41]; significant gains were reported. However,
learning universal phone/grapheme models in parallel. Exper- the work did not model triphone states directly and it is not
imental evaluation on three low-resource South African lan- clear if it is really better to use the triphone contexts as the sec-
guages is reported in Section V, which is followed by con- ondary task in learning monophone state posteriors in the MTL
cluding remarks in Section VI. framework. MTL has also been employed successfully to train
multi-lingual DNNs [21], [22], [42]. In these works, during
II. MULTI-TASK LEARNING pre-training and subsequent fine-tuning, data from all training
Multi-task learning (MTL) [31] or learning to learn [34] is a languages are fed through the common hidden layers but each
machine learning approach that aims at improving the general- language maintains its own language-specific output layer.
ization performance of a learning task by jointly learning mul- However, unlike the MTL work in [31] and [41], for each input
tiple related tasks together. It is found that if the multiple tasks only one task is being trained, and the relatedness among the
are related and if they share some internal representation, then tasks are exploited only by enforcing common weights in the
through learning them together, they are able to transfer knowl- hidden layers. We will follow the notation in [22] and call these
multi-lingual DNNs with shared hidden layers SHL-MDNNs.
2Although we mostly take the estimation of phone-based acoustic models as
the primary task in the proposed MTL framework, the approach is general and B. Our MTL-DNN Formulation
flexible, and one may take other task such as the estimation of grapheme-based
acoustic models as the primary task as well. We would like to apply the MTL framework to improve
3The first method has been presented in our conference paper [33]. phone-based acoustic models for low-resource ASR using
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2024 at 03:18:52 UTC from IEEE Xplore. Restrictions apply.
1174 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 7, JULY 2015

DNNs that strictly follows Caruana’s two MTL requirements.

That is, for each acoustic training input, one or more related
secondary tasks are learned in parallel with the primary task of
learning the phonetic models, and the multiple tasks share the
hidden layers in a DNN. We believe that the first requirement
may give better MTL performance as it allows learning the
hidden layers with constructive or destructive error gradients
simultaneously from multiple tasks for each input; however,
the requirement was not always enforced by some previous
works (e.g., SHL-MDNN [22]). In terms of task relationships,
we will make use of data amplification, representation bias,
and overfitting prevention in the selection of secondary task(s),
which, together with the use of early stopping in DNN training,
will result in a more generalized acoustic model.
Let’s assume there are tasks to
learn under the MTL framework. The MTL model parameters
are represented by , where Fig. 1. An MTL-DNN system for joint training of triphone and trigrapheme
acoustic models (MTL-DNN-PG). Outputs, labelled as green, from 2 separate
consists of model parameters that are shared by all tasks and tasks are turned “on” by an input acoustic vector.
consists of model parameters specific to only task . In our
case, represents the shared weights from all hidden layers,
whereas represents the weights in each task-specific output based acoustic modeling. That is an important advantage
layer of . Without loss of generality, will always be taken in low-resource ASR.
as the primary task, and the rest are secondary (or extra) tasks. Fig. 1 shows an overview of the proposed MTL-DNN system
The training objective function is formulated as the weighted for the joint training of triphone and trigrapheme acoustic
sum of the error functions of all the tasks as follows: models; we will denote this DNN as MTL-DNN-PG. The DNN
architecture is similar to the one used in common multi-lingual
(1) ASR [21], [22]. Essentially two single-task learning DNNs
(STL-DNNs), one for training triphone models and the other
for training trigrapheme models are merged so that their hidden
where and are the error function and the task weight of layers are shared, while each of them keeps its own output
with ; is the whole set of training vectors layer. The two output layers are trained to model the posterior
from all tasks and is one input vector. After training, only the probabilities of triphone senones (tied states) and trigrapheme
model parameters ( and ) of the primary task are needed, senones respectively for the same given input acoustic frame.
and those of the secondary task(s) may be discarded. That is, given an input acoustic vector , the posterior proba-
The key to successful application of MTL is to identify effec- bility of the th triphone senone at the triphone output layer
tive related learning tasks in the context of low-resource ASR. is computed using the following softmax function:

III. METHOD 1: JOINT ACOUSTIC MODELING OF PHONES AND

(2)
GRAPHEMES WITH A MONO-LINGUAL MTL-DNN
Acoustic modeling of trigraphemes of the same language
is chosen as the secondary task in the training of its triphone where is the activation of the senone and is the
acoustic models using an MTL-DNN for low-resource ASR. total number of triphone senones (in task ). A similar formula
That is, the 2 tasks in this MTL are: may be derived for the posterior probabilities
(primary task): posteriors of triphone senones (tied of the trigrapheme senones (in task ). For each training
states) frame , the error function of task is to minimize
(secondary task): posteriors of trigrapheme senones the following per-frame cross entropy:
There are several motivations for the choice:
• although it may not be absolutely necessary, humans usu- (3)
ally learn a language by reading, listening, and speaking.
Hence, the joint learning of phones and graphemes is a
real-life example of MTL, and we would like to repeat its where is the target value of the th senone in . Finally, the
success in ASR; task errors are weighted and summed over all training frames as
• past experiences in low-resource ASR [2]–[6] show that described in Eq. (1).
when the two tasks are trained individually, they give The triphone and trigrapheme senones in the MTL-DNN-PG
comparable recognition performance for many languages. are obtained from their corresponding tied-state GMM-HMM
Thus, their joint training may beneﬁt each other; and systems. The triphone and trigrapheme GMM-HMMs are also
• grapheme-based acoustic modeling requires no additional utilized to obtain the frame labels and senone priors by forced
language resources besides those already used by phone- aligning the training data. During MTL-DNN training, the
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2024 at 03:18:52 UTC from IEEE Xplore. Restrictions apply.
CHEN AND MAK: MTL OF DNNs FOR LOW-RESOURCE SPEECH RECOGNITION 1175

target values of exactly one triphone senone output unit and one TABLE I
trigrapheme senone output unit will be set to 1.0 per training THE UNIVERSAL PHONE SET (UPS) AND THE PHONEMES’
USAGE IN THREE SOUTH AFRICAN LANGUAGES
frame. During decoding, each senone posterior probability is
converted back to a scaled likelihood by dividing it by its prior
as follows:

(4)

Afterward, Viterbi decoding is performed using either the tri-

phone or the trigrapheme MTL-DNN-HMM4. In fact, one may
even combine the independent decisions from the triphone de-
coder and the trigrapheme decoder to get better result using, for
example, ROVER [43].
One limitation of using grapheme acoustic modeling as the
secondary task is that graphemes may not be the appropriate
modeling units for all languages; graphemes are closely related
to the acoustic manifestation of alphabetic languages only. For-
tunately, the majority of languages in the world is alphabetic [5].

IV. METHOD 2: JOINT ACOUSTIC MODELING OF

LANGUAGE-SPECIFIC TRIPHONES AND UNIVERSAL
PHONES WITH A MULTI-LINGUAL MTL-DNN
When resources from other languages are available, cross-lin-
gual or multi-lingual ASR techniques may be used to improve
low-resource language ASR [18]–[29]. In our second method,
we propose to improve low-resource acoustic modeling by ex-
ploiting the relationship among the phones from multiple lan-
guages via a universal phone set in the MTL framework without
directly deﬁning the mappings between them.

A. Universal Phone Set (UPS)

Many multi-lingual ASR techniques utilize a global phone in-
ventory such as the International Phonetic Alphabet (IPA) [44],
or a smaller universal phone set (UPS) which can be written in
the ASCII format and is derived from the IPA. During multi-lin- With respect to the th language, is the primary task and
gual acoustic modeling, phones from different languages having the remaining tasks ( ) are
the same UPS phonetic symbol will share their training data. In the secondary tasks with the UPS task being a common task
this paper, the three South African languages under investiga- for the learning of any language. The proposed multi-lingual
tion come from the Lwazi project [45] which already provides MTL-DNN system is shown in Fig. 2. The model architec-
their IPA phone sets [46]. Thus, we simply unify their phone sets ture is similar to SHL-MDNN in [21], [22] except for the
(after removing any duplicates) to form the UPS. Table I shows additional output layer for learning the posteriors of UPS
the final UPS of 67 phones and their uses in the three South monophone states; we will call our multi-lingual MTL network
African languages in our experiments. ML-MTL-DNN-UPS. Without the additional UPS task, one
simply hopes that the shared hidden layers will automatically
B. Multi-lingual MTL-DNN With an Extra UPS Learning Task captures the phonetic-acoustic relationships among the multiple
To develop a multi-lingual ASR system with languages, languages during MTL-DNN training so that the recognition
tasks will be jointly learned in our multi-lingual MTL-DNN performances of all languages are enhanced. Our additional
system: UPS learning task forces this to happen: the weights in the
: posteriors of triphone senones of the first language shared hidden layers are trained to cause acoustic vectors from
: posteriors of triphone senones of the second language different languages that are mapped to the same UPS phone
.. to activate the same target in the UPS output layer. Moreover,
. instead of directly defining phone mappings between
: posteriors of triphone senones of the th language any two of the languages, one only needs to map the phones
: posteriors of monophone states of the UPS of each of the languages to the UPS phones.
4Although we start with the goal of improving a phone-based ASR system,
Let’s denote the training data set from the th language as
as we will see, for some low-resource language, a grapheme-based ASR system and its set of triphone senones as , ,
may perform better when the amount of training data is small. and . Similarly, the set of UPS monophone
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2024 at 03:18:52 UTC from IEEE Xplore. Restrictions apply.
1176 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 7, JULY 2015

C. Extensions
As said in Section III, grapheme-based acoustic modeling is
a viable solution for low-resource language ASR. Method 2 can
be easily modified to use graphemes instead of phones as the
modeling units. A universal grapheme set (UGS) may again be
created by simply taking the union of the grapheme sets of all
the languages under investigation. The UGS for the three lan-
guages in our experiments consists of 30 graphemes including
one that denotes silence. We will call the grapheme-based
multi-lingual MTL-DNN with the extra UGS learning task
ML-MTL-DNN-UGS.
Obviously, one may further combine Method 1 and
Method 2 to jointly model multi-lingual phones and graphemes
using the UPS and UGS as the extra learning tasks, and we
will label such network as ML-MTL-DNN-UPS-UGS. If there
Fig. 2. A multi-lingual MTL-DNN system (ML-MTL-DNN-UPS) with shared are languages to learn simultaneously, then there will be
hidden layers and an extra output layer of UPS states. Outputs, labelled as green, totally (softmax) output layers in the model. Each input
from 2 separate tasks are turned “on” by an input acoustic vector.
acoustic vector from the th language will activate four targets:
one triphone senone and one trigrapheme senone of the th
states is denoted as . For each input vector , only language, one universal (mono)phone state, and one universal
two tasks are involved: the triphone senones of the th language (mono)grapheme state.
( ) and the UPS monophone states ( ) are activated using
V. EXPERIMENTAL EVALUATION
the softmax function of Eq. (2). Their corresponding per-frame
cross-entropies, and , where The two proposed multi-lingual MTL-DNN training methods
consists of the weights in the output layer of the UPS states, were evaluated on three low-resource South African languages
are given by Eq. (3). Finally, the training objective function over in the Lwazi project [45].
all data of the multiple languages is modified from Eq. (1) as A. The Lwazi Speech Corpus
follows:
The Lwazi project was set up to develop a telephone-based
speech-driven information system in South Africa. In the
project, the Lwazi ASR corpus [45] was collected over a
telephone channel from approximately 200 native speakers
for each of the 11 official languages in South Africa. Each
speaker produced approximately 30 utterances, in which 16 of
them are phonetically balanced read speech and the remainders
are elicited short words such as answers to open questions,
answers to yes/no questions, spelt words, dates, and numbers.
A 5,000-word pronunciation dictionary was also created for
(5) each language, which covers only the most common words
in the language. Thus, for the phone-based experiments, the
DictionaryMaker software [48] was used to generate dictionary
entries for the uncovered words in the corpus.
Eq. (5) shows that our multi-lingual MTL-DNN training is dif- Three languages were selected from the corpus in our evalu-
ferent from SHL-MDNN training and may be considered as a ations. They are Afrikaans, Sesotho, and siSwati. Afrikaans is
regularized version of the latter—a form of regularized MTL a Low Franconian, West Germanic language, originated from
[47]. If the language task weights are large, it will be the same Dutch; Sesotho is a Southern Bantu language, closely related to
as SHL-MDNN training; if the UPS task weight is large, it will other languages in the Sotho-Tswana language group; SiSwati
be reduced to UPS training. Since UPS models are usually not as is also a Southern Bantu language, but is more closely related
good as language-specific models [30], the learned UPS output to the Nguni language group. Thus, the three chosen languages
layer will not be used in recognition, and it is only used to help come from different language families. The numbers of phones
enforce the cross-lingual phone mappings during MTL-DNN and graphemes in the three languages and the size of the cor-
training. The training procedure of ML-MTL-DNN-UPS is sim- responding universal phone and grapheme sets are shown in
ilar to that of MTL-DNN-PG in Section III. Table II. Since the corpus does not define an official training, de-
From the perspective of regularization, we prefer simpler velopment, and test set for each language, we followed the par-
regularizer and thus we use UPS monophone states instead of titions used in [6]. In addition, in order to evaluate the efficacy
UPS triphone senones as the common task. In some preliminary of MTL in the scenarios where acoustic data is scarce, smaller
experiments, we also empirically found that they gave similar data sets consisting of approximately one hour of speech were
results. further created by randomly sampling from the full training set
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2024 at 03:18:52 UTC from IEEE Xplore. Restrictions apply.
CHEN AND MAK: MTL OF DNNs FOR LOW-RESOURCE SPEECH RECOGNITION 1177

TABLE II DNNs in our experiments are multi-layer perceptrons (MLPs)

NUMBER OF PHONES AND GRAPHEMES OF 3 SOUTH AFRICAN LANGUAGES with 4 hidden layers and 2048 nodes per layer. The weights of
AND THE TEST-SET PERPLEXITIES OF THEIR LMS
the hidden layers were initialized by unsupervised pre-training
a deep belief network (DBN) of the same architecture [49].
The DBN was built by stacking layers of restricted Boltzmann
machines (RBMs) on top of one another, and the RBMs were
trained one layer at a time. During pre-training, the mini-batch
size was kept at 128 (input vectors), and a momentum of 0.5
was employed at the beginning which was then grown to 0.9
after 5 iterations. For Gaussian-Bernoulli RBMs, training kept
TABLE III going for 220 epochs with a learning rate of 0.002, while
DETAILS OF VARIOUS LWAZI DATA SETS. OOV MEANS “OUT-OF- Bernoulli-Bernoulli RBMs were trained for 100 iterations with
VOCABULARY” AND “-S” MEANS “SMALL TRAINING SET”
a learning rate of 0.02. After pre-training, a softmax layer was
added on top of the DBN to obtain the final DNN. The DNN is
now a typical feed-forward MLP and was trained by standard
stochastic gradient descent. The targets were derived from the
senones of the respective GMM-HMM baseline models. The
whole network was fine-tuned with a learning rate starting at
0.02 which was subsequently halved when performance gain
on the validation set was less than 0.5%. Training continued for
at least 10 iterations and was stopped when the classification
error rate on the development set started to increase.
It should be noted that the same DBN can be used to initialize
both the triphone STL-DNN and the trigrapheme STL-DNN;
they only differ in their output softmax layer.

C. Decoding
Standard Viterbi algorithm was used for decoding in all ex-
of each language. Care had been taken to ensure that there are periments using a bigram language model (LM) for each lan-
roughly the same number of utterances for each speaker. Details guage. Each LM was trained using only the transcriptions in
of the various data sets are listed in Table III. the training set of its language. The test-set perplexities of these
LMs are given in Table II. All system parameters such as the
B. Baseline Systems grammar factor and insertion penalty were tuned using the de-
For each language, HMM-based recognition systems were velopment data.
built using the proposed two MTL-DNN training methods, and
they are compared with two kinds of baseline systems: GMM- D. Evaluation 1: Joint Acoustic Modeling of Phones and
HMMs and STL-DNN-HMMs. Graphemes of Lwazi Languages by MTL-DNN-PG
Training of the GMM-HMM Baselines: Acoustic models For each language, a single MTL-DNN, labelled as MTL-
of all phone-based and grapheme-based baseline systems DNN-PG, was trained to estimate the posterior probabilities of
were strictly left-to-right 3-state continuous-density hidden both triphone and trigrapheme senones of the language.
Markov models (HMMs). HMM state emission probabilities MTL-DNN Training: The construction of the MTL-DNN-PG
were modeled by Gaussian mixture models (GMMs) with at is very similar to that of an STL-DNN. Firstly, the weights in
most 16 components. The GMM-HMMs were trained using its hidden layers were initialized by the weights of the same
maximum-likelihood estimation with 39-dimensional acoustic DBN of the corresponding STL-DNNs. But now the output
feature vectors extracted at every 10 ms over a window of layer in the MTL-DNN-PG consists of two separate softmax
25 ms from the training utterances. The acoustic features layers: one for the primary task and one for the secondary
consist of the first 13 PLP coefficients, including c0, and their task. For each training sample, two error signals—one from
first- and second-order derivatives. Speaker-based cepstral each task’s softmax layer—were propagated back to the hidden
mean subtraction and variance normalization were applied to layers. Thus, the learning rate of the hidden layers was set to
the extracted features before they were used. Moreover, states half of the original one, while that of the output layer remains
were tied using phonetic decision trees and the optimal number the same. Otherwise, the training procedure was the same as
of tied states (senones) were determined using the development that of STL-DNN. In addition, the task weights were set to 0.5
data set. for both tasks as other values did not make much difference in
Training of the STL-DNN Baselines: Single-task learning our preliminary experiments.
(STL) DNNs were trained to classify the central frame of Results and Discussions: The evaluation was first performed
each 15-frame acoustic context window. Feature vectors in using the full training data set of each language, and then re-
the window were concatenated and then normalized to have peated with the reduced training sets to investigate the effect
zero mean and unit variance over the whole training set. All of limited amount of training data on MTL. The recognition
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2024 at 03:18:52 UTC from IEEE Xplore. Restrictions apply.
1178 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 7, JULY 2015

TABLE IV TABLE V
LWAZI: WERS (%) OF MONO-LINGUAL SYSTEMS TRAINED ON THE FULL LWAZI: WERS (%) OF MONO-LINGUAL SYSTEMS TRAINED ON 1-HOUR
TRAINING SETS. FIGURES IN () ARE #SENONES AND FIGURES IN [] ARE WER SMALL TRAINING SETS. FIGURES IN () ARE #SENONES AND FIGURES IN [] ARE
REDUCTIONS (%) OVER THE PHONETIC GMM-HMM BASELINE WER REDUCTIONS (%) OVER THE PHONETIC GMM-HMM BASELINE

TABLE VI
performance of the MTL-DNN-PGs are compared with the cor- NUMBER OF MODEL PARAMETERS WHEN THE MODELS WERE ESTIMATED
responding GMM-HMM baselines, STL-DNN baselines, and USING THE REDUCED DATA SETS (IN MILLIONS)
the ROVER integration (using maximum confidence) of the tri-
phone and trigrapheme STL-DNNs, as well as the ROVER inte-
gration of the triphone models and trigrapheme models derived
from the MTL-DNN-PGs, and they are listed in Table IV and
Table V. We have the following observations:
• For all the three languages, when the full training data sets
were used for acoustic modeling, both triphone and tri-
grapheme GMM-HMMs give similar recognition perfor-
mance. Similar findings were reported in [3] and [4] though
the latter used larger amounts of training data (8–80 hours)
than what are available in the Lwazi corpus (3–8 hours).
Among the three languages, the GMM-HMMs perform the
best in Africaans and the worst in Sesotho even though the
amount of training data is the least in Africaans and the
the reduced training data sets in the three languages. It can
highest in siSwati. The results may be partly explained by
be seen that the STL-DNNs are bigger than the GMMs by
the highest LM perplexity in Sesotho. Moreover, it prob-
more than an order of magnitude. We attribute the robust
ably means that the acoustic manifestations of the phones
estimation of the large number of DNN parameters to the
and graphemes in Africaans are less confusable.
effective initialization of the DNN weights by the corre-
• When the training data sets were reduced to about an hour,
sponding pre-trained DBN and/or the effective discrimi-
the recognition performance in all three languages drops
native fine-tuning of the parameters by back-propagation
as expected. However, trigrapheme models start to out-
without overfitting them.
perform the triphone models in siSwati and Sesotho. One
• After MTL was applied to jointly training the triphone
reason may be that there are much fewer graphemes than
and trigrapheme posteriors in a single MTL-DNN, com-
phones in the two languages: the ratio is 1:1.6 in these two
pared with the corresponding STL-DNN, word error rates
languages but is 1:1.2 in Africaans. Thus, the trigrapheme
(WERs) were further reduced by 3–9% absolute in the full
models were better trained than the triphone models with
set and 3–5% absolute in the reduced set. Consistent per-
the smaller amount of data. In fact, the better performance
formance gain is observed for both the larger and smaller
disappears when the full training set was used. The finding
training sets, and in both the primary and secondary tasks.
again supports the use of graphemic acoustic models in
The results show that MTL benefits learning of not only the
low-resource ASR.
primary task but also the secondary task, and it is still effec-
• All phone-based and grapheme-based STL-DNN-HMMs
tive with even an hour of training speech. Furthermore, the
outperform their GMM-HMM counterparts by 9–25% rel-
gains are obtained with no additional language resources.
ative in the full training sets, and 15–24% relative in the re-
• The triphone models derived from the MTL-DNN-PGs
duced training sets. The amount of performance gains are
even outperform the ROVER integration of the corre-
typical in large-vocabulary ASR (e.g., [50]) and here we
sponding triphone and trigrapheme STL-DNNs (except
show that such gains can also be obtained in low-resource
for the case of using the reduced set in siSwati where
ASR. This is surprising given that the number of model pa-
the trigrapheme models derived from the MTL-DNN-PG
rameters in STL-DNNs is generally much greater than that
in GMMs. Table VI shows the number of model parame- 5The figures do not include HMM transition probabilities but only parameters
ters5 in the various kinds of state models estimated using describing HMM state probability distributions.
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2024 at 03:18:52 UTC from IEEE Xplore. Restrictions apply.
CHEN AND MAK: MTL OF DNNs FOR LOW-RESOURCE SPEECH RECOGNITION 1179

used to build the various models were very similar to the ones
in Evaluation 1; only the differences will be described below.
The TIMIT Corpus: The standard NIST training set con-
sisting of 3,696 utterances from 462 speakers was used for
training, whereas the standard core test set consisting of 192
utterances spoken by 24 speakers was used for evaluation. The
development set is part of the complete test set, consisting of
192 utterances spoken by 24 speakers. Speakers in the training,
development, and test sets do not overlap. We followed the
standard experimentation on TIMIT and collapsed the original
61 phonetic labels in the corpus into a set of 48 phones for
acoustic modeling; the latter were further collapsed into the
standard set of 39 phones for error reporting. Moreover, the
glottal stop [q] was ignored.
GMM-HMM Baselines: In the phone-based system, there
Fig. 3. Frame classification error rates of STL-DNN and MTL-DNN on the
were altogether 15,546 cross-word triphone HMMs based
Lwazi training and development sets of Sesotho during back-propagation. on 48 base phones and 587 senones. Phone recognition was
performed with a phone bigram LM that was trained only from
the TIMIT training transcriptions, and it has a perplexity of
is better). This shows that knowledge transfer between 16.44 on the core test set. The grapheme-based system made
multiple learning tasks can be done more effectively by use of the 26 English alphabets plus the silence symbol as
MTL than ROVER integration. Nevertheless, ROVER the graphemic labels. Its GMM-HMMs had altogether 760
may still take advantage of any complementary residual senones. A grapheme bigram LM was estimated from the
errors made by the triphone and trigrapheme models de- training transcriptions, and it has a perplexity of 22.79 on the
rived separately from the MTL-DNN-PGs and gives the core test set—which is very high given that there are only 26
best recognition performance by integrating them. At the letters to recognize!
end, the best results reduce the WERs of the GMM-HMM DNN Systems: The training procedure of STL-DNN and
baselines by 16–33% relative in the full training set and MTL-DNN systems on TIMIT was identical to that in Eval-
27–32% relative in the reduced training set. uation 1 except that the acoustic features were filter-bank
To see the generalization effect of MTL-DNN-PG training, outputs instead of PLP coefficients. Moreover, the softmax
we look at the frame classification errors over both the reduced output layers consisted of monophone and/or monographeme
training and development data sets after each back-propagation states as it is usually found that the use of context-dependent
epoch during both STL-DNN training and MTL-DNN training. monophones does not give performance gain in DNN-HMM
The results for Sesotho are plotted in Fig. 3; similar behaviors systems for TIMIT.
are also found for Afrikaans and siSwati. The plots clearly Results and Discussions: Results on the core test set are sum-
show that although MTL-DNN-PG training converges to a marized in Table VII.
worse local optimum than STL-DNN training for the training • We may see that English grapheme recognition is far more
data, it performs better on the unseen development set. Thus, difficult than English phone recognition with more than
we may conclude that the extra grapheme modeling task really 10% higher error rate. This is expected in English when
provides a representation bias to a better local optimum which the estimated grapheme bigram has a perplexity of 22.79;
generalizes better for unseen data. that means the LM does not help much in TIMIT grapheme
recognition.
E. Evaluation 2: TIMIT Phone Recognition By MTL-DNN-PG • The STL-DNN-HMM system again outperforms the
To further check the efficacy of using grapheme modeling as GMM-HMM system by a large margin—22.4% relative
the secondary task in Method 1, the experiments in Evaluation in phone recognition6 and 10.6% relative in grapheme
1 were repeated to recognize English phones in TIMIT [51]—in recognition.
a language that is notorious for the complicated relationship • Using grapheme modeling as a secondary task in MTL-
between its writing and pronunciation. In fact, grapheme-based DNN-PG training again helps improve the English phone
acoustic models perform much worse than phone-based models and lowers the PER by 2.70% relative. The PER re-
acoustic models in English [5]. Because of our better under- duction obtained in TIMIT is similar to the WER reduction
standing of the English language (and we do not understand the obtained in Sesotho and siSwati in Evaluation 1. Thus, we
South African languages at all), this evaluation is also designed conclude that grapheme modeling can be a good secondary
to verify our claim that the proposed MTL-DNN-PG method MTL task for training phone models even for languages in
exploits extra information in the acoustic data—which is the 6Our DNN baseline result is comparable with others. For example, one Mi-
implicit phone-to-grapheme mappings—to learn a more gener- crosoft group recently reported a PER of 21.63% [41] though a stronger baseline
alized acoustic model. The experimental setup and procedure of 20.7% was reported by Hinton’s group in [32].

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2024 at 03:18:52 UTC from IEEE Xplore. Restrictions apply.
1180 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 7, JULY 2015

TABLE VII TABLE VIII

TIMIT: RECOGNITION PERFORMANCE IN TERMS OF PHONE ERROR WSJ0: WERS (%) OF VARIOUS SYSTEMS. FIGURES IN [] ARE WER
RATE (PER) AND GRAPHEME ERROR RATE (GER) REDUCTIONS (%) OVER THE PHONETIC GMM-HMM BASELINE

South African ASR tasks. Triphone GMM-HMMs with 1254

tied-states and 32 Gaussian mixtures per state was found to be
optimal, while the best number of tied-states for trigrapheme
modeling is 1489. The triphone and trigrapheme tied-states
were then used as DNN outputs.
Results and Discussions: Table VIII lists the results on the
Nov92 test set. Similar to what is observed on the TIMIT phone
recognition task, even though trigrapheme modeling is far infe-
rior to triphone modeling for English, training the two tasks to-
gether using our MTL-DNN-PG method beneﬁts both of them,
Fig. 4. The relationship matrix between phone weight vectors (abscissa) and reducing the WER of triphone STL-DNN by 15.6% (11.8%)
grapheme weight vectors (ordinate) in the MTL-DNN-PG trained on TIMIT.
with trigram (bigram) language model.

G. Evaluation 4: Multi-Lingual Acoustic Modeling with

which the relationship between graphemes and phones is
Universal Phone Set and/or Universal Grapheme Set by
not strong.
ML-MTL-DNN-UPS/UGS
• To visualize the relationship between the trained English
monophone posteriors and grapheme posteriors, we com- Method 2 of MTL-DNN was evaluated and compared with
pute the cosines of the angles between any two weight vec- the existing SHL-MDNN on multi-lingual acoustic modeling of
tors in the output layers—one from the monophone output the three South African languages tested in Evaluation 1.
layer and one from the monographeme layer. The results MTL-DNN Training: The training of the various DNNs in
are plotted as a relationship matrix in Fig. 4 in gray scale; Method 2 is similar to that in Method 1 except that now the
a darker cell indicates a stronger relationship between the training data of all the three languages were pooled together to
corresponding phone and grapheme. The relationships de- jointly train their acoustic models. Specifically, the following
scribed in the matrix generally agree well with what we models were trained and compared:
expect. For example, according to Fig. 4, the letter ‘c’ • ML-STL-DNN: multi-lingual STL-DNN of universal
is mostly related to the phonemes [k] and [s], while the (mono)phones or (mono)graphemes;
letter ‘f’ is mostly related to the phonemes [f], [v] and [th], • SHL-MDNN: multi-lingual phonetic shared-hidden-layer
and so forth. The figure provides some evidence that the DNN [22] (with a total of 3 learning tasks);
MTL-DNN-PG encodes the grapheme-to-phone mappings • ML-MTL-DNN-UPS: multi-lingual phonetic MTL-DNN
in English. using universal phone modeling as the extra learning task
(for a total of 4 learning tasks);
F. Evaluation 3: Does MTL-DNN-PG Work for Larger Tasks? • ML-MTL-DNN-UGS: multi-lingual graphemic MTL-DNN
Although we design the MTL-DNN-PG method with tackling using universal grapheme modeling as the extra learning
low-resource ASR in mind, there is no reason why it may not be task (for a total of 4 learning tasks); and
applied to a larger ASR task with adequate training data. Here • ML-MTL-DNN-UPS-UGS: multi-lingual MTL-DNN
we further evaluate MTL-DNN-PG on the Wall Street Journal using universal phone modeling and universal grapheme
speech recognition task (WSJ0). modeling as the extra learning tasks (for a total of 8
Speech Corpus and Experimental Setup: The standard WSJ0 learning tasks).
SI-84 training set with 15 hours of speech was used for acoustic All these DNNs were initialized from the same DBN which
modeling. Evaluation was performed on the standard Nov92 5 K was pre-trained by training data from all the three languages.
non-verbalized test set, and the si_dt_05 data set was used as the Afterwards, multiple softmax output layers were added to
development set for tuning system parameters. Evaluation was the DNNs, one for each learning task. Thus, the number of
performed with both bigram and trigram language models. softmax layers in the various DNNs are: 1 for ML-STL-DNN,
Acoustic Modeling: The procedure and configuration of 3 for SHL-MDNN, 4 for ML-MTL-DNN-UPS/-UGS, and 8
feature extraction, GMM-HMM training, and DNN-HMM for ML-MTL-DNN-UPS-UGS. As a result, during back-prop-
training are exactly the same as those in the low-resource agation fine-tuning, each training frame will activate 1,
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2024 at 03:18:52 UTC from IEEE Xplore. Restrictions apply.
CHEN AND MAK: MTL OF DNNs FOR LOW-RESOURCE SPEECH RECOGNITION 1181

TABLE IX TABLE X
LWAZI: WERS (%) OF MULTI-LINGUAL SYSTEMS TRAINED ON THE LWAZI: WERS (%) OF MULTI-LINGUAL SYSTEMS TRAINED ON 1-HOUR
FULL TRAINING SETS. FIGURES IN [] ARE WER REDUCTIONS (%) OVER SMALL TRAINING SETS. FIGURES IN [] ARE WER REDUCTIONS (%) OVER
THE MONO-LINGUAL PHONETIC STL-DNN-HMM BASELINE THE MONO-LINGUAL PHONETIC STL-DNN-HMM BASELINE

1, 2, and 4 output nodes in ML-STL-DNN, SHL-MDNN, learning common weights in the hidden layers, whereas
ML-MTL-DNN-UPS/-UGS, and ML-MTL-DNN-UPS-UGS in our ML-MTL-DNN-UPS/-UGS, the learning of the
respectively. Because of the use of multiple languages and weights in the output layer of each language is further
MTL, some parts of the training procedure were modified. regularized by the learning of the weights in the output
Firstly, the use of data from multiple languages requires layer of UPS (UGS).
the training utterances to be shuffled randomly so that the • Finally, based on the results of Evaluation 1 and above, we
fine-tuning process would not be biased to a particular language put the learning of phone models and grapheme models of
at any time during training. Secondly, since more than one the three languages together with the UPG and UGS, and
output node may be activated, the learning rate of the weights in obtained the best results that reduce the WER by 6–22%
the hidden layers were reduced by a factor equal to the number relative in the full set and 8–13% in the reduced set
of activated output nodes. Otherwise, the training procedure is over the STL-DNN baselines. The improvements obtained
the same as that of MTL-DNN in Evaluation 1. by the ML-MTL-DNN-UGS-UPS are about twice of that
Results and Discussions: Table IX and Table X summarize from the respective SHL-MDNNs. All these are obtained
the recognition performance of the various systems trained on without additional language resources.
the full training sets and the reduced training sets of all the three
languages respectively. Performance of the previous mono-lin- VI. CONCLUSIONS
gual STL-DNNs are repeated in the tables for comparison. Lack of data and language resources is the largest obstacle
• The performance of multi-lingual STL-DNN (ML-STL- in low-resource ASR. In this paper, we propose two methods
DNN) of the universal phones (graphemes) is far inferior in the multi-task learning (MTL) framework using deep neural
to the triphone (trigrapheme) STL-DNN baseline. Sim- networks (DNNs) to train phonetic models in low-resource
ilar finding was reported in [30]. Although the UPS/UGS languages without requiring additional resources. The resulting
models may share data among the various languages, the phonetic models are believed to generalize better to unseen data
data become impure and they may fail to model the lan- because the extra learning task(s) can exploit extra information
guage specificities. Moreover, co-articulatory effects were from the training data to provide a representation bias to the
not modeled as the targets in our ML-STL-DNNs are only original phonetic modeling task. This is made possible because
monophones/monographemes states. both the inputs and hidden layers are shared by the multiple
• On the other hand, multi-lingual models based on learning tasks. More specifically, for single-language low-re-
SHL-MDNN outperform their STL-DNN counterparts source ASR, we propose using grapheme modeling as the
and reduce the WER by 2–11% relative in the full additional learning task to learn the language’s phone models
training set and 4–10% relative in the reduced training using an MTL-DNN. The proposed method was shown to work
set. The improvements agree fairly well with the findings well not only for three low-resource South African languages,
in [22] where the WER reductions are 3–5% relative. It but also equally well for TIMIT phone recognition even though
is believed that the shared internal representation captures it is well-known that the grapheme-to-phone mappings in Eng-
cross-lingual knowledge among the training languages. lish are not simple. Moreover, although the method is originally
• The multi-lingual MTL-DNN (ML-MTL-DNN) with an designed for low-resource ASR, we further show that it works
extra UPS (UGS) output layer further outperforms the even for the WSJ large-vocabulary ASR task where there are
corresponding phonetic (graphemic) SHL-MDNN. For adequate amount of training data. Thus, we believe the method
example, in the case of reduced training set, the WER re- can be applied in other general ASR tasks. Secondly, when the
duction improves from 4–10% relative in SHL-MDNN phone models of multiple low-resource languages are trained
to 7–12% relative in ML-MTL-DNN-UPS/-UGS. In together, we propose using the acoustic modeling of a set
SHL-MDNN, the benefit of MTL is achieved only by of universal phones/graphemes (UPS/UGS) as the additional
Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2024 at 03:18:52 UTC from IEEE Xplore. Restrictions apply.
1182 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 23, NO. 7, JULY 2015

learning task. From the optimization perspective, the UPS task [6] T. Ko and B. Mak, “Eigentrigraphemes for under-resourced lan-
serves as a regularizer for the phonetic modeling of all the guages,” Speech Commun., vol. 56, pp. 132–141, Jan. 2014.
[7] S. Takahashi and S. Sagayama, “Four-level tied-structure for efficient
involved languages. From the language perspective, it forces representation of acoustic modeling,” in Proc. ICASSP, 1995, vol. 1,
the multi-lingual MTL-DNN to implicitly encode a mapping pp. 520–523.
among the phones of all the languages. Finally, by combining [8] K. F. Lee, “Context-dependent phonetic hidden Markov models for
speaker-independent continuous speech recognition,” IEEE Trans.
the two methods, we are able to reduce the WERs of mono-lin- Acoust., Speech, Signal Process., vol. 38, no. 4, pp. 599–609, Apr.
gual STL-DNN baselines by 8–13% relative when only an 1990.
hour of training data is available from each of the three South [9] S. J. Young and P. C. Woodland, “The use of state tying in continuous
speech recognition,” in Proc. Eurospeech, 1993, vol. 3, pp. 2203–2206.
African languages, and 7–22\% relative when 3–8 hours [10] M. Y. Hwang and X. D. Huang, “Shared-distribution hidden Markov
of data are available. Additional memory and computational model for speech recognition,” IEEE Trans. Speech Audio Process.,
requirements are only required during MTL training; during vol. 1, pp. 414–420, Jan. 1993.
[11] E. Bocchieri and B. Mak, “Subspace distribution clustering hidden
recognition, the softmax layer(s) due to any extra tasks may Markov model,” IEEE Trans. Speech Audio Process., vol. 9, no. 3, pp.
be discarded. Furthermore, since our multi-lingual MTL-DNN 264–275, Mar. 2001.
has the same architecture as the multi-lingual SHL-MDNN but [12] J. R. Bellegarda and D. Nahamoo, “Tied mixture continuous parameter
modeling for speech recognition,” IEEE Trans. Acoust., Speech, Signal
performs better than the latter, and the latter had been shown to Process., vol. 38, no. 12, pp. 2033–2045, Dec. 1990.
be effective in cross-lingual model adaptation [21], [22], [42], [13] X. Huang and M. A. Jack, “Semi-continuous hidden Markov models
we believe that our multi-lingual MTL-DNN will also perform for speech signals,” Comput. Speech Lang., vol. 3, no. 3, pp. 239–251,
Jul. 1989.
better in cross-lingual model adaptation as well. [14] D. Povey, L. Burget, M. Agarwal, P. Akyazi, K. Feng, A. Ghoshal, O.
Our proposed MTL methods aim at improving the generaliza- Glembek, N. K. Goel, M. Karafiát, and A. Rastrow et al., “Subspace
tion of phonetic DNNs. There are many other ways to do this, Gaussian mixture models for speech recognition,” in Proc. ICASSP,
2010, pp. 4330–4333.
and perhaps the most well-known one is the dropout method [15] G. Saon and J.-T. Chien, “Bayesian sensing hidden Markov models,”
[52] which had been applied successfully to low-resource ASR IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 43–54,
[53]. Both our MTL methods and dropout are regularization Jan. 2012.
[16] M. J. F. Gales and K. Yu, “Canonical state models for automatic speech
methods but they use different mechanisms: dropout prevents recognition,” in Proc. Interspeech, 2010, pp. 58–61.
overfitting by efficiently and approximately combining an expo- [17] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Trans.
nentially many different neural network architectures, whereas Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, 2010.
[18] W. Byrne, P. Beyerlein, J. M. Huerta, S. Khudanpur, B. Marthi, J.
our MTL methods exploit extra information from the data using Morgan, N. Peterek, J. Picone, D. Vergyri, and T. Wang, “Towards lan-
additional learning task(s) which share(s) some commonality guage independent acoustic modeling,” in Proc. ICASSP, 2000, vol. 2,
with the primary learning task and provide(s) a representation pp. 1029–1032.
[19] V. Le and L. Besacier, “Automatic speech recognition for under-re-
bias towards a better local optimum. Other ways such as weight sourced languages: Application to Vietnamese language,” IEEE Trans.
pruning [54] and large-margin optimization [55], [56] have also Audio, Speech, Lang. Process., vol. 17, no. 8, pp. 1471–1482, Nov.
been proposed, and it will be interesting to see if these methods 2009.
[20] J. Kohler, “Multi-lingual phoneme recognition exploiting acoustic-
are complementary to our proposed MTL methods. phonetic similarities of sounds,” in Proc. ICSLP, 1996.
Multi-task learning can be a powerful learning method if [21] A. Ghoshal, P. Swietojanski, and S. Renals, “Multilingual training of
the tasks involved are truly related. In this paper, the multiple deep-neural networks,” in Proc. ICASSP, 2013, pp. 7319–7323.
[22] J.-T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, “Cross-language
tasks are carefully sought and their positive relationships are knowledge transfer using multilingual deep neural network with shared
assumed, based on common knowledge. In the future, we hidden layers,” in Proc. ICASSP, 2013, pp. 7304–7308.
would like to formulate the task relationships mathematically [23] J. Kohler, “Language adaptation of multilingual phone models for
vocabulary independent speech recognition tasks,” in Proc. ICASSP,
and make use of them in the MTL algorithm to further improve 1998, vol. 1, pp. 417–420.
the ensuing model. In the machine learning community, this [24] F. Grezl, M. Karafiat, and K. Vesely, “Adaptation of multilingual
is known as multi-task relationship learning (MTRL), and stacked bottle-neck neural network structure for new language,” in
Proc. ICASSP, 2014, pp. 7654–7658.
MTRL for simple linear regression tasks had been investigated [25] D. Imseng, H. Bourlard, J. Dines, P. Garner, and M. Magimai-Doss,
[57], [58]. How to do MTRL for complex tasks like speech “Applying multi- and cross-lingual stochastic phone space transforma-
recognition using DNN needs further investigation. tions to non-native speech recognition,” IEEE Trans. Audio, Speech,
Lang. Process., vol. 21, no. 8, pp. 1713–1726, Aug. 2013.
[26] N. T. Vu, D. Imseng, D. Povey, P. Motlicek, T. Schultz, and H.
REFERENCES Bourlard, “Multilingual deep neural network based acoustic mod-
eling for rapid language adaptation,” in Proc. ICASSP, 2014, pp.
[1] S. Hunnicutt, H. M. Meng, S. Seneff, and V. W. Zue, “Reversible 7639–7643.
letter-to-sound sound-to-letter generation based on parsing word mor- [27] X. Cui, V. Goel, and B. Kingsbury, “Data augmentation for deep neural
phology,” in Proc. Eurospeech, 1993, pp. 763–766. network acoustic modeling,” in Proc. ICASSP, 2014, pp. 5582–5586.
[2] E. G. Schukat-Talamazzini, H. Niemann, W. Eckert, T. Kuhn, and S. [28] L. Burget, P. Schwarz, M. Agarwal, P. Akyazi, K. Feng, A. Ghoshal, O.
Rieck, “Automatic speech recognition without phonemes,” in Proc. Glembek, N. Goel, M. Karafiat, D. Povey, A. Rastrow, R. Rose, and S.
Eurospeech, 1993. Thomas, “Multilingual acoustic modeling for speech recognition based
[3] S. Kanthak and H. Ney, “Context-dependent acoustic modeling using on subspace Gaussian mixture models,” in Proc. ICASSP, Mar. 2010,
graphemes for large vocabulary speech recognition,” in Proc. ICASSP, pp. 4334–4337.
2002, vol. 1, pp. 845–848. [29] P. Cohen, S. Dharanipragada, J. Gros, M. Monkowski, C. Neti, S.
[4] P. Charoenpornsawat, S. Hewavitharana, and T. Schultz, “Thai Roukos, and T. Ward, “Towards a universal speech recognizer for
grapheme-based speech recognition,” in Proc. HLT-NAACL, Com- multiple languages,” in Proc. IEEE ASRU, 1997, pp. 591–598.
panion Vol.: Short Papers, 2006, pp. 17–20, ACL. [30] H. Lin, L. Deng, D. Yu, Y.-F. Gong, A. Acero, and C.-H. Lee, “A study
[5] S. Stüker, “Acoustic modeling for under-resourced languages,” Ph.D. on multilingual acoustic modeling for large vocabulary ASR,” in Proc.
dissertation, Univ. of Karlsruhe, Karlsruhe, Germany, 2009. ICASSP, Apr. 2009, pp. 4333–4336.

Authorized licensed use limited to: XIDIAN UNIVERSITY. Downloaded on October 14,2024 at 03:18:52 UTC from IEEE Xplore. Restrictions apply.
CHEN AND MAK: MTL OF DNNs FOR LOW-RESOURCE SPEECH RECOGNITION 1183

[31] R. Caruana, “Multitask learning,” Ph.D. dissertation, Carnegie Mellon [53] Y. Miao and F. Metze, “Improving low-resource CD-DNN-HMM
Univ., Pittsburgh, PA, USA, 1997. using dropout and multilingual DNN training,” in Proc. ICASSP,
[32] A. Mohamed, G. Dahl, and G. E. Hinton, “Acoustic modeling using 2013, pp. 7304–7308.
deep belief networks,” IEEE Trans. Audio, Speech, Lang. Process., vol. [54] D. Yu, F. Seide, G. Li, and L. Deng, “Exploiting sparseness in deep
20, no. 1, pp. 14–22, Jan. 2012. neural networks for large vocabulary speech recognition,” in Proc.
[33] D. Chen, B. Mak, C.-C. Leung, and S. Sivadas, “Joint acoustic mod- ICASSP, 2012, pp. 4409–4412.
eling of triphones and trigraphemes by multi-task learning deep neural [55] Y. Tang, “Deep learning using linear support vector machines,”
networks for low-resource speech recognition,” in Proc. ICASSP, arXiv:1306.0239v3 [cs.LG], 2013.
2014, pp. 5992–5296. [56] R. Min, Z. Yuan, D. A. Stanley, A. Bonner, and Z. Zhang, “A deep non-
[34] S. Thrun and L. Pratt, Learning to Learn. Norwell, MA, USA: linear feature mapping for large-margin kNN classification,” in Proc.
Kluwer, Nov. 1997. ICDM, 2009, pp. 357–366.
[35] J. Baxter, “A model of inductive bias learning,” J. Artif. Intell. Res., [57] Y. Zhang and D.-Y. Yeung, “A convex formulation for learning task re-
vol. 12, pp. 149–198, 2000. lationships in multi-task learning,” in Proc. 26th Conf. UAI, Jul. 2010.
[36] S. Ben-David and R. Schuller, “Exploiting task relatedness for multiple [58] W. Zhong and J. Kwok, “Convex multitask learning with flexible task
task learning,” in Proc. COLT, 2003, pp. 567–580. clusters,” in Proc. ICML, 2012.
[37] R. Collobert and J. Weston, “A unified architecture for natural language
processing: Deep neural networks with multitask learning,” in Proc.
ICML, 2008, pp. 160–167, ACM.
[38] G. Tur, “Multitask learning for spoken language understanding,” in
Proc. ICASSP, 2006, pp. 585–588.
[39] Y. Huang, W. Wang, L. Wang, and T. Tan, “Multi-task deep neural
network for multi-label learning,” in Proc. ICIP, 2013, pp. 2897–2900. Dongpeng Chen received the Bachelor degree in
[40] S. Parveen and P. D. Green, “Multitask learning in connectionist ASR computer science from the University of Science and
using recurrent neural networks,” in Proc. Eurospeech, 2003, pp. Technology of China in 2010. Since August 2010,
1813–1816. he has been a Ph.D. candidate under the supervision
[41] M. Seltzer and J. Droppo, “Multi-task learning in deep neural net- of Prof. Brian Mak of the Hong Kong University of
works for improved phoneme recognition,” in Proc. ICASSP, 2013, pp. Science and Technology. His research focuses on
6965–6968. speech recognition and machine learning.
[42] G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen, M. Ranzato, M.
Devin, and J. Dean, “Multilingual acoustic models using distributed
deep neural networks,” in Proc. ICASSP, 2013, pp. 8619–8623.
[43] J. G. Fiscus, “A post-processing system to yield reduced word error
rates: Recognizer output voting error reduction (ROVER),” in Proc.
IEEE ASRU, 1997, pp. 347–354.
[44] I. P. Association, Handbook of the International Phonetic Associa-
tion: A guide to the use of the International Phonetic Alphabet. Cam- Brian Kan-Wing Mak received the B.Sc. degree
bridge, U.K.: Cambridge Univ. Press, 1999. in electrical engineering from the University of
[45] Meraka-Institute, Lwazi ASR corpus [Online]. Available: http://www. Hong Kong, the M. S. degree in computer science
meraka.org.za/lwazi 2009 from the University of California, Santa Barbara,
[46] “Lwazi phone set,” [Online]. Available: ftp://hlt.mirror.ac.za/ USA, and the Ph.D. degree in computer science
Phoneset/Lwazi.Phoneset.1.2.pdf 2009 from Oregon Graduate Institute of Science and
[47] T. Evgeniou and M. Pontil, “Regularized multi-task learning,” in Proc. Technology, Portland, Oregon, USA. He had was
10th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining. ACM, a Research Programmer at the Speech Technology
2004, pp. 109–117. Laboratory of Panasonic Technologies Inc. in Santa
[48] M. M. Tempest, “Dictionarymaker 2.16 user manual,” [Online]. Avail- Barbara, and was a Research Consultant at the
able: http://dictionarymaker.sourceforge.net/ 2009 AT&T Labs–Research, Florham Park, New Jersey,
[49] G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for USA. He had been a Visiting Researcher of Bell Laboratories and Advanced
deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, 2006. Telecommunication Research Institute—International as well. Since April
[50] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre- 1998, he has been with the Department of Computer Science in the Hong Kong
trained deep neural networks for large vocabulary speech recognition,” University of Science and Technology, and is now an Associate Professor.
IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 30–42, He had served or is serving on the editorial board of the IEEE TRANSACTIONS
Jan. 2012. ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, the SIGNAL PROCESSING
[51] V. Zue, S. Seneff, and J. Glass, “Speech database development at MIT: LETTERS, and Speech Communication. He also had served on the Speech and
TIMIT and beyond,” Speech Commun., vol. 9, no. 4, pp. 351–356, Aug. Language Technical Committee of the IEEE Signal Processing Society. His in-
1990. terests include acoustic modeling, speech recognition, spoken language under-
[52] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. R. standing, computer-assisted language learning, and machine learning. He re-
Salakhutdinov, “Dropout: A simple way to prevent neural networks ceived the Best Paper Award in the area of Speech Processing from the IEEE
from overfitting,” J. Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014. Signal Processing Society in 2004.