0% found this document useful (0 votes)

23 views34 pages

Adaptation Algorithms for Neural Network-Based Speech Recognition an Overview

This document provides an overview of adaptation algorithms for neural network-based speech recognition, focusing on speaker, domain, and accent adaptation. It categorizes adaptation methods into embeddings, model parameter adaptation, and data augmentation, and includes a meta-analysis of their performance based on error rate reductions. The review discusses both hybrid hidden Markov model/neural network systems and end-to-end neural network systems, highlighting the importance of adapting to different conditions of use.

Uploaded by

mustafiz rehman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views34 pages

Adaptation Algorithms for Neural Network-Based Speech Recognition an Overview

Uploaded by

mustafiz rehman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Received 11 August 2020; revised 1 December 2020; accepted 9 December 2020.

Date of publication 16 December 2020;

date of current version 19 February 2021. The review of this article was arranged by Associate Editor Dr. Karen Livescu.
Digital Object Identifier 10.1109/OJSP.2020.3045349

Adaptation Algorithms for Neural

Network-Based Speech Recognition:
An Overview
PETER BELL 1 (Associate Member, IEEE), JOACHIM FAINBERG1 (Member, IEEE),
ONDREJ KLEJCH1 (Member, IEEE), JINYU LI2 (Member, IEEE), STEVE RENALS 1 (Fellow, IEEE),
AND PAWEL SWIETOJANSKI 3,4 (Member, IEEE)
1
Centre for Speech Technology Research, University of Edinburgh, EH8 9AB Edinburgh, U.K.
2
Microsoft Corporation, Redmond, WA 98052 USA
3
School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia
4
Apple, CB2 1LQ Cambridge, U.K.
CORRESPONDING AUTHOR: PETER BELL (e-mail: [email protected]).
This work was supported in part by the EPSRC under Grant EP/R012180/1 (SpeechWave), a Ph.D. studentship funded by Bloomberg, and the EU H2020 project
ELG under Grant 825627, and in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA),
via Air Force Research Laboratory (AFRL) contract #FA8650-17-C-9117.

ABSTRACT We present a structured overview of adaptation algorithms for neural network-based speech
recognition, considering both hybrid hidden Markov model / neural network systems and end-to-end neu-
ral network systems, with a focus on speaker adaptation, domain adaptation, and accent adaptation. The
overview characterizes adaptation algorithms as based on embeddings, model parameter adaptation, or data
augmentation. We present a meta-analysis of the performance of speech recognition adaptation algorithms,
based on relative error rate reductions as reported in the literature.

INDEX TERMS Accent adaptation, data augmentation, domain adaptation, regularization, semi-supervised
learning, speaker adaptation, speaker embeddings, speech recognition, structured linear transforms.

I. INTRODUCTION but there are other important adaptation targets such as the
The performance of automatic speech recognition (ASR) sys- domain of use, and the spoken accent. Much of the work
tems has improved dramatically in recent years thanks to the in the area has focused on speaker adaptation: it is the case
availability of larger training datasets, the development of that many approaches developed for speaker adaptation do not
neural network based models, and the computational power explicitly model speaker characteristics, and can be applied to
to train such models on these datasets [1]–[4]. However, the other adaptation targets. Thus our core treatment of adaptation
performance of ASR systems can still degrade rapidly when algorithms is in the context of speaker adaptation, with a later
their conditions of use (test conditions) differ from the train- discussion of particular approaches for domain adaptation
ing data. There are several causes for this, including speaker and accent adaptation. Specifically, domain adaptation in this
differences, variability in the acoustic environment, and the paper refers to the task of adapting the models to the target
domain of use. domain that has either acoustic or content mismatch from the
Adaptation algorithms attempt to alleviate the mismatch source domain in which the models were trained.
between the test data and an ASR system’s training data. This overview focuses on the adaptation of neural network
Adapting an ASR system is a challenging problem since it (NN) based speech recognition systems, although we briefly
requires the modification of large and complex models, typ- discuss earlier approaches to speaker adaptation of hidden
ically using only a small amount of target data and without Markov model (HMM) based systems. NN-based systems [1],
explicit supervision. Speaker adaptation – adapting the system [5], [6] have revolutionized the field of speech recognition,
to a target speaker – is the most common form of adaptation, and there has been intense activity in the development of

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 2, 2021 33
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW

FIGURE 1. NN architectures used for hybrid NN/HMM and end-to-end (CTC, RNN-T, AED) speech recognition systems: (a) Scheme of NN architecture used
for NN/HMM hybrid systems and for connectionist temporal classification (CTC); (b) architecture for the RNN Transducer (RNN-T); (c) architecture for
attention based encoder-decoder (AED) end-to-end systems. Input acoustic feature vectors are denoted by xt ; hidden layers are denoted by ht , hu and
output labels by yt , yu depending on whether they are indexed by time t (in hybrid and CTC systems) or only by output label u (in parts of RNN-T and AED
systems). In practice, the encoders use a wide temporal context as input, even the whole acoustic sequence in the case of most CTC and AED models.

adaptation algorithms for such systems. Adaptation of NN- (senones) as GMM-based systems [1], [2]. This led to the de-
based speech recognition is an exciting research area for at velopment of systems surpassing the accuracy of GMM-based
least two reasons: from a practical point of view, it is im- systems. This increase in computational power also enabled
portant to be able to adapt state-of-the-art systems; and from more powerful neural network models to be employed, in
a theoretical point of view the fact that NNs require fewer particular time-delay neural networks (TDNNs) [10], [11],
constraints on the input than a Gaussian-based system, along convolutional neural networks (CNNs) [12], [13], long short-
with the gradient-based discriminative training which is at the term memory (LSTM) RNNs [14], [15], and bidirectional
heart of most NN-based speech recognition systems, opens a LSTMs [16], [17].
range of possible adaptation algorithms.
B. END-TO-END SYSTEMS
Since 2015, there has been a significant trend in the field
A. NN/HMM HYBRID SYSTEMS moving from hybrid HMM/NN systems to end-to-end (E2E)
Neural networks were first applied to speech recognition as NN modeling [4], [6], [18]–[24] for ASR. E2E systems are
so-called NN/HMM hybrid systems, in which the neural net- characterized by the use of a single model transforming the
work is used to estimate (scaled) likelihoods that act as the input acoustic feature stream to a target stream of output
HMM state observation probabilities [5] (Fig. 1(a)). During tokens, which might be constructed of characters, subwords,
the 1990 s both feed-forward networks [5] and recurrent neu- or even words. E2E models are optimized using a single
ral networks (RNNs) [7] were used in such hybrid systems objective function, rather than comprising multiple compo-
and close to state-of-the-art results were obtained [8]. These nents (acoustic model, language model, lexicon) that are op-
systems were largely context-independent, although context- timized individually. Currently, the most widely used E2E
dependent NN-based acoustic models were also explored [9]. models are connectionist temporal classification (CTC) [25],
The modeling power of neural network systems at that [26], the RNN Transducer (RNN-T) model [21], [27], and the
time was computationally limited, and they were not able attention-based encoder-decoder (AED) model [6], [18].
to achieve the precise levels of modeling obtained using CTC and the RNN-T both map an input speech feature se-
context-dependent GMM-based HMM systems which be- quence to an output label sequence, where the label sequence
came the dominant approach. However, increases in com- (typically characters) is considerably shorter than the input
putational power enabled deeper neural network models to sequence. Both of these architectures use an additional blank
be learned along with context-dependent modeling using output token to deal with the sequence length differences, with
the same number of context-dependent HMM tied states an objective function which sums over all possible alignments

34 VOLUME 2, 2021
using the forward backward algorithm [28]. CTC is an earlier, transfer [43]), often trained in a multi-task setting. Earlier
and simpler, method which assumes frame independence and adaptation approaches in NLP focused on feature adaptation
functions similarly to the acoustic model in hybrid systems (e.g. [44]), but more recently better results have been obtained
without modeling the linguistic dependency across words; its using model-based adaptation, for instance “adapter layers”
architecture is similar to that of the neural network in the [43], [45], in which trainable transform layers are inserted into
hybrid system (Fig. 1(a)). the pretrained base model.
An RNN-T (Fig. 1(b)) combines an additional prediction More broadly there has been extensive work on domain
network with the acoustic encoder. The prediction network is adaptation and transfer learning in machine learning, reviewed
an RNN modeling linguistic dependencies whose input is the by Kouw and Loog [46]. This includes work on few-shot
previously output symbol. It is possible to initialize some of its learning [47]–[49] and normalizing flows [50], [51]. Normal-
layers from an external language model trained on additional izing flows which provide a probabilistic framework for fea-
text data. The acoustic encoder and the prediction network are ture transformations, were first developed for speech recogni-
combined using a feed-forward joint network followed by a tion as Gaussianization [52], and more recently have been ap-
softmax to predict the next output token given the speech input plied to speech synthesis [53] and voice transformation [54].
and the linguistic context.
Together, the RNN-T’s prediction and joint networks may D. STRUCTURE OF THIS REVIEW
be regarded as a decoder, and we can view the RNN-T as We begin by considering the issues of identifying suitable data
a form of encoder-decoder system. The AED architecture and target labels to adapt to in Section II. After discussing
(Fig. 1(c)) enriches the encoder-decoder model with an addi- speaker adaptation of non NN-based HMM systems in Sec-
tional attention network which interfaces the acoustic encoder tion III, we present a general framework for adaptation of
with the decoder. The attention network operates on the entire NN-based speech recognition systems (both hybrid and E2E)
sequence of encoder representations for an utterance, offering in Section IV, where we organize adaptation algorithms into
the decoder considerably more flexibility. A detailed com- three general categories: embedding-based approaches (dis-
parison of popular E2E models in both streaming and non- cussed in Section V), model-based approaches (discussed in
streaming modes with large scale training data was conducted Secs. VI–VIII), and data augmentation approaches (discussed
by Li et al. [29]. It is worth noting that with the recent success in Section IX).
in machine translation, there is a trend of using the transformer As mentioned above, most of our treatment of adaptation
model [30] to replace LSTM for both the AED [31]–[33] and algorithms is in the context of speaker adaptation. In Secs. X
RNN-T models [34]–[36]. and XI we discuss specific approaches to accent adaptation
and domain adaptation respectively.
C. ADAPTATION AND TRANSFER LEARNING IN Our primary focus is on the adaptation of acoustic models
RELATED FIELDS and end-to-end models. In Section XII we provide a summary
Adaptation and transfer learning have become important and of work in language model (LM) adaptation, mentioning both
intensively researched topics in other areas related to machine n-gram and neural network language models, and the use of
learning, most notably computer vision and natural language LM adaptation in E2E systems.
processing (NLP). In both these cases the motivation is to train Finally we provide a meta analysis of experimental studies
powerful base models using large amounts of training data, using the main adaptation algorithms that we have discussed
then to adapt these to specific tasks or domains, for which (Section XIII). The meta-analysis is based on experiments
considerably less training data is available. reported in 47 papers, carried out using 38 datasets, and is
In computer vision, the base model is typically a large primarily based on the relative error rate reduction arising
convolutional network trained to perform image classifica- from adaptation approaches. In this section we analyze the
tion or object recognition using the ImageNet database [37], performance of the main adaptation algorithms across a vari-
[38]. The ImageNet model is then adapted to a lower re- ety of adaptation target types (for instance speaker, domain,
source task, such as computer-aided detection in medical and accent), in supervised and unsupervised settings, in six
imaging [39]. Kornblith et al. [40] have investigated empiri- different languages, and using six different NN model types
cally how well ImageNet models transfer to different tasks and in both hybrid and E2E settings. Raw data, aggregated results
datasets. and the corresponding scripts are available at https://github.
Transfer learning in NLP differs from computer vision, com/pswietojanski/ojsp_adaptation_review_2020.
and from the speech recognition approaches discussed in this
paper, in that the base model is trained in an unsupervised II. IDENTIFYING ADAPTATION TARGETS
fashion to perform language modeling or a related task, typ- Adaptation aims to reduce the mismatch between training and
ically using web-crawled text data. Base models used for test conditions. For an adaptation algorithm to be effective,
NLP include the bidirectional LSTM [41] and Transformers the distribution of the adaptation data should be close to that
which make use of self-attention [42], [43]. These models are encountered in test conditions. For this reason it is impor-
then trained on specific NLP tasks, with supervised training tant to ensure that the target labels adapted to form coherent
data, which is specified in a common format (e.g. text-to-text classes. For the task of acoustic adaptation this requirement

VOLUME 2, 2021 35
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW

is typically satisfied by forming the adaptation data from one Depending on whether adaptation transforms are estimated
or more speech segments from known testing conditions (i.e. on held out data, or adaptation is iteratively derived from
the same speaker, accent, domain, or acoustic environment). test segments, we will refer to these as enrolment or online
While for some tasks labels ascribed to speech segments may modes, respectively. In enrolment mode, the adaptation data
exist, allowing segments to be grouped into larger adaptation would be ideally labeled with a gold-standard transcription,
clusters, it is unrealistic to assume the availability of such to enable supervised learning algorithms to be used for adap-
metadata in general. However, depending on the application tation. However, supervised data is rarely available: small
and the operating regime of the ASR system, it may be possi- amounts may be available for some domain adaptation tasks
ble to derive reasonable proxies. (for example, adapting a system trained on typical speech to
Utterance-level adaptation derives adaptation statistics us- disordered speech [68]). In the usual case, where supervised
ing a single speech segment [55]. This waives the require- adaptation data is not available, supervised training algorithms
ment to carry information about speaker identity between can still be used with “pseudo-labels” that are automatically
utterances, which may simplify deployment of recognition obtained from a seed model, a process which is a type of semi-
systems – in terms of both engineering and privacy – as one supervised training [69]. Alternatively, unsupervised training
does not need to estimate and store offline speaker-specific can be applied to learn embeddings for the different adaptation
information. On the other hand, owing to the small amounts classes, such as i-vectors [56] or bottleneck features extracted
of data available for adaptation the gains are usually lower from an auto-encoder neural network [70]. A two-pass system
than one could obtain with speaker-level clusters. While many is a special case for which the statistics are estimated from test
approaches use utterances to directly extract corresponding data using the first pass decoding with a speaker-independent
embeddings to use as an auxiliary input for the acoustic model in order to obtain adaptation labels, followed by a
model [56]–[59], one can also build a fixed inventory of second pass with the speaker-adapted model.
speakers, domains, or topic codes [60] or embeddings [61], For semi-supervised approaches, it is possible to further
[62] when learning the acoustic model or acoustic encoder, filter out regions with low-confidence to avoid the reinforce-
and then use the test utterance to select a combination of these ment of potential errors [71]–[73]. There is some evidence
at test stage. The latter approach alleviates the necessity of in the literature that, for some limited-in-capacity transforms
estimating an accurate representation from small amounts of estimated in a semi-supervised manner, the first pass tran-
data. It may be possible to relax the utterance-level constraint script quality has a small impact on the adapted accuracy as
by iteratively re-estimating adaptation statistics using a num- long as these are obtained with the corresponding speaker-
ber of preceding segment(s) [57]. Extra care usually needs independent model [74], [75]. In lattice supervision multiple
to be taken to handle silence and speech uttered by different possible transcriptions are used in a semi-supervised setting
speakers, as failing to do so may deteriorate the overall ASR by generating a lattice, rather than the one-best transcrip-
performance [62]–[64]. tion [76]–[79].
Speaker-level adaptation aggregates statistics across two or
more segments uttered by the same talker, requiring a way to III. ADAPTATION ALGORITHMS FOR HMM-BASED ASR
group adaptation utterances produced by different talkers. In Speaker adaptation of speech recognition systems has been in-
some cases – for example lecture recordings and telephony vestigated since the 1960s [80], [81]. In the mid-1990 s, the in-
– speaker information may be available. In other cases po- fluential maximum likelihood linear regression (MLLR) [82]
tentially inaccurate metadata is available, for instance in the and maximum a posteriori (MAP) [83] approaches to speaker
transcription of television or online broadcasts. In many cases adaptation for HMM/GMM systems were introduced. These
(for instance, anonymous voice search) speaker metadata is methods, described below, stimulated the field leading an in-
not available. The generic approach to this problem relies on tense activity in algorithms for the adaptation of HMM/GMM
a speaker diarization system [65], that can identify speakers systems, reviewed by Woodland [84] and Shinoda [85], as
and accordingly assign their identities to the corresponding well as in section 5 of Gales and Young’s broader review
segments in the recordings. This is often used in the of- of HMM-based speech recognition [86]. As we later discuss,
fline transcription of meetings or broadcast media. Alternative some of the algorithms developed for HMM-based systems, in
clustering approaches can be used to define the adaptation particular feature transformation approaches have been suc-
classes [66], [67]. cessfully applied to NN-based systems. In this section we
Domain-level adaptation broadens the speaker-level cluster review MAP, MLLR, and related approaches to the adaptation
by including speech produced by multiple talkers character- of HMM/GMM systems, along with earlier approaches to
ized by some common characteristic such as accent, age, speaker adaptation.
medical condition, topic, etc. . This typically results in more
adaptation material and an easier annotation process (cluster A. SPEAKER NORMALISATION
labels need to be assigned at batch rather than segment level). Many of these early approaches were designed to normalize
As such, domain adaptation can usually leverage adaptation speaker-specific characteristics, such as vocal tract length,
transforms with greater capacity, and thus offer better adapta- building on linguistic findings relating to speaker normaliza-
tion gains. tion in speech perception [87], often casting the problem as

36 VOLUME 2, 2021
one of spectral normalization. This work included formant- transforms, one for Gaussians in speech models and one for
based frequency warping approaches [80], [81], [88] and the Gaussians in non-speech models.
estimation of linear projections to normalize the spectral rep- Constrained MLLR [104], [105], is an important variant of
resentation to a speaker-independent form [89], [90]. MLLR, in which the same transform is used for both the mean
Vocal tract length normalization (VTLN) was introduced and covariance:
by Wakita [91] (and again by Andreou [92]) as a form of
μ̂s = As μ − bs (3)
frequency warping with the aim to compensate for vocal
tract length differences across speakers. VTLN was exten-
ˆs = As As (4)
sively investigated for speech recognition in the 1990 s and
2000s [93]–[96], and is discussed further in Section V. In this case, the log likelihood for a single Gaussian is given
by
B. MODEL BASED APPROACHES ˆ s ) = log N (x; As μ − bs , As As )
L cMLLR (x; μ̂s , (5)
In model based adaptation, the speech recognition model is
used to drive the adaptation. In work prefiguring subspace = log N (A−1 −1
s x + As bs ; μ, ) − log |As | (6)
models, Furui [97] showed how speaker specific models could
be estimated from small amounts of target data in a dynamic It can be seen that this transform of the model parameters is
time warping setting, learning linear transforms between pre- equivalent to applying an affine transform to the data – hence
existing speaker-dependent phonetic templates, and templates constrained MLLR is often referred to as feature-space MLLR
for a target speaker. Similar techniques were developed in (fMLLR), although it is not strictly feature-space adaptation
the 1980 s by adapting the vector quantization (VQ) used in unless a single transform is shared across all Gaussians in
discrete HMM systems. Shikano, Nakamura, and Abe [98] the system, in which case the Jacobian term − log |As | can be
showed that mappings between speaker dependent codebooks ignored. MLLR and its variants have been used extensively
could be learned to model a target speaker (a technique widely in the adaptation of Gaussian mixture model (GMM)-based
used for voice conversion [99]); Feng et al. [100] developed HMM speech recognition systems [84], [86].
a VQ-based approach in which speaker-specific mappings
were learned between codewords in a speaker-independent C. BAYESIAN METHODS
codebook, in order to maximize the likelihood of the discrete The above model-based adaptation approaches have aimed to
HMM system. Rigoll [101] introduced a related approach estimate transforms between a speaker independent model and
in which the speaker-specific transform took the form of a a model adapted to a target speaker. An alternative Bayesian
Markov model. A continuous version of this approach, re- approach attempts to perform the adaptation by using the
ferred to as probabilistic spectrum fitting, which aimed to speaker independent model to inform the prior of a speaker-
adjust the parameters of a Gaussian phonetic model was in- adapted model. If the set of parameters of a speech recognition
troduced by Hunt [102] and further developed by Cox and model are denoted by θ , then maximum likelihood estimation
Bridle [103]. sets θ to maximize the likelihood p(X | θ ). In MAP training,
These probabilistic spectral modeling approaches can be the estimation procedure maximizes the posterior of the pa-
viewed as precursors to maximum likelihood linear regres- rameters given the data X = {x1 . . . xT }:
sion (MLLR) introduced by Leggetter and Woodland [82]
P(θ | X ) ∝ p(X | θ ) p(θ )r , (7)
and generalized by Gales [104]. MLLR applies to continuous
probability density HMM systems, composed of Gaussian where p(θ ) is the prior distribution of the parameters, which
probability density functions. In MLLR, linear transforms are can be based on speaker independent models, and r is an em-
estimated to adapt the mean vectors and – in [104] – covari- pirically determined weighting factor. Gauvain and Lee [83]
ance matrices of the Gaussian components. If μ and are the presented an approach using MAP estimation as an adaptation
mean vector and covariance matrix of a particular Gaussian, approach for HMM/GMM systems. A convenient choice of
then MLLR adapts the parameters as follows: function for p(θ ) is the conjugate to the likelihood – the
function which ensures the posterior has the same form as the
μ̂s = As μ − bs (1) prior. For a GMM, if it is assumed that the mixture weights
ˆ s = Hs Hs . ci and the Gaussian parameters (μi , i ) are independent,
(2)
then the conjugate
prior may take the form of a mixture
model pD (ci ) i pW (μi , i ), where pD () is a Dirichlet distri-
The speaker-specific parameters bs , As and Hs are estimated
bution (conjugate to the multinomial) and pW () is the normal-
using maximum likelihood. MLLR is a compact adaptation
Wishart density (conjugate to the Gaussian). This results in
technique since the transforms are shared across Gaussians:
the following intuitively understandable parameter estimate
for instance all Gaussians corresponding to the same mono-
for the adapted mean of a Gaussian μ̂ ∈ Rd :
phone might share mean and covariance transforms. Very
often, especially when target data is sparse, a greater degree τ μ0 + t γ (t )xt
μ̂ = , (8)
of sharing is employed – for instance two shared adaptation τ + t γ (t )

VOLUME 2, 2021 37
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW

where μ0 ∈ Rd is the unadapted (speaker-independent) mean, have been used for the adaptation of neural network based
xt ∈ Rd is the adaptation vector at time t, γ (t ) ∈ R is the com- systems (Section VI).
ponent occupation probability (responsibility) for the Gaus- In contrast to model-based methods, in feature-based adap-
sian component at time t (estimated by the forward-backward tation it is usual to adapt or normalize the acoustic features
algorithm), and τ is a positive scalar-valued parameter of the for each speaker in both the training and test sets– this may
normal-Wishart density, which is typically set to a constant be viewed as a form of speaker adaptive training. For exam-
empirically (although Gauvain and Lee [83] also discuss an ple, in the case of cepstral mean and variance normalization
empirical Bayes estimation approach for this parameter). The (CMVN), statistics are computed for each speaker and the
re-estimated means of the Gaussian components take the form features normalized accordingly, during both training and test.
of a weighted interpolation between the speaker independent Likewise, VTLN is also carried out for all speakers, trans-
mean and data from the target speaker. When there is no forming the acoustic features to a canonical form, with the
target speaker data for a Gaussian component, the parameters variation from changes in vocal tract length being normalized
remain speaker-independent; as the amount of target speaker away.
data increases, so the Gaussian parameters approach the target
speaker maximum likelihood estimate. IV. ADAPTATION ALGORITHMS FOR NN-BASED ASR
The literature describing methods for adaptation of NNs has
D. SPEAKER ADAPTIVE TRAINING tended to inherit terminology from the algorithms used to
In the model-based approaches discussed above (MLLR and adapt HMM-GMM systems, for which there is an important
MAP), we have implicitly assumed that adaptation takes place distinction between feature space and model space formula-
at test time: speaker independent models are trained using tions of MLLR-type approaches [104], as discussed in the
recordings of multiple speakers in the usual way, with only previous section. In a 2017 review of NN adaptation, Sim et
the test speakers used for adaptation. In contrast to this, it is al. [109] divide adaptation algorithms into feature normalisa-
possible to employ a model-based adaptive training approach. tion, feature augmentation and structured parameterization.
In speaker adaptive training [106], a transform is estimated for (They also use a further category termed constrained adapta-
each speaker in the training set, as well as for each speaker tion, discussed further below.)
in the test set. During training, speaker-specific transforms The task of an ASR model is to map a sequence of acous-
and a speaker-independent canonical model are updated in an tic feature vectors, X = (x1 , . . . xt , . . . , xT ), xt ∈ Rd to a se-
iterative fashion. quence of words W . Although – as we discuss below – most
Speaker space approaches represent a speaker-adapted techniques described in this paper apply equally to end-to-end
model as a weighted sum of a set of individual models models and hybrid HMM-NN models, we generally treat the
which may represent individual speakers or, more commonly, model to be adapted as an acoustic model. That is, we ignore
speaker clusters. In cluster-adaptive training (CAT) [66], the aspects of adaptation that affect only P(W ), independently of
mean for a Gaussian component for a specific speaker s is the acoustics X (LM adaptation is discussed in Section XII).
given by: Further, with only a small loss of generality, in what follows
we will assume that the model operates in a framewise man-

C ner, thus we can define the model as:
μ̂s = wc μc (9)
yt = f (xt ; θ ) (10)
c=1
where f (x; θ ) is the NN model with parameters θ and yt is the
where μc ∈ Rd is the mean of the particular Gaussian com- output label at frame t. In a hybrid HMM-NN system, for ex-
ponent for speaker cluster c, and wc ∈ R is the cluster weight. ample, yt is taken to be a vector of posterior probabilities over
This expresses the speaker-adapted mean vector as a point a senone set. In a CTC model, yt would be a vector of posterior
in a speaker space. Given a set of canonical speaker cluster probabilities over the output symbol set, plus blank symbol.
models, CAT is efficient in terms of parameters, since only Note that NN models often operate on a wider windowed
the set of cluster weights need to be estimated for a new set of input features, xt (w) = [xt−c , xt−c+1 , . . . , xt+c−1 , xt+c ]
speaker. Eigenvoices [107] are alternative way of construct- with the total window size w = 2c + 1. For reasons of no-
ing speaker spaces, with a speaker model again represented tational clarity, we generally ignore the distinction between
as a weighted sum of canonical models. In the Eigenvoices xt and xt (w), unless it is specifically relevant to a particular
technique, principal component analysis of “supervectors” topic.
(concatenated mean vectors from the set of speaker-specific In this framework, we can define feature normalisation
models) is used to create a basis of the speaker space. approaches as acting to transform the features in a speaker-
A number of variants of cluster-adaptive training have been dependent manner, on which the speaker-independent model
presented, including representing a speaker by combining operates. For each speaker s, a transformation function g :
MLLR transforms from the canonical models [66], and using
Rd → Rd computes:
sequence discriminative objective functions such as minimum
phone error (MPE) [108]. Techniques closely related to CAT xt = g(xt ; φs ) (11)

38 VOLUME 2, 2021
where φs is a set of speaker-dependent parameters. Commonly which we can write as a structured parameter transformation
the dimension of the normalised features is the identical to the of f , as defined in (12):
original (i.e. d = d ) but this is not required. This family is
θs = {θ , φs } = h({θ , φ I }; ϕs ) (17)
closely related to feature space methods used in GMM sys-
tems described above in Section III, including fMLLR (when where the transformation h( · ; ϕs ) is simply set to replace
only a single affine transform is used), VTLN, and CMVN. the parameters pertaining to g with the original normalisation
Structured parameterization approaches, in contrast, in- parameters, φs = ϕs , leaving the other parameters unchanged.
troduce a speaker-dependent transformation of the acoustic Similarly, feature augmentation approaches may be readily
model parameters: seen to be a further special case of structured adaptation. In
the simple case of input feature augmentation (13), we see
θs = h(θ ; ϕs ) (12)
that the output of the first layer, prior to the non-linearity, can
In this case, the function h would typically be structured so be written as

as to ensure that the number of speaker-dependent parameters x

ϕs is sufficiently smaller than the number of parameters of the z = Wx + b = W +b (18)
λs
original model. Such methods are closely related to model-
based adaptation of GMMs such as MLLR. where W and b are the weight and bias of the first layer
respec-

Finally, feature-augmentation approaches extend the fea- tively. By introducing a decomposition of W , W = U V
ture vector xt with a speaker-dependent embedding λs , which
we write this as
we can write as
x
xt z= U V + b = U x + b + V λs (19)
xt = (13) λs
λs
with U ∈ θ and V ∈ θ E being weight matrices pertaining to
Close variants of this approach use the embedding to augment
the input features and speaker embedding, respectively.
the input to higher layers of the network. Note that the in-
This can be expressed as a structured transformation of the
corporation of an embedding requires the addition of further
bias:
parameters to the acoustic model controlling the manner in
which the embedding acts to adapt the model, which can θs = {U , b } = h({U, b}; ϕs ) = {U, b + V λs } (20)
be written f (xt ; θ , θ E ). The embedding parameters θ E are
themselves speaker-independent. with ϕs = V λs . Similar arguments apply to embeddings used
We suggest that the distinctions described above may not in other network layers.
always be helpful when considering NN adaptation specifi- Certain types of feature normalisation approaches can be
cally, because all three approaches can be seen to be closely expressed as feature augmentation. For example, cepstral
related or even special cases of each other. As we saw in mean normalisation given by
Section III this is not the case in HMM-GMM systems, where xt = g(xt ; φs ) = xt − μs (21)
the distinction between feature-space and model adaptation
is important (as noted by Gales [104]) because in the former can be expressed as

case, different feature space transformations can be carried x
out per senone class if the appropriate scaling by a Jacobian z = W (x − μs ) + b = W W +b (22)
is performed; whilst in the latter case, it is necessary for the −μs
adapted probability density functions to be re-normalized. with augmented features λs = −μs .
As an example of the equivalence of the close relationship As we have seen, approaches to NN adaptation under the
between the three approaches to NN adaptation, the normali- traditional categorization of feature augmentation, structured
sation function g can generally be formulated as shallow NN, parameterization and feature normalization can usually be
possibly without a non-linearity. If there is a set of “identity seen as special cases of one another. Therefore, in the remain-
transform” parameters φ I such that der of this paper, we adopt an alternative categorization:
r Embedding-based approaches in which any speaker-
g(xt ; φ I ) = xt , ∀xt (14)
dependent parameters are estimated independently of the
then we have model, with the model f (xt ; θ ) itself being unchanged
between speakers, other than the possible need for addi-
yt = f (xt ; θ ) = f (g(xt ; φ I ); θ ) = f (xt ; θ , φ I ) (15) tional embedding parameters θ E ;
r Model-based approaches in which the model parameters
where f is a new network comprising of a copy of the original
network f with the layers of g prepended. Applying feature θ are directly adapted to data from the target speaker
according to the primary objective function;
normalization (11) leads to: r Data augmentation approaches which attempt to syn-
yt = f (xt ; θ ) = f (g(xt ; φs ); θ ) = f (xt ; θ , φs ) (16) thetically generate additional training data with a close

VOLUME 2, 2021 39
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW

match to the target speaker, by transforming the existing attribute aware training; however, we do not believe that all
training data. multi-task learning approaches to adaptation can be labeled in
This distinction is, we believe, particularly important in this way.
speaker adaptation of NNs because in ASR it has become Data augmentation methods have proved very success-
standard to perform adaptation in a semi-supervised manner, ful in adaptation to other sources of variability, particularly
with no transcribed adaptation data for the target speaker. In those – such as background noise conditions – where the
this setting, as we will discuss, standard objective functions required model transformations are hard to explicitly estimate,
such as cross-entropy, which may be very effective in su- but where it is easy to generate realistic data. In the case
pervised training or adaptation, are particularly susceptible to of speaker adaptation, it is significantly harder to generate
transcription errors in semi-supervised settings. sufficiently good-quality synthetic data for a target speaker,
We describe the model-independent approaches as given only limited data from the speaker in question. How-
embedding-based because any set of speaker-dependent ever, there is a growing body of work in this area using, for
parameters can be viewed as an embedding. Embedding- example, techniques from the field of speech synthesis [116].
based approaches are discussed in Section V. Well-known Approaches in this area are discussed in Section IX.
examples of speaker embeddings include i-vectors [56], Most works suitable for adapting hybrid acoustic models
[110], and x-vectors [111], but can also include parameter can be leveraged to adapt acoustic encoders in E2E mod-
sets more classically viewed as normalizing transforms els. Both Kullback-Leibler divergence (KLD) regularization
such as CMVN statistics and global fMLLR transforms (see (Section VII) and multi-task learning (MTL) methods (Sec-
Section III above). However, for the reasons mentioned above, tion VIII) have been used for speaker adaptation for CTC and
we exclude from this category methods where the embedding AED models [117], [118].
is simply a subset of the primary model parameters and Sim et al. [119] updated the acoustic encoder of RNN-T
estimated according to the model’s objective function. Note models using speaker-specific adaptation data. Furthermore,
that methods using a one-hot encoding for each speaker are by generating text-to-speech (TTS) audio from the target
also excluded, since it would be impossible to use these with a speaker, more data can be used to adapt the acoustic encoder.
speaker-independent model, without each test speaker having Such data augmentation adaptation (discussed in Section IX)
been present in training data; such methods might however was shown to be an effective way for the speaker adaptation
be useful for closely related tasks such as domain adaptation, of E2E models [120] even with very limited raw data from
discussed in Section XI. the target speaker. Embeddings have also been used to train a
The primary benefit of speaker adaptive approaches over speaker-aware AED model [62], [121], [122].
simply using speaker-dependent models is the prevention of Because AED and RNN-T also have components corre-
over-fitting to the adaptation data (and its possibly errorful sponding to the language model, there are also techniques spe-
transcript). A large number of model-based adaptation tech- cific to adapting the language modeling aspect of E2E models,
niques have been proposed to achieve this; in this paper, we for instance using a text embedding instead of an acoustic
sub-divide them into: embedding to bias an E2E model in order to produce outputs
r Structured transforms: Methods in which a subset of relevant to the particular recognition context [123]–[125]. If
the parameters are adapted, with many instances struc- the new domain differs from the source domain mainly in
turing the model so as to permit a reduced number of content instead of acoustics, domain adaptation on E2E mod-
speaker-dependent parameters, as in the Learning Hid- els can be performed by either interpolating the E2E model
den Unit Contributions (LHUC) scheme [75], [112]. The with an external language model or updating language model
can be viewed as an analogy to MLLR transforms for related components inside the E2E model with the text-to-
GMMs. They are discussed in Section VI. speech audio generated from the text in the new domain [126],
r Regularization: Methods with explicit regularization of [127], discussed in Section XII.
the objective function to prevent over-fitting to the adap-
tation data, examples including the use of L2 loss or V. SPEAKER EMBEDDINGS
KL divergence terms to penalize the divergence from Speaker embeddings map speakers to a continuous space.
the speaker-independent parameters [113], [114]. Such In this section we consider embeddings that may be ex-
methods can be viewed as related to the MAP approach tracted in a manner independent of the model, and which
for GMM adaptation. They are discussed in Section VII. are also typically unsupervised with respect to the transcript.
r Variant objective functions: Methods which adopt vari- They can therefore also be useful in a standalone manner
ants of the primary objective function to overcome the for other tasks such as speaker recognition. When used with
problems of noise in the target labels, with examples an acoustic model, the model learns how to incorporate the
including the use of lattice supervision [79] or multi-task embedding information by, in effect, speaker-aware training.
learning [115]. They are discussed in Section VIII. Speaker embeddings may encode speaker-level variations that
The second two categories above are collectively termed are otherwise difficult for the AM to learn from short-term
constrained adaptation in the review by Sim et al. [109]. features [64], and may be included as auxiliary features to the
Within this, multi-task learning is labeled by Sim et al. as network. Specifically, let x ∈ Rd denote the acoustic features,

40 VOLUME 2, 2021
and λs ∈ Rk a k-dimensional speaker embedding. The speaker mismatched transforms [133], or to obtain speaker-adapted
embeddings may be concatenated with the acoustic input fea- features derived from GMM log-likelihoods [138], otherwise
tures, as previously seen in (13): known as GMM-derived features.
Another technique with a long history is VTLN [91], [92],
xt [94], [139], which was briefly introduced in Section III. To
xt = (23)
λs control for varying vocal tract lengths between speakers,
VTLN typically uses a piecewise linear warping function to
Alternatively they may be concatenated with the activations adjust the filterbank in feature extraction. This requires only
of a hidden layer. In either case the result is bias adaptation of a single warping factor parameter that can be estimated using
the next hidden layer as discussed in Section VI. As noted by any AM with a line search. Alternatively, linear-VTLN (e.g.
Delcroix et al. [128] the auxiliary features may equivalently [95]) obtains a corresponding affine transform similar to fM-
be added directly to the features using a learned projection LLR, but chooses from a fixed set of transforms at test time. A
matrix P, with the benefit that the downstream architecture related idea is that of the exponential transform [140], which
can remain unchanged: forgoes any notion of vocal tract length, but akin to VTLN
xt = xt + Pλs (24) is controlled by a single parameter. More recently, adaptation
of learnable filterbanks, operating as the first layer in a deep
There are many other ways to incorporate embeddings into network, has resulted in updates which compensate for vocal
the AM: for example, they may be used to scale neuron acti- tract length differences between speakers [141].
vations as in LHUC [75]. More generally we may consider
embeddings applied to either biases or activations through B. I-VECTORS
context-adaptive [129] or control networks [130]. It is possible Many types of embeddings stem from research in speaker
to limit connectivity from the auxiliary features to the rest verification and speaker recognition. One such approach is
of the network in order to improve robustness at test time identity vectors, or i-vectors [56], [110], [142], which are
or to better incorporate static features [131]–[133]. We will estimated using means from GMMs trained on the acoustic
further consider transformations of the features as speaker features. Specifically, the extraction of a speaker i-vector, λs ∈
embeddings, such as with fMLLR [104], [105], and they may Rk , assumes a linear relationship between the global means
also be used as label targets [134]. from a background GMM (or universal background model,
UBM), mg ∈ Rm , and the speaker-specific means, ms ∈ Rm
A. FEATURE TRANSFORMATIONS
We may consider speaker-level transformations of the acous- ms = mg + T λs (27)
tic features as speaker embeddings. These include methods where T ∈ Rm×k is a matrix that is shared across all speakers
traditionally viewed as normalisation, such as CMVN and which is sometimes called the total variability matrix from
fMLLR, which produce affine transformations of the features: its relation to joint factor analysis [143]. An i-vector thus
xs = As x + bs (25) corresponds to coordinates in the column space of T . T is
estimated iteratively using the EM algorithm. It is possible to
CMVN derives its name from the application to cepstral replace the GMM means with posteriors or alignments from
features, but corresponds to the standardization of the features the AM [131], [144], [145] although this is no longer indepen-
to zero mean and unit variance (z-score): dent of the AM and requires transcriptions. The i-vectors are
x−μ usually concatenated with the acoustic features as discussed
xs = √ (26) above, but have also been used in more elaborate architectures
σ2 +
to produce a feature mapping of the input features them-
where μ ∈ Rd is the cepstral mean, σ 2 ∈ Rd is the cepstral selves [146], [147].
variance, and is a small constant for numerical stability.
fMLLR [104] belongs to a family of speaker adaptation C. NEURAL NETWORK EMBEDDINGS
methods originally developed for HMM-GMM models, as A number of works proposed to extract low-dimensional em-
discussed in Section III. The technique has, however, later beddings from bottleneck layers in neural network models
been used with success to transform features for hybrid mod- trained to distinguish between speakers [64], [132] or across
els as well [135], [136]. While the fMLLR transforms were multiple layers followed by dimensionality reduction in a sep-
traditionally estimated using maximum likelihood and HMM- arate AM (e.g. CNN embeddings [148]). One such approach,
GMM models, the transforms may also be estimated using a using Bottleneck Speaker Vector (BSV) embeddings [64],
neural network trained to estimate fMLLR features [137] (in trains a feed-forward network to predict speaker labels (and
Section VI we will further discuss structurally similar trans- silence) from spliced MFCCs (Fig. 2(a)). Tan et al. [132]
forms estimated using the main objective function). Instead proposed to add a second objective to predict monophones in
of transforming the input features, some work has explored a multi-task setup. The bottleneck layer dimension is typically
fMLLR features as an additional, auxiliary, feature stream set to values commonly used for i-vectors. In fact, Huang and
to the standard features in order to improve robustness to Sim [64] note that if the speaker label targets are replaced with

VOLUME 2, 2021 41
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW

FIGURE 2. (a) Bottleneck feature extraction that uses a pretrained speaker classifier. (b) Summary network extracting speaker embeddings which is
trained jointly with the acoustic model.

speaker deviations from a UBM, then the bottleneck-features objective for the x-vector network which implicitly factors
may be considered frame-level i-vectors. The extracted fea- out channel information, which might be beneficial for adap-
tures are averaged across all speech frames of a given speaker tation. The optimal objective for speaker embeddings used in
to produce speaker-level i-vectors. ASR differs from the objective used in speaker verification.
There are several more recent approaches that we may Summary networks [59], [128] produce sequence level
collectively refer to as -vectors. Like bottleneck features, summaries of the input features and are closely related to
these approaches typically extract embeddings from neural -vectors (cf. Fig. 2(b)). Auxiliary features are produced by
networks trained to discriminate between speakers, but not a neural network that takes as input the same features as the
necessarily using a low-dimensional layer. For instance, deep AM, and produces embeddings by taking the time-average of
vectors, or d-vectors [149], [150], extract embeddings from the output. By incorporating the averaging into the graph, the
feed-forward or LSTM networks trained on filterbank fea- network can be trained jointly with the AM in an end-to-end
tures to predict speaker labels. The activations from the last fashion [128]. A related approach is to produce LHUC fea-
hidden layer are averaged over time. X-vectors [111], [130] ture vectors (Section VI) from an independent network with
use TDNNs with a pooling layer that collects statistics over embedded averaging [153].
time and the embeddings are extracted following a subsequent
affine layer. A related approach called r-vectors [151] uses the D. EMBEDDINGS FOR E2E SYSTEMS
architecture of x-vectors, but predicts room impulse response The embedding method is also helpful to the adaptation of
(RIR) labels rather than speaker labels. In contrast to the E2E systems. Fan et al. [121] and Sari et al. [62] generated a
above approaches, label embeddings, or l-vectors [134], are soft embedding vector by combining a set of i-vectors from
designed to be used as soft output targets for the training of an multiple speakers with the combination weight calculated
AM. Each label embedding represents the output distribution from the attention mechanism. The soft embedding vector is
for a particular senone target. In this way they are, in effect, appended to the acoustic encoder output of the E2E model,
uncoupled from the individual data points and can be used for helping the model to normalize speaker variations. While the
domain adaptation without a requirement of parallel data. We soft embedding vectors in [62], [121] are different at each
will discuss this idea further in Section XI. For completeness frame, the speaker i-vectors are concatenated with the speech
we also mention h-vectors [152] which use a hierarchical utterance as the input of every encoder layer in [122] to form a
attention mechanism to produce utterance-level embeddings, persistent memory through the depth of encoder, hence learn-
but has only been applied to speaker recognition tasks. ing utterance-level speaker knowledge.
X-vector embeddings are not widely used for adaptating In addition to acoustic embedding, E2E models can also
ASR algorithms in practice – especially in comparison to leverage text embedding to improve their modeling accuracy.
commonly used i-vectors – as experiments have not shown For example, E2E models can be optimized to produce outputs
consistent improvements in recognition accuracy. One rea- relevant to the particular recognition context, for instance user
son for this is related to the speaker identification training contacts or device location. One solution is to add a context

42 VOLUME 2, 2021
bias encoder in addition to the original audio encoder into E2E a batch normalization layer, adapting both the scale and the
models [123]–[125]. This bias encoder takes a list of biasing offset of the hidden layer activations with mean μ ∈ Rn and
phrases as the input. The context vector of the biasing list variance σ 2 ∈ Rn :
is generated by using the attention mechanism, and is then h−μ
concatenated with the context vector of acoustic encoder and h = γs √ + βs . (32)
σ2 +
is fed into the decoder.
Mana et al. [161] showed that batch normalization layers can
VI. STRUCTURED TRANSFORMS be also updated by recomputing the statistics μ and σ 2 in
Methods to adapt the parameters θ of a neural network-based online fashion.
acoustic model f (x; θ ) can be split into two groups. The A similar approach with a low-memory footprint adapts the
first group adapts the whole acoustic model or some of its activation functions instead of the scale rs and offset bs . Zhang
layers [113], [114], [154]. The second group employs struc- and Woodland [162] proposed the use of parameterised sig-
tured transformations [109] to transform input features x, hid- moid and ReLU activation functions. With the parameterised
den activations h or outputs y of the acoustic model. Such sigmoid function, hidden activations h are computed from
transformations include the linear input network (LIN) [155], hidden pre-activations z as
linear hidden network (LHN) [156] and the linear output net- 1
work (LON) [157]. These transforms are parameterized with h = ηs , (33)
1 + e−γs z+ζs
a transformation matrix As ∈ Rn×n and a bias bs ∈ Rn . The
where ηs ∈ Rn , γs ∈ Rn and ζs ∈ Rn are speaker dependent
transformation matrix As is initialized as an identity matrix
parameters. |ηs | controls the scale of the hidden activations,
and the bias bs is initialized as a zero vector prior to speaker
γs controls the slope of the sigmoid function and ζs controls
adaptation. The adapted hidden activations then become
the midpoint of the sigmoid function. Similarly, parameterised
h = As h + bs . (28) ReLU activations were defined as
However, even a single transformation matrix As can contain αs z if z > 0
many speaker dependent parameters, making adaptation sus- h= , (34)
βs z if z ≤ 0
ceptible to overfitting to the adaptation data. It also limits its
practical usage in real world deployment because of memory where αs ∈ Rn and βs ∈ Rn are speaker dependent parame-
requirements related to storing speaker dependent parameters ters that correspond to slopes for positive and negative pre-
for each speaker. Therefore there has been considerable re- activations, respectively.
search into how to structure the matrix As and the bias bs to Other approaches factorize the transformation matrix As
reduce the number of speaker dependent parameters. into a product of low-rank matrices to obtain a compact set of
The first set of approaches restricts the adaptation matrix speaker dependent parameters. Zhao et al. [163] proposed the
As to be diagonal. If we denote the diagonal elements as rs = Low-Rank Plus Diagonal (LRPD) method, which reduces the
diag(As ), then the adapted hidden activations become number of speaker dependent parameters by approximating
the linear transformation matrix As ∈ Rn×n as
h = rs h + bs . (29)
As ≈ Ds + Ps Qs , (35)
There are several methods that belong to this set of adaptation
methods. LHUC [75], [112] adapts only the parameters rs : where the Ds ∈ Rn×n , Ps ∈ Rn×k
and Qs ∈ Rk×n
are treated
as speaker dependent matrices (k < n) and Ds is a diagonal
h = rs h. (30)
matrix. This approximation was motivated by the assumption
Speaker Codes [158], [159] prepend an adaptation neural net- that the adapted hidden activations should not be very differ-
work to an existing SI model in place of the input features. The ent from the unadapted hidden activations when only a limited
adaptation network – which operates somewhat similarly to amount of adaptation data is available; hence the adaptation
control networks, described below – uses the acoustic features linear transformation should be close to a diagonal matrix.
as inputs, as well as an auxiliary low-dimensional speaker In fact, for k = 0 LRPD reduces to LHUC adaptation. LRPD
code which essentially adapts speaker dependent biases within adaptation can be implemented by inserting two hidden linear
the adaptation network: layers and a skip connection as illustrated in Fig. 3(b).
Zhao et al. [164] later presented an extension to LRPD
h = h + bs . (31)
called Extended LRPD (eLRPD), which removed the depen-
The network and speaker codes are learned by back- dency of the number of speaker dependent parameters on the
propagating through the frozen SI network with transcribed hidden layer size by performing a different approximation of
training data. At test time the speaker codes are derived the linear transformation matrix As ,
by freezing all but the speaker code parameters and back-
As ≈ Ds + PTs Q, (36)
propagating on a small amount of adaptation data.
Similarly, Wang and Wang [160] proposed a method that where matrices Ds ∈ Rn×n
and Ts ∈ Rk×k
are treated as
adapts both rs and bs as parameters βs ∈ Rn and γs ∈ Rn of speaker dependent, and matrices P ∈ Rn×k and Q ∈ Rk×n are

VOLUME 2, 2021 43
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW

FIGURE 3. Structured transforms of an adaptation matrix As : (a) Learning Hidden Unit Contributions (LHUC) adapts only diagonal elements of the
transformation matrix rs = diag(As ); (b) Low-Rank Plus Diagonal factorizes the adaptation matrix as As ≈ Ds + Ps Qs ; (c) Extended LRPD factorizes the
adaptation matrix as As ≈ Ds + PTs Q.

treated as speaker independent. Thus the number of speaker There have also been approaches, that further reduce the
dependent parameters is mostly dependent on k, which can be number of speaker dependent parameters by removing the de-
chosen arbitrarily. pendency on the hidden layer width by using control networks
Instead of factorizing the transformation matrix, a tech- that predict the speaker-dependent parameters
nique typically known as feature-space discriminative lin-
θs = c(λs ; φ), (40)
ear regression (fDLR) [135], [165], [166] imposes a block-
diagonal structure such that each input frame shares the same In contrast to the adaptation network used in the Speaker
linear transform. This is, in effect, a tied variation of LIN with Codes scheme, the control networks themselves are speaker-
a reduction in the number of speaker dependent parameters. independent, taking as input some lower dimensional speaker
Another set of approaches uses the speaker dependent pa- embedding λs ∈ Rk . As such, they form a link between
rameters as mixing coefficients θs = {α0 . . . αk } for a set of structured transforms and the embedding-based approaches
k speaker independent bases {B0 . . . Bk } which factorize the of Section V. The control networks c(λs , φ) can be imple-
transformation matrix As . Samarakoon and Sim [167], [168] mented as a single linear transformation or as a multi-layer
proposed to use factorized hidden layers (FHL) that allow neural network. These control networks are similar to the
both speaker-independent and speaker dependent modelling. conditional affine transformations referred to as Feature-wise
With this approach, activations of a hidden layer h with an Linear Modulation (FiLM) [170]. For example, Subspace
activation function σ are computed as LHUC [171] uses a control network to predict LHUC pa-
rameters rs from i-vectors λs , resulting in a 94% memory
k
h = σ (W + αi Bi )x + bs + b . (37) footprint reduction compared to standard LHUC adaptation.
i=0
Cui et al. [172] used auxiliary features to adapt both the
scale rs and offset bs . Other approaches adapted the scale rs
Note, that when αs = 0 and bs = 0, the activations correspond or the offset bs by leveraging the information extracted with
to a standard speaker independent model. If the bases Bi are summary networks instead of auxiliary features [173]–[175].
rank-1 matrices, Bi = γi ψiT , then this allows the reparameter- Finally, the number of speaker dependent parameters in
ization of (37) as [168]: all the aforementioned linear transformations can be reduced
h = σ (W + D T )x + bs + b , (38) by applying them to bottleneck layers that have much lower
dimensionality than the standard hidden layers. These bot-
where vectors γi and ψi are i-th columns of matrices and tleneck layers can be obtained directly by training a neural
, respectively, and the mixing coefficients αs correspond to network with bottleneck-layers or by applying Singular Value
the diagonal of matrix D. This approach is very similar to Decomposition (SVD) to the hidden layers [176], [177].
the factorization of hidden layers used for Cluster Adaptive
Training of DNN networks (CAT-DNN) [67] that uses full VII. REGULARIZATION METHODS
rank bases instead of rank-1 bases. Even with the small number of speaker dependent parameters
Similarly, Delcroix et al. [129] proposed to adapt the acti- required by structured transformations, speaker adaptation can
vations of a hidden layer using a mixture of experts [169]. The still overfit to the adaptation data. One way to prevent this
adapted hidden unit activations are then overfitting is through the use of regularization methods that
prevent the adapted model from diverging too far from the

k
h = αi Bi h. (39) original model. This can be achieved by using early stopping
i=0 and appropriate learning rates, which can be obtained with a

44 VOLUME 2, 2021
hyper-parameter grid-search or by meta-learning [178], [179].
Another way to prevent the adapted model from diverging too
far from the original can be achieved by limiting the distance
between the original and the adapted model. Liao [113] pro-
posed to use the L2 regularization loss of the distance between
the original speaker dependent parameters θs and the adapted
speaker dependent parameters θs
LL2 = |θs − θs |22 . (41)
Yu et al. [114] proposed to use Kullback-Leibler (KL) diver-
gence to measure the distance between the senone distribu-
tions of the adapted model and the original model
LKL = DKL ( f (x; θ ) || f (x; θs )). (42)
If we consider the overall adaptation loss using cross-entropy:
L = (1 − λ)Lxent + λLKL , (43)
FIGURE 4. Adversarial speaker adaptation.
we can show that this loss equals to cross-entropy with the
target distribution for a label y given the input frame xt
(1 − λ)P̂(y | xt ) + λ f (xt ; θ ), (44) estimates speaker dependent parameters as a mode of the
distribution
where P̂(y | xt ) is a distribution corresponding to the provided
θ̂s = arg max P(Y | X, θs )p(θs ), (46)
labels yadapt . Although initially proposed for adapting hybrid θs
models, the KLD regularization method may also be used for
where p(θs ) is a prior density of the speaker depen-
speaker adaptation of E2E models [117], [118], [180].
dent parameters. In order to obtain this prior density,
Meng et al. [181] noted that KL divergence is not a distance
Huang et al. [182] employed an empirical Bayes approach
metric between distributions because it is asymmetric, and
(following Gauvain and Lee [83]) and treated each speaker
therefore proposed to use adversarial learning which guar-
in the training data as a data point. They performed speaker
antees that the local minimum of the regularization term is
adaptation for each speaker and observed that the speaker pa-
reached only if the senone distributions of the speaker inde-
rameters across speakers resemble Gaussians. Therefore they
pendent and the speaker dependent models are identical. They
decided to parameterise the prior density p(θs ) as
achieve this by adversarially training a discriminator d (x; φ)
whose task is to discriminate between the speaker dependent p(θs ) = N (θs ; μ, ), (47)
deep features h and speaker independent deep features h that
where μ is the mean of adapted speaker dependent parameters
are obtained by passing the input adaptation frames through
across different speakers, and is the corresponding diagonal
speaker dependent and speaker independent feature extractor
covariance matrix. With this parameterisation the regulariza-
respectively. This process is illustrated in Fig. 4. The regular-
tion term of the prior density p(θs ) is
ization loss of the discriminator is
1
Ldisc = − log d (h; φ) − log 1 − d (h ; φ) , (45) LMAP = (θs − μ)T −1 (θs − μ), (48)
2
where h are hidden layer activations of the speaker indepen- which for the prior density p(θs ) = N (θs ; 0, I ) degenerates
dent model and h are hidden layer activations of the adapted to the L2 regularization loss. Huang et al. investigated their
model. The discriminator is trained in a minimax fashion dur- proposed MAP approach with LHN structured transforms, but
ing adaptation by minimizing Ldisc with respect to φ and max- noted that it may be used in combination with other schemes.
imizing Ldisc with respect to θs . Consequently, the distribution Xie et al. [183] proposed a fully Bayesian way of dealing
of activations of the i-th hidden layer of the speaker depen- with uncertainty inherent in speaker dependent parameters θs ,
dent model will be indistinguishable from the distribution of in the context of estimating the LHUC parameters rs (see
activations of the i-th hidden layer of the speaker independent Section VI). In this method, known as BLHUC, the posterior
model, which ought to result in more robust performance of distribution of the adapted model is approximated as:
speaker adaptation.
P(Y | X, Dadapt ) ≈ P(Y | X, E[rs | Dadapt ]), (49)
Other approaches aim to prevent overfitting by leveraging
the uncertainty of the speaker-dependent parameter space. Xie et al. propose to use a distribution q(rs ) as a variational
Huang et al. [182] proposed Maximum A Posteriori (MAP) approximation of the posterior distribution of the LHUC pa-
adaptation of neural networks, inspired by MAP adaptation rameters, p(rs |Dadapt ). For simplicity, they assume that both
of GMM-HMM models [83] (Section III). MAP adaptation q(rs ) and p(rs ) are normal, such that q(rs ) = N (rs ; μs , γs )

VOLUME 2, 2021 45
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW

and p(rs ) = N (rs ; μ0 , γ0 ), which results in the expectation

for the speaker dependent parameters in (49) being given by:

E[rs | Dadapt ] = μs . (50)

The parameters are computed using gradient descent with a

Monte Carlo approximation. Similarly to MAP adaptation, the
effect is to force the adaptation to stay close to the speaker
independent model when we perform adaptation with a small
amount of adaptation data.

VIII. VARIANT OBJECTIVE FUNCTIONS

Another challenge in speaker adaptation is overfitting to
targets seen in the adaptation data and to errors in semi-
supervised transcriptions. This issue can be mitigated by an
appropriate choice of objective function.
Gemello et al. [156] proposed Conservative Training,
FIGURE 5. Multi-task learning speaker adaptation.
which modifies the target distribution to ensure that labels
not seen in the adaptation data will not be catastrophically
forgotten. Given a set of labels not seen in the adaptation data Huang et al. [115] presented an approach that used multi-
U and the reference label ŷt at a time-step t the adjusted target task learning [186] to leverage both senone and mono-
distribution P̂ is defined as phone/senone clusters targets. It worked by having multi-
⎧ ple output layers, each on top of the last hidden layer, that
⎪
⎨P(y | xt ) if y ∈ U

predicted the corresponding targets. These additional output
P̂(y | xt ) = 1 − y ∈U P(y |xt ) if y = ŷt (51)
⎪
⎩
layers were also trained after a complete training pass of the
0 otherwise. speaker independent model with its parameters fixed. Thus,
the adaptation loss was a weighted sum of individual losses,
To mitigate errors in semi-supervised transcriptions we can for example monophone and senone losses (Fig. 5). Swieto-
replace the transcriptions with a lattice of supervision, which janski et al. [187] combined these two approaches and used
encodes the uncertainty arising from the first pass decoding. multi-task learning for speaker adaptation through a struc-
Lattice supervision has previously been used in work on unsu- tured output layer, which predicts both monophone targets and
pervised adaptation [76] and training [77] of GMMs, as well senone targets. Unlike the approach by Price et al. [185], the
as discriminative [184] and semi-supervised training [78], and monophone predictions are used for the prediction of senones.
adaptation [79], of neural network models. For instance, lat- Li et al. [117] and Meng et al. [118] applied multi-task
tice supervision can be used with the MMI criterion where for learning to speaker adaptation of CTC and AED models.
a single utterance we have: These E2E models typically use subword units, such as word
p(X | Mnum ; θ ) piece units, as the output target in order to achieve high recog-
FMMI (θ ) = log , (52) nition accuracy. The number of subword units is usually at the
p(X | Mden ; θ )
scale of thousands or even more. Given very limited speaker-
where the Mnumr is a numerator lattice containing multiple hy- specific adaptation data, these units may not be fully covered.
potheses from a first pass decoding and Mdenr is a denominator Multi-task learning using both character and subword units
lattice containing all possible sequences of words. can significantly alleviate such sparseness issues.
Another family of methods prevents overfitting to adap-
tation targets by performing adaptation through the use of IX. DATA AUGMENTATION
a lower entropy task such as monophone or senone cluster Data augmentation has been proven to be an effective way
targets. This has the advantage that the unsupervised targets to decrease the acoustic mismatch between training and test-
might be less noisy and also that the targets have higher ing conditions. Data augmentation approaches supplement the
coverage even with small amounts of adaptation data. Price training data with distorted or synthetic variants of speech
et al. [185] proposed to append a new output layer predicting with characteristics resembling the target acoustic environ-
monophone targets on top of the original output layer pre- ment, for instance with reverberation or interfering sound
dicting senones. The layer can be either full rank or sparse – sources. Thanks to realistic room acoustic simulators [188]
leveraging knowledge of relationships between monophones one can generate large numbers of room impulse responses
and senones. Its parameters are trained on the training data and reuse clean corpora to create multiple copies of the same
with a fixed speaker independent model. Only the monophone sentence under different acoustic conditions [189]–[191].
targets are used for the adaptation of the speaker dependent Similar approaches have been proposed for increasing
parameters. robustness in speaker space by augmenting training data

46 VOLUME 2, 2021
with, typically label-preserving, speaker-related distortions approaches resemble approaches for acoustic adaptation of
or transforms. Examples include creating multiple copies of VQ codebooks (discussed in section III), in that they learn
clean utterances with perturbed VTL warp factors [192], an accent-specific transition matrix between the phonemic
[193], augmenting related properties such as volume or speak- symbols in the dictionary. Selection of utterances for accent
ing rate [11], [194], [195], or voice-conversion [196] inspired adaptation has been explored, with Nallasamy et al. [212]
transformations of speech uttered by one speaker into another proposing an active learning approach.
speaker using stochastic feature mapping [193], [197], [198]. Approaches to accent adaptation of neural network-based
While voice conversion does not create any new data with systems have typically employed accent-dependent output
respect to unseen acoustic / linguistic complexity (just replicas layers and shared hidden layers [213], [214], based on a
of the utterances with different voices, often from the same similar approach to the multilingual training of deep neural
dataset), recent advances in text-to-speech (TTS) allows the networks [215]–[217]. Huang et al. [213] combined this with
rapid building of new multi-speaker TTS voices [199] from KL regularization (Section VII), and Chen et al. [214] used
small amounts of data. TTS may then be used to arbitrarily ex- accent-dependent i-vectors (Section V); Yi et al. [218] used
pand the adaptation set for a given speaker, possibly to cover accent-dependent bottleneck features in place of i-vectors;
unseen acoustic domains [116], [120]. If TTS is coupled with and Turan et al. [219] used x-vector accent embeddings in a
a related natural language generation module, it is possible semi-supervised setting.
to generate speech for domain-related texts. In this way, the Multi-task learning approaches, where the secondary task is
speaker adaptation uses more data, not only from the speaker’s accent/dialect identification has been explored by a number of
original speech but also from the TTS speech. Because the researchers [220]–[224] in the context of both hybrid and end-
transcription used for TTS generation is also used for model to-end models. Improvements with multi-task training were
adaptation, this approach also circumvents the obstacle of the observed in some instances, but the evidence indicates that it
hypothesis error in unsupervised adaptation. Moreover, TTS gives a small adaptation gain. Sun et al. [225] replaced multi-
generated data can also help to adapt E2E models to a new task learning with domain adversarial learning (Section VIII),
domain which has more discrepancy in contents from the in which the objective function treated accent identification as
source domain, which will be discussed in Section XII. an adversarial task, finding that this improved accented speech
Finally, for unbalanced data sets the acoustic models may recognition over multi-task learning.
under-perform for certain demographics that are not suffi- More successfully, Li et al. [226] explored learning multi-
ciently represented in training data. There is an ongoing effort dialect sequence-to-sequence models using one-hot dialect
to address this using generative adversarial networks (GANs). information as input. Grace et al. [227] also used one-hot
For example, Hosseini-Asl et al. [200] used GANs with a dialect codes and also explored a family of cluster adaptive
cycle-consistency constraint [201] to balance the speaker ra- training and hidden layer factorization approaches. In both
tios with respect to gender representation in training set. cases using one-hot dialect codes as an input augmentation
(corresponding to bias adaptation) proved to be the best ap-
X. ACCENT ADAPTATION proach, and cluster-adaptive approaches did not result in a
Although there is significant literature on automatic dialect consistent gain. These approaches were extended by Yoo et
identification from speech (e.g. [202]), there has been less al. [228] and Viglino et al. [224] who both explored the use of
work on accent and dialect adaptive speech recognition sys- dialect embeddings for multi-accent end-to-end speech recog-
tems. The MGB–3 [203] and MGB–5 [204] evaluation chal- nition. Ghorbani et al. [229] used accent-specific teacher-
lenges have used dialectal Arabic test sets, with a modern student learning, and Jain et al. [230] explored a mixture of
standard Arabic (MSA) training set, using broadcast and in- experts (MoE) approach, using mixtures of experts both at the
ternet video data. The best results reported on these challenges phonetic and accent levels.
have used a straightforward model-based transfer learning Yoo et al. [228] also applied a method of feature-wise affine
approach in an lattice-free maximum mutual information (LF- transformations on the hidden layers (FiLM), that are depen-
MMI) framework [205], adapting MSA trained baseline sys- dent both on the network’s internal state and the dialect/accent
tems to specific Arabic dialects [206], [207]. code (discussed in Section VI). This approach, which can
Much of the reported work on accent adaptation has taken be viewed as a conditioned normalization, differs from the
approaches for speaker adaptation, and applied them using previous use of one-hot dialect codes and multi-task learning
an adaptation set of utterances from the target accent. For in that it has the goal of learning a single normalized model
instance, Vergyri et al. [208] used MAP adaptation of a rather than an implicit combination of specialist models. A
GMM/HMM system. Zheng et al. [209] used both MAP and related approach is gated accent adaptation [231], although
MLLR adaptation, together with features selected to be dis- this focused on a single transformation conditioned on accent.
criminative towards accent, with the accent adaptation con- Winata et al. [232] experimented with a meta-learning ap-
trolled using hard decisions made by an accent classifier. proach for few-shot adaptation to accented speech, where
Earlier work on accent adaptation focused on automatic the meta-learning algorithm learns a good initialization and
adaptation of the pronunciation dictionary [210], [211]. These hyperparameters for the adaptation.

VOLUME 2, 2021 47
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW

XI. DOMAIN ADAPTATION sequence-level loss function as the speech signal is a sequence
The performance of automatic speech recognition (ASR) al- signal [247], [248].
ways drops significantly when the recognition model is eval- The biggest advantage of T/S learning is that it can leverage
uated in a mismatched new domain. Domain adaptation is large amounts of unlabeled data by using soft labels PT (yt =
the technology used to adapt the well-trained source domain y|xt ). This is particularly useful in industrial setups where
model to the new domain. The most straightforward way is effectively unlimited unlabeled data is available [241], [242].
to collect and label data in the new domain to fine-tune the Furthermore, soft labels produced by the teacher network
model. Most adaptation technologies discussed in this paper carry knowledge learned by the teacher on the difficulty of
can also be applied to domain adaptation [154], [233]–[236]. classifying each sample, while the hard labels do not contain
When the amount of adaptation data is limited, a common such information. Such knowledge helps the student to gener-
practice is adapting only partial layers of the network [237]. alize better, especially when adaptation data size is small.
To let the adapted model still perform well on the source E2E models tend to memorize the training data well, and
domain, Moriya et al. [238] proposed progressive neural net- therefore may not generalize well to a new domain. Meng et
works by adding an additional model column to the original al. [249] proposed T/S learning for the domain adaptation of
model for each new domain and only update the new model E2E models. The loss function is
column with the new domain data. In the following, we focus

L
N
on technologies more specific to domain adaptation. − PT (y | Y1:u−1 , X ) log PS (y | Y1:u−1 , X̂ ), (55)
l=1 y=1
A. TEACHER-STUDENT LEARNING
where X and X̂ are the source and target domain speech
While conventional adaptation techniques require large
sequence, Y is the label sequence of length L which is either
amounts of labeled data in the target domain, the teacher-
the ground truth in the supervised adaptation setup or the
student (T/S) paradigm [239], [240] can better take advantage
hypothesis generated by the decoding of the teacher model
of large amounts of unlabeled data and has been widely used
with X in the unsupervised adaptation setup. Note that in the
for industrial scale tasks [241], [242].
unsupervised case, there are two levels of knowledge transfer:
The most popular T/S learning strategy was proposed in
the teacher’s token posteriors (used as soft labels) and one-
2014 by Li et al. [239] to minimize the KL divergence be-
best predictions as decoder guidance.
tween the output posterior distributions of the teacher network
One constraint to T/S adaptation is that it requires paired
and the student network. This can also be considered as learn-
source and target domain data. While the paired data can be
ing soft targets generated by a teacher model instead of 1-hot
obtained with simulation in most cases, there are scenarios
hard targets
in which it is hard to simulate the target domain data from

T
N the source domain data. For example, simulation of children’s
− PT (y | xt ) log PS (y | xt ), (53) speech or accented speech remains challenging. In [134], a
t=1 y=1 neural label embedding scheme was proposed for domain
adaptation with unpaired data. A label embedding, l-vector,
where PT and PS are posteriors of teacher and student net- represents the output distribution of the deep network trained
works, xt and yt are the input speech and output senone at in the source domain for each output token, e.g. , senone.
time t, respectively. T is the number of speech frames in an To adapt the deep network model to the target domain, the
utterance, and N is the number of senones in the network l-vectors learned from the source domain are used as the soft
output layer. targets in the cross entropy criterion.
Later, Hinton et al. [240] proposed knowledge distillation
by introducing a temperature parameter (like chemical distil- B. ADVERSARIAL LEARNING
lation) to scale the posteriors. This has been applied to speech It is usually hard to obtain the transcription in the target do-
by e.g. Asami et al. [243]. There are also variations such main, therefore unsupervised adaptation is critical. Although
as learning the interpolation of soft and hard targets [240] the transcription can be generated by decoding the target
and conditional T/S learning [244]. Although initially pro- domain data using the source domain model, the generated
posed for model compression, T/S learning is also widely hypothesis quality is often poor given the domain mismatch.
used for model adaptation if source and target signals are Recently, adversarial training was applied to the area of un-
frame-synchronized, which can be realized by simulation. The supervised domain adaptation in a form of multi-task learn-
loss function is [245], [246] ing [250] without the need for transcription in the target do-
main. Unsupervised adaptation is achieved by learning deep

T
N
− PT (y | xt ) log PS (y | x̂t ), (54) intermediate representations that are both discriminative for
t=1 y=1 the main task on the source domain and invariant with respect
to mismatch between source and target domains. Domain
where xt is the source speech signal while x̂t is the frame- invariance is achieved by adversarial training of the domain
synchronized target signal. It can be further improved with classification objective functions using a gradient reversal

48 VOLUME 2, 2021
layer (GRL) [250]. This GRL approach has been applied to Neural network language modelling [271] has become
acoustic models for unsupervised adaptation in [251]–[253]. state-of-the-art, in particular recurrent neural network lan-
Meng et al. [254] further combine adversarial learning and guage models (RNNLMs) [272]. There has been a range of
T/S learning as adversarial T/S learning to improve the ro- work on adaptation of RNNLMs, including the use of topic
bustness against condition variability during adaptation. or genre information as auxiliary features [273], [274] or
There is also increasing interest in the use of GANs with combined as marginal distributions [275], domain specific
cycle consistency constraints for domain adaptation [255]– embeddings [276], and the use of curriculum learning and
[257]. This enables the use of non-parallel data without la- fine-tuning to take account of shifting contexts [277], [278].
bels in the target domain by learning to map the acoustic Approaches based on acoustic model adaptation, such as
features into the style of the target domain for training. The LHUC [278] and LHN [274], have also been explored.
cycle-consistency constraint also provides the possibility of There have a been a number of approaches to apply the
mapping features from the target to the source style for, in ideas of cache language model adaptation to neural network
effect, test-time adaptation or speech enhancement. language models [275], [279], [280], along with so-called
Unsupervised domain adaptation is more attractive than dynamic evaluation approaches in which the recent context
the supervised one because there is usually large amount of is used for fine tuning [275], [281].
unlabeled data in the new domain while transcribing the new E2E models are trained with paired speech and text data.
domain data usually is time consuming with large cost. T/S The amount of text data in such a paired setup is much
learning and adversarial learning both can utilize unlabeled smaller than the amount of text data used in training a separate
data well. Specifically, T/S learning has been very successful external LM. Therefore, it is popular to adjust E2E models
in industry-scale tasks. In contrast, adversarial learning was by fusing the external LM trained with a large amount of
reported successful in relatively smaller tasks. Therefore, T/S text data. The simplest and most popular approach is shal-
learning is more promising if the parallel data is available. low fusion [282]–[285], in which the external LM is inter-
However, if there is no prior knowledge about the new do- polated log-linearly with the E2E model at inference time
main, adversarial learning can be a good choice. There are only.
also other works on unsupervised domain adaptation. For ex- However, shallow fusion does not have a clear probabilis-
ample, Hsu et al. [70] use a variational autoencoder instead tic interpretation. McDermott et al. [286] proposed a density
of adversarial learning to obtain a latent representation robust ratio approach based on Bayes’ rule. An LM is built on text
to domains. However, similar to adversarial learning, such transcripts from the training set which has paired speech and
method is pending examination when large amount of unla- text data, and a second LM is built on the target domain. When
beled training data is available. decoding on the target domain, the output of the E2E model is
modified by the ratio of target/training LMs. While it is well
grounded with Bayes’ rule, the density ratio method requires
XII. LANGUAGE MODEL ADAPTATION the training of two separate LMs, from the training and target
LM adaptation typically involves updating an LM estimated data respectively. Variani et al. [287] proposed a hybrid au-
from a large general corpus, with data from a target do- toregressive transducer (HAT) model to improve the RNN-T
main. Many approaches to LM adaptation were developed model. The HAT model builds a training set LM internally
in the context of n-gram models, and are reviewed by Bel- and the label distribution is derived by normalizing the score
legarda [258]. Hybrid NN/HMM speech recognition systems functions across all labels excluding blank. Therefore, it is
still make use of n-gram language models and a finite state mathematically justified to integrate the HAT model with an
structure, at least in the first pass; it is difficult to use neu- external or target LM using the density ratio formulation.
ral network LMs (with infinite context) directly in first pass In [126], [127], RNN-T models were adapted to a new
decoding in such systems. Neural network LMs are typically domain with the TTS data generated from the domain-specific
used to rescore lattices in hybrid systems, or may be combined text. Because the prediction network in RNN-T works simi-
(in a variety of ways) in end-to-end systems. larly to a LM, adapting it without updating the acoustic en-
The main techniques for n-gram language model adapta- coder is shown to be more effective than interpolating the
tion include interpolation of multiple language models [259]– RNN-T model with an external LM trained from the domain-
[261], updating the model using a cache of recently observed specific text [127].
(decoded) text [259], [262]–[264], or merging or interpolat-
ing n-gram counts from decoded transcripts [265]. There is
also a large body of work incorporating longer scale context, XIII. META ANALYSIS
for instance modelling the topic and style of the recorded In this section we present an aggregated review of pub-
speech [266]–[269]. LM adaptation approaches making use lished results in experiments applying adaptation algorithms
of wider context have often built on approaches using unigram to speech recognition. This differs from typical experimental
statistics or bag-of-words models, and a number of approaches reporting that focuses on one-to-one system comparisons typ-
for combination with n-gram models have been proposed, for ically using a small fixed set of systems and benchmark tasks
example dynamic marginals [270]. and data. The proposed meta-analysis approach offers insights

VOLUME 2, 2021 49
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW

TABLE 1. Adaptation Studies Used in the Meta-Analysis Categorized on the Level They Operate At, and System Architecture

into the performance of adaptation algorithms that are difficult The analysis spans 38 datasets (more than 50 unique
to capture from individual experiments. {train, test} pairings), 28 of which are public and 10
We divide this section into four main parts. The first, Sec- are proprietary. These cover different speaking styles,
tion XIII-A, explains the protocol and overall assumptions of domains, acoustic conditions, applications and languages
the meta-analysis, followed by a top-level summary of find- (though the study is strongly biased towards English re-
ings in Section XIII-B, with a more detailed analysis in Sec- sources). The public corpora used include the following:
tion XIII-C. The final part, Section XIII-D, aims to quantify AISHELL2 [298], AMI [299], APASCI [300], Aurora4 [301],
the adaptation performance across languages, speaking styles CASIA [302], ChildIt [303], Chime4 [304], CSJ [305],
and datasets. ETAPE [306], HKUST [307], MGB [308], RASC863 [309],
SWBD [310], TED [311], TED-LIUM [312], TED-
LIUM2 [313], TIMIT [314], WSJ [315], PF-STAR [316],
Librispeech [317], Intel Accented Mandarin Speech Recogni-
A. PROTOCOL AND LITERATURE
tion Corpus [214], UTCRSS-4EnglishAccent [295]. To save
The meta-analysis is based on 47 peer-reviewed studies se- space we do not provide detailed corpora statistics in this pa-
lected such that they cover a wide range of systems, archi- per, but make them available via a corresponding repository1
tectures, and adaptation tasks. Each study was required to alongside raw data and scripts used to perform the analysis.
compare adaptation results versus a baseline, enabling the Overall, the meta-analysis is based on ASR systems trained on
configurations of interest to be compared quantitatively. There datasets with a combined duration of over 30000 hours, while
was no fixed target for the total number of papers included, the baseline acoustic models were estimated from as little as
due to our aim to cover as many different methods as possible. 5 hours to around 10000 hours of speech. Adaptation data
Note that the meta-analysis spans several model architectures, varies from a few seconds per speaker to over 25000 hours
languages, and domains; although most studies use word error of acoustic material used for domain adaptation.
rate (WER) as the evaluation metric, some studies used char-
acter error rate (CER) or phone error rate (PER). Since we are
B. OVERALL FINDINGS
interested in the relative improvement brought by adaptation,
Fig. 6 (Top) presents the average adaptation gains for all con-
we report Relative Error Rate Reductions (RERR).
sidered systems, adaptation methods, and adaptation classes.
The meta-analysis is based on the studies shown in Table 1,
The overall RERR is 9.72%.2 Since grouping data across
with additional splits into level of operation and top-level sys-
attributes of interest may result in unbalanced (or very sparse)
tem architecture. The positions were selected such that they
sample sizes, we also report additional statistics such as the
cover most of the topics mentioned in the review. For an adap-
number of samples, datasets and studies the given statistic is
tation of end-to-end systems we included all peer-reviewed
based on. As can be seen in the right part of Fig. 6 (Top), the
works we could find (their number is relatively limited). For
results in this review were derived from 356 samples produced
the hybrid approach, the studies are shortlisted such that they
using 38 datasets reported in 47 studies. A single sample is
enable the quantification of the gains for the categories out-
defined as a 1:1 system comparison for which one can un-
lined in the preceding theoretical sections. As a generic rule,
ambiguously state the RERR. Likewise, a dataset refers to a
when choosing papers for the analysis we first included works
particular training corpus configuration. Note that there may
that introduced a specific adaptation method in the context of
be some data-level overlap between different corpora origi-
neural models, or that offered some additional experiments
nating from the same source (e.g. TED talks) and we make a
allowing the comparison of different areas of interest - such
distinction for the acoustic condition (e.g. AMI close-talking
as the impact of objective functions, the complementarity of
and distant channels are counted as two different datasets
adaptation transforms or that show behavior under different
operating regimes. In the case of certain more commonly-used
techniques, due to the laborious nature of the analysis, it was
1 [Online]. Available: https://github.com/pswietojanski/ojsp_adaptation_
not always possible to include an exhaustive set of somewhat
review_2020.
similar papers. In this situation, the papers selected were those 2 We do not report exact numbers in tabular form due to space limitations,
with higher citation counts. but they are available in the GitHub repository

50 VOLUME 2, 2021
FIGURE 6. Aggregated summary of adaptation RERR from all studies (top),
considering single method only (middle) and two or more methods
stacked (bottom). The top graph is annotated to explain the information
presented in each of the boxplot graphs in this section.

when they are used to estimate separate acoustic models). A

FIGURE 7. Comparison of feature, embedding, and model-level adaptation
study refers to a single peer-reviewed publication. approaches. Speaker (middle) and domain (bottom) adaptations are based
Depending on which property we want to measure the on {utterance, speaker} and {accent, child, domain, disordered} clusters,
analysis set can be split into smaller subsets, as the ones respectively.
shown in the lower part of Fig. 6. The majority of analyses
in this review are reported for models adapted using a single
method with some additional groupings used to better capture
further details such as complementarity of adaptation methods
or their performance in different operating regimes.
As mentioned in Section IV, adaptation methods were
historically categorized based on the level they operated
at in the speech processing pipeline. Fig. 7 (top) quanti-
fies the ASR performance along this attribute, showing that
model-based adaptation obtains the best average improve-
ments of 11.8%, followed by embedding and feature levels
at 7.2% and 5.0% RERR, respectively. This is not surprising, FIGURE 8. Adaptation results for different adaptation clusters.
as model level adaptation allows large amounts of adaptation
data to be leveraged by allowing the update of large portions
of the model (including re-training the whole model). In more was used to augment the adaptation set to better estimate
data-constrained regimes, such as utterance or speaker-level additional adaptation transforms while VTLP and SFM were
adaptation, where only a limited amount of adaptation data used to directly expand the training data, and were found par-
is typically available, differences are less pronounced and ticularly effective for low-resource training conditions. Data
model-based speaker adaptation obtains 8.9% RERR while augmentations are beyond the scope of this meta-analysis and
adapting to domains gives 15.5% RERR (cf. middle and bot- will not be further investigated in this review.
tom plots in Fig. 7). Embedding approaches stay at a sim- The results for different adaptation clusters, introduced in
ilar level for speaker adaptation, improving to 9.2% RERR Section II, are shown in Fig. 8. Models benefit more when
for domain adaptation (although based on only two studies). adapting to accent, from adult to child speech, to the domain,
Feature-space domain adaptation was used in only one study, and to disordered speech conditions (such as arising from
which reported a small deterioration of −0.3% RERR. speech motor disorders), as opposed to speaker or utterance
Fig. 7 (middle) additionally shows results for speaker- adaptation. This is expected, since domain adaptation usu-
oriented data augmentation as described in Section IX. These ally has more adaptation data, and the acoustic mismatch
were found to increase accuracy by 4.6% RERR on av- introduced by unseen domains is greater than the mismatch
erage, or by 3.3%, 3.6% and 8.2% RERR for VTL per- caused by unseen speakers – unless these are substantially
turbations (VTLP) [192], [193], stochastic feature mapping mismatched to the training data as is often the case for child
(SFM) [193], [197] and when using synthetically generated or disordered speech recognition. But in the latter case the
TTS utterances [116], respectively. Note that the TTS method adaptation is typically not carried out at the speaker level, but

VOLUME 2, 2021 51
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW

FIGURE 10. Comparison of adaptation results for FF and RNN

architectures.
FIGURE 9. Comparison of adaptation results for hybrid and E2E systems.

TABLE 2. Amounts of Data Used to Estimate Hybrid and E2E Models for
Speaker and Domain Adaptation Clusters

FIGURE 11. Comparison of adaptation results for FF and RNN

architectures split by hybrid and E2E systems.

at the domain level (i.e. tailoring the acoustic model to better

handle dysarthric speech, not a single dysarthric speaker).
Fig. 9 aggregates the adaptation along the two main neu-
ral network-based ASR approaches - hybrid and E2E. It is
interesting to observe that E2E systems gain more from adap-
tation (12.8% RERR) than hybrid systems (9.2% RERR) in
both the overall and speaker-based regimes. This is somewhat FIGURE 12. Comparison of adaptation results for supervision modes.

expected, as hybrid systems benefit from strong inductive

biases – such as access to pronunciation dictionaries and hand
engineered modeling constraints – whereas E2E models must Hybrid models can leverage either FF or RNN architectures
learn these from data. Given limited amounts of training data while most E2E systems use some form of RNN. (Note,
one may expect that E2E may struggle to learn these as well transformer-based E2E models [30] are built from FF (CNN)
as hybrid models, as such adaptation may bring greater gains. modules, however, due to their relative novelty in ASR there
This reverses for domain adaptation, with E2E and hybrid is only one accent adaptation study included in our meta-
improving by 12.2 and 14.9% RERR, respectively. Note that analysis [232]). Fig. 10 reports similar adaptation gains of
for domain adaptation, the hybrid approach was studied more 9.8% RERR for both FF and RNN architectures. RNNs seem
often for child and disordered speech applications, which to benefit more when adapting to speakers (9.2% vs 7.4%
makes adaptation gains bigger (see also Fig. 8). Table 2 further RERR for RNN and FF, respectively), and less when adapting
reports average amounts of training data used to estimate to domain (10.4% vs 17.0% RERR for RNN and FF, respec-
hybrid and E2E models. It is interesting to notice that E2E tively). When controlling for the system paradigm (E2E vs.
systems on average leverage twice as much acoustic material Hybrid), RNNs mostly benefit through adapting E2E models
when compared to hybrid setups but still seem to substantially (cf. Fig. 11 6.6% vs 15.7% RERR for Hybrid (RNN) and E2E
benefit from adaptation. These results suggest that adaptation (RNN), respectively). We observed a similar trend for speaker
for E2E is a promising direction for future investigations, that and domain clusters separately (figure not shown).
remains under-investigated as of now based on the relatively Fig. 12 compares the RERR for unsupervised and super-
few works published to date. vised modes of adaptation. Overall, deriving the adaptation
Next we compare feed-forward (FF) and recurrent neural transform with manually annotated targets results in an av-
network (RNN) architectures in both hybrid and E2E models. erage 12.8% RERR, whereas unsupervised methods result

52 VOLUME 2, 2021
FIGURE 15. Comparison of adaptation results for acoustic models
estimated from different amounts of training data.

FIGURE 13. Comparison of adaptation results for different adaptation

targets: online adaptation, supervised and unsupervised enrollment, and
two-pass decoding.
is more powerful and gives better results than embedding or
feature-based approaches; and iii) adaptation is particularly
effective in scenarios with a large mismatch and where ob-
taining matched training data is difficult.
In Fig. 15 we further report adaptability of acoustic models
estimated from different amounts of training material. Inter-
estingly, models trained on small amounts of data (up to 50
hours) benefit from adaptation to a similar degree as models
estimated from several thousands of hours. This is somewhat
an unexpected result - if test sets are kept fixed, increasing the
training material typically results in a less mismatched model,
thus lowering gains from adaptation (and most experiments
evaluating adaptation performance as a function of data are
carried out in this way). However, when training from more
FIGURE 14. Comparison of adaptation results for different amount of
data one should proportionally increase the complexity of the
adaptation data.
testing conditions. We hypothesize that this is what implicitly
occurs across different datasets in the meta-analysis - some-
one who has access to a large training set may also sample a
in 8% RERR. Fig. 12 shows results specifically for semi-
more diverse testing set. Note that the acoustic models in this
supervised adaptation, which are captured by the 2pass and
work were trained from relatively limited amounts of data (up
enrol (Unsup.) conditions. Fig. 13 also shows further analysis
to 10 k hours), and adaptation protocols between studies may
on the modes of deriving adaptation statistics (Section II).
not be exactly comparable. However, this does not change the
Both online and two-pass adaptation are unsupervised, while
conclusion that some form of adaptation is beneficial for most
the enrollment mode may be either supervised or unsuper-
considered systems, regardless of how many hours of acoustic
vised. The supervised approach offers most accurate adapta-
data was used to train it.
tion, as expected. Unsupervised enrollment outperforms the
Since this meta-analysis combines results across many dif-
other two unsupervised methods mainly due to the T/S do-
ferent studies with many reference systems, the results should
main adaptation study [249] (Section XI) that leverages large
not necessarily be compared at the sample level, but rather
amounts of data. When considering speaker adaptation only,
in an aggregated form to outline dominant trends and typical
the two-pass approach obtains 8.2% RERR and is more effec-
data regimes that each category was tried in. Data amounts
tive than enrol (Unsup.) (7.3% RERR) and online adaptation
for some systems for the purpose of plotting were assumed
(6.5% RERR).
approximately to be at a given level: e.g. two-pass systems un-
Finally, we consider the overall trends for the considered
less shown otherwise assumed 10 minutes per speaker, while
systems and their operating regions. Fig. 14 reports results
embedding approaches 30 seconds.
obtained with different amounts of adaptation data . Fig. 16
further shows regression trends when splitting by adaptation
type, hybrid or E2E, and adaptation clusters. These are in line C. DETAILED FINDINGS
with the observations so far: i) more adaptation data brings In this subsection we investigate the effect of the specific
(on average) larger improvements; ii) model-based adaptation approach to adaptation, beyond the broad categories discussed

VOLUME 2, 2021 53
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW

FIGURE 16. Regression analysis for the three major control variables.

As shown in Fig. 17 GMMEmb, NNEmb and NNTrans-

formEmb obtain 8.1%, 5.2% and 9.2% average RERR,
respectively.
The second group in Fig. 17 comprises model-based
approaches split into Linear Transform (LT), Activation,
and Finetuning–based methods. LT methods introduce new
speaker dependent affine transformations in the model, either
in the form of new LIN/LHN/LON layers (i.e. [135], [155],
[157], [289]) or transforms estimated using a GMM system
such as fMLLR [56], [112], [135], [319]. Finetune refers to
FIGURE 17. Comparison of results for different adaptation approaches. approaches which assume that the adaptation is carried out
by altering a subset of the existing model parameters. This is
often done in a similar manner to an LT approach by adapt-
ing an input, output and/or one or more hidden layers that
are already present in the model [113], [141], [213], [214].
above. Fig. 17 reorganizes the earlier split into feature, embed- Finally, activation methods perform adaptation by introduc-
ding, and model-level adaptation (Fig. 7) into embedding (cf. ing speaker-dependent parameters in the activation functions
Section V) and model-based transformations (cf. Section VI). of the neural network [112], [162], [320]–[322]. Note that,
For the embeddings, we introduce three sub-categories as outlined in Section VI, some of activation-based methods
referred to as GMMEmb, NNEmb and NNTransformEmb. can be expressed as constrained LT methods. The results ob-
GMMEmb comprises GMM-related embedding extractors tained by LT, Activation and Finetune–based methods score
primarily based on i-vectors [56], [57], [110] but also in- 6.7%, 9.0% and 13.9% average RERR, respectively. Fig. 18(a)
clude adaptation results for other GMM-derived (GMMD) shows the regression trends for amounts of adaptation data for
features [138]. NNEmb are neural network-based embed- each of the six considered categories.
ding extractors that estimate speaker/utterance statistics The use of embeddings implies that the acoustic model is
from speaker-independent acoustic features. Examples of trained in a speaker adaptive manner, whereas the majority of
NNEmb approaches include -vector techniques, such as d- model-based techniques are carried out in a test-only manner
vectors [149] and x-vectors [111], discussed in Section V, – meaning that speaker-level information is not used during
sentence-level embeddings [59], [128], and other bottle- training – though some methods offer SAT variants [167],
neck approaches [130], [132]. NNTransformEmb are trans- [323]. Fig. 19 shows that SAT trained systems offer a small
formed embeddings which typically rely on i-vectors as in- advantage (8% vs. 7.6% RERR) when adapted with limited
put instead of acoustic features. These have been proposed amounts of data (up to around 10 minutes). When looking
to help alleviate issues related to inconsistent DNN adap- at the average performance across all data-points, however,
tation performance when using raw i-vectors [57], [58], test-only approaches obtain 10.8% RERR, primarily because
[318]. The NNTransformEmb group includes studies do- of greater adaptation gains for larger amounts of data. See also
ing standard i-vector transformations with NNEmb [74], Fig. 18(b) for operating regions of SAT and non-SAT systems.
[147], [218] but also more recent memory-based approaches Fig. 20 quantifies gains for different adaptation objectives
in which an embedding is selected via attention from and regularization approaches – results for the online condi-
a fixed training stage embedding inventory [61], [62]. tion are given only for reference, as in this case adaptation

54 VOLUME 2, 2021
FIGURE 18. Regression analysis for adaptation families, speaker-adaptive training and adaptation losses.

FIGURE 19. Comparison of adaptation results for SAT vs Test-only modes.

FIGURE 21. Comparison of adaptation results for acoustic models trained

with CE and Sequence-level objectives.

model’s predictions such that they do not deviate too much

from the speaker independent variant by KL regularization
(CE-KL) [114]. KL regularization can be applied to either CE
or sequential objective functions [154], although most models
estimated in a sequential discriminative manner can success-
fully be adapted with a CE (or CE-KL) criterion [75], [168],
[195], [297] (see also Fig. 21). Teacher-student (T/S) [239]
FIGURE 20. Comparison of results for different adaptation loss functions. is a special case (see Section XI) where the adaptation is
carried with the targets directly produced by a teacher model,
rather than the ones obtained from first pass decodes (possibly
information is obtained via an embedding extractor (which is KL-regularized with the SI model). T/S allows the use of
usually not updated, although not always [218]). The second large amounts of unsupervised data and in this analysis was
group depicts approaches where the adaptation information is found to offer an average 28.2% RERR when adapting to
derived by adapting a GMM in model-space using an MLE domains [229], [241], [249].
or MAP criterion when extracting speaker-adapted auxiliary The final group in Fig. 20 includes objectives that try to
features for NN training [138], [324] or by estimating fMLLR leverage auxiliary information at the objective function level.
transforms with MLE under a GMM to obtain speaker adapted Meta-learning [178], [179], [232] estimates the adaptation
acoustic features [56], [135], [319]. hyper-parameters jointly with the adaptation transform while
The third group comprises methods which aim to explicitly multi-task learning [115], [118], [187], [295] leverages addi-
match the model’s output distribution to the one found in the tional phonetic priors to circumvent the (potential) sparsity
adaptation data. CE is a non-regularized frame-level cross- of senones when adapting with small amounts of data. Meta-
entropy baseline obtaining 8.7% average RERR. This can be learning and multi-task adaptation obtain 6.8% and 7.6% av-
improved to 14.8% average RERR by penalizing the adapted erage RERR, respectively. See also Fig. 18(c).

VOLUME 2, 2021 55
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW

FIGURE 22. Comparison of adaptation results for different architectures.

Fig. 21 further summarizes the adaptability of acoustic

models trained in a frame-based (CE) or a sequential (Seq)
manner. The results indicate that sequential models benefit
more from adaptation when compared to frame-based systems
(11.6% vs. 9.8% average RERR). However, when controlling
for the same dataset and baseline (reference systems were ex-
pected to exist for both CE and Seq) the difference decreases FIGURE 23. Complementarity of selected adaptation techniques.
to around 0.6% RERR in favor of the frame-based systems.
Fig. 22 compares the adaptation gains obtained using var-
ious model architectures. LSTM benefits the most (15.4% Fig. 25 compares gains as obtained for different speech
average RERR). The feed-forward TDNN, DNN, and ResNet styles. At the top we report three special cases spanning disor-
architectures all improve by around 10.5% RERR. Smaller dered, children’s, and accented speech (these are similar to the
gains were observed for Transformer, CNN and BLSTM, im- adaptation clusters from Fig. 8). As expected, acoustic models
proving by 7.6, 6.5 and 4.9% average RERR, respectively. estimated largely from adult speech of healthy individuals per-
This result is somewhat expected as the last three architectures form poorly in these highly mismatched domains, especially
either normalize some of the variability by design, or have for disordered and children’s speech, and domain adaptation
access to a larger speech context during recognition. improves ASR by over 50% average RERR.
In Fig. 23 we study the complementarity of the different Performance gains from adapting models with accented
adaptation techniques. These results are based on 22 samples speech are similar to that obtained on other speech tasks. Note
and 6 studies for which there were a complete set of baseline that the presence of non-native speakers in (English) training
experiments allowing improvements to be quantified when corpora is fairly common, so the underlying acoustic models
adapting an SI model with Method1, and then measuring may learn to better normalize this variability at the training
further gains when adding Method2. Fig. 23 shows that, on av- stage. Interestingly, adaptation brings relatively larger gains
erage, stacking adaptation techniques improved the adaptation in commercial applications such as VoiceSearch and Dictation
performance by an additional 4%, from 8% to 12% RERR. tasks (14% RERR on average). This is also visible in Fig. 26
Finally, in Fig. 24 we report results for all techniques comparing performance on public and proprietary data. We
included in the meta-analysis. These are based on samples hypothesize that commercial data is more likely to contain a
where only a single method was used to adapt the acoustic mix of speech from a diverse set of speakers (including non-
model (cf. Fig. 6 (middle)), spanning results for all adaptation native and child speech) and thus benefits more from adapta-
clusters (cf. Fig. 8). These should not be directly compared tion. Another explanation could be that the public benchmarks
owing to differences in operating regions, but they offer an have been around for some time, and systems built on these
indication of the performance of the individual methods. are likely to be more over-fitted in general.
Finally, Fig. 27 summarizes the adaptation performance
for several languages. Note that speaker adaptation was per-
D. SPEECH STYLES, APPLICATIONS, LANGUAGES formed on English, French, Japanese, and Mandarin while for
In this subsection, we analyze the efficacy of adaptation meth- Korean and Italian we only report adaptation gains for disor-
ods across acoustic and linguistic dimensions by reporting dered and children’s speech recognition. The overall improve-
adaptation gains for different types of speech styles, applica- ments for non-English languages when adapting to speakers
tions (including ones with a large mismatch to the training are similar to gains obtained for English when controlling for
conditions), and languages. the adaptation method (i.e. improvements are between 6 and

56 VOLUME 2, 2021
FIGURE 24. Comparison of adaptation results for the standalone techniques.

FIGURE 26. Performance of adaptation techniques as obtained on public

and proprietary datasets.

XIV. SUMMARY AND DISCUSSION

FIGURE 25. Comparison of adaptation results for different speech styles. The rapid developments in speech recognition over the past
decade have been driven by deep neural network models of
acoustics, deployed in both hybrid and E2E systems. Com-
10% average RERR), giving some evidence that adaptation pared to the previous state-of-the-art approaches based on
helps to a similar degree for different languages, and that some GMMs, neural network-based systems have less constrained
of these primarily English-based findings generalize across and more flexible models and are open to a richer set of
languages. adaptation algorithms, compared to previous approaches

VOLUME 2, 2021 57
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW

of limited memory and computation power [119]. Another

interesting direction for the adaptation of E2E models is how
to leverage unpaired data especially text only data in a new
domain. In [127], several methods have been explored in this
direction, but we are expecting more innovations there.
Adaptation algorithms are often deployed for conditions in
which there is very limited labeled data, or none at all. In this
case unsupervised and semi-supervised learning approaches
are central, and indeed many current adaptation approaches
strongly leverage such algorithms. However there are signifi-
FIGURE 27. Adaptation gains for different languages. cant open research challenges in this area, particularly relating
to unsupervised and semi-supervised training of E2E sys-
tems, using methods which are able to propagate uncertainty.
based on linear transforms of the model parameters and acous- Current approaches often do this indirectly (e.g. through T/S
tic features. training), but more direct modeling of uncertainty would be
In this overview article we have surveyed approaches to the desirable.
adaptation of neural network-based speech recognition sys- Domain adaptation has become central to work in computer
tems. We structured the field into embedding-based, model- vision and image processing, as discussed in Section I, with
based, and data augmentation adaptation approaches, arguing large scale base models (typically trained on ImageNet) being
that this organization gives a more coherent understanding adapted to specific tasks. The closest analogies to this in
of the field compared with the usual split into feature-based speech recognition are some of the domain recognition ap-
and model-based approaches. We presented these adaptation proaches discussed in Section XI and for multilingual speech
algorithms in the context of speaker adaptation, with a discus- recognition. The idea of shared multilingual representations
sion on their application to accent and domain adaptation. and language-specific or language-adaptive output layers was
A key aspect of this overview was a meta-analysis of recent proposed in 2013 [215]–[217] and has become a standard
published results for the adaptation of speech recognition sys- architectural pattern. More recently several authors have pro-
tems. The meta-analysis indicates that adaptation algorithms posed highly multilingual E2E systems, with a shared mul-
apply successfully to both hybrid and E2E systems, across tilingual output layer [325]–[328], with the potential to be
different corpora and adaptation classes. adapted to new languages.
E2E modeling is less mature than the hybrid approach, and State-of-the-art NLP systems are characterized by an unsu-
much of the research focus on E2E modeling is to improve pervised, large-scale base model [30], [42] which may then
the general modeling technology. Therefore, in this overview be adapted to specific domains and tasks [43]. An analogous
paper, many more adaptation methods were introduced in the approach for speech recognition would be based on the unsu-
context of hybrid systems. However, most adaptation tech- pervised learning of speech representations, from diverse and
nologies successfully applied to hybrid models by adapting potentially multilingual speech recordings. Initial work in this
acoustic model or language model should also work well direction includes the unsupervised learning from large-scale
for E2E models because E2E models usually contain sub- multilingual speech data [329], [330]. More generally, deep
networks corresponding to the acoustic model and language probabilistic generative modeling has become a highly active
model in hybrid models; this is supported by findings in our research area, in particular through approaches such as nor-
meta-analysis. malizing flows [50], [51], [53], [54]. Such deep generative
Different from hybrid models in which components are models offer different ways of addressing the problem of
optimized separately, E2E models are optimized using a single adaptation including powerful approaches to data augmenta-
objective function. Therefore, E2E models tend to memorize tion, and the development of rich adaptation algorithms build-
the training data more and hence the generalization or ro- ing on a base model with a joint distribution over acoustics
bustness to unseen data [191] is challenging to E2E models. and symbols. This offers the possibility of finetuning general
Consequently, adaptation to new environment or new domain encoders to specific acoustic domains, and adapting the de-
is very important to the large scale application of E2E models. coder to specific tasks (such as speech recognition, speaker
We would expect more research toward this direction as E2E identification, language recognition, or emotion recognition),
modeling becomes increasingly mainstream in ASR. noting that classic adaptation to speakers can bring further
Because the size of E2E models is much smaller than that gains [331], [332].
of hybrid models, E2E models have clear advantages when
being deployed to device. Therefore, the personalization or ACKNOWLEDGMENT
adaptation of E2E models [119], [120], [126], [127] is a The views and conclusions contained herein are those of the
rapidly growing area. While it is possible to adapt every user’s authors and should not be interpreted as necessarily repre-
model in the cloud and then push it back to each device, it is senting the official policies, either expressed or implied, of
more reasonable to adapt the model on device, which requires ODNI, IARPA, AFRL or the U.S. Government. All authors
adjusting the adaptation algorithm to overcome the challenge contributed equally to this work.
58 VOLUME 2, 2021
REFERENCES [24] J. Li, R. Zhao, H. Hu, and Y. Gong, “Improving RNN transducer
[1] G. Hinton, “Deep neural networks for acoustic modeling in speech modeling for end-to-end speech recognition,” in Proc. IEEE Workshop
recognition: The shared views of four research groups,” IEEE Signal, Autom. Speech Recognit. Understanding, 2019, pp. 114–121.
vol. 29, no. 6, pp. 82–97, Nov. 2012. [25] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connection-
[2] F. Seide, G. Li, and D. Yu, “Conversational speech transcription us- ist temporal classification: Labelling unsegmented sequence data with
ing context-dependent deep neural networks,” in Proc. Interspeech, recurrent neural networks,” in Proc. Int. Conf. Mach. Learn., 2006,
Aug. 2011, pp. 27–31. pp. 369–376.
[3] G. Saon et al., “English conversational telephone speech recognition [26] A. Hannun, “Sequence modeling with CTC,” Distill, 2017. [Online].
by humans and machines,” in Proc. Interspeech, 2017, pp. 132–136. Available: https://distill.pub/2017/ctc
[4] C.-C. Chiu et al., “State-of-the-art speech recognition with sequence- [27] A. Graves, “Sequence transduction with recurrent neural networks,” in
to-sequence models,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Proc. ICML Workshop Representation Learn., 2012.
Process., 2018, pp. 4774–4778. [28] L. R. Rabiner, “A tutorial on hidden markov models and selected appli-
[5] N. Morgan and H. A. Bourlard, “Neural networks for statistical cations in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286,
recognition of continuous speech,” Proc. IEEE, vol. 83, no. 5, Feb. 1989.
pp. 742–772, May 1995. [29] J. Li, Y. Wu, Y. Gaur, C. Wang, R. Zhao, and S. Liu, “On the compar-
[6] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: ison of popular end-to-end models for large scale speech recognition,”
A. neural network for large vocabulary conversational speech recogni- in Proc. Interspeech, 2020, pp. 1–5.
tion,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2016, [30] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
pp. 4960–4964. Process. Syst., 2017, pp. 5998–6008.
[7] T. Robinson, M. Hochberg, and S. Renals, “The use of recurrent neural [31] L. Dong, S. Xu, and B. Xu, “Speech-transformer: A no-
networks in continuous speech recognition,” in Automatic Speech and recurrence sequence-to-sequence model for speech recognition,”
Speaker Recognition, C.-H. Lee, F. K. Soong, and K. K. Paliwal, Eds. in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2018,
Boston, MA, USA: Springer, 1996, pp. 233–258. pp. 5884–5888.
[8] A. J. Robinson, G. D. Cook, D. P. W. Ellis, E. Fosler-Lussier, S. [32] Y. Zhao, J. Li, X. Wang, and Y. Li, “The speechtransformer for large-
J. Renals, and D. A. G. Williams, “Connectionist speech recog- scale mandarin chinese speech recognition,” in Proc. IEEE Int. Conf.
nition of broadcast news,” Speech Commun., vol. 37, pp. 27–45, Acoust., Speech Signal Process., 2019, pp. 7095–7099.
2002. [33] S. Karita, N. E. Y. Soplin, S. Watanabe, M. Delcroix, A. Ogawa, and
[9] D. J. Kershaw, A. J. Robinson, and M. Hochberg, “Context-dependent T. Nakatani, “Improving transformer based end-to-end speech recog-
classes in a hybrid recurrent network-HMM speech recognition sys- nition with connectionist temporal classification and language model
tem,” in Proc. Adv. Neural Inf. Process. Syst., 1996, pp. 750–756. integration,” in Proc. Interspeech, 2019, pp. 1408–1412.
[10] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, [34] C.-F. Yeh et al., “Transformer-transducer: End-to-end speech recogni-
“Phoneme recognition using time-delay neural networks,” IEEE tion with self-attention,” 2019, arXiv:1910.12977.
Speech, Signal Process., vol. 37, no. 3, pp. 328–339, Mar. 1989. [35] Q. Zhang et al., “Transformer transducer: A. streamable speech
[11] V. Peddinti, D. Povey, and S. Khudanpur, “A time delay neural net- recognition model with transformer encoders and RNN-T loss,”
work architecture for efficient modeling of long temporal contexts,” in in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2020,
Proc. Interspeech, 2015, pp. 3214–3218. pp. 7829–7833.
[12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based [36] X. Chen, Y. Wu, Z. Wang, S. Liu, and J. Li, “Developing real-time
learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, streaming transformer transducer for speech recognition on large-scale
pp. 2278–2324, Nov. 1998. dataset,” 2020, arXiv:2010.11395.
[13] O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, [37] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Im-
and D. Yu, “Convolutional neural networks for speech recogni- ageNet: A large-scale hierarchical image database,” in Proc. Conf.
tion,” IEEE/ACM Audio, Speech Lang. Process., vol. 22, no. 10, Comput. Vis. Pattern Recognit., 2009, pp. 248–255.
pp. 1533–1545, Oct. 2014. [38] O. Russakovsky et al., “ImageNet large scale visual recognition chal-
[14] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural lenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015.
Comput., vol. 9, no. 8, pp. 1735–1780, 1997. [39] H.-C. Shin et al., “Deep convolutional neural networks for computer-
[15] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with aided detection: CNN architectures, dataset characteristics and transfer
deep recurrent neural networks,” in Proc. IEEE Int. Conf. Acoust., learning,” IEEE Trans. Med. Imag., vol. 35, no. 5, pp. 1285–1298,
Speech, Signal Process., 2013, pp. 6645–6649. May 2016.
[16] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural net- [40] S. Kornblith, J. Shlens, and Q. V. Le, “Do better ImageNet mod-
works,” IEEE Trans. Signal Process., vol. 45, no. 11, pp. 2673–2681, els transfer better?,” in Proc. Conf. Comput. Vis. Pattern Recognit.,
Nov. 1997. 2019, pp. 2661–2671.
[17] A. Graves, N. Jaitly, and A.-r. Mohamed, “Hybrid speech recogni- [41] M. Peters et al., “Deep contextualized word representations,”
tion with deep bidirectional LSTM,” in Proc. IEEE Workshop Autom. in North Amer. Chapter Assoc. Comput. Linguistics/HLT, 2018,
Speech Recognit. Understanding, Olomouc, 2013, pp. 273–278. pp. 2227–2237.
[18] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, [42] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-
“Attention-based models for speech recognition,” in Adv. Neural Inf. training of deep bidirectional transformers for language understand-
Process. Syst., 2015, pp. 577–585. ing,” in North Amer. Chapter Assoc. Comput. Linguistics/HLT, 2019,
[19] Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speech pp. 4171–4186.
recognition using deep RNN models and WFST-based decoding,” in [43] C. Raffel et al., “Exploring the limits of transfer learning with a unified
Proc. IEEE Workshop Autom. Speech Recognit. Understanding, 2015, text-to-text transformer,” Comput. Sci., vol. 21, no. 140, pp. 1–67,
pp. 167–174. 2020.
[20] L. Lu, X. Zhang, K. Cho, and S. Renals, “A study of the recurrent neu- [44] D. Cer et al., “Universal sentence encoder for English,” in Proc. Conf.
ral network encoder-decoder for large vocabulary speech recognition,” Empirical Methods Nat. Lang. Process., 2018, pp. 169–174.
in Proc. Interspeech, 2015, pp. 3249–3253 [45] N. Houlsby et al., “Parameter-efficient transfer learning for NLP,” in
[21] K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, data Proc. 37th Int. Conf. Mach. Learn., 2019, pp. 2790–2799.
and units for streaming end-to-end speech recognition with RNN- [46] W. M. Kouw and M. Loog, “A review of domain adaptation without
transducer,” in Proc. IEEE Workshop Autom. Speech Recognit. Un- target labels,” IEEE Trans. Pattern Anal. Mach. Intell., 2019, pp. 1–20,
derstanding, 2017, pp. 193–199. doi: 10.1109/TPAMI.2019.2945942.
[22] E. Battenberg et al., “Exploring neural transducers for end-to-end [47] S. Ravi and H. Larochelle, “Optimization as a model for few-
speech recognition,” in Proc. IEEE Workshop Autom. Speech Recognit. shot learning,” in Proc. Int. Conf. Learn. Representations, 2016,
Understanding, 2017, pp. 206–213. pp. 1–11.
[23] Y. He et al., “Streaming end-to-end speech recognition for mobile [48] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for
devices,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., few-shot learning,” in Adv. Neural Inf. Process. Syst., 2017,
2019, pp. 6381–6385. pp. 4077–4087.

VOLUME 2, 2021 59
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW

[49] X. Li, S. Dalmia, D. R. Mortensen, J. Li, A. W. Black, and F. Metze, [71] L. Mathias, G. Yegnanarayanan, and J. Fritsch, “Discriminative train-
“Towards zero-shot learning for automatic phonemic transcription,” in ing of acoustic models applied to domains with unreliable transcripts
Assoc. Adv. Artif. Intell., 2020, pp. 8261–8268. [speech recognition applications],” in Proc. IEEE Int. Conf. Acoust.,
[50] D. Rezende and S. Mohamed, “Variational inference with normalizing Speech, Signal Process., 2005, pp. I/109–I/112.
flows,” in Proc. 37th Int. Conf. Mach. Learn., 2015, pp. 1530–1538. [72] S.-H. Liu, F.-H. Chu, S.-H. Lin, and B. Chen, “Investigating data
[51] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and selection for minimum phone error training of acoustic models,” in
B. Lakshminarayanan, “Normalizing flows for probabilistic modeling Proc. IEEE Int. Conf. Multimedia Expo., 2007, pp. 348–351.
and inference,” 2019, arXiv:1912.02762. [73] S. Walker, M. Pedersen, I. Orife, and J. Flaks, “Semi-supervised
[52] S. S. Chen and R. A. Gopinath, “Gaussianization,” in Adv Neural Inf. model training for unbounded conversational speech recognition,”
Process. Syst., 2001, pp. 423–429. 2017, arXiv:1705.09724.
[53] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A. flow-based [74] Y. Miao, H. Zhang, and F. Metze, “Speaker adaptive training of deep
generative network for speech synthesis,” in Proc. IEEE Int. Conf. neural network acoustic models using i-vectors,” IEEE/ACM Audio,
Acoust., Speech, Signal Process., 2019, pp. 3617–3621. Speech Lang. Process., vol. 23, no. 11, pp. 1938–1949, Nov. 2015.
[54] J. Serrà, S. Pascual, and C. S. Perales, “Blow: A single-scale hyper- [75] P. Swietojanski, J. Li, and S. Renals, “Learning hidden unit contribu-
conditioned flow for non-parallel raw-audio voice conversion,” in Adv. tions for unsupervised acoustic model adaptation,” IEEE Trans. Audio,
Neural Inf. Process. Syst., 2019, pp. 6793–6803. Speech, Lang. Process., vol. 24, no. 8, pp. 1450–1463, Aug. 2016.
[55] S. Tan and K. C. Sim, “Learning utterance-level normalisation using [76] M. Padmanabhan, G. Saon, and G. Zweig, “Lattice-based unsuper-
variational autoencoders for robust automatic speech recognition,” in vised MLLR for speaker adaptation,” in Proc. ISCA ASR2000 Work-
IEEE Sri Lanka Telecom, 2016, pp. 43–49. shop, 2000, pp. 128–132.
[56] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation [77] T. Fraga-Silva, J.-L. Gauvain, and L. Lamel, “Lattice-based unsu-
of neural network acoustic models using i-vectors,” in Proc. IEEE pervised acoustic model training,” in Proc. IEEE Int. Conf. Acoust.,
Autom. Speech Recognit. Understanding Workshop, 2013, pp. 55–59. Speech, Signal Process., 2011, pp. 4656–4659.
[57] A. Senior and I. Lopez-Moreno, “Improving DNN speaker indepen- [78] V. Manohar, H. Hadian, D. Povey, and S. Khudanpur, “Semi-
dence with i-vector inputs,” in Proc. IEEE Int. Conf. Acoust., Speech, supervised training of acoustic models using lattice-free MMI,”
Signal Process., 2014, pp. 225–229. in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2018,
[58] P. Karanasou, Y. Wang, M. J. Gales, and P. C. Woodland, “Adaptation pp. 4844–4848.
of deep neural network acoustic models using factorised i-vectors,” in [79] O. Klejch, J. Fainberg, P. Bell, and S. Renals, “Lattice-based unsuper-
Proc. 15th Conf. Int. Speech Commun. Assoc., 2014, pp. 2180–2184. vised test-time adaptation of neural network acoustic models,” 2019,
[59] K. Veselý, S. Watanabe, K. Žmolíková, M. Karafiát, L. Burget, and arXiv:1906.11521.
J. H. Černocký, “Sequence summarizing neural network for speaker [80] H. Suzuki, H. Kasuya, and K. Kido, “The acoustic parameters for
adaptation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., vowel recognition without distinction of speakers,” in Proc. Conf.
2016, pp. 5315–5319. Speech Commun. Process., 1967, pp. 92–96.
[60] M. Doulaty, O. Saz, R. W. M. Ng, and T. Hain, “Latent Dirichlet [81] L. Gerstman, “Classification of self-normalized vowels,” IEEE Trans.
allocation based organisation of broadcast media archives for deep Audio Electroacoust., vol. 16, no. 1, pp. 78–80, Mar. 1968.
neural network adaptation,” in Proc. IEEE Autom. Speech Recognit. [82] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear re-
Understanding Workshop, 2015, pp. 130–136. gression for speaker adaptation of continuous density hidden Markov
[61] J. Pan, G. Wan, J. Du, and Z. Ye, “Online speaker adap- models,” Comput. Speech Lang., vol. 9, no. 2, pp. 171–185, 1995.
tation using memory-aware networks for speech recognition,” [83] J.-L. Gauvain and C.-H. Lee, “Maximum a posteriori estimation for
IEEE/ACM Audio, Speech, Lang. Process., to be published, multivariate Gaussian mixture observations of Markov chains,” IEEE
doi: 10.1109/TASLP.2020.2980372. Audio, Speech, Audio Process., vol. 2, no. 2, pp. 291–298, Apr. 1994.
[62] L. Sari, N. Moritz, T. Hori, and J. Le Roux, “Unsupervised speaker [84] P. C. Woodland, “Speaker adaptation for continuous density HMMs:
adaptation using attention-based speaker memory for end-to-end A review,” in Proc. ISCA Workshop Adapt. Methods Speech Recognit.,
ASR,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2020, 2001, pp. 11–19.
pp. 7384–7388. [85] K. Shinoda, “Speaker adaptation techniques for automatic speech
[63] Z.-P. Zhang, S. Furui, and K. Ohtsuki, “On-line incremental recognition,” in Proc. Asia-Pacific Signal Inf. Process. Assoc. Annu.
speaker adaptation with automatic speaker change detection,” in Summit Conf., 2011.
Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2000, [86] M. Gales and S. Young, “The application of hidden Markov models in
pp. II.961–II.964. speech recognition,” Found. Trends Signal, vol. 1, no. 3, pp. 195–304,
[64] H. Huang and K. C. Sim, “An investigation of augmenting speaker rep- 2008.
resentations to improve speaker normalisation for DNN-based speech [87] K. Johnson, “Speaker normalization in speech perception,” in The
recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., Handbook of Speech Perception. Hoboken, NJ, USA: Wiley, 2005,
2015, pp. 4610–4613. pp. 363–389.
[65] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, [88] K. Paliwal and W. Ainsworth, “Dynamic frequency warping for
and O. Vinyals, “Speaker diarization: A review of recent research,” speaker adaptation in automatic speech recognition,” J. Phonetics,
IEEE Audio, Speech, Lang. Process., vol. 20, no. 2, pp. 356–370, vol. 13, no. 2, pp. 123–134, 1985.
Feb. 2012. [89] Y. Grenier, “Speaker adaptation through canonical correlation analy-
[66] M. J. Gales, “Cluster adaptive training of hidden Markov models,” sis,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 1980,
IEEE Trans. Speech Audio Process., vol. 8, no. 4, pp. 417–428, pp. 888–891.
Jul. 2000. [90] K. Choukri and G. Chollet, “Adaptation of automatic speech recogniz-
[67] T. Tan, Y. Qian, and K. Yu, “Cluster adaptive training for deep neural ers to new speakers using canonical correlation analysis techniques,”
network based acoustic model,” IEEE/ACM Trans. Audio, Speech, Comput. Speech Lang., vol. 1, no. 2, pp. 95–107, 1986.
Lang. Process., vol. 24, no. 3, pp. 459–468, Mar. 2016. [91] H. Wakita, “Normalization of vowels by vocal-tract length and its
[68] H. Christensen, S. Cunningham, C. Fox, P. Green, and T. Hain, “A application to vowel identification,” IEEE Speech, Signal Process.,
comparative study of adaptive, automatic recognition of disordered vol. 25, no. 2, pp. 183–192, Apr. 1977.
speech,” in Proc. Interspeech, 2012, pp. 1776–1779. [92] A. Andreou, “Experiments in vocal tract normalization,” in Proc. Cer-
[69] H. Liao, E. McDermott, and A. Senior, “Large scale deep neural tified Artif. Intell. Practitioner Workshop: Front. Speech Recognit. II,
network acoustic modeling with semi-supervised training data for 1994.
YouTube video transcription,” in Proc. IEEE Autom. Speech Recognit. [93] E. Eide and H. Gish, “A parametric approach to vocal tract length nor-
Understanding Workshop, 2013, pp. 368–373. malization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
[70] W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised domain adaptation 1996, pp. 346–348.
for robust speech recognition via variational autoencoder-based data [94] L. Lee and R. C. Rose, “Speaker normalization using efficient fre-
augmentation,” in Proc. IEEE Autom. Speech Recognit. Understanding quency warping procedures,” in Proc. IEEE Int. Conf. Acoust., Speech,
Workshop, 2017, pp. 16–23. Signal Process., 1996, pp. 353–356.

60 VOLUME 2, 2021
[95] D. Kim, S. Umesh, M. Gales, T. Hain, and P. Woodland, “Using VTLN [117] K. Li, J. Li, Y. Zhao, K. Kumar, and Y. Gong, “Speaker adaptation
for broadcast news transcription,” in Proc. Int. Conf. Spoken Lang. for end-to-end CTC models,” in Proc. IEEE Spoken Lang. Technol.
Process., 2004, pp. 4–8. Workshop, 2018, pp. 542–549.
[96] G. Garau, S. Renals, and T. Hain, “Applying vocal tract length normal- [118] Z. Meng, Y. Gaur, J. Li, and Y. Gong, “Speaker adapta-
ization to meeting recordings,” in Proc. Interspeech, 2005. [Online]. tion for attention-based end-to-end speech recognition,” 2019,
Available: http://www.isca-speech.org/archive/interspeech_2005 arXiv:1911.03762.
[97] S. Furui, “A training procedure for isolated word recognition systems,” [119] K. C. Sim, P. Zadrazil, and F. Beaufays, “An investigation into on-
IEEE Speech, Signal Process., vol. 28, no. 2, pp. 129–136, Apr. 1980. device personalization of end-to-end automatic speech recognition
[98] K. Shikano, S. Nakamura, and M. Abe, “Speaker adaptation and voice models,” 2019, arXiv:1909.06678.
conversion by codebook mapping,” in Proc. IEEE Int. Symp. Circuits [120] Y. Huang, J. Li, L. He, W. Wei, W. Gale, and Y. Gong, “Rapid RNN-T
Syst., 1991, pp. 594–597. adaptation using personalised speech synthesis and neural language
[99] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice conver- generator,” in Proc. Interspeech, 2020, pp. 1256–1260.
sion through vector quantization,” in Proc. Int. Conf. Acoust., Speech, [121] Z. Fan, J. Li, S. Zhou, and B. Xu, “Speaker-aware speech-transformer,”
Signal Process., 1990, pp. 71–76. in Proc. IEEE Autom. Speech Recognit. Understanding Workshop,
[100] M. Feng, F. Kubala, R. Schwartz, and J. Makhoul, “Improved speaker 2019, pp. 222–229.
adaption using text dependent spectral mappings,” in Proc. IEEE Int. [122] Y. Zhao, C. Ni, C.-C. Leung, S. Joty, E. S. Chng, and B. Ma, “Speech
Conf. Acoust., Speech, Signal Process., 1988, pp. 131–134. transformer with speaker aware persistent memory,” in Proc. Inter-
[101] G. Rigoll, “Speaker adaptation for large vocabulary speech recognition speech, 2020, pp. 1261–1265.
systems using speaker Markov models,” in Proc. IEEE Int. Conf. [123] G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao,
Acoust., Speech, Signal Process., 1989, pp. 5–8. “Deep context: End-to-end contextual speech recognition,” in Proc.
[102] M. J. Hunt, “Speaker adaptation for word-based speech recognition IEEE Spoken Lang. Technol. Workshop, 2018, pp. 418–425.
systems,” J. Acoust. Soc. Amer., vol. 69, no. S1, pp. S 41–S 42, [124] Z. Chen, M. Jain, Y. Wang, M. L. Seltzer, and C. Fuegen, “End-to-
1981. end contextual speech recognition using class language models and
[103] S. J. Cox and J. S. Bridle, “Unsupervised speaker adaptation by prob- a token passing decoder,” in Proc. IEEE Int. Conf. Acoust., Speech
abilistic spectrum fitting,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2019, pp. 6186–6190.
Signal Process., 1989, pp. 294–297. [125] M. Jain, G. Keren, J. Mahadeokar, and Y. Saraf, “Contextual RNN-T
[104] M. Gales, “Maximum likelihood linear transformations for HMM- for open domain ASR,” in Proc. Interspeech, 2020, pp. 11–15.
based speech recognition,” Comput. Speech Lang., vol. 12, no. 2, [126] K. C. Sim et al., “Personalization of end-to-end speech recognition
pp. 75–98, 1998. on mobile devices for named entities,” in Proc. IEEE Autom. Speech
[105] L. Neumeyer, A. Sankar, and V. Digalakis, “A comparative study of Recognit. Understanding Workshop, 2019, pp. 23–30.
speaker adaptation techniques,” in Proc. Eurospeech, 1995, pp. 1127– [127] J. Li et al., “Developing RNN-T models surpassing high-performance
1130. hybrid models with customization capability,” in Proc. Interspeech,
[106] T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul, “A 2020, pp. 3590–3594.
compact model for speaker-adaptive training,” in Proc. 4th Int. Conf. [128] M. Delcroix, S. Watanabe, A. Ogawa, S. Karita, and T. Nakatani,
Spoken Lang. Process., 1996. pp. 3–35. “Auxiliary feature based adaptation of end-to-end ASR systems,” in
[107] R. Kuhn, J.-C. Junqua, P. Nguyen, and N. Niedzielski, “Rapid speaker Interspeech, 2018, doi:10.21437/Interspeech.2018-1438.
adaptation in eigenvoice space,” IEEE Speech Audio Lang. Process., [129] M. Delcroix, K. Kinoshita, A. Ogawa, C. Huemmer, and T. Nakatani,
vol. 8, no. 6, pp. 695–707, Nov. 2000. “Context adaptive neural network based acoustic models for rapid
[108] K. Yu and M. J. Gales, “Discriminative cluster adaptive training,” adaptation,” IEEE/ACM Audio, Speech, Lang. Process., vol. 26, no. 5,
IEEE Audio, Speech, Lang. Process., vol. 14, no. 5, pp. 1694–1703, pp. 895–908, May 2018.
Sep. 2006. [130] J. Rownicka, P. Bell, and S. Renals, “Embeddings for DNN speaker
[109] K. C. Sim, Y. Qian, G. Mantena, L. Samarakoon, S. Kundu, and T. Tan, adaptive training,” in Proc. IEEE Autom. Speech Recognit. Under-
“Adaptation of deep neural network acoustic models for robust auto- standing Workshop, 2019, pp. 479–486.
matic speech recognition,” in New Era for Robust Speech: Exploiting [131] S. Garimella, A. Mandal, N. Strom, B. Hoffmeister, S. Matsoukas,
Deep, S. Watanabe, M. Delcroix, F. Metze, and J. R. Hershey, Eds. and S. H. K. Parthasarathi, “Robust i-vector based adaptation of DNN
Berlin, Germany: Springer, 2017, pp. 219–243. acoustic model for speech recognition,” in Proc. Interspeech, 2015,
[110] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouel- pp. 2877–2881.
let, “Front-end factor analysis for speaker verification,” IEEE [132] T. Tan et al., “Speaker-aware training of LSTM-RNNs for acoustic
Audio, Speech, Lang. Process., vol. 19, no. 4, pp. 788–798, modelling,” in IEEE Int. Conf. Acoust., Speech Signal Process., 2016,
May 2011. pp. 5280–5284.
[111] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan- [133] S. H. K. Parthasarathi, B. Hoffmeister, S. Matsoukas, A. Mandal, N.
pur, “X-vectors: Robust DNN embeddings for speaker recognition,” Strom, and S. Garimella, “fMLLR based feature-space speaker adap-
in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2018, tation of DNN acoustic models,” in Proc. Interspeech, 2015, pp. 3630–
pp. 5329–5333. 3634.
[112] P. Swietojanski and S. Renals, “Learning hidden unit contributions for [134] Z. Meng et al., “L-vector: Neural label embedding for domain adapta-
unsupervised speaker adaptation of neural network acoustic models,” tion,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2020,
in Proc. IEEE Spoken Lang. Technol. Workshop, South Lake Tahoe, pp. 7389–7393.
2014, pp. 171–176. [135] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-
[113] H. Liao, “Speaker adaptation of context dependent deep neural net- dependent deep neural networks for conversational speech transcrip-
works,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., tion,” in Proc. IEEE Autom. Speech Recognit. Understanding Work-
2013, pp. 7947–7951. shop, 2011, pp. 24–29.
[114] D. Yu, K. Yao, H. Su, G. Li, and F. Seide, “KL-divergence regularized [136] S. P. Rath, D. Povey, K. Veselý, and J. Černocký, “Improved feature
deep neural network adaptation for improved large vocabulary speech processing for deep neural networks,” in Proc. Interspeech, 2013,
recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., pp. 109–113.
2013, pp. 7893–7897. [137] N. M. Joy, M. K. Baskar, S. Umesh, and B. Abraham, “DNNs for
[115] Z. Huang, J. Li, S. M. Siniscalchi, I.-F. Chen, J. Wu, and C.-H. unsupervised extraction of pseudo FMLLR features without explicit
Lee, “Rapid adaptation for deep neural networks through multi-task adaptation data,” in Proc. Interspeech, 2016, pp. 3479–3483.
learning,” in Proc. Interspeech, 2015, pp. 3625–3629. [138] N. Tomashenko and Y. Khokhlov, “GMM-derived features for effec-
[116] Y. Huang, L. He, W. Wei, W. Gale, J. Li, and Y. Gong, “Using per- tive unsupervised adaptation of deep neural network acoustic models,”
sonalized speech synthesis and neural language generator for rapid in Proc. Interspeech, 2015, pp. 2882–2886.
speaker adaptation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal [139] L. F. Uebel and P. C. Woodland, “An investigation into vocal tract
Process., 2020, pp. 7399–7403. length normalisation,” in Proc. Eurospeech, 1999, pp. 2527–2530.

VOLUME 2, 2021 61
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW

[140] D. Povey, G. Zweig, and A. Acero, “Speaker adaptation with an [163] Y. Zhao, J. Li, and Y. Gong, “Low-rank plus diagonal adaptation for
exponential transform,” in Proc. IEEE Autom. Speech Recognit. Un- deep neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
derstanding Workshop, 2011, pp. 158–163. Process., 2016, pp. 5005–5009.
[141] J. Fainberg, O. Klejch, E. Loweimi, P. Bell, and S. Renals, “Acous- [164] Y. Zhao, J. Li, K. Kumar, and Y. Gong, “Extended low-rank
tic model adaptation from raw waveforms with SincNet,” in Proc. plus diagonal adaptation for deep and recurrent neural networks,”
IEEE Autom. Speech Recognit. Understanding Workshop, 2019, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017,
pp. 897–904. pp. 5040–5044.
[142] M. Karafiát, L. Burget, P. Matějka, O. Glembek, and J. Černocký, [165] V. Abrash, H. Franco, A. Sankar, and M. Cohen, “Connectionist
“iVector-based discriminative adaptation for automatic speech recog- speaker normalization and adaptation,” in Proc. Eurospeech, 1995,
nition,” in Proc. IEEE Autom. Speech Recognit. Understanding Work- pp. 2183–2186.
shop, 2011, pp. 152–157. [166] K. Yao, D. Yu, F. Seide, H. Su, L. Deng, and Y. Gong, “Adapta-
[143] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Speaker and tion of context-dependent deep neural networks for automatic speech
session variability in GMM-based speaker verification,” IEEE Audio, recognition,” in Proc. IEEE Spoken Lang. Technol. Workshop, 2012,
Speech, Lang. Process., vol. 15, no. 4, pp. 1448–1460, May 2007. pp. 366–369.
[144] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme [167] L. Samarakoon and K. C. Sim, “Learning factorized feature transforms
for speaker recognition using a phonetically-aware deep neural net- for speaker normalization,” in Proc. IEEE Workshop Autom. Speech
work,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2014, Recognit. Understanding, 2015, pp. 145–152.
pp. 1695–1699. [168] L. Samarakoon and K. C. Sim, “Factorized hidden layer adapta-
[145] P. Kenny, T. Stafylakis, P. Ouellet, V. Gupta, and M. J. Alam, “Deep tion for deep neural network based acoustic modeling,” IEEE/ACM
neural networks for extracting Baum-Welch statistics for speaker Trans. Audio, Speech, Lang. Process., vol. 24, no. 12, pp. 2241–2250,
recognition,” in Speaker Odyssey, 2014, pp. 293–298. Dec. 2016.
[146] Y. Miao, H. Zhang, and F. Metze, “Towards speaker adaptive training [169] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive
of deep neural network acoustic models,” in Proc. Interspeech, 2014, mixtures of local experts,” Neural Comput., vol. 3, no. 1, pp. 79–87,
pp. 2189–2193. 1991.
[147] Y. Miao, L. Jiang, H. Zhang, and F. Metze, “Improvements to speaker [170] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “FiLM:
adaptive training of deep neural networks,” in Proc. IEEE Spoken Visual reasoning with a general conditioning layer,” in Proc. Assoc.
Lang. Technol. Workshop, 2014, pp. 165–170. Adv. Artif. Intell., 2018, pp. 3942–3951.
[148] J. Rownicka, P. Bell, and S. Renals, “Analyzing deep CNN-based [171] L. Samarakoon and K. C. Sim, “Subspace LHUC for fast adaptation
utterance embeddings for acoustic model adaptation,” in Proc. IEEE of deep neural network acoustic models,” in Proc. Interspeech, 2016,
Spoken Lang. Technol. Workshop, 2018, pp. 235–241. pp. 1593–1597.
[149] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez- [172] X. Cui, V. Goel, and G. Saon, “Embedding-based speaker adaptive
Dominguez, “Deep neural networks for small footprint text-dependent training of deep neural networks,” in Proc. Interspeech, 2017, pp. 122–
speaker verification,” in Proc. IEEE Spoken Lang. Technol. Workshop, 126.
2014, pp. 4052–4056. [173] T. Kim, I. Song, and Y. Bengio, “Dynamic layer normalization for
[150] X. Li and X. Wu, “Modeling speaker variability using long short-term adaptive neural acoustic modeling in speech recognition,” in Proc.
memory networks for speech recognition,” in Proc. Interspeech, 2015, Interspeech, 2017, pp. 2411–2415.
pp. 1086–1090. [174] L. Sari, S. Thomas, and M. A. Hasegawa-Johnson, “Learning speaker
[151] Y. Khokhlov et al., “R-vectors: New technique for adaptation to room aware offsets for speaker adaptation of neural networks,” in Proc.
acoustics,” in Proc. Interspeech, 2019, pp. 15–19. Interspeech, 2019, pp. 769–773.
[152] Y. Shi, Q. Huang, and T. Hain, “H-vectors: Utterance-level speaker [175] X. Xie, X. Liu, T. Lee, and L. Wang, “Fast DNN acoustic model
embedding using a hierarchical attention model,” in Proc. IEEE Int. speaker adaptation by learning hidden unit contribution features,” in
Conf. Acoust., Speech Signal Process., 2020, pp. 7579–7583. Proc. Interspeech, 2019, pp. 759–763.
[153] X. Xie, X. Liu, T. Lee, and L. Wang, “Fast DNN acoustic model [176] J. Xue, J. Li, and Y. Gong, “Restructuring of deep neural network
speaker adaptation by learning hidden unit contribution features,” in acoustic models with singular value decomposition,” in Proc. Inter-
Proc. Interspeech, 2019, pp. 759–763. speech, 2013, pp. 2365–2369.
[154] Y. Huang and Y. Gong, “Regularized sequence-level deep neural net- [177] J. Xue, J. Li, D. Yu, M. Seltzer, and Y. Gong, “Singular value decom-
work model adaptation,” in Proc. Interspeech, 2015, pp. 1081–1085. position based low-footprint speaker adaptation and personalization
[155] J. Neto et al., “Speaker-adaptation for hybrid HMM-ANN continuous for deep neural network,” in Proc. IEEE Int. Conf. Acoust., Speech
speech recognition system,” Eurospeech, 1995. [Online]. Available: Signal Process., 2014, pp. 6359–6363.
http://hdl.handle.net/1842/1274 [178] O. Klejch, J. Fainberg, and P. Bell, “Learning to adapt: A meta-
[156] R. Gemello, F. Mana, S. Scanzio, P. Laface, and R. De Mori, “Linear learning approach for speaker adaptation,” in Proc. Interspeech, 2018,
hidden transformations for adaptation of hybrid ANN/HMM mod- pp. 867–871.
els,” Speech Commun., vol. 49, no. 10/11, pp. 827–835, Oct./Nov. [179] O. Klejch, J. Fainberg, P. Bell, and S. Renals, “Speaker adaptive
2007. training using model agnostic meta-learning,” in Proc. IEEE Autom.
[157] B. Li and K. C. Sim, “Comparison of discriminative input and out- Speech Recognit. Understanding Workshop, 2019, pp. 881–888.
put transformations for speaker adaptation in the hybrid NN/HMM [180] F. Weninger, J. Andrés-Ferrer, X. Li, and P. Zhan, “Listen, attend,
systems,” 2010. [Online]. Available: http://scholarbank.nus.edu.sg/ spell and adapt: Speaker adapted sequence-to-sequence ASR,” in Proc.
handle/10635/41498 Interspeech, 2019, pp. 3805–3809.
[158] J. S. Bridle and S. J. Cox, “RecNorm: Simultaneous normalisation [181] Z. Meng, J. Li, and Y. Gong, “Adversarial speaker adaptation,” in Proc.
and classification applied to speech recognition,” in Adv. Neural Inf. IEEE Int. Conf. Acoust., Speech Signal Process., 2019, pp. 5721–5725.
Process. Syst., 1991, pp. 234–240. [182] Z. Huang, S. M. Siniscalchi, I.-F. Chen, J. Wu, and C.-H. Lee, “Max-
[159] O. Abdel-Hamid and H. Jiang, “Fast speaker adaptation of hybrid imum a posteriori adaptation of network parameters in deep models,”
NN/HMM model for speech recognition based on discriminative in Proc. Interspeech, 2015, pp. 1076–1080.
learning of speaker code,” in Proc. IEEE Int. Conf. Acoust., Speech [183] X. Xie, X. Liu, T. Lee, S. Hu, and L. Wang, “BLHUC: Bayesian
Signal Process., 2013, pp. 7942–7946. learning of hidden unit contributions for deep neural network speaker
[160] Z. Q. Wang and D. Wang, “Unsupervised speaker adaptation of batch adaptation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.,
normalized acoustic models for robust ASR,” in Proc. IEEE Int. Conf. 2019, pp. 5711–5715.
Acoust., Speech Signal Process., 2017, pp. 4890–4894. [184] D. Povey, “Discriminative training for large vocabulary speech recog-
[161] F. Mana, F. Weninger, R. Gemello, and P. Zhan, “Online batch normal- nition,” Ph.D. dissertation, Cambridge, U.K.: Univ. Cambridge, 2005.
ization adaptation for automatic speech recognition,” in Proc. Autom. [185] R. Price, K.-i. Iso, and K. Shinoda, “Speaker adaptation of deep neural
Speech Recognit. Understanding Workshop, 2019, pp. 875–880. networks using a hierarchy of output layers,” in Proc. IEEE Spoken
[162] C. Zhang and P. C. Woodland, “Parameterised sigmoid and ReLU Lang. Technol. Workshop, 2014, pp. 153–158.
hidden activation functions for DNN acoustic modelling,” in Proc. [186] R. Caruana, “Multitask learning,” Mach. Learn., vol. 28, no. 1, pp. 41–
Interspeech, 2015, pp. 3224–3228. 75, 1997.

62 VOLUME 2, 2021
[187] P. Swietojanski, P. Bell, and S. Renals, “Structured output layer with [212] U. Nallasamy, F. Metze, and T. Schultz, “Active learning for accent
auxiliary targets for context-dependent acoustic modelling,” in Proc. adaptation in automatic speech recognition,” in Proc. IEEE Spoken
Interspeech, 2015, pp. 3605–3609. Lang. Technol. Workshop, 2012, pp. 360–365.
[188] J. B. Allen and D. A. Berkley, “Image method for efficiently simulat- [213] Y. Huang, D. Yu, C. Liu, and Y. Gong, “Multi-accent deep neu-
ing small-room acoustics,” Acoust. Amer., vol. 65, no. 4, pp. 943–950, ral network acoustic model with accent-specific top layer using
1979. the KLD-regularized model adaptation,” in Proc. Interspeech, 2014,
[189] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A pp. 2977–2981.
study on data augmentation of reverberant speech for robust speech [214] M. Chen, Z. Yang, J. Liang, Y. Li, and W. Liu, “Improving deep neu-
recognition,” in IEEE Int. Conf. Acoust., Speech Signal Process., 2017, ral networks based multi-accent Mandarin speech recognition using
pp. 5220–5224. i-vectors and accent-specific top layer,” in Proc. Interspeech, 2015,
[190] C. Kim et al., “Generation of large-scale simulated utterances pp. 3620–3624.
in virtual rooms to train deep-neural networks for far-field [215] A. Ghoshal, P. Swietojanski, and S. Renals, “Multilingual training of
speech recognition in Google Home,” in Proc. Interspeech, 2017, deep neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
doi: 10.21437/Interspeech.2017-1510. Process., 2013, pp. 7319–7323.
[191] J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, “An overview of noise- [216] G. Heigold et al., “Multilingual acoustic models using distributed deep
robust automatic speech recognition,” IEEE/ACM Audio, Speech, neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
Lang. Process., vol. 22, pp. 745–777, Apr. 2014. Process., 2013, pp. 8619–8623.
[192] N. Jaitly and G. E. Hinton, “Vocal tract length perturbation (VTLP) [217] J.-T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, “Cross-language
improves speech recognition,” in Proc. ICML Workshop Deep Audio, knowledge transfer using multilingual deep neural network with
Speech, 2013. shared hidden layers,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
[193] X. Cui, V. Goel, and B. Kingsbury, “Data augmentation for deep Process., 2013, pp. 7304–7308.
neural network acoustic modeling,” IEEE/ACM Audio, Speech, Lang [218] J. Yi, H. Ni, Z. Wen, and J. Tao, “Improving BLSTM RNN based Man-
Process., vol. 23, no. 9, pp. 1469–1477, Sep. 2015. darin speech recognition using accent dependent bottleneck features,”
[194] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation in Proc. IEEE Asia-Pacific Signal Inf. Process. Assoc. Annu. Summit
for speech recognition,” in Proc. Interspeech, 2015, pp 3586–3589. Conf., 2016, pp. 1–5.
[195] Y. Huang and Y. Gong, “Acoustic model adaptation for presentation [219] M. T. Turan, E. Vincent, and D. Jouvet, “Achieving multi-accent ASR
transcription and intelligent meeting assistant systems,” in Proc. IEEE via unsupervised acoustic model adaptation,” in Proc. Interspeech,
Int. Conf. Acoust., Speech Signal Process., 2020, pp. 7394–7398. 2020, pp. 1286–1290.
[196] Y. Stylianou, O. Cappé, and E. Moulines, “Continuous probabilistic [220] M. Elfeky, M. Bastani, X. Velez, P. Moreno, and A. Waters, “Towards
transform for voice conversion,” IEEE Speech Audio Process., vol. 6, acoustic model unification across dialects,” in Proc. IEEE Spoken
no. 2, pp. 131–142, Mar. 1998. Lang. Technol. Workshop, 2016, pp. 624–628.
[197] X. Cui, V. Goel, and B. Kingsbury, “Data augmentation for deep [221] X. Yang, K. Audhkhasi, A. Rosenberg, S. Thomas, B. Ramabhadran,
convolutional neural network acoustic modeling,” in Proc. IEEE Int. and M. Hasegawa-Johnson, “Joint modeling of accents and acoustics
Conf. Acoust., Speech Signal Process., 2015, pp. 4545–4549. for multi-accent speech recognition,” in Proc. IEEE Int. Conf. Acoust.,
[198] J. Fainberg, P. Bell, M. Lincoln, and S. Renals, “Improving chil- Speech Signal Process., 2018, pp. 1–5.
dren’s speech recognition through out-of-domain data augmentation,” [222] A. Jain, M. Upreti, and P. Jyothi, “Improved accented speech recog-
in Proc. Interspeech, 2016, pp. 1598–1602. nition using accent embeddings and multi-task learning,” in Proc.
[199] Y. Jia et al., “Transfer learning from speaker verification to multi- Interspeech, 2018, doi:10.21437/Interspeech.2018-1864.
speaker text-to-speech synthesis,” in Adv. Neural Inf. Process. Syst., [223] K. Li, J. Li, Y. Zhao, K. Kumar, and Y. Gong, “Speaker adaptation
2018, pp. 4480–4490. for end-to-end CTC models,” in Proc. IEEE Spoken Lang. Technol.
[200] E. Hosseini-Asl, Y. Zhou, C. Xiong, and R. Socher, “Augmented cyclic Workshop, 2018, pp. 542–549.
adversarial learning for low resource domain adaptation,” in Proc. Int. [224] T. Viglino, P. Motlicek, and M. Cernak, “End-to-end
Conf. Learn. Representations, 2019, pp. 1–14. accented speech recognition,” in Proc. Interspeech, 2019,
[201] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image doi:10.21437/interspeech.2019-2122.
translation using cycle-consistent adversarial networks,” in Proc. IEEE [225] S. Sun, C.-F. Yeh, M.-Y. Hwang, M. Ostendorf, and L. Xie,
Int. Conf. Comput. Vis., 2017, pp. 2242–2251. “Domain adversarial training for accented speech recognition,” in
[202] A. Ali et al., “Automatic dialect detection in Arabic broadcast speech,” Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2018,
in Proc. Interspeech, 2016, pp. 2934–2938. pp. 4854–4858.
[203] A. Ali, S. Vogel, and S. Renals, “Speech recognition challenge in [226] B. Li et al., “Multi-dialect speech recognition with a single sequence-
the wild: Arabic MGB-3,” in Proc. IEEE Autom. Speech Recognit. to-sequence model,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
Understanding Workshop, 2017, pp. 14–18. Process., 2018, pp. 4749–4753.
[204] A. Ali et al., “The MGB-5 challenge: Recognition and dialect iden- [227] M. Grace, M. Bastani, and E. Weinstein, “Occam’s adaptation: A com-
tification of dialectal Arabic speech,” in Proc. IEEE Autom. Speech parison of interpolation of bases adaptation methods for multi-dialect
Recognit. Understanding Workshop, 2019, pp. 1026–1033. acoustic modeling with LSTMs,” in Proc. IEEE Spoken Lang. Technol.
[205] D. Povey et al., “Purely sequence-trained neural networks for ASR Workshop, 2018, pp. 174–181.
based on lattice-free MMI,” in Proc. Interspeech, 2016, doi:10.21437/ [228] S. Yoo, I. Song, and Y. Bengio, “A highly adaptive acoustic model for
Interspeech.2016-595. accurate multi-dialect speech recognition,” in Proc. IEEE Int. Conf.
[206] P. Smit, S. R. Gangireddy, S. Enarvi, S. Virpioja, and M. Kurimo, Acoust., Speech Signal Process., 2019, pp. 5716–5720.
“Aalto system for the 2017 Arabic multi-genre broadcast challenge,” [229] S. Ghorbani, A. E. Bulut, and J. H. Hansen, “Advancing multi-
in IEEE Autom. Speech Recognit. Understanding Workshop, 2017, accented LSTM-CTC speech recognition using a domain specific
pp. 338–345. student-teacher learning paradigm,” in Proc. IEEE Spoken Lang. Tech-
[207] S. Khurana, A. Ali, and J. Glass, “Darts: Dialectal arabic transcription nol. Workshop, 2018, pp. 29–35.
system,” 2019, arXiv:1909.12163. [230] A. Jain, V. P. Singh, and S. P. Rath, “A multi-accent acoustic model
[208] D. Vergyri, L. Lamel, and J.-L. Gauvain, “Automatic speech recogni- using mixture of experts for speech recognition,” in Proc. Interspeech,
tion of multiple accented English data,” in Proc. Interspeech, 2010, 2019, doi: 10.21437/Interspeech.2019-1667.
pp. 1652–1655. [231] H. Zhu, L. Wang, P. Zhang, and Y. Yan, “Multi-accent adap-
[209] Y. Zheng et al., “Accent detection and speech recognition for tation based on gate mechanism,” in Proc. Interspeech, 2019,
Shanghai-accented Mandarin,” in Proc. Interspeech, 2005, pp. 217– doi: 10.21437/Interspeech.2019-3155.
220. [232] G. I. Winata et al., “Learning fast adaptation on cross-accented speech
[210] L. W. Kat and P. Fung, “Fast accent identification and accented speech recognition,” in Proc. Interspeech, 2020, pp. 1276–1280.
recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., [233] Y. Long, Y. Li, H. Ye, and H. Mao, “Domain adaptation of lattice-
1999, pp. 221–224. free MMI based TDNN models for speech recognition,” Int. J. Speech
[211] M. Liu, B. Xu, T. Hunng, Y. Deng, and C. Li, “Mandarin accent adap- Technol., 2017, pp. 171–178.
tation based on context-independent/context-dependent pronunciation [234] J. Fainberg, S. Renals, and P. Bell, “Factorised representations for
modeling,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., neural network adaptation to diverse acoustic environments,” in Proc.
2000, pp. II 1025–II 1028. Interspeech, 2017, doi: 10.21437/Interspeech.2017-1365.
VOLUME 2, 2021 63
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW

[235] K. C. Sim et al., “Domain adaptation using factorized hidden layer [259] P. R. Clarkson and A. J. Robinson, “Language model adaptation using
for robust automatic speech recognition,” in Proc. Interspeech, 2018, mixtures and an exponentially decaying cache,” in Proc. IEEE Int.
pp. 892–896. Conf. Acoust., Speech Signal Process., 1997, pp. 799–802.
[236] J. Huang et al., “Cross-language transfer learning, continuous learn- [260] G. Tur and A. Stolcke, “Unsupervised language model adaptation for
ing, and domain adaptation for end-to-end automatic speech recogni- meeting recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
tion,” 2020, arXiv:2005.04290. Process., 2007, pp. IV-173–IV-176.
[237] S. Ueno et al., “Encoder transfer for attention-based acoustic-to-word [261] X. Liu, M. J. Gales, and P. C. Woodland, “Context dependent language
speech recognition,” in Proc. Interspeech, 2018, pp. 96–108. model adaptation,” in Proc. Interspeech, 2008, pp. 837–840.
[238] T. Moriya et al., “Progressive neural network-based knowledge trans- [262] R. Kuhn and R. De Mori, “A cache-based natural language model for
fer in acoustic models,” in IEEE Asia-Pacific Signal Inf. Process. speech recognition,” IEEE Pattern Anal. Mach. Intell., vol. 12, no. 6,
Assoc., 2018, pp. 998–1002. pp. 570–583, Jun. 1990.
[239] J. Li, R. Zhao, J.-T. Huang, and Y. Gong, “Learning small-size DNN [263] R. Kuhn and R. De Mori, “Corrections to “a cache-based language
with output-distribution-based criteria,” in Proc. Interspeech, 2014, model for speech recognition”,” IEEE Pattern Anal. Mach. Intell.,
pp. 1910–1914. vol. 14, no. 6, pp. 691–692, Jun. 1992.
[240] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a [264] M. Federico, “Bayesian estimation methods for n-gram language
neural network,” in Proc. NIPS Deep Learn. Representation Learn. model adaptation,” in Proc. Int. Conf. Spoken Lang. Process., 1996,
Workshop, 2015, pp. 1–9. pp. 240–243.
[241] J. Li et al., “Developing far-field speaker system via teacher-student [265] M. Bacchiani and B. Roark, “Unsupervised language model adapta-
learning,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., tion,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., Hong
2018, pp. 5699–5703. Kong, 2003, pp. I–I.
[242] L. Mošner et al., “Improving noise robustness of automatic [266] K. Seymore, S. Chen, and R. Rosenfeld, “Nonlinear interpolation
speech recognition via parallel data and teacher-student learning,” of topic models for language model adaptation,” in Proc. Int. Conf.
in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2019, Spoken Lang. Process., 1998.
pp. 6475–6479. [267] L. Chen, J.-L. Gauvain, L. Lamel, G. Adda, and M. Adda, “Using
[243] T. Asami, R. Masumura, Y. Yamaguchi, H. Masataki, and Y. Aono, information retrieval methods for language model adaptation,” in Proc.
“Domain adaptation of DNN acoustic models using knowledge distil- Interspeech, 2001, pp. 255–258.
lation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017, [268] B.-J. P. Hsu and J. Glass, “Style & topic language model adaptation
pp. 5185–5189. using HMM-LDA,” in Empirical Methods Nat. Lang. Process., 2006,
[244] Z. Meng, J. Li, Y. Zhao, and Y. Gong, “Conditional teacher-student pp. 373–381.
learning,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., [269] S. Huang and S. Renals, “Unsupervised language model adapta-
2019, pp. 6445–6449. tion based on topic and role information in multiparty meetings,”
[245] J. Li, M. L. Seltzer, X. Wang, R. Zhao, and Y. Gong, “Large-scale in Proc. Interspeech, 2008. [Online]. Available: http://hdl.handle.net/
domain adaptation via teacher-student learning,” in Proc. Interspeech, 1842/3835
2017, pp. 2386–2390. [270] R. Kneser, J. Peters, and D. Klakow, “Language model adaptation
[246] S. Watanabe, T. Hori, J. Le Roux, and J. R. Hershey, “Student-teacher using dynamic marginals,” in 5th EuroSpeech, 1997.
network learning with enhanced features,” in Proc. IEEE Int. Conf. [271] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural proba-
Acoust., Speech Signal Process., 2017, pp. 5275–5279. bilistic language model,” J. Mach. Learn. Res., vol. 3, pp. 1137–1155,
[247] J. H. Wong and M. J. Gales, “Sequence student-teacher train- 2003.
ing of deep neural networks,” in Proc. Interspeech, 2016, doi: [272] T. Mikolov, S. Kombrink, L. Burget, J. Černockỳ, and S. Khu-
10.21437/Interspeech.2016-911. danpur, “Extensions of recurrent neural network language model,”
[248] V. Manohar, P. Ghahremani, D. Povey, and S. Khudanpur, “A teacher- in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2011,
student learning approach for unsupervised domain adaptation of pp. 5528–5531.
sequence-trained ASR models,” in Proc. IEEE Spoken Lang. Technol. [273] X. Chen et al., “Recurrent neural network language model adaptation
Workshop, 2018, pp. 250–257. for multi-genre broadcast speech recognition,” in Proc. Interspeech,
[249] Z. Meng, J. Li, Y. Gaur, and Y. Gong, “Domain adaptation via teacher- 2015, pp. 3511–3515.
student learning for end-to-end speech recognition,” in Proc. IEEE Au- [274] S. Deena, M. Hasan, M. Doulaty, O. Saz, and T. Hain, “Recurrent
tom. Speech Recognit. Understanding Workshop, 2019, pp. 268–275. neural network language model adaptation for multi-genre broadcast
[250] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation speech recognition and alignment,” IEEE/ACM Audio, Speech, Lang.
by backpropagation,” in Proc. Int. Conf. Mach. Learn., 2015, Process., vol. 27, no. 3, pp. 572–582, Mar. 2019.
pp. 1180–1189. [275] K. Li, H. Xu, Y. Wang, D. Povey, and S. Khudanpur, “Recurrent
[251] S. Sun, B. Zhang, L. Xie, and Y. Zhang, “An unsupervised deep do- neural network language model adaptation for conversational speech
main adaptation approach for robust speech recognition,” Neurocom- recognition,” in Interspeech, 2018, pp. 3373–3377.
puting, vol. 257, pp. 79–87, 2017, doi: 10.1016/j.neucom.2016.11.063. [276] T. Moriokal, N. Tawara, T. Ogawa, A. Ogawa, T. Iwata, and T.
[252] Z. Meng, Z. Chen, V. Mazalov, J. Li, and Y. Gong, “Unsupervised Kobayashi, “Language model domain adaptation via recurrent neural
adaptation with domain separation networks for robust speech recog- networks with domain-shared and domain-specific representations,”
nition,” in Proc. IEEE Autom. Speech Recognit. Understanding Work- in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2018,
shop, 2017, pp. 214–221. pp. 6084–6088.
[253] P. Denisov, N. T. Vu, and M. F. Font, “Unsupervised domain adapta- [277] Y. Shi, M. Larson, and C. M. Jonker, “Recurrent neural network lan-
tion by adversarial learning for robust speech recognition,” in Speech guage model adaptation with curriculum learning,” Comput. Speech
Commun. 13th ITG-Symp., 2018, pp. 1–5. Lang., vol. 33, no. 1, pp. 136–154, 2015.
[254] Z. Meng, J. Li, Y. Gong, and B.-H. Juang, “Adversarial teacher-student [278] S. R. Gangireddy, P. Swietojanski, P. Bell, and S. Renals, “Unsuper-
learning for unsupervised domain adaptation,” in Proc. IEEE Int. Conf. vised adaptation of recurrent neural network language models,” in
Acoust., Speech Signal Process., 2018, pp. 5949–5953. Proc. Interspeech, 2016, doi: 10.21437/Interspeech.2016-1342.
[255] M. Mimura, S. Sakai, and T. Kawahara, “Cross-domain speech recog- [279] E. Grave, A. Joulin, and N. Usunier, “Improving neural language
nition using nonparallel corpora with cycle-consistent adversarial net- models with a continuous cache,” in Proc. Int. Conf. Learn. Repre-
works,” in Proc. IEEE Autom. Speech Recognit. Understanding Work- sentations, 2017, pp. 1–9.
shop, 2017, pp. 134–140. [280] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel
[256] E. Hosseini-Asl, Y. Zhou, C. Xiong, and R. Socher, “A multi- mixture models,” in Proc. Int. Conf. Learn. Representations, 2017,
discriminator CycleGAN for unsupervised non-parallel speech do- pp. 1–15.
main adaptation,” in Proc. Interspeech, 2018, pp. 3758–3762. [281] B. Krause, E. Kahembwe, I. Murray, and S. Renals, “Dynamic eval-
[257] Z. Meng, J. Li, Y. Gong, and B.-H. F. Juang, “Cycle-consistent speech uation of neural sequence models,” in Proc. Int. Conf. Mach. Learn.,
enhancement,” in Proc. Interspeech, 2018, pp. 1165–169. 2018, pp. 2766–2775.
[258] J. R. Bellegarda, “Statistical language model adaptation: Review and [282] C. Gulcehre et al., “On using monolingual corpora in neural machine
perspectives,” Speech Commun., vol. 42, no. 1, pp. 93–108, 2004. translation,” 2015, arXiv:1503.03535.

64 VOLUME 2, 2021
[283] A. Hannun, A. Maas, D. Jurafsky, and A. Ng, “First-pass large vo- [306] G. Gravier, G. Adda, N. Paulsson, M. Carré, A. Giraudel, and O.
cabulary continuous speech recognition using bi-directional recurrent Galibert, “The ETAPE corpus for the evaluation of speech-based TV
DNNs,” 2014, arXiv:1408.2873. content processing in the French language,” in Proc. Int. Conf. Lang.
[284] A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prab- Resour. Eval., 2012, pp. 11–16.
havalkar, “An analysis of incorporating an external language model [307] Y. Liu, P. Fung, Y. Yang, C. Cieri, S. Huang, and D. Graff,
into a sequence-to-sequence model,” in Proc. IEEE Int. Conf. Acoust., “HKUST/MTS: A very large scale Mandarin telephone speech cor-
Speech Signal Process., 2018, pp. 1–5828. pus,” in Chinese Spoken Language Processing, Q. Huo, B. Ma, E.-S.
[285] A. Zeyer, K. Irie, R. Schlüter, and H. Ney, “Improved training of end- Chng, and H. Li, Eds., Berlin, Heidelberg, Germany: Springer, 2006,
to-end attention models for speech recognition,” in Proc. Interspeech, pp. 724–735.
2018, pp. 7–11. [308] P. Bell et al., “The MGB challenge: Evaluating multi-genre broadcast
[286] E. McDermott, H. Sak, and E. Variani, “A density ratio approach to media recognition,” in Proc. IEEE Autom. Speech Recognit. Under-
language model fusion in end-to-end automatic speech recognition,” in standing Workshop, 2015, pp. 687–693.
Proc. IEEE Autom. Speech Recognit. Understanding Workshop, 2019, [309] “RASC863: 863 annotated 4 regional accent speech corpus,”
pp. 434–441. 2003. [Online]. Available: http://www.chineseldc.org/doc/CLDC-
[287] E. Variani, D. Rybach, C. Allauzen, and M. Riley, “Hybrid autore- SPC-2004-005/intro.htm
gressive transducer (HAT),” in Proc. IEEE Int. Conf. Acoust., Speech [310] J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCH-
Signal Process., 2020, pp. 6139–6143. BOARD: Telephone speech corpus for research and develop-
[288] M. Kim, Y. Kim, J. Yoo, J. Wang, and H. Kim, “Regularized speaker ment,” in IEEE Int. Conf. Acoust., Speech Signal Process., 1992,
adaptation of KL-HMM for dysarthric speech recognition,” IEEE Neu- pp. 517–520.
ral Syst. Rehabil. Eng., vol. 25, no. 9, pp. 1581–1591, Sep. 2017. [311] M. Cettolo, C. Girardi, and M. Federico, “WIT3: Web inventory of
[289] M. Kitza, R. Schlüter, and H. Ney, “Comparison of BLSTM-layer- transcribed and translated talks,” in Eur. Assoc. Mach. Trans., 2012,
specific affine transformations for speaker adaptation,” in Proc. Inter- pp. 261–268.
speech, 2018, pp. 877–881. [312] A. Rousseau, P. Deléglise, and Y. Estève, “TED-LIUM: an automatic
[290] C. Liu, Y. Wang, K. Kumar, and Y. Gong, “Investigations on speech recognition dedicated corpus,” in Proc. Int. Conf. Lang. Re-
speaker adaptation of LSTM RNN models for speech recognition,” sourc. Eval., 2012, pp. 125–129.
in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2016, [313] A. Rousseau, P. Deléglise, and Y. Estève, “Enhancing the TED-LIUM
pp. 5020–5024. corpus with selected data for language modeling and more TED talks,”
[291] H. Seki, K. Yamamoto, T. Akiba, and S. Nakagawa, “Rapid speaker in Proc. Int. Conf. Lang. Resour. Eval., 2014, pp. 3935–3939.
adaptation of neural network based filterbank layer for automatic [314] J. S. Garofolo et al., “TIMIT acoustic phonetic continuous speech
speech recognition,” in Proc. IEEE Spoken Lang. Technol. Workshop, corpus,” Linguistic Data Consortium, LDC93S1, Philadelphia, PA,
2018, pp. 574–580. USA, 1993. [Online]. Available: https://doi.org/10.35111/17gk-bn40
[292] R. Serizel and D. Giuliani, “Deep neural network adaptation for chil- [315] D. B. Paul and J. Baker, “The design for the wall street journal-based
dren’s and adults’ speech recognition,” in Proc. Ital. Conf. Comput. CSR corpus,” in Proc. Speech Nat. Workshop, 1992, pp. 357–362.
Linguistics, 2014. [316] A. Batliner et al., “The PF_STAR children’s speech corpus,” in Proc.
[293] P. Swietojanski and S. Renals, “Differentiable pooling for unsuper- Interspeech, 2005, pp. 3761–3764.
vised acoustic model adaptation,” IEEE/ACM Audio, Speech, Lang. [317] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech:
Process., vol. 24, no. 10, pp. 1773–1784, Oct. 2016. An ASR corpus based on public domain audio books,” in Proc. IEEE
[294] C. Zhang and P. C. Woodland, “DNN speaker adaptation us- Int. Conf. Acoust., Speech Signal Process., 2015, pp. 5206–5210.
ing parameterised sigmoid and ReLU hidden activation func- [318] P. Karanasou, C. Wu, M. Gales, and P. C. Woodland, “I-vectors and
tions,” in IEEE Int. Conf. Acoust., Speech Signal Process., 2016, structured neural networks for rapid adaptation of acoustic models,”
pp. 5300–5304. IEEE/ACM Audio, Speech, Lang. Process., vol. 25, no. 4, pp. 818–828,
[295] S. Ghorbani and J. H. Hansen, “Leveraging native language informa- Apr. 2017.
tion for improved accented speech recognition,” in Proc. Interspeech, [319] A.-r. Mohamed, T. N. Sainath, G. Dahl, B. Ramabhadran, G. E. Hin-
2018, pp. 2449–2453. ton, and M. A. Picheny, “Deep belief networks using discriminative
[296] V. Gupta, P. Kenny, P. Ouellet, and T. Stafylakis, “I-vector-based features for phone recognition,” in Proc. IEEE Int. Conf. Acoust.,
speaker adaptation of deep neural networks for French broadcast audio Speech Signal Process., 2011, pp. 5060–5063.
transcription,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Pro- [320] S. M. Siniscalchi, J. Li, and C.-H. Lee, “Hermitian polynomial
cess., 2014, pp. 6334–6338. for speaker adaptation of connectionist speech recognition systems,”
[297] P. C. Woodland et al., “Cambridge University transcription systems for IEEE Audio, Speech Lang. Process., vol. 21, no. 10, pp. 2152–2161,
the multi-genre broadcast challenge,” in IEEE Autom. Speech Recog- Oct. 2013.
nit. Understanding, 2015, pp. 639–646. [321] O. Abdel-hamid and H. Jiang, “Rapid and effective speaker adaptation
[298] J. Du, X. Na, X. Liu, and H. Bu, “AISHELL-2: Transforming Man- of convolutional neural network based models for speech recognition,”
darin ASR research into industrial scale,” 2018, arXiv:1808.10583. in Proc. Interspeech, 2013, pp. 1248–1252.
[299] J. Carletta, “Unleashing the killer corpus: experiences in creating the [322] P. Swietojanski and S. Renals, “Differentiable pooling for unsuper-
multi-everything AMI meeting corpus,” Lang. Resour. Eval., vol. 41, vised speaker adaptation,” in Proc. IEEE Int. Conf. Acoust., Speech
no. 2, pp. 181–190, 2007. Signal Process., 2015, pp. 4305–4309.
[300] B. Angelini, F. Brugnara, D. Falavigna, D. Giuliani, R. Gretter, and M. [323] P. Swietojanski and S. Renals, “SAT-LHUC: Speaker adaptive train-
Omologo, “Speaker independent continuous speech recognition using ing for learning hidden unit contributions,” in Proc. IEEE Int. Conf.
an acoustic-phonetic Italian corpus,” in Proc. Int. Conf. Spoken Lang. Acoust., Speech Signal Process., 2016, pp. 5010–5014.
Process., 1994, pp. 1391–1394. [324] N. Tomashenko, Y. Khokhlov, A. Larcher, and Y. Estève, “Exploring
[301] N. Parihar, J. Picone, D. Pearce, and H. G. Hirsch, “Performance GMM-derived features for unsupervised adaptation of deep neural
analysis of the Aurora large vocabulary baseline system,” in Proc. Eur. network acoustic models,” Speech Comput., 2016, pp 304–311.
Signal Process. Conf., 2004, pp. 553–556. [325] S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. Moreno, E. We-
[302] “CASIA Northern accent speech corpus,” 2003. [Online]. Available: instein, and K. Rao, “Multilingual speech recognition with a single
http://www.chineseldc.org/doc/CLDC-SPC-2004-015/intro.htm end-to-end model,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
[303] D. Giuliani and M. Gerosa, “Investigating recognition of children’s Process., 2018, pp. 4904–4908.
speech,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., [326] A. Kannan et al., “Large-scale multilingual speech recognition with a
2003, pp. II–137. streaming end-to-end model,” in Proc. Interspeech, 2019, pp. 2130–
[304] E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, and R. Marxer, 2134.
“An analysis of environment, microphone and data simulation mis- [327] O. Adams, M. Wiesner, S. Watanabe, and D. Yarowsky, “Mas-
matches in robust speech recognition,” Comput. Speech Lang., vol. 46, sively multilingual adversarial speech recognition,” in NAACL/HLT,
pp. 535–557, 2017. 2019.
[305] K. Maekawa, “Corpus of spontaneous Japanese: Its design and evalu- [328] V. Pratap et al., “Massively multilingual ASR: 50 languages, 1 model,
ation,” in Proc. ISCA IEEE Workshop Spontaneous Speech, 2003. 1 billion parameters,” in Proc. Interspeech, 2020, pp. 4751–4755.

VOLUME 2, 2021 65
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW

[329] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, JINYU LI (Member, IEEE) received the Ph.D.
“Unsupervised cross-lingual representation learning for speech recog- degree from the Georgia Institute of Technol-
nition,” 2020, arXiv:2006.13979. ogy in 2008. From 2000 to 2003, he was a Re-
[330] K. Kawakami, L. Wang, C. Dyer, P. Blunsom, and A. V. Oord, “Learn- searcher with the Intel China Research Center and
ing robust and multilingual speech representations,” in Proc. Conf. Research Manager in iFlytek, China. Currently,
Empirical Methods Nat. Lang. Process., 2020, pp. 1182–1192. he is a Partner Applied Scientist with Microsoft
[331] S. Pascual, M. Ravanelli, J. Serrà, A. Bonafonte, and Y. Bengio, Corporation, leading a team to design and im-
“Learning problem-agnostic speech representations from multiple prove speech modeling algorithms and technolo-
self-supervised tasks,” in Proc. Interspeech, 2019, pp. 161–165. gies that ensure industry state-of-the-art speech
[332] M. Ravanelli et al., “Multi-task self-supervised learning for robust recognition accuracy for Microsoft products. Dr.
speech recognition,” in IEEE Int. Conf. Acoust., Speech Signal Pro- Li is a member of the IEEE Speech and Lan-
cess., 2020, pp. 6989–6993. guage Processing Technical Committee. He also serves as Associate Editor
for the IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE
PETER BELL (Associate Member, IEEE) received PROCESSING.
the B.A. degree in mathematics in 2002 and the
M.Phil. degree in computer speech, text and In-
ternet technology in 2005 from the University of
Cambridge, and the Ph.D. degree in automatic
speech recognition from the University of Edin-
burgh, in 2010. He is a reader in speech technol- STEVE RENALS (Fellow, IEEE) received the
ogy with the School of Informatics, University of B.Sc. degree in chemistry from the University of
Edinburgh. His research interests include domain Sheffield, in 1986 and the M.Sc. degree in artifi-
adaptation, regularization, and low-resource meth- cial intelligence in 1987 and the Ph.D. degree in
ods for acoustic modeling. neural networks and speech recognition from the
University of Edinburgh, in 1991. He is Professor
of speech technology with the School of Infor-
matics, University of Edinburgh, having previously
JOACHIM FAINBERG (Member, IEEE) received
held positions at ICSI Berkeley, the University of
the B.Mus. degree in music and sound record-
Cambridge, and the University of Sheffield. His re-
ing (Tonmeister) from the University of Surrey in
search interests include speech recognition, spoken
2014, the M.Sc. degree in artificial intelligence in language processing, neural networks, and machine learning. Dr Renals is a
2015 and the Ph.D. degree in automatic speech
fellow of ISCA (2016) and a Senior Area Editor of the IEEE OPEN JOURNAL
recognition in 2020 from the University of Edin-
OF SIGNAL PROCESSING.
burgh. He is currently with the Machine Learn-
ing Center of Excellence, JPMorgan Chase. His
research interests include domain adaptation and
training methods for acoustic modeling.

PAWEL SWIETOJANSKI (Member, IEEE) re-

ONDREJ KLEJCH (Member, IEEE) received ceived the M.Sc. degree from AGH-UST, Cra-
the B.Sc. degree in 2013, the M.Sc. degree cow, Poland and the Ph.D. degree in computer
in 2015 in computer science from the Charles science from the University of Edinburgh, U.K.
University in Prague, Czech Republic, and He is currently a Research Scientist with Apple,
the Ph.D. degree in automatic speech recog- and in the past held academic and research po-
nition in 2020 from the University of Edin- sitions, UNSW Sydney and University of Edin-
burgh. He is a Postdoctoral Researcher in au- burgh. His main research interests include machine
tomatic speech recognition with the Univer- learning and its applications in spoken language
sity of Edinburgh. His research interests include processing, with a particular focus on learning neu-
acoustic model adaptation using meta-learning ral representations for acoustic modeling in speech
approaches. recognition.

66 VOLUME 2, 2021

Survey of Deep Learning Paradigms For Speech Processing
No ratings yet
Survey of Deep Learning Paradigms For Speech Processing
37 pages
A Comparative Analysis of Domain Adaptation Techniques for Recognition of Accented Speech
No ratings yet
A Comparative Analysis of Domain Adaptation Techniques for Recognition of Accented Speech
6 pages
An Unsupervised Deep Domain Adaptation Approach For Robust Speech Recognition PDF
No ratings yet
An Unsupervised Deep Domain Adaptation Approach For Robust Speech Recognition PDF
12 pages
End-to-End Automatic Speech Recognition
No ratings yet
End-to-End Automatic Speech Recognition
19 pages
02
No ratings yet
02
8 pages
Domain Adap Asr 5
No ratings yet
Domain Adap Asr 5
6 pages
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
Audio Visual Speech Recognition: Advancements, Applications, and Insights
From Everand
Audio Visual Speech Recognition: Advancements, Applications, and Insights
Fouad Sabry
No ratings yet
Transformer-Transducer End-to-End Speech Recognition with Self-Attention
No ratings yet
Transformer-Transducer End-to-End Speech Recognition with Self-Attention
5 pages
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
From Everand
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
SPEAKER ADAPTATION FOR END-TO-END SPEECH RECOGNITION SYSTEMS IN NOISY ENVIRONMENTS copy
No ratings yet
SPEAKER ADAPTATION FOR END-TO-END SPEECH RECOGNITION SYSTEMS IN NOISY ENVIRONMENTS copy
6 pages
Te Belski S 1995
No ratings yet
Te Belski S 1995
190 pages
Speech Recognition Using Neural Networks, PHD Thesis (1995)
No ratings yet
Speech Recognition Using Neural Networks, PHD Thesis (1995)
190 pages
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Recent Progresses in Deep Learning Based Acoustic Models: Dong Yu and Jinyu Li
No ratings yet
Recent Progresses in Deep Learning Based Acoustic Models: Dong Yu and Jinyu Li
14 pages
TASLP2339736 Proof 2
No ratings yet
TASLP2339736 Proof 2
26 pages
Linguistic-Coupled Age-to-Age Voice Translation To Improve Speech Recognition Performance in Real Environments
No ratings yet
Linguistic-Coupled Age-to-Age Voice Translation To Improve Speech Recognition Performance in Real Environments
11 pages
Survey On Speech Imitation Using Machine Learning: Rahul Kumar, Jaybrata Chakraborty and Bappaditya Chakraborty
No ratings yet
Survey On Speech Imitation Using Machine Learning: Rahul Kumar, Jaybrata Chakraborty and Bappaditya Chakraborty
5 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
16 pages
Data-Driven Neural Network Based Feature - Phd-Thesis
No ratings yet
Data-Driven Neural Network Based Feature - Phd-Thesis
155 pages
Asru 2013 PDF
No ratings yet
Asru 2013 PDF
6 pages
Speech Recognition
No ratings yet
Speech Recognition
188 pages
Progress - Report - of - Intership MD Shams Alam
No ratings yet
Progress - Report - of - Intership MD Shams Alam
4 pages
116.00000050
No ratings yet
116.00000050
64 pages
Domain Adap Asr 2
No ratings yet
Domain Adap Asr 2
5 pages
Performance - Evaluation - of - Recurrent - Neural - Networks-LSTM - and - GRU - For ASR - IC2E3
No ratings yet
Performance - Evaluation - of - Recurrent - Neural - Networks-LSTM - and - GRU - For ASR - IC2E3
6 pages
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Editor in Chief,+recurrent Neural Networks in Automatic Speech Recognition
No ratings yet
Editor in Chief,+recurrent Neural Networks in Automatic Speech Recognition
8 pages
Thesis Attention-Based Encoder-Decoder Models for Speech Processing
No ratings yet
Thesis Attention-Based Encoder-Decoder Models for Speech Processing
219 pages
Ann LA2 Project
No ratings yet
Ann LA2 Project
23 pages
Approaches For Neural-Network Language Model Adaptation: August 2017
No ratings yet
Approaches For Neural-Network Language Model Adaptation: August 2017
6 pages
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
No ratings yet
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
63 pages
Automatic Speech Recognition Using Deep Neural Networks
No ratings yet
Automatic Speech Recognition Using Deep Neural Networks
6 pages
Applsci 12 01091
No ratings yet
Applsci 12 01091
18 pages
10 - Recurrent Neural Network Based Speech Emotion
No ratings yet
10 - Recurrent Neural Network Based Speech Emotion
13 pages
Speech Recognition As Emerging Revolutionary Technology
No ratings yet
Speech Recognition As Emerging Revolutionary Technology
4 pages
Multi-View Self-Supervised Learning and Multi-Scale Feature Fusion For Automatic Speech Recognition
No ratings yet
Multi-View Self-Supervised Learning and Multi-Scale Feature Fusion For Automatic Speech Recognition
20 pages
Speech Recognition Using Artificial Neural Network: - A Review
100% (1)
Speech Recognition Using Artificial Neural Network: - A Review
4 pages
End-to-End Speech Recognition: A Survey
No ratings yet
End-to-End Speech Recognition: A Survey
27 pages
Adwait Naik - Antyplagiat Raport
No ratings yet
Adwait Naik - Antyplagiat Raport
31 pages
Intelligent Speech Recognition Algorithm in Multimedia Visual Interaction Via BiLSTM and Attention Mechanism
No ratings yet
Intelligent Speech Recognition Algorithm in Multimedia Visual Interaction Via BiLSTM and Attention Mechanism
13 pages
j9
No ratings yet
j9
37 pages
10.2478 - Jaiscr 2019 0006
No ratings yet
10.2478 - Jaiscr 2019 0006
11 pages
s40537-025-01090-0
No ratings yet
s40537-025-01090-0
31 pages
Incorporating Knowledge Sources Into Statistical Speech Recognition
No ratings yet
Incorporating Knowledge Sources Into Statistical Speech Recognition
20 pages
Automatic Speech Recognition Using Deep Neural Networks
No ratings yet
Automatic Speech Recognition Using Deep Neural Networks
6 pages
Malaysian Journal of Computer Science
No ratings yet
Malaysian Journal of Computer Science
14 pages
Speech Recognition Seminar
No ratings yet
Speech Recognition Seminar
19 pages
Full Text 01
No ratings yet
Full Text 01
54 pages
(IJCST-V4I2P62) :Dr.V.Ajantha Devi, Ms.V.Suganya
No ratings yet
(IJCST-V4I2P62) :Dr.V.Ajantha Devi, Ms.V.Suganya
6 pages
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Human Visual System Model: Understanding Perception and Processing
From Everand
Human Visual System Model: Understanding Perception and Processing
Fouad Sabry
No ratings yet
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mastering Deep Learning with Keras: From Basics to Expert Proficiency
From Everand
Mastering Deep Learning with Keras: From Basics to Expert Proficiency
William Smith
No ratings yet
Machine Learning Methods for Engineering Application Development
From Everand
Machine Learning Methods for Engineering Application Development
Prasad Lokulwar
No ratings yet
Advanced Deep Learning Techniques for Natural Language Understanding: A Comprehensive Guide
From Everand
Advanced Deep Learning Techniques for Natural Language Understanding: A Comprehensive Guide
Adam Jones
No ratings yet
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
SONET Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
SONET Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Voice Technologies and Systems: Definitive Reference for Developers and Engineers
From Everand
Voice Technologies and Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Automatic Target Recognition: Fundamentals and Applications
From Everand
Automatic Target Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Учебное пособие
No ratings yet
Учебное пособие
147 pages
MBA Course Outline Booklet
No ratings yet
MBA Course Outline Booklet
179 pages
Tugas Ku Listening
No ratings yet
Tugas Ku Listening
11 pages
The Impact of Globalization On Cross-Cultural Communication
No ratings yet
The Impact of Globalization On Cross-Cultural Communication
13 pages
Bilingual Language Profile
No ratings yet
Bilingual Language Profile
4 pages
PTE Essay Solution Book
100% (1)
PTE Essay Solution Book
53 pages
Conditional Type 1,2,3: Group 4
100% (1)
Conditional Type 1,2,3: Group 4
21 pages
Mca Jntu-Anantapur Syllabus
No ratings yet
Mca Jntu-Anantapur Syllabus
124 pages
Muet Lesson Plan
No ratings yet
Muet Lesson Plan
2 pages
Campione! 13 - Princess Goddess of The South Seas
No ratings yet
Campione! 13 - Princess Goddess of The South Seas
314 pages
Congratulation Compliment 2
No ratings yet
Congratulation Compliment 2
11 pages
Problem Solving and Algorithms: Problems, Solutions, and Tools
No ratings yet
Problem Solving and Algorithms: Problems, Solutions, and Tools
12 pages
Yingjin Wang Resume
No ratings yet
Yingjin Wang Resume
2 pages
Sounding With Syllables: Lesson Plan
No ratings yet
Sounding With Syllables: Lesson Plan
3 pages
Straggling Through Fire An Anthology of Proemistry First Edition Ghulam Murtaza Aatir instant download
No ratings yet
Straggling Through Fire An Anthology of Proemistry First Edition Ghulam Murtaza Aatir instant download
66 pages
Sadlier Unit 5
No ratings yet
Sadlier Unit 5
2 pages
[English (auto-generated)] Learn English With Podcast Conversation Episode 1 _ English Podcast For Beginners #englishpodcast [DownSub.com]
No ratings yet
[English (auto-generated)] Learn English With Podcast Conversation Episode 1 _ English Podcast For Beginners #englishpodcast [DownSub.com]
7 pages
The Impact of Cultural Folklore On National Values: A Preliminary Study With A Focus On Bhutan
No ratings yet
The Impact of Cultural Folklore On National Values: A Preliminary Study With A Focus On Bhutan
18 pages
Literature of The Philippines: Rhyme and Rhythm
No ratings yet
Literature of The Philippines: Rhyme and Rhythm
8 pages
Tws 4 Assessment Plan
No ratings yet
Tws 4 Assessment Plan
5 pages
Exhibit 1
No ratings yet
Exhibit 1
38 pages
Animals and Habitats British English Student A1 A2
No ratings yet
Animals and Habitats British English Student A1 A2
7 pages
Quiz in Campus Journalism
No ratings yet
Quiz in Campus Journalism
2 pages
Assignment Packet Hs Spanish 1 Term3 Mrs Navarro
100% (1)
Assignment Packet Hs Spanish 1 Term3 Mrs Navarro
7 pages
2504.08543v1
No ratings yet
2504.08543v1
7 pages
source 2
No ratings yet
source 2
5 pages
Test Questions Assesment Portfolio
No ratings yet
Test Questions Assesment Portfolio
13 pages
CP Lab Manual FInal.
No ratings yet
CP Lab Manual FInal.
52 pages
Esperanto Vocabulary List
No ratings yet
Esperanto Vocabulary List
20 pages
Kisi-Kisi Soal BING Genap 2
No ratings yet
Kisi-Kisi Soal BING Genap 2
2 pages

Adaptation Algorithms for Neural Network-Based Speech Recognition an Overview

Uploaded by

Adaptation Algorithms for Neural Network-Based Speech Recognition an Overview

Uploaded by

Received 11 August 2020; revised 1 December 2020; accepted 9 December 2020.

Date of publication 16 December 2020;

Adaptation Algorithms for Neural

and p(rs ) = N (rs ; μ0 , γ0 ), which results in the expectation

E[rs | Dadapt ] = μs . (50)

The parameters are computed using gradient descent with a

VIII. VARIANT OBJECTIVE FUNCTIONS

when they are used to estimate separate acoustic models). A

FIGURE 10. Comparison of adaptation results for FF and RNN

FIGURE 11. Comparison of adaptation results for FF and RNN

at the domain level (i.e. tailoring the acoustic model to better

expected, as hybrid systems benefit from strong inductive

FIGURE 13. Comparison of adaptation results for different adaptation

As shown in Fig. 17 GMMEmb, NNEmb and NNTrans-

FIGURE 19. Comparison of adaptation results for SAT vs Test-only modes.

FIGURE 21. Comparison of adaptation results for acoustic models trained

model’s predictions such that they do not deviate too much

FIGURE 22. Comparison of adaptation results for different architectures.

Fig. 21 further summarizes the adaptability of acoustic

FIGURE 26. Performance of adaptation techniques as obtained on public

XIV. SUMMARY AND DISCUSSION

of limited memory and computation power [119]. Another

PAWEL SWIETOJANSKI (Member, IEEE) re-

You might also like