Adaptation Algorithms for Neural Network-Based Speech Recognition an Overview
Adaptation Algorithms for Neural Network-Based Speech Recognition an Overview
ABSTRACT We present a structured overview of adaptation algorithms for neural network-based speech
recognition, considering both hybrid hidden Markov model / neural network systems and end-to-end neu-
ral network systems, with a focus on speaker adaptation, domain adaptation, and accent adaptation. The
overview characterizes adaptation algorithms as based on embeddings, model parameter adaptation, or data
augmentation. We present a meta-analysis of the performance of speech recognition adaptation algorithms,
based on relative error rate reductions as reported in the literature.
INDEX TERMS Accent adaptation, data augmentation, domain adaptation, regularization, semi-supervised
learning, speaker adaptation, speaker embeddings, speech recognition, structured linear transforms.
I. INTRODUCTION but there are other important adaptation targets such as the
The performance of automatic speech recognition (ASR) sys- domain of use, and the spoken accent. Much of the work
tems has improved dramatically in recent years thanks to the in the area has focused on speaker adaptation: it is the case
availability of larger training datasets, the development of that many approaches developed for speaker adaptation do not
neural network based models, and the computational power explicitly model speaker characteristics, and can be applied to
to train such models on these datasets [1]–[4]. However, the other adaptation targets. Thus our core treatment of adaptation
performance of ASR systems can still degrade rapidly when algorithms is in the context of speaker adaptation, with a later
their conditions of use (test conditions) differ from the train- discussion of particular approaches for domain adaptation
ing data. There are several causes for this, including speaker and accent adaptation. Specifically, domain adaptation in this
differences, variability in the acoustic environment, and the paper refers to the task of adapting the models to the target
domain of use. domain that has either acoustic or content mismatch from the
Adaptation algorithms attempt to alleviate the mismatch source domain in which the models were trained.
between the test data and an ASR system’s training data. This overview focuses on the adaptation of neural network
Adapting an ASR system is a challenging problem since it (NN) based speech recognition systems, although we briefly
requires the modification of large and complex models, typ- discuss earlier approaches to speaker adaptation of hidden
ically using only a small amount of target data and without Markov model (HMM) based systems. NN-based systems [1],
explicit supervision. Speaker adaptation – adapting the system [5], [6] have revolutionized the field of speech recognition,
to a target speaker – is the most common form of adaptation, and there has been intense activity in the development of
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 2, 2021 33
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW
FIGURE 1. NN architectures used for hybrid NN/HMM and end-to-end (CTC, RNN-T, AED) speech recognition systems: (a) Scheme of NN architecture used
for NN/HMM hybrid systems and for connectionist temporal classification (CTC); (b) architecture for the RNN Transducer (RNN-T); (c) architecture for
attention based encoder-decoder (AED) end-to-end systems. Input acoustic feature vectors are denoted by xt ; hidden layers are denoted by ht , hu and
output labels by yt , yu depending on whether they are indexed by time t (in hybrid and CTC systems) or only by output label u (in parts of RNN-T and AED
systems). In practice, the encoders use a wide temporal context as input, even the whole acoustic sequence in the case of most CTC and AED models.
adaptation algorithms for such systems. Adaptation of NN- (senones) as GMM-based systems [1], [2]. This led to the de-
based speech recognition is an exciting research area for at velopment of systems surpassing the accuracy of GMM-based
least two reasons: from a practical point of view, it is im- systems. This increase in computational power also enabled
portant to be able to adapt state-of-the-art systems; and from more powerful neural network models to be employed, in
a theoretical point of view the fact that NNs require fewer particular time-delay neural networks (TDNNs) [10], [11],
constraints on the input than a Gaussian-based system, along convolutional neural networks (CNNs) [12], [13], long short-
with the gradient-based discriminative training which is at the term memory (LSTM) RNNs [14], [15], and bidirectional
heart of most NN-based speech recognition systems, opens a LSTMs [16], [17].
range of possible adaptation algorithms.
B. END-TO-END SYSTEMS
Since 2015, there has been a significant trend in the field
A. NN/HMM HYBRID SYSTEMS moving from hybrid HMM/NN systems to end-to-end (E2E)
Neural networks were first applied to speech recognition as NN modeling [4], [6], [18]–[24] for ASR. E2E systems are
so-called NN/HMM hybrid systems, in which the neural net- characterized by the use of a single model transforming the
work is used to estimate (scaled) likelihoods that act as the input acoustic feature stream to a target stream of output
HMM state observation probabilities [5] (Fig. 1(a)). During tokens, which might be constructed of characters, subwords,
the 1990 s both feed-forward networks [5] and recurrent neu- or even words. E2E models are optimized using a single
ral networks (RNNs) [7] were used in such hybrid systems objective function, rather than comprising multiple compo-
and close to state-of-the-art results were obtained [8]. These nents (acoustic model, language model, lexicon) that are op-
systems were largely context-independent, although context- timized individually. Currently, the most widely used E2E
dependent NN-based acoustic models were also explored [9]. models are connectionist temporal classification (CTC) [25],
The modeling power of neural network systems at that [26], the RNN Transducer (RNN-T) model [21], [27], and the
time was computationally limited, and they were not able attention-based encoder-decoder (AED) model [6], [18].
to achieve the precise levels of modeling obtained using CTC and the RNN-T both map an input speech feature se-
context-dependent GMM-based HMM systems which be- quence to an output label sequence, where the label sequence
came the dominant approach. However, increases in com- (typically characters) is considerably shorter than the input
putational power enabled deeper neural network models to sequence. Both of these architectures use an additional blank
be learned along with context-dependent modeling using output token to deal with the sequence length differences, with
the same number of context-dependent HMM tied states an objective function which sums over all possible alignments
34 VOLUME 2, 2021
using the forward backward algorithm [28]. CTC is an earlier, transfer [43]), often trained in a multi-task setting. Earlier
and simpler, method which assumes frame independence and adaptation approaches in NLP focused on feature adaptation
functions similarly to the acoustic model in hybrid systems (e.g. [44]), but more recently better results have been obtained
without modeling the linguistic dependency across words; its using model-based adaptation, for instance “adapter layers”
architecture is similar to that of the neural network in the [43], [45], in which trainable transform layers are inserted into
hybrid system (Fig. 1(a)). the pretrained base model.
An RNN-T (Fig. 1(b)) combines an additional prediction More broadly there has been extensive work on domain
network with the acoustic encoder. The prediction network is adaptation and transfer learning in machine learning, reviewed
an RNN modeling linguistic dependencies whose input is the by Kouw and Loog [46]. This includes work on few-shot
previously output symbol. It is possible to initialize some of its learning [47]–[49] and normalizing flows [50], [51]. Normal-
layers from an external language model trained on additional izing flows which provide a probabilistic framework for fea-
text data. The acoustic encoder and the prediction network are ture transformations, were first developed for speech recogni-
combined using a feed-forward joint network followed by a tion as Gaussianization [52], and more recently have been ap-
softmax to predict the next output token given the speech input plied to speech synthesis [53] and voice transformation [54].
and the linguistic context.
Together, the RNN-T’s prediction and joint networks may D. STRUCTURE OF THIS REVIEW
be regarded as a decoder, and we can view the RNN-T as We begin by considering the issues of identifying suitable data
a form of encoder-decoder system. The AED architecture and target labels to adapt to in Section II. After discussing
(Fig. 1(c)) enriches the encoder-decoder model with an addi- speaker adaptation of non NN-based HMM systems in Sec-
tional attention network which interfaces the acoustic encoder tion III, we present a general framework for adaptation of
with the decoder. The attention network operates on the entire NN-based speech recognition systems (both hybrid and E2E)
sequence of encoder representations for an utterance, offering in Section IV, where we organize adaptation algorithms into
the decoder considerably more flexibility. A detailed com- three general categories: embedding-based approaches (dis-
parison of popular E2E models in both streaming and non- cussed in Section V), model-based approaches (discussed in
streaming modes with large scale training data was conducted Secs. VI–VIII), and data augmentation approaches (discussed
by Li et al. [29]. It is worth noting that with the recent success in Section IX).
in machine translation, there is a trend of using the transformer As mentioned above, most of our treatment of adaptation
model [30] to replace LSTM for both the AED [31]–[33] and algorithms is in the context of speaker adaptation. In Secs. X
RNN-T models [34]–[36]. and XI we discuss specific approaches to accent adaptation
and domain adaptation respectively.
C. ADAPTATION AND TRANSFER LEARNING IN Our primary focus is on the adaptation of acoustic models
RELATED FIELDS and end-to-end models. In Section XII we provide a summary
Adaptation and transfer learning have become important and of work in language model (LM) adaptation, mentioning both
intensively researched topics in other areas related to machine n-gram and neural network language models, and the use of
learning, most notably computer vision and natural language LM adaptation in E2E systems.
processing (NLP). In both these cases the motivation is to train Finally we provide a meta analysis of experimental studies
powerful base models using large amounts of training data, using the main adaptation algorithms that we have discussed
then to adapt these to specific tasks or domains, for which (Section XIII). The meta-analysis is based on experiments
considerably less training data is available. reported in 47 papers, carried out using 38 datasets, and is
In computer vision, the base model is typically a large primarily based on the relative error rate reduction arising
convolutional network trained to perform image classifica- from adaptation approaches. In this section we analyze the
tion or object recognition using the ImageNet database [37], performance of the main adaptation algorithms across a vari-
[38]. The ImageNet model is then adapted to a lower re- ety of adaptation target types (for instance speaker, domain,
source task, such as computer-aided detection in medical and accent), in supervised and unsupervised settings, in six
imaging [39]. Kornblith et al. [40] have investigated empiri- different languages, and using six different NN model types
cally how well ImageNet models transfer to different tasks and in both hybrid and E2E settings. Raw data, aggregated results
datasets. and the corresponding scripts are available at https://github.
Transfer learning in NLP differs from computer vision, com/pswietojanski/ojsp_adaptation_review_2020.
and from the speech recognition approaches discussed in this
paper, in that the base model is trained in an unsupervised II. IDENTIFYING ADAPTATION TARGETS
fashion to perform language modeling or a related task, typ- Adaptation aims to reduce the mismatch between training and
ically using web-crawled text data. Base models used for test conditions. For an adaptation algorithm to be effective,
NLP include the bidirectional LSTM [41] and Transformers the distribution of the adaptation data should be close to that
which make use of self-attention [42], [43]. These models are encountered in test conditions. For this reason it is impor-
then trained on specific NLP tasks, with supervised training tant to ensure that the target labels adapted to form coherent
data, which is specified in a common format (e.g. text-to-text classes. For the task of acoustic adaptation this requirement
VOLUME 2, 2021 35
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW
is typically satisfied by forming the adaptation data from one Depending on whether adaptation transforms are estimated
or more speech segments from known testing conditions (i.e. on held out data, or adaptation is iteratively derived from
the same speaker, accent, domain, or acoustic environment). test segments, we will refer to these as enrolment or online
While for some tasks labels ascribed to speech segments may modes, respectively. In enrolment mode, the adaptation data
exist, allowing segments to be grouped into larger adaptation would be ideally labeled with a gold-standard transcription,
clusters, it is unrealistic to assume the availability of such to enable supervised learning algorithms to be used for adap-
metadata in general. However, depending on the application tation. However, supervised data is rarely available: small
and the operating regime of the ASR system, it may be possi- amounts may be available for some domain adaptation tasks
ble to derive reasonable proxies. (for example, adapting a system trained on typical speech to
Utterance-level adaptation derives adaptation statistics us- disordered speech [68]). In the usual case, where supervised
ing a single speech segment [55]. This waives the require- adaptation data is not available, supervised training algorithms
ment to carry information about speaker identity between can still be used with “pseudo-labels” that are automatically
utterances, which may simplify deployment of recognition obtained from a seed model, a process which is a type of semi-
systems – in terms of both engineering and privacy – as one supervised training [69]. Alternatively, unsupervised training
does not need to estimate and store offline speaker-specific can be applied to learn embeddings for the different adaptation
information. On the other hand, owing to the small amounts classes, such as i-vectors [56] or bottleneck features extracted
of data available for adaptation the gains are usually lower from an auto-encoder neural network [70]. A two-pass system
than one could obtain with speaker-level clusters. While many is a special case for which the statistics are estimated from test
approaches use utterances to directly extract corresponding data using the first pass decoding with a speaker-independent
embeddings to use as an auxiliary input for the acoustic model in order to obtain adaptation labels, followed by a
model [56]–[59], one can also build a fixed inventory of second pass with the speaker-adapted model.
speakers, domains, or topic codes [60] or embeddings [61], For semi-supervised approaches, it is possible to further
[62] when learning the acoustic model or acoustic encoder, filter out regions with low-confidence to avoid the reinforce-
and then use the test utterance to select a combination of these ment of potential errors [71]–[73]. There is some evidence
at test stage. The latter approach alleviates the necessity of in the literature that, for some limited-in-capacity transforms
estimating an accurate representation from small amounts of estimated in a semi-supervised manner, the first pass tran-
data. It may be possible to relax the utterance-level constraint script quality has a small impact on the adapted accuracy as
by iteratively re-estimating adaptation statistics using a num- long as these are obtained with the corresponding speaker-
ber of preceding segment(s) [57]. Extra care usually needs independent model [74], [75]. In lattice supervision multiple
to be taken to handle silence and speech uttered by different possible transcriptions are used in a semi-supervised setting
speakers, as failing to do so may deteriorate the overall ASR by generating a lattice, rather than the one-best transcrip-
performance [62]–[64]. tion [76]–[79].
Speaker-level adaptation aggregates statistics across two or
more segments uttered by the same talker, requiring a way to III. ADAPTATION ALGORITHMS FOR HMM-BASED ASR
group adaptation utterances produced by different talkers. In Speaker adaptation of speech recognition systems has been in-
some cases – for example lecture recordings and telephony vestigated since the 1960s [80], [81]. In the mid-1990 s, the in-
– speaker information may be available. In other cases po- fluential maximum likelihood linear regression (MLLR) [82]
tentially inaccurate metadata is available, for instance in the and maximum a posteriori (MAP) [83] approaches to speaker
transcription of television or online broadcasts. In many cases adaptation for HMM/GMM systems were introduced. These
(for instance, anonymous voice search) speaker metadata is methods, described below, stimulated the field leading an in-
not available. The generic approach to this problem relies on tense activity in algorithms for the adaptation of HMM/GMM
a speaker diarization system [65], that can identify speakers systems, reviewed by Woodland [84] and Shinoda [85], as
and accordingly assign their identities to the corresponding well as in section 5 of Gales and Young’s broader review
segments in the recordings. This is often used in the of- of HMM-based speech recognition [86]. As we later discuss,
fline transcription of meetings or broadcast media. Alternative some of the algorithms developed for HMM-based systems, in
clustering approaches can be used to define the adaptation particular feature transformation approaches have been suc-
classes [66], [67]. cessfully applied to NN-based systems. In this section we
Domain-level adaptation broadens the speaker-level cluster review MAP, MLLR, and related approaches to the adaptation
by including speech produced by multiple talkers character- of HMM/GMM systems, along with earlier approaches to
ized by some common characteristic such as accent, age, speaker adaptation.
medical condition, topic, etc. . This typically results in more
adaptation material and an easier annotation process (cluster A. SPEAKER NORMALISATION
labels need to be assigned at batch rather than segment level). Many of these early approaches were designed to normalize
As such, domain adaptation can usually leverage adaptation speaker-specific characteristics, such as vocal tract length,
transforms with greater capacity, and thus offer better adapta- building on linguistic findings relating to speaker normaliza-
tion gains. tion in speech perception [87], often casting the problem as
36 VOLUME 2, 2021
one of spectral normalization. This work included formant- transforms, one for Gaussians in speech models and one for
based frequency warping approaches [80], [81], [88] and the Gaussians in non-speech models.
estimation of linear projections to normalize the spectral rep- Constrained MLLR [104], [105], is an important variant of
resentation to a speaker-independent form [89], [90]. MLLR, in which the same transform is used for both the mean
Vocal tract length normalization (VTLN) was introduced and covariance:
by Wakita [91] (and again by Andreou [92]) as a form of
μ̂s = As μ − bs (3)
frequency warping with the aim to compensate for vocal
tract length differences across speakers. VTLN was exten-
ˆs = As As (4)
sively investigated for speech recognition in the 1990 s and
2000s [93]–[96], and is discussed further in Section V. In this case, the log likelihood for a single Gaussian is given
by
B. MODEL BASED APPROACHES ˆ s ) = log N (x; As μ − bs , As As )
L cMLLR (x; μ̂s , (5)
In model based adaptation, the speech recognition model is
used to drive the adaptation. In work prefiguring subspace = log N (A−1 −1
s x + As bs ; μ, ) − log |As | (6)
models, Furui [97] showed how speaker specific models could
be estimated from small amounts of target data in a dynamic It can be seen that this transform of the model parameters is
time warping setting, learning linear transforms between pre- equivalent to applying an affine transform to the data – hence
existing speaker-dependent phonetic templates, and templates constrained MLLR is often referred to as feature-space MLLR
for a target speaker. Similar techniques were developed in (fMLLR), although it is not strictly feature-space adaptation
the 1980 s by adapting the vector quantization (VQ) used in unless a single transform is shared across all Gaussians in
discrete HMM systems. Shikano, Nakamura, and Abe [98] the system, in which case the Jacobian term − log |As | can be
showed that mappings between speaker dependent codebooks ignored. MLLR and its variants have been used extensively
could be learned to model a target speaker (a technique widely in the adaptation of Gaussian mixture model (GMM)-based
used for voice conversion [99]); Feng et al. [100] developed HMM speech recognition systems [84], [86].
a VQ-based approach in which speaker-specific mappings
were learned between codewords in a speaker-independent C. BAYESIAN METHODS
codebook, in order to maximize the likelihood of the discrete The above model-based adaptation approaches have aimed to
HMM system. Rigoll [101] introduced a related approach estimate transforms between a speaker independent model and
in which the speaker-specific transform took the form of a a model adapted to a target speaker. An alternative Bayesian
Markov model. A continuous version of this approach, re- approach attempts to perform the adaptation by using the
ferred to as probabilistic spectrum fitting, which aimed to speaker independent model to inform the prior of a speaker-
adjust the parameters of a Gaussian phonetic model was in- adapted model. If the set of parameters of a speech recognition
troduced by Hunt [102] and further developed by Cox and model are denoted by θ , then maximum likelihood estimation
Bridle [103]. sets θ to maximize the likelihood p(X | θ ). In MAP training,
These probabilistic spectral modeling approaches can be the estimation procedure maximizes the posterior of the pa-
viewed as precursors to maximum likelihood linear regres- rameters given the data X = {x1 . . . xT }:
sion (MLLR) introduced by Leggetter and Woodland [82]
P(θ | X ) ∝ p(X | θ ) p(θ )r , (7)
and generalized by Gales [104]. MLLR applies to continuous
probability density HMM systems, composed of Gaussian where p(θ ) is the prior distribution of the parameters, which
probability density functions. In MLLR, linear transforms are can be based on speaker independent models, and r is an em-
estimated to adapt the mean vectors and – in [104] – covari- pirically determined weighting factor. Gauvain and Lee [83]
ance matrices of the Gaussian components. If μ and are the presented an approach using MAP estimation as an adaptation
mean vector and covariance matrix of a particular Gaussian, approach for HMM/GMM systems. A convenient choice of
then MLLR adapts the parameters as follows: function for p(θ ) is the conjugate to the likelihood – the
function which ensures the posterior has the same form as the
μ̂s = As μ − bs (1) prior. For a GMM, if it is assumed that the mixture weights
ˆ s = Hs Hs . ci and the Gaussian parameters (μi , i ) are independent,
(2)
then the conjugate
prior may take the form of a mixture
model pD (ci ) i pW (μi , i ), where pD () is a Dirichlet distri-
The speaker-specific parameters bs , As and Hs are estimated
bution (conjugate to the multinomial) and pW () is the normal-
using maximum likelihood. MLLR is a compact adaptation
Wishart density (conjugate to the Gaussian). This results in
technique since the transforms are shared across Gaussians:
the following intuitively understandable parameter estimate
for instance all Gaussians corresponding to the same mono-
for the adapted mean of a Gaussian μ̂ ∈ Rd :
phone might share mean and covariance transforms. Very
often, especially when target data is sparse, a greater degree τ μ0 + t γ (t )xt
μ̂ = , (8)
of sharing is employed – for instance two shared adaptation τ + t γ (t )
VOLUME 2, 2021 37
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW
where μ0 ∈ Rd is the unadapted (speaker-independent) mean, have been used for the adaptation of neural network based
xt ∈ Rd is the adaptation vector at time t, γ (t ) ∈ R is the com- systems (Section VI).
ponent occupation probability (responsibility) for the Gaus- In contrast to model-based methods, in feature-based adap-
sian component at time t (estimated by the forward-backward tation it is usual to adapt or normalize the acoustic features
algorithm), and τ is a positive scalar-valued parameter of the for each speaker in both the training and test sets– this may
normal-Wishart density, which is typically set to a constant be viewed as a form of speaker adaptive training. For exam-
empirically (although Gauvain and Lee [83] also discuss an ple, in the case of cepstral mean and variance normalization
empirical Bayes estimation approach for this parameter). The (CMVN), statistics are computed for each speaker and the
re-estimated means of the Gaussian components take the form features normalized accordingly, during both training and test.
of a weighted interpolation between the speaker independent Likewise, VTLN is also carried out for all speakers, trans-
mean and data from the target speaker. When there is no forming the acoustic features to a canonical form, with the
target speaker data for a Gaussian component, the parameters variation from changes in vocal tract length being normalized
remain speaker-independent; as the amount of target speaker away.
data increases, so the Gaussian parameters approach the target
speaker maximum likelihood estimate. IV. ADAPTATION ALGORITHMS FOR NN-BASED ASR
The literature describing methods for adaptation of NNs has
D. SPEAKER ADAPTIVE TRAINING tended to inherit terminology from the algorithms used to
In the model-based approaches discussed above (MLLR and adapt HMM-GMM systems, for which there is an important
MAP), we have implicitly assumed that adaptation takes place distinction between feature space and model space formula-
at test time: speaker independent models are trained using tions of MLLR-type approaches [104], as discussed in the
recordings of multiple speakers in the usual way, with only previous section. In a 2017 review of NN adaptation, Sim et
the test speakers used for adaptation. In contrast to this, it is al. [109] divide adaptation algorithms into feature normalisa-
possible to employ a model-based adaptive training approach. tion, feature augmentation and structured parameterization.
In speaker adaptive training [106], a transform is estimated for (They also use a further category termed constrained adapta-
each speaker in the training set, as well as for each speaker tion, discussed further below.)
in the test set. During training, speaker-specific transforms The task of an ASR model is to map a sequence of acous-
and a speaker-independent canonical model are updated in an tic feature vectors, X = (x1 , . . . xt , . . . , xT ), xt ∈ Rd to a se-
iterative fashion. quence of words W . Although – as we discuss below – most
Speaker space approaches represent a speaker-adapted techniques described in this paper apply equally to end-to-end
model as a weighted sum of a set of individual models models and hybrid HMM-NN models, we generally treat the
which may represent individual speakers or, more commonly, model to be adapted as an acoustic model. That is, we ignore
speaker clusters. In cluster-adaptive training (CAT) [66], the aspects of adaptation that affect only P(W ), independently of
mean for a Gaussian component for a specific speaker s is the acoustics X (LM adaptation is discussed in Section XII).
given by: Further, with only a small loss of generality, in what follows
we will assume that the model operates in a framewise man-
C ner, thus we can define the model as:
μ̂s = wc μc (9)
yt = f (xt ; θ ) (10)
c=1
where f (x; θ ) is the NN model with parameters θ and yt is the
where μc ∈ Rd is the mean of the particular Gaussian com- output label at frame t. In a hybrid HMM-NN system, for ex-
ponent for speaker cluster c, and wc ∈ R is the cluster weight. ample, yt is taken to be a vector of posterior probabilities over
This expresses the speaker-adapted mean vector as a point a senone set. In a CTC model, yt would be a vector of posterior
in a speaker space. Given a set of canonical speaker cluster probabilities over the output symbol set, plus blank symbol.
models, CAT is efficient in terms of parameters, since only Note that NN models often operate on a wider windowed
the set of cluster weights need to be estimated for a new set of input features, xt (w) = [xt−c , xt−c+1 , . . . , xt+c−1 , xt+c ]
speaker. Eigenvoices [107] are alternative way of construct- with the total window size w = 2c + 1. For reasons of no-
ing speaker spaces, with a speaker model again represented tational clarity, we generally ignore the distinction between
as a weighted sum of canonical models. In the Eigenvoices xt and xt (w), unless it is specifically relevant to a particular
technique, principal component analysis of “supervectors” topic.
(concatenated mean vectors from the set of speaker-specific In this framework, we can define feature normalisation
models) is used to create a basis of the speaker space. approaches as acting to transform the features in a speaker-
A number of variants of cluster-adaptive training have been dependent manner, on which the speaker-independent model
presented, including representing a speaker by combining operates. For each speaker s, a transformation function g :
MLLR transforms from the canonical models [66], and using
Rd → Rd computes:
sequence discriminative objective functions such as minimum
phone error (MPE) [108]. Techniques closely related to CAT xt = g(xt ; φs ) (11)
38 VOLUME 2, 2021
where φs is a set of speaker-dependent parameters. Commonly which we can write as a structured parameter transformation
the dimension of the normalised features is the identical to the of f , as defined in (12):
original (i.e. d = d ) but this is not required. This family is
θs = {θ , φs } = h({θ , φ I }; ϕs ) (17)
closely related to feature space methods used in GMM sys-
tems described above in Section III, including fMLLR (when where the transformation h( · ; ϕs ) is simply set to replace
only a single affine transform is used), VTLN, and CMVN. the parameters pertaining to g with the original normalisation
Structured parameterization approaches, in contrast, in- parameters, φs = ϕs , leaving the other parameters unchanged.
troduce a speaker-dependent transformation of the acoustic Similarly, feature augmentation approaches may be readily
model parameters: seen to be a further special case of structured adaptation. In
the simple case of input feature augmentation (13), we see
θs = h(θ ; ϕs ) (12)
that the output of the first layer, prior to the non-linearity, can
In this case, the function h would typically be structured so be written as
as to ensure that the number of speaker-dependent parameters x
ϕs is sufficiently smaller than the number of parameters of the z = Wx + b = W +b (18)
λs
original model. Such methods are closely related to model-
based adaptation of GMMs such as MLLR. where W and b are the weight and bias of the first layer
respec-
Finally, feature-augmentation approaches extend the fea- tively. By introducing a decomposition of W , W = U V
ture vector xt with a speaker-dependent embedding λs , which
we write this as
we can write as
x
xt z= U V + b = U x + b + V λs (19)
xt = (13) λs
λs
with U ∈ θ and V ∈ θ E being weight matrices pertaining to
Close variants of this approach use the embedding to augment
the input features and speaker embedding, respectively.
the input to higher layers of the network. Note that the in-
This can be expressed as a structured transformation of the
corporation of an embedding requires the addition of further
bias:
parameters to the acoustic model controlling the manner in
which the embedding acts to adapt the model, which can θs = {U , b } = h({U, b}; ϕs ) = {U, b + V λs } (20)
be written f (xt ; θ , θ E ). The embedding parameters θ E are
themselves speaker-independent. with ϕs = V λs . Similar arguments apply to embeddings used
We suggest that the distinctions described above may not in other network layers.
always be helpful when considering NN adaptation specifi- Certain types of feature normalisation approaches can be
cally, because all three approaches can be seen to be closely expressed as feature augmentation. For example, cepstral
related or even special cases of each other. As we saw in mean normalisation given by
Section III this is not the case in HMM-GMM systems, where xt = g(xt ; φs ) = xt − μs (21)
the distinction between feature-space and model adaptation
is important (as noted by Gales [104]) because in the former can be expressed as
case, different feature space transformations can be carried x
out per senone class if the appropriate scaling by a Jacobian z = W (x − μs ) + b = W W +b (22)
is performed; whilst in the latter case, it is necessary for the −μs
adapted probability density functions to be re-normalized. with augmented features λs = −μs .
As an example of the equivalence of the close relationship As we have seen, approaches to NN adaptation under the
between the three approaches to NN adaptation, the normali- traditional categorization of feature augmentation, structured
sation function g can generally be formulated as shallow NN, parameterization and feature normalization can usually be
possibly without a non-linearity. If there is a set of “identity seen as special cases of one another. Therefore, in the remain-
transform” parameters φ I such that der of this paper, we adopt an alternative categorization:
r Embedding-based approaches in which any speaker-
g(xt ; φ I ) = xt , ∀xt (14)
dependent parameters are estimated independently of the
then we have model, with the model f (xt ; θ ) itself being unchanged
between speakers, other than the possible need for addi-
yt = f (xt ; θ ) = f (g(xt ; φ I ); θ ) = f (xt ; θ , φ I ) (15) tional embedding parameters θ E ;
r Model-based approaches in which the model parameters
where f is a new network comprising of a copy of the original
network f with the layers of g prepended. Applying feature θ are directly adapted to data from the target speaker
according to the primary objective function;
normalization (11) leads to: r Data augmentation approaches which attempt to syn-
yt = f (xt ; θ ) = f (g(xt ; φs ); θ ) = f (xt ; θ , φs ) (16) thetically generate additional training data with a close
VOLUME 2, 2021 39
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW
match to the target speaker, by transforming the existing attribute aware training; however, we do not believe that all
training data. multi-task learning approaches to adaptation can be labeled in
This distinction is, we believe, particularly important in this way.
speaker adaptation of NNs because in ASR it has become Data augmentation methods have proved very success-
standard to perform adaptation in a semi-supervised manner, ful in adaptation to other sources of variability, particularly
with no transcribed adaptation data for the target speaker. In those – such as background noise conditions – where the
this setting, as we will discuss, standard objective functions required model transformations are hard to explicitly estimate,
such as cross-entropy, which may be very effective in su- but where it is easy to generate realistic data. In the case
pervised training or adaptation, are particularly susceptible to of speaker adaptation, it is significantly harder to generate
transcription errors in semi-supervised settings. sufficiently good-quality synthetic data for a target speaker,
We describe the model-independent approaches as given only limited data from the speaker in question. How-
embedding-based because any set of speaker-dependent ever, there is a growing body of work in this area using, for
parameters can be viewed as an embedding. Embedding- example, techniques from the field of speech synthesis [116].
based approaches are discussed in Section V. Well-known Approaches in this area are discussed in Section IX.
examples of speaker embeddings include i-vectors [56], Most works suitable for adapting hybrid acoustic models
[110], and x-vectors [111], but can also include parameter can be leveraged to adapt acoustic encoders in E2E mod-
sets more classically viewed as normalizing transforms els. Both Kullback-Leibler divergence (KLD) regularization
such as CMVN statistics and global fMLLR transforms (see (Section VII) and multi-task learning (MTL) methods (Sec-
Section III above). However, for the reasons mentioned above, tion VIII) have been used for speaker adaptation for CTC and
we exclude from this category methods where the embedding AED models [117], [118].
is simply a subset of the primary model parameters and Sim et al. [119] updated the acoustic encoder of RNN-T
estimated according to the model’s objective function. Note models using speaker-specific adaptation data. Furthermore,
that methods using a one-hot encoding for each speaker are by generating text-to-speech (TTS) audio from the target
also excluded, since it would be impossible to use these with a speaker, more data can be used to adapt the acoustic encoder.
speaker-independent model, without each test speaker having Such data augmentation adaptation (discussed in Section IX)
been present in training data; such methods might however was shown to be an effective way for the speaker adaptation
be useful for closely related tasks such as domain adaptation, of E2E models [120] even with very limited raw data from
discussed in Section XI. the target speaker. Embeddings have also been used to train a
The primary benefit of speaker adaptive approaches over speaker-aware AED model [62], [121], [122].
simply using speaker-dependent models is the prevention of Because AED and RNN-T also have components corre-
over-fitting to the adaptation data (and its possibly errorful sponding to the language model, there are also techniques spe-
transcript). A large number of model-based adaptation tech- cific to adapting the language modeling aspect of E2E models,
niques have been proposed to achieve this; in this paper, we for instance using a text embedding instead of an acoustic
sub-divide them into: embedding to bias an E2E model in order to produce outputs
r Structured transforms: Methods in which a subset of relevant to the particular recognition context [123]–[125]. If
the parameters are adapted, with many instances struc- the new domain differs from the source domain mainly in
turing the model so as to permit a reduced number of content instead of acoustics, domain adaptation on E2E mod-
speaker-dependent parameters, as in the Learning Hid- els can be performed by either interpolating the E2E model
den Unit Contributions (LHUC) scheme [75], [112]. The with an external language model or updating language model
can be viewed as an analogy to MLLR transforms for related components inside the E2E model with the text-to-
GMMs. They are discussed in Section VI. speech audio generated from the text in the new domain [126],
r Regularization: Methods with explicit regularization of [127], discussed in Section XII.
the objective function to prevent over-fitting to the adap-
tation data, examples including the use of L2 loss or V. SPEAKER EMBEDDINGS
KL divergence terms to penalize the divergence from Speaker embeddings map speakers to a continuous space.
the speaker-independent parameters [113], [114]. Such In this section we consider embeddings that may be ex-
methods can be viewed as related to the MAP approach tracted in a manner independent of the model, and which
for GMM adaptation. They are discussed in Section VII. are also typically unsupervised with respect to the transcript.
r Variant objective functions: Methods which adopt vari- They can therefore also be useful in a standalone manner
ants of the primary objective function to overcome the for other tasks such as speaker recognition. When used with
problems of noise in the target labels, with examples an acoustic model, the model learns how to incorporate the
including the use of lattice supervision [79] or multi-task embedding information by, in effect, speaker-aware training.
learning [115]. They are discussed in Section VIII. Speaker embeddings may encode speaker-level variations that
The second two categories above are collectively termed are otherwise difficult for the AM to learn from short-term
constrained adaptation in the review by Sim et al. [109]. features [64], and may be included as auxiliary features to the
Within this, multi-task learning is labeled by Sim et al. as network. Specifically, let x ∈ Rd denote the acoustic features,
40 VOLUME 2, 2021
and λs ∈ Rk a k-dimensional speaker embedding. The speaker mismatched transforms [133], or to obtain speaker-adapted
embeddings may be concatenated with the acoustic input fea- features derived from GMM log-likelihoods [138], otherwise
tures, as previously seen in (13): known as GMM-derived features.
Another technique with a long history is VTLN [91], [92],
xt [94], [139], which was briefly introduced in Section III. To
xt = (23)
λs control for varying vocal tract lengths between speakers,
VTLN typically uses a piecewise linear warping function to
Alternatively they may be concatenated with the activations adjust the filterbank in feature extraction. This requires only
of a hidden layer. In either case the result is bias adaptation of a single warping factor parameter that can be estimated using
the next hidden layer as discussed in Section VI. As noted by any AM with a line search. Alternatively, linear-VTLN (e.g.
Delcroix et al. [128] the auxiliary features may equivalently [95]) obtains a corresponding affine transform similar to fM-
be added directly to the features using a learned projection LLR, but chooses from a fixed set of transforms at test time. A
matrix P, with the benefit that the downstream architecture related idea is that of the exponential transform [140], which
can remain unchanged: forgoes any notion of vocal tract length, but akin to VTLN
xt = xt + Pλs (24) is controlled by a single parameter. More recently, adaptation
of learnable filterbanks, operating as the first layer in a deep
There are many other ways to incorporate embeddings into network, has resulted in updates which compensate for vocal
the AM: for example, they may be used to scale neuron acti- tract length differences between speakers [141].
vations as in LHUC [75]. More generally we may consider
embeddings applied to either biases or activations through B. I-VECTORS
context-adaptive [129] or control networks [130]. It is possible Many types of embeddings stem from research in speaker
to limit connectivity from the auxiliary features to the rest verification and speaker recognition. One such approach is
of the network in order to improve robustness at test time identity vectors, or i-vectors [56], [110], [142], which are
or to better incorporate static features [131]–[133]. We will estimated using means from GMMs trained on the acoustic
further consider transformations of the features as speaker features. Specifically, the extraction of a speaker i-vector, λs ∈
embeddings, such as with fMLLR [104], [105], and they may Rk , assumes a linear relationship between the global means
also be used as label targets [134]. from a background GMM (or universal background model,
UBM), mg ∈ Rm , and the speaker-specific means, ms ∈ Rm
A. FEATURE TRANSFORMATIONS
We may consider speaker-level transformations of the acous- ms = mg + T λs (27)
tic features as speaker embeddings. These include methods where T ∈ Rm×k is a matrix that is shared across all speakers
traditionally viewed as normalisation, such as CMVN and which is sometimes called the total variability matrix from
fMLLR, which produce affine transformations of the features: its relation to joint factor analysis [143]. An i-vector thus
xs = As x + bs (25) corresponds to coordinates in the column space of T . T is
estimated iteratively using the EM algorithm. It is possible to
CMVN derives its name from the application to cepstral replace the GMM means with posteriors or alignments from
features, but corresponds to the standardization of the features the AM [131], [144], [145] although this is no longer indepen-
to zero mean and unit variance (z-score): dent of the AM and requires transcriptions. The i-vectors are
x−μ usually concatenated with the acoustic features as discussed
xs = √ (26) above, but have also been used in more elaborate architectures
σ2 +
to produce a feature mapping of the input features them-
where μ ∈ Rd is the cepstral mean, σ 2 ∈ Rd is the cepstral selves [146], [147].
variance, and is a small constant for numerical stability.
fMLLR [104] belongs to a family of speaker adaptation C. NEURAL NETWORK EMBEDDINGS
methods originally developed for HMM-GMM models, as A number of works proposed to extract low-dimensional em-
discussed in Section III. The technique has, however, later beddings from bottleneck layers in neural network models
been used with success to transform features for hybrid mod- trained to distinguish between speakers [64], [132] or across
els as well [135], [136]. While the fMLLR transforms were multiple layers followed by dimensionality reduction in a sep-
traditionally estimated using maximum likelihood and HMM- arate AM (e.g. CNN embeddings [148]). One such approach,
GMM models, the transforms may also be estimated using a using Bottleneck Speaker Vector (BSV) embeddings [64],
neural network trained to estimate fMLLR features [137] (in trains a feed-forward network to predict speaker labels (and
Section VI we will further discuss structurally similar trans- silence) from spliced MFCCs (Fig. 2(a)). Tan et al. [132]
forms estimated using the main objective function). Instead proposed to add a second objective to predict monophones in
of transforming the input features, some work has explored a multi-task setup. The bottleneck layer dimension is typically
fMLLR features as an additional, auxiliary, feature stream set to values commonly used for i-vectors. In fact, Huang and
to the standard features in order to improve robustness to Sim [64] note that if the speaker label targets are replaced with
VOLUME 2, 2021 41
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW
FIGURE 2. (a) Bottleneck feature extraction that uses a pretrained speaker classifier. (b) Summary network extracting speaker embeddings which is
trained jointly with the acoustic model.
speaker deviations from a UBM, then the bottleneck-features objective for the x-vector network which implicitly factors
may be considered frame-level i-vectors. The extracted fea- out channel information, which might be beneficial for adap-
tures are averaged across all speech frames of a given speaker tation. The optimal objective for speaker embeddings used in
to produce speaker-level i-vectors. ASR differs from the objective used in speaker verification.
There are several more recent approaches that we may Summary networks [59], [128] produce sequence level
collectively refer to as -vectors. Like bottleneck features, summaries of the input features and are closely related to
these approaches typically extract embeddings from neural -vectors (cf. Fig. 2(b)). Auxiliary features are produced by
networks trained to discriminate between speakers, but not a neural network that takes as input the same features as the
necessarily using a low-dimensional layer. For instance, deep AM, and produces embeddings by taking the time-average of
vectors, or d-vectors [149], [150], extract embeddings from the output. By incorporating the averaging into the graph, the
feed-forward or LSTM networks trained on filterbank fea- network can be trained jointly with the AM in an end-to-end
tures to predict speaker labels. The activations from the last fashion [128]. A related approach is to produce LHUC fea-
hidden layer are averaged over time. X-vectors [111], [130] ture vectors (Section VI) from an independent network with
use TDNNs with a pooling layer that collects statistics over embedded averaging [153].
time and the embeddings are extracted following a subsequent
affine layer. A related approach called r-vectors [151] uses the D. EMBEDDINGS FOR E2E SYSTEMS
architecture of x-vectors, but predicts room impulse response The embedding method is also helpful to the adaptation of
(RIR) labels rather than speaker labels. In contrast to the E2E systems. Fan et al. [121] and Sari et al. [62] generated a
above approaches, label embeddings, or l-vectors [134], are soft embedding vector by combining a set of i-vectors from
designed to be used as soft output targets for the training of an multiple speakers with the combination weight calculated
AM. Each label embedding represents the output distribution from the attention mechanism. The soft embedding vector is
for a particular senone target. In this way they are, in effect, appended to the acoustic encoder output of the E2E model,
uncoupled from the individual data points and can be used for helping the model to normalize speaker variations. While the
domain adaptation without a requirement of parallel data. We soft embedding vectors in [62], [121] are different at each
will discuss this idea further in Section XI. For completeness frame, the speaker i-vectors are concatenated with the speech
we also mention h-vectors [152] which use a hierarchical utterance as the input of every encoder layer in [122] to form a
attention mechanism to produce utterance-level embeddings, persistent memory through the depth of encoder, hence learn-
but has only been applied to speaker recognition tasks. ing utterance-level speaker knowledge.
X-vector embeddings are not widely used for adaptating In addition to acoustic embedding, E2E models can also
ASR algorithms in practice – especially in comparison to leverage text embedding to improve their modeling accuracy.
commonly used i-vectors – as experiments have not shown For example, E2E models can be optimized to produce outputs
consistent improvements in recognition accuracy. One rea- relevant to the particular recognition context, for instance user
son for this is related to the speaker identification training contacts or device location. One solution is to add a context
42 VOLUME 2, 2021
bias encoder in addition to the original audio encoder into E2E a batch normalization layer, adapting both the scale and the
models [123]–[125]. This bias encoder takes a list of biasing offset of the hidden layer activations with mean μ ∈ Rn and
phrases as the input. The context vector of the biasing list variance σ 2 ∈ Rn :
is generated by using the attention mechanism, and is then h−μ
concatenated with the context vector of acoustic encoder and h = γs √ + βs . (32)
σ2 +
is fed into the decoder.
Mana et al. [161] showed that batch normalization layers can
VI. STRUCTURED TRANSFORMS be also updated by recomputing the statistics μ and σ 2 in
Methods to adapt the parameters θ of a neural network-based online fashion.
acoustic model f (x; θ ) can be split into two groups. The A similar approach with a low-memory footprint adapts the
first group adapts the whole acoustic model or some of its activation functions instead of the scale rs and offset bs . Zhang
layers [113], [114], [154]. The second group employs struc- and Woodland [162] proposed the use of parameterised sig-
tured transformations [109] to transform input features x, hid- moid and ReLU activation functions. With the parameterised
den activations h or outputs y of the acoustic model. Such sigmoid function, hidden activations h are computed from
transformations include the linear input network (LIN) [155], hidden pre-activations z as
linear hidden network (LHN) [156] and the linear output net- 1
work (LON) [157]. These transforms are parameterized with h = ηs , (33)
1 + e−γs z+ζs
a transformation matrix As ∈ Rn×n and a bias bs ∈ Rn . The
where ηs ∈ Rn , γs ∈ Rn and ζs ∈ Rn are speaker dependent
transformation matrix As is initialized as an identity matrix
parameters. |ηs | controls the scale of the hidden activations,
and the bias bs is initialized as a zero vector prior to speaker
γs controls the slope of the sigmoid function and ζs controls
adaptation. The adapted hidden activations then become
the midpoint of the sigmoid function. Similarly, parameterised
h = As h + bs . (28) ReLU activations were defined as
However, even a single transformation matrix As can contain αs z if z > 0
many speaker dependent parameters, making adaptation sus- h= , (34)
βs z if z ≤ 0
ceptible to overfitting to the adaptation data. It also limits its
practical usage in real world deployment because of memory where αs ∈ Rn and βs ∈ Rn are speaker dependent parame-
requirements related to storing speaker dependent parameters ters that correspond to slopes for positive and negative pre-
for each speaker. Therefore there has been considerable re- activations, respectively.
search into how to structure the matrix As and the bias bs to Other approaches factorize the transformation matrix As
reduce the number of speaker dependent parameters. into a product of low-rank matrices to obtain a compact set of
The first set of approaches restricts the adaptation matrix speaker dependent parameters. Zhao et al. [163] proposed the
As to be diagonal. If we denote the diagonal elements as rs = Low-Rank Plus Diagonal (LRPD) method, which reduces the
diag(As ), then the adapted hidden activations become number of speaker dependent parameters by approximating
the linear transformation matrix As ∈ Rn×n as
h = rs h + bs . (29)
As ≈ Ds + Ps Qs , (35)
There are several methods that belong to this set of adaptation
methods. LHUC [75], [112] adapts only the parameters rs : where the Ds ∈ Rn×n , Ps ∈ Rn×k
and Qs ∈ Rk×n
are treated
as speaker dependent matrices (k < n) and Ds is a diagonal
h = rs h. (30)
matrix. This approximation was motivated by the assumption
Speaker Codes [158], [159] prepend an adaptation neural net- that the adapted hidden activations should not be very differ-
work to an existing SI model in place of the input features. The ent from the unadapted hidden activations when only a limited
adaptation network – which operates somewhat similarly to amount of adaptation data is available; hence the adaptation
control networks, described below – uses the acoustic features linear transformation should be close to a diagonal matrix.
as inputs, as well as an auxiliary low-dimensional speaker In fact, for k = 0 LRPD reduces to LHUC adaptation. LRPD
code which essentially adapts speaker dependent biases within adaptation can be implemented by inserting two hidden linear
the adaptation network: layers and a skip connection as illustrated in Fig. 3(b).
Zhao et al. [164] later presented an extension to LRPD
h = h + bs . (31)
called Extended LRPD (eLRPD), which removed the depen-
The network and speaker codes are learned by back- dency of the number of speaker dependent parameters on the
propagating through the frozen SI network with transcribed hidden layer size by performing a different approximation of
training data. At test time the speaker codes are derived the linear transformation matrix As ,
by freezing all but the speaker code parameters and back-
As ≈ Ds + PTs Q, (36)
propagating on a small amount of adaptation data.
Similarly, Wang and Wang [160] proposed a method that where matrices Ds ∈ Rn×n
and Ts ∈ Rk×k
are treated as
adapts both rs and bs as parameters βs ∈ Rn and γs ∈ Rn of speaker dependent, and matrices P ∈ Rn×k and Q ∈ Rk×n are
VOLUME 2, 2021 43
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW
FIGURE 3. Structured transforms of an adaptation matrix As : (a) Learning Hidden Unit Contributions (LHUC) adapts only diagonal elements of the
transformation matrix rs = diag(As ); (b) Low-Rank Plus Diagonal factorizes the adaptation matrix as As ≈ Ds + Ps Qs ; (c) Extended LRPD factorizes the
adaptation matrix as As ≈ Ds + PTs Q.
treated as speaker independent. Thus the number of speaker There have also been approaches, that further reduce the
dependent parameters is mostly dependent on k, which can be number of speaker dependent parameters by removing the de-
chosen arbitrarily. pendency on the hidden layer width by using control networks
Instead of factorizing the transformation matrix, a tech- that predict the speaker-dependent parameters
nique typically known as feature-space discriminative lin-
θs = c(λs ; φ), (40)
ear regression (fDLR) [135], [165], [166] imposes a block-
diagonal structure such that each input frame shares the same In contrast to the adaptation network used in the Speaker
linear transform. This is, in effect, a tied variation of LIN with Codes scheme, the control networks themselves are speaker-
a reduction in the number of speaker dependent parameters. independent, taking as input some lower dimensional speaker
Another set of approaches uses the speaker dependent pa- embedding λs ∈ Rk . As such, they form a link between
rameters as mixing coefficients θs = {α0 . . . αk } for a set of structured transforms and the embedding-based approaches
k speaker independent bases {B0 . . . Bk } which factorize the of Section V. The control networks c(λs , φ) can be imple-
transformation matrix As . Samarakoon and Sim [167], [168] mented as a single linear transformation or as a multi-layer
proposed to use factorized hidden layers (FHL) that allow neural network. These control networks are similar to the
both speaker-independent and speaker dependent modelling. conditional affine transformations referred to as Feature-wise
With this approach, activations of a hidden layer h with an Linear Modulation (FiLM) [170]. For example, Subspace
activation function σ are computed as LHUC [171] uses a control network to predict LHUC pa-
rameters rs from i-vectors λs , resulting in a 94% memory
k
h = σ (W + αi Bi )x + bs + b . (37) footprint reduction compared to standard LHUC adaptation.
i=0
Cui et al. [172] used auxiliary features to adapt both the
scale rs and offset bs . Other approaches adapted the scale rs
Note, that when αs = 0 and bs = 0, the activations correspond or the offset bs by leveraging the information extracted with
to a standard speaker independent model. If the bases Bi are summary networks instead of auxiliary features [173]–[175].
rank-1 matrices, Bi = γi ψiT , then this allows the reparameter- Finally, the number of speaker dependent parameters in
ization of (37) as [168]: all the aforementioned linear transformations can be reduced
h = σ (W + D T )x + bs + b , (38) by applying them to bottleneck layers that have much lower
dimensionality than the standard hidden layers. These bot-
where vectors γi and ψi are i-th columns of matrices and tleneck layers can be obtained directly by training a neural
, respectively, and the mixing coefficients αs correspond to network with bottleneck-layers or by applying Singular Value
the diagonal of matrix D. This approach is very similar to Decomposition (SVD) to the hidden layers [176], [177].
the factorization of hidden layers used for Cluster Adaptive
Training of DNN networks (CAT-DNN) [67] that uses full VII. REGULARIZATION METHODS
rank bases instead of rank-1 bases. Even with the small number of speaker dependent parameters
Similarly, Delcroix et al. [129] proposed to adapt the acti- required by structured transformations, speaker adaptation can
vations of a hidden layer using a mixture of experts [169]. The still overfit to the adaptation data. One way to prevent this
adapted hidden unit activations are then overfitting is through the use of regularization methods that
prevent the adapted model from diverging too far from the
k
h = αi Bi h. (39) original model. This can be achieved by using early stopping
i=0 and appropriate learning rates, which can be obtained with a
44 VOLUME 2, 2021
hyper-parameter grid-search or by meta-learning [178], [179].
Another way to prevent the adapted model from diverging too
far from the original can be achieved by limiting the distance
between the original and the adapted model. Liao [113] pro-
posed to use the L2 regularization loss of the distance between
the original speaker dependent parameters θs and the adapted
speaker dependent parameters θs
LL2 = |θs − θs |22 . (41)
Yu et al. [114] proposed to use Kullback-Leibler (KL) diver-
gence to measure the distance between the senone distribu-
tions of the adapted model and the original model
LKL = DKL ( f (x; θ ) || f (x; θs )). (42)
If we consider the overall adaptation loss using cross-entropy:
L = (1 − λ)Lxent + λLKL , (43)
FIGURE 4. Adversarial speaker adaptation.
we can show that this loss equals to cross-entropy with the
target distribution for a label y given the input frame xt
(1 − λ)P̂(y | xt ) + λ f (xt ; θ ), (44) estimates speaker dependent parameters as a mode of the
distribution
where P̂(y | xt ) is a distribution corresponding to the provided
θ̂s = arg max P(Y | X, θs )p(θs ), (46)
labels yadapt . Although initially proposed for adapting hybrid θs
models, the KLD regularization method may also be used for
where p(θs ) is a prior density of the speaker depen-
speaker adaptation of E2E models [117], [118], [180].
dent parameters. In order to obtain this prior density,
Meng et al. [181] noted that KL divergence is not a distance
Huang et al. [182] employed an empirical Bayes approach
metric between distributions because it is asymmetric, and
(following Gauvain and Lee [83]) and treated each speaker
therefore proposed to use adversarial learning which guar-
in the training data as a data point. They performed speaker
antees that the local minimum of the regularization term is
adaptation for each speaker and observed that the speaker pa-
reached only if the senone distributions of the speaker inde-
rameters across speakers resemble Gaussians. Therefore they
pendent and the speaker dependent models are identical. They
decided to parameterise the prior density p(θs ) as
achieve this by adversarially training a discriminator d (x; φ)
whose task is to discriminate between the speaker dependent p(θs ) = N (θs ; μ, ), (47)
deep features h and speaker independent deep features h that
where μ is the mean of adapted speaker dependent parameters
are obtained by passing the input adaptation frames through
across different speakers, and is the corresponding diagonal
speaker dependent and speaker independent feature extractor
covariance matrix. With this parameterisation the regulariza-
respectively. This process is illustrated in Fig. 4. The regular-
tion term of the prior density p(θs ) is
ization loss of the discriminator is
1
Ldisc = − log d (h; φ) − log 1 − d (h ; φ) , (45) LMAP = (θs − μ)T −1 (θs − μ), (48)
2
where h are hidden layer activations of the speaker indepen- which for the prior density p(θs ) = N (θs ; 0, I ) degenerates
dent model and h are hidden layer activations of the adapted to the L2 regularization loss. Huang et al. investigated their
model. The discriminator is trained in a minimax fashion dur- proposed MAP approach with LHN structured transforms, but
ing adaptation by minimizing Ldisc with respect to φ and max- noted that it may be used in combination with other schemes.
imizing Ldisc with respect to θs . Consequently, the distribution Xie et al. [183] proposed a fully Bayesian way of dealing
of activations of the i-th hidden layer of the speaker depen- with uncertainty inherent in speaker dependent parameters θs ,
dent model will be indistinguishable from the distribution of in the context of estimating the LHUC parameters rs (see
activations of the i-th hidden layer of the speaker independent Section VI). In this method, known as BLHUC, the posterior
model, which ought to result in more robust performance of distribution of the adapted model is approximated as:
speaker adaptation.
P(Y | X, Dadapt ) ≈ P(Y | X, E[rs | Dadapt ]), (49)
Other approaches aim to prevent overfitting by leveraging
the uncertainty of the speaker-dependent parameter space. Xie et al. propose to use a distribution q(rs ) as a variational
Huang et al. [182] proposed Maximum A Posteriori (MAP) approximation of the posterior distribution of the LHUC pa-
adaptation of neural networks, inspired by MAP adaptation rameters, p(rs |Dadapt ). For simplicity, they assume that both
of GMM-HMM models [83] (Section III). MAP adaptation q(rs ) and p(rs ) are normal, such that q(rs ) = N (rs ; μs , γs )
VOLUME 2, 2021 45
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW
46 VOLUME 2, 2021
with, typically label-preserving, speaker-related distortions approaches resemble approaches for acoustic adaptation of
or transforms. Examples include creating multiple copies of VQ codebooks (discussed in section III), in that they learn
clean utterances with perturbed VTL warp factors [192], an accent-specific transition matrix between the phonemic
[193], augmenting related properties such as volume or speak- symbols in the dictionary. Selection of utterances for accent
ing rate [11], [194], [195], or voice-conversion [196] inspired adaptation has been explored, with Nallasamy et al. [212]
transformations of speech uttered by one speaker into another proposing an active learning approach.
speaker using stochastic feature mapping [193], [197], [198]. Approaches to accent adaptation of neural network-based
While voice conversion does not create any new data with systems have typically employed accent-dependent output
respect to unseen acoustic / linguistic complexity (just replicas layers and shared hidden layers [213], [214], based on a
of the utterances with different voices, often from the same similar approach to the multilingual training of deep neural
dataset), recent advances in text-to-speech (TTS) allows the networks [215]–[217]. Huang et al. [213] combined this with
rapid building of new multi-speaker TTS voices [199] from KL regularization (Section VII), and Chen et al. [214] used
small amounts of data. TTS may then be used to arbitrarily ex- accent-dependent i-vectors (Section V); Yi et al. [218] used
pand the adaptation set for a given speaker, possibly to cover accent-dependent bottleneck features in place of i-vectors;
unseen acoustic domains [116], [120]. If TTS is coupled with and Turan et al. [219] used x-vector accent embeddings in a
a related natural language generation module, it is possible semi-supervised setting.
to generate speech for domain-related texts. In this way, the Multi-task learning approaches, where the secondary task is
speaker adaptation uses more data, not only from the speaker’s accent/dialect identification has been explored by a number of
original speech but also from the TTS speech. Because the researchers [220]–[224] in the context of both hybrid and end-
transcription used for TTS generation is also used for model to-end models. Improvements with multi-task training were
adaptation, this approach also circumvents the obstacle of the observed in some instances, but the evidence indicates that it
hypothesis error in unsupervised adaptation. Moreover, TTS gives a small adaptation gain. Sun et al. [225] replaced multi-
generated data can also help to adapt E2E models to a new task learning with domain adversarial learning (Section VIII),
domain which has more discrepancy in contents from the in which the objective function treated accent identification as
source domain, which will be discussed in Section XII. an adversarial task, finding that this improved accented speech
Finally, for unbalanced data sets the acoustic models may recognition over multi-task learning.
under-perform for certain demographics that are not suffi- More successfully, Li et al. [226] explored learning multi-
ciently represented in training data. There is an ongoing effort dialect sequence-to-sequence models using one-hot dialect
to address this using generative adversarial networks (GANs). information as input. Grace et al. [227] also used one-hot
For example, Hosseini-Asl et al. [200] used GANs with a dialect codes and also explored a family of cluster adaptive
cycle-consistency constraint [201] to balance the speaker ra- training and hidden layer factorization approaches. In both
tios with respect to gender representation in training set. cases using one-hot dialect codes as an input augmentation
(corresponding to bias adaptation) proved to be the best ap-
X. ACCENT ADAPTATION proach, and cluster-adaptive approaches did not result in a
Although there is significant literature on automatic dialect consistent gain. These approaches were extended by Yoo et
identification from speech (e.g. [202]), there has been less al. [228] and Viglino et al. [224] who both explored the use of
work on accent and dialect adaptive speech recognition sys- dialect embeddings for multi-accent end-to-end speech recog-
tems. The MGB–3 [203] and MGB–5 [204] evaluation chal- nition. Ghorbani et al. [229] used accent-specific teacher-
lenges have used dialectal Arabic test sets, with a modern student learning, and Jain et al. [230] explored a mixture of
standard Arabic (MSA) training set, using broadcast and in- experts (MoE) approach, using mixtures of experts both at the
ternet video data. The best results reported on these challenges phonetic and accent levels.
have used a straightforward model-based transfer learning Yoo et al. [228] also applied a method of feature-wise affine
approach in an lattice-free maximum mutual information (LF- transformations on the hidden layers (FiLM), that are depen-
MMI) framework [205], adapting MSA trained baseline sys- dent both on the network’s internal state and the dialect/accent
tems to specific Arabic dialects [206], [207]. code (discussed in Section VI). This approach, which can
Much of the reported work on accent adaptation has taken be viewed as a conditioned normalization, differs from the
approaches for speaker adaptation, and applied them using previous use of one-hot dialect codes and multi-task learning
an adaptation set of utterances from the target accent. For in that it has the goal of learning a single normalized model
instance, Vergyri et al. [208] used MAP adaptation of a rather than an implicit combination of specialist models. A
GMM/HMM system. Zheng et al. [209] used both MAP and related approach is gated accent adaptation [231], although
MLLR adaptation, together with features selected to be dis- this focused on a single transformation conditioned on accent.
criminative towards accent, with the accent adaptation con- Winata et al. [232] experimented with a meta-learning ap-
trolled using hard decisions made by an accent classifier. proach for few-shot adaptation to accented speech, where
Earlier work on accent adaptation focused on automatic the meta-learning algorithm learns a good initialization and
adaptation of the pronunciation dictionary [210], [211]. These hyperparameters for the adaptation.
VOLUME 2, 2021 47
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW
XI. DOMAIN ADAPTATION sequence-level loss function as the speech signal is a sequence
The performance of automatic speech recognition (ASR) al- signal [247], [248].
ways drops significantly when the recognition model is eval- The biggest advantage of T/S learning is that it can leverage
uated in a mismatched new domain. Domain adaptation is large amounts of unlabeled data by using soft labels PT (yt =
the technology used to adapt the well-trained source domain y|xt ). This is particularly useful in industrial setups where
model to the new domain. The most straightforward way is effectively unlimited unlabeled data is available [241], [242].
to collect and label data in the new domain to fine-tune the Furthermore, soft labels produced by the teacher network
model. Most adaptation technologies discussed in this paper carry knowledge learned by the teacher on the difficulty of
can also be applied to domain adaptation [154], [233]–[236]. classifying each sample, while the hard labels do not contain
When the amount of adaptation data is limited, a common such information. Such knowledge helps the student to gener-
practice is adapting only partial layers of the network [237]. alize better, especially when adaptation data size is small.
To let the adapted model still perform well on the source E2E models tend to memorize the training data well, and
domain, Moriya et al. [238] proposed progressive neural net- therefore may not generalize well to a new domain. Meng et
works by adding an additional model column to the original al. [249] proposed T/S learning for the domain adaptation of
model for each new domain and only update the new model E2E models. The loss function is
column with the new domain data. In the following, we focus
L
N
on technologies more specific to domain adaptation. − PT (y | Y1:u−1 , X ) log PS (y | Y1:u−1 , X̂ ), (55)
l=1 y=1
A. TEACHER-STUDENT LEARNING
where X and X̂ are the source and target domain speech
While conventional adaptation techniques require large
sequence, Y is the label sequence of length L which is either
amounts of labeled data in the target domain, the teacher-
the ground truth in the supervised adaptation setup or the
student (T/S) paradigm [239], [240] can better take advantage
hypothesis generated by the decoding of the teacher model
of large amounts of unlabeled data and has been widely used
with X in the unsupervised adaptation setup. Note that in the
for industrial scale tasks [241], [242].
unsupervised case, there are two levels of knowledge transfer:
The most popular T/S learning strategy was proposed in
the teacher’s token posteriors (used as soft labels) and one-
2014 by Li et al. [239] to minimize the KL divergence be-
best predictions as decoder guidance.
tween the output posterior distributions of the teacher network
One constraint to T/S adaptation is that it requires paired
and the student network. This can also be considered as learn-
source and target domain data. While the paired data can be
ing soft targets generated by a teacher model instead of 1-hot
obtained with simulation in most cases, there are scenarios
hard targets
in which it is hard to simulate the target domain data from
T
N the source domain data. For example, simulation of children’s
− PT (y | xt ) log PS (y | xt ), (53) speech or accented speech remains challenging. In [134], a
t=1 y=1 neural label embedding scheme was proposed for domain
adaptation with unpaired data. A label embedding, l-vector,
where PT and PS are posteriors of teacher and student net- represents the output distribution of the deep network trained
works, xt and yt are the input speech and output senone at in the source domain for each output token, e.g. , senone.
time t, respectively. T is the number of speech frames in an To adapt the deep network model to the target domain, the
utterance, and N is the number of senones in the network l-vectors learned from the source domain are used as the soft
output layer. targets in the cross entropy criterion.
Later, Hinton et al. [240] proposed knowledge distillation
by introducing a temperature parameter (like chemical distil- B. ADVERSARIAL LEARNING
lation) to scale the posteriors. This has been applied to speech It is usually hard to obtain the transcription in the target do-
by e.g. Asami et al. [243]. There are also variations such main, therefore unsupervised adaptation is critical. Although
as learning the interpolation of soft and hard targets [240] the transcription can be generated by decoding the target
and conditional T/S learning [244]. Although initially pro- domain data using the source domain model, the generated
posed for model compression, T/S learning is also widely hypothesis quality is often poor given the domain mismatch.
used for model adaptation if source and target signals are Recently, adversarial training was applied to the area of un-
frame-synchronized, which can be realized by simulation. The supervised domain adaptation in a form of multi-task learn-
loss function is [245], [246] ing [250] without the need for transcription in the target do-
main. Unsupervised adaptation is achieved by learning deep
T
N
− PT (y | xt ) log PS (y | x̂t ), (54) intermediate representations that are both discriminative for
t=1 y=1 the main task on the source domain and invariant with respect
to mismatch between source and target domains. Domain
where xt is the source speech signal while x̂t is the frame- invariance is achieved by adversarial training of the domain
synchronized target signal. It can be further improved with classification objective functions using a gradient reversal
48 VOLUME 2, 2021
layer (GRL) [250]. This GRL approach has been applied to Neural network language modelling [271] has become
acoustic models for unsupervised adaptation in [251]–[253]. state-of-the-art, in particular recurrent neural network lan-
Meng et al. [254] further combine adversarial learning and guage models (RNNLMs) [272]. There has been a range of
T/S learning as adversarial T/S learning to improve the ro- work on adaptation of RNNLMs, including the use of topic
bustness against condition variability during adaptation. or genre information as auxiliary features [273], [274] or
There is also increasing interest in the use of GANs with combined as marginal distributions [275], domain specific
cycle consistency constraints for domain adaptation [255]– embeddings [276], and the use of curriculum learning and
[257]. This enables the use of non-parallel data without la- fine-tuning to take account of shifting contexts [277], [278].
bels in the target domain by learning to map the acoustic Approaches based on acoustic model adaptation, such as
features into the style of the target domain for training. The LHUC [278] and LHN [274], have also been explored.
cycle-consistency constraint also provides the possibility of There have a been a number of approaches to apply the
mapping features from the target to the source style for, in ideas of cache language model adaptation to neural network
effect, test-time adaptation or speech enhancement. language models [275], [279], [280], along with so-called
Unsupervised domain adaptation is more attractive than dynamic evaluation approaches in which the recent context
the supervised one because there is usually large amount of is used for fine tuning [275], [281].
unlabeled data in the new domain while transcribing the new E2E models are trained with paired speech and text data.
domain data usually is time consuming with large cost. T/S The amount of text data in such a paired setup is much
learning and adversarial learning both can utilize unlabeled smaller than the amount of text data used in training a separate
data well. Specifically, T/S learning has been very successful external LM. Therefore, it is popular to adjust E2E models
in industry-scale tasks. In contrast, adversarial learning was by fusing the external LM trained with a large amount of
reported successful in relatively smaller tasks. Therefore, T/S text data. The simplest and most popular approach is shal-
learning is more promising if the parallel data is available. low fusion [282]–[285], in which the external LM is inter-
However, if there is no prior knowledge about the new do- polated log-linearly with the E2E model at inference time
main, adversarial learning can be a good choice. There are only.
also other works on unsupervised domain adaptation. For ex- However, shallow fusion does not have a clear probabilis-
ample, Hsu et al. [70] use a variational autoencoder instead tic interpretation. McDermott et al. [286] proposed a density
of adversarial learning to obtain a latent representation robust ratio approach based on Bayes’ rule. An LM is built on text
to domains. However, similar to adversarial learning, such transcripts from the training set which has paired speech and
method is pending examination when large amount of unla- text data, and a second LM is built on the target domain. When
beled training data is available. decoding on the target domain, the output of the E2E model is
modified by the ratio of target/training LMs. While it is well
grounded with Bayes’ rule, the density ratio method requires
XII. LANGUAGE MODEL ADAPTATION the training of two separate LMs, from the training and target
LM adaptation typically involves updating an LM estimated data respectively. Variani et al. [287] proposed a hybrid au-
from a large general corpus, with data from a target do- toregressive transducer (HAT) model to improve the RNN-T
main. Many approaches to LM adaptation were developed model. The HAT model builds a training set LM internally
in the context of n-gram models, and are reviewed by Bel- and the label distribution is derived by normalizing the score
legarda [258]. Hybrid NN/HMM speech recognition systems functions across all labels excluding blank. Therefore, it is
still make use of n-gram language models and a finite state mathematically justified to integrate the HAT model with an
structure, at least in the first pass; it is difficult to use neu- external or target LM using the density ratio formulation.
ral network LMs (with infinite context) directly in first pass In [126], [127], RNN-T models were adapted to a new
decoding in such systems. Neural network LMs are typically domain with the TTS data generated from the domain-specific
used to rescore lattices in hybrid systems, or may be combined text. Because the prediction network in RNN-T works simi-
(in a variety of ways) in end-to-end systems. larly to a LM, adapting it without updating the acoustic en-
The main techniques for n-gram language model adapta- coder is shown to be more effective than interpolating the
tion include interpolation of multiple language models [259]– RNN-T model with an external LM trained from the domain-
[261], updating the model using a cache of recently observed specific text [127].
(decoded) text [259], [262]–[264], or merging or interpolat-
ing n-gram counts from decoded transcripts [265]. There is
also a large body of work incorporating longer scale context, XIII. META ANALYSIS
for instance modelling the topic and style of the recorded In this section we present an aggregated review of pub-
speech [266]–[269]. LM adaptation approaches making use lished results in experiments applying adaptation algorithms
of wider context have often built on approaches using unigram to speech recognition. This differs from typical experimental
statistics or bag-of-words models, and a number of approaches reporting that focuses on one-to-one system comparisons typ-
for combination with n-gram models have been proposed, for ically using a small fixed set of systems and benchmark tasks
example dynamic marginals [270]. and data. The proposed meta-analysis approach offers insights
VOLUME 2, 2021 49
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW
TABLE 1. Adaptation Studies Used in the Meta-Analysis Categorized on the Level They Operate At, and System Architecture
into the performance of adaptation algorithms that are difficult The analysis spans 38 datasets (more than 50 unique
to capture from individual experiments. {train, test} pairings), 28 of which are public and 10
We divide this section into four main parts. The first, Sec- are proprietary. These cover different speaking styles,
tion XIII-A, explains the protocol and overall assumptions of domains, acoustic conditions, applications and languages
the meta-analysis, followed by a top-level summary of find- (though the study is strongly biased towards English re-
ings in Section XIII-B, with a more detailed analysis in Sec- sources). The public corpora used include the following:
tion XIII-C. The final part, Section XIII-D, aims to quantify AISHELL2 [298], AMI [299], APASCI [300], Aurora4 [301],
the adaptation performance across languages, speaking styles CASIA [302], ChildIt [303], Chime4 [304], CSJ [305],
and datasets. ETAPE [306], HKUST [307], MGB [308], RASC863 [309],
SWBD [310], TED [311], TED-LIUM [312], TED-
LIUM2 [313], TIMIT [314], WSJ [315], PF-STAR [316],
Librispeech [317], Intel Accented Mandarin Speech Recogni-
A. PROTOCOL AND LITERATURE
tion Corpus [214], UTCRSS-4EnglishAccent [295]. To save
The meta-analysis is based on 47 peer-reviewed studies se- space we do not provide detailed corpora statistics in this pa-
lected such that they cover a wide range of systems, archi- per, but make them available via a corresponding repository1
tectures, and adaptation tasks. Each study was required to alongside raw data and scripts used to perform the analysis.
compare adaptation results versus a baseline, enabling the Overall, the meta-analysis is based on ASR systems trained on
configurations of interest to be compared quantitatively. There datasets with a combined duration of over 30000 hours, while
was no fixed target for the total number of papers included, the baseline acoustic models were estimated from as little as
due to our aim to cover as many different methods as possible. 5 hours to around 10000 hours of speech. Adaptation data
Note that the meta-analysis spans several model architectures, varies from a few seconds per speaker to over 25000 hours
languages, and domains; although most studies use word error of acoustic material used for domain adaptation.
rate (WER) as the evaluation metric, some studies used char-
acter error rate (CER) or phone error rate (PER). Since we are
B. OVERALL FINDINGS
interested in the relative improvement brought by adaptation,
Fig. 6 (Top) presents the average adaptation gains for all con-
we report Relative Error Rate Reductions (RERR).
sidered systems, adaptation methods, and adaptation classes.
The meta-analysis is based on the studies shown in Table 1,
The overall RERR is 9.72%.2 Since grouping data across
with additional splits into level of operation and top-level sys-
attributes of interest may result in unbalanced (or very sparse)
tem architecture. The positions were selected such that they
sample sizes, we also report additional statistics such as the
cover most of the topics mentioned in the review. For an adap-
number of samples, datasets and studies the given statistic is
tation of end-to-end systems we included all peer-reviewed
based on. As can be seen in the right part of Fig. 6 (Top), the
works we could find (their number is relatively limited). For
results in this review were derived from 356 samples produced
the hybrid approach, the studies are shortlisted such that they
using 38 datasets reported in 47 studies. A single sample is
enable the quantification of the gains for the categories out-
defined as a 1:1 system comparison for which one can un-
lined in the preceding theoretical sections. As a generic rule,
ambiguously state the RERR. Likewise, a dataset refers to a
when choosing papers for the analysis we first included works
particular training corpus configuration. Note that there may
that introduced a specific adaptation method in the context of
be some data-level overlap between different corpora origi-
neural models, or that offered some additional experiments
nating from the same source (e.g. TED talks) and we make a
allowing the comparison of different areas of interest - such
distinction for the acoustic condition (e.g. AMI close-talking
as the impact of objective functions, the complementarity of
and distant channels are counted as two different datasets
adaptation transforms or that show behavior under different
operating regimes. In the case of certain more commonly-used
techniques, due to the laborious nature of the analysis, it was
1 [Online]. Available: https://github.com/pswietojanski/ojsp_adaptation_
not always possible to include an exhaustive set of somewhat
review_2020.
similar papers. In this situation, the papers selected were those 2 We do not report exact numbers in tabular form due to space limitations,
with higher citation counts. but they are available in the GitHub repository
50 VOLUME 2, 2021
FIGURE 6. Aggregated summary of adaptation RERR from all studies (top),
considering single method only (middle) and two or more methods
stacked (bottom). The top graph is annotated to explain the information
presented in each of the boxplot graphs in this section.
VOLUME 2, 2021 51
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW
TABLE 2. Amounts of Data Used to Estimate Hybrid and E2E Models for
Speaker and Domain Adaptation Clusters
52 VOLUME 2, 2021
FIGURE 15. Comparison of adaptation results for acoustic models
estimated from different amounts of training data.
VOLUME 2, 2021 53
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW
FIGURE 16. Regression analysis for the three major control variables.
54 VOLUME 2, 2021
FIGURE 18. Regression analysis for adaptation families, speaker-adaptive training and adaptation losses.
VOLUME 2, 2021 55
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW
56 VOLUME 2, 2021
FIGURE 24. Comparison of adaptation results for the standalone techniques.
VOLUME 2, 2021 57
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW
VOLUME 2, 2021 59
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW
[49] X. Li, S. Dalmia, D. R. Mortensen, J. Li, A. W. Black, and F. Metze, [71] L. Mathias, G. Yegnanarayanan, and J. Fritsch, “Discriminative train-
“Towards zero-shot learning for automatic phonemic transcription,” in ing of acoustic models applied to domains with unreliable transcripts
Assoc. Adv. Artif. Intell., 2020, pp. 8261–8268. [speech recognition applications],” in Proc. IEEE Int. Conf. Acoust.,
[50] D. Rezende and S. Mohamed, “Variational inference with normalizing Speech, Signal Process., 2005, pp. I/109–I/112.
flows,” in Proc. 37th Int. Conf. Mach. Learn., 2015, pp. 1530–1538. [72] S.-H. Liu, F.-H. Chu, S.-H. Lin, and B. Chen, “Investigating data
[51] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and selection for minimum phone error training of acoustic models,” in
B. Lakshminarayanan, “Normalizing flows for probabilistic modeling Proc. IEEE Int. Conf. Multimedia Expo., 2007, pp. 348–351.
and inference,” 2019, arXiv:1912.02762. [73] S. Walker, M. Pedersen, I. Orife, and J. Flaks, “Semi-supervised
[52] S. S. Chen and R. A. Gopinath, “Gaussianization,” in Adv Neural Inf. model training for unbounded conversational speech recognition,”
Process. Syst., 2001, pp. 423–429. 2017, arXiv:1705.09724.
[53] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A. flow-based [74] Y. Miao, H. Zhang, and F. Metze, “Speaker adaptive training of deep
generative network for speech synthesis,” in Proc. IEEE Int. Conf. neural network acoustic models using i-vectors,” IEEE/ACM Audio,
Acoust., Speech, Signal Process., 2019, pp. 3617–3621. Speech Lang. Process., vol. 23, no. 11, pp. 1938–1949, Nov. 2015.
[54] J. Serrà, S. Pascual, and C. S. Perales, “Blow: A single-scale hyper- [75] P. Swietojanski, J. Li, and S. Renals, “Learning hidden unit contribu-
conditioned flow for non-parallel raw-audio voice conversion,” in Adv. tions for unsupervised acoustic model adaptation,” IEEE Trans. Audio,
Neural Inf. Process. Syst., 2019, pp. 6793–6803. Speech, Lang. Process., vol. 24, no. 8, pp. 1450–1463, Aug. 2016.
[55] S. Tan and K. C. Sim, “Learning utterance-level normalisation using [76] M. Padmanabhan, G. Saon, and G. Zweig, “Lattice-based unsuper-
variational autoencoders for robust automatic speech recognition,” in vised MLLR for speaker adaptation,” in Proc. ISCA ASR2000 Work-
IEEE Sri Lanka Telecom, 2016, pp. 43–49. shop, 2000, pp. 128–132.
[56] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker adaptation [77] T. Fraga-Silva, J.-L. Gauvain, and L. Lamel, “Lattice-based unsu-
of neural network acoustic models using i-vectors,” in Proc. IEEE pervised acoustic model training,” in Proc. IEEE Int. Conf. Acoust.,
Autom. Speech Recognit. Understanding Workshop, 2013, pp. 55–59. Speech, Signal Process., 2011, pp. 4656–4659.
[57] A. Senior and I. Lopez-Moreno, “Improving DNN speaker indepen- [78] V. Manohar, H. Hadian, D. Povey, and S. Khudanpur, “Semi-
dence with i-vector inputs,” in Proc. IEEE Int. Conf. Acoust., Speech, supervised training of acoustic models using lattice-free MMI,”
Signal Process., 2014, pp. 225–229. in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2018,
[58] P. Karanasou, Y. Wang, M. J. Gales, and P. C. Woodland, “Adaptation pp. 4844–4848.
of deep neural network acoustic models using factorised i-vectors,” in [79] O. Klejch, J. Fainberg, P. Bell, and S. Renals, “Lattice-based unsuper-
Proc. 15th Conf. Int. Speech Commun. Assoc., 2014, pp. 2180–2184. vised test-time adaptation of neural network acoustic models,” 2019,
[59] K. Veselý, S. Watanabe, K. Žmolíková, M. Karafiát, L. Burget, and arXiv:1906.11521.
J. H. Černocký, “Sequence summarizing neural network for speaker [80] H. Suzuki, H. Kasuya, and K. Kido, “The acoustic parameters for
adaptation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., vowel recognition without distinction of speakers,” in Proc. Conf.
2016, pp. 5315–5319. Speech Commun. Process., 1967, pp. 92–96.
[60] M. Doulaty, O. Saz, R. W. M. Ng, and T. Hain, “Latent Dirichlet [81] L. Gerstman, “Classification of self-normalized vowels,” IEEE Trans.
allocation based organisation of broadcast media archives for deep Audio Electroacoust., vol. 16, no. 1, pp. 78–80, Mar. 1968.
neural network adaptation,” in Proc. IEEE Autom. Speech Recognit. [82] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear re-
Understanding Workshop, 2015, pp. 130–136. gression for speaker adaptation of continuous density hidden Markov
[61] J. Pan, G. Wan, J. Du, and Z. Ye, “Online speaker adap- models,” Comput. Speech Lang., vol. 9, no. 2, pp. 171–185, 1995.
tation using memory-aware networks for speech recognition,” [83] J.-L. Gauvain and C.-H. Lee, “Maximum a posteriori estimation for
IEEE/ACM Audio, Speech, Lang. Process., to be published, multivariate Gaussian mixture observations of Markov chains,” IEEE
doi: 10.1109/TASLP.2020.2980372. Audio, Speech, Audio Process., vol. 2, no. 2, pp. 291–298, Apr. 1994.
[62] L. Sari, N. Moritz, T. Hori, and J. Le Roux, “Unsupervised speaker [84] P. C. Woodland, “Speaker adaptation for continuous density HMMs:
adaptation using attention-based speaker memory for end-to-end A review,” in Proc. ISCA Workshop Adapt. Methods Speech Recognit.,
ASR,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2020, 2001, pp. 11–19.
pp. 7384–7388. [85] K. Shinoda, “Speaker adaptation techniques for automatic speech
[63] Z.-P. Zhang, S. Furui, and K. Ohtsuki, “On-line incremental recognition,” in Proc. Asia-Pacific Signal Inf. Process. Assoc. Annu.
speaker adaptation with automatic speaker change detection,” in Summit Conf., 2011.
Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2000, [86] M. Gales and S. Young, “The application of hidden Markov models in
pp. II.961–II.964. speech recognition,” Found. Trends Signal, vol. 1, no. 3, pp. 195–304,
[64] H. Huang and K. C. Sim, “An investigation of augmenting speaker rep- 2008.
resentations to improve speaker normalisation for DNN-based speech [87] K. Johnson, “Speaker normalization in speech perception,” in The
recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., Handbook of Speech Perception. Hoboken, NJ, USA: Wiley, 2005,
2015, pp. 4610–4613. pp. 363–389.
[65] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, [88] K. Paliwal and W. Ainsworth, “Dynamic frequency warping for
and O. Vinyals, “Speaker diarization: A review of recent research,” speaker adaptation in automatic speech recognition,” J. Phonetics,
IEEE Audio, Speech, Lang. Process., vol. 20, no. 2, pp. 356–370, vol. 13, no. 2, pp. 123–134, 1985.
Feb. 2012. [89] Y. Grenier, “Speaker adaptation through canonical correlation analy-
[66] M. J. Gales, “Cluster adaptive training of hidden Markov models,” sis,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 1980,
IEEE Trans. Speech Audio Process., vol. 8, no. 4, pp. 417–428, pp. 888–891.
Jul. 2000. [90] K. Choukri and G. Chollet, “Adaptation of automatic speech recogniz-
[67] T. Tan, Y. Qian, and K. Yu, “Cluster adaptive training for deep neural ers to new speakers using canonical correlation analysis techniques,”
network based acoustic model,” IEEE/ACM Trans. Audio, Speech, Comput. Speech Lang., vol. 1, no. 2, pp. 95–107, 1986.
Lang. Process., vol. 24, no. 3, pp. 459–468, Mar. 2016. [91] H. Wakita, “Normalization of vowels by vocal-tract length and its
[68] H. Christensen, S. Cunningham, C. Fox, P. Green, and T. Hain, “A application to vowel identification,” IEEE Speech, Signal Process.,
comparative study of adaptive, automatic recognition of disordered vol. 25, no. 2, pp. 183–192, Apr. 1977.
speech,” in Proc. Interspeech, 2012, pp. 1776–1779. [92] A. Andreou, “Experiments in vocal tract normalization,” in Proc. Cer-
[69] H. Liao, E. McDermott, and A. Senior, “Large scale deep neural tified Artif. Intell. Practitioner Workshop: Front. Speech Recognit. II,
network acoustic modeling with semi-supervised training data for 1994.
YouTube video transcription,” in Proc. IEEE Autom. Speech Recognit. [93] E. Eide and H. Gish, “A parametric approach to vocal tract length nor-
Understanding Workshop, 2013, pp. 368–373. malization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.,
[70] W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised domain adaptation 1996, pp. 346–348.
for robust speech recognition via variational autoencoder-based data [94] L. Lee and R. C. Rose, “Speaker normalization using efficient fre-
augmentation,” in Proc. IEEE Autom. Speech Recognit. Understanding quency warping procedures,” in Proc. IEEE Int. Conf. Acoust., Speech,
Workshop, 2017, pp. 16–23. Signal Process., 1996, pp. 353–356.
60 VOLUME 2, 2021
[95] D. Kim, S. Umesh, M. Gales, T. Hain, and P. Woodland, “Using VTLN [117] K. Li, J. Li, Y. Zhao, K. Kumar, and Y. Gong, “Speaker adaptation
for broadcast news transcription,” in Proc. Int. Conf. Spoken Lang. for end-to-end CTC models,” in Proc. IEEE Spoken Lang. Technol.
Process., 2004, pp. 4–8. Workshop, 2018, pp. 542–549.
[96] G. Garau, S. Renals, and T. Hain, “Applying vocal tract length normal- [118] Z. Meng, Y. Gaur, J. Li, and Y. Gong, “Speaker adapta-
ization to meeting recordings,” in Proc. Interspeech, 2005. [Online]. tion for attention-based end-to-end speech recognition,” 2019,
Available: http://www.isca-speech.org/archive/interspeech_2005 arXiv:1911.03762.
[97] S. Furui, “A training procedure for isolated word recognition systems,” [119] K. C. Sim, P. Zadrazil, and F. Beaufays, “An investigation into on-
IEEE Speech, Signal Process., vol. 28, no. 2, pp. 129–136, Apr. 1980. device personalization of end-to-end automatic speech recognition
[98] K. Shikano, S. Nakamura, and M. Abe, “Speaker adaptation and voice models,” 2019, arXiv:1909.06678.
conversion by codebook mapping,” in Proc. IEEE Int. Symp. Circuits [120] Y. Huang, J. Li, L. He, W. Wei, W. Gale, and Y. Gong, “Rapid RNN-T
Syst., 1991, pp. 594–597. adaptation using personalised speech synthesis and neural language
[99] M. Abe, S. Nakamura, K. Shikano, and H. Kuwabara, “Voice conver- generator,” in Proc. Interspeech, 2020, pp. 1256–1260.
sion through vector quantization,” in Proc. Int. Conf. Acoust., Speech, [121] Z. Fan, J. Li, S. Zhou, and B. Xu, “Speaker-aware speech-transformer,”
Signal Process., 1990, pp. 71–76. in Proc. IEEE Autom. Speech Recognit. Understanding Workshop,
[100] M. Feng, F. Kubala, R. Schwartz, and J. Makhoul, “Improved speaker 2019, pp. 222–229.
adaption using text dependent spectral mappings,” in Proc. IEEE Int. [122] Y. Zhao, C. Ni, C.-C. Leung, S. Joty, E. S. Chng, and B. Ma, “Speech
Conf. Acoust., Speech, Signal Process., 1988, pp. 131–134. transformer with speaker aware persistent memory,” in Proc. Inter-
[101] G. Rigoll, “Speaker adaptation for large vocabulary speech recognition speech, 2020, pp. 1261–1265.
systems using speaker Markov models,” in Proc. IEEE Int. Conf. [123] G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao,
Acoust., Speech, Signal Process., 1989, pp. 5–8. “Deep context: End-to-end contextual speech recognition,” in Proc.
[102] M. J. Hunt, “Speaker adaptation for word-based speech recognition IEEE Spoken Lang. Technol. Workshop, 2018, pp. 418–425.
systems,” J. Acoust. Soc. Amer., vol. 69, no. S1, pp. S 41–S 42, [124] Z. Chen, M. Jain, Y. Wang, M. L. Seltzer, and C. Fuegen, “End-to-
1981. end contextual speech recognition using class language models and
[103] S. J. Cox and J. S. Bridle, “Unsupervised speaker adaptation by prob- a token passing decoder,” in Proc. IEEE Int. Conf. Acoust., Speech
abilistic spectrum fitting,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2019, pp. 6186–6190.
Signal Process., 1989, pp. 294–297. [125] M. Jain, G. Keren, J. Mahadeokar, and Y. Saraf, “Contextual RNN-T
[104] M. Gales, “Maximum likelihood linear transformations for HMM- for open domain ASR,” in Proc. Interspeech, 2020, pp. 11–15.
based speech recognition,” Comput. Speech Lang., vol. 12, no. 2, [126] K. C. Sim et al., “Personalization of end-to-end speech recognition
pp. 75–98, 1998. on mobile devices for named entities,” in Proc. IEEE Autom. Speech
[105] L. Neumeyer, A. Sankar, and V. Digalakis, “A comparative study of Recognit. Understanding Workshop, 2019, pp. 23–30.
speaker adaptation techniques,” in Proc. Eurospeech, 1995, pp. 1127– [127] J. Li et al., “Developing RNN-T models surpassing high-performance
1130. hybrid models with customization capability,” in Proc. Interspeech,
[106] T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul, “A 2020, pp. 3590–3594.
compact model for speaker-adaptive training,” in Proc. 4th Int. Conf. [128] M. Delcroix, S. Watanabe, A. Ogawa, S. Karita, and T. Nakatani,
Spoken Lang. Process., 1996. pp. 3–35. “Auxiliary feature based adaptation of end-to-end ASR systems,” in
[107] R. Kuhn, J.-C. Junqua, P. Nguyen, and N. Niedzielski, “Rapid speaker Interspeech, 2018, doi:10.21437/Interspeech.2018-1438.
adaptation in eigenvoice space,” IEEE Speech Audio Lang. Process., [129] M. Delcroix, K. Kinoshita, A. Ogawa, C. Huemmer, and T. Nakatani,
vol. 8, no. 6, pp. 695–707, Nov. 2000. “Context adaptive neural network based acoustic models for rapid
[108] K. Yu and M. J. Gales, “Discriminative cluster adaptive training,” adaptation,” IEEE/ACM Audio, Speech, Lang. Process., vol. 26, no. 5,
IEEE Audio, Speech, Lang. Process., vol. 14, no. 5, pp. 1694–1703, pp. 895–908, May 2018.
Sep. 2006. [130] J. Rownicka, P. Bell, and S. Renals, “Embeddings for DNN speaker
[109] K. C. Sim, Y. Qian, G. Mantena, L. Samarakoon, S. Kundu, and T. Tan, adaptive training,” in Proc. IEEE Autom. Speech Recognit. Under-
“Adaptation of deep neural network acoustic models for robust auto- standing Workshop, 2019, pp. 479–486.
matic speech recognition,” in New Era for Robust Speech: Exploiting [131] S. Garimella, A. Mandal, N. Strom, B. Hoffmeister, S. Matsoukas,
Deep, S. Watanabe, M. Delcroix, F. Metze, and J. R. Hershey, Eds. and S. H. K. Parthasarathi, “Robust i-vector based adaptation of DNN
Berlin, Germany: Springer, 2017, pp. 219–243. acoustic model for speech recognition,” in Proc. Interspeech, 2015,
[110] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouel- pp. 2877–2881.
let, “Front-end factor analysis for speaker verification,” IEEE [132] T. Tan et al., “Speaker-aware training of LSTM-RNNs for acoustic
Audio, Speech, Lang. Process., vol. 19, no. 4, pp. 788–798, modelling,” in IEEE Int. Conf. Acoust., Speech Signal Process., 2016,
May 2011. pp. 5280–5284.
[111] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudan- [133] S. H. K. Parthasarathi, B. Hoffmeister, S. Matsoukas, A. Mandal, N.
pur, “X-vectors: Robust DNN embeddings for speaker recognition,” Strom, and S. Garimella, “fMLLR based feature-space speaker adap-
in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2018, tation of DNN acoustic models,” in Proc. Interspeech, 2015, pp. 3630–
pp. 5329–5333. 3634.
[112] P. Swietojanski and S. Renals, “Learning hidden unit contributions for [134] Z. Meng et al., “L-vector: Neural label embedding for domain adapta-
unsupervised speaker adaptation of neural network acoustic models,” tion,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2020,
in Proc. IEEE Spoken Lang. Technol. Workshop, South Lake Tahoe, pp. 7389–7393.
2014, pp. 171–176. [135] F. Seide, G. Li, X. Chen, and D. Yu, “Feature engineering in context-
[113] H. Liao, “Speaker adaptation of context dependent deep neural net- dependent deep neural networks for conversational speech transcrip-
works,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., tion,” in Proc. IEEE Autom. Speech Recognit. Understanding Work-
2013, pp. 7947–7951. shop, 2011, pp. 24–29.
[114] D. Yu, K. Yao, H. Su, G. Li, and F. Seide, “KL-divergence regularized [136] S. P. Rath, D. Povey, K. Veselý, and J. Černocký, “Improved feature
deep neural network adaptation for improved large vocabulary speech processing for deep neural networks,” in Proc. Interspeech, 2013,
recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., pp. 109–113.
2013, pp. 7893–7897. [137] N. M. Joy, M. K. Baskar, S. Umesh, and B. Abraham, “DNNs for
[115] Z. Huang, J. Li, S. M. Siniscalchi, I.-F. Chen, J. Wu, and C.-H. unsupervised extraction of pseudo FMLLR features without explicit
Lee, “Rapid adaptation for deep neural networks through multi-task adaptation data,” in Proc. Interspeech, 2016, pp. 3479–3483.
learning,” in Proc. Interspeech, 2015, pp. 3625–3629. [138] N. Tomashenko and Y. Khokhlov, “GMM-derived features for effec-
[116] Y. Huang, L. He, W. Wei, W. Gale, J. Li, and Y. Gong, “Using per- tive unsupervised adaptation of deep neural network acoustic models,”
sonalized speech synthesis and neural language generator for rapid in Proc. Interspeech, 2015, pp. 2882–2886.
speaker adaptation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal [139] L. F. Uebel and P. C. Woodland, “An investigation into vocal tract
Process., 2020, pp. 7399–7403. length normalisation,” in Proc. Eurospeech, 1999, pp. 2527–2530.
VOLUME 2, 2021 61
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW
[140] D. Povey, G. Zweig, and A. Acero, “Speaker adaptation with an [163] Y. Zhao, J. Li, and Y. Gong, “Low-rank plus diagonal adaptation for
exponential transform,” in Proc. IEEE Autom. Speech Recognit. Un- deep neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
derstanding Workshop, 2011, pp. 158–163. Process., 2016, pp. 5005–5009.
[141] J. Fainberg, O. Klejch, E. Loweimi, P. Bell, and S. Renals, “Acous- [164] Y. Zhao, J. Li, K. Kumar, and Y. Gong, “Extended low-rank
tic model adaptation from raw waveforms with SincNet,” in Proc. plus diagonal adaptation for deep and recurrent neural networks,”
IEEE Autom. Speech Recognit. Understanding Workshop, 2019, in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017,
pp. 897–904. pp. 5040–5044.
[142] M. Karafiát, L. Burget, P. Matějka, O. Glembek, and J. Černocký, [165] V. Abrash, H. Franco, A. Sankar, and M. Cohen, “Connectionist
“iVector-based discriminative adaptation for automatic speech recog- speaker normalization and adaptation,” in Proc. Eurospeech, 1995,
nition,” in Proc. IEEE Autom. Speech Recognit. Understanding Work- pp. 2183–2186.
shop, 2011, pp. 152–157. [166] K. Yao, D. Yu, F. Seide, H. Su, L. Deng, and Y. Gong, “Adapta-
[143] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Speaker and tion of context-dependent deep neural networks for automatic speech
session variability in GMM-based speaker verification,” IEEE Audio, recognition,” in Proc. IEEE Spoken Lang. Technol. Workshop, 2012,
Speech, Lang. Process., vol. 15, no. 4, pp. 1448–1460, May 2007. pp. 366–369.
[144] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme [167] L. Samarakoon and K. C. Sim, “Learning factorized feature transforms
for speaker recognition using a phonetically-aware deep neural net- for speaker normalization,” in Proc. IEEE Workshop Autom. Speech
work,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2014, Recognit. Understanding, 2015, pp. 145–152.
pp. 1695–1699. [168] L. Samarakoon and K. C. Sim, “Factorized hidden layer adapta-
[145] P. Kenny, T. Stafylakis, P. Ouellet, V. Gupta, and M. J. Alam, “Deep tion for deep neural network based acoustic modeling,” IEEE/ACM
neural networks for extracting Baum-Welch statistics for speaker Trans. Audio, Speech, Lang. Process., vol. 24, no. 12, pp. 2241–2250,
recognition,” in Speaker Odyssey, 2014, pp. 293–298. Dec. 2016.
[146] Y. Miao, H. Zhang, and F. Metze, “Towards speaker adaptive training [169] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive
of deep neural network acoustic models,” in Proc. Interspeech, 2014, mixtures of local experts,” Neural Comput., vol. 3, no. 1, pp. 79–87,
pp. 2189–2193. 1991.
[147] Y. Miao, L. Jiang, H. Zhang, and F. Metze, “Improvements to speaker [170] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “FiLM:
adaptive training of deep neural networks,” in Proc. IEEE Spoken Visual reasoning with a general conditioning layer,” in Proc. Assoc.
Lang. Technol. Workshop, 2014, pp. 165–170. Adv. Artif. Intell., 2018, pp. 3942–3951.
[148] J. Rownicka, P. Bell, and S. Renals, “Analyzing deep CNN-based [171] L. Samarakoon and K. C. Sim, “Subspace LHUC for fast adaptation
utterance embeddings for acoustic model adaptation,” in Proc. IEEE of deep neural network acoustic models,” in Proc. Interspeech, 2016,
Spoken Lang. Technol. Workshop, 2018, pp. 235–241. pp. 1593–1597.
[149] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez- [172] X. Cui, V. Goel, and G. Saon, “Embedding-based speaker adaptive
Dominguez, “Deep neural networks for small footprint text-dependent training of deep neural networks,” in Proc. Interspeech, 2017, pp. 122–
speaker verification,” in Proc. IEEE Spoken Lang. Technol. Workshop, 126.
2014, pp. 4052–4056. [173] T. Kim, I. Song, and Y. Bengio, “Dynamic layer normalization for
[150] X. Li and X. Wu, “Modeling speaker variability using long short-term adaptive neural acoustic modeling in speech recognition,” in Proc.
memory networks for speech recognition,” in Proc. Interspeech, 2015, Interspeech, 2017, pp. 2411–2415.
pp. 1086–1090. [174] L. Sari, S. Thomas, and M. A. Hasegawa-Johnson, “Learning speaker
[151] Y. Khokhlov et al., “R-vectors: New technique for adaptation to room aware offsets for speaker adaptation of neural networks,” in Proc.
acoustics,” in Proc. Interspeech, 2019, pp. 15–19. Interspeech, 2019, pp. 769–773.
[152] Y. Shi, Q. Huang, and T. Hain, “H-vectors: Utterance-level speaker [175] X. Xie, X. Liu, T. Lee, and L. Wang, “Fast DNN acoustic model
embedding using a hierarchical attention model,” in Proc. IEEE Int. speaker adaptation by learning hidden unit contribution features,” in
Conf. Acoust., Speech Signal Process., 2020, pp. 7579–7583. Proc. Interspeech, 2019, pp. 759–763.
[153] X. Xie, X. Liu, T. Lee, and L. Wang, “Fast DNN acoustic model [176] J. Xue, J. Li, and Y. Gong, “Restructuring of deep neural network
speaker adaptation by learning hidden unit contribution features,” in acoustic models with singular value decomposition,” in Proc. Inter-
Proc. Interspeech, 2019, pp. 759–763. speech, 2013, pp. 2365–2369.
[154] Y. Huang and Y. Gong, “Regularized sequence-level deep neural net- [177] J. Xue, J. Li, D. Yu, M. Seltzer, and Y. Gong, “Singular value decom-
work model adaptation,” in Proc. Interspeech, 2015, pp. 1081–1085. position based low-footprint speaker adaptation and personalization
[155] J. Neto et al., “Speaker-adaptation for hybrid HMM-ANN continuous for deep neural network,” in Proc. IEEE Int. Conf. Acoust., Speech
speech recognition system,” Eurospeech, 1995. [Online]. Available: Signal Process., 2014, pp. 6359–6363.
http://hdl.handle.net/1842/1274 [178] O. Klejch, J. Fainberg, and P. Bell, “Learning to adapt: A meta-
[156] R. Gemello, F. Mana, S. Scanzio, P. Laface, and R. De Mori, “Linear learning approach for speaker adaptation,” in Proc. Interspeech, 2018,
hidden transformations for adaptation of hybrid ANN/HMM mod- pp. 867–871.
els,” Speech Commun., vol. 49, no. 10/11, pp. 827–835, Oct./Nov. [179] O. Klejch, J. Fainberg, P. Bell, and S. Renals, “Speaker adaptive
2007. training using model agnostic meta-learning,” in Proc. IEEE Autom.
[157] B. Li and K. C. Sim, “Comparison of discriminative input and out- Speech Recognit. Understanding Workshop, 2019, pp. 881–888.
put transformations for speaker adaptation in the hybrid NN/HMM [180] F. Weninger, J. Andrés-Ferrer, X. Li, and P. Zhan, “Listen, attend,
systems,” 2010. [Online]. Available: http://scholarbank.nus.edu.sg/ spell and adapt: Speaker adapted sequence-to-sequence ASR,” in Proc.
handle/10635/41498 Interspeech, 2019, pp. 3805–3809.
[158] J. S. Bridle and S. J. Cox, “RecNorm: Simultaneous normalisation [181] Z. Meng, J. Li, and Y. Gong, “Adversarial speaker adaptation,” in Proc.
and classification applied to speech recognition,” in Adv. Neural Inf. IEEE Int. Conf. Acoust., Speech Signal Process., 2019, pp. 5721–5725.
Process. Syst., 1991, pp. 234–240. [182] Z. Huang, S. M. Siniscalchi, I.-F. Chen, J. Wu, and C.-H. Lee, “Max-
[159] O. Abdel-Hamid and H. Jiang, “Fast speaker adaptation of hybrid imum a posteriori adaptation of network parameters in deep models,”
NN/HMM model for speech recognition based on discriminative in Proc. Interspeech, 2015, pp. 1076–1080.
learning of speaker code,” in Proc. IEEE Int. Conf. Acoust., Speech [183] X. Xie, X. Liu, T. Lee, S. Hu, and L. Wang, “BLHUC: Bayesian
Signal Process., 2013, pp. 7942–7946. learning of hidden unit contributions for deep neural network speaker
[160] Z. Q. Wang and D. Wang, “Unsupervised speaker adaptation of batch adaptation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process.,
normalized acoustic models for robust ASR,” in Proc. IEEE Int. Conf. 2019, pp. 5711–5715.
Acoust., Speech Signal Process., 2017, pp. 4890–4894. [184] D. Povey, “Discriminative training for large vocabulary speech recog-
[161] F. Mana, F. Weninger, R. Gemello, and P. Zhan, “Online batch normal- nition,” Ph.D. dissertation, Cambridge, U.K.: Univ. Cambridge, 2005.
ization adaptation for automatic speech recognition,” in Proc. Autom. [185] R. Price, K.-i. Iso, and K. Shinoda, “Speaker adaptation of deep neural
Speech Recognit. Understanding Workshop, 2019, pp. 875–880. networks using a hierarchy of output layers,” in Proc. IEEE Spoken
[162] C. Zhang and P. C. Woodland, “Parameterised sigmoid and ReLU Lang. Technol. Workshop, 2014, pp. 153–158.
hidden activation functions for DNN acoustic modelling,” in Proc. [186] R. Caruana, “Multitask learning,” Mach. Learn., vol. 28, no. 1, pp. 41–
Interspeech, 2015, pp. 3224–3228. 75, 1997.
62 VOLUME 2, 2021
[187] P. Swietojanski, P. Bell, and S. Renals, “Structured output layer with [212] U. Nallasamy, F. Metze, and T. Schultz, “Active learning for accent
auxiliary targets for context-dependent acoustic modelling,” in Proc. adaptation in automatic speech recognition,” in Proc. IEEE Spoken
Interspeech, 2015, pp. 3605–3609. Lang. Technol. Workshop, 2012, pp. 360–365.
[188] J. B. Allen and D. A. Berkley, “Image method for efficiently simulat- [213] Y. Huang, D. Yu, C. Liu, and Y. Gong, “Multi-accent deep neu-
ing small-room acoustics,” Acoust. Amer., vol. 65, no. 4, pp. 943–950, ral network acoustic model with accent-specific top layer using
1979. the KLD-regularized model adaptation,” in Proc. Interspeech, 2014,
[189] T. Ko, V. Peddinti, D. Povey, M. L. Seltzer, and S. Khudanpur, “A pp. 2977–2981.
study on data augmentation of reverberant speech for robust speech [214] M. Chen, Z. Yang, J. Liang, Y. Li, and W. Liu, “Improving deep neu-
recognition,” in IEEE Int. Conf. Acoust., Speech Signal Process., 2017, ral networks based multi-accent Mandarin speech recognition using
pp. 5220–5224. i-vectors and accent-specific top layer,” in Proc. Interspeech, 2015,
[190] C. Kim et al., “Generation of large-scale simulated utterances pp. 3620–3624.
in virtual rooms to train deep-neural networks for far-field [215] A. Ghoshal, P. Swietojanski, and S. Renals, “Multilingual training of
speech recognition in Google Home,” in Proc. Interspeech, 2017, deep neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
doi: 10.21437/Interspeech.2017-1510. Process., 2013, pp. 7319–7323.
[191] J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, “An overview of noise- [216] G. Heigold et al., “Multilingual acoustic models using distributed deep
robust automatic speech recognition,” IEEE/ACM Audio, Speech, neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
Lang. Process., vol. 22, pp. 745–777, Apr. 2014. Process., 2013, pp. 8619–8623.
[192] N. Jaitly and G. E. Hinton, “Vocal tract length perturbation (VTLP) [217] J.-T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, “Cross-language
improves speech recognition,” in Proc. ICML Workshop Deep Audio, knowledge transfer using multilingual deep neural network with
Speech, 2013. shared hidden layers,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
[193] X. Cui, V. Goel, and B. Kingsbury, “Data augmentation for deep Process., 2013, pp. 7304–7308.
neural network acoustic modeling,” IEEE/ACM Audio, Speech, Lang [218] J. Yi, H. Ni, Z. Wen, and J. Tao, “Improving BLSTM RNN based Man-
Process., vol. 23, no. 9, pp. 1469–1477, Sep. 2015. darin speech recognition using accent dependent bottleneck features,”
[194] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio augmentation in Proc. IEEE Asia-Pacific Signal Inf. Process. Assoc. Annu. Summit
for speech recognition,” in Proc. Interspeech, 2015, pp 3586–3589. Conf., 2016, pp. 1–5.
[195] Y. Huang and Y. Gong, “Acoustic model adaptation for presentation [219] M. T. Turan, E. Vincent, and D. Jouvet, “Achieving multi-accent ASR
transcription and intelligent meeting assistant systems,” in Proc. IEEE via unsupervised acoustic model adaptation,” in Proc. Interspeech,
Int. Conf. Acoust., Speech Signal Process., 2020, pp. 7394–7398. 2020, pp. 1286–1290.
[196] Y. Stylianou, O. Cappé, and E. Moulines, “Continuous probabilistic [220] M. Elfeky, M. Bastani, X. Velez, P. Moreno, and A. Waters, “Towards
transform for voice conversion,” IEEE Speech Audio Process., vol. 6, acoustic model unification across dialects,” in Proc. IEEE Spoken
no. 2, pp. 131–142, Mar. 1998. Lang. Technol. Workshop, 2016, pp. 624–628.
[197] X. Cui, V. Goel, and B. Kingsbury, “Data augmentation for deep [221] X. Yang, K. Audhkhasi, A. Rosenberg, S. Thomas, B. Ramabhadran,
convolutional neural network acoustic modeling,” in Proc. IEEE Int. and M. Hasegawa-Johnson, “Joint modeling of accents and acoustics
Conf. Acoust., Speech Signal Process., 2015, pp. 4545–4549. for multi-accent speech recognition,” in Proc. IEEE Int. Conf. Acoust.,
[198] J. Fainberg, P. Bell, M. Lincoln, and S. Renals, “Improving chil- Speech Signal Process., 2018, pp. 1–5.
dren’s speech recognition through out-of-domain data augmentation,” [222] A. Jain, M. Upreti, and P. Jyothi, “Improved accented speech recog-
in Proc. Interspeech, 2016, pp. 1598–1602. nition using accent embeddings and multi-task learning,” in Proc.
[199] Y. Jia et al., “Transfer learning from speaker verification to multi- Interspeech, 2018, doi:10.21437/Interspeech.2018-1864.
speaker text-to-speech synthesis,” in Adv. Neural Inf. Process. Syst., [223] K. Li, J. Li, Y. Zhao, K. Kumar, and Y. Gong, “Speaker adaptation
2018, pp. 4480–4490. for end-to-end CTC models,” in Proc. IEEE Spoken Lang. Technol.
[200] E. Hosseini-Asl, Y. Zhou, C. Xiong, and R. Socher, “Augmented cyclic Workshop, 2018, pp. 542–549.
adversarial learning for low resource domain adaptation,” in Proc. Int. [224] T. Viglino, P. Motlicek, and M. Cernak, “End-to-end
Conf. Learn. Representations, 2019, pp. 1–14. accented speech recognition,” in Proc. Interspeech, 2019,
[201] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image doi:10.21437/interspeech.2019-2122.
translation using cycle-consistent adversarial networks,” in Proc. IEEE [225] S. Sun, C.-F. Yeh, M.-Y. Hwang, M. Ostendorf, and L. Xie,
Int. Conf. Comput. Vis., 2017, pp. 2242–2251. “Domain adversarial training for accented speech recognition,” in
[202] A. Ali et al., “Automatic dialect detection in Arabic broadcast speech,” Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2018,
in Proc. Interspeech, 2016, pp. 2934–2938. pp. 4854–4858.
[203] A. Ali, S. Vogel, and S. Renals, “Speech recognition challenge in [226] B. Li et al., “Multi-dialect speech recognition with a single sequence-
the wild: Arabic MGB-3,” in Proc. IEEE Autom. Speech Recognit. to-sequence model,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
Understanding Workshop, 2017, pp. 14–18. Process., 2018, pp. 4749–4753.
[204] A. Ali et al., “The MGB-5 challenge: Recognition and dialect iden- [227] M. Grace, M. Bastani, and E. Weinstein, “Occam’s adaptation: A com-
tification of dialectal Arabic speech,” in Proc. IEEE Autom. Speech parison of interpolation of bases adaptation methods for multi-dialect
Recognit. Understanding Workshop, 2019, pp. 1026–1033. acoustic modeling with LSTMs,” in Proc. IEEE Spoken Lang. Technol.
[205] D. Povey et al., “Purely sequence-trained neural networks for ASR Workshop, 2018, pp. 174–181.
based on lattice-free MMI,” in Proc. Interspeech, 2016, doi:10.21437/ [228] S. Yoo, I. Song, and Y. Bengio, “A highly adaptive acoustic model for
Interspeech.2016-595. accurate multi-dialect speech recognition,” in Proc. IEEE Int. Conf.
[206] P. Smit, S. R. Gangireddy, S. Enarvi, S. Virpioja, and M. Kurimo, Acoust., Speech Signal Process., 2019, pp. 5716–5720.
“Aalto system for the 2017 Arabic multi-genre broadcast challenge,” [229] S. Ghorbani, A. E. Bulut, and J. H. Hansen, “Advancing multi-
in IEEE Autom. Speech Recognit. Understanding Workshop, 2017, accented LSTM-CTC speech recognition using a domain specific
pp. 338–345. student-teacher learning paradigm,” in Proc. IEEE Spoken Lang. Tech-
[207] S. Khurana, A. Ali, and J. Glass, “Darts: Dialectal arabic transcription nol. Workshop, 2018, pp. 29–35.
system,” 2019, arXiv:1909.12163. [230] A. Jain, V. P. Singh, and S. P. Rath, “A multi-accent acoustic model
[208] D. Vergyri, L. Lamel, and J.-L. Gauvain, “Automatic speech recogni- using mixture of experts for speech recognition,” in Proc. Interspeech,
tion of multiple accented English data,” in Proc. Interspeech, 2010, 2019, doi: 10.21437/Interspeech.2019-1667.
pp. 1652–1655. [231] H. Zhu, L. Wang, P. Zhang, and Y. Yan, “Multi-accent adap-
[209] Y. Zheng et al., “Accent detection and speech recognition for tation based on gate mechanism,” in Proc. Interspeech, 2019,
Shanghai-accented Mandarin,” in Proc. Interspeech, 2005, pp. 217– doi: 10.21437/Interspeech.2019-3155.
220. [232] G. I. Winata et al., “Learning fast adaptation on cross-accented speech
[210] L. W. Kat and P. Fung, “Fast accent identification and accented speech recognition,” in Proc. Interspeech, 2020, pp. 1276–1280.
recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., [233] Y. Long, Y. Li, H. Ye, and H. Mao, “Domain adaptation of lattice-
1999, pp. 221–224. free MMI based TDNN models for speech recognition,” Int. J. Speech
[211] M. Liu, B. Xu, T. Hunng, Y. Deng, and C. Li, “Mandarin accent adap- Technol., 2017, pp. 171–178.
tation based on context-independent/context-dependent pronunciation [234] J. Fainberg, S. Renals, and P. Bell, “Factorised representations for
modeling,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., neural network adaptation to diverse acoustic environments,” in Proc.
2000, pp. II 1025–II 1028. Interspeech, 2017, doi: 10.21437/Interspeech.2017-1365.
VOLUME 2, 2021 63
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW
[235] K. C. Sim et al., “Domain adaptation using factorized hidden layer [259] P. R. Clarkson and A. J. Robinson, “Language model adaptation using
for robust automatic speech recognition,” in Proc. Interspeech, 2018, mixtures and an exponentially decaying cache,” in Proc. IEEE Int.
pp. 892–896. Conf. Acoust., Speech Signal Process., 1997, pp. 799–802.
[236] J. Huang et al., “Cross-language transfer learning, continuous learn- [260] G. Tur and A. Stolcke, “Unsupervised language model adaptation for
ing, and domain adaptation for end-to-end automatic speech recogni- meeting recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
tion,” 2020, arXiv:2005.04290. Process., 2007, pp. IV-173–IV-176.
[237] S. Ueno et al., “Encoder transfer for attention-based acoustic-to-word [261] X. Liu, M. J. Gales, and P. C. Woodland, “Context dependent language
speech recognition,” in Proc. Interspeech, 2018, pp. 96–108. model adaptation,” in Proc. Interspeech, 2008, pp. 837–840.
[238] T. Moriya et al., “Progressive neural network-based knowledge trans- [262] R. Kuhn and R. De Mori, “A cache-based natural language model for
fer in acoustic models,” in IEEE Asia-Pacific Signal Inf. Process. speech recognition,” IEEE Pattern Anal. Mach. Intell., vol. 12, no. 6,
Assoc., 2018, pp. 998–1002. pp. 570–583, Jun. 1990.
[239] J. Li, R. Zhao, J.-T. Huang, and Y. Gong, “Learning small-size DNN [263] R. Kuhn and R. De Mori, “Corrections to “a cache-based language
with output-distribution-based criteria,” in Proc. Interspeech, 2014, model for speech recognition”,” IEEE Pattern Anal. Mach. Intell.,
pp. 1910–1914. vol. 14, no. 6, pp. 691–692, Jun. 1992.
[240] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a [264] M. Federico, “Bayesian estimation methods for n-gram language
neural network,” in Proc. NIPS Deep Learn. Representation Learn. model adaptation,” in Proc. Int. Conf. Spoken Lang. Process., 1996,
Workshop, 2015, pp. 1–9. pp. 240–243.
[241] J. Li et al., “Developing far-field speaker system via teacher-student [265] M. Bacchiani and B. Roark, “Unsupervised language model adapta-
learning,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., tion,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., Hong
2018, pp. 5699–5703. Kong, 2003, pp. I–I.
[242] L. Mošner et al., “Improving noise robustness of automatic [266] K. Seymore, S. Chen, and R. Rosenfeld, “Nonlinear interpolation
speech recognition via parallel data and teacher-student learning,” of topic models for language model adaptation,” in Proc. Int. Conf.
in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2019, Spoken Lang. Process., 1998.
pp. 6475–6479. [267] L. Chen, J.-L. Gauvain, L. Lamel, G. Adda, and M. Adda, “Using
[243] T. Asami, R. Masumura, Y. Yamaguchi, H. Masataki, and Y. Aono, information retrieval methods for language model adaptation,” in Proc.
“Domain adaptation of DNN acoustic models using knowledge distil- Interspeech, 2001, pp. 255–258.
lation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017, [268] B.-J. P. Hsu and J. Glass, “Style & topic language model adaptation
pp. 5185–5189. using HMM-LDA,” in Empirical Methods Nat. Lang. Process., 2006,
[244] Z. Meng, J. Li, Y. Zhao, and Y. Gong, “Conditional teacher-student pp. 373–381.
learning,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., [269] S. Huang and S. Renals, “Unsupervised language model adapta-
2019, pp. 6445–6449. tion based on topic and role information in multiparty meetings,”
[245] J. Li, M. L. Seltzer, X. Wang, R. Zhao, and Y. Gong, “Large-scale in Proc. Interspeech, 2008. [Online]. Available: http://hdl.handle.net/
domain adaptation via teacher-student learning,” in Proc. Interspeech, 1842/3835
2017, pp. 2386–2390. [270] R. Kneser, J. Peters, and D. Klakow, “Language model adaptation
[246] S. Watanabe, T. Hori, J. Le Roux, and J. R. Hershey, “Student-teacher using dynamic marginals,” in 5th EuroSpeech, 1997.
network learning with enhanced features,” in Proc. IEEE Int. Conf. [271] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural proba-
Acoust., Speech Signal Process., 2017, pp. 5275–5279. bilistic language model,” J. Mach. Learn. Res., vol. 3, pp. 1137–1155,
[247] J. H. Wong and M. J. Gales, “Sequence student-teacher train- 2003.
ing of deep neural networks,” in Proc. Interspeech, 2016, doi: [272] T. Mikolov, S. Kombrink, L. Burget, J. Černockỳ, and S. Khu-
10.21437/Interspeech.2016-911. danpur, “Extensions of recurrent neural network language model,”
[248] V. Manohar, P. Ghahremani, D. Povey, and S. Khudanpur, “A teacher- in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2011,
student learning approach for unsupervised domain adaptation of pp. 5528–5531.
sequence-trained ASR models,” in Proc. IEEE Spoken Lang. Technol. [273] X. Chen et al., “Recurrent neural network language model adaptation
Workshop, 2018, pp. 250–257. for multi-genre broadcast speech recognition,” in Proc. Interspeech,
[249] Z. Meng, J. Li, Y. Gaur, and Y. Gong, “Domain adaptation via teacher- 2015, pp. 3511–3515.
student learning for end-to-end speech recognition,” in Proc. IEEE Au- [274] S. Deena, M. Hasan, M. Doulaty, O. Saz, and T. Hain, “Recurrent
tom. Speech Recognit. Understanding Workshop, 2019, pp. 268–275. neural network language model adaptation for multi-genre broadcast
[250] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation speech recognition and alignment,” IEEE/ACM Audio, Speech, Lang.
by backpropagation,” in Proc. Int. Conf. Mach. Learn., 2015, Process., vol. 27, no. 3, pp. 572–582, Mar. 2019.
pp. 1180–1189. [275] K. Li, H. Xu, Y. Wang, D. Povey, and S. Khudanpur, “Recurrent
[251] S. Sun, B. Zhang, L. Xie, and Y. Zhang, “An unsupervised deep do- neural network language model adaptation for conversational speech
main adaptation approach for robust speech recognition,” Neurocom- recognition,” in Interspeech, 2018, pp. 3373–3377.
puting, vol. 257, pp. 79–87, 2017, doi: 10.1016/j.neucom.2016.11.063. [276] T. Moriokal, N. Tawara, T. Ogawa, A. Ogawa, T. Iwata, and T.
[252] Z. Meng, Z. Chen, V. Mazalov, J. Li, and Y. Gong, “Unsupervised Kobayashi, “Language model domain adaptation via recurrent neural
adaptation with domain separation networks for robust speech recog- networks with domain-shared and domain-specific representations,”
nition,” in Proc. IEEE Autom. Speech Recognit. Understanding Work- in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2018,
shop, 2017, pp. 214–221. pp. 6084–6088.
[253] P. Denisov, N. T. Vu, and M. F. Font, “Unsupervised domain adapta- [277] Y. Shi, M. Larson, and C. M. Jonker, “Recurrent neural network lan-
tion by adversarial learning for robust speech recognition,” in Speech guage model adaptation with curriculum learning,” Comput. Speech
Commun. 13th ITG-Symp., 2018, pp. 1–5. Lang., vol. 33, no. 1, pp. 136–154, 2015.
[254] Z. Meng, J. Li, Y. Gong, and B.-H. Juang, “Adversarial teacher-student [278] S. R. Gangireddy, P. Swietojanski, P. Bell, and S. Renals, “Unsuper-
learning for unsupervised domain adaptation,” in Proc. IEEE Int. Conf. vised adaptation of recurrent neural network language models,” in
Acoust., Speech Signal Process., 2018, pp. 5949–5953. Proc. Interspeech, 2016, doi: 10.21437/Interspeech.2016-1342.
[255] M. Mimura, S. Sakai, and T. Kawahara, “Cross-domain speech recog- [279] E. Grave, A. Joulin, and N. Usunier, “Improving neural language
nition using nonparallel corpora with cycle-consistent adversarial net- models with a continuous cache,” in Proc. Int. Conf. Learn. Repre-
works,” in Proc. IEEE Autom. Speech Recognit. Understanding Work- sentations, 2017, pp. 1–9.
shop, 2017, pp. 134–140. [280] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel
[256] E. Hosseini-Asl, Y. Zhou, C. Xiong, and R. Socher, “A multi- mixture models,” in Proc. Int. Conf. Learn. Representations, 2017,
discriminator CycleGAN for unsupervised non-parallel speech do- pp. 1–15.
main adaptation,” in Proc. Interspeech, 2018, pp. 3758–3762. [281] B. Krause, E. Kahembwe, I. Murray, and S. Renals, “Dynamic eval-
[257] Z. Meng, J. Li, Y. Gong, and B.-H. F. Juang, “Cycle-consistent speech uation of neural sequence models,” in Proc. Int. Conf. Mach. Learn.,
enhancement,” in Proc. Interspeech, 2018, pp. 1165–169. 2018, pp. 2766–2775.
[258] J. R. Bellegarda, “Statistical language model adaptation: Review and [282] C. Gulcehre et al., “On using monolingual corpora in neural machine
perspectives,” Speech Commun., vol. 42, no. 1, pp. 93–108, 2004. translation,” 2015, arXiv:1503.03535.
64 VOLUME 2, 2021
[283] A. Hannun, A. Maas, D. Jurafsky, and A. Ng, “First-pass large vo- [306] G. Gravier, G. Adda, N. Paulsson, M. Carré, A. Giraudel, and O.
cabulary continuous speech recognition using bi-directional recurrent Galibert, “The ETAPE corpus for the evaluation of speech-based TV
DNNs,” 2014, arXiv:1408.2873. content processing in the French language,” in Proc. Int. Conf. Lang.
[284] A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prab- Resour. Eval., 2012, pp. 11–16.
havalkar, “An analysis of incorporating an external language model [307] Y. Liu, P. Fung, Y. Yang, C. Cieri, S. Huang, and D. Graff,
into a sequence-to-sequence model,” in Proc. IEEE Int. Conf. Acoust., “HKUST/MTS: A very large scale Mandarin telephone speech cor-
Speech Signal Process., 2018, pp. 1–5828. pus,” in Chinese Spoken Language Processing, Q. Huo, B. Ma, E.-S.
[285] A. Zeyer, K. Irie, R. Schlüter, and H. Ney, “Improved training of end- Chng, and H. Li, Eds., Berlin, Heidelberg, Germany: Springer, 2006,
to-end attention models for speech recognition,” in Proc. Interspeech, pp. 724–735.
2018, pp. 7–11. [308] P. Bell et al., “The MGB challenge: Evaluating multi-genre broadcast
[286] E. McDermott, H. Sak, and E. Variani, “A density ratio approach to media recognition,” in Proc. IEEE Autom. Speech Recognit. Under-
language model fusion in end-to-end automatic speech recognition,” in standing Workshop, 2015, pp. 687–693.
Proc. IEEE Autom. Speech Recognit. Understanding Workshop, 2019, [309] “RASC863: 863 annotated 4 regional accent speech corpus,”
pp. 434–441. 2003. [Online]. Available: http://www.chineseldc.org/doc/CLDC-
[287] E. Variani, D. Rybach, C. Allauzen, and M. Riley, “Hybrid autore- SPC-2004-005/intro.htm
gressive transducer (HAT),” in Proc. IEEE Int. Conf. Acoust., Speech [310] J. J. Godfrey, E. C. Holliman, and J. McDaniel, “SWITCH-
Signal Process., 2020, pp. 6139–6143. BOARD: Telephone speech corpus for research and develop-
[288] M. Kim, Y. Kim, J. Yoo, J. Wang, and H. Kim, “Regularized speaker ment,” in IEEE Int. Conf. Acoust., Speech Signal Process., 1992,
adaptation of KL-HMM for dysarthric speech recognition,” IEEE Neu- pp. 517–520.
ral Syst. Rehabil. Eng., vol. 25, no. 9, pp. 1581–1591, Sep. 2017. [311] M. Cettolo, C. Girardi, and M. Federico, “WIT3: Web inventory of
[289] M. Kitza, R. Schlüter, and H. Ney, “Comparison of BLSTM-layer- transcribed and translated talks,” in Eur. Assoc. Mach. Trans., 2012,
specific affine transformations for speaker adaptation,” in Proc. Inter- pp. 261–268.
speech, 2018, pp. 877–881. [312] A. Rousseau, P. Deléglise, and Y. Estève, “TED-LIUM: an automatic
[290] C. Liu, Y. Wang, K. Kumar, and Y. Gong, “Investigations on speech recognition dedicated corpus,” in Proc. Int. Conf. Lang. Re-
speaker adaptation of LSTM RNN models for speech recognition,” sourc. Eval., 2012, pp. 125–129.
in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2016, [313] A. Rousseau, P. Deléglise, and Y. Estève, “Enhancing the TED-LIUM
pp. 5020–5024. corpus with selected data for language modeling and more TED talks,”
[291] H. Seki, K. Yamamoto, T. Akiba, and S. Nakagawa, “Rapid speaker in Proc. Int. Conf. Lang. Resour. Eval., 2014, pp. 3935–3939.
adaptation of neural network based filterbank layer for automatic [314] J. S. Garofolo et al., “TIMIT acoustic phonetic continuous speech
speech recognition,” in Proc. IEEE Spoken Lang. Technol. Workshop, corpus,” Linguistic Data Consortium, LDC93S1, Philadelphia, PA,
2018, pp. 574–580. USA, 1993. [Online]. Available: https://doi.org/10.35111/17gk-bn40
[292] R. Serizel and D. Giuliani, “Deep neural network adaptation for chil- [315] D. B. Paul and J. Baker, “The design for the wall street journal-based
dren’s and adults’ speech recognition,” in Proc. Ital. Conf. Comput. CSR corpus,” in Proc. Speech Nat. Workshop, 1992, pp. 357–362.
Linguistics, 2014. [316] A. Batliner et al., “The PF_STAR children’s speech corpus,” in Proc.
[293] P. Swietojanski and S. Renals, “Differentiable pooling for unsuper- Interspeech, 2005, pp. 3761–3764.
vised acoustic model adaptation,” IEEE/ACM Audio, Speech, Lang. [317] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech:
Process., vol. 24, no. 10, pp. 1773–1784, Oct. 2016. An ASR corpus based on public domain audio books,” in Proc. IEEE
[294] C. Zhang and P. C. Woodland, “DNN speaker adaptation us- Int. Conf. Acoust., Speech Signal Process., 2015, pp. 5206–5210.
ing parameterised sigmoid and ReLU hidden activation func- [318] P. Karanasou, C. Wu, M. Gales, and P. C. Woodland, “I-vectors and
tions,” in IEEE Int. Conf. Acoust., Speech Signal Process., 2016, structured neural networks for rapid adaptation of acoustic models,”
pp. 5300–5304. IEEE/ACM Audio, Speech, Lang. Process., vol. 25, no. 4, pp. 818–828,
[295] S. Ghorbani and J. H. Hansen, “Leveraging native language informa- Apr. 2017.
tion for improved accented speech recognition,” in Proc. Interspeech, [319] A.-r. Mohamed, T. N. Sainath, G. Dahl, B. Ramabhadran, G. E. Hin-
2018, pp. 2449–2453. ton, and M. A. Picheny, “Deep belief networks using discriminative
[296] V. Gupta, P. Kenny, P. Ouellet, and T. Stafylakis, “I-vector-based features for phone recognition,” in Proc. IEEE Int. Conf. Acoust.,
speaker adaptation of deep neural networks for French broadcast audio Speech Signal Process., 2011, pp. 5060–5063.
transcription,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Pro- [320] S. M. Siniscalchi, J. Li, and C.-H. Lee, “Hermitian polynomial
cess., 2014, pp. 6334–6338. for speaker adaptation of connectionist speech recognition systems,”
[297] P. C. Woodland et al., “Cambridge University transcription systems for IEEE Audio, Speech Lang. Process., vol. 21, no. 10, pp. 2152–2161,
the multi-genre broadcast challenge,” in IEEE Autom. Speech Recog- Oct. 2013.
nit. Understanding, 2015, pp. 639–646. [321] O. Abdel-hamid and H. Jiang, “Rapid and effective speaker adaptation
[298] J. Du, X. Na, X. Liu, and H. Bu, “AISHELL-2: Transforming Man- of convolutional neural network based models for speech recognition,”
darin ASR research into industrial scale,” 2018, arXiv:1808.10583. in Proc. Interspeech, 2013, pp. 1248–1252.
[299] J. Carletta, “Unleashing the killer corpus: experiences in creating the [322] P. Swietojanski and S. Renals, “Differentiable pooling for unsuper-
multi-everything AMI meeting corpus,” Lang. Resour. Eval., vol. 41, vised speaker adaptation,” in Proc. IEEE Int. Conf. Acoust., Speech
no. 2, pp. 181–190, 2007. Signal Process., 2015, pp. 4305–4309.
[300] B. Angelini, F. Brugnara, D. Falavigna, D. Giuliani, R. Gretter, and M. [323] P. Swietojanski and S. Renals, “SAT-LHUC: Speaker adaptive train-
Omologo, “Speaker independent continuous speech recognition using ing for learning hidden unit contributions,” in Proc. IEEE Int. Conf.
an acoustic-phonetic Italian corpus,” in Proc. Int. Conf. Spoken Lang. Acoust., Speech Signal Process., 2016, pp. 5010–5014.
Process., 1994, pp. 1391–1394. [324] N. Tomashenko, Y. Khokhlov, A. Larcher, and Y. Estève, “Exploring
[301] N. Parihar, J. Picone, D. Pearce, and H. G. Hirsch, “Performance GMM-derived features for unsupervised adaptation of deep neural
analysis of the Aurora large vocabulary baseline system,” in Proc. Eur. network acoustic models,” Speech Comput., 2016, pp 304–311.
Signal Process. Conf., 2004, pp. 553–556. [325] S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. Moreno, E. We-
[302] “CASIA Northern accent speech corpus,” 2003. [Online]. Available: instein, and K. Rao, “Multilingual speech recognition with a single
http://www.chineseldc.org/doc/CLDC-SPC-2004-015/intro.htm end-to-end model,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
[303] D. Giuliani and M. Gerosa, “Investigating recognition of children’s Process., 2018, pp. 4904–4908.
speech,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., [326] A. Kannan et al., “Large-scale multilingual speech recognition with a
2003, pp. II–137. streaming end-to-end model,” in Proc. Interspeech, 2019, pp. 2130–
[304] E. Vincent, S. Watanabe, A. A. Nugraha, J. Barker, and R. Marxer, 2134.
“An analysis of environment, microphone and data simulation mis- [327] O. Adams, M. Wiesner, S. Watanabe, and D. Yarowsky, “Mas-
matches in robust speech recognition,” Comput. Speech Lang., vol. 46, sively multilingual adversarial speech recognition,” in NAACL/HLT,
pp. 535–557, 2017. 2019.
[305] K. Maekawa, “Corpus of spontaneous Japanese: Its design and evalu- [328] V. Pratap et al., “Massively multilingual ASR: 50 languages, 1 model,
ation,” in Proc. ISCA IEEE Workshop Spontaneous Speech, 2003. 1 billion parameters,” in Proc. Interspeech, 2020, pp. 4751–4755.
VOLUME 2, 2021 65
BELL ET AL.: ADAPTATION ALGORITHMS FOR NEURAL NETWORK-BASED SPEECH RECOGNITION: AN OVERVIEW
[329] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, JINYU LI (Member, IEEE) received the Ph.D.
“Unsupervised cross-lingual representation learning for speech recog- degree from the Georgia Institute of Technol-
nition,” 2020, arXiv:2006.13979. ogy in 2008. From 2000 to 2003, he was a Re-
[330] K. Kawakami, L. Wang, C. Dyer, P. Blunsom, and A. V. Oord, “Learn- searcher with the Intel China Research Center and
ing robust and multilingual speech representations,” in Proc. Conf. Research Manager in iFlytek, China. Currently,
Empirical Methods Nat. Lang. Process., 2020, pp. 1182–1192. he is a Partner Applied Scientist with Microsoft
[331] S. Pascual, M. Ravanelli, J. Serrà, A. Bonafonte, and Y. Bengio, Corporation, leading a team to design and im-
“Learning problem-agnostic speech representations from multiple prove speech modeling algorithms and technolo-
self-supervised tasks,” in Proc. Interspeech, 2019, pp. 161–165. gies that ensure industry state-of-the-art speech
[332] M. Ravanelli et al., “Multi-task self-supervised learning for robust recognition accuracy for Microsoft products. Dr.
speech recognition,” in IEEE Int. Conf. Acoust., Speech Signal Pro- Li is a member of the IEEE Speech and Lan-
cess., 2020, pp. 6989–6993. guage Processing Technical Committee. He also serves as Associate Editor
for the IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE
PETER BELL (Associate Member, IEEE) received PROCESSING.
the B.A. degree in mathematics in 2002 and the
M.Phil. degree in computer speech, text and In-
ternet technology in 2005 from the University of
Cambridge, and the Ph.D. degree in automatic
speech recognition from the University of Edin-
burgh, in 2010. He is a reader in speech technol- STEVE RENALS (Fellow, IEEE) received the
ogy with the School of Informatics, University of B.Sc. degree in chemistry from the University of
Edinburgh. His research interests include domain Sheffield, in 1986 and the M.Sc. degree in artifi-
adaptation, regularization, and low-resource meth- cial intelligence in 1987 and the Ph.D. degree in
ods for acoustic modeling. neural networks and speech recognition from the
University of Edinburgh, in 1991. He is Professor
of speech technology with the School of Infor-
matics, University of Edinburgh, having previously
JOACHIM FAINBERG (Member, IEEE) received
held positions at ICSI Berkeley, the University of
the B.Mus. degree in music and sound record-
Cambridge, and the University of Sheffield. His re-
ing (Tonmeister) from the University of Surrey in
search interests include speech recognition, spoken
2014, the M.Sc. degree in artificial intelligence in language processing, neural networks, and machine learning. Dr Renals is a
2015 and the Ph.D. degree in automatic speech
fellow of ISCA (2016) and a Senior Area Editor of the IEEE OPEN JOURNAL
recognition in 2020 from the University of Edin-
OF SIGNAL PROCESSING.
burgh. He is currently with the Machine Learn-
ing Center of Excellence, JPMorgan Chase. His
research interests include domain adaptation and
training methods for acoustic modeling.
66 VOLUME 2, 2021