Survey of Deep Learning Paradigms For Speech Processing
Survey of Deep Learning Paradigms For Speech Processing
https://doi.org/10.1007/s11277-022-09640-y
Abstract
Over the past decades, a particular focus is given to research on machine learning tech-
niques for speech processing applications. However, in the past few years, research has
focused on using deep learning for speech processing applications. This new machine
learning field has become a very attractive area of study and has remarkably better perfor-
mance than the others in the various speech processing applications. This paper presents a
brief survey of application deep learning for various speech processing applications such as
speech separation, speech enhancement, speech recognition, speaker recognition, emotion
recognition, language recognition, music recognition, speech data retrieval, etc. The sur-
vey goes on to cover the use of Auto-Encoder, Generative Adversarial Network, Restricted
Boltzmann Machine, Deep Belief Network, Deep Neural Network, Convolutional Neural
Network, Recurrent Neural Network and Deep Reinforcement Learning for speech process-
ing. Additionally, it focuses on the various speech database and evaluation metrics used by
deep learning algorithms for performance evaluation.
1 Introduction
Over the last few years, speech processing has become a dynamic field of research due to
tremendous scientific advancements and its pervasive commercial product use.
In the last decade, deep learning was popular for image processing applications. Later, it
was adapted to other signal processing applications such as speech, music, environmental
signal processing [1]. Deep learning is a several-layer model of multiple machine learn-
ing algorithms applied for input representation. Recent research on speech processing has
shown momentous improvement in deep learning performance over traditional speech pro-
cessing models, such as the Gaussian Mixture Model (GMM) and Hidden Markov Models
(HMM). The deep learning scope further increased towards natural language processing,
* Mohanaprasad Kothandaraman
[email protected]
1
School of Electronics Engineering (SENSE), VIT University, Chennai, Tamil Nadu 600127, India
13
Vol.:(0123456789)
1914 K. B. Bhangale, M. Kothandaraman
13
Survey of Deep Learning Paradigms for Speech Processing 1915
(PLP) features using Feature Minimum Phone Error (fMPE). Multilayer fMPE + MPE has
shown significant improvement over the single MPE or MLP model [13].
Morgan et al. [14] used a multi-rate coupled HMM model for speech recognition. It
used short-term PLP spectral features for short-term modeling and long-term temporal fea-
tures (HAT) for long-term modeling. The addition of long-term features has shown a sig-
nificant reduction in WER. In 2007 Frantisek Grezl et al. [15] have presented bottle-neck
features extracted using a multilayer neural network. They have extracted features in two
layers, where the first layer consists of 12th order PLP features and energy features. The
features are further given to VTLN and HLDA to reduce speaker variability and dimen-
sion reduction, respectively. The second layer consists of TRAP-based features. These fea-
tures were also abstracted using five-layer MLP neural network models and later fed to the
GMM-HMM model for the meeting recognition task described in NIST RT’05. It resulted
in 26.2% WER for 45 bottle-neck features.
Morgan [16] has reviewed some of the existing deep learning models and argued that
increasing each layer of feature’s width is as important as expanding the network’s depth.
This paper presents the review of deep learning architectures such as unsupervised,
supervised, semi-supervised and reinforcement deep learning models for the distinct
speech processing applications such as speech enhancement, speech separation, auto-
matic speech recognition (ASR), speaker recognition, emotion recognition, etc. This paper
mainly focuses on the methodology used for the specific speech processing application, the
database used for the deep learning model’s experimentation, and performance.
This paper is structured as follows: Sect. 2 gives an overview of the speech signal,
machine learning, and deep learning; Sect. 3 describe various deep learning architectures
for the speech processing; Sect. 4 describes the details of database utilized by different
deep learning models; Sect. 5 gives details about various evaluation metrics of speech pro-
cessing applications; Sect. 6 provides the discussion on the results of previous work on
speech processing applications, and Sect. 7 concludes the paper.
2 Background
2.1 Speech Signal
Speech is the phonetic representation of the symbols known as phonemes. The number
of phonemes depends on the language (the typical value is between 32 and 64 for most
languages). English phonemes consist of vowels, diphthongs, glides, liquids, nasals, stops,
fricatives, and affricatives [17]. The lungs’ pressure produces speech, which originates
utterance in the larynx’s glottis, which is finally shaped by the mouth and vocal tract into
different vowels and consonants. Humans can also produce speech using the airstream
technique without glottis, which is called the alaryngeal address. Alaryngeal speech signals
are categorized into esophageal, buccal, and pharyngeal speech [18].
Human hearing perception capacity is between 20 and 20 kHz. The human ear can
respond to speech intensity upto120–130 dB. However, all sounds above 90 dB may dam-
age the inner ear, and sound above 120 dB may cause irreversible damage. The sound wave
propagates as the continuous acoustic wave, and once it is acquired, it can be recorded,
digitized, processed, coded, transmitted, and replicated. An average human being can rec-
ognize the sound frequencies typically below 4 kHz and hardly above 7–8 kHz. Therefore,
an 8 kHz sampling rate is used for sampling of the speech signal to get a basic level of
13
1916 K. B. Bhangale, M. Kothandaraman
quality and 16 kHz for a higher level of quality by considering the Nyquist sampling cri-
terion (Fs > = 2*Fmax) [19]. The speech signal is sampled at 8 bits per sample to minimize
the quantization noise [20]. Typically, speech signals are converted into a digital format to
represent the speech in a robust and compact form.
13
Survey of Deep Learning Paradigms for Speech Processing 1917
and regression. Conversely, the generative model follows the top-down approach, and data
flows in the reverse direction. Deep learning can be utilized for probabilistic distribution
and unsupervised learning tasks. Generally, discriminative learning is selected when label
data is available, and in case of the unavailability of labeled data, the generative approach
is undertaken [28].
Based on the training method, deep learning algorithms can be grouped into unsu-
pervised, semi-supervised, supervised, and reinforcement deep learning algorithms (see
Fig. 1). A supervised deep learning algorithm uses labeled data for the training, whereas
an unsupervised deep learning algorithm uses generative models for learning instead of
labeled data. It is challenging to develop the framework to extract the significant features
from larger labeled and unlabeled higher-dimensional data. LeCun et al. [29] presented
semi-supervised learning, which collaborates supervised deep learning for labeled data
and unsupervised learning for un-labeled data to achieve meaningful representation of the
features.
Almost all the previous architectures of deep learning are applied to two-dimensional
images. But raw speech signal is one-dimensional time-series data, which is a pole apart
from two-dimensional image representation. Therefore, acoustic signals are usually con-
verted into two-dimensional time–frequency series representations. An image can be pro-
cessed as holistic or part-based, with litter order restrictions; however, speech signals have
to be studied in sequential order [30]. Figure 2 shows the dissimilarity between machine
learning and deep learning processing for speech processing.
The learning algorithm is the core part of deep learning; it depicts each layer of a deep
learning algorithm’s discriminative features. The learning technique’s primary aim is to
discover optimal weight vector values to solve a class of problems in a domain [31]. Few
Generative
Adversarial
Network (GAN) Recurrent Neural
Network (RNN)
Long Short Recurrent Neural Policy Gradient
Restricted
Term Memory Network (RNN) Method
Boltzmann
Machines (RBM) (LSTM)
Deep Belief Gated Recurrent
Network (DBN) Unit (GRU)
13
1918 K. B. Bhangale, M. Kothandaraman
Simple Additional
Input Speech Features in Layers of More Mapping from
Output
Signal 1-D or 2-D Complex Features
Representation Features
Fig. 2 Machine learning versus deep learning flow for speech processing applications
standard learning algorithms are Back-propagation (BP), Gradient Descent (GD), Stochas-
tic Gradient Descent (SGD), Momentum, and Levenberg–Marquardt (LM) algorithm. The
learning algorithms have several drawbacks: vanishing gradient problem, local minima,
larger training time, over-fitting, etc. [32]. The performance of deep learning can be opti-
mized using parameter initialization methods, hyper-parameter optimization using Parti-
cle Swarm Algorithm (PSO), Genetic Algorithm (GA), adaptive learning rates (Delta-bar
Algorithm, AdaGrad, MSProp, Adam, etc.), batch normalization, supervised pre-training,
dropout, training speed up with Graphical Processing Unit (GPU) and cloud [33–35].
This section provides the survey of various deep learning algorithms such as Auto-Encod-
ers (AE), Generative Adversarial Network (GAN), Restricted Boltzmann Machine (RBM),
Deep Belief Network (DBN), Deep Neural Network (DNN), Convolutional Neural Net-
work (CNN), Recurrent Neural Network (RNN) and Deep Reinforcement Learning (DRL)
for the different speech technology.
3.1 Auto‑Encoder (AE)
Auto-encoders are the unsupervised deep learning architecture that reproduces the input
signal at the output layer. Auto-encoder is the extension of the idea of principle component
analysis (PCA). PCA is dependent on the variance rather than covariances and correlations
and transforms the multidimensional data into linear representation [36]. Auto-encoder
poses the ability to reduce the non-stationary noise in the speech signal and increase noisy
speech’s perceptual quality. Deep auto-encoder is more famous for noise reduction for
speech enhancement [37]. Sparse Auto-encoder (SAE), Variational Auto-encoder (VAE),
and De-noising Auto-encoder (DAE) are the major implementation architectures of auto-
encoders. Figure 3 shows the generalized structure of the auto-encoder.
DAE helps to restore clean speech from noisy speech. The noisy environment and larger
variation in the speech pattern bring less discrimination in the local transformation [38].
Auto-encoder has a greedy layer-wise structure trained in an unsupervised manner using
13
Survey of Deep Learning Paradigms for Speech Processing 1919
Encoder Decoder
′
Input Restored
signal signal
′
Compressed data
′
back-propagation. It has shown noteworthy performance for noise-robust ASR [39], speech
restoration [40], and learning the local and global transformations of the speech signal [41].
Unseen noise estimation is challenging in speech enhancement in an adverse envi-
ronment. Separable Deep Auto Encoder (SDAE) consisted of two DAE for clean speech
signal modeling, and the residual signal is presented for unseen noise estimation to deal
with this issue. The residual signal is obtained from the subtraction of the clean signal
estimated from DAE and the noisy speech signal. Experimental results on the TIMIT
database tainted by 20 types of unseen noises based on PESQ, SDR, and segmental SNR
have shown superior performance over the traditional approaches [42]. Deep De-noising
Auto-encoder (DDAE) structure is similar to the DAE, but DDAE has more than one
hidden layer. Single-layer DDAE has several drawbacks: less contextual speech informa-
tion, less generalization in unknown SNR, and residual noise in the enhanced signal. To
overcome these limitations, Safari et al.[43] suggested modular dynamic deep de-noising
auto-encoder (MD-DDAE), which consists of a stack of three DDAE layers with distinct
window sizes. Purvi Agrawal et al. [44] have employed a modulation filter using the deep
variational model to remove noise and reverberation caused at recording time. A Two-
dimensional (2-D) unsupervised model of convolutional variational auto-encoder (CVAE)
is applied for the speech spectrogram. It focused on modulation in the spectro-temporal
domain and ignored the modulations in noise or reverberation. Leglaive et al. [45] offered
speech enhancement using a recurrent variational auto-encoder (RVAE) consisting of
a recurrent deep generative speech model and variational EM algorithm to estimate the
distribution of the dormant variables in the noisy speech samples. It is observed that the
introduction of temporal dynamics shows significant improvement in speech enhancement.
Li et al. [46] recommended unsupervised mobile phone clustering based on deep auto-
encoder and spectral clustering algorithm. It can be applicable for asymmetric recordings
and can be used in forensic applications. This deep learning model is less useful for cap-
turing the individuality of the same brand’s mobile phones. Qian Zhan et al. [47] used an
unsupervised bottle-neck feature extraction technique to deal with the need for transcribed
speech information. It is found that adversarial auto-encoder performs better for dialects
close to each other because of its ability to extract latent information, phonetic information,
13
1920 K. B. Bhangale, M. Kothandaraman
and language label from original speech. Chorowski et al. [48] presented a comparison
of Gaussian variational encoder (VAE), dimensionality reduction bottle-neck, and discrete
Vector Quantized VAE (VQ-VAE) for speech representation. It is observed that VQ-VAE
is speaker invariant and maintains more phonetic information. Further, it used the time jit-
ter regularization scheme to improve the speech representation quality and limit latent code
capability. MFCC features and time jitter regularization resulted in 56.20% accuracy for
the phoneme classification on the LIBRISPEECH database.
Generative Adversarial Network (GAN) is the type of unsupervised deep learning recently
devised by Goodfellow in 2014 [49]. GAN comprises two neural networks, namely dis-
criminator and generator network. The discriminator network differentiates between natu-
ral and generated samples, whereas the generator network tricks the discriminator. Both
networks contend against each other in a zero-sum game. GAN can be treated as an unsu-
pervised or semi-supervised model. The discriminator and generator layer can consist of
a convolution layer stack, transposed convolution layer, leaky ReLU layer, and fully con-
nected layer. Further, Mirza et al. [50] utilized conditional GAN in which generation type
is dependent on conditional information given to the generator. Figure 4 shows the frame-
work for basic GAN and conditional GAN for data augmentation for noise-robust ASR.
The disparity between training and data testing data is a significant challenge in noise-
robust speech recognition systems. To overcome this problem, Yanmin Qian et al. [51]
developed the Generative Adversarial Networks (GAN) for enlarging the size and variabil-
ity of the training data. Basic GAN and conditional GAN are applied for data augmenta-
tion on Aurora4 and AMI dataset with different types of noise, reverberation, and channel
distortions. It has shown an improvement of 6% and 14% improvement in noisy conditions.
It is observed that conditional GAN performs better than basic GAN.
Pascual et al. [52] described the speech enhancement GAN (SEGAN) to reconstruct
the clean signal from noisy signal to maintain the raw signal’s intelligibility and quality.
It used a convolutional encoder and decoder structure independent of the length of the
input sequence. Based on subjective and objective quality measures for whispered to voice,
Generator Generator
(G) (G)
13
Survey of Deep Learning Paradigms for Speech Processing 1921
conversion has shown improvement over baseline methods. Takuhiro Kaneko et al. [53]
inspected the GAN to estimate differences in natural and synthetic speech. Post-filter-based
GAN has shown that synthetically generated speech is comparable to natural speech. It
used CNN to capture the time and frequency domain structures. Further, they have used
GAN to reconstruct the Short Term Fourier Transform (SIFT) spectrogram to generate an
adequate structural representation of speech data. The reconstructed spectrogram applica-
tion for text to speech (TTS) application has shown a higher degree of similarity in the
synthesized speech and target speech [54].
Voice conversion is challenging in non-parallel voice conversion systems as speakers
can speak in different languages or not repeat the text. Hsu et al. [55] depicted the vari-
ational auto-encoding Wasserstein GAN (VAW-GAN) to build the voice conversion model.
The ASR performance in cross-domain speech recognition is abysmal when the systems
are trained in noisy environments or different speaking accents. Mirura et al. [56] proposed
GAN based speech recognition system that has shown adaptation to change in speaking
accent and noisy speech. Mostly, GAN is used for voice conversion, speech enhancement,
and speech synthesis. In recent years, few researchers have used GAN for noise-robust
ASR. Reduction of the disparity between training and testing data augmentation using
GAN has shown effectiveness rather than manual addition of noise to the original signal
[57].
GAN is not appropriate for the recurrent model or sequence training because of the
independence between the frame-level data generated. GAN model has a problem of non-
convergence, unstable training, and sensitivity to hyper-parameter selection. GAN causes
over-fitting due to unbalance between discriminative and generator networks.
13
1922 K. B. Bhangale, M. Kothandaraman
●
● ●
●
●
Weights
Hidden Visible
Layer Layer
the subjective score. Multiple condition training (on TIMIT database) can deal with the
speech enhancement of unseen noisy data, new speakers, various SNR levels, and different
cross-language generalizations. It has shown poor performance in real-time environmental
noise.
The speech emotion signals are generally supra-segmental, and turn-level features per-
form better than the frame-level features, which lose the local information. RBM supports
optimal and discriminative feature learning. Mohit Shah et al. [62] offered a Latent Topic
Model (LTM) for the emotion salient feature extraction and a supervised replicated soft-
max model (sRSM) model based on RBM. It has given better performance over turn level
features for cross-corpus and spontaneous emotion on IEMOCAP and SEMAINE database.
DBN is the stacked RBMs that are trained layer-wise. It is an unsupervised and proba-
bilistic generative model. DBN supports pre-training and fine-tuning, unlike conventional
FFNN. It is a two-layered model consists of two RBMs [63]. In pre-training, each RBM
is trained independently, and output lower RBM is fed to higher-level RBM, as shown in
Fig. 6. In the fine-tuning process, the network is transformed into a deep auto-encoder (DA)
by unrolling the whole DBN and repeating the input and hidden layers, and attaching it to
the output of the DBN. In this, each layer’s hidden layer acts as a visible layer to the adja-
cent layer [64].
Abdel-Rahman Mohamed et al. [65] proposed DBN for phone recognition, which used
discriminative training to avoid the over-fitting problem. It resulted in PER of 23.00% on
the TIMIT test set. Further, they have investigated that DBN can also be applied to the
full utterance rather than a local window of frames of speech signal [66]. They have also
explored that DBN can efficiently replace the popular Gaussian Mixture Model (GMM) for
speech recognition with fewer parameters [67].
Deep Belief Network (DBN) is used in voice activity detection (VAD) to combine the
multiple features and describe the features’ variant features and various features. It resulted
in higher robustness due to the fusing of multiple features and lower detection complex-
ity. It has low performance in a non-stationary noisy environment. Therefore, the authors
13
Survey of Deep Learning Paradigms for Speech Processing 1923
● ●
● ● ● ●
● ●
Logistic
Input Restricted Boltzmann Restricted Boltzmann Regression
Features Machine Machine Layer
have planned to improve the performance using a stacked de-noising encoder and DBN for
unsupervised online learning [68]. Sarikaya et al. [69] used DBN for the action recogni-
tion in call routing task. The performance is compared with the Support Vector Machines
(SVM), boosting, and Maximum Entropy (MaxEnt) classifier. For the call routing data-
base, DBN-3 gives 90.8% action classification accuracy. Its training is simple, and there is
less possibility of over-fitting for the lower database. It has shown poor performance for a
lower database. Their future scope consisted of using DBN for the tagging for event detec-
tion in spoken language understanding.
DBN can learn the higher-level features and multiple-level representations of the speech
signal. Guihui Wen et al. [70] applied a random deep belief network for emotion recog-
nition, which can overcome dimensionality problems and performs better for the larger
database.
Chien-Yao Wang et al. [71] presented noise-robust sound event recognition based on
auditory receptive-field binary pattern (ARFBP), which consists of spectrogram image fea-
ture (SIF), the cepstral features, and the human ARF model. These features are given to
a hierarchical-diving deep belief network (HDDBN) classifier, which learns the distinc-
tive properties from the physical attributes. It has shown 99.27% accuracy for clean data
and 95.06% accuracy for noisy data (0 dB SNR) using the RWCP dataset. ON TUT sound
event database, it has shown 0.81 and 0.73 error rates for sound event detection in house
and residential areas, respectively. HDDBN has shown significant improvement in sound
event recognition rate over SVM.
Speech quality plays a crucial role in a telephonic conversation. Affonso et al. [72]
investigated deep belief network and radial basis function SVM (RBF-SVM) for speech
quality assessment on unimpaired speech samples from the public database. They have
extracted 64 speech features (13 MFCC static features, 20 FFT power spectrums, ZCR,
spectral centroid, spectral roll-off, the first and second derivative of MCFF static features)
which are given as input to DBN. Speech quality is classified into four categories ranging
from excellent to inferior quality, and the proposed method resulted in 95% of aggregated
accuracy.
Soufiane Hourri et al. [73] recommended Deep speaker features (DeepSFs) by trans-
forming the MFCC features into DeepSFs using DNN to increase the noise’s robust nature
MFCC. In this, basic MFCC features are divided into two groups using the K-mean algo-
rithm to avoid over-fitting. The weights of DNN are initialized using Deep Belief Network
13
1924 K. B. Bhangale, M. Kothandaraman
(DBN) and given to the DNN feature classifier. The nearest cluster (NearC) is used as a
scoring technique. The THUYG-20 SRE database’s extensive experimentation has shown
an equal error rate of 0.43% and 0.55% for female and male corpus, respectively. They have
observed that DNN can learn feature distribution and can be used for robust speaker recog-
nition under noisy environments.
DNN is a variant of feed-forward ANN that includes multiple hidden layers between the
input and output layer. DNN can model complex data with a nonlinear relationship. In
DNN, data propagates from the input to the output layer without any backward flow of data
[74].
In DNN, at the initial stage, input data is given to the input layer, then its output is fed
to the next layer (hidden layer) neuron and so on to produce a result at the output layer as
shown in Fig. 7. Due to multiple layers, DNNs have a better capability to represent the non-
linear functions compared to shallow learning. Again the combination of feature extrac-
tion and classification layers makes deep learning efficient. DNN can estimate the encoding
vector and reconstruct the source data signal, making it suitable for source separation and
speech enhancement.
Tae Gyoon Kang et al. [75] used DNN to map the data vector and the corresponding
encoding vectors. The proposed method consists of three parts: Non-negative matrix fac-
torization (NMF) training, DNN training, and source separation stages. DNN-NMF outper-
forms the previous NMF based techniques but has less adaptability. Shuai Nie et al. [76]
presented a combination of DNN and Nonnegative matrix factorization (NMF) for speech
separation. NMF learns the spectra of the speech signal, and NMF reconstructs the mag-
nitude of signal and noise. The discriminative training with scarcity constraint eliminates
the noise with the minimal cost of distortions and artifacts; and retains original speech con-
tent. They used TIMIT and NOISEX-92 datasets for the speech and noise database. They
have used the signal to interference ratio (SIR), source to distortion ratio (SDR), source
to artifact ratio (SAR), and PESQ as the evaluation metrics to estimate the performance
of their proposed model. Naijun Zheng et al. [77] suggested deep learning for phase-
aware speech enhancement, which considered phase information of Short-Time Fourier
Input
Output
data
Output
Input layer
layer Hidden
Hidden layer 2 Hidden
layer 1 layer N
13
Survey of Deep Learning Paradigms for Speech Processing 1925
Transform (STFT). They have used a derivative of phase spectrogram and time axes known
as instantaneous frequency deviation (IFD). They have used an Ideal Ratio Mask (IRM),
Ideal Amplitude Mask (IAM), Phase Sensitive Filter (PSF), and Complex Ideal Ratio
Mask (CIRM) as the training targets for DNN. Based on speech quality and intelligibil-
ity, it performs better than the DNN architectures not considering phase information. The
unstructured architecture of the system brings complexity. Yan Zhao et al. [78] applied
two-level DNN for de-noising and de-reverberation in the speech signal. It has shown
improved genuine speech intelligibility and quality in the real-time noisy-reverberant situ-
ation. Phase change due to noise and reverberation degrades the performance of the two-
stage DNN. George E. Dahl et al. [79] inspected the context-dependent DNN HMM (CD-
DNN-HMM) model for large vocabulary speech recognition to reduce the generalization
error and robustness of the system. CD-DNN-HMMs unite the emblematic power of DNN
and the sequential modeling capability of CD-HMM. Experimental results on challenging
business search datasets have shown that CD-DNN-HMM (Accuracy 69.6%) outperforms
the existing machine learning-based algorithms. It has been observed that CD-DNN-HMM
is computationally expensive. Dong Yu et al. [80] utilized a deep tensor neural network
(DTNN) by replacing one or more conventional layers of DNN with a Double Projection
(DP) layer where the input feature vector is projected in two nonlinear subspaces and ten-
sor-flow. It has several advantages like representing the covariance structure of data in hid-
den layers, modeling noisy data with high inconsistency, and performing effectively for
smaller databases. A small number of DP layers and bottom DP layer degrade the system’s
performance; therefore, in the future, they intended to increase the DP layer to improve the
performance. It has shown 16.6% WER on 30 Hrs SWB task for Hub5′00 evaluation set.
For the separation noise and handling channel mismatch from the speech signal, time-
varying masking is used. Naraynan et al. [81] proposed diagonal feature discriminant
linear regression (dFDLR) adaptation algorithm deployment for the deep neural network
and HMM for noise-robust speech recognition when the system is trained with clean data.
dFDLR performed best when it is trained for noisy log-Mel spectral features. It gave better
results when trained for cleaned data and resulted in a WER of 4.8% for clean training for
dFDLR + log Mel features on Aurora-4 medium-large vocabulary. The system is trained for
multi-condition such as noisy, clean, noisy + channel mismatch, clean + channel mismatch.
The drawback of the system is that for noisy channel mismatch and WER is larger.
Wang et al. [82] exploited regressive Context-Dependent DNN (CD-DNN) for
addressing the problem of data scarcity and clustering in broad phones. It can discrimi-
nate the context state at the frame level. It reduces the word error rate by 1.3% compared
to standard CD-DNN on Topic Detection and Tracking—Phase 3 (TDT3) corpus but
resulted in high WER for voiced, unvoiced, and silence classes. It has resulted in 15.0%,
12.1%, 10.8% WER for Context-Independent DNN (CI-DNN), CD-DNN, and regres-
sive CD-DNN, respectively. Hue et al. [83] implemented DNN for the fast speaker adap-
tation for speech recognition for the larger database. The speaker adaptation is applied
in three ways such as nonlinear feature normalization in feature space (fSA-SC), a direct
model adaptation of DNN based on speaker codes (mSA-SC); Joint speaker adaptive
training with speaker codes(SAT-SC). fSA-SC and mSA-SC are speaker-independent,
whereas SAT-SC is speaker-dependent and has lower training time. It resulted in a word
error rate of 12.1% after speaker adaptation using sequence training on the TIMIT large
vocabulary Switchboard task. Pan Zho et al. [84] presented a multiple DNN (mDNN)
model, which computes posterior probabilities of HMM for speech recognition. In this
method, the training data is grouped in m clusters for training to decrease the training
time. Four clustered mDNN resulted in 14.5% WER on the Mandarin transcription task.
13
1926 K. B. Bhangale, M. Kothandaraman
Performance in mDNN is better and faster than baseline DNN. It is observed that if the
number of clusters increased beyond ten, then the system’s performance degrades. In
[85], the authors suggested that DNN can be used for bandwidth expansion in the data
with multiple sampling rates. It is effective and robust in real-time but requires a larger
training time. Wu et al. [86] proposed activation regularization to avoid the network
over-fitting for speech recognition. They have used the Wallstreet journal, Babel lan-
guage, and Broadcast News database to evaluate the proposed algorithm. It provided a
generalized network structure, which reduces the WER significantly.
Speaker recognition is challenging due to the diversity in language, accent, and
speech tones. Still, DNN is considered the better option for speaker verification because
of its representative power. Chen et al. [87] suggested the use of Deep Neural Archi-
tecture (DNA) for learning speaker-specific characteristics from MFCC. DNA includes
two identical fully connected multilayered feed-forward neural network subnets having
2 K − 1 hidden layer, where K > 1 (odd number of hidden layers). They used two types
of learning algorithms as pre-training and discriminative learning. In this speech, infor-
mation component analysis is complicated, and a large amount of data with larger vari-
ability is required during discriminative learning. It was independent of the text and lan-
guages are spoken. It performed better for speaker verification and segmentation.
Score compensation, calibration, and transformation play an essential role in the
speaker verification system. Zhili Tan et al. [88] examined DNN based score calibra-
tion where the calibrated score and score shits are estimated from i-vectors for speaker
recognition. Their proposed method reduces over-fitting and performs better in a noisy
environment. An experiment on NIST 2012 SRE has shown that multitask learning per-
forms better (EER of 3.6% for 0 dB SNR) for a wide range of SNRs. Larger time com-
plexity, the need for a pair of clean and noisy data for training are the weakness of DNN
based score calibration over conventional methods.
To improve the DNN classifier’s performance for spoofing detection, Hong Yu et al.
[89] combined DNN with Human log-likelihoods (DNN-HLL). DNN-HLL classifier
that is trained with five dynamic filter bank-based cepstral features and constant Q-cep-
stral coefficients (CQCC) features. CQCC has a variable resolution in both the time
and frequency domain that is better suitable for spoofing detection. They found that the
performance of DNN-HLL is ten times better than baseline GMM-HLL on the ASVs-
poof-2015 database. They have used Spoofing-discriminant DNNs with five hidden lay-
ers. Their proposed method reduces the equal error rate to 0.045% for all types of spoof-
ing attacks. Wang et al. [90] proposed a combination of spatial and spectral features for
a deep neural network for blind speech separation. The time–frequency dominance to
find the interested user’s direction is evaluated by a two-stage chimera ++ network. It
can be used for multi-speaker ASR. Experimental evaluation on the RIR database has
shown that its performance degrades in environmental noise and more substantial rever-
berations. Lotfian et al. [91] studied curriculum learning for speech emotion recogni-
tion. Multi-class evaluation agreement trains the simple data first and keeps ambiguous
data for further training. The curriculum is defined using the Min–Max method. Exten-
sive experimentation on MSP-Podcast Database resulted in better performance, but
classification accuracy affects due to wrongly labeled and unreliable samples. Liu et al.
[92] offered different applications of DNN for understanding the relevance between user
embedding and candidate response on chatbots. It gives semantic information about the
post, response, and personal information on the chatbots like Facebook-M, Clever-bot,
and Xiaoice.
13
Survey of Deep Learning Paradigms for Speech Processing 1927
A CNN was initially proposed by Fukushima in 1988 but had the limitation of compu-
tation hardware for network training [93]. Later, in the 1990s, LeCun et al. [94] pre-
sented a successful CNN version with a gradient descent learning technique. The bio-
logical process inspires CNN, and the connectivity pattern of neurons is similar to the
animal visual cortex [95]. A CNN consists of a chain of convolution, Rectified Linear
Unit (ReLU), pooling layer, fully connected layer, and final soft-max layer as shown
in Fig. 8. Convolution layer outputs are represented as a feature map. Each unit in the
feature map is connected with the local region of feature maps of the previous layers
via a convolution filter bank. All units in single feature maps share a common filter
bank. Different feature maps in layers share different filter banks that maintain the local
connectivity and local region’s correlation. Each neuron of one layer is linked to all
other neurons in the next layer. Discrete convolution is used for the filtering operation;
therefore, it is called a Convolutional Neural Network. ReLU layer removes the con-
volution map’s negative values to increase the nonlinear properties of the network and
decision function. The pooling layer, which is also called a sub-sampling layer, merges
the semantically similar features into one. There are two types of pooling, such as maxi-
mum pooling and average pooling. The polling layer helps to extract the dominant ele-
ments, reduces the feature map dimensions and computation power. The maximum
pooling layer can also act as a noise suppressant.
A fully connected layer converts the multidimensional feature map into a one-dimen-
sional vector and shows each neuron’s connectivity with other neurons. The output of
a fully connected layer is given to the soft-max classifier. CNN is less invariant to the
scale, shift, and distortions of the input signal. CNN accepts the fixed size of the input
vector and produces output with the fixed-size vector. It also consists of a fixed number
of processing layers.
Some of the typical examples of CNN architectures that use a stack of convolution
layers, max-pooling layer, fully connected layer, and soft-max classifier are LeNet,
AlexNet, VGG Net, NiN, and All Conv. Some of the advanced architectures of CNN are
DenseNet, FractalNet, GoogLeNet with Inception units, and Residual Networks [96].
Output
Fig. 8 Generalize framework of CNN for speaker recognition ( adopted from Arindam Jatti et al. [100])
13
1928 K. B. Bhangale, M. Kothandaraman
A CNN is very popular for image processing applications, and in recent years it gives
promising outcomes for various speech processing applications.
Generally, speech enhancement models focus on audio information only, but very
little concentration is given on video data. Hou et al. [97] presented audiovisual deep
CNNs (AVDCNN), which consists of a separate CNN model for speech and video data
that are fused into a collaborative network for speech enhancement and image recon-
struction. In this, they have mentioned that lip shape and speech have a high degree of
correlation. Lip shape can be used as auxiliary features for voice activity detection. It is
noticed that late fusion is superior to early fusion for audio–video streaming.
Speech separation is challenging due to two significant issues, such as the order of
target and masker speakers in the mixer and the number of speakers in the mixer. To
address these issues, Yi Luo [98] has investigated the Deep Attractor Network (DANet)
to project the time–frequency attributes of mixture signal in high dimensional embed-
ding space. The clustering of speakers is dependent on the attractor (reference) point.
Permutation and number of the attractor in DANet reduces the permutation and speaker
number problem in speech separation. Tian Tan et al. [99] proposed a very deep CNN
(VDCRN), which incorporates residual learning and batch normalization for noise-
robust speech recognition and alleviates the training–testing database mismatch. Fac-
tor aware training and cluster-aware training significantly improved the performance of
VDCRN in noisy conditions. It achieved significant WER of 5.67% for the AURORA-4
dataset. Jati et al. [100] studied speaker-specific characteristics obtained from the unsu-
pervised neural predictive coding (NPC) along with the convolutional SIAMESE net-
work. It can detect overlapping speech and worked better in environmental noises but
has less robustness for a larger speaker database. In the future, they have planned to use
a deeper network for larger vocabulary and introduction of robustness in channel char-
acteristics. Nguyen An et al. [101] inspected the text-independent speaker identification
method for speaker separation. In this, CNN variants such as residual neural networks
(ResNets) and visual geometry group (VGG) nets are used to learn speaker character-
istics that can handle variable-length segments. Log Mel spectral features are given to
CNN. In this, a structured self-attentive layer is applied after the CNN layer, which gen-
erates the fixed-length input for the next layers and attends to the discriminant in the
speaker characteristics.
Database size and condition make an impact on speaker recognition. Most of the time,
the database is created in the constrained condition. Thus, it has a smaller size, which
degrades the speaker recognition performance in the unconstrained and noisy environment.
Arsha Nagrani et al. [102] expanded the celebrity VoxCeleb database using open-source
media such as YouTube to deal with this problem. They have applied two-dimensional
CNN (Thin-ResNet with a GhostVLAD)to the speech spectrogram for speaker verification.
After speaker verification, the speaker identity is verified using face recognition of celeb-
rity and added to the database. It resulted in an equal error rate of 2.87%.
CNN is capable of learning the discriminative features from diverse speech expressions
for emotion recognition. Shiqing Zhang et al. [103] recommended Deep Convolutional
Neural Networks (DCNN) for emotion recognition to bridge the semantic gap between
low-level features and subjective emotions. They have provided three log MFCC features,
such as static, delta, and delta-delta coefficients, to train the AlexNet DCNN model. For
the aggregation of learned high-level features, Discriminant Temporal Pyramid Matching
(DTPM) is used. They have employed SVM for emotion classification. Extensive experi-
mentation on EMO-DB, RML, eNTERFACE05, and BAUM-1 s databases has shown
promising results. It is observed that DCNN pre-trained for image application can also be
13
Survey of Deep Learning Paradigms for Speech Processing 1929
used for speech feature extraction. LP-norm pooling has demonstrated significant improve-
ment over the maximum and average pooling.
Jianfeng Zhao et al. [104] presented merged DNN that considered the features from
1D-CNN applied to the audio clip and 2d-CNN applied to spectrogram for the emotion rec-
ognition. The Bayesian optimization model is used for fine-tuning merged features. Using
the transfer learning deep learning model’s performance can be improved for smaller data-
sets by transferring it over a larger dataset model. Merged CNN resulted in 89.77% and
86.36% accuracy for Speaker-dependent and speaker-independent emotion recognition sys-
tems on Berlin EmoDB and IEMOCAP databases.
Hossain et al. [105] used CNN for speech MFCC spectrum, and images, and the fea-
tures are fused using two consecutive extreme learning machines (ELMs). They used
SVM for classification. The system’s performance is measured based on % accuracy on the
eNTERFACE’05 audiovisual emotion database consisting of six emotions such as anger,
disgust, fear, happiness, sadness, and surprise. ELM has a high degree of non-linearity in
the feature fusion, but because of MFCC, it is prone to background noise. In the future,
they planned to evaluate their proposed system on the other deep learning architectures and
cloud frameworks.
Ocquaye et al. [106] proposed Dual Exclusive Attentive Transfer (DEAT) for unsuper-
vised CNN is used for source-target domain adaptation. To minimize the domain incongru-
ity on the second-order statistics of the attention maps of both source and target, correlation
alignment loss (CALLoss) is used. The spectrogram is used to find the discriminant and
salient feature learning. Raw spectrogram features are given to 5 layered CNN. It resulted
in 65.02% of un-weighted average recall (UAR) for ABC corpus and 67.79% for the emo-
DB database. It has several advantages like high % UAR, computational efficiency, and
simple optimization, but the feature vector length is larger. Their future scope included the
implementation of a multi-layer model to VGGNet and ResNet along with dimensional
reduction on attention maps.
Suraj Tripathi et al. [107] examined CNN for emotion recognition using speech fea-
tures and speech transcripts. CNN is applied for the text and speech MFCC features and
collected in a fully connected layer for the classification. CNN-MFCC + TEXT resulted in
76.1% accuracy, and it is observed that there is almost a 7% increase in performance over
current benchmark methods.
Heinrich Dinkelet al. [108] investigated joint Convolutional Long Short Term Mem-
ory deep neural network (CLDNN) for spoofing detection with the help of raw wave front
end speech features. Experimental evaluation of the algorithm on BTAS2016 and ASVs-
poof2015 resulted in half total error rate (HTER) of 0.19% and 0.0%, respectively. Raw
wave works better for synthetic and voice converted spoof detection but performs poorly
for sparse data.
RNN is a feed-forward neural network that is generally used to process sequential and
time-series data. RNN’s most popular implementations are Long Short Term Memory
(LSTM) and Gated Recurrent Units (GRUs) [109]. RNN is called recurrent as they accom-
plish the same action for every unit of the sequence where output depends on the earlier
computation. The structure of basic RNN with the loop is shown in Fig. 9. The General
feed-forward neural network has several issues like the inability to handle the sequential
data, consideration of current input, and failure to memorize the previous input.
13
1930 K. B. Bhangale, M. Kothandaraman
Hidden layer
h
x Input layer
LSTM is generally used for temporal information processing. GRU has few network
parameters, simple topology, lowest computation cost, and complexity [110]. Depending
upon the input and output relationship and applications, RNN structures are categorized as
one to one, many to one, one to many, and many to many structures. This section provides
some of the applications of RNN for speech processing applications.
The combination of Bidirectional LSTM-RNN and end-to-end training can give better
results for phoneme recognition on the TIMIT database (PER of 17.7%) [111]. Chu-Xiong
Qin et al. [112] presented transfer learning for speech recognition, which compromised the
multilingual DNN and matrix factorization method to extract higher-level features. Fur-
ther, the connectionist temporal classification (CTC) attentive model with shallow RNN
increases the robustness through the joint decoding and shared training. Experimental
results on the TIMIT database have shown PER of 16.59%.
The combination of convolution and LSTM recurrent network gives better temporal
dependencies. The convolution-LSTM model resulted in better accuracy (85%) for the
speech and music recognition on the Google AudioSet database [113]. Bidirectional LSTM
(BLSTM) can minimize the vanishing gradient problem in the RNN, but it is more com-
putationally expensive. Still, LSTM has a vanishing gradient problem for the higher layer
of LSTM because of bounded output. The larger size of LSTM resulted in the over-fitting
of the network. To deal with these problems, Jian Kang et al. [114] combined bidirectional
RNN, RGU, and residual architecture for the low resource speech recognition. They have
presented local BLSTM (LBSTM) for modeling the temporal dependencies over the local
window. It has shown significant improvement over baseline LSTM (3–8% decrement in
WER) and DNN (4–10% decrement in WER). Zhiyuan Tang et al. [115] developed Pho-
netic temporal neural (PTN) language identification using the LSTM-RNN model that
accepts the speech features generated by phone discriminative DNN. It is observed that
phonetic temporal information is more important than raw speech features for discriminat-
ing languages. Kun Han et al. [116] presented feed-forward DNN and RNN for training the
static and sequential frame-level acoustic features, respectively, which can learn temporal
dynamics. It has shown that single-condition training performs better than proposed multi-
condition training on the TIMIT and NOISEX-92 noise datasets. Ke Tan et al. [117] have
combined Convolution Encoder-Decoder (CED) and recurrent LSTM to form the recur-
rent convolutional network (CRN), which is noise and speaker-independent for instantane-
ous monaural speech enhancement. Progressive learning and a combination of DNN and
LSTM resulted in improved speech quality and intelligibility for speech enhancement in
low SNR. But it has a higher computation cost and a large number of parameters, which
makes it difficult for practical real-time implementation. To reduce the parameters and
13
Survey of Deep Learning Paradigms for Speech Processing 1931
Reinforcement Learning (RL) is based on the reasonable thought that if an action is trailed
by an improvement in the state of affairs, then the tendency to produce that action is
strengthened [125]. RLs are categorized into value-based methods, including Q-learning
approaches [126], and policy-based methods, including policy gradient methods [127].
DRL is investigated for the very few speech processing applications such as dialogue-
based systems, speech enhancement, pre-training for ASR, content-based speech retrieval,
etc. In recent years DRL has been mostly investigated for the dialogue-based systems in
human–robot or human–machine systems. In a dialogue-based system, it is necessary to
generate a response based on the user’s state of conversation and action.
Policy optimization DRL algorithms can be useful in the automated dialogue system
for generating the speech response by considering the current state of discussion with
the human [128]. When the domain changes or policy is transferred from one domain to
another, the entire dialogue state space and action set changes. Therefore, the DRL model
should be different for a different domain, which is very challenging. A multi-agent dia-
logue policy (MADP), which consists of slot-dependent agents (S-Agents) and a slot-inde-
pendent agent (G-Agent), can tackle this problem [129]. Further, Lu Chen et al. [130] pro-
posed Agent-Graph to make the DRL-based policies sample efficient and compatible for
policy transfer between different domains.
The performance of ASR can be improved by optimizing the speech enhancement
(SE) model using DRL. Yih Shen et al. [131] presented the ideal Binary Mask-based
13
1932 K. B. Bhangale, M. Kothandaraman
SE system on the Mandarin Chinese broadcast news corpus (MATBN) database and
showed significant improvement in noisy conditions.
DRL requires a large amount of time for the training, thus makes it unsuitable for
real-time human–computer interaction. Rajapakshe et al. [132] explored pre-training of
DRL for a reduction in training time. Markov Decision Process (MDP) is used for pre-
training of DNN. This pre-training is used for the speech command recognition using
CNN and LSTM, which has shown significant improvement in network performance
over the network without pre-training.
In supervised learning, transcribing the training speech data is a challenging and
computationally expensive task. Taku Kala et al. [133] described speech recognition
using the policy gradient and hypothesis selection DRL method, which has given prom-
ising results compared to unsupervised approaches. They have observed that increasing
the number of DRL stages, increases the WER.
In Deep Q-Learning (DQN), the state is given as input to the neural network, which
approximates the Q-value function and generates the Q-value of all possible actions at
the output, as shown in Fig. 10.
Content-based spoken data retrieval has a high degree of uncertainty and noisy
retrieval results, unlike content-based text retrieval. Hung-Yi Lee et al. [134] have pre-
sented a Deep-Q-Network (DQN) that determines the machine action without hand-
crafted inputs. They have used Mandarin Chinese broadcast news corpus for experimen-
tations. Double DQN and Dueling DQN achieved better performance than simple DQN.
DQN is used for speech volume control in humanoid robots to improve the human–robot
interaction [135], dialogue policy decisions in chat systems [136].
DRL faces many challenges while implementing it for real-world problems because
of uncertainty in the action states. DRL performance can be improved by combining
it more deeply with AI-based techniques to interpretability, generalization, and better
sample complexity.
Q-Value Action 1
Q-Value Action 2
Q-Value Action 3
State
●
●
●
●
Q-Value Action N
Deep Reinforcement Q-Learning
13
Survey of Deep Learning Paradigms for Speech Processing 1933
4 Database
Database size, variability, and quality play a vital role in the performance of deep learn-
ing algorithms. Various standard databases for speech recognition, speaker recognition,
and voice activity detection are available online in open source, licensed, or public mode.
TIMIT is considered the baseline corpus for speech and speaker recognition, consisting of
10 sentences of American English for 30-s duration for 630 speakers [137]. LibriSpeech
is 1000 Hrs of 16 kHz English speech corpus mostly used for phoneme classification
[138]. VoxCeleb database is large-scale text-independent speaker recognition corpus con-
sisting of 153,486 utterances for 1251 celebrities, extracted from YouTube videos [139].
Aurora-4 consists of the clean and noisy database (noise addition on Wall Street Journal
(WSJ0)) with different SNRs at two sampling rates 8 kHz and 16 kHz is frequently used
for noise-robust speech recognition [140]. Mandarin transcription task consists of 76,843
speech samples (about 64 h of speech) from 1500 speakers along with an independent test
set contains 3,720 samples (about 3 h) from an additional 50 speakers [141]. CHiME is a
medium vocabulary database that consists of English speech corpus and transcripts (342 h)
from noisy environments and 50 h of noisy audio samples [142]. REVERB-Challenge is
used for reverberant speech recognition and enhancement tasks [143]. Topic Detection and
Tracking—Phase 3 (TDT3) speech recognition database consists of samples from Chinese
Mandarin news broadcasts [82]. SWITCHBOARD database, which consists of 2500 con-
versations of 500 speakers, is often used for speech recognition and speaker recognition
[144].
The speech emotion recognition system’s performance hugely depends upon the emo-
tion database as emotion signals are largely subjected to the language, length of the sam-
ple, and noise. Thus, the speech emotion database needs more samples for many users in
different environmental conditions. EmoDB is a speech emotion database generally used
for speech emotion recognition consists of 500 samples of happy, fearful, angry, anxious,
bored, and disgusts emotions recorded from 10 actors [145]. The Interactive Emotional
Dyadic Motion Capture (IEMOCAP) database is an acted, multimodal, and multi-speaker
speech emotion database that consists of 12 Hrs of anger, happiness, sadness, and neutral
emotion speech samples. eNTERFACE’05 is an audiovisual emotion database used for col-
laborative speech and image speech emotion recognition [146]. The MSP-Podcast speech
emotion database consists of 22,630 corpora of males and females [147]. Isophonics [148]
music database includes 19 songs by Queen, 180 songs by the Beatles, and 18 songs by
Zweieck. RWC-Popular [149] contains 100 American and Japanese-style pop songs.
MIR-1 K is a music database consisting of 1000 music samples used for music singer rec-
ognition and singing voice separation [150]. For speech enhancement, noisy samples are
required but acquiring a large number of the sample under various noisy conditions is dif-
ficult. Therefore, many researchers perform experimentation by adding standard noise sig-
nals to the available database. NOISEX is a noise database that consists of various noises
such as factory noise, voice babble, pink noise, HF radio channel noise, different military
noises, white noise, and Volvo 340. It is commonly used for evaluating noise-robust speech
recognition, enhancement, and separation applications, along with other speech corpora
[151].
Majorly standard speech corpora are available in the English language. There is a lack
of a larger authoritative database in other languages, making it challenging to check per-
formance variability and language independence. Database published so far focuses on the
sample collection from experts, actors, or normal people. Still, very little concentration
13
1934 K. B. Bhangale, M. Kothandaraman
is given to disabled voice samples such as stammering, stuttering, and dysarthric speech.
Most of the speech corpus databases are not publicly available, limiting exploration and
validation of researchers’ results.
5 Evaluation Metrics
Evaluation metrics are different for different speech processing applications. For speech
enhancement, various subjective and objective metrics are used. PESQ is the measure for
predicting subjective speech quality, ranging from -0.5 (extremely high distortions) to 4.5
(No distortions). It can be applied in a wide range of conditions such as background noise,
variable delay, and analog filtering. The larger the PESQ score, the better the predicted
quality of speech is [42]. STOI estimates a corrupted audio signal’s objective intelligibility
with the help of a correlation between the temporal envelopes of the corrupted speech sig-
nal and its clean reference. It has been observed that STOI scores are stalwartly correlated
with human speech intelligibility scores [77, 78]. ESTOI represents a corrupted speech
signal’s objective intelligibility by calculating the spectral correlation coefficients of the
corrupted signal and its clean reference in short time segments. Unlike STOI, ESTOI does
not assume that frequency bands are mutually independent. Both STOI and ESTOI scores
range from 0 to 1. Higher scores indicate better-predicted intelligibility [152]. Performance
metrics such as signal to distortion ratio (SDR) [153] and signal to artifact ratio (SAR), sig-
nal to interference ratio (SIR), signal to artifacts ratio (SAR) illustrate the quality of speech
signal after the speech enhancement process. In [97] two new evaluation metrics are pre-
sented, such as the hearing-aid speech quality index (HASQI) and the hearing-aid speech
perception index (HASPI).
Word Error Rate (WER) is a standard evaluation metric for the speech recognition task
that depicts the division of word errors after aligning the hypothesis and reference word
sequence. It is the number of additions, removals, and substitutions divided by a total num-
ber of reference words [81, 82]. Performance of speaker recognition systems is also evalu-
ated using miss probability, false alarm probability, Equal Error Rate (EER), and percent-
age accuracy, which gives the statistical measure of correctly/wrongly recognized speakers.
For a binary classification task, F1-score is a popular evaluation metric that is based on
precision and recall rate. The larger the value of the F1-score better is the algorithm perfor-
mance. False Alarm Rate (FAR) and Miss Detection Rate (MDR) are also standard evalua-
tion metrics for speech recognition. Several Deep learning algorithms are aimed at a reduc-
tion in FAR and MDR. Reza Lotfian [91] has investigated the concordance correlation
coefficient (CCC) metric for the performance evaluation of speech emotion recognition.
6 Discussion
This section provides a comparative analysis and discussion of the various deep learn-
ing frameworks applied for various speech processing applications based on methodol-
ogy, database, and performance evaluation. It majorly focused on speech preprocessing,
speech recognition, speaker recognition, emotion recognition, and other speech processing
domains.
Table 1 gives the comparative analysis of various deep learning algorithms for speech
pre-processing, such as speech enhancement and speech separation. Subjective and
13
Table 1 Comparative analysis of various deep learning algorithms for speech enhancement and speech separation
Sr. No Authors Application Methodology Database Evaluation metrics Performance
1 Yong Xu et al. [61] Speech enhancement Regressive multiple TIMIT PESQ 2.83 (Car noise)
restricted Boltzmann 2.47 (Exhibition noise)
machines (RBMs)
2 Arun Narayanan et al. [81] Speech separation and noisy Diagonal feature discri- Aurora-4 medium-large Word error rate (%) 4.8% (Clean training)
speech recognition minant linear regression vocabulary
(dFDLR) and deep neural
network (DNN)
3 Tae Gyoon Kang et al. [75] Source separation, speech Deep neural network—non- TIMIT and NOISEX-92 SDR 8.74
enhancement negative matrix factoriza- noise dataset SIR 11.20
tion (NMF)
SAR 13.91
PESQ 2.23
4 Emmanuel Affonso et al. Speech quality assessment Deep belief network (DBN) ITU-T recommendation Accuracy (%) 95.00%
[72] and radial basis function
Survey of Deep Learning Paradigms for Speech Processing
SVM (RBF-SVM)
5 Jen-Cheng Hou et al. [97] Speech enhancement Audiovisual deep CNNs Taiwan Mandarin hearing PESQ 2.41
(AVDCNN) in noise test (Taiwan STOI 0.66
MHINT)
SDI 0.45
HASQI 0.43
HASPI 0.99
6 Yi Luo et al. [98] Speech separation Deep attractor network Wall street journal dataset SDR 10.4 (2-speaker)
(DANet) 8.5 (3-speakers)
7 Yan Zhao et al. [78] Speech enhancement Deep neural network IEEE corpus and diverse STOI (%) 79.4 (Dliving noise)
environments multi- 68.5 (Pcafeter noise)
channel acoustic noise
70.7 (Babble noise)
database (DEMAND)
PESQ 2.05 (Dliving noise)
1.66 (Pcafeter noise)
1.66 (Babble noise)
1935
13
1936 K. B. Bhangale, M. Kothandaraman
qualitative analysis based on different evaluation metrics has shown that RBMs and DBN
give better performance than the other approaches. The performance of speech enhance-
ment is limited because of the noise phase spectrogram’s unavailability, larger time com-
plexity, and poor performance for online speech enhancement.
In recent years, many deep learning frameworks are adopted for the ASR, whose per-
formance is evaluated mostly on percentage accuracy, word error rate, and hit rate. It is
observed that because of the higher correlation and representation capability of the CNN,
deep architectures based on CNN have given better performance for ASR. The perfor-
mance of deep learning algorithms for speech recognition is challenging because of cross-
domain training, noisy training, language dependency, and various environmental condi-
tions (Table 2).
The performance of various deep learning algorithms for speaker recognition appli-
cations is shown in Table 3. Along with speaker recognition, it focuses on music singer
recognition and spoofing recognition. The performance of speaker recognition systems is
mostly measured o the basis of equal error rate (%EER), false alarm rate (FAR), and recog-
nition accuracy. It is observed that DNN and CNN-LSTM represent better speaker-specific
characteristics and results in better performance.
Table 4 gives a detailed comparative analysis of speech emotion recognition using deep
learning algorithms. Deep learning algorithms such as DCNN, DBN, RBM, etc., have suc-
cessfully presented speech emotion recognition. Easy training and the capability of weight
sharing of deep learning algorithms have significantly improved machine learning-based
speech emotion recognition systems. Various multimodal emotion recognition systems
have been suggested that considered audio–video data increase emotion recognition per-
formance. The performance of deep learning algorithms is often restricted due to over-
learning during memorization of layer-wise information, complex architecture, language
dependency, and temporal variation in input data.
Deep learning is becoming more popular in various speech recognition fields because of
its higher representation ability, ability to handle complex problems, and ability to handle
large databases. Table 5 shows the comparative analysis of miscellaneous speech process-
ing applications such as spoke content retrieval, language identification, dialect identifica-
tion, sound event recognition, etc. Supervised deep learning models have given superior
performance for the various pattern base recognition applications.
7 Conclusion
This paper has presented a comprehensive review of deep learning architectures and their
applications for speech processing applications over the past few years. Various modern
deep learning models in different learning groups, including unsupervised, supervised,
semi-supervised, and Reinforcement Learning (RL), and their applications in different
domains are reviewed. This paper presented structure and applications of Auto-Encoders
(AE), Generative Adversarial Network (GAN), Restricted Boltzmann Machine (RBM),
Deep Belief Network (DBN), Deep Neural Network (DNN), Convolutional Neural Net-
work (CNN), Recurrent Neural Network (RNN) and Deep Reinforcement Learning (DRL)
for various speech processing applications. This paper focused on major speech processing
applications such as speech enhancement, speech separation, speech recognition, speaker
recognition, emotion recognition, and natural language processing.
13
Table 2 Comparative analysis of various deep learning algorithms for speech recognition
Sr. No Authors Application Methodology Database Evaluation metrics Performance (%)
1 Shaofei Xue et al. [83] Speech recognition Deep neural network TIMIT large vocabulary Word error rate (%) 12.10
switchboard task
2 Yuxuan Wang and DeLiang Speech recognition DNN-SVM-SEG TIMIT, IEEE Female, IEEE HIT-false alarm rate (FA) 63.8, 65.9, 64.7
Wang [60] Male
3 Dong Yu et al. [80] Speech recognition Deep tensor neural network 30Hrs SWB task Word error rate (%) 16.60
(DTNN)
4 Wang and Sim [82] Speech recognition Regressive context Dependent Topic detection and track- Word error rate (%) 10.80
Survey of Deep Learning Paradigms for Speech Processing
13
1938
13
Table 3 Comparative analysis of various deep learning algorithms for speaker recognition
Sr. No Authors Application Methodology Database Evaluation metrics Performance
1 Chen and Salman [87] Speaker recognition Deep neural architec- TIMIT, Chinese (CHN) False alarm rate (FAR), TIMIT-
ture (DNA) corpus Miss detection 0.25 ± 0.09,0.19 ± 0.09,0.74 ± 0.12
rate (MDR), F1 CHN-
(Mean + -STD) 0.21 ± 0.06,0.34 ± 0.09,0.68 ± 0.08
2 Zhili Tan et al. [88] Speaker verification Deep neural network NIST 2012 SRE Equal error rate (%) EER of 3.6% for 0 dB SNR
3 Hong Yu et al. [89] Speaker verifica- Deep neural network ASVspoof2015 data- Equal error rate (%) 0.05%
tion + spoofing and human log-likeli- base
detection hoods (DNN-HLL)
4 Heinrich Dinkel et al. Speech spoofing detec- Joint convolutional BTAS2016 and ASVs- Half total error rate 0.19% and 0.0%
[108] tion LSTM deep neural poof2015 (HTER)
network (CLDNN)
5 Jati and Georgiou [100] Speaker recognition Neural predictive cod- VoxCeleb database Equal error rate (%), 7.21%, 0.61
ing (NPC) and Con- minimum normalized
volution SIAMESE detection cost func-
network tion (minDCF)
6 Nguyen AN et al. [101] Speaker identification Convolutional Neural VoxCeleb database Accuracy (%) 88.2% (VGG + self attention layer)
(text-independent) Network (CNN) and 90.8% (ResNet + self attention
layer)
7 Zebang Shen et al. Singer recognition Long short term Chinese SID in the Accuracy (%) 88.40%
[120] memory (LSTM) MIR-1 K
8 Hourri et al. [73] Speaker recognition MFCC + DNN THUYG-20 SRE % EER 0.43 and 3.07% (Female) and 0.55
corpus and 3.19% (Male)
9 Arsha Nagrani et al. Speaker recognition Thin-ResNet with a VoxCeleb database % EER 2.87%
[102] GhostVLAD
K. B. Bhangale, M. Kothandaraman
Table 4 Comparative analysis of various deep learning algorithms for speech emotion recognition
Sr. no. Authors Application Methodology Database Evaluation metrics Performance
1 Shah et al. [62] Emotion recognition Supervised replicated Arousal –IEMOCAP, Weighted Average Recall 72.00%,
softmax model (sRSM) Valence IEMOCAP, Rate 60.00%
based on RBM Arousal SEMAINE, 66.35%, 66.45%
Valence SEMAINE
2 Wen et al. [70] Emotion recognition Random deep belief EMODB Weighted Average 82.32%,
networks (RDBN) CASIA Accuracy 48.50%,
SAVEE 53.60%
3 Zhang et al. [103] Emotion recognition DCNN and discriminant EMO-DB, RML, eNTER- Recognition Accuracy 87.31% (EMO-DB),
temporal pyramid FACE05 and BAUM-1 s 69.70% (RML), 76.56%
matching (DTPM) (eNTERFACE), 44.61%
(BAUM-1 s)
4 Jianfeng Zhao et al. [104] Emotion recognition 1D-CNN Berlin EmoDB and Accuracy (%) 89.77% and 86.36%
Survey of Deep Learning Paradigms for Speech Processing
IEMOCAP databases
5 Reza Lotfian et al. [91] Emotion recognition Deep neural network MSP-Podcast Database F1-Score 42.10%
6 ELIAS OCQUAYE et al. Emotion Recognition Dual exclusive attentive ABC Corpus, emo-DB Unweighted average recall of 65.02% for ABC corpus
[106] transfer (DEAT) based database (UAR) and 67.79% for emo-DB
unsupervised CNN database
7 Tripathi et al. [107] Emotion recognition CNN-MFCC + TEXT IEMOCAP data Accuracy (%) 76.10%
8 Jianfeng Zhao et al. [122] Emotion recognition 2-D CNN LSTM Berlin EmoDB and Accuracy (%) 95.33% (speaker dependent)
Interactive Emotional and 95.89% (speaker inde-
Dyadic Motion Capture pendent) 89.16% (speaker
(IEMOCAP) dependent) and 52.14%
(speaker independent)
9 Hossain et al. [105] Emotion recognition Convolutional extreme eNTERFACE’05 audio- Accuracy (%) 99.90%
(Speech and video) learning machines visual emotion database
(ELMs)
1939
13
1940
13
Table 5 Comparative analysis of various deep learning algorithms for miscellaneous speech processing applications
Sr. No Authors Application Methodology Database Evaluation Metrics Performance
1 Hung Lee et al. [134] Spoken content retrieval Deep reinforcement learn- Stanford question answer- Word error rate (%) 22.70%
ing- deep-Q-network ing dataset
(DQN)
2 Bingquan Liu et al. [92] Speech data retrieval Deep neural network Baidu Tieba corpus (BTC) Accuracy (%) 71.60% (BTC) and 72.46%
and Reddit corpus (RC) (RC)
3 Zhiyuan Tang et al. [115] Language identification LSTM-RNN Babel database and the Equal error rate (%) 5.70% (Babel) and 6.34%
AP16-OLR database (AP16-OLR)
4 Qian Zhang et al. [47] Language/dialect Recogni- Generative auto encoder CHINESE Corpus, PAN- Accuracy (%) 97.8%, 81.3%, and 65.4%
tion ARABIC and MBG-3
5 Ruhi Sarikaya et al. [69] Natural language process- Deep belief network Call routing database Accuracy (%) 90.80%
ing (DBN)
6 Kun Han et al. [116] Pitch tracking DNN and RNN TIMIT and NOISEX-92 Detection rate (%) 66.4% (DNN) and 66.2%
noise dataset (RNN)
7 Chien-Yao Wang et al. [71] Sound event recognition Hierarchical-diving RWCP dataset Accuracy (%) RWCP- 99.27% accuracy
deep belief network for clean data and 95.06%
(HDDBN) accuracy for noisy data
(0 dB SNR)
8 Bui et al. (2019) Human–robot interaction Deep Q-network RobotSVA Std. error 0.103
K. B. Bhangale, M. Kothandaraman
Survey of Deep Learning Paradigms for Speech Processing 1941
For speech processing, the speech signal is converted into two-dimensional spectral
representations and given to deep learning architecture as input. Log Mel spectrogram
or MFCC features provide a compact two-dimensional representation of speech signal. It
may lead to corruption in temporal variation properties and phase information of the origi-
nal speech signal. Therefore, raw speech is used for speech separation and enhancement
applications.
Unsupervised auto-encoders (AE) have given superior performance for speech enhance-
ment, restoration, reverberation minimization, noise estimation, and speech separation in
noisy environmental conditions. Unsupervised GAN has attracted vast attraction for noise-
robust speech recognition and text-to-speech conversion because of its ability to learn the
internal property of data and generation of output similar to the input. GAN model suffers
from non-convergence, unstable training, sensitivity to the hyper-parameter selection, and
over-fitting due to unbalance between discriminative and generator networks. RBMs and
DBNs are widely used for speech separation, enhancement, and emotion recognition appli-
cations because of their optimal and discriminative feature learning capability. DNNs have
better feature representation capability for non-linear functions because of multiple hid-
den layers. They are efficient for speech enhancement, speech recognition, speaker recog-
nition, etc., because of a combination of feature extraction and classification layers. CNN
consists of fixed receptive fields that limit the temporal context that can be considered for
speech recognition, speaker recognition, and emotion recognition. RNN can be used for
unlimited temporal context, which can be learned using the LSTM adaptive model, but it
needs sequential processing of input speech signal, making it slower than CNN. Therefore,
CRNN suggests conciliation in between, inheriting both CNNs and RNN’s advantages and
disadvantages. Deep reinforcement learning and deep Q- learning models are more popular
for speech enhancement and dialogue-based systems in robotics because of their ability to
find a balance between exploration and exploitation. DRL is avoided for real-time applica-
tions because of its extensive response time and uncertainty in the action states.
A pre-trained model for audio data is not available that can be used for learning raw
samples for large vocabulary speech, speaker, and emotion recognition. Compared to tra-
ditional methods, deep learning models require larger computational power and training
data. General CPUs are not suitable for the deep learning model implementation instead
General-Purpose Graphics Processing Unit (GPGPUs) is an optimized processor for matrix
operations and application-specific integrated circuits such as proprietary tensor processing
units (TPUs) are used. The applications of deep learning models are restricted to smaller
devices such as mobile phones or hearing aids.
Deep learning can be extended further for the improvement of recent speech process-
ing applications such as spoofing detection, speech pathology, robotics, automation, auto-
tagging, audio content retrieval on social media, hate speech detection, stress detection,
autism detection, audio conferencing, etc. Deep learning fails to understand the signifi-
cance of input speech features and inner working principles. It is observed that superior
performance of deep learning models can be achieved at the cost of network complexity,
which is frequently challenging to optimize and prone to over-fitting without a huge num-
ber of samples to train multiple parameters. Finally, emerging deep learning research in
speech processing involves achieving high efficiency for data-intensive applications. How-
ever, they need a vigilant selection of models and model parameters to guarantee model
robustness.
Funding None.
13
1942 K. B. Bhangale, M. Kothandaraman
Data Availability Enquiries about data availability should be directed to the authors.
Declarations
Conflict of Interest The authors declare that they have no conflict of interest.
References
1. Sarker, I. H. (2021). Deep learning: A comprehensive overview on techniques, taxonomy, applica-
tions and research directions. SN Computer Science, 2(6), 1–20.
2. Otter, D. W., Medina, J. R., & Kalita, J. K. (2020). A survey of the usages of deep learning for
natural language processing. IEEE Transactions on Neural Networks and Learning Systems, 32(2),
604–624.
3. Alam, M., Samad, M. D., Vidyaratne, L., Glandon, A., & Iftekharuddin, K. M. (2020). Survey on
deep neural networks in speech and vision systems. Neurocomputing, 417, 302–321.
4. Watanabe, S., & Araki, S. (2019). Introduction to the issue on far-field speech processing in the era of
deep learning: speech enhancement, separation, and recognition. IEEE Journal of Selected Topics in
Signal Processing, 13(4), 785–786.
5. Raj, D., Denisov, P., Chen, Z., Erdogan, H., Huang, Z., He, M., Watanabe, S., Du, J., Yoshioka, T.,
Luo, Y., & Kanda, N. (2021). Integration of speech separation, diarization, and recognition for multi-
speaker meetings: System description, comparison, and analysis. In 2021 IEEE spoken language tech-
nology workshop (SLT), pp. 897–904. IEEE.
6. Suh, J. Y., Bennett, C. C., Weiss, B., Yoon, E., Jeong, J., & Chae, Y. (2021). Development of speech
dialogue systems for social ai in cooperative game evironments. In IEEE region 10 symposium (TEN-
SYMP 2021).
7. Hanifa, R. M., Isa, K., & Mohamad, S. (2021). A review on speaker recognition: Technology and
challenges. Computers & Electrical Engineering, 90, 107005.
8. Ntalampiras, S. (2021). Speech emotion recognition via learning analogies. Pattern Recognition Let-
ters, 144, 21–26.
9. Deng, L., Hassanein, K., & Elmasry, M. (1994). Analysis of the correlation structure for a neural pre-
dictive model with application to speech recognition. Neural Networks, 7(2), 331–339.
10. Cohen, J., Kamm, T., & Andreou, A. (1995). Vocal tract normalization in speech recognition: Com-
pensation for system systematic speaker variability. The Journal of the Acoustical Society of America,
97(5), 3246–3247.
11. Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on
Signal Processing, 45(11), 2673–2681. https://doi.org/10.1109/78.650093
12. Hermansky, H., Ellis, D. P. W., & Sharma, S. (2000). Tandem connectionist feature extraction for
conventional HMM systems. In 2000 IEEE international conference on acoustics, speech, and signal
processing proceedings (Cat. No.00CH37100), Istanbul, Turkey, vol. 3, pp. 1635–1638. https://doi.
org/10.1109/ICASSP.2000.862024.
13. Povey, D., Kingsbury, B., Mangu, L., Saon, G., Soltau, H., & Zweig, G. (2005). fPME: Discrimina-
tively trained features for speech recognition. In Proceedings IEEE ICASSP’05, pp. 961–964.
14. Morgan, N., et al. (2005). Pushing the envelope: Aside [speech recognition]. IEEE Signal Processing
Magazine, 22(5), 81–88. https://doi.org/10.1109/MSP.2005.1511826
15. Grezl, F., Karafiat, M., Kontar, S., & Cernocky, J. (2007). Probabilistic and bottle-neck features for
LVCSR of meetings. In 2007 IEEE international conference on acoustics, speech and signal process-
ing-ICASSP ’07, Honolulu, HI, pp. IV-757-IV-760. https://doi.org/10.1109/ICASSP.2007.367023.
16. Morgan, N. (2012). Deep and wide: Multiple layers in automatic speech recognition. IEEE Transac-
tions on Audio, Speech, and Language Processing, 20(1), 7–13. https://doi.org/10.1109/TASL.2011.
2116010
17. Rabiner, L. R., & Schafer, R. W. (2007). Introduction to digital speech processing. Now Publishers
Inc.
18. Van Gilse, P. H. G. (1948). Another method of speech without larynx. Acta Oto-Laryngologica,
36(sup78), 109–110.
19. Everest, F. A., & Pohlmann, K. (2009). Master handbook of acoustics. McGraw-Hill/TAB
Electronics.
13
Survey of Deep Learning Paradigms for Speech Processing 1943
20. Haneche, H., Ouahabi, A., & Boudraa, B. (2021). Compressed sensing-speech coding scheme
for mobile communications. Circuits, Systems, and Signal Processing. https://doi.org/10.1007/
s00034-021-01712-x
21. Sonawane, A., Inamdar, M. U., & Bhangale, K. B. (2017). Sound based human emotion recogni-
tion using MFCC & multiple SVM. In 2017 international conference on information, communication,
instrumentation and control (ICICIC), pp. 1–4. IEEE.
22. Bhangale, K. B., Titare, P., Pawar, R., & Bhavsar, S. (2018). Synthetic speech spoofing detection
using MFCC and radial basis function SVM. IOSR Journal of Engineering (IOSRJEN), 8(6), 55–61.
23. Bhangale, K. B., & Mohanaprasad, K. (2021). A review on speech processing using machine learning
paradigm. International Journal of Speech Technology, 24(2), 367–388.
24. Nirmal, J., Zaveri, M., Patnaik, S., & Kachare, P. (2014). Voice conversion using general regression
neural network. Applied Soft Computing, 24, 1–12.
25. Amrouche, A., Taleb-Ahmed, A., Rouvaen, J. M., & Yagoub, M. C. E. (2009). Improvement of the
speech recognition in noisy environments using a nonparametric regression. International Journal of
Parallel, Emergent and Distributed Systems, 24(1), 49–67.
26. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444. https://doi.org/10.
1038/nature14539
27. Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspec-
tives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828. https://
doi.org/10.1109/TPAMI.2013.50
28. Ng, A. Y., & Jordan, M. I. (2001). On discriminative vs. generative classifiers: A comparison of logis-
tic regression and naive bayes. In Proceedings of the 14th international conference on neural infor-
mation processing systems, Cambridge, MA, USA: MIT Press, 2001, pp. 841–848.
29. LeCun, Y., Kavukcuoglu, K., & Farabet, C. (2010). Convolutional networks and applications in
vision. In Proceedings of 2010 IEEE international symposium on circuits and systems, pp. 253–256.
30. Purwins, H., Li, Bo., Virtanen, T., Schlüter, J., Chang, S.-Y., & Sainath, T. (2019). Deep learning for
audio signal processing. IEEE Journal of Selected Topics in Signal Processing, 13(2), 206–219.
31. Chen, X. W., & Lin, X. (2014). Big data deep learning: Challenges and perspectives. IEEE Access, 2,
514–525.
32. Shrestha, A., & Mahmood, A. (2019). Review of deep learning algorithms and architectures. IEEE
Access, 7, 53040–53065.
33. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. In Adaptive computation and
machine learning series (p. 775). MIT Press. https://mitpress.mit.edu/books/deep-learning.
34. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A
simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research,
15(1), 1929–1958.
35. Strom, N. (2015). Scalable distributed DNN training using commodity GPU cloud computing. In Six-
teenth annual conference of the international speech communication association.
36. Jolliffe, I. T. (2002). Mathematical and statistical properties of sample principal components. In:
Principal Component Analysis. Springer Series in Statistics. Springer, New York. https://doi.org/10.
1007/0-387-22440-8_3.
37. Noda, K. (2013). Multimodal integration learning of object manipulation behaviors using deep neural
networks. In Proceedings of the IEEE/RSJ international conference on intelligent robots and systems,
pp. 1728–1733.
38. Lu, X., Matsuda, S., Hori, C., & Kashioka, H. (2012). Speech restoration based on deep learning
autoencoder with layer-wised pretraining. In 13th annual conference of the international speech com-
munication association.
39. Lu, X., Matsuda, S., Hori, C., & Kashioka, H. (2012). Speech restoration based on deep learning
autoencoder with layer-wised learning. In INTERSPEECH, Portland, Oregon, Sept. 2012.
40. Lu, X., Tsao, Y., Matsuda, S., & Hori, C. (2013). Speech enhancement based on deep denoising auto-
encoder. In Proceedings of interspeech, pp. 436–440.
41. Lu, X., Tsao, Y., Matsuda, S., & Hori, C. (2014). Ensemble modeling of denoising autoencoder for
speech spectrum restoration. In Proceedings of the annual conference of the international speech
communication association, INTERSPEECH, pp 885–889.
42. Sun, M., Zhang, X., Van Hamme, H., & Zheng, T. F. (2016). Unseen noise estimation using sepa-
rable deep auto encoder for speech enhancement. IEEE/ACM Transactions on Audio, Speech, and
Language Processing, 24(1), 93–104. https://doi.org/10.1109/TASLP.2015.2498101.
43. Safari, R., Ahadi, S. M., & Seyedin, S. (2017). Modular dynamic deep denoising autoencoder for
speech enhancement. In 2017 7th international conference on computer and knowledge engineering
(ICCKE), Mashhad, pp. 254–259. https://doi.org/10.1109/ICCKE.2017.8167886.
13
1944 K. B. Bhangale, M. Kothandaraman
44. Agrawal, P., & Ganapathy, S. (2019). Modulation filter learning using deep variational networks
for robust speech recognition. IEEE Journal of Selected Topics in Signal Processing, 13(2),
244–253.
45. Leglaive, S., Alameda-Pineda, X., Girin, L., & Horaud, R. (2020). A recurrent variational autoen-
coder for speech enhancement. In ICASSP 2020–2020 IEEE international conference on acous-
tics, speech and signal processing (ICASSP), Barcelona, Spain, pp. 371–375. https://doi.org/10.
1109/ICASSP40776.2020.9053164.
46. Li, Y., Zhang, X., Li, X., Zhang, Y., Yang, J., & He, Q. (2018). Mobile phone clustering from
speech recordings using deep representation and spectral clustering. IEEE Transactions on Infor-
mation Forensics and Security, 13(4), 965–977. https://doi.org/10.1109/TIFS.2017.2774505
47. Zhang, Q., & Hansen, J. H. L. (2018). Language/dialect recognition based on unsupervised deep
learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(5), 873–882.
48. Chorowski, J., Weiss, R. J., Bengio, S., & van den Oord, A. (2019). Unsupervised speech repre-
sentation learning using WaveNet autoencoders. IEEE/ACM Transactions on Audio, Speech, and
Language Processing, 27(12), 2041–2053.
49. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.,
& Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing
systems, pp. 2672–2680.
50. Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:
1411.1784.
51. Qian, Y., Hu, Hu., & Tan, T. (2019). Data augmentation using generative adversarial networks for
robust speech recognition. Speech Communication, 114, 1–9.
52. Pascual, S., Serra, J., & Bonafonte, A. (2019). Time-domain speech enhancement using generative
adversarial networks. Speech Communication, 114, 10–21.
53. Kaneko, T., Kameoka, H., Hojo, N., Ijima, Y., Hiramatsu, K., & Kashino, K. (2017). Generative
adversarial network-based postfilter for statistical parametric speech synthesis. In 2017 IEEE interna-
tional conference on acoustics, speech and signal processing (ICASSP), pp. 4910–4914. IEEE.
54. Kaneko, T., Takaki, S., Kameoka, H., & Yamagishi J. (2017). Generative adversarial network-
based postfilter for STFT spectrograms. In Interspeech, pp. 3389–3393.
55. Hsu, C. C., Hwang, H. T., Wu, Y. C., Tsao, Y., & Wang H. M. (2017). Voice conversion from una-
ligned corpora using variational autoencoding wasserstein generative adversarial networks. arXiv
preprint arXiv:1704.00849.
56. Mimura, M., Sakai, S., & Kawahara, T. (2017). Cross-domain speech recognition using nonparal-
lel corpora with cycle-consistent adversarial networks. In 2017 IEEE automatic speech recogni-
tion and understanding workshop (ASRU), pp. 134–140. IEEE.
57. Hu, H., Tan, T., & Qian, Y. (2018). Generative adversarial networks based data augmentation for
noise robust speech recognition. In 2018 IEEE international conference on acoustics, speech and
signal processing (ICASSP), pp. 5044–5048. IEEE.
58. Freund, Y., & Haussler, D. (1992). Unsupervised learning of distributions on binary vectors using
two layer networks. In Advances in neural information processing systems, pp. 912–919.
59. Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted Boltzmann
machines. In Proceedings of the 25th international conference on machine learning, pp. 536–543.
60. Wang, Y., & Wang, D. (2013). Towards scaling up classification-based speech separation. IEEE
Transactions on Audio, Speech, and Language Processing, 21(7), 1381–1390. https://doi.org/10.
1109/TASL.2013.2250961
61. Xu, Y., Du, J., Dai, L., & Lee, C. (2014). An experimental study on speech enhancement based on
deep neural networks. IEEE Signal Processing Letters, 21(1), 65–68. https://doi.org/10.1109/LSP.
2013.2291240
62. Shah, M., Chakrabarti, C., & Spanias, A. (2015). Within and cross-corpus speech emotion recog-
nition using latent topic model-based features. EURASIP Journal on Audio, Speech, and Music
Processing, 2015(1), 4.
63. Navamani, T. M. (2019). Efficient deep learning approaches for health informatics. In Deep learning
and parallel computing environment for bioengineering systems (pp. 503–519). Elsevier. https://doi.
org/10.1016/B978-0-12-816718-2.00014-2.
64. Rizk, Y., Hajj, N., Mitri, N., & Awad, M. (2019). Deep belief networks and cortical algorithms: A
comparative study for supervised classification. Applied Computing and Informatics, 15(2), 81–93.
65. Mohamed, A. R., Dahl, G., & Hinton, G. (2009). Deep belief networks for phone recognition. In Nips
workshop on deep learning for speech recognition and related applications, vol. 1, no. 9, p. 39.
13
Survey of Deep Learning Paradigms for Speech Processing 1945
66. Mohamed, A. R., Yu, D., & Deng L. (2010). Investigation of full-sequence training of deep belief
networks for speech recognition. In Eleventh annual conference of the international speech com-
munication association.
67. Mohamed, A.-R., Dahl, G. E., & Hinton, G. (2011). Acoustic modeling using deep belief net-
works. IEEE transactions on audio, speech, and language processing, 20(1), 14–22.
68. Zhang, X., & Wu, J. (2013). Deep belief networks based voice activity detection. IEEE Transac-
tions on Audio, Speech, and Language Processing, 21(4), 697–710. https://doi.org/10.1109/TASL.
2012.2229986
69. Sarikaya, R., Hinton, G. E., & Deoras, A. (2014). Application of deep belief networks for natural
language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
22(4), 778–784. https://doi.org/10.1109/TASLP.2014.2303296
70. Wen, G., Li, H., Huang, J., Li, D., & Xun, E. (2017). Random deep belief networks for recogniz-
ing emotions from speech signals. Computational Intelligence and Neuroscience. https://doi.org/
10.1155/2017/1945630
71. Wang, C., Wang, J., Santoso, A., Chiang, C., & Wu, C. (2018). Sound event recognition using
auditory-receptive-field binary pattern and hierarchical-diving deep belief network. IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 26(8), 1336–1351. https://doi.org/10.
1109/TASLP.2017.2738443
72. Affonso, E. T., Rosa, R. L., & Rodríguez, D. Z. (2018). Speech quality assessment over lossy
transmission channels using deep belief networks. IEEE Signal Processing Letters, 25(1), 70–74.
https://doi.org/10.1109/LSP.2017.2773536
73. Hourri, S., & Kharroubi, J. (2020). A deep learning approach for speaker recognition. Interna-
tional Journal of Speech Technology, 23(1), 123–131.
74. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learn-
ing., 2(1), 1–127.
75. Kang, T. G., Kwon, K., Shin, J. W., & Kim, N. S. (2015). NMF-based Target source separation
using deep neural network. IEEE Signal Processing Letters, 22(2), 229–233. https://doi.org/10.
1109/LSP.2014.2354456
76. Nie, S., Liang, S., Liu, W., Zhang, X., & Tao, J. (2018). Deep learning based speech separation via
NMF-style reconstructions. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
26(11), 2043–2055.
77. Zheng, N., & Zhang, X. (2019). Phase-aware speech enhancement based on deep neural networks.
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(1), 63–76. https://doi.
org/10.1109/TASLP.2018.2870742
78. Zhao, Y., Wang, Z., & Wang, D. (2019). Two-stage deep learning for noisy-reverberant speech
enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(1),
53–62.
79. Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-trained deep neural
networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Lan-
guage Processing, 20(1), 30–42. https://doi.org/10.1109/TASL.2011.2134090
80. Yu, D., Deng, L., & Seide, F. (2013). The deep tensor neural network with applications to large
vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing,
21(2), 388–396. https://doi.org/10.1109/TASL.2012.2227738
81. Narayanan, A., & Wang, D. (2014). Investigation of speech separation as a front-end for noise
robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
22(4), 826–835. https://doi.org/10.1109/TASLP.2014.2305833
82. Wang, G., & Sim, K. C. (2014). Regression-based context-dependent modeling of deep neural net-
works for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Process-
ing, 22(11), 1660–1669. https://doi.org/10.1109/TASLP.2014.2344855
83. Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., & Liu, Q. (2014). Fast adaptation of deep neural
network based on discriminant codes for speech recognition. IEEE/ACM Transactions on Audio,
Speech, and Language Processing, 22(12), 1713–1725. https://doi.org/10.1109/TASLP.2014.
2346313
84. Zhou, P., Jiang, H., Dai, L., Hu, Y., & Liu, Q. (2015). State-clustering based multiple deep neural
networks modeling approach for speech recognition. IEEE/ACM Transactions on Audio, Speech,
and Language Processing, 23(4), 631–642. https://doi.org/10.1109/TASLP.2015.2392944
85. Gao, J., Du, J., & Chen, E. (2019). Mixed-bandwidth cross-channel speech recognition via joint
optimization of dnn-based bandwidth expansion and acoustic modeling. IEEE/ACM Transactions
on Audio, Speech, and Language Processing, 27(3), 559–571. https://doi.org/10.1109/TASLP.
2018.2886739
13
1946 K. B. Bhangale, M. Kothandaraman
86. Wu, C., Gales, M. J. F., Ragni, A., Karanasou, P., & Sim, K. C. (2018). Improving interpretability
and regularization in deep learning. IEEE/ACM Transactions on Audio, Speech, and Language
Processing, 26(2), 256–265. https://doi.org/10.1109/TASLP.2017.2774919
87. Chen, K., & Salman, A. (2011). Learning speaker-specific characteristics with a deep neural architec-
ture. IEEE Transactions on Neural Networks, 22(11), 1744–1756. https://doi.org/10.1109/TNN.2011.
2167240
88. Tan, Z., Mak, M., & Mak, B. K. (2018). DNN-based score calibration with multitask learning for
noise robust speaker verification. IEEE/ACM Transactions on Audio, Speech, and Language Process-
ing, 26(4), 700–712.
89. Yu, H., Tan, Z., Ma, Z., Martin, R., & Guo, J. (2018). Spoofing detection in automatic speaker veri-
fication systems using DNN classifiers and dynamic acoustic features. IEEE Transactions on Neural
Networks and Learning Systems, 29(10), 4633–4644.
90. Wang, Z., & Wang, D. (2019). Combining spectral and spatial features for deep learning based blind
speaker separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(2),
457–468.
91. Lotfian, R., & Busso, C. (2019). Curriculum learning for speech emotion recognition from crowd-
sourced labels. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(4),
815–826.
92. Liu, B., Xu, Z., Sun, C., Wang, B., Wang, X., Wong, D. F., & Zhang, M. (2018). Content-oriented
user modeling for personalized response ranking in chatbots. IEEE/ACM Transactions on Audio,
Speech, and Language Processing, 26(1), 122–133. https://doi.org/10.1109/TASLP.2017.2763243
93. Fukushima, K. (1988). Neocognitron: A hierarchical neural network capable of visual pattern recog-
nition. Neural Networks, 1, 119–130.
94. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86, 2278–2324.
95. Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate
cortex. The Journal of Physiology., 195(1), 215–243.
96. Li, Z., Liu, F., Yang, W., Peng, S., & Zhou, J. (2021). A survey of convolutional neural networks:
Analysis, applications, and prospects. IEEE Transactions on Neural Networks and Learning Systems.
https://doi.org/10.1109/TNNLS.2021.3084827
97. Hou, J., Wang, S., Lai, Y., Tsao, Y., Chang, H., & Wang, H. (2018). Audio-visual speech enhance-
ment using multimodal deep convolutional neural networks. IEEE Transactions on Emerging Topics
in Computational Intelligence, 2(2), 117–128.
98. Luo, Y., Chen, Z., & Mesgarani, N. (2018). Speaker-independent speech separation with deep attrac-
tor network. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(4), 787–796.
99. Tan, T., Qian, Y., Hu, H., Zhou, Y., Ding, W., & Yu, K. (2018). Adaptive very deep convolutional
residual network for noise robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and
Language Processing, 26(8), 1393–1405.
100. Jati, A., & Georgiou, P. (2019). Neural predictive coding using convolutional neural networks toward
unsupervised learning of speaker characteristics. IEEE/ACM Transactions on Audio, Speech, and
Language Processing, 27(10), 1577–1589.
101. An, N. N., Thanh, N. Q., & Liu, Y. (2019). Deep CNNs with self-attention for speaker identification.
IEEE Access, 7, 85327–85337. https://doi.org/10.1109/ACCESS.2019.2917470
102. Nagrani, A., Chung, J. S., Xie, W., & Zisserman, A. (2020). Voxceleb: Large-scale speaker verifica-
tion in the wild. Computer Speech & Language, 60, 101027.
103. Zhang, S., Zhang, S., Huang, T., & Gao, W. (2018). Speech emotion recognition using deep convolu-
tional neural network and discriminant temporal pyramid matching. IEEE Transactions on Multime-
dia, 20(6), 1576–1590. https://doi.org/10.1109/TMM.2017.2766843
104. Zhao, J., Mao, X., & Chen, L. (2018). Learning deep features to recognise speech emotion using
merged deep CNN. IET Signal Processing, 12(6), 713–721. https://doi.org/10.1049/iet-spr.2017.0320
105. Hossain, M. S., & Muhammad, G. (2019). Emotion recognition using deep learning approach from
audio–visual emotional big data. Information Fusion, 49, 69–78.
106. Ocquaye, E. N. N., Mao, Q., Song, H., Xu, G., & Xue, Y. (2019). Dual exclusive attentive transfer for
unsupervised deep convolutional domain adaptation in speech emotion recognition. IEEE Access, 7,
93847–93857.
107. Tripathi, S., Kumar, A., Ramesh, A., Singh, C., & Yenigalla, P. (2019). Deep learning based emotion
recognition system using speech features and transcriptions. arXiv preprint arXiv:1906.05681.
108. Dinkel, H., Qian, Y., & Yu, K. (2018). Investigating raw wave deep neural networks for end-to-end
speaker spoofing detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
26(11), 2002–2014.
13
Survey of Deep Learning Paradigms for Speech Processing 1947
109. DiPietro, R., & Hager, G. D. (2020). Deep learning: RNNs and LSTM. In Handbook of medical
image computing and computer assisted intervention (pp. 503–519). Elsevier. https://doi.org/10.1016/
B978-0-12-816176-0.00026-0.
110. Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural
networks on sequence modeling. arXiv preprint arXiv:1412.3555.
111. Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural net-
works. In 2013 IEEE international conference on acoustics, speech and signal processing, Vancou-
ver, BC, pp. 6645–6649. https://doi.org/10.1109/ICASSP.2013.6638947.
112. Qin, C.-X., Dan, Qu., & Zhang, L.-H. (2018). Towards end-to-end speech recognition with transfer
learning. EURASIP Journal on Audio, Speech, and Music Processing, 2018(1), 1–9.
113. de Benito-Gorron, D., Lozano-Diez, A., Toledano, D. T., & Gonzalez-Rodriguez, J. (2019). Explor-
ing convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a
large audio dataset. EURASIP Journal on Audio, Speech, and Music Processing, 2019(1), 9.
114. Kang, J., Zhang, W.-Q., Liu, W.-W., Liu, J., & Johnson, M. T. (2018). Advanced recurrent network-
based hybrid acoustic models for low resource speech recognition. EURASIP Journal on Audio,
Speech, and Music Processing, 2018(1), 6.
115. Tang, Z., Wang, D., Chen, Y., Li, L., & Abel, A. (2018). Phonetic temporal neural model for language
identification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(1), 134–144.
116. Han, K., & Wang, D. (2014). Neural network based pitch tracking in very noisy speech. IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 22(12), 2158–2168. https://doi.org/10.
1109/TASLP.2014.2363410
117. Tan, K., & Wang, D. (2018). A convolutional recurrent neural network for real-time speech enhance-
ment. In Interspeech, pp. 3229–3233.
118. Li, A., Yuan, M., Zheng, C., & Li, X. (2020). Speech enhancement using progressive learning-based
convolutional recurrent neural network. Applied Acoustics, 166, 107347.
119. Vafeiadis, A., Fanioudakis, E., Potamitis, I., Votis, K., Giakoumis, D., Tzovaras, D., Chen, L., &
Hamzaoui, R. (2019). Two-dimensional convolutional recurrent neural networks for speech activity
detection. In International Speech Communication Association, pp. 2045–2049.
120. Shen, Z., Yong, B., Zhang, G., Zhou, R., & Zhou, Q. (2019). A deep learning method for Chinese
singer identification. Tsinghua Science and Technology, 24(4), 371–378. https://doi.org/10.26599/
TST.2018.9010121
121. Wu, Y., & Li, W. (2019). Automatic audio chord recognition with MIDI-trained deep feature and
BLSTM-CRF sequence decoding model. IEEE/ACM Transactions on Audio, Speech, and Language
Processing, 27(2), 355–366.
122. Zhao, J., Mao, X., & Chen, L. (2019). Speech emotion recognition using deep 1D & 2D CNN LSTM
networks. Biomedical Signal Processing and Control, 47, 312–323.
123. Yu, Y., Si, X., Changhua, Hu., & Zhang, J. (2019). A review of recurrent neural networks: LSTM
cells and network architectures. Neural computation, 31(7), 1235–1270.
124. Goehring, T., Keshavarzi, M., Carlyon, R. P., & Moore, B. C. J. (2019). Using recurrent neural net-
works to improve the perception of speech in non-stationary noise by people with cochlear implants.
The Journal of the Acoustical Society of America, 146(1), 705–718.
125. Sutton, R. S., Barto, A. G., & Williams, R. J. (1992). Reinforcement learning is direct adaptive opti-
mal control. IEEE Control Systems, 12(2), 19–22.
126. Mnih,V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M.
(2013). Playing atari with deep reinforcement learning. In NIPS deep learning workshop.
127. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). Policy gradient methods for reinforce-
ment learning with function approximation. In Proceedings of the 12th international conference on
neural information processing systems, NIPS’99, pp. 1057–1063.
128. Weisz, G., Budzianowski, P., Su, P., & Gašić, M. (2018). Sample efficient deep reinforcement learn-
ing for dialogue systems with large action spaces. IEEE/ACM Transactions on Audio, Speech, and
Language Processing, 26(11), 2083–2097. https://doi.org/10.1109/TASLP.2018.2851664
129. Chen, L., Chang, C., Chen, Z., Tan, B., Gašić, M., & Yu, K. (2018). Policy adaptation for deep rein-
forcement learning-based dialogue management. In 2018 IEEE international conference on acous-
tics, speech and signal processing (ICASSP), Calgary, AB, pp. 6074–6078. https://doi.org/10.1109/
ICASSP.2018.8462272.
130. Chen, L., Chen, Z., Tan, B., Long, S., Gašić, M., & Yu, K. (2019). AgentGraph: Toward univer-
sal dialogue management with structured deep reinforcement learning. IEEE/ACM Transactions on
Audio, Speech, and Language Processing, 27(9), 1378–1391. https://doi.org/10.1109/TASLP.2019.
2919872
13
1948 K. B. Bhangale, M. Kothandaraman
131. Shen, Y. L., Huang, C. Y., Wang, S. S., Tsao, Y., Wang, H. M., & Chi, T. S. (2019). Reinforcement
learning based speech enhancement for robust speech recognition. In ICASSP 2019–2019 IEEE inter-
national conference on acoustics, speech and signal processing (ICASSP), pp. 6750–6754. IEEE.
132. Rajapakshe, T., Rana, R., Latif, S., Khalifa, S., & Schuller, B. W. (2019). Pre-training in deep
reinforcement learning for automatic speech recognition. arXiv preprint arXiv:1910.11256.
133. Kala, T., & Shinozaki, T. (2018). Reinforcement learning of speech recognition system based on
policy gradient and hypothesis selection. In 2018 IEEE international conference on acoustics,
speech and signal processing (ICASSP), Calgary, AB, pp. 5759–5763, https://doi.org/10.1109/
ICASSP.2018.8462656.
134. Lee, H., Chung, P., Wu, Y., Lin, T., & Wen, T. (2018). Interactive spoken content retrieval by deep
reinforcement learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
26(12), 2447–2459.
135. Bui, H., & Chong, N. Y. (2019). Autonomous speech volume control for social robots in a noisy
environment using deep reinforcement learning. In 2019 IEEE international conference on robot-
ics and biomimetics (ROBIO), Dali, China, pp. 1263–1268. https://doi.org/10.1109/ROBIO49542.
2019.8961810.
136. Su, M., Wu, C., & Chen, L. (2020). Attention-based response generation using parallel double
Q-learning for dialog policy decision in a conversational system. IEEE/ACM Transactions on
Audio, Speech, and Language Processing, 28, 131–143. https://doi.org/10.1109/TASLP.2019.
2949687
137. Zue, V., Seneff, S., & Glass, J. (1990). Speech database development at MIT: TIMIT and beyond.
Speech Communication, 9(4), 351–356.
138. Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: An asr corpus based
on public domain audio books. In 2015 IEEE international conference on acoustics, speech and
signal processing (ICASSP), pp. 5206–5210. IEEE.
139. Nagrani, A., Chung, J. S., & Zisserman, A. (2017). "Voxceleb: A large-scale speaker identification
dataset. arXiv preprint arXiv:1706.08612.
140. Pearce, D., & Picone, J. (2002). Aurora working group: DSR front end LVCSR evaluation AU/384/02.
In Institute for signal & information processing, Mississippi State University, Technical Report.
141. Sinha, R., Gales, M. J., Kim, D. Y., Liu, X. A., Sim, K. C., & Woodland, P. C. (2006). The CU-
HTK mandarin broadcast news transcription system. In Proceedings of ICASSP 2006, May, 2006,
pp. 1077–1080.
142. Barker, J., Watanabe, S., Vincent, E., & Trmal, J. (2018). The fifth’CHiME’speech separation and
recognition challenge: Dataset, task and baselines. arXiv preprint arXiv:1803.10609.
143. Kinoshita, K., Delcroix, M., Gannot, S., Habets, E., Haeb-Umbach, R., Kellermann, W., Leutnant,
V., Maas, R., Nakatani, T., Raj, B., Sehr, A., & Yoshioka, T. (2016). A summary of the REVERB
challenge: state-of-the-art and remaining challenges in reverberant speech processing research.
EURASIP Journal on Advances in Signal Processing. https://doi.org/10.1186/s13634-016-0306-6
144. Godfrey, J. J., Holliman, E. C., & McDaniel, J. (1992) SWITCHBOARD: telephone speech corpus
for research and development. In [Proceedings] ICASSP-92: 1992 IEEE international conference
on acoustics, speech, and signal processing, San Francisco, CA, USA, vol. 1, pp. 517–520. https://
doi.org/10.1109/ICASSP.1992.225858.
145. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of
German emotional speech. In Proceedings of Interspeech.
146. Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J. N., Lee, S., &
Narayanan, S. S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Jour-
nal of Language Resources and Evaluation, 42(4), 335–359.
147. Lotfian, R., & Busso, C. (2019). Building naturalistic emotionally balanced speech corpus by
retrieving emotional speech from existing podcast recordings. IEEE Transactions on Affective
Computing, 10(4), 471–483.
148. Black, D. (2014). Singing voice dataset.
149. Goto, M., Hashiguchi, H., Nishimura, T., & Oka, R. (2002). RWC music database: Popular, classi-
cal, and jazz music databases. In Proceedings of the 3rd international conference on music infor-
mation retrieval (ISMIR 2002), pp. 287–288.
150. Hsu, C., & Jang, J. R. (2010). On the improvement of singing voice separation for monaural
recordings using the MIR-1K dataset. IEEE Transactions on Audio, Speech, and Language Pro-
cessing, 18(2), 310–319. https://doi.org/10.1109/TASL.2009.2026503
151. Varga, A., & Steeneken, H. J. M. (1993). Assessment for automatic speech recognition: II. NOI-
SEX-92: A database and an experiment to study the effect of additive noise on speech recognition
systems. Speech Communication, 12(3), 247–251.
13
Survey of Deep Learning Paradigms for Speech Processing 1949
152. Jensen, J., & Taal, C. H. (2016). An algorithm for predicting the intelligibility of speech masked
by modulated noise maskers. IEEE/ACM Transactions on Audio, Speech, and Language Process-
ing, 24(11), 2009–2022.
153. Vincent, E., Gribonval, R., & Fevotte, C. (2006). Performance measurement in blind audio source
separation. IEEE Transactions on Audio, Speech and Language Processing, 14(4), 1462–1469.
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
13