0% found this document useful (0 votes)
22 views

Survey of Deep Learning Paradigms For Speech Processing

Survey of Deep Learning Paradigms for Speech Processing

Uploaded by

soumikfarhan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

Survey of Deep Learning Paradigms For Speech Processing

Survey of Deep Learning Paradigms for Speech Processing

Uploaded by

soumikfarhan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Wireless Personal Communications (2022) 125:1913–1949

https://doi.org/10.1007/s11277-022-09640-y

Survey of Deep Learning Paradigms for Speech Processing

Kishor Barasu Bhangale1 · Mohanaprasad Kothandaraman1

Accepted: 7 February 2022 / Published online: 4 March 2022


© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2022

Abstract
Over the past decades, a particular focus is given to research on machine learning tech-
niques for speech processing applications. However, in the past few years, research has
focused on using deep learning for speech processing applications. This new machine
learning field has become a very attractive area of study and has remarkably better perfor-
mance than the others in the various speech processing applications. This paper presents a
brief survey of application deep learning for various speech processing applications such as
speech separation, speech enhancement, speech recognition, speaker recognition, emotion
recognition, language recognition, music recognition, speech data retrieval, etc. The sur-
vey goes on to cover the use of Auto-Encoder, Generative Adversarial Network, Restricted
Boltzmann Machine, Deep Belief Network, Deep Neural Network, Convolutional Neural
Network, Recurrent Neural Network and Deep Reinforcement Learning for speech process-
ing. Additionally, it focuses on the various speech database and evaluation metrics used by
deep learning algorithms for performance evaluation.

Keywords Deep learning · Speech processing · Auto-encoder (AE) · Generative


adversarial network (GAN) · Restricted Boltzmann machine (RBM) · Deep belief network
(DBN) · Deep neural network (DNN) · Convolutional neural network (CNN) · Recurrent
neural network (RNN) · Deep reinforcement learning (DRL)

1 Introduction

Over the last few years, speech processing has become a dynamic field of research due to
tremendous scientific advancements and its pervasive commercial product use.
In the last decade, deep learning was popular for image processing applications. Later, it
was adapted to other signal processing applications such as speech, music, environmental
signal processing [1]. Deep learning is a several-layer model of multiple machine learn-
ing algorithms applied for input representation. Recent research on speech processing has
shown momentous improvement in deep learning performance over traditional speech pro-
cessing models, such as the Gaussian Mixture Model (GMM) and Hidden Markov Models
(HMM). The deep learning scope further increased towards natural language processing,

* Mohanaprasad Kothandaraman
[email protected]
1
School of Electronics Engineering (SENSE), VIT University, Chennai, Tamil Nadu 600127, India

13
Vol.:(0123456789)
1914 K. B. Bhangale, M. Kothandaraman

speech processing applications, recommendation theory, drug discovery, genomics, and


quantum chemistry [2].
Speech processing deals with the study of speech signals and processing techniques of
the speech signal. Speech processing encompasses speech enhancement, speech separation,
speech analysis, speech synthesis, dialogue-based system, speech coding, compression, and
transmission [3]. Speech enhancement comes under the speech pre-processing that mini-
mizes the noise, artifacts, and reverberation in the speech signal to enhance speech signal
quality [4]. Speech separation includes segmenting the target speech from the background
interference or deriving the content from the speech data [5]. Dialogue-based systems con-
sist of significant domains such as speech recognition, speaker recognition, and emotion
recognition [6]. Speech recognition deals with recognizing the spoken content of the user
in various languages.
In contrast, automatic speaker recognition means identifying or verifying an individ-
ual’s identity using his/her voice samples using an artificial intelligence method without
any manual intervention [7]. Emotion recognition illustrates human emotion based on the
speech signal. The progress of dialogue-based systems is challenging because of the use
of the private database for performance evaluation of new research, database constraints,
issues of channel and domain mismatch, reverberation environment, intra-speaker variabil-
ity, linguistic variability, speaker dependency, variability caused by the noisy environment
and variability caused by the context, the relationship between spoken text and speaker
characteristics, etc. [8].
Various multi-layered methods have been incorporated earlier to deep learning for
speech recognition, which provided better performance for a noisy speech on smaller
vocabulary or moderate understanding for high SNR speech on more extensive vocabulary.
These models are shallow neural network models. In 1994, Deng et al. [9] suggested a
multi-layered neural prediction HMM model for speech recognition, giving long-term tem-
poral correlation by combining the linear and nonlinear compressive models. Each vocabu-
lary syllable represents using multilayer MLP. Jointly predictive HMM resulted in 92.9%
of average recognition on CV Syllablee for HMM recognizer. It had low generalization
capabilities and resulted in overtraining for the use of a more complex structure.
Vocal tract length normalization (VTLN) used a statistical generative learning model to
estimate the speaker’s maximum-likelihood compression/expansion of the speaker utter-
ance spectrum. It used Heteroscedastic Linear Discriminant Analysis (HLDA) to convert
cepstral information into a powerful phonetic discriminant. It used these features to train
the number of Gaussians with Expectation–Maximization to generate the likelihood of a
specific speech sample [10]. In 1997, Schuster and Paliwal [11] presented a bidirectional
recurrent neural network (BRN) that trained it in negative and positive time direction to
avoid user input information limitation just up to deliver future frame. BRNN combines the
past and future frame information at the current time frame. BRNN performed better than
existing ANN and RNN models and resulted in a recognition rate of 70.73% and 68.53%
for training and test the TIMIT phoneme classification database.
In 2000, a multilayer Tandem network presented hybrid connectionist HMM trained
using multilayer perceptron neural network (MLPNN) with a nonlinear hidden unit to esti-
mate the possible posterior probability phones. The neural network model’s output is given
to PCA to convert the features into orthogonal features and further those fed to the Gauss-
ian mixture-based HTK model. In the final layer, the HTK decoder recognizes the correct
word. It outperformed the baseline HTK model and resulted in a 35.00% more relative
error rate for the multi-condition Aurora noisy continuous digits task [12]. In 2005, fea-
tures were constructed by training multiple Gaussians over Perceptual Linear Prediction

13
Survey of Deep Learning Paradigms for Speech Processing 1915

(PLP) features using Feature Minimum Phone Error (fMPE). Multilayer fMPE + MPE has
shown significant improvement over the single MPE or MLP model [13].
Morgan et al. [14] used a multi-rate coupled HMM model for speech recognition. It
used short-term PLP spectral features for short-term modeling and long-term temporal fea-
tures (HAT) for long-term modeling. The addition of long-term features has shown a sig-
nificant reduction in WER. In 2007 Frantisek Grezl et al. [15] have presented bottle-neck
features extracted using a multilayer neural network. They have extracted features in two
layers, where the first layer consists of 12th order PLP features and energy features. The
features are further given to VTLN and HLDA to reduce speaker variability and dimen-
sion reduction, respectively. The second layer consists of TRAP-based features. These fea-
tures were also abstracted using five-layer MLP neural network models and later fed to the
GMM-HMM model for the meeting recognition task described in NIST RT’05. It resulted
in 26.2% WER for 45 bottle-neck features.
Morgan [16] has reviewed some of the existing deep learning models and argued that
increasing each layer of feature’s width is as important as expanding the network’s depth.
This paper presents the review of deep learning architectures such as unsupervised,
supervised, semi-supervised and reinforcement deep learning models for the distinct
speech processing applications such as speech enhancement, speech separation, auto-
matic speech recognition (ASR), speaker recognition, emotion recognition, etc. This paper
mainly focuses on the methodology used for the specific speech processing application, the
database used for the deep learning model’s experimentation, and performance.
This paper is structured as follows: Sect. 2 gives an overview of the speech signal,
machine learning, and deep learning; Sect. 3 describe various deep learning architectures
for the speech processing; Sect. 4 describes the details of database utilized by different
deep learning models; Sect. 5 gives details about various evaluation metrics of speech pro-
cessing applications; Sect. 6 provides the discussion on the results of previous work on
speech processing applications, and Sect. 7 concludes the paper.

2 Background

2.1 Speech Signal

Speech is the phonetic representation of the symbols known as phonemes. The number
of phonemes depends on the language (the typical value is between 32 and 64 for most
languages). English phonemes consist of vowels, diphthongs, glides, liquids, nasals, stops,
fricatives, and affricatives [17]. The lungs’ pressure produces speech, which originates
utterance in the larynx’s glottis, which is finally shaped by the mouth and vocal tract into
different vowels and consonants. Humans can also produce speech using the airstream
technique without glottis, which is called the alaryngeal address. Alaryngeal speech signals
are categorized into esophageal, buccal, and pharyngeal speech [18].
Human hearing perception capacity is between 20 and 20 kHz. The human ear can
respond to speech intensity upto120–130 dB. However, all sounds above 90 dB may dam-
age the inner ear, and sound above 120 dB may cause irreversible damage. The sound wave
propagates as the continuous acoustic wave, and once it is acquired, it can be recorded,
digitized, processed, coded, transmitted, and replicated. An average human being can rec-
ognize the sound frequencies typically below 4 kHz and hardly above 7–8 kHz. Therefore,
an 8 kHz sampling rate is used for sampling of the speech signal to get a basic level of

13
1916 K. B. Bhangale, M. Kothandaraman

quality and 16 kHz for a higher level of quality by considering the Nyquist sampling cri-
terion ­(Fs >  = 2*Fmax) [19]. The speech signal is sampled at 8 bits per sample to minimize
the quantization noise [20]. Typically, speech signals are converted into a digital format to
represent the speech in a robust and compact form.

2.2 Machine Learning Techniques for Speech Processing

Traditional machine learning algorithms use handcrafted feature extraction techniques


for feature extraction and require several boosting methods to enhance features’ quality.
These features are provided to the learning algorithm, which generates the output. The one-
dimensional speech signal is provided to feature extraction technique, which mines the sta-
tistical, spectral, cepstral, model-based, or transforms domain features of the speech signal.
Machine learning algorithm’s performance greatly depends upon the data representation or
extracted features. Some of the popular feature extraction techniques for speech processing
are Principal Component Analysis (PCA), Independent Component Analysis (ICA), Linear
Predictive Coding (LPC), Perceptual Linear Prediction (PLP), Relative Spectral Filtering
(RASTA), Mel Frequency Cepstral Coefficient (MFCC), Power-Normalized Cepstral Coef-
ficients (PNCC), Zero Crossing Rate (ZCR), Wavelet Transform (WT), etc. [21, 22]. Sev-
eral popular machine learning modeling techniques and classifier algorithms are Gaussian
Mixture Model (GMM), Hidden Markov Model (HMM), Dynamic Time Warping (DTW),
Vector quantization (VQ), K-Nearest Neighbor Classifier (KNN), Support Vector Machine
(SVM), Artificial Neural Network (ANN), Naïve Bayes Classifier (NB), Linear Discrimi-
nant Analysis (LDA), General Regression Neural Network, Non-parametric Regression,
etc. [23–25].
The representation of data needs preprocessing to make it more discriminative, which
is labor-intensive. It performs poorly for a more extensive and noisy database. Traditional
machine learning techniques lack in providing a universal solution to the various speech
signal-related problems such as accent change, short utterance, spontaneous speech, ina-
bility to depicts disabled speech, continuous utterance, reverberation, non-stationary envi-
ronmental noise, etc. Again, these methods require more considerable training time due
to bulky feature extraction and perform poorly for a more extensive and noisy database.
Therefore, there is a need for discriminative representation of the speech signal, which can
be overcome by deep learning techniques.

2.3 Overview of Deep Learning

Deep learning is a sub-branch of machine learning that brings higher-level abstraction in


the data through multiple layer processing models or complex structures of various non-
linear transformations. Deep learning is also called hierarchical learning, deep machine
learning, structured learning, deep representation learning, or DL. The word "Deep" repre-
sents the multiple layers of a processing architecture. Hierarchical features describe more
complex features in lower-level features or lower-level features in more complex features;
therefore, it is referred to as deep learning [26, 27].
Deep learning acts as a junction point between artificial intelligence, machine learning,
neural network, pattern recognition, signal processing, optimization, and graphical mod-
eling. Neural network models are categorized into discriminative and generative models.
The discriminative model is a bottom-up model where data flows from the input layer to the
output layer through hidden layers. It supports supervised learning tasks like classification

13
Survey of Deep Learning Paradigms for Speech Processing 1917

and regression. Conversely, the generative model follows the top-down approach, and data
flows in the reverse direction. Deep learning can be utilized for probabilistic distribution
and unsupervised learning tasks. Generally, discriminative learning is selected when label
data is available, and in case of the unavailability of labeled data, the generative approach
is undertaken [28].
Based on the training method, deep learning algorithms can be grouped into unsu-
pervised, semi-supervised, supervised, and reinforcement deep learning algorithms (see
Fig. 1). A supervised deep learning algorithm uses labeled data for the training, whereas
an unsupervised deep learning algorithm uses generative models for learning instead of
labeled data. It is challenging to develop the framework to extract the significant features
from larger labeled and unlabeled higher-dimensional data. LeCun et al. [29] presented
semi-supervised learning, which collaborates supervised deep learning for labeled data
and unsupervised learning for un-labeled data to achieve meaningful representation of the
features.
Almost all the previous architectures of deep learning are applied to two-dimensional
images. But raw speech signal is one-dimensional time-series data, which is a pole apart
from two-dimensional image representation. Therefore, acoustic signals are usually con-
verted into two-dimensional time–frequency series representations. An image can be pro-
cessed as holistic or part-based, with litter order restrictions; however, speech signals have
to be studied in sequential order [30]. Figure 2 shows the dissimilarity between machine
learning and deep learning processing for speech processing.
The learning algorithm is the core part of deep learning; it depicts each layer of a deep
learning algorithm’s discriminative features. The learning technique’s primary aim is to
discover optimal weight vector values to solve a class of problems in a domain [31]. Few

Unsupervised Supervised Deep Semi-supervised Deep


Deep Learning Learning Deep Learning Reinforcement
Learning (DRL)
Auto-encoders Deep Neural
De-noising Network (DNN)
Auto-encoders Generative
Deep Q-Learning
Adversarial
Sparse Convolutional (DQN)
Network (GAN)
Auto-encoders Neural Network
(CNN)
Contrastive
Auto-encoders
Residual Network

Generative
Adversarial
Network (GAN) Recurrent Neural
Network (RNN)
Long Short Recurrent Neural Policy Gradient
Restricted
Term Memory Network (RNN) Method
Boltzmann
Machines (RBM) (LSTM)
Deep Belief Gated Recurrent
Network (DBN) Unit (GRU)

Fig. 1  Classification of deep learning architectures

13
1918 K. B. Bhangale, M. Kothandaraman

Input Speech Hand Crafted Mapping from


Output
Signal Feature Extraction Features

Traditional Machine Learning flow for Speech Processing Applications

Simple Additional
Input Speech Features in Layers of More Mapping from
Output
Signal 1-D or 2-D Complex Features
Representation Features

Deep Learning flow for Speech Processing Applications

Fig. 2  Machine learning versus deep learning flow for speech processing applications

standard learning algorithms are Back-propagation (BP), Gradient Descent (GD), Stochas-
tic Gradient Descent (SGD), Momentum, and Levenberg–Marquardt (LM) algorithm. The
learning algorithms have several drawbacks: vanishing gradient problem, local minima,
larger training time, over-fitting, etc. [32]. The performance of deep learning can be opti-
mized using parameter initialization methods, hyper-parameter optimization using Parti-
cle Swarm Algorithm (PSO), Genetic Algorithm (GA), adaptive learning rates (Delta-bar
Algorithm, AdaGrad, MSProp, Adam, etc.), batch normalization, supervised pre-training,
dropout, training speed up with Graphical Processing Unit (GPU) and cloud [33–35].

3 Deep Learning Architectures for Speech Processing

This section provides the survey of various deep learning algorithms such as Auto-Encod-
ers (AE), Generative Adversarial Network (GAN), Restricted Boltzmann Machine (RBM),
Deep Belief Network (DBN), Deep Neural Network (DNN), Convolutional Neural Net-
work (CNN), Recurrent Neural Network (RNN) and Deep Reinforcement Learning (DRL)
for the different speech technology.

3.1 Auto‑Encoder (AE)

Auto-encoders are the unsupervised deep learning architecture that reproduces the input
signal at the output layer. Auto-encoder is the extension of the idea of principle component
analysis (PCA). PCA is dependent on the variance rather than covariances and correlations
and transforms the multidimensional data into linear representation [36]. Auto-encoder
poses the ability to reduce the non-stationary noise in the speech signal and increase noisy
speech’s perceptual quality. Deep auto-encoder is more famous for noise reduction for
speech enhancement [37]. Sparse Auto-encoder (SAE), Variational Auto-encoder (VAE),
and De-noising Auto-encoder (DAE) are the major implementation architectures of auto-
encoders. Figure 3 shows the generalized structure of the auto-encoder.
DAE helps to restore clean speech from noisy speech. The noisy environment and larger
variation in the speech pattern bring less discrimination in the local transformation [38].
Auto-encoder has a greedy layer-wise structure trained in an unsupervised manner using

13
Survey of Deep Learning Paradigms for Speech Processing 1919

Encoder Decoder

Input Restored
signal signal

Compressed data

Input layer Output layer

Fig. 3  Generalized structure of auto-encoder

back-propagation. It has shown noteworthy performance for noise-robust ASR [39], speech
restoration [40], and learning the local and global transformations of the speech signal [41].
Unseen noise estimation is challenging in speech enhancement in an adverse envi-
ronment. Separable Deep Auto Encoder (SDAE) consisted of two DAE for clean speech
signal modeling, and the residual signal is presented for unseen noise estimation to deal
with this issue. The residual signal is obtained from the subtraction of the clean signal
estimated from DAE and the noisy speech signal. Experimental results on the TIMIT
database tainted by 20 types of unseen noises based on PESQ, SDR, and segmental SNR
have shown superior performance over the traditional approaches [42]. Deep De-noising
Auto-encoder (DDAE) structure is similar to the DAE, but DDAE has more than one
hidden layer. Single-layer DDAE has several drawbacks: less contextual speech informa-
tion, less generalization in unknown SNR, and residual noise in the enhanced signal. To
overcome these limitations, Safari et al.[43] suggested modular dynamic deep de-noising
auto-encoder (MD-DDAE), which consists of a stack of three DDAE layers with distinct
window sizes. Purvi Agrawal et al. [44] have employed a modulation filter using the deep
variational model to remove noise and reverberation caused at recording time. A Two-
dimensional (2-D) unsupervised model of convolutional variational auto-encoder (CVAE)
is applied for the speech spectrogram. It focused on modulation in the spectro-temporal
domain and ignored the modulations in noise or reverberation. Leglaive et al. [45] offered
speech enhancement using a recurrent variational auto-encoder (RVAE) consisting of
a recurrent deep generative speech model and variational EM algorithm to estimate the
distribution of the dormant variables in the noisy speech samples. It is observed that the
introduction of temporal dynamics shows significant improvement in speech enhancement.
Li et al. [46] recommended unsupervised mobile phone clustering based on deep auto-
encoder and spectral clustering algorithm. It can be applicable for asymmetric recordings
and can be used in forensic applications. This deep learning model is less useful for cap-
turing the individuality of the same brand’s mobile phones. Qian Zhan et al. [47] used an
unsupervised bottle-neck feature extraction technique to deal with the need for transcribed
speech information. It is found that adversarial auto-encoder performs better for dialects
close to each other because of its ability to extract latent information, phonetic information,

13
1920 K. B. Bhangale, M. Kothandaraman

and language label from original speech. Chorowski et al. [48] presented a comparison
of Gaussian variational encoder (VAE), dimensionality reduction bottle-neck, and discrete
Vector Quantized VAE (VQ-VAE) for speech representation. It is observed that VQ-VAE
is speaker invariant and maintains more phonetic information. Further, it used the time jit-
ter regularization scheme to improve the speech representation quality and limit latent code
capability. MFCC features and time jitter regularization resulted in 56.20% accuracy for
the phoneme classification on the LIBRISPEECH database.

3.2 Generative Adversarial Network (GAN)

Generative Adversarial Network (GAN) is the type of unsupervised deep learning recently
devised by Goodfellow in 2014 [49]. GAN comprises two neural networks, namely dis-
criminator and generator network. The discriminator network differentiates between natu-
ral and generated samples, whereas the generator network tricks the discriminator. Both
networks contend against each other in a zero-sum game. GAN can be treated as an unsu-
pervised or semi-supervised model. The discriminator and generator layer can consist of
a convolution layer stack, transposed convolution layer, leaky ReLU layer, and fully con-
nected layer. Further, Mirza et al. [50] utilized conditional GAN in which generation type
is dependent on conditional information given to the generator. Figure 4 shows the frame-
work for basic GAN and conditional GAN for data augmentation for noise-robust ASR.
The disparity between training and data testing data is a significant challenge in noise-
robust speech recognition systems. To overcome this problem, Yanmin Qian et al. [51]
developed the Generative Adversarial Networks (GAN) for enlarging the size and variabil-
ity of the training data. Basic GAN and conditional GAN are applied for data augmenta-
tion on Aurora4 and AMI dataset with different types of noise, reverberation, and channel
distortions. It has shown an improvement of 6% and 14% improvement in noisy conditions.
It is observed that conditional GAN performs better than basic GAN.
Pascual et al. [52] described the speech enhancement GAN (SEGAN) to reconstruct
the clean signal from noisy signal to maintain the raw signal’s intelligibility and quality.
It used a convolutional encoder and decoder structure independent of the length of the
input sequence. Based on subjective and objective quality measures for whispered to voice,

Fig. 4  Framework for basic GAN


and conditional GAN (Qian et al. Real/ Fake ? Real/ Fake ?
[51])
Discriminator Discriminator
(D) (D)

Real (Xr) Fake (Xg) Real (Xr) Fake (Xg)

Generator Generator
(G) (G)

Noise (z) Condition (c) Noise (z)

Basic GAN Conditional GAN

13
Survey of Deep Learning Paradigms for Speech Processing 1921

conversion has shown improvement over baseline methods. Takuhiro Kaneko et al. [53]
inspected the GAN to estimate differences in natural and synthetic speech. Post-filter-based
GAN has shown that synthetically generated speech is comparable to natural speech. It
used CNN to capture the time and frequency domain structures. Further, they have used
GAN to reconstruct the Short Term Fourier Transform (SIFT) spectrogram to generate an
adequate structural representation of speech data. The reconstructed spectrogram applica-
tion for text to speech (TTS) application has shown a higher degree of similarity in the
synthesized speech and target speech [54].
Voice conversion is challenging in non-parallel voice conversion systems as speakers
can speak in different languages or not repeat the text. Hsu et al. [55] depicted the vari-
ational auto-encoding Wasserstein GAN (VAW-GAN) to build the voice conversion model.
The ASR performance in cross-domain speech recognition is abysmal when the systems
are trained in noisy environments or different speaking accents. Mirura et al. [56] proposed
GAN based speech recognition system that has shown adaptation to change in speaking
accent and noisy speech. Mostly, GAN is used for voice conversion, speech enhancement,
and speech synthesis. In recent years, few researchers have used GAN for noise-robust
ASR. Reduction of the disparity between training and testing data augmentation using
GAN has shown effectiveness rather than manual addition of noise to the original signal
[57].
GAN is not appropriate for the recurrent model or sequence training because of the
independence between the frame-level data generated. GAN model has a problem of non-
convergence, unstable training, and sensitivity to hyper-parameter selection. GAN causes
over-fitting due to unbalance between discriminative and generator networks.

3.3 Restricted Boltzmann Machines (RBMs)

RBM is an unsupervised energy-based generative model. It is an energy-based undirected


generative model that utilizes a hidden variable layer to replicate distribution over visible
layer variables. The RBM models are also known as restricted as there is no visible to vis-
ible or hidden to hidden connection [58, 59]. The training is performed over a two-layer
network one is treated as a hidden layer, and the other is treated as a visible layer, as shown
in Fig. 5.
Yuxuan Wang et al. [60] used Restricted Boltzmann Machines-based DNN features to
train the linear SVM to overcome the intractable problem in a larger speech database for
speech separation. DNN is used for feature extraction, and linear SVM is used for classifi-
cation. Deep learning is a hierarchical method that can capture the higher-order correlation
between raw features. Mini batch gradient descent method is used for training, has lesser
complexity and large scalability. They have proposed the DNN-SVM-SEG method, which
uses the RASTA-PLPΔΔ feature extraction method and cross-channel correlation-based
auditory segmentation method. The performance of the system is dependent on the pitch-
based features and prone to a noisy environment.
Xu et al. [61] explored the regressive model using multiple deep learning to enhance
the noisy speech signal. For training regressive multiple restricted Boltzmann machines
(RBMs), the deep model was trained using log power spectra features of noisy and clean
data. The noisy data is reconstructed using the trained deep model. It is the first approach
of DNN as a regression model. The quality of enhanced speech is evaluated based on seg-
mental SNR (SegSNR in dB) and log-spectral distortion (LSD in dB). A new measure used
is the perceptual evaluation of speech quality (PESQ), representing a high correlation with

13
1922 K. B. Bhangale, M. Kothandaraman

Fig. 5  Generalized framework


of RBM


● ●

Weights
Hidden Visible
Layer Layer

the subjective score. Multiple condition training (on TIMIT database) can deal with the
speech enhancement of unseen noisy data, new speakers, various SNR levels, and different
cross-language generalizations. It has shown poor performance in real-time environmental
noise.
The speech emotion signals are generally supra-segmental, and turn-level features per-
form better than the frame-level features, which lose the local information. RBM supports
optimal and discriminative feature learning. Mohit Shah et al. [62] offered a Latent Topic
Model (LTM) for the emotion salient feature extraction and a supervised replicated soft-
max model (sRSM) model based on RBM. It has given better performance over turn level
features for cross-corpus and spontaneous emotion on IEMOCAP and SEMAINE database.

3.4 Deep Belief Network (DBN)

DBN is the stacked RBMs that are trained layer-wise. It is an unsupervised and proba-
bilistic generative model. DBN supports pre-training and fine-tuning, unlike conventional
FFNN. It is a two-layered model consists of two RBMs [63]. In pre-training, each RBM
is trained independently, and output lower RBM is fed to higher-level RBM, as shown in
Fig. 6. In the fine-tuning process, the network is transformed into a deep auto-encoder (DA)
by unrolling the whole DBN and repeating the input and hidden layers, and attaching it to
the output of the DBN. In this, each layer’s hidden layer acts as a visible layer to the adja-
cent layer [64].
Abdel-Rahman Mohamed et al. [65] proposed DBN for phone recognition, which used
discriminative training to avoid the over-fitting problem. It resulted in PER of 23.00% on
the TIMIT test set. Further, they have investigated that DBN can also be applied to the
full utterance rather than a local window of frames of speech signal [66]. They have also
explored that DBN can efficiently replace the popular Gaussian Mixture Model (GMM) for
speech recognition with fewer parameters [67].
Deep Belief Network (DBN) is used in voice activity detection (VAD) to combine the
multiple features and describe the features’ variant features and various features. It resulted
in higher robustness due to the fusing of multiple features and lower detection complex-
ity. It has low performance in a non-stationary noisy environment. Therefore, the authors

13
Survey of Deep Learning Paradigms for Speech Processing 1923

● ●
● ● ● ●
● ●

Logistic
Input Restricted Boltzmann Restricted Boltzmann Regression
Features Machine Machine Layer

Fig. 6  Generalized structure for deep belief network (DBN)

have planned to improve the performance using a stacked de-noising encoder and DBN for
unsupervised online learning [68]. Sarikaya et al. [69] used DBN for the action recogni-
tion in call routing task. The performance is compared with the Support Vector Machines
(SVM), boosting, and Maximum Entropy (MaxEnt) classifier. For the call routing data-
base, DBN-3 gives 90.8% action classification accuracy. Its training is simple, and there is
less possibility of over-fitting for the lower database. It has shown poor performance for a
lower database. Their future scope consisted of using DBN for the tagging for event detec-
tion in spoken language understanding.
DBN can learn the higher-level features and multiple-level representations of the speech
signal. Guihui Wen et al. [70] applied a random deep belief network for emotion recog-
nition, which can overcome dimensionality problems and performs better for the larger
database.
Chien-Yao Wang et al. [71] presented noise-robust sound event recognition based on
auditory receptive-field binary pattern (ARFBP), which consists of spectrogram image fea-
ture (SIF), the cepstral features, and the human ARF model. These features are given to
a hierarchical-diving deep belief network (HDDBN) classifier, which learns the distinc-
tive properties from the physical attributes. It has shown 99.27% accuracy for clean data
and 95.06% accuracy for noisy data (0 dB SNR) using the RWCP dataset. ON TUT sound
event database, it has shown 0.81 and 0.73 error rates for sound event detection in house
and residential areas, respectively. HDDBN has shown significant improvement in sound
event recognition rate over SVM.
Speech quality plays a crucial role in a telephonic conversation. Affonso et al. [72]
investigated deep belief network and radial basis function SVM (RBF-SVM) for speech
quality assessment on unimpaired speech samples from the public database. They have
extracted 64 speech features (13 MFCC static features, 20 FFT power spectrums, ZCR,
spectral centroid, spectral roll-off, the first and second derivative of MCFF static features)
which are given as input to DBN. Speech quality is classified into four categories ranging
from excellent to inferior quality, and the proposed method resulted in 95% of aggregated
accuracy.
Soufiane Hourri et al. [73] recommended Deep speaker features (DeepSFs) by trans-
forming the MFCC features into DeepSFs using DNN to increase the noise’s robust nature
MFCC. In this, basic MFCC features are divided into two groups using the K-mean algo-
rithm to avoid over-fitting. The weights of DNN are initialized using Deep Belief Network

13
1924 K. B. Bhangale, M. Kothandaraman

(DBN) and given to the DNN feature classifier. The nearest cluster (NearC) is used as a
scoring technique. The THUYG-20 SRE database’s extensive experimentation has shown
an equal error rate of 0.43% and 0.55% for female and male corpus, respectively. They have
observed that DNN can learn feature distribution and can be used for robust speaker recog-
nition under noisy environments.

3.5 Deep Neural Network (DNN)

DNN is a variant of feed-forward ANN that includes multiple hidden layers between the
input and output layer. DNN can model complex data with a nonlinear relationship. In
DNN, data propagates from the input to the output layer without any backward flow of data
[74].
In DNN, at the initial stage, input data is given to the input layer, then its output is fed
to the next layer (hidden layer) neuron and so on to produce a result at the output layer as
shown in Fig. 7. Due to multiple layers, DNNs have a better capability to represent the non-
linear functions compared to shallow learning. Again the combination of feature extrac-
tion and classification layers makes deep learning efficient. DNN can estimate the encoding
vector and reconstruct the source data signal, making it suitable for source separation and
speech enhancement.
Tae Gyoon Kang et al. [75] used DNN to map the data vector and the corresponding
encoding vectors. The proposed method consists of three parts: Non-negative matrix fac-
torization (NMF) training, DNN training, and source separation stages. DNN-NMF outper-
forms the previous NMF based techniques but has less adaptability. Shuai Nie et al. [76]
presented a combination of DNN and Nonnegative matrix factorization (NMF) for speech
separation. NMF learns the spectra of the speech signal, and NMF reconstructs the mag-
nitude of signal and noise. The discriminative training with scarcity constraint eliminates
the noise with the minimal cost of distortions and artifacts; and retains original speech con-
tent. They used TIMIT and NOISEX-92 datasets for the speech and noise database. They
have used the signal to interference ratio (SIR), source to distortion ratio (SDR), source
to artifact ratio (SAR), and PESQ as the evaluation metrics to estimate the performance
of their proposed model. Naijun Zheng et al. [77] suggested deep learning for phase-
aware speech enhancement, which considered phase information of Short-Time Fourier

Fig. 7  Generalized framework of Neuron


feed-forward DNN

Input
Output
data

Output
Input layer
layer Hidden
Hidden layer 2 Hidden
layer 1 layer N

13
Survey of Deep Learning Paradigms for Speech Processing 1925

Transform (STFT). They have used a derivative of phase spectrogram and time axes known
as instantaneous frequency deviation (IFD). They have used an Ideal Ratio Mask (IRM),
Ideal Amplitude Mask (IAM), Phase Sensitive Filter (PSF), and Complex Ideal Ratio
Mask (CIRM) as the training targets for DNN. Based on speech quality and intelligibil-
ity, it performs better than the DNN architectures not considering phase information. The
unstructured architecture of the system brings complexity. Yan Zhao et al. [78] applied
two-level DNN for de-noising and de-reverberation in the speech signal. It has shown
improved genuine speech intelligibility and quality in the real-time noisy-reverberant situ-
ation. Phase change due to noise and reverberation degrades the performance of the two-
stage DNN. George E. Dahl et al. [79] inspected the context-dependent DNN HMM (CD-
DNN-HMM) model for large vocabulary speech recognition to reduce the generalization
error and robustness of the system. CD-DNN-HMMs unite the emblematic power of DNN
and the sequential modeling capability of CD-HMM. Experimental results on challenging
business search datasets have shown that CD-DNN-HMM (Accuracy 69.6%) outperforms
the existing machine learning-based algorithms. It has been observed that CD-DNN-HMM
is computationally expensive. Dong Yu et al. [80] utilized a deep tensor neural network
(DTNN) by replacing one or more conventional layers of DNN with a Double Projection
(DP) layer where the input feature vector is projected in two nonlinear subspaces and ten-
sor-flow. It has several advantages like representing the covariance structure of data in hid-
den layers, modeling noisy data with high inconsistency, and performing effectively for
smaller databases. A small number of DP layers and bottom DP layer degrade the system’s
performance; therefore, in the future, they intended to increase the DP layer to improve the
performance. It has shown 16.6% WER on 30 Hrs SWB task for Hub5′00 evaluation set.
For the separation noise and handling channel mismatch from the speech signal, time-
varying masking is used. Naraynan et al. [81] proposed diagonal feature discriminant
linear regression (dFDLR) adaptation algorithm deployment for the deep neural network
and HMM for noise-robust speech recognition when the system is trained with clean data.
dFDLR performed best when it is trained for noisy log-Mel spectral features. It gave better
results when trained for cleaned data and resulted in a WER of 4.8% for clean training for
dFDLR + log Mel features on Aurora-4 medium-large vocabulary. The system is trained for
multi-condition such as noisy, clean, noisy + channel mismatch, clean + channel mismatch.
The drawback of the system is that for noisy channel mismatch and WER is larger.
Wang et al. [82] exploited regressive Context-Dependent DNN (CD-DNN) for
addressing the problem of data scarcity and clustering in broad phones. It can discrimi-
nate the context state at the frame level. It reduces the word error rate by 1.3% compared
to standard CD-DNN on Topic Detection and Tracking—Phase 3 (TDT3) corpus but
resulted in high WER for voiced, unvoiced, and silence classes. It has resulted in 15.0%,
12.1%, 10.8% WER for Context-Independent DNN (CI-DNN), CD-DNN, and regres-
sive CD-DNN, respectively. Hue et al. [83] implemented DNN for the fast speaker adap-
tation for speech recognition for the larger database. The speaker adaptation is applied
in three ways such as nonlinear feature normalization in feature space (fSA-SC), a direct
model adaptation of DNN based on speaker codes (mSA-SC); Joint speaker adaptive
training with speaker codes(SAT-SC). fSA-SC and mSA-SC are speaker-independent,
whereas SAT-SC is speaker-dependent and has lower training time. It resulted in a word
error rate of 12.1% after speaker adaptation using sequence training on the TIMIT large
vocabulary Switchboard task. Pan Zho et al. [84] presented a multiple DNN (mDNN)
model, which computes posterior probabilities of HMM for speech recognition. In this
method, the training data is grouped in m clusters for training to decrease the training
time. Four clustered mDNN resulted in 14.5% WER on the Mandarin transcription task.

13
1926 K. B. Bhangale, M. Kothandaraman

Performance in mDNN is better and faster than baseline DNN. It is observed that if the
number of clusters increased beyond ten, then the system’s performance degrades. In
[85], the authors suggested that DNN can be used for bandwidth expansion in the data
with multiple sampling rates. It is effective and robust in real-time but requires a larger
training time. Wu et al. [86] proposed activation regularization to avoid the network
over-fitting for speech recognition. They have used the Wallstreet journal, Babel lan-
guage, and Broadcast News database to evaluate the proposed algorithm. It provided a
generalized network structure, which reduces the WER significantly.
Speaker recognition is challenging due to the diversity in language, accent, and
speech tones. Still, DNN is considered the better option for speaker verification because
of its representative power. Chen et al. [87] suggested the use of Deep Neural Archi-
tecture (DNA) for learning speaker-specific characteristics from MFCC. DNA includes
two identical fully connected multilayered feed-forward neural network subnets having
2 K − 1 hidden layer, where K > 1 (odd number of hidden layers). They used two types
of learning algorithms as pre-training and discriminative learning. In this speech, infor-
mation component analysis is complicated, and a large amount of data with larger vari-
ability is required during discriminative learning. It was independent of the text and lan-
guages are spoken. It performed better for speaker verification and segmentation.
Score compensation, calibration, and transformation play an essential role in the
speaker verification system. Zhili Tan et al. [88] examined DNN based score calibra-
tion where the calibrated score and score shits are estimated from i-vectors for speaker
recognition. Their proposed method reduces over-fitting and performs better in a noisy
environment. An experiment on NIST 2012 SRE has shown that multitask learning per-
forms better (EER of 3.6% for 0 dB SNR) for a wide range of SNRs. Larger time com-
plexity, the need for a pair of clean and noisy data for training are the weakness of DNN
based score calibration over conventional methods.
To improve the DNN classifier’s performance for spoofing detection, Hong Yu et al.
[89] combined DNN with Human log-likelihoods (DNN-HLL). DNN-HLL classifier
that is trained with five dynamic filter bank-based cepstral features and constant Q-cep-
stral coefficients (CQCC) features. CQCC has a variable resolution in both the time
and frequency domain that is better suitable for spoofing detection. They found that the
performance of DNN-HLL is ten times better than baseline GMM-HLL on the ASVs-
poof-2015 database. They have used Spoofing-discriminant DNNs with five hidden lay-
ers. Their proposed method reduces the equal error rate to 0.045% for all types of spoof-
ing attacks. Wang et al. [90] proposed a combination of spatial and spectral features for
a deep neural network for blind speech separation. The time–frequency dominance to
find the interested user’s direction is evaluated by a two-stage chimera ++ network. It
can be used for multi-speaker ASR. Experimental evaluation on the RIR database has
shown that its performance degrades in environmental noise and more substantial rever-
berations. Lotfian et al. [91] studied curriculum learning for speech emotion recogni-
tion. Multi-class evaluation agreement trains the simple data first and keeps ambiguous
data for further training. The curriculum is defined using the Min–Max method. Exten-
sive experimentation on MSP-Podcast Database resulted in better performance, but
classification accuracy affects due to wrongly labeled and unreliable samples. Liu et al.
[92] offered different applications of DNN for understanding the relevance between user
embedding and candidate response on chatbots. It gives semantic information about the
post, response, and personal information on the chatbots like Facebook-M, Clever-bot,
and Xiaoice.

13
Survey of Deep Learning Paradigms for Speech Processing 1927

3.6 Convolutional Neural Network

A CNN was initially proposed by Fukushima in 1988 but had the limitation of compu-
tation hardware for network training [93]. Later, in the 1990s, LeCun et al. [94] pre-
sented a successful CNN version with a gradient descent learning technique. The bio-
logical process inspires CNN, and the connectivity pattern of neurons is similar to the
animal visual cortex [95]. A CNN consists of a chain of convolution, Rectified Linear
Unit (ReLU), pooling layer, fully connected layer, and final soft-max layer as shown
in Fig. 8. Convolution layer outputs are represented as a feature map. Each unit in the
feature map is connected with the local region of feature maps of the previous layers
via a convolution filter bank. All units in single feature maps share a common filter
bank. Different feature maps in layers share different filter banks that maintain the local
connectivity and local region’s correlation. Each neuron of one layer is linked to all
other neurons in the next layer. Discrete convolution is used for the filtering operation;
therefore, it is called a Convolutional Neural Network. ReLU layer removes the con-
volution map’s negative values to increase the nonlinear properties of the network and
decision function. The pooling layer, which is also called a sub-sampling layer, merges
the semantically similar features into one. There are two types of pooling, such as maxi-
mum pooling and average pooling. The polling layer helps to extract the dominant ele-
ments, reduces the feature map dimensions and computation power. The maximum
pooling layer can also act as a noise suppressant.
A fully connected layer converts the multidimensional feature map into a one-dimen-
sional vector and shows each neuron’s connectivity with other neurons. The output of
a fully connected layer is given to the soft-max classifier. CNN is less invariant to the
scale, shift, and distortions of the input signal. CNN accepts the fixed size of the input
vector and produces output with the fixed-size vector. It also consists of a fixed number
of processing layers.
Some of the typical examples of CNN architectures that use a stack of convolution
layers, max-pooling layer, fully connected layer, and soft-max classifier are LeNet,
AlexNet, VGG Net, NiN, and All Conv. Some of the advanced architectures of CNN are
DenseNet, FractalNet, GoogLeNet with Inception units, and Residual Networks [96].

Output

Original MFCC Convolution Pooling Convolution Pooling Fully


Speech Features Layer 1 Layer 1 Layer 2 Layer 2 Connected
Layer

Fig. 8  Generalize framework of CNN for speaker recognition ( adopted from Arindam Jatti et al. [100])

13
1928 K. B. Bhangale, M. Kothandaraman

A CNN is very popular for image processing applications, and in recent years it gives
promising outcomes for various speech processing applications.
Generally, speech enhancement models focus on audio information only, but very
little concentration is given on video data. Hou et al. [97] presented audiovisual deep
CNNs (AVDCNN), which consists of a separate CNN model for speech and video data
that are fused into a collaborative network for speech enhancement and image recon-
struction. In this, they have mentioned that lip shape and speech have a high degree of
correlation. Lip shape can be used as auxiliary features for voice activity detection. It is
noticed that late fusion is superior to early fusion for audio–video streaming.
Speech separation is challenging due to two significant issues, such as the order of
target and masker speakers in the mixer and the number of speakers in the mixer. To
address these issues, Yi Luo [98] has investigated the Deep Attractor Network (DANet)
to project the time–frequency attributes of mixture signal in high dimensional embed-
ding space. The clustering of speakers is dependent on the attractor (reference) point.
Permutation and number of the attractor in DANet reduces the permutation and speaker
number problem in speech separation. Tian Tan et al. [99] proposed a very deep CNN
(VDCRN), which incorporates residual learning and batch normalization for noise-
robust speech recognition and alleviates the training–testing database mismatch. Fac-
tor aware training and cluster-aware training significantly improved the performance of
VDCRN in noisy conditions. It achieved significant WER of 5.67% for the AURORA-4
dataset. Jati et al. [100] studied speaker-specific characteristics obtained from the unsu-
pervised neural predictive coding (NPC) along with the convolutional SIAMESE net-
work. It can detect overlapping speech and worked better in environmental noises but
has less robustness for a larger speaker database. In the future, they have planned to use
a deeper network for larger vocabulary and introduction of robustness in channel char-
acteristics. Nguyen An et al. [101] inspected the text-independent speaker identification
method for speaker separation. In this, CNN variants such as residual neural networks
(ResNets) and visual geometry group (VGG) nets are used to learn speaker character-
istics that can handle variable-length segments. Log Mel spectral features are given to
CNN. In this, a structured self-attentive layer is applied after the CNN layer, which gen-
erates the fixed-length input for the next layers and attends to the discriminant in the
speaker characteristics.
Database size and condition make an impact on speaker recognition. Most of the time,
the database is created in the constrained condition. Thus, it has a smaller size, which
degrades the speaker recognition performance in the unconstrained and noisy environment.
Arsha Nagrani et al. [102] expanded the celebrity VoxCeleb database using open-source
media such as YouTube to deal with this problem. They have applied two-dimensional
CNN (Thin-ResNet with a GhostVLAD)to the speech spectrogram for speaker verification.
After speaker verification, the speaker identity is verified using face recognition of celeb-
rity and added to the database. It resulted in an equal error rate of 2.87%.
CNN is capable of learning the discriminative features from diverse speech expressions
for emotion recognition. Shiqing Zhang et al. [103] recommended Deep Convolutional
Neural Networks (DCNN) for emotion recognition to bridge the semantic gap between
low-level features and subjective emotions. They have provided three log MFCC features,
such as static, delta, and delta-delta coefficients, to train the AlexNet DCNN model. For
the aggregation of learned high-level features, Discriminant Temporal Pyramid Matching
(DTPM) is used. They have employed SVM for emotion classification. Extensive experi-
mentation on EMO-DB, RML, eNTERFACE05, and BAUM-1 s databases has shown
promising results. It is observed that DCNN pre-trained for image application can also be

13
Survey of Deep Learning Paradigms for Speech Processing 1929

used for speech feature extraction. LP-norm pooling has demonstrated significant improve-
ment over the maximum and average pooling.
Jianfeng Zhao et al. [104] presented merged DNN that considered the features from
1D-CNN applied to the audio clip and 2d-CNN applied to spectrogram for the emotion rec-
ognition. The Bayesian optimization model is used for fine-tuning merged features. Using
the transfer learning deep learning model’s performance can be improved for smaller data-
sets by transferring it over a larger dataset model. Merged CNN resulted in 89.77% and
86.36% accuracy for Speaker-dependent and speaker-independent emotion recognition sys-
tems on Berlin EmoDB and IEMOCAP databases.
Hossain et al. [105] used CNN for speech MFCC spectrum, and images, and the fea-
tures are fused using two consecutive extreme learning machines (ELMs). They used
SVM for classification. The system’s performance is measured based on % accuracy on the
eNTERFACE’05 audiovisual emotion database consisting of six emotions such as anger,
disgust, fear, happiness, sadness, and surprise. ELM has a high degree of non-linearity in
the feature fusion, but because of MFCC, it is prone to background noise. In the future,
they planned to evaluate their proposed system on the other deep learning architectures and
cloud frameworks.
Ocquaye et al. [106] proposed Dual Exclusive Attentive Transfer (DEAT) for unsuper-
vised CNN is used for source-target domain adaptation. To minimize the domain incongru-
ity on the second-order statistics of the attention maps of both source and target, correlation
alignment loss (CALLoss) is used. The spectrogram is used to find the discriminant and
salient feature learning. Raw spectrogram features are given to 5 layered CNN. It resulted
in 65.02% of un-weighted average recall (UAR) for ABC corpus and 67.79% for the emo-
DB database. It has several advantages like high % UAR, computational efficiency, and
simple optimization, but the feature vector length is larger. Their future scope included the
implementation of a multi-layer model to VGGNet and ResNet along with dimensional
reduction on attention maps.
Suraj Tripathi et al. [107] examined CNN for emotion recognition using speech fea-
tures and speech transcripts. CNN is applied for the text and speech MFCC features and
collected in a fully connected layer for the classification. CNN-MFCC + TEXT resulted in
76.1% accuracy, and it is observed that there is almost a 7% increase in performance over
current benchmark methods.
Heinrich Dinkelet al. [108] investigated joint Convolutional Long Short Term Mem-
ory deep neural network (CLDNN) for spoofing detection with the help of raw wave front
end speech features. Experimental evaluation of the algorithm on BTAS2016 and ASVs-
poof2015 resulted in half total error rate (HTER) of 0.19% and 0.0%, respectively. Raw
wave works better for synthetic and voice converted spoof detection but performs poorly
for sparse data.

3.7 Recurrent Neural Network (RNN)

RNN is a feed-forward neural network that is generally used to process sequential and
time-series data. RNN’s most popular implementations are Long Short Term Memory
(LSTM) and Gated Recurrent Units (GRUs) [109]. RNN is called recurrent as they accom-
plish the same action for every unit of the sequence where output depends on the earlier
computation. The structure of basic RNN with the loop is shown in Fig. 9. The General
feed-forward neural network has several issues like the inability to handle the sequential
data, consideration of current input, and failure to memorize the previous input.

13
1930 K. B. Bhangale, M. Kothandaraman

Fig. 9  Structure of basic RNN


y Output layer
with loop

Hidden layer
h

x Input layer

LSTM is generally used for temporal information processing. GRU has few network
parameters, simple topology, lowest computation cost, and complexity [110]. Depending
upon the input and output relationship and applications, RNN structures are categorized as
one to one, many to one, one to many, and many to many structures. This section provides
some of the applications of RNN for speech processing applications.
The combination of Bidirectional LSTM-RNN and end-to-end training can give better
results for phoneme recognition on the TIMIT database (PER of 17.7%) [111]. Chu-Xiong
Qin et al. [112] presented transfer learning for speech recognition, which compromised the
multilingual DNN and matrix factorization method to extract higher-level features. Fur-
ther, the connectionist temporal classification (CTC) attentive model with shallow RNN
increases the robustness through the joint decoding and shared training. Experimental
results on the TIMIT database have shown PER of 16.59%.
The combination of convolution and LSTM recurrent network gives better temporal
dependencies. The convolution-LSTM model resulted in better accuracy (85%) for the
speech and music recognition on the Google AudioSet database [113]. Bidirectional LSTM
(BLSTM) can minimize the vanishing gradient problem in the RNN, but it is more com-
putationally expensive. Still, LSTM has a vanishing gradient problem for the higher layer
of LSTM because of bounded output. The larger size of LSTM resulted in the over-fitting
of the network. To deal with these problems, Jian Kang et al. [114] combined bidirectional
RNN, RGU, and residual architecture for the low resource speech recognition. They have
presented local BLSTM (LBSTM) for modeling the temporal dependencies over the local
window. It has shown significant improvement over baseline LSTM (3–8% decrement in
WER) and DNN (4–10% decrement in WER). Zhiyuan Tang et al. [115] developed Pho-
netic temporal neural (PTN) language identification using the LSTM-RNN model that
accepts the speech features generated by phone discriminative DNN. It is observed that
phonetic temporal information is more important than raw speech features for discriminat-
ing languages. Kun Han et al. [116] presented feed-forward DNN and RNN for training the
static and sequential frame-level acoustic features, respectively, which can learn temporal
dynamics. It has shown that single-condition training performs better than proposed multi-
condition training on the TIMIT and NOISEX-92 noise datasets. Ke Tan et al. [117] have
combined Convolution Encoder-Decoder (CED) and recurrent LSTM to form the recur-
rent convolutional network (CRN), which is noise and speaker-independent for instantane-
ous monaural speech enhancement. Progressive learning and a combination of DNN and
LSTM resulted in improved speech quality and intelligibility for speech enhancement in
low SNR. But it has a higher computation cost and a large number of parameters, which
makes it difficult for practical real-time implementation. To reduce the parameters and

13
Survey of Deep Learning Paradigms for Speech Processing 1931

computation cost, Andong Li et al. [118] implemented Progressive Learning-based Convo-


lutional RNN (PL-CRNN) to improve speech quality and speech intelligibility.
Anastasios Vafeiadis et al. [119] presented CRNN for speech activity detection, which
uses the ability of RNN to classify speech and non-speech in long sequence speech.
Zebang Shen et al. [120] used the LSTM model to build the relationship between various
MFCC frames for Chinese singer recognition. To overcome the gradient vanishing problem
in RNN, LSTM uses LSTM cells instead of hidden neurons in RNN. The performance is
evaluated on Chinese SID in the MIR-1 K, and it resulted in 88.4% accuracy for 400 neu-
rons. LSTM is suitable for better relative insensitivity and long-term dependency of long
voice sequences.
Yiming Wu et al. [121] proposed CNN and Bidirectional LSTM Conditional Random
Field (BLSTM-CRF) for musical chord recognition. The input signals are converted to
log-frequency magnitude spectrogram representation via harmonic Constant-Q Transform
(Harmonic-CQT). They have used two databases for experimentation such as Isophonics
and RWC-Popular. Their approach resulted in a cross-validation score of CNN-BLSTM-
CRF) for the Isophonic database 85.3%, and for the RWC database, it is 84.3%. It is more
suitable for regular, major, and minor chord classification with large vocabulary but less
efficient for complex chord recognition. Their future scope is to apply the system for music
synchronization, structure analysis, and cover song identification. Jianfeng Zhao et al. [122]
applied one 1-D CNN-LSTM and two 2-d CNN LSTM networks to learn the emotion fea-
tures obtained using MFCC. LSTM is used for understanding the long-term dependencies
from the features.
The speech enhancement RNN gives better speech quality and intelligibility in non-sta-
tionary noise than the feed-forward network. RNN architectures need more computation
for the complex problem modeling and need more memory for the computation [123, 124].

3.8 Deep Reinforcement Learning (DRN)

Reinforcement Learning (RL) is based on the reasonable thought that if an action is trailed
by an improvement in the state of affairs, then the tendency to produce that action is
strengthened [125]. RLs are categorized into value-based methods, including Q-learning
approaches [126], and policy-based methods, including policy gradient methods [127].
DRL is investigated for the very few speech processing applications such as dialogue-
based systems, speech enhancement, pre-training for ASR, content-based speech retrieval,
etc. In recent years DRL has been mostly investigated for the dialogue-based systems in
human–robot or human–machine systems. In a dialogue-based system, it is necessary to
generate a response based on the user’s state of conversation and action.
Policy optimization DRL algorithms can be useful in the automated dialogue system
for generating the speech response by considering the current state of discussion with
the human [128]. When the domain changes or policy is transferred from one domain to
another, the entire dialogue state space and action set changes. Therefore, the DRL model
should be different for a different domain, which is very challenging. A multi-agent dia-
logue policy (MADP), which consists of slot-dependent agents (S-Agents) and a slot-inde-
pendent agent (G-Agent), can tackle this problem [129]. Further, Lu Chen et al. [130] pro-
posed Agent-Graph to make the DRL-based policies sample efficient and compatible for
policy transfer between different domains.
The performance of ASR can be improved by optimizing the speech enhancement
(SE) model using DRL. Yih Shen et al. [131] presented the ideal Binary Mask-based

13
1932 K. B. Bhangale, M. Kothandaraman

SE system on the Mandarin Chinese broadcast news corpus (MATBN) database and
showed significant improvement in noisy conditions.
DRL requires a large amount of time for the training, thus makes it unsuitable for
real-time human–computer interaction. Rajapakshe et al. [132] explored pre-training of
DRL for a reduction in training time. Markov Decision Process (MDP) is used for pre-
training of DNN. This pre-training is used for the speech command recognition using
CNN and LSTM, which has shown significant improvement in network performance
over the network without pre-training.
In supervised learning, transcribing the training speech data is a challenging and
computationally expensive task. Taku Kala et al. [133] described speech recognition
using the policy gradient and hypothesis selection DRL method, which has given prom-
ising results compared to unsupervised approaches. They have observed that increasing
the number of DRL stages, increases the WER.
In Deep Q-Learning (DQN), the state is given as input to the neural network, which
approximates the Q-value function and generates the Q-value of all possible actions at
the output, as shown in Fig. 10.
Content-based spoken data retrieval has a high degree of uncertainty and noisy
retrieval results, unlike content-based text retrieval. Hung-Yi Lee et al. [134] have pre-
sented a Deep-Q-Network (DQN) that determines the machine action without hand-
crafted inputs. They have used Mandarin Chinese broadcast news corpus for experimen-
tations. Double DQN and Dueling DQN achieved better performance than simple DQN.
DQN is used for speech volume control in humanoid robots to improve the human–robot
interaction [135], dialogue policy decisions in chat systems [136].
DRL faces many challenges while implementing it for real-world problems because
of uncertainty in the action states. DRL performance can be improved by combining
it more deeply with AI-based techniques to interpretability, generalization, and better
sample complexity.

Q-Value Action 1

Q-Value Action 2

Q-Value Action 3
State



Q-Value Action N
Deep Reinforcement Q-Learning

Fig. 10  Generalized structure of deep reinforcement Q-learning

13
Survey of Deep Learning Paradigms for Speech Processing 1933

4 Database

Database size, variability, and quality play a vital role in the performance of deep learn-
ing algorithms. Various standard databases for speech recognition, speaker recognition,
and voice activity detection are available online in open source, licensed, or public mode.
TIMIT is considered the baseline corpus for speech and speaker recognition, consisting of
10 sentences of American English for 30-s duration for 630 speakers [137]. LibriSpeech
is 1000 Hrs of 16 kHz English speech corpus mostly used for phoneme classification
[138]. VoxCeleb database is large-scale text-independent speaker recognition corpus con-
sisting of 153,486 utterances for 1251 celebrities, extracted from YouTube videos [139].
Aurora-4 consists of the clean and noisy database (noise addition on Wall Street Journal
(WSJ0)) with different SNRs at two sampling rates 8 kHz and 16 kHz is frequently used
for noise-robust speech recognition [140]. Mandarin transcription task consists of 76,843
speech samples (about 64 h of speech) from 1500 speakers along with an independent test
set contains 3,720 samples (about 3 h) from an additional 50 speakers [141]. CHiME is a
medium vocabulary database that consists of English speech corpus and transcripts (342 h)
from noisy environments and 50 h of noisy audio samples [142]. REVERB-Challenge is
used for reverberant speech recognition and enhancement tasks [143]. Topic Detection and
Tracking—Phase 3 (TDT3) speech recognition database consists of samples from Chinese
Mandarin news broadcasts [82]. SWITCHBOARD database, which consists of 2500 con-
versations of 500 speakers, is often used for speech recognition and speaker recognition
[144].
The speech emotion recognition system’s performance hugely depends upon the emo-
tion database as emotion signals are largely subjected to the language, length of the sam-
ple, and noise. Thus, the speech emotion database needs more samples for many users in
different environmental conditions. EmoDB is a speech emotion database generally used
for speech emotion recognition consists of 500 samples of happy, fearful, angry, anxious,
bored, and disgusts emotions recorded from 10 actors [145]. The Interactive Emotional
Dyadic Motion Capture (IEMOCAP) database is an acted, multimodal, and multi-speaker
speech emotion database that consists of 12 Hrs of anger, happiness, sadness, and neutral
emotion speech samples. eNTERFACE’05 is an audiovisual emotion database used for col-
laborative speech and image speech emotion recognition [146]. The MSP-Podcast speech
emotion database consists of 22,630 corpora of males and females [147]. Isophonics [148]
music database includes 19 songs by Queen, 180 songs by the Beatles, and 18 songs by
Zweieck. RWC-Popular [149] contains 100 American and Japanese-style pop songs.
MIR-1 K is a music database consisting of 1000 music samples used for music singer rec-
ognition and singing voice separation [150]. For speech enhancement, noisy samples are
required but acquiring a large number of the sample under various noisy conditions is dif-
ficult. Therefore, many researchers perform experimentation by adding standard noise sig-
nals to the available database. NOISEX is a noise database that consists of various noises
such as factory noise, voice babble, pink noise, HF radio channel noise, different military
noises, white noise, and Volvo 340. It is commonly used for evaluating noise-robust speech
recognition, enhancement, and separation applications, along with other speech corpora
[151].
Majorly standard speech corpora are available in the English language. There is a lack
of a larger authoritative database in other languages, making it challenging to check per-
formance variability and language independence. Database published so far focuses on the
sample collection from experts, actors, or normal people. Still, very little concentration

13
1934 K. B. Bhangale, M. Kothandaraman

is given to disabled voice samples such as stammering, stuttering, and dysarthric speech.
Most of the speech corpus databases are not publicly available, limiting exploration and
validation of researchers’ results.

5 Evaluation Metrics

Evaluation metrics are different for different speech processing applications. For speech
enhancement, various subjective and objective metrics are used. PESQ is the measure for
predicting subjective speech quality, ranging from -0.5 (extremely high distortions) to 4.5
(No distortions). It can be applied in a wide range of conditions such as background noise,
variable delay, and analog filtering. The larger the PESQ score, the better the predicted
quality of speech is [42]. STOI estimates a corrupted audio signal’s objective intelligibility
with the help of a correlation between the temporal envelopes of the corrupted speech sig-
nal and its clean reference. It has been observed that STOI scores are stalwartly correlated
with human speech intelligibility scores [77, 78]. ESTOI represents a corrupted speech
signal’s objective intelligibility by calculating the spectral correlation coefficients of the
corrupted signal and its clean reference in short time segments. Unlike STOI, ESTOI does
not assume that frequency bands are mutually independent. Both STOI and ESTOI scores
range from 0 to 1. Higher scores indicate better-predicted intelligibility [152]. Performance
metrics such as signal to distortion ratio (SDR) [153] and signal to artifact ratio (SAR), sig-
nal to interference ratio (SIR), signal to artifacts ratio (SAR) illustrate the quality of speech
signal after the speech enhancement process. In [97] two new evaluation metrics are pre-
sented, such as the hearing-aid speech quality index (HASQI) and the hearing-aid speech
perception index (HASPI).
Word Error Rate (WER) is a standard evaluation metric for the speech recognition task
that depicts the division of word errors after aligning the hypothesis and reference word
sequence. It is the number of additions, removals, and substitutions divided by a total num-
ber of reference words [81, 82]. Performance of speaker recognition systems is also evalu-
ated using miss probability, false alarm probability, Equal Error Rate (EER), and percent-
age accuracy, which gives the statistical measure of correctly/wrongly recognized speakers.
For a binary classification task, F1-score is a popular evaluation metric that is based on
precision and recall rate. The larger the value of the F1-score better is the algorithm perfor-
mance. False Alarm Rate (FAR) and Miss Detection Rate (MDR) are also standard evalua-
tion metrics for speech recognition. Several Deep learning algorithms are aimed at a reduc-
tion in FAR and MDR. Reza Lotfian [91] has investigated the concordance correlation
coefficient (CCC) metric for the performance evaluation of speech emotion recognition.

6 Discussion

This section provides a comparative analysis and discussion of the various deep learn-
ing frameworks applied for various speech processing applications based on methodol-
ogy, database, and performance evaluation. It majorly focused on speech preprocessing,
speech recognition, speaker recognition, emotion recognition, and other speech processing
domains.
Table 1 gives the comparative analysis of various deep learning algorithms for speech
pre-processing, such as speech enhancement and speech separation. Subjective and

13
Table 1  Comparative analysis of various deep learning algorithms for speech enhancement and speech separation
Sr. No Authors Application Methodology Database Evaluation metrics Performance

1 Yong Xu et al. [61] Speech enhancement Regressive multiple TIMIT PESQ 2.83 (Car noise)
restricted Boltzmann 2.47 (Exhibition noise)
machines (RBMs)
2 Arun Narayanan et al. [81] Speech separation and noisy Diagonal feature discri- Aurora-4 medium-large Word error rate (%) 4.8% (Clean training)
speech recognition minant linear regression vocabulary
(dFDLR) and deep neural
network (DNN)
3 Tae Gyoon Kang et al. [75] Source separation, speech Deep neural network—non- TIMIT and NOISEX-92 SDR 8.74
enhancement negative matrix factoriza- noise dataset SIR 11.20
tion (NMF)
SAR 13.91
PESQ 2.23
4 Emmanuel Affonso et al. Speech quality assessment Deep belief network (DBN) ITU-T recommendation Accuracy (%) 95.00%
[72] and radial basis function
Survey of Deep Learning Paradigms for Speech Processing

SVM (RBF-SVM)
5 Jen-Cheng Hou et al. [97] Speech enhancement Audiovisual deep CNNs Taiwan Mandarin hearing PESQ 2.41
(AVDCNN) in noise test (Taiwan STOI 0.66
MHINT)
SDI 0.45
HASQI 0.43
HASPI 0.99
6 Yi Luo et al. [98] Speech separation Deep attractor network Wall street journal dataset SDR 10.4 (2-speaker)
(DANet) 8.5 (3-speakers)
7 Yan Zhao et al. [78] Speech enhancement Deep neural network IEEE corpus and diverse STOI (%) 79.4 (Dliving noise)
environments multi- 68.5 (Pcafeter noise)
channel acoustic noise
70.7 (Babble noise)
database (DEMAND)
PESQ 2.05 (Dliving noise)
1.66 (Pcafeter noise)
1.66 (Babble noise)
1935

13
1936 K. B. Bhangale, M. Kothandaraman

qualitative analysis based on different evaluation metrics has shown that RBMs and DBN
give better performance than the other approaches. The performance of speech enhance-
ment is limited because of the noise phase spectrogram’s unavailability, larger time com-
plexity, and poor performance for online speech enhancement.
In recent years, many deep learning frameworks are adopted for the ASR, whose per-
formance is evaluated mostly on percentage accuracy, word error rate, and hit rate. It is
observed that because of the higher correlation and representation capability of the CNN,
deep architectures based on CNN have given better performance for ASR. The perfor-
mance of deep learning algorithms for speech recognition is challenging because of cross-
domain training, noisy training, language dependency, and various environmental condi-
tions (Table 2).
The performance of various deep learning algorithms for speaker recognition appli-
cations is shown in Table 3. Along with speaker recognition, it focuses on music singer
recognition and spoofing recognition. The performance of speaker recognition systems is
mostly measured o the basis of equal error rate (%EER), false alarm rate (FAR), and recog-
nition accuracy. It is observed that DNN and CNN-LSTM represent better speaker-specific
characteristics and results in better performance.
Table 4 gives a detailed comparative analysis of speech emotion recognition using deep
learning algorithms. Deep learning algorithms such as DCNN, DBN, RBM, etc., have suc-
cessfully presented speech emotion recognition. Easy training and the capability of weight
sharing of deep learning algorithms have significantly improved machine learning-based
speech emotion recognition systems. Various multimodal emotion recognition systems
have been suggested that considered audio–video data increase emotion recognition per-
formance. The performance of deep learning algorithms is often restricted due to over-
learning during memorization of layer-wise information, complex architecture, language
dependency, and temporal variation in input data.
Deep learning is becoming more popular in various speech recognition fields because of
its higher representation ability, ability to handle complex problems, and ability to handle
large databases. Table 5 shows the comparative analysis of miscellaneous speech process-
ing applications such as spoke content retrieval, language identification, dialect identifica-
tion, sound event recognition, etc. Supervised deep learning models have given superior
performance for the various pattern base recognition applications.

7 Conclusion

This paper has presented a comprehensive review of deep learning architectures and their
applications for speech processing applications over the past few years. Various modern
deep learning models in different learning groups, including unsupervised, supervised,
semi-supervised, and Reinforcement Learning (RL), and their applications in different
domains are reviewed. This paper presented structure and applications of Auto-Encoders
(AE), Generative Adversarial Network (GAN), Restricted Boltzmann Machine (RBM),
Deep Belief Network (DBN), Deep Neural Network (DNN), Convolutional Neural Net-
work (CNN), Recurrent Neural Network (RNN) and Deep Reinforcement Learning (DRL)
for various speech processing applications. This paper focused on major speech processing
applications such as speech enhancement, speech separation, speech recognition, speaker
recognition, emotion recognition, and natural language processing.

13
Table 2  Comparative analysis of various deep learning algorithms for speech recognition
Sr. No Authors Application Methodology Database Evaluation metrics Performance (%)

1 Shaofei Xue et al. [83] Speech recognition Deep neural network TIMIT large vocabulary Word error rate (%) 12.10
switchboard task
2 Yuxuan Wang and DeLiang Speech recognition DNN-SVM-SEG TIMIT, IEEE Female, IEEE HIT-false alarm rate (FA) 63.8, 65.9, 64.7
Wang [60] Male
3 Dong Yu et al. [80] Speech recognition Deep tensor neural network 30Hrs SWB task Word error rate (%) 16.60
(DTNN)
4 Wang and Sim [82] Speech recognition Regressive context Dependent Topic detection and track- Word error rate (%) 10.80
Survey of Deep Learning Paradigms for Speech Processing

DNN (CD-DNN) ing—phase 3 (TDT3)


corpus
5 Pan Zhou et al. [84] Speech recognition Multiple DNN(mDNN) Mandarin transcription task Word error rate (%) 14.50
6 Tian Tan et al. [99] Speech recognition Very deep CNN (VDCRN) AURORA4 dataset Word error rate (%) 5.67
7 Chunyang Wu et al. [86] Speech recognition DNN-KL Divergence method Broadcast News Word error rate (%) 9.80
8 Purvi Agrawal and Sriram Speech recognition Convolutional variational Aurora-4, REVERB Chal- Word error rate (%) 11.2, 16.1 and 15.3
Ganapathy 2019) [44] autoencoder (CVAE) lenge, CHiME-3 Challenge
9 Jianqing Gao et al. [85] Speech recognition DNN for multiple bandwidth Large-scale Mandarin speech Character error rate (CER) 21.20
expansion (DNN-DM- dataset
MBE)
1937

13
1938

13
Table 3  Comparative analysis of various deep learning algorithms for speaker recognition
Sr. No Authors Application Methodology Database Evaluation metrics Performance

1 Chen and Salman [87] Speaker recognition Deep neural architec- TIMIT, Chinese (CHN) False alarm rate (FAR), TIMIT-
ture (DNA) corpus Miss detection 0.25 ± 0.09,0.19 ± 0.09,0.74 ± 0.12
rate (MDR), F1 CHN-
(Mean + -STD) 0.21 ± 0.06,0.34 ± 0.09,0.68 ± 0.08
2 Zhili Tan et al. [88] Speaker verification Deep neural network NIST 2012 SRE Equal error rate (%) EER of 3.6% for 0 dB SNR
3 Hong Yu et al. [89] Speaker verifica- Deep neural network ASVspoof2015 data- Equal error rate (%) 0.05%
tion + spoofing and human log-likeli- base
detection hoods (DNN-HLL)
4 Heinrich Dinkel et al. Speech spoofing detec- Joint convolutional BTAS2016 and ASVs- Half total error rate 0.19% and 0.0%
[108] tion LSTM deep neural poof2015 (HTER)
network (CLDNN)
5 Jati and Georgiou [100] Speaker recognition Neural predictive cod- VoxCeleb database Equal error rate (%), 7.21%, 0.61
ing (NPC) and Con- minimum normalized
volution SIAMESE detection cost func-
network tion (minDCF)
6 Nguyen AN et al. [101] Speaker identification Convolutional Neural VoxCeleb database Accuracy (%) 88.2% (VGG + self attention layer)
(text-independent) Network (CNN) and 90.8% (ResNet + self attention
layer)
7 Zebang Shen et al. Singer recognition Long short term Chinese SID in the Accuracy (%) 88.40%
[120] memory (LSTM) MIR-1 K
8 Hourri et al. [73] Speaker recognition MFCC + DNN THUYG-20 SRE % EER 0.43 and 3.07% (Female) and 0.55
corpus and 3.19% (Male)
9 Arsha Nagrani et al. Speaker recognition Thin-ResNet with a VoxCeleb database % EER 2.87%
[102] GhostVLAD
K. B. Bhangale, M. Kothandaraman
Table 4  Comparative analysis of various deep learning algorithms for speech emotion recognition
Sr. no. Authors Application Methodology Database Evaluation metrics Performance

1 Shah et al. [62] Emotion recognition Supervised replicated Arousal –IEMOCAP, Weighted Average Recall 72.00%,
softmax model (sRSM) Valence IEMOCAP, Rate 60.00%
based on RBM Arousal SEMAINE, 66.35%, 66.45%
Valence SEMAINE
2 Wen et al. [70] Emotion recognition Random deep belief EMODB Weighted Average 82.32%,
networks (RDBN) CASIA Accuracy 48.50%,
SAVEE 53.60%
3 Zhang et al. [103] Emotion recognition DCNN and discriminant EMO-DB, RML, eNTER- Recognition Accuracy 87.31% (EMO-DB),
temporal pyramid FACE05 and BAUM-1 s 69.70% (RML), 76.56%
matching (DTPM) (eNTERFACE), 44.61%
(BAUM-1 s)
4 Jianfeng Zhao et al. [104] Emotion recognition 1D-CNN Berlin EmoDB and Accuracy (%) 89.77% and 86.36%
Survey of Deep Learning Paradigms for Speech Processing

IEMOCAP databases
5 Reza Lotfian et al. [91] Emotion recognition Deep neural network MSP-Podcast Database F1-Score 42.10%
6 ELIAS OCQUAYE et al. Emotion Recognition Dual exclusive attentive ABC Corpus, emo-DB Unweighted average recall of 65.02% for ABC corpus
[106] transfer (DEAT) based database (UAR) and 67.79% for emo-DB
unsupervised CNN database
7 Tripathi et al. [107] Emotion recognition CNN-MFCC + TEXT IEMOCAP data Accuracy (%) 76.10%
8 Jianfeng Zhao et al. [122] Emotion recognition 2-D CNN LSTM Berlin EmoDB and Accuracy (%) 95.33% (speaker dependent)
Interactive Emotional and 95.89% (speaker inde-
Dyadic Motion Capture pendent) 89.16% (speaker
(IEMOCAP) dependent) and 52.14%
(speaker independent)
9 Hossain et al. [105] Emotion recognition Convolutional extreme eNTERFACE’05 audio- Accuracy (%) 99.90%
(Speech and video) learning machines visual emotion database
(ELMs)
1939

13
1940

13
Table 5  Comparative analysis of various deep learning algorithms for miscellaneous speech processing applications
Sr. No Authors Application Methodology Database Evaluation Metrics Performance

1 Hung Lee et al. [134] Spoken content retrieval Deep reinforcement learn- Stanford question answer- Word error rate (%) 22.70%
ing- deep-Q-network ing dataset
(DQN)
2 Bingquan Liu et al. [92] Speech data retrieval Deep neural network Baidu Tieba corpus (BTC) Accuracy (%) 71.60% (BTC) and 72.46%
and Reddit corpus (RC) (RC)
3 Zhiyuan Tang et al. [115] Language identification LSTM-RNN Babel database and the Equal error rate (%) 5.70% (Babel) and 6.34%
AP16-OLR database (AP16-OLR)
4 Qian Zhang et al. [47] Language/dialect Recogni- Generative auto encoder CHINESE Corpus, PAN- Accuracy (%) 97.8%, 81.3%, and 65.4%
tion ARABIC and MBG-3
5 Ruhi Sarikaya et al. [69] Natural language process- Deep belief network Call routing database Accuracy (%) 90.80%
ing (DBN)
6 Kun Han et al. [116] Pitch tracking DNN and RNN TIMIT and NOISEX-92 Detection rate (%) 66.4% (DNN) and 66.2%
noise dataset (RNN)
7 Chien-Yao Wang et al. [71] Sound event recognition Hierarchical-diving RWCP dataset Accuracy (%) RWCP- 99.27% accuracy
deep belief network for clean data and 95.06%
(HDDBN) accuracy for noisy data
(0 dB SNR)
8 Bui et al. (2019) Human–robot interaction Deep Q-network RobotSVA Std. error 0.103
K. B. Bhangale, M. Kothandaraman
Survey of Deep Learning Paradigms for Speech Processing 1941

For speech processing, the speech signal is converted into two-dimensional spectral
representations and given to deep learning architecture as input. Log Mel spectrogram
or MFCC features provide a compact two-dimensional representation of speech signal. It
may lead to corruption in temporal variation properties and phase information of the origi-
nal speech signal. Therefore, raw speech is used for speech separation and enhancement
applications.
Unsupervised auto-encoders (AE) have given superior performance for speech enhance-
ment, restoration, reverberation minimization, noise estimation, and speech separation in
noisy environmental conditions. Unsupervised GAN has attracted vast attraction for noise-
robust speech recognition and text-to-speech conversion because of its ability to learn the
internal property of data and generation of output similar to the input. GAN model suffers
from non-convergence, unstable training, sensitivity to the hyper-parameter selection, and
over-fitting due to unbalance between discriminative and generator networks. RBMs and
DBNs are widely used for speech separation, enhancement, and emotion recognition appli-
cations because of their optimal and discriminative feature learning capability. DNNs have
better feature representation capability for non-linear functions because of multiple hid-
den layers. They are efficient for speech enhancement, speech recognition, speaker recog-
nition, etc., because of a combination of feature extraction and classification layers. CNN
consists of fixed receptive fields that limit the temporal context that can be considered for
speech recognition, speaker recognition, and emotion recognition. RNN can be used for
unlimited temporal context, which can be learned using the LSTM adaptive model, but it
needs sequential processing of input speech signal, making it slower than CNN. Therefore,
CRNN suggests conciliation in between, inheriting both CNNs and RNN’s advantages and
disadvantages. Deep reinforcement learning and deep Q- learning models are more popular
for speech enhancement and dialogue-based systems in robotics because of their ability to
find a balance between exploration and exploitation. DRL is avoided for real-time applica-
tions because of its extensive response time and uncertainty in the action states.
A pre-trained model for audio data is not available that can be used for learning raw
samples for large vocabulary speech, speaker, and emotion recognition. Compared to tra-
ditional methods, deep learning models require larger computational power and training
data. General CPUs are not suitable for the deep learning model implementation instead
General-Purpose Graphics Processing Unit (GPGPUs) is an optimized processor for matrix
operations and application-specific integrated circuits such as proprietary tensor processing
units (TPUs) are used. The applications of deep learning models are restricted to smaller
devices such as mobile phones or hearing aids.
Deep learning can be extended further for the improvement of recent speech process-
ing applications such as spoofing detection, speech pathology, robotics, automation, auto-
tagging, audio content retrieval on social media, hate speech detection, stress detection,
autism detection, audio conferencing, etc. Deep learning fails to understand the signifi-
cance of input speech features and inner working principles. It is observed that superior
performance of deep learning models can be achieved at the cost of network complexity,
which is frequently challenging to optimize and prone to over-fitting without a huge num-
ber of samples to train multiple parameters. Finally, emerging deep learning research in
speech processing involves achieving high efficiency for data-intensive applications. How-
ever, they need a vigilant selection of models and model parameters to guarantee model
robustness.

Funding None.

13
1942 K. B. Bhangale, M. Kothandaraman

Data Availability Enquiries about data availability should be directed to the authors.

Declarations
Conflict of Interest The authors declare that they have no conflict of interest.

References
1. Sarker, I. H. (2021). Deep learning: A comprehensive overview on techniques, taxonomy, applica-
tions and research directions. SN Computer Science, 2(6), 1–20.
2. Otter, D. W., Medina, J. R., & Kalita, J. K. (2020). A survey of the usages of deep learning for
natural language processing. IEEE Transactions on Neural Networks and Learning Systems, 32(2),
604–624.
3. Alam, M., Samad, M. D., Vidyaratne, L., Glandon, A., & Iftekharuddin, K. M. (2020). Survey on
deep neural networks in speech and vision systems. Neurocomputing, 417, 302–321.
4. Watanabe, S., & Araki, S. (2019). Introduction to the issue on far-field speech processing in the era of
deep learning: speech enhancement, separation, and recognition. IEEE Journal of Selected Topics in
Signal Processing, 13(4), 785–786.
5. Raj, D., Denisov, P., Chen, Z., Erdogan, H., Huang, Z., He, M., Watanabe, S., Du, J., Yoshioka, T.,
Luo, Y., & Kanda, N. (2021). Integration of speech separation, diarization, and recognition for multi-
speaker meetings: System description, comparison, and analysis. In 2021 IEEE spoken language tech-
nology workshop (SLT), pp. 897–904. IEEE.
6. Suh, J. Y., Bennett, C. C., Weiss, B., Yoon, E., Jeong, J., & Chae, Y. (2021). Development of speech
dialogue systems for social ai in cooperative game evironments. In IEEE region 10 symposium (TEN-
SYMP 2021).
7. Hanifa, R. M., Isa, K., & Mohamad, S. (2021). A review on speaker recognition: Technology and
challenges. Computers & Electrical Engineering, 90, 107005.
8. Ntalampiras, S. (2021). Speech emotion recognition via learning analogies. Pattern Recognition Let-
ters, 144, 21–26.
9. Deng, L., Hassanein, K., & Elmasry, M. (1994). Analysis of the correlation structure for a neural pre-
dictive model with application to speech recognition. Neural Networks, 7(2), 331–339.
10. Cohen, J., Kamm, T., & Andreou, A. (1995). Vocal tract normalization in speech recognition: Com-
pensation for system systematic speaker variability. The Journal of the Acoustical Society of America,
97(5), 3246–3247.
11. Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on
Signal Processing, 45(11), 2673–2681. https://​doi.​org/​10.​1109/​78.​650093
12. Hermansky, H., Ellis, D. P. W., & Sharma, S. (2000). Tandem connectionist feature extraction for
conventional HMM systems. In 2000 IEEE international conference on acoustics, speech, and signal
processing proceedings (Cat. No.00CH37100), Istanbul, Turkey, vol. 3, pp. 1635–1638. https://​doi.​
org/​10.​1109/​ICASSP.​2000.​862024.
13. Povey, D., Kingsbury, B., Mangu, L., Saon, G., Soltau, H., & Zweig, G. (2005). fPME: Discrimina-
tively trained features for speech recognition. In Proceedings IEEE ICASSP’05, pp. 961–964.
14. Morgan, N., et al. (2005). Pushing the envelope: Aside [speech recognition]. IEEE Signal Processing
Magazine, 22(5), 81–88. https://​doi.​org/​10.​1109/​MSP.​2005.​15118​26
15. Grezl, F., Karafiat, M., Kontar, S., & Cernocky, J. (2007). Probabilistic and bottle-neck features for
LVCSR of meetings. In 2007 IEEE international conference on acoustics, speech and signal process-
ing-ICASSP ’07, Honolulu, HI, pp. IV-757-IV-760. https://​doi.​org/​10.​1109/​ICASSP.​2007.​367023.
16. Morgan, N. (2012). Deep and wide: Multiple layers in automatic speech recognition. IEEE Transac-
tions on Audio, Speech, and Language Processing, 20(1), 7–13. https://​doi.​org/​10.​1109/​TASL.​2011.​
21160​10
17. Rabiner, L. R., & Schafer, R. W. (2007). Introduction to digital speech processing. Now Publishers
Inc.
18. Van Gilse, P. H. G. (1948). Another method of speech without larynx. Acta Oto-Laryngologica,
36(sup78), 109–110.
19. Everest, F. A., & Pohlmann, K. (2009). Master handbook of acoustics. McGraw-Hill/TAB
Electronics.

13
Survey of Deep Learning Paradigms for Speech Processing 1943

20. Haneche, H., Ouahabi, A., & Boudraa, B. (2021). Compressed sensing-speech coding scheme
for mobile communications. Circuits, Systems, and Signal Processing. https://​doi.​org/​10.​1007/​
s00034-​021-​01712-x
21. Sonawane, A., Inamdar, M. U., & Bhangale, K. B. (2017). Sound based human emotion recogni-
tion using MFCC & multiple SVM. In 2017 international conference on information, communication,
instrumentation and control (ICICIC), pp. 1–4. IEEE.
22. Bhangale, K. B., Titare, P., Pawar, R., & Bhavsar, S. (2018). Synthetic speech spoofing detection
using MFCC and radial basis function SVM. IOSR Journal of Engineering (IOSRJEN), 8(6), 55–61.
23. Bhangale, K. B., & Mohanaprasad, K. (2021). A review on speech processing using machine learning
paradigm. International Journal of Speech Technology, 24(2), 367–388.
24. Nirmal, J., Zaveri, M., Patnaik, S., & Kachare, P. (2014). Voice conversion using general regression
neural network. Applied Soft Computing, 24, 1–12.
25. Amrouche, A., Taleb-Ahmed, A., Rouvaen, J. M., & Yagoub, M. C. E. (2009). Improvement of the
speech recognition in noisy environments using a nonparametric regression. International Journal of
Parallel, Emergent and Distributed Systems, 24(1), 49–67.
26. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521, 436–444. https://​doi.​org/​10.​
1038/​natur​e14539
27. Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspec-
tives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1798–1828. https://​
doi.​org/​10.​1109/​TPAMI.​2013.​50
28. Ng, A. Y., & Jordan, M. I. (2001). On discriminative vs. generative classifiers: A comparison of logis-
tic regression and naive bayes. In Proceedings of the 14th international conference on neural infor-
mation processing systems, Cambridge, MA, USA: MIT Press, 2001, pp. 841–848.
29. LeCun, Y., Kavukcuoglu, K., & Farabet, C. (2010). Convolutional networks and applications in
vision. In Proceedings of 2010 IEEE international symposium on circuits and systems, pp. 253–256.
30. Purwins, H., Li, Bo., Virtanen, T., Schlüter, J., Chang, S.-Y., & Sainath, T. (2019). Deep learning for
audio signal processing. IEEE Journal of Selected Topics in Signal Processing, 13(2), 206–219.
31. Chen, X. W., & Lin, X. (2014). Big data deep learning: Challenges and perspectives. IEEE Access, 2,
514–525.
32. Shrestha, A., & Mahmood, A. (2019). Review of deep learning algorithms and architectures. IEEE
Access, 7, 53040–53065.
33. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. In Adaptive computation and
machine learning series (p. 775). MIT Press. https://​mitpr​ess.​mit.​edu/​books/​deep-​learn​ing.
34. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A
simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research,
15(1), 1929–1958.
35. Strom, N. (2015). Scalable distributed DNN training using commodity GPU cloud computing. In Six-
teenth annual conference of the international speech communication association.
36. Jolliffe, I. T. (2002). Mathematical and statistical properties of sample principal components. In:
Principal Component Analysis. Springer Series in Statistics. Springer, New York. https://​doi.​org/​10.​
1007/0-​387-​22440-8_3.
37. Noda, K. (2013). Multimodal integration learning of object manipulation behaviors using deep neural
networks. In Proceedings of the IEEE/RSJ international conference on intelligent robots and systems,
pp. 1728–1733.
38. Lu, X., Matsuda, S., Hori, C., & Kashioka, H. (2012). Speech restoration based on deep learning
autoencoder with layer-wised pretraining. In 13th annual conference of the international speech com-
munication association.
39. Lu, X., Matsuda, S., Hori, C., & Kashioka, H. (2012). Speech restoration based on deep learning
autoencoder with layer-wised learning. In INTERSPEECH, Portland, Oregon, Sept. 2012.
40. Lu, X., Tsao, Y., Matsuda, S., & Hori, C. (2013). Speech enhancement based on deep denoising auto-
encoder. In Proceedings of interspeech, pp. 436–440.
41. Lu, X., Tsao, Y., Matsuda, S., & Hori, C. (2014). Ensemble modeling of denoising autoencoder for
speech spectrum restoration. In Proceedings of the annual conference of the international speech
communication association, INTERSPEECH, pp 885–889.
42. Sun, M., Zhang, X., Van Hamme, H., & Zheng, T. F. (2016). Unseen noise estimation using sepa-
rable deep auto encoder for speech enhancement. IEEE/ACM Transactions on Audio, Speech, and
Language Processing, 24(1), 93–104. https://​doi.​org/​10.​1109/​TASLP.​2015.​24981​01.
43. Safari, R., Ahadi, S. M., & Seyedin, S. (2017). Modular dynamic deep denoising autoencoder for
speech enhancement. In 2017 7th international conference on computer and knowledge engineering
(ICCKE), Mashhad, pp. 254–259. https://​doi.​org/​10.​1109/​ICCKE.​2017.​81678​86.

13
1944 K. B. Bhangale, M. Kothandaraman

44. Agrawal, P., & Ganapathy, S. (2019). Modulation filter learning using deep variational networks
for robust speech recognition. IEEE Journal of Selected Topics in Signal Processing, 13(2),
244–253.
45. Leglaive, S., Alameda-Pineda, X., Girin, L., & Horaud, R. (2020). A recurrent variational autoen-
coder for speech enhancement. In ICASSP 2020–2020 IEEE international conference on acous-
tics, speech and signal processing (ICASSP), Barcelona, Spain, pp. 371–375. https://​doi.​org/​10.​
1109/​ICASS​P40776.​2020.​90531​64.
46. Li, Y., Zhang, X., Li, X., Zhang, Y., Yang, J., & He, Q. (2018). Mobile phone clustering from
speech recordings using deep representation and spectral clustering. IEEE Transactions on Infor-
mation Forensics and Security, 13(4), 965–977. https://​doi.​org/​10.​1109/​TIFS.​2017.​27745​05
47. Zhang, Q., & Hansen, J. H. L. (2018). Language/dialect recognition based on unsupervised deep
learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(5), 873–882.
48. Chorowski, J., Weiss, R. J., Bengio, S., & van den Oord, A. (2019). Unsupervised speech repre-
sentation learning using WaveNet autoencoders. IEEE/ACM Transactions on Audio, Speech, and
Language Processing, 27(12), 2041–2053.
49. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.,
& Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing
systems, pp. 2672–2680.
50. Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:​
1411.​1784.
51. Qian, Y., Hu, Hu., & Tan, T. (2019). Data augmentation using generative adversarial networks for
robust speech recognition. Speech Communication, 114, 1–9.
52. Pascual, S., Serra, J., & Bonafonte, A. (2019). Time-domain speech enhancement using generative
adversarial networks. Speech Communication, 114, 10–21.
53. Kaneko, T., Kameoka, H., Hojo, N., Ijima, Y., Hiramatsu, K., & Kashino, K. (2017). Generative
adversarial network-based postfilter for statistical parametric speech synthesis. In 2017 IEEE interna-
tional conference on acoustics, speech and signal processing (ICASSP), pp. 4910–4914. IEEE.
54. Kaneko, T., Takaki, S., Kameoka, H., & Yamagishi J. (2017). Generative adversarial network-
based postfilter for STFT spectrograms. In Interspeech, pp. 3389–3393.
55. Hsu, C. C., Hwang, H. T., Wu, Y. C., Tsao, Y., & Wang H. M. (2017). Voice conversion from una-
ligned corpora using variational autoencoding wasserstein generative adversarial networks. arXiv
preprint arXiv:​1704.​00849.
56. Mimura, M., Sakai, S., & Kawahara, T. (2017). Cross-domain speech recognition using nonparal-
lel corpora with cycle-consistent adversarial networks. In 2017 IEEE automatic speech recogni-
tion and understanding workshop (ASRU), pp. 134–140. IEEE.
57. Hu, H., Tan, T., & Qian, Y. (2018). Generative adversarial networks based data augmentation for
noise robust speech recognition. In 2018 IEEE international conference on acoustics, speech and
signal processing (ICASSP), pp. 5044–5048. IEEE.
58. Freund, Y., & Haussler, D. (1992). Unsupervised learning of distributions on binary vectors using
two layer networks. In Advances in neural information processing systems, pp. 912–919.
59. Larochelle, H., & Bengio, Y. (2008). Classification using discriminative restricted Boltzmann
machines. In Proceedings of the 25th international conference on machine learning, pp. 536–543.
60. Wang, Y., & Wang, D. (2013). Towards scaling up classification-based speech separation. IEEE
Transactions on Audio, Speech, and Language Processing, 21(7), 1381–1390. https://​doi.​org/​10.​
1109/​TASL.​2013.​22509​61
61. Xu, Y., Du, J., Dai, L., & Lee, C. (2014). An experimental study on speech enhancement based on
deep neural networks. IEEE Signal Processing Letters, 21(1), 65–68. https://​doi.​org/​10.​1109/​LSP.​
2013.​22912​40
62. Shah, M., Chakrabarti, C., & Spanias, A. (2015). Within and cross-corpus speech emotion recog-
nition using latent topic model-based features. EURASIP Journal on Audio, Speech, and Music
Processing, 2015(1), 4.
63. Navamani, T. M. (2019). Efficient deep learning approaches for health informatics. In Deep learning
and parallel computing environment for bioengineering systems (pp. 503–519). Elsevier. https://​doi.​
org/​10.​1016/​B978-0-​12-​816718-​2.​00014-2.
64. Rizk, Y., Hajj, N., Mitri, N., & Awad, M. (2019). Deep belief networks and cortical algorithms: A
comparative study for supervised classification. Applied Computing and Informatics, 15(2), 81–93.
65. Mohamed, A. R., Dahl, G., & Hinton, G. (2009). Deep belief networks for phone recognition. In Nips
workshop on deep learning for speech recognition and related applications, vol. 1, no. 9, p. 39.

13
Survey of Deep Learning Paradigms for Speech Processing 1945

66. Mohamed, A. R., Yu, D., & Deng L. (2010). Investigation of full-sequence training of deep belief
networks for speech recognition. In Eleventh annual conference of the international speech com-
munication association.
67. Mohamed, A.-R., Dahl, G. E., & Hinton, G. (2011). Acoustic modeling using deep belief net-
works. IEEE transactions on audio, speech, and language processing, 20(1), 14–22.
68. Zhang, X., & Wu, J. (2013). Deep belief networks based voice activity detection. IEEE Transac-
tions on Audio, Speech, and Language Processing, 21(4), 697–710. https://​doi.​org/​10.​1109/​TASL.​
2012.​22299​86
69. Sarikaya, R., Hinton, G. E., & Deoras, A. (2014). Application of deep belief networks for natural
language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
22(4), 778–784. https://​doi.​org/​10.​1109/​TASLP.​2014.​23032​96
70. Wen, G., Li, H., Huang, J., Li, D., & Xun, E. (2017). Random deep belief networks for recogniz-
ing emotions from speech signals. Computational Intelligence and Neuroscience. https://​doi.​org/​
10.​1155/​2017/​19456​30
71. Wang, C., Wang, J., Santoso, A., Chiang, C., & Wu, C. (2018). Sound event recognition using
auditory-receptive-field binary pattern and hierarchical-diving deep belief network. IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 26(8), 1336–1351. https://​doi.​org/​10.​
1109/​TASLP.​2017.​27384​43
72. Affonso, E. T., Rosa, R. L., & Rodríguez, D. Z. (2018). Speech quality assessment over lossy
transmission channels using deep belief networks. IEEE Signal Processing Letters, 25(1), 70–74.
https://​doi.​org/​10.​1109/​LSP.​2017.​27735​36
73. Hourri, S., & Kharroubi, J. (2020). A deep learning approach for speaker recognition. Interna-
tional Journal of Speech Technology, 23(1), 123–131.
74. Bengio, Y. (2009). Learning deep architectures for AI. Foundations and Trends in Machine Learn-
ing., 2(1), 1–127.
75. Kang, T. G., Kwon, K., Shin, J. W., & Kim, N. S. (2015). NMF-based Target source separation
using deep neural network. IEEE Signal Processing Letters, 22(2), 229–233. https://​doi.​org/​10.​
1109/​LSP.​2014.​23544​56
76. Nie, S., Liang, S., Liu, W., Zhang, X., & Tao, J. (2018). Deep learning based speech separation via
NMF-style reconstructions. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
26(11), 2043–2055.
77. Zheng, N., & Zhang, X. (2019). Phase-aware speech enhancement based on deep neural networks.
IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(1), 63–76. https://​doi.​
org/​10.​1109/​TASLP.​2018.​28707​42
78. Zhao, Y., Wang, Z., & Wang, D. (2019). Two-stage deep learning for noisy-reverberant speech
enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(1),
53–62.
79. Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-trained deep neural
networks for large-vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Lan-
guage Processing, 20(1), 30–42. https://​doi.​org/​10.​1109/​TASL.​2011.​21340​90
80. Yu, D., Deng, L., & Seide, F. (2013). The deep tensor neural network with applications to large
vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing,
21(2), 388–396. https://​doi.​org/​10.​1109/​TASL.​2012.​22277​38
81. Narayanan, A., & Wang, D. (2014). Investigation of speech separation as a front-end for noise
robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
22(4), 826–835. https://​doi.​org/​10.​1109/​TASLP.​2014.​23058​33
82. Wang, G., & Sim, K. C. (2014). Regression-based context-dependent modeling of deep neural net-
works for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Process-
ing, 22(11), 1660–1669. https://​doi.​org/​10.​1109/​TASLP.​2014.​23448​55
83. Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., & Liu, Q. (2014). Fast adaptation of deep neural
network based on discriminant codes for speech recognition. IEEE/ACM Transactions on Audio,
Speech, and Language Processing, 22(12), 1713–1725. https://​doi.​org/​10.​1109/​TASLP.​2014.​
23463​13
84. Zhou, P., Jiang, H., Dai, L., Hu, Y., & Liu, Q. (2015). State-clustering based multiple deep neural
networks modeling approach for speech recognition. IEEE/ACM Transactions on Audio, Speech,
and Language Processing, 23(4), 631–642. https://​doi.​org/​10.​1109/​TASLP.​2015.​23929​44
85. Gao, J., Du, J., & Chen, E. (2019). Mixed-bandwidth cross-channel speech recognition via joint
optimization of dnn-based bandwidth expansion and acoustic modeling. IEEE/ACM Transactions
on Audio, Speech, and Language Processing, 27(3), 559–571. https://​doi.​org/​10.​1109/​TASLP.​
2018.​28867​39

13
1946 K. B. Bhangale, M. Kothandaraman

86. Wu, C., Gales, M. J. F., Ragni, A., Karanasou, P., & Sim, K. C. (2018). Improving interpretability
and regularization in deep learning. IEEE/ACM Transactions on Audio, Speech, and Language
Processing, 26(2), 256–265. https://​doi.​org/​10.​1109/​TASLP.​2017.​27749​19
87. Chen, K., & Salman, A. (2011). Learning speaker-specific characteristics with a deep neural architec-
ture. IEEE Transactions on Neural Networks, 22(11), 1744–1756. https://​doi.​org/​10.​1109/​TNN.​2011.​
21672​40
88. Tan, Z., Mak, M., & Mak, B. K. (2018). DNN-based score calibration with multitask learning for
noise robust speaker verification. IEEE/ACM Transactions on Audio, Speech, and Language Process-
ing, 26(4), 700–712.
89. Yu, H., Tan, Z., Ma, Z., Martin, R., & Guo, J. (2018). Spoofing detection in automatic speaker veri-
fication systems using DNN classifiers and dynamic acoustic features. IEEE Transactions on Neural
Networks and Learning Systems, 29(10), 4633–4644.
90. Wang, Z., & Wang, D. (2019). Combining spectral and spatial features for deep learning based blind
speaker separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(2),
457–468.
91. Lotfian, R., & Busso, C. (2019). Curriculum learning for speech emotion recognition from crowd-
sourced labels. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(4),
815–826.
92. Liu, B., Xu, Z., Sun, C., Wang, B., Wang, X., Wong, D. F., & Zhang, M. (2018). Content-oriented
user modeling for personalized response ranking in chatbots. IEEE/ACM Transactions on Audio,
Speech, and Language Processing, 26(1), 122–133. https://​doi.​org/​10.​1109/​TASLP.​2017.​27632​43
93. Fukushima, K. (1988). Neocognitron: A hierarchical neural network capable of visual pattern recog-
nition. Neural Networks, 1, 119–130.
94. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document
recognition. Proceedings of the IEEE, 86, 2278–2324.
95. Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate
cortex. The Journal of Physiology., 195(1), 215–243.
96. Li, Z., Liu, F., Yang, W., Peng, S., & Zhou, J. (2021). A survey of convolutional neural networks:
Analysis, applications, and prospects. IEEE Transactions on Neural Networks and Learning Systems.
https://​doi.​org/​10.​1109/​TNNLS.​2021.​30848​27
97. Hou, J., Wang, S., Lai, Y., Tsao, Y., Chang, H., & Wang, H. (2018). Audio-visual speech enhance-
ment using multimodal deep convolutional neural networks. IEEE Transactions on Emerging Topics
in Computational Intelligence, 2(2), 117–128.
98. Luo, Y., Chen, Z., & Mesgarani, N. (2018). Speaker-independent speech separation with deep attrac-
tor network. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(4), 787–796.
99. Tan, T., Qian, Y., Hu, H., Zhou, Y., Ding, W., & Yu, K. (2018). Adaptive very deep convolutional
residual network for noise robust speech recognition. IEEE/ACM Transactions on Audio, Speech, and
Language Processing, 26(8), 1393–1405.
100. Jati, A., & Georgiou, P. (2019). Neural predictive coding using convolutional neural networks toward
unsupervised learning of speaker characteristics. IEEE/ACM Transactions on Audio, Speech, and
Language Processing, 27(10), 1577–1589.
101. An, N. N., Thanh, N. Q., & Liu, Y. (2019). Deep CNNs with self-attention for speaker identification.
IEEE Access, 7, 85327–85337. https://​doi.​org/​10.​1109/​ACCESS.​2019.​29174​70
102. Nagrani, A., Chung, J. S., Xie, W., & Zisserman, A. (2020). Voxceleb: Large-scale speaker verifica-
tion in the wild. Computer Speech & Language, 60, 101027.
103. Zhang, S., Zhang, S., Huang, T., & Gao, W. (2018). Speech emotion recognition using deep convolu-
tional neural network and discriminant temporal pyramid matching. IEEE Transactions on Multime-
dia, 20(6), 1576–1590. https://​doi.​org/​10.​1109/​TMM.​2017.​27668​43
104. Zhao, J., Mao, X., & Chen, L. (2018). Learning deep features to recognise speech emotion using
merged deep CNN. IET Signal Processing, 12(6), 713–721. https://​doi.​org/​10.​1049/​iet-​spr.​2017.​0320
105. Hossain, M. S., & Muhammad, G. (2019). Emotion recognition using deep learning approach from
audio–visual emotional big data. Information Fusion, 49, 69–78.
106. Ocquaye, E. N. N., Mao, Q., Song, H., Xu, G., & Xue, Y. (2019). Dual exclusive attentive transfer for
unsupervised deep convolutional domain adaptation in speech emotion recognition. IEEE Access, 7,
93847–93857.
107. Tripathi, S., Kumar, A., Ramesh, A., Singh, C., & Yenigalla, P. (2019). Deep learning based emotion
recognition system using speech features and transcriptions. arXiv preprint arXiv:​1906.​05681.
108. Dinkel, H., Qian, Y., & Yu, K. (2018). Investigating raw wave deep neural networks for end-to-end
speaker spoofing detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
26(11), 2002–2014.

13
Survey of Deep Learning Paradigms for Speech Processing 1947

109. DiPietro, R., & Hager, G. D. (2020). Deep learning: RNNs and LSTM. In Handbook of medical
image computing and computer assisted intervention (pp. 503–519). Elsevier. https://​doi.​org/​10.​1016/​
B978-0-​12-​816176-​0.​00026-0.
110. Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural
networks on sequence modeling. arXiv preprint arXiv:​1412.​3555.
111. Graves, A., Mohamed, A., & Hinton, G. (2013). Speech recognition with deep recurrent neural net-
works. In 2013 IEEE international conference on acoustics, speech and signal processing, Vancou-
ver, BC, pp. 6645–6649. https://​doi.​org/​10.​1109/​ICASSP.​2013.​66389​47.
112. Qin, C.-X., Dan, Qu., & Zhang, L.-H. (2018). Towards end-to-end speech recognition with transfer
learning. EURASIP Journal on Audio, Speech, and Music Processing, 2018(1), 1–9.
113. de Benito-Gorron, D., Lozano-Diez, A., Toledano, D. T., & Gonzalez-Rodriguez, J. (2019). Explor-
ing convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a
large audio dataset. EURASIP Journal on Audio, Speech, and Music Processing, 2019(1), 9.
114. Kang, J., Zhang, W.-Q., Liu, W.-W., Liu, J., & Johnson, M. T. (2018). Advanced recurrent network-
based hybrid acoustic models for low resource speech recognition. EURASIP Journal on Audio,
Speech, and Music Processing, 2018(1), 6.
115. Tang, Z., Wang, D., Chen, Y., Li, L., & Abel, A. (2018). Phonetic temporal neural model for language
identification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(1), 134–144.
116. Han, K., & Wang, D. (2014). Neural network based pitch tracking in very noisy speech. IEEE/ACM
Transactions on Audio, Speech, and Language Processing, 22(12), 2158–2168. https://​doi.​org/​10.​
1109/​TASLP.​2014.​23634​10
117. Tan, K., & Wang, D. (2018). A convolutional recurrent neural network for real-time speech enhance-
ment. In Interspeech, pp. 3229–3233.
118. Li, A., Yuan, M., Zheng, C., & Li, X. (2020). Speech enhancement using progressive learning-based
convolutional recurrent neural network. Applied Acoustics, 166, 107347.
119. Vafeiadis, A., Fanioudakis, E., Potamitis, I., Votis, K., Giakoumis, D., Tzovaras, D., Chen, L., &
Hamzaoui, R. (2019). Two-dimensional convolutional recurrent neural networks for speech activity
detection. In International Speech Communication Association, pp. 2045–2049.
120. Shen, Z., Yong, B., Zhang, G., Zhou, R., & Zhou, Q. (2019). A deep learning method for Chinese
singer identification. Tsinghua Science and Technology, 24(4), 371–378. https://​doi.​org/​10.​26599/​
TST.​2018.​90101​21
121. Wu, Y., & Li, W. (2019). Automatic audio chord recognition with MIDI-trained deep feature and
BLSTM-CRF sequence decoding model. IEEE/ACM Transactions on Audio, Speech, and Language
Processing, 27(2), 355–366.
122. Zhao, J., Mao, X., & Chen, L. (2019). Speech emotion recognition using deep 1D & 2D CNN LSTM
networks. Biomedical Signal Processing and Control, 47, 312–323.
123. Yu, Y., Si, X., Changhua, Hu., & Zhang, J. (2019). A review of recurrent neural networks: LSTM
cells and network architectures. Neural computation, 31(7), 1235–1270.
124. Goehring, T., Keshavarzi, M., Carlyon, R. P., & Moore, B. C. J. (2019). Using recurrent neural net-
works to improve the perception of speech in non-stationary noise by people with cochlear implants.
The Journal of the Acoustical Society of America, 146(1), 705–718.
125. Sutton, R. S., Barto, A. G., & Williams, R. J. (1992). Reinforcement learning is direct adaptive opti-
mal control. IEEE Control Systems, 12(2), 19–22.
126. Mnih,V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M.
(2013). Playing atari with deep reinforcement learning. In NIPS deep learning workshop.
127. Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). Policy gradient methods for reinforce-
ment learning with function approximation. In Proceedings of the 12th international conference on
neural information processing systems, NIPS’99, pp. 1057–1063.
128. Weisz, G., Budzianowski, P., Su, P., & Gašić, M. (2018). Sample efficient deep reinforcement learn-
ing for dialogue systems with large action spaces. IEEE/ACM Transactions on Audio, Speech, and
Language Processing, 26(11), 2083–2097. https://​doi.​org/​10.​1109/​TASLP.​2018.​28516​64
129. Chen, L., Chang, C., Chen, Z., Tan, B., Gašić, M., & Yu, K. (2018). Policy adaptation for deep rein-
forcement learning-based dialogue management. In 2018 IEEE international conference on acous-
tics, speech and signal processing (ICASSP), Calgary, AB, pp. 6074–6078. https://​doi.​org/​10.​1109/​
ICASSP.​2018.​84622​72.
130. Chen, L., Chen, Z., Tan, B., Long, S., Gašić, M., & Yu, K. (2019). AgentGraph: Toward univer-
sal dialogue management with structured deep reinforcement learning. IEEE/ACM Transactions on
Audio, Speech, and Language Processing, 27(9), 1378–1391. https://​doi.​org/​10.​1109/​TASLP.​2019.​
29198​72

13
1948 K. B. Bhangale, M. Kothandaraman

131. Shen, Y. L., Huang, C. Y., Wang, S. S., Tsao, Y., Wang, H. M., & Chi, T. S. (2019). Reinforcement
learning based speech enhancement for robust speech recognition. In ICASSP 2019–2019 IEEE inter-
national conference on acoustics, speech and signal processing (ICASSP), pp. 6750–6754. IEEE.
132. Rajapakshe, T., Rana, R., Latif, S., Khalifa, S., & Schuller, B. W. (2019). Pre-training in deep
reinforcement learning for automatic speech recognition. arXiv preprint arXiv:​1910.​11256.
133. Kala, T., & Shinozaki, T. (2018). Reinforcement learning of speech recognition system based on
policy gradient and hypothesis selection. In 2018 IEEE international conference on acoustics,
speech and signal processing (ICASSP), Calgary, AB, pp. 5759–5763, https://​doi.​org/​10.​1109/​
ICASSP.​2018.​84626​56.
134. Lee, H., Chung, P., Wu, Y., Lin, T., & Wen, T. (2018). Interactive spoken content retrieval by deep
reinforcement learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
26(12), 2447–2459.
135. Bui, H., & Chong, N. Y. (2019). Autonomous speech volume control for social robots in a noisy
environment using deep reinforcement learning. In 2019 IEEE international conference on robot-
ics and biomimetics (ROBIO), Dali, China, pp. 1263–1268. https://​doi.​org/​10.​1109/​ROBIO​49542.​
2019.​89618​10.
136. Su, M., Wu, C., & Chen, L. (2020). Attention-based response generation using parallel double
Q-learning for dialog policy decision in a conversational system. IEEE/ACM Transactions on
Audio, Speech, and Language Processing, 28, 131–143. https://​doi.​org/​10.​1109/​TASLP.​2019.​
29496​87
137. Zue, V., Seneff, S., & Glass, J. (1990). Speech database development at MIT: TIMIT and beyond.
Speech Communication, 9(4), 351–356.
138. Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: An asr corpus based
on public domain audio books. In 2015 IEEE international conference on acoustics, speech and
signal processing (ICASSP), pp. 5206–5210. IEEE.
139. Nagrani, A., Chung, J. S., & Zisserman, A. (2017). "Voxceleb: A large-scale speaker identification
dataset. arXiv preprint arXiv:​1706.​08612.
140. Pearce, D., & Picone, J. (2002). Aurora working group: DSR front end LVCSR evaluation AU/384/02.
In Institute for signal & information processing, Mississippi State University, Technical Report.
141. Sinha, R., Gales, M. J., Kim, D. Y., Liu, X. A., Sim, K. C., & Woodland, P. C. (2006). The CU-
HTK mandarin broadcast news transcription system. In Proceedings of ICASSP 2006, May, 2006,
pp. 1077–1080.
142. Barker, J., Watanabe, S., Vincent, E., & Trmal, J. (2018). The fifth’CHiME’speech separation and
recognition challenge: Dataset, task and baselines. arXiv preprint arXiv:​1803.​10609.
143. Kinoshita, K., Delcroix, M., Gannot, S., Habets, E., Haeb-Umbach, R., Kellermann, W., Leutnant,
V., Maas, R., Nakatani, T., Raj, B., Sehr, A., & Yoshioka, T. (2016). A summary of the REVERB
challenge: state-of-the-art and remaining challenges in reverberant speech processing research.
EURASIP Journal on Advances in Signal Processing. https://​doi.​org/​10.​1186/​s13634-​016-​0306-6
144. Godfrey, J. J., Holliman, E. C., & McDaniel, J. (1992) SWITCHBOARD: telephone speech corpus
for research and development. In [Proceedings] ICASSP-92: 1992 IEEE international conference
on acoustics, speech, and signal processing, San Francisco, CA, USA, vol. 1, pp. 517–520. https://​
doi.​org/​10.​1109/​ICASSP.​1992.​225858.
145. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of
German emotional speech. In Proceedings of Interspeech.
146. Busso, C., Bulut, M., Lee, C. C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J. N., Lee, S., &
Narayanan, S. S. (2008). IEMOCAP: Interactive emotional dyadic motion capture database. Jour-
nal of Language Resources and Evaluation, 42(4), 335–359.
147. Lotfian, R., & Busso, C. (2019). Building naturalistic emotionally balanced speech corpus by
retrieving emotional speech from existing podcast recordings. IEEE Transactions on Affective
Computing, 10(4), 471–483.
148. Black, D. (2014). Singing voice dataset.
149. Goto, M., Hashiguchi, H., Nishimura, T., & Oka, R. (2002). RWC music database: Popular, classi-
cal, and jazz music databases. In Proceedings of the 3rd international conference on music infor-
mation retrieval (ISMIR 2002), pp. 287–288.
150. Hsu, C., & Jang, J. R. (2010). On the improvement of singing voice separation for monaural
recordings using the MIR-1K dataset. IEEE Transactions on Audio, Speech, and Language Pro-
cessing, 18(2), 310–319. https://​doi.​org/​10.​1109/​TASL.​2009.​20265​03
151. Varga, A., & Steeneken, H. J. M. (1993). Assessment for automatic speech recognition: II. NOI-
SEX-92: A database and an experiment to study the effect of additive noise on speech recognition
systems. Speech Communication, 12(3), 247–251.

13
Survey of Deep Learning Paradigms for Speech Processing 1949

152. Jensen, J., & Taal, C. H. (2016). An algorithm for predicting the intelligibility of speech masked
by modulated noise maskers. IEEE/ACM Transactions on Audio, Speech, and Language Process-
ing, 24(11), 2009–2022.
153. Vincent, E., Gribonval, R., & Fevotte, C. (2006). Performance measurement in blind audio source
separation. IEEE Transactions on Audio, Speech and Language Processing, 14(4), 1462–1469.

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.

Kishor Barasu Bhangale has obtained his BE degree from North


Maharashtra University, Jalgaon. He received Masters degree in VLSI
& Embedded Systems from Pune University. He is perusing Ph.D.
from VIT University, Chennai. He is working as Assistant Professor in
Pimpri Chinchwad College of Engineering and Research, Ravet, Pune.
His research interest is in Signal Processing, Machine Learning and
Deep Learning.

Mohanaprasad Kothandaraman was born in 1981.He completed his


Post-Doctoral research fellowship in UTAR Malaysia and received his
PhD degree in the field of Speech signal processing from VIT Univer-
sity, Vellore, India in 2016. He did his Master of Engineering from
Anna University, Chennai in 2006 and his Bachelor of Engineering in
University of Madras, Chennai, and has 25 International journals and
International conference publications. Currently he is working as an
Associate Professor in School of Electronics Engineering, VIT Univer-
sity Chennai. His research interests include signal processing, speech
processing and Wavelet transform.

13

You might also like