0% found this document useful (0 votes)

43 views

A Convolutional Recurrent Neural Network For Real-Time Speech Enhancement

This document describes a convolutional recurrent neural network (CRN) for real-time speech enhancement. The CRN incorporates a convolutional encoder-decoder and long short-term memory (LSTM). It is noise- and speaker-independent, meaning noise types and speakers can differ between training and test. Experiments show the CRN achieves better objective intelligibility and quality than an existing LSTM model, while using fewer parameters. The CRN is suitable for real-time applications due to its causal structure which does not use future information.

Uploaded by

Vasanth Yannam

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views

A Convolutional Recurrent Neural Network For Real-Time Speech Enhancement

Uploaded by

Vasanth Yannam

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Interspeech 2018

2-6 September 2018, Hyderabad

A Convolutional Recurrent Neural Network for Real-Time Speech

Enhancement
Ke Tan1 , DeLiang Wang1,2
1
Department of Computer Science and Engineering, The Ohio State University, USA
2
Center for Cognitive and Brain Sciences, The Ohio State University, USA
[email protected], [email protected]

Abstract speech [6] [7]. In this study, we use the magnitude spectra of
target speech as the training target.
Many real-world applications of speech enhancement, such as For supervised speech enhancement, noise generalization
hearing aids and cochlear implants, desire real-time processing, and speaker generalization are both crucial. A simple yet ef-
with no or low latency. In this paper, we propose a novel convo- fective method to deal with noise generalization is to train with
lutional recurrent network (CRN) to address real-time monaural different noise types [8]. Analogously, to address speaker gen-
speech enhancement. We incorporate a convolutional encoder- eralization would include a large number of speakers in a train-
decoder (CED) and long short-term memory (LSTM) into the ing set. However, it has been found that a feedforward DNN is
CRN architecture, which leads to a causal system that is natu- unable to track a target speaker in the presence of many training
rally suitable for real-time processing. Moreover, the proposed speakers [9] [10] [11]. Typically, a DNN independently predicts
model is noise- and speaker-independent, i.e. noise types and a label for each time frame from a small context window around
speakers can be different between training and test. Our exper- the frame. An interpretation is that such DNNs cannot lever-
iments suggest that the CRN leads to consistently better objec- age long-term contexts, which would be essential for tracking a
tive intelligibility and perceptual quality than an existing LSTM target speaker. Recent studies [9] [10] suggest that it would be
based model. Moreover, the CRN has much fewer trainable pa- better to formulate speech separation as a sequence-to-sequence
rameters. mapping in order to leverage long-term contexts.
Index Terms: noise- and speaker-independent speech enhance- With such a formulation, recurrent neural networks (RNNs)
ment, real-time applications, convolutional encoder-decoder, and convolutional neural networks (CNNs) have been used for
long short-term memory, convolutional recurrent networks noise- and speaker-independent speech enhancement, where
noise types and speakers can be different between training and
1. Introduction test. Chen et al. [10] proposed an RNN with four hidden LSTM
layers to deal with speaker generalization of noise-independent
Speech separation aims to separate target speech from a back- models. Their experimental results show that the LSTM model
ground interference, which may include nonspeech noise, inter- generalizes well to untrained speakers, and substantially out-
fering speech and room reverberation [1]. Speech enhancement performs a DNN based model in terms of short-time objective
refers to the separation of speech and nonspeech noise. It has intelligibility (STOI) [12]. A more recent study [13] developed
various real-world applications such as robust automatic speech a gated residual network (GRN) based on dilated convolution-
recognition and mobile speech communication. For many such s. Compared with the LSTM model in [10], the GRN exhibits
applications, real-time processing is required. In other words, higher parameter efficiency and better generalization capabili-
speech enhancement is performed with low computational com- ty for untrained speakers at different SNR levels. On the other
plexity, providing near-instantaneous output. hand, the GRN requires a large amount of future information
In this study, we focus on monaural (single-microphone) for mask estimation or spectral mapping at each time frame.
speech enhancement that can operate in real-time applications. Hence, it cannot be used for real-time speech enhancement.
In digital hearing aids, for example, it has been found that a Motivated by recent works [14] [15] on CRNs, we develop
delay as low as 3 milliseconds is noticeable to listeners and a a novel CRN architecture for noise- and speaker-independent
delay of longer than 10 milliseconds is objectionable [2]. For speech enhancement in real time. The CRN incorporates a con-
such applications, causal speech enhancement systems, where volutional encoder-decoder and long short-term memory. We
no future information is allowed, are often required. find that the proposed CRN leads to consistently better objective
Inspired by the concept of time-frequency (T-F) masking speech intelligibility and quality than the LSTM model in [10].
in computational auditory scene analysis (CASA) [3], speech Moreover, the CRN has much fewer trainable parameters.
separation has been formulated as supervised learning in recent The rest of this paper is organized as follows. We give a
years, where a deep neural network (DNN) is employed to learn detailed description of our proposed model in Section 2. The
a mapping from noisy acoustic features to a T-F mask [4]. The experimental setup and results are presented in Section 3. We
ideal binary mask, which classifies T-F units as either speech- conclude this paper in Section 4.
dominant or noise-dominant, is the first training target used
in supervised speech separation. More recent training target-
s include the ideal ratio mask [5] and mapping-based target- 2. System description
s corresponding to the magnitude or power spectra of target 2.1. Encoder-decoder with causal convolutions

This research was supported in part by an NIDCD grant (R01 D- Badrinarayanan et al. first proposed a convolutional encoder-
C012048) and the Ohio Supercomputer Center. decoder network for pixel-wise image labelling [16]. It com-

3229 10.21437/Interspeech.2018-1405
Output Note that the input can be treated as a sequence of feature vec-
tors, while only the time dimension is illustrated in Fig. 1. In
Hidden Layer causal convolutions, the output does not depend on future input-
s. With causal convolutions instead of noncausal convolutions,
Hidden Layer the encoder-decoder architecture leads to a causal system. Note
that we can easily apply causal deconvolutions to the decoder,
Hidden Layer since the deconvolution is intrinsically a convolution operation.

Input 2.2. Temporal modeling via LSTM

Past Future
Time In order to track a target speaker, it may be important to leverage
long-term contexts, which cannot be utilized by the aforemen-
Figure 1: An example of causal convolutions. The convolution tioned convolutional encoder-decoder. The LSTM [21], a spe-
output does not depend on future inputs. cific type of RNN which incorporates a memory cell, has been
successful in temporal modeling in various applications such as
acoustic modeling and video classification. To account for tem-
prises a convolutional encoder followed by a corresponding poral dynamics of speech, we insert two stacked LSTM layers
decoder which feeds into a softmax classification layer. The between the encoder and the decoder. In this study, we use the
encoder is a stack of convolutional layers and pooling layers, LSTM defined by the following equations:
which serves to extract high-level features from a raw input im-
age. With essentially the same structure as the encoder in the it = σ(Wii xt + bii + Whi ht−1 + bhi ) (1)
reverse order, the decoder maps low-resolution feature maps at ft = σ(Wif xt + bif + Whf ht−1 + bhf ) (2)
the output of the encoder to feature maps of the full input image gt = tanh(Wig xt + big + Whg ht−1 + bhg ) (3)
size. The symmetric encoder-decoder architecture ensures that
the output has the same shape as the input. With such an at- ot = σ(Wio xt + bio + Who ht−1 + bho ) (4)
tractive property, the encoder-decoder architecture is naturally ct = ft ct−1 + it gt (5)
suitable for any pixel-wise dense prediction task, which aims to ht = ot tanh(ct ) (6)
predict a label for each pixel in the input image.
For speech enhancement, one approach is to employ a CED where xt , gt , ct and ht represent input, block input, memory
to map from the magnitude spectrogram of noisy speech to that cell and hidden activation at time t, respectively. W ’s and b’s
of clean speech, where the magnitude spectrograms are sim- denote weights and biases, respectively. σ represents sigmoid
ply treated as images. To our knowledge, Park et al. [17] first nonlinearity and represents element-wise multiplication.
introduced CED for speech enhancement. They proposed a re- To fit the input shape required by the LSTM, we flatten the
dundant CED network (R-CED), which consists of repetition- frequency dimension and the depth dimension of the encoder
s of a convolution, batch normalization (BN) [18], and a Re- output to produce a sequence of feature vectors before feeding
LU activation [19] layer. The R-CED architecture additionally it into the LSTM layers. The output sequence of the LSTM
incorporates skip connections to facilitate optimization, which layers is subsequently reshaped back to fit the decoder. It is
connect each layer in the encoder to its corresponding layer in worth noting that the inclusion of the LSTM layers does not
the decoder. change the causality of the system.
In our proposed network, the encoder comprises five convo-
lutional layers while the decoder has five deconvolutional layer- 2.3. Network architecture
s. We apply exponential linear units (ELUs) [20] to all convolu-
tional and deconvolutional layers except the output layer. ELUs Table 1: Architecture of our proposed CRN. Here T denotes the
have been demonstrated to lead to faster convergence and bet- number of time frames in the STFT magnitude spectrum.
ter generalization than ReLUs. In the output layer, we utilize
softplus activation [19] which is a smooth approximation to the layer name input size hyperparameters output size
ReLU function and can constrain the network output to always reshape 1 T × 161 - 1 × T × 161
be positive. Moreover, we adopt batch normalization right after conv2d 1 1 × T × 161 2 × 3, (1, 2), 16 16 × T × 80
conv2d 2 16 × T × 80 2 × 3, (1, 2), 32 32 × T × 39
each convolution (or deconvolution) and before activation. The conv2d 3 32 × T × 39 2 × 3, (1, 2), 64 64 × T × 19
numbers of kernels are kept symmetric: the number of kernels conv2d 4 64 × T × 19 2 × 3, (1, 2), 128 128 × T × 9
is gradually increased in the encoder while it is gradually de- conv2d 5 128 × T × 9 2 × 3, (1, 2), 256 256 × T × 4
reshape 2 256 × T × 4 - T × 1024
creased in the decoder. To leverage a larger context along the lstm 1 T × 1024 1024 T × 1024
frequency direction, we apply a stride of 2 along the frequency lstm 2 T × 1024 1024 T × 1024
dimension to all convolutional (or deconvolutional) layers. In reshape 3 T × 1024 - 256 × T × 4
other words, we halve the frequency dimension size of feature deconv2d 5 512 × T × 4 2 × 3, (1, 2), 128 128 × T × 9
deconv2d 4 256 × T × 9 2 × 3, (1, 2), 64 64 × T × 19
maps layer by layer in the encoder and double it layer by layer deconv2d 3 128×T ×19 2 × 3, (1, 2), 32 32 × T × 39
in the decoder, whereas we do not change the time dimension deconv2d 2 64 × T × 39 2 × 3, (1, 2), 16 16 × T × 80
size of feature maps. To improve the flow of information and deconv2d 1 32 × T × 80 2 × 3, (1, 2), 1 1 × T × 161
reshape 4 1 × T × 161 - T × 161
gradients throughout the network, we utilize skip connections
which concatenate the output of each encoder layer to the input
of each decoder layer. In this study, we use 161-dimensional short-time Fourier
To obtain a causal system for real-time speech enhancemen- transform (STFT) magnitude spectrum of noisy speech as input
t, we impose causal convolutions upon the encoder-decoder ar- features, and that of clean speech as the training target. Our
chitecture. Fig. 1 depicts an example of causal convolutions. proposed CRN is shown in Fig. 2, in which the network input

3230
Encoder Decoder

t t

Conv Conv Conv Conv Conv Deconv Deconv Deconv Deconv Deconv
f BN BN BN BN BN LSTM LSTM BN BN BN BN BN f
ELU ELU ELU ELU ELU ELU ELU ELU ELU Softplus

Figure 2: Network architecture of our proposed CRN.

is encoded into a higher-dimensional latent space, and the se- t-11 t-10 t-2 t-1 t-10 t-9 t-1 t t-9 t-8 t t+1

quence of latent feature vectors are then modeled by two LSTM … … …

layers. Subsequently, the output sequence of the LSTM layer-
s is converted back to the original input shape by the decoder. Input Layer Input Layer Input Layer

The proposed CRN benefits from the feature extraction capabil- LSTM, 1024 LSTM, 1024 LSTM, 1024

ity of CNNs and the temporal modeling capability of RNNs, by LSTM, 1024 LSTM, 1024 LSTM, 1024

combining the two topologies together. LSTM, 1024 LSTM, 1024 LSTM, 1024

A more detailed description of our proposed network archi- LSTM, 1024 LSTM, 1024 LSTM, 1024

tecture is provided in Table 1. The input size and the output

Output Layer Output Layer Output Layer
size of each layer are specified in featureMaps × timeSteps ×
frequencyChannels format. The layer hyperparameters are giv-
en in (kernelSize, strides, outChannels) format. For all the
t-1 t t+1
convolutions and the deconvolutions, we apply zero-padding
to the time direction but not to the frequency direction. To
Figure 3: An LSTM baseline with a feature window of 11 frames
perform causal convolutions, we use a kernel size of 2 × 3
(10 past frames and 1 current frame). At each time step, the 11
(time × frequency). Note that the number of feature maps in
input frames are concatenated into a feature vector.
each decoder layer is doubled by the skip connections.

2.4. LSTM baselines

ditec CD (available at http://www.auditec.com).
In our experiments, we build two LSTM baselines for compar-
We create a training set including 320 000 mixtures with a
ison. In the first LSTM model, a feature window of 11 frames
total duration of about 500 hours. Specifically, we mix a ran-
(10 past frames and 1 current frame) is employed to estimate
domly selected training utterance with a random cut from the
one frame of the target (see Fig. 3). In other words, 11 frames
10 000 training noises at a signal-to-noise ratio (SNR) that is
of feature vectors are concatenated into a long vector as the net-
randomly chosen from {-5, -4, -3, -2, -1, 0} dB. To investigate
work input at each time step. In the second LSTM model, how-
speaker generalization of the models, we create two test sets for
ever, no feature window is utilized. We denote the first LSTM
each noise using 6 trained speakers (3 males and 3 females) and
model as LSTM-1 and the second one as LSTM-2. From the
6 untrained speakers, respectively. One test set comprises 150
input layer to the output layer, LSTM-1 has 11 × 161, 1024,
mixtures created from 25 × 6 utterances of 6 trained speaker-
1024, 1024, 1024, and 161 units, respectively; LSTM-2 has
s, while the other comprises 150 mixtures created from 25 × 6
161, 1024, 1024, 1024, 1024, and 161 units, respectively. Both
utterances of 6 untrained speakers. Note that all test utterances
baselines do not use future information, which amount to causal
are excluded from the training set. We use two SNRs for the
systems.
test set, i.e. -5 and -2 dB. All signals are sampled at 16 kHz.
The models are trained with the Adam optimizer [23]. We
3. Experiments set the learning rate to 0.0002. The mean squared error (MSE)
3.1. Experimental setup serves as the objective function. We train the models with a
minibatch size of 16 on the utterance-level. Within a mini-
In our experiments, we evaluate the models on the WSJ0 SI- batch, all training samples are padded with zeros to have the
84 training set [22] including 7138 utterances from 83 speakers same number of time steps as the longest sample does. The best
(42 males and 41 females). Among these speakers, 6 speak- models are selected by cross validation.
ers (3 males and 3 females) are treated as untrained speakers.
Hence, we train the models with the 77 remaining speakers. To 3.2. Experimental results
obtain noise-independent models, we use 10 000 noises from a
sound effect library (available at https://www.sound-ideas.com) In this study, we use STOI and perceptual evaluation of speech
for training, and the duration is about 126 hours. For test, we quality (PESQ) [24] as the evaluation metrics. Table 2 and 3
use two challenging noises (babble and cafeteria) from an Au- present STOI and PESQ scores of unprocessed and processed

3231
Table 2: Model comparisons in terms of STOI and PESQ scores on trained speakers.

evaluation metrics STOI (in %) PESQ

test SNR -5 dB -2 dB -5 dB -2 dB
noises Avg. babble cafeteria Avg. babble cafeteria Avg. babble cafeteria Avg. babble cafeteria
unprocessed 58.18 58.95 57.40 65.75 66.30 65.19 1.50 1.63 1.52 1.67 1.79 1.70
LSTM-1 75.81 77.29 74.32 82.00 82.62 81.38 2.05 2.06 2.04 2.33 2.36 2.30
LSTM-2 75.80 77.45 74.14 82.53 83.80 81.25 2.05 2.06 2.03 2.31 2.34 2.28
CRN 77.89 79.71 76.07 84.08 85.48 82.68 2.15 2.17 2.12 2.41 2.44 2.38

Table 3: Model comparisons in terms of STOI and PESQ scores on untrained speakers.

evaluation metrics STOI (in %) PESQ

test SNR -5 dB -2 dB -5 dB -2 dB
noises Avg. babble cafeteria Avg. babble cafeteria Avg. babble cafeteria Avg. babble cafeteria
unprocessed 57.86 58.54 57.18 65.08 65.45 64.70 1.52 1.56 1.47 1.66 1.69 1.63
LSTM-1 74.33 75.21 73.44 81.75 82.65 80.84 1.96 1.94 1.97 2.25 2.26 2.24
LSTM-2 74.42 75.55 73.29 81.88 82.87 80.88 1.95 1.94 1.96 2.25 2.25 2.24
CRN 76.42 77.98 74.85 83.31 84.38 82.24 2.04 2.04 2.03 2.33 2.34 2.31

LSTM-1 train LSTM-1 test LSTM-2 train LSTM-1 LSTM-2 CRN

LSTM-2 test CRN train CRN test 45.0
16 40.0 36.81

Number of params (million)

14 35.0
30.22
30.0
12
25.0
10
20.0 17.58
MSE

8
15.0
6 10.0

4 5.0
0.0
2

0
0 1 2 3 4 5 6 7 8 Figure 5: Parameter efficiency comparison of different model-
Training Epoch s. We compare the number of trainable parameters in different
models.
Figure 4: Mean square errors over training epochs for LSTM-
1, LSTM-2 and CRN on the training set and the test set. All
models are evaluated with a test set of six untrained speakers Fig. 5. This is mainly due to the use of shared weights in convo-
on the untrained babble noise. lutions. With a higher parameter efficiency, the CRN is easier
to train than the LSTMs.
In addition, the causal convolutions in the CRN capture lo-
signals for trained speakers and untrained speakers, respective- cal spatial patterns in the input STFT magnitude spectrum with-
ly. In each case, the best result is highlighted by a boldface out using future information. In contrast, the LSTM models
number. As shown in Table 2 and 3, LSTM-1 and LSTM-2 treat each input frame as a flattened feature vector, and cannot
yield similar STOI and PESQ scores for both trained speakers sufficiently leverage the T-F structure in the STFT magnitude
and untrained speakers, which implies that the use of the fea- spectrum. On the other hand, the LSTM layers in the CRN
ture window in LSTM-1 does not improve the performance. On model the temporal dependencies in a latent space, which would
the other hand, our proposed CRN consistently outperforms the be important to speaker characterization in speaker-independent
LSTM baselines in both metrics. At the SNR of -5 dB, for ex- speech enhancement.
ample, the CRN provides about 2% STOI improvements and
about 0.1 PESQ improvements over the LSTM models. Com-
paring the results in Table 2 with those in Table 3, we can find
4. Conclusions
that the CRN generalizes well to untrained speakers. In the most In this study, we have proposed a convolutional recurrent net-
challenging case, where the utterances from untrained speaker- work to deal with noise- and speaker-independent speech en-
s are mixed with the two untrained noises at -5 dB, the CRN hancement for real-time applications. The proposed model
produces a 18.56% STOI improvement and a 0.55 PESQ im- leads to a causal speech enhancement system, where no fu-
provement over the unprocessed mixtures. ture information is utilized. The evaluation results suggest that
The CRN takes advantage of batch normalization, which the proposed CRN consistently outperforms two strong LSTM
can be easily adopted for convolution operations to accelerate baselines for both trained and untrained speakers in terms of
training and improve the performance. Fig. 4 compares training STOI and PESQ scores. In addition, we find that the CRN has
and test MSEs of different models over training epochs, where fewer trainable parameters than the LSTMs. We believe the pro-
the models are evaluated on a test set of six untrained speakers. posed model represents a strong speech enhancement method
We observe that the CRN converges faster and achieves low- for real-world applications, of which the desirable properties
er MSEs than the two LSTM models. Moreover, the CRN has often include online operation, single-channel operation, and
fewer trainable parameters than the LSTM models as shown in noise- and speaker-independent models.

3232
5. References [20] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and ac-
curate deep network learning by exponential linear units (elus),”
[1] D. L. Wang and J. Chen, “Supervised speech separation based on arXiv preprint arXiv:1511.07289, 2015.
deep learning: an overview,” arXiv preprint arXiv:1708.07524,
2017. [21] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[2] J. Agnew and J. M. Thornton, “Just noticeable and objectionable
group delays in digital hearing aids,” Journal of the American A- [22] D. B. Paul and J. M. Baker, “The design for the wall street journal-
cademy of Audiology, vol. 11, no. 6, pp. 330–336, 2000. based csr corpus,” in Proceedings of the workshop on Speech and
Natural Language. Association for Computational Linguistics,
[3] D. L. Wang and G. J. Brown, Eds., Computational auditory scene 1992, pp. 357–362.
analysis: Principles, algorithms, and applications. Wiley-IEEE
press, 2006. [23] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
mization,” arXiv preprint arXiv:1412.6980, 2014.
[4] Y. Wang and D. L. Wang, “Towards scaling up classification-
based speech separation,” IEEE Transactions on Audio, Speech, [24] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra,
and Language Processing, vol. 21, no. 7, pp. 1381–1390, 2013. “Perceptual evaluation of speech quality (pesq)-a new method for
speech quality assessment of telephone networks and codecs,” in
[5] Y. Wang, A. Narayanan, and D. L. Wang, “On training targets for 2001 IEEE International Conference on Acoustics, Speech, and
supervised speech separation,” IEEE/ACM Transactions on Au- Signal Processing, vol. 2. IEEE, 2001, pp. 749–752.
dio, Speech and Language Processing (TASLP), vol. 22, no. 12,
pp. 1849–1858, 2014.
[6] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study
on speech enhancement based on deep neural networks,” IEEE
Signal processing letters, vol. 21, no. 1, pp. 65–68, 2014.
[7] ——, “A regression approach to speech enhancement based on
deep neural networks,” IEEE/ACM Transactions on Audio, Speech
and Language Processing (TASLP), vol. 23, no. 1, pp. 7–19, 2015.
[8] J. Chen, Y. Wang, S. E. Yoho, D. L. Wang, and E. W. Healy,
“Large-scale training to increase speech intelligibility for hearing-
impaired listeners in novel noises,” The Journal of the Acoustical
Society of America, vol. 139, no. 5, pp. 2604–2612, 2016.
[9] J. Chen and D. L. Wang, “Long short-term memory for speaker
generalization in supervised speech separation,” Proceedings of
Interspeech, pp. 3314–3318, 2016.
[10] ——, “Long short-term memory for speaker generalization in su-
pervised speech separation,” The Journal of the Acoustical Society
of America, vol. 141, no. 6, pp. 4705–4714, 2017.
[11] M. Kolbæk, Z.-H. Tan, and J. Jensen, “Speech intelligibility
potential of general and specialized deep neural network based
speech enhancement systems,” IEEE/ACM Transactions on Au-
dio, Speech, and Language Processing, vol. 25, no. 1, pp. 153–
167, 2017.
[12] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An al-
gorithm for intelligibility prediction of time–frequency weighted
noisy speech,” IEEE Transactions on Audio, Speech, and Lan-
guage Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
[13] K. Tan, J. Chen, and D. L. Wang, “Gated residual networks with
dilated convolutions for supervised speech separation,” in IEEE
International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), 2018, to appear.
[14] Z. Zhang, Z. Sun, J. Liu, J. Chen, Z. Huo, and X. Zhang, “Deep
recurrent convolutional neural network: Improving performance
for speech recognition,” arXiv preprint arXiv:1611.07174, 2016.
[15] G. Naithani, T. Barker, G. Parascandolo, L. Bramsl, N. H. Pontop-
pidan, and T. Virtanen, “Low latency sound source separation us-
ing convolutional recurrent neural networks,” in 2017 IEEE Work-
shop on Applications of Signal Processing to Audio and Acoustics
(WASPAA). IEEE, 2017, pp. 71–75.
[16] V. Badrinarayanan, A. Handa, and R. Cipolla, “Segnet: A deep
convolutional encoder-decoder architecture for robust semantic
pixel-wise labelling,” arXiv preprint arXiv:1505.07293, 2015.
[17] S. R. Park and J. Lee, “A fully convolutional neural network for
speech enhancement,” arXiv preprint arXiv:1609.07132, 2016.
[18] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” in Interna-
tional conference on machine learning, 2015, pp. 448–456.
[19] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural
networks,” in Proceedings of the Fourteenth International Confer-
ence on Artificial Intelligence and Statistics, 2011, pp. 315–323.

3233

2306 9MA0-32 A Level Mechanics - June 2023 Mark Scheme PDF
67% (3)
2306 9MA0-32 A Level Mechanics - June 2023 Mark Scheme PDF
19 pages
Software Development Methodologies
No ratings yet
Software Development Methodologies
29 pages
applsci-15-02919
No ratings yet
applsci-15-02919
19 pages
Voice Conversion With Deep Learning: Miguel Varela Ramos, Instituto Superior T Ecnico, Universidade de Lisboa
No ratings yet
Voice Conversion With Deep Learning: Miguel Varela Ramos, Instituto Superior T Ecnico, Universidade de Lisboa
10 pages
Neural Machine Translation With Deep Attention.
No ratings yet
Neural Machine Translation With Deep Attention.
10 pages
10.2478 - Jaiscr 2019 0006
No ratings yet
10.2478 - Jaiscr 2019 0006
11 pages
CNN For Phoneme Recognition
No ratings yet
CNN For Phoneme Recognition
6 pages
A_Gated_Recurrent_Unit_Based_Robust_Voic
No ratings yet
A_Gated_Recurrent_Unit_Based_Robust_Voic
6 pages
Report - SIP - KWS Key Word Spotting
No ratings yet
Report - SIP - KWS Key Word Spotting
2 pages
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
No ratings yet
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
5 pages
Lip-Reading With Densely Connected Temporal Convolutional Networks
No ratings yet
Lip-Reading With Densely Connected Temporal Convolutional Networks
10 pages
A_Lightweight_CNN-Conformer_Model_for_Automatic_Speaker_Verification
No ratings yet
A_Lightweight_CNN-Conformer_Model_for_Automatic_Speaker_Verification
5 pages
1809 07454 PDF
No ratings yet
1809 07454 PDF
12 pages
ConvTasNet Compressed
No ratings yet
ConvTasNet Compressed
12 pages
MSGLN
No ratings yet
MSGLN
10 pages
2021-Titanet Neural Model For Speaker Representation With 1D Depth-Wise
No ratings yet
2021-Titanet Neural Model For Speaker Representation With 1D Depth-Wise
5 pages
UNIT-3 part2
No ratings yet
UNIT-3 part2
14 pages
9
No ratings yet
9
6 pages
Deep Learning For Lip Reading Using Audio-Visual Information For Urdu Language
No ratings yet
Deep Learning For Lip Reading Using Audio-Visual Information For Urdu Language
5 pages
Convolutional Recurrent Neural Networks For Small-Footprint Keyword Spotting
No ratings yet
Convolutional Recurrent Neural Networks For Small-Footprint Keyword Spotting
5 pages
DGRU-AM
No ratings yet
DGRU-AM
5 pages
Deep_Learning_Techniques_for_Speech_Emotion_Recognition_A_Review
No ratings yet
Deep_Learning_Techniques_for_Speech_Emotion_Recognition_A_Review
6 pages
DEVELOPING REAL-TIME STREAMING TRANSFORMER TRANSDUCER FOR SPEECH
No ratings yet
DEVELOPING REAL-TIME STREAMING TRANSFORMER TRANSDUCER FOR SPEECH
5 pages
45054 (35)
No ratings yet
45054 (35)
5 pages
Handwriting Recognition With Large Multidimensional Long Short-Term Memory Recurrent Neural Networks
No ratings yet
Handwriting Recognition With Large Multidimensional Long Short-Term Memory Recurrent Neural Networks
6 pages
A_Transformer-Based_Framework_for_Scene_Text_Recognition
No ratings yet
A_Transformer-Based_Framework_for_Scene_Text_Recognition
16 pages
NN Vs HMM
No ratings yet
NN Vs HMM
4 pages
Background Noise Suppression in Audio File Using LSTM Network
No ratings yet
Background Noise Suppression in Audio File Using LSTM Network
9 pages
Chord Detection Using Deep Learning
No ratings yet
Chord Detection Using Deep Learning
7 pages
STFT-Domain Neural Speech Enhancement With Very Low Algorithmic Latency
No ratings yet
STFT-Domain Neural Speech Enhancement With Very Low Algorithmic Latency
14 pages
Ismir 2023 Cheng
No ratings yet
Ismir 2023 Cheng
8 pages
NeurIPS 2019 Shallow RNN Accurate Time Series Classification On Resource Constrained Devices Paper
No ratings yet
NeurIPS 2019 Shallow RNN Accurate Time Series Classification On Resource Constrained Devices Paper
11 pages
DRNN-AM
No ratings yet
DRNN-AM
5 pages
Microprocessors and Microsystems. 2019
No ratings yet
Microprocessors and Microsystems. 2019
10 pages
Real-Time Speech Enhancement On Raw Signals With Deep State-Space Modeling
No ratings yet
Real-Time Speech Enhancement On Raw Signals With Deep State-Space Modeling
7 pages
Applsci 12 03461 v2
No ratings yet
Applsci 12 03461 v2
15 pages
Survey On Speech Imitation Using Machine Learning: Rahul Kumar, Jaybrata Chakraborty and Bappaditya Chakraborty
No ratings yet
Survey On Speech Imitation Using Machine Learning: Rahul Kumar, Jaybrata Chakraborty and Bappaditya Chakraborty
5 pages
Towards Emotion Independent Languageidentification System: by Priyam Jain, Krishna Gurugubelli, Anil Kumar Vuppala
No ratings yet
Towards Emotion Independent Languageidentification System: by Priyam Jain, Krishna Gurugubelli, Anil Kumar Vuppala
6 pages
Polynomial Expansion Paper
No ratings yet
Polynomial Expansion Paper
4 pages
applsci-15-02924
No ratings yet
applsci-15-02924
27 pages
Language Translation
No ratings yet
Language Translation
15 pages
Lin21d Interspeech
No ratings yet
Lin21d Interspeech
5 pages
Audio Fingerprinting Based On Multiple
No ratings yet
Audio Fingerprinting Based On Multiple
4 pages
Spcom20 Aaron
No ratings yet
Spcom20 Aaron
17 pages
Adaptive Digital Image Sequence Compression Stored by Fixed Cameras Base On Sparse Representation and Di 0
No ratings yet
Adaptive Digital Image Sequence Compression Stored by Fixed Cameras Base On Sparse Representation and Di 0
5 pages
BiLSTM_BPTT
No ratings yet
BiLSTM_BPTT
8 pages
Recent Progresses in Deep Learning Based Acoustic Models: Dong Yu and Jinyu Li
No ratings yet
Recent Progresses in Deep Learning Based Acoustic Models: Dong Yu and Jinyu Li
14 pages
Speech Processing Research Paper
No ratings yet
Speech Processing Research Paper
13 pages
Handwriting Recognition With Large Multidimensional Long Short-Term Memory Recurrent Neural Networks
No ratings yet
Handwriting Recognition With Large Multidimensional Long Short-Term Memory Recurrent Neural Networks
6 pages
Improving English Conversational Telephone Speech Recognition
No ratings yet
Improving English Conversational Telephone Speech Recognition
6 pages
Neural Machine Translation PDF
No ratings yet
Neural Machine Translation PDF
15 pages
El 29 2 15
No ratings yet
El 29 2 15
8 pages
AUDIO CHORD RECOGNITION WITH RECURRENT NEURAL NETWORKS
No ratings yet
AUDIO CHORD RECOGNITION WITH RECURRENT NEURAL NETWORKS
6 pages
A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE
No ratings yet
A NOVEL SCHEME FOR SPEAKER RECOGNITION USING A PHONETICALLY-AWARE
5 pages
An Introduction To Deep Learning For The Physical Layer
No ratings yet
An Introduction To Deep Learning For The Physical Layer
13 pages
2660 International+Journal+of+Intelligent+Systems+and+Applications+in+Engineering+ +Green+Publication+Service
No ratings yet
2660 International+Journal+of+Intelligent+Systems+and+Applications+in+Engineering+ +Green+Publication+Service
11 pages
(IJCST-V10I3P32) :rizwan K Rahim, Tharikh Bin Siyad, Muhammed Ameen M.A, Muhammed Salim K.T, Selin M
No ratings yet
(IJCST-V10I3P32) :rizwan K Rahim, Tharikh Bin Siyad, Muhammed Ameen M.A, Muhammed Salim K.T, Selin M
6 pages
Underwater Acoustic Target Classification Based On Dense Convolutional Neural Network
No ratings yet
Underwater Acoustic Target Classification Based On Dense Convolutional Neural Network
5 pages
Tech Doc 2 (Repaired)
No ratings yet
Tech Doc 2 (Repaired)
22 pages
Makalah Speech Recognition
No ratings yet
Makalah Speech Recognition
15 pages
Audio Visual Speech Recognition: Advancements, Applications, and Insights
From Everand
Audio Visual Speech Recognition: Advancements, Applications, and Insights
Fouad Sabry
No ratings yet
Human Visual System Model: Understanding Perception and Processing
From Everand
Human Visual System Model: Understanding Perception and Processing
Fouad Sabry
No ratings yet
Types of Internet A Complete Introduction
No ratings yet
Types of Internet A Complete Introduction
6 pages
FOPP3 Ed
No ratings yet
FOPP3 Ed
279 pages
Lomce - Junio 2018: Valuación de Achillerato para El Acceso A La Niversidad
No ratings yet
Lomce - Junio 2018: Valuación de Achillerato para El Acceso A La Niversidad
4 pages
Prompt Engineering 101
No ratings yet
Prompt Engineering 101
26 pages
Apollo Proton Therapy & Cancer Care Hospital: Check Sheet For Installation of Plumbing
No ratings yet
Apollo Proton Therapy & Cancer Care Hospital: Check Sheet For Installation of Plumbing
1 page
SHS_ABM_PM_QRT3_MOD9 (1) - Copy
No ratings yet
SHS_ABM_PM_QRT3_MOD9 (1) - Copy
20 pages
3D Model Bilge System
No ratings yet
3D Model Bilge System
1 page
AdvanceNoticeofArrivalForm - 0 2023-04-23 at 5.40.23 PM
No ratings yet
AdvanceNoticeofArrivalForm - 0 2023-04-23 at 5.40.23 PM
2 pages
Revised Biochemistry and Carbohydrates
100% (2)
Revised Biochemistry and Carbohydrates
100 pages
Lecture 5: Requirements Elicitation: Requirements Management and Systems Engineering (ITKS451), Autumn 2008
No ratings yet
Lecture 5: Requirements Elicitation: Requirements Management and Systems Engineering (ITKS451), Autumn 2008
37 pages
Bertolotti's Syndrome
No ratings yet
Bertolotti's Syndrome
14 pages
Astronomia: Steel Guitar Standard Tuning
No ratings yet
Astronomia: Steel Guitar Standard Tuning
3 pages
COSTCO ESR REPORT
No ratings yet
COSTCO ESR REPORT
1 page
Hold Time Study 1
100% (1)
Hold Time Study 1
3 pages
Kabalikat Charity Civic Communicator, Inc.: Present
No ratings yet
Kabalikat Charity Civic Communicator, Inc.: Present
2 pages
Paper Presentation On ExecutionASCJ - 0
No ratings yet
Paper Presentation On ExecutionASCJ - 0
22 pages
BSR205 Lecture 1 (Compatibility Mode)
No ratings yet
BSR205 Lecture 1 (Compatibility Mode)
25 pages
Jaeger Products, Inc: Superior Performance by Design
100% (1)
Jaeger Products, Inc: Superior Performance by Design
18 pages
Advantages of Using A DBMS
No ratings yet
Advantages of Using A DBMS
4 pages
Detroit Symphony Percussion
No ratings yet
Detroit Symphony Percussion
3 pages
Account Statement
No ratings yet
Account Statement
12 pages
1 History of Coding Theory
No ratings yet
1 History of Coding Theory
7 pages
CPD Proposal
100% (1)
CPD Proposal
46 pages
Estimation of Geometric and Semantic Properties of Objects Based On Images or Observations From Similar Sensors
No ratings yet
Estimation of Geometric and Semantic Properties of Objects Based On Images or Observations From Similar Sensors
5 pages
Pakistan Monthly Climate Summary June 2022
No ratings yet
Pakistan Monthly Climate Summary June 2022
6 pages
Division Memorandum No. 122 s.2020 (Signed)
No ratings yet
Division Memorandum No. 122 s.2020 (Signed)
7 pages
Wallaert J J, Fisher J W. Shear Strength of High-Strength Bolts - 1964
No ratings yet
Wallaert J J, Fisher J W. Shear Strength of High-Strength Bolts - 1964
61 pages
Enjoy Connection Globally.: Tax Invoice/ Tax Credit Note
No ratings yet
Enjoy Connection Globally.: Tax Invoice/ Tax Credit Note
37 pages

A Convolutional Recurrent Neural Network For Real-Time Speech Enhancement

Uploaded by

A Convolutional Recurrent Neural Network For Real-Time Speech Enhancement

Uploaded by

Interspeech 2018

2-6 September 2018, Hyderabad

A Convolutional Recurrent Neural Network for Real-Time Speech

Input 2.2. Temporal modeling via LSTM

Figure 2: Network architecture of our proposed CRN.

quence of latent feature vectors are then modeled by two LSTM … … …

tecture is provided in Table 1. The input size and the output

2.4. LSTM baselines

evaluation metrics STOI (in %) PESQ

evaluation metrics STOI (in %) PESQ

LSTM-1 train LSTM-1 test LSTM-2 train LSTM-1 LSTM-2 CRN

Number of params (million)

You might also like