0% found this document useful (0 votes)
43 views

A Convolutional Recurrent Neural Network For Real-Time Speech Enhancement

This document describes a convolutional recurrent neural network (CRN) for real-time speech enhancement. The CRN incorporates a convolutional encoder-decoder and long short-term memory (LSTM). It is noise- and speaker-independent, meaning noise types and speakers can differ between training and test. Experiments show the CRN achieves better objective intelligibility and quality than an existing LSTM model, while using fewer parameters. The CRN is suitable for real-time applications due to its causal structure which does not use future information.

Uploaded by

Vasanth Yannam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

A Convolutional Recurrent Neural Network For Real-Time Speech Enhancement

This document describes a convolutional recurrent neural network (CRN) for real-time speech enhancement. The CRN incorporates a convolutional encoder-decoder and long short-term memory (LSTM). It is noise- and speaker-independent, meaning noise types and speakers can differ between training and test. Experiments show the CRN achieves better objective intelligibility and quality than an existing LSTM model, while using fewer parameters. The CRN is suitable for real-time applications due to its causal structure which does not use future information.

Uploaded by

Vasanth Yannam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Interspeech 2018

2-6 September 2018, Hyderabad

A Convolutional Recurrent Neural Network for Real-Time Speech


Enhancement
Ke Tan1 , DeLiang Wang1,2
1
Department of Computer Science and Engineering, The Ohio State University, USA
2
Center for Cognitive and Brain Sciences, The Ohio State University, USA
[email protected], [email protected]

Abstract speech [6] [7]. In this study, we use the magnitude spectra of
target speech as the training target.
Many real-world applications of speech enhancement, such as For supervised speech enhancement, noise generalization
hearing aids and cochlear implants, desire real-time processing, and speaker generalization are both crucial. A simple yet ef-
with no or low latency. In this paper, we propose a novel convo- fective method to deal with noise generalization is to train with
lutional recurrent network (CRN) to address real-time monaural different noise types [8]. Analogously, to address speaker gen-
speech enhancement. We incorporate a convolutional encoder- eralization would include a large number of speakers in a train-
decoder (CED) and long short-term memory (LSTM) into the ing set. However, it has been found that a feedforward DNN is
CRN architecture, which leads to a causal system that is natu- unable to track a target speaker in the presence of many training
rally suitable for real-time processing. Moreover, the proposed speakers [9] [10] [11]. Typically, a DNN independently predicts
model is noise- and speaker-independent, i.e. noise types and a label for each time frame from a small context window around
speakers can be different between training and test. Our exper- the frame. An interpretation is that such DNNs cannot lever-
iments suggest that the CRN leads to consistently better objec- age long-term contexts, which would be essential for tracking a
tive intelligibility and perceptual quality than an existing LSTM target speaker. Recent studies [9] [10] suggest that it would be
based model. Moreover, the CRN has much fewer trainable pa- better to formulate speech separation as a sequence-to-sequence
rameters. mapping in order to leverage long-term contexts.
Index Terms: noise- and speaker-independent speech enhance- With such a formulation, recurrent neural networks (RNNs)
ment, real-time applications, convolutional encoder-decoder, and convolutional neural networks (CNNs) have been used for
long short-term memory, convolutional recurrent networks noise- and speaker-independent speech enhancement, where
noise types and speakers can be different between training and
1. Introduction test. Chen et al. [10] proposed an RNN with four hidden LSTM
layers to deal with speaker generalization of noise-independent
Speech separation aims to separate target speech from a back- models. Their experimental results show that the LSTM model
ground interference, which may include nonspeech noise, inter- generalizes well to untrained speakers, and substantially out-
fering speech and room reverberation [1]. Speech enhancement performs a DNN based model in terms of short-time objective
refers to the separation of speech and nonspeech noise. It has intelligibility (STOI) [12]. A more recent study [13] developed
various real-world applications such as robust automatic speech a gated residual network (GRN) based on dilated convolution-
recognition and mobile speech communication. For many such s. Compared with the LSTM model in [10], the GRN exhibits
applications, real-time processing is required. In other words, higher parameter efficiency and better generalization capabili-
speech enhancement is performed with low computational com- ty for untrained speakers at different SNR levels. On the other
plexity, providing near-instantaneous output. hand, the GRN requires a large amount of future information
In this study, we focus on monaural (single-microphone) for mask estimation or spectral mapping at each time frame.
speech enhancement that can operate in real-time applications. Hence, it cannot be used for real-time speech enhancement.
In digital hearing aids, for example, it has been found that a Motivated by recent works [14] [15] on CRNs, we develop
delay as low as 3 milliseconds is noticeable to listeners and a a novel CRN architecture for noise- and speaker-independent
delay of longer than 10 milliseconds is objectionable [2]. For speech enhancement in real time. The CRN incorporates a con-
such applications, causal speech enhancement systems, where volutional encoder-decoder and long short-term memory. We
no future information is allowed, are often required. find that the proposed CRN leads to consistently better objective
Inspired by the concept of time-frequency (T-F) masking speech intelligibility and quality than the LSTM model in [10].
in computational auditory scene analysis (CASA) [3], speech Moreover, the CRN has much fewer trainable parameters.
separation has been formulated as supervised learning in recent The rest of this paper is organized as follows. We give a
years, where a deep neural network (DNN) is employed to learn detailed description of our proposed model in Section 2. The
a mapping from noisy acoustic features to a T-F mask [4]. The experimental setup and results are presented in Section 3. We
ideal binary mask, which classifies T-F units as either speech- conclude this paper in Section 4.
dominant or noise-dominant, is the first training target used
in supervised speech separation. More recent training target-
s include the ideal ratio mask [5] and mapping-based target- 2. System description
s corresponding to the magnitude or power spectra of target 2.1. Encoder-decoder with causal convolutions

This research was supported in part by an NIDCD grant (R01 D- Badrinarayanan et al. first proposed a convolutional encoder-
C012048) and the Ohio Supercomputer Center. decoder network for pixel-wise image labelling [16]. It com-

3229 10.21437/Interspeech.2018-1405
Output Note that the input can be treated as a sequence of feature vec-
tors, while only the time dimension is illustrated in Fig. 1. In
Hidden Layer causal convolutions, the output does not depend on future input-
s. With causal convolutions instead of noncausal convolutions,
Hidden Layer the encoder-decoder architecture leads to a causal system. Note
that we can easily apply causal deconvolutions to the decoder,
Hidden Layer since the deconvolution is intrinsically a convolution operation.

Input 2.2. Temporal modeling via LSTM


Past Future
Time In order to track a target speaker, it may be important to leverage
long-term contexts, which cannot be utilized by the aforemen-
Figure 1: An example of causal convolutions. The convolution tioned convolutional encoder-decoder. The LSTM [21], a spe-
output does not depend on future inputs. cific type of RNN which incorporates a memory cell, has been
successful in temporal modeling in various applications such as
acoustic modeling and video classification. To account for tem-
prises a convolutional encoder followed by a corresponding poral dynamics of speech, we insert two stacked LSTM layers
decoder which feeds into a softmax classification layer. The between the encoder and the decoder. In this study, we use the
encoder is a stack of convolutional layers and pooling layers, LSTM defined by the following equations:
which serves to extract high-level features from a raw input im-
age. With essentially the same structure as the encoder in the it = σ(Wii xt + bii + Whi ht−1 + bhi ) (1)
reverse order, the decoder maps low-resolution feature maps at ft = σ(Wif xt + bif + Whf ht−1 + bhf ) (2)
the output of the encoder to feature maps of the full input image gt = tanh(Wig xt + big + Whg ht−1 + bhg ) (3)
size. The symmetric encoder-decoder architecture ensures that
the output has the same shape as the input. With such an at- ot = σ(Wio xt + bio + Who ht−1 + bho ) (4)
tractive property, the encoder-decoder architecture is naturally ct = ft ct−1 + it gt (5)
suitable for any pixel-wise dense prediction task, which aims to ht = ot tanh(ct ) (6)
predict a label for each pixel in the input image.
For speech enhancement, one approach is to employ a CED where xt , gt , ct and ht represent input, block input, memory
to map from the magnitude spectrogram of noisy speech to that cell and hidden activation at time t, respectively. W ’s and b’s
of clean speech, where the magnitude spectrograms are sim- denote weights and biases, respectively. σ represents sigmoid
ply treated as images. To our knowledge, Park et al. [17] first nonlinearity and represents element-wise multiplication.
introduced CED for speech enhancement. They proposed a re- To fit the input shape required by the LSTM, we flatten the
dundant CED network (R-CED), which consists of repetition- frequency dimension and the depth dimension of the encoder
s of a convolution, batch normalization (BN) [18], and a Re- output to produce a sequence of feature vectors before feeding
LU activation [19] layer. The R-CED architecture additionally it into the LSTM layers. The output sequence of the LSTM
incorporates skip connections to facilitate optimization, which layers is subsequently reshaped back to fit the decoder. It is
connect each layer in the encoder to its corresponding layer in worth noting that the inclusion of the LSTM layers does not
the decoder. change the causality of the system.
In our proposed network, the encoder comprises five convo-
lutional layers while the decoder has five deconvolutional layer- 2.3. Network architecture
s. We apply exponential linear units (ELUs) [20] to all convolu-
tional and deconvolutional layers except the output layer. ELUs Table 1: Architecture of our proposed CRN. Here T denotes the
have been demonstrated to lead to faster convergence and bet- number of time frames in the STFT magnitude spectrum.
ter generalization than ReLUs. In the output layer, we utilize
softplus activation [19] which is a smooth approximation to the layer name input size hyperparameters output size
ReLU function and can constrain the network output to always reshape 1 T × 161 - 1 × T × 161
be positive. Moreover, we adopt batch normalization right after conv2d 1 1 × T × 161 2 × 3, (1, 2), 16 16 × T × 80
conv2d 2 16 × T × 80 2 × 3, (1, 2), 32 32 × T × 39
each convolution (or deconvolution) and before activation. The conv2d 3 32 × T × 39 2 × 3, (1, 2), 64 64 × T × 19
numbers of kernels are kept symmetric: the number of kernels conv2d 4 64 × T × 19 2 × 3, (1, 2), 128 128 × T × 9
is gradually increased in the encoder while it is gradually de- conv2d 5 128 × T × 9 2 × 3, (1, 2), 256 256 × T × 4
reshape 2 256 × T × 4 - T × 1024
creased in the decoder. To leverage a larger context along the lstm 1 T × 1024 1024 T × 1024
frequency direction, we apply a stride of 2 along the frequency lstm 2 T × 1024 1024 T × 1024
dimension to all convolutional (or deconvolutional) layers. In reshape 3 T × 1024 - 256 × T × 4
other words, we halve the frequency dimension size of feature deconv2d 5 512 × T × 4 2 × 3, (1, 2), 128 128 × T × 9
deconv2d 4 256 × T × 9 2 × 3, (1, 2), 64 64 × T × 19
maps layer by layer in the encoder and double it layer by layer deconv2d 3 128×T ×19 2 × 3, (1, 2), 32 32 × T × 39
in the decoder, whereas we do not change the time dimension deconv2d 2 64 × T × 39 2 × 3, (1, 2), 16 16 × T × 80
size of feature maps. To improve the flow of information and deconv2d 1 32 × T × 80 2 × 3, (1, 2), 1 1 × T × 161
reshape 4 1 × T × 161 - T × 161
gradients throughout the network, we utilize skip connections
which concatenate the output of each encoder layer to the input
of each decoder layer. In this study, we use 161-dimensional short-time Fourier
To obtain a causal system for real-time speech enhancemen- transform (STFT) magnitude spectrum of noisy speech as input
t, we impose causal convolutions upon the encoder-decoder ar- features, and that of clean speech as the training target. Our
chitecture. Fig. 1 depicts an example of causal convolutions. proposed CRN is shown in Fig. 2, in which the network input

3230
Encoder Decoder

t t

Conv Conv Conv Conv Conv Deconv Deconv Deconv Deconv Deconv
f BN BN BN BN BN LSTM LSTM BN BN BN BN BN f
ELU ELU ELU ELU ELU ELU ELU ELU ELU Softplus

Figure 2: Network architecture of our proposed CRN.

is encoded into a higher-dimensional latent space, and the se- t-11 t-10 t-2 t-1 t-10 t-9 t-1 t t-9 t-8 t t+1

quence of latent feature vectors are then modeled by two LSTM … … …


layers. Subsequently, the output sequence of the LSTM layer-
s is converted back to the original input shape by the decoder. Input Layer Input Layer Input Layer

The proposed CRN benefits from the feature extraction capabil- LSTM, 1024 LSTM, 1024 LSTM, 1024

ity of CNNs and the temporal modeling capability of RNNs, by LSTM, 1024 LSTM, 1024 LSTM, 1024

combining the two topologies together. LSTM, 1024 LSTM, 1024 LSTM, 1024

A more detailed description of our proposed network archi- LSTM, 1024 LSTM, 1024 LSTM, 1024

tecture is provided in Table 1. The input size and the output


Output Layer Output Layer Output Layer
size of each layer are specified in featureMaps × timeSteps ×
frequencyChannels format. The layer hyperparameters are giv-
en in (kernelSize, strides, outChannels) format. For all the
t-1 t t+1
convolutions and the deconvolutions, we apply zero-padding
to the time direction but not to the frequency direction. To
Figure 3: An LSTM baseline with a feature window of 11 frames
perform causal convolutions, we use a kernel size of 2 × 3
(10 past frames and 1 current frame). At each time step, the 11
(time × frequency). Note that the number of feature maps in
input frames are concatenated into a feature vector.
each decoder layer is doubled by the skip connections.

2.4. LSTM baselines


ditec CD (available at http://www.auditec.com).
In our experiments, we build two LSTM baselines for compar-
We create a training set including 320 000 mixtures with a
ison. In the first LSTM model, a feature window of 11 frames
total duration of about 500 hours. Specifically, we mix a ran-
(10 past frames and 1 current frame) is employed to estimate
domly selected training utterance with a random cut from the
one frame of the target (see Fig. 3). In other words, 11 frames
10 000 training noises at a signal-to-noise ratio (SNR) that is
of feature vectors are concatenated into a long vector as the net-
randomly chosen from {-5, -4, -3, -2, -1, 0} dB. To investigate
work input at each time step. In the second LSTM model, how-
speaker generalization of the models, we create two test sets for
ever, no feature window is utilized. We denote the first LSTM
each noise using 6 trained speakers (3 males and 3 females) and
model as LSTM-1 and the second one as LSTM-2. From the
6 untrained speakers, respectively. One test set comprises 150
input layer to the output layer, LSTM-1 has 11 × 161, 1024,
mixtures created from 25 × 6 utterances of 6 trained speaker-
1024, 1024, 1024, and 161 units, respectively; LSTM-2 has
s, while the other comprises 150 mixtures created from 25 × 6
161, 1024, 1024, 1024, 1024, and 161 units, respectively. Both
utterances of 6 untrained speakers. Note that all test utterances
baselines do not use future information, which amount to causal
are excluded from the training set. We use two SNRs for the
systems.
test set, i.e. -5 and -2 dB. All signals are sampled at 16 kHz.
The models are trained with the Adam optimizer [23]. We
3. Experiments set the learning rate to 0.0002. The mean squared error (MSE)
3.1. Experimental setup serves as the objective function. We train the models with a
minibatch size of 16 on the utterance-level. Within a mini-
In our experiments, we evaluate the models on the WSJ0 SI- batch, all training samples are padded with zeros to have the
84 training set [22] including 7138 utterances from 83 speakers same number of time steps as the longest sample does. The best
(42 males and 41 females). Among these speakers, 6 speak- models are selected by cross validation.
ers (3 males and 3 females) are treated as untrained speakers.
Hence, we train the models with the 77 remaining speakers. To 3.2. Experimental results
obtain noise-independent models, we use 10 000 noises from a
sound effect library (available at https://www.sound-ideas.com) In this study, we use STOI and perceptual evaluation of speech
for training, and the duration is about 126 hours. For test, we quality (PESQ) [24] as the evaluation metrics. Table 2 and 3
use two challenging noises (babble and cafeteria) from an Au- present STOI and PESQ scores of unprocessed and processed

3231
Table 2: Model comparisons in terms of STOI and PESQ scores on trained speakers.

evaluation metrics STOI (in %) PESQ


test SNR -5 dB -2 dB -5 dB -2 dB
noises Avg. babble cafeteria Avg. babble cafeteria Avg. babble cafeteria Avg. babble cafeteria
unprocessed 58.18 58.95 57.40 65.75 66.30 65.19 1.50 1.63 1.52 1.67 1.79 1.70
LSTM-1 75.81 77.29 74.32 82.00 82.62 81.38 2.05 2.06 2.04 2.33 2.36 2.30
LSTM-2 75.80 77.45 74.14 82.53 83.80 81.25 2.05 2.06 2.03 2.31 2.34 2.28
CRN 77.89 79.71 76.07 84.08 85.48 82.68 2.15 2.17 2.12 2.41 2.44 2.38

Table 3: Model comparisons in terms of STOI and PESQ scores on untrained speakers.

evaluation metrics STOI (in %) PESQ


test SNR -5 dB -2 dB -5 dB -2 dB
noises Avg. babble cafeteria Avg. babble cafeteria Avg. babble cafeteria Avg. babble cafeteria
unprocessed 57.86 58.54 57.18 65.08 65.45 64.70 1.52 1.56 1.47 1.66 1.69 1.63
LSTM-1 74.33 75.21 73.44 81.75 82.65 80.84 1.96 1.94 1.97 2.25 2.26 2.24
LSTM-2 74.42 75.55 73.29 81.88 82.87 80.88 1.95 1.94 1.96 2.25 2.25 2.24
CRN 76.42 77.98 74.85 83.31 84.38 82.24 2.04 2.04 2.03 2.33 2.34 2.31

LSTM-1 train LSTM-1 test LSTM-2 train LSTM-1 LSTM-2 CRN


LSTM-2 test CRN train CRN test 45.0
16 40.0 36.81

Number of params (million)


14 35.0
30.22
30.0
12
25.0
10
20.0 17.58
MSE

8
15.0
6 10.0

4 5.0
0.0
2

0
0 1 2 3 4 5 6 7 8 Figure 5: Parameter efficiency comparison of different model-
Training Epoch s. We compare the number of trainable parameters in different
models.
Figure 4: Mean square errors over training epochs for LSTM-
1, LSTM-2 and CRN on the training set and the test set. All
models are evaluated with a test set of six untrained speakers Fig. 5. This is mainly due to the use of shared weights in convo-
on the untrained babble noise. lutions. With a higher parameter efficiency, the CRN is easier
to train than the LSTMs.
In addition, the causal convolutions in the CRN capture lo-
signals for trained speakers and untrained speakers, respective- cal spatial patterns in the input STFT magnitude spectrum with-
ly. In each case, the best result is highlighted by a boldface out using future information. In contrast, the LSTM models
number. As shown in Table 2 and 3, LSTM-1 and LSTM-2 treat each input frame as a flattened feature vector, and cannot
yield similar STOI and PESQ scores for both trained speakers sufficiently leverage the T-F structure in the STFT magnitude
and untrained speakers, which implies that the use of the fea- spectrum. On the other hand, the LSTM layers in the CRN
ture window in LSTM-1 does not improve the performance. On model the temporal dependencies in a latent space, which would
the other hand, our proposed CRN consistently outperforms the be important to speaker characterization in speaker-independent
LSTM baselines in both metrics. At the SNR of -5 dB, for ex- speech enhancement.
ample, the CRN provides about 2% STOI improvements and
about 0.1 PESQ improvements over the LSTM models. Com-
paring the results in Table 2 with those in Table 3, we can find
4. Conclusions
that the CRN generalizes well to untrained speakers. In the most In this study, we have proposed a convolutional recurrent net-
challenging case, where the utterances from untrained speaker- work to deal with noise- and speaker-independent speech en-
s are mixed with the two untrained noises at -5 dB, the CRN hancement for real-time applications. The proposed model
produces a 18.56% STOI improvement and a 0.55 PESQ im- leads to a causal speech enhancement system, where no fu-
provement over the unprocessed mixtures. ture information is utilized. The evaluation results suggest that
The CRN takes advantage of batch normalization, which the proposed CRN consistently outperforms two strong LSTM
can be easily adopted for convolution operations to accelerate baselines for both trained and untrained speakers in terms of
training and improve the performance. Fig. 4 compares training STOI and PESQ scores. In addition, we find that the CRN has
and test MSEs of different models over training epochs, where fewer trainable parameters than the LSTMs. We believe the pro-
the models are evaluated on a test set of six untrained speakers. posed model represents a strong speech enhancement method
We observe that the CRN converges faster and achieves low- for real-world applications, of which the desirable properties
er MSEs than the two LSTM models. Moreover, the CRN has often include online operation, single-channel operation, and
fewer trainable parameters than the LSTM models as shown in noise- and speaker-independent models.

3232
5. References [20] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and ac-
curate deep network learning by exponential linear units (elus),”
[1] D. L. Wang and J. Chen, “Supervised speech separation based on arXiv preprint arXiv:1511.07289, 2015.
deep learning: an overview,” arXiv preprint arXiv:1708.07524,
2017. [21] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[2] J. Agnew and J. M. Thornton, “Just noticeable and objectionable
group delays in digital hearing aids,” Journal of the American A- [22] D. B. Paul and J. M. Baker, “The design for the wall street journal-
cademy of Audiology, vol. 11, no. 6, pp. 330–336, 2000. based csr corpus,” in Proceedings of the workshop on Speech and
Natural Language. Association for Computational Linguistics,
[3] D. L. Wang and G. J. Brown, Eds., Computational auditory scene 1992, pp. 357–362.
analysis: Principles, algorithms, and applications. Wiley-IEEE
press, 2006. [23] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
mization,” arXiv preprint arXiv:1412.6980, 2014.
[4] Y. Wang and D. L. Wang, “Towards scaling up classification-
based speech separation,” IEEE Transactions on Audio, Speech, [24] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra,
and Language Processing, vol. 21, no. 7, pp. 1381–1390, 2013. “Perceptual evaluation of speech quality (pesq)-a new method for
speech quality assessment of telephone networks and codecs,” in
[5] Y. Wang, A. Narayanan, and D. L. Wang, “On training targets for 2001 IEEE International Conference on Acoustics, Speech, and
supervised speech separation,” IEEE/ACM Transactions on Au- Signal Processing, vol. 2. IEEE, 2001, pp. 749–752.
dio, Speech and Language Processing (TASLP), vol. 22, no. 12,
pp. 1849–1858, 2014.
[6] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study
on speech enhancement based on deep neural networks,” IEEE
Signal processing letters, vol. 21, no. 1, pp. 65–68, 2014.
[7] ——, “A regression approach to speech enhancement based on
deep neural networks,” IEEE/ACM Transactions on Audio, Speech
and Language Processing (TASLP), vol. 23, no. 1, pp. 7–19, 2015.
[8] J. Chen, Y. Wang, S. E. Yoho, D. L. Wang, and E. W. Healy,
“Large-scale training to increase speech intelligibility for hearing-
impaired listeners in novel noises,” The Journal of the Acoustical
Society of America, vol. 139, no. 5, pp. 2604–2612, 2016.
[9] J. Chen and D. L. Wang, “Long short-term memory for speaker
generalization in supervised speech separation,” Proceedings of
Interspeech, pp. 3314–3318, 2016.
[10] ——, “Long short-term memory for speaker generalization in su-
pervised speech separation,” The Journal of the Acoustical Society
of America, vol. 141, no. 6, pp. 4705–4714, 2017.
[11] M. Kolbæk, Z.-H. Tan, and J. Jensen, “Speech intelligibility
potential of general and specialized deep neural network based
speech enhancement systems,” IEEE/ACM Transactions on Au-
dio, Speech, and Language Processing, vol. 25, no. 1, pp. 153–
167, 2017.
[12] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An al-
gorithm for intelligibility prediction of time–frequency weighted
noisy speech,” IEEE Transactions on Audio, Speech, and Lan-
guage Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
[13] K. Tan, J. Chen, and D. L. Wang, “Gated residual networks with
dilated convolutions for supervised speech separation,” in IEEE
International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), 2018, to appear.
[14] Z. Zhang, Z. Sun, J. Liu, J. Chen, Z. Huo, and X. Zhang, “Deep
recurrent convolutional neural network: Improving performance
for speech recognition,” arXiv preprint arXiv:1611.07174, 2016.
[15] G. Naithani, T. Barker, G. Parascandolo, L. Bramsl, N. H. Pontop-
pidan, and T. Virtanen, “Low latency sound source separation us-
ing convolutional recurrent neural networks,” in 2017 IEEE Work-
shop on Applications of Signal Processing to Audio and Acoustics
(WASPAA). IEEE, 2017, pp. 71–75.
[16] V. Badrinarayanan, A. Handa, and R. Cipolla, “Segnet: A deep
convolutional encoder-decoder architecture for robust semantic
pixel-wise labelling,” arXiv preprint arXiv:1505.07293, 2015.
[17] S. R. Park and J. Lee, “A fully convolutional neural network for
speech enhancement,” arXiv preprint arXiv:1609.07132, 2016.
[18] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” in Interna-
tional conference on machine learning, 2015, pp. 448–456.
[19] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural
networks,” in Proceedings of the Fourteenth International Confer-
ence on Artificial Intelligence and Statistics, 2011, pp. 315–323.

3233

You might also like