A Convolutional Recurrent Neural Network For Real-Time Speech Enhancement
A Convolutional Recurrent Neural Network For Real-Time Speech Enhancement
Abstract speech [6] [7]. In this study, we use the magnitude spectra of
target speech as the training target.
Many real-world applications of speech enhancement, such as For supervised speech enhancement, noise generalization
hearing aids and cochlear implants, desire real-time processing, and speaker generalization are both crucial. A simple yet ef-
with no or low latency. In this paper, we propose a novel convo- fective method to deal with noise generalization is to train with
lutional recurrent network (CRN) to address real-time monaural different noise types [8]. Analogously, to address speaker gen-
speech enhancement. We incorporate a convolutional encoder- eralization would include a large number of speakers in a train-
decoder (CED) and long short-term memory (LSTM) into the ing set. However, it has been found that a feedforward DNN is
CRN architecture, which leads to a causal system that is natu- unable to track a target speaker in the presence of many training
rally suitable for real-time processing. Moreover, the proposed speakers [9] [10] [11]. Typically, a DNN independently predicts
model is noise- and speaker-independent, i.e. noise types and a label for each time frame from a small context window around
speakers can be different between training and test. Our exper- the frame. An interpretation is that such DNNs cannot lever-
iments suggest that the CRN leads to consistently better objec- age long-term contexts, which would be essential for tracking a
tive intelligibility and perceptual quality than an existing LSTM target speaker. Recent studies [9] [10] suggest that it would be
based model. Moreover, the CRN has much fewer trainable pa- better to formulate speech separation as a sequence-to-sequence
rameters. mapping in order to leverage long-term contexts.
Index Terms: noise- and speaker-independent speech enhance- With such a formulation, recurrent neural networks (RNNs)
ment, real-time applications, convolutional encoder-decoder, and convolutional neural networks (CNNs) have been used for
long short-term memory, convolutional recurrent networks noise- and speaker-independent speech enhancement, where
noise types and speakers can be different between training and
1. Introduction test. Chen et al. [10] proposed an RNN with four hidden LSTM
layers to deal with speaker generalization of noise-independent
Speech separation aims to separate target speech from a back- models. Their experimental results show that the LSTM model
ground interference, which may include nonspeech noise, inter- generalizes well to untrained speakers, and substantially out-
fering speech and room reverberation [1]. Speech enhancement performs a DNN based model in terms of short-time objective
refers to the separation of speech and nonspeech noise. It has intelligibility (STOI) [12]. A more recent study [13] developed
various real-world applications such as robust automatic speech a gated residual network (GRN) based on dilated convolution-
recognition and mobile speech communication. For many such s. Compared with the LSTM model in [10], the GRN exhibits
applications, real-time processing is required. In other words, higher parameter efficiency and better generalization capabili-
speech enhancement is performed with low computational com- ty for untrained speakers at different SNR levels. On the other
plexity, providing near-instantaneous output. hand, the GRN requires a large amount of future information
In this study, we focus on monaural (single-microphone) for mask estimation or spectral mapping at each time frame.
speech enhancement that can operate in real-time applications. Hence, it cannot be used for real-time speech enhancement.
In digital hearing aids, for example, it has been found that a Motivated by recent works [14] [15] on CRNs, we develop
delay as low as 3 milliseconds is noticeable to listeners and a a novel CRN architecture for noise- and speaker-independent
delay of longer than 10 milliseconds is objectionable [2]. For speech enhancement in real time. The CRN incorporates a con-
such applications, causal speech enhancement systems, where volutional encoder-decoder and long short-term memory. We
no future information is allowed, are often required. find that the proposed CRN leads to consistently better objective
Inspired by the concept of time-frequency (T-F) masking speech intelligibility and quality than the LSTM model in [10].
in computational auditory scene analysis (CASA) [3], speech Moreover, the CRN has much fewer trainable parameters.
separation has been formulated as supervised learning in recent The rest of this paper is organized as follows. We give a
years, where a deep neural network (DNN) is employed to learn detailed description of our proposed model in Section 2. The
a mapping from noisy acoustic features to a T-F mask [4]. The experimental setup and results are presented in Section 3. We
ideal binary mask, which classifies T-F units as either speech- conclude this paper in Section 4.
dominant or noise-dominant, is the first training target used
in supervised speech separation. More recent training target-
s include the ideal ratio mask [5] and mapping-based target- 2. System description
s corresponding to the magnitude or power spectra of target 2.1. Encoder-decoder with causal convolutions
This research was supported in part by an NIDCD grant (R01 D- Badrinarayanan et al. first proposed a convolutional encoder-
C012048) and the Ohio Supercomputer Center. decoder network for pixel-wise image labelling [16]. It com-
3229 10.21437/Interspeech.2018-1405
Output Note that the input can be treated as a sequence of feature vec-
tors, while only the time dimension is illustrated in Fig. 1. In
Hidden Layer causal convolutions, the output does not depend on future input-
s. With causal convolutions instead of noncausal convolutions,
Hidden Layer the encoder-decoder architecture leads to a causal system. Note
that we can easily apply causal deconvolutions to the decoder,
Hidden Layer since the deconvolution is intrinsically a convolution operation.
3230
Encoder Decoder
t t
Conv Conv Conv Conv Conv Deconv Deconv Deconv Deconv Deconv
f BN BN BN BN BN LSTM LSTM BN BN BN BN BN f
ELU ELU ELU ELU ELU ELU ELU ELU ELU Softplus
is encoded into a higher-dimensional latent space, and the se- t-11 t-10 t-2 t-1 t-10 t-9 t-1 t t-9 t-8 t t+1
The proposed CRN benefits from the feature extraction capabil- LSTM, 1024 LSTM, 1024 LSTM, 1024
ity of CNNs and the temporal modeling capability of RNNs, by LSTM, 1024 LSTM, 1024 LSTM, 1024
combining the two topologies together. LSTM, 1024 LSTM, 1024 LSTM, 1024
A more detailed description of our proposed network archi- LSTM, 1024 LSTM, 1024 LSTM, 1024
3231
Table 2: Model comparisons in terms of STOI and PESQ scores on trained speakers.
Table 3: Model comparisons in terms of STOI and PESQ scores on untrained speakers.
8
15.0
6 10.0
4 5.0
0.0
2
0
0 1 2 3 4 5 6 7 8 Figure 5: Parameter efficiency comparison of different model-
Training Epoch s. We compare the number of trainable parameters in different
models.
Figure 4: Mean square errors over training epochs for LSTM-
1, LSTM-2 and CRN on the training set and the test set. All
models are evaluated with a test set of six untrained speakers Fig. 5. This is mainly due to the use of shared weights in convo-
on the untrained babble noise. lutions. With a higher parameter efficiency, the CRN is easier
to train than the LSTMs.
In addition, the causal convolutions in the CRN capture lo-
signals for trained speakers and untrained speakers, respective- cal spatial patterns in the input STFT magnitude spectrum with-
ly. In each case, the best result is highlighted by a boldface out using future information. In contrast, the LSTM models
number. As shown in Table 2 and 3, LSTM-1 and LSTM-2 treat each input frame as a flattened feature vector, and cannot
yield similar STOI and PESQ scores for both trained speakers sufficiently leverage the T-F structure in the STFT magnitude
and untrained speakers, which implies that the use of the fea- spectrum. On the other hand, the LSTM layers in the CRN
ture window in LSTM-1 does not improve the performance. On model the temporal dependencies in a latent space, which would
the other hand, our proposed CRN consistently outperforms the be important to speaker characterization in speaker-independent
LSTM baselines in both metrics. At the SNR of -5 dB, for ex- speech enhancement.
ample, the CRN provides about 2% STOI improvements and
about 0.1 PESQ improvements over the LSTM models. Com-
paring the results in Table 2 with those in Table 3, we can find
4. Conclusions
that the CRN generalizes well to untrained speakers. In the most In this study, we have proposed a convolutional recurrent net-
challenging case, where the utterances from untrained speaker- work to deal with noise- and speaker-independent speech en-
s are mixed with the two untrained noises at -5 dB, the CRN hancement for real-time applications. The proposed model
produces a 18.56% STOI improvement and a 0.55 PESQ im- leads to a causal speech enhancement system, where no fu-
provement over the unprocessed mixtures. ture information is utilized. The evaluation results suggest that
The CRN takes advantage of batch normalization, which the proposed CRN consistently outperforms two strong LSTM
can be easily adopted for convolution operations to accelerate baselines for both trained and untrained speakers in terms of
training and improve the performance. Fig. 4 compares training STOI and PESQ scores. In addition, we find that the CRN has
and test MSEs of different models over training epochs, where fewer trainable parameters than the LSTMs. We believe the pro-
the models are evaluated on a test set of six untrained speakers. posed model represents a strong speech enhancement method
We observe that the CRN converges faster and achieves low- for real-world applications, of which the desirable properties
er MSEs than the two LSTM models. Moreover, the CRN has often include online operation, single-channel operation, and
fewer trainable parameters than the LSTM models as shown in noise- and speaker-independent models.
3232
5. References [20] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and ac-
curate deep network learning by exponential linear units (elus),”
[1] D. L. Wang and J. Chen, “Supervised speech separation based on arXiv preprint arXiv:1511.07289, 2015.
deep learning: an overview,” arXiv preprint arXiv:1708.07524,
2017. [21] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[2] J. Agnew and J. M. Thornton, “Just noticeable and objectionable
group delays in digital hearing aids,” Journal of the American A- [22] D. B. Paul and J. M. Baker, “The design for the wall street journal-
cademy of Audiology, vol. 11, no. 6, pp. 330–336, 2000. based csr corpus,” in Proceedings of the workshop on Speech and
Natural Language. Association for Computational Linguistics,
[3] D. L. Wang and G. J. Brown, Eds., Computational auditory scene 1992, pp. 357–362.
analysis: Principles, algorithms, and applications. Wiley-IEEE
press, 2006. [23] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-
mization,” arXiv preprint arXiv:1412.6980, 2014.
[4] Y. Wang and D. L. Wang, “Towards scaling up classification-
based speech separation,” IEEE Transactions on Audio, Speech, [24] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra,
and Language Processing, vol. 21, no. 7, pp. 1381–1390, 2013. “Perceptual evaluation of speech quality (pesq)-a new method for
speech quality assessment of telephone networks and codecs,” in
[5] Y. Wang, A. Narayanan, and D. L. Wang, “On training targets for 2001 IEEE International Conference on Acoustics, Speech, and
supervised speech separation,” IEEE/ACM Transactions on Au- Signal Processing, vol. 2. IEEE, 2001, pp. 749–752.
dio, Speech and Language Processing (TASLP), vol. 22, no. 12,
pp. 1849–1858, 2014.
[6] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study
on speech enhancement based on deep neural networks,” IEEE
Signal processing letters, vol. 21, no. 1, pp. 65–68, 2014.
[7] ——, “A regression approach to speech enhancement based on
deep neural networks,” IEEE/ACM Transactions on Audio, Speech
and Language Processing (TASLP), vol. 23, no. 1, pp. 7–19, 2015.
[8] J. Chen, Y. Wang, S. E. Yoho, D. L. Wang, and E. W. Healy,
“Large-scale training to increase speech intelligibility for hearing-
impaired listeners in novel noises,” The Journal of the Acoustical
Society of America, vol. 139, no. 5, pp. 2604–2612, 2016.
[9] J. Chen and D. L. Wang, “Long short-term memory for speaker
generalization in supervised speech separation,” Proceedings of
Interspeech, pp. 3314–3318, 2016.
[10] ——, “Long short-term memory for speaker generalization in su-
pervised speech separation,” The Journal of the Acoustical Society
of America, vol. 141, no. 6, pp. 4705–4714, 2017.
[11] M. Kolbæk, Z.-H. Tan, and J. Jensen, “Speech intelligibility
potential of general and specialized deep neural network based
speech enhancement systems,” IEEE/ACM Transactions on Au-
dio, Speech, and Language Processing, vol. 25, no. 1, pp. 153–
167, 2017.
[12] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An al-
gorithm for intelligibility prediction of time–frequency weighted
noisy speech,” IEEE Transactions on Audio, Speech, and Lan-
guage Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
[13] K. Tan, J. Chen, and D. L. Wang, “Gated residual networks with
dilated convolutions for supervised speech separation,” in IEEE
International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP), 2018, to appear.
[14] Z. Zhang, Z. Sun, J. Liu, J. Chen, Z. Huo, and X. Zhang, “Deep
recurrent convolutional neural network: Improving performance
for speech recognition,” arXiv preprint arXiv:1611.07174, 2016.
[15] G. Naithani, T. Barker, G. Parascandolo, L. Bramsl, N. H. Pontop-
pidan, and T. Virtanen, “Low latency sound source separation us-
ing convolutional recurrent neural networks,” in 2017 IEEE Work-
shop on Applications of Signal Processing to Audio and Acoustics
(WASPAA). IEEE, 2017, pp. 71–75.
[16] V. Badrinarayanan, A. Handa, and R. Cipolla, “Segnet: A deep
convolutional encoder-decoder architecture for robust semantic
pixel-wise labelling,” arXiv preprint arXiv:1505.07293, 2015.
[17] S. R. Park and J. Lee, “A fully convolutional neural network for
speech enhancement,” arXiv preprint arXiv:1609.07132, 2016.
[18] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” in Interna-
tional conference on machine learning, 2015, pp. 448–456.
[19] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural
networks,” in Proceedings of the Fourteenth International Confer-
ence on Artificial Intelligence and Statistics, 2011, pp. 315–323.
3233