0% found this document useful (0 votes)
17 views5 pages

LSTMSE-Net: Long Short Term Speech Enhancement Network For Audio-Visual Speech Enhancement

Uploaded by

ee230002027
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views5 pages

LSTMSE-Net: Long Short Term Speech Enhancement Network For Audio-Visual Speech Enhancement

Uploaded by

ee230002027
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

LSTMSE-Net: Long Short Term Speech Enhancement Network for

Audio-visual Speech Enhancement

Arnav Jain∗1 , Jasmer Singh Sanjotra∗1 , Harshvardhan Choudhary1 , Krish Agrawal1 , Rupal Shah1 ,
Rohan Jha1 , M. Sajid1 , Amir Hussain2 , M. Tanveer1
1
Indian Institute of Technology Indore, Simrol, Indore, 453552, India
2
School of Computing, Edinburgh Napier University, EH11 4BN, Edinburgh, United Kingdom
[email protected], [email protected]
Abstract be constrained if phase information is not taken into account
[3]. Some methods that make use of complex-valued features
In this paper, we propose long short term memory speech en- have been introduced to get around this restriction, including
hancement network (LSTMSE-Net), an audio-visual speech en- complex spectral mapping (CSM) [8] and complex ratio mask-
hancement (AVSE) method. This innovative method leverages ing (CRM) [9]. Real-valued neural networks are used in the
arXiv:2409.02266v1 [cs.SD] 3 Sep 2024

the complementary nature of visual and audio information to implementation of many CRM and CSM techniques, whereas
boost the quality of speech signals. Visual features are ex- neural networks using complex values are used in other cases
tracted with VisualFeatNet (VFN), and audio features are pro- to handle complex input. Notable examples of complex-valued
cessed through an encoder and decoder. The system scales neural networks for SE tasks include deep complex convolution
and concatenates visual and audio features, then processes them recurrent network (DCCRN) [10] and deep complex U-NET
through a separator network for optimized speech enhancement. (DCUNET) [11]. In this study, we employ time-domain-based
The architecture highlights advancements in leveraging multi- methods as well as real-valued neural networks to show their
modal data and interpolation techniques for robust AVSE chal- effectiveness in SE tasks.
lenge systems. The performance of LSTMSE-Net surpasses
The primary idea in the wake of audio-visual speech en-
that of the baseline model from the COG-MHEAR AVSE Chal-
hancement (AVSE) is to augment an audio-only SE system with
lenge 2024 by a margin of 0.06 in scale-invariant signal-to-
visual input as supplemental data with the goal of improving SE
distortion ratio (SISDR), 0.03 in short-time objective intelli-
performance. The advantage of using visual input to enhance
gibility (STOI), and 1.32 in perceptual evaluation of speech
SE system performance has been demonstrated in a number of
quality (PESQ). The source code of the proposed LSTMSE-
earlier research [4, 12, 13]. Most preceding AVSE methods
Net is available at https://github.com/mtanveer1/
focused on processing audio in the TF domain [14, 6], how-
AVSEC-3-Challenge.
ever some research have explored time domain methods for
Index Terms: Audio-visual speech enhancement, Speech
audio-visual speech separation tasks [15]. Additionally, tech-
recognition, Human-computer interaction, Computational par-
niques such as self supervised learning (SSL) embedding are
alinguistics, LRS3 dataset
used to boost AVSE performance. Richard et al. [16] pre-
sented the SSL-AVSE technique, which combines auditory and
1. Introduction visual cues. These combined audio-visual features are then an-
Speech is key to how humans interact. Speech clarity and qual- alyzed by a Transformer-based SSL AV-HuBERT model to ex-
ity are critical for domains like video conferencing, telecom- tract characteristics, which are then controlled by a BLSTM-
munications, voice assistants, hearing aids, etc. However, based SE model. However, these models are too large to be
maintaining high-quality speech in adverse acoustic condi- scalable or deployable in real-life scenarios. Therefore, we fo-
tions—such as environments with background noise, reverber- cused on developing a smaller, simpler and a scalable model
ation, or poor audio quality—remains a significant challenge. that maintains performance comparable to these larger models.
Speech enhancement (SE) has become a pivotal area of study In this paper, we propose long short-term memory speech
and development to solve these problems and enhance speech enhancement network (LSTMSE-Net) that exemplifies a so-
quality and intelligibility [1]. Deep learning approaches have phisticated approach to enhancing speech signals through the
been the driving force behind recent advances in SE. While integration of audio and visual information. LSTMSE-Net em-
deep learning-based SE techniques [2, 3] have shown excep- ploys a dual-pronged feature extraction strategy, visual features
tional success by focusing mainly on audio signals, it is cru- are extracted using a VisualFeatNet comprising a 3D convolu-
cial to understand that adding visual information can greatly tional frontend and a ResNet trunk [17], while audio features
improve SE systems’ performance in adverse sound conditions. are processed using an audio encoder and audio decoder. A
[4, 5, 6]. For comprehensive insights into speech signal pro- key innovation to the system is the fusion of these features to
cessing tasks using ensemble deep learning methods, readers form a comprehensive representation. Visual features are inter-
are referred to [7]. polated using bi-linear methods to align with temporal dimen-
Time-frequency (TF) domain methods and time-domain sions in the audio domain. This fusion process, combined with
methods are two general categories into which audio-only SE advanced processing through a separator network featuring bi-
methods can be divided, depending on the type of input. Clas- directional LSTMs [18, 19], underscores the model’s capability
sical TF domain techniques often rely on amplitude spectrum to effectively enhance speech quality through comprehensive
features; however, studies shows that their effectiveness may multi-modal integration. The study thus explores new frontiers
in AVSE research, aiming to improve intelligibility and fidelity
* These authors contributed equally to this work. in challenging audio environments.
The evaluation metrics for the model include perceptual form a joint audio-visual feature representation. This is then
evaluation of speech quality (PESQ), short-time objective intel- passed through the separator to extract the relevant part of the
ligibility (STOI), and scale-invariant signal-to-distortion ratio audio signal.
(SISDR), with model parameters totalling around 5.1M which
is significantly less than the baseline model (COG-MHEAR 2.4. Feature Extractor and Noise Separator
Challenge 2024) which is around 75M parameters. The ini-
tial model weights are randomized, and the average inference 2.4.1. Overview and Motivation
time is approximately 0.3 seconds per video. The Separator module is a crucial component of the LSTMSE-
In summary, we have developed a strong AVSE model, Net, tasked with effectively integrating and processing the com-
LSTMSE-Net, by employing deep learning modules such as bined audio and visual features to isolate and enhance the
neural networks, LSTMs, and convolutional neural netowrks speech signal. This module leverages Long Short Term Mem-
(CNNs). When trained on the challenge dataset provided by the ory (LSTM) networks to capture temporal dependencies and
COG-MHEAR challenge 2024, our model obtains higher out- relationships between the audio and visual inputs. The use of
comes across all evaluation metrics despite being substantially LSTM in the AVSE system is further motivated by the follow-
smaller than the baseline model provided by the COG-MHEAR ing reasons.
challenge 2024. Its decreased size also results in a shorter in- Sequential data: Audio and visual features are sequen-
ference time when compared to the baseline model which takes tial in nature, with each frame or time step building upon the
approximately 0.95 seconds per video. previous one. LSTM is well-suited to handle such sequential
data. Speech signals exhibit long-term dependencies, with pho-
2. Methodology netic and contextual information spanning multiple time steps.
LSTM’s ability to learn long-term dependencies enables it to
2.1. Overview capture these relationships effectively.
This section delves into the intricacies of the proposed Contextual information: LSTM’s internal memory mech-
LSTMSE-Net architecture, which leverages a synergistic fusion anism allows it to retain contextual information, enabling the
of audio and visual features to enhance speech signals. We dis- system to make informed decisions about speech enhancement.
cuss and highlight its audio and visual feature extraction, in-
tegration, and noise separation mechanisms. This is achieved 2.4.2. Core functionality and Multimodal ability
using the following primary components, which are discussed The functionality of the Separator block is based on a multi-
further - audio encoder, visual feature network (VFN), noise modal fusion design. Through the integration of audio and vi-
separator and audio decoder. The overall architecture of our sual inputs, the Separator Block optimizes speech enhancement
LSTMSE-Net is depicted in Fig. 1(a). by utilizing complimentary information from both modalities.
The VFN records visual cues including lip movements, which
2.2. Audio Encoder offer important context for differentiating speech from back-
An essential part of the AVSE system, the audio encoder mod- ground noise. The ability to identify the portion of audio that
ule is in charge of gathering and evaluating audio features. The the speaker is saying is made easier by the temporal alignment
conv-1d architecture used in this module consists of a single of the visual and aural elements.
convolutional layer with 256 output channels, a kernel size of The separator block is made up of several separate units,
16, and a stride of 8. Robust audio features can be extracted each of which makes use of intra- and inter-LSTM layers, lin-
using this setup. To add non-linearity, a rectified linear unit ear layers, and group normalization. We now elaborate on the
(ReLU) activation function is applied after the convolution step. information flow in a single unit. Group normalization lay-
Afterwards, the upsampled visual features and the encoded au- ers are used to normalize the combined features following the
dio information are combined and passed into the noise separa- first feature extraction and concatenation. These normalization
tor. steps stabilize the learning process and provide consistent fea-
ture scaling, guaranteeing that auditory and visual input are ini-
2.3. Visual Feature Network tially given equal priority. The model can recognize complex
correlations and patterns between the auditory and visual inputs
The VFN is a vital component of the AVSE system, tasked with due to the intra- and inter-LSTM layers. The inter-LSTM layers
extracting relevant visual features from input video frames. The are intended for a global context, whilst the intra-LSTM layers
VFN architecture comprises a frontend 3-dimesional (3D) con- concentrate on local feature extraction. Through residual con-
volutional layer, ResNet trunk [20] and fully connected layers. nections, the original inputs are brought back to the output of
The 3D convolution layer processes the raw video frames, ex- these LSTM layers, aiding in the gradient flow during training
tracting relevant anatomical and visual features. The ResNet and helping to preserve relevant features. More reliable speech
trunk comprises a series of residual blocks designed to cap- enhancement results from the Separator Block’s ability to learn
ture spatial and temporal features from the video input. The the additive and interactive impacts of the audio-visual elements
extracted features’ dimensionality is then reduced to 256 by because of this residual design. Fig. 1(b) shows a single unit of
adding a Fully Connected Layer, which simultaneously im- the Separator block.
proves computational complexity and gets the visual features As highlighted above the proposed AVSE system employs
ready for integration with the audio features. a multi-modal fusion strategy, combining the strengths of both
Bi-linear interpolation is used to upsample the encoded vi- audio and visual modalities. The final output of the separator
sual characteristics so they match the encoded audio features’ module is the mask, which contains only the relevant part of the
temporal dimension. This is done to ensure proper synchro- original input audio features and removes all the background
nization of features from both modalities. Further, these are noise. The original input audio features are then multiplied by
concatenated, as mentioned above, with the audio features to this mask in order to extract those that are relevant and suppress
(a) (b)

Figure 1: (a) The workflow of the proposed LSTMSE-Net, (b) a single unit in the Separator block of the proposed LSTMSE-Net.

the ones that are not needed. This generates a clean and pro- was released in the previous version of the challenge, features
cessed audio feature map. sounds present in AudioSet, Freesound and DEMAND, and
Environmental sound classification (ESC-50) dataset [27]
2.5. Audio Decoder which comprises 50 noise groups that fall into five categories:
sounds of animals, landscapes and water, human non-verbal
The audio decoder, which is built upon the ConvTranspose1d
sounds, noises from the outside and within the home, and
[21] architecture, consists of a single transposed convolution
noises from cities. Additionally, data preparation scripts are
layer with a kernel size of 16, stride of 8, and a single output
given to us. The output of these scripts consists of the fol-
channel. This design facilitates the transformation of the en-
lowing: S00001 target.wav (target audio), S00001 silent.mp4
coded audio feature map back into an enhanced audio signal. It
(video without audio), S00001 interferer.wav (interferer audio),
is given the enhanced feature map as the input, and returns the
and S00001 interferer.wav (the audio interferer).
enhanced audio signal, which is also the final model output.
3.2. Experimental Setup
3. Experiments
We set up our training environment to get the best possible per-
In this section, we begin with a detailed description of the formance and use of the available resources. A rigorous train-
dataset. Next, we outline the experimental setup and the evalu- ing method that lasted 48 epochs and 211435 steps was ap-
ation metrics used. Finally, we present and discuss the experi- plied to the model. By utilising GPU acceleration, each epoch
mental results. took about twenty-two minutes to finish. This effective training
length demonstrates how quickly and efficiently the model can
3.1. Dataset Description handle big datasets.
The data used for training, testing and validation consists of With 146 GB of shared RAM, a single NVIDIA RTX
films extracted from the LRS3 dataset [22]. It contains 34524 A4500 GPU is used for all training and inference tasks. Our
scenes (a total of 113 hours and 17 minutes) from 605 speakers LSTMSE-Net model is effectively trained because of it’s sturdy
appointed for TED and TEDx talks. For the noise, speech inter- training configuration, which also guaranteed that the model
ferers were selected from a pool of 405 contestant speakers and could withstand the high computational demands necessary for
7346 noise recordings across 15 different divisions. high-quality audio-visual speech enhancement.
The videos contained in the test set differ from those used
in the training and validation datasets. The train set contains 3.3. Evaluation Metrics
around 5090 videos or 51k unique words in vocabulary, whilst The LSTMSE-Net model was subjected to a thorough evalua-
the validation and test sets have 4004 films or 17k words in its tion using multiple standard metrics, including scale-invariant
vocabulary and 412 videos, respectively. signal-to-distortion ratio (SISDR), short-time objective intel-
The dataset has two types of interferers: speech of com- ligibility (STOI), and perceptual evaluation of speech quality
peting speakers, which are taken from the LRS3 dataset (com- (PESQ). A comprehensive and multifaceted evaluation is en-
peting speakers and target speakers does not have any over- sured by the distinct insights that each of these measures offers
lap) and noise, which is derived from various datasets such as into various aspects of the quality of voice enhancement.
CEC1 [23], which consists around 7 hours of noise, the DE-
MAND [24], noise dataset includes multi-channel recordings
3.3.1. PESQ
of 18 soundscapes lasting more than 1 hour, MedleyDB dataset
[25] comprises 122 songs that are royalty-free. Additionally, A standardised metric called PESQ compares the enhanced
Deep Noise Supression challenge (DNS) dataset [26], which speech signal to a clean reference signal in order to evaluate
the quality of the speech. With values ranging between −0.5 to Table 1: Comparison of noisy speech, the baseline speech, and
4.5, larger scores denote better perceptual quality. LSTMSE-Net (ours) based on PESQ, STOI, and SISDR metrics.

3.3.2. STOI Audio PESQ STOI SISDR

STOI (short-time objective intelligibility) is a metric used to Noisy Speech 1.467288 0.610359 -5.494292
Baseline 1.492356 0.616006 -1.204192
assess how clear and understandable speech is, particularly in
LSTMSE-Net (Ours) 1.547272 0.647083 0.124061
environments with background noise. It measures the similar-
ity between the clean and improved speech signals’ temporal The boldface in each column denotes the performance of the best
model corresponding to each metric.
envelopes, producing a score between 0 and 1. Improved com-
prehensibility is correlated with higher scores.
Table 2: Comparison between the inference time of the baseline
3.3.3. SISDR and LSTMSE-Net (ours).

By calculating the amount of distortion brought about by the en-


Model Average inference time per video
hancement process, SISDR is a commonly used metric to assess
the quality of speech enhancement. Higher SISDR values are Baseline 0.95 seconds
indicative of less distortion and improved speech signal qual- LSTMSE-Net (Ours) 0.3 seconds
ity, making them a crucial indicator for assessing how well our
model performs in maintaining the original speech features.
We guarantee a thorough and comprehensive examination an excellent choice for both high-performance systems and
of the LSTMSE-Net model by utilising these three complimen- resource-constrained environments, demonstrating its versatil-
tary assessment metrics. STOI gauges intelligibility, PESQ as- ity and practical applicability.
sesses perceptual quality, and SISDR concentrates on distortion
and fidelity. When taken as a whole, these measurements of-
fer a thorough insight of the model’s performance, showcasing 4. Conclusion and Future Work
its advantages and pinpointing areas that might use improve- This research presents LSTMSE-Net, an advanced AVSE ar-
ment. Our dedication to creating a high-performance speech chitecture that improves speech quality by fusing audio signals
enhancement system that excels in a number of crucial areas with visual information from lip movements. The LSTMSE-
of audio quality is demonstrated by this multifaceted evaluation Net architecture consists of an audio decoder, a visual encoder,
technique. a separator, and an audio encoder. Each of these components
is essential to the processing and refinement of the input sig-
3.4. Evaluation Results nals in order to generate enhanced speech that is high-quality.
AVSE-Net exhibits the capacity to efficiently capture and lever-
Three types of speeches were included in our evaluation. First,
age both local and global audio-visual interdependence. With
we used the noisy speech provided in the challenge testing
the use of advanced deep learning methods like convolutions
dataset, which also served as the audio requiring further en-
and long short term memory networks, LSTMSE-Net improves
hancement using various AVSE models. Second, we gener-
voice enhancement significantly.
ate the improved speech by applying our LSTMSE-Net model
Experimental studies on the benchmark dataset, i.e., COG-
to enhance the noisy speech. Finally, we produce the im-
MHEAR LRS3 dataset, confirm LSTMSE-Net’s superior per-
proved speech using the COG-MHEAR AVSE Challenge 2024
formance. LSTMSE-Net performs much better than baseline
baseline model to enhance the same noisy speech. We evalu-
models on the COG-MHEAR LRS3 dataset, demonstrating its
ated them using the PESQ, STOI, and SISDR, which are the
effectiveness in combining visual and aural characteristics for
standard evaluation metrics. The table 1 displays the final
improved speech quality. To sum up, LSTMSE-Net is a ma-
scores of the models on the evaluation metrics. Compared to
jor breakthrough in audio-visual speech improvement, utilising
noisy speech, the AVSE baseline model produced notably bet-
the complementary qualities of both auditory and visual input to
ter quality (PESQ) and higher intelligibility (STOI). Further-
provide better voice quality. This work establishes a new bench-
more, in PESQ, STOI, and SISDR, LSTMSE-Net outperformed
mark in the field by offering a scalable and efficient solution for
the baseline model by a margin of 0.06, 0.03, and 1.32, re-
speech improvement.
spectively. All evaluation criteria showed that LSMTSE-Net
For our future work, we have the following plans:
performed better than the baseline as well as the noisy speech,
which is strong proof of the efficacy of our model. • We aim to extend our model to incorporate causality in its
The table 2 displays the final inference time of the models architecture, enabling real-time deployment. This enhance-
on the testing dataset. Compared to the baseline model, which ment will ensure that the model relies solely on past and cur-
takes an average of 0.95 secs per video to enhance the audio, rent information for predictions.
LSTMSE-Net only takes 0.3 seconds per video on average to • We plan to propose an enhanced version of LSTMSE-Net that
enhance the audio in them. This significant reduction in pro- incorporates attention mechanisms and advanced feature fu-
cessing time underscores the efficiency of LSTMSE-Net. sion techniques to further refine the integration of visual and
The superior efficiency and efficacy of the proposed audio features. Our goal is to achieve superior performance
LSTMSE-Net not only reduces the computational load but also across various AVSE benchmarks.
enables real-time processing, making it highly suitable for ap- • Additionally, we will conduct a comprehensive comparative
plications requiring low latency. Moreover, the smaller model analysis of LSTMSE-Net and other state-of-the-art AVSE
size enhances scalability, allowing the deployment of LSTMSE- variants. This analysis will focus on their performance in
Net on a wider range of devices, including those with lim- real-world noisy environments to identify strengths and areas
ited computational resources. This makes the proposed model for improvement.
5. Acknowledgement [13] S.-Y. Chuang, H.-M. Wang, and Y. Tsao, “Improved lite audio-
visual speech enhancement,” IEEE/ACM Transactions on Audio,
The authors are grateful to the anonymous reviewers for their Speech, and Language Processing, vol. 30, pp. 1345–1359, 2022.
invaluable comments and suggestions. This project is sup- [14] I.-C. Chern, K.-H. Hung, Y.-T. Chen, T. Hussain, M. Gogate,
ported by the Indian government’s Science and Engineer- A. Hussain, Y. Tsao, and J.-C. Hou, “Audio-visual speech
ing Research Board (SERB) through the Mathematical Re- enhancement and separation by utilizing multi-modal self-
search Impact-Centric Support (MATRICS) scheme under grant supervised embeddings,” in 2023 IEEE International Conference
MTR/2021/000787. Prof Hussain acknowledges the support on Acoustics, Speech, and Signal Processing Workshops (ICAS-
of the UK Engineering and Physical Sciences Research Coun- SPW), 2023, pp. 1–5.
cil (EPSRC) Grants Ref. EP/T021063/1 (COG-MHEAR) and [15] Y. Wu, C. Li, J. Bai, Z. Wu, and Y. Qian, “Time-domain audio-
EP/T024917/1 (NATGEN). The work of M. Sajid is supported visual speech separation on low quality videos,” in ICASSP 2022
by the Council of Scientific and Industrial Research (CSIR), - 2022 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), 2022, pp. 256–260.
New Delhi for providing fellowship under the under Grant
09/1022(13847)/2022-EMR-I. [16] R. L. Lai, J.-C. Hou, M. Gogate, K. Dashtipour, A. Hussain, and
Y. Tsao, “Audio-visual speech enhancement using self-supervised
learning to improve speech intelligibility in cochlear implant sim-
6. References ulations,” arXiv preprint arXiv:2307.07748, 2023.
[1] P. C. Loizou, Speech Enhancement: Theory and Practice. CRC [17] H. Wang, K. Li, and C. Xu, “[retracted] a new generation of resnet
press, 2007. model based on artificial intelligence and few data driven and its
construction in image recognition model,” Computational Intelli-
[2] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement gence and Neuroscience, vol. 2022, no. 1, p. 5976155, 2022.
based on deep denoising autoencoder.” in Interspeech, vol. 2013,
2013, pp. 436–440. [18] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[3] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis,
“Deep learning for monaural speech separation,” in 2014 IEEE [19] A. Graves and J. Schmidhuber, “Framewise phoneme classifica-
International Conference on Acoustics, Speech and Signal Pro- tion with bidirectional LSTM networks,” in Proceedings. 2005
cessing (ICASSP). IEEE, 2014, pp. 1562–1566. IEEE International Joint Conference on Neural Networks, 2005.,
vol. 4. IEEE, 2005, pp. 2047–2052.
[4] J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, and
H.-M. Wang, “Audio-visual speech enhancement using multi- [20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
modal deep convolutional neural networks,” IEEE Transactions image recognition,” in Proceedings of the IEEE Conference on
on Emerging Topics in Computational Intelligence, vol. 2, no. 2, Computer Vision and Pattern Recognition, June 2016, pp. 770–
pp. 117–128, 2018. 778.
[21] J. Shipton, J. Fowler, C. Chalmers, S. Davis, S. Gooch, and
[5] B. Xu, C. Lu, Y. Guo, and J. Wang, “Discriminative multi-
G. Coccia, “Implementing wavenet using Intel® Stratix® 10 NX
modality speech recognition,” in Proceedings of the IEEE/CVF
FPGA for real-time speech synthesis,” pp. 1–8, 2021.
conference on Computer Vision and Pattern Recognition, June
2020, pp. 14 433–14 442. [22] T. Afouras, J. S. Chung, and A. Zisserman, “LRS3-TED: A
large-scale dataset for visual speech recognition,” arXiv preprint
[6] D. Michelsanti, Z.-H. Tan, S.-X. Zhang, Y. Xu, M. Yu, D. Yu, arXiv:1809.00496, 2018.
and J. Jensen, “An overview of deep-learning-based audio-visual
speech enhancement and separation,” IEEE/ACM Transactions on [23] S. Graetzer, J. Barker, T. J. Cox, M. Akeroyd, J. F. Culling,
Audio, Speech, and Language Processing, vol. 29, p. 1368–1396, G. Naylor, E. Porter, and R. Viveros Munoz, “Clarity-2021 chal-
2021. lenges: Machine learning challenges for advancing hearing aid
processing,” in Proceedings of the Annual Conference of the In-
[7] M. Tanveer, A. Rastogi, V. Paliwal, M. Ganaie, A. K. Malik, ternational Speech Communication Association, INTERSPEECH,
J. Del Ser, and C.-T. Lin, “Ensemble deep learning in speech sig- vol. 2, 2021, pp. 686–690.
nal tasks: a review,” Neurocomputing, vol. 550, p. 126436, 2023.
[24] J. Thiemann, N. Ito, and E. Vincent, “The diverse environments
[8] K. Tan and D. Wang, “Complex spectral mapping with a convo- multi-channel acoustic noise database (demand): A database of
lutional recurrent network for monaural speech enhancement,” in multichannel environmental noise recordings,” in Proceedings of
ICASSP 2019 - 2019 IEEE International Conference on Acous- Meetings on Acoustics, vol. 19, no. 1. AIP Publishing, 2013, p.
tics, Speech and Signal Processing (ICASSP), 2019, pp. 6865– 035081.
6869.
[25] R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam,
[9] D. S. Williamson and D. Wang, “Time-frequency masking in and J. P. Bello, “MedleyDB: A multitrack dataset for annotation-
the complex domain for speech dereverberation and denoising,” intensive mir research.” in ISMIR, vol. 14, 2014, pp. 155–160.
IEEE/ACM Transactions on Audio, Speech, and Language Pro-
[26] C. K. Reddy, G. Vishak, C. Ross, B. Ebrahim, C. Roger,
cessing, vol. 25, no. 7, pp. 1492–1501, 2017.
D. Harishchandra, M. Sergiy, A. Robert, A. Ashkan,
[10] Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, B. Sebastian, R. Puneet, S. Sriram, and G. Johannes,
and L. Xie, “DCCRN: Deep complex convolution recurrent net- “The interspeech 2020 deep noise suppression challenge:
work for phase-aware speech enhancement,” in Proceedings of Datasets, subjective testing framework, and challenge re-
the Annual Conference of the International Speech Communica- sults,” Interspeech 2020, 10 2020. [Online]. Available:
tion Association, INTERSPEECH, 2020, August 2020, pp. 2472– https://cir.nii.ac.jp/crid/1360016869793872128
2476. [27] K. J. Piczak, “ESC: Dataset for environmental sound classifi-
[11] H.-S. Choi, J. Kim, J. Huh, A. Kim, J.-W. Ha, and K. Lee, cation,” ser. MM ’15. New York, NY, USA: Association for
“Phase-aware speech enhancement with deep complex U-Net,” in Computing Machinery, 2015, p. 1015–1018. [Online]. Available:
ICLR, 2019. [Online]. Available: https://openreview.net/forum? https://doi.org/10.1145/2733373.2806390
id=SkeRTsAcYm
[12] T. Afouras, J. S. Chung, and A. Zisserman, “The conversation:
Deep audio-visual speech enhancement,” in Proceedings of the
Annual Conference of the International Speech Communication
Association, INTERSPEECH, vol. 2018, 2018, pp. 3244–3248.

You might also like