LSTMSE-Net: Long Short Term Speech Enhancement Network For Audio-Visual Speech Enhancement
LSTMSE-Net: Long Short Term Speech Enhancement Network For Audio-Visual Speech Enhancement
Arnav Jain∗1 , Jasmer Singh Sanjotra∗1 , Harshvardhan Choudhary1 , Krish Agrawal1 , Rupal Shah1 ,
Rohan Jha1 , M. Sajid1 , Amir Hussain2 , M. Tanveer1
1
Indian Institute of Technology Indore, Simrol, Indore, 453552, India
2
School of Computing, Edinburgh Napier University, EH11 4BN, Edinburgh, United Kingdom
[email protected], [email protected]
Abstract be constrained if phase information is not taken into account
[3]. Some methods that make use of complex-valued features
In this paper, we propose long short term memory speech en- have been introduced to get around this restriction, including
hancement network (LSTMSE-Net), an audio-visual speech en- complex spectral mapping (CSM) [8] and complex ratio mask-
hancement (AVSE) method. This innovative method leverages ing (CRM) [9]. Real-valued neural networks are used in the
arXiv:2409.02266v1 [cs.SD] 3 Sep 2024
the complementary nature of visual and audio information to implementation of many CRM and CSM techniques, whereas
boost the quality of speech signals. Visual features are ex- neural networks using complex values are used in other cases
tracted with VisualFeatNet (VFN), and audio features are pro- to handle complex input. Notable examples of complex-valued
cessed through an encoder and decoder. The system scales neural networks for SE tasks include deep complex convolution
and concatenates visual and audio features, then processes them recurrent network (DCCRN) [10] and deep complex U-NET
through a separator network for optimized speech enhancement. (DCUNET) [11]. In this study, we employ time-domain-based
The architecture highlights advancements in leveraging multi- methods as well as real-valued neural networks to show their
modal data and interpolation techniques for robust AVSE chal- effectiveness in SE tasks.
lenge systems. The performance of LSTMSE-Net surpasses
The primary idea in the wake of audio-visual speech en-
that of the baseline model from the COG-MHEAR AVSE Chal-
hancement (AVSE) is to augment an audio-only SE system with
lenge 2024 by a margin of 0.06 in scale-invariant signal-to-
visual input as supplemental data with the goal of improving SE
distortion ratio (SISDR), 0.03 in short-time objective intelli-
performance. The advantage of using visual input to enhance
gibility (STOI), and 1.32 in perceptual evaluation of speech
SE system performance has been demonstrated in a number of
quality (PESQ). The source code of the proposed LSTMSE-
earlier research [4, 12, 13]. Most preceding AVSE methods
Net is available at https://github.com/mtanveer1/
focused on processing audio in the TF domain [14, 6], how-
AVSEC-3-Challenge.
ever some research have explored time domain methods for
Index Terms: Audio-visual speech enhancement, Speech
audio-visual speech separation tasks [15]. Additionally, tech-
recognition, Human-computer interaction, Computational par-
niques such as self supervised learning (SSL) embedding are
alinguistics, LRS3 dataset
used to boost AVSE performance. Richard et al. [16] pre-
sented the SSL-AVSE technique, which combines auditory and
1. Introduction visual cues. These combined audio-visual features are then an-
Speech is key to how humans interact. Speech clarity and qual- alyzed by a Transformer-based SSL AV-HuBERT model to ex-
ity are critical for domains like video conferencing, telecom- tract characteristics, which are then controlled by a BLSTM-
munications, voice assistants, hearing aids, etc. However, based SE model. However, these models are too large to be
maintaining high-quality speech in adverse acoustic condi- scalable or deployable in real-life scenarios. Therefore, we fo-
tions—such as environments with background noise, reverber- cused on developing a smaller, simpler and a scalable model
ation, or poor audio quality—remains a significant challenge. that maintains performance comparable to these larger models.
Speech enhancement (SE) has become a pivotal area of study In this paper, we propose long short-term memory speech
and development to solve these problems and enhance speech enhancement network (LSTMSE-Net) that exemplifies a so-
quality and intelligibility [1]. Deep learning approaches have phisticated approach to enhancing speech signals through the
been the driving force behind recent advances in SE. While integration of audio and visual information. LSTMSE-Net em-
deep learning-based SE techniques [2, 3] have shown excep- ploys a dual-pronged feature extraction strategy, visual features
tional success by focusing mainly on audio signals, it is cru- are extracted using a VisualFeatNet comprising a 3D convolu-
cial to understand that adding visual information can greatly tional frontend and a ResNet trunk [17], while audio features
improve SE systems’ performance in adverse sound conditions. are processed using an audio encoder and audio decoder. A
[4, 5, 6]. For comprehensive insights into speech signal pro- key innovation to the system is the fusion of these features to
cessing tasks using ensemble deep learning methods, readers form a comprehensive representation. Visual features are inter-
are referred to [7]. polated using bi-linear methods to align with temporal dimen-
Time-frequency (TF) domain methods and time-domain sions in the audio domain. This fusion process, combined with
methods are two general categories into which audio-only SE advanced processing through a separator network featuring bi-
methods can be divided, depending on the type of input. Clas- directional LSTMs [18, 19], underscores the model’s capability
sical TF domain techniques often rely on amplitude spectrum to effectively enhance speech quality through comprehensive
features; however, studies shows that their effectiveness may multi-modal integration. The study thus explores new frontiers
in AVSE research, aiming to improve intelligibility and fidelity
* These authors contributed equally to this work. in challenging audio environments.
The evaluation metrics for the model include perceptual form a joint audio-visual feature representation. This is then
evaluation of speech quality (PESQ), short-time objective intel- passed through the separator to extract the relevant part of the
ligibility (STOI), and scale-invariant signal-to-distortion ratio audio signal.
(SISDR), with model parameters totalling around 5.1M which
is significantly less than the baseline model (COG-MHEAR 2.4. Feature Extractor and Noise Separator
Challenge 2024) which is around 75M parameters. The ini-
tial model weights are randomized, and the average inference 2.4.1. Overview and Motivation
time is approximately 0.3 seconds per video. The Separator module is a crucial component of the LSTMSE-
In summary, we have developed a strong AVSE model, Net, tasked with effectively integrating and processing the com-
LSTMSE-Net, by employing deep learning modules such as bined audio and visual features to isolate and enhance the
neural networks, LSTMs, and convolutional neural netowrks speech signal. This module leverages Long Short Term Mem-
(CNNs). When trained on the challenge dataset provided by the ory (LSTM) networks to capture temporal dependencies and
COG-MHEAR challenge 2024, our model obtains higher out- relationships between the audio and visual inputs. The use of
comes across all evaluation metrics despite being substantially LSTM in the AVSE system is further motivated by the follow-
smaller than the baseline model provided by the COG-MHEAR ing reasons.
challenge 2024. Its decreased size also results in a shorter in- Sequential data: Audio and visual features are sequen-
ference time when compared to the baseline model which takes tial in nature, with each frame or time step building upon the
approximately 0.95 seconds per video. previous one. LSTM is well-suited to handle such sequential
data. Speech signals exhibit long-term dependencies, with pho-
2. Methodology netic and contextual information spanning multiple time steps.
LSTM’s ability to learn long-term dependencies enables it to
2.1. Overview capture these relationships effectively.
This section delves into the intricacies of the proposed Contextual information: LSTM’s internal memory mech-
LSTMSE-Net architecture, which leverages a synergistic fusion anism allows it to retain contextual information, enabling the
of audio and visual features to enhance speech signals. We dis- system to make informed decisions about speech enhancement.
cuss and highlight its audio and visual feature extraction, in-
tegration, and noise separation mechanisms. This is achieved 2.4.2. Core functionality and Multimodal ability
using the following primary components, which are discussed The functionality of the Separator block is based on a multi-
further - audio encoder, visual feature network (VFN), noise modal fusion design. Through the integration of audio and vi-
separator and audio decoder. The overall architecture of our sual inputs, the Separator Block optimizes speech enhancement
LSTMSE-Net is depicted in Fig. 1(a). by utilizing complimentary information from both modalities.
The VFN records visual cues including lip movements, which
2.2. Audio Encoder offer important context for differentiating speech from back-
An essential part of the AVSE system, the audio encoder mod- ground noise. The ability to identify the portion of audio that
ule is in charge of gathering and evaluating audio features. The the speaker is saying is made easier by the temporal alignment
conv-1d architecture used in this module consists of a single of the visual and aural elements.
convolutional layer with 256 output channels, a kernel size of The separator block is made up of several separate units,
16, and a stride of 8. Robust audio features can be extracted each of which makes use of intra- and inter-LSTM layers, lin-
using this setup. To add non-linearity, a rectified linear unit ear layers, and group normalization. We now elaborate on the
(ReLU) activation function is applied after the convolution step. information flow in a single unit. Group normalization lay-
Afterwards, the upsampled visual features and the encoded au- ers are used to normalize the combined features following the
dio information are combined and passed into the noise separa- first feature extraction and concatenation. These normalization
tor. steps stabilize the learning process and provide consistent fea-
ture scaling, guaranteeing that auditory and visual input are ini-
2.3. Visual Feature Network tially given equal priority. The model can recognize complex
correlations and patterns between the auditory and visual inputs
The VFN is a vital component of the AVSE system, tasked with due to the intra- and inter-LSTM layers. The inter-LSTM layers
extracting relevant visual features from input video frames. The are intended for a global context, whilst the intra-LSTM layers
VFN architecture comprises a frontend 3-dimesional (3D) con- concentrate on local feature extraction. Through residual con-
volutional layer, ResNet trunk [20] and fully connected layers. nections, the original inputs are brought back to the output of
The 3D convolution layer processes the raw video frames, ex- these LSTM layers, aiding in the gradient flow during training
tracting relevant anatomical and visual features. The ResNet and helping to preserve relevant features. More reliable speech
trunk comprises a series of residual blocks designed to cap- enhancement results from the Separator Block’s ability to learn
ture spatial and temporal features from the video input. The the additive and interactive impacts of the audio-visual elements
extracted features’ dimensionality is then reduced to 256 by because of this residual design. Fig. 1(b) shows a single unit of
adding a Fully Connected Layer, which simultaneously im- the Separator block.
proves computational complexity and gets the visual features As highlighted above the proposed AVSE system employs
ready for integration with the audio features. a multi-modal fusion strategy, combining the strengths of both
Bi-linear interpolation is used to upsample the encoded vi- audio and visual modalities. The final output of the separator
sual characteristics so they match the encoded audio features’ module is the mask, which contains only the relevant part of the
temporal dimension. This is done to ensure proper synchro- original input audio features and removes all the background
nization of features from both modalities. Further, these are noise. The original input audio features are then multiplied by
concatenated, as mentioned above, with the audio features to this mask in order to extract those that are relevant and suppress
(a) (b)
Figure 1: (a) The workflow of the proposed LSTMSE-Net, (b) a single unit in the Separator block of the proposed LSTMSE-Net.
the ones that are not needed. This generates a clean and pro- was released in the previous version of the challenge, features
cessed audio feature map. sounds present in AudioSet, Freesound and DEMAND, and
Environmental sound classification (ESC-50) dataset [27]
2.5. Audio Decoder which comprises 50 noise groups that fall into five categories:
sounds of animals, landscapes and water, human non-verbal
The audio decoder, which is built upon the ConvTranspose1d
sounds, noises from the outside and within the home, and
[21] architecture, consists of a single transposed convolution
noises from cities. Additionally, data preparation scripts are
layer with a kernel size of 16, stride of 8, and a single output
given to us. The output of these scripts consists of the fol-
channel. This design facilitates the transformation of the en-
lowing: S00001 target.wav (target audio), S00001 silent.mp4
coded audio feature map back into an enhanced audio signal. It
(video without audio), S00001 interferer.wav (interferer audio),
is given the enhanced feature map as the input, and returns the
and S00001 interferer.wav (the audio interferer).
enhanced audio signal, which is also the final model output.
3.2. Experimental Setup
3. Experiments
We set up our training environment to get the best possible per-
In this section, we begin with a detailed description of the formance and use of the available resources. A rigorous train-
dataset. Next, we outline the experimental setup and the evalu- ing method that lasted 48 epochs and 211435 steps was ap-
ation metrics used. Finally, we present and discuss the experi- plied to the model. By utilising GPU acceleration, each epoch
mental results. took about twenty-two minutes to finish. This effective training
length demonstrates how quickly and efficiently the model can
3.1. Dataset Description handle big datasets.
The data used for training, testing and validation consists of With 146 GB of shared RAM, a single NVIDIA RTX
films extracted from the LRS3 dataset [22]. It contains 34524 A4500 GPU is used for all training and inference tasks. Our
scenes (a total of 113 hours and 17 minutes) from 605 speakers LSTMSE-Net model is effectively trained because of it’s sturdy
appointed for TED and TEDx talks. For the noise, speech inter- training configuration, which also guaranteed that the model
ferers were selected from a pool of 405 contestant speakers and could withstand the high computational demands necessary for
7346 noise recordings across 15 different divisions. high-quality audio-visual speech enhancement.
The videos contained in the test set differ from those used
in the training and validation datasets. The train set contains 3.3. Evaluation Metrics
around 5090 videos or 51k unique words in vocabulary, whilst The LSTMSE-Net model was subjected to a thorough evalua-
the validation and test sets have 4004 films or 17k words in its tion using multiple standard metrics, including scale-invariant
vocabulary and 412 videos, respectively. signal-to-distortion ratio (SISDR), short-time objective intel-
The dataset has two types of interferers: speech of com- ligibility (STOI), and perceptual evaluation of speech quality
peting speakers, which are taken from the LRS3 dataset (com- (PESQ). A comprehensive and multifaceted evaluation is en-
peting speakers and target speakers does not have any over- sured by the distinct insights that each of these measures offers
lap) and noise, which is derived from various datasets such as into various aspects of the quality of voice enhancement.
CEC1 [23], which consists around 7 hours of noise, the DE-
MAND [24], noise dataset includes multi-channel recordings
3.3.1. PESQ
of 18 soundscapes lasting more than 1 hour, MedleyDB dataset
[25] comprises 122 songs that are royalty-free. Additionally, A standardised metric called PESQ compares the enhanced
Deep Noise Supression challenge (DNS) dataset [26], which speech signal to a clean reference signal in order to evaluate
the quality of the speech. With values ranging between −0.5 to Table 1: Comparison of noisy speech, the baseline speech, and
4.5, larger scores denote better perceptual quality. LSTMSE-Net (ours) based on PESQ, STOI, and SISDR metrics.
STOI (short-time objective intelligibility) is a metric used to Noisy Speech 1.467288 0.610359 -5.494292
Baseline 1.492356 0.616006 -1.204192
assess how clear and understandable speech is, particularly in
LSTMSE-Net (Ours) 1.547272 0.647083 0.124061
environments with background noise. It measures the similar-
ity between the clean and improved speech signals’ temporal The boldface in each column denotes the performance of the best
model corresponding to each metric.
envelopes, producing a score between 0 and 1. Improved com-
prehensibility is correlated with higher scores.
Table 2: Comparison between the inference time of the baseline
3.3.3. SISDR and LSTMSE-Net (ours).