0% found this document useful (0 votes)
6 views

001 OK AVSR based transformer

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

001 OK AVSR based transformer

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Expert Systems With Applications 252 (2024) 124159

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

Audio–visual speech recognition based on regulated transformer and


spatio–temporal fusion strategy for driver assistive systems
Dmitry Ryumin a ,∗, Alexandr Axyonov a , Elena Ryumina a , Denis Ivanko a , Alexey Kashevnik a ,
Alexey Karpov b
a
St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), St. Petersburg, 199178, Russia
b ITMO University, St. Petersburg, 191002, Russia

ARTICLE INFO ABSTRACT

Keywords: This article presents a research methodology for audio–visual speech recognition (AVSR) in driver assistive
Audio–visual speech recognition systems. These systems necessitate ongoing interaction with drivers while driving through voice control
Spatio–temporal fusion strategy for safety reasons. The article introduces a novel audio–visual speech command recognition transformer
Classifier ensemble
(AVCRFormer) specifically designed for robust AVSR. We propose (i) a multimodal fusion strategy based on
Transformer
spatio–temporal fusion of audio and video feature matrices, (ii) a regulated transformer based on iterative
Computer vision
Driver assistive systems
model refinement module with multiple encoders, (iii) a classifier ensemble strategy based on multiple
decoders. The spatio–temporal fusion strategy preserves contextual information of both modalities and
achieves their synchronization. An iterative model refinement module can bridge the gap between acoustic
and visual data by leveraging their impact on speech recognition accuracy. The proposed multi-prediction
strategy demonstrates superior performance compared to traditional single-prediction strategy, showcasing the
model’s adaptability across diverse audio–visual contexts. The transformer proposed has achieved the highest
values of speech command recognition accuracy, reaching 98.87% and 98.81% on the RUSAVIC and LRW
corpora, respectively. This research has significant implications for advancing human–computer interaction.
The capabilities of AVCRFormer extend beyond AVSR, making it a valuable contribution to the intersection of
audio–visual processing and artificial intelligence.

1. Introduction model specifically designed for robust AVSR. The proposed multi-
prediction strategy demonstrates superior performance compared to
Audio–visual speech recognition (AVSR) represents a promising traditional single-prediction strategy, showcasing the model’s adapt-
frontier in the field of automatic speech recognition (SR), offering ability across diverse audio–visual (AV) contexts. The AVCRFormer ar-
opportunities to enhance accuracy, accessibility, adaptability, and natu- chitecture is based on the well-established encoder–decoder paradigm
ralness in human–machine communication across various domains and with a transformer architecture. The model’s uniqueness lies in its
applications. ability to bridge the gap between acoustic and visual data, enhancing
In today’s automotive landscape, the integration of advanced tech- the process of recognition through an iterative refinement step.
nologies is reshaping the driving experience. Among these technologies, This research provides the mathematical framework necessary to
SR plays a pivotal role enabling hands-free interaction with in-car develop SR systems that are not only more efficient and resilient but
systems, enhancing the driver’s convenience, and promoting safety on contextually sensitive as well. This greater comprehension is critical for
the road. However, for traditional SR systems accuracy is challenging the improvements of user experiences across a variety of applications,
in diverse and dynamic environments, for example, inside a moving including in-car communication and, more broadly, human–computer
vehicle with prevailing background noise and visual distractions. interaction (HCI) (Liang & Tang, 2022; Ouyang, Ma, & Tan, 2023).
To address these challenges and further improve the efficacy of The motivation behind this research lies in unlocking the full po-
SR systems in automotive settings, we present an audio–visual speech tential the AVSR technology to create intelligent and intuitive in-car
command recognition transformer (AVCRFormer); which an advanced communication systems that seamlessly integrate with the driver’s

∗ Corresponding author.
E-mail addresses: [email protected] (D. Ryumin), [email protected] (A. Axyonov), [email protected] (E. Ryumina), [email protected]
(D. Ivanko), [email protected] (A. Kashevnik), [email protected] (A. Karpov).

https://doi.org/10.1016/j.eswa.2024.124159
Received 27 February 2024; Received in revised form 8 April 2024; Accepted 2 May 2024
Available online 9 May 2024
0957-4174/© 2024 Elsevier Ltd. All rights reserved.
D. Ryumin et al. Expert Systems With Applications 252 (2024) 124159

natural interaction patterns, ultimately contributing to safer, more Yang, 2021; Petridis et al., 2018), and transformers (Burchi & Timofte,
enjoyable, and more efficient driving. 2023; Lee, Jang, Kim, Park, & Park, 2020; Lin, Sung, Lei, Bansal, &
The scientific novelty of the article includes the following. Bertasius, 2023; Ma, Petridis, & Pantic, 2021; Song, Sun, & Li, 2022;
Yang et al., 2023). This comprehensive integration enhances recogni-
1. We have developed and presented a novel transformer-based
tion accuracy and robustness as well as advances the AVSR performance
method for AVSR based on a Transformer architecture and in-
through attention mechanisms and NN-based classifiers.
cluding a regulated cross-attention technique. The method con-
The NNs, including convolutional neural network (CNN)s, feed-
sists of feature extraction, model initialization, and interactive
forward network (FFN)s, autoencoders, and robust recurrent neural
model refinement steps.
network (RNN)s, have proven to be effective in unraveling the in-
2. We have developed the spatio–temporal fusion strategies of AV
tricacies of AV speech (Ma, Petridis, & Pantic, 2021; Makino et al.,
features and the classifier ensemble that demonstrate promising
2019; Shewalkar, Nyavanandi, & Ludwig, 2019; Zhou, Yang, Chen,
results for solving the problem of audio and video modality
Wang, & Jia, 2019). Deep NNs (DNNs) are crucial in current AVSR by
fusion during the command recognition.
assisting in feature extraction and classification. They are used in both
3. We have developed and implemented the attention visualization the front-end and back-end of AVSRs, adeptly extracting features while
approach that shows how different modalities (audio and video) encoding spatio and temporal information concurrently. While the
affect overall SR accuracy. For example, in the case of noisy significant data requirements and computational complexity of training
environments the attention shifts to the video modality, but in DL models can be challenging (Michelsanti et al., 2021), transfer DL
the case of poor lighting conditions in the vehicle or high lip emerges as a solution. This method allows pre-trained models to refine
occlusion it shifts to the audio one. their capabilities on new, similar tasks, thereby reducing computational
4. We have developed and presented a software application for
and data requirements (Gerczuk, Amiriparian, Ottl, & Schuller, 2021;
driver assistive systems that is based on the AVSR method.
Zhuang et al., 2020).
The application integrates the proposed method, strategies, and
CNNs excel at extracting features from audio and video data, elim-
approach that can be used for the navigation and multimedia
inating the need for manual feature selection. This shift has revolu-
control system in the proposed in-vehicle case study.
tionized AVSR by automating the extraction of distinctive features. DL
The remainder of this article is organized as follows: In Section 2, is integrated with modality fusion, enabling current AVSR systems to
we provide a comprehensive review of recent advancements in audio– seamlessly combine information from audio and video modalities. Such
visual speech recognition that have recently emerged. Section 3 details fusion enhances recognition accuracy and robustness. Moreover, atten-
the AVCRFormer architecture, covering feature extraction, model ini- tion mechanisms and other NN-based classifiers have further improved
tialization, iterative refinement, and the loss function. In Section 4, we the AVSR performance.
present experimental results covering encoder and decoder configura- Transformer models have demonstrated exemplary performance in
tions, refinement iterations, and the gated mechanism. Additionally, the AVSR task (Khan et al., 2022). The profound impact of Transformer
we provide a comparative analysis of multi- and single-prediction models has become clearer with their scalability to very large capacity
strategies. Section 5 considers the implementation of the proposed models. The other method is the Conformer Transducer presented
method in a driver assistive system. Finally, some concluding remarks in (Gulati et al., 2020); it blends convolutional and transformer-based
with a discussion and our vision for future research are presented in NNs for audio processing, and proves to be highly effective in dealing
Section 6. with varying lengths of sequences. This effectiveness is achieved by
employing a Conformer block consisting of two feed-forward modules
2. Related work flanking the multi-headed self-attention module and the convolution
module, inspired by the Macaron Net structure (Lu et al., 2019).
In this section, we review the state-of-the-art (SOTA) research in A significant improvement in the field of SR is the joint implementa-
the field of automatic SR based on AV information processing with the tion of connectionist temporal classification (CTC) and attention-based
focus on four different aspects: processing of audio and video signals, end-to-end (E2E) SR (Kim, Hori, & Watanabe, 2017). The method
multimodal fusion strategies, SR in car environment, and existing AV combines the efficiency of CTC and the attention mechanism in an
speech corpora. E2E method. The use of multitask learning is beneficial for making
Recently, there has been a growing interest in the incorporation SR systems stronger and faster, and for overcoming problems caused
of combined AV information into SR systems, aiming to create more by noisy environments and long inputs. This represents an important
reliable and contextually aware intelligent systems (Afouras, Chung, step towards more accurate and adaptive E2E SR systems. The results
Senior, Vinyals, & Zisserman, 2018; Chen, Hu, et al., 2023; Ma et al., encourage and offer high potential for real-world situations (Deng, Cao,
2023; Noda, Yamaguchi, Nakadai, Okuno, & Ogata, 2015; Wang, Zha, Zhang, & Ma, 2021; Miao, Cheng, Gao, Zhang & and Yan, 2020; Zhang,
Li, Chen, & Luo, 2023). Zhao, et al., 2020).
In the field of automatic SR, neural networks (NNs) have emerged The authors in Dai et al. (2023) present a novel method for AVSR by
as a driving force for transformative change, spearheading a revolution decoupling the training process into two stages and introducing a visual
in the SR technology (Bhardwaj et al., 2022; Li et al., 2022; Thomas, pre-training framework based on lip shape correlation with syllables.
Kessler, & Karout, 2022; Zhang et al., 2023). Notable improvements In work (Ma et al., 2023) the authors explore the importance of auto
have been made in creating SR systems with high recognition accuracy AVSR and the challenges associated with the accuracy of transcription
and flexibility, primarily due to the successful combination of deep labeling. The authors propose a method to leverage publicly avail-
learning (DL) and SR techniques. The research review the most recent able SR models for automatically generated transcriptions of unlabeled
state-of-the-art (SOTA) developments of SR systems featuring enhanced corpora. This method substantially expands the training set size and
accuracy, efficiency, and adaptability. boosts the AVSR performance to SOTA, especially on the LRS2 and
LRS3 corpora (Afouras, Chung, & Zisserman, 2018).
2.1. Audio–visual speech recognition In Pan et al. (2022) authors research the implementation of self-
supervised learning in AVSR by integrating pre-trained models trained
Recent AVSR methods demonstrate substantial progress by transi- on vast unlabeled single-modal data. The core method incorporates pre-
tioning from traditional handcrafted methods to such DL architectures trained models trained on extensive unlabeled data in single modalities,
as long-short term memory (LSTM) (Makino et al., 2019; Shashidhar, using the Libri-Light corpus (Kahn et al., 2020) for audio the and
Patilkulkarni, & Puneeth, 2022), gated recurrent unit (GRU) (Liu, Li, & ImageNet corpus (Deng et al., 2009) for visual data.

2
D. Ryumin et al. Expert Systems With Applications 252 (2024) 124159

Table 1
Overview of AVCR methods used in driver assistive systems. GMM refers to Gaussian Mixture Model, HMM to Hidden Markov Model, SGMM to Subspace Gaussian
Mixture Model.
Work Audio Video Audio–Visual CRNN GMM-HMM CLSTM RNN-GRU CNN DNN-HMM SGMM-HMM
Tamazin, Gouda, and Khedr (2019) + − − − + − − − − −
Adeel, Gogate, and Hussain (2020) − − + − − + − − − −
Isobe, Tamura, Hayamizu, Gotoh, and Nose (2021) + + + − − − − + − −
Jaradat, Alzubaidi, and Otoom (2022) + − − − + − − − + +
Chuang, Wang, and Tsao (2022) − − + + − − − − − −
Kumar, Renuka, Rose, Wartana, et al. (2022) + + + − − − + + − −

In Shi, Hsu, and Mohamed (2022) self-supervised learning is also that conducting multi-layer fusion can enhance the performance of
employed along with several key methods to enhance AVSR. The AVSR systems.
authors introduce the audio–visual hidden unit BERT (AV-HuBERT) Unlike the above AVSR methods concatenating of the feature vec-
framework, which uses self-supervised learning, extending the Hu- tors and their subsequent aggregation using NN models (Ma et al.,
BERT (Hsu et al., 2021) framework for joint speech representation 2023; Pan et al., 2022; Ryumin, Ivanko, & Ryumina, 2023; Shi et al.,
learning from audio and lip movement cues in video recordings. Ad- 2022) or cross-modal attention (Axyonov, Ryumin, Ivanko, Kashevnik,
ditionally, noise augmentation is a pivotal technique employed at & Karpov, 2024; Dai et al., 2023), in our spatio–temporal modality
both pre-training and fine-tuning stages. This adds diverse noise types fusion strategy implies making feature matrices uniform in size and
(ambient, babble, music, and speech noise) to the audio input, thereby combining them along the temporal axis. Such fusion allows preserving
improving the model’s robustness in noisy environments. the contextual information of both modalities and achieving their
These methods collectively contribute to the significant perfor- synchronization. This integration enables the model to fuse acoustic
mance gains observed in AVSR, making the models highly effective and and visual information, providing a holistic representation capturing
adaptable in challenging acoustic conditions. To further advance the spatio–temporal information of audio and video data. This strategy en-
performance of SOTA AVSRs in driving scenarios, we propose an AVSR hances the model’s ability to comprehend multimodal input, facilitating
method based on regulated transformers and a spatio–temporal fusion feature extraction for the proposed AVCRFormer architecture.
strategy to integrate results of audio–visual modalities. In line with the
prevailing trends we have opted to employ the Transformer architec- 2.3. Automatic speech recognition for driver assistive systems
ture in our method. The proposed iterative model refinement module
can bridge the gap between audio and visual data by compensating for Driver assistive systems with SR capabilities offer hands-free control
the weaknesses of unimodal information. Our transformer achieved the of key vehicle functions, enhancing both the comfort and safety of the
highest values of recognition accuracy on benchmark corpora. driving experience. These systems may include voice-activated navi-
gation, enabling input of destination and adjustment of route settings
2.2. Multimodal fusion strategies without taking hands off the wheel and eyes on the road. Voice-
controlled infotainment allows for control over audio playback and
Numerous studies on AVSR indicate that incorporating modalities climate settings, often by simply requesting the desired music or tem-
fusion strategy can greatly improve robustness, especially in challeng- perature setting (Alvarez, Martin, Dunbar, Taiber, Wilson, & Gilbert,
ing acoustic environments. Several pioneering AV fusion strategies 2010; Biswas, Sahu, & Chandra, 2015; Zhang, Pinto, Plahl, Schuller, &
are discussed in Katsaggelos, Bahaadini, and Molina (2015), primarily Willett, 2014). Other SR systems are virtual assistants that can perform
categorized into feature fusion and decision fusion techniques. Feature a wide range of tasks, such as sending messages, setting reminders, and
fusion emerges as the prevailing method in AVSR (Chung, Senior, managing other ‘‘to-do’’ functions. They are also appearing in systems
Vinyals, & Zisserman, 2017; Petridis et al., 2018). However, simple that allow adjusting vehicle settings, for example changing the seat po-
feature concatenation fails to explicitly address the reliability of both sition or color of interior lighting (Braun, Mainz, Chadowitz, Pfleging,
signals. Additionally, Chung et al. (2017) notes that the system tends & Alt, 2019; Loh, Boey, & Hong, 2017; Mehler et al., 2015). Overall, SR
to overly depend on the audio modality, prompting the application of aims to provide a natural and intuitive way to interact with a vehicle,
a regularization technique during training, where one of the signal is reducing distractions and contributing to comfort, convenience and
randomly dropped out. safety while driving.
In Ma, Petridis, and Pantic (2021), introduced an end-to-end dual- In the specific context of a car environment, the goal is to im-
encoder hybrid CTC/Attention (Kim et al., 2017) system for AVSR. This prove SR accuracy and reliability. Engine noise, road conditions, and
system comprises a visual encoder based on ResNet (He, Zhang, Ren, & passenger conversations make the automotive environment noisy and
Sun, 2015), an audio encoder based on Conformer (Gulati et al., 2020), dynamic, which challenges effective interaction between the assis-
and a multi-layer perception (MLP) module for integrating features tive system and the driver (in-car communication). As a solution,
from different modalities. researchers develop cue-based audio and visual systems that com-
In contrast to the MLP-based fusion method, (Sterpu, Saam, & prehend spoken instructions with precision and adjust to changing
Harte, 2018) pioneered an attention-based fusion one, demonstrating conditions. The ultimate aim of integrating AVSR into modern cars is
that aligning features from different modalities enhances learned rep- to enhance the driving experience and promote road safety by reducing
resentations. Subsequently, numerous studies have embraced a cross- distractions and enabling drivers to interact with their vehicles more
attention module along with Transformers to capture inherent align- naturally and effortlessly. This could potentially lead to AVSR becoming
ments and complementary information within fully encoded AV repre- a standard feature in cars of the next generation and represent a
sentations (Sterpu et al., 2018; Wu, Li, Yang, Wu, & Qian, 2021). Some significant leap forward in automotive technology.
opt to directly concatenate raw speech and video sequences and utilize The methods used in the driver’s assistive systems is summarized in
a shared encoder with self-attention mechanisms to learn modality Table 1. Compared to traditional AVSRs, there has been little research
alignments (Hong, Kim, Choi, & Ro, 2023; Li, Li, Wu, & Qian, 2023). done on the SR systems for a vehicle, without applying transformer-
In Cheng, Wang, Wang, and Li (2022), Li, Zhou, Wang, Hong, and Li based methods in them.
(2023) researchers leveraged hidden features from various layers of In Tamazin et al. (2019), the researchers introduced a modified
audio and visual encoders to achieve more effective fusion, assuming method for calculating Power normalized cepstral coefficients (PNCC),

3
D. Ryumin et al. Expert Systems With Applications 252 (2024) 124159

Table 2
Comparison of multimodal corpora. (En)glish, (Ru)ssian, (Ch)inese (include Mandarin), (Can)tonese, (Ko)rean, (Tu)rkish, (Fa)rsi. Online includes video clips of TV shows, video
content on YouTube, and so on. Sent. refers to Sentences, Cmd. to Commands.
Name # Spk. Voc. # Utt. Dur., h Resolution FPS View Audio, Env. Lang. Content Measure Result, %
kHz
LRW (Chung & 256 × 256
1K+ 500 539K 173 25 Front 16 Online En Word Acc. 65.40
Zisserman, 2016) 128 × 128
LSVSR (Shillingford 1K+ 127K 2.935M 3.9K 128 × 128 25 Front – Online En Sent. WER 40.90
et al., 2019)
LRW-1000 (Yang 2K+ 1K 718K 508 Different 25 Multi 16 Online Ch Word Acc. 38.19
et al., 2019)
LRWR (Egorov, 135 235 117.5K 350 112 × 112 25 Multi – Online Ru Word Acc. 61.10
Kostyumov, Konyk,
& Kolesnikov, 2021)
RUSAVIC (Ivanko, 20 62 12.4K 7 1920 × 1080 60 Front 48 Car Ru Cmd. Acc. 87.26
Axyonov, et al.,
2022)
CI-AVSR (Dai et al., 30 200 4.99K 8.3 640 × 480 25 Front 16 Indoor Can Cmd. CER 3.48
2022)
VLRDT (Berkol, – 10 2.33K 1 Different 30 Multi – Online Tu Word Acc. 61.70
Tümer-Sivri,
Pervan-Akman,
Çolak, & Erdem,
2023)
OLKAVS (Park 1.1K – 2.5M 5.75K 1920 × 1080 30 Multi 48 Lab Ko Sent. CER 5.64
et al., 2023)
Arman-AV 1.76K – 89K 220 224 × 224 25 Multi – Online Fa Sent. CER 13.08
(Peymanfard et al.,
2023)
MuAViC (Anwar 8K+ – – 1.2K Different 25 Multi 16 Online Multi Sent. WER 2.30
et al., 2023)
CN-CVS (Chen, 2.7K – 206.2K 308 – 25 Multi 16 Online Ch Sent. – –
Wang, & Zheng,
2023)
MAVD (Wang, Huo, 1920 × 1080 RGB
64 – 12.5K 24 30 Front 16 Lab Ch Sent. CER 8.78
et al., 2023) 640 × 576 Depth

to improve SR in noisy environments, particularly in those associ- In Chuang et al. (2022) the authors describe a novel system aimed
ated with cars. The authors compared PNCC with traditional methods. at enhancing AV speech under challenging conditions. They use a
Their experiments showed that in free-noise conditions, all methods three-stage approach, involving data pre-processing, a convolutional
performed comparably. However, when exposed to car noise at signal- recurrent neural network (CRNN)-based audio–visual speech enhance-
to-noise ratio (SNR) 0 dB, the modified PNCC outperformed such ment (AVSE) stage, and data reconstruction, and address real-world
traditional methods as Mel-frequency cepstral coefficients (MFCC) and problems such as the computational cost of processing visual data,
standard PNCC. AV asynchrony, and low-quality visual data. In particular, under chal-
In Adeel et al. (2020), the authors have introduced a speech en- lenging conditions, the model achieves a remarkable accuracy across
hancement framework, particularly relevant to the context of in-car different scenarios. The results underscore the effectiveness of their
communication. They leveraged DNNs, including LSTM and contextual- AVSR method, making it a valuable contribution to the field.
LSTM, to process and adapt both audio and visual features in varying In the context of assisting individuals with hearing impairments,
automotive noise scenarios. The novelty is the development of the an AVSR system focused on accommodating the challenging in-car
Enhanced Visually-derived Wiener Filter (EVWF) that significantly im- environment was developed (Kumar et al., 2022). This system leverages
proves the speech quality in car noise. The work was based on data both audio and visual modalities to ensure effective communication.
combining clean grid videos with automotive noises, which makes this In general, driver SR systems are gaining more popularity. They are
system highly relevant to in-car communication applications. designed to enhance the accuracy and robustness of SR for common
The authors introduced a multi-angle VSR system with three com- driver requests in real-world driving scenarios. Given the inherent
ponents: view classification, feature extraction, and recognition (Isobe challenges of SR in driving conditions, including ambient noise, head
et al., 2021). They used a 2-Dimensional (2D)CNN for view clas- movements, varying poses, distance from recording devices, and light-
sification and created 3DCNN models for feature extraction, taking ing conditions, the researchers leverage AV information to improve
into account the importance of various angles, especially in dynamic recognition results. Despite of the variety of methods used, it is worth
environments such as cars, where drivers constantly turn their heads. noting that transformers have not previously been used in such systems.
Recognition involved integrating features from angle-specific models In this work, we improve existing driver SR systems by using the SOTA
and using fully-connected layers. AVCRFormer methodology.
A SR system (Jaradat et al., 2022) has been developed to en- Due to DL techniques, including convolutional and long short-term
hance driver assistive applications by minimizing driver distraction. memory networks, AVSR has significantly advanced. These advance-
This system uses MFCC for extracting acoustic features and the SRILM ments have led to improved recognition accuracy and robustness, with
toolkit (http://www.speech.sri.com/projects/srilm/ accessed on 1 Dec. practical implications in noise-robust SR, accessibility for individuals
2023) for the Bi-gram language model analysis. The results indicate with hearing impairments. Recent studies have achieved the SOTA per-
that the DNN achieves a 5.17% WER for noise-free data and a 6.68% formance by addressing challenges related to modality variations and
WER for noisy data. cross-modal alignment. In the automotive context, the AVSR systems

4
D. Ryumin et al. Expert Systems With Applications 252 (2024) 124159

Fig. 1. Pipeline of the AVCRFormer architecture.

aim to enhance SR accuracy and improve the overall driving experi- 3. Methodology
ence. The proposed techniques like EVWF and multi-angle visual SR
have demonstrated significant improvements in challenging conditions, Driver assistive systems demand ongoing interaction with the hu-
making them valuable for reducing distractions and enhancing road man driver. However, while on the road, the driver is unable to use
safety. Therefore, AVSR promises various practical applications and their hands to interact with the system. Besides, vehicle environment
continues to evolve with ongoing research and development efforts. has different kind of noise: sound noise and visual noise due to light-
ing conditions. We proposed an AVSR method based on regulated
2.4. Audio–visual speech corpora transformers as well as spatio–temporal fusion strategy to integrate
audio and video modalities results. The method allows distributing the
AV data processing has become one of the fastest-evolving research model’s attention depending on the informativeness of the modalities,
areas within artificial intelligence (AI) and CV. The availability of high thus partially solving the problem of noisy audio or video signals. In
quality corpora for the study of SR and AV signals has become a this section, we present the overall architecture of our AVCRFormer
major factor in this development. This section reviews the most recent and then discuss each component separately. Finally, we outline the
corpora designed for AVSR, presented in various languages, including training aim and inference strategy for AVCRFormer in the context of
multi-lingual corpora. The research also includes the baselines achieved the intelligent driver assistant.
by researchers. Table 2 shows the distinctive characteristics of recent
corpora. 3.1. AVCRFormer method overview
Based on the conducted analysis, we chose two AV corpora for
model training and experimental evaluation. In the context of this In line with the prevailing trends in AVSR methodology, we have
research, the selection of the Russian Audio–Visual Speech in Cars opted to employ the Transformer architecture in our proposed method.
(RUSAVIC) corpus (Ivanko, Axyonov, et al., 2022) is of great im- Fig. 1 offers a pipeline of the AVCRFormer architecture, encompassing
portance. Distinguished by its recordings made in authentic driving four primary modules:
scenarios, RUSAVIC proves to be a versatile and carefully controlled
1. Feature extraction;
corpus, offering a spectrum of SNRs influenced by dynamic elements
2. Spatio–temporal fusion;
such as window opening/closing and car motion. This corpus matches
3. Model initialization;
the focus of the research, encompassing driver commands tailored
4. Iterative model refinement.
to control car navigation and multimedia systems. The inclusion of
RUSAVIC in the experiments enhances the relevance and authenticity The foundation of the AVCRFormer method lies in obtaining a
of the research and makes a significant contribution to the advance- representation of the mapping function 𝑓 ∶ 𝑿 𝑨 , 𝑿 𝑽 → 𝑷 for each AV
ment of the AVSR research in the specific context of the automotive recording of drivers in cars. Where 𝑿 𝑨 and 𝑿 𝑽 refer to audio and video
{ }𝑇
environment. input signals, 𝑷 to class prediction. The audio 𝑿 𝑨 = 𝒙𝐴 𝑡 𝑡=1 and video
{ } 𝑇
At the same time, we also consider the LRW corpus (Chung & Zisser- 𝑿 𝑽 = 𝒙𝑉𝑡 𝑡=1 signals are considered as input within AVCRFormer
man, 2016) for our experiments. Renowned for its extensive usage in where 𝑇 refers to the audio spectrograms or the video frames. Then,
the AVSR domain, LRW serves as a benchmark corpus in our research. two different feature extractors are used. The {
first }extractor transforms
𝑇
While various corpora are used across a broader AVSR context, we opt visual input 𝑿 𝑽 into feature matrix 𝑭 = 𝒇 𝑉𝑡 𝑡=1 , while the sec-
for RUSAVIC and LRW due to their unique characteristics that align 𝑨
ond extractor transforms acoustic input 𝑿 into feature matrix 𝑺 =
{ 𝐴 }𝑇
well with the specific goals and evaluation criteria of our research. 𝒇 𝑡 𝑡=1 . These sequential feature matrices together form the spatio–
The comparison of corpora reveals significant differences, not only temporal features (STF). After the feature extraction module, the STF
in recording characteristics such as the number of speakers, duration, pass through a multimodal fusion module based on the concatenation
and resolution, but also in the languages and tasks they are intended of features along the temporal axis. The output of the module is
for. It is worth noting that corpora are primarily collected for contin- matrix 𝑶. The model initialization module then uses the 1 encoder
uous SR tasks, labeled as ‘‘Sent.’’ in Table 2. However, these corpora and 1 decoder to generate initial data representation 𝑬 0 and initial
may not be the optimal choice for driver assistive systems. The table recognition 𝒀 0 from matrix 𝑶.
includes other corpora, labeled ‘‘Word’’, which are collected for single To bridge the divide between acoustic and visual signals and con-
word recognition tasks. These corpora may be more appropriate for textual representation, we enhance the initial process of data represen-
use in driver assistive systems, but they may contain excessive vocab- tation and recognition by incorporating an iterative model refinement
ulary. While the corpora are labeled as ‘‘Cmd.’’, they are specifically module with the multiple feature representations and the classifier en-
developed for driver assistive systems as they are collected based on semble. Unlike the previous module, where the encoder only analyzes
frequently encountered driver requests. As mentioned earlier, for our matrix 𝑶 and calculates a new matrix based on the self-attention, the
research, we have chosen the RUSAVIC and LRW (Chung & Zisserman, encoder of this module analyzes the cross-attention of two matrices: the
2016) corpora. matrix 𝑶 and the matrix obtained at the previous step.

5
D. Ryumin et al. Expert Systems With Applications 252 (2024) 124159

The iterative model refinement module, involving encoder 2 and a have performed better on the LRW corpus. In addition, our pre-trained
decoder 2 , takes initial data representation 𝑬 0 as input and generates model that combines ResNet18 and BiLSTM achieves the highest per-
data representation 𝑬 𝑁 and probability prediction 𝒀 𝑁 for the current formance for automatic lip-reading tasks. It is supported by recent
step by integrating it with initial AV representation 𝑶. After the 𝑁 studies (https://paperswithcode.com/sota/lipreading-on-lip-reading-
iterations of refinement, we obtain a sequence of data representations in-the-wild accessed on 28 Mar. 2024) in which ResNet18 model is con-
𝑬 = {𝑬 0 , 𝑬 1 , … , 𝑬 𝑁 } and their corresponding probability prediction sistently ranked among the top models for lip-reading tasks due to its
sequence 𝒀 = {𝒀 0 , 𝒀 1 , … , 𝒀 𝑁 }, therefore, 𝑬 → 𝒀 . The final class pre- competitive recognition accuracy and computational efficiency. Since
diction (𝑷 ) is computed using the argmax operation with the elements the model was trained on the LRW corpus for AVSR, it allows extracting
of the average of all probability prediction vectors in 𝒀 : features for 29 frames. If a video contains fewer than 29 frames, all
frames are duplicated to meet the required number. If there are more
1 ∑
𝑁−1
𝑷 = 𝑎𝑟𝑔𝑚𝑎𝑥( 𝒀 ), (1) than 29 frames, the video is segmented into 29-frame segments. Then,
𝑁 𝑛=0 𝑛 512 deep features are extracted for each frame. Processing video frames
where, 𝑁 represents the total number of refinement iterations. The through this extractor yields STF (𝑭 ), as illustrated by the following
averaging mechanism helps enhance the stability and robustness of the Eq. (3):
( { }𝑇 )
class prediction based on the refined information from all iterations.
𝑭 = VideoFE 𝒙𝑉𝑡 𝑡=1 . (3)
Hence, the feature extraction module extracts and combines fea-
tures from audio and video inputs. The model initialization module is The VideoFE feature extractor accepts segmented frames as in-
responsible for the initial representation (aggregation) of AV features puts and extracts features for each segment. The sequence of features
and the initial recognition. The iterative model refinement module (at obtained from a segmented frames forms STF (𝑭 ). The particular
each 𝑁 iteration) takes the combined matrix of AV features and the feature extractor is contingent on the task demands; in our research,
feature representation obtained in the previous iteration as an input it involves analyzing drivers’ lip regions during their visual utterance
and produces a refined feature representation and recognition as an of words while driving. The lip regions are extracted with MediaPipe,
output. Due to its iteration, the feature representation is constantly a cross-platform open-source environment (https://google.github.io/
updated and new (most informative) features are searched for in the mediapipe/ accessed on 1 Dec. 2023). Therefore, the feature extractor
original matrix of AV features. Based on the informativeness of the is essential in converting raw video frames into a compressed represen-
features, the recognition changes at each iteration, so that the final class tation, making the foundation for further analysis within the context of
prediction is based on the classifier ensemble. Unlike the AVSR method our proposed AVCRFormer architecture.
in (Chen, Hu, et al., 2023), which employs multiple models to create a
classifier ensemble, our method uses a single holistic model. 3.3. Spatio–temporal fusion strategy
The interactive model refinement module with multiple feature
representations and the classifier ensemble allows implementing a reg- The proposed strategy allows integrating audio and video modalities
ulated transformer. The attention visualization of features presented in at the feature level for AVCR. Within the strategy, we implement audio
Section 4.8 demonstrates the effectiveness of the proposed regulated and video signal pre-processing and signal segmentation for feature
transformer. extraction. Then, we propose to fuse audio features with visual ones by
the temporal axis and use the regulated transformer model to decode
3.2. Feature extraction multimodal speech. The use of modality fusion by the temporal axis
allows solving the problem of synchronizing audio and video data, and
3.2.1. Audio feature extraction also enables the transformer to analyze both modalities in one pass,
The audio feature extractor converts a series of audio spectrograms regulating their feature informativeness.
𝐴
𝑿 𝑨 ∈ R𝑆 ∈ R𝑇 ×𝐵×𝐷 , where 𝑆 𝐴 refers to the audio samples, 𝑇 to To construct a holistic model that includes both audio and visual
the number of spectrograms, 𝐵 to Mel-frequency and 𝐷 to the time features, we combine the outputs of the two feature extractors. The
duration, into spectrogram features 𝑺 ∈ R𝑇 ×𝐶 , with 𝐶 representing the 𝑺 feature matrix, from AudioFE, operates on audio spectrograms
{ 𝐴 }𝑇
feature size. The pre-trained model proposed in Ryumin et al. (2023) 𝒙𝑡 𝑡=1 to extract STF. Simultaneously, the 𝑭 feature matrix, derived
is used as a feature extractor. The model based on ResNet18 extracts { }𝑇
from VideoFE, transforms video frames 𝒙𝑉𝑡 𝑡=1 into STF.
512 deep features for each spectrogram calculated by the method Unlike existing AVSR methods that involve a concatenation of the
proposed in Ryumin et al. (2023). The spectrograms from the audio feature vectors and their subsequent aggregation using NN models (Ma
signal are facilitated using Librosa (https://librosa.org/ accessed on 1 et al., 2023; Pan et al., 2022; Ryumin et al., 2023; Shi et al., 2022)
Dec. 2023) (McFee et al., 2015). The processing of audio signal through or a cross-modal attention of the features from different modalities
this extractor results in STF (𝑺), as expressed by the following equation: (Axyonov et al., 2024; Dai et al., 2023), we make STF matrices uniform
( ({ }𝑇 ))
𝑺 = 𝐴𝑢𝑑𝑖𝑜𝐹 𝐸 𝑆𝑝𝑒𝑐𝐸 𝒙𝐴 . (2) in size and combine them along the temporal axis. Such fusion allows
𝑡 𝑡=1
preserving the contextual information of both modalities and achieving
The AudioFE feature extractor accepts segmented spectrograms their synchronization.
as inputs and extracts features for each segment. The sequence of As mentioned above, the visual feature matrix has a size of 29 × 512,
features obtained from a series of spectrograms forms STF (𝑺). In our and in practice, the number of frames in a video is not fixed. We
research, the feature extractor is applied to enhance the analysis of segment the video into 29 frames and extract features for all segments.
audio features, contributing to a more comprehensive understanding To obtain the number of spectrograms equal to the number of frames
of the AV context in the proposed AVCRFormer architecture. in the video, we segment the audio using a fixed step. Thus, feature
matrices from both modalities are obtained in the same size, containing
3.2.2. Visual feature extraction STF information. Subsequently, the features are concatenated along the
The visual feature extractor transforms sequence of video frames temporal axis and input to the model initialization encoder, 1 , and the
𝑿 𝑽 ∈ R𝑇 ×𝐻×𝑊 ×3 , where 𝑇 refers to the number of frames, 𝑊 and iterative model refinement encoder, 2 .
𝐻 to width and height, into visual features 𝑭 ∈ R𝑇 ×𝐶 , with 𝐶 This integration enables the model to fuse acoustic and visual
representing the feature size. The pre-trained model proposed in Ryu- information, providing a holistic representation capturing STF of audio
min et al. (2023) is used as a feature extractor. We use our model and video data. This method enhances the model’s ability to compre-
because there are no pre-trained models with open-source code that hend multimodal input, facilitating comprehensive feature extraction

6
D. Ryumin et al. Expert Systems With Applications 252 (2024) 124159

for the proposed AVCRFormer architecture. To achieve a synergistic R2𝑇 ×𝐶 , and decoder 1 reconstructs this representation, yielding initial
representation, we introduce a SpatioTemporalFusion strategy, probability prediction 𝒀 0 . This process forms the initial step in the
denoted as 𝑶: AVCRFormer method.
Encoder 1 is composed of several layers (see Fig. 2), where 𝑬 0
𝑶 = SpatioTemporalFusion (𝑺, 𝑭 ) , (4) denotes the output. The hidden state is computed as:
where, 𝑺 and 𝑭 represent audio and video STF, respectively. ̃ 0 =𝛼 ⋅ 𝑶 + (1 − 𝛼) ⋅ attn𝑠 (𝑶) ,
𝑬 (5)
The SpatioTemporalFusion strategy merges the 𝑺 and 𝑭 fea-
tures according to Algorithm 1. The input signals are first segmented where attn𝑠 denotes the self-attention layer in Transformer; 𝛼 ∈ [0, 1]
into segment with fixed steps (𝑠𝑡𝑒𝑝𝐴 and 𝑠𝑡𝑒𝑝𝑉 ). The number of seg- is a gated mechanism hyperparameter that weights the importance
ments depends on the duration of the signal. Then, features are ex- of previous prototypes during training and inference. To further fuse
and refine the prototype, 𝑬̃ 0 is linked with a fully-connected layers
tracted using AudioFE and VideoFE extractors for each audio and video
FCL(⋅) using the residual connection. The final output of encoder 1
segment. The resulting features are recorded in the corresponding 𝑺
is formulated as:
and 𝑭 matrices. If the size of the resulting matrices is less than required ( ( ( )) ( ))
size 𝑻 𝑹 , padding is applied based on the mean feature values of the ma- 𝑬 0 = FCL FCL LN 𝑬 ̃0 + LN 𝑬 ̃0 , (6)
trices. Finally, the matrices are concatenated along the temporal axis.
where LN(⋅) is the layer normalization operation. Probability prediction
The resulting fused representation, 𝑶, captures combined information
vector 𝒀 0 is obtained by feeding the output of encoder 1 into the input
from both modalities and improves the overall understanding of the AV
of decoder 1 (see Fig. 2). The probability prediction vector in the
context.
recognizer is computed as shown in Eq. (7):
( )
Algorithm 1 SpatioTemporalFusion 𝒉0 = FCL 𝑬 0 , (7)

1: procedure SpatioTemporalFusion(𝑋 𝐴 ,𝑋 𝑉 ,𝑇 𝑅 ) ( ) ( )
𝒀 0 = FCL 𝒉0 = sof tmax 𝒉0 ⋅ 𝑾 𝟎 , (8)
2: 𝑆 𝐴 ← shape(𝑋 𝐴 ) ⊳ 𝑆 𝐴 is the audio samples
3: 𝑇 , _, _, _ ← shape(𝑋 𝑉 ) ⊳ 𝑇 is the video frames where, 𝒉0 ∈ R𝐻 represents the output of the final layer, 𝐻 is a
4: 𝑭 ← zeros(𝑇 , 𝐶) ⊳ 𝑭 is the video STF hyperparameter customized during training and used in inference, and
5: 𝑺 ← zeros(𝑇 , 𝐶) ⊳ 𝑺 is the audio STF 𝑾 ∈ R𝐻×𝑀 denotes a linear mapping to project hidden state 𝒉0 into
6: 𝑠𝑡𝑒𝑝𝐴 ← 𝑆 𝐴 𝐝𝐢𝐯 𝑇 ⊳ 𝐶 is the number of features the probability prediction for 𝑀 classes. The probability is obtained by
applying the sof tmax(⋅) function.
7: 𝑠𝑡𝑒𝑝𝑉 ← 29
Thus, through the self-attention layer, residual connections, and
8: AudioFeatureExtractor (AudioFE):
fully-connected layers, the model initialization module forms an initial
9: for 𝑡, 𝑠𝑡𝑎𝑟𝑡 from enumerate(0 to 𝑆 𝐴 − 1 step 𝑠𝑡𝑒𝑝𝐴 ) do
representation of features with adapted importance of input sequence
10: 𝑒𝑛𝑑 ← 𝑠𝑡𝑎𝑟𝑡 + 𝑠𝑡𝑒𝑝𝐴 − 1
elements, leading to initial recognition.
11: if 𝑒𝑛𝑑 ≥ 𝑆 𝐴 then
12: 𝑒𝑛𝑑 ← 𝑆 𝐴 − 1 3.5. Iterative model refinement
13: end if
14: 𝑋𝑡𝐴 ← SpecE(𝑋 𝐴 [𝑠𝑡𝑎𝑟𝑡 ∶ 𝑒𝑛𝑑]) ⊳𝑡∈𝑇 Once the initial prototype, denoted as 𝑬 0 , is obtained through
15: 𝑺[𝑡, ...] ← AudioFE(𝑋𝑡𝐴 ) the model initialization module, it is combined with the initial AV
16: end for representation, 𝑶, and fed into the iterative model refinement module.
17: VideoFeatureExtractor (VideoFE): The iterative model refinement module consists of two intertwined sub-
18: for 𝑠𝑡𝑎𝑟𝑡 from 0 to 𝑇 − 1 step 𝑠𝑡𝑒𝑝𝑉 do processes: iterative prototype aggregation within the 2 encoder and
19: 𝑒𝑛𝑑 ← 𝑠𝑡𝑎𝑟𝑡 + 𝑠𝑡𝑒𝑝𝑉 − 1 recognition generation within the 2 decoder.
20: if 𝑒𝑛𝑑 ≥ 𝑇 then Encoder 2 is composed of several layers (see Fig. 2), where 𝑬 𝑛
21: 𝑒𝑛𝑑 ← 𝑇 − 1 denotes the output of the 𝑛th iteration. The hidden state is computed
22: end if as:
( )
23: 𝑭 [𝑠𝑡𝑎𝑟𝑡 ∶ 𝑒𝑛𝑑, ...] ← VideoFE(𝑋 𝑉 [𝑠𝑡𝑎𝑟𝑡 ∶ 𝑒𝑛𝑑]) ̃ 𝑛 = 𝛼 ⋅ attn𝑠 (𝑶) + (1 − 𝛼) ⋅ attn𝑐 𝑬 𝑛 , 𝑶 ,
𝑬 (9)
24: end for
25: PaddingMatrices: where attn𝑐 denotes the cross-attention layer used in Transformer. To
̃ 𝑛 is linked with fully-connected
further fuse and refine the prototype, 𝑬
26: 𝑺 ← padding(𝑺, need_height = 𝑇 𝑅 , mode = 𝑚𝑒𝑎𝑛)
layers FCL(⋅) using the residual connection. The final output of encoder
27: 𝑭 ← padding(𝑭 , need_height = 𝑇 𝑅 , mode = 𝑚𝑒𝑎𝑛)
2 at the 𝑛th iteration is formulated as:
28: ConcatenateMatrices:
[ ] ( ( ( )) ( ))
𝑆𝑇 𝑅 ×𝐶 𝑬 𝑛 = FCL FCL LN 𝑬 ̃ 𝑛 + 𝑶 + LN 𝑬 ̃𝑛 + 𝑶 , (10)
29: 𝑶←
𝐹𝑇 𝑅 ×𝐶
30: return 𝑶 Simultaneously, refined prototype 𝑬 𝑛 is used in the decoder 2 (see
31: end procedure Fig. 2) to generate probability prediction 𝒀 𝑛 . Probability prediction 𝒀 𝑛
at the 𝑛th iteration is calculated as 𝒀 0 :
( )
3.4. Model initialization 𝒉𝑛 = FCL 𝑬 𝑛 , (11)
( ) ( )
Following the concatenation of audio and visual features, the com- 𝒀 𝑛 = FCL 𝒉𝑛 = sof tmax 𝒉𝑛 ⋅ 𝑾 𝒏 , (12)
bined feature matrix is processed in the model initialization module here, 𝒉𝑛 ∈ R𝐻 represents the output of the final layer.
of AVCRFormer, which encompasses an encoder–decoder architecture. The iterative model refinement module, unlike the model initial-
The encoder, denoted as 1 , uses the concatenated features to create an ization module, includes an additional layer of cross-modal attention.
encoded representation, while the decoder, denoted as 1 , reconstructs The module encoders take the original feature representation and the
the encoded information. This initialization step is crucial to establish previous feature representation as input. This allows re-adapting the
the model’s foundation. most informative features from the original feature representation and
Matrix 𝑶 ∈ R2𝑇 ×𝐶 represents the concatenated AV features, where taking into account the importance of the features highlighted in the
2𝑇 is the number of feature vectors, and 𝐶 is the number of features. En- previous step. This process is repeated the N iterations, each time
{ }2𝑇
coder 1 processes 𝑶 to obtain encoded representation 𝑬 0 = 𝒆0𝑡 𝑡=1 ∈ obtaining an update feature representation and a new recognition.

7
D. Ryumin et al. Expert Systems With Applications 252 (2024) 124159

Fig. 2. Architectures of the regulated transformer components in AVCRFormer. 1 and 2 refer to encoders, 1 and 2 to decoders.

3.6. Loss function Table 3


RUSAVIC corpus characteristics.
Set # Classes # Samples for Each Class # Frames
Our method generates the recognition sequence based on the condi-
Train 140
tional probability provided by decoders 1 and 2 . The cross-entropy Val 62 (utterances) 30 [21, 172]
loss (CE-loss) function is used to measure the difference between the Test 30
ground truth labels and the decoder outputs for all training samples.
The primary aim while training is to maximize the log-likelihood, Table 4
equivalent to minimizing the CE-loss, defined as: LRW corpus characteristics.
Set # Classes # Samples for Each Class # Frames
𝐿𝐶𝐸,𝑛 = − log 𝑌𝑛 (𝒀̂ ∣ 𝑶), (13)
Train 800–1000
Val 500 (words) 50 29
where 𝒀̂ represents the ground truth annotation, 𝑌𝑛 is each probability Test 50
prediction. This loss is applied to the output of the model initialization
module and the 𝑁th output of the iterative model refinement module.
The overall loss function for our AV command classification system is:
4.2. Experimental setup
𝐿 = 𝐿𝐶𝐸,0 + 𝐿𝐶𝐸,𝑁 , (14)

where 𝐿𝐶𝐸,0 is CE-losses at the output of the model initialization During training, we meticulously fine-tuned several critical param-
module and 𝐿𝐶𝐸,𝑁 the 𝑁th output of the iterative model refinement eters, tailoring the architecture to the specific needs of our task. The
module. Initially, guided by the CE-loss, we train the initial generation primary optimized parameters:
module until convergence. Subsequently, the complete loss function is
• Encoder and Decoder Configurations: We varied the struc-
employed for E2E optimization of the system.
ture of both the encoder and decoder, tuning the number of
fully-connected layers in both and adjusting the units in the
4. Experimental results hidden fully-connected layer of the decoder. This was done to
find an optimal balance for feature representation and prediction
generation.
4.1. Experimental corpora
• Number of Iterations (𝑵): We explored the impact of varying
the number of iterations (𝑁) in the iterative model refinement
To evaluate the effectiveness of the proposed AVCRFormer, we to understand how many refinement steps were necessary for the
conducted extensive experiments on two different corpora: RUSAVIC model to converge to optimal performance.
(Ivanko, Axyonov, et al., 2022) and LRW (Chung & Zisserman, 2016). • Gated Mechanism (𝜶): The hyperparameter, 𝛼, controlling the
The primary aim of these experiments was to achieve the SOTA ac- weight between original prototypes and gated-based prototypes
curacy in AV utterance recognition for both corpora, with a special during training and inference, was systematically tuned to opti-
focus on the RUSAVIC corpus. This focus is due to the fact that the mize the contribution of previous iterations to the final predic-
RUSAVIC corpus contains essential commands relevant to the AV driver tion.
assistive system. Meanwhile, LRW served as a benchmark for a broader • Training Parameters: The choice of learning rate (LR), LR decay
comparison within the context of the overall SOTA. See Table 3 for the schedule, and optimizer were considered and fine-tuned to opti-
main characteristics of the RUSAVIC and Table 4 for LRW corpora. mize the convergence speed and overall performance. To identify
To comply with the experimental setup, we use data split simi- the most effective optimization strategy for our model, we ex-
lar to previous studies (Axyonov et al., 2024; Chung & Zisserman, plored popular optimizers, including Adam (Kingma & Ba, 2014),
2016). The number of frames (𝑇 ) varies in the RUSAVIC corpus (see AdamW (Loshchilov & Hutter, 2019), stochastic gradient descent
Table 3) because that both single- and multi-word commands are used (SGD) (Ruder, 2016), and the Lion optimizer (Chen, Liang, et al.,
in this corpus. We use the maximum number of frames 𝑇 𝑅 = 120, as 2023).
in Axyonov et al. (2024). In case the number of frames is less, we • Multi-Prediction Strategy vs. Single-Prediction Strategy: We
apply padding by mean feature values (see Algorithm 1). The fixed investigated whether generating multi-predictions at each itera-
step (𝑠𝑡𝑒𝑝𝐴 ) for spectrogram segmentation is set to 16 ms by 16 kHz tion provided any benefit compared to single-predictions. This
sample rate. For the LRW corpus, 𝑇 is constant and equals 𝑇 𝑅 = 29 involved assessing the trade-off between increased computation
and 𝑠𝑡𝑒𝑝𝐴 = 34 ms by 16 kHz sample rate. and potential improvement in model performance.

8
D. Ryumin et al. Expert Systems With Applications 252 (2024) 124159

Table 5
Impact of the Encoder and Decoder configurations on recognition accuracy (%) on the RUSAVIC and LRW
corpora.
Experiment Encoder Decoder Accuracy (%)
# FCLs # FCLs # Units RUSAVIC LRW
Val Test Val Test
Baseline 1 (Default) 1 1 128 85.38 85.22 86.63 85.72
Config 1 2 1 128 89.89 89.52 86.08 85.96
Config 2 3 1 128 88.12 88.71 86.89 86.31
Config 3 2 2 128 90.43 89.09 88.21 87.69
Config 4 2 3 128 89.14 88.39 88.30 88.65
Config 5 2 0 – 88.66 88.93 83.33 83.07
Config 6 2 1 256 90.91 90.22 88.47 87.93
Config 7 (Baseline 2) 2 1 512 90.27 90.75 88.73 88.61
Config 8 2 1 1024 89.84 90.11 89.31 89.52
Config 9 2 1 2048 88.23 87.42 89.01 88.86

This section details the experiments and provides insights into the Table 6
model’s training dynamics, convergence, and performance accuracy on Impact of the number of iterations (𝑁) on recognition accuracy (%) on the RUSAVIC
and LRW corpora.
these two corpora.
Experiment # Iterations (𝑵) Accuracy (%)
The following default parameters were set and used in all the
experiments: RUSAVIC LRW
Val Test Val Test
• A single fully-connected layer encoder. Baseline 2 (Default) 1 90.27 90.75 88.73 88.61
• A single layer decoder with 128 hidden units. Config 1 2 91.88 91.13 91.13 90.64
• A single model for iterative refinement. Config 2 3 92.53 91.72 91.59 90.92
Config 3 (Baseline 3) 4 92.85 92.10 91.63 91.08
• Without 𝛼 hyperparameter.
Config 4 5 92.04 91.88 91.32 90.75
• The Adam optimizer with 1 × 10−3 LR and constant LR over 100 Config 5 6 91.56 91.72 91.07 90.31
epochs.
• A single-prediction strategy.

4.3. Optimizing encoder and decoder configurations 4.4. Optimizing number of iterations

We systematically optimized the AVCRFormer method by adapting Here and below, we applied the configurations that worked best for
the architecture of its encoder and decoder components. Our aim the validation set of the RUSAVIC corpus as the new baseline.
was to identify the optimal configuration that improves feature rep- In Table 6, we analyzed the effect of varying the number of re-
resentation and sequence generation for improved accuracy in AV finement process iterations (𝑁), to identify the optimal number of
utterance recognition on both the RUSAVIC and LRW corpora. We refinement steps necessary to unleash the model’s potential.
iteratively varied the number of fully-connected layers and units, care- Table 6 shows that increasing the number of iterations enhanced
fully balancing model complexity and recognition accuracy. Previous the model’s recognition accuracy on both the RUSAVIC and LRW
studies (Ryumina, Dresvyanskiy, & Karpov, 2022) showed that effective corpora. This resulted in the attainment of maximal accuracy of 92.85%
fine-tuning of models varies depending on the experimental corpus. (Val) and 92.10% (Test) for RUSAVIC and 91.63% (Val) and 91.08%
Therefore, we do not use traditional grid search to find the optimal (Test) for LRW. Further iterations did not lead to substantial accuracy
combination of parameters. We varied only one parameter in each improvements consistently. The best accuracy for the Test set was
experiment to obtain more unbiased accuracy on two different corpora. achieved in ‘‘Config 3’’ using four 𝑁 iterations; this emphasizes the sig-
Table 5 provides a detailed representation of the recognition accuracy nificance of determining the suitable number of refinement iterations
for different configurations. First, we determined the optimal number to enhance the recognition capabilities of AVCRFormer.
of fully-connected layers in encoders. Next, we determined the optimal
number of fully-connected layers in decoders. Finally, we determined 4.5. Optimizing gated mechanism
the optimal number of hidden units in the fully-connected layer of the
decoders. As the next step of fine-tuning the model, we considered the in-
The experiments revealed that increasing the complexity of the fluence of the gated mechanism hyperparameter (𝛼) on model per-
encoder and decoder components generally improved accuracy, as formance. The 𝛼 hyperparameter determines the weight distribution
shown in Table 5. In particular, the search for the best configuration, between the original and the adapted prototypes during both train-
exemplified by ‘‘Config 7’’, achieved the highest accuracy at 90.75% on ing and inference. The detailed results of recognition accuracy under
the RUSAVIC corpus test set. Additional complexity adjustments, such different gated mechanism configurations are presented in Table 7.
as the inclusion of 1024 units in ‘‘Config 8’’, resulted in only a slight The analysis of the gated mechanism hyperparameter (𝛼) on the
decrease in accuracy. performance of the AVCRFormer method revealed unique patterns.
This trend remained consistent for the LRW corpus, which is charac- Modifying the value of 𝛼 led to dissimilar recognition accuracy for the
terized by a more extensive set of classes and data (see Tables 3 and 4). RUSAVIC and LRW corpora, as presented in Table 7.
While ‘‘Config 8’’ achieved slightly lower accuracy than in RUSAVIC, it The ‘‘Config 4’’ provided optimal results for RUSAVIC with the
demonstrated robust performance on both the Val (89.31%) and Test hyperparameter 𝛼 set to 0.5, resulting in a recognition accuracy of
(89.52%) sets. 94.30% (Val) and 92.90% (Test). Similarly, for the LRW corpus, when
The results highlighted the nuanced relationship between encoder– ‘‘Config 2’’ had the 𝛼 value of 0.3, it achieved the highest recognition
decoder complexity and recognition accuracy across different corpora. accuracy at 92.11% (Val) and 92.40% (Test). It is worth noting that
Thus, to optimize the AVCRFormer method for accurate AV word or ut- any subsequent increase or decrease of 𝛼 reduced the accuracy, which
terance recognition, careful configuration selection is critical, especially highlights a significant role of the gated mechanism hyperparameter in
when dealing with varying class sizes and corpus complexity. AVCRFormer performance.

9
D. Ryumin et al. Expert Systems With Applications 252 (2024) 124159

Table 7 4.7. Multi-prediction strategy vs. Single-prediction strategy


Impact of gated mechanism (𝛼) on recognition accuracy (%) on the RUSAVIC and LRW
corpora.
The comparison of single- and multi-prediction strategies with the
Experiment Gated mechanism (𝛼) RUSAVIC LRW
SOTA results is shown in Table 10.
Val Test Val Test
Table 10 shows that for both corpora, the multi-prediction strat-
Baseline 3 (Default) 0.1 92.85 92.10 91.63 91.08 egy outperforms the single-prediction strategy by 0.64% to 2.21%,
Config 1 0.2 93.50 92.31 92.49 91.56
depending on the corpus used. We also compare the performance of
Config 2 0.3 93.87 92.53 92.11 92.40
Config 3 0.4 94.14 92.63 92.57 92.26 the unimodal methods by setting the features of a modality to zero with
Config 4 0.5 94.30 92.90 91.61 91.33 the unimodal SOTA methods. For the RUSAVIC corpus, the audio- and
Config 5 (Baseline 4) 0.6 94.57 92.74 91.18 91.05 visual-based methods using the multi-prediction strategy outperform
Config 6 0.7 94.19 92.47 91.12 90.91 the methods presented in Axyonov et al. (2024). For the LRW corpus,
Config 7 0.8 93.60 92.37 90.89 90.70
Config 8 0.9 92.80 92.04 90.48 90.42
our audio-based method is inferior to the method (Petridis et al., 2018),
while the visual-based method is inferior to the method (Ma et al.,
2022) using an ensemble of models. At the same time, for the RUSAVIC
Table 8 corpus, an accuracy of 98.87% was achieved, outperforming the SOTA
Impact of optimizers and LRs on recognition accuracy (%) on the RUSAVIC and LRW
method by 0.22%. For the LRW corpus, 98.81% accuracy was achieved,
corpora.
outperforming the SOTA method by 0.05%. The results allow con-
Experiment Optimizer LR RUSAVIC LRW
cluding that increasing the computational cost of the proposed model
Val Test Val Test
improves the accuracy of word and utterance recognition. Moreover,
Baseline 4 (Default) 1 × 10−3 94.57 92.74 91.18 91.05 several predictions for each instance of AV data at different iterations
Config 1 Adam 1 × 10−4 95.65 94.14 93.37 93.16
Config 2 1 × 10−5 95.22 93.98 93.04 92.97
of the model result in a more robust AVCRFormer method in contrast to
single-predictions obtained at the final iteration. The multi-prediction
Config 3 1 × 10−3 94.73 93.71 92.88 93.06
Config 4 (Baseline 5) AdamW 1 × 10−4 96.02 96.18 93.80 93.54
strategy can be compared to the AVSR ensemble method (Chen, Hu,
Config 5 1 × 10−5 95.54 95.75 93.29 93.08 et al., 2023), but unlike our method, this method uses the average of
Config 6 1 × 10−3 94.46 93.23 91.84 91.72 probabilities from two models.
Config 7 SGD 1 × 10−4 81.67 76.40 70.46 69.99 We also report the computational cost of our models. Our models
Config 8 1 × 10−5 67.42 61.18 49.38 48.57 take 148 MB and have 38 million parameters. The feature extraction
Config 9 1 × 10−3 95.05 94.14 93.74 93.85 lasts about 16 h. Training the regulated transformer takes approxi-
Config 10 Lion 1 × 10−4 95.86 95.48 96.09 96.60 mately 48 min per epoch (or about 3 days for 100 epoch) using one
Config 11 1 × 10−5 95.65 95.27 94.61 94.64
GeForce RTX 3090 Ti with 24 GB memory. Our method operates in
a real-time mode, with a processing real-time factor of 0.11 on a GPU
GeForce RTX 3090 Ti and 0.29 on a CPU Intel i9-11900KF. At the same
4.6. Optimizing training parameters time, 1/3 of the work of the method involves processing audio signals.
Times are provided for LRW in comparison with the SOTA methods.
Other researchers did not include all the computational costs we men-
After optimizing the model architecture, refinement iterations, and
tioned; however, we can note that a more powerful GPU or more than
gated mechanism, we focused on fine-tuning the training parameters.
one GPU were used to train the models. For example, the authors (Pan
This stage is aimed at determining the most effective combination of
et al., 2022) used four NVIDIA A100 GPUs with a total of 160 GB
LRs, optimizers, and LR schedulers to improve the convergence speed
memory to train their models during a week. Thus, we accelerated the
of the model and overall AVCRFormer performance.
model training process by pre-extracting features, and only one GPU
was needed to train our model, unlike other methods (Kim et al., 2022;
4.6.1. Learning rates and optimizers Pan et al., 2022; Zhang, Yang, et al., 2020).
We compared popular optimizers: Adam (Kingma & Ba, 2014),
AdamW (Loshchilov & Hutter, 2019), SGD (Ruder, 2016), and Lion 4.8. Attention visualization approach
(Chen, Liang, et al., 2023). We performed a systematic evaluation of
three LRs for each optimizer: 1 × 10−3 , 1 × 10−4 , and 1 × 10−5 . We developed the following approach to transformer attention vi-
The results shown in Table 8 provide valuable insights into the sualization to show how the different modalities (in our case audio
impact of different optimizers and LRs on the recognition accuracy and video) affect SR results. The maximum results of the performance
within both the RUSAVIC and LRW corpora. The configurations using measure on both corpora are achieved when five encoders and decoders
the AdamW and Lion optimizers with 1 × 10−4 LR showed superior are used. Each encoder and decoder has its own feature representation
performance. for the input signal and probability prediction based on it. To evaluate
which features (audio or video) are more informative to determine
the predicted class in each individual encoder, we constructed feature
4.6.2. Learning rate schedulers
heatmaps for the three samples from the RUSAVIC corpus. The results
With the selected optimizer from the previous step, we proceeded for the LRW corpus exhibited similar trends; hence, we concentrate our
to assess various LR schedulers presented in Table 9. We consid- analysis on three specific samples from the RUSAVIC corpus.
ered the following schedulers (https://pytorch.org/docs/stable/optim. As mentioned above general matrix 𝑶 ∈ R2𝑇 ×𝐶 represents the
html accessed on 1 Dec. 2023): StepLR, ExponentialLR, ReduceLROn- combined AV feature matrix, where 2𝑇 are the timesteps or the number
Plateau, and CosineAnnealingWarmRestarts. Each scheduler was fine- of feature vectors, and 𝐶 is the number of acoustic and visual features.
tuned with different hyperparameter values. To create heatmaps, we again divide the 𝑶 matrix into two feature ma-
Therefore, our single-prediction strategy produced noteworthy trices (one for audio features, the other for video features). This division
recognition accuracy rates of 97.63% (Val) and 98.23% (Test) for the helps illustrate the importance of features within each individual matrix
RUSAVIC corpus, and 96.09% (Val) and 96.60% (Test) for the LRW more effectively. The resulting heatmaps are shown in Fig. 3.
corpus. We presented a thorough comparative analysis of our method, The result visualization shows that the information content of the
which involves both single- and multi-prediction strategies, compared features in five encoders differs depending on the input signal. In
to the SOTA methods. the first sample, the model pays more attention to audio features

10
D. Ryumin et al. Expert Systems With Applications 252 (2024) 124159

Table 9
Impact of LR schedulers on recognition accuracy (%) on the RUSAVIC and LRW corpora.
Experiment LR Scheduler Accuracy (%)
RUSAVIC LRW
Val Test Val Test
Baseline 5 (Default) Constant 96.02 96.18 93.80 93.54
Config 1 (Best) StepLR (𝑠𝑡𝑒𝑝𝑠𝑖𝑧𝑒 = 20, 𝑔𝑎𝑚𝑚𝑎 = 0.1) 97.63 98.23 96.09 96.60
Config 2 ExponentialLR (𝑔𝑎𝑚𝑚𝑎 = 0.1) 96.18 96.56 95.53 95.84
Config 3 ReduceLROnPlateau (𝑓 𝑎𝑐𝑡𝑜𝑟 = 0.1) 95.59 96.08 94.62 95.01
Config 4 CosineAnnealingWarmRestarts (𝑇0 = 10, 𝑒𝑡𝑎𝑚𝑖𝑛 = 1 × 10−5 ) 96.83 97.26 95.51 96.03

Table 10
Comparison of performance (accuracy, %) of single-prediction strategy (SP) and multi-prediction strategy (MP) with the SOTA methods. Results
are reported for the RUSAVIC and LRW corpora. A, V, AV refer to audio, video and audio–visual modalities.
Method RUSAVIC, V LRW, V
Pan et al. (2022) – 85.00
Zhang, Yang, Xiao, Shan, and Chen (2020) – 85.02
Martinez, Ma, Petridis, and Pantic (2020) – 85.30
Kim, Hong, Park, and Ro (2021) – 85.40
Feng, Yang, Shan, and Chen (2020) – 88.40
Ma, Martinez, Petridis, and Pantic (2021) – 88.50
Kim, Yeo, and Ro (2022) – 88.50
Ivanko, Ryumin, et al. (2022) – 88.70
Koumparoulis and Potamianos (2022) – 89.52
Ma, Wang, Petridis, Shen, and Pantic (2022) (ensemble) – 94.10
Axyonov, Ryumina, Ryumin, Ivanko, and Karpov (2023) 83.46 –
Method RUSAVIC, A/V/AV LRW, A/V/AV
Petridis et al. (2018) – 97.70/82.00/98.00
Miao, Liu, and Yang (2020) – –/82.80/98.30
Ryumin et al. (2023) – 96.07/87.16/98.76
Axyonov et al. (2024) 95.61/83.46/98.65 –
Ours (with SP) 94.89/82.31/98.23 94.45/88.36/96.60
Ours (with MP) 95.75/84.03/98.87 97.25/89.57/98.81

Fig. 3. Attention Visualization: heatmap examples for the RUSAVIC samples. Gray shows target classes and their predictions for each heatmap.

(more orange and bright red pixels) than to video ones, as the in- ones because of the noisy road environment, which makes the audio
formation content of the video modality is degraded because of lip modality less distinguishable. Despite the loss of useful information in
region occlusion resulted from head rotation. In contrast, in the second one of the modalities, the model still finds informative features in the
sample, the model pays more attention to video features than to audio weaker modalities that influence the prediction. In the cases where both

11
D. Ryumin et al. Expert Systems With Applications 252 (2024) 124159

and trained using the RUSAVIC and LRW corpora, and rigorous testing
has been conducted to evaluate its functionalities and user interface.
The software application has been integrated into the Drive Safely
assistive system in Ivanko, Kashevnik, et al. (2022), which provides
voice control capabilities for smartphone navigation/multimedia sys-
tems. It allows possible extension to control vehicle systems, e.g. light-
ing conditions and climate control via voice commands. Source code,
dataset and trained models are free available by request via (https:
//mobiledrivesafely.com, accessed on Feb. 19, 2024).
AVSR has made significant strides in improving human–computer
interaction, particularly in environments like driver assistive systems
where traditional unimodal speech recognition systems may falter due
to acoustic noises. However, the challenge of noise in real-world appli-
cations remains a substantial hurdle:

1. Variability and intensity of acoustic noises: In the context


Fig. 4. AVCRFormer application for driver assistive systems (DAS). of driver assistive systems, the vehicle’s interior can be fraught
with various noise sources, including road noise, engine sounds,
conversations, and the audio output from devices.
modalities are not distorted by various factors, the model pays equal 2. Poor lighting conditions and visual occlusions: While vi-
attention to the features of both modalities. Therefore, each encoder sual information can supplement audio cues, especially in noisy
pays attention to its features regardless of the analyzed modality, so the conditions, visual noise and challenges such as poor lighting
use of multiple encoders allows the model to make predictions based conditions or occlusions (e.g., the driver’s mouth being partially
on the informativeness of different features. covered) degrade the quality of visual data.
To visualize how predictions change iteratively based on feature 3. Non-stationary noises: The noise encountered in driving envi-
importance, we also gray out target classes and their predictions in ronments is predominantly non-stationary, meaning its charac-
Fig. 3. Thus, for all three input samples, the 𝑬 0 features of the model teristics change over time.
initialization make incorrect probability predictions, while the 𝑬 1 - 4. Synchronization issues: The integration of audio and visual
𝑬 5 features of the iterative model refinement module result in more data requires precise synchronization to ensure that the temporal
accurate predictions. In addition, the model based on the latest features alignment between spoken words and visual lip movements is
still makes errors, but to a lesser extent. Despite the number of errors maintained.
for the input samples presented, the averaging of the probability pre-
dictions leads to the correct predicted command IDs. The decoding of To effectively address the challenges of noises in real-world applica-
all command IDs in the RUSAVIC corpus is available at (https://smil- tions of the AVSR systems, particularly in driver assistive technologies,
spcras.github.io/AVCRFormer/ accessed on Feb. 19, 2024). a multifaceted approach is essential. This involves deploying advanced
Hence, if important information is lost in one encoder, it is taken noise suppression techniques to clean audio, improving spatio-temporal
into account in another encoder, and the prediction is based on dif- fusion methods for better integration of audio and visual data, de-
ferent features. We have developed a model that is capable of: (i) veloping robust visual feature extraction algorithms for poor lighting
compensating for the weaknesses of unimodal information, i.e., adapt- conditions and occlusions, and implementing cross-modal compensa-
ing to the informativeness of the modalities, (ii) making multiple tion strategies for system resilience by leveraging strong cues from one
predictions depending on the informativeness of the features on each modality to compensate for weaknesses in another one.
encoder. The proposed multimodal architecture using multiple en- Traditional AVSR systems address some of the noise issues described
coders and decoders provides stability and flexibility in processing above. However, such systems often rely on separately processing audio
input signals with uncontrolled recording conditions. and visual signals and then combine these at a later stage without dy-
namic weighting of modality importance. These systems are difficult to
5. Avcrformer application for driver assistive systems adapt to real-world applications because they cannot deal with different
qualities of audio and video data. This occurs because the models have
In this section, we introduce AVCRFormer-based multimodal in- not been trained to adjust attention based on the informativeness of
teraction application designed to aid vehicle drivers, enhance their audio and video data by their simultaneous analysis.
driving experience and ultimately increase overall road safety. This The proposed regulated transformer model with a cross-attention
mobile application provides voice control functionalities to navigate mechanism dynamically adjusts the focus between audio and visual
and manage multimedia systems of smartphones, potentially extending inputs based on their contextual relevance and quality. This is clearly
to vehicle features such as lighting and climate control. illustrated in Fig. 3. For example, in the case of noisy environments the
It is designed to enhance the accuracy and robustness of SR for attention shifts to the video modality, but in the case of poor lighting
common driver requests in real-world driving scenarios. Given the in- conditions or high lip occlusions it shifts to the audio one. Our method
herent challenges of SR in driving conditions, including ambient noise, outperforms the traditional ones by:
head movements, varying poses, distance from recording devices, and
lighting conditions, we leverage multimodal information and improve 1. Enhancing contextual understanding: Leveraging the trans-
upon existing recognition systems, e.g. Ivanko, Ryumin, et al. (2022), former’s ability to model complex relationships within and
by using the SOTA AVCRFormer methodology, presented in the article. across modalities, leading to a richer understanding of the
Fig. 4 illustrates the general architecture of the AVCRFormer ap- speech context.
plication comprising two primary parts. The AVCRFormer part is re- 2. Dynamic modality weighting: Automatically giving precedence
sponsible for processing results of speech and audio–visual command to clean modality (audio or visual) in real-time, thus main-
recognition; the actions part provides execution of voice commands and taining high recognition accuracy even when one modality is
AV feedback to the driver. The software application has been developed compromised (e.g., noisy environments or poor visibility).

12
D. Ryumin et al. Expert Systems With Applications 252 (2024) 124159

3. Iterative model refinement: Iteratively accounting important References


features missed in previous iterations and making predictions
instead of solely relying on potentially inaccurate predictions Adeel, A., Gogate, M., & Hussain, A. (2020). Contextual deep learning-based audio-
from the last iteration. visual switching for speech enhancement in real-world environments. Information
Fusion, 59, 163–170. http://dx.doi.org/10.1016/j.inffus.2019.08.008.
Therefore, the proposed method improves upon the existing AVSR Afouras, T., Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2018). Deep
applications by introducing the regulated transformer, the fusion strate- audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 44(12), 8717–8727. http://dx.doi.org/10.1109/TPAMI.2018.2889052.
gies of AV features and the classifier ensemble that account for tempo-
Afouras, T., Chung, J. S., & Zisserman, A. (2018). LRS3-TED: A large-scale dataset for
ral and spatial dimensions, and enhanced transparency and adaptability visual speech recognition. (pp. 1–2). http://dx.doi.org/10.48550/arXiv.1809.00496,
through attention visualization. These innovations address critical chal- arXiv preprint arXiv:1809.00496.
lenges in AVSR, particularly for applications in noisy and visually Alvarez, I., Martin, A. M., Dunbar, J., Taiber, J. G., Wilson, D.-M., & Gilbert, J. E.
complex environments like those encountered in driver assistive sys- (2010). Voice interfaced vehicle user help. In International conference on automotive
user interfaces and interactive vehicular applications (pp. 42–49). http://dx.doi.org/
tems, paving the way for more accurate, robust, and user-friendly SR
10.1145/1969773.1969782.
technologies. Anwar, M., Shi, B., Goswami, V., Hsu, W.-N., Pino, J., & Wang, C. (2023). MuAViC:
A multilingual audio-visual corpus for robust speech recognition and robust
6. Conclusions speech-to-text translation. (pp. 1–8). http://dx.doi.org/10.48550/arXiv.2303.00628,
ArXiv.
In this article, we proposed a novel AVSR method aimed at enhanc- Axyonov, A., Ryumin, D., Ivanko, D., Kashevnik, A., & Karpov, A. (2024). Audio-visual
speech recognition in-the-Wild: Multi-angle vehicle cabin corpus and attention-
ing SR within driver assistive systems, crucial for ensuring safe inter-
based method. In IEEE international conference on acoustics, speech and signal
action while driving. The proposed method introduces a novel trans- processing (pp. 8195–8199). IEEE, http://dx.doi.org/10.1109/ICASSP48485.2024.
former model, AVCRFormer, tailored specifically for robust AVSR. Key 10448048.
innovations include a multimodal fusion strategy, a regulated trans- Axyonov, A., Ryumina, E., Ryumin, D., Ivanko, D., & Karpov, A. (2023). Neural
former architecture, and a classifier ensemble strategy. By preserving network-based method for visual recognition of driver’s voice commands using
attention mechanism. Scientific and Technical Journal of Information Technologies, Me-
contextual information and leveraging the iterative refinement of mod-
chanics and Optics, 23(4), 767–775. http://dx.doi.org/10.17586/2226-1494-2023-
els, AVCRFormer demonstrates superior performance in speech com- 23-4-767-775.
mand recognition accuracy. The implementation of a multi-prediction Berkol, A., Tümer-Sivri, T., Pervan-Akman, N., Çolak, M., & Erdem, H. (2023). Visual lip
strategy demonstrates superior adaptability in diverse AV contexts. reading dataset in Turkish. Data, 8(1), 15. http://dx.doi.org/10.3390/data8010015.
Achieving high accuracy rates of 98.87% and 98.81% on the RUSAVIC Bhardwaj, V., Ben Othman, M. T., Kukreja, V., Belkhier, Y., Bajaj, M., Goud, B.
and LRW corpora respectively, this research significantly contributes to S., et al. (2022). Automatic speech recognition (ASR) systems for children: A
systematic literature review. Applied Sciences, 12(9), 4419. http://dx.doi.org/10.
advancing human–computer interaction and holds promise for broader
3390/app12094419.
applications in audio–visual processing and artificial intelligence. Biswas, A., Sahu, P. K., & Chandra, M. (2015). Multiple camera in car audio–visual
Although our method has demonstrated proficiency in semi- speech recognition using phonetic and visemic information. Computers & Electrical
controlled environments, its transition to real-world applications Engineering, 47, 35–50. http://dx.doi.org/10.1016/j.compeleceng.2015.08.009.
presents significant challenges, such as latency, computational effi- Braun, M., Mainz, A., Chadowitz, R., Pfleging, B., & Alt, F. (2019). At your service:
Designing voice assistant personalities to improve automotive user interfaces. In
ciency, and scalability. Our future endeavors will prioritize optimizing
Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (pp.
AVCRFormer for seamless integration into practical scenarios, with a 1–11). http://dx.doi.org/10.1145/3290605.3300270.
focus on addressing challenges specific to the automotive environment. Burchi, M., & Timofte, R. (2023). Audio-visual efficient conformer for robust speech
In addition, we aim to deploy AVCRFormer for different acoustic recognition. In IEEE/CVF winter conference on applications of computer vision (pp.
environments, speech patterns, and languages. 2258–2267). http://dx.doi.org/10.1109/WACV56688.2023.00229.
Chen, C., Hu, Y., Zhang, Q., Zou, H., Zhu, B., & Chng, E. S. (2023). Leveraging
modality-specific representations for audio-visual speech recognition via reinforce-
CRediT authorship contribution statement
ment learning. In AAAI conference on artificial intelligence (pp. 12607–12615).
http://dx.doi.org/10.48550/arXiv.2212.05301.
Dmitry Ryumin: Conceptualization, Methodology, Software, Vali- Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Liu, Y., et al. (2023). Symbolic
dation, Investigation, Writing – original draft, Project administration. discovery of optimization algorithms. (pp. 1–30). http://dx.doi.org/10.48550/
Alexandr Axyonov: Conceptualization, Methodology, Visualization, arXiv.2302.06675, arXiv preprint arXiv:2302.06675.
Writing – original draft. Elena Ryumina: Conceptualization, Method- Chen, C., Wang, D., & Zheng, T. F. (2023). CN-CVS: A mandarin audio-visual dataset
for large vocabulary continuous visual to speech synthesis. In IEEE international
ology, Investigation, Visualization, Writing – original draft. Denis
conference on acoustics, speech and signal processing (pp. 1–5). IEEE, http://dx.doi.
Ivanko: Conceptualization, Resources, Writing – original draft, Writing org/10.1109/ICASSP49357.2023.10095796.
– review & editing. Alexey Kashevnik: Conceptualization, Methodol- Cheng, M., Wang, H., Wang, Y., & Li, M. (2022). The DKU audio-visual wake word
ogy, Resources, Writing – original draft, Writing – review & editing. spotting system for the 2021 MISP challenge. In IEEE international conference on
Alexey Karpov: Conceptualization, Supervision, Writing – review & acoustics, speech and signal processing (pp. 9256–9260). http://dx.doi.org/10.1109/
icassp43922.2022.9747216.
editing, Funding acquisition.
Chuang, S.-Y., Wang, H.-M., & Tsao, Y. (2022). Improved lite audio-visual speech
enhancement. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
Declaration of competing interest 30, 1345–1359. http://dx.doi.org/10.1109/taslp.2022.3153265.
Chung, J. S., Senior, A. W., Vinyals, O., & Zisserman, A. (2017). Lip reading sentences
The authors declare that they have no known competing finan- in the wild. In IEEE/CVF conference on computer vision and pattern recognition (pp.
cial interests or personal relationships that could have appeared to 3444–3453). http://dx.doi.org/10.1109/CVPR.2017.367.
Chung, J. S., & Zisserman, A. (2016). Lip reading in the wild. In Asian conference on
influence the work reported in this paper.
computer vision (pp. 87–103). http://dx.doi.org/10.1007/978-3-319-54184-6_6.
Dai, W., Cahyawijaya, S., Yu, T., Barezi, E. J., Xu, P., Yiu, C. T. S., et al. (2022).
Data availability CI-AVSR: A cantonese audio-visual speech dataset for in-car command recognition.
In International conference on language resources and evaluation (pp. 6786–6793).
Data will be made available on request. Dai, Y., Chen, H., Du, J., Ding, X., Ding, N., Jiang, F., et al. (2023). Improving audio-
visual speech recognition by lip-subword correlation based visual pre-training and
Acknowledgments cross-modal fusion encoder. In IEEE international conference on multimedia and expo
(pp. 2627–2632). IEEE, http://dx.doi.org/10.1109/ICME55011.2023.00447.
Deng, K., Cao, S., Zhang, Y., & Ma, L. (2021). Improving hybrid CTC/Attention
This work was supported by the Analytical Center for the Gov- end-to-end speech recognition with pretrained acoustic and language models. In
ernment of the Russian Federation (IGK 000000D730324P540002), IEEE automatic speech recognition and understanding workshop (pp. 76–82). IEEE,
agreement No. 70-2021-00141. http://dx.doi.org/10.1109/ASRU51503.2021.9688009.

13
D. Ryumin et al. Expert Systems With Applications 252 (2024) 124159

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large- Li, J., Li, C., Wu, Y., & Qian, Y. (2023). Robust audio-visual ASR with unified cross-
scale hierarchical image database. In Proceedings of the 2009 conference on computer modal attention. In IEEE international conference on acoustics, speech and signal
vision and pattern recognition (pp. 248–255). IEEE. processing (pp. 1–5). http://dx.doi.org/10.1109/ICASSP49357.2023.10096893.
Egorov, E., Kostyumov, V., Konyk, M., & Kolesnikov, S. (2021). LRWR: Large-scale Li, T., Zhou, H., Wang, J., Hong, Q., & Li, L. (2023). The XMU system for audio-
benchmark for lip reading in Russian language. (pp. 1–5). http://dx.doi.org/10. visual diarization and recognition in MISP challenge 2022. In IEEE international
48550/arXiv.2109.06692, arXiv preprint arXiv:2109.06692. conference on acoustics, speech and signal processing (pp. 1–2). http://dx.doi.org/10.
Feng, D., Yang, S., Shan, S., & Chen, X. (2020). Learn an effective lip reading 1109/ICASSP49357.2023.10095693.
model without pains. (pp. 1–6). http://dx.doi.org/10.48550/arXiv.2011.07557, Li, J., et al. (2022). Recent advances in end-to-end automatic speech recognition.
arXiv preprint arXiv:2011.07557. APSIPA Transactions on Signal and Information Processing, 11(1), http://dx.doi.org/
Gerczuk, M., Amiriparian, S., Ottl, S., & Schuller, B. (2021). EmoNet: A transfer 10.1561/116.00000050.
learning framework for multi-corpus speech emotion recognition. IEEE Transactions Liang, B., & Tang, Z. (2022). Multi-modal information analysis of automobile intelligent
on Affective Computing, 14, 1472–1487. http://dx.doi.org/10.1109/TAFFC.2021. human-computer interaction. In International conference on multi-modal information
3135152. analytics (pp. 658–666). Springer, http://dx.doi.org/10.1007/978-3-031-05237-8_
Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu, J., et al. (2020). Conformer: 81.
Convolution-augmented transformer for speech recognition. In Interspeech (pp. Lin, Y.-B., Sung, Y.-L., Lei, J., Bansal, M., & Bertasius, G. (2023). Vision transformers
5036–5040). http://dx.doi.org/10.21437/Interspeech.2020-3015. are parameter-efficient audio-visual learners. In IEEE/CVF conference on computer
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Deep residual learning for image
vision and pattern recognition (pp. 2299–2309). IEEE, http://dx.doi.org/10.1109/
recognition. In IEEE conference on computer vision and pattern recognition (pp.
CVPR52729.2023.00228.
770–778). http://dx.doi.org/10.1109/cvpr.2016.90.
Liu, H., Li, W., & Yang, B. (2021). Robust audio-visual speech recognition based on
Hong, J., Kim, M., Choi, J. Y., & Ro, Y. M. (2023). Watch or listen: Robust audio-
hybrid fusion. In International conference on pattern recognition (pp. 7580–7586).
visual speech recognition with visual corruption modeling and reliability scoring.
IEEE, http://dx.doi.org/10.1109/ICPR48806.2021.9412817.
In IEEE/CVF conference on computer vision and pattern recognition (pp. 18783–18794).
Loh, C. Y., Boey, K. L., & Hong, K. S. (2017). Speech recognition interactive system
http://dx.doi.org/10.1109/CVPR52729.2023.01801.
for vehicle. In 2017 IEEE 13th international colloquium on signal processing & its
Hsu, W.-N., Bolte, B., Tsai, Y.-H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A.
applications (pp. 85–88). http://dx.doi.org/10.1109/CSPA.2017.8064929.
(2021). HuBERT: Self-supervised speech rrepresentation learning by masked pre-
diction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. In
Processing, 29, 3451–3460. http://dx.doi.org/10.1109/TASLP.2021.3122291. International conference on learning representations (pp. 1–18).
Isobe, S., Tamura, S., Hayamizu, S., Gotoh, Y., & Nose, M. (2021). Multi-angle Lu, Y., Li, Z., He, D., Sun, Z., Dong, B., Qin, T., et al. (2019). Understanding and
lipreading with angle classification-based feature extraction and its application to improving transformer from a multi-particle dynamic system point of view. (pp.
audio-visual speech recognition. Future Internet, 13(7), 182. http://dx.doi.org/10. 1–15). http://dx.doi.org/10.48550/arXiv.1906.02762, arXiv preprint arXiv:1906.
3390/fi13070182. 02762.
Ivanko, D., Axyonov, A., Ryumin, D., Kashevnik, A., & Karpov, A. (2022). RUSAVIC Ma, P., Haliassos, A., Fernandez-Lopez, A., Chen, H., Petridis, S., & Pantic, M.
corpus: Russian audio-visual speech in cars. In Language resources and evaluation (2023). Auto-AVSR: Audio-visual speech recognition with automatic labels. In IEEE
conference (pp. 1555–1559). international conference on acoustics, speech and signal processing (pp. 1–5). IEEE,
Ivanko, D., Kashevnik, A., Ryumin, D., Kitenko, A., Axyonov, A., Lashkov, I., et al. http://dx.doi.org/10.1109/ICASSP49357.2023.10096889.
(2022). MIDriveSafely: Multimodal interaction for drive safely. In Proceedings of Ma, P., Martinez, B., Petridis, S., & Pantic, M. (2021). Towards practical lipreading with
the 2022 international conference on multimodal interaction (pp. 733–735). http: distilled and efficient models. In IEEE international conference on acoustics, speech and
//dx.doi.org/10.1145/3536221.3557037. signal processing (pp. 7608–7612). IEEE, http://dx.doi.org/10.1109/ICASSP39728.
Ivanko, D., Ryumin, D., Kashevnik, A., Axyonov, A., & Karpov, A. (2022). Visual speech 2021.9415063.
recognition in a driver assistance system. In European signal processing conference Ma, P., Petridis, S., & Pantic, M. (2021). End-to-end audio-visual speech recognition
(pp. 1131–1135). IEEE, http://dx.doi.org/10.23919/eusipco55093.2022.9909819. with conformers. In IEEE international conference on acoustics, speech and signal
Ivanko, D., Ryumin, D., Kashevnik, A., Axyonov, A., Kitenko, A., Lashkov, I., et al. processing (pp. 7613–7617). IEEE, http://dx.doi.org/10.1109/ICASSP39728.2021.
(2022). DAVIS: Driver’s audio-visual speech recognition. In ISCA annual conference 9414567.
interspeech (pp. 1141–1142). Ma, P., Wang, Y., Petridis, S., Shen, J., & Pantic, M. (2022). Training strategies for
Jaradat, G. A., Alzubaidi, M. A., & Otoom, M. (2022). A novel human-vehicle improved lip-reading. In IEEE international conference on acoustics, speech and signal
interaction assistive device for Arab drivers using speech recognition. IEEE Access, processing (pp. 8472–8476). IEEE, http://dx.doi.org/10.1109/ICASSP43922.2022.
10, 127514–127529. http://dx.doi.org/10.1109/ACCESS.2022.3226539. 9746706.
Kahn, J., Riviere, M., Zheng, W., Kharitonov, E., Xu, Q., Mazaré, P.-E., et al. (2020). Makino, T., Liao, H., Assael, Y., Shillingford, B., Garcia, B., Braga, O., et al. (2019).
Libri-light: A benchmark for asr with limited or no supervision. In Proceedings Recurrent neural network transducer for audio-visual speech recognition. In IEEE
of the 2020 international conference on acoustics, speech and signal processing (pp. automatic speech recognition and understanding workshop (pp. 905–912). IEEE, http:
7669–7673). IEEE. //dx.doi.org/10.1109/ASRU46091.2019.9004036.
Katsaggelos, A. K., Bahaadini, S., & Molina, R. (2015). Audiovisual fusion: Challenges Martinez, B., Ma, P., Petridis, S., & Pantic, M. (2020). Lipreading using temporal
and new approaches. IEEE, 103, 1635–1653. http://dx.doi.org/10.1109/JPROC. convolutional networks. In IEEE international conference on acoustics, speech and
2015.2459017. signal processing (pp. 6319–6323). IEEE, http://dx.doi.org/10.1109/ICASSP40776.
Khan, S. H., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., & Shah, M. (2022). 2020.9053841.
Transformers in vision: A survey. ACM Computing Surveys (CSUR), 54(10s), 1–41.
McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., et al. (2015).
http://dx.doi.org/10.1145/3505244.
Librosa: Audio and music signal analysis in Python. In Python in science conference:
Kim, M., Hong, J., Park, S. J., & Ro, Y. M. (2021). Multi-modality associative
vol. 8, (pp. 18–25). http://dx.doi.org/10.25080/MAJORA-7B98E3ED-003.
bridging through memory: Speech sound recollected from face video. In IEEE/CVF
Mehler, B., Kidd, D. G., Reimer, B., Reagan, I. J., Dobres, J., & McCartt, A. T. (2015).
international conference on computer vision (pp. 296–306). http://dx.doi.org/10.
Multi-modal assessment of on-road demand of voice and manual phone calling
1109/ICCV48922.2021.00036.
and voice navigation entry across two embedded vehicle systems. Ergonomics, 59,
Kim, S., Hori, T., & Watanabe, S. (2017). Joint CTC-attention based end-to-end speech
344–367. http://dx.doi.org/10.1080/00140139.2015.1081412.
recognition using multi-task learning. In IEEE international conference on acoustics,
Miao, H., Cheng, G., Gao, C., Zhang, P., & Yan, Y. (2020). Transformer-based online
speech and signal processing (pp. 4835–4839). IEEE, http://dx.doi.org/10.1109/
CTC/Attention end-to-end speech recognition architecture. In IEEE international
ICASSP.2017.7953075.
Kim, M., Yeo, J. H., & Ro, Y. M. (2022). Distinguishing homophenes using multi-head conference on acoustics, speech and signal processing (pp. 6084–6088). IEEE, http:
visual-audio memory for lip reading. In AAAI conference on artificial intelligence (pp. //dx.doi.org/10.1109/ICASSP40776.2020.9053165.
1174–1182). http://dx.doi.org/10.48550/arXiv.2204.01725. Miao, Z., Liu, H., & Yang, B. (2020). Part-based lipreading for audio-visual speech
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. (pp. 1–15). recognition. In IEEE international conference on systems, man, and cybernetics (pp.
http://dx.doi.org/10.48550/arXiv.1412.6980, arXiv preprint arXiv:1412.6980. 2722–2726). IEEE, http://dx.doi.org/10.1109/smc42975.2020.9283044.
Koumparoulis, A., & Potamianos, G. (2022). Accurate and resource-efficient lipreading Michelsanti, D., Tan, Z.-H., Zhang, S.-X., Xu, Y., Yu, M., Yu, D., et al. (2021). An
with Efficientnetv2 and transformers. In IEEE international conference on acoustics, overview of deep-learning-based audio-visual speech enhancement and separation.
speech and signal processing (pp. 8467–8471). IEEE, http://dx.doi.org/10.1109/ IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 1368–1396.
ICASSP43922.2022.9747729. http://dx.doi.org/10.1109/TASLP.2021.3066303.
Kumar, L. A., Renuka, D. K., Rose, S. L., Wartana, I. M., et al. (2022). Deep learning Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H. G., & Ogata, T. (2015). Audio-visual
based assistive technology on audio visual speech recognition for hearing impaired. speech recognition using deep learning. Applied Intelligence: The International Journal
International Journal of Cognitive Computing in Engineering, 3, 24–30. http://dx.doi. of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Technologies,
org/10.1016/j.ijcce.2022.01.003. 42, 722–737. http://dx.doi.org/10.1007/s10489-014-0629-7.
Lee, Y.-H., Jang, D.-W., Kim, J.-B., Park, R.-H., & Park, H.-M. (2020). Audio-visual Ouyang, R., Ma, C., & Tan, H. (2023). The development of China automotive human-
speech recognition based on dual cross-modality attentions with the transformer computer interaction. In International conference on human-computer interaction (pp.
model. Applied Sciences, 10(20), 7263. http://dx.doi.org/10.3390/app10207263. 417–429). Springer, http://dx.doi.org/10.1007/978-3-031-35939-2_31.

14
D. Ryumin et al. Expert Systems With Applications 252 (2024) 124159

Pan, X., Chen, P., Gong, Y., Zhou, H., Wang, X., & Lin, Z. (2022). Leveraging Tamazin, M., Gouda, A., & Khedr, M. (2019). Enhanced automatic speech recognition
unimodal self-supervised learning for multimodal audio-visual speech recognition. system based on enhancing power-normalized cepstral coefficients. Applied Sciences,
In Annual meeting of the association for computational linguistics (pp. 4491–4503). 9(10), 2166. http://dx.doi.org/10.3390/APP9102166.
http://dx.doi.org/10.18653/v1/2022.acl-long.308. Thomas, B., Kessler, S., & Karout, S. (2022). Efficient adapter transfer of self-supervised
Park, J. G., Hwang, J.-W., Choi, K., Lee, S.-H., Ahn, J.-H., Park, R.-H., et al. (2023). speech models for automatic speech recognition. In IEEE international conference on
OLKAVS: An open large-scale Korean audio-visual speech dataset. (pp. 1–15). acoustics, speech and signal processing (pp. 7102–7106). IEEE, http://dx.doi.org/10.
http://dx.doi.org/10.48550/arXiv.2301.06375, ArXiv arXiv:2301.06375. 1109/ICASSP43922.2022.9746223.
Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., & Pantic, M. (2018). End- Wang, J., Huo, Y., Liu, L., Xu, T., Li, Q., & Li, S. (2023). MAVD: The First Open
to-end audiovisual speech recognition. In IEEE international conference on acoustics, Large-Scale Mandarin Audio-Visual Dataset with Depth Information. In Interspeech
speech and signal processing (pp. 6548–6552). IEEE, http://dx.doi.org/10.1109/ (pp. 2113–2117). http://dx.doi.org/10.21437/Interspeech.2023-823.
ICASSP.2018.8461326. Wang, H., Zha, Z.-J., Li, L., Chen, X., & Luo, J. (2023). Context-aware proposal-
Peymanfard, J., Heydarian, S., Lashini, A., Zeinali, H., Mohammadi, M. R., & Moza- boundary network with structural consistency for audiovisual event localization.
yani, N. (2023). A multi-purpose audio-visual corpus for multi-modal Persian IEEE Transactions on Neural Networks and Learning Systems, 1–11. http://dx.doi.org/
speech recognition: The Arman-AV dataset. Expert Systems with Applications, 238, 10.1109/TNNLS.2023.3290083.
Article 121648. http://dx.doi.org/10.1016/j.eswa.2023.121648. Wu, Y., Li, C., Yang, S., Wu, Z., & Qian, Y. (2021). Audio-visual multi-talker speech
Ruder, S. (2016). An overview of gradient descent optimization algorithms. (pp. 1–14). recognition in a cocktail party. In Interspeech (pp. 1–5). http://dx.doi.org/10.21437/
http://dx.doi.org/10.48550/arXiv.1609.04747, arXiv preprint arXiv:1609.04747. interspeech.2021-2128.
Ryumin, D., Ivanko, D., & Ryumina, E. (2023). Audio-visual speech and gesture Yang, W., Li, P., Yang, W., Liu, Y., He, Y., Petrosian, O., et al. (2023). Research
recognition by sensors of mobile devices. Sensors, 23(4), 2284. http://dx.doi.org/ on robust audio-visual speech recognition algorithms. Mathematics, 11(7), 1733.
10.3390/s23042284. http://dx.doi.org/10.3390/math11071733.
Ryumina, E., Dresvyanskiy, D., & Karpov, A. (2022). In search of a robust facial expres- Yang, S., Zhang, Y., Feng, D., Yang, M., Wang, C., Xiao, J., et al. (2019). LRW-
sions recognition model: A large-scale visual cross-corpus study. Neurocomputing, 1000: A naturally-distributed large-scale benchmark for lip reading in the wild.
514, 435–450. http://dx.doi.org/10.1016/j.neucom.2022.10.013. In IEEE international conference on automatic face & gesture recognition (pp. 1–8).
Shashidhar, R., Patilkulkarni, S., & Puneeth, S. (2022). Combining audio and visual IEEE, http://dx.doi.org/10.1109/FG.2019.8756582.
speech recognition using LSTM and deep convolutional neural network. International Zhang, Y., Han, W., Qin, J., Wang, Y., Bapna, A., Chen, Z., et al. (2023). Google
Journal of Information Technology, 14(7), 3425–3436. http://dx.doi.org/10.1007/ USM: Scaling automatic speech recognition beyond 100 languages. (pp. 1–20).
s41870-022-00907-y. http://dx.doi.org/10.48550/arXiv.2303.01037, arXiv preprint arXiv:2303.01037.
Shewalkar, A., Nyavanandi, D., & Ludwig, S. A. (2019). Performance evaluation of deep Zhang, Z., Pinto, J., Plahl, C., Schuller, B., & Willett, D. (2014). Channel mapping
neural networks applied to speech recognition: RNN, LSTM and GRU. Journal of using bidirectional long short-term memory for dereverberation in hands-free voice
Artificial Intelligence and Soft Computing Research, 9(4), 235–245. http://dx.doi.org/ controlled devices. IEEE Transactions on Consumer Electronics, 60, 525–533. http:
10.2478/jaiscr-2019-0006. //dx.doi.org/10.1109/TCE.2014.6937339.
Shi, B., Hsu, W.-N., & Mohamed, A. (2022). Robust self-supervised audio-visual Zhang, Y., Yang, S., Xiao, J., Shan, S., & Chen, X. (2020). Can we read speech beyond
speech recognition. In Interspeech (pp. 2118–2122). http://dx.doi.org/10.21437/ the lips? Rethinking ROI selection for deep visual speech recognition. In IEEE
Interspeech.2022-99. international conference on automatic face and gesture recognition (pp. 356–363). IEEE,
Shillingford, B., Assael, Y., Hoffman, M. W., Paine, T., Hughes, C., Prabhu, U., et http://dx.doi.org/10.1109/FG47880.2020.00134.
al. (2019). Large-scale visual speech recognition. In Interspeech (pp. 4135–4139). Zhang, L., Zhao, Z., Ma, C., Shan, L., Sun, H., Jiang, L., et al. (2020). End-to-end
http://dx.doi.org/10.21437/Interspeech.2019-1669. automatic pronunciation error detection based on improved hybrid CTC/Attention
Song, Q., Sun, B., & Li, S. (2022). Multimodal sparse transformer network for audio- architecture. Sensors, 20(7), 1809. http://dx.doi.org/10.3390/s20071809.
visual speech recognition. IEEE Transactions on Neural Networks and Learning Zhou, P., Yang, W., Chen, W., Wang, Y., & Jia, J. (2019). Modality attention for end-
Systems, 10028–10038. http://dx.doi.org/10.1109/TNNLS.2022.3163771. to-end audio-visual speech recognition. In IEEE international conference on acoustics,
Sterpu, G., Saam, C., & Harte, N. (2018). Attention-based audio-visual fusion for robust speech and signal processing (pp. 6565–6569). IEEE, http://dx.doi.org/10.1109/
automatic speech recognition. In 20th ACM international conference on multimodal ICASSP.2019.8683733.
interaction (pp. 1–6). http://dx.doi.org/10.1145/3242969.3243014. Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., et al. (2020). A comprehen-
sive survey on transfer learning. IEEE, 109(1), 43–76. http://dx.doi.org/10.1109/
JPROC.2020.3004555.

15

You might also like