0% found this document useful (0 votes)
12 views

ASR Survey Presentation

Uploaded by

drishti23117
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

ASR Survey Presentation

Uploaded by

drishti23117
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Automatic Speech Recognition

(ASR) Systems for Indic


Languages

Presented By: Drishti Singh, MT23117

ENG 599 - Winter 2024 Presentation


Contents

1. Introduction to ASRs
2. Motivation to create ASRs for Indic languages
3. Challenges
4. Past Literature Review
5. Methodology
6. Results
7. Conclusion
8. Future Scope
9. References
Introduction
Automatic Speech Recognition (ASR) marks a significant milestone in the evolution of Human-Computer
Interaction. From its humble beginnings in basic digit identification, ASR has progressed to sophisticated
systems capable of transcribing continuous speech in real-time. This advancement has driven the
automation of tasks across various levels of complexity, fueling research efforts towards developing
optimal systems that effectively comprehend and analyze speech for seamless task execution.

● ASR technology has undergone substantial transformation (mostly after the advent of LLMs),
enabling computers to transcribe speech into text with increasing accuracy and efficiency.

● The evolution of ASR has been fueled by the growing need to automate tasks and enhance
Human-Machine interaction, driving extensive research and development in the field.

● ASR systems play a crucial role in enabling hands-free operation of devices, facilitating voice
commands for tasks ranging from simple commands to complex queries.
Motivation
The linguistic diversity of countries like India presents unique challenges and opportunities for ASR
technology. In India, where numerous languages are spoken and a significant portion of the population
faces literacy challenges, the need for ASRs tailored to Indic languages is particularly pressing.

● India's vast linguistic diversity and substantial illiterate population underscore the importance of
ASRs customized for Indic languages.

● Language Model Models (LLMs) like ChatGPT and Gemini have opened new avenues for bridging
language gaps through AI, offering solutions to societal challenges faced by rural communities.

● ASR technology holds the potential to empower individuals in rural areas with limited literacy
skills, enabling them to interact effortlessly with computers and devices, thereby unlocking
opportunities for education, information access, and socio-economic empowerment.
Challenges
1. Linguistic Diversity: Indic languages exhibit significant linguistic diversity in terms of
phonetics, grammar, and vocabulary. Developing ASRs that accurately capture these
linguistic nuances across different Indic languages poses a considerable challenge.
2. Limited Data Availability: Unlike widely spoken languages such as English, data
availability for training ASRs in Indic languages is often limited. This scarcity of labeled
audio data hampers the ability to train robust and accurate ASR models.
3. Variability in Pronunciation: Indic languages often exhibit variability in
pronunciation due to regional accents, dialects, and socio-cultural factors. ASR models
must account for this variability to ensure accurate transcription of speech.
4. Code-Switching and Multilingualism: Many speakers of Indic languages engage in
code-switching, where they seamlessly switch between multiple languages within a
single conversation. ASRs need to effectively handle code-switching and multilingual
speech to accurately transcribe such interactions.
Past Literature Review

Research in Automatic Speech Recognition (ASR) has predominantly centered around three
key areas: Multilingual ASR, Monolingual ASR, and Code-Switching ASR, each posing distinct
challenges and garnering significant attention in advancing speech recognition technology.

1. Multilingual ASR - To handle many languages at the same time, the union of the
corpus consisting of all language characters is fed to a single deep neural network and
is trained collectively.
2. Monolingual ASR - Monolingual ASR consists of various pre-trained models, which are
trained on each language separately and fine-tuned accordingly. There is a language
identification module (LID) attached to such an ASR, which recognises the language and
directs to the specified model accordingly.
3. Code - Switching ASR - When bilingual speakers switch between languages
depending on the situation, it's called code-switching. This phenomenon occurs when
someone speaks two languages successively in the same conversation, and it can differ
from how each language is spoken individually.
Benchmark Datasets for ASR Training:

1. Vistaar Benchmark Set:

a. Kathbath-Hard: Diverse dataset from speakers representing


12 Indian languages, including background noise for a
challenging benchmark.

b. FLEURS: Dataset of translated Wikipedia content with broad


language coverage. Vistaar Dataset
Description
c. CommonVoice: Crowdsourced speech dataset from diverse
sources encompassing 8 languages.

2. MUCS Dataset: Included in the MUCS 2021 competition,


featuring multilingual aspects and code-switching transcripts.
Combines manually curated gold standard labels with weakly
curated labels from crowdsourcing platforms
MUCS Dataset Description
Methodology

Methodologies for Indian ASR research encompass diverse approaches, including self-supervised learning
and encoder-decoder architectures, driving advancements in speech recognition technology.

● Wav2Vec 2.0:
● Utilizes self-supervised learning with extensive unlabeled speech data.
● Two stages: pretraining with masking and contrastive task, fine-tuning with labeled data.
● Extracts robust contextualized information from speech audio for Multilingual ASR.

● IndicWhisper Model:
● Utilizes OpenAI's Whisper models for monolingual training and combination.
● Trains individual models for each language with Transformer-based architecture.
● Fine-tunes Whisper-medium model for each language.

● Seq2Seq Model:
● Utilized for multilingual speech recognition.
● Encoder-decoder architecture maps input audio to intermediate representation and generates
output sequence.
Results

IndicWav2Vec Model and


IndicWhisper Model
generated almost same
WER (Word Eror Rate) and
therefore portrayed similar The above histogram shows that for almost all
performance. Indian languages, multilingual approach
outperformed monolingual approach in terms of
pre-training loss.
This is for the Kathbath Dataset

The diagram portrays different ways of training


multilingual ASRs, and then comparing their
relative WER change (lower being better) as the
Conclusion

The study underscores the pivotal role of speech recognition in machine learning,
driven by humans' innate propensity for speech comprehension. It highlights the
challenges researchers face, notably the scarcity of large audio datasets,
particularly for diverse Indian languages lacking sufficient corpora and labeled
data. However, the emergence of robust benchmarks like Kathbath and Vistaar
signal a promising avenue for developing more efficient ASRs across all Indian
languages, given access to diverse datasets. Furthermore, leveraging Language
Model Models (LLMs) can facilitate the generation of synthetic datasets, enabling
weak supervised training during the initial stages.
Future Scope

Future research endeavors in ASR should focus on addressing dataset scarcity, fostering
collaboration, and exploring novel techniques to enhance performance, paving the way for
inclusive and effective speech recognition systems.

● Address challenges of dataset scarcity and label annotation for Indian languages through
collaborative efforts between academia, industry, and government bodies.
● Foster collaboration to curate and expand existing datasets, ensuring inclusivity and
representation of all Indian languages.
● Explore novel techniques such as active learning and transfer learning to enhance ASR
performance, especially for low-resource languages.
● Advance Language Model Models (LLMs) and synthetic data generation techniques to
improve ASR robustness and scalability.
● Develop inclusive and effective speech recognition systems for all languages spoken in
India and beyond, leveraging advancements in technology and data accessibility.
References

[1] T. Javed, K. S. Bhogale, A. Raman, A. Kunchukuttan, P. Kumar, and M. M. Khapra, “IndicSUPERB: A


Speech Processing Universal Performance Benchmark for Indian languages.” arXiv, Dec. 15, 2022.
Accessed: Mar. 21, 2024. [Online]. Available: http://arxiv.org/abs/2208.11761

[2] A. Gupta et al., “CLSRIL-23: Cross Lingual Speech Representations for Indic Languages.” arXiv, Jan.
13, 2022. Accessed: Mar. 21, 2024. [Online]. Available: http://arxiv.org/abs/2107.07402

[3] H. S. Chadha, P. Shah, A. Dhuriya, N. Chhimwal, A. Gupta, and V. Raghavan, “Code Switched and
Code Mixed Speech Recognition for Indic languages.” arXiv, Jun. 13, 2022. Accessed: Mar. 21, 2024.
[Online]. Available: http://arxiv.org/abs/2203.16578

[4] K. S. Bhogale, S. Sundaresan, A. Raman, T. Javed, M. M. Khapra, and P. Kumar, “Vistaar: Diverse
Benchmarks and Training Sets for Indian Language ASR.” arXiv, Aug. 02, 2023. Accessed: Mar. 21,
2024. [Online]. Available: http://arxiv.org/abs/2305.15386

[5] V. Pratap et al., “Massively Multilingual ASR: 50 Languages, 1 Model, 1 Billion Parameters.” arXiv,
Jul. 07, 2020. Accessed: Mar. 21, 2024. [Online]. Available: http://arxiv.org/abs/2007.03001

You might also like