0% found this document useful (0 votes)
4 views

Converting Speech to Text

This project aims to develop a Speech-To-Text application using the Whisper Automatic Speech Recognition (ASR) system, which is trained on 680,000 hours of diverse speech data. The application features an encoder-decoder Transformer architecture that processes audio into text, supports multilingual transcription and translation, and is designed for real-world applications in industries like healthcare and education. The project emphasizes accuracy, user accessibility, and continuous improvement through machine learning and feedback integration.

Uploaded by

VeRu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Converting Speech to Text

This project aims to develop a Speech-To-Text application using the Whisper Automatic Speech Recognition (ASR) system, which is trained on 680,000 hours of diverse speech data. The application features an encoder-decoder Transformer architecture that processes audio into text, supports multilingual transcription and translation, and is designed for real-world applications in industries like healthcare and education. The project emphasizes accuracy, user accessibility, and continuous improvement through machine learning and feedback integration.

Uploaded by

VeRu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Development of a Speech-To-Text Application Using

Whisper ASR System

Abstract:
The objective of this project is to develop a Speech-To-Text application utilizing the Whisper
Automatic Speech Recognition (ASR) system. Whisper is an advanced ASR model trained on
680,000 hours of diverse speech data, enabling it to effectively handle various accents and
background noise. The architecture of Whisper follows an encoder-decoder Transformer
model, where input audio is split into 30-second chunks, converted into a log-Mel spectrogram,
and processed through an encoder. The decoder then predicts the corresponding text,
incorporating tasks such as language identification, phrase-level timestamps, multilingual
transcription, and translation. This project employs the medium-sized Whisper model,
balancing speed and accuracy. The workflow involves loading an audio file, performing signal
analysis, and generating a Mel spectrogram for visualization. The system then detects the
language of the audio and transcribes it into text. Additionally, the application can translate the
transcribed text into different languages, demonstrated by translating English audio to French.
The developed Speech-To-Text application showcases the effectiveness of Whisper in
accurately transcribing audio, offering potential applications in areas such as audio
transcription, dictation, voice commands, online search, and enhanced customer service.
Table of Contents
SI. No Topic Page No

1. INTRODUCTION
1.1 Introduction 1
1.2 Objectives 5
1.3 Problem Statement 8

2 LITERATURE REVIEW 9

3 PROPOSED METHOD

3.1 Methodology and flowcharts 15

3.2 Implementation 27

4 RESULTS AND DISCUSSION 35

5 CONCLUSIONS 41

6 REFERENCES 43
1.INTRODUCTION
The primary objective of this project is to develop an advanced Speech-To-Text application
using the Whisper Automatic Speech Recognition (ASR) system. Whisper represents a cutting-
edge advancement in ASR technology, distinguished by its extensive training on a vast dataset
comprising 680,000 hours of diverse speech data sourced from the internet. This rigorous
training regimen equips Whisper with unparalleled capabilities to effectively handle a wide
range of accents, dialects, and background noise, ensuring robust performance in demanding
real-world applications that require high levels of accuracy and reliability.

Whisper adopts an advanced architecture based on an encoder-decoder Transformer model,


which has proven highly effective in processing complex speech signals comprehensively.
Initially, incoming audio streams are segmented into 30-second intervals, a strategy that
optimizes processing efficiency while maintaining fidelity to the original audio content. These
segments are then transformed into log-Mel spectrograms, providing a detailed representation
of the audio's frequency content over time. This step is crucial for capturing the subtle acoustic
nuances inherent in spoken language, thereby enhancing the accuracy of subsequent
transcription processes.

The encoder component of Whisper plays a critical role in the ASR pipeline by extracting
intricate acoustic features from the log-Mel spectrograms. These features encapsulate the
detailed acoustic characteristics essential for precise speech recognition. The encoded
representations are then passed to the decoder, which is meticulously trained to predict accurate
text captions. Beyond mere transcription, the decoder incorporates specialized tokens that
enable sophisticated functionalities such as language identification, precise timestamping,
multilingual transcription, and translation into multiple languages, including English.

Whisper's versatility and unparalleled accuracy position it as a cornerstone technology for a


diverse array of applications. From providing real-time transcription services and enabling
intuitive voice-controlled interfaces to facilitating seamless multilingual communication,
Whisper ASR stands at the forefront of bridging linguistic divides and enhancing user
interactions across various industries. Its potential impact spans critical sectors such as
healthcare, education, telecommunications, and customer service, where reliable and efficient
speech recognition is essential for optimizing operational workflows and enhancing user
engagement.

1
In addition to these features, Whisper exhibits an exceptional capability to detect and transcribe
speech under challenging conditions. This includes handling speech in noisy environments,
recognizing overlapping speech from multiple speakers, and transcribing audio with low-
quality recordings. Whisper's model can discern speech patterns and provide accurate
transcriptions even when dealing with varying speech speeds and unclear articulation. This
robustness is achieved through the model's extensive training data and sophisticated
architecture, which allows it to generalize well across different scenarios and speaker profiles.

Furthermore, Whisper's ability to perform real-time language translation is particularly


noteworthy. The model can transcribe speech in one language and translate it into another,
facilitating instant multilingual communication. This feature is invaluable in globalized
environments where interactions between speakers of different languages are common. The
model's proficiency in handling diverse languages and dialects ensures that it can provide
accurate and contextually appropriate translations, enhancing communication efficiency and
understanding.

The Whisper ASR system also incorporates advanced noise reduction algorithms that further
improve transcription accuracy. By filtering out background noise and isolating the primary
speech signal, Whisper ensures that the transcriptions remain clear and precise, even in
acoustically challenging environments. This capability is particularly beneficial for
applications in busy public spaces, outdoor settings, or industrial environments where ambient
noise levels can be high.

Moreover, Whisper is designed with scalability in mind, making it suitable for deployment in
large-scale enterprise environments. The system's architecture supports distributed processing,
enabling it to handle high volumes of data and concurrent transcription tasks without
compromising on performance. This scalability is essential for applications in call centers,
customer support services, and other high-demand settings where real-time speech processing
is critical.

The project also explores the integration of Whisper with various hardware platforms,
including mobile devices, embedded systems, and cloud-based services. By optimizing the
model for different hardware configurations, the application aims to deliver consistent
performance across a wide range of devices. This flexibility ensures that users can access
reliable speech-to-text services regardless of their device or operating environment.

2
To further enhance the user experience, the project includes the development of a
comprehensive user interface (UI) that is intuitive and accessible. The UI design prioritizes
ease of use, enabling users to interact with the speech-to-text application effortlessly. Features
such as voice command recognition, real-time feedback, and customizable settings are
incorporated to cater to diverse user needs and preferences. The goal is to create an interface
that not only functions effectively but also enhances user satisfaction and engagement.

Testing and evaluation are critical components of the project, ensuring that the Whisper ASR
system meets the highest standards of accuracy and reliability. The project employs a rigorous
testing framework that includes a variety of scenarios, such as different languages, accents,
speech speeds, and background noises. Performance metrics like Word Error Rate (WER),
transcription latency, and resource utilization are closely monitored to identify areas for
improvement and ensure the system's robustness in real-world applications.

In the context of healthcare, Whisper ASR can revolutionize the way medical professionals
handle patient records and documentation. By providing accurate and timely transcriptions of
medical dictations, the system can reduce administrative burdens on healthcare providers,
allowing them to focus more on patient care. Additionally, Whisper's ability to handle medical
terminologies and context-specific language ensures that the transcriptions are not only
accurate but also clinically relevant.

In educational settings, Whisper ASR can enhance learning experiences by providing real-time
transcriptions of lectures and classroom discussions. This feature is particularly beneficial for
students with hearing impairments or those who speak different languages, ensuring that they
can fully participate in and benefit from educational activities. The system's multilingual
capabilities also support the creation of inclusive learning environments, where language
barriers are minimized, and all students have access to the same information.

The project's future directions include exploring the integration of additional advanced
features, such as emotion detection, speaker identification, and context-aware transcriptions.
These features can add significant value to the speech-to-text application by providing more
nuanced and detailed transcriptions. For instance, emotion detection can help identify the
speaker's emotional state, which can be useful in customer service interactions or mental health
assessments. Speaker identification can differentiate between multiple speakers in a
conversation, ensuring that the transcriptions accurately reflect who said what.

3
Continuous learning and adaptation are also key areas of focus for the project's future
development. By implementing machine learning algorithms that enable the system to learn
from user feedback, the Whisper ASR system can continuously improve its performance over
time. This adaptive learning approach ensures that the system remains up-to-date with evolving
language patterns, new terminologies, and user preferences, maintaining its relevance and
effectiveness in dynamic environments.

Furthermore, the project aims to expand the range of supported languages and improve the
accuracy of transcriptions for lesser-known languages and regional dialects. Developing
specialized models for different dialects and accents will enhance recognition accuracy and
ensure inclusivity. This focus on linguistic diversity underscores the project's commitment to
creating a truly global speech recognition solution that can cater to users from various linguistic
backgrounds.

Integration with various industries is another significant aspect of the project's vision. In
customer service, Whisper ASR can automate call transcriptions, improving response times
and service quality. In healthcare, it can provide accurate medical dictations, enhancing record-
keeping and reducing administrative burdens on healthcare professionals. In educational
settings, real-time lecture transcriptions can aid students with hearing impairments or those
who speak different languages, ensuring they have equal access to educational content.

Real-time processing and scalability are crucial for the application's success. Enhancing the
system's capability to process and transcribe live audio streams in real-time will be a key focus.
This involves optimizing the model to leverage available hardware resources effectively,
potentially utilizing GPU acceleration for faster computation if supported. Ensuring
compatibility with the selected hardware specifications, such as an Intel i7 11th generation
processor and 16GB of RAM, is essential for maintaining high performance.

User interface and accessibility improvements are also prioritized in the project's development.
Developing a more intuitive and user-friendly interface that supports a wide range of devices
and platforms is crucial. Ensuring the system is accessible to users with disabilities by
incorporating features like screen reader support and voice commands will enhance its
usability. The goal is to create an inclusive application that caters to the needs of all users,
regardless of their physical abilities or technical proficiency.

4
Advanced features and functionalities, such as emotion detection, speaker identification, and
context-aware transcriptions, will be explored to provide more nuanced and accurate outputs.
Integrating with cloud-based storage solutions and collaboration tools will facilitate seamless
sharing and management of transcribed content across teams. These advanced features will add
significant value to the application by enhancing its functionality and user experience.

The project's commitment to continuous learning and adaptation is evident in its focus on
implementing machine learning algorithms that enable the system to learn and adapt from user
feedback. This approach will ensure that the Whisper ASR system continuously improves its
performance over time, reducing biases in speech recognition and ensuring fair and accurate
transcriptions for all users. Leveraging AI advancements to maintain the system's relevance
and effectiveness in dynamic environments is a key objective.

The development of the Whisper ASR-based Speech-To-Text application represents a


significant advancement in speech recognition technology. By leveraging Whisper's robust
architecture, extensive training on diverse datasets, and advanced functionalities, the project
aims to create a highly accurate, reliable, and versatile speech-to-text solution. This application
has the potential to transform various industries, enhance user experiences, and bridge
linguistic divides, making speech recognition technology more accessible and effective in the
digital age. Through continuous innovation and adaptation, the project endeavors to showcase
how Whisper ASR can elevate the standard of speech processing technology, paving the way
for enhanced accessibility, efficiency, and user experience in the digital age.

1.2 Objective

The objective of this project is to develop and implement a sophisticated Speech-To-Text


application utilizing the Whisper Automatic Speech Recognition (ASR) system. Whisper,
distinguished by its training on a vast dataset of 680,000 hours of diverse speech data from the
internet, offers advanced capabilities in handling various accents and background noise. The
project aims to leverage Whisper's robust architecture, based on an encoder-decoder
Transformer model, to achieve high accuracy and reliability in converting spoken language
into textual transcripts.

Key objectives include:

1. Implementation of Whisper ASR

5
Implementing Whisper ASR involves using its medium-sized model effectively to process
audio inputs for accurate speech recognition. This starts with loading audio files and breaking
them into manageable segments. These segments are then converted into log-Mel
spectrograms, which capture the audio's frequency content over time in a format that Whisper
can analyze efficiently.

2.Enhanced Speech Recognition

To improve speech recognition, algorithms are developed to extract detailed acoustic features
from these spectrograms using Whisper's encoder. These features include aspects like pitch and
tone that are crucial for understanding speech. The decoder part of Whisper is then trained to
convert these acoustic features into accurate text captions. Special tokens within the decoder
help in tasks like identifying languages, adding timestamps, and handling multilingual
transcriptions.

Applications and Benefits

Implementing Whisper ASR enhances various applications. In customer service, it speeds up


call handling by transcribing customer queries automatically. In healthcare, it helps doctors
dictate notes accurately. Educational settings benefit from real-time transcription during
classes, aiding students in note-taking. Overall, Whisper ASR improves efficiency across
industries by making speech-to-text conversion reliable and accessible.

3. Multilingual Capabilities

Whisper's integration of multilingual capabilities represents a significant advancement in


speech-to-text technology. Leveraging its robust encoder-decoder Transformer architecture,
Whisper can seamlessly translate transcriptions into multiple languages with high accuracy.
This capability is pivotal for global applications where language diversity is prevalent. By
supporting languages beyond English, Whisper enhances accessibility and usability across
international borders, catering to a diverse user base.

The development phase involves configuring Whisper to recognize and process various
languages effectively. This includes optimizing language detection algorithms to accurately
identify the spoken language within audio inputs. Once identified, Whisper's decoder module
utilizes specialized tokens to facilitate multilingual transcription and translation tasks. This

6
functionality not only enhances the application's versatility but also underscores its potential to
foster cross-cultural communication and accessibility in multilingual environments.

4. Real-World Application

Demonstrating Whisper's efficacy in real-time transcription tasks is crucial for validating its
practical utility across diverse fields. Industries such as customer service, education, healthcare,
and telecommunications stand to benefit significantly from Whisper's ability to provide
accurate and timely transcriptions. Real-world scenarios require reliable speech recognition
systems capable of handling varied accents, dialects, and environmental noise conditions, all
of which Whisper addresses through its extensive training on diverse speech datasets.

The application's effectiveness can be showcased through case studies and use-case scenarios.
For instance, in customer service applications, Whisper enables automated transcription of
customer calls, facilitating improved service delivery and responsiveness. In educational
settings, the application supports lecture transcription and note-taking, enhancing accessibility
for students with hearing impairments or language barriers. Similarly, in healthcare, Whisper's
real-time transcription capabilities aid in medical dictation and patient record management,
streamlining administrative workflows and improving accuracy.

5. User Interface and Accessibility

Designing an intuitive user interface (UI) is essential to ensure the Speech-To-Text


application's accessibility and ease of use across different platforms. The UI should be user-
friendly, featuring clear navigation, responsive design elements, and customizable settings to
accommodate diverse user preferences. Accessibility considerations, such as support for screen
readers and keyboard navigation, should be integrated to enhance usability for individuals with
disabilities.

The UI design process involves iterative prototyping and usability testing to gather feedback
and refine interface elements. Visual representations of audio waveforms and spectrograms can
aid users in visualizing speech patterns and understanding transcription processes. Integration
with cloud-based storage solutions and collaboration tools enhances the application's utility in
professional settings, allowing seamless sharing and management of transcribed content across
teams.

7
6. Evaluation and Validation

Conducting rigorous testing and validation is essential to assess the developed application's
performance metrics comprehensively. Evaluation criteria include transcription accuracy,
processing speed, resource utilization (CPU/GPU usage), and overall system robustness.
Performance benchmarks should be established against industry standards and competitor
solutions to gauge the application's comparative advantage.

Validation methodologies encompass both automated testing frameworks and manual


validation processes. Automated tests assess system responsiveness under varying workload
conditions, while manual validation involves expert review and user feedback sessions to
evaluate transcription quality and user satisfaction. Continuous monitoring and feedback
integration enable iterative improvements and optimizations, ensuring the application meets
performance expectations and user requirements.

1.3 Problem Statement

Despite advancements in Automatic Speech Recognition (ASR) technology, existing systems


often struggle with accurately transcribing diverse accents and filtering out background noise
in real-world scenarios. Current solutions either sacrifice accuracy for speed or lack robustness
across different languages and environments. This project addresses these challenges by
leveraging Whisper, an ASR system trained on a comprehensive dataset of 680,000 hours of
speech data. Whisper aims to enhance speech-to-text accuracy and reliability across various
applications, including audio transcription, dictation, voice commands, online search, and
customer service automation.

The project focuses on developing a Speech-To-Text application using Whisper ASR to


demonstrate its efficacy in real-time transcription tasks. Key challenges include optimizing the
model for efficient processing of audio inputs, integrating language detection capabilities to
identify diverse languages spoken within audio files, and implementing multilingual translation
features. By tackling these challenges, the project aims to showcase Whisper's potential to
significantly improve speech recognition technologies, thereby enhancing user experience and
operational efficiency in diverse linguistic and environmental contexts.

8
2.LITERATURE REVIEW

Automatic Speech Recognition (ASR) systems have undergone significant evolution, driven
by advances in machine learning models and the availability of large-scale training datasets.
Whisper ASR exemplifies this progress, having been trained on a vast dataset comprising
680,000 hours of diverse speech data sourced from the internet (Dettmers et al., 2022). This
extensive training enables Whisper to effectively handle various accents, dialects, and
background noise, ensuring robust performance in real-world applications that demand high
accuracy and reliability.[1]

The architecture of Whisper ASR is based on an encoder-decoder Transformer model, a design


choice that has proven highly effective in processing complex speech signals (Hu et al., 2021).
The initial preprocessing steps involve segmenting incoming audio into manageable 30-second
intervals and converting them into log-Mel spectrograms. These spectrograms capture the
frequency content of the audio over time, providing a detailed representation essential for
accurate speech recognition (Kumar & Mittal, 2019).[3]

The encoder component of Whisper ASR plays a critical role in extracting intricate acoustic
features from the log-Mel spectrograms. These features are then passed to the decoder, which
is trained to predict corresponding text captions while incorporating specialized tokens for
tasks such as language identification and timestamping (Novitasari et al., 2020). This dual
functionality not only enhances transcription accuracy but also extends Whisper's capabilities
to support multilingual speech transcription and translation into various languages (Yang et al.,
2023).[2]

Recent research has focused on optimizing ASR systems through parameter-efficient fine-
tuning techniques (Fu et al., 2023). These approaches aim to strike a balance between model
complexity and computational efficiency, thereby improving performance metrics such as
transcription accuracy and processing speed. Such optimizations are crucial for deploying ASR
systems in real-time applications across diverse sectors including customer service, healthcare,
and telecommunications (Alharbi et al., 2021; Radford et al., 2023).[5]

User interface design is another critical aspect of ASR applications, ensuring accessibility and
ease of use across different platforms (Wang et al., 2020). Intuitive interfaces not only facilitate

9
interaction with the ASR system but also enhance user experience, making it seamless to
dictate, transcribe, and interact with spoken language.[4]

Evaluation and validation methodologies are essential for assessing the reliability and
efficiency of ASR systems (Zhang et al., 2021). Metrics such as transcription accuracy,
processing speed, and resource utilization provide insights into the performance improvements
and optimizations needed for practical deployment in various real-world scenarios (Radford et
al., 2019).[6]

Whisper’s architecture is based on an encoder-decoder Transformer model, a design choice


that significantly enhances its capability to process sequential data such as speech. This model
architecture allows Whisper to understand and generate text from speech with remarkable
accuracy. The input audio is segmented into 30-second chunks and converted into log-Mel
spectrograms, a transformation that captures the temporal and spectral features of the audio
signal. This preprocessing step is crucial as it preserves the nuances of the spoken language,
which are essential for accurate transcription (Hu et al., 2021).[7]

The encoder in Whisper extracts detailed acoustic features from these spectrograms, enabling
the model to understand the complex patterns in the speech signal. These features are then fed
into the decoder, which has been trained to predict text captions. The decoder's functionality is
enhanced by the integration of specialized tokens, allowing it to perform additional tasks such
as language identification, timestamping, multilingual transcription, and translation into
various languages, including English (Novitasari et al., 2020).[10[

Recent studies have focused on optimizing ASR systems to improve their efficiency and
performance. Techniques such as parameter-efficient fine-tuning have been shown to enhance
model accuracy while reducing computational requirements. Fu et al. (2023) highlighted the
effectiveness of these techniques in fine-tuning large language models, making them more
adaptable to specific tasks without the need for extensive retraining. This approach not only
conserves computational resources but also accelerates the deployment of ASR systems in real-
world applications.[9]

Additionally, research by Alharbi et al. (2021) and Radford et al. (2023) emphasizes the
significance of large-scale weak supervision in training robust ASR models. This approach
leverages diverse and abundant unlabeled data, enhancing the model’s ability to generalize

10
across different domains and languages. The integration of such strategies has significantly
bolstered the performance of ASR systems, making them more reliable and versatile.[8]

One of the standout features of Whisper ASR is its robust multilingual capability. The system’s
ability to handle multiple languages seamlessly makes it an invaluable tool for global
applications. Studies such as those by Novitasari et al. (2020) and Rouditchenko et al. (2023)
have explored the implementation of cross-lingual ASR systems, demonstrating their potential
to facilitate communication across different languages and dialects. These systems are
particularly beneficial in low-resource settings where training data for specific languages may
be scarce.[12]

Moreover, Whisper’s multilingual capabilities are bolstered by its use of advanced tokenization
and embedding techniques, which ensure that the model can effectively process and understand
speech in various languages. This cross-linguistic adaptability is crucial for applications in
international markets, enhancing user experience and accessibility globally (Butryna et al.,
2020).[11]

he development of intuitive and accessible user interfaces is critical for the widespread
adoption of ASR technologies. Wang et al. (2020) and Zhang et al. (2021) have underscored
the importance of user-centric design in ASR applications. A well-designed interface simplifies
interaction, making ASR functionalities such as voice commands, dictation, and transcription
services more user-friendly. Ensuring that these interfaces are accessible across different
devices and platforms is essential for enhancing user engagement and satisfaction.[13]

To ensure the reliability and effectiveness of ASR systems, rigorous evaluation methodologies
are essential. Metrics such as transcription accuracy, processing speed, and resource utilization
are commonly used to assess system performance. Joseph (2022) and Liu et al. (2022) have
emphasized the importance of comprehensive testing frameworks that evaluate both the
qualitative and quantitative aspects of ASR systems. These evaluations help in identifying
performance bottlenecks and areas for improvement, thereby enhancing the system’s overall
efficacy and user experience.[14]

The landscape of Automatic Speech Recognition (ASR) has undergone significant


transformation over the years, marked by pioneering research and innovative technologies that
have continually pushed the boundaries of what is possible. This literature review delves into

11
various seminal works and recent advancements in ASR, highlighting the evolution of
techniques and models that have significantly improved the accuracy and efficiency of speech
recognition systems.[16]

Deep Recurrent Neural Networks in Speech Recognition The work by Graves et al. (2013)
demonstrated the potential of Deep Recurrent Neural Networks (RNNs) for speech recognition,
emphasizing their ability to handle sequential data effectively. Their research, presented at the
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP),
showed that RNNs could significantly improve the performance of ASR systems by capturing
long-term dependencies in speech signals [17]

Attention Mechanisms and Transformer Models The introduction of attention mechanisms and
Transformer models marked a significant leap in ASR technology. Chan et al. (2016)
introduced the "Listen, Attend and Spell" model, which employed an attention-based neural
network for large vocabulary conversational speech recognition. This model demonstrated
superior performance by focusing on relevant parts of the input sequence, thereby improving
accuracy and reducing errors[18]

Further advancements were made with the development of the Transformer Transducer by
Zhang et al. (2020), which integrated Transformer encoders with Recurrent Neural Network
Transducer (RNN-T) loss to create a streamable speech recognition model. This approach
combined the strengths of Transformer architectures in handling complex dependencies with
the real-time processing capabilities of RNN-T, leading to significant improvements in both
latency and accuracy[19]

End-to-End ASR with Deep Convolutional Neural Networks Zhang et al. (2017) explored the
use of deep convolutional neural networks (CNNs) for end-to-end speech recognition. Their
research highlighted the efficacy of CNNs in capturing local patterns in speech signals, leading
to enhanced performance in ASR tasks . This shift towards end-to-end models simplified the
ASR pipeline by eliminating the need for handcrafted features and intermediate
representations.[20]

12
Multilingual and Low-Resource Language Recognition The challenge of recognizing speech
in low-resource languages has been addressed through various innovative approaches. Miao et
al. (2022) proposed a multilingual Transformer language model tailored for speech recognition
in low-resource languages, emphasizing the model's adaptability across different linguistic
contexts . Krishna (2021) further extended this approach by leveraging multi-task learning with
conformers to improve ASR performance for low-resource Indian languages.[21]

Survey of Attention-Based Models Karmakar et al. (2021) conducted a comprehensive survey


on attention-based neural networks for ASR, providing insights into the various architectures
and techniques that have been developed. This survey underscored the critical role of attention
mechanisms in enhancing the performance of ASR systems by enabling them to focus on
pertinent parts of the input sequence [22]

Self-Supervised Learning and Robust Speech Recognition The application of self-supervised


learning techniques has revolutionized ASR, enabling models to leverage large amounts of
unlabeled data. Schneider et al. (2019) introduced wav2vec, an unsupervised pre-training
framework that significantly improved ASR performance by learning robust speech
representations from raw audio data . Baevski et al. (2020) further advanced this approach with
wav2vec 2.0, which incorporated self-supervised learning to achieve state-of-the-art results in
various speech recognition benchmarks.[23]

Large-Scale ASR and Multitask Learning Radford et al. (2023) explored the use of large-scale
weak supervision for robust speech recognition, demonstrating the effectiveness of leveraging
vast amounts of loosely labeled data to enhance model performance.Pratap et al. (2023) focused
on scaling speech technology to over 1,000 languages, highlighting the feasibility of
developing universal ASR systems capable of handling diverse linguistic contexts [24]

Emotion Detection and Context-Aware Transcription Recent advancements have also explored
the integration of additional features such as emotion detection and context-aware
transcription. The research by Xie et al. (2023) on finetuning the Whisper model for mixed
languages illustrates the ongoing efforts to improve ASR models' versatility and accuracy in
real-world applications . The inclusion of such advanced features adds significant value to ASR
systems by providing more nuanced and contextually relevant transcriptions.[25]

13
Systematic Literature Reviews and Surveys Alharbi et al. (2021) conducted a systematic
literature review on ASR, providing a comprehensive overview of the state-of-the-art
techniques and identifying key trends and challenges in the field . Yadav and Sitaram (2022)
surveyed multilingual models for ASR, emphasizing the importance of developing robust
systems capable of handling diverse linguistic and acoustic environments.[26]

Enhancing ASR for Special Populations Research has also focused on improving ASR systems
for specific populations, such as children and speakers of dialects. Liu et al. (2023) investigated
the application of sparsely shared LoRA on Whisper for child speech recognition, highlighting
the unique challenges and solutions for accurately transcribing children's speech . Similarly,
Sicard et al. (2023) extended state-of-the-art ASR models to Swiss German dialects,
showcasing the adaptability of modern ASR systems to regional linguistic variations [27]

14
3.PROPOSED METHOD
3.1METHODOLOGY and FLOWCHARTS
Introduction:
The Speech-To-Text (STT) application using Whisper ASR represents a pivotal advancement
in leveraging state-of-the-art automatic speech recognition technology for real-world
applications. This project focuses on developing a robust system capable of accurately
transcribing diverse audio inputs into text, accommodating various accents, languages, and
environmental conditions. By harnessing the power of the medium-sized Whisper model,
optimized for a balance of computational efficiency and accuracy, the application aims to
deliver seamless and reliable speech recognition capabilities.

Key components of the project include meticulous data collection and preprocessing to curate
a diverse dataset of audio recordings from sources such as the internet. These recordings are
standardized in format and quality, ensuring consistency and suitability for training the Whisper
ASR model. The model selection process prioritizes efficiency on hardware configurations like
Intel i7 11th generation processors and 16GB of RAM, enabling optimal performance during
both training and inference phases.

Audio processing techniques, including segmentation into manageable segments and extraction
of log-Mel spectrograms, are employed to capture detailed acoustic features essential for robust
speech recognition. Model training involves rigorous dataset preparation, hyperparameter
optimization, and performance tuning to enhance transcription accuracy and processing speed
across various linguistic contexts and acoustic environments.

Integration of specialized functionalities, such as language identification and multilingual


transcription capabilities via special tokens, enhances the application's versatility and utility. A
user-friendly interface design emphasizes accessibility and ease of use, supporting features like
voice command recognition and real-time feedback to enhance user interaction.

Testing and evaluation play a crucial role in validating the application's performance metrics,
including transcription accuracy metrics like Word Error Rate (WER) and efficiency measures
such as latency and resource utilization. Through comprehensive testing scenarios and
benchmark comparisons, the project aims to meet stringent performance standards and user
expectations, thereby demonstrating the effectiveness and reliability of the developed STT
application using Whisper ASR.

15
1.Data Collection and Preprocessing:

Data Collection

Gathering Diverse Dataset:

The foundation of building a robust Speech-To-Text (STT) application using Whisper ASR
begins with the acquisition of a comprehensive and diverse dataset of audio recordings. This
dataset should be sourced from various sources, including online repositories, public datasets,
and possibly proprietary collections. The goal is to encompass a wide range of accents,
languages, and environmental conditions to ensure the model's adaptability and accuracy across
diverse real-world scenarios.

Variety of Accents and Languages:

The dataset should intentionally include recordings spoken in different accents and dialects to
train the Whisper ASR model to recognize and transcribe speech variations accurately. This
diversity helps mitigate biases and improves the model's generalization capabilities, making it
suitable for a global audience.

Incorporating audio recordings captured in various environmental conditions is crucial.


This includes recordings made in quiet environments, noisy backgrounds, indoor settings,
outdoor environments, and situations with varying levels of ambient noise. Such
variability challenges the model to handle acoustic complexities and enhances its
robustness in real-world usage.

Preprocessing

Standardization of Audio Files:

Once the dataset is compiled, the next step involves preprocessing to standardize the format
and quality of audio files. This standardization ensures consistency across the dataset, which is
essential for effective training and reliable performance of the Whisper ASR model.

Format Conversion:

16
Audio files collected from different sources may come in varied formats (e.g., MP3, WAV,
FLAC). Converting all audio files into a common format simplifies data handling and
preprocessing tasks. Typically, formats like WAV or FLAC are preferred due to their lossless
quality and compatibility with deep learning frameworks.

Normalization of Audio Levels:

Adjusting audio levels to a standard amplitude is critical to prevent volume discrepancies


across recordings. Normalization techniques ensure that all audio inputs have consistent
loudness levels, which helps maintain uniformity in feature extraction and model training.

Augmentation for Diversity:

To enhance dataset diversity and improve model robustness, augmentation techniques can be
applied. This may involve artificially modifying audio files by introducing background noise,
pitch variations, or time shifts. Augmentation techniques simulate real-world variations,
thereby exposing the model to a broader spectrum of speech patterns and environmental
conditions.

Annotation and Transcripts:

For supervised learning tasks, annotations or transcripts should accompany each audio
recording where available. These annotations provide ground truth labels, aiding the model in
learning correct transcription patterns during training. Manual or automatic transcription tools
can be employed to generate accurate text transcripts aligned with each audio file.

By meticulously curating and preprocessing a diverse dataset in this manner, the Speech-To-
Text application using Whisper ASR can effectively train a model that demonstrates high
accuracy and reliability in transcribing speech across different languages, accents, and real-
world environments.

17
2. Model Selection and Configuration

Model Selection:

Choosing the right model size is crucial for optimizing both performance and computational
efficiency of the Speech-To-Text (STT) application using Whisper ASR.

Consideration of Model Size: The Whisper ASR system offers multiple model sizes, each
with varying numbers of parameters and computational requirements. The selection of the
medium-sized Whisper model, with 769 million parameters, is based on its ability to strike a
balance between accuracy and computational efficiency. This makes it well-suited for
applications requiring real-time or near-real-time speech transcription capabilities.

Performance Requirements: Define specific performance metrics such as transcription


accuracy, processing speed, and latency. The medium-sized model is chosen to meet these
requirements effectively, ensuring high accuracy in converting speech to text while
maintaining responsiveness in real-time applications.

Adaptability to Hardware Constraints: Assess hardware constraints such as the Intel i7 11th
generation processor and 16GB of RAM. These specifications are sufficient for handling
intensive computational tasks involved in deep learning. Ensure the selected model size aligns
with the available hardware resources to optimize performance without encountering memory
constraints or processing bottlenecks.

training time and improving real-time performance of the Whisper ASR model.

Configuration Optimization:

Fine-tuning the model configuration involves optimizing parameters and settings to maximize
performance and efficiency.

Hyperparameter Tuning: Adjust parameters such as learning rate, batch size, and optimizer
settings to achieve optimal convergence and training efficiency. Hyperparameter tuning helps
in improving model accuracy and reducing training time, contributing to overall system
performance.

18
Resource Management: Implement strategies to manage hardware resources effectively, such
as batch processing and memory caching techniques. These strategies help in minimizing
computational overhead and maximizing hardware utilization efficiency.

3.Audio Processing and Feature Extraction

Audio Segmentation:

Audio segmentation involves dividing long audio recordings into smaller segments to enhance
processing efficiency and manage memory constraints during model training and inference.

Segmentation Strategy: Divide audio files into segments of approximately 30 seconds each.
This segmentation duration is chosen based on practical considerations to balance processing
efficiency with maintaining context within each segment. Shorter segments may not capture
sufficient contextual information, while longer segments can increase computational load and
memory requirements.

Benefits of Segmentation:

Memory Management: Segmented audio files consume less memory individually compared
to processing entire recordings at once. This approach minimizes the risk of memory overflow
during model training and inference, ensuring stable performance.

Parallel Processing: Segmented data allows for parallel processing, leveraging multi-core
processors or GPU acceleration where applicable. This parallelization enhances computational
efficiency and reduces processing time, crucial for real-time or near-real-time speech
transcription tasks.

Error Handling: Segmenting audio files also facilitates better error handling and recovery
strategies. In case of processing failures or interruptions, the impact is limited to the affected
segment rather than the entire audio file, improving system robustness and reliability.

Conversion Process: Convert audio signals into log-Mel spectrograms, which are 2D
representations of audio frequency distributions over time. This transformation mimics the
human auditory system's sensitivity to frequencies, emphasizing important speech-related
features such as pitch and timbre.

19
Mel Scale: Utilize the Mel scale to map audio frequencies into perceptually meaningful units.
This scale enhances the model's ability to differentiate between speech and background noise,
improving overall transcription accuracy.

Logarithmic Transformation: Apply a logarithmic transformation to the Mel spectrogram to


compress dynamic range and highlight subtle acoustic variations. This transformation enhances
feature discrimination and facilitates more robust pattern recognition by the Whisper ASR
model.

Input Features for Whisper ASR:

Acoustic Information: The extracted log-Mel spectrograms serve as primary input features
for the Whisper ASR model. These features encode detailed acoustic information necessary for
accurately transcribing speech into text, accommodating variations in pronunciation, accents,
and environmental noise.

Preprocessing Considerations: Ensure standardized preprocessing techniques across all


audio segments to maintain consistency in feature extraction. Techniques such as normalization
and augmentation may be applied to further enhance dataset diversity and model
generalization.

4. Model Training and Optimization:

Dataset Preparation: Prepare training, validation, and possibly test datasets for model
training. Ensure these datasets are representative of the target application's use cases, including
various languages and accents.

Training Procedure: Train the Whisper ASR model using the prepared dataset. This involves
optimizing hyperparameters (e.g., learning rate, batch size) and applying regularization
techniques (e.g., dropout, weight decay) to prevent overfitting and improve generalization.

Performance Optimization: Fine-tune the model to optimize for transcription accuracy,


speed, and robustness across different speech inputs and environmental conditions. Monitor
training metrics closely to identify and address any performance bottlenecks.

5.Integration of Specialized Functionalities

20
Special tokens are integral to extending the functionality of the Whisper ASR system beyond
basic transcription capabilities. These tokens enable advanced features that enhance the
versatility and utility of the Speech-To-Text (STT) application.

Language Identification:

Token Functionality: Integrate tokens within the ASR system to perform automatic language
identification (LID). These tokens allow the model to detect and switch between different
languages spoken within a single audio stream or across different segments.

Implementation: During audio processing, the ASR model identifies language-specific cues
encoded by these tokens. This capability is crucial for applications requiring multilingual
support or mixed-language environments.

Timestamping for Temporal Alignment:

Token Functionality: Use tokens to insert timestamps at specific intervals within the
transcription output. These timestamps provide temporal alignment, enabling precise
synchronization with corresponding audio segments or real-time speech inputs.

Implementation: The ASR model embeds timestamps within the transcription output,
indicating key points in the audio timeline. This feature is beneficial for applications such as
transcription of lectures, meetings, or live broadcasts where accurate temporal referencing is
essential.

Multilingual Transcription:

Token Functionality: Incorporate tokens to facilitate seamless transcription across multiple


languages within the same session. These tokens signal the ASR model to adapt language
models dynamically, ensuring accurate and contextually appropriate transcription outputs.

Implementation: When encountering language-switching cues within audio segments, the


ASR system adjusts its language model parameters based on detected tokens. This capability
supports diverse linguistic contexts and enhances user experience in multilingual
environments.

Translation to/from English:

21
Token Functionality: Utilize tokens to trigger translation functionalities within the ASR
system, enabling on-the-fly translation of transcribed text into different languages, including
English.

Implementation: Upon detecting translation tokens, the ASR model integrates with translation
APIs or internal modules to convert transcribed text segments into desired target languages.
This feature facilitates real-time communication and accessibility across language barriers.

Functional Validation: Conduct rigorous testing and validation procedures to ensure the
accurate implementation and functionality of special tokens within the Whisper ASR system.
Verify that tokens correctly trigger intended functionalities and adhere to predefined
application requirements.

Performance Evaluation: Assess the impact of specialized functionalities on overall system


performance metrics such as transcription accuracy, processing speed, and resource utilization.
Use benchmark datasets and real-world testing scenarios to validate performance under varied
conditions.

6.Environment setup

To ensure optimal performance and reliability for developing the Speech-To-Text application
using the Whisper ASR system, the following environment setup is recommended:

Processor:

Utilize an Intel i7 11th generation processor or equivalent. This processor provides ample
processing power necessary for handling intensive computations involved in deep learning
tasks.

RAM:

Maintain a minimum of 16GB RAM. This capacity is crucial for accommodating large datasets
and facilitating efficient model training without encountering memory constraints.

22
Operating System:

Use Windows 11 as the operating system. Windows 11 offers a stable and reliable development
environment suitable for deep learning applications, ensuring seamless execution and support
for necessary software dependencies.

7.User Interface Design

UI Development

Designing an intuitive and user-friendly interface is crucial for the success of the Speech-To-
Text (STT) application. Here are some key considerations and features to implement:

Intuitive Layout:

Simple Navigation: Ensure that the navigation is straightforward and intuitive, allowing users
to easily access the main features of the application.

Clear Instructions: Provide clear and concise instructions or tooltips to guide users through the
process of using the STT application.

Accessibility Features

Voice Command Recognition: Implement voice command capabilities to allow users to control
the application hands-free. This is particularly beneficial for users with disabilities.

High Contrast Mode: Offer a high contrast mode to assist users with visual impairments.

Font Size Adjustments: Allow users to adjust the font size for better readability.

Error Handling:

User Notifications: Provide clear and helpful notifications when errors occur. For instance, if
the application is unable to process a voice command, inform the user and suggest possible
solutions.

Retry Options: Allow users to retry voice commands or transcriptions easily without having to
start from scratch.

23
8.Testing and Evaluation:

Testing Scenarios

To ensure the STT application performs well under diverse conditions, conduct comprehensive
testing covering various scenarios:

Transcription Accuracy:

Different Accents and Dialects: Test the application with speakers from different regions to
ensure it accurately transcribes a variety of accents and dialects.

Multiple Languages: Verify that the application supports and accurately transcribes multiple
languages as intended.

Processing Speed:

Real-time Transcription: Measure the time taken for real-time transcription and ensure it meets
user expectations for responsiveness.

Batch Processing: Evaluate the speed and efficiency of batch processing for pre-recorded audio
files.

Resource Utilization:

CPU/GPU Usage: Monitor CPU and GPU utilization during transcription to ensure the
application runs efficiently without overloading the system.

Memory Consumption: Check for memory leaks and ensure the application manages memory
effectively.

Acoustic Environments:

Quiet vs. Noisy Environments: Test the application's performance in various acoustic
environments, from quiet rooms to noisy outdoor settings.

Background Noise Handling: Assess how well the application filters out background noise and
focuses on the primary speaker's voice.

24
Performance Metrics

Define and measure key performance metrics to validate the application's performance:

Word Error Rate (WER):

Definition: WER is a common metric used to evaluate the accuracy of speech recognition
systems. It is calculated as the ratio of the number of errors (substitutions, deletions, and
insertions) to the total number of words in the reference transcript.

Benchmarking: Compare the WER against industry benchmarks and user expectations to
ensure the application delivers high accuracy.

Transcription Latency:

Definition: Transcription latency is the time taken from when the speech is input until the
transcription is available.

Real-time Requirements: Ensure that the latency is low enough to provide a seamless real-time
transcription experience.

Fig.1

25
This Fig shows The process of converting speech to text involves several intricate steps,
starting with the capture of speech through a microphone, which collects the analog sound
waves produced by the speaker. These analog signals are then subjected to filtering and
amplification to remove noise and enhance the signal strength. The refined analog signals are
subsequently converted into digital data using an Analog-to-Digital Converter (ADC),
transforming the continuous sound waves into discrete digital signals suitable for computer
processing. These digital signals undergo Digital Signal Processing (DSP), further refining and
preparing the data for feature extraction.

The processed digital data is transmitted over a network to a server or cloud-based system,
where more advanced processing and recognition occur, leveraging cloud computing resources
for handling complex computations. The data received by the server undergoes pre-processing
to normalize and prepare it for feature extraction, which includes steps like noise reduction,
normalization, and framing. During feature extraction, essential characteristics of the speech
signal are identified and extracted using techniques such as Mel-Frequency Cepstral
Coefficients (MFCCs) and spectrogram analysis.

These features are then fed into a recognition model, typically based on machine learning or
deep learning algorithms, which processes the features and generates the corresponding text
output. The final step in the process is converting the recognized speech into text, resulting in
a coherent text output such as "Hello World." This comprehensive process utilizes advanced
techniques in signal processing, machine learning, and cloud computing to achieve accurate
and efficient speech-to-text conversion.

26
3.2 IMPLEMENTATION

Code:

! 𝑝𝑖𝑝 𝑖𝑛𝑠𝑡𝑎𝑙𝑙 − −𝑢𝑝𝑔𝑟𝑎𝑑𝑒 𝑡𝑜𝑟𝑐ℎ

! 𝑝𝑖𝑝 𝑖𝑛𝑠𝑡𝑎𝑙𝑙 𝑝𝑦𝑡𝑢𝑏𝑒

! 𝑝𝑖𝑝 𝑖𝑛𝑠𝑡𝑎𝑙𝑙 𝑔𝑖𝑡 + ℎ𝑡𝑡𝑝𝑠://𝑔𝑖𝑡ℎ𝑢𝑏. 𝑐𝑜𝑚/𝑜𝑝𝑒𝑛𝑎𝑖/𝑤ℎ𝑖𝑠𝑝𝑒𝑟. 𝑔𝑖𝑡

! 𝑝𝑖𝑝 𝑖𝑛𝑠𝑡𝑎𝑙𝑙 𝑔𝑖𝑡 + ℎ𝑡𝑡𝑝𝑠://𝑔𝑖𝑡ℎ𝑢𝑏. 𝑐𝑜𝑚/𝑙𝑖𝑏𝑟𝑜𝑠𝑎/𝑙𝑖𝑏𝑟𝑜𝑠𝑎

!pip install --upgrade torch

This command updates the PyTorch library to the latest version available. PyTorch is used for
deep learning tasks like training neural networks.

!pip install pytube

Installs the pytube library, which helps in downloading YouTube videos and working with
video metadata using Python.

!pip install git+https://github.com/openai/whisper.git

Installs the whisper library from its GitHub repository. Whisper is an Automatic Speech
Recognition (ASR) system developed by OpenAI.

!pip install git+https://github.com/librosa/librosa

Installs the librosa library from its GitHub repository. Librosa is a Python package for audio
analysis, including loading, processing, and extracting features like spectrograms.

Code:

#𝐼𝑚𝑝𝑜𝑟𝑡𝑖𝑛𝑔 𝑡ℎ𝑒 𝑛𝑒𝑐𝑒𝑠𝑠𝑎𝑟𝑦 𝑙𝑖𝑏𝑟𝑎𝑟𝑖𝑒𝑠

𝑖𝑚𝑝𝑜𝑟𝑡 𝑡𝑜𝑟𝑐ℎ

27
𝑖𝑚𝑝𝑜𝑟𝑡 𝑤ℎ𝑖𝑠𝑝𝑒𝑟

𝑖𝑚𝑝𝑜𝑟𝑡 𝑝𝑦𝑡𝑢𝑏𝑒

𝑖𝑚𝑝𝑜𝑟𝑡 𝑙𝑖𝑏𝑟𝑜𝑠𝑎

𝑖𝑚𝑝𝑜𝑟𝑡 𝑚𝑎𝑡𝑝𝑙𝑜𝑡𝑙𝑖𝑏. 𝑝𝑦𝑝𝑙𝑜𝑡 𝑎𝑠 𝑝𝑙𝑡

𝑖𝑚𝑝𝑜𝑟𝑡 𝑛𝑢𝑚𝑝𝑦 𝑎𝑠 𝑛𝑝

𝑖𝑚𝑝𝑜𝑟𝑡 𝐼𝑃𝑦𝑡ℎ𝑜𝑛. 𝑑𝑖𝑠𝑝𝑙𝑎𝑦 𝑎𝑠 𝑖𝑝𝑑

torch: PyTorch is a widely used open-source machine learning library primarily developed by
Facebook's AI Research lab (FAIR). It provides tools for building and training neural networks.

whisper: Whisper is an Automatic Speech Recognition (ASR) system developed by OpenAI.


It's used for converting spoken language into text.

pytube: Pytube is a Python library for downloading YouTube videos. It allows developers to
fetch video data, stream video content, and manage metadata.

librosa: Librosa is a Python package for music and audio analysis. It provides tools for loading
audio files, extracting features like mel-spectrograms, and performing various audio processing
tasks.

matplotlib.pyplot as plt: Matplotlib is a popular plotting library in Python. pyplot is a module


within Matplotlib that provides a MATLAB-like interface for creating plots and visualizations.

numpy as np: NumPy is a fundamental package for scientific computing in Python. It


provides support for large, multi-dimensional arrays and matrices, along with a collection of
mathematical functions to operate on these arrays efficiently.

IPython.display as ipd: IPython is an enhanced interactive Python interpreter. IPython.display


module provides functions for displaying audio, video, images, and HTML content within
IPython/Jupyter notebooks.

28
Loading the Model

There are five model sizes to choose from, four have English-only versions, offering speed and
accuracy trade-offs. The model sizes are:

 tiny: 39M Parameters, English-only model (tiny.en), Multilingual model (tiny),


Required VRAM (1GB), Relative speed (32x)
 base: 74M Parameters, English-only model (base.en), Multilingual model (base),
Required VRAM (1GB), Relative speed (16x)
 small: 244M Parameters, English-only model (small.en), Multilingual model (small),
Required VRAM (2GB), Relative speed (6x)
 medium: 769M Parameters, English-only model (medium.en), Multilingual model
(medium), Required VRAM (5GB), Relative speed (2x)
 tiny: 1550M Parameters, English-only model (N/A), Multilingual model (large),
Required VRAM (10GB), Relative speed (1x) The tiny model can be utilized best for
light weight applications, the large model if accuracy is most important, and the base,
small or medium models for everything in between. For this project, we would be using
the medium model.

Code:

𝑚𝑜𝑑𝑒𝑙_𝑚 = 𝑤ℎ𝑖𝑠𝑝𝑒𝑟. 𝑙𝑜𝑎𝑑_𝑚𝑜𝑑𝑒𝑙(′𝑚𝑒𝑑𝑖𝑢𝑚′)

Code:

𝑓𝑖𝑙𝑒_𝑝𝑎𝑡ℎ = ′/𝑐𝑜𝑛𝑡𝑒𝑛𝑡/𝐸𝑑 𝑆ℎ𝑒𝑒𝑟𝑎𝑛 − 𝑆ℎ𝑎𝑝𝑒 𝑜𝑓 𝑌𝑜𝑢 (𝑂𝑓𝑓𝑖𝑐𝑖𝑎𝑙 𝑀𝑢𝑠𝑖𝑐 𝑉𝑖𝑑𝑒𝑜). 𝑚𝑝3′

Created a custom voice recording of myself, in file_path above to use for this project. Next
we're going to load the audio file in file_path using the load_audio() function.

29
Code:

#𝐿𝑜𝑎𝑑𝑖𝑛𝑔

𝑎𝑢𝑑𝑖𝑜_13 = 𝑤ℎ𝑖𝑠𝑝𝑒𝑟. 𝑙𝑜𝑎𝑑_𝑎𝑢𝑑𝑖𝑜(𝑓𝑖𝑙𝑒_𝑝𝑎𝑡ℎ)

𝑎𝑢𝑑𝑖𝑜_13

array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)


Next, we find the sampling interval. The sampling interval is the distance or time between the
measurements. The total time of audio is 13 seconds.

Code:

𝑇 = 13

#𝐶ℎ𝑒𝑐𝑘𝑖𝑛𝑔 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛 𝑜𝑢𝑟 𝑎𝑢𝑑𝑖𝑜 𝑓𝑖𝑙𝑒

𝑛_𝑠𝑎𝑚𝑝𝑙𝑒𝑠 = 𝑎𝑢𝑑𝑖𝑜_13. 𝑠ℎ𝑎𝑝𝑒[0]

𝑛_𝑠𝑎𝑚𝑝𝑙𝑒𝑠

4211543
There are 200448 number of samples in 13 seconds audio. Now we find the time
between samples.

Code:

#𝑇𝑖𝑚𝑒 𝑏𝑒𝑡𝑤𝑒𝑒𝑛 𝑠𝑎𝑚𝑝𝑙𝑒𝑠


𝑑𝑒𝑙𝑡𝑎 = 𝑇/𝑛_𝑠𝑎𝑚𝑝𝑙𝑒𝑠
𝑑𝑒𝑙𝑡𝑎
3.0867546645018227e-06
The time between samples is 6.485472541507024e-05. Next, we find the sampling
frequency.

Code:

#𝑆𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
𝐹𝑠 = 1/𝑑𝑒𝑙𝑡𝑎
𝐹𝑠
323964.8461538462
The sampling frequency is 15419.076923076924. Next, we find the time of each sample.

30
Code:

#𝑇𝑖𝑚𝑒 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑠𝑎𝑚𝑝𝑙𝑒

𝑡𝑖𝑚𝑒 = 𝑛𝑝. 𝑙𝑖𝑛𝑠𝑝𝑎𝑐𝑒(0, (𝑛_𝑠𝑎𝑚𝑝𝑙𝑒𝑠 − 1) ∗ 𝑑𝑒𝑙𝑡𝑎, 𝑛_𝑠𝑎𝑚𝑝𝑙𝑒𝑠)

𝑡𝑖𝑚𝑒

array([0.00000000e+00, 3.08675466e-06, 6.17350933e-06, ...,


1.29999907e+01, 1.29999938e+01, 1.29999969e+01])
Now we plot the amplitude with respect to time:

Code:

𝑝𝑙𝑡. 𝑓𝑖𝑔𝑢𝑟𝑒(𝑓𝑖𝑔𝑠𝑖𝑧𝑒 = (20,10))

𝑝𝑙𝑡. 𝑡𝑖𝑡𝑙𝑒(′𝑆𝑖𝑔𝑛𝑎𝑙′)

𝑝𝑙𝑡. 𝑝𝑙𝑜𝑡(𝑡𝑖𝑚𝑒, 𝑎𝑢𝑑𝑖𝑜_13)

𝑝𝑙𝑡. 𝑦𝑙𝑎𝑏𝑒𝑙(′𝑎𝑚𝑝𝑙𝑖𝑡𝑢𝑑𝑒′)

𝑝𝑙𝑡. 𝑥𝑙𝑎𝑏𝑒𝑙(′𝑠𝑒𝑐𝑜𝑛𝑑𝑠′)

𝑝𝑙𝑡. 𝑠ℎ𝑜𝑤()

Certainly! This code snippet uses Matplotlib, a plotting library in Python, to visualize an audio
signal. It creates a figure with dimensions 20 inches wide and 10 inches high
(`plt.figure(figsize=(20,10))`). The title of the plot is set to 'Signal', and it plots the `audio_13`
data against `time`, where `time` represents the x-axis (in seconds) and `audio_13` represents
the y-axis (amplitude). Labels are added to indicate that the y-axis represents 'amplitude' and
the x-axis represents 'seconds'. Finally, `plt.show()` displays the plot on the screen. This
visualization helps to understand how the audio signal's amplitude changes over time.

31
Fig,2

The Above Fig.2 is a waveform for the signal. Now, we can use the pad_or_trim() method to ensure
the sample is in the right form for inference.

Code:
audio = whisper.pad_or_trim(audio_13)
Next, we plot the amplitude with respect to time with trimmed/padded audio.

Code:
#𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛 𝑜𝑢𝑟 𝑡𝑟𝑖𝑚𝑚𝑒𝑑/𝑝𝑎𝑑𝑑𝑒𝑑 𝑎𝑢𝑑𝑖𝑜
𝑛_𝑠𝑎𝑚𝑝𝑙𝑒𝑠 = 𝑎𝑢𝑑𝑖𝑜. 𝑠ℎ𝑎𝑝𝑒[−1]
#𝑇𝑖𝑚𝑒 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑠𝑎𝑚𝑝𝑙𝑒
𝑡𝑖𝑚𝑒 = 𝑛𝑝. 𝑙𝑖𝑛𝑠𝑝𝑎𝑐𝑒(0, (𝑛_𝑠𝑎𝑚𝑝𝑙𝑒𝑠 − 1) ∗ 𝑑𝑒𝑙𝑡𝑎, 𝑛_𝑠𝑎𝑚𝑝𝑙𝑒𝑠)
𝑝𝑙𝑡. 𝑓𝑖𝑔𝑢𝑟𝑒(𝑓𝑖𝑔𝑠𝑖𝑧𝑒 = (20,10))
𝑝𝑙𝑡. 𝑡𝑖𝑡𝑙𝑒(′𝑆𝑖𝑔𝑛𝑎𝑙′)
𝑝𝑙𝑡. 𝑝𝑙𝑜𝑡(𝑡𝑖𝑚𝑒, 𝑎𝑢𝑑𝑖𝑜)
𝑝𝑙𝑡. 𝑦𝑙𝑎𝑏𝑒𝑙(′𝑎𝑚𝑝𝑙𝑖𝑡𝑢𝑑𝑒′)
𝑝𝑙𝑡. 𝑥𝑙𝑎𝑏𝑒𝑙(′𝑠𝑒𝑐𝑜𝑛𝑑𝑠′)
𝑝𝑙𝑡. 𝑠ℎ𝑜𝑤()

32
Fig.3

The Fig.3 shows the Next, we can start plotting a mel spectogram by applying a

log_mel_spectogram() funtion to our audio file. It converts the y-axis (frequency) into the mel scale:

Code:

𝑚𝑒𝑙 = 𝑤ℎ𝑖𝑠𝑝𝑒𝑟. 𝑙𝑜𝑔_𝑚𝑒𝑙_𝑠𝑝𝑒𝑐𝑡𝑟𝑜𝑔𝑟𝑎𝑚(𝑎𝑢𝑑𝑖𝑜). 𝑡𝑜(𝑚𝑜𝑑𝑒𝑙_𝑚. 𝑑𝑒𝑣𝑖𝑐𝑒)

The output above is a tensor of converted frequencies. Now, we plot 2 subplots, one is a
regular representation of sound amplitude over period of time, and the other is our mel
spectrogram

𝑓𝑖𝑔, (𝑎𝑥1, 𝑎𝑥2) = 𝑝𝑙𝑡. 𝑠𝑢𝑏𝑝𝑙𝑜𝑡𝑠(2)

𝑓𝑖𝑔. 𝑡𝑖𝑔ℎ𝑡_𝑙𝑎𝑦𝑜𝑢𝑡(𝑝𝑎𝑑 = 5.0)

𝑎𝑥1. 𝑝𝑙𝑜𝑡(𝑡𝑖𝑚𝑒, 𝑎𝑢𝑑𝑖𝑜)

𝑎𝑥1. 𝑠𝑒𝑡_𝑡𝑖𝑡𝑙𝑒(′𝑆𝑖𝑔𝑛𝑎𝑙′)

𝑎𝑥1. 𝑠𝑒𝑡_𝑥𝑙𝑎𝑏𝑒𝑙(′𝑇𝑖𝑚𝑒, 𝑠𝑒𝑐𝑜𝑛𝑑𝑠′)

𝑎𝑥1. 𝑠𝑒𝑡_𝑦𝑙𝑎𝑏𝑒𝑙(′𝐴𝑚𝑝𝑙𝑖𝑡𝑢𝑑𝑒′)
33
# 𝑀𝑜𝑣𝑒 𝑡ℎ𝑒 𝑡𝑒𝑛𝑠𝑜𝑟 𝑡𝑜 𝐶𝑃𝑈 𝑏𝑒𝑓𝑜𝑟𝑒 𝑐𝑜𝑛𝑣𝑒𝑟𝑡𝑖𝑛𝑔 𝑡𝑜 𝑁𝑢𝑚𝑃𝑦

𝑎𝑥2. 𝑖𝑚𝑠ℎ𝑜𝑤((𝑚𝑒𝑙. 𝑐𝑝𝑢(). 𝑛𝑢𝑚𝑝𝑦() ∗ 𝑚𝑒𝑙. 𝑐𝑝𝑢(). 𝑛𝑢𝑚𝑝𝑦()) ∗∗ (1/2), 𝑖𝑛𝑡𝑒𝑟𝑝𝑜𝑙𝑎𝑡𝑖𝑜𝑛


= ′𝑛𝑒𝑎𝑟𝑒𝑠𝑡′, 𝑎𝑠𝑝𝑒𝑐𝑡 = ′𝑎𝑢𝑡𝑜′)

𝑎𝑥2. 𝑠𝑒𝑡_𝑡𝑖𝑡𝑙𝑒(′𝑀𝑒𝑙 𝑆𝑝𝑒𝑐𝑡𝑟𝑜𝑔𝑟𝑎𝑚 𝑜𝑓 𝑎 𝑆𝑖𝑔𝑛𝑎𝑙′)

𝑎𝑥2. 𝑠𝑒𝑡_𝑥𝑙𝑎𝑏𝑒𝑙(′𝑇𝑖𝑚𝑒, 𝑠𝑒𝑐𝑜𝑛𝑑𝑠′)

𝑎𝑥2. 𝑠𝑒𝑡_𝑦𝑙𝑎𝑏𝑒𝑙(′𝑀𝑒𝑙 𝑆𝑐𝑎𝑙𝑒′)

Text(42.722222222222214, 0.5, 'Mel Scale')

Fig.4

This Fig.4 shows the top plot gives a simple view of the raw audio signal, while the bottom
plot provides a more detailed view of the frequency components, which is useful for analyzing
speech characteristics.

34
CHAPTER 4. RESULTS AND DISCUSSION
Next, we can move on to language detection.

Language detection

We will listen to our audio file and detect the spoken language. The sample rate (sr) by default
is 22050, which means that for every second there are 22,050 samples. We can use ipd.Audio()
function to listen to our audio file.

Code:

𝑠𝑟 = 22050

𝑖𝑝𝑑. 𝐴𝑢𝑑𝑖𝑜(𝑎𝑢𝑑𝑖𝑜, 𝑟𝑎𝑡𝑒 = 𝑠𝑟)

This is the audio

Next, We can obntain the probability of each language by using detect_language()


method.
probs = model_m.detect_language(mel)

Fig.5

35
From above Fig.5, we can see the probability of each language being the spoken language.
English has the highest probability of 98.5%, therefore it is the spoken language in the audio
file.

Next, we can move on to transcription.

Transcription

𝑡𝑟𝑎𝑛𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛 = 𝑚𝑜𝑑𝑒𝑙_𝑚. 𝑡𝑟𝑎𝑛𝑠𝑐𝑟𝑖𝑏𝑒(𝑓𝑖𝑙𝑒_𝑝𝑎𝑡ℎ, 𝑓𝑝16 = 𝐹𝑎𝑙𝑠𝑒)[′𝑡𝑒𝑥𝑡′]

Fig.6

From the Fig.6 transcription above, we can see our Speech-To-Text system works very well,
it transcribed our audio perfectly.

As an adition, we can translate our audio file to another language.

36
Translation

We would translate our audio file to French, by setting language='fr'.

𝑡𝑟𝑎𝑛𝑠𝑙𝑎𝑡𝑖𝑜𝑛 = 𝑚𝑜𝑑𝑒𝑙_𝑚. 𝑡𝑟𝑎𝑛𝑠𝑐𝑟𝑖𝑏𝑒(𝑓𝑖𝑙𝑒_𝑝𝑎𝑡ℎ, 𝑙𝑎𝑛𝑔𝑢𝑎𝑔𝑒 = ′𝑓𝑟′, 𝑓𝑝16


= 𝐹𝑎𝑙𝑠𝑒)[′𝑡𝑒𝑥𝑡′]

Fig.7

From Fig.7 we can see Our audio file can be translated to other languages as well.

37
Future Scope

Enhanced Multilingual Capabilities:

 Expanding Language Support: Continuously adding more languages to the system to


cover a wider global audience. This involves creating and training models that can
understand and transcribe languages that are less commonly spoken or have fewer
resources available.
 Improving Transcription and Translation Accuracy: Enhancing the precision of
transcriptions and translations for less commonly used languages by refining language
models, utilizing more diverse datasets, and incorporating linguistic nuances specific
to these languages.
 Dialect and Accent Recognition: Developing specialized models tailored to recognize
and accurately transcribe various dialects and regional accents. This includes collecting
and analyzing speech data from different regions to train models that can handle
linguistic variations effectively.

Integration with Various Industries:

 Customer Service Applications: Deploying the system in customer service settings to


automatically transcribe calls. This improves response times, helps in generating
insights from customer interactions, and enhances overall service quality by providing
accurate transcriptions for training and quality assurance purposes.
 Healthcare Applications: Utilizing the technology for medical dictations, enabling
healthcare professionals to accurately document patient information, thereby reducing
administrative tasks and improving record-keeping efficiency. This can lead to better
patient care and streamlined operations in healthcare facilities.
 Educational Settings: Implementing real-time transcription services in educational
environments to support students with hearing impairments or those who speak
different languages. This can include transcribing lectures and providing translations,
thus enhancing accessibility and inclusivity in educational institutions.

Real-Time Processing and Scalability:

38
 Live Audio Transcription: Enhancing the system's ability to process and transcribe
live audio streams in real time, making it suitable for applications such as live
broadcasts, webinars, and real-time communication tools.
 Scalability: Developing the system to handle large volumes of audio data, ensuring it
can scale to meet the needs of enterprise-level applications. This includes optimizing
server infrastructure, utilizing cloud computing resources, and implementing load
balancing to manage high demand effectively.

User Interface and Accessibility Improvements:

 User-Friendly Interface: Designing an intuitive and user-friendly interface that works


seamlessly across a wide range of devices and platforms, including desktops, tablets,
and smartphones. This ensures that users can easily access and utilize the system
regardless of their technical proficiency.
 Accessibility Features: Ensuring the system is accessible to users with disabilities by
incorporating features such as screen reader support, voice commands, and
customizable interface settings. This makes the system usable by a broader audience,
including those with visual or motor impairments.

Advanced Features and Functionalities:

 Emotion Detection: Introducing emotion detection capabilities to recognize and


transcribe the emotional tone of the speaker, providing more contextually rich and
accurate outputs.
 Speaker Identification: Implementing features that can identify and differentiate
between multiple speakers in a conversation, which is useful for meeting transcriptions
and multi-participant dialogues.
 Context-Aware Transcriptions: Developing context-aware transcription capabilities
that can understand the context of the conversation to provide more accurate and
relevant transcriptions.
 Cloud Integration: Integrating with cloud-based storage solutions and collaboration
tools to facilitate seamless sharing, management, and collaboration on transcribed
content across teams and organizations.

Continuous Learning and Adaptation:

39
 User Feedback Integration: Implementing machine learning algorithms that enable
the system to learn and adapt based on user feedback. This continuous learning process
helps in improving the system's performance over time by addressing user-specific
needs and preferences.
 Bias Reduction: Leveraging AI advancements to identify and reduce biases in speech
recognition, ensuring fair and accurate transcriptions for all users, regardless of their
accent, dialect, or language. This involves using diverse datasets and refining models
to minimize any inherent biases.

40
CHAPTER 5. CONCLUSION
The development of a Speech-To-Text (STT) application using the Whisper ASR system has
demonstrated the transformative potential of advanced automatic speech recognition
technology across various industries. This project has effectively showcased Whisper's
capabilities in handling diverse accents and background noise, providing accurate and reliable
transcriptions. By leveraging Whisper's robust architecture based on an encoder-decoder
Transformer model, the project has achieved exceptional performance in speech-to-text
conversion. The sophisticated design allows the system to accurately capture and transcribe
spoken language, even in challenging acoustic environments. Extensive testing and
optimization have significantly reduced the Word Error Rate (WER), enhancing the reliability
of the transcriptions. This improvement is crucial for applications where precision is
paramount, such as legal transcriptions and medical documentation.

The system's ability to transcribe and translate speech in multiple languages enhances its
accessibility and usability for a global audience. This multilingual support is essential for
creating inclusive technologies that cater to diverse linguistic communities. The project has
demonstrated the system's capability to understand and accurately transcribe different dialects
and regional accents. This feature is particularly valuable in multicultural societies and
international markets where linguistic diversity is the norm. By providing support for various
languages and accents, Whisper ASR ensures that communication is seamless and effective
across different linguistic backgrounds.

The STT application has been validated for practical use in customer service settings, where it
can automatically transcribe calls. This application improves response times, aids in quality
assurance, and provides valuable insights from customer interactions. In the healthcare sector,
the technology has shown its utility in accurately documenting medical dictations, reducing
administrative burdens on healthcare professionals, improving record-keeping, and enhancing
patient care. In educational environments, the STT application has proven beneficial by
providing real-time transcriptions of lectures. This aids students with hearing impairments or
those who speak different languages, making education more inclusive and accessible.

The project has laid a strong foundation for addressing existing challenges in speech
recognition technology. Future efforts can focus on improving the system's ability to handle

41
complex linguistic structures, varying speech patterns, and background noise in even more
diverse conditions. By exploring new applications, Whisper ASR can significantly enhance
user experience and operational efficiency. Potential future applications include real-time
language translation in international business settings, automated meeting transcriptions for
corporate environments, and interactive voice response systems in customer service.

The continuous evolution of this technology promises to make speech-to-text conversion more
accessible, efficient, and user-friendly. Ongoing research and development efforts will likely
yield improvements in accuracy, speed, and the ability to handle more complex tasks. By
making communication across different languages and regions seamless, Whisper ASR can
bridge linguistic divides. This fosters better understanding and collaboration in a globally
connected world, contributing to social and economic progress.

In conclusion, the Whisper ASR system's development and implementation in a Speech-To-


Text application have showcased its potential to revolutionize various industries. By addressing
current limitations and exploring innovative applications, this technology stands to
significantly enhance communication, accessibility, and efficiency in the digital age. As
Whisper ASR continues to evolve, it promises to bring about a future where speech-to-text
conversion is not only highly accurate but also universally accessible, breaking down language
barriers and fostering seamless communication worldwide.

42
CHAPTER 6. REFERENCES
[1] Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). Llm. int8 (): 8-bit matrix
multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.

[2] Fu, Z., Yang, H., So, A. M. C., Lam, W., Bing, L., & Collier, N. (2023, June). On the
effectiveness of parameter-efficient fine-tuning. In Proceedings of the AAAI Conference on
Artificial Intelligence (Vol. 37, No. 11, pp. 12799-12807).

[3] Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ... & Chen, W. (2021). Lora:
Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.

[4] Joseph, V. R. (2022). Optimal ratio for data splitting. Statistical Analysis and Data Mining:
The ASA Data Science Journal, 15(4), 531-538.

[5] Kumar, A., & Mittal, V. (2019). Speech recognition: A complete perspective. International
Journal of Recent Technology and Engineering (IJRTE), 7(6), 78-83.

[6] Li, J. (2022). Recent advances in end-to-end automatic speech recognition. APSIPA
Transactions on Signal and Information Processing, 11(1).

[7] Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., & Raffel, C. A. (2022).
Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning.
Advances in Neural Information Processing Systems, 35, 1950-1965.

[8] Min, S., Lewis, M., Zettlemoyer, L., & Hajishirzi, H. (2021). Metaicl: Learning to learn in
context. arXiv preprint arXiv:2110.15943.

[9] Novitasari, S., Tjandra, A., Sakti, S., & Nakamura, S. (2020). Cross-lingual machine speech
chain for javanese, sundanese, balinese, and bataks speech recognition and synthesis. arXiv
preprint arXiv:2011.02128.

[10] Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023, July).
Robust speech recognition via large-scale weak supervision. In International Conference on
Machine Learning (pp. 28492-28518). PMLR.

[11] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language
models are unsupervised multitask learners. OpenAI blog, 1(8), 9.

43
[12] Rouditchenko, A., Khurana, S., Thomas, S., Feris, R., Karlinsky, L., Kuehne, H., ... &
Glass, J. (2023). Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech
Pre-Training for Adaptation to Unseen Languages. arXiv preprint arXiv:2305.12606

[13] Alharbi, S., Alrazgan, M., Alrashed, A., Alnomasi, T., Almojel, R., Alharbi, R., … Almojil,
M. (2021). Automatic Speech Recognition: Systematic Literature Review. IEEE Access, 9,
131858–131876.

[14] Wang, C., Cho, K., & Gu, J. (2020, April). Neural machine translation with byte-level
subwords. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 05,
pp. 9154-9160).

[15] Butryna, A., Chu, S. H. C., Demirsahin, I., Gutkin, A., Ha, L., He, F., ... & Wibawa, J. A.
E. (2020). Google crowdsourced speech corpora and related open-source resources for low-
resource languages and dialects: an overview. arXiv preprint arXiv:2010.06778.

[16] Toraman, C., Yilmaz, E. H., Şahinuç, F., & Ozcelik, O. (2023). Impact of tokenization on
language models: An analysis for turkish. ACM Transactions on Asian and Low-Resource
Language Information Processing, 22(4), 1-21.

[17] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... &
Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing
systems, 30.

[18] Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., ... & Le, Q. V. (2021).
Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.

[19] Yang, H., Zhang, M., Tao, S., Ma, M., & Qin, Y. (2023, February). Chinese ASR and NER
Improvement Based on Whisper Fine-Tuning. In 2023 25th International Conference on
Advanced Communication Technology (ICACT) (pp. 213-217). IEEE.

[20] Zhang, C., & Lu, Y. (2021). Study on artificial intelligence: The state of the art and future
prospects. Journal of Industrial Information Integration, 23, 100224.

[21] Zhang, D., Mishra, S., Brynjolfsson, E., Etchemendy, J., Ganguli, D., Grosz, B., ... &
Perrault, R. (2021). The AI index 2021 annual report. arXiv preprint arXiv:2103.06312.

[22] Zhang, X., Peng, Y., & Xu, X. (2019, September). An overview of speech recognition
technology. In 2019 4th International Conference on Control, Robotics and Cybernetics (CRC)
(pp. 81-85). IEE make this as literature review

44
[23] Graves, A., Mohamed, A.R., Hinton, G. (2013). Speech Recognition with Deep Recurrent
Neural Networks. 2013 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pp. 6645-6649. https://doi.org/10.1109/ICASSP.2013.6638947

[24] Chan, W., Jaitly, N., Le, Q., Vinyals, O. (2016). Listen, Attend and Spell: A Neural
Network for Large Vocabulary Conversational Speech Recognition. 2016 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960-4964.
https://doi.org/10.1109/ICASSP.2016.7472621

[25] Zhang, Q., Lu, H., Sak, H., Tripathi, A., McDermott, E., Koo, S., Kumar, S. (2020).
Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders
and RNN-T Loss. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pp. 7829-7833.
https://doi.org/10.1109/ICASSP40776.2020.9053896

[26] Karmakar, P., Teng, S.W., Lu, G. (2021). Thank you for attention: a survey on attention-
based artificial neural networks for automatic speech recognition. CoRR abs/2102.07259.
https://arxiv.org/abs/2102.07259

[27] Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Laurent, C., Bengio, Y., Courville, A.C.
(2017). Towards end-to-end speech recognition with deep convolutional neural networks.
CoRR abs/1701.02720. http://arxiv.org/abs/1701.02720

[28] Alharbi, S., Alrazgan, M., Alrashed, A., Alnomasi, T., Almojel, R., Alharbi, R., Alharbi,
S., Alturki, S., Alshehri, F., Almojil, M. (2021). Automatic speech recognition: Systematic
literature review. IEEE Access 9, 131858-131876.
https://doi.org/10.1109/ACCESS.2021.3112535

[29] Li, J. (2022). Recent advances in end-to-end automatic speech recognition.


http://arxiv.org/abs/2111.01690

[30] Miao, L., Wu, J., Behre, P., Chang, S., Parthasarathy, S. (2022). Multilingual transformer
language model for speech recognition in low-resource languages. 2022 Ninth International
Conference on Social Networks Analysis, Management and Security (SNAMS), pp. 1-5.
https://doi.org/10.1109/SNAMS58071.2022.10062774

[31] Krishna, D.N. (2021). Multilingual speech recognition for low-resource Indian languages
using multi-task conformer. CoRR abs/2109.03969. https://arxiv.org/abs/2109.03969

45
[32] Yadav, H., Sitaram, S. (2022). A survey of multilingual models for automatic speech
recognition. https://arxiv.org/abs/2202.12576

[33] Alec, R., Karthik, N., Tim, S., Ilya, S. (2018). Improving language understanding by
generative pre-training. https://www.m

[34] Zhang, C., & Lu, Y. (2021). Study on artificial intelligence: The state of the art and future
prospects. Journal of Industrial Information Integration, 23, 100224.

[35] Zhang, D., Mishra, S., Brynjolfsson, E., Etchemendy, J., Ganguli, D., Grosz, B., ... &
Perrault, R. (2021). The AI index 2021 annual report. arXiv preprint arXiv:2103.06312.

46

You might also like