Converting Speech to Text
Converting Speech to Text
Abstract:
The objective of this project is to develop a Speech-To-Text application utilizing the Whisper
Automatic Speech Recognition (ASR) system. Whisper is an advanced ASR model trained on
680,000 hours of diverse speech data, enabling it to effectively handle various accents and
background noise. The architecture of Whisper follows an encoder-decoder Transformer
model, where input audio is split into 30-second chunks, converted into a log-Mel spectrogram,
and processed through an encoder. The decoder then predicts the corresponding text,
incorporating tasks such as language identification, phrase-level timestamps, multilingual
transcription, and translation. This project employs the medium-sized Whisper model,
balancing speed and accuracy. The workflow involves loading an audio file, performing signal
analysis, and generating a Mel spectrogram for visualization. The system then detects the
language of the audio and transcribes it into text. Additionally, the application can translate the
transcribed text into different languages, demonstrated by translating English audio to French.
The developed Speech-To-Text application showcases the effectiveness of Whisper in
accurately transcribing audio, offering potential applications in areas such as audio
transcription, dictation, voice commands, online search, and enhanced customer service.
Table of Contents
SI. No Topic Page No
1. INTRODUCTION
1.1 Introduction 1
1.2 Objectives 5
1.3 Problem Statement 8
2 LITERATURE REVIEW 9
3 PROPOSED METHOD
3.2 Implementation 27
5 CONCLUSIONS 41
6 REFERENCES 43
1.INTRODUCTION
The primary objective of this project is to develop an advanced Speech-To-Text application
using the Whisper Automatic Speech Recognition (ASR) system. Whisper represents a cutting-
edge advancement in ASR technology, distinguished by its extensive training on a vast dataset
comprising 680,000 hours of diverse speech data sourced from the internet. This rigorous
training regimen equips Whisper with unparalleled capabilities to effectively handle a wide
range of accents, dialects, and background noise, ensuring robust performance in demanding
real-world applications that require high levels of accuracy and reliability.
The encoder component of Whisper plays a critical role in the ASR pipeline by extracting
intricate acoustic features from the log-Mel spectrograms. These features encapsulate the
detailed acoustic characteristics essential for precise speech recognition. The encoded
representations are then passed to the decoder, which is meticulously trained to predict accurate
text captions. Beyond mere transcription, the decoder incorporates specialized tokens that
enable sophisticated functionalities such as language identification, precise timestamping,
multilingual transcription, and translation into multiple languages, including English.
1
In addition to these features, Whisper exhibits an exceptional capability to detect and transcribe
speech under challenging conditions. This includes handling speech in noisy environments,
recognizing overlapping speech from multiple speakers, and transcribing audio with low-
quality recordings. Whisper's model can discern speech patterns and provide accurate
transcriptions even when dealing with varying speech speeds and unclear articulation. This
robustness is achieved through the model's extensive training data and sophisticated
architecture, which allows it to generalize well across different scenarios and speaker profiles.
The Whisper ASR system also incorporates advanced noise reduction algorithms that further
improve transcription accuracy. By filtering out background noise and isolating the primary
speech signal, Whisper ensures that the transcriptions remain clear and precise, even in
acoustically challenging environments. This capability is particularly beneficial for
applications in busy public spaces, outdoor settings, or industrial environments where ambient
noise levels can be high.
Moreover, Whisper is designed with scalability in mind, making it suitable for deployment in
large-scale enterprise environments. The system's architecture supports distributed processing,
enabling it to handle high volumes of data and concurrent transcription tasks without
compromising on performance. This scalability is essential for applications in call centers,
customer support services, and other high-demand settings where real-time speech processing
is critical.
The project also explores the integration of Whisper with various hardware platforms,
including mobile devices, embedded systems, and cloud-based services. By optimizing the
model for different hardware configurations, the application aims to deliver consistent
performance across a wide range of devices. This flexibility ensures that users can access
reliable speech-to-text services regardless of their device or operating environment.
2
To further enhance the user experience, the project includes the development of a
comprehensive user interface (UI) that is intuitive and accessible. The UI design prioritizes
ease of use, enabling users to interact with the speech-to-text application effortlessly. Features
such as voice command recognition, real-time feedback, and customizable settings are
incorporated to cater to diverse user needs and preferences. The goal is to create an interface
that not only functions effectively but also enhances user satisfaction and engagement.
Testing and evaluation are critical components of the project, ensuring that the Whisper ASR
system meets the highest standards of accuracy and reliability. The project employs a rigorous
testing framework that includes a variety of scenarios, such as different languages, accents,
speech speeds, and background noises. Performance metrics like Word Error Rate (WER),
transcription latency, and resource utilization are closely monitored to identify areas for
improvement and ensure the system's robustness in real-world applications.
In the context of healthcare, Whisper ASR can revolutionize the way medical professionals
handle patient records and documentation. By providing accurate and timely transcriptions of
medical dictations, the system can reduce administrative burdens on healthcare providers,
allowing them to focus more on patient care. Additionally, Whisper's ability to handle medical
terminologies and context-specific language ensures that the transcriptions are not only
accurate but also clinically relevant.
In educational settings, Whisper ASR can enhance learning experiences by providing real-time
transcriptions of lectures and classroom discussions. This feature is particularly beneficial for
students with hearing impairments or those who speak different languages, ensuring that they
can fully participate in and benefit from educational activities. The system's multilingual
capabilities also support the creation of inclusive learning environments, where language
barriers are minimized, and all students have access to the same information.
The project's future directions include exploring the integration of additional advanced
features, such as emotion detection, speaker identification, and context-aware transcriptions.
These features can add significant value to the speech-to-text application by providing more
nuanced and detailed transcriptions. For instance, emotion detection can help identify the
speaker's emotional state, which can be useful in customer service interactions or mental health
assessments. Speaker identification can differentiate between multiple speakers in a
conversation, ensuring that the transcriptions accurately reflect who said what.
3
Continuous learning and adaptation are also key areas of focus for the project's future
development. By implementing machine learning algorithms that enable the system to learn
from user feedback, the Whisper ASR system can continuously improve its performance over
time. This adaptive learning approach ensures that the system remains up-to-date with evolving
language patterns, new terminologies, and user preferences, maintaining its relevance and
effectiveness in dynamic environments.
Furthermore, the project aims to expand the range of supported languages and improve the
accuracy of transcriptions for lesser-known languages and regional dialects. Developing
specialized models for different dialects and accents will enhance recognition accuracy and
ensure inclusivity. This focus on linguistic diversity underscores the project's commitment to
creating a truly global speech recognition solution that can cater to users from various linguistic
backgrounds.
Integration with various industries is another significant aspect of the project's vision. In
customer service, Whisper ASR can automate call transcriptions, improving response times
and service quality. In healthcare, it can provide accurate medical dictations, enhancing record-
keeping and reducing administrative burdens on healthcare professionals. In educational
settings, real-time lecture transcriptions can aid students with hearing impairments or those
who speak different languages, ensuring they have equal access to educational content.
Real-time processing and scalability are crucial for the application's success. Enhancing the
system's capability to process and transcribe live audio streams in real-time will be a key focus.
This involves optimizing the model to leverage available hardware resources effectively,
potentially utilizing GPU acceleration for faster computation if supported. Ensuring
compatibility with the selected hardware specifications, such as an Intel i7 11th generation
processor and 16GB of RAM, is essential for maintaining high performance.
User interface and accessibility improvements are also prioritized in the project's development.
Developing a more intuitive and user-friendly interface that supports a wide range of devices
and platforms is crucial. Ensuring the system is accessible to users with disabilities by
incorporating features like screen reader support and voice commands will enhance its
usability. The goal is to create an inclusive application that caters to the needs of all users,
regardless of their physical abilities or technical proficiency.
4
Advanced features and functionalities, such as emotion detection, speaker identification, and
context-aware transcriptions, will be explored to provide more nuanced and accurate outputs.
Integrating with cloud-based storage solutions and collaboration tools will facilitate seamless
sharing and management of transcribed content across teams. These advanced features will add
significant value to the application by enhancing its functionality and user experience.
The project's commitment to continuous learning and adaptation is evident in its focus on
implementing machine learning algorithms that enable the system to learn and adapt from user
feedback. This approach will ensure that the Whisper ASR system continuously improves its
performance over time, reducing biases in speech recognition and ensuring fair and accurate
transcriptions for all users. Leveraging AI advancements to maintain the system's relevance
and effectiveness in dynamic environments is a key objective.
1.2 Objective
5
Implementing Whisper ASR involves using its medium-sized model effectively to process
audio inputs for accurate speech recognition. This starts with loading audio files and breaking
them into manageable segments. These segments are then converted into log-Mel
spectrograms, which capture the audio's frequency content over time in a format that Whisper
can analyze efficiently.
To improve speech recognition, algorithms are developed to extract detailed acoustic features
from these spectrograms using Whisper's encoder. These features include aspects like pitch and
tone that are crucial for understanding speech. The decoder part of Whisper is then trained to
convert these acoustic features into accurate text captions. Special tokens within the decoder
help in tasks like identifying languages, adding timestamps, and handling multilingual
transcriptions.
3. Multilingual Capabilities
The development phase involves configuring Whisper to recognize and process various
languages effectively. This includes optimizing language detection algorithms to accurately
identify the spoken language within audio inputs. Once identified, Whisper's decoder module
utilizes specialized tokens to facilitate multilingual transcription and translation tasks. This
6
functionality not only enhances the application's versatility but also underscores its potential to
foster cross-cultural communication and accessibility in multilingual environments.
4. Real-World Application
Demonstrating Whisper's efficacy in real-time transcription tasks is crucial for validating its
practical utility across diverse fields. Industries such as customer service, education, healthcare,
and telecommunications stand to benefit significantly from Whisper's ability to provide
accurate and timely transcriptions. Real-world scenarios require reliable speech recognition
systems capable of handling varied accents, dialects, and environmental noise conditions, all
of which Whisper addresses through its extensive training on diverse speech datasets.
The application's effectiveness can be showcased through case studies and use-case scenarios.
For instance, in customer service applications, Whisper enables automated transcription of
customer calls, facilitating improved service delivery and responsiveness. In educational
settings, the application supports lecture transcription and note-taking, enhancing accessibility
for students with hearing impairments or language barriers. Similarly, in healthcare, Whisper's
real-time transcription capabilities aid in medical dictation and patient record management,
streamlining administrative workflows and improving accuracy.
The UI design process involves iterative prototyping and usability testing to gather feedback
and refine interface elements. Visual representations of audio waveforms and spectrograms can
aid users in visualizing speech patterns and understanding transcription processes. Integration
with cloud-based storage solutions and collaboration tools enhances the application's utility in
professional settings, allowing seamless sharing and management of transcribed content across
teams.
7
6. Evaluation and Validation
Conducting rigorous testing and validation is essential to assess the developed application's
performance metrics comprehensively. Evaluation criteria include transcription accuracy,
processing speed, resource utilization (CPU/GPU usage), and overall system robustness.
Performance benchmarks should be established against industry standards and competitor
solutions to gauge the application's comparative advantage.
8
2.LITERATURE REVIEW
Automatic Speech Recognition (ASR) systems have undergone significant evolution, driven
by advances in machine learning models and the availability of large-scale training datasets.
Whisper ASR exemplifies this progress, having been trained on a vast dataset comprising
680,000 hours of diverse speech data sourced from the internet (Dettmers et al., 2022). This
extensive training enables Whisper to effectively handle various accents, dialects, and
background noise, ensuring robust performance in real-world applications that demand high
accuracy and reliability.[1]
The encoder component of Whisper ASR plays a critical role in extracting intricate acoustic
features from the log-Mel spectrograms. These features are then passed to the decoder, which
is trained to predict corresponding text captions while incorporating specialized tokens for
tasks such as language identification and timestamping (Novitasari et al., 2020). This dual
functionality not only enhances transcription accuracy but also extends Whisper's capabilities
to support multilingual speech transcription and translation into various languages (Yang et al.,
2023).[2]
Recent research has focused on optimizing ASR systems through parameter-efficient fine-
tuning techniques (Fu et al., 2023). These approaches aim to strike a balance between model
complexity and computational efficiency, thereby improving performance metrics such as
transcription accuracy and processing speed. Such optimizations are crucial for deploying ASR
systems in real-time applications across diverse sectors including customer service, healthcare,
and telecommunications (Alharbi et al., 2021; Radford et al., 2023).[5]
User interface design is another critical aspect of ASR applications, ensuring accessibility and
ease of use across different platforms (Wang et al., 2020). Intuitive interfaces not only facilitate
9
interaction with the ASR system but also enhance user experience, making it seamless to
dictate, transcribe, and interact with spoken language.[4]
Evaluation and validation methodologies are essential for assessing the reliability and
efficiency of ASR systems (Zhang et al., 2021). Metrics such as transcription accuracy,
processing speed, and resource utilization provide insights into the performance improvements
and optimizations needed for practical deployment in various real-world scenarios (Radford et
al., 2019).[6]
The encoder in Whisper extracts detailed acoustic features from these spectrograms, enabling
the model to understand the complex patterns in the speech signal. These features are then fed
into the decoder, which has been trained to predict text captions. The decoder's functionality is
enhanced by the integration of specialized tokens, allowing it to perform additional tasks such
as language identification, timestamping, multilingual transcription, and translation into
various languages, including English (Novitasari et al., 2020).[10[
Recent studies have focused on optimizing ASR systems to improve their efficiency and
performance. Techniques such as parameter-efficient fine-tuning have been shown to enhance
model accuracy while reducing computational requirements. Fu et al. (2023) highlighted the
effectiveness of these techniques in fine-tuning large language models, making them more
adaptable to specific tasks without the need for extensive retraining. This approach not only
conserves computational resources but also accelerates the deployment of ASR systems in real-
world applications.[9]
Additionally, research by Alharbi et al. (2021) and Radford et al. (2023) emphasizes the
significance of large-scale weak supervision in training robust ASR models. This approach
leverages diverse and abundant unlabeled data, enhancing the model’s ability to generalize
10
across different domains and languages. The integration of such strategies has significantly
bolstered the performance of ASR systems, making them more reliable and versatile.[8]
One of the standout features of Whisper ASR is its robust multilingual capability. The system’s
ability to handle multiple languages seamlessly makes it an invaluable tool for global
applications. Studies such as those by Novitasari et al. (2020) and Rouditchenko et al. (2023)
have explored the implementation of cross-lingual ASR systems, demonstrating their potential
to facilitate communication across different languages and dialects. These systems are
particularly beneficial in low-resource settings where training data for specific languages may
be scarce.[12]
Moreover, Whisper’s multilingual capabilities are bolstered by its use of advanced tokenization
and embedding techniques, which ensure that the model can effectively process and understand
speech in various languages. This cross-linguistic adaptability is crucial for applications in
international markets, enhancing user experience and accessibility globally (Butryna et al.,
2020).[11]
he development of intuitive and accessible user interfaces is critical for the widespread
adoption of ASR technologies. Wang et al. (2020) and Zhang et al. (2021) have underscored
the importance of user-centric design in ASR applications. A well-designed interface simplifies
interaction, making ASR functionalities such as voice commands, dictation, and transcription
services more user-friendly. Ensuring that these interfaces are accessible across different
devices and platforms is essential for enhancing user engagement and satisfaction.[13]
To ensure the reliability and effectiveness of ASR systems, rigorous evaluation methodologies
are essential. Metrics such as transcription accuracy, processing speed, and resource utilization
are commonly used to assess system performance. Joseph (2022) and Liu et al. (2022) have
emphasized the importance of comprehensive testing frameworks that evaluate both the
qualitative and quantitative aspects of ASR systems. These evaluations help in identifying
performance bottlenecks and areas for improvement, thereby enhancing the system’s overall
efficacy and user experience.[14]
11
various seminal works and recent advancements in ASR, highlighting the evolution of
techniques and models that have significantly improved the accuracy and efficiency of speech
recognition systems.[16]
Deep Recurrent Neural Networks in Speech Recognition The work by Graves et al. (2013)
demonstrated the potential of Deep Recurrent Neural Networks (RNNs) for speech recognition,
emphasizing their ability to handle sequential data effectively. Their research, presented at the
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP),
showed that RNNs could significantly improve the performance of ASR systems by capturing
long-term dependencies in speech signals [17]
Attention Mechanisms and Transformer Models The introduction of attention mechanisms and
Transformer models marked a significant leap in ASR technology. Chan et al. (2016)
introduced the "Listen, Attend and Spell" model, which employed an attention-based neural
network for large vocabulary conversational speech recognition. This model demonstrated
superior performance by focusing on relevant parts of the input sequence, thereby improving
accuracy and reducing errors[18]
Further advancements were made with the development of the Transformer Transducer by
Zhang et al. (2020), which integrated Transformer encoders with Recurrent Neural Network
Transducer (RNN-T) loss to create a streamable speech recognition model. This approach
combined the strengths of Transformer architectures in handling complex dependencies with
the real-time processing capabilities of RNN-T, leading to significant improvements in both
latency and accuracy[19]
End-to-End ASR with Deep Convolutional Neural Networks Zhang et al. (2017) explored the
use of deep convolutional neural networks (CNNs) for end-to-end speech recognition. Their
research highlighted the efficacy of CNNs in capturing local patterns in speech signals, leading
to enhanced performance in ASR tasks . This shift towards end-to-end models simplified the
ASR pipeline by eliminating the need for handcrafted features and intermediate
representations.[20]
12
Multilingual and Low-Resource Language Recognition The challenge of recognizing speech
in low-resource languages has been addressed through various innovative approaches. Miao et
al. (2022) proposed a multilingual Transformer language model tailored for speech recognition
in low-resource languages, emphasizing the model's adaptability across different linguistic
contexts . Krishna (2021) further extended this approach by leveraging multi-task learning with
conformers to improve ASR performance for low-resource Indian languages.[21]
Large-Scale ASR and Multitask Learning Radford et al. (2023) explored the use of large-scale
weak supervision for robust speech recognition, demonstrating the effectiveness of leveraging
vast amounts of loosely labeled data to enhance model performance.Pratap et al. (2023) focused
on scaling speech technology to over 1,000 languages, highlighting the feasibility of
developing universal ASR systems capable of handling diverse linguistic contexts [24]
Emotion Detection and Context-Aware Transcription Recent advancements have also explored
the integration of additional features such as emotion detection and context-aware
transcription. The research by Xie et al. (2023) on finetuning the Whisper model for mixed
languages illustrates the ongoing efforts to improve ASR models' versatility and accuracy in
real-world applications . The inclusion of such advanced features adds significant value to ASR
systems by providing more nuanced and contextually relevant transcriptions.[25]
13
Systematic Literature Reviews and Surveys Alharbi et al. (2021) conducted a systematic
literature review on ASR, providing a comprehensive overview of the state-of-the-art
techniques and identifying key trends and challenges in the field . Yadav and Sitaram (2022)
surveyed multilingual models for ASR, emphasizing the importance of developing robust
systems capable of handling diverse linguistic and acoustic environments.[26]
Enhancing ASR for Special Populations Research has also focused on improving ASR systems
for specific populations, such as children and speakers of dialects. Liu et al. (2023) investigated
the application of sparsely shared LoRA on Whisper for child speech recognition, highlighting
the unique challenges and solutions for accurately transcribing children's speech . Similarly,
Sicard et al. (2023) extended state-of-the-art ASR models to Swiss German dialects,
showcasing the adaptability of modern ASR systems to regional linguistic variations [27]
14
3.PROPOSED METHOD
3.1METHODOLOGY and FLOWCHARTS
Introduction:
The Speech-To-Text (STT) application using Whisper ASR represents a pivotal advancement
in leveraging state-of-the-art automatic speech recognition technology for real-world
applications. This project focuses on developing a robust system capable of accurately
transcribing diverse audio inputs into text, accommodating various accents, languages, and
environmental conditions. By harnessing the power of the medium-sized Whisper model,
optimized for a balance of computational efficiency and accuracy, the application aims to
deliver seamless and reliable speech recognition capabilities.
Key components of the project include meticulous data collection and preprocessing to curate
a diverse dataset of audio recordings from sources such as the internet. These recordings are
standardized in format and quality, ensuring consistency and suitability for training the Whisper
ASR model. The model selection process prioritizes efficiency on hardware configurations like
Intel i7 11th generation processors and 16GB of RAM, enabling optimal performance during
both training and inference phases.
Audio processing techniques, including segmentation into manageable segments and extraction
of log-Mel spectrograms, are employed to capture detailed acoustic features essential for robust
speech recognition. Model training involves rigorous dataset preparation, hyperparameter
optimization, and performance tuning to enhance transcription accuracy and processing speed
across various linguistic contexts and acoustic environments.
Testing and evaluation play a crucial role in validating the application's performance metrics,
including transcription accuracy metrics like Word Error Rate (WER) and efficiency measures
such as latency and resource utilization. Through comprehensive testing scenarios and
benchmark comparisons, the project aims to meet stringent performance standards and user
expectations, thereby demonstrating the effectiveness and reliability of the developed STT
application using Whisper ASR.
15
1.Data Collection and Preprocessing:
Data Collection
The foundation of building a robust Speech-To-Text (STT) application using Whisper ASR
begins with the acquisition of a comprehensive and diverse dataset of audio recordings. This
dataset should be sourced from various sources, including online repositories, public datasets,
and possibly proprietary collections. The goal is to encompass a wide range of accents,
languages, and environmental conditions to ensure the model's adaptability and accuracy across
diverse real-world scenarios.
The dataset should intentionally include recordings spoken in different accents and dialects to
train the Whisper ASR model to recognize and transcribe speech variations accurately. This
diversity helps mitigate biases and improves the model's generalization capabilities, making it
suitable for a global audience.
Preprocessing
Once the dataset is compiled, the next step involves preprocessing to standardize the format
and quality of audio files. This standardization ensures consistency across the dataset, which is
essential for effective training and reliable performance of the Whisper ASR model.
Format Conversion:
16
Audio files collected from different sources may come in varied formats (e.g., MP3, WAV,
FLAC). Converting all audio files into a common format simplifies data handling and
preprocessing tasks. Typically, formats like WAV or FLAC are preferred due to their lossless
quality and compatibility with deep learning frameworks.
To enhance dataset diversity and improve model robustness, augmentation techniques can be
applied. This may involve artificially modifying audio files by introducing background noise,
pitch variations, or time shifts. Augmentation techniques simulate real-world variations,
thereby exposing the model to a broader spectrum of speech patterns and environmental
conditions.
For supervised learning tasks, annotations or transcripts should accompany each audio
recording where available. These annotations provide ground truth labels, aiding the model in
learning correct transcription patterns during training. Manual or automatic transcription tools
can be employed to generate accurate text transcripts aligned with each audio file.
By meticulously curating and preprocessing a diverse dataset in this manner, the Speech-To-
Text application using Whisper ASR can effectively train a model that demonstrates high
accuracy and reliability in transcribing speech across different languages, accents, and real-
world environments.
17
2. Model Selection and Configuration
Model Selection:
Choosing the right model size is crucial for optimizing both performance and computational
efficiency of the Speech-To-Text (STT) application using Whisper ASR.
Consideration of Model Size: The Whisper ASR system offers multiple model sizes, each
with varying numbers of parameters and computational requirements. The selection of the
medium-sized Whisper model, with 769 million parameters, is based on its ability to strike a
balance between accuracy and computational efficiency. This makes it well-suited for
applications requiring real-time or near-real-time speech transcription capabilities.
Adaptability to Hardware Constraints: Assess hardware constraints such as the Intel i7 11th
generation processor and 16GB of RAM. These specifications are sufficient for handling
intensive computational tasks involved in deep learning. Ensure the selected model size aligns
with the available hardware resources to optimize performance without encountering memory
constraints or processing bottlenecks.
training time and improving real-time performance of the Whisper ASR model.
Configuration Optimization:
Fine-tuning the model configuration involves optimizing parameters and settings to maximize
performance and efficiency.
Hyperparameter Tuning: Adjust parameters such as learning rate, batch size, and optimizer
settings to achieve optimal convergence and training efficiency. Hyperparameter tuning helps
in improving model accuracy and reducing training time, contributing to overall system
performance.
18
Resource Management: Implement strategies to manage hardware resources effectively, such
as batch processing and memory caching techniques. These strategies help in minimizing
computational overhead and maximizing hardware utilization efficiency.
Audio Segmentation:
Audio segmentation involves dividing long audio recordings into smaller segments to enhance
processing efficiency and manage memory constraints during model training and inference.
Segmentation Strategy: Divide audio files into segments of approximately 30 seconds each.
This segmentation duration is chosen based on practical considerations to balance processing
efficiency with maintaining context within each segment. Shorter segments may not capture
sufficient contextual information, while longer segments can increase computational load and
memory requirements.
Benefits of Segmentation:
Memory Management: Segmented audio files consume less memory individually compared
to processing entire recordings at once. This approach minimizes the risk of memory overflow
during model training and inference, ensuring stable performance.
Parallel Processing: Segmented data allows for parallel processing, leveraging multi-core
processors or GPU acceleration where applicable. This parallelization enhances computational
efficiency and reduces processing time, crucial for real-time or near-real-time speech
transcription tasks.
Error Handling: Segmenting audio files also facilitates better error handling and recovery
strategies. In case of processing failures or interruptions, the impact is limited to the affected
segment rather than the entire audio file, improving system robustness and reliability.
Conversion Process: Convert audio signals into log-Mel spectrograms, which are 2D
representations of audio frequency distributions over time. This transformation mimics the
human auditory system's sensitivity to frequencies, emphasizing important speech-related
features such as pitch and timbre.
19
Mel Scale: Utilize the Mel scale to map audio frequencies into perceptually meaningful units.
This scale enhances the model's ability to differentiate between speech and background noise,
improving overall transcription accuracy.
Acoustic Information: The extracted log-Mel spectrograms serve as primary input features
for the Whisper ASR model. These features encode detailed acoustic information necessary for
accurately transcribing speech into text, accommodating variations in pronunciation, accents,
and environmental noise.
Dataset Preparation: Prepare training, validation, and possibly test datasets for model
training. Ensure these datasets are representative of the target application's use cases, including
various languages and accents.
Training Procedure: Train the Whisper ASR model using the prepared dataset. This involves
optimizing hyperparameters (e.g., learning rate, batch size) and applying regularization
techniques (e.g., dropout, weight decay) to prevent overfitting and improve generalization.
20
Special tokens are integral to extending the functionality of the Whisper ASR system beyond
basic transcription capabilities. These tokens enable advanced features that enhance the
versatility and utility of the Speech-To-Text (STT) application.
Language Identification:
Token Functionality: Integrate tokens within the ASR system to perform automatic language
identification (LID). These tokens allow the model to detect and switch between different
languages spoken within a single audio stream or across different segments.
Implementation: During audio processing, the ASR model identifies language-specific cues
encoded by these tokens. This capability is crucial for applications requiring multilingual
support or mixed-language environments.
Token Functionality: Use tokens to insert timestamps at specific intervals within the
transcription output. These timestamps provide temporal alignment, enabling precise
synchronization with corresponding audio segments or real-time speech inputs.
Implementation: The ASR model embeds timestamps within the transcription output,
indicating key points in the audio timeline. This feature is beneficial for applications such as
transcription of lectures, meetings, or live broadcasts where accurate temporal referencing is
essential.
Multilingual Transcription:
21
Token Functionality: Utilize tokens to trigger translation functionalities within the ASR
system, enabling on-the-fly translation of transcribed text into different languages, including
English.
Implementation: Upon detecting translation tokens, the ASR model integrates with translation
APIs or internal modules to convert transcribed text segments into desired target languages.
This feature facilitates real-time communication and accessibility across language barriers.
Functional Validation: Conduct rigorous testing and validation procedures to ensure the
accurate implementation and functionality of special tokens within the Whisper ASR system.
Verify that tokens correctly trigger intended functionalities and adhere to predefined
application requirements.
6.Environment setup
To ensure optimal performance and reliability for developing the Speech-To-Text application
using the Whisper ASR system, the following environment setup is recommended:
Processor:
Utilize an Intel i7 11th generation processor or equivalent. This processor provides ample
processing power necessary for handling intensive computations involved in deep learning
tasks.
RAM:
Maintain a minimum of 16GB RAM. This capacity is crucial for accommodating large datasets
and facilitating efficient model training without encountering memory constraints.
22
Operating System:
Use Windows 11 as the operating system. Windows 11 offers a stable and reliable development
environment suitable for deep learning applications, ensuring seamless execution and support
for necessary software dependencies.
UI Development
Designing an intuitive and user-friendly interface is crucial for the success of the Speech-To-
Text (STT) application. Here are some key considerations and features to implement:
Intuitive Layout:
Simple Navigation: Ensure that the navigation is straightforward and intuitive, allowing users
to easily access the main features of the application.
Clear Instructions: Provide clear and concise instructions or tooltips to guide users through the
process of using the STT application.
Accessibility Features
Voice Command Recognition: Implement voice command capabilities to allow users to control
the application hands-free. This is particularly beneficial for users with disabilities.
High Contrast Mode: Offer a high contrast mode to assist users with visual impairments.
Font Size Adjustments: Allow users to adjust the font size for better readability.
Error Handling:
User Notifications: Provide clear and helpful notifications when errors occur. For instance, if
the application is unable to process a voice command, inform the user and suggest possible
solutions.
Retry Options: Allow users to retry voice commands or transcriptions easily without having to
start from scratch.
23
8.Testing and Evaluation:
Testing Scenarios
To ensure the STT application performs well under diverse conditions, conduct comprehensive
testing covering various scenarios:
Transcription Accuracy:
Different Accents and Dialects: Test the application with speakers from different regions to
ensure it accurately transcribes a variety of accents and dialects.
Multiple Languages: Verify that the application supports and accurately transcribes multiple
languages as intended.
Processing Speed:
Real-time Transcription: Measure the time taken for real-time transcription and ensure it meets
user expectations for responsiveness.
Batch Processing: Evaluate the speed and efficiency of batch processing for pre-recorded audio
files.
Resource Utilization:
CPU/GPU Usage: Monitor CPU and GPU utilization during transcription to ensure the
application runs efficiently without overloading the system.
Memory Consumption: Check for memory leaks and ensure the application manages memory
effectively.
Acoustic Environments:
Quiet vs. Noisy Environments: Test the application's performance in various acoustic
environments, from quiet rooms to noisy outdoor settings.
Background Noise Handling: Assess how well the application filters out background noise and
focuses on the primary speaker's voice.
24
Performance Metrics
Define and measure key performance metrics to validate the application's performance:
Definition: WER is a common metric used to evaluate the accuracy of speech recognition
systems. It is calculated as the ratio of the number of errors (substitutions, deletions, and
insertions) to the total number of words in the reference transcript.
Benchmarking: Compare the WER against industry benchmarks and user expectations to
ensure the application delivers high accuracy.
Transcription Latency:
Definition: Transcription latency is the time taken from when the speech is input until the
transcription is available.
Real-time Requirements: Ensure that the latency is low enough to provide a seamless real-time
transcription experience.
Fig.1
25
This Fig shows The process of converting speech to text involves several intricate steps,
starting with the capture of speech through a microphone, which collects the analog sound
waves produced by the speaker. These analog signals are then subjected to filtering and
amplification to remove noise and enhance the signal strength. The refined analog signals are
subsequently converted into digital data using an Analog-to-Digital Converter (ADC),
transforming the continuous sound waves into discrete digital signals suitable for computer
processing. These digital signals undergo Digital Signal Processing (DSP), further refining and
preparing the data for feature extraction.
The processed digital data is transmitted over a network to a server or cloud-based system,
where more advanced processing and recognition occur, leveraging cloud computing resources
for handling complex computations. The data received by the server undergoes pre-processing
to normalize and prepare it for feature extraction, which includes steps like noise reduction,
normalization, and framing. During feature extraction, essential characteristics of the speech
signal are identified and extracted using techniques such as Mel-Frequency Cepstral
Coefficients (MFCCs) and spectrogram analysis.
These features are then fed into a recognition model, typically based on machine learning or
deep learning algorithms, which processes the features and generates the corresponding text
output. The final step in the process is converting the recognized speech into text, resulting in
a coherent text output such as "Hello World." This comprehensive process utilizes advanced
techniques in signal processing, machine learning, and cloud computing to achieve accurate
and efficient speech-to-text conversion.
26
3.2 IMPLEMENTATION
Code:
This command updates the PyTorch library to the latest version available. PyTorch is used for
deep learning tasks like training neural networks.
Installs the pytube library, which helps in downloading YouTube videos and working with
video metadata using Python.
Installs the whisper library from its GitHub repository. Whisper is an Automatic Speech
Recognition (ASR) system developed by OpenAI.
Installs the librosa library from its GitHub repository. Librosa is a Python package for audio
analysis, including loading, processing, and extracting features like spectrograms.
Code:
𝑖𝑚𝑝𝑜𝑟𝑡 𝑡𝑜𝑟𝑐ℎ
27
𝑖𝑚𝑝𝑜𝑟𝑡 𝑤ℎ𝑖𝑠𝑝𝑒𝑟
𝑖𝑚𝑝𝑜𝑟𝑡 𝑝𝑦𝑡𝑢𝑏𝑒
𝑖𝑚𝑝𝑜𝑟𝑡 𝑙𝑖𝑏𝑟𝑜𝑠𝑎
𝑖𝑚𝑝𝑜𝑟𝑡 𝑛𝑢𝑚𝑝𝑦 𝑎𝑠 𝑛𝑝
torch: PyTorch is a widely used open-source machine learning library primarily developed by
Facebook's AI Research lab (FAIR). It provides tools for building and training neural networks.
pytube: Pytube is a Python library for downloading YouTube videos. It allows developers to
fetch video data, stream video content, and manage metadata.
librosa: Librosa is a Python package for music and audio analysis. It provides tools for loading
audio files, extracting features like mel-spectrograms, and performing various audio processing
tasks.
28
Loading the Model
There are five model sizes to choose from, four have English-only versions, offering speed and
accuracy trade-offs. The model sizes are:
Code:
Code:
Created a custom voice recording of myself, in file_path above to use for this project. Next
we're going to load the audio file in file_path using the load_audio() function.
29
Code:
#𝐿𝑜𝑎𝑑𝑖𝑛𝑔
𝑎𝑢𝑑𝑖𝑜_13
Code:
𝑇 = 13
𝑛_𝑠𝑎𝑚𝑝𝑙𝑒𝑠
4211543
There are 200448 number of samples in 13 seconds audio. Now we find the time
between samples.
Code:
Code:
#𝑆𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
𝐹𝑠 = 1/𝑑𝑒𝑙𝑡𝑎
𝐹𝑠
323964.8461538462
The sampling frequency is 15419.076923076924. Next, we find the time of each sample.
30
Code:
𝑡𝑖𝑚𝑒
Code:
𝑝𝑙𝑡. 𝑡𝑖𝑡𝑙𝑒(′𝑆𝑖𝑔𝑛𝑎𝑙′)
𝑝𝑙𝑡. 𝑦𝑙𝑎𝑏𝑒𝑙(′𝑎𝑚𝑝𝑙𝑖𝑡𝑢𝑑𝑒′)
𝑝𝑙𝑡. 𝑥𝑙𝑎𝑏𝑒𝑙(′𝑠𝑒𝑐𝑜𝑛𝑑𝑠′)
𝑝𝑙𝑡. 𝑠ℎ𝑜𝑤()
Certainly! This code snippet uses Matplotlib, a plotting library in Python, to visualize an audio
signal. It creates a figure with dimensions 20 inches wide and 10 inches high
(`plt.figure(figsize=(20,10))`). The title of the plot is set to 'Signal', and it plots the `audio_13`
data against `time`, where `time` represents the x-axis (in seconds) and `audio_13` represents
the y-axis (amplitude). Labels are added to indicate that the y-axis represents 'amplitude' and
the x-axis represents 'seconds'. Finally, `plt.show()` displays the plot on the screen. This
visualization helps to understand how the audio signal's amplitude changes over time.
31
Fig,2
The Above Fig.2 is a waveform for the signal. Now, we can use the pad_or_trim() method to ensure
the sample is in the right form for inference.
Code:
audio = whisper.pad_or_trim(audio_13)
Next, we plot the amplitude with respect to time with trimmed/padded audio.
Code:
#𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑖𝑛 𝑜𝑢𝑟 𝑡𝑟𝑖𝑚𝑚𝑒𝑑/𝑝𝑎𝑑𝑑𝑒𝑑 𝑎𝑢𝑑𝑖𝑜
𝑛_𝑠𝑎𝑚𝑝𝑙𝑒𝑠 = 𝑎𝑢𝑑𝑖𝑜. 𝑠ℎ𝑎𝑝𝑒[−1]
#𝑇𝑖𝑚𝑒 𝑜𝑓 𝑒𝑎𝑐ℎ 𝑠𝑎𝑚𝑝𝑙𝑒
𝑡𝑖𝑚𝑒 = 𝑛𝑝. 𝑙𝑖𝑛𝑠𝑝𝑎𝑐𝑒(0, (𝑛_𝑠𝑎𝑚𝑝𝑙𝑒𝑠 − 1) ∗ 𝑑𝑒𝑙𝑡𝑎, 𝑛_𝑠𝑎𝑚𝑝𝑙𝑒𝑠)
𝑝𝑙𝑡. 𝑓𝑖𝑔𝑢𝑟𝑒(𝑓𝑖𝑔𝑠𝑖𝑧𝑒 = (20,10))
𝑝𝑙𝑡. 𝑡𝑖𝑡𝑙𝑒(′𝑆𝑖𝑔𝑛𝑎𝑙′)
𝑝𝑙𝑡. 𝑝𝑙𝑜𝑡(𝑡𝑖𝑚𝑒, 𝑎𝑢𝑑𝑖𝑜)
𝑝𝑙𝑡. 𝑦𝑙𝑎𝑏𝑒𝑙(′𝑎𝑚𝑝𝑙𝑖𝑡𝑢𝑑𝑒′)
𝑝𝑙𝑡. 𝑥𝑙𝑎𝑏𝑒𝑙(′𝑠𝑒𝑐𝑜𝑛𝑑𝑠′)
𝑝𝑙𝑡. 𝑠ℎ𝑜𝑤()
32
Fig.3
The Fig.3 shows the Next, we can start plotting a mel spectogram by applying a
log_mel_spectogram() funtion to our audio file. It converts the y-axis (frequency) into the mel scale:
Code:
The output above is a tensor of converted frequencies. Now, we plot 2 subplots, one is a
regular representation of sound amplitude over period of time, and the other is our mel
spectrogram
𝑎𝑥1. 𝑠𝑒𝑡_𝑡𝑖𝑡𝑙𝑒(′𝑆𝑖𝑔𝑛𝑎𝑙′)
𝑎𝑥1. 𝑠𝑒𝑡_𝑦𝑙𝑎𝑏𝑒𝑙(′𝐴𝑚𝑝𝑙𝑖𝑡𝑢𝑑𝑒′)
33
# 𝑀𝑜𝑣𝑒 𝑡ℎ𝑒 𝑡𝑒𝑛𝑠𝑜𝑟 𝑡𝑜 𝐶𝑃𝑈 𝑏𝑒𝑓𝑜𝑟𝑒 𝑐𝑜𝑛𝑣𝑒𝑟𝑡𝑖𝑛𝑔 𝑡𝑜 𝑁𝑢𝑚𝑃𝑦
Fig.4
This Fig.4 shows the top plot gives a simple view of the raw audio signal, while the bottom
plot provides a more detailed view of the frequency components, which is useful for analyzing
speech characteristics.
34
CHAPTER 4. RESULTS AND DISCUSSION
Next, we can move on to language detection.
Language detection
We will listen to our audio file and detect the spoken language. The sample rate (sr) by default
is 22050, which means that for every second there are 22,050 samples. We can use ipd.Audio()
function to listen to our audio file.
Code:
𝑠𝑟 = 22050
Fig.5
35
From above Fig.5, we can see the probability of each language being the spoken language.
English has the highest probability of 98.5%, therefore it is the spoken language in the audio
file.
Transcription
Fig.6
From the Fig.6 transcription above, we can see our Speech-To-Text system works very well,
it transcribed our audio perfectly.
36
Translation
Fig.7
From Fig.7 we can see Our audio file can be translated to other languages as well.
37
Future Scope
38
Live Audio Transcription: Enhancing the system's ability to process and transcribe
live audio streams in real time, making it suitable for applications such as live
broadcasts, webinars, and real-time communication tools.
Scalability: Developing the system to handle large volumes of audio data, ensuring it
can scale to meet the needs of enterprise-level applications. This includes optimizing
server infrastructure, utilizing cloud computing resources, and implementing load
balancing to manage high demand effectively.
39
User Feedback Integration: Implementing machine learning algorithms that enable
the system to learn and adapt based on user feedback. This continuous learning process
helps in improving the system's performance over time by addressing user-specific
needs and preferences.
Bias Reduction: Leveraging AI advancements to identify and reduce biases in speech
recognition, ensuring fair and accurate transcriptions for all users, regardless of their
accent, dialect, or language. This involves using diverse datasets and refining models
to minimize any inherent biases.
40
CHAPTER 5. CONCLUSION
The development of a Speech-To-Text (STT) application using the Whisper ASR system has
demonstrated the transformative potential of advanced automatic speech recognition
technology across various industries. This project has effectively showcased Whisper's
capabilities in handling diverse accents and background noise, providing accurate and reliable
transcriptions. By leveraging Whisper's robust architecture based on an encoder-decoder
Transformer model, the project has achieved exceptional performance in speech-to-text
conversion. The sophisticated design allows the system to accurately capture and transcribe
spoken language, even in challenging acoustic environments. Extensive testing and
optimization have significantly reduced the Word Error Rate (WER), enhancing the reliability
of the transcriptions. This improvement is crucial for applications where precision is
paramount, such as legal transcriptions and medical documentation.
The system's ability to transcribe and translate speech in multiple languages enhances its
accessibility and usability for a global audience. This multilingual support is essential for
creating inclusive technologies that cater to diverse linguistic communities. The project has
demonstrated the system's capability to understand and accurately transcribe different dialects
and regional accents. This feature is particularly valuable in multicultural societies and
international markets where linguistic diversity is the norm. By providing support for various
languages and accents, Whisper ASR ensures that communication is seamless and effective
across different linguistic backgrounds.
The STT application has been validated for practical use in customer service settings, where it
can automatically transcribe calls. This application improves response times, aids in quality
assurance, and provides valuable insights from customer interactions. In the healthcare sector,
the technology has shown its utility in accurately documenting medical dictations, reducing
administrative burdens on healthcare professionals, improving record-keeping, and enhancing
patient care. In educational environments, the STT application has proven beneficial by
providing real-time transcriptions of lectures. This aids students with hearing impairments or
those who speak different languages, making education more inclusive and accessible.
The project has laid a strong foundation for addressing existing challenges in speech
recognition technology. Future efforts can focus on improving the system's ability to handle
41
complex linguistic structures, varying speech patterns, and background noise in even more
diverse conditions. By exploring new applications, Whisper ASR can significantly enhance
user experience and operational efficiency. Potential future applications include real-time
language translation in international business settings, automated meeting transcriptions for
corporate environments, and interactive voice response systems in customer service.
The continuous evolution of this technology promises to make speech-to-text conversion more
accessible, efficient, and user-friendly. Ongoing research and development efforts will likely
yield improvements in accuracy, speed, and the ability to handle more complex tasks. By
making communication across different languages and regions seamless, Whisper ASR can
bridge linguistic divides. This fosters better understanding and collaboration in a globally
connected world, contributing to social and economic progress.
42
CHAPTER 6. REFERENCES
[1] Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). Llm. int8 (): 8-bit matrix
multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.
[2] Fu, Z., Yang, H., So, A. M. C., Lam, W., Bing, L., & Collier, N. (2023, June). On the
effectiveness of parameter-efficient fine-tuning. In Proceedings of the AAAI Conference on
Artificial Intelligence (Vol. 37, No. 11, pp. 12799-12807).
[3] Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., ... & Chen, W. (2021). Lora:
Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
[4] Joseph, V. R. (2022). Optimal ratio for data splitting. Statistical Analysis and Data Mining:
The ASA Data Science Journal, 15(4), 531-538.
[5] Kumar, A., & Mittal, V. (2019). Speech recognition: A complete perspective. International
Journal of Recent Technology and Engineering (IJRTE), 7(6), 78-83.
[6] Li, J. (2022). Recent advances in end-to-end automatic speech recognition. APSIPA
Transactions on Signal and Information Processing, 11(1).
[7] Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., & Raffel, C. A. (2022).
Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning.
Advances in Neural Information Processing Systems, 35, 1950-1965.
[8] Min, S., Lewis, M., Zettlemoyer, L., & Hajishirzi, H. (2021). Metaicl: Learning to learn in
context. arXiv preprint arXiv:2110.15943.
[9] Novitasari, S., Tjandra, A., Sakti, S., & Nakamura, S. (2020). Cross-lingual machine speech
chain for javanese, sundanese, balinese, and bataks speech recognition and synthesis. arXiv
preprint arXiv:2011.02128.
[10] Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023, July).
Robust speech recognition via large-scale weak supervision. In International Conference on
Machine Learning (pp. 28492-28518). PMLR.
[11] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language
models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
43
[12] Rouditchenko, A., Khurana, S., Thomas, S., Feris, R., Karlinsky, L., Kuehne, H., ... &
Glass, J. (2023). Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech
Pre-Training for Adaptation to Unseen Languages. arXiv preprint arXiv:2305.12606
[13] Alharbi, S., Alrazgan, M., Alrashed, A., Alnomasi, T., Almojel, R., Alharbi, R., … Almojil,
M. (2021). Automatic Speech Recognition: Systematic Literature Review. IEEE Access, 9,
131858–131876.
[14] Wang, C., Cho, K., & Gu, J. (2020, April). Neural machine translation with byte-level
subwords. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 05,
pp. 9154-9160).
[15] Butryna, A., Chu, S. H. C., Demirsahin, I., Gutkin, A., Ha, L., He, F., ... & Wibawa, J. A.
E. (2020). Google crowdsourced speech corpora and related open-source resources for low-
resource languages and dialects: an overview. arXiv preprint arXiv:2010.06778.
[16] Toraman, C., Yilmaz, E. H., Şahinuç, F., & Ozcelik, O. (2023). Impact of tokenization on
language models: An analysis for turkish. ACM Transactions on Asian and Low-Resource
Language Information Processing, 22(4), 1-21.
[17] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... &
Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing
systems, 30.
[18] Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., ... & Le, Q. V. (2021).
Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
[19] Yang, H., Zhang, M., Tao, S., Ma, M., & Qin, Y. (2023, February). Chinese ASR and NER
Improvement Based on Whisper Fine-Tuning. In 2023 25th International Conference on
Advanced Communication Technology (ICACT) (pp. 213-217). IEEE.
[20] Zhang, C., & Lu, Y. (2021). Study on artificial intelligence: The state of the art and future
prospects. Journal of Industrial Information Integration, 23, 100224.
[21] Zhang, D., Mishra, S., Brynjolfsson, E., Etchemendy, J., Ganguli, D., Grosz, B., ... &
Perrault, R. (2021). The AI index 2021 annual report. arXiv preprint arXiv:2103.06312.
[22] Zhang, X., Peng, Y., & Xu, X. (2019, September). An overview of speech recognition
technology. In 2019 4th International Conference on Control, Robotics and Cybernetics (CRC)
(pp. 81-85). IEE make this as literature review
44
[23] Graves, A., Mohamed, A.R., Hinton, G. (2013). Speech Recognition with Deep Recurrent
Neural Networks. 2013 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pp. 6645-6649. https://doi.org/10.1109/ICASSP.2013.6638947
[24] Chan, W., Jaitly, N., Le, Q., Vinyals, O. (2016). Listen, Attend and Spell: A Neural
Network for Large Vocabulary Conversational Speech Recognition. 2016 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4960-4964.
https://doi.org/10.1109/ICASSP.2016.7472621
[25] Zhang, Q., Lu, H., Sak, H., Tripathi, A., McDermott, E., Koo, S., Kumar, S. (2020).
Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders
and RNN-T Loss. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pp. 7829-7833.
https://doi.org/10.1109/ICASSP40776.2020.9053896
[26] Karmakar, P., Teng, S.W., Lu, G. (2021). Thank you for attention: a survey on attention-
based artificial neural networks for automatic speech recognition. CoRR abs/2102.07259.
https://arxiv.org/abs/2102.07259
[27] Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Laurent, C., Bengio, Y., Courville, A.C.
(2017). Towards end-to-end speech recognition with deep convolutional neural networks.
CoRR abs/1701.02720. http://arxiv.org/abs/1701.02720
[28] Alharbi, S., Alrazgan, M., Alrashed, A., Alnomasi, T., Almojel, R., Alharbi, R., Alharbi,
S., Alturki, S., Alshehri, F., Almojil, M. (2021). Automatic speech recognition: Systematic
literature review. IEEE Access 9, 131858-131876.
https://doi.org/10.1109/ACCESS.2021.3112535
[30] Miao, L., Wu, J., Behre, P., Chang, S., Parthasarathy, S. (2022). Multilingual transformer
language model for speech recognition in low-resource languages. 2022 Ninth International
Conference on Social Networks Analysis, Management and Security (SNAMS), pp. 1-5.
https://doi.org/10.1109/SNAMS58071.2022.10062774
[31] Krishna, D.N. (2021). Multilingual speech recognition for low-resource Indian languages
using multi-task conformer. CoRR abs/2109.03969. https://arxiv.org/abs/2109.03969
45
[32] Yadav, H., Sitaram, S. (2022). A survey of multilingual models for automatic speech
recognition. https://arxiv.org/abs/2202.12576
[33] Alec, R., Karthik, N., Tim, S., Ilya, S. (2018). Improving language understanding by
generative pre-training. https://www.m
[34] Zhang, C., & Lu, Y. (2021). Study on artificial intelligence: The state of the art and future
prospects. Journal of Industrial Information Integration, 23, 100224.
[35] Zhang, D., Mishra, S., Brynjolfsson, E., Etchemendy, J., Ganguli, D., Grosz, B., ... &
Perrault, R. (2021). The AI index 2021 annual report. arXiv preprint arXiv:2103.06312.
46