0% found this document useful (0 votes)
14 views

dl_proj_rep

This project report details the development of a voice-to-text conversion system using Python and deep learning techniques, primarily leveraging CNNs and LSTMs for accurate transcription. It discusses the history, challenges, and applications of speech recognition technology, highlighting the importance of preprocessing and model training on diverse datasets. The report concludes with insights into the potential of speech-to-text systems in various sectors and the future of voice interaction with technology.

Uploaded by

Visvagna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

dl_proj_rep

This project report details the development of a voice-to-text conversion system using Python and deep learning techniques, primarily leveraging CNNs and LSTMs for accurate transcription. It discusses the history, challenges, and applications of speech recognition technology, highlighting the importance of preprocessing and model training on diverse datasets. The report concludes with insights into the potential of speech-to-text systems in various sectors and the future of voice interaction with technology.

Uploaded by

Visvagna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

PROJECT REPORT

DEEP LEARNING

22AD3105R/A

PROJECT TITLE- VOICE TO TEXT CONVERSION USING


PYTHON AND DEEP LEARNING

SUBMITTED BY-
2210080076- SWAPNIKA SHETTY
2210080050- PADMA SREEJA
2210080059-P.VISVAGNA
2210080034- M.JYOTHI
2210080053- SEELAMSETTI LEELA KUMARI

CONTENT
1. Abstract
2. Introduction
3. History of speech to text
4. Problem faced in speech to text
5. Packages required
6. Steps
7. Applications of speech recognition
8. Conclusion

ABSTRACT
The advancement of deep learning technologies has significantly enhanced the capabilities of
speech recognition systems. This project focuses on developing a robust speech-to-text
conversion model utilizing deep learning techniques in Python. The primary goal is to
accurately transcribe spoken language into written text by training a neural network on a
large dataset of audio files and corresponding transcripts.
The project leverages a combination of convolutional neural networks (CNNs) and recurrent
neural networks (RNNs), specifically Long Short-Term Memory (LSTM) units, to capture the
temporal dependencies and acoustic features of speech. Additionally, the model employs a
Connectionist Temporal Classification (CTC) loss function to handle the varying lengths of
input speech and output text sequences.
Using popular Python libraries such as TensorFlow and Keras, the model is trained on diverse
speech datasets to enhance its generalization across different speakers, accents, and
environments. The implementation also integrates various preprocessing steps, including
noise reduction, feature extraction using Mel-Frequency Cepstral Coefficients (MFCCs), and
data augmentation, to improve the model's robustness.
The resulting speech-to-text system is evaluated on multiple benchmarks, demonstrating its
potential for real-world applications such as voice-activated assistants, automated
transcription services, and accessibility tools for the hearing impaired. The project concludes
with insights into the challenges faced during development, such as handling noise and
managing computational complexity, and suggests potential future improvements for
achieving even greater accuracy and efficiency.

INTRODUCTION
As our interactions with machines have evolved, the complexity of these interactions has
increased dramatically. We’ve transitioned from pressing large mechanical buttons to
seamlessly navigating touchscreens. But the advancements aren’t limited to hardware alone.
Historically, text has been the predominant method of input for computers. However, with
breakthroughs in Natural Language Processing (NLP), Machine Learning (ML), and Data
Science, we are now on the cusp of using speech as a primary medium for interacting with
our devices.
Virtual assistants exemplify this shift and are now embedded in our daily lives. Digital
assistants like Google Assistant, Siri, Alexa, and others have set a new standard for how we
can communicate with the digital world on a more personal and intuitive level.
One of the most exciting developments in modern technology is the widespread availability
of tools that allow anyone to convert spoken words into text. This capability, once limited to
specialized applications, is now accessible to a broad audience, inviting experimentation and
innovation.
When it comes to developing speech-to-text applications, Python—one of the most popular
programming languages—offers a wealth of options. From built-in libraries to powerful
frameworks, Python provides the tools necessary to create efficient and effective speech-to-
text systems, making this technology more accessible than ever before.

History of Speech to Text


Before diving into Python’s statement to text feature, it’s interesting to take a look at how far
we’ve come in this area. Listed here is a condensed version of the timeline of events:
Audrey,1952: The first speech recognition system built by 3 Bell Labs engineers was Audrey
in 1952. It was only able to read numerals.
IBM Shoebox (1962): Coils can distinguish 16 words in addition to numbers in IBM’s first
voice recognition system, the IBM Shoebox (1962). Had the ability to do basic mathematical
calculations and publish the results.
IMAGE

Defense Advanced Research Projects Agency(DARPA) (1970): Defense Advanced


Research Projects Agency (DARPA) (1970): DARPA supported Speech Understanding
Research, which led to the creation of Harpy’s ability to identify 1011 words.
Hidden Markov Model(HMM), the 1980s: Problems that need sequential information can
be represented using the HMM statistical model. This model was used in the development of
new voice recognition techniques.
Voice search by Google,2001: It was in 2001 that Google launched its Voice Search tool,
which allowed users to search by speaking. This was the first widely used voice-enabled app.
IMAGE

Siri,2011: A real-time and convenient way to connect with Apple’s gadgets was provided by
Siri in 2011.

IMAGE

Alexa,2014 & google home,2016: Voice-activated virtual assistants like Alexa and Google
Home, which have sold over 150 million units combined, entered the mainstream in 2014 and
2016, respectively.
IMAGE
Problems faced in Speech to Text
Speech-to-text conversion remains a challenging area of technology, with several unresolved
issues. Numerous technical limitations currently make it an imperfect tool. Below are some
of the most common challenges faced in voice recognition technology:
1. Inaccurate Interpretation
Speech recognition systems often struggle to accurately capture spoken words. Unlike
humans, Voice User Interfaces (VUIs) lack the ability to fully understand context, which can
significantly alter the meaning of words and phrases. As a result, these systems may have
difficulty grasping the true semantics of a statement.
2. Processing Time
Speech recognition systems can sometimes take a considerable amount of time to process
spoken input. This delay is often due to the vast range of vocal patterns among different
users. While speaking more slowly or clearly can improve recognition accuracy, it detracts
from the tool’s overall convenience and efficiency.
3. Accents and Dialects
VUIs may struggle to understand non-standard accents and dialects. Even within the same
language, people can pronounce the same words in vastly different ways, leading to potential
misinterpretations by the system.
4. Background Noise and Volume
Ideally, background noise and varying speech volumes would not be an issue for VUIs, but in
reality, these factors can significantly impact performance. In noisy environments such as
public spaces or large offices, speech recognition systems may struggle to function
effectively.
How does Speech recognition work?

A complete description of the method is beyond the scope of this blog.А соmрlete desсriрtiоn
оf the methоd is beyоnd the sсорe оf this blоg. I’m going to demonstrate how to convert
speech to text using Python in this blog. This is accomplished using the “Speech
Recognition” API and the “PyAudio” library.
PACKAGES REQUIRED
 speech_recogntion (pip install SpeechRecogntion): This is the core package that
handles the most important part of the conversion process. Other solutions, such as
appeal, assembly, google-cloud-search, pocketsphinx, Watson-developer-cloud, wit,
and so on, offer advantages and disadvantages.
pip install SpeechRecognition
 My audio (pip install Pyaudio)
 Portaudio (pip install Portaudio)

Convert an audio file into text


Steps
1. Import library for speech recognition
2. Initializing the recognizer class in order to do voice recognition. We аre utilizing
Gооgle’s sрeeсh reсоgnitiоn teсhnоlоgy.
3. The following audio formats are supported by speech recognition: wav, AIFF, AIFF-
C, and FLAC. In this example, I utilized a ‘wav’ file.
4. I’ve utilized an audio clip from a ‘stolen’ video that states “I have no idea who you are
or what you want, but if you’re seeking for ransom, I can tell you I don’t have any
money.”
5. Google recognizer reads English by default. It supports a variety of languages; for
further information, please refer to this documentation.
Speech is nothing more than a sound wave at its most basic level. In terms of acoustics,
amplitude, peak, trough, crest, and trough, wavelength, cycle, and frequency are some of the
characteristics of these sound waves or audio signals.
Due to the fact that these audio signals are continuous, they include an endless number of
data points. To convert such an audio signal to a digital signal capable of being processed by
a computer, the network must take a discrete distribution of samples that closely
approximates the continuity of an audio signal.
Once we’ve established a suitable sample frequency (8000 Hz is a reasonable starting point,
given the majority of speech frequencies fall within this range), we can analyze the audio
signals using Python packages such as LibROSA and SciPy. On the basis of these inputs, we
can then partition the data set into two parts: one for training the model and another for
validating the model’s conclusions.
At this stage, one may use the Conv1d model architecture, a convolutional neural network
with a single dimension of operation. After that, we may construct a model, establish its loss
function, and use neural networks to prevent the best model from converting voice to text. We
can modify statements to text using deep learning and NLP (Natural Language Processing) to
enable wider applicability and acceptance.

Applications of Speech Recognition


There are more tools accessible for operating this technological breakthrough because it is
mostly a software creation that does not belong to anyone company. Because of this, even
developers with little financial resources have been able to use this technology to create
innovative apps.
The following are some of the sectors in which voice recognition is gaining traction
 Evolution in search engines: Speech recognition will aid in improving search
accuracy by bridging the gap between verbal and textual communication.
 Impact on the healthcare industry: The impact on the healthcare business is that voice
recognition is becoming a more prevalent element in the medical sector, as it speeds
up the production of medical reports. As VUIs improve their ability to comprehend
medical language, clinicians will gain time away from administrative tasks by using
this technology.
 Service industry: As automation advances, it is possible that a customer will be unable
to reach a human to respond to a query; in this case, speech recognition systems can
fill the void. We will witness a quick expansion of this function at airports, public
transportation, and other locations.
 Service providers: Telecommunications companies may rely even more on speech-to-
text technology that may help determine callers’ requirements and lead them to the
proper support.

Conclusion
A speech-to-text conversion is a useful tool that is on its way to becoming commonplace.
With Python, one of the most popular programming languages in the world, it’s easy to create
applications with this tool. As we make progress in this area, we’re laying the groundwork for
a future in which digital information may be accessed not just with a fingertip but also with a
spoken command.

You might also like