PROJECT_REPORT_B12[1]
PROJECT_REPORT_B12[1]
“Task successful” makes everyone happy. But the happiness will be gold without glitter if we didn’t
state the persons who have supported us to make it a success.
We would like to place on record the deep sense of gratitude to the Hon’ble Chairman Sri. N. SURYA
KALYAN CHAKRAVARTHY GARU and Hon’ble Executive Vice Chairman Dr. N. SRI GAYATRI
GARU, QIS Group of Institutions, Ongole for providing necessary facilities to carry the project work.
We express our gratitude to Dr. Y. V. HANUMANTHA RAO, M.Tech, Ph.D., Principal of QIS
College of Engineering & Technology, Ongole for his valuable suggestions and advices in the B.Tech
course.
We express our gratitude to the Head of the Department of CSE (DS), Dr. G. L. V. PRASAD
GARU, M.Tech, Ph.D., QIS College of Engineering & Technology, Ongole for his constant supervision,
guidance and co-operation throughout the project.
We express our thankfulness to our project guide Dr. Y. Sowjanya Kumari, M.Tech, Ph.D., Associate
Professor, QIS College of Engineering & Technology, Ongole for her constant motivation and valuable
help throughout the project work.
We would like to express our thankfulness to CSCDE & DPSR for their constant motivation and
valuable help throughout the project.
Finally, we would like to thank our parents, family and friends for their co-operation to complete this
project.
Submitted by
ISUKAPALLI NAGA SAI 21491A4463
GANDIKOTA HEMA SANKAR 21491A4469
TALLURI BRAHMAIAH 21491A4484
NIDUMUKKALA SATYA SAI UMESH CHANDRA 21491A4490
SHAIK BAJI VALI 21491A44A3
DAGGUPATI NAGENDRA 21491A44D2
ABSTRACT
Speech Emotion Recognition (SER) is a new area in artificial intelligence that helps computers
understand emotions from speech. This is important for things like analyzing feelings, virtual
assistants, and checking mental health. This study looks at how deep learning models, like Long Short-
Term Memory (LSTM) and Bidirectional LSTM (BiLSTM), are used to recognize emotions in
speech.The study has many steps like collecting data, finding important features, and classifying them.
Mel-Frequency Cepstral Coefficients (MFCC) are used to get features from the speech. Then,
Principal Component Analysis (PCA) is used to pick the best features. These features are used to train
LSTM and BiLSTM models because they can handle sequences of data and remember information
from the past.The results show that BiLSTM works better than LSTM in recognizing emotions because
it can learn from both the past and the future. The study shows that deep learning methods give better
results than older machine learning methods. Also, using MFCC for feature extraction and PCA for
feature selection makes the model stronger and better.The research shows that deep learning-based
SER models can be used in real-life situations. Future work could include using SER in areas like
human-robot interaction, call center analysis, and mental health diagnosis, where understanding
emotions is very useful.The system achives a target accuracy of LSTM 90% and BiLSTM 91%.
Keywords: Speech Emotion Recognition (SER), Deep Learning, LSTM, BiLSTM, MFCC, PCA,
Feature Selection, Sentiment Analysis, Human-Computer Interaction.
TABLE OF CONTENTS
LIST OF TABLES
i
LIST OF FIGURES ii
1 INTRODUCTION 1
1.1 Introduction 1 -2
1.5.4 Classification 8 -9
1.9 Applications 11 - 13
2 LITERATURE SURVEY 14 - 20
3. PROPOSED WORK AND ANALYSIS 21 - 23
4 SYSTEM DESIGN 24 - 35
5.4 Coding 38 - 47
6 TESTING 48 - 49
7 RESULTS 50 - 52
8 CONCLUSION 53 - 54
9 FUTURE ENHANCEMENTS 55
REFERENCES 56 - 58
LIST OF TABLES
2 Testing cases 48 - 49
i|Page
LIST OF FIGURES
ii| P a g e
LIST OF SYMBOLS AND ABBREVIATIONS
AI Artificial Intelligence
We all know how powerful speech is - it's not just about the words we say, but how we say them.
The tone, the pauses, the rise and fall of our voice - they all carry emotional meaning that we humans
pick up on instantly. But for computers? That's a whole different story. Teaching machines to
understand the emotion behind our words has become one of the most fascinating challenges in AI
research today.
This technology we call Speech Emotion Recognition (SER) is starting to change how machines
interact with us. Think about:
i. A virtual assistant that can tell when you're frustrated and adjusts its responses
ii. Mental health apps that might notice signs of depression just from how someone talks
iii. Customer service systems that can sense when a caller is getting upset
iv. Even robots that respond appropriately to human emotions
The possibilities are endless, and we're just scratching the surface of what's possible. What makes
this so exciting is that we're teaching computers one of the most human skills there is - understanding
emotion.
One of the significant advancements driving Speech Emotion Recognition (SER) is deep learning.
Previously, traditional methods relied on handcrafted features and conventional machine learning
techniques; however, deep learning has revolutionized this field by enabling models to identify
intricate emotional signals from raw audio data. Deep learning models differ from earlier techniques,
which depended on manual feature development, by being capable of autonomously detecting and
analyzing speech patterns in a manner that is more versatile and robust.
Contemporary Speech Emotion Recognition (SER) systems mainly employ technologies such as
Long Short-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM) networks.These models
1|Page
excel at processing sequential data, closely resembling the way human speech naturally unfolds over
time. Emotions are expressed not just through individual words but also through the construction of
sentences, variations in tone, and pauses that affect the overall interpretation of a dialogue. LSTMs
excel at remembering information from previous parts of speech, while BiLSTMs enhance this ability
by taking into account both prior and subsequent speech contexts at the same time. This thorough
analysis results in better emotion detection, offering greater insight into speech patterns. Prior to the
analysis by deep learning models, a crucial step is to extract significant features from the raw audio
input. One commonly employed method for this is Mel-Frequency Cepstral Coefficients (MFCCs)
MFCCs convert speech into numerical representations that capture key aspects of human vocal
expression. These features enable neural networks to detect variations in pitch, tone, and rhythm, all
vital in emotional expression. Moreover, feature selection methods like Principal Component
Analysis (PCA) and statistical techniques refine the extracted data by minimizing noise and focusing
on the most pertinent elements, which ultimately enhances model accuracy and effectiveness.
Deep learning has significantly enhanced the ability of Speech Emotion Recognition (SER)
systems to understand and respond to human emotions in real time.By utilizing LSTM and BiLSTM
networks, machines can effectively capture the nuances of emotional speech. This progress has
important consequences, ranging from improving the functionalities of virtual assistants and customer
service interactions to supporting mental health evaluations.As research continues to evolve,the
integration of deep learning with SER is anticipated to produce even more sophisticated applications
bringing machines closer to truly grasping and responding to human emotions.
Speech serves as one of the most instinctive methods for humans to express their feelings.
Regardless of whether an individual feels joy, sorrow, anger, or neutrality, their vocal tone provides
emotional hints. Utilizing technology to interpret these feelings can enhance interactions between
humans and computers. Speech Emotion Recognition (SER) is an expanding discipline dedicated to
detecting emotions from speech through artificial intelligence and deep learning techniques. This
research aims to establish a dependable SER system capable of accurately classifying emotions with
the help of advanced deep learning methods.
2|Page
A primary goal of this study is to create an effective deep learning model that can identify emotions
from speech with a high degree of accuracy. The focus will be on Long Short-Term Memory (LSTM)
and Bidirectional LSTM (BiLSTM) networks, which excel in processing speech data. These models
can recognize patterns in vocal data and grasp long-term dependencies, making them suitable for
emotion detection. The overarching objective is to develop a system applicable in real-world
situations, such as customer service, virtual assistants, and monitoring mental health.
Additionally, enhancing feature extraction methods for improved accuracy is of importance. Raw
speech signals are rich in information, but not all of it aids in emotion identification. This research
will employs Mel Frequency Cepstral Coefficients (MFCC) to derive significant features from speech.
MFCC captures essential characteristics of human vocal patterns, making it a popular approach in
speech analysis. By choosing the appropriate features, the model can concentrate on the critical
aspects, thus boosting its overall efficiency.
To further increase accuracy, the study will emphasize feature selection techniques. Speech data
often includes unnecessary noise and redundant information that can undermine the model's
performance. This research will implement methods like Principal Component Analysis (PCA) and
statistical evaluation to minimize the number of features while keeping the most crucial ones. This
process enables the model to learn more quickly and prevent overfitting, resulting in better predictions.
An additional significant aim is to evaluate the performance of LSTM and BiLSTM models in
emotion recognition. While LSTM networks process information sequentially, BiLSTM models are
capable of handling data in both forward and backward directions, enhancing context comprehension.
By conducting experiments across various datasets this study will ascertain which model excels in
accurately identifying emotions. The evaluation will also utilize key metrics such as accuracy,
precision, recall and F1-score to provide a complete view of model effectiveness.
3|Page
This research also intends to train and assess the SER system using authentic speech datasets. For
building deep learning models, having a reliable dataset with well-labeled emotional content is
essential. The study will incorporate emotional speech samples and apply data preprocessing methods
to guarantee clean and consistent input. The Proper annotation of speech samples is vital to ensure the
model learns correctly.The aim is to evaluate the models on standardized datasets and examine how
various factors influence their performance.
Beyond technical elements, the research emphasizes the practical applications of SER. One long-
term aspiration is to make SER systems more adaptable. Many AI models perform well with specific
datasets but encounter challenges with real-world speech variations. The objective is to create a model
that functions effectively in diverse settings, enhancing its practicality nature for everyday use.
Another objective involves proposing future advancements in SER. While LSTM and BiLSTM
are powerful deep learning models, emerging techniques may further enhance performance. Future
investigations could search:
Hybrid Deep Learning Models: Merging Convolutional Neural Networks (CNNs) with LSTMs to
refine feature extraction.
Attention Mechanisms: Implementing attention based models to direct focus on the most relevant
segments of speech signals.
Multilingual SER: Extending SER capabilities to accommodate various languages and accents for
broader applicability.
Additionally, this research will focus on enhancing the efficiency and scalability of SER systems.
Deep learning models often require substantial computational resources, which can make real-time
processing challenging. By optimizing model designs and utilising effective training methods, this
study seeks to decrease processing times and memory usage, making the technology more accessible
for mobile devices, embedded systems and real-time applications.
4|P age
Another crucial aspect involves is addressing privacy and ethical considerations in SER. Given that
speech data may include personal information, it's critical to formulate secure handling methods. This
research will explore strategies to ensure data privacy and ethical practices in AI, ensuring that emotion
recognition technologies are applied responsibly.
Finally, the study aspires to develop a user-friendly SER system that can seamlessly integrate into
practical applications. The goal is to design a system that is not just accurate and efficient but also
accessible and easy to navigate. By concentrating on real-world usages, this research will further the
development of artificial intelligence and enhance human-computer interactions.
In summary, this study encompasses a range of objectives, including the creation of a deep
learning- based SER system, the optimization of feature extraction and selection, a comparative
analysis of LSTM and BiLSTM models and the exploration of real-world uses such as human-robot
interaction, call center analysis, and mental health monitoring. Additionally, it aims to improve model
generalisation, investigate hybrid architectures, and address ethical issues related to privacy. By
achieving these goals, this research will promote advancements in speech emotion recognition
technology and contribute to future innovations in AI-based human-computer interaction.
Humans communicate with one another in a natural way using speech, by using the voice to
expressemotions. When people speak, the tone of their voice, the pitch and speed of speaking
altersaccording to their emotions. Feelings assists a person to determine if he or she is feeling angry,
sad,happy, or surprised. However, for machines, recognizing feelings in a person’s voice
squitecomplex. Speech Emotion Recognition (SER) is an advanced technology that enables machines
toidentify emotions conveyed in the voice. This technology has great applications in shuch sectors
ascustomer support, mental health, and even robotics.
Since then, advancements in artificial intelligence (AI) using deep learning have helped machines
understand speech better. Deep learning methods like Long Short Term Memory (LSTM) and
Bidirectional LSTM (BLSTM) can learn and recognize speech patterns, which helps in accurately
identifying emotions. What makes these models different from traditional machine learning models is
their ability to consider the order of speech.
5|P age
Most of the speech recognition systems present today can translate voice to text but lack emotional
recognition. This is a big issue because emotions are a big aspect of communication. For instance, in
a call center, a customer could say the same words one day in an upbeat way and another day
cambering with an angry tone. It can give response only if it knows this emotion, otherwise it does
not recognize that this user is saying in wrong way.
In a similar vein, emotions in speech contribute cues for diagnosing mental health for example,
giving away the person behaving depressed, or anxious. A patient may not openly say they are sad, but
their voice may sound sad. If doctors or AI systems can detect such emotions early, they can provide
better care and support.
Another challenge is that different people express emotions in different ways. Some people have
a loud voice when they are angry, while others may speak softly. Traditional methods struggle to
detect such variations, but deep learning models can learn these differences by training on large
datasets.
6|P age
CHAPTER 2
LITERATURE REVIEW
The aim of the project titled "Speech Emotion Recognition using Machine Learning" by : Ajmeera
Kiran,Mudassir Khan,J.Chinna Babu,B.P.Santosh Kumar,Sheik.Jamil Ahmed,Zafar Ali Khan is to
improve human-computer interaction by using speech to detect emotions such as joy, rage, and
sadness. Pitch, tone, and MFCC coefficients were extracted from the RAVDESS dataset and used to
train a Multi-Layer Perceptron (MLP) model with an accuracy of 84.38%.The study shows the
superiority of MLP through comparing classification methods such as SVM,K-Nearest Neighbors and
Hidden Markov Models.It tackles issues like misclassification and emotion overlap, especially with
related emotions.Virtual assistants such as Siri and Alexa are examples of possible applications.New
developments might make multilingual support,larger databases and real-time emotion recognition
possible.To according to the study's findings,audio-based emotion recognition can improve user
experiences and develop intelligent systems.[2]
8|P age
This study is by Apoorva Ganapathy, Senior Developer, Adobe Systems, San Jose uses deep
learning techniques namely Long Short-Term Memory (LSTM) networks to analyze current
developments in Speech Emotion Recognition (SER).It evaluates the advantages and disadvantages
of the current datasets and techniques.The goal of the project is to improve knowledge about SER and
its applications in daily life.It studies the efficiency of LSTM networks with an emphasis on deep
learning-based methodologies emphasizing gains in emotion recognition accuracy and resilience.The
study offers a thorough examination of SER's use cases and situations providing insightful information
for further study and advancement.[6]
Using Machine Learning (ML) methods this research paper by Samaneh Madanian, Talen Chen,
Olayinka Adeleye, John Michael Templeton, Christian Poellabauer, Dave Parry, Sandra L. Schneider
provides a comprehensive overview of Speech Emotion Recognition (SER) with a focus on the past
ten years. The three main SER steps for data processing feature selection/extraction and classification
are covered. Getting high classification accuracy in speaker independent tests is a difficulty that is
highlighted in the review. It emphasizes the importance of uniform processes and datasets and offers
criteria for SER assessment. Through the identification of major challenges and areas for development
the paper aims to support future SER research.[7]
This paper authors are Javier de Lope, Manuel Graña. The importance of Speech Emotion
Recognition (SER) in Human-Computer Interaction (HCI) is highlighted in this paper thorough
analysis of this topic.The use of deep learning and machine learning methods in SER is discussed in the
review. There includes discussion of both conventional and current SER techniques.The article
provides information on feature extraction procedures classification methods and data sources. It
discusses the difficulties and possibilities in SER research making it an invaluable tool for scholars
and professionals working in the area. The goal of the review is to aid in the creation of SER systems
that are more precise and efficient.[8]
This work studies Speech Emotion Recognition (SER) using Constant-Q Transform based
Modulation Spectral Features (CQT-MSF). The paper is by Premjeet Singh, Md Sahidullah, and
Goutam Saha. It looks at how humans process hearing and highlights the importance of recognizing
emotional cues in speech. The study demonstrates that by using the techniques discussed,
9|P age
special qualities of the Constant-Q Transform CQT-MSF is capable of extracting emotion-specific
information from speech signals. In order to increase the accuracy and stability of emotion recognition
systems and eventually human-computer interaction and emotional intelligence this project will
investigate the potential of CQT-MSF in SER.[9]
This paper is by Subrat Kumar Nayak, Ajit Kumar Nayak, Smitaprava Mishra, Prithviraj Mohanty,
Nrusingha Tripathy, Kumar Surjeet Chaudhury. The problem of Speech Emotion Recognition (SER)
in low resource tribal languages is investigated in this work with a focus on the KUI language. The
lack of resources for this language has been solved by the authors by creating a unique KUI speech
dataset. After that they use deep learning methods to identify emotions in speech using the dataset for
training and evaluating their models. By solving an information vacuum in SER research for under-
resourced languages the project will help construct emotionally intelligent systems for a variety of
multilingual people. Speech based apps are affected by the results.[10]
The study proposes a novel speech emotion recognition (SER) method inspired by the human
brain. Researchers suggest "implicit emotional attribute classification," mimicking brain regions'
emotional interpretation. The model achieved notable performance increases on the IEMOCAP dataset
by incorporating this concept through multi-task learning.This approach demonstrates the potential of
using human brain principles to develop more effective SER systems with improved accuracy and
reliability always.The findings have significant implications for future research and applications in
human-computer interaction and affective computing technologies. With great potential for
success.[11]
This study is by Shaode Yu, Jiajian Meng, Wenqing Fan, Ye Chen, Bing Zhu, Hang Yu, Yaoqin Xie,
and Qiurui Sun. The goal of the Speech Emotion Recognition (SER) study is to use audio signals to
identify human emotions. It combines a Cross-Attention Fusion (CAF) module with a dual-stream
method to better capture emotional signals. The method fine-tunes a large language model and trains
a text processing network. The CAF module often achieves high accuracy, outperforming current
fusion methods, based on tests done on the EmoDB, IEMOCAP, and RAVDESS databases. This new
finding will significantly impact future SER research and the quick development of applications,
especially in education and healthcare.
10 | P a g e
This paper is by Mohammed Jawad Al Dujaili, Abbas Ebrahimi-Moghadam, Ahmed Fatlawi. The
challenges finding emotions in speech, an important topic in speech processing and human-computer
interaction is addressed in the research.It explains a system that can be divided into three primary
stages: classification, feature selection, and attribute extraction. Principal Component Analysis (PCA)
serves to extract and reduce features like energy (E),zero-crossing rate (ZCR),fundamental frequency
(F0) and Fourier parameters (FP). In order evaluate emotional states,the system incorporates Support
Vector Machine (SVM) and K-Nearest Neighbor (KNN) classifiers,producing interesting results in
many different kinds of languages.[13]
A 2023 study in Electronics studies deep learning towards Speech Emotion Recognition (SER) by
Konstantinos Mountzouris, Isidoros Perikos, Ioannis Hatzilygeroudis. Long Short-Term
Memory,CNN with Attention Mechanism,LSTM with Attention Mechanism, CNN with Attention
Mechanism,Deep Belief Network and Simple Deep Neural Network are the six architectures that have
been implemented into action.Results taken from the SAVEE and RAVDESS databases reveal that
attention-based models perform greater than others.The optimum accuracy is 74% (SAVEE) and 77%
(RAVDESS) for CNN-ATN. SER performance is greatly improved by attention operations, giving
possibility for use in feelings computing and human-computer interaction.The research and
development of SER is being further developed by the study.[14]
This paper was by Zhongwen Tu, Bin Liu, Wei Zhao, Raoxin Yan, Yang Zou. In this study,a new
Speech Emotion Recognition (SER) approach involving feature fusion, feature selection and data
augmentation is given.A fresh data augmentation approach called mix-wav handles the issues of small
and unequal emotional speech examples.The suggested model makes use of a Light Gradient Boosting
Machine for feature selection and a Multi-Head Attention mechanism-based Convolutional Recurrent
Neural Network (MHA-CRNN) for feature extraction.Experiments on the IEMOCAP and CHSE-DB
datasets show significant advantages over current approaches with unweighted average test accuracies
of 66.44% and 93.47% respectively.with improving robustness and performance in practical uses as
well.[15]
11 | P a g e
This paper by Ruhul Amin Khalil,Edward Jones studies the use of deep learning methods to speech
emotion recognition (SER).It covers several types of deep learning models, including Deep Boltzmann
Machines (DBMs), Recurrent Neural Networks (RNNs),and Convolutional Neural Networks (CNNs),
as well as their uses in SER.The paper also addresses the disadvantages and challenges of different
approaches,as well as possible paths for further study.The benefits of deep learning techniques over
standard approaches such as higher precision and faster feature extraction are the main focus of the
study.It also highlights how important high-quality data sets and good assessment metrics are to the
development of effective SER systems.[16]
Speech Emotion Recognition (SER) methods was by Taiba Majid Wani, Teddy Surya Gunawan
and this try to determine a speaker's emotional state from their speech,are thoroughly examined in this
research.It covers important SER topics such as databases,classification algorithms,speech processing
methods,and emotional models. In addition to studying the use of several classification methods,such
as GMMs, HMMs,and SVMs,for emotion recognition,the work analyzes several kinds of feature
extraction techniques, including prosodic features,spectral features,and voice quality
characteristics.likewise the report identifies the limitations and challenges present in SER systems and
presents possible directions for further study,such as improving noise resistance and creating more
efficient ways of controlling naturalistic and spontaneous speech.[17]
For the purpose of to detect a speaker's emotional state from their speech,Speech Emotion
Recognition (SER) systems are examined in this study.It covers multiple SER methods, such as
methods based on deep learning and machine learning.Various feature extraction and classification
approaches are covered in the paper,such as the usage of Support Vector Machines (SVM),Deep
Neural Networks (DNNs),and Recurrent Neural Networks (RNNs).It also discusses potential paths in
future research and brings attention to the challenges and limitations present in SER systems.This
paper was by Anushka Sandesara, Shilpi Parikh, Pratyay Sapovadiya.[18]
A study looked at Speech Emotion Recognition (SER) systems weaknesses in security was by Itzik
Gurowiec,Nir Nissim.It examined at the SER environment,evaluated SER the foundations,and found
likely cyberattacks. It was discovered revealed the security methods in place were not enough,making
SER systems unprotected.The authors provided recommendations for improving security and
minimizing risks. In order secure SER systems towards cyberattacks and maintain consistency, this
12 | P a g e
study highlights the importance for higher levels of security. Regularly, applications everywhere a
variety of sectors and industries require SER system security.[19]
Inspired by the human brain,this study proposes an innovative speech emotion recognition (SER)
approach.Researchers proposed detecting emotions indirectly by replicating the emotional
interpretation of brain regions.By applying multi-task learning to include this idea,the model displayed
major gains in performance on the IEMOCAP dataset.This method shows how human brain concepts
may be used to create improved SER systems that are consistently more accurate and reliable.The
results have important implications for further studies and uses of affective computing and human -
computer user interface technologies.[20]
The paper by Anjali I.P & Sherseena P.M is about using machine learning,statistical data mining
and pattern recognition this study analyzes improvements in speech recognition. It offers a thorough
examination of automated speech recognition (ASR) including vocabulary sizes and classifications.
To handle voice sounds the methodology uses methods such as Hidden Markov Models (HMMs) and
acoustic-phonetic analysis. These results demonstrate significant improvements in real-time voice
processing which improves communication between humans and computers. According to this study
speech recognition has the ability to completely transform input techniques eliminating traditional
interfaces like keyboards and mouse and improving access to technology.[21]
The author of this study is Astha Gupta, Rakesh Kumara, and Yogesh Kumar studies the design
of Automatic Speech Recognition Systems (ASRS) for both foreign and Indian languages. The authors
of this paper examine deep learning and machine learning methods for voice recognition in Hindi,
Marathi, Urdu, Chinese, French and German among other languages.Significant improvements in
speech recognition accuracy and application are made by the study using frameworks such as HMM
Toolkit, CMU Sphinx and Kaldi Toolkit.In order to enable more efficient and readily available
technology the research advances our understanding of ASRS by focusing on difficulties and talking
about applications in various languages and human-computer interaction.[22]
This paper was by Dr. Kavitha. R1, Nachammai. N2, Ranjani. R2, Shifali enhances understanding
of biometric security by presenting a voice-based identification solution for native Tamil speakers.
13 | P a g e
The authors develop a Tamil-language speech recognition system using Mel Frequency Cepstral
Coefficients (MFCC) and Dynamic Time Warping (DTW) to analyze voice authentication
methods.[23]
The authors for this paper are Wiqas Ghai, Navdeep Singh. This paper examines different
approaches such as connectionist and acoustic-phonetic models to study developments in automatic
speech recognition (ASR). To enhance large vocabulary voice recognition the authors use Hidden
Markov Models with Artificial Neural Networks. Speech variability is addressed by methods such as
Missing Data Techniques. The findings demonstrate notable advancements in ASR across a variety of
languages improving flexibility in response to linguistic patterns. This research advances our
knowledge of ASR's development,customers and uses in the construction of reliable speech
recognition systems the generation of voice corpora and multilingual speech processing.[24]
The Authors of this paper are PhD. Candidate Amarildo Rista; Prof. Dr. Arbana Kadriu. In this
paper a variety of systems including hybrid end-to-end and low resource language models are analyzed
in the context of voice recognition a branch of natural language processing (NLP). Important methods
including Recurrent Neural Networks Deep Neural Networks and Hidden Markov Models are
reviewed in the paper.The findings indicate that while end-to-end designs increase efficiency but
present difficulties hybrid techniques perform better than traditional models.Model generalization is
improved by transfer learning and multi-task learning particularly for low-resource languages.The
work contributes to our understanding of speech recognition's development and future prospects by
highlighting its uses and difficulties.[25]
The authors for this paper are Taiba Majid Wani, Ted Surya gunawa, Asif Ahmad Aqdri, Mir
Kartiwi, Eliathamby Ambikairrajah. Speech is well know type of communication method with both
linguiste and paralinguistic data, the traditional speech emotion focus on ligunistic data, the SER uses
speech signals to identify the emotonal state , this model needs feature extraction to extract features
from the audio data , here features is nothing but tone,frequency anf classifiers like Support Vector
Machines (SVM), Artificial Neural Networks (ANN), and Gaussian Mixture Models (GMM) are
commonly used for emotion classification. Recently, deep learning models have gained prominence
in SER due to their potential to improve accuracy.[26]
14 | P a g e
This paper is by Eva Lieskovka, Maros Jakubec, Roman Jarina, Michal Chumlik. The emotion
recognition have different types of applications like anger detection in call senters which will help full
in better customer services , conversational chatbots to enhance integration to identify emotional state
of the person in real time . for better performance computational power ,speed,accuracy.Long Short -
Term Memory (LSTM) BiLSTM Bidirectional , forword and backword and Gated Recurrent Unit
(GRU) networks are particularly prominent in modeling time-sequence data for speech-based emotion
recognition. These architectures excel in capturing temporal dependencies crucial for analyzing speech
signals.[27]
The study focus on Speech Emotion Recognition (SER), which is usefull in identifying a speaker's
emotional state from their speech output. It tells how important emotions are to mental health. Types
of Emotions: The four primary emotions identified by the study are anger, sadness, happiness, and
neutrality. this systems uses signals to identify emotions can accurately predict these emotions. It
specifically highlights the usage of Mel-frequency cepstral coefficients (MFCC), a critical spectral
component, in conjunction with prosodic traits including pitch, loudness, and speech intensity.[28]
This paper was by Hadhami Aouani, Yassine Ben Ayed. Speech emotion recognition (SER)
identifies human emotions from speech data .Emotions are crucial for humans for understannding and
decision- making, and social interactions.SER has many applications in depression diagnosis, call
centers, and Online classrooms.Deep learning has plays an important role in advanced SER,becoming
the mainstream method.Early use of RNNs in SER dates back to 2002; DNNs were introduced in 2014.
CNNs and LSTMs have also been applied to improve SER performance.Despite advancements,
existing network structures often borrow from other fields.Challenges include dataset scarcity and
emotion perception complexity.The limbic system in the brain is key to emotion perception.[29]
The paper focus on two approachs for recognizing emotions from speech signals. This involves
Feature Extraction: The first stage focuses on extracting relevant features from the audio signals.
Specifically, it investigates two sets of features: The first set consists of a 42-dimensional vector that
includes 39 coefficients of Mel Frequency Cepstral Coefficients (MFCC), along with other parameters
like Zero Crossing Rate (ZCR), Harmonic to Noise Rate (HNR), and Teager Energy Operator (TEO)
[30]
15 | P a g e
CHAPTER 3
PROPOSED WORK
Speech Emotion Recognition (SER) is very useful in human computer interaction,makes machines to
interpret and respond to human emotions.This proposed work focuses on developing an SER system using
deep learning techniques, particularly Long Short-Term Memory (LSTM) and Bidirectional LSTM
(BiLSTM),to enhance the accuracy of emotion classification.the system follows a structured approach that
includes data collection,feature extraction,feature selection,model training,and evaluation.
The first step is to collect speech datasets with labeled audio samples that show different emotions
like happiness, sadness, anger, surprise, fear, and neutrality. Popular datasets such as RAVDESS,
CREMA-D, and IEMOCAP are used for training and evaluation. The dataset offers a variety of speech
samples.
Feature extraction is a crucial stage where relevant speech characteristics are obtained. The system
extracts Mel-Frequency Cepstral Coefficients (MFCC), Chroma features, and Zero Crossing Rate (ZCR).
MFCC is essential for capturing the timbre and spectral properties of speech, Chroma features analyze
pitch variations, and ZCR measures the rate at which the signal changes its polarity.
Feature selection follows the extraction phase to remove and optimize the dataset by reducing unwanted
and missing information. Principal Component Analysis (PCA) and statistical analysis techniques are
applied to retain only the most significant features, improves computational efficiency while maintaining
classification accuracy.
LSTM and BiLSTM are two classification stages used in the deep learning model. Speech processing
benefits greatly from the LSTM's capacity to capture long term dependencies in sequentials
input.BiLSTM's improves the model and provides improved contextual awareness by using speech
sequences in both directions. Accuracy,precision,recall, and F1-score measures are used to evaluate the
trained models. While the F1-score offers a fair evaluation of precision and recall, accuracy gauges how
accurate predictions are overall.To enhance model generalization,data augmentation methods like pitch
shifting and noise addition can be used.
16 | P a g e
3.1.1 Key Components of SER Systems
A Speech Emotion Recognition system uses a methodical approach which normally includes tasks
such as data collection, extraction of features, selection, classification, and others.
The first step in SER is gathering speech samples containing various emotions. These datasets can
come from real-life conversations, scripted recordings, or acted emotions.Popular datasets for SER
research include RAVDESS, IEMOCAP, EmoDB, and TESS. These dataset's contain's labelled
speech recordings, so that machine learning and deep learning based models can learn the patterns
associated with different emotional states.
The dataset we used is Toronto Emotional Speech Set(TESS) , this dataset contains speech
recordings
Samples used
The next step after collecting the speech data is to extract relevant features from the audio signals. An
important technique in SER used for feature extraction is Mel Frequency Cepstral Coefficients
(MFCC). MFCC captures frequency properties of speech, which are important to differentiate
emotion. Other features like Chroma features, spectral centroid and formants may be added to improve
model performance.
Once features are extracted, the next step is the removal of some of the features extracted which are
irrelevant or highly related in a way which may make features inefficient. Feature reduction is
typically done using principal component analysis (PCA) and statistical analysis to reduce the
dimension of the data and take the most relevant features. The importance in this step is the process
is to ensure that the deep learning model can learn representations of the speech signal at the most
informative parts when analyzed for accuracy.
18 | P a g e
3.1.5 Classification
The last stage of SER is emotion classification, where machine learning or deep learning based
modeling algorithms learn to classify emotions based on the speech features that have been extracted.
The most prevalent options are LSTM and BiLSTM networks, since they can discover the temporal
structure or dynamics that naturally exists in speech. With LSTM networks, speech
processing occurs sequentially; whereas BiLSTM networks can analyze the speech data in both the
forward and backward direction, thus enhancing the potential for emotion recognition of the speech.
The main goal of this research is to develop a Speech Emotion Recognition (SER) system using deep
learning that can accurately identify emotions in speech. This system will use LSTM and BiLSTM models,
which are effective in recognizing patterns in sequential data.
Data extract will be used for the system detailing with elements it will use Mel Frequency Cepstral
Coefficients (MFCC) to make the system more accurate. In speech, this technique provides more
memorable details. E.g. Principal Component Analysis (PCA) and Statistical Analysis will be utilized
for feature selection, cutting out the excess data and enhancing performance.
This study will focus on building a real-world SER system that can serve several applications. For
example:
Call Centers: The system can analyse customer voices to understand their emotions and improve customer
service.
Healthcare: Doctors and psychologists can use this technology to monitor patients emotional health.
Human-Robot Interaction: Robots and virtual assistants can respond in a more natural and human
like way by recognizing emotions.
People use emotions in speech to express their feelings. But computers and machines cannot
understand emotions like humans do. This is a big problem in areas where emotional understanding
is important, like human-robot interaction, call center customer service, and mental health diagnosis.
19 | P a g e
If machines could recognize emotions from speech, they could respond better to people, give more
natural conversations, and improve user experience.
Speech Emotion Recognition (SER) is a technology that helps machines detect emotions from
speech. Traditional methods for SER do not work very well because they depend on handcrafted
features and simple models. But deep learning methods like LSTM and BiLSTM can learn emotions
better by analyzing speech patterns. These models help capture long-term speech dependencies,
making them more accurate for emotion detection.
The objective of this research is to create a deep learning-based SER system that extracts features
using Mel Frequency Cepstral Coefficients (MFCC) and classifies data using LSTM/BiLSTM networks.
The objective is to create a model that can reliably recognise speech signals for the emotion's of neutrality,
anger, sadness, and joy. If this technology works, it might be applied in the real world to enhance mental
health, AI assistant's, and customer service.
By helping robot's better comprehend human emotion's, deep learning will enhance emotion
recognition in this study, leading to more effective and human-like technological interactions.
The scope of this research is to employ deep learning techniques to create an efficient Speech
Emotion Recognition (SER) system. The major objective is to make it possible for computers to
reliably identify and decipher human emotions from speech data.This will be helpful in a number of
industries where emotional intelligence is important, including social robot's, call centre's, healthcare
and human-computer interaction.
The demand for systems that can comprehend the speakers emotions in addition to words is
growing as voice-based technology becomes more widely used.Traditional machine learning models
have been used in the past for SER, but they often struggle with accuracy and efficiency. Deep learning
models especially LSTM (Long Short-Term Memory) and BiLSTM (Bidirectional LSTM), offers a
better solution by capturing the sequential patterns in speech data. This project aims to explore these
models and analyses their performance in recognizing emotions from speech.
20 | P a g e
Key Areas Covered in the Project
This project also involves in collecting speech datasets containing various emotional
expressions like happiness, sadness, anger and some other emotions.
Preprocessing techniques such as noise reduction, normalization and feature extraction will
be applied to improve the quality of input data.
Speech signals contain important acoustic features that help in emotion recognition.
The project will use Mel Frequency Cepstral Coefficients (MFCC) for extracting speech
features.
Principal Component Analysis (PCA) and statistical methods will be used for selecting the
most important features, reducing unnecessary data, and improving model efficiency.
LSTM and BiLSTM models will be trained on the processed speech data.
These models are chosen because they are effective at handling sequential and time series data,
which is important for understanding speech patterns.
This project will also compare the performances of both models to determine which one
provides better accuracy in emotion recognition.
The system will be tested using benchmark emotional speech datasets to check its
accuracy.
Various performance metrics of accuracy, precision, recall and F1-score will be used for
evaluation.
The findings will help in understanding how well deep learning models can be applied to
SER.
21 | P a g e
This project set's the foundation for future improvements in SER, such as integrating hybrid
deep learning models, attention mechanisms and multilingual datasets to further increase in
recognition accuracy.
3.5 Applications
Speech Emotion Recognition (SER) a growing field that uses deep learning, such as Long-Term
Memory (LSTM) and Bidirectional LSTM (BiLSTM), to identify emotions based on speech features
like Mel Frequency Cepstral Coefficients (MFCCs) and statistical analyses.The ability to detect
emotions through vocal cues presents a variety of applications across many sectors.
Monitoring Agent Performance allows businesses to assess interactions to train agents in
managing difficult conversations effectively. In the healthcare and mental health sector, early detection of
depression and anxiety can be enhanced by SER, which assists psychologists in identifying emotional
distress through speech patterns. Remote therapy sessions enable virtual therapists to gauge a patient's
emotional state through speech without the need for in-person meetings. In the gaming industry, games
can improve players' overall experiences by reacting to their emotional conditions. Smart home
technology utilizes AI-powered automation systems that can adjust music, lighting, or temperature in
response to a user's emotional state.
In education and e-learning, tailored learning is possible as AI tutors recognize when students are
confused or frustrated, allowing them to modify their teaching strategies accordingly. Engagement
monitoring in online classes enables SER to evaluate student participation, determining engagement levels
and improving teaching strategies. Additionally, SER supports individuals facing speech difficulties by
tracking progress and providing corrective input in speech therapy. In security and law enforcement, lie
detection and threat evaluation can be enhanced as SER assesses emotional signals in suspects' voices to
aid investigations. Emergency response systems in call centers can detect distress calls and prioritize
critical situations. Emotion recognition technology can also improve border security and surveillance by
identifying suspicious behaviors. In entertainment and media, content recommendations for films and
media can be tailored by streaming platforms using SER to suggest shows and movies based on viewers'
emotions. AI music apps can select playlists according to listeners' current emotions, thanks to mood-
based music selection. Voice acting and animations are utilized to match the vocal performances of
characters with their emotions. In the automobile industry, driver monitoring systems can provide
notifications or activate autonomous driving capabilities by detecting signs of fatigue, stress, or aggression
in drivers. AI-powered in-car assistants facilitate a more interactive and human-like experience for vehicle
22 | P a g e
operators. In workplace productivity and HR management, employee well-being assessments can be
conducted using SER during virtual meetings to evaluate employee morale, while workplace stress
monitoring through AI-driven SER systems helps human resources identify stress levels.
While raw audio signals consist of redundant and unstructured information,feature extraction is an
important stage in SER.Techniques like zero-crossing rate (ZCR), chroma features and Mel-
Frequency Cepstral Coefficients (MFCCs) are frequently employed to extract significant patterns from
speech data.Principal Component Analysis (PCA) and statistical analysis are two feature
selection techniques that play a role optimize the features found by removing redundant or
unneeded information,improving the accuracy and efficiency of the model.Good preprocessing,
including normalization and noise reduction, guarantees improved model performance and further
improves the information gathered features.
In sequence-based tasks like SER and deep learning models like Long Short-Term Memory
(LSTM) and Bidirectional LSTM (BiLSTM) have been found to be highly effective.Because LSTMs
may keep long-term patterns in sequential inputs they are perfect for capturing temporal dependencies
in voice data.By processing the speech signal both forward and backward BiLSTMs beat LSTMs in
terms of obtaining contextual information.For quicker processing and optimization these models need
a large amount of labeled speech data for training which means they need a lot of computational
23 | P a g e
resources such as GPUs or TPUs.
Determining multiple metrics such as accuracy and F1- score to gauge classification efficiency is
part of the evaluation the system performance.A thorough evaluation confirms that the model
performs effectively when applied to various speakers,languages and recording environments.It is
quicker to better feature extraction and classification procedures when visualization tools are used to
help analyze model predictions and discover areas that want improvement.Continuous improvement
including methods for adding data and optimizing deep learning architectures may raise the system's
ability to adapt even further.
The practicality of Speech Emotion Recognition (SER) using deep learning is influenced by real-
world applications such as customer service, healthcare, and human-computer interaction. SER can
enhance emotion-aware chatbots, artificial intelligence devices, and mental health monitoring systems
by providing real-time emotional insights. However, difficulties are still present such as the demand
for massive labeled datasets, questions of ethics and data privacy problems.To get beyond such
obstacles and achieve SER full potential, multidisciplinary cooperation and breakthroughs in deep
learning and speech processing technologies are important forcing innovation in front daily.
Deep learning can be used to implement an audio identification of feelings system but it will take
thoughtful planning a lot of data preparing,strong computing resources and continuous model
optimization.SER is a promising field because it combines deep learning techniques like LSTM and
BiLSTM with strong feature extraction algorithm design With its many applications in everyday life
SER opens the door for increasingly intellectually and emotionally complex artificial intelligence (AI)
systems that have the potential to completely transform human computer interaction.This technology
has huge potential to improve user experiences and interactions in fields including healthcare,
education, and customer service.
The technical feasibility sections tells the proposed system can be successfully developed and implemented
or not with available technologies.
Hardware requirments: computational power, storage, network and infrastructure required.
Software requirements: programming language, frameworks, suitable models needed
24 | P a g e
Skill Set: development team should have necessary technical knowledge
Cost-Benefit Analysis: It tells initial development coasts, operational expenses and expected
revenue.
Budget allocation: it ensures the ovearall budget that covers the development
The section tells weather this proposed system meets the objectives and user requirements or not.
It consists :
User acceptance : evaluate the end users adoption
Training requirements: Identify the importance of training employees to adapt to the new
systems Risk assessment: identify the risks in mitigation strategies
The techniques for identifying emotions in spoken language have progressed considerably over
time. Initially, researchers utilized conventional machine learning approaches, including Support
Vector Machines (SVM), Hidden Markov Models (HMM), and Gaussian Mixture Models (GMM).
These approaches were relatively intricate, requiring the identification and extraction of specific vocal
traits, such as changes in pitch, loudness, and spectral features, to successfully train the systems.
While these methods were innovative at the time, they often encountered difficulties due to the natural
complexity and variability of human speech, resulting in inconsistent outcomes.
A significant transformation took place with the advent of deep learning. Modern Speech Emotion
Recognition (SER) systems now employ neural networks that can learn directly from raw audio inputs,
eliminating the need for manual feature extraction. In particular, Long Short-Term Memory (LSTM)
networks and their bidirectional counterparts (BiLSTM) have demonstrated remarkable effectiveness.
Their architecture enables them to process speech in sequence, allowing for the tracking of emotional
signal developments and fluctuations over time—representing a notable improvement over previous.
25 | P a g e
CHAPTER 4
SYSTEM DESIGN
The primary objective of this Speech Emotion Recognition(SER) model is to develop and evaluate
two powerful deep learning models specifically designed for Speech Emotion Recognition (SER):
Long Short-Term Memory (LSTM) and Bidirectional Long Short-Term Memory (BiLSTM). And
this model follows a sequential and well-defined algorithm or steps to ensure efficient data handling,
processing, extracting , model training, and evaluation.
The first step involves collecting the raw audio data from the Toronto Emotional Voice Set (TESS),
which is a collection of voice recordings categorized into seven distinct emotions. The audio files,
contains different emotions with WAV format at a 44.1 kHz sample rate, undergo a thorough
preprocessing step to extract important features like tone , frequency .
The dataset used in this this project is TESS (Toronto Emotional Speech Set) available in kaggle ,
this dataset contains high-quality audio recordings with seven different emotions: angry, disgust,
fear, happy, neutral, pleasant surprise, and sad.
26 | P a g e
To improve the quality of feature extraction , many techniques are used , including:
It converts raw audio signals into the frequency domain,capturing usefull information about the speech
signals.
we use MFCC to extract important characteristics from the speech signal ,both machine learning
and deep learning uses this extracted characteristics to analyze and classify the speech .
MFCCs reduces the dimensionality of audio signals without reducing the key information required for
tasks such as:
i. Speech Recognition
Step 1: Pre-Emphasis
ii. Formula
y[n]=x[n]−α×x[n−1]
iii. α is 0.97.
27 | P a g e
Step 2: Framing
i. the signal is divided into small overlapping frames to capture the short-time
properties
2. Random Forest(RF)
Used in classification tasks ,espically for the small datasets
Advantages:
i. It handles high dimensional spaces
ii. Reduces the variance and prevents overfitting
28 | P a g e
3. K-Nearest Neighbors(K-NN)
4. Logistic Regression(LR)
Used for binary or multi class classification
Advantages:
i. Fast and accurate
ii. Sutable for base line models
Advantages of MFCCs
Captures Human Perception: Models how the human ear perceives sound frequencies.
Compact Representation: Reduces the dimensionality of audio data while preserving critical
information. Effective for Speech and Audio Tasks: Widely used in speech recognition, emotion
recognition, and speaker identification.
29 | P a g e
Fig 3: Wave Plot for Happy
30 | P a g e
Fig:5 Wave plot for angry
31 | P a g e
Fig:7 Wave plot for neutral
32 | P a g e
Fig:9 Wave plot for sad
Chroma Features
i. It extracts pitch profiles,enabling the model to capture harmonic and melodic characteristics
of speech .
ii. The chroma features are used to capture the pitch from the audio signals, generally audio
contains many features , pitch is one of the feature , the chroma feature captures this pitch by
mapping the different frequencies at a time
iii. The audio frames maps into 12 different musical notes, each note have active and mapped to
the same bin , this process makes this feature invariant to pitch changes
iv. ZCR(Zero Crossing Rate) : It measures the audio waveform at zero amplitude level and it
indicates change in the speech energy.
Spectral Contrast
It analyzes the differences between the spectral peaks and valleys by improving emotion
classification accuracy. It ensures model efficiency, normalization, and standardization by
transforming the feature values to uniform scaling, which enhances the convergence during training.
The speech emotion recognition model is developed by training two deep learning models - LSTM
33 | P a g e
and BiLSTM - both implemented using the Keras sequential API.
LSTM model
This model consists of multiple LSTM layers with 128 units each and followed by Dropout layer
with a 20% droupout rate to reduce overfitting. And the final dense layer uses Softmax activation to
classify the input into a multiple emotion categories, and the training is performed using the Ada m
optimizer with learning rate 0.001 and loss function is set to the categorical cross-emtropy to handle
multi-class classification.
Workflow of LSTM:
34 | P a g e
Fig 4: LSTM Network Architecture
Dropout Layer : This dropout layer is a technique used in deeplearning and neaural networks to avoid
overfitting , a 20% dropout rate is applied to prevent overfitting.
Dense Layer: The final output layer uses softmax activation to classify the emotions into multiple
categories.
BiLSTM model
The BiLSTM model is similar to the LSTM model and contains bidirectional layers that process the
data in both directions , allowing the model to capture conextual dependencies more effectively,
BiLSTM process the data in both forword and backword direction ,this allows the model to capture
the contextual information from both future and past sequences
36 | P a g e
We use BiLSTM because it address the limitation in the LSTM. While predicting the output LSTM
focuses on past information , but for speech emotion recognition future context is important . So
BiLSTM solves this problem by looking in both directions , this model have forword layer and
backword layer.
1️ Forward LSTM
Processes the sequence from left to right.
Computes hidden states as:
ht→=LSTM(xt,ht−1→)\overrightarrow{h_t} = LSTM(x_t, \overrightarrow{h_{t-1}})ht=LSTM(xt
,ht−1)
2️ Backward LSTM
Processes the sequence from right to left.
Computes hidden states as:
ht←=LSTM(xt,ht+1←)\overleftarrow{h_t} = LSTM(x_t, \overleftarrow{h_{t+1}})ht=LSTM(xt
,ht+1)
The BiLSTM model is similar to the LSTM model it captures in both forword and backword , bidirectional
layers to capture contextual information in both forward and backward directions.
Dropout and Batch Normalization: Applied to improve generalization and stability during training.
Dense Layer: The final output layer with Softmax activation predicts the emotion class.
37 | P a g e
Fig 6: BiLSTM Neural Network Architecture
The block diagram of a Speech Emotion Recognition (SER) system using deep learning represents
the complete workflow from raw audio input to the final evaluation and visualization of results. The
process begins with collecting speech input, followed by feature extraction, where important speech
features such as MFCC (Mel-Frequency Cepstral Coefficients), Chroma, and Zero-Crossing Rate
(ZCR) are extracted. Th extracted features are used in model training,like Long Short-Term Memory
(LSTM) and Bidirectional LSTM (BiLSTM) are active to classify emotion,to access the model
performance evaluation metrix ,accuracy and F-1 score is used ensuring reliability. This structured
approach provides a comprehensive view of the SER process, ensuring an efficient and systematic
38 | P a g e
methodology for emotion recognition from speech.
The data flow diagram for a speech emotion recognition (SER) system shows the sequential steps
that the system takes to transform raw audio data to model performance evaluations. So, it starts with
an audio data, which is raw and goes through feature extraction to convert to some features which are
basically important features of the data. The features extracted are subjected to preprocessing for a
time, thus further refining its quality and relevancy. These features are then life for model training
where a ml or DL model is given the task of recognizing emotion from speech. Evaluation metrics:
Accuracy, F1-Score. Then the model is evaluated using different metrics. This structured way of
presentation ensures a systematic flow of data through the SER and helps in the better prediction of
emotions from speech.
39 | P a g e
Fig 8: Data Flow Diagram for Speech Emotion Recognition
40 | P a g e
4.4UML Diagram
The below UML Class Diagram shows a basic High-Level Overview of Machine Learning
based Speech Emotion Recognition System and different components and its relationships. It starts
with data collection, collecting speech data with different emotions on different labels. Mel-
Frequency Cepstral Coefficients (MFCC) is then applied to extract the most salient features of the
speech.
Then, feature selection comes with PCA and statistics to filter and optimize features extracted.
Finally, classification is performed using Long Short-Term Memory (LSTM) and Bidirectional LSTM
(BiLSTM) models, in order, to allow speech emotion detection. Following this methodical path
guarantees a recognization system that will go off without a problem — and work like a charm.
41 | P a g e
CHAPTER 6
Implementation/Results and Discussion
4.1Software Requirements
42 | P a g e
reduction.
4.2Hardware Requirements
Processor: Minimum Intel Core i5 / AMD Ryzen 5 (Recommended: Intel Core i7 / AMD Ryzen
7 or higher)
RAM: At least 8GB (Recommended: 16GB or more for large datasets)
GPU: NVIDIA GPU (Recommended: NVIDIA RTX 2060 or higher for deep learning
acceleration)
Storage: At least 100GB (Recommended: SSD for faster data processing)
Audio Input Device: High-quality microphone (if real-time emotion recognition is needed)
4.3Technologies Used
Feature Extraction:
43 | P a g e
ii. Chroma Features
Feature Selection:
Evaluation Metrics:
i. Accuracy
ii. F1-score
4.4 Coding
import os
import librosa
import numpy as np
import pandas as pd
import librosa.display
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from scipy.stats import skew, kurtosis
import tensorflow as tf
from tensorflow.keras.models import Sequential
44 | P a g e
from tensorflow.keras.layers import LSTM, Bidirectional, Dense, Dropout
from tensorflow.keras.optimizers import Adam
# Suppress warnings
import warnings
warnings.filterwarnings("ignore")
# Load data
features, labels = [], []
# Encode labels
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
# Normalize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Print results
print(f"LSTM Test Accuracy: {lstm_accuracy:.2%}")
print(f"BiLSTM Test Accuracy: {bilstm_accuracy:.2%}")
48 | P a g e
# Plot Confusion Matrices
plot_confusion_matrix(y_true, y_pred_lstm, "LSTM", emotion_labels)
plot_confusion_matrix(y_true, y_pred_bilstm, "BiLSTM", emotion_labels)
import os
import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt
def plot_spectrogram(file_path):
y, sr = librosa.load(file_path, sr=None) # Load audio file
49 | P a g e
S = librosa.stft(y) # Compute Short-Time Fourier Transform (STFT)
D = librosa.amplitude_to_db(np.abs(S), ref=np.max) # Convert to dB
plt.figure(figsize=(10, 4))
librosa.display.specshow(D, sr=sr, x_axis='time', y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title(f'Spectrogram of {os.path.basename(file_path)}')
plt.show()
# Example Usage
#sample_path = os.path.join(dataset_path, "YAF_happy", "YAF_jail_happy.wav")
sample_path = os.path.join(dataset_path, "YAF_fear", "YAF_home_fear.wav")
plot_spectrogram(sample_path)
def plot_mfcc(file_path):
y, sr = librosa.load(file_path, sr=None)
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
plt.figure(figsize=(10, 4))
librosa.display.specshow(mfccs, x_axis='time', sr=sr)
plt.colorbar()
plt.title(f'MFCC of {os.path.basename(file_path)}')
plt.show()
# Example Usage
plot_mfcc(sample_path)
# Dummy true labels and predictions (Replace with your model outputs)
true_labels = np.array([0, 1, 1, 0, 2, 2, 1]) # Ground truth labels
predicted_labels = np.array([0, 1, 0, 0, 2, 1, 1]) # Model predictions
# Compute F1-score
f1 = f1_score(true_labels, predicted_labels, average='weighted')
print("F1-Score:", f1)
import os
import pandas as pd
import librosa
50 | P a g e
import librosa.display
import matplotlib.pyplot as plt
from IPython.display import Audio
if filtered_df.empty:
raise ValueError(f"No speech data found for emotion '{emotion}'")
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Simulate extracted features for 100 samples (replace with real features if available)
np.random.seed(42)
features_sample = np.random.rand(100, 17) # 13 MFCCs + 4 Statistical Features
53 | P a g e
CHAPTER 6
TESTING
4.5 Testing Objectives
Accuracy Testing – Ensure the model correctly classifies emotions from speech input.
Performance Testing – Measure response time, memory usage, and GPU utilization.
Feature Extraction Testing – Validate the correctness of extracted features (MFCC, Chroma, ZCR). Model
Evaluation Metric Validation – Verify accuracy, F1-score, and confusion matrix. Scalability
Real-time Processing Testing – Check latency and correctness in live speech recognition.
Robustness Testing – Ensure the model performs well under noisy conditions.
Integration Testing – Verify smooth interaction between modules (data collection, preprocessing,
model training, and evaluation).
54 | P a g e
TC_05 Check F1-score Compute F1-score F1-score should be
using test dataset within limit range
TC_07 Response time test Measure time taken for Should be within
classification acceptable limits (e.g.,
<2s)
TC_08 Check model on Use test data with Model should
different speakers varied speakers maintain accuracy
across different voices
TC_09 Real-time emotion Process live speech System should detect
recognition input and classify emotion
correctly
TC_10 Integration test Run the complete System should work
pipeline from data without crashes or
input to emotion errors
classification
55 | P a g e
CHAPTER 7
Results
i. Precision: How many of the predicted positive samples are actually correct?
ii. Recall: How many of the actual positive samples were identified correctly?
iii. F1-Score: The harmonic mean of precision and recall.
56 | P a g e
Support: The number of actual occurrences of each class in the dataset.Precision (Positive
Predictive Value)
What is Precision?
Precision measures the proportion of correctly predicted positive observations to the total predicted
positive observations.
What is Recall?
Recall measures the proportion of correctly predicted positive observations .
What is F1-Score?
F1-Score is a harmonic mean of precision and recall.
What is Support?
Support is the number of actual occurrences of each class in the dataset. It represents how many
samples belong to a specific class.
f you are classifying emotions from speech and your dataset contains:
Happy: 300
Sad: 200
Angry: 100
57 | P a g e
Confusion matrix -LSTM
58 | P a g e
Confusion matrix for -BiLSTM
Spectrogram:
The spectrogram is the representation of frequency of an audio signal .
59 | P a g e
Fig:10 spectrogram for fear
60 | P a g e
Fig:12 spectrogram for disgust
61 | P a g e
Fig:14 spectrogram for ps
62 | P a g e
Fig:16 spectrogram for happy
According to the LSTM and BiLSTM models categorization reports and test accuracy findings:
2. Emotional Assessment
LSTM: With an overall macro average of 92%, LSTM achieved good precision, recall and F1-
scores across all emotions.
BiLSTM: It also did well, however it was a little less accurate for "Happy" and "Angry."
In both models, "Happy" showed comparatively poorer recall whereas "Disgust" had the
highest precision.
63 | P a g e
3. LSTM vs BiLSTM comparison:
In terms of total accuracy,LSTM did marginally better.
When it came to addressing certain emotions, BiLSTM performed better, especially
"Disgust" (95% F1-score).
LSTM had more balanced precision-recall values across emotions.
64 | P a g e
CHAPTER 8
CONCLUSION
Deep learning-based voice Emotion Recognition (SER) has become a key area of artificial
intelligence, allowing machines to recognize and react to human emotions from voice data. Accurately
identifying emotions in machines has several uses in affective computing, mental health analysis,
customer service and human-computer interaction. In order to categorize emotions from audio data,
this study used and examined two deep learning architectures: Long Short-Term Memory (LSTM) and
Bidirectional Long Short-Term Memory (BiLSTM).Precision, recall, F1-score and overall accuracy
were used to assess these models performance. The findings showed that both LSTM and BiLSTM
were capable of accurately identifying emotional patterns in speech, with LSTM beating BiLSTM by
a little margin.The accuracy of the LSTM and BiLSTM models in this investigation was 91.00% and
90.75%, respectively.Both models categorization reports showed excellent precision, recall and F1-
scores for a variety of emotions such as fear, disgust, anger, sadness, and neutrality. With particularly
high precision for emotions like disgust (0.98) and fear (0.93), LSTM showed outstanding recognition
abilities across all categories. However, compared to LSTM, BiLSTM also demonstrated good
performance, albeit with somewhat less accuracy for specific emotions. BiLSTM provided an
advantage in learning temporal relationships in both forward and backward directions, despite its
somewhat lower accuracy, which made it useful for capturing the contextual flow of speech.Mel -
Frequency Cepstral Coefficients (MFCC) which identified important information from speech signals,
were one of the fundamental speech processing approaches used to train the models. Preprocessing
was done on the training dataset to eliminate noise, normalize audio signals, and extract pertinent
frequency-based features.The LSTM and BiLSTM models were constructed and optimized using deep
learning frameworks such as TensorFlow and Keras.A key part of the speech analysis pipeline, Librosa
was also utilized for audio feature extraction. Since deep learning architectures need a lot of processing
power to train huge datasets,the models were trained on GPU enabled hardware to improve
computational efficiency.To speed up model training and evaluation, a system with an NVIDIA GPU
and 16GB of RAM was utilized.Testing and evaluation were two of the study's main components.To
examine how well the model performed in various scenarios such as noisy settings,voice tones and
emotional intensities a number of test cases were created.Both models robustness in emotion
65 | P a g e
recognition was demonstrated by the classification results, which demonstrated that they maintained
excellent accuracy across several test circumstances. However,due to overlapping speech
characteristics, there were a few small misclassifications, especially b/w related emotions like Happy
and Neutral or Angry and Fear.These incorrect classifications draw attention to how difficult it is to
identify emotions practical applications,where they are frequently complex and impacted by a variety
of elements,including background noise,speaker accent and intonation.The findings of this work
suggest that automated emotion recognition from speech is a good fit for deep learning
models,especially LSTM and BiLSTM.They are perfect for applications requiring a comprehension
of speech dynamic's because of their quick processing of sequential input. By learning both past and
future speech dependencies,BiLSTM offersextra insights,even if the LSTM model showed marginally
better accuracy.This implies that in order to improve feature extraction, BiLSTM could be further
enhanced by incorporating attention processes or merging it with convolutional neural networks
(CNNs).Notwithstanding the encouraging outcomes,this study has many drawbacks.Both the size and
speaker diversity of the training sample were constrained.Larger, more varied datasets that span
languages, accents and age groups could improve model generalization in future studies. Furthermore,
adding multimodal emotion recognition which blends speech with physiological inputs or facial
expressions—could raise the general precision and dependability of emotion detection systems.Real-
time emotion detection is another possible improvement that calls for model optimization to process
live speech inputs with the least amount of lag.The learned models can be deployed on edge devices
or cloud-based platforms to accomplish this,enabling smooth integration into practical applications
like customer care systems, virtual assistants and interactive learning environments.In conclusion,deep
learning-based speech emotion recognition offers important advances in affective computing.The
study's findings demonstrate that both the LSTM and BiLSTM models are successful in correctly
identifying emotions from speech,with LSTM attaining an accuracy of 91.00% and BiLSTM
90.75%.BiLSTM had advantages in capturing contextual relationships, although LSTM had a little
higher overall accuracy.The results of this study add to the expanding body of research on emotion
aware AI systems which could improve mental health monitoring, human-computer connection, and
customer service.To further improve speech emotion recognition systems capabilities, future research
should concentrate on resolving current issues such dataset variety, real-time implementation and
multimodal fusion.
66 | P a g e
5. Future Enhancements
Another key area for enhancement is the implementation of multimodal techniques that combine
speech data with facial expressions and physiological signals, like heart rate and
electroencephalograms (EEG). By incorporating various modalities, deep learning models can gain a
more profound understanding of human emotions, resulting in improved classification results. This
approach would also address issues related to changes in speech due to stress, health problems,or
emotional suppression.
67 | P a g e
References
1. Prathibha, G., Kavya, Y., Poojita, L., & Vinay Jacob, P. (2024). Speech Emotion
Recognition Using Deep Learning. International Journal of Scientific Research in
Engineering and Management (IJSREM), 8(7).
2. Babu, J. C., & Kumar, B. P. S. (2023). Speech emotion recognition using machine learning.
https://doi.org/10.1109/ICCAMS60113.2023.10526124
3. Kinjal, S., Raja D., & Sanghani, D. (n.d.). Speech emotion recognition using machine
learning. https://doi.org/10.53555/kuey.v30i6(S).5333
4. Yan, Y., & Shen, X. (2022). Research on speech emotion recognition based on AA-CBGRU
network. Electronics, 11(9), 1409. https://doi.org/10.3390/electronics11091409
5. Reddy, N. V. R., Kulkarni, S., Sainikhil, T., & Vala, S. (2024). Speech Emotion Recognition
Using Convolutional Neural Networks. International Journal for Research in Applied Science
& Engineering Technology (IJRASET), 12(8).
6. Nicholson, J., Takahashi, K., & Nakatsu, R. (1999). Emotion recognition in speech using
neural networks. Neural Computing & Applications, 9(4), 290–296
8. Pitterman, J., Pittermann, A., & Minker, W. (2010). Emotion recognition and adaptation in
spoken dialogue systems. International Journal of Speech Technology, 13(1), 49–60.
9. Singh, P., Sahidullah, M., & Saha, G. (2023). Modulation spectral features for speech emotion
recognition using deep neural networks. Speech Communication Journal.
10. Jo, A. H., & Kwak, K. C. (2023). Speech emotion recognition based on a two-stream deep
learning model using Korean audio information. Applied Sciences, 13(4), 2167
11. France, D. J., Shiavi, R. G., Silverman, S., Silverman, M., & Wilkes, M. (2007). Acoustical properties
of speech as indicators of depression and suicidal risk. IEEE Transactions on Biomedical Engineering,
47(7), 829-837.
12. Yu, S., Meng, J., Fan, W., Chen, Y., Zhu, B., Yu, H., Xie, Y., & Sun, Q. (2024). Speech Emotion
68 | P a g e
Recognition Using Dual-Stream Representation and Cross-Attention Fusion. Electronics, 13(2191).
13. Han, T., Zhang, Z., Ren, M., Dong, C., Jiang, X., & Zhuang, Q. (2023). Speech Emotion Recognition
Based on Deep Residual Shrinkage Network. Electronics, 12(2512).
14. Tu, Z., Liu, B., Zhao, W., Yan, R., & Zou, Y. (2023). A Feature Fusion Model with Data Augmentation
for Speech Emotion Recognition. Applied Sciences, 13(4124).
15. Mountzouris, K., Perikos, I., & Hatzilygeroudis, I. (2023). Speech Emotion Recognition Using
Convolutional Neural Networks with Attention Mechanism. Electronics, 12(4376).
https://doi.org/10.3390/electronics12204376​:contentReference[oaicite:1]{index=1}.
16. B. W. Schuller, Speech emotion recognition: Two decades in a nut shell, benchmarks, and ongoing
17. B. M. Nema and A. A. Abdul-Kareem, "Preprocessing signal for Speech Emotion Recognition," Al-
19. Aloufi R, Haddadi H, Boyle D (2019) Emotionless: privacy-preserving speech analysis for voice
20. L.S.A. Low, N.C. Maddage, M. Lech, L.B. Sheeber, N.B. Allen, Detection of clinical depression in
adolescents’ speech during family interactions. IEEE Trans. Biomed. Eng. 58(3), 574–586 (2010)
22. An automatic speech recognition system in Indian and foreign languages: A state-of-the-art review
analysis; Astha Gupta, Rakesh Kumara, and Yogesh Kumar research gate journal; 2023;220228
23. Speech Based Voice Recognition System for Natural Language Processing; Dr. Kavitha. R1,
24. Literature Review on Automatic Speech Recognition; Wiqas Ghai, Navdeep Singh; Research Gate
journal;2020;09758-887
26. Wani, T. M., Gunawan, T. S., Qadri, S. A. A., Kartiwi, M., & Ambikairajah, E. (2021). A
comprehensive review of speech emotion recognition systems. IEEE access, 9, 47795-47814.
27. Lieskovská, E., Jakubec, M., Jarina, R., & Chmulík, M. (2021). A review on speech emotion
recognition using deep learning and attention mechanism. Electronics, 10(10), 1163.
28. Selvaraj, M., Bhuvana, R., & Padmaja, S. (2016). Human speech emotion recognition.
International Journal of Engineering & Technology, 8(1), 311-323.
29. Liu, G., Cai, S., & Wang, C. (2023). Speech emotion recognition based on emotion perception.
EURASIP Journal on Audio, Speech, and Music Processing, 2023(1), 22.
30. Aouani, H., & Ayed, Y. B. (2020). Speech emotion recognition with deep learning. Procedia
70 | P a g e