0% found this document useful (0 votes)
18 views

c11 report final 01

Uploaded by

Mohammad Muskaan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

c11 report final 01

Uploaded by

Mohammad Muskaan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

“EMOTION RECOGNITION USING DEEP LEARNING NETWORKS ON

LIVE CALLS”

A Project Report submitted to

JAWAHARLAL NEHRU TECHNOLOGICAL UNVERSITY ANANTAPUR.

In Partial Fulfillment of the Requirements for the Award of the degree of

BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
BY

REDDYGARI SIRISHA (20121A05N6)


PUVVADI MONICA (20121A05M7)
POTHIREDDY LAKSHMI PRASANNA (20121A05M0)
PUJARI PRAVEEN KUMAR (20121A05M5)

Under the Guidance of


Dr. G. Vennila
Assistant Professor
Dept of CSE, SVEC

Department of Computer Science and Engineering


SREE VIDYANIKETHAN ENGINEERING COLLEGE
(Affilited to JNTUA, Anantapuramu)
Sree Sainath Nagar, Tirupathi – 517 102
2020-2024
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING

VISION AND MISSION

VISION

To become a Centre of Excellence in Computer Science and Engineering by


imparting high quality education through teaching, training and research.

MISSION
 The Department of Computer Science and Engineering is established to
provide undergraduate and graduate education in the field of Computer
Science and Engineering to students with diverse background in
foundations of software and hardware through a broad curriculum and
strongly focused on developing advanced knowledge to become future
leaders.
 Create knowledge of advanced concepts, innovative technologies and
develop research aptitude for contributing to the needs of industry and
society.
 Develop professional and soft skills for improved knowledge and
employability of students.
 Encourage students to engage in life-long learning to create awareness
of the contemporary developments in computer science and engineering
to become outstanding professionals.
 Develop attitude for ethical and social responsibilities in professional
practice at regional, National and International levels.

I
Program Educational Objectives (PEO’s)

1. Pursuing higher studies in Computer Science and Engineering and


related disciplines

2. employed in reputed Computer and I.T organizations and Government


or have established startup companies.

3. Able to demonstrate effective communication, engage in team work,


exhibit leadership skills, ethical attitude, and achieve professional
advancement through continuing education.

II
Program Specific Outcomes(PSO’s)
On successful completion of the Program, the graduates of B.Tech. (CSE)
program will be able to:
1. Demonstrate knowledge in Data structures and Algorithms, Operating
Systems, Database Systems, Software Engineering, Programming
Languages, Digital systems, Theoretical Computer Science, and
Computer Networks. (PO1)

2. Analyze complex engineering problems and identify algorithms for


providing solutions (PO2)

3. Provide solutions for complex engineering problems by analysis,


interpretation of data, and development of algorithms to meet the
desired needs of industry and society. (PO3,PO4)

4. Select and Apply appropriate techniques and tools to complex


engineering problems in the domain of computer software and computer
based systems (PO5)

III
Program Outcomes (PO’s)
1. Apply the knowledge of mathematics, science, engineering fundamentals,
and an engineering specialization to the solution of complex engineering
problems (Engineering knowledge).

2. Identify, formulate, review research literature, and analyze complex


engineering problems reaching substantiated conclusions using first principles
of mathematics, natural sciences, and engineering sciences (Problem
analysis).

3. Design solutions for complex engineering problems and design system


components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and
environmental considerations (Design/development of solutions).

4. Use research-based knowledge and research methods including design of


experiments, analysis and interpretation of data, and synthesis of the
information to provide valid conclusions (Conduct investigations of
complex problems).

5. Create, select, and apply appropriate techniques, resources, and modern


engineering and IT tools including prediction and modeling to complex
engineering activities with an understanding of the limitations (Modern tool
usage)

6. Apply reasoning informed by the contextual knowledge to assess societal,


health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice (The engineer and society)

7. Understand the impact of the professional engineering solutions in societal


and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development (Environment and sustainability).

IV
8. Apply ethical principles and commit to professional ethics and
responsibilities and norms of the engineering practice (Ethics).

9. Function effectively as an individual, and as a member or leader in diverse


teams, and in multidisciplinary settings (Individual and team work).

10. Communicate effectively on complex engineering activities with the


engineering community and with society at large, such as, being able to
comprehend and write effective reports and design documentation, make
effective presentations, and give and receive clear instructions
(Communication).

11. Demonstrate knowledge and understanding of the engineering and


management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments
(Project management and finance).

12. Recognize the need for, and have the preparation and ability to engage
in independent and life-long learning in the broadest context of technological
change (Life-long learning).

V
Course Outcomes

CO1. Knowledge on the project topic (PO1)

CO2. Analytical ability exercised in the project work.(PO2)

CO3. Design skills applied on the project topic. (PO3)

CO4. Ability to investigate and solve complex engineering problems faced


during the project work.(PO4)

CO5. Ability to apply tools and techniques to complex engineering activities


with an understanding of limitations in the project work. (PO5)

CO6. Ability to provide solutions as per societal needs with consideration to


health, safety, legal and cultural issues considered in the project work. (PO6)

CO7. Understanding of the impact of the professional engineering solutions in


environmental context and need for sustainable development experienced
during the project work. (PO7)

CO8. Ability to apply ethics and norms of the engineering practice as applied
in the project work.(PO8)

CO9. Ability to function effectively as an individual as experienced during the


project work.(PO9)

CO10. Ability to present views cogently and precisely on the project work.
(PO10)

CO11. Project management skills as applied in the project work. (PO11)

CO12. Ability to engage in life-long leaning as experience during the project


work.(PO12)

VI
CO-PO Mapping

PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO11 PSO12 PSO13 PSO14

CO1 3 3

CO2 3 3

CO3 3 3

CO4 3 3

CO5 3 3

CO6 3

CO7 3

CO8 3

CO9 3

CO10 3

CO11 3

CO12 3

(Note: 3-High, 2-Medium, 1-Low)

VII
DECLARATION

We hereby declare that this project report titled “Emotion Recognition

using Deep Learning Networks on Live Calls” is a genuine project work

carried out by us, in B.Tech (Computer Science and Engineering) degree

course of Jawaharlal Nehru Technological University Anantapur and has

not been submitted to any other course or University for the award of any

degree by us.

Signature of the student

1. REDDYGARI SIRISHA

2. PUVVADI MONICA

3. P LAKSHMI PRASANNA

4. PUJARI PRAVEEN KUMAR

VIII
SREE VIDYANIKETHAN ENGINEERING COLLEGE
(Affiliated to Jawaharlal Nehru Technological University Anantapur)
Sree Sainath Nagar, A. Rangampet, Tirupati – 517 102, Chittoor Dist., A.P.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

CERTIFICATE

This is to certify that the Project Work entitled

“Emotion Recognition Using Deep Learning Networks on Live Calls“

is the bonafide work done by

REDDYGARI SIRISHA (20121A05N6)


PUVVADI MONICA (20121A05M7)
POTHIREDDY LAKSHMI PRASANNA (20121A05M0)
PUJARI PRAVEEN KUMAR (20121A05M5)

In the Department of Computer Science and Engineering, Sree Vidyanikethan


Engineering College, A.Rangampet. is affiliated to JNTUA, Anantapuramu in partial
fulfillment of the requirements for the award of Bachelor of Technology in Computer
Science and Engineering during 2020-2024.

This is work has been carried out under my guidance and supervision.

The results embodied in this Project report have not been submitted in any University or
Organization for the award of any degree or diploma.

Internal Guide Head

Dr. G. Vennila Dr. B. Narendra Kumar Rao


Assistance Professor Prof & Head
Dept of CSE Dept of CSE
Sree Vidyanikethan Engineering College Sree Vidyanikethan Engineering College
Tirupathi Tirupathi

INTERNAL EXAMINER EXTERNAL EXAMINER

IX
ACKNOWLEDGEMENT

We are extremely thankful to our beloved Chairman and founder Dr. M.


Mohan Babu who took keen interest to provide us the infrastructural facilities
for carrying out the project work.

We are highly indebted to Dr. B. M. Satish, Principal of Sree


Vidyanikethan Engineering College for his valuable support and guidance in all
academic matters.

We are very much obliged to Dr. B. Narendra Kumar Rao,


Professor&Head, Department of CSE, for providing us the guidance and
encouragement in completion of this project.

We would like to express our indebtedness to the project coordinator,


Dr.J.Avanija, Professor, Department of CSE for her valuable guidance during
the course of project work.

We would like to express our deep sense of gratitude to Dr.G.Vennila,


Assistant Professor, Department of CSE, for the constant support and
invaluable guidance provided for the successful completion of the project.

We are also thankful to all the faculty members of CSE Department, who
have cooperated in carrying out our project. We would like to thank our
parents and friends who have extended their help and encouragement either
directly or indirectly in completion of our project work.

X
ABSTRACT

“In the realm of virtual assistants, mental health assessments, and customer
service, incorporating emotion detection in speech is essential for effective
communication between humans and machines. A variety of distinct speech
characteristics can be leveraged to extract valuable insights from audio samples.
Our objective is to develop an emotion identification system that utilizes these
detected attributes within the audio samples.
In the field of machine learning, there exist traditional models such as Support
Vector Machines (SVM), K-Nearest Neighbors (KNN), and Random Forest (RF)
classifiers that can be applied to Speech Emotion Recognition (SER) systems.
Our proposed model consists of a 1D CNN-LSTM architecture, with four Local
Feature Learning Blocks (LFLBs) followed by a primary LSTM layer. Each LFLB
comprises a single convolutional layer and a single max-pooling layer, which are
effective at capturing inherent correlations and constructing hierarchical
relationships. The LSTM layer then processes these long-term relationships
derived from the localized cognitive parameters.
The combination of Convolutional Neural Network (CNN) and Long Short-Term
Memory (LSTM) has the potential to surpass the constraints of individual
networks. This study evaluates their performance using the Berlin EmoDB
dataset. Our proposed method outperforms current conventional models, setting
a new benchmark for accuracy and effectiveness in Speech Emotion Recognition
(SER) systems.”

Keywords— speech emotion recognition, machine learning, Long Short Term


Memory, Convolutional Neural Network, Berlin EmoDB.

XI
TABLE OF CONTENTS:

S.No Topic PageNo.

Vision and Mission i


Program Educational Objectives ii
Program Specific Outcomes iii
Program Outcomes iv
Course Outcomes vi
CO-PO Mapping vii
Declaration viii
Certificate ix
Acknowledgement x
Abstract xi

1. Introduction 1-6
1.1 Statement of the Problem
1.2 Objectives
1.3 Scope
1.4 Applications
1.5 Limitations

2. Literature Survey 7-9

3. Analysis 10 - 16
3.1 System Requirements
3.1.1 Hardware Specifications
3.1.2 Software Specifications
3.2 Existing System
3.3 Proposed System
3.4 Requirement Analysis
3.4.1 Functional Requirements
3.4.2 Non-Functional Requirements

4. Design 17 - 21

4.1 UML Diagram


4.1.1 Use Case Diagram
4.1.2 Activity Diagram
4.1.3 Sequence Diagram
4.1.4 Class Diagram
5. Implementation 22 - 28

5.1 Dataset Used


5.2 Dataset Information
5.3 Preprocessing
5.4 Background of CNN and LSTM architecture

6. Execution Procedure and Testing 29 - 33


6.1 Execution Procedure
6.2 Testing
6.3 Types of Testing

7. Results and Performance Evaluation 34 - 37


7.1 Evaluation Description
7.2 Performance Evaluation
7.3 Results

8. Conclusion and Future Works 38 - 39

9. Appendix 40 - 57
Program Listing/Code
Screen shots
List of Figures
List of Abbreviations
List of Tables
References
2020-2024/ BTech – CSE /Emotion Recognition using Deep Introduction
Learning Networks on Live Calls

CHAPTER-1

INTRODUCTION

In live calls, speech emotion recognition (SER) is a real-time system that


instantly ascertains a speaker's emotional state. Customer service, mental
health counselling, and human-robot contact are just a few uses for this
technology. The steps in the ser process for live calls include recording the
speech signal using a microphone, extracting pertinent acoustic features from
the signal, and classifying the speaker's emotional state using machine
learning algorithms into predefined categories like happy, sad, angry, and
neutral. Dealing with the noise and fluctuation in the speech signal brought
on by elements like background noise, speech overlaps, and variations in
speaking styles is one of the main obstacles in performing ASR in live
conversations. The quality of the voice signal may be improved while reducing
the influence of noise and other distortions by using a variety of signal
processing techniques, such as speaker dualization and noise reduction. The
trained model must be combined with a microphone and a real-time
processing system in order to record and analyse the speech signal in real-
time during live calls. In order to obtain the needed degree of emotion
detection performance, it is important to make sure the system can analyse
the speech signal quickly and accurately. Overall, the study of emotions in
live conversations is a difficult but intriguing area of study that holds great
promise for real-world applications in industries like customer service and
mental health. The performance of ser in live conversations is anticipated to
continue to improve thanks to developments in machine learning and signal
processing, making it an increasingly important tool for enhancing human-
computer engagement and communication.

1
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Introduction
Learning Networks on Live Calls

1.1 STATEMENT OF THE PROBLEM

The solution should analyse the voice of the caller on live on-going calls being
attended by the caller in the emergency Response System. After Analysing
the voice of the caller, the solution should predict the emotional/ mental
condition of the caller. The solution should predict/suggest following about
the caller Stressful Voice Drunk Voice Prank Voice Abusive Voice Painful Voice
Or any mental condition Objective: The system will be trained using a sizable
dataset of audio recordings of human speech in various emotional states in
order to accomplish this goal. The system will employ deep neural networks
and other machine learning methods to extract pertinent information from
the audio input and categorize the speaker's emotions. To give clients a more
effective and tailored experience, the suggested system would be linked into
already-existing communication systems including contact centres and
customer support centres.

1.2 OBJECTIVES

This project aims to create an automated speech emotion identification


system that can identify a speaker's emotional state during a live call.
Customer service agents (CSRs), who communicate with customers over the
phone, will use the technology in a call centre setting. The technology will
assess the speech signal in real-time and give the CSR knowledge on the
customer's emotional state. The CSR may use this information to better
understand the needs of the consumer and respond accordingly.

1. Create an automated method for detecting the emotional state of the


speaker from voice signals using real- time analysis.

2. Integrate the technology into a call centre setting to give CSRs knowledge
about the customer's emotional condition during a live conversation.

3. Train the system on a sizable dataset of labelled speech signals using deep

2
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Introduction
Learning Networks on Live Calls

learning methods, particularly CNNs and LSTMs.

4. Compare the system's performance to that of cutting-edge speech emotion


detection systems after evaluating it on a test set of voice signals.

5. Improve the system's precision, speed, and resistance to noise and other
environmental elements that are frequently present in a contact centre
setting.

6. Create a user-friendly user interface for the system that gives the customer
service representative (CSR) real-time feedback, including a description of
the client's emotional state and suggested answers based on that state.

1.3 SCOPE

The scope of this project encompasses the development and evaluation of a


cutting-edge computational model designed for the detection and recognition
of emotion in speech audios. This includes:
 Mental Health and well-being:
SER, although a more human communication service in live
interactions can build a vastly active range of avenues for mental health
support systems that are used during real-world conversations SER may be
used by professionals to study emotional states during counseling or support
debates as an example, and to take swift interventions from thought over to
action perspective in a short time that can also allow them customized
treatment procedure according to individuals’ needs. Developments that
show emotional disorders helps in increasing the accessibility and utility of
mental health services, allowing more citizens to be seen.
 User experience and Human – Computer interaction:
By gaining accessibility to the emotional state of the user in real-time
during interaction with a computer, this makes it possible to get a
personalized relationship. Such emotion-aware systems are able to change
the interfaces, data or recommendations as per the detected emotions that

3
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Introduction
Learning Networks on Live Calls

boosts user satisfaction and affinity. This application spans over a myriad of
industries such as in the field of virtual assistants, entertainment apps and
interactive platforms.
 Training and Education:
The SER may help in evaluating engagement and level of well-
understood topics via live discussions that occur during the task of remote
learning which in turn takes place within educational institutions. Teachers
exonerate emotional responses from students during online classes to inform
about their learning or areas where they may need more focus. This in turn
can enable facilitators to change their strategy base on the needs of each
student.

1.4 APPLICATIONS:

Speech emotion recognition using 1D CNN and LSTM networks finds diverse
applications across various domains. In healthcare, it aids in detecting
emotional cues for mental health assessment and speech therapy evaluation.
In customer service, it enhances sentiment analysis for understanding
customer emotions and improving service quality. In human-computer
interaction, it enables more personalized user experiences by recognizing
emotional states during interactions with devices. In education, it assists in
assessing student engagement and emotional responses in online learning
environments. In entertainment, it enhances the immersive experience by
adapting content based on user emotions in gaming and virtual reality
applications. Additionally, in security and law enforcement, it can be utilized
for emotion-based lie detection and monitoring emotional states during
critical interactions. Overall, speech emotion recognition with 1D CNN and
LSTM networks has broad implications for enhancing communication,
engagement, and understanding in various human-machine interaction
scenarios.

4
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Introduction
Learning Networks on Live Calls

1.5 LIMITATIONS:

Speech emotion recognition system plays a crucial role in modern


communication technology of tele-communication. However, there are
several challenges and limitations that need to be addressed for effective
implementation.

 Complexity and Computational Cost:

Implementing a Speech Emotion Recognition (SER) system using


Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM)
networks can be computationally expensive and complex. CNNs require
significant computational resources for training due to their deep
architectures and large numbers of parameters. LSTM networks, although
efficient in handling sequence data like speech, also add to the computational
burden, especially when dealing with long sequences. This complexity and
computational cost can pose challenges, particularly in real-time applications
or when dealing with large datasets, limiting the scalability of the system.

 Data Dependency and Generalization:

Another limitation is the system's dependency on large and diverse


datasets for training. CNNs and LSTMs are known for their ability to learn
intricate patterns and dependencies in data, but they require extensive
training on diverse datasets to generalize well across different speakers,
languages, and emotional expressions. Limited or biased training data can
lead to overfitting, where the model performs well on the training data but
fails to generalize effectively to unseen data. This issue can hinder the
system's performance in real-world scenarios with varying input conditions.

 Interpretability and Explainability:

CNN and LSTM models, especially when combined, can create highly
complex and opaque systems. The deep layers and sequential processing

5
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Introduction
Learning Networks on Live Calls

make it challenging to interpret how the model arrives at its predictions. This
lack of interpretability and explainability is a significant limitation, especially
in applications where understanding the reasoning behind the system's
decisions is crucial, such as in healthcare or legal contexts. It can also limit
the user's trust and acceptance of the system, impacting its usability and
adoption.

While CNN and LSTM-based SER systems offer significant advancements in


emotion recognition from speech, they come with inherent limitations that
need to be addressed for broader deployment and usability. Future research
directions may focus on mitigating computational costs through model
optimization techniques, improving generalization with more diverse
datasets and data augmentation methods, and enhancing interpretability
through model transparency and explainable AI approaches. Overcoming
these limitations can lead to more robust and reliable SER systems that find
broader applications across various domains.

6
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Literature Survey
Learning Networks on Live Calls

CHAPTER – 2

LITERATURE SURVEY

2.1 OVERVIEW OF EXISTING RESEARCH AND LITERATURE


This section will examine recent developments in the field of emotion
recognition. We will provide an overview of several studies conducted in this
area. Various references will be summarized and discussed to highlight the
progress made in emotion recognition research. The following are our survey
details.
An Affective Service based on Multi-Modal Emotion Recognition, using EEG
enabled Emotion Tracking and Speech Emotion Recognition by Danai Styliani
Moschona (2021)” The proposed system is evaluated using a dataset of EEG
signals and speech samples, and the results show that the multi-modal
approach outperforms single modalities in recognizing emotions. The
proposed system could be used in various applications such as affective
computing, mental health diagnosis, and human-robot interaction.
In "A Review on Speech Emotion Recognition Techniques" by Pandey etal.
(2021), The authors provide an overview of speech emotion recognition
techniques, including traditional machine learning and deep learning-based
approaches. They discuss various feature extraction techniques, classification
algorithms, and datasets used in the field. They also highlight the limitations
and future directions of the field, such as the need for more research into
cross language generalisation and the development of more accurate and
robust models.
Speech Emotion Recognition using Deep Learning by Lietal (2020), The paper
provided a detailed review of the recent advancements in deep learning
techniques for speech emotion recognition. The authors discussed various
deep learning models, including CNNs, LSTMs, and attention-based models.
They also discussed the challenges and limitations of current systems and
provided suggestions for future research.

7
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Literature Survey
Learning Networks on Live Calls

Multilayer Network-Based CNN Model for Emotion Recognition by Dong-Mei


Lv, Ru-Mei Li, Lin-Ge Rui, Zhuo-Yi Yang (2021) The proposed model
outperforms existing methods for emotion recognition from speech signals.
The authors also performed a feature analysis to identify the most important
features for each emotion and found that the proposed model learns high-
level features related to prosody, rhythm, and pitch that are important for
emotion recognition.
Speech emotion recognition based on multi‐feature and multi‐lingual fusion”
by Chunyi Wang, Ying Ren, Na Zhang (2020)
The proposed method is evaluated on the IEMOCAP and CASIA databases,
and the results show that the proposed method outperforms other existing
methods for emotion recognition from speech signals in multiple languages.
The authors also performed a feature analysis to identify the most important
features for each emotion, and found that prosodic features and MFCCs are
the most important features for emotion recognition.
In "Speech Emotion Recognition: Recent Advances and Prospects"by Hu etal.
(2018), The authors provide an overview of the current state-of-the-art in
speech emotion recognition. They cover various approaches such as feature
extraction, feature selection, and classification methods. They also discuss
the limitations and future directions of the field, such as the need for more
research into crosscorpus generalisation and the development of more
efficient and effective algorithms
2.2 INFERENCES FROM LITERATURE SURVEY

The research review emphasizes several key insights emerge regarding the
advancements and challenges in speech emotion recognition (SER) using
multi-modal and deep learning approaches. Firstly, the integration of multi-
modal data, such as EEG signals and speech samples, proves to be more
effective in emotion recognition compared to single modalities. This multi-
modal approach not only enhances the accuracy of emotion recognition but
also opens avenues for diverse applications in affective computing, mental

8
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Literature Survey
Learning Networks on Live Calls

health diagnosis, and human-robot interaction. It signifies the importance of


leveraging complementary information from different modalities to achieve
robust and reliable emotion recognition systems.
Secondly, the evolution of deep learning techniques, including CNNs, LSTMs,
and attention-based models, has significantly improved SER performance.
These models can learn complex patterns and dependencies in speech data,
leading to more accurate emotion recognition. However, challenges such as
cross-language generalization, model interpretability, and the need for more
diverse and standardized datasets remain prominent. Future research
directions should focus on addressing these challenges, exploring feature
fusion strategies, enhancing model interpretability, and developing efficient
algorithms for real-time applications. Overall, the literature underscores the
continuous progress and potential of SER systems but also highlights the
ongoing efforts required to overcome existing limitations and advance the
field further.

9
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Analysis
Learning Networks on Live Calls

CHAPTER - 3

ANALYSIS

System Analysis is the detailed study of the various operations performed


by the system and their relationships within and outside the system. The
breakdown of something into parts so that the entire system may be
understood. System Analysis is concerned mainly with understanding or
being aware of the problem, identifying the relevant variables which are used
for decision making, and analyzing and synthesizing them to obtain optimal
solutions. Another view of it is a Problem Solving technique that breaks down
a system into different parts and it studies how those parts will interact to
accomplish their purpose

3.1 SYSTEM REQUIREMENTS:

3.1.1 Hardware specifications

● Microsoft Server enabled computers, preferably workstations

● Higher RAM, of about 4GB or above

● Processor of frequency 1.5GHz or above (intel i5 or AMD ryzen)

● Operating system-Windows 7 or above

3.1.2 Software specifications

● Python 3.8 and higher version

● Google Colab Notebook

● Visual Studio Code

10
2020-2024/ BTech – CSE /Emotion Recognition using Deep Analysis
Learning Networks on Live Calls

3.2 Existing System:


The approach for identifying ambiguous speech emotions using multiple
classifiers and interactive learning is suggested in the paper "Multi-Classifier
Interactive Learning for Ambiguous Speech Emotion Recognition". By
tackling the difficulties brought on by ambiguous emotional expressions, the
system seeks to increase the precision of speech emotion identification. A
feature extraction module, a multiclassifier module, and an interactive
learning module make up the three primary parts of the suggested system.
The feature extraction module uses a mix of time-domain, frequency-
domain, and cepstral domain analysis approaches to extract pertinent
characteristics from speech signals. The multi-classifier module divides the
retrieved features into many emotional categories using a variety of
classifiers, suchas support vector machines (SVM), k-nearest neighbours
(KNN), and random forests (RF). The final emotion label for a specific speech
signal is created by combining the output of each classifier using a weighted
average method. By allowing users to comment on the precision of the
system's predictions, the interactive learning module aims to solve the issue
of ambiguous emotional expressions. This feedback is used by the system
to modify the weights given to each classifier, enhancing the system's
performance over time. The suggested system was tested on a publicly
accessible dataset of speech emotions, and the findings revealed that it
surpassed numerous cutting-edge methods in terms of recall, accuracy, and
precision. The approach might potentially be used in fields including speech
therapy, social robotics, and human-computer interaction, according to the
authors. The suggested method has the potential to increase the precision
of emotion detection in a range of circumstances and provides a unique way
to solving the difficulties of ambiguous speech emotion recognition.

The existing methods for lung nodule detection, while groundbreaking at


their inception, come with several disadvantages that limit their
effectiveness and application in clinical settings. Here's a concise overview:
11
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Analysis
Learning Networks on Live Calls

 High False Positive Rate: Many current models struggle to


differentiate between same types of voices in the audio clips. This
leads to a high rate of false positives, necessitating further review and
testing, which can be time consuming and misleading.
 Computational Efficiency: The training and inference processes for
deep learning models can be computationally intensive, especially
when processing large samples of high-resolution audio clips. This can
limit the practicality and scalability of deploying these models in tele-
communication systems.
 Generalization and Transferability: Models trained on specific
datasets may not perform well on data from different sources or
populations due to variations in signaling protocols and demographic
factors. This challenges the generalization and transferability of
existing models across different communication systems.
 Interpretability and Explainability: Deep learning models, often
operate as "black boxes," making it difficult for technisians to
understand the rationale behind the classifications. This lack of
transparency can hinder trust and adoption in communication
systems.

3.3 Proposed System

The CNN and LSTM networks are constructed by coupling four LFLBs, one
LSTM layer, and one fully connected layer. To distinguish identical building
elements or strata, we utilize following coding conventions: 1) the number
that precedes the designation indicates what network is related to such a
building block or layer, 2) The following numerical term shows which element
among all of these buildings blocks or layers. The entire diagram of
architecture the CNN LSTM networks is shown in architecture section.

Additionally, the first pausing was carefully used to mitigate overfitting. An


iterative training method aimed at improving the model’s adaptation to the

12
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Analysis
Learning Networks on Live Calls

training data, and having early halting was critical in making our
performance improve on out-of-sample base. By attending to the choice of
monitored quantities and by applying forbearance, study outcomes clearly
revealed how delicate balance was a necessity. The trials show why the early
stopping is a valuable practice making the networks learn more general
features and display enhanced forecasting capability.
Work Flow of Proposed Model:

Fig -: Block diagram of proposed model

3.4 REQUIREMENT ANALYSIS:

Requirement analysis is a critical phase in the development of any machine


learning system, particularly in the context of speech emotion recognition
on live calls. It involves gathering, documenting, and analyzing the needs
and constraints of stakeholders to define the system's functionalities and
characteristics. In the case of speech emotion recognition on live calls,
requirements can be categorized into two main types: functional and non-
functional.

13
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Analysis
Learning Networks on Live Calls

3.4.1 FUNCTIONAL REQUIREMENTS:

1. Image Data Acquisition:


 Allow users to input audio recording clips.
 Support various audio formats such as mp3, wav file etc
2. Data Preprocessing:

 Proceeding from the resulting recordings of these speeches, one


should obtain such information that appears relevant. These
regards can be either auditory prosodic or linguistic depending
of the type emotion that is been detected. Some examples of
acoustic characteristics are pitch, power energy levels and also
frequency.

 All prosodic elements are duration of speaking, logical volume,


and tempo. Linguistic features include the language used,
number in it including emotions and motivation .
3. Feature Extraction:
 Extract relevant features from preprocessed audio clips.
 Features should capture distinctive characteristics of audio
samples for effective detection.
4. 1D-CNN and LSTM networks approach:
 Employ 1DCNN with learning blocks for extracting the global
features of the preprocessed data.
 Connect LSTM layer to the learning block layers for emotion
detection.
5. Model Training:
 Train the model using a dataset of labeled emotion classes.
 Optimize the model parameter to improve recognition accuracy.

14
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Analysis
Learning Networks on Live Calls

6. Real-time Recognition:
 Perform emotion recognition in real-time, allowing users to
analyze audio recordings.
7. User Interface:
 Provide a user-friendly interface for uploading, viewing and
analyzing audio recordings.
 Include features for displaying recognized emotion and their
characteristics.
8.Integration:
 Seamlessly integrate the system with other Tele-communication
systems or communication devices.
 Ensure compatibility with standard audio formats and protocals.
9.Model Evaluation :
 Evaluate the performance of the SER model using metrics such as
sensitivity, specificity and accuracy.
 Validate the model’s effectiveness through testing with diverse
audio datasets.
10.Output:
 Provide detailed reports on detected emotion, including their
characteristics.
 Generate visualization to aid in the analysis of audio samples.

3.4.2 NON-FUNCTIONAL REQUIREMENTS:

The system will have the following non-functional requirements:


Performance: The system should demonstrate high performance, with fast
and accurate signature recognition capabilities. It should process signature
images swiftly, providing near real-time results.

15
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Analysis
Learning Networks on Live Calls

Scalability: The system should be scalable to accommodate an increasing


number of users and signature recognition requests.It should efficiently
handle varying workloads without compromising performance.

Portability: The system should be highly portable, capable of running on a


wide range of mobile devices with different specifications and operating
systems.It should maintain consistent performance across various mobile
platforms.

Usability: The user interface should be intuitive and easy to navigate, even
for users with limited technical expertise.

Adaptability: The system should adapt to variations in lighting conditions,


image quality, and signature styles to maintain robust performance across
diverse scenarios.

Security: The system will be designed with security in mind, with user
authentication and access control measures.

16
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Design
Learning Networks on Live Calls

CHAPTER - 4
DESIGN
The most creative and challenging part of system development is System
Design. The Design of a system can be defined as a process of applying
various techniques and principles for the purpose of defining a device,
architecture, modules, interfaces and data for the system to satisfy specified
requirements. For the creation of a new system, the system design is a
solution to “how to” approach.

4.1 UML DIAGRAM:

Unified Modeling Language diagrams are a visual representation of a


software system or a process. UML diagrams are used to model, design,
and document software systems. There are several types of UML
diagrams, each serving a different purpose.

The ultimate objective is for UML to become a standard language for


creating models of item-based PC programming. Two notable parts of
UML are included in its gift frame: a Meta-show and documentation.
Later, other sorts of methods or systems may also be added to or
connected to UML.

These UML diagrams can be used in different stages of the software


development life cycle, from requirements gathering to design,
implementation, testing, and maintenance. They help in communicating
the system design and architecture to different stakeholders, including
developers, testers, project managers, and customers

17
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Design
Learning Networks on Live Calls

4.1.1 USE CASE DIAGRAM:

A Use case Diagram is used to present a graphical overview of the


functionality provided by a system in terms of actors, their goals and any
dependencies between those use cases. Use case diagram consists of Use
case: A use case describes a sequence of actions that provided
something of measurable value to an actor and is drawn as a horizontal
ellipse. Actor: An actor is a person, organization or external system that
plays a role in one or more interactions with the system.

Fig.4.1.1.1 Use Case Diagram

4.1.2 ACTIVITY DIAGRAM:

Activity diagram is a graphical representation of workflows of stepwise


activities and actions with support for choice, iteration and concurrency.

18
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Design
Learning Networks on Live Calls

An activity diagram shows the overall flow of control. The most important
shape types:
• Rounded rectangles represent activities.
• Diamonds represent decisions.
• Bars represent the start or end of concurrent activities.
• A black circle represents the start of the workflow.
• An encircled circle represents the end of the workflow.

Fig. 4.2.2.1 Activity Diagram

19
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Design
Learning Networks on Live Calls

4.1.3 SEQUENCE DIAGRAM

Sequence Diagram Illustrates the sequence of messages or interactions


between objects or components over time.

Fig. 4.2.3.1 Sequence Diagram

20
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Design
Learning Networks on Live Calls

4.1.5 CLASS DIAGRAM

Class Diagram Represents the static structure and relationships of classes


and their members within a system.

Fig. 4.2.5.1 Class Diagram

21
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Implementation
Learning Networks on Live Calls

CHAPTER-5

IMPLEMENTATION
The combination of Convolutional Neural Network (CNN) and Long Short-
Term Memory (LSTM) has the potential to surpass the constraints of individual
networks. This study evaluates their performance using the Berlin EmoDB
dataset. Our proposed method outperforms current conventional models,
setting a new benchmark for accuracy and effectiveness in Speech Emotion
Recognition (SER) systems.

Fig 5.1:System Architecture

22
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Implementation
Learning Networks on Live Calls

5.1 Dataset Used:

The Berlin Emotional Database (EmoDB) is a widely used dataset in the field
of speech emotion recognition (SER). It comprises audio recordings of actors
speaking scripted sentences to convey various emotions. The dataset
contains a total of 535 number of samples, typically around 10,000
utterances, with each utterance tagged with an emotional label. These labels
include basic emotions such as happiness, sadness, anger, and fear, among
others, providing a diverse set of emotional expressions for analysis.

In terms of dimensions, the audio data in the BERD is typically stored as WAV
files, which include information about the audio signal's amplitude over time.
Each audio file is usually sampled at a standard rate, such as 44.1 kHz, and
is typically of short duration, ranging from a few seconds to a minute. This
format allows researchers to analyze the acoustic features of speech,
including pitch, intensity, and spectral characteristics, to extract meaningful
features for emotion recognition tasks. Overall, the EmoDB serves as a
valuable resource for studying emotional speech and developing SER models
due to its size, diversity of emotions, and well-defined labeling of samples.

Dataset Name Berlin’s Emotional Database

Dataset Link https://drive.google.com/drive/folders/1bIS8ZHSiEZjrM9m-


NipodSLjIRNC0gZs?usp=sharing

Dataset Size 140 MB

No. of samples 535

No of classes 7

File Extensions .wav

23
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Implementation
Learning Networks on Live Calls

5.2 Dataset Information:

We use the following kinds of audio clips, This audio clips of different classes
are collected. Classes include: happiness, sadness, anger, fear, disguist, hate
and boredom.

Fig 5.2:Sample Waveforms of various emotions present in EmoDB data set.

24
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Implementation
Learning Networks on Live Calls

5.3 Preprocessing:
Proceeding from the resulting recordings of these speeches, one should obtain
such information that appears relevant. These regards can be either auditory
prosodic or linguistic depending of the type emotion that is been detected.
Some examples of acoustic characteristics are pitch, power energy levels and
also frequency. All prosodic elements are duration of speaking, logical
volume, and tempo. Linguistic features include the language used, number in
it including emotions and motivation.
DATA CLEANSING:
Your utterances should form a diverse group that includes voices, emotions,
accents and environmental conditions; to achieve this purpose, collect audio
files from real calls. Preprocess the data by ensuring that all artifacts,
unnecessary background noise, or unwanted parts are removed as they may
affect the precision of emotion detector.

Fig 5.3:Data Preprocessing

5.4 Background of CNN and LSTM Architecture:

CNN serves as a cornerstone in deep learning, prominently utilized for tasks


such as image and video recognition. It employs a series of convolutional
and pooling layers to extract hierarchical features from input data, followed
25
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Implementation
Learning Networks on Live Calls

by fully connected layers for classification.

Convolutional Layers: These layers convolve learnable filters or kernels


across input images, capturing spatial patterns and features through dot
product computations.

Pooling Layers: Following convolution, pooling layers aggregate


information by reducing spatial dimensions, aiding in translation invariance
and computational efficiency.

Fully Connected Layers: Subsequent fully connected layers perform


classification based on the extracted features, enabling robust recognition
capabilities.

Fig 5.4: CNN Architecture for SER System

The LSTMs algorithms offer an excellent edge in a ser scenario within real
conversations because they have the innate ability to see when speech patterns
can change from one time point to another. There are other emotions that are
implicated in the word utterance during debates, which are marked by multi-
dimensionality and shifting across time. While the LSTMs are more successful
compared to RNNs in capturing these modest temporal changes by preserving
significant information over an extended timeline. This competence allows
perceivers to recognize subtle changes of emotional expression a fairly crucial

26
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Implementation
Learning Networks on Live Calls

ability for capturing the speaker’s constantly changing feelings at any moment.
It allows Lasting LSTM layers to take advantage of segments previous to the
current segment. Hence, helps in understanding the speaker’s emotional
journey across the call with more precision.
It is a fact that speech patterns in live interactions are diverse and this means
that emotion recognition becomes subjective due to these defects. The
fundamental value of LSTM is its capacity for adaptation in face with such
discrepancies. The memory cells in LSTMs provide an opportunity for the model
to distinguish important emotional signals provided by changes with tone,
speed, or silence and pauses during speech. This facility provides a strong
capability for the detection of emotions through wide occurrences of speech
patterns that appear in authentic ongoing conversation situations.

Additionally, LSTMs process the live speech input as it is being spoken and
therefore monitor the emotional state of the speaker while talking in general
to providing instant feedbacks. People thus become able to retain the context
and even evaluate conversation history better, which increases the ability to
sense emotions in a free communication. In this light, within such a setting,
LSTMs are superior at interpreting the feelings characters pass over in live
conversation and were able to increase each through better accuracy and
precision.

Fig5.5: LSTM Architecture

27
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Implementation
Learning Networks on Live Calls

Convolutional Neural Network with Long Short Term Memory:


The CNN and LSTM networks are constructed by coupling four LFLBs, one
LSTM layer, and one fully connected layer. To distinguish identical building
elements or strata, we utilize following coding conventions: 1) the number
that precedes the designation indicates what network is related to such a
building block or layer, 2) The following numerical term shows which element
among all of these buildings blocks or layers. The entire diagram of
architecture the CNN LSTM networks is shown in architecture section.

Additionally, the first pausing was carefully used to mitigate overfitting. An


iterative training method aimed at improving the model’s adaptation to the
training data, and having early halting was critical in making our performance
improve on out-of-sample base. By attending to the choice of monitored
quantities and by applying forbearance, study outcomes clearly revealed how
delicate balance was a necessity. The trials show why the early stopping is a
valuable practice making the networks learn more general features and
display enhanced forecasting capability.

28
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Execution Procedure
Learning Networks on Live Calls and Testing

CHAPTER-6

EXECUTION PROCEDURE AND TESTING

6.1 Execution Procedure:


The execution procedure for developing a robust speech emotion recognition
model using a CNN and LSTM networks involve the following steps:
 Data Gathering: Collecting a diverse dataset of audio recording with
different emotional classes. The dataset should encompass variations
in tone ,voice and frequency to ensure the model's robustness across
different scenarios.
 Preprocessing Datasets and Data Splitting: Proceeding from the
resulting recordings of these speeches, one should obtain such
information that appears relevant. These regards can be either auditory
prosodic or linguistic depending of the type emotion that is been
detected. Some examples of acoustic characteristics are pitch, power
energy levels and also frequency. All prosodic elements are duration of
speaking, logical volume, and tempo. Linguistic features include the
language used, number in it including emotions and motivation.
 Model Building: Developing a CNN and LSTM based architecture
customized for Emotion recognition entails configuring the optimal
number of layers, attention units, and other architectural parameters
to enhance detection accuracy. This involves intricate decision-making
to ensure the model's efficacy in identifying nodules. Moreover,
integrating both networks offers a refined approach to fine-tune
hyperparameters, enhancing the model's performance specifically for
emotion detection tasks. By leveraging this combination, we aim to
create a robust framework capable of accurately identifying emotions,
thereby advancing interaction capabilities in tele-communication
systems.
29
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Execution Procedure
Learning Networks on Live Calls and Testing

 Training Model: Training the model involves employing iterative


optimization techniques on the training dataset. Throughout training,
parameters like learning rate and batch size are fine-tuned to optimize
the model's performance. Validation set evaluation occurs iteratively to
monitor the model's progress and guide adjustments to enhance
performance. This iterative process ensures the model's continual
refinement, ultimately leading to improved accuracy in emotion
detection.
 Prediction for New Data: Once the model is trained, it's deployed to
predict emotion in new audio recordings. Real-world performance
monitoring is crucial post-deployment, enabling ongoing refinement
based on feedback and evaluation. Continuous assessment ensures the
model remains effective across diverse scenarios, allowing for
necessary adjustments to maintain accuracy and reliability in detecting
emotions in communication practices.

6.2 Testing:
Testing is an important part of model development and involves evaluating
the performance of the trained models on a previously unseen dataset. This
is done to ensure that the models have not to overfit the training data and
can generalize well to new data. The testing dataset is used to evaluate the
performance of the models by comparing the predicted values to the actual
values. Metrics such as accuracy, precision, recall, F1-score, and others can
be used to evaluate the performance of the models. The testing dataset
should be representative of the population that the model will be used on,
and should not be used for training the model. It is important to repeat the
testing process several times using different subsets of the data to ensure
that the results are consistent and reliable.

The purpose of testing is to discover errors. Testing is the process of trying


to discover every conceivable fault or weakness in a work product. It provides
30
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Execution Procedure
Learning Networks on Live Calls and Testing

a way to check the functionality of components, sub-assemblies, assemblies


and/or a finished product It is the process of exercising software with the
intent of ensuring that the

Software system meets its requirements and user expectations and does not
fail in an unacceptable manner. There are various types of tests. Each test
type addresses a specific testing requirement.

6.3 Types of Testing:

Unit testing

Unit testing involves the design of test cases that validate that the internal
program logic is functioning properly, and that program inputs produce valid
outputs. All decision branches and internal code flow should be validated. It
is the testing of individual software units of the application .it is done after
the completion of an individual unit before integration. This is a structural
testing, that relies on knowledge of its construction and is invasive. Unit tests
perform basic tests at component level and test a specific business process,
application, and/or system configuration. Unit tests ensure that each unique
path of a business process performs accurately to the documented
specifications and contains clearly defined inputs and expected results.

Integration testing

Integration tests are designed to test integrated software components to


determine if they actually run as one program. Testing is event driven and is
more concerned with the basic outcome of screens or fields. Integration tests
demonstrate that although the components were individually satisfaction, as
shown by successfully unit testing, the combination of components is correct
and consistent. Integration testing is specifically aimed at exposing the
problems that arise from the combination of components.
31
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Execution Procedure
Learning Networks on Live Calls and Testing

Software integration testing is the incremental integration testing of two or


more integrated software components on a single platform to produce failures
caused by interface defects.

The task of the integration test is to check that components or software


applications, e.g. components in a software system or – one step up –
software applications at the company level – interact without error.

Acceptance Testing

User Acceptance Testing is a critical phase of any project and requires


significant participation by the end user. It also ensures that the system
meets the functional requirements.

Functional testing

Functional tests provide systematic demonstrations that functions tested are


available as specified by the business and technical requirements, system
documentation, and user manuals.

Functional testing is centered on the following items:

Valid Input : identified classes of valid input must be accepted.

Invalid Input : identified classes of invalid input must be rejected.

Functions : identified functions must be exercised.

Output : identified classes of application outputs must be


exercised.

Systems/Procedures: interfacing systems or procedures must be invoked.

Organization and preparation of functional tests is focused on requirements,


key functions, or special test cases. In addition, systematic coverage
pertaining to identify Business process flows; data fields, predefined
32
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Execution Procedure
Learning Networks on Live Calls and Testing

processes, and successive processes must be considered for testing. Before


functional testing is complete, additional tests are identified and the effective
value of current tests is determined.
By conducting these different types of testing, you can ensure that the

algorithms for Handwritten signature recognition project are properly


functioning and can provide accurate and reliable classifications.

33
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Results and
Learning Networks on Live Calls Performance Evaluation

CHAPTER-7

RESULTS AND PERFORMANCE EVALUATION

7.1 Evaluation Description:

CONFUSION MATRIX DESCRIPTION:


True Positive (TP) refers to an output that is positive in the sense that the
result is accurately classified.

True Negative (TN) refers to negative output for the result to be correctly
classified.

False Positive (FP) refers to an outcome that is positive but not as expected.

False Negative (FN) refers to output that is negative so that the projected
result is misclassified.

Accuracy Obtained:
The accuracy of the classification shows the likelihood that forecasts will
come true. To calculate it, utilize the confusion matrix.
(𝑇𝑁+𝑇𝑃)
Accuracy= ∗ 100.
(𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁)

Recall:
Recall plays a significant role in the evaluation of the model's performance.
It is the percentage of linked instances about all retrieved instances.
𝑇𝑃
Recall= ∗ 100
𝑇𝑃+𝐹𝑁

Precision:
The model performance evaluation matrix includes precision as a key factor.
Instances that are connected as a percentage of all retrieved instances
make up this percentage. It has a projected value that is favorable.

34
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Results and
Learning Networks on Live Calls Performance Evaluation

𝑇𝑃
Precision= ∗ 100
𝑇𝑃+𝐹𝑁

F-Measure:
It also goes by the name F Score. To gauge test accuracy, the F-measure is
calculated.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑟𝑒𝑐𝑎𝑙𝑙
F-Measure= 2 ∗
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙

7.2 Performance Evaluation:


To evaluate the performance of deep learning algorithm for recognizing
the nature of signature, several metrics can be used, such as accuracy
precision, recall, F1-score, and others.

Here's a brief overview of the performance evaluation results for


the model:

CNN + LSTM Model Results:

We present the successful integration of the Convolutional


Neural Network(CNN) architecture with Long Short Term Memory for
speech emotion recognition in EmoDB dataset, yielding exceptional
performance metrics: an accuracy of 86.3%. These results underscore
the model's proficiency in accurately identifying emtions while minimizing
false positives. Through the synergistic combination of CNN’s potent
complex feature learing capabilities and LSTM's fine-tuning approach,
we've forged a robust framework that significantly advances emotion
recognition in communication systems. These findings represent a crucial
step forward in enhancing emotion detection, with promising implications
for improving interaction between user and telecommunication firms.

35
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Results and
Learning Networks on Live Calls Performance Evaluation

Fig. 7.2. Training and validation accuracy

7.3 RESULTS

The network's abilities were assessed on a benchmark database,


revealing that the constructed CNN LSTM networks excel at learning
differentiating characteristics and accumulating high-level abstractions
of emotional input. In contrast to prior feature representations and
methodologies, the 1D CNN LSTM network demonstrates a superior
average accuracy.

36
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Results and
Learning Networks on Live Calls Performance Evaluation

Fig 7.3: Recognizing emotion of input audio

Research Work Accuracy

Huang yongming et al[17] 65.5

Zhegwei Huang et al[13] 78.3

Our work 86.33

Table 7.1: Comparison of accuracies of existing model


with our proposed model

37
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Conclusion and
Learning Networks on Live Calls Future Works

CHAPTER-8

CONCLUSION AND FUTURE WORKS

1D CNN LSTMs- also known as speech emotion recognition. The focus is to


understand the methods of entire learning for finding host local correlations
and global contextual information from out-the-box audio recordings and log
mel spectrograms. Image-based local feature learning is possible only using
a Local Feature Learning Block (LFLB), which is created from the layer of
convolutional, BN layer, ELU layer and pooling layer. With the help of any
supporting contextual connections through these local features are modeled
and inserted into an LSTM layer so that now they can be perceived by the
network. Therefore, the embedded properties attained from the procured
networks combine both local features and long-term contemporaneous
relations.
The network's abilities were assessed on a benchmark database, revealing
that the constructed CNN LSTM networks excel at learning differentiating
characteristics and accumulating high-level abstractions of emotional input.
In contrast to prior feature representations and methodologies, the 1D CNN
LSTM network demonstrates a superior average accuracy.
On the other hand, though in this study deep networks indicate an advantage
for speech emotion classification, many approaches should be considered. In
particular, the full description of how emotions are perceived by the already-
present networks is missing - how exactly these mentsioned networks
perceive feelings remains a black box. While several researchers, as reported
in Section E, have contributed a number of the advances in unveiling the black
box phenomenon associated with deep learning being a learning approach
that this paper is primarily concerned with reports about their efforts earn
much of emphasis on deep networks used for picture processing.
The mode of gathering information by the use of speech is different from
using photos; this calls for higher recommendations on how to penetrate and

38
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Conclusion and
Learning Networks on Live Calls Future Works

understand what’s in the “black box” referred to as deep networks meant only
for processing voice. The issue always presented is the increased accuracy
would be possible through vocal emotion identification. Particularly, as the
network structure becomes more advanced and intelligent, looking into new
topologies or learning strategies that can take advantage of broader
properties or train superior prediction models is necessary. Moreover,
formulating ways to be able to unite several deep characteristics learned by
other networks is interesting perspectives as well.

neuromorphic chips or efficient parallel processing methods. Addressing


dynamic environment challenges requires integrating temporal
information and object tracking into segmentation models. Embracing
uncertainty modeling techniques could enhance decision-making in
ambiguous scenarios.

39
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep
Learning Networks on Live Calls

APPENDIX:

Program Listing/Code:

import numpy as np

import os

import librosa

from keras.models import Model, Sequential

from keras import optimizers

from keras.layers import Input, Conv1D, BatchNormalization,


MaxPooling1D, LSTM, Dense, Activation

from keras.callbacks import EarlyStopping, ModelCheckpoint

from keras.models import load_model

from keras_self_attention import SeqSelfAttention

from datapre import load_data

from keras import initializers

import numpy as np

from keras.utils import to_categorical

datapath = '/content/drive/My Drive/Emo-db/wav2'

classes = ['W', 'L', 'E', 'A', 'F', 'T', 'N'] # 7 classes

seg_len = 16000 # signal split length (in samples) in time domain

seg_ov = int(seg_len * 0.5) # 50% overlap

def normalize(s):

# RMS normalization

new_s = s / np.sqrt(np.sum(np.square((np.abs(s)))) / len(s))


40
Dept. of CSE
return new_s

def countclasses(fnames):

dict_counts = {cls: 0 for cls in classes}

for name in fnames:

if name[5] in classes:

dict_counts[name[5]] += 1

return dict_counts

def data1d(path):

fnames = os.listdir(path)

dict_counts = countclasses(fnames)

print('Total Data', dict_counts)

num_cl = len(classes)

train_dict = {cls: 0 for cls in classes}

test_dict = {cls: 0 for cls in classes}

val_dict = {cls: 0 for cls in classes}

for cname, cnum in dict_counts.items():

t = round(0.8 * cnum)

test_dict[cname] = int(cnum - t)

val_dict[cname] = int(round(0.2 * t))

train_dict[cname] = int(t - val_dict[cname])

print('Class:', cname, 'train:', train_dict[cname], 'val:',


val_dict[cname], 'test:', test_dict[cname])

41
Dept. of CSE
x_train, y_train, x_test, y_test, x_val, y_val = [], [], [], [], [], []

count = {cls: 0 for cls in classes}

for name in fnames:

if name[5] in classes:

sig, fs = librosa.load(os.path.join(path, name), sr=16000)

# normalize signal

data = normalize(sig)

if len(data) < seg_len:

pad_len = int(seg_len - len(data))

pad_rem = int(pad_len % 2)

pad_len /= 2.0 # Ensure floating-point division

signal = np.pad(data, (int(pad_len), int(pad_len +


pad_rem)), 'constant', constant_values=0)

elif len(data) > seg_len:

signal = []

end = seg_len

st = 0

while end < len(data):

signal.append(data[st:end])

st = st + seg_ov

end = st + seg_len

signal = np.array(signal)

if end >= len(data):

num_zeros = int(end - len(data))

if num_zeros > 0:

n1 = np.array(data[st:end])

42
Dept. of CSE
n2 = np.zeros([num_zeros])

s = np.concatenate([n1, n2], 0)

else:

s = np.array(data[int(st):int(end)])

signal = np.vstack([signal, s])

else:

signal = data

if count[name[5]] < train_dict[name[5]]:

if signal.ndim > 1:

for i in range(signal.shape[0]):

x_train.append(signal[i])

y_train.append(name[5])

else:

x_train.append(signal)

y_train.append(name[5])

else:

if (count[name[5]] - train_dict[name[5]]) <


val_dict[name[5]]:

if signal.ndim > 1:

for i in range(signal.shape[0]):

x_val.append(signal[i])

y_val.append(name[5])

else:

x_val.append(signal)

y_val.append(name[5])

43
Dept. of CSE
else:

if signal.ndim > 1:

for i in range(signal.shape[0]):

x_test.append(signal[i])

y_test.append(name[5])

else:

x_test.append(signal)

y_test.append(name[5])

count[name[5]] += 1

return np.float32(x_train), y_train, np.float32(x_test), y_test,


np.float32(x_val), y_va

def string2num(y):

y_map = {cls: idx for idx, cls in enumerate(classes)}

y1 = [y_map[i] for i in y]

return np.float32(np.array(y1))

def load_data():

x_tr, y_tr, x_t, y_t, x_v, y_v = data1d(datapath)

y_tr = string2num(y_tr)

y_t = string2num(y_t)

y_v = string2num(y_v)

return x_tr, y_tr, x_t, y_t, x_v, y_v

def emo1d(input_shape, num_classes, args):

model = Sequential(name='Emo1D')

# LFLB1

model.add(Conv1D(filters=64, kernel_size=3, strides=1,


padding='same', input_shape=input_shape,
44
Dept. of CSE
kernel_initializer=initializers.GlorotNormal(seed=42)))

model.add(BatchNormalization())

model.add(MaxPooling1D(pool_size=4, strides=4))

# LFLB2

model.add(Conv1D(filters=64, kernel_size=3, strides=1,


padding='same',

kernel_initializer=initializers.GlorotNormal(seed=42)))

model.add(BatchNormalization())

model.add(MaxPooling1D(pool_size=4, strides=4))

# LFLB3

model.add(Conv1D(filters=128, kernel_size=3, strides=1,


padding='same',

kernel_initializer=initializers.GlorotNormal(seed=42)))

model.add(BatchNormalization())

model.add(MaxPooling1D(pool_size=4, strides=4))

# LFLB4

model.add(Conv1D(filters=128, kernel_size=3, strides=1,


padding='same',

kernel_initializer=initializers.GlorotNormal(seed=42)))

model.add(BatchNormalization())

model.add(MaxPooling1D(pool_size=4, strides=4))

# LSTM layer

model.add(LSTM(units=args.num_fc, return_sequences=True))

model.add(SeqSelfAttention(attention_activation='tanh'))

model.add(LSTM(units=args.num_fc, return_sequences=False))

45
Dept. of CSE
# Fully connected layer

model.add(Dense(units=num_classes, activation='softmax'))

# Model compilation

opt = optimizers.Adam(learning_rate=args.learning_rate)

model.compile(optimizer=opt, loss='categorical_crossentropy',
metrics=['accuracy'])

return model

def train(model, x_tr, y_tr, x_val, y_val, args):

es = EarlyStopping(monitor='val_loss', mode='min', verbose=1,


patience=40)

mc = ModelCheckpoint('test_model.h5', monitor='val_accuracy',
mode='max', verbose=1,

save_best_only=True)

history = model.fit(x_tr, y_tr, epochs=args.num_epochs,


batch_size=args.batch_size, validation_data=(x_val, y_val),

callbacks=[es, mc])

return model

def test(model, x_t, y_t):

saved_model = load_model('test_model.h5',
custom_objects={'SeqSelfAttention': SeqSelfAttention})

score = saved_model.evaluate(x_t, y_t, batch_size=20)

print(score)

return score

def loadData():

x_tr, y_tr, x_t, y_t, x_val, y_val = load_data()

46
Dept. of CSE
x_tr = x_tr.reshape(-1, x_tr.shape[1], 1)

x_t = x_t.reshape(-1, x_t.shape[1], 1)

x_val = x_val.reshape(-1, x_val.shape[1], 1)

y_tr = to_categorical(y_tr)

y_t = to_categorical(y_t)

y_val = to_categorical(y_val)

return x_tr, y_tr, x_t, y_t, x_val, y_val

if __name__ == "__main__":

class Args:

num_fc = 128

batch_size = 32

num_epochs = 50

learning_rate = 5e-5

args = Args()

x_tr, y_tr, x_t, y_t, x_val, y_val = loadData()

model = emo1d(input_shape=x_tr.shape[1:],
num_classes=len(np.unique(np.argmax(y_tr, axis=1))), args=args)

model = train(model, x_tr, y_tr, x_val, y_val, args=args)

score = test(model, x_t, y_t)

import matplotlib.pyplot as plt

# Epochs (x-axis)

epochs = range(1,len(accuracy) * 5 + 1,5)

47
Dept. of CSE
# Plotting

plt.figure(figsize=(10, 6))

plt.plot(epochs, accuracy, label='Accuracy', marker='o')

plt.plot(epochs, val_acc, label='Validation Accuracy', marker='s')

plt.xlabel('Epochs')

plt.ylabel('Accuracy')

plt.title('Training and Validation Accuracy Over Epochs')

plt.legend()

plt.grid(True)

plt.show()

print(score)

print('Final accuracy of the model :',score[1]*100)

#predicting

saved_model = load_model('best_model.h5',
custom_objects={'SeqSelfAttention': SeqSelfAttention})

# Load the input audio file

input_file_path = '/content/drive/MyDrive/Emo-db/wav/03a01Fa.wav'

signal, sr = librosa.load(input_file_path, sr=16000)

# Normalize the signal

data = normalize(signal)

x_input = data.reshape(-1, max_input_length, 1)

predictions = saved_model.predict(x_input)

emotions = ['W', 'L', 'E', 'A', 'F', 'T', 'N']

emotion_mapping = {

'W': 'Anger',
48
Dept. of CSE
'L': 'Boredom',

'E': 'Disgust',

'A': 'Anxiety',

'F': 'Happiness',

'T': 'Sadness',

'N': 'Neutral'

} # Assuming these are the emotion labels

predicted_emotions = [emotions[np.argmax(pred)] for pred in


predictions]

predicted_emotions_names = [emotion_mapping[emotion] for emotion


in predicted_emotions]

# Display the predicted emotions

for i, emotion_name in enumerate(predicted_emotions_names):

print(f"Segment {i+1}: Predicted Emotion - {emotion_name}")

49
Dept. of CSE
SCREENSHOTS

Fig.1. Importing Required Packages

Fig.2. Loading Dataset

Fig.3. Splitting the dataset

50
Dept. of CSE
Fig.4. Initializing Model Parameters

51
Dept. of CSE
Fig.5. CNN+ LSTM layers

52
Dept. of CSE
Fig.6. Visualizing Model performance

53
Dept. of CSE
LIST OF FIGURES:

Figure No. Figure Name. Page No.

Fig 3.1 System Architecture 13

Fig.4.2.1.1 Use Case Diagram 18

Fig. 4.2.2.1 Activity Diagram 19

Fig. 4.2.3.1 Sequence Diagram 20

Fig. 4.2.5.1 Class Diagram 21

Fig.5.1 System Architecture 22

Fig.5.2 Sample waveforms 24

Fig.5.3 Data preprocessing 25

Fig.5.4 CNN Architecture 26

Fig.5.5 LSTM Architecture 27

Fig. 7.2 Training and validation accuracy 36

Fig. 7.3 Recognizing voice 37

54
Dept. of CSE
LIST OF ABBREVATIONS/NOMENCALTURE

1. CNN: Convolutional Neural Networks

2. LSTM: Long Short Term Memory

LIST OF TABLES

Table No Title Page No

7.1 Comparison of accuracy 37


with existing models

55
Dept. of CSE
REFERENCES

[1] A. Graves, N. Jaitly, A.R. Mohamed, Hybrid speech recognition with Deep
Bidirectional LSTM, Autom. Speech Recognit. Underst. (2013) 273–278.
[2] A.B. Kandali, A. Routray, T.K. Basu, Emotion recognition from Assamese
speeches using MFCC features and GMM classifier, in: TENCON 2008 - 2008
IEEE Region 10 Conference IEEE, 2008, pp. 1–5.
[3] A. Milton, S.S. Roy, S.T. Selvi, SVM scheme for speech emotion recognition
using MFCC feature, Int. J. Comput. Appl. 69 (9) (2013) 34–39.
[4] W.Q. Zheng, J.S. Yu, Y.X. Zou, An experimental study of speech emotion
recognition based on deep convolutional neural networks, in: International
Conference on Affective Computing and Intelligent Interaction IEEE, 2015, pp.
827–831..
[5] S. Demircan, H. Kahramanl, Feature extraction from speech data for
emotion recognition, J. Adv. Comput. Netw. 2 (1) (2014) 28–30
[6] N.J. Nalini, S. Palanivel, M. Balasubramanian, Speech emotion recognition
using residual phase and MFCC features, Int. J. Eng. Technol. 5 (6) (2013)
4515–4527
[7] F. Chenchah, Z. Lachiri, Acoustic emotion recognition using linear and
nonlinear cepstral coefficients, Int. J. Adv. Comput. Sci. Appl. 6 (11) (2015).
[8] N.J. Nalini, S. Palanivel, Music emotion recognition: the combined evidence
of MFCC and residual phase, Egypt. Inf. J. 17 (1) (2015) 1–10.
[9] D. Le, E.M. Provost, Emotion recognition from spontaneous speech using
hidden markov models with deep belief networks, in: Automatic Speech
Recognition and Understanding (ASRU) IEEE, 2013, pp. 216–221.
[10] V.B. Waghmare, R.R. Deshmukh, P.P. Shrishrimal, G.B. Janvale, Emotion
recognition system from artificial marathi speech using MFCC and LDA
techniques, in: International Conference on Advances in Communication,
Network, and Computing, 2014
[11] L. Chen, X. Mao, H. Yan, Text-independent phoneme segmentation
combining EGG and speech data, IEEE/ACM Trans. Audio Speech Lang.
Process. 24 (6) (2016) 1029–1037.

56
Dept. of CSE
[12] Y. Kim, H. Lee, E.M. Provost, Deep learning for robust feature generation
in audiovisual emotion recognition, Acoustics, Speech and Signal Processing
(ICASSP), 2013 IEEE International Conference on 32 (2013) 3687–3691.
[13] Z. Huang, M. Dong, Q. Mao, Y. Zhan, Speech Emotion Recognition Using
CNN, ACM Multimedia, 2014, pp. 801–804.
[14] A. Stuhlsatz, C. Meyer, F. Eyben, T. Zielke, G. Meier, B. Schuller, Deep
neural networks for acoustic emotion recognition: raising the benchmarks, in:
IEEE International Conference on Acoustics, Speech, and Signal Processing
IEEE, 2011, pp. 5688–5691
[15] J. Loughrey, P. Cunningham, Using Early Stopping to Reduce Overfltting
in Wrapper-Based Feature Weighting, Trinity College Dublin Department of
Computer Science, 2005, TCD-CS-2005-41, pp12.
[16] Q. Mao, M. Dong, Z. Huang, Y. Zhan, Learning salient features for speech
emotion recognition using convolutional neural networks, IEEE Trans.
Multimedia 16 (8) (2014) 2203–2213
[17] Y. Huang, A. Wu, G. Zhang, Y. Li, Extraction of adaptive wavelet packet
filter-bank-based acoustic feature for speech emotion recognition, IET Signal
Process. 9 (4) (2015) 341–348
[18] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network
training by reducing internal covariate shift, in: International Conference on
Machine Learning, 2015, pp. 448–456.
[19] E.M. Schmidt, J.J. Scott, Y.E. Kim, Feature learning in dynamic
environments: modeling the acoustic structure of musical emotion, in:
International Symposium/Conference on Music Information Retrieval, 2012,
pp. 325–330.
[20] F. Burkhardt, A. Paeschke, M. Rolfes, W.F. Sendlmeier, B. Weiss, A
database of German emotional speech, INTERSPEECH 2005 - Eurospeech

57
Dept. of CSE

You might also like