c11 report final 01
c11 report final 01
LIVE CALLS”
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
BY
VISION
MISSION
The Department of Computer Science and Engineering is established to
provide undergraduate and graduate education in the field of Computer
Science and Engineering to students with diverse background in
foundations of software and hardware through a broad curriculum and
strongly focused on developing advanced knowledge to become future
leaders.
Create knowledge of advanced concepts, innovative technologies and
develop research aptitude for contributing to the needs of industry and
society.
Develop professional and soft skills for improved knowledge and
employability of students.
Encourage students to engage in life-long learning to create awareness
of the contemporary developments in computer science and engineering
to become outstanding professionals.
Develop attitude for ethical and social responsibilities in professional
practice at regional, National and International levels.
I
Program Educational Objectives (PEO’s)
II
Program Specific Outcomes(PSO’s)
On successful completion of the Program, the graduates of B.Tech. (CSE)
program will be able to:
1. Demonstrate knowledge in Data structures and Algorithms, Operating
Systems, Database Systems, Software Engineering, Programming
Languages, Digital systems, Theoretical Computer Science, and
Computer Networks. (PO1)
III
Program Outcomes (PO’s)
1. Apply the knowledge of mathematics, science, engineering fundamentals,
and an engineering specialization to the solution of complex engineering
problems (Engineering knowledge).
IV
8. Apply ethical principles and commit to professional ethics and
responsibilities and norms of the engineering practice (Ethics).
12. Recognize the need for, and have the preparation and ability to engage
in independent and life-long learning in the broadest context of technological
change (Life-long learning).
V
Course Outcomes
CO8. Ability to apply ethics and norms of the engineering practice as applied
in the project work.(PO8)
CO10. Ability to present views cogently and precisely on the project work.
(PO10)
VI
CO-PO Mapping
PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO11 PSO12 PSO13 PSO14
CO1 3 3
CO2 3 3
CO3 3 3
CO4 3 3
CO5 3 3
CO6 3
CO7 3
CO8 3
CO9 3
CO10 3
CO11 3
CO12 3
VII
DECLARATION
not been submitted to any other course or University for the award of any
degree by us.
1. REDDYGARI SIRISHA
2. PUVVADI MONICA
3. P LAKSHMI PRASANNA
VIII
SREE VIDYANIKETHAN ENGINEERING COLLEGE
(Affiliated to Jawaharlal Nehru Technological University Anantapur)
Sree Sainath Nagar, A. Rangampet, Tirupati – 517 102, Chittoor Dist., A.P.
CERTIFICATE
This is work has been carried out under my guidance and supervision.
The results embodied in this Project report have not been submitted in any University or
Organization for the award of any degree or diploma.
IX
ACKNOWLEDGEMENT
We are also thankful to all the faculty members of CSE Department, who
have cooperated in carrying out our project. We would like to thank our
parents and friends who have extended their help and encouragement either
directly or indirectly in completion of our project work.
X
ABSTRACT
“In the realm of virtual assistants, mental health assessments, and customer
service, incorporating emotion detection in speech is essential for effective
communication between humans and machines. A variety of distinct speech
characteristics can be leveraged to extract valuable insights from audio samples.
Our objective is to develop an emotion identification system that utilizes these
detected attributes within the audio samples.
In the field of machine learning, there exist traditional models such as Support
Vector Machines (SVM), K-Nearest Neighbors (KNN), and Random Forest (RF)
classifiers that can be applied to Speech Emotion Recognition (SER) systems.
Our proposed model consists of a 1D CNN-LSTM architecture, with four Local
Feature Learning Blocks (LFLBs) followed by a primary LSTM layer. Each LFLB
comprises a single convolutional layer and a single max-pooling layer, which are
effective at capturing inherent correlations and constructing hierarchical
relationships. The LSTM layer then processes these long-term relationships
derived from the localized cognitive parameters.
The combination of Convolutional Neural Network (CNN) and Long Short-Term
Memory (LSTM) has the potential to surpass the constraints of individual
networks. This study evaluates their performance using the Berlin EmoDB
dataset. Our proposed method outperforms current conventional models, setting
a new benchmark for accuracy and effectiveness in Speech Emotion Recognition
(SER) systems.”
XI
TABLE OF CONTENTS:
1. Introduction 1-6
1.1 Statement of the Problem
1.2 Objectives
1.3 Scope
1.4 Applications
1.5 Limitations
3. Analysis 10 - 16
3.1 System Requirements
3.1.1 Hardware Specifications
3.1.2 Software Specifications
3.2 Existing System
3.3 Proposed System
3.4 Requirement Analysis
3.4.1 Functional Requirements
3.4.2 Non-Functional Requirements
4. Design 17 - 21
9. Appendix 40 - 57
Program Listing/Code
Screen shots
List of Figures
List of Abbreviations
List of Tables
References
2020-2024/ BTech – CSE /Emotion Recognition using Deep Introduction
Learning Networks on Live Calls
CHAPTER-1
INTRODUCTION
1
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Introduction
Learning Networks on Live Calls
The solution should analyse the voice of the caller on live on-going calls being
attended by the caller in the emergency Response System. After Analysing
the voice of the caller, the solution should predict the emotional/ mental
condition of the caller. The solution should predict/suggest following about
the caller Stressful Voice Drunk Voice Prank Voice Abusive Voice Painful Voice
Or any mental condition Objective: The system will be trained using a sizable
dataset of audio recordings of human speech in various emotional states in
order to accomplish this goal. The system will employ deep neural networks
and other machine learning methods to extract pertinent information from
the audio input and categorize the speaker's emotions. To give clients a more
effective and tailored experience, the suggested system would be linked into
already-existing communication systems including contact centres and
customer support centres.
1.2 OBJECTIVES
2. Integrate the technology into a call centre setting to give CSRs knowledge
about the customer's emotional condition during a live conversation.
3. Train the system on a sizable dataset of labelled speech signals using deep
2
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Introduction
Learning Networks on Live Calls
5. Improve the system's precision, speed, and resistance to noise and other
environmental elements that are frequently present in a contact centre
setting.
6. Create a user-friendly user interface for the system that gives the customer
service representative (CSR) real-time feedback, including a description of
the client's emotional state and suggested answers based on that state.
1.3 SCOPE
3
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Introduction
Learning Networks on Live Calls
boosts user satisfaction and affinity. This application spans over a myriad of
industries such as in the field of virtual assistants, entertainment apps and
interactive platforms.
Training and Education:
The SER may help in evaluating engagement and level of well-
understood topics via live discussions that occur during the task of remote
learning which in turn takes place within educational institutions. Teachers
exonerate emotional responses from students during online classes to inform
about their learning or areas where they may need more focus. This in turn
can enable facilitators to change their strategy base on the needs of each
student.
1.4 APPLICATIONS:
Speech emotion recognition using 1D CNN and LSTM networks finds diverse
applications across various domains. In healthcare, it aids in detecting
emotional cues for mental health assessment and speech therapy evaluation.
In customer service, it enhances sentiment analysis for understanding
customer emotions and improving service quality. In human-computer
interaction, it enables more personalized user experiences by recognizing
emotional states during interactions with devices. In education, it assists in
assessing student engagement and emotional responses in online learning
environments. In entertainment, it enhances the immersive experience by
adapting content based on user emotions in gaming and virtual reality
applications. Additionally, in security and law enforcement, it can be utilized
for emotion-based lie detection and monitoring emotional states during
critical interactions. Overall, speech emotion recognition with 1D CNN and
LSTM networks has broad implications for enhancing communication,
engagement, and understanding in various human-machine interaction
scenarios.
4
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Introduction
Learning Networks on Live Calls
1.5 LIMITATIONS:
CNN and LSTM models, especially when combined, can create highly
complex and opaque systems. The deep layers and sequential processing
5
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Introduction
Learning Networks on Live Calls
make it challenging to interpret how the model arrives at its predictions. This
lack of interpretability and explainability is a significant limitation, especially
in applications where understanding the reasoning behind the system's
decisions is crucial, such as in healthcare or legal contexts. It can also limit
the user's trust and acceptance of the system, impacting its usability and
adoption.
6
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Literature Survey
Learning Networks on Live Calls
CHAPTER – 2
LITERATURE SURVEY
7
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Literature Survey
Learning Networks on Live Calls
The research review emphasizes several key insights emerge regarding the
advancements and challenges in speech emotion recognition (SER) using
multi-modal and deep learning approaches. Firstly, the integration of multi-
modal data, such as EEG signals and speech samples, proves to be more
effective in emotion recognition compared to single modalities. This multi-
modal approach not only enhances the accuracy of emotion recognition but
also opens avenues for diverse applications in affective computing, mental
8
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Literature Survey
Learning Networks on Live Calls
9
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Analysis
Learning Networks on Live Calls
CHAPTER - 3
ANALYSIS
10
2020-2024/ BTech – CSE /Emotion Recognition using Deep Analysis
Learning Networks on Live Calls
The CNN and LSTM networks are constructed by coupling four LFLBs, one
LSTM layer, and one fully connected layer. To distinguish identical building
elements or strata, we utilize following coding conventions: 1) the number
that precedes the designation indicates what network is related to such a
building block or layer, 2) The following numerical term shows which element
among all of these buildings blocks or layers. The entire diagram of
architecture the CNN LSTM networks is shown in architecture section.
12
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Analysis
Learning Networks on Live Calls
training data, and having early halting was critical in making our
performance improve on out-of-sample base. By attending to the choice of
monitored quantities and by applying forbearance, study outcomes clearly
revealed how delicate balance was a necessity. The trials show why the early
stopping is a valuable practice making the networks learn more general
features and display enhanced forecasting capability.
Work Flow of Proposed Model:
13
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Analysis
Learning Networks on Live Calls
14
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Analysis
Learning Networks on Live Calls
6. Real-time Recognition:
Perform emotion recognition in real-time, allowing users to
analyze audio recordings.
7. User Interface:
Provide a user-friendly interface for uploading, viewing and
analyzing audio recordings.
Include features for displaying recognized emotion and their
characteristics.
8.Integration:
Seamlessly integrate the system with other Tele-communication
systems or communication devices.
Ensure compatibility with standard audio formats and protocals.
9.Model Evaluation :
Evaluate the performance of the SER model using metrics such as
sensitivity, specificity and accuracy.
Validate the model’s effectiveness through testing with diverse
audio datasets.
10.Output:
Provide detailed reports on detected emotion, including their
characteristics.
Generate visualization to aid in the analysis of audio samples.
15
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Analysis
Learning Networks on Live Calls
Usability: The user interface should be intuitive and easy to navigate, even
for users with limited technical expertise.
Security: The system will be designed with security in mind, with user
authentication and access control measures.
16
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Design
Learning Networks on Live Calls
CHAPTER - 4
DESIGN
The most creative and challenging part of system development is System
Design. The Design of a system can be defined as a process of applying
various techniques and principles for the purpose of defining a device,
architecture, modules, interfaces and data for the system to satisfy specified
requirements. For the creation of a new system, the system design is a
solution to “how to” approach.
17
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Design
Learning Networks on Live Calls
18
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Design
Learning Networks on Live Calls
An activity diagram shows the overall flow of control. The most important
shape types:
• Rounded rectangles represent activities.
• Diamonds represent decisions.
• Bars represent the start or end of concurrent activities.
• A black circle represents the start of the workflow.
• An encircled circle represents the end of the workflow.
19
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Design
Learning Networks on Live Calls
20
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Design
Learning Networks on Live Calls
21
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Implementation
Learning Networks on Live Calls
CHAPTER-5
IMPLEMENTATION
The combination of Convolutional Neural Network (CNN) and Long Short-
Term Memory (LSTM) has the potential to surpass the constraints of individual
networks. This study evaluates their performance using the Berlin EmoDB
dataset. Our proposed method outperforms current conventional models,
setting a new benchmark for accuracy and effectiveness in Speech Emotion
Recognition (SER) systems.
22
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Implementation
Learning Networks on Live Calls
The Berlin Emotional Database (EmoDB) is a widely used dataset in the field
of speech emotion recognition (SER). It comprises audio recordings of actors
speaking scripted sentences to convey various emotions. The dataset
contains a total of 535 number of samples, typically around 10,000
utterances, with each utterance tagged with an emotional label. These labels
include basic emotions such as happiness, sadness, anger, and fear, among
others, providing a diverse set of emotional expressions for analysis.
In terms of dimensions, the audio data in the BERD is typically stored as WAV
files, which include information about the audio signal's amplitude over time.
Each audio file is usually sampled at a standard rate, such as 44.1 kHz, and
is typically of short duration, ranging from a few seconds to a minute. This
format allows researchers to analyze the acoustic features of speech,
including pitch, intensity, and spectral characteristics, to extract meaningful
features for emotion recognition tasks. Overall, the EmoDB serves as a
valuable resource for studying emotional speech and developing SER models
due to its size, diversity of emotions, and well-defined labeling of samples.
No of classes 7
23
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Implementation
Learning Networks on Live Calls
We use the following kinds of audio clips, This audio clips of different classes
are collected. Classes include: happiness, sadness, anger, fear, disguist, hate
and boredom.
24
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Implementation
Learning Networks on Live Calls
5.3 Preprocessing:
Proceeding from the resulting recordings of these speeches, one should obtain
such information that appears relevant. These regards can be either auditory
prosodic or linguistic depending of the type emotion that is been detected.
Some examples of acoustic characteristics are pitch, power energy levels and
also frequency. All prosodic elements are duration of speaking, logical
volume, and tempo. Linguistic features include the language used, number in
it including emotions and motivation.
DATA CLEANSING:
Your utterances should form a diverse group that includes voices, emotions,
accents and environmental conditions; to achieve this purpose, collect audio
files from real calls. Preprocess the data by ensuring that all artifacts,
unnecessary background noise, or unwanted parts are removed as they may
affect the precision of emotion detector.
The LSTMs algorithms offer an excellent edge in a ser scenario within real
conversations because they have the innate ability to see when speech patterns
can change from one time point to another. There are other emotions that are
implicated in the word utterance during debates, which are marked by multi-
dimensionality and shifting across time. While the LSTMs are more successful
compared to RNNs in capturing these modest temporal changes by preserving
significant information over an extended timeline. This competence allows
perceivers to recognize subtle changes of emotional expression a fairly crucial
26
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Implementation
Learning Networks on Live Calls
ability for capturing the speaker’s constantly changing feelings at any moment.
It allows Lasting LSTM layers to take advantage of segments previous to the
current segment. Hence, helps in understanding the speaker’s emotional
journey across the call with more precision.
It is a fact that speech patterns in live interactions are diverse and this means
that emotion recognition becomes subjective due to these defects. The
fundamental value of LSTM is its capacity for adaptation in face with such
discrepancies. The memory cells in LSTMs provide an opportunity for the model
to distinguish important emotional signals provided by changes with tone,
speed, or silence and pauses during speech. This facility provides a strong
capability for the detection of emotions through wide occurrences of speech
patterns that appear in authentic ongoing conversation situations.
Additionally, LSTMs process the live speech input as it is being spoken and
therefore monitor the emotional state of the speaker while talking in general
to providing instant feedbacks. People thus become able to retain the context
and even evaluate conversation history better, which increases the ability to
sense emotions in a free communication. In this light, within such a setting,
LSTMs are superior at interpreting the feelings characters pass over in live
conversation and were able to increase each through better accuracy and
precision.
27
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Implementation
Learning Networks on Live Calls
28
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Execution Procedure
Learning Networks on Live Calls and Testing
CHAPTER-6
6.2 Testing:
Testing is an important part of model development and involves evaluating
the performance of the trained models on a previously unseen dataset. This
is done to ensure that the models have not to overfit the training data and
can generalize well to new data. The testing dataset is used to evaluate the
performance of the models by comparing the predicted values to the actual
values. Metrics such as accuracy, precision, recall, F1-score, and others can
be used to evaluate the performance of the models. The testing dataset
should be representative of the population that the model will be used on,
and should not be used for training the model. It is important to repeat the
testing process several times using different subsets of the data to ensure
that the results are consistent and reliable.
Software system meets its requirements and user expectations and does not
fail in an unacceptable manner. There are various types of tests. Each test
type addresses a specific testing requirement.
Unit testing
Unit testing involves the design of test cases that validate that the internal
program logic is functioning properly, and that program inputs produce valid
outputs. All decision branches and internal code flow should be validated. It
is the testing of individual software units of the application .it is done after
the completion of an individual unit before integration. This is a structural
testing, that relies on knowledge of its construction and is invasive. Unit tests
perform basic tests at component level and test a specific business process,
application, and/or system configuration. Unit tests ensure that each unique
path of a business process performs accurately to the documented
specifications and contains clearly defined inputs and expected results.
Integration testing
Acceptance Testing
Functional testing
33
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Results and
Learning Networks on Live Calls Performance Evaluation
CHAPTER-7
True Negative (TN) refers to negative output for the result to be correctly
classified.
False Positive (FP) refers to an outcome that is positive but not as expected.
False Negative (FN) refers to output that is negative so that the projected
result is misclassified.
Accuracy Obtained:
The accuracy of the classification shows the likelihood that forecasts will
come true. To calculate it, utilize the confusion matrix.
(𝑇𝑁+𝑇𝑃)
Accuracy= ∗ 100.
(𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁)
Recall:
Recall plays a significant role in the evaluation of the model's performance.
It is the percentage of linked instances about all retrieved instances.
𝑇𝑃
Recall= ∗ 100
𝑇𝑃+𝐹𝑁
Precision:
The model performance evaluation matrix includes precision as a key factor.
Instances that are connected as a percentage of all retrieved instances
make up this percentage. It has a projected value that is favorable.
34
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Results and
Learning Networks on Live Calls Performance Evaluation
𝑇𝑃
Precision= ∗ 100
𝑇𝑃+𝐹𝑁
F-Measure:
It also goes by the name F Score. To gauge test accuracy, the F-measure is
calculated.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑟𝑒𝑐𝑎𝑙𝑙
F-Measure= 2 ∗
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
35
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Results and
Learning Networks on Live Calls Performance Evaluation
7.3 RESULTS
36
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Results and
Learning Networks on Live Calls Performance Evaluation
37
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Conclusion and
Learning Networks on Live Calls Future Works
CHAPTER-8
38
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep Conclusion and
Learning Networks on Live Calls Future Works
understand what’s in the “black box” referred to as deep networks meant only
for processing voice. The issue always presented is the increased accuracy
would be possible through vocal emotion identification. Particularly, as the
network structure becomes more advanced and intelligent, looking into new
topologies or learning strategies that can take advantage of broader
properties or train superior prediction models is necessary. Moreover,
formulating ways to be able to unite several deep characteristics learned by
other networks is interesting perspectives as well.
39
Dept. of CSE
2020-2024/ BTech – CSE /Emotion Recognition using Deep
Learning Networks on Live Calls
APPENDIX:
Program Listing/Code:
import numpy as np
import os
import librosa
import numpy as np
def normalize(s):
# RMS normalization
def countclasses(fnames):
if name[5] in classes:
dict_counts[name[5]] += 1
return dict_counts
def data1d(path):
fnames = os.listdir(path)
dict_counts = countclasses(fnames)
num_cl = len(classes)
t = round(0.8 * cnum)
test_dict[cname] = int(cnum - t)
41
Dept. of CSE
x_train, y_train, x_test, y_test, x_val, y_val = [], [], [], [], [], []
if name[5] in classes:
# normalize signal
data = normalize(sig)
pad_rem = int(pad_len % 2)
signal = []
end = seg_len
st = 0
signal.append(data[st:end])
st = st + seg_ov
end = st + seg_len
signal = np.array(signal)
if num_zeros > 0:
n1 = np.array(data[st:end])
42
Dept. of CSE
n2 = np.zeros([num_zeros])
s = np.concatenate([n1, n2], 0)
else:
s = np.array(data[int(st):int(end)])
else:
signal = data
if signal.ndim > 1:
for i in range(signal.shape[0]):
x_train.append(signal[i])
y_train.append(name[5])
else:
x_train.append(signal)
y_train.append(name[5])
else:
if signal.ndim > 1:
for i in range(signal.shape[0]):
x_val.append(signal[i])
y_val.append(name[5])
else:
x_val.append(signal)
y_val.append(name[5])
43
Dept. of CSE
else:
if signal.ndim > 1:
for i in range(signal.shape[0]):
x_test.append(signal[i])
y_test.append(name[5])
else:
x_test.append(signal)
y_test.append(name[5])
count[name[5]] += 1
def string2num(y):
y1 = [y_map[i] for i in y]
return np.float32(np.array(y1))
def load_data():
y_tr = string2num(y_tr)
y_t = string2num(y_t)
y_v = string2num(y_v)
model = Sequential(name='Emo1D')
# LFLB1
model.add(BatchNormalization())
model.add(MaxPooling1D(pool_size=4, strides=4))
# LFLB2
kernel_initializer=initializers.GlorotNormal(seed=42)))
model.add(BatchNormalization())
model.add(MaxPooling1D(pool_size=4, strides=4))
# LFLB3
kernel_initializer=initializers.GlorotNormal(seed=42)))
model.add(BatchNormalization())
model.add(MaxPooling1D(pool_size=4, strides=4))
# LFLB4
kernel_initializer=initializers.GlorotNormal(seed=42)))
model.add(BatchNormalization())
model.add(MaxPooling1D(pool_size=4, strides=4))
# LSTM layer
model.add(LSTM(units=args.num_fc, return_sequences=True))
model.add(SeqSelfAttention(attention_activation='tanh'))
model.add(LSTM(units=args.num_fc, return_sequences=False))
45
Dept. of CSE
# Fully connected layer
model.add(Dense(units=num_classes, activation='softmax'))
# Model compilation
opt = optimizers.Adam(learning_rate=args.learning_rate)
model.compile(optimizer=opt, loss='categorical_crossentropy',
metrics=['accuracy'])
return model
mc = ModelCheckpoint('test_model.h5', monitor='val_accuracy',
mode='max', verbose=1,
save_best_only=True)
callbacks=[es, mc])
return model
saved_model = load_model('test_model.h5',
custom_objects={'SeqSelfAttention': SeqSelfAttention})
print(score)
return score
def loadData():
46
Dept. of CSE
x_tr = x_tr.reshape(-1, x_tr.shape[1], 1)
y_tr = to_categorical(y_tr)
y_t = to_categorical(y_t)
y_val = to_categorical(y_val)
if __name__ == "__main__":
class Args:
num_fc = 128
batch_size = 32
num_epochs = 50
learning_rate = 5e-5
args = Args()
model = emo1d(input_shape=x_tr.shape[1:],
num_classes=len(np.unique(np.argmax(y_tr, axis=1))), args=args)
# Epochs (x-axis)
47
Dept. of CSE
# Plotting
plt.figure(figsize=(10, 6))
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()
print(score)
#predicting
saved_model = load_model('best_model.h5',
custom_objects={'SeqSelfAttention': SeqSelfAttention})
input_file_path = '/content/drive/MyDrive/Emo-db/wav/03a01Fa.wav'
data = normalize(signal)
predictions = saved_model.predict(x_input)
emotion_mapping = {
'W': 'Anger',
48
Dept. of CSE
'L': 'Boredom',
'E': 'Disgust',
'A': 'Anxiety',
'F': 'Happiness',
'T': 'Sadness',
'N': 'Neutral'
49
Dept. of CSE
SCREENSHOTS
50
Dept. of CSE
Fig.4. Initializing Model Parameters
51
Dept. of CSE
Fig.5. CNN+ LSTM layers
52
Dept. of CSE
Fig.6. Visualizing Model performance
53
Dept. of CSE
LIST OF FIGURES:
54
Dept. of CSE
LIST OF ABBREVATIONS/NOMENCALTURE
LIST OF TABLES
55
Dept. of CSE
REFERENCES
[1] A. Graves, N. Jaitly, A.R. Mohamed, Hybrid speech recognition with Deep
Bidirectional LSTM, Autom. Speech Recognit. Underst. (2013) 273–278.
[2] A.B. Kandali, A. Routray, T.K. Basu, Emotion recognition from Assamese
speeches using MFCC features and GMM classifier, in: TENCON 2008 - 2008
IEEE Region 10 Conference IEEE, 2008, pp. 1–5.
[3] A. Milton, S.S. Roy, S.T. Selvi, SVM scheme for speech emotion recognition
using MFCC feature, Int. J. Comput. Appl. 69 (9) (2013) 34–39.
[4] W.Q. Zheng, J.S. Yu, Y.X. Zou, An experimental study of speech emotion
recognition based on deep convolutional neural networks, in: International
Conference on Affective Computing and Intelligent Interaction IEEE, 2015, pp.
827–831..
[5] S. Demircan, H. Kahramanl, Feature extraction from speech data for
emotion recognition, J. Adv. Comput. Netw. 2 (1) (2014) 28–30
[6] N.J. Nalini, S. Palanivel, M. Balasubramanian, Speech emotion recognition
using residual phase and MFCC features, Int. J. Eng. Technol. 5 (6) (2013)
4515–4527
[7] F. Chenchah, Z. Lachiri, Acoustic emotion recognition using linear and
nonlinear cepstral coefficients, Int. J. Adv. Comput. Sci. Appl. 6 (11) (2015).
[8] N.J. Nalini, S. Palanivel, Music emotion recognition: the combined evidence
of MFCC and residual phase, Egypt. Inf. J. 17 (1) (2015) 1–10.
[9] D. Le, E.M. Provost, Emotion recognition from spontaneous speech using
hidden markov models with deep belief networks, in: Automatic Speech
Recognition and Understanding (ASRU) IEEE, 2013, pp. 216–221.
[10] V.B. Waghmare, R.R. Deshmukh, P.P. Shrishrimal, G.B. Janvale, Emotion
recognition system from artificial marathi speech using MFCC and LDA
techniques, in: International Conference on Advances in Communication,
Network, and Computing, 2014
[11] L. Chen, X. Mao, H. Yan, Text-independent phoneme segmentation
combining EGG and speech data, IEEE/ACM Trans. Audio Speech Lang.
Process. 24 (6) (2016) 1029–1037.
56
Dept. of CSE
[12] Y. Kim, H. Lee, E.M. Provost, Deep learning for robust feature generation
in audiovisual emotion recognition, Acoustics, Speech and Signal Processing
(ICASSP), 2013 IEEE International Conference on 32 (2013) 3687–3691.
[13] Z. Huang, M. Dong, Q. Mao, Y. Zhan, Speech Emotion Recognition Using
CNN, ACM Multimedia, 2014, pp. 801–804.
[14] A. Stuhlsatz, C. Meyer, F. Eyben, T. Zielke, G. Meier, B. Schuller, Deep
neural networks for acoustic emotion recognition: raising the benchmarks, in:
IEEE International Conference on Acoustics, Speech, and Signal Processing
IEEE, 2011, pp. 5688–5691
[15] J. Loughrey, P. Cunningham, Using Early Stopping to Reduce Overfltting
in Wrapper-Based Feature Weighting, Trinity College Dublin Department of
Computer Science, 2005, TCD-CS-2005-41, pp12.
[16] Q. Mao, M. Dong, Z. Huang, Y. Zhan, Learning salient features for speech
emotion recognition using convolutional neural networks, IEEE Trans.
Multimedia 16 (8) (2014) 2203–2213
[17] Y. Huang, A. Wu, G. Zhang, Y. Li, Extraction of adaptive wavelet packet
filter-bank-based acoustic feature for speech emotion recognition, IET Signal
Process. 9 (4) (2015) 341–348
[18] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network
training by reducing internal covariate shift, in: International Conference on
Machine Learning, 2015, pp. 448–456.
[19] E.M. Schmidt, J.J. Scott, Y.E. Kim, Feature learning in dynamic
environments: modeling the acoustic structure of musical emotion, in:
International Symposium/Conference on Music Information Retrieval, 2012,
pp. 325–330.
[20] F. Burkhardt, A. Paeschke, M. Rolfes, W.F. Sendlmeier, B. Weiss, A
database of German emotional speech, INTERSPEECH 2005 - Eurospeech
57
Dept. of CSE