0% found this document useful (0 votes)
11 views

MiniProject 5

Uploaded by

Akanksha Raj
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

MiniProject 5

Uploaded by

Akanksha Raj
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Speech Emotion Recognition

using ML

Submitted By - Under the mentorship of -


Akanksha Raj Dr. Ankit Tomar
U. Roll no - 2019272 Assistant Professor
Introduction
The most elementary way of communication in humans is Speech. To enrich interaction, one needs to know and understand the emotion of
another person and how to react to it. Unlike machines, we humans can naturally recognize the nature and emotion of the speech. Can a
machine also detect the emotion from a speech? Well, this could be made possible using machine learning. Machines need a specific model for
detecting the emotions of a speech and such a model can be implemented using machine learning. A machine detecting the emotion of a human
speech can be proved useful in various industries. A very basic usage of speech recognition is in the health sector where it can be used in
detecting depression, anxiety, stress etc. in a patient. It can also be used in industries like the crime sector where emotions can be recognized
from the speech to distinguish between victims and criminals.

Machine Learning is a well-known procedure of foreseeing or Classifying information to assist with people in pursuing important choices. In
order to learn from previous experiences and analyse the verifiable data, ML computations are prepared over cases or models. Just structure
models aren't sufficient. The model should be adequately advanced and tuned so that it gives you precise results. In order to achieve the best
results, streamlining strategies require tweaking the hyper parameters. As it repeatedly trains on the models, it gains the ability to detect
designs, enabling more precise decision-making. When the ML model is familiar with any new data, it applies its learnt lessons to the new data
and creates predictions for the future. Using various normalized methodologies, one can advance their models in light of the most recent
exactness. In a similar vein, Al models learn how to adapt to novel models and deliver better outcomes.
Problem Statement
Feelings assume a fundamental part in
correspondence, the location and examination of the
equivalent is of imperative significance in the present
computerized universe of distant correspondence.
Feeling identification is a testing task, since feelings
are emotional. We characterize a SER framework as
an assortment of strategies that cycle and group
discourse signs to identify feelings implanted in them.
Such a system has a vast variety of application, such
as intelligent voice-based assistants and expert guest
conversation research. The goal of this work is to
identify fundamental emotions in recorded
conversation by breaking down the acoustic
components of the sound data of reports. In this
undertaking, we will foresee the feeling in the
discourse of an individual's sound on the given
dataset utilizing CNN and profound learning
calculations. The dataset comprises 2,800 sound
records of 2 female voices with various feelings like
anger, disgust, fear, happiness, pleasant surprise,
Methodology • Librosa is a library that is used for analyzing the behavior of audio. It helps in loading
audio files, extracting the characteristics of the music, and visualizing audio data.

• The os library provides functions for interacting with the operating system, allowing
1.Import Required Libraries tasks like file management and directory manipulation in Python.

• TensorFlow is a popular deep learning framework used for building, training, and
deploying machine learning models, particularly neural networks.

• Matplotlib is a plotting library in Python used to create high-quality 2D and 3D


visualizations of data and results.

• NumPy is used for numerical computing in Python and provides essential tools for
array manipulation, mathematical operations, and linear algebra.

2.Data Collection and Preprocessing

• TESS is a dataset which has audio files of 200 target words spoken in the
carrier phrase "Say the word _' by two actresses (aged 26 and 64 years) and
recordings were made of the set portraying each of seven emotions (anger,
disgust, fear, happiness, pleasant surprise, sadness, and neutral). There are
2800 audio files in total.
• The dataset is organized such that each of the two female actor and their
emotions are contain within its own folder. And within that, all 200 target
words audio file can be found. The format of the audio file is a WAV format
3. Exploratory Data Analysis 4. Feature Extraction
• Sequential is used to create a linear stack of layers, and Dense, LSTM, and Dropout are layer types that
5. Model Architecture and Training can be added to the model.

• STM Layer: A Long Short-Term Memory (LSTM) layer with 256 units, set to return only the last output
sequence (return_sequences=False). It takes input sequences of shape (40, 1), where 40 represents the
sequence length, and 1 is the number of features at each time step.

• Dropout Layer: A dropout layer with a dropout rate of 0.2 is added after the LSTM layer. Dropout is a
regularization technique that helps prevent overfitting by randomly setting a fraction of input units to 0 at
each update during training.

• Dense Layer (ReLU Activation): A fully connected (dense) layer with 128 units and Rectified Linear Unit
(ReLU) activation function is added. ReLU is a common activation function that introduces non-linearity.

• Dropout Layer: Another dropout layer with a rate of 0.2 is added after the dense layer.

• Dense Layer (ReLU Activation): Another fully connected layer with 64 units and ReLU activation.

• Dropout Layer: A dropout layer with a rate of 0.2 is added after the second dense layer.

• Dense Layer (Softmax Activation): The final layer is a dense layer with 7 units and softmax activation.
This is often used in multi-class classification problems, where the network outputs probability
distribution over different classes.
Result and Discussion
4.1 Model Performance Metrics:
The implemented CNN-based SER model exhibited commendable performance on the provided dataset. The
model achieved an accuracy of approximately 97% on the training dataset and 94% on the testing dataset.
While evaluating the confusion matrix, the model showcased robustness in recognizing various emotions,
particularly excelling in discerning 'Neutral' and 'Happy' emotions. However, it exhibited relatively lower
accuracy in classifying 'Angry' and 'Disgust' emotions, possibly due to the inherent complexity and nuances in
identifying these emotions solely from speech signals.
4.2 Comparative Analysis:

Comparing the model's performance against existing state-of-the-


art SER approaches reveals noteworthy observations. The
proposed CNN-based model yielded competitive accuracy rates
compared to traditional machine learning techniques,
demonstrating the efficacy of leveraging deep learning for SER
tasks. Nevertheless, further analysis is required to comprehend the
model's performance concerning specific emotions and the
potential influence of imbalanced data distribution across emotion
classes.
4.3 Strengths and Limitations:
The strengths of the CNN-based SER model lie in its ability to automatically extract intricate patterns and hierarchical features from MFCC
representations, enabling better discrimination among various emotions. The model’s adaptability to complex data and its high-dimensional feature
extraction capabilities contribute significantly to its success. However, inherent limitations exist, notably the dependency on the quality and diversity of
the dataset. The model’s performance might be influenced by imbalanced data distributions among emotion classes, potentially leading to biased
predictions. Additionally, challenges persist in accurately capturing subtle emotional nuances and cultural variations in speech, warranting further
exploration and data augmentation strategies
Conclusion and Future Work
5.1 Conclusions
In the project, deep learning is used to analyse certain speech samples. In order to illustrate the various human emotions, first the dataset is loaded using the Librosa library and
depicted them in the form of various wave plots and spectrograms. Then, the MFCC feature extraction method is used to analyse the acoustic characteristics of all of the samples
and the sequential data obtained is organized in the 3D array form that the CNN model accepts.
Using the Matplotlib library, the data is put into a graphical form, then after some repeated Testing with various values it is revealed that the model's average accuracy is 94% at
testing and 97% at the training phase

5.2 Future Scope


5.2.1. Data Augmentation and Diverse Datasets:
 Augmentation Strategies: Implement advanced data augmentation techniques to address data imbalances and enrich the diversity of emotional expressions within the dataset.
 Multilingual and Multicultural Datasets: Curate datasets encompassing diverse languages and cultural contexts to improve the model's adaptability and robustness in
recognizing emotions across different demographics.
5.2.2. Model Refinement and Optimization:
 Architecture Refinement: Exploring modifications to the CNN architecture, incorporating attention mechanisms, ensemble techniques, or deeper network structures to capture
finer emotional nuances present in speech signals.
 Feature Engineering: Investigating alternative feature representations or fusion of multimodal features (audio-visual, textual) to extract more discriminative emotional cues
and improve classification accuracy.
5.2.3. Contextual Understanding and Real-Time Applications:
 Contextual Analysis: Incorporate contextual understanding by analysing the context surrounding speech to enhance emotion recognition accuracy. Emphasizing the temporal
dynamics and sequence modelling for a more comprehensive interpretation of emotional cues.
 Real-Time Applications: Adapting the model for real-time applications, enabling its integration into interactive systems, virtual assistants, or therapeutic applications
requiring accurate emotion detection in speech.
Thank you!

You might also like