speech emotion recognization
speech emotion recognization
ABSTRACT
Emotion recognition from speech is a crucial task in human-computer interaction,
psychology, and healthcare. It involves analyzing audio signals to detect the underlying
emotions conveyed by a speaker's voice. This capability has broad applications, including
improving customer service, designing empathetic virtual assistants, and enhancing mental
health diagnosis and treatment. Traditional approaches to speech emotion recognition often
rely on handcrafted features extracted from audio signals, such as pitch, intensity, and
spectral features. These features are then fed into machine learning models, , to classify
emotions. However, these systems often struggle with generalization across different
speakers, languages, and recording conditions. They also require extensive feature
engineering and may not capture subtle nuances in vocal expressions. The primary challenge
in speech emotion recognition is to develop robust and accurate models that can effectively
capture and interpret the complex patterns present in audio signals. This includes accounting
for variations in voice quality, speaking style, and emotional intensity across different
individuals and cultural contexts. Our proposed system aims to leverage advancements in
signal processing techniques to address the limitations of traditional speech emotion
recognition systems. we seek to automatically learn discriminative features from raw audio
data, enabling more robust and scalable emotion classification. Additionally, we plan to
explore multimodal approaches that combine speech signals with other modalities, such as
facial expressions or text, to further improve emotion recognition accuracy and robustness.
Through rigorous experimentation and evaluation on diverse datasets, we aim to develop a
state-of-the-art speech emotion recognition system capable of achieving high accuracy across
various real-world scenarios.
CHAPTER 1
INTRODUCTION
1.1 History
The quest to understand and interpret human emotions from speech dates back several
decades, rooted in the fields of psychology and linguistics. [1] Early efforts in the mid-20th
century focused on analyzing vocal characteristics and speech patterns to infer emotional
states. [2] Researchers explored fundamental acoustic features such as pitch, intensity, and
formants, seeking correlations with different emotions.
[3] In the 1970s and 1980s, advancements in signal processing and machine learning paved
the way for more systematic approaches to speech emotion recognition. [4] Researchers
began developing computational models to automatically extract relevant features from audio
signals and classify emotions using techniques like Hidden Markov Models (HMMs) and
Dynamic Time Warping (DTW). These pioneering studies laid the groundwork for
subsequent research in the field.
[5] The turn of the 21st century witnessed a surge of interest in speech emotion recognition,
driven by the proliferation of digital communication platforms and the growing importance of
human-computer interaction. [6] Researchers started exploring more sophisticated machine
learning algorithms, including Support Vector Machines (SVMs), Gaussian Mixture Models
(GMMs), and neural networks, to improve classification accuracy and robustness.
Additionally, the availability of large annotated datasets, such as the Berlin Database of
Emotional Speech (Emo-DB) and the Interactive Emotional Dyadic Motion Capture
(IEMOCAP) database, facilitated the development and evaluation of advanced emotion
recognition systems.
The motivation behind advancing speech emotion recognition systems lies in their wide-
ranging applications across various domains. [7] In human-computer interaction, the ability
to understand and respond to users' emotional states can significantly enhance the user
experience of interactive systems. Empathetic virtual assistants, capable of detecting users'
emotions and adapting their responses accordingly, can foster more engaging and
personalized interactions.
[8] In psychology and healthcare, accurate emotion recognition from speech can play a
crucial role in diagnosing and treating mental health disorders. [9] By analyzing subtle vocal
cues and patterns, clinicians can gain insights into patients' emotional well-being and tailor
interventions accordingly. [10] Furthermore, speech emotion recognition systems can assist in
therapeutic interventions, such as virtual reality-based exposure therapy for anxiety disorders,
by providing real-time feedback on emotional states.
Despite significant advancements, traditional speech emotion recognition systems still face
several challenges that limit their effectiveness in real-world applications. [11] One of the
primary challenges is the difficulty in generalizing across different speakers, languages, and
cultural contexts. [12] Existing models often struggle to adapt to variations in voice quality,
speaking style, and emotional expression, leading to reduced performance in diverse settings.
The traditional systems rely heavily on handcrafted features, requiring extensive domain
expertise and manual effort for feature engineering. This approach may overlook subtle
nuances and contextual cues present in vocal expressions, limiting the system's ability to
capture the complexity of human emotions accurately.
1.4 Applications
Speech emotion recognition has broad applications across various domains, ranging from
entertainment and education to healthcare and customer service. In the entertainment
industry, emotion-aware content recommendation systems can personalize multimedia
experiences based on users' emotional preferences, enhancing engagement and satisfaction.
In education, speech emotion recognition can facilitate adaptive learning environments that
respond dynamically to students' emotional states. For example, intelligent tutoring systems
can adjust their instructional strategies based on students' frustration or engagement levels,
optimizing learning outcomes.
In customer service, emotion-sensitive chatbots and virtual assistants can provide more
empathetic and effective support to users, leading to higher customer satisfaction and loyalty.
By understanding customers' emotions, businesses can tailor their responses and
recommendations to better meet individual needs and preferences.
CHAPTER 1
LITERATURE SURVEY
According to Qing and Zhong [13], the rise of big data handling in recent times, coupled with
the continual improvement of computers’ computational power and the ongoing improvement
of techniques, has led to significant advancements in the field. Also, with the advancement of
artificial intelligence studies, individuals are not always content that the computer does have
the same problem-solving abilities as the human mind. Still, they also wish for a much more
humanized artificial intelligence with the same emotions and character. It may be utilized in
students’ learning to recognize students’ feelings in real time and analyze them appropriately
and in intelligent human-computer interaction to detect the speaker’s emotional shifts in real
time. Researchers primarily investigate the Mel-Cepstral Coefficient settings and K-Nearest
Neighbor algorithm (KNN) for speech signals and implement MFCC extraction of features
using MATLAB and emotion classification using the KNN method. The CASIA corpus is
utilized for training and validation, and it eventually achieved 78% accuracy. As per
Kannadaguli and Bhat [14], humans see feelings as physiological changes in the composition
of consciousness caused by various ideas, sentiments, sensations, and actions. Although
emotions vary with an individual’s familiarity, they remain consistent with attitude, color,
character, and inclination. Researchers employ Bayesian and Hidden Markov Model (HMM)
based techniques to study and assess the effectiveness of speaker-dependent emotion
identification systems. Because all emotions may not have the same prior probability,
researchers must calculate the conditional probability by multiplying the pattern’s chances by
each class’s previous distribution and dividing by the pattern’s likelihood function derived by
summing its potential for all categories. An emotion-based information model is constructed
using the acoustic-phonetic modeling technique to voice recognition. Following that, the
template classifier and pattern recognition are built using the three probabilistic
methodologies in Machine Learning.
As described by Nasrun and Setianingsih [15], emotions in daily language are often
associated with feelings of anger or rage experienced by an individual. Nevertheless, the fact
that action is predisposed as a property of emotions does not necessarily make things simpler
to describe terminologically. Speech is a significant factor in determining one’s psychological
response. The Mel-Frequency Cepstral Coefficient (MFCC) approach, which involves
extracting features, is commonly used in human emotion recognition system that are based on
sound inputs. Support Vector Machine (SVM) is a novel data categorization approach
developed in the 1990s. SVM is guided Machine Learning, frequently used in various
research to categorize human voice recognition. The RBF kernel has been the most often used
kernel in SVM multi-Class. This is because SVM employs the Radial Basis Function (RBF)
seed to improve accuracy. This report’s most incredible accuracy ratio was 72.5%.
According to Mohammad and Elhadef [16], emotion recognition in speech may be defined as
perceiving and recognizing emotions in human communication. In other respects, speech-
emotion perception means communicating with feelings between a computer and a human.
The proposed methodology comprises three major phases: signal pre-processing to remove
noise and decrease signal throughput, feature extraction using a combination of Linear
Predictive Rules and 10-degree polynomial Curve fitting Coefficients over the periodogram
power spectrum feature of the speech signal, and Machine Learning that utilizes various
machine learning algorithms and compares their overall accuracy to determine the best
accuracy. Several of the causes are that the recognition approach selects the best elements for
a method to be powerful enough to distinguish between different emotions. Another factor is
the variety of languages, dialects, phrases, and speaking patterns. As per Bharti and Kekana
[17], speech conveys information and meaning via pitch, speech, emotion, and numerous
aspects of the Human Vocal System (HVS). Researchers suggested an outline that recognizes
sentiments using Speech Signal (SS) with the highest average accuracy and effectiveness
when compared to techniques such as Hidden Markov Model and Support Vector Machine.
The detection step can be easily implemented on various mobile platforms with minimal
computing effort, as compared to previous approaches. The ML model has been trained
successfully using the Multi-class Support Vector Machine (MSVM) approach to distinguish
emotional categories based on selected features. In machine learning, Support Vector
Machines (SVMs) are popular models used for classification and regression analysis. They’re
especially known for their effectiveness in high-dimensional spaces. However, traditional
SVMs are inherently binary classifiers. When there are more than two classes in the dataset,
adaptations like MSVMs are used, which can handle multi-class classification problems. The
MSVM classification was used to extract features Gammatone Frequency Cepstral
Coefficients (GFCC) and remove elements to achieve a high success rate of 97% on the
RAVDESS data set (ALO). The GFCC is a feature extraction method used often in the field
of speech and audio processing. The GFCC features try to mimic the human auditory system,
capturing the phonetically important characteristics of speech, and are robust against noise.
Whenever extracted features using MFCC are applied to existing databases, all classifiers
achieve an accuracy of 79.48%.
As described by Gopal and Jayakrishnan [18], emotions are a very complicated psychological
phenomenon that must be examined and categorized. Psychologists and neuroscientists have
performed extensive studies to analyze and classify human emotions over the last two
decades. Emotional prosody is used in several works. The goal of this project was to develop
a mechanism for annotating novel texts with appropriate emotion. With the SVM classifier, a
supervised method was used. The One-Against-Rest technique was utilized in a multi-class
SVM architecture. The suggested approach would categorize Malayalam phrases into several
emotion classes such as joyful, sad, angry, fear, standard, etc., using suitable level data with
an overall accuracy of 91.8%. Throughout feature vector choice, many aspects such as n-
grams, semantic orientation, POS-related features, and contextual details are analyzed to
determine if the phrase is conversational, or a question.
CHAPTER 3
EXISTING SYSTEM
Traditional speech emotion recognition systems primarily rely on handcrafted features
extracted from audio signals, such as pitch, intensity, and spectral characteristics. These
features are then input into machine learning models like Support Vector Machines (SVMs)
and Gaussian Mixture Models (GMMs) for emotion classification. The process involves
significant feature engineering, where domain experts must identify and extract relevant
characteristics from the audio data. While these approaches have made strides in emotion
recognition, they often struggle with generalization due to variations in speakers, languages,
and recording conditions. The handcrafted nature of feature extraction limits the system's
ability to capture subtle emotional nuances, leading to challenges in recognizing emotions
accurately across diverse contexts. Furthermore, traditional systems can be computationally
intensive and require substantial labeled data for training, which may not always be available.
Limitations
PROPOSED SYSTEM
4.1 Overview
Importing Necessary Libraries imports essential libraries required for the project.
This includes libraries for numerical computations (numpy), data manipulation
(pandas), plotting (matplotlib and seaborn), audio processing (librosa), and
machine learning tasks (scikit-learn, xgboost). Additionally, joblib is imported for
saving and loading machine learning models.
Setting Dataset Paths This block sets up the file paths for the datasets (TESS and
CREMA-D) that will be used. It ensures the correct directories are accessed when
loading the audio files.
Loading and Preprocessing TESS Dataset A function named load_tess is defined to
load and preprocess the TESS dataset. It iterates through the dataset directories,
extracts emotions from the filenames, and creates a dataframe with two columns:
'Emotion' and 'File_Path'.
Loading and Preprocessing CREMA-D Dataset Similar to the TESS dataset, a
function named load_crema is defined to load and preprocess the CREMA-D dataset.
This function also iterates through the files, extracts emotion labels, and constructs a
dataframe.
Concatenating Datasets This block combines the dataframes from the TESS and
CREMA-D datasets into a single dataframe. This consolidated dataframe will be used
for further processing and feature extraction.
Visualizing Data Distribution Several blocks are dedicated to visualizing the
distribution of emotions in the combined dataset. Plots such as count plots are used to
show the number of samples for each emotion, providing insights into the dataset's
balance.
Audio Visualization Functions Two functions, wave_plot and spectogram, are
defined for visualizing the waveforms and spectrograms of the audio files. These
visualizations help in understanding the acoustic characteristics of different emotions.
Data Augmentation Techniques Functions for data augmentation are defined,
including adding noise, shifting, stretching, and pitch shifting. These techniques help
in increasing the diversity of the training data, making the model more robust.
Feature Extraction Functions A key part of the code involves defining functions to
extract features from the audio data. Features such as zero-crossing rate, root mean
square energy, and MFCCs (Mel-frequency cepstral coefficients) are extracted. These
features are crucial for training the machine learning models.
Loading and Augmenting Audio Data This block iterates through the audio files,
applies the feature extraction and augmentation techniques, and compiles the features
into a dataset. This dataset is saved to a CSV file for future use.
Preprocessing the Dataset The dataset is preprocessed by filling missing values,
encoding the emotion labels, and standardizing the feature values. This step ensures
that the data is in the right format for training machine learning models.
Splitting Data into Training and Testing Sets The dataset is split into training and
testing sets using the train_test_split function. This separation allows for evaluating
the model's performance on unseen data.
Performance Metrics Function A function named performance_metrics is defined
to calculate and print various performance metrics (accuracy, precision, recall, F1
score) for the model. It also generates a classification report and confusion matrix to
visualize the model's performance.
Training and Evaluating Random Forest Classifier This block trains a Random
Forest Classifier on the training data. If a saved model exists, it loads the model using
joblib; otherwise, it trains a new model and saves it. The model's performance is then
evaluated on the test set.
Training and Evaluating XGBoost Classifier Similar to the Random Forest
Classifier, this block trains and evaluates an XGBoost Classifier. It also checks for an
existing saved model, trains a new one if necessary, and evaluates its performance.
Fig. 1: Block Diagram of proposed system.
Introduction
This step transforms raw audio data into a format suitable for feature extraction and model
training, addressing various challenges such as noise, variability in speech, and differences in
recording environments. Effective preprocessing ensures that the subsequent feature
extraction and classification stages can accurately capture the emotional content in speech.
1. Loading Audio Data The first step in preprocessing is loading the audio data from
various sources. Audio files can come in different formats (e.g., WAV, MP3), and the
preprocessing pipeline needs to handle these appropriately. Libraries like librosa are
commonly used for loading audio files into numerical arrays that can be manipulated
programmatically.
3. Trimming Silence Silence at the beginning or end of audio recordings can introduce
unnecessary variability. Trimming silence involves removing these silent segments,
ensuring that the audio data predominantly contains the speech signal. This step can
be particularly important in datasets where recordings have varying lengths and silent
periods.
5. Noise Reduction Background noise can significantly impact the accuracy of emotion
recognition systems. Techniques such as spectral gating, where noise is reduced by
filtering out frequencies with low energy, or more advanced methods like Wiener
filtering, are employed to enhance the clarity of the speech signal. Noise reduction
ensures that the features extracted are more representative of the speech content rather
than the background noise.
Adding Noise: Injecting random noise into the audio signal to simulate
different recording environments.
Time Shifting: Shifting the audio signal in time to create variations in the start
and end points of the speech.
Time Stretching: Speeding up or slowing down the audio without altering the
pitch to simulate different speaking rates.
Pitch Shifting: Changing the pitch of the audio to account for variations in
speaker pitch.
Augmentation increases the diversity of the training data, helping the model generalize better
to new, unseen data.
Feature Extraction
After preprocessing the audio data, the next step is to extract features that capture the
emotional content of the speech. These features are then used to train machine learning
models. Common features include:
1. Zero Crossing Rate (ZCR) ZCR measures how frequently the audio signal crosses
the zero amplitude line. It is an indicator of the noisiness of the signal and can be
correlated with certain emotions. For example, excited or angry speech may have a
higher ZCR.
2. Root Mean Square Energy (RMSE) RMSE provides a measure of the signal's
energy, which corresponds to the loudness of the speech. Emotions like anger or
happiness might exhibit higher energy levels, while sadness or calmness might have
lower energy.
4. Spectral Features These include features like spectral centroid, bandwidth, contrast,
and roll-off. They describe the shape of the audio spectrum and provide insights into
the distribution of energy across different frequencies, which can vary with different
emotional states.
5. Prosodic Features Prosody refers to the rhythm, stress, and intonation of speech.
Features like pitch (fundamental frequency), intensity (loudness), and duration can
provide valuable cues about the speaker's emotional state. For instance, anger might
be characterized by a higher and more variable pitch, whereas sadness might exhibit a
lower and more stable pitch.
Visualizations play an important role in understanding the characteristics of audio signals and
verifying the effectiveness of preprocessing steps. Common visualizations include:
2. Spectrograms A spectrogram displays the frequency content of the audio signal over
time. It provides a visual representation of how the spectral characteristics of the
speech change, making it easier to identify patterns associated with different
emotions.
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It
is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model. As the
name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy
of that dataset." Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
Fig. 4.1: Random Forest algorithm.
Step 1: In Random Forest n number of random records are taken from the data set having k
number of records.
Step 4: Final output is considered based on Majority Voting or Averaging for Classification
and regression respectively.
Immune to the curse of dimensionality- Since each tree does not consider all the
features, the feature space is reduced.
Train-Test split- In a random forest we don’t have to segregate the data for train and
test as there will always be 30% of the data which is not seen by the decision tree.
Stability- Stability arises because the result is based on majority voting/ averaging.
Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not. But
together, all the trees predict the correct output. Therefore, below are two assumptions for a
better Random Forest classifier:
There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
The predictions from each tree must have very low correlations.
Below are some points that explain why we should use the Random Forest algorithm
It predicts output with high accuracy, even for the large dataset it runs efficiently.
Before understanding the working of the random forest, we must look into the ensemble
technique. Ensemble simply means combining multiple models. Thus, a collection of models
is used to make predictions rather than an individual model. Ensemble uses two types of
methods:
Bagging– It creates a different training subset from sample training data with replacement &
the final output is based on majority voting. For example, Random Forest. Bagging, also
known as Bootstrap Aggregation is the ensemble technique used by random forest. Bagging
chooses a random sample from the data set. Hence each model is generated from the samples
(Bootstrap Samples) provided by the Original Data with replacement known as row
sampling. This step of row sampling with replacement is called bootstrap. Now each model is
trained independently which generates results. The final output is based on majority voting
after combining the results of all models. This step which involves combining all the results
and generating output based on majority voting is known as aggregation.
Fig. 4.2: RF Classifier analysis.
Boosting– It combines weak learners into strong learners by creating sequential models such
that the final model has the highest accuracy. For example, ADA BOOST, XG BOOST.
Disadvantages:
Not Suitable for Imbalanced Data: Random Forest may not perform well on highly
imbalanced datasets where one class is significantly more frequent than the others. It
tends to favor the majority class, leading to biased predictions.
Training Time: Training a Random Forest model can take longer compared to
simpler algorithms like linear regression or decision trees, especially when dealing
with large datasets or a high number of trees.
Limited Extrapolation Ability: Random Forest models may struggle with making
predictions outside the range of the training data. They may not generalize well to
unseen data points that are significantly different from those in the training set.
XGBoost is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based
on the concept of ensemble learning, which is a process of combining multiple classifiers to
solve a complex problem and to improve the performance of the model. As the name
suggests, "XGBoost is a classifier that contains a number of decision trees on various subsets
of the given dataset and takes the average to improve the predictive accuracy of that dataset."
Instead of relying on one decision tree, the XGBoost takes the prediction from each tree and
based on the majority votes of predictions, and it predicts the final output. The greater
number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.
Fig. 4.4: XGBoost algorithm.
XGBoost, which stands for "Extreme Gradient Boosting," is a popular and powerful machine
learning algorithm used for both classification and regression tasks. It is known for its high
predictive accuracy and efficiency, and it has won numerous data science competitions and is
widely used in industry and academia. Here are some key characteristics and concepts related
to the XGBoost algorithm:
Tree-based Models: Decision trees are the weak learners used in XGBoost. These are
shallow trees, often referred to as "stumps" or "shallow trees," which helps prevent
overfitting.
Handling Missing Data: XGBoost has built-in capabilities to handle missing data
without requiring imputation. It does this by finding the optimal split for missing
values during tree construction.
4.5 Advantages
The proposed research work, which combines Edge Computing, Light Weight Homomorphic
Encryption, and the XGBOOST classifier in a privacy-preserving healthcare application,
offers several distinct advantages:
Enhanced Data Privacy: One of the foremost advantages is the robust protection of
patient data privacy. The use of Light Weight Homomorphic Encryption ensures that
sensitive medical information remains confidential throughout the entire process,
from data collection to disease prediction. This not only complies with stringent
privacy regulations but also builds trust among patients, encouraging them to engage
with healthcare applications more freely.
Reduced Response Times: The introduction of Edge Nodes significantly reduces
response times for disease prediction and diagnosis. By selecting the nearest and
available Edge Node, the research minimizes latency, especially crucial in critical
medical situations such as heart disease diagnosis. This enhancement in speed can
lead to more timely interventions and improved patient outcomes.
UML DIAGRAMS
UML stands for Unified Modeling Language. UML is a standardized general-purpose
modeling language in the field of object-oriented software engineering. The standard is
managed, and was created by, the Object Management Group. The goal is for UML to
become a common language for creating models of object-oriented computer software. In its
current form UML is comprised of two major components: a Meta-model and a notation. In
the future, some form of method or process may also be added to; or associated with, UML.
GOALS: The Primary goals in the design of the UML are as follows:
Provide users a ready-to-use, expressive visual modeling Language so that they can
develop and exchange meaningful models.
Class Diagram
The class diagram is used to refine the use case diagram and define a detailed design of the
system. The class diagram classifies the actors defined in the use case diagram into a set of
interrelated classes. The relationship or association between the classes can be either an “is-a”
or “has-a” relationship. Each class in the class diagram may be capable of providing certain
functionalities. These functionalities provided by the class are termed “methods” of the class.
Apart from this, each class may have certain “attributes” that uniquely identify the class.
A use case diagram in the Unified Modeling Language (UML) is a type of behavioral
diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical
overview of the functionality provided by a system in terms of actors, their goals (represented
as use cases), and any dependencies between those use cases. The main purpose of a use case
diagram is to show what system functions are performed for which actor. Roles of the actors
in the system can be depicted.
Data Flow Diagram
A data flow diagram (DFD) is a graphical or visual representation using a standardized set of
symbols and notations to describe a business’s operations through data movement.
Sequence Diagram
Activity diagram: Activity diagram is another important diagram in UML to describe the
dynamic aspects of the system.
Deployment diagram
A deployment diagram in the Unified Modeling Language models the physical deployment of
artifacts on nodes. To describe a web site, for example, a deployment diagram would show
what hardware components (“nodes”) exist (e.g., a web server, an application server, and a
database server), what software components (“artifacts”) run on each node (e.g., web
application, database), and how the different pieces are connected (e.g., JDBC, REST, RMI).
The nodes appear as boxes, and the artifacts allocated to each node appear as rectangles
within the boxes. Nodes may have sub nodes, which appear as nested boxes. A single node in
a deployment diagram may conceptually represent multiple physical nodes, such as a cluster
of database servers.
Component diagram: Component diagram describes the organization and wiring of the
physical components in a system.
CHAPTER 6
SOFTWARE ENVIRONMENT
What is Python?
Programmers have to type relatively less and indentation requirement of the language,
makes them readable all the time.
Python language is being used by almost all tech-giant companies like – Google,
Amazon, Facebook, Instagram, Dropbox, Uber… etc.
The biggest strength of Python is huge collection of standard library which can be used for
the following –
Machine Learning
Test frameworks
Multimedia
Advantages of Python
Python downloads with an extensive library and it contain code for various purposes like
regular expressions, documentation-generation, unit-testing, web browsers, threading,
databases, CGI, email, image manipulation, and more. So, we don’t have to write the
complete code for that manually.
2. Extensible
As we have seen earlier, Python can be extended to other languages. You can write some of
your code in languages like C++ or C. This comes in handy, especially in projects.
3. Embeddable
Complimentary to extensibility, Python is embeddable as well. You can put your Python code
in your source code of a different language, like C++. This lets us add scripting capabilities to
our code in the other language.
4. Improved Productivity
The language’s simplicity and extensive libraries render programmers more productive than
languages like Java and C++ do. Also, the fact that you need to write less and get more things
done.
5. IOT Opportunities
Since Python forms the basis of new platforms like Raspberry Pi, it finds the future bright for
the Internet Of Things. This is a way to connect the language with the real world.
When working with Java, you may have to create a class to print ‘Hello World’. But in
Python, just a print statement will do. It is also quite easy to learn, understand, and code. This
is why when people pick up Python, they have a hard time adjusting to other more verbose
languages like Java.
7. Readable
Because it is not such a verbose language, reading Python is much like reading English. This
is the reason why it is so easy to learn, understand, and code. It also does not need curly
braces to define blocks, and indentation is mandatory. This further aids the readability of the
code.
8. Object-Oriented
This language supports both the procedural and object-oriented programming paradigms.
While functions help us with code reusability, classes and objects let us model the real world.
A class allows the encapsulation of data and functions into one.
Like we said earlier, Python is freely available. But not only can you download Python for
free, but you can also download its source code, make changes to it, and even distribute it. It
downloads with an extensive collection of libraries to help you with your tasks.
10. Portable
When you code your project in a language like C++, you may need to make some changes to
it if you want to run it on another platform. But it isn’t the same with Python. Here, you need
to code only once, and you can run it anywhere. This is called Write Once Run Anywhere
(WORA). However, you need to be careful enough not to include any system-dependent
features.
11. Interpreted
Lastly, we will say that it is an interpreted language. Since statements are executed one by
one, debugging is easier than in compiled languages.
Any doubts till now in the advantages of Python? Mention in the comment section.
Almost all of the tasks done in Python requires less coding when the same task is done in
other languages. Python also has an awesome standard library support, so you don’t have to
search for any third-party libraries to get your job done. This is the reason that many people
suggest learning Python to beginners.
2. Affordable
Python is free therefore individuals, small companies or big organizations can leverage the
free available resources to build applications. Python is popular and widely used so it gives
you better community support.
The 2019 Github annual survey showed us that Python has overtaken Java in the most
popular programming language category.
Python code can run on any machine whether it is Linux, Mac or Windows. Programmers
need to learn different languages for different jobs but with Python, you can professionally
build web apps, perform data analysis and machine learning, automate things, do web
scraping and also build games and powerful visualizations. It is an all-rounder programming
language.
Disadvantages of Python
So far, we’ve seen why Python is a great choice for your project. But if you choose it, you
should be aware of its consequences as well. Let’s now see the downsides of choosing Python
over another language.
We have seen that Python code is executed line by line. But since Python is interpreted, it
often results in slow execution. This, however, isn’t a problem unless speed is a focal point
for the project. In other words, unless high speed is a requirement, the benefits offered by
Python are enough to distract us from its speed limitations.
While it serves as an excellent server-side language, Python is much rarely seen on the client-
side. Besides that, it is rarely ever used to implement smartphone-based applications. One
such application is called Carbonnelle.
The reason it is not so famous despite the existence of Brython is that it isn’t that secure.
3. Design Restrictions
As you know, Python is dynamically typed. This means that you don’t need to declare the
type of variable while writing the code. It uses duck-typing. But wait, what’s that? Well, it
just means that if it looks like a duck, it must be a duck. While this is easy on the
programmers during coding, it can raise run-time errors.
5. Simple
No, we’re not kidding. Python’s simplicity can indeed be a problem. Take my example. I
don’t do Java, I’m more of a Python person. To me, its syntax is so simple that the verbosity
of Java code seems unnecessary.
This was all about the Advantages and Disadvantages of Python Programming Language.
NumPy
It is the fundamental package for scientific computing with Python. It contains various
features including these important ones:
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional
container of generic data. Arbitrary datatypes can be defined using NumPy which allows
NumPy to seamlessly and speedily integrate with a wide variety of databases.
Pandas
Matplotlib
For simple plotting the pyplot module provides a MATLAB-like interface, particularly when
combined with Ipython. For the power user, you have full control of line styles, font
properties, axes properties, etc, via an object oriented interface or via a set of functions
familiar to MATLAB users.
Scikit – learn
There have been several updates in the Python version over the years. The question is how to
install Python? It might be confusing for the beginner who is willing to start learning Python
but this tutorial will solve your query. The latest or the newest version of Python is version
3.7.4 or in other words, it is Python 3.
Note: The python version 3.7.4 cannot be used on Windows XP or earlier devices.
Before you start with the installation process of Python. First, you need to know about your
System Requirements. Based on your system type i.e. operating system and based processor,
you must download the python version. My system type is a Windows 64-bit operating
system. So the steps below are to install python version 3.7.4 on Windows 7 device or to
install Python 3. Download the Python Cheatsheet here.The steps on how to install Python on
Windows 10, 8 and 7 are divided into 4 parts to help understand better.
Step 1: Go to the official site to download and install python using Google Chrome or any
other web browser. OR Click on the following link: https://www.python.org
Now, check for the latest and the correct version for your operating system.
Step 4: Scroll down the page until you find the Files option.
Step 5: Here you see a different version of python along with the operating system.
To download Windows 32-bit python, you can select any one from the three options:
Windows x86 embeddable zip file, Windows x86 executable installer or Windows x86
web-based installer.
To download Windows 64-bit python, you can select any one from the three options:
Windows x86-64 embeddable zip file, Windows x86-64 executable installer or
Windows x86-64 web-based installer.
Here we will install Windows x86-64 web-based installer. Here your first part regarding
which version of python is to be downloaded is completed. Now we move ahead with the
second part in installing python i.e. Installation
Note: To know the changes or updates that are made in the version you can click on the
Release Note Option.
Installation of Python
Step 1: Go to Download and Open the downloaded python version to carry out the
installation process.
Step 2: Before you click on Install Now, Make sure to put a tick on Add Python 3.7 to PATH.
Step 3: Click on Install NOW After the installation is successful. Click on Close.
With these above three steps on python installation, you have successfully and correctly
installed Python. Now is the time to verify the installation.
Step 4: Let us test whether the python is correctly installed. Type python –V and press Enter.
Step 3: Click on IDLE (Python 3.7 64-bit) and launch the program
Step 4: To go ahead with working in IDLE you must first save the file. Click on File > Click
on Save
Step 5: Name the file and save as type should be Python files. Click on SAVE. Here I have
named the files as Hey World.
Step 6: Now for e.g. enter print (“Hey World”) and Press Enter.
You will see that the command given is launched. With this, we end our tutorial on how to
install Python. You have learned how to download python for windows into your respective
operating system.
Note: Unlike Java, Python does not need semicolons at the end of the statements otherwise it
won’t work.
CHAPTER 7
SYSTEM REQUIREMENTS
Software Requirements
The functional requirements or the overall description documents include the product
perspective and features, operating system and operating environment, graphics requirements,
design constraints and user documentation.
The appropriation of requirements and implementation constraints gives the general overview
of the project in regard to what the areas of strength and deficit are and how to tackle them.
Jupiter (or)
Google colab
Hardware Requirements
Minimum hardware requirements are very dependent on the particular software being
developed by a given Enthought Python / Canopy / VS Code user. Applications that need to
store large arrays/objects in memory will require more RAM, whereas applications that need
to perform numerous calculations or tasks more quickly will require a faster processor.
Ram : minimum 4 GB
FUNCTIONAL REQUIREMENTS
Output Design
Outputs from computer systems are required primarily to communicate the results of
processing to users. They are also used to provides a permanent copy of the results for later
consultation. The various types of outputs in general are:
Internal Outputs whose destination is within organization and they are the
Output Definition
Input Design
Input design is a part of overall system design. The main objective during the input design is
as given below:
Input Stages
Data recording
Data transcription
Data conversion
Data verification
Data control
Data transmission
Data validation
Data correction
Input Types
It is necessary to determine the various types of inputs. Inputs can be categorized as follows:
Input Media
At this stage choice has to be made about the input media. To conclude about the input media
consideration has to be given to;
Type of input
Flexibility of format
Speed
Accuracy
Verification methods
Rejection rates
Ease of correction
Security
Easy to use
Portability
Keeping in view the above description of the input types and input media, it can be said that
most of the inputs are of the form of internal and interactive. As
Input data is to be the directly keyed in by the user, the keyboard can be considered to be the
most suitable input device.
Error Avoidance
At this stage care is to be taken to ensure that input data remains accurate form the stage at
which it is recorded up to the stage in which the data is accepted by the system. This can be
achieved only by means of careful control each time the data is handled.
Error Detection
Even though every effort is make to avoid the occurrence of errors, still a small proportion of
errors is always likely to occur, these types of errors can be discovered by using validations to
check the input data.
Data Validation
Procedures are designed to detect errors in data at a lower level of detail. Data validations
have been included in the system in almost every area where there is a possibility for the user
to commit errors. The system will not accept invalid data. Whenever an invalid data is keyed
in, the system immediately prompts the user and the user has to again key in the data and the
system will accept the data only if the data is correct. Validations have been included where
necessary.
The system is designed to be a user friendly one. In other words the system has been
designed to communicate effectively with the user. The system has been designed with
popup menus.
It is essential to consult the system users and discuss their needs while designing the user
interface:
User initiated interface the user is in charge, controlling the progress of the
user/computer dialogue. In the computer-initiated interface, the computer selects the
next stage in the interaction.
In the computer-initiated interfaces the computer guides the progress of the user/computer
dialogue. Information is displayed and the user response of the computer takes action or
displays further information.
Command driven interfaces: In this type of interface the user inputs commands or
queries which are interpreted by the computer.
Forms oriented interface: The user calls up an image of the form to his/her screen and
fills in the form. The forms-oriented interface is chosen because it is the best choice.
Computer-Initiated Interfaces
The menu system for the user is presented with a list of alternatives and the user
chooses one; of alternatives.
Questions – answer type dialog system where the computer asks question and takes
action based on the basis of the users reply.
Right from the start the system is going to be menu driven, the opening menu displays the
available options. Choosing one option gives another popup menu with more options. In this
way every option leads the users to data entry form where the user can key in the data.
The design of error messages is an important part of the user interface design. As user is
bound to commit some errors or other while designing a system the system should be
designed to be helpful by providing the user with information regarding the error he/she has
committed.
This application must be able to produce output at different modules for different inputs.
Performance Requirements
The requirement specification for any system can be broadly stated as given below:
The existing system is completely dependent on the user to perform all the duties.
CHAPTER 9
SOURCE CODE
import numpy as np
import pandas as pd
import librosa
import os
import IPython
import joblib
tess_path = r"datasets/Tess"
crema_path = r"datasets/Crema"
def load_tess():
tess = []
emotion = wav.partition('.wav')[0].split('_')
if emotion[2] == 'ps':
else:
tess_df = pd.DataFrame.from_dict(tess)
return tess_df
def load_crema():
crema = []
emotion = wav.partition(".wav")[0].split('_')
if emotion[2] =='SAD':
else:
crema_df = pd.DataFrame(crema)
return crema_df
return df
plt.figure(figsize=(12, 5))
audio = librosa.stft(data)
audio_db = librosa.amplitude_to_db(abs(audio))
plt.figure(figsize=(12, 5))
result = np.array([])
return result
audio = np.array(aud)
return audio
tess_df = load_tess()
crema_df = load_crema()
# Concatenate datasets
df = concat_datasets(tess_df, crema_df)
X, Y = [], []
features = get_features(path)
if index % 500 == 0:
for i in features:
X.append(i)
Y.append(emotion)
print('Done')
dataset = pd.DataFrame(X)
dataset['Emotion'] = Y
dataset.to_csv('processed_data.csv', index=False)
dataset = pd.read_csv('processed_data.csv')
dataset = dataset.fillna(0)
X = dataset.drop(['Emotion'], axis=1)
y = dataset['Emotion']
le = LabelEncoder()
y = le.fit_transform(y)
scaler = StandardScaler()
X = scaler.fit_transform(X)
testY = testY.astype('int')
predict = predict.astype('int')
accuracy.append(a)
precision.append(p)
recall.append(r)
fscore.append(f)
CHAPTER 10
Adding noise: Injecting random noise into the audio signals to mimic
environmental conditions or recording artifacts.
Shifting: Temporally shifting the audio signals to simulate variations in speech
tempo or speaking rate.
Pitch shifting: Modifying the pitch of the audio signals to simulate variations
in vocal characteristics. By augmenting the training data, the model becomes
more resilient to noise and variations in speech patterns, leading to improved
performance on unseen data.
Model Training Once the feature extraction and data augmentation steps are
completed, the next stage involves training machine learning models to classify
emotions from the extracted features. Two popular classification algorithms used for
speech emotion recognition are Random Forest and XGBoost. These algorithms are
trained on the preprocessed and augmented dataset, with the emotion labels as the
target variable. During training, the models learn to map the extracted features to the
corresponding emotion labels, thereby enabling them to make accurate predictions on
unseen data.
Model Evaluation After training the machine learning models, they are evaluated on
a separate test dataset to assess their performance and generalization ability.
Performance metrics such as accuracy, precision, recall, and F1-score are calculated to
quantify the models' effectiveness in classifying emotions. Additionally, a
classification report and confusion matrix are generated to provide detailed insights
into the models' performance across different emotion classes. By evaluating the
models on unseen data, we can ensure that they generalize well and can effectively
recognize emotions in real-world scenarios.
The dataset used in this project consists of audio recordings from two main sources: TESS
(Toronto Emotional Speech Set) and CREMA-D (CrowdEmotion Database). Here's a detailed
description of each dataset:
TESS (Toronto Emotional Speech Set):
It consists of a total of 2800 audio files, with each file representing a unique
emotional expression.
The emotions covered in the TESS dataset include anger, disgust, fear, happiness,
sadness, surprise, and neutral.
The dataset provides a diverse range of emotional expressions, making it suitable for
training and testing emotion recognition models.
It consists of a total of 7,442 audio files, with each file representing a unique
emotional expression.
Similar to the TESS dataset, the emotions covered in CREMA-D include anger,
disgust, fear, happiness, sadness, surprise, and neutral.
The dataset includes recordings from multiple speakers, providing variability in voice
quality, accent, and speaking style.
Combined Dataset:
The combined dataset is created by concatenating the TESS and CREMA-D datasets
into a single dataframe.
It contains a total of 10,242 audio files, encompassing a wide range of emotional
expressions.
Each audio file is associated with a specific emotion label, allowing for supervised
learning of emotion recognition models.
The dataset is divided into training and testing sets for model development and
evaluation.
Feature extraction and data augmentation techniques are applied to the audio data to
enhance model performance and generalization ability.
Figure 1 presents a bar chart depicting the total count of data samples for each emotion class
in the dataset. The x-axis represents the emotion classes, including angry, disgust, fear, happy,
neutral, surprise, and sad, while the y-axis represents the corresponding count of data
samples. This visualization provides an overview of the dataset's distribution across different
emotion categories, enabling insights into the dataset's balance and potential biases. Figure 2
illustrates a wavelet plot representing the angry emotion. The plot visualizes the waveform of
audio signals associated with the angry emotion. The x-axis represents time, while the y-axis
represents the amplitude of the audio signal. This visualization provides an intuitive
understanding of the temporal dynamics of the angry emotion in the audio recordings,
highlighting patterns and fluctuations in the waveform. Figure 3 showcases a spectrogram
representing the angry emotion. The spectrogram visualizes the frequency content of the
audio signals associated with the angry emotion over time. The x-axis represents time, the y-
axis represents frequency, and the color intensity represents the magnitude of the frequency
components. This visualization offers insights into the spectral characteristics of the angry
emotion, capturing variations in frequency components over time.
Fig. 1: Presents the total data count of each class.
Figure 4 displays a wavelet plot depicting the sad emotion. Similar to Figure 2, this plot
illustrates the waveform of audio signals associated with the sad emotion. The visualization
enables the examination of temporal patterns and dynamics in the audio recordings
corresponding to the sad emotion, facilitating the identification of distinctive features. Figure
5 exhibits a spectrogram representing the sad emotion. Similar to Figure 3, this visualization
illustrates the frequency content of the audio signals associated with the sad emotion over
time. By visualizing the spectral characteristics, this plot aids in understanding the variations
in frequency components and their temporal evolution in the audio recordings expressing the
sad emotion.
Fig. 5: Visual representation of sad emotion spectogram.
Figure 6 presents the performance metrics of the Random Forest Classifier (RFC) model. The
metrics include accuracy (85.80%), precision (87.62%), recall (86.10%), and F1-score
(86.62%), calculated for each emotion class. Additionally, the macro-averaged (86.00%) and
weighted-averaged (86.00%) metrics provide an overall assessment of the model's
performance across all classes. This visualization enables the evaluation of the RFC model's
effectiveness in classifying emotions and provides insights into its strengths and weaknesses.
Figure 7 displays the confusion matrix of the RFC model, illustrating the model's predictions
versus the actual labels for each emotion class. Each cell in the matrix represents the count of
instances where the model predicted a particular emotion class compared to the ground truth.
This visualization aids in understanding the model's classification errors and identifying any
patterns or biases in its predictions. Figure 8 showcases the performance metrics of the
XGBoost Classifier model, including accuracy (95.53%), precision (96.00%), recall
(95.65%), and F1-score (95.81%), calculated for each emotion class. Similar to Figure 6, the
macro-averaged (95.80%) and weighted-averaged (95.80%) metrics provide an overall
assessment of the XGBoost model's performance. This visualization facilitates the
comparison of the XGBoost model's performance with that of the RFC model and provides
insights into its classification capabilities. Figure 9 exhibits the confusion matrix of the
XGBoost model, illustrating its predictions versus the actual labels for each emotion class.
This matrix provides a detailed breakdown of the model's classification performance,
highlighting any discrepancies between predicted and true labels. By visualizing the model's
confusion patterns, this plot assists in diagnosing classification errors and identifying areas
for improvement.
CHAPTER 11
Future Scope