0% found this document useful (0 votes)
8 views

speech emotion recognization

The document discusses advancements in speech emotion recognition, highlighting its importance in human-computer interaction, psychology, and healthcare. It critiques traditional systems that rely on handcrafted features and proposes a new approach leveraging signal processing techniques and multimodal data to enhance accuracy and robustness. The research aims to develop a state-of-the-art system capable of effective emotion classification across diverse real-world scenarios.

Uploaded by

justin2362003
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

speech emotion recognization

The document discusses advancements in speech emotion recognition, highlighting its importance in human-computer interaction, psychology, and healthcare. It critiques traditional systems that rely on handcrafted features and proposes a new approach leveraging signal processing techniques and multimodal data to enhance accuracy and robustness. The research aims to develop a state-of-the-art system capable of effective emotion classification across diverse real-world scenarios.

Uploaded by

justin2362003
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 65

EXPLORING SPEECH EMOTION RECOGNITION: A MULTI-

EMOTION VOICE DATASET ANALYSIS FOR EMOTION


DETECTION AND SENTIMENT ANALYSIS

ABSTRACT
Emotion recognition from speech is a crucial task in human-computer interaction,
psychology, and healthcare. It involves analyzing audio signals to detect the underlying
emotions conveyed by a speaker's voice. This capability has broad applications, including
improving customer service, designing empathetic virtual assistants, and enhancing mental
health diagnosis and treatment. Traditional approaches to speech emotion recognition often
rely on handcrafted features extracted from audio signals, such as pitch, intensity, and
spectral features. These features are then fed into machine learning models, , to classify
emotions. However, these systems often struggle with generalization across different
speakers, languages, and recording conditions. They also require extensive feature
engineering and may not capture subtle nuances in vocal expressions. The primary challenge
in speech emotion recognition is to develop robust and accurate models that can effectively
capture and interpret the complex patterns present in audio signals. This includes accounting
for variations in voice quality, speaking style, and emotional intensity across different
individuals and cultural contexts. Our proposed system aims to leverage advancements in
signal processing techniques to address the limitations of traditional speech emotion
recognition systems. we seek to automatically learn discriminative features from raw audio
data, enabling more robust and scalable emotion classification. Additionally, we plan to
explore multimodal approaches that combine speech signals with other modalities, such as
facial expressions or text, to further improve emotion recognition accuracy and robustness.
Through rigorous experimentation and evaluation on diverse datasets, we aim to develop a
state-of-the-art speech emotion recognition system capable of achieving high accuracy across
various real-world scenarios.
CHAPTER 1

INTRODUCTION

1.1 History

The quest to understand and interpret human emotions from speech dates back several
decades, rooted in the fields of psychology and linguistics. [1] Early efforts in the mid-20th
century focused on analyzing vocal characteristics and speech patterns to infer emotional
states. [2] Researchers explored fundamental acoustic features such as pitch, intensity, and
formants, seeking correlations with different emotions.

[3] In the 1970s and 1980s, advancements in signal processing and machine learning paved
the way for more systematic approaches to speech emotion recognition. [4] Researchers
began developing computational models to automatically extract relevant features from audio
signals and classify emotions using techniques like Hidden Markov Models (HMMs) and
Dynamic Time Warping (DTW). These pioneering studies laid the groundwork for
subsequent research in the field.

[5] The turn of the 21st century witnessed a surge of interest in speech emotion recognition,
driven by the proliferation of digital communication platforms and the growing importance of
human-computer interaction. [6] Researchers started exploring more sophisticated machine
learning algorithms, including Support Vector Machines (SVMs), Gaussian Mixture Models
(GMMs), and neural networks, to improve classification accuracy and robustness.
Additionally, the availability of large annotated datasets, such as the Berlin Database of
Emotional Speech (Emo-DB) and the Interactive Emotional Dyadic Motion Capture
(IEMOCAP) database, facilitated the development and evaluation of advanced emotion
recognition systems.

1.2 Research Motivation

The motivation behind advancing speech emotion recognition systems lies in their wide-
ranging applications across various domains. [7] In human-computer interaction, the ability
to understand and respond to users' emotional states can significantly enhance the user
experience of interactive systems. Empathetic virtual assistants, capable of detecting users'
emotions and adapting their responses accordingly, can foster more engaging and
personalized interactions.
[8] In psychology and healthcare, accurate emotion recognition from speech can play a
crucial role in diagnosing and treating mental health disorders. [9] By analyzing subtle vocal
cues and patterns, clinicians can gain insights into patients' emotional well-being and tailor
interventions accordingly. [10] Furthermore, speech emotion recognition systems can assist in
therapeutic interventions, such as virtual reality-based exposure therapy for anxiety disorders,
by providing real-time feedback on emotional states.

1.3 Problem Statement

Despite significant advancements, traditional speech emotion recognition systems still face
several challenges that limit their effectiveness in real-world applications. [11] One of the
primary challenges is the difficulty in generalizing across different speakers, languages, and
cultural contexts. [12] Existing models often struggle to adapt to variations in voice quality,
speaking style, and emotional expression, leading to reduced performance in diverse settings.

The traditional systems rely heavily on handcrafted features, requiring extensive domain
expertise and manual effort for feature engineering. This approach may overlook subtle
nuances and contextual cues present in vocal expressions, limiting the system's ability to
capture the complexity of human emotions accurately.

1.4 Applications

Speech emotion recognition has broad applications across various domains, ranging from
entertainment and education to healthcare and customer service. In the entertainment
industry, emotion-aware content recommendation systems can personalize multimedia
experiences based on users' emotional preferences, enhancing engagement and satisfaction.

In education, speech emotion recognition can facilitate adaptive learning environments that
respond dynamically to students' emotional states. For example, intelligent tutoring systems
can adjust their instructional strategies based on students' frustration or engagement levels,
optimizing learning outcomes.

In customer service, emotion-sensitive chatbots and virtual assistants can provide more
empathetic and effective support to users, leading to higher customer satisfaction and loyalty.
By understanding customers' emotions, businesses can tailor their responses and
recommendations to better meet individual needs and preferences.
CHAPTER 1

LITERATURE SURVEY

According to Qing and Zhong [13], the rise of big data handling in recent times, coupled with
the continual improvement of computers’ computational power and the ongoing improvement
of techniques, has led to significant advancements in the field. Also, with the advancement of
artificial intelligence studies, individuals are not always content that the computer does have
the same problem-solving abilities as the human mind. Still, they also wish for a much more
humanized artificial intelligence with the same emotions and character. It may be utilized in
students’ learning to recognize students’ feelings in real time and analyze them appropriately
and in intelligent human-computer interaction to detect the speaker’s emotional shifts in real
time. Researchers primarily investigate the Mel-Cepstral Coefficient settings and K-Nearest
Neighbor algorithm (KNN) for speech signals and implement MFCC extraction of features
using MATLAB and emotion classification using the KNN method. The CASIA corpus is
utilized for training and validation, and it eventually achieved 78% accuracy. As per
Kannadaguli and Bhat [14], humans see feelings as physiological changes in the composition
of consciousness caused by various ideas, sentiments, sensations, and actions. Although
emotions vary with an individual’s familiarity, they remain consistent with attitude, color,
character, and inclination. Researchers employ Bayesian and Hidden Markov Model (HMM)
based techniques to study and assess the effectiveness of speaker-dependent emotion
identification systems. Because all emotions may not have the same prior probability,
researchers must calculate the conditional probability by multiplying the pattern’s chances by
each class’s previous distribution and dividing by the pattern’s likelihood function derived by
summing its potential for all categories. An emotion-based information model is constructed
using the acoustic-phonetic modeling technique to voice recognition. Following that, the
template classifier and pattern recognition are built using the three probabilistic
methodologies in Machine Learning.

As described by Nasrun and Setianingsih [15], emotions in daily language are often
associated with feelings of anger or rage experienced by an individual. Nevertheless, the fact
that action is predisposed as a property of emotions does not necessarily make things simpler
to describe terminologically. Speech is a significant factor in determining one’s psychological
response. The Mel-Frequency Cepstral Coefficient (MFCC) approach, which involves
extracting features, is commonly used in human emotion recognition system that are based on
sound inputs. Support Vector Machine (SVM) is a novel data categorization approach
developed in the 1990s. SVM is guided Machine Learning, frequently used in various
research to categorize human voice recognition. The RBF kernel has been the most often used
kernel in SVM multi-Class. This is because SVM employs the Radial Basis Function (RBF)
seed to improve accuracy. This report’s most incredible accuracy ratio was 72.5%.

According to Mohammad and Elhadef [16], emotion recognition in speech may be defined as
perceiving and recognizing emotions in human communication. In other respects, speech-
emotion perception means communicating with feelings between a computer and a human.
The proposed methodology comprises three major phases: signal pre-processing to remove
noise and decrease signal throughput, feature extraction using a combination of Linear
Predictive Rules and 10-degree polynomial Curve fitting Coefficients over the periodogram
power spectrum feature of the speech signal, and Machine Learning that utilizes various
machine learning algorithms and compares their overall accuracy to determine the best
accuracy. Several of the causes are that the recognition approach selects the best elements for
a method to be powerful enough to distinguish between different emotions. Another factor is
the variety of languages, dialects, phrases, and speaking patterns. As per Bharti and Kekana
[17], speech conveys information and meaning via pitch, speech, emotion, and numerous
aspects of the Human Vocal System (HVS). Researchers suggested an outline that recognizes
sentiments using Speech Signal (SS) with the highest average accuracy and effectiveness
when compared to techniques such as Hidden Markov Model and Support Vector Machine.
The detection step can be easily implemented on various mobile platforms with minimal
computing effort, as compared to previous approaches. The ML model has been trained
successfully using the Multi-class Support Vector Machine (MSVM) approach to distinguish
emotional categories based on selected features. In machine learning, Support Vector
Machines (SVMs) are popular models used for classification and regression analysis. They’re
especially known for their effectiveness in high-dimensional spaces. However, traditional
SVMs are inherently binary classifiers. When there are more than two classes in the dataset,
adaptations like MSVMs are used, which can handle multi-class classification problems. The
MSVM classification was used to extract features Gammatone Frequency Cepstral
Coefficients (GFCC) and remove elements to achieve a high success rate of 97% on the
RAVDESS data set (ALO). The GFCC is a feature extraction method used often in the field
of speech and audio processing. The GFCC features try to mimic the human auditory system,
capturing the phonetically important characteristics of speech, and are robust against noise.
Whenever extracted features using MFCC are applied to existing databases, all classifiers
achieve an accuracy of 79.48%.

As described by Gopal and Jayakrishnan [18], emotions are a very complicated psychological
phenomenon that must be examined and categorized. Psychologists and neuroscientists have
performed extensive studies to analyze and classify human emotions over the last two
decades. Emotional prosody is used in several works. The goal of this project was to develop
a mechanism for annotating novel texts with appropriate emotion. With the SVM classifier, a
supervised method was used. The One-Against-Rest technique was utilized in a multi-class
SVM architecture. The suggested approach would categorize Malayalam phrases into several
emotion classes such as joyful, sad, angry, fear, standard, etc., using suitable level data with
an overall accuracy of 91.8%. Throughout feature vector choice, many aspects such as n-
grams, semantic orientation, POS-related features, and contextual details are analyzed to
determine if the phrase is conversational, or a question.
CHAPTER 3

EXISTING SYSTEM
Traditional speech emotion recognition systems primarily rely on handcrafted features
extracted from audio signals, such as pitch, intensity, and spectral characteristics. These
features are then input into machine learning models like Support Vector Machines (SVMs)
and Gaussian Mixture Models (GMMs) for emotion classification. The process involves
significant feature engineering, where domain experts must identify and extract relevant
characteristics from the audio data. While these approaches have made strides in emotion
recognition, they often struggle with generalization due to variations in speakers, languages,
and recording conditions. The handcrafted nature of feature extraction limits the system's
ability to capture subtle emotional nuances, leading to challenges in recognizing emotions
accurately across diverse contexts. Furthermore, traditional systems can be computationally
intensive and require substantial labeled data for training, which may not always be available.

Limitations

 Struggles with generalization across different speakers and languages.

 Requires extensive feature engineering, which is time-consuming and subjective.

 May fail to capture subtle vocal nuances and emotional expressions.

 Performance is affected by variations in recording conditions.

 Computationally intensive, necessitating significant processing power.

 Limited scalability due to reliance on handcrafted features.

 Often requires substantial labeled datasets, which may be hard to obtain.


CHAPTER 4

PROPOSED SYSTEM

4.1 Overview

 Importing Necessary Libraries imports essential libraries required for the project.
This includes libraries for numerical computations (numpy), data manipulation
(pandas), plotting (matplotlib and seaborn), audio processing (librosa), and
machine learning tasks (scikit-learn, xgboost). Additionally, joblib is imported for
saving and loading machine learning models.
 Setting Dataset Paths This block sets up the file paths for the datasets (TESS and
CREMA-D) that will be used. It ensures the correct directories are accessed when
loading the audio files.
 Loading and Preprocessing TESS Dataset A function named load_tess is defined to
load and preprocess the TESS dataset. It iterates through the dataset directories,
extracts emotions from the filenames, and creates a dataframe with two columns:
'Emotion' and 'File_Path'.
 Loading and Preprocessing CREMA-D Dataset Similar to the TESS dataset, a
function named load_crema is defined to load and preprocess the CREMA-D dataset.
This function also iterates through the files, extracts emotion labels, and constructs a
dataframe.
 Concatenating Datasets This block combines the dataframes from the TESS and
CREMA-D datasets into a single dataframe. This consolidated dataframe will be used
for further processing and feature extraction.
 Visualizing Data Distribution Several blocks are dedicated to visualizing the
distribution of emotions in the combined dataset. Plots such as count plots are used to
show the number of samples for each emotion, providing insights into the dataset's
balance.
 Audio Visualization Functions Two functions, wave_plot and spectogram, are
defined for visualizing the waveforms and spectrograms of the audio files. These
visualizations help in understanding the acoustic characteristics of different emotions.
 Data Augmentation Techniques Functions for data augmentation are defined,
including adding noise, shifting, stretching, and pitch shifting. These techniques help
in increasing the diversity of the training data, making the model more robust.
 Feature Extraction Functions A key part of the code involves defining functions to
extract features from the audio data. Features such as zero-crossing rate, root mean
square energy, and MFCCs (Mel-frequency cepstral coefficients) are extracted. These
features are crucial for training the machine learning models.
 Loading and Augmenting Audio Data This block iterates through the audio files,
applies the feature extraction and augmentation techniques, and compiles the features
into a dataset. This dataset is saved to a CSV file for future use.
 Preprocessing the Dataset The dataset is preprocessed by filling missing values,
encoding the emotion labels, and standardizing the feature values. This step ensures
that the data is in the right format for training machine learning models.
 Splitting Data into Training and Testing Sets The dataset is split into training and
testing sets using the train_test_split function. This separation allows for evaluating
the model's performance on unseen data.
 Performance Metrics Function A function named performance_metrics is defined
to calculate and print various performance metrics (accuracy, precision, recall, F1
score) for the model. It also generates a classification report and confusion matrix to
visualize the model's performance.
 Training and Evaluating Random Forest Classifier This block trains a Random
Forest Classifier on the training data. If a saved model exists, it loads the model using
joblib; otherwise, it trains a new model and saves it. The model's performance is then
evaluated on the test set.
 Training and Evaluating XGBoost Classifier Similar to the Random Forest
Classifier, this block trains and evaluates an XGBoost Classifier. It also checks for an
existing saved model, trains a new one if necessary, and evaluates its performance.
Fig. 1: Block Diagram of proposed system.

4.2 Audio Preprocessing

Introduction

This step transforms raw audio data into a format suitable for feature extraction and model
training, addressing various challenges such as noise, variability in speech, and differences in
recording environments. Effective preprocessing ensures that the subsequent feature
extraction and classification stages can accurately capture the emotional content in speech.

Steps in Audio Preprocessing

1. Loading Audio Data The first step in preprocessing is loading the audio data from
various sources. Audio files can come in different formats (e.g., WAV, MP3), and the
preprocessing pipeline needs to handle these appropriately. Libraries like librosa are
commonly used for loading audio files into numerical arrays that can be manipulated
programmatically.

2. Resampling Audio files might be recorded at different sampling rates. To ensure


consistency, all audio data is resampled to a common sampling rate (e.g., 16 kHz or
44.1 kHz). Resampling helps in standardizing the time resolution of the audio signals,
making it easier to extract uniform features across all samples.

3. Trimming Silence Silence at the beginning or end of audio recordings can introduce
unnecessary variability. Trimming silence involves removing these silent segments,
ensuring that the audio data predominantly contains the speech signal. This step can
be particularly important in datasets where recordings have varying lengths and silent
periods.

4. Normalization Audio signals can have varying amplitudes due to differences in


recording equipment and speaker volume. Normalization scales the audio signals to a
standard range, typically [-1, 1], to ensure that the amplitude variations do not affect
the feature extraction process. This step makes the audio data more uniform and
comparable across different samples.

5. Noise Reduction Background noise can significantly impact the accuracy of emotion
recognition systems. Techniques such as spectral gating, where noise is reduced by
filtering out frequencies with low energy, or more advanced methods like Wiener
filtering, are employed to enhance the clarity of the speech signal. Noise reduction
ensures that the features extracted are more representative of the speech content rather
than the background noise.

6. Data Augmentation To improve the robustness of the emotion recognition model,


data augmentation techniques are applied. These techniques generate additional
training samples by altering the original audio data. Common augmentation methods
include:

 Adding Noise: Injecting random noise into the audio signal to simulate
different recording environments.

 Time Shifting: Shifting the audio signal in time to create variations in the start
and end points of the speech.
 Time Stretching: Speeding up or slowing down the audio without altering the
pitch to simulate different speaking rates.

 Pitch Shifting: Changing the pitch of the audio to account for variations in
speaker pitch.

Augmentation increases the diversity of the training data, helping the model generalize better
to new, unseen data.

Feature Extraction

After preprocessing the audio data, the next step is to extract features that capture the
emotional content of the speech. These features are then used to train machine learning
models. Common features include:

1. Zero Crossing Rate (ZCR) ZCR measures how frequently the audio signal crosses
the zero amplitude line. It is an indicator of the noisiness of the signal and can be
correlated with certain emotions. For example, excited or angry speech may have a
higher ZCR.

2. Root Mean Square Energy (RMSE) RMSE provides a measure of the signal's
energy, which corresponds to the loudness of the speech. Emotions like anger or
happiness might exhibit higher energy levels, while sadness or calmness might have
lower energy.

3. Mel-Frequency Cepstral Coefficients (MFCCs) MFCCs are widely used in speech


processing and capture the power spectrum of the audio signal in a way that mimics
human auditory perception. MFCCs represent the short-term power spectrum of a
sound and are effective in capturing the timbral aspects of speech, which are crucial
for emotion recognition.

4. Spectral Features These include features like spectral centroid, bandwidth, contrast,
and roll-off. They describe the shape of the audio spectrum and provide insights into
the distribution of energy across different frequencies, which can vary with different
emotional states.

5. Prosodic Features Prosody refers to the rhythm, stress, and intonation of speech.
Features like pitch (fundamental frequency), intensity (loudness), and duration can
provide valuable cues about the speaker's emotional state. For instance, anger might
be characterized by a higher and more variable pitch, whereas sadness might exhibit a
lower and more stable pitch.

Visualizing Preprocessed Audio

Visualizations play an important role in understanding the characteristics of audio signals and
verifying the effectiveness of preprocessing steps. Common visualizations include:

1. Waveforms A waveform is a graphical representation of the audio signal's amplitude


over time. It helps in visualizing the structure of the speech, identifying silent periods,
and observing amplitude variations.

2. Spectrograms A spectrogram displays the frequency content of the audio signal over
time. It provides a visual representation of how the spectral characteristics of the
speech change, making it easier to identify patterns associated with different
emotions.

3. Mel-Spectrograms A mel-spectrogram is a spectrogram with a mel scale on the


frequency axis, which aligns more closely with human auditory perception. It is
particularly useful for visualizing the features used in MFCC extraction.

4.3 Random Forest Classifier

Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It
is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model. As the
name suggests, "Random Forest is a classifier that contains a number of decision trees on
various subsets of the given dataset and takes the average to improve the predictive accuracy
of that dataset." Instead of relying on one decision tree, the random forest takes the prediction
from each tree and based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of
overfitting.
Fig. 4.1: Random Forest algorithm.

4.1 Random Forest algorithm

Step 1: In Random Forest n number of random records are taken from the data set having k
number of records.

Step 2: Individual decision trees are constructed for each sample.

Step 3: Each decision tree will generate an output.

Step 4: Final output is considered based on Majority Voting or Averaging for Classification
and regression respectively.

4.2 Important Features of Random Forest

 Diversity- Not all attributes/variables/features are considered while making an


individual tree, each tree is different.

 Immune to the curse of dimensionality- Since each tree does not consider all the
features, the feature space is reduced.

 Parallelization-Each tree is created independently out of different data and attributes.


This means that we can make full use of the CPU to build random forests.

 Train-Test split- In a random forest we don’t have to segregate the data for train and
test as there will always be 30% of the data which is not seen by the decision tree.
 Stability- Stability arises because the result is based on majority voting/ averaging.

4.3 Assumptions for Random Forest

Since the random forest combines multiple trees to predict the class of the dataset, it is
possible that some decision trees may predict the correct output, while others may not. But
together, all the trees predict the correct output. Therefore, below are two assumptions for a
better Random Forest classifier:

 There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.

 The predictions from each tree must have very low correlations.

Below are some points that explain why we should use the Random Forest algorithm

 It takes less training time as compared to other algorithms.

 It predicts output with high accuracy, even for the large dataset it runs efficiently.

 It can also maintain accuracy when a large proportion of data is missing.

4.4 Types of Ensembles

Before understanding the working of the random forest, we must look into the ensemble
technique. Ensemble simply means combining multiple models. Thus, a collection of models
is used to make predictions rather than an individual model. Ensemble uses two types of
methods:

Bagging– It creates a different training subset from sample training data with replacement &
the final output is based on majority voting. For example, Random Forest. Bagging, also
known as Bootstrap Aggregation is the ensemble technique used by random forest. Bagging
chooses a random sample from the data set. Hence each model is generated from the samples
(Bootstrap Samples) provided by the Original Data with replacement known as row
sampling. This step of row sampling with replacement is called bootstrap. Now each model is
trained independently which generates results. The final output is based on majority voting
after combining the results of all models. This step which involves combining all the results
and generating output based on majority voting is known as aggregation.
Fig. 4.2: RF Classifier analysis.

Boosting– It combines weak learners into strong learners by creating sequential models such
that the final model has the highest accuracy. For example, ADA BOOST, XG BOOST.

Fig. 4.3: Boosting RF Classifier.

Disadvantages:

 Lack of Interpretability: Random Forest models are often considered as "black


boxes" because it's challenging to interpret the individual trees in the forest, especially
when the forest becomes large and complex.
 Memory and Computation: Random Forest models can become memory-intensive
and computationally expensive, especially when dealing with a large number of trees
or features. This can be a limitation for real-time prediction tasks or when working
with large datasets.

 Overfitting: Although Random Forests are less prone to overfitting compared to


individual decision trees, they can still overfit noisy data if not properly tuned. Tuning
parameters such as the number of trees and the maximum depth of trees is essential to
mitigate this risk.

 Not Suitable for Imbalanced Data: Random Forest may not perform well on highly
imbalanced datasets where one class is significantly more frequent than the others. It
tends to favor the majority class, leading to biased predictions.

 Training Time: Training a Random Forest model can take longer compared to
simpler algorithms like linear regression or decision trees, especially when dealing
with large datasets or a high number of trees.

 Limited Extrapolation Ability: Random Forest models may struggle with making
predictions outside the range of the training data. They may not generalize well to
unseen data points that are significantly different from those in the training set.

4.4 XGBoost Model

XGBoost is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based
on the concept of ensemble learning, which is a process of combining multiple classifiers to
solve a complex problem and to improve the performance of the model. As the name
suggests, "XGBoost is a classifier that contains a number of decision trees on various subsets
of the given dataset and takes the average to improve the predictive accuracy of that dataset."
Instead of relying on one decision tree, the XGBoost takes the prediction from each tree and
based on the majority votes of predictions, and it predicts the final output. The greater
number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.
Fig. 4.4: XGBoost algorithm.

XGBoost, which stands for "Extreme Gradient Boosting," is a popular and powerful machine
learning algorithm used for both classification and regression tasks. It is known for its high
predictive accuracy and efficiency, and it has won numerous data science competitions and is
widely used in industry and academia. Here are some key characteristics and concepts related
to the XGBoost algorithm:

 Gradient Boosting: XGBoost is an ensemble learning method based on the gradient


boosting framework. It builds a predictive model by combining the predictions of
multiple weak learners (typically decision trees) into a single, stronger model.

 Tree-based Models: Decision trees are the weak learners used in XGBoost. These are
shallow trees, often referred to as "stumps" or "shallow trees," which helps prevent
overfitting.

 Objective Function: XGBoost uses a specific objective function that needs to be


optimized during training. The objective function consists of two parts: a loss function
that quantifies the error between predicted and actual values and a regularization term
to control model complexity and prevent overfitting. The most common loss functions
are for regression (e.g., Mean Squared Error) and classification (e.g., Log Loss).

 Gradient Descent Optimization: XGBoost optimizes the objective function using


gradient descent. It calculates the gradients of the objective function with respect to
the model's predictions and updates the model iteratively to minimize the loss.
 Regularization: XGBoost provides several regularization techniques, such as L1
(Lasso) and L2 (Ridge) regularization, to control overfitting. These regularization
terms are added to the objective function.

 Parallel and Distributed Computing: XGBoost is designed to be highly efficient. It


can take advantage of parallel processing and distributed computing to train models
quickly, making it suitable for large datasets.

 Handling Missing Data: XGBoost has built-in capabilities to handle missing data
without requiring imputation. It does this by finding the optimal split for missing
values during tree construction.

 Feature Importance: XGBoost provides a way to measure the importance of each


feature in the model. This can help in feature selection and understanding which
features contribute the most to the predictions.

 Early Stopping: To prevent overfitting, XGBoost supports early stopping, which


allows training to stop when the model's performance on a validation dataset starts to
degrade.

 Scalability: XGBoost is versatile and can be applied to a wide range of machine


learning tasks, including classification, regression, ranking, and more.

 Python and R Libraries: XGBoost is available through libraries in Python (e.g.,


xgboost) and R (e.g., xgboost), making it accessible and easy to use for data scientists
and machine learning practitioners.

4.5 Advantages

The proposed research work, which combines Edge Computing, Light Weight Homomorphic
Encryption, and the XGBOOST classifier in a privacy-preserving healthcare application,
offers several distinct advantages:

 Enhanced Data Privacy: One of the foremost advantages is the robust protection of
patient data privacy. The use of Light Weight Homomorphic Encryption ensures that
sensitive medical information remains confidential throughout the entire process,
from data collection to disease prediction. This not only complies with stringent
privacy regulations but also builds trust among patients, encouraging them to engage
with healthcare applications more freely.
 Reduced Response Times: The introduction of Edge Nodes significantly reduces
response times for disease prediction and diagnosis. By selecting the nearest and
available Edge Node, the research minimizes latency, especially crucial in critical
medical situations such as heart disease diagnosis. This enhancement in speed can
lead to more timely interventions and improved patient outcomes.

 Scalability and Efficiency: The architecture's scalability is another advantage. As the


system grows and accommodates more users and healthcare data, it can efficiently
distribute tasks among multiple Edge Nodes. This scalability ensures that response
times remain consistently low, even as the user base expands, making it a practical
solution for large-scale healthcare applications.

 Accurate Disease Prediction: The use of the XGBOOST classifier, trained on


encrypted data, ensures high accuracy in disease prediction. This accuracy is essential
for healthcare professionals, as it aids in early detection and precise diagnosis,
allowing for better treatment planning and patient care.

 Real-World Applicability: The research offers a practical solution that can be


implemented in real-world healthcare settings. By addressing the critical issues of
data privacy and response times, it provides healthcare providers and patients with a
tool that enhances the quality and accessibility of healthcare services.

 Interoperability: The proposed framework can be integrated into existing healthcare


systems and applications. It offers the flexibility to collaborate with various healthcare
providers, ensuring that the benefits of privacy-preserving disease prediction can be
leveraged across the healthcare industry.

 Cross-Domain Potential: Beyond healthcare, the principles of privacy-preserving data


processing and edge computing have broader applicability. The research's architecture
can serve as a model for other domains where data privacy and real-time processing
are paramount, such as finance, telecommunications, and Internet of Things (IoT)
applications.
CHAPTER 5

UML DIAGRAMS
UML stands for Unified Modeling Language. UML is a standardized general-purpose
modeling language in the field of object-oriented software engineering. The standard is
managed, and was created by, the Object Management Group. The goal is for UML to
become a common language for creating models of object-oriented computer software. In its
current form UML is comprised of two major components: a Meta-model and a notation. In
the future, some form of method or process may also be added to; or associated with, UML.

The Unified Modeling Language is a standard language for specifying, Visualization,


Constructing and documenting the artifacts of software system, as well as for business
modeling and other non-software systems. The UML represents a collection of best
engineering practices that have proven successful in the modeling of large and complex
systems. The UML is a very important part of developing objects-oriented software and the
software development process. The UML uses mostly graphical notations to express the
design of software projects.

GOALS: The Primary goals in the design of the UML are as follows:

 Provide users a ready-to-use, expressive visual modeling Language so that they can
develop and exchange meaningful models.

 Provide extendibility and specialization mechanisms to extend the core concepts.

 Be independent of particular programming languages and development process.

 Provide a formal basis for understanding the modeling language.

 Encourage the growth of OO tools market.

 Support higher level development concepts such as collaborations, frameworks,


patterns and components.

 Integrate best practices.

Class Diagram

The class diagram is used to refine the use case diagram and define a detailed design of the
system. The class diagram classifies the actors defined in the use case diagram into a set of
interrelated classes. The relationship or association between the classes can be either an “is-a”
or “has-a” relationship. Each class in the class diagram may be capable of providing certain
functionalities. These functionalities provided by the class are termed “methods” of the class.
Apart from this, each class may have certain “attributes” that uniquely identify the class.

Use case diagram

A use case diagram in the Unified Modeling Language (UML) is a type of behavioral
diagram defined by and created from a Use-case analysis. Its purpose is to present a graphical
overview of the functionality provided by a system in terms of actors, their goals (represented
as use cases), and any dependencies between those use cases. The main purpose of a use case
diagram is to show what system functions are performed for which actor. Roles of the actors
in the system can be depicted.
Data Flow Diagram

A data flow diagram (DFD) is a graphical or visual representation using a standardized set of
symbols and notations to describe a business’s operations through data movement.

Sequence Diagram

A sequence diagram in Unified Modeling Language (UML) is a kind of interaction diagram


that shows how processes operate with one another and in what order. It is a construct of a
Message Sequence Chart. A sequence diagram shows, as parallel vertical lines (“lifelines”),
different processes or objects that live simultaneously, and as horizontal arrows, the messages
exchanged between them, in the order in which they occur. This allows the specification of
simple runtime scenarios in a graphical manner.

Activity diagram: Activity diagram is another important diagram in UML to describe the
dynamic aspects of the system.
Deployment diagram

A deployment diagram in the Unified Modeling Language models the physical deployment of
artifacts on nodes. To describe a web site, for example, a deployment diagram would show
what hardware components (“nodes”) exist (e.g., a web server, an application server, and a
database server), what software components (“artifacts”) run on each node (e.g., web
application, database), and how the different pieces are connected (e.g., JDBC, REST, RMI).
The nodes appear as boxes, and the artifacts allocated to each node appear as rectangles
within the boxes. Nodes may have sub nodes, which appear as nested boxes. A single node in
a deployment diagram may conceptually represent multiple physical nodes, such as a cluster
of database servers.

Component diagram: Component diagram describes the organization and wiring of the
physical components in a system.
CHAPTER 6

SOFTWARE ENVIRONMENT
What is Python?

Below are some facts about Python.

 Python is currently the most widely used multi-purpose, high-level programming


language.

 Python allows programming in Object-Oriented and Procedural paradigms. Python


programs generally are smaller than other programming languages like Java.

 Programmers have to type relatively less and indentation requirement of the language,
makes them readable all the time.

 Python language is being used by almost all tech-giant companies like – Google,
Amazon, Facebook, Instagram, Dropbox, Uber… etc.

The biggest strength of Python is huge collection of standard library which can be used for
the following –

 Machine Learning

 GUI Applications (like Kivy, Tkinter, PyQt etc. )

 Web frameworks like Django (used by YouTube, Instagram, Dropbox)

 Image processing (like Opencv, Pillow)

 Web scraping (like Scrapy, BeautifulSoup, Selenium)

 Test frameworks

 Multimedia

Advantages of Python

Let’s see how Python dominates over other languages.

10. Extensive Libraries

Python downloads with an extensive library and it contain code for various purposes like
regular expressions, documentation-generation, unit-testing, web browsers, threading,
databases, CGI, email, image manipulation, and more. So, we don’t have to write the
complete code for that manually.

2. Extensible

As we have seen earlier, Python can be extended to other languages. You can write some of
your code in languages like C++ or C. This comes in handy, especially in projects.

3. Embeddable

Complimentary to extensibility, Python is embeddable as well. You can put your Python code
in your source code of a different language, like C++. This lets us add scripting capabilities to
our code in the other language.

4. Improved Productivity

The language’s simplicity and extensive libraries render programmers more productive than
languages like Java and C++ do. Also, the fact that you need to write less and get more things
done.

5. IOT Opportunities

Since Python forms the basis of new platforms like Raspberry Pi, it finds the future bright for
the Internet Of Things. This is a way to connect the language with the real world.

6. Simple and Easy

When working with Java, you may have to create a class to print ‘Hello World’. But in
Python, just a print statement will do. It is also quite easy to learn, understand, and code. This
is why when people pick up Python, they have a hard time adjusting to other more verbose
languages like Java.

7. Readable

Because it is not such a verbose language, reading Python is much like reading English. This
is the reason why it is so easy to learn, understand, and code. It also does not need curly
braces to define blocks, and indentation is mandatory. This further aids the readability of the
code.

8. Object-Oriented
This language supports both the procedural and object-oriented programming paradigms.
While functions help us with code reusability, classes and objects let us model the real world.
A class allows the encapsulation of data and functions into one.

9. Free and Open-Source

Like we said earlier, Python is freely available. But not only can you download Python for
free, but you can also download its source code, make changes to it, and even distribute it. It
downloads with an extensive collection of libraries to help you with your tasks.

10. Portable

When you code your project in a language like C++, you may need to make some changes to
it if you want to run it on another platform. But it isn’t the same with Python. Here, you need
to code only once, and you can run it anywhere. This is called Write Once Run Anywhere
(WORA). However, you need to be careful enough not to include any system-dependent
features.

11. Interpreted

Lastly, we will say that it is an interpreted language. Since statements are executed one by
one, debugging is easier than in compiled languages.

Any doubts till now in the advantages of Python? Mention in the comment section.

Advantages of Python Over Other Languages

10. Less Coding

Almost all of the tasks done in Python requires less coding when the same task is done in
other languages. Python also has an awesome standard library support, so you don’t have to
search for any third-party libraries to get your job done. This is the reason that many people
suggest learning Python to beginners.

2. Affordable

Python is free therefore individuals, small companies or big organizations can leverage the
free available resources to build applications. Python is popular and widely used so it gives
you better community support.
The 2019 Github annual survey showed us that Python has overtaken Java in the most
popular programming language category.

3. Python is for Everyone

Python code can run on any machine whether it is Linux, Mac or Windows. Programmers
need to learn different languages for different jobs but with Python, you can professionally
build web apps, perform data analysis and machine learning, automate things, do web
scraping and also build games and powerful visualizations. It is an all-rounder programming
language.

Disadvantages of Python

So far, we’ve seen why Python is a great choice for your project. But if you choose it, you
should be aware of its consequences as well. Let’s now see the downsides of choosing Python
over another language.

10. Speed Limitations

We have seen that Python code is executed line by line. But since Python is interpreted, it
often results in slow execution. This, however, isn’t a problem unless speed is a focal point
for the project. In other words, unless high speed is a requirement, the benefits offered by
Python are enough to distract us from its speed limitations.

2. Weak in Mobile Computing and Browsers

While it serves as an excellent server-side language, Python is much rarely seen on the client-
side. Besides that, it is rarely ever used to implement smartphone-based applications. One
such application is called Carbonnelle.

The reason it is not so famous despite the existence of Brython is that it isn’t that secure.

3. Design Restrictions

As you know, Python is dynamically typed. This means that you don’t need to declare the
type of variable while writing the code. It uses duck-typing. But wait, what’s that? Well, it
just means that if it looks like a duck, it must be a duck. While this is easy on the
programmers during coding, it can raise run-time errors.

4. Underdeveloped Database Access Layers


Compared to more widely used technologies like JDBC (Java DataBase Connectivity) and
ODBC (Open DataBase Connectivity), Python’s database access layers are a bit
underdeveloped. Consequently, it is less often applied in huge enterprises.

5. Simple

No, we’re not kidding. Python’s simplicity can indeed be a problem. Take my example. I
don’t do Java, I’m more of a Python person. To me, its syntax is so simple that the verbosity
of Java code seems unnecessary.

This was all about the Advantages and Disadvantages of Python Programming Language.

Modules Used in Project

NumPy

NumPy is a general-purpose array-processing package. It provides a high-performance


multidimensional array object, and tools for working with these arrays.

It is the fundamental package for scientific computing with Python. It contains various
features including these important ones:

 A powerful N-dimensional array object

 Sophisticated (broadcasting) functions

 Tools for integrating C/C++ and Fortran code

 Useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional
container of generic data. Arbitrary datatypes can be defined using NumPy which allows
NumPy to seamlessly and speedily integrate with a wide variety of databases.

Pandas

Pandas is an open-source Python Library providing high-performance data manipulation and


analysis tool using its powerful data structures. Python was majorly used for data munging
and preparation. It had very little contribution towards data analysis. Pandas solved this
problem. Using Pandas, we can accomplish five typical steps in the processing and analysis
of data, regardless of the origin of data load, prepare, manipulate, model, and analyze. Python
with Pandas is used in a wide range of fields including academic and commercial domains
including finance, economics, Statistics, analytics, etc.

Matplotlib

Matplotlib is a Python 2D plotting library which produces publication quality figures in a


variety of hardcopy formats and interactive environments across platforms. Matplotlib can be
used in Python scripts, the Python and Ipython shells, the Jupyter Notebook, web application
servers, and four graphical user interface toolkits. Matplotlib tries to make easy things easy
and hard things possible. You can generate plots, histograms, power spectra, bar charts, error
charts, scatter plots, etc., with just a few lines of code. For examples, see the sample plots and
thumbnail gallery.

For simple plotting the pyplot module provides a MATLAB-like interface, particularly when
combined with Ipython. For the power user, you have full control of line styles, font
properties, axes properties, etc, via an object oriented interface or via a set of functions
familiar to MATLAB users.

Scikit – learn

Scikit-learn provides a range of supervised and unsupervised learning algorithms via a


consistent interface in Python. It is licensed under a permissive simplified BSD license and is
distributed under many Linux distributions, encouraging academic and commercial use.
Python

Install Python Step-by-Step in Windows and Mac

Python a versatile programming language doesn’t come pre-installed on your computer


devices. Python was first released in the year 1991 and until today it is a very popular high-
level programming language. Its style philosophy emphasizes code readability with its
notable use of great whitespace.

The object-oriented approach and language construct provided by Python enables


programmers to write both clear and logical code for projects. This software does not come
pre-packaged with Windows.

How to Install Python on Windows and Mac

There have been several updates in the Python version over the years. The question is how to
install Python? It might be confusing for the beginner who is willing to start learning Python
but this tutorial will solve your query. The latest or the newest version of Python is version
3.7.4 or in other words, it is Python 3.

Note: The python version 3.7.4 cannot be used on Windows XP or earlier devices.

Before you start with the installation process of Python. First, you need to know about your
System Requirements. Based on your system type i.e. operating system and based processor,
you must download the python version. My system type is a Windows 64-bit operating
system. So the steps below are to install python version 3.7.4 on Windows 7 device or to
install Python 3. Download the Python Cheatsheet here.The steps on how to install Python on
Windows 10, 8 and 7 are divided into 4 parts to help understand better.

Download the Correct version into the system

Step 1: Go to the official site to download and install python using Google Chrome or any
other web browser. OR Click on the following link: https://www.python.org

Now, check for the latest and the correct version for your operating system.

Step 2: Click on the Download Tab.


Step 3: You can either select the Download Python for windows 3.7.4 button in Yellow Color
or you can scroll further down and click on download with respective to their version. Here,
we are downloading the most recent python version for windows 3.7.4

Step 4: Scroll down the page until you find the Files option.

Step 5: Here you see a different version of python along with the operating system.
 To download Windows 32-bit python, you can select any one from the three options:
Windows x86 embeddable zip file, Windows x86 executable installer or Windows x86
web-based installer.

 To download Windows 64-bit python, you can select any one from the three options:
Windows x86-64 embeddable zip file, Windows x86-64 executable installer or
Windows x86-64 web-based installer.

Here we will install Windows x86-64 web-based installer. Here your first part regarding
which version of python is to be downloaded is completed. Now we move ahead with the
second part in installing python i.e. Installation

Note: To know the changes or updates that are made in the version you can click on the
Release Note Option.

Installation of Python

Step 1: Go to Download and Open the downloaded python version to carry out the
installation process.
Step 2: Before you click on Install Now, Make sure to put a tick on Add Python 3.7 to PATH.

Step 3: Click on Install NOW After the installation is successful. Click on Close.
With these above three steps on python installation, you have successfully and correctly
installed Python. Now is the time to verify the installation.

Note: The installation process might take a couple of minutes.

Verify the Python Installation

Step 1: Click on Start

Step 2: In the Windows Run Command, type “cmd”.

Step 3: Open the Command prompt option.

Step 4: Let us test whether the python is correctly installed. Type python –V and press Enter.

Step 5: You will get the answer as 3.7.4


Note: If you have any of the earlier versions of Python already installed. You must first
uninstall the earlier version and then install the new one.

Check how the Python IDLE works

Step 1: Click on Start

Step 2: In the Windows Run command, type “python idle”.

Step 3: Click on IDLE (Python 3.7 64-bit) and launch the program

Step 4: To go ahead with working in IDLE you must first save the file. Click on File > Click
on Save

Step 5: Name the file and save as type should be Python files. Click on SAVE. Here I have
named the files as Hey World.

Step 6: Now for e.g. enter print (“Hey World”) and Press Enter.
You will see that the command given is launched. With this, we end our tutorial on how to
install Python. You have learned how to download python for windows into your respective
operating system.

Note: Unlike Java, Python does not need semicolons at the end of the statements otherwise it
won’t work.
CHAPTER 7

SYSTEM REQUIREMENTS
Software Requirements

The functional requirements or the overall description documents include the product
perspective and features, operating system and operating environment, graphics requirements,
design constraints and user documentation.

The appropriation of requirements and implementation constraints gives the general overview
of the project in regard to what the areas of strength and deficit are and how to tackle them.

 Python IDLE 3.7 version (or)

 Anaconda 3.7 (or)

 Jupiter (or)

 Google colab

Hardware Requirements

Minimum hardware requirements are very dependent on the particular software being
developed by a given Enthought Python / Canopy / VS Code user. Applications that need to
store large arrays/objects in memory will require more RAM, whereas applications that need
to perform numerous calculations or tasks more quickly will require a faster processor.

 Operating system : Windows, Linux

 Processor : minimum intel i3

 Ram : minimum 4 GB

 Hard disk : minimum 250GB


CHAPTER 8

FUNCTIONAL REQUIREMENTS

Output Design

Outputs from computer systems are required primarily to communicate the results of
processing to users. They are also used to provides a permanent copy of the results for later
consultation. The various types of outputs in general are:

 External Outputs, whose destination is outside the organization

 Internal Outputs whose destination is within organization and they are the

 User’s main interface with the computer.

 Operational outputs whose use is purely within the computer department.

 Interface outputs, which involve the user in communicating directly.

Output Definition

The outputs should be defined in terms of the following points:

 Type of the output

 Content of the output

 Format of the output

 Location of the output

 Frequency of the output

 Volume of the output

 Sequence of the output

It is not always desirable to print or display data as it is held on a computer. It should be


decided as which form of the output is the most suitable.

Input Design
Input design is a part of overall system design. The main objective during the input design is
as given below:

 To produce a cost-effective method of input.

 To achieve the highest possible level of accuracy.

 To ensure that the input is acceptable and understood by the user.

Input Stages

The main input stages can be listed as below:

 Data recording

 Data transcription

 Data conversion

 Data verification

 Data control

 Data transmission

 Data validation

 Data correction

Input Types

It is necessary to determine the various types of inputs. Inputs can be categorized as follows:

 External inputs, which are prime inputs for the system.

 Internal inputs, which are user communications with the system.

 Operational, which are computer department’s communications to the system?

 Interactive, which are inputs entered during a dialogue.

Input Media

At this stage choice has to be made about the input media. To conclude about the input media
consideration has to be given to;

 Type of input
 Flexibility of format

 Speed

 Accuracy

 Verification methods

 Rejection rates

 Ease of correction

 Storage and handling requirements

 Security

 Easy to use

 Portability

Keeping in view the above description of the input types and input media, it can be said that
most of the inputs are of the form of internal and interactive. As

Input data is to be the directly keyed in by the user, the keyboard can be considered to be the
most suitable input device.

Error Avoidance

At this stage care is to be taken to ensure that input data remains accurate form the stage at
which it is recorded up to the stage in which the data is accepted by the system. This can be
achieved only by means of careful control each time the data is handled.

Error Detection

Even though every effort is make to avoid the occurrence of errors, still a small proportion of
errors is always likely to occur, these types of errors can be discovered by using validations to
check the input data.

Data Validation

Procedures are designed to detect errors in data at a lower level of detail. Data validations
have been included in the system in almost every area where there is a possibility for the user
to commit errors. The system will not accept invalid data. Whenever an invalid data is keyed
in, the system immediately prompts the user and the user has to again key in the data and the
system will accept the data only if the data is correct. Validations have been included where
necessary.

The system is designed to be a user friendly one. In other words the system has been
designed to communicate effectively with the user. The system has been designed with
popup menus.

User Interface Design

It is essential to consult the system users and discuss their needs while designing the user
interface:

User Interface Systems Can Be Broadly Clasified As:

 User initiated interface the user is in charge, controlling the progress of the
user/computer dialogue. In the computer-initiated interface, the computer selects the
next stage in the interaction.

 Computer initiated interfaces

In the computer-initiated interfaces the computer guides the progress of the user/computer
dialogue. Information is displayed and the user response of the computer takes action or
displays further information.

User Initiated Interfaces

User initiated interfaces fall into two approximate classes:

 Command driven interfaces: In this type of interface the user inputs commands or
queries which are interpreted by the computer.

 Forms oriented interface: The user calls up an image of the form to his/her screen and
fills in the form. The forms-oriented interface is chosen because it is the best choice.

Computer-Initiated Interfaces

The following computer – initiated interfaces were used:

 The menu system for the user is presented with a list of alternatives and the user
chooses one; of alternatives.

 Questions – answer type dialog system where the computer asks question and takes
action based on the basis of the users reply.
Right from the start the system is going to be menu driven, the opening menu displays the
available options. Choosing one option gives another popup menu with more options. In this
way every option leads the users to data entry form where the user can key in the data.

Error Message Design

The design of error messages is an important part of the user interface design. As user is
bound to commit some errors or other while designing a system the system should be
designed to be helpful by providing the user with information regarding the error he/she has
committed.

This application must be able to produce output at different modules for different inputs.

Performance Requirements

Performance is measured in terms of the output provided by the application. Requirement


specification plays an important part in the analysis of a system. Only when the requirement
specifications are properly given, it is possible to design a system, which will fit into required
environment. It rests largely in the part of the users of the existing system to give the
requirement specifications because they are the people who finally use the system. This is
because the requirements have to be known during the initial stages so that the system can be
designed according to those requirements. It is very difficult to change the system once it has
been designed and on the other hand designing a system, which does not cater to the
requirements of the user, is of no use.

The requirement specification for any system can be broadly stated as given below:

 The system should be able to interface with the existing system

 The system should be accurate

 The system should be better than the existing system

 The existing system is completely dependent on the user to perform all the duties.

CHAPTER 9

SOURCE CODE
import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

import librosa

import os

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder

import IPython

from sklearn.metrics import precision_score, accuracy_score, f1_score, recall_score,


classification_report, confusion_matrix

from sklearn.decomposition import PCA

from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import RandomForestClassifier

import xgboost as xgb

import joblib

# Paths to dataset folders

tess_path = r"datasets/Tess"

crema_path = r"datasets/Crema"

# Function to load and preprocess TESS dataset

def load_tess():

tess = []

for folder in os.listdir(tess_path):


for wav in os.listdir(os.path.join(tess_path, folder)):

emotion = wav.partition('.wav')[0].split('_')

if emotion[2] == 'ps':

tess.append(('surprise', os.path.join(tess_path, folder, wav)))

else:

tess.append((emotion[2], os.path.join(tess_path, folder, wav)))

tess_df = pd.DataFrame.from_dict(tess)

tess_df = tess_df.rename(columns={0: 'Emotion', 1: 'File_Path'})

return tess_df

# Function to load and preprocess CREMA-D dataset

def load_crema():

crema = []

for wav in os.listdir(crema_path):

emotion = wav.partition(".wav")[0].split('_')

if emotion[2] =='SAD':

crema.append(('sad', os.path.join(crema_path, wav)))

elif emotion[2] == 'DIS':

crema.append(('disgust', os.path.join(crema_path, wav)))

elif emotion[2] == 'FEA':

crema.append(('fear', os.path.join(crema_path, wav)))

elif emotion[2] == 'HAP':

crema.append(('happy', os.path.join(crema_path, wav)))

elif emotion[2] == 'NEU':


crema.append(('neutral', os.path.join(crema_path, wav)))

elif emotion[2] == 'ANG':

crema.append(('angry', os.path.join(crema_path, wav)))

else:

crema.append(('unknown', os.path.join(crema_path, wav)))

crema_df = pd.DataFrame(crema)

crema_df = crema_df.rename(columns = {0:'Emotion',1:'File_Path'})

return crema_df

# Concatenate TESS and CREMA-D datasets

def concat_datasets(tess_df, crema_df):

df = pd.concat([tess_df, crema_df], axis=0)

return df

# Function to plot waveplot

def wave_plot(data, sr, emotion, color):

plt.figure(figsize=(12, 5))

plt.title(f'{emotion} emotion for waveplot', size=17)

librosa.display.waveshow(y=data, sr=sr, color=color)

# Function to plot spectrogram

def spectogram(data, sr, emotion):

audio = librosa.stft(data)

audio_db = librosa.amplitude_to_db(abs(audio))
plt.figure(figsize=(12, 5))

plt.title(f'{emotion} emotion for spectogram', size=17)

librosa.display.specshow(audio_db, sr=sr, x_axis='time', y_axis='hz')

# Function to extract features from audio data

def extract_features(data, sr, frame_length=2048, hop_length=512):

result = np.array([])

result = np.hstack((result, librosa.feature.zero_crossing_rate(data,


frame_length=frame_length, hop_length=hop_length)))

result = np.hstack((result, librosa.feature.rms(y=data, frame_length=frame_length,


hop_length=hop_length)))

result = np.hstack((result, np.ravel(librosa.feature.mfcc(y=data, sr=sr, n_fft=frame_length,


hop_length=hop_length).T)))

return result

# Function to get features from audio files

def get_features(path, duration=2.5, offset=0.6):

data, sr = librosa.load(path, duration=duration, offset=offset)

aud = extract_features(data, sr)

audio = np.array(aud)

noised_audio = add_noise(data, random=True)

aud2 = extract_features(noised_audio, sr)

audio = np.vstack((audio, aud2))

pitched_audio = pitching(data, sr, random=True)

aud3 = extract_features(pitched_audio, sr)


audio = np.vstack((audio, aud3))

pitched_audio1 = pitching(data, sr, random=True)

pitched_noised_audio = add_noise(pitched_audio1, random=True)

aud4 = extract_features(pitched_noised_audio, sr)

audio = np.vstack((audio, aud4))

return audio

# Load TESS dataset

tess_df = load_tess()

# Load CREMA-D dataset

crema_df = load_crema()

# Concatenate datasets

df = concat_datasets(tess_df, crema_df)

# Extract features from audio files

X, Y = [], []

for path, emotion, index in zip(df.File_Path, df.Emotion, range(df.File_Path.shape[0])):

features = get_features(path)

if index % 500 == 0:

print(f'{index} audio has been processed')

for i in features:

X.append(i)
Y.append(emotion)

print('Done')

# Preprocess the dataset

dataset = pd.DataFrame(X)

dataset['Emotion'] = Y

dataset.to_csv('processed_data.csv', index=False)

dataset = pd.read_csv('processed_data.csv')

dataset = dataset.fillna(0)

X = dataset.drop(['Emotion'], axis=1)

y = dataset['Emotion']

le = LabelEncoder()

y = le.fit_transform(y)

scaler = StandardScaler()

X = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=44)

# Function to calculate performance metrics

def performance_metrics(algorithm, predict, testY):

testY = testY.astype('int')

predict = predict.astype('int')

p = precision_score(testY, predict, average='macro') * 100

r = recall_score(testY, predict, average='macro') * 100

f = f1_score(testY, predict, average='macro') * 100


a = accuracy_score(testY, predict) * 100

accuracy.append(a)

precision.append(p)

recall.append(r)

fscore.append(f)

print(algorithm + ' Accuracy : ' + str(a))

print(algorithm + ' Precision : ' + str(p))

print(algorithm + ' Recall : ' + str(r))

print(algorithm + ' FSCORE : ' + str(f))

report = classification_report(predict, testY, target_names=labels)

print('\n', algorithm + " classification report\n",

CHAPTER 10

RESULTS AND DESCRIPTION

10.1 Implementation Description


The speech emotion recognition system encompasses several key steps, including data
preprocessing, feature extraction, model training, and evaluation. Each step plays a crucial
role in building a robust and accurate model capable of recognizing emotions from speech
signals. Below is an in-depth description of each component of the implementation:

 Data Preprocessing The implementation process involves loading and preprocessing


the audio data. The datasets used for training and testing the model are TESS (Toronto
Emotional Speech Set) and CREMA-D (CrowdEmotion Database). These datasets
contain audio recordings of various emotional expressions spoken by different
individuals. During preprocessing, the audio files are loaded, and the emotion labels
are extracted from the filenames. This information is organized into a structured
format, typically a dataframe, for further processing.
 Feature Extraction Feature extraction is a critical component of the speech emotion
recognition system, as it involves extracting relevant information from the audio
signals that can be used by the machine learning models to classify emotions
accurately. Several acoustic features are extracted from the audio signals, including:
 Zero-crossing rate: The rate at which the signal changes its sign, often
associated with speech dynamics.
 Root mean square (RMS) energy: A measure of the signal's energy, reflecting
the overall amplitude of the speech.
 Mel-frequency cepstral coefficients (MFCCs): Representations of the short-
term power spectrum of the audio signal, capturing spectral characteristics.
These features provide valuable insights into the underlying characteristics of
the speech signals and are crucial for differentiating between various
emotional states.
 Data Augmentation To enhance the robustness and generalization capability of the
model, data augmentation techniques are applied to the audio data. These techniques
involve introducing variations into the training data to simulate real-world scenarios
and increase the diversity of the dataset. Common data augmentation techniques
include:

 Adding noise: Injecting random noise into the audio signals to mimic
environmental conditions or recording artifacts.
 Shifting: Temporally shifting the audio signals to simulate variations in speech
tempo or speaking rate.

 Stretching: Temporally stretching or compressing the audio signals to simulate


variations in speech duration.

 Pitch shifting: Modifying the pitch of the audio signals to simulate variations
in vocal characteristics. By augmenting the training data, the model becomes
more resilient to noise and variations in speech patterns, leading to improved
performance on unseen data.

 Model Training Once the feature extraction and data augmentation steps are
completed, the next stage involves training machine learning models to classify
emotions from the extracted features. Two popular classification algorithms used for
speech emotion recognition are Random Forest and XGBoost. These algorithms are
trained on the preprocessed and augmented dataset, with the emotion labels as the
target variable. During training, the models learn to map the extracted features to the
corresponding emotion labels, thereby enabling them to make accurate predictions on
unseen data.
 Model Evaluation After training the machine learning models, they are evaluated on
a separate test dataset to assess their performance and generalization ability.
Performance metrics such as accuracy, precision, recall, and F1-score are calculated to
quantify the models' effectiveness in classifying emotions. Additionally, a
classification report and confusion matrix are generated to provide detailed insights
into the models' performance across different emotion classes. By evaluating the
models on unseen data, we can ensure that they generalize well and can effectively
recognize emotions in real-world scenarios.

10.2 Dataset Description

The dataset used in this project consists of audio recordings from two main sources: TESS
(Toronto Emotional Speech Set) and CREMA-D (CrowdEmotion Database). Here's a detailed
description of each dataset:
TESS (Toronto Emotional Speech Set):

 The TESS dataset contains audio recordings of emotional expressions spoken by


actors in a controlled environment.

 It consists of a total of 2800 audio files, with each file representing a unique
emotional expression.

 The emotions covered in the TESS dataset include anger, disgust, fear, happiness,
sadness, surprise, and neutral.

 Each audio file is approximately 3 to 5 seconds long and is recorded at a sampling


rate of 16 kHz.

 The dataset provides a diverse range of emotional expressions, making it suitable for
training and testing emotion recognition models.

CREMA-D (CrowdEmotion Database):

 The CREMA-D dataset contains audio recordings of emotional expressions collected


from a diverse set of speakers.

 It consists of a total of 7,442 audio files, with each file representing a unique
emotional expression.

 Similar to the TESS dataset, the emotions covered in CREMA-D include anger,
disgust, fear, happiness, sadness, surprise, and neutral.

 Each audio file is approximately 3 to 5 seconds long and is recorded at a sampling


rate of 44.1 kHz.

 The dataset includes recordings from multiple speakers, providing variability in voice
quality, accent, and speaking style.

 CREMA-D offers a large and diverse collection of emotional expressions, making it a


valuable resource for emotion recognition research.

Combined Dataset:

 The combined dataset is created by concatenating the TESS and CREMA-D datasets
into a single dataframe.
 It contains a total of 10,242 audio files, encompassing a wide range of emotional
expressions.

 Each audio file is associated with a specific emotion label, allowing for supervised
learning of emotion recognition models.

 The dataset is divided into training and testing sets for model development and
evaluation.

 Feature extraction and data augmentation techniques are applied to the audio data to
enhance model performance and generalization ability.

10.3 Results Description

Figure 1 presents a bar chart depicting the total count of data samples for each emotion class
in the dataset. The x-axis represents the emotion classes, including angry, disgust, fear, happy,
neutral, surprise, and sad, while the y-axis represents the corresponding count of data
samples. This visualization provides an overview of the dataset's distribution across different
emotion categories, enabling insights into the dataset's balance and potential biases. Figure 2
illustrates a wavelet plot representing the angry emotion. The plot visualizes the waveform of
audio signals associated with the angry emotion. The x-axis represents time, while the y-axis
represents the amplitude of the audio signal. This visualization provides an intuitive
understanding of the temporal dynamics of the angry emotion in the audio recordings,
highlighting patterns and fluctuations in the waveform. Figure 3 showcases a spectrogram
representing the angry emotion. The spectrogram visualizes the frequency content of the
audio signals associated with the angry emotion over time. The x-axis represents time, the y-
axis represents frequency, and the color intensity represents the magnitude of the frequency
components. This visualization offers insights into the spectral characteristics of the angry
emotion, capturing variations in frequency components over time.
Fig. 1: Presents the total data count of each class.

Fig. 2: Visual representation of angry emotion wavelet.


Fig. 3: Visual representation of angry emotion spectogram.

Fig. 4: Visual representation of sad emotion wavelet.

Figure 4 displays a wavelet plot depicting the sad emotion. Similar to Figure 2, this plot
illustrates the waveform of audio signals associated with the sad emotion. The visualization
enables the examination of temporal patterns and dynamics in the audio recordings
corresponding to the sad emotion, facilitating the identification of distinctive features. Figure
5 exhibits a spectrogram representing the sad emotion. Similar to Figure 3, this visualization
illustrates the frequency content of the audio signals associated with the sad emotion over
time. By visualizing the spectral characteristics, this plot aids in understanding the variations
in frequency components and their temporal evolution in the audio recordings expressing the
sad emotion.
Fig. 5: Visual representation of sad emotion spectogram.

Fig. 6: Performance metrics of RFC model.


Fig. 7: Confusion matrix of RFC model.

Fig. 8: Performance metrics of XGBoost model.


Fig. 9: Confusion matrix of XGBoost model.

Figure 6 presents the performance metrics of the Random Forest Classifier (RFC) model. The
metrics include accuracy (85.80%), precision (87.62%), recall (86.10%), and F1-score
(86.62%), calculated for each emotion class. Additionally, the macro-averaged (86.00%) and
weighted-averaged (86.00%) metrics provide an overall assessment of the model's
performance across all classes. This visualization enables the evaluation of the RFC model's
effectiveness in classifying emotions and provides insights into its strengths and weaknesses.
Figure 7 displays the confusion matrix of the RFC model, illustrating the model's predictions
versus the actual labels for each emotion class. Each cell in the matrix represents the count of
instances where the model predicted a particular emotion class compared to the ground truth.
This visualization aids in understanding the model's classification errors and identifying any
patterns or biases in its predictions. Figure 8 showcases the performance metrics of the
XGBoost Classifier model, including accuracy (95.53%), precision (96.00%), recall
(95.65%), and F1-score (95.81%), calculated for each emotion class. Similar to Figure 6, the
macro-averaged (95.80%) and weighted-averaged (95.80%) metrics provide an overall
assessment of the XGBoost model's performance. This visualization facilitates the
comparison of the XGBoost model's performance with that of the RFC model and provides
insights into its classification capabilities. Figure 9 exhibits the confusion matrix of the
XGBoost model, illustrating its predictions versus the actual labels for each emotion class.
This matrix provides a detailed breakdown of the model's classification performance,
highlighting any discrepancies between predicted and true labels. By visualizing the model's
confusion patterns, this plot assists in diagnosing classification errors and identifying areas
for improvement.
CHAPTER 11

CONCLUSION AND FUTURE SCOPE


The project explored speech emotion recognition using acoustic analysis. By leveraging
advanced signal processing techniques and machine learning models, the system achieved
accurate classification of emotions conveyed in audio recordings. The evaluation of Random
Forest Classifier (RFC) and XGBoost Classifier models demonstrated their effectiveness in
capturing and interpreting complex patterns present in speech signals. The visualization of
performance metrics and confusion matrices provided valuable insights into the models'
classification capabilities and areas for improvement.

Future Scope

 Multimodal Integration: Explore the integration of multiple modalities, such as


facial expressions and text sentiment analysis, to enhance emotion recognition
accuracy and robustness. Combining information from diverse sources can provide
complementary cues and improve the overall performance of the system.

 Deep Learning Architectures: Investigate the use of deep learning architectures,


such as convolutional neural networks (CNNs) and recurrent neural networks
(RNNs), for speech emotion recognition. Deep learning models have shown
remarkable success in capturing hierarchical features and temporal dependencies,
which could further improve classification performance.

 Real-time Applications: Adapt the developed models for real-time emotion


recognition applications, such as virtual assistants, emotion-aware systems, and
mental health monitoring tools. Optimizing the algorithms for low-latency processing
and integrating them into user-friendly interfaces would facilitate their practical
deployment.

 Cross-lingual and Cross-cultural Analysis: Extend the analysis to encompass a


broader range of languages and cultural contexts. Investigate the generalization
capabilities of the models across different linguistic and cultural backgrounds,
considering variations in speech patterns and emotional expressions.

 Longitudinal Studies: Conduct longitudinal studies to assess the stability and


reliability of the emotion recognition models over time. Tracking changes in
emotional states and speech patterns over extended periods could provide insights into
individual variability and enable personalized emotion monitoring.

 Ethical Considerations: Address ethical considerations related to privacy, consent,


and bias in emotion recognition systems. Ensure responsible and ethical deployment
of the technology by prioritizing user privacy, transparency, and fairness in
algorithmic decision-making.
REFERENCES

[1] Trampe D, Quoidbach J, Taquet M. Emotions in everyday life. PloS One.


2015;10(12):e0145450
[2] 2.Owens A. A Case study of cross-cultural communication issues for Filipino call
centre staff and their Australian customers. In: 2008 IEEE International
Professional Communication Conference. Montreal: IEEE; 2008. pp. 1-10
[3] 3.Jeanne Segal PM. 2021. articles. Retrieved
from: https://www.helpguide.org/articles/mental-health/emotional-intelligence-
eq.htm#
[4] 4.Australia, U. The Science of Emotion: Exploring The Basics Of Emotional
Psychology. 2019. Retrieved from: https://online.uwa.edu/news/emotional-
psychology/
[5] 5.Backstrom T. Speech Production and Acoustic Properties. Aalto University;
2021. Available from: https://speechprocessingbook.aalto.fi/
[6] 6.Aalto. Speech Processing. [Online]. 2020. Available on Jan.10.2023
at: https://wiki.aalto.fi/display/ITSP/Introduction+to+Speech+Processing
[7] 7.Warden P. Speech Commands: A Dataset for Limited-Vocabulary Speech
Recognition. DOI: 10.48550/arXiv.1804.03209
[8] 8.Blanding M. The role of emotions in effective negotiations. 2014. Retrieved
from: https://hbswk.hbs.edu/item/the-role-of-emotions-in-effective-negotiation
[9] 9.Pavelescu LM, Petrić B. Studies in second language learning and teaching.
2018. Retrieved from: https://pressto.amu.edu.pl/index.php/ssllt
[10] 10.Raisanen O. Linguistic Structure of Speech. Aalto University; 2021.
Available from: https://speechprocessingbook.aalto.fi/
[11] 11.Backstrom T. Waveform. Aalto University; 2022. Available
from: https://speechprocessingbook.aalto.fi/
[12] 12.Backstrom T. Windowing. Spectrogram and the STFT, Cestrum and
MFCC: Aalto University; 2019. Available
from: https://speechprocessingbook.aalto.fi/
[13] 13.Qing Z, Zhong W. Research on speech emotion recognition technology
based on machine learning. In: 7th International Conference on Information
Science and Control Engineering (ICISCE). 2020. pp. 1220-1223
[14] 14.Kannadaguli P, Bhat V. A comparison of Bayesian and HMM based
approaches in machine learning for emotion detection in native Kannada speaker.
In: IEEMA Engineer Infinite Conference (TechNet). 2018. pp. 1-6
[15] 15.Nasrun M, Setianingsih C. Human emotion detection with speech
recognition using Mel-frequency cepstral coefficient and support vector machine.
In: International Conference on Artificial Intelligence and Mechatronics Systems
(AIMS). 2021. pp. 1-6
[16] 16.Mohammad OA, Elhadef M. Arabic speech emotion recognition method
based on LPC and PPSD. In: 2nd International Conference on Computation,
Automation and Knowledge Management (ICCAKM). 2021. pp. 31-36
[17] 17.Bharti D, Kekana P. A hybrid machine learning model for emotion
recognition from speech signals. In: International Conference on Smart
Electronics and Communication (ICOSEC). 2020. pp. 491-496
[18] 18.Gopal GN, Jayakrishnan R. Multi-class emotion detection and annotation
in Malayalam Novels. In: International Conference on Computer Communication
and Informatics (ICCCI). 2018. pp. 1-5

You might also like