FPR+Example+1
FPR+Example+1
In Speech
September, 2022
UNIVERSITY OF HERTFORDSHIRE
School of Physics, Astronomy and Mathematics
MSc Data Science
7PAM2002-0209-2021 - Data Science Project
SEPTEMBER 4
1
MSc Final Project Declaration
This report is submitted in partial fulfillment of the requirement for the degree of Master
of Science in Data Science at the University of Hertfordshire, Hatfield (UH).
I hereby authorise the report to be posted on the university website if the source is cited
appropriately.
2
Acknowledgement
As I come close to finishing my post-graduate studies, I would like to emphasize that it
has been a wonderful learning experience, and I want to express my gratitude to all the
people who have supported me along the way.
I'd like to start by expressing my gratitude to Almighty God for never ceasing to inspire
me with His endless blessings and for giving me the confidence and valor to move
forward with assurance and self-belief.
I would like to convey my appreciation and gratitude to Carolyn Devereux, who served
as my supervisor, for her constant advice and assistance in this project. I am appreciative
of her constant support and her patience towards my inquisitiveness.
I would also like to express my gratitude towards all my professors at the University of
Hertfordshire who helped me gain knowledge and understanding of the subjects and
helped me throughout my course.
I would also like to thank my parents, my sister, and my friends for their unwavering
encouragement and support, without which this would not have been possible.
3
Abstract
The main objective of this project was to conduct comparative research on the topic of
SER. The models explored in this project were Convolutional Neural Networks, Multi-
layer Perceptron, Support Vector Machines and Long-Short Term Memory (LSTM). These
models were taken to establish comparison of performance metrics of classical
classifiers versus machine learning classifiers. The data set utilised in the project was a
combination of CREMA-D (Crowd-Sourced Emotional Multimodal Actors Dataset) and
TESS (Toronto Emotional Speech Set) along with Mel-frequency Cepstrum coefficients
(MFCC) which served as method of extraction of features or information. This project
concluded that the LSTM was the most suitable with respect to accuracy with 91%
training accuracy and 98% test accuracy whereas MLP gave the most reliable accuracy
vs testing time balance with less than a second of testing time. This project highlights
the model usability in real-time applications provided the drawbacks of the data
available in real-time and also the limitations of models.
4
Contents
1. Introduction ............................................................................................................... 8
2. Literature Review ..................................................................................................... 12
3. Methodology ........................................................................................................... 17
3.1 Brief Overview:................................................................................................... 17
3.2 Dataset Used: ..................................................................................................... 17
3.3 Data Pre-Processing: .......................................................................................... 18
3.3.1 Feature Extraction: ...................................................................................... 20
3.3.2 Data Augmentation: .................................................................................... 22
3.4 Classification Models:......................................................................................... 23
3.4.1 Convolutional Neural Network: ................................................................... 23
3.4.2 Multi-Layer Perceptron: .............................................................................. 25
3.4.3 Long Short-Term Memory (LSTM): .............................................................. 26
3.4.4 Support Vector Machines: ........................................................................... 29
4. Results ..................................................................................................................... 31
5. Discussion of Results ................................................................................................ 39
5.1 Comparison of models: ...................................................................................... 39
5.2 Comparison with Other Papers: ......................................................................... 39
5.3 Applying Models to some of the Applications: ................................................... 41
5.4 Improvements of Models: .................................................................................. 41
5.5 Limitations: ........................................................................................................ 42
6. Conclusion................................................................................................................ 43
7. References ............................................................................................................... 45
8. Appendix .................................................................................................................. 50
5
List of Figures:
6
List of Tables:
7
1. Introduction
Humans express themselves most naturally through speech. We are so reliant on it that
we understand its significance even when using other forms of communication, like
emails or messages. Hence the use of emojis in our text messages has become
predominant. Emojis have proliferated in text messaging, which is not surprising given
the potential for misunderstanding, making it essential to convey the underlying
emotions of the context through text. According to Kerkeni et al., (2020), emotion is a
key factor in our everyday interpersonal relationships. It is essential for our rational,
logical and wise/intelligent decisions. We can align and comprehend others' emotions
by communicating our own emotions with them and communicating feedback or
reciprocating their sentiments. Since emotions make it easier for people to comprehend
one another, it seems natural to apply this idea to computers as well. In the modern
world, if the same level of understanding as human-human interaction could be
achieved by human-machine interaction, it would represent an enormous advancement
in technology.
Therefore, the idea of SER (Speech Emotion recognition) came into existence. SER is
basically an attempt to recognize human emotions from speech/audio. The fundamental
principle is the fact that voice frequently conveys the underlying emotion through pitch
and tone. According to (Femia, 2020), this same method along with the facial expression
is used by animals like horses to be able to understand human emotion. It is crucial to
have an emotional detection framework, which consists of different modules that carry
out tasks like audio to text conversion, followed by feature extraction from the audio
files, then the selection of features from those extracted features, and finally
8
classification of those features to determine the underlying emotions in those audio files
(Sudhakar and Anil, 2015). Figure 1 below depicts a typical SER system.
Detection of human emotions through a machine has many applications. For example,
some of the applications according to Kerkeni et al., (2020) are, detection of emotion in
the audio surveillance, business applications, clinical trials, interfacing with robots, web-
based E-learnings, entertainment sector, emotion detection in call centers, banking
sector in detection of fraud, video games, etc. Thanks to smartphones that can
synthesize or rather, transcribe voice commands, speech recognition has already been
part of our daily lives. As an advancement to this, smartphone companies could also
employ the Speech Emotion Recognition (SER) model to be able to recognize our
emotions in real-time. (Akçay and Oğuz, 2020).
These SER methods could also be used to watch how businesses engage with their clients
in call centers. Currently, a human professional with limited skills must be included in
order to assess emotions in these talks. However, using machines to complete the
process of assessment will prove to be far more cost-effective and produce results that
are a lot more reliable. Such a strategy might also be advantageous for public services.
For instance, it would be feasible to examine the emotions expressed in the voices, or
speeches, of members of parliament or politicians. As the attitudes and sincerity of
politicians may be looked into and further analyzed, such knowledge could be of great
interest and benefit to society. Additionally, other NGOs tackling problems in civil society
could apply emotion recognition. For instance, it will be possible to follow the emotional
states and behavior of various social groups by utilizing emotion recognition in speech
(Lugovic, Dunder and Horvat, 2016). Additionally, these methods could be applied to
academic research, particularly social science research, to reach a higher level of
granularity. For instance, when conducting interviews, one could pick up on emotions in
9
speech. This method would allow further scope for the analysis of another speech
utilizing the same data collection and processing technique.
On a more societal level, emotion recognition could be used to enhance the day-to-day
interaction between artificial intelligence and humans. Siri and Alexa, two examples of
artificially intelligent devices, are at present programmed to play music or comprehend
spoken commands from humans. Speech Emotion Recognition can evaluate a person's
mood and, when used in conjunction with IOT (Internet of Things), change the music
and lighting of the place to elevate mood. If a sad/depressing emotion is identified, it
can also recommend some shows to assist people feel better. In the medical field too,
the emotional state of the patient can be tracked and monitored and based on that,
suitable medical treatment can be administered. Another application would be to teach
computers to recognize the emotional expressions made by humans, so that tailored
email answers could be sent out in response.
Over the past few years, this subject has drawn more and more attention from
researchers and recent advances in the field of artificial intelligence and deep learning
techniques have further intensified the interest in this topic. With many researchers
coming up with different algorithms and ML (Machine Learning) models to try and
achieve acceptable accuracy, there is still a lot of scope for research in this area.
Although many models have been created for SER, the accuracy and the predictive
quality of the models have not quite reached our expectations to be able to be deployed
in real-time applications.
10
and built different models in their research, the models – CNN, MLP, SVM and RNN-
LSTM were specifically chosen for this project because:
CNN: was used as it uses convolutional layers with the combination of pooling layers
which act as filters to reduce low-pixels or in this case, noise from the data. It also helps
in dimensionality reduction.
LSTM: was used due to its ability to predict sequential data.
MLP: is mainly used for classification problems, and SER being one, MLP was chosen.
SVM: is considered to be one of the best classifiers with optimal margin. This helps
classification of a new datapoint a lot easier. SVM also has the benefit of not requiring
extensive preprocessing. Hence SVM was chosen.
Further information of each model is detailed in Section 3.4 of this report.
11
2. Literature Review
SER is a widely researched topic which has garnered a piqued interest over the former
years with an increasing demand for not just research but the market as well. According
to the survey by Anderson, (2022) via the site marketandmarkets.com, an increasing
exponential trend for the demand of SER was observed. With that observation, an
estimate of $56 billion is predicted in 2024 from $21.6 billion in 2019 which is more than
double in just 5 years implying the extend of its captivity amongst people. Figure 2
depicts the predicted growth of SER over the years for different regions.
12
Fig 3: Overview of SER. (Source: Akçay and Oğuz, (2020))
The overview of SER shows the current research done and the information available
around SER. The research by Akçay and Oğuz, (2020) mainly focuses on the databases,
the preprocessing techniques and the classifiers that have been used for SER. Databases
like RAVDESS, SAVEE, CREMA-D, TESS, IEMOCAP, MES, DES etc. are widely used in
research. These datasets contain the recording of different emotions acted out by
professionals in different languages. There are many other datasets available for generic
speech recognition as well. Preprocessing techniques are used for extracting features
from the given voice recordings. The list of preprocessing techniques as listed in the
Figure. 3, range from framing, windowing to normalization. The techniques depend on
the type of features that are to be extracted. At first, researchers commonly used
prosodic features like pitch, energy and duration of the voice recordings like in Nwe,
Foo, and De Silva, (2003) and Mao et al., (2014). Spectral features have now seemingly
become widely accepted (example papers: Issa, Fatih Demirci and Yazici, (2020), Fayek,
Lech and Cavedon, (2017), Kerkeni et al., (2018)) as a result of their capacity to increase
the performance of an algorithm. This report uses spectral features for model training
and hence MFCCs are extracted using windowing. As for the classification techniques,
13
the list broadly categorizes the techniques into classical classifiers, deep learning
classifiers and enhanced deep learning classifiers. At the beginning, during the late 90s,
the classical classifiers were popular amongst the research (example paper: Nwe, Foo,
and De Silva, (2003), which used Hidden Markov Model) but after the materialization of
deep learning specially after the release of Tensorflow and Keras libraries in 2015, many
researches researching around this topic in early 2010s shifted to CNN, RNN-LSTM and
DNN (example papers: Mirsamadi et al., (2017), Sai, et al., (2022), and Fayek, Lech and
Cavedon, (2017)). There is still a huge scope for research in the enhanced deep learning
techniques which may become popular techniques for SER in the coming years.
There are many other papers that assess the current research on speech emotion
recognition. Many researchers have proposed multiple solutions using various
mathematical algorithms and deep learning models. Some of them worth mentioning
and comparing are shown in Table 1.
Papers Classifiers Dataset Used Results
Speech Emotion Recognition: Multivariate Berlin and Spanish Highest results for each
Methods and Cases Study. Linear regression, Emotional database classifier:
(Kerkeni et al., 2018) SVM, RNN MLR – Spanish – 82.41%, Berlin
– 75%
SVM – Spanish – 77.63%, Berlin
– 63.30%
RNN- Spanish – 90.05%, Berlin –
69.55%
Speech emotion recognition using LFPC to represent A user defined dataset, An average accuracy of 77.1%,
hidden Markov models. (Nwe, speech signals with voices of Burmese 89% when emotions were
Foo, and De Silva, 2003) and Hidden and mandarin speaking identified individually
Markov Models people. A total of 720
were used as voice recordings.
classifiers
Evaluating deep learning DNN – CNN + IEMOCAP The best accuracy achieved was
architectures for Speech Emotion LSTM 64.78%. Preprocessing was done
Recognition. (Fayek, Lech and with hamming window and log
Cavedon, 2017) FFT.
Speech emotion recognition: HMM, SVM, CNN No Dataset used Just comparative research on
Emotional models, databases, and many more what has been done by other
features, preprocessing methods, authors
supporting modalities, and
classifiers. (Akçay and Oğuz,
2020)
Learning salient features for CNN SAVEE, Emo-DB (Berlin), 73.6% accuracy for SAVEE
speech emotion recognition using DES (German), DS,85.2% for EMODB,79.9% for
convolutional neural networks. MES(Mandarin) DES DS 78.3% for MES DS
(Mao et al., 2014)
14
Automatic speech emotion RNN IEMOCAP 61.8% recognition rate with raw
recognition using recurrent features 63.5% recognition rate
neural networks with local with LLD features.
attention. (Mirsamadi et al.,
2017)
Emotion recognition and its Neural Networks User-defined dataset 50% accuracy of recognition
application to computer agents
with spontaneous interactive
capabilities. (Nakatsu et al.,
1999)
Adieu features? end-to-end Deep CNN with RECOLA Highest valence – 0.261
speech emotion recognition using LSTM
a deep convolutional recurrent
network. (Trigeorgis et al., 2016)
Speech emotion recognition with CNN RAVDESS, EMODB, RAVDESS - 71.61%
deep convolutional neural IEMOCAP EMODB – 86.1%
networks IEMOCAP – 64.30%
(Issa, Fatih Demirci and Yazici,
2020)
Speech Emotion Recognition RNN LSTM The dataset used was accuracy was 95% overall
using LSTM and RNN (Sai, et al., RAVDESS
2022)
From the papers in Table 1, it would be relevant to discuss the papers and their achieved
accuracy in order to incorporate their work into this project. According to this project,
the accuracy of the models heavily depends on the dataset which is being used and the
type of preprocessing done on that dataset to extract information. As shown in Table 1,
the relatively older papers used mathematical algorithms to develop SER like in paper
Speech emotion recognition using hidden Markov models (Nwe, Foo, and De Silva,
2003) whereas the latest papers focused on neural networks for it like papers Trigeorgis
et al., (2016), Sai, et al., (2022) and Issa, Fatih Demirci and Yazici, (2020). While CNN
model was used by many of the researchers, RNN proved to be delivering the expected
results (Sai, et al., 2022). SVM model, being widely used for classification algorithms
(Kerkeni et al., 2018), also provided satisfactory results for many of the datasets used.
From the preprocessing standpoint, one of the crucial processes in the creation of an
SER system is feature extraction. Due to its importance, many researchers have used
different techniques and methods to extract features with higher accuracy and better
overall performance. Some of the common techniques used according to (Akçay and
Oğuz, 2020) were Modulation spectral (MS) features (E.g., Kerkeni et al., (2018)), Mel-
Frequency Cepstrum Coefficients (MFCC) (E.g., Issa, Fatih Demirci and Yazici, (2020),
15
Kerkeni et al., (2018)), Low-Level Descriptors (LLD) features (E.g., Mirsamadi et al.,
(2017)), Linear Prediction Coefficients (LPC) features (E.g., Shen, Changjun and Chen,
(2011)) etc. These techniques helped with extracting most common features like pitch,
spectral features and energy. As observed from this project, many researchers favored
the MFCC technique of extracting the features as it yielded far better results when
compared to other techniques.
Not only was there abundant published literature found for this topic, but also this topic
has taken the masses by sense of intrigue. Many competitions were held for developing
the SER with many people coming up with different algorithms and deep learning
models to achieve a reliable accuracy. Again, the most common technique of feature
extraction in these codes was found to be MFCC. While many leaned towards CNN
model, MLP was also one of their favorite models. Then, there were few who tried
replicating the results of the published papers. The website by Meta AI,
https://paperswithcode.com/task/speech-emotion-recognition, gives the list of papers
generally referred to and the relevant code that people tried for those papers.
In this report, the overall performance of each model (CNN, MLP, SVM, RNN-LSTM)
when trained and tested against the same data with same feature extraction technique
is highlighted. In the Section. 4 of this report, more specifics about this study are
covered.
16
3. Methodology
3.1 Brief Overview:
This project was built in Google Colaboratory in Python language using various libraries
for models used. The libraries numpy, pandas, matplotlib were used for basic
mathematical, visualization and Dataframe operations. The libraries tensorflow, keras
and sklearn were used for model building. Librosa library was used for preprocessing
data using the MFCC technique. The first step was to extract the datasets CREMA-D and
TESS from the Kaggle website. The data was downloaded into the google colab
environment and a Dataframe containing the information from these two datasets was
constructed. The Dataframe was made up of the audio path and the associated
emotions. Using the data in this Dataframe the features were extracted and stored into
another Dataframe, the specifics of this process will be covered in the Section 3.3 of this
report. The next step was data preparation which included scaling of data features, one-
hot encoding the emotions which were in form of words, and splitting data into training
and test datasets. to make them more compatible with the models. Next came the
building of the models CNN, MLP, SVM and RNN-LSTM. The data was then fed into these
models for training the models. Parameters were calibrated to achieve desirable levels
of accuracy and speed and the models were then tested against the test dataset. Figure
4 shows the flowchart of the tasks carried out in this project.
17
backgrounds. The dataset has voice recordings of 48 males and 43 females. The actors
in the dataset have spoken 12 different sentences with different emotions.
18
producing incorrect, inaccurate and unreliable conclusions/results of the research.
(Mesevage, 2021). A general flow of SER systems is detailed in Figure 6.
For preprocessing my dataset, it was necessary to understand the nature of data and
the various available procedures to extract information from it. Looking at the
physiological perspective during the generation of speech, Williams and Stevens,
(1981) specified that emotions such as anger, fear, or joy cause the sympathetic nervous
system to become dominantly active. Increased blood pressure, heart rate, occasional
dry mouth, and intermittent muscle tremors are the results one goes through while
feeling those emotions. In line with this, speech is also spoken quickly, enunciated at
times, and with powerful high frequency energy. On the other hand, when the
parasympathetic nervous system is aroused, as when one feels sad, the blood pressure
and heartbeat rate drop while the rate of generation of saliva rises, resulting in speech
that is sluggish and lacking in high frequency energy. The total energy, energy
distribution across the frequency spectrum, the frequency and length of speech signal
and the pause duration of the speech, all exhibit the corresponding impacts of these
physiological changes on speech (Nwe, Foo and De Silva, 2003). But this is not just
limited to human psychology and this same concept of understanding the energy
distribution in frequency spectrum can be extended to machines as well. Hence, taking
this into consideration, preprocessing of data (which is to extract features by extracting
information of audio signals in frequency and time domain) was done in this project.
Further details of the feature extraction process in discussed in Section 3.3.1 and 3.3.2
below.
19
3.3.1 Feature Extraction:
Since our data is in audio format, preprocessing is the very first step after collection of
data that is used to train the classifier models. In order to analyze and discover
relationships between many entities, mainly the audio files, feature extraction from
these audio signals becomes a crucial step which is used by many researchers in their
feature extraction process. As shown in Figure 7, the audio signal can be disintegrated
into a three-dimensional signal. The axes of which represent amplitude, time and
frequency.
20
Fig 8: Flowchart of MFCCs. Source: https://ieeexplore.ieee.org/abstract/document/1692543
Pre-emphasis amplifies the energy in the higher frequency by a significant amount.
When we analyze the audio signal’s frequency domain, the energy at a higher frequency
is much smaller than the energy at a lower frequency for voiced segments, such as
vowels. The main objective of performing pre-emphasis is to improve the model's
performance and accuracy by increasing the energy at higher frequencies. (Kiran,
2021).
Windowing breaks the audio signal into different segments to extract the required
features. If the signal is chopped off at its edges directly during the signal splitting, it
would result in the abrupt drop in amplitude of the signals at the margins leading to the
generation of noise signals in the range of high frequency. Hence, instead of employing
a rectangular window for windowing, a Hanning window (a sinusoidal wave which
touches zero at both ends after removal of discontinuity) or a Hamming window (a
sinusoidal wave which does not touch zero at both ends and has slight discontinuity) is
used which helps in prevention of noise in the high-frequency range (Kiran, 2021).
FFT or the Fast Fourier Transform is used for the conversion of the incoming signal from
the time domain to the frequency domain. This step is crucial because for audio signals,
it is significantly simpler to analyze the data in the frequency domain than in the time
domain (Kiran, 2021).
MEL-frequency filter bank uses the Mel scale to convert the actual frequency to a
frequency that people can generally perceive. This is done because there is a difference
in the way our ears perceive the sound versus the machines perceiving the sound. Our
hearing is dependent on the fact that at lower frequencies than at higher frequencies,
human beings have better sound or acoustic resolution. (Kiran, 2021).
Cepstrum was actually used for studying echoes, but its purpose is to detect the pitch in
audio files. Cepstrum is basically a spectrum of a spectrum. In mathematical terms, it is
21
the inverse Fourier transform of the power spectrum which is obtained from the
Discrete Fourier Transform (DFT) of the speech signal (Kiran, 2021).
Performing these steps gave me the required features in the form of Mel frequency
coefficients. But before data features are extracted, another procedure of data pre-
processing is most commonly applied which is the technique of augmenting data known
as data augmentation.
As discussed previously, because the classifier models could not directly interpret or
process the data from our audio recordings, feature extraction was utilized to transform
the data into a format that could be handled by the classifier models. Therefore, in this
project, the features were extracted using MFCCs after performing the data
augmentation steps. The sequence of which is:
1. Addition of noise which is random addition of data points to the existing data.
2. Stretching of the audio signal which is to increase rate of flow of data.
3. Shifting of the audio signal in amplitude domain similar to normalization of data
signal.
22
4. Pitch shifting of the audio data by a pitch factor of 0.7 (or simply the steps of 0.7).
5. Speeding up the data in time domain by the speed factor of 1.25 (increasing the
speed of the audio signal to 1.25 times).
6. Speeding down the data in time domain by the speed factor of 0.75 (decreasing the
speed of the audio signal to 0.75 times).
23
probability. All of these layers in the given sequence—Convolutional Layers (CL), Pooling
Layers (PL), Fully Connected Layer (also known as the Dense Layer), and SoftMax
Classifier—are arranged by a CNN in a hierarchical structure (Mustaqeem and Kwon,
2019). The SoftMax function can be used to change a vector of K real values into a vector
of K real values that equal 1. The SoftMax converts input values, which may be zero,
negative, positive, or greater than one, into numbers in the range of 0 and 1, making
them understandable as probabilities. The result of the SoftMax function provides you
the likelihood that the output returned belongs to any of the classes, which is
comparable to a categorical probability distribution. Apart from these, a CNN layer must
also contain activation function for the layers its built of. An activation function in an
artificial neural network terminology, is a task that yields a reduced output for minuscule
inputs and a larger output if the inputs are greater than a tolerance level. If the inputs
are sufficiently large, the activation function "triggers"; otherwise, nothing happens. To
put it another way, an activation function functions like a barrier that verifies that if an
incoming value is greater than a threshold value.
CNNs are ideal for image recognition and classification but have proved to be classifiers
for emotion in speech as well. A basic example of CNN architecture is shown in Figure 9.
In this project, a CNN model was built with the convolutional, max pool and dense layers.
The CNN model was built by setting up the combination of convolutional layers with the
activation being ReLU (Rectified Linear Unit) along with max-pooling layers. The plot of
a ReLU graph is as shown in Figure 10. ReLU works on the formula:
24
Fig 10: ReLU function. (Source: https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-
neural-networks-584bc134c1e2)
There are four such combinations followed by a flatten and a dense layer with ‘Softmax’
activation. To avoid the problem of overfitting, Dropout regularization is used. It is a
regularization technique which excludes input and recurrent connections to LSTM/CNN
units probabilistically from weight and activation and updates during network training.
As a result, overfitting is decreased and model performance is enhanced. The
architecture of CNN used in this project is as described in Table 2 below:
Layers Units Activation Parameters
Convolutional Layer 256 ReLU 1536
Convolutional Layer 128 ReLU 163968
Dropout 0.1
Max Pooling Layer 8
Convolutional Layer 128 ReLU 82048
Convolutional Layer 128 ReLU 82048
Flatten Layer 7
Dense Layer 7 Softmax 6279
Table 2: Architecture of CNN model
Forget gates: Recurrent gates referred to as "forget gates" are typically added to LSTM.
Looking at an example, taking a sentence:
Input Gate: The input gate is in responsible of updating the cell's state with information.
Here, additional information in the aforementioned example would be that both Marley
and Riles are good people (Srivastava, 2017).
27
Output Gate: The output gate is liable of extracting pertinent information from the
present state of the cell and projecting it as output. In other terms, predicting the next
word of a sentence based on the states received from input gate and the information
stored in the memory (Srivastava, 2017).
Fig 13: LSTM Model with Input, Output and Forget gates
Figure 13 shows the LSTM model containing the output gate (O), Input gate (I) and forget
gate (F). The two hidden states are depicted by (H) and candidate layer is shown by (C).
Backpropagated errors can be avoided from exploding or vanishing using LSTM. Instead,
errors might go backwards through an endless number of simulated layers that are
dispersed in space. In other sense, according to Sai, et al., (2022), tasks requiring
memories of occasions that occurred thousands or even millions of discrete time steps
previously can be learned using LSTM. Employing RNN-LSTM has the bigger benefit of
performing well with large datasets. For emotion detection, the interdependence of
each part of the audio speech with the previous one should be taken into account. So,
in this case LSTM should ideally give a better performance.
In this project, the best version of the LSTM model was built with a simple RNN
architecture is built with the parameters as shown below in Table 3:
Layers Units Activation Parameters
LSTM 200 Tanh (Default) 161600
Dropout 0.5
Dense 512 ReLU 102912
Dense 512 ReLU 262656
Dense 7 Softmax 3591
Table 3: Architecture of RNN model
28
3.4.4 Support Vector Machines:
Support Vector Machine (SVM) is a well-known supervised machine learning method
which is utilized in both regression and classification algorithms. However, classification
algorithms tend to favor it. It essentially creates a hyperplane in n- or multidimensional
space that divides various target groups. The fundamental goal of the SVM is to establish
the optimum decision border with the greatest amount of margin between two or more
classes so that the fresh data points can be accurately classified. (Awasthi, 2020).
In machine learning algorithms, one of the best classifiers with a considerably optimal
margin are Support Vector Machines. SVMs are extensively used in numerous research
papers that focus on audio/speech emotion recognition, example, (Hu, Xu and Wu,
(2007), Chavhan, Dhore and Yesaware, (2010), Shen, Changjun and Chen, (2011)).
According to G.S. et al., (2016), in comparison to other classifiers/models, it can do
remarkably well in terms of classification, especially with little training data. According
to (Awasthi, 2020), SVMs are have a better ability of handling classification and
regression methods on both linear and non-linear data. The models usually are more
accurate when compared to other algorithms due to their ability to handle small
(typically tens to many thousands of features) and complex datasets. Another benefit of
it is it can find complex relationships amongst the provided dataset without
preprocessing it with plenty of transformations. This is handled by Kernels. Kernels help
a lot when dealing with complex datasets. They receive the data as input, which they
then transform into the necessary form. Typically used kernels are:
• The linear kernel is used to partition data linearly. Because it works well with big
datasets, this kernel is often chosen for text classification models.
Formula: F (X, Xj) = sum ( X.Xj)
• Sigmoid kernel graphs into a sigmoid when classifying data and is mostly preferred
for neural networks. This kernel closely resembles a two-layer perceptron of a neural
network and functions as an activation function for neurons.
Formula: F (X, Xj) = tanh (αxay + c)
• The Gaussian Radial Basis Function (RBF) kernel is a type of gaussian kernel, which
projects the higher dimensional data and then searches for a linear separation plane
for the data.
Formula: F (X, Xj) = exp (-gamma * ||X - Xj||^2)
29
• Polynomial kernels are used preferably for data which has been normalized. This
kernel is not often used as it does not give the most reliable results when compared
to other kernels.
Formula: F (X, Xj) = (X.Xj+1)^d
The below Figure 14 depicts the graphical representations of SVM classifications using
different kernels for some dummy/example data.
For this project, SVM model was built with RBF kernel which gave the highest accuracy
amongst the other kernels.
30
4. Results
The table below displays the comparative outcomes for each model. This table displays
the accuracy, training time, testing time and predictive quality of the best versions of
each model trained in the project.
Model Accuracy Training Time Test time Predictive quality
(TP+TN)/ (TP+FN+TN+FP)
CNN 84.47% 82mins 7.48 sec Very good
MLP 86.29% 6.56 mins 0.19 sec Very good
RNN-LSTM 98.14% 63.38 mins 2.23 sec Excellent
SVM 72.80% 6 mins ~ 1 min Average
Table 4: Results of each model
The predictive quality was measured based on the accuracy score. Approximately 17,900
(30%) feature- extracted data files were tested for CNN and MLP and 14400 (20%) data
files for RNN and SVM. As highlighted, RNN yielded the best accuracy and had excellent
predictive quality. The training and test times were measured (as start time – current
time) from the time.time() command in the program.
CNN Model: The dataset was reshaped to make it compatible with CNN model. The
RMSprop optimizer was used with an adaptive learning rate starting from 0.01
decreasing by a factor of 0.4 to the lowest of 0.0000001. 50 epochs were used to train
the CNN model having batch size of 256. CNN took approximately 82 mins to train with
training loss deprecating from 3.00 to 0.19 and testing loss from 1.31 to 0.60. While the
training accuracy was 93.25%, the validation accuracy could only reach 84.47%. Below
are the graphs representing training and testing loss and accuracy.
Fig 15: Plot of Accuracy and Loss of CNN model on training and validation data
At first, different models with varied parameters and layers were tried out but the
aforementioned model proved to give the best results. At first, just three Conv layers
with max pooling were tried out which gave an accuracy of approx. 76%. Then the
31
parameters were adjusted ranging from 64 to 256 each resulting in different accuracies
and training times. Also, the model was trained for more epochs of 60 and 75, but the
losses did not improve and the model tended to overfit the data. Different optimizers
were also tried out including Adam, Nadam, RMSprop, and SVG. but the best optimizer
was RMSprop for this classification problem.
MLP Model: The best version of MLP classifier was built with two hidden layers of 250
and 150 neurons respectively. It had an adaptive learning rate, and alpha value of 0.06.
An alpha value is used to tackle overfitting/underfitting problems by restricting the
weight size. Increasing alpha fixes high variance while decreasing its value fixes high bias.
Thus, an ideal value is to be calibrated to achieve reliant results. The epsilon of 0.1 x
10^8 (used for stability of optimizer Adam) was used. A batch size of 256 was employed.
The MLP took approximately 6.56 mins to train and 0.19 seconds to test giving an overall
accuracy of 86.29%. Other versions were built with different combinations of hidden
layers and batch sizes which affected the accuracy and training time of the models. Some
of the models worth mentioning for comparison are:
Hidden layers Batch size Training time Accuracy
(300,) 256 ~ 9 mins 74.85%
(350,) 256 ~ 14.52 mins 77.50%
(350,) 128 ~ 13 mins 75.23%
(400,) 256 ~ 14 mins 76.55%
(300, 100) 256 ~ 7.5 mins 79.56%
(250, 150) – best version 256 ~ 6.6 mins 86.29%
Table 5: Different versions of MLP models built in this project
As observed from Table 5, increasing the parameters did increase the accuracy but
heavily affected the training time. Increasing layers resulted in lesser training time and
increase in accuracy.
SVM Model: RBF (Gaussian Radial Basis Function) kernel was used to train the most
effective SVM model. The approximate training time taken by the model was 6 mins and
it took a rather long time to test of approximately 73 seconds. The overall accuracy
achieved was 72.8% with the overall predictive quality being average to low. At first,
different kernels were used for training including linear, polynomial and sigmoid. While
polynomial kernel gave around 70% accuracy with a training time of approximately 7
minutes, giving second best results, linear kernel and sigmoid kernel gave even worse
results of about 56% and 24% of accuracy respectively with an even greater training time
of about 17 mins and 7.5 mins respectively.
32
RNN Model: The RNN model was trained with the loss metric of ‘binary cross-entropy’
and ‘Nadam’ optimizer. An adaptive learning rate starting from 0.01 decreasing by a
factor of 0.4 to the lowest of 0.000001 was applied. RNN was trained with a batch size
of 128 across 60 epochs. This model took approximately 64 mins to train and gave an
accuracy of 98.13%. The validation loss captured during the training of this model
reduced from 0.27 to 0.13 while the validation increased accuracy from 0.50 to 0.91.
Below are the graphs representing training and testing loss and accuracy.
Fig 16: Plot of Accuracy and Loss of RNN model on training and validation data
Again, different parameters with a varied number of layers were tried for this classifier.
The model with same number of layers but with just 30 epochs gave an accuracy of 68%,
so the epochs were increased to 50, the accuracy improved to 72.5%. Another approach
of scaling the features before they were trained was tried but the results were not very
promising and gave lesser accuracy than before. Optimizers like Nadam, RMSprop and
Adam were tried and ‘Nadam’ proved to be the best optimizer. The model could reach
an accuracy of 78.9% with learning rate of 0.01 even when trained for 100 epochs. Then,
the learning rate was made adaptive and epochs were increased to 100. This model gave
the second-best results. To experiment further, the training data was increased to 80%
from 70%, decreasing the test dataset. This did not make much difference to the model
performance. Then, a second layer of LSTM was added with same number of parameters
(100). Though it did increase the accuracy, due to the model complexity, the training
time increased by 50 mins and test time increased by 6 secs making it unfit for real-time
deployment. Finally, only one layer was kept and parameters were increased from 100
to 150 to 200 and epochs were decreased to 60. The best model was achieved by one
LSTM layer of 200 parameters with two dense layers.
The overall effectiveness of the models in identifying certain emotions is examined using
the confusion matrix. It is used to determine how well each model correlates with each
33
emotion. Because they give a more realistic representation of a model's performance,
confusion matrices are the most common. In a regular accuracy measure of
classification, the information of incorrectly classified cases is not reported. A confusion
matrix provides information of both rightly and erroneously classified cases reported
giving us a broader understanding of the model’s performance. To understand the
confusion matrix, below are a brief definition of terms used in the matrix.
Recall: or sensitivity is described as the percentage of positive events that were correctly
identified out of all positive events. Or, which positive instances, if any, were correctly
predicted out of all of them. Recall ought to be high (ideally 1) (Suresh, 2020).
A Classification Report along with the Heatmap for each model: To further analyze the
performance of each model in terms of performance level of detection of each emotion,
the heatmaps and classification reports are shown below for each model in figures
18,19,20,21.
34
Fig 18: Classification report and heatmap of SVM Fig 19: Classification report and heatmap of MLP
Fig 20: Classification report and heatmap of RNN Fig 21: Classification report and heatmap of CNN
35
SVM Model: From Figure 18, the precision with which the "surprise" and "anger"
emotions were detected for the SVM model was equivalent to or higher than 80%,
suggesting that these emotions were detected with a higher degree of precision than
other emotions. The least precise emotions were "sad" and "disgust," which suggests
that the model failed to correctly identify these feelings. The recall rate varied from 60%
to 99% for all emotions, suggesting a moderate level of success for the model in
identifying false negatives. The overall F1 score too had similar range of 65% to 100% for
the emotions implying that though the model’s capability of identifying few emotions
was good, it could not distinguish between ‘disgust’ and ‘fear’. The heatmap of the SVM
model shows a rather detailed picture of each emotion classification whether correctly
detected or falsely classified. The correlation of each emotion with itself and with the
other emotions as well is depicted. For example, while emotion ‘angry’ was correlated,
rather detected with 80.16% accuracy, it was falsely detected or correlated as ‘happy’
7.50% of the times, ‘disgust’ as 5.84% of the times etc.
MLP Model: For the MLP model as shown in Figure 19, 'angry', 'happy', 'neutral',
'disgust', ‘sad’ and 'surprise' emotions performed extraordinarily well, with the model's
accuracy in identifying these being more than 90%. A good recall rate was also observed
for these emotions, the lowest being 79% for ‘disgust’ emotion which was far better
than SVM model. F1 score was the highest for ‘surprise’ recorded at 99% implying that
model almost always classified this emotion accurately. The aggregate F1 score
remained higher than 86 percent, suggesting that this model performed well overall. A
more graphical representation of correlation of each emotion is shown in the heatmap
of MLP model below.
The RNN model: in Figure 20, which was the most accurate model, had a precision level
of above 97% for all the emotions, with "surprise," "fear," “happy” and "anger" having
the best precision of more than 98%. All of the emotions had recall rates above 98%,
which speaks to the model's accuracy in identifying false negatives. The total F1 score
was above 98% for all emotions, demonstrating the model's strong capacity to recognize
and distinguish emotions with greater accuracy. As observed, all the emotions had the
same levels of precision and recall rate which speaks about the model’s stability and
consistency amongst all emotions.
CNN Model: All emotions for the CNN model were within the 80–90% precision range as
seen in Figure 21, preserving no biases against any emotion. Only one emotion, ‘surprise’
was 99 percent accurately represented. In comparison to previous models, a consistent
recall rate of between 80 and 90 percent was also found, with "surprise" being the only
36
emotion that showed preference over other emotions. This resulted in a pretty balanced
F1-score of all the emotions. Although, MLP had higher overall accuracies, CNN, as LSTM,
had an advantage of being consistent across all emotions whereas SVM and MLP had
few emotions with very less precision. Heatmap of CNN is as shown below, giving a more
detailed analysis of each emotion and its correlation.
To further understand the performance of models on emotion level, the graphs were
plotted for precision, F1 Score and recall of each emotion detected by the four models
below in Figure 22:
37
Fig 22: Statistics of Each Emotion
From the confusion matrices in Figures 18,19,20 and 21 and the graphs in Figure 22, all
the emotions were classified with very good accuracy in all the models, but ‘surprise’
emotion had the F1-score of almost 100% in every model. It was observed that the
models may have overfitted this emotion as ‘surprise’ emotion had very limited data in
comparison of other emotions. Emotions ‘Angry’ and ‘Happy’ was the most precise
emotions with good recall rate as well. Emotions ‘Disgust’ and ‘Fear’ were the emotions
which had less overall precision in the models, inferring that these emotions were not
classified with highest precision. The emotions ‘Sad’ and ‘Neutral’ were moderate in
terms of precision and recall rate.
38
5. Discussion of Results
5.1 Comparison of models:
From the results, it is clear that the RNN provided the maximum accuracy of 98% but
required roughly 64 minutes to train and 3 seconds to test, whereas the MLP model
provided an accuracy of 86% with a testing time of under a second, providing the best
accuracy to time trade-off implying that MLP could be more usable in real-time scenario.
Even though SVM trained the quickest in around 6 mins, it only could reach 72.8% which
did not quite meet the proper accuracy level and tested in around one minute, which
questions the credibility of the model, again inferring that the model was not quite ideal
in a real-world scenario. 84% accuracy was provided by CNN, however the testing of it
took 7.48 seconds, which is again not optimal in a real-time scenario. As seen from the
performance of detection of each emotion, CNN and LSTM were consistent across all
the emotions in terms of precision of each emotion, whereas MLP and SVM had lesser
precision for few of the emotions.
39
be seen from their research, combining MFCCs and MS features produced superior
results to using just MFCCs or MS alone.
Mao et al., (2014) talked about just CNN model and its classification accuracy when
trained on four datasets. The CNN models built by them achieved an accuracy of 73.7%
for SAVEE (Surrey Audio-Visual Expressed Emotion Database) data, 85.2% for EMODB
(Berlin dataset), 79.9% for DES (Danish Emotional Speech database) and 78.3% for MES
(Mandarin Emotional Speech). The features were extracted with their own algorithm.
Comparing the highest and the lowest accuracies for the datasets (highest for EMODB
and lowest for SAVEE), EMODB had recordings of 10 actors while SAVEE had just 4
English actors. One of the positive points of this paper was it explored the language/
culture barrier of speech and its underlying emotion. In comparison to this project, CNN
model had an accuracy of 84.47% performing better than most of the models in Mao et
al., (2014) as it used MFCC feature extraction technique which means that the CNN
model of this project was trained on spectral features (MFCCs) whereas the other CNN
models of Mao et al., (2014) were trained on prosodic features (pitch, tone etc.). As seen
from paper Shen, Changjun and Chen, (2011), using the same dataset, model achieved
more accuracy when trained with spectral features than prosodic features.
Convolutional Neural Network (CNN) was also employed in the paper Issa, Fatih Demirci,
and Yazici, (2020) to construct SER. The datasets used in this study differed from those
used in Mao et al., (2014). On the Berlin dataset, this article was able to reach an
accuracy of 86.1%, which was greater than the accuracy 85.2% obtained by Mao et al.,
(2014). Although the dataset was the same, the different accuracy levels were caused
by the feature extraction methods used in these articles. While Issa, Fatih Demirci, and
Yazici, (2020) used MFCCs for spectral feature extraction, Mao et al., (2014) employed
their own feature extraction technique which involved prosodic feature extraction.
While Issa, Fatih Demirci, and Yazici, (2020) focused on building the CNN model of 6 CNN
layers, Mao et al., (2014) focused on the feature extraction technique to make the SER
system more robust in nature.
While CNN model developed on the dataset RAVDESS (Ryerson Audio-Visual Database
of Emotional Speech and Song) employed in paper Issa, Fatih Demirci, and Yazici, (2020)
showed an accuracy of only 71.61 percent, RNN technique used to develop the SER
system in Sai, et al., (2022) on the same RAVDESS dataset showed an accuracy around
95%, suggesting that RNN could be a better deep learning algorithm than CNN for SER
systems. Despite the fact that the RNN in Sai, et al., (2022) reached an accuracy of 95%,
the RNN in this project achieved roughly 98% as a result of a far more robust and varied
40
dataset and more data than RAVDESS (RAVDESS contains only 1440 files, whereas this
project had a dataset of 10,242 files). Another differentiating factor was that the
features were extracted using Short Time Fourier Transform (STFT) by Sai, et al., (2022),
whereas this project used MFCCs indicating that the feature extraction method also has
a significant impact on the variation in model accuracy.
5.5 Limitations:
One of the major limitations of this project was that the datasets taken had the voice
recordings being acted out by professional actors in professional environment like silent
rooms or recording studios, whereas in real-time applications, the emotions must be
detected for people that have a wide range of emotional aptitude with complex or mixed
emotions and the SER systems should also be accustomed to real-world environment
which may contain noise and other characteristics. Another aspect to be considered is
the effects of language and culture on SER. For instance, there may be variances in the
way emotions are expressed in speech across different languages. According to Akçay
and Oğuz, (2020), there are many ongoing studies for development of cross-language
SER models. However, with the current models and techniques, the results are not as
promising as required. Another aspect is to attain the required accuracy. In order to
increase emotion recognition accuracy and reduce the computational intricacy of the
entire model, there are numerous obstacles in the literature and research of SER. Due
to these challenges, a right balance of accuracy vs model complexity is yet to be
originated. Generalization is also one of the challenges that SER faces, in terms of the
models giving the same accuracy and testing time with every dataset the model is being
trained and tested with. Also, the present SER models detect just basic emotions, mostly
the common emotions of sad, happy, anger, disgust, neutrality, and fear. But, in real life
scenario a human subject may emote complicated or mixed emotions as well. With the
models built currently, those emotions may either be undetected or classified
incorrectly.
42
6. Conclusion
To summarize, a Speech Emption Recognition system will have an enormous impact on
human-Artificial Intelligence (AI) interaction with an extensive range of applications. A
breakthrough to get the right model-accuracy model to be able to deploy the system in
real-world applications is still a topic for research. According to Kerkeni et al., (2020),
There are still many unknowns regarding the ideal method for categorizing emotions.
The rate at which emotions can be detected varies depending on the combination of
emotional traits. What characteristics affect the recognition of emotion in speech is still
a subject of discussion among scholars. In this project, the SER system is developed on a
dataset merged with two known datasets. The combined dataset is used to train and
test the four models for comparison. For this project, the best model was MLP classifier
as it could nearly attain the right accuracy-time balance giving over 86% accuracy and
approximately 7 mins for training time. RNN proved to be the best in terms of accuracy
but had the disadvantage of training time of over 63 mins and test time of over 2
seconds. Though the robustness of this model is still open to research, the basic
emotions were correctly classified. As for the limitations, generalization is a major
concern across the research and real-time applications. Also, classification of real-time
mixed or complex emotions is a challenge yet to be solved.
Future Scope:
▪ A possibility is to make it a more generalized version of models to be able to
overcome overfitting. This would also help in maintaining the accuracy level given
the kind of raw, real-time data generally available in organizations.
▪ The exploration of detection of changing emotions during a brief period using the SER
system can also be useful for real-world applications.
▪ To overcome the limitations of getting real-time emotions for training the models,
the possibility of getting real-life data from established organizations like call-centers,
NGOs etc. to train data can be explored.
▪ Another possibility is to research/build models which can detect complicated
emotions or mixed emotions.
▪ At present many researchers tend to develop new models based on supervised
learning techniques, further research may lead the researchers to explore the
possibility of using unsupervised training techniques.
▪ Exploration of combination facial expressions along with the speech to enhance the
recognition of emotions which will also help in interfacing AI robots.
43
▪ Another possibility of using a combination of sporadic-spectral features or
combination of MFCC-MS or MFCC-LLD features can be explored to improve the SER
system’s robustness.
▪ Combination of models, for example CNN+LSTM, maybe explored as it may prove to
be a more a far superior algorithm in building SER.
44
7. References
Abirami, S. and Chitra, P., 2020. Energy-efficient edge based real-time healthcare
support system. Advances in Computers, pp.339-368.
Akçay, M. and Oğuz, K., 2020. Speech emotion recognition: Emotional models,
databases, features, preprocessing methods, supporting modalities, and
classifiers. Speech Communication, 116, pp.56-76.
Anderson, M., 2022. The Future of Emotion Recognition in Machine Learning. [online]
Iflexion. Available at: <https://www.iflexion.com/blog/emotion-recognition-software>
[Accessed 5 July 2022].
Awasthi, S., 2020. SEVEN MOST POPULAR SVM KERNELS. [Online] Available at:
https://dataaspirant.com/svm-kernels/#t-1608054630734 [Accessed 17 December
2020].
Burnwal, S., 2020. Speech Emotion Recognition. [online] Kaggle.com. Available at:
<https://www.kaggle.com/code/shivamburnwal/speech-emotion-
recognition/notebook> [Accessed 1 January 2020].
Chavhan, Y., Dhore, M. and Yesaware, P., 2010. Speech Emotion Recognition using
Support Vector Machine. International Journal of Computer Applications, 1(20), pp.8-
11.
Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W. and
Taylor, J., 2001. Emotion recognition in human-computer interaction. IEEE Signal
Processing Magazine, 18(1), pp.32-80.
Dertat, A., 2017. Applied Deep Learning - Part 4: Convolutional Neural Networks. [online]
Applied Deep Learning - Part 4: Convolutional Neural Networks. Available at:
<https://towardsdatascience.com/applied-deep-learning-part-4-convolutional-neural-
networks-584bc134c1e2> [Accessed 8 November 2017].
45
Fayek, H., Lech, M. and Cavedon, L., 2017. Evaluating deep learning architectures for
Speech Emotion Recognition. Neural Networks, 92, pp.60-68.
Femia, K., 2020. Study Proves That Horses Can Understand Human Emotions. [Online]
Available at: https://ihearthorses.com/horses-understand-human-emotions/
[Accessed 25 November 2020].
G.S., D. S., P., C., and B., V. (2016). SVM Based Speech Emotion Recognition Compared
with GMM-UBM and NN. IJESC, 6.
Hochreiter, S., Younger, A.S. and Conwell, P.R., 2001, August. Learning to learn using
gradient descent. In International conference on artificial neural networks (pp. 87-94).
Springer, Berlin, Heidelberg.
Hu, H., Xu, M. and Wu, W., 2007. GMM Supervector Based SVM with Spectral Features
for Speech Emotion Recognition. 2007 IEEE International Conference on Acoustics,
Speech and Signal Processing - ICASSP '07,.
Issa, D., Fatih Demirci, M. and Yazici, A., 2020. Speech emotion recognition with deep
convolutional neural networks. Biomedical Signal Processing and Control, 59, p.101894.
Kerkeni, L., Serrestou, Y., Mbarki, M., Raoof, K. and Mahjoub, M., 2018. Speech Emotion
Recognition: Methods and Cases Study. Proceedings of the 10th International
Conference on Agents and Artificial Intelligence,.[176]
Lugovic, S., Dunder, I. and Horvat, M., 2016. Techniques and applications of emotion
recognition in speech. 2016 39th International Convention on Information and
Communication Technology, Electronics and Microelectronics (MIPRO),.
46
Mao, Q., Dong, M., Huang, Z. and Zhan, Y., 2014. Learning Salient Features for Speech
Emotion <newline/>Recognition Using Convolutional <newline/>Neural
Networks. IEEE Transactions on Multimedia, 16(8), pp.2203-2213.
Marius, H., 2020. Multiclass Classification with Support Vector Machines (SVM), Kernel
Trick & Kernel Functions. [online], Dual Problem and Kernel Functions. Available at:
<https://towardsdatascience.com/multiclass-classification-with-support-vector-
machines-svm-kernel-trick-kernel-functions-f9d5377d6f02> [Accessed 9 June 2020].
Mesevage, T. G., 2021. What Is Data Preprocessing & What Are The Steps Involved?
[Online] Available at: https://monkeylearn.com/blog/data-preprocessing/ [Accessed 24
May 2021].
MevonAI, https://tinyurl.com/ze94b7ud
Mirsamadi, S., Barsoum, E. and Zhang, C., 2017. Automatic speech emotion recognition
using recurrent neural networks with local attention. 2017 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP),.
Mustaqeem and Kwon, S., 2019. A CNN-Assisted Enhanced Audio Signal Processing for
Speech Emotion Recognition. Sensors, 20(1), p.183.
Nakatsu, R., Nicholson, J. and Tosa, N., 1999. Emotion recognition and its application to
computer agents with spontaneous interactive capabilities. Proceedings of the seventh
ACM international conference on Multimedia (Part 1) - MULTIMEDIA '99,.
Nwe, T., Foo, S. and De Silva, L., 2003. Speech emotion recognition using hidden Markov
models. Speech Communication, 41(4), pp.603-623.
47
Rocha, W., Prado, C. and Blonder, N., 2020. Comparison of Chemometric Problems in
Food Analysis using Non-Linear Methods. Molecules, 25(13), p.3025.
Sai, R. N. D., Shahbaaz, S. & Prakash, U., 2022. Speech Emotion Recognition using LSTM
and RNN. Compliance Engineering Journal, 13(2), pp. 37-42.
Shen, P., Changjun, Z. and Chen, X., 2011. Automatic Speech Emotion Recognition using
Support Vector Machine. Proceedings of 2011 International Conference on Electronic
& Mechanical Engineering and Information Technology,.
Srivastava, P., 2017. Essentials of Deep Learning : Introduction to Long Short Term
Memory. [Online] Available at:
https://www.analyticsvidhya.com/blog/2017/12/fundamentals-of-deep-learning-
introduction-to-lstm/ [Accessed 18 May 2020].
Sudhakar, R. and Anil, M., 2015. Analysis of Speech Features for Emotion Detection: A
Review. 2015 International Conference on Computing Communication Control and
Automation,.
Swamy, S., 2022. RECENT TRENDS IN TIME SERIES FORECASTING-A SURVEY. [online]
Available at: <https://www.researchgate.net/figure/Rolled-out-RNN-at-different-time-
steps-Long-Short-Term-Memory-8-is-one-of-the-popular_fig1_341151347> [Accessed 1
April 2020].
Takimoglu, A., 2021. What is Data Augmentation? Techniques & Examples in 2022.
[Online] Available at: https://research.aimultiple.com/data-
augmentation/#:~:text=Data%20augmentation%20is%20useful%20to%20improve%20
performance%20and,sufficient%2C%20the%20model%20performs%20better%20and%
20more%20accurately. [Accessed 20 june 2022].
48
Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M., Schuller, B. and
Zafeiriou, S., 2016. Adieu features? End-to-end speech emotion recognition using a deep
convolutional recurrent network. 2016 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP),.
Wei Han, Cheong-Fat Chan, Chiu-Sing Choy and Kong-Pang Pun, n.d. An efficient MFCC
extraction method in speech recognition. 2006 IEEE International Symposium on Circuits
and Systems,.
Williams, C.E. and Stevens, K.N., 1981. Vocal correlates of emotional states. Speech
evaluation in psychiatry, pp.221-240.
49
8. Appendix
Below is the attached file for code.
50
from keras.layers import Dense, Conv1D, MaxPooling1D, Flatten, Dropout,
BatchNormalization, Activation, LSTM
from keras.callbacks import ModelCheckpoint
import librosa
import librosa.display
from matplotlib.pyplot import specgram
import os
import sys
import warnings
import keras
# Importing sklearn library
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
import time
# ignore warnings
if not sys.warnoptions:
warnings.simplefilter("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning)
In [ ]:
We first start with the CREMA-D dataset. The dataset is downloaded and then
converted to a dataframe.
Looking in indexes: https://pypi.org/simple, (https://pypi.org/simple,) https://us-
python.pkg.dev/colab-wheels/public/s
imple/ (https://us-python.pkg.dev/colab-wheels/public/simple/)
Requirement already satisfied: kaggle in /usr/local/lib/python3.7/dist-packages
(1.5.12)
Requirement already satisfied: urllib3 in /usr/local/lib/python3.7/dist-packages (from
kaggle) (1.24.3)
51
Requirement already satisfied: python-dateutil in /usr/local/lib/python3.7/dist-
packages (from kaggle) (2.8.2)
Requirement already satisfied: certifi in /usr/local/lib/python3.7/dist-packages (from
kaggle) (2022.6.15)
Requirement already satisfied: python-slugify in /usr/local/lib/python3.7/dist-packages
(from kaggle) (6.1.2)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages
(from kaggle) (2.23.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from
kaggle) (4.64.0)
Requirement already satisfied: six>=1.10 in /usr/local/lib/python3.7/dist-packages
(from kaggle) (1.15.0)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/lib/python3.7/dist-
packages (from python-slugify->kagg
le) (1.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-
packages (from requests->kaggle) (3.
0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages
(from requests->kaggle) (2.10)
Choose Files No file chosen
Upload widget is only available when the cell has been executed in the current browser
session. Please rerun this cell to enable.
Saving kaggle.json to kaggle.json
The uploaded file name: "kaggle.json" with length 68 bytes
# Install Kaggle API to be able to access Kaggle datasets
!pip install kaggle
from google.colab import files
files_upload = files.upload()
for fn in files_upload.keys():
print('The uploaded file name: "{name}" with length {length} bytes'.format(
name=fn, length=len(files_upload[fn])))
52
# Then move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600
~/.kaggle/kaggle.json
In [ ]:
inflating: AudioWAV/1031_ITH_FEA_XX.wav
inflating: AudioWAV/1031_ITH_HAP_XX.wav
inflating: AudioWAV/1031_ITH_NEU_XX.wav
inflating: AudioWAV/1031_ITH_SAD_XX.wav
inflating: AudioWAV/1031_ITS_ANG_XX.wav
inflating: AudioWAV/1031_ITS_DIS_XX.wav
inflating: AudioWAV/1031_ITS_FEA_XX.wav
inflating: AudioWAV/1031_ITS_HAP_XX.wav
inflating: AudioWAV/1031_ITS_NEU_XX.wav
inflating: AudioWAV/1031_ITS_SAD_XX.wav
inflating: AudioWAV/1031_IWL_ANG_XX.wav
inflating: AudioWAV/1031_IWL_DIS_XX.wav
inflating: AudioWAV/1031_IWL_FEA_XX.wav
inflating: AudioWAV/1031_IWL_HAP_XX.wav
inflating: AudioWAV/1031_IWL_NEU_XX.wav
inflating: AudioWAV/1031_IWL_SAD_XX.wav
inflating: AudioWAV/1031_IWW_ANG_XX.wav
inflating: AudioWAV/1031_IWW_DIS_XX.wav
inflating: AudioWAV/1031 IWW FEA XX.wav
# Downloading CREMA-D dataset and unzipping to make it available in
# the google colab directory
!kaggle datasets download "ejlok1/cremad"
!unzip cremad.zip
In [ ]:
Out[27]: Emotions Path
0 fear /content/AudioWAV/1054_TIE_FEA_XX.wav
1 disgust /content/AudioWAV/1079_ITH_DIS_XX.wav
# Creating dataframe with the given data
crema = '/content/AudioWAV'
53
crema_directory_list = os.listdir(crema)
file_emotion = []
file_path = []
for file in crema_directory_list:
# storing file paths
file_path.append(crema + '/' + file)
# storing file emotions
part=file.split('_')
if part[2] == 'SAD':
file_emotion.append('sad')
elif part[2] == 'ANG':
file_emotion.append('angry')
elif part[2] == 'DIS':
file_emotion.append('disgust')
elif part[2] == 'FEA':
file_emotion.append('fear')
elif part[2] == 'HAP':
file_emotion.append('happy')
elif part[2] == 'NEU':
file_emotion.append('neutral')
else:
file_emotion.append('Unknown')
54
3 fear /content/AudioWAV/1087_WSI_FEA_XX.wav
4 angry /content/AudioWAV/1033_DFA_ANG_XX.wav
Streaming output truncated to the last 5000 lines.
inflating: TESS Toronto emotional speech set data/OAF_angry/OAF_back_angry.wav
inflating: TESS Toronto emotional speech set data/OAF_angry/OAF_bar_angry.wav
inflating: TESS Toronto emotional speech set data/OAF_angry/OAF_base_angry.wav
inflating: TESS Toronto emotional speech set data/OAF_angry/OAF_bath_angry.wav
inflating: TESS Toronto emotional speech set data/OAF_angry/OAF_bean_angry.wav
inflating: TESS Toronto emotional speech set data/OAF_angry/OAF_beg_angry.wav
inflating: TESS Toronto emotional speech set data/OAF_angry/OAF_bite_angry.wav
inflating: TESS Toronto emotional speech set data/OAF_angry/OAF_boat_angry.wav
inflating: TESS Toronto emotional speech set data/OAF_angry/OAF_bone_angry.wav
inflating: TESS Toronto emotional speech set data/OAF_angry/OAF_book_angry.wav
inflating: TESS Toronto emotional speech set data/OAF_angry/OAF_bought_angry.wav
inflating: TESS Toronto emotional speech set data/OAF_angry/OAF_burn_angry.wav
inflating: TESS Toronto emotional speech set data/OAF_angry/OAF_cab_angry.wav
inflating: TESS Toronto emotional speech set data/OAF_angry/OAF_calm_angry.wav
inflating: TESS Toronto emotional speech set data/OAF_angry/OAF_came_angry.wav
inflating: TESS Toronto emotional speech set data/OAF_angry/OAF_cause_angry.wav
inflating: TESS Toronto emotional speech set data/OAF_angry/OAF_chain_angry.wav
inflating: TESS Toronto emotional speech set data/OAF_angry/OAF_chair_angry.wav
In [ ]:
Merging data into one dataframe
Out[6]: Emotions Path
0 angry /content/TESS Toronto emotional speech set dat...
1 angry /content/TESS Toronto emotional speech set dat...
2 angry /content/TESS Toronto emotional speech set dat...
55
3 angry /content/TESS Toronto emotional speech set dat...
4 angry /content/TESS Toronto emotional speech set dat...
# Creating dataframe with the given data
Tess = '/content/TESS Toronto emotional speech set data'
tess_directory_list = os.listdir(Tess)
file_emotion = []
file_path = []
for dir in tess_directory_list:
directories = os.listdir(Tess + '/'+ dir) #Tess +
for file in directories:
part = file.split('.')[0]
part = part.split('_')[2]
if part=='ps':
file_emotion.append('surprise')
else:
file_emotion.append(part)
file_path.append(Tess + '/' + dir + '/' + file)
56
2 sad /content/AudioWAV/1020_IEO_SAD_LO.wav
3 fear /content/AudioWAV/1088_TAI_FEA_XX.wav
4 neutral /content/AudioWAV/1003_WSI_NEU_XX.wav
# creating a Dataframe using the 2 dataframes created.
audiopath_data = pd.concat([Crema_df, Tess_df], axis = 0)
audiopath_data.to_csv("data_path.csv",index=False)
audiopath_data.head()
file_name = '/content/AudioWAV/1088_TAI_FEA_XX.wav'
data, sampling_rate = librosa.load(file_name)
plt.figure(figsize=(15, 5))
librosa.display.waveplot(data, sr=sampling_rate)
plt.title('Fear')
plt.ylabel('Amplitude')
Out[32]: Text(0, 0.5, 'Amplitude')
Feature Extraction
57
Now, We extract the features, from the given audio files in the dataframes
In [ ]: ## Code taken from Kaggle##
# Definitions of functions to be used for feature extraction
def noise(data):
noise_amp = 0.035*np.random.uniform()*np.amax(data)
data = data + noise_amp*np.random.normal(size=data.shape[0])
return data
def stretch(data, rate=0.8):
return librosa.effects.time_stretch(data, rate)
def shift(data):
shift_range = int(np.random.uniform(low=-5, high = 5)*1000)
return np.roll(data, shift_range)
def pitch(data, sampling_rate, pitch_factor=0.7):
return librosa.effects.pitch_shift(data, sampling_rate, pitch_factor)
def higher_speed(data, speed_factor = 1.25):
return librosa.effects.time_stretch(data, speed_factor)
def lower_speed(data, speed_factor = 0.75):
return librosa.effects.time_stretch(data, speed_factor)
In [ ]: ## Code taken from Kaggle##
# Extracting Features using MFCCs
def extract_features(data):
result = np.array([])
mfccs = librosa.feature.mfcc(y=data, sr=22050, n_mfcc=58)
mfccs_processed = np.mean(mfccs.T,axis=0)
result = np.array(mfccs_processed)
return result
def get_features(path):
# duration and offset are used to take care of the no audio in start and the ending of
each audio files as seen above
data, sample_rate = librosa.load(path, duration=3, offset=0.5, res_type='kaiser_fast')
#without augmentation
58
res1 = extract_features(data)
result = np.array(res1)
#noised
noise_data = noise(data)
res2 = extract_features(noise_data)
result = np.vstack((result, res2)) # stacking vertically
#stretched
stretch_data = stretch(data)
res3 = extract_features(stretch_data)
result = np.vstack((result, res3))
#shifted
shift_data = shift(data)
res4 = extract_features(shift_data)
result = np.vstack((result, res4))
#pitched
pitch_data = pitch(data, sample_rate)
res5 = extract_features(pitch_data)
result = np.vstack((result, res5))
#speed up
higher_speed_data = higher_speed(data)
res6 = extract_features(higher_speed_data)
Out[11]: 0 1 2 3 4 5 6 7 8 9 ... 49 50
0 -372.859833 158.552551 -5.121318 38.753338 -12.569175 10.467646 -17.361149 -
3.500664 -10.303802 -4.179052 ... -1.381174 -2.042949 1.837
1 -235.423931 50.915942 23.368699 9.832177 3.868557 -3.130746 -7.366017 -
7.309678 -5.633727 -1.876884 ... -1.490963 -0.792193 1.351
59
2 -406.562378 157.783981 -6.387405 38.337200 -13.279373 8.636255 -18.171515 -
3.022032 -11.198946 -4.556575 ... -1.518174 -2.668355 2.245
3 -371.539337 157.163910 -4.820157 38.652512 -12.831516 10.374809 -17.519386 -
3.489297 -10.154447 -4.095973 ... -1.122553 -1.963099 1.756
4 -397.103271 153.161499 -4.517867 33.218170 -14.287848 3.471455 -20.819828 -
1.789339 -13.161255 0.570050 ... -0.895595 3.226112 1.632
5 rows × 59 columns
result = np.vstack((result, res6))
#speed down
lower_speed_data = higher_speed(data)
res7 = extract_features(lower_speed_data)
result = np.vstack((result, res7))
return result
## Code taken from Kaggle##
## Calling the above defined functions for feature extraction
X_features, Y_labels = [], []
for path, emotion in zip(audiopath_data.Path, audiopath_data.Emotions):
feature = get_features(path)
for elements in feature:
X_features.append(elements)
# appending emotion 3 times as we have made 3 augmentation techniques on each
audio file.
Y_labels.append(emotion)
## Creating a dataframe with features
Features_DF = pd.DataFrame(X_features)
Features_DF['labels'] = Y_labels
Features_DF.to_csv('features.csv', index=False)
Features_DF.head()
Data Preparation
In [ ]:
60
# Rereading the saved features from the features csv
Features_DF = pd.read_csv('features.csv')
X_features = Features_DF.iloc[: ,:-1].values
Y_labels = Features_DF['labels'].values
emotions = ['Angry', 'Disgust', 'Fear', 'Happy', 'Nuetral', 'Sad', 'Surprise']
# As this is a multiclass classification, using onehotencoder for our labels.
label_encoder = OneHotEncoder()
Y_labels = label_encoder.fit_transform(np.array(Y_labels)
.reshape(-1,1))\
.toarray()
# splitting the data into train and test set
x_train, x_test, y_train, y_test = train_test_split(X_features,
Y_labels,
random_state=0,
shuffle=True)
x_train.shape, y_train.shape, x_test.shape, y_test.shape
# scaling our data
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
x_train.shape, y_train.shape, x_test.shape, y_test.shape
Now, we build the CNN model and train it using the given train data. This model will
then be validated against the test data from our dataset
In [ ]:
Out[6]: ((53770, 58, 1), (53770, 7), (17924, 58, 1), (17924, 7))
# reshaping the data for CNN model compatibility
61
x_train_cnn = np.expand_dims(x_train, axis=2)
x_test_cnn = np.expand_dims(x_test, axis=2)
y_train_cnn = y_train
y_test_cnn = y_test
x_train_cnn.shape, y_train_cnn.shape, x_test_cnn.shape, y_test_cnn.shape
In [ ]:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d (Conv1D) (None, 58, 256) 1536
62
CNN_model.add(Activation('relu'))
CNN_model.add(Flatten())
CNN_model.add(Dense(7))
CNN_model.add(Activation('softmax'))
opt = keras.optimizers.RMSprop(learning_rate=0.01, decay=1e-6)
CNN_model.compile(loss='categorical_crossentropy',
optimizer=opt,metrics=['accuracy'])
CNN_model.summary()
activation_2 (Activation) (None, 7, 128) 0
63
_accuracy: 0.8427 - lr: 0.0016
Epoch 48/50
211/211 [==============================] - 98s 464ms/step - loss: 0.1985 -
accuracy: 0.9312 - val_loss: 0.6093 - val
_accuracy: 0.8402 - lr: 0.0016
Epoch 49/50
211/211 [==============================] - 98s 464ms/step - loss: 0.1964 -
accuracy: 0.9320 - val_loss: 0.6045 - val
_accuracy: 0.8374 - lr: 0.0016
Epoch 50/50
211/211 [==============================] - 98s 464ms/step - loss: 0.1957 -
accuracy: 0.9325 - val_loss: 0.6062 - val
_accuracy: 0.8447 - lr: 0.0016
CNN model trained in 4888.570144891739 s
# Training the CNN model on our training data
learning_rate = ReduceLROnPlateau(monitor='loss',
factor=0.4,
verbose=0,
patience=2,
min_lr=0.0000001)
starting_time = time.time()
history_CNN = CNN_model.fit(x_train_cnn, y_train_cnn,
batch_size=256,
epochs=50,
validation_data=(x_test_cnn, y_test_cnn),
callbacks=[learning_rate])
print("CNN model trained in %s s " % str(time.time() - starting_time))
In [ ]: # Plotting the graphs for training and test accuracy
epochs = [i for i in range(50)]
fig , ax = plt.subplots(1,2)
train_acc = history_CNN.history['accuracy']
train_loss = history_CNN.history['loss']
test_acc = history_CNN.history['val_accuracy']
64
test_loss = history_CNN.history['val_loss']
fig.set_size_inches(20,6)
ax[0].plot(epochs , train_loss , label = 'Training Loss')
ax[0].plot(epochs , test_loss , label = 'Testing Loss')
ax[0].set_title('Training & Testing Loss')
ax[0].legend()
ax[0].set_xlabel("Epochs")
ax[1].plot(epochs , train_acc , label = 'Training Accuracy')
ax[1].plot(epochs , test_acc , label = 'Testing Accuracy')
ax[1].set_title('Training & Testing Accuracy')
ax[1].legend()
ax[1].set_xlabel("Epochs")
plt.show()
In [ ]:
In [ ]:
CNN model tested in 7.4804136753082275 s
The actual emotions: [['angry']
['fear']
['angry']
...
['neutral']
['disgust']
['neutral']] The predicted emotions: [['angry']
['fear']
['angry']
...
['neutral']
['disgust']
['neutral']]
Accuracy: 84.47%
# Determining the predictive quality of the network
starting_time = time.time()
y_pred_cnn = CNN_model.predict(x_test_cnn)
65
print("CNN model tested in %s s " % str(time.time() - starting_time))
y_pred_cnn = label_encoder.inverse_transform(y_pred_cnn)
ytest_dec_cnn = label_encoder.inverse_transform(y_test_cnn)
print('The actual emotions:', ytest_dec_cnn, 'The predicted emotions:', y_pred_cnn)
# Determining the accuracy of our model
accuracy_CNN = accuracy_score(y_true = ytest_dec_cnn ,
y_pred = y_pred_cnn)
# Printing the accuracy of our MLP model
print("Accuracy: {:.2f}%".format(accuracy_CNN*100))
MLP (Multi-Layer Perceptron) Model
In [ ]:
In [ ]:
In [ ]:
MLP model trained in 393.80851912498474 s
# Initialising the train and test data
x_train_mlp = x_train
x_test_mlp = x_test
y_train_mlp = y_train
y_test_mlp = y_test
# Defining the Multi Layer Perceptron Classifier
MLP_model = MLPClassifier(alpha=0.06,
batch_size=256,
epsilon=1e-08,
hidden_layer_sizes=(250, 150),
learning_rate='adaptive',
max_iter=400)
# Training the MLP model against the training data
import time
starting_time = time.time()
MLP_model.fit(x_train_mlp,y_train_mlp)
print("MLP model trained in %s s " % str(time.time() - starting_time))
In [ ]:
MLP model tested in 0.19156312942504883 s
66
[[1 0 0 ... 0 0 0]
[0 0 1 ... 0 0 0]
[1 0 0 ... 0 0 0]
...
[0 0 0 ... 1 0 0]
[0 1 0 ... 0 0 0]
[0 0 0 ... 1 0 0]]
The actual emotions: [['angry']
['fear']
['angry']
...
['neutral']
['disgust']
['neutral']]
The predicted emotions: [['angry']
['fear']
['angry']
...
# Testing the predictive quality of the model:
starting_time = time.time()
y_pred_mlp = MLP_model.predict(x_test_mlp)
print("MLP model tested in %s s " % str(time.time() - starting_time))
df_temp = pd.DataFrame(y_pred_mlp)
for row in range(len(df_temp)):
if np.sum(df_temp.loc[row,:]) == 0:
df_temp.loc[row,2] = 1
y_pred_mlp = pd.DataFrame.to_numpy(df_temp)
print(y_pred_mlp)
y_pred_mlp = label_encoder.inverse_transform(y_pred_mlp)
y_test_mlp = label_encoder.inverse_transform(y_test_mlp)
print('The actual emotions:', y_test_mlp)
print('The predicted emotions:', y_pred_mlp)
In [ ]:
67
SVM (Support Vector Machine) Model
In [ ]:
In [ ]:
In [ ]:
In [ ]:
['neutral']
['disgust']
['neutral']]
Accuracy: 86.29%
Out[51]: ((57355, 58), (57355,), (14339, 58), (14339,))
# Determining the accuracy of our model
accuracy_MLP= accuracy_score(y_true = y_test_mlp,
y_pred = y_pred_mlp)
# Printing the accuracy of our MLP model
print("Accuracy: {:.2f}%".format(accuracy_MLP*100))
# Rereading the saved features from the features csv
Features_DF = pd.read_csv('features.csv')
X_features = Features_DF.iloc[: ,:-1].values
Y_labels = Features_DF['labels'].values
# Initialising the train and test data
x_train_svm, x_test_svm, y_train_svm, y_test_svm = train_test_split(X_features,
Y_labels,
test_size = 0.20)
# scaling our data
scaler = StandardScaler()
x_train_svm = scaler.fit_transform(x_train_svm)
x_test_svm = scaler.transform(x_test_svm)
x_train_svm.shape, y_train_svm.shape, x_test_svm.shape, y_test_svm.shape
from sklearn.svm import SVC
svmclassifier = SVC(kernel = 'rbf')
LSTM (Long Short Term Memory) Model
In [ ]:
Trained model in 369.50613021850586 s
68
Tested model in 73.30016803741455 s
Predicted values: ['sad' 'fear' 'disgust' 'fear' 'fear' 'disgust' 'fear' 'angry' 'disgust'
'sad']
Actual values: ['sad' 'sad' 'disgust' 'fear' 'fear' 'disgust' 'fear' 'disgust' 'disgust'
'sad']
----accuracy score 72.80842457633028 ----
import time
starting_time = time.time()
svmclassifier.fit(x_train_svm, y_train_svm)
print("Trained model in %s s " % str(time.time() - starting_time))
# Predictive model
starting_time = time.time()
y_pred_svm = svmclassifier.predict(x_test_svm)
print("Tested model in %s s " % str(time.time() - starting_time))
print('Predicted values: ',y_pred_svm[:10])
print('Actual values: ', y_test_svm[:10])
acc = float(accuracy_score(y_test_svm,y_pred_svm))*100
print("----accuracy score %s ----" % acc)
# Rereading the saved features from the features csv
Features_DF = pd.read_csv('features.csv')
X_features = Features_DF.iloc[: ,:-1].values
Y_labels = Features_DF['labels'].values
In [ ]:
In [ ]:
Out[97]: ((51619, 58, 1), (14339, 58, 1), (51619, 7), (14339, 7))
# As this is a multiclass classification, using onehotencoder for our labels.
label_encoder = OneHotEncoder()
Y_labels = label_encoder.fit_transform(np.array(Y_labels)
.reshape(-1,1))\
.toarray()
x_train_rnn, x_test_rnn, y_train_rnn, y_test_rnn = train_test_split(
X_features,
Y_labels,
69
test_size=0.2)
x_train_rnn, x_validation_rnn, y_train_rnn, y_validation_rnn = train_test_split(
x_train_rnn,
y_train_rnn,
test_size=0.1)
# making our data compatible to model.
x_train_rnn = np.expand_dims(x_train_rnn, axis=-1)
x_test_rnn = np.expand_dims(x_test_rnn, axis=-1)
x_validation_rnn = np.expand_dims(x_validation_rnn, axis=-1)
x_train_rnn.shape, x_test_rnn.shape, y_train_rnn.shape, y_test_rnn.shape
In [ ]:
In [ ]:
Model: "sequential_5"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
lstm_5 (LSTM) (None, 200) 161600
70
model_RNN.add(Dense(512))
model_RNN.add(Dense(y_train_rnn.shape[1], activation='softmax'))
model_RNN.summary()
model_RNN.compile(loss='binary_crossentropy', optimizer='Nadam',
metrics=['accuracy'])
# Defining callbacks - learning rate
learning_rate_rnn = ReduceLROnPlateau(monitor='loss',
factor=0.4,
verbose=0,
patience=2,
min_lr=0.000001)
In [ ]:
_accuracy: 0.8878 lr: 0.0010
Epoch 55/60
454/454 [==============================] - 64s 141ms/step - loss: 0.0175 -
accuracy: 0.9796 - val_loss: 0.1016 - val
_accuracy: 0.9070 - lr: 4.0000e-04
Epoch 56/60
454/454 [==============================] - 65s 143ms/step - loss: 0.0114 -
accuracy: 0.9873 - val_loss: 0.1102 - val
_accuracy: 0.9070 - lr: 4.0000e-04
Epoch 57/60
454/454 [==============================] - 64s 141ms/step - loss: 0.0117 -
accuracy: 0.9874 - val_loss: 0.1154 - val
_accuracy: 0.9033 - lr: 4.0000e-04
Epoch 58/60
454/454 [==============================] - 64s 141ms/step - loss: 0.0130 -
accuracy: 0.9859 - val_loss: 0.1111 - val
_accuracy: 0.9093 - lr: 4.0000e-04
Epoch 59/60
454/454 [==============================] - 64s 141ms/step - loss: 0.0066 -
accuracy: 0.9932 - val_loss: 0.1234 - val
_accuracy: 0.9118 - lr: 1.6000e-04
71
Epoch 60/60
454/454 [==============================] - 66s 145ms/step - loss: 0.0046 -
accuracy: 0.9954 - val_loss: 0.1325 - val
_accuracy: 0.9135 - lr: 1.6000e-04
RNN-LSTM model trained in 3803.1767585277557 s
starting_time = time.time()
history_rnn = model_RNN.fit(x_train_rnn, y_train_rnn,
validation_data=(x_validation_rnn, y_validation_rnn),
batch_size= 128,
epochs=60,
callbacks = [learning_rate_rnn])
print("RNN-LSTM model trained in %s s " % str(time.time() - starting_time))
In [ ]: # Plotting the graphs for training and test accuracy
epochs = [i for i in range(60)]
fig , ax = plt.subplots(1,2)
train_acc_rnn = history_rnn.history['accuracy']
train_loss_rnn = history_rnn.history['loss']
test_acc_rnn = history_rnn.history['val_accuracy']
test_loss_rnn = history_rnn.history['val_loss']
fig.set_size_inches(20,6)
ax[0].plot(epochs , train_loss_rnn , label = 'Training Loss')
ax[0].plot(epochs , test_loss_rnn , label = 'Testing Loss')
ax[0].set_title('Training & Testing Loss')
ax[0].legend()
ax[0].set_xlabel("Epochs")
ax[1].plot(epochs , train_acc_rnn , label = 'Training Accuracy')
ax[1].plot(epochs , test_acc_rnn , label = 'Testing Accuracy')
ax[1].set_title('Training & Testing Accuracy')
ax[1].legend()
ax[1].set_xlabel("Epochs")
plt.show()
In [ ]:
Test Accuracy: 98.13560843467712
72
test_loss, test_acc = model_RNN.evaluate(x_test_rnn, y_test_rnn, verbose=0)
print("Test Accuracy: ",test_acc*100)
In [ ]:
In [ ]:
RNN-LSTM model tested in 2.2305257320404053 s
The actual emotions: [['neutral']
['happy']
['neutral']
['happy']
['happy']
['angry']
['angry']
['disgust']
['sad']
['fear']]
The predicted emotions: [['neutral']
['happy']
['neutral']
['disgust']
['happy']
['angry']
['angry']
['disgust']
['sad']
['fear']]
Accuracy: 98.14%
starting_time = time.time()
y_pred_rnn = model_RNN.predict(x_test_rnn)
print("RNN-LSTM model tested in %s s " % str(time.time() - starting_time))
y_pred_rnn = label_encoder.inverse_transform(y_pred_rnn)
y_test_rnn = label_encoder.inverse_transform(y_test_rnn)
print('The actual emotions:', y_test_rnn[:10])
print('The predicted emotions:', y_pred_rnn[:10])
73
# Determining the accuracy of our model
accuracy_RNN = accuracy_score(y_true = y_test_rnn,
y_pred = y_pred_rnn)
# Printing the accuracy of our RNN model
print("Accuracy: {:.2f}%".format(accuracy_RNN*100))
Accuracy of the Given models and their comparision
Analysing Model Performances
In [ ]:
# CNN Model
## Classification report
print(classification_report(ytest_dec_cnn,y_pred_cnn))
class_cnn = classification_report(ytest_dec_cnn,y_pred_cnn, output_dict=True)
class_cnn = pd.DataFrame(class_cnn).transpose()
class_cnn.to_csv('Class_cnn.csv', index = True)
acc = float(accuracy_score(ytest_dec_cnn,y_pred_cnn))*100
print("----accuracy score %s ----" % acc)
## Confusion matrix and its heatmap
cm_cnn = confusion_matrix(ytest_dec_cnn,y_pred_cnn)
df_cm_cnn = pd.DataFrame(cm_cnn)
df_cm_cnn.to_csv('CNN.csv', index = False)
sn.heatmap(df_cm_cnn/np.sum(df_cm_cnn),
annot= True, fmt='.2%',
xticklabels= emotions,
yticklabels=emotions)
plt.show()
In [ ]:
# MLP Model
## Classification report
print(classification_report(y_test_mlp,y_pred_mlp))
class_mlp = classification_report(y_test_mlp,y_pred_mlp, output_dict=True)
class_mlp = pd.DataFrame(class_mlp).transpose()
class_mlp.to_csv('Class_mlp.csv', index = False)
74
acc = float(accuracy_score(y_test_mlp,y_pred_mlp))*100
print("----accuracy score %s ----" % acc)
## Confusion matrix and its heatmap
cm_mlp = confusion_matrix(y_test_mlp,y_pred_mlp)
df_cm_mlp = pd.DataFrame(cm_mlp)
df_cm_mlp.to_csv('MLP.csv', index=False)
sn.heatmap(df_cm_mlp/np.sum(df_cm_mlp),
annot= True, fmt='.2%',
xticklabels= emotions,
yticklabels=emotions)
plt.show()
In [ ]:
# SVM Model
## Classification report
print(classification_report(y_test_svm,y_pred_svm))
class_svm = classification_report(y_test_svm,y_pred_svm, output_dict=True)
class_svm = pd.DataFrame(class_svm).transpose()
class_svm.to_csv('Class_svm.csv', index = False)
acc = float(accuracy_score(y_test_svm,y_pred_svm))*100
print("----accuracy score %s ----" % acc)
## Confusion matrix and its heatmap
cm_svm = confusion_matrix(y_test_svm,y_pred_svm)
df_cm_svm = pd.DataFrame(cm_svm)
df_cm_svm.to_csv('SVM.csv', index=False)
sn.heatmap(df_cm_svm/np.sum(df_cm_svm),
annot= True, fmt='.2%',
xticklabels= emotions,
yticklabels=emotions)
plt.show()
In [ ]:
#RNN Model
75
## Classification report
emotions = ['Angry', 'Disgust', 'Fear', 'Happy', 'Neutral', 'Sad', 'Surprise']
print(classification_report(y_test_rnn,y_pred_rnn))
class_rnn = classification_report(y_test_rnn,y_pred_rnn, output_dict=True)
class_rnn = pd.DataFrame(class_rnn).transpose()
class_rnn.to_csv('Class_rnn.csv', index = False)
acc = float(accuracy_score(y_test_rnn,y_pred_rnn))*100
print("----accuracy score %s ----" % acc)
## Confusion matrix and its heatmap
cm_rnn = confusion_matrix(y_test_rnn,y_pred_rnn)
df_cm_rnn = pd.DataFrame(cm_rnn)
df_cm_rnn.to_csv('RNN.csv', index=False)
sn.heatmap(df_cm_rnn/np.sum(df_cm_rnn),
annot= True, fmt='.2%',
xticklabels= emotions,
yticklabels=emotions)
plt.show()
Analysing stats of each emotion
In [ ]: emotions = ['Angry', 'Disgust', 'Fear', 'Happy', 'Neutral', 'Sad', 'Surprise']
class_cnn = pd.read_csv('Class_cnn.csv')
class_mlp = pd.read_csv('Class_mlp.csv')
class_svm = pd.read_csv('Class_svm.csv')
class_rnn = pd.read_csv('Class_rnn.csv')
precision = pd.DataFrame()
precision['CNN'] = class_cnn['precision']
precision['MLP'] = class_mlp['precision']
precision['SVM'] = class_svm['precision']
precision['RNN'] = class_rnn['precision']
precision = precision.iloc[0:7,:]
precision.index = emotions
recall = pd.DataFrame()
76
recall['CNN'] = class_cnn['recall']
recall['MLP'] = class_mlp['recall']
recall['SVM'] = class_svm['recall']
recall['RNN'] = class_rnn['recall']
recall = recall.iloc[0:7,:]
recall.index = emotions
score = pd.DataFrame()
score['CNN'] = class_cnn['f1-score']
score['MLP'] = class_mlp['f1-score']
score['SVM'] = class_svm['f1-score']
score['RNN'] = class_rnn['f1-score']
score = score.iloc[0:7,:]
score.index = emotions
In [ ]:
# Plotting precision of each emotions in each models
precision.plot(kind = 'bar', title='emotions', figsize= (9,9))
plt.title('Precision of Emotions')
plt.xlabel('Emotions')
plt.ylabel('percent level')
In [ ]:
# Plotting recall of each emotions in each models
recall.plot(kind = 'bar', title='emotions', figsize= (9,9))
plt.title('Recall of Emotions')
plt.xlabel('Emotions')
plt.ylabel('percent level')
In [ ]:
# F1 score plot
score.plot(kind = 'bar', title='emotions', figsize= (9,9))
plt.title('F1 Score of Emotions')
plt.xlabel('Emotions')
77
plt.ylabel('percent level')
78