0% found this document useful (0 votes)
83 views

Speech Based Emotion Classification For Human by Introducing Upgraded Long Short Term Memory (ULSTM)

The document proposes an upgraded long short term memory (ULSTM) model for speech-based emotion classification of humans. It introduces ULSTM to enhance the learning capacity of machines for human emotion recognition, which is important as emotions can transition between types. The proposed method uses four deep learning models - CNN, RNN, LSTM and GRU - to categorize human emotions based on physical and perceptual features from the RAVDESS dataset. An upgraded LSTM model is also proposed and tested against the existing LSTM model to improve accuracy for emotion classification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views

Speech Based Emotion Classification For Human by Introducing Upgraded Long Short Term Memory (ULSTM)

The document proposes an upgraded long short term memory (ULSTM) model for speech-based emotion classification of humans. It introduces ULSTM to enhance the learning capacity of machines for human emotion recognition, which is important as emotions can transition between types. The proposed method uses four deep learning models - CNN, RNN, LSTM and GRU - to categorize human emotions based on physical and perceptual features from the RAVDESS dataset. An upgraded LSTM model is also proposed and tested against the existing LSTM model to improve accuracy for emotion classification.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 14

Speech based emotion classification for human by

introducing upgraded long short term memory (ULSTM)

Subhrajit Bhowmik1, Akshay Chatterjee2, Sampurna Biswas3, Reshmina Farhin4,
Ghazaala Yasmin5 
1,2,3,4,5
 Department of Computer Science and Engineering, St Thomas‘ College of
Engineering and Technology, Kolkata 700 023, India
[email protected]
[email protected]
[email protected]@gmail.com
[email protected]@gmail.com
[email protected]

Abstract.  All  humans  have  the  intelligence  for  emotions  through  emotional
behavior by social skills and by interacting and imitating with human. Not only
that we also enhance and upgrade our skill for analysis for different emotion
through learning with our experience in surroundings. Now what if the machine
is capable of learning through its artificial intelligent skills. Plenty of research
work has been conducted on this province. The ongoing exploration is being
done   using   deep   learning   model   concept.   This   technique   is   being   used   to
enhance the learning capacity of the machine which is most important in human
emotion recognition because one emotion can be derived towards another type
of emotion which is difficult to analyze. This theme has inclined us to explore
this problem. The proposed method has been designed to categorize the human
emotions   through   four   different   deep   learning   model   convolution   neural
network   (CNN),   recurrent   neural   network   (RNN),   long   short   term   memory
(LSTM) and gated recurrent unit (GRU). For training these model well known
physical and perceptual features has been fitted. The System has been tested on
benchmark data of Ryerson Audio­Visual Dataset for Emotional Speech and
Song (RAVDESS). Furthermore the mentioned deep learning model has been
compared based on testing the above dataset in terms of vanishing gradient
problem. In addition an upgraded model of LSTM has been proposed to get the
better accuracy and it is being tested with the existing model of LSTM.

Keywords:  Emotion   Recognition,   Feature   extraction,   Classification   model,


convolution   neural   network   (CNN),   Recurrent   neural   network(RNN),   gated
recurrent unit (GRU), Long short term memory (LSTM), upgraded Long short
term memory(ULSTM) .
1   Introduction

The current advancement in the technology is the interaction between humans and
computer. There is a sudden requirement of making these interactions more seamless
using emotions. The emotional  feedback  of users can help computers to do more
accurate result based operations which are customised from user to user. It has a vast
application in the field of mental diagnosis, customer services, human counselling and
many more.   Emotions can play a big role in such interactions. The technique of
detection of lies is based on this concept. Hence being able to assess human emotions
using   machines   is   a   topic   that   is   springing   up   almost   everywhere.   Emotions   are
displayed visually, vocally, and through other physiological means. In this paper we
are more concerned with the recognition of emotion through analysis of speech. Tone
of voice in speech has characterizations such as pitch, loudness, timbre, speech rate
and pauses  which come  under  the umbrella  of emotional  prosody. The  energy  in
speech can be changed intentionally to generate different feelings. Studies have found
that emotions such as fear,  joy and anger, are portrayed  at higher intensities than
emotions   such   as   sadness.   In   the   research   area   emotion   has   been   detected   using
various ways. But now days the advancement of artificial intelligence is leading to
make every system to be more intelligent so that it classify and solve the problem
with   more   accuracy.     For   achieving   this   deep   learning   concept   is   highly   being
recommended for the problems which are suitable to be solved using deep neural
network.   This   factor   has   motivated   us   to   implement   a   system   which   detects   the
emotion through different deep learning techniques and the result has been analyzed
based on different measuring techniques such as vanishing gradient curve. The details
of  this  will  be   discussed  in  section  2.1.  Emotions  are  being  classified  into  many
categories;   generally   five   major   emotions   are   classified   in   most   of   the   presented
techniques. Those are anger, disgust, fear, sadness, happiness and a neutral emotion
as well. Neutral  speech is considered to be the speech which does not have such
remarkable emotions. In comparison to neutral speech, anger is produced with a lower
pitch, higher intensity, and more energy. One other aspect disgust is produced with a
lower, downward directed pitch, with energy. Fearful emotions have a higher pitch,
little variation, lower energy, and a faster speech rate with more pauses. Sad emotions
are produced with a higher pitch, less intensity but more vocal energy, longer duration
with more pauses whereas happiness is produced with a higher pitch, high intensity
and higher pitch. The six major archetypal emotions are anger, surprise, fear, disgust,
happiness and morose. The most common approach that has been used all around the
globe is prosodic features. Pitch and energy come under prosodic features. The other
is spectral features. Mel­scale frequency cepstral coefficients are a spectral feature.
By combining different types of features like prosodic and spectral features, emotion
recognition from audio is possible. The proposed work contains eight different types
of emotions which are difficult to classify with better accuracy by incorporating initial
machine learning technique. This is why more upgraded technique that is using Gated
recurrent unit (GRU), convolution neural network (CNN), Recurrent Neural network
(RNN) and long short term memory (LSTM).     Convolution neural network is an
artificial  neural  network which does supervised learning based on convoluting the
data for large dimension.   Recurrent neural network is an upgraded neural network
which consists of directed cycles in memory.  LSTM is an advanced version of RNN
with   more   learning   capacity.   Gated   Recurrent   unit   somehow   flows   with   similar
principle as LSTMs. Difference  is that it uses set of gates to control  the flow of
information.

1.1 Related Work

Plenty of work has been introduced. Even the research is still going on in this field.
The application of deep learning in this problem is making this area more attention
seeking. Kun Han et al. [1] proposed to utilize deep neural networks (DNNs) to find
the probability distribution for each speech segment and then these has been effective
for speech emotion recognition. Yelin Kim et al. [2] proposed that automatic emotion
recognition systems provide the difference between high­level affective content and
low­level human signals only. They proposed to utilize Deep Belief Network models
which   suggests   that   high   order   non   linear   relationships   are   effective   for   human
recognition.   WQ   Zheng   et   al.   [3]   proposed   to   implement   an   effectively   emotion
recognition system based on deep convolution neural networks using a systematical
approach   using   labeled   training   audio   data.   Aharon   Satt   et   al.   [4]   proposed   the
concept of spectrograms. It involved para­lingual information in the speech based on
deep neural network.  The speech is processed into small segments of 3 seconds each
and then the longer input is splitted into non­overlapping parts which reduces the
latency of predicting it. Jinkyu Lee et al. [5] presented recurrent neural network as the
speech recognition system which is trained by an efficient learning algorithm. It takes
the long­range context into account and how the emotions oh a human are not certain.
Qirong Mao et al. [6] proposed the usage of convolutional neural networks to learn
about  the  various   salient  features   for  human   emotion  recognition  through  speech.
Seyedmahdad Mirsamadi et al. [7] proposed that using recurrent neural network both
short­time frame­level features and also appropriate temporal aggregation of human
emotions can be learnt. George Trigeorgis et al. [8] proposed a solution to the context
aware problem of human emotions by Convolutional Neural Networks(CNNs). This
is being combined with LSTM networks to learn the best the best way of signaling the
speech   automatically.   Samira   Ebrahimi   Kahou   et   al.   [9]   proposed   the   analysis   of
video   clips   which   were   of   one­two   seconds   to   study   the   human   emotions   from
speech.They   even   analysed   audio   clips   which   contained   human   voice   as   the
background music. Zakaria Aldeneh et al. [10] proposed how convolutional neural
networks can be used to get  competitive results when minimally hand­engineered
features are applied. Vladimir Chernykh et al. [11] proposed that how small intervals
of speech can be calculated from which sequence of acoustic features can be trained
with the help of recurrent neural network. Zhong­Qiu Wang et al. [12] proposed that
by utterance of a fixed length vector of the last hidden layer of time the speech of any
human   emotion   can   be   easily   encoded   by   the   help   of   deep   neural   network.   This
encoding   system   classifies   the   utterance   level   for   a   betterment.   Samira   Ebrahimi
Kahou et  al. [13]  proposed that  a  completely different  system  for studying video
clips . They do their analysis on a hybrid CNN­RNN architecture which even analysis
the facial expressions of the humans. Lang He et al. [14] proposed the design of a
audio­video emotion channel which functions on low level descriptors obtained from
a toolbox . Mohamed R. Amer et al. [15] proposed the idea of a hybrid model which
will be able to detect human emotions very easily. This hybrid model would have the
strength   of  distinguishing   classifiers   along  with   the  representational   power   of   the
normal   audio   model   or   video   model   separately.   For   the   last   few   years   many
researches are going on to determine a new deep learning algorithm or to improve the
existing algorithms in the field of deep learning. We have also worked on improving a
specific deep learning algorithm that is LSTM. There are some advantages in using
LSTM   algorithm   and   certain   characteristics   of   LSTM   let   us   prefer   LSTM   to
Recurrent Neural Network (RNN), one of them is more memory content in LSTM.
recurrent neural network has shorter memory where as a LSTM has more memory to
hold data. But the problem in LSTM was vanishing gradient problem as per discussed
in Section. So Gated Recurrent Unit (GRU) is taken into account where the vanishing
gradient   problem   could     be   recovered   but   still   in   terms   of   memory   LSTM   was
leading. So to have more memory as well as getting rid of vanishing gradient problem
we have decided to change the architecture of LSTM model where we could remove
the vanishing gradient problem. In the LSTM model there are three gates: forget gate,
input gate and output gate. We have added three more gates to remove the vanishing
gradient problem. The gates are one update gate and one reset gate. First we have
added one update gate and then one reset gate. Update gate is used to update the
previous information so that  no previous neurons are  not wasted per epoch.  Also
using update gate it can memorize the previous statements. Reset gate is used to reset
the new neurons so that the previous neurons in the layer are not reused.
2   Proposed Methodology

The proposed method has been introduced to serve the classification of human
emotion from speech. This has been done by two major phases, one is feature
extraction which extracts the relevant feature which extracts the relevant set of
features to minutely analyse the variation of emotion from speech file. Another is
classification where well known deep learni9ng model has been used and the result of
all of them has been compared. Fig. 1 represents the overall architecture of the
proposed system.

Fig. 1. Basic Work Flow of the proposed System 

2.1   Feature Extraction

Extraction of features is an extremely essential role in analysing and discovering


relations between different entities. The data supplied from audio cannot be
understood by the models used for audio data retrieval. Hence, to directly to convert
them into an understandable format feature extraction is implemented. It is a process
that explains most of the data but in an unambiguous way. Feature extraction is useful
in classification, prediction and recommendation algorithms. Audio signal scan be
thought of as a three-dimensional signal in which three axes represent time, amplitude
and frequency

2.1.1 Energy

Let yi(n) , where n=1... SL; be the sequence of audio samples of the ith frame, where SL
is the length of the frame. The short-term energy is computed according to the eq.
(1). Usually, energy is normalized by dividing it with SL to remove the dependency on
the frame length. Therefore, finally energy will be represented as eq.(2).
(1)

(2)

2.1.2 Zero Crossing Rate

This feature is also abbreviated as ZCR, the zero crossing rate of any audio frame
is the rate at which the signal changes sign in the duration of a specific frame. To put
it simply, ZCR is the count of times the signal changes value from negative to
positive, or vice-versa, divided by the length of the frame. Mathematically, we can
describe the ZCR as eq. (3). ZCR provides information about the measure of noise in
a signal, where it demonstrates higher values in case of noisy signals, and also reflects
the spectral characteristics of a signal in a rough manner. The easiness of it’s
computation and it’s properties have led to ZCR being extensively used in numerous
applications like music genre classification and speech detection

           (3)

                                        (3.1)   

2.1.3. Chromagram

Chroma vector is a representation of how humans relate colours to notes. In other
words we think of same notes but from two different octaves to be of the same color.
Thus we have 12 possible values at each window. A, A#, B, C, C#, D, D#, E, F, F#, G
and G#. Of course they are not mutually exclusive hence for a given time fame one
can have more than one note. But to keep things simple we will only select the most
prominent note for a particular window. A visual description of the chroma vector
using a chromagram is shown as below, which is obtained by executing a suitable
python program utilising the librosa module.
2.1.4 Spectral Contrast

Each frame of a spectrogram S for any audio file is divided into sub­bands. For
each sub­band, the energy contrast is estimated by comparing the mean energy in the
top   quantile   (peak   energy)   to   that   of   the   bottom   quantile   (valley   energy).   High
contrast values generally correspond to clear, narrow­band signals, while low contrast
values   correspond   to   broad­band   noise.   It   provides   a   good   representation   of   the
relative  spectral  distribution and  is very efficient  in music  genre  classification.  A
python program is executed to provide an implementation of the spectral contrast with
the power spectrogram, of an audio file. 

2.1.5 Mel Frequency Cepstral Coefficients (MFCC)

When sounds are generated by humans, they are filtered by the shape of the vocal
tract including tongue, teeth etc. The sound coming out is determined by this shape. If
the shape can be determined accurately, an accurate representation of the produced
phonemes   can   be   obtained.   The   shape   of   the   vocal   tract   manifests   itself   in   the
envelope of the short time power spectrum, and MFCCs come into play to accurately
represent this envelope. The Mel scale is an essential pre­requisite in understanding
the   concept   of   MFCCs.The   Mel   scale   provides   a   relationship   between   perceived
frequency, or pitch, of a pure tone, and the actual measured frequency. Humans are
quite efficient at perceiving small changes in pitch at low frequencies than they are at
high frequencies.  Implementing this scale makes our features  mutch more closely
what humans hear. 

2.2.2 Convolutional Neural Networks (CNN)

Also popularly abbreviated as CNN(Conv Nets), Convolutional Neural Networks
are one of the essential, state of art, Artificial Neural Network design architecture,
which is utilised in Image Based Classifications. The intuition behind the working of
CNN   is   the   idea   of   Convolution,   which   produces   filtered   Feature   Maps,   also
sometimes referred as Kernels, stacked over each other. The procedures involved in
CNN   steps   Image  Filtering  ReLUPooling…….VectorizationFully
Connected Network.  Convolution neural network works for two purpose one is for
image which signified as Conv 2D. Another is for audio which is Conv 1D. The input
image in form of a matrix, is convolved with a m x m filter, where m is generally odd.
Here a given image might undergo convolution with n different  filters of size m x
m,where   each   of   the   filter   would   be   concerned   with   a   definite   feature   which   is
required   to   be   extracted   from   the   image,   thus   forming n stacked   feature   maps   or
kernels. Thesekernels are then passed through a ReLU Layer, which removes all the
negative entries in the matrices obtained, with value 0,and preserves the non­linearity
of the image. Now the resultant is transferred to the Pooling Layer wherethe size of
the obtained matrices is reduced, by choosing a Nearest Neighbourparadigm where an
area of size p x p is replaced by the max value in the area p x p. Pooling reduces the
size of the matrix we operate on, hence diminishing the computations, without loss of
information .The above steps are repeated, and then at a certain stage the matrices are
vectorized  and   sent  to  a  neural  network,  which  trains   using  Back­propagation,  to
classify the image.   We have used Conv1D since this convolutes one dimensional
audio   file.   Conv1D   goes   through   those   stages.   The   stages   involve   first   layer   for
setting the weights second is the pooling and then feeding it to dense layer that is fully
connected layer. 

2.2.3 Recurrent Neural Networks

Recurrent   Neural   Networks   (RNNs)   are   a   paradigm   of   neural   networks   that


specialize in processing sequences, or sequential data. They’re often used in Natural
Language Processing tasks because of their effectiveness in managing textual data.A
bottleneck with CNN is that they only work with pre­determined sizes, where they
take fixed­size inputs and produce fixed­size outputs. RNNs are useful because they
let us have variable­length sequences as both inputs and outputs. Some variants of
RNN is shown in fig.2. Here Inputs are red, the RNN itself is green, and outputs are
blue. Suppose a “many to many” RNN is considered with inputs x0, x1....xn that wants
to  produce   outputs y0,   y1........yn  These   xi  and   yi  are arbitrary   dimensional   vectors.
RNNs function with iterations by updating a hidden state h, which is a vector that can
also   have   arbitrary   dimension.   At   any   given   step   t.   The   next   hidden   state   h t  is
computed using the previous hidden state h{t­1}and the next input xt. Similarly The next
output yt is calculated using ht .  The principle which makes a RNN recurrent is that it
uses the same weights for each step. To be more precise, a typical vanilla RNN uses
only 3 sets of weights to perform its calculations. The principle which makes a RNN
recurrent is that it uses the same weights for each step. To be more precise, a typical
vanilla RNN uses only 3 sets of weights to perform its calculations, such as W xh, used
for all  xtht  links. Whereas  in the same way  Whh  , used for all h{t­1}  ht    links.
Furthermore Why, used for all htyt links. Two biases shall also be used for our RNN.
One is bh added when calculating ht . Another is by  , added when calculating yt   . The
weights shall be represented as matrices and the biases as vectors.  These 3 weights
and   2   biases   constitute   the   entirety   of   the   RNN.   The   equations   summing   up
everything together are represented in by eq. 6 and 7. 
(6)

(7)
                                                                
In the above eq (6) and (7) weights are matrices and the other variables are vectors. All the
weights   are   inserted   via   matrix   multiplication,   and   the   biases   are   added   to   the   obtained
products. We then use tanh as an activation function for the first equation (It is to be noted that
other activation functions like sigmoid can be used we have used Relu because it is giving
better result for our dataset). 

2.2.4   Gated Recurrent Unit(GRU)

Gated Recurrent Units, also abbreviated as GRU, are a variant of RNN’s that are more
convenient   to   train   which   also   eliminates   the   vanishing   gradients   problem.   The   vanishing
gradient problem is encountered whentraining RNNs with a long sequential time series data,
where the error gradient with respect to the model parameters values to zero within early time
steps, which occurs due to multiplying too many numbers that are less than 1. It implies that it
becomes  difficult  for  the  model  to  able  to  learn  from  long­term  dependencies  in  the  input
timeseries.This issue can resolved by using either Gated Recurrent Units (GRU) or LSTMs
(Long­Short­Term­Memory)  in place of the standard RNN cell.With respect to LSTM, GRU
stands  out  to  becomputationally  economical  since  theyhave  only  a  few  number  of  internal
gates. The architecture of GRU is represented in fig.4 

2.2.4   Long Short Term Memory Network (LSTM)

LSTMs contributes in conserving the error that can be back propagated through time and
layers. Since they maintain a more constant error, they facilitate recurrent nets to continue the
process of learning over many time steps, hence opening a conduit to link causes and effects
remotely. LSTMs store information outside the normal flow of the recurrent network in a gated
cell.   Information   is   allowed   to   be   stored   in,   written   to,   or   read   from   a   cell,   bearing   a
resemblance to data in a computer’s memory. It is decided by the cell about what to store, and
when  to  allow  reads  and  write  operations  and  deletions,   via  gates  that  open  and  close.   In
contrast to the digital storage on computers, these gates are analog by nature, implemented with
element­wise multiplication by sigmoids, falling in in the range of 0 to 1. Analog gains the
upper   hand   over   digital,   by   being   able   to   be   differentiated,   and   hence   appropriate   for
backpropagation. These gates act on the received signal, and alike to the nodes of a neural
network, they restrict or pass on information depending on its strength and import, filtered with
their own sets of weights. These weights, similar to the weights that control input and hidden
states, are altered via the recurrent network learning process. It implies that the cells learn when
to allow entering of data, allowing data to leave, or be erased through the iterative process of
making guesses, backpropagation of error, and adjustment of weights with the help of gradient
descent. An LSTM block diagram is shown in fig.3.
2.2.5     Upgraded long short term memory (ULSTM)

The proposed method has modified the basic architecture of LSTM in order to get better
result for the dataset. First, the previous hidden state and the current input get concatenated.
We’ll call it combine. Combine get’s fed into the forget layer. This layer removes non­relevant
data. A candidate layer is created using combine. The candidate holds possible values to add to
the cell state. Combine also get’s fed into the input layer. This layer decides what data from the
candidate should be added to the new cell state. After computing the forget layer, candidate
layer, and the input layer, the cell state is calculated using those vectors and the previous cell
state. Then it is fed to update gate to activate the previous neurons and then passed through
reset gate to eliminate the previous layers neurons and activate new neurons. Then the output is
then computed. Point wise multiplying the output and the new cell state gives us the new
hidden state. 

2.3 Vanishing gradient Problem

Vanishing gradient problem is seen in case of deep neural network models because of the
reinforcement learning along with data. The problem arise when no more data is left to train
and the neuron start keep on training the same data. This problem has been seen in much more
level in case of CNN. Compare to other models. The proposed ULSTM model has minimize
the vanishing gradient problem for our proposed system. Fig. 5 shows the comparative analysis
based on all the mentioned model. 

Fig.2 Comparative analysis basd on vanishing gradient problem

  Here   we  have   plotted  the   vanishing  gradient   problem  graph   of  RNN,LSTM,GRU,U­
LSTM(our   modified   or   upgraded   LSTM)   .   The   x   axis   represents   the   epoch   and   y   axis
represents the  loss.   After plotting  we found  that RNN  has shown  the maximum  vanishing
gradient problem and U­LSTM has shown the minimum vanishing gradient problem. So we
can infer from here that our modified LSTM (U­LSTM) has shown better result practically in
terms of both vanishing gradient problem as well as memory power.

3    Result Analysis

3.1   Dataset Collection

Our system has been tested on benchmark data of Ryerson Audio­Visual Dataset for Emotional
Speech and Song (RAVDESS) [19]. The Ryerson Audio­Visual Database of Emotional Speech
data contains 1440 audio files in which contain eight types of emotion. The database contains
24   professional   actors   (12   female,   12   male),   where   they   have   spoken   in   North   American
accents. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions.
Each expression is produced at two levels of emotional intensity (normal,  strong),  with an
additional neutral expression. The dataset contains files which have unique name identification.
The names are divided into 7 parts: the first part is modality, second part is vocal channel, third
part is emotion, fourth part is emotional intensity, fifth part is statement, sixth part is repetition
and last part is actor id. The system is implemented and tested on the mentioned data set and
the result is compared for all the models that LSTM, GRU, CNN, RNN, ULSTM. The figure
for the result has been summarized in Table.1. 

Table 1.  Quality measurement for emotion recognition through deep learning models 

Sl . No. Model Accuracy (%)


1 CNN 63.45
2 LSTM 85.43
3 Bi­LSTM 86.45
4 GRU 83.49
5 U­LSTM 90.48

3.2 Comparative Study
The proposed system has been undergone testing through four previously proposed works.
Lee   et.al   [5]   have   proposed   the   recognition   of   emotion   through   recurrent   neural   network
(RNN). Satt et.al [4] has categorized different emotion using the very new and well known
concept called deep neural network based on spectrogram type. Mirsamadi et.al [7] have also
proposed the concept of RNN for the recognition of different emotion. Trigeorgis et.al [8] have
propounded the concept of deep convolution neural network , in this work convolution neural
network   is   combined   with   long   short   term   memory   (LSTM)   to   make   the   system   to   be
performed better. The proposed system has undergone through all these methodology to check
the performance and its efficiency. It has been shown that its performing comparatively better
for the given dataset. 

Table 2. Comparative Analysis for Existing work 

Sl. No Methodology Accuracy (%)


1 Lee et.al [5] 0.55
2 Satt et.al [4] 0.49
3 Mirsamadi et.al [7] 0.44
4 Trigeorgis et.al [8] 0.68

4. Conclusion

The   proposed   method   has   presented   and   implemented   human   emotion


recognition and its analysis in a comparative means based on different deep
learning model. The performance of these models has also been highlighted
in terms of vanishing gradient problem.  Furthermore, the proposed method
has presented a better version of LSTM through ULSTM by modifying the
architecture of the existing model of LSTM to decrease vanishing gradient
problem. This work would be explored to learn more categories of human
emotion   to   make   the   system   more   efficient.   In   addition,   music   can   be
generated   based   the   human   emotion  which   is  called   music   therapy.   This
problem would be resolved using Generative Adversarial Network (GAN) in
the land of deep learning  which another very powerful  technique for  the
generation of new result based on the provided data.
Acknowledgments “This chapter does not contain any studies with human
participants or animals performed by any of the authors.”
References

1. Han, Kun, Dong Yu, and Ivan Tashev. "Speech emotion recognition using deep neural
network and extreme learning machine." Fifteenth annual conference of the international
speech communication association. 2014.
2. Kim, Yelin, Honglak Lee, and Emily Mower Provost. "Deep learning for robust feature
generation in  audiovisual emotion  recognition." 2013 IEEE  international conference  on
acoustics, speech and signal processing. IEEE, 2013.
3. Zheng, W. Q., J. S. Yu, and Y. X. Zou. "An experimental study of speech emotion
recognition based on deep convolutional neural networks." 2015 international
conference on affective computing and intelligent interaction (ACII). IEEE, 2015.
4. Satt, Aharon, Shai Rozenberg, and Ron Hoory. "Efficient Emotion Recognition from Speech
Using Deep Learning on Spectrograms." INTERSPEECH. 2017.
5. Lee, Jinkyu, and Ivan Tashev. "High-level feature representation using recurrent
neural network for speech emotion recognition." Sixteenth Annual Conference of
the International Speech Communication Association. 2015.
6. Mao, Qirong, et al. "Learning salient features for speech emotion recognition
using convolutional neural networks." IEEE transactions on multimedia 16.8
(2014): 2203-2213.
7. Mirsamadi, Seyedmahdad, Emad Barsoum, and Cha Zhang. "Automatic speech
emotion recognition using recurrent neural networks with local attention." 2017
IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2017.
8. Trigeorgis, George, et al. "Adieu features? end-to-end speech emotion recognition
using a deep convolutional recurrent network." 2016 IEEE international
conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016.
9. Kahou, Samira Ebrahimi, et al. "Combining modality specific deep neural
networks for emotion recognition in video." Proceedings of the 15th ACM on
International conference on multimodal interaction. ACM, 2013.
10. Aldeneh, Zakaria, and Emily Mower Provost. "Using regional saliency for speech
emotion recognition." 2017 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 2017.
11. Chernykh, Vladimir, and Pavel Prikhodko. "Emotion recognition from speech
with recurrent neural networks." arXiv preprint arXiv:1701.08071 (2017).
12. Wang, Zhong-Qiu, and Ivan Tashev. "Learning utterance-level representations for
speech emotion and age/gender recognition using deep neural networks." 2017
IEEE international conference on acoustics, speech and signal processing
(ICASSP). IEEE, 2017.
13. Ebrahimi Kahou, Samira, et al. "Recurrent neural networks for emotion
recognition in video." Proceedings of the 2015 ACM on International Conference
on Multimodal Interaction. ACM, 2015.
14. He, Lang, et al. "Multimodal affective dimension prediction using deep
bidirectional long short-term memory recurrent neural networks." Proceedings of
the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 2015.
15. Amer, Mohamed R., et al. "Emotion detection in speech using deep
networks." 2014 IEEE international conference on acoustics, speech and signal
processing (ICASSP). IEEE, 2014.
16. Satt, Aharon, Shai Rozenberg, and Ron Hoory. "Efficient Emotion Recognition
from Speech Using Deep Learning on Spectrograms." INTERSPEECH. 2017.
17. Xingjian, S. H. I., et al. "Convolutional LSTM network: A machine learning
approach for precipitation nowcasting." Advances in neural information
processing systems. 2015.
18. Shi, Heng, Minghao Xu, and Ran Li. "Deep learning for household load
forecasting—A novel pooling deep RNN." IEEE Transactions on Smart Grid 9.5
(2017): 5271-5280.
19. The Ryerson Audio-Visual Database of Emotional Speech and
Song(RAVDESS),https://zenodo.org/record/1188976#.XU739B0zbIV

You might also like