Speech Based Emotion Classification For Human by Introducing Upgraded Long Short Term Memory (ULSTM)
Speech Based Emotion Classification For Human by Introducing Upgraded Long Short Term Memory (ULSTM)
introducing upgraded long short term memory (ULSTM)
Subhrajit Bhowmik1, Akshay Chatterjee2, Sampurna Biswas3, Reshmina Farhin4,
Ghazaala Yasmin5
1,2,3,4,5
Department of Computer Science and Engineering, St Thomas‘ College of
Engineering and Technology, Kolkata 700 023, India
[email protected]
[email protected]
[email protected]@gmail.com
[email protected]@gmail.com
[email protected]
Abstract. All humans have the intelligence for emotions through emotional
behavior by social skills and by interacting and imitating with human. Not only
that we also enhance and upgrade our skill for analysis for different emotion
through learning with our experience in surroundings. Now what if the machine
is capable of learning through its artificial intelligent skills. Plenty of research
work has been conducted on this province. The ongoing exploration is being
done using deep learning model concept. This technique is being used to
enhance the learning capacity of the machine which is most important in human
emotion recognition because one emotion can be derived towards another type
of emotion which is difficult to analyze. This theme has inclined us to explore
this problem. The proposed method has been designed to categorize the human
emotions through four different deep learning model convolution neural
network (CNN), recurrent neural network (RNN), long short term memory
(LSTM) and gated recurrent unit (GRU). For training these model well known
physical and perceptual features has been fitted. The System has been tested on
benchmark data of Ryerson AudioVisual Dataset for Emotional Speech and
Song (RAVDESS). Furthermore the mentioned deep learning model has been
compared based on testing the above dataset in terms of vanishing gradient
problem. In addition an upgraded model of LSTM has been proposed to get the
better accuracy and it is being tested with the existing model of LSTM.
The current advancement in the technology is the interaction between humans and
computer. There is a sudden requirement of making these interactions more seamless
using emotions. The emotional feedback of users can help computers to do more
accurate result based operations which are customised from user to user. It has a vast
application in the field of mental diagnosis, customer services, human counselling and
many more. Emotions can play a big role in such interactions. The technique of
detection of lies is based on this concept. Hence being able to assess human emotions
using machines is a topic that is springing up almost everywhere. Emotions are
displayed visually, vocally, and through other physiological means. In this paper we
are more concerned with the recognition of emotion through analysis of speech. Tone
of voice in speech has characterizations such as pitch, loudness, timbre, speech rate
and pauses which come under the umbrella of emotional prosody. The energy in
speech can be changed intentionally to generate different feelings. Studies have found
that emotions such as fear, joy and anger, are portrayed at higher intensities than
emotions such as sadness. In the research area emotion has been detected using
various ways. But now days the advancement of artificial intelligence is leading to
make every system to be more intelligent so that it classify and solve the problem
with more accuracy. For achieving this deep learning concept is highly being
recommended for the problems which are suitable to be solved using deep neural
network. This factor has motivated us to implement a system which detects the
emotion through different deep learning techniques and the result has been analyzed
based on different measuring techniques such as vanishing gradient curve. The details
of this will be discussed in section 2.1. Emotions are being classified into many
categories; generally five major emotions are classified in most of the presented
techniques. Those are anger, disgust, fear, sadness, happiness and a neutral emotion
as well. Neutral speech is considered to be the speech which does not have such
remarkable emotions. In comparison to neutral speech, anger is produced with a lower
pitch, higher intensity, and more energy. One other aspect disgust is produced with a
lower, downward directed pitch, with energy. Fearful emotions have a higher pitch,
little variation, lower energy, and a faster speech rate with more pauses. Sad emotions
are produced with a higher pitch, less intensity but more vocal energy, longer duration
with more pauses whereas happiness is produced with a higher pitch, high intensity
and higher pitch. The six major archetypal emotions are anger, surprise, fear, disgust,
happiness and morose. The most common approach that has been used all around the
globe is prosodic features. Pitch and energy come under prosodic features. The other
is spectral features. Melscale frequency cepstral coefficients are a spectral feature.
By combining different types of features like prosodic and spectral features, emotion
recognition from audio is possible. The proposed work contains eight different types
of emotions which are difficult to classify with better accuracy by incorporating initial
machine learning technique. This is why more upgraded technique that is using Gated
recurrent unit (GRU), convolution neural network (CNN), Recurrent Neural network
(RNN) and long short term memory (LSTM). Convolution neural network is an
artificial neural network which does supervised learning based on convoluting the
data for large dimension. Recurrent neural network is an upgraded neural network
which consists of directed cycles in memory. LSTM is an advanced version of RNN
with more learning capacity. Gated Recurrent unit somehow flows with similar
principle as LSTMs. Difference is that it uses set of gates to control the flow of
information.
1.1 Related Work
Plenty of work has been introduced. Even the research is still going on in this field.
The application of deep learning in this problem is making this area more attention
seeking. Kun Han et al. [1] proposed to utilize deep neural networks (DNNs) to find
the probability distribution for each speech segment and then these has been effective
for speech emotion recognition. Yelin Kim et al. [2] proposed that automatic emotion
recognition systems provide the difference between highlevel affective content and
lowlevel human signals only. They proposed to utilize Deep Belief Network models
which suggests that high order non linear relationships are effective for human
recognition. WQ Zheng et al. [3] proposed to implement an effectively emotion
recognition system based on deep convolution neural networks using a systematical
approach using labeled training audio data. Aharon Satt et al. [4] proposed the
concept of spectrograms. It involved paralingual information in the speech based on
deep neural network. The speech is processed into small segments of 3 seconds each
and then the longer input is splitted into nonoverlapping parts which reduces the
latency of predicting it. Jinkyu Lee et al. [5] presented recurrent neural network as the
speech recognition system which is trained by an efficient learning algorithm. It takes
the longrange context into account and how the emotions oh a human are not certain.
Qirong Mao et al. [6] proposed the usage of convolutional neural networks to learn
about the various salient features for human emotion recognition through speech.
Seyedmahdad Mirsamadi et al. [7] proposed that using recurrent neural network both
shorttime framelevel features and also appropriate temporal aggregation of human
emotions can be learnt. George Trigeorgis et al. [8] proposed a solution to the context
aware problem of human emotions by Convolutional Neural Networks(CNNs). This
is being combined with LSTM networks to learn the best the best way of signaling the
speech automatically. Samira Ebrahimi Kahou et al. [9] proposed the analysis of
video clips which were of onetwo seconds to study the human emotions from
speech.They even analysed audio clips which contained human voice as the
background music. Zakaria Aldeneh et al. [10] proposed how convolutional neural
networks can be used to get competitive results when minimally handengineered
features are applied. Vladimir Chernykh et al. [11] proposed that how small intervals
of speech can be calculated from which sequence of acoustic features can be trained
with the help of recurrent neural network. ZhongQiu Wang et al. [12] proposed that
by utterance of a fixed length vector of the last hidden layer of time the speech of any
human emotion can be easily encoded by the help of deep neural network. This
encoding system classifies the utterance level for a betterment. Samira Ebrahimi
Kahou et al. [13] proposed that a completely different system for studying video
clips . They do their analysis on a hybrid CNNRNN architecture which even analysis
the facial expressions of the humans. Lang He et al. [14] proposed the design of a
audiovideo emotion channel which functions on low level descriptors obtained from
a toolbox . Mohamed R. Amer et al. [15] proposed the idea of a hybrid model which
will be able to detect human emotions very easily. This hybrid model would have the
strength of distinguishing classifiers along with the representational power of the
normal audio model or video model separately. For the last few years many
researches are going on to determine a new deep learning algorithm or to improve the
existing algorithms in the field of deep learning. We have also worked on improving a
specific deep learning algorithm that is LSTM. There are some advantages in using
LSTM algorithm and certain characteristics of LSTM let us prefer LSTM to
Recurrent Neural Network (RNN), one of them is more memory content in LSTM.
recurrent neural network has shorter memory where as a LSTM has more memory to
hold data. But the problem in LSTM was vanishing gradient problem as per discussed
in Section. So Gated Recurrent Unit (GRU) is taken into account where the vanishing
gradient problem could be recovered but still in terms of memory LSTM was
leading. So to have more memory as well as getting rid of vanishing gradient problem
we have decided to change the architecture of LSTM model where we could remove
the vanishing gradient problem. In the LSTM model there are three gates: forget gate,
input gate and output gate. We have added three more gates to remove the vanishing
gradient problem. The gates are one update gate and one reset gate. First we have
added one update gate and then one reset gate. Update gate is used to update the
previous information so that no previous neurons are not wasted per epoch. Also
using update gate it can memorize the previous statements. Reset gate is used to reset
the new neurons so that the previous neurons in the layer are not reused.
2 Proposed Methodology
The proposed method has been introduced to serve the classification of human
emotion from speech. This has been done by two major phases, one is feature
extraction which extracts the relevant feature which extracts the relevant set of
features to minutely analyse the variation of emotion from speech file. Another is
classification where well known deep learni9ng model has been used and the result of
all of them has been compared. Fig. 1 represents the overall architecture of the
proposed system.
2.1 Feature Extraction
2.1.1 Energy
Let yi(n) , where n=1... SL; be the sequence of audio samples of the ith frame, where SL
is the length of the frame. The short-term energy is computed according to the eq.
(1). Usually, energy is normalized by dividing it with SL to remove the dependency on
the frame length. Therefore, finally energy will be represented as eq.(2).
(1)
(2)
2.1.2 Zero Crossing Rate
This feature is also abbreviated as ZCR, the zero crossing rate of any audio frame
is the rate at which the signal changes sign in the duration of a specific frame. To put
it simply, ZCR is the count of times the signal changes value from negative to
positive, or vice-versa, divided by the length of the frame. Mathematically, we can
describe the ZCR as eq. (3). ZCR provides information about the measure of noise in
a signal, where it demonstrates higher values in case of noisy signals, and also reflects
the spectral characteristics of a signal in a rough manner. The easiness of it’s
computation and it’s properties have led to ZCR being extensively used in numerous
applications like music genre classification and speech detection
(3)
(3.1)
2.1.3. Chromagram
Chroma vector is a representation of how humans relate colours to notes. In other
words we think of same notes but from two different octaves to be of the same color.
Thus we have 12 possible values at each window. A, A#, B, C, C#, D, D#, E, F, F#, G
and G#. Of course they are not mutually exclusive hence for a given time fame one
can have more than one note. But to keep things simple we will only select the most
prominent note for a particular window. A visual description of the chroma vector
using a chromagram is shown as below, which is obtained by executing a suitable
python program utilising the librosa module.
2.1.4 Spectral Contrast
Each frame of a spectrogram S for any audio file is divided into subbands. For
each subband, the energy contrast is estimated by comparing the mean energy in the
top quantile (peak energy) to that of the bottom quantile (valley energy). High
contrast values generally correspond to clear, narrowband signals, while low contrast
values correspond to broadband noise. It provides a good representation of the
relative spectral distribution and is very efficient in music genre classification. A
python program is executed to provide an implementation of the spectral contrast with
the power spectrogram, of an audio file.
2.1.5 Mel Frequency Cepstral Coefficients (MFCC)
When sounds are generated by humans, they are filtered by the shape of the vocal
tract including tongue, teeth etc. The sound coming out is determined by this shape. If
the shape can be determined accurately, an accurate representation of the produced
phonemes can be obtained. The shape of the vocal tract manifests itself in the
envelope of the short time power spectrum, and MFCCs come into play to accurately
represent this envelope. The Mel scale is an essential prerequisite in understanding
the concept of MFCCs.The Mel scale provides a relationship between perceived
frequency, or pitch, of a pure tone, and the actual measured frequency. Humans are
quite efficient at perceiving small changes in pitch at low frequencies than they are at
high frequencies. Implementing this scale makes our features mutch more closely
what humans hear.
2.2.2 Convolutional Neural Networks (CNN)
Also popularly abbreviated as CNN(Conv Nets), Convolutional Neural Networks
are one of the essential, state of art, Artificial Neural Network design architecture,
which is utilised in Image Based Classifications. The intuition behind the working of
CNN is the idea of Convolution, which produces filtered Feature Maps, also
sometimes referred as Kernels, stacked over each other. The procedures involved in
CNN steps Image Filtering ReLUPooling…….VectorizationFully
Connected Network. Convolution neural network works for two purpose one is for
image which signified as Conv 2D. Another is for audio which is Conv 1D. The input
image in form of a matrix, is convolved with a m x m filter, where m is generally odd.
Here a given image might undergo convolution with n different filters of size m x
m,where each of the filter would be concerned with a definite feature which is
required to be extracted from the image, thus forming n stacked feature maps or
kernels. Thesekernels are then passed through a ReLU Layer, which removes all the
negative entries in the matrices obtained, with value 0,and preserves the nonlinearity
of the image. Now the resultant is transferred to the Pooling Layer wherethe size of
the obtained matrices is reduced, by choosing a Nearest Neighbourparadigm where an
area of size p x p is replaced by the max value in the area p x p. Pooling reduces the
size of the matrix we operate on, hence diminishing the computations, without loss of
information .The above steps are repeated, and then at a certain stage the matrices are
vectorized and sent to a neural network, which trains using Backpropagation, to
classify the image. We have used Conv1D since this convolutes one dimensional
audio file. Conv1D goes through those stages. The stages involve first layer for
setting the weights second is the pooling and then feeding it to dense layer that is fully
connected layer.
2.2.3 Recurrent Neural Networks
(7)
In the above eq (6) and (7) weights are matrices and the other variables are vectors. All the
weights are inserted via matrix multiplication, and the biases are added to the obtained
products. We then use tanh as an activation function for the first equation (It is to be noted that
other activation functions like sigmoid can be used we have used Relu because it is giving
better result for our dataset).
2.2.4 Gated Recurrent Unit(GRU)
Gated Recurrent Units, also abbreviated as GRU, are a variant of RNN’s that are more
convenient to train which also eliminates the vanishing gradients problem. The vanishing
gradient problem is encountered whentraining RNNs with a long sequential time series data,
where the error gradient with respect to the model parameters values to zero within early time
steps, which occurs due to multiplying too many numbers that are less than 1. It implies that it
becomes difficult for the model to able to learn from longterm dependencies in the input
timeseries.This issue can resolved by using either Gated Recurrent Units (GRU) or LSTMs
(LongShortTermMemory) in place of the standard RNN cell.With respect to LSTM, GRU
stands out to becomputationally economical since theyhave only a few number of internal
gates. The architecture of GRU is represented in fig.4
2.2.4 Long Short Term Memory Network (LSTM)
LSTMs contributes in conserving the error that can be back propagated through time and
layers. Since they maintain a more constant error, they facilitate recurrent nets to continue the
process of learning over many time steps, hence opening a conduit to link causes and effects
remotely. LSTMs store information outside the normal flow of the recurrent network in a gated
cell. Information is allowed to be stored in, written to, or read from a cell, bearing a
resemblance to data in a computer’s memory. It is decided by the cell about what to store, and
when to allow reads and write operations and deletions, via gates that open and close. In
contrast to the digital storage on computers, these gates are analog by nature, implemented with
elementwise multiplication by sigmoids, falling in in the range of 0 to 1. Analog gains the
upper hand over digital, by being able to be differentiated, and hence appropriate for
backpropagation. These gates act on the received signal, and alike to the nodes of a neural
network, they restrict or pass on information depending on its strength and import, filtered with
their own sets of weights. These weights, similar to the weights that control input and hidden
states, are altered via the recurrent network learning process. It implies that the cells learn when
to allow entering of data, allowing data to leave, or be erased through the iterative process of
making guesses, backpropagation of error, and adjustment of weights with the help of gradient
descent. An LSTM block diagram is shown in fig.3.
2.2.5 Upgraded long short term memory (ULSTM)
The proposed method has modified the basic architecture of LSTM in order to get better
result for the dataset. First, the previous hidden state and the current input get concatenated.
We’ll call it combine. Combine get’s fed into the forget layer. This layer removes nonrelevant
data. A candidate layer is created using combine. The candidate holds possible values to add to
the cell state. Combine also get’s fed into the input layer. This layer decides what data from the
candidate should be added to the new cell state. After computing the forget layer, candidate
layer, and the input layer, the cell state is calculated using those vectors and the previous cell
state. Then it is fed to update gate to activate the previous neurons and then passed through
reset gate to eliminate the previous layers neurons and activate new neurons. Then the output is
then computed. Point wise multiplying the output and the new cell state gives us the new
hidden state.
2.3 Vanishing gradient Problem
Vanishing gradient problem is seen in case of deep neural network models because of the
reinforcement learning along with data. The problem arise when no more data is left to train
and the neuron start keep on training the same data. This problem has been seen in much more
level in case of CNN. Compare to other models. The proposed ULSTM model has minimize
the vanishing gradient problem for our proposed system. Fig. 5 shows the comparative analysis
based on all the mentioned model.
Fig.2 Comparative analysis basd on vanishing gradient problem
Here we have plotted the vanishing gradient problem graph of RNN,LSTM,GRU,U
LSTM(our modified or upgraded LSTM) . The x axis represents the epoch and y axis
represents the loss. After plotting we found that RNN has shown the maximum vanishing
gradient problem and ULSTM has shown the minimum vanishing gradient problem. So we
can infer from here that our modified LSTM (ULSTM) has shown better result practically in
terms of both vanishing gradient problem as well as memory power.
3 Result Analysis
3.1 Dataset Collection
Our system has been tested on benchmark data of Ryerson AudioVisual Dataset for Emotional
Speech and Song (RAVDESS) [19]. The Ryerson AudioVisual Database of Emotional Speech
data contains 1440 audio files in which contain eight types of emotion. The database contains
24 professional actors (12 female, 12 male), where they have spoken in North American
accents. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions.
Each expression is produced at two levels of emotional intensity (normal, strong), with an
additional neutral expression. The dataset contains files which have unique name identification.
The names are divided into 7 parts: the first part is modality, second part is vocal channel, third
part is emotion, fourth part is emotional intensity, fifth part is statement, sixth part is repetition
and last part is actor id. The system is implemented and tested on the mentioned data set and
the result is compared for all the models that LSTM, GRU, CNN, RNN, ULSTM. The figure
for the result has been summarized in Table.1.
Table 1. Quality measurement for emotion recognition through deep learning models
3.2 Comparative Study
The proposed system has been undergone testing through four previously proposed works.
Lee et.al [5] have proposed the recognition of emotion through recurrent neural network
(RNN). Satt et.al [4] has categorized different emotion using the very new and well known
concept called deep neural network based on spectrogram type. Mirsamadi et.al [7] have also
proposed the concept of RNN for the recognition of different emotion. Trigeorgis et.al [8] have
propounded the concept of deep convolution neural network , in this work convolution neural
network is combined with long short term memory (LSTM) to make the system to be
performed better. The proposed system has undergone through all these methodology to check
the performance and its efficiency. It has been shown that its performing comparatively better
for the given dataset.
Table 2. Comparative Analysis for Existing work
4. Conclusion
1. Han, Kun, Dong Yu, and Ivan Tashev. "Speech emotion recognition using deep neural
network and extreme learning machine." Fifteenth annual conference of the international
speech communication association. 2014.
2. Kim, Yelin, Honglak Lee, and Emily Mower Provost. "Deep learning for robust feature
generation in audiovisual emotion recognition." 2013 IEEE international conference on
acoustics, speech and signal processing. IEEE, 2013.
3. Zheng, W. Q., J. S. Yu, and Y. X. Zou. "An experimental study of speech emotion
recognition based on deep convolutional neural networks." 2015 international
conference on affective computing and intelligent interaction (ACII). IEEE, 2015.
4. Satt, Aharon, Shai Rozenberg, and Ron Hoory. "Efficient Emotion Recognition from Speech
Using Deep Learning on Spectrograms." INTERSPEECH. 2017.
5. Lee, Jinkyu, and Ivan Tashev. "High-level feature representation using recurrent
neural network for speech emotion recognition." Sixteenth Annual Conference of
the International Speech Communication Association. 2015.
6. Mao, Qirong, et al. "Learning salient features for speech emotion recognition
using convolutional neural networks." IEEE transactions on multimedia 16.8
(2014): 2203-2213.
7. Mirsamadi, Seyedmahdad, Emad Barsoum, and Cha Zhang. "Automatic speech
emotion recognition using recurrent neural networks with local attention." 2017
IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2017.
8. Trigeorgis, George, et al. "Adieu features? end-to-end speech emotion recognition
using a deep convolutional recurrent network." 2016 IEEE international
conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016.
9. Kahou, Samira Ebrahimi, et al. "Combining modality specific deep neural
networks for emotion recognition in video." Proceedings of the 15th ACM on
International conference on multimodal interaction. ACM, 2013.
10. Aldeneh, Zakaria, and Emily Mower Provost. "Using regional saliency for speech
emotion recognition." 2017 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 2017.
11. Chernykh, Vladimir, and Pavel Prikhodko. "Emotion recognition from speech
with recurrent neural networks." arXiv preprint arXiv:1701.08071 (2017).
12. Wang, Zhong-Qiu, and Ivan Tashev. "Learning utterance-level representations for
speech emotion and age/gender recognition using deep neural networks." 2017
IEEE international conference on acoustics, speech and signal processing
(ICASSP). IEEE, 2017.
13. Ebrahimi Kahou, Samira, et al. "Recurrent neural networks for emotion
recognition in video." Proceedings of the 2015 ACM on International Conference
on Multimodal Interaction. ACM, 2015.
14. He, Lang, et al. "Multimodal affective dimension prediction using deep
bidirectional long short-term memory recurrent neural networks." Proceedings of
the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 2015.
15. Amer, Mohamed R., et al. "Emotion detection in speech using deep
networks." 2014 IEEE international conference on acoustics, speech and signal
processing (ICASSP). IEEE, 2014.
16. Satt, Aharon, Shai Rozenberg, and Ron Hoory. "Efficient Emotion Recognition
from Speech Using Deep Learning on Spectrograms." INTERSPEECH. 2017.
17. Xingjian, S. H. I., et al. "Convolutional LSTM network: A machine learning
approach for precipitation nowcasting." Advances in neural information
processing systems. 2015.
18. Shi, Heng, Minghao Xu, and Ran Li. "Deep learning for household load
forecasting—A novel pooling deep RNN." IEEE Transactions on Smart Grid 9.5
(2017): 5271-5280.
19. The Ryerson Audio-Visual Database of Emotional Speech and
Song(RAVDESS),https://zenodo.org/record/1188976#.XU739B0zbIV