0% found this document useful (0 votes)

83 views

Speech Based Emotion Classification For Human by Introducing Upgraded Long Short Term Memory (ULSTM)

The document proposes an upgraded long short term memory (ULSTM) model for speech-based emotion classification of humans. It introduces ULSTM to enhance the learning capacity of machines for human emotion recognition, which is important as emotions can transition between types. The proposed method uses four deep learning models - CNN, RNN, LSTM and GRU - to categorize human emotions based on physical and perceptual features from the RAVDESS dataset. An upgraded LSTM model is also proposed and tested against the existing LSTM model to improve accuracy for emotion classification.

Uploaded by

AKSHAY CHATTERJEE 15BCE2089

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

83 views

Speech Based Emotion Classification For Human by Introducing Upgraded Long Short Term Memory (ULSTM)

Uploaded by

AKSHAY CHATTERJEE 15BCE2089

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 14

Speech based emotion classification for human by

introducing upgraded long short term memory (ULSTM)

Subhrajit Bhowmik1, Akshay Chatterjee2, Sampurna Biswas3, Reshmina Farhin4,
Ghazaala Yasmin5
1,2,3,4,5
Department of Computer Science and Engineering, St Thomas‘ College of
Engineering and Technology, Kolkata 700 023, India
[email protected]
[email protected]
[email protected]@gmail.com
[email protected]@gmail.com
[email protected]

Abstract. All humans have the intelligence for emotions through emotional
behavior by social skills and by interacting and imitating with human. Not only
that we also enhance and upgrade our skill for analysis for different emotion
through learning with our experience in surroundings. Now what if the machine
is capable of learning through its artificial intelligent skills. Plenty of research
work has been conducted on this province. The ongoing exploration is being
done using deep learning model concept. This technique is being used to
enhance the learning capacity of the machine which is most important in human
emotion recognition because one emotion can be derived towards another type
of emotion which is difficult to analyze. This theme has inclined us to explore
this problem. The proposed method has been designed to categorize the human
emotions through four different deep learning model convolution neural
network (CNN), recurrent neural network (RNN), long short term memory
(LSTM) and gated recurrent unit (GRU). For training these model well known
physical and perceptual features has been fitted. The System has been tested on
benchmark data of Ryerson AudioVisual Dataset for Emotional Speech and
Song (RAVDESS). Furthermore the mentioned deep learning model has been
compared based on testing the above dataset in terms of vanishing gradient
problem. In addition an upgraded model of LSTM has been proposed to get the
better accuracy and it is being tested with the existing model of LSTM.

Keywords: Emotion Recognition, Feature extraction, Classification model,

convolution neural network (CNN), Recurrent neural network(RNN), gated
recurrent unit (GRU), Long short term memory (LSTM), upgraded Long short
term memory(ULSTM) .
1 Introduction

The current advancement in the technology is the interaction between humans and
computer. There is a sudden requirement of making these interactions more seamless
using emotions. The emotional feedback of users can help computers to do more
accurate result based operations which are customised from user to user. It has a vast
application in the field of mental diagnosis, customer services, human counselling and
many more. Emotions can play a big role in such interactions. The technique of
detection of lies is based on this concept. Hence being able to assess human emotions
using machines is a topic that is springing up almost everywhere. Emotions are
displayed visually, vocally, and through other physiological means. In this paper we
are more concerned with the recognition of emotion through analysis of speech. Tone
of voice in speech has characterizations such as pitch, loudness, timbre, speech rate
and pauses which come under the umbrella of emotional prosody. The energy in
speech can be changed intentionally to generate different feelings. Studies have found
that emotions such as fear, joy and anger, are portrayed at higher intensities than
emotions such as sadness. In the research area emotion has been detected using
various ways. But now days the advancement of artificial intelligence is leading to
make every system to be more intelligent so that it classify and solve the problem
with more accuracy. For achieving this deep learning concept is highly being
recommended for the problems which are suitable to be solved using deep neural
network. This factor has motivated us to implement a system which detects the
emotion through different deep learning techniques and the result has been analyzed
based on different measuring techniques such as vanishing gradient curve. The details
of this will be discussed in section 2.1. Emotions are being classified into many
categories; generally five major emotions are classified in most of the presented
techniques. Those are anger, disgust, fear, sadness, happiness and a neutral emotion
as well. Neutral speech is considered to be the speech which does not have such
remarkable emotions. In comparison to neutral speech, anger is produced with a lower
pitch, higher intensity, and more energy. One other aspect disgust is produced with a
lower, downward directed pitch, with energy. Fearful emotions have a higher pitch,
little variation, lower energy, and a faster speech rate with more pauses. Sad emotions
are produced with a higher pitch, less intensity but more vocal energy, longer duration
with more pauses whereas happiness is produced with a higher pitch, high intensity
and higher pitch. The six major archetypal emotions are anger, surprise, fear, disgust,
happiness and morose. The most common approach that has been used all around the
globe is prosodic features. Pitch and energy come under prosodic features. The other
is spectral features. Melscale frequency cepstral coefficients are a spectral feature.
By combining different types of features like prosodic and spectral features, emotion
recognition from audio is possible. The proposed work contains eight different types
of emotions which are difficult to classify with better accuracy by incorporating initial
machine learning technique. This is why more upgraded technique that is using Gated
recurrent unit (GRU), convolution neural network (CNN), Recurrent Neural network
(RNN) and long short term memory (LSTM). Convolution neural network is an
artificial neural network which does supervised learning based on convoluting the
data for large dimension. Recurrent neural network is an upgraded neural network
which consists of directed cycles in memory. LSTM is an advanced version of RNN
with more learning capacity. Gated Recurrent unit somehow flows with similar
principle as LSTMs. Difference is that it uses set of gates to control the flow of
information.

1.1 Related Work

Plenty of work has been introduced. Even the research is still going on in this field.
The application of deep learning in this problem is making this area more attention
seeking. Kun Han et al. [1] proposed to utilize deep neural networks (DNNs) to find
the probability distribution for each speech segment and then these has been effective
for speech emotion recognition. Yelin Kim et al. [2] proposed that automatic emotion
recognition systems provide the difference between highlevel affective content and
lowlevel human signals only. They proposed to utilize Deep Belief Network models
which suggests that high order non linear relationships are effective for human
recognition. WQ Zheng et al. [3] proposed to implement an effectively emotion
recognition system based on deep convolution neural networks using a systematical
approach using labeled training audio data. Aharon Satt et al. [4] proposed the
concept of spectrograms. It involved paralingual information in the speech based on
deep neural network. The speech is processed into small segments of 3 seconds each
and then the longer input is splitted into nonoverlapping parts which reduces the
latency of predicting it. Jinkyu Lee et al. [5] presented recurrent neural network as the
speech recognition system which is trained by an efficient learning algorithm. It takes
the longrange context into account and how the emotions oh a human are not certain.
Qirong Mao et al. [6] proposed the usage of convolutional neural networks to learn
about the various salient features for human emotion recognition through speech.
Seyedmahdad Mirsamadi et al. [7] proposed that using recurrent neural network both
shorttime framelevel features and also appropriate temporal aggregation of human
emotions can be learnt. George Trigeorgis et al. [8] proposed a solution to the context
aware problem of human emotions by Convolutional Neural Networks(CNNs). This
is being combined with LSTM networks to learn the best the best way of signaling the
speech automatically. Samira Ebrahimi Kahou et al. [9] proposed the analysis of
video clips which were of onetwo seconds to study the human emotions from
speech.They even analysed audio clips which contained human voice as the
background music. Zakaria Aldeneh et al. [10] proposed how convolutional neural
networks can be used to get competitive results when minimally handengineered
features are applied. Vladimir Chernykh et al. [11] proposed that how small intervals
of speech can be calculated from which sequence of acoustic features can be trained
with the help of recurrent neural network. ZhongQiu Wang et al. [12] proposed that
by utterance of a fixed length vector of the last hidden layer of time the speech of any
human emotion can be easily encoded by the help of deep neural network. This
encoding system classifies the utterance level for a betterment. Samira Ebrahimi
Kahou et al. [13] proposed that a completely different system for studying video
clips . They do their analysis on a hybrid CNNRNN architecture which even analysis
the facial expressions of the humans. Lang He et al. [14] proposed the design of a
audiovideo emotion channel which functions on low level descriptors obtained from
a toolbox . Mohamed R. Amer et al. [15] proposed the idea of a hybrid model which
will be able to detect human emotions very easily. This hybrid model would have the
strength of distinguishing classifiers along with the representational power of the
normal audio model or video model separately. For the last few years many
researches are going on to determine a new deep learning algorithm or to improve the
existing algorithms in the field of deep learning. We have also worked on improving a
specific deep learning algorithm that is LSTM. There are some advantages in using
LSTM algorithm and certain characteristics of LSTM let us prefer LSTM to
Recurrent Neural Network (RNN), one of them is more memory content in LSTM.
recurrent neural network has shorter memory where as a LSTM has more memory to
hold data. But the problem in LSTM was vanishing gradient problem as per discussed
in Section. So Gated Recurrent Unit (GRU) is taken into account where the vanishing
gradient problem could be recovered but still in terms of memory LSTM was
leading. So to have more memory as well as getting rid of vanishing gradient problem
we have decided to change the architecture of LSTM model where we could remove
the vanishing gradient problem. In the LSTM model there are three gates: forget gate,
input gate and output gate. We have added three more gates to remove the vanishing
gradient problem. The gates are one update gate and one reset gate. First we have
added one update gate and then one reset gate. Update gate is used to update the
previous information so that no previous neurons are not wasted per epoch. Also
using update gate it can memorize the previous statements. Reset gate is used to reset
the new neurons so that the previous neurons in the layer are not reused.
2 Proposed Methodology

The proposed method has been introduced to serve the classification of human
emotion from speech. This has been done by two major phases, one is feature
extraction which extracts the relevant feature which extracts the relevant set of
features to minutely analyse the variation of emotion from speech file. Another is
classification where well known deep learni9ng model has been used and the result of
all of them has been compared. Fig. 1 represents the overall architecture of the
proposed system.

Fig. 1. Basic Work Flow of the proposed System

2.1 Feature Extraction

Extraction of features is an extremely essential role in analysing and discovering

relations between different entities. The data supplied from audio cannot be
understood by the models used for audio data retrieval. Hence, to directly to convert
them into an understandable format feature extraction is implemented. It is a process
that explains most of the data but in an unambiguous way. Feature extraction is useful
in classification, prediction and recommendation algorithms. Audio signal scan be
thought of as a three-dimensional signal in which three axes represent time, amplitude
and frequency

2.1.1 Energy

Let yi(n) , where n=1... SL; be the sequence of audio samples of the ith frame, where SL
is the length of the frame. The short-term energy is computed according to the eq.
(1). Usually, energy is normalized by dividing it with SL to remove the dependency on
the frame length. Therefore, finally energy will be represented as eq.(2).
(1)

(2)

2.1.2 Zero Crossing Rate

This feature is also abbreviated as ZCR, the zero crossing rate of any audio frame
is the rate at which the signal changes sign in the duration of a specific frame. To put
it simply, ZCR is the count of times the signal changes value from negative to
positive, or vice-versa, divided by the length of the frame. Mathematically, we can
describe the ZCR as eq. (3). ZCR provides information about the measure of noise in
a signal, where it demonstrates higher values in case of noisy signals, and also reflects
the spectral characteristics of a signal in a rough manner. The easiness of it’s
computation and it’s properties have led to ZCR being extensively used in numerous
applications like music genre classification and speech detection

(3)

(3.1)

2.1.3. Chromagram

Chroma vector is a representation of how humans relate colours to notes. In other
words we think of same notes but from two different octaves to be of the same color.
Thus we have 12 possible values at each window. A, A#, B, C, C#, D, D#, E, F, F#, G
and G#. Of course they are not mutually exclusive hence for a given time fame one
can have more than one note. But to keep things simple we will only select the most
prominent note for a particular window. A visual description of the chroma vector
using a chromagram is shown as below, which is obtained by executing a suitable
python program utilising the librosa module.
2.1.4 Spectral Contrast

Each frame of a spectrogram S for any audio file is divided into subbands. For
each subband, the energy contrast is estimated by comparing the mean energy in the
top quantile (peak energy) to that of the bottom quantile (valley energy). High
contrast values generally correspond to clear, narrowband signals, while low contrast
values correspond to broadband noise. It provides a good representation of the
relative spectral distribution and is very efficient in music genre classification. A
python program is executed to provide an implementation of the spectral contrast with
the power spectrogram, of an audio file.

2.1.5 Mel Frequency Cepstral Coefficients (MFCC)

When sounds are generated by humans, they are filtered by the shape of the vocal
tract including tongue, teeth etc. The sound coming out is determined by this shape. If
the shape can be determined accurately, an accurate representation of the produced
phonemes can be obtained. The shape of the vocal tract manifests itself in the
envelope of the short time power spectrum, and MFCCs come into play to accurately
represent this envelope. The Mel scale is an essential prerequisite in understanding
the concept of MFCCs.The Mel scale provides a relationship between perceived
frequency, or pitch, of a pure tone, and the actual measured frequency. Humans are
quite efficient at perceiving small changes in pitch at low frequencies than they are at
high frequencies. Implementing this scale makes our features mutch more closely
what humans hear.

2.2.2 Convolutional Neural Networks (CNN)

Also popularly abbreviated as CNN(Conv Nets), Convolutional Neural Networks
are one of the essential, state of art, Artificial Neural Network design architecture,
which is utilised in Image Based Classifications. The intuition behind the working of
CNN is the idea of Convolution, which produces filtered Feature Maps, also
sometimes referred as Kernels, stacked over each other. The procedures involved in
CNN steps Image  Filtering  ReLUPooling…….VectorizationFully
Connected Network. Convolution neural network works for two purpose one is for
image which signified as Conv 2D. Another is for audio which is Conv 1D. The input
image in form of a matrix, is convolved with a m x m filter, where m is generally odd.
Here a given image might undergo convolution with n different filters of size m x
m,where each of the filter would be concerned with a definite feature which is
required to be extracted from the image, thus forming n stacked feature maps or
kernels. Thesekernels are then passed through a ReLU Layer, which removes all the
negative entries in the matrices obtained, with value 0,and preserves the nonlinearity
of the image. Now the resultant is transferred to the Pooling Layer wherethe size of
the obtained matrices is reduced, by choosing a Nearest Neighbourparadigm where an
area of size p x p is replaced by the max value in the area p x p. Pooling reduces the
size of the matrix we operate on, hence diminishing the computations, without loss of
information .The above steps are repeated, and then at a certain stage the matrices are
vectorized and sent to a neural network, which trains using Backpropagation, to
classify the image. We have used Conv1D since this convolutes one dimensional
audio file. Conv1D goes through those stages. The stages involve first layer for
setting the weights second is the pooling and then feeding it to dense layer that is fully
connected layer.

2.2.3 Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are a paradigm of neural networks that

specialize in processing sequences, or sequential data. They’re often used in Natural
Language Processing tasks because of their effectiveness in managing textual data.A
bottleneck with CNN is that they only work with predetermined sizes, where they
take fixedsize inputs and produce fixedsize outputs. RNNs are useful because they
let us have variablelength sequences as both inputs and outputs. Some variants of
RNN is shown in fig.2. Here Inputs are red, the RNN itself is green, and outputs are
blue. Suppose a “many to many” RNN is considered with inputs x0, x1....xn that wants
to produce outputs y0, y1........yn These xi and yi are arbitrary dimensional vectors.
RNNs function with iterations by updating a hidden state h, which is a vector that can
also have arbitrary dimension. At any given step t. The next hidden state h t is
computed using the previous hidden state h{t1}and the next input xt. Similarly The next
output yt is calculated using ht . The principle which makes a RNN recurrent is that it
uses the same weights for each step. To be more precise, a typical vanilla RNN uses
only 3 sets of weights to perform its calculations. The principle which makes a RNN
recurrent is that it uses the same weights for each step. To be more precise, a typical
vanilla RNN uses only 3 sets of weights to perform its calculations, such as W xh, used
for all xtht links. Whereas in the same way Whh , used for all h{t1} ht links.
Furthermore Why, used for all htyt links. Two biases shall also be used for our RNN.
One is bh added when calculating ht . Another is by , added when calculating yt . The
weights shall be represented as matrices and the biases as vectors. These 3 weights
and 2 biases constitute the entirety of the RNN. The equations summing up
everything together are represented in by eq. 6 and 7.
(6)

(7)

In the above eq (6) and (7) weights are matrices and the other variables are vectors. All the
weights are inserted via matrix multiplication, and the biases are added to the obtained
products. We then use tanh as an activation function for the first equation (It is to be noted that
other activation functions like sigmoid can be used we have used Relu because it is giving
better result for our dataset).

2.2.4 Gated Recurrent Unit(GRU)

Gated Recurrent Units, also abbreviated as GRU, are a variant of RNN’s that are more
convenient to train which also eliminates the vanishing gradients problem. The vanishing
gradient problem is encountered whentraining RNNs with a long sequential time series data,
where the error gradient with respect to the model parameters values to zero within early time
steps, which occurs due to multiplying too many numbers that are less than 1. It implies that it
becomes difficult for the model to able to learn from longterm dependencies in the input
timeseries.This issue can resolved by using either Gated Recurrent Units (GRU) or LSTMs
(LongShortTermMemory) in place of the standard RNN cell.With respect to LSTM, GRU
stands out to becomputationally economical since theyhave only a few number of internal
gates. The architecture of GRU is represented in fig.4

2.2.4 Long Short Term Memory Network (LSTM)

LSTMs contributes in conserving the error that can be back propagated through time and
layers. Since they maintain a more constant error, they facilitate recurrent nets to continue the
process of learning over many time steps, hence opening a conduit to link causes and effects
remotely. LSTMs store information outside the normal flow of the recurrent network in a gated
cell. Information is allowed to be stored in, written to, or read from a cell, bearing a
resemblance to data in a computer’s memory. It is decided by the cell about what to store, and
when to allow reads and write operations and deletions, via gates that open and close. In
contrast to the digital storage on computers, these gates are analog by nature, implemented with
elementwise multiplication by sigmoids, falling in in the range of 0 to 1. Analog gains the
upper hand over digital, by being able to be differentiated, and hence appropriate for
backpropagation. These gates act on the received signal, and alike to the nodes of a neural
network, they restrict or pass on information depending on its strength and import, filtered with
their own sets of weights. These weights, similar to the weights that control input and hidden
states, are altered via the recurrent network learning process. It implies that the cells learn when
to allow entering of data, allowing data to leave, or be erased through the iterative process of
making guesses, backpropagation of error, and adjustment of weights with the help of gradient
descent. An LSTM block diagram is shown in fig.3.
2.2.5 Upgraded long short term memory (ULSTM)

The proposed method has modified the basic architecture of LSTM in order to get better
result for the dataset. First, the previous hidden state and the current input get concatenated.
We’ll call it combine. Combine get’s fed into the forget layer. This layer removes nonrelevant
data. A candidate layer is created using combine. The candidate holds possible values to add to
the cell state. Combine also get’s fed into the input layer. This layer decides what data from the
candidate should be added to the new cell state. After computing the forget layer, candidate
layer, and the input layer, the cell state is calculated using those vectors and the previous cell
state. Then it is fed to update gate to activate the previous neurons and then passed through
reset gate to eliminate the previous layers neurons and activate new neurons. Then the output is
then computed. Point wise multiplying the output and the new cell state gives us the new
hidden state.

2.3 Vanishing gradient Problem

Vanishing gradient problem is seen in case of deep neural network models because of the
reinforcement learning along with data. The problem arise when no more data is left to train
and the neuron start keep on training the same data. This problem has been seen in much more
level in case of CNN. Compare to other models. The proposed ULSTM model has minimize
the vanishing gradient problem for our proposed system. Fig. 5 shows the comparative analysis
based on all the mentioned model.

Fig.2 Comparative analysis basd on vanishing gradient problem

Here we have plotted the vanishing gradient problem graph of RNN,LSTM,GRU,U
LSTM(our modified or upgraded LSTM) . The x axis represents the epoch and y axis
represents the loss. After plotting we found that RNN has shown the maximum vanishing
gradient problem and ULSTM has shown the minimum vanishing gradient problem. So we
can infer from here that our modified LSTM (ULSTM) has shown better result practically in
terms of both vanishing gradient problem as well as memory power.

3 Result Analysis

3.1 Dataset Collection

Our system has been tested on benchmark data of Ryerson AudioVisual Dataset for Emotional
Speech and Song (RAVDESS) [19]. The Ryerson AudioVisual Database of Emotional Speech
data contains 1440 audio files in which contain eight types of emotion. The database contains
24 professional actors (12 female, 12 male), where they have spoken in North American
accents. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions.
Each expression is produced at two levels of emotional intensity (normal, strong), with an
additional neutral expression. The dataset contains files which have unique name identification.
The names are divided into 7 parts: the first part is modality, second part is vocal channel, third
part is emotion, fourth part is emotional intensity, fifth part is statement, sixth part is repetition
and last part is actor id. The system is implemented and tested on the mentioned data set and
the result is compared for all the models that LSTM, GRU, CNN, RNN, ULSTM. The figure
for the result has been summarized in Table.1.

Table 1. Quality measurement for emotion recognition through deep learning models

Sl . No. Model Accuracy (%)

1 CNN 63.45
2 LSTM 85.43
3 BiLSTM 86.45
4 GRU 83.49
5 ULSTM 90.48

3.2 Comparative Study
The proposed system has been undergone testing through four previously proposed works.
Lee et.al [5] have proposed the recognition of emotion through recurrent neural network
(RNN). Satt et.al [4] has categorized different emotion using the very new and well known
concept called deep neural network based on spectrogram type. Mirsamadi et.al [7] have also
proposed the concept of RNN for the recognition of different emotion. Trigeorgis et.al [8] have
propounded the concept of deep convolution neural network , in this work convolution neural
network is combined with long short term memory (LSTM) to make the system to be
performed better. The proposed system has undergone through all these methodology to check
the performance and its efficiency. It has been shown that its performing comparatively better
for the given dataset.

Table 2. Comparative Analysis for Existing work

Sl. No Methodology Accuracy (%)

1 Lee et.al [5] 0.55
2 Satt et.al [4] 0.49
3 Mirsamadi et.al [7] 0.44
4 Trigeorgis et.al [8] 0.68

4. Conclusion

The proposed method has presented and implemented human emotion

recognition and its analysis in a comparative means based on different deep
learning model. The performance of these models has also been highlighted
in terms of vanishing gradient problem. Furthermore, the proposed method
has presented a better version of LSTM through ULSTM by modifying the
architecture of the existing model of LSTM to decrease vanishing gradient
problem. This work would be explored to learn more categories of human
emotion to make the system more efficient. In addition, music can be
generated based the human emotion which is called music therapy. This
problem would be resolved using Generative Adversarial Network (GAN) in
the land of deep learning which another very powerful technique for the
generation of new result based on the provided data.
Acknowledgments “This chapter does not contain any studies with human
participants or animals performed by any of the authors.”
References

1. Han, Kun, Dong Yu, and Ivan Tashev. "Speech emotion recognition using deep neural
network and extreme learning machine." Fifteenth annual conference of the international
speech communication association. 2014.
2. Kim, Yelin, Honglak Lee, and Emily Mower Provost. "Deep learning for robust feature
generation in audiovisual emotion recognition." 2013 IEEE international conference on
acoustics, speech and signal processing. IEEE, 2013.
3. Zheng, W. Q., J. S. Yu, and Y. X. Zou. "An experimental study of speech emotion
recognition based on deep convolutional neural networks." 2015 international
conference on affective computing and intelligent interaction (ACII). IEEE, 2015.
4. Satt, Aharon, Shai Rozenberg, and Ron Hoory. "Efficient Emotion Recognition from Speech
Using Deep Learning on Spectrograms." INTERSPEECH. 2017.
5. Lee, Jinkyu, and Ivan Tashev. "High-level feature representation using recurrent
neural network for speech emotion recognition." Sixteenth Annual Conference of
the International Speech Communication Association. 2015.
6. Mao, Qirong, et al. "Learning salient features for speech emotion recognition
using convolutional neural networks." IEEE transactions on multimedia 16.8
(2014): 2203-2213.
7. Mirsamadi, Seyedmahdad, Emad Barsoum, and Cha Zhang. "Automatic speech
emotion recognition using recurrent neural networks with local attention." 2017
IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2017.
8. Trigeorgis, George, et al. "Adieu features? end-to-end speech emotion recognition
using a deep convolutional recurrent network." 2016 IEEE international
conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016.
9. Kahou, Samira Ebrahimi, et al. "Combining modality specific deep neural
networks for emotion recognition in video." Proceedings of the 15th ACM on
International conference on multimodal interaction. ACM, 2013.
10. Aldeneh, Zakaria, and Emily Mower Provost. "Using regional saliency for speech
emotion recognition." 2017 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP). IEEE, 2017.
11. Chernykh, Vladimir, and Pavel Prikhodko. "Emotion recognition from speech
with recurrent neural networks." arXiv preprint arXiv:1701.08071 (2017).
12. Wang, Zhong-Qiu, and Ivan Tashev. "Learning utterance-level representations for
speech emotion and age/gender recognition using deep neural networks." 2017
IEEE international conference on acoustics, speech and signal processing
(ICASSP). IEEE, 2017.
13. Ebrahimi Kahou, Samira, et al. "Recurrent neural networks for emotion
recognition in video." Proceedings of the 2015 ACM on International Conference
on Multimodal Interaction. ACM, 2015.
14. He, Lang, et al. "Multimodal affective dimension prediction using deep
bidirectional long short-term memory recurrent neural networks." Proceedings of
the 5th International Workshop on Audio/Visual Emotion Challenge. ACM, 2015.
15. Amer, Mohamed R., et al. "Emotion detection in speech using deep
networks." 2014 IEEE international conference on acoustics, speech and signal
processing (ICASSP). IEEE, 2014.
16. Satt, Aharon, Shai Rozenberg, and Ron Hoory. "Efficient Emotion Recognition
from Speech Using Deep Learning on Spectrograms." INTERSPEECH. 2017.
17. Xingjian, S. H. I., et al. "Convolutional LSTM network: A machine learning
approach for precipitation nowcasting." Advances in neural information
processing systems. 2015.
18. Shi, Heng, Minghao Xu, and Ran Li. "Deep learning for household load
forecasting—A novel pooling deep RNN." IEEE Transactions on Smart Grid 9.5
(2017): 5271-5280.
19. The Ryerson Audio-Visual Database of Emotional Speech and
Song(RAVDESS),https://zenodo.org/record/1188976#.XU739B0zbIV

Virzi 1992
No ratings yet
Virzi 1992
12 pages
Assignment Questions Math
100% (1)
Assignment Questions Math
1 page
Burial of Pipes in OLGA
No ratings yet
Burial of Pipes in OLGA
2 pages
LRN Level B2 June 2016 Exam Paper With Speaking PDF
No ratings yet
LRN Level B2 June 2016 Exam Paper With Speaking PDF
19 pages
Speech Emotion Recognition (Sound C
No ratings yet
Speech Emotion Recognition (Sound C
2 pages
Lee-Tashev_ paper 6
No ratings yet
Lee-Tashev_ paper 6
4 pages
Kismet: Fundamentals and Applications
From Everand
Kismet: Fundamentals and Applications
Fouad Sabry
No ratings yet
Roots of Neuro-Linguistic Programming
From Everand
Roots of Neuro-Linguistic Programming
Robert Brian Dilts
5/5 (1)
Speech Emotion Recoginition
No ratings yet
Speech Emotion Recoginition
5 pages
Research Paper
No ratings yet
Research Paper
5 pages
Speech-Emotion-Recognition-with-Deep-Learning
No ratings yet
Speech-Emotion-Recognition-with-Deep-Learning
5 pages
Exploring the Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
No ratings yet
Exploring the Effectiveness of Advanced Machine Learning Models in Speech Emotion Recognition
6 pages
Affective Computing: Fundamentals and Applications
From Everand
Affective Computing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Speech Recog
No ratings yet
Speech Recog
5 pages
Speech Emotion Recognition PDF
No ratings yet
Speech Emotion Recognition PDF
5 pages
2411.02964v2
No ratings yet
2411.02964v2
9 pages
9.-Yogendra
No ratings yet
9.-Yogendra
5 pages
Sample Course End Project Report
No ratings yet
Sample Course End Project Report
27 pages
Group 110 Arun Kumar Review 2 Report
No ratings yet
Group 110 Arun Kumar Review 2 Report
14 pages
Speech Emotion Recognition With Deep Learning
No ratings yet
Speech Emotion Recognition With Deep Learning
5 pages
SERDL 2
No ratings yet
SERDL 2
10 pages
A Complete Phase 3
No ratings yet
A Complete Phase 3
14 pages
The Motor Theory of Language Origin: 1989
From Everand
The Motor Theory of Language Origin: 1989
Robin Allott
3/5 (2)
Emonet
No ratings yet
Emonet
16 pages
EMOTIONDETECTION (1)mini project
No ratings yet
EMOTIONDETECTION (1)mini project
5 pages
Wa0007
No ratings yet
Wa0007
6 pages
User Illusion: Fundamentals and Applications
From Everand
User Illusion: Fundamentals and Applications
Fouad Sabry
No ratings yet
Chethana H N REPORT
No ratings yet
Chethana H N REPORT
12 pages
2019 BE Emotionrecognition ICESTMM19
No ratings yet
2019 BE Emotionrecognition ICESTMM19
8 pages
Deep_Learning_Techniques_for_Speech_Emotion_Recognition_A_Review
No ratings yet
Deep_Learning_Techniques_for_Speech_Emotion_Recognition_A_Review
6 pages
Human Emotion Detection With Speech Recognition Using Mel-Frequency Cepstral Coefficient and CNN - New
No ratings yet
Human Emotion Detection With Speech Recognition Using Mel-Frequency Cepstral Coefficient and CNN - New
2 pages
Speech Emotion Recognition Using Deep Learning Techniques: A Review
No ratings yet
Speech Emotion Recognition Using Deep Learning Techniques: A Review
19 pages
Speech Emotion System Full Project Report
No ratings yet
Speech Emotion System Full Project Report
54 pages
Paper 15-Bidirectional Long Short Term Memory With Attention Mechanism
No ratings yet
Paper 15-Bidirectional Long Short Term Memory With Attention Mechanism
8 pages
The Brain is a Suitability Probability Processor: A macro model of our neural control system
From Everand
The Brain is a Suitability Probability Processor: A macro model of our neural control system
Eckhard Schindler
No ratings yet
Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching
No ratings yet
Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching
15 pages
SER (Research Paper)
No ratings yet
SER (Research Paper)
5 pages
Nuero-Symbolism
From Everand
Nuero-Symbolism
Larry Lee matthews
No ratings yet
The Sentient Robot: The Last Two Hurdles in the Race to Build Artificial Superintelligence
From Everand
The Sentient Robot: The Last Two Hurdles in the Race to Build Artificial Superintelligence
Rupert Robson
No ratings yet
Applying-Machine-Learning-Techniques-for-Speech-Emotion-Recognition
No ratings yet
Applying-Machine-Learning-Techniques-for-Speech-Emotion-Recognition
6 pages
47734-125647-1-PB
No ratings yet
47734-125647-1-PB
14 pages
(A) EEG Emotion Recognition Using Fusion Model of Graph Convolutional
No ratings yet
(A) EEG Emotion Recognition Using Fusion Model of Graph Convolutional
44 pages
Natural Language Understanding
From Everand
Natural Language Understanding
Kai Turing
No ratings yet
10 1109@access 2019 2936124
No ratings yet
10 1109@access 2019 2936124
19 pages
Kinematics of the Brain Activities
From Everand
Kinematics of the Brain Activities
Mostafa M. Dini
No ratings yet
Cyprus University of Technology TEPAK Report Template English PDF
No ratings yet
Cyprus University of Technology TEPAK Report Template English PDF
17 pages
Cyprus University of Technology TEPAK Report Template English PDF
No ratings yet
Cyprus University of Technology TEPAK Report Template English PDF
17 pages
1904.06022v1
No ratings yet
1904.06022v1
9 pages
Reality
No ratings yet
Reality
11 pages
3rd Review 9023
No ratings yet
3rd Review 9023
39 pages
48
No ratings yet
48
10 pages
Communication Skills
From Everand
Communication Skills
Ethan Patel
No ratings yet
Pursuing Intelligence: Real and Artificial
From Everand
Pursuing Intelligence: Real and Artificial
R. S. Eisenhour
No ratings yet
Fpsyg 15 1300996
No ratings yet
Fpsyg 15 1300996
16 pages
Mini Project B20CS061
No ratings yet
Mini Project B20CS061
16 pages
Artificial Minds, Human Hearts. Rethinking Empathy in the Age of AI.
From Everand
Artificial Minds, Human Hearts. Rethinking Empathy in the Age of AI.
Gabriel Fernando Cao
No ratings yet
Deep Learning for Beginners: A Comprehensive Introduction of Deep Learning Fundamentals for Beginners to Understanding Frameworks, Neural Networks, Large Datasets, and Creative Applications with Ease
From Everand
Deep Learning for Beginners: A Comprehensive Introduction of Deep Learning Fundamentals for Beginners to Understanding Frameworks, Neural Networks, Large Datasets, and Creative Applications with Ease
Steven Cooper
5/5 (1)
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
No ratings yet
Speech Emotion Recognition: Submitted by Manoj Rajput 2019PEC5303
11 pages
Artificial Intelligence Psychology A Comprehensive View: Psychology, #1
From Everand
Artificial Intelligence Psychology A Comprehensive View: Psychology, #1
mohamed mahmoud habib
No ratings yet
Aggarwal 2019
No ratings yet
Aggarwal 2019
6 pages
JETIR2106163 (37)
No ratings yet
JETIR2106163 (37)
5 pages
Ian Talks AI A-Z
From Everand
Ian Talks AI A-Z
Ian Eress
No ratings yet
Can Large Language Models Aid in Annotating Speech Emotional Data Uncovering New Frontiers Research Frontier (1)
No ratings yet
Can Large Language Models Aid in Annotating Speech Emotional Data Uncovering New Frontiers Research Frontier (1)
12 pages
IJRPR4210
No ratings yet
IJRPR4210
12 pages
ASCE Journal of Computing in Civil Engineering, Volume 14, 2000 PDF
No ratings yet
ASCE Journal of Computing in Civil Engineering, Volume 14, 2000 PDF
279 pages
Teamworking Skills
100% (1)
Teamworking Skills
66 pages
Contoh Application Letter
No ratings yet
Contoh Application Letter
2 pages
Intrapersonal and Interpersonal Communication
No ratings yet
Intrapersonal and Interpersonal Communication
2 pages
English Speech - How To Face Globalization Era
100% (1)
English Speech - How To Face Globalization Era
4 pages
Lesson Plan. CN. 8-10
No ratings yet
Lesson Plan. CN. 8-10
1 page
Manual PHP JRU
No ratings yet
Manual PHP JRU
29 pages
Team Roles - Review
No ratings yet
Team Roles - Review
30 pages
Platinum Dragon 222
No ratings yet
Platinum Dragon 222
20 pages
English Test - About Media
No ratings yet
English Test - About Media
5 pages
3 E&P Business Lifecycle
No ratings yet
3 E&P Business Lifecycle
34 pages
Andreas Birk Holger Keno - : - Vrije Universiteit Brussel AI-Lab
No ratings yet
Andreas Birk Holger Keno - : - Vrije Universiteit Brussel AI-Lab
6 pages
14snii22 Week14 2022
No ratings yet
14snii22 Week14 2022
124 pages
Permanent Acceleration Fix
No ratings yet
Permanent Acceleration Fix
1 page
ME 201 Basics
No ratings yet
ME 201 Basics
5 pages
Oracle Approvals Management (AME) Case Studies For AP, PO, and HR
100% (1)
Oracle Approvals Management (AME) Case Studies For AP, PO, and HR
78 pages
MLC Protocols and Its Awareness Among Dental Professionals - A Review
No ratings yet
MLC Protocols and Its Awareness Among Dental Professionals - A Review
4 pages
Chap2_Argument as Civil Conversation
No ratings yet
Chap2_Argument as Civil Conversation
24 pages
Install
No ratings yet
Install
6 pages
Once you VISUALIZE CORRECTLY
No ratings yet
Once you VISUALIZE CORRECTLY
12 pages
Ibm Hana Redbook
No ratings yet
Ibm Hana Redbook
80 pages
Chapter 4 Summary
No ratings yet
Chapter 4 Summary
2 pages
CBSE Class 9 English Grammar - Direct and Indirect Speech - Learn CBSE
No ratings yet
CBSE Class 9 English Grammar - Direct and Indirect Speech - Learn CBSE
1 page
1BLIB3-library Clasificaton Theory PDF
No ratings yet
1BLIB3-library Clasificaton Theory PDF
7 pages
Ericsson Lte Kpis v1 080316
100% (1)
Ericsson Lte Kpis v1 080316
12 pages
Ethology Spinoza and Us PDF
0% (1)
Ethology Spinoza and Us PDF
2 pages

Speech Based Emotion Classification For Human by Introducing Upgraded Long Short Term Memory (ULSTM)

Uploaded by

Speech Based Emotion Classification For Human by Introducing Upgraded Long Short Term Memory (ULSTM)

Uploaded by

Speech based emotion classification for human by

Keywords: Emotion Recognition, Feature extraction, Classification model,

Fig. 1. Basic Work Flow of the proposed System

Extraction of features is an extremely essential role in analysing and discovering

Recurrent Neural Networks (RNNs) are a paradigm of neural networks that

Sl . No. Model Accuracy (%)

Sl. No Methodology Accuracy (%)

The proposed method has presented and implemented human emotion

You might also like