Databases, Features and Classifiers For Speech Emotion Recognition: A Review
Databases, Features and Classifiers For Speech Emotion Recognition: A Review
net/publication/322602563
CITATIONS READS
52 2,596
3 authors:
Prithviraj Kabisatpathy
Biju Patnaik University of Technology
11 PUBLICATIONS 113 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Design and development of an on-board intelligent embedded platform for detection of weak failure modes and prognosis of severe faults in locomotives View project
All content following this page was uploaded by Monorama Swain on 09 August 2019.
Received: 13 April 2017 / Accepted: 11 January 2018 / Published online: 19 January 2018
© Springer Science+Business Media, LLC, part of Springer Nature 2018
Abstract
Speech is an effective medium to express emotions and attitude through language. Finding the emotional content from a
speech signal and identify the emotions from the speech utterances is an important task for the researchers. Speech emotion
recognition has considered as an important research area over the last decade. Many researchers have been attracted due to
the automated analysis of human affective behaviour. Therefore a number of systems, algorithms, and classifiers have been
developed and outlined for the identification of emotional content of a speech from a person’s speech. In this study, avail-
able literature on various databases, different features and classifiers have been taken in to consideration for speech emotion
recognition from assorted languages.
Keywords Speech corpus · Excitation features · Spectral features · Prosodic features · Classifiers · Emotion recognition
1 Introduction said’ (factor 2) and ‘who says it’ (factor 3). In the first fac-
tor, the speech represents the information of linguistic origin
Expressions of emotions are carried out via various means and depends on the way of pronunciation of the words as
like responses, language, behavior, body gestures, posture representatives of the language, whereas, the second fac-
and movement. Many physiological processes like respira- tor carries paralinguistic information related to the speakers
tion, heart rate, temperature, skin conductivity and tempera- emotional state. The third factor contains cumulative infor-
ture, muscle and neural membrane potentials can also be mation regarding the speaker’s basic features and identities
used for expression of emotions. Use of these non-invasive like age, gender, and body size (Borden et al. 1994). That is
parameters are effective in understanding the emotional why it is very difficult to draw any affective information and
state of the speaker without having any physical contact, in specific emotion from voice of an individual.
and thereby can also be effectively used to interpret the sta- Affective computing has emerged as an active and inter-
tus of emotions of the individual. The inherent difficulties disciplinary field of research in the area of automatic recog-
of deciphering the speakers emotional state from voice arise nition, interpretation and compilation of human emotions
due to factors such as ‘what is said’ (factor 1), ‘how it is (Picard 1997). The present work falls in a multidisciplinary
domain i.e. “Emotion Recognition”, involving psychology,
* Monorama Swain social science, linguistics, neurology, neurophysiology, neu-
[email protected] ropsychophysiology, anthropology, cognitive science, pro-
Aurobinda Routray cessing of digital signals, speech signal, natural language,
[email protected] and processing of artificial intelligence.
P. Kabisatpathy According to Salovey et al. (2004), emotional intelligence
[email protected] has four branches—perception of emotion, facilitation by
thought, understanding and management of emotions. The
1
Department of Electronics and Communication Engineering, speech signal has become the latest and the fastest commu-
Silicon Institute of Technology, Bhubaneswar, Odisha, India
nication system between humans involving complex signal
2
Electrical Engineering, Indian Institute of Technology processing systems, networking and multiple signaling units
Kharagpur, Kharagpur, West Bengal, India
about message, speaker, and language. Extensive research
3
Department of Electronics and Communication, CV Raman has been accomplished in the last few decades related
College of Engineering, Bhubaneswar, Odisha, India
13
Vol.:(0123456789)
94 International Journal of Speech Technology (2018) 21:93–120
to conversion of human speech to sequence of words. In valance (positive–negative), arousal (calm-excited), and ten-
spite of all these, there is a huge gap between the man and sion (tense-relaxed). Russell and Mehrabian (1977) describe
machine, as a machine cannot understand the emotional state the subjects dominance or power or control over the situa-
of the speaker and thereby fails to interpret the emotions of tion which leads to her/his state of emotion. In this three
an individual. This has opened up a new research field called dimensional space the primary emotion dimensions has been
speech emotion recognition, having basic goals to under- conceptualized in terms of pairs of opposites. Anger and
stand and retrieve desired emotions from the speech signal. fear are opposites in the sense that one implies attack and
Researchers are working to make speech as the most efficient the other tends to fight. In the same way joy and sadness
method of interaction between humans and machines but the are opposites in the sense that one implies a possession or
major obstacle is that machines do not have sufficient intel- gain while the other implies loss. Acceptance and disgust are
ligence to recognize human voices. opposites in the sense that one implies a taking in and the
Speech emotion recognition has a number of applica- other implies an ejection or riddance. Surprise and anticipa-
tions—(1) man machine interactions on a natural basis like tion are opposites in the sense that one implies the unpredict-
web movies, (2) computer movies and tutorial applications, able and other implies the predictable. A four-dimensional
(3) drivers safety via car on—board system in which coded emotion space can be described by four mutually orthogonal
message of the driver’s mental state is conveyed to the oper- dimensions valence, activation, potency and intensity (Fon-
ating system of car, (4) diagnostic tool for a therapist to treat taine et al. 2007). The evolutionary view which is closely
disease (5) used as a tool for automatic translation system in related to the discrete emotion theory is believed to have its
which a speaker plays a key role in communication between own particular pattern of cognitive appraisal, physiological
parties and (6) mobile communications (Ayadi et al. 2011). activity, action tendency and expression (Darwin 1872/1965;
Different theories exist regarding emotion, like evolution- Ekman 1999; Izard 1992; Tomkins 1962). A limited num-
ary theories, the James–Lange theory, the Cannon–Bard the- ber of basic emotions that have evolved with pertinent life
ory, Schachter and Singer’s two-factor theory, and cognitive problems such as anger, fear, happiness and sadness are dealt
appraisal. Charles Darwin proposed that emotions evolved with using discrete emotion theories (Power and Dalgleish
due to adaptive nature. Most researchers are of the view 2000). In this paper we have presented a overall review of
that a concept of emotion consists of an event, a perception speech emotion recognition systems keeping in mind the
or an interpretation, an appraisal, physiological change, an three aspects—the database design, the important speech
action potential, and conscious awareness. Schachter and features and the classifiers used in speech emotion recog-
Singer (1962) have developed a few theories on physi- nition system. Though there are already several reviews
ological arousal and the cognitive interpretation. Also psy- available, our literature survey gives insight regarding the
chologist Lazarus’ (1991) research has shown that people’s features, databases and classifiers used from 2000 to 2017.
experience of emotion depends on the way they appraise or The article is arranged as follows. Segment 2 gives review
evaluate the events around them. The two emotion structure of databases used, segment 3 gives the features used, and the
theories which strongly influenced subsequent research on different classification techniques available in the literature
vocal emotion are discrete category and Dimensional struc- are presented in Sect. 3. Finally conclusions are drawn in
ture theories. This theory tries to categorize human emotions Sect. 4.
according to where they lie (in 1 or 2-dimensions). These
models based on dimensions tend to theorize that affective
states result from neuro-phyiological systems. In recent 2 Speech corpus (emotional): a review
years, the two dimensional model of emotion has gained
support among emotion researchers (Gomez and Danuser A formidable task for researchers is speech emotion rec-
2004; Schubert 1999); this is primarily due to the fact that ognition in the area of speech processing. To evaluate the
the two dimensional circumplex model was utilized instead performance of a speech recognition system it is essential to
of an independent system (Russell 1980). Russell and Bar- design a suitable database (Ayadi et al. 2011). There are four
rett (1999) proposed that all affective states arise from two criteria necessary in preparing a database—the scope, the
independent neuro-physiological systems, namely valence physical existence, contents and the actual language chosen.
and activation. Valance describes a pleasure and displeas- The scope of a database design consists of several kinds
ure continuum while activation (Russell 1980) describes of variation in a database like number of speakers, speaker
the activeness of the subject in the state of the emotion. gender, type of emotions, number of dialects, type of lan-
But the two dimensional space does not give sufficient guage and age. While creating the database consideration
information regarding emotion structure so Wundt (2013) of additional signals like speech, language, physiological
suggested emotions could be described by their position signals recorded simultaneously with speech, the data col-
in a three dimensional space formed by the dimensions of lection purpose (emotional speech recognition, expressive
13
International Journal of Speech Technology (2018) 21:93–120 95
synthesis), the emotional states recorded and the kind of the emotions joy, anger, sadness, fear and relax. The recorded
emotions (natural, simulated, elicited) are also taken into data are useful for repeating or augmenting the experiments.
consideration for better performance. According to Eckman Speech from interviews with specialists such as psycholo-
(1992), the basic emotions are categorized as anger, fear, gists and scientist specialized in phonetics (Douglas-Cowie
sadness, sensory pleasure, amusement, satisfaction, content- et al. 2003). A conversation of parents with their children
ment, excitement, disgust, contempt, pride, shame, guilt, when their intension is to keep them away from danger-
embarrassment and relief. Non-basic emotions are called ous objects can be taken as a real life example (Slaney and
“higher-level” emotions (Buck 1999), and they are rarely McRoberts 2003). Analysis of interviews between a doc-
represented in data collections. The methods and objectives tor and a patient before and after medication was used in
of collecting speech corpora depends according to the moti- France et al. (2000). Speech was recorded during machine
vation behind the development of speech systems. human interaction e.g. during telephone calls to automatic
According to researchers the facial emotions are univer- speech recognition (ASR) call centers as discussed in Lee
sal. But vocal signatures show variations in features like the and Narayanan (2005). Speech Under Simulated and Actual
emotion anger. Anger typically has high fundamental fre- Stress (SUSAS) (You et al. 1997) was created by the Robust
quency when compared with sadness which has a low value Speech Processing Laboratory at the University of Colo-
of fundamental frequency. These variations occur due to the rado-Boulder under the direction of Professor John H. L.
distinction between real and simulated data. Also some vari- Hansen. The simulated database in English language is par-
ations can arise due to the differences in culture, language, titioned into four domains, encompassing a wide variety of
gender and particular situations. Again, from the literatures stresses and emotions. A total of 32 speakers (13 female, 19
it is evident that the emotional scope of databases needs to male), with ages ranging from 22 to 76 years were employed
be designed and developed carefully. Generally, representa- to generate 16,000 utterances. SUSAS also contains several
tion of a natural speech from a male voice or female voice longer speech files from four Apache helicopter pilots. Those
for production of a database is a difficult task. So some of helicopter speech files were transcribed by the Linguistic
the databases consists of some video and audio clips from Data Consortium and are available in SUSAS Transcripts.
television, radio programs or from call centers and the natu- LDC Emotional Prosody Speech and Transcripts was devel-
ral speech recorded are called spontaneous speech in real oped by the Linguistic Data Consortium and contains audio
world situations. Here we have discussed some of the data- recordings and corresponding transcripts. It is now part of
bases used for speech emotion recognition system. There a commercially available database on English language and
are three kinds of databases which are used for development contains seven professional actors with 15 emotions and 10
of speech emotion recognition systems—natural, simulated utterances per emotion. The emotions considered are panic,
(Acted) and elicited (Induced) emotional speech database anxiety, hot anger, cold anger, despire, sadness, elation, joy,
(Ververidis and Kotropoulos 2006). Natural database is a interest, boredom, shame, pride, contempt and neutral (Uni-
database developed on spontaneous speech of real data. It versity of Pennsylvania Linguistic Data Consortium 2002).
includes the data recorded from call center conversations, The AIBO database is a natural database which consists
cockpit recordings during abnormal conditions, conversa- of recording from children while interacting with robot.
tion between a patient and a doctor, conversation with emo- The database contain 110 dialogues and 29,200 words in 11
tions in public places and similar interactions Batliner et al. emotion categories of anger, boredom, emphatic, helpless,
(2000). Simulated emotion speech database is known as ironic, joyful, reprimanding, rest, surprise and touchy. The
acted database because the speech utterances were collected data labeling is based on the listeners’ judgment (Batliner
from experienced, trained and professional artists. Emotions et al. 2004). The Berlin Database of Emotional Speech is a
used for simulated database are full blown emotions (Ayadi German acted database, which consists of recordings from
et al. 2011). Elicited speech emotion database is where emo- 10 actors (5 male, 5 female). The data consist of 10 German
tions were induced i.e. artificial emotional situation without sentences recorded in anger, boredom, disgust, fear, hap-
any knowledge about the speaker. The basic emotions used piness, sadness and neutral. The final database consists of
in the literature are: anger, fear, sadness, sensory pleasure, 493 utterances after the listeners’ judgment (Burkhardt et al.
amusement, satisfaction, contentment, excitement, disgust, 2005). The Danish Emotional Speech Database is another
contempt, pride, shame, guilt, embarrassment, and relief. audio database recorded from 4 actors (2 male, 2 female).
Also in addition some physiological signals such as heart The recorded data consist of 2 words, 9 sentences and 2
rate, blood pressure, respiration and EGG (ELECTRO- passages, resulting in 10 min of audio data. The recorded
GLOTTOGRAM) could be recorded during experiments emotions are anger, happiness, sadness, surprise and neu-
(Pravena and Govind 2017). The speech emotion recognition tral (Engberg and Hansen 1996). The RUSLANA simu-
system is also possible from biomultimodal bio-potential lated speech emotional database was used for linguistic and
signals as in Takahashi (2004). The evaluation done on five speech processing research on communicative and emotive
13
96 International Journal of Speech Technology (2018) 21:93–120
attitudinal aspects of spoken language. A total of sixty-one considering 10 subjects under audio, visual and audio-visual
native speakers (12 males and 49 females) of standard Rus- conditions. The SEMAINE database contains the record-
sian were recorded for this database and the emotions were ing conversations between humans (the user) and artificially
considered anger, happiness, neutral, sadness, surprise and intelligent agents (the operators). The emotion labels for four
fear (Makarova and Petrushin 2002). emotion dimensions: activation, expectation, power, and
Zhang et al. (2004) presents an approach to automatically valence (McKeown et al. 2007).
recognize emotion which children exhibit in an intelligent Koolagudi et al. (2009) proposed one simulated data-
tutoring system. Emotion recognition can assist the com- base named IITKGP-SESC which was recorded in the
puter agent to adapt its tutorial strategies to improve the effi- Telugu language with help of some professional artists
ciency of knowledge transmission. In this study, we detect from AIR Vijayawada, India. The database contains 10
three emotional classes: confidence, puzzle, and hesitation. professional artists (5 male and 5 female). For analyzing
Some of the authors focus on the ‘Interviews corpus’ also the emotions they had considered 15 Telugu sentences.
known as the Belfast database (Douglas-Cowie et al. 2003); Each of the artists had to speak the 15 sentences in 8 basic
the ‘EmoTV’ corpus—a set of TV interviews in French emotions in one session. The number of sessions consid-
recorded in the HUMAINE project (Abrilian et al. 2006). ered for preparing the database was 10. The total number
Further speech from real life spoken dialogues from call of utterances recorded in the database was 12,000 (15 sen-
center services can used for emotion recognition (Vidrascu tences × 8 emotions × 10 artists × 10 sessions). Each emo-
and Devillers 2005). This corpus of naturally-occurring dia- tion has 1500 utterances. Eight basic emotions considered
logs recorded in a real-life call center. The corpus contains for collecting the proposed speech database were: Anger,
real agent-client recordings obtained from a convention Compassion, Disgust, Fear, Happy, Neutral, Sarcastic and
between a medical emergency call center and the LIMSI- Surprise. An IITKGP-SEHSC (Rao and Koolagudi 2011)
CNRS. The transcribed corpus contains about 20 h of data. Hindi dialect speech corpus was collected from five differ-
Around 404 agent caller dialogs were involved (6 different ent geographical regions (central, eastern, western, north-
agents and 404 callers). ern and southern) of India, representing the five dialects
The MASC (Mandarin Affective Speech Corpus) (Wu of Hindi. For each dialect, speech was taken from five
et al. 2006) contains recordings of 68 native speakers (23 male and five female speakers. Speech data was collected
female and 45 male) and five kinds of emotions: neutral, from the speaker, by posing the questions arbitrarily so
anger, elation, panic and sadness. Each speaker pronounces 5 as to describe one’s childhood, about the history of the
phrases, 10 sentences for three times for each emotional state home town, and past memories. Altogether, the duration
and 2 paragraphs for neutral. This database can also be used of each dialect was, about 1–1.5 h. This speech database
for recognition of affectively stressed speakers and prosodic contains 10 (5 male and 5 female) professional artists for
feature analysis; speaker recognition baseline experiments recording purpose collected from All India Radio (AIR)
are also performed on this database. The interactive emo- Varanasi, India. The eight emotions considered for record-
tional dyadic motion capture database (IEMOCAP) (Busso ing this database are anger, disgust, fear, happy, neutral,
et al. 2008), which was collected by the Speech Analysis and sadness, sarcastic and surprise. Fifteen emotionally neu-
Interpretation Laboratory (SAIL) at the University of South- tral, Hindi sentences are chosen as text prompts for the
ern California (USC). This database was recorded from ten database. Each of the artists had to speak 15 sentences in
actors in dyadic sessions with markers on the face, head, 8 given emotions in one session. The number of sessions
and hands, which provide detailed information about their recorded for preparing the database was 10 and each emo-
facial expression and hand movements during scripted and tion had 1500 utterances. The total number of utterances
spontaneous spoken communication scenarios. The actors recorded in this database was 12,000 (15 sentences × 8
performed selected emotional scripts and also improvised emotions × 10 artists × 10 sessions). An IIIT-H Semi-nat-
hypothetical scenarios designed to elicit specific types of ural Telugu database was used for speech emotion recogni-
emotions such as happiness, anger, sadness, frustration and tion (Gangamohan et al. 2013). The database was recorded
neutral state. The corpus contains approximately 12 h of in Telugu language and it contained 7 numbers of students
data. The Surrey Audio-Visual Expressed Emotion (SAVEE) from IIIT Hyderabad for emotions such as anger, happy,
database (Haq and Jackson 2009) has been recorded as a neutral and sad. EMOVO simulated database was the first
pre-requisite for the development of an automatic emo- database of emotional speech for the Italian language used
tion recognition system. The database consists of record- for speech emotion recognition system. This database con-
ings from 4 male actors in 7 different emotions, 480 British tained six actors who were summoned (three males and
English utterances in total. The sentences were chosen from three females) with proven expertise, and will utter four-
the standard TIMIT corpus and phonetically-balanced for teen sentences (assertive, interrogative, lists) based on six
each emotion and the performance analysis was done by basic emotional states (disgust, fear, anger, joy, surprise,
13
International Journal of Speech Technology (2018) 21:93–120 97
sadness) plus the neutral state (Costantini et al. 2014). In 3.1 Excitation source features
Agrawal (2011) studies conducted to analyze, perceive and
recognize commonly occurring emotions in Hindi speech. The speech features which were derived from excitation
Also Experiments has been conducted to study and rec- source signals are called source features. The vocal tract
ognise emotions: anger, happiness, fear, sadness, surprise (VT) characteristics are suppressed to obtain the excitation
and neutral based on phonetic as well as prosodic parame- source signals. The linear prediction (LP) residual contains
ters in the speech samples due to changes in emotions. The information about the excitation source and is achieved by
work presented by Pravena and Govind (2017) shows the prediction of VT information using filter coefficients and
development of a simulated emotion database for excita- their separation is achieved by inverse filter formulation
tion source analysis. The study involves development of a (Makhoul 1975). Glottal volume velocity (GVV) signal is
large simulated emotion database for three emotions anger, also used to represent an excitation source, and it is derived
happy and sad along with neutrally spoken utterances in from the LP residual signal. The sub segmental analysis of
three languages: Tamil, Malayalam and Indian English. speech signal is glottal pulse, pen and closed phases of glot-
Here also some other databases available in literature also tis and strength of excitation. The correlation of excitation
listed in Table 1. source information is obtained from LP residual signal and
glottal volume velocity (GVV). LP residual signal contains
valid information as a primary excitation to vocal tract,
while speech is being produced. The extracted information
3 Speech features and classifiers about pitch from LP residual signal has been successfully
used by Atal (1972). LP residual energy has been also used
In the designing of a system that recognizes emotions from for vowel and speaker recognition by Wakita (1976). A
speech, the identification and extraction of different emo- novel approach of comparative analysis between two fea-
tion related speech features is a challenging task. In the real tures glottal waveform based AUSEEG features and speech
scenario humans have the ability to interpret and detect lin- based AUSEES features was proposed in He et al. (2010).
guistic and paralinguistic information. Proper selection of The study involves an English dataset containing 170 adult
speech features affects the classification performance. There speakers for recognition of seven emotions—contempt,
are several features like local, global, continuous speech, angry, anxious, dysphoric, pleasant, neutral and happy. They
qualitative, spectral and Teager Energy Operator (TEO)— have also considered MFCC parameters for comparison with
based, excitation source features, vocal tract features in the the proposed features. For performance evaluation GMM
pattern recognition problem reviewed in the literature by and KNN classifiers were used for classification of emo-
Koolagudi and Rao (2012a). Since speech signal is non- tions. It has been observed that the new features AUSEEG
stationary in nature, to make it stationary it is divided into representing the spectral energy distribution of the glottal
small segments called frames. Here we present the study waveform gives better classification rates than the AUSEES
based on some important speech features such as excitation features representing the spectral energy distribution of the
source features, vocal tract system, prosodic features and speech signal. The excitation component of speech can be
different combination of features. Also we have emphasized exploited for speaker recognition studies in Prasanna et al.
on classifiers which were used for developing speech emo- (2006).
tion recognition systems. In the literature several classifiers Chauhan et al. (2010) explored the Linear Prediction (LP)
are implemented and tested for developing a speech recog- residual of speech signal for characterizing basic emotions.
nition system. To evaluate the performance of the classi- The auto associative neural network (AANN) and Gauss-
fiers, the design of databases are of paramount importance. ian mixture models (GMM) were considered for emotion
Every speech database has been created on the basis of classification on IITKGP-Simulated Emotion Speech Cor-
environmental conditions and language, but sometimes fea- pus (IITKGP-SESC) which includes eight emotions anger,
tures selected to design a classifier is not robust enough for compassion, disgust, fear, happy, neutral, sarcastic and sur-
speech emotion detection. So classifiers are usually trained prise. The emotion recognition performance was observed
and tested by using the same data base. In the literature, a to be about 56%. Epoch-based analysis of speech helps not
variety of classifiers designed and modelled for recognizing only to segment the speech signals based on speech produc-
emotions from speech like single classifiers, multiple clas- tion characteristics, but also helps in accurate analysis of
sifiers, hybrid classifiers or ensemble classifiers are avail- speech (Yegnanarayana and Gangashetty 2011). It enables
able. In the following sections we present the details of the extraction of important acoustic–phonetic features such as
literature survey about excitation source features, spectral glottal vibrations, formants, and instantaneous fundamen-
features and prosodic features as well as the different clas- tal frequency, etc. Accurate estimation of epochs helps in
sifiers developed for speech emotion recognition systems. characterizing voice quality features. Epoch extraction also
13
98
Table 1 Survey databases
Database Language Type of database Size Purpose and approach Emotions
13
1. Engberg and Hansen (1996) Danish emotional database Simulated Four actors, two of each gen- Synthesis Anger, sadness, surprise, neu-
der (two isolated words, nine To evaluate how well the tral, happiness
sentences and two passages) emotional state in emotional
speech is identified by
humans
2. Montero et al. (1999) Spanish Simulated Two sessions conducted Synthesis Happiness, sadness, cold anger
one professional actor, (3 Pitch, tempo, and stress are and surprise
passages, 15 sentences of used for synthesis
neutral-content text). 2000
phonemes per emotion are
considered for analysis
3. Amir et al. (2000) Hebrew Natural 40 subjects were considered Physiologic evaluations Anger, fear, joy, sadness, disgust
(19 males, 21 females) Signal analysis over sliding
windows and extracting a
representative feature set
4. Cecile Pereira (2000) English Simulated 2 actors (40 utterances) Recognition Happiness, sadness, anger (hot
Findings of emotions on the and cold), neutral
three dimensional scales
arousal, pleasure and power
5. Marc Schroder (2000) German Simulated 6 native speakers (3 male, 3 Recognition Admiration, threat, disgust,
female) elation, boredom, relief startle,
worry contempt, anger
6. Iriondo et al. (2000) Castilian Spanish Simulated Eight actors (4 male and 4 Synthesis Desire, disgust, fear, fury
female), 336 discourses were (anger), joy, sadness, surprise
recorded
7. Nogueiras et al. (2001) Spanish Simulated Two professional actors (one Synthesis Anger, disgust, fear, joy, sad-
male and one female) Emotion recognition using ness, surprise, neutral
RAMES, the UOC’s speech
recognition system based on
standard speech recognition
technology using hidden
semi-continous Markov
models
8. New et al. (2001) Burmese Simulated or acted Two Burmese language speak- Recognition Anger, dislike, fear, happiness,
ers, 90 emotional utterances A universal codebook is con- sadness and surprise
each from two speakers structed based on emotions
9. Yu et al. (2001) Chinese Simulated Native TV actors Recognition Anger, happiness, neutral, and
721 short utterances per emo- sad
tion are recorded
10. Makarova and Petrushin Russian Simulated 61 Native speakers (12 male, Recognition Neutral (unemotional), surprise,
(2002) 49 female), 10 sentences This database is a source for anger, happiness, sadness and
were recorded per emotion, linguistic and speech pro- fear
total 3660 utterances cessing research
International Journal of Speech Technology (2018) 21:93–120
Table 1 (continued)
Database Language Type of database Size Purpose and approach Emotions
11. Bulut et al. (2002) English Simulated 1 actress Synthesis Anger, happiness, sadness,
neutral
12. Scherer et al. (2002) English and German Natural 100 native speakers Recognition Stress and load level
13. Tato et al. (2002) German Elicited 14 native speaker Synthesis Anger, boredom, happiness,
neutral, sad
14. Chuang and Wu (2002) Chinese Simulated 2 actors (1 Male and 1 female) Recognition Anger, surprise, sadness, fear,
From male: 558 utterances An emotional semantic happiness and antipathy
contained in 137 dialogues network proposed to extract
From female: 453 sentences in the schematic information
136 dialogues related to emotion
15. Yuan et al. (2002) Chinese Elicited 9 native speakers Recognition Anger, fear, joy, neutral, sadness
16. Hozjan et al. (2002) English, Slovenian Spanish Simulated One male and one female Synthesis Anger, sadness, joy, fear, dis-
and French speaker have been recorded, The recorded INTERFACE gust, surprise and neutral
for English two male and 1 database is used to develop
female have been recorded. a multilingual emotion clas-
International Journal of Speech Technology (2018) 21:93–120
13
99
Table 1 (continued)
100
13
21. Lee and Narayanan (2003) English Natural Unknown Recognition Negative (anger, frustration,
Call center application boredom) and positive emo-
tions (neutral, happiness,
others)
22. Yamagishi et al. (2003) Japanese Simulated I male speaker Speech recognition and Joyful and sad
Phonetically balanced 503 synthesis
sentences of ATR Japanese An approach to realizing
database various emotional expres-
sions and speaking styles in
synthetic speech using HMM
based synthesis
23. Schuller et al. (2003) English, German Natural and Simulated 5 speakers Recognition Anger, disgust, fear, surprise,
Total 5250 samples taken for Two different methods propa- joy, neutral and sadness
analysis gated for various feature
analysis and comparison
between two classifies such
as GMM and HMM
24. Hozjan and Kacic (2003) English, Slovenian Spanish Simulated Total 9 speakers for each Recognition Anger, sadness, joy, fear, dis-
and French language Analysis of various acoustic gust, surprise and neutral
Total 23,000 sentences were and large set of statistical
recorded. For English features
language two male and
one female speakers were
recorded and for Slovenian,
Spanish, French language
one male and one female
speaker were recorded
25. Lida et al. (2003) Japanese Simulated Two native speakers (one male Synthesis Anger, joy and sadness
and one female) To synthesizing emotional
speech by a corpus based
concatenative speech synthe-
sis system (ATR CHATR)
using speech corpora of
emotional speech
26. Fernandez and Picard English Natural Four drivers Recognition Stress
(2003) Use of features derived from
multi resolution analysis of
speech and TEO for clas-
sification of driver’s speech
under stressed conditions
International Journal of Speech Technology (2018) 21:93–120
Table 1 (continued)
Database Language Type of database Size Purpose and approach Emotions
27. Jovičić et al. (2004) Serbian Simulated Six actors (3 female, 3 male) Recognition Neutral, anger, happiness, sad-
GEES database contains 32 Designing, processing and ness and fear
isolated words, 30 short evaluation of Serbian emo-
semantically neutral sen- tional speech database
tences, 30 long semantically
neutral sentences and one
passage with 79 words in
size. Total database contains
2790 recordings and duration
of speech around 3 h
28. Schuller et al. (2004) German and English Simulated German and English sentences Recognition Anger, disgust, fear, joy, neutral,
of 13 speakers, one female Combination of acoustic sadness, surprise
were assembled features and language
German database contains information for a most robust
2829 emotional recorded automatic recognition of a
samples for training and speakers emotion
International Journal of Speech Technology (2018) 21:93–120
13
101
Table 1 (continued)
102
13
32. Caldognetto et al. (2004) Italian Simulated Single native speaker Synthesis Anger, disgust, fear, joy, sadness
Here analysis on the interac- and surprise
tion between the articula-
tory lip targets of the Italian
vowel and consonants
defined by phonetic-pho-
nological rules and labil
configurations peculiar to
each emotion
33. Jiang and Cai (2004) Chinese Simulated Single amateur actress Recognition Anger, fear, happiness, sadness,
200 Chinese utterances for Combination of statistic fea- surprise and neutral
each emotion tures and temporal features
34. Ververidis et al. (2004) Danish Simulated Four actors (two male and two Recognition Anger, happiness, sadness,
female) Feature analysis and classifi- surprise and neutral
Danish emotional speech cation
database
Total amount of data used
in the experiment was 500
speech segments (with no
silence interruptions)
35. Jiang et al. (2005) Mandarin Natural One female speaker Synthesis Sadness and happiness
Total 216 sad sentences, 143 Analysis and modeling the
happy sentences and 10 sen- emotional prosody features
tences per each emotion
36. Cichosz and Slot (2005) Polish Simulated Four actors and four actresses Recognition Anger, fear, sadness, boredom,
Total 240 utterances uttered by To determine a set of low joy and neutral (no emotion)
four actors and four actresses dimensional feature spaces
that provides high recogni-
tion rates
37. Lin and Wei (2005) Danish Simulated Four actors (two male and two Recognition Anger, happiness, sadness,
female) familiar with radio Gender dependent and gender surprise and a neutral state
theatre independent speech emotion
recognition
38. Luengo et al. (2005) Basque Simulated One actress Emotion identification Anger, fear, surprise, disgust,
Total 97 recordings for each Analysis of prosodic features joy and sadness
emotion were done and spectral features with
Database contains numbers, classifiers GMM and SVM
isolated words and sentences for emotion identification
of different length
39. Lee and Narayanan (2005) English Natural Customers and call attendants Recognition Negative and positive
Call center conversations are
recorded
International Journal of Speech Technology (2018) 21:93–120
Table 1 (continued)
Database Language Type of database Size Purpose and approach Emotions
40. Pao et al. (2005) Mandarin Simulated Eighteen male and sixteen Recognition Anger, happiness, sadness, bore-
female uttered 20 differ- Evaluation of Mandarin dom and neutral
ent utterances. Total 3400 speech using weighted
sentences were recorded D-KNN classification
41. Batliner et al. (2006) English Elicited 51 school children (21 male Recognition Different elicited emotions are
and 30 Female) Children are asked to spon- recorded
taneously react with Sony
AIBO pet robot. Around
9.5 h of effective emotional
expressions of children were
recorded
42. Wu et al. (2006) Chinese Simulated Non –broadcasting speakers Recognition Anger, fear, happiness, sadness,
Total 25 male, 25 female Study on GMM-UBM based neutral
speakers were involved in the speaker verification system
recording process on emotional speech
International Journal of Speech Technology (2018) 21:93–120
43. Grimm et al. (2006) English Simulated EMA (Electromagnetic articu- Recognition Happy, angry, sad, neutral
lography) Database contains Feature based categorical
680 emotional speech classification and primitives-
utterances, generated by one based dynamic emotion
female professional and two estimation
non-professional (one male
and one female) speakers.
Female speaker produce 10
sentences and male speaker
produce 14 sentences each
for 4 different emotions
44. Morrison et al. (2007) Mandarin and Burmese Natural and simulated 1. Natural database contains Recognition 1. Anger, neutral
11 speakers with 388 num- Call center applications 2. Anger, happiness, sadness,
bers of utterances for two disgust, fear, surprise
emotion classes
2. ESMBS database contains
12 emotional speeches of
Mandarin and Burmese
speakers with 720 utter-
ances for six emotions. Six
Mandarin and six Burmese
speakers were used. 10 dif-
ferent sentences uttered by
the speakers
13
103
Table 1 (continued)
104
13
45. Kandali et al. (2008a) Assamese Simulated MESDNEI (multilingual Recognition Anger, disgust, fear, happiness,
emotional speech database of Vocal emotion recognition sadness, surprise and 1 no
North East India) database emotion (neutral)
contains short sentences of
six full blown basic emotions
with neutral
Total 140 simulated utterances
per speaker was collected
for 5 native language of
Assam. Specifically students
and faculty members from
educational institutions were
chosen for the recording.
30 subjects (3 male and 3
female per language) were
chosen for recording
46. Grimm et al. (2008) German Natural 104 Native speakers (44 male Recognition Two emotions for each emo-
and 60 female) 12 h of audio visual recording tional dimensions are recorded
is done using TV talk show 1. Activation (calm-excited)
Vera am Mittang in German. 2. Valence (positive–negative)
Emotion annotation is done 3. Dominance (weak-strong)
based on activation, valence,
and dominance dimensions
47. Koolagudi et al. (2009) Telugu Simulated The database contains 10 Recognition Anger, disgust, fear, happy,
professional artists (5 male Design, acquisition, post compassion, neutral, sarcastic,
and 5 female) from All India processing and evaluation of surprise
Radio (AIR) Vijaywada. IITKGP-SESC database
Total number of utterances
recorded in the database
was 12,000 (15 sentences,
8 emotions, 10 artists and
10 sessions). Each emotion
contains 1500 utterances
48. Mohanty and Swain (2010) Oriya language Elicited Database contains 35 speakers Recognition Anger, sadness, astonish, fear,
(Male 23 and Female 12), Creation of Odiya database happiness, neutral
reading text fragments taken and emotion recognition
from various Oriya drama from Odiya speech
scripts
International Journal of Speech Technology (2018) 21:93–120
Table 1 (continued)
Database Language Type of database Size Purpose and approach Emotions
49. Rao and Koolagudi (2011) Hindi Natural and simulated 1. Hindi dialect speech corpus Recognition and Identification Anger, disgust, fear, happy,
used for dialect identifica- Dialect identification, emo- neutral, sadness, surprise,
tion. It contains 5 females tion recognition and feature sarcastic
and 5 males, sentences analysis
uttered based on their past
memories
2. IITKGP-SEHSC corpus
used for the speech emotion
recognition. It contains 10
professional artists from All
India Radio Varanasi, India.
Total 12,000 utterances
recorded and each emotion
has 1500 sentences
50. Koolagudi et al. (2012) Hindi Semi natural Utterances taken for the Recognition Sad, anger, happy, neutral
database were the recording Proposed a Semi natural data-
International Journal of Speech Technology (2018) 21:93–120
13
105
Table 1 (continued)
106
13
54. Ooi et al. (2014) German, English, Mandarin, Simulated 1. EMO-DB (Berlin emotional Recognition 1. Anger, boredom, disgust, fear,
urdu, Punjabi, Persian, and database) contains 10 speak- A new architecture of happiness, neutral, sadness
Italian ers (5 male and 5 female), intelligent audio emotion 2. Happy, angry, disgust, sad,
10 sentences were chosen for recognition is introduced and surprise, fear
recording, total 840 recorded analysis of different prosodic 3. Anger, disgust, fear, happi-
utterances were used. and spectral features was ness, surprise, sadness, neutral
2. eNTERFACE’05 audio- done.
visual emotion database
contains 1170 utterances
with 42 subjects (34 male
and 8 female) chosen from
different nations.
3. RML (audio-visual emo-
tion database) contains 720
videos from 8 subjects.
55. Mencattini et al. (2014) Italian Simulated EMOVO Italian speech cor- Recognition Disgust, joy, fear, anger, sad-
pus. It contains 588 record- PLS regression model was ness, surprise, neutral
ings: 14 Italian sentences by introduced, new speech fea-
6 professional actors (3 male tures related to speech ampli-
and 3 female) tude modulation parameters
was discussed
56. Kadiri et al. (2015) Telugu and German Semi natural and simulated 1. Students (two females and Recognition Anger, happy, neutral, sad
five males) were involved in Excitation source feature
the recording process and the analysis for speech emotion
utterances recorded based recognition
on past memories. Total
200 utterances recorded for
experiment (IIIT-H Telugu
emotion database)
2. EMO-DB Berlin emotion
database contains 10 profes-
sional native German actors
(5 males and 5 females) were
asked to speak 10 sentences
in different emotions.
Total 535 utterances were
recorded. 339 utterances
were taken for final experi-
ment
International Journal of Speech Technology (2018) 21:93–120
Table 1 (continued)
Database Language Type of database Size Purpose and approach Emotions
57. Song et al. (2016) German and English Simulated Berlindataset: emotional utter- Recognition 1. Anger, boredom, disgust,
ances recorded by ten actors A novel transfer non-negative fear, happiness, sadness and
(5 males and 5 females) in matrix factorization (TNMF) neutral
German language. Total 494 method is presented for 2. Anger, disgust, fear, happi-
utterances were used for cross-corpus Speech emotion ness, sadness and surprise
experiment recognition
eNTERFACE (Audio-visual
database) 42 speakers were
allotted for recording (34
males and 8 females).Total
1170 video samples were
collected
58. Brester et al. (2016) German, English, Japanese Simulated and natural Four emotional databases Recognition 1. Neutral, anger, fear, joy, sad-
1. EMO-DB (GERMAN data- Evolutionary feature selection ness, boredom or disgust
base) recorded at the Techni- technique based on the two 2. Anger, disgust, fear, happi-
cal University of Berlin. It criterion optimization model ness, sadness, surprise, and
International Journal of Speech Technology (2018) 21:93–120
13
107
108 International Journal of Speech Technology (2018) 21:93–120
helps in speech enhancement and multispeaker separation. acoustic properties are therefore used to analyze prosody.
Prasanna and Govind (2010) examined the effect of emo- Prosodic features pitch such as mean. maximum, minimum,
tions on the excitation source of speech production. Five variance and standard deviation of energy are used by Del-
emotions neutral, angry, happy, boredom and fear were lert et al. (1996b). They have considered maximum likeli-
considered for the study. Initially the electroglottogram hood Bayes classification, kernel regression, and a k-near-
(EGG) and its derivative signals are compared across dif- est neighborhood methods. The steepness of fundamental
ferent emotions. The mean, standard deviation and contour frequency, articulation duration, rise and fall durations are
of instantaneous pitch, and strength of excitation parameters also measured. Fear, anger, sadness and joy are expressed as
are derived by processing the derivative of the EGG and also peaks and troughs of F0. Minimum maximum and medium
speech using zero-frequency filtering (ZFF) approach. The values of F0 are emotion salient features. Emotions are ana-
work presented by Pravena and Govind (2017) explored the lyzed by using short time supra segmental features like pitch,
effectiveness of the excitation source parameters-strength of energy, formant, locations and their bandwidths, dynamics
excitation and instantaneous fundamental frequency (F0) for of pitch, energy, formant contour, speaking rate (Ververidis
emotion recognition from speech and electroglottographic and Kotropoulos 2006).
(EGG) signals. They have considered GMM classifier for Nogueiras et al. (2001) investigated the classification of
performance evaluation. Nandi et al. (2017) shows the lin- emotions in speech with a Hidden Markov model (HMM).
ear prediction (LP) residual signal has been parameterized They tested different HMM configurations and found that
to capture the excitation source information for language increasing the number of states from one to 64 monotoni-
identification (LID) study. LP residual signal has been pro- cally improved the recognition accuracy. The best reported
cessed at three different levels: sub-segmental, segmental recognition accuracy was 82.5% which was obtained using
and supra-segmental levels to demonstrate different aspects HMMs with 64 states and all 11 features based on Prosodic
of language-specific excitation source information. The features such as energy and pitch. Ramamohan and Dandapat
proposed excitation source features have been evaluated on (2006) computed sinusoidal model based features consisting
27 Indian languages from Indian Institute of Technology of 10 frequencies (sorted in ascending order) and 10 phase
Kharagpur-Multi Lingual Indian Language Speech Corpus angles corresponding to 10 most significant peaks of ampli-
(IITKGP-MLILSC), Oregon Graduate Institute Multi-Lan- tudes at which slope of amplitude changes from positive to
guage Telephone-based Speech (OGI-MLTS) and National negative. A trained Vector Quantizer (VQ) based classifier
Institute of Standards and Technology Language Recogni- and Hidden Markov Model with discrete state output proba-
tion Evaluation (NIST LRE) 2011 corpora. LID systems bility density function (VQ-HMM) were used. They reported
were developed using Gaussian mixture model (GMM) and a score of success of 80% in case of Telugu and 70% in case
i-vector based approaches. Experimental results have shown of English. Agrawal et al. (2009) collected 5 short sentences
that segmental level parametric features provide better iden- of Hindi language uttered by 6 male graduate drama club
tification accuracy (62%), compared to sub-segmental (40%) students aged between 20 and 23 years with 4 repetitions
and supra-segmental level (34%) features. Here Table 2 rep- in Anger, Fear, Happiness, Sadness, and Neutral emotions.
resents the excitation source features used in the available They computed pitch using Praat software and applied Neu-
literature. ral Network (NN) and Fisher’s Linear Discriminant Analysis
(FLDA) classifiers and achieved average recognition success
3.2 Prosodic feature scores as 64.3% and 56.2% in Hindi language. Koolagudi
et al. (2009) created the IITKGP-SESC database which con-
Prosody or supra segmental information structures the flow sists of 10 repetitions of 15 portrayed utterances in Anger,
of speech and consists of duration, intensity, intonation and Compassion, Disgust, Fear, Happy, Neutral, Sarcastic and
sound units. The acoustic correlates of prosodic features Surprise emotions uttered by 10 professional actors (5 males
include pitch, energy, duration and their derivatives (Rao and and 5 females) in the Telugu language. They computed mean
Yegnanarayana 2006). The four levels of computing prosody duration of each utterance, mean and standard deviation of
are (a) linguistc intention level, (b) articulatory level, (c) pitch values and mean energy across each utterance of 1
acoustic realization level, (d) perceptual level (Werner and male and 1 female speaker. The achieved average recognition
Keller 1994). Different linguistic elements are related in an success score was 75 and 69% for male and female speak-
utterance by bringing semantic emphasis on an element. The ers respectively, using an Euclidean distance classifier. Lee
articulatory movements include physical movements of the et al. (2011) computed zero crossing rate, root mean square
muscles in the throat. The fundamental frequency intensity energy, harmonics-to noise ratio and 12 mel frequency
and duration help in analysis of prosody factors according to cepstral coefficients and their deltas for emotion analy-
Rao and Yegnanarayana (2006). It is also expressed in form sis. The study involves a hierarchical binary decision tree
of pauses, length and melody to subjective listeners. The approach and SVM classifier for analysis of emotions anger,
13
International Journal of Speech Technology (2018) 21:93–120 109
happy, sad, neutral, emphatic. Chen et al. (2012) worked (actors and actress) whose ages were between 20 and 25.
on BHUDES database which contains mandarin utterances They conducted experiments on a speaker independent sys-
of six emotions like sadness (sad), anger (ang), fear (fea), tem. They computed some instantaneous features with SVM
happiness (hap), and disgust (dis). The database consists and ANN. Mirsamadi et al. (2017) utilized the deep learning
of 5400 utterances which were performed by 15 speakers to automatically discover emotionally relevant features from
13
110 International Journal of Speech Technology (2018) 21:93–120
speech. The study involves classification of four emotions MFCC (Mel-Frequency Cepstral Coefficient) feature set
happy, sad, neutral and angry from IEMOCAP dataset. A and GMM (Gaussian Mixture Model) classifier. Tests were
speaker independent speech emotion recognition was per- conducted with varying contents of utterances and speak-
formed using a deep recurrent neural network classifier. Jin ers, compared with those during training conditions. i.e.
and Wang (2005) studied the distribution of the seven emo- the experiments are text-and-speaker-independent Kan-
tions in spoken Chinese, including joy, anger, surprise, fear, dali et al. (2008a). Firoz Shah et al. (2009) analyzed an
disgust, sadness and neutral, in the two dimensional space of elicited database consisting of 700 utterances for emo-
valence and arousal, and analyzed the relationship between tions neutral, happy, sad and anger in Malayalam (One of
the dimensional ratings and the prosodic characteristics in the south Indian languages) using both Discrete Wavelet
terms of F0 maximum, minimum, range and mean. Thirty- Transforms (DWTs) and Mel Frequency Cepstral Coef-
two different acoustic features F0, energy, duration, and tune ficients (MFCCs). The overall recognition accuracy was
had studied for the classification of five emotion states in observed 68.5 and 55% using Artificial Neural Network
McGilloway et al. (2000). classifier for speech emotion recognition. Kandali et al.
Table 3 represents the some of the prosodic features (2008b) collected 4200 utterances (140 utterances of dif-
with classifiers used in the literature for speech emotion ferent short sentences (20 sentence per emotion) uttered
recognition. by each speaker) of 6 full-blown basic emotions (Anger,
Disgust, Fear, Happiness, Sadness, Surprise) and 1 ‘no-
3.3 Vocal tract feature emotion’ i.e. Neutral from 30 native speakers (3 Males and
3 Females per language) of 5 native languages of Assam
Vocal tract features are basically known as spectral features (India). They performed text-and-speaker-independent
or segmental features. To extract the vocal tract system experiments using utterances of all the languages taken
features a speech segment of length 20–30 ms is gener- together. They reported a success score of 73.1, 75.1,
ally required. For the analyses of different speech features 60.9, 61.1, 95.1, 93.8, 93.7 and 94.29% using feature
formants, bandwidth spectral energy and slope of the sig- sets MFCC, tfMFCC (Teager-Energy-Operated-in-Trans-
nal from the spectrum, it is necessary to find the Fourier form-Domain MFCC), LFPC, tfLFPC, WPCC (Wavelet
transform of a speech frame. The Fourier transform on log Packet Cepstral Coefficient), tfWPCC, WPCC2 (Wavelet
magnitude spectrum gives the cestrum (Rabiner and Juang Packet Cepstral Coefficient computed by method 2) and
1993). The cepstral domain represents the MFCCs (Mel fre- tfWPCC2 respectively with a GMM classifier. Rao et al.
quency spectral frequency coefficients), PLPCs (perpetual (2012) worked on spectral features for both acted and real
linear prediction coefficients) and the LPCCs (linear predic- databases with a GMM classifier. Ververidis et al. (2004)
tions cepstral coefficients) which are also called segmental they have considered 87 features calculated over 500
or system features (Ververidis and Kotropoulos 2006). In utterances of the Danish Emotional Speech database. The
the literature spectral features have been successfully used study involves The Sequential Forward Selection method
for various speech applications in the area of development (SFS) has been used in order to discover the 5–10 fea-
of speech recognition and speaker recognition systems. Here tures which are able to classify the samples in the best
we have discussed some of the studies on speech emotion way for each gender. Bayes classifier was considered for
recognition using spectral features. classification of emotions: anger, happiness, neutral, sad-
The utterances of 10 sentences regarding anger, fear, ness and surprise. A correct classification rate of 61.1%
joy, sadness, disgust and surprise emotions in Mandarin is obtained for male subjects and a corresponding rate of
and Burnese, were examined by New et al. (2003). The 57.1% for female ones. In the same experiment, a random
values of LFPC (Log-frequency power coefficient) were classification would result in a correct classification rate
computed for each frame and discrete Markov models were of 20%. When gender information is not considered a cor-
developed. Success was at 79.9% for Burmese and 76.4% rect classification score of 50.6% was obtained. In Lan-
for Mandarin. 1134 utterances (5 repeated utterances of 7 jewar et al. (2015) Mel frequency cepstrum coefficients
short sentences (1 sentence per emotion) uttered by each (MFCC), wavelet features of speech, pitch of vocal traces
speaker and 1 long sentence per emotion uttered by each were considered for speech emotion recognition. Gauss-
speaker from his own thought) of 6 full-blown basic emo- ian mixture model (GMM), k-Nearest Neighbor (k-NN)
tions (Anger, Disgust, Fear, Happiness, Sadness, Surprise) models considered for classification and recognition of six
and 1 ‘no-emotion’ i.e. Neutral were collected by Kandali emotions: happy, angry, neutral, surprised, fearful, and sad
et al. (2008a) from 27 native speakers (14 Males and 13 in Berlin emotion database (BES). Table 4 represents the
Females) of the Assamese language. This work reports some of the vocal tract features with classifiers used in the
a success score of 76.5% in the text-independent but literature for speech emotion recognition.
speaker-dependent experiment using the spectral feature
13
Table 3 Literature survey of prosodic features used for speech emotion recognition
References Feature Approach and classifier
Banse and Scherer (1996) Fundamental frequency/pitch (F0), energy, speech rate, and spectral infor- Acoustic profiles or vocal cues for emotion expression using actors voices for
mation in voiced and unvoiced portions fourteen emotion categories
Nicholson et al. (2006) Prosodic features Recognizing emotions using a large database of phoneme balanced words,
speaker and contest-independent. One-class-in-one Neural Networks used
for classifications of emotions. Around 50% recognition rate was achieved
Picard et al. (2001) Statistical features (mean, standard deviation of raw signals, absolute values Electromyogram blood volume pulse skin conductance respiration
of first and second differences of raw signals), sequential forward selection Hybrid linear discriminant analysis used for classification of emotions neutral,
search, fischer projection anger hate, grief platonic love, romantic love, joy, reverence
Park and Sim (2003) Prosodic features Recognition of four emotions neutral, anger, laugh, surprise and feature
analysis
DRNN (dynamic recurrent neural network) for classification
Tao and Kang (2005) Prosodic features intonation, speaking rate, intensity CART model and a weight decay neural network model considered for the
classification performance analysis of emotions
Koolagudi and Krothapalli (2012) Prosodic features (1) duration patterns, (2) average pitch, (3) variation of Performance evaluation of eight emotions anger, compassion, disgust, fear,
pitch with respect to mean [standard deviation (SD)], and (4) average happy, neutral, sarcastic and surprise of IITKGP-SESC: speech database
International Journal of Speech Technology (2018) 21:93–120
energy Euclidian distance measure and subjective evaluation considered for the clas-
sification performance analysis of emotions
Lee et al. (2001) Prosodic features fundamental frequency (F0), energy, duration, the first and Detection of negative and non-negative emotions using spoken language data
second formant frequencies obtained from a call center application
Petrushin (1999) Prosodic features pitch, the first and second formants, energy and the speak- Real-time emotion recognizer using an ensemble of neural networks for a
ing rate call center application. Classification accuracy was achieved 77% for two
emotional states, agitation and calm, with eight features chosen by a feature
selection method
Luengo et al. (2005) Total 86 prosodic features were used, best features were chosen from feature Identification of four different emotions in Basque language.GMM classifier
selection was used for performance analysis. 92% recognition accuracy was achieved
Iliou and Anagnostopoulos (2009) 35 dimensional prosodic feature vectors including pitch, energy, and dura- Classification of seven emotions of Berlin emotional speech corpus. For
tion speaker independent cases using neural networks the emotion recognition
accuracy was achieved around 51%
Kao and Lee (2006) Pich and power based features are extracted from frame, syllable, and word Recognizing four emotions in Mandarin. Combination of features from frame,
levels syllable and word level yielded 90% emotion recognition performance
Zhu and Luo (2007) Duration, energy and pitch based features Recognizing emotions in Mandarin Language. Sequential forward selection
was used for best feature selection among the all prosodic features. Emotion
classification studies are conducted on multi-speaker multi-lingual database.
Modular neural networks were used as classifiers
Lugger and Yang (2007) Eight static prosodic features and voice quality features Classification of six emotions: anger, anxiety, boredom, happiness, Neutral
and sadness from Berlin emotional speech corpus. Speaker independent
emotion classification is performed using Bayesian classifiers
Wang et al. (2008) Energy, pitch and duration based features Classification of six emotions from Mandarin language. Around 88% of aver-
age emotion recognition rate is reported using SVM and genetic algorithm
Zhang et al. (2008) Prosody and voice quality based features Classification of four emotions anger, joy, neutral, and sadness from Chinese
natural emotional speech corpus. Around 76% emotion recognition perfor-
mance was achieved using support vector machine
13
111
112 International Journal of Speech Technology (2018) 21:93–120
13
Table 4 Literature survey of vocal tract features used for speech emotion recognition
New et al. (2001) LFPC (log-frequency power coefficient), MFCC Text independent method of emotion classification The discrete HMM classifier was considered for the
(mel-frequency cepstral coefficients) and LPCC classification of emotions: anger, fear, joy, sadness,
(linear prediction cepstral coefficients) disgust and surprise emotions in Mandarin and
Burmese language
Lee et al. (2004) MFCC (mel-frequency cepstral coefficients) Phoneme-level modeling for the classification of Hidden Markov models (HMM) based on short-term
emotional states from speech spectral features are used for the experiment, two sets
of HMM classifiers were discussed: a generic set of
International Journal of Speech Technology (2018) 21:93–120
13
113
114 International Journal of Speech Technology (2018) 21:93–120
methods, the hidden Markov model (HMM) and the support dialects of Hindi: Chhattisgarhi (spoken in central India),
vector machine (SVM), to classify five emotional states Bengali (Bengali accented Hindi spoken in Eastern region),
anger, happiness, sadness, surprise and neutral. In the HMM Marathi (Marathi accented Hindi spoken in Western region),
method, 39 candidate instantaneous features were extracted, General (Hindi spoken in Northern region) and Telugu (Tel-
and the Sequential Forward Selection (SFS) method was ugu accented Hindi spoken in Southern region) for the iden-
used to find the best feature subset. The classification per- tification task. The work shows the performance of both
formance of the selected feature subset was then compared dialect identification and emotion identification from speech.
with that of the Mel frequency cepstrum coefficients Speech database considered for the dialect identification task
(MFCC). Within the method based on SVM, a new vector consists of spontaneous speech spoken by male and female
measuring the difference between Mel frequency scale sub- speakers. Indian Institute of Technology Kharagpur Simu-
bands energies is proposed. The performance of the K-near- lated Emotion Hindi Speech Corpus (IITKGP-SEHSC) was
est Neighbors (KNN) classifier using the proposed vector used for conducting the emotion recognition studies. The
was also investigated. Both gender dependent and gender different types of emotion considered in this study were
independent experiments were conducted on the Danish anger, disgust, fear, happy, neutral and sad. Spectral and
Emotional Speech (DES) Database. The recognition rates by Prosodic features educed from speech are used for discrimi-
the HMM classifier were 98.9% for female subjects, 100% nating the dialects and emotions. Spectral features are rep-
for male subjects, and 99.5% for gender independent cases. resented by Mel frequency cepstral coefficients (MFCC) and
When the SVM classifier and the proposed feature vector prosodic features are represented by durations of syllables,
were employed, correct classification rates of 89.4, 93.6 and pitch and energy contours. Auto-associative neural network
88.9% were obtained for male, female and gender independ- (AANN) models and Support Vector Machines (SVM) are
ent cases respectively. The work presented by Luengo et al. explored for capturing the dialect specific and emotion spe-
(2005) shows an experiment based on Basque speech emo- cific information from the above specified features. Classi-
tional database for automatic speech emotion recognition. fication systems were developed separately for dialect clas-
They built three classifiers for their experiment. The first sification and emotion classification. Recognition
classifier was using a spectral feature and GMM, second performance of the dialect identification and emotion recog-
classifier using prosodic feature and SVM and third classifier nition systems was found to be 81 and 78% respectively.
using prosodic feature and GMM. Total 86 prosodic features Table 5 shows list of combination of features with
were evaluated and the accuracy was achieved after feature classifiers.
selection was 92.3%. In Rong et al. (2009) acoustic features
were used for speech emotion recognition system using
actual and acted dataset in Chinese language. Two classifica-
tion methods C4.5 decision tree algorithm and Random for- 4 Conclusions
est algorithm were applied to evaluate the quality of the
feature pitch, intensity, zero crossing rate, spectral features In the recent past extensive effort have been expended by
Mel-scale frequency cepstral coefficients. It has been researchers in the area of emotion recognition via speech. In
observed that 72.25% is the best recognition accuracy this study a significant number of research papers were sur-
achieved by Random forest classifier. Sheikhan et al. (2013) veyed based on three parameters—database, feature extrac-
proposed a modular neural support vector machine (SVM) tion and classifiers. This paper summarizes some of the
classifier for speech emotion recognition. The study shows research work carried out by various workers between 2000
the performance of the proposed classifier compared with and 2017 on speech emotion recognition system. A bulk of
other classifiers—GMM, Multilayer perceptron neural net- the current research also focuses on feature extraction and
work and C5.0-based classifiers. The proposed neural-SVM selection in order to select the best features and improve the
classifier achieved 8% improved recognition accuracy over performance accuracy. It has been noted from analyses of
other classifiers. The Farsi neutral speech corpus with 6000 the data that to improve the system performance and identify
utterances from 300 speakers with various accents had been the correct emotions, classifier selection is a challenging
collected for the experiment. Each speaker uttered 202 sen- task. Many classifiers have been chosen for speech emotion
tences in three emotional states: neutral, happiness, anger. recognition system but it is very difficult to conclude which
The features used for the experiment 12 MFCCS, log energy performs better-there is no clear winner. Recent works focus
(LE), the first three formant frequencies and pitch frequency. mostly on deep neural network and architecture, hybrid clas-
For feature selection combination of ANOVA and Tukey sifiers and fusion methods for emotion recognition system. It
methods were considered. Rao and Koolagudi (2011) is evident from this review that enough scope exists for the
explored speech features to identify Hindi dialects and emo- development of new protocols along with their proper design
tions. In this work, they have considered five prominent in the area of speech emotion recognition. Higher accuracy,
13
Table 5 Literature survey of combination of features used for speech emotion recognition
References Features Purpose Classifier
Bozkurt et al. (2009) Spectral, prosodic and HMM based features Classification of five emotions of INTERSPEECH Emotion classification accuracy was found around
2009 challenge 63%
Nakatsu et al. (2000) Combination of LPCCs and pitch related features 100 phonetically balanced words were recorded Artificial neural network was used for classification
using 100 native speakers (50 male and 50 female) and performance analysis of eight emotions for
speech emotion recognition system
Iliev et al. (2010) Glottal symmetry and MFCC features Speech emotion classification Classification of four emotions using optimum path
forest classifier
Jeon et al. (2013) Spectral and prosody Chinese: ACC database (CA), EMO-DB (simulated Support vector machines (SVM), implemented in
International Journal of Speech Technology (2018) 21:93–120
parallel), English: EMA database (EA), German: WEKA, a three-order polynomial kernel and multi-
EMO-DB database (GA), English: IEMOCAP class (4-way classification) discrimination was used
database (ES), four emotions (anger, happiness,
sadness and neutral). Cross lingual/corpus effect of
emotion recognition from speech
Yeh and Chi (2010) Spectral and prosodic features EMO-DB (simulated parallel), seven emotions SVM used as the classifier for emotion analysis
13 MFCCs, their deltas, double-deltas were com- (anger, happiness, fear, disgust, boredom, sadness
puted as 156 features per utterance and 30 prosodic and neutral)
features
Espinosa et al. (2010) Spectral, prosody and voice quality features VAM (natural), German spontaneous emotional SVM and Pace Regression classifiers chosen for best
speech, three categories (arousal, valence and recognition accuracy. Bagging ensemble of Pace
dominance) Regression classifiers shows estimations reached an
average correlation of 0.6885 and a mean error of
0.1433
Rozgic et al. (2012) Spectral, prosodic and lexical USC-IEMOCAP (semi-natural), four emotions Support vector machine and Gaussian mixture model
(anger, happiness, sadness and neutral) was chosen for classification. The fusion of acoustic
and lexical features delivers an emotion recognition
accuracy of 65.7%
Atassi and Esposito (2008) Prosody and voice quality (emotion selective) EMO-DB (simulated parallel), six emotions (anger, Gaussian mixture model classifier was used and SFFS
happiness, fear, disgust, boredom and sadness) feature selection algorithms was chosen for selecting
best features
13
115
116 International Journal of Speech Technology (2018) 21:93–120
specificity and reproducibility are some of the criteria that Bozkurt, E., Erzin, E., & Erdem, A. T. (2009). Improving automatic
should be considered in developing such new protocols. emotion recognition from speech signals. In 10th annual con-
ference of the international speech communication association
(INTERSPEECH), Brighton, UK, pp. 324–327.
Acknowledgements The authors are grateful for the valuable input Brester, C., Semenkin, E., & Sidorov, M. (2016). Multi-objective
given by Prof. J. Talukdar, Silicon Institute of Technology, Bhubane- heuristic feature selection for speech-based multilingual emo-
swar, Odisha. tion recognition. JAISCR, 6(4), 243–253.
Buck, R. (1999). The biological affects, a typology. Psychological
Review, 106(2), 301–336.
Bulut, M., Narayanan, S. S., & Syrdal, A. K. (2002). Expressive
References speech synthesis using a concatenative synthesizer. In Pro-
ceedings of international conference on spoken language pro-
Abrilian, S., Devillers, L., & Martin, J. C. (2006). Annotation of emo- cessing (ICSLP’02), Vol. 2, pp. 1265–1268.
tions in real-life video interviews: Variability between coders. In Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., & Weiss,
5th international conference on language resources and evalua- B. (2005). A database of German emotional speech. In Pro-
tion (LREC 06), Genoa, pp. 2004–2009. ceedings of the INTERSPEECH 2005, Lissabon, Portugal,
Agrawal, S. S. (2011). Emotions in Hindi speech-analysis, perception pp. 1517–1520.
and recognition. In International conference on speech database Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E.,
and assessments (Oriental COCOSDA). Kim, S. et al. (2008) IEMOCAP: Interactive emotional
Agrawal, S. S., Jain, A., & Arora, S. (2009). Acoustic and perceptual dyadic motion capture database. In: Language resources and
features of intonation patterns in Hindi speech. In International evaluation.
workshop on spoken language prosody (IWSLPR-09), Kolkata, Caballero-Morales, S. O. (2013) Recognition of emotions in Mexi-
pp. 25–27. can Spanish speech: An approach based on acoustic model-
Alonso, J. B., Cabrera, J., Medina, M., & Travieso, C. M. (2015). New ling of emotion-specific vowels. The Scientific World Journal,
approach in quantification of emotional intensity from the speech 2013, 1–13.
signal: Emotional temperature. Experts Systems with Applica- Caldognetto, E. M., Cosi, P., Drioli, C., Tisato, G., & Cavicchio,
tions, 42, 9554–9564. F. (2004). Modifications of phonetic labial targets in emotive
Amer, M. R., Siddiquie, B., Richey, C., & Divakaran, A. (2014). Emo- speech: Effects of the co-production of speech and emotions.
tion detection in speech using deep networks. In IEEE interna- Speech Communication, 44, 173–185.
tional conference on acoustics, speech and signal processing Chauhan, A., Koolagudi, S. G., Kafley, S. & Rao, K. S. (2010). Emo-
(ICASSP), pp. 3724–3728. tion recognition using LP residual. In Proceedings of the 2010
Amir, N., Ron, S., & Laor, N. (2000). Analysis of an emotional speech IEEE students’ technology symposium, IIT Kharagpu.
corpus in Hebrew based on objective criteria. In Proceedings of Chen, L., Mao, X., Xue, Y., & Lung, L. (2012). Speech emotion rec-
ISCA workshop speech and emotion, Belfast, Vol. 1, pp. 29–33. ognition: Features and classification models. Digital Signal Pro-
Atal, B. S. (1972). Automatic speaker recognition based on pitch con- cessing, 22(6), 1154–1160.
tours. The Journal of the Acoustical Society of America, 52(6), Chuang, Z.-J., & Wu, C.-H. (2002). Emotion recognition from tex-
1687–1697. tual input using an emotional semantic network. In Proceed-
Atassi, H., & Esposito, A. (2008). A speaker independent approach ings of international conference on spoken language processing
to the classification of emotional vocal expressions. In IEEE (ICSLP’02), Vol. 3, pp. 2033–2036.
international conference on tools with artificial intelligence Cichosz, J., & Slot, K. (2005). Low-dimensional feature space deri-
(ICTAI’08), Dayton, Ohio, USA, Vol 2, pp 147–152. vation for emotion recognition. In INTERSPEECH’05, Lisbon,
Banse, R., & Scherer, K. R. (1996). Acoustic profiles in vocal emo- Portugal, pp. 477–480.
tion expression. Journal of Personality and Social Psychology, Costantini, G., Iaderola, I., Paoloni, A., & Todisco, M. (2014). EMOVO
70(3), 614–636. Corpus: An Italian emotional speech database. In Proceedings
Bapineedu, G., Avinash, B., Gangashetty, S. V., & Yegnanaray- of the 9th international conference on language resources and
ana, B. (2009). Analysis of Lombard speech using excitation evaluation—LREC 14, pp. 3501–3504.
source information. In INTERSPEECH-09, Brighton, UK, Cummings, K. E., & Clements, M. A. (1998). Analysis of the glottal
pp. 1091–1094. excitation of emotionally styled and stressed speech. The Journal
Batliner, A., Biersack, S., & Steidl, S. (2006). The prosody of pet robot of the Acoustical Society of America, 98, 88–98.
directed speech: Evidence from children. In Speech prosody, Darwin, C. (1872/1965). The expression of the emotions in man and
Dresden, pp. 1–4. animals. Chicago University Press, Chicago.
Batliner, A., Hacker, C., Steidl, S., Noth, E., D’Arcy, S., Russell, M., & Dellaert, F., Polzin, T., & Waibel, A. (1996a). Recognising emotions
Wong, M. (2004). You stupid tin box—children interacting with in speech. In ICSLP 96.
the AIBO robot: A cross-linguistic emotional speech corpus. In Dellert, F., Polzin, T., & Waibel, A. (1996b). Recognizing emotion
Proceedings of language resources and evaluation (LREC 04), in speech. In 4th international conference on spoken language
Lisbon. processing, Philadelphia, PA, USA, pp. 1970–1973.
Batliner, A., Huber, R., Niemann, H., Nöth, E., Spilker, J., & Fischer, Douglas-Cowie, E., Campbell, N., Cowie, R., & Roach, P. (2003).
K. (2000). The recognition of emotion. In Verbmobil: Founda- Emotional speech: Towards a new generation of databases.
tions of speech-to-speech translation, pp. 122–130. Speech Communication, 40, 33–60.
Bitouk, D., Verma, R., & Nenkova, A. (2010). Class-level spectral fea- Eckman, P. (1992). An argument for basic emotions. Cognition and
tures for emotion recognition. Speech Communication, 52(7–8), Emotion, 6, 169–200.
613–625. Ekman, P. (1999). Basic emotions. In T. Dalgleish & M. Power (Eds.),
Borden, G., Harris, K., & Raphael, L. (1994). Speech science primer: Handbook of cognition and emotion. Sussex: Wiley.
Physiology, acoustics, and perception of speech (3rd ed.). Balti- EI Ayadi M, Kamel MS, Karray F (2011). Survey on speech emotion
more: Williams and Wilkins. recognition: Features, classification schemes, and databases. Pat-
tern Recognition, 1(44), 572–587.
13
International Journal of Speech Technology (2018) 21:93–120 117
Engberg, I., & Hansen, A. (1996). “Documentation of the Danish emo- Izard, C. E. (1992). Basic emotions, relations among emotions, and
tional speech database” des. Retrieved from http://cpk.auc.dk/tb/ emotion-cognition relations. Psychological Review, 99, 561–565.
speech/Emotions/. Jeon, J. H., Le, D., Xia, R., & Liu, Y. (2013). A preliminary study
Esmaileyan, Z., & Marvi, H. (2014). A database for automatic Persian of cross-lingual emotion recognition from speech: Automatic
speech emotion recognition: Collection, processing and evalua- classification versus human perception. In Interspeech, Layon,
tion. IJE Transactions A: Bascis, 27(1), 79–90. France, pp. 2837–2840.
Espinosa, H. P., Garcia, J. O., & Pineda, L. V. (2010). Features selec- Jiang, D.-N., & Cai, L. H. (2004). Classifying emotion in Chinese
tion for primitives estimation on emotional speech. In ICASSP, speech by decomposing prosodic features. In International
Florence, Italy, pp. 5138–5141 conference on speech and language processing (ICSLP), Jeju,
Fernandez, R., & Picard, R. W. (2003). Modeling driver’s speech under Korea.
stress. Speech Communication, 40, 145–159. Jiang, D.-N., Zhang, W., Shen, L.-Q., & Cai, L.-H. (2005). Prosody
Shah, A. F., Vimal Krishnan, V. R., Sukumar, A. R., Jayakumar, A., & analysis and modelling for emotional speech synthesis. In IEEE
Anto, P. B. (2009). Speaker independent automatic emotion rec- proceedings of ICASSP 2005, pp. 281–284.
ognition in speech: A comparison of MFCCs and discrete wave- Jin, X., & Wang, Z. (2005). An emotion space model for recognition
let transforms. In International conference on advances in recent of emotions in spoken Chinese (pp. 397–402). Berlin: Springer.
technologies in communication and computing, ARTCom ‘09. Jovičić, S. T., Kašić, Z., Đorđević, M., & Rajković, M. (2004). Serbian
Fontaine, J. R., Scherer, K. R., Roesch, E. B., & Ellsworth, P. C. emotional speech database: Design, processing and evaluation. In
(2007). The world of emotion is not two dimensional. Psycho- SPECOM 9th conference speech and computer, St. Petersburg,
logical Science, 13, 1050–1057. Russia.
France, D. J., Shiavi, R. G., Silverman, S., Silverman, M., & Wilkes, Kadiri, S. R., Gangamohan, P., Gangashetty, S. V., & Yegnanaray-
M. (2000). Acoustical properties of speech as indicators of ana, B. (2015). Analysis of excitation source features of speech
depression and suicidal risk. IEEE Transactions on Biomedical for emotion recognition. In INTERSPEECH 2015, Dresden,
Engineering, 7, 829–837. pp. 1324–1328.
Gangamohan, P., Kadiri, S. R., Gangashetty, S. V., & Yegnanaray- Kandali, A. B., Routray, A., & Basu, T. K. (2008a). Emotion recogni-
ana, B. (2014). Excitation source features for discrimination tion from Assamese speeches using MFCC features and GMM
of anger and happy emotions. In: INTERSPEECH, Singapore, classifier. In Proceedings of IEEE region 10 conference on
pp. 1253–1257. TENCHON.
Gangamohan, P., Kadiri, S. R., & Yegnanarayana, B. (2013). Analysis Kandali, A. B., Routray, A., & Basu, T. K. (2008b). Emotion recogni-
of emotional speech at sub segmental level. In Interspeech, Lyon, tion from speeches of some native languages of ASSAM inde-
France, pp. 1916–1920. pendent of text and speaker. In National seminar on Devices,
Gomez, P., & Danuser, B. (2004). Relationships between musical Circuits, and Communications, B. I. T. Mesra, Ranchi, pp. 6–7.
structure and physiological measures of emotion. Emotion, 7(2), Kao, Y.-H., & Lee, L.-S. (2006). Feature analysis for emotion recogni-
377–387. tion from Mandarin speech considering the special characteris-
Grimm, M., Kroschel, K., & Narayanan, S. (2008). The Vera Ammittag tics of Chinese language. In INTERSPEECH-ICSLP, Pittsburgh,
German audio-visual emotional speech database. In International Pennsylvania, pp. 1814–1817.
conference on multimedia and expo, pp. 865–868. Kim, J. B., Park, J. S., Oh, Y. H. (2011). On-line speaker adaptation
Grimm, M., Mower, E., Kroschel, K., & Narayanan, S. (2006). Com- based emotion recognition using incremental emotional informa-
bining categorical and primitives-based emotion recognition. In tion. In ICASSP, Prague, Czech Republic, pp. 4948–4951.
14th European signal processing conference (EUSIPCO 2006), Koolagudi, S. G., Devliyal, S., Chawla, B., Barthwal, A., & Rao, K. S.
Florence, Italy. (2012). Recognition of emotions from speech using excitation
Haq, S., & Jackson, P. J. B. (2009). Speaker-dependent audio-visual source features. Procedia Engineering, 38, 3409–3417.
emotion recognition. In Proceedings of international conference Koolagudi, S. G., & Krothapalli, S. R. (2012). Emotion recognition
on auditory-visual speech processing, pp. 53–58. from speech using sub-syllabic and pitch synchronous spectral
He, L., Lech, M., & Allen, N. (2010). On the importance of glottal features. International Journal of Speech Technology, 15(4),
flow spectral energy for the recognition of emotions in speech. 495–511.
In INTERSPEECH 2010, Makuhari, Chiba, Japan, pp. 26–30. Koolagudi, S. G., Maity, S., Kumar, V. A., Chakrabati, S., & Rao, K.
Hozjan, V., & Kacic, Z. (2003). Improved emotion recognition with S. (2009). IITKGP-SESC: Speech database for emotion analysis.
large set of stastical features. Geneva: Eurospecch. Communications in computer and information science, LNCS
Hozjan, V., Kacic, Z., Moreno, A., Bonafonte, A., & Nogueiras, A. (pp. 485–492). Berlin: Springer.
(2002). Interface databases: Design and collection of a multi- Koolagudi, S. G., & Rao, K. S. (2012a). Emotion recognition from
lingual emotional speech database. In Proceedings of the 3rd speech: A review. International Journal of Speech Technology,
international conference on language (LREC’02) Las Palmas 15, 99–117.
de Gran Canaria, Spain, pp. 2019–2023. Koolagudi, S. G., & Rao, K. S. (2012b). Emotion recognition from
Iliev, A. I., Scordilis, M. S., Papa, J. P., & Falco, A. X. (2010). Spoken speech using source, system, and prosodic features. International
emotion recognition through optimum-path forest classification Journal of Speech Technology, 15(2), 265–289.
using glottal features. Computer Speech and Language, 24(3), Koolagudi, S. G., Reddy, R., & Rao, K. S. (2010). Emotion recogni-
445–460. tion from speech signal using epoch parameters. In International
Iliou, T., & Anagnostopoulos, C.-N. (2009). Statistical evaluation of conference on signal processing and communications (SPCOM).
speech features for emotion recognition. In Fourth international Krothapalli, S. R., & Koolagudi, S. G. (2013). Characterization and
conference on digital telecommunications, Colmar, France, recognition of emotions from speech using excitation source
pp. 121–126. information. International Journal of Speech Technology, 16(2),
Iriondo, I., Guaus, R., & Rodriguez, A. (2000). Validation of an acous- 181–201.
tical modeling of emotional expression in Spanish using speech Kwon, O.-W., Chan, K., Hao, J., & Lee, T.-W. (2003). Emotion rec-
synthesis techniques. In Proceedings of ISCA workshop speech ognition by speech signals. In EUROSPEECH, pp. 125–128,.
and emotion, Belfast, Vol. 1, pp. 161–166. Lanjewar, R. B., Mauhurkar, S., & Patel, N. (2015). Implementation
and comparison of speech emotion recognition system using
13
118 International Journal of Speech Technology (2018) 21:93–120
Gaussian mixture model and K-nearest neighbor techniques. Morrison, D., Wang, R., & De Silva, L. C. (2007). Ensemble methods
Procedia Computer Science, 49, 50–57. for spoken emotion recognition in call-centres. Speech Commu-
Lazarus, R. S. (1991). Emotion & adaptation. New York: Oxford Uni- nication, 49, 98–112.
versity Press. Nakatsu, R., Nicholson, J., & Tosa, N. (2000). Emotion recognition and
Lee, C. M., & Narayanan, S. (2003). Emotion recognition using a data- its application to computer agents with spontaneous interactive
driven fuzzy inference system. In European conference on speech capabilities. Knowledge-Based Systems, 13, 497–504.
and language processing (EUROSPEECH), Geneva, Switzer- Nandi, D., Pati, D., & Rao, K. S. (2017). Parametric representation of
land, pp. 157–160. excitation source information for language identification. Com-
Lee, C. M., & Narayanan, S. S. (2005). Toward detecting emotions in puter Speech and Language, 41, 88–115.
spoken dialogs. IEEE Transactions on Speech and Audio Pro- Neiberg, D., Elenius, K., & Laskowski, K. (2006). Emotion recognition
cessing, 13(2), 293–303. in spontaneous speech using GMMs. In INTERSPEECH 2006,
Lee, C. M., Narayanan, S., & Pieraccini, R. (2001). Recognition of ICSLP, Pittsburgh, Pennsylvania, pp. 809–812.
negative emotion in the human speech signals. In Workshop on New, T. L., Wei, F. S., & De Silva, L. C. (2001). Speech based emotion
auto, speech recognition and understanding. classification. In Proceedings of the IEEE region 10 international
Lee, C. M., Yildirim, S., Bulut, M., Kazemzadeh, A., Busso, C., Deng, conference on electrical and electronic technology (TENCON),
Z. et al. (2004). Emotion recognition based on phoneme classes. Phuket Island, Singapore, Vol. 1, pp 297–301.
In 8th international conference on spoken language processing, New, T. L., Wei, F. S., & De Silva, L. C. (2003). Speech emotion rec-
INTERSPEECH 2004, Korea. ognition using hidden Markov models. Speech Communication,
Lee, C.-C., Mower, E., Busso, C., Lee, S., & Narayanan, S. (2011). 41, 603–623.
Emotion recognition using a hierarchical binary decision tree Nicholson, J., Takahashi, K., & Nakatsu, R. (2006). Emotion recog-
approach. Speech Communication, 53, 1162–1171. nition in speech using neural networks. Neural Computing &
Lida, A., Campbell, N., Higuchi, F., & Yasumura, M. (2003). A corpus Applications, 11, 290–296.
based synthesis system with emotion. Speech Communication, Nogueiras, A., Marino, J. B., Moreno, A., & Bonafonte, A. (2001).
40, 161–187. Speech emotion recognition using hidden Markov models. In
Lin, Y.-L., & Wei, G. (2005). Speech emotion recognition based on Proceedings of European conference on speech communication
HMM and SVM. In: Fourth International conference on machine and technology (Eurospeech’01), Denmark.
learning and cybernetics, Guangzhou, pp. 4898–4901. Nordstrand, L., Svanfeld, G., Granstrom, B., & House, D. (2004).
Lotfian, R., & Busso, C. (2015). Emotion recognition using synthetic Measurements of ariculatory variation in expressive speech for
speech as neutral reference. In IEEE International conference on a set of Swedish vowels. Speech Communication, 44, 187–196.
ICASSP, pp. 4759–4763. Ooi, C. S., Seng, K. P., Ang, L.-M., & Chew, L. W. (2014). A new
Luengo, I., Navas, E., Hernáez, I., & Sánchez, J. (2005). Automatic approach of audio emotion recognition. Experts Systems with
emotion recognition using prosodic parameters. In INTER- Applications, 41, 5858–5869.
SPEECH, Lisbon, Portugal, pp. 493–496. Pao, T.-L., Chen, Y.-T., Yeh, J.-H., & Liao, W.-Y. (2005). Combining
Lugger, M., & Yang, B. (2007). The relevance of voice quality features acoustic features for improved emotion recognition in Mandarin
in speaker independent emotion recognition. In ICASSP, Hono- speech. In International conference on affective computing and
lulu, Hawaii, pp. IV17–IV20. intelligent interaction, pp. 279–285.
Makarova, V., & Petrushin, V. A. (2002). RUSLANA: A database of Park, C.-H., & Sim, K.-B. (2003). Emotion recognition and acoustic
Russian emotional utterances. In 7th International conference on analysis from speech signal. In Proceedings of the international
spoken language processing (ICSLP 02), pp. 2041–2044. joint conference on neural networks, pp. 2594–2598.
Makhoul, J. (1975). Linear prediction: A tutorial review. Proceedings Pereira, C. (2000). Dimensions of emotional meaning in speech. In
of the IEEE, 63(4), 561–580. Proceedings of ISCA workshop speech and emotion, Belfast,
McGilloway, S., Cowie, R., Douglas-Cowie, E., Gielen, S., Westerdijk, Vol. 1, pp. 25–28.
M., & Stroeve, S. (2000) Approaching automatic recognition of Petrushin, V. A. (1999). Emotion in speech: Recognition and applica-
emotion from voice: A rough benchmark. In Proceedings of ISCA tion to call centers. In Proceedings of the 1999 conference on
workshop speech emotion, pp. 207–212. artificial neural networks in engineering (ANNIE 99).
McKeown, G., Valstar, M., Cowie, R., Pantic, M., & Schroder, M. Picard, R. W. (1997). Affective computing. Cambridge: The MIT Press.
(2007). The SEMAINE database: Annotated multimodal records Picard, R. W., Vyzas, E., & Healey, J. (2001). Toward machine emo-
of emotionally coloured conversations between a person and a tional intelligence: Analysis of affective physiological state.
limited agent. Journal of LATEX Class Files, 6(1), 1–14. IEEE Transactions on Pattern Analysis and Machine Intelli-
Mencattini, A., Martinelli, E., Costantini, G., Todisco, M., Basile, B., gence, 23, 1175–1191.
Bozzali, M., & Di Natale, C. (2014). Speech emotion recognition Power, M., & Dalgleish, T. (2000). Cognition and emotion from order
using amplitude modulation parameters and a combined feature to disorder. New York: Psychology Press.
selection procedure. Knowledge-Based Systems, 63, 68–81. Prasanna, S. R. M., & Govind, D. (2010). Analysis of excitation source
Mirsamadi, S., Barsoum, E., & Zhang, C. (2017). Automatic speech information in emotional speech. In INTERSPEECH 2010,
emotion recognition using recurrent neural networks with local Makuhari, Chiba, Japan, pp. 781–784.
attention. In Proceedings of IEEE conference on ICASSP, Prasanna, S. R. M., Gupta, C. S., & Yegnanarayana, B. (2006). Extrac-
pp. 2227–2231. tion of speaker-specific excitation information from linear predic-
Mohanty, S., & Swain, B. K. (2010). Emotion recognition using tion residual of speech. Speech Communication, 48, 1243–1261.
fuzzy K-means from Oriya speech. In International Conference Pravena, D., & Govind, D. (2017). Significance of incorporating excita-
[ACCTA-2010] on Special Issue of IJCCT, Vol. 1 Issue 2–4. tion source parameters for improved emotion recognition from
Montero, J. M., Gutiérrez-Arriola, J., Colás, J., Enríquez, E., & Pardo, speech and electroglottographic signals. International Journal
J. M. (1999). Analysis and modeling of emotional speech in of Speech Technology, 20(4), 787–797.
Spanish. In Proceedings of international conference on phonetic Pravena, D., & Govind, D. (2017). Development of simulated emo-
sciences, pp. 957–960. tion speech database for excitation source analysis. International
Journal of Speech Technology, 20, 327–338.
13
International Journal of Speech Technology (2018) 21:93–120 119
Quiros-Ramirez, M. A., Polikovsky, S., Kameda, Y., & Onisawa, T. Schuller, B., Rigoll, G., & Lang, M. (2003). Hidden Markov model
(2014). A spontaneous cross-cultural emotion database: Latin- based speech emotion recognition. In Proceedings of the Inter-
America vs. Japan. In International conference on Kansei Engi- national conference on multimedia and Expo, ICME.
neering and emotion research, pp. 1127–1134. Schuller, B., Rigoll, G., & Lang, M. (2004). Speech emotion recogni-
Rabiner, L., & Juang, B.-H. (1993). Fundamentals of speech recogni- tion combining acoustic features and linguistis information in a
tion. Englewood Cliffs: Prentice-Hall. hybrid support vector machine-belief network architecture. In
Rahurkar, M. A., & Hansen, J. H. (2002). Frequency band analysis for Proceedings of international conference on acoustics, speech and
stress detection using a Teager energy operator based feature. signal processing (ICASSP’04), Vol. 1, pp. 557–560.
Proceedings of International Conference on Spoken Language Sheikhan, M., Bejani, M., & Gharavian, D. (2013). Modular neural-
Processing (ICSLP’), Vol. 3, issue 02, pp. 2021–2024. SVM scheme for speech emotion recognition using ANOVA
Ramamohan, S., & Dandapat, S. (2006). Sinusoidal model-based anal- feature selection method. Neural Computing and Applications,
ysis and classification of stressed speech. In IEEE transactions 23(1), 215–227.
on audio, speech and language processing, Vol. 14, p. 3. Slaney, M., & McRoberts, G. (2003). Babyears: A recognition system
Rao, K. S., & Koolagudi, S. G. (2011). Identification of Hindi dialects for affective vocalizations. Speech Comunnication, 39, 367–384.
and emotions using spectral and prosodic features of speech. Sys- Song, P., Ou, S., Zheng, W., Jin, Y., & Zhao, L. (2016). Speech emo-
temics, Cybernetics, and Informatics, 9(4), 24–33. tion recognition using transfer non-negative matrix factoriza-
Rao, K. S., Koolagudi, S. G., & Vempada, R. R. (2013). Emotion rec- tion. In Proceedings of IEEE international conference ICASSP,
ognition from speech using global and local prosodic features. pp. 5180–5184.
International Journal of Speech Technology, 16(2), 143–160. Sun, R., & Moore, E. (2011). Investigating glottal parameters and
Rao, K. S., Kumar, T. P., Anusha, K., Leela, B., Bhavana, I., & teager energy operators in emotion recognition. In Affective
Gowtham, S. V. S. K. (2012). Emotion recognition from speech. Computing and Intelligent Interaction, pp. 425–434.
International Journal of Computer Science and Information Takahashi, K. (2004). Remarks on SVM-based emotion recognition
Technologies, 3, 3603–3607. from multi-modal bio-potential signals. In 13th IEEE interna-
Rao, K. S., Prasanna, S. R. M., & Yegnanarayana, B. (2007). Determi- tional workshop on robot and human interactive communication,
nation of instants of significant excitation in speech using Hilbert Roman.
envelope and group delay function. IEEE Signal Processing Let- Tao, J., & Kang, Y. (2005). Features importance analysis for emotional
ters, 14, 762–765. speech classification. In Affective Computing and Intelligent
Rao, K. S., & Yegnanarayana, B. (2006). Prosody modification using Interaction, pp. 449–457.
instants of significant excitation. In IEEE transactions on audio Tato, R., Santos, R., Kompe, R., & Pardo, J. M. (2002). Emotional
and speech, pp. 972–980. space improves emotion recognition. In Proceedings of interna-
Rong, J., Li, G., & Chen, Y. P. P. (2009). Acoustic feature selection for tional conference on spoken language processing (ICSLP’02),
automatic emotion recognition from speech. Information Pro- Colorado, Vol. 3, pp. 2029–2032.
cessing and Management, 45, 315–328. Tomkins, S. (1962). Affect imagery and consciousness: The positive
Rozgic, V., Ananthakrishnan, S., Saleem, S., Kumar, R., Vembu, A. affects, Vol. 1. New York: Springer.
N., & Prasad, R. (2012). Emotion recognition using acoustic and University of Pennsylvania Linguistic Data Consortium. (2002).
lexical features. In INTERSPEECH, Portland, USA. Emotional prosody speech and transcripts. Retrieved
Russell, J. A. (1980). A circumplex model of affect. Journal of Person- from http://www.Idc.upenn.edu/Catalog/CatalogEntry.
ality and Social Psychology, 39, 1161–1178. jsp?CatalogId=LDC2002S28.
Russell, J. A., & Barrett, L. F. (1999). Core affect, prototypical emo- Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recogni-
tional episodes, and other things called emotion: Dissecting the tion: Resources, features and methods. Speech Communication,
elephant. Journal of Personality and Social Psychology, 76, 48, 1162–1181.
805–819. Ververidis, D., Kotropoulos, C., & Pitas, I. (2004). Automatic emo-
Russell, J. A., & Mehrabian, A. (1977). Evidence for a three-factor tional speech classification. In Proceedings of international con-
theory of emotions. Journal of Research in Personality, 11, ference on acoustics, speech and signal processing (ICASSP’04),
273–294. Montreal, Vol. 1, pp. 593–596.
Salovey, P., Kokkonen, M., Lopes, P., & Mayer, J. (2004). Emotional Vidrascu, L., & Devillers, L. (2005). Detection of real-life emo-
Intelligence: What do we know? In ASR Manstead, N. H. Frijda tions in call centers. In INTERSPEECH, Lisbon, Portugal,
& A. H. Fischer (Eds.), Feelings and emotions: The Amsterdam pp. 1841–1844.
symposium (pp. 321–340). Cambridge: Cambridge University Vogt, T., & André, E. (2006). Improving automatic from speech via
Press. gender differentiation. In Proceedings of language resources and
Schachter, S., & Singer, J. (1962). Cognitive, social, and physiologi- evaluation conference (LREC 2006), Genoa.
cal determinants of emotional state. Psychological Review, 69, Wakita, H. (1976). Residual energy of linear prediction to vowel and
379–399. speaker recognition. IEEE Transactions on Acoustics, Speech,
Scherer, K. R., Grandjean, D., Johnstone, T., Klasmeyer, G., & Ban- and Signal Processing, 24, 270–271.
ziger, T. (2002). Acoustic correlates of task load and stress. In Wang, K., An, N., Li, B. N., Zhang, Y., & Li, L. (2015). Speech emo-
Proceedings of international conference on spoken language tion recognition using Fourier parameters. IEEE Transcations on
processing (ICSLP’02), Colorado, Vol. 3, pp. 2017–2020. Affective Computing, 6(1), 69–75.
Schroder, M. (2000). Experimental study of affect bursts. In Proceed- Wang, Y., Du, S., & Zhan, Y. (2008). Adaptive and optimal classifi-
ings of ISCA workshop speech and emotion, Vol. 1, pp. 132–137. cation of speech emotion recognition. In Fourth international
Schroder, M., & Grice, M. (2003). Expressing vocal effort in concat- conference on natural computation, pp. 407–411.
enative synthesis. In Proceedings of international conference on Wang, Y., & Guan, L. (2004). An investigation of speech based human
phonetic sciences (ICPhS’03), Barcelona, pp. 2589–2592. emotion recognition. In IEEE 6th workshop on multimedia signal
Schubert, E. (1999). Measurement and time series analysis of emotion processing.
in music, Ph.D dissertation, school of Music education, Univer- Werner, S., & Keller, E. (1994). Prosodic aspects of speech. In E. Keller
sity of New South Wales, Sydeny, Australia. (Ed.), Fundamentals of speech synthesis and speech recognition:
13
120 International Journal of Speech Technology (2018) 21:93–120
Basic concepts, state of the art, the future challenges (pp. 23–40). emotions expressed in speech. In Proceedings of International
Chichester: Wiley. Conference on Spoken Language Processing (ICSLP’04), Korea,
Wu, S., Falk, T. H., & Chan, W.-Y. (2011). Automatic speech emotion Vol. 1, pp. 2193–2196.
recognition using modulation spectral features. Speech Commu- You, M., Chen, C., Bu, J., Liu, J., & Tao, J. (1997). Getting started
nication, 53(5), 768–785. with susas: a speech under simulated and actual stress database.
Wu, T., Yang, Y., Wu, Z., & Li, D. (2006). MASC: a speech corpus in Eurospeech, 4, 1743–1746.
mandarin for emotion analysis and affective speaker recognition. Yu, F., Chang, E., Xu, Y.-Q., & Shum, H.-Y. (2001). Emotion detec-
In Speaker and language recognition workshop. tion from speech to enrich multimedia content. In: Proceedings
Wu, W., Zheng, T. F., Xu, M.-X., & Bao, H.-J. (2006). Study on of IEEE Pacific-Rim Conference on Multimedia, Beijing, Vol. 1,
speaker verification on emotional speech. In INTERSPEECH’06, pp. 550–557.
Piisburgh, Pennsylvania, pp. 2102–2105. Yuan, J., Shen, L., & Chen, F. (2002). The acoustic realization of anger,
Wundt, W. (2013). An introduction to psychology. Read Books Ltd. fear, joy and sadness in Chinese. In Proceedings of International
Yamagishi, J., Onishi, K., Maskko, T., & Kobayashi, T. (2003). Emo- Conference on Spoken Language Processing (ICSLP’02), Vol. 3,
tion recognition using a data-driven fuzzy inference system. pp. 2025–2028.
Geneva: Eurospeech. Zhang, S. (2008). Emotion recognition in Chinese natural speech by
Yegnanarayana, B., & Gangashetty, S. (2011). Epoch-based analysis combining prosody and voice quality features. In Sun et al. (Ed.),
of speech signals. S¯adhan¯ a, 36(5), 651–697. Advances in neural networks. Lecture notes in computer science
Yegnanarayana, B., Swamy, R. K., & Murty, K. S. R. (2009). Deter- (pp. 457–464). Berlin: Springer.
mining mixing parameters from multispeaker data using speech- Zhang, T., Hasegawa-Johnson, M., & Levinson, S. E. (2004). Chil-
specific information. IEEE Transactions on Audio, Speech, and dren’s emotion recognition in an intelligent tutoring scenario. In
Language Processing, 17(6), 1196–1207. Proceeding of the eighth European Conference on Speech Com-
Yeh, L., & Chi, T. (2010). Spectro-temporal modulations for robust munication and Technology, INTERSPEECH.
speech emotion recognition. In INTERSPEECH, Chiba, Japan, Zhu, A., & Luo, Q. (2007). Study on speech emotion recognition sys-
pp. 789–792. tem in E-learning. In J. Jacko (Ed.), Human computer interac-
Yildirim, S., Bulut, M., Lee, C. M., Kazemzadeh, A., Busso, C., tion, Part III, HCII (pp. 544–552). Berlin: Springer.
Deng, Z., Lee, S., & Narayanan, S. (2004). An acoustic study of
13