0% found this document useful (0 votes)
115 views4 pages

An Approach On Emotion Recognition by Using Speech Signals

The document discusses an approach to emotion recognition from speech signals using various features. It describes extracting wavelet energy and MFCC features from an acted emotional speech database to classify 5 emotional states. The paper also discusses using the Weka machine learning system to perform the classification and evaluate different classifiers on the extracted features.

Uploaded by

eugen lupu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views4 pages

An Approach On Emotion Recognition by Using Speech Signals

The document discusses an approach to emotion recognition from speech signals using various features. It describes extracting wavelet energy and MFCC features from an acted emotional speech database to classify 5 emotional states. The paper also discusses using the Weka machine learning system to perform the classification and evaluate different classifiers on the extracted features.

Uploaded by

eugen lupu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

An Approach on Emotion Recognition by Using Speech Signals

Simina Emerich, Eugen Lupu, Anca Apatean


Technical University of Cluj-Napoca, Communication Department,
Cluj-Napoca, Romania
[email protected]
Abstract
Automatic recognition of emotion has gained attention due to
the widespread applications into various domains. In our
approach, we computed a multitude of features and then we
reduced the feature set to the most relevant ones for the given
application. Preliminary exeriments lead to the selection of a
feature vector from wavelet energy and MFCCs. A robust
speech classification of 5 emotional states: anger, happy, sad,
disgust and neutral was performed.
1. Introduction
The speech signal transmits many levels of information to the
listener. At the first level, speech transmits a message, via
words, which is independent on the speaker. On the other
levels, speech transmits information about the identity of the
speaker, such as: language being spoken, gender,
socioeconomic background, stress, accent, emotional state,
health of the speaker etc.
While speech recognition aims at recognizing the words
spoken from the speech signal, the goal of speaker recognition
systems is to extract and recognize the information in the
speech signal depending on speaker identity. Automatic
recognition of emotion is gaining attention due to widespread
applications among various domains: detecting frustration,
disappointment/tiredness, surprise/ amusement etc.
Emotion can affect speech in many ways: consciously,
unconsciously and through the autonomous nervous system.
The most fundamental problem is to answer the question: what
are the features that signify emotion? A very qualitative
correlation between emotion and some speech feature is
presented in the table below [Murray and Arnott, 1993].
Anger Happiness Sadness Fear Disgust
Speech
rate
Slightly
faster
Faster or
slower
Slightly
slower
Much
faster
Very much
slower
Pitch
average
Very
much
higher
Much
higher
Slightly
slower
Very
much
higher
Very much
lower
Pitch
range
Much
wider
Much
wider
Slightly
narrower
Much
wider
Slightly
wider
Intensity Higher Higher Lower Normal Lower
Voice
quality
Breathy,
chest
tone
Breathy,
blaring
Resonant Irregular
voicing
Grumbled
chest tone
Pitch
change
Abrupt,
on
stressed
syllables
Smooth,
upward
inflections
Downward
inflections
Normal Wide
downward
terminal
inflections
Articulation Tense Normal Slurring Precise Normal
Table 1. Summary of most general correlates to emotion in
speech
A proper choice of feature vectors is one of the most
important tasks. There are many approaches towards
automatic recognition of emotion in speech by using different
feature vectors. Some important feature vectors have been
chosen as the background for efficient system of emotion
recognition, such as: fundamental frequency (F0) and time-
energy distribution vector [9], LP analysis algorithm has been
used for the speech emotion parameter extraction (a 22
speech features have been selected to represent each emotion)
[10], feature streams contain MFCC, LPCC, and LPC
coefficients etc.
Emotion recognition results are far more difficult to
quantify than those provided by speech recognition systems.
The results provided by emotion recognition systems are
comparable to human emotion recognition with an accuracy
of 70-75%.
An emotional speech data collection is undoubtedly a
useful tool for research purposes in the domain of emotional
speech analysis and recognition. An overview of 64
emotional speech data collections is presented in [3].
In this paper, an acted emotional speech database was
employed. Databases for emotional speech synthesis are
usually based on acted speech, where some professional
speakers read a set of texts simulating the desired emotions.
The advantage of this method is the control of the verbal and
phonetic content of speech as all the emotional states can be
produced using the same phrases. This allows direct
comparisons of the phonetics, the prosody and the voice
quality for the different emotions. The disadvantage is the
lack of authenticity of the expressed emotion.
One of the challenges is the identification of oral
indicators (prosodic, spectral and vocal quality) attributable to
the emotional behavior. Many features for emotion
recognition from speech have been explored, but there is still
no agreement on a fixed set of features. The objective is to
establish a relationship between the speakers emotional state
and some quantifiable parameters of speech. The paper
presents a data-mining experiment where a set of acoustic
features is computed as wavelet energy and MFCC time
series of the data.
Speech can be classified as information carrying non-
stationary acoustical signals. Based on this, speech is a good
candidate for wavelet analysis because its variability in style,
changes rapidly over time depending on the environment and
speakers characteristics.
The cepstral coefficients are also a set of features reported
to be robust in some different pattern recognition tasks
concerning human voice. MFCCs are based on the known
variation of the human ears critical bandwidths with
frequency. The MFCC technique makes use of two types of
filter: linearly spaced filters and logarithmically spaced
filters. To capture the phonetically important characteristics
of speech, signal is expressed in the Mel frequency scale.
SPECOM'2009, St. Petersburg, 21-25 June 2009
309
The remainder of the paper is organized as follows:
Section 2 is an overview of proposed methods and tools.
Section 3 explains the steps of the feature extraction from the
speech signals and presents the database used to perform the
experiments. In the end, the evaluation results are
summarized in Section 4 and final conclusions are presented
in Section 5.
2. Methods and Tools
2.1. Weka System
All the experiments have been carried out using Weka
software (Waikato Environment for Knowledge Analysis),
developed at the University of Waikato in New Zealand. This
is a very powerful system, freely available on the internet
[11] for experiment.
This system is used in order to apply a learning method to
the data set and in order to analyze its output. The learning
methods are called classifiers. In Weka, the performance of
all classifiers is measured by a common evaluation module.
Before applying any classification algorithm to the data, it
must be converted to ARFF form with type information about
each attribute.
A great number of classifiers are available, and several
are chosen to classify the data. The classifiers were evaluated
by cross-validation, using a number of 10 folds.
The summary of the results from the training data ends
with a confusion matrix, showing how many instances of each
class have been assigned to each class. If all instances are
classified correctly, only the diagonal elements of the matrix
are non-zero. From the confusion matrix one can observe
whether the instances of a class have been assigned to another
class.
2.2. The Mel-Frequency Cepstral Coefficients
The speech signal consists of tones with different
frequencies. For each tone with an actual frequency, f,
measured in Hz, a subjective pitch is measured on the Mel
scale. The mel-frequency scale is a linear frequency spacing
below 1000Hz and a logarithmic spacing above 1000Hz.
Therefore the following formula may be used to compute the
mels for a given frequency f in Hz[1]:
|
.
|

\
|
+ =
700
1 10 log 2595 ) (
f
f mel (1)
One approach to simulating the subjective spectrum is to
use a filter bank, that is one filter for each desired
melfrequency component. The filter bank has a triangular
bandpass frequency response, and the spacing as well as the
bandwidth is determined by a constant mel-frequency
interval. In the final step, the log mel spectrum has to be
converted back to time. The result is called the mel frequency
cepstrum coefficients (MFCCs). The cepstral representation
of the speech spectrum provides a good representation of the
local spectral properties of the signal for the given frame
analysis. Because the mel spectrum coefficients are real
numbers (and so are their logarithms), they may be converted
to the time domain using the Discrete Cosine Transform
(DCT). The MFCCs may be calculated using this equation
[1]:
( )
(

|
.
|

\
|
=

=
K
k n S C
K
k
k n

2
1 ~
log
~
1
(2)
where n=1,2,.....K
The first component is excluded from the DCT since it
represents the mean value of the input signal which carries
insignificant speakers specific information. This set of
coefficients is called an acoustic vector. These acoustic
vectors can be used to represent and recognize the voice
characteristic of the speaker [4]. Therefore each input
utterance is transformed into a sequence of acoustic vectors.
The next section describes how these acoustic vectors can be
used to represent and recognize the emotional state of a
speaker.
2.3. Wavelet Analysis
Traditional techniques for speech signal analysis use
Fourier methods for signal processing. Fourier analysis,
however, only details the spectral content of a signal in
frequency domain. The time domain information for a
particular event is lost during Fourier transformations because
preservation of time instances is not considered. This
condition can be overlooked if the signal is stationary.
However, for non-stationary signals, like speech, time and
frequency domain information is necessary to avoid any loss
of significant information in the signal. Wavelet analysis
provides an alternative method to Fourier analysis for voice
processing. Wavelets apply the concept of multiresolution
analysis (i.e., time and frequency scale representations) to
produce precise decompositions of signals for accurate signal
representation. They can reveal detailed characteristics, like
small discontinuities, self-similarities, and even higher order
derivatives that may be hidden by the conventional Fourier
analysis.
The wavelet transform (WT) is based on a time-frequency
signal analysis. It represents a windowing technique with
variable-sized regions. Wavelet transform allows the use of
long time intervals where more precise low-frequency
information is wanted, and shorter regions where high
frequency information is wanted [2]. It is well known that
speech signals contain many transient components and non-
stationary properties. Making use of the multi-resolution
analysis (MRA) capability of the WT, better time-resolution
is needed at high frequency range to detect the rapid changing
transient component of the signal, while better frequency
resolution is needed at low frequency range to track the
slowly time-varying formants more precisely.
3. Experiments Background
3.1. Database
The data used for our experiments comes from the Berlin
Database of Emotional Speech. It contains about 500
utterances spoken by actors in a happy, angry, sad and
disgusted way as well as in a neutral version. We can choose
utterances from 10 different actors (males and females) and
ten different utterances per actor[12].
SPECOM'2009, St. Petersburg, 21-25 June 2009
310
3.2. Feature extraction
Prosodic features that are commonly used in literature for
speech emotion recognition are based on pitch, energy,
MFCCs (Mel Frequency Cepstral Coefficients), pauses,
duration and speech rate, formants and voice quality features.
Feature extraction is a procedure that computes new relative
data from the original one. The main purpose is to generate
features that retain the most significant information or encode
efficiently the relevant information from the original data [3].
In the approach of the present study, a multitude of
features was computed and analyzed and then the most
relevant features for the given application were selected:
MFCCs and the energy of wavelet coefficients.
By applying the procedure described above, for each
speech frame of 30 ms, a set of mel-frequency cepstrum
coefficients was computed. The number of mel cepstrum
coefficients, K, was chosen between 12 and 20. A series with
the mean of MFCCs was detected for every utterance.
Figure 1 shows the variation in second MFCCs for a male
speaker uttering Der Lappen liegt auf dem Eisschrank (The
tablecloth is laying on the fridge) in emotional states of
happiness and anger.
Fig.1. 2
nd
MFCC trace for happiness and anger utterances
An utterance was also decomposed into an approximation
coefficients vector CA4 and four detail coefficients vectors
CD1, CD2, CD3 and CD4, using a 4-level dyadic wavelet
transform. The sets of coefficients were derived from
different mother functions such as: Daubechies, Coiflets,
Symlets, Morlet and biortogonal wavelets.
After that EA - the percentage of energy corresponding to
the approximation and Ed - the vector containing the
percentages of energy corresponding to the details was
estimated for every case. We also performed 3, 4 and 5 level
decomposition. Comparing the results, it was observed that
level 4 gives the best solution for the problem.
In addition, a voiced/unvoiced decision step was
introduced. A measure of voicedness was extracted based on
the harmonic product spectrum for each time frame. The same
features were extracted only for the voiced speech.
3.3. Classification
The Naive Bayes classifier was selected from the WEKA
toolbox for the recognition of the emotional patterns. Other
classifiers were also tested, but no significant differences
were observed. Naive Bayes also has the advantage of being
fast, even when dealing with high-dimensional data.
Two functions were also used: RBF Network and SMO. In
the RBF Network, the class implements a normalized
Gaussian radial basis function network. It uses the k-means
clustering algorithm to provide the basis functions. Weka
includes an implementation of Support vector machines
(SVM) classifier, called SMO. SVMs are a set of related
supervised learning methods used for classification and
regression.
A 10-fold cross validation technique was used whereby
the training data was randomly split into ten sets, 9 of which
were used in training and the 10th for validations. Then
iteratively another nine] were picked and so forth [11].
The selected features are used to classify the speeches
into their corresponding emotional classes.
4. Experiments and Results
A series of experiments was conducted based on the proposed
classification approach.
As expected, the classification rates perform better in the
case of using 20 Mel cepstrum coefficients. A high
percentage was achieved for all the mentioned wavelet
functions, but Db4 seems to be more adequate for this area
of applications. Thus a small feature vector of 25 coefficients
was obtained. The next tables show the results obtained for
SMO classifier, 5 classes and 45 utterances for each class:
Emotional
State (class)
TP Rate FP Rate Precision
Happiness 0.756 0.056 0.773
Disgust 0.8 0.033 0.857
Neutral 0.889 0.044 0.833
Anger 0.889 0.022 0.909
Sadness 0.978 0.017 0.936
Table 2. The results obtained using SMO classifier
Weka produces many useful statistics (e.g. TP rate, FP rate,
precision, confusion matrix) where: The True Positive (TP)
rate is the proportion of examples which were classified as
class x, among all examples which truly have class x, i.e. how
much part of the class was captured. The False Positive (FP)
rate is the proportion of examples which were classified as
class x, but belong to a different class, among all examples
which are not of class x. The Precision is the proportion of the
examples which truly have class x among all those which were
classified as class x.
a b c d e
Classified
as
34 4 2 4 1 a = happiness
4 36 5 0 0 b = disgust
1 2 40 0 2 c = neutral
5 0 0 40 0 d = anger
0 0 1 0 44 e = sadness
Table 3. Confusion matrix for SMO classifier
The confusion matrix (table 3), shows that the main errors
are in the happiness version and followed by the disgust one.
The sad style is the best rated (97.8% of average), followed by
anger and neutral (88.9%).
SPECOM'2009, St. Petersburg, 21-25 June 2009
311
In the next two tables the results obtained with the RBF
function are presented. From 225 total numbers of instances,
185 were correctly classified in this case.
Emotional
State (class)
TP Rate FP Rate Precision
Happiness 0.733 0.072 0.717
Disgust 0.778 0.072 0.729
Neutral 0.844 0.022 0.905
Anger 0.844 0.05 0.809
Sadness 0.911 0.006 0.976
Table 4. The results obtained using RBF function
a b c d e
Classified
as
33 7 0 5 0 a = happiness
5 35 3 2 0 b = disgust
1 4 38 0 21 c = neutral
6 1 0 38 0 d = anger
0 1 1 2 41 e = sadness
Table 5. Confusion matrix for RBF function
The best results were obtained by using the SMO
algorithm (86.22% of average), followed by RBF network
(82.22%) and Naive Bayes (around 80%).
The same feature vector and the same set of classifiers
were applied again only on the voiced speech; the results are
approximately the same but the computational cost is
increased.
5. Conclusions and Future Work
The results obtained following the experiments performed
allow us to make the following observations: the description
of emotional states and the selection of parameters are the key
for the final result.
The purpose was to obtain a small feature vector and a
good classification rate. The experiment presented in this
paper achieved an average of 86.22% success, for a 25
coefficients vector, SMO classifier.
Comparing the results with other studies [5], [6], [7], it
was proved that the classification rate is approximately the
same, but the feature vector dimension in the present study is
far smaller.
It is known that MFCCs are used for speaker recognition
and that these coefficients contain information unrelated to
emotional state (such as speaker identity, speaker gender).
We assume that this additional information may increase the
False Positive Rate.
In the future, we intend to introduce other voice
parameterization in order to minimize the confusion between
states. We also plan to investigate other wavelet techniques
that can be used to overcome some of the deficiencies in the
methods presented.
Consequently, psychologists have argued that visual
information modifies the perception of speech. The
combination of visual and audio information provides a better
achievement. Therefore, our future efforts will include fusion
of video and audio data in order to heighten the performance
of our existing emotion recognition system.
6. Acknowledgements
Part of this work has been supported by the research grant
project PNCDI No. 339/2007.
7. References
[1] R. Hasan, M Jamil, G Rabbani, Speaker Identification
using Mel Frequency Cepstral Coefficients, 3
rd
International Conference on Electrical & Computer
Engineering, ICECE 2004
[2] S. Mallat, A wavelet tour of signal processing, New
York, Academic Press, 1999
[3] D.Ververidis, C. Kotropoulos, Emotional speech
recognition: Resources, features, and methods, Speech
Communication, vol 48, 2006, p 11621181
[4] I Iriondo, S Planet, J Socoro, F. Alias, Objective and
Subjective Evaluation of an Expressive Speech Corpus,
Volume 4885, 2007, p 86-94
[5] R. Shah, M. Hewlett, Emotion Detection from Speech,
CS 229 Machine Learning Final Projects, Stanford
University, 2007
[6] F Beritelli, S Casale, A Russo, S Serrano, Speech
Emotion Recognition Using MFCCs Extracted from a
Mobile Terminal based on ETSI Front End, ICSP2006
Proceedings
[7] S. E. Bou-Ghazale and J. H. L. Hansen, A comparative
study of traditional and newly proposed features for
recognition of speech under stress, IEEE Transaction on
Speech and Audio Processing, vol. 8, no. 4, pp. 429442,
Jul 2000.
[8] B. Schuller, G. Rigoll, and M. Lang, Hidden markov
model-based speech emotion recognition, in
Proceedings of 2003 International Conference on
Multimedia and Expo (ICME 03), Jul 2003, vol. 1, pp.
401404.
[9] A. Razak, H. Yusof, R. Komiya, Towards automatic
recognition of emotion in speech Signal Processing and
Information Technology, 2003. ISSPIT 2003.
Proceedings of the 3rd IEEE International Symposium on
Volume , Issue , 14-17 Dec. 2003 pp 548 551
[10] Ciota, Z Emotion Recognition on the Basis of Human
Speech ICECom 2005. 18th International Conference on
Volume , Issue , 12-14 Oct. 2005 pp 1 4
[11] http://www.cs.waikato.ac.nz/ml/weka/
[12] http://pascal.kgw.tu-berlin.de/emodb/index-1280.html
SPECOM'2009, St. Petersburg, 21-25 June 2009
312

You might also like