An Approach On Emotion Recognition by Using Speech Signals
An Approach On Emotion Recognition by Using Speech Signals
\
|
+ =
700
1 10 log 2595 ) (
f
f mel (1)
One approach to simulating the subjective spectrum is to
use a filter bank, that is one filter for each desired
melfrequency component. The filter bank has a triangular
bandpass frequency response, and the spacing as well as the
bandwidth is determined by a constant mel-frequency
interval. In the final step, the log mel spectrum has to be
converted back to time. The result is called the mel frequency
cepstrum coefficients (MFCCs). The cepstral representation
of the speech spectrum provides a good representation of the
local spectral properties of the signal for the given frame
analysis. Because the mel spectrum coefficients are real
numbers (and so are their logarithms), they may be converted
to the time domain using the Discrete Cosine Transform
(DCT). The MFCCs may be calculated using this equation
[1]:
( )
(
|
.
|
\
|
=
=
K
k n S C
K
k
k n
2
1 ~
log
~
1
(2)
where n=1,2,.....K
The first component is excluded from the DCT since it
represents the mean value of the input signal which carries
insignificant speakers specific information. This set of
coefficients is called an acoustic vector. These acoustic
vectors can be used to represent and recognize the voice
characteristic of the speaker [4]. Therefore each input
utterance is transformed into a sequence of acoustic vectors.
The next section describes how these acoustic vectors can be
used to represent and recognize the emotional state of a
speaker.
2.3. Wavelet Analysis
Traditional techniques for speech signal analysis use
Fourier methods for signal processing. Fourier analysis,
however, only details the spectral content of a signal in
frequency domain. The time domain information for a
particular event is lost during Fourier transformations because
preservation of time instances is not considered. This
condition can be overlooked if the signal is stationary.
However, for non-stationary signals, like speech, time and
frequency domain information is necessary to avoid any loss
of significant information in the signal. Wavelet analysis
provides an alternative method to Fourier analysis for voice
processing. Wavelets apply the concept of multiresolution
analysis (i.e., time and frequency scale representations) to
produce precise decompositions of signals for accurate signal
representation. They can reveal detailed characteristics, like
small discontinuities, self-similarities, and even higher order
derivatives that may be hidden by the conventional Fourier
analysis.
The wavelet transform (WT) is based on a time-frequency
signal analysis. It represents a windowing technique with
variable-sized regions. Wavelet transform allows the use of
long time intervals where more precise low-frequency
information is wanted, and shorter regions where high
frequency information is wanted [2]. It is well known that
speech signals contain many transient components and non-
stationary properties. Making use of the multi-resolution
analysis (MRA) capability of the WT, better time-resolution
is needed at high frequency range to detect the rapid changing
transient component of the signal, while better frequency
resolution is needed at low frequency range to track the
slowly time-varying formants more precisely.
3. Experiments Background
3.1. Database
The data used for our experiments comes from the Berlin
Database of Emotional Speech. It contains about 500
utterances spoken by actors in a happy, angry, sad and
disgusted way as well as in a neutral version. We can choose
utterances from 10 different actors (males and females) and
ten different utterances per actor[12].
SPECOM'2009, St. Petersburg, 21-25 June 2009
310
3.2. Feature extraction
Prosodic features that are commonly used in literature for
speech emotion recognition are based on pitch, energy,
MFCCs (Mel Frequency Cepstral Coefficients), pauses,
duration and speech rate, formants and voice quality features.
Feature extraction is a procedure that computes new relative
data from the original one. The main purpose is to generate
features that retain the most significant information or encode
efficiently the relevant information from the original data [3].
In the approach of the present study, a multitude of
features was computed and analyzed and then the most
relevant features for the given application were selected:
MFCCs and the energy of wavelet coefficients.
By applying the procedure described above, for each
speech frame of 30 ms, a set of mel-frequency cepstrum
coefficients was computed. The number of mel cepstrum
coefficients, K, was chosen between 12 and 20. A series with
the mean of MFCCs was detected for every utterance.
Figure 1 shows the variation in second MFCCs for a male
speaker uttering Der Lappen liegt auf dem Eisschrank (The
tablecloth is laying on the fridge) in emotional states of
happiness and anger.
Fig.1. 2
nd
MFCC trace for happiness and anger utterances
An utterance was also decomposed into an approximation
coefficients vector CA4 and four detail coefficients vectors
CD1, CD2, CD3 and CD4, using a 4-level dyadic wavelet
transform. The sets of coefficients were derived from
different mother functions such as: Daubechies, Coiflets,
Symlets, Morlet and biortogonal wavelets.
After that EA - the percentage of energy corresponding to
the approximation and Ed - the vector containing the
percentages of energy corresponding to the details was
estimated for every case. We also performed 3, 4 and 5 level
decomposition. Comparing the results, it was observed that
level 4 gives the best solution for the problem.
In addition, a voiced/unvoiced decision step was
introduced. A measure of voicedness was extracted based on
the harmonic product spectrum for each time frame. The same
features were extracted only for the voiced speech.
3.3. Classification
The Naive Bayes classifier was selected from the WEKA
toolbox for the recognition of the emotional patterns. Other
classifiers were also tested, but no significant differences
were observed. Naive Bayes also has the advantage of being
fast, even when dealing with high-dimensional data.
Two functions were also used: RBF Network and SMO. In
the RBF Network, the class implements a normalized
Gaussian radial basis function network. It uses the k-means
clustering algorithm to provide the basis functions. Weka
includes an implementation of Support vector machines
(SVM) classifier, called SMO. SVMs are a set of related
supervised learning methods used for classification and
regression.
A 10-fold cross validation technique was used whereby
the training data was randomly split into ten sets, 9 of which
were used in training and the 10th for validations. Then
iteratively another nine] were picked and so forth [11].
The selected features are used to classify the speeches
into their corresponding emotional classes.
4. Experiments and Results
A series of experiments was conducted based on the proposed
classification approach.
As expected, the classification rates perform better in the
case of using 20 Mel cepstrum coefficients. A high
percentage was achieved for all the mentioned wavelet
functions, but Db4 seems to be more adequate for this area
of applications. Thus a small feature vector of 25 coefficients
was obtained. The next tables show the results obtained for
SMO classifier, 5 classes and 45 utterances for each class:
Emotional
State (class)
TP Rate FP Rate Precision
Happiness 0.756 0.056 0.773
Disgust 0.8 0.033 0.857
Neutral 0.889 0.044 0.833
Anger 0.889 0.022 0.909
Sadness 0.978 0.017 0.936
Table 2. The results obtained using SMO classifier
Weka produces many useful statistics (e.g. TP rate, FP rate,
precision, confusion matrix) where: The True Positive (TP)
rate is the proportion of examples which were classified as
class x, among all examples which truly have class x, i.e. how
much part of the class was captured. The False Positive (FP)
rate is the proportion of examples which were classified as
class x, but belong to a different class, among all examples
which are not of class x. The Precision is the proportion of the
examples which truly have class x among all those which were
classified as class x.
a b c d e
Classified
as
34 4 2 4 1 a = happiness
4 36 5 0 0 b = disgust
1 2 40 0 2 c = neutral
5 0 0 40 0 d = anger
0 0 1 0 44 e = sadness
Table 3. Confusion matrix for SMO classifier
The confusion matrix (table 3), shows that the main errors
are in the happiness version and followed by the disgust one.
The sad style is the best rated (97.8% of average), followed by
anger and neutral (88.9%).
SPECOM'2009, St. Petersburg, 21-25 June 2009
311
In the next two tables the results obtained with the RBF
function are presented. From 225 total numbers of instances,
185 were correctly classified in this case.
Emotional
State (class)
TP Rate FP Rate Precision
Happiness 0.733 0.072 0.717
Disgust 0.778 0.072 0.729
Neutral 0.844 0.022 0.905
Anger 0.844 0.05 0.809
Sadness 0.911 0.006 0.976
Table 4. The results obtained using RBF function
a b c d e
Classified
as
33 7 0 5 0 a = happiness
5 35 3 2 0 b = disgust
1 4 38 0 21 c = neutral
6 1 0 38 0 d = anger
0 1 1 2 41 e = sadness
Table 5. Confusion matrix for RBF function
The best results were obtained by using the SMO
algorithm (86.22% of average), followed by RBF network
(82.22%) and Naive Bayes (around 80%).
The same feature vector and the same set of classifiers
were applied again only on the voiced speech; the results are
approximately the same but the computational cost is
increased.
5. Conclusions and Future Work
The results obtained following the experiments performed
allow us to make the following observations: the description
of emotional states and the selection of parameters are the key
for the final result.
The purpose was to obtain a small feature vector and a
good classification rate. The experiment presented in this
paper achieved an average of 86.22% success, for a 25
coefficients vector, SMO classifier.
Comparing the results with other studies [5], [6], [7], it
was proved that the classification rate is approximately the
same, but the feature vector dimension in the present study is
far smaller.
It is known that MFCCs are used for speaker recognition
and that these coefficients contain information unrelated to
emotional state (such as speaker identity, speaker gender).
We assume that this additional information may increase the
False Positive Rate.
In the future, we intend to introduce other voice
parameterization in order to minimize the confusion between
states. We also plan to investigate other wavelet techniques
that can be used to overcome some of the deficiencies in the
methods presented.
Consequently, psychologists have argued that visual
information modifies the perception of speech. The
combination of visual and audio information provides a better
achievement. Therefore, our future efforts will include fusion
of video and audio data in order to heighten the performance
of our existing emotion recognition system.
6. Acknowledgements
Part of this work has been supported by the research grant
project PNCDI No. 339/2007.
7. References
[1] R. Hasan, M Jamil, G Rabbani, Speaker Identification
using Mel Frequency Cepstral Coefficients, 3
rd
International Conference on Electrical & Computer
Engineering, ICECE 2004
[2] S. Mallat, A wavelet tour of signal processing, New
York, Academic Press, 1999
[3] D.Ververidis, C. Kotropoulos, Emotional speech
recognition: Resources, features, and methods, Speech
Communication, vol 48, 2006, p 11621181
[4] I Iriondo, S Planet, J Socoro, F. Alias, Objective and
Subjective Evaluation of an Expressive Speech Corpus,
Volume 4885, 2007, p 86-94
[5] R. Shah, M. Hewlett, Emotion Detection from Speech,
CS 229 Machine Learning Final Projects, Stanford
University, 2007
[6] F Beritelli, S Casale, A Russo, S Serrano, Speech
Emotion Recognition Using MFCCs Extracted from a
Mobile Terminal based on ETSI Front End, ICSP2006
Proceedings
[7] S. E. Bou-Ghazale and J. H. L. Hansen, A comparative
study of traditional and newly proposed features for
recognition of speech under stress, IEEE Transaction on
Speech and Audio Processing, vol. 8, no. 4, pp. 429442,
Jul 2000.
[8] B. Schuller, G. Rigoll, and M. Lang, Hidden markov
model-based speech emotion recognition, in
Proceedings of 2003 International Conference on
Multimedia and Expo (ICME 03), Jul 2003, vol. 1, pp.
401404.
[9] A. Razak, H. Yusof, R. Komiya, Towards automatic
recognition of emotion in speech Signal Processing and
Information Technology, 2003. ISSPIT 2003.
Proceedings of the 3rd IEEE International Symposium on
Volume , Issue , 14-17 Dec. 2003 pp 548 551
[10] Ciota, Z Emotion Recognition on the Basis of Human
Speech ICECom 2005. 18th International Conference on
Volume , Issue , 12-14 Oct. 2005 pp 1 4
[11] http://www.cs.waikato.ac.nz/ml/weka/
[12] http://pascal.kgw.tu-berlin.de/emodb/index-1280.html
SPECOM'2009, St. Petersburg, 21-25 June 2009
312