0% found this document useful (0 votes)
34 views4 pages

Lugger, Yang - 2006 - Classification of Different Speaking Groups by Means of Voice Quality Parameters

Uploaded by

jimakosjp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views4 pages

Lugger, Yang - 2006 - Classification of Different Speaking Groups by Means of Voice Quality Parameters

Uploaded by

jimakosjp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

ITG Fachtagung

Sprachkommunikation 2006

Lugger and Yang:


Classification of different speaking groups

CLASSIFICATION OF DIFFERENT SPEAKING GROUPS


BY MEANS OF VOICE QUALITY PARAMETERS
M. Lugger and B. Yang
Chair of System Theory and Signal Processing
University of Stuttgart, Germany
[email protected]

ABSTRACT
This paper presents a new method to classify different speaking groups by using the so called voice quality
parameters. By voice quality we mean the characteristics of the glottal excitation of the speech signal. We
estimate five parameters describing the voice quality as spectral gradients of the vocal tract compensated speech.
They correlate with the glottal speech source features like open quotient, skewness of the glottal pulse, and
incompleteness of closure. These parameters are then used to classify gender, different phonation types, and
different emotions with promising results.

1. INTRODUCTION

2. VOICE QUALITY PARAMETERS

Detection of paralinguistic properties of speech is


in contrast to other applications in speech processing like automatic speech recognition a less explored
field. Under paralinguistic properties we understand
all the information beyond to the pure linguistic content of a spoken utterance. They describe the emotional state of a speaker, the peculiarities or manner
of his individual voice, his gender, or his social background. So the listener obtains information about the
physical, psychological, social, and emotional characteristics of the speaker.
According to the source-filter-model, the linguistic content of speech is mainly determined by the vocal tract while the glottal excitation contributes significantly to the paralinguistic properties. We call the
characteristics of the glottal excitation as voice quality. It is our goal to extract parameters describing the
voice quality from the speech signal which allow a
mapping of spoken utterances to different speaking
groups.
This paper shows three classification applications
of the voice quality parameters: the gender, the phonation types by J. Laver [1], and emotional states. The
paper is structured as follows. Section 2 explains
how the voice quality parameters are estimated. In
section 3, the speech data and the classification method is described. The results of the classifications
are presented in section 4.

Voice quality is mainly affected by the excitation of


the human voice that is called phonation. That means,
the shape of the glottal pulse and its rate and time
variations are responsible for the kind of voice quality that is realized. In contrast, all activities belonging to the articulation process affect the sounds that
all together build the content of the speech. In the
literature, some non-acoustic methods measure the
change in electrical impedance across the throat during speaking. An electroglottograph, for example,
measures the change in electrical impedance across
the throat during speaking.
We study a method to measure the voice quality
directly from the acoustic speech signal. No extra
hardware and no invasion to the human body are required to obtain the desired information. The method
is based on the observations by Stevens and Hanson
[2] that the glottal properties open quotient, glottal opening, skewness of glottal pulse, and rate
of glottal closure each affect the excitation spectrum of the speech signal in a dedicated frequency
range and thus reflect the voice quality of the speaker.
They proposed to estimate these glottal states from
the acoustic speech signal by adequate relation of
the amplitudes of the corresponding higher harmonics with that of the fundamental mode. They further found that the first formant bandwidth is correlated with the incompleteness of the glottal closure. These measurements are simply called voice

ITG Fachtagung
Sprachkommunikation 2006

Lugger and Yang:


Classification of different speaking groups

2.1. Measurement of speech features


The first step estimates some well known speech features from windowed, voiced segments of the speech
signal. We perform the voiced-unvoiced decision
and the pitch estimation according to the RAPT algorithm [4] that looks for peaks in the normalized
cross correlation function. The frequencies and bandwidths of the first four formants are estimated by
an LPC analysis [5]. All frequency values are converted to the Bark scale. Table 1 shows the estimated
speech features required for the voice quality estimation.
Feature
Fp
F1 , F2 , F3 , F4
B1 , B2 , B3 , B4
H1 , H2
F1p , F2p , F3p
A1p , A2p , A3p

Meaning
pitch
formant frequencies
formant bandwidths
amplitudes at F p and 2F p [dB]
frequencies of peaks near formants
amplitude at F1p , F2p , F3p [dB]

The results of the formant compensation are the corrected spectral amplitudes H1 , H2 of the first and
second harmonics and the corrected peak amplitudes
A 1p , A 2p , and A 3p near the three formants as shown
in Figure 1.
H 1

spectrum [dB]

quality parameters.
We use a modified algorithm which calculates
spectral gradients instead of amplitude ratios. In addition, a vocal tract compensation is performed prior
to estimating the gradients [3]. The whole algorithm
can be divided into three steps: measurement of speech
features, compensation of the vocal tract influence,
and estimation of the voice quality parameters. Below we describe these steps in more details.

H 2

H k = Hk

4
X

A kp = Akp

VdB (kF p ; Fi , Bi )

i=1
4
X
i=1

i,k

F3p f [Barks]

F2p

The last step estimates the following five voice quality parameters from the vocal tract compensated speech features: Open Quotient Gradient, Glottal
Opening Gradient, SKewness Gradient, Rate of
Closure Gradient, and Incompleteness of Closure.
They are given by
OQG =
GOG =
SKG =
RCG =
IC =

H 1 H 2
Fp

H1 A 1p
F1p F p
H 1 A 2p
F2p F p
H 1 A 3p
F3p F p
B1
F1

spectrum [dB]

Figure 2 gives an illustration of the first four parameters as spectral gradients with respect to the pitch
frequency.
GOG
OQG

SKG

RCG

(k = 1, 2)
0

VdB (Fkp ; Fi , Bi )

F p 2F p F1p

A 3p

2.3. Estimation of the voice quality parameters

2.2. Compensation of the vocal tract influence

They are removed from the amplitudes Hk and Akp


in Table 1 in the dB scale:

A 2p

Fig. 1. Vocal tract compensated peaks of the FFT


spectrum for the voice quality parameter estimation

Table 1. Speech features used for voice quality estimation

Since the voice quality parameters shall only depend


on the excitation and not on the articulation process,
the influence of the vocal tract has to be compensated. The contribution of each of the four formants
to the spectrum at frequency f is estimated by [6]
2
Fi2 + B2i
V( f ; Fi , Bi ) = r
B 2
B 2
i
2
2
( f Fi ) + 2
( f + Fi ) + 2i

A 1p

(k = 1, 2, 3)

F p 2F p F1p

F2p

F3p f [Barks]

Fig. 2. Voice quality parameters as spectral gradients

ITG Fachtagung
Sprachkommunikation 2006

Lugger and Yang:


Classification of different speaking groups

3. EXPERIMENTS
In this paper, we apply the voice quality parameters
to classify different speaking groups in three studies:
classification of gender, voice qualities, and emotions. The speech data for all studies are recorded
in an anechoic room at the sampling frequency of
16 kHz. All signals were segmented to utterances of
about 3 second length. The speech was classified on
the basis of every spoken utterance.
3.1. Classification
There are many different approaches for gender and
emotion detection in the literature. A method for
gender detection is described in [7]. A good overview
of previous works about emotion detection is given
in [8]. Most of the proposed methods use prosodic
properties of the speech like the intonation, intensity,
and duration as features for the classification. They
do not consider voice quality parameters as features
because they found them difficult to model and to
estimate. Similarly, there exist many different methods for pattern recognition. For emotion detection,
the Bayes classifier, Gaussian mixture modes, hidden Markov models, and artificial neural networks
have been used.
In this paper, we use voice quality parameters
only for all classifications. For the pattern recognition part, we use a linear discriminant analysis because it is simple and there is no need for a long
training phase.

dependent, yet. The classification was done for every spoken utterance by a majority decision. That
means, every voiced segment of an utterance is mapped to one class by the classifier. The whole utterance is classified to that class with the maximum of
single segment decisions. The confusion matrices
for the three studies are depicted in the Table 2-4.
The first column contains the true speaking group
and the first line shows the detected speaking group.
The entries in boldface on the main diagonal show
the rates of the correct decisions. The other nondiagonal entries are the rates of wrong decisions.
4.1. Classification of gender
The speech samples for the first study were taken
from [10]. They consist of 10 utterances of male
and 10 utterances of female speech, each of about
30 second duration. For the training phase a random
set of 5 male and 5 female speakers was considered.
Based on these training data a classification of all 20
speakers was performed.
Table 2 shows the gender classification under white background noise with different values for the global signal to noise ratio (SNR). We see that the gender classification shows quite good results. Only
7.4% of the male respectively 4.8% of the female
utterances are wrong classified in the noiseless case.
For a decreasing SNR, there is a moderate and smooth
performance degradation where the classifier tends
to favour the female decisions. The reason is that the
female voice has a stronger breathy portion as the
male voice [11] and hence noisy speech tends to be
classified to femal than to male.

3.2. Linear discriminant analysis


The discriminant analysis is a method for multivariate data analysis [9]. In this work, a linear discriminant analysis is used for classification. It involves
several linear discriminant functions.
In the training phase, the parameters of the linear
discriminant functions are determined from a random subset of the speech data with a priori known
classes. This is called supervised training. These
discriminant functions are then used in a second test
phase to classify the remaining(or complete) speech
data into different speaking groups. The results of
the classifier are compared with the a priori known
classes in order to evaluate the quality of the classifier.
4. RESULTS
This section presents the classification results. Except for the gender, all classifications are speaker

gender
male
female
male
female
male
female
male
female

SNR

30 dB
15 dB
0 dB

male
92.6%
4.8%
82.7%
9.3%
81.5%
11.3%
58.5%
5.6%

female
7.4%
95.2%
17.3%
90.7%
18.5%
88.7%
41.5%
94.4%

Table 2. Gender classification under white noise


4.2. Classification of phonation types
The noiseless speech data for the second study were
taken from the book [12] by J. Laver. It contains
utterances spoken in six different phonation types
[1] by the same speaker: modal voice, falsetto
voice, whispery voice, breathy voice, creaky
voice, and rough voice. For the classification,

ITG Fachtagung
Sprachkommunikation 2006

Lugger and Yang:


Classification of different speaking groups

only the phonation types rough voice, creaky voice, and modal voice were considered.
Table 3 shows the results of the phonation type
detection. The classification for modal voice shows
a very good result with a detection rate of 95.6%.
Rough voice is correctly detected in nearly 3 out of
4 utterances. Creaky voice is correctly detected in
67.5%. It is mainly confused with the modal voice.
voice quality
modal voice
rough voice
creaky voice

modal
95.6%
12.7%
28.8%

rough
0.0%
73.3%
3.7%

creaky
4.4%
14.0%
67.5%

Table 3. Classification of phonation types


4.3. Classification of emotions
The noiseless speech data for the third study were
taken from the Berlin database of emotional speech
[13]. It contains about 500 sentences spoken by actors in a neutral, happy, angry, sad, fearful, bored,
and disgusted way. We used the emotions angry, sad,
happy, and neutral only. The classification was done
speaker-by-speaker separately. That means, speech
samples of the same speaker were used for both the
training and classification. The values in the confusion matrix are mean values over 10 speakers.
Table 4 shows the results of emotion detection.
We see that the emotions angry, sad, and neutral were
classified with detection rates over 80%. Only happy
voice shows a detection rate of 57.7% because it is
often confused with angry voice. This is a well known
fact from the literature [14]. Happy and angry voices
show similar values in the emotion dimension activity and valency. They only differ in the dimension
potency.
Emotion
angry
sad
happy
neutral

angry
93.7%
0.0%
33.8%
0.0%

sad
0.0%
84.7%
0,0%
2.5%

happy
5.6%
1.7%
57.7%
6.3%

neutral
0.8%
13.6%
8.4%
91.1%

Table 4. Classification of emotions

5. CONCLUSION
We introduced the voice quality to describe the glottal excitation and presented an algorithm to estimate
the voice quality parameters from the speech signal. The classification of gender, phonation type, and

emotion by using voice quality parameters shows promising results. The next step is to achieve a speaker
independent classification for phonation type and emotion. One idea could be the use of additional prosodic
features like pitch, intensity, and duration. A second
approach could be the improvement of the classifier.
6. REFERENCES
[1] John Laver, The phonetic description of voice quality, Cambridge University Press, 1980.
[2] K. Stevens and H. Hanson, Classification of glottal
vibration from acoustic measurements, Vocal Fold
Physiology, pp. 147170, 1994.
[3] M. Putzer and W. Wokurek, Multiparametrische
Stimmprofildifferenzierung zu mannlichen und
weiblichen Normalstimmen auf der Grundlage
akustischer Signale,
Laryngo-Rhino-Otologie,
Thieme, 2006.
[4] D. Talkin, W. Kleijn, and K. Paliwal, A robust algorithm for pitch tracking (RAPT), Speech Coding
and Synthesis, Elsevier, pp. 495518, 1995.
[5] David Talkin, Speech formant trajectory estimation using dynamic programming with modulated
transition costs, Technical Report, Bell Labs.,
1987.
[6] G. Fant, Acoustic theory of speech production, The
Hague: Mouton, 1960.
[7] H. Harb and L. Chen, Gender identification using
a general audio classifier, Proc. IEEE International
Conference on Multimedia and Expo,, pp. 733736,
2003.
[8] R. Cowie and E. Douglas-Cowie, Emotion recognition in human-computer ineraction, IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 3280,
2001.
[9] R. O. Duda, P. Hart, and D.G. Stork, Pattern Classification, Wiley, 2001.
[10] M. Putzer and J. Koreman, A German database
of patterns of pathological vocal fold vibration,
PHONUS 3, Universitat des Saarlandes, pp. 143
153, 1997.
[11] H. Hanson and E. Chuang, Glottal characteristics
of male speakers: Acoustic correlates and comparison with female data, Journal of acoustical society
of America, vol. 106, pp. 10641077, 1999.
[12] J. Laver and H. Eckert, Menschen und ihre Stimmen
- Aspekte der vokalen Kommunikation, Beltz, 1994.
[13] F. Burkhardt, A. Paeschke, M. Rolfes,
W. Sendlmeier, and B. Weiss, A database of
German emotional speech,
Proceedings of
Interspeech, 2005.
[14] Astrid Paeschke, Prosodische Analyse emotionaler
Sprechweise, Ph.D. thesis, TU-Berlin, 2003.
[15] M. Lugger, B. Yang, and W. Wokurek, Robust estimation of voice quality parameters under real world
disturbances, In: Proc. IEEE ICASSP, 2006.

You might also like