Lugger, Yang - 2006 - Classification of Different Speaking Groups by Means of Voice Quality Parameters
Lugger, Yang - 2006 - Classification of Different Speaking Groups by Means of Voice Quality Parameters
Sprachkommunikation 2006
ABSTRACT
This paper presents a new method to classify different speaking groups by using the so called voice quality
parameters. By voice quality we mean the characteristics of the glottal excitation of the speech signal. We
estimate five parameters describing the voice quality as spectral gradients of the vocal tract compensated speech.
They correlate with the glottal speech source features like open quotient, skewness of the glottal pulse, and
incompleteness of closure. These parameters are then used to classify gender, different phonation types, and
different emotions with promising results.
1. INTRODUCTION
ITG Fachtagung
Sprachkommunikation 2006
Meaning
pitch
formant frequencies
formant bandwidths
amplitudes at F p and 2F p [dB]
frequencies of peaks near formants
amplitude at F1p , F2p , F3p [dB]
The results of the formant compensation are the corrected spectral amplitudes H1 , H2 of the first and
second harmonics and the corrected peak amplitudes
A 1p , A 2p , and A 3p near the three formants as shown
in Figure 1.
H 1
spectrum [dB]
quality parameters.
We use a modified algorithm which calculates
spectral gradients instead of amplitude ratios. In addition, a vocal tract compensation is performed prior
to estimating the gradients [3]. The whole algorithm
can be divided into three steps: measurement of speech
features, compensation of the vocal tract influence,
and estimation of the voice quality parameters. Below we describe these steps in more details.
H 2
H k = Hk
4
X
A kp = Akp
VdB (kF p ; Fi , Bi )
i=1
4
X
i=1
i,k
F3p f [Barks]
F2p
The last step estimates the following five voice quality parameters from the vocal tract compensated speech features: Open Quotient Gradient, Glottal
Opening Gradient, SKewness Gradient, Rate of
Closure Gradient, and Incompleteness of Closure.
They are given by
OQG =
GOG =
SKG =
RCG =
IC =
H 1 H 2
Fp
H1 A 1p
F1p F p
H 1 A 2p
F2p F p
H 1 A 3p
F3p F p
B1
F1
spectrum [dB]
Figure 2 gives an illustration of the first four parameters as spectral gradients with respect to the pitch
frequency.
GOG
OQG
SKG
RCG
(k = 1, 2)
0
VdB (Fkp ; Fi , Bi )
F p 2F p F1p
A 3p
A 2p
A 1p
(k = 1, 2, 3)
F p 2F p F1p
F2p
F3p f [Barks]
ITG Fachtagung
Sprachkommunikation 2006
3. EXPERIMENTS
In this paper, we apply the voice quality parameters
to classify different speaking groups in three studies:
classification of gender, voice qualities, and emotions. The speech data for all studies are recorded
in an anechoic room at the sampling frequency of
16 kHz. All signals were segmented to utterances of
about 3 second length. The speech was classified on
the basis of every spoken utterance.
3.1. Classification
There are many different approaches for gender and
emotion detection in the literature. A method for
gender detection is described in [7]. A good overview
of previous works about emotion detection is given
in [8]. Most of the proposed methods use prosodic
properties of the speech like the intonation, intensity,
and duration as features for the classification. They
do not consider voice quality parameters as features
because they found them difficult to model and to
estimate. Similarly, there exist many different methods for pattern recognition. For emotion detection,
the Bayes classifier, Gaussian mixture modes, hidden Markov models, and artificial neural networks
have been used.
In this paper, we use voice quality parameters
only for all classifications. For the pattern recognition part, we use a linear discriminant analysis because it is simple and there is no need for a long
training phase.
dependent, yet. The classification was done for every spoken utterance by a majority decision. That
means, every voiced segment of an utterance is mapped to one class by the classifier. The whole utterance is classified to that class with the maximum of
single segment decisions. The confusion matrices
for the three studies are depicted in the Table 2-4.
The first column contains the true speaking group
and the first line shows the detected speaking group.
The entries in boldface on the main diagonal show
the rates of the correct decisions. The other nondiagonal entries are the rates of wrong decisions.
4.1. Classification of gender
The speech samples for the first study were taken
from [10]. They consist of 10 utterances of male
and 10 utterances of female speech, each of about
30 second duration. For the training phase a random
set of 5 male and 5 female speakers was considered.
Based on these training data a classification of all 20
speakers was performed.
Table 2 shows the gender classification under white background noise with different values for the global signal to noise ratio (SNR). We see that the gender classification shows quite good results. Only
7.4% of the male respectively 4.8% of the female
utterances are wrong classified in the noiseless case.
For a decreasing SNR, there is a moderate and smooth
performance degradation where the classifier tends
to favour the female decisions. The reason is that the
female voice has a stronger breathy portion as the
male voice [11] and hence noisy speech tends to be
classified to femal than to male.
gender
male
female
male
female
male
female
male
female
SNR
30 dB
15 dB
0 dB
male
92.6%
4.8%
82.7%
9.3%
81.5%
11.3%
58.5%
5.6%
female
7.4%
95.2%
17.3%
90.7%
18.5%
88.7%
41.5%
94.4%
ITG Fachtagung
Sprachkommunikation 2006
only the phonation types rough voice, creaky voice, and modal voice were considered.
Table 3 shows the results of the phonation type
detection. The classification for modal voice shows
a very good result with a detection rate of 95.6%.
Rough voice is correctly detected in nearly 3 out of
4 utterances. Creaky voice is correctly detected in
67.5%. It is mainly confused with the modal voice.
voice quality
modal voice
rough voice
creaky voice
modal
95.6%
12.7%
28.8%
rough
0.0%
73.3%
3.7%
creaky
4.4%
14.0%
67.5%
angry
93.7%
0.0%
33.8%
0.0%
sad
0.0%
84.7%
0,0%
2.5%
happy
5.6%
1.7%
57.7%
6.3%
neutral
0.8%
13.6%
8.4%
91.1%
5. CONCLUSION
We introduced the voice quality to describe the glottal excitation and presented an algorithm to estimate
the voice quality parameters from the speech signal. The classification of gender, phonation type, and
emotion by using voice quality parameters shows promising results. The next step is to achieve a speaker
independent classification for phonation type and emotion. One idea could be the use of additional prosodic
features like pitch, intensity, and duration. A second
approach could be the improvement of the classifier.
6. REFERENCES
[1] John Laver, The phonetic description of voice quality, Cambridge University Press, 1980.
[2] K. Stevens and H. Hanson, Classification of glottal
vibration from acoustic measurements, Vocal Fold
Physiology, pp. 147170, 1994.
[3] M. Putzer and W. Wokurek, Multiparametrische
Stimmprofildifferenzierung zu mannlichen und
weiblichen Normalstimmen auf der Grundlage
akustischer Signale,
Laryngo-Rhino-Otologie,
Thieme, 2006.
[4] D. Talkin, W. Kleijn, and K. Paliwal, A robust algorithm for pitch tracking (RAPT), Speech Coding
and Synthesis, Elsevier, pp. 495518, 1995.
[5] David Talkin, Speech formant trajectory estimation using dynamic programming with modulated
transition costs, Technical Report, Bell Labs.,
1987.
[6] G. Fant, Acoustic theory of speech production, The
Hague: Mouton, 1960.
[7] H. Harb and L. Chen, Gender identification using
a general audio classifier, Proc. IEEE International
Conference on Multimedia and Expo,, pp. 733736,
2003.
[8] R. Cowie and E. Douglas-Cowie, Emotion recognition in human-computer ineraction, IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 3280,
2001.
[9] R. O. Duda, P. Hart, and D.G. Stork, Pattern Classification, Wiley, 2001.
[10] M. Putzer and J. Koreman, A German database
of patterns of pathological vocal fold vibration,
PHONUS 3, Universitat des Saarlandes, pp. 143
153, 1997.
[11] H. Hanson and E. Chuang, Glottal characteristics
of male speakers: Acoustic correlates and comparison with female data, Journal of acoustical society
of America, vol. 106, pp. 10641077, 1999.
[12] J. Laver and H. Eckert, Menschen und ihre Stimmen
- Aspekte der vokalen Kommunikation, Beltz, 1994.
[13] F. Burkhardt, A. Paeschke, M. Rolfes,
W. Sendlmeier, and B. Weiss, A database of
German emotional speech,
Proceedings of
Interspeech, 2005.
[14] Astrid Paeschke, Prosodische Analyse emotionaler
Sprechweise, Ph.D. thesis, TU-Berlin, 2003.
[15] M. Lugger, B. Yang, and W. Wokurek, Robust estimation of voice quality parameters under real world
disturbances, In: Proc. IEEE ICASSP, 2006.