0% found this document useful (0 votes)
46 views

Audio-Based Classification of Speaker Characteristics: Promiti Dutta and Alexander Haubold

Human voice contains non-linguistic features indicative of various speaker demographics. Low-level signal-based features include MFCCs, LPCs, and six spectral features. Midand high-level features are optimal for identification of speaker characteristics.

Uploaded by

sabih
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Audio-Based Classification of Speaker Characteristics: Promiti Dutta and Alexander Haubold

Human voice contains non-linguistic features indicative of various speaker demographics. Low-level signal-based features include MFCCs, LPCs, and six spectral features. Midand high-level features are optimal for identification of speaker characteristics.

Uploaded by

sabih
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

AUDIO-BASED CLASSIFICATION OF SPEAKER CHARACTERISTICS

Promiti Dutta and Alexander Haubold

Columbia University, New York, NY


{pd2049,ah297}@columbia.edu
ABSTRACT Video Database Speaker Class Annotation Silence
(segmented speakers) (VAST MM) Filter

The human voice is primarily a carrier of speech, but it also


contains non-linguistic features unique to a speaker and
indicative of various speaker demographics, e.g. gender, Weka: SMO algorithm WEKA File Low, Mid, High Level
(50% train, 50% test) Formatting Feature Extraction
nativity, ethnicity. Such characteristics are helpful cues for
audio/video search and retrieval. In this paper, we evaluate
the effects of various low-, mid-, and high-level features for
effective classification of speaker characteristics. Low-level Results Recursive SMO with
Ranker: pick top-k
signal-based features include MFCCs, LPCs, and six
spectral features; mid-level statistical features model low-
level features; and high-level semantic features are based on Figure 1: Overview diagram of processing steps in our approach.
selected phonemes in addition to mid-level features. Our
data set consists of approximately 76.4 hours of annotated significantly less effective in determining characteristics
audio with 2786 unique speaker segments used for than mid-level features. For gender classification, we
classification. Quantitative evaluation of our method results achieve an accuracy of 67.3% using low-level signal-based
in accuracy rates up to 98.6% on our test data for features. Tzanetakis et. al. report similar results (76%) on
male/female classification using mid-level features and a TRECVid 2003 data; however their low-level features are
linear kernel support vector machine. We determine that computed on 20 ms windows, while we use 10 ms windows
mid- and high-level features are optimal for identification of [3]. Our results for mid-level statistical features show
speaker characteristics. significant improvement, leading to an overall accuracy of
90.1%-98.6% over varying speech window sizes.
Index Terms— audio signal processing, feature
extraction, MFCC, LPC, classification, gender, ethnicity 2. DATA SET

1. INTRODUCTION Our dataset includes student final project presentation


videos from a large university-level engineering design
Searching through vast amounts of spoken audio course with more than 150 students per semester. Each
collections is an arduous task without the availability of presentation team is comprised of 5 – 6 students who take
search cues. Audio transcripts generated by Automatic turns presenting their team’s project during a midterm and a
Speech Recognition (ASR) systems provide good content final period in the semester. Our video data spans 5 years.
search cues, albeit imperfect coverage and varying We perform data annotation to establish ground truth
accuracy, especially for salient key terms [1,2]. Search for using the VAST MM Video Audio Structure Text
content can be improved significantly through re-ranking or Multimedia system (Figure 2) [4]. The VAST MM browser
filtering speech segments by known speaker characteristics. displays audio and visual cues, which are useful for
In this paper we identify and evaluate classifiers for distinguishing speaker segments. In an indexing step, the
three characteristics: gender, nativity (native vs. non-native VAST MM indexing tool performs several content analysis
English), and ethnicity (African-American, Asian, processes, including automatic speaker segmentation based
Caucasian, Hispanic, South-east Asian). Using a large on Mel Frequency Cepstral Coefficient (MFCC) features
dataset of 2786 manually annotated speech segments from and the Bayesian Information Criterion (BIC) [5]. Using the
student presentation videos, we evaluate and train on tool, we listen to and view short video clips from each
various low-, mid-, and high-level feature classifiers on the speaker segment to correctly annotate each with appropriate
detection of voice characteristics (Figure 1). Through classifications. Each speaker segment is classified according
experimentation, we observe that low-level features are to gender, ethnicity, and familiarity of spoken English.

978-1-4244-4291-1/09/$25.00 ©2009 IEEE 422 ICME 2009

Authorized licensed use limited to: University of Wollongong. Downloaded on June 05,2010 at 15:53:08 UTC from IEEE Xplore. Restrictions apply.
Table 1: Summary of classification for data set.
Class # of Segments Time (hr)
African-American 101 2.36
Asian 776 20.15
Ethnicity Caucasian 1233 33.34
Hispanic 80 2.00
South-east Asian 295 7.74
Male 1865 51.86
Gender
Female 692 16.23
Spoken Native 2197 58.81
English Non-native 327 8.53

3. FEATURE EXTRACTION
Figure 2: VAST MM browser used to annotate speaker seg-
ments. Visual cues (key frames and streaming video) and audio
signal are displayed in the user interface for ease of annotation. We extract low-, mid-, and high-level features from each
audio speaker segments for varying time-intervals. Low-
Table 1 summarizes the sample sizes of the annotated data level features are signal-based; mid-level features are
set: we have annotated over 76.4 hours of audio with 2786 statistical aggregates of low-level features; and high-level
unique speaker segments. Each audio speaker segment is features include phonemes in addition to mid-level features.
extracted from the original video for further analysis. 3.1. Low-level: Signal-level
In a preprocessing step, audio speaker segments are
filtered for silence. This step is crucial for removing a Low-level features include 13 MFCCs, 13 Linear Predictive
signal, which would otherwise act as a similarity between Coefficients (LPCs), and 6 distinct spectral features for a
speaker segments from different classes. Because the total of 32 distinct features from each 256-sample window
original video recordings were made with wired and (~0.01 sec in a 22 kHz sampled signal). The 13 MFCCs are
wireless analog microphones, silent pauses in the audio a representation of the short-term power spectrum of a
track are practically low-amplitude noise. Their numerical sound. LPCs analyze the speech signal by estimating
representation as MFCC features is substantially different formants, removing their effects from the speech signal, and
from actual speech: the zeroth MFCC feature, commonly estimating the intensity and frequency of the remaining
referred to as a representation of signal amplitude, deviates buzz. The six spectral features include energy entropy
most; higher order MFCCs also reflect a significant block, short time energy, zero-crossing rate, spectral roll
difference due to the high frequency inherent to noise. We off, spectral centroid, and spectral flux [6].
apply a simple heuristic, which computes the absolute 3.2. Mid-level: Statistical Aggregates from Signal Level
maximum amplitude A for a given speaker segment, and
filters out any short fixed audio sample window (256 Mid-level features are statistical aggregates of the
samples) which does not pass a threshold measured as an aforementioned 32 low-level features on longer samples.
empirically determined fraction of A. The low-level features underlie a Gaussian distribution with
Key characteristics of the audio data include varying mean, μ, and variance, . We model the aggregate of low-
audio quality between student presentations. This is largely level MFCC and LPC features by their mean and
due to different microphones that were used over the five- covariance. The covariance matrices for MFCC and LPC
year course recordings. Also affecting audio quality is an are symmetrical; we only use the covariance values from the
individual speaker’s use of the microphone, such as upper triangular matrix and the diagonal for a total of 91
placement with respect to speaker (hand-held vs. on-stand) values for MFCCs and LPCs, respectively. We include 13
and presenter’s activity (rigid pose vs. constant shifting). MFCC means, 13 LPC means, and respective statistical
We notice a skew in the distribution between certain measures for the 6 spectral features [6]. The complete
annotation classes. Specifically, in the engineering school feature vector for mid-level features contains 214 features.
we observe a 3:1 ratio of male to female students. Similarly, 3.3. High-level: Semantic Level
we find fewer speakers in some ethnic classes (African
Americans and Hispanics) than others (Asians, Caucasians, High-level feature vectors are derived from mid-level
and Southeast Asians). To avoid a bias due to unequal features. We include 12 additional features derived from
sample sizes, we down-sample the data set to comparable phonemes for a total of 226 features (91 MFCC cov, 13
class sizes for classification. MFCC mean, 91 LPC cov, 13 LPC mean, 6 spectral

423

Authorized licensed use limited to: University of Wollongong. Downloaded on June 05,2010 at 15:53:08 UTC from IEEE Xplore. Restrictions apply.
100 30000

95 25000

Number of Samples
Accuracy (%)

90 20000

85 15000

80 10000

75 5000

70 0
0 10 20 30 40 50 1 2 3 5 10 30 40
Sample Time (sec) Sample Time (sec)

All Features Top 5 Features Top 10 Features Female Male

Figure 3 (left): Male/female classification accuracy for varying sampling time lengths. The 256-sample window (~0.01 sec) using low-
level samples is our baseline accuracy (67.3%). Mid-level features (1 sec – 40 sec) exhibit significant increasing classification accuracy.
The same trend is apparent after applying classification with the top 5 and 10 features selected by recursive feature selection.
Figure 4 (right): Distribution of non-overlapping sample sizes for male/female speaker segments at different sampling time durations.

features, and 12 phonemes). We apply phoneme extraction Additionally, we perform feature selection using the
to generate a frequency list of occurring phonemes in the “SVMAttributeEval” method in Weka.
audio signal. We apply our approach [7] to identify a “SVMAttributeEval” evaluates the weight of an attribute by
selection of monophthongs, diphthongs, and fricatives using a linear SVM. The “Ranker” search method ranks
(/AA/, /AE/, /AH/, /AO/, /EH/, /ER/, /IH/, /IY/, /S/, /SH/, each feature by the square of the weight assigned by the
/UH/, /UW/). This heuristic method models the vocal tract SVM from the “SVMAttributeEval” method. The selected
using an autoregressive model of the speech signal in which features are used for classification to test the overall
the peaks of the frequency response correspond to resonant prediction of the dataset.
frequencies of the vocal tract (formants). The closest
matching phoneme is determined by the Euclidean distance 5. EXPERIMENTS AND RESULTS
of a weighted difference between model and computed
5.1. Low-level: Signal-level
values by using a table of expected frequency values for
formants F1, F2, and F3. We apply low-level features to non-overlapping 256-sample
windows (~0.01 sec) for gender classification
4. CLASSIFICATION AND FEATURE SELECTION (male/female). Male samples sizes are down-sampled to
adjust for any classification bias due to mismatching sample
Classification is performed using the Sequential Minimal sizes between the two classes. In total, we have 105,106
Optimization (SMO) algorithm in Weka [8]. SMO is a male speaking auditory samples compared to the 94,555
computationally simpler method to compute the support female speaking samples. The linear kernel SMO achieves
vector machine (SVM) quadratic programming (QP) 67.3% classification accuracy (Figure 3), consistent with an
optimization problem without extra matrix storage and accuracy of 76% on 0.02 sec sample windows in [4].
without using numerical QP optimization steps. We use a
5.2. Mid-level: Statistical aggregates from signal level
linear kernel unless otherwise noted for the SMO. The
output equation for a linear SVM (Equation 1) defines w as 5.2.1. Varying Sampling Times
the normal vector to the hyperplane, x as the input vector, We extract mid-level features for several non-overlapping
and u as the separating hyperplane. The linear kernel sampling intervals: 1, 2, 3, 5, 10, 20, 30, and 40 seconds
identifies the optimal separating hyper-plane between the (Figure 4). Monologues longer than 40 seconds by a given
distributions by maximizing the margin m (Equation 2) speaker are rare in our dataset. We down-sample the male
using training examples. Prediction is performed on the test samples to create equal distribution between male and
set. To avoid classification biases, cross-validation is female samples for classification.
obtained for the experiments using a 50% split for training We perform classification using all 214 features to
and test sets. determine the efficacy of mid-level features at varying
& &
u  w x  b (Equation 1) sampling intervals. Classification accuracies range between
b 90.1 – 98.6% (Figure 3) where accuracy is logarithmically
m
w2 (Equation 2) related to sample time. A 10-second time interval provides a
reasonable baseline for analysis on high-level features.

424

Authorized licensed use limited to: University of Wollongong. Downloaded on June 05,2010 at 15:53:08 UTC from IEEE Xplore. Restrictions apply.
polynomial kernel. A linear kernel did not provide effective
classification accuracies.
The demographic classification may be confounded by
inclusion of both native and non-native English speakers in
the respective groups. We remove this bias by creating
groups based on native and non-native speakers and their
respective demographics class. This increases classification
accuracy to approximately 64.5% accuracy for each class.
The classification confusion matrix indicates the similarity
between the Asian and South-east Asian groups, suggesting
that better accuracy may be obtained by combining these
two groups. A similar association is observed with the
Figure 5: Additive histogram of male samples shown in African American and Hispanic groups as well. We note
blue; female in red. (top) MFCC covariance 2_7. (bottom) that these results are significant compared to the
MFCC covariance 3_8. probabilistic 20% accuracy achieved by random guessing.

5.2.2. Feature Selection 6. CONCLUSIONS


Each feature vector contains 214 features for mid-level
analysis. The use of excessive features can result in over- This paper presents a survey of the different levels of
fitting. To determine whether the data was over-fitted, we features that can be applied to classification of speech. We
perform feature selection to identify the 5 and 10 most demonstrate that low-level features perform poorly for
significant features for our classification for each sampling classification since audio sampling is too short and therefore
interval (1, 2, 3, 5, 10, 20, 30, and 40 seconds). We not representative of characteristic traits for the classifi-
determine that using fewer features provides comparable cation classes. We show that mid- and high-level features
classification accuracy and is less computationally perform significantly better, because higher order features
expensive (Figure 3). more closely correspond to human perception of auditory
For each of the sampling intervals, we obtain a distinct characteristics. A human can identify characteristics best
group of top 5 and 10 significant. We note that certain with ample amount of information (longer speech segments)
features are common to each sampling interval. Specifically, rather than short samples of speech. The main disadvantage
there are two MFCC co-variances that rank as the top two with the use of these types of features is the requirement of
features for all classification performed with mid-level longer audio segments. However given the domain of
features. The additive histogram contains two very distinct presentation and lecture videos, these audio segments are
Gaussians for males and females with very different means aptly available and thus are applicable for effective audio
— for these two MFCC co-variances (Figure 5). search methods for large multimedia collections. We
propose further investigation into the high-level feature
5.3. High-level: semantic level domain by through the exploration of additional phonemes
5.3.1. Spoken English Experiment as well as semantic (vocabulary) usage.
In the spoken English experiment, we classify native
English speakers versus non-native English accented 7. REFERENCES
speakers. Sample size is 2700 male and 2700 female feature
segments. We obtain a 73.5% classification accuracy. [1] B. Matthews, U. Chaudhari, and B. Ramabhadran, “Fast Audio
It is possible that the classifier is confounded by Search Using Space Modeling,” ASRU ‘07, Kyoto, Japan.
[2] C. González-Ferreras and V. Cardeñoso-Payo, ”A System for
demographical data. We perform additional experiments in
Speech Driven Information Retrieval,” ASRU ‘07, Kyoto, Japan.
which we create sub-groups for classification, i.e. African [3] G. Tzanetakis, M.-Y. Chen, “Building Audio Classifiers for
American native English speaker versus African American Broadcast News Retrieval,” WIAMIS '04, Lisbon, Portugal, 2004.
non-native English speaker. Classification accuracy for [4] A. Haubold, J.R. Kender, “VAST MM: Multimedia Browser
these smaller groups rises to 80% and greater for each of the for Presentation Video”, CIVR ‘07, Amsterdam, The Netherlands.
5 smaller subgroups created. [5] A. Haubold, J.R. Kender, “Augmented segmentation and
visualization for presentation videos,” MM ‘05, Singapore.
5.3.2. Demographics Experiment [6] T. Giannakopoulos, “Some Basic Audio Features,” Matlab File
The demographics experiment is a multi-class classification Exchange, March 16, 2008.
task amongst five groups: African Americans, Asians, [7] A. Haubold, J.R. Kender, “Alignment of Speech to Highly
Caucasians, Hispanics, and South-east Asians. We sample Imperfect Text Transcriptions”, ICME ‘07.
each group to contain 600 samples of non-overlapping 10- [8] I.H. Witten and E. Frank, “Data Mining: Practical machine
second sampling windows. We obtain 48.5% classification learning tools and techniques,” 2nd Edition, Morgan Kaufmann,
accuracy using an empirically determined 5th degree San Francisco, 2005.

425

Authorized licensed use limited to: University of Wollongong. Downloaded on June 05,2010 at 15:53:08 UTC from IEEE Xplore. Restrictions apply.

You might also like