Audio-Based Classification of Speaker Characteristics: Promiti Dutta and Alexander Haubold
Audio-Based Classification of Speaker Characteristics: Promiti Dutta and Alexander Haubold
Authorized licensed use limited to: University of Wollongong. Downloaded on June 05,2010 at 15:53:08 UTC from IEEE Xplore. Restrictions apply.
Table 1: Summary of classification for data set.
Class # of Segments Time (hr)
African-American 101 2.36
Asian 776 20.15
Ethnicity Caucasian 1233 33.34
Hispanic 80 2.00
South-east Asian 295 7.74
Male 1865 51.86
Gender
Female 692 16.23
Spoken Native 2197 58.81
English Non-native 327 8.53
3. FEATURE EXTRACTION
Figure 2: VAST MM browser used to annotate speaker seg-
ments. Visual cues (key frames and streaming video) and audio
signal are displayed in the user interface for ease of annotation. We extract low-, mid-, and high-level features from each
audio speaker segments for varying time-intervals. Low-
Table 1 summarizes the sample sizes of the annotated data level features are signal-based; mid-level features are
set: we have annotated over 76.4 hours of audio with 2786 statistical aggregates of low-level features; and high-level
unique speaker segments. Each audio speaker segment is features include phonemes in addition to mid-level features.
extracted from the original video for further analysis. 3.1. Low-level: Signal-level
In a preprocessing step, audio speaker segments are
filtered for silence. This step is crucial for removing a Low-level features include 13 MFCCs, 13 Linear Predictive
signal, which would otherwise act as a similarity between Coefficients (LPCs), and 6 distinct spectral features for a
speaker segments from different classes. Because the total of 32 distinct features from each 256-sample window
original video recordings were made with wired and (~0.01 sec in a 22 kHz sampled signal). The 13 MFCCs are
wireless analog microphones, silent pauses in the audio a representation of the short-term power spectrum of a
track are practically low-amplitude noise. Their numerical sound. LPCs analyze the speech signal by estimating
representation as MFCC features is substantially different formants, removing their effects from the speech signal, and
from actual speech: the zeroth MFCC feature, commonly estimating the intensity and frequency of the remaining
referred to as a representation of signal amplitude, deviates buzz. The six spectral features include energy entropy
most; higher order MFCCs also reflect a significant block, short time energy, zero-crossing rate, spectral roll
difference due to the high frequency inherent to noise. We off, spectral centroid, and spectral flux [6].
apply a simple heuristic, which computes the absolute 3.2. Mid-level: Statistical Aggregates from Signal Level
maximum amplitude A for a given speaker segment, and
filters out any short fixed audio sample window (256 Mid-level features are statistical aggregates of the
samples) which does not pass a threshold measured as an aforementioned 32 low-level features on longer samples.
empirically determined fraction of A. The low-level features underlie a Gaussian distribution with
Key characteristics of the audio data include varying mean, μ, and variance, . We model the aggregate of low-
audio quality between student presentations. This is largely level MFCC and LPC features by their mean and
due to different microphones that were used over the five- covariance. The covariance matrices for MFCC and LPC
year course recordings. Also affecting audio quality is an are symmetrical; we only use the covariance values from the
individual speaker’s use of the microphone, such as upper triangular matrix and the diagonal for a total of 91
placement with respect to speaker (hand-held vs. on-stand) values for MFCCs and LPCs, respectively. We include 13
and presenter’s activity (rigid pose vs. constant shifting). MFCC means, 13 LPC means, and respective statistical
We notice a skew in the distribution between certain measures for the 6 spectral features [6]. The complete
annotation classes. Specifically, in the engineering school feature vector for mid-level features contains 214 features.
we observe a 3:1 ratio of male to female students. Similarly, 3.3. High-level: Semantic Level
we find fewer speakers in some ethnic classes (African
Americans and Hispanics) than others (Asians, Caucasians, High-level feature vectors are derived from mid-level
and Southeast Asians). To avoid a bias due to unequal features. We include 12 additional features derived from
sample sizes, we down-sample the data set to comparable phonemes for a total of 226 features (91 MFCC cov, 13
class sizes for classification. MFCC mean, 91 LPC cov, 13 LPC mean, 6 spectral
423
Authorized licensed use limited to: University of Wollongong. Downloaded on June 05,2010 at 15:53:08 UTC from IEEE Xplore. Restrictions apply.
100 30000
95 25000
Number of Samples
Accuracy (%)
90 20000
85 15000
80 10000
75 5000
70 0
0 10 20 30 40 50 1 2 3 5 10 30 40
Sample Time (sec) Sample Time (sec)
Figure 3 (left): Male/female classification accuracy for varying sampling time lengths. The 256-sample window (~0.01 sec) using low-
level samples is our baseline accuracy (67.3%). Mid-level features (1 sec – 40 sec) exhibit significant increasing classification accuracy.
The same trend is apparent after applying classification with the top 5 and 10 features selected by recursive feature selection.
Figure 4 (right): Distribution of non-overlapping sample sizes for male/female speaker segments at different sampling time durations.
features, and 12 phonemes). We apply phoneme extraction Additionally, we perform feature selection using the
to generate a frequency list of occurring phonemes in the “SVMAttributeEval” method in Weka.
audio signal. We apply our approach [7] to identify a “SVMAttributeEval” evaluates the weight of an attribute by
selection of monophthongs, diphthongs, and fricatives using a linear SVM. The “Ranker” search method ranks
(/AA/, /AE/, /AH/, /AO/, /EH/, /ER/, /IH/, /IY/, /S/, /SH/, each feature by the square of the weight assigned by the
/UH/, /UW/). This heuristic method models the vocal tract SVM from the “SVMAttributeEval” method. The selected
using an autoregressive model of the speech signal in which features are used for classification to test the overall
the peaks of the frequency response correspond to resonant prediction of the dataset.
frequencies of the vocal tract (formants). The closest
matching phoneme is determined by the Euclidean distance 5. EXPERIMENTS AND RESULTS
of a weighted difference between model and computed
5.1. Low-level: Signal-level
values by using a table of expected frequency values for
formants F1, F2, and F3. We apply low-level features to non-overlapping 256-sample
windows (~0.01 sec) for gender classification
4. CLASSIFICATION AND FEATURE SELECTION (male/female). Male samples sizes are down-sampled to
adjust for any classification bias due to mismatching sample
Classification is performed using the Sequential Minimal sizes between the two classes. In total, we have 105,106
Optimization (SMO) algorithm in Weka [8]. SMO is a male speaking auditory samples compared to the 94,555
computationally simpler method to compute the support female speaking samples. The linear kernel SMO achieves
vector machine (SVM) quadratic programming (QP) 67.3% classification accuracy (Figure 3), consistent with an
optimization problem without extra matrix storage and accuracy of 76% on 0.02 sec sample windows in [4].
without using numerical QP optimization steps. We use a
5.2. Mid-level: Statistical aggregates from signal level
linear kernel unless otherwise noted for the SMO. The
output equation for a linear SVM (Equation 1) defines w as 5.2.1. Varying Sampling Times
the normal vector to the hyperplane, x as the input vector, We extract mid-level features for several non-overlapping
and u as the separating hyperplane. The linear kernel sampling intervals: 1, 2, 3, 5, 10, 20, 30, and 40 seconds
identifies the optimal separating hyper-plane between the (Figure 4). Monologues longer than 40 seconds by a given
distributions by maximizing the margin m (Equation 2) speaker are rare in our dataset. We down-sample the male
using training examples. Prediction is performed on the test samples to create equal distribution between male and
set. To avoid classification biases, cross-validation is female samples for classification.
obtained for the experiments using a 50% split for training We perform classification using all 214 features to
and test sets. determine the efficacy of mid-level features at varying
& &
u w x b (Equation 1) sampling intervals. Classification accuracies range between
b 90.1 – 98.6% (Figure 3) where accuracy is logarithmically
m
w2 (Equation 2) related to sample time. A 10-second time interval provides a
reasonable baseline for analysis on high-level features.
424
Authorized licensed use limited to: University of Wollongong. Downloaded on June 05,2010 at 15:53:08 UTC from IEEE Xplore. Restrictions apply.
polynomial kernel. A linear kernel did not provide effective
classification accuracies.
The demographic classification may be confounded by
inclusion of both native and non-native English speakers in
the respective groups. We remove this bias by creating
groups based on native and non-native speakers and their
respective demographics class. This increases classification
accuracy to approximately 64.5% accuracy for each class.
The classification confusion matrix indicates the similarity
between the Asian and South-east Asian groups, suggesting
that better accuracy may be obtained by combining these
two groups. A similar association is observed with the
Figure 5: Additive histogram of male samples shown in African American and Hispanic groups as well. We note
blue; female in red. (top) MFCC covariance 2_7. (bottom) that these results are significant compared to the
MFCC covariance 3_8. probabilistic 20% accuracy achieved by random guessing.
425
Authorized licensed use limited to: University of Wollongong. Downloaded on June 05,2010 at 15:53:08 UTC from IEEE Xplore. Restrictions apply.