Acoustic Parameters For Speaker Verification
Acoustic Parameters For Speaker Verification
The acoustic wave speech signal generated by humans can be converted into an
analog signal using a microphone. An antialiasing filter is thereafter used to condition this
signal and additional filtering is used to compensate for the channel impairments. The
antialiasing filter band limits the speech signal to approximately the Nyquist rate (half the
sampling rate) before sampling. The conditioned analog signal is then sampled by an analog
to digital (A/D) converter in order to obtain a digital signal. The A/D converters in use today
for speech signal applications have a resolution of 12 to 16 bits typically at 8000 to 20,000
samples per second . For allowing the use of a simple antialiasing filter and precise control of
the fidelity of the sampled speech signal, over sampling of the analog speech signal is used.
g) Harmonic Features
The harmonic decomposition of the high-resolution spectral line estimate of speech
signal results in the harmonic features. The line spectral pairs represent the variations in the
glottis and the vocal tract of a speaker, which are transformed into frequency domain. The
feature vector of harmonic features contains the fundamental frequency followed by
amplitudes of several harmonic components. These features can be produced only on voiced
segments of speech and the long vowels and nasals were found to be most speaker specific
3. Similarity measures
The features of the speech signal are in the form of N . dimensional feature vector.
For a segmented signal that is divided into M segments, M vectors are determined producing
the M x N feature matrix. The M x N matrix is created by extracting features from the
utterances of the speaker for selected words or sentences during the training phase. After
extraction of the feature vectors from the speech signal, matching of the templates is required
to be carried out for speaker recognition. This process could either be manual (comparison of
spectrograms visually) or automatic. In automatic matching of templates, speaker models are
constructed from the extracted features. There after a speaker is authenticated by comparison
of the incoming speech signal with the stored model of the claimed user. The speaker models
are of two types: template models and stochastic models.
i. Template Models
The simplest template model has a single template x, which is the model for a speech
segment. The match score between the template x for the claimed speaker and an input
feature vector y from an unknown user is given by d (x, y). The model for the claimed
speaker could be the centroid (mean) of a set of N vectors obtaining in training phase
The various distance measures between the vectors x and y can be written as
Where, W is the weighting matrix. If W is an identity matrix, then all the elements of
the vectors are equally treated and the distance is called Euclidean. If W is a positive .
definite matrix that would allow desired weighting of the template features then, the distance
is Mahalanobis.
a) Dynamic Time Warping (DTW)
The time alignment of different utterances is a serious problem for distance measures
and a small shift would lead to incorrect identification. Dynamic time warping is an efficient
method to solve this time alignment problem. This is the most popular method for speaking
rate variability in template-based systems The asymmetric match score β of comparison of an
input frame y of M samples with the template sequence x is given as follows
The template indices j(i) are given by the DTW algorithm. This algorithm performs a
piece wise linear mapping of the time axis to align both the signals. The variation over time
in the parameters corresponding to the dynamic configuration of the articulators and the vocal
tract is taken into account in this method.
b) VQ Source Modeling
This is another form of usually text dependent template model that uses multiple
frames of speech. This model makes use of has a vector quantized codebook, which is
generated for a speaker by using his/her training data. Standard clustering procedures are
utilized for formulation of the codebook. These procedures average out the temporal
information from the codebook and therefore the requirement of performing time alignment
is eliminated. The pattern match score is the distance between the input vector and the
minimum distance code word in the codebook.
c) Nearest Neighbors
This method combines the strengths of the dynamic time warping and vector
quantization methods. This method keeps all the data obtained from training phase and does
not cluster data to obtain the codebook. Therefore it can make use of the temporal
information that may be present in the prompted phrase. The distances between the input
frames and the stored frames is used for computing the inter frame distance matrix. The
nearest neighbor distance is the minimum distance between the input and the stored frames.
The nearest neighbor distances for all input frames are averaged to arrive upon the matched
score. These matched scores are thereafter combined to form an approximation of the
likelihood ratio. This method is very memory intensive and is one of the most powerful
methods.
Where m is the mel scale frequency and f represents the frequency of the cepstrum.
3) Log magnitude of the spectrum is taken at each of these mel frequencies.
4) Then DCT is performed.
5) The coefficients thus obtained from the resulting spectra, are the required MFCCs.
The temporal derivatives of the MFCC features are given by △, △△ features. The
MFCC features give the information about the static speech features whereas its derivatives
capture the dynamic attributes present in the speech.
Pitch: Pitch information provides a unique way for correlating the training
and testing utterances because the rate at which the vocal folds vibrates is
different for different speakers. The different patterns of pitch are used to
convey different meanings to the listener.
Duration: For a genuine client, the total duration of the reference speech may
differ from that of the testing one [6]. But there is always a consistency in the
relative duration of words, syllables or phrases spoken in the utterance. Its
application is found in text-to-speech systems, speech understanding systems
etc. The pitch and duration information are the suprasegmental features,
extracted from a speech signal.
Linear predictive coding: Linear predictive coding (LPC) is used to predict
the present value from a linear combination of the past values . And this is
done to eliminate the redundancy in the signals. These features are generally
used for speech recognition, speech analysis and synthesis, voice compression
by phone companies, secure wireless where voice must be digitized, encrypted
and sent over a narrow voice channel etc,. The speech signals are analyzed by
estimating the formants. On these LPC features, after applying cepstral
analysis, a set of iterative procedures are applied. The coefficients therefore
obtained are the linear predictive cepstral coefficients (LPCC).
Perceptual linear predictive coefficients: Perceptual linear predictive
coefficients (PLP) discard the unnecessary message present in the voice
signals in order to improve the speech recognition rate. It is used in merging a
variety of engineering estimation of human audio procedures. It is alike as
LPC except the spectral domain characteristics are altered such that it becomes
equivalent to the features obtained from a human’s hearing system. In PLP,
the nonlinear mapping and non uniform filter bank in between the perceived
loudness and sound intensity and are used in the extraction process of LP
features.
c) Pattern Classification
Vector quantization: In vector quantization (VQ) method, the non-
overlapping clusters of feature vector forms the speaker models [13]. Here
quantization of the data is done in the form of contiguous blocks called
vectors, rather than taking a single scalar value. The output obtained after
quantization, is a data block that results from a finite set of vectors, termed as
the codebook.
Dynamic time warping: Dynamic time warping (DTW) is an algorithm for
finding the minimum distance path through a matrix, whereby reducing the
computation time.
Gaussian mixture model: A Gaussian mixture model (GMM) is defined as
the parametric form of probability density function (pdf) having continuous
features in a biometric system . These features include the spectral features of
a vocal-tract system that has weighted sum of Gaussian component densities.
d) Decision Making and Performance Measures
After performing the classification, decision is taken, based on a threshold value. If
the score is more than the threshold value, then it is accepted otherwise rejected. Performance
measures of the system are taken in terms of acceptance and rejection rate, as listed below:
False acceptance rate: False acceptance rate (FAR) is defined as the ratio of
the accepted imposter claims to the total number of the imposter speakers
False rejection rate: False rejection rate (FRR) is given by the ratio of the
rejected client patterns to the total number of genuine speakers
Equal error rate: Equal error rate (EER) is the point where FAR and FRR
intersect each other. EER should be low for better system’s performance
Total success rate: The total success rate (TSR) is obtained by deducting the
EER from 100.