0% found this document useful (0 votes)
68 views

Acoustic Parameters For Speaker Verification

Speaker recognition methods can be divided into speaker identification and speaker verification. Speaker identification determines who is talking from a set of known voices, while speaker verification accepts or rejects a claimed speaker's identity. The usual approach to speaker recognition is based on classifying acoustic parameters derived from speech signals, such as short-time spectral analysis. Acoustic parameters contain both phonetic information related to text and individual information related to the speaker. Many systems are text-dependent, requiring a predefined utterance.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views

Acoustic Parameters For Speaker Verification

Speaker recognition methods can be divided into speaker identification and speaker verification. Speaker identification determines who is talking from a set of known voices, while speaker verification accepts or rejects a claimed speaker's identity. The usual approach to speaker recognition is based on classifying acoustic parameters derived from speech signals, such as short-time spectral analysis. Acoustic parameters contain both phonetic information related to text and individual information related to the speaker. Many systems are text-dependent, requiring a predefined utterance.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Unit - 5

Classification of Speaker Recognition Methods:


The problem of speaker recognition can be divided into two major sub problems:
speaker identification and speaker verification.
Speaker identification can be thought of, as the task of determining who is talking from a set of
known voices of speakers. It is the process of determining who has provided a given utterance based
on the information contained in speech waves. The unknown voice comes from a fixed set of known
speakers, thus the task is referred toas closed set identification.
Speaker Verification on the other hand is the process of accepting or rejecting the speaker claiming to
be the actual one. Since it is assumed that imposters (those who fake as valid users) are not known to
the system, this is referred to as the open set task. Adding anyone of the above option to the closed set
identification task would enable merging of the two tasks, and it is called open set identification.
“Error that can occur in speaker identification is the false identification of speaker and the errors in
speaker verification can be classified into the following two categories: (1) false rejections: a true
speaker is rejected as an imposter, and (2) False acceptances: a false speaker is accepted as a true one.
The usual approach to speaker recognition is based on the classification of acoustic
parameters derived from the speech signal. Generally, the parameters are obtained via short time
spectral analysis and contain both phonetic information, related to the uttered text, and individual
information, related to the speaker. Since the task of separating the phonetic information from the
individual one is not yet solved, many speaker recognition systems behave in a text dependent way
(i.e. the user must utter a predefined sentence).
1. Acoustic Parameters for Speaker Recognition

The acoustic wave speech signal generated by humans can be converted into an
analog signal using a microphone. An antialiasing filter is thereafter used to condition this
signal and additional filtering is used to compensate for the channel impairments. The
antialiasing filter band limits the speech signal to approximately the Nyquist rate (half the
sampling rate) before sampling. The conditioned analog signal is then sampled by an analog
to digital (A/D) converter in order to obtain a digital signal. The A/D converters in use today
for speech signal applications have a resolution of 12 to 16 bits typically at 8000 to 20,000
samples per second . For allowing the use of a simple antialiasing filter and precise control of
the fidelity of the sampled speech signal, over sampling of the analog speech signal is used.

The usual approach to speaker recognition is based on the classification of acoustic


parameters derived from the speech signal. Generally, the parameters are obtained via short
time spectral analysis and contain both phonetic information, related to the uttered text, and
individual information, related to the speaker. Since the task of separating the phonetic
information from the individual one is not yet solved, many speaker recognition systems
behave in a text dependent way (i.e. the user must utter a predefined sentence).

Basic Structure of Speaker Recognition System

Figure 3. Structure of speaker recognition system

Speaker recognition systems generally consist of three major units as shown in


Figure 3. The input to the first stage or the front end processing system is the speech signal.
Here the speech is digitized and subsequently the feature extraction takes place. There are no
exclusive features that convey the speakers identity in the speech signal, however it is known
from the source filter theory of speech production that the speech spectrum shape encodes in
it the information about speakers vocal tract shape via formants and glottal source via pitch
harmonics. Therefore some form or the other of the spectral based features is used in most of
the speaker recognition systems. The final process in the front end processing stage is some
form of channel compensation. Different input devices (e.g. different telephone handsets)
impose different spectral characteristics on the speech signal, such as band limiting and
shaping. Therefore channel compensation is done for removal of these unwanted effects.
Most commonly some form of linear channel compensation, such as long and short-term
cepstral mean subtraction are applied to features. The basic fundamental of spectral
subtraction is that the power spectrum of speech signal corrupted by additive noise is equal to
the sum of the signal power spectrum and noise
2. Features space for Speaker Recognition
The speech signal can be represented by a sequence of feature vectors in order to
application of mathematical tools without the loss of generality. Most of these features are
also used for speaker dependent speech recognition systems. In practical real life systems,
several of these features are used in combinations. Some of the desirable properties for
feature sets are as follows:
 They should preserve or highlight information and variation in the speech that
is relevant to the basis being used for the speech recognition and at the same
time minimize or eliminate any variation irrelevant to that task.
 Feature space should be relatively compact in order to enable easier learning
of models from finite amounts of data.
 A feature representation that can be used without much consideration in most
circumstances should be used.
 The process of feature calculation should be computationally inexpensive.
Processing delay (i.e. how much of the ’future’ of the signal you have to know
before you can emitthe features) is a significant factor in some settings, such
as real-time recognition.

a) Frequency Band Analysis


Filter banks were initially used to gather information about the spectral structure of
signal. The filter banks consist of number of filters where each filter covers one group of
frequencies. Bandwidths of filters could be chosen to be equal, Logarithmic or may
correspond to certain critical intervals. The output of such filter bank offers largely depends
upon the number of the filters being used, which normally varies from 10 . 20 and thus this
technique represents an approximate representation of the actual spectrum. The output of the
filter bank is sampled (usually 100 Hz) and the samples of the output indicate the amplitude
of the frequencies from a particular bandwidth. The output is thus used as the feature vector
for speaker recognition.
b) Formant Frequencies
Periodic excitation is seen in the spectrum of certain sounds, especially vowels. The
speech organs form certain shapes to produce the vowel sound and therefore regions of
resonance and anti resonance are formed in the vocal tract. Location of these resonances in
the frequency spectrum depends on the form and shape of the vocal tract. Since the physical
structure of the speech organs is a characteristic of each speaker, differences between
speakers can also be found in the position of their formant frequencies. The resonances
heavily affect the overall spectrum shape and are referred to as formants. A few of these
formant frequencies can be sampled at an appropriate rate and used for speaker recognition.
These features are normally used in combination with other features.
c) Pitch Contours
The variations of the fundamental frequency (pitch) during the duration of the
utterance if followed, would provide the contour, which can be used as a feature for speech
recognition. The speech utterance is normalized and the contour is determined. The
normalization of the speech utterance is required because the accurate time alignment of
utterances is crucial; else the same speaker utterances could be interpreted as utterances from
two different speakers. The contour is divided into a set of segments and the measured pitch
values are averaged over the whole segment. The vector that contains the average values of
pitch of all segments is thereafter used as a feature for speaker recognition.
d) Coarticulation
Coarticulation is a phenomenon where a feature of a phonemic unit is achieved in the
articulators well in advance of the time it is needed for that phonemic unit. Variation of the
physical form of the speech organs causes the variation in the sounds that they produce. The
process of coarticulation in which, the speech organs prepare to produce a new sound while
transiting from one sound to another is characteristic of a speaker. This is due to the
following reasons: the construction and shape of the vocal tract, and the motorically abilities
of the speaker to produce the sequences of speech. Therefore for speaker recognition using
this feature, the points in the speech signal where coarticulation takes place are
spectrographically analysed.
e) Features derived from Short term processing
The following features of the short - term processing of the speech can be applied
short - term autocorrelation, average magnitude difference function, zero crossing measure,
short – term power and energy measures, and short - term Fourier analysis. The short term
processing techniques provide signals in the following form
T[s (m)] is a transformation, which is applied to the speech signal and the signal is
thereafter weighted by a window w (n). The summation of T[s (n)] convolved with w (n)
represents certain property of the signal that is averaged over the window duration.
f) Linear Prediction Features
The basic idea of linear prediction is that a speech sample s (n) related to excitation
u (n) can be predicted (approximated) by a linear combination of the past P speech samples

Here G is the gain parameter and ak are the prediction coefficients

g) Harmonic Features
The harmonic decomposition of the high-resolution spectral line estimate of speech
signal results in the harmonic features. The line spectral pairs represent the variations in the
glottis and the vocal tract of a speaker, which are transformed into frequency domain. The
feature vector of harmonic features contains the fundamental frequency followed by
amplitudes of several harmonic components. These features can be produced only on voiced
segments of speech and the long vowels and nasals were found to be most speaker specific

3. Similarity measures
The features of the speech signal are in the form of N . dimensional feature vector.
For a segmented signal that is divided into M segments, M vectors are determined producing
the M x N feature matrix. The M x N matrix is created by extracting features from the
utterances of the speaker for selected words or sentences during the training phase. After
extraction of the feature vectors from the speech signal, matching of the templates is required
to be carried out for speaker recognition. This process could either be manual (comparison of
spectrograms visually) or automatic. In automatic matching of templates, speaker models are
constructed from the extracted features. There after a speaker is authenticated by comparison
of the incoming speech signal with the stored model of the claimed user. The speaker models
are of two types: template models and stochastic models.
i. Template Models
The simplest template model has a single template x, which is the model for a speech
segment. The match score between the template x for the claimed speaker and an input
feature vector y from an unknown user is given by d (x, y). The model for the claimed
speaker could be the centroid (mean) of a set of N vectors obtaining in training phase

The various distance measures between the vectors x and y can be written as

Where, W is the weighting matrix. If W is an identity matrix, then all the elements of
the vectors are equally treated and the distance is called Euclidean. If W is a positive .
definite matrix that would allow desired weighting of the template features then, the distance
is Mahalanobis.
a) Dynamic Time Warping (DTW)
The time alignment of different utterances is a serious problem for distance measures
and a small shift would lead to incorrect identification. Dynamic time warping is an efficient
method to solve this time alignment problem. This is the most popular method for speaking
rate variability in template-based systems The asymmetric match score β of comparison of an
input frame y of M samples with the template sequence x is given as follows

The template indices j(i) are given by the DTW algorithm. This algorithm performs a
piece wise linear mapping of the time axis to align both the signals. The variation over time
in the parameters corresponding to the dynamic configuration of the articulators and the vocal
tract is taken into account in this method.
b) VQ Source Modeling
This is another form of usually text dependent template model that uses multiple
frames of speech. This model makes use of has a vector quantized codebook, which is
generated for a speaker by using his/her training data. Standard clustering procedures are
utilized for formulation of the codebook. These procedures average out the temporal
information from the codebook and therefore the requirement of performing time alignment
is eliminated. The pattern match score is the distance between the input vector and the
minimum distance code word in the codebook.
c) Nearest Neighbors
This method combines the strengths of the dynamic time warping and vector
quantization methods. This method keeps all the data obtained from training phase and does
not cluster data to obtain the codebook. Therefore it can make use of the temporal
information that may be present in the prompted phrase. The distances between the input
frames and the stored frames is used for computing the inter frame distance matrix. The
nearest neighbor distance is the minimum distance between the input and the stored frames.
The nearest neighbor distances for all input frames are averaged to arrive upon the matched
score. These matched scores are thereafter combined to form an approximation of the
likelihood ratio. This method is very memory intensive and is one of the most powerful
methods.

ii. Stochastic models


Stochastic models have been a lately which provide more flexibility and produce
better matching score. In a stochastic model, the process of pattern matching is carried out by
measuring the likelihood of a feature vector in a given speaker model. A stochastic model
that is widely used for modeling of sequences is the Hidden Markov Model [Cam97]. This
technique efficiently models the statistical variations of the features and provides a statistical
representation of the manner in which a speaker produces sounds.

Figure 5. Five state Markov model

A Hidden Markov Model (HMM) consists of a set of transitions between a set of


states. Two sets of probabilities are defined for each transition: a transition probability and
the output probability density function. The output probability density function is the
probability of emitting each of the output symbols from a finite vocabulary. As shown in Fig.
5, the transitions are allowed to the next right state or the same state, thus the model is named
left . to right model and aij are the probabilities of transition to other states. The HMM
parameters are generated from the speech during the training phase and for verification, the
likely hood of the input feature sequence is computed with respect to the speakers HMMs. In
case of finite vocabulary being used for speaker recognition, each word is modelled using
multiple state left . to right HMMs. Therefore in case of large vocabulary, larger number of
models are required.

4. Speaker Recognition : Text – Dependent and Text – Independent

Speaker recognition, which can be classified into identification and verification, is


the process of automatically recognizing who is speaking base on speech signal. This method
of persons identification use unique information included in voice of speaker, and allows
verify their identity and control access to services such as voice dialling, banking by
telephone, telephone shopping, database access services, voice mail, access authorization to
resources and for forensic purpose.
Speaker identification is the process of determining which registered speaker
provides a given utterance. Speaker verification is the process of accepting or rejecting the
identity claim of a speaker. Most applications in which a voice is used as the key to confirm
the identity of a speaker are classified as speaker verification.
Speaker recognition methods are divided into text-dependent and text independent
methods. In case of text-dependent systems the speaker says key words or sentences having
the same text for both training and recognition mode.
i. Text Dependent speaker recognition
Text-dependent speaker recognition characterizes a speaker recognition task, such as
verification or identification, in which the set of words (or lexicon) used during the testing
phase is a subset of the ones present during the enrolment phase. The restricted lexicon
enables very short enrolment (or registration) and testing sessions to deliver an accurate
solution but, at the same time, represents scientific and technical challenges. Because of the
short enrollment and testing sessions, text-dependent speaker recognition technology is
particularly well suited for deployment in large-scale commercial applications. These are the
bases for presenting an overview of the state of the art in text-dependent speaker recognition
as well as emerging research avenues.
In text dependent speaker verification, a speaker presents fixed or prompted phrase
that is programmed into the system and can improve system performance. But if an arbitrary
word or sentence is used, then the system is called text-independent. In a text independent
speaker verification system, the system has no advance knowledge of the speaker' s phrasing
and is much more difficult and less robust.

Speaker verification has many potential applications, including access control to


computers, databases and facilities, electronic commerce, forensic and telephone banking [2].
Here is the baseline system of speaker verification which we try to follow through the paper.
Speech signal contains lots of information within it and finding out these features from SS is
beneficial for efficient result . There are several features; Mel-frequency Cepstral
Coefficients (MFCC) is standard for baseline system, Linear Frequency Cepstral Coefficients
(LFCC), periodic & aperiodic energy, formants, pitch, Linear predictive coding (LPC) etc.
[3] which are fetched from Speech signal.
Like features, researchers have proposed several modelling technique for speaker
verification systems such as Dynamic time wrapping (DTW), Artificial Neural Networks
(ANN), Hidden Markov models (HMM), Vector Quantization (VQ) & several combined
approach etc. Threshold value can be fixed based on the training of proposed model for
decision logic to obtain efficient result.
A) Various Stages
a) Pre-processing
Pre-processing of any signal is necessary in the beginning, to get a proper signal to be
worked on and it involves the following steps:
Sampling: To process an analog signal in a computer, digitization of the signal is
necessary. So at the very onset, a speech signal should be sampled by the Nyquist criterion.
Framing: As the speech signals are non-stationary in nature, small blocks or frames
are formed by assuming that portion to be stationary by means of short term processing.
Energy of any signal y(n) is given by

Windowing: Windowing is the process of multiplying a signal by a window function


such as hamming, hanning etc. This is basically done to retain the desired region of interest
and discarding all the other regions by equating them to zeros.
Endpoint detection: Endpoint detection refers to detection of the start and the end
points of an utterance in presence of background noise, by means of certain energy threshold,
short term frequency spectrum, cepstral analysis, zero crossing rate
Noise removal: For the speech signals that are degraded by noise, the quality of these
signals can be improved by noise reduction.
b) Feature Extraction

 Mel frequency cepstral coefficients: In acoustics, short term spectral feature


of sound is represented by mel frequency cepstrum (MFC) based on a
nonlinear mel scale of frequency . Mel-frequency scale acts linearly up to the
frequency of 1 KHz and then gradually becomes logarithmic for the higher
frequencies. MFCC features are based on human hearing perceptions which
cannot perceive frequencies over 1Khz. They basically used in automatically
recognize numbers spoken into a telephone. Its application is also found in
music information retrieval like genre classification, audio similarity
measures. The procedure for obtaining MFCC coefficients are:
1) Fourier transformation of the signal is done after passing the signal through a
window function.
2) These frequencies are then mapped to the mel scale using the formula

Where m is the mel scale frequency and f represents the frequency of the cepstrum.
3) Log magnitude of the spectrum is taken at each of these mel frequencies.
4) Then DCT is performed.
5) The coefficients thus obtained from the resulting spectra, are the required MFCCs.
The temporal derivatives of the MFCC features are given by △, △△ features. The
MFCC features give the information about the static speech features whereas its derivatives
capture the dynamic attributes present in the speech.

 Pitch: Pitch information provides a unique way for correlating the training
and testing utterances because the rate at which the vocal folds vibrates is
different for different speakers. The different patterns of pitch are used to
convey different meanings to the listener.
 Duration: For a genuine client, the total duration of the reference speech may
differ from that of the testing one [6]. But there is always a consistency in the
relative duration of words, syllables or phrases spoken in the utterance. Its
application is found in text-to-speech systems, speech understanding systems
etc. The pitch and duration information are the suprasegmental features,
extracted from a speech signal.
 Linear predictive coding: Linear predictive coding (LPC) is used to predict
the present value from a linear combination of the past values . And this is
done to eliminate the redundancy in the signals. These features are generally
used for speech recognition, speech analysis and synthesis, voice compression
by phone companies, secure wireless where voice must be digitized, encrypted
and sent over a narrow voice channel etc,. The speech signals are analyzed by
estimating the formants. On these LPC features, after applying cepstral
analysis, a set of iterative procedures are applied. The coefficients therefore
obtained are the linear predictive cepstral coefficients (LPCC).
 Perceptual linear predictive coefficients: Perceptual linear predictive
coefficients (PLP) discard the unnecessary message present in the voice
signals in order to improve the speech recognition rate. It is used in merging a
variety of engineering estimation of human audio procedures. It is alike as
LPC except the spectral domain characteristics are altered such that it becomes
equivalent to the features obtained from a human’s hearing system. In PLP,
the nonlinear mapping and non uniform filter bank in between the perceived
loudness and sound intensity and are used in the extraction process of LP
features.
c) Pattern Classification
 Vector quantization: In vector quantization (VQ) method, the non-
overlapping clusters of feature vector forms the speaker models [13]. Here
quantization of the data is done in the form of contiguous blocks called
vectors, rather than taking a single scalar value. The output obtained after
quantization, is a data block that results from a finite set of vectors, termed as
the codebook.
 Dynamic time warping: Dynamic time warping (DTW) is an algorithm for
finding the minimum distance path through a matrix, whereby reducing the
computation time.
 Gaussian mixture model: A Gaussian mixture model (GMM) is defined as
the parametric form of probability density function (pdf) having continuous
features in a biometric system . These features include the spectral features of
a vocal-tract system that has weighted sum of Gaussian component densities.
d) Decision Making and Performance Measures
After performing the classification, decision is taken, based on a threshold value. If
the score is more than the threshold value, then it is accepted otherwise rejected. Performance
measures of the system are taken in terms of acceptance and rejection rate, as listed below:
 False acceptance rate: False acceptance rate (FAR) is defined as the ratio of
the accepted imposter claims to the total number of the imposter speakers
 False rejection rate: False rejection rate (FRR) is given by the ratio of the
rejected client patterns to the total number of genuine speakers
 Equal error rate: Equal error rate (EER) is the point where FAR and FRR
intersect each other. EER should be low for better system’s performance
 Total success rate: The total success rate (TSR) is obtained by deducting the
EER from 100.

ii. Text independent speaker recognition


In text-independent systems, there are no constraints on the words which the
speakers are allowed to use. Thus, the reference (what are spoken in training) and the
test (what are uttered in actual use) utterances may have completely different content,
and the recognition system must take this phonetic mismatch into account. Text-
independent recognition is the much more challenging of the two tasks.
In general, phonetic variability represents one adverse factor to accuracy in
text-independent speaker recognition. Changes in the acoustic environment and
technical factors (transducer, channel), as well as ‘‘within-speaker” variation of the
speaker him/herself (state of health, mood, aging) represent other undesirable factors.
In general, any variation between two recordings of the same speaker is known as
session variability
Fig. 1. Components of a typical automatic speaker recognition system. In the enrolment
mode, a speaker model is created with the aid of previously created background model; in
recognition mode, both the hypothesized model and the background model are matched and
background score is used in normalizing the raw score
Fig. 2. A summary of features from viewpoint of their physical interpretation. The choice of
features has to be based on their discrimination, robustness, and practicality. Short-term
spectral features are the simplest, yet most discriminative; prosodics and high-level features
have received much attention at high computational cost.

You might also like