0% found this document useful (0 votes)
16 views

2015-Elsevier-Speaker-identification-using-vowels-features-through-a-combined-method-of-formants-wavelets-and-neural-network-classifiers

This paper presents a novel speaker identification method called FWENN, which utilizes formants and wavelet entropy features processed through neural networks. The method focuses on extracting features from vowels, allowing for effective speaker recognition even with partially recorded speech signals. Experimental results demonstrate that FWENN achieves high classification rates with minimal information, outperforming traditional speaker recognition techniques.

Uploaded by

chandreshgovind
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

2015-Elsevier-Speaker-identification-using-vowels-features-through-a-combined-method-of-formants-wavelets-and-neural-network-classifiers

This paper presents a novel speaker identification method called FWENN, which utilizes formants and wavelet entropy features processed through neural networks. The method focuses on extracting features from vowels, allowing for effective speaker recognition even with partially recorded speech signals. Experimental results demonstrate that FWENN achieves high classification rates with minimal information, outperforming traditional speaker recognition techniques.

Uploaded by

chandreshgovind
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Applied Soft Computing 27 (2015) 231–239

Contents lists available at ScienceDirect

Applied Soft Computing


journal homepage: www.elsevier.com/locate/asoc

Speaker identification using vowels features through a combined


method of formants, wavelets, and neural network classifiers
Khaled Daqrouq a,∗ , Tarek A. Tutunji b
a
Electrical & Computer Engineering Department, King Abdulaziz University, Jeddah, Saudi Arabia
b
Mechantronics Engineering Department, Philadelphia University, Jordan

a r t i c l e i n f o a b s t r a c t

Article history: This paper proposes a new method for speaker feature extraction based on Formants, Wavelet Entropy
Received 15 December 2011 and Neural Networks denoted as FWENN. In the first stage, five formants and seven Shannon entropy
Received in revised form 3 November 2014 wavelet packet are extracted from the speakers’ signals as the speaker feature vector. In the second stage,
Accepted 18 November 2014
these 12 feature extraction coefficients are used as inputs to feed-forward neural networks. Probabilistic
Available online 26 November 2014
neural network is also proposed for comparison. In contrast to conventional speaker recognition methods
that extract features from sentences (or words), the proposed method extracts the features from vowels.
Keywords:
Advantages of using vowels include the ability to recognize speakers when only partially-recorded words
Speaker verification and identification
Wavelet packet
are available. This may be useful for deaf-mute persons or when the recordings are damaged. Experimen-
Neural networks tal results show that the proposed method succeeds in the speaker verification and identification tasks
Formants with high classification rate. This is accomplished with minimum amount of information, using only 12
coefficient features (i.e. vector length) and only one vowel signal, which is the major contribution of this
work. The results are further compared to well-known classical algorithms for speaker recognition and
are found to be superior.
© 2014 Elsevier B.V. All rights reserved.

1. Introduction the speech energy distribution. Such systems can be divided


into two main steps: feature extraction and speaker classifica-
Speech processing applications include speech recognition and tion [3]. Several digital signal processing methods have been
speaker identification. Speaker identification system is a technol- used by researchers: Linear Predictive Coding (LPC) technique [1],
ogy with a potentially large market due to broad applications that Mel Frequency Cepstral Coefficients (MFCC) [4], Discrete Wavelet
range from automation using operator-assisted service to speech- Transform (DWT) [5] and Wavelet Packet Transform (WPT) [3].
to-text aiding system [1,2]. Due to its success in analyzing non-stationary signals, DWT has
In general, a speaker identification system can be implemented become a powerful alternative to the Fourier methods in many
by observing the voiced/unvoiced components or by analyzing speech/speaker identification applications. The main advantage of
wavelets is its optimal time–frequency resolution in all frequency
ranges. This is a result of varying the window size for different fre-
quencies: wide for slow frequencies and narrow for fast frequencies
Abbreviations: ACF, Autocorrelation Function; ANN, Artificial Neural Network;
[1,6,7].
AR, Auto-Regressive; DFT, Discrete Fourier Transform; DWT, Discrete Wavelet
Transform; DWT-NN, Discrete Wavelet Transform with Neural Networks; FSVM, Previous studies showed that using Wavelet Packet (WP)
Feature-extraction Support Vector Machine; FWENN, Formants Wavelet Entropy entropy as features in recognition tasks is effective. In [27], a
with Neural Networks; GWP-NN, Genetic Wavelet Packet with Neural Networks; method to calculate the wavelet norm entropy value in digital
IMFCC, Inverted Mel Frequency Cepstral Coefficient; LPC, Linear Predictive Coding;
modulation recognition was proposed. In [36] a combination of
LPC-NN, Linear Predictive Coding with Neural Networks; LPC-WP, Linear Predictive
Coding with Wavelet Packet; MFCC, Mel Frequency Cepstral Coefficient; MFCC-NN,
genetic algorithm and WPT was presented and the energy fea-
Mel Frequency Cepstral Coefficient with Neural Network; PSD, Power Spectrum tures were determined from a group of WP coefficients. The work
Density; SWPF, Sure entropy with WP Formants; SVM, Support Vector Machine; WP, was applied in bio-medic where the results were used for patho-
Wavelet Packet; WPID, Wavelet Packet Index Distribution; WPT, Wavelet Packet logical classification and evaluation. Energy indexes of WP were
Transform.
∗ Corresponding author at: P.O. Box 80204, Jeddah 21589, Saudi Arabia.
proposed for speaker identification [3] and sure entropy was con-
Tel.: +966 5 66 980400; fax: +966 5 695 2686.
sidered terminal node signal waveforms obtained from DWT [1]
E-mail address: [email protected] (K. Daqrouq). and applied to speaker identification. Others, [28], used features

http://dx.doi.org/10.1016/j.asoc.2014.11.016
1568-4946/© 2014 Elsevier B.V. All rights reserved.
232 K. Daqrouq, T.A. Tutunji / Applied Soft Computing 27 (2015) 231–239

extraction methods based on a combination of three entropy types


Start
(sure, logarithmic energy and norm).
Formant can be described as a function of the supralaryngeal
vocal tract. The air in the oral and nasal cavities vibrates at a range
Record
of frequencies as a response to the vibratory movement of the vocal speech
folds and air passing through the glottis. These resonant frequen- signals
cies are affected by the size/shape of the vocal tract and by the
tongue and lip positions [9]. Vocal tract resonances are often stud-
ied in terms of vowel formant frequencies. Because the male vocal Filter
tract is about 15% longer than the female vocal tract, men’s speech
signals have lower formant frequencies than women’s [10]. During
voiced speech, resonant frequencies of the vocal track are recog-
nized as formants with valuable features for both automatic speech
recognition and speech synthesis [11]. Use PSD to calculate Use Wavelet Packet to
formants calculate Entropies
The use of formants for 1-D and 2-D continuous motion control
created a new vocal interface that allowed people, especially indi-
Use backpropagation to train neural
viduals with motor impairments, to interact with computer-based
network for speaker identification
devices [8,49].
Researchers used different methods for formant tracking that
included: Linear Predictive Coding spectral analysis [12], hidden
Markov model based methods [13,60], nonlinear predictors [14], Input test
and Kalman filtering framework [15]. signals
Artificial Neural Network (ANN) models have been effec-
tively used for speaker classification [38,53,57,59]. Researchers
used radial basis function networks [17,18,54] and developed Use the trained
ANN-based techniques using a cascade neural network [16] for network to identify
the speaker
speaker verification. Others compared ANN with second order
statistical techniques for speaker verification [19]. Committee
neural networks to improve the reliability of ANN based clas- Output the
sification systems were developed [21,22] and the use of these speaker
name
committee networks for text-dependent speaker verification was
addressed [20]. Support Vector Machines (SVM), a special case
of Tikhonov regularization that belongs to the general lin-
End
ear classifier family, have been used for speaker recognition
[46–47].
Researchers used a combination of MFCC and parametric Fig. 1. Flow chart procedure for FWENN method.
feature-sets’ algorithms to improve the accuracy of speaker recog-
nition systems in adverse environments. Some studies, such as [39],
emphasized their work on text-dependent speaker identification, This paper presents a new method for speaker identification that
which deals with feature extraction by means of LPC coefficients. A uses formants and wavelet packet entropy within a feed-forward
Gaussian shaped filter was used for calculating MFCC and IMFCC neural network. The objective was to develop the method using
instead of typical triangular shaped bins. A system using four partially-recorded speech signals only. The major contribution of
transform techniques was suggested in [40,41]. The feature vec- this research is the development of an accurate speaker identifica-
tors were the row mean of the transforms for different groupings. tion method that uses simple computations with minimum amount
Experiments were performed on Discrete Fourier Transform (DFT), of information. The developed method is capable of dealing with
Discrete Cosine Transform, Discrete Sine Transform and Walsh vowels as the only input from the speech signal. This method might
Transform. All these methods showed an accuracy of more than be used for forensic and criminal investigation as well as for deaf-
80% for the different groupings considered. However, the results mute speakers’ recognition.
showed that Discrete Sine Transform had the best performance. The paper is organized as follows: Section 1 describes the gen-
In [42] IMFCC, which covers high frequency, was used to improve eral structure of proposed method. Section 2 discusses feature’s
speaker recognition rate. extraction using Power Spectrum Density and Wavelets while Sec-
Researchers investigated fundamental and formant frequencies tion 3 describes the neural networks used in this work. Section
for a speaker recognition task [50]. It was concluded from a detailed 5 provides the experimental results and Section 6 concludes the
comparison that the long-term formant distributions contributed paper.
to the rejection of the suspect. Grigoras continued this study to cal-
culate likelihood ratios based on the density estimation of formant 2. FWENN method
frequencies on distinct vowel phonemes ([a], [e], [i], [o]) [51]. Rose
[52] suggested the comparison of vowel phonemes by likelihood The method developed and proposed in this work, Formant and
ratio computation, and recognition of human speech phonemes by Wave Entropy within Neural Networks (FWENN), is explained in
fuzzy method was proposed in [58]. this section. The proposed method is based on several steps (as
In our study, vowels are used for speaker recognition. And shown as a flow chart in Fig. 1) and can be divided into four stages:
because the formants are recommended in case of vowels [51,52], recording and filtering the speech signals, extracting features, clas-
they are studied here in detail. In order to enhance the recognition sification, and speaker retrieval.
results, WP entropy is utilized. The reason behind WP entropy is to The emphasis here will be on the second and third stages:
extract additional features over different band passes of frequency extracting features and classification [37]. Extracting features of
by Shannon entropy. the speech signals were performed using two techniques: formants
K. Daqrouq, T.A. Tutunji / Applied Soft Computing 27 (2015) 231–239 233

using Power Spectrum Density (PSD) and entropies using WP. These Power spectrum can be found by taking the Fourier Transform
concepts will be explained in Section 3. Classification was done of the Autocorrelation Function (ACF)
using neural networks and will be explained in Section 4.


Px (ejω ) = rx (n)e−jωn (1)
n=−∞
3. Features’ extraction
where rx (n) is the ACF of the signal x(n). The autocorrelation values
are estimated from finite data record, x(n) for 0 ≤ n ≤ N − 1, and is
Periodic excitation is seen in the spectrum of certain sounds,
defined as
especially vowels. The speech organs form certain shapes to pro-
duce the vowel sound and therefore regions of resonance and 1 
N−1−k

anti-resonance are formed in the vocal tract. Location of these r̂x (k) = x(n + k)x∗ (n) k = 0, 1, . . ., p (2)
N
resonances in the frequency spectrum depends on the form and n=0
shape of the vocal tract. Since the physical structure of the speech Eq. (1) is one estimate of the PSD, but has some disadvantages
organs is a characteristic of each speaker, differences among that include excessive variance estimation. A better estimate is the
speakers can also be found in the position of their formant fre- Yule–Walker method.
quencies. These resonances affect the overall spectrum shape The Yule–Walker method estimates the PSD of the input using
and are referred to as formants. A few of these formant fre- the Yule–Walker AR method. The concept is to minimize the for-
quencies can be sampled at an appropriate rate and used for ward prediction error by fitting an AR model to the windowed input
speaker recognition. These features are normally used in com- data. This method is also called autocorrelation method [30]. The
bination with other features. Unlike our previous works that PSD is estimated using the following
investigated DWT for feature extraction [24,25], in this work, the
formants are used with the WP entropy to identify the speak- |b(0)|2
Px (ejω ) =  p 2 (3)
ers. 1 + a(k)e−jωk 
k=1
This paper proposes to use formants and WP parameters as
inputs to ANN for speaker classification. Therefore, it is necessary The parameters a(k) and b(0) can found from the autocorrelation
to introduce the two concepts: feature extraction by formants and estimates and will be described next.
WP algorithms. The discussion of these concepts will be limited to The AR model is described in the equation
their use in speaker identification.

p

r̂x (n) = − a(k)r̂x (n − k) (4)


k=1
3.1. Features’ extraction by formants using PSD
Which can be expanded into matrix form as
⎡ ⎤⎡ ⎤ ⎡ ⎤
Formants are the spectral peaks of the sound spectrum of vowels r̂x (0) r̂x (1) ··· r̂x (p − 1) a(1) r̂x (1)
or the acoustic resonances of the human vocal tract. In a wide band ⎢ r̂ (1) ⎥
r̂x (p − 2) ⎥ ⎢ a(2) ⎥
⎢ r̂ (2) ⎥
⎢ x r̂x (0) ···
⎥ = −⎢ ⎥
⎥⎢
x
spectrogram they show up as black bars. In a small band spectro- ⎢ ⎢ ⎥ (5)
gram the fundamental frequency and the harmonics are visible as ⎣ ··· ··· ··· ··· ⎦⎣ ··· ⎦ ⎣ ··· ⎦
well. The sound production can be modeled as a time varying linear a(p)
r̂x (p − 1) r̂x (p − 2) ··· r̂x (n0) r̂x (p)
system (having these resonances), that is excited by a sequence of
impulses. Using a linear system of order + n and taking the inverse, This formulation is defined as the Yule–Walker equations and the
the model spectrum of the linear system can be created from the Levinson–Durbin recursion are used to solve the equations in order
speech signal. Using a small order can result in noisy resonances to obtain the AR parameters: a(1), . . ., a(p). On the other hand, the
while using a large order can introduce artificial harmonics: For parameter b(0) is calculated using
small n, the resonances are smeared, while for large n, some of the

p
peaks are actually not formants but harmonics. An order n corre- |b̂(0)| = r̂x (0) + a(k)r̂x (k) (6)
sponding to about 1 ms seems to be a good choice resulting in five
k=1
formants.
The first five vocal resonant frequencies, i.e. formants (F1, F2, F3, The above parameters, b(0) and a(k), can now be substituted to
F4, F5), during voiced-speech are distinguishable for each person estimate the PSD.
and therefore are proposed as the speaker features. For voiced- In the context of this paper, Arabic vowels were used for speaker
speech, the glottis signal is periodic with a fundamental frequency identification. However, the proposed method can be applied to
(i.e. pitch, F0). Variations of the pitch during the duration of the other languages. Figs. 2 and 3 illustrate the applied results for the
utterance provide the contour, which can be used as a feature for first five formants (spectrum peaks) of Arabic vowels for speak-
speech recognition. The speech utterance is normalized and the ers’ models discrimination. Fig. 2 illustrates the formants of Arabic
contour is determined. The vector that contains the average values vowels for two speakers. For each speaker, four speech signals
of pitch of all segments is thereafter used as a feature for speaker with Arabic vowel H sounds /e/ were recorded (see Appendix A).
recognition. The pitch might be sufficient for the speaker identifi- Note that the spectrum features for each speaker are similar while
cation, but is usually assisted by the formants for the best speaker there are clear differences in magnitudes and indexes between the
recognition [26]. two speakers. Fig. 3 illustrates the formants of two Arabic vowels:
The filtered speech signal can be used as an input to a power sounds /a/ and Arabic vowel H (i.e. vowel-independent). Here,
spectrum algorithm in order to identify the first five formants. the overlap between the two speaker spectrum features is much
These formants can be used as the unique features for the larger because different vowels are used (i.e. vowel-independent
speaker. The PSD algorithm can be used to identify the formants case). This is a limitation due to the use of different vowels for each
in two steps: First, the PSD is estimated using the Yule–Walker speaker.
Auto-Regressive (AR) method. Then, the local maxima are identi- Frequency information, specifically the indexes of local maxima,
fied. contains the distinguishable speaker features. Table 1 shows the
234 K. Daqrouq, T.A. Tutunji / Applied Soft Computing 27 (2015) 231–239

Table 1
Comparison among formants (indexes) and magnitudes.

Speaker Signal number (vowel E) F1 F2 F3 F4 F5

Index Mag. Index Mag. Index Mag. Index Mag. Index Mag.

1 7 0.5934 39 4.2136 80 0.2432 127 2.9991 N/A N/A


2 7 2.4974 40 4.1294 80 0.134 127 3.0025 N/A N/A
1 3 7 0.3612 40 4.7808 81 0.3719 N/A N/A N/A N/A
4 7 2.2286 41 2.5217 78 0.1622 127 2.9925 N/A N/A
5 7 0.9950 23 0.0268 40 1.8998 78 0.0183 127 2.9958

1 1 0.5550 20 1.0868 50 3.053 82 2.6169 113 0.2566


2 1 0.1981 19 0.7639 49 0.2269 83 2.0576 113 0.8389
2 3 1 0.8332 21 1.1704 48 3.5790 81 0.4615 110 2.0189
4 1 1.1253 22 1.8774 46 1.6640 73 0.0524 111 0.8447
5 1 1.0500 21 0.1506 48 1.3245 82 0.4500 110 04166

1 7 0.9968 34 0.5306 80 0.4455 97 0.0326 127 3.009


2 7 2.0944 18 0.1030 31 0.0798 80 1.4715 N/A N/A
3 3 7 1.5067 78 0.5869 98 0.0331 N/A N/A N/A N/A
4 7 2.0874 79 1.3096 100 0.0064 127 3.0044 N/A N/A
5 7 1.1680 78 0.5193 127 3.008 N/A N/A N/A N/A

1 2 0.0760 41 0.2450 61 0.1483 N/A N/A N/A N/A


2 4 0.4334 42 0.7514 63 0.3955 127 2.9697 N/A N/A
4 3 4 0.0664 16 0.0442 62 0.2867 82 0.2373 114 0.0073
4 8 1.2975 43 0.0967 63 0.0791 127 2.9458 N/A N/A
5 19 0.5030 54 0.0738 92 0.0004 120 0.0039 N/A N/A

N/A: Non available.

Node
(0,0)

Node Node
(1,0) (1,1)

Node Node Node Node


(2,0) (2,1) (2,2) (2,3)

Fig. 4. Wavelet packet tree at depth two.

formants calculations for five speakers. Note that the indexes vary
among different speakers, but are consistent for each speaker.

3.2. Features’ extraction by entropies using WP


Fig. 2. PSD result showing the five formants of Arabic vowels for two speakers: A
and B (vowel-dependent). Speakers A and B are both male. In A1, A2, A3, A4 four
A general case of the wavelet decomposition is the WP method.
signals of one speaker were used, and in B1, B2, B3, B4 four signals of another speaker
were used. The mother wavelet function is defined by

t−b
a,b (t) = (7)
a

where a and b are the scale and shift parameters, respectively. By


varying a and b, the mother wavelet is scaled and translated. The
wavelet transform is obtained by the inner product of the data
function x(t) and the mother wavelet (t)
+∞
1 t−b
W (a, b) = √ x(t) ∗ dt (8)
X
a −∞
a

The WP uses a recursive binary tree, as shown in Fig. 4, for the recur-
sive decomposition of the data. A pair of low-pass and high-pass
filters, denoted as h[n] and g[n], respectively, are used to generate
two sequences, with the purpose of capturing different frequency
sub-band features of the original signal. The two wavelet ortho-
gonal bases generated from a previous node are defined as:
Fig. 3. PSD result showing the five formants of Arabic vowels for two speakers: A
2p


p
and B (vowel-independent). Speakers A and B are both male. In A1, A2, A3, A4 four
j+1
(k) = h[n] j
(k − 2j n) (9)
signals of one speaker were used, and in B1, B2, B3, B4 four signals of another speaker
were used. n=−∞
K. Daqrouq, T.A. Tutunji / Applied Soft Computing 27 (2015) 231–239 235

v_jh z_1 w_hj node j). The outputs of the hidden and output layers at each pattern,
x_1 y_1 k, are given by
 M 
y_2 
x_2 zh (k) = f vih xi (k) h = 1, . . ., H (12)
i=1
 H 
z_H 
yj (k) = g whj zh (k) j = 1, . . ., N (13)
x M y_N h=1

where f and g are the activation function. The most commonly used
Fig. 5. Feed-forward multi-layer architecture. activation functions are the sigmoidal and hyper-tangent.
Let the desired outputs be d1 , d2 , . . ., dn . The learning objec-
tive is to determine the weight values that minimize the difference
2p


p between the desired and network outputs for all patterns. Let the
j+1
(k) = g[n] j
(k − 2j n) (10)
error criterion be defined as follows:
n=−∞

1  1 
K N K N
where [n] is the wavelet function, while j and p are the number 2 2
SSE = (ej (k)) = (yj (k) − dj (k)) (14)
of decomposition levels and the number of nodes in the previous K K
k=1 j=1 k=1 j=1
level, respectively [3,44]. In this study, WPT is applied at the feature
extraction stage, but the large amount of data might cause some dif- where k refers to the pattern number and j refers to the output node
ficulties. Therefore, a better representation for the speech features number. The weights are updated recursively
is needed and is explained next.
whj (iter + 1) = whj (iter) + s(iter) (15)
For a given orthogonal wavelet function, a library of WP bases is
generated. Each of these bases proposes a particular way of coding vih (iter + 1) = vih (iter) + s(iter) (16)
signals, maintaining global energy and reconstructing exact fea-
tures. The WP is used to extract extra features to guarantee higher Here s(iter) is the search direction at the specified iteration. An effi-
recognition rate. In this work, WPT is used at the feature extrac- cient update is the use of Levenberg–Marquardt method to find the
tion stage, but this data is not suitable for classifier due to a great search direction. This is provided next for the output weights wjh
amount of data length. Therefore, there is a need to find a better [30]
demonstration for the speech features.
The WP features’ extraction method can be summarized as fol-

N
2
E(k) = (yj (k) − dj (k)) (17)
lows:
j=1

• Decompose the vowel signal at WP depth of level two with ∂E(k)


J= (18)
Daubechies type and calculate the Shannon entropy for each ∂wih
sub-signal. The WP extracts additional features to the Shannon −1
entropy and therefore enhances the recognition rate. s(iter) = −(J T J + I) (J T E(k)) (19)
• Calculate Shannon entropy for all seven nodes of wavelet packet
where J is the Jacobian matrix and I is the identity matrix. The
using the equation
parameter  is updated for each search step: increased if the
 algorithm is divergent and decreased if the algorithm is conver-
E(x) = − xi2 log(xi2 ) (11)
gent. There are two main advantages of using this parameter:
i enforce descending function values in the optimization sequence
where x is the signal under consideration and xi are the sig- and increase the numerical stability of the algorithm.
nal coefficients that form the orthonormal basis. These seven The derivative of the error w.r.t., the hidden weights, vih ,
entropies (in addition to the five formants) will be used to identify involves more steps: The target values at the hidden layer are not
different speakers. available and a chain rule is used to approximate the hidden error.
This algorithm is called backpropagation [29].
A major application for neural networks is pattern classification.
4. Classification In this paper, the formant and wavelet information presented in the
two previous sections were used as input/output data for the neural
Two different classification approaches were used in the network for classification. The total number of inputs used was 12
experimental investigation: feed-forward neural networks and (five formants and seven entropies).
probabilistic neural networks.
4.2. Probabilistic neural network
4.1. Feed-forward neural networks
Probabilistic neural networks are implementations of statistical
Feed-forward networks are usually composed of multi-layer algorithms and are generally used as classifiers. These networks are
nodes (Fig. 5). The direction of the data goes only in one-way (i.e. unsupervised feed-forward networks with four layers: input, pat-
forward). Consider a three-layer neural network which receives tern, summation, and output. A probabilistic function, such as the
inputs x1 , x2 , . . ., xm , processes it to the hidden layer and then to Gaussian, is used for each pattern node. The network weights are
the output layer to give the outputs y1 , y2 , . . ., yn . updated according to the input patterns. The patterns are then clas-
The connecting arrows between the nodes have weights (the sified using the nearest-neighborhood function according to the
network variables). These weights are: vjh (connects node input i Gaussian classifiers. The mean and variance for each node func-
with hidden node h) and whj (connects node input h with hidden tion can also be updated during training to minimize the distance
236 K. Daqrouq, T.A. Tutunji / Applied Soft Computing 27 (2015) 231–239

between the patterns and their closest classifiers. More theoretical Table 2
Parameters used for the neural network.
details of such networks can be found in [31].
Functions Description
5. Results and discussion Network type Feed-forward back propagation
No. of layers Three layers: input, hidden, and output
The experimental setup was as follows: Speech signals were No. of neurons in layers 12 – inputs, 30 – hidden, and 4 – outputs
Training function Levenberg–Marquardt
recorded via PC-sound card with a spectral frequency of 4000 Hz
Performance function (mse) 10−5
and sampling frequency of 8000 Hz. Eighty people participated in Differentiable transfer function Sigmoidal
the recordings with each person recording a minimum of 20 times. No. of epochs 200
Each recording was a partially spoken word emphasizing a partic- Maximum validation failures 5
Minimum performance gradient 10−10
ular Arabic vowel (E, A and O) (see Appendix A). The age of the
Initial mu 10−2
speakers varied from 30 to 45 years and included 48 males and 32 mu increase factor 10
females. The recording process was provided in normal university mu decrease factor 0.1
office conditions (i.e. recording in an office with the door closed and Maximum mu 1010
with no obvious noisy surroundings).
Even though the methods proposed for speaker and speech
recognition, diacritics or diacritics restoration have been maturing
over time, they are still inadequate in terms of accuracy [47,48]. In
the approach we present in this paper, we propose research study of
the speaker recognition by means of speech vowels signal. There-
fore, the presented study may be considered as an investigation
work aiming to build a system that classifies the speakers by only
a short part of his speech signal. This speech part signal is the sep-
arated spoken vowel. This helps greatly in case only uncompleted
word speech signals are available. We solved the problem by using
the conventional recognition method (feature extraction and then
classification). This assists to find out a distinguished speaker recog-
nition system by only the vowels speech signals. This approach is
based on a combination between the formants and seven Shannon
entropy wavelet packets as a feature extraction method and neural
network for classification.
The extracted features (five formants and seven entropies) were
calculated for each person from filtered speech signals. Filters using
Fig. 6. Network training convergence.
the multistage wavelet enhancement method were used [23]. The
input matrix, X, contained n columns (that represented the number
of speakers). Each column had 12 entrees: five formants and seven The recorded signals were matched with the speaker’s identity
entropies. and were used as input/output pairs as explained in Section 4. A
⎡ ⎤ typical network convergence run is shown in Fig. 6.
x1,1 x1,2 ... x1,n The conducted experiments tackled two main recognition tasks:
⎢x ⎥ the verification task and the identification task. All the conducted
⎢ 2,1 x2,2 ... x2,n⎥
X=⎢ ⎥ (20) experiments used a database with 80 speakers. For each speaker,
⎣ ... ... ... ... ⎦ the data was divided between training and testing data.
x12,1 x12,2 ... x12,n For verification [55], the system was tested for its accuracy and
effectiveness on the 80 speakers (i.e. classes). At each run, 50%
One way to configure the desired output matrix is to use the number of the signals were used from chosen class (i.e. correct speaker)
of columns that is equal to the number of speakers. Binary-decoded while the other 50% of the signals were used from another class
columns were used where the binary value of each column repre- (i.e. imposter). The verification recognition rates’ results were deter-
sented the speaker’s order. As an example, for six speakers, the mined with regard to the False Positive Error (FPE), which refers to
first column would be ‘1000’, the second would be ‘0100’, the third the total number of the accepted imposter testing signals divided
would be ‘1100’, . . ., and the sixth column would be ‘0110’. The by the total number of the testing signals, and the False Negative
matrix used is shown next Error (FNE), which refers to the number of rejected genuine test-
⎡ ⎤
1 0 1 0 1 0 ing signals divided by the total number of the testing signals. The
⎢0 1 1 0 0 1⎥ recognition rate was calculated as 100% minus the total of FPE and
D=⎢ ⎥ (21)
⎣0 0 0 1 0 1⎦ FNE. We would like to note here that the proposed method had a
100% recognition rate on the training data.
0 0 0 0 1 0
The results are compared to two established methods in lit-
However, in order to improve the performance of the network, sev- erature: MFCC [32] with ANN referred to as MFCC-NN, and LPC
eral patterns were recorded for each person (not just one column). with ANN referred to as LPC-NN [33]. The results are tabulated in
Therefore, the final input and output matrices were of the form Table 3. Note that FWENN had better recognition rates: An increase
of about 12% over MFCC-NN and 27% over LPC-NN. Those are sig-
X = [ X1 X2 ... Xr ] (22) nificant results showing that the proposed method is more suitable
for identifying speakers when ‘partially-spoken vowels’ are used as
D = [ D1 D2 ... Dr ] (23)
inputs.
where r is the number of recordings for each person. In the next experiments, identification task was performed
The neural network parameters used were determined empiri- where the first half of the signals in each class were used for training
cally for the best performance results and are provided in Table 2. and the second half for testing. The proposed method (i.e. FWENN)
K. Daqrouq, T.A. Tutunji / Applied Soft Computing 27 (2015) 231–239 237

Table 3 Table 5
Recognition rate for verification of experiment results. A comparison among different WP-based methods.

Recognition rate [%] FNE [%] FPE [%] Method Recognition rate [%] Identification method

89.16 5.60 5.23 FWENN 90.09 FWENN


77.32 14.23 8.11 MFCC-NN 84.34 LPC-WP
61.88 15 23.12 LPC-NN 83.10 WPID
80.76 SHWPF
87.21 FSVM
Table 4
Average recognition rate results.
Table 6
Identification Number of Recognition rate [%] Recognition rate [%]
Comparison between FWENN and DWT-NN in noisy environments.
method speakers (Vowel-dependent) (Vowel-independent)
Identification method Recognition rate [%]
FWENN 80 90.09 82.50
MFCC-NN 80 79.66 74.03 0 dB 5 dB
LPC-NN 80 66.63 59.45
DWT-NN 80 81.44 79.32 FWENN 48.44 68.45
GWP-NN 80 85.47 80.07 DWT-NN 46.10 50.23

Recognition Rates for WEFBNN and WEPNN Table 7


110
The computational complexity in terms of feature extraction vector length and
simulating time by Shannon entropy via WP.
100
WP level Level 2 Level 5 Level 7
90
Recognition Rate %

Simulation time (s) 0.58 1.31 3.40


80 Vector length 7 63 255
Recognition rate 90.09% 91.04 91.69%
70

60
for the whole recorded database. The best recognition rate selection
50
WEPNN obtained was 90.09% for proposed method (Table 5).
WEFBNN Another experiment was conducted to assess the performance
40
of the system in the noisy environments for vowel-dependent
30 system. Table 6 summarizes the results of speaker identification
0 5 10 15 20 25 30 35 40 45 50
corresponding to white Gaussian noise with SNR of 0 dB and 5 dB
Speaker Number
references. Two approaches were used in the experimental investi-
Fig. 7. The experimental results of two different classification approaches (regular gation for comparison: FWENN and DWT-NN. The best recognition
feed-forward and probabilistic) used in the experimental investigation for compar- rate selection obtained was 48.44 (with 0 dB) and 68.45 (with 5 dB)
ison of vowel-dependent system. for DWT-NN. The reason of DWT being successful over WP is that
the feature vector is obtained from level 5, where the sub-signals
were filtered in lower depth than in WP at level 2. Our proposed
was further compared to two more modern methods that use
method could easily overcome this problem by increasing the num-
wavelet transform: Genetic Wavelet Packet with ANN denoted by
ber of WP tree levels from 2 to 5 or 7. However, the improvement
GWP-NN [34] and Discrete Wavelet Transform (at level 5) with the
implies a trade-off between the recognition rate and increasing the
proposed feature extraction method and ANN denoted by DWT-NN.
features’ dimensionality.
The results tabulated in Table 4 can be used to compare recog-
Very few researchers studied the use of formants for speaker
nition rates among five methods. Results indicated that proposed
recognition [56]. This work concentrated on the use of formants of
method is superior with highest recognition rates of 90% and 82.5%
spoken vowels in order to recognize different speakers. Additional
for vowel-dependent and vowel-independent respectively.
features extracted by Shannon entropy enhanced the recognition
In WP case, it was found that the recognition rates improved
rate by a further 7%. However, the improvement implied a trade-off
upon increasing the number of features (by increasing WP level).
between the recognition rate and extraction time.
However, the improvement implies a trade-off between the recog-
The results of different Shannon entropy feature vector lengths
nition rate and extraction time (see Table 7).
by several WP levels are shown in Table 7. Notice that the recog-
To further test the FWENN method, the feed-forward neural
nition rate of the proposed method was enhanced from 90.09% to
network was replaced with a probabilistic neural network. Fig. 7
91.69% when the number of WP nodes increased from 7 to 255
shows the experimental results of the two classification approaches
(from level 2 to 7). The fact that we can get very good performance
within FWENN. The recognition rates for the database of the prob-
with a short vector (i.e. only 12 coefficients) is a major contribution
abilistic network reached the lowest values with average 76.56%.
of this work.
The best average recognition rate selection obtained was of 90.09%
for FWENN.
A comparative study of the proposed feature extraction method 6. Conclusion
with other WP-based feature extraction methods was performed.
The Eigen vector with LPC [43] in conjunction with WP (LPC-WP), In this paper, a new method for speaker recognition (verifica-
Wavelet Packet energy Index Distribution method (WPID) [3], and tion and identification) was described. The method used feature
Sure entropy in conjunction with WP Formants at level seven extractions (formants and Shannon entropy) as inputs to a neural
(SWPF) [44] were employed for comparison. For all these meth- network for classification.
ods feed-forward neural network classifier was utilized. Finally, Power spectrum using Yule–Walker equations was used for
our proposed feature extraction method with Support Vector identifying the formants while wavelet packet was used for
Machines (SVM) classifier (FSVM) [47]. The results were conducted calculating the entropies. Furthermore, two different network
238 K. Daqrouq, T.A. Tutunji / Applied Soft Computing 27 (2015) 231–239

architectures were investigated: feed-forward neural networks and References


probabilistic neural networks.
[1] D. Avci, An expert system for speaker identification using adaptive wavelet sure
The significance of this work is that the recorded signals used for entropy, Expert Syst. Appl. 36 (2009) 6295–6300.
recognition were vowels. Advantages of using vowels include the [2] D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker verification using adapted
ability to recognize speakers when only partially-recorded words Gaussian mixture models, Digit. Signal Process. 10 (1–3) (2000) 19–41.
[3] J.-D. Wu, B.-F. Lin, Speaker identification using discrete wavelet packet trans-
are available. This may be useful for people with speaking disability, form technique with irregular decomposition, Expert Syst. Appl. 36 (2009)
such as deaf-mute persons. 3136–3143.
The proposed method, FWENN, was compared to several estab- [4] R. Sarikaya, B.L. Pellom, J.H.L. Hansen, Wavelet packet transform features with
application to speaker identification, in: Proceedings of the IEEE Nordic Signal
lished methods in literature (MFCC-NN, LPC-NN, DWT-NN, and
Processing Symposium, 1998, pp. 81–84.
GWP-NN) using vowel-dependent and vowel-independent recor- [5] E.S. Fonseca, R.C. Guido, P.R. Scalassara, C.D. Maciel, J.C. Pereira, Wavelet
dings from a combination of 80 speakers under different noise time–frequency analysis and least squares GWPNN support vector machines
levels. Experimental results showed that the proposed method for the identification of voice disorders, Comput. Biol. Med. 37 (2007) 571–578.
[6] R.R. Coifman, M.L. Wickerhauser, Entropy based algorithms for best basis selec-
had a high recognition rate for verification and identification. In tion, IEEE Trans. Inf. Theory 32 (1992) 712–718.
comparison to other published methods, results indicated that the [7] E. Visser, M. Otsuka, T. Lee, A spatio-temporal speech enhancement scheme for
proposed method is superior with a highest recognition. robust speech recognition in noisy environments, Speech Commun. 41 (1992)
393–407.
[8] J. Malkin, X. Li, J. Bilmes, A Graphical Model for Formant Tracking, SSLI Lab,
Acknowledgements Department of Electrical Engineering, University of Washington, Seattle, 2005.
[9] M.P. Gelfer, V.A. Mikos, The relative contributions of speaking fundamental
frequency and formant frequencies to gender identification based on isolated
This project’s paper was funded by the Deanship of Scientific vowels, J. Voice 19 (4) (2007) 544–554.
Research (DSR), King Abdulaziz University, under grant no. 12- [10] J. Bachorowski, M. Owren, Acoustic correlates of talker sex and individual talker
identity are present in a short vowel segment produced in running speech, J.
135-35-RG. The authors, therefore, acknowledge with thanks DSR Acoust. Soc. Am. 106 (2) (1999) 1054–1063.
technical and financial support. [11] X. Huang, A. Acero, H.-W. Hon, Spoken Language Processing, Prentice Hall PTR,
2001.
[12] S. Kadambe, G.F. Boudreaux-Bartels, Application of the wavelet transform for
Appendix A. Arabic vowels pitch detection of speech signals, IEEE Trans. Inf. Theory 32 (March) (1992)
712–718.
[13] Acero, Formant analysis and synthesis using hidden Markov models, in: Proc.
Arabic language is one of the most important and broadly spo-
Eur. Conf. Speech Communication Technology, 1999.
ken language in the world. An expected number of 350 million [14] L. Deng, A. Bazzi, Acero, Tracking vocal tract resonances using an analytical
people distributed all over the world (mainly covering 22 Ara- nonlinear predictor and a target-guided temporal constraint, in: Proc. Eur. Conf.
Speech Communication Technology, 2003.
bic countries) speak Arabic. Arabic is a Semitic language that is
[15] L. Deng, L. Lee, H. Attias, A. Acero, A structured speech model with continu-
characterized by the existence of particular consonants like pha- ous hidden dynamics and prediction residual training for tracking vocal track
ryngeal, glottal and emphatic consonants. Furthermore, it presents resonances, in: IEEE ICASSP, 2004.
some phonetics and morpho-syntactic particularities. The morpho- [16] C.A. Norton, S.A. Zahorian, Speaker verification based on speaker position in a
multidimensional speaker identification space, in: Intelligent Engineering Sys-
syntactic structure is built, around pattern roots (CVCVCV, CVCCVC, tems Trough Artificial Neural Networks, 5ASME Press, New York, 1995, pp.
etc.), as shown in [35]. 739–744.
The Arabic alphabet consists of 28 letters that can be extended [17] M. Zaki, A. Ghalwash, A. Elkouny, Speaker recognition system using a cascade
neural network, Int. J. Neural Syst. 7 (1996) 203–212.
to a set of 90 by additional shapes, marks, and vowels. The 28 letters [18] M.W. Mak, S.Y. Kung, Estimation of elliptical basis function parameters by EM
represent the consonants and long vowels such as and (both pro- algorithm with application to speaker recognition, IEEE Trans. Neural Netw. 11
nounced as/a:/), H (pronounced as/i:/), and (pronounced as/u:/). (2000) 961–969.
[19] M. Homayounpour, G. Chollet, Neural nets approach to speaker verification,
The short vowels and certain other phonetic information such as in: Proceedings of ICASSP95 (Proceedings of the International Conference of
consonant doubling (shadda) are not represented by letters, but by Acoustics, Speech and Signal Processing, ’95), 1995, pp. 335–356.
diacritics. A diacritic is a short stroke placed above or below the [20] N.P. Reddy, O.A. Buch, Speaker verification using committee neural network,
Comput. Methods Programs Biomed. 72 (2003) 109–115;
consonant. We find three short vowels: fatha: it represents the /a/ Artificial Neural Networks, vol. 5, ASME Press, New York, 2000, pp. 739–744.
sound and is an oblique dash over a letter, damma: it represents [21] N.P. Reddy, D. Prabhu, S. Palreddy, V. Gupta, S. Suryanarayanan, E.P. Canilang,
the /u/ sound and has shape of a comma over a letter and kasra: it Redundant neural networks for reliable diagnosis: applications to dysphagia
diagnosis, in: C. Dagli, A. Akay, C. Philips, B. Fernadez, J. Ghosh (Eds.), Intelligent
represents the /i/ sound and is an oblique dash under a letter. The Engineering Systems Through Artificial Neural Networks, vol. 5, ASME Press,
long and short vowels are presented in Tables A.1 and A.2. New York, 1995, pp. 739–744.
[22] A. Das, N.P. Reddy, J. Narayanan, Hybrid fuzzy logic committee neural networks
for recognition of swallow acceleration signals, Comput. Methods Programs
Table A.1 Biomed. 64 (2001) 87–99.
Long Arabic vowels. [23] K. Daqrouq, I. Abu Sbeih, O. Daoud, E. Khalaf, An investigation of speech
enhancement using wavelet filtering method, Int. J. Speech Technol. 13 (2)
Long vowel name Connected with letter ‘+’ (sounds B) Pronunciation (2010) 101–115.
[24] K. Daqrouq, E. Khalaf, A. Al-Qawasmi, T. Abu-Hilal, Wavelet formants speaker
Alef +? /baa/ identification based system via neural network, Int. J. Recent Trends Eng. 2 (5)
+F
Waw /buu/ (2009).
Yaa +H /bii/ [25] W. Al-Sawalmeh, K. Daqrouq, O. Daoud, A. Al-Qawasmi, Speaker identifica-
tion system-based Mel frequency and wavelet transform using neural network
classifier, Eur. J. Sci. Res. 41 (4) (2010) 515–525.
Table A.2 [26] A. Cherif, Bouafif, T. Dabbabi, Pitch detection and formants analysis of Arabic
Short Arabic vowels. speech processing, Appl. Acoust. 62 (2001) 1129–1140.
[27] E. Avci, D. Hanbay, A. Varol, An expert discrete wavelet adaptive network based
Short vowel name (diacritics)
Diacritics above or below letter ‘+’ (sounds
Pronunciation
B) fuzzy inference system for digital modulation recognition, Expert Syst. Appl. 33
(2006) 582–589.
Fatha /ba/ [28] E. Avci, A new optimum feature extraction and classification method for
Damma +F /bu/ speaker recognition: GWPNN, Expert Syst. Appl. 32 (2007) 485–498.
Kasra /bi/ [29] S. Haykin, Neural Networks: A Comprehensive Foundation, Macmil-
Tanween Alfath /ban/ lan College Publishing Company, NY, 1994, ISBN 10:0023527617/ISBN
Tanween Aldam +FF /bun/ 13:9780023527616, From Cheshire Book Centre (CHESHIRE, ENG, United King-
Tanween Alkasr /bin/ dom).
[30] R. Isermann, M. Munchhof, Identification of Dynamic Systems: An Introduction
Sokun /b/ with Applications, Springer-Verlag, Berlin, Heidelberg, 2011.
K. Daqrouq, T.A. Tutunji / Applied Soft Computing 27 (2015) 231–239 239

[31] T. Ganchev, D. Tasoulis, M. Vrahatis, D. Fakotakis, Generalized locally recurrent [47] P. Rama Koteswara Rao, Pitch and pitch strength based effective speaker recog-
probabilistic neural networks with application to text-independent speaker nition: a technique by blending of MPCA and SVM, Am. J. Sci. Res. 59 (2012)
verification, Neurocomputing 70 (2007) 1424–1438. 11–22.
[32] T. Ganchev, N. Fakotakis, G. Kokkinakis, Comparative evaluation of various [48] Y. Alotaibi, A. Hussain, Speech recognition system and formant based analysis
MFCC implementations on the speaker verification task, in: Proceedings of the of spoken Arabic vowels, in: Proceedings of the First International Conference,
SPECOM-2005, vol. 1, 2005, pp. 191–194. FGIT, Jeju Island, Korea, December 10–12, 2009.
[33] Y. Bennani, P. Gallinari, Neural networks for discrimination and modelization [49] Y. Alotaibi, A. Hussain, Formant based analysis of spoken Arabic vowels, in:
of speakers, Speech Commun. 17 (1995) 159–175. Proceedings BioID MultiComm, Madrid, Spain, 2009.
[34] A. Engin, A new optimum feature extraction and classification method for [50] F. Nolan, C. Grigoras, A case for formant analysis in forensic speaker identifica-
speaker recognition: GWPNN, Expert Syst. Appl. 32 (2007) 485–498. tion, Speech Lang. Law 12 (2) (2005) 143–173.
[35] I. Zitouni, R. Sarikaya, Arabic diacritic restoration approach based on maximum [51] C. Grigoras, Forensic voice analysis based on long term formant distributions,
entropy models, Comput. Speech Lang. 23 (2009) 257–276. in: 4th European Academy of Forensic Science Conference, June 2006.
[36] R. Behroozmand, F. Almasganj, Optimal selection of wavelet-packet-based fea- [52] P. Rose, Forensic Speaker Identification, Taylor & Francis, London, 2002.
tures using genetic algorithm in pathological assessment of patients’ speech [53] J. Nirmal, M. Zaveri, S. Patnaik, P. Kachare, Voice conversion using Gen-
signal with unilateral vocal fold paralysis, Comput. Biol. Med. 37 (2007) eral Regression Neural Network, Appl. Soft Comput. 24 (November) (2014)
474–485. 1–12.
[37] D. Mashao, M. Skosan, Combining classifier decisions for robust speaker iden- [54] X. Hong, S. Chen, A. Qatawneh, K. Daqrouq, M. Sheikh, A. Morfeq, A radial basis
tification, Pattern Recognit. 39 (January (1)) (2006). function network classifier to maximize leave-one-out mutual information,
[38] R.V. Pawar, P.P. Kajave, S.N. Mali, Speaker identification using neural networks, Appl. Soft Comput. 23 (October) (2014) 9–18.
Proc. World Acad. Sci. Eng. Technol. 7 (August) (2005). [55] S. Sarkar, K. Sreenivasa Rao, Stochastic feature compensation methods for
[39] S. Chakroborty, G. Saha, Improved text-independent speaker identification speaker verification in noisy environments, Appl. Soft Comput. 19 (June) (2014)
using fused MFCC & IMFCC feature sets based on Gaussian filter, Int. J. Signal 198–214.
Process. 5 (1) (2009). [56] V. Asadpour, M.M. Homayounpour, F. Towhidkha, Audio–visual speaker identi-
[40] H.B. Kekre, V. Kulkarni, Comparative analysis of speaker identification using fication using dynamic facial movements and utterance phonetic content, Appl.
row mean of DFT, DCT, DST and Walsh transforms, Int. J. Comput. Sci. Inf. Secur. Soft Comput. 11 (March (2)) (2011) 2083–2093.
9 (1) (2011). [57] R.H. Laskar, D. Chakrabarty, F.A. Talukdara, K. Sreenivasa Raoc, K. Banerjee,
[41] H. Kekre, V. Kulkarni, Speaker identification using row mean of DCT and Walsh Comparing ANN and GMM in a voice conversion framework, Appl. Soft Comput.
Hadamard transform, Int. J. Comput. Sci. Eng. (2011) 6–12. 12 (November (11)) (2012) 3332–3342.
[42] S. Singh, E.G. Rajan, Vector quantization approach for speaker recognition using [58] R. Halavati, S.B. Shouraki, S.H. Zadeh, Recognition of human speech phonemes
MFCC and inverted MFCC, Int. J. Comput. Appl. 17 (March (1)) (2011) 1–7. using a novel fuzzy approach, Appl. Soft Comput. 7 (June (3)) (2007)
[43] S. Uchida, M.A. Ronee, H. Sakoe, Using eigen-deformations in handwritten char- 828–839.
acter recognition, in: Proceedings of the 16th ICPR, vol. 1, 2002, pp. 572–575. [59] M. Sarma, K.K. Sarma, An ANN based approach to recognize initial phonemes
[44] K. Daqrouq, Wavelet entropy and neural network for text-independent speaker of spoken words of Assamese language, Appl. Soft Comput. 13 (May (5)) (2013)
identification, Eng. Appl. Artif. Intell. 24 (2011) 796–802. 2281–2291.
[46] S. Rahati, Quchani, K. Rahbar, Discrete word speech recognition using hybrid [60] S. Jothilakshmi, Automatic system to detect the type of voice pathology, Appl.
self-adaptive HMM/SVM classifier, J. Tech. Eng. 1 (2) (2007) 79–90. Soft Comput. 21 (August) (2014) 244–249.

You might also like