2015-Elsevier-Speaker-identification-using-vowels-features-through-a-combined-method-of-formants-wavelets-and-neural-network-classifiers
2015-Elsevier-Speaker-identification-using-vowels-features-through-a-combined-method-of-formants-wavelets-and-neural-network-classifiers
a r t i c l e i n f o a b s t r a c t
Article history: This paper proposes a new method for speaker feature extraction based on Formants, Wavelet Entropy
Received 15 December 2011 and Neural Networks denoted as FWENN. In the first stage, five formants and seven Shannon entropy
Received in revised form 3 November 2014 wavelet packet are extracted from the speakers’ signals as the speaker feature vector. In the second stage,
Accepted 18 November 2014
these 12 feature extraction coefficients are used as inputs to feed-forward neural networks. Probabilistic
Available online 26 November 2014
neural network is also proposed for comparison. In contrast to conventional speaker recognition methods
that extract features from sentences (or words), the proposed method extracts the features from vowels.
Keywords:
Advantages of using vowels include the ability to recognize speakers when only partially-recorded words
Speaker verification and identification
Wavelet packet
are available. This may be useful for deaf-mute persons or when the recordings are damaged. Experimen-
Neural networks tal results show that the proposed method succeeds in the speaker verification and identification tasks
Formants with high classification rate. This is accomplished with minimum amount of information, using only 12
coefficient features (i.e. vector length) and only one vowel signal, which is the major contribution of this
work. The results are further compared to well-known classical algorithms for speaker recognition and
are found to be superior.
© 2014 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.asoc.2014.11.016
1568-4946/© 2014 Elsevier B.V. All rights reserved.
232 K. Daqrouq, T.A. Tutunji / Applied Soft Computing 27 (2015) 231–239
using Power Spectrum Density (PSD) and entropies using WP. These Power spectrum can be found by taking the Fourier Transform
concepts will be explained in Section 3. Classification was done of the Autocorrelation Function (ACF)
using neural networks and will be explained in Section 4.
∞
Px (ejω ) = rx (n)e−jωn (1)
n=−∞
3. Features’ extraction
where rx (n) is the ACF of the signal x(n). The autocorrelation values
are estimated from finite data record, x(n) for 0 ≤ n ≤ N − 1, and is
Periodic excitation is seen in the spectrum of certain sounds,
defined as
especially vowels. The speech organs form certain shapes to pro-
duce the vowel sound and therefore regions of resonance and 1
N−1−k
anti-resonance are formed in the vocal tract. Location of these r̂x (k) = x(n + k)x∗ (n) k = 0, 1, . . ., p (2)
N
resonances in the frequency spectrum depends on the form and n=0
shape of the vocal tract. Since the physical structure of the speech Eq. (1) is one estimate of the PSD, but has some disadvantages
organs is a characteristic of each speaker, differences among that include excessive variance estimation. A better estimate is the
speakers can also be found in the position of their formant fre- Yule–Walker method.
quencies. These resonances affect the overall spectrum shape The Yule–Walker method estimates the PSD of the input using
and are referred to as formants. A few of these formant fre- the Yule–Walker AR method. The concept is to minimize the for-
quencies can be sampled at an appropriate rate and used for ward prediction error by fitting an AR model to the windowed input
speaker recognition. These features are normally used in com- data. This method is also called autocorrelation method [30]. The
bination with other features. Unlike our previous works that PSD is estimated using the following
investigated DWT for feature extraction [24,25], in this work, the
formants are used with the WP entropy to identify the speak- |b(0)|2
Px (ejω ) = p 2 (3)
ers. 1 + a(k)e−jωk
k=1
This paper proposes to use formants and WP parameters as
inputs to ANN for speaker classification. Therefore, it is necessary The parameters a(k) and b(0) can found from the autocorrelation
to introduce the two concepts: feature extraction by formants and estimates and will be described next.
WP algorithms. The discussion of these concepts will be limited to The AR model is described in the equation
their use in speaker identification.
p
Table 1
Comparison among formants (indexes) and magnitudes.
Index Mag. Index Mag. Index Mag. Index Mag. Index Mag.
Node
(0,0)
Node Node
(1,0) (1,1)
formants calculations for five speakers. Note that the indexes vary
among different speakers, but are consistent for each speaker.
t−b
a,b (t) = (7)
a
The WP uses a recursive binary tree, as shown in Fig. 4, for the recur-
sive decomposition of the data. A pair of low-pass and high-pass
filters, denoted as h[n] and g[n], respectively, are used to generate
two sequences, with the purpose of capturing different frequency
sub-band features of the original signal. The two wavelet ortho-
gonal bases generated from a previous node are defined as:
Fig. 3. PSD result showing the five formants of Arabic vowels for two speakers: A
2p
∞
p
and B (vowel-independent). Speakers A and B are both male. In A1, A2, A3, A4 four
j+1
(k) = h[n] j
(k − 2j n) (9)
signals of one speaker were used, and in B1, B2, B3, B4 four signals of another speaker
were used. n=−∞
K. Daqrouq, T.A. Tutunji / Applied Soft Computing 27 (2015) 231–239 235
v_jh z_1 w_hj node j). The outputs of the hidden and output layers at each pattern,
x_1 y_1 k, are given by
M
y_2
x_2 zh (k) = f vih xi (k) h = 1, . . ., H (12)
i=1
H
z_H
yj (k) = g whj zh (k) j = 1, . . ., N (13)
x M y_N h=1
where f and g are the activation function. The most commonly used
Fig. 5. Feed-forward multi-layer architecture. activation functions are the sigmoidal and hyper-tangent.
Let the desired outputs be d1 , d2 , . . ., dn . The learning objec-
tive is to determine the weight values that minimize the difference
2p
∞
p between the desired and network outputs for all patterns. Let the
j+1
(k) = g[n] j
(k − 2j n) (10)
error criterion be defined as follows:
n=−∞
1 1
K N K N
where [n] is the wavelet function, while j and p are the number 2 2
SSE = (ej (k)) = (yj (k) − dj (k)) (14)
of decomposition levels and the number of nodes in the previous K K
k=1 j=1 k=1 j=1
level, respectively [3,44]. In this study, WPT is applied at the feature
extraction stage, but the large amount of data might cause some dif- where k refers to the pattern number and j refers to the output node
ficulties. Therefore, a better representation for the speech features number. The weights are updated recursively
is needed and is explained next.
whj (iter + 1) = whj (iter) + s(iter) (15)
For a given orthogonal wavelet function, a library of WP bases is
generated. Each of these bases proposes a particular way of coding vih (iter + 1) = vih (iter) + s(iter) (16)
signals, maintaining global energy and reconstructing exact fea-
tures. The WP is used to extract extra features to guarantee higher Here s(iter) is the search direction at the specified iteration. An effi-
recognition rate. In this work, WPT is used at the feature extrac- cient update is the use of Levenberg–Marquardt method to find the
tion stage, but this data is not suitable for classifier due to a great search direction. This is provided next for the output weights wjh
amount of data length. Therefore, there is a need to find a better [30]
demonstration for the speech features.
The WP features’ extraction method can be summarized as fol-
N
2
E(k) = (yj (k) − dj (k)) (17)
lows:
j=1
between the patterns and their closest classifiers. More theoretical Table 2
Parameters used for the neural network.
details of such networks can be found in [31].
Functions Description
5. Results and discussion Network type Feed-forward back propagation
No. of layers Three layers: input, hidden, and output
The experimental setup was as follows: Speech signals were No. of neurons in layers 12 – inputs, 30 – hidden, and 4 – outputs
Training function Levenberg–Marquardt
recorded via PC-sound card with a spectral frequency of 4000 Hz
Performance function (mse) 10−5
and sampling frequency of 8000 Hz. Eighty people participated in Differentiable transfer function Sigmoidal
the recordings with each person recording a minimum of 20 times. No. of epochs 200
Each recording was a partially spoken word emphasizing a partic- Maximum validation failures 5
Minimum performance gradient 10−10
ular Arabic vowel (E, A and O) (see Appendix A). The age of the
Initial mu 10−2
speakers varied from 30 to 45 years and included 48 males and 32 mu increase factor 10
females. The recording process was provided in normal university mu decrease factor 0.1
office conditions (i.e. recording in an office with the door closed and Maximum mu 1010
with no obvious noisy surroundings).
Even though the methods proposed for speaker and speech
recognition, diacritics or diacritics restoration have been maturing
over time, they are still inadequate in terms of accuracy [47,48]. In
the approach we present in this paper, we propose research study of
the speaker recognition by means of speech vowels signal. There-
fore, the presented study may be considered as an investigation
work aiming to build a system that classifies the speakers by only
a short part of his speech signal. This speech part signal is the sep-
arated spoken vowel. This helps greatly in case only uncompleted
word speech signals are available. We solved the problem by using
the conventional recognition method (feature extraction and then
classification). This assists to find out a distinguished speaker recog-
nition system by only the vowels speech signals. This approach is
based on a combination between the formants and seven Shannon
entropy wavelet packets as a feature extraction method and neural
network for classification.
The extracted features (five formants and seven entropies) were
calculated for each person from filtered speech signals. Filters using
Fig. 6. Network training convergence.
the multistage wavelet enhancement method were used [23]. The
input matrix, X, contained n columns (that represented the number
of speakers). Each column had 12 entrees: five formants and seven The recorded signals were matched with the speaker’s identity
entropies. and were used as input/output pairs as explained in Section 4. A
⎡ ⎤ typical network convergence run is shown in Fig. 6.
x1,1 x1,2 ... x1,n The conducted experiments tackled two main recognition tasks:
⎢x ⎥ the verification task and the identification task. All the conducted
⎢ 2,1 x2,2 ... x2,n⎥
X=⎢ ⎥ (20) experiments used a database with 80 speakers. For each speaker,
⎣ ... ... ... ... ⎦ the data was divided between training and testing data.
x12,1 x12,2 ... x12,n For verification [55], the system was tested for its accuracy and
effectiveness on the 80 speakers (i.e. classes). At each run, 50%
One way to configure the desired output matrix is to use the number of the signals were used from chosen class (i.e. correct speaker)
of columns that is equal to the number of speakers. Binary-decoded while the other 50% of the signals were used from another class
columns were used where the binary value of each column repre- (i.e. imposter). The verification recognition rates’ results were deter-
sented the speaker’s order. As an example, for six speakers, the mined with regard to the False Positive Error (FPE), which refers to
first column would be ‘1000’, the second would be ‘0100’, the third the total number of the accepted imposter testing signals divided
would be ‘1100’, . . ., and the sixth column would be ‘0110’. The by the total number of the testing signals, and the False Negative
matrix used is shown next Error (FNE), which refers to the number of rejected genuine test-
⎡ ⎤
1 0 1 0 1 0 ing signals divided by the total number of the testing signals. The
⎢0 1 1 0 0 1⎥ recognition rate was calculated as 100% minus the total of FPE and
D=⎢ ⎥ (21)
⎣0 0 0 1 0 1⎦ FNE. We would like to note here that the proposed method had a
100% recognition rate on the training data.
0 0 0 0 1 0
The results are compared to two established methods in lit-
However, in order to improve the performance of the network, sev- erature: MFCC [32] with ANN referred to as MFCC-NN, and LPC
eral patterns were recorded for each person (not just one column). with ANN referred to as LPC-NN [33]. The results are tabulated in
Therefore, the final input and output matrices were of the form Table 3. Note that FWENN had better recognition rates: An increase
of about 12% over MFCC-NN and 27% over LPC-NN. Those are sig-
X = [ X1 X2 ... Xr ] (22) nificant results showing that the proposed method is more suitable
for identifying speakers when ‘partially-spoken vowels’ are used as
D = [ D1 D2 ... Dr ] (23)
inputs.
where r is the number of recordings for each person. In the next experiments, identification task was performed
The neural network parameters used were determined empiri- where the first half of the signals in each class were used for training
cally for the best performance results and are provided in Table 2. and the second half for testing. The proposed method (i.e. FWENN)
K. Daqrouq, T.A. Tutunji / Applied Soft Computing 27 (2015) 231–239 237
Table 3 Table 5
Recognition rate for verification of experiment results. A comparison among different WP-based methods.
Recognition rate [%] FNE [%] FPE [%] Method Recognition rate [%] Identification method
60
for the whole recorded database. The best recognition rate selection
50
WEPNN obtained was 90.09% for proposed method (Table 5).
WEFBNN Another experiment was conducted to assess the performance
40
of the system in the noisy environments for vowel-dependent
30 system. Table 6 summarizes the results of speaker identification
0 5 10 15 20 25 30 35 40 45 50
corresponding to white Gaussian noise with SNR of 0 dB and 5 dB
Speaker Number
references. Two approaches were used in the experimental investi-
Fig. 7. The experimental results of two different classification approaches (regular gation for comparison: FWENN and DWT-NN. The best recognition
feed-forward and probabilistic) used in the experimental investigation for compar- rate selection obtained was 48.44 (with 0 dB) and 68.45 (with 5 dB)
ison of vowel-dependent system. for DWT-NN. The reason of DWT being successful over WP is that
the feature vector is obtained from level 5, where the sub-signals
were filtered in lower depth than in WP at level 2. Our proposed
was further compared to two more modern methods that use
method could easily overcome this problem by increasing the num-
wavelet transform: Genetic Wavelet Packet with ANN denoted by
ber of WP tree levels from 2 to 5 or 7. However, the improvement
GWP-NN [34] and Discrete Wavelet Transform (at level 5) with the
implies a trade-off between the recognition rate and increasing the
proposed feature extraction method and ANN denoted by DWT-NN.
features’ dimensionality.
The results tabulated in Table 4 can be used to compare recog-
Very few researchers studied the use of formants for speaker
nition rates among five methods. Results indicated that proposed
recognition [56]. This work concentrated on the use of formants of
method is superior with highest recognition rates of 90% and 82.5%
spoken vowels in order to recognize different speakers. Additional
for vowel-dependent and vowel-independent respectively.
features extracted by Shannon entropy enhanced the recognition
In WP case, it was found that the recognition rates improved
rate by a further 7%. However, the improvement implied a trade-off
upon increasing the number of features (by increasing WP level).
between the recognition rate and extraction time.
However, the improvement implies a trade-off between the recog-
The results of different Shannon entropy feature vector lengths
nition rate and extraction time (see Table 7).
by several WP levels are shown in Table 7. Notice that the recog-
To further test the FWENN method, the feed-forward neural
nition rate of the proposed method was enhanced from 90.09% to
network was replaced with a probabilistic neural network. Fig. 7
91.69% when the number of WP nodes increased from 7 to 255
shows the experimental results of the two classification approaches
(from level 2 to 7). The fact that we can get very good performance
within FWENN. The recognition rates for the database of the prob-
with a short vector (i.e. only 12 coefficients) is a major contribution
abilistic network reached the lowest values with average 76.56%.
of this work.
The best average recognition rate selection obtained was of 90.09%
for FWENN.
A comparative study of the proposed feature extraction method 6. Conclusion
with other WP-based feature extraction methods was performed.
The Eigen vector with LPC [43] in conjunction with WP (LPC-WP), In this paper, a new method for speaker recognition (verifica-
Wavelet Packet energy Index Distribution method (WPID) [3], and tion and identification) was described. The method used feature
Sure entropy in conjunction with WP Formants at level seven extractions (formants and Shannon entropy) as inputs to a neural
(SWPF) [44] were employed for comparison. For all these meth- network for classification.
ods feed-forward neural network classifier was utilized. Finally, Power spectrum using Yule–Walker equations was used for
our proposed feature extraction method with Support Vector identifying the formants while wavelet packet was used for
Machines (SVM) classifier (FSVM) [47]. The results were conducted calculating the entropies. Furthermore, two different network
238 K. Daqrouq, T.A. Tutunji / Applied Soft Computing 27 (2015) 231–239
[31] T. Ganchev, D. Tasoulis, M. Vrahatis, D. Fakotakis, Generalized locally recurrent [47] P. Rama Koteswara Rao, Pitch and pitch strength based effective speaker recog-
probabilistic neural networks with application to text-independent speaker nition: a technique by blending of MPCA and SVM, Am. J. Sci. Res. 59 (2012)
verification, Neurocomputing 70 (2007) 1424–1438. 11–22.
[32] T. Ganchev, N. Fakotakis, G. Kokkinakis, Comparative evaluation of various [48] Y. Alotaibi, A. Hussain, Speech recognition system and formant based analysis
MFCC implementations on the speaker verification task, in: Proceedings of the of spoken Arabic vowels, in: Proceedings of the First International Conference,
SPECOM-2005, vol. 1, 2005, pp. 191–194. FGIT, Jeju Island, Korea, December 10–12, 2009.
[33] Y. Bennani, P. Gallinari, Neural networks for discrimination and modelization [49] Y. Alotaibi, A. Hussain, Formant based analysis of spoken Arabic vowels, in:
of speakers, Speech Commun. 17 (1995) 159–175. Proceedings BioID MultiComm, Madrid, Spain, 2009.
[34] A. Engin, A new optimum feature extraction and classification method for [50] F. Nolan, C. Grigoras, A case for formant analysis in forensic speaker identifica-
speaker recognition: GWPNN, Expert Syst. Appl. 32 (2007) 485–498. tion, Speech Lang. Law 12 (2) (2005) 143–173.
[35] I. Zitouni, R. Sarikaya, Arabic diacritic restoration approach based on maximum [51] C. Grigoras, Forensic voice analysis based on long term formant distributions,
entropy models, Comput. Speech Lang. 23 (2009) 257–276. in: 4th European Academy of Forensic Science Conference, June 2006.
[36] R. Behroozmand, F. Almasganj, Optimal selection of wavelet-packet-based fea- [52] P. Rose, Forensic Speaker Identification, Taylor & Francis, London, 2002.
tures using genetic algorithm in pathological assessment of patients’ speech [53] J. Nirmal, M. Zaveri, S. Patnaik, P. Kachare, Voice conversion using Gen-
signal with unilateral vocal fold paralysis, Comput. Biol. Med. 37 (2007) eral Regression Neural Network, Appl. Soft Comput. 24 (November) (2014)
474–485. 1–12.
[37] D. Mashao, M. Skosan, Combining classifier decisions for robust speaker iden- [54] X. Hong, S. Chen, A. Qatawneh, K. Daqrouq, M. Sheikh, A. Morfeq, A radial basis
tification, Pattern Recognit. 39 (January (1)) (2006). function network classifier to maximize leave-one-out mutual information,
[38] R.V. Pawar, P.P. Kajave, S.N. Mali, Speaker identification using neural networks, Appl. Soft Comput. 23 (October) (2014) 9–18.
Proc. World Acad. Sci. Eng. Technol. 7 (August) (2005). [55] S. Sarkar, K. Sreenivasa Rao, Stochastic feature compensation methods for
[39] S. Chakroborty, G. Saha, Improved text-independent speaker identification speaker verification in noisy environments, Appl. Soft Comput. 19 (June) (2014)
using fused MFCC & IMFCC feature sets based on Gaussian filter, Int. J. Signal 198–214.
Process. 5 (1) (2009). [56] V. Asadpour, M.M. Homayounpour, F. Towhidkha, Audio–visual speaker identi-
[40] H.B. Kekre, V. Kulkarni, Comparative analysis of speaker identification using fication using dynamic facial movements and utterance phonetic content, Appl.
row mean of DFT, DCT, DST and Walsh transforms, Int. J. Comput. Sci. Inf. Secur. Soft Comput. 11 (March (2)) (2011) 2083–2093.
9 (1) (2011). [57] R.H. Laskar, D. Chakrabarty, F.A. Talukdara, K. Sreenivasa Raoc, K. Banerjee,
[41] H. Kekre, V. Kulkarni, Speaker identification using row mean of DCT and Walsh Comparing ANN and GMM in a voice conversion framework, Appl. Soft Comput.
Hadamard transform, Int. J. Comput. Sci. Eng. (2011) 6–12. 12 (November (11)) (2012) 3332–3342.
[42] S. Singh, E.G. Rajan, Vector quantization approach for speaker recognition using [58] R. Halavati, S.B. Shouraki, S.H. Zadeh, Recognition of human speech phonemes
MFCC and inverted MFCC, Int. J. Comput. Appl. 17 (March (1)) (2011) 1–7. using a novel fuzzy approach, Appl. Soft Comput. 7 (June (3)) (2007)
[43] S. Uchida, M.A. Ronee, H. Sakoe, Using eigen-deformations in handwritten char- 828–839.
acter recognition, in: Proceedings of the 16th ICPR, vol. 1, 2002, pp. 572–575. [59] M. Sarma, K.K. Sarma, An ANN based approach to recognize initial phonemes
[44] K. Daqrouq, Wavelet entropy and neural network for text-independent speaker of spoken words of Assamese language, Appl. Soft Comput. 13 (May (5)) (2013)
identification, Eng. Appl. Artif. Intell. 24 (2011) 796–802. 2281–2291.
[46] S. Rahati, Quchani, K. Rahbar, Discrete word speech recognition using hybrid [60] S. Jothilakshmi, Automatic system to detect the type of voice pathology, Appl.
self-adaptive HMM/SVM classifier, J. Tech. Eng. 1 (2) (2007) 79–90. Soft Comput. 21 (August) (2014) 244–249.