Detection of Pathological Voice Using Cepstrum Vectors: A Deep Learning Approach
Detection of Pathological Voice Using Cepstrum Vectors: A Deep Learning Approach
Summary: Objectives. Computerized detection of voice disorders has attracted considerable academic and clini-
cal interest in the hope of providing an effective screening method for voice diseases before endoscopic confirmation.
This study proposes a deep-learning-based approach to detect pathological voice and examines its performance and
utility compared with other automatic classification algorithms.
Methods. This study retrospectively collected 60 normal voice samples and 402 pathological voice samples of 8 common
clinical voice disorders in a voice clinic of a tertiary teaching hospital. We extracted Mel frequency cepstral coeffi-
cients from 3-second samples of a sustained vowel. The performances of three machine learning algorithms, namely,
deep neural network (DNN), support vector machine, and Gaussian mixture model, were evaluated based on a five-
fold cross-validation. Collective cases from the voice disorder database of MEEI (Massachusetts Eye and Ear Infirmary)
were used to verify the performance of the classification mechanisms.
Results. The experimental results demonstrated that DNN outperforms Gaussian mixture model and support vector
machine. Its accuracy in detecting voice pathologies reached 94.26% and 90.52% in male and female subjects, based
on three representative Mel frequency cepstral coefficient features. When applied to the MEEI database for validation,
the DNN also achieved a higher accuracy (99.32%) than the other two classification algorithms.
Conclusions. By stacking several layers of neurons with optimized weights, the proposed DNN algorithm can fully
utilize the acoustic features and efficiently differentiate between normal and pathological voice samples. Based on this
pilot study, future research may proceed to explore more application of DNN from laboratory and clinical perspectives.
Key Words: Nodule–Polyp–Neoplasm–Spasmodic dysphonia–Sulcus.
DNN-based system for detecting features extracted from voice the coefficients had zero mean and unit variance, and ap-
samples, (2) to examine the performance of DNN in differen- pended the delta MFCCs to the normalized MFCCs to form
tiating between normal and pathological voice samples, and (3) the third feature vectors. The details of MFCCs and their varia-
to validate the accuracy of the DNN using the widely applied tions were outlined in a previous publication.27
voice disorder database from MEEI. The experimental setups for the acoustic signal processing and
feature extraction procedures are described later. The first feature,
MATERIALS AND METHODS made up of 13-dimension MFCCs, was extracted from a 16-
Study subjects millisecond windowed signal using an 8-millisecond frameshift.
Voice samples were obtained from a voice clinic in a tertiary A window length of 16 milliseconds is used to capture the fast
teaching hospital (Far Eastern Memorial Hospital, FEMH), which dynamic acoustic waves, whereas the 8-millisecond frameshift
included 60 normal voice samples and 402 samples of common enables smoothness between frames. Similar settings were applied
voice disorders, including vocal nodules, polyps, and cysts; glottic in many previous studies.28,29 The next feature, MFCC + delta,
neoplasm; vocal atrophy; laryngeal dystonia (ie, spasmodic dys- was created by appending 13 velocity features to the original
phonia and tremor); unilateral vocal paralysis; and sulcus vocalis 13-dimension MFCCs and thus had 26 dimensions. The third
(Tables 1 and 2). Voice samples of a 3-second sustained vowel feature, denoted by MFCC(N) + delta for convenience, has the
sound /a:/ were recorded at a comfortable level of loudness, with same dimensions as that of MFCC + delta. The only difference
a microphone-to-mouth distance of approximately 15–20 cm, is that the former normalized all MFCC coefficients with zero
using a high-quality microphone (Model: SM58, SHURE, IL),23 mean and unit variance.
with a digital amplifier (Model: X2u, SHURE) under a back-
ground noise level between 40 and 45 dBA. The sampling rate
DNN
was 44,100 Hz with a 16-bit resolution, and data were saved in
A DNN model comprises multiple hidden layers to form a
an uncompressed .wav format.
complex mapping function between the inputs and outputs. Pre-
vious studies have verified that a DNN model can provide a
Feature extraction from MFCCs satisfactory performance in speech enhancement.30 In DNN, the
Derived through pre-emphasis, windowing, fast Fourier trans- relationship between the input, x, and the output of the first hidden
form, Mel filtering, nonlinear transformation, and discrete cosine layer, h1, is described as
transform, MFCCs have been widely used in acoustic research.24
For example, MFCC and MFCC + delta features were selected h1 = f (W1 x) + b1, (1)
for voice disorder detection,7,10,11 and the normalized version where W1 and b1 are the weight matrix and bias vector, respec-
was selected for performance comparison.25,26 To capture these tively, and f (.) is the activation function. In this study, we use
MFCC features, first, the raw waveform was divided into N the sigmoid function for the activation function, namely,
frames (or segments), represented by the vectors x1, . . ., xN f(z) = [1 + exp(−z)](−1), based on the better performance among
(Figure 1). A total of N frames were then transformed into N different activation functions (Appendix 1 of the Supplemen-
MFCC vectors, representing the acoustic features. Next, for tary material). The relationship between the current hidden layer
the second feature, we calculated the trajectories of the MFCCs and the next hidden layer can be expressed as
over time (delta MFCCs) and appended them to the original
MFCCs. Finally, we normalized the MFCCs such that all of hi+1 = f (Wi+1hi + bi+1 ), i = 1, 2, … , L −1, (2)
TABLE 1.
Demographics of the 462 Normal and Pathological Voice Samples
Number Mean Age (y) Age Range (y) Standard Deviation
M F M F M F M F
Normal 16 44 30.7 30.1 23–37 22–47 3.93 5.79
Pathological 189 213 56.1 44.2 20–87 20–87 15.9 14.9
Abbreviations: M, male; F, female.
TABLE 2.
Disease Categories of the 402 Pathological Voice Samples
Nodules Polyp Cyst Neoplasm Atrophy Dystonia Vocal Palsy Sulcus
M 1 18 17 43 39 2 41 28
F 51 33 34 5 16 17 26 31
Abbreviations: M, male; F, female.
ARTICLE IN PRESS
Shih-Hau Fang, et al Detection of Pathological Voice Using Deep Neural Network 3
where L is the DNN hidden layer number. Finally, another func- θ * = arg min O ( y, yˆ ; X , θ ), (5)
θ
tion, g(.), is adopted on the output layer to form the output vector
ŷ. Thus, we have where the standard back-propagation algorithm is applied to
compute θ∗ in Eq. (5).
yˆ = g ( hL ). (3)
For classification tasks, the softmax function is usually adopted
Experimental setup
for g(.). To compute the parameters in the DNN model with the
We examined 1–16 Gaussian mixtures for the GMM and
training samples X = [x1, xi, . . ., xN] and the corresponding labels
tested the performance of different kernel functions for the
Y = [y1, yi, . . ., yN], where N is the total number of training
SVM, including linear, quadratic, and Gaussian functions. The
samples, we formulate an objective function:
DNN was structured into multiple hidden layers with varying
−1 N J neuron numbers in each layer. The best combination of
O (Y , Yˆ ; X , θ ) = ∑ ∑[ yi,j log yˆi, j ], (4) hidden layers and number of neurons was determined based
NJ i=1 j=1
on the experimental results. The threshold of the ratio of the
where θ = {Wl, bl, l = 1, 2, L} is the DNN parameter set, pathological feature vectors was investigated from 0.1 to 0.9,
and Ŷ = [ŷ1, ŷi, . . ., ŷN ] is the DNN output (ŷi is the ith with a 0.1 increment. The performance was evaluated through
DNN output given input xi); yi,j and ŷi,j denote the jth a fivefold cross-validation. We utilized general accuracy, which
element of yi and ŷi, respectively. The parameter is then is widely used for detection tasks, as the main performance
estimated by metric.
ARTICLE IN PRESS
4 Journal of Voice, Vol. ■■, No. ■■, 2018
FIGURE 3. Accuracy of different numbers of mixtures in GMM with (A) MFCC, (B) MFCC + delta, and (C) MFCC (N) + delta features.
RESULTS AND DISCUSSION function was chosen because of its highest accuracy for female
Spectrogram and acoustic waveforms voice samples with the lowest variation of accuracy among male
Figure 2 shows the waveform and spectrogram plots of normal samples (Appendix 2 of the Supplementary material). We also
and pathological voice samples. From the waveform plots, the compared the accuracies of GMM using different numbers of
pathological voice sample (B) showed irregular and wider varia- Gaussian mixtures with three MFCC features. The results show
tions of amplitude compared with the normal voice sample (A). that the accuracy increases as the number of mixtures in-
Meanwhile, from the spectrogram plots, the normal voice sample creases from 1 to 6, whereas the performance becomes saturated
(C) presented clearer harmonic structures, especially in the low- when the number of Gaussian mixtures ranges from 8 to 16
frequency areas. In contrast, the pathological voice sample (D) (Figure 3). Accordingly, we used eight Gaussian mixtures as a
showed blurred harmonic structures and contained noise-like com- representative GMM model in the following study. We also
ponents in the high-frequency region. examine the performance among different DNN structures (ie,
hidden layers and number of neurons); the results showed that
the best performance was achieved when using 3 hidden layers
Optimal setting for SVM, GMM, and DNN with 300 neurons in each layer (Appendix 3 of the Supplemen-
We performed multiple experiments to determine the optimal tary material). Finally, we investigated the ratio of the pathological
kernel functions for SVM. The Gaussian radial basis kernel feature vectors to determine the adequate cut-off value.
FIGURE 2. Waveform of a normal voice sample (A) and a pathological voice sample (B). Wide band spectrogram in the corresponding normal
voice sample (C) and pathological voice sample (D).
ARTICLE IN PRESS
Shih-Hau Fang, et al Detection of Pathological Voice Using Deep Neural Network 5
FIGURE 4. (A) True-positive rate (TPR), false-negative rate (FNR); (B) positive predictive value (PPV) and false detection rate (FDR) among
different ratios of pathological feature vectors on DNN.
Experimental results indicated that 0.5 could be a good balance useful for improving the performance, particularly for the GMM
point between the true-positive rate, false-negative rate, posi- and DNN methods, in the FEMH database of voice disorders.
tive predictive value, and false detection rate (Figure 4).
Validation of DNN performance using MEEI voice
Accuracy of DNN in comparison with SVM and disorder database
GMM To validate the aforementioned results, which indicates that
Table 3 compares the performance of the three algorithms and DNN outperforms SVM and GMM in detecting pathological
three features among the voice samples of the male subjects from voice samples, we applied a common voice disorder database
FEMH. The table indicates that the DNN coupled with the from MEEI under the same experimental setting. We retrieved
MFCC(N) + delta feature provides higher accuracy (94.26%), 53 normal and 173 pathological samples from the MEEI
with a lower standard variation (2.25%) than those of the database,5 identical to a previously published study.7 Results in
other two classification algorithms (GMM and SVM) and the Table 5 showed that DNN provides greater accuracy and a
other two features (MFCC and MFCC + delta). Data on the lower standard deviation than SVM and GMM, indicating the
performance of the DNN among female subjects are provided same tendency as the previous results obtained using the FEMH
in Table 4, which also demonstrates that the DNN and data (Tables 3 and 4). Similarly, compared with previous studies
MFCC(N) + delta feature achieve the highest accuracy for de- using neural networks with a single hidden layer to detect
tecting pathological voice samples. However, the overall accuracy pathological voice samples from MEEI database,15,31 this study
among the female voice samples is lower than that among the utilized DNN with three hidden layers and exhibited a better
male samples. Similar to our results, a previous study by Fraile performance, further confirming the advantages of the pro-
et al also reported a higher accuracy for pathological voice de- posed DNN-based approach. Although the detailed settings for
tection in men than in women.10 The authors proposed that such extracting the MFCC features and numbers of GMM mixtures
discrepancy might be explained by the wider distribution of the were not identical with the previous study by Godino-Llorente
values of MFCC features in women compared with the nar- et al,7 this study also showed that dynamic delta features of
rower distribution for men.10 MFCCs do not enhance the capability of the MEEI model in
Compared with a previous study using an artificial neural the detection of voice disorders (Table 5). Such a concordance
network containing one layer of hidden node (accuracy: men: may be due to the fact that the MFCC produces sufficient
90.95%; women: 86.50%),10 DNN model with multiple layers discriminative information when voice samples are recorded
of hidden nodes further increases the accuracy rates by around in a well-controlled environment. Accordingly, the appended
4% (men: 94.26%; women: 90.52%, Tables 3 and 4). Accord- delta trajectory may be redundant and result in learning confu-
ingly, our results indicate that DNN achieves the highest sions. In contrast, under circumstances in which the voice
performance for both female and male subjects, confirming samples are recorded in suboptimal settings, adding temporal
the capability of DNN for detecting pathological voice samples. derivatives (delta feature) might be helpful to increase the
Moreover, the velocity (delta) features and normalization are robustness of performance (Tables 3 and 4).
ARTICLE IN PRESS
6 Journal of Voice, Vol. ■■, No. ■■, 2018
TABLE 3.
Classification Accuracies of Three Classification Algorithms and Three MFCC Features Among Male Subjects
SVM GMM DNN
Accuracy ± Standard Accuracy ± Standard Accuracy ± Standard
Deviation Deviation Deviation
MFCC 92.24 ± 2.66% 89.00 ± 1.79% 93.86 ± 2.05%
MFCC + delta 92.24 ± 2.66% 91.02 ± 3.38% 93.86 ± 2.05%
MFCC(N) + delta 93.04 ± 2.74% 90.24 ± 4.18% 94.26 ± 2.25%
TABLE 4.
Classification Accuracies of Three Classification Algorithms and Three MFCC Features Among Female Subjects
SVM GMM DNN
Accuracy ± Standard Accuracy ± Standard Accuracy ± Standard
Deviation Deviation Deviation
MFCC 85.18 ± 0.72% 83.56 ± 2.12% 86.14 ± 1.43%
MFCC + delta 85.18 ± 0.72% 86.12 ± 4.35% 87.74 ± 1.43%
MFCC(N) + delta 87.40 ± 1.92% 90.20 ± 3.83% 90.52 ± 2.00%
TABLE 5.
Detection of Pathological Voice Samples in the MEEI Voice Disorder Database
SVM GMM DNN
Accuracy ± Standard Accuracy ± Standard Accuracy ± Standard
Deviation Deviation Deviation
MFCC 98.28 ± 2.36% 98.26 ± 1.80% 99.14 ± 1.92%
MFCC + delta 93.04 ± 2.74% 90.24 ± 4.18% 94.26 ± 2.25%
MFCC(N) + delta 87.40 ± 1.92% 90.20 ± 3.83% 90.52 ± 2.00%
FIGURE 5. Online and offline models of the proposed pathological voice detection system.
Tsui, BSc, for their help in the analysis of acoustic data and neural 12. Arias-Londono JD, Godino-Llorente JI, Markaki M, et al. On combining
network modeling. information from modulation spectra and Mel-frequency cepstral coefficients
for automatic detection of pathological voices. Logoped Phoniatr Vocol.
2011;36:60–69.
SUPPLEMENTARY DATA 13. Markaki M, Stylianou Y. Voice pathology detection and discrimination based
on modulation spectral features. IEEE Trans Audio, Speech, Language Proc.
2011;19:1938–1948.
Supplementary data related to this article can be found online 14. Muhammad G, Mesallam TA, Malki KH, et al. Multidirectional regression
at doi:10.1016/j.jvoice.2018.02.003. (MDR)-based features for automatic voice disorder detection. J Voice.
2012;26:e819–e827.
15. Arjmandi MK, Pooyan M. An optimum algorithm in pathological voice
REFERENCES quality assessment using wavelet-packet-based features, linear discriminant
1. Titze IR. Workshop on acoustic voice analysis: Summary statement. National analysis and support vector machine. Biomed Signal Process Control.
Center for Voice and Speech; 1995. 2012;7:3–19.
2. Stemple JC, Roy N, Klaben BK. Clinical Voice Pathology Theory and 16. Hinton G, Deng L, Yu D, et al. Deep neural networks for acoustic modeling
Management. San Diego: Plural Publishing; 2014. in speech recognition. IEEE Signal Proc Mag. 2012;29:82–97.
3. Schwartz SR, Cohen SM, Dailey SH, et al. Clinical practice guideline: 17. Silver D, Huang A, Maddison CJ, et al. Mastering the game of Go with
hoarseness (dysphonia). Otolaryngol Head Neck Surg. 2009;141:S1– deep neural networks and tree search. Nature. 2016;529:484–489.
S31. 18. Fang SH, Fei YX, Xu ZZ, et al. Learning transportation modes from
4. Vaziri G, Almasganj F, Behroozmand R. Pathological assessment of patients’ smartphone sensors based on deep neural network. IEEE Sens J.
speech signals using nonlinear dynamical analysis. Comput Biol Med. 2017;17:6111–6118.
2010;40:54–63. 19. Li B, Tsao Y, Sim KC. An investigation of spectral restoration algorithms
5. Elemetrics K. Disordered voice database 1.03ed, 1994. for deep neural networks based noise robust speech recognition.
6. Umapathy K, Krishnan S, Parsa V, et al. Discrimination of pathological INTERSPEECH; 2013:3002–3006.
voices using a time-frequency approach. IEEE Trans Biomed Eng. 20. Tawalbeh LA, Mehmood R, Benkhlifa E, et al. Mobile cloud computing
2005;52:421–430. model and big data analysis for healthcare applications. IEEE Access.
7. Godino-Llorente JI, Gomez-Vilda P, Blanco-Velasco M. Dimensionality 2016;4:6171–6180.
reduction of a pathological voice quality assessment system based on 21. Sahoo PK, Mohapatra SK, Wu SL. Analyzing healthcare big data with
Gaussian mixture models and short-term cepstral parameters. IEEE Trans prediction for future health condition. IEEE Access. 2017;99:1.
Biomed Eng. 2006;53:1943–1953. 22. Ma Y, Wang Y, Yang J, et al. Big health application system based on health
8. Costa SC, Neto BGA, Fechine JM. Pathological voice discrimination using internet of things and big data. IEEE Access. 2016;PP:1.
cepstral analysis, vector quantization and hidden Markov models. In: IEEE 23. Fu S, Theodoros DG, Ward EC. Delivery of intensive voice therapy for vocal
International Conference on Bioinformatics and Bioengineering. Athens, fold nodules via telepractice: a pilot feasibility and efficacy study. J Voice.
Greece; 2008:1–5. 2015;29:696–706.
9. Salhi L, Mourad T, Cherif A. Voice disorders identification using multilayer 24. Davis S, Mermelstein P. Comparison of parametric representations for
neural network. Int Arab J Inf Technol. 2010;7:177–185. monosyllabic word recognition in continuously spoken sentences. IEEE Trans
10. Fraile R, Saenz-Lechon N, Godino-Llorente JI, et al. Automatic detection Acoust. 1980;28:357–366.
of laryngeal pathologies in records of sustained vowels by means of 25. Hamawaki S, Funasawa S, Katto J, et al. Feature Analysis and Normalization
Mel-frequency cepstral coefficient parameters and differentiation of patients Approach for Robust Content-Based Music Retrieval to Encoded Audio with
by sex. Folia Phoniatr Logop. 2009;61:146–152. Different Bit Rates. In: Huet B, Smeaton A, Mayer-Patel K, et al., eds.
11. Arias-Londono JD, Godino-Llorente JI, Saenz-Lechon N, et al. Automatic Advances in Multimedia Modeling: 15th International Multimedia Modeling
detection of pathological voices using complexity measures, noise parameters, Conference, MMM 2009, Sophia-Antipolis, France, January 7–9, 2009.
and Mel-cepstral coefficients. IEEE Trans Biomed Eng. 2011;58:370–379. Proceedings. Berlin, Heidelberg: Springer Berlin Heidelberg; 2009:298–309.
ARTICLE IN PRESS
8 Journal of Voice, Vol. ■■, No. ■■, 2018
26. Boril H, Hansen JHL. Unsupervised equalization of Lombard effect for 29. Dahmani M, Guerti M. Vocal folds pathologies classification using
speech recognition in noisy adverse environments. IEEE Trans Audio, Speech Naïve Bayes Networks Systems and Control (ICSC), 2017:p. 426–
Lang Proc. 2010;18:1379–1393. 432.
27. Zhang D, Gatica-Perez D, Bengio S, et al. Semisupervised adapted HMMS 30. Lu X, Tsao Y, Matsuda S, et al. Speech enhancement based on deep
for unusual event detection. IEEE Comput Soc Conf Comput Vis Pattern denoising autoencoder, 2013:436–440.
Recognit. 2005;1:611–618. 31. Godino-Llorente JI, Gomez-Vilda P. Automatic detection of voice
28. Chan CP, Wong YW, Tan L, et al. Two-dimensional multi-resolution analysis impairments by means of short-term cepstral parameters and neural network
of speech signals and its application to speech recognition 1999. In: IEEE based detectors. IEEE Trans Biomed Eng. 2004;51:380–384.
International Conference on Acoustics, Speech, and Signal Processing. 32. Li J, Deng L, Gong Y, et al. An overview of noise-robust automatic
Proceedings, Vol. 401. ICASSP99 (Cat. No.99CH36258). Phoenix, AZ, USA; speech recognition. IEEE/ACM Trans Audio, Speech Lang Proc.
1999:405–408. 2014;22:745–777.