Multitask and Transfer Learning Approach For Joint Classification and Severity Estimation of Dysphonia
Multitask and Transfer Learning Approach For Joint Classification and Severity Estimation of Dysphonia
ABSTRACT Objective: Despite speech being the primary communication medium, it carries valuable
information about a speaker’s health, emotions, and identity. Various conditions can affect the vocal organs,
leading to speech difficulties. Extensive research has been conducted by voice clinicians and academia in
speech analysis. Previous approaches primarily focused on one particular task, such as differentiating between
normal and dysphonic speech, classifying different voice disorders, or estimating the severity of voice
disorders. Methods and procedures: This study proposes an approach that combines transfer learning and
multitask learning (MTL) to simultaneously perform dysphonia classification and severity estimation. Both
tasks use a shared representation; network is learned from these shared features. We employed five computer
vision models and changed their architecture to support multitask learning. Additionally, we conducted binary
‘healthy vs. dysphonia’ and multiclass ‘healthy vs. organic and functional dysphonia’ classification using
multitask learning, with the speaker’s sex as an auxiliary task. Results: The proposed method achieved
improved performance across all classification metrics compared to single-task learning (STL), which only
performs classification or severity estimation. Specifically, the model achieved F1 scores of 93% and 90%
in MTL and STL, respectively. Moreover, we observed considerable improvements in both classification
tasks by evaluating beta values associated with the weight assigned to the sex-predicting auxiliary task. MTL
achieved an accuracy of 77% compared to the STL score of 73.2%. However, the performance of severity
estimation in MTL was comparable to STL. Conclusion: Our goal is to improve how voice pathologists and
clinicians understand patients’ conditions, make it easier to track their progress, and enhance the monitoring
of vocal quality and treatment procedures.
INDEX TERMS Multitask learning, dysphonia, voice pathology, deep learning, speech.
Clinical and Translational Impact Statement: By integrating both classification and severity estimation of
dysphonia using multitask learning, we aim to enable clinicians to gain a better understanding of the patient’s
situation, effectively monitor their progress and voice quality.
point in life. Although dysphonia is often used interchange- it is crucial to approach this differentiation with caution, and
ably with hoarseness, hoarseness is a symptom of altered further research is needed to establish an encompassing and
voice quality reported by patients, while dysphonia is a diag- standardized classification framework for voice disorders.
nosis made by clinicians [8]. Dysphonia can affect people of The conventional diagnostic approach for dysphonia
all ages and genders, but it is more common in teachers [9], involves multiple visits to healthcare professionals and
older adults, and individuals with significant vocal demands. voice therapists, utilizing techniques like laryngoscopy, stro-
It can be caused by benign or self-limited conditions but boscopy, laryngeal electromyography, and auditory analysis.
may also indicate a more severe or progressive condition that However, these methods often cause patient discomfort
requires prompt management [10]. and distress [16]. Additionally, they are time-consuming,
Dysphonia has been categorized into two groups: organic subject to variability due to their subjective nature, and
and functional dysphonia. Functional dysphonia refers to a expensive, imposing a significant financial burden ranging
voice condition that does not correlate with neurogenerative from 577 to 953 US dollars per patient per year [10].
or anatomical factors. It originates from abnormal laryngeal Considering these factors and recent advancements in
function or vocal misuse or overuse, which can lead to inef- machine learning (ML) qualified the development of objec-
ficient oral communication. Any stage of voice production tive and reliable methods for diagnosing dysphonia. ML algo-
can be affected by this condition [11]. Organic dysphonia can rithms can detect and classify dysphonia accurately by
be further divided into structural and neurogenic categories. analyzing speech signals, providing valuable diagnostic
Neurogenic dysphonia is caused by abnormal control, coor- information for healthcare professionals. Speech samples
dination, or strength of the vocal folds due to neurological were recorded from patients with voice disorders and control
diseases such as Parkinson’s. Structural organic dysphonia is groups. Various acoustic features can be extracted from these
caused by morphological changes such as vocal cord nodules, speech samples, such as jitter, shimmer, harmonic-to-noise
polyps, Gastroesophageal Reflux (GERD), and vocal cord ratio (HNR), fundamental frequency, and Mel-Frequency
paralysis [10]. cepstral coefficients (MFCCs) [17], [18]. Features extracted
While the term functional and organic dysphonia exists from utterances are used to train ML algorithms to clas-
in the literature, there are some contradictory arguments if sify healthy and voice-disordered persons. In [19], MFCC
they are really distinguishable. Speech-language pathology and Linear prediction cepstral coefficients (LPC) were
has historically used the term ‘‘functional’’ to describe voice extracted from sustained /a/ vowels and used to train Hid-
disorders linked to psychogenic causes or personality vari- den Markov Model (HMM) and Gaussian Mixture Model
ables. Initially, functional and psychogenic voice disorders (GMM), achieving accuracy of 94.44%, 95.74% with HMM
were seen as closely related. At the same time, this has and GMM, respectively. A combination of VGG16 and a
been argued by [12] as their findings indicate no signifi- Support Vector Machine was proposed. Features extracted
cant differences in personality traits or psychological distress using VGG16 from sustained /i/ vowel used to train SVM
between individuals with organic and functional dysphonia. classifier and achieved accuracy 96.7% [20]. In some other
Suggested common causes for functional dysphonia encom- works focusing on sustained vowels, performance analy-
pass psychoneuroses, personality disorders, and faulty voice sis of different ML algorithms has been conducted [21],
habits. These disorders were believed to occur in individ- [22]. Other paper has explored the effectiveness of algo-
uals with normal laryngeal anatomy and physiology, with rithms such as K-nearest neighbors (KNN), SVM, and
stress, musculoskeletal tension, and conflicts related to sex random forest [23]. Furthermore, the utilization of XGBoost,
identification as specific contributing factors [13]. Others isolation forest, and DenseNet has also been investigated
argue that the term ‘‘functional’’ dysphonia is inadequate [24]. The [25] study predicted dysphonic speech severity
in terms of etiology and merely implies a ‘‘non-organic’’ from sustained vowels’ time and spectral features using
disorder identified through exclusion [14]. In their scoping step-wise multiple regression, yielding mean R of 0.880 and
review, authors critically analyzed the diverse terminolo- mean R2 of 0.775. Continuous speech samples were used
gies employed in classifying voice disorders, highlighting for automated dysphonia severity assessment, achieving
disparities among professionals and researchers. It under- 89% accuracy and a root mean square error (RMSE)
scores the need for a standardized classification system of 0.49 for binary classification and severity estimation,
to facilitate effective communication among clinicians and respectively [26]
provide appropriate treatment guidance. The study sug- Deep learning techniques, like Convolutional Neural Net-
gests categorizing hyperfunctional muscle issues as ‘‘Muscle works (CNN) and Recurrent Neural Networks (RNNs), have
Tension Dysphonia,’’ psychosocially-based voice disorders gained traction in analyzing pathological speech due to their
as ‘‘Functional (Psychogenic) Voice Disorder,’’ and disor- impressive performance in diverse domains such as image
ders stemming from organic or neurogenerative causes as recognition [27], natural language processing [28] and speech
‘‘Organic Dysphonia’’ [15]. The validity of the distinction recognition [29]. Several studies have been conducted in
between individuals with functional and organic dysphonia the research area of voice disorder. One study achieved an
has yet to be thoroughly examined and tested. Therefore, accuracy of 98.9% using CNN with oversampling techniques
to deal with class imbalance [30]. CNN with chromagram it faithfully captures natural speech patterns, closely corre-
feature set in [31], CNN combined with RNN in [32]. In [33], sponding to natural communication interactions in everyday
a Convolutional Deep Belief Network (CDBN) was used to scenarios. This method aligns more closely with real-life
pre-train the weight of the CNN model, and an accuracy scenarios compared to analyzing isolated speech fragments,
of 71% was achieved. In [34], a performance comparison enhancing the realism and practical applicability of our
between CNN and RNN was conducted. The study used 10- findings.
fold cross-validation and accuracies of 87.11% and 86.52%
for CNN and RNN, respectively. A combination of features II. MATERIALS AND METHODS
from EGG and speech from sustained vowel /a/ has been Five different computer vision models were used in this
considered for distinguishing normal and pathological voices research. The architectures of these models were changed
[35], [36]. Other studies considered stacked autoencoder [37] to support multitask learning, as discussed in section II-C.
and LSTM-based autoencoder [38] using sustained vowel /a/ The fine-tuning was performed using 5-fold cross-validation,
and continuous speech samples, respectively. with a learning rate 0.0001, batch size of 4, Adam opti-
Limited research exists on distinguishing functional from mizer, and 50 epochs. Implementation was carried out using
organic dysphonia. Only [39] has performed classification PyTorch [41] framework. Pre-trained versions of these net-
between these two categories using handcrafted feature works are available in torchvision module. In each training
extraction and SVM. Classifying between organic and func- and validation fold, two models were saved, final and best
tional dysphonia is essential since the treatment procedures models, respectively. The final model is the fine-tuned model
may differ. Benign Organic dysphonia generally may require after 50 epochs, and the best model is the result of early
medical interventions or surgical procedures except for vocal stopping. We saved the best model with the lowest validation
fold nodule, which includes voice therapy as an alternative or loss. The saved models(Final and Best) were independently
adjusted therapy before and after surgery. The treatment pro- tested on the test set.
cedures for dysphonia from neurodegenerative include voice
therapy, laryngeal injection with a variety of fillers, laryngeal A. DATASET
framework surgery, and laryngeal re-innervation. Functional We used Hungarian speech samples from two categories
dysphonia may be managed through voice therapy, psy- of dysphonia, functional and organic, along with healthy
chological interventions, or a collaboration [40]. Accurate control. All speakers were native Hungarian speakers and
classification ensures that patients receive appropriate and read a short passage titled ‘‘The North Wind and The Sun,’’
targeted treatment approaches. an English text translated into many languages, including
Although previous studies have offered valuable insights Hungarian, frequently used by phoneticians. Speech samples
into voice disorder detection and assessment, they have were collected from patients who had agreed to participate
focused mainly on one task, such as binary, multiclass clas- in the study. The recordings were performed at the Head
sification, or the prediction of severity scores. Additionally, and Neck Surgery department of the National Institute of
most of the research has used sustained vowels, a method that Oncology during the consultation. Distinguishing functional
might not fully capture the complexities of natural speech from organic dysphonia has been performed by a special-
patterns. Furthermore, limited attention has been given to ist at the same department. During the recording, various
distinguishing between organic and functional dysphonia. health conditions were observed, including functional dys-
This paper introduces a novel approach to dysphonia phonia, recurrent paresis, tumors in different parts of the
evaluation by utilizing transfer and multitasking learning vocal tract, gastroesophageal reflux disease, chronic inflam-
to distinguish between normal and dysphonia and predict mation of the larynx, bulbar paresis, amyotrophic lateral
severity scores simultaneously. Additionally, we performed sclerosis, leucoplakia, spasmodic dysphonia, and more. The
multiclass classification between functional, organic dyspho- speech recording was performed in a quiet (clinical office)
nia, and normal speech using continuous speech samples. The environment with PCM audio coding, 16 kHz sampling rate,
similarity in the effects of both conditions on speech samples and 16-bit quantization. In total, 441 speech samples were
adds complexity to the classification process. To address this used, including 179 healthy control samples, 179 organic
challenge, we implemented multitask learning by incorporat- dysphonia samples, and 83 functional dysphonia samples.
ing the speaker’s sex as an auxiliary task. This approach aims A panel of three specialists graded the severity scores. The
to enhance the model’s classification accuracy and its overall initial specialist directly interacted with the patients during
ability to generalize. To the best of our knowledge, no previ- the grading process. The other two experts evaluated the
ous attempts have been made to conduct both classification speech recordings while listening to the recording in a quiet
and severity estimation of dysphonia patients simultaneously environment. The final severity score used in this study is
using multitask learning. Both tasks are considered equally a rounded average score of three raters. The RBH scale,
important as they provide valuable insights into the patient’s which stands for Roughness, Breathiness, and Hoarseness
condition and allow clinicians to gain comprehensive insights [42], [43], was used to determine the severity level. The scale
into the patient’s condition and monitor their progress effec- ranges from 0 to 3; for the speech samples in this study,
tively. Our approach adopted continuous speech analysis as H was selected as the severity level, ranging from 0 to 3
1) ResNet 3) MobileNet
Deep neural networks have benefited from weight ini- The aim was to design an architecture for efficient process-
tialization and batch normalization, which have mitigated ing on mobile devices and embedded systems with limited
issues like vanishing and exploding gradients. However, computational resources. The authors introduce depthwise
they are still susceptible to the degradation problem, where separable convolutions as a building block for the network
network accuracy plateaus or decreases as the network as an alternative to the conventional convolution layer. In a
depth increases. To address this challenge, researchers from depthwise convolution, a single convolutional filter is applied
Microsoft proposed ResNet [45], a residual network archi- to each input channel to achieve a lightweight filtering pro-
tecture. ResNet has become a prominent solution to the cess. After that, pointwise 1 × 1 convolution is applied to
degradation problem due to its unique design of skip con- compute linear combinations of the input channels, creating
nections, allowing for smoother gradient flow and enabling new features. The resulting network has fewer parameters
the training of much deeper networks. Figure 1 illustrates the and requires less computation than traditional convolutional
ResNet block’s architecture. networks while still performing well on various tasks [47].
ResNet employs skip connections to enable more effi-
cient learning of identity mapping within the neural network. 4) ConvNeXt
As shown in Figure 1, the feature x is summed with the Vision transformers have become popular in various com-
output of the last layer and passed through a nonlinear func- puter vision applications since they were first introduced
tion, typically ReLU, establishing a direct information flow in 2020. However, a group of researchers from Facebook
through the network. This technique has been pivotal in train- aimed to demonstrate the continued relevance of convolu-
ing large-scale neural networks and has gained widespread tional networks by modernizing the ResNet50 architecture
toward the vision transformer’s design. This involved imple- Natural Language Processing [52], and speech processing
menting changes such as depthwise convolutions, replacing [53]. Researchers have found that MTL perform better than
Relu with Gelu activation functions, and other changes. Con- their STL counterparts. This can be attributed to MTL’s abil-
vNeXt, a family of pure ConvNeXt models constructed, have ity to leverage a larger amount of data from diverse learning
been competing with Transformers regarding accuracy and tasks, facilitating enhanced knowledge sharing among tasks,
scalability [48]. improved performance for each task, and mitigating risks of
overfitting. The most straightforward architecture of MTL
5) EfficientNet
is a shared representation. In this case, all tasks are trained
in parallel with the same representation feature; each task
The author of this model proposed a new scaling method
has a different output layer corresponding to the task. The
for designing neural network architectures. The approach
loss function of the MTL network is the combination of loss
involves scaling the network’s depth, width, and resolution
functions of the tasks performed by the model. Eq. (1) shows
using compound scaling based on the intuition that larger
the loss function of our multitask learning network.
input images require more layers and channels to capture
detailed information. As a result, they developed a family of Lcombined = β1 Lt1 + β2 Lt2 (1)
eight models called EfficientNetB0 to B7 that outperformed
Lt1 and Lt2 are loss functions corresponding to tasks
popular neural network architectures like ResNet, Inception,
in a multitasking network. Furthermore, β1 and β2 are
and MobileNet in terms of accuracy and efficiency, even with
the corresponding hyperparameters that control each task’s
fewer parameters [49].
importance in the network.
In fact, deep learning architectures are data-hungry and
We chose cross-entropy loss function for Lt1 for all three
require large amounts of data and significant computation
experiments we conducted. For the experiment of joint classi-
power, making it difficult for individuals or small organi-
fication and regression of dysphonia speech, we have selected
zations to access these resources. Transfer learning plays a
mean squared error as loss function Lt2 . In contrast, for binary
crucial role in scenarios where the database size is insufficient
and multiclass classification experiments using sex prediction
for training deep learning (DL) from scratch. Transfer learn-
as an auxiliary task, mean square error loss has been replaced
ing is a technique in which the model’s parameter learned
by another cross-entropy loss function.
in a specific domain is transferred to another domain and
The choice of hyperparameters β1 and β2 depends on the
fine-tuned on smaller datasets. The idea is that the pre-trained
relative importance of each task with values ranging between
models have already learned useful features and represen-
0 and 1. In the joint classification and regression, we want
tations that can be generalized to the new task, even if the
to perform both tasks simultaneously with equal importance,
new task’s dataset is relatively small or different. Considering
so the values of β1 and β2 are equal to 1. Knowing the severity
the limited size of our dataset, we adopted transfer learning.
level of the patients along the classification task will help
Different ways exist when using transfer learning, depending
clinicians determine the disease’s advancement and can also
on datasets and performed tasks. The first scenario uses these
help monitor the progress of treatment over time. In the binary
models as a feature extractor by freezing all the model’s
and multiclass classification cases, we investigate the impact
layers. Features can be obtained before the last classifier
of gradually increasing the weight β2 assigned to the sex
layer. These high-dimensional features can be used to train
prediction auxiliary task in the multitasking approach. This
neural networks or any other ML algorithm. Unfreezing and
allows us to assess its influence on the accuracy of the main
fine-tuning some layers of the network is another approach.
classification task. The detailed explanation of the effect of
Lastly is fine-tuning all neural network layers, which takes
β2 values in binary and multiclass scenarios is discussed in
more time and computation but usually leads to better perfor-
section III.
mance. Considering our domain is speech and quite different
As discussed in section II-B, five pre-trained computer
from the image in which these models have been trained,
vision models were used in these experiments. These mod-
we considered the last option.
els were initially developed for image classification tasks;
they have single-task architecture. To use them for our
C. MULTITASKING LEARNING work, we need to modify the architectures. The modification
Multitask Learning (MTL) was first introduced by Caruana involves replacing the original classifier layer with a new
as a learning paradigm in ML that enables learning multiple classifier with a number of output features equal to the num-
related tasks jointly to improve the generalization perfor- ber of classes in our dataset. In the binary case, two output
mance of all the tasks. The motivation for MTL was to features represent healthy and dysphonia classes, while in
alleviate the data sparsity problem, where each task has a the multiclass case, three classes represent Healthy control
limited number of labeled data. MTL aggregates the labeled (HC), Functional Dysphonia (FD), and Organic Dysphonia
data in all the tasks to obtain a more accurate learner for (OD). The modification was performed so we preserved the
each task, which can help reuse existing knowledge for dif- original architecture of the models; for example, in the case
ferent tasks [50]. Since then, many researchers adopted this of ConvNeXt, the final classifier consists of (Linear, Relu,
methodology in many areas, including computer vision [51], Dropout, and Linear). So, we replaced the original sequential
model with a new sequential model with the same number of we extracted 128 mel spectrogram features using a 25ms
layers for the classifier, and we added a new head with the Hamming window with 10ms overlapping. Figure 2 presents
same sequential order but changed the last linear layer output the mel spectrogram for male and female participants. From
to 1 for the regression task. This scenario of modification and left to right, the figures show the mel spectrogram for healthy
preserving the original structure is identical to the other four controls and dysphonia samples with severity levels ranging
models. from 1 (SL 1) to 3 (SL 3). As observed in the figure, there is
By specifying the value of β1 and β2 in Eq.(1), we allow a clear distinction between the mel spectrograms of healthy
the network to work in multitask or single task technique. individuals and those with dysphonia.
For performing a single task classification β2 = 0, in this
case, the network tries to optimize parameters only related F. EVALUATION METRICS
to the classification of the speech samples. Likewise, for the Different evaluation metrics are needed to measure the pro-
regression task, we update the values of β1 and β2 to 0 and 1, posed model’s performance for classification and regression.
respectively. In this case, the model is only trained to predict Accuracy, sensitivity, specificity, and f1-score have been used
severity estimation. to measure the performance of the classification tasks. Accu-
racy (Eq. 2) measures the proportion of correctly classified
D. EXPERIMENT CASES instances, both True Positives (TP) and True Negatives (TN),
Overall, six cases were performed in this paper, three for STL over the total number of instances.
and three for MTL. We will explain them briefly here. TP + TN
Accuracy(Acc) = (2)
• Joint classification and regression- MTL: performing TP + FP + TN + FN
classification of Healthy vs Dysphonia and predicting Sensitivity (Eq. 3) often refers to how well a trained model
severity level simultaneously. For this experiment, val- can detect the presence of a disease or condition, in our case,
ues of β1 and β2 are equal to 1. dysphonia. A high-sensitivity model will accurately identify
• Binary classification- STL: distinguishing between most individuals with the disorder.
healthy and dysphonia speech sample only, β1 = 1 and TP
β2 = 0. Sensitivity(Sen) = (3)
TP + FN
• Regression- STL: predicting severity level of dysphonia
patient, β1 = 0 and β2 = 1 Specificity (Eq. 4) is the ability of the model to correctly clas-
• Binary classification- MTL with sex as an auxiliary task: sify an individual who does not have the disease or condition.
trying different values of β2 TN
Specificity(Spec) = (4)
• Multiclass classification- MTL with sex as an auxil- TN + FP
iary task: classifying Normal vs Organic vs Functional The F1 score (Eq. 5) is the harmonic mean of precision and
dysphonia. recall and provides a single score that balances both metrics.
• Multiclass classification- STL: same as binary Precision is the proportion of true positive predictions among
classification β2 =0 all positive predictions, while recall is the proportion of true
positive predictions among all actual positive cases.
E. SPEECH FEATURES: MEL SPECTROGRAM Precision ∗ Recall
As has been explained in section II-A, participants read F1score(F1) = 2 ∗ (5)
Precision + Recall
a short paragraph. The length of the recording is dif-
where precision and recall are calculated as follows:
ferent between files. Deep learning models used require
fixed-length input features. Considering our sample’s overall TP
Precision = (6)
length, we selected 40 seconds of speech as input to the TP + FP
network. Speech files longer than 40 seconds were truncated, TP
Recall = (7)
and shorter files were padded with zeros. Mel spectrogram is TP + FN
used as an input for the CNN model. A spectrogram is the Room Mean Square Error (RMSE) and Pearson Correlation
time-frequency representation of a speech signal. It shows coefficient have been used to measure how well our models
how the frequency of the speech changes over time. The mel perform in the case of regression. RMSE (Eq. 8) measures the
spectrogram, on the other hand, is a logarithmic transfor- average difference between the predicted and actual severity
mation of the frequency axis of the traditional spectrogram, levels.
which makes it more perceptually relevant to human hearing. v
u n
The mel spectrogram is calculated by dividing the audio sig- u1 X
nal into short overlapping frames and then applying a Fourier RMSE = t (yi − ŷi )2 (8)
n
i=1
transform to each frame to obtain the frequency content. The
frequency axis of the spectrogram is then converted to the mel where yi represents the actual value of the i-th sample’s
scale using a nonlinear transformation, which considers the severity, and ŷi represents the predicted severity of the i-th
nonlinear nature of human perception of sound. For our study, sample. The formula computes the square root of the average
FIGURE 2. Mel spectrogram for speech samples: for female and male samples, respectively.
TABLE 2. Classification and regression using MTL and STL (MTL/STL) for 5-fold cross-validation.
of the squared differences between the actual and predicted performance between MTL and STL, while the bold and
values. underlined values indicate the best performance across all
The Pearson correlation coefficient(Eq. 9) measures the models (both MTL and STL). Values shown in each column
strength and direction of the linear relationship between the correspond to MTL and STL, respectively. Overall, MTL
predicted and actual severity level values. exhibits better performance compared to STL across vari-
Pn ous metrics. MTL consistently achieves higher accuracy, F1
i=1 (xi − x̄)(yi − ȳ) score, sensitivity, and specificity in all the models except in
r(PearCorre) = qP qP (9)
n 2 n 2 EfficientNet. Regarding sensitivity, which is the ability of the
i=1 (x i − x̄) i=1 (yi − ȳ)
models to identify patients with dysphonia correctly, we can
xi and yi are the values of the actual and predicted severity see that ConvNeXt in MTL archives slightly above 96%,
score for the i-th sample, x̄ and ȳ are the sample means of the which outperforms all other models.
two variables, and n is the total number of samples in the test A detailed analysis of the results reveals slight variations
set. between MTL and STL approaches in terms of severity esti-
mation performance. While there are instances where MTL
III. RESULTS achieves a slightly lower RMSE than STL, it is essential to
The results reported in this section are the average of five note that these differences are not substantial. For example,
results, each corresponding to models saved in each fold. in the case of ResNet, the MTL model achieves an RMSE
of 0.547 in the best configuration, while the STL model also
A. BINARY CLASSIFICATION AND REGRESSION achieves an RMSE of 0.555. Similarly, the best DenseNet
(MTL VS STL) model yields an RMSE of 0.622 for MTL and 0.631 for STL.
The results presented in Table 2 demonstrate the performance These differences, although present, are minimal and may not
of models in three different experiments, namely joint clas- significantly impact the overall regression performance.
sification and severity estimation using MTL, classification, The MobileNet model stands out with the lowest RMSE of
and regression using STL. The bold values represent better 0.546 when using STL, indicating its better performance in
TABLE 3. Binary classification for MobileNet with different beta values in 5-fold cross-validation.
IV. DISCUSSION 0.005 RMSE for STL. However, MTL outperformed STL
Although the research area of distinguishing normal from with a difference of 0.006 concerning the Pearson correlation
dysphonic voice and estimating the severity of dysphonia metric. This finding suggests that the choice of MTL may not
has long been studied in the literature, these methods only impact the regression performance compared to the STL.
focused on one task at a time. The objective of this study is When using sex prediction as an auxiliary task, our results
to investigate the performance of MTL compared to STL in indicated that integrating this additional task improves the
dysphonia classification and severity estimation tasks. Our accuracy of dysphonia classification both in binary and mul-
findings shed light on the benefits and limitations of incor- ticlass scenarios. The weight assigned to the auxiliary task
porating MTL approaches in these tasks. played an essential role in determining the extent of this
In the first experiments, we aimed to achieve two objec- improvement. The line charts demonstrated clear trends,
tives: classifying normal from dysphonic speech and esti- with certain models achieving their highest accuracy at
mating the severity score together. We consider both tasks specific beta values. In the classification of healthy vs. dys-
to be equally important. Predicting the severity scores and phonia speech, the best performance among all computer
differentiating between healthy speech and dysphonia pro- vision models was achieved by MobileNet with an accuracy
vide practical insights into the patient’s condition. This of 92.58% and F1 score of 94.12%. Moreover, in multi-
comprehensive approach enables clinicians to understand class classification experiments, MTL outperforms STL and
the patient’s situation better and effectively monitor their achieves 77.53% accuracy compared to 73.26% in STL. MTL
progress in both the treatment stages and the progressive- demonstrates improved abilities in differentiating between
ness of the condition. Results from joint learning indicate organic and functional dysphonia, potentially aiding clini-
that MTL models demonstrated better performance across cians in more accurately classifying and managing specific
various evaluation metrics, including accuracy of 91.69%, types of voice disorders using non-invasive and cost-effective
F1 score of 93.20%, and specificity of 93.57% in case of methods. However, compared to binary classification, both
MobileNet and F1 score of 93.3% with ConvNeXt. The sig- STL and MTL approaches face challenges in distinguishing
nificant improvement in these metrics suggests that jointly between organic and functional dysphonia in general, which
considering classification and regression tasks using MTL suggests these disorders affect speech similarly. By examin-
facilitates more effective dysphonia classification. The ability ing the confusion matrix of the MTL approach, as shown in
of MTL models to leverage shared representations and learn Figure 6-B, it becomes evident that MTL has better capa-
task-specific features simultaneously leads to their enhanced bilities for differentiating between these two classes. More
performance. research needs to be performed to understand the reason
Regarding severity estimation, both MTL and STL behind this confusion between the two types of dysphonia.
approach effectively tackled the regression task. The minor By learning from multiple tasks, models can effectively learn
variations observed concerning RMSE indicate that both the shared information and extract robust features related
approaches demonstrate similar accuracy in estimating sever- to dysphonia evaluation. This shared representation learning
ity of the condition. The occurrence of negative transfer, improves the model’s ability to generalize to unseen data and
where the performance of one task negatively impacts enhances its performance in real-world scenarios. MTL also
another in multitask learning, was not significant. While the offers advantages regarding computational costs and training
difference between STL and MTL is insignificant, the best- time. Rather than training two separate models, one for each
performing model, MobileNet, showed a slight advantage of specific task, MTL allows both tasks to be performed within
the same framework. These findings suggest that includ- generalizing ability of our approach on larger, more diverse,
ing sex prediction as an auxiliary task provides valuable and multilingual datasets would provide valuable insights.
contextual information that complements the dysphonia clas- These efforts will contribute to advancing the field of dys-
sification task. phonia assessment and its clinical applications.
In clinical practice, classification and severity estimation of
dysphonia specialists rely on standards like GRBAS, CAPE- REFERENCES
V, RBH, and other diagnostic frameworks. These existing [1] P. Belin, S. Fecteau, and C. Bédard, ‘‘Thinking the voice: Neural correlates
standards serve as a benchmark for clinicians in assessing of voice perception,’’ Trends Cognit. Sci., vol. 8, no. 3, pp. 129–135,
Mar. 2004.
voice disorders. The advantage of this method is that it pro- [2] G. Szatloczki, I. Hoffmann, V. Vincze, J. Kalman, and M. Pakaski, ‘‘Speak-
vides clinicians with novel computational and cost-effective ing in Alzheimer’s disease, is that an early sign? Importance of changes
methodologies along these standards. MTL offers valuable in language abilities in Alzheimer’s disease,’’ Frontiers Aging Neurosci.,
vol. 7, p. 195, Oct. 2015.
insights to clinicians in tandem with their existing diagnos-
[3] A. S. Cohen, J. E. McGovern, T. J. Dinzeo, and M. A. Covington,
tic approaches and experiences. Furthermore, the methods ‘‘Speech deficits in serious mental illness: A cognitive resource issue?’’
described here can serve as pre-screening for pre-diagnosis Schizophrenia Res., vol. 160, nos. 1–3, pp. 173–179, Dec. 2014.
stages, such as general practitioner examination, or can even [4] A. S. Cohen and B. Elvevåg, ‘‘Automated computerized analysis of speech
in psychiatric disorders,’’ Current Opinion Psychiatry, vol. 27, no. 3,
be used at home on mobile devices. This could shed light on pp. 203–209, 2014.
a possible pathological affection, thus bringing the meeting [5] M. L. Poole, A. Brodtmann, D. Darby, and A. P. Vogel, ‘‘Motor speech
with an expert to facilitate correct treatment. Ultimately, the phenotypes of frontotemporal dementia, primary progressive aphasia, and
progressive apraxia of speech,’’ J. Speech, Lang., Hearing Res., vol. 60,
successful integration of computational methodologies into no. 4, pp. 897–911, Apr. 2017.
dysphonia assessment should be used as a complement, rather [6] A. Suppa et al., ‘‘Voice in Parkinson’s disease: A machine learning study,’’
than replace, enabling a collaborative approach that leverages Frontiers Neurol., vol. 13, Feb. 2022, Art. no. 831428.
[7] B. Hajduska-Dér, G. Kiss, D. Sztahó, K. Vicsi, and L. Simon,
both technological advancements and clinical expertise. ‘‘The applicability of the beck depression inventory and Hamilton depres-
sion scale in the automatic recognition of depression based on speech
V. CONCLUSION signal processing,’’ Frontiers Psychiatry, vol. 13, p. 1767, Aug. 2022.
[8] M. M. Johns, R. T. Sataloff, A. L. Merati, and C. A. Rosen, ‘‘Article com-
In this paper, we presented an approach that combines transfer mentary: Shortfalls of the American academy of otolaryngology—Head
learning and multitask learning for the binary classification and neck surgery’s clinical practice guideline: Hoarseness (dysphonia),’’
and severity estimation of dysphonia, as well as multiclass Otolaryngol.-Head Neck Surg., vol. 143, no. 2, pp. 175–177, Aug. 2010.
classification. By leveraging five computer vision deep learn- [9] E. Nerrière, M.-N. Vercambre, F. Gilbert, and V. Kovess-Masféty, ‘‘Voice
disorders and mental health in teachers: A cross-sectional nationwide
ing architectures, we fine-tuned them on our dataset after study,’’ BMC Public Health, vol. 9, no. 1, pp. 1–8, Dec. 2009.
modifying their architecture to adapt multitask learning. [10] R. J. Stachler et al., ‘‘Clinical practice guideline: Hoarseness (dysphonia)
A significant advantage of these deep learning models is their (update),’’ Otolaryngol.-Head Neck Surg., vol. 158, no. S1, pp. S1–S42,
Mar. 2018.
ability to learn feature representations without manual feature [11] L. Crevier-Buchman, T. Ch, A. Sauvignet, S. Brihaye-Arpin, and
engineering, which is often required in traditional ML meth- M.-C. Monfrais-Pfauwadel, ‘‘Diagnosis of non-organic dysphonia
ods. Our experimental results revealed the benefits of multi- in adult,’’ Revue Laryngologie-Otologie-Rhinologie, vol. 126, no. 5,
pp. 353–360, 2005.
task learning in the joint classification and severity estimation [12] A. Millar, I. J. Deary, J. A. Wilson, and K. MacKenzie, ‘‘Is an
of dysphonia. Compared to single-task learning counterparts, organic/functional distinction psychologically meaningful in patients with
the multitask learning models demonstrated more promising dysphonia?’’ J. Psychosomatic Res., vol. 46, no. 6, pp. 497–505, Jun. 1999.
[13] A. E. Aronson, Clinical Voice Disorders, An Interdisciplinary Approach /
performance in distinguishing between healthy and dyspho- Arnold E. Aronson ; [Medical Ill., Floyd E. Hosmer]. New York, NY, USA:
nic speech while maintaining a comparable level of accuracy B. C. Decker, 1990.
in severity estimation. This demonstrates the effectiveness of [14] N. P. Connor and D. M. Bless, Functional and Organic Voice Disorders
(Cambridge Handbooks in Language and Linguistics). Cambridge, U.K.:
leveraging shared knowledge and interdependencies between
Cambridge Univ. Press, 2013.
tasks to enhance overall performance. Moreover, we found [15] C. L. Payten, G. Chiapello, K. A. Weir, and C. J. Madill, ‘‘Frameworks,
that multitask learning facilitated better feature representa- terminology and definitions used for the classification of voice disorders:
tion learning, enabling the models to discriminate between A scoping review,’’ J. Voice, vol. 2022, p. 89, Mar. 2022.
[16] C. Sapienza and B. Hoffman, Voice Disorders. San Diego, CA, USA: Plural
organic and functional dysphonia effectively. This improved Publishing, Inc., 2020.
capability to distinguish between the two types of dysphonia [17] J. P. Teixeira, C. Oliveira, and C. Lopes, ‘‘Vocal acoustic analysis—Jitter,
highlights the advantage of using sex of the speakers as an shimmer and HNR parameters,’’ Proc. Technol., vol. 9, pp. 1112–1122,
Jan. 2013.
auxiliary task in MTL. [18] Z. Kh. Abdul and A. K. Al-Talabani, ‘‘Mel frequency cepstral
In conclusion, our proposed approach demonstrates coefficient and its applications: A review,’’ IEEE Access, vol. 10,
promising results in dysphonia classification and severity pp. 122136–122158, 2022.
[19] S. Jothilakshmi, ‘‘Automatic system to detect the type of voice pathology,’’
estimation. By leveraging deep learning architectures and Appl. Soft Comput., vol. 21, pp. 244–249, Aug. 2014.
exploiting the interdependencies between tasks, we achieve [20] J. Reid, P. Parmar, T. Lund, D. K. Aalto, and C. C. Jeffery, ‘‘Develop-
enhanced performance and contribute to a better understand- ment of a machine-learning based voice disorder screening tool,’’ Amer.
ing of dysphonia-related factors. Future research is to expand J. Otolaryngol., vol. 43, no. 2, Mar. 2022, Art. no. 103327.
[21] D. R. A. Leite, R. M. de Moraes, and L. W. Lopes, ‘‘Different performances
the range of additional tasks to improve the performance of machine learning models to classify dysphonic and non-dysphonic
of multitask learning further. Additionally, evaluating the voices,’’ J. Voice, vol. 2022, pp. 1–10, Dec. 2022.
[22] L. Verde, G. De Pietro, and G. Sannino, ‘‘Voice disorder identifi- [38] D. Sztahó, K. Gábor, and T. Gábriel, ‘‘Deep learning solution for patholog-
cation by using machine learning techniques,’’ IEEE Access, vol. 6, ical voice detection using LSTM-based autoencoder hybrid with multi-task
pp. 16246–16255, 2018. learning,’’ in Proc. 14th Int. Joint Conf. Biomed. Eng. Syst. Technol., 2021,
[23] Z. Dankovičová, D. Sovák, P. Drotár, and L. Vokorokos, ‘‘Machine learning pp. 135–141.
approach to dysphonia detection,’’ Appl. Sci., vol. 8, no. 10, p. 1927, [39] M. G. Tulics, L. J. Lavati, K. Mészáros, and K. Vicsi, ‘‘Possibilities for
Oct. 2018. the automatic classification of functional and organic dysphonia,’’ in Proc.
[24] P. Harar, Z. Galaz, J. B. Alonso-Hernandez, J. Mekyska, R. Burget, Int. Conf. Speech Technol. Hum.-Comput. Dialogue (SpeD), Oct. 2019,
and Z. Smekal, ‘‘Towards robust voice pathology detection: Investigation pp. 1–6.
of supervised deep learning, gradient boosting, and anomaly detection [40] C. Robotti et al., ‘‘Treatment of relapsing functional and organic dyspho-
approaches across four databases,’’ Neural Comput. Appl., vol. 32, no. 20, nia: A narrative literature review,’’ Acta Otorhinolaryngol. Ital., vol. 43,
pp. 15747–15757, Oct. 2020. p. S84, Apr. 2023.
[25] S. N. Awan and N. Roy, ‘‘Toward the development of an objective index [41] A. Paszke et al., ‘‘PyTorch: An imperative style, high-performance deep
of dysphonia severity: A four-factor acoustic model,’’ Clin. Linguistics learning library,’’ in Proc. Annu. Conf. Neural Inf. Process. Syst., 2019,
Phonetics, vol. 20, no. 1, pp. 35–49, Jan. 2006. pp. 8024–8035.
[26] M. G. Tulics and K. Vicsi, ‘‘The automatic assessment of the severity of [42] R. Schönweiler, M. Hess, P. Wübbelt, and M. Ptok, ‘‘Novel approach to
dysphonia,’’ Int. J. Speech Technol., vol. 22, no. 2, pp. 341–350, Jun. 2019. acoustical voice analysis using artificial neural networks,’’ J. Assoc. Res.
[27] M. Wu and L. Chen, ‘‘Image recognition based on deep learning,’’ in Proc. Otolaryngol., vol. 1, pp. 270–282, Jan. 2000.
Chin. Autom. Congr. (CAC), 2015, pp. 542–546. [43] M. Ptok, C. Schwemmle, C. Iven, M. Jessen, and T. Nawka, ‘‘[On
[28] A. Torfi, R. A. Shirvani, Y. Keneshloo, N. Tavaf, and E. A. Fox, ‘‘Natural the auditory evaluation of voice quality],’’ HNO, vol. 54, pp. 793–802,
language processing advancements by deep learning: A survey,’’ 2020, Oct. 2006.
arXiv:2003.01200. [44] F. Pedregosa et al., ‘‘Scikit-learn: Machine learning in Python,’’ J. Mach.
[29] U. Kamath, J. Liu, and J. Whitaker, Deep Learning for NLP and Speech Learn. Res., vol. 12 no. 10, pp. 2825–2830, 2012.
Recognition, vol. 84. Cham, Switzerland: Springer, 2019. [45] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image
[30] J.-N. Lee and J.-Y. Lee, ‘‘An efficient SMOTE-based deep learning recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
model for voice pathology detection,’’ Appl. Sci., vol. 13, no. 6, p. 3571, Jun. 2016, pp. 770–778.
Mar. 2023. [46] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, ‘‘Densely
[31] R. Islam and M. Tarique, ‘‘A novel convolutional neural network based connected convolutional networks,’’ in Proc. IEEE Conf. Comput. Vis.
dysphonic voice detection algorithm using chromagram,’’ Int. J. Electr. Pattern Recognit. (CVPR), Jul. 2017, pp. 2261–2269.
Comput. Eng., vol. 12, no. 5, p. 5511, Oct. 2022. [47] A. G. Howard et al., ‘‘MobileNets: Efficient convolutional neural networks
[32] A. Ksibi, N. A. Hakami, N. Alturki, M. M. Asiri, M. Zakariah, and for mobile vision applications,’’ 2017, arXiv:1704.04861.
M. Ayadi, ‘‘Voice pathology detection using a two-level classifier based on [48] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie,
combined CNN–RNN architecture,’’ Sustainability, vol. 15, no. 4, p. 3204, ‘‘A ConvNet for the 2020s,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Feb. 2023. Recognit. (CVPR), Jun. 2022, pp. 11966–11976.
[33] H. Wu, J. Soraghan, A. Lowit, and G. Di-Caterina, ‘‘A deep learning [49] M. Tan and Q. Le, ‘‘EfficientNet: Rethinking model scaling for con-
method for pathological voice detection using convolutional deep belief volutional neural networks,’’ in Proc. Int. Conf. Mach. Learn., 2019,
networks,’’ in Proc. Interspeech, Sep. 2018. pp. 6105–6114.
[34] S. A. Syed, M. Rashid, S. Hussain, and H. Zahid, ‘‘Comparative analysis of [50] R. Caruana, ‘‘Multitask learning,’’ Mach. Learn., vol. 28, no. 1, pp. 41–75,
CNN and RNN for voice pathology detection,’’ BioMed Res. Int., vol. 2021, 1997.
pp. 1–8, Apr. 2021. [51] J. Lu, V. Goswami, M. Rohrbach, D. Parikh, and S. Lee, ‘‘12-in-
[35] R. Islam, E. Abdel-Raheem, and M. Tarique, ‘‘Voice pathology detec- 1: Multi-task vision and language representation learning,’’ in Proc.
tion using convolutional neural networks with electroglottographic (EGG) IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
and speech signals,’’ Comput. Methods Programs Biomed. Update, pp. 10434–10443.
vol. 2, 2022, Art. no. 100074. [52] J. Worsham and J. Kalita, ‘‘Multi-task learning for natural language pro-
[36] A. N. Omeroglu, H. M. A. Mohammed, and E. A. Oral, ‘‘Multi-modal cessing in the 2020s: Where are we going?’’ Pattern Recognit. Lett.,
voice pathology detection architecture based on deep and handcrafted fea- vol. 136, pp. 120–126, Aug. 2020.
ture fusion,’’ Eng. Sci. Technol., Int. J., vol. 36, Dec. 2022, Art. no. 101148. [53] B. T. Atmaja, A. Sasou, and M. Akagi, ‘‘Speech emotion and natural-
[37] L. Chen and J. Chen, ‘‘Deep neural network for automatic classification of ness recognitions with multitask and single-task learnings,’’ IEEE Access,
pathological voice signals,’’ J. Voice, vol. 36, no. 2, pp. 288.e15–288.e24, vol. 10, pp. 72381–72387, 2022.
Mar. 2022.