0% found this document useful (0 votes)
14 views

Multitask and Transfer Learning Approach For Joint Classification and Severity Estimation of Dysphonia

Uploaded by

Jeevan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Multitask and Transfer Learning Approach For Joint Classification and Severity Estimation of Dysphonia

Uploaded by

Jeevan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Received 24 August 2023; revised 30 November 2023; accepted 4 December 2023.

Date of publication 7 December 2023; date of current version 26 December 2023.


Digital Object Identifier 10.1109/JTEHM.2023.3340345

Multitask and Transfer Learning Approach for


Joint Classification and Severity
Estimation of Dysphonia
DOSTI AZIZ AND SZTAHÓ DÁVID
Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, 1117 Budapest, Hungary
CORRESPONDING AUTHOR: D. AZIZ ([email protected])
This work was supported in part by the National Research, Development and Innovation Fund of Hungary,
under the K_22 Funding Scheme under Grant K143075.

ABSTRACT Objective: Despite speech being the primary communication medium, it carries valuable
information about a speaker’s health, emotions, and identity. Various conditions can affect the vocal organs,
leading to speech difficulties. Extensive research has been conducted by voice clinicians and academia in
speech analysis. Previous approaches primarily focused on one particular task, such as differentiating between
normal and dysphonic speech, classifying different voice disorders, or estimating the severity of voice
disorders. Methods and procedures: This study proposes an approach that combines transfer learning and
multitask learning (MTL) to simultaneously perform dysphonia classification and severity estimation. Both
tasks use a shared representation; network is learned from these shared features. We employed five computer
vision models and changed their architecture to support multitask learning. Additionally, we conducted binary
‘healthy vs. dysphonia’ and multiclass ‘healthy vs. organic and functional dysphonia’ classification using
multitask learning, with the speaker’s sex as an auxiliary task. Results: The proposed method achieved
improved performance across all classification metrics compared to single-task learning (STL), which only
performs classification or severity estimation. Specifically, the model achieved F1 scores of 93% and 90%
in MTL and STL, respectively. Moreover, we observed considerable improvements in both classification
tasks by evaluating beta values associated with the weight assigned to the sex-predicting auxiliary task. MTL
achieved an accuracy of 77% compared to the STL score of 73.2%. However, the performance of severity
estimation in MTL was comparable to STL. Conclusion: Our goal is to improve how voice pathologists and
clinicians understand patients’ conditions, make it easier to track their progress, and enhance the monitoring
of vocal quality and treatment procedures.

INDEX TERMS Multitask learning, dysphonia, voice pathology, deep learning, speech.
Clinical and Translational Impact Statement: By integrating both classification and severity estimation of
dysphonia using multitask learning, we aim to enable clinicians to gain a better understanding of the patient’s
situation, effectively monitor their progress and voice quality.

I. INTRODUCTION speech analysis has the potential to become a valuable

S PEECH has been recognized as a primary factor in


human interaction. It plays a crucial role in social-
ization and overall well-being by enabling individuals to
clinical tool for diagnosing and monitoring a wide range
of medical conditions such as Alzheimer’s [4], Parkin-
son’s [6], Depression [7] and dysphonia. By analyzing
communicate and share ideas. Human speech can reveal speech patterns, researchers and healthcare professionals
necessary information about an individual’s identity and can gain valuable insights into an individual’s cognitive,
health status [1]. Several psychiatric and neurogenerative motor, and emotional processes, which can provide practical
conditions can impact the organs responsible for speech diagnostic information and improve treatment procedures.
production, causing individuals to struggle with producing Dysphonia, a condition characterized by impaired voice
normal speech [2], [3], [4], [5]. Considering these factors, production, affects almost a third of the population at some
2023 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
VOLUME 12, 2024 233
D. Aziz, S. Dávid: Multitask and Transfer Learning Approach for Joint Classification

point in life. Although dysphonia is often used interchange- it is crucial to approach this differentiation with caution, and
ably with hoarseness, hoarseness is a symptom of altered further research is needed to establish an encompassing and
voice quality reported by patients, while dysphonia is a diag- standardized classification framework for voice disorders.
nosis made by clinicians [8]. Dysphonia can affect people of The conventional diagnostic approach for dysphonia
all ages and genders, but it is more common in teachers [9], involves multiple visits to healthcare professionals and
older adults, and individuals with significant vocal demands. voice therapists, utilizing techniques like laryngoscopy, stro-
It can be caused by benign or self-limited conditions but boscopy, laryngeal electromyography, and auditory analysis.
may also indicate a more severe or progressive condition that However, these methods often cause patient discomfort
requires prompt management [10]. and distress [16]. Additionally, they are time-consuming,
Dysphonia has been categorized into two groups: organic subject to variability due to their subjective nature, and
and functional dysphonia. Functional dysphonia refers to a expensive, imposing a significant financial burden ranging
voice condition that does not correlate with neurogenerative from 577 to 953 US dollars per patient per year [10].
or anatomical factors. It originates from abnormal laryngeal Considering these factors and recent advancements in
function or vocal misuse or overuse, which can lead to inef- machine learning (ML) qualified the development of objec-
ficient oral communication. Any stage of voice production tive and reliable methods for diagnosing dysphonia. ML algo-
can be affected by this condition [11]. Organic dysphonia can rithms can detect and classify dysphonia accurately by
be further divided into structural and neurogenic categories. analyzing speech signals, providing valuable diagnostic
Neurogenic dysphonia is caused by abnormal control, coor- information for healthcare professionals. Speech samples
dination, or strength of the vocal folds due to neurological were recorded from patients with voice disorders and control
diseases such as Parkinson’s. Structural organic dysphonia is groups. Various acoustic features can be extracted from these
caused by morphological changes such as vocal cord nodules, speech samples, such as jitter, shimmer, harmonic-to-noise
polyps, Gastroesophageal Reflux (GERD), and vocal cord ratio (HNR), fundamental frequency, and Mel-Frequency
paralysis [10]. cepstral coefficients (MFCCs) [17], [18]. Features extracted
While the term functional and organic dysphonia exists from utterances are used to train ML algorithms to clas-
in the literature, there are some contradictory arguments if sify healthy and voice-disordered persons. In [19], MFCC
they are really distinguishable. Speech-language pathology and Linear prediction cepstral coefficients (LPC) were
has historically used the term ‘‘functional’’ to describe voice extracted from sustained /a/ vowels and used to train Hid-
disorders linked to psychogenic causes or personality vari- den Markov Model (HMM) and Gaussian Mixture Model
ables. Initially, functional and psychogenic voice disorders (GMM), achieving accuracy of 94.44%, 95.74% with HMM
were seen as closely related. At the same time, this has and GMM, respectively. A combination of VGG16 and a
been argued by [12] as their findings indicate no signifi- Support Vector Machine was proposed. Features extracted
cant differences in personality traits or psychological distress using VGG16 from sustained /i/ vowel used to train SVM
between individuals with organic and functional dysphonia. classifier and achieved accuracy 96.7% [20]. In some other
Suggested common causes for functional dysphonia encom- works focusing on sustained vowels, performance analy-
pass psychoneuroses, personality disorders, and faulty voice sis of different ML algorithms has been conducted [21],
habits. These disorders were believed to occur in individ- [22]. Other paper has explored the effectiveness of algo-
uals with normal laryngeal anatomy and physiology, with rithms such as K-nearest neighbors (KNN), SVM, and
stress, musculoskeletal tension, and conflicts related to sex random forest [23]. Furthermore, the utilization of XGBoost,
identification as specific contributing factors [13]. Others isolation forest, and DenseNet has also been investigated
argue that the term ‘‘functional’’ dysphonia is inadequate [24]. The [25] study predicted dysphonic speech severity
in terms of etiology and merely implies a ‘‘non-organic’’ from sustained vowels’ time and spectral features using
disorder identified through exclusion [14]. In their scoping step-wise multiple regression, yielding mean R of 0.880 and
review, authors critically analyzed the diverse terminolo- mean R2 of 0.775. Continuous speech samples were used
gies employed in classifying voice disorders, highlighting for automated dysphonia severity assessment, achieving
disparities among professionals and researchers. It under- 89% accuracy and a root mean square error (RMSE)
scores the need for a standardized classification system of 0.49 for binary classification and severity estimation,
to facilitate effective communication among clinicians and respectively [26]
provide appropriate treatment guidance. The study sug- Deep learning techniques, like Convolutional Neural Net-
gests categorizing hyperfunctional muscle issues as ‘‘Muscle works (CNN) and Recurrent Neural Networks (RNNs), have
Tension Dysphonia,’’ psychosocially-based voice disorders gained traction in analyzing pathological speech due to their
as ‘‘Functional (Psychogenic) Voice Disorder,’’ and disor- impressive performance in diverse domains such as image
ders stemming from organic or neurogenerative causes as recognition [27], natural language processing [28] and speech
‘‘Organic Dysphonia’’ [15]. The validity of the distinction recognition [29]. Several studies have been conducted in
between individuals with functional and organic dysphonia the research area of voice disorder. One study achieved an
has yet to be thoroughly examined and tested. Therefore, accuracy of 98.9% using CNN with oversampling techniques

234 VOLUME 12, 2024


D. Aziz, S. Dávid: Multitask and Transfer Learning Approach for Joint Classification

to deal with class imbalance [30]. CNN with chromagram it faithfully captures natural speech patterns, closely corre-
feature set in [31], CNN combined with RNN in [32]. In [33], sponding to natural communication interactions in everyday
a Convolutional Deep Belief Network (CDBN) was used to scenarios. This method aligns more closely with real-life
pre-train the weight of the CNN model, and an accuracy scenarios compared to analyzing isolated speech fragments,
of 71% was achieved. In [34], a performance comparison enhancing the realism and practical applicability of our
between CNN and RNN was conducted. The study used 10- findings.
fold cross-validation and accuracies of 87.11% and 86.52%
for CNN and RNN, respectively. A combination of features II. MATERIALS AND METHODS
from EGG and speech from sustained vowel /a/ has been Five different computer vision models were used in this
considered for distinguishing normal and pathological voices research. The architectures of these models were changed
[35], [36]. Other studies considered stacked autoencoder [37] to support multitask learning, as discussed in section II-C.
and LSTM-based autoencoder [38] using sustained vowel /a/ The fine-tuning was performed using 5-fold cross-validation,
and continuous speech samples, respectively. with a learning rate 0.0001, batch size of 4, Adam opti-
Limited research exists on distinguishing functional from mizer, and 50 epochs. Implementation was carried out using
organic dysphonia. Only [39] has performed classification PyTorch [41] framework. Pre-trained versions of these net-
between these two categories using handcrafted feature works are available in torchvision module. In each training
extraction and SVM. Classifying between organic and func- and validation fold, two models were saved, final and best
tional dysphonia is essential since the treatment procedures models, respectively. The final model is the fine-tuned model
may differ. Benign Organic dysphonia generally may require after 50 epochs, and the best model is the result of early
medical interventions or surgical procedures except for vocal stopping. We saved the best model with the lowest validation
fold nodule, which includes voice therapy as an alternative or loss. The saved models(Final and Best) were independently
adjusted therapy before and after surgery. The treatment pro- tested on the test set.
cedures for dysphonia from neurodegenerative include voice
therapy, laryngeal injection with a variety of fillers, laryngeal A. DATASET
framework surgery, and laryngeal re-innervation. Functional We used Hungarian speech samples from two categories
dysphonia may be managed through voice therapy, psy- of dysphonia, functional and organic, along with healthy
chological interventions, or a collaboration [40]. Accurate control. All speakers were native Hungarian speakers and
classification ensures that patients receive appropriate and read a short passage titled ‘‘The North Wind and The Sun,’’
targeted treatment approaches. an English text translated into many languages, including
Although previous studies have offered valuable insights Hungarian, frequently used by phoneticians. Speech samples
into voice disorder detection and assessment, they have were collected from patients who had agreed to participate
focused mainly on one task, such as binary, multiclass clas- in the study. The recordings were performed at the Head
sification, or the prediction of severity scores. Additionally, and Neck Surgery department of the National Institute of
most of the research has used sustained vowels, a method that Oncology during the consultation. Distinguishing functional
might not fully capture the complexities of natural speech from organic dysphonia has been performed by a special-
patterns. Furthermore, limited attention has been given to ist at the same department. During the recording, various
distinguishing between organic and functional dysphonia. health conditions were observed, including functional dys-
This paper introduces a novel approach to dysphonia phonia, recurrent paresis, tumors in different parts of the
evaluation by utilizing transfer and multitasking learning vocal tract, gastroesophageal reflux disease, chronic inflam-
to distinguish between normal and dysphonia and predict mation of the larynx, bulbar paresis, amyotrophic lateral
severity scores simultaneously. Additionally, we performed sclerosis, leucoplakia, spasmodic dysphonia, and more. The
multiclass classification between functional, organic dyspho- speech recording was performed in a quiet (clinical office)
nia, and normal speech using continuous speech samples. The environment with PCM audio coding, 16 kHz sampling rate,
similarity in the effects of both conditions on speech samples and 16-bit quantization. In total, 441 speech samples were
adds complexity to the classification process. To address this used, including 179 healthy control samples, 179 organic
challenge, we implemented multitask learning by incorporat- dysphonia samples, and 83 functional dysphonia samples.
ing the speaker’s sex as an auxiliary task. This approach aims A panel of three specialists graded the severity scores. The
to enhance the model’s classification accuracy and its overall initial specialist directly interacted with the patients during
ability to generalize. To the best of our knowledge, no previ- the grading process. The other two experts evaluated the
ous attempts have been made to conduct both classification speech recordings while listening to the recording in a quiet
and severity estimation of dysphonia patients simultaneously environment. The final severity score used in this study is
using multitask learning. Both tasks are considered equally a rounded average score of three raters. The RBH scale,
important as they provide valuable insights into the patient’s which stands for Roughness, Breathiness, and Hoarseness
condition and allow clinicians to gain comprehensive insights [42], [43], was used to determine the severity level. The scale
into the patient’s condition and monitor their progress effec- ranges from 0 to 3; for the speech samples in this study,
tively. Our approach adopted continuous speech analysis as H was selected as the severity level, ranging from 0 to 3

VOLUME 12, 2024 235


D. Aziz, S. Dávid: Multitask and Transfer Learning Approach for Joint Classification

TABLE 1. Sex and severity distribution of each class in the dataset.

FIGURE 1. Residual block.

adoption by researchers when constructing substantial net-


(0 indicates no hoarseness, ‘‘Healthy speaker,’’ and 3 means work architectures. In this experiment, ResNet50 was used,
‘‘Severe hoarseness’’). The class and severity distribution by which consists of 50 layers with a skip connection between
sex is presented in Table 1. every two layers [45].
The dataset is divided into 80% training set and a 20%
test set. The training set was split into training and validation 2) DenseNet
sets and used in 5-fold cross-validation using the scikit-learn The Dense Convolutional Network (DenseNet) is built upon
library [44]. One fold was used for validation, while the the residual network concept, where more connections are
remaining four were used for training the models. added between layers in the network. In DenseNet, the lay-
Two manual seeds have been used to obtain a realistic view ers are connected using dense connectivity, which means
of our method’s performance. The manual seeds were used to each layer’s output is concatenated with the inputs of all
obtain the representative samples in the test set according to subsequent layers. This creates a dense connectivity pat-
the severity level of the speech and gender of the speakers. tern between the layers, allowing the network to reuse
One seed is used for joint classification and regression tasks, features from earlier layers and retain more information
while the other is used for binary and multiclass classification throughout the network. The DenseNet-B and DenseNet-
with the speaker’s sex as an auxiliary task. BC architectures have demonstrated superior performance
compared to other deep learning architectures because of the
B. DEEP LEARNING MODELS use of bottleneck layers between dense blocks on various
Computer vision models used for the experiments are image classification benchmarks. Furthermore, DenseNets
ResNet50, DenseNet, MobileNet, ConvNeXt, and Efficient- use fewer parameters because of the bottleneck layers
Net. These models were initially used for image classification between dense blocks, making them advantageous for tasks
and trained on the Imagenet dataset. In this section, we will with limited computational resources or where model size is a
briefly explain them. constraint. [46].

1) ResNet 3) MobileNet
Deep neural networks have benefited from weight ini- The aim was to design an architecture for efficient process-
tialization and batch normalization, which have mitigated ing on mobile devices and embedded systems with limited
issues like vanishing and exploding gradients. However, computational resources. The authors introduce depthwise
they are still susceptible to the degradation problem, where separable convolutions as a building block for the network
network accuracy plateaus or decreases as the network as an alternative to the conventional convolution layer. In a
depth increases. To address this challenge, researchers from depthwise convolution, a single convolutional filter is applied
Microsoft proposed ResNet [45], a residual network archi- to each input channel to achieve a lightweight filtering pro-
tecture. ResNet has become a prominent solution to the cess. After that, pointwise 1 × 1 convolution is applied to
degradation problem due to its unique design of skip con- compute linear combinations of the input channels, creating
nections, allowing for smoother gradient flow and enabling new features. The resulting network has fewer parameters
the training of much deeper networks. Figure 1 illustrates the and requires less computation than traditional convolutional
ResNet block’s architecture. networks while still performing well on various tasks [47].
ResNet employs skip connections to enable more effi-
cient learning of identity mapping within the neural network. 4) ConvNeXt
As shown in Figure 1, the feature x is summed with the Vision transformers have become popular in various com-
output of the last layer and passed through a nonlinear func- puter vision applications since they were first introduced
tion, typically ReLU, establishing a direct information flow in 2020. However, a group of researchers from Facebook
through the network. This technique has been pivotal in train- aimed to demonstrate the continued relevance of convolu-
ing large-scale neural networks and has gained widespread tional networks by modernizing the ResNet50 architecture

236 VOLUME 12, 2024


D. Aziz, S. Dávid: Multitask and Transfer Learning Approach for Joint Classification

toward the vision transformer’s design. This involved imple- Natural Language Processing [52], and speech processing
menting changes such as depthwise convolutions, replacing [53]. Researchers have found that MTL perform better than
Relu with Gelu activation functions, and other changes. Con- their STL counterparts. This can be attributed to MTL’s abil-
vNeXt, a family of pure ConvNeXt models constructed, have ity to leverage a larger amount of data from diverse learning
been competing with Transformers regarding accuracy and tasks, facilitating enhanced knowledge sharing among tasks,
scalability [48]. improved performance for each task, and mitigating risks of
overfitting. The most straightforward architecture of MTL
5) EfficientNet
is a shared representation. In this case, all tasks are trained
in parallel with the same representation feature; each task
The author of this model proposed a new scaling method
has a different output layer corresponding to the task. The
for designing neural network architectures. The approach
loss function of the MTL network is the combination of loss
involves scaling the network’s depth, width, and resolution
functions of the tasks performed by the model. Eq. (1) shows
using compound scaling based on the intuition that larger
the loss function of our multitask learning network.
input images require more layers and channels to capture
detailed information. As a result, they developed a family of Lcombined = β1 Lt1 + β2 Lt2 (1)
eight models called EfficientNetB0 to B7 that outperformed
Lt1 and Lt2 are loss functions corresponding to tasks
popular neural network architectures like ResNet, Inception,
in a multitasking network. Furthermore, β1 and β2 are
and MobileNet in terms of accuracy and efficiency, even with
the corresponding hyperparameters that control each task’s
fewer parameters [49].
importance in the network.
In fact, deep learning architectures are data-hungry and
We chose cross-entropy loss function for Lt1 for all three
require large amounts of data and significant computation
experiments we conducted. For the experiment of joint classi-
power, making it difficult for individuals or small organi-
fication and regression of dysphonia speech, we have selected
zations to access these resources. Transfer learning plays a
mean squared error as loss function Lt2 . In contrast, for binary
crucial role in scenarios where the database size is insufficient
and multiclass classification experiments using sex prediction
for training deep learning (DL) from scratch. Transfer learn-
as an auxiliary task, mean square error loss has been replaced
ing is a technique in which the model’s parameter learned
by another cross-entropy loss function.
in a specific domain is transferred to another domain and
The choice of hyperparameters β1 and β2 depends on the
fine-tuned on smaller datasets. The idea is that the pre-trained
relative importance of each task with values ranging between
models have already learned useful features and represen-
0 and 1. In the joint classification and regression, we want
tations that can be generalized to the new task, even if the
to perform both tasks simultaneously with equal importance,
new task’s dataset is relatively small or different. Considering
so the values of β1 and β2 are equal to 1. Knowing the severity
the limited size of our dataset, we adopted transfer learning.
level of the patients along the classification task will help
Different ways exist when using transfer learning, depending
clinicians determine the disease’s advancement and can also
on datasets and performed tasks. The first scenario uses these
help monitor the progress of treatment over time. In the binary
models as a feature extractor by freezing all the model’s
and multiclass classification cases, we investigate the impact
layers. Features can be obtained before the last classifier
of gradually increasing the weight β2 assigned to the sex
layer. These high-dimensional features can be used to train
prediction auxiliary task in the multitasking approach. This
neural networks or any other ML algorithm. Unfreezing and
allows us to assess its influence on the accuracy of the main
fine-tuning some layers of the network is another approach.
classification task. The detailed explanation of the effect of
Lastly is fine-tuning all neural network layers, which takes
β2 values in binary and multiclass scenarios is discussed in
more time and computation but usually leads to better perfor-
section III.
mance. Considering our domain is speech and quite different
As discussed in section II-B, five pre-trained computer
from the image in which these models have been trained,
vision models were used in these experiments. These mod-
we considered the last option.
els were initially developed for image classification tasks;
they have single-task architecture. To use them for our
C. MULTITASKING LEARNING work, we need to modify the architectures. The modification
Multitask Learning (MTL) was first introduced by Caruana involves replacing the original classifier layer with a new
as a learning paradigm in ML that enables learning multiple classifier with a number of output features equal to the num-
related tasks jointly to improve the generalization perfor- ber of classes in our dataset. In the binary case, two output
mance of all the tasks. The motivation for MTL was to features represent healthy and dysphonia classes, while in
alleviate the data sparsity problem, where each task has a the multiclass case, three classes represent Healthy control
limited number of labeled data. MTL aggregates the labeled (HC), Functional Dysphonia (FD), and Organic Dysphonia
data in all the tasks to obtain a more accurate learner for (OD). The modification was performed so we preserved the
each task, which can help reuse existing knowledge for dif- original architecture of the models; for example, in the case
ferent tasks [50]. Since then, many researchers adopted this of ConvNeXt, the final classifier consists of (Linear, Relu,
methodology in many areas, including computer vision [51], Dropout, and Linear). So, we replaced the original sequential

VOLUME 12, 2024 237


D. Aziz, S. Dávid: Multitask and Transfer Learning Approach for Joint Classification

model with a new sequential model with the same number of we extracted 128 mel spectrogram features using a 25ms
layers for the classifier, and we added a new head with the Hamming window with 10ms overlapping. Figure 2 presents
same sequential order but changed the last linear layer output the mel spectrogram for male and female participants. From
to 1 for the regression task. This scenario of modification and left to right, the figures show the mel spectrogram for healthy
preserving the original structure is identical to the other four controls and dysphonia samples with severity levels ranging
models. from 1 (SL 1) to 3 (SL 3). As observed in the figure, there is
By specifying the value of β1 and β2 in Eq.(1), we allow a clear distinction between the mel spectrograms of healthy
the network to work in multitask or single task technique. individuals and those with dysphonia.
For performing a single task classification β2 = 0, in this
case, the network tries to optimize parameters only related F. EVALUATION METRICS
to the classification of the speech samples. Likewise, for the Different evaluation metrics are needed to measure the pro-
regression task, we update the values of β1 and β2 to 0 and 1, posed model’s performance for classification and regression.
respectively. In this case, the model is only trained to predict Accuracy, sensitivity, specificity, and f1-score have been used
severity estimation. to measure the performance of the classification tasks. Accu-
racy (Eq. 2) measures the proportion of correctly classified
D. EXPERIMENT CASES instances, both True Positives (TP) and True Negatives (TN),
Overall, six cases were performed in this paper, three for STL over the total number of instances.
and three for MTL. We will explain them briefly here. TP + TN
Accuracy(Acc) = (2)
• Joint classification and regression- MTL: performing TP + FP + TN + FN
classification of Healthy vs Dysphonia and predicting Sensitivity (Eq. 3) often refers to how well a trained model
severity level simultaneously. For this experiment, val- can detect the presence of a disease or condition, in our case,
ues of β1 and β2 are equal to 1. dysphonia. A high-sensitivity model will accurately identify
• Binary classification- STL: distinguishing between most individuals with the disorder.
healthy and dysphonia speech sample only, β1 = 1 and TP
β2 = 0. Sensitivity(Sen) = (3)
TP + FN
• Regression- STL: predicting severity level of dysphonia
patient, β1 = 0 and β2 = 1 Specificity (Eq. 4) is the ability of the model to correctly clas-
• Binary classification- MTL with sex as an auxiliary task: sify an individual who does not have the disease or condition.
trying different values of β2 TN
Specificity(Spec) = (4)
• Multiclass classification- MTL with sex as an auxil- TN + FP
iary task: classifying Normal vs Organic vs Functional The F1 score (Eq. 5) is the harmonic mean of precision and
dysphonia. recall and provides a single score that balances both metrics.
• Multiclass classification- STL: same as binary Precision is the proportion of true positive predictions among
classification β2 =0 all positive predictions, while recall is the proportion of true
positive predictions among all actual positive cases.
E. SPEECH FEATURES: MEL SPECTROGRAM Precision ∗ Recall
As has been explained in section II-A, participants read F1score(F1) = 2 ∗ (5)
Precision + Recall
a short paragraph. The length of the recording is dif-
where precision and recall are calculated as follows:
ferent between files. Deep learning models used require
fixed-length input features. Considering our sample’s overall TP
Precision = (6)
length, we selected 40 seconds of speech as input to the TP + FP
network. Speech files longer than 40 seconds were truncated, TP
Recall = (7)
and shorter files were padded with zeros. Mel spectrogram is TP + FN
used as an input for the CNN model. A spectrogram is the Room Mean Square Error (RMSE) and Pearson Correlation
time-frequency representation of a speech signal. It shows coefficient have been used to measure how well our models
how the frequency of the speech changes over time. The mel perform in the case of regression. RMSE (Eq. 8) measures the
spectrogram, on the other hand, is a logarithmic transfor- average difference between the predicted and actual severity
mation of the frequency axis of the traditional spectrogram, levels.
which makes it more perceptually relevant to human hearing. v
u n
The mel spectrogram is calculated by dividing the audio sig- u1 X
nal into short overlapping frames and then applying a Fourier RMSE = t (yi − ŷi )2 (8)
n
i=1
transform to each frame to obtain the frequency content. The
frequency axis of the spectrogram is then converted to the mel where yi represents the actual value of the i-th sample’s
scale using a nonlinear transformation, which considers the severity, and ŷi represents the predicted severity of the i-th
nonlinear nature of human perception of sound. For our study, sample. The formula computes the square root of the average

238 VOLUME 12, 2024


D. Aziz, S. Dávid: Multitask and Transfer Learning Approach for Joint Classification

FIGURE 2. Mel spectrogram for speech samples: for female and male samples, respectively.

TABLE 2. Classification and regression using MTL and STL (MTL/STL) for 5-fold cross-validation.

of the squared differences between the actual and predicted performance between MTL and STL, while the bold and
values. underlined values indicate the best performance across all
The Pearson correlation coefficient(Eq. 9) measures the models (both MTL and STL). Values shown in each column
strength and direction of the linear relationship between the correspond to MTL and STL, respectively. Overall, MTL
predicted and actual severity level values. exhibits better performance compared to STL across vari-
Pn ous metrics. MTL consistently achieves higher accuracy, F1
i=1 (xi − x̄)(yi − ȳ) score, sensitivity, and specificity in all the models except in
r(PearCorre) = qP qP (9)
n 2 n 2 EfficientNet. Regarding sensitivity, which is the ability of the
i=1 (x i − x̄) i=1 (yi − ȳ)
models to identify patients with dysphonia correctly, we can
xi and yi are the values of the actual and predicted severity see that ConvNeXt in MTL archives slightly above 96%,
score for the i-th sample, x̄ and ȳ are the sample means of the which outperforms all other models.
two variables, and n is the total number of samples in the test A detailed analysis of the results reveals slight variations
set. between MTL and STL approaches in terms of severity esti-
mation performance. While there are instances where MTL
III. RESULTS achieves a slightly lower RMSE than STL, it is essential to
The results reported in this section are the average of five note that these differences are not substantial. For example,
results, each corresponding to models saved in each fold. in the case of ResNet, the MTL model achieves an RMSE
of 0.547 in the best configuration, while the STL model also
A. BINARY CLASSIFICATION AND REGRESSION achieves an RMSE of 0.555. Similarly, the best DenseNet
(MTL VS STL) model yields an RMSE of 0.622 for MTL and 0.631 for STL.
The results presented in Table 2 demonstrate the performance These differences, although present, are minimal and may not
of models in three different experiments, namely joint clas- significantly impact the overall regression performance.
sification and severity estimation using MTL, classification, The MobileNet model stands out with the lowest RMSE of
and regression using STL. The bold values represent better 0.546 when using STL, indicating its better performance in

VOLUME 12, 2024 239


D. Aziz, S. Dávid: Multitask and Transfer Learning Approach for Joint Classification

FIGURE 3. Severity estimation of MTL and STL average of 5-fold


cross-validation.
FIGURE 4. Binary classification accuracy according to different beta
values.

terms of severity estimation. However, it is noteworthy that


the best Pearson correlation value of 0.885 was obtained by variation in beta values, representing the weight assigned to
the MobileNet model using MTL. In some scenarios, such the auxiliary task. The trends of the line are different accord-
as the MobileNet model, the STL approach demonstrates a ing to the beta values. The line charts reveal that integrating
lower RMSE of 0.547 in the final configuration compared sex prediction as an auxiliary task in binary classification
to MTL’s RMSE of 0.552. Similarly, the STL model of Effi- yields improved accuracy. MobileNet with MTL achieved the
cientNet achieves an RMSE of 0.807 in its best configuration, highest accuracy among all the models, particularly when
slightly outperforming MTL’s RMSE of 0.819. the beta value was set to 1. Notably, the DenseNet model
The most significant difference between MTL and STL exhibited a significant disparity between STL and MTL,
regarding RMSE is observed in the DenseNet model, where with MTL achieving an accuracy of 81.8 compared to STL’s
STL achieves a nearly 0.154 lower RMSE in the final model. 72.36 binary classification. Moreover, the performance of the
However, in the best model configuration, MTL performs ConvNeXt remains stable within the range of beta values
better with an almost 0.009 lower RMSE compared to STL. between 0.5 and 0.7, which surpasses the single task model
Figure 3 shows the scatter plot of the average severity predic- by nearly 4%.
tion of MobilenNet by MTL and STL, respectively. However, it is critical to note that there are instances where
From Figure 3, x-axis represents the actual severity values STL performs better than MTL, as evidenced by the down-
in the test set, while y-axis corresponds to the predicted score ward trend in performance for the ResNet model when the
by MTL and STL architecture. The difference between the beta value exceeds 0.6. Also, in the case of EfficientNet, when
MTL and STL predicted severity score could be seen from a beta is equal to 0.5.
the scatter plot, especially in the case of SL2 and SL3. Table 3 presents the comprehensive results for the
These findings highlight that while there are minor vari- MobileNet model, considered the top-performing model
ations in the RMSE values between MTL and STL for among the various models evaluated in both cases. Due to the
severity estimation, the overall differences are not signifi- many results, we have chosen to showcase the outcomes for
cant. It indicates that both approaches can effectively address this specific model only. The table illustrates that the model
the regression task, with only slight performance variations performs better when trained using MTL compared to STL,
observed in specific cases. with an improvement of just over 2%

B. SEX IN BINARY CLASSIFICATION C. SEX IN MULTICLASS CLASSIFICATION


In this experiment, the sex prediction task was incorporated The models were trained to classify speech samples into
as an auxiliary task in the MTL framework for classifying healthy, organic, and functional dysphonia in a multiclass
between dysphonia and healthy speech. The emphasis on pre- classification experiment. To enhance the classification per-
dicting sex was controlled by varying the values of β2 in the formance, we incorporated sex prediction as an auxiliary
loss function. By starting with a value of 0, which indicates task. By considering sex as an additional aspect, the mod-
STL (only focusing on distinguishing between dysphonia els learned to extract features related to both the speech
and healthy speech), and incrementally increasing β2 until disorder and the sex of the speaker, leading to potentially
reaching 1, we determined the degree of importance placed improved classification accuracy. Similar to binary classi-
on sex prediction by the neural networks. This allowed us to fication, we explored the effects of different values of β2 .
analyze the impact of incorporating sex as an auxiliary task This allowed us to assess the influence of sex information
on the overall performance of the neural networks. Figure 4 on the overall performance of the models in classifying the
presents the performance analysis of five different models different types of dysphonia. Figure 5 presents the trends
using MTL and STL approaches. The lines in the graph of classification accuracy across different beta values, rep-
represent individual models, while the x-axis showcases the resenting the average performance of five models. The line

240 VOLUME 12, 2024


D. Aziz, S. Dávid: Multitask and Transfer Learning Approach for Joint Classification

TABLE 3. Binary classification for MobileNet with different beta values in 5-fold cross-validation.

FIGURE 6. Confusion matrices of MobileNet in MTL and STL.


FIGURE 5. Multiclass classification accuracy according to the different
beta values.
Confusion Matrices from Figure 6A and B represent the
performance of a classification model in classifying instances
chart reveals that the MobileNet model achieved the highest into three classes: HC (Healthy Control), OD (Organic Dys-
accuracy of 77.53% when beta was set to 0.8. Comparatively, phonia), and FD (Functional Dysphonia) in STL and MTL,
the best accuracy attained using STL was 73.26%, indicating respectively. Each column represents the actual class, and
an improvement of more than 4% when employing the MTL. each row represents the predicted class. The values in the
In case of ConvNeXt and ResNet, we can see that MTL matrix indicate the average number of instances in five folds
outperforms STL in most beta values. classified into that class. The diagonal values correspond to
Nevertheless, it is interesting in the case of EfficientNet; the number of correct predictions. Summing the values of
regardless of the beta value used, the accuracy achieved each column will determine the number of samples in that
through MTL was consistently lower than the STL. This specific class in the test set.
finding suggests that for the EfficientNet model in particular, From the analysis of the confusion matrices, it is evident
incorporating the sex prediction task as an auxiliary task did that MTL approach outperforms STL in the classification of
not contribute to improved performance in the multiclass dysphonia categories. It shows higher accuracy in correctly
case. By analyzing the results across various beta values, predicting instances across all three classes, with a notable
we gained insights into the significance of sex in the context improvement in distinguishing functional dysphonia. Specif-
of multiclass dysphonia classification. Table 4 shows the ically, MTL accurately classifies 11.4 instances of functional
detailed metrics of the best-performing model in both STL dysphonia, compared to only the STL’s nine correct predic-
and MTL. tions out of 22 total samples. This highlights the effectiveness
As shown in Table 4, MTL approach with a beta value of incorporating sex prediction as an auxiliary task in the
of 0.8 consistently outperforms the counterpart approach in MTL approach for better classification accuracy. Further-
all metrics for both the final and early stopping models. The more, the MTL demonstrates a lower number of incorrect
MTL achieved significant improvements of more than 5% classifications between organic and functional dysphonia,
in both sensitivity and F1 scores compared to the STL indicating its superior ability to differentiate between these
model. two categories.

VOLUME 12, 2024 241


D. Aziz, S. Dávid: Multitask and Transfer Learning Approach for Joint Classification

TABLE 4. Accuracy of MobileNet 5-fold multiclass classification.

IV. DISCUSSION 0.005 RMSE for STL. However, MTL outperformed STL
Although the research area of distinguishing normal from with a difference of 0.006 concerning the Pearson correlation
dysphonic voice and estimating the severity of dysphonia metric. This finding suggests that the choice of MTL may not
has long been studied in the literature, these methods only impact the regression performance compared to the STL.
focused on one task at a time. The objective of this study is When using sex prediction as an auxiliary task, our results
to investigate the performance of MTL compared to STL in indicated that integrating this additional task improves the
dysphonia classification and severity estimation tasks. Our accuracy of dysphonia classification both in binary and mul-
findings shed light on the benefits and limitations of incor- ticlass scenarios. The weight assigned to the auxiliary task
porating MTL approaches in these tasks. played an essential role in determining the extent of this
In the first experiments, we aimed to achieve two objec- improvement. The line charts demonstrated clear trends,
tives: classifying normal from dysphonic speech and esti- with certain models achieving their highest accuracy at
mating the severity score together. We consider both tasks specific beta values. In the classification of healthy vs. dys-
to be equally important. Predicting the severity scores and phonia speech, the best performance among all computer
differentiating between healthy speech and dysphonia pro- vision models was achieved by MobileNet with an accuracy
vide practical insights into the patient’s condition. This of 92.58% and F1 score of 94.12%. Moreover, in multi-
comprehensive approach enables clinicians to understand class classification experiments, MTL outperforms STL and
the patient’s situation better and effectively monitor their achieves 77.53% accuracy compared to 73.26% in STL. MTL
progress in both the treatment stages and the progressive- demonstrates improved abilities in differentiating between
ness of the condition. Results from joint learning indicate organic and functional dysphonia, potentially aiding clini-
that MTL models demonstrated better performance across cians in more accurately classifying and managing specific
various evaluation metrics, including accuracy of 91.69%, types of voice disorders using non-invasive and cost-effective
F1 score of 93.20%, and specificity of 93.57% in case of methods. However, compared to binary classification, both
MobileNet and F1 score of 93.3% with ConvNeXt. The sig- STL and MTL approaches face challenges in distinguishing
nificant improvement in these metrics suggests that jointly between organic and functional dysphonia in general, which
considering classification and regression tasks using MTL suggests these disorders affect speech similarly. By examin-
facilitates more effective dysphonia classification. The ability ing the confusion matrix of the MTL approach, as shown in
of MTL models to leverage shared representations and learn Figure 6-B, it becomes evident that MTL has better capa-
task-specific features simultaneously leads to their enhanced bilities for differentiating between these two classes. More
performance. research needs to be performed to understand the reason
Regarding severity estimation, both MTL and STL behind this confusion between the two types of dysphonia.
approach effectively tackled the regression task. The minor By learning from multiple tasks, models can effectively learn
variations observed concerning RMSE indicate that both the shared information and extract robust features related
approaches demonstrate similar accuracy in estimating sever- to dysphonia evaluation. This shared representation learning
ity of the condition. The occurrence of negative transfer, improves the model’s ability to generalize to unseen data and
where the performance of one task negatively impacts enhances its performance in real-world scenarios. MTL also
another in multitask learning, was not significant. While the offers advantages regarding computational costs and training
difference between STL and MTL is insignificant, the best- time. Rather than training two separate models, one for each
performing model, MobileNet, showed a slight advantage of specific task, MTL allows both tasks to be performed within

242 VOLUME 12, 2024


D. Aziz, S. Dávid: Multitask and Transfer Learning Approach for Joint Classification

the same framework. These findings suggest that includ- generalizing ability of our approach on larger, more diverse,
ing sex prediction as an auxiliary task provides valuable and multilingual datasets would provide valuable insights.
contextual information that complements the dysphonia clas- These efforts will contribute to advancing the field of dys-
sification task. phonia assessment and its clinical applications.
In clinical practice, classification and severity estimation of
dysphonia specialists rely on standards like GRBAS, CAPE- REFERENCES
V, RBH, and other diagnostic frameworks. These existing [1] P. Belin, S. Fecteau, and C. Bédard, ‘‘Thinking the voice: Neural correlates
standards serve as a benchmark for clinicians in assessing of voice perception,’’ Trends Cognit. Sci., vol. 8, no. 3, pp. 129–135,
Mar. 2004.
voice disorders. The advantage of this method is that it pro- [2] G. Szatloczki, I. Hoffmann, V. Vincze, J. Kalman, and M. Pakaski, ‘‘Speak-
vides clinicians with novel computational and cost-effective ing in Alzheimer’s disease, is that an early sign? Importance of changes
methodologies along these standards. MTL offers valuable in language abilities in Alzheimer’s disease,’’ Frontiers Aging Neurosci.,
vol. 7, p. 195, Oct. 2015.
insights to clinicians in tandem with their existing diagnos-
[3] A. S. Cohen, J. E. McGovern, T. J. Dinzeo, and M. A. Covington,
tic approaches and experiences. Furthermore, the methods ‘‘Speech deficits in serious mental illness: A cognitive resource issue?’’
described here can serve as pre-screening for pre-diagnosis Schizophrenia Res., vol. 160, nos. 1–3, pp. 173–179, Dec. 2014.
stages, such as general practitioner examination, or can even [4] A. S. Cohen and B. Elvevåg, ‘‘Automated computerized analysis of speech
in psychiatric disorders,’’ Current Opinion Psychiatry, vol. 27, no. 3,
be used at home on mobile devices. This could shed light on pp. 203–209, 2014.
a possible pathological affection, thus bringing the meeting [5] M. L. Poole, A. Brodtmann, D. Darby, and A. P. Vogel, ‘‘Motor speech
with an expert to facilitate correct treatment. Ultimately, the phenotypes of frontotemporal dementia, primary progressive aphasia, and
progressive apraxia of speech,’’ J. Speech, Lang., Hearing Res., vol. 60,
successful integration of computational methodologies into no. 4, pp. 897–911, Apr. 2017.
dysphonia assessment should be used as a complement, rather [6] A. Suppa et al., ‘‘Voice in Parkinson’s disease: A machine learning study,’’
than replace, enabling a collaborative approach that leverages Frontiers Neurol., vol. 13, Feb. 2022, Art. no. 831428.
[7] B. Hajduska-Dér, G. Kiss, D. Sztahó, K. Vicsi, and L. Simon,
both technological advancements and clinical expertise. ‘‘The applicability of the beck depression inventory and Hamilton depres-
sion scale in the automatic recognition of depression based on speech
V. CONCLUSION signal processing,’’ Frontiers Psychiatry, vol. 13, p. 1767, Aug. 2022.
[8] M. M. Johns, R. T. Sataloff, A. L. Merati, and C. A. Rosen, ‘‘Article com-
In this paper, we presented an approach that combines transfer mentary: Shortfalls of the American academy of otolaryngology—Head
learning and multitask learning for the binary classification and neck surgery’s clinical practice guideline: Hoarseness (dysphonia),’’
and severity estimation of dysphonia, as well as multiclass Otolaryngol.-Head Neck Surg., vol. 143, no. 2, pp. 175–177, Aug. 2010.
classification. By leveraging five computer vision deep learn- [9] E. Nerrière, M.-N. Vercambre, F. Gilbert, and V. Kovess-Masféty, ‘‘Voice
disorders and mental health in teachers: A cross-sectional nationwide
ing architectures, we fine-tuned them on our dataset after study,’’ BMC Public Health, vol. 9, no. 1, pp. 1–8, Dec. 2009.
modifying their architecture to adapt multitask learning. [10] R. J. Stachler et al., ‘‘Clinical practice guideline: Hoarseness (dysphonia)
A significant advantage of these deep learning models is their (update),’’ Otolaryngol.-Head Neck Surg., vol. 158, no. S1, pp. S1–S42,
Mar. 2018.
ability to learn feature representations without manual feature [11] L. Crevier-Buchman, T. Ch, A. Sauvignet, S. Brihaye-Arpin, and
engineering, which is often required in traditional ML meth- M.-C. Monfrais-Pfauwadel, ‘‘Diagnosis of non-organic dysphonia
ods. Our experimental results revealed the benefits of multi- in adult,’’ Revue Laryngologie-Otologie-Rhinologie, vol. 126, no. 5,
pp. 353–360, 2005.
task learning in the joint classification and severity estimation [12] A. Millar, I. J. Deary, J. A. Wilson, and K. MacKenzie, ‘‘Is an
of dysphonia. Compared to single-task learning counterparts, organic/functional distinction psychologically meaningful in patients with
the multitask learning models demonstrated more promising dysphonia?’’ J. Psychosomatic Res., vol. 46, no. 6, pp. 497–505, Jun. 1999.
[13] A. E. Aronson, Clinical Voice Disorders, An Interdisciplinary Approach /
performance in distinguishing between healthy and dyspho- Arnold E. Aronson ; [Medical Ill., Floyd E. Hosmer]. New York, NY, USA:
nic speech while maintaining a comparable level of accuracy B. C. Decker, 1990.
in severity estimation. This demonstrates the effectiveness of [14] N. P. Connor and D. M. Bless, Functional and Organic Voice Disorders
(Cambridge Handbooks in Language and Linguistics). Cambridge, U.K.:
leveraging shared knowledge and interdependencies between
Cambridge Univ. Press, 2013.
tasks to enhance overall performance. Moreover, we found [15] C. L. Payten, G. Chiapello, K. A. Weir, and C. J. Madill, ‘‘Frameworks,
that multitask learning facilitated better feature representa- terminology and definitions used for the classification of voice disorders:
tion learning, enabling the models to discriminate between A scoping review,’’ J. Voice, vol. 2022, p. 89, Mar. 2022.
[16] C. Sapienza and B. Hoffman, Voice Disorders. San Diego, CA, USA: Plural
organic and functional dysphonia effectively. This improved Publishing, Inc., 2020.
capability to distinguish between the two types of dysphonia [17] J. P. Teixeira, C. Oliveira, and C. Lopes, ‘‘Vocal acoustic analysis—Jitter,
highlights the advantage of using sex of the speakers as an shimmer and HNR parameters,’’ Proc. Technol., vol. 9, pp. 1112–1122,
Jan. 2013.
auxiliary task in MTL. [18] Z. Kh. Abdul and A. K. Al-Talabani, ‘‘Mel frequency cepstral
In conclusion, our proposed approach demonstrates coefficient and its applications: A review,’’ IEEE Access, vol. 10,
promising results in dysphonia classification and severity pp. 122136–122158, 2022.
[19] S. Jothilakshmi, ‘‘Automatic system to detect the type of voice pathology,’’
estimation. By leveraging deep learning architectures and Appl. Soft Comput., vol. 21, pp. 244–249, Aug. 2014.
exploiting the interdependencies between tasks, we achieve [20] J. Reid, P. Parmar, T. Lund, D. K. Aalto, and C. C. Jeffery, ‘‘Develop-
enhanced performance and contribute to a better understand- ment of a machine-learning based voice disorder screening tool,’’ Amer.
ing of dysphonia-related factors. Future research is to expand J. Otolaryngol., vol. 43, no. 2, Mar. 2022, Art. no. 103327.
[21] D. R. A. Leite, R. M. de Moraes, and L. W. Lopes, ‘‘Different performances
the range of additional tasks to improve the performance of machine learning models to classify dysphonic and non-dysphonic
of multitask learning further. Additionally, evaluating the voices,’’ J. Voice, vol. 2022, pp. 1–10, Dec. 2022.

VOLUME 12, 2024 243


D. Aziz, S. Dávid: Multitask and Transfer Learning Approach for Joint Classification

[22] L. Verde, G. De Pietro, and G. Sannino, ‘‘Voice disorder identifi- [38] D. Sztahó, K. Gábor, and T. Gábriel, ‘‘Deep learning solution for patholog-
cation by using machine learning techniques,’’ IEEE Access, vol. 6, ical voice detection using LSTM-based autoencoder hybrid with multi-task
pp. 16246–16255, 2018. learning,’’ in Proc. 14th Int. Joint Conf. Biomed. Eng. Syst. Technol., 2021,
[23] Z. Dankovičová, D. Sovák, P. Drotár, and L. Vokorokos, ‘‘Machine learning pp. 135–141.
approach to dysphonia detection,’’ Appl. Sci., vol. 8, no. 10, p. 1927, [39] M. G. Tulics, L. J. Lavati, K. Mészáros, and K. Vicsi, ‘‘Possibilities for
Oct. 2018. the automatic classification of functional and organic dysphonia,’’ in Proc.
[24] P. Harar, Z. Galaz, J. B. Alonso-Hernandez, J. Mekyska, R. Burget, Int. Conf. Speech Technol. Hum.-Comput. Dialogue (SpeD), Oct. 2019,
and Z. Smekal, ‘‘Towards robust voice pathology detection: Investigation pp. 1–6.
of supervised deep learning, gradient boosting, and anomaly detection [40] C. Robotti et al., ‘‘Treatment of relapsing functional and organic dyspho-
approaches across four databases,’’ Neural Comput. Appl., vol. 32, no. 20, nia: A narrative literature review,’’ Acta Otorhinolaryngol. Ital., vol. 43,
pp. 15747–15757, Oct. 2020. p. S84, Apr. 2023.
[25] S. N. Awan and N. Roy, ‘‘Toward the development of an objective index [41] A. Paszke et al., ‘‘PyTorch: An imperative style, high-performance deep
of dysphonia severity: A four-factor acoustic model,’’ Clin. Linguistics learning library,’’ in Proc. Annu. Conf. Neural Inf. Process. Syst., 2019,
Phonetics, vol. 20, no. 1, pp. 35–49, Jan. 2006. pp. 8024–8035.
[26] M. G. Tulics and K. Vicsi, ‘‘The automatic assessment of the severity of [42] R. Schönweiler, M. Hess, P. Wübbelt, and M. Ptok, ‘‘Novel approach to
dysphonia,’’ Int. J. Speech Technol., vol. 22, no. 2, pp. 341–350, Jun. 2019. acoustical voice analysis using artificial neural networks,’’ J. Assoc. Res.
[27] M. Wu and L. Chen, ‘‘Image recognition based on deep learning,’’ in Proc. Otolaryngol., vol. 1, pp. 270–282, Jan. 2000.
Chin. Autom. Congr. (CAC), 2015, pp. 542–546. [43] M. Ptok, C. Schwemmle, C. Iven, M. Jessen, and T. Nawka, ‘‘[On
[28] A. Torfi, R. A. Shirvani, Y. Keneshloo, N. Tavaf, and E. A. Fox, ‘‘Natural the auditory evaluation of voice quality],’’ HNO, vol. 54, pp. 793–802,
language processing advancements by deep learning: A survey,’’ 2020, Oct. 2006.
arXiv:2003.01200. [44] F. Pedregosa et al., ‘‘Scikit-learn: Machine learning in Python,’’ J. Mach.
[29] U. Kamath, J. Liu, and J. Whitaker, Deep Learning for NLP and Speech Learn. Res., vol. 12 no. 10, pp. 2825–2830, 2012.
Recognition, vol. 84. Cham, Switzerland: Springer, 2019. [45] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image
[30] J.-N. Lee and J.-Y. Lee, ‘‘An efficient SMOTE-based deep learning recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
model for voice pathology detection,’’ Appl. Sci., vol. 13, no. 6, p. 3571, Jun. 2016, pp. 770–778.
Mar. 2023. [46] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, ‘‘Densely
[31] R. Islam and M. Tarique, ‘‘A novel convolutional neural network based connected convolutional networks,’’ in Proc. IEEE Conf. Comput. Vis.
dysphonic voice detection algorithm using chromagram,’’ Int. J. Electr. Pattern Recognit. (CVPR), Jul. 2017, pp. 2261–2269.
Comput. Eng., vol. 12, no. 5, p. 5511, Oct. 2022. [47] A. G. Howard et al., ‘‘MobileNets: Efficient convolutional neural networks
[32] A. Ksibi, N. A. Hakami, N. Alturki, M. M. Asiri, M. Zakariah, and for mobile vision applications,’’ 2017, arXiv:1704.04861.
M. Ayadi, ‘‘Voice pathology detection using a two-level classifier based on [48] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie,
combined CNN–RNN architecture,’’ Sustainability, vol. 15, no. 4, p. 3204, ‘‘A ConvNet for the 2020s,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
Feb. 2023. Recognit. (CVPR), Jun. 2022, pp. 11966–11976.
[33] H. Wu, J. Soraghan, A. Lowit, and G. Di-Caterina, ‘‘A deep learning [49] M. Tan and Q. Le, ‘‘EfficientNet: Rethinking model scaling for con-
method for pathological voice detection using convolutional deep belief volutional neural networks,’’ in Proc. Int. Conf. Mach. Learn., 2019,
networks,’’ in Proc. Interspeech, Sep. 2018. pp. 6105–6114.
[34] S. A. Syed, M. Rashid, S. Hussain, and H. Zahid, ‘‘Comparative analysis of [50] R. Caruana, ‘‘Multitask learning,’’ Mach. Learn., vol. 28, no. 1, pp. 41–75,
CNN and RNN for voice pathology detection,’’ BioMed Res. Int., vol. 2021, 1997.
pp. 1–8, Apr. 2021. [51] J. Lu, V. Goswami, M. Rohrbach, D. Parikh, and S. Lee, ‘‘12-in-
[35] R. Islam, E. Abdel-Raheem, and M. Tarique, ‘‘Voice pathology detec- 1: Multi-task vision and language representation learning,’’ in Proc.
tion using convolutional neural networks with electroglottographic (EGG) IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020,
and speech signals,’’ Comput. Methods Programs Biomed. Update, pp. 10434–10443.
vol. 2, 2022, Art. no. 100074. [52] J. Worsham and J. Kalita, ‘‘Multi-task learning for natural language pro-
[36] A. N. Omeroglu, H. M. A. Mohammed, and E. A. Oral, ‘‘Multi-modal cessing in the 2020s: Where are we going?’’ Pattern Recognit. Lett.,
voice pathology detection architecture based on deep and handcrafted fea- vol. 136, pp. 120–126, Aug. 2020.
ture fusion,’’ Eng. Sci. Technol., Int. J., vol. 36, Dec. 2022, Art. no. 101148. [53] B. T. Atmaja, A. Sasou, and M. Akagi, ‘‘Speech emotion and natural-
[37] L. Chen and J. Chen, ‘‘Deep neural network for automatic classification of ness recognitions with multitask and single-task learnings,’’ IEEE Access,
pathological voice signals,’’ J. Voice, vol. 36, no. 2, pp. 288.e15–288.e24, vol. 10, pp. 72381–72387, 2022.
Mar. 2022.

244 VOLUME 12, 2024

You might also like