0% found this document useful (0 votes)
42 views

MachineLearning Final

This technical report discusses machine learning applications in voice and speech science. It describes the main types of machine learning as supervised learning, unsupervised learning, and reinforcement learning. Supervised learning has been used for automatic detection of voice disorders from speech samples and improving computational efficiency of voice simulators. Unsupervised learning is used for disorder detection and voice quality detection without labeled data. Reinforcement learning allows models to interact with the vocal system and learn to control it. The report also discusses limitations of machine learning and concludes that machine learning approaches will continue to advance voice and speech science research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

MachineLearning Final

This technical report discusses machine learning applications in voice and speech science. It describes the main types of machine learning as supervised learning, unsupervised learning, and reinforcement learning. Supervised learning has been used for automatic detection of voice disorders from speech samples and improving computational efficiency of voice simulators. Unsupervised learning is used for disorder detection and voice quality detection without labeled data. Reinforcement learning allows models to interact with the vocal system and learn to control it. The report also discusses limitations of machine learning and concludes that machine learning approaches will continue to advance voice and speech science research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/370894001

Machine Learning for Voice and Speech Science

Technical Report · May 2023


DOI: 10.13140/RG.2.2.32603.54562

CITATIONS READS
0 120

2 authors, including:

National Center For Voice and Speech


National Center for Voice and Speech
3 PUBLICATIONS 0 CITATIONS

SEE PROFILE

All content following this page was uploaded by National Center For Voice and Speech on 19 May 2023.

The user has requested enhancement of the downloaded file.


Machine Learning for Voice and Speech Science
Anil Palaparthi

Machine Learning:
Machine Learning is a subfield of ar;ficial intelligence that enables computers to learn
pa=erns or models from data and improve with experience without programming explicitly1. In
tradi;onal programming, to perform any task, the programmer provides the input data and the
model (logic or algorithm) to the computer. The computer (program) then applies the input data
to the model and obtains the output (Fig. 1a). On the contrary, the goal of machine learning is
to develop the model (algorithm) to perform a task, given both data and expected output from
the model as inputs (Fig. 1b)2.

Figure 1. (a) Tradi;onal Program (b) Machine Learning model training and inference.

Types of Machine Learning:


Machine Learning can be broadly subdivided into supervised learning, unsupervised
learning, and reinforcement learning3. Supervised learning is the most common type of
machine learning. It uses both inputs (data or features of the data) and outputs (oLen in the
form of verbal descriptors, or labels) to learn the pa=ern between inputs and outputs as a
model. The user then uses the learned model on new data to predict new outputs, which is
reliable as long as the new inputs are similar to the inputs that were used to train the model
(Fig. 1b). In voice and speech science research, supervised learning has been predominantly
used for the automa;c detec;on of disorders4-6, improving the computa;onal efficiency of
simulators7, and es;ma;on of voice control parameters from acous;c output signals8,9.

Automa;c detec;on of disorders:


Con;nuous and long-term monitoring of pa;ents is now possible with the use of the
cloud and the Internet of Things (IoT) in healthcare. At the same ;me, it is very easy to acquire
voice and speech samples from pa;ents using IoT devices such as mobile phones. Prior research
has shown that the detec;on of some disorders is possible from voice and speech signals4-6.
Therefore, machine learning techniques can be used on the features of speech signals for the
automa;c detec;on of disorders.
To automa;cally detect voice disorders using machine learning models, experts first
label the speech samples as either normal or belonging to a par;cular disorder. Supervised
learning algorithms then learn the rela;on between the input samples and their corresponding
labels. The trained algorithms then predict whether a new sample belongs to normal phona;on
or disordered phona;on. Such supervised learning is being used to detect laryngeal cancer,
dysphonia, vocal fold nodules, polyps, edema, vocal fold paralysis, and neuromuscular disorders
from voice and speech samples. Supervised learning is also being used to objec;vely detect
GRBAS (grade, roughness, breathiness, asthenia, and strain) voice quality features10.

Improving computa;onal efficiency of simulators


Voice and Speech simulators are widely used for understanding the physiology of voice
produc;on, valida;on of therapies, and predic;on and op;miza;on of surgical interven;ons11.
Accurate simula;on of voice produc;on for clinical purposes requires pa;ent-specific
geometries, anisotropic material proper;es, and solving complex 3D fluid-structure
interac;ons7. Solving these complex fluid-structure interac;ons in pa;ent-specific geometry is
computa;onally expensive, taking supercomputers mul;ple weeks to produce one second of
speech output. Such high computa;onal complexity is preven;ng the use of voice simulators for
widespread clinical use. Thus, faster machine learning models trained on accurate flow and
pressure data obtained from solving Navier-Stokes equa;ons can replace the tradi;onal
computa;onal methods and provide the necessary speed and accuracy. Machine learning
models are also being used to es;mate some poorly known physiological control parameters of
the vocal system from expected outcomes. This may include vocal fold geometry, s;ffness, and
subglo=al pressure from the produced acous;cs, aerodynamics, and vocal fold vibra;on. Such
es;ma;on can help in providing quan;ta;ve informa;on to clinicians for be=er diagnosis of
voice disorders7,8.

Unsupervised Learning:
The supervised learning methods require data with accurate labels (verbal descriptors)
or outputs for be=er performance. However, it is oLen not easy and requires experts to
generate accurate labels, which can be highly subjec;ve and prone to errors. Unsupervised
learning, on the other hand, does not use labeled data. Instead, it automa;cally learns pa=erns
in the input data and groups them into mul;ple categories. Unsupervised learning is currently
being used for disorder detec;on12, emo;on recogni;on13, and voice quality detec;on using
voice and speech samples.

Reinforcement Learning:
In Reinforcement learning, training data is not needed ahead of ;me14. The model
interacts with the physical plant (vocal system) in a trial-and-error manner and learns to control
the physical plant. This method can be used to learn the neural processes that control the vocal
system. These neural control systems try to mimic how the brain controls the vocal system.
When used with voice simulators, such neural control systems can provide insights into
neuromuscular disorders such as vocal tremor, Parkinson's disease, and spasmodic dysphonia.
The use of reinforcement learning is s;ll in its infancy in voice and speech science research.
Even though the DIVA model15 and other neural controllers of the vocal system16-18 do not use
reinforcement learning in its true sense, they fall under its broader category. The controllers
(reinforcement learning model) take our vocal inten;ons (how high in pitch the voice should be,
how loud the voice should be, how rough or periodic the voice should be, and what syllable to
produce) as inputs and generate corresponding muscle ac;va;ons as output. These muscle
ac;va;ons are provided as input to the vocal system, which then produces phona;on at the
desired vocal inten;ons. The auditory and somatosensory feedback from the vocal system is
con;nuously used to train the reinforcement Learning Models aLer every interac;on. This
con;nuous interac;on through feedback allows the model to improve over ;me (Fig.2).

Figure 2. Neural control system for the vocal system

Limita;ons of Machine Learning:


The machine learning models are as good as their data. Their accuracy relies on the
quality and quan;ty of the data that is available for training. Their performance degrades if the
data are incomplete, biased, or contains errors. They can only learn the pa=erns in the input
data and lack expandability or improvisa;on typically seen in humans. For example, if the new
data are outside the ranges of the training data, the machine learning models will have a hard
;me predic;ng the correct output, even for the simplest of problems. Some machine learning
models, such as deep neural networks, can be highly complex and the results may be difficult to
explain. This could be a limita;on, especially in the healthcare sector where the ability to
explain how a decision was made is important19.

Conclusion:
Machine Learning approaches are being used in a wide range of applica;ons, including
voice and speech science. The approaches will get be=er and more powerful in the future with
more databases, accurate modeling, and wide distribu;on of soLware, allowing researchers and
prac;;oners to take be=er advantage of them. The profound ques;on is, will human
intelligence and human learning be advanced with ar;ficial intelligence, or will a trust in
machine learning diminish a deeper understanding of the communica;on sciences and
disorders.

References
1. Mohri, M., Rostamizadeh, A., and Talwalkar, A. Founda'ons of Machine Learning. MIT
press, pp. 1-7, 2018.
2. Turner, R. Machine Learning: The ul'mate beginner’s guide to learn machine learning,
ar'ficial intelligence and neural networks step by step. Publishing Factory LLC, 2020.
3. Ayodele, T.O. Types of machine learning algorithms. New advances in machine learning,
vol. 3, pp. 19-48, 2010.
4. Al-Dhief, F.T., La;ff, N.M.A., Malik, N.N.N.A., Salim, S. N., Baki, M.M., Albadr, M.A.A., et
al. A survey of voice pathology surveillance systems based on Internet of Things and
machine learning algorithms. IEEE Access, vol. 8, pp. 64514-64533, 2020.
5. Verde, L., De Pietro, G. and Sannino, G. Voice disorder iden;fica;on by using machine
learning techniques. IEEE Access, vol. 6, pp. 16246-16255, 2018.
6. Hegde, S., She=y, S., Rai, S. and Dodderi, T. A survey of machine learning approaches for
automa;c detec;on of voice disorders. Journal of Voice, vol. 33(6), pp. 947.e11-947.e33,
2019.
7. Zhang, Y., Zheng, X. and Xue, Q. A deep neural network based glo=al flow model for
predic;ng fluid-structure interac;ons during voice produc;on. Applied Sciences (Basel),
vol. 10(2): 705, 2020.
8. Zhang, Z. Voice feature selec;on to improve performance of machine learning models
for voice produc;on inversion. Journal of Voice, 2021.
9. Zhang, Z. Es;ma;on of vocal fold physiology from voice acous;cs using machine
learning. The Journal of the Acous'cal Society of America, vol. 147(3), pp. EL264-EL270,
2020.
10. Kojima, T., Fujimura, S., Hasebe, K., Okanoue, Y., Shuya, O., Yuki, R. et al. Objec;ve
assessment of pathological voice using ar;ficial intelligence based on the GRBAS scale.
Journal of Voice, 2021.
11. Titze, I.R. and Lucero, J.C. Voice simula;on: The next genera;on. Applied Sciences, vol.
12(22): 11720, 2022.
12. Rueda, A. and Krishnan, S. Clustering Parkinson’s and age-related voice impairment
signal features for unsupervised learning. Advances in Data Science and Adap've
Analysis, vol 10(2):1840007, 2018.
13. Zhang, Z., Weninger, F., Wollmer, M. and Schuller, B. Unsupervised learning in cross-
corpus acous;c emo;on recogni;on. IEEE Workshop on Automa'c Speech Recogni'on
and Understanding, pp. 523-528, 2011.
14. Su=on, R.S. and Barto, A.G. Reinforcement learning: An introduc'on, MIT press, 2018.
15. Guenther, F.H. Neural control of speech. Cambridge, MA. MIT press, 2016.
16. Kroger, B.J., Kannampuzha, J. and Rube, C.N. Towards a neurocomputa;onal model of
speech produc;on and percep;on. Speech Communica'on, vol. 51, pp. 793-808, 2009.
17. Hickok, G. Computa;onal neuroanatomy of speech produc;on. Nature reviews
neuroscience, vol. 13(2), pp. 135-145, 2012.
18. Palaparthi, A. Computa;onal motor learning and control of the vocal source for voice
produc;on. Ph.D. disserta'on, University of Utah, Salt Lake City, UT, 2021.
19. Ribeiro, M.T., Singh, S. and Guestrin, C. Why should I trust you? Explaining the
predic;ons of any classifier. In Proceedings of the 22nd ACM SIGKDD interna'onal
conference on knowledge discovery and data mining, pp. 1135-1144, 2016.

View publication stats

You might also like