ISM_Report_Final
ISM_Report_Final
Supervisor:
Lipika Dey
Department of Computer Science
Ashoka University
Contents
1 Introduction 1
1.1 Purpose of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Understanding Text-to-Speech Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2
9 Duration Modelling 22
9.1 Duration Model for Resource-Scarce Languages . . . . . . . . . . . . . . . . . . . . . . . . 23
9.1.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
9.1.2 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
9.1.3 Contextual Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
9.1.4 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
9.1.5 Features of the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
9.1.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
9.2 Duration modeling using DNN for Arabic speech synthesis . . . . . . . . . . . . . . . . . . 26
9.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
9.2.2 Different phoneme classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
9.2.3 Architecture leading to the best accuracy . . . . . . . . . . . . . . . . . . . . . . . 27
9.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1 Introduction
1.1 Purpose of the Report
The purpose of this report is to delve into the realm of text-to-speech (TTS) technology, which is the process
through which computers convert written text into spoken words. It is essentially an assistive technology
that reads digital text out aloud, and has a wide range of applications from aiding individuals with visual
impairments, to enhancing accessibility in digital interfaces, and facilitating natural interactions with tech-
nology through voice-enabled systems. By comprehensively exploring TTS, we aim to provide insights into
its underlying mechanisms, system architectures, and the different types of TTS models in use. Through
this exploration, readers will gain a deeper understanding of the components of a TTS system, as well as the
potential impact of TTS across diverse fields such as education and assistive technology.
4. Overview of 2 Bengali TTS Systems: Subachan and a DNN Based SPSS System
1
1.2 Understanding Text-to-Speech Modeling
Text-to-speech (TTS) or speech synthesis is the process by which input in the form of digital text is convered
into an audio signal. The primary goal of modern day TTS systems is to not only generate audio which
accurately reads aloud the input text, but also to make it sound natural. A typical modern day TTS system
consists of three main components [1]:
• An acoustic model
• A vocoder
These three components work together in order for the input text to be converted into the desired audio. The
text analysis component is used to extract linguistic features of the text, the acoustic model predicts acoustic
features from the input, and the vocoder reconstructs the final waveform. The acoustic model may consist
of a duration model in some systems as well. This process is shown in the image below:
2
2 Components of a Text-to-Speech System
As mentioned, a typical modern-day TTS system contains three components, a text analysis module, an
acoustic module, and a vocoder.
• Mel-Cepstral Frequency Coefficients (MFCC): This gives the short-term power spectrum of speech
signals and describes the phonemes in terms of the shape of the vocal tract.
• Fundamental Frequency (F0): F0 represents the rate at which the vocal folds vibrate during speech
production and determines the pitch of the voice.
So, the input to the acoustic modeling component is the set of linguistic features extracted from the text
analysis component, and the output is a set of acoustic features that represent the spectral and temporal
properties of the synthesized speech. These features serve as the input to the vocoder.
2.3 Vocoder
Similar to acoustic models, vocoders are also classified into two types: traditional and neural-based. The
former take acoustic features as input, while neural-based vocoders accept both acoustic and linguistic fea-
tures [1]. The output given by the vocoder is the synthesized speech waveform, which can be played back
as audio or saved as a digital file.
3
3 Different types of TTS Models
A TTS system follows a sequential flow of data through the pipeline, where the output from one component
serves as the input for the next. While traditional TTS systems include three models, namely the frontend,
acoustic model, and the vocoder (an example of such a system is Statistical Parametric Speech Synthe-
sis (SPSS)), some TTS models combine these components to directly provide the output. One of the most
successful TTS systems, Tacotron 2 combines text analysis and acoustic model, and directly predicts gen-
eral acoustic features. Another very popular system Wavenet, combines acoustic models and final vocoder,
which directly transforms linguistic features into waveform. Finally, fully End-to-end models such as Clar-
inet combine all three blocks into one and directly converts characters into waveform [1]. These different
types of pipelines for TTS models are shown below:
3.1 Wavenet
WaveNet is a deep generative model that uses dilated causal convolutions to autoregressively generate raw
audio waveforms. It can be conditioned on additional inputs to control the style and characteristics of the
generated audio. WaveNet has demonstrated impressive performance in generating realistic and high-quality
audio for various applications [1].
3.2 Tacotron 2
Tacotron 2 is a sequence-to-sequence model that generates speech spectrograms from input text using an
attention mechanism and recurrent neural networks. It leverages the WaveNet vocoder to produce high-
quality speech waveforms from the generated spectrograms, resulting in natural and intelligible synthetic
speech.
4
4 Importance of Text-to-speech technology
One of the biggest applications of TTS software is in the educational sector as an assistive technology. It
serves as a very helpful tool for those with learning disabilities, particularly those who are visually impaired.
The conversion of digital text into audio signals serves as a powerful medium for learning and helps retain
more information effectively. TTS software is commonly used as a compensatory tool (mainly at the postsec-
ondary level), and has assisted in students improving reading speed, fluency, and content retention, resulting
in increased student self-efficacy in reading abilities and independent learning [2]. It also helps students
learn the accurate and correct pronunciation of terms, something they may not be able to do just by reading
them. The technology provides students with levels of autonomy and independence in their learning, since
when educators use TTS software as part of a comprehensive approach to instruction, it decreases the need
for human supports and increases self-confidence, motivation, and accessibility of grade-level curriculum
[2].
5
5 Bengali Text to Speech
Bengali, also known as Bangla, is a language indigenous to the Bengali region of South-East Asia. As of
2021, it boasts approximately 240 million native speakers, making it the sixth most spoken native language
worldwide and the seventh most spoken language overall [3]. In the era of expanding digital communication,
the necessity for a system capable of generating natural-sounding Bengali speech has become increasingly
apparent. Given its vast speaker base, the development of a Bengali Text-to-Speech (TTS) system holds
immense importance for fostering accessibility, communication, and inclusion among Bengali speakers from
diverse backgrounds. Despite being considered a relatively under-resourced language with limited speech
applications, efforts to implement a TTS model for Bengali have been underway. In this context, two distinct
approaches have been explored: one adopting a concatenative synthesis method via a TTS system named
Subachan, and another leveraging statistical parametric speech synthesis (SPSS) employing Deep Neural
Networks.
1. Normalization
2. Phonetic analysis
3. Prosodic analysis
4. Wave synthesis
Subachan, written entirely in the Java programming language has been implemented by using a diphone set
for Bengali TTS. [5] The text analysis part of the software, involving tokenizing, identification of token, and
phrase detection is performed in the Normalization module. Pronunciation rules are included in the Phonetic
Analysis module, which uses grapheme to phoneme rules (in cases other than O-karanto which is discussed
in detail below). The prosodic Generator assigns duration values to individual phonemes, and finally the
speech synthesis is performed by concatenative techniques in the Waveform Synthesis phase. [5] Figure 3
shows the architecture of Subachan.
• Splitting to token: Instead of using only white space and punctuation as a delimeter, it uses determin-
istic finite automata (DFA) to split the token. As an example, consider a phone number ৮৬১৭ ৯৮৫
২০৭ (8617 985 207 in english). If tokenization was based on white space, it would identify
6
Figure 3: The Architecture of Subachan [5]
this as three tokens, but Subachan using DFA instead successfully identifies it as a "Telephone
number". [5]
• Token identification: Subachan deals with ambiguity in tokens by focusing on the correct iden-
tification of non-standard words. These NSWs include date, time, fraction, telephone numbers,
etc. For example, the fraction ১/২ is pronounced as এক ভাগ দু ই and the telephone number ৯৮৮১-
৭২-১৩০৪ is pronounced as নয় আট আট এক সাত দু ই এক িতন শূ নয্ চার, instead of being identified as
a date due to its format. [5]
O-karanto Problem
This problem is faced by the words which when pronounced, end with an অ sound, but are not written
in that way. Examples include করল, which without this dictionary would have been pronounced
as kor-ol, instead of the correct pronunciation that is kor-lo. Similarly words like যাব and খাব are
mapped to their correct pronunciations of 'jabo' and 'khabo', instead of 'jab' and 'khab'.
7
Figure 4: Phonetic Analysis of Subachan [5]
TTS softwares by the special emphasis it puts on joint letters. There are almost 285 joint letters
in Bengali, pronounced in various different ways. Most of joint letters of a word generate silence
between pronouncing the word, whereas some are pronounced consecutively by pronouncing the
unit letters that form those joint letters. The algorithm in Subachan converts the joint letters to the
concatenation of diphones, fade in and fade out effect of starting and ending diphones, and silence
to maintain the artificial stress and pitch. [5]
Subachan also uses a tool which can concatenate diphone and helps to take decisions for identifying
proper diphones for a word. The diphones of the word "bangla" as generated by Subachan is the
8
following: "বাংলা →ব-বআ-আঙ-ঙঅ-অল-লআ-আ"[5]
For example, “পদ্মার ইিলশ খুব সু সব্াদু ” – generates the following line after phonetic analysis: “পদ্দার
ইিলশ খুব শুশ্শাদু ”. Then after prosodic analysis and handling special cases of joint letters replacement,
the following list of diphones are generated : প+পঅ+অদ+দ + দ+দআ+আর+র + silence + ই+ইল+লই+ইশ+শ +
silence + খ+খউ+উব+ব + silence + শ+শউ+উশ+শ + শ+শআ+আদ+দউ+উ. Then all of these diphones are con-
verted to the corresponding sound files, which are concatenated together to produce audio output.
[5]
Subachan represents a significant advancement in Bengali TTS synthesis, offering reliable perfor-
mance and adherence to linguistic rules and regulations. Its unique architecture and ongoing de-
velopment efforts position it as a promising platform for future advancements in Bengali speech
synthesis technology.
9
7 DNN Based Statistical Parametric TTS System
An attractive alternative to the concatenative synthesis approach is the statistical parametric speech
synthesis, or SPSS [6]. In SPSS, acoustic parameters from which speech can be synthesized are
generated, with the help of a vocoder. [6] The introduction of deep neural networks (DNNs) opened
a new research direction for acoustic modeling in SPSS. [7] This section discusses about a DNN-based
Statistical Parametric TTS System developed by Rajan Saha Raju et. al. from Shahjalal University of
Science and Technology. Due to the lack of a good dataset, they developers created their own dataset
for the software, which includes more than 40 hours of speech, amounting to 12,500 utterances. They
also prepared a pronunciation dictionary (lexicon) of 1, 35, 000 words for frontend text processing.
Two TTS voices were developed using this dataset: one male and one female called as SUST SPSS
Male, and SUST SPSS Female respectively. [6] A summary of the speech data for this system is shown
in figure 5.
From the figure, we can see that the text that is passed to this system as the input goes through five
individual components: A Text Normalizer, A front-end processor, A Duration Model, An Acoustic
Model, and A Vocoder. Let us go into a deep dive of what exactly happens in each component, and
what algorithms each follows.
10
7.2 Text Normalizer
Text Normalization in a TTS system, is essentially the task of converting written text into its actual
pronounceable form. Although bengali is mostly a "You pronounce what you write" type of language,
exceptions occur in many cases.[8]
An example of such exceptions can be the following words:
বাহয্, বলব, িজহবা
The pronunciations for the three words are বাজেঝা, বলেবা, িজউবা which is quite different from what it
is written. Converting the raw text into its pronounceable form is part of what the text normalizer
does. There are multiple issues in text which a normalizer deals with, some of them are given below
[8]:
7.2.2 Conjuncts
There are more than 250 conjuncts in bengali, and most of them follow one of the four following
variations:
• Sometimes sound of a constituent letter of the conjunct is substituted by another sound, eg.:
িবদব্ান = িবদদান
11
7.2.3 Null-Modified Characters
Null-modified characters are the vowels with no modifying characters, and are frequently used in
bengali. This is a big problem to deal with since it is not possible to determine the pronounciation
of null-modified characters with rules. The table below shows the exception words which create this
problem [8]:
These are just some of the various problems that a next normalizer must deal with. Now lets
look at how a text normalizer deals with these problems.
12
7.3 Dealing with Text Normalization Issues
There are two approaches: rule-based and database approaches.
13
So we can see that when raw text is passed to a text normalizer, the output is the form that the
text is actually pronounced in. For each word encountered, the text normalizer first checks if it is
present in the database. If yes, its normalized form is fetched. If the word is not present then using
different rule based approaches, the normalized word is produced. This output is now passed to the
next component which is the front-end processor. An example of an input and output passed to and
obtained from a text normalizer is as follows:
14
7.4 Front-end Processor
The DNN-based TTS system relies on a front-end text processor to extract linguistic features from
input text. Two open-source tools, Ossian and Festival, have been effectively employed for this
purpose.[6]
7.4.1 Ossian
Ossian's text processor extracts linguistic features from input text, irrespective of language, by map-
ping each character to its phoneme. By linguistic features, it means characteristics which describe the
sound of the text, such as phonemes (the smallest units of sound in a language), prosody (intonation,
stress, rhythm), and other features. An example of characters to phoneme mapping in bengali is the
following:
For the word আেদশ there are three characters: আ, েদ, শ which corresponds to the phonemes: a/de/sh.
Using this mapping, Ossian extracts other linguistic features as mentioned above.
7.4.2 Festival
Festival's front-end text processor, on the other hand, requires language-specific lexicons and phonolo-
gies. It uses a grapheme-to-phoneme converter and is language specific. An example of a grapheme
to phoneme converter is given in the image below:
HTS-style labels refer to the format of these alignment labels used in HTS-based systems.
In HTS, speech synthesis involves dividing the speech signal into small segments (frames) and aligning
them with linguistic units (e.g., phonemes), and each frame corresponds to a specific state within
an HMM. State-level alignment provides a mapping between frames and the corresponding states in
the HMM. For example, if we have an HMM with three states (initial, middle, and final), state-level
alignment tells us which frames correspond to each state. From these HTS-style labels, a vector of
linguistic features are generated by Ossian or Festival. This feature vector is then fed to the duration
model and acoustic model. So the input to the front-end processor is the normalized text, and the
output is a feature vector.
15
7.5 Duration Model
The TTS System described in the paper consists of two different deep neural networks. The first is the
duration model, which takes in the linguistic features generated from the front-end processor as out-
put and produces a set of predicted durations for each phoneme in the input text. These predicted
durations represent the expected length of time that each speech unit should be pronounced for
when synthesizing speech. The DNN employs a feed forward network, and consists of 3 hidden layers
with 512 neurons in each layer. Using Gradient Descent Optimizer, the network learns the proper
duration information by updating weights. An example of the working of this component is as follows:
Input: Let us consider that the TTS System is given an input আিম একিট বই পড়িছ
Now, let's generate a hypothetical feature vector containing linguistic features for this input sentence.
The feature vector may include information such as phonetic representations of each word, syllable
boundaries, part-of-speech tags, and linguistic context.
Hypothetical Feature Vector:
A hypothetical output that the duration model could give can be the following:
• /a/: 90 ms
• /mi/: 150 ms
• /e/: 110 ms
• /k/: 60 ms
• /ti/: 180 ms
• /b/: 40 ms
• /oi/: 130 ms
• /po/: 100 ms
• /r/: 70 ms
• /ch/: 90 ms
• /i/: 100 ms
These predicted output durations are now passed onto the acoustic model to get the acoustic features.
16
7.6 Acoustic Model
The acoustic model takes two kinds of outputs: the linguistic features generated from the front-end
processor (which are represented as sequence binary vectors), and duration features generated by
the duration model for those linguistic features. The acoustic model learns to map these inputs to
acoustic features, which represent the spectral characteristics of speech, such as pitch, intensity, and
spectral envelope. This neural network consists of 6 hidden layers, with 1024 neurons in each, and
by applying gradient descent algorithm it updates the weights in each iteration.
The input linguistic and duration features are fed forward through the layers of the acoustic model
and transformed at each layer using activation functions until it reaches the output layer. The output
layer of the acoustic model generates the predicted acoustic features, which represent the spectral
characteristics of speech, such as pitch, intensity, and spectral envelope.
As in the example before for the sentence:আমি একিট বই পড়িছ, the input to the acoustic model
would be the feature vector and the durations obtained from the duration model. A possible output
that the acoustic model might generate can be:
Predicted Acoustic Features:
• Pitch Contour: [100 Hz, 110 Hz, 115 Hz, 120 Hz, 125 Hz, 120 Hz, 115 Hz, 110 Hz, 105 Hz, 100 Hz]
• Intensity Contour: [60 dB, 65 dB, 70 dB, 75 dB, 80 dB, 75 dB, 70 dB, 65 dB, 60 dB, 55 dB]
• Spectral Envelope: [Array of spectral energy values over different frequency bands]
17
7.7 Vocoder
The final component of this TTS System, the vocoder is used to synthesize the final waveform, re-
sulting in the final synthesized speech. The output from the acoustic model is first normalized by
transforming the acoustic features in such a way so that they have zero mean and unit variance. A
hypothetical example of this normalized acoustic features could be the following:
[
[0.2, 0.3, 0.5], #Example frame 1
[0.4, 0.6, 0.8], #Example frame 2
[0.1, 0.2, 0.4], #Example frame 3
...
]
In this example, each row represents a frame of acoustic features and the columns represent different
acoustic parameters, such as pitch, intensity, and spectral envelope (the exact features that are
produced by the acoustic model are not mentioned).
These normalized features are now sent to the vocoder for waveform synthesis. The vocoder used in
this system is WORLD, which is an open source vocoder. This vocoder uses the normalized acoustic
features and generates the final output waveform which is the synthesized speech. The process
involves:
• Analyzing the spectral characteristics of the input features. By doing so, the vocoder identifies
the frequency components that contribute to the overall sound.
• Synthesizing speech by filtering a broadband noise source (white noise) using the extracted
spectral features. White noise contains energy across all frequencies and the vocoder filters
this noise source using the spectral features obtained from the input. Each spectral feature
corresponds to a specific frequency band.
• Controlling the filter gains based on the normalized features. The normalized features (such as
pitch, duration, and intensity) influence the filter gains. Filter gains determine how much the
noise at each frequency band is amplified or attenuated.
• Combining the filtered noise sources to create the final output waveform. After filtering the
noise source for each spectral feature, the WORLD vocoder combines these filtered noise sources.
The synthesized speech waveform is generated by combining the filtered noise sources and the WORLD
vocoder ensures that the timing, pitch, and spectral characteristics match the input features. The
output waveform represents the natural-sounding speech corresponding to the input text.
18
7.8 Discussion
The DNN-Based Statistical Parametric Bengali Text to Speech System uses a combination of various
components, each performing a distinct task to produce natural sounding speech from input bengali
text. It makes use of a text normalizer, which requires a large database for exceptions, as well as
numerous rules to determine the normalized form of the input text. The system also utilizes various
open source softwares such as Ossian, Festival, and WORLD to generate the final output. The use
of the 2 neural networks: one for Duration Modelling, and another for Acoustic Modelling sets this
TTS System apart from other SPSS based systems which use other methods such as HMM and LSTM-
RNN. Gaining a comprehensive understanding of the input and output processes associated with each
component of this TTS System was important for grasping the overall architecture and essential for
comprehending the different stages through which the input text progresses before culminating in
the final generated waveform.
19
8 Comparing SPSS and Subachan
The developers of the DNN-based SPSS system compared the results between both the voices of their
model (Male and Female), and that of Subachan. They also drew a comparison between all the models
and that of the best known commercial Bangla TTS from Google.
We can clearly see that although the SPSS model does not objectively perform as well as the Google
Bangla TTS, both the male and female voices score relatively higher than Subachan TTS.
20
8.2 Subjective Evaluation
The Mean Opinion Score (MOS) was the test which was used to subjectively evaluate all the four
models. Native Bangladeshi Speakers were made to listen to 20 synthetic sentences generated by
various TTS systems and give a naturalness score between 0 and 5 to each of the systems. A higher
score meant better naturalness. All the scores were averaged to obtain the mean score of a system.
[6] Figure 8 summarizes the MOS scores obtained by the various systems.
The performance of the SPSS system being much higher than Subachan is apparent in this case,
showing a statistical parametric speech synthesis approach outperforms the concatenative based
approach. The Male voice performed slightly better than the female voice, and both were comparable
to the Google Bangla TTS.
21
9 Duration Modelling
In recent years, the two most popular methods used in TTS models are Concatenative Synthesis, and
Statistical Parametric Speech Synthesis (SPSS).
Duration Modelling is essentially the process of deciding the duration of each phoneme, which
is the smallest unit of speech that can distinguish meaning in a language. Concatenative synthesis
approaches that supplanted these systems do not necessarily require modelling of duration, since the
units themselves have intrinsic durations [10]. Most SPSS models on the other hand, make use of a
component known as the Duration Model, in order to specify the duration of each phoneme, when
given linguistic features as input. By linguistic features, we mean things such as phrase type, parts
of speech, type and position of syllable.
A deep neural network (DNN)-based statistical parametric speech synthesis (SPSS) framework (such as
the one discussed above) converts input text into output waveforms by using modules in a pipeline:
a text analyzer to derive linguistic features such as syntactic and prosodic tags from text, a duration
model to predict the phoneme duration, an acoustic model to predict the acoustic features such as
melcepstral coefficients and F0, and a vocoder to produce the waveform from the acoustic features [11].
In this report, I have explored how the duration model works in a typical DNN-based SPSS framework,
focusing on how linguistic features are converted into durations based on the work presented in the
following two papers:
• ``Duration modeling using DNN for Arabic speech synthesis'' by Zangar et. al. [16]
22
9.1 Duration Model for Resource-Scarce Languages
9.1.1 Model Architecture
The duration model discussed in this paper is based on a stack of fully connected layers in a feed-
forward neural network (FFNN)[12], the architecture for which is given in the figure below:
At the output is a linear layer, while ReLU is used at the hidden layers. Batch normalization and
dropout are used with each hidden layer of the network. The Adam optimisation algorithm [13] is
used with a learning rate scheduler that lowers the learning rate when the validation loss reaches a
plateau. The weights and biases of all the layers are initialized using the He-uniform distribution [5].
The loss function is the mean squared error (MSE) on the predicted duration feature.
23
9.1.2 Training Process
During training the TTS engine front-end creates a contextual label sequence for each recording of
the training data in the speech database. This contextual label sequence is then converted into a
linguistic description feature vector, and used as input to the FFNN. The ground truth duration of
each phone unit in the contextual label sequence is used as the output feature target of the FFNN.
24
9.1.4 Pipeline
The contextual features given in the above table are converted to a linguistic description vector
containing a combination of binary encodings (for the phoneme identities and features) and positional
information using the MERLIN toolkit [15]. This is the linguistic description feature vector. This vector
is passed to the FFNN, and the output from that is the prediction of the duration.
9.1.6 Discussion
The steps followed in the duration model of the TTS system described in this paper is:
2. The MERLIN toolkit converts these labels into vectors of binary and continuous features [6],
and this vector is the linguistic description feature vector
4. The FFNN which gave the best results consisted of 4 hidden layers with 128 units/ layer
5. The FFNN uses ReLU activation functions for the hidden layers, and the weightd and biases are
initialized using the He-uniform distribution
6. The output from the FFNN is the predicted duration of the phonemes
25
9.2 Duration modeling using DNN for Arabic speech synthesis
This paper investigates the modeling of phoneme duration for Arabic language. Similar to the paper
discussed above, the model discussed in this paper also takes the same input features (two preceding
and succeeding phonemes, position of syllables etc.), and gives the duration of phonemes as an
output. The type of an input feature can be binary, like stressed/not-stressed, discrete like the
phoneme identity, or numeric like the phoneme position [16].
9.2.1 Overview
The paper focuses on multiple different architectures, which includes feed-forward DNN using only
dense layers, and recurrent DNNs based on LSTM and on BLSTM layers. The RMSprop optimizer is
adopted in the experiments, as well as early stopping to avoid the over-fitting problem. For each
model, various numbers of hidden layers, of nodes and of activation functions have been tried.
Vowel Quantity is a term in phonetics for the length of a vowel, usually indicated in phonetic
transcription by a LENGTH MARK [:] after a vowel, as in /a:/.[17] Vowels so marked have in general
greater duration than the same vowels with no such mark. Consonant Gemination is an articulation
of a consonant for a longer period of time than that of a singleton consonant [18].
The data used was divided into several subsets, and for each subset, different models have been
trained and evaluated, and the architecture leading to the most accurate prediction on the develop-
ment set was selected
5. simple consonants
6. geminated consonants
7. short vowels
8. long vowels.
26
9.2.3 Architecture leading to the best accuracy
As the size of training corpus is different from one class to another one, it is not the same model
architecture that leads to the most accurate prediction of phoneme durations on the different classes
of sounds. The best model for each class, along with the architecture (number of layers, activation
functions etc.) is given in the figure below:
27
9.2.4 Discussion
The model discussed in the paper makes use of a class-specific approach, and after testing with
various models for each class, the best is determined based on the root mean squared prediction
error (RMSE) value. This is because, segmental duration is a continuous value, and the DNN used in
the architecture are used as a regression tool, which is trained to minimize the RMSE. Comparisons
with other State-of-the-art DNN based toolkits such as MERLIN [16] have shown that for the Arabic
test set, a class-specific approach works best.
28
References
[1] Hasanabadi, Mohammad Reza. ''An Overview of Text-to-Speech Systems and Media Applications."
arXiv e-prints, 2023, arXiv:2310.14301.
[2] Raffoul, Sandra, and Lindsey Jaber. ''Text-to-Speech Software and Reading Comprehension: The
Impact for Students with Learning Disabilities." Canadian Journal of Learning and Technology,
vol. 49, no. 2, Nov. 2023, pp. 1-18.
[4] Tabet Y, and Mohamed Boughazi. ''Speech synthesis techniques. A survey." Systems, Signal Pro-
cessing and their Applications (WOSSPA), 2011 7th International Workshop on. IEEE, 2011.
[5] A. Naser, D. Aich, and M. R. Amin, “Implementation of subachan: ''Bengali text-to-speech synthesis
software,'' in International Conference on Electrical & Computer Engineering (ICECE 2010). IEEE,
2010, pp. 574–577
[6] R. S. Raju, P. Bhattacharjee, A. Ahmad and M. S. Rahman, ''A Bangla Text-to-Speech System using
Deep Neural Networks'', 2019 International Conference on Bangla Speech and Language Processing
(ICBSLP), Sylhet, Bangladesh, 2019, pp. 1-5, doi: 10.1109/ICBSLP47725.2019.202055.
[7] H. Ze, A. Senior, and M. Schuster, ''Statistical parametric speech synthesis using deep neural
networks'' in 2013 ieee international conference on acoustics, speech and signal processing. IEEE,
2013, pp. 7962-7966.
[8] Rashid, M., Hussain, M. A., & Rahman, M. S. (2010, December). ''Text normalization and
diphone preparation for Bangla speech synthesis''. Journal of Multimedia, 5(6), 551-559.
https://doi.org/10.4304/jmm.5.6.551-559
[9] Morise, M., Yokomori, F., & Ozawa, K. (2016). ''WORLD: A Vocoder-Based High-Quality Speech
Synthesis System for Real-Time Applications''. IEICE Transactions on Information and Systems,
E99.D(7), 1877-1884. doi:10.1587/transinf.2015EDP7457
[10] Henter, G., Ronanki, S., Watts, O., Wester, M., Wu, Z., & King, S. (2016). ''Robust TTS Duration
Modelling Using DNNs''. In 2016 IEEE International Conference on Acoustics, Speech and Sig-
nal Processing (ICASSP) (pp. 5130-5134). Institute of Electrical and Electronics Engineers (IEEE).
https://doi.org/10.1109/ICASSP.2016.7472655
[11] Yasuda, Y., Wang, X., & Yamagishi, J. (2020). ''Investigation of learning abilities on linguistic
features in sequence-to-sequence text-to-speech synthesis''. arXiv preprint arXiv:2005.10390.
[12] Louw, A. (2020). ''Text-to-Speech Duration Models for Resource-Scarce Languages in Neural Ar-
chitectures''. In A. L. Lueker, & P. J. Sweeney (Eds.), Advances in Neural Information Processing
Systems 33 (pp. 141-153). Retrieved from https://doi.org/10.1007/978-3-030-66151-9_9
[13] Kingma, D. P., & Ba, J. (2014). ''Adam: A method for stochastic optimization''. arXiv preprint
arXiv:1412.6980.
[14] He, K., Zhang, X., Ren, S., & Sun, J. (December 2015). ''Delving Deep into Rectifiers: Surpassing
Human-Level Performance on ImageNet Classification''. In Proceedings of the IEEE International
Conference on Computer Vision (ICCV).
29
[15] Wu, Z., Watts, O., & King, S. (2016). ''Merlin: An Open Source Neural Network Speech Synthesis
System''. In Proceedings of the 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9) (pp.
202-207). doi:10.21437/SSW.2016-33
[16] Zangar, I., Mnasri, Z., Colotte, V., Jouvet, D., & Houidhek, A. (2018). ''Duration mod-
eling using DNN for Arabic speech synthesis''. Speech Prosody 2018. Retrieved from
https://api.semanticscholar.org/CorpusID:53055606
[17] Encyclopedia.com. (2024). Vowel Quantity. Retrieved May 12, 2024, from
https://www.encyclopedia.com/humanities/encyclopedias-almanacs-transcripts-and-maps/vowel-
quantity
[18] Gemination. (2024, May 3). Wikipedia. Retrieved May 12, 2024, from
https://en.wikipedia.org/wiki/Gemination
30