0% found this document useful (0 votes)
4 views

ISM_Report_Final

This report explores text-to-speech (TTS) technology, detailing its purpose, components, and various models. It emphasizes the significance of TTS in enhancing accessibility and interaction with technology, particularly for individuals with visual impairments. The document also provides insights into specific TTS systems, including Bengali TTS systems and duration modeling in DNN-based TTS systems.

Uploaded by

deyelvia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

ISM_Report_Final

This report explores text-to-speech (TTS) technology, detailing its purpose, components, and various models. It emphasizes the significance of TTS in enhancing accessibility and interaction with technology, particularly for individuals with visual impairments. The document also provides insights into specific TTS systems, including Bengali TTS systems and duration modeling in DNN-based TTS systems.

Uploaded by

deyelvia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Text to Speech:

Machine Learning Models and Applications


Elvia Dey

May 13, 2024

Supervisor:

Lipika Dey
Department of Computer Science
Ashoka University
Contents
1 Introduction 1
1.1 Purpose of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Understanding Text-to-Speech Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Components of a Text-to-Speech System 3


2.1 Text Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Acoustic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Vocoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Different types of TTS Models 4


3.1 Wavenet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Tacotron 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

4 Importance of Text-to-speech technology 5

5 Bengali Text to Speech 6

6 Subachan: A Concatenation Based TTS System 6


6.1 Text Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6.2 Phonetic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6.3 Prosodic Analysis and Wave Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

7 DNN Based Statistical Parametric TTS System 10


7.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
7.2 Text Normalizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
7.2.1 Tokenization/ Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
7.2.2 Conjuncts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
7.2.3 Null-Modified Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
7.2.4 Numerical Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
7.3 Dealing with Text Normalization Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
7.3.1 Database Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
7.3.2 Rule-based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
7.4 Front-end Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7.4.1 Ossian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7.4.2 Festival . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7.4.3 Output of the front-end processor . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
7.5 Duration Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
7.6 Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
7.7 Vocoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
7.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

8 Comparing SPSS and Subachan 20


8.1 Objective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
8.2 Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2
9 Duration Modelling 22
9.1 Duration Model for Resource-Scarce Languages . . . . . . . . . . . . . . . . . . . . . . . . 23
9.1.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
9.1.2 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
9.1.3 Contextual Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
9.1.4 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
9.1.5 Features of the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
9.1.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
9.2 Duration modeling using DNN for Arabic speech synthesis . . . . . . . . . . . . . . . . . . 26
9.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
9.2.2 Different phoneme classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
9.2.3 Architecture leading to the best accuracy . . . . . . . . . . . . . . . . . . . . . . . 27
9.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
1 Introduction
1.1 Purpose of the Report
The purpose of this report is to delve into the realm of text-to-speech (TTS) technology, which is the process
through which computers convert written text into spoken words. It is essentially an assistive technology
that reads digital text out aloud, and has a wide range of applications from aiding individuals with visual
impairments, to enhancing accessibility in digital interfaces, and facilitating natural interactions with tech-
nology through voice-enabled systems. By comprehensively exploring TTS, we aim to provide insights into
its underlying mechanisms, system architectures, and the different types of TTS models in use. Through
this exploration, readers will gain a deeper understanding of the components of a TTS system, as well as the
potential impact of TTS across diverse fields such as education and assistive technology.

This report is divided into the following broad topics:

1. Brief Overview of the components of a TTS System

2. Different types of TTS Models

3. Understanding the importance of TTS Technology

4. Overview of 2 Bengali TTS Systems: Subachan and a DNN Based SPSS System

5. Understanding Duration Modeling in a DNN Based TTS System

1
1.2 Understanding Text-to-Speech Modeling
Text-to-speech (TTS) or speech synthesis is the process by which input in the form of digital text is convered
into an audio signal. The primary goal of modern day TTS systems is to not only generate audio which
accurately reads aloud the input text, but also to make it sound natural. A typical modern day TTS system
consists of three main components [1]:

• A text analysis module

• An acoustic model

• A vocoder

These three components work together in order for the input text to be converted into the desired audio. The
text analysis component is used to extract linguistic features of the text, the acoustic model predicts acoustic
features from the input, and the vocoder reconstructs the final waveform. The acoustic model may consist
of a duration model in some systems as well. This process is shown in the image below:

Figure 1: General structure of TTS Systems [1]

2
2 Components of a Text-to-Speech System
As mentioned, a typical modern-day TTS system contains three components, a text analysis module, an
acoustic module, and a vocoder.

2.1 Text Analysis


The first stage that the input of a TTS system goes through is the Text Analysis module, which is also known
as the frontend. Its purpose is to convert the input text into linguistic features which contains information
about the prosody such as rhythm, pitch, loudness, etc. and the pronunciation [1]. The input to the text
analysis component is the raw text that needs to be converted into speech. This text can be in the form of
words, sentences, or paragraphs. The output of the text analysis component is a set of linguistic features
that capture the phonetic and prosodic characteristics of the input text. These features serve as the input to
the acoustic modeling component. Text analysis in most TTS systems consist of a text normalizer, which
produces the actual pronunciation of the raw text, and a front-end processor (such as Ossian or Festival)
which extracts linguistic features such as phonemes, syllables etc. from the input text.

2.2 Acoustic Models


A vital component of a TTS system, this step involves the prediction of acoustic features from the extracted
linguistic features (in traditional models), or directly from phonemes and characters (in neural-based models
using a additional Duration model)[1].Acoustic features represent various characteristics of speech signals
that are essential to generate natural-sounding speech. Some of these features are:

• Mel-Spectrogram: This provides a visual representation of the spectrum of frequencies of a signal as


it varies with time.

• Mel-Cepstral Frequency Coefficients (MFCC): This gives the short-term power spectrum of speech
signals and describes the phonemes in terms of the shape of the vocal tract.

• Fundamental Frequency (F0): F0 represents the rate at which the vocal folds vibrate during speech
production and determines the pitch of the voice.

So, the input to the acoustic modeling component is the set of linguistic features extracted from the text
analysis component, and the output is a set of acoustic features that represent the spectral and temporal
properties of the synthesized speech. These features serve as the input to the vocoder.

2.3 Vocoder
Similar to acoustic models, vocoders are also classified into two types: traditional and neural-based. The
former take acoustic features as input, while neural-based vocoders accept both acoustic and linguistic fea-
tures [1]. The output given by the vocoder is the synthesized speech waveform, which can be played back
as audio or saved as a digital file.

3
3 Different types of TTS Models
A TTS system follows a sequential flow of data through the pipeline, where the output from one component
serves as the input for the next. While traditional TTS systems include three models, namely the frontend,
acoustic model, and the vocoder (an example of such a system is Statistical Parametric Speech Synthe-
sis (SPSS)), some TTS models combine these components to directly provide the output. One of the most
successful TTS systems, Tacotron 2 combines text analysis and acoustic model, and directly predicts gen-
eral acoustic features. Another very popular system Wavenet, combines acoustic models and final vocoder,
which directly transforms linguistic features into waveform. Finally, fully End-to-end models such as Clar-
inet combine all three blocks into one and directly converts characters into waveform [1]. These different
types of pipelines for TTS models are shown below:

Figure 2: Different types of TTS models [1]

3.1 Wavenet
WaveNet is a deep generative model that uses dilated causal convolutions to autoregressively generate raw
audio waveforms. It can be conditioned on additional inputs to control the style and characteristics of the
generated audio. WaveNet has demonstrated impressive performance in generating realistic and high-quality
audio for various applications [1].

3.2 Tacotron 2
Tacotron 2 is a sequence-to-sequence model that generates speech spectrograms from input text using an
attention mechanism and recurrent neural networks. It leverages the WaveNet vocoder to produce high-
quality speech waveforms from the generated spectrograms, resulting in natural and intelligible synthetic
speech.

4
4 Importance of Text-to-speech technology
One of the biggest applications of TTS software is in the educational sector as an assistive technology. It
serves as a very helpful tool for those with learning disabilities, particularly those who are visually impaired.
The conversion of digital text into audio signals serves as a powerful medium for learning and helps retain
more information effectively. TTS software is commonly used as a compensatory tool (mainly at the postsec-
ondary level), and has assisted in students improving reading speed, fluency, and content retention, resulting
in increased student self-efficacy in reading abilities and independent learning [2]. It also helps students
learn the accurate and correct pronunciation of terms, something they may not be able to do just by reading
them. The technology provides students with levels of autonomy and independence in their learning, since
when educators use TTS software as part of a comprehensive approach to instruction, it decreases the need
for human supports and increases self-confidence, motivation, and accessibility of grade-level curriculum
[2].

5
5 Bengali Text to Speech
Bengali, also known as Bangla, is a language indigenous to the Bengali region of South-East Asia. As of
2021, it boasts approximately 240 million native speakers, making it the sixth most spoken native language
worldwide and the seventh most spoken language overall [3]. In the era of expanding digital communication,
the necessity for a system capable of generating natural-sounding Bengali speech has become increasingly
apparent. Given its vast speaker base, the development of a Bengali Text-to-Speech (TTS) system holds
immense importance for fostering accessibility, communication, and inclusion among Bengali speakers from
diverse backgrounds. Despite being considered a relatively under-resourced language with limited speech
applications, efforts to implement a TTS model for Bengali have been underway. In this context, two distinct
approaches have been explored: one adopting a concatenative synthesis method via a TTS system named
Subachan, and another leveraging statistical parametric speech synthesis (SPSS) employing Deep Neural
Networks.

6 Subachan: A Concatenation Based TTS System


Concatenative synthesis is the process by which prerecorded units of speech such as phonemes, di-phones,
syllables, words or sentences are concatenated to produce artificial speech. [4] This approach uses human
speech samples and generates the most natural-sounding synthesized speech. [5] Developed by Abu Naser
et. al. from Shahjalal University of Science and Technology, Subachan is a TTS software which uses this
approach to convert bengali text to recognizable speech and it does so by using four major modules which
are:

1. Normalization

2. Phonetic analysis

3. Prosodic analysis

4. Wave synthesis

Subachan, written entirely in the Java programming language has been implemented by using a diphone set
for Bengali TTS. [5] The text analysis part of the software, involving tokenizing, identification of token, and
phrase detection is performed in the Normalization module. Pronunciation rules are included in the Phonetic
Analysis module, which uses grapheme to phoneme rules (in cases other than O-karanto which is discussed
in detail below). The prosodic Generator assigns duration values to individual phonemes, and finally the
speech synthesis is performed by concatenative techniques in the Waveform Synthesis phase. [5] Figure 3
shows the architecture of Subachan.

6.1 Text Normalization


The first step which is performed while converting raw text into a pronounceable word is Text Normal-
ization. After raw text goes through various stages such as lexical analysis, NSW (Non-standard word)
identification, token expansion, application of expansion rules and phase detection, it produces normalized
text. Subachan differs from a lot of other TTS Systems as it takes a different approach to tokenization and
token identification.

• Splitting to token: Instead of using only white space and punctuation as a delimeter, it uses determin-
istic finite automata (DFA) to split the token. As an example, consider a phone number ৮৬১৭ ৯৮৫
২০৭ (8617 985 207 in english). If tokenization was based on white space, it would identify

6
Figure 3: The Architecture of Subachan [5]

this as three tokens, but Subachan using DFA instead successfully identifies it as a "Telephone
number". [5]

• Token identification: Subachan deals with ambiguity in tokens by focusing on the correct iden-
tification of non-standard words. These NSWs include date, time, fraction, telephone numbers,
etc. For example, the fraction ১/২ is pronounced as এক ভাগ দু ই and the telephone number ৯৮৮১-
৭২-১৩০৪ is pronounced as নয় আট আট এক সাত দু ই এক িতন শূ নয্ চার, instead of being identified as
a date due to its format. [5]

6.2 Phonetic Analysis


Subachan uses the rules of the existing grapheme to phoneme algorithms in most cases. The exceptions in-
clude those that face the O-karanto problem for which it uses a small dictionary containing the pronunciation
of a few words.

O-karanto Problem
This problem is faced by the words which when pronounced, end with an অ sound, but are not written
in that way. Examples include করল, which without this dictionary would have been pronounced
as kor-ol, instead of the correct pronunciation that is kor-lo. Similarly words like যাব and খাব are
mapped to their correct pronunciations of 'jabo' and 'khabo', instead of 'jab' and 'khab'.

Figure 4 shows the flow chart of phonetic analysis.

7
Figure 4: Phonetic Analysis of Subachan [5]

6.3 Prosodic Analysis and Wave Synthesis


This module includes corpus recording, diphone separation, labeling the diphone files and finally
synthesizing these diphone files to produce speech. [5] Subachan converts some letters into a com-
bination of two or more in order to reduce the number of diphones (such as ঋ to ির, ঐ to ওই and
so on. After the conversion of these diphones, the system recognizes a total of 527 diphones in
bangla, as mentioned in the table 1. Subachan as a TTS software puts itself above most other bengali

Diphone Type Number


Starting of Vowel (V) 6
Starting of Consonant (C) 27
Ending of Vowel (V) 6
Ending of Consonant (C) 32
VV 36
CV 192
VC 192
VYV 36
Total 527

Table 1: Number of Diphones [5]

TTS softwares by the special emphasis it puts on joint letters. There are almost 285 joint letters
in Bengali, pronounced in various different ways. Most of joint letters of a word generate silence
between pronouncing the word, whereas some are pronounced consecutively by pronouncing the
unit letters that form those joint letters. The algorithm in Subachan converts the joint letters to the
concatenation of diphones, fade in and fade out effect of starting and ending diphones, and silence
to maintain the artificial stress and pitch. [5]
Subachan also uses a tool which can concatenate diphone and helps to take decisions for identifying
proper diphones for a word. The diphones of the word "bangla" as generated by Subachan is the

8
following: "বাংলা →ব-বআ-আঙ-ঙঅ-অল-লআ-আ"[5]

For example, “পদ্মার ইিলশ খুব সু সব্াদু ” – generates the following line after phonetic analysis: “পদ্দার
ইিলশ খুব শুশ্শাদু ”. Then after prosodic analysis and handling special cases of joint letters replacement,
the following list of diphones are generated : প+পঅ+অদ+দ + দ+দআ+আর+র + silence + ই+ইল+লই+ইশ+শ +
silence + খ+খউ+উব+ব + silence + শ+শউ+উশ+শ + শ+শআ+আদ+দউ+উ. Then all of these diphones are con-
verted to the corresponding sound files, which are concatenated together to produce audio output.
[5]

6.4 Performance Evaluation


Subachan demonstrates robust performance in converting text to speech with high intelligibility and
acceptable naturalness. Through meticulous algorithm development, Subachan efficiently handles
various linguistic aspects of Bengali, including text normalization, phonetic analysis, and diphone
reduction.

In evaluations involving human participants, Subachan achieves an intelligibility rate of approxi-


mately 73.3% at the word level and 93.3% at the sentence level. It also garners a naturalness rate of
approximately 93.6% at the word level and 76.6% at the sentence level.

Subachan represents a significant advancement in Bengali TTS synthesis, offering reliable perfor-
mance and adherence to linguistic rules and regulations. Its unique architecture and ongoing de-
velopment efforts position it as a promising platform for future advancements in Bengali speech
synthesis technology.

9
7 DNN Based Statistical Parametric TTS System
An attractive alternative to the concatenative synthesis approach is the statistical parametric speech
synthesis, or SPSS [6]. In SPSS, acoustic parameters from which speech can be synthesized are
generated, with the help of a vocoder. [6] The introduction of deep neural networks (DNNs) opened
a new research direction for acoustic modeling in SPSS. [7] This section discusses about a DNN-based
Statistical Parametric TTS System developed by Rajan Saha Raju et. al. from Shahjalal University of
Science and Technology. Due to the lack of a good dataset, they developers created their own dataset
for the software, which includes more than 40 hours of speech, amounting to 12,500 utterances. They
also prepared a pronunciation dictionary (lexicon) of 1, 35, 000 words for frontend text processing.
Two TTS voices were developed using this dataset: one male and one female called as SUST SPSS
Male, and SUST SPSS Female respectively. [6] A summary of the speech data for this system is shown
in figure 5.

Figure 5: Speech data prepared for Bangla SPSS[6]

7.1 Model Architecture


The model architecture that the DNN Based System uses is shown below:

Figure 6: SPSS Model Architecture

From the figure, we can see that the text that is passed to this system as the input goes through five
individual components: A Text Normalizer, A front-end processor, A Duration Model, An Acoustic
Model, and A Vocoder. Let us go into a deep dive of what exactly happens in each component, and
what algorithms each follows.

10
7.2 Text Normalizer
Text Normalization in a TTS system, is essentially the task of converting written text into its actual
pronounceable form. Although bengali is mostly a "You pronounce what you write" type of language,
exceptions occur in many cases.[8]
An example of such exceptions can be the following words:
বাহয্, বলব, িজহবা
The pronunciations for the three words are বাজেঝা, বলেবা, িজউবা which is quite different from what it
is written. Converting the raw text into its pronounceable form is part of what the text normalizer
does. There are multiple issues in text which a normalizer deals with, some of them are given below
[8]:

7.2.1 Tokenization/ Segmentation


The text normalization process takes a sentence as input and produces words in the first step. [8]
White space and punctuation doesn't always work for tokenization of a sentence. For example, +৯১
৮৬১৭ ২৮৫৭০২ and +৯১ ৯৮৩০ ০৮৪৭৯৫ are two phone numbers, but if split by white space it will be
considered as 6 different numbers.

7.2.2 Conjuncts
There are more than 250 conjuncts in bengali, and most of them follow one of the four following
variations:

• Some are pronounced according to the spelling, eg.: মক্কা

• Sometimes sound of a constituent letter of the conjunct is substituted by another sound, eg.:
িবদব্ান = িবদদান

• Some are pronounced wih a different letter, eg: ক্ষমা = খমা

• Some are not pronounced at all, eg.: হৰ্দ = রদ

11
7.2.3 Null-Modified Characters
Null-modified characters are the vowels with no modifying characters, and are frequently used in
bengali. This is a big problem to deal with since it is not possible to determine the pronounciation
of null-modified characters with rules. The table below shows the exception words which create this
problem [8]:

Figure 7: Null Modified Character Problem

7.2.4 Numerical Words


Normalizing numbers create ambiguity due to the varying ways of pronunciation depending on the
way of treatment. Some examples are:

Figure 8: Numerical Word Issue

These are just some of the various problems that a next normalizer must deal with. Now lets
look at how a text normalizer deals with these problems.

12
7.3 Dealing with Text Normalization Issues
There are two approaches: rule-based and database approaches.

7.3.1 Database Application


Words that cannot be normalized using rules are stored in a database in normalized form. When
the text normalizer encounters a word it first looks for it in the database. If present, its normalized
representation is fetched. Abbreviations like ডাঃ = ডাক্তার are stored in these databases among other
words. Many of the verbs with null-modified characters are also stored in the database.

7.3.2 Rule-based Approach


Some of the rules for words with issues discussed above are given in the tables below:

13
So we can see that when raw text is passed to a text normalizer, the output is the form that the
text is actually pronounced in. For each word encountered, the text normalizer first checks if it is
present in the database. If yes, its normalized form is fetched. If the word is not present then using
different rule based approaches, the normalized word is produced. This output is now passed to the
next component which is the front-end processor. An example of an input and output passed to and
obtained from a text normalizer is as follows:

Figure 9: Text Normalization

14
7.4 Front-end Processor
The DNN-based TTS system relies on a front-end text processor to extract linguistic features from
input text. Two open-source tools, Ossian and Festival, have been effectively employed for this
purpose.[6]

7.4.1 Ossian
Ossian's text processor extracts linguistic features from input text, irrespective of language, by map-
ping each character to its phoneme. By linguistic features, it means characteristics which describe the
sound of the text, such as phonemes (the smallest units of sound in a language), prosody (intonation,
stress, rhythm), and other features. An example of characters to phoneme mapping in bengali is the
following:
For the word আেদশ there are three characters: আ, েদ, শ which corresponds to the phonemes: a/de/sh.
Using this mapping, Ossian extracts other linguistic features as mentioned above.

7.4.2 Festival
Festival's front-end text processor, on the other hand, requires language-specific lexicons and phonolo-
gies. It uses a grapheme-to-phoneme converter and is language specific. An example of a grapheme
to phoneme converter is given in the image below:

Figure 10: Grapheme to Phoneme Conversion

7.4.3 Output of the front-end processor


The front-end outputs HTS-style labels with state-level alignment, producing a feature vector for
input into the duration and acoustic models. HTS uses hidden Markov models (HMMs) to model
speech features (such as phonemes, durations, and acoustic parameters).

HTS-style labels refer to the format of these alignment labels used in HTS-based systems.
In HTS, speech synthesis involves dividing the speech signal into small segments (frames) and aligning
them with linguistic units (e.g., phonemes), and each frame corresponds to a specific state within
an HMM. State-level alignment provides a mapping between frames and the corresponding states in
the HMM. For example, if we have an HMM with three states (initial, middle, and final), state-level
alignment tells us which frames correspond to each state. From these HTS-style labels, a vector of
linguistic features are generated by Ossian or Festival. This feature vector is then fed to the duration
model and acoustic model. So the input to the front-end processor is the normalized text, and the
output is a feature vector.

15
7.5 Duration Model
The TTS System described in the paper consists of two different deep neural networks. The first is the
duration model, which takes in the linguistic features generated from the front-end processor as out-
put and produces a set of predicted durations for each phoneme in the input text. These predicted
durations represent the expected length of time that each speech unit should be pronounced for
when synthesizing speech. The DNN employs a feed forward network, and consists of 3 hidden layers
with 512 neurons in each layer. Using Gradient Descent Optimizer, the network learns the proper
duration information by updating weights. An example of the working of this component is as follows:

Input: Let us consider that the TTS System is given an input আিম একিট বই পড়িছ
Now, let's generate a hypothetical feature vector containing linguistic features for this input sentence.
The feature vector may include information such as phonetic representations of each word, syllable
boundaries, part-of-speech tags, and linguistic context.
Hypothetical Feature Vector:

• Phonemes: /a/mi/ /e/k/ti/ /b/oi/ /po/r/ch/i

• Syllable Boundaries: [আ-িম] [এ-ক-িট] [বই] [পড়-িছ]

• Part-of-Speech Tags: Pronoun, Indefinite Article, Noun, Verb

• Linguistic Context: Sentence Context

A hypothetical output that the duration model could give can be the following:

• /a/: 90 ms

• /mi/: 150 ms

• /e/: 110 ms

• /k/: 60 ms

• /ti/: 180 ms

• /b/: 40 ms

• /oi/: 130 ms

• /po/: 100 ms

• /r/: 70 ms

• /ch/: 90 ms

• /i/: 100 ms

These predicted output durations are now passed onto the acoustic model to get the acoustic features.

16
7.6 Acoustic Model
The acoustic model takes two kinds of outputs: the linguistic features generated from the front-end
processor (which are represented as sequence binary vectors), and duration features generated by
the duration model for those linguistic features. The acoustic model learns to map these inputs to
acoustic features, which represent the spectral characteristics of speech, such as pitch, intensity, and
spectral envelope. This neural network consists of 6 hidden layers, with 1024 neurons in each, and
by applying gradient descent algorithm it updates the weights in each iteration.
The input linguistic and duration features are fed forward through the layers of the acoustic model
and transformed at each layer using activation functions until it reaches the output layer. The output
layer of the acoustic model generates the predicted acoustic features, which represent the spectral
characteristics of speech, such as pitch, intensity, and spectral envelope.

As in the example before for the sentence:আমি একিট বই পড়িছ, the input to the acoustic model
would be the feature vector and the durations obtained from the duration model. A possible output
that the acoustic model might generate can be:
Predicted Acoustic Features:

• Pitch Contour: [100 Hz, 110 Hz, 115 Hz, 120 Hz, 125 Hz, 120 Hz, 115 Hz, 110 Hz, 105 Hz, 100 Hz]

• Intensity Contour: [60 dB, 65 dB, 70 dB, 75 dB, 80 dB, 75 dB, 70 dB, 65 dB, 60 dB, 55 dB]

• Spectral Envelope: [Array of spectral energy values over different frequency bands]

This output of predicted acoustic features is now passed to the vocoder.

17
7.7 Vocoder
The final component of this TTS System, the vocoder is used to synthesize the final waveform, re-
sulting in the final synthesized speech. The output from the acoustic model is first normalized by
transforming the acoustic features in such a way so that they have zero mean and unit variance. A
hypothetical example of this normalized acoustic features could be the following:

[
[0.2, 0.3, 0.5], #Example frame 1
[0.4, 0.6, 0.8], #Example frame 2
[0.1, 0.2, 0.4], #Example frame 3
...
]

In this example, each row represents a frame of acoustic features and the columns represent different
acoustic parameters, such as pitch, intensity, and spectral envelope (the exact features that are
produced by the acoustic model are not mentioned).
These normalized features are now sent to the vocoder for waveform synthesis. The vocoder used in
this system is WORLD, which is an open source vocoder. This vocoder uses the normalized acoustic
features and generates the final output waveform which is the synthesized speech. The process
involves:

• Analyzing the spectral characteristics of the input features. By doing so, the vocoder identifies
the frequency components that contribute to the overall sound.

• Synthesizing speech by filtering a broadband noise source (white noise) using the extracted
spectral features. White noise contains energy across all frequencies and the vocoder filters
this noise source using the spectral features obtained from the input. Each spectral feature
corresponds to a specific frequency band.

• Controlling the filter gains based on the normalized features. The normalized features (such as
pitch, duration, and intensity) influence the filter gains. Filter gains determine how much the
noise at each frequency band is amplified or attenuated.

• Combining the filtered noise sources to create the final output waveform. After filtering the
noise source for each spectral feature, the WORLD vocoder combines these filtered noise sources.

The synthesized speech waveform is generated by combining the filtered noise sources and the WORLD
vocoder ensures that the timing, pitch, and spectral characteristics match the input features. The
output waveform represents the natural-sounding speech corresponding to the input text.

18
7.8 Discussion
The DNN-Based Statistical Parametric Bengali Text to Speech System uses a combination of various
components, each performing a distinct task to produce natural sounding speech from input bengali
text. It makes use of a text normalizer, which requires a large database for exceptions, as well as
numerous rules to determine the normalized form of the input text. The system also utilizes various
open source softwares such as Ossian, Festival, and WORLD to generate the final output. The use
of the 2 neural networks: one for Duration Modelling, and another for Acoustic Modelling sets this
TTS System apart from other SPSS based systems which use other methods such as HMM and LSTM-
RNN. Gaining a comprehensive understanding of the input and output processes associated with each
component of this TTS System was important for grasping the overall architecture and essential for
comprehending the different stages through which the input text progresses before culminating in
the final generated waveform.

19
8 Comparing SPSS and Subachan
The developers of the DNN-based SPSS system compared the results between both the voices of their
model (Male and Female), and that of Subachan. They also drew a comparison between all the models
and that of the best known commercial Bangla TTS from Google.

8.1 Objective Evaluation


The Perceptual Evaluation of Speech Quality (PESQ) score, specifically raw-PESQ and MOS-LQO were
chosen for the purpose of objective evaluation. The PESQ scores of the four models: Subachan TTS,
SUST SPSS Male, SUST SPSS Female, and Google Bangla were calculated to compare Speech Quality.
Figure 7 shows the average PESQ scores for the four systems.

Figure 11: PESQ Scores [6]

We can clearly see that although the SPSS model does not objectively perform as well as the Google
Bangla TTS, both the male and female voices score relatively higher than Subachan TTS.

20
8.2 Subjective Evaluation
The Mean Opinion Score (MOS) was the test which was used to subjectively evaluate all the four
models. Native Bangladeshi Speakers were made to listen to 20 synthetic sentences generated by
various TTS systems and give a naturalness score between 0 and 5 to each of the systems. A higher
score meant better naturalness. All the scores were averaged to obtain the mean score of a system.
[6] Figure 8 summarizes the MOS scores obtained by the various systems.

Figure 12: MOS Scores [6]

The performance of the SPSS system being much higher than Subachan is apparent in this case,
showing a statistical parametric speech synthesis approach outperforms the concatenative based
approach. The Male voice performed slightly better than the female voice, and both were comparable
to the Google Bangla TTS.

21
9 Duration Modelling
In recent years, the two most popular methods used in TTS models are Concatenative Synthesis, and
Statistical Parametric Speech Synthesis (SPSS).

Duration Modelling is essentially the process of deciding the duration of each phoneme, which
is the smallest unit of speech that can distinguish meaning in a language. Concatenative synthesis
approaches that supplanted these systems do not necessarily require modelling of duration, since the
units themselves have intrinsic durations [10]. Most SPSS models on the other hand, make use of a
component known as the Duration Model, in order to specify the duration of each phoneme, when
given linguistic features as input. By linguistic features, we mean things such as phrase type, parts
of speech, type and position of syllable.

A deep neural network (DNN)-based statistical parametric speech synthesis (SPSS) framework (such as
the one discussed above) converts input text into output waveforms by using modules in a pipeline:
a text analyzer to derive linguistic features such as syntactic and prosodic tags from text, a duration
model to predict the phoneme duration, an acoustic model to predict the acoustic features such as
melcepstral coefficients and F0, and a vocoder to produce the waveform from the acoustic features [11].

In this report, I have explored how the duration model works in a typical DNN-based SPSS framework,
focusing on how linguistic features are converted into durations based on the work presented in the
following two papers:

• ``Text-to-Speech Duration Models for Resource-Scarce Languages in Neural Architectures'' by


Aby Louw [12]

• ``Duration modeling using DNN for Arabic speech synthesis'' by Zangar et. al. [16]

22
9.1 Duration Model for Resource-Scarce Languages
9.1.1 Model Architecture
The duration model discussed in this paper is based on a stack of fully connected layers in a feed-
forward neural network (FFNN)[12], the architecture for which is given in the figure below:

Figure 13: Model Architecture

At the output is a linear layer, while ReLU is used at the hidden layers. Batch normalization and
dropout are used with each hidden layer of the network. The Adam optimisation algorithm [13] is
used with a learning rate scheduler that lowers the learning rate when the validation loss reaches a
plateau. The weights and biases of all the layers are initialized using the He-uniform distribution [5].
The loss function is the mean squared error (MSE) on the predicted duration feature.

23
9.1.2 Training Process
During training the TTS engine front-end creates a contextual label sequence for each recording of
the training data in the speech database. This contextual label sequence is then converted into a
linguistic description feature vector, and used as input to the FFNN. The ground truth duration of
each phone unit in the contextual label sequence is used as the output feature target of the FFNN.

9.1.3 Contextual Features


The work discussed in this paper uses various linguistic contextual features as described in the table
below:

Figure 14: Contextual Features

24
9.1.4 Pipeline
The contextual features given in the above table are converted to a linguistic description vector
containing a combination of binary encodings (for the phoneme identities and features) and positional
information using the MERLIN toolkit [15]. This is the linguistic description feature vector. This vector
is passed to the FFNN, and the output from that is the prediction of the duration.

9.1.5 Features of the Model


The input linguistic descriptions vector in this particular model consists of 375 features and is nor-
malised to the range of [0.01, 0.99], while the output vectors (the reference durations) are normalised
to zero mean and unit variance. This model was tested with various iterations of 4 to 6 layers,
with each iteration having different layer sizes. The model which produced the lowest RMSE (=2.905
frames/phone) had 4 hidden layers, with 128 units per layer.

9.1.6 Discussion
The steps followed in the duration model of the TTS system described in this paper is:

1. Front-end processor (like Ossian or Festival) generates a contextual label sequence

2. The MERLIN toolkit converts these labels into vectors of binary and continuous features [6],
and this vector is the linguistic description feature vector

3. This feature vector is the input to the Feed-forward neural network

4. The FFNN which gave the best results consisted of 4 hidden layers with 128 units/ layer

5. The FFNN uses ReLU activation functions for the hidden layers, and the weightd and biases are
initialized using the He-uniform distribution

6. The output from the FFNN is the predicted duration of the phonemes

25
9.2 Duration modeling using DNN for Arabic speech synthesis
This paper investigates the modeling of phoneme duration for Arabic language. Similar to the paper
discussed above, the model discussed in this paper also takes the same input features (two preceding
and succeeding phonemes, position of syllables etc.), and gives the duration of phonemes as an
output. The type of an input feature can be binary, like stressed/not-stressed, discrete like the
phoneme identity, or numeric like the phoneme position [16].

9.2.1 Overview
The paper focuses on multiple different architectures, which includes feed-forward DNN using only
dense layers, and recurrent DNNs based on LSTM and on BLSTM layers. The RMSprop optimizer is
adopted in the experiments, as well as early stopping to avoid the over-fitting problem. For each
model, various numbers of hidden layers, of nodes and of activation functions have been tried.

9.2.2 Different phoneme classes


For Arabic, there are two major characteristics which are vowel quantity and consonant gemination
[16].

Vowel Quantity is a term in phonetics for the length of a vowel, usually indicated in phonetic
transcription by a LENGTH MARK [:] after a vowel, as in /a:/.[17] Vowels so marked have in general
greater duration than the same vowels with no such mark. Consonant Gemination is an articulation
of a consonant for a longer period of time than that of a singleton consonant [18].

The data used was divided into several subsets, and for each subset, different models have been
trained and evaluated, and the architecture leading to the most accurate prediction on the develop-
ment set was selected

The 8 different classes which have been considered are:

1. all phonemes including pauses between words

2. all phonemes only (i.e., without pauses)

3. all consonants only

4. all vowels only

5. simple consonants

6. geminated consonants

7. short vowels

8. long vowels.

26
9.2.3 Architecture leading to the best accuracy
As the size of training corpus is different from one class to another one, it is not the same model
architecture that leads to the most accurate prediction of phoneme durations on the different classes
of sounds. The best model for each class, along with the architecture (number of layers, activation
functions etc.) is given in the figure below:

Figure 15: Model Architecture for each class

27
9.2.4 Discussion
The model discussed in the paper makes use of a class-specific approach, and after testing with
various models for each class, the best is determined based on the root mean squared prediction
error (RMSE) value. This is because, segmental duration is a continuous value, and the DNN used in
the architecture are used as a regression tool, which is trained to minimize the RMSE. Comparisons
with other State-of-the-art DNN based toolkits such as MERLIN [16] have shown that for the Arabic
test set, a class-specific approach works best.

28
References
[1] Hasanabadi, Mohammad Reza. ''An Overview of Text-to-Speech Systems and Media Applications."
arXiv e-prints, 2023, arXiv:2310.14301.

[2] Raffoul, Sandra, and Lindsey Jaber. ''Text-to-Speech Software and Reading Comprehension: The
Impact for Students with Learning Disabilities." Canadian Journal of Learning and Technology,
vol. 49, no. 2, Nov. 2023, pp. 1-18.

[3] Bengali Ethnologue free. (n.d.). Ethnologue (Free All). https://www.ethnologue.com/language/ben/

[4] Tabet Y, and Mohamed Boughazi. ''Speech synthesis techniques. A survey." Systems, Signal Pro-
cessing and their Applications (WOSSPA), 2011 7th International Workshop on. IEEE, 2011.

[5] A. Naser, D. Aich, and M. R. Amin, “Implementation of subachan: ''Bengali text-to-speech synthesis
software,'' in International Conference on Electrical & Computer Engineering (ICECE 2010). IEEE,
2010, pp. 574–577

[6] R. S. Raju, P. Bhattacharjee, A. Ahmad and M. S. Rahman, ''A Bangla Text-to-Speech System using
Deep Neural Networks'', 2019 International Conference on Bangla Speech and Language Processing
(ICBSLP), Sylhet, Bangladesh, 2019, pp. 1-5, doi: 10.1109/ICBSLP47725.2019.202055.

[7] H. Ze, A. Senior, and M. Schuster, ''Statistical parametric speech synthesis using deep neural
networks'' in 2013 ieee international conference on acoustics, speech and signal processing. IEEE,
2013, pp. 7962-7966.

[8] Rashid, M., Hussain, M. A., & Rahman, M. S. (2010, December). ''Text normalization and
diphone preparation for Bangla speech synthesis''. Journal of Multimedia, 5(6), 551-559.
https://doi.org/10.4304/jmm.5.6.551-559

[9] Morise, M., Yokomori, F., & Ozawa, K. (2016). ''WORLD: A Vocoder-Based High-Quality Speech
Synthesis System for Real-Time Applications''. IEICE Transactions on Information and Systems,
E99.D(7), 1877-1884. doi:10.1587/transinf.2015EDP7457

[10] Henter, G., Ronanki, S., Watts, O., Wester, M., Wu, Z., & King, S. (2016). ''Robust TTS Duration
Modelling Using DNNs''. In 2016 IEEE International Conference on Acoustics, Speech and Sig-
nal Processing (ICASSP) (pp. 5130-5134). Institute of Electrical and Electronics Engineers (IEEE).
https://doi.org/10.1109/ICASSP.2016.7472655

[11] Yasuda, Y., Wang, X., & Yamagishi, J. (2020). ''Investigation of learning abilities on linguistic
features in sequence-to-sequence text-to-speech synthesis''. arXiv preprint arXiv:2005.10390.

[12] Louw, A. (2020). ''Text-to-Speech Duration Models for Resource-Scarce Languages in Neural Ar-
chitectures''. In A. L. Lueker, & P. J. Sweeney (Eds.), Advances in Neural Information Processing
Systems 33 (pp. 141-153). Retrieved from https://doi.org/10.1007/978-3-030-66151-9_9

[13] Kingma, D. P., & Ba, J. (2014). ''Adam: A method for stochastic optimization''. arXiv preprint
arXiv:1412.6980.

[14] He, K., Zhang, X., Ren, S., & Sun, J. (December 2015). ''Delving Deep into Rectifiers: Surpassing
Human-Level Performance on ImageNet Classification''. In Proceedings of the IEEE International
Conference on Computer Vision (ICCV).

29
[15] Wu, Z., Watts, O., & King, S. (2016). ''Merlin: An Open Source Neural Network Speech Synthesis
System''. In Proceedings of the 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9) (pp.
202-207). doi:10.21437/SSW.2016-33

[16] Zangar, I., Mnasri, Z., Colotte, V., Jouvet, D., & Houidhek, A. (2018). ''Duration mod-
eling using DNN for Arabic speech synthesis''. Speech Prosody 2018. Retrieved from
https://api.semanticscholar.org/CorpusID:53055606

[17] Encyclopedia.com. (2024). Vowel Quantity. Retrieved May 12, 2024, from
https://www.encyclopedia.com/humanities/encyclopedias-almanacs-transcripts-and-maps/vowel-
quantity

[18] Gemination. (2024, May 3). Wikipedia. Retrieved May 12, 2024, from
https://en.wikipedia.org/wiki/Gemination

30

You might also like